All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] oom reaper follow ups v1
@ 2016-04-06 14:13 ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2016-04-06 14:13 UTC (permalink / raw)
  To: linux-mm
  Cc: David Rientjes, Tetsuo Handa, Andrew Morton, LKML, Daniel Vetter,
	Michael S. Tsirkin, Michal Hocko, Oleg Nesterov,
	Paul E. McKenney, Raushaniya Maksudova

Hi,
the following three patches should help to reduce the corner case space
for oom livelocks even further.

Patch1 is something that we should have probably done quite some time
ago. GFP_NOFS requests never got access to memory reserves even when a
task was killed. As this has some side effect to oom notifiers I have
CCed curret users of this interface to hear from them. The patch contains
more detailed information.

Patch2 builds on top and allows tasks which skip the regular OOM killer
(e.g. those with fatal_signal_pending) to queue them for oom reaper if
there is not a risk that somebody sharing the mm with them could see
this from the userspace. I have cced Oleg on this patch because I am not
entirely sure I am doing it properly.

Finally the last patch relaxes TIF_MEMDIE clearing and makes sure that
no task queued for the oom reaper will keep it once it is processed
(either successfully or not).

Any feedback is highly appreciated.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 0/3] oom reaper follow ups v1
@ 2016-04-06 14:13 ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2016-04-06 14:13 UTC (permalink / raw)
  To: linux-mm
  Cc: David Rientjes, Tetsuo Handa, Andrew Morton, LKML, Daniel Vetter,
	Michael S. Tsirkin, Michal Hocko, Oleg Nesterov,
	Paul E. McKenney, Raushaniya Maksudova

Hi,
the following three patches should help to reduce the corner case space
for oom livelocks even further.

Patch1 is something that we should have probably done quite some time
ago. GFP_NOFS requests never got access to memory reserves even when a
task was killed. As this has some side effect to oom notifiers I have
CCed curret users of this interface to hear from them. The patch contains
more detailed information.

Patch2 builds on top and allows tasks which skip the regular OOM killer
(e.g. those with fatal_signal_pending) to queue them for oom reaper if
there is not a risk that somebody sharing the mm with them could see
this from the userspace. I have cced Oleg on this patch because I am not
entirely sure I am doing it properly.

Finally the last patch relaxes TIF_MEMDIE clearing and makes sure that
no task queued for the oom reaper will keep it once it is processed
(either successfully or not).

Any feedback is highly appreciated.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 1/3] mm, oom: move GFP_NOFS check to out_of_memory
  2016-04-06 14:13 ` Michal Hocko
@ 2016-04-06 14:13   ` Michal Hocko
  -1 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2016-04-06 14:13 UTC (permalink / raw)
  To: linux-mm
  Cc: David Rientjes, Tetsuo Handa, Andrew Morton, LKML, Michal Hocko,
	Daniel Vetter, Raushaniya Maksudova, Michael S . Tsirkin,
	Paul E . McKenney

From: Michal Hocko <mhocko@suse.com>

__alloc_pages_may_oom is the central place to decide when the
out_of_memory should be invoked. This is a good approach for most checks
there because they are page allocator specific and the allocation fails
right after for all of them.

The notable exception is GFP_NOFS context which is faking
did_some_progress and keep the page allocator looping even though there
couldn't have been any progress from the OOM killer. This patch doesn't
change this behavior because we are not ready to allow those allocation
requests to fail yet (and maybe we will face the reality that we will
never manage to safely fail these request). Instead __GFP_FS check
is moved down to out_of_memory and prevent from OOM victim selection
there. There are two reasons for that
	- OOM notifiers might release some memory even from this context
	  as none of the registered notifier seems to be FS related
	- this might help a dying thread to get an access to memory
          reserves and move on which will make the behavior more
          consistent with the case when the task gets killed from a
          different context.

Keep a comment in __alloc_pages_may_oom to make sure we do not forget
how GFP_NOFS is special and that we really want to do something about
it.

Note to the current oom_notifier users:
The observable difference for you is that oom notifiers cannot depend on
any fs locks because we could deadlock. Not that this would be allowed
today because that would just lockup machine in most of the cases and
ruling out the OOM killer along the way. Another difference is that
callbacks might be invoked sooner now because GFP_NOFS is a weaker
reclaim context and so there could be reclaimable memory which is just
not reachable now. That would require GFP_NOFS only loads which are
really rare and more importantly the observable result would be dropping
of reconstructible object and potential performance drop which is not
such a big deal when we are struggling to fulfill other important
allocation requests.

Cc: Daniel Vetter <daniel.vetter@intel.com>
Cc: Raushaniya Maksudova <rmaksudova@parallels.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c   |  9 +++++++++
 mm/page_alloc.c | 24 ++++++++++--------------
 2 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 86349586eacb..32d8210b8773 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -877,6 +877,15 @@ bool out_of_memory(struct oom_control *oc)
 	}
 
 	/*
+	 * The OOM killer does not compensate for IO-less reclaim.
+	 * pagefault_out_of_memory lost its gfp context so we have to
+	 * make sure exclude 0 mask - all other users should have at least
+	 * ___GFP_DIRECT_RECLAIM to get here.
+	 */
+	if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL)))
+		return true;
+
+	/*
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.
 	 */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1b889dba7bd4..736ea28abfcf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2872,22 +2872,18 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		/* The OOM killer does not needlessly kill tasks for lowmem */
 		if (ac->high_zoneidx < ZONE_NORMAL)
 			goto out;
-		/* The OOM killer does not compensate for IO-less reclaim */
-		if (!(gfp_mask & __GFP_FS)) {
-			/*
-			 * XXX: Page reclaim didn't yield anything,
-			 * and the OOM killer can't be invoked, but
-			 * keep looping as per tradition.
-			 *
-			 * But do not keep looping if oom_killer_disable()
-			 * was already called, for the system is trying to
-			 * enter a quiescent state during suspend.
-			 */
-			*did_some_progress = !oom_killer_disabled;
-			goto out;
-		}
 		if (pm_suspended_storage())
 			goto out;
+		/*
+		 * XXX: GFP_NOFS allocations should rather fail than rely on
+		 * other request to make a forward progress.
+		 * We are in an unfortunate situation where out_of_memory cannot
+		 * do much for this context but let's try it to at least get
+		 * access to memory reserved if the current task is killed (see
+		 * out_of_memory). Once filesystems are ready to handle allocation
+		 * failures more gracefully we should just bail out here.
+		 */
+
 		/* The OOM killer may not free memory on a specific node */
 		if (gfp_mask & __GFP_THISNODE)
 			goto out;
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 1/3] mm, oom: move GFP_NOFS check to out_of_memory
@ 2016-04-06 14:13   ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2016-04-06 14:13 UTC (permalink / raw)
  To: linux-mm
  Cc: David Rientjes, Tetsuo Handa, Andrew Morton, LKML, Michal Hocko,
	Daniel Vetter, Raushaniya Maksudova, Michael S . Tsirkin,
	Paul E . McKenney

From: Michal Hocko <mhocko@suse.com>

__alloc_pages_may_oom is the central place to decide when the
out_of_memory should be invoked. This is a good approach for most checks
there because they are page allocator specific and the allocation fails
right after for all of them.

The notable exception is GFP_NOFS context which is faking
did_some_progress and keep the page allocator looping even though there
couldn't have been any progress from the OOM killer. This patch doesn't
change this behavior because we are not ready to allow those allocation
requests to fail yet (and maybe we will face the reality that we will
never manage to safely fail these request). Instead __GFP_FS check
is moved down to out_of_memory and prevent from OOM victim selection
there. There are two reasons for that
	- OOM notifiers might release some memory even from this context
	  as none of the registered notifier seems to be FS related
	- this might help a dying thread to get an access to memory
          reserves and move on which will make the behavior more
          consistent with the case when the task gets killed from a
          different context.

Keep a comment in __alloc_pages_may_oom to make sure we do not forget
how GFP_NOFS is special and that we really want to do something about
it.

Note to the current oom_notifier users:
The observable difference for you is that oom notifiers cannot depend on
any fs locks because we could deadlock. Not that this would be allowed
today because that would just lockup machine in most of the cases and
ruling out the OOM killer along the way. Another difference is that
callbacks might be invoked sooner now because GFP_NOFS is a weaker
reclaim context and so there could be reclaimable memory which is just
not reachable now. That would require GFP_NOFS only loads which are
really rare and more importantly the observable result would be dropping
of reconstructible object and potential performance drop which is not
such a big deal when we are struggling to fulfill other important
allocation requests.

Cc: Daniel Vetter <daniel.vetter@intel.com>
Cc: Raushaniya Maksudova <rmaksudova@parallels.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c   |  9 +++++++++
 mm/page_alloc.c | 24 ++++++++++--------------
 2 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 86349586eacb..32d8210b8773 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -877,6 +877,15 @@ bool out_of_memory(struct oom_control *oc)
 	}
 
 	/*
+	 * The OOM killer does not compensate for IO-less reclaim.
+	 * pagefault_out_of_memory lost its gfp context so we have to
+	 * make sure exclude 0 mask - all other users should have at least
+	 * ___GFP_DIRECT_RECLAIM to get here.
+	 */
+	if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL)))
+		return true;
+
+	/*
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.
 	 */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1b889dba7bd4..736ea28abfcf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2872,22 +2872,18 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		/* The OOM killer does not needlessly kill tasks for lowmem */
 		if (ac->high_zoneidx < ZONE_NORMAL)
 			goto out;
-		/* The OOM killer does not compensate for IO-less reclaim */
-		if (!(gfp_mask & __GFP_FS)) {
-			/*
-			 * XXX: Page reclaim didn't yield anything,
-			 * and the OOM killer can't be invoked, but
-			 * keep looping as per tradition.
-			 *
-			 * But do not keep looping if oom_killer_disable()
-			 * was already called, for the system is trying to
-			 * enter a quiescent state during suspend.
-			 */
-			*did_some_progress = !oom_killer_disabled;
-			goto out;
-		}
 		if (pm_suspended_storage())
 			goto out;
+		/*
+		 * XXX: GFP_NOFS allocations should rather fail than rely on
+		 * other request to make a forward progress.
+		 * We are in an unfortunate situation where out_of_memory cannot
+		 * do much for this context but let's try it to at least get
+		 * access to memory reserved if the current task is killed (see
+		 * out_of_memory). Once filesystems are ready to handle allocation
+		 * failures more gracefully we should just bail out here.
+		 */
+
 		/* The OOM killer may not free memory on a specific node */
 		if (gfp_mask & __GFP_THISNODE)
 			goto out;
-- 
2.8.0.rc3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular OOM killer path
  2016-04-06 14:13 ` Michal Hocko
@ 2016-04-06 14:13   ` Michal Hocko
  -1 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2016-04-06 14:13 UTC (permalink / raw)
  To: linux-mm
  Cc: David Rientjes, Tetsuo Handa, Andrew Morton, LKML, Michal Hocko,
	Oleg Nesterov

From: Michal Hocko <mhocko@suse.com>

If either the current task is already killed or PF_EXITING or a selected
task is PF_EXITING then the oom killer is suppressed and so is the oom
reaper. This patch adds try_oom_reaper which checks the given task
and queues it for the oom reaper if that is safe to be done meaning
that the task doesn't share the mm with an alive process.

This might help to release the memory pressure while the task tries to
exit.

Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 72 insertions(+), 18 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 32d8210b8773..74c38f5fffef 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -412,6 +412,25 @@ bool oom_killer_disabled __read_mostly;
 
 #define K(x) ((x) << (PAGE_SHIFT-10))
 
+/*
+ * task->mm can be NULL if the task is the exited group leader.  So to
+ * determine whether the task is using a particular mm, we examine all the
+ * task's threads: if one of those is using this mm then this task was also
+ * using it.
+ */
+static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
+{
+	struct task_struct *t;
+
+	for_each_thread(p, t) {
+		struct mm_struct *t_mm = READ_ONCE(t->mm);
+		if (t_mm)
+			return t_mm == mm;
+	}
+	return false;
+}
+
+
 #ifdef CONFIG_MMU
 /*
  * OOM Reaper kernel thread which tries to reap the memory used by the OOM
@@ -563,6 +582,53 @@ static void wake_oom_reaper(struct task_struct *tsk)
 	wake_up(&oom_reaper_wait);
 }
 
+/* Check if we can reap the given task. This has to be called with stable
+ * tsk->mm
+ */
+static void try_oom_reaper(struct task_struct *tsk)
+{
+	struct mm_struct *mm = tsk->mm;
+	struct task_struct *p;
+
+	if (!mm)
+		return;
+
+	/*
+	 * There might be other threads/processes which are either not
+	 * dying or even not killable.
+	 */
+	if (atomic_read(&mm->mm_users) > 1) {
+		rcu_read_lock();
+		for_each_process(p) {
+			bool exiting;
+
+			if (!process_shares_mm(p, mm))
+				continue;
+			if (same_thread_group(p, tsk))
+				continue;
+			if (fatal_signal_pending(p))
+				continue;
+
+			/*
+			 * If the task is exiting make sure the whole thread group
+			 * is exiting and cannot acces mm anymore.
+			 */
+			spin_lock_irq(&p->sighand->siglock);
+			exiting = signal_group_exit(p->signal);
+			spin_unlock_irq(&p->sighand->siglock);
+			if (exiting)
+				continue;
+
+			/* Give up */
+			rcu_read_unlock();
+			return;
+		}
+		rcu_read_unlock();
+	}
+
+	wake_oom_reaper(tsk);
+}
+
 static int __init oom_init(void)
 {
 	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
@@ -575,6 +641,10 @@ static int __init oom_init(void)
 }
 subsys_initcall(oom_init)
 #else
+static void try_oom_reaper(struct task_struct *tsk)
+{
+}
+
 static void wake_oom_reaper(struct task_struct *tsk)
 {
 }
@@ -653,24 +723,6 @@ void oom_killer_enable(void)
 }
 
 /*
- * task->mm can be NULL if the task is the exited group leader.  So to
- * determine whether the task is using a particular mm, we examine all the
- * task's threads: if one of those is using this mm then this task was also
- * using it.
- */
-static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
-{
-	struct task_struct *t;
-
-	for_each_thread(p, t) {
-		struct mm_struct *t_mm = READ_ONCE(t->mm);
-		if (t_mm)
-			return t_mm == mm;
-	}
-	return false;
-}
-
-/*
  * Must be called while holding a reference to p, which will be released upon
  * returning.
  */
@@ -694,6 +746,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	task_lock(p);
 	if (p->mm && task_will_free_mem(p)) {
 		mark_oom_victim(p);
+		try_oom_reaper(p);
 		task_unlock(p);
 		put_task_struct(p);
 		return;
@@ -873,6 +926,7 @@ bool out_of_memory(struct oom_control *oc)
 	if (current->mm &&
 	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
 		mark_oom_victim(current);
+		try_oom_reaper(current);
 		return true;
 	}
 
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular OOM killer path
@ 2016-04-06 14:13   ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2016-04-06 14:13 UTC (permalink / raw)
  To: linux-mm
  Cc: David Rientjes, Tetsuo Handa, Andrew Morton, LKML, Michal Hocko,
	Oleg Nesterov

From: Michal Hocko <mhocko@suse.com>

If either the current task is already killed or PF_EXITING or a selected
task is PF_EXITING then the oom killer is suppressed and so is the oom
reaper. This patch adds try_oom_reaper which checks the given task
and queues it for the oom reaper if that is safe to be done meaning
that the task doesn't share the mm with an alive process.

This might help to release the memory pressure while the task tries to
exit.

Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 72 insertions(+), 18 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 32d8210b8773..74c38f5fffef 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -412,6 +412,25 @@ bool oom_killer_disabled __read_mostly;
 
 #define K(x) ((x) << (PAGE_SHIFT-10))
 
+/*
+ * task->mm can be NULL if the task is the exited group leader.  So to
+ * determine whether the task is using a particular mm, we examine all the
+ * task's threads: if one of those is using this mm then this task was also
+ * using it.
+ */
+static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
+{
+	struct task_struct *t;
+
+	for_each_thread(p, t) {
+		struct mm_struct *t_mm = READ_ONCE(t->mm);
+		if (t_mm)
+			return t_mm == mm;
+	}
+	return false;
+}
+
+
 #ifdef CONFIG_MMU
 /*
  * OOM Reaper kernel thread which tries to reap the memory used by the OOM
@@ -563,6 +582,53 @@ static void wake_oom_reaper(struct task_struct *tsk)
 	wake_up(&oom_reaper_wait);
 }
 
+/* Check if we can reap the given task. This has to be called with stable
+ * tsk->mm
+ */
+static void try_oom_reaper(struct task_struct *tsk)
+{
+	struct mm_struct *mm = tsk->mm;
+	struct task_struct *p;
+
+	if (!mm)
+		return;
+
+	/*
+	 * There might be other threads/processes which are either not
+	 * dying or even not killable.
+	 */
+	if (atomic_read(&mm->mm_users) > 1) {
+		rcu_read_lock();
+		for_each_process(p) {
+			bool exiting;
+
+			if (!process_shares_mm(p, mm))
+				continue;
+			if (same_thread_group(p, tsk))
+				continue;
+			if (fatal_signal_pending(p))
+				continue;
+
+			/*
+			 * If the task is exiting make sure the whole thread group
+			 * is exiting and cannot acces mm anymore.
+			 */
+			spin_lock_irq(&p->sighand->siglock);
+			exiting = signal_group_exit(p->signal);
+			spin_unlock_irq(&p->sighand->siglock);
+			if (exiting)
+				continue;
+
+			/* Give up */
+			rcu_read_unlock();
+			return;
+		}
+		rcu_read_unlock();
+	}
+
+	wake_oom_reaper(tsk);
+}
+
 static int __init oom_init(void)
 {
 	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
@@ -575,6 +641,10 @@ static int __init oom_init(void)
 }
 subsys_initcall(oom_init)
 #else
+static void try_oom_reaper(struct task_struct *tsk)
+{
+}
+
 static void wake_oom_reaper(struct task_struct *tsk)
 {
 }
@@ -653,24 +723,6 @@ void oom_killer_enable(void)
 }
 
 /*
- * task->mm can be NULL if the task is the exited group leader.  So to
- * determine whether the task is using a particular mm, we examine all the
- * task's threads: if one of those is using this mm then this task was also
- * using it.
- */
-static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
-{
-	struct task_struct *t;
-
-	for_each_thread(p, t) {
-		struct mm_struct *t_mm = READ_ONCE(t->mm);
-		if (t_mm)
-			return t_mm == mm;
-	}
-	return false;
-}
-
-/*
  * Must be called while holding a reference to p, which will be released upon
  * returning.
  */
@@ -694,6 +746,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	task_lock(p);
 	if (p->mm && task_will_free_mem(p)) {
 		mark_oom_victim(p);
+		try_oom_reaper(p);
 		task_unlock(p);
 		put_task_struct(p);
 		return;
@@ -873,6 +926,7 @@ bool out_of_memory(struct oom_control *oc)
 	if (current->mm &&
 	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
 		mark_oom_victim(current);
+		try_oom_reaper(current);
 		return true;
 	}
 
-- 
2.8.0.rc3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 3/3] mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper
  2016-04-06 14:13 ` Michal Hocko
@ 2016-04-06 14:13   ` Michal Hocko
  -1 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2016-04-06 14:13 UTC (permalink / raw)
  To: linux-mm; +Cc: David Rientjes, Tetsuo Handa, Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Right now the oom reaper will clear TIF_MEMDIE only for tasks which
were successfully reaped. This is the safest option because we know
that such an oom victim would only block forward progress of the oom
killer without a good reason because it is highly unlikely it would
release much more memory. Basically most of its memory has been already
torn down.

We can relax this assumption to catch more corner cases though.

The first obvious one is when the oom victim clears its mm and gets
stuck later on. oom_reaper would back of on find_lock_task_mm returning
NULL. We can safely try to clear TIF_MEMDIE in this case because such a
task would be ignored by the oom killer anyway. The flag would be
cleared by that time already most of the time anyway.

The less obvious one is when the oom reaper fails due to mmap_sem
contention. Even if we clear TIF_MEMDIE for this task then it is not
very likely that we would select another task too easily because
we haven't reaped the last victim and so it would be still the #1
candidate. There is a rare race condition possible when the current
victim terminates before the next select_bad_process but considering
that oom_reap_task had retried several times before giving up then
this sounds like a borderline thing.

After this patch we should have a guarantee that the OOM killer will
not be block for unbounded amount of time for most cases.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 74c38f5fffef..7098104b7475 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -510,14 +510,10 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	up_read(&mm->mmap_sem);
 
 	/*
-	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
-	 * reasonably reclaimable memory anymore. OOM killer can continue
-	 * by selecting other victim if unmapping hasn't led to any
-	 * improvements. This also means that selecting this task doesn't
-	 * make any sense.
+	 * This task can be safely ignored because we cannot do much more
+	 * to release its memory.
 	 */
 	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
-	exit_oom_victim(tsk);
 out:
 	mmput(mm);
 	return ret;
@@ -538,6 +534,14 @@ static void oom_reap_task(struct task_struct *tsk)
 		debug_show_all_locks();
 	}
 
+	/*
+	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
+	 * reasonably reclaimable memory anymore or it is not a good candidate
+	 * for the oom victim right now because it cannot release its memory
+	 * itself nor by the oom reaper.
+	 */
+	exit_oom_victim(tsk);
+
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
 }
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 3/3] mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper
@ 2016-04-06 14:13   ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2016-04-06 14:13 UTC (permalink / raw)
  To: linux-mm; +Cc: David Rientjes, Tetsuo Handa, Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Right now the oom reaper will clear TIF_MEMDIE only for tasks which
were successfully reaped. This is the safest option because we know
that such an oom victim would only block forward progress of the oom
killer without a good reason because it is highly unlikely it would
release much more memory. Basically most of its memory has been already
torn down.

We can relax this assumption to catch more corner cases though.

The first obvious one is when the oom victim clears its mm and gets
stuck later on. oom_reaper would back of on find_lock_task_mm returning
NULL. We can safely try to clear TIF_MEMDIE in this case because such a
task would be ignored by the oom killer anyway. The flag would be
cleared by that time already most of the time anyway.

The less obvious one is when the oom reaper fails due to mmap_sem
contention. Even if we clear TIF_MEMDIE for this task then it is not
very likely that we would select another task too easily because
we haven't reaped the last victim and so it would be still the #1
candidate. There is a rare race condition possible when the current
victim terminates before the next select_bad_process but considering
that oom_reap_task had retried several times before giving up then
this sounds like a borderline thing.

After this patch we should have a guarantee that the OOM killer will
not be block for unbounded amount of time for most cases.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 74c38f5fffef..7098104b7475 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -510,14 +510,10 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	up_read(&mm->mmap_sem);
 
 	/*
-	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
-	 * reasonably reclaimable memory anymore. OOM killer can continue
-	 * by selecting other victim if unmapping hasn't led to any
-	 * improvements. This also means that selecting this task doesn't
-	 * make any sense.
+	 * This task can be safely ignored because we cannot do much more
+	 * to release its memory.
 	 */
 	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
-	exit_oom_victim(tsk);
 out:
 	mmput(mm);
 	return ret;
@@ -538,6 +534,14 @@ static void oom_reap_task(struct task_struct *tsk)
 		debug_show_all_locks();
 	}
 
+	/*
+	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
+	 * reasonably reclaimable memory anymore or it is not a good candidate
+	 * for the oom victim right now because it cannot release its memory
+	 * itself nor by the oom reaper.
+	 */
+	exit_oom_victim(tsk);
+
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
 }
-- 
2.8.0.rc3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular OOM killer path
  2016-04-06 14:13   ` Michal Hocko
@ 2016-04-07 11:38     ` Tetsuo Handa
  -1 siblings, 0 replies; 40+ messages in thread
From: Tetsuo Handa @ 2016-04-07 11:38 UTC (permalink / raw)
  To: mhocko, linux-mm; +Cc: rientjes, akpm, linux-kernel, mhocko, oleg

Michal Hocko wrote:
> @@ -563,6 +582,53 @@ static void wake_oom_reaper(struct task_struct *tsk)
>  	wake_up(&oom_reaper_wait);
>  }
>  
> +/* Check if we can reap the given task. This has to be called with stable
> + * tsk->mm
> + */
> +static void try_oom_reaper(struct task_struct *tsk)
> +{
> +	struct mm_struct *mm = tsk->mm;
> +	struct task_struct *p;
> +
> +	if (!mm)
> +		return;
> +
> +	/*
> +	 * There might be other threads/processes which are either not
> +	 * dying or even not killable.
> +	 */
> +	if (atomic_read(&mm->mm_users) > 1) {
> +		rcu_read_lock();
> +		for_each_process(p) {
> +			bool exiting;
> +
> +			if (!process_shares_mm(p, mm))
> +				continue;
> +			if (same_thread_group(p, tsk))
> +				continue;
> +			if (fatal_signal_pending(p))
> +				continue;
> +
> +			/*
> +			 * If the task is exiting make sure the whole thread group
> +			 * is exiting and cannot acces mm anymore.
> +			 */
> +			spin_lock_irq(&p->sighand->siglock);
> +			exiting = signal_group_exit(p->signal);
> +			spin_unlock_irq(&p->sighand->siglock);
> +			if (exiting)
> +				continue;
> +
> +			/* Give up */
> +			rcu_read_unlock();
> +			return;
> +		}
> +		rcu_read_unlock();
> +	}
> +
> +	wake_oom_reaper(tsk);
> +}
> +

I think you want to change "try_oom_reaper() without wake_oom_reaper()"
as mm_is_reapable() and use it from oom_kill_process() in order to skip
p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN test which needlessly makes
can_oom_reap false.

> @@ -694,6 +746,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  	task_lock(p);
>  	if (p->mm && task_will_free_mem(p)) {
>  		mark_oom_victim(p);
> +		try_oom_reaper(p);
>  		task_unlock(p);
>  		put_task_struct(p);
>  		return;
> @@ -873,6 +926,7 @@ bool out_of_memory(struct oom_control *oc)
>  	if (current->mm &&
>  	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
>  		mark_oom_victim(current);
> +		try_oom_reaper(current);
>  		return true;
>  	}
>  

Why don't you call try_oom_reaper() from the shortcuts in
mem_cgroup_out_of_memory() as well?

Why don't you embed try_oom_reaper() into mark_oom_victim() like I did at
http://lkml.kernel.org/r/201602052014.HBG52666.HFMOQVLFOSFJtO@I-love.SAKURA.ne.jp ?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular OOM killer path
@ 2016-04-07 11:38     ` Tetsuo Handa
  0 siblings, 0 replies; 40+ messages in thread
From: Tetsuo Handa @ 2016-04-07 11:38 UTC (permalink / raw)
  To: mhocko, linux-mm; +Cc: rientjes, akpm, linux-kernel, mhocko, oleg

Michal Hocko wrote:
> @@ -563,6 +582,53 @@ static void wake_oom_reaper(struct task_struct *tsk)
>  	wake_up(&oom_reaper_wait);
>  }
>  
> +/* Check if we can reap the given task. This has to be called with stable
> + * tsk->mm
> + */
> +static void try_oom_reaper(struct task_struct *tsk)
> +{
> +	struct mm_struct *mm = tsk->mm;
> +	struct task_struct *p;
> +
> +	if (!mm)
> +		return;
> +
> +	/*
> +	 * There might be other threads/processes which are either not
> +	 * dying or even not killable.
> +	 */
> +	if (atomic_read(&mm->mm_users) > 1) {
> +		rcu_read_lock();
> +		for_each_process(p) {
> +			bool exiting;
> +
> +			if (!process_shares_mm(p, mm))
> +				continue;
> +			if (same_thread_group(p, tsk))
> +				continue;
> +			if (fatal_signal_pending(p))
> +				continue;
> +
> +			/*
> +			 * If the task is exiting make sure the whole thread group
> +			 * is exiting and cannot acces mm anymore.
> +			 */
> +			spin_lock_irq(&p->sighand->siglock);
> +			exiting = signal_group_exit(p->signal);
> +			spin_unlock_irq(&p->sighand->siglock);
> +			if (exiting)
> +				continue;
> +
> +			/* Give up */
> +			rcu_read_unlock();
> +			return;
> +		}
> +		rcu_read_unlock();
> +	}
> +
> +	wake_oom_reaper(tsk);
> +}
> +

I think you want to change "try_oom_reaper() without wake_oom_reaper()"
as mm_is_reapable() and use it from oom_kill_process() in order to skip
p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN test which needlessly makes
can_oom_reap false.

> @@ -694,6 +746,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  	task_lock(p);
>  	if (p->mm && task_will_free_mem(p)) {
>  		mark_oom_victim(p);
> +		try_oom_reaper(p);
>  		task_unlock(p);
>  		put_task_struct(p);
>  		return;
> @@ -873,6 +926,7 @@ bool out_of_memory(struct oom_control *oc)
>  	if (current->mm &&
>  	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
>  		mark_oom_victim(current);
> +		try_oom_reaper(current);
>  		return true;
>  	}
>  

Why don't you call try_oom_reaper() from the shortcuts in
mem_cgroup_out_of_memory() as well?

Why don't you embed try_oom_reaper() into mark_oom_victim() like I did at
http://lkml.kernel.org/r/201602052014.HBG52666.HFMOQVLFOSFJtO@I-love.SAKURA.ne.jp ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/3] mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper
  2016-04-06 14:13   ` Michal Hocko
@ 2016-04-07 11:55     ` Tetsuo Handa
  -1 siblings, 0 replies; 40+ messages in thread
From: Tetsuo Handa @ 2016-04-07 11:55 UTC (permalink / raw)
  To: mhocko, linux-mm; +Cc: rientjes, akpm, linux-kernel, mhocko

Michal Hocko wrote:
> The first obvious one is when the oom victim clears its mm and gets
> stuck later on. oom_reaper would back of on find_lock_task_mm returning
> NULL. We can safely try to clear TIF_MEMDIE in this case because such a
> task would be ignored by the oom killer anyway. The flag would be
> cleared by that time already most of the time anyway.

I didn't understand what this wants to tell. The OOM victim will clear
TIF_MEMDIE as soon as it sets current->mm = NULL. Even if the oom victim
clears its mm and gets stuck later on (e.g. at exit_task_work()),
TIF_MEMDIE was already cleared by that moment by the OOM victim.

> 
> The less obvious one is when the oom reaper fails due to mmap_sem
> contention. Even if we clear TIF_MEMDIE for this task then it is not
> very likely that we would select another task too easily because
> we haven't reaped the last victim and so it would be still the #1
> candidate. There is a rare race condition possible when the current
> victim terminates before the next select_bad_process but considering
> that oom_reap_task had retried several times before giving up then
> this sounds like a borderline thing.

Is it helpful? Allowing the OOM killer to select the same thread again
simply makes the kernel log buffer flooded with the OOM kill messages.

I think we should not allow the OOM killer to select the same thread again
by e.g. doing tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN regardless of
whether reaping that thread's memory succeeded or not.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/3] mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper
@ 2016-04-07 11:55     ` Tetsuo Handa
  0 siblings, 0 replies; 40+ messages in thread
From: Tetsuo Handa @ 2016-04-07 11:55 UTC (permalink / raw)
  To: mhocko, linux-mm; +Cc: rientjes, akpm, linux-kernel, mhocko

Michal Hocko wrote:
> The first obvious one is when the oom victim clears its mm and gets
> stuck later on. oom_reaper would back of on find_lock_task_mm returning
> NULL. We can safely try to clear TIF_MEMDIE in this case because such a
> task would be ignored by the oom killer anyway. The flag would be
> cleared by that time already most of the time anyway.

I didn't understand what this wants to tell. The OOM victim will clear
TIF_MEMDIE as soon as it sets current->mm = NULL. Even if the oom victim
clears its mm and gets stuck later on (e.g. at exit_task_work()),
TIF_MEMDIE was already cleared by that moment by the OOM victim.

> 
> The less obvious one is when the oom reaper fails due to mmap_sem
> contention. Even if we clear TIF_MEMDIE for this task then it is not
> very likely that we would select another task too easily because
> we haven't reaped the last victim and so it would be still the #1
> candidate. There is a rare race condition possible when the current
> victim terminates before the next select_bad_process but considering
> that oom_reap_task had retried several times before giving up then
> this sounds like a borderline thing.

Is it helpful? Allowing the OOM killer to select the same thread again
simply makes the kernel log buffer flooded with the OOM kill messages.

I think we should not allow the OOM killer to select the same thread again
by e.g. doing tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN regardless of
whether reaping that thread's memory succeeded or not.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular OOM killer path
  2016-04-07 11:38     ` Tetsuo Handa
@ 2016-04-08 11:19       ` Tetsuo Handa
  -1 siblings, 0 replies; 40+ messages in thread
From: Tetsuo Handa @ 2016-04-08 11:19 UTC (permalink / raw)
  To: mhocko, linux-mm; +Cc: rientjes, akpm, linux-kernel, mhocko, oleg

Tetsuo Handa wrote:
> Michal Hocko wrote:
> > @@ -694,6 +746,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> >  	task_lock(p);
> >  	if (p->mm && task_will_free_mem(p)) {
> >  		mark_oom_victim(p);
> > +		try_oom_reaper(p);
> >  		task_unlock(p);
> >  		put_task_struct(p);
> >  		return;
> > @@ -873,6 +926,7 @@ bool out_of_memory(struct oom_control *oc)
> >  	if (current->mm &&
> >  	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
> >  		mark_oom_victim(current);
> > +		try_oom_reaper(current);
> >  		return true;
> >  	}
> >  

oom_reaper() will need to do "tsk->oom_reaper_list = NULL;" due to

	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
		return;

test in wake_oom_reaper() if "[PATCH 3/3] mm, oom_reaper: clear
TIF_MEMDIE for all tasks queued for oom_reaper" will select the same
thread again. Though I think we should not allow the OOM killer to
select the same thread again.

> 
> Why don't you call try_oom_reaper() from the shortcuts in
> mem_cgroup_out_of_memory() as well?

I looked at next-20160408 but I again came to think that we should remove
these shortcuts (something like a patch shown bottom).

These shortcuts might be called without sending SIGKILL to threads sharing
the victim's memory. It is possible that try_oom_reaper() fails to call
wake_oom_reaper() due to a thread without fatal_signal_pending(). If such
thread is holding mmap_sem for write without fatal_signal_pending(),
the victim will get stuck at exit_mm() but wake_oom_reaper() is not called.
Therefore, use of these shortcuts can put us back to square one.

My perspective on the OOM reaper is to behave as a guaranteed unlocking
mechanism than a preemptive memory reaping mechanism. Therefore, it is
critically important that the OOM reaper kernel thread is always woken up
and unlock TIF_MEMDIE some time later, even if it is known that the memory
used by the caller of try_oom_reaper() is not reapable. Therefore, I moved
mm_is_reapable() test to the OOM reaper kernel thread in the patch shown
bottom.

Also, setting TIF_MEMDIE to all threads at oom_kill_process() not only
eliminates the need of "[PATCH 1/3] mm, oom: move GFP_NOFS check to
out_of_memory" but also suppresses needless OOM killer messages by holding
off the OOM killer until "all threads which should release the victim's mm
releases the victim's mm" or "the OOM reaper marks all threads which failed
to release the victim's mm as being stuck".

Setting TIF_MEMDIE to all threads at oom_kill_process() also increases
possibility of reclaiming memory when racing with oom_killer_disable() by
waiting until "all threads which should release the victim's mm releases
the victim's mm" or "oom_killer_disable() gives up waiting for oom_victims
to become 0". It sounds strange to me that we are currently thawing only
victim's thread when we need to thaw all threads sharing the victim's mm
in order to reclaim memory used by the victim. (And more crazy thing is
that we loop forever without providing a guarantee of forward progress at

	/* Exhausted what can be done so it's blamo time */
	if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
		*did_some_progress = 1;

		if (gfp_mask & __GFP_NOFAIL) {
			page = get_page_from_freelist(gfp_mask, order,
					ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
			/*
			 * fallback to ignore cpuset restriction if our nodes
			 * are depleted
			 */
			if (!page)
				page = get_page_from_freelist(gfp_mask, order,
					ALLOC_NO_WATERMARKS, ac);
		}
	}

but that thing is outside of this discussion.)

----------
 include/linux/oom.h   |    4 -
 include/linux/sched.h |    2
 kernel/exit.c         |    2
 mm/memcontrol.c       |   13 ---
 mm/oom_kill.c         |  200 +++++++++++++++++++-------------------------------
 5 files changed, 83 insertions(+), 138 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index abaab8e..9c99956 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -67,8 +67,6 @@ static inline bool oom_task_origin(const struct task_struct *p)
 	return p->signal->oom_flag_origin;
 }
 
-extern void mark_oom_victim(struct task_struct *tsk);
-
 extern unsigned long oom_badness(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask,
 		unsigned long totalpages);
@@ -86,7 +84,7 @@ extern enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 
 extern bool out_of_memory(struct oom_control *oc);
 
-extern void exit_oom_victim(struct task_struct *tsk);
+extern void exit_oom_victim(void);
 
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index dd286b1..a93b24d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -786,6 +786,8 @@ struct signal_struct {
 	 * oom
 	 */
 	bool oom_flag_origin;
+	/* Already OOM-killed but cannot terminate. Don't count on me. */
+	bool oom_skip_me;
 	short oom_score_adj;		/* OOM kill score adjustment */
 	short oom_score_adj_min;	/* OOM kill score adjustment min value.
 					 * Only settable by CAP_SYS_RESOURCE. */
diff --git a/kernel/exit.c b/kernel/exit.c
index 9e6e135..c742c37 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -435,7 +435,7 @@ static void exit_mm(struct task_struct *tsk)
 	mm_update_next_owner(mm);
 	mmput(mm);
 	if (test_thread_flag(TIF_MEMDIE))
-		exit_oom_victim(tsk);
+		exit_oom_victim();
 }
 
 static struct task_struct *find_alive_thread(struct task_struct *p)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3e90d48..57611b8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1278,17 +1278,8 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	unsigned int points = 0;
 	struct task_struct *chosen = NULL;
 
-	mutex_lock(&oom_lock);
-
-	/*
-	 * If current has a pending SIGKILL or is exiting, then automatically
-	 * select it.  The goal is to allow it to allocate so that it may
-	 * quickly exit and free its memory.
-	 */
-	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
-		mark_oom_victim(current);
-		goto unlock;
-	}
+	if (mutex_lock_killable(&oom_lock))
+		return NULL;
 
 	check_panic_on_oom(&oc, CONSTRAINT_MEMCG, memcg);
 	totalpages = mem_cgroup_get_limit(memcg) ? : 1;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7098104..7bda655 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -279,6 +279,8 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 	 * Don't allow any other task to have access to the reserves.
 	 */
 	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
+		if (task->signal->oom_skip_me)
+			return OOM_SCAN_CONTINUE;
 		if (!is_sysrq_oom(oc))
 			return OOM_SCAN_ABORT;
 	}
@@ -441,6 +443,42 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 static struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
+static bool mm_is_reapable(struct mm_struct *mm)
+{
+	struct task_struct *p;
+
+	if (atomic_read(&mm->mm_users) <= 2)
+		return true;
+	/*
+	 * There might be other threads/processes which are either not
+	 * dying or even not killable.
+	 */
+	rcu_read_lock();
+	for_each_process(p) {
+		bool exiting;
+
+		if (!process_shares_mm(p, mm))
+			continue;
+		if (fatal_signal_pending(p))
+			continue;
+
+		/*
+		 * If the task is exiting make sure the whole thread group
+		 * is exiting and cannot access mm anymore.
+		 */
+		spin_lock_irq(&p->sighand->siglock);
+		exiting = signal_group_exit(p->signal);
+		spin_unlock_irq(&p->sighand->siglock);
+		if (exiting)
+			continue;
+
+		/* Give up */
+		rcu_read_unlock();
+		return false;
+	}
+	rcu_read_unlock();
+	return true;
+}
 
 static bool __oom_reap_task(struct task_struct *tsk)
 {
@@ -470,7 +508,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
 
 	task_unlock(p);

-	if (!down_read_trylock(&mm->mmap_sem)) {
+	if (!mm_is_reapable(mm) || !down_read_trylock(&mm->mmap_sem)) {
 		ret = false;
 		goto out;
 	}
@@ -509,11 +547,6 @@ static bool __oom_reap_task(struct task_struct *tsk)
 			K(get_mm_counter(mm, MM_SHMEMPAGES)));
 	up_read(&mm->mmap_sem);
 
-	/*
-	 * This task can be safely ignored because we cannot do much more
-	 * to release its memory.
-	 */
-	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
 out:
 	mmput(mm);
 	return ret;
@@ -535,12 +568,12 @@ static void oom_reap_task(struct task_struct *tsk)
 	}
 
 	/*
-	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
-	 * reasonably reclaimable memory anymore or it is not a good candidate
-	 * for the oom victim right now because it cannot release its memory
-	 * itself nor by the oom reaper.
+	 * Tell oom_scan_process_thread() not to wait for this process, for
+	 * the task shouldn't be sitting on a reasonably reclaimable memory
+	 * anymore or it is not a good candidate for the oom victim right now
+	 * because it cannot release its memory itself nor by the oom reaper.
 	 */
-	exit_oom_victim(tsk);
+	tsk->signal->oom_skip_me = true;
 
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
@@ -586,53 +619,6 @@ static void wake_oom_reaper(struct task_struct *tsk)
 	wake_up(&oom_reaper_wait);
 }
 
-/* Check if we can reap the given task. This has to be called with stable
- * tsk->mm
- */
-static void try_oom_reaper(struct task_struct *tsk)
-{
-	struct mm_struct *mm = tsk->mm;
-	struct task_struct *p;
-
-	if (!mm)
-		return;
-
-	/*
-	 * There might be other threads/processes which are either not
-	 * dying or even not killable.
-	 */
-	if (atomic_read(&mm->mm_users) > 1) {
-		rcu_read_lock();
-		for_each_process(p) {
-			bool exiting;
-
-			if (!process_shares_mm(p, mm))
-				continue;
-			if (same_thread_group(p, tsk))
-				continue;
-			if (fatal_signal_pending(p))
-				continue;
-
-			/*
-			 * If the task is exiting make sure the whole thread group
-			 * is exiting and cannot acces mm anymore.
-			 */
-			spin_lock_irq(&p->sighand->siglock);
-			exiting = signal_group_exit(p->signal);
-			spin_unlock_irq(&p->sighand->siglock);
-			if (exiting)
-				continue;
-
-			/* Give up */
-			rcu_read_unlock();
-			return;
-		}
-		rcu_read_unlock();
-	}
-
-	wake_oom_reaper(tsk);
-}
-
 static int __init oom_init(void)
 {
 	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
@@ -645,10 +631,6 @@ static int __init oom_init(void)
 }
 subsys_initcall(oom_init)
 #else
-static void try_oom_reaper(struct task_struct *tsk)
-{
-}
-
 static void wake_oom_reaper(struct task_struct *tsk)
 {
 }
@@ -661,7 +643,7 @@ static void wake_oom_reaper(struct task_struct *tsk)
  * Has to be called with oom_lock held and never after
  * oom has been disabled already.
  */
-void mark_oom_victim(struct task_struct *tsk)
+static void mark_oom_victim(struct task_struct *tsk)
 {
 	WARN_ON(oom_killer_disabled);
 	/* OOM killer might race with memcg OOM */
@@ -680,10 +662,9 @@ void mark_oom_victim(struct task_struct *tsk)
 /**
  * exit_oom_victim - note the exit of an OOM victim
  */
-void exit_oom_victim(struct task_struct *tsk)
+void exit_oom_victim(void)
 {
-	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
-		return;
+	clear_thread_flag(TIF_MEMDIE);
 
 	if (!atomic_dec_return(&oom_victims))
 		wake_up_all(&oom_victims_wait);
@@ -741,21 +722,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
-	bool can_oom_reap = true;
-
-	/*
-	 * If the task is already exiting, don't alarm the sysadmin or kill
-	 * its children or threads, just set TIF_MEMDIE so it can die quickly
-	 */
-	task_lock(p);
-	if (p->mm && task_will_free_mem(p)) {
-		mark_oom_victim(p);
-		try_oom_reaper(p);
-		task_unlock(p);
-		put_task_struct(p);
-		return;
-	}
-	task_unlock(p);
 
 	if (__ratelimit(&oom_rs))
 		dump_header(oc, p, memcg);
@@ -804,13 +770,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	/* Get a reference to safely compare mm after task_unlock(victim) */
 	mm = victim->mm;
 	atomic_inc(&mm->mm_count);
-	/*
-	 * We should send SIGKILL before setting TIF_MEMDIE in order to prevent
-	 * the OOM victim from depleting the memory reserves from the user
-	 * space under its control.
-	 */
-	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
-	mark_oom_victim(victim);
 	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
 		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
@@ -819,37 +778,47 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	task_unlock(victim);
 
 	/*
-	 * Kill all user processes sharing victim->mm in other thread groups, if
-	 * any.  They don't get access to memory reserves, though, to avoid
-	 * depletion of all memory.  This prevents mm->mmap_sem livelock when an
-	 * oom killed thread cannot exit because it requires the semaphore and
-	 * its contended by another thread trying to allocate memory itself.
-	 * That thread will now get access to memory reserves since it has a
-	 * pending fatal signal.
+	 * Kill all user processes sharing victim->mm in other thread groups,
+	 * if any. This reduces possibility of hitting mm->mmap_sem livelock
+	 * when an oom killed thread cannot exit because it requires the
+	 * semaphore and its contended by another thread trying to allocate
+	 * memory itself.
 	 */
 	rcu_read_lock();
 	for_each_process(p) {
 		if (!process_shares_mm(p, mm))
 			continue;
-		if (same_thread_group(p, victim))
+		if (unlikely(p->flags & PF_KTHREAD))
 			continue;
-		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
-		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
-			/*
-			 * We cannot use oom_reaper for the mm shared by this
-			 * process because it wouldn't get killed and so the
-			 * memory might be still used.
-			 */
-			can_oom_reap = false;
+		if (is_global_init(p))
 			continue;
-		}
+		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+			continue;
+
+		/*
+		 * We should make sure that oom_badness() will treat this
+		 * process as unkillable because wake_oom_reaper() might do
+		 * nothing.
+		 * Note that this will change sysctl_oom_kill_allocating_task
+		 * behavior.
+		 */
+		p->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
+		/*
+		 * We should send SIGKILL before setting TIF_MEMDIE in order to
+		 * prevent the OOM victim from depleting the memory reserves
+		 * from the user space under its control.
+		 */
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
+		for_each_thread(p, t) {
+			task_lock(t);
+			if (t->mm)
+				mark_oom_victim(t);
+			task_unlock(t);
+		}
+		wake_oom_reaper(p);
 	}
 	rcu_read_unlock();
 
-	if (can_oom_reap)
-		wake_oom_reaper(victim);
-
 	mmdrop(mm);
 	put_task_struct(victim);
 }
@@ -920,21 +889,6 @@ bool out_of_memory(struct oom_control *oc)
 		return true;
 
 	/*
-	 * If current has a pending SIGKILL or is exiting, then automatically
-	 * select it.  The goal is to allow it to allocate so that it may
-	 * quickly exit and free its memory.
-	 *
-	 * But don't select if current has already released its mm and cleared
-	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
-	 */
-	if (current->mm &&
-	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
-		mark_oom_victim(current);
-		try_oom_reaper(current);
-		return true;
-	}
-
-	/*
 	 * The OOM killer does not compensate for IO-less reclaim.
 	 * pagefault_out_of_memory lost its gfp context so we have to
 	 * make sure exclude 0 mask - all other users should have at least
----------

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular OOM killer path
@ 2016-04-08 11:19       ` Tetsuo Handa
  0 siblings, 0 replies; 40+ messages in thread
From: Tetsuo Handa @ 2016-04-08 11:19 UTC (permalink / raw)
  To: mhocko, linux-mm; +Cc: rientjes, akpm, linux-kernel, mhocko, oleg

Tetsuo Handa wrote:
> Michal Hocko wrote:
> > @@ -694,6 +746,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> >  	task_lock(p);
> >  	if (p->mm && task_will_free_mem(p)) {
> >  		mark_oom_victim(p);
> > +		try_oom_reaper(p);
> >  		task_unlock(p);
> >  		put_task_struct(p);
> >  		return;
> > @@ -873,6 +926,7 @@ bool out_of_memory(struct oom_control *oc)
> >  	if (current->mm &&
> >  	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
> >  		mark_oom_victim(current);
> > +		try_oom_reaper(current);
> >  		return true;
> >  	}
> >  

oom_reaper() will need to do "tsk->oom_reaper_list = NULL;" due to

	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
		return;

test in wake_oom_reaper() if "[PATCH 3/3] mm, oom_reaper: clear
TIF_MEMDIE for all tasks queued for oom_reaper" will select the same
thread again. Though I think we should not allow the OOM killer to
select the same thread again.

> 
> Why don't you call try_oom_reaper() from the shortcuts in
> mem_cgroup_out_of_memory() as well?

I looked at next-20160408 but I again came to think that we should remove
these shortcuts (something like a patch shown bottom).

These shortcuts might be called without sending SIGKILL to threads sharing
the victim's memory. It is possible that try_oom_reaper() fails to call
wake_oom_reaper() due to a thread without fatal_signal_pending(). If such
thread is holding mmap_sem for write without fatal_signal_pending(),
the victim will get stuck at exit_mm() but wake_oom_reaper() is not called.
Therefore, use of these shortcuts can put us back to square one.

My perspective on the OOM reaper is to behave as a guaranteed unlocking
mechanism than a preemptive memory reaping mechanism. Therefore, it is
critically important that the OOM reaper kernel thread is always woken up
and unlock TIF_MEMDIE some time later, even if it is known that the memory
used by the caller of try_oom_reaper() is not reapable. Therefore, I moved
mm_is_reapable() test to the OOM reaper kernel thread in the patch shown
bottom.

Also, setting TIF_MEMDIE to all threads at oom_kill_process() not only
eliminates the need of "[PATCH 1/3] mm, oom: move GFP_NOFS check to
out_of_memory" but also suppresses needless OOM killer messages by holding
off the OOM killer until "all threads which should release the victim's mm
releases the victim's mm" or "the OOM reaper marks all threads which failed
to release the victim's mm as being stuck".

Setting TIF_MEMDIE to all threads at oom_kill_process() also increases
possibility of reclaiming memory when racing with oom_killer_disable() by
waiting until "all threads which should release the victim's mm releases
the victim's mm" or "oom_killer_disable() gives up waiting for oom_victims
to become 0". It sounds strange to me that we are currently thawing only
victim's thread when we need to thaw all threads sharing the victim's mm
in order to reclaim memory used by the victim. (And more crazy thing is
that we loop forever without providing a guarantee of forward progress at

	/* Exhausted what can be done so it's blamo time */
	if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
		*did_some_progress = 1;

		if (gfp_mask & __GFP_NOFAIL) {
			page = get_page_from_freelist(gfp_mask, order,
					ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
			/*
			 * fallback to ignore cpuset restriction if our nodes
			 * are depleted
			 */
			if (!page)
				page = get_page_from_freelist(gfp_mask, order,
					ALLOC_NO_WATERMARKS, ac);
		}
	}

but that thing is outside of this discussion.)

----------
 include/linux/oom.h   |    4 -
 include/linux/sched.h |    2
 kernel/exit.c         |    2
 mm/memcontrol.c       |   13 ---
 mm/oom_kill.c         |  200 +++++++++++++++++++-------------------------------
 5 files changed, 83 insertions(+), 138 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index abaab8e..9c99956 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -67,8 +67,6 @@ static inline bool oom_task_origin(const struct task_struct *p)
 	return p->signal->oom_flag_origin;
 }
 
-extern void mark_oom_victim(struct task_struct *tsk);
-
 extern unsigned long oom_badness(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask,
 		unsigned long totalpages);
@@ -86,7 +84,7 @@ extern enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 
 extern bool out_of_memory(struct oom_control *oc);
 
-extern void exit_oom_victim(struct task_struct *tsk);
+extern void exit_oom_victim(void);
 
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index dd286b1..a93b24d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -786,6 +786,8 @@ struct signal_struct {
 	 * oom
 	 */
 	bool oom_flag_origin;
+	/* Already OOM-killed but cannot terminate. Don't count on me. */
+	bool oom_skip_me;
 	short oom_score_adj;		/* OOM kill score adjustment */
 	short oom_score_adj_min;	/* OOM kill score adjustment min value.
 					 * Only settable by CAP_SYS_RESOURCE. */
diff --git a/kernel/exit.c b/kernel/exit.c
index 9e6e135..c742c37 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -435,7 +435,7 @@ static void exit_mm(struct task_struct *tsk)
 	mm_update_next_owner(mm);
 	mmput(mm);
 	if (test_thread_flag(TIF_MEMDIE))
-		exit_oom_victim(tsk);
+		exit_oom_victim();
 }
 
 static struct task_struct *find_alive_thread(struct task_struct *p)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3e90d48..57611b8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1278,17 +1278,8 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	unsigned int points = 0;
 	struct task_struct *chosen = NULL;
 
-	mutex_lock(&oom_lock);
-
-	/*
-	 * If current has a pending SIGKILL or is exiting, then automatically
-	 * select it.  The goal is to allow it to allocate so that it may
-	 * quickly exit and free its memory.
-	 */
-	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
-		mark_oom_victim(current);
-		goto unlock;
-	}
+	if (mutex_lock_killable(&oom_lock))
+		return NULL;
 
 	check_panic_on_oom(&oc, CONSTRAINT_MEMCG, memcg);
 	totalpages = mem_cgroup_get_limit(memcg) ? : 1;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7098104..7bda655 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -279,6 +279,8 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 	 * Don't allow any other task to have access to the reserves.
 	 */
 	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
+		if (task->signal->oom_skip_me)
+			return OOM_SCAN_CONTINUE;
 		if (!is_sysrq_oom(oc))
 			return OOM_SCAN_ABORT;
 	}
@@ -441,6 +443,42 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 static struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
+static bool mm_is_reapable(struct mm_struct *mm)
+{
+	struct task_struct *p;
+
+	if (atomic_read(&mm->mm_users) <= 2)
+		return true;
+	/*
+	 * There might be other threads/processes which are either not
+	 * dying or even not killable.
+	 */
+	rcu_read_lock();
+	for_each_process(p) {
+		bool exiting;
+
+		if (!process_shares_mm(p, mm))
+			continue;
+		if (fatal_signal_pending(p))
+			continue;
+
+		/*
+		 * If the task is exiting make sure the whole thread group
+		 * is exiting and cannot access mm anymore.
+		 */
+		spin_lock_irq(&p->sighand->siglock);
+		exiting = signal_group_exit(p->signal);
+		spin_unlock_irq(&p->sighand->siglock);
+		if (exiting)
+			continue;
+
+		/* Give up */
+		rcu_read_unlock();
+		return false;
+	}
+	rcu_read_unlock();
+	return true;
+}
 
 static bool __oom_reap_task(struct task_struct *tsk)
 {
@@ -470,7 +508,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
 
 	task_unlock(p);

-	if (!down_read_trylock(&mm->mmap_sem)) {
+	if (!mm_is_reapable(mm) || !down_read_trylock(&mm->mmap_sem)) {
 		ret = false;
 		goto out;
 	}
@@ -509,11 +547,6 @@ static bool __oom_reap_task(struct task_struct *tsk)
 			K(get_mm_counter(mm, MM_SHMEMPAGES)));
 	up_read(&mm->mmap_sem);
 
-	/*
-	 * This task can be safely ignored because we cannot do much more
-	 * to release its memory.
-	 */
-	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
 out:
 	mmput(mm);
 	return ret;
@@ -535,12 +568,12 @@ static void oom_reap_task(struct task_struct *tsk)
 	}
 
 	/*
-	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
-	 * reasonably reclaimable memory anymore or it is not a good candidate
-	 * for the oom victim right now because it cannot release its memory
-	 * itself nor by the oom reaper.
+	 * Tell oom_scan_process_thread() not to wait for this process, for
+	 * the task shouldn't be sitting on a reasonably reclaimable memory
+	 * anymore or it is not a good candidate for the oom victim right now
+	 * because it cannot release its memory itself nor by the oom reaper.
 	 */
-	exit_oom_victim(tsk);
+	tsk->signal->oom_skip_me = true;
 
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
@@ -586,53 +619,6 @@ static void wake_oom_reaper(struct task_struct *tsk)
 	wake_up(&oom_reaper_wait);
 }
 
-/* Check if we can reap the given task. This has to be called with stable
- * tsk->mm
- */
-static void try_oom_reaper(struct task_struct *tsk)
-{
-	struct mm_struct *mm = tsk->mm;
-	struct task_struct *p;
-
-	if (!mm)
-		return;
-
-	/*
-	 * There might be other threads/processes which are either not
-	 * dying or even not killable.
-	 */
-	if (atomic_read(&mm->mm_users) > 1) {
-		rcu_read_lock();
-		for_each_process(p) {
-			bool exiting;
-
-			if (!process_shares_mm(p, mm))
-				continue;
-			if (same_thread_group(p, tsk))
-				continue;
-			if (fatal_signal_pending(p))
-				continue;
-
-			/*
-			 * If the task is exiting make sure the whole thread group
-			 * is exiting and cannot acces mm anymore.
-			 */
-			spin_lock_irq(&p->sighand->siglock);
-			exiting = signal_group_exit(p->signal);
-			spin_unlock_irq(&p->sighand->siglock);
-			if (exiting)
-				continue;
-
-			/* Give up */
-			rcu_read_unlock();
-			return;
-		}
-		rcu_read_unlock();
-	}
-
-	wake_oom_reaper(tsk);
-}
-
 static int __init oom_init(void)
 {
 	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
@@ -645,10 +631,6 @@ static int __init oom_init(void)
 }
 subsys_initcall(oom_init)
 #else
-static void try_oom_reaper(struct task_struct *tsk)
-{
-}
-
 static void wake_oom_reaper(struct task_struct *tsk)
 {
 }
@@ -661,7 +643,7 @@ static void wake_oom_reaper(struct task_struct *tsk)
  * Has to be called with oom_lock held and never after
  * oom has been disabled already.
  */
-void mark_oom_victim(struct task_struct *tsk)
+static void mark_oom_victim(struct task_struct *tsk)
 {
 	WARN_ON(oom_killer_disabled);
 	/* OOM killer might race with memcg OOM */
@@ -680,10 +662,9 @@ void mark_oom_victim(struct task_struct *tsk)
 /**
  * exit_oom_victim - note the exit of an OOM victim
  */
-void exit_oom_victim(struct task_struct *tsk)
+void exit_oom_victim(void)
 {
-	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
-		return;
+	clear_thread_flag(TIF_MEMDIE);
 
 	if (!atomic_dec_return(&oom_victims))
 		wake_up_all(&oom_victims_wait);
@@ -741,21 +722,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
-	bool can_oom_reap = true;
-
-	/*
-	 * If the task is already exiting, don't alarm the sysadmin or kill
-	 * its children or threads, just set TIF_MEMDIE so it can die quickly
-	 */
-	task_lock(p);
-	if (p->mm && task_will_free_mem(p)) {
-		mark_oom_victim(p);
-		try_oom_reaper(p);
-		task_unlock(p);
-		put_task_struct(p);
-		return;
-	}
-	task_unlock(p);
 
 	if (__ratelimit(&oom_rs))
 		dump_header(oc, p, memcg);
@@ -804,13 +770,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	/* Get a reference to safely compare mm after task_unlock(victim) */
 	mm = victim->mm;
 	atomic_inc(&mm->mm_count);
-	/*
-	 * We should send SIGKILL before setting TIF_MEMDIE in order to prevent
-	 * the OOM victim from depleting the memory reserves from the user
-	 * space under its control.
-	 */
-	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
-	mark_oom_victim(victim);
 	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
 		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
@@ -819,37 +778,47 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	task_unlock(victim);
 
 	/*
-	 * Kill all user processes sharing victim->mm in other thread groups, if
-	 * any.  They don't get access to memory reserves, though, to avoid
-	 * depletion of all memory.  This prevents mm->mmap_sem livelock when an
-	 * oom killed thread cannot exit because it requires the semaphore and
-	 * its contended by another thread trying to allocate memory itself.
-	 * That thread will now get access to memory reserves since it has a
-	 * pending fatal signal.
+	 * Kill all user processes sharing victim->mm in other thread groups,
+	 * if any. This reduces possibility of hitting mm->mmap_sem livelock
+	 * when an oom killed thread cannot exit because it requires the
+	 * semaphore and its contended by another thread trying to allocate
+	 * memory itself.
 	 */
 	rcu_read_lock();
 	for_each_process(p) {
 		if (!process_shares_mm(p, mm))
 			continue;
-		if (same_thread_group(p, victim))
+		if (unlikely(p->flags & PF_KTHREAD))
 			continue;
-		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
-		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
-			/*
-			 * We cannot use oom_reaper for the mm shared by this
-			 * process because it wouldn't get killed and so the
-			 * memory might be still used.
-			 */
-			can_oom_reap = false;
+		if (is_global_init(p))
 			continue;
-		}
+		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+			continue;
+
+		/*
+		 * We should make sure that oom_badness() will treat this
+		 * process as unkillable because wake_oom_reaper() might do
+		 * nothing.
+		 * Note that this will change sysctl_oom_kill_allocating_task
+		 * behavior.
+		 */
+		p->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
+		/*
+		 * We should send SIGKILL before setting TIF_MEMDIE in order to
+		 * prevent the OOM victim from depleting the memory reserves
+		 * from the user space under its control.
+		 */
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
+		for_each_thread(p, t) {
+			task_lock(t);
+			if (t->mm)
+				mark_oom_victim(t);
+			task_unlock(t);
+		}
+		wake_oom_reaper(p);
 	}
 	rcu_read_unlock();
 
-	if (can_oom_reap)
-		wake_oom_reaper(victim);
-
 	mmdrop(mm);
 	put_task_struct(victim);
 }
@@ -920,21 +889,6 @@ bool out_of_memory(struct oom_control *oc)
 		return true;
 
 	/*
-	 * If current has a pending SIGKILL or is exiting, then automatically
-	 * select it.  The goal is to allow it to allocate so that it may
-	 * quickly exit and free its memory.
-	 *
-	 * But don't select if current has already released its mm and cleared
-	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
-	 */
-	if (current->mm &&
-	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
-		mark_oom_victim(current);
-		try_oom_reaper(current);
-		return true;
-	}
-
-	/*
 	 * The OOM killer does not compensate for IO-less reclaim.
 	 * pagefault_out_of_memory lost its gfp context so we have to
 	 * make sure exclude 0 mask - all other users should have at least
----------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/3] mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper
  2016-04-07 11:55     ` Tetsuo Handa
@ 2016-04-08 11:34       ` Michal Hocko
  -1 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2016-04-08 11:34 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm, linux-kernel

On Thu 07-04-16 20:55:34, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > The first obvious one is when the oom victim clears its mm and gets
> > stuck later on. oom_reaper would back of on find_lock_task_mm returning
> > NULL. We can safely try to clear TIF_MEMDIE in this case because such a
> > task would be ignored by the oom killer anyway. The flag would be
> > cleared by that time already most of the time anyway.
> 
> I didn't understand what this wants to tell. The OOM victim will clear
> TIF_MEMDIE as soon as it sets current->mm = NULL.

No it clears the flag _after_ it returns from mmput. There is no
guarantee it won't get stuck somewhere on the way there - e.g. exit_aio
waits for completion and who knows what else might get stuck.

> Even if the oom victim
> clears its mm and gets stuck later on (e.g. at exit_task_work()),
> TIF_MEMDIE was already cleared by that moment by the OOM victim.
> 
> > 
> > The less obvious one is when the oom reaper fails due to mmap_sem
> > contention. Even if we clear TIF_MEMDIE for this task then it is not
> > very likely that we would select another task too easily because
> > we haven't reaped the last victim and so it would be still the #1
> > candidate. There is a rare race condition possible when the current
> > victim terminates before the next select_bad_process but considering
> > that oom_reap_task had retried several times before giving up then
> > this sounds like a borderline thing.
> 
> Is it helpful? Allowing the OOM killer to select the same thread again
> simply makes the kernel log buffer flooded with the OOM kill messages.

I am trying to be as conservative as possible here. The likelyhood of
mmap sem contention will be reduced considerably after my
down_write_killable series will get merged. If this turns out to be a
problem (trivial to spot as the same task will be killed again) then we
can think about a fix for that (e.g. ignore the task if the has been
selected more than N times).

> I think we should not allow the OOM killer to select the same thread again
> by e.g. doing tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN regardless of
> whether reaping that thread's memory succeeded or not.

I think this comes with some risk and so it should go as a separate
patch with a full justification why the outcome is better. Especially
after the mmap_sem contention will be reduced by other means.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/3] mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper
@ 2016-04-08 11:34       ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2016-04-08 11:34 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm, linux-kernel

On Thu 07-04-16 20:55:34, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > The first obvious one is when the oom victim clears its mm and gets
> > stuck later on. oom_reaper would back of on find_lock_task_mm returning
> > NULL. We can safely try to clear TIF_MEMDIE in this case because such a
> > task would be ignored by the oom killer anyway. The flag would be
> > cleared by that time already most of the time anyway.
> 
> I didn't understand what this wants to tell. The OOM victim will clear
> TIF_MEMDIE as soon as it sets current->mm = NULL.

No it clears the flag _after_ it returns from mmput. There is no
guarantee it won't get stuck somewhere on the way there - e.g. exit_aio
waits for completion and who knows what else might get stuck.

> Even if the oom victim
> clears its mm and gets stuck later on (e.g. at exit_task_work()),
> TIF_MEMDIE was already cleared by that moment by the OOM victim.
> 
> > 
> > The less obvious one is when the oom reaper fails due to mmap_sem
> > contention. Even if we clear TIF_MEMDIE for this task then it is not
> > very likely that we would select another task too easily because
> > we haven't reaped the last victim and so it would be still the #1
> > candidate. There is a rare race condition possible when the current
> > victim terminates before the next select_bad_process but considering
> > that oom_reap_task had retried several times before giving up then
> > this sounds like a borderline thing.
> 
> Is it helpful? Allowing the OOM killer to select the same thread again
> simply makes the kernel log buffer flooded with the OOM kill messages.

I am trying to be as conservative as possible here. The likelyhood of
mmap sem contention will be reduced considerably after my
down_write_killable series will get merged. If this turns out to be a
problem (trivial to spot as the same task will be killed again) then we
can think about a fix for that (e.g. ignore the task if the has been
selected more than N times).

> I think we should not allow the OOM killer to select the same thread again
> by e.g. doing tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN regardless of
> whether reaping that thread's memory succeeded or not.

I think this comes with some risk and so it should go as a separate
patch with a full justification why the outcome is better. Especially
after the mmap_sem contention will be reduced by other means.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular OOM killer path
  2016-04-07 11:38     ` Tetsuo Handa
@ 2016-04-08 11:34       ` Michal Hocko
  -1 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2016-04-08 11:34 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm, linux-kernel, oleg

On Thu 07-04-16 20:38:43, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > @@ -563,6 +582,53 @@ static void wake_oom_reaper(struct task_struct *tsk)
> >  	wake_up(&oom_reaper_wait);
> >  }
> >  
> > +/* Check if we can reap the given task. This has to be called with stable
> > + * tsk->mm
> > + */
> > +static void try_oom_reaper(struct task_struct *tsk)
> > +{
> > +	struct mm_struct *mm = tsk->mm;
> > +	struct task_struct *p;
> > +
> > +	if (!mm)
> > +		return;
> > +
> > +	/*
> > +	 * There might be other threads/processes which are either not
> > +	 * dying or even not killable.
> > +	 */
> > +	if (atomic_read(&mm->mm_users) > 1) {
> > +		rcu_read_lock();
> > +		for_each_process(p) {
> > +			bool exiting;
> > +
> > +			if (!process_shares_mm(p, mm))
> > +				continue;
> > +			if (same_thread_group(p, tsk))
> > +				continue;
> > +			if (fatal_signal_pending(p))
> > +				continue;
> > +
> > +			/*
> > +			 * If the task is exiting make sure the whole thread group
> > +			 * is exiting and cannot acces mm anymore.
> > +			 */
> > +			spin_lock_irq(&p->sighand->siglock);
> > +			exiting = signal_group_exit(p->signal);
> > +			spin_unlock_irq(&p->sighand->siglock);
> > +			if (exiting)
> > +				continue;
> > +
> > +			/* Give up */
> > +			rcu_read_unlock();
> > +			return;
> > +		}
> > +		rcu_read_unlock();
> > +	}
> > +
> > +	wake_oom_reaper(tsk);
> > +}
> > +
> 
> I think you want to change "try_oom_reaper() without wake_oom_reaper()"
> as mm_is_reapable() and use it from oom_kill_process() in order to skip
> p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN test which needlessly makes
> can_oom_reap false.

Not sure I understand the OOM_SCORE_ADJ_MIN part. We cannot reap the
task if somebody sharing the mm is OOM_SCORE_ADJ_MIN. We have to check
this in oom_kill_process because we are sending SIGKILL but we do not
have to check for this explicitly in try_oom_reaper because we only care
about exiting/killed tasks.

[...]

> > @@ -873,6 +926,7 @@ bool out_of_memory(struct oom_control *oc)
> >  	if (current->mm &&
> >  	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
> >  		mark_oom_victim(current);
> > +		try_oom_reaper(current);
> >  		return true;
> >  	}
> >  
> 
> Why don't you call try_oom_reaper() from the shortcuts in
> mem_cgroup_out_of_memory() as well?

I have focused on the global case and the correctness for now. But I
agree we can safely squash mem_cgroup_out_of_memory part into the patch
as well. Thanks for pointing this out.

> Why don't you embed try_oom_reaper() into mark_oom_victim() like I did at
> http://lkml.kernel.org/r/201602052014.HBG52666.HFMOQVLFOSFJtO@I-love.SAKURA.ne.jp ?

it didn't fit in the current flow of oom_kill_process where we do:
do_send_sig_info(victim)
mark_oom_victim(victim)
kill_sharing_tasks

so in the case of shared mm we wouldn't schedule the task for the reaper
most likely because we have to kill them first.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular OOM killer path
@ 2016-04-08 11:34       ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2016-04-08 11:34 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm, linux-kernel, oleg

On Thu 07-04-16 20:38:43, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > @@ -563,6 +582,53 @@ static void wake_oom_reaper(struct task_struct *tsk)
> >  	wake_up(&oom_reaper_wait);
> >  }
> >  
> > +/* Check if we can reap the given task. This has to be called with stable
> > + * tsk->mm
> > + */
> > +static void try_oom_reaper(struct task_struct *tsk)
> > +{
> > +	struct mm_struct *mm = tsk->mm;
> > +	struct task_struct *p;
> > +
> > +	if (!mm)
> > +		return;
> > +
> > +	/*
> > +	 * There might be other threads/processes which are either not
> > +	 * dying or even not killable.
> > +	 */
> > +	if (atomic_read(&mm->mm_users) > 1) {
> > +		rcu_read_lock();
> > +		for_each_process(p) {
> > +			bool exiting;
> > +
> > +			if (!process_shares_mm(p, mm))
> > +				continue;
> > +			if (same_thread_group(p, tsk))
> > +				continue;
> > +			if (fatal_signal_pending(p))
> > +				continue;
> > +
> > +			/*
> > +			 * If the task is exiting make sure the whole thread group
> > +			 * is exiting and cannot acces mm anymore.
> > +			 */
> > +			spin_lock_irq(&p->sighand->siglock);
> > +			exiting = signal_group_exit(p->signal);
> > +			spin_unlock_irq(&p->sighand->siglock);
> > +			if (exiting)
> > +				continue;
> > +
> > +			/* Give up */
> > +			rcu_read_unlock();
> > +			return;
> > +		}
> > +		rcu_read_unlock();
> > +	}
> > +
> > +	wake_oom_reaper(tsk);
> > +}
> > +
> 
> I think you want to change "try_oom_reaper() without wake_oom_reaper()"
> as mm_is_reapable() and use it from oom_kill_process() in order to skip
> p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN test which needlessly makes
> can_oom_reap false.

Not sure I understand the OOM_SCORE_ADJ_MIN part. We cannot reap the
task if somebody sharing the mm is OOM_SCORE_ADJ_MIN. We have to check
this in oom_kill_process because we are sending SIGKILL but we do not
have to check for this explicitly in try_oom_reaper because we only care
about exiting/killed tasks.

[...]

> > @@ -873,6 +926,7 @@ bool out_of_memory(struct oom_control *oc)
> >  	if (current->mm &&
> >  	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
> >  		mark_oom_victim(current);
> > +		try_oom_reaper(current);
> >  		return true;
> >  	}
> >  
> 
> Why don't you call try_oom_reaper() from the shortcuts in
> mem_cgroup_out_of_memory() as well?

I have focused on the global case and the correctness for now. But I
agree we can safely squash mem_cgroup_out_of_memory part into the patch
as well. Thanks for pointing this out.

> Why don't you embed try_oom_reaper() into mark_oom_victim() like I did at
> http://lkml.kernel.org/r/201602052014.HBG52666.HFMOQVLFOSFJtO@I-love.SAKURA.ne.jp ?

it didn't fit in the current flow of oom_kill_process where we do:
do_send_sig_info(victim)
mark_oom_victim(victim)
kill_sharing_tasks

so in the case of shared mm we wouldn't schedule the task for the reaper
most likely because we have to kill them first.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular OOM killer path
  2016-04-08 11:19       ` Tetsuo Handa
@ 2016-04-08 11:50         ` Michal Hocko
  -1 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2016-04-08 11:50 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm, linux-kernel, oleg

On Fri 08-04-16 20:19:28, Tetsuo Handa wrote:
> Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > @@ -694,6 +746,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> > >  	task_lock(p);
> > >  	if (p->mm && task_will_free_mem(p)) {
> > >  		mark_oom_victim(p);
> > > +		try_oom_reaper(p);
> > >  		task_unlock(p);
> > >  		put_task_struct(p);
> > >  		return;
> > > @@ -873,6 +926,7 @@ bool out_of_memory(struct oom_control *oc)
> > >  	if (current->mm &&
> > >  	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
> > >  		mark_oom_victim(current);
> > > +		try_oom_reaper(current);
> > >  		return true;
> > >  	}
> > >  
> 
> oom_reaper() will need to do "tsk->oom_reaper_list = NULL;" due to
> 
> 	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
> 		return;
> 
> test in wake_oom_reaper() if "[PATCH 3/3] mm, oom_reaper: clear
> TIF_MEMDIE for all tasks queued for oom_reaper" will select the same
> thread again.

true, will update my patch.

> Though I think we should not allow the OOM killer to select the same
> thread again.
> 
> > 
> > Why don't you call try_oom_reaper() from the shortcuts in
> > mem_cgroup_out_of_memory() as well?
> 
> I looked at next-20160408 but I again came to think that we should remove
> these shortcuts (something like a patch shown bottom).

feel free to send the patch with the full description. But I would
really encourage you to check the history to learn why those have been
added and describe why those concerns are not valid/important anymore.
Your way of throwing a large patch based on an extreme load which is
basically DoSing the machine is not the ideal one.

I do respect your different opinion. It is well possible that you are
right here and you can convince all the reviewers that your changes
are safe. I would be more than happy to drop my smaller steps approach
then. But I will be honest with you, you haven't convinced me yet and
I have seen so many subtle issues in this code area that the risk is
really non trivial for any larger changes. This is the primary reason I
am doing small steps each focusing on a single improvement which can be
argued about and is known to help a particular case without introducing
a risk of different problems. I am not the maintainer so it is not up to
me to select the right approach.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular OOM killer path
@ 2016-04-08 11:50         ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2016-04-08 11:50 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm, linux-kernel, oleg

On Fri 08-04-16 20:19:28, Tetsuo Handa wrote:
> Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > @@ -694,6 +746,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> > >  	task_lock(p);
> > >  	if (p->mm && task_will_free_mem(p)) {
> > >  		mark_oom_victim(p);
> > > +		try_oom_reaper(p);
> > >  		task_unlock(p);
> > >  		put_task_struct(p);
> > >  		return;
> > > @@ -873,6 +926,7 @@ bool out_of_memory(struct oom_control *oc)
> > >  	if (current->mm &&
> > >  	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
> > >  		mark_oom_victim(current);
> > > +		try_oom_reaper(current);
> > >  		return true;
> > >  	}
> > >  
> 
> oom_reaper() will need to do "tsk->oom_reaper_list = NULL;" due to
> 
> 	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
> 		return;
> 
> test in wake_oom_reaper() if "[PATCH 3/3] mm, oom_reaper: clear
> TIF_MEMDIE for all tasks queued for oom_reaper" will select the same
> thread again.

true, will update my patch.

> Though I think we should not allow the OOM killer to select the same
> thread again.
> 
> > 
> > Why don't you call try_oom_reaper() from the shortcuts in
> > mem_cgroup_out_of_memory() as well?
> 
> I looked at next-20160408 but I again came to think that we should remove
> these shortcuts (something like a patch shown bottom).

feel free to send the patch with the full description. But I would
really encourage you to check the history to learn why those have been
added and describe why those concerns are not valid/important anymore.
Your way of throwing a large patch based on an extreme load which is
basically DoSing the machine is not the ideal one.

I do respect your different opinion. It is well possible that you are
right here and you can convince all the reviewers that your changes
are safe. I would be more than happy to drop my smaller steps approach
then. But I will be honest with you, you haven't convinced me yet and
I have seen so many subtle issues in this code area that the risk is
really non trivial for any larger changes. This is the primary reason I
am doing small steps each focusing on a single improvement which can be
argued about and is known to help a particular case without introducing
a risk of different problems. I am not the maintainer so it is not up to
me to select the right approach.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/3] mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper
  2016-04-06 14:13   ` Michal Hocko
@ 2016-04-08 13:07     ` Michal Hocko
  -1 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2016-04-08 13:07 UTC (permalink / raw)
  To: Andrew Morton; +Cc: David Rientjes, Tetsuo Handa, linux-mm, LKML

Andrew, could you please fold this in?
---
>From 4e23df3584966f6d7885245f3071c15669b8a5fc Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Fri, 8 Apr 2016 15:04:29 +0200
Subject: [PATCH] mm, oom_reaper: clear oom_reaper_list before clearing
 TIF_MEMDIE

As per Tetsuo:
: oom_reaper() will need to do "tsk->oom_reaper_list = NULL;" due to
:
:         if (tsk == oom_reaper_list || tsk->oom_reaper_list)
:                 return;
:
: test in wake_oom_reaper() if "[PATCH 3/3] mm, oom_reaper: clear
: TIF_MEMDIE for all tasks queued for oom_reaper" will select the same
: thread again

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7098104b7475..ca34036f3ae1 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -540,6 +540,7 @@ static void oom_reap_task(struct task_struct *tsk)
 	 * for the oom victim right now because it cannot release its memory
 	 * itself nor by the oom reaper.
 	 */
+	tsk->oom_reaper_list = NULL;
 	exit_oom_victim(tsk);
 
 	/* Drop a reference taken by wake_oom_reaper */
-- 
2.8.0.rc3

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/3] mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper
@ 2016-04-08 13:07     ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2016-04-08 13:07 UTC (permalink / raw)
  To: Andrew Morton; +Cc: David Rientjes, Tetsuo Handa, linux-mm, LKML

Andrew, could you please fold this in?
---

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular OOM killer path
  2016-04-06 14:13   ` Michal Hocko
@ 2016-04-08 13:14     ` Michal Hocko
  -1 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2016-04-08 13:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: David Rientjes, Tetsuo Handa, linux-mm, LKML, Oleg Nesterov

Andrew, could you fold this in?
---
>From 9645b0518c20da67decfd2a94b1cc99ef1b4218a Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Fri, 8 Apr 2016 15:11:59 +0200
Subject: [PATCH] mm, oom_reaper: Try to reap tasks which skip regular memcg
 OOM killer path

no reason to include memcg oom killed tasks as well.

Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/oom.h | 8 ++++++++
 mm/memcontrol.c     | 1 +
 mm/oom_kill.c       | 2 +-
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 628a43242a34..83b9c39bd8b7 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -72,6 +72,14 @@ static inline bool oom_task_origin(const struct task_struct *p)
 
 extern void mark_oom_victim(struct task_struct *tsk);
 
+#ifdef CONFIG_MMU
+extern void try_oom_reaper(struct task_struct *tsk);
+#else
+static inline void try_oom_reaper(struct task_struct *tsk)
+{
+}
+#endif
+
 extern unsigned long oom_badness(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask,
 		unsigned long totalpages);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 17a847c96618..6d2f699e4d3d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1256,6 +1256,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 */
 	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
 		mark_oom_victim(current);
+		try_oom_reaper(current);
 		goto unlock;
 	}
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 74c38f5fffef..e656ee2c0be4 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -585,7 +585,7 @@ static void wake_oom_reaper(struct task_struct *tsk)
 /* Check if we can reap the given task. This has to be called with stable
  * tsk->mm
  */
-static void try_oom_reaper(struct task_struct *tsk)
+void try_oom_reaper(struct task_struct *tsk)
 {
 	struct mm_struct *mm = tsk->mm;
 	struct task_struct *p;
-- 
2.8.0.rc3

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular OOM killer path
@ 2016-04-08 13:14     ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2016-04-08 13:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: David Rientjes, Tetsuo Handa, linux-mm, LKML, Oleg Nesterov

Andrew, could you fold this in?
---

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skipregular OOM killer path
  2016-04-08 11:50         ` Michal Hocko
@ 2016-04-09  4:39           ` Tetsuo Handa
  -1 siblings, 0 replies; 40+ messages in thread
From: Tetsuo Handa @ 2016-04-09  4:39 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, akpm, linux-kernel, oleg

Michal Hocko wrote:
> On Fri 08-04-16 20:19:28, Tetsuo Handa wrote:
> > I looked at next-20160408 but I again came to think that we should remove
> > these shortcuts (something like a patch shown bottom).
>
> feel free to send the patch with the full description. But I would
> really encourage you to check the history to learn why those have been
> added and describe why those concerns are not valid/important anymore.

I believe that past discussions and decisions about current code are too
optimistic because they did not take 'The "too small to fail" memory-
allocation rule' problem into account.

If you ignore me with "check the history to learn why those have been added
and describe why those concerns are not valid/important anymore", I can do
nothing. What are valid/important concerns that have higher priority than
keeping 'The "too small to fail" memory-allocation rule' problem and continue
telling a lie to end users? Please enumerate such concerns.

> Your way of throwing a large patch based on an extreme load which is
> basically DoSing the machine is not the ideal one.

You are not paying attention to real world's limitations I'm facing.
I have to waste my resource trying to identify and fix on behalf of
customers before they determine the kernel version to use for their
systems, for your way of thinking is that "We don't need to worry about
it because I have never received such report" while the reality of
customers is that "I'm not skillful enough to catch the problematic
behavior and make a reproducer" or "I have a little skill but I'm not
permitted to modify my systems for reporting the problematic behavior".
If you listen to me, I don't need to do such thing. It is very very sad.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skipregular OOM killer path
@ 2016-04-09  4:39           ` Tetsuo Handa
  0 siblings, 0 replies; 40+ messages in thread
From: Tetsuo Handa @ 2016-04-09  4:39 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, akpm, linux-kernel, oleg

Michal Hocko wrote:
> On Fri 08-04-16 20:19:28, Tetsuo Handa wrote:
> > I looked at next-20160408 but I again came to think that we should remove
> > these shortcuts (something like a patch shown bottom).
>
> feel free to send the patch with the full description. But I would
> really encourage you to check the history to learn why those have been
> added and describe why those concerns are not valid/important anymore.

I believe that past discussions and decisions about current code are too
optimistic because they did not take 'The "too small to fail" memory-
allocation rule' problem into account.

If you ignore me with "check the history to learn why those have been added
and describe why those concerns are not valid/important anymore", I can do
nothing. What are valid/important concerns that have higher priority than
keeping 'The "too small to fail" memory-allocation rule' problem and continue
telling a lie to end users? Please enumerate such concerns.

> Your way of throwing a large patch based on an extreme load which is
> basically DoSing the machine is not the ideal one.

You are not paying attention to real world's limitations I'm facing.
I have to waste my resource trying to identify and fix on behalf of
customers before they determine the kernel version to use for their
systems, for your way of thinking is that "We don't need to worry about
it because I have never received such report" while the reality of
customers is that "I'm not skillful enough to catch the problematic
behavior and make a reproducer" or "I have a little skill but I'm not
permitted to modify my systems for reporting the problematic behavior".
If you listen to me, I don't need to do such thing. It is very very sad.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skipregular OOM killer path
  2016-04-09  4:39           ` Tetsuo Handa
@ 2016-04-11 12:02             ` Michal Hocko
  -1 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2016-04-11 12:02 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm, linux-kernel, oleg

On Sat 09-04-16 13:39:30, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 08-04-16 20:19:28, Tetsuo Handa wrote:
> > > I looked at next-20160408 but I again came to think that we should remove
> > > these shortcuts (something like a patch shown bottom).
> >
> > feel free to send the patch with the full description. But I would
> > really encourage you to check the history to learn why those have been
> > added and describe why those concerns are not valid/important anymore.
> 
> I believe that past discussions and decisions about current code are too
> optimistic because they did not take 'The "too small to fail" memory-
> allocation rule' problem into account.

In most cases they were driven by _real_ usecases though. And that
is what matters. Theoretically possible issues which happen under
crazy workloads which are DoSing the machine already are not something
to optimize for. Sure we should try to cope with them as gracefully
as possible, no questions about that, but we should try hard not to
reintroduce previous issues during _sensible_ workloads.

> If you ignore me with "check the history to learn why those have been added
> and describe why those concerns are not valid/important anymore", I can do
> nothing. What are valid/important concerns that have higher priority than
> keeping 'The "too small to fail" memory-allocation rule' problem and continue
> telling a lie to end users? Please enumerate such concerns.

I feel like we are looping in a circle and I do not want to waste my
time repeating arguments which were already mentioned several times. 
I have already told you that you have to justify potentially disruptive
changes properly. So far you are more focused on extreme cases while
you do not seem to care all that much about those which happen most of
the time. We surely do not want to regress there. If I am telling you
to study the history of our heuristics it is to _help_ you understand
why they have been introduced so that you can argue with the reasoning
and/or come up with improvements. Unless you start doing this chances
are that your patches will not see overly warm welcome.

> > Your way of throwing a large patch based on an extreme load which is
> > basically DoSing the machine is not the ideal one.
> 
> You are not paying attention to real world's limitations I'm facing.

So far I haven't seen any _real_world_ example from you, to be honest.
All I can see is hammering the system with some DoS scenarios which
triggered different corner cases in the behavior. Those are good to make
us think about our limitations and think for longterm solutions.

> I have to waste my resource trying to identify and fix on behalf of
> customers before they determine the kernel version to use for their
> systems, for your way of thinking is that "We don't need to worry about
> it because I have never received such report"

No I am not saying that. I am saying that I have never seen a _properly_
configured system to blow up in a way that would trigger pathological
cases you are mentioning. And that is a big difference. You can
misconfigure your system in so many ways and put it on knees without a
way out.

With all due respect I will not continue in this line of discussion
because it doesn't lead anywhere.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skipregular OOM killer path
@ 2016-04-11 12:02             ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2016-04-11 12:02 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm, linux-kernel, oleg

On Sat 09-04-16 13:39:30, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 08-04-16 20:19:28, Tetsuo Handa wrote:
> > > I looked at next-20160408 but I again came to think that we should remove
> > > these shortcuts (something like a patch shown bottom).
> >
> > feel free to send the patch with the full description. But I would
> > really encourage you to check the history to learn why those have been
> > added and describe why those concerns are not valid/important anymore.
> 
> I believe that past discussions and decisions about current code are too
> optimistic because they did not take 'The "too small to fail" memory-
> allocation rule' problem into account.

In most cases they were driven by _real_ usecases though. And that
is what matters. Theoretically possible issues which happen under
crazy workloads which are DoSing the machine already are not something
to optimize for. Sure we should try to cope with them as gracefully
as possible, no questions about that, but we should try hard not to
reintroduce previous issues during _sensible_ workloads.

> If you ignore me with "check the history to learn why those have been added
> and describe why those concerns are not valid/important anymore", I can do
> nothing. What are valid/important concerns that have higher priority than
> keeping 'The "too small to fail" memory-allocation rule' problem and continue
> telling a lie to end users? Please enumerate such concerns.

I feel like we are looping in a circle and I do not want to waste my
time repeating arguments which were already mentioned several times. 
I have already told you that you have to justify potentially disruptive
changes properly. So far you are more focused on extreme cases while
you do not seem to care all that much about those which happen most of
the time. We surely do not want to regress there. If I am telling you
to study the history of our heuristics it is to _help_ you understand
why they have been introduced so that you can argue with the reasoning
and/or come up with improvements. Unless you start doing this chances
are that your patches will not see overly warm welcome.

> > Your way of throwing a large patch based on an extreme load which is
> > basically DoSing the machine is not the ideal one.
> 
> You are not paying attention to real world's limitations I'm facing.

So far I haven't seen any _real_world_ example from you, to be honest.
All I can see is hammering the system with some DoS scenarios which
triggered different corner cases in the behavior. Those are good to make
us think about our limitations and think for longterm solutions.

> I have to waste my resource trying to identify and fix on behalf of
> customers before they determine the kernel version to use for their
> systems, for your way of thinking is that "We don't need to worry about
> it because I have never received such report"

No I am not saying that. I am saying that I have never seen a _properly_
configured system to blow up in a way that would trigger pathological
cases you are mentioning. And that is a big difference. You can
misconfigure your system in so many ways and put it on knees without a
way out.

With all due respect I will not continue in this line of discussion
because it doesn't lead anywhere.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular OOM killer path
  2016-04-11 12:02             ` Michal Hocko
@ 2016-04-11 13:26               ` Tetsuo Handa
  -1 siblings, 0 replies; 40+ messages in thread
From: Tetsuo Handa @ 2016-04-11 13:26 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, akpm, linux-kernel, oleg

Michal Hocko wrote:
> On Sat 09-04-16 13:39:30, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Fri 08-04-16 20:19:28, Tetsuo Handa wrote:
> > > > I looked at next-20160408 but I again came to think that we should remove
> > > > these shortcuts (something like a patch shown bottom).
> > >
> > > feel free to send the patch with the full description. But I would
> > > really encourage you to check the history to learn why those have been
> > > added and describe why those concerns are not valid/important anymore.
> > 
> > I believe that past discussions and decisions about current code are too
> > optimistic because they did not take 'The "too small to fail" memory-
> > allocation rule' problem into account.
> 
> In most cases they were driven by _real_ usecases though. And that
> is what matters. Theoretically possible issues which happen under
> crazy workloads which are DoSing the machine already are not something
> to optimize for. Sure we should try to cope with them as gracefully
> as possible, no questions about that, but we should try hard not to
> reintroduce previous issues during _sensible_ workloads.

I'm not requesting you to optimize for crazy workloads. None of my
customers intentionally put crazy workloads, but they are getting silent
hangups and I'm suspecting that something went wrong with memory management.
But there is no evidence because memory management subsystem remains silent.
You call my testcases DoS, but there is no evidence that my customers
are not hitting the same problem my testcases found.

I'm suggesting you to at least emit diagnostic messages when something went
wrong. That is what kmallocwd is for. And if you do not want to emit
diagnostic messages, I'm fine with timeout based approach.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular OOM killer path
@ 2016-04-11 13:26               ` Tetsuo Handa
  0 siblings, 0 replies; 40+ messages in thread
From: Tetsuo Handa @ 2016-04-11 13:26 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, akpm, linux-kernel, oleg

Michal Hocko wrote:
> On Sat 09-04-16 13:39:30, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Fri 08-04-16 20:19:28, Tetsuo Handa wrote:
> > > > I looked at next-20160408 but I again came to think that we should remove
> > > > these shortcuts (something like a patch shown bottom).
> > >
> > > feel free to send the patch with the full description. But I would
> > > really encourage you to check the history to learn why those have been
> > > added and describe why those concerns are not valid/important anymore.
> > 
> > I believe that past discussions and decisions about current code are too
> > optimistic because they did not take 'The "too small to fail" memory-
> > allocation rule' problem into account.
> 
> In most cases they were driven by _real_ usecases though. And that
> is what matters. Theoretically possible issues which happen under
> crazy workloads which are DoSing the machine already are not something
> to optimize for. Sure we should try to cope with them as gracefully
> as possible, no questions about that, but we should try hard not to
> reintroduce previous issues during _sensible_ workloads.

I'm not requesting you to optimize for crazy workloads. None of my
customers intentionally put crazy workloads, but they are getting silent
hangups and I'm suspecting that something went wrong with memory management.
But there is no evidence because memory management subsystem remains silent.
You call my testcases DoS, but there is no evidence that my customers
are not hitting the same problem my testcases found.

I'm suggesting you to at least emit diagnostic messages when something went
wrong. That is what kmallocwd is for. And if you do not want to emit
diagnostic messages, I'm fine with timeout based approach.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular OOM killer path
  2016-04-11 13:26               ` Tetsuo Handa
@ 2016-04-11 13:43                 ` Michal Hocko
  -1 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2016-04-11 13:43 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm, linux-kernel, oleg

On Mon 11-04-16 22:26:09, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Sat 09-04-16 13:39:30, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Fri 08-04-16 20:19:28, Tetsuo Handa wrote:
> > > > > I looked at next-20160408 but I again came to think that we should remove
> > > > > these shortcuts (something like a patch shown bottom).
> > > >
> > > > feel free to send the patch with the full description. But I would
> > > > really encourage you to check the history to learn why those have been
> > > > added and describe why those concerns are not valid/important anymore.
> > > 
> > > I believe that past discussions and decisions about current code are too
> > > optimistic because they did not take 'The "too small to fail" memory-
> > > allocation rule' problem into account.
> > 
> > In most cases they were driven by _real_ usecases though. And that
> > is what matters. Theoretically possible issues which happen under
> > crazy workloads which are DoSing the machine already are not something
> > to optimize for. Sure we should try to cope with them as gracefully
> > as possible, no questions about that, but we should try hard not to
> > reintroduce previous issues during _sensible_ workloads.
> 
> I'm not requesting you to optimize for crazy workloads. None of my
> customers intentionally put crazy workloads, but they are getting silent
> hangups and I'm suspecting that something went wrong with memory management.

There are many other possible reasons for thses symptoms. Have you
actually seen any _evidence_ they the hang they are seeing is due to
oom deadlock, though. A single crash dump or consistent sysrq output
which would point that direction.

> But there is no evidence because memory management subsystem remains silent.
> You call my testcases DoS, but there is no evidence that my customers
> are not hitting the same problem my testcases found.

This is really impossible to comment on.

> I'm suggesting you to at least emit diagnostic messages when something went
> wrong. That is what kmallocwd is for. And if you do not want to emit
> diagnostic messages, I'm fine with timeout based approach.

I am all for more diagnostic but what you were proposing was so heavy
weight it doesn't really seem worth it.

Anyway yet again this is getting largely off-topic...
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular OOM killer path
@ 2016-04-11 13:43                 ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2016-04-11 13:43 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm, linux-kernel, oleg

On Mon 11-04-16 22:26:09, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Sat 09-04-16 13:39:30, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Fri 08-04-16 20:19:28, Tetsuo Handa wrote:
> > > > > I looked at next-20160408 but I again came to think that we should remove
> > > > > these shortcuts (something like a patch shown bottom).
> > > >
> > > > feel free to send the patch with the full description. But I would
> > > > really encourage you to check the history to learn why those have been
> > > > added and describe why those concerns are not valid/important anymore.
> > > 
> > > I believe that past discussions and decisions about current code are too
> > > optimistic because they did not take 'The "too small to fail" memory-
> > > allocation rule' problem into account.
> > 
> > In most cases they were driven by _real_ usecases though. And that
> > is what matters. Theoretically possible issues which happen under
> > crazy workloads which are DoSing the machine already are not something
> > to optimize for. Sure we should try to cope with them as gracefully
> > as possible, no questions about that, but we should try hard not to
> > reintroduce previous issues during _sensible_ workloads.
> 
> I'm not requesting you to optimize for crazy workloads. None of my
> customers intentionally put crazy workloads, but they are getting silent
> hangups and I'm suspecting that something went wrong with memory management.

There are many other possible reasons for thses symptoms. Have you
actually seen any _evidence_ they the hang they are seeing is due to
oom deadlock, though. A single crash dump or consistent sysrq output
which would point that direction.

> But there is no evidence because memory management subsystem remains silent.
> You call my testcases DoS, but there is no evidence that my customers
> are not hitting the same problem my testcases found.

This is really impossible to comment on.

> I'm suggesting you to at least emit diagnostic messages when something went
> wrong. That is what kmallocwd is for. And if you do not want to emit
> diagnostic messages, I'm fine with timeout based approach.

I am all for more diagnostic but what you were proposing was so heavy
weight it doesn't really seem worth it.

Anyway yet again this is getting largely off-topic...
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular OOM killer path
  2016-04-11 13:43                 ` Michal Hocko
@ 2016-04-13 11:08                   ` Tetsuo Handa
  -1 siblings, 0 replies; 40+ messages in thread
From: Tetsuo Handa @ 2016-04-13 11:08 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, akpm, linux-kernel, oleg

Michal Hocko wrote:
> There are many other possible reasons for thses symptoms. Have you
> actually seen any _evidence_ they the hang they are seeing is due to
> oom deadlock, though. A single crash dump or consistent sysrq output
> which would point that direction.

Yes. I saw several OOM livelock cases occurred in the customer's servers.

One case I was able to identify the cause was request_module() local DoS
( https://bugzilla.redhat.com/show_bug.cgi?id=CVE-2012-4398 ). That server
was running Java based enterprise application. The java process tried to
create IPv6 socket on a system where IPv6 was configured to be disabled,
and request_module() was every time called from socket() syscall for trying
to load ipv6.ko module, and the OOM killer was invoked and the java process
was selected for the OOM victim, and since request_module() was not killable,
the system got the OOM livelock.

Another case I saw is interrupts from virtio being disabled due to a bug in
qemu-kvm. Since the cron daemon continuously starts cron jobs even after
storage I/O started stalling (because the qemu-kvm stopped sending interrupts),
all memory was consumed for cron jobs and async file write requests, and the
OOM killer was invoked. And since a cron job which was selected as the OOM
victim while trying to write to file was unable to terminate due to waiting
for fs writeback, the system got the OOM livelock.

Yet another case I saw is a hangup where a process is blocked at
down_read(&mm->mmap_sem) in __access_remote_vm() while reading /proc/pid/
entries. Since I had zero knowledge about OOM livelock at that time, I was
not able to tell whether it was an OOM livelock or not.

There would be some more, but I can't recall them because I left the support
center one year ago and I have no chance to re-examine these cases.

But in general, it is rare that I can find the OOM killer messages
because their servers are force rebooted without capturing kdump or SysRq.
Hints I can use are limited to /var/log/messages which lacks suspicious
messages, /var/log/sa/ which shows that there was little free memory and
/proc/sys/kernel/hung_task_warnings already being 0 (if sosreport is also
provided).

> > I'm suggesting you to at least emit diagnostic messages when something went
> > wrong. That is what kmallocwd is for. And if you do not want to emit
> > diagnostic messages, I'm fine with timeout based approach.
>
> I am all for more diagnostic but what you were proposing was so heavy
> weight it doesn't really seem worth it.

I suspect that the reason hung_task_warnings becomes 0 is related to
use of the same watermark for GFP_KERNEL/GFP_NOFS/GFP_NOIO, but I can't
ask customers to replace their kernels for debugging. So, the first step is
to merge kmallocwd upstream, then wait until customers start using that
kernel (it may be within a few months if they are about to develop a new
server, but it may be 10 years away if they already decided not to update
kernels for their servers' lifetime).

> Anyway yet again this is getting largely off-topic...

OK. I'll stop posting to this thread.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular OOM killer path
@ 2016-04-13 11:08                   ` Tetsuo Handa
  0 siblings, 0 replies; 40+ messages in thread
From: Tetsuo Handa @ 2016-04-13 11:08 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, akpm, linux-kernel, oleg

Michal Hocko wrote:
> There are many other possible reasons for thses symptoms. Have you
> actually seen any _evidence_ they the hang they are seeing is due to
> oom deadlock, though. A single crash dump or consistent sysrq output
> which would point that direction.

Yes. I saw several OOM livelock cases occurred in the customer's servers.

One case I was able to identify the cause was request_module() local DoS
( https://bugzilla.redhat.com/show_bug.cgi?id=CVE-2012-4398 ). That server
was running Java based enterprise application. The java process tried to
create IPv6 socket on a system where IPv6 was configured to be disabled,
and request_module() was every time called from socket() syscall for trying
to load ipv6.ko module, and the OOM killer was invoked and the java process
was selected for the OOM victim, and since request_module() was not killable,
the system got the OOM livelock.

Another case I saw is interrupts from virtio being disabled due to a bug in
qemu-kvm. Since the cron daemon continuously starts cron jobs even after
storage I/O started stalling (because the qemu-kvm stopped sending interrupts),
all memory was consumed for cron jobs and async file write requests, and the
OOM killer was invoked. And since a cron job which was selected as the OOM
victim while trying to write to file was unable to terminate due to waiting
for fs writeback, the system got the OOM livelock.

Yet another case I saw is a hangup where a process is blocked at
down_read(&mm->mmap_sem) in __access_remote_vm() while reading /proc/pid/
entries. Since I had zero knowledge about OOM livelock at that time, I was
not able to tell whether it was an OOM livelock or not.

There would be some more, but I can't recall them because I left the support
center one year ago and I have no chance to re-examine these cases.

But in general, it is rare that I can find the OOM killer messages
because their servers are force rebooted without capturing kdump or SysRq.
Hints I can use are limited to /var/log/messages which lacks suspicious
messages, /var/log/sa/ which shows that there was little free memory and
/proc/sys/kernel/hung_task_warnings already being 0 (if sosreport is also
provided).

> > I'm suggesting you to at least emit diagnostic messages when something went
> > wrong. That is what kmallocwd is for. And if you do not want to emit
> > diagnostic messages, I'm fine with timeout based approach.
>
> I am all for more diagnostic but what you were proposing was so heavy
> weight it doesn't really seem worth it.

I suspect that the reason hung_task_warnings becomes 0 is related to
use of the same watermark for GFP_KERNEL/GFP_NOFS/GFP_NOIO, but I can't
ask customers to replace their kernels for debugging. So, the first step is
to merge kmallocwd upstream, then wait until customers start using that
kernel (it may be within a few months if they are about to develop a new
server, but it may be 10 years away if they already decided not to update
kernels for their servers' lifetime).

> Anyway yet again this is getting largely off-topic...

OK. I'll stop posting to this thread.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/3] mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper
  2016-04-08 11:34       ` Michal Hocko
  (?)
@ 2016-04-16  2:51       ` Tetsuo Handa
  2016-04-17 11:54         ` Michal Hocko
  -1 siblings, 1 reply; 40+ messages in thread
From: Tetsuo Handa @ 2016-04-16  2:51 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, akpm

Michal Hocko wrote:
> On Thu 07-04-16 20:55:34, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > The first obvious one is when the oom victim clears its mm and gets
> > > stuck later on. oom_reaper would back of on find_lock_task_mm returning
> > > NULL. We can safely try to clear TIF_MEMDIE in this case because such a
> > > task would be ignored by the oom killer anyway. The flag would be
> > > cleared by that time already most of the time anyway.
> > 
> > I didn't understand what this wants to tell. The OOM victim will clear
> > TIF_MEMDIE as soon as it sets current->mm = NULL.
> 
> No it clears the flag _after_ it returns from mmput. There is no
> guarantee it won't get stuck somewhere on the way there - e.g. exit_aio
> waits for completion and who knows what else might get stuck.

OK. Then, I think an OOM livelock scenario shown below is possible.

 (1) First OOM victim (where mm->mm_users == 1) is selected by the first
     round of out_of_memory() call.

 (2) The OOM reaper calls atomic_inc_not_zero(&mm->mm_users).

 (3) The OOM victim calls mmput() from exit_mm() from do_exit().
     mmput() returns immediately because atomic_dec_and_test(&mm->mm_users)
     returns false because of (2).

 (4) The OOM reaper reaps memory and then calls mmput().
     mmput() calls exit_aio() etc. and waits for completion because
     atomic_dec_and_test(&mm->mm_users) is now true.

 (5) Second OOM victim (which is the parent of the first OOM victim)
     is selected by the next round of out_of_memory() call.

 (6) The OOM reaper is waiting for completion of the first OOM victim's
     memory while the second OOM victim is waiting for the OOM reaper to
     reap memory.

Where is the guarantee that exit_aio() etc. called from mmput() by the
OOM reaper does not depend on memory allocation (i.e. the OOM reaper is
not blocked forever inside __oom_reap_task())?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/3] mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper
  2016-04-16  2:51       ` Tetsuo Handa
@ 2016-04-17 11:54         ` Michal Hocko
  2016-04-18 11:59           ` Tetsuo Handa
  0 siblings, 1 reply; 40+ messages in thread
From: Michal Hocko @ 2016-04-17 11:54 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm

On Sat 16-04-16 11:51:11, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 07-04-16 20:55:34, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > The first obvious one is when the oom victim clears its mm and gets
> > > > stuck later on. oom_reaper would back of on find_lock_task_mm returning
> > > > NULL. We can safely try to clear TIF_MEMDIE in this case because such a
> > > > task would be ignored by the oom killer anyway. The flag would be
> > > > cleared by that time already most of the time anyway.
> > > 
> > > I didn't understand what this wants to tell. The OOM victim will clear
> > > TIF_MEMDIE as soon as it sets current->mm = NULL.
> > 
> > No it clears the flag _after_ it returns from mmput. There is no
> > guarantee it won't get stuck somewhere on the way there - e.g. exit_aio
> > waits for completion and who knows what else might get stuck.
> 
> OK. Then, I think an OOM livelock scenario shown below is possible.
> 
>  (1) First OOM victim (where mm->mm_users == 1) is selected by the first
>      round of out_of_memory() call.
> 
>  (2) The OOM reaper calls atomic_inc_not_zero(&mm->mm_users).
> 
>  (3) The OOM victim calls mmput() from exit_mm() from do_exit().
>      mmput() returns immediately because atomic_dec_and_test(&mm->mm_users)
>      returns false because of (2).
> 
>  (4) The OOM reaper reaps memory and then calls mmput().
>      mmput() calls exit_aio() etc. and waits for completion because
>      atomic_dec_and_test(&mm->mm_users) is now true.
> 
>  (5) Second OOM victim (which is the parent of the first OOM victim)
>      is selected by the next round of out_of_memory() call.
> 
>  (6) The OOM reaper is waiting for completion of the first OOM victim's
>      memory while the second OOM victim is waiting for the OOM reaper to
>      reap memory.
> 
> Where is the guarantee that exit_aio() etc. called from mmput() by the
> OOM reaper does not depend on memory allocation (i.e. the OOM reaper is
> not blocked forever inside __oom_reap_task())?

You should realize that the mmput is called _after_ we have reclaimed
victim's address space. So there should be some memory freed by that
time which reduce the likelyhood of a lockup due to memory allocation
request if it is really needed for exit_aio.

But you have a good point here. We want to strive for robustness of
oom_reaper as much as possible. We have dropped the munlock patch because
of the robustness so I guess we want this to be fixed as well. The
reason for blocking might be different from memory pressure I guess.

Here is what should work - I have only compile tested it. I will prepare
the proper patch later this week with other oom reaper patches or after
I come back from LSF/MM.
---
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 624b78b848b8..5113e0e7e8ef 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -509,6 +509,7 @@ struct mm_struct {
 #ifdef CONFIG_HUGETLB_PAGE
 	atomic_long_t hugetlb_usage;
 #endif
+	struct work_struct async_put_work;
 };
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7bd0fa9db199..df8778e72211 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2604,6 +2604,11 @@ static inline void mmdrop(struct mm_struct * mm)
 
 /* mmput gets rid of the mappings and all user-space */
 extern void mmput(struct mm_struct *);
+/* same as above but performs the slow path from the async kontext. Can
+ * be called from the atomic context as well
+ */
+extern void mmput_async(struct mm_struct *);
+
 /* Grab a reference to a task's mm, if it is not already going away */
 extern struct mm_struct *get_task_mm(struct task_struct *task);
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index accb7221d547..10b0f771d795 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -696,6 +696,26 @@ void __mmdrop(struct mm_struct *mm)
 }
 EXPORT_SYMBOL_GPL(__mmdrop);
 
+static inline void __mmput(struct mm_struct *mm)
+{
+	VM_BUG_ON(atomic_read(&mm->mm_users));
+
+	uprobe_clear_state(mm);
+	exit_aio(mm);
+	ksm_exit(mm);
+	khugepaged_exit(mm); /* must run before exit_mmap */
+	exit_mmap(mm);
+	set_mm_exe_file(mm, NULL);
+	if (!list_empty(&mm->mmlist)) {
+		spin_lock(&mmlist_lock);
+		list_del(&mm->mmlist);
+		spin_unlock(&mmlist_lock);
+	}
+	if (mm->binfmt)
+		module_put(mm->binfmt->module);
+	mmdrop(mm);
+}
+
 /*
  * Decrement the use count and release all resources for an mm.
  */
@@ -703,24 +723,24 @@ void mmput(struct mm_struct *mm)
 {
 	might_sleep();
 
+	if (atomic_dec_and_test(&mm->mm_users))
+		__mmput(mm);
+}
+EXPORT_SYMBOL_GPL(mmput);
+
+static void mmput_async_fn(struct work_struct *work)
+{
+	struct mm_struct *mm = container_of(work, struct mm_struct, async_put_work);
+	__mmput(mm);
+}
+
+void mmput_async(struct mm_struct *mm)
+{
 	if (atomic_dec_and_test(&mm->mm_users)) {
-		uprobe_clear_state(mm);
-		exit_aio(mm);
-		ksm_exit(mm);
-		khugepaged_exit(mm); /* must run before exit_mmap */
-		exit_mmap(mm);
-		set_mm_exe_file(mm, NULL);
-		if (!list_empty(&mm->mmlist)) {
-			spin_lock(&mmlist_lock);
-			list_del(&mm->mmlist);
-			spin_unlock(&mmlist_lock);
-		}
-		if (mm->binfmt)
-			module_put(mm->binfmt->module);
-		mmdrop(mm);
+		INIT_WORK(&mm->async_put_work, mmput_async_fn);
+		schedule_work(&mm->async_put_work);
 	}
 }
-EXPORT_SYMBOL_GPL(mmput);
 
 /**
  * set_mm_exe_file - change a reference to the mm's executable file
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 44683a2f8fa7..65f2acbaad29 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -446,7 +446,6 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 static struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
-
 static bool __oom_reap_task(struct task_struct *tsk)
 {
 	struct mmu_gather tlb;
@@ -520,7 +519,12 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	 */
 	set_bit(MMF_OOM_REAPED, &mm->flags);
 out:
-	mmput(mm);
+	/*
+	 * Drop our reference but make sure the mmput slow path is called from a
+	 * different context because we shouldn't risk we get stuck there and
+	 * put the oom_reaper out of the way.
+	 */
+	mmput_async(mm);
 	return ret;
 }
 
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/3] mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper
  2016-04-17 11:54         ` Michal Hocko
@ 2016-04-18 11:59           ` Tetsuo Handa
  2016-04-19 14:17             ` Michal Hocko
  0 siblings, 1 reply; 40+ messages in thread
From: Tetsuo Handa @ 2016-04-18 11:59 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, akpm

Michal Hocko wrote:
> On Sat 16-04-16 11:51:11, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Thu 07-04-16 20:55:34, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > > > The first obvious one is when the oom victim clears its mm and gets
> > > > > stuck later on. oom_reaper would back of on find_lock_task_mm returning
> > > > > NULL. We can safely try to clear TIF_MEMDIE in this case because such a
> > > > > task would be ignored by the oom killer anyway. The flag would be
> > > > > cleared by that time already most of the time anyway.
> > > > 
> > > > I didn't understand what this wants to tell. The OOM victim will clear
> > > > TIF_MEMDIE as soon as it sets current->mm = NULL.
> > > 
> > > No it clears the flag _after_ it returns from mmput. There is no
> > > guarantee it won't get stuck somewhere on the way there - e.g. exit_aio
> > > waits for completion and who knows what else might get stuck.
> > 
> > OK. Then, I think an OOM livelock scenario shown below is possible.
> > 
> >  (1) First OOM victim (where mm->mm_users == 1) is selected by the first
> >      round of out_of_memory() call.
> > 
> >  (2) The OOM reaper calls atomic_inc_not_zero(&mm->mm_users).
> > 
> >  (3) The OOM victim calls mmput() from exit_mm() from do_exit().
> >      mmput() returns immediately because atomic_dec_and_test(&mm->mm_users)
> >      returns false because of (2).
> > 
> >  (4) The OOM reaper reaps memory and then calls mmput().
> >      mmput() calls exit_aio() etc. and waits for completion because
> >      atomic_dec_and_test(&mm->mm_users) is now true.
> > 
> >  (5) Second OOM victim (which is the parent of the first OOM victim)
> >      is selected by the next round of out_of_memory() call.
> > 
> >  (6) The OOM reaper is waiting for completion of the first OOM victim's
> >      memory while the second OOM victim is waiting for the OOM reaper to
> >      reap memory.
> > 
> > Where is the guarantee that exit_aio() etc. called from mmput() by the
> > OOM reaper does not depend on memory allocation (i.e. the OOM reaper is
> > not blocked forever inside __oom_reap_task())?
> 
> You should realize that the mmput is called _after_ we have reclaimed
> victim's address space. So there should be some memory freed by that
> time which reduce the likelyhood of a lockup due to memory allocation
> request if it is really needed for exit_aio.

Not always true. mmput() is called when down_read_trylock(&mm->mmap_sem) failed.
It is possible that the OOM victim was about to call up_write(&mm->mmap_sem) when
down_read_trylock(&mm->mmap_sem) failed, and it is possible that the OOM victim
runs until returning from mmput() from exit_mm() from do_exit() when the OOM
reaper was preempted between down_read_trylock(&mm->mmap_sem) and mmput().
Under such race, the OOM reaper will call mmput() without reclaiming the OOM
victim's address space.

> 
> But you have a good point here. We want to strive for robustness of
> oom_reaper as much as possible. We have dropped the munlock patch because
> of the robustness so I guess we want this to be fixed as well. The
> reason for blocking might be different from memory pressure I guess.

The reality of race/dependency is more complicated than we can imagine.

> 
> Here is what should work - I have only compile tested it. I will prepare
> the proper patch later this week with other oom reaper patches or after
> I come back from LSF/MM.

Excuse me, but is system_wq suitable for queuing operations which may take
unpredictable duration to flush?

  system_wq is the one used by schedule[_delayed]_work[_on]().
  Multi-CPU multi-threaded.  There are users which expect relatively
  short queue flush time.  Don't queue works which can run for too
  long.

Many users including SysRq-f depend on system_wq being flushed shortly. We
haven't guaranteed that SysRq-f can always fire and select a different OOM
victim, but you proposed always clearing TIF_MEMDIE without thinking the
possibility of the OOM victim with mmap_sem held for write being stuck at
unkillable wait.

I wonder about your definition of "robustness". You are almost always missing
the worst scenario. You are trying to manage OOM without defining default:
label in a switch statement. I don't think your approach is robust.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/3] mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper
  2016-04-18 11:59           ` Tetsuo Handa
@ 2016-04-19 14:17             ` Michal Hocko
  2016-04-19 15:07               ` Tetsuo Handa
  0 siblings, 1 reply; 40+ messages in thread
From: Michal Hocko @ 2016-04-19 14:17 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm

On Mon 18-04-16 20:59:51, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Sat 16-04-16 11:51:11, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Thu 07-04-16 20:55:34, Tetsuo Handa wrote:
> > > > > Michal Hocko wrote:
> > > > > > The first obvious one is when the oom victim clears its mm and gets
> > > > > > stuck later on. oom_reaper would back of on find_lock_task_mm returning
> > > > > > NULL. We can safely try to clear TIF_MEMDIE in this case because such a
> > > > > > task would be ignored by the oom killer anyway. The flag would be
> > > > > > cleared by that time already most of the time anyway.
> > > > > 
> > > > > I didn't understand what this wants to tell. The OOM victim will clear
> > > > > TIF_MEMDIE as soon as it sets current->mm = NULL.
> > > > 
> > > > No it clears the flag _after_ it returns from mmput. There is no
> > > > guarantee it won't get stuck somewhere on the way there - e.g. exit_aio
> > > > waits for completion and who knows what else might get stuck.
> > > 
> > > OK. Then, I think an OOM livelock scenario shown below is possible.
> > > 
> > >  (1) First OOM victim (where mm->mm_users == 1) is selected by the first
> > >      round of out_of_memory() call.
> > > 
> > >  (2) The OOM reaper calls atomic_inc_not_zero(&mm->mm_users).
> > > 
> > >  (3) The OOM victim calls mmput() from exit_mm() from do_exit().
> > >      mmput() returns immediately because atomic_dec_and_test(&mm->mm_users)
> > >      returns false because of (2).
> > > 
> > >  (4) The OOM reaper reaps memory and then calls mmput().
> > >      mmput() calls exit_aio() etc. and waits for completion because
> > >      atomic_dec_and_test(&mm->mm_users) is now true.
> > > 
> > >  (5) Second OOM victim (which is the parent of the first OOM victim)
> > >      is selected by the next round of out_of_memory() call.
> > > 
> > >  (6) The OOM reaper is waiting for completion of the first OOM victim's
> > >      memory while the second OOM victim is waiting for the OOM reaper to
> > >      reap memory.
> > > 
> > > Where is the guarantee that exit_aio() etc. called from mmput() by the
> > > OOM reaper does not depend on memory allocation (i.e. the OOM reaper is
> > > not blocked forever inside __oom_reap_task())?
> > 
> > You should realize that the mmput is called _after_ we have reclaimed
> > victim's address space. So there should be some memory freed by that
> > time which reduce the likelyhood of a lockup due to memory allocation
> > request if it is really needed for exit_aio.
> 
> Not always true. mmput() is called when down_read_trylock(&mm->mmap_sem) failed.
> It is possible that the OOM victim was about to call up_write(&mm->mmap_sem) when
> down_read_trylock(&mm->mmap_sem) failed, and it is possible that the OOM victim
> runs until returning from mmput() from exit_mm() from do_exit() when the OOM
> reaper was preempted between down_read_trylock(&mm->mmap_sem) and mmput().
> Under such race, the OOM reaper will call mmput() without reclaiming the OOM
> victim's address space.

You are right! For some reason I have missed that.

> > But you have a good point here. We want to strive for robustness of
> > oom_reaper as much as possible. We have dropped the munlock patch because
> > of the robustness so I guess we want this to be fixed as well. The
> > reason for blocking might be different from memory pressure I guess.
> 
> The reality of race/dependency is more complicated than we can imagine.
> 
> > 
> > Here is what should work - I have only compile tested it. I will prepare
> > the proper patch later this week with other oom reaper patches or after
> > I come back from LSF/MM.
> 
> Excuse me, but is system_wq suitable for queuing operations which may take
> unpredictable duration to flush?
> 
>   system_wq is the one used by schedule[_delayed]_work[_on]().
>   Multi-CPU multi-threaded.  There are users which expect relatively
>   short queue flush time.  Don't queue works which can run for too
>   long.

An alternative would be using a dedicated WQ with WQ_MEM_RECLAIM which I
am not really sure would be justified considering we are talking about a
highly unlikely event. You do not want to consume resources permanently
for an eventual and not fatal event.

> Many users including SysRq-f depend on system_wq being flushed shortly.

Critical work shouldn't really rely on system_wq, full stop. There is
just too much going on on that WQ and it is simply impossible to
guarantee anything.

> We
> haven't guaranteed that SysRq-f can always fire and select a different OOM
> victim, but you proposed always clearing TIF_MEMDIE without thinking the
> possibility of the OOM victim with mmap_sem held for write being stuck at
> unkillable wait.
> 
> I wonder about your definition of "robustness". You are almost always missing
> the worst scenario. You are trying to manage OOM without defining default:
> label in a switch statement. I don't think your approach is robust.

I am trying to be as robust as it is viable. You have to realize we are
in the catastrophic path already and there is simply no deterministic
way out.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/3] mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper
  2016-04-19 14:17             ` Michal Hocko
@ 2016-04-19 15:07               ` Tetsuo Handa
  2016-04-19 19:32                 ` Michal Hocko
  0 siblings, 1 reply; 40+ messages in thread
From: Tetsuo Handa @ 2016-04-19 15:07 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, akpm

Michal Hocko wrote:
> On Mon 18-04-16 20:59:51, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > Here is what should work - I have only compile tested it. I will prepare
> > > the proper patch later this week with other oom reaper patches or after
> > > I come back from LSF/MM.
> > 
> > Excuse me, but is system_wq suitable for queuing operations which may take
> > unpredictable duration to flush?
> > 
> >   system_wq is the one used by schedule[_delayed]_work[_on]().
> >   Multi-CPU multi-threaded.  There are users which expect relatively
> >   short queue flush time.  Don't queue works which can run for too
> >   long.
> 
> An alternative would be using a dedicated WQ with WQ_MEM_RECLAIM which I
> am not really sure would be justified considering we are talking about a
> highly unlikely event. You do not want to consume resources permanently
> for an eventual and not fatal event.

Yes, the reason SysRq-f is still not using a dedicated WQ with WQ_MEM_RECLAIM
will be the same.

> 
> > We
> > haven't guaranteed that SysRq-f can always fire and select a different OOM
> > victim, but you proposed always clearing TIF_MEMDIE without thinking the
> > possibility of the OOM victim with mmap_sem held for write being stuck at
> > unkillable wait.
> > 
> > I wonder about your definition of "robustness". You are almost always missing
> > the worst scenario. You are trying to manage OOM without defining default:
> > label in a switch statement. I don't think your approach is robust.
> 
> I am trying to be as robust as it is viable. You have to realize we are
> in the catastrophic path already and there is simply no deterministic
> way out.

I know we are talking about the catastrophic situation. Since you insist on
deterministic approach, we are struggling so much.
If you tolerate
http://lkml.kernel.org/r/201604152111.JBD95763.LMFOOHQOtFSFJV@I-love.SAKURA.ne.jp
approach as the fastpath (deterministic but could fail) and
http://lkml.kernel.org/r/201604200006.FBG45192.SOHFQJFOOLFMtV@I-love.SAKURA.ne.jp
approach as the slowpath (non-deterministic but never fail), we don't need to
use a dedicated WQ with WQ_MEM_RECLAIM for avoiding this mmput() trap and the
SysRq-f trap. What a simple answer. ;-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/3] mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper
  2016-04-19 15:07               ` Tetsuo Handa
@ 2016-04-19 19:32                 ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2016-04-19 19:32 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm

On Wed 20-04-16 00:07:50, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Mon 18-04-16 20:59:51, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > Here is what should work - I have only compile tested it. I will prepare
> > > > the proper patch later this week with other oom reaper patches or after
> > > > I come back from LSF/MM.
> > > 
> > > Excuse me, but is system_wq suitable for queuing operations which may take
> > > unpredictable duration to flush?
> > > 
> > >   system_wq is the one used by schedule[_delayed]_work[_on]().
> > >   Multi-CPU multi-threaded.  There are users which expect relatively
> > >   short queue flush time.  Don't queue works which can run for too
> > >   long.
> > 
> > An alternative would be using a dedicated WQ with WQ_MEM_RECLAIM which I
> > am not really sure would be justified considering we are talking about a
> > highly unlikely event. You do not want to consume resources permanently
> > for an eventual and not fatal event.
> 
> Yes, the reason SysRq-f is still not using a dedicated WQ with WQ_MEM_RECLAIM
> will be the same.

sysrq+f can use the oom_reaper kernel thread.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2016-04-19 19:32 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-06 14:13 [PATCH 0/3] oom reaper follow ups v1 Michal Hocko
2016-04-06 14:13 ` Michal Hocko
2016-04-06 14:13 ` [PATCH 1/3] mm, oom: move GFP_NOFS check to out_of_memory Michal Hocko
2016-04-06 14:13   ` Michal Hocko
2016-04-06 14:13 ` [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular OOM killer path Michal Hocko
2016-04-06 14:13   ` Michal Hocko
2016-04-07 11:38   ` Tetsuo Handa
2016-04-07 11:38     ` Tetsuo Handa
2016-04-08 11:19     ` Tetsuo Handa
2016-04-08 11:19       ` Tetsuo Handa
2016-04-08 11:50       ` Michal Hocko
2016-04-08 11:50         ` Michal Hocko
2016-04-09  4:39         ` [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skipregular " Tetsuo Handa
2016-04-09  4:39           ` Tetsuo Handa
2016-04-11 12:02           ` Michal Hocko
2016-04-11 12:02             ` Michal Hocko
2016-04-11 13:26             ` [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular " Tetsuo Handa
2016-04-11 13:26               ` Tetsuo Handa
2016-04-11 13:43               ` Michal Hocko
2016-04-11 13:43                 ` Michal Hocko
2016-04-13 11:08                 ` Tetsuo Handa
2016-04-13 11:08                   ` Tetsuo Handa
2016-04-08 11:34     ` Michal Hocko
2016-04-08 11:34       ` Michal Hocko
2016-04-08 13:14   ` Michal Hocko
2016-04-08 13:14     ` Michal Hocko
2016-04-06 14:13 ` [PATCH 3/3] mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper Michal Hocko
2016-04-06 14:13   ` Michal Hocko
2016-04-07 11:55   ` Tetsuo Handa
2016-04-07 11:55     ` Tetsuo Handa
2016-04-08 11:34     ` Michal Hocko
2016-04-08 11:34       ` Michal Hocko
2016-04-16  2:51       ` Tetsuo Handa
2016-04-17 11:54         ` Michal Hocko
2016-04-18 11:59           ` Tetsuo Handa
2016-04-19 14:17             ` Michal Hocko
2016-04-19 15:07               ` Tetsuo Handa
2016-04-19 19:32                 ` Michal Hocko
2016-04-08 13:07   ` Michal Hocko
2016-04-08 13:07     ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.