linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/3] oom: few enahancements
@ 2016-01-12 21:00 Michal Hocko
  2016-01-12 21:00 ` [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks Michal Hocko
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Michal Hocko @ 2016-01-12 21:00 UTC (permalink / raw)
  To: linux-mm; +Cc: Tetsuo Handa, David Rientjes, LKML

Hi,
based on the recent discussions I have accumulated the following three
patches. I haven't tested them yet but I would like to hear your
opinion. The first patch only affects sysrq+f OOM killer.  I believe it
should be relatively uncontroversial.

The patch 2 tweaks how we handle children tasks standing for the parent
oom victim. This should help the test case Tetsuo shown [1].

The patch 3 is just a rough idea. I can see objections there but this is
mainly to start discussion about ho to deal with small children which
basically do not sit on any memory. Maybe we do not need anything like
that at all and realy on multiple OOM invocations as a safer option. I
dunno but I would like to hear your opinions.

---
[1] http://lkml.kernel.org/r/201512292258.ABF87505.OFOSJLHMFVOQFt%40I-love.SAKURA.ne.jp

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks
  2016-01-12 21:00 [RFC 0/3] oom: few enahancements Michal Hocko
@ 2016-01-12 21:00 ` Michal Hocko
  2016-01-13  0:41   ` David Rientjes
  2016-01-12 21:00 ` [RFC 2/3] oom: Do not sacrifice already OOM killed children Michal Hocko
  2016-01-12 21:00 ` [RFC 3/3] oom: Do not try to sacrifice small children Michal Hocko
  2 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2016-01-12 21:00 UTC (permalink / raw)
  To: linux-mm; +Cc: Tetsuo Handa, David Rientjes, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

When the OOM killer is issued by the administrator by sysrq+f it is
expected that a task will be killed to release the memory pressure.
Unlike the regular OOM killer the forced one doesn't abort when
there is an OOM victim selected. Instead oom_scan_process_thread
forces select_bad_process to check this thread. If this happens to be
the largest OOM hog then it will be selected again and the forced
OOM killer wouldn't make any change in case the current OOM victim
is not able terminate and free up resources it is sitting on.

This patch makes sure that the forced oom killer will skip over all
oom victims (with TIF_MEMDIE) and tasks with fatal_signal_pending
on basis that there is no guarantee those tasks are making progress
and there is no way to check they will ever make any. It is more
conservative to simply try another task.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index abefeeb42504..2b9dc5129a89 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -326,6 +326,17 @@ static struct task_struct *select_bad_process(struct oom_control *oc,
 		case OOM_SCAN_OK:
 			break;
 		};
+
+		/*
+		 * If we are doing sysrq+f then it doesn't make any sense to
+		 * check OOM victim or killed task because it might be stuck
+		 * and unable to terminate while the forced OOM might be the
+		 * only option left to get the system back to work.
+		 */
+		if (is_sysrq_oom(oc) && (test_tsk_thread_flag(p, TIF_MEMDIE) ||
+				fatal_signal_pending(p)))
+			continue;
+
 		points = oom_badness(p, NULL, oc->nodemask, totalpages);
 		if (!points || points < chosen_points)
 			continue;
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC 2/3] oom: Do not sacrifice already OOM killed children
  2016-01-12 21:00 [RFC 0/3] oom: few enahancements Michal Hocko
  2016-01-12 21:00 ` [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks Michal Hocko
@ 2016-01-12 21:00 ` Michal Hocko
  2016-01-13  0:45   ` David Rientjes
  2016-01-12 21:00 ` [RFC 3/3] oom: Do not try to sacrifice small children Michal Hocko
  2 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2016-01-12 21:00 UTC (permalink / raw)
  To: linux-mm; +Cc: Tetsuo Handa, David Rientjes, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

oom_kill_process tries to sacrifice a children process of the selected
victim to safe as much work done as possible. This is all and good but
the current heuristic doesn't check the status of the children before
they are examined. So it might be possible that we will end up selecting
the same child all over again just because it cannot terminate.

Tetsuo Handa has reported exactly this when trying to use sysrq+f to
resolve an OOM situation:
[   86.767482] a.out invoked oom-killer: order=0, oom_score_adj=0, gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|GFP_ZERO)
[   86.769905] a.out cpuset=/ mems_allowed=0
[   86.771393] CPU: 2 PID: 9573 Comm: a.out Not tainted 4.4.0-next-20160112+ #279
(...snipped...)
[   86.874710] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
(...snipped...)
[   86.945286] [ 9573]  1000  9573   541717   402522     796       6        0             0 a.out
[   86.947457] [ 9574]  1000  9574     1078       21       7       3        0             0 a.out
[   86.949568] Out of memory: Kill process 9573 (a.out) score 908 or sacrifice child
[   86.951538] Killed process 9574 (a.out) total-vm:4312kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[   86.955296] systemd-journal invoked oom-killer: order=0, oom_score_adj=0, gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|GFP_COLD)
[   86.958035] systemd-journal cpuset=/ mems_allowed=0
(...snipped...)
[   87.128808] [ 9573]  1000  9573   541717   402522     796       6        0             0 a.out
[   87.130926] [ 9575]  1000  9574     1078        0       7       3        0             0 a.out
[   87.133055] Out of memory: Kill process 9573 (a.out) score 908 or sacrifice child
[   87.134989] Killed process 9575 (a.out) total-vm:4312kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  116.979564] sysrq: SysRq : Manual OOM execution
[  116.984119] kworker/0:8 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
[  116.986367] kworker/0:8 cpuset=/ mems_allowed=0
(...snipped...)
[  117.157045] [ 9573]  1000  9573   541717   402522     797       6        0             0 a.out
[  117.159191] [ 9575]  1000  9574     1078        0       7       3        0             0 a.out
[  117.161302] Out of memory: Kill process 9573 (a.out) score 908 or sacrifice child
[  117.163250] Killed process 9575 (a.out) total-vm:4312kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  119.043685] sysrq: SysRq : Manual OOM execution
[  119.046239] kworker/0:8 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
[  119.048453] kworker/0:8 cpuset=/ mems_allowed=0
(...snipped...)
[  119.215982] [ 9573]  1000  9573   541717   402522     797       6        0             0 a.out
[  119.218122] [ 9575]  1000  9574     1078        0       7       3        0             0 a.out
[  119.220237] Out of memory: Kill process 9573 (a.out) score 908 or sacrifice child
[  119.222129] Killed process 9575 (a.out) total-vm:4312kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  120.179644] sysrq: SysRq : Manual OOM execution
[  120.206938] kworker/0:8 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
[  120.209152] kworker/0:8 cpuset=/ mems_allowed=0
[  120.376821] [ 9573]  1000  9573   541717   402522     797       6        0             0 a.out
[  120.378924] [ 9575]  1000  9574     1078        0       7       3        0             0 a.out
[  120.381065] Out of memory: Kill process 9573 (a.out) score 908 or sacrifice child
[  120.382929] Killed process 9575 (a.out) total-vm:4312kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

This patch simply rules out all the children which are OOM victims
already or have fatal_signal_pending or they are exiting already. It
simply doesn't make any sense to kill such tasks if they have already
been killed and the OOM situation wasn't resolved as we had to select a
new OOM victim. This is true for both the regular and forced oom killer
invocation.

While we are there let's separate this specific logic into a separate
function.

Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 89 ++++++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 58 insertions(+), 31 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 2b9dc5129a89..8bca0b1e97f7 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -671,6 +671,63 @@ static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
 }
 
 #define K(x) ((x) << (PAGE_SHIFT-10))
+
+/*
+ * If any of victim's children has a different mm and is eligible for kill,
+ * the one with the highest oom_badness() score is sacrificed for its
+ * parent.  This attempts to lose the minimal amount of work done while
+ * still freeing memory.
+ */
+static struct task_struct *
+try_to_sacrifice_child(struct oom_control *oc, struct task_struct *victim,
+		       unsigned long totalpages, struct mem_cgroup *memcg)
+{
+	struct task_struct *child_victim = NULL;
+	unsigned int victim_points = 0;
+	struct task_struct *t;
+
+	read_lock(&tasklist_lock);
+	for_each_thread(victim, t) {
+		struct task_struct *child;
+
+		list_for_each_entry(child, &t->children, sibling) {
+			unsigned int child_points;
+
+			/*
+			 * Skip over already OOM killed children as this hasn't
+			 * helped to resolve the situation obviously.
+			 */
+			if (test_tsk_thread_flag(child, TIF_MEMDIE) ||
+					fatal_signal_pending(child) ||
+					task_will_free_mem(child))
+				continue;
+
+			if (process_shares_mm(child, victim->mm))
+				continue;
+
+			child_points = oom_badness(child, memcg, oc->nodemask,
+								totalpages);
+			if (child_points > victim_points) {
+				if (child_victim)
+					put_task_struct(child_victim);
+				child_victim = child;
+				victim_points = child_points;
+				get_task_struct(child_victim);
+			}
+		}
+	}
+	read_unlock(&tasklist_lock);
+
+	if (!child_victim)
+		goto out;
+
+	put_task_struct(victim);
+	victim = child_victim;
+
+out:
+	return victim;
+}
+
 /*
  * Must be called while holding a reference to p, which will be released upon
  * returning.
@@ -680,10 +737,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 		      struct mem_cgroup *memcg, const char *message)
 {
 	struct task_struct *victim = p;
-	struct task_struct *child;
-	struct task_struct *t;
 	struct mm_struct *mm;
-	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
 	bool can_oom_reap = true;
@@ -707,34 +761,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
 		message, task_pid_nr(p), p->comm, points);
 
-	/*
-	 * If any of p's children has a different mm and is eligible for kill,
-	 * the one with the highest oom_badness() score is sacrificed for its
-	 * parent.  This attempts to lose the minimal amount of work done while
-	 * still freeing memory.
-	 */
-	read_lock(&tasklist_lock);
-	for_each_thread(p, t) {
-		list_for_each_entry(child, &t->children, sibling) {
-			unsigned int child_points;
-
-			if (process_shares_mm(child, p->mm))
-				continue;
-			/*
-			 * oom_badness() returns 0 if the thread is unkillable
-			 */
-			child_points = oom_badness(child, memcg, oc->nodemask,
-								totalpages);
-			if (child_points > victim_points) {
-				put_task_struct(victim);
-				victim = child;
-				victim_points = child_points;
-				get_task_struct(victim);
-			}
-		}
-	}
-	read_unlock(&tasklist_lock);
-
+	victim = try_to_sacrifice_child(oc, victim, totalpages, memcg);
 	p = find_lock_task_mm(victim);
 	if (!p) {
 		put_task_struct(victim);
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC 3/3] oom: Do not try to sacrifice small children
  2016-01-12 21:00 [RFC 0/3] oom: few enahancements Michal Hocko
  2016-01-12 21:00 ` [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks Michal Hocko
  2016-01-12 21:00 ` [RFC 2/3] oom: Do not sacrifice already OOM killed children Michal Hocko
@ 2016-01-12 21:00 ` Michal Hocko
  2016-01-13  0:51   ` David Rientjes
  2 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2016-01-12 21:00 UTC (permalink / raw)
  To: linux-mm; +Cc: Tetsuo Handa, David Rientjes, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

try_to_sacrifice_child will select the largest child of the selected OOM
victim to protect it and potentially save some work done by the parent.
We can however select a small child which has barely touched any memory
and killing it wouldn't lead to OOM recovery and only prolong the OOM
condition which is not desirable.

This patch simply ignores the largest child selection and falls back to
the parent (original victim) if the child hasn't accumulated even 1MB
worth of oom score. We are not checking the memory consumption directly
as we want to honor the oom_score_adj here because this would be the
only way to protect children from this heuristic in case they are more
important than the parent.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 8bca0b1e97f7..b5c0021c6462 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -721,8 +721,16 @@ try_to_sacrifice_child(struct oom_control *oc, struct task_struct *victim,
 	if (!child_victim)
 		goto out;
 
-	put_task_struct(victim);
-	victim = child_victim;
+	/*
+	 * Protecting the parent makes sense only if killing the child
+	 * would release at least some memory (at least 1MB).
+	 */
+	if (K(victim_points) >= 1024) {
+		put_task_struct(victim);
+		victim = child_victim;
+	} else {
+		put_task_struct(child_victim);
+	}
 
 out:
 	return victim;
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks
  2016-01-12 21:00 ` [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks Michal Hocko
@ 2016-01-13  0:41   ` David Rientjes
  2016-01-13  9:30     ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: David Rientjes @ 2016-01-13  0:41 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Tetsuo Handa, LKML, Michal Hocko

On Tue, 12 Jan 2016, Michal Hocko wrote:

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index abefeeb42504..2b9dc5129a89 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -326,6 +326,17 @@ static struct task_struct *select_bad_process(struct oom_control *oc,
>  		case OOM_SCAN_OK:
>  			break;
>  		};
> +
> +		/*
> +		 * If we are doing sysrq+f then it doesn't make any sense to
> +		 * check OOM victim or killed task because it might be stuck
> +		 * and unable to terminate while the forced OOM might be the
> +		 * only option left to get the system back to work.
> +		 */
> +		if (is_sysrq_oom(oc) && (test_tsk_thread_flag(p, TIF_MEMDIE) ||
> +				fatal_signal_pending(p)))
> +			continue;
> +
>  		points = oom_badness(p, NULL, oc->nodemask, totalpages);
>  		if (!points || points < chosen_points)
>  			continue;

I think you can make a case for testing TIF_MEMDIE here since there is no 
chance of a panic from the sysrq trigger.  However, I'm not convinced that 
checking fatal_signal_pending() is appropriate.  I think it would be 
better for sysrq+f to first select a process with fatal_signal_pending() 
set so it silently gets access to memory reserves and then a second 
sysrq+f to choose a different process, if necessary, because of 
TIF_MEMDIE.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 2/3] oom: Do not sacrifice already OOM killed children
  2016-01-12 21:00 ` [RFC 2/3] oom: Do not sacrifice already OOM killed children Michal Hocko
@ 2016-01-13  0:45   ` David Rientjes
  2016-01-13  9:36     ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: David Rientjes @ 2016-01-13  0:45 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Tetsuo Handa, LKML, Michal Hocko

On Tue, 12 Jan 2016, Michal Hocko wrote:

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 2b9dc5129a89..8bca0b1e97f7 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -671,6 +671,63 @@ static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
>  }
>  
>  #define K(x) ((x) << (PAGE_SHIFT-10))
> +
> +/*
> + * If any of victim's children has a different mm and is eligible for kill,
> + * the one with the highest oom_badness() score is sacrificed for its
> + * parent.  This attempts to lose the minimal amount of work done while
> + * still freeing memory.
> + */
> +static struct task_struct *
> +try_to_sacrifice_child(struct oom_control *oc, struct task_struct *victim,
> +		       unsigned long totalpages, struct mem_cgroup *memcg)
> +{
> +	struct task_struct *child_victim = NULL;
> +	unsigned int victim_points = 0;
> +	struct task_struct *t;
> +
> +	read_lock(&tasklist_lock);
> +	for_each_thread(victim, t) {
> +		struct task_struct *child;
> +
> +		list_for_each_entry(child, &t->children, sibling) {
> +			unsigned int child_points;
> +
> +			/*
> +			 * Skip over already OOM killed children as this hasn't
> +			 * helped to resolve the situation obviously.
> +			 */
> +			if (test_tsk_thread_flag(child, TIF_MEMDIE) ||
> +					fatal_signal_pending(child) ||
> +					task_will_free_mem(child))
> +				continue;
> +

What guarantees that child had time to exit after it has been oom killed 
(better yet, what guarantees that it has even scheduled after it has been 
oom killed)?  It seems like this would quickly kill many children 
unnecessarily.

> +			if (process_shares_mm(child, victim->mm))
> +				continue;
> +
> +			child_points = oom_badness(child, memcg, oc->nodemask,
> +								totalpages);
> +			if (child_points > victim_points) {
> +				if (child_victim)
> +					put_task_struct(child_victim);
> +				child_victim = child;
> +				victim_points = child_points;
> +				get_task_struct(child_victim);
> +			}
> +		}
> +	}
> +	read_unlock(&tasklist_lock);
> +
> +	if (!child_victim)
> +		goto out;
> +
> +	put_task_struct(victim);
> +	victim = child_victim;
> +
> +out:
> +	return victim;
> +}
> +
>  /*
>   * Must be called while holding a reference to p, which will be released upon
>   * returning.
> @@ -680,10 +737,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  		      struct mem_cgroup *memcg, const char *message)
>  {
>  	struct task_struct *victim = p;
> -	struct task_struct *child;
> -	struct task_struct *t;
>  	struct mm_struct *mm;
> -	unsigned int victim_points = 0;
>  	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
>  					      DEFAULT_RATELIMIT_BURST);
>  	bool can_oom_reap = true;
> @@ -707,34 +761,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  	pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
>  		message, task_pid_nr(p), p->comm, points);
>  
> -	/*
> -	 * If any of p's children has a different mm and is eligible for kill,
> -	 * the one with the highest oom_badness() score is sacrificed for its
> -	 * parent.  This attempts to lose the minimal amount of work done while
> -	 * still freeing memory.
> -	 */
> -	read_lock(&tasklist_lock);
> -	for_each_thread(p, t) {
> -		list_for_each_entry(child, &t->children, sibling) {
> -			unsigned int child_points;
> -
> -			if (process_shares_mm(child, p->mm))
> -				continue;
> -			/*
> -			 * oom_badness() returns 0 if the thread is unkillable
> -			 */
> -			child_points = oom_badness(child, memcg, oc->nodemask,
> -								totalpages);
> -			if (child_points > victim_points) {
> -				put_task_struct(victim);
> -				victim = child;
> -				victim_points = child_points;
> -				get_task_struct(victim);
> -			}
> -		}
> -	}
> -	read_unlock(&tasklist_lock);
> -
> +	victim = try_to_sacrifice_child(oc, victim, totalpages, memcg);
>  	p = find_lock_task_mm(victim);
>  	if (!p) {
>  		put_task_struct(victim);

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 3/3] oom: Do not try to sacrifice small children
  2016-01-12 21:00 ` [RFC 3/3] oom: Do not try to sacrifice small children Michal Hocko
@ 2016-01-13  0:51   ` David Rientjes
  2016-01-13  9:40     ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: David Rientjes @ 2016-01-13  0:51 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Tetsuo Handa, LKML, Michal Hocko

On Tue, 12 Jan 2016, Michal Hocko wrote:

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 8bca0b1e97f7..b5c0021c6462 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -721,8 +721,16 @@ try_to_sacrifice_child(struct oom_control *oc, struct task_struct *victim,
>  	if (!child_victim)
>  		goto out;
>  
> -	put_task_struct(victim);
> -	victim = child_victim;
> +	/*
> +	 * Protecting the parent makes sense only if killing the child
> +	 * would release at least some memory (at least 1MB).
> +	 */
> +	if (K(victim_points) >= 1024) {
> +		put_task_struct(victim);
> +		victim = child_victim;
> +	} else {
> +		put_task_struct(child_victim);
> +	}
>  
>  out:
>  	return victim;

The purpose of sacrificing a child has always been to prevent a process 
that has been running with a substantial amount of work done from being 
terminated and losing all that work if it can be avoided.  This happens a 
lot: imagine a long-living front end client forking a child which simply 
collects stats and malloc information at a regular intervals and writes 
them out to disk or over the network.  These processes may be quite small, 
and we're willing to happily sacrifice them if it will save the parent.  
This was, and still is, the intent of the sacrifice in the first place.

We must be able to deal with oom victims that are very small, since 
userspace has complete control in prioritizing these processes in the 
first place.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks
  2016-01-13  0:41   ` David Rientjes
@ 2016-01-13  9:30     ` Michal Hocko
  2016-01-14  0:38       ` David Rientjes
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2016-01-13  9:30 UTC (permalink / raw)
  To: David Rientjes; +Cc: linux-mm, Tetsuo Handa, LKML

On Tue 12-01-16 16:41:50, David Rientjes wrote:
> On Tue, 12 Jan 2016, Michal Hocko wrote:
> 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index abefeeb42504..2b9dc5129a89 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -326,6 +326,17 @@ static struct task_struct *select_bad_process(struct oom_control *oc,
> >  		case OOM_SCAN_OK:
> >  			break;
> >  		};
> > +
> > +		/*
> > +		 * If we are doing sysrq+f then it doesn't make any sense to
> > +		 * check OOM victim or killed task because it might be stuck
> > +		 * and unable to terminate while the forced OOM might be the
> > +		 * only option left to get the system back to work.
> > +		 */
> > +		if (is_sysrq_oom(oc) && (test_tsk_thread_flag(p, TIF_MEMDIE) ||
> > +				fatal_signal_pending(p)))
> > +			continue;
> > +
> >  		points = oom_badness(p, NULL, oc->nodemask, totalpages);
> >  		if (!points || points < chosen_points)
> >  			continue;
> 
> I think you can make a case for testing TIF_MEMDIE here since there is no 
> chance of a panic from the sysrq trigger.  However, I'm not convinced that 
> checking fatal_signal_pending() is appropriate. 

My thinking was that such a process would get TIF_MEMDIE if it hits the
OOM from the allocator.

> I think it would be 
> better for sysrq+f to first select a process with fatal_signal_pending() 
> set so it silently gets access to memory reserves and then a second 
> sysrq+f to choose a different process, if necessary, because of 
> TIF_MEMDIE.

The disadvantage of this approach is that sysrq+f might silently be
ignored and the administrator doesn't have any signal about that. IMHO
sysrq+f would be much better defined if it _always_ selected and killed
a task. After all it is an explicit administrator action.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 2/3] oom: Do not sacrifice already OOM killed children
  2016-01-13  0:45   ` David Rientjes
@ 2016-01-13  9:36     ` Michal Hocko
  2016-01-14  0:42       ` David Rientjes
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2016-01-13  9:36 UTC (permalink / raw)
  To: David Rientjes; +Cc: linux-mm, Tetsuo Handa, LKML

On Tue 12-01-16 16:45:35, David Rientjes wrote:
> On Tue, 12 Jan 2016, Michal Hocko wrote:
> 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 2b9dc5129a89..8bca0b1e97f7 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -671,6 +671,63 @@ static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
> >  }
> >  
> >  #define K(x) ((x) << (PAGE_SHIFT-10))
> > +
> > +/*
> > + * If any of victim's children has a different mm and is eligible for kill,
> > + * the one with the highest oom_badness() score is sacrificed for its
> > + * parent.  This attempts to lose the minimal amount of work done while
> > + * still freeing memory.
> > + */
> > +static struct task_struct *
> > +try_to_sacrifice_child(struct oom_control *oc, struct task_struct *victim,
> > +		       unsigned long totalpages, struct mem_cgroup *memcg)
> > +{
> > +	struct task_struct *child_victim = NULL;
> > +	unsigned int victim_points = 0;
> > +	struct task_struct *t;
> > +
> > +	read_lock(&tasklist_lock);
> > +	for_each_thread(victim, t) {
> > +		struct task_struct *child;
> > +
> > +		list_for_each_entry(child, &t->children, sibling) {
> > +			unsigned int child_points;
> > +
> > +			/*
> > +			 * Skip over already OOM killed children as this hasn't
> > +			 * helped to resolve the situation obviously.
> > +			 */
> > +			if (test_tsk_thread_flag(child, TIF_MEMDIE) ||
> > +					fatal_signal_pending(child) ||
> > +					task_will_free_mem(child))
> > +				continue;
> > +
> 
> What guarantees that child had time to exit after it has been oom killed 
> (better yet, what guarantees that it has even scheduled after it has been 
> oom killed)?  It seems like this would quickly kill many children 
> unnecessarily.

If the child hasn't released any memory after all the allocator attempts to
free a memory, which takes quite some time, then what is the advantage of
waiting even more and possibly get stuck? This is a heuristic, we should
have killed the selected victim but we have chosen to reduce the impact by
selecting the child process instead. If that hasn't led to any
improvement I believe we should move on rather than looping on
potentially unresolvable situation _just because_ of the said heuristic.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 3/3] oom: Do not try to sacrifice small children
  2016-01-13  0:51   ` David Rientjes
@ 2016-01-13  9:40     ` Michal Hocko
  2016-01-14  0:43       ` David Rientjes
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2016-01-13  9:40 UTC (permalink / raw)
  To: David Rientjes; +Cc: linux-mm, Tetsuo Handa, LKML

On Tue 12-01-16 16:51:43, David Rientjes wrote:
> On Tue, 12 Jan 2016, Michal Hocko wrote:
> 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 8bca0b1e97f7..b5c0021c6462 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -721,8 +721,16 @@ try_to_sacrifice_child(struct oom_control *oc, struct task_struct *victim,
> >  	if (!child_victim)
> >  		goto out;
> >  
> > -	put_task_struct(victim);
> > -	victim = child_victim;
> > +	/*
> > +	 * Protecting the parent makes sense only if killing the child
> > +	 * would release at least some memory (at least 1MB).
> > +	 */
> > +	if (K(victim_points) >= 1024) {
> > +		put_task_struct(victim);
> > +		victim = child_victim;
> > +	} else {
> > +		put_task_struct(child_victim);
> > +	}
> >  
> >  out:
> >  	return victim;
> 
> The purpose of sacrificing a child has always been to prevent a process 
> that has been running with a substantial amount of work done from being 
> terminated and losing all that work if it can be avoided.  This happens a 
> lot: imagine a long-living front end client forking a child which simply 
> collects stats and malloc information at a regular intervals and writes 
> them out to disk or over the network.  These processes may be quite small, 
> and we're willing to happily sacrifice them if it will save the parent.  
> This was, and still is, the intent of the sacrifice in the first place.

Yes I understand the intention of the heuristic. I am just contemplating
about what is way too small to sacrifice because it clearly doesn't make
much sense to kill a task which is sitting on basically no memory (well
just few pages backing page tables and stack) because this would just
prolong the OOM agony.

> We must be able to deal with oom victims that are very small, since 
> userspace has complete control in prioritizing these processes in the 
> first place.

Sure the patch is not great but I would like to come up with some
threshold when children are way too small to be worthwhile considering.
Or maybe there is other measure we can use.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks
  2016-01-13  9:30     ` Michal Hocko
@ 2016-01-14  0:38       ` David Rientjes
  2016-01-14 11:00         ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: David Rientjes @ 2016-01-14  0:38 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Tetsuo Handa, LKML

On Wed, 13 Jan 2016, Michal Hocko wrote:

> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > index abefeeb42504..2b9dc5129a89 100644
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -326,6 +326,17 @@ static struct task_struct *select_bad_process(struct oom_control *oc,
> > >  		case OOM_SCAN_OK:
> > >  			break;
> > >  		};
> > > +
> > > +		/*
> > > +		 * If we are doing sysrq+f then it doesn't make any sense to
> > > +		 * check OOM victim or killed task because it might be stuck
> > > +		 * and unable to terminate while the forced OOM might be the
> > > +		 * only option left to get the system back to work.
> > > +		 */
> > > +		if (is_sysrq_oom(oc) && (test_tsk_thread_flag(p, TIF_MEMDIE) ||
> > > +				fatal_signal_pending(p)))
> > > +			continue;
> > > +
> > >  		points = oom_badness(p, NULL, oc->nodemask, totalpages);
> > >  		if (!points || points < chosen_points)
> > >  			continue;
> > 
> > I think you can make a case for testing TIF_MEMDIE here since there is no 
> > chance of a panic from the sysrq trigger.  However, I'm not convinced that 
> > checking fatal_signal_pending() is appropriate. 
> 
> My thinking was that such a process would get TIF_MEMDIE if it hits the
> OOM from the allocator.
> 

It certainly would get TIF_MEMDIE set if it needs to allocate memory 
itself and it calls the oom killer.  That doesn't mean that we should kill 
a different process, though, when the killed process should exit and free 
its memory.  So NACK to the fatal_signal_pending() check here.

> > I think it would be 
> > better for sysrq+f to first select a process with fatal_signal_pending() 
> > set so it silently gets access to memory reserves and then a second 
> > sysrq+f to choose a different process, if necessary, because of 
> > TIF_MEMDIE.
> 
> The disadvantage of this approach is that sysrq+f might silently be
> ignored and the administrator doesn't have any signal about that.

The administrator can check the kernel log for an oom kill.  Killing 
additional processes is not going to help and has never been the semantics 
of the sysrq trigger, it is quite clearly defined as killing a process 
when out of memory, not serial killing everything on the machine.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 2/3] oom: Do not sacrifice already OOM killed children
  2016-01-13  9:36     ` Michal Hocko
@ 2016-01-14  0:42       ` David Rientjes
  0 siblings, 0 replies; 22+ messages in thread
From: David Rientjes @ 2016-01-14  0:42 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Tetsuo Handa, LKML

On Wed, 13 Jan 2016, Michal Hocko wrote:

> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > index 2b9dc5129a89..8bca0b1e97f7 100644
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -671,6 +671,63 @@ static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
> > >  }
> > >  
> > >  #define K(x) ((x) << (PAGE_SHIFT-10))
> > > +
> > > +/*
> > > + * If any of victim's children has a different mm and is eligible for kill,
> > > + * the one with the highest oom_badness() score is sacrificed for its
> > > + * parent.  This attempts to lose the minimal amount of work done while
> > > + * still freeing memory.
> > > + */
> > > +static struct task_struct *
> > > +try_to_sacrifice_child(struct oom_control *oc, struct task_struct *victim,
> > > +		       unsigned long totalpages, struct mem_cgroup *memcg)
> > > +{
> > > +	struct task_struct *child_victim = NULL;
> > > +	unsigned int victim_points = 0;
> > > +	struct task_struct *t;
> > > +
> > > +	read_lock(&tasklist_lock);
> > > +	for_each_thread(victim, t) {
> > > +		struct task_struct *child;
> > > +
> > > +		list_for_each_entry(child, &t->children, sibling) {
> > > +			unsigned int child_points;
> > > +
> > > +			/*
> > > +			 * Skip over already OOM killed children as this hasn't
> > > +			 * helped to resolve the situation obviously.
> > > +			 */
> > > +			if (test_tsk_thread_flag(child, TIF_MEMDIE) ||
> > > +					fatal_signal_pending(child) ||
> > > +					task_will_free_mem(child))
> > > +				continue;
> > > +
> > 
> > What guarantees that child had time to exit after it has been oom killed 
> > (better yet, what guarantees that it has even scheduled after it has been 
> > oom killed)?  It seems like this would quickly kill many children 
> > unnecessarily.
> 
> If the child hasn't released any memory after all the allocator attempts to
> free a memory, which takes quite some time, then what is the advantage of
> waiting even more and possibly get stuck?

No, we don't rely on implicit page allocator behavior or implementation to 
decide when additional processes should randomly be killed.  It is quite 
simple to get dozens of processes oom killed if your patch is introduced, 
just as it is possible to get dozens of processes oom killed unnecessarily 
if you remove TIF_MEMDIE checks from select_bad_process().  If you are 
concerned about the child never exiting, then it is quite simple to 
provide access to memory reserves in the page allocator in such 
situations, this is no different than TIF_MEMDIE threads failing to exit.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 3/3] oom: Do not try to sacrifice small children
  2016-01-13  9:40     ` Michal Hocko
@ 2016-01-14  0:43       ` David Rientjes
  0 siblings, 0 replies; 22+ messages in thread
From: David Rientjes @ 2016-01-14  0:43 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Tetsuo Handa, LKML

On Wed, 13 Jan 2016, Michal Hocko wrote:

> > The purpose of sacrificing a child has always been to prevent a process 
> > that has been running with a substantial amount of work done from being 
> > terminated and losing all that work if it can be avoided.  This happens a 
> > lot: imagine a long-living front end client forking a child which simply 
> > collects stats and malloc information at a regular intervals and writes 
> > them out to disk or over the network.  These processes may be quite small, 
> > and we're willing to happily sacrifice them if it will save the parent.  
> > This was, and still is, the intent of the sacrifice in the first place.
> 
> Yes I understand the intention of the heuristic. I am just contemplating
> about what is way too small to sacrifice because it clearly doesn't make
> much sense to kill a task which is sitting on basically no memory (well
> just few pages backing page tables and stack) because this would just
> prolong the OOM agony.
> 

Nothing is ever too small to kill since we allow userspace the ability to 
define their own oom priority.  We would rather kill your bash shell when 
you login rather than your sshd.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks
  2016-01-14  0:38       ` David Rientjes
@ 2016-01-14 11:00         ` Michal Hocko
  2016-01-14 21:51           ` David Rientjes
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2016-01-14 11:00 UTC (permalink / raw)
  To: David Rientjes; +Cc: linux-mm, Tetsuo Handa, LKML

On Wed 13-01-16 16:38:26, David Rientjes wrote:
> On Wed, 13 Jan 2016, Michal Hocko wrote:
[...]
> > > I think it would be 
> > > better for sysrq+f to first select a process with fatal_signal_pending() 
> > > set so it silently gets access to memory reserves and then a second 
> > > sysrq+f to choose a different process, if necessary, because of 
> > > TIF_MEMDIE.
> > 
> > The disadvantage of this approach is that sysrq+f might silently be
> > ignored and the administrator doesn't have any signal about that.

Sorry I meant to say "administrator doesn't know why it has been
ignored". But it would have been better to say "administrator cannot do
anything about that".

> The administrator can check the kernel log for an oom kill.

What should the admin do when the request got ignored, though? sysrq+i?
sysrq+b?

> Killing additional processes is not going to help

Whether it is going to help or not is a different topic. My point is
that we have a sysrq action which might get ignored without providing
any explanation. But what I consider much bigger issue is that the
deliberate request of the admin is ignored in the first place. Me as an
admin do not expect the system knows better than me when I perform some
action.

> and has never been the semantics 
> of the sysrq trigger, it is quite clearly defined as killing a process 
> when out of memory,

I disagree. Being OOM has never been the requirement for sysrq+f to kill
a task. It should kill _a_ memory hog. Your system might be trashing to
the point you are not able to log in and resolve the situation in a
reasonable time yet you are still not OOM. sysrq+f is your only choice
then.

> not serial killing everything on the machine.

Which is not proposed here. The only thing I would like to achive is to
get rid of OOM killer heuristics which assume some forward progress in
order to prevent from killing a task. Those make perfect sense when the
system tries to resolve the OOM condition but when the administrator has
clearly asked to _kill_a_ memory hog then the result should be killing a
task which consumes memory (ideally the largest one).

What would be a regression scenario for this change? I can clearly see
deficiency of the current implementation so we should weigh cons and
pros here I believe.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks
  2016-01-14 11:00         ` Michal Hocko
@ 2016-01-14 21:51           ` David Rientjes
  2016-01-15 10:12             ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: David Rientjes @ 2016-01-14 21:51 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Tetsuo Handa, LKML

On Thu, 14 Jan 2016, Michal Hocko wrote:

> > > > I think it would be 
> > > > better for sysrq+f to first select a process with fatal_signal_pending() 
> > > > set so it silently gets access to memory reserves and then a second 
> > > > sysrq+f to choose a different process, if necessary, because of 
> > > > TIF_MEMDIE.
> > > 
> > > The disadvantage of this approach is that sysrq+f might silently be
> > > ignored and the administrator doesn't have any signal about that.
> 
> Sorry I meant to say "administrator doesn't know why it has been
> ignored". But it would have been better to say "administrator cannot do
> anything about that".
> 
> > The administrator can check the kernel log for an oom kill.
> 
> What should the admin do when the request got ignored, though? sysrq+i?
> sysrq+b?
> 

We're not striving for a solution to general process exiting issues or oom 
livelock situations by requiring admins to use a sysrq trigger.  Sysrq+F 
could arguably be removed at this point since it solely existed to trigger 
the oom killer when newly rewritten reclaim thrashed and the page 
allocator didn't call it fast enough.  Since the oom killer allows killed 
processes to gain access to memory reserves, we could extend that to 
contexts that do not allow calling the oom killer to set TIF_MEMDIE, and 
then simply require root to send a SIGKILL rather than do sysrq+F.  I 
think it's time to kill sysrq+F and I'll send those two patches unless 
there is a usecase I'm not aware of.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks
  2016-01-14 21:51           ` David Rientjes
@ 2016-01-15 10:12             ` Michal Hocko
  2016-01-15 15:37               ` One Thousand Gnomes
  2016-01-19 22:57               ` David Rientjes
  0 siblings, 2 replies; 22+ messages in thread
From: Michal Hocko @ 2016-01-15 10:12 UTC (permalink / raw)
  To: David Rientjes; +Cc: linux-mm, Tetsuo Handa, LKML

On Thu 14-01-16 13:51:16, David Rientjes wrote:
> I think it's time to kill sysrq+F and I'll send those two patches
> unless there is a usecase I'm not aware of.

I have described one in the part you haven't quoted here. Let me repeat:
: Your system might be trashing to the point you are not able to log in
: and resolve the situation in a reasonable time yet you are still not
: OOM. sysrq+f is your only choice then.

Could you clarify why it is better to ditch a potentially usefull
emergency tool rather than to make it work reliably and predictably?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks
  2016-01-15 10:12             ` Michal Hocko
@ 2016-01-15 15:37               ` One Thousand Gnomes
  2016-01-19 23:01                 ` David Rientjes
  2016-01-19 22:57               ` David Rientjes
  1 sibling, 1 reply; 22+ messages in thread
From: One Thousand Gnomes @ 2016-01-15 15:37 UTC (permalink / raw)
  To: Michal Hocko; +Cc: David Rientjes, linux-mm, Tetsuo Handa, LKML

On Fri, 15 Jan 2016 11:12:18 +0100
Michal Hocko <mhocko@kernel.org> wrote:

> On Thu 14-01-16 13:51:16, David Rientjes wrote:
> > I think it's time to kill sysrq+F and I'll send those two patches
> > unless there is a usecase I'm not aware of.
> 
> I have described one in the part you haven't quoted here. Let me repeat:
> : Your system might be trashing to the point you are not able to log in
> : and resolve the situation in a reasonable time yet you are still not
> : OOM. sysrq+f is your only choice then.
> 
> Could you clarify why it is better to ditch a potentially usefull
> emergency tool rather than to make it work reliably and predictably?

Even if it doesn't work reliably and predictably it is *still* better
than removing it as it works currently. Today we have "might save you a
reboot", the removal turns it into "you'll have to reboot". That's a
regression.

Alan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks
  2016-01-15 10:12             ` Michal Hocko
  2016-01-15 15:37               ` One Thousand Gnomes
@ 2016-01-19 22:57               ` David Rientjes
  2016-01-20  9:49                 ` Michal Hocko
  1 sibling, 1 reply; 22+ messages in thread
From: David Rientjes @ 2016-01-19 22:57 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Tetsuo Handa, LKML

On Fri, 15 Jan 2016, Michal Hocko wrote:

> > I think it's time to kill sysrq+F and I'll send those two patches
> > unless there is a usecase I'm not aware of.
> 
> I have described one in the part you haven't quoted here. Let me repeat:
> : Your system might be trashing to the point you are not able to log in
> : and resolve the situation in a reasonable time yet you are still not
> : OOM. sysrq+f is your only choice then.
> 
> Could you clarify why it is better to ditch a potentially usefull
> emergency tool rather than to make it work reliably and predictably?

I'm concerned about your usecase where the kernel requires admin 
intervention to resolve such an issue and there is nothing in the VM we 
can do to fix it.

If you have a specific test that demonstrates when your usecase is needed, 
please provide it so we can address the issue that it triggers.  I'd 
prefer to fix the issue in the VM rather than require human intervention, 
especially when we try to keep a very large number of machines running in 
our datacenters.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks
  2016-01-15 15:37               ` One Thousand Gnomes
@ 2016-01-19 23:01                 ` David Rientjes
  0 siblings, 0 replies; 22+ messages in thread
From: David Rientjes @ 2016-01-19 23:01 UTC (permalink / raw)
  To: One Thousand Gnomes; +Cc: Michal Hocko, linux-mm, Tetsuo Handa, LKML

On Fri, 15 Jan 2016, One Thousand Gnomes wrote:

> > > I think it's time to kill sysrq+F and I'll send those two patches
> > > unless there is a usecase I'm not aware of.
> > 
> > I have described one in the part you haven't quoted here. Let me repeat:
> > : Your system might be trashing to the point you are not able to log in
> > : and resolve the situation in a reasonable time yet you are still not
> > : OOM. sysrq+f is your only choice then.
> > 
> > Could you clarify why it is better to ditch a potentially usefull
> > emergency tool rather than to make it work reliably and predictably?
> 
> Even if it doesn't work reliably and predictably it is *still* better
> than removing it as it works currently. Today we have "might save you a
> reboot", the removal turns it into "you'll have to reboot". That's a
> regression.
> 

Under what circumstance are you supposing to use sysrq+f in your 
hypothetical?  If you have access to the shell, then you can kill any 
process at random (and you may even be able to make better realtime 
decisions than the oom killer) and it will gain access to memory reserves 
immediately under my proposal when it tries to allocate memory.  The net 
result is that calling the oom killer is no better than you issuing the 
SIGKILL yourself.

This doesn't work if your are supposing to use sysrq+f without the ability 
to get access to the shell.  That's the point, I believe, that Michal has 
raised in this thread.  I'd like to address that issue directly rather 
than requiring human intervention to fix.  If you have deployed a very 
large number of machines to your datacenters, you don't possibly have the 
resources to do this.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks
  2016-01-19 22:57               ` David Rientjes
@ 2016-01-20  9:49                 ` Michal Hocko
  2016-01-21  0:01                   ` David Rientjes
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2016-01-20  9:49 UTC (permalink / raw)
  To: David Rientjes; +Cc: linux-mm, Tetsuo Handa, LKML

On Tue 19-01-16 14:57:33, David Rientjes wrote:
> On Fri, 15 Jan 2016, Michal Hocko wrote:
> 
> > > I think it's time to kill sysrq+F and I'll send those two patches
> > > unless there is a usecase I'm not aware of.
> > 
> > I have described one in the part you haven't quoted here. Let me repeat:
> > : Your system might be trashing to the point you are not able to log in
> > : and resolve the situation in a reasonable time yet you are still not
> > : OOM. sysrq+f is your only choice then.
> > 
> > Could you clarify why it is better to ditch a potentially usefull
> > emergency tool rather than to make it work reliably and predictably?
> 
> I'm concerned about your usecase where the kernel requires admin 
> intervention to resolve such an issue and there is nothing in the VM we 
> can do to fix it.
> 
> If you have a specific test that demonstrates when your usecase is needed, 
> please provide it so we can address the issue that it triggers.

No, I do not have a specific load in mind. But let's be realistic. There
will _always_ be corner cases where the VM cannot react properly or in a
timely fashion.

> I'd prefer to fix the issue in the VM rather than require human
> intervention, especially when we try to keep a very large number of
> machines running in our datacenters.

It is always preferable to resolve the mm related issue automagically,
of course. We should strive for robustness as much as possible but that
doesn't mean we should get the only emergency tool out of administrator
hands.

To be honest I really fail to understand your line of argumentation
here. Just that you think that sysrq+f might be not helpful in large
datacenters which you seem to care about, doesn't mean that it is not
helpful in other setups.

Removing the functionality is out of question IMHO so can we please
start discussing how to make it more predictable please?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks
  2016-01-20  9:49                 ` Michal Hocko
@ 2016-01-21  0:01                   ` David Rientjes
  2016-01-21  9:15                     ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: David Rientjes @ 2016-01-21  0:01 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Tetsuo Handa, LKML

On Wed, 20 Jan 2016, Michal Hocko wrote:

> No, I do not have a specific load in mind. But let's be realistic. There
> will _always_ be corner cases where the VM cannot react properly or in a
> timely fashion.
> 

Then let's identify it and fix it, like we do with any other bug?  I'm 99% 
certain you are not advocating that human intervention is the ideal 
solution to prevent lengthy stalls or livelocks.

I can't speak for all possible configurations and workloads; the only 
thing we use sysrq+f for is automated testing of the oom killer itself.  
It would help to know of any situations when people actually need to use 
this to solve issues and then fix those issues rather than insisting that 
this is the ideal solution.

> To be honest I really fail to understand your line of argumentation
> here. Just that you think that sysrq+f might be not helpful in large
> datacenters which you seem to care about, doesn't mean that it is not
> helpful in other setups.
> 

This type of message isn't really contributing anything.  You don't have a 
specific load in mind, you can't identify a pending bug that people have 
complained about, you presumably can't show a testcase that demonstrates 
how it's required, yet you're arguing that we should keep a debugging tool 
around because you think somebody somewhere sometime might use it.

 [ I would imagine that users would be unhappy they have to kill processes 
   already, and would have reported how ridiculous it is that they had to
   use sysrq+f, but I haven't seen those bug reports. ]

I want the VM to be responsive, I don't want it to thrash forever, and I 
want it to not require root to trigger a sysrq to have the kernel kill a 
process for the VM to work properly.  We either need to fix the issue that 
causes the unresponsiveness or oom kill processes earlier.  This is very 
simple.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks
  2016-01-21  0:01                   ` David Rientjes
@ 2016-01-21  9:15                     ` Michal Hocko
  0 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2016-01-21  9:15 UTC (permalink / raw)
  To: David Rientjes; +Cc: linux-mm, Tetsuo Handa, LKML

On Wed 20-01-16 16:01:54, David Rientjes wrote:
> On Wed, 20 Jan 2016, Michal Hocko wrote:
> 
> > No, I do not have a specific load in mind. But let's be realistic. There
> > will _always_ be corner cases where the VM cannot react properly or in a
> > timely fashion.
> > 
> 
> Then let's identify it and fix it, like we do with any other bug?  I'm 99% 
> certain you are not advocating that human intervention is the ideal 
> solution to prevent lengthy stalls or livelocks.

I didn't claim that! Please read what I have written. I consider sysrq+f
as a _last resort_ emergency tool when the system doesn't behave in the
expected way.

> I can't speak for all possible configurations and workloads; the only 
> thing we use sysrq+f for is automated testing of the oom killer itself.  

That is your use case and it is not the one why the this functionality
has been introduced. This is _not a debuggin_ tool. Back in 2005 it has
been added precisely to allow for an immediate intervention while the
system was trashing heavily.

> It would help to know of any situations when people actually need to use 
> this to solve issues and then fix those issues rather than insisting that 
> this is the ideal solution.

I fully agree that such an issues should be investigated and fixed. That
is nothing against having the emergency tool and allow the admin to
intervene right away when it happens.

> > To be honest I really fail to understand your line of argumentation
> > here. Just that you think that sysrq+f might be not helpful in large
> > datacenters which you seem to care about, doesn't mean that it is not
> > helpful in other setups.
> > 
> 
> This type of message isn't really contributing anything.  You don't have a 
> specific load in mind, you can't identify a pending bug that people have 
> complained about, you presumably can't show a testcase that demonstrates 
> how it's required, yet you're arguing that we should keep a debugging tool 
> around because you think somebody somewhere sometime might use it.

Look, I am getting tired of this discussion. You seem to completely
ignore the emergency aspect of sysrq+f just because it doesn't seem to
fit in _your_ particular usecase. I have seen admins using sysrq+f when
a large application got crazy and started trashing to the point when
even ssh to the machine took ages and sysrq+f over serial console was
the only deterministic way to make the system usable. Such things are
still real. Just look at linux-mm ML (just off hand
http://lkml.kernel.org/r/20151221123557.GE3060%40orkisz). You can argue
we should fix them, and I agree but swap/page cache trashing are real
for ages and those are hard problems and very likely to be with us for
some more. Until our MM subsystem and all others that might interfere
are perfect we need a sledge hammer. And if we have a hammer then we
should really make sure it hits something when used rather than hitting
the thin air.

The patch proposed here doesn't make the code more complicated or harder
to maintain. It even doesn't have any side effects outside of sysrq+f
triggered OOM. Your only argument so far was:
"
: It certainly would get TIF_MEMDIE set if it needs to allocate memory
: itself and it calls the oom killer.  That doesn't mean that we should
: kill a different process, though, when the killed process should exit
: and free its memory.  So NACK to the fatal_signal_pending() check here.
"

And that argument is fundamentally broken because killed process is not
guaranteed to exit and free its memory. Moreover sysrq+f is by
definition an async action which might race by passing killed task and
that should deactivate it. The race is quite unlikely but emergency
tools should be as robust/reliable as possible. You also have ignored
my question about what kind of regression would such a change cause.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2016-01-21  9:15 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-12 21:00 [RFC 0/3] oom: few enahancements Michal Hocko
2016-01-12 21:00 ` [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks Michal Hocko
2016-01-13  0:41   ` David Rientjes
2016-01-13  9:30     ` Michal Hocko
2016-01-14  0:38       ` David Rientjes
2016-01-14 11:00         ` Michal Hocko
2016-01-14 21:51           ` David Rientjes
2016-01-15 10:12             ` Michal Hocko
2016-01-15 15:37               ` One Thousand Gnomes
2016-01-19 23:01                 ` David Rientjes
2016-01-19 22:57               ` David Rientjes
2016-01-20  9:49                 ` Michal Hocko
2016-01-21  0:01                   ` David Rientjes
2016-01-21  9:15                     ` Michal Hocko
2016-01-12 21:00 ` [RFC 2/3] oom: Do not sacrifice already OOM killed children Michal Hocko
2016-01-13  0:45   ` David Rientjes
2016-01-13  9:36     ` Michal Hocko
2016-01-14  0:42       ` David Rientjes
2016-01-12 21:00 ` [RFC 3/3] oom: Do not try to sacrifice small children Michal Hocko
2016-01-13  0:51   ` David Rientjes
2016-01-13  9:40     ` Michal Hocko
2016-01-14  0:43       ` David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).