All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm,oom: Re-enable OOM killer using timeout.
@ 2016-04-19 15:06 Tetsuo Handa
  2016-04-19 20:07 ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2016-04-19 15:06 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, akpm

>From dbe7dcb4ff757d5b1865e935f840d0418af804d3 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Tue, 19 Apr 2016 23:11:31 +0900
Subject: [PATCH] mm,oom: Re-enable OOM killer using timeout.

We are trying to reduce the possibility of hitting OOM livelock by
introducing the OOM reaper. But the OOM reaper cannot reap the victim's
memory if the victim's mmap_sem is held for write. It is possible that
the thread which got TIF_MEMDIE while holding mmap_sem for write gets
stuck at unkillable wait waiting for other thread's memory allocation.
This problem cannot be avoided even after we convert
down_write(&mm->mmap_sem) to down_write_killable(&mm->mmap_sem).
Since we cannot afford converting all waits killable, we should prepare
for such situation.

The simplest way is to mark the victim's thread group as no longer
OOM-killable by updating victim's signal->oom_score_adj to
OOM_SCORE_ADJ_MIN.

But doing so is not sufficient for !oom_kill_allocating_task case
because oom_scan_process_thread() will find TIF_MEMDIE thread and
continue waiting. We will need to revoke TIF_MEMDIE from all victim
threads but TIF_MEMDIE will be automatically granted to potentially all
victim threads due to fatal_signal_pending() or task_will_free_mem() in
out_of_memory(). We don't want to walk the process list so many times
in order to revoke TIF_MEMDIE from all victim threads from racy loop.

Also, doing so breaks oom_kill_allocating_task case because we will not
wait for existing TIF_MEMDIE threads because oom_scan_process_thread()
is not called. As a result, all children of the calling process will be
needlessly OOM-killed.

Therefore, we should not play with victim's signal->oom_score_adj value
and/or victim's TIF_MEMDIE flag.

This patch adds a timeout for handling corner cases where a TIF_MEMDIE
thread got stuck. Since the timeout is checked at oom_unkillable_task(),
oom_scan_process_thread() will not find TIF_MEMDIE thread
(for !oom_kill_allocating_task case) and oom_badness() will return 0
(for oom_kill_allocating_task case).

By applying this patch, the kernel will automatically press SysRq-f if
the OOM reaper cannot reap the victim's memory, and we will never OOM
livelock forever as long as the OOM killer is called.

For those who prefer existing behavior (i.e. let the kernel OOM livelock
forever if the OOM reaper cannot reap the victim's memory), the timeout
is set to very large value (effectively no timeout) by default.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 include/linux/oom.h   |  2 ++
 include/linux/sched.h |  1 +
 kernel/sysctl.c       | 12 ++++++++++++
 mm/oom_kill.c         | 18 ++++++++++++++++++
 4 files changed, 33 insertions(+)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index abaab8e..4d2a97e 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -112,4 +112,6 @@ static inline bool task_will_free_mem(struct task_struct *task)
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
 extern int sysctl_panic_on_oom;
+extern unsigned long sysctl_oom_victim_wait_timeout;
+
 #endif /* _INCLUDE_LINUX_OOM_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d8f366c..6e701b3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -790,6 +790,7 @@ struct signal_struct {
 	short oom_score_adj;		/* OOM kill score adjustment */
 	short oom_score_adj_min;	/* OOM kill score adjustment min value.
 					 * Only settable by CAP_SYS_RESOURCE. */
+	unsigned long oom_start; /* If not 0, timestamp of being OOM-killed. */
 
 	struct mutex cred_guard_mutex;	/* guard against foreign influences on
 					 * credential calculations
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index d11c22d..3092ec2 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -147,6 +147,9 @@ static const int cap_last_cap = CAP_LAST_CAP;
 static unsigned long hung_task_timeout_max = (LONG_MAX/HZ);
 #endif
 
+static unsigned long oom_victim_wait_timeout_min = 1;
+static unsigned long oom_victim_wait_timeout_max = (LONG_MAX / HZ);
+
 #ifdef CONFIG_INOTIFY_USER
 #include <linux/inotify.h>
 #endif
@@ -1222,6 +1225,15 @@ static struct ctl_table vm_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 	{
+		.procname	= "oom_victim_wait_timeout",
+		.data		= &sysctl_oom_victim_wait_timeout,
+		.maxlen		= sizeof(sysctl_oom_victim_wait_timeout),
+		.mode		= 0644,
+		.proc_handler	= proc_doulongvec_minmax,
+		.extra1         = &oom_victim_wait_timeout_min,
+		.extra2         = &oom_victim_wait_timeout_max,
+	},
+	{
 		.procname	= "overcommit_ratio",
 		.data		= &sysctl_overcommit_ratio,
 		.maxlen		= sizeof(sysctl_overcommit_ratio),
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7098104..a2e1543a 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -47,6 +47,7 @@
 int sysctl_panic_on_oom;
 int sysctl_oom_kill_allocating_task;
 int sysctl_oom_dump_tasks = 1;
+unsigned long sysctl_oom_victim_wait_timeout = (LONG_MAX / HZ);
 
 DEFINE_MUTEX(oom_lock);
 
@@ -149,6 +150,12 @@ static bool oom_unkillable_task(struct task_struct *p,
 	if (!has_intersects_mems_allowed(p, nodemask))
 		return true;
 
+	/* Already OOM-killed p might get stuck at unkillable wait */
+	if (p->signal->oom_start &&
+	    time_after(jiffies, p->signal->oom_start
+		       + sysctl_oom_victim_wait_timeout * HZ))
+		return true;
+
 	return false;
 }
 
@@ -668,6 +675,17 @@ void mark_oom_victim(struct task_struct *tsk)
 	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
 		return;
 	/*
+	 * The task might get stuck at unkillable wait with mmap_sem held for
+	 * write. In that case, even the OOM reaper will not help.
+	 */
+	if (!tsk->signal->oom_start) {
+		unsigned long oom_start = jiffies;
+
+		if (!oom_start)
+			oom_start--;
+		tsk->signal->oom_start = oom_start;
+	}
+	/*
 	 * Make sure that the task is woken up from uninterruptible sleep
 	 * if it is frozen because OOM killer wouldn't be able to free
 	 * any memory and livelock. freezing_slow_path will tell the freezer
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timeout.
  2016-04-19 15:06 [PATCH] mm,oom: Re-enable OOM killer using timeout Tetsuo Handa
@ 2016-04-19 20:07 ` Michal Hocko
  2016-04-19 21:55   ` Tetsuo Handa
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2016-04-19 20:07 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm

On Wed 20-04-16 00:06:05, Tetsuo Handa wrote:
> >From dbe7dcb4ff757d5b1865e935f840d0418af804d3 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Tue, 19 Apr 2016 23:11:31 +0900
> Subject: [PATCH] mm,oom: Re-enable OOM killer using timeout.
> 
> We are trying to reduce the possibility of hitting OOM livelock by
> introducing the OOM reaper. But the OOM reaper cannot reap the victim's
> memory if the victim's mmap_sem is held for write. It is possible that
> the thread which got TIF_MEMDIE while holding mmap_sem for write gets
> stuck at unkillable wait waiting for other thread's memory allocation.
> This problem cannot be avoided even after we convert
> down_write(&mm->mmap_sem) to down_write_killable(&mm->mmap_sem).
> Since we cannot afford converting all waits killable, we should prepare
> for such situation.
> 
> The simplest way is to mark the victim's thread group as no longer
> OOM-killable by updating victim's signal->oom_score_adj to
> OOM_SCORE_ADJ_MIN.
> 
> But doing so is not sufficient for !oom_kill_allocating_task case
> because oom_scan_process_thread() will find TIF_MEMDIE thread and
> continue waiting. We will need to revoke TIF_MEMDIE from all victim
> threads but TIF_MEMDIE will be automatically granted to potentially all
> victim threads due to fatal_signal_pending() or task_will_free_mem() in
> out_of_memory(). We don't want to walk the process list so many times
> in order to revoke TIF_MEMDIE from all victim threads from racy loop.
> 
> Also, doing so breaks oom_kill_allocating_task case because we will not
> wait for existing TIF_MEMDIE threads because oom_scan_process_thread()
> is not called. As a result, all children of the calling process will be
> needlessly OOM-killed.
> 
> Therefore, we should not play with victim's signal->oom_score_adj value
> and/or victim's TIF_MEMDIE flag.
> 
> This patch adds a timeout for handling corner cases where a TIF_MEMDIE
> thread got stuck. Since the timeout is checked at oom_unkillable_task(),
> oom_scan_process_thread() will not find TIF_MEMDIE thread
> (for !oom_kill_allocating_task case) and oom_badness() will return 0
> (for oom_kill_allocating_task case).
> 
> By applying this patch, the kernel will automatically press SysRq-f if
> the OOM reaper cannot reap the victim's memory, and we will never OOM
> livelock forever as long as the OOM killer is called.

Which will not guarantee anything as already pointed out several times
before. So I think this is not really that useful. I have said it
earlier and will repeat it again. Any timeout based solution which
doesn't guarantee that the system will be in a consistent state (reboot,
panic or kill all existing tasks) after the specified timeout is
pointless.

I believe that the chances of the lockup are much less likely with the
oom reaper and that we are not really urged to provide a new knob with a
random semantic. If we really want to have a timeout based thing better
make it behave reliably.

> For those who prefer existing behavior (i.e. let the kernel OOM livelock
> forever if the OOM reaper cannot reap the victim's memory), the timeout
> is set to very large value (effectively no timeout) by default.
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  include/linux/oom.h   |  2 ++
>  include/linux/sched.h |  1 +
>  kernel/sysctl.c       | 12 ++++++++++++
>  mm/oom_kill.c         | 18 ++++++++++++++++++
>  4 files changed, 33 insertions(+)
> 
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index abaab8e..4d2a97e 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -112,4 +112,6 @@ static inline bool task_will_free_mem(struct task_struct *task)
>  extern int sysctl_oom_dump_tasks;
>  extern int sysctl_oom_kill_allocating_task;
>  extern int sysctl_panic_on_oom;
> +extern unsigned long sysctl_oom_victim_wait_timeout;
> +
>  #endif /* _INCLUDE_LINUX_OOM_H */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d8f366c..6e701b3 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -790,6 +790,7 @@ struct signal_struct {
>  	short oom_score_adj;		/* OOM kill score adjustment */
>  	short oom_score_adj_min;	/* OOM kill score adjustment min value.
>  					 * Only settable by CAP_SYS_RESOURCE. */
> +	unsigned long oom_start; /* If not 0, timestamp of being OOM-killed. */
>  
>  	struct mutex cred_guard_mutex;	/* guard against foreign influences on
>  					 * credential calculations
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index d11c22d..3092ec2 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -147,6 +147,9 @@ static const int cap_last_cap = CAP_LAST_CAP;
>  static unsigned long hung_task_timeout_max = (LONG_MAX/HZ);
>  #endif
>  
> +static unsigned long oom_victim_wait_timeout_min = 1;
> +static unsigned long oom_victim_wait_timeout_max = (LONG_MAX / HZ);
> +
>  #ifdef CONFIG_INOTIFY_USER
>  #include <linux/inotify.h>
>  #endif
> @@ -1222,6 +1225,15 @@ static struct ctl_table vm_table[] = {
>  		.proc_handler	= proc_dointvec,
>  	},
>  	{
> +		.procname	= "oom_victim_wait_timeout",
> +		.data		= &sysctl_oom_victim_wait_timeout,
> +		.maxlen		= sizeof(sysctl_oom_victim_wait_timeout),
> +		.mode		= 0644,
> +		.proc_handler	= proc_doulongvec_minmax,
> +		.extra1         = &oom_victim_wait_timeout_min,
> +		.extra2         = &oom_victim_wait_timeout_max,
> +	},
> +	{
>  		.procname	= "overcommit_ratio",
>  		.data		= &sysctl_overcommit_ratio,
>  		.maxlen		= sizeof(sysctl_overcommit_ratio),
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 7098104..a2e1543a 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -47,6 +47,7 @@
>  int sysctl_panic_on_oom;
>  int sysctl_oom_kill_allocating_task;
>  int sysctl_oom_dump_tasks = 1;
> +unsigned long sysctl_oom_victim_wait_timeout = (LONG_MAX / HZ);
>  
>  DEFINE_MUTEX(oom_lock);
>  
> @@ -149,6 +150,12 @@ static bool oom_unkillable_task(struct task_struct *p,
>  	if (!has_intersects_mems_allowed(p, nodemask))
>  		return true;
>  
> +	/* Already OOM-killed p might get stuck at unkillable wait */
> +	if (p->signal->oom_start &&
> +	    time_after(jiffies, p->signal->oom_start
> +		       + sysctl_oom_victim_wait_timeout * HZ))
> +		return true;
> +
>  	return false;
>  }
>  
> @@ -668,6 +675,17 @@ void mark_oom_victim(struct task_struct *tsk)
>  	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
>  		return;
>  	/*
> +	 * The task might get stuck at unkillable wait with mmap_sem held for
> +	 * write. In that case, even the OOM reaper will not help.
> +	 */
> +	if (!tsk->signal->oom_start) {
> +		unsigned long oom_start = jiffies;
> +
> +		if (!oom_start)
> +			oom_start--;
> +		tsk->signal->oom_start = oom_start;
> +	}
> +	/*
>  	 * Make sure that the task is woken up from uninterruptible sleep
>  	 * if it is frozen because OOM killer wouldn't be able to free
>  	 * any memory and livelock. freezing_slow_path will tell the freezer
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timeout.
  2016-04-19 20:07 ` Michal Hocko
@ 2016-04-19 21:55   ` Tetsuo Handa
  2016-04-20 10:37     ` [PATCH v2] " Tetsuo Handa
  2016-04-20 14:47     ` [PATCH] " Michal Hocko
  0 siblings, 2 replies; 21+ messages in thread
From: Tetsuo Handa @ 2016-04-19 21:55 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, akpm

Michal Hocko wrote:
> > This patch adds a timeout for handling corner cases where a TIF_MEMDIE
> > thread got stuck. Since the timeout is checked at oom_unkillable_task(),
> > oom_scan_process_thread() will not find TIF_MEMDIE thread
> > (for !oom_kill_allocating_task case) and oom_badness() will return 0
> > (for oom_kill_allocating_task case).
> > 
> > By applying this patch, the kernel will automatically press SysRq-f if
> > the OOM reaper cannot reap the victim's memory, and we will never OOM
> > livelock forever as long as the OOM killer is called.
> 
> Which will not guarantee anything as already pointed out several times
> before. So I think this is not really that useful. I have said it
> earlier and will repeat it again. Any timeout based solution which
> doesn't guarantee that the system will be in a consistent state (reboot,
> panic or kill all existing tasks) after the specified timeout is
> pointless.

Triggering the reboot/panic is the worst action. Killing all existing tasks
is the next worst action. Thus, I prefer killing tasks one by one.

I'm OK with shortening the timeout like N (when waiting for the 1st victim)
+ N/2 (the 2nd victim) + N/4 (the 3rd victim) + N/8 (the 4th victim) + ...
but does it worth complicating the least unlikely path?

> 
> I believe that the chances of the lockup are much less likely with the
> oom reaper and that we are not really urged to provide a new knob with a
> random semantic. If we really want to have a timeout based thing better
> make it behave reliably.

The threshold which the administrator can wait for ranges. Some may want to
set few seconds because of 10 seconds /dev/watchdog timeout, others may want
to set one minute because of not using watchdog. Thus, I think we should not
hard code the timeout.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v2] mm,oom: Re-enable OOM killer using timeout.
  2016-04-19 21:55   ` Tetsuo Handa
@ 2016-04-20 10:37     ` Tetsuo Handa
  2016-04-25 11:47       ` Michal Hocko
  2016-04-20 14:47     ` [PATCH] " Michal Hocko
  1 sibling, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2016-04-20 10:37 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, akpm

Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > This patch adds a timeout for handling corner cases where a TIF_MEMDIE
> > > thread got stuck. Since the timeout is checked at oom_unkillable_task(),
> > > oom_scan_process_thread() will not find TIF_MEMDIE thread
> > > (for !oom_kill_allocating_task case) and oom_badness() will return 0
> > > (for oom_kill_allocating_task case).
> > >
> > > By applying this patch, the kernel will automatically press SysRq-f if
> > > the OOM reaper cannot reap the victim's memory, and we will never OOM
> > > livelock forever as long as the OOM killer is called.
> >
> > Which will not guarantee anything as already pointed out several times
> > before. So I think this is not really that useful. I have said it
> > earlier and will repeat it again. Any timeout based solution which
> > doesn't guarantee that the system will be in a consistent state (reboot,
> > panic or kill all existing tasks) after the specified timeout is
> > pointless.
>
> Triggering the reboot/panic is the worst action. Killing all existing tasks
> is the next worst action. Thus, I prefer killing tasks one by one.
>
> I'm OK with shortening the timeout like N (when waiting for the 1st victim)
> + N/2 (the 2nd victim) + N/4 (the 3rd victim) + N/8 (the 4th victim) + ...
> but does it worth complicating the least unlikely path?

Well, (N / (1 << atomic_read(&oom_victims))) is not accurate.
Having another timeout is simpler.

>
> >
> > I believe that the chances of the lockup are much less likely with the
> > oom reaper and that we are not really urged to provide a new knob with a
> > random semantic. If we really want to have a timeout based thing better
> > make it behave reliably.
>
> The threshold which the administrator can wait for ranges. Some may want to
> set few seconds because of 10 seconds /dev/watchdog timeout, others may want
> to set one minute because of not using watchdog. Thus, I think we should not
> hard code the timeout.

Well, I already tried to propose it using two timeouts (one for selecting next
victim, the other for triggering kernel panic) at
http://lkml.kernel.org/r/201505232339.DAB00557.VFFLHMSOJFOOtQ@I-love.SAKURA.ne.jp .

I thought I can apply these timeouts for per a signal_struct basis than
per a task_struct basis because you proposed changing task_will_free_mem()
to guarantee the whole thread group is going down. But it turned out that
since signal_struct has wider scope than OOM livelock check (i.e.
signal_struct remains even after TASK_DEAD while OOM livelock check must
stop at as of clearing TIF_MEMDIE), we need to apply these timestamps only
if one of threads has TIF_MEMDIE. This is needed for not requiring
find_lock_non_victim_task_mm() proposed at
http://lkml.kernel.org/r/201602171929.IFG12927.OVFJOQHOSMtFFL@I-love.SAKURA.ne.jp .

----------------------------------------
>From f3da6c8e98365e59226500142fcc3855f336d61f Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 20 Apr 2016 12:57:37 +0900
Subject: [PATCH v2] mm,oom: Re-enable OOM killer using timeout.

We are trying to reduce the possibility of hitting OOM livelock by
introducing the OOM reaper. But the OOM reaper cannot reap the victim's
memory if the victim's mmap_sem is held for write. It is possible that
the thread which got TIF_MEMDIE while holding mmap_sem for write gets
stuck at unkillable wait waiting for other thread's memory allocation.
This problem cannot be avoided even after we convert
down_write(&mm->mmap_sem) to down_write_killable(&mm->mmap_sem).
Since we cannot afford converting all waits killable, we should prepare
for such situation.

The simplest way is to mark the victim's thread group as no longer
OOM-killable by updating victim's signal->oom_score_adj to
OOM_SCORE_ADJ_MIN at oom_kill_process().

But doing so is not sufficient for !oom_kill_allocating_task case
because oom_scan_process_thread() will find TIF_MEMDIE thread and
continue waiting. We will need to revoke TIF_MEMDIE from all victim
threads but TIF_MEMDIE will be automatically granted to potentially all
victim threads due to fatal_signal_pending() or task_will_free_mem() in
out_of_memory(). We don't want to walk the process list so many times
in order to revoke TIF_MEMDIE from all victim threads from racy loop.

Also, doing so breaks oom_kill_allocating_task case because we will not
wait for existing TIF_MEMDIE threads because oom_scan_process_thread()
is not called. As a result, all children of the calling process will be
needlessly OOM-killed if existing TIF_MEMDIE threads fail to terminate
immediately.

Therefore, we should not play with victim's signal->oom_score_adj value
and/or victim's TIF_MEMDIE flag.

This patch adds two timeouts for handling corner cases where a TIF_MEMDIE
thread got stuck. One for selecting next OOM victim and the other for
triggering kernel panic. Since these timeouts are checked at
oom_unkillable_task(), oom_scan_process_thread() will not find TIF_MEMDIE
thread (for !oom_kill_allocating_task case) and oom_badness() will return
0 (for oom_kill_allocating_task case) when the timeout for selecting next
OOM victim expired.

By applying this patch, the kernel will automatically press SysRq-f or
trigger kernel panic if the OOM reaper cannot reap the victim's memory,
and we will never OOM livelock forever as long as the OOM killer is
called. An example of panic on 10 seconds timeout (taken with the OOM
reaper disabled for demonstration) is shown below.

----------
[   75.736534] Out of memory: Kill process 1241 (oom-write) score 851 or sacrifice child
[   75.740947] Killed process 1250 (write) total-vm:156kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[   85.741009] write           D ffff88003b493cb8     0  1250   1241 0x20120084
[   85.745823]  ffff88003b493cb8 ffff88003fb88040 ffff88003aeb0000 ffff88003b494000
[   85.750348]  ffff880038736548 0000000000000246 ffff88003aeb0000 00000000ffffffff
[   85.754851]  ffff88003b493cd0 ffffffff81667f30 ffff880038736540 ffff88003b493ce0
[   85.759346] Call Trace:
[   85.761274]  [<ffffffff81667f30>] schedule+0x30/0x80
[   85.764546]  [<ffffffff81668239>] schedule_preempt_disabled+0x9/0x10
[   85.768375]  [<ffffffff81669def>] mutex_lock_nested+0x14f/0x3a0
[   85.771986]  [<ffffffffa0240e1f>] ? xfs_file_buffered_aio_write+0x5f/0x1f0 [xfs]
[   85.776421]  [<ffffffffa0240e1f>] xfs_file_buffered_aio_write+0x5f/0x1f0 [xfs]
[   85.780679]  [<ffffffffa024103a>] xfs_file_write_iter+0x8a/0x150 [xfs]
[   85.784627]  [<ffffffff811c37b7>] __vfs_write+0xc7/0x100
[   85.787923]  [<ffffffff811c442d>] vfs_write+0x9d/0x190
[   85.791169]  [<ffffffff811e3dca>] ? __fget_light+0x6a/0x90
[   85.794577]  [<ffffffff811c5893>] SyS_write+0x53/0xd0
[   85.797721]  [<ffffffff810037b4>] do_int80_syscall_32+0x64/0x190
[   85.801378]  [<ffffffff8166ef3b>] entry_INT80_compat+0x3b/0x50
[   85.804933] Kernel panic - not syncing: Out of memory and 1250 (write) can not die...
----------

For those who prefer existing behavior (i.e. let the kernel OOM livelock
forever if the OOM reaper cannot reap the victim's memory), these timeouts
are set to very large value (effectively no timeout) by default.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 include/linux/oom.h   |  2 ++
 include/linux/sched.h |  2 ++
 kernel/sysctl.c       | 21 +++++++++++++++++++++
 mm/oom_kill.c         | 49 ++++++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 73 insertions(+), 1 deletion(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index abaab8e..7fcb586 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -112,4 +112,6 @@ static inline bool task_will_free_mem(struct task_struct *task)
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
 extern int sysctl_panic_on_oom;
+extern unsigned long sysctl_oom_victim_skip_secs;
+extern unsigned long sysctl_oom_victim_panic_secs;
 #endif /* _INCLUDE_LINUX_OOM_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d8f366c..e6cb766 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -790,6 +790,8 @@ struct signal_struct {
 	short oom_score_adj;		/* OOM kill score adjustment */
 	short oom_score_adj_min;	/* OOM kill score adjustment min value.
 					 * Only settable by CAP_SYS_RESOURCE. */
+	/* If not 0, timestamp of getting TIF_MEMDIE for the first time. */
+	unsigned long oom_start;
 
 	struct mutex cred_guard_mutex;	/* guard against foreign influences on
 					 * credential calculations
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index d11c22d..bdd8a78 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -147,6 +147,9 @@ static const int cap_last_cap = CAP_LAST_CAP;
 static unsigned long hung_task_timeout_max = (LONG_MAX/HZ);
 #endif
 
+static unsigned long oom_victim_wait_timeout_min = 1;
+static unsigned long oom_victim_wait_timeout_max = (LONG_MAX / HZ);
+
 #ifdef CONFIG_INOTIFY_USER
 #include <linux/inotify.h>
 #endif
@@ -1222,6 +1225,24 @@ static struct ctl_table vm_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 	{
+		.procname	= "oom_victim_skip_secs",
+		.data		= &sysctl_oom_victim_skip_secs,
+		.maxlen		= sizeof(sysctl_oom_victim_skip_secs),
+		.mode		= 0644,
+		.proc_handler	= proc_doulongvec_minmax,
+		.extra1         = &oom_victim_wait_timeout_min,
+		.extra2         = &oom_victim_wait_timeout_max,
+	},
+	{
+		.procname	= "oom_victim_panic_secs",
+		.data		= &sysctl_oom_victim_panic_secs,
+		.maxlen		= sizeof(sysctl_oom_victim_panic_secs),
+		.mode		= 0644,
+		.proc_handler	= proc_doulongvec_minmax,
+		.extra1         = &oom_victim_wait_timeout_min,
+		.extra2         = &oom_victim_wait_timeout_max,
+	},
+	{
 		.procname	= "overcommit_ratio",
 		.data		= &sysctl_overcommit_ratio,
 		.maxlen		= sizeof(sysctl_overcommit_ratio),
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7098104..1a4f54f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -47,6 +47,8 @@
 int sysctl_panic_on_oom;
 int sysctl_oom_kill_allocating_task;
 int sysctl_oom_dump_tasks = 1;
+unsigned long sysctl_oom_victim_skip_secs = (LONG_MAX / HZ);
+unsigned long sysctl_oom_victim_panic_secs = (LONG_MAX / HZ);
 
 DEFINE_MUTEX(oom_lock);
 
@@ -132,6 +134,34 @@ static inline bool is_sysrq_oom(struct oom_control *oc)
 	return oc->order == -1;
 }
 
+static bool is_killable_memdie_task(struct task_struct *p)
+{
+	const unsigned long oom_start = p->signal->oom_start;
+	struct task_struct *t;
+	bool memdie_pending = false;
+
+	if (!oom_start)
+		return false;
+	rcu_read_lock();
+	for_each_thread(p, t) {
+		if (!test_tsk_thread_flag(t, TIF_MEMDIE))
+			continue;
+		memdie_pending = true;
+		break;
+	}
+	rcu_read_unlock();
+	if (!memdie_pending)
+		return false;
+	if (time_after(jiffies, oom_start +
+		       sysctl_oom_victim_panic_secs * HZ)) {
+		sched_show_task(p);
+		panic("Out of memory and %u (%s) can not die...\n",
+		      p->pid, p->comm);
+	}
+	return time_after(jiffies, oom_start +
+			  sysctl_oom_victim_skip_secs * HZ);
+}
+
 /* return true if the task is not adequate as candidate victim task. */
 static bool oom_unkillable_task(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask)
@@ -149,7 +179,8 @@ static bool oom_unkillable_task(struct task_struct *p,
 	if (!has_intersects_mems_allowed(p, nodemask))
 		return true;
 
-	return false;
+	/* Already OOM-killed p might get stuck at unkillable wait */
+	return is_killable_memdie_task(p);
 }
 
 /**
@@ -668,6 +699,22 @@ void mark_oom_victim(struct task_struct *tsk)
 	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
 		return;
 	/*
+	 * The task might get stuck at unkillable wait with mmap_sem held for
+	 * write. In that case, even the OOM reaper will not help. Therefore,
+	 * record timestamp of setting TIF_MEMDIE for the first time of this
+	 * thread group, and check the timestamp at oom_unkillable_task().
+	 * If we record timestamp of setting TIF_MEMDIE for the first time of
+	 * this task, find_lock_task_mm() will select this task forever and
+	 * the OOM killer will wait for this thread group forever.
+	 */
+	if (!tsk->signal->oom_start) {
+		unsigned long oom_start = jiffies;
+
+		if (!oom_start)
+			oom_start = 1;
+		tsk->signal->oom_start = oom_start;
+	}
+	/*
 	 * Make sure that the task is woken up from uninterruptible sleep
 	 * if it is frozen because OOM killer wouldn't be able to free
 	 * any memory and livelock. freezing_slow_path will tell the freezer
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timeout.
  2016-04-19 21:55   ` Tetsuo Handa
  2016-04-20 10:37     ` [PATCH v2] " Tetsuo Handa
@ 2016-04-20 14:47     ` Michal Hocko
  2016-04-21 11:49       ` Tetsuo Handa
  1 sibling, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2016-04-20 14:47 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm

On Wed 20-04-16 06:55:42, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > This patch adds a timeout for handling corner cases where a TIF_MEMDIE
> > > thread got stuck. Since the timeout is checked at oom_unkillable_task(),
> > > oom_scan_process_thread() will not find TIF_MEMDIE thread
> > > (for !oom_kill_allocating_task case) and oom_badness() will return 0
> > > (for oom_kill_allocating_task case).
> > > 
> > > By applying this patch, the kernel will automatically press SysRq-f if
> > > the OOM reaper cannot reap the victim's memory, and we will never OOM
> > > livelock forever as long as the OOM killer is called.
> > 
> > Which will not guarantee anything as already pointed out several times
> > before. So I think this is not really that useful. I have said it
> > earlier and will repeat it again. Any timeout based solution which
> > doesn't guarantee that the system will be in a consistent state (reboot,
> > panic or kill all existing tasks) after the specified timeout is
> > pointless.
> 
> Triggering the reboot/panic is the worst action. Killing all existing tasks
> is the next worst action. Thus, I prefer killing tasks one by one.

killing a task by task doesn't guarantee any convergence to a usable
state. If somebody really cares about these highly unlikely lockups
I am pretty sure he would really appreciate to have a _reliable_ and
_guaranteed_ way out of that situation. Having a fuzzy mechanism to do
something in a good hope of resolving that state is just unhelpful.

If I was an admin and had a machine on the other side of the globe and
that machine just locked up due to OOM I would pretty much wanted to
force reboot as my other means of fixing that situation would be pretty
much close to zero otherwise.

> I'm OK with shortening the timeout like N (when waiting for the 1st victim)
> + N/2 (the 2nd victim) + N/4 (the 3rd victim) + N/8 (the 4th victim) + ...
> but does it worth complicating the least unlikely path?

No it is not IMHO.
 
> > I believe that the chances of the lockup are much less likely with the
> > oom reaper and that we are not really urged to provide a new knob with a
> > random semantic. If we really want to have a timeout based thing better
> > make it behave reliably.
> 
> The threshold which the administrator can wait for ranges. Some may want to
> set few seconds because of 10 seconds /dev/watchdog timeout, others may want
> to set one minute because of not using watchdog. Thus, I think we should not
> hard code the timeout.

I guess you missed my point here. I didn't say this should be hardcoded
in any way. I am just saying that if we really want to do some timeout
based decisions we should better think about the semantic and that
should provide a reliable and deterministic means to resolve the problem.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timeout.
  2016-04-20 14:47     ` [PATCH] " Michal Hocko
@ 2016-04-21 11:49       ` Tetsuo Handa
  2016-04-21 13:07         ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2016-04-21 11:49 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, akpm

Michal Hocko wrote:
> On Wed 20-04-16 06:55:42, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > > This patch adds a timeout for handling corner cases where a TIF_MEMDIE
> > > > thread got stuck. Since the timeout is checked at oom_unkillable_task(),
> > > > oom_scan_process_thread() will not find TIF_MEMDIE thread
> > > > (for !oom_kill_allocating_task case) and oom_badness() will return 0
> > > > (for oom_kill_allocating_task case).
> > > > 
> > > > By applying this patch, the kernel will automatically press SysRq-f if
> > > > the OOM reaper cannot reap the victim's memory, and we will never OOM
> > > > livelock forever as long as the OOM killer is called.
> > > 
> > > Which will not guarantee anything as already pointed out several times
> > > before. So I think this is not really that useful. I have said it
> > > earlier and will repeat it again. Any timeout based solution which
> > > doesn't guarantee that the system will be in a consistent state (reboot,
> > > panic or kill all existing tasks) after the specified timeout is
> > > pointless.
> > 
> > Triggering the reboot/panic is the worst action. Killing all existing tasks
> > is the next worst action. Thus, I prefer killing tasks one by one.
> 
> killing a task by task doesn't guarantee any convergence to a usable
> state. If somebody really cares about these highly unlikely lockups
> I am pretty sure he would really appreciate to have a _reliable_ and
> _guaranteed_ way out of that situation. Having a fuzzy mechanism to do
> something in a good hope of resolving that state is just unhelpful.

Killing a task by task shall eventually converge to the kernel panic.
But since we now have the OOM reaper, the possibility of needing to kill
next task is very low. Killing a task by task via timeout is an insurance
for rare situations where the OOM reaper cannot reap the OOM-killed thread's
memory due to mmap_sem being held for write. (If TIF_MEMDIE were set to all
OOM-kiled thread groups, the OOM killer can converge to the kernel panic
more quickly by ignoring the rest of OOM-killed threads sharing the same
memory, but that is a different patch.)

> 
> If I was an admin and had a machine on the other side of the globe and
> that machine just locked up due to OOM I would pretty much wanted to
> force reboot as my other means of fixing that situation would be pretty
> much close to zero otherwise.

I posted V2 of patch which also allows triggering the kernel panic via timeout.

> 
> > I'm OK with shortening the timeout like N (when waiting for the 1st victim)
> > + N/2 (the 2nd victim) + N/4 (the 3rd victim) + N/8 (the 4th victim) + ...
> > but does it worth complicating the least unlikely path?
> 
> No it is not IMHO.
>  
> > > I believe that the chances of the lockup are much less likely with the
> > > oom reaper and that we are not really urged to provide a new knob with a
> > > random semantic. If we really want to have a timeout based thing better
> > > make it behave reliably.
> > 
> > The threshold which the administrator can wait for ranges. Some may want to
> > set few seconds because of 10 seconds /dev/watchdog timeout, others may want
> > to set one minute because of not using watchdog. Thus, I think we should not
> > hard code the timeout.
> 
> I guess you missed my point here. I didn't say this should be hardcoded
> in any way. I am just saying that if we really want to do some timeout
> based decisions we should better think about the semantic and that
> should provide a reliable and deterministic means to resolve the problem.

I thought you do not like the tunable timeout because tunable timeout
leads to "what is the best duration" discussion.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timeout.
  2016-04-21 11:49       ` Tetsuo Handa
@ 2016-04-21 13:07         ` Michal Hocko
  2016-04-24 14:19           ` Tetsuo Handa
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2016-04-21 13:07 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm

On Thu 21-04-16 20:49:16, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 20-04-16 06:55:42, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > > This patch adds a timeout for handling corner cases where a TIF_MEMDIE
> > > > > thread got stuck. Since the timeout is checked at oom_unkillable_task(),
> > > > > oom_scan_process_thread() will not find TIF_MEMDIE thread
> > > > > (for !oom_kill_allocating_task case) and oom_badness() will return 0
> > > > > (for oom_kill_allocating_task case).
> > > > > 
> > > > > By applying this patch, the kernel will automatically press SysRq-f if
> > > > > the OOM reaper cannot reap the victim's memory, and we will never OOM
> > > > > livelock forever as long as the OOM killer is called.
> > > > 
> > > > Which will not guarantee anything as already pointed out several times
> > > > before. So I think this is not really that useful. I have said it
> > > > earlier and will repeat it again. Any timeout based solution which
> > > > doesn't guarantee that the system will be in a consistent state (reboot,
> > > > panic or kill all existing tasks) after the specified timeout is
> > > > pointless.
> > > 
> > > Triggering the reboot/panic is the worst action. Killing all existing tasks
> > > is the next worst action. Thus, I prefer killing tasks one by one.
> > 
> > killing a task by task doesn't guarantee any convergence to a usable
> > state. If somebody really cares about these highly unlikely lockups
> > I am pretty sure he would really appreciate to have a _reliable_ and
> > _guaranteed_ way out of that situation. Having a fuzzy mechanism to do
> > something in a good hope of resolving that state is just unhelpful.
> 
> Killing a task by task shall eventually converge to the kernel panic.

I (as an admin) do not want to wait unbounded amount of time though.
This is just not practical.

> But since we now have the OOM reaper, the possibility of needing to kill
> next task is very low. Killing a task by task via timeout is an insurance
> for rare situations where the OOM reaper cannot reap the OOM-killed thread's
> memory due to mmap_sem being held for write.

You are changing one unlikely situation for another and that's why I
think this is basically unusable in the real life and why I am so
strongly opposing it.

> (If TIF_MEMDIE were set to all
> OOM-kiled thread groups, the OOM killer can converge to the kernel panic
> more quickly by ignoring the rest of OOM-killed threads sharing the same
> memory, but that is a different patch.)
> 
> > 
> > If I was an admin and had a machine on the other side of the globe and
> > that machine just locked up due to OOM I would pretty much wanted to
> > force reboot as my other means of fixing that situation would be pretty
> > much close to zero otherwise.
> 
> I posted V2 of patch which also allows triggering the kernel panic via timeout.

I have seen that patch. I didn't get to review it properly yet as I am
still travelling. From a quick view I think it is conflating two things
together. I could see arguments for the panic part but I do not consider
the move-to-kill-another timeout as justified. I would have to see a
clear indication this is actually useful for real life usecases.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timeout.
  2016-04-21 13:07         ` Michal Hocko
@ 2016-04-24 14:19           ` Tetsuo Handa
  2016-04-25  9:55             ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2016-04-24 14:19 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, akpm

Michal Hocko wrote:
> I have seen that patch. I didn't get to review it properly yet as I am
> still travelling. From a quick view I think it is conflating two things
> together. I could see arguments for the panic part but I do not consider
> the move-to-kill-another timeout as justified. I would have to see a
> clear indication this is actually useful for real life usecases.

You admit that it is possible that the TIF_MEMDIE thread is blocked at
unkillable wait (due to memory allocation requests by somebody else) but
the OOM reaper cannot reap the victim's memory (due to holding the mmap_sem
for write), don't you?

Then, I think this patch makes little sense unless accompanied with the
move-to-kill-another timeout. If the OOM reaper failed to reap the victim's
memory, the OOM reaper simply clears TIF_MEMDIE from the victim thread. But
since nothing has changed (i.e. the victim continues waiting, and the victim's
memory is not reclaimed, and the victim's oom_score_adj is not updated to
OOM_SCORE_ADJ_MIN), the OOM killer will select that same victim again.
This forms an infinite loop. You will want to call panic() as soon as the OOM
reaper failed to reap the victim's memory (than waiting for the panic timeout).

For both system operators at customer's companies and staffs at support center,
avoiding hangup (due to OOM livelock) and panic (due to the OOM panic timeout)
eliminates a lot of overhead. This is a practical benefit for them.

I also think that the purpose of killing only one task at a time than calling
panic() is to save as much work as possible. Therefore, I can't understand why
you don't think that killing only another task via the move-to-kill-another
timeout is a useful real life usecase.

panic on timeout is a practical benefit for you, but giving several chances
on timeout is a practical benefit for someone you don't know.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timeout.
  2016-04-24 14:19           ` Tetsuo Handa
@ 2016-04-25  9:55             ` Michal Hocko
  2016-04-26 13:54               ` Michal Hocko
  2016-04-26 14:00               ` Tetsuo Handa
  0 siblings, 2 replies; 21+ messages in thread
From: Michal Hocko @ 2016-04-25  9:55 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm

On Sun 24-04-16 23:19:03, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > I have seen that patch. I didn't get to review it properly yet as I am
> > still travelling. From a quick view I think it is conflating two things
> > together. I could see arguments for the panic part but I do not consider
> > the move-to-kill-another timeout as justified. I would have to see a
> > clear indication this is actually useful for real life usecases.
> 
> You admit that it is possible that the TIF_MEMDIE thread is blocked at
> unkillable wait (due to memory allocation requests by somebody else) but
> the OOM reaper cannot reap the victim's memory (due to holding the mmap_sem
> for write), don't you?

I have never said this to be impossible.

> Then, I think this patch makes little sense unless accompanied with the
> move-to-kill-another timeout. If the OOM reaper failed to reap the victim's
> memory, the OOM reaper simply clears TIF_MEMDIE from the victim thread. But
> since nothing has changed (i.e. the victim continues waiting, and the victim's
> memory is not reclaimed, and the victim's oom_score_adj is not updated to
> OOM_SCORE_ADJ_MIN), the OOM killer will select that same victim again.

Yes a patch to introduce a reliable panic-on-timeout would have to
solved this and it is not really trivial to do so.

> This forms an infinite loop. You will want to call panic() as soon as the OOM
> reaper failed to reap the victim's memory (than waiting for the panic timeout).
> 
> For both system operators at customer's companies and staffs at support center,
> avoiding hangup (due to OOM livelock) and panic (due to the OOM panic timeout)
> eliminates a lot of overhead. This is a practical benefit for them.
> 
> I also think that the purpose of killing only one task at a time than calling
> panic() is to save as much work as possible.

If we are locked up then there is no room to try to save some work. We
want the machine to recover rather than hope for anything.

> Therefore, I can't understand why
> you don't think that killing only another task via the move-to-kill-another
> timeout is a useful real life usecase.

I feel like I have to repeat myself. The argument is really simple. If
you have an unlikely possibility of a lockup then you you really want to
a _reliable_ way to get out of this unfortunate state. Kill-another-task
is a mere optimization which has to be evaluated for maintenance vs.
feasibility aspects. So far I am not really convicend about the second
while the first seems like a real concern because the oom code is
complex enough already.

You also have to consider that exporting sysctl knobs for one-off
usecases which are very specific to the implementation at the time have
proven bad. The implementation is moving on and there is no guarantee
that the OOM killer will see changes where the single oom victim will
make even sense - e.g. we might change the semantic to kill whole
containers or that the killing logic would be under control of the admin
(e.g. BPF filters or kernel modules or whatever).

No panic on timeout has a _clear_ semantic independent on the current
oom implementation. While move-to-other victim is not so clear in that
aspect.

> panic on timeout is a practical benefit for you, but giving several chances
> on timeout is a practical benefit for someone you don't know.

Then I would like to hear about that "someone I don't know" with a
clear usecase. So far you are only fuzzy about those and that is not
sufficient to add another subtle code. Did I make myself clear?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] mm,oom: Re-enable OOM killer using timeout.
  2016-04-20 10:37     ` [PATCH v2] " Tetsuo Handa
@ 2016-04-25 11:47       ` Michal Hocko
  2016-04-26 14:00         ` Tetsuo Handa
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2016-04-25 11:47 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm

On Wed 20-04-16 19:37:30, Tetsuo Handa wrote:
[...]
> +static bool is_killable_memdie_task(struct task_struct *p)
> +{
> +	const unsigned long oom_start = p->signal->oom_start;
> +	struct task_struct *t;
> +	bool memdie_pending = false;
> +
> +	if (!oom_start)
> +		return false;
> +	rcu_read_lock();
> +	for_each_thread(p, t) {
> +		if (!test_tsk_thread_flag(t, TIF_MEMDIE))
> +			continue;
> +		memdie_pending = true;
> +		break;
> +	}
> +	rcu_read_unlock();
> +	if (!memdie_pending)
> +		return false;
> +	if (time_after(jiffies, oom_start +
> +		       sysctl_oom_victim_panic_secs * HZ)) {
> +		sched_show_task(p);
> +		panic("Out of memory and %u (%s) can not die...\n",
> +		      p->pid, p->comm);
> +	}
> +	return time_after(jiffies, oom_start +
> +			  sysctl_oom_victim_skip_secs * HZ);
> +}
> +
>  /* return true if the task is not adequate as candidate victim task. */
>  static bool oom_unkillable_task(struct task_struct *p,
>  		struct mem_cgroup *memcg, const nodemask_t *nodemask)
> @@ -149,7 +179,8 @@ static bool oom_unkillable_task(struct task_struct *p,
>  	if (!has_intersects_mems_allowed(p, nodemask))
>  		return true;
>  
> -	return false;
> +	/* Already OOM-killed p might get stuck at unkillable wait */
> +	return is_killable_memdie_task(p);
>  }

Hmm, I guess we have already discussed that in the past but I might
misremember. The above relies on oom killer to be triggered after the
previous victim was selected. There is no guarantee this will happen.
Why cannot we get back to the timer based solution at least for the
panic timeout?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timeout.
  2016-04-25  9:55             ` Michal Hocko
@ 2016-04-26 13:54               ` Michal Hocko
  2016-04-27 10:43                 ` Tetsuo Handa
  2016-04-26 14:00               ` Tetsuo Handa
  1 sibling, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2016-04-26 13:54 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm

On Mon 25-04-16 11:55:08, Michal Hocko wrote:
> On Sun 24-04-16 23:19:03, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > I have seen that patch. I didn't get to review it properly yet as I am
> > > still travelling. From a quick view I think it is conflating two things
> > > together. I could see arguments for the panic part but I do not consider
> > > the move-to-kill-another timeout as justified. I would have to see a
> > > clear indication this is actually useful for real life usecases.
> > 
> > You admit that it is possible that the TIF_MEMDIE thread is blocked at
> > unkillable wait (due to memory allocation requests by somebody else) but
> > the OOM reaper cannot reap the victim's memory (due to holding the mmap_sem
> > for write), don't you?
> 
> I have never said this to be impossible.

And just to clarify. I consider unkillable sleep while holding mmap_sem
for write to be a _bug_ which should be fixed rather than worked around
by some timeout based heuristics.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timeout.
  2016-04-25  9:55             ` Michal Hocko
  2016-04-26 13:54               ` Michal Hocko
@ 2016-04-26 14:00               ` Tetsuo Handa
  1 sibling, 0 replies; 21+ messages in thread
From: Tetsuo Handa @ 2016-04-26 14:00 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, akpm

Michal Hocko wrote:
> On Sun 24-04-16 23:19:03, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > I have seen that patch. I didn't get to review it properly yet as I am
> > > still travelling. From a quick view I think it is conflating two things
> > > together. I could see arguments for the panic part but I do not consider
> > > the move-to-kill-another timeout as justified. I would have to see a
> > > clear indication this is actually useful for real life usecases.
> > 
> > You admit that it is possible that the TIF_MEMDIE thread is blocked at
> > unkillable wait (due to memory allocation requests by somebody else) but
> > the OOM reaper cannot reap the victim's memory (due to holding the mmap_sem
> > for write), don't you?
> 
> I have never said this to be impossible.
> 

OK. You might think it happens once per million OOM killer invocations.
I might think it happens once per thousand OOM killer invocations. But
someone might have a setup and applications which make it happen once
per ten OOM killer invocations. We need to be prepared for it anyway.

> > Then, I think this patch makes little sense unless accompanied with the
> > move-to-kill-another timeout. If the OOM reaper failed to reap the victim's
> > memory, the OOM reaper simply clears TIF_MEMDIE from the victim thread. But
> > since nothing has changed (i.e. the victim continues waiting, and the victim's
> > memory is not reclaimed, and the victim's oom_score_adj is not updated to
> > OOM_SCORE_ADJ_MIN), the OOM killer will select that same victim again.
> 
> Yes a patch to introduce a reliable panic-on-timeout would have to
> solved this and it is not really trivial to do so.
> 
> > This forms an infinite loop. You will want to call panic() as soon as the OOM
> > reaper failed to reap the victim's memory (than waiting for the panic timeout).
> > 
> > For both system operators at customer's companies and staffs at support center,
> > avoiding hangup (due to OOM livelock) and panic (due to the OOM panic timeout)
> > eliminates a lot of overhead. This is a practical benefit for them.
> > 
> > I also think that the purpose of killing only one task at a time than calling
> > panic() is to save as much work as possible.
> 
> If we are locked up then there is no room to try to save some work. We
> want the machine to recover rather than hope for anything.
> 

We might be locked up with the first OOM victim due to mmap_sem held for write.
But we are likely no longer locked up with the second/third OOM victims
(assuming that the OOM killer selects different mm users). I never want
the machine to panic/reboot without trying SysRq-f for several times (but
I'm not always sitting in front of the console in order to try SysRq-f).

> > Therefore, I can't understand why
> > you don't think that killing only another task via the move-to-kill-another
> > timeout is a useful real life usecase.
> 
> I feel like I have to repeat myself. The argument is really simple. If
> you have an unlikely possibility of a lockup then you you really want to
> a _reliable_ way to get out of this unfortunate state. Kill-another-task
> is a mere optimization which has to be evaluated for maintenance vs.
> feasibility aspects. So far I am not really convicend about the second
> while the first seems like a real concern because the oom code is
> complex enough already.
> 

Quite opposite. Panic on timeout is a mere optimization for those who don't
want to wait for too long. The basic direction for panic_on_oom == 0 is try
to loose minimum work while avoiding oom lockup. We added the OOM reaper
in order to assist that direction. We are talking about situations where
the OOM reaper failed to assist. You think "an unlikely possibility of a
lockup" but such assumption is not always true.

> You also have to consider that exporting sysctl knobs for one-off
> usecases which are very specific to the implementation at the time have
> proven bad. The implementation is moving on and there is no guarantee
> that the OOM killer will see changes where the single oom victim will
> make even sense - e.g. we might change the semantic to kill whole
> containers or that the killing logic would be under control of the admin
> (e.g. BPF filters or kernel modules or whatever).
> 

Yes, the OOM killer might change in the future. But that is not an excuse
to desert current users. You can deprecate and then remove such sysctl knobs
when you developed perfect model and mechanism. Until that moment, please
don't desert current and future users.

> No panic on timeout has a _clear_ semantic independent on the current
> oom implementation. While move-to-other victim is not so clear in that
> aspect.
> 
> > panic on timeout is a practical benefit for you, but giving several chances
> > on timeout is a practical benefit for someone you don't know.
> 
> Then I would like to hear about that "someone I don't know" with a
> clear usecase. So far you are only fuzzy about those and that is not
> sufficient to add another subtle code. Did I make myself clear?

I still cannot understand what you want to hear about the "usecase".

For CONFIG_MMU=n systems, the possibility is not small because the OOM
reaper is not available.

For large servers which take 10 minutes to reboot, trying to survive
for up to 60 seconds using move-to-other victim is helpful for several
administrators. (Of course, the period to retry is just an example.)

For desktop PCs running an Office application and a Web browser,
trying to save not-yet-saved Office documents when the Web browser
by chance triggered the OOM killer is helpful for several users.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] mm,oom: Re-enable OOM killer using timeout.
  2016-04-25 11:47       ` Michal Hocko
@ 2016-04-26 14:00         ` Tetsuo Handa
  2016-04-26 14:31           ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2016-04-26 14:00 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, akpm

Michal Hocko wrote:
> Hmm, I guess we have already discussed that in the past but I might
> misremember. The above relies on oom killer to be triggered after the
> previous victim was selected. There is no guarantee this will happen.

Why there is no guarantee this will happen?

This OOM livelock is caused by waiting for TIF_MEMDIE threads forever
unconditionally. If oom_unkillable_task() is not called, it is not
the OOM killer's problem.

Are you talking about doing did_some_progress = 1 for !__GFP_FS && !__GFP_NOFAIL
allocations without calling oom_unkillable_task() ? Then, I insist that this is
the page allocator's problem. GFP_NOIO and GFP_NOFS allocations wake up kswapd,
and kswapd does __GFP_FS reclaim. If the kswapd is unable to make forward
progress (an example is
http://I-love.SAKURA.ne.jp/tmp/serial-20160314-too-many-isolated2.txt.xz

----------
[  485.216878] Out of memory: Kill process 1356 (a.out) score 999 or sacrifice child
[  485.219170] Killed process 1356 (a.out) total-vm:4176kB, anon-rss:80kB, file-rss:0kB, shmem-rss:0kB
[  514.255929] MemAlloc-Info: stalling=146 dying=0 exiting=0 victim=0 oom_count=1/226
(...snipped...)
[  540.998623] MemAlloc-Info: stalling=146 dying=0 exiting=0 victim=0 oom_count=1/226
[  571.003817] MemAlloc-Info: stalling=152 dying=0 exiting=0 victim=0 oom_count=1/226
(...snipped...)
[  585.888300] MemAlloc-Info: stalling=152 dying=0 exiting=0 victim=0 oom_count=1/226
----------
), we are already OOM and we need to hear from administrator's decision (i.e.
either fail !__GFP_FS && !__GFP_NOFAIL allocations or select an OOM victim).
This OOM livelock is caused by waiting for somebody else forever unconditionally.

These OOM livelocks are caused by lack of mechanism for hearing administrator's
policy. We are missing rescue mechanisms which are needed for recovering from
situations your model did not expect.

I'm talking about corner cases where your deterministic approach fail. What we
need is "stop waiting for something forever unconditionally" and "hear what the
administrator wants to do". You can deprecate and then remove sysctl knobs for
hearing what the administrator wants to do when you developed perfect model and
mechanism.

> Why cannot we get back to the timer based solution at least for the
> panic timeout?

Use of global timer can cause false positive panic() calls.
Timeout should be calculated for per task_struct or signal_struct basis.

Also, although a different problem, global timer based solution does not
work for OOM livelock without any TIF_MEMDIE thread case (an example
shown above).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] mm,oom: Re-enable OOM killer using timeout.
  2016-04-26 14:00         ` Tetsuo Handa
@ 2016-04-26 14:31           ` Michal Hocko
  2016-04-27 10:43             ` Tetsuo Handa
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2016-04-26 14:31 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm

On Tue 26-04-16 23:00:15, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > Hmm, I guess we have already discussed that in the past but I might
> > misremember. The above relies on oom killer to be triggered after the
> > previous victim was selected. There is no guarantee this will happen.
> 
> Why there is no guarantee this will happen?

What happens if you even do not hit the out_of_memory path? E.g
GFP_FS allocation being stuck somewhere in shrinkers waiting for
somebody to make a forward progress which never happens. Because this is
essentially what would block the mmap_sem write holder as well and what
you are trying to workaround by the timeout based approach.

> This OOM livelock is caused by waiting for TIF_MEMDIE threads forever
> unconditionally. If oom_unkillable_task() is not called, it is not
> the OOM killer's problem.

It really doesn't matter whose problem is that because whoever it is
doesn't have a full picture to draw any conclusions.

[...]

> These OOM livelocks are caused by lack of mechanism for hearing administrator's
> policy. We are missing rescue mechanisms which are needed for recovering from
> situations your model did not expect.

I am not opposed against a rescue policy defined by the admin. All I
am saying is that the only save and reasonably maintainable one with
_predictable_ behavior I can see is to reboot/panic/killall-tasks after
a certain timeout. You consider this to be too harsh but do you at
least agree that the semantic of this is clear and an admin knows what
the behavior would be? As we are not able to find a consensus on
go-to-other-victim approach can we at least agree on the absolute last
resort first?

We will surely hear complains if this is too coarse and users really
need something more fine grained.
 
> I'm talking about corner cases where your deterministic approach fail. What we
> need is "stop waiting for something forever unconditionally" and "hear what the
> administrator wants to do". You can deprecate and then remove sysctl knobs for
> hearing what the administrator wants to do when you developed perfect model and
> mechanism.
> 
> > Why cannot we get back to the timer based solution at least for the
> > panic timeout?
> 
> Use of global timer can cause false positive panic() calls.

Race that would take in orders of tens of seconds which would be the
most probable chosen value doesn't matter that much IMHO.

> Timeout should be calculated for per task_struct or signal_struct basis.
> 
> Also, although a different problem, global timer based solution does not
> work for OOM livelock without any TIF_MEMDIE thread case (an example
> shown above).

which is a technical detail which can be solved.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timeout.
  2016-04-26 13:54               ` Michal Hocko
@ 2016-04-27 10:43                 ` Tetsuo Handa
  2016-04-27 11:11                   ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2016-04-27 10:43 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, akpm

Michal Hocko wrote:
> On Mon 25-04-16 11:55:08, Michal Hocko wrote:
> > On Sun 24-04-16 23:19:03, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > I have seen that patch. I didn't get to review it properly yet as I am
> > > > still travelling. From a quick view I think it is conflating two things
> > > > together. I could see arguments for the panic part but I do not consider
> > > > the move-to-kill-another timeout as justified. I would have to see a
> > > > clear indication this is actually useful for real life usecases.
> > > 
> > > You admit that it is possible that the TIF_MEMDIE thread is blocked at
> > > unkillable wait (due to memory allocation requests by somebody else) but
> > > the OOM reaper cannot reap the victim's memory (due to holding the mmap_sem
> > > for write), don't you?
> > 
> > I have never said this to be impossible.
> 
> And just to clarify. I consider unkillable sleep while holding mmap_sem
> for write to be a _bug_ which should be fixed rather than worked around
> by some timeout based heuristics.

Excuse me, but I think that it is difficult to fix.
Since currently it is legal to block kswapd from memory reclaim paths
( http://lkml.kernel.org/r/20160211225929.GU14668@dastard ) and there
are allocation requests with mmap_sem held for write, you will need to
make memory reclaim paths killable. (I wish memory reclaim paths being
completely killable because fatal_signal_pending(current) check done in
throttle_direct_reclaim() is racy.)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] mm,oom: Re-enable OOM killer using timeout.
  2016-04-26 14:31           ` Michal Hocko
@ 2016-04-27 10:43             ` Tetsuo Handa
  0 siblings, 0 replies; 21+ messages in thread
From: Tetsuo Handa @ 2016-04-27 10:43 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, akpm

Michal Hocko wrote:
> On Tue 26-04-16 23:00:15, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > Hmm, I guess we have already discussed that in the past but I might
> > > misremember. The above relies on oom killer to be triggered after the
> > > previous victim was selected. There is no guarantee this will happen.
> > 
> > Why there is no guarantee this will happen?
> 
> What happens if you even do not hit the out_of_memory path? E.g
> GFP_FS allocation being stuck somewhere in shrinkers waiting for
> somebody to make a forward progress which never happens. Because this is
> essentially what would block the mmap_sem write holder as well and what
> you are trying to workaround by the timeout based approach.
> 

Are you talking about situations where the system hangs up before
mark_oom_victim() is called? If yes, starting a timer from mark_oom_victim()
is too late. We will need to start that timer from __alloc_pages_slowpath()
because __alloc_pages_slowpath() can sleep. Then, we don't need to consider
"OOM livelock before mark_oom_victim() is called" and "OOM livelock after
mark_oom_victim() is called" separately.

> > These OOM livelocks are caused by lack of mechanism for hearing administrator's
> > policy. We are missing rescue mechanisms which are needed for recovering from
> > situations your model did not expect.
> 
> I am not opposed against a rescue policy defined by the admin. All I
> am saying is that the only save and reasonably maintainable one with
> _predictable_ behavior I can see is to reboot/panic/killall-tasks after
> a certain timeout. You consider this to be too harsh but do you at
> least agree that the semantic of this is clear and an admin knows what
> the behavior would be? As we are not able to find a consensus on
> go-to-other-victim approach can we at least agree on the absolute last
> resort first?
> 

Which one ("OOM livelock before mark_oom_victim() is called" or "OOM livelock
after mark_oom_victim() is called") does this "the absolute last resort" apply to?

If this "the absolute last resort" applies to "OOM livelock after mark_oom_victim()
is called", what is your "the absolute last resort" for "OOM livelock before
mark_oom_victim() is called"? My suggestion is to workaround by per task_struct
timeout based approach until such workaround becomes no longer needed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timeout.
  2016-04-27 10:43                 ` Tetsuo Handa
@ 2016-04-27 11:11                   ` Michal Hocko
  2016-05-14  0:39                     ` Tetsuo Handa
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2016-04-27 11:11 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm

On Wed 27-04-16 19:43:08, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Mon 25-04-16 11:55:08, Michal Hocko wrote:
> > > On Sun 24-04-16 23:19:03, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > > > I have seen that patch. I didn't get to review it properly yet as I am
> > > > > still travelling. From a quick view I think it is conflating two things
> > > > > together. I could see arguments for the panic part but I do not consider
> > > > > the move-to-kill-another timeout as justified. I would have to see a
> > > > > clear indication this is actually useful for real life usecases.
> > > > 
> > > > You admit that it is possible that the TIF_MEMDIE thread is blocked at
> > > > unkillable wait (due to memory allocation requests by somebody else) but
> > > > the OOM reaper cannot reap the victim's memory (due to holding the mmap_sem
> > > > for write), don't you?
> > > 
> > > I have never said this to be impossible.
> > 
> > And just to clarify. I consider unkillable sleep while holding mmap_sem
> > for write to be a _bug_ which should be fixed rather than worked around
> > by some timeout based heuristics.
> 
> Excuse me, but I think that it is difficult to fix.
> Since currently it is legal to block kswapd from memory reclaim paths
> ( http://lkml.kernel.org/r/20160211225929.GU14668@dastard ) and there
> are allocation requests with mmap_sem held for write, you will need to
> make memory reclaim paths killable. (I wish memory reclaim paths being
> completely killable because fatal_signal_pending(current) check done in
> throttle_direct_reclaim() is racy.)

Be it difficult or not it is something that should be fixed.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timeout.
  2016-04-27 11:11                   ` Michal Hocko
@ 2016-05-14  0:39                     ` Tetsuo Handa
  2016-05-16 14:18                       ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2016-05-14  0:39 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, akpm

Michal Hocko wrote:
> On Wed 27-04-16 19:43:08, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Mon 25-04-16 11:55:08, Michal Hocko wrote:
> > > > On Sun 24-04-16 23:19:03, Tetsuo Handa wrote:
> > > > > Michal Hocko wrote:
> > > > > > I have seen that patch. I didn't get to review it properly yet as I am
> > > > > > still travelling. From a quick view I think it is conflating two things
> > > > > > together. I could see arguments for the panic part but I do not consider
> > > > > > the move-to-kill-another timeout as justified. I would have to see a
> > > > > > clear indication this is actually useful for real life usecases.
> > > > > 
> > > > > You admit that it is possible that the TIF_MEMDIE thread is blocked at
> > > > > unkillable wait (due to memory allocation requests by somebody else) but
> > > > > the OOM reaper cannot reap the victim's memory (due to holding the mmap_sem
> > > > > for write), don't you?
> > > > 
> > > > I have never said this to be impossible.
> > > 
> > > And just to clarify. I consider unkillable sleep while holding mmap_sem
> > > for write to be a _bug_ which should be fixed rather than worked around
> > > by some timeout based heuristics.
> > 
> > Excuse me, but I think that it is difficult to fix.
> > Since currently it is legal to block kswapd from memory reclaim paths
> > ( http://lkml.kernel.org/r/20160211225929.GU14668@dastard ) and there
> > are allocation requests with mmap_sem held for write, you will need to
> > make memory reclaim paths killable. (I wish memory reclaim paths being
> > completely killable because fatal_signal_pending(current) check done in
> > throttle_direct_reclaim() is racy.)
> 
> Be it difficult or not it is something that should be fixed.
> 
I hit "allowing the OOM killer to select the same thread again" problem
using reproducer shown below on linux-next-20160513 + kmallocwd v5.

---------- Reproducer start ----------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <signal.h>
#include <poll.h>
#include <sched.h>
#include <sys/prctl.h>
#include <sys/wait.h>
#include <sys/mman.h>

static int memory_eater(void *unused)
{
	const int fd = open("/proc/self/exe", O_RDONLY);
	srand(getpid());
	while (1) {
		int size = rand() % 1048576;
		void *ptr = mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
		munmap(ptr, size);
	}
	return 0;
}

static int self_killer(void *unused)
{
	srand(getpid());
	poll(NULL, 0, rand() % 1000);
	kill(getpid(), SIGKILL);
	return 0;
}

static void child(void)
{
	static char *stack[256] = { };
	char buf[32] = { };
	int i;
	int fd = open("/proc/self/oom_score_adj", O_WRONLY);
	write(fd, "1000", 4);
	close(fd);
	snprintf(buf, sizeof(buf), "tgid=%u", getpid());
	prctl(PR_SET_NAME, (unsigned long) buf, 0, 0, 0);
	for (i = 0; i < 256; i++)
		stack[i] = malloc(4096 * 2);
	for (i = 1; i < 256 - 2; i++)
		if (clone(memory_eater, stack[i] + 8192, CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, NULL) == -1)
			_exit(1);
	if (clone(memory_eater, stack[i++] + 8192, CLONE_VM, NULL) == -1)
		_exit(1);
	if (clone(self_killer, stack[i] + 8192, CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, NULL) == -1)
		_exit(1);
	_exit(0);
}

int main(int argc, char *argv[])
{
	static cpu_set_t set = { { 1 } };
	sched_setaffinity(0, sizeof(set), &set);
	if (fork() > 0) {
		char *buf = NULL;
		unsigned long size;
		unsigned long i;
		sleep(1);
		for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
			char *cp = realloc(buf, size);
			if (!cp) {
				size >>= 1;
				break;
			}
			buf = cp;
		}
		/* Will cause OOM due to overcommit */
		for (i = 0; i < size; i += 4096)
			buf[i] = 0;
		while (1)
			pause();
	}
	while (1)
		if (fork() == 0)
			child();
		else
			wait(NULL);
	return 0;
}
---------- Reproducer end ----------

This reproducer tried to attack the shortcuts

        /*
         * If the task is already exiting, don't alarm the sysadmin or kill
         * its children or threads, just set TIF_MEMDIE so it can die quickly
         */
        task_lock(p);
        if (p->mm && task_will_free_mem(p)) {
                mark_oom_victim(p);
                try_oom_reaper(p);
                task_unlock(p);
                put_task_struct(p);
                return;
        }
        task_unlock(p);

in out_of_memory() and/or

        /*
         * If the task is already exiting, don't alarm the sysadmin or kill
         * its children or threads, just set TIF_MEMDIE so it can die quickly
         */
        task_lock(p);
        if (p->mm && task_will_free_mem(p)) {
                mark_oom_victim(p);
                try_oom_reaper(p);
                task_unlock(p);
                put_task_struct(p);
                return;
        }
        task_unlock(p);

in oom_kill_process(), by making try_oom_reaper() not to wake up the OOM
reaper by reproducing a situation where one thread group receives SIGKILL
(and hence blocked at down_read() after reaching do_exit()) and the other
thread group does not receive SIGKILL (and hence continue blocked at
down_write_killable()), in order to demonstrate how optimistic it is
to wait for TIF_MEMDIE thread unconditionally forever.

What I got is that the OOM victim is blocked at
down_write(vma->file->f_mapping) in i_mmap_lock_write() called from
link_file_vma(vma) etc.

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20160514.txt.xz .
----------
[  156.182149] oom_reaper: reaped process 13333 (tgid=13079), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  157.113150] oom_reaper: reaped process 4372 (tgid=4118), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  157.995910] oom_reaper: reaped process 11029 (tgid=10775), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  158.181043] oom_reaper: reaped process 11285 (tgid=11031), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  169.049766] oom_reaper: reaped process 11541 (tgid=11287), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  169.323695] oom_reaper: reaped process 11797 (tgid=11543), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  176.294340] oom_reaper: reaped process 12309 (tgid=12055), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  240.458346] MemAlloc-Info: stalling=16 dying=1 exiting=1 victim=0 oom_count=729
[  241.950461] MemAlloc-Info: stalling=16 dying=1 exiting=1 victim=0 oom_count=729
[  301.956044] MemAlloc-Info: stalling=19 dying=1 exiting=1 victim=0 oom_count=729
[  303.654382] MemAlloc-Info: stalling=19 dying=1 exiting=1 victim=0 oom_count=729
[  349.771068] oom_reaper: reaped process 13589 (tgid=13335), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  349.996636] oom_reaper: reaped process 13845 (tgid=13591), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  350.704767] oom_reaper: reaped process 14357 (tgid=14103), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  351.656833] Out of memory: Kill process 5652 (tgid=5398) score 999 or sacrifice child
[  351.659127] Killed process 5652 (tgid=5398) total-vm:6348kB, anon-rss:1116kB, file-rss:12kB, shmem-rss:0kB
[  352.664419] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  357.238418] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  358.621747] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  359.970605] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  361.423518] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  362.704023] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  363.832115] MemAlloc-Info: stalling=1 dying=3 exiting=1 victim=1 oom_count=25279
[  364.148948] MemAlloc: tgid=5398(5652) flags=0x400040 switches=266 dying victim
[  364.150851] tgid=5398       R  running task    12920  5652      1 0x00100084
[  364.152773]  ffff88000637fbe8 ffffffff8172b257 000091fa78a0caf8 ffff8800389de440
[  364.154843]  ffff880006376440 ffff880006380000 ffff880078a0caf8 ffff880078a0caf8
[  364.156898]  ffff880078a0cb10 ffff880078a0cb00 ffff88000637fc00 ffffffff81725e1a
[  364.158972] Call Trace:
[  364.159979]  [<ffffffff8172b257>] ? _raw_spin_unlock_irq+0x27/0x50
[  364.161691]  [<ffffffff81725e1a>] schedule+0x3a/0x90
[  364.163170]  [<ffffffff8172a366>] rwsem_down_write_failed+0x106/0x220
[  364.164925]  [<ffffffff813bd2c7>] call_rwsem_down_write_failed+0x17/0x30
[  364.166737]  [<ffffffff81729877>] down_write+0x47/0x60
[  364.168258]  [<ffffffff811c3284>] ? vma_link+0x44/0xc0
[  364.169773]  [<ffffffff811c3284>] vma_link+0x44/0xc0
[  364.171255]  [<ffffffff811c5c05>] mmap_region+0x3a5/0x5b0
[  364.172822]  [<ffffffff811c6204>] do_mmap+0x3f4/0x4c0
[  364.174324]  [<ffffffff811a64dc>] vm_mmap_pgoff+0xbc/0x100
[  364.175894]  [<ffffffff811c4060>] SyS_mmap_pgoff+0x1c0/0x290
[  364.177499]  [<ffffffff81002c91>] ? do_syscall_64+0x21/0x170
[  364.179118]  [<ffffffff81022b7d>] SyS_mmap+0x1d/0x20
[  364.180592]  [<ffffffff81002ccc>] do_syscall_64+0x5c/0x170
[  364.182140]  [<ffffffff8172b9da>] entry_SYSCALL64_slow_path+0x25/0x25
[  364.183855] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  365.199023] MemAlloc-Info: stalling=1 dying=3 exiting=1 victim=1 oom_count=28254
[  366.283955] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  368.158264] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  369.568325] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  371.416533] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  373.159185] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  374.835808] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  376.386226] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  378.223962] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  379.601584] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  381.067290] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  382.394818] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  383.918460] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  385.540088] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  386.915094] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  388.297575] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  391.598638] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  393.580423] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  395.744709] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  397.377497] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  399.614030] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  401.103803] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  402.484887] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  404.503755] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  406.433219] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  407.958772] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  410.094990] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  413.509253] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  416.820991] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  420.485121] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  422.302336] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  424.623738] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  425.204811] MemAlloc-Info: stalling=13 dying=3 exiting=1 victim=0 oom_count=161064
[  425.592191] MemAlloc-Info: stalling=13 dying=3 exiting=1 victim=0 oom_count=161064
[  430.507619] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  432.487807] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  436.810127] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  439.310553] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  441.404857] oom_reaper: unable to reap pid:5652 (tgid=5398)
----------

static inline void i_mmap_lock_write(struct address_space *mapping)
{
        down_write(&mapping->i_mmap_rwsem);
}

static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
                        struct vm_area_struct *prev, struct rb_node **rb_link,
                        struct rb_node *rb_parent)
{
        struct address_space *mapping = NULL;

        if (vma->vm_file) {
                mapping = vma->vm_file->f_mapping;
                i_mmap_lock_write(mapping);
        }

        __vma_link(mm, vma, prev, rb_link, rb_parent); /* [<ffffffff811c3284>] vma_link+0x44/0xc0 */
        __vma_link_file(vma);

        if (mapping)
                i_mmap_unlock_write(mapping);

        mm->map_count++;
        validate_mm(mm);
}

As you said that "I consider unkillable sleep while holding mmap_sem
for write to be a _bug_ which should be fixed rather than worked around
by some timeout based heuristics.", you of course have a plan to rewrite
functions to return "int" which are currently "void" in order to use
killable waits, don't you?

I think that clearing TIF_MEMDIE even if the OOM reaper failed to reap the
OOM vitctim's memory is confusing for panic_on_oom_timeout timer (which stops
itself when TIF_MEMDIE is cleared) and kmallocwd (which prints victim=0 in
MemAlloc-Info: line). Until you complete rewriting all functions which could
be called with mmap_sem held for write, we should allow the OOM killer to
select next OOM victim upon timeout; otherwise calling panic() is premature.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timeout.
  2016-05-14  0:39                     ` Tetsuo Handa
@ 2016-05-16 14:18                       ` Michal Hocko
  2016-05-17 11:08                         ` Tetsuo Handa
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2016-05-16 14:18 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm

On Sat 14-05-16 09:39:49, Tetsuo Handa wrote:
[...]
> What I got is that the OOM victim is blocked at
> down_write(vma->file->f_mapping) in i_mmap_lock_write() called from
> link_file_vma(vma) etc.
> 
> Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20160514.txt.xz .
> ----------
[...]
> [  364.158972] Call Trace:
> [  364.159979]  [<ffffffff8172b257>] ? _raw_spin_unlock_irq+0x27/0x50
> [  364.161691]  [<ffffffff81725e1a>] schedule+0x3a/0x90
> [  364.163170]  [<ffffffff8172a366>] rwsem_down_write_failed+0x106/0x220
> [  364.164925]  [<ffffffff813bd2c7>] call_rwsem_down_write_failed+0x17/0x30
> [  364.166737]  [<ffffffff81729877>] down_write+0x47/0x60
> [  364.168258]  [<ffffffff811c3284>] ? vma_link+0x44/0xc0
> [  364.169773]  [<ffffffff811c3284>] vma_link+0x44/0xc0
> [  364.171255]  [<ffffffff811c5c05>] mmap_region+0x3a5/0x5b0
> [  364.172822]  [<ffffffff811c6204>] do_mmap+0x3f4/0x4c0
> [  364.174324]  [<ffffffff811a64dc>] vm_mmap_pgoff+0xbc/0x100
> [  364.175894]  [<ffffffff811c4060>] SyS_mmap_pgoff+0x1c0/0x290
> [  364.177499]  [<ffffffff81002c91>] ? do_syscall_64+0x21/0x170
> [  364.179118]  [<ffffffff81022b7d>] SyS_mmap+0x1d/0x20
> [  364.180592]  [<ffffffff81002ccc>] do_syscall_64+0x5c/0x170
> [  364.182140]  [<ffffffff8172b9da>] entry_SYSCALL64_slow_path+0x25/0x25
> [  364.183855] oom_reaper: unable to reap pid:5652 (tgid=5398)
[...]
> static inline void i_mmap_lock_write(struct address_space *mapping)
> {
>         down_write(&mapping->i_mmap_rwsem);
> }
> 
> static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
>                         struct vm_area_struct *prev, struct rb_node **rb_link,
>                         struct rb_node *rb_parent)
> {
>         struct address_space *mapping = NULL;
> 
>         if (vma->vm_file) {
>                 mapping = vma->vm_file->f_mapping;
>                 i_mmap_lock_write(mapping);
>         }
> 
>         __vma_link(mm, vma, prev, rb_link, rb_parent); /* [<ffffffff811c3284>] vma_link+0x44/0xc0 */
>         __vma_link_file(vma);
> 
>         if (mapping)
>                 i_mmap_unlock_write(mapping);
> 
>         mm->map_count++;
>         validate_mm(mm);
> }
> 
> As you said that "I consider unkillable sleep while holding mmap_sem
> for write to be a _bug_ which should be fixed rather than worked around
> by some timeout based heuristics.", you of course have a plan to rewrite
> functions to return "int" which are currently "void" in order to use
> killable waits, don't you?

Thanks for the report. Yes this seems quite simple to fix actually. All
the callers are able to handle the failure. Well, copy_vma is a bit
complicated because it already did anon_vma_clone and other state
related stuff so it would be slightly more tricky but nothing unfixable.

> I think that clearing TIF_MEMDIE even if the OOM reaper failed to reap the
> OOM vitctim's memory is confusing for panic_on_oom_timeout timer (which stops
> itself when TIF_MEMDIE is cleared) and kmallocwd (which prints victim=0 in
> MemAlloc-Info: line). Until you complete rewriting all functions which could
> be called with mmap_sem held for write, we should allow the OOM killer to
> select next OOM victim upon timeout; otherwise calling panic() is premature.

I would agree if this was an easily triggerable issue in the real life.
You are basically DoSing your machine and that leads to corner cases of
course. We can and should try to plug them but I still do not see any
reason to rush into any solutions.

You seem to be bound to the timeout solution so much that you even
refuse to think about any other potential ways to move on. I think that
is counter productive. I have tried to explain many times that once you
define a _user_ _visible_ knob you should better define a proper semantic
for it. Do something with a random outcome is not it.

So let's move on and try to think outside of the box:
---
diff --git a/include/linux/sched.h b/include/linux/sched.h
index df8778e72211..027d5bc1e874 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -513,6 +513,7 @@ static inline int get_dumpable(struct mm_struct *mm)
 #define MMF_HAS_UPROBES		19	/* has uprobes */
 #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
 #define MMF_OOM_REAPED		21	/* mm has been already reaped */
+#define MMF_OOM_NOT_REAPABLE	22	/* mm couldn't be reaped */
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index c0e37dd1422f..b1a1e3317231 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -538,8 +538,27 @@ static void oom_reap_task(struct task_struct *tsk)
 		schedule_timeout_idle(HZ/10);
 
 	if (attempts > MAX_OOM_REAP_RETRIES) {
+		struct task_struct *p;
+
 		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
 				task_pid_nr(tsk), tsk->comm);
+
+		/*
+		 * If we've already tried to reap this task in the past and
+		 * failed it probably doesn't make much sense to try yet again
+		 * so hide the mm from the oom killer so that it can move on
+		 * to another task with a different mm struct.
+		 */
+		p = find_lock_task_mm(tsk);
+		if (p) {
+			if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &p->mm->flags)) {
+				pr_info("oom_reaper: giving up pid:%d (%s)\n",
+						task_pid_nr(tsk), tsk->comm);
+				set_bit(MMF_OOM_REAPED, &p->mm->flags);
+			}
+			task_unlock(p);
+		}
+
 		debug_show_all_locks();
 	}
 

See the difference? This is 11LOC and we do not have export any knobs
which would tie us for future implementations. We will cap the number
of times each mm struct is attempted for OOM killer and do not have
to touch any subtle oom killer paths so the patch would be quite easy
to review. We can change this implementation if it turns out to be
impractical, too optimistic or pesimistic.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timeout.
  2016-05-16 14:18                       ` Michal Hocko
@ 2016-05-17 11:08                         ` Tetsuo Handa
  2016-05-17 12:51                           ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2016-05-17 11:08 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, akpm

Michal Hocko wrote:
> > I think that clearing TIF_MEMDIE even if the OOM reaper failed to reap the
> > OOM vitctim's memory is confusing for panic_on_oom_timeout timer (which stops
> > itself when TIF_MEMDIE is cleared) and kmallocwd (which prints victim=0 in
> > MemAlloc-Info: line). Until you complete rewriting all functions which could
> > be called with mmap_sem held for write, we should allow the OOM killer to
> > select next OOM victim upon timeout; otherwise calling panic() is premature.
> 
> I would agree if this was an easily triggerable issue in the real life.

Please don't assume that this isn't an easily triggerable issue in the real life.

http://lkml.kernel.org/r/201409192053.IHJ35462.JLOMOSOFFVtQFH@I-love.SAKURA.ne.jp
is a bug which was not identified for 5 years. I was not able to trigger it
in my environment, but the customer was able to trigger it easily in his
environment soon after he started testing his enterprise applications.
Although the bug was fixed and the distributor's kernel was updated, he
gave up using cpuset cgroup because it was too late to update the kernel
for his environment.

I haven't heard response from you about silent hangup bug where all
allocating tasks are trapped at too_many_isolated() loop in shrink_inactive_list()
( http://lkml.kernel.org/r/201602092349.ACG81273.OSVtMJQHLOFOFF@I-love.SAKURA.ne.jp ).
His cpuset cgroup case was lucky because he had a workaround not to use
cpuset cgroup. But regarding page allocator bugs, no one has a workaround
not to use page allocator. Therefore, proactive detection is important.

You can't determine whether corner case bugs are triggerable before
such bugs bite users. It is too late to wait for feedback from users.

> You are basically DoSing your machine and that leads to corner cases of
> course. We can and should try to plug them but I still do not see any
> reason to rush into any solutions.

My intent of doing what-you-call-DoS stress tests is

   You had better realize that we can't find all corner cases.
   It is not a responsible attitude that you knowingly preserve
   corner cases with "can you trigger it?".

.

The OOM killer is a safety net in case something went wrong (e.g.
a ranaway program). You refuse to care about corner cases. How can it be
called "robust / reliable" without the ability to handle corner cases?
As long as minimal infrastructure for handling the OOM situation (e.g.
scheduler) is alive, we should strive for recovering from the OOM situation
(as with you strive for making the OOM reaper context reliable as much as
possible).

> 
> You seem to be bound to the timeout solution so much that you even
> refuse to think about any other potential ways to move on. I think that
> is counter productive. I have tried to explain many times that once you
> define a _user_ _visible_ knob you should better define a proper semantic
> for it. Do something with a random outcome is not it.

Waiting for feedback without offering a workaround is counterproductive
when we are already aware of bugs. Offering a workaround first and then
trying to fix easily triggerable bugs is appreciated for those who can not
update kernels for their systems due to their constraints. It is up to
users to decide whether to use workarounds.

The reason I insist on the timeout based approach is the robustness.

   (A) It can work on CONFIG_MMU=n kernels.

   (B) It can work even if kthread_run(oom_reaper, NULL, "oom_reaper")
       returned an error.

   (C) It gives more accurately bounded delay (compared to waiting for
       TIF_MEMDIE being sequentially cleared by the OOM reaper) even if
       there are so many threads on the oom_reaper_list list.

   (D) It can work even if the OOM reaper cannot run for long time
       for unknown reasons (e.g. preempted by realtime priority tasks).

   (E) We can handle all corner cases without proving that they are
       triggerable issues in the real life.

> 
> So let's move on and try to think outside of the box:
> ---
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index df8778e72211..027d5bc1e874 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -513,6 +513,7 @@ static inline int get_dumpable(struct mm_struct *mm)
>  #define MMF_HAS_UPROBES		19	/* has uprobes */
>  #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
>  #define MMF_OOM_REAPED		21	/* mm has been already reaped */
> +#define MMF_OOM_NOT_REAPABLE	22	/* mm couldn't be reaped */
>  
>  #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
>  
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index c0e37dd1422f..b1a1e3317231 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -538,8 +538,27 @@ static void oom_reap_task(struct task_struct *tsk)
>  		schedule_timeout_idle(HZ/10);
>  
>  	if (attempts > MAX_OOM_REAP_RETRIES) {
> +		struct task_struct *p;
> +
>  		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
>  				task_pid_nr(tsk), tsk->comm);
> +
> +		/*
> +		 * If we've already tried to reap this task in the past and
> +		 * failed it probably doesn't make much sense to try yet again
> +		 * so hide the mm from the oom killer so that it can move on
> +		 * to another task with a different mm struct.
> +		 */
> +		p = find_lock_task_mm(tsk);
> +		if (p) {
> +			if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &p->mm->flags)) {
> +				pr_info("oom_reaper: giving up pid:%d (%s)\n",
> +						task_pid_nr(tsk), tsk->comm);
> +				set_bit(MMF_OOM_REAPED, &p->mm->flags);
> +			}
> +			task_unlock(p);
> +		}
> +
>  		debug_show_all_locks();
>  	}
>  
> 
> See the difference? This is 11LOC and we do not have export any knobs
> which would tie us for future implementations. We will cap the number
> of times each mm struct is attempted for OOM killer and do not have
> to touch any subtle oom killer paths so the patch would be quite easy
> to review. We can change this implementation if it turns out to be
> impractical, too optimistic or pesimistic.

Oh, this is a drastic change for you. You are trying to be very conservative
and you refused to select next OOM victim unless progress are made.
If you can accept selecting next OOM victim when progress are not made,
I might be able to get away from timeout based approach.

The requirements for getting me away from timeout based approach would be

   (1) The OOM reaper is always invoked even if the OOM victim's mm is
       known to be not reapable. That is, wake up the OOM reaper whenever
       TIF_MEMDIE is set. mark_oom_victim() is a good place for doing so.

   (2) The OOM reaper marks the OOM victim's mm_struct or signal_struct
       as "not suitable for OOM victims" regardless of whether the OOM
       reaper reaped the OOM victim's mm.

       Marking the OOM victim's mm_struct might be unsafe because
       it is possible that the OOM reaper is called without sending
       SIGKILL to all thread groups sharing the OOM victim's mm_struct
       (i.e. a situation where this reproducer tried to attack is
       reproduced).

       Marking the OOM victim's signal_struct might be unsafe unless
       the OOM reaper is called with at least a thread group which
       the OOM victim thread belongs to is guaranteed to be dying or
       exiting. "oom: consider multi-threaded tasks in task_will_free_mem"
       ( http://lkml.kernel.org/r/1460452756-15491-1-git-send-email-mhocko@kernel.org )
       will help.

   (3) The OOM reaper does not defer marking as "not suitable for OOM
       victims" for unbounded duration.

       Imagine a situation where a thread group with 1000 threads was
       OOM killed, 1 thread is holding mmap_sem held for write and 999
       threads are doing allocation. All 1000 threads will be queued to
       the oom_reaper_list and 999 threads are blocked on mmap_sem at
       exit_mm(). Since the OOM reaper waits for 1 second on each OOM
       victim, the allocating task could wait for 1000 seconds.
       If oom_reap_task() checks for "not suitable for OOM victims"
       before calling __oom_reap_task(), we can shorten the delay to
       1 second.

       Imagine a situation where 100 thread groups in different memcg
       are OOM killed, each thread group is holding mmap_sem for write.
       Since the OOM reaper waits for 1 second on each memcg, the last
       OOM victim could wait for 100 seconds. If oom_reap_task() does
       parallel reaping, we can shorten the delay to 1 second.

So, can you accept these requirements?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timeout.
  2016-05-17 11:08                         ` Tetsuo Handa
@ 2016-05-17 12:51                           ` Michal Hocko
  0 siblings, 0 replies; 21+ messages in thread
From: Michal Hocko @ 2016-05-17 12:51 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm

On Tue 17-05-16 20:08:09, Tetsuo Handa wrote:
> Michal Hocko wrote:
[... Skipping parts unrelated to this discussion ...]

> > You are basically DoSing your machine and that leads to corner cases of
> > course. We can and should try to plug them but I still do not see any
> > reason to rush into any solutions.
> 
> My intent of doing what-you-call-DoS stress tests is
> 
>    You had better realize that we can't find all corner cases.
>    It is not a responsible attitude that you knowingly preserve
>    corner cases with "can you trigger it?".

You are still missing the important point. Scenarios you are testing are
killing the machine anyway. This is interesting to catch some corner
cases which might happen even without killing the machine and that is
why they are interesting but they are light years away from any
reasonably working systems. And that is probably why nobody give a damn
about possible deadlocks during OOM all those years. Don't you think?

> The OOM killer is a safety net in case something went wrong (e.g.
> a ranaway program).

OOM killer is a best effort to handle out-of-memory condition. It's
never been perfect and never will be. Full stop. If you want your system
behave reasonably, better configure it properly that you do not hit the
OOM situation. This is what every reasonable admin will tell you.

> You refuse to care about corner cases.

Pardon me but it's been some time since I've tried to understand and
cope with those corner cases much as reasonable without introducing new
wormholes.

> How can it be
> called "robust / reliable" without the ability to handle corner cases?
> As long as minimal infrastructure for handling the OOM situation (e.g.
> scheduler) is alive, we should strive for recovering from the OOM situation
> (as with you strive for making the OOM reaper context reliable as much as
> possible).

I am not going to repeat my arguments here for 101st time.

> > You seem to be bound to the timeout solution so much that you even
> > refuse to think about any other potential ways to move on. I think that
> > is counter productive. I have tried to explain many times that once you
> > define a _user_ _visible_ knob you should better define a proper semantic
> > for it. Do something with a random outcome is not it.
> 
> Waiting for feedback without offering a workaround is counterproductive
> when we are already aware of bugs. Offering a workaround first and then
> trying to fix easily triggerable bugs is appreciated for those who can not
> update kernels for their systems due to their constraints. It is up to
> users to decide whether to use workarounds.

Not if the workaround is a user visible api which basically gets carved
in stone once we release it. We have a good history of doing this
mistake over and over.

> The reason I insist on the timeout based approach is the robustness.
> 
>    (A) It can work on CONFIG_MMU=n kernels.
> 
>    (B) It can work even if kthread_run(oom_reaper, NULL, "oom_reaper")
>        returned an error.
> 
>    (C) It gives more accurately bounded delay (compared to waiting for
>        TIF_MEMDIE being sequentially cleared by the OOM reaper) even if
>        there are so many threads on the oom_reaper_list list.
> 
>    (D) It can work even if the OOM reaper cannot run for long time
>        for unknown reasons (e.g. preempted by realtime priority tasks).
> 
>    (E) We can handle all corner cases without proving that they are
>        triggerable issues in the real life.

This just doesn't make any sense to answer. Most of those points have
been discussed in the past and I am not going to waste my time on them
again.

> > So let's move on and try to think outside of the box:
> > ---
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index df8778e72211..027d5bc1e874 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -513,6 +513,7 @@ static inline int get_dumpable(struct mm_struct *mm)
> >  #define MMF_HAS_UPROBES		19	/* has uprobes */
> >  #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
> >  #define MMF_OOM_REAPED		21	/* mm has been already reaped */
> > +#define MMF_OOM_NOT_REAPABLE	22	/* mm couldn't be reaped */
> >  
> >  #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
> >  
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index c0e37dd1422f..b1a1e3317231 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -538,8 +538,27 @@ static void oom_reap_task(struct task_struct *tsk)
> >  		schedule_timeout_idle(HZ/10);
> >  
> >  	if (attempts > MAX_OOM_REAP_RETRIES) {
> > +		struct task_struct *p;
> > +
> >  		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> >  				task_pid_nr(tsk), tsk->comm);
> > +
> > +		/*
> > +		 * If we've already tried to reap this task in the past and
> > +		 * failed it probably doesn't make much sense to try yet again
> > +		 * so hide the mm from the oom killer so that it can move on
> > +		 * to another task with a different mm struct.
> > +		 */
> > +		p = find_lock_task_mm(tsk);
> > +		if (p) {
> > +			if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &p->mm->flags)) {
> > +				pr_info("oom_reaper: giving up pid:%d (%s)\n",
> > +						task_pid_nr(tsk), tsk->comm);
> > +				set_bit(MMF_OOM_REAPED, &p->mm->flags);
> > +			}
> > +			task_unlock(p);
> > +		}
> > +
> >  		debug_show_all_locks();
> >  	}
> >  
> > 
> > See the difference? This is 11LOC and we do not have export any knobs
> > which would tie us for future implementations. We will cap the number
> > of times each mm struct is attempted for OOM killer and do not have
> > to touch any subtle oom killer paths so the patch would be quite easy
> > to review. We can change this implementation if it turns out to be
> > impractical, too optimistic or pesimistic.
> 
> Oh, this is a drastic change for you. You are trying to be very conservative
> and you refused to select next OOM victim unless progress are made.

Right. And, unlike with previous attempts, this approach would be
justifiable because it is based on an actual feedback. We know why the
oom killer/reaper cannot make any progress so let's kill something
else. We know that all of our efforts have failed and we know why! So we
can actually make an educated decision. This is a huge difference to
any timeout based decision.

> If you can accept selecting next OOM victim when progress are not made,
> I might be able to get away from timeout based approach.

OK, so let's focus on being productive, reasonable and make educated
decisions ideally in small and understandable steps. If you absolutely
want to have a guaranteed hand break user tunable then I have told you
what is my requirement for its semantic in order to support you in that.

I am skipping your further points, sorry about that, but I have more
important tasks on my todo list than end up in an endless discussion
again.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2016-05-17 12:51 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-19 15:06 [PATCH] mm,oom: Re-enable OOM killer using timeout Tetsuo Handa
2016-04-19 20:07 ` Michal Hocko
2016-04-19 21:55   ` Tetsuo Handa
2016-04-20 10:37     ` [PATCH v2] " Tetsuo Handa
2016-04-25 11:47       ` Michal Hocko
2016-04-26 14:00         ` Tetsuo Handa
2016-04-26 14:31           ` Michal Hocko
2016-04-27 10:43             ` Tetsuo Handa
2016-04-20 14:47     ` [PATCH] " Michal Hocko
2016-04-21 11:49       ` Tetsuo Handa
2016-04-21 13:07         ` Michal Hocko
2016-04-24 14:19           ` Tetsuo Handa
2016-04-25  9:55             ` Michal Hocko
2016-04-26 13:54               ` Michal Hocko
2016-04-27 10:43                 ` Tetsuo Handa
2016-04-27 11:11                   ` Michal Hocko
2016-05-14  0:39                     ` Tetsuo Handa
2016-05-16 14:18                       ` Michal Hocko
2016-05-17 11:08                         ` Tetsuo Handa
2016-05-17 12:51                           ` Michal Hocko
2016-04-26 14:00               ` Tetsuo Handa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.