All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm,oom: Re-enable OOM killer using timers.
@ 2016-01-07 11:26 ` Tetsuo Handa
  0 siblings, 0 replies; 38+ messages in thread
From: Tetsuo Handa @ 2016-01-07 11:26 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: mgorman, rientjes, torvalds, oleg, hughd, andrea, riel, linux-mm,
	linux-kernel, mhocko

>From 2f73abcec47535062d41c04bd7d9068cd71214b0 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Thu, 7 Jan 2016 11:34:41 +0900
Subject: [PATCH] mm,oom: Re-enable OOM killer using timers.

This patch introduces two timers ( holdoff timer and victim wait timer)
and sysctl variables for changing timeout ( oomkiller_holdoff_ms and
oomkiller_victim_wait_ms ) for respectively handling collateral OOM
victim problem and OOM livelock problem. When you are trying to analyze
problems under OOM condition, you can set holdoff timer's timeout to 0
and victim wait timer's timeout to very large value for emulating
current behavior.


About collateral OOM victim problem:

We can observe collateral victim being OOM-killed immediately after
the memory hog process is OOM-killed. This is caused by a race:

   (1) The process which called oom_kill_process() releases the oom_lock
       mutex before the memory reclaimed by OOM-killing the memory hog
       process becomes allocatable for others.

   (2) Another process acquires the oom_lock mutex and checks for
       get_page_from_freelist() before the memory reclaimed by OOM-killing
       the memory hog process becomes allocatable for others.
       get_page_from_freelist() fails and thus the process proceeds
       calling out_of_memory().

   (3) The memory hog process exits and clears TIF_MEMDIE flag.

   (4) select_bad_process() in out_of_memory() fails to find a task with
       TIF_MEMDIE pending. Thus the process proceeds choosing next OOM
       victim.

   (5) The memory reclaimed by OOM-killing the memory hog process becomes
       allocatable for others. But get_page_from_freelist() is no longer
       called by somebody which held the oom_lock mutex.

   (6) oom_kill_process() is called although get_page_from_freelist()
       could now succeed. If get_page_from_freelist() can succeed, this
       is a collateral victim.

We cannot completely avoid this race because we cannot predict when the
memory reclaimed by OOM-killing the memory hog process becomes allocatable
for others. But we can reduce possibility of hitting this race by keeping
the OOM killer disabled for some administrator controlled period, instead
of relying on a sleep with oom_lock mutex held.

This patch adds /proc/sys/vm/oomkiller_holdoff_ms for that purpose.
Since the OOM reaper retries for 10 times with 0.1 second interval,
this timeout can be relatively short (e.g. between 0.1 second and few
seconds). Longer the period is, more unlikely to hit this race but more
likely to stall longer when the OOM reaper failed to reclaim memory.


About OOM livelock problem:

We are trying to reduce the possibility of hitting OOM livelock by
introducing the OOM reaper, but we can still observe OOM livelock
when the OOM reaper failed to reclaim enough memory.

When the OOM reaper failed, we need to take some action for making forward
progress. Possible candidates are: choose next OOM victim, allow access to
memory reserves, trigger kernel panic.

Allowing access to memory reserves might help, but on rare occasions
we are already observing depletion of the memory reserves with current
behavior. Thus, this is not a reliable candidate.

Triggering kernel panic upon timeout might help, but can be overkilling
for those who use with /proc/sys/vm/panic_on_oom = 0. At least some of
them prefer choosing next OOM victim because it is very likely that the
OOM reaper can eventually reclaim memory if we continue choosing
subsequent OOM victims.

Therefore, this patch adds /proc/sys/vm/oomkiller_victim_wait_ms for
ignoring current behavior in order to choose subsequent OOM victims.
Since wait victim timer should expire after the OOM reaper fails,
this timeout should be longer than holdoff timer's timeout (e.g.
between few seconds and a minute).


Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
  include/linux/oom.h |  2 ++
  kernel/sysctl.c     | 14 ++++++++++++++
  mm/oom_kill.c       | 31 ++++++++++++++++++++++++++++++-
  3 files changed, 46 insertions(+), 1 deletion(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 03e6257..633e92a 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -117,4 +117,6 @@ static inline bool task_will_free_mem(struct task_struct *task)
  extern int sysctl_oom_dump_tasks;
  extern int sysctl_oom_kill_allocating_task;
  extern int sysctl_panic_on_oom;
+extern unsigned int sysctl_oomkiller_holdoff_ms;
+extern unsigned int sysctl_oomkiller_victim_wait_ms;
  #endif /* _INCLUDE_LINUX_OOM_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 9142036..7102212 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1209,6 +1209,20 @@ static struct ctl_table vm_table[] = {
  		.proc_handler	= proc_dointvec,
  	},
  	{
+		.procname       = "oomkiller_holdoff_ms",
+		.data           = &sysctl_oomkiller_holdoff_ms,
+		.maxlen         = sizeof(sysctl_oomkiller_holdoff_ms),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec_minmax,
+	},
+	{
+		.procname       = "oomkiller_victim_wait_ms",
+		.data           = &sysctl_oomkiller_victim_wait_ms,
+		.maxlen         = sizeof(sysctl_oomkiller_victim_wait_ms),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec_minmax,
+	},
+	{
  		.procname	= "overcommit_ratio",
  		.data		= &sysctl_overcommit_ratio,
  		.maxlen		= sizeof(sysctl_overcommit_ratio),
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b8a4210..9548dce 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -48,9 +48,17 @@
  int sysctl_panic_on_oom;
  int sysctl_oom_kill_allocating_task;
  int sysctl_oom_dump_tasks = 1;
+unsigned int sysctl_oomkiller_holdoff_ms = 100; /* 0.1 second */
+unsigned int sysctl_oomkiller_victim_wait_ms = 5000; /* 5 seconds */

  DEFINE_MUTEX(oom_lock);

+static void oomkiller_reset(unsigned long arg)
+{
+}
+static DEFINE_TIMER(oomkiller_holdoff_timer, oomkiller_reset, 0, 0);
+static DEFINE_TIMER(oomkiller_victim_wait_timer, oomkiller_reset, 0, 0);
+
  #ifdef CONFIG_NUMA
  /**
   * has_intersects_mems_allowed() - check task eligiblity for kill
@@ -292,8 +300,14 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
  	 * Don't allow any other task to have access to the reserves.
  	 */
  	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
+		/*
+		 * If one of TIF_MEMDIE tasks cannot die after victim wait
+		 * timeout expires, treat such tasks as unkillable because
+		 * they are likely stuck at OOM livelock.
+		 */
  		if (!is_sysrq_oom(oc))
-			return OOM_SCAN_ABORT;
+			return timer_pending(&oomkiller_victim_wait_timer) ?
+				OOM_SCAN_ABORT : OOM_SCAN_CONTINUE;
  	}
  	if (!task->mm)
  		return OOM_SCAN_CONTINUE;
@@ -575,6 +589,14 @@ void mark_oom_victim(struct task_struct *tsk)
  	/* OOM killer might race with memcg OOM */
  	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
  		return;
+	/* Start holdoff timer and victim wait timer. */
+	if (sysctl_oomkiller_holdoff_ms)
+		mod_timer(&oomkiller_holdoff_timer, jiffies +
+			  msecs_to_jiffies(sysctl_oomkiller_holdoff_ms));
+	if (sysctl_oomkiller_victim_wait_ms)
+		mod_timer(&oomkiller_victim_wait_timer, jiffies +
+			  msecs_to_jiffies(sysctl_oomkiller_victim_wait_ms));
+
  	/*
  	 * Make sure that the task is woken up from uninterruptible sleep
  	 * if it is frozen because OOM killer wouldn't be able to free
@@ -865,6 +887,13 @@ bool out_of_memory(struct oom_control *oc)
  	}

  	/*
+	 * Do not try to choose next OOM victim until holdoff timer expires
+	 * so that we can reduce possibility of making a collateral victim.
+	 */
+	if (timer_pending(&oomkiller_holdoff_timer))
+		return true;
+
+	/*
  	 * Check if there were limitations on the allocation (only relevant for
  	 * NUMA) that may require different handling.
  	 */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH] mm,oom: Re-enable OOM killer using timers.
@ 2016-01-07 11:26 ` Tetsuo Handa
  0 siblings, 0 replies; 38+ messages in thread
From: Tetsuo Handa @ 2016-01-07 11:26 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: mgorman, rientjes, torvalds, oleg, hughd, andrea, riel, linux-mm,
	linux-kernel, mhocko

>From 2f73abcec47535062d41c04bd7d9068cd71214b0 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Thu, 7 Jan 2016 11:34:41 +0900
Subject: [PATCH] mm,oom: Re-enable OOM killer using timers.

This patch introduces two timers ( holdoff timer and victim wait timer)
and sysctl variables for changing timeout ( oomkiller_holdoff_ms and
oomkiller_victim_wait_ms ) for respectively handling collateral OOM
victim problem and OOM livelock problem. When you are trying to analyze
problems under OOM condition, you can set holdoff timer's timeout to 0
and victim wait timer's timeout to very large value for emulating
current behavior.


About collateral OOM victim problem:

We can observe collateral victim being OOM-killed immediately after
the memory hog process is OOM-killed. This is caused by a race:

   (1) The process which called oom_kill_process() releases the oom_lock
       mutex before the memory reclaimed by OOM-killing the memory hog
       process becomes allocatable for others.

   (2) Another process acquires the oom_lock mutex and checks for
       get_page_from_freelist() before the memory reclaimed by OOM-killing
       the memory hog process becomes allocatable for others.
       get_page_from_freelist() fails and thus the process proceeds
       calling out_of_memory().

   (3) The memory hog process exits and clears TIF_MEMDIE flag.

   (4) select_bad_process() in out_of_memory() fails to find a task with
       TIF_MEMDIE pending. Thus the process proceeds choosing next OOM
       victim.

   (5) The memory reclaimed by OOM-killing the memory hog process becomes
       allocatable for others. But get_page_from_freelist() is no longer
       called by somebody which held the oom_lock mutex.

   (6) oom_kill_process() is called although get_page_from_freelist()
       could now succeed. If get_page_from_freelist() can succeed, this
       is a collateral victim.

We cannot completely avoid this race because we cannot predict when the
memory reclaimed by OOM-killing the memory hog process becomes allocatable
for others. But we can reduce possibility of hitting this race by keeping
the OOM killer disabled for some administrator controlled period, instead
of relying on a sleep with oom_lock mutex held.

This patch adds /proc/sys/vm/oomkiller_holdoff_ms for that purpose.
Since the OOM reaper retries for 10 times with 0.1 second interval,
this timeout can be relatively short (e.g. between 0.1 second and few
seconds). Longer the period is, more unlikely to hit this race but more
likely to stall longer when the OOM reaper failed to reclaim memory.


About OOM livelock problem:

We are trying to reduce the possibility of hitting OOM livelock by
introducing the OOM reaper, but we can still observe OOM livelock
when the OOM reaper failed to reclaim enough memory.

When the OOM reaper failed, we need to take some action for making forward
progress. Possible candidates are: choose next OOM victim, allow access to
memory reserves, trigger kernel panic.

Allowing access to memory reserves might help, but on rare occasions
we are already observing depletion of the memory reserves with current
behavior. Thus, this is not a reliable candidate.

Triggering kernel panic upon timeout might help, but can be overkilling
for those who use with /proc/sys/vm/panic_on_oom = 0. At least some of
them prefer choosing next OOM victim because it is very likely that the
OOM reaper can eventually reclaim memory if we continue choosing
subsequent OOM victims.

Therefore, this patch adds /proc/sys/vm/oomkiller_victim_wait_ms for
ignoring current behavior in order to choose subsequent OOM victims.
Since wait victim timer should expire after the OOM reaper fails,
this timeout should be longer than holdoff timer's timeout (e.g.
between few seconds and a minute).


Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
  include/linux/oom.h |  2 ++
  kernel/sysctl.c     | 14 ++++++++++++++
  mm/oom_kill.c       | 31 ++++++++++++++++++++++++++++++-
  3 files changed, 46 insertions(+), 1 deletion(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 03e6257..633e92a 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -117,4 +117,6 @@ static inline bool task_will_free_mem(struct task_struct *task)
  extern int sysctl_oom_dump_tasks;
  extern int sysctl_oom_kill_allocating_task;
  extern int sysctl_panic_on_oom;
+extern unsigned int sysctl_oomkiller_holdoff_ms;
+extern unsigned int sysctl_oomkiller_victim_wait_ms;
  #endif /* _INCLUDE_LINUX_OOM_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 9142036..7102212 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1209,6 +1209,20 @@ static struct ctl_table vm_table[] = {
  		.proc_handler	= proc_dointvec,
  	},
  	{
+		.procname       = "oomkiller_holdoff_ms",
+		.data           = &sysctl_oomkiller_holdoff_ms,
+		.maxlen         = sizeof(sysctl_oomkiller_holdoff_ms),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec_minmax,
+	},
+	{
+		.procname       = "oomkiller_victim_wait_ms",
+		.data           = &sysctl_oomkiller_victim_wait_ms,
+		.maxlen         = sizeof(sysctl_oomkiller_victim_wait_ms),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec_minmax,
+	},
+	{
  		.procname	= "overcommit_ratio",
  		.data		= &sysctl_overcommit_ratio,
  		.maxlen		= sizeof(sysctl_overcommit_ratio),
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b8a4210..9548dce 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -48,9 +48,17 @@
  int sysctl_panic_on_oom;
  int sysctl_oom_kill_allocating_task;
  int sysctl_oom_dump_tasks = 1;
+unsigned int sysctl_oomkiller_holdoff_ms = 100; /* 0.1 second */
+unsigned int sysctl_oomkiller_victim_wait_ms = 5000; /* 5 seconds */

  DEFINE_MUTEX(oom_lock);

+static void oomkiller_reset(unsigned long arg)
+{
+}
+static DEFINE_TIMER(oomkiller_holdoff_timer, oomkiller_reset, 0, 0);
+static DEFINE_TIMER(oomkiller_victim_wait_timer, oomkiller_reset, 0, 0);
+
  #ifdef CONFIG_NUMA
  /**
   * has_intersects_mems_allowed() - check task eligiblity for kill
@@ -292,8 +300,14 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
  	 * Don't allow any other task to have access to the reserves.
  	 */
  	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
+		/*
+		 * If one of TIF_MEMDIE tasks cannot die after victim wait
+		 * timeout expires, treat such tasks as unkillable because
+		 * they are likely stuck at OOM livelock.
+		 */
  		if (!is_sysrq_oom(oc))
-			return OOM_SCAN_ABORT;
+			return timer_pending(&oomkiller_victim_wait_timer) ?
+				OOM_SCAN_ABORT : OOM_SCAN_CONTINUE;
  	}
  	if (!task->mm)
  		return OOM_SCAN_CONTINUE;
@@ -575,6 +589,14 @@ void mark_oom_victim(struct task_struct *tsk)
  	/* OOM killer might race with memcg OOM */
  	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
  		return;
+	/* Start holdoff timer and victim wait timer. */
+	if (sysctl_oomkiller_holdoff_ms)
+		mod_timer(&oomkiller_holdoff_timer, jiffies +
+			  msecs_to_jiffies(sysctl_oomkiller_holdoff_ms));
+	if (sysctl_oomkiller_victim_wait_ms)
+		mod_timer(&oomkiller_victim_wait_timer, jiffies +
+			  msecs_to_jiffies(sysctl_oomkiller_victim_wait_ms));
+
  	/*
  	 * Make sure that the task is woken up from uninterruptible sleep
  	 * if it is frozen because OOM killer wouldn't be able to free
@@ -865,6 +887,13 @@ bool out_of_memory(struct oom_control *oc)
  	}

  	/*
+	 * Do not try to choose next OOM victim until holdoff timer expires
+	 * so that we can reduce possibility of making a collateral victim.
+	 */
+	if (timer_pending(&oomkiller_holdoff_timer))
+		return true;
+
+	/*
  	 * Check if there were limitations on the allocation (only relevant for
  	 * NUMA) that may require different handling.
  	 */
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timers.
  2016-01-07 11:26 ` Tetsuo Handa
@ 2016-01-13  1:36   ` David Rientjes
  -1 siblings, 0 replies; 38+ messages in thread
From: David Rientjes @ 2016-01-13  1:36 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, akpm, mgorman, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel, mhocko

On Thu, 7 Jan 2016, Tetsuo Handa wrote:

> This patch introduces two timers ( holdoff timer and victim wait timer)
> and sysctl variables for changing timeout ( oomkiller_holdoff_ms and
> oomkiller_victim_wait_ms ) for respectively handling collateral OOM
> victim problem and OOM livelock problem. When you are trying to analyze
> problems under OOM condition, you can set holdoff timer's timeout to 0
> and victim wait timer's timeout to very large value for emulating
> current behavior.
> 
> 
> About collateral OOM victim problem:
> 
> We can observe collateral victim being OOM-killed immediately after
> the memory hog process is OOM-killed. This is caused by a race:
> 
>    (1) The process which called oom_kill_process() releases the oom_lock
>        mutex before the memory reclaimed by OOM-killing the memory hog
>        process becomes allocatable for others.
> 
>    (2) Another process acquires the oom_lock mutex and checks for
>        get_page_from_freelist() before the memory reclaimed by OOM-killing
>        the memory hog process becomes allocatable for others.
>        get_page_from_freelist() fails and thus the process proceeds
>        calling out_of_memory().
> 
>    (3) The memory hog process exits and clears TIF_MEMDIE flag.
> 
>    (4) select_bad_process() in out_of_memory() fails to find a task with
>        TIF_MEMDIE pending. Thus the process proceeds choosing next OOM
>        victim.
> 
>    (5) The memory reclaimed by OOM-killing the memory hog process becomes
>        allocatable for others. But get_page_from_freelist() is no longer
>        called by somebody which held the oom_lock mutex.
> 
>    (6) oom_kill_process() is called although get_page_from_freelist()
>        could now succeed. If get_page_from_freelist() can succeed, this
>        is a collateral victim.
> 
> We cannot completely avoid this race because we cannot predict when the
> memory reclaimed by OOM-killing the memory hog process becomes allocatable
> for others. But we can reduce possibility of hitting this race by keeping
> the OOM killer disabled for some administrator controlled period, instead
> of relying on a sleep with oom_lock mutex held.
> 
> This patch adds /proc/sys/vm/oomkiller_holdoff_ms for that purpose.
> Since the OOM reaper retries for 10 times with 0.1 second interval,
> this timeout can be relatively short (e.g. between 0.1 second and few
> seconds). Longer the period is, more unlikely to hit this race but more
> likely to stall longer when the OOM reaper failed to reclaim memory.
> 
> 
> About OOM livelock problem:
> 
> We are trying to reduce the possibility of hitting OOM livelock by
> introducing the OOM reaper, but we can still observe OOM livelock
> when the OOM reaper failed to reclaim enough memory.
> 
> When the OOM reaper failed, we need to take some action for making forward
> progress. Possible candidates are: choose next OOM victim, allow access to
> memory reserves, trigger kernel panic.
> 
> Allowing access to memory reserves might help, but on rare occasions
> we are already observing depletion of the memory reserves with current
> behavior. Thus, this is not a reliable candidate.
> 
> Triggering kernel panic upon timeout might help, but can be overkilling
> for those who use with /proc/sys/vm/panic_on_oom = 0. At least some of
> them prefer choosing next OOM victim because it is very likely that the
> OOM reaper can eventually reclaim memory if we continue choosing
> subsequent OOM victims.
> 
> Therefore, this patch adds /proc/sys/vm/oomkiller_victim_wait_ms for
> ignoring current behavior in order to choose subsequent OOM victims.
> Since wait victim timer should expire after the OOM reaper fails,
> this timeout should be longer than holdoff timer's timeout (e.g.
> between few seconds and a minute).
> 
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>   include/linux/oom.h |  2 ++
>   kernel/sysctl.c     | 14 ++++++++++++++
>   mm/oom_kill.c       | 31 ++++++++++++++++++++++++++++++-
>   3 files changed, 46 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 03e6257..633e92a 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -117,4 +117,6 @@ static inline bool task_will_free_mem(struct task_struct *task)
>   extern int sysctl_oom_dump_tasks;
>   extern int sysctl_oom_kill_allocating_task;
>   extern int sysctl_panic_on_oom;
> +extern unsigned int sysctl_oomkiller_holdoff_ms;
> +extern unsigned int sysctl_oomkiller_victim_wait_ms;
>   #endif /* _INCLUDE_LINUX_OOM_H */
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 9142036..7102212 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1209,6 +1209,20 @@ static struct ctl_table vm_table[] = {
>   		.proc_handler	= proc_dointvec,
>   	},
>   	{
> +		.procname       = "oomkiller_holdoff_ms",
> +		.data           = &sysctl_oomkiller_holdoff_ms,
> +		.maxlen         = sizeof(sysctl_oomkiller_holdoff_ms),
> +		.mode           = 0644,
> +		.proc_handler   = proc_dointvec_minmax,
> +	},
> +	{
> +		.procname       = "oomkiller_victim_wait_ms",
> +		.data           = &sysctl_oomkiller_victim_wait_ms,
> +		.maxlen         = sizeof(sysctl_oomkiller_victim_wait_ms),
> +		.mode           = 0644,
> +		.proc_handler   = proc_dointvec_minmax,
> +	},
> +	{
>   		.procname	= "overcommit_ratio",
>   		.data		= &sysctl_overcommit_ratio,
>   		.maxlen		= sizeof(sysctl_overcommit_ratio),

I'm not sure why you are proposing adding both of these in the same patch; 
they have very different usecases and semantics.

oomkiller_holdoff_ms, as indicated by the changelog, seems to be 
correcting some deficiency in the oom reaper.  I haven't reviewed that, 
but it seems like something that wouldn't need to be fixed with a 
timeout-based solution.  We either know if we have completed oom reaping 
or we haven't, it is something we should easily be able to figure out and 
not require heuristics such as this.

This does not seem to have anything to do with current upstream code that 
does not have the oom reaper since the oom killer clearly has 
synchronization through oom_lock and we carefully defer for TIF_MEMDIE 
processes and abort for those that have not yet fully exited to free its 
memory.  If patches are going to be proposed on top of the oom reaper, 
please explicitly state that.

I believe any such race described in the changelog could be corrected by 
deferring the oom killer entirely until the oom reaper has been able to 
free memory or the oom victim has fully exited.  I haven't reviewed that, 
so I can't speak definitively, but I think we should avoid _any_ timeout 
based solution if possible and there's no indication this is the only way 
to solve such a problem.

oomkiller_victim_wait_ms seems to be another manifestation of the same 
patch which has been nack'd over and over again.  It does not address the 
situation where there are no additional eligible processes to kill and we 
end up panicking the machine when additional access to memory reserves may 
have allowed the victim to exit.  Randomly killing additional processes 
makes that problem worse since if they cannot exit (which may become more 
likely than not if all victims are waiting on a mutex held by an 
allocating thread).

My solution for that has always been to grant allocating threads temporary 
access to memory reserves in the hope that the mutex be dropped and the 
victim may make forward progress.  We have this implemented internally and 
I've posted a test module that easily exhibits the problem and how it is 
fixed.

The reality that we must confront, however, is very stark: if the system 
is out of memory and we have killed processes to free memory, we do not 
have the ability to recover from this situation in all cases.  We make a 
trade-off: either allow processes waiting for memory to have access to 
reserves so that the livelock hopefully gets worked out, or we randomly 
kill additional processes.  Neither of these work 100% of the time and 
you can easily create test modules which exhibit both.  We cannot do both 
solutions because memory reserves is a finite resource and we can't spread 
it to every process on the system and reasonably expect recovery.

The ability to recover from these situations is going to have a direct 
correlation to the size of your memory reserves, although even 
min_free_kbytes being very large will also not solve the problem 100% of 
the time.  Complaining that the oom reaper can be shown to not free enough 
memory, and that you can show that memory reserves can be fully depleted 
isn't interesting.  I could easily change my module that allocates while 
holding a mutex to have the lowest oom priority on the system and show 
that this timeout-based solution also doesn't work because reserves are 
finite.

We can make a best-effort to improve the situation: things like the oom 
reaper and access to memory reserves for processes looping forever due to 
a deferred oom killer.  It doesn't help to randomly oom kill processes 
that may be waiting on the same mutex itself.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timers.
@ 2016-01-13  1:36   ` David Rientjes
  0 siblings, 0 replies; 38+ messages in thread
From: David Rientjes @ 2016-01-13  1:36 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, akpm, mgorman, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel, mhocko

On Thu, 7 Jan 2016, Tetsuo Handa wrote:

> This patch introduces two timers ( holdoff timer and victim wait timer)
> and sysctl variables for changing timeout ( oomkiller_holdoff_ms and
> oomkiller_victim_wait_ms ) for respectively handling collateral OOM
> victim problem and OOM livelock problem. When you are trying to analyze
> problems under OOM condition, you can set holdoff timer's timeout to 0
> and victim wait timer's timeout to very large value for emulating
> current behavior.
> 
> 
> About collateral OOM victim problem:
> 
> We can observe collateral victim being OOM-killed immediately after
> the memory hog process is OOM-killed. This is caused by a race:
> 
>    (1) The process which called oom_kill_process() releases the oom_lock
>        mutex before the memory reclaimed by OOM-killing the memory hog
>        process becomes allocatable for others.
> 
>    (2) Another process acquires the oom_lock mutex and checks for
>        get_page_from_freelist() before the memory reclaimed by OOM-killing
>        the memory hog process becomes allocatable for others.
>        get_page_from_freelist() fails and thus the process proceeds
>        calling out_of_memory().
> 
>    (3) The memory hog process exits and clears TIF_MEMDIE flag.
> 
>    (4) select_bad_process() in out_of_memory() fails to find a task with
>        TIF_MEMDIE pending. Thus the process proceeds choosing next OOM
>        victim.
> 
>    (5) The memory reclaimed by OOM-killing the memory hog process becomes
>        allocatable for others. But get_page_from_freelist() is no longer
>        called by somebody which held the oom_lock mutex.
> 
>    (6) oom_kill_process() is called although get_page_from_freelist()
>        could now succeed. If get_page_from_freelist() can succeed, this
>        is a collateral victim.
> 
> We cannot completely avoid this race because we cannot predict when the
> memory reclaimed by OOM-killing the memory hog process becomes allocatable
> for others. But we can reduce possibility of hitting this race by keeping
> the OOM killer disabled for some administrator controlled period, instead
> of relying on a sleep with oom_lock mutex held.
> 
> This patch adds /proc/sys/vm/oomkiller_holdoff_ms for that purpose.
> Since the OOM reaper retries for 10 times with 0.1 second interval,
> this timeout can be relatively short (e.g. between 0.1 second and few
> seconds). Longer the period is, more unlikely to hit this race but more
> likely to stall longer when the OOM reaper failed to reclaim memory.
> 
> 
> About OOM livelock problem:
> 
> We are trying to reduce the possibility of hitting OOM livelock by
> introducing the OOM reaper, but we can still observe OOM livelock
> when the OOM reaper failed to reclaim enough memory.
> 
> When the OOM reaper failed, we need to take some action for making forward
> progress. Possible candidates are: choose next OOM victim, allow access to
> memory reserves, trigger kernel panic.
> 
> Allowing access to memory reserves might help, but on rare occasions
> we are already observing depletion of the memory reserves with current
> behavior. Thus, this is not a reliable candidate.
> 
> Triggering kernel panic upon timeout might help, but can be overkilling
> for those who use with /proc/sys/vm/panic_on_oom = 0. At least some of
> them prefer choosing next OOM victim because it is very likely that the
> OOM reaper can eventually reclaim memory if we continue choosing
> subsequent OOM victims.
> 
> Therefore, this patch adds /proc/sys/vm/oomkiller_victim_wait_ms for
> ignoring current behavior in order to choose subsequent OOM victims.
> Since wait victim timer should expire after the OOM reaper fails,
> this timeout should be longer than holdoff timer's timeout (e.g.
> between few seconds and a minute).
> 
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>   include/linux/oom.h |  2 ++
>   kernel/sysctl.c     | 14 ++++++++++++++
>   mm/oom_kill.c       | 31 ++++++++++++++++++++++++++++++-
>   3 files changed, 46 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 03e6257..633e92a 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -117,4 +117,6 @@ static inline bool task_will_free_mem(struct task_struct *task)
>   extern int sysctl_oom_dump_tasks;
>   extern int sysctl_oom_kill_allocating_task;
>   extern int sysctl_panic_on_oom;
> +extern unsigned int sysctl_oomkiller_holdoff_ms;
> +extern unsigned int sysctl_oomkiller_victim_wait_ms;
>   #endif /* _INCLUDE_LINUX_OOM_H */
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 9142036..7102212 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1209,6 +1209,20 @@ static struct ctl_table vm_table[] = {
>   		.proc_handler	= proc_dointvec,
>   	},
>   	{
> +		.procname       = "oomkiller_holdoff_ms",
> +		.data           = &sysctl_oomkiller_holdoff_ms,
> +		.maxlen         = sizeof(sysctl_oomkiller_holdoff_ms),
> +		.mode           = 0644,
> +		.proc_handler   = proc_dointvec_minmax,
> +	},
> +	{
> +		.procname       = "oomkiller_victim_wait_ms",
> +		.data           = &sysctl_oomkiller_victim_wait_ms,
> +		.maxlen         = sizeof(sysctl_oomkiller_victim_wait_ms),
> +		.mode           = 0644,
> +		.proc_handler   = proc_dointvec_minmax,
> +	},
> +	{
>   		.procname	= "overcommit_ratio",
>   		.data		= &sysctl_overcommit_ratio,
>   		.maxlen		= sizeof(sysctl_overcommit_ratio),

I'm not sure why you are proposing adding both of these in the same patch; 
they have very different usecases and semantics.

oomkiller_holdoff_ms, as indicated by the changelog, seems to be 
correcting some deficiency in the oom reaper.  I haven't reviewed that, 
but it seems like something that wouldn't need to be fixed with a 
timeout-based solution.  We either know if we have completed oom reaping 
or we haven't, it is something we should easily be able to figure out and 
not require heuristics such as this.

This does not seem to have anything to do with current upstream code that 
does not have the oom reaper since the oom killer clearly has 
synchronization through oom_lock and we carefully defer for TIF_MEMDIE 
processes and abort for those that have not yet fully exited to free its 
memory.  If patches are going to be proposed on top of the oom reaper, 
please explicitly state that.

I believe any such race described in the changelog could be corrected by 
deferring the oom killer entirely until the oom reaper has been able to 
free memory or the oom victim has fully exited.  I haven't reviewed that, 
so I can't speak definitively, but I think we should avoid _any_ timeout 
based solution if possible and there's no indication this is the only way 
to solve such a problem.

oomkiller_victim_wait_ms seems to be another manifestation of the same 
patch which has been nack'd over and over again.  It does not address the 
situation where there are no additional eligible processes to kill and we 
end up panicking the machine when additional access to memory reserves may 
have allowed the victim to exit.  Randomly killing additional processes 
makes that problem worse since if they cannot exit (which may become more 
likely than not if all victims are waiting on a mutex held by an 
allocating thread).

My solution for that has always been to grant allocating threads temporary 
access to memory reserves in the hope that the mutex be dropped and the 
victim may make forward progress.  We have this implemented internally and 
I've posted a test module that easily exhibits the problem and how it is 
fixed.

The reality that we must confront, however, is very stark: if the system 
is out of memory and we have killed processes to free memory, we do not 
have the ability to recover from this situation in all cases.  We make a 
trade-off: either allow processes waiting for memory to have access to 
reserves so that the livelock hopefully gets worked out, or we randomly 
kill additional processes.  Neither of these work 100% of the time and 
you can easily create test modules which exhibit both.  We cannot do both 
solutions because memory reserves is a finite resource and we can't spread 
it to every process on the system and reasonably expect recovery.

The ability to recover from these situations is going to have a direct 
correlation to the size of your memory reserves, although even 
min_free_kbytes being very large will also not solve the problem 100% of 
the time.  Complaining that the oom reaper can be shown to not free enough 
memory, and that you can show that memory reserves can be fully depleted 
isn't interesting.  I could easily change my module that allocates while 
holding a mutex to have the lowest oom priority on the system and show 
that this timeout-based solution also doesn't work because reserves are 
finite.

We can make a best-effort to improve the situation: things like the oom 
reaper and access to memory reserves for processes looping forever due to 
a deferred oom killer.  It doesn't help to randomly oom kill processes 
that may be waiting on the same mutex itself.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timers.
  2016-01-13  1:36   ` David Rientjes
@ 2016-01-13 12:11     ` Tetsuo Handa
  -1 siblings, 0 replies; 38+ messages in thread
From: Tetsuo Handa @ 2016-01-13 12:11 UTC (permalink / raw)
  To: rientjes
  Cc: mhocko, akpm, mgorman, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel, mhocko

David Rientjes wrote:
> I'm not sure why you are proposing adding both of these in the same patch; 
> they have very different usecases and semantics.

Because both of these are for tuning the OOM killer.



> 
> oomkiller_holdoff_ms, as indicated by the changelog, seems to be 
> correcting some deficiency in the oom reaper.

It is not deficiency in the OOM reaper but deficiency in the OOM killer
or in the page allocator.

>                                                I haven't reviewed that, 
> but it seems like something that wouldn't need to be fixed with a 
> timeout-based solution.

The problem is that it takes some amount of time to return memory to
freelist after memory was reclaimed. Unless we add a callback mechanism
for notifying that the memory used by TIF_MEMDIE task was reclaimed and
returned to freelist, there is no means to fix this problem.

>                          We either know if we have completed oom reaping 
> or we haven't, it is something we should easily be able to figure out and 
> not require heuristics such as this.
> 
> This does not seem to have anything to do with current upstream code that 
> does not have the oom reaper since the oom killer clearly has 
> synchronization through oom_lock and we carefully defer for TIF_MEMDIE 
> processes and abort for those that have not yet fully exited to free its 
> memory.  If patches are going to be proposed on top of the oom reaper, 
> please explicitly state that.

The OOM reaper is irrelevant. The OOM reaper is merely an accelerator for
reclaiming memory earlier than now.

> 
> I believe any such race described in the changelog could be corrected by 
> deferring the oom killer entirely until the oom reaper has been able to 
> free memory or the oom victim has fully exited.  I haven't reviewed that, 
> so I can't speak definitively, but I think we should avoid _any_ timeout 
> based solution if possible and there's no indication this is the only way 
> to solve such a problem.

The OOM killer can not know when the reclaimed memory is returned to freelist
(and therefore get_page_from_freelist() might succeed).

Currently timeout is the only way to mitigate this problem.



> 
> oomkiller_victim_wait_ms seems to be another manifestation of the same 
> patch which has been nack'd over and over again.

I believe the situation is changing due to introduction of the OOM reaper.

>                                                   It does not address the 
> situation where there are no additional eligible processes to kill and we 
> end up panicking the machine when additional access to memory reserves may 
> have allowed the victim to exit.  Randomly killing additional processes 
> makes that problem worse since if they cannot exit (which may become more 
> likely than not if all victims are waiting on a mutex held by an 
> allocating thread).
> 
> My solution for that has always been to grant allocating threads temporary 
> access to memory reserves in the hope that the mutex be dropped and the 
> victim may make forward progress.  We have this implemented internally and 
> I've posted a test module that easily exhibits the problem and how it is 
> fixed.

Those who use panic_on_oom = 1 expect that the system triggers kernel panic
rather than stall forever. This is a translation of administrator's wish that
"Please press SysRq-c on behalf of me if the memory exhausted. In that way,
I don't need to stand by in front of the console twenty-four seven."

Those who use panic_on_oom = 0 expect that the OOM killer solves OOM condition
rather than stall forever. This is a translation of administrator's wish that
"Please press SysRq-f on behalf of me if the memory exhausted. In that way,
I don't need to stand by in front of the console twenty-four seven."

However, since the OOM killer never presses SysRq-f again until the OOM
victim terminates, this is annoying administrators.

  Administrator:  "I asked you to press SysRq-f on behalf of me. Why did you
                   let the system stalled forever?"

  The OOM killer: "I did. The system did not recover from OOM condition."

  Administrator:  "Why you don't try pressing SysRq-f again on behalf of me?"

  The OOM killer: "I am not programmed to do so."

  Administrator:  "You are really inattentive assistant. OK. Here is a patch
                   that programs you to press SysRq-f again on behalf of me."

What I want to say to the OOM killer is "Please don't toss the OOM killer's
duty away." when the OOM killer answered "I did something with a hope that
OOM condition is solved". And MM people are still NACKing administrator's
innocent wish.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timers.
@ 2016-01-13 12:11     ` Tetsuo Handa
  0 siblings, 0 replies; 38+ messages in thread
From: Tetsuo Handa @ 2016-01-13 12:11 UTC (permalink / raw)
  To: rientjes
  Cc: mhocko, akpm, mgorman, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel, mhocko

David Rientjes wrote:
> I'm not sure why you are proposing adding both of these in the same patch; 
> they have very different usecases and semantics.

Because both of these are for tuning the OOM killer.



> 
> oomkiller_holdoff_ms, as indicated by the changelog, seems to be 
> correcting some deficiency in the oom reaper.

It is not deficiency in the OOM reaper but deficiency in the OOM killer
or in the page allocator.

>                                                I haven't reviewed that, 
> but it seems like something that wouldn't need to be fixed with a 
> timeout-based solution.

The problem is that it takes some amount of time to return memory to
freelist after memory was reclaimed. Unless we add a callback mechanism
for notifying that the memory used by TIF_MEMDIE task was reclaimed and
returned to freelist, there is no means to fix this problem.

>                          We either know if we have completed oom reaping 
> or we haven't, it is something we should easily be able to figure out and 
> not require heuristics such as this.
> 
> This does not seem to have anything to do with current upstream code that 
> does not have the oom reaper since the oom killer clearly has 
> synchronization through oom_lock and we carefully defer for TIF_MEMDIE 
> processes and abort for those that have not yet fully exited to free its 
> memory.  If patches are going to be proposed on top of the oom reaper, 
> please explicitly state that.

The OOM reaper is irrelevant. The OOM reaper is merely an accelerator for
reclaiming memory earlier than now.

> 
> I believe any such race described in the changelog could be corrected by 
> deferring the oom killer entirely until the oom reaper has been able to 
> free memory or the oom victim has fully exited.  I haven't reviewed that, 
> so I can't speak definitively, but I think we should avoid _any_ timeout 
> based solution if possible and there's no indication this is the only way 
> to solve such a problem.

The OOM killer can not know when the reclaimed memory is returned to freelist
(and therefore get_page_from_freelist() might succeed).

Currently timeout is the only way to mitigate this problem.



> 
> oomkiller_victim_wait_ms seems to be another manifestation of the same 
> patch which has been nack'd over and over again.

I believe the situation is changing due to introduction of the OOM reaper.

>                                                   It does not address the 
> situation where there are no additional eligible processes to kill and we 
> end up panicking the machine when additional access to memory reserves may 
> have allowed the victim to exit.  Randomly killing additional processes 
> makes that problem worse since if they cannot exit (which may become more 
> likely than not if all victims are waiting on a mutex held by an 
> allocating thread).
> 
> My solution for that has always been to grant allocating threads temporary 
> access to memory reserves in the hope that the mutex be dropped and the 
> victim may make forward progress.  We have this implemented internally and 
> I've posted a test module that easily exhibits the problem and how it is 
> fixed.

Those who use panic_on_oom = 1 expect that the system triggers kernel panic
rather than stall forever. This is a translation of administrator's wish that
"Please press SysRq-c on behalf of me if the memory exhausted. In that way,
I don't need to stand by in front of the console twenty-four seven."

Those who use panic_on_oom = 0 expect that the OOM killer solves OOM condition
rather than stall forever. This is a translation of administrator's wish that
"Please press SysRq-f on behalf of me if the memory exhausted. In that way,
I don't need to stand by in front of the console twenty-four seven."

However, since the OOM killer never presses SysRq-f again until the OOM
victim terminates, this is annoying administrators.

  Administrator:  "I asked you to press SysRq-f on behalf of me. Why did you
                   let the system stalled forever?"

  The OOM killer: "I did. The system did not recover from OOM condition."

  Administrator:  "Why you don't try pressing SysRq-f again on behalf of me?"

  The OOM killer: "I am not programmed to do so."

  Administrator:  "You are really inattentive assistant. OK. Here is a patch
                   that programs you to press SysRq-f again on behalf of me."

What I want to say to the OOM killer is "Please don't toss the OOM killer's
duty away." when the OOM killer answered "I did something with a hope that
OOM condition is solved". And MM people are still NACKing administrator's
innocent wish.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timers.
  2016-01-13 12:11     ` Tetsuo Handa
@ 2016-01-13 16:26       ` Michal Hocko
  -1 siblings, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2016-01-13 16:26 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, akpm, mgorman, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel

On Wed 13-01-16 21:11:30, Tetsuo Handa wrote:
[...]
> Those who use panic_on_oom = 1 expect that the system triggers kernel panic
> rather than stall forever. This is a translation of administrator's wish that
> "Please press SysRq-c on behalf of me if the memory exhausted. In that way,
> I don't need to stand by in front of the console twenty-four seven."
> 
> Those who use panic_on_oom = 0 expect that the OOM killer solves OOM condition
> rather than stall forever. This is a translation of administrator's wish that
> "Please press SysRq-f on behalf of me if the memory exhausted. In that way,
> I don't need to stand by in front of the console twenty-four seven."

I think you are missing an important point. There is _no reliable_ way
to resolve the OOM condition in general except to panic the system. Even
killing all user space tasks might not be sufficient in general because
they might be blocked by an unkillable context (e.g. kernel thread).
So if you need a reliable behavior then either use panic_on_oom=1 or
provide a measure to panic after fixed timeout if the OOM cannot get
resolved. We have seen patches in that regards but there was no general
interest in them to merge them.

All we can do is a best effort approach which tries to be optimized to
reduce the impact of an unexpected SIGKILL sent to a "random" task. And
this is a reasonable objective IMHO. This works well in 99% of cases.
You can argue you do care about that 1% and I sympathy with you but
steps to mitigate those shouldn't involve steps which bring another
level of non-determinism into an already complicated system. This was
the biggest issue of the early OOM killer.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timers.
@ 2016-01-13 16:26       ` Michal Hocko
  0 siblings, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2016-01-13 16:26 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, akpm, mgorman, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel

On Wed 13-01-16 21:11:30, Tetsuo Handa wrote:
[...]
> Those who use panic_on_oom = 1 expect that the system triggers kernel panic
> rather than stall forever. This is a translation of administrator's wish that
> "Please press SysRq-c on behalf of me if the memory exhausted. In that way,
> I don't need to stand by in front of the console twenty-four seven."
> 
> Those who use panic_on_oom = 0 expect that the OOM killer solves OOM condition
> rather than stall forever. This is a translation of administrator's wish that
> "Please press SysRq-f on behalf of me if the memory exhausted. In that way,
> I don't need to stand by in front of the console twenty-four seven."

I think you are missing an important point. There is _no reliable_ way
to resolve the OOM condition in general except to panic the system. Even
killing all user space tasks might not be sufficient in general because
they might be blocked by an unkillable context (e.g. kernel thread).
So if you need a reliable behavior then either use panic_on_oom=1 or
provide a measure to panic after fixed timeout if the OOM cannot get
resolved. We have seen patches in that regards but there was no general
interest in them to merge them.

All we can do is a best effort approach which tries to be optimized to
reduce the impact of an unexpected SIGKILL sent to a "random" task. And
this is a reasonable objective IMHO. This works well in 99% of cases.
You can argue you do care about that 1% and I sympathy with you but
steps to mitigate those shouldn't involve steps which bring another
level of non-determinism into an already complicated system. This was
the biggest issue of the early OOM killer.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timers.
  2016-01-13 16:26       ` Michal Hocko
@ 2016-01-13 16:56         ` Johannes Weiner
  -1 siblings, 0 replies; 38+ messages in thread
From: Johannes Weiner @ 2016-01-13 16:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, rientjes, akpm, mgorman, torvalds, oleg, hughd,
	andrea, riel, linux-mm, linux-kernel

On Wed, Jan 13, 2016 at 05:26:10PM +0100, Michal Hocko wrote:
> On Wed 13-01-16 21:11:30, Tetsuo Handa wrote:
> [...]
> > Those who use panic_on_oom = 1 expect that the system triggers kernel panic
> > rather than stall forever. This is a translation of administrator's wish that
> > "Please press SysRq-c on behalf of me if the memory exhausted. In that way,
> > I don't need to stand by in front of the console twenty-four seven."
> > 
> > Those who use panic_on_oom = 0 expect that the OOM killer solves OOM condition
> > rather than stall forever. This is a translation of administrator's wish that
> > "Please press SysRq-f on behalf of me if the memory exhausted. In that way,
> > I don't need to stand by in front of the console twenty-four seven."
> 
> I think you are missing an important point. There is _no reliable_ way
> to resolve the OOM condition in general except to panic the system. Even
> killing all user space tasks might not be sufficient in general because
> they might be blocked by an unkillable context (e.g. kernel thread).
> So if you need a reliable behavior then either use panic_on_oom=1 or
> provide a measure to panic after fixed timeout if the OOM cannot get
> resolved. We have seen patches in that regards but there was no general
> interest in them to merge them.

While what you're saying about there not being a failsafe way is true,
I don't understand why we should panic the machine before we tried to
kill every single userspace task. That's what I never understood about
your timeout-panic patches: if the OOM victim doesn't exit in a fixed
amount of time, why is it better to panic the machine than to try the
second-best, third-best, fourth-best etc. OOM candidates?

Yes, you can say that at least the kernel will make a decision in a
fixed amount of time and it'll be more useful in practice. But the
reality of most scenarios is that moving on to other victims will
increase the chance of success dramatically while the chance of
continued hanging would converge toward 0.

And for the more extreme scenarios, where you have a million tasks all
blocked on the same resource, we can decay the timeout exponentially
to cap the decision time to a reasonable worst case; wait 8s for the
first victim, 4s for the next one etc. and the machine will still
recover or panic within 15s after the deadlock first occurs.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timers.
@ 2016-01-13 16:56         ` Johannes Weiner
  0 siblings, 0 replies; 38+ messages in thread
From: Johannes Weiner @ 2016-01-13 16:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, rientjes, akpm, mgorman, torvalds, oleg, hughd,
	andrea, riel, linux-mm, linux-kernel

On Wed, Jan 13, 2016 at 05:26:10PM +0100, Michal Hocko wrote:
> On Wed 13-01-16 21:11:30, Tetsuo Handa wrote:
> [...]
> > Those who use panic_on_oom = 1 expect that the system triggers kernel panic
> > rather than stall forever. This is a translation of administrator's wish that
> > "Please press SysRq-c on behalf of me if the memory exhausted. In that way,
> > I don't need to stand by in front of the console twenty-four seven."
> > 
> > Those who use panic_on_oom = 0 expect that the OOM killer solves OOM condition
> > rather than stall forever. This is a translation of administrator's wish that
> > "Please press SysRq-f on behalf of me if the memory exhausted. In that way,
> > I don't need to stand by in front of the console twenty-four seven."
> 
> I think you are missing an important point. There is _no reliable_ way
> to resolve the OOM condition in general except to panic the system. Even
> killing all user space tasks might not be sufficient in general because
> they might be blocked by an unkillable context (e.g. kernel thread).
> So if you need a reliable behavior then either use panic_on_oom=1 or
> provide a measure to panic after fixed timeout if the OOM cannot get
> resolved. We have seen patches in that regards but there was no general
> interest in them to merge them.

While what you're saying about there not being a failsafe way is true,
I don't understand why we should panic the machine before we tried to
kill every single userspace task. That's what I never understood about
your timeout-panic patches: if the OOM victim doesn't exit in a fixed
amount of time, why is it better to panic the machine than to try the
second-best, third-best, fourth-best etc. OOM candidates?

Yes, you can say that at least the kernel will make a decision in a
fixed amount of time and it'll be more useful in practice. But the
reality of most scenarios is that moving on to other victims will
increase the chance of success dramatically while the chance of
continued hanging would converge toward 0.

And for the more extreme scenarios, where you have a million tasks all
blocked on the same resource, we can decay the timeout exponentially
to cap the decision time to a reasonable worst case; wait 8s for the
first victim, 4s for the next one etc. and the machine will still
recover or panic within 15s after the deadlock first occurs.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timers.
  2016-01-13 16:56         ` Johannes Weiner
@ 2016-01-13 18:01           ` Michal Hocko
  -1 siblings, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2016-01-13 18:01 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, rientjes, akpm, mgorman, torvalds, oleg, hughd,
	andrea, riel, linux-mm, linux-kernel

On Wed 13-01-16 11:56:09, Johannes Weiner wrote:
> On Wed, Jan 13, 2016 at 05:26:10PM +0100, Michal Hocko wrote:
> > On Wed 13-01-16 21:11:30, Tetsuo Handa wrote:
> > [...]
> > > Those who use panic_on_oom = 1 expect that the system triggers kernel panic
> > > rather than stall forever. This is a translation of administrator's wish that
> > > "Please press SysRq-c on behalf of me if the memory exhausted. In that way,
> > > I don't need to stand by in front of the console twenty-four seven."
> > > 
> > > Those who use panic_on_oom = 0 expect that the OOM killer solves OOM condition
> > > rather than stall forever. This is a translation of administrator's wish that
> > > "Please press SysRq-f on behalf of me if the memory exhausted. In that way,
> > > I don't need to stand by in front of the console twenty-four seven."
> > 
> > I think you are missing an important point. There is _no reliable_ way
> > to resolve the OOM condition in general except to panic the system. Even
> > killing all user space tasks might not be sufficient in general because
> > they might be blocked by an unkillable context (e.g. kernel thread).
> > So if you need a reliable behavior then either use panic_on_oom=1 or
> > provide a measure to panic after fixed timeout if the OOM cannot get
> > resolved. We have seen patches in that regards but there was no general
> > interest in them to merge them.
> 
> While what you're saying about there not being a failsafe way is true,
> I don't understand why we should panic the machine before we tried to
> kill every single userspace task. That's what I never understood about
> your timeout-panic patches: if the OOM victim doesn't exit in a fixed
> amount of time, why is it better to panic the machine than to try the
> second-best, third-best, fourth-best etc. OOM candidates?
> 
> Yes, you can say that at least the kernel will make a decision in a
> fixed amount of time and it'll be more useful in practice.

Yes, this is the main argument. The predictability of the behavior is
the main concern for any timeout based solution. If you know that your
failover mechanisms requires N + reboot_time then you want the
system to act very close to N. To act after _at least N_ without upper
bound sounds more than impractical to me.
If you can handle more tasks within that time period, as you are
suggesting below, then such a solution would satisfy this requirement as
well. I haven't explored how complex such a solution would be though.

Timeout-to-panic patches were just trying to be as simple as possible
to guarantee the predictability requirement. No other timeout based
solutions, which were proposed so far, did guarantee the same AFAIR.

> But the
> reality of most scenarios is that moving on to other victims will
> increase the chance of success dramatically while the chance of
> continued hanging would converge toward 0.

While theoretically the probability of hang decreases with the number of
oom victims (if the amount of memory reserves scales as well) most of
the cases where we really got stuck during OOM required a major screw up
and something must have gone really wrong. There are many tasks blocked
on the shared resource and it would take to kill them all (if that is
possible) to move on.

In other words I haven't seen a real world use case on a reasonably
configured system to lock up during OOM. It is much more probable that
the system is trashing for hours before it even hits the OOM killer from
my experience.

> And for the more extreme scenarios, where you have a million tasks all
> blocked on the same resource, we can decay the timeout exponentially
> to cap the decision time to a reasonable worst case; wait 8s for the
> first victim, 4s for the next one etc. and the machine will still
> recover or panic within 15s after the deadlock first occurs.

Yes this would be usable. Basically any timeout based solution must
converge to a known state to be usable as said above IMHO.
I am just not convinced such a sophisticated mechanism will be really
needed after we have functional async memory reaping in place. I might
be wrong here of course and then we can explore other ways to mitigate
issues that pop out.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timers.
@ 2016-01-13 18:01           ` Michal Hocko
  0 siblings, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2016-01-13 18:01 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, rientjes, akpm, mgorman, torvalds, oleg, hughd,
	andrea, riel, linux-mm, linux-kernel

On Wed 13-01-16 11:56:09, Johannes Weiner wrote:
> On Wed, Jan 13, 2016 at 05:26:10PM +0100, Michal Hocko wrote:
> > On Wed 13-01-16 21:11:30, Tetsuo Handa wrote:
> > [...]
> > > Those who use panic_on_oom = 1 expect that the system triggers kernel panic
> > > rather than stall forever. This is a translation of administrator's wish that
> > > "Please press SysRq-c on behalf of me if the memory exhausted. In that way,
> > > I don't need to stand by in front of the console twenty-four seven."
> > > 
> > > Those who use panic_on_oom = 0 expect that the OOM killer solves OOM condition
> > > rather than stall forever. This is a translation of administrator's wish that
> > > "Please press SysRq-f on behalf of me if the memory exhausted. In that way,
> > > I don't need to stand by in front of the console twenty-four seven."
> > 
> > I think you are missing an important point. There is _no reliable_ way
> > to resolve the OOM condition in general except to panic the system. Even
> > killing all user space tasks might not be sufficient in general because
> > they might be blocked by an unkillable context (e.g. kernel thread).
> > So if you need a reliable behavior then either use panic_on_oom=1 or
> > provide a measure to panic after fixed timeout if the OOM cannot get
> > resolved. We have seen patches in that regards but there was no general
> > interest in them to merge them.
> 
> While what you're saying about there not being a failsafe way is true,
> I don't understand why we should panic the machine before we tried to
> kill every single userspace task. That's what I never understood about
> your timeout-panic patches: if the OOM victim doesn't exit in a fixed
> amount of time, why is it better to panic the machine than to try the
> second-best, third-best, fourth-best etc. OOM candidates?
> 
> Yes, you can say that at least the kernel will make a decision in a
> fixed amount of time and it'll be more useful in practice.

Yes, this is the main argument. The predictability of the behavior is
the main concern for any timeout based solution. If you know that your
failover mechanisms requires N + reboot_time then you want the
system to act very close to N. To act after _at least N_ without upper
bound sounds more than impractical to me.
If you can handle more tasks within that time period, as you are
suggesting below, then such a solution would satisfy this requirement as
well. I haven't explored how complex such a solution would be though.

Timeout-to-panic patches were just trying to be as simple as possible
to guarantee the predictability requirement. No other timeout based
solutions, which were proposed so far, did guarantee the same AFAIR.

> But the
> reality of most scenarios is that moving on to other victims will
> increase the chance of success dramatically while the chance of
> continued hanging would converge toward 0.

While theoretically the probability of hang decreases with the number of
oom victims (if the amount of memory reserves scales as well) most of
the cases where we really got stuck during OOM required a major screw up
and something must have gone really wrong. There are many tasks blocked
on the shared resource and it would take to kill them all (if that is
possible) to move on.

In other words I haven't seen a real world use case on a reasonably
configured system to lock up during OOM. It is much more probable that
the system is trashing for hours before it even hits the OOM killer from
my experience.

> And for the more extreme scenarios, where you have a million tasks all
> blocked on the same resource, we can decay the timeout exponentially
> to cap the decision time to a reasonable worst case; wait 8s for the
> first victim, 4s for the next one etc. and the machine will still
> recover or panic within 15s after the deadlock first occurs.

Yes this would be usable. Basically any timeout based solution must
converge to a known state to be usable as said above IMHO.
I am just not convinced such a sophisticated mechanism will be really
needed after we have functional async memory reaping in place. I might
be wrong here of course and then we can explore other ways to mitigate
issues that pop out.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timers.
  2016-01-13 18:01           ` Michal Hocko
@ 2016-01-14 11:26             ` Tetsuo Handa
  -1 siblings, 0 replies; 38+ messages in thread
From: Tetsuo Handa @ 2016-01-14 11:26 UTC (permalink / raw)
  To: mhocko, hannes
  Cc: rientjes, akpm, mgorman, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel

Michal Hocko wrote:
> I think you are missing an important point. There is _no reliable_ way
> to resolve the OOM condition in general except to panic the system. Even
> killing all user space tasks might not be sufficient in general because
> they might be blocked by an unkillable context (e.g. kernel thread).

I know. What I'm proposing is try to recover by killing more OOM-killable
tasks because I think impact of crashing the kernel is larger than impact
of killing all OOM-killable tasks. We should at least try OOM-kill all
OOM-killable processes before crashing the kernel. Some servers take many
minutes to reboot whereas restarting OOM-killed services takes only a few
seconds. Also, SysRq-i is inconvenient because it kills OOM-unkillable ssh
daemon process.

An example is:

  (1) Kill a victim and start timeout counter.

  (2) Kill all oom_score_adj > 0 tasks when OOM condition was not
      solved after 5 seconds since (1).

  (3) Kill all oom_score_adj = 0 tasks when OOM condition was not
      solved after 5 seconds since (2).

  (4) Kill all oom_score_adj >= -500 tasks when OOM condition was not
      solved after 5 seconds since (3).

  (5) Kill all oom_score_adj >= -999 tasks when OOM condition was not
      solved after 5 seconds since (4).

  (6) Trigger kernel panic because only OOM-unkillable tasks are left
      when OOM condition was not solved after 5 seconds since (5).

> All we can do is a best effort approach which tries to be optimized to
> reduce the impact of an unexpected SIGKILL sent to a "random" task. And
> this is a reasonable objective IMHO.

A best effort approach which tries to be optimized to reduce
the possibility of kernel panic should exist.



Michal Hocko wrote:
> Timeout-to-panic patches were just trying to be as simple as possible
> to guarantee the predictability requirement. No other timeout based
> solutions, which were proposed so far, did guarantee the same AFAIR.

What did "[PATCH] mm: Introduce timeout based OOM killing" miss
( http://lkml.kernel.org/r/201505232339.DAB00557.VFFLHMSOJFOOtQ@I-love.SAKURA.ne.jp )?
It provided

  (1) warn OOM victim not dying using memdie_task_warn_secs timeout
  (2) select next OOM victim using memdie_task_skip_secs timeout
  (3) trigger kernel panic using memdie_task_panic_secs timeout
  (4) warn trashing condition using memalloc_task_warn_secs timeout
  (5) trigger OOM killer using memalloc_task_retry_secs timeout

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timers.
@ 2016-01-14 11:26             ` Tetsuo Handa
  0 siblings, 0 replies; 38+ messages in thread
From: Tetsuo Handa @ 2016-01-14 11:26 UTC (permalink / raw)
  To: mhocko, hannes
  Cc: rientjes, akpm, mgorman, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel

Michal Hocko wrote:
> I think you are missing an important point. There is _no reliable_ way
> to resolve the OOM condition in general except to panic the system. Even
> killing all user space tasks might not be sufficient in general because
> they might be blocked by an unkillable context (e.g. kernel thread).

I know. What I'm proposing is try to recover by killing more OOM-killable
tasks because I think impact of crashing the kernel is larger than impact
of killing all OOM-killable tasks. We should at least try OOM-kill all
OOM-killable processes before crashing the kernel. Some servers take many
minutes to reboot whereas restarting OOM-killed services takes only a few
seconds. Also, SysRq-i is inconvenient because it kills OOM-unkillable ssh
daemon process.

An example is:

  (1) Kill a victim and start timeout counter.

  (2) Kill all oom_score_adj > 0 tasks when OOM condition was not
      solved after 5 seconds since (1).

  (3) Kill all oom_score_adj = 0 tasks when OOM condition was not
      solved after 5 seconds since (2).

  (4) Kill all oom_score_adj >= -500 tasks when OOM condition was not
      solved after 5 seconds since (3).

  (5) Kill all oom_score_adj >= -999 tasks when OOM condition was not
      solved after 5 seconds since (4).

  (6) Trigger kernel panic because only OOM-unkillable tasks are left
      when OOM condition was not solved after 5 seconds since (5).

> All we can do is a best effort approach which tries to be optimized to
> reduce the impact of an unexpected SIGKILL sent to a "random" task. And
> this is a reasonable objective IMHO.

A best effort approach which tries to be optimized to reduce
the possibility of kernel panic should exist.



Michal Hocko wrote:
> Timeout-to-panic patches were just trying to be as simple as possible
> to guarantee the predictability requirement. No other timeout based
> solutions, which were proposed so far, did guarantee the same AFAIR.

What did "[PATCH] mm: Introduce timeout based OOM killing" miss
( http://lkml.kernel.org/r/201505232339.DAB00557.VFFLHMSOJFOOtQ@I-love.SAKURA.ne.jp )?
It provided

  (1) warn OOM victim not dying using memdie_task_warn_secs timeout
  (2) select next OOM victim using memdie_task_skip_secs timeout
  (3) trigger kernel panic using memdie_task_panic_secs timeout
  (4) warn trashing condition using memalloc_task_warn_secs timeout
  (5) trigger OOM killer using memalloc_task_retry_secs timeout

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timers.
  2016-01-14 11:26             ` Tetsuo Handa
@ 2016-01-14 22:01               ` David Rientjes
  -1 siblings, 0 replies; 38+ messages in thread
From: David Rientjes @ 2016-01-14 22:01 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, hannes, Andrew Morton, mgorman, torvalds, oleg, hughd,
	andrea, riel, linux-mm, linux-kernel

On Thu, 14 Jan 2016, Tetsuo Handa wrote:

> I know. What I'm proposing is try to recover by killing more OOM-killable
> tasks because I think impact of crashing the kernel is larger than impact
> of killing all OOM-killable tasks. We should at least try OOM-kill all
> OOM-killable processes before crashing the kernel. Some servers take many
> minutes to reboot whereas restarting OOM-killed services takes only a few
> seconds. Also, SysRq-i is inconvenient because it kills OOM-unkillable ssh
> daemon process.
> 

This is where me and you disagree; the goal should not be to continue to 
oom kill more and more processes since there is no guarantee that further 
kills will result in forward progress.  These additional kills can result 
in the same livelock that is already problematic, and killing additional 
processes has made the situation worse since memory reserves are more 
depleted.

I believe what is better is to exhaust reclaim, check if the page 
allocator is constantly looping due to waiting for the same victim to 
exit, and then allowing that allocation with memory reserves, see the 
attached patch which I have proposed before.

The livelock often occurs when a single thread is holding a kernel mutex 
and is allocating memory and the oom victim requires that mutex to exit.  
This approach allows that holder to allocate and (hopefully) eventually 
drop the mutex.  It doesn't require that we randomly oom kill processes 
that don't requrie the mutex, which the oom killer can't know, and it also 
avoids losing the most amount of work.  Oom killing many threads is never 
a solution to anything since reserves are a finite resource.

I know you'll object to this and propose some theory or usecase that shows 
a mutex holder can allocate a ton of memory while holding that mutex and 
memory reserves get depleted.  If you can show such a case in the kernel 
where that happens, we can fix that.  However, the patch also allows for 
identifying such processes by showing their stack trace so these issues 
can be fixed.

Remember, we can never recover 100% of the time, and arguing that this is 
not a perfect solution is not going to help us make forward progress in 
this discussion.

----- >o -----

mm, oom: add global access to memory reserves on livelock

On system oom, a process may fail to exit if its thread depends on a lock
held by another allocating process.

In this case, we can detect an oom kill livelock that requires memory
allocation to be successful to resolve.

This patch introduces an oom expiration, set to 5s, that defines how long
a thread has to exit after being oom killed.

When this period elapses, it is assumed that the thread cannot make
forward progress without help.  The only help the VM may provide is to
allow pending allocations to succeed, so it grants all allocators access
to memory reserves after reclaim and compaction have failed.

This patch does not allow global access to memory reserves on memcg oom
kill, but the functionality is there if extended.

An example livelock is possible with a kernel module (requires
EXPORT_SYMBOL(min_free_kbytes)):

#include <linux/delay.h>
#include <linux/kernel.h>
#include <linux/kthread.h>
#include <linux/mm.h>
#include <linux/module.h>
#include <linux/oom.h>

/*
 * The contended mutex that the allocating kthread is holding and that the oom
 * killed process is trying to acquire.
 */
static DEFINE_MUTEX(shared_mutex);

static int alloc_thread(void *param)
{
	struct task_struct *locking_thread = param;
	struct page *page, *next;
	unsigned int allocs_left;
	bool oom_killed = false;
	LIST_HEAD(pagelist);

	/* Allocate half of memory reserves after oom */
	allocs_left = (min_free_kbytes >> (PAGE_SHIFT - 10)) / 2;

	mutex_lock(&shared_mutex);

	/* Give locking_thread a chance to wakeup */
	msleep(1000);

	while (!oom_killed || allocs_left) {
		page = alloc_pages(GFP_KERNEL, 0);
		if (likely(page))
			list_add(&page->lru, &pagelist);

		cond_resched();

		if (unlikely(kthread_should_stop()))
			break;
		if (oom_killed) {
			allocs_left--;
			continue;
		}
		if (unlikely(fatal_signal_pending(locking_thread))) {
			/*
			 * The process trying to acquire shared_mutex has been
			 * killed.  Continue to allocate some memory to use
			 * reserves, and then drop the mutex.
			 */
			oom_killed = true;
		}
	}
	mutex_unlock(&shared_mutex);

	/* Release memory back to the system */
	list_for_each_entry_safe(page, next, &pagelist, lru) {
		list_del(&page->lru);
		__free_page(page);
		cond_resched();
	}
	return 0;
}

static int __init oom_livelock_init(void)
{
	struct task_struct *allocating_thread;

	allocating_thread = kthread_run(alloc_thread, current,
					"oom_test_alloc");
	if (unlikely(IS_ERR(allocating_thread))) {
		pr_err("oom_test_alloc: could not create oom_test_alloc\n");
		return PTR_ERR(allocating_thread);
	}

	/* Prefer to be the first process killed on system oom */
	set_current_oom_origin();

	/* Wait until the kthread has acquired the mutex */
	while (!mutex_is_locked(&shared_mutex))
		schedule();

	/* This should livelock without VM intervention */
	mutex_lock(&shared_mutex);
	mutex_unlock(&shared_mutex);

	kthread_stop(allocating_thread);
	return 0;
}

module_init(oom_livelock_init);

This will livelock a system running with this patch.  With an oom
expiration, allocating_thread eventually gets access to allocate memory
so that it can drop shared_mutex and the oom victim, insmod, exits.

Format of the kernel log is of the oom killed victim:

	oom_test_alloc invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0, oom_score_badness=5000 (enabled), memcg_scoring=disabled
	oom_test_alloc cpuset=/ mems_allowed=0-1
	...
	Out of Memory: Kill process 13692 (insmod) score 1142282919 or sacrifice child
	Out of Memory: Killed process 13692 (insmod) total-vm:4356kB, anon-rss:68kB, file-rss:284kB
	GOOGLE: oom-kill constraint=CONSTRAINT_NONE origin_memcg= kill_memcg= task=insmod pid=13692 uid=0

and then access to memory reserves being granted:

	WARNING: CPU: 17 PID: 13046 at mm/oom_kill.c:314 oom_scan_process_thread+0x16b/0x1f0()
	insmod (13692) has failed to exit -- global access to memory reserves started
	...
	Call Trace:
	 [<ffffffffbb5f0e67>] dump_stack+0x46/0x58
	 [<ffffffffbb069394>] warn_slowpath_common+0x94/0xc0
	 [<ffffffffbb069476>] warn_slowpath_fmt+0x46/0x50
	 [<ffffffffbb14ad1b>] oom_scan_process_thread+0x16b/0x1f0
	 [<ffffffffbb14b81b>] out_of_memory+0x39b/0x6f0
	 ...
	---[ end trace a26a290e84699a90 ]---
	Call Trace of insmod/13692:
	 [<ffffffffbb0d620c>] load_module+0x1cdc/0x23c0
	 [<ffffffffbb0d6996>] SyS_init_module+0xa6/0xd0
	 [<ffffffffbb5fb322>] system_call_fastpath+0x16/0x1b
	 [<ffffffffffffffff>] 0xffffffffffffffff

load_module+0x1cdc is a shared mutex in the unit test that insmod is
trying to acquire and that oom_test_alloc is holding while allocating.

And then the eventual exit of insmod:

	insmod (13692) exited -- global access to memory reserves ended

Signed-off-by: David Rientjes <rientjes@google.com>
---
 drivers/tty/sysrq.c   |  3 +-
 include/linux/oom.h   |  5 +--
 include/linux/sched.h |  1 +
 kernel/exit.c         |  4 +++
 mm/memcontrol.c       |  4 ++-
 mm/oom_kill.c         | 92 +++++++++++++++++++++++++++++++++++++++++++++++----
 mm/page_alloc.c       | 18 ++++++----
 7 files changed, 110 insertions(+), 17 deletions(-)

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -360,9 +360,10 @@ static void moom_callback(struct work_struct *ignored)
 		.gfp_mask = gfp_mask,
 		.order = -1,
 	};
+	bool unused;
 
 	mutex_lock(&oom_lock);
-	if (!out_of_memory(&oc))
+	if (!out_of_memory(&oc, &unused))
 		pr_info("OOM request ignored because killer is disabled\n");
 	mutex_unlock(&oom_lock);
 }
diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -87,9 +87,10 @@ extern void check_panic_on_oom(struct oom_control *oc,
 			       struct mem_cgroup *memcg);
 
 extern enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
-		struct task_struct *task, unsigned long totalpages);
+		struct task_struct *task, unsigned long totalpages,
+		bool *expire);
 
-extern bool out_of_memory(struct oom_control *oc);
+extern bool out_of_memory(struct oom_control *oc, bool *expired_oom_kill);
 
 extern void exit_oom_victim(void);
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -767,6 +767,7 @@ struct signal_struct {
 	short oom_score_adj;		/* OOM kill score adjustment */
 	short oom_score_adj_min;	/* OOM kill score adjustment min value.
 					 * Only settable by CAP_SYS_RESOURCE. */
+	unsigned long oom_kill_expire;	/* expiration time */
 
 	struct mutex cred_guard_mutex;	/* guard against foreign influences on
 					 * credential calculations
diff --git a/kernel/exit.c b/kernel/exit.c
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -427,6 +427,10 @@ static void exit_mm(struct task_struct *tsk)
 	}
 	atomic_inc(&mm->mm_count);
 	BUG_ON(mm != tsk->active_mm);
+	if (tsk->signal->oom_kill_expire &&
+	    time_after_eq(jiffies, tsk->signal->oom_kill_expire))
+		pr_info("%s (%d) exited -- global access to memory reserves ended\n",
+			tsk->comm, tsk->pid);
 	/* more a memory barrier than a real lock */
 	task_lock(tsk);
 	tsk->mm = NULL;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1330,10 +1330,12 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	for_each_mem_cgroup_tree(iter, memcg) {
 		struct css_task_iter it;
 		struct task_struct *task;
+		bool unused;
 
 		css_task_iter_start(&iter->css, &it);
 		while ((task = css_task_iter_next(&it))) {
-			switch (oom_scan_process_thread(&oc, task, totalpages)) {
+			switch (oom_scan_process_thread(&oc, task, totalpages,
+							&unused)) {
 			case OOM_SCAN_SELECT:
 				if (chosen)
 					put_task_struct(chosen);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -35,6 +35,7 @@
 #include <linux/freezer.h>
 #include <linux/ftrace.h>
 #include <linux/ratelimit.h>
+#include <linux/stacktrace.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/oom.h>
@@ -45,6 +46,14 @@ int sysctl_oom_dump_tasks = 1;
 
 DEFINE_MUTEX(oom_lock);
 
+/*
+ * If an oom killed thread cannot exit because another thread is holding a lock
+ * that is requires, then the oom killer cannot ensure forward progress.  When
+ * OOM_EXPIRE_MSECS lapses, provide all threads access to memory reserves so the
+ * thread holding the lock may drop it and the oom victim may exit.
+ */
+#define OOM_EXPIRE_MSECS	(5000)
+
 #ifdef CONFIG_NUMA
 /**
  * has_intersects_mems_allowed() - check task eligiblity for kill
@@ -254,8 +263,57 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc,
 }
 #endif
 
+#ifdef CONFIG_STACKTRACE
+#define MAX_STACK_TRACE_ENTRIES	(64)
+static unsigned long stack_trace_entries[MAX_STACK_TRACE_ENTRIES *
+					 sizeof(unsigned long)];
+static DEFINE_MUTEX(stack_trace_mutex);
+
+static void print_stacks_expired(struct task_struct *task)
+{
+	/* One set of stack traces every OOM_EXPIRE_MS */
+	static DEFINE_RATELIMIT_STATE(expire_rs, OOM_EXPIRE_MSECS / 1000 * HZ,
+				      1);
+	struct stack_trace trace = {
+		.nr_entries = 0,
+		.max_entries = ARRAY_SIZE(stack_trace_entries),
+		.entries = stack_trace_entries,
+		.skip = 2,
+	};
+
+	if (!__ratelimit(&expire_rs))
+		return;
+
+	WARN(true,
+	     "%s (%d) has failed to exit -- global access to memory reserves started\n",
+	     task->comm, task->pid);
+
+	/*
+	 * If cred_guard_mutex can't be acquired, this may be a mutex that is
+	 * being held causing the livelock.  Return without printing the stack.
+	 */
+	if (!mutex_trylock(&task->signal->cred_guard_mutex))
+		return;
+
+	mutex_lock(&stack_trace_mutex);
+	save_stack_trace_tsk(task, &trace);
+
+	pr_info("Call Trace of %s/%d:\n", task->comm, task->pid);
+	print_stack_trace(&trace, 0);
+
+	mutex_unlock(&stack_trace_mutex);
+	mutex_unlock(&task->signal->cred_guard_mutex);
+}
+#else
+static inline void print_stacks_expired(struct task_struct *task)
+{
+}
+#endif /* CONFIG_STACKTRACE */
+
+
 enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
-			struct task_struct *task, unsigned long totalpages)
+			struct task_struct *task, unsigned long totalpages,
+			bool *expired_oom_kill)
 {
 	if (oom_unkillable_task(task, NULL, oc->nodemask))
 		return OOM_SCAN_CONTINUE;
@@ -265,8 +323,14 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 	 * Don't allow any other task to have access to the reserves.
 	 */
 	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
-		if (oc->order != -1)
+		if (oc->order != -1) {
+			if (task->mm && time_after_eq(jiffies,
+					task->signal->oom_kill_expire)) {
+				get_task_struct(task);
+				*expired_oom_kill = true;
+			}
 			return OOM_SCAN_ABORT;
+		}
 	}
 	if (!task->mm)
 		return OOM_SCAN_CONTINUE;
@@ -289,7 +353,8 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
  * number of 'points'.  Returns -1 on scan abort.
  */
 static struct task_struct *select_bad_process(struct oom_control *oc,
-		unsigned int *ppoints, unsigned long totalpages)
+		unsigned int *ppoints, unsigned long totalpages,
+		bool *expired_oom_kill)
 {
 	struct task_struct *g, *p;
 	struct task_struct *chosen = NULL;
@@ -299,7 +364,8 @@ static struct task_struct *select_bad_process(struct oom_control *oc,
 	for_each_process_thread(g, p) {
 		unsigned int points;
 
-		switch (oom_scan_process_thread(oc, p, totalpages)) {
+		switch (oom_scan_process_thread(oc, p, totalpages,
+						expired_oom_kill)) {
 		case OOM_SCAN_SELECT:
 			chosen = p;
 			chosen_points = ULONG_MAX;
@@ -308,6 +374,10 @@ static struct task_struct *select_bad_process(struct oom_control *oc,
 			continue;
 		case OOM_SCAN_ABORT:
 			rcu_read_unlock();
+			if (*expired_oom_kill) {
+				print_stacks_expired(p);
+				put_task_struct(p);
+			}
 			return (struct task_struct *)(-1UL);
 		case OOM_SCAN_OK:
 			break;
@@ -414,6 +484,10 @@ void mark_oom_victim(struct task_struct *tsk)
 	/* OOM killer might race with memcg OOM */
 	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
 		return;
+
+	tsk->signal->oom_kill_expire = jiffies +
+				       msecs_to_jiffies(OOM_EXPIRE_MSECS);
+
 	/*
 	 * Make sure that the task is woken up from uninterruptible sleep
 	 * if it is frozen because OOM killer wouldn't be able to free
@@ -637,8 +711,11 @@ EXPORT_SYMBOL_GPL(unregister_oom_notifier);
  * killing a random task (bad), letting the system crash (worse)
  * OR try to be smart about which process to kill. Note that we
  * don't have to be perfect here, we just have to be good.
+ *
+ * expired_oom_kill is true if waiting on a process that has exceeded its oom
+ * expiration to exit.
  */
-bool out_of_memory(struct oom_control *oc)
+bool out_of_memory(struct oom_control *oc, bool *expired_oom_kill)
 {
 	struct task_struct *p;
 	unsigned long totalpages;
@@ -686,7 +763,7 @@ bool out_of_memory(struct oom_control *oc)
 		return true;
 	}
 
-	p = select_bad_process(oc, &points, totalpages);
+	p = select_bad_process(oc, &points, totalpages, expired_oom_kill);
 	/* Found nothing?!?! Either we hang forever, or we panic. */
 	if (!p && oc->order != -1) {
 		dump_header(oc, NULL, NULL);
@@ -717,6 +794,7 @@ void pagefault_out_of_memory(void)
 		.gfp_mask = 0,
 		.order = 0,
 	};
+	bool unused;
 
 	if (mem_cgroup_oom_synchronize(true))
 		return;
@@ -724,7 +802,7 @@ void pagefault_out_of_memory(void)
 	if (!mutex_trylock(&oom_lock))
 		return;
 
-	if (!out_of_memory(&oc)) {
+	if (!out_of_memory(&oc, &unused)) {
 		/*
 		 * There shouldn't be any user tasks runnable while the
 		 * OOM killer is disabled, so the current task has to
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2717,7 +2717,8 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
 
 static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
-	const struct alloc_context *ac, unsigned long *did_some_progress)
+	const struct alloc_context *ac, unsigned long *did_some_progress,
+	bool *expired_oom_kill)
 {
 	struct oom_control oc = {
 		.zonelist = ac->zonelist,
@@ -2776,7 +2777,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 			goto out;
 	}
 	/* Exhausted what can be done so it's blamo time */
-	if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
+	if (out_of_memory(&oc, expired_oom_kill) ||
+	    WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
 		*did_some_progress = 1;
 out:
 	mutex_unlock(&oom_lock);
@@ -2947,7 +2949,7 @@ static void wake_all_kswapds(unsigned int order, const struct alloc_context *ac)
 }
 
 static inline int
-gfp_to_alloc_flags(gfp_t gfp_mask)
+gfp_to_alloc_flags(gfp_t gfp_mask, const bool expired_oom_kill)
 {
 	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
 	const bool atomic = !(gfp_mask & (__GFP_WAIT | __GFP_NO_KSWAPD));
@@ -2987,6 +2989,8 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 				((current->flags & PF_MEMALLOC) ||
 				 unlikely(test_thread_flag(TIF_MEMDIE))))
 			alloc_flags |= ALLOC_NO_WATERMARKS;
+		else if (expired_oom_kill)
+			alloc_flags |= ALLOC_NO_WATERMARKS;
 	}
 #ifdef CONFIG_CMA
 	if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
@@ -2997,7 +3001,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 
 bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 {
-	return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS);
+	return !!(gfp_to_alloc_flags(gfp_mask, false) & ALLOC_NO_WATERMARKS);
 }
 
 static inline struct page *
@@ -3012,6 +3016,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
 	bool deferred_compaction = false;
 	int contended_compaction = COMPACT_CONTENDED_NONE;
+	bool expired_oom_kill = false;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3041,7 +3046,7 @@ retry:
 	 * reclaim. Now things get more complex, so set up alloc_flags according
 	 * to how we want to proceed.
 	 */
-	alloc_flags = gfp_to_alloc_flags(gfp_mask);
+	alloc_flags = gfp_to_alloc_flags(gfp_mask, expired_oom_kill);
 
 	/*
 	 * Find the true preferred zone if the allocation is unconstrained by
@@ -3166,7 +3171,8 @@ retry:
 	}
 
 	/* Reclaim has failed us, start killing things */
-	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
+	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress,
+				     &expired_oom_kill);
 	if (page)
 		goto got_pg;
 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timers.
@ 2016-01-14 22:01               ` David Rientjes
  0 siblings, 0 replies; 38+ messages in thread
From: David Rientjes @ 2016-01-14 22:01 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, hannes, Andrew Morton, mgorman, torvalds, oleg, hughd,
	andrea, riel, linux-mm, linux-kernel

On Thu, 14 Jan 2016, Tetsuo Handa wrote:

> I know. What I'm proposing is try to recover by killing more OOM-killable
> tasks because I think impact of crashing the kernel is larger than impact
> of killing all OOM-killable tasks. We should at least try OOM-kill all
> OOM-killable processes before crashing the kernel. Some servers take many
> minutes to reboot whereas restarting OOM-killed services takes only a few
> seconds. Also, SysRq-i is inconvenient because it kills OOM-unkillable ssh
> daemon process.
> 

This is where me and you disagree; the goal should not be to continue to 
oom kill more and more processes since there is no guarantee that further 
kills will result in forward progress.  These additional kills can result 
in the same livelock that is already problematic, and killing additional 
processes has made the situation worse since memory reserves are more 
depleted.

I believe what is better is to exhaust reclaim, check if the page 
allocator is constantly looping due to waiting for the same victim to 
exit, and then allowing that allocation with memory reserves, see the 
attached patch which I have proposed before.

The livelock often occurs when a single thread is holding a kernel mutex 
and is allocating memory and the oom victim requires that mutex to exit.  
This approach allows that holder to allocate and (hopefully) eventually 
drop the mutex.  It doesn't require that we randomly oom kill processes 
that don't requrie the mutex, which the oom killer can't know, and it also 
avoids losing the most amount of work.  Oom killing many threads is never 
a solution to anything since reserves are a finite resource.

I know you'll object to this and propose some theory or usecase that shows 
a mutex holder can allocate a ton of memory while holding that mutex and 
memory reserves get depleted.  If you can show such a case in the kernel 
where that happens, we can fix that.  However, the patch also allows for 
identifying such processes by showing their stack trace so these issues 
can be fixed.

Remember, we can never recover 100% of the time, and arguing that this is 
not a perfect solution is not going to help us make forward progress in 
this discussion.

----- >o -----

mm, oom: add global access to memory reserves on livelock

On system oom, a process may fail to exit if its thread depends on a lock
held by another allocating process.

In this case, we can detect an oom kill livelock that requires memory
allocation to be successful to resolve.

This patch introduces an oom expiration, set to 5s, that defines how long
a thread has to exit after being oom killed.

When this period elapses, it is assumed that the thread cannot make
forward progress without help.  The only help the VM may provide is to
allow pending allocations to succeed, so it grants all allocators access
to memory reserves after reclaim and compaction have failed.

This patch does not allow global access to memory reserves on memcg oom
kill, but the functionality is there if extended.

An example livelock is possible with a kernel module (requires
EXPORT_SYMBOL(min_free_kbytes)):

#include <linux/delay.h>
#include <linux/kernel.h>
#include <linux/kthread.h>
#include <linux/mm.h>
#include <linux/module.h>
#include <linux/oom.h>

/*
 * The contended mutex that the allocating kthread is holding and that the oom
 * killed process is trying to acquire.
 */
static DEFINE_MUTEX(shared_mutex);

static int alloc_thread(void *param)
{
	struct task_struct *locking_thread = param;
	struct page *page, *next;
	unsigned int allocs_left;
	bool oom_killed = false;
	LIST_HEAD(pagelist);

	/* Allocate half of memory reserves after oom */
	allocs_left = (min_free_kbytes >> (PAGE_SHIFT - 10)) / 2;

	mutex_lock(&shared_mutex);

	/* Give locking_thread a chance to wakeup */
	msleep(1000);

	while (!oom_killed || allocs_left) {
		page = alloc_pages(GFP_KERNEL, 0);
		if (likely(page))
			list_add(&page->lru, &pagelist);

		cond_resched();

		if (unlikely(kthread_should_stop()))
			break;
		if (oom_killed) {
			allocs_left--;
			continue;
		}
		if (unlikely(fatal_signal_pending(locking_thread))) {
			/*
			 * The process trying to acquire shared_mutex has been
			 * killed.  Continue to allocate some memory to use
			 * reserves, and then drop the mutex.
			 */
			oom_killed = true;
		}
	}
	mutex_unlock(&shared_mutex);

	/* Release memory back to the system */
	list_for_each_entry_safe(page, next, &pagelist, lru) {
		list_del(&page->lru);
		__free_page(page);
		cond_resched();
	}
	return 0;
}

static int __init oom_livelock_init(void)
{
	struct task_struct *allocating_thread;

	allocating_thread = kthread_run(alloc_thread, current,
					"oom_test_alloc");
	if (unlikely(IS_ERR(allocating_thread))) {
		pr_err("oom_test_alloc: could not create oom_test_alloc\n");
		return PTR_ERR(allocating_thread);
	}

	/* Prefer to be the first process killed on system oom */
	set_current_oom_origin();

	/* Wait until the kthread has acquired the mutex */
	while (!mutex_is_locked(&shared_mutex))
		schedule();

	/* This should livelock without VM intervention */
	mutex_lock(&shared_mutex);
	mutex_unlock(&shared_mutex);

	kthread_stop(allocating_thread);
	return 0;
}

module_init(oom_livelock_init);

This will livelock a system running with this patch.  With an oom
expiration, allocating_thread eventually gets access to allocate memory
so that it can drop shared_mutex and the oom victim, insmod, exits.

Format of the kernel log is of the oom killed victim:

	oom_test_alloc invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0, oom_score_badness=5000 (enabled), memcg_scoring=disabled
	oom_test_alloc cpuset=/ mems_allowed=0-1
	...
	Out of Memory: Kill process 13692 (insmod) score 1142282919 or sacrifice child
	Out of Memory: Killed process 13692 (insmod) total-vm:4356kB, anon-rss:68kB, file-rss:284kB
	GOOGLE: oom-kill constraint=CONSTRAINT_NONE origin_memcg= kill_memcg= task=insmod pid=13692 uid=0

and then access to memory reserves being granted:

	WARNING: CPU: 17 PID: 13046 at mm/oom_kill.c:314 oom_scan_process_thread+0x16b/0x1f0()
	insmod (13692) has failed to exit -- global access to memory reserves started
	...
	Call Trace:
	 [<ffffffffbb5f0e67>] dump_stack+0x46/0x58
	 [<ffffffffbb069394>] warn_slowpath_common+0x94/0xc0
	 [<ffffffffbb069476>] warn_slowpath_fmt+0x46/0x50
	 [<ffffffffbb14ad1b>] oom_scan_process_thread+0x16b/0x1f0
	 [<ffffffffbb14b81b>] out_of_memory+0x39b/0x6f0
	 ...
	---[ end trace a26a290e84699a90 ]---
	Call Trace of insmod/13692:
	 [<ffffffffbb0d620c>] load_module+0x1cdc/0x23c0
	 [<ffffffffbb0d6996>] SyS_init_module+0xa6/0xd0
	 [<ffffffffbb5fb322>] system_call_fastpath+0x16/0x1b
	 [<ffffffffffffffff>] 0xffffffffffffffff

load_module+0x1cdc is a shared mutex in the unit test that insmod is
trying to acquire and that oom_test_alloc is holding while allocating.

And then the eventual exit of insmod:

	insmod (13692) exited -- global access to memory reserves ended

Signed-off-by: David Rientjes <rientjes@google.com>
---
 drivers/tty/sysrq.c   |  3 +-
 include/linux/oom.h   |  5 +--
 include/linux/sched.h |  1 +
 kernel/exit.c         |  4 +++
 mm/memcontrol.c       |  4 ++-
 mm/oom_kill.c         | 92 +++++++++++++++++++++++++++++++++++++++++++++++----
 mm/page_alloc.c       | 18 ++++++----
 7 files changed, 110 insertions(+), 17 deletions(-)

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -360,9 +360,10 @@ static void moom_callback(struct work_struct *ignored)
 		.gfp_mask = gfp_mask,
 		.order = -1,
 	};
+	bool unused;
 
 	mutex_lock(&oom_lock);
-	if (!out_of_memory(&oc))
+	if (!out_of_memory(&oc, &unused))
 		pr_info("OOM request ignored because killer is disabled\n");
 	mutex_unlock(&oom_lock);
 }
diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -87,9 +87,10 @@ extern void check_panic_on_oom(struct oom_control *oc,
 			       struct mem_cgroup *memcg);
 
 extern enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
-		struct task_struct *task, unsigned long totalpages);
+		struct task_struct *task, unsigned long totalpages,
+		bool *expire);
 
-extern bool out_of_memory(struct oom_control *oc);
+extern bool out_of_memory(struct oom_control *oc, bool *expired_oom_kill);
 
 extern void exit_oom_victim(void);
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -767,6 +767,7 @@ struct signal_struct {
 	short oom_score_adj;		/* OOM kill score adjustment */
 	short oom_score_adj_min;	/* OOM kill score adjustment min value.
 					 * Only settable by CAP_SYS_RESOURCE. */
+	unsigned long oom_kill_expire;	/* expiration time */
 
 	struct mutex cred_guard_mutex;	/* guard against foreign influences on
 					 * credential calculations
diff --git a/kernel/exit.c b/kernel/exit.c
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -427,6 +427,10 @@ static void exit_mm(struct task_struct *tsk)
 	}
 	atomic_inc(&mm->mm_count);
 	BUG_ON(mm != tsk->active_mm);
+	if (tsk->signal->oom_kill_expire &&
+	    time_after_eq(jiffies, tsk->signal->oom_kill_expire))
+		pr_info("%s (%d) exited -- global access to memory reserves ended\n",
+			tsk->comm, tsk->pid);
 	/* more a memory barrier than a real lock */
 	task_lock(tsk);
 	tsk->mm = NULL;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1330,10 +1330,12 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	for_each_mem_cgroup_tree(iter, memcg) {
 		struct css_task_iter it;
 		struct task_struct *task;
+		bool unused;
 
 		css_task_iter_start(&iter->css, &it);
 		while ((task = css_task_iter_next(&it))) {
-			switch (oom_scan_process_thread(&oc, task, totalpages)) {
+			switch (oom_scan_process_thread(&oc, task, totalpages,
+							&unused)) {
 			case OOM_SCAN_SELECT:
 				if (chosen)
 					put_task_struct(chosen);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -35,6 +35,7 @@
 #include <linux/freezer.h>
 #include <linux/ftrace.h>
 #include <linux/ratelimit.h>
+#include <linux/stacktrace.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/oom.h>
@@ -45,6 +46,14 @@ int sysctl_oom_dump_tasks = 1;
 
 DEFINE_MUTEX(oom_lock);
 
+/*
+ * If an oom killed thread cannot exit because another thread is holding a lock
+ * that is requires, then the oom killer cannot ensure forward progress.  When
+ * OOM_EXPIRE_MSECS lapses, provide all threads access to memory reserves so the
+ * thread holding the lock may drop it and the oom victim may exit.
+ */
+#define OOM_EXPIRE_MSECS	(5000)
+
 #ifdef CONFIG_NUMA
 /**
  * has_intersects_mems_allowed() - check task eligiblity for kill
@@ -254,8 +263,57 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc,
 }
 #endif
 
+#ifdef CONFIG_STACKTRACE
+#define MAX_STACK_TRACE_ENTRIES	(64)
+static unsigned long stack_trace_entries[MAX_STACK_TRACE_ENTRIES *
+					 sizeof(unsigned long)];
+static DEFINE_MUTEX(stack_trace_mutex);
+
+static void print_stacks_expired(struct task_struct *task)
+{
+	/* One set of stack traces every OOM_EXPIRE_MS */
+	static DEFINE_RATELIMIT_STATE(expire_rs, OOM_EXPIRE_MSECS / 1000 * HZ,
+				      1);
+	struct stack_trace trace = {
+		.nr_entries = 0,
+		.max_entries = ARRAY_SIZE(stack_trace_entries),
+		.entries = stack_trace_entries,
+		.skip = 2,
+	};
+
+	if (!__ratelimit(&expire_rs))
+		return;
+
+	WARN(true,
+	     "%s (%d) has failed to exit -- global access to memory reserves started\n",
+	     task->comm, task->pid);
+
+	/*
+	 * If cred_guard_mutex can't be acquired, this may be a mutex that is
+	 * being held causing the livelock.  Return without printing the stack.
+	 */
+	if (!mutex_trylock(&task->signal->cred_guard_mutex))
+		return;
+
+	mutex_lock(&stack_trace_mutex);
+	save_stack_trace_tsk(task, &trace);
+
+	pr_info("Call Trace of %s/%d:\n", task->comm, task->pid);
+	print_stack_trace(&trace, 0);
+
+	mutex_unlock(&stack_trace_mutex);
+	mutex_unlock(&task->signal->cred_guard_mutex);
+}
+#else
+static inline void print_stacks_expired(struct task_struct *task)
+{
+}
+#endif /* CONFIG_STACKTRACE */
+
+
 enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
-			struct task_struct *task, unsigned long totalpages)
+			struct task_struct *task, unsigned long totalpages,
+			bool *expired_oom_kill)
 {
 	if (oom_unkillable_task(task, NULL, oc->nodemask))
 		return OOM_SCAN_CONTINUE;
@@ -265,8 +323,14 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 	 * Don't allow any other task to have access to the reserves.
 	 */
 	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
-		if (oc->order != -1)
+		if (oc->order != -1) {
+			if (task->mm && time_after_eq(jiffies,
+					task->signal->oom_kill_expire)) {
+				get_task_struct(task);
+				*expired_oom_kill = true;
+			}
 			return OOM_SCAN_ABORT;
+		}
 	}
 	if (!task->mm)
 		return OOM_SCAN_CONTINUE;
@@ -289,7 +353,8 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
  * number of 'points'.  Returns -1 on scan abort.
  */
 static struct task_struct *select_bad_process(struct oom_control *oc,
-		unsigned int *ppoints, unsigned long totalpages)
+		unsigned int *ppoints, unsigned long totalpages,
+		bool *expired_oom_kill)
 {
 	struct task_struct *g, *p;
 	struct task_struct *chosen = NULL;
@@ -299,7 +364,8 @@ static struct task_struct *select_bad_process(struct oom_control *oc,
 	for_each_process_thread(g, p) {
 		unsigned int points;
 
-		switch (oom_scan_process_thread(oc, p, totalpages)) {
+		switch (oom_scan_process_thread(oc, p, totalpages,
+						expired_oom_kill)) {
 		case OOM_SCAN_SELECT:
 			chosen = p;
 			chosen_points = ULONG_MAX;
@@ -308,6 +374,10 @@ static struct task_struct *select_bad_process(struct oom_control *oc,
 			continue;
 		case OOM_SCAN_ABORT:
 			rcu_read_unlock();
+			if (*expired_oom_kill) {
+				print_stacks_expired(p);
+				put_task_struct(p);
+			}
 			return (struct task_struct *)(-1UL);
 		case OOM_SCAN_OK:
 			break;
@@ -414,6 +484,10 @@ void mark_oom_victim(struct task_struct *tsk)
 	/* OOM killer might race with memcg OOM */
 	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
 		return;
+
+	tsk->signal->oom_kill_expire = jiffies +
+				       msecs_to_jiffies(OOM_EXPIRE_MSECS);
+
 	/*
 	 * Make sure that the task is woken up from uninterruptible sleep
 	 * if it is frozen because OOM killer wouldn't be able to free
@@ -637,8 +711,11 @@ EXPORT_SYMBOL_GPL(unregister_oom_notifier);
  * killing a random task (bad), letting the system crash (worse)
  * OR try to be smart about which process to kill. Note that we
  * don't have to be perfect here, we just have to be good.
+ *
+ * expired_oom_kill is true if waiting on a process that has exceeded its oom
+ * expiration to exit.
  */
-bool out_of_memory(struct oom_control *oc)
+bool out_of_memory(struct oom_control *oc, bool *expired_oom_kill)
 {
 	struct task_struct *p;
 	unsigned long totalpages;
@@ -686,7 +763,7 @@ bool out_of_memory(struct oom_control *oc)
 		return true;
 	}
 
-	p = select_bad_process(oc, &points, totalpages);
+	p = select_bad_process(oc, &points, totalpages, expired_oom_kill);
 	/* Found nothing?!?! Either we hang forever, or we panic. */
 	if (!p && oc->order != -1) {
 		dump_header(oc, NULL, NULL);
@@ -717,6 +794,7 @@ void pagefault_out_of_memory(void)
 		.gfp_mask = 0,
 		.order = 0,
 	};
+	bool unused;
 
 	if (mem_cgroup_oom_synchronize(true))
 		return;
@@ -724,7 +802,7 @@ void pagefault_out_of_memory(void)
 	if (!mutex_trylock(&oom_lock))
 		return;
 
-	if (!out_of_memory(&oc)) {
+	if (!out_of_memory(&oc, &unused)) {
 		/*
 		 * There shouldn't be any user tasks runnable while the
 		 * OOM killer is disabled, so the current task has to
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2717,7 +2717,8 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
 
 static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
-	const struct alloc_context *ac, unsigned long *did_some_progress)
+	const struct alloc_context *ac, unsigned long *did_some_progress,
+	bool *expired_oom_kill)
 {
 	struct oom_control oc = {
 		.zonelist = ac->zonelist,
@@ -2776,7 +2777,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 			goto out;
 	}
 	/* Exhausted what can be done so it's blamo time */
-	if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
+	if (out_of_memory(&oc, expired_oom_kill) ||
+	    WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
 		*did_some_progress = 1;
 out:
 	mutex_unlock(&oom_lock);
@@ -2947,7 +2949,7 @@ static void wake_all_kswapds(unsigned int order, const struct alloc_context *ac)
 }
 
 static inline int
-gfp_to_alloc_flags(gfp_t gfp_mask)
+gfp_to_alloc_flags(gfp_t gfp_mask, const bool expired_oom_kill)
 {
 	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
 	const bool atomic = !(gfp_mask & (__GFP_WAIT | __GFP_NO_KSWAPD));
@@ -2987,6 +2989,8 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 				((current->flags & PF_MEMALLOC) ||
 				 unlikely(test_thread_flag(TIF_MEMDIE))))
 			alloc_flags |= ALLOC_NO_WATERMARKS;
+		else if (expired_oom_kill)
+			alloc_flags |= ALLOC_NO_WATERMARKS;
 	}
 #ifdef CONFIG_CMA
 	if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
@@ -2997,7 +3001,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 
 bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 {
-	return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS);
+	return !!(gfp_to_alloc_flags(gfp_mask, false) & ALLOC_NO_WATERMARKS);
 }
 
 static inline struct page *
@@ -3012,6 +3016,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
 	bool deferred_compaction = false;
 	int contended_compaction = COMPACT_CONTENDED_NONE;
+	bool expired_oom_kill = false;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3041,7 +3046,7 @@ retry:
 	 * reclaim. Now things get more complex, so set up alloc_flags according
 	 * to how we want to proceed.
 	 */
-	alloc_flags = gfp_to_alloc_flags(gfp_mask);
+	alloc_flags = gfp_to_alloc_flags(gfp_mask, expired_oom_kill);
 
 	/*
 	 * Find the true preferred zone if the allocation is unconstrained by
@@ -3166,7 +3171,8 @@ retry:
 	}
 
 	/* Reclaim has failed us, start killing things */
-	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
+	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress,
+				     &expired_oom_kill);
 	if (page)
 		goto got_pg;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timers.
  2016-01-14 22:01               ` David Rientjes
@ 2016-01-14 22:58                 ` Johannes Weiner
  -1 siblings, 0 replies; 38+ messages in thread
From: Johannes Weiner @ 2016-01-14 22:58 UTC (permalink / raw)
  To: David Rientjes
  Cc: Tetsuo Handa, mhocko, Andrew Morton, mgorman, torvalds, oleg,
	hughd, andrea, riel, linux-mm, linux-kernel

On Thu, Jan 14, 2016 at 02:01:45PM -0800, David Rientjes wrote:
> On Thu, 14 Jan 2016, Tetsuo Handa wrote:
> > I know. What I'm proposing is try to recover by killing more OOM-killable
> > tasks because I think impact of crashing the kernel is larger than impact
> > of killing all OOM-killable tasks. We should at least try OOM-kill all
> > OOM-killable processes before crashing the kernel. Some servers take many
> > minutes to reboot whereas restarting OOM-killed services takes only a few
> > seconds. Also, SysRq-i is inconvenient because it kills OOM-unkillable ssh
> > daemon process.
> 
> This is where me and you disagree; the goal should not be to continue to 
> oom kill more and more processes since there is no guarantee that further 
> kills will result in forward progress.  These additional kills can result 
> in the same livelock that is already problematic, and killing additional 
> processes has made the situation worse since memory reserves are more 
> depleted.
> 
> I believe what is better is to exhaust reclaim, check if the page 
> allocator is constantly looping due to waiting for the same victim to 
> exit, and then allowing that allocation with memory reserves, see the 
> attached patch which I have proposed before.

If giving the reserves to another OOM victim is bad, how is giving
them to the *allocating* task supposed to be better? Which path is
more likely to release memory? That doesn't seem to follow.

We need to make the OOM killer conclude in a fixed amount of time, no
matter what happens. If the system is irrecoverably deadlocked on
memory it needs to panic (and reboot) so we can get on with it. And
it's silly to panic while there are still killable tasks available.

Hence my proposal to wait a decaying amount of time after each OOM
victim before moving on, until we killed everything in the system and
panic (and reboot). What else is there we can do once out of memory?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timers.
@ 2016-01-14 22:58                 ` Johannes Weiner
  0 siblings, 0 replies; 38+ messages in thread
From: Johannes Weiner @ 2016-01-14 22:58 UTC (permalink / raw)
  To: David Rientjes
  Cc: Tetsuo Handa, mhocko, Andrew Morton, mgorman, torvalds, oleg,
	hughd, andrea, riel, linux-mm, linux-kernel

On Thu, Jan 14, 2016 at 02:01:45PM -0800, David Rientjes wrote:
> On Thu, 14 Jan 2016, Tetsuo Handa wrote:
> > I know. What I'm proposing is try to recover by killing more OOM-killable
> > tasks because I think impact of crashing the kernel is larger than impact
> > of killing all OOM-killable tasks. We should at least try OOM-kill all
> > OOM-killable processes before crashing the kernel. Some servers take many
> > minutes to reboot whereas restarting OOM-killed services takes only a few
> > seconds. Also, SysRq-i is inconvenient because it kills OOM-unkillable ssh
> > daemon process.
> 
> This is where me and you disagree; the goal should not be to continue to 
> oom kill more and more processes since there is no guarantee that further 
> kills will result in forward progress.  These additional kills can result 
> in the same livelock that is already problematic, and killing additional 
> processes has made the situation worse since memory reserves are more 
> depleted.
> 
> I believe what is better is to exhaust reclaim, check if the page 
> allocator is constantly looping due to waiting for the same victim to 
> exit, and then allowing that allocation with memory reserves, see the 
> attached patch which I have proposed before.

If giving the reserves to another OOM victim is bad, how is giving
them to the *allocating* task supposed to be better? Which path is
more likely to release memory? That doesn't seem to follow.

We need to make the OOM killer conclude in a fixed amount of time, no
matter what happens. If the system is irrecoverably deadlocked on
memory it needs to panic (and reboot) so we can get on with it. And
it's silly to panic while there are still killable tasks available.

Hence my proposal to wait a decaying amount of time after each OOM
victim before moving on, until we killed everything in the system and
panic (and reboot). What else is there we can do once out of memory?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timers.
  2016-01-14 22:58                 ` Johannes Weiner
@ 2016-01-14 23:09                   ` David Rientjes
  -1 siblings, 0 replies; 38+ messages in thread
From: David Rientjes @ 2016-01-14 23:09 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, mhocko, Andrew Morton, mgorman, torvalds, oleg,
	hughd, andrea, riel, linux-mm, linux-kernel

On Thu, 14 Jan 2016, Johannes Weiner wrote:

> > This is where me and you disagree; the goal should not be to continue to 
> > oom kill more and more processes since there is no guarantee that further 
> > kills will result in forward progress.  These additional kills can result 
> > in the same livelock that is already problematic, and killing additional 
> > processes has made the situation worse since memory reserves are more 
> > depleted.
> > 
> > I believe what is better is to exhaust reclaim, check if the page 
> > allocator is constantly looping due to waiting for the same victim to 
> > exit, and then allowing that allocation with memory reserves, see the 
> > attached patch which I have proposed before.
> 
> If giving the reserves to another OOM victim is bad, how is giving
> them to the *allocating* task supposed to be better?

Unfortunately, due to rss and oom priority, it is possible to repeatedly 
select processes which are all waiting for the same mutex.  This is 
possible when loading shards, for example, and all processes have the same 
oom priority and are livelocked on i_mutex which is the most common 
occurrence in our environments.  The livelock came about because we 
selected a process that could not make forward progress, there is no 
guarantee that we will not continue to select such processes.

Giving access to the memory allocator eventually allows all allocators to 
successfully allocate, giving the holder of i_mutex the ability to 
eventually drop it.  This happens in a very rate-limited manner depending 
on how you define when the page allocator has looped enough waiting for 
the same process to exit in my patch.

In the past, we have even increased the scheduling priority of oom killed 
processes so that they have a greater likelihood of picking up i_mutex and 
exiting.

> We need to make the OOM killer conclude in a fixed amount of time, no
> matter what happens. If the system is irrecoverably deadlocked on
> memory it needs to panic (and reboot) so we can get on with it. And
> it's silly to panic while there are still killable tasks available.
> 

What is the solution when there are no additional processes that may be 
killed?  It is better to give access to memory reserves so a single 
stalling allocation can succeed so the livelock can be resolved rather 
than panicking.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timers.
@ 2016-01-14 23:09                   ` David Rientjes
  0 siblings, 0 replies; 38+ messages in thread
From: David Rientjes @ 2016-01-14 23:09 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, mhocko, Andrew Morton, mgorman, torvalds, oleg,
	hughd, andrea, riel, linux-mm, linux-kernel

On Thu, 14 Jan 2016, Johannes Weiner wrote:

> > This is where me and you disagree; the goal should not be to continue to 
> > oom kill more and more processes since there is no guarantee that further 
> > kills will result in forward progress.  These additional kills can result 
> > in the same livelock that is already problematic, and killing additional 
> > processes has made the situation worse since memory reserves are more 
> > depleted.
> > 
> > I believe what is better is to exhaust reclaim, check if the page 
> > allocator is constantly looping due to waiting for the same victim to 
> > exit, and then allowing that allocation with memory reserves, see the 
> > attached patch which I have proposed before.
> 
> If giving the reserves to another OOM victim is bad, how is giving
> them to the *allocating* task supposed to be better?

Unfortunately, due to rss and oom priority, it is possible to repeatedly 
select processes which are all waiting for the same mutex.  This is 
possible when loading shards, for example, and all processes have the same 
oom priority and are livelocked on i_mutex which is the most common 
occurrence in our environments.  The livelock came about because we 
selected a process that could not make forward progress, there is no 
guarantee that we will not continue to select such processes.

Giving access to the memory allocator eventually allows all allocators to 
successfully allocate, giving the holder of i_mutex the ability to 
eventually drop it.  This happens in a very rate-limited manner depending 
on how you define when the page allocator has looped enough waiting for 
the same process to exit in my patch.

In the past, we have even increased the scheduling priority of oom killed 
processes so that they have a greater likelihood of picking up i_mutex and 
exiting.

> We need to make the OOM killer conclude in a fixed amount of time, no
> matter what happens. If the system is irrecoverably deadlocked on
> memory it needs to panic (and reboot) so we can get on with it. And
> it's silly to panic while there are still killable tasks available.
> 

What is the solution when there are no additional processes that may be 
killed?  It is better to give access to memory reserves so a single 
stalling allocation can succeed so the livelock can be resolved rather 
than panicking.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timers.
  2016-01-14 23:09                   ` David Rientjes
@ 2016-01-15 10:36                     ` Tetsuo Handa
  -1 siblings, 0 replies; 38+ messages in thread
From: Tetsuo Handa @ 2016-01-15 10:36 UTC (permalink / raw)
  To: rientjes, hannes
  Cc: mhocko, akpm, mgorman, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel

David Rientjes wrote:
> On Thu, 14 Jan 2016, Tetsuo Handa wrote:
>
> > I know. What I'm proposing is try to recover by killing more OOM-killable
> > tasks because I think impact of crashing the kernel is larger than impact
> > of killing all OOM-killable tasks. We should at least try OOM-kill all
> > OOM-killable processes before crashing the kernel. Some servers take many
> > minutes to reboot whereas restarting OOM-killed services takes only a few
> > seconds. Also, SysRq-i is inconvenient because it kills OOM-unkillable ssh
> > daemon process.
> >
>
> This is where me and you disagree; the goal should not be to continue to
> oom kill more and more processes since there is no guarantee that further
> kills will result in forward progress.

Leaving a system OOM-livelocked forever is very very annoying thing.
My goal is to ask the OOM killer not to toss the OOM killer's duty away.
What is important for me is that the OOM killer takes next action when
current action did not solve the OOM situation.

For me, when and whether to allow global access to memory reserves is not
critical, for that is nothing but one of actions administrators might want
the OOM killer to take.

>                                         These additional kills can result
> in the same livelock that is already problematic, and killing additional
> processes has made the situation worse since memory reserves are more
> depleted.

Why are you still assuming that memory reserves are more depleted if we kill
additional processes? We are introducing the OOM reaper which can compensate
memory reserves if we kill additional processes. We can make the OOM reaper
update oom priority of all processes that use a mm the OOM killer chose
( http://lkml.kernel.org/r/201601131915.BCI35488.FHSFQtVMJOOOLF@I-love.SAKURA.ne.jp )
so that we can help the OOM reaper compensate memory reserves by helping
the OOM killer to select a different mm.

Current rule for allowing access to memory reserves is "Pick and Pray" (like
"Plug and Pray"), for only TIF_MEMDIE threads can access all of memory reserves
whereas !TIF_MEMDIE SIGKILL threads can access none of memory reserves.
The OOM killer picks up the first thread from all threads which use a mm
the OOM killer chose, without taking into account that which thread is doing
a __GFP_FS || __GFP_NOFAIL allocation request, with an assumption that
!TIF_MEMDIE SIGKILL tasks can access all of memory reserves even if they are
doing a !__GFP_FS && !__GFP_NOFAIL allocation request. If we want SIGKILL tasks
to access all of memory reserves, we would need below change.

----------
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2744,7 +2744,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		if (ac->high_zoneidx < ZONE_NORMAL)
 			goto out;
 		/* The OOM killer does not compensate for IO-less reclaim */
-		if (!(gfp_mask & __GFP_FS)) {
+		if (!(gfp_mask & __GFP_FS) && !fatal_signal_pending(current)) {
 			/*
 			 * XXX: Page reclaim didn't yield anything,
 			 * and the OOM killer can't be invoked, but
----------

But from my examination results, in most cases we don't need to allow
TIF_MEMDIE tasks to access all of memory reserves. They are likely able to
satisfy their memory allocation requests using watermarks for GFP_ATOMIC
allocations. There are cases where evenly treating all SIGKILL tasks can
solve the OOM condition without allowing access to all of memory reserves.

----------
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2955,10 +2955,10 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 		else if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
 			alloc_flags |= ALLOC_NO_WATERMARKS;
-		else if (!in_interrupt() &&
-				((current->flags & PF_MEMALLOC) ||
-				 unlikely(test_thread_flag(TIF_MEMDIE))))
+		else if (!in_interrupt() && (current->flags & PF_MEMALLOC))
 			alloc_flags |= ALLOC_NO_WATERMARKS;
+		else if (fatal_signal_pending(current))
+			alloc_flags |= ALLOC_HARDER | ALLOC_HIGH;
 	}
 #ifdef CONFIG_CMA
 	if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
----------

>
> I believe what is better is to exhaust reclaim, check if the page
> allocator is constantly looping due to waiting for the same victim to
> exit, and then allowing that allocation with memory reserves, see the
> attached patch which I have proposed before.
>
> The livelock often occurs when a single thread is holding a kernel mutex
> and is allocating memory and the oom victim requires that mutex to exit.
> This approach allows that holder to allocate and (hopefully) eventually
> drop the mutex.

That "hopefully" is bad. It causes losing all the work by leaving the system
OOM-livelocked forever when that "hopefully" did not help.

>                  It doesn't require that we randomly oom kill processes
> that don't requrie the mutex, which the oom killer can't know, and it also
> avoids losing the most amount of work.  Oom killing many threads is never
> a solution to anything since reserves are a finite resource.

Are you aware of the OOM reaper?

>
> I know you'll object to this and propose some theory or usecase that shows
> a mutex holder can allocate a ton of memory while holding that mutex and
> memory reserves get depleted.  If you can show such a case in the kernel
> where that happens, we can fix that.  However, the patch also allows for
> identifying such processes by showing their stack trace so these issues
> can be fixed.

I already tried your patch but "global access to memory reserves ended" message
was not always printed
( http://lkml.kernel.org/r/201509221433.ICI00012.VFOQMFHLFJtSOO@I-love.SAKURA.ne.jp ).
What I don't like is that your patch gives up when your patch failed to
solve the OOM situation.

Your patch might be good for kernel developers who want to identify what
the cause is, but is bad for administrators who want to give priority to
not leaving a system OOM-livelocked forever.

>
> Remember, we can never recover 100% of the time, and arguing that this is
> not a perfect solution is not going to help us make forward progress in
> this discussion.

I know. What I'm trying to do is to let the OOM killer take next action
when current action did not solve the OOM situation. I don't object your
patch if a mechanism that guarantees that the OOM killer takes next action
when current action did not solve the OOM situation is implemented. The
simplest way is to re-enable the OOM killer and choose more OOM victims,
thanks to introducing the OOM reaper.



David Rientjes wrote:
> On Thu, 14 Jan 2016, Johannes Weiner wrote:
>
> > > This is where me and you disagree; the goal should not be to continue to
> > > oom kill more and more processes since there is no guarantee that further
> > > kills will result in forward progress.  These additional kills can result
> > > in the same livelock that is already problematic, and killing additional
> > > processes has made the situation worse since memory reserves are more
> > > depleted.
> > >
> > > I believe what is better is to exhaust reclaim, check if the page
> > > allocator is constantly looping due to waiting for the same victim to
> > > exit, and then allowing that allocation with memory reserves, see the
> > > attached patch which I have proposed before.
> >
> > If giving the reserves to another OOM victim is bad, how is giving
> > them to the *allocating* task supposed to be better?
>
> Unfortunately, due to rss and oom priority, it is possible to repeatedly
> select processes which are all waiting for the same mutex.  This is
> possible when loading shards, for example, and all processes have the same
> oom priority and are livelocked on i_mutex which is the most common
> occurrence in our environments.  The livelock came about because we
> selected a process that could not make forward progress, there is no
> guarantee that we will not continue to select such processes.

RSS and oom_score_adj can affect the OOM killer regarding which process is
selected. But I wonder how RSS and oom_score_adj can affect the OOM killer
regarding select processes which are all waiting for the same mutex.

>
> Giving access to the memory allocator eventually allows all allocators to
> successfully allocate, giving the holder of i_mutex the ability to
> eventually drop it.  This happens in a very rate-limited manner depending
> on how you define when the page allocator has looped enough waiting for
> the same process to exit in my patch.

Not always true. See serial-20150922-1.txt.xz and serial-20150922-2.txt.xz
in the link above. The "global access to memory reserves ended" message was
not always printed.

>
> In the past, we have even increased the scheduling priority of oom killed
> processes so that they have a greater likelihood of picking up i_mutex and
> exiting.

That was also "Pick and Pray", an optimistic bet without thinking about the
worst case.

>
> > We need to make the OOM killer conclude in a fixed amount of time, no
> > matter what happens. If the system is irrecoverably deadlocked on
> > memory it needs to panic (and reboot) so we can get on with it. And
> > it's silly to panic while there are still killable tasks available.
> >

I agree with Johannes. Leaving the system OOM-livelocked forever
(even if we allowed global access to memory reserves) is bad.

>
> What is the solution when there are no additional processes that may be
> killed?  It is better to give access to memory reserves so a single
> stalling allocation can succeed so the livelock can be resolved rather
> than panicking.

How can your patch become a solution when memory reserves are depleted
due to your patch? I'm proposing this "re-enable the OOM killer" patch
in case the OOM reaper failed to reclaim memory enough to terminate the
first OOM victim. Since your patch does not choose the second OOM victim,
the OOM reaper cannot compensate memory reserves by choosing the second
OOM victim which is not using a mm used by the first OOM victim.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timers.
@ 2016-01-15 10:36                     ` Tetsuo Handa
  0 siblings, 0 replies; 38+ messages in thread
From: Tetsuo Handa @ 2016-01-15 10:36 UTC (permalink / raw)
  To: rientjes, hannes
  Cc: mhocko, akpm, mgorman, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel

David Rientjes wrote:
> On Thu, 14 Jan 2016, Tetsuo Handa wrote:
>
> > I know. What I'm proposing is try to recover by killing more OOM-killable
> > tasks because I think impact of crashing the kernel is larger than impact
> > of killing all OOM-killable tasks. We should at least try OOM-kill all
> > OOM-killable processes before crashing the kernel. Some servers take many
> > minutes to reboot whereas restarting OOM-killed services takes only a few
> > seconds. Also, SysRq-i is inconvenient because it kills OOM-unkillable ssh
> > daemon process.
> >
>
> This is where me and you disagree; the goal should not be to continue to
> oom kill more and more processes since there is no guarantee that further
> kills will result in forward progress.

Leaving a system OOM-livelocked forever is very very annoying thing.
My goal is to ask the OOM killer not to toss the OOM killer's duty away.
What is important for me is that the OOM killer takes next action when
current action did not solve the OOM situation.

For me, when and whether to allow global access to memory reserves is not
critical, for that is nothing but one of actions administrators might want
the OOM killer to take.

>                                         These additional kills can result
> in the same livelock that is already problematic, and killing additional
> processes has made the situation worse since memory reserves are more
> depleted.

Why are you still assuming that memory reserves are more depleted if we kill
additional processes? We are introducing the OOM reaper which can compensate
memory reserves if we kill additional processes. We can make the OOM reaper
update oom priority of all processes that use a mm the OOM killer chose
( http://lkml.kernel.org/r/201601131915.BCI35488.FHSFQtVMJOOOLF@I-love.SAKURA.ne.jp )
so that we can help the OOM reaper compensate memory reserves by helping
the OOM killer to select a different mm.

Current rule for allowing access to memory reserves is "Pick and Pray" (like
"Plug and Pray"), for only TIF_MEMDIE threads can access all of memory reserves
whereas !TIF_MEMDIE SIGKILL threads can access none of memory reserves.
The OOM killer picks up the first thread from all threads which use a mm
the OOM killer chose, without taking into account that which thread is doing
a __GFP_FS || __GFP_NOFAIL allocation request, with an assumption that
!TIF_MEMDIE SIGKILL tasks can access all of memory reserves even if they are
doing a !__GFP_FS && !__GFP_NOFAIL allocation request. If we want SIGKILL tasks
to access all of memory reserves, we would need below change.

----------
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2744,7 +2744,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		if (ac->high_zoneidx < ZONE_NORMAL)
 			goto out;
 		/* The OOM killer does not compensate for IO-less reclaim */
-		if (!(gfp_mask & __GFP_FS)) {
+		if (!(gfp_mask & __GFP_FS) && !fatal_signal_pending(current)) {
 			/*
 			 * XXX: Page reclaim didn't yield anything,
 			 * and the OOM killer can't be invoked, but
----------

But from my examination results, in most cases we don't need to allow
TIF_MEMDIE tasks to access all of memory reserves. They are likely able to
satisfy their memory allocation requests using watermarks for GFP_ATOMIC
allocations. There are cases where evenly treating all SIGKILL tasks can
solve the OOM condition without allowing access to all of memory reserves.

----------
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2955,10 +2955,10 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 		else if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
 			alloc_flags |= ALLOC_NO_WATERMARKS;
-		else if (!in_interrupt() &&
-				((current->flags & PF_MEMALLOC) ||
-				 unlikely(test_thread_flag(TIF_MEMDIE))))
+		else if (!in_interrupt() && (current->flags & PF_MEMALLOC))
 			alloc_flags |= ALLOC_NO_WATERMARKS;
+		else if (fatal_signal_pending(current))
+			alloc_flags |= ALLOC_HARDER | ALLOC_HIGH;
 	}
 #ifdef CONFIG_CMA
 	if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
----------

>
> I believe what is better is to exhaust reclaim, check if the page
> allocator is constantly looping due to waiting for the same victim to
> exit, and then allowing that allocation with memory reserves, see the
> attached patch which I have proposed before.
>
> The livelock often occurs when a single thread is holding a kernel mutex
> and is allocating memory and the oom victim requires that mutex to exit.
> This approach allows that holder to allocate and (hopefully) eventually
> drop the mutex.

That "hopefully" is bad. It causes losing all the work by leaving the system
OOM-livelocked forever when that "hopefully" did not help.

>                  It doesn't require that we randomly oom kill processes
> that don't requrie the mutex, which the oom killer can't know, and it also
> avoids losing the most amount of work.  Oom killing many threads is never
> a solution to anything since reserves are a finite resource.

Are you aware of the OOM reaper?

>
> I know you'll object to this and propose some theory or usecase that shows
> a mutex holder can allocate a ton of memory while holding that mutex and
> memory reserves get depleted.  If you can show such a case in the kernel
> where that happens, we can fix that.  However, the patch also allows for
> identifying such processes by showing their stack trace so these issues
> can be fixed.

I already tried your patch but "global access to memory reserves ended" message
was not always printed
( http://lkml.kernel.org/r/201509221433.ICI00012.VFOQMFHLFJtSOO@I-love.SAKURA.ne.jp ).
What I don't like is that your patch gives up when your patch failed to
solve the OOM situation.

Your patch might be good for kernel developers who want to identify what
the cause is, but is bad for administrators who want to give priority to
not leaving a system OOM-livelocked forever.

>
> Remember, we can never recover 100% of the time, and arguing that this is
> not a perfect solution is not going to help us make forward progress in
> this discussion.

I know. What I'm trying to do is to let the OOM killer take next action
when current action did not solve the OOM situation. I don't object your
patch if a mechanism that guarantees that the OOM killer takes next action
when current action did not solve the OOM situation is implemented. The
simplest way is to re-enable the OOM killer and choose more OOM victims,
thanks to introducing the OOM reaper.



David Rientjes wrote:
> On Thu, 14 Jan 2016, Johannes Weiner wrote:
>
> > > This is where me and you disagree; the goal should not be to continue to
> > > oom kill more and more processes since there is no guarantee that further
> > > kills will result in forward progress.  These additional kills can result
> > > in the same livelock that is already problematic, and killing additional
> > > processes has made the situation worse since memory reserves are more
> > > depleted.
> > >
> > > I believe what is better is to exhaust reclaim, check if the page
> > > allocator is constantly looping due to waiting for the same victim to
> > > exit, and then allowing that allocation with memory reserves, see the
> > > attached patch which I have proposed before.
> >
> > If giving the reserves to another OOM victim is bad, how is giving
> > them to the *allocating* task supposed to be better?
>
> Unfortunately, due to rss and oom priority, it is possible to repeatedly
> select processes which are all waiting for the same mutex.  This is
> possible when loading shards, for example, and all processes have the same
> oom priority and are livelocked on i_mutex which is the most common
> occurrence in our environments.  The livelock came about because we
> selected a process that could not make forward progress, there is no
> guarantee that we will not continue to select such processes.

RSS and oom_score_adj can affect the OOM killer regarding which process is
selected. But I wonder how RSS and oom_score_adj can affect the OOM killer
regarding select processes which are all waiting for the same mutex.

>
> Giving access to the memory allocator eventually allows all allocators to
> successfully allocate, giving the holder of i_mutex the ability to
> eventually drop it.  This happens in a very rate-limited manner depending
> on how you define when the page allocator has looped enough waiting for
> the same process to exit in my patch.

Not always true. See serial-20150922-1.txt.xz and serial-20150922-2.txt.xz
in the link above. The "global access to memory reserves ended" message was
not always printed.

>
> In the past, we have even increased the scheduling priority of oom killed
> processes so that they have a greater likelihood of picking up i_mutex and
> exiting.

That was also "Pick and Pray", an optimistic bet without thinking about the
worst case.

>
> > We need to make the OOM killer conclude in a fixed amount of time, no
> > matter what happens. If the system is irrecoverably deadlocked on
> > memory it needs to panic (and reboot) so we can get on with it. And
> > it's silly to panic while there are still killable tasks available.
> >

I agree with Johannes. Leaving the system OOM-livelocked forever
(even if we allowed global access to memory reserves) is bad.

>
> What is the solution when there are no additional processes that may be
> killed?  It is better to give access to memory reserves so a single
> stalling allocation can succeed so the livelock can be resolved rather
> than panicking.

How can your patch become a solution when memory reserves are depleted
due to your patch? I'm proposing this "re-enable the OOM killer" patch
in case the OOM reaper failed to reclaim memory enough to terminate the
first OOM victim. Since your patch does not choose the second OOM victim,
the OOM reaper cannot compensate memory reserves by choosing the second
OOM victim which is not using a mm used by the first OOM victim.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timers.
  2016-01-15 10:36                     ` Tetsuo Handa
@ 2016-01-19 23:13                       ` David Rientjes
  -1 siblings, 0 replies; 38+ messages in thread
From: David Rientjes @ 2016-01-19 23:13 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: hannes, mhocko, Andrew Morton, mgorman, torvalds, oleg, hughd,
	andrea, riel, linux-mm, linux-kernel

On Fri, 15 Jan 2016, Tetsuo Handa wrote:

> Leaving a system OOM-livelocked forever is very very annoying thing.

Agreed.

> My goal is to ask the OOM killer not to toss the OOM killer's duty away.
> What is important for me is that the OOM killer takes next action when
> current action did not solve the OOM situation.
> 

What is the "next action" when there are no more processes on your system, 
or attached to your memcg hierarchy, that are killable?

Of course your proposal offers no solution for that.  Extend it further: 
what is the "next action" when the process holding the mutex needed by the 
victim is oom disabled?

I don't think it's in the best interest of the user to randomly kill 
processes until one exits and implicitly hoping that one of your 
selections will be able to do so (your notion of "pick and pray").

> >                                         These additional kills can result
> > in the same livelock that is already problematic, and killing additional
> > processes has made the situation worse since memory reserves are more
> > depleted.
> 
> Why are you still assuming that memory reserves are more depleted if we kill
> additional processes? We are introducing the OOM reaper which can compensate
> memory reserves if we kill additional processes. We can make the OOM reaper
> update oom priority of all processes that use a mm the OOM killer chose
> ( http://lkml.kernel.org/r/201601131915.BCI35488.FHSFQtVMJOOOLF@I-love.SAKURA.ne.jp )
> so that we can help the OOM reaper compensate memory reserves by helping
> the OOM killer to select a different mm.
> 

We are not adjusting the selection heuristic, which is already 
determinisitic and people use to fine tune through procfs, for what the 
oom reaper can free.

Even if you can free memory immediately, there is no guarantee that a 
process holding a mutex needed for the victim to exit will be able to 
allocate from that memory.  Continuing to kill more and more processes may 
eventually solve the situation which simply granting access to memory 
reserves temporarily would have also solved, but at the cost of, well, 
many processes.

The final solution may combine both approaches, which are the only real 
approaches on how to make forward progress.  We could first try allowing 
temporary access to memory reserves when a livelock has been detected, 
similar to my patch, and then fallback to killing additional processes 
since the oom reaper should be able to at least free some of that memory 
immediately, if it fails.

However, I think the best course of action at the moment is to review and 
get the oom reaper merged, if applicable, since it should greatly aid this 
issue and then look at livelock issues as they arise once it is deployed.  
I'm not enthusiastic about adding additional heuristics and tunables for 
theoretical issues that may arise, especially considering the oom reaper 
is not even upstream.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timers.
@ 2016-01-19 23:13                       ` David Rientjes
  0 siblings, 0 replies; 38+ messages in thread
From: David Rientjes @ 2016-01-19 23:13 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: hannes, mhocko, Andrew Morton, mgorman, torvalds, oleg, hughd,
	andrea, riel, linux-mm, linux-kernel

On Fri, 15 Jan 2016, Tetsuo Handa wrote:

> Leaving a system OOM-livelocked forever is very very annoying thing.

Agreed.

> My goal is to ask the OOM killer not to toss the OOM killer's duty away.
> What is important for me is that the OOM killer takes next action when
> current action did not solve the OOM situation.
> 

What is the "next action" when there are no more processes on your system, 
or attached to your memcg hierarchy, that are killable?

Of course your proposal offers no solution for that.  Extend it further: 
what is the "next action" when the process holding the mutex needed by the 
victim is oom disabled?

I don't think it's in the best interest of the user to randomly kill 
processes until one exits and implicitly hoping that one of your 
selections will be able to do so (your notion of "pick and pray").

> >                                         These additional kills can result
> > in the same livelock that is already problematic, and killing additional
> > processes has made the situation worse since memory reserves are more
> > depleted.
> 
> Why are you still assuming that memory reserves are more depleted if we kill
> additional processes? We are introducing the OOM reaper which can compensate
> memory reserves if we kill additional processes. We can make the OOM reaper
> update oom priority of all processes that use a mm the OOM killer chose
> ( http://lkml.kernel.org/r/201601131915.BCI35488.FHSFQtVMJOOOLF@I-love.SAKURA.ne.jp )
> so that we can help the OOM reaper compensate memory reserves by helping
> the OOM killer to select a different mm.
> 

We are not adjusting the selection heuristic, which is already 
determinisitic and people use to fine tune through procfs, for what the 
oom reaper can free.

Even if you can free memory immediately, there is no guarantee that a 
process holding a mutex needed for the victim to exit will be able to 
allocate from that memory.  Continuing to kill more and more processes may 
eventually solve the situation which simply granting access to memory 
reserves temporarily would have also solved, but at the cost of, well, 
many processes.

The final solution may combine both approaches, which are the only real 
approaches on how to make forward progress.  We could first try allowing 
temporary access to memory reserves when a livelock has been detected, 
similar to my patch, and then fallback to killing additional processes 
since the oom reaper should be able to at least free some of that memory 
immediately, if it fails.

However, I think the best course of action at the moment is to review and 
get the oom reaper merged, if applicable, since it should greatly aid this 
issue and then look at livelock issues as they arise once it is deployed.  
I'm not enthusiastic about adding additional heuristics and tunables for 
theoretical issues that may arise, especially considering the oom reaper 
is not even upstream.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timers.
  2016-01-19 23:13                       ` David Rientjes
@ 2016-01-20 14:36                         ` Tetsuo Handa
  -1 siblings, 0 replies; 38+ messages in thread
From: Tetsuo Handa @ 2016-01-20 14:36 UTC (permalink / raw)
  To: rientjes
  Cc: hannes, mhocko, akpm, mgorman, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

David Rientjes wrote:
> On Fri, 15 Jan 2016, Tetsuo Handa wrote:
> 
> > Leaving a system OOM-livelocked forever is very very annoying thing.
> 
> Agreed.
> 
> > My goal is to ask the OOM killer not to toss the OOM killer's duty away.
> > What is important for me is that the OOM killer takes next action when
> > current action did not solve the OOM situation.
> > 
> 
> What is the "next action" when there are no more processes on your system, 

Just call panic(), as with select_bad_process() from out_of_memory() returned
NULL.

> or attached to your memcg hierarchy, that are killable?

I think we have nothing to do for mem_cgroup_out_of_memory() case.



> The final solution may combine both approaches, which are the only real 
> approaches on how to make forward progress.  We could first try allowing 
> temporary access to memory reserves when a livelock has been detected, 
> similar to my patch, and then fallback to killing additional processes 
> since the oom reaper should be able to at least free some of that memory 
> immediately, if it fails.

If we can agree on combining both approaches, I'm OK with it. That will keep
the OOM reaper simple, for the OOM reaper will not need to clear TIF_MEMDIE
flag which is unfriendly for wait_event() in oom_killer_disable(), and the
OOM reaper will not need to care about situations where TIF_MEMDIE flag is
set when it is not safe to reap.

What we need to do before "fallback to killing additional processes" is
make sure that the OOM killer won't select processes which already have
TIF_MEMDIE flag, as with SysRq-f case.

> 
> However, I think the best course of action at the moment is to review and 
> get the oom reaper merged, if applicable, since it should greatly aid this 
> issue and then look at livelock issues as they arise once it is deployed.  
> I'm not enthusiastic about adding additional heuristics and tunables for 
> theoretical issues that may arise, especially considering the oom reaper 
> is not even upstream.
> 

We already know there is a flaw. For example,

	if (current->mm &&
	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
		mark_oom_victim(current);
		return true;
	}

in out_of_memory() omits sending SIGKILL to processes sharing same memory
when current process received SIGKILL by now (but that SIGKILL was not sent
by oom_kill_process()) or current thread is exiting normally, which can result
in problems which "Kill all user processes sharing victim->mm in other thread
groups, if any." tried to avoid. And the OOM reaper does not help because the
OOM reaper does not know whether it is safe to reap memory used by current
thread.

I think we should decide what to do for managing (or papering over) such
corner cases before we get the OOM reaper merged. I'm OK with combination of
your global access to memory reserves and my OOM killer re-enabling.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timers.
@ 2016-01-20 14:36                         ` Tetsuo Handa
  0 siblings, 0 replies; 38+ messages in thread
From: Tetsuo Handa @ 2016-01-20 14:36 UTC (permalink / raw)
  To: rientjes
  Cc: hannes, mhocko, akpm, mgorman, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

David Rientjes wrote:
> On Fri, 15 Jan 2016, Tetsuo Handa wrote:
> 
> > Leaving a system OOM-livelocked forever is very very annoying thing.
> 
> Agreed.
> 
> > My goal is to ask the OOM killer not to toss the OOM killer's duty away.
> > What is important for me is that the OOM killer takes next action when
> > current action did not solve the OOM situation.
> > 
> 
> What is the "next action" when there are no more processes on your system, 

Just call panic(), as with select_bad_process() from out_of_memory() returned
NULL.

> or attached to your memcg hierarchy, that are killable?

I think we have nothing to do for mem_cgroup_out_of_memory() case.



> The final solution may combine both approaches, which are the only real 
> approaches on how to make forward progress.  We could first try allowing 
> temporary access to memory reserves when a livelock has been detected, 
> similar to my patch, and then fallback to killing additional processes 
> since the oom reaper should be able to at least free some of that memory 
> immediately, if it fails.

If we can agree on combining both approaches, I'm OK with it. That will keep
the OOM reaper simple, for the OOM reaper will not need to clear TIF_MEMDIE
flag which is unfriendly for wait_event() in oom_killer_disable(), and the
OOM reaper will not need to care about situations where TIF_MEMDIE flag is
set when it is not safe to reap.

What we need to do before "fallback to killing additional processes" is
make sure that the OOM killer won't select processes which already have
TIF_MEMDIE flag, as with SysRq-f case.

> 
> However, I think the best course of action at the moment is to review and 
> get the oom reaper merged, if applicable, since it should greatly aid this 
> issue and then look at livelock issues as they arise once it is deployed.  
> I'm not enthusiastic about adding additional heuristics and tunables for 
> theoretical issues that may arise, especially considering the oom reaper 
> is not even upstream.
> 

We already know there is a flaw. For example,

	if (current->mm &&
	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
		mark_oom_victim(current);
		return true;
	}

in out_of_memory() omits sending SIGKILL to processes sharing same memory
when current process received SIGKILL by now (but that SIGKILL was not sent
by oom_kill_process()) or current thread is exiting normally, which can result
in problems which "Kill all user processes sharing victim->mm in other thread
groups, if any." tried to avoid. And the OOM reaper does not help because the
OOM reaper does not know whether it is safe to reap memory used by current
thread.

I think we should decide what to do for managing (or papering over) such
corner cases before we get the OOM reaper merged. I'm OK with combination of
your global access to memory reserves and my OOM killer re-enabling.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timers.
  2016-01-20 14:36                         ` Tetsuo Handa
@ 2016-01-20 23:49                           ` David Rientjes
  -1 siblings, 0 replies; 38+ messages in thread
From: David Rientjes @ 2016-01-20 23:49 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: hannes, mhocko, akpm, mgorman, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

On Wed, 20 Jan 2016, Tetsuo Handa wrote:

> > > My goal is to ask the OOM killer not to toss the OOM killer's duty away.
> > > What is important for me is that the OOM killer takes next action when
> > > current action did not solve the OOM situation.
> > > 
> > 
> > What is the "next action" when there are no more processes on your system, 
> 
> Just call panic(), as with select_bad_process() from out_of_memory() returned
> NULL.
> 

No way is that a possible solution for a system-wide oom condition.  We 
could have megabytes of memory available in memory reserves and a simple 
allocation succeeding could fix the livelock quite easily (and can be 
demonstrated with my testcase).  A panic is never better than allowing an 
allocation to succeed through the use of available memory reserves.

For the memcg case, we wouldn't panic() when there are no more killable 
processes, and this livelock problem can easily be exhibited in memcg 
hierarchy oom conditions as well (and quite easier since it's in 
isolation and doesn't get interferred with by external process freeing 
elsewhere on the system).  So, again, your approach offers no solution to 
this case and you presumably suggest that we should leave the hierarchy 
livelocked forever.  Again, not a possible solution.

> If we can agree on combining both approaches, I'm OK with it. That will keep
> the OOM reaper simple, for the OOM reaper will not need to clear TIF_MEMDIE
> flag which is unfriendly for wait_event() in oom_killer_disable(), and the
> OOM reaper will not need to care about situations where TIF_MEMDIE flag is
> set when it is not safe to reap.
> 

Please, allow us to review and get the oom reaper merged first and then 
evaluate the problem afterwards.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timers.
@ 2016-01-20 23:49                           ` David Rientjes
  0 siblings, 0 replies; 38+ messages in thread
From: David Rientjes @ 2016-01-20 23:49 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: hannes, mhocko, akpm, mgorman, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

On Wed, 20 Jan 2016, Tetsuo Handa wrote:

> > > My goal is to ask the OOM killer not to toss the OOM killer's duty away.
> > > What is important for me is that the OOM killer takes next action when
> > > current action did not solve the OOM situation.
> > > 
> > 
> > What is the "next action" when there are no more processes on your system, 
> 
> Just call panic(), as with select_bad_process() from out_of_memory() returned
> NULL.
> 

No way is that a possible solution for a system-wide oom condition.  We 
could have megabytes of memory available in memory reserves and a simple 
allocation succeeding could fix the livelock quite easily (and can be 
demonstrated with my testcase).  A panic is never better than allowing an 
allocation to succeed through the use of available memory reserves.

For the memcg case, we wouldn't panic() when there are no more killable 
processes, and this livelock problem can easily be exhibited in memcg 
hierarchy oom conditions as well (and quite easier since it's in 
isolation and doesn't get interferred with by external process freeing 
elsewhere on the system).  So, again, your approach offers no solution to 
this case and you presumably suggest that we should leave the hierarchy 
livelocked forever.  Again, not a possible solution.

> If we can agree on combining both approaches, I'm OK with it. That will keep
> the OOM reaper simple, for the OOM reaper will not need to clear TIF_MEMDIE
> flag which is unfriendly for wait_event() in oom_killer_disable(), and the
> OOM reaper will not need to care about situations where TIF_MEMDIE flag is
> set when it is not safe to reap.
> 

Please, allow us to review and get the oom reaper merged first and then 
evaluate the problem afterwards.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timers.
  2016-01-20 23:49                           ` David Rientjes
@ 2016-01-21 11:44                             ` Tetsuo Handa
  -1 siblings, 0 replies; 38+ messages in thread
From: Tetsuo Handa @ 2016-01-21 11:44 UTC (permalink / raw)
  To: rientjes
  Cc: hannes, mhocko, akpm, mgorman, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

David Rientjes wrote:
> On Wed, 20 Jan 2016, Tetsuo Handa wrote:
> 
> > > > My goal is to ask the OOM killer not to toss the OOM killer's duty away.
> > > > What is important for me is that the OOM killer takes next action when
> > > > current action did not solve the OOM situation.
> > > > 
> > > 
> > > What is the "next action" when there are no more processes on your system, 
> > 
> > Just call panic(), as with select_bad_process() from out_of_memory() returned
> > NULL.
> > 
> 
> No way is that a possible solution for a system-wide oom condition.  We 
> could have megabytes of memory available in memory reserves and a simple 
> allocation succeeding could fix the livelock quite easily (and can be 
> demonstrated with my testcase).  A panic is never better than allowing an 
> allocation to succeed through the use of available memory reserves.
> 

While it seems to me that you are really interested in memcg OOM events,
I'm interested in only system-wide OOM events. I'm not using memcg and my
patches are targeted for handling system-wide OOM events.

I consider phases for managing system-wide OOM events as follows.

  (1) Design and use a system with appropriate memory capacity in mind.

  (2) When (1) failed, the OOM killer is invoked. The OOM killer selects
      an OOM victim and allow that victim access to memory reserves by
      setting TIF_MEMDIE to it.

  (3) When (2) did not solve the OOM condition, start allowing all tasks
      access to memory reserves by your approach.

  (4) When (3) did not solve the OOM condition, start selecting more OOM
      victims by my approach.

  (5) When (4) did not solve the OOM condition, trigger the kernel panic.

By introducing the OOM reaper, possibility of solving the OOM condition at
(2) will be increased if the OOM reaper can reap the OOM victim's memory.
But when the OOM reaper did not help, it's time to fall back to (3).

Your approach will open the memory reserves. Therefore, when the memory
reserve depletes, it's time to fall back to (4).

My approach will choose next OOM victim, let the system return from (4) to
(2), allow the OOM reaper to reap the next OOM victim's memory. Therefore,
when there is no more OOM-killable processes, it's time to fall back to (5).

I agree that we might have megabytes of memory available in memory reserves
and a simple allocation succeeding might solve the OOM condition. I posted a
patch ( http://lkml.kernel.org/r/201509102318.GHG18789.OHMSLFJOQFOtFV@I-love.SAKURA.ne.jp )
for that reason.

But when the system arrived at (5), there will be no memory available in
memory reserves, for the condition for falling back to (5) includes (3).
Thus, triggering kernel panic should be OK.

> For the memcg case, we wouldn't panic() when there are no more killable 
> processes, and this livelock problem can easily be exhibited in memcg 
> hierarchy oom conditions as well (and quite easier since it's in 
> isolation and doesn't get interferred with by external process freeing 
> elsewhere on the system).  So, again, your approach offers no solution to 
> this case and you presumably suggest that we should leave the hierarchy 
> livelocked forever.  Again, not a possible solution.
> 

I don't know how to trigger memcg OOM livelock problem after killing all (i.e.
both OOM-killable and OOM-unkillable) tasks in a memcg. If only OOM-unkillable
tasks remained in that memcg after killing all OOM-killable tasks in that memcg,
it's time for administrator to manually send SIGKILL or loosen the quota of that
memcg. Unless the administrator encounters system-wide OOM event when trying to
manually send SIGKILL or loosen the quota, I don't think it is a problem.

Just leave that memcg hierarchy livelocked forever for now. I'm talking about
managing system-wide OOM events now.

> > If we can agree on combining both approaches, I'm OK with it. That will keep
> > the OOM reaper simple, for the OOM reaper will not need to clear TIF_MEMDIE
> > flag which is unfriendly for wait_event() in oom_killer_disable(), and the
> > OOM reaper will not need to care about situations where TIF_MEMDIE flag is
> > set when it is not safe to reap.
> > 
> 
> Please, allow us to review and get the oom reaper merged first and then 
> evaluate the problem afterwards.
> 

Best is we don't need to invoke the OOM killer. Next best is the OOM killer
solves the OOM condition. Next best is the OOM reaper solves the OOM condition.
Worst is we need to trigger the kernel panic. Next worst is we need to kill
all processes. Next worst is the OOM killer needs to kill all OOM-killable
processes.

We are currently violating Linux users' expectations that "the OOM killer
kills the OOM condition". I don't want to violate them again by advertising
the OOM reaper as "a reliable last resort for killing the OOM condition".

Why don't we start from establishing an evacuation route to the worst case
(i.e. make sure that the OOM killer chooses a !TIF_MEMDIE process) before
we make the kernel more difficult to test worse cases?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timers.
@ 2016-01-21 11:44                             ` Tetsuo Handa
  0 siblings, 0 replies; 38+ messages in thread
From: Tetsuo Handa @ 2016-01-21 11:44 UTC (permalink / raw)
  To: rientjes
  Cc: hannes, mhocko, akpm, mgorman, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

David Rientjes wrote:
> On Wed, 20 Jan 2016, Tetsuo Handa wrote:
> 
> > > > My goal is to ask the OOM killer not to toss the OOM killer's duty away.
> > > > What is important for me is that the OOM killer takes next action when
> > > > current action did not solve the OOM situation.
> > > > 
> > > 
> > > What is the "next action" when there are no more processes on your system, 
> > 
> > Just call panic(), as with select_bad_process() from out_of_memory() returned
> > NULL.
> > 
> 
> No way is that a possible solution for a system-wide oom condition.  We 
> could have megabytes of memory available in memory reserves and a simple 
> allocation succeeding could fix the livelock quite easily (and can be 
> demonstrated with my testcase).  A panic is never better than allowing an 
> allocation to succeed through the use of available memory reserves.
> 

While it seems to me that you are really interested in memcg OOM events,
I'm interested in only system-wide OOM events. I'm not using memcg and my
patches are targeted for handling system-wide OOM events.

I consider phases for managing system-wide OOM events as follows.

  (1) Design and use a system with appropriate memory capacity in mind.

  (2) When (1) failed, the OOM killer is invoked. The OOM killer selects
      an OOM victim and allow that victim access to memory reserves by
      setting TIF_MEMDIE to it.

  (3) When (2) did not solve the OOM condition, start allowing all tasks
      access to memory reserves by your approach.

  (4) When (3) did not solve the OOM condition, start selecting more OOM
      victims by my approach.

  (5) When (4) did not solve the OOM condition, trigger the kernel panic.

By introducing the OOM reaper, possibility of solving the OOM condition at
(2) will be increased if the OOM reaper can reap the OOM victim's memory.
But when the OOM reaper did not help, it's time to fall back to (3).

Your approach will open the memory reserves. Therefore, when the memory
reserve depletes, it's time to fall back to (4).

My approach will choose next OOM victim, let the system return from (4) to
(2), allow the OOM reaper to reap the next OOM victim's memory. Therefore,
when there is no more OOM-killable processes, it's time to fall back to (5).

I agree that we might have megabytes of memory available in memory reserves
and a simple allocation succeeding might solve the OOM condition. I posted a
patch ( http://lkml.kernel.org/r/201509102318.GHG18789.OHMSLFJOQFOtFV@I-love.SAKURA.ne.jp )
for that reason.

But when the system arrived at (5), there will be no memory available in
memory reserves, for the condition for falling back to (5) includes (3).
Thus, triggering kernel panic should be OK.

> For the memcg case, we wouldn't panic() when there are no more killable 
> processes, and this livelock problem can easily be exhibited in memcg 
> hierarchy oom conditions as well (and quite easier since it's in 
> isolation and doesn't get interferred with by external process freeing 
> elsewhere on the system).  So, again, your approach offers no solution to 
> this case and you presumably suggest that we should leave the hierarchy 
> livelocked forever.  Again, not a possible solution.
> 

I don't know how to trigger memcg OOM livelock problem after killing all (i.e.
both OOM-killable and OOM-unkillable) tasks in a memcg. If only OOM-unkillable
tasks remained in that memcg after killing all OOM-killable tasks in that memcg,
it's time for administrator to manually send SIGKILL or loosen the quota of that
memcg. Unless the administrator encounters system-wide OOM event when trying to
manually send SIGKILL or loosen the quota, I don't think it is a problem.

Just leave that memcg hierarchy livelocked forever for now. I'm talking about
managing system-wide OOM events now.

> > If we can agree on combining both approaches, I'm OK with it. That will keep
> > the OOM reaper simple, for the OOM reaper will not need to clear TIF_MEMDIE
> > flag which is unfriendly for wait_event() in oom_killer_disable(), and the
> > OOM reaper will not need to care about situations where TIF_MEMDIE flag is
> > set when it is not safe to reap.
> > 
> 
> Please, allow us to review and get the oom reaper merged first and then 
> evaluate the problem afterwards.
> 

Best is we don't need to invoke the OOM killer. Next best is the OOM killer
solves the OOM condition. Next best is the OOM reaper solves the OOM condition.
Worst is we need to trigger the kernel panic. Next worst is we need to kill
all processes. Next worst is the OOM killer needs to kill all OOM-killable
processes.

We are currently violating Linux users' expectations that "the OOM killer
kills the OOM condition". I don't want to violate them again by advertising
the OOM reaper as "a reliable last resort for killing the OOM condition".

Why don't we start from establishing an evacuation route to the worst case
(i.e. make sure that the OOM killer chooses a !TIF_MEMDIE process) before
we make the kernel more difficult to test worse cases?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timers.
  2016-01-21 11:44                             ` Tetsuo Handa
@ 2016-01-21 23:15                               ` David Rientjes
  -1 siblings, 0 replies; 38+ messages in thread
From: David Rientjes @ 2016-01-21 23:15 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: hannes, mhocko, akpm, mgorman, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

On Thu, 21 Jan 2016, Tetsuo Handa wrote:

> I consider phases for managing system-wide OOM events as follows.
> 
>   (1) Design and use a system with appropriate memory capacity in mind.
> 
>   (2) When (1) failed, the OOM killer is invoked. The OOM killer selects
>       an OOM victim and allow that victim access to memory reserves by
>       setting TIF_MEMDIE to it.
> 
>   (3) When (2) did not solve the OOM condition, start allowing all tasks
>       access to memory reserves by your approach.
> 
>   (4) When (3) did not solve the OOM condition, start selecting more OOM
>       victims by my approach.
> 
>   (5) When (4) did not solve the OOM condition, trigger the kernel panic.
> 

This was all mentioned previously, and I suggested that the panic only 
occur when memory reserves have been depleted, otherwise there is still 
the potential for the livelock to be solved.  That is a patch that would 
apply today, before any of this work, since we never want to loop 
endlessly in the page allocator when memory reserves are fully depleted.

This is all really quite simple.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timers.
@ 2016-01-21 23:15                               ` David Rientjes
  0 siblings, 0 replies; 38+ messages in thread
From: David Rientjes @ 2016-01-21 23:15 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: hannes, mhocko, akpm, mgorman, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

On Thu, 21 Jan 2016, Tetsuo Handa wrote:

> I consider phases for managing system-wide OOM events as follows.
> 
>   (1) Design and use a system with appropriate memory capacity in mind.
> 
>   (2) When (1) failed, the OOM killer is invoked. The OOM killer selects
>       an OOM victim and allow that victim access to memory reserves by
>       setting TIF_MEMDIE to it.
> 
>   (3) When (2) did not solve the OOM condition, start allowing all tasks
>       access to memory reserves by your approach.
> 
>   (4) When (3) did not solve the OOM condition, start selecting more OOM
>       victims by my approach.
> 
>   (5) When (4) did not solve the OOM condition, trigger the kernel panic.
> 

This was all mentioned previously, and I suggested that the panic only 
occur when memory reserves have been depleted, otherwise there is still 
the potential for the livelock to be solved.  That is a patch that would 
apply today, before any of this work, since we never want to loop 
endlessly in the page allocator when memory reserves are fully depleted.

This is all really quite simple.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timers.
  2016-01-21 23:15                               ` David Rientjes
@ 2016-01-22 13:59                                 ` Tetsuo Handa
  -1 siblings, 0 replies; 38+ messages in thread
From: Tetsuo Handa @ 2016-01-22 13:59 UTC (permalink / raw)
  To: rientjes, hannes, mhocko
  Cc: akpm, mgorman, torvalds, oleg, hughd, andrea, riel, linux-mm,
	linux-kernel

David Rientjes wrote:
> On Thu, 21 Jan 2016, Tetsuo Handa wrote:
> 
> > I consider phases for managing system-wide OOM events as follows.
> > 
> >   (1) Design and use a system with appropriate memory capacity in mind.
> > 
> >   (2) When (1) failed, the OOM killer is invoked. The OOM killer selects
> >       an OOM victim and allow that victim access to memory reserves by
> >       setting TIF_MEMDIE to it.
> > 
> >   (3) When (2) did not solve the OOM condition, start allowing all tasks
> >       access to memory reserves by your approach.
> > 
> >   (4) When (3) did not solve the OOM condition, start selecting more OOM
> >       victims by my approach.
> > 
> >   (5) When (4) did not solve the OOM condition, trigger the kernel panic.
> > 
> 
> This was all mentioned previously, and I suggested that the panic only 
> occur when memory reserves have been depleted, otherwise there is still 
> the potential for the livelock to be solved.  That is a patch that would 
> apply today, before any of this work, since we never want to loop 
> endlessly in the page allocator when memory reserves are fully depleted.
> 
> This is all really quite simple.
> 

So, David is OK with above approach, right?
Then, Michal and Johannes, are you OK with above approach?



What I'm not sure about above approach are handling of !__GFP_NOFAIL &&
!__GFP_FS allocation requests and use of ALLOC_NO_WATERMARKS without
TIF_MEMDIE.

Basically, we want to make small allocation requests success unless
__GFP_NORETRY is given. Currently such allocation requests do not fail
unless TIF_MEMDIE is given by the OOM killer. But how hard do we want to
continue looping when we reach (3) by timeout for waiting for TIF_MEMDIE
task at (2) expires?

Should we give up waiting for TIF_MEMDIE task and make !__GFP_NOFAIL allocation
requests fail (as with OOM condition after oom_killer_disable() is called)?
If our answer is "yes", there is no need to open the memory reserves.
Therefore, I guess our answer is "no".

Now, we open the memory reserves at (3). Since currently !__GFP_NOFAIL &&
!__GFP_FS allocation requests do not call out_of_memory(), current version
of "mm, oom: add global access to memory reserves on livelock" does not
allow such allocation requests access to memory reserves on OOM livelock.

If the cause of OOM livelock is an OOM victim is waiting for a lock which is
held by somebody else which is doing !__GFP_NOFAIL && !__GFP_FS allocation
requests to be released, we will fully deplete memory reserves because only
__GFP_NOFAIL || __GFP_FS allocation requests (e.g. page fault by memory hog
processes) can access memory reserves. To handle this case, what do we want
to do?

Should we allow !__GFP_NOFAIL && !__GFP_FS allocation requests access to
memory reserves by allowing them to call out_of_memory() in order to avoid
needlessly deplete memory reserves? Or, we don't care at all because we
can reach (4) anyway?

----------
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6463426..2299374 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2743,16 +2743,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		/* The OOM killer does not needlessly kill tasks for lowmem */
 		if (ac->high_zoneidx < ZONE_NORMAL)
 			goto out;
-		/* The OOM killer does not compensate for IO-less reclaim */
-		if (!(gfp_mask & __GFP_FS)) {
-			/*
-			 * XXX: Page reclaim didn't yield anything,
-			 * and the OOM killer can't be invoked, but
-			 * keep looping as per tradition.
-			 */
-			*did_some_progress = 1;
-			goto out;
-		}
 		if (pm_suspended_storage())
 			goto out;
 		/* The OOM killer may not free memory on a specific node */
----------

Regardless of our answer, we need to decide whether to continue looping, for
there is no guarantee that memory reserves is sufficient to solve OOM livelock.

Should we give up waiting for TIF_MEMDIE task and make !__GFP_NOFAIL allocation
requests fail (as if TIF_MEMDIE was already given by the OOM killer because
we used ALLOC_NO_WATERMARKS)?
If our answer is "yes", there is no need to choose next OOM victim.
Therefore, I guess our answer is "no".

Regardless of our answer, we need to prepare for reaching (4), for it might be
__GFP_NOFAIL allocation request. What is the requirement for choosing next OOM
victim at (4)? An allocating task sees a TIF_MEMDIE task again after
get_page_from_freelist(ALLOC_NO_WATERMARKS) failed after timeout for waiting for
that task at (2) expires? Then, it would kill all tasks immediately because
reaping OOM victim's memory needs some time. We will want to check for another
timeout.



Finally, we will automatically reach (5) after all OOM-killable tasks are
chosen as OOM victims at (4). Here is just an idea for (5). If we change
the OOM killer not to call panic() when there is no more OOM-killable tasks,
it will allow us not to give up immediately after TIF_MEMDIE was given by the
OOM killer. This will increase possibility of making small allocation requests
by TIF_MEMDIE tasks success, for it is almost impossibly unlikely case that
we reach (5).

----------
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6ebc0351..de22c44 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -884,8 +884,10 @@ bool out_of_memory(struct oom_control *oc)
 	}
 
 	p = select_bad_process(oc, &points, totalpages);
-	/* Found nothing?!?! Either we hang forever, or we panic. */
+	/* Found nothing?!?! Either we fail the allocation, or we panic. */
 	if (!p && !is_sysrq_oom(oc)) {
+		if (!(oc->gfp_mask & __GFP_NOFAIL))
+			return false;
 		dump_header(oc, NULL, NULL);
 		panic("Out of memory and no killable processes...\n");
 	}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6463426..798fd68 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3181,10 +3171,6 @@ retry:
 		goto nopage;
 	}
 
-	/* Avoid allocations with no watermarks from looping endlessly */
-	if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
-		goto nopage;
-
 	/*
 	 * Try direct compaction. The first pass is asynchronous. Subsequent
 	 * attempts after direct reclaim are synchronous
----------

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timers.
@ 2016-01-22 13:59                                 ` Tetsuo Handa
  0 siblings, 0 replies; 38+ messages in thread
From: Tetsuo Handa @ 2016-01-22 13:59 UTC (permalink / raw)
  To: rientjes, hannes, mhocko
  Cc: akpm, mgorman, torvalds, oleg, hughd, andrea, riel, linux-mm,
	linux-kernel

David Rientjes wrote:
> On Thu, 21 Jan 2016, Tetsuo Handa wrote:
> 
> > I consider phases for managing system-wide OOM events as follows.
> > 
> >   (1) Design and use a system with appropriate memory capacity in mind.
> > 
> >   (2) When (1) failed, the OOM killer is invoked. The OOM killer selects
> >       an OOM victim and allow that victim access to memory reserves by
> >       setting TIF_MEMDIE to it.
> > 
> >   (3) When (2) did not solve the OOM condition, start allowing all tasks
> >       access to memory reserves by your approach.
> > 
> >   (4) When (3) did not solve the OOM condition, start selecting more OOM
> >       victims by my approach.
> > 
> >   (5) When (4) did not solve the OOM condition, trigger the kernel panic.
> > 
> 
> This was all mentioned previously, and I suggested that the panic only 
> occur when memory reserves have been depleted, otherwise there is still 
> the potential for the livelock to be solved.  That is a patch that would 
> apply today, before any of this work, since we never want to loop 
> endlessly in the page allocator when memory reserves are fully depleted.
> 
> This is all really quite simple.
> 

So, David is OK with above approach, right?
Then, Michal and Johannes, are you OK with above approach?



What I'm not sure about above approach are handling of !__GFP_NOFAIL &&
!__GFP_FS allocation requests and use of ALLOC_NO_WATERMARKS without
TIF_MEMDIE.

Basically, we want to make small allocation requests success unless
__GFP_NORETRY is given. Currently such allocation requests do not fail
unless TIF_MEMDIE is given by the OOM killer. But how hard do we want to
continue looping when we reach (3) by timeout for waiting for TIF_MEMDIE
task at (2) expires?

Should we give up waiting for TIF_MEMDIE task and make !__GFP_NOFAIL allocation
requests fail (as with OOM condition after oom_killer_disable() is called)?
If our answer is "yes", there is no need to open the memory reserves.
Therefore, I guess our answer is "no".

Now, we open the memory reserves at (3). Since currently !__GFP_NOFAIL &&
!__GFP_FS allocation requests do not call out_of_memory(), current version
of "mm, oom: add global access to memory reserves on livelock" does not
allow such allocation requests access to memory reserves on OOM livelock.

If the cause of OOM livelock is an OOM victim is waiting for a lock which is
held by somebody else which is doing !__GFP_NOFAIL && !__GFP_FS allocation
requests to be released, we will fully deplete memory reserves because only
__GFP_NOFAIL || __GFP_FS allocation requests (e.g. page fault by memory hog
processes) can access memory reserves. To handle this case, what do we want
to do?

Should we allow !__GFP_NOFAIL && !__GFP_FS allocation requests access to
memory reserves by allowing them to call out_of_memory() in order to avoid
needlessly deplete memory reserves? Or, we don't care at all because we
can reach (4) anyway?

----------
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6463426..2299374 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2743,16 +2743,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		/* The OOM killer does not needlessly kill tasks for lowmem */
 		if (ac->high_zoneidx < ZONE_NORMAL)
 			goto out;
-		/* The OOM killer does not compensate for IO-less reclaim */
-		if (!(gfp_mask & __GFP_FS)) {
-			/*
-			 * XXX: Page reclaim didn't yield anything,
-			 * and the OOM killer can't be invoked, but
-			 * keep looping as per tradition.
-			 */
-			*did_some_progress = 1;
-			goto out;
-		}
 		if (pm_suspended_storage())
 			goto out;
 		/* The OOM killer may not free memory on a specific node */
----------

Regardless of our answer, we need to decide whether to continue looping, for
there is no guarantee that memory reserves is sufficient to solve OOM livelock.

Should we give up waiting for TIF_MEMDIE task and make !__GFP_NOFAIL allocation
requests fail (as if TIF_MEMDIE was already given by the OOM killer because
we used ALLOC_NO_WATERMARKS)?
If our answer is "yes", there is no need to choose next OOM victim.
Therefore, I guess our answer is "no".

Regardless of our answer, we need to prepare for reaching (4), for it might be
__GFP_NOFAIL allocation request. What is the requirement for choosing next OOM
victim at (4)? An allocating task sees a TIF_MEMDIE task again after
get_page_from_freelist(ALLOC_NO_WATERMARKS) failed after timeout for waiting for
that task at (2) expires? Then, it would kill all tasks immediately because
reaping OOM victim's memory needs some time. We will want to check for another
timeout.



Finally, we will automatically reach (5) after all OOM-killable tasks are
chosen as OOM victims at (4). Here is just an idea for (5). If we change
the OOM killer not to call panic() when there is no more OOM-killable tasks,
it will allow us not to give up immediately after TIF_MEMDIE was given by the
OOM killer. This will increase possibility of making small allocation requests
by TIF_MEMDIE tasks success, for it is almost impossibly unlikely case that
we reach (5).

----------
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6ebc0351..de22c44 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -884,8 +884,10 @@ bool out_of_memory(struct oom_control *oc)
 	}
 
 	p = select_bad_process(oc, &points, totalpages);
-	/* Found nothing?!?! Either we hang forever, or we panic. */
+	/* Found nothing?!?! Either we fail the allocation, or we panic. */
 	if (!p && !is_sysrq_oom(oc)) {
+		if (!(oc->gfp_mask & __GFP_NOFAIL))
+			return false;
 		dump_header(oc, NULL, NULL);
 		panic("Out of memory and no killable processes...\n");
 	}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6463426..798fd68 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3181,10 +3171,6 @@ retry:
 		goto nopage;
 	}
 
-	/* Avoid allocations with no watermarks from looping endlessly */
-	if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
-		goto nopage;
-
 	/*
 	 * Try direct compaction. The first pass is asynchronous. Subsequent
 	 * attempts after direct reclaim are synchronous
----------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timers.
  2016-01-22 13:59                                 ` Tetsuo Handa
@ 2016-01-22 14:57                                   ` Johannes Weiner
  -1 siblings, 0 replies; 38+ messages in thread
From: Johannes Weiner @ 2016-01-22 14:57 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, mhocko, akpm, mgorman, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

On Fri, Jan 22, 2016 at 10:59:10PM +0900, Tetsuo Handa wrote:
> David Rientjes wrote:
> > On Thu, 21 Jan 2016, Tetsuo Handa wrote:
> > 
> > > I consider phases for managing system-wide OOM events as follows.
> > > 
> > >   (1) Design and use a system with appropriate memory capacity in mind.
> > > 
> > >   (2) When (1) failed, the OOM killer is invoked. The OOM killer selects
> > >       an OOM victim and allow that victim access to memory reserves by
> > >       setting TIF_MEMDIE to it.
> > > 
> > >   (3) When (2) did not solve the OOM condition, start allowing all tasks
> > >       access to memory reserves by your approach.
> > > 
> > >   (4) When (3) did not solve the OOM condition, start selecting more OOM
> > >       victims by my approach.
> > > 
> > >   (5) When (4) did not solve the OOM condition, trigger the kernel panic.
> > > 
> > 
> > This was all mentioned previously, and I suggested that the panic only 
> > occur when memory reserves have been depleted, otherwise there is still 
> > the potential for the livelock to be solved.  That is a patch that would 
> > apply today, before any of this work, since we never want to loop 
> > endlessly in the page allocator when memory reserves are fully depleted.
> > 
> > This is all really quite simple.
> 
> So, David is OK with above approach, right?
> Then, Michal and Johannes, are you OK with above approach?

Yes, that order of events sounds reasonable to me. Personally, I'm not
entirely sure whether it's better to give out the last reserves to the
allocating task or subsequent OOM victims, but it's likely not even
that important. The most important part is to guarantee a predictable
and reasonable decision time.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timers.
@ 2016-01-22 14:57                                   ` Johannes Weiner
  0 siblings, 0 replies; 38+ messages in thread
From: Johannes Weiner @ 2016-01-22 14:57 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, mhocko, akpm, mgorman, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

On Fri, Jan 22, 2016 at 10:59:10PM +0900, Tetsuo Handa wrote:
> David Rientjes wrote:
> > On Thu, 21 Jan 2016, Tetsuo Handa wrote:
> > 
> > > I consider phases for managing system-wide OOM events as follows.
> > > 
> > >   (1) Design and use a system with appropriate memory capacity in mind.
> > > 
> > >   (2) When (1) failed, the OOM killer is invoked. The OOM killer selects
> > >       an OOM victim and allow that victim access to memory reserves by
> > >       setting TIF_MEMDIE to it.
> > > 
> > >   (3) When (2) did not solve the OOM condition, start allowing all tasks
> > >       access to memory reserves by your approach.
> > > 
> > >   (4) When (3) did not solve the OOM condition, start selecting more OOM
> > >       victims by my approach.
> > > 
> > >   (5) When (4) did not solve the OOM condition, trigger the kernel panic.
> > > 
> > 
> > This was all mentioned previously, and I suggested that the panic only 
> > occur when memory reserves have been depleted, otherwise there is still 
> > the potential for the livelock to be solved.  That is a patch that would 
> > apply today, before any of this work, since we never want to loop 
> > endlessly in the page allocator when memory reserves are fully depleted.
> > 
> > This is all really quite simple.
> 
> So, David is OK with above approach, right?
> Then, Michal and Johannes, are you OK with above approach?

Yes, that order of events sounds reasonable to me. Personally, I'm not
entirely sure whether it's better to give out the last reserves to the
allocating task or subsequent OOM victims, but it's likely not even
that important. The most important part is to guarantee a predictable
and reasonable decision time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timers.
  2016-01-22 13:59                                 ` Tetsuo Handa
@ 2016-01-26 23:44                                   ` David Rientjes
  -1 siblings, 0 replies; 38+ messages in thread
From: David Rientjes @ 2016-01-26 23:44 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: hannes, mhocko, akpm, mgorman, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

On Fri, 22 Jan 2016, Tetsuo Handa wrote:

> > >   (1) Design and use a system with appropriate memory capacity in mind.
> > > 
> > >   (2) When (1) failed, the OOM killer is invoked. The OOM killer selects
> > >       an OOM victim and allow that victim access to memory reserves by
> > >       setting TIF_MEMDIE to it.
> > > 
> > >   (3) When (2) did not solve the OOM condition, start allowing all tasks
> > >       access to memory reserves by your approach.
> > > 
> > >   (4) When (3) did not solve the OOM condition, start selecting more OOM
> > >       victims by my approach.
> > > 
> > >   (5) When (4) did not solve the OOM condition, trigger the kernel panic.
> > > 
> > 
> > This was all mentioned previously, and I suggested that the panic only 
> > occur when memory reserves have been depleted, otherwise there is still 
> > the potential for the livelock to be solved.  That is a patch that would 
> > apply today, before any of this work, since we never want to loop 
> > endlessly in the page allocator when memory reserves are fully depleted.
> > 
> > This is all really quite simple.
> > 
> 
> So, David is OK with above approach, right?
> Then, Michal and Johannes, are you OK with above approach?
> 

The first step before implementing access to memory reserves on livelock 
(my patch) and oom killing additional processes on livelock (your patch) 
is to detect the appropriate place to panic() when reserves are depleted.

This has historically been done in the oom killer when there are no oom 
killable processes left.  That's easy to figure out and should still be 
done, but we are now introducing the possibility of memory reserves being 
fully depleted while there are oom killable processes left or victims that
cannot exit.

So we need a patch to the page allocator that would be applicable today 
before any of the above is worked on to detect when reserves are depleted 
and panic() rather than loop forever in the page allocator.  I'd suggest 
that this work be done as a follow-up to Michal's patchset to rework the 
page allocator retry logic.

It's not entirely trivial because we want to detect situations when 
high-order < PAGE_ALLOC_COSTLY_ORDER allocations are looping forever and 
we are failing due to fragmentation as well.  If all cpus are looping 
trying to allocate a task_struct, and there are eligible zones with some 
free memory but it is not allocatable, we still want to panic().

> What I'm not sure about above approach are handling of !__GFP_NOFAIL &&
> !__GFP_FS allocation requests and use of ALLOC_NO_WATERMARKS without
> TIF_MEMDIE.
> 
> Basically, we want to make small allocation requests success unless
> __GFP_NORETRY is given. Currently such allocation requests do not fail
> unless TIF_MEMDIE is given by the OOM killer. But how hard do we want to
> continue looping when we reach (3) by timeout for waiting for TIF_MEMDIE
> task at (2) expires?
> 

In my patch, that is tunable by the user with a new sysctl and defines 
when the oom killer is considered livelocked because the victim cannot 
exit.  I think we'd do *did_some_progress = 1 for !__GFP_FS as is done 
today before this expiration happens and otherwise trigger the oom killer 
livelock detection in my patch to allow the allocation to succeed with 
ALLOC_NO_WATERMARKS.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] mm,oom: Re-enable OOM killer using timers.
@ 2016-01-26 23:44                                   ` David Rientjes
  0 siblings, 0 replies; 38+ messages in thread
From: David Rientjes @ 2016-01-26 23:44 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: hannes, mhocko, akpm, mgorman, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

On Fri, 22 Jan 2016, Tetsuo Handa wrote:

> > >   (1) Design and use a system with appropriate memory capacity in mind.
> > > 
> > >   (2) When (1) failed, the OOM killer is invoked. The OOM killer selects
> > >       an OOM victim and allow that victim access to memory reserves by
> > >       setting TIF_MEMDIE to it.
> > > 
> > >   (3) When (2) did not solve the OOM condition, start allowing all tasks
> > >       access to memory reserves by your approach.
> > > 
> > >   (4) When (3) did not solve the OOM condition, start selecting more OOM
> > >       victims by my approach.
> > > 
> > >   (5) When (4) did not solve the OOM condition, trigger the kernel panic.
> > > 
> > 
> > This was all mentioned previously, and I suggested that the panic only 
> > occur when memory reserves have been depleted, otherwise there is still 
> > the potential for the livelock to be solved.  That is a patch that would 
> > apply today, before any of this work, since we never want to loop 
> > endlessly in the page allocator when memory reserves are fully depleted.
> > 
> > This is all really quite simple.
> > 
> 
> So, David is OK with above approach, right?
> Then, Michal and Johannes, are you OK with above approach?
> 

The first step before implementing access to memory reserves on livelock 
(my patch) and oom killing additional processes on livelock (your patch) 
is to detect the appropriate place to panic() when reserves are depleted.

This has historically been done in the oom killer when there are no oom 
killable processes left.  That's easy to figure out and should still be 
done, but we are now introducing the possibility of memory reserves being 
fully depleted while there are oom killable processes left or victims that
cannot exit.

So we need a patch to the page allocator that would be applicable today 
before any of the above is worked on to detect when reserves are depleted 
and panic() rather than loop forever in the page allocator.  I'd suggest 
that this work be done as a follow-up to Michal's patchset to rework the 
page allocator retry logic.

It's not entirely trivial because we want to detect situations when 
high-order < PAGE_ALLOC_COSTLY_ORDER allocations are looping forever and 
we are failing due to fragmentation as well.  If all cpus are looping 
trying to allocate a task_struct, and there are eligible zones with some 
free memory but it is not allocatable, we still want to panic().

> What I'm not sure about above approach are handling of !__GFP_NOFAIL &&
> !__GFP_FS allocation requests and use of ALLOC_NO_WATERMARKS without
> TIF_MEMDIE.
> 
> Basically, we want to make small allocation requests success unless
> __GFP_NORETRY is given. Currently such allocation requests do not fail
> unless TIF_MEMDIE is given by the OOM killer. But how hard do we want to
> continue looping when we reach (3) by timeout for waiting for TIF_MEMDIE
> task at (2) expires?
> 

In my patch, that is tunable by the user with a new sysctl and defines 
when the oom killer is considered livelocked because the victim cannot 
exit.  I think we'd do *did_some_progress = 1 for !__GFP_FS as is done 
today before this expiration happens and otherwise trigger the oom killer 
livelock detection in my patch to allow the allocation to succeed with 
ALLOC_NO_WATERMARKS.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2016-01-26 23:44 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-07 11:26 [PATCH] mm,oom: Re-enable OOM killer using timers Tetsuo Handa
2016-01-07 11:26 ` Tetsuo Handa
2016-01-13  1:36 ` David Rientjes
2016-01-13  1:36   ` David Rientjes
2016-01-13 12:11   ` Tetsuo Handa
2016-01-13 12:11     ` Tetsuo Handa
2016-01-13 16:26     ` Michal Hocko
2016-01-13 16:26       ` Michal Hocko
2016-01-13 16:56       ` Johannes Weiner
2016-01-13 16:56         ` Johannes Weiner
2016-01-13 18:01         ` Michal Hocko
2016-01-13 18:01           ` Michal Hocko
2016-01-14 11:26           ` Tetsuo Handa
2016-01-14 11:26             ` Tetsuo Handa
2016-01-14 22:01             ` David Rientjes
2016-01-14 22:01               ` David Rientjes
2016-01-14 22:58               ` Johannes Weiner
2016-01-14 22:58                 ` Johannes Weiner
2016-01-14 23:09                 ` David Rientjes
2016-01-14 23:09                   ` David Rientjes
2016-01-15 10:36                   ` Tetsuo Handa
2016-01-15 10:36                     ` Tetsuo Handa
2016-01-19 23:13                     ` David Rientjes
2016-01-19 23:13                       ` David Rientjes
2016-01-20 14:36                       ` Tetsuo Handa
2016-01-20 14:36                         ` Tetsuo Handa
2016-01-20 23:49                         ` David Rientjes
2016-01-20 23:49                           ` David Rientjes
2016-01-21 11:44                           ` Tetsuo Handa
2016-01-21 11:44                             ` Tetsuo Handa
2016-01-21 23:15                             ` David Rientjes
2016-01-21 23:15                               ` David Rientjes
2016-01-22 13:59                               ` Tetsuo Handa
2016-01-22 13:59                                 ` Tetsuo Handa
2016-01-22 14:57                                 ` Johannes Weiner
2016-01-22 14:57                                   ` Johannes Weiner
2016-01-26 23:44                                 ` David Rientjes
2016-01-26 23:44                                   ` David Rientjes

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.