linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
@ 2015-09-17 17:59 Kyle Walker
  2015-09-17 19:22 ` Oleg Nesterov
                   ` (2 more replies)
  0 siblings, 3 replies; 109+ messages in thread
From: Kyle Walker @ 2015-09-17 17:59 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, rientjes, hannes, vdavydov, oleg, linux-mm, linux-kernel,
	Kyle Walker

Currently, the oom killer will attempt to kill a process that is in
TASK_UNINTERRUPTIBLE state. For tasks in this state for an exceptional
period of time, such as processes writing to a frozen filesystem during
a lengthy backup operation, this can result in a deadlock condition as
related processes memory access will stall within the page fault
handler.

Within oom_unkillable_task(), check for processes in
TASK_UNINTERRUPTIBLE (TASK_KILLABLE omitted). The oom killer will
move on to another task.

Signed-off-by: Kyle Walker <kwalker@redhat.com>
---
 mm/oom_kill.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1ecc0bc..66f03f8 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -131,6 +131,10 @@ static bool oom_unkillable_task(struct task_struct *p,
 	if (memcg && !task_in_mem_cgroup(p, memcg))
 		return true;
 
+	/* Uninterruptible tasks should not be killed unless in TASK_WAKEKILL */
+	if (p->state == TASK_UNINTERRUPTIBLE)
+		return true;
+
 	/* p may not have freeable memory in nodemask */
 	if (!has_intersects_mems_allowed(p, nodemask))
 		return true;
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
  2015-09-17 17:59 [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks Kyle Walker
@ 2015-09-17 19:22 ` Oleg Nesterov
  2015-09-18 15:41   ` Christoph Lameter
  2015-09-19  8:22 ` Michal Hocko
  2015-09-19 15:03 ` can't oom-kill zap the victim's memory? Oleg Nesterov
  2 siblings, 1 reply; 109+ messages in thread
From: Oleg Nesterov @ 2015-09-17 19:22 UTC (permalink / raw)
  To: Kyle Walker
  Cc: akpm, mhocko, rientjes, hannes, vdavydov, linux-mm, linux-kernel,
	Tetsuo Handa, Stanislav Kozina

Add cc's.

On 09/17, Kyle Walker wrote:
>
> Currently, the oom killer will attempt to kill a process that is in
> TASK_UNINTERRUPTIBLE state. For tasks in this state for an exceptional
> period of time, such as processes writing to a frozen filesystem during
> a lengthy backup operation, this can result in a deadlock condition as
> related processes memory access will stall within the page fault
> handler.
>
> Within oom_unkillable_task(), check for processes in
> TASK_UNINTERRUPTIBLE (TASK_KILLABLE omitted). The oom killer will
> move on to another task.
>
> Signed-off-by: Kyle Walker <kwalker@redhat.com>
> ---
>  mm/oom_kill.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 1ecc0bc..66f03f8 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -131,6 +131,10 @@ static bool oom_unkillable_task(struct task_struct *p,
>  	if (memcg && !task_in_mem_cgroup(p, memcg))
>  		return true;
>
> +	/* Uninterruptible tasks should not be killed unless in TASK_WAKEKILL */
> +	if (p->state == TASK_UNINTERRUPTIBLE)
> +		return true;
> +

So we can skip a memory hog which, say, does mutex_lock(). And this can't
help if this task is multithreaded, unless all its sub-threads are in "D"
state too oom killer will pick another thread with the same ->mm. Plus
other problems.

But yes, such a deadlock is possible. I would really like to see the comments
from maintainers. In particular, I seem to recall that someone suggested to
try to kill another !TIF_MEMDIE process after timeout, perhaps this is what
we should actually do...

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
  2015-09-17 19:22 ` Oleg Nesterov
@ 2015-09-18 15:41   ` Christoph Lameter
  2015-09-18 16:24     ` Oleg Nesterov
  2015-09-19  8:25     ` Michal Hocko
  0 siblings, 2 replies; 109+ messages in thread
From: Christoph Lameter @ 2015-09-18 15:41 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Kyle Walker, akpm, mhocko, rientjes, hannes, vdavydov, linux-mm,
	linux-kernel, Tetsuo Handa, Stanislav Kozina

> But yes, such a deadlock is possible. I would really like to see the comments
> from maintainers. In particular, I seem to recall that someone suggested to
> try to kill another !TIF_MEMDIE process after timeout, perhaps this is what
> we should actually do...

Well yes here is a patch that kills another memdie process but there is
some risk with such an approach of overusing the reserves.


Subject: Allow multiple kills from the OOM killer

The OOM killer currently aborts if it finds a process that already is having
access to the reserve memory pool for exit processing. This is done so that
the reserves are not overcommitted but on the other hand this also allows
only one process being oom killed at the time. That process may be stuck
in D state.

The patch simply removes the aborting of the scan so that other processes
may be killed if one is stuck in D state.

Signed-off-by: Christoph Lameter <cl@linux.com>

Index: linux/mm/oom_kill.c
===================================================================
--- linux.orig/mm/oom_kill.c	2015-09-18 10:38:29.601963726 -0500
+++ linux/mm/oom_kill.c	2015-09-18 10:39:55.911699017 -0500
@@ -265,8 +265,8 @@ enum oom_scan_t oom_scan_process_thread(
 	 * Don't allow any other task to have access to the reserves.
 	 */
 	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
-		if (oc->order != -1)
-			return OOM_SCAN_ABORT;
+		if (unlikely(frozen(task)))
+			__thaw_task(task);
 	}
 	if (!task->mm)
 		return OOM_SCAN_CONTINUE;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
  2015-09-18 15:41   ` Christoph Lameter
@ 2015-09-18 16:24     ` Oleg Nesterov
  2015-09-18 16:39       ` Tetsuo Handa
  2015-09-18 17:00       ` Christoph Lameter
  2015-09-19  8:25     ` Michal Hocko
  1 sibling, 2 replies; 109+ messages in thread
From: Oleg Nesterov @ 2015-09-18 16:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Kyle Walker, akpm, mhocko, rientjes, hannes, vdavydov, linux-mm,
	linux-kernel, Tetsuo Handa, Stanislav Kozina

On 09/18, Christoph Lameter wrote:
>
> > But yes, such a deadlock is possible. I would really like to see the comments
> > from maintainers. In particular, I seem to recall that someone suggested to
> > try to kill another !TIF_MEMDIE process after timeout, perhaps this is what
> > we should actually do...
>
> Well yes here is a patch that kills another memdie process but there is
> some risk with such an approach of overusing the reserves.

Yes, I understand it is not that simple. And probably this is all I can
understand ;)

> --- linux.orig/mm/oom_kill.c	2015-09-18 10:38:29.601963726 -0500
> +++ linux/mm/oom_kill.c	2015-09-18 10:39:55.911699017 -0500
> @@ -265,8 +265,8 @@ enum oom_scan_t oom_scan_process_thread(
>  	 * Don't allow any other task to have access to the reserves.
>  	 */
>  	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
> -		if (oc->order != -1)
> -			return OOM_SCAN_ABORT;
> +		if (unlikely(frozen(task)))
> +			__thaw_task(task);

To simplify the discussion lets ignore PF_FROZEN, this is another issue.

I am not sure this change is enough, we need to ensure that
select_bad_process() won't pick the same task (or its sub-thread) again.

And perhaps something like

	wait_event_timeout(oom_victims_wait, !oom_victims,
				configurable_timeout);

before select_bad_process() makes sense?

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
  2015-09-18 16:24     ` Oleg Nesterov
@ 2015-09-18 16:39       ` Tetsuo Handa
  2015-09-18 16:54         ` Oleg Nesterov
  2015-09-18 17:00       ` Christoph Lameter
  1 sibling, 1 reply; 109+ messages in thread
From: Tetsuo Handa @ 2015-09-18 16:39 UTC (permalink / raw)
  To: oleg, cl
  Cc: kwalker, akpm, mhocko, rientjes, hannes, vdavydov, linux-mm,
	linux-kernel, skozina

Oleg Nesterov wrote:
> To simplify the discussion lets ignore PF_FROZEN, this is another issue.
> 
> I am not sure this change is enough, we need to ensure that
> select_bad_process() won't pick the same task (or its sub-thread) again.

SysRq-f is sometimes unusable because it continues choosing the same thread.
oom_kill_process() should not choose a thread which already has TIF_MEMDIE.
I think we need to rewrite oom_kill_process().

> 
> And perhaps something like
> 
> 	wait_event_timeout(oom_victims_wait, !oom_victims,
> 				configurable_timeout);
> 
> before select_bad_process() makes sense?

I think you should not sleep for long with oom_lock mutex held.
http://marc.info/?l=linux-mm&m=143031212312459

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
  2015-09-18 16:39       ` Tetsuo Handa
@ 2015-09-18 16:54         ` Oleg Nesterov
  0 siblings, 0 replies; 109+ messages in thread
From: Oleg Nesterov @ 2015-09-18 16:54 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: cl, kwalker, akpm, mhocko, rientjes, hannes, vdavydov, linux-mm,
	linux-kernel, skozina

On 09/19, Tetsuo Handa wrote:
>
> Oleg Nesterov wrote:
> > To simplify the discussion lets ignore PF_FROZEN, this is another issue.
> >
> > I am not sure this change is enough, we need to ensure that
> > select_bad_process() won't pick the same task (or its sub-thread) again.
>
> SysRq-f is sometimes unusable because it continues choosing the same thread.
> oom_kill_process() should not choose a thread which already has TIF_MEMDIE.

So I was right, this is really not enough...

> I think we need to rewrite oom_kill_process().

Heh. I can only ack the intent and wish you good luck ;)

> > And perhaps something like
> >
> > 	wait_event_timeout(oom_victims_wait, !oom_victims,
> > 				configurable_timeout);
> >
> > before select_bad_process() makes sense?
>
> I think you should not sleep for long with oom_lock mutex held.
> http://marc.info/?l=linux-mm&m=143031212312459

Yes, yes, sure, I didn't mean we should wait under oom_lock.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
  2015-09-18 16:24     ` Oleg Nesterov
  2015-09-18 16:39       ` Tetsuo Handa
@ 2015-09-18 17:00       ` Christoph Lameter
  2015-09-18 19:07         ` Oleg Nesterov
                           ` (2 more replies)
  1 sibling, 3 replies; 109+ messages in thread
From: Christoph Lameter @ 2015-09-18 17:00 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Kyle Walker, akpm, mhocko, rientjes, hannes, vdavydov, linux-mm,
	linux-kernel, Tetsuo Handa, Stanislav Kozina

On Fri, 18 Sep 2015, Oleg Nesterov wrote:

> To simplify the discussion lets ignore PF_FROZEN, this is another issue.

Ok.

Subject: Allow multiple kills from the OOM killer

The OOM killer currently aborts if it finds a process that already is having
access to the reserve memory pool for exit processing. This is done so that
the reserves are not overcommitted but on the other hand this also allows
only one process being oom killed at the time. That process may be stuck
in D state.

Signed-off-by: Christoph Lameter <cl@linux.com>

Index: linux/mm/oom_kill.c
===================================================================
--- linux.orig/mm/oom_kill.c	2015-09-18 11:58:52.963946782 -0500
+++ linux/mm/oom_kill.c	2015-09-18 11:59:42.010684778 -0500
@@ -264,10 +264,9 @@ enum oom_scan_t oom_scan_process_thread(
 	 * This task already has access to memory reserves and is being killed.
 	 * Don't allow any other task to have access to the reserves.
 	 */
-	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
-		if (oc->order != -1)
-			return OOM_SCAN_ABORT;
-	}
+	if (test_tsk_thread_flag(task, TIF_MEMDIE))
+		return OOM_SCAN_CONTINUE;
+
 	if (!task->mm)
 		return OOM_SCAN_CONTINUE;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
  2015-09-18 17:00       ` Christoph Lameter
@ 2015-09-18 19:07         ` Oleg Nesterov
  2015-09-18 19:19           ` Christoph Lameter
  2015-09-19  8:32         ` Michal Hocko
  2015-09-21 23:27         ` David Rientjes
  2 siblings, 1 reply; 109+ messages in thread
From: Oleg Nesterov @ 2015-09-18 19:07 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Kyle Walker, akpm, mhocko, rientjes, hannes, vdavydov, linux-mm,
	linux-kernel, Tetsuo Handa, Stanislav Kozina

On 09/18, Christoph Lameter wrote:
>
> --- linux.orig/mm/oom_kill.c	2015-09-18 11:58:52.963946782 -0500
> +++ linux/mm/oom_kill.c	2015-09-18 11:59:42.010684778 -0500
> @@ -264,10 +264,9 @@ enum oom_scan_t oom_scan_process_thread(
>  	 * This task already has access to memory reserves and is being killed.
>  	 * Don't allow any other task to have access to the reserves.
>  	 */
> -	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
> -		if (oc->order != -1)
> -			return OOM_SCAN_ABORT;
> -	}
> +	if (test_tsk_thread_flag(task, TIF_MEMDIE))
> +		return OOM_SCAN_CONTINUE;
> +

Well, I can't really comment. Hopefully we will see more comments from
those who understand oom-killer.

But I still think this is not enough, and we need some (configurable?)
timeout before we pick another victim...


And btw. Yes, this is a bit off-topic, but I think another change make
sense too. We should report the fact we are going to kill another task
because the previous victim refuse to die, and print its stack trace.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
  2015-09-18 19:07         ` Oleg Nesterov
@ 2015-09-18 19:19           ` Christoph Lameter
  2015-09-18 21:28             ` Kyle Walker
  0 siblings, 1 reply; 109+ messages in thread
From: Christoph Lameter @ 2015-09-18 19:19 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Kyle Walker, akpm, mhocko, rientjes, hannes, vdavydov, linux-mm,
	linux-kernel, Tetsuo Handa, Stanislav Kozina

On Fri, 18 Sep 2015, Oleg Nesterov wrote:

> And btw. Yes, this is a bit off-topic, but I think another change make
> sense too. We should report the fact we are going to kill another task
> because the previous victim refuse to die, and print its stack trace.

What happens is that the previous victim did not enter exit processing. If
it would then it would be excluded by other checks. The first victim never
reacted and never started using the memory resources available for
exiting. Thats why I thought it maybe safe to go this way.

An issue could result from another process being terminated and the first
victim finally reacting to the signal and also beginning termination. Then
we have contention on the reserves.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
  2015-09-18 19:19           ` Christoph Lameter
@ 2015-09-18 21:28             ` Kyle Walker
  2015-09-18 22:07               ` Christoph Lameter
  0 siblings, 1 reply; 109+ messages in thread
From: Kyle Walker @ 2015-09-18 21:28 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Oleg Nesterov, akpm, mhocko, rientjes, hannes, vdavydov,
	linux-mm, linux-kernel, Tetsuo Handa, Stanislav Kozina

[-- Attachment #1: Type: text/plain, Size: 1339 bytes --]

> On Fri, 18 Sep 2015, Oleg Nesterov wrote:
> > And btw. Yes, this is a bit off-topic, but I think another change make
> > sense too. We should report the fact we are going to kill another task
> > because the previous victim refuse to die, and print its stack trace.

Thank you for the review and feedback! I think that would definitely be a
nice touch. I would definitely throw my hat in as wanting the above, but in
the interests of keeping things as simple as possible, I kept myself out of
that level of change.

> What happens is that the previous victim did not enter exit processing. If
> it would then it would be excluded by other checks. The first victim never
> reacted and never started using the memory resources available for
> exiting. Thats why I thought it maybe safe to go this way.
>
> An issue could result from another process being terminated and the first
> victim finally reacting to the signal and also beginning termination. Then
> we have contention on the reserves.
>

I do like the idea of not stalling completely in an oom just because the
first attempt didn't go so well. Is there any possibility of simply having
our cake and eating it too? Specifically, omitting TASK_UNINTERRUPTIBLE
tasks
as low-hanging fruit and allowing the oom to continue in the event that the
first attempt stalls?

Just a thought.

[-- Attachment #2: Type: text/html, Size: 3752 bytes --]

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
  2015-09-18 21:28             ` Kyle Walker
@ 2015-09-18 22:07               ` Christoph Lameter
  0 siblings, 0 replies; 109+ messages in thread
From: Christoph Lameter @ 2015-09-18 22:07 UTC (permalink / raw)
  To: Kyle Walker
  Cc: Oleg Nesterov, akpm, mhocko, rientjes, hannes, vdavydov,
	linux-mm, linux-kernel, Tetsuo Handa, Stanislav Kozina

On Fri, 18 Sep 2015, Kyle Walker wrote:

> I do like the idea of not stalling completely in an oom just because the
> first attempt didn't go so well. Is there any possibility of simply having
> our cake and eating it too? Specifically, omitting TASK_UNINTERRUPTIBLE
> tasks
> as low-hanging fruit and allowing the oom to continue in the event that the
> first attempt stalls?

TASK_UNINTERRUPTIBLE tasks should not be sleeping that long and they
*should react* in a reasonable timeframe. There is an alternative API for
those cases that cannot. Typically this is a write that is stalling. If we
kill the process then its pointless to wait on the write to complete. See

https://lwn.net/Articles/288056/

http://www.ibm.com/developerworks/library/l-task-killable/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
  2015-09-17 17:59 [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks Kyle Walker
  2015-09-17 19:22 ` Oleg Nesterov
@ 2015-09-19  8:22 ` Michal Hocko
  2015-09-21 23:08   ` David Rientjes
  2015-09-19 15:03 ` can't oom-kill zap the victim's memory? Oleg Nesterov
  2 siblings, 1 reply; 109+ messages in thread
From: Michal Hocko @ 2015-09-19  8:22 UTC (permalink / raw)
  To: Kyle Walker
  Cc: akpm, rientjes, hannes, vdavydov, oleg, linux-mm, linux-kernel

On Thu 17-09-15 13:59:43, Kyle Walker wrote:
> Currently, the oom killer will attempt to kill a process that is in
> TASK_UNINTERRUPTIBLE state. For tasks in this state for an exceptional
> period of time, such as processes writing to a frozen filesystem during
> a lengthy backup operation, this can result in a deadlock condition as
> related processes memory access will stall within the page fault
> handler.

I am not familiar with the fs freezing code so I might be missing
something important here. __sb_start_write waits for the frozen fs by
wait_event which is really UN sleep. Why cannot we sleep here in IN
sleep and return with EINTR when interrupted? I would consider this
a better behavior not only because of OOM because having unkillable
tasks in general is undesirable. AFAIU the fs might be frozen for ever
and admin cannot do anything about the pending processes.

> Within oom_unkillable_task(), check for processes in
> TASK_UNINTERRUPTIBLE (TASK_KILLABLE omitted). The oom killer will
> move on to another task.

Nack to this. TASK_UNINTERRUPTIBLE should be time constrained/bounded
state. Using it as an oom victim criteria makes the victim selection
less deterministic which is undesirable. As much as I am aware of
potential issues with the current implementation, making the behavior
more random doesn't really help.

> Signed-off-by: Kyle Walker <kwalker@redhat.com>
> ---
>  mm/oom_kill.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 1ecc0bc..66f03f8 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -131,6 +131,10 @@ static bool oom_unkillable_task(struct task_struct *p,
>  	if (memcg && !task_in_mem_cgroup(p, memcg))
>  		return true;
>  
> +	/* Uninterruptible tasks should not be killed unless in TASK_WAKEKILL */
> +	if (p->state == TASK_UNINTERRUPTIBLE)
> +		return true;
> +
>  	/* p may not have freeable memory in nodemask */
>  	if (!has_intersects_mems_allowed(p, nodemask))
>  		return true;
> -- 
> 2.4.3

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
  2015-09-18 15:41   ` Christoph Lameter
  2015-09-18 16:24     ` Oleg Nesterov
@ 2015-09-19  8:25     ` Michal Hocko
  1 sibling, 0 replies; 109+ messages in thread
From: Michal Hocko @ 2015-09-19  8:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Oleg Nesterov, Kyle Walker, akpm, rientjes, hannes, vdavydov,
	linux-mm, linux-kernel, Tetsuo Handa, Stanislav Kozina

On Fri 18-09-15 10:41:09, Christoph Lameter wrote:
[...]
>  	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
> -		if (oc->order != -1)
> -			return OOM_SCAN_ABORT;
> +		if (unlikely(frozen(task)))
> +			__thaw_task(task);

TIF_MEMDIE processes will get thawed automatically and then cannot be
frozen again. Have a look at mark_oom_victim.

>  	}
>  	if (!task->mm)
>  		return OOM_SCAN_CONTINUE;

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
  2015-09-18 17:00       ` Christoph Lameter
  2015-09-18 19:07         ` Oleg Nesterov
@ 2015-09-19  8:32         ` Michal Hocko
  2015-09-19 14:33           ` Tetsuo Handa
  2015-09-19 14:44           ` Oleg Nesterov
  2015-09-21 23:27         ` David Rientjes
  2 siblings, 2 replies; 109+ messages in thread
From: Michal Hocko @ 2015-09-19  8:32 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Oleg Nesterov, Kyle Walker, akpm, rientjes, hannes, vdavydov,
	linux-mm, linux-kernel, Tetsuo Handa, Stanislav Kozina

On Fri 18-09-15 12:00:59, Christoph Lameter wrote:
[...]
> Subject: Allow multiple kills from the OOM killer
> 
> The OOM killer currently aborts if it finds a process that already is having
> access to the reserve memory pool for exit processing. This is done so that
> the reserves are not overcommitted but on the other hand this also allows
> only one process being oom killed at the time. That process may be stuck
> in D state.

This has been posted in various forms many times over past years. I
still do not think this is a right approach of dealing with the problem.
You can quickly deplete memory reserves this way without making further
progress (I am afraid you can even trigger this from userspace without
having big privileges) so even administrator will have no way to
intervene.

> Signed-off-by: Christoph Lameter <cl@linux.com>
> 
> Index: linux/mm/oom_kill.c
> ===================================================================
> --- linux.orig/mm/oom_kill.c	2015-09-18 11:58:52.963946782 -0500
> +++ linux/mm/oom_kill.c	2015-09-18 11:59:42.010684778 -0500
> @@ -264,10 +264,9 @@ enum oom_scan_t oom_scan_process_thread(
>  	 * This task already has access to memory reserves and is being killed.
>  	 * Don't allow any other task to have access to the reserves.
>  	 */
> -	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
> -		if (oc->order != -1)
> -			return OOM_SCAN_ABORT;
> -	}
> +	if (test_tsk_thread_flag(task, TIF_MEMDIE))
> +		return OOM_SCAN_CONTINUE;
> +
>  	if (!task->mm)
>  		return OOM_SCAN_CONTINUE;

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
  2015-09-19  8:32         ` Michal Hocko
@ 2015-09-19 14:33           ` Tetsuo Handa
  2015-09-19 15:51             ` Michal Hocko
  2015-09-21 23:33             ` David Rientjes
  2015-09-19 14:44           ` Oleg Nesterov
  1 sibling, 2 replies; 109+ messages in thread
From: Tetsuo Handa @ 2015-09-19 14:33 UTC (permalink / raw)
  To: mhocko, cl
  Cc: oleg, kwalker, akpm, rientjes, hannes, vdavydov, linux-mm,
	linux-kernel, skozina

Michal Hocko wrote:
> This has been posted in various forms many times over past years. I
> still do not think this is a right approach of dealing with the problem.

I do not think "GFP_NOFS can fail" patch is a right approach because
that patch easily causes messages like below.

  Buffer I/O error on dev sda1, logical block 34661831, lost async page write
  XFS: possible memory allocation deadlock in kmem_alloc (mode:0x8250)
  XFS: possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
  XFS: possible memory allocation deadlock in kmem_zone_alloc (mode:0x8250)

Adding __GFP_NOFAIL will hide these messages but OOM stall remains anyway.

I believe choosing more OOM victims is the only way which can solve OOM stalls.

> You can quickly deplete memory reserves this way without making further
> progress (I am afraid you can even trigger this from userspace without
> having big privileges) so even administrator will have no way to
> intervene.

I think that use of ALLOC_NO_WATERMARKS via TIF_MEMDIE is the underlying
cause. ALLOC_NO_WATERMARKS via TIF_MEMDIE is intended for terminating the
OOM victim task as soon as possible, but it turned out that it will not
work if there is invisible lock dependency. Therefore, why not to give up
"there should be only up to 1 TIF_MEMDIE task" rule?

What this patch (and many others posted in various forms many times over
past years) does is to give up "there should be only up to 1 TIF_MEMDIE
task" rule. I think that we need to tolerate more than 1 TIF_MEMDIE tasks
and somehow manage in a way memory reserves will not deplete.

In my proposal which favors all fatal_signal_pending() tasks evenly
( http://lkml.kernel.org/r/201509102318.GHG18789.OHMSLFJOQFOtFV@I-love.SAKURA.ne.jp )
suggests that the OOM victim task unlikely needs all of memory reserves.
In other words, the OOM victim task can likely make forward progress
if some amount of memory reserves are allowed (compared to normal tasks
waiting for memory).

So, I think that getting rid of "ALLOC_NO_WATERMARKS via TIF_MEMDIE" rule
and replace test_thread_flag(TIF_MEMDIE) with fatal_signal_pending(current)
will handle many cases if fatal_signal_pending() tasks are allowed to access
some amount of memory reserves. And my proposal which chooses next OOM
victim upon timeout will handle the remaining cases without depleting
memory reserves.

If you still want to keep "there should be only up to 1 TIF_MEMDIE task"
rule, what alternative do you have? (I do not like panic_on_oom_timeout
because it is more data-lossy approach than choosing next OOM victim.)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
  2015-09-19  8:32         ` Michal Hocko
  2015-09-19 14:33           ` Tetsuo Handa
@ 2015-09-19 14:44           ` Oleg Nesterov
  1 sibling, 0 replies; 109+ messages in thread
From: Oleg Nesterov @ 2015-09-19 14:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Lameter, Kyle Walker, akpm, rientjes, hannes, vdavydov,
	linux-mm, linux-kernel, Tetsuo Handa, Stanislav Kozina

On 09/19, Michal Hocko wrote:
>
> This has been posted in various forms many times over past years. I
> still do not think this is a right approach of dealing with the problem.

Agreed. But still I think it makes sense to try to kill another task
if the victim refuse to die. Yes, the details are not clear to me.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* can't oom-kill zap the victim's memory?
  2015-09-17 17:59 [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks Kyle Walker
  2015-09-17 19:22 ` Oleg Nesterov
  2015-09-19  8:22 ` Michal Hocko
@ 2015-09-19 15:03 ` Oleg Nesterov
  2015-09-19 15:10   ` Oleg Nesterov
                     ` (3 more replies)
  2 siblings, 4 replies; 109+ messages in thread
From: Oleg Nesterov @ 2015-09-19 15:03 UTC (permalink / raw)
  To: Kyle Walker, Christoph Lameter, Linus Torvalds, Michal Hocko
  Cc: akpm, rientjes, hannes, vdavydov, linux-mm, linux-kernel,
	Stanislav Kozina, Tetsuo Handa

On 09/17, Kyle Walker wrote:
>
> Currently, the oom killer will attempt to kill a process that is in
> TASK_UNINTERRUPTIBLE state. For tasks in this state for an exceptional
> period of time, such as processes writing to a frozen filesystem during
> a lengthy backup operation, this can result in a deadlock condition as
> related processes memory access will stall within the page fault
> handler.

And there are other potential reasons for deadlock.

Stupid idea. Can't we help the memory hog to free its memory? This is
orthogonal to other improvements we can do.

Please don't tell me the patch below is ugly, incomplete and suboptimal
in many ways, I know ;) I am not sure it is even correct. Just to explain
what I mean.

Perhaps oom_unmap_func() should only zap the anonymous vmas... and there
are a lot of other details which should be discussed if this can make any
sense.

Oleg.
---

--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -493,6 +493,26 @@ void oom_killer_enable(void)
 	up_write(&oom_sem);
 }
 
+static struct mm_struct *oom_unmap_mm;
+
+static void oom_unmap_func(struct work_struct *work)
+{
+	struct mm_struct *mm = xchg(&oom_unmap_mm, NULL);
+
+	if (!atomic_inc_not_zero(&mm->mm_users))
+		return;
+
+	// If this is not safe we can do use_mm() + unuse_mm()
+	down_read(&mm->mmap_sem);
+	if (mm->mmap)
+		zap_page_range(mm->mmap, 0, TASK_SIZE, NULL);
+	up_read(&mm->mmap_sem);
+
+	mmput(mm);
+	mmdrop(mm);
+}
+static DECLARE_WORK(oom_unmap_work, oom_unmap_func);
+
 #define K(x) ((x) << (PAGE_SHIFT-10))
 /*
  * Must be called while holding a reference to p, which will be released upon
@@ -570,8 +590,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 		victim = p;
 	}
 
-	/* mm cannot safely be dereferenced after task_unlock(victim) */
 	mm = victim->mm;
+	atomic_inc(&mm->mm_count);
 	mark_tsk_oom_victim(victim);
 	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
 		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
@@ -604,6 +624,10 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	rcu_read_unlock();
 
 	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
+	if (cmpxchg(&oom_unmap_mm, NULL, mm))
+		mmdrop(mm);
+	else
+		queue_work(system_unbound_wq, &oom_unmap_work);
 	put_task_struct(victim);
 }
 #undef K

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-19 15:03 ` can't oom-kill zap the victim's memory? Oleg Nesterov
@ 2015-09-19 15:10   ` Oleg Nesterov
  2015-09-19 15:58   ` Michal Hocko
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 109+ messages in thread
From: Oleg Nesterov @ 2015-09-19 15:10 UTC (permalink / raw)
  To: Kyle Walker, Christoph Lameter, Linus Torvalds, Michal Hocko
  Cc: akpm, rientjes, hannes, vdavydov, linux-mm, linux-kernel,
	Stanislav Kozina, Tetsuo Handa

(off-topic)

On 09/19, Oleg Nesterov wrote:
>
> @@ -570,8 +590,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  		victim = p;
>  	}
>
> -	/* mm cannot safely be dereferenced after task_unlock(victim) */
>  	mm = victim->mm;
> +	atomic_inc(&mm->mm_count);

Btw, I think we need this change anyway. This is pure theoretical, but
otherwise this task can exit and free its mm_struct right after task_unlock(),
then this mm_struct can be reallocated and used by another task, so we
can't trust the "p->mm == mm" check below.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
  2015-09-19 14:33           ` Tetsuo Handa
@ 2015-09-19 15:51             ` Michal Hocko
  2015-09-21 23:33             ` David Rientjes
  1 sibling, 0 replies; 109+ messages in thread
From: Michal Hocko @ 2015-09-19 15:51 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: cl, oleg, kwalker, akpm, rientjes, hannes, vdavydov, linux-mm,
	linux-kernel, skozina

On Sat 19-09-15 23:33:07, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > This has been posted in various forms many times over past years. I
> > still do not think this is a right approach of dealing with the problem.
> 
> I do not think "GFP_NOFS can fail" patch is a right approach because
> that patch easily causes messages like below.
> 
>   Buffer I/O error on dev sda1, logical block 34661831, lost async page write
>   XFS: possible memory allocation deadlock in kmem_alloc (mode:0x8250)
>   XFS: possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
>   XFS: possible memory allocation deadlock in kmem_zone_alloc (mode:0x8250)

These messages just tell you that the allocation fails repeatedly. Have
a look and check the code. They are basically opencoded NOFAIL
allocations. They haven't been converted to actually tell the MM layer
that they cannot fail because Dave said they have a long term plan to
change this code and basically implement different failing strategies.

> Adding __GFP_NOFAIL will hide these messages but OOM stall remains anyway.
> 
> I believe choosing more OOM victims is the only way which can solve OOM stalls.

I am very well aware of your position and all the attempts to tweak
different code paths to actually pass your corner case. I, however, care
for the longer term goals more. And I believe that the page allocator
and the reclaim should strive for being less deadlock prone in the
first place.  That includes a more natural semantic and non-failing
default semantic is really error prone IMHO. We have been through this
discussion many times already and I've tried to express this is a long
term goal with incremental steps.
I really hate to do "easy" things now just to feel better about
particular case which will kick us back little bit later. And from my
own experience I can tell you that a more non-deterministic OOM behavior
is thing people complain about.

> > You can quickly deplete memory reserves this way without making further
> > progress (I am afraid you can even trigger this from userspace without
> > having big privileges) so even administrator will have no way to
> > intervene.
> 
> I think that use of ALLOC_NO_WATERMARKS via TIF_MEMDIE is the underlying
> cause. ALLOC_NO_WATERMARKS via TIF_MEMDIE is intended for terminating the
> OOM victim task as soon as possible, but it turned out that it will not
> work if there is invisible lock dependency.

Of course. This is a heurstic and as such it cannot ever work in 100%
situations. And it is not the first heuristic we have for the OOM
killer. The last time this has been all rewritten was because the OOM
killer was too unreliable/non-deterministic. Reports have decreased
considerable since then.

> Therefore, why not to give up
> "there should be only up to 1 TIF_MEMDIE task" rule?

This has been explained several times. There is no guaranteed this would
help and _your_ own usecase shows how you can end up with such a long
lock dependency chains that you can easily eat up the whole memory
reserves before you can make any progress.

I do agree that a hand break mechanism is really desirable for those who
really care.

> What this patch (and many others posted in various forms many times over
> past years) does is to give up "there should be only up to 1 TIF_MEMDIE
> task" rule. I think that we need to tolerate more than 1 TIF_MEMDIE tasks
> and somehow manage in a way memory reserves will not deplete.

But those two goes against each other.

[...]

> If you still want to keep "there should be only up to 1 TIF_MEMDIE task"
> rule, what alternative do you have? (I do not like panic_on_oom_timeout
> because it is more data-lossy approach than choosing next OOM victim.)

I am not married to 1 TIF_MEMDIE task thing. I just think that there is
still a lot of room for other improvements. The original issue which
triggered this discussion again is a good example. I completely miss why
a writer has to be unkillable when the fs is frozen. There are others
which are more complicated of course. Including the whole class
represented by GFP_NOFS allocations as you have noted. But we still have
a room for improvements even in the reclaim. It has been suggested quite
some time ago that the memory mapped by the OOM victim might be
unmapped. Basically what Oleg is proposing in other email. I didn't get
to read his email yet properly but that should certainly help to reduce
the problem space.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-19 15:03 ` can't oom-kill zap the victim's memory? Oleg Nesterov
  2015-09-19 15:10   ` Oleg Nesterov
@ 2015-09-19 15:58   ` Michal Hocko
  2015-09-20 13:16     ` Oleg Nesterov
  2015-09-19 22:24   ` Linus Torvalds
  2015-09-20 14:50   ` Tetsuo Handa
  3 siblings, 1 reply; 109+ messages in thread
From: Michal Hocko @ 2015-09-19 15:58 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Kyle Walker, Christoph Lameter, Linus Torvalds, akpm, rientjes,
	hannes, vdavydov, linux-mm, linux-kernel, Stanislav Kozina,
	Tetsuo Handa

On Sat 19-09-15 17:03:16, Oleg Nesterov wrote:
> On 09/17, Kyle Walker wrote:
> >
> > Currently, the oom killer will attempt to kill a process that is in
> > TASK_UNINTERRUPTIBLE state. For tasks in this state for an exceptional
> > period of time, such as processes writing to a frozen filesystem during
> > a lengthy backup operation, this can result in a deadlock condition as
> > related processes memory access will stall within the page fault
> > handler.
> 
> And there are other potential reasons for deadlock.
> 
> Stupid idea. Can't we help the memory hog to free its memory? This is
> orthogonal to other improvements we can do.
> 
> Please don't tell me the patch below is ugly, incomplete and suboptimal
> in many ways, I know ;) I am not sure it is even correct. Just to explain
> what I mean.

Unmapping the memory for the oom victim has been already mentioned as a
way to improve the OOM killer behavior. Nobody has implemented that yet
though unfortunately. I have that on my TODO list since we have
discussed it with Mel at LSF.

> Perhaps oom_unmap_func() should only zap the anonymous vmas... and there
> are a lot of other details which should be discussed if this can make any
> sense.

I have just returned from an internal conference so my head is
completely cabbaged. I will have a look on Monday. From a quick look
the idea is feasible. You cannot rely on the worker context because
workqueues might be completely stuck with at this stage. You also cannot
do take mmap_sem directly because that might be held already so you need
a try_lock instead. Focusing on anonymous vmas first sounds like a good
idea to me because that would be simpler I guess.

> 
> Oleg.
> ---
> 
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -493,6 +493,26 @@ void oom_killer_enable(void)
>  	up_write(&oom_sem);
>  }
>  
> +static struct mm_struct *oom_unmap_mm;
> +
> +static void oom_unmap_func(struct work_struct *work)
> +{
> +	struct mm_struct *mm = xchg(&oom_unmap_mm, NULL);
> +
> +	if (!atomic_inc_not_zero(&mm->mm_users))
> +		return;
> +
> +	// If this is not safe we can do use_mm() + unuse_mm()
> +	down_read(&mm->mmap_sem);
> +	if (mm->mmap)
> +		zap_page_range(mm->mmap, 0, TASK_SIZE, NULL);
> +	up_read(&mm->mmap_sem);
> +
> +	mmput(mm);
> +	mmdrop(mm);
> +}
> +static DECLARE_WORK(oom_unmap_work, oom_unmap_func);
> +
>  #define K(x) ((x) << (PAGE_SHIFT-10))
>  /*
>   * Must be called while holding a reference to p, which will be released upon
> @@ -570,8 +590,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  		victim = p;
>  	}
>  
> -	/* mm cannot safely be dereferenced after task_unlock(victim) */
>  	mm = victim->mm;
> +	atomic_inc(&mm->mm_count);
>  	mark_tsk_oom_victim(victim);
>  	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
>  		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
> @@ -604,6 +624,10 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  	rcu_read_unlock();
>  
>  	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
> +	if (cmpxchg(&oom_unmap_mm, NULL, mm))
> +		mmdrop(mm);
> +	else
> +		queue_work(system_unbound_wq, &oom_unmap_work);
>  	put_task_struct(victim);
>  }
>  #undef K

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-19 15:03 ` can't oom-kill zap the victim's memory? Oleg Nesterov
  2015-09-19 15:10   ` Oleg Nesterov
  2015-09-19 15:58   ` Michal Hocko
@ 2015-09-19 22:24   ` Linus Torvalds
  2015-09-19 22:54     ` Raymond Jennings
                       ` (3 more replies)
  2015-09-20 14:50   ` Tetsuo Handa
  3 siblings, 4 replies; 109+ messages in thread
From: Linus Torvalds @ 2015-09-19 22:24 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Kyle Walker, Christoph Lameter, Michal Hocko, Andrew Morton,
	David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm,
	Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa

On Sat, Sep 19, 2015 at 8:03 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> +
> +static void oom_unmap_func(struct work_struct *work)
> +{
> +       struct mm_struct *mm = xchg(&oom_unmap_mm, NULL);
> +
> +       if (!atomic_inc_not_zero(&mm->mm_users))
> +               return;
> +
> +       // If this is not safe we can do use_mm() + unuse_mm()
> +       down_read(&mm->mmap_sem);

I don't think this is safe.

What makes you sure that we might not deadlock on the mmap_sem here?
For all we know, the process that is going out of memory is in the
middle of a mmap(), and already holds the mmap_sem for writing. No?

So at the very least that needs to be a trylock, I think. And I'm not
sure zap_page_range() is ok with the mmap_sem only held for reading.
Normally our rule is that you can *populate* the page tables
concurrently, but you can't tear the down.

                Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-19 22:24   ` Linus Torvalds
@ 2015-09-19 22:54     ` Raymond Jennings
  2015-09-19 23:00     ` Raymond Jennings
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 109+ messages in thread
From: Raymond Jennings @ 2015-09-19 22:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Kyle Walker, Christoph Lameter, Michal Hocko,
	Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov,
	linux-mm, Linux Kernel Mailing List, Stanislav Kozina,
	Tetsuo Handa

[-- Attachment #1: Type: text/plain, Size: 1434 bytes --]

On Sat, Sep 19, 2015 at 3:24 PM, Linus Torvalds 
<torvalds@linux-foundation.org> wrote:
> On Sat, Sep 19, 2015 at 8:03 AM, Oleg Nesterov <oleg@redhat.com> 
> wrote:
>>  +
>>  +static void oom_unmap_func(struct work_struct *work)
>>  +{
>>  +       struct mm_struct *mm = xchg(&oom_unmap_mm, NULL);
>>  +
>>  +       if (!atomic_inc_not_zero(&mm->mm_users))
>>  +               return;
>>  +
>>  +       // If this is not safe we can do use_mm() + unuse_mm()
>>  +       down_read(&mm->mmap_sem);
> 
> I don't think this is safe.
> 
> What makes you sure that we might not deadlock on the mmap_sem here?
> For all we know, the process that is going out of memory is in the
> middle of a mmap(), and already holds the mmap_sem for writing. No?
> 
> So at the very least that needs to be a trylock, I think. And I'm not
> sure zap_page_range() is ok with the mmap_sem only held for reading.
> Normally our rule is that you can *populate* the page tables
> concurrently, but you can't tear the down.

Is it also possible to have mmap fail with EINTR?  Presumably that 
would let a pending SIGKILL from the oom handler punch it out of the 
kernel and back to userspace.

> 
> 
>                 Linus
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

[-- Attachment #2: Type: text/html, Size: 1769 bytes --]

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-19 22:24   ` Linus Torvalds
  2015-09-19 22:54     ` Raymond Jennings
@ 2015-09-19 23:00     ` Raymond Jennings
  2015-09-19 23:13       ` Linus Torvalds
  2015-09-20  9:33     ` Michal Hocko
  2015-09-20 12:56     ` Oleg Nesterov
  3 siblings, 1 reply; 109+ messages in thread
From: Raymond Jennings @ 2015-09-19 23:00 UTC (permalink / raw)
  To: Linus Torvalds, Oleg Nesterov
  Cc: Kyle Walker, Christoph Lameter, Michal Hocko, Andrew Morton,
	David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm,
	Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa

On 09/19/15 15:24, Linus Torvalds wrote:
> On Sat, Sep 19, 2015 at 8:03 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>> +
>> +static void oom_unmap_func(struct work_struct *work)
>> +{
>> +       struct mm_struct *mm = xchg(&oom_unmap_mm, NULL);
>> +
>> +       if (!atomic_inc_not_zero(&mm->mm_users))
>> +               return;
>> +
>> +       // If this is not safe we can do use_mm() + unuse_mm()
>> +       down_read(&mm->mmap_sem);
> I don't think this is safe.
>
> What makes you sure that we might not deadlock on the mmap_sem here?
> For all we know, the process that is going out of memory is in the
> middle of a mmap(), and already holds the mmap_sem for writing. No?

Potentially stupid question that others may be asking: Is it legal to 
return EINTR from mmap() to let a SIGKILL from the OOM handler punch the 
task out of the kernel and back to userspace?

(sorry for the dupe btw, new email client snuck in html and I got bounced)

> So at the very least that needs to be a trylock, I think. And I'm not
> sure zap_page_range() is ok with the mmap_sem only held for reading.
> Normally our rule is that you can *populate* the page tables
> concurrently, but you can't tear the down.
>
>                  Linus
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-19 23:00     ` Raymond Jennings
@ 2015-09-19 23:13       ` Linus Torvalds
  0 siblings, 0 replies; 109+ messages in thread
From: Linus Torvalds @ 2015-09-19 23:13 UTC (permalink / raw)
  To: Raymond Jennings
  Cc: Oleg Nesterov, Kyle Walker, Christoph Lameter, Michal Hocko,
	Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov,
	linux-mm, Linux Kernel Mailing List, Stanislav Kozina,
	Tetsuo Handa

On Sat, Sep 19, 2015 at 4:00 PM, Raymond Jennings <shentino@gmail.com> wrote:
>
> Potentially stupid question that others may be asking: Is it legal to return
> EINTR from mmap() to let a SIGKILL from the OOM handler punch the task out
> of the kernel and back to userspace?

Yes. Note that mmap() itself seldom sleeps or allocates much memory
(yeah, there's the vma itself and soem minimal stuff), so it's mainly
an issue for things like MAP_POPULATE etc.

The more common situation is things like uninterruptible reads when a
device (or network) is not responding, and we have special support for
"killable" waits that act like normal uninterruptible waits but can be
interrupted by deadly signals, exactly because for those cases we
don't need to worry about things like POSIX return value guarantees
("all or nothing" for file reads) etc.

So you do generally have to write extra code for the "killable sleep".
But it's a good thing to do, if you notice that certain cases aren't
responding well to oom killing because they keep on waiting.

                Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-19 22:24   ` Linus Torvalds
  2015-09-19 22:54     ` Raymond Jennings
  2015-09-19 23:00     ` Raymond Jennings
@ 2015-09-20  9:33     ` Michal Hocko
  2015-09-20 13:06       ` Oleg Nesterov
  2015-09-20 12:56     ` Oleg Nesterov
  3 siblings, 1 reply; 109+ messages in thread
From: Michal Hocko @ 2015-09-20  9:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Kyle Walker, Christoph Lameter, Andrew Morton,
	David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm,
	Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa

On Sat 19-09-15 15:24:02, Linus Torvalds wrote:
> On Sat, Sep 19, 2015 at 8:03 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> > +
> > +static void oom_unmap_func(struct work_struct *work)
> > +{
> > +       struct mm_struct *mm = xchg(&oom_unmap_mm, NULL);
> > +
> > +       if (!atomic_inc_not_zero(&mm->mm_users))
> > +               return;
> > +
> > +       // If this is not safe we can do use_mm() + unuse_mm()
> > +       down_read(&mm->mmap_sem);
> 
> I don't think this is safe.
> 
> What makes you sure that we might not deadlock on the mmap_sem here?
> For all we know, the process that is going out of memory is in the
> middle of a mmap(), and already holds the mmap_sem for writing. No?
> 
> So at the very least that needs to be a trylock, I think.

Agreed.

> And I'm not
> sure zap_page_range() is ok with the mmap_sem only held for reading.
> Normally our rule is that you can *populate* the page tables
> concurrently, but you can't tear the down

Actually mmap_sem for reading should be sufficient because we do not
alter the layout. Both MADV_DONTNEED and MADV_FREE require read mmap_sem
for example.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-19 22:24   ` Linus Torvalds
                       ` (2 preceding siblings ...)
  2015-09-20  9:33     ` Michal Hocko
@ 2015-09-20 12:56     ` Oleg Nesterov
  2015-09-20 18:05       ` Linus Torvalds
  3 siblings, 1 reply; 109+ messages in thread
From: Oleg Nesterov @ 2015-09-20 12:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kyle Walker, Christoph Lameter, Michal Hocko, Andrew Morton,
	David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm,
	Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa

On 09/19, Linus Torvalds wrote:
>
> On Sat, Sep 19, 2015 at 8:03 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> > +
> > +static void oom_unmap_func(struct work_struct *work)
> > +{
> > +       struct mm_struct *mm = xchg(&oom_unmap_mm, NULL);
> > +
> > +       if (!atomic_inc_not_zero(&mm->mm_users))
> > +               return;
> > +
> > +       // If this is not safe we can do use_mm() + unuse_mm()
> > +       down_read(&mm->mmap_sem);
>
> I don't think this is safe.
>
> What makes you sure that we might not deadlock on the mmap_sem here?
> For all we know, the process that is going out of memory is in the
> middle of a mmap(), and already holds the mmap_sem for writing. No?

In this case the workqueue thread will block. But it can not block
forever. I mean if it can then the killed process will never exit
(exit_mm does down_read) and release its memory, so we lose anyway.

But let me repeat this patch is obviously not complete/etc,

> So at the very least that needs to be a trylock, I think.

And we want to avoid using workqueues when the caller can do this
directly. And in this case we certainly need trylock. But this needs
some refactoring: we do not want to do this under oom_lock, otoh it
makes sense to do this from mark_oom_victim() if current && killed,
and a lot more details.

The workqueue thread has other reasons for trylock, but probably not
in the initial version of this patch. And perhaps we should use a
dedicated kthread and do not use workqueues at all. And yes, a single
"mm_struct *oom_unmap_mm" is ugly, it should be the list of mm's to
unmap, but then at least we need MMF_MEMDIE.

> And I'm not
> sure zap_page_range() is ok with the mmap_sem only held for reading.
> Normally our rule is that you can *populate* the page tables
> concurrently, but you can't tear the down.

Well, according to madvise_need_mmap_write() MADV_DONTNEED does this
under down_read().

But yes, yes, this is probably not right anyway. Say, VM_LOCKED...
That is why I mentioned that perhaps this should only unmap the
anonymous pages. We can probably add zap_details->for_oom hint.



Another question if it is safe to abuse the foreign mm this way.
Well, zap_page_range_single() does this, so this is probably safe.
But we can do use_mm().

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-20  9:33     ` Michal Hocko
@ 2015-09-20 13:06       ` Oleg Nesterov
  0 siblings, 0 replies; 109+ messages in thread
From: Oleg Nesterov @ 2015-09-20 13:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Linus Torvalds, Kyle Walker, Christoph Lameter, Andrew Morton,
	David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm,
	Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa

On 09/20, Michal Hocko wrote:
>
> On Sat 19-09-15 15:24:02, Linus Torvalds wrote:
> > On Sat, Sep 19, 2015 at 8:03 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> > > +
> > > +static void oom_unmap_func(struct work_struct *work)
> > > +{
> > > +       struct mm_struct *mm = xchg(&oom_unmap_mm, NULL);
> > > +
> > > +       if (!atomic_inc_not_zero(&mm->mm_users))
> > > +               return;
> > > +
> > > +       // If this is not safe we can do use_mm() + unuse_mm()
> > > +       down_read(&mm->mmap_sem);
> >
> > I don't think this is safe.
> >
> > What makes you sure that we might not deadlock on the mmap_sem here?
> > For all we know, the process that is going out of memory is in the
> > middle of a mmap(), and already holds the mmap_sem for writing. No?
> >
> > So at the very least that needs to be a trylock, I think.
>
> Agreed.

Why? See my reply to Linus's email.

Just in case, yes sure the unconditonal down_read() is suboptimal, but
this is minor compared to other problems we need to solve.

> > And I'm not
> > sure zap_page_range() is ok with the mmap_sem only held for reading.
> > Normally our rule is that you can *populate* the page tables
> > concurrently, but you can't tear the down
>
> Actually mmap_sem for reading should be sufficient because we do not
> alter the layout. Both MADV_DONTNEED and MADV_FREE require read mmap_sem
> for example.

Yes, but see the ->vm_flags check in madvise_dontneed().

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-19 15:58   ` Michal Hocko
@ 2015-09-20 13:16     ` Oleg Nesterov
  0 siblings, 0 replies; 109+ messages in thread
From: Oleg Nesterov @ 2015-09-20 13:16 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kyle Walker, Christoph Lameter, Linus Torvalds, akpm, rientjes,
	hannes, vdavydov, linux-mm, linux-kernel, Stanislav Kozina,
	Tetsuo Handa

On 09/19, Michal Hocko wrote:
>
> On Sat 19-09-15 17:03:16, Oleg Nesterov wrote:
> >
> > Stupid idea. Can't we help the memory hog to free its memory? This is
> > orthogonal to other improvements we can do.
> >
> > Please don't tell me the patch below is ugly, incomplete and suboptimal
> > in many ways, I know ;) I am not sure it is even correct. Just to explain
> > what I mean.
>
> Unmapping the memory for the oom victim has been already mentioned as a
> way to improve the OOM killer behavior. Nobody has implemented that yet
> though unfortunately. I have that on my TODO list since we have
> discussed it with Mel at LSF.

OK, good. So perhaps we should try to do this.

>
> > Perhaps oom_unmap_func() should only zap the anonymous vmas... and there
> > are a lot of other details which should be discussed if this can make any
> > sense.
>
> I have just returned from an internal conference so my head is
> completely cabbaged. I will have a look on Monday. From a quick look
> the idea is feasible. You cannot rely on the worker context because
> workqueues might be completely stuck with at this stage.

Yes this is true. See another email, probably oom-kill.c needs its own
kthread.

And again, we should actually try to avoid queue_work or queue_kthread_work
in any case. But not in the initial implementation. And initial implementation
could use workqueues, I think. I the likely case system_unbound_wq pool
should have an idle thread.

> You also cannot
> do take mmap_sem directly because that might be held already so you need
> a try_lock instead.

Still can't understand this part. See other emails, perhaps I missed
something.

> Focusing on anonymous vmas first sounds like a good
> idea to me because that would be simpler I guess.

And safer.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-19 15:03 ` can't oom-kill zap the victim's memory? Oleg Nesterov
                     ` (2 preceding siblings ...)
  2015-09-19 22:24   ` Linus Torvalds
@ 2015-09-20 14:50   ` Tetsuo Handa
  2015-09-20 14:55     ` Oleg Nesterov
  3 siblings, 1 reply; 109+ messages in thread
From: Tetsuo Handa @ 2015-09-20 14:50 UTC (permalink / raw)
  To: oleg, kwalker, cl, torvalds, mhocko
  Cc: akpm, rientjes, hannes, vdavydov, linux-mm, linux-kernel, skozina

Oleg Nesterov wrote:
> On 09/17, Kyle Walker wrote:
> >
> > Currently, the oom killer will attempt to kill a process that is in
> > TASK_UNINTERRUPTIBLE state. For tasks in this state for an exceptional
> > period of time, such as processes writing to a frozen filesystem during
> > a lengthy backup operation, this can result in a deadlock condition as
> > related processes memory access will stall within the page fault
> > handler.
> 
> And there are other potential reasons for deadlock.
> 
> Stupid idea. Can't we help the memory hog to free its memory? This is
> orthogonal to other improvements we can do.

So, we are trying to release memory without waiting for arriving at
exit_mm() from do_exit(), right? If it works, it will be a simple and
small change that will be easy to backport.

The idea is that since fatal_signal_pending() tasks no longer return to
user space, we can release memory allocated for use by user space, right?

Then, I think that this approach can be applied to not only OOM-kill case
but also regular kill(pid, SIGKILL) case (i.e. kick from signal_wake_up(1)
or somewhere?). A dedicated kernel thread (not limited to OOM-kill purpose)
scans for fatal_signal_pending() tasks and release that task's memory.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-20 14:50   ` Tetsuo Handa
@ 2015-09-20 14:55     ` Oleg Nesterov
  0 siblings, 0 replies; 109+ messages in thread
From: Oleg Nesterov @ 2015-09-20 14:55 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: kwalker, cl, torvalds, mhocko, akpm, rientjes, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

On 09/20, Tetsuo Handa wrote:
>
> Oleg Nesterov wrote:
> > On 09/17, Kyle Walker wrote:
> > >
> > > Currently, the oom killer will attempt to kill a process that is in
> > > TASK_UNINTERRUPTIBLE state. For tasks in this state for an exceptional
> > > period of time, such as processes writing to a frozen filesystem during
> > > a lengthy backup operation, this can result in a deadlock condition as
> > > related processes memory access will stall within the page fault
> > > handler.
> >
> > And there are other potential reasons for deadlock.
> >
> > Stupid idea. Can't we help the memory hog to free its memory? This is
> > orthogonal to other improvements we can do.
>
> So, we are trying to release memory without waiting for arriving at
> exit_mm() from do_exit(), right? If it works, it will be a simple and
> small change that will be easy to backport.
>
> The idea is that since fatal_signal_pending() tasks no longer return to
> user space, we can release memory allocated for use by user space, right?

Yes.

> Then, I think that this approach can be applied to not only OOM-kill case
> but also regular kill(pid, SIGKILL) case (i.e. kick from signal_wake_up(1)
> or somewhere?).

I don't think so... but we might want to do this if (say) we are not going
to kill someone else because fatal_signal_pending(current).

> A dedicated kernel thread (not limited to OOM-kill purpose)
> scans for fatal_signal_pending() tasks and release that task's memory.

Perhaps a dedicated kernel thread makes sense (see other emails),
but I don't think it should scan the killed threads. oom-kill should
kict it.

Anyway, let me repeat there are a lot of details we might want to
discuss. But the initial changes should be simple as possible, imo.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-20 12:56     ` Oleg Nesterov
@ 2015-09-20 18:05       ` Linus Torvalds
  2015-09-20 18:21         ` Raymond Jennings
                           ` (3 more replies)
  0 siblings, 4 replies; 109+ messages in thread
From: Linus Torvalds @ 2015-09-20 18:05 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Kyle Walker, Christoph Lameter, Michal Hocko, Andrew Morton,
	David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm,
	Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa

On Sun, Sep 20, 2015 at 5:56 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>
> In this case the workqueue thread will block.

What workqueue thread?

   pagefault_out_of_memory ->
      out_of_memory ->
         oom_kill_process

as far as I can tell, this can be called by any task. Now, that
pagefault case should only happen when the page fault comes from user
space, but we also have

   __alloc_pages_slowpath ->
      __alloc_pages_may_oom ->
         out_of_memory ->
            oom_kill_process

which can be called from just about any context (but atomic
allocations will never get here, so it can schedule etc).

So what's your point? Explain again just how do you guarantee that you
can take the mmap_sem.

                       Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-20 18:05       ` Linus Torvalds
@ 2015-09-20 18:21         ` Raymond Jennings
  2015-09-20 18:23         ` Raymond Jennings
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 109+ messages in thread
From: Raymond Jennings @ 2015-09-20 18:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Kyle Walker, Christoph Lameter, Michal Hocko,
	Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov,
	linux-mm, Linux Kernel Mailing List, Stanislav Kozina,
	Tetsuo Handa

[-- Attachment #1: Type: text/plain, Size: 1496 bytes --]

On Sun, Sep 20, 2015 at 11:05 AM, Linus Torvalds 
<torvalds@linux-foundation.org> wrote:
> On Sun, Sep 20, 2015 at 5:56 AM, Oleg Nesterov <oleg@redhat.com> 
> wrote:
>> 
>>  In this case the workqueue thread will block.
> 
> What workqueue thread?
> 
>    pagefault_out_of_memory ->
>       out_of_memory ->
>          oom_kill_process
> 
> as far as I can tell, this can be called by any task. Now, that
> pagefault case should only happen when the page fault comes from user
> space, but we also have
> 
>    __alloc_pages_slowpath ->
>       __alloc_pages_may_oom ->
>          out_of_memory ->
>             oom_kill_process
> 
> which can be called from just about any context (but atomic
> allocations will never get here, so it can schedule etc).
> 
> So what's your point? Explain again just how do you guarantee that you
> can take the mmap_sem.
> 
>                        Linus
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

Would it be a cleaner design in general to require all threads to 
completely exit kernel space before being terminated?  Possibly 
expedited by noticing fatal signals and riding the EINTR rocket back up 
the stack?

My two cents:  If we do that we won't have to worry about fatally 
wounded tasks slipping into a coma before they cough up any semaphores 
or locks.



[-- Attachment #2: Type: text/html, Size: 1762 bytes --]

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-20 18:05       ` Linus Torvalds
  2015-09-20 18:21         ` Raymond Jennings
@ 2015-09-20 18:23         ` Raymond Jennings
  2015-09-20 19:07         ` Raymond Jennings
  2015-09-21 13:44         ` Oleg Nesterov
  3 siblings, 0 replies; 109+ messages in thread
From: Raymond Jennings @ 2015-09-20 18:23 UTC (permalink / raw)
  To: Linus Torvalds, Oleg Nesterov
  Cc: Kyle Walker, Christoph Lameter, Michal Hocko, Andrew Morton,
	David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm,
	Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa

[-- Attachment #1: Type: text/plain, Size: 1443 bytes --]



On 09/20/15 11:05, Linus Torvalds wrote:
> On Sun, Sep 20, 2015 at 5:56 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>> In this case the workqueue thread will block.
> What workqueue thread?
>
>     pagefault_out_of_memory ->
>        out_of_memory ->
>           oom_kill_process
>
> as far as I can tell, this can be called by any task. Now, that
> pagefault case should only happen when the page fault comes from user
> space, but we also have
>
>     __alloc_pages_slowpath ->
>        __alloc_pages_may_oom ->
>           out_of_memory ->
>              oom_kill_process
>
> which can be called from just about any context (but atomic
> allocations will never get here, so it can schedule etc).
>
> So what's your point? Explain again just how do you guarantee that you
> can take the mmap_sem.
>
>                         Linus
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .dadsf
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Would it be a cleaner design in general to require all threads to 
completely exit kernel space before being terminated?  Possibly 
expedited by noticing fatal signals and riding the EINTR rocket back up 
the stack?

My two cents:  If we do that we won't have to worry about fatally 
wounded tasks slipping into a coma before they cough up any semaphores 
or locks.


[-- Attachment #2: Type: text/html, Size: 3651 bytes --]

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-20 18:05       ` Linus Torvalds
  2015-09-20 18:21         ` Raymond Jennings
  2015-09-20 18:23         ` Raymond Jennings
@ 2015-09-20 19:07         ` Raymond Jennings
  2015-09-21 13:57           ` Oleg Nesterov
  2015-09-21 13:44         ` Oleg Nesterov
  3 siblings, 1 reply; 109+ messages in thread
From: Raymond Jennings @ 2015-09-20 19:07 UTC (permalink / raw)
  To: Linus Torvalds, Oleg Nesterov
  Cc: Kyle Walker, Christoph Lameter, Michal Hocko, Andrew Morton,
	David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm,
	Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa

On 09/20/15 11:05, Linus Torvalds wrote:
> On Sun, Sep 20, 2015 at 5:56 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>> In this case the workqueue thread will block.
> What workqueue thread?
>
>     pagefault_out_of_memory ->
>        out_of_memory ->
>           oom_kill_process
>
> as far as I can tell, this can be called by any task. Now, that
> pagefault case should only happen when the page fault comes from user
> space, but we also have
>
>     __alloc_pages_slowpath ->
>        __alloc_pages_may_oom ->
>           out_of_memory ->
>              oom_kill_process
>
> which can be called from just about any context (but atomic
> allocations will never get here, so it can schedule etc).

I think in this case the oom killer should just slap a SIGKILL on the 
task and then back out, and whatever needed the memory should just wait 
patiently for the sacrificial lamb to commit seppuku.

Which, btw, we should IMO encourage ASAP in the context of the lamb by 
having anything potentially locky or semaphory pay attention to if the 
task in question has a fatal signal pending, and if so, drop everything 
and run like hell so that the task can cough up any locks or semaphores.
> So what's your point? Explain again just how do you guarantee that you
> can take the mmap_sem.
>
>                         Linus
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

Also, I observed that a task in the middle of dumping core doesn't 
respond to signals while it's dumping, and I would guess that might be 
the case even if the task receives a SIGKILL from the OOM handler.  Just 
a potential observation.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-20 18:05       ` Linus Torvalds
                           ` (2 preceding siblings ...)
  2015-09-20 19:07         ` Raymond Jennings
@ 2015-09-21 13:44         ` Oleg Nesterov
  2015-09-21 14:24           ` Michal Hocko
  2015-09-21 16:55           ` Linus Torvalds
  3 siblings, 2 replies; 109+ messages in thread
From: Oleg Nesterov @ 2015-09-21 13:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kyle Walker, Christoph Lameter, Michal Hocko, Andrew Morton,
	David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm,
	Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa

On 09/20, Linus Torvalds wrote:
>
> On Sun, Sep 20, 2015 at 5:56 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > In this case the workqueue thread will block.
>
> What workqueue thread?

I must have missed something. I can't understand your and Michal's
concerns.

>    pagefault_out_of_memory ->
>       out_of_memory ->
>          oom_kill_process
>
> as far as I can tell, this can be called by any task. Now, that
> pagefault case should only happen when the page fault comes from user
> space, but we also have
>
>    __alloc_pages_slowpath ->
>       __alloc_pages_may_oom ->
>          out_of_memory ->
>             oom_kill_process
>
> which can be called from just about any context (but atomic
> allocations will never get here, so it can schedule etc).

So yes, in general oom_kill_process() can't call oom_unmap_func() directly.
That is why the patch uses queue_work(oom_unmap_func). The workqueue thread
takes mmap_sem and frees the memory allocated by user space.

If this can lead to deadlock somehow, then we can hit the same deadlock
when an oom-killed thread calls exit_mm().

> So what's your point?

This can help if the killed process refuse to die and (of course) it
doesn't hold the mmap_sem for writing. Say, it waits for some mutex
held by the task which tries to alloc the memory and triggers oom.

> Explain again just how do you guarantee that you
> can take the mmap_sem.

This is not guaranteed, down_read(mmap_sem) can block forever. But this
means that the (killed) victim never drops mmap_sem / never exits, so
we lose anyway. We have no memory, oom-killer is blocked, etc.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-20 19:07         ` Raymond Jennings
@ 2015-09-21 13:57           ` Oleg Nesterov
  0 siblings, 0 replies; 109+ messages in thread
From: Oleg Nesterov @ 2015-09-21 13:57 UTC (permalink / raw)
  To: Raymond Jennings
  Cc: Linus Torvalds, Kyle Walker, Christoph Lameter, Michal Hocko,
	Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov,
	linux-mm, Linux Kernel Mailing List, Stanislav Kozina,
	Tetsuo Handa

On 09/20, Raymond Jennings wrote:
>
> On 09/20/15 11:05, Linus Torvalds wrote:
>>
>> which can be called from just about any context (but atomic
>> allocations will never get here, so it can schedule etc).
>
> I think in this case the oom killer should just slap a SIGKILL on the
> task and then back out, and whatever needed the memory should just wait
> patiently for the sacrificial lamb to commit seppuku.

Not sure I understand you correctly, but this is what we currently do.
The only problem is that this doesn't work sometimes.

> Also, I observed that a task in the middle of dumping core doesn't
> respond to signals while it's dumping,

How did you observe this? The coredumping is killable.

Although yes, we have problems here in oom condition. In particular
with CLONE_VM tasks.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-21 13:44         ` Oleg Nesterov
@ 2015-09-21 14:24           ` Michal Hocko
  2015-09-21 15:32             ` Oleg Nesterov
  2015-09-21 16:55           ` Linus Torvalds
  1 sibling, 1 reply; 109+ messages in thread
From: Michal Hocko @ 2015-09-21 14:24 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Kyle Walker, Christoph Lameter, Andrew Morton,
	David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm,
	Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa

On Mon 21-09-15 15:44:14, Oleg Nesterov wrote:
[...]
> So yes, in general oom_kill_process() can't call oom_unmap_func() directly.
> That is why the patch uses queue_work(oom_unmap_func). The workqueue thread
> takes mmap_sem and frees the memory allocated by user space.

OK, this might have been a bit confusing. I didn't mean you cannot use
mmap_sem directly from the workqueue context. You _can_ AFAICS. But I've
mentioned that you _shouldn't_ use workqueue context in the first place
because all the workers might be blocked on locks and new workers cannot
be created due to memory pressure. This has been demostrated already
where sysrq+f couldn't trigger OOM killer because the work item to do so
was waiting for a worker which never came...

So I think we probably need to do this in the OOM killer context (with
try_lock) or hand over to a special kernel thread. I am not sure a
special kernel thread is really worth that but maybe it will turn out to
be a better choice.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-21 14:24           ` Michal Hocko
@ 2015-09-21 15:32             ` Oleg Nesterov
  2015-09-21 16:12               ` Michal Hocko
                                 ` (2 more replies)
  0 siblings, 3 replies; 109+ messages in thread
From: Oleg Nesterov @ 2015-09-21 15:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Linus Torvalds, Kyle Walker, Christoph Lameter, Andrew Morton,
	David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm,
	Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa

On 09/21, Michal Hocko wrote:
>
> On Mon 21-09-15 15:44:14, Oleg Nesterov wrote:
> [...]
> > So yes, in general oom_kill_process() can't call oom_unmap_func() directly.
> > That is why the patch uses queue_work(oom_unmap_func). The workqueue thread
> > takes mmap_sem and frees the memory allocated by user space.
>
> OK, this might have been a bit confusing. I didn't mean you cannot use
> mmap_sem directly from the workqueue context. You _can_ AFAICS. But I've
> mentioned that you _shouldn't_ use workqueue context in the first place
> because all the workers might be blocked on locks and new workers cannot
> be created due to memory pressure.

Yes, yes, and I already tried to comment this part. We probably need a
dedicated kernel thread, but I still think (although I am not sure) that
initial change can use workueue. In the likely case system_unbound_wq pool
should have an idle thread, if not - OK, this change won't help in this
case. This is minor.

> So I think we probably need to do this in the OOM killer context (with
> try_lock)

Yes we should try to do this in the OOM killer context, and in this case
(of course) we need trylock. Let me quote my previous email:

	And we want to avoid using workqueues when the caller can do this
	directly. And in this case we certainly need trylock. But this needs
	some refactoring: we do not want to do this under oom_lock, otoh it
	makes sense to do this from mark_oom_victim() if current && killed,
	and a lot more details.

and probably this is another reason why do we need MMF_MEMDIE. But again,
I think the initial change should be simple.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-21 15:32             ` Oleg Nesterov
@ 2015-09-21 16:12               ` Michal Hocko
  2015-09-22 16:06                 ` Oleg Nesterov
  2015-09-21 16:51               ` Tetsuo Handa
  2015-09-21 23:42               ` David Rientjes
  2 siblings, 1 reply; 109+ messages in thread
From: Michal Hocko @ 2015-09-21 16:12 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Kyle Walker, Christoph Lameter, Andrew Morton,
	David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm,
	Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa

On Mon 21-09-15 17:32:52, Oleg Nesterov wrote:
> On 09/21, Michal Hocko wrote:
> >
> > On Mon 21-09-15 15:44:14, Oleg Nesterov wrote:
> > [...]
> > > So yes, in general oom_kill_process() can't call oom_unmap_func() directly.
> > > That is why the patch uses queue_work(oom_unmap_func). The workqueue thread
> > > takes mmap_sem and frees the memory allocated by user space.
> >
> > OK, this might have been a bit confusing. I didn't mean you cannot use
> > mmap_sem directly from the workqueue context. You _can_ AFAICS. But I've
> > mentioned that you _shouldn't_ use workqueue context in the first place
> > because all the workers might be blocked on locks and new workers cannot
> > be created due to memory pressure.
> 
> Yes, yes, and I already tried to comment this part.

OK then we are on the same page, good.

> We probably need a
> dedicated kernel thread, but I still think (although I am not sure) that
> initial change can use workueue. In the likely case system_unbound_wq pool
> should have an idle thread, if not - OK, this change won't help in this
> case. This is minor.

The point is that the implementation should be robust from the very
beginning. I am not sure what you mean by the idle thread here but the
rescuer can get stuck the very same way other workers. So I think that
we cannot rely on WQ for a real solution here.

> > So I think we probably need to do this in the OOM killer context (with
> > try_lock)
> 
> Yes we should try to do this in the OOM killer context, and in this case
> (of course) we need trylock. Let me quote my previous email:
> 
> 	And we want to avoid using workqueues when the caller can do this
> 	directly. And in this case we certainly need trylock. But this needs
> 	some refactoring: we do not want to do this under oom_lock,

Why do you think oom_lock would be a big deal? Address space of the
victim might be really large but we can back off after a batch of
unmapped pages.

>       otoh it
> 	makes sense to do this from mark_oom_victim() if current && killed,
> 	and a lot more details.
> 
> and probably this is another reason why do we need MMF_MEMDIE. But again,
> I think the initial change should be simple.

I definitely agree with the simplicity for the first iteration. That
means only unmap private exclusive pages and release at most few megs of
them. I am still not sure about some details, e.g. futex sitting in such
a memory. Wouldn't threads blow up when they see an unmapped futex page,
try to page it in and it would be in an uninitialized state? Maybe this
is safe because they will die anyway but I am not familiar with that
code.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-21 15:32             ` Oleg Nesterov
  2015-09-21 16:12               ` Michal Hocko
@ 2015-09-21 16:51               ` Tetsuo Handa
  2015-09-22 12:43                 ` Oleg Nesterov
  2015-09-21 23:42               ` David Rientjes
  2 siblings, 1 reply; 109+ messages in thread
From: Tetsuo Handa @ 2015-09-21 16:51 UTC (permalink / raw)
  To: oleg, mhocko
  Cc: torvalds, kwalker, cl, akpm, rientjes, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

Oleg Nesterov wrote:
> Yes, yes, and I already tried to comment this part. We probably need a
> dedicated kernel thread, but I still think (although I am not sure) that
> initial change can use workueue. In the likely case system_unbound_wq pool
> should have an idle thread, if not - OK, this change won't help in this
> case. This is minor.
> 
I imagined a dedicated kernel thread doing something like shown below.
(I don't know about mm->mmap management.)
mm->mmap_zapped corresponds to MMF_MEMDIE.
I think this kernel thread can be used for normal kill(pid, SIGKILL) cases.

----------
bool has_sigkill_task;
wait_queue_head_t kick_mm_zapper;

static void mm_zapper(void *unused)
{
	struct task_struct *g, *p;
	struct mm_struct *mm;

sleep:
	wait_event(kick_remover, has_sigkill_task);
	has_sigkill_task = false;
restart:
	rcu_read_lock();
	for_each_process_thread(g, p) {
		if (likely(!fatal_signal_pending(p)))
			continue;
		task_lock(p);
		mm = p->mm;
		if (mm && mm->mmap && !mm->mmap_zapped && down_read_trylock(&mm->mmap_sem)) {
			atomic_inc(&mm->mm_users);
			task_unlock(p);
			rcu_read_unlock();
			if (mm->mmap && !mm->mmap_zapped)
				zap_page_range(mm->mmap, 0, TASK_SIZE, NULL);
			mm->mmap_zapped = 1;
			up_read(&mm->mmap_sem);
			mmput(mm);
			cond_resched();
			goto restart;
		}
		task_unlock(p);
	}
	rcu_read_unlock();
	goto sleep;
}

kthread_run(mm_zapper, NULL, "mm_zapper");
----------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-21 13:44         ` Oleg Nesterov
  2015-09-21 14:24           ` Michal Hocko
@ 2015-09-21 16:55           ` Linus Torvalds
  1 sibling, 0 replies; 109+ messages in thread
From: Linus Torvalds @ 2015-09-21 16:55 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Kyle Walker, Christoph Lameter, Michal Hocko, Andrew Morton,
	David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm,
	Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa

On Mon, Sep 21, 2015 at 6:44 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>
> I must have missed something. I can't understand your and Michal's
> concerns.

Heh.  I looked at that patch, and apparently entirely missed the
queue_work() part of the whole patch, thinking it was a direct call.

So never mind.

                Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
  2015-09-19  8:22 ` Michal Hocko
@ 2015-09-21 23:08   ` David Rientjes
  0 siblings, 0 replies; 109+ messages in thread
From: David Rientjes @ 2015-09-21 23:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kyle Walker, akpm, hannes, vdavydov, oleg, linux-mm, linux-kernel

On Sat, 19 Sep 2015, Michal Hocko wrote:

> Nack to this. TASK_UNINTERRUPTIBLE should be time constrained/bounded
> state. Using it as an oom victim criteria makes the victim selection
> less deterministic which is undesirable. As much as I am aware of
> potential issues with the current implementation, making the behavior
> more random doesn't really help.
> 

Agreed, we can't avoid killing a process simply because it is in D state, 
this isn't an indication that the process will not be able to exit and in 
the worst case could panic the system if all other processes cannot be oom 
killed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
  2015-09-18 17:00       ` Christoph Lameter
  2015-09-18 19:07         ` Oleg Nesterov
  2015-09-19  8:32         ` Michal Hocko
@ 2015-09-21 23:27         ` David Rientjes
  2 siblings, 0 replies; 109+ messages in thread
From: David Rientjes @ 2015-09-21 23:27 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Oleg Nesterov, Kyle Walker, akpm, mhocko, hannes, vdavydov,
	linux-mm, linux-kernel, Tetsuo Handa, Stanislav Kozina

On Fri, 18 Sep 2015, Christoph Lameter wrote:

> Subject: Allow multiple kills from the OOM killer
> 
> The OOM killer currently aborts if it finds a process that already is having
> access to the reserve memory pool for exit processing. This is done so that
> the reserves are not overcommitted but on the other hand this also allows
> only one process being oom killed at the time. That process may be stuck
> in D state.
> 
> Signed-off-by: Christoph Lameter <cl@linux.com>
> 
> Index: linux/mm/oom_kill.c
> ===================================================================
> --- linux.orig/mm/oom_kill.c	2015-09-18 11:58:52.963946782 -0500
> +++ linux/mm/oom_kill.c	2015-09-18 11:59:42.010684778 -0500
> @@ -264,10 +264,9 @@ enum oom_scan_t oom_scan_process_thread(
>  	 * This task already has access to memory reserves and is being killed.
>  	 * Don't allow any other task to have access to the reserves.
>  	 */
> -	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
> -		if (oc->order != -1)
> -			return OOM_SCAN_ABORT;
> -	}
> +	if (test_tsk_thread_flag(task, TIF_MEMDIE))
> +		return OOM_SCAN_CONTINUE;
> +
>  	if (!task->mm)
>  		return OOM_SCAN_CONTINUE;
> 

If this would result in the newly chosen process being guaranteed to exit, 
this would be fine.  Unfortunately, no such guarantee is possible.  If a 
thread is holding a contended mutex that the victim(s) require, this 
serial oom killer could eventually panic the system if that thread is 
OOM_DISABLE.

The solution that we have merged internally is described at 
http://marc.info/?l=linux-kernel&m=144010444913702 -- we provide access to 
memory reserves to processes that find a stalled exit in the oom killer so 
that they may allocate.  It comes along with a test module that takes a 
contended mutex and ensures that forward progress is made as long as 
memory reserves are not depleted.  We can't actually guarantee that memory 
reserves won't be depleted, but we (1) hope that nobody is actually 
allocating a lot of memory before dropping a mutex and (2) want to avoid 
the alternative which is a system livelock.

This will address situations such as

	allocator			oom victim
	---------			----------
	mutex_lock(lock)
	alloc_pages(GFP_KERNEL)
					mutex_lock(lock)
					mutex_unlock(lock)
					handle SIGKILL

since this otherwise results in a livelock without a solution such as 
mine since the GFP_KERNEL allocation stalls forever waiting for the oom 
victim to acquire the mutex and exit.  This also works if the allocator is 
OOM_DISABLE.

This won't handle other situations where the victim gets wedged in D state 
and is not allocating memory, but this is by far the more common 
occurrence that we have dealt with.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
  2015-09-19 14:33           ` Tetsuo Handa
  2015-09-19 15:51             ` Michal Hocko
@ 2015-09-21 23:33             ` David Rientjes
  2015-09-22  5:33               ` Tetsuo Handa
  1 sibling, 1 reply; 109+ messages in thread
From: David Rientjes @ 2015-09-21 23:33 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, cl, oleg, kwalker, akpm, hannes, vdavydov, linux-mm,
	linux-kernel, skozina

On Sat, 19 Sep 2015, Tetsuo Handa wrote:

> I think that use of ALLOC_NO_WATERMARKS via TIF_MEMDIE is the underlying
> cause. ALLOC_NO_WATERMARKS via TIF_MEMDIE is intended for terminating the
> OOM victim task as soon as possible, but it turned out that it will not
> work if there is invisible lock dependency. Therefore, why not to give up
> "there should be only up to 1 TIF_MEMDIE task" rule?
> 

I don't see the connection between TIF_MEMDIE and ALLOC_NO_WATERMARKS 
being problematic.  It is simply the mechanism by which we give oom killed 
processes access to memory reserves if they need it.  I believe you are 
referring only to the oom killer stalling when it finds an oom victim.

> What this patch (and many others posted in various forms many times over
> past years) does is to give up "there should be only up to 1 TIF_MEMDIE
> task" rule. I think that we need to tolerate more than 1 TIF_MEMDIE tasks
> and somehow manage in a way memory reserves will not deplete.
> 

Your proposal, which I mostly agree with, tries to kill additional 
processes so that they allocate and drop the lock that the original victim 
depends on.  My approach, from 
http://marc.info/?l=linux-kernel&m=144010444913702, is the same, but 
without the killing.  It's unecessary to kill every process on the system 
that is depending on the same lock, and we can't know which processes are 
stalling on that lock and which are not.

I think it's much easier to simply identify such a situation where a 
process has not exited in a timely manner and then provide processes 
access to memory reserves without being killed.  We hope that the victim 
will have queued its mutex_lock() and allocators that are holding the lock 
will drop it after successfully utilizing memory reserves.

We can mitigate immediate depletion of memory reserves by requiring all 
allocators to reclaim (or compact) and calling the oom killer to identify 
the timeout before granting access to memory reserves for a single 
allocation before schedule_timeout_killable(1) and returning.

I don't know of any alternative solutions where we can guarantee that 
memory reserves cannot be depleted unless memory reserves are 100% of 
memory.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-21 15:32             ` Oleg Nesterov
  2015-09-21 16:12               ` Michal Hocko
  2015-09-21 16:51               ` Tetsuo Handa
@ 2015-09-21 23:42               ` David Rientjes
  2 siblings, 0 replies; 109+ messages in thread
From: David Rientjes @ 2015-09-21 23:42 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Michal Hocko, Linus Torvalds, Kyle Walker, Christoph Lameter,
	Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm,
	Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa

On Mon, 21 Sep 2015, Oleg Nesterov wrote:

> Yes we should try to do this in the OOM killer context, and in this case
> (of course) we need trylock. Let me quote my previous email:
> 
> 	And we want to avoid using workqueues when the caller can do this
> 	directly. And in this case we certainly need trylock. But this needs
> 	some refactoring: we do not want to do this under oom_lock, otoh it
> 	makes sense to do this from mark_oom_victim() if current && killed,
> 	and a lot more details.
> 
> and probably this is another reason why do we need MMF_MEMDIE. But again,
> I think the initial change should be simple.
> 

I agree with the direction and I don't think it would be too complex to 
have a dedicated kthread that is kicked when we queue an mm to do 
MADV_DONTNEED behavior, and have that happen only if a trylock in 
oom_kill_process() fails to do it itself for anonymous mappings.  We may 
have different opinions of simplicity.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
  2015-09-21 23:33             ` David Rientjes
@ 2015-09-22  5:33               ` Tetsuo Handa
  2015-09-22 23:32                 ` David Rientjes
  0 siblings, 1 reply; 109+ messages in thread
From: Tetsuo Handa @ 2015-09-22  5:33 UTC (permalink / raw)
  To: rientjes
  Cc: mhocko, cl, oleg, kwalker, akpm, hannes, vdavydov, linux-mm,
	linux-kernel, skozina

David Rientjes wrote:
> Your proposal, which I mostly agree with, tries to kill additional 
> processes so that they allocate and drop the lock that the original victim 
> depends on.  My approach, from 
> http://marc.info/?l=linux-kernel&m=144010444913702, is the same, but 
> without the killing.  It's unecessary to kill every process on the system 
> that is depending on the same lock, and we can't know which processes are 
> stalling on that lock and which are not.

Would you try your approach with below program?
(My reproducers are tested on XFS on a VM with 4 CPUs / 2048MB RAM.)

---------- oom-depleter3.c start ----------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sched.h>

static int zero_fd = EOF;
static char *buf = NULL;
static unsigned long size = 0;

static int dummy(void *unused)
{
	static char buffer[4096] = { };
	int fd = open("/tmp/file", O_WRONLY | O_CREAT | O_APPEND, 0600);
	while (write(fd, buffer, sizeof(buffer) == sizeof(buffer)) &&
	       fsync(fd) == 0);
	return 0;
}

static int trigger(void *unused)
{
	read(zero_fd, buf, size); /* Will cause OOM due to overcommit */
	return 0;
}

int main(int argc, char *argv[])
{
        unsigned long i;
	zero_fd = open("/dev/zero", O_RDONLY);
	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size >>= 1;
			break;
		}
		buf = cp;
	}
	/*
	 * Create many child threads in order to enlarge time lag between
	 * the OOM killer sets TIF_MEMDIE to thread group leader and
	 * the OOM killer sends SIGKILL to that thread.
	 */
	for (i = 0; i < 1000; i++) {
		clone(dummy, malloc(1024) + 1024, CLONE_SIGHAND | CLONE_VM,
		      NULL);
	}
	/* Let a child thread trigger the OOM killer. */
	clone(trigger, malloc(4096)+ 4096, CLONE_SIGHAND | CLONE_VM, NULL);
	/* Deplete all memory reserve using the time lag. */
	for (i = size; i; i -= 4096)
		buf[i - 1] = 1;
	return * (char *) NULL; /* Kill all threads. */
}
---------- oom-depleter3.c end ----------

uptime > 350 of http://I-love.SAKURA.ne.jp/tmp/serial-20150922-1.txt.xz
shows that the memory reserves completely depleted and
uptime > 42 of http://I-love.SAKURA.ne.jp/tmp/serial-20150922-2.txt.xz
shows that the memory reserves was not used at all.
Is this result what you expected?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-21 16:51               ` Tetsuo Handa
@ 2015-09-22 12:43                 ` Oleg Nesterov
  2015-09-22 14:30                   ` Tetsuo Handa
  0 siblings, 1 reply; 109+ messages in thread
From: Oleg Nesterov @ 2015-09-22 12:43 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, torvalds, kwalker, cl, akpm, rientjes, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

On 09/22, Tetsuo Handa wrote:
>
> I imagined a dedicated kernel thread doing something like shown below.
> (I don't know about mm->mmap management.)
> mm->mmap_zapped corresponds to MMF_MEMDIE.

No, it doesn't, please see below.

> bool has_sigkill_task;
> wait_queue_head_t kick_mm_zapper;

OK, if this kthread is kicked by oom this makes more sense, but still
doesn't look right at least initially.

Let me repeat, I do think we need MMF_MEMDIE or something like it before
we do something more clever. And in fact I think this flag makes sense
regardless.

> static void mm_zapper(void *unused)
> {
> 	struct task_struct *g, *p;
> 	struct mm_struct *mm;
>
> sleep:
> 	wait_event(kick_remover, has_sigkill_task);
> 	has_sigkill_task = false;
> restart:
> 	rcu_read_lock();
> 	for_each_process_thread(g, p) {
> 		if (likely(!fatal_signal_pending(p)))
> 			continue;
> 		task_lock(p);
> 		mm = p->mm;
> 		if (mm && mm->mmap && !mm->mmap_zapped && down_read_trylock(&mm->mmap_sem)) {
                                       ^^^^^^^^^^^^^^^

We do not want mm->mmap_zapped, it can't work. We need mm->needs_zap
set by oom_kill_process() and cleared after zap_page_range().

Because otherwise we can not handle CLONE_VM correctly. Suppose that
an innocent process P does vfork() and the child is killed but not
exited yet. mm_zapper() can find the child, do zap_page_range(), and
surprise its alive parent P which uses the same ->mm.

And if we rely on MMF_MEMDIE or mm->needs_zap or whaveter then
for_each_process_thread() doesn't really make sense. And if we have
a single MMF_MEMDIE process (likely case) then the unconditional
_trylock is suboptimal.

Tetsuo, can't we do something simple which "obviously can't hurt at
least" and then discuss the potential improvements?

And yes, yes, the "Kill all user processes sharing victim->mm" logic
in oom_kill_process() doesn't 100% look right, at least wrt the change
we discuss.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-22 12:43                 ` Oleg Nesterov
@ 2015-09-22 14:30                   ` Tetsuo Handa
  2015-09-22 14:45                     ` Oleg Nesterov
  0 siblings, 1 reply; 109+ messages in thread
From: Tetsuo Handa @ 2015-09-22 14:30 UTC (permalink / raw)
  To: oleg
  Cc: mhocko, torvalds, kwalker, cl, akpm, rientjes, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

Oleg Nesterov wrote:
> On 09/22, Tetsuo Handa wrote:
> >
> > I imagined a dedicated kernel thread doing something like shown below.
> > (I don't know about mm->mmap management.)
> > mm->mmap_zapped corresponds to MMF_MEMDIE.
> 
> No, it doesn't, please see below.
> 
> > bool has_sigkill_task;
> > wait_queue_head_t kick_mm_zapper;
> 
> OK, if this kthread is kicked by oom this makes more sense, but still
> doesn't look right at least initially.

Yes, I meant this kthread is kicked upon sending SIGKILL. But I forgot that

> 
> Let me repeat, I do think we need MMF_MEMDIE or something like it before
> we do something more clever. And in fact I think this flag makes sense
> regardless.
> 
> > static void mm_zapper(void *unused)
> > {
> > 	struct task_struct *g, *p;
> > 	struct mm_struct *mm;
> >
> > sleep:
> > 	wait_event(kick_remover, has_sigkill_task);
> > 	has_sigkill_task = false;
> > restart:
> > 	rcu_read_lock();
> > 	for_each_process_thread(g, p) {
> > 		if (likely(!fatal_signal_pending(p)))
> > 			continue;
> > 		task_lock(p);
> > 		mm = p->mm;
> > 		if (mm && mm->mmap && !mm->mmap_zapped && down_read_trylock(&mm->mmap_sem)) {
>                                        ^^^^^^^^^^^^^^^
> 
> We do not want mm->mmap_zapped, it can't work. We need mm->needs_zap
> set by oom_kill_process() and cleared after zap_page_range().
> 
> Because otherwise we can not handle CLONE_VM correctly. Suppose that
> an innocent process P does vfork() and the child is killed but not
> exited yet. mm_zapper() can find the child, do zap_page_range(), and
> surprise its alive parent P which uses the same ->mm.

kill(P's-child, SIGKILL) does not kill P sharing the same ->mm.
Thus, mm_zapper() can be used for only OOM-kill case and
test_tsk_thread_flag(p, TIF_MEMDIE) should be used than
fatal_signal_pending(p).

> 
> And if we rely on MMF_MEMDIE or mm->needs_zap or whaveter then
> for_each_process_thread() doesn't really make sense. And if we have
> a single MMF_MEMDIE process (likely case) then the unconditional
> _trylock is suboptimal.

I guess the more likely case is that the OOM victim successfully exits
before mm_zapper() finds it.

I thought that a dedicated kernel thread which scans the task list can do
deferred zapping by automatically retrying (in a few seconds interval ?)
when down_read_trylock() failed. 

> 
> Tetsuo, can't we do something simple which "obviously can't hurt at
> least" and then discuss the potential improvements?

No problem. I can wait for your version.

> 
> And yes, yes, the "Kill all user processes sharing victim->mm" logic
> in oom_kill_process() doesn't 100% look right, at least wrt the change
> we discuss.

If we use test_tsk_thread_flag(p, TIF_MEMDIE), we will need to set
TIF_MEMDIE to the victim after sending SIGKILL to all processes sharing
the victim's mm. Well, the likely case that the OOM victim exits before
mm_zapper() finds it becomes not-so-likely case? Then, MMF_MEMDIE is
better than test_tsk_thread_flag(p, TIF_MEMDIE)...

> 
> Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-22 14:30                   ` Tetsuo Handa
@ 2015-09-22 14:45                     ` Oleg Nesterov
  0 siblings, 0 replies; 109+ messages in thread
From: Oleg Nesterov @ 2015-09-22 14:45 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, torvalds, kwalker, cl, akpm, rientjes, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

On 09/22, Tetsuo Handa wrote:
>
> Oleg Nesterov wrote:
> > On 09/22, Tetsuo Handa wrote:
> > > 	rcu_read_lock();
> > > 	for_each_process_thread(g, p) {
> > > 		if (likely(!fatal_signal_pending(p)))
> > > 			continue;
> > > 		task_lock(p);
> > > 		mm = p->mm;
> > > 		if (mm && mm->mmap && !mm->mmap_zapped && down_read_trylock(&mm->mmap_sem)) {
> >                                        ^^^^^^^^^^^^^^^
> >
> > We do not want mm->mmap_zapped, it can't work. We need mm->needs_zap
> > set by oom_kill_process() and cleared after zap_page_range().
> >
> > Because otherwise we can not handle CLONE_VM correctly. Suppose that
> > an innocent process P does vfork() and the child is killed but not
> > exited yet. mm_zapper() can find the child, do zap_page_range(), and
> > surprise its alive parent P which uses the same ->mm.
>
> kill(P's-child, SIGKILL) does not kill P sharing the same ->mm.
> Thus, mm_zapper() can be used for only OOM-kill case

Yes, and only if we know for sure that all tasks which can use
this ->mm were killed.

> and
> test_tsk_thread_flag(p, TIF_MEMDIE) should be used than
> fatal_signal_pending(p).

No. For example, just look at mark_oom_victim() at the start of
out_of_memory().

> > Tetsuo, can't we do something simple which "obviously can't hurt at
> > least" and then discuss the potential improvements?
>
> No problem. I can wait for your version.

All I wanted to say is that this all is a bit more complicated than it
looks at first glance.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-21 16:12               ` Michal Hocko
@ 2015-09-22 16:06                 ` Oleg Nesterov
  2015-09-22 23:04                   ` David Rientjes
  2015-09-23 20:59                   ` Michal Hocko
  0 siblings, 2 replies; 109+ messages in thread
From: Oleg Nesterov @ 2015-09-22 16:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Linus Torvalds, Kyle Walker, Christoph Lameter, Andrew Morton,
	David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm,
	Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa

On 09/21, Michal Hocko wrote:
>
> On Mon 21-09-15 17:32:52, Oleg Nesterov wrote:
> > On 09/21, Michal Hocko wrote:
> > >
> > > On Mon 21-09-15 15:44:14, Oleg Nesterov wrote:
> > > [...]
> > > > So yes, in general oom_kill_process() can't call oom_unmap_func() directly.
> > > > That is why the patch uses queue_work(oom_unmap_func). The workqueue thread
> > > > takes mmap_sem and frees the memory allocated by user space.
> > >
> > > OK, this might have been a bit confusing. I didn't mean you cannot use
> > > mmap_sem directly from the workqueue context. You _can_ AFAICS. But I've
> > > mentioned that you _shouldn't_ use workqueue context in the first place
> > > because all the workers might be blocked on locks and new workers cannot
> > > be created due to memory pressure.
> >
> > Yes, yes, and I already tried to comment this part.
>
> OK then we are on the same page, good.

Yes, yes.

> > We probably need a
> > dedicated kernel thread, but I still think (although I am not sure) that
> > initial change can use workueue. In the likely case system_unbound_wq pool
> > should have an idle thread, if not - OK, this change won't help in this
> > case. This is minor.
>
> The point is that the implementation should be robust from the very
> beginning.

OK, let it be a kthread from the very beginning, I won't argue. This
is really minor compared to other problems.

> > > So I think we probably need to do this in the OOM killer context (with
> > > try_lock)
> >
> > Yes we should try to do this in the OOM killer context, and in this case
> > (of course) we need trylock. Let me quote my previous email:
> >
> > 	And we want to avoid using workqueues when the caller can do this
> > 	directly. And in this case we certainly need trylock. But this needs
> > 	some refactoring: we do not want to do this under oom_lock,
>
> Why do you think oom_lock would be a big deal?

I don't really know... This doesn't look sane to me, but perhaps this
is just because I don't understand this code enough.

And note that the caller can held other locks we do not even know about.
Most probably we should not deadlock, at least if we only unmap the anon
pages, but still this doesn't look safe.

But I agree, this probably needs more discussion.

> Address space of the
> victim might be really large but we can back off after a batch of
> unmapped pages.

Hmm. If we already have mmap_sem and started zap_page_range() then
I do not think it makes sense to stop until we free everything we can.

> I definitely agree with the simplicity for the first iteration. That
> means only unmap private exclusive pages and release at most few megs of
> them.

See above, I am not sure this makes sense. And in any case this will
complicate the initial changes, not simplify.

> I am still not sure about some details, e.g. futex sitting in such
> a memory. Wouldn't threads blow up when they see an unmapped futex page,
> try to page it in and it would be in an uninitialized state? Maybe this
> is safe

But this must be safe.

We do not care about userspace (assuming that all mm users have a
pending SIGKILL).

If this can (say) crash the kernel somehow, then we have a bug which
should be fixed. Simply because userspace can exploit this bug doing
MADV_DONTEED from another thread or CLONE_VM process.



Finally. Whatever we do, we need to change oom_kill_process() first,
and I think we should do this regardless. The "Kill all user processes
sharing victim->mm" logic looks wrong and suboptimal/overcomplicated.
I'll try to make some patches tomorrow if I have time...

But. Can't we just remove another ->oom_score_adj check when we try
to kill all mm users (the last for_each_process loop). If yes, this
all can be simplified.

I guess we can't and its a pity. Because it looks simply pointless
to not kill all mm users. This just means the select_bad_process()
picked the wrong task.


Say, vfork(). OK, it is possible that parent is OOM_SCORE_ADJ_MIN and
the child has already updated its oom_score_adj before exec. Now if
we to kill the child we will only upset the parent for no reason, this
won't help to free the memory.



And while this completely offtopic... why does it take task_lock()
to protect ->comm? Sure, without task_lock() we can print garbage.
Is it really that important? I am asking because sometime people
think that it is not safe to use ->comm lockless, but this is not
true.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-22 16:06                 ` Oleg Nesterov
@ 2015-09-22 23:04                   ` David Rientjes
  2015-09-23 20:59                   ` Michal Hocko
  1 sibling, 0 replies; 109+ messages in thread
From: David Rientjes @ 2015-09-22 23:04 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Michal Hocko, Linus Torvalds, Kyle Walker, Christoph Lameter,
	Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm,
	Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa

On Tue, 22 Sep 2015, Oleg Nesterov wrote:

> Finally. Whatever we do, we need to change oom_kill_process() first,
> and I think we should do this regardless. The "Kill all user processes
> sharing victim->mm" logic looks wrong and suboptimal/overcomplicated.
> I'll try to make some patches tomorrow if I have time...
> 

Killing all processes sharing the ->mm has been done in the past to 
obviously ensure that memory is eventually freed, but also to solve 
mm->mmap_sem livelocks where a thread is holding a contended mutex and 
needs a fatal signal to acquire TIF_MEMDIE if it calls into the oom killer 
and be able to allocate so that it may eventually drop the mutex.

> But. Can't we just remove another ->oom_score_adj check when we try
> to kill all mm users (the last for_each_process loop). If yes, this
> all can be simplified.
> 

For complete correctness, we would avoid killing any process that shares 
memory with an oom disabled thread since the oom killer shall not kill it 
and otherwise we do not free any memory.

> I guess we can't and its a pity. Because it looks simply pointless
> to not kill all mm users. This just means the select_bad_process()
> picked the wrong task.
> 

This is a side-effect of moving oom scoring to signal_struct from 
mm_struct.  It could be improved separately by flagging mm_structs that 
are unkillable which would also allow for an optimization in 
find_lock_task_mm().

> And while this completely offtopic... why does it take task_lock()
> to protect ->comm? Sure, without task_lock() we can print garbage.
> Is it really that important? I am asking because sometime people
> think that it is not safe to use ->comm lockless, but this is not
> true.
> 

This has come up a couple times in the past and, from what I recall, 
Andrew has said that we don't actually care since the string will always 
be terminated and if we race we don't actually care.  There are other 
places in the kernel where task_lock() isn't used solely to protect 
->comm.  It can be removed from the oom_kill_process() loop checking for 
other potential victims.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
  2015-09-22  5:33               ` Tetsuo Handa
@ 2015-09-22 23:32                 ` David Rientjes
  2015-09-23 12:03                   ` Kyle Walker
  0 siblings, 1 reply; 109+ messages in thread
From: David Rientjes @ 2015-09-22 23:32 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, cl, oleg, kwalker, akpm, hannes, vdavydov, linux-mm,
	linux-kernel, skozina

On Tue, 22 Sep 2015, Tetsuo Handa wrote:

> David Rientjes wrote:
> > Your proposal, which I mostly agree with, tries to kill additional 
> > processes so that they allocate and drop the lock that the original victim 
> > depends on.  My approach, from 
> > http://marc.info/?l=linux-kernel&m=144010444913702, is the same, but 
> > without the killing.  It's unecessary to kill every process on the system 
> > that is depending on the same lock, and we can't know which processes are 
> > stalling on that lock and which are not.
> 
> Would you try your approach with below program?
> (My reproducers are tested on XFS on a VM with 4 CPUs / 2048MB RAM.)
> 
> ---------- oom-depleter3.c start ----------
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <sched.h>
> 
> static int zero_fd = EOF;
> static char *buf = NULL;
> static unsigned long size = 0;
> 
> static int dummy(void *unused)
> {
> 	static char buffer[4096] = { };
> 	int fd = open("/tmp/file", O_WRONLY | O_CREAT | O_APPEND, 0600);
> 	while (write(fd, buffer, sizeof(buffer) == sizeof(buffer)) &&
> 	       fsync(fd) == 0);
> 	return 0;
> }
> 
> static int trigger(void *unused)
> {
> 	read(zero_fd, buf, size); /* Will cause OOM due to overcommit */
> 	return 0;
> }
> 
> int main(int argc, char *argv[])
> {
>         unsigned long i;
> 	zero_fd = open("/dev/zero", O_RDONLY);
> 	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
> 		char *cp = realloc(buf, size);
> 		if (!cp) {
> 			size >>= 1;
> 			break;
> 		}
> 		buf = cp;
> 	}
> 	/*
> 	 * Create many child threads in order to enlarge time lag between
> 	 * the OOM killer sets TIF_MEMDIE to thread group leader and
> 	 * the OOM killer sends SIGKILL to that thread.
> 	 */
> 	for (i = 0; i < 1000; i++) {
> 		clone(dummy, malloc(1024) + 1024, CLONE_SIGHAND | CLONE_VM,
> 		      NULL);
> 	}
> 	/* Let a child thread trigger the OOM killer. */
> 	clone(trigger, malloc(4096)+ 4096, CLONE_SIGHAND | CLONE_VM, NULL);
> 	/* Deplete all memory reserve using the time lag. */
> 	for (i = size; i; i -= 4096)
> 		buf[i - 1] = 1;
> 	return * (char *) NULL; /* Kill all threads. */
> }
> ---------- oom-depleter3.c end ----------
> 
> uptime > 350 of http://I-love.SAKURA.ne.jp/tmp/serial-20150922-1.txt.xz
> shows that the memory reserves completely depleted and
> uptime > 42 of http://I-love.SAKURA.ne.jp/tmp/serial-20150922-2.txt.xz
> shows that the memory reserves was not used at all.
> Is this result what you expected?
> 

What are the results when the kernel isn't patched at all?  The trade-off 
being made is that we want to attempt to make forward progress when there 
is an excessive stall in an oom victim making its exit rather than 
livelock the system forever waiting for memory that can never be 
allocated.

I struggle to understand how the approach of randomly continuing to kill 
more and more processes in the hope that it slows down usage of memory 
reserves or that we get lucky is better.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
  2015-09-22 23:32                 ` David Rientjes
@ 2015-09-23 12:03                   ` Kyle Walker
  2015-09-24 11:50                     ` Tetsuo Handa
  0 siblings, 1 reply; 109+ messages in thread
From: Kyle Walker @ 2015-09-23 12:03 UTC (permalink / raw)
  To: David Rientjes
  Cc: Tetsuo Handa, mhocko, Christoph Lameter, Oleg Nesterov, akpm,
	Johannes Weiner, vdavydov, linux-mm, linux-kernel,
	Stanislav Kozina

On Tue, Sep 22, 2015 at 7:32 PM, David Rientjes <rientjes@google.com> wrote:
>
> I struggle to understand how the approach of randomly continuing to kill
> more and more processes in the hope that it slows down usage of memory
> reserves or that we get lucky is better.

Thank you to one and all for the feedback.

I agree, in lieu of treating TASK_UNINTERRUPTIBLE tasks as unkillable,
and omitting them from the oom selection process, continuing the
carnage is likely to result in more unpredictable results. At this
time, I believe Oleg's solution of zapping the process memory use
while it sleeps with the fatal signal enroute is ideal.

Kyle Walker

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-22 16:06                 ` Oleg Nesterov
  2015-09-22 23:04                   ` David Rientjes
@ 2015-09-23 20:59                   ` Michal Hocko
  2015-09-24 21:15                     ` David Rientjes
  2015-10-06 18:45                     ` Oleg Nesterov
  1 sibling, 2 replies; 109+ messages in thread
From: Michal Hocko @ 2015-09-23 20:59 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Kyle Walker, Christoph Lameter, Andrew Morton,
	David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm,
	Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa

On Tue 22-09-15 18:06:08, Oleg Nesterov wrote:
> On 09/21, Michal Hocko wrote:
> >
> > On Mon 21-09-15 17:32:52, Oleg Nesterov wrote:
[...]
> > > We probably need a
> > > dedicated kernel thread, but I still think (although I am not sure) that
> > > initial change can use workueue. In the likely case system_unbound_wq pool
> > > should have an idle thread, if not - OK, this change won't help in this
> > > case. This is minor.
> >
> > The point is that the implementation should be robust from the very
> > beginning.
> 
> OK, let it be a kthread from the very beginning, I won't argue. This
> is really minor compared to other problems.

I am still not sure how you want to implement that kernel thread but I
am quite skeptical it would be very much useful because all the current
allocations which end up in the OOM killer path cannot simply back off
and drop the locks with the current allocator semantic.  So they will
be sitting on top of unknown pile of locks whether you do an additional
reclaim (unmap the anon memory) in the direct OOM context or looping
in the allocator and waiting for kthread/workqueue to do its work. The
only argument that I can see is the stack usage but I haven't seen stack
overflows in the OOM path AFAIR.

> > > > So I think we probably need to do this in the OOM killer context (with
> > > > try_lock)
> > >
> > > Yes we should try to do this in the OOM killer context, and in this case
> > > (of course) we need trylock. Let me quote my previous email:
> > >
> > > 	And we want to avoid using workqueues when the caller can do this
> > > 	directly. And in this case we certainly need trylock. But this needs
> > > 	some refactoring: we do not want to do this under oom_lock,
> >
> > Why do you think oom_lock would be a big deal?
> 
> I don't really know... This doesn't look sane to me, but perhaps this
> is just because I don't understand this code enough.

Well one of the purpose of this lock is to throttle all the concurrent
allocators to not step on each other toes because only one task is
allowed to get killed currently. So they wouldn't be any useful anyway.

> And note that the caller can held other locks we do not even know about.
> Most probably we should not deadlock, at least if we only unmap the anon
> pages, but still this doesn't look safe.

The unmapper cannot fall back to reclaim and/or trigger the OOM so
we should be indeed very careful and mark the allocation context
appropriately. I can remember mmu_gather but it is only doing
opportunistic allocation AFAIR.

> But I agree, this probably needs more discussion.
> 
> > Address space of the
> > victim might be really large but we can back off after a batch of
> > unmapped pages.
> 
> Hmm. If we already have mmap_sem and started zap_page_range() then
> I do not think it makes sense to stop until we free everything we can.

Zapping a huge address space can take quite some time and we really do
not have to free it all on behalf of the killer when enough memory is
freed to allow for further progress and the rest can be done by the
victim. If one batch doesn't seem sufficient then another retry can
continue.

I do not think that a limited scan would make the implementation more
complicated but I will leave the decision to you of course.

> > I definitely agree with the simplicity for the first iteration. That
> > means only unmap private exclusive pages and release at most few megs of
> > them.
> 
> See above, I am not sure this makes sense. And in any case this will
> complicate the initial changes, not simplify.
> 
> > I am still not sure about some details, e.g. futex sitting in such
> > a memory. Wouldn't threads blow up when they see an unmapped futex page,
> > try to page it in and it would be in an uninitialized state? Maybe this
> > is safe
> 
> But this must be safe.
> 
> We do not care about userspace (assuming that all mm users have a
> pending SIGKILL).
> 
> If this can (say) crash the kernel somehow, then we have a bug which
> should be fixed. Simply because userspace can exploit this bug doing
> MADV_DONTEED from another thread or CLONE_VM process.

OK, that makes perfect sense. I should have realized that an in-kernel
state for a futex must not be controlled from the userspace. So you are
right and futex shouldn't be a big deal.

> Finally. Whatever we do, we need to change oom_kill_process() first,
> and I think we should do this regardless. The "Kill all user processes
> sharing victim->mm" logic looks wrong and suboptimal/overcomplicated.
> I'll try to make some patches tomorrow if I have time...

That would be appreciated. I do not like that part either. At least we
shouldn't go over the whole list when we have a good chance that the mm
is not shared with other processes.

> But. Can't we just remove another ->oom_score_adj check when we try
> to kill all mm users (the last for_each_process loop). If yes, this
> all can be simplified.
> 
> I guess we can't and its a pity. Because it looks simply pointless
> to not kill all mm users. This just means the select_bad_process()
> picked the wrong task.

Yes I am not really sure why oom_score_adj is not per-mm and we are
doing that per signal struct to be honest. It doesn't make much sense as
the mm_struct is the primary source of information for the oom victim
selection. And the fact that mm might be shared withtout sharing signals
make it double the reason to have it in mm.

It seems David has already tried that 2ff05b2b4eac ("oom: move oom_adj
value from task_struct to mm_struct") but it was later reverted by
0753ba01e126 ("mm: revert "oom: move oom_adj value""). I do not agree
with the reasoning there because vfork is documented to have undefined
behavior
"
       if the process created by vfork() either modifies any data other
       than a variable of type pid_t used to store the return value
       from vfork(), or returns from the function in which vfork() was
       called, or calls any other function before successfully calling
       _exit(2) or one of the exec(3) family of functions.
"
Maybe we can revisit this... It would make the whole semantic much more
straightforward. The current situation when you kill a task which might
share the mm with OOM unkillable task is clearly suboptimal and
confusing.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
  2015-09-23 12:03                   ` Kyle Walker
@ 2015-09-24 11:50                     ` Tetsuo Handa
  0 siblings, 0 replies; 109+ messages in thread
From: Tetsuo Handa @ 2015-09-24 11:50 UTC (permalink / raw)
  To: kwalker, rientjes
  Cc: mhocko, cl, oleg, akpm, hannes, vdavydov, linux-mm, linux-kernel,
	skozina

Kyle Walker wrote:
> I agree, in lieu of treating TASK_UNINTERRUPTIBLE tasks as unkillable,
> and omitting them from the oom selection process, continuing the
> carnage is likely to result in more unpredictable results. At this
> time, I believe Oleg's solution of zapping the process memory use
> while it sleeps with the fatal signal enroute is ideal.

I cannot help thinking about the worst case.

(1) If memory zapping code successfully reclaimed some memory from
    the mm struct used by the OOM victim, what guarantees that the
    reclaimed memory is used by OOM victims (and processes which
    are blocking OOM victims)?

    David's "global access to memory reserves" allows a local unprivileged
    user to deplete memory reserves; could allow that user to deplete the
    reclaimed memory as well.

    I think that my "Favor kthread and dying threads over normal threads"
    ( http://lkml.kernel.org/r/1442939668-4421-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp )
    would allow the reclaimed memory to be used by OOM victims and kernel
    threads if the reclaimed memory is added to free list bit by bit
    in a way that watermark remains low enough to prevent normal threads
    from allocating the reclaimed memory.

    But my patch still fails if normal threads are blocking the OOM
    victims or unrelated kernel threads consume the reclaimed memory.

(2) If memory zapping code failed to reclaim enough memory from the mm
    struct needed for the OOM victim, what mechanism can solve the OOM
    stalls?

    Some administrator sets /proc/pid/oom_score_adj to -1000 to most of
    enterprise processes (e.g. java) and as a consequence only trivial
    processes (e.g. grep / sed) are candidates for OOM victims.

    Moreover, a local unprivileged user can easily fool the OOM killer using
    decoy tasks (which consumes little memory and /proc/pid/oom_score_adj is
    set to 999).

(3) If memory zapping code reclaimed no memory due to ->mmap_sem contention,
    what mechanism can solve the OOM stalls?

    While we don't allocate much memory with ->mmap_sem held for writing,
    the task which is holding ->mmap_sem for writing can be chosen as
    one of OOM victims. If such task receives SIGKILL but TIF_MEMDIE is not
    set, it can form OOM-livelock unless all memory allocations with
    ->mmap_sem held for writing are __GFP_FS allocations and that task can
    reach out_of_memory() (i.e. not blocked by unexpected factors such as
    waiting for filesystem's writeback).

After all I think we have to consider what to do if memory zapping code
failed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-23 20:59                   ` Michal Hocko
@ 2015-09-24 21:15                     ` David Rientjes
  2015-09-25  9:35                       ` Michal Hocko
  2015-10-06 18:45                     ` Oleg Nesterov
  1 sibling, 1 reply; 109+ messages in thread
From: David Rientjes @ 2015-09-24 21:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Oleg Nesterov, Linus Torvalds, Kyle Walker, Christoph Lameter,
	Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm,
	Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa

On Wed, 23 Sep 2015, Michal Hocko wrote:

> I am still not sure how you want to implement that kernel thread but I
> am quite skeptical it would be very much useful because all the current
> allocations which end up in the OOM killer path cannot simply back off
> and drop the locks with the current allocator semantic.  So they will
> be sitting on top of unknown pile of locks whether you do an additional
> reclaim (unmap the anon memory) in the direct OOM context or looping
> in the allocator and waiting for kthread/workqueue to do its work. The
> only argument that I can see is the stack usage but I haven't seen stack
> overflows in the OOM path AFAIR.
> 

Which locks are you specifically interested in?  We have already discussed 
the usefulness of killing all threads on the system sharing the same ->mm, 
meaning all threads that are either holding or want to hold mm->mmap_sem 
will be able to allocate into memory reserves.  Any allocator holding 
down_write(&mm->mmap_sem) should be able to allocate and drop its lock.  
(Are you concerned about MAP_POPULATE?)

> > Finally. Whatever we do, we need to change oom_kill_process() first,
> > and I think we should do this regardless. The "Kill all user processes
> > sharing victim->mm" logic looks wrong and suboptimal/overcomplicated.
> > I'll try to make some patches tomorrow if I have time...
> 
> That would be appreciated. I do not like that part either. At least we
> shouldn't go over the whole list when we have a good chance that the mm
> is not shared with other processes.
> 

Heh, it's actually imperative to avoid livelocking based on mm->mmap_sem, 
it's the reason the code exists.  Any optimizations to that is certainly 
welcome, but we definitely need to send SIGKILL to all threads sharing the 
mm to make forward progress, otherwise we are going back to pre-2008 
livelocks.

> Yes I am not really sure why oom_score_adj is not per-mm and we are
> doing that per signal struct to be honest. It doesn't make much sense as
> the mm_struct is the primary source of information for the oom victim
> selection. And the fact that mm might be shared withtout sharing signals
> make it double the reason to have it in mm.
> 
> It seems David has already tried that 2ff05b2b4eac ("oom: move oom_adj
> value from task_struct to mm_struct") but it was later reverted by
> 0753ba01e126 ("mm: revert "oom: move oom_adj value""). I do not agree
> with the reasoning there because vfork is documented to have undefined
> behavior
> "
>        if the process created by vfork() either modifies any data other
>        than a variable of type pid_t used to store the return value
>        from vfork(), or returns from the function in which vfork() was
>        called, or calls any other function before successfully calling
>        _exit(2) or one of the exec(3) family of functions.
> "
> Maybe we can revisit this... It would make the whole semantic much more
> straightforward. The current situation when you kill a task which might
> share the mm with OOM unkillable task is clearly suboptimal and
> confusing.
> 

How do you reconcile this with commit 28b83c5193e7 ("oom: move oom_adj 
value from task_struct to signal_struct")?  We also must appreciate the 
real-world usecase for an oom disabled process doing fork(), setting 
/proc/child/oom_score_adj to non-disabled, and exec().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-24 21:15                     ` David Rientjes
@ 2015-09-25  9:35                       ` Michal Hocko
  2015-09-25 16:14                         ` Tetsuo Handa
  2015-09-28 22:24                         ` can't oom-kill zap the victim's memory? David Rientjes
  0 siblings, 2 replies; 109+ messages in thread
From: Michal Hocko @ 2015-09-25  9:35 UTC (permalink / raw)
  To: David Rientjes
  Cc: Oleg Nesterov, Linus Torvalds, Kyle Walker, Christoph Lameter,
	Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm,
	Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa

On Thu 24-09-15 14:15:34, David Rientjes wrote:
> On Wed, 23 Sep 2015, Michal Hocko wrote:
> 
> > I am still not sure how you want to implement that kernel thread but I
> > am quite skeptical it would be very much useful because all the current
> > allocations which end up in the OOM killer path cannot simply back off
> > and drop the locks with the current allocator semantic.  So they will
> > be sitting on top of unknown pile of locks whether you do an additional
> > reclaim (unmap the anon memory) in the direct OOM context or looping
> > in the allocator and waiting for kthread/workqueue to do its work. The
> > only argument that I can see is the stack usage but I haven't seen stack
> > overflows in the OOM path AFAIR.
> > 
> 
> Which locks are you specifically interested in?

Any locks they were holding before they entered the page allocator (e.g.
i_mutex is the easiest one to trigger from the userspace but mmap_sem
might be involved as well because we are doing kmalloc(GFP_KERNEL) with
mmap_sem held for write). Those would be locked until the page allocator
returns, which with the current semantic might be _never_.

> We have already discussed 
> the usefulness of killing all threads on the system sharing the same ->mm, 
> meaning all threads that are either holding or want to hold mm->mmap_sem 
> will be able to allocate into memory reserves.  Any allocator holding 
> down_write(&mm->mmap_sem) should be able to allocate and drop its lock.  
> (Are you concerned about MAP_POPULATE?)

I am not sure I understand. We would have to fail the request in order
the context which requested the memory could drop the lock. Are we
talking about the same thing here?

The point I've tried to made is that oom unmapper running in a detached
context (e.g. kernel thread) vs. directly in the oom context doesn't
make any difference wrt. lock because the holders of the lock would loop
inside the allocator anyway because we do not fail small allocations.

> > > Finally. Whatever we do, we need to change oom_kill_process() first,
> > > and I think we should do this regardless. The "Kill all user processes
> > > sharing victim->mm" logic looks wrong and suboptimal/overcomplicated.
> > > I'll try to make some patches tomorrow if I have time...
> > 
> > That would be appreciated. I do not like that part either. At least we
> > shouldn't go over the whole list when we have a good chance that the mm
> > is not shared with other processes.
> > 
> 
> Heh, it's actually imperative to avoid livelocking based on mm->mmap_sem, 
> it's the reason the code exists.  Any optimizations to that is certainly 
> welcome, but we definitely need to send SIGKILL to all threads sharing the 
> mm to make forward progress, otherwise we are going back to pre-2008 
> livelocks.

Yes but mm is not shared between processes most of the time. CLONE_VM
without CLONE_THREAD is more a corner case yet we have to crawl all the
task_structs for _each_ OOM killer invocation. Yes this is an extreme
slow path but still might take quite some unnecessarily time.
 
> > Yes I am not really sure why oom_score_adj is not per-mm and we are
> > doing that per signal struct to be honest. It doesn't make much sense as
> > the mm_struct is the primary source of information for the oom victim
> > selection. And the fact that mm might be shared withtout sharing signals
> > make it double the reason to have it in mm.
> > 
> > It seems David has already tried that 2ff05b2b4eac ("oom: move oom_adj
> > value from task_struct to mm_struct") but it was later reverted by
> > 0753ba01e126 ("mm: revert "oom: move oom_adj value""). I do not agree
> > with the reasoning there because vfork is documented to have undefined
> > behavior
> > "
> >        if the process created by vfork() either modifies any data other
> >        than a variable of type pid_t used to store the return value
> >        from vfork(), or returns from the function in which vfork() was
> >        called, or calls any other function before successfully calling
> >        _exit(2) or one of the exec(3) family of functions.
> > "
> > Maybe we can revisit this... It would make the whole semantic much more
> > straightforward. The current situation when you kill a task which might
> > share the mm with OOM unkillable task is clearly suboptimal and
> > confusing.
> > 
> 
> How do you reconcile this with commit 28b83c5193e7 ("oom: move oom_adj 
> value from task_struct to signal_struct")?

If the oom_score_adj is per mm then all the threads and processes which
share the mm would share the same value. So that would naturally extend
per-process to per address space sharing tasks and would be in line with
the above commit.

> We also must appreciate the 
> real-world usecase for an oom disabled process doing fork(), setting 
> /proc/child/oom_score_adj to non-disabled, and exec().

I guess you meant vfork mentioned in 0753ba01e126. I am not sure this
is a valid use of set_oom_adj. As the documentation explicitly states
this leads to an undefined behavior. But if we really want to support
this particular case, and I can see a reason we would, then we can work
around it and store the oom_score_adj temporarily to task_struct and
reset it to mm_struct after exec. Not nice for sure but this is a clear
violation of the vfork semantic.

The per-mm oom_score_adj has a better semantic but if there is a general
consensus that an inconsistent value among processes sharing the same mm
is a configuration bug I can live with that. It surely makes the code
uglier and more subtly, though.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-25  9:35                       ` Michal Hocko
@ 2015-09-25 16:14                         ` Tetsuo Handa
  2015-09-28 16:18                           ` Tetsuo Handa
  2015-09-28 22:24                         ` can't oom-kill zap the victim's memory? David Rientjes
  1 sibling, 1 reply; 109+ messages in thread
From: Tetsuo Handa @ 2015-09-25 16:14 UTC (permalink / raw)
  To: mhocko, rientjes
  Cc: oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm,
	linux-kernel, skozina

Michal Hocko wrote:
> On Thu 24-09-15 14:15:34, David Rientjes wrote:
> > > > Finally. Whatever we do, we need to change oom_kill_process() first,
> > > > and I think we should do this regardless. The "Kill all user processes
> > > > sharing victim->mm" logic looks wrong and suboptimal/overcomplicated.
> > > > I'll try to make some patches tomorrow if I have time...
> > > 
> > > That would be appreciated. I do not like that part either. At least we
> > > shouldn't go over the whole list when we have a good chance that the mm
> > > is not shared with other processes.
> > > 
> > 
> > Heh, it's actually imperative to avoid livelocking based on mm->mmap_sem, 
> > it's the reason the code exists.  Any optimizations to that is certainly 
> > welcome, but we definitely need to send SIGKILL to all threads sharing the 
> > mm to make forward progress, otherwise we are going back to pre-2008 
> > livelocks.
> 
> Yes but mm is not shared between processes most of the time. CLONE_VM
> without CLONE_THREAD is more a corner case yet we have to crawl all the
> task_structs for _each_ OOM killer invocation. Yes this is an extreme
> slow path but still might take quite some unnecessarily time.

Excuse me, but thinking about CLONE_VM without CLONE_THREAD case...
Isn't there possibility of hitting livelocks at

        /*
         * If current has a pending SIGKILL or is exiting, then automatically
         * select it.  The goal is to allow it to allocate so that it may
         * quickly exit and free its memory.
         *
         * But don't select if current has already released its mm and cleared
         * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
         */
        if (current->mm &&
            (fatal_signal_pending(current) || task_will_free_mem(current))) {
                mark_oom_victim(current);
                return true;
        }

if current thread receives SIGKILL just before reaching here, for we don't
send SIGKILL to all threads sharing the mm?

Hopefully current thread is not holding inode->i_mutex because reaching here
(i.e. calling out_of_memory()) suggests that we are doing GFP_KERNEL
allocation. But it could be !__GFP_NOFS && __GFP_NOFAIL allocation, or
different locks contended by another thread sharing the mm?

I don't like "That thread will now get access to memory reserves since it
has a pending fatal signal." line in comments for the "Kill all user
processes sharing victim->mm" logic. That thread won't get access to memory
reserves unless that thread can call out_of_memory() (i.e. doing __GFP_FS or
__GFP_NOFAIL allocations). Since I can observe that that thread may be doing
!__GFP_NOFS allocation, I think that this comment needs to be updated.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-25 16:14                         ` Tetsuo Handa
@ 2015-09-28 16:18                           ` Tetsuo Handa
  2015-09-28 22:28                             ` David Rientjes
  2015-10-02 12:36                             ` Michal Hocko
  0 siblings, 2 replies; 109+ messages in thread
From: Tetsuo Handa @ 2015-09-28 16:18 UTC (permalink / raw)
  To: mhocko, rientjes
  Cc: oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm,
	linux-kernel, skozina

Michal Hocko wrote:
> The point I've tried to made is that oom unmapper running in a detached
> context (e.g. kernel thread) vs. directly in the oom context doesn't
> make any difference wrt. lock because the holders of the lock would loop
> inside the allocator anyway because we do not fail small allocations.

We tried to allow small allocations to fail. It resulted in unstable system
with obscure bugs.

We tried to allow small !__GFP_FS allocations to fail. It failed to fail by
effectively __GFP_NOFAIL allocations.

We are now trying to allow zapping OOM victim's mm. Michal is already
skeptical about this approach due to lock dependency.

We already spent 9 months on this OOM livelock. No silver bullet yet.
Proposed approaches are too drastic to backport for existing users.
I think we are out of bullet.

Until we complete adding/testing __GFP_NORETRY (or __GFP_KILLABLE) to most
of callsites, timeout based workaround will be the only bullet we can use.

Michal's panic_on_oom_timeout and David's "global access to memory reserves"
will be acceptable for some users if these approaches are used as opt-in.
Likewise, my memdie_task_skip_secs / memdie_task_panic_secs will be
acceptable for those who want to retry a bit more rather than panic on
accidental livelock if this approach is used as opt-in.

Tetsuo Handa wrote:
> Excuse me, but thinking about CLONE_VM without CLONE_THREAD case...
> Isn't there possibility of hitting livelocks at
> 
>         /*
>          * If current has a pending SIGKILL or is exiting, then automatically
>          * select it.  The goal is to allow it to allocate so that it may
>          * quickly exit and free its memory.
>          *
>          * But don't select if current has already released its mm and cleared
>          * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
>          */
>         if (current->mm &&
>             (fatal_signal_pending(current) || task_will_free_mem(current))) {
>                 mark_oom_victim(current);
>                 return true;
>         }
> 
> if current thread receives SIGKILL just before reaching here, for we don't
> send SIGKILL to all threads sharing the mm?

Seems that CLONE_VM without CLONE_THREAD is irrelevant here.
We have sequences like

  Do a GFP_KENREL allocation.
  Hold a lock.
  Do a GFP_NOFS allocation.
  Release a lock.

where an example is seen in VFS operations which receive pathname from
user space using getname() and then call VFS functions and filesystem
code takes locks which can contend with other threads.

------------------------------------------------------------
diff --git a/fs/namei.c b/fs/namei.c
index d68c21f..d51c333 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -4005,6 +4005,8 @@ int vfs_symlink(struct inode *dir, struct dentry *dentry, const char *oldname)
        if (error)
                return error;

+       if (fatal_signal_pending(current))
+               printk(KERN_INFO "Calling symlink with SIGKILL pending\n");
        error = dir->i_op->symlink(dir, dentry, oldname);
        if (!error)
                fsnotify_create(dir, dentry);
@@ -4021,6 +4023,10 @@ SYSCALL_DEFINE3(symlinkat, const char __user *, oldname,
        struct path path;
        unsigned int lookup_flags = 0;

+       if (!strcmp(current->comm, "a.out")) {
+               printk(KERN_INFO "Sending SIGKILL to current thread\n");
+               do_send_sig_info(SIGKILL, SEND_SIG_FORCED, current, true);
+       }
        from = getname(oldname);
        if (IS_ERR(from))
                return PTR_ERR(from);
diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
index 996481e..2b6faa5 100644
--- a/fs/xfs/xfs_symlink.c
+++ b/fs/xfs/xfs_symlink.c
@@ -240,6 +240,8 @@ xfs_symlink(
        if (error)
                goto out_trans_cancel;

+       if (fatal_signal_pending(current))
+               printk(KERN_INFO "Calling xfs_ilock() with SIGKILL pending\n");
        xfs_ilock(dp, XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL |
                      XFS_IOLOCK_PARENT | XFS_ILOCK_PARENT);
        unlock_dp_on_error = true;
------------------------------------------------------------

[  119.534976] Sending SIGKILL to current thread
[  119.535898] Calling symlink with SIGKILL pending
[  119.536870] Calling xfs_ilock() with SIGKILL pending

Any program can potentially hit this silent livelock. We can't predict
what locks the OOM victim threads will depend on after TIF_MEMDIE was
set by the OOM killer. Therefore, I think that TIF_MEMDIE disables the
OOM killer indefinitely is one of possible causes regarding silent
hangup troubles.

Michal Hocko wrote:
> I really hate to do "easy" things now just to feel better about
> particular case which will kick us back little bit later. And from my
> own experience I can tell you that a more non-deterministic OOM behavior
> is thing people complain about.

I believe that not waiting for TIF_MEMDIE thread indefinitely is the first
choice we can propose people to try. From my own experience I can tell you
that some customers are really sensitive about bugs which halt their systems
(e.g. https://access.redhat.com/solutions/68466 ).
Opt-in version of TIF_MEMDIE timeout should be acceptable for people
who prefer avoiding silent hangup over non-deterministic OOM behavior if
they were explained about the truth of current memory allocator's behavior.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-25  9:35                       ` Michal Hocko
  2015-09-25 16:14                         ` Tetsuo Handa
@ 2015-09-28 22:24                         ` David Rientjes
  2015-09-29  7:57                           ` Tetsuo Handa
  2015-10-01 14:48                           ` Michal Hocko
  1 sibling, 2 replies; 109+ messages in thread
From: David Rientjes @ 2015-09-28 22:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Oleg Nesterov, Linus Torvalds, Kyle Walker, Christoph Lameter,
	Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm,
	Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa

On Fri, 25 Sep 2015, Michal Hocko wrote:

> > > I am still not sure how you want to implement that kernel thread but I
> > > am quite skeptical it would be very much useful because all the current
> > > allocations which end up in the OOM killer path cannot simply back off
> > > and drop the locks with the current allocator semantic.  So they will
> > > be sitting on top of unknown pile of locks whether you do an additional
> > > reclaim (unmap the anon memory) in the direct OOM context or looping
> > > in the allocator and waiting for kthread/workqueue to do its work. The
> > > only argument that I can see is the stack usage but I haven't seen stack
> > > overflows in the OOM path AFAIR.
> > > 
> > 
> > Which locks are you specifically interested in?
> 
> Any locks they were holding before they entered the page allocator (e.g.
> i_mutex is the easiest one to trigger from the userspace but mmap_sem
> might be involved as well because we are doing kmalloc(GFP_KERNEL) with
> mmap_sem held for write). Those would be locked until the page allocator
> returns, which with the current semantic might be _never_.
> 

I agree that i_mutex seems to be one of the most common offenders.  
However, I'm not sure I understand why holding it while trying to allocate 
infinitely for an order-0 allocation is problematic wrt the proposed 
kthread.  The kthread itself need only take mmap_sem for read.  If all 
threads sharing the mm with a victim have been SIGKILL'd, they should get 
TIF_MEMDIE set when reclaim fails and be able to allocate so that they can 
drop mmap_sem.  We must ensure that any holder of mmap_sem cannot quickly 
deplete memory reserves without properly checking for 
fatal_signal_pending().

> > We have already discussed 
> > the usefulness of killing all threads on the system sharing the same ->mm, 
> > meaning all threads that are either holding or want to hold mm->mmap_sem 
> > will be able to allocate into memory reserves.  Any allocator holding 
> > down_write(&mm->mmap_sem) should be able to allocate and drop its lock.  
> > (Are you concerned about MAP_POPULATE?)
> 
> I am not sure I understand. We would have to fail the request in order
> the context which requested the memory could drop the lock. Are we
> talking about the same thing here?
> 

Not fail the request, they should be able to allocate from memory reserves 
when TIF_MEMDIE gets set.  This would require that threads is all gfp 
contexts are able to get TIF_MEMDIE set without an explicit call to 
out_of_memory() for !__GFP_FS.

> > Heh, it's actually imperative to avoid livelocking based on mm->mmap_sem, 
> > it's the reason the code exists.  Any optimizations to that is certainly 
> > welcome, but we definitely need to send SIGKILL to all threads sharing the 
> > mm to make forward progress, otherwise we are going back to pre-2008 
> > livelocks.
> 
> Yes but mm is not shared between processes most of the time. CLONE_VM
> without CLONE_THREAD is more a corner case yet we have to crawl all the
> task_structs for _each_ OOM killer invocation. Yes this is an extreme
> slow path but still might take quite some unnecessarily time.
>  

It must solve the issue you describe, killing other processes that share 
the ->mm, otherwise we have mm->mmap_sem livelock.  We are not concerned 
about iterating over all task_structs in the oom killer as a painpoint, 
such users should already be using oom_kill_allocating_task which is why 
it was introduced.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-28 16:18                           ` Tetsuo Handa
@ 2015-09-28 22:28                             ` David Rientjes
  2015-10-02 12:36                             ` Michal Hocko
  1 sibling, 0 replies; 109+ messages in thread
From: David Rientjes @ 2015-09-28 22:28 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

On Tue, 29 Sep 2015, Tetsuo Handa wrote:

> > The point I've tried to made is that oom unmapper running in a detached
> > context (e.g. kernel thread) vs. directly in the oom context doesn't
> > make any difference wrt. lock because the holders of the lock would loop
> > inside the allocator anyway because we do not fail small allocations.
> 
> We tried to allow small allocations to fail. It resulted in unstable system
> with obscure bugs.
> 

These are helpful to identify regardless of the outcome of this 
discussion.  I'm not sure where the best place to report them would be, 
or whether its even feasible to dig through looking for possibilities, but 
I think it would be interesting to see which callers are relying on 
internal page allocator implementation to work properly since it may 
uncover bugs that would occur later if it were changed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-28 22:24                         ` can't oom-kill zap the victim's memory? David Rientjes
@ 2015-09-29  7:57                           ` Tetsuo Handa
  2015-09-29 22:56                             ` David Rientjes
  2015-10-01 14:48                           ` Michal Hocko
  1 sibling, 1 reply; 109+ messages in thread
From: Tetsuo Handa @ 2015-09-29  7:57 UTC (permalink / raw)
  To: rientjes, mhocko
  Cc: oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm,
	linux-kernel, skozina

David Rientjes wrote:
> On Fri, 25 Sep 2015, Michal Hocko wrote:
> > > > I am still not sure how you want to implement that kernel thread but I
> > > > am quite skeptical it would be very much useful because all the current
> > > > allocations which end up in the OOM killer path cannot simply back off
> > > > and drop the locks with the current allocator semantic.  So they will
> > > > be sitting on top of unknown pile of locks whether you do an additional
> > > > reclaim (unmap the anon memory) in the direct OOM context or looping
> > > > in the allocator and waiting for kthread/workqueue to do its work. The
> > > > only argument that I can see is the stack usage but I haven't seen stack
> > > > overflows in the OOM path AFAIR.
> > > > 
> > > 
> > > Which locks are you specifically interested in?
> > 
> > Any locks they were holding before they entered the page allocator (e.g.
> > i_mutex is the easiest one to trigger from the userspace but mmap_sem
> > might be involved as well because we are doing kmalloc(GFP_KERNEL) with
> > mmap_sem held for write). Those would be locked until the page allocator
> > returns, which with the current semantic might be _never_.
> > 
> 
> I agree that i_mutex seems to be one of the most common offenders.  
> However, I'm not sure I understand why holding it while trying to allocate 
> infinitely for an order-0 allocation is problematic wrt the proposed 
> kthread.  The kthread itself need only take mmap_sem for read.  If all 
> threads sharing the mm with a victim have been SIGKILL'd, they should get 
> TIF_MEMDIE set when reclaim fails and be able to allocate so that they can 
> drop mmap_sem.  We must ensure that any holder of mmap_sem cannot quickly 
> deplete memory reserves without properly checking for 
> fatal_signal_pending().

Is the story such simple? I think there are factors which disturb memory
allocation with mmap_sem held for writing.

  down_write(&mm->mmap_sem);
  kmalloc(GFP_KERNEL);
  up_write(&mm->mmap_sem);

can involve locks inside __alloc_pages_slowpath().

Say, there are three userspace tasks named P1, P2T1, P2T2 and
one kernel thread named KT1. Only P2T1 and P2T2 shares the same mm.
KT1 is a kernel thread for fs writeback (maybe kswapd?).
I think sequence shown below is possible.

(1) P1 enters into kernel mode via write() syscall.

(2) P1 allocates memory for buffered write.

(3) P2T1 enters into kernel mode and calls kmalloc().

(4) P2T1 arrives at __alloc_pages_may_oom() because there was no
    reclaimable memory. (Memory allocated by P1 is not reclaimable
    as of this moment.)

(5) P1 dirties memory allocated for buffered write.

(6) P2T2 enters into kernel mode and calls kmalloc() with
    mmap_sem held for writing.

(7) KT1 finds dirtied memory.

(8) KT1 holds fs's unkillable lock for fs writeback.

(9) P2T2 is blocked at unkillable lock for fs writeback held by KT1.

(10) P2T1 calls out_of_memory() and the OOM killer chooses P2T1 and sets
     TIF_MEMDIE on both P2T1 and P2T2.

(11) P2T2 got TIF_MEMDIE but is blocked at unkillable lock for fs writeback
     held by KT1.

(12) KT1 is trying to allocate memory for fs writeback. But since P2T1 and
     P2T2 cannot release memory because memory unmapping code cannot hold
     mmap_sem for reading, KT1 waits forever.... OOM livelock completed!

I think sequence shown below is also possible. Say, there are three
userspace tasks named P1, P2, P3 and one kernel thread named KT1.

(1) P1 enters into kernel mode via write() syscall.

(2) P1 allocates memory for buffered write.

(3) P2 enters into kernel mode and holds mmap_sem for writing.

(4) P3 enters into kernel mode and calls kmalloc().

(5) P3 arrives at __alloc_pages_may_oom() because there was no
    reclaimable memory. (Memory allocated by P1 is not reclaimable
    as of this moment.)

(6) P1 dirties memory allocated for buffered write.

(7) KT1 finds dirtied memory.

(8) KT1 holds fs's unkillable lock for fs writeback.

(9) P2 calls kmalloc() and is blocked at unkillable lock for fs writeback
    held by KT1.

(10) P3 calls out_of_memory() and the OOM killer chooses P2 and sets
     TIF_MEMDIE on P2.

(11) P2 got TIF_MEMDIE but is blocked at unkillable lock for fs writeback
     held by KT1.

(12) KT1 is trying to allocate memory for fs writeback. But since P2 cannot
     release memory because memory unmapping code cannot hold mmap_sem for
     reading, KT1 waits forever.... OOM livelock completed!

So, allowing all OOM victim threads to use memory reserves does not guarantee
that a thread which held mmap_sem for writing to make forward progress.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-29  7:57                           ` Tetsuo Handa
@ 2015-09-29 22:56                             ` David Rientjes
  2015-09-30  4:25                               ` Tetsuo Handa
  0 siblings, 1 reply; 109+ messages in thread
From: David Rientjes @ 2015-09-29 22:56 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

On Tue, 29 Sep 2015, Tetsuo Handa wrote:

> Is the story such simple? I think there are factors which disturb memory
> allocation with mmap_sem held for writing.
> 
>   down_write(&mm->mmap_sem);
>   kmalloc(GFP_KERNEL);
>   up_write(&mm->mmap_sem);
> 
> can involve locks inside __alloc_pages_slowpath().
> 
> Say, there are three userspace tasks named P1, P2T1, P2T2 and
> one kernel thread named KT1. Only P2T1 and P2T2 shares the same mm.
> KT1 is a kernel thread for fs writeback (maybe kswapd?).
> I think sequence shown below is possible.
> 
> (1) P1 enters into kernel mode via write() syscall.
> 
> (2) P1 allocates memory for buffered write.
> 
> (3) P2T1 enters into kernel mode and calls kmalloc().
> 
> (4) P2T1 arrives at __alloc_pages_may_oom() because there was no
>     reclaimable memory. (Memory allocated by P1 is not reclaimable
>     as of this moment.)
> 
> (5) P1 dirties memory allocated for buffered write.
> 
> (6) P2T2 enters into kernel mode and calls kmalloc() with
>     mmap_sem held for writing.
> 
> (7) KT1 finds dirtied memory.
> 
> (8) KT1 holds fs's unkillable lock for fs writeback.
> 
> (9) P2T2 is blocked at unkillable lock for fs writeback held by KT1.
> 
> (10) P2T1 calls out_of_memory() and the OOM killer chooses P2T1 and sets
>      TIF_MEMDIE on both P2T1 and P2T2.
> 
> (11) P2T2 got TIF_MEMDIE but is blocked at unkillable lock for fs writeback
>      held by KT1.
> 
> (12) KT1 is trying to allocate memory for fs writeback. But since P2T1 and
>      P2T2 cannot release memory because memory unmapping code cannot hold
>      mmap_sem for reading, KT1 waits forever.... OOM livelock completed!
> 
> I think sequence shown below is also possible. Say, there are three
> userspace tasks named P1, P2, P3 and one kernel thread named KT1.
> 
> (1) P1 enters into kernel mode via write() syscall.
> 
> (2) P1 allocates memory for buffered write.
> 
> (3) P2 enters into kernel mode and holds mmap_sem for writing.
> 
> (4) P3 enters into kernel mode and calls kmalloc().
> 
> (5) P3 arrives at __alloc_pages_may_oom() because there was no
>     reclaimable memory. (Memory allocated by P1 is not reclaimable
>     as of this moment.)
> 
> (6) P1 dirties memory allocated for buffered write.
> 
> (7) KT1 finds dirtied memory.
> 
> (8) KT1 holds fs's unkillable lock for fs writeback.
> 
> (9) P2 calls kmalloc() and is blocked at unkillable lock for fs writeback
>     held by KT1.
> 
> (10) P3 calls out_of_memory() and the OOM killer chooses P2 and sets
>      TIF_MEMDIE on P2.
> 
> (11) P2 got TIF_MEMDIE but is blocked at unkillable lock for fs writeback
>      held by KT1.
> 
> (12) KT1 is trying to allocate memory for fs writeback. But since P2 cannot
>      release memory because memory unmapping code cannot hold mmap_sem for
>      reading, KT1 waits forever.... OOM livelock completed!
> 
> So, allowing all OOM victim threads to use memory reserves does not guarantee
> that a thread which held mmap_sem for writing to make forward progress.
> 

Thank you for writing this all out, it definitely helps to understand the 
concerns.

This, in my understanding, is the same scenario that requires not only oom 
victims to be able to access memory reserves, but also any thread after an 
oom victim has failed to make a timely exit.

I point out mm->mmap_sem as a special case because we have had fixes in 
the past, such as the special fatal_signal_pending() handling in 
__get_user_pages(), that try to ensure forward progress since we know that 
we need exclusive mm->mmap_sem for the victim to make an exit.

I think both of your illustrations show why it is not helpful to kill 
additional processes after a time period has elapsed and a victim has 
failed to exit.  In both of your scenarios, it would require that KT1 be 
killed to allow forward progress and we know that's not possible.

Perhaps this is an argument that we need to provide access to memory 
reserves for threads even for !__GFP_WAIT and !__GFP_FS in such scenarios, 
but I would wait to make that extension until we see it in practice.

Killing all mm->mmap_sem threads certainly isn't meant to solve all oom 
killer livelocks, as you show.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-29 22:56                             ` David Rientjes
@ 2015-09-30  4:25                               ` Tetsuo Handa
  2015-09-30 10:21                                 ` Tetsuo Handa
  2015-09-30 21:11                                 ` David Rientjes
  0 siblings, 2 replies; 109+ messages in thread
From: Tetsuo Handa @ 2015-09-30  4:25 UTC (permalink / raw)
  To: rientjes
  Cc: mhocko, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

David Rientjes wrote:
> I think both of your illustrations show why it is not helpful to kill 
> additional processes after a time period has elapsed and a victim has 
> failed to exit.  In both of your scenarios, it would require that KT1 be 
> killed to allow forward progress and we know that's not possible.

My illustrations show why it is helpful to kill additional processes after
a time period has elapsed and a victim has failed to exit. We don't need
to kill KT1 if we combine memory unmapping approach and timeout based OOM
killing approach.

Simply choosing more OOM victims (processes which do not share other OOM
victim's mm) based on timeout itself does not guarantee that other OOM
victims can exit. But if timeout based OOM killing is used together with
memory unmapping approach, the possibility that OOM victims can exit
significantly increases because the only case where memory unmapping
approach stucks will be when mm->mmap_sem was held for writing (which
should unlikely occur).

If we choose only 1 OOM victim, the possibility of hitting this memory
unmapping livelock is (say) 1%. But if we choose multiple OOM victims, the
possibility becomes (almost) 0%. And if we still hit this livelock even
after choosing many OOM victims, it is time to call panic().

(Well, do we need to change __alloc_pages_slowpath() that OOM victims do not
enter direct reclaim paths in order to avoid being blocked by unkillable fs
locks?)

> 
> Perhaps this is an argument that we need to provide access to memory 
> reserves for threads even for !__GFP_WAIT and !__GFP_FS in such scenarios, 
> but I would wait to make that extension until we see it in practice.

I think that GFP_ATOMIC allocations already access memory reserves via
ALLOC_HIGH priority.

> 
> Killing all mm->mmap_sem threads certainly isn't meant to solve all oom 
> killer livelocks, as you show.
> 

Good.

I'm not denying memory unmapping approach. I'm just pointing out that
use of memory unmapping approach alone still leaves room for hang up.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-30  4:25                               ` Tetsuo Handa
@ 2015-09-30 10:21                                 ` Tetsuo Handa
  2015-09-30 21:11                                 ` David Rientjes
  1 sibling, 0 replies; 109+ messages in thread
From: Tetsuo Handa @ 2015-09-30 10:21 UTC (permalink / raw)
  To: rientjes
  Cc: mhocko, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

Tetsuo Handa wrote:
> (Well, do we need to change __alloc_pages_slowpath() that OOM victims do not
> enter direct reclaim paths in order to avoid being blocked by unkillable fs
> locks?)

I'm not familiar with how fs writeback manages memory. I feel I'm missing
something. Can somebody please re-check whether my illustrations are really
possible?

If they are really possible, I think we have yet another silent hang up
sequence. Say, there are one userspace task named P1 and one kernel thread
named KT1.

(1) P1 enters into kernel mode via write() syscall.

(2) P1 allocates memory for buffered write.

(3) P1 dirties memory allocated for buffered write.

(4) P1 leaves kernel mode.

(5) KT1 finds dirtied memory.

(6) KT1 holds fs's unkillable lock for fs writeback.

(7) KT1 tries to allocate memory for fs writeback, but fails to allocate
    because watermark is low. KT1 cannot call out_of_memory() because of
    !__GFP_FS allocation.

(8) P1 enters into kernel mode.

(9) P1 calls kmalloc(GFP_KERNEL) and is blocked at unkillable lock for fs
    writeback held by KT1.

How do we allow KT1 to make forward progress? Are we giving access to
memory reserves (e.g. ALLOC_NO_WATERMARKS priority) to KT1?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-30  4:25                               ` Tetsuo Handa
  2015-09-30 10:21                                 ` Tetsuo Handa
@ 2015-09-30 21:11                                 ` David Rientjes
  2015-10-01 12:13                                   ` Tetsuo Handa
  1 sibling, 1 reply; 109+ messages in thread
From: David Rientjes @ 2015-09-30 21:11 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, oleg, kwalker, cl, Andrew Morton, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

On Wed, 30 Sep 2015, Tetsuo Handa wrote:

> If we choose only 1 OOM victim, the possibility of hitting this memory
> unmapping livelock is (say) 1%. But if we choose multiple OOM victims, the
> possibility becomes (almost) 0%. And if we still hit this livelock even
> after choosing many OOM victims, it is time to call panic().
> 

Again, this is a fundamental disagreement between your approach of 
randomly killing processes hoping that we target one that can make a quick 
exit vs. my approach where we give threads access to memory reserves after 
reclaim has failed in an oom livelock so they at least make forward 
progress.  We're going around in circles.

> (Well, do we need to change __alloc_pages_slowpath() that OOM victims do not
> enter direct reclaim paths in order to avoid being blocked by unkillable fs
> locks?)
> 

OOM victims shouldn't need to enter reclaim, and there have been patches 
before to abort reclaim if current has a pending SIGKILL, if they have 
access to memory reserves.  Nothing prevents the victim from already being 
in reclaim, however, when it is killed.

> > Perhaps this is an argument that we need to provide access to memory 
> > reserves for threads even for !__GFP_WAIT and !__GFP_FS in such scenarios, 
> > but I would wait to make that extension until we see it in practice.
> 
> I think that GFP_ATOMIC allocations already access memory reserves via
> ALLOC_HIGH priority.
> 

Yes, that's true.  It doesn't help for GFP_NOFS, however.  It may be 
possible that GFP_ATOMIC reserves have been depleted or there is a 
GFP_NOFS allocation that gets stuck looping forever that doesn't get the 
ability to allocate without watermarks.  I'd wait to see it in practice 
before making this extension since it relies on scanning the tasklist.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-30 21:11                                 ` David Rientjes
@ 2015-10-01 12:13                                   ` Tetsuo Handa
  0 siblings, 0 replies; 109+ messages in thread
From: Tetsuo Handa @ 2015-10-01 12:13 UTC (permalink / raw)
  To: rientjes
  Cc: mhocko, oleg, kwalker, cl, akpm, hannes, vdavydov, linux-mm,
	linux-kernel, skozina

David Rientjes wrote:
> On Wed, 30 Sep 2015, Tetsuo Handa wrote:
> 
> > If we choose only 1 OOM victim, the possibility of hitting this memory
> > unmapping livelock is (say) 1%. But if we choose multiple OOM victims, the
> > possibility becomes (almost) 0%. And if we still hit this livelock even
> > after choosing many OOM victims, it is time to call panic().
> > 
> 
> Again, this is a fundamental disagreement between your approach of 
> randomly killing processes hoping that we target one that can make a quick 
> exit vs. my approach where we give threads access to memory reserves after 
> reclaim has failed in an oom livelock so they at least make forward 
> progress.  We're going around in circles.

I don't like that memory management subsystem shows an expectant attitude
when memory allocation is failing. There are many possible silent hang up
paths. And my customer's servers might be hitting such paths. But I can't
go in front of their servers and capture SysRq. Thus, I want to let memory
management subsystem try to recover automatically; at least emit some
diagnostic kernel messages automatically.

> 
> > (Well, do we need to change __alloc_pages_slowpath() that OOM victims do not
> > enter direct reclaim paths in order to avoid being blocked by unkillable fs
> > locks?)
> > 
> 
> OOM victims shouldn't need to enter reclaim, and there have been patches 
> before to abort reclaim if current has a pending SIGKILL,

Yes. shrink_inactive_list() and throttle_direct_reclaim() recognize
fatal_signal_pending() tasks.

>                                                           if they have 
> access to memory reserves.

What does this mean?

shrink_inactive_list() and throttle_direct_reclaim() do not check whether
OOM victims have access to memory reserves, do they?

We don't allow access to memory reserves by OOM victims without TIF_MEMDIE.
I think that we should favor kthread and dying threads over normal threads
at __alloc_pages_slowpath() but there is no response on
http://lkml.kernel.org/r/1442939668-4421-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp .

>                             Nothing prevents the victim from already being 
> in reclaim, however, when it is killed.

I think this is problematic because there are unkillable locks in reclaim
paths. The memory management subsystem reports nothing.

> 
> > > Perhaps this is an argument that we need to provide access to memory 
> > > reserves for threads even for !__GFP_WAIT and !__GFP_FS in such scenarios, 
> > > but I would wait to make that extension until we see it in practice.
> > 
> > I think that GFP_ATOMIC allocations already access memory reserves via
> > ALLOC_HIGH priority.
> > 
> 
> Yes, that's true.  It doesn't help for GFP_NOFS, however.  It may be 
> possible that GFP_ATOMIC reserves have been depleted or there is a 
> GFP_NOFS allocation that gets stuck looping forever that doesn't get the 
> ability to allocate without watermarks.

Why can't we emit some diagnostic kernel messages automatically?
Memory allocation requests which did not complete within e.g. 30 seconds
deserve possible memory allocation deadlock warning messages.

>                                          I'd wait to see it in practice 
> before making this extension since it relies on scanning the tasklist.
> 

Is this extension something like check_hung_uninterruptible_tasks()?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-28 22:24                         ` can't oom-kill zap the victim's memory? David Rientjes
  2015-09-29  7:57                           ` Tetsuo Handa
@ 2015-10-01 14:48                           ` Michal Hocko
  2015-10-02 13:06                             ` Tetsuo Handa
  1 sibling, 1 reply; 109+ messages in thread
From: Michal Hocko @ 2015-10-01 14:48 UTC (permalink / raw)
  To: David Rientjes
  Cc: Oleg Nesterov, Linus Torvalds, Kyle Walker, Christoph Lameter,
	Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm,
	Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa

On Mon 28-09-15 15:24:06, David Rientjes wrote:
> On Fri, 25 Sep 2015, Michal Hocko wrote:
> 
> > > > I am still not sure how you want to implement that kernel thread but I
> > > > am quite skeptical it would be very much useful because all the current
> > > > allocations which end up in the OOM killer path cannot simply back off
> > > > and drop the locks with the current allocator semantic.  So they will
> > > > be sitting on top of unknown pile of locks whether you do an additional
> > > > reclaim (unmap the anon memory) in the direct OOM context or looping
> > > > in the allocator and waiting for kthread/workqueue to do its work. The
> > > > only argument that I can see is the stack usage but I haven't seen stack
> > > > overflows in the OOM path AFAIR.
> > > > 
> > > 
> > > Which locks are you specifically interested in?
> > 
> > Any locks they were holding before they entered the page allocator (e.g.
> > i_mutex is the easiest one to trigger from the userspace but mmap_sem
> > might be involved as well because we are doing kmalloc(GFP_KERNEL) with
> > mmap_sem held for write). Those would be locked until the page allocator
> > returns, which with the current semantic might be _never_.
> > 
> 
> I agree that i_mutex seems to be one of the most common offenders.  
> However, I'm not sure I understand why holding it while trying to allocate 
> infinitely for an order-0 allocation is problematic wrt the proposed 
> kthread. 

I didn't say it would be problematic. We are talking past each other
here. All I wanted to say was that a separate kernel oom thread wouldn't
_help_ with the lock dependencies.

> The kthread itself need only take mmap_sem for read.  If all 
> threads sharing the mm with a victim have been SIGKILL'd, they should get 
> TIF_MEMDIE set when reclaim fails and be able to allocate so that they can 
> drop mmap_sem. 

which is the case if the direct oom context used trylock...
So just to make it clear. I am not objecting a specialized oom kernel
thread. It would work as well. I am just not convinced that it is really
needed because the direct oom context can use trylock and do the same
work directly.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-28 16:18                           ` Tetsuo Handa
  2015-09-28 22:28                             ` David Rientjes
@ 2015-10-02 12:36                             ` Michal Hocko
  2015-10-02 19:01                               ` Linus Torvalds
  2015-10-03  6:02                               ` Can't we use timeout based OOM warning/killing? Tetsuo Handa
  1 sibling, 2 replies; 109+ messages in thread
From: Michal Hocko @ 2015-10-02 12:36 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

On Tue 29-09-15 01:18:00, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > The point I've tried to made is that oom unmapper running in a detached
> > context (e.g. kernel thread) vs. directly in the oom context doesn't
> > make any difference wrt. lock because the holders of the lock would loop
> > inside the allocator anyway because we do not fail small allocations.
> 
> We tried to allow small allocations to fail. It resulted in unstable system
> with obscure bugs.

Have they been reported/fixed? All kernel paths doing an allocation are
_supposed_ to check and handle ENOMEM. If they are not then they are
buggy and should be fixed.

> We tried to allow small !__GFP_FS allocations to fail. It failed to fail by
> effectively __GFP_NOFAIL allocations.

What do you mean by that? An opencoded __GFP_NOFAIL?
 
> We are now trying to allow zapping OOM victim's mm. Michal is already
> skeptical about this approach due to lock dependency.

I am not sure where this came from. I am all for this approach. It will
not solve the problem completely for sure but it can help in many cases
already.

> We already spent 9 months on this OOM livelock. No silver bullet yet.
> Proposed approaches are too drastic to backport for existing users.
> I think we are out of bullet.

Not at all. We have this problem since ever basically. And we have a lot
of legacy issues to care about. But nobody could reasonably expect this
will be solved in a short time period.

> Until we complete adding/testing __GFP_NORETRY (or __GFP_KILLABLE) to most
> of callsites,

This is simply not doable. There are thousand of allocation sites all
over the kernel.

> timeout based workaround will be the only bullet we can use.

Those are the last resort which only paper over real bugs which should
be fixed. I would agree with your urging if this was something that can
easily happen on a _properly_ configured system. System which can blow
into an OOM storm is far from being configured properly. If you have an
untrusted users running on your system you should better put them into a
highly restricted environment and limit as much as possible.

I can completely understand your frustration about the pace of the
progress here but this is nothing new and we should strive for long term
vision which would be much less fragile than what we have right now. No
timeout based solution is the way in that direction.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-10-01 14:48                           ` Michal Hocko
@ 2015-10-02 13:06                             ` Tetsuo Handa
  0 siblings, 0 replies; 109+ messages in thread
From: Tetsuo Handa @ 2015-10-02 13:06 UTC (permalink / raw)
  To: mhocko, rientjes
  Cc: oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm,
	linux-kernel, skozina

Michal Hocko wrote:
> On Mon 28-09-15 15:24:06, David Rientjes wrote:
> > I agree that i_mutex seems to be one of the most common offenders.  
> > However, I'm not sure I understand why holding it while trying to allocate 
> > infinitely for an order-0 allocation is problematic wrt the proposed 
> > kthread. 
> 
> I didn't say it would be problematic. We are talking past each other
> here. All I wanted to say was that a separate kernel oom thread wouldn't
> _help_ with the lock dependencies.
> 
Oops. I misunderstood that you are skeptical about memory unmapping approach
due to lock dependency. But rather, you are skeptical about use of a dedicated
kernel thread for memory unmapping approach.

> > The kthread itself need only take mmap_sem for read.  If all 
> > threads sharing the mm with a victim have been SIGKILL'd, they should get 
> > TIF_MEMDIE set when reclaim fails and be able to allocate so that they can 
> > drop mmap_sem. 
> 
> which is the case if the direct oom context used trylock...
> So just to make it clear. I am not objecting a specialized oom kernel
> thread. It would work as well. I am just not convinced that it is really
> needed because the direct oom context can use trylock and do the same
> work directly.

Well, I think it depends on from where we call memory unmapping code.

The first candidate is oom_kill_process() because it is a location where
the mm struct to unmap is determined. But since select_bad_process()
aborts upon encountering a TIF_MEMDIE task, we will fail to call memory
unmapping code again if the first down_trylock(&mm->mmap_sem) attempt in
oom_kill_process() failed. (Here I assumed that we allow all OOM victims
to access memory reserves so that subsequent down_trylock(&mm->mmap_sem)
attempts could succeed.)

The second candidate is select_bad_process() because it is a location
where we can call memory unmapping code again upon encountering a
TIF_MEMDIE task.

The third candidate is caller of out_of_memory() because it is a location
where we can call memory unmapping code again even when the OOM victims
are blocked. (Our discussion seems to assume that TIF_MEMDIE tasks can
make forward progress and die. But since TIF_MEMDIE tasks might encounter
unkillable locks after returning from allocation (e.g.
http://lkml.kernel.org/r/201509290118.BCJ43256.tSFFFMOLHVOJOQ@I-love.SAKURA.ne.jp ),
it will be safer not to assume that out_of_memory() can be always called.
So, I thought that a dedicated kernel thread makes it easy to call memory
unmapping code periodically again and again.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-10-02 12:36                             ` Michal Hocko
@ 2015-10-02 19:01                               ` Linus Torvalds
  2015-10-05 14:44                                 ` Michal Hocko
  2015-10-06  7:55                                 ` Eric W. Biederman
  2015-10-03  6:02                               ` Can't we use timeout based OOM warning/killing? Tetsuo Handa
  1 sibling, 2 replies; 109+ messages in thread
From: Linus Torvalds @ 2015-10-02 19:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Kyle Walker,
	Christoph Lameter, Andrew Morton, Johannes Weiner,
	Vladimir Davydov, linux-mm, Linux Kernel Mailing List,
	Stanislav Kozina

On Fri, Oct 2, 2015 at 8:36 AM, Michal Hocko <mhocko@kernel.org> wrote:
>
> Have they been reported/fixed? All kernel paths doing an allocation are
> _supposed_ to check and handle ENOMEM. If they are not then they are
> buggy and should be fixed.

No. Stop this theoretical idiocy.

We've tried it. I objected before people tried it, and it turns out
that it was a horrible idea.

Small kernel allocations should basically never fail, because we end
up needing memory for random things, and if a kmalloc() fails it's
because some application is using too much memory, and the application
should be killed. Never should the kernel allocation fail. It really
is that simple. If we are out of memory, that does not mean that we
should start failing random kernel things.

So this "people should check for allocation failures" is bullshit.
It's a computer science myth. It's simply not true in all cases.

Kernel allocators that know that they do large allocations (ie bigger
than a few pages) need to be able to handle the failure, but not the
general case. Also, kernel allocators that know they have a good
fallback (eg they try a large allocation first but can fall back to a
smaller one) should use __GFP_NORETRY, but again, that does *not* in
any way mean that general kernel allocations should randomly fail.

So no. The answer is ABSOLUTELY NOT "everybody should check allocation
failure". Get over it. I refuse to go through that circus again. It's
stupid.

             Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Can't we use timeout based OOM warning/killing?
  2015-10-02 12:36                             ` Michal Hocko
  2015-10-02 19:01                               ` Linus Torvalds
@ 2015-10-03  6:02                               ` Tetsuo Handa
  2015-10-06 14:51                                 ` Tetsuo Handa
                                                   ` (2 more replies)
  1 sibling, 3 replies; 109+ messages in thread
From: Tetsuo Handa @ 2015-10-03  6:02 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

Michal Hocko wrote:
> On Tue 29-09-15 01:18:00, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > The point I've tried to made is that oom unmapper running in a detached
> > > context (e.g. kernel thread) vs. directly in the oom context doesn't
> > > make any difference wrt. lock because the holders of the lock would loop
> > > inside the allocator anyway because we do not fail small allocations.
> >
> > We tried to allow small allocations to fail. It resulted in unstable system
> > with obscure bugs.
>
> Have they been reported/fixed? All kernel paths doing an allocation are
> _supposed_ to check and handle ENOMEM. If they are not then they are
> buggy and should be fixed.
>

Kernel developers are not interested in testing OOM cases. I proposed a
SystemTap-based mandatory memory allocation failure injection for testing
OOM cases, but there was no response. Most of memory allocation failure
paths in the kernel remain untested. Unless you persuade all kernel
developers to test OOM cases and add a gfp flag which bypasses memory
allocation failure injection test (e.g. __GFP_FITv1_PASSED) and change
any !__GFP_FITv1_PASSED && !__GFP_NOFAIL allocations always fail, we can't
check that "all kernel paths doing an allocation are _supposed_ to check
and handle ENOMEM".

> > We tried to allow small !__GFP_FS allocations to fail. It failed to fail by
> > effectively __GFP_NOFAIL allocations.
>
> What do you mean by that? An opencoded __GFP_NOFAIL?
>  

Yes. XFS livelock is an example I can trivially reproduce.
Loss of reliability of buffered write()s is another example.

  [ 1721.405074] buffer_io_error: 36 callbacks suppressed
  [ 1721.406263] Buffer I/O error on dev sda1, logical block 34652401, lost async page write
  [ 1721.406996] Buffer I/O error on dev sda1, logical block 34650278, lost async page write
  [ 1721.407125] Buffer I/O error on dev sda1, logical block 34652330, lost async page write
  [ 1721.407197] Buffer I/O error on dev sda1, logical block 34653485, lost async page write
  [ 1721.407203] Buffer I/O error on dev sda1, logical block 34652398, lost async page write
  [ 1721.407232] Buffer I/O error on dev sda1, logical block 34650494, lost async page write
  [ 1721.407356] Buffer I/O error on dev sda1, logical block 34652361, lost async page write
  [ 1721.407386] Buffer I/O error on dev sda1, logical block 34653484, lost async page write
  [ 1721.407481] Buffer I/O error on dev sda1, logical block 34652396, lost async page write
  [ 1721.407504] Buffer I/O error on dev sda1, logical block 34650291, lost async page write
  [ 1723.369963] XFS: a.out(8241) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
  [ 1723.810033] XFS: a.out(7788) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
  [ 1725.434057] XFS: a.out(8171) possible memory allocation deadlock in kmem_alloc (mode:0x8250)
  [ 1725.448049] XFS: a.out(7810) possible memory allocation deadlock in kmem_alloc (mode:0x8250)
  [ 1725.470757] XFS: a.out(8122) possible memory allocation deadlock in kmem_alloc (mode:0x8250)
  [ 1725.474061] XFS: a.out(7881) possible memory allocation deadlock in kmem_alloc (mode:0x8250)
  [ 1725.586610] XFS: a.out(8241) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
  [ 1726.026702] XFS: a.out(7770) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
  [ 1726.043988] XFS: a.out(7788) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
  [ 1727.682001] XFS: a.out(8122) possible memory allocation deadlock in kmem_alloc (mode:0x8250)
  [ 1727.688661] XFS: a.out(8171) possible memory allocation deadlock in kmem_alloc (mode:0x8250)
  [ 1727.785214] XFS: a.out(8241) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
  [ 1728.226640] XFS: a.out(7770) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
  [ 1728.290648] XFS: a.out(7788) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
  [ 1729.930028] XFS: a.out(8171) possible memory allocation deadlock in kmem_alloc (mode:0x8250)

> > We are now trying to allow zapping OOM victim's mm. Michal is already
> > skeptical about this approach due to lock dependency.
>
> I am not sure where this came from. I am all for this approach. It will
> not solve the problem completely for sure but it can help in many cases
> already.
>

Sorry. This was my misunderstanding. But I still think that we need to be
prepared for cases where zapping OOM victim's mm approach fails.
( http://lkml.kernel.org/r/201509242050.EHE95837.FVFOOtMQHLJOFS@I-love.SAKURA.ne.jp )

> > We already spent 9 months on this OOM livelock. No silver bullet yet.
> > Proposed approaches are too drastic to backport for existing users.
> > I think we are out of bullet.
>
> Not at all. We have this problem since ever basically. And we have a lot
> of legacy issues to care about. But nobody could reasonably expect this
> will be solved in a short time period.
>

What people generally imagine with OOM killer is that OOM killer is invoked
when the system is out of memory. But we know that there are many possible
cases where OOM killer messages are not printed. We did not make effort to
break people free from the belief that OOM killer is invoked when the system
is out of memory, nor make effort to provide people a mean to warn OOM
situation, after we recognized the "too small to fail" memory-allocation rule
( https://lwn.net/Articles/627419/ ) which was 9 months ago.

> > Until we complete adding/testing __GFP_NORETRY (or __GFP_KILLABLE) to most
> > of callsites,
>
> This is simply not doable. There are thousand of allocation sites all
> over the kernel.

But changing the default behavior (i.e. implicitly behave like __GFP_NORETRY
inside memory allocator unless __GFP_NOFAIL is passed) is also not doable.
You will need to ask for ACKs from thousand of allocation sites all over
the kernel but that is not realistic.

An example. I proposed a patch which changes the default behavior in XFS and
got a feedback ( http://marc.info/?l=linux-mm&m=144279862227010 ) that
fundamentally changing the allocation behavior of the filesystem requires
some indication of the testing and characterization of how the change has
impacted low memory balance and performance of the filesystem.
You will need to ask for ACKs from all filesystem developers.

Another example. I don't like that permission checks for access requests from
user space start failing with ENOMEM error when memory is tight. It is not
happy that access requests by critical processes are failed by inconsequential
process's memory consumption.
( https://www.mail-archive.com/tomoyo-users-en@lists.osdn.me/msg00008.html )
This problem is not limited to permission checks. If a process executed a
program using execve() and that process reached the point of no return in
the execve() operation, any memory allocation failure before reaching the
point of handling ENOMEM errors (e.g. failing to load shared libraries before
calling the main() function of the new program), the process will be killed.
If the process were the global init process, the system will panic().

Despite we mean to simply enforce only "all kernel paths doing an allocation
are _supposed_ to check and handle ENOMEM", we have a period where memory
allocation failure in the user space results in an unrecoverable failure.
We depend on /proc/$pid/oom_score_adj for protecting critical processes from
inconsequential process.

I'm happy to give up memory allocation upon SIGKILL, but I'm not happy to
give up upon ENOMEM without making effort to solve OOM situation.

>
> > timeout based workaround will be the only bullet we can use.
>
> Those are the last resort which only paper over real bugs which should
> be fixed. I would agree with your urging if this was something that can
> easily happen on a _properly_ configured system. System which can blow
> into an OOM storm is far from being configured properly. If you have an
> untrusted users running on your system you should better put them into a
> highly restricted environment and limit as much as possible.

People are reporting hang up problems. I'm suspecting that some of them are
caused by silent OOM. I showed you that there are many possible paths which
can lead to silent hang up. But we are forcing people to use kernels without
means to find out what was happening. Therefore, "there is no report" does
not mean that "we are not hitting OOM livelock problems".

Without means to find out what was happening, we will "overlook real bugs"
before "paper over real bugs". The means are expected to work without
knowledge to use trace points functionality, are expected to run without
memory allocation, are expected to dump output without administrator's
operation, are expected to work before power reset by watchdog timers.

>
> I can completely understand your frustration about the pace of the
> progress here but this is nothing new and we should strive for long term
> vision which would be much less fragile than what we have right now. No
> timeout based solution is the way in that direction.

Can we stop randomly setting TIF_MEMDIE to only one task and staying silent
forever in the hope that the task can make a quick exit? As long as small
allocations do not fail, this TIF_MEMDIE logic is prone to livelock.

We won't be able to change small allocations to fail (like Linus said at
http://lkml.kernel.org/r/CA+55aFw=OLSdh-5Ut2vjy=4Yf1fTXqpzoDHdF7XnT5gDHs6sYA@mail.gmail.com
and I said in this post) in the near future.

Like I said at http://lkml.kernel.org/r/201510012113.HEA98301.SVFQOFtFOHLMOJ@I-love.SAKURA.ne.jp ,
can't we start adding a mean to emit some diagnostic kernel messages
automatically?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-10-02 19:01                               ` Linus Torvalds
@ 2015-10-05 14:44                                 ` Michal Hocko
  2015-10-07  5:16                                   ` Vlastimil Babka
  2015-10-06  7:55                                 ` Eric W. Biederman
  1 sibling, 1 reply; 109+ messages in thread
From: Michal Hocko @ 2015-10-05 14:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Kyle Walker,
	Christoph Lameter, Andrew Morton, Johannes Weiner,
	Vladimir Davydov, linux-mm, Linux Kernel Mailing List,
	Stanislav Kozina

On Fri 02-10-15 15:01:06, Linus Torvalds wrote:
> On Fri, Oct 2, 2015 at 8:36 AM, Michal Hocko <mhocko@kernel.org> wrote:
> >
> > Have they been reported/fixed? All kernel paths doing an allocation are
> > _supposed_ to check and handle ENOMEM. If they are not then they are
> > buggy and should be fixed.
> 
> No. Stop this theoretical idiocy.
> 
> We've tried it. I objected before people tried it, and it turns out
> that it was a horrible idea.
> 
> Small kernel allocations should basically never fail, because we end
> up needing memory for random things, and if a kmalloc() fails it's
> because some application is using too much memory, and the application
> should be killed. Never should the kernel allocation fail. It really
> is that simple. If we are out of memory, that does not mean that we
> should start failing random kernel things.

But you do realize that killing a task as a memory reclaim technique is
not 100% reliable, right?

Any task might be blocked in an uninterruptible context (e.g. a mutex)
waiting for completion which depends on the allocation success. The page
allocator (resp. OOM killer) is not aware of these dependencies and I am
really skeptical it will ever be because dependency tracking is way too
expensive. So killing a task doesn't guarantee a forward progress.

So I can see basically only few ways out of this deadlock situation.
Either we face the reality and allow small allocations (withtout
__GFP_NOFAIL) to fail after all attempts to reclaim memory have failed
(so after even OOM killer hasn't made any progress).
Or we can start killing other tasks but this might end up in the same
state and the time to resolve the problem might be basically unbounded
(it is trivial to construct loads where hundreds of tasks are bashing
against a single i_mutex and all of them depending on an allocation...).
Or we can panic/reboot the system if the OOM situation cannot be solved
within a selected timeout.

There are other ways to micro-optimize the current implementation by
playing with memory reserves but all that is just postponing the final
disaster and there is still a point of no further progress that we have
to deal with somehow.

> So this "people should check for allocation failures" is bullshit.
> It's a computer science myth. It's simply not true in all cases.

Sure it is not true in _all_ cases. If some paths cannot fail they can
use __GFP_NOFAIL for that purpose. The point is that most allocations
_can_ handle the failure. People are taught to check for allocation
failures. We even have scripts/coccinelle/null/kmerr.cocci which helps
to detect slab allocator users to some degree.

> Kernel allocators that know that they do large allocations (ie bigger
> than a few pages) need to be able to handle the failure, but not the
> general case. Also, kernel allocators that know they have a good
> fallback (eg they try a large allocation first but can fall back to a
> smaller one) should use __GFP_NORETRY, but again, that does *not* in
> any way mean that general kernel allocations should randomly fail.
> 
> So no. The answer is ABSOLUTELY NOT "everybody should check allocation
> failure". Get over it. I refuse to go through that circus again. It's
> stupid.
> 
>              Linus

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-10-02 19:01                               ` Linus Torvalds
  2015-10-05 14:44                                 ` Michal Hocko
@ 2015-10-06  7:55                                 ` Eric W. Biederman
  2015-10-06  8:49                                   ` Linus Torvalds
  1 sibling, 1 reply; 109+ messages in thread
From: Eric W. Biederman @ 2015-10-06  7:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Michal Hocko, Tetsuo Handa, David Rientjes, Oleg Nesterov,
	Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner,
	Vladimir Davydov, linux-mm, Linux Kernel Mailing List,
	Stanislav Kozina

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Fri, Oct 2, 2015 at 8:36 AM, Michal Hocko <mhocko@kernel.org> wrote:
>>
>> Have they been reported/fixed? All kernel paths doing an allocation are
>> _supposed_ to check and handle ENOMEM. If they are not then they are
>> buggy and should be fixed.
>
> No. Stop this theoretical idiocy.
>
> We've tried it. I objected before people tried it, and it turns out
> that it was a horrible idea.
>
> Small kernel allocations should basically never fail, because we end
> up needing memory for random things, and if a kmalloc() fails it's
> because some application is using too much memory, and the application
> should be killed. Never should the kernel allocation fail. It really
> is that simple. If we are out of memory, that does not mean that we
> should start failing random kernel things.
>
> So this "people should check for allocation failures" is bullshit.
> It's a computer science myth. It's simply not true in all cases.
>
> Kernel allocators that know that they do large allocations (ie bigger
> than a few pages) need to be able to handle the failure, but not the
> general case. Also, kernel allocators that know they have a good
> fallback (eg they try a large allocation first but can fall back to a
> smaller one) should use __GFP_NORETRY, but again, that does *not* in
> any way mean that general kernel allocations should randomly fail.
>
> So no. The answer is ABSOLUTELY NOT "everybody should check allocation
> failure". Get over it. I refuse to go through that circus again. It's
> stupid.

Not to take away from your point about very small allocations.  However
assuming allocations larger than a page will always succeed is down
right dangerous.  Last time this issue rose up and bit me I sat down and
did the math, and it is ugly.  You have to have 50% of the memory free
to guarantee that an order 1 allocation will succeed.

So quite frankly I think it is only safe to require order 0 alloctions
to succeed.  Larger allocations do fail in practice, and it causes real
problems on real workloads when we try and loop forever waiting for
something that will never come.

My analysis from when it bit me.

commit 96c7a2ff21501691587e1ae969b83cbec8b78e08
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Mon Feb 10 14:25:41 2014 -0800

    fs/file.c:fdtable: avoid triggering OOMs from alloc_fdmem
    
    Recently due to a spike in connections per second memcached on 3
    separate boxes triggered the OOM killer from accept.  At the time the
    OOM killer was triggered there was 4GB out of 36GB free in zone 1.  The
    problem was that alloc_fdtable was allocating an order 3 page (32KiB) to
    hold a bitmap, and there was sufficient fragmentation that the largest
    page available was 8KiB.
    
    I find the logic that PAGE_ALLOC_COSTLY_ORDER can't fail pretty dubious
    but I do agree that order 3 allocations are very likely to succeed.
    
    There are always pathologies where order > 0 allocations can fail when
    there are copious amounts of free memory available.  Using the pigeon
    hole principle it is easy to show that it requires 1 page more than 50%
    of the pages being free to guarantee an order 1 (8KiB) allocation will
    succeed, 1 page more than 75% of the pages being free to guarantee an
    order 2 (16KiB) allocation will succeed and 1 page more than 87.5% of
    the pages being free to guarantee an order 3 allocate will succeed.
    
    A server churning memory with a lot of small requests and replies like
    memcached is a common case that if anything can will skew the odds
    against large pages being available.
    
    Therefore let's not give external applications a practical way to kill
    linux server applications, and specify __GFP_NORETRY to the kmalloc in
    alloc_fdmem.  Unless I am misreading the code and by the time the code
    reaches should_alloc_retry in __alloc_pages_slowpath (where
    __GFP_NORETRY becomes signification).  We have already tried everything
    reasonable to allocate a page and the only thing left to do is wait.  So
    not waiting and falling back to vmalloc immediately seems like the
    reasonable thing to do even if there wasn't a chance of triggering the
    OOM killer.

Eric

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-10-06  7:55                                 ` Eric W. Biederman
@ 2015-10-06  8:49                                   ` Linus Torvalds
  2015-10-06  8:55                                     ` Linus Torvalds
  0 siblings, 1 reply; 109+ messages in thread
From: Linus Torvalds @ 2015-10-06  8:49 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Michal Hocko, Tetsuo Handa, David Rientjes, Oleg Nesterov,
	Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner,
	Vladimir Davydov, linux-mm, Linux Kernel Mailing List,
	Stanislav Kozina

On Tue, Oct 6, 2015 at 8:55 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Not to take away from your point about very small allocations.  However
> assuming allocations larger than a page will always succeed is down
> right dangerous.

We've required retrying for *at least* order-1 allocations. Exactly
because things like fork() etc have wanted them, and:

 - as you say, you can be unlucky even with reasonable amounts of free memory

 - the page-out code is approximate and doesn't guarantee that you get
buddy coalescing

 - just failing after a couple of loops has been known to result in
fork() and similar friends returning -EAGAIN and breaking user space.

Really. Stop this idiocy. We have gone through this before. It's a disaster.

The basic fact remains: kernel allocations are so important that
rather than fail, you should kill user space. Only kernel allocations
that *explicitly* know that they have fallback code should fail, and
they should just do the __GFP_NORETRY.

So the rule ends up being that we retry the memory freeing loop for
small allocations (where "small" is something like "order 2 or less")

So really. If you find some particular case that is painful because it
wants an order-1 or order-2 allocation, then you do this:

 - do the allocation with GFP_NORETRY

 - have a fallback that uses vmalloc or just is able to make the
buffer even smaller.

But by default we will continue to make small orders retry. As
mentioned, we have tried the alternatives. It doesn't work.

            Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-10-06  8:49                                   ` Linus Torvalds
@ 2015-10-06  8:55                                     ` Linus Torvalds
  2015-10-06 14:52                                       ` Eric W. Biederman
  0 siblings, 1 reply; 109+ messages in thread
From: Linus Torvalds @ 2015-10-06  8:55 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Michal Hocko, Tetsuo Handa, David Rientjes, Oleg Nesterov,
	Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner,
	Vladimir Davydov, linux-mm, Linux Kernel Mailing List,
	Stanislav Kozina

On Tue, Oct 6, 2015 at 9:49 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> The basic fact remains: kernel allocations are so important that
> rather than fail, you should kill user space. Only kernel allocations
> that *explicitly* know that they have fallback code should fail, and
> they should just do the __GFP_NORETRY.

To be clear: "big" orders (I forget if the limit is at order-3 or
order-4) do fail much more aggressively. But no, we do not limit retry
to just order-0, because even small kmalloc sizes tend to often do
order-1 or order-2 just because of memory packing issues (ie trying to
pack into a single page wastes too much memory if the allocation sizes
don't come out right).

So no, order-0 isn't special. 1/2 are rather important too.

[ Checking /proc/slabinfo: it looks like several slabs are order-3,
for things like files_cache, signal_cache and sighand_cache for me at
least. So I think it's up to order-3 that we basically need to
consider "we'll need to shrink user space aggressively unless we have
an explicit fallback for the allocation" ]

            Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Can't we use timeout based OOM warning/killing?
  2015-10-03  6:02                               ` Can't we use timeout based OOM warning/killing? Tetsuo Handa
@ 2015-10-06 14:51                                 ` Tetsuo Handa
  2015-10-12  6:43                                   ` Tetsuo Handa
  2015-10-06 15:25                                 ` Can't we use timeout based OOM warning/killing? Linus Torvalds
  2015-10-10 12:50                                 ` Tetsuo Handa
  2 siblings, 1 reply; 109+ messages in thread
From: Tetsuo Handa @ 2015-10-06 14:51 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

Tetsuo Handa wrote:
> Sorry. This was my misunderstanding. But I still think that we need to be
> prepared for cases where zapping OOM victim's mm approach fails.
> ( http://lkml.kernel.org/r/201509242050.EHE95837.FVFOOtMQHLJOFS@I-love.SAKURA.ne.jp )

I tested whether it is easy/difficult to make zapping OOM victim's mm
approach fail. The result seems that not difficult to make it fail.

---------- Reproducer start ----------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sched.h>
#include <sys/mman.h>

static int reader(void *unused)
{
	char c;
	int fd = open("/proc/self/cmdline", O_RDONLY);
	while (pread(fd, &c, 1, 0) == 1);
	return 0;
}

static int writer(void *unused)
{
	const int fd = open("/proc/self/exe", O_RDONLY);
	static void *ptr[10000];
	int i;
	sleep(2);
	while (1) {
		for (i = 0; i < 10000; i++)
			ptr[i] = mmap(NULL, 4096, PROT_READ, MAP_PRIVATE, fd,
				      0);
		for (i = 0; i < 10000; i++)
			munmap(ptr[i], 4096);
	}
	return 0;
}

int main(int argc, char *argv[])
{
	int zero_fd = open("/dev/zero", O_RDONLY);
	char *buf = NULL;
	unsigned long size = 0;
	int i;
	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size >>= 1;
			break;
		}
		buf = cp;
	}
	for (i = 0; i < 100; i++) {
		clone(reader, malloc(1024) + 1024, CLONE_THREAD | CLONE_SIGHAND | CLONE_VM,
		      NULL);
	}
	clone(writer, malloc(1024) + 1024, CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, NULL);
	read(zero_fd, buf, size); /* Will cause OOM due to overcommit */
	return * (char *) NULL; /* Kill all threads. */
}
---------- Reproducer end ----------

(I wrote this program for trying to mimic a trouble that a customer's system
 hung up with a lot of ps processes blocked at reading /proc/pid/ entries
 due to unkillable down_read(&mm->mmap_sem) in __access_remote_vm(). Though
 I couldn't identify what function was holding the mmap_sem for writing...)

Uptime > 429 of http://I-love.SAKURA.ne.jp/tmp/serial-20151006.txt.xz showed
a OOM livelock that

  (1) thread group leader is blocked at down_read(&mm->mmap_sem) in exit_mm()
      called from do_exit().

  (2) writer thread is blocked at down_write(&mm->mmap_sem) in vm_mmap_pgoff()
      called from SyS_mmap_pgoff() called from SyS_mmap().

  (3) many reader threads are blocking the writer thread because of
      down_read(&mm->mmap_sem) called from proc_pid_cmdline_read().

  (4) while the thread group leader is blocked at down_read(&mm->mmap_sem),
      some of the reader threads are trying to allocate memory via page fault.

So, zapping the first OOM victim's mm might fail by chance.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-10-06  8:55                                     ` Linus Torvalds
@ 2015-10-06 14:52                                       ` Eric W. Biederman
  0 siblings, 0 replies; 109+ messages in thread
From: Eric W. Biederman @ 2015-10-06 14:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Michal Hocko, Tetsuo Handa, David Rientjes, Oleg Nesterov,
	Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner,
	Vladimir Davydov, linux-mm, Linux Kernel Mailing List,
	Stanislav Kozina

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Tue, Oct 6, 2015 at 9:49 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> The basic fact remains: kernel allocations are so important that
>> rather than fail, you should kill user space. Only kernel allocations
>> that *explicitly* know that they have fallback code should fail, and
>> they should just do the __GFP_NORETRY.

If you have reached the point of killing userspace you might as well
panic the box.  Userspace will recover more cleanly and more quickly.
The oom-killer is like an oops.  Nice for debugging but not something
you want on a production workload.

> To be clear: "big" orders (I forget if the limit is at order-3 or
> order-4) do fail much more aggressively. But no, we do not limit retry
> to just order-0, because even small kmalloc sizes tend to often do
> order-1 or order-2 just because of memory packing issues (ie trying to
> pack into a single page wastes too much memory if the allocation sizes
> don't come out right).

I am not asking that we limit retry to just order-0 pages.  I am asking
that we limit the oom-killer on failure to just order-0 pages.

> So no, order-0 isn't special. 1/2 are rather important too.

That is a justification for retrying.  That is not a justification for
killing the box.

> [ Checking /proc/slabinfo: it looks like several slabs are order-3,
> for things like files_cache, signal_cache and sighand_cache for me at
> least. So I think it's up to order-3 that we basically need to
> consider "we'll need to shrink user space aggressively unless we have
> an explicit fallback for the allocation" ]

What I know is that order-3 is definitely too big.  I had 4G of RAM
free.  I needed 16K to exapand the fd table.  The box died.  That is
not good.

We have static checkers now, failure to check and handle errors tends to
be caught.

So yes for the rare case of order-[123] allocations failing we should
return the failure to the caller.  The kernel can handle it.  Userspace
can handle just about anything better than random processes dying.

Eric

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Can't we use timeout based OOM warning/killing?
  2015-10-03  6:02                               ` Can't we use timeout based OOM warning/killing? Tetsuo Handa
  2015-10-06 14:51                                 ` Tetsuo Handa
@ 2015-10-06 15:25                                 ` Linus Torvalds
  2015-10-08 15:33                                   ` Tetsuo Handa
  2015-10-10 12:50                                 ` Tetsuo Handa
  2 siblings, 1 reply; 109+ messages in thread
From: Linus Torvalds @ 2015-10-06 15:25 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Christoph Lameter, Linux Kernel Mailing List, Michal Hocko,
	Kyle Walker, Oleg Nesterov, Vladimir Davydov, Stanislav Kozina,
	linux-mm, David Rientjes, Johannes Weiner, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 1696 bytes --]

On Oct 3, 2015 7:02 AM, "Tetsuo Handa" <penguin-kernel@i-love.sakura.ne.jp>
wrote:
>
> Kernel developers are not interested in testing OOM cases. I proposed a
> SystemTap-based mandatory memory allocation failure injection for testing
> OOM cases, but there was no response.

I don't know if it's so much "not interested" as just "it's fairly hard to
be realistic and on the same page". We used to have some simple oom testing
that just did tons of allocations in user space, but then all the actual
allocations that go on tend to be just the normal anonymous pages.

Or then it's the same thing with shared memory (which is harder) or some
other case.  It's seldom a complex and varied load with lots of different
allocations.

I think it might be interesting to have some VM image case with fairly
limited memory (so you can easily run it on different machines, whether you
have a workstation with 16GB or some big iron with 1TB of ram). And a
reasonable load that does at least a few different cases (ie do not just
some server load, but maybe Xorg and chrome or something).

Because another thing that tends to affect this is that oom without swap is
very different from oom with lots of swap, so different people will see
very different issues. If you have some particular case you want to check,
and could make a VM image for it, maybe that would get more mm people
looking at it and agreeing about the issues.

Would something like that perhaps work? I dunno, but it *might* get more
people on the same page (although maybe then people just start complaining
about the choice of load instead..)

    Linus (on mobile at LinuxCon, so
            the mailing list will bounce this) Torvalds

[-- Attachment #2: Type: text/html, Size: 1987 bytes --]

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-23 20:59                   ` Michal Hocko
  2015-09-24 21:15                     ` David Rientjes
@ 2015-10-06 18:45                     ` Oleg Nesterov
  2015-10-07 11:03                       ` Tetsuo Handa
  2015-10-08 14:01                       ` Michal Hocko
  1 sibling, 2 replies; 109+ messages in thread
From: Oleg Nesterov @ 2015-10-06 18:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Linus Torvalds, Kyle Walker, Christoph Lameter, Andrew Morton,
	David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm,
	Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa

Damn. I can't believe this, but I still can't make the initial change.
And no, it is not that I hit some technical problems, just I can't
decide what exactly the first step should do to be a) really simple
and b) useful. I am starting to think I'll just update my draft patch
which uses queue_work() and send it tomorrow (yes, tomorrow again ;).

But let me at least answer this email,

On 09/23, Michal Hocko wrote:
>
> On Tue 22-09-15 18:06:08, Oleg Nesterov wrote:
> >
> > OK, let it be a kthread from the very beginning, I won't argue. This
> > is really minor compared to other problems.
>
> I am still not sure how you want to implement that kernel thread but I
> am quite skeptical it would be very much useful because all the current
> allocations which end up in the OOM killer path cannot simply back off
> and drop the locks with the current allocator semantic.  So they will
> be sitting on top of unknown pile of locks whether you do an additional
> reclaim (unmap the anon memory) in the direct OOM context or looping
> in the allocator and waiting for kthread/workqueue to do its work. The
> only argument that I can see is the stack usage but I haven't seen stack
> overflows in the OOM path AFAIR.

Please see below,

> > And note that the caller can held other locks we do not even know about.
> > Most probably we should not deadlock, at least if we only unmap the anon
> > pages, but still this doesn't look safe.
>
> The unmapper cannot fall back to reclaim and/or trigger the OOM so
> we should be indeed very careful and mark the allocation context
> appropriately. I can remember mmu_gather but it is only doing
> opportunistic allocation AFAIR.

And I was going to make V1 which avoids queue_work/kthread and zaps the
memory in oom_kill_process() context.

But this can't work because we need to increment ->mm_users to avoid
the race with exit_mmap/etc. And this means that we need mmput() after
that, and as we recently discussed it can deadlock if mm_users goes
to zero, we can't do exit_mmap/etc in oom_kill_process().

> > Hmm. If we already have mmap_sem and started zap_page_range() then
> > I do not think it makes sense to stop until we free everything we can.
>
> Zapping a huge address space can take quite some time

Yes, and this is another reason we should do this asynchronously.

> and we really do
> not have to free it all on behalf of the killer when enough memory is
> freed to allow for further progress and the rest can be done by the
> victim. If one batch doesn't seem sufficient then another retry can
> continue.
>
> I do not think that a limited scan would make the implementation more
> complicated

But we can't even know much memory unmap_single_vma() actually frees.
Even if we could, how can we know we freed enough?

Anyway. Perhaps it makes sense to abort the for_each_vma() loop if
freed_enough_mem() == T. But it is absolutely not clear to me how we
should define this freed_enough_mem(), so I think we should do this
later.

> > But. Can't we just remove another ->oom_score_adj check when we try
> > to kill all mm users (the last for_each_process loop). If yes, this
> > all can be simplified.
> >
> > I guess we can't and its a pity. Because it looks simply pointless
> > to not kill all mm users. This just means the select_bad_process()
> > picked the wrong task.
>
> Yes I am not really sure why oom_score_adj is not per-mm and we are
> doing that per signal struct to be honest.

Heh ;) Yes, but I guess it is too late to move it back.

> Maybe we can revisit this...

I hope, but I am not going to try to remove this OOM_SCORE_ADJ_MIN
check now. Just we should not zap this mm if we find the OOM-unkillable
user.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-10-05 14:44                                 ` Michal Hocko
@ 2015-10-07  5:16                                   ` Vlastimil Babka
  2015-10-07 10:43                                     ` Tetsuo Handa
  0 siblings, 1 reply; 109+ messages in thread
From: Vlastimil Babka @ 2015-10-07  5:16 UTC (permalink / raw)
  To: Michal Hocko, Linus Torvalds
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Kyle Walker,
	Christoph Lameter, Andrew Morton, Johannes Weiner,
	Vladimir Davydov, linux-mm, Linux Kernel Mailing List,
	Stanislav Kozina

On 5.10.2015 16:44, Michal Hocko wrote:
> So I can see basically only few ways out of this deadlock situation.
> Either we face the reality and allow small allocations (withtout
> __GFP_NOFAIL) to fail after all attempts to reclaim memory have failed
> (so after even OOM killer hasn't made any progress).

Note that small allocations already *can* fail if they are done in the context
of a task selected as OOM victim (i.e. TIF_MEMDIE). And yeah I've seen a case
when they failed in a code that "handled" the allocation failure with a
BUG_ON(!page).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-10-07  5:16                                   ` Vlastimil Babka
@ 2015-10-07 10:43                                     ` Tetsuo Handa
  2015-10-08  9:40                                       ` Vlastimil Babka
  0 siblings, 1 reply; 109+ messages in thread
From: Tetsuo Handa @ 2015-10-07 10:43 UTC (permalink / raw)
  To: vbabka
  Cc: mhocko, torvalds, rientjes, oleg, kwalker, cl, akpm, hannes,
	vdavydov, linux-mm, linux-kernel, skozina

Vlastimil Babka wrote:
> On 5.10.2015 16:44, Michal Hocko wrote:
> > So I can see basically only few ways out of this deadlock situation.
> > Either we face the reality and allow small allocations (withtout
> > __GFP_NOFAIL) to fail after all attempts to reclaim memory have failed
> > (so after even OOM killer hasn't made any progress).
> 
> Note that small allocations already *can* fail if they are done in the context
> of a task selected as OOM victim (i.e. TIF_MEMDIE). And yeah I've seen a case
> when they failed in a code that "handled" the allocation failure with a
> BUG_ON(!page).
> 
Did You hit a race described below?
http://lkml.kernel.org/r/201508272249.HDH81838.FtQOLMFFOVSJOH@I-love.SAKURA.ne.jp

Where was the BUG_ON(!page) ? Maybe it is a candidate for adding __GFP_NOFAIL.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-10-06 18:45                     ` Oleg Nesterov
@ 2015-10-07 11:03                       ` Tetsuo Handa
  2015-10-07 12:00                         ` Oleg Nesterov
  2015-10-08 14:01                       ` Michal Hocko
  1 sibling, 1 reply; 109+ messages in thread
From: Tetsuo Handa @ 2015-10-07 11:03 UTC (permalink / raw)
  To: oleg, mhocko
  Cc: torvalds, kwalker, cl, akpm, rientjes, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

Oleg Nesterov wrote:
> > > Hmm. If we already have mmap_sem and started zap_page_range() then
> > > I do not think it makes sense to stop until we free everything we can.
> >
> > Zapping a huge address space can take quite some time
> 
> Yes, and this is another reason we should do this asynchronously.
> 
> > and we really do
> > not have to free it all on behalf of the killer when enough memory is
> > freed to allow for further progress and the rest can be done by the
> > victim. If one batch doesn't seem sufficient then another retry can
> > continue.
> >
> > I do not think that a limited scan would make the implementation more
> > complicated
> 
> But we can't even know much memory unmap_single_vma() actually frees.
> Even if we could, how can we know we freed enough?
> 
> Anyway. Perhaps it makes sense to abort the for_each_vma() loop if
> freed_enough_mem() == T. But it is absolutely not clear to me how we
> should define this freed_enough_mem(), so I think we should do this
> later.

Maybe

  bool freed_enough_mem(void) { !atomic_read(&oom_victims); }

if we change to call mark_oom_victim() on all threads which should be
killed as OOM victims.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-10-07 11:03                       ` Tetsuo Handa
@ 2015-10-07 12:00                         ` Oleg Nesterov
  2015-10-08 14:04                           ` Michal Hocko
  0 siblings, 1 reply; 109+ messages in thread
From: Oleg Nesterov @ 2015-10-07 12:00 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, torvalds, kwalker, cl, akpm, rientjes, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

On 10/07, Tetsuo Handa wrote:
>
> Oleg Nesterov wrote:
> > Anyway. Perhaps it makes sense to abort the for_each_vma() loop if
> > freed_enough_mem() == T. But it is absolutely not clear to me how we
> > should define this freed_enough_mem(), so I think we should do this
> > later.
>
> Maybe
>
>   bool freed_enough_mem(void) { !atomic_read(&oom_victims); }
>
> if we change to call mark_oom_victim() on all threads which should be
> killed as OOM victims.

Well, in this case

	if (atomic_read(&mm->mm_users) == 1)
		break;

makes much more sense. Plus we do not need to change mark_oom_victim().

Lets discuss this later?

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-10-07 10:43                                     ` Tetsuo Handa
@ 2015-10-08  9:40                                       ` Vlastimil Babka
  0 siblings, 0 replies; 109+ messages in thread
From: Vlastimil Babka @ 2015-10-08  9:40 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, torvalds, rientjes, oleg, kwalker, cl, akpm, hannes,
	vdavydov, linux-mm, linux-kernel, skozina

On 10/07/2015 12:43 PM, Tetsuo Handa wrote:
> Vlastimil Babka wrote:
>> On 5.10.2015 16:44, Michal Hocko wrote:
>>> So I can see basically only few ways out of this deadlock situation.
>>> Either we face the reality and allow small allocations (withtout
>>> __GFP_NOFAIL) to fail after all attempts to reclaim memory have failed
>>> (so after even OOM killer hasn't made any progress).
>>
>> Note that small allocations already *can* fail if they are done in the context
>> of a task selected as OOM victim (i.e. TIF_MEMDIE). And yeah I've seen a case
>> when they failed in a code that "handled" the allocation failure with a
>> BUG_ON(!page).
>>
> Did You hit a race described below?

I don't know, I don't even have direct evidence of TIF_MEMDIE being set, 
but OOMs were happening all over the place, and I haven't found another 
reason why the allocation would not be too-small-to-fail otherwise.

> http://lkml.kernel.org/r/201508272249.HDH81838.FtQOLMFFOVSJOH@I-love.SAKURA.ne.jp
>
> Where was the BUG_ON(!page) ? Maybe it is a candidate for adding __GFP_NOFAIL.

Yes, I suggested so:
http://marc.info/?l=linux-kernel&m=144181523115244&w=2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-10-06 18:45                     ` Oleg Nesterov
  2015-10-07 11:03                       ` Tetsuo Handa
@ 2015-10-08 14:01                       ` Michal Hocko
  1 sibling, 0 replies; 109+ messages in thread
From: Michal Hocko @ 2015-10-08 14:01 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Kyle Walker, Christoph Lameter, Andrew Morton,
	David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm,
	Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa

On Tue 06-10-15 20:45:02, Oleg Nesterov wrote:
[...]
> And I was going to make V1 which avoids queue_work/kthread and zaps the
> memory in oom_kill_process() context.
> 
> But this can't work because we need to increment ->mm_users to avoid
> the race with exit_mmap/etc. And this means that we need mmput() after
> that, and as we recently discussed it can deadlock if mm_users goes
> to zero, we can't do exit_mmap/etc in oom_kill_process().

Right. I hoped we could rely on mm_count just to pin mm but that is not
sufficient because exit_mmap doesn't rely on mmap_sem so we do not have
any synchronization there. Unfortunate. This means that we indeed have
to do it asynchronously. Maybe we can come up with some trickery but
let's do it later. I do agree that going with a kernel thread for now
would be easier. Sorry about misleading you, I should have realized that
mmput from the oom killing path is dangerous.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-10-07 12:00                         ` Oleg Nesterov
@ 2015-10-08 14:04                           ` Michal Hocko
  0 siblings, 0 replies; 109+ messages in thread
From: Michal Hocko @ 2015-10-08 14:04 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Tetsuo Handa, torvalds, kwalker, cl, akpm, rientjes, hannes,
	vdavydov, linux-mm, linux-kernel, skozina

On Wed 07-10-15 14:00:16, Oleg Nesterov wrote:
> On 10/07, Tetsuo Handa wrote:
> >
> > Oleg Nesterov wrote:
> > > Anyway. Perhaps it makes sense to abort the for_each_vma() loop if
> > > freed_enough_mem() == T. But it is absolutely not clear to me how we
> > > should define this freed_enough_mem(), so I think we should do this
> > > later.
> >
> > Maybe
> >
> >   bool freed_enough_mem(void) { !atomic_read(&oom_victims); }
> >
> > if we change to call mark_oom_victim() on all threads which should be
> > killed as OOM victims.
> 
> Well, in this case
> 
> 	if (atomic_read(&mm->mm_users) == 1)
> 		break;
> 
> makes much more sense. Plus we do not need to change mark_oom_victim().
> 
> Lets discuss this later?

Yes I do not think this is that important if a kernel thread is going to
reclaim the address space. It will effectively free memory on behalf of
the victim so a longer scan shouldn't be such a big problem. At least
not for the first implementation.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Can't we use timeout based OOM warning/killing?
  2015-10-06 15:25                                 ` Can't we use timeout based OOM warning/killing? Linus Torvalds
@ 2015-10-08 15:33                                   ` Tetsuo Handa
  0 siblings, 0 replies; 109+ messages in thread
From: Tetsuo Handa @ 2015-10-08 15:33 UTC (permalink / raw)
  To: torvalds
  Cc: cl, linux-kernel, mhocko, kwalker, oleg, vdavydov, skozina,
	linux-mm, rientjes, hannes, akpm

Linus Torvalds wrote:
> Because another thing that tends to affect this is that oom without swap is
> very different from oom with lots of swap, so different people will see
> very different issues. If you have some particular case you want to check,
> and could make a VM image for it, maybe that would get more mm people
> looking at it and agreeing about the issues.

I was working at support center for troubleshooting RHEL systems. I saw
many trouble cases where customer's servers hung up / rebooted unexpectedly.
In most cases, their servers hung up without OOM killer messages. (I saw
few cases where OOM killer messages are discovered by analyzing vmcore.)

No messages are recorded to log files such as /var/log/messages and
/var/log/sa/ when their servers hung up. According to /var/log/sa/ ,
there was little free memory just before their servers hung up.
I suspected that something memory related problem happened and suggested
customers to install serial console or netconsole in case the kernel was
printing some messages, but I don't know whether they were able to install
serial console or netconsole into their production systems.

The origin of this OOM livelock discussion was a local OOM-DoS vulnerability
which exists since Linux 2.0. When I tested this vulnerability on RHEL 7,
I saw strange stalls on XFS. The discussion went to public by developing
a reproducer which does not make use of the vulnerability. We recognized
the "too small to fail" memory-allocation rule. I tested various corner
cases using variants of the reproducer. I realized that we have race window
where the memory allocation can fall into infinite loop without OOM killer
messages.

I made a hypothesis that customer's servers hit a race where __GFP_FS
allocations are blocked at too_many_isolated() or unkillable locks in
direct reclaim paths whereas !__GFP_FS allocations are retrying forever
without calling out_of_memory(). But even if they install serial console
or netconsole, we are currently emitting no warning messages. The timeout
based OOM warning corresponds to check_memalloc_delay() in
http://marc.info/?l=linux-kernel&m=143239201905479 . The timeout based
OOM warning is not only for stalls after an OOM victim was chosen but also
for stalls before an OOM victim is chosen.

Whether we should call out_of_memory() upon timeout might depend on
hardware / ram / swap / workload etc. But I think that whether we can
have a mechanism for warning about possible OOM livelock is independent.
Thus, I think that making a VM image is not helpful.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Can't we use timeout based OOM warning/killing?
  2015-10-03  6:02                               ` Can't we use timeout based OOM warning/killing? Tetsuo Handa
  2015-10-06 14:51                                 ` Tetsuo Handa
  2015-10-06 15:25                                 ` Can't we use timeout based OOM warning/killing? Linus Torvalds
@ 2015-10-10 12:50                                 ` Tetsuo Handa
  2 siblings, 0 replies; 109+ messages in thread
From: Tetsuo Handa @ 2015-10-10 12:50 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

Tetsuo Handa wrote:
> Without means to find out what was happening, we will "overlook real bugs"
> before "paper over real bugs". The means are expected to work without
> knowledge to use trace points functionality, are expected to run without
> memory allocation, are expected to dump output without administrator's
> operation, are expected to work before power reset by watchdog timers.

I want to use something like this patch (CONFIG_DEBUG_something is fine).
Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20151010.txt.xz
----------------------------------------
>From 0f749ddbc2bd9ce57ba56787e77595c3f13e9cc3 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sat, 10 Oct 2015 20:48:09 +0900
Subject: [PATCH] Memory allocation watchdog kernel thread.

This patch adds a kernel thread which periodically reports number of
memory allocating tasks, dying tasks and OOM victim tasks.
This kernel thread helps reporting whether we are failing to solve OOM
conditions after OOM killer is invoked, in addition to reporting stalls
before OOM killer is invoked (e.g. all __GFP_FS allocating tasks are
blocked by locks or throttling whereas all !__GFP_FS allocating tasks
are unable to invoke the OOM killer).

$ grep MemAlloc serial.txt | grep -A 5 MemAlloc-Info:
[  101.937548] MemAlloc-Info: 4 stalling task, 32 dying task, 1 victim task.
[  101.939460] MemAlloc: sync4(10598) gfp=0x24280ca order=0 delay=17338
[  101.975433] MemAlloc: sync4(10602) gfp=0x24280ca order=0 delay=17115
[  102.015519] MemAlloc: sync4(10599) gfp=0x24280ca order=0 delay=17097
[  102.053884] MemAlloc: sync4(10607) gfp=0x24280ca order=0 delay=15970
[  112.094349] MemAlloc-Info: 176 stalling task, 32 dying task, 1 victim task.
[  112.098411] MemAlloc: sync4(10598) gfp=0x24280ca order=0 delay=27494
[  112.138381] MemAlloc: sync4(10602) gfp=0x24280ca order=0 delay=27271
[  112.178710] MemAlloc: sync4(10599) gfp=0x24280ca order=0 delay=27253
[  112.218674] MemAlloc: sync4(10607) gfp=0x24280ca order=0 delay=26126
[  112.257749] MemAlloc: sync4(10608) gfp=0x24280ca order=0 delay=14083
--
[  128.952137] MemAlloc-Info: 176 stalling task, 32 dying task, 1 victim task.
[  128.954056] MemAlloc: sync4(10598) gfp=0x24280ca order=0 delay=44352
[  128.992231] MemAlloc: sync4(10602) gfp=0x24280ca order=0 delay=44129
[  129.034180] MemAlloc: sync4(10599) gfp=0x24280ca order=0 delay=44111
[  129.071755] MemAlloc: sync4(10607) gfp=0x24280ca order=0 delay=42984
[  129.109851] MemAlloc: sync4(10608) gfp=0x24280ca order=0 delay=30941
--
[  145.683171] MemAlloc-Info: 175 stalling task, 32 dying task, 1 victim task.
[  145.685344] MemAlloc: sync4(10598) gfp=0x24280ca order=0 delay=61084
[  145.736475] MemAlloc: sync4(10599) gfp=0x24280ca order=0 delay=60843
[  145.778084] MemAlloc: sync4(10607) gfp=0x24280ca order=0 delay=59716
[  145.815363] MemAlloc: sync4(10608) gfp=0x24280ca order=0 delay=47673
[  145.853610] MemAlloc: sync4(10601) gfp=0x24280ca order=0 delay=47673
--
[  158.030038] MemAlloc-Info: 178 stalling task, 32 dying task, 1 victim task.
[  158.031945] MemAlloc: sync4(10598) gfp=0x24280ca order=0 delay=73430
[  158.071066] MemAlloc: sync4(10599) gfp=0x24280ca order=0 delay=73189
[  158.108835] MemAlloc: sync4(10607) gfp=0x24280ca order=0 delay=72062
[  158.146500] MemAlloc: sync4(10608) gfp=0x24280ca order=0 delay=60019
[  158.184146] MemAlloc: sync4(10601) gfp=0x24280ca order=0 delay=60019
--
[  174.851184] MemAlloc-Info: 178 stalling task, 32 dying task, 1 victim task.
[  174.853106] MemAlloc: sync4(10598) gfp=0x24280ca order=0 delay=90252
[  174.896592] MemAlloc: sync4(10599) gfp=0x24280ca order=0 delay=90011
[  174.935838] MemAlloc: sync4(10607) gfp=0x24280ca order=0 delay=88884
[  174.978799] MemAlloc: sync4(10608) gfp=0x24280ca order=0 delay=76841
[  175.022003] MemAlloc: sync4(10601) gfp=0x24280ca order=0 delay=76841
--

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/page_alloc.c | 145 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 145 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0d6f540..0473eec 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2972,6 +2972,147 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
 	return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == GFP_TRANSHUGE;
 }
 
+#if 1
+
+static u8 memalloc_counter_active_index; /* Either 0 or 1. */
+static int memalloc_counter[2]; /* Number of tasks doing memory allocation. */
+
+struct memalloc {
+	struct list_head list; /* Connected to memalloc_list. */
+	struct task_struct *task; /* Iniatilized to current. */
+	unsigned long start; /* Initialized to jiffies. */
+	unsigned int order;
+	gfp_t gfp;
+	u8 index; /* Initialized to memalloc_counter_active_index. */
+};
+
+static LIST_HEAD(memalloc_list); /* List of "struct memalloc".*/
+static DEFINE_SPINLOCK(memalloc_list_lock); /* Lock for memalloc_list. */
+
+/*
+ * malloc_watchdog - A kernel thread for monitoring memory allocation stalls.
+ *
+ * @unused: Not used.
+ *
+ * This kernel thread does not terminate.
+ */
+static int malloc_watchdog(void *unused)
+{
+	static const unsigned long timeout = 10 * HZ;
+	struct memalloc *m;
+	struct task_struct *g, *p;
+	unsigned long now;
+	unsigned long spent;
+	unsigned int sigkill_pending;
+	unsigned int memdie_pending;
+	unsigned int stalling_tasks;
+	u8 index;
+
+ not_stalling: /* Healty case. */
+	/*
+	 * Switch active counter and wait for timeout duration.
+	 * This is a kind of open coded implementation of synchronize_srcu()
+	 * because synchronize_srcu_timeout() is missing.
+	 */
+	spin_lock(&memalloc_list_lock);
+	index = memalloc_counter_active_index;
+	memalloc_counter_active_index ^= 1;
+	spin_unlock(&memalloc_list_lock);
+	schedule_timeout_interruptible(timeout);
+	/*
+	 * If memory allocations are working, the counter should remain 0
+	 * because tasks will be able to call both start_memalloc_timer()
+	 * and stop_memalloc_timer() within timeout duration.
+	 */
+	if (likely(!memalloc_counter[index]))
+		goto not_stalling;
+ maybe_stalling: /* Maybe something is wrong. Let's check. */
+	/* First, report whether there are SIGKILL tasks and/or OOM victims. */
+	sigkill_pending = 0;
+	memdie_pending = 0;
+	stalling_tasks = 0;
+	preempt_disable();
+	rcu_read_lock();
+	for_each_process_thread(g, p) {
+		if (test_tsk_thread_flag(p, TIF_MEMDIE))
+			memdie_pending++;
+		if (fatal_signal_pending(p))
+			sigkill_pending++;
+	}
+	rcu_read_unlock();
+	preempt_enable();
+	spin_lock(&memalloc_list_lock);
+	now = jiffies;
+	list_for_each_entry(m, &memalloc_list, list) {
+		spent = now - m->start;
+		if (time_before(spent, timeout))
+			continue;
+		stalling_tasks++;
+	}
+	pr_warn("MemAlloc-Info: %u stalling task, %u dying task, %u victim task.\n",
+		stalling_tasks, sigkill_pending, memdie_pending);
+	/* Next, report tasks stalled at memory allocation. */
+	list_for_each_entry(m, &memalloc_list, list) {
+		spent = now - m->start;
+		if (time_before(spent, timeout))
+			continue;
+		p = m->task;
+		pr_warn("MemAlloc%s: %s(%u) gfp=0x%x order=%u delay=%lu\n",
+			test_tsk_thread_flag(p, TIF_MEMDIE) ? "-victim" :
+			(fatal_signal_pending(p) ? "-dying" : ""),
+			p->comm, p->pid, m->gfp, m->order, spent);
+		show_stack(p, NULL);
+	}
+	spin_unlock(&memalloc_list_lock);
+	/* Wait until next timeout duration. */
+	schedule_timeout_interruptible(timeout);
+	if (memalloc_counter[index])
+		goto maybe_stalling;
+	goto not_stalling;
+	return 0;
+}
+
+static int __init start_malloc_watchdog(void)
+{
+	struct task_struct *task = kthread_run(malloc_watchdog, NULL,
+					       "MallocWatchdog");
+	BUG_ON(IS_ERR(task));
+	return 0;
+}
+late_initcall(start_malloc_watchdog);
+
+#define DEFINE_MEMALLOC_TIMER(m) struct memalloc m = { .task = NULL }
+
+static void start_memalloc_timer(struct memalloc *m, gfp_t gfp_mask, int order)
+{
+	if (m->task)
+		return;
+	m->task = current;
+	m->start = jiffies;
+	m->gfp = gfp_mask;
+	order = order;
+	spin_lock(&memalloc_list_lock);
+	m->index = memalloc_counter_active_index;
+	memalloc_counter[m->index]++;
+	list_add_tail(&m->list, &memalloc_list);
+	spin_unlock(&memalloc_list_lock);
+}
+
+static void stop_memalloc_timer(struct memalloc *m)
+{
+	if (!m->task)
+		return;
+	spin_lock(&memalloc_list_lock);
+	memalloc_counter[m->index]--;
+	list_del(&m->list);
+	spin_unlock(&memalloc_list_lock);
+}
+#else
+#define DEFINE_MEMALLOC_TIMER(m)
+#define start_memalloc_timer(m, gfp_mask, order)
+#define stop_memalloc_timer(m)
+#endif
+
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 						struct alloc_context *ac)
@@ -2984,6 +3125,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
 	bool deferred_compaction = false;
 	int contended_compaction = COMPACT_CONTENDED_NONE;
+	DEFINE_MEMALLOC_TIMER(m);
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3075,6 +3217,8 @@ retry:
 	if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
 		goto nopage;
 
+	start_memalloc_timer(&m, gfp_mask, order);
+
 	/*
 	 * Try direct compaction. The first pass is asynchronous. Subsequent
 	 * attempts after direct reclaim are synchronous
@@ -3168,6 +3312,7 @@ noretry:
 nopage:
 	warn_alloc_failed(gfp_mask, order, NULL);
 got_pg:
+	stop_memalloc_timer(&m);
 	return page;
 }
 
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* Re: Can't we use timeout based OOM warning/killing?
  2015-10-06 14:51                                 ` Tetsuo Handa
@ 2015-10-12  6:43                                   ` Tetsuo Handa
  2015-10-12 15:25                                     ` Silent hang up caused by pages being not scanned? Tetsuo Handa
  2015-10-26 11:44                                     ` Newbie's question: memory allocation when reclaiming memory Tetsuo Handa
  0 siblings, 2 replies; 109+ messages in thread
From: Tetsuo Handa @ 2015-10-12  6:43 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

Tetsuo Handa wrote:
> So, zapping the first OOM victim's mm might fail by chance.

I retested with a slightly different version.

---------- Reproducer start ----------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sched.h>
#include <sys/mman.h>

static int writer(void *unused)
{
	const int fd = open("/proc/self/exe", O_RDONLY);
	while (1) {
		void *ptr = mmap(NULL, 4096, PROT_READ, MAP_PRIVATE, fd, 0);
		munmap(ptr, 4096);
	}
	return 0;
}

int main(int argc, char *argv[])
{
	char buffer[128] = { };
	const pid_t pid = fork();
	if (pid == 0) { /* down_write(&mm->mmap_sem) requester which is chosen as an OOM victim. */
		int i;
		for (i = 0; i < 9; i++)
			clone(writer, malloc(1024) + 1024, CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, NULL);
		writer(NULL);
	}
	snprintf(buffer, sizeof(buffer) - 1, "/proc/%u/stat", pid);
	if (fork() == 0) { /* down_read(&mm->mmap_sem) requester. */
		const int fd = open(buffer, O_RDONLY);
		while (pread(fd, buffer, sizeof(buffer), 0) > 0);
		_exit(0);
	} else { /* A dummy process for invoking the OOM killer. */
		char *buf = NULL;
		unsigned long size = 0;
		const int fd = open("/dev/zero", O_RDONLY);
		for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
			char *cp = realloc(buf, size);
			if (!cp) {
				size >>= 1;
				break;
			}
			buf = cp;
		}
		read(fd, buf, size); /* Will cause OOM due to overcommit */
		return 0;
	}
}
---------- Reproducer end ----------

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20151012.txt.xz .

Uptime between 101 and 300 is a silent hang up (i.e. no OOM killer messages,
no SIGKILL pending tasks, no TIF_MEMDIE tasks) which I solved using SysRq-f
at uptime = 289. I don't know the reason of this silent hang up, but the
memory unzapping kernel thread will not help because there is no OOM victim.

----------
[  101.438951] MemAlloc-Info: 10 stalling task, 0 dying task, 0 victim task.
(...snipped...)
[  111.817922] MemAlloc-Info: 12 stalling task, 0 dying task, 0 victim task.
(...snipped...)
[  122.281828] MemAlloc-Info: 13 stalling task, 0 dying task, 0 victim task.
(...snipped...)
[  132.793724] MemAlloc-Info: 14 stalling task, 0 dying task, 0 victim task.
(...snipped...)
[  143.336154] MemAlloc-Info: 16 stalling task, 0 dying task, 0 victim task.
(...snipped...)
[  289.343187] sysrq: SysRq : Manual OOM execution
(...snipped...)
[  292.065650] MemAlloc-Info: 16 stalling task, 0 dying task, 0 victim task.
(...snipped...)
[  302.590736] kworker/3:2 invoked oom-killer: gfp_mask=0x24000c0, order=-1, oom_score_adj=0
(...snipped...)
[  302.690047] MemAlloc-Info: 4 stalling task, 0 dying task, 0 victim task.
----------

Uptime between 379 and 605 is a mmap_sem livelock after the OOM killer was
invoked.

----------
[  380.039897] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[  380.042500] [  467]     0   467    14047     1815      28       3        0             0 systemd-journal
[  380.045055] [  482]     0   482    10413      259      23       3        0         -1000 systemd-udevd
[  380.047637] [  504]     0   504    12795      119      25       3        0         -1000 auditd
[  380.050127] [ 1244]     0  1244    82428     4257      81       3        0             0 firewalld
[  380.052536] [ 1247]    70  1247     6988       61      21       3        0             0 avahi-daemon
[  380.055028] [ 1250]     0  1250    54104     1372      42       4        0             0 rsyslogd
[  380.057505] [ 1251]     0  1251   137547     2620      91       3        0             0 tuned
[  380.059996] [ 1255]     0  1255     4823       77      15       3        0             0 irqbalance
[  380.062552] [ 1256]     0  1256     1095       37       8       3        0             0 rngd
[  380.065020] [ 1259]     0  1259    53626      441      60       3        0             0 abrtd
[  380.067383] [ 1260]     0  1260    53001      341      58       5        0             0 abrt-watch-log
[  380.069965] [ 1265]     0  1265     8673       83      21       3        0             0 systemd-logind
[  380.072554] [ 1266]    81  1266     6663      117      18       3        0          -900 dbus-daemon
[  380.075122] [ 1272]     0  1272    31577      154      21       3        0             0 crond
[  380.077544] [ 1314]    70  1314     6988       57      19       3        0             0 avahi-daemon
[  380.080013] [ 1427]     0  1427    46741      225      44       3        0             0 vmtoolsd
[  380.082478] [ 1969]     0  1969    25942     3100      48       3        0             0 dhclient
[  380.084969] [ 1990]   999  1990   128626     1929      50       4        0             0 polkitd
[  380.087516] [ 2073]     0  2073    20629      214      45       3        0         -1000 sshd
[  380.090065] [ 2201]     0  2201     7320       68      21       3        0             0 xinetd
[  380.092465] [ 3215]     0  3215    22773      257      44       3        0             0 master
[  380.094879] [ 3217]    89  3217    22816      249      45       3        0             0 qmgr
[  380.097304] [ 3249]     0  3249    75245      315      97       3        0             0 nmbd
[  380.099666] [ 3259]     0  3259    92963      486     131       5        0             0 smbd
[  380.101956] [ 3282]     0  3282    27503       30      12       3        0             0 agetty
[  380.104277] [ 3283]     0  3283    21788      154      49       3        0             0 login
[  380.106574] [ 3286]     0  3286    92963      486     126       5        0             0 smbd
[  380.108835] [ 3296]  1000  3296    28864      117      13       3        0             0 bash
[  380.111073] [ 3374]    89  3374    22799      249      46       3        0             0 pickup
[  380.113298] [ 3378]    89  3378    22836      252      45       3        0             0 cleanup
[  380.115555] [ 3385]    89  3385    22800      248      44       3        0             0 trivial-rewrite
[  380.117811] [ 3392]     0  3392    22825      265      48       3        0             0 local
[  380.119995] [ 3393]     0  3393    30828       59      17       3        0             0 anacron
[  380.122183] [ 3417]  1000  3417   541715   397587     787       6        0             0 a.out
[  380.124315] [ 3418]  1000  3418     1081       24       8       3        0             0 a.out
[  380.126410] [ 3419]  1000  3419     1042       21       7       3        0             0 a.out
[  380.128535] Out of memory: Kill process 3417 (a.out) score 890 or sacrifice child
[  380.130392] Killed process 3418 (a.out) total-vm:4324kB, anon-rss:96kB, file-rss:0kB
[  392.704028] MemAlloc-Info: 7 stalling task, 10 dying task, 1 victim task.
(...snipped...)
[  601.129977] a.out           R  running task        0  3417   3296 0x00000080
[  601.131899]  ffff8800774dba10 ffffffff8112b174 0000000000000100 0000000000000000
[  601.134026]  0000000000000000 0000000000000000 00000000a23cb49d 0000000000000000
[  601.136076]  ffff880077603200 00000000024280ca 0000000000000000 ffff880077603200
[  601.138090] Call Trace:
[  601.139145]  [<ffffffff8112b174>] ? try_to_free_pages+0x94/0xc0
[  601.140831]  [<ffffffff8111a8c4>] ? out_of_memory+0x2f4/0x460
[  601.142489]  [<ffffffff8111fa63>] ? __alloc_pages_nodemask+0x613/0xc30
[  601.144328]  [<ffffffff81161c40>] ? alloc_pages_vma+0xb0/0x200
[  601.145994]  [<ffffffff81143056>] ? handle_mm_fault+0xfa6/0x1370
[  601.147677]  [<ffffffff8162f557>] ? native_iret+0x7/0x7
[  601.149258]  [<ffffffff81058217>] ? __do_page_fault+0x177/0x400
[  601.150966]  [<ffffffff810584d0>] ? do_page_fault+0x30/0x80
[  601.152625]  [<ffffffff81630518>] ? page_fault+0x28/0x30
[  601.154159]  [<ffffffff813230c0>] ? __clear_user+0x20/0x50
[  601.155723]  [<ffffffff81327a68>] ? iov_iter_zero+0x68/0x250
[  601.157329]  [<ffffffff813fc6c8>] ? read_iter_zero+0x38/0xa0
[  601.158923]  [<ffffffff81187f04>] ? __vfs_read+0xc4/0xf0
[  601.160453]  [<ffffffff8118868a>] ? vfs_read+0x7a/0x120
[  601.161961]  [<ffffffff811893a0>] ? SyS_read+0x50/0xc0
[  601.163513]  [<ffffffff8162e9ee>] ? entry_SYSCALL_64_fastpath+0x12/0x71
[  601.165254] a.out           D ffff8800777b7e08     0  3418   3417 0x00100084
[  601.167118]  ffff8800777b7e08 ffff880077606400 ffff8800777b8000 ffff880036032e00
[  601.169137]  ffff880036032de8 ffffffff00000000 ffffffff00000001 ffff8800777b7e20
[  601.171159]  ffffffff8162a570 ffff880077606400 ffff8800777b7ea8 ffffffff8162d8eb
[  601.173183] Call Trace:
[  601.174193]  [<ffffffff8162a570>] schedule+0x30/0x80
[  601.175661]  [<ffffffff8162d8eb>] rwsem_down_write_failed+0x1fb/0x350
[  601.177388]  [<ffffffff81322f64>] ? call_rwsem_down_read_failed+0x14/0x30
[  601.179194]  [<ffffffff81322f93>] call_rwsem_down_write_failed+0x13/0x20
[  601.180971]  [<ffffffff8162d05f>] ? down_write+0x1f/0x30
[  601.182509]  [<ffffffff81147abe>] vm_munmap+0x2e/0x60
[  601.183992]  [<ffffffff811489fd>] SyS_munmap+0x1d/0x30
[  601.185485]  [<ffffffff8162e9ee>] entry_SYSCALL_64_fastpath+0x12/0x71
[  601.187224] a.out           D ffff88007c60fdf0     0  3420   3417 0x00000084
[  601.189130]  ffff88007c60fdf0 ffff880078e15780 ffff88007c610000 ffff880036032de8
[  601.191158]  ffff880036032e00 ffff88007c60ff58 ffff880078e15780 ffff88007c60fe08
[  601.193180]  ffffffff8162a570 ffff880078e15780 ffff88007c60fe68 ffffffff8162d698
[  601.195217] Call Trace:
[  601.196226]  [<ffffffff8162a570>] schedule+0x30/0x80
[  601.197683]  [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150
[  601.199407]  [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30
[  601.201192]  [<ffffffff8162d032>] ? down_read+0x12/0x20
[  601.202711]  [<ffffffff810583f7>] __do_page_fault+0x357/0x400
[  601.204328]  [<ffffffff810584d0>] do_page_fault+0x30/0x80
[  601.205874]  [<ffffffff81630518>] page_fault+0x28/0x30
[  601.207376] a.out           D ffff88007c24fdf0     0  3421   3417 0x00000084
[  601.209286]  ffff88007c24fdf0 ffff880078e13200 ffff88007c250000 ffff880036032de8
[  601.211316]  ffff880036032e00 ffff88007c24ff58 ffff880078e13200 ffff88007c24fe08
[  601.213335]  ffffffff8162a570 ffff880078e13200 ffff88007c24fe68 ffffffff8162d698
[  601.215356] Call Trace:
[  601.216377]  [<ffffffff8162a570>] schedule+0x30/0x80
[  601.217831]  [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150
[  601.219529]  [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30
[  601.221296]  [<ffffffff8162d032>] ? down_read+0x12/0x20
[  601.222802]  [<ffffffff810583f7>] __do_page_fault+0x357/0x400
[  601.224403]  [<ffffffff810584d0>] do_page_fault+0x30/0x80
[  601.225958]  [<ffffffff81630518>] page_fault+0x28/0x30
[  601.227453] a.out           D ffff88007823bdf0     0  3422   3417 0x00000084
[  601.229348]  ffff88007823bdf0 ffff880078e10000 ffff88007823c000 ffff880036032de8
[  601.231395]  ffff880036032e00 ffff88007823bf58 ffff880078e10000 ffff88007823be08
[  601.233427]  ffffffff8162a570 ffff880078e10000 ffff88007823be68 ffffffff8162d698
[  601.235472] Call Trace:
[  601.236504]  [<ffffffff8162a570>] schedule+0x30/0x80
[  601.237989]  [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150
[  601.239720]  [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30
[  601.241583]  [<ffffffff8162d032>] ? down_read+0x12/0x20
[  601.243144]  [<ffffffff810583f7>] __do_page_fault+0x357/0x400
[  601.244777]  [<ffffffff810584d0>] do_page_fault+0x30/0x80
[  601.246307]  [<ffffffff81630518>] page_fault+0x28/0x30
[  601.247823] a.out           D ffff88007c483df0     0  3423   3417 0x00000084
[  601.249719]  ffff88007c483df0 ffff880078e13e80 ffff88007c484000 ffff880036032de8
[  601.251765]  ffff880036032e00 ffff88007c483f58 ffff880078e13e80 ffff88007c483e08
[  601.253808]  ffffffff8162a570 ffff880078e13e80 ffff88007c483e68 ffffffff8162d698
[  601.255831] Call Trace:
[  601.256850]  [<ffffffff8162a570>] schedule+0x30/0x80
[  601.258286]  [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150
[  601.260005]  [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30
[  601.261803]  [<ffffffff8162d032>] ? down_read+0x12/0x20
[  601.263329]  [<ffffffff810583f7>] __do_page_fault+0x357/0x400
[  601.264936]  [<ffffffff810584d0>] do_page_fault+0x30/0x80
[  601.266504]  [<ffffffff81630518>] page_fault+0x28/0x30
[  601.268019] a.out           D ffff880035893e08     0  3424   3417 0x00000084
[  601.269940]  ffff880035893e08 ffff880078e17080 ffff880035894000 ffff880036032e00
[  601.271945]  ffff880036032de8 ffffffff00000000 ffffffff00000001 ffff880035893e20
[  601.273954]  ffffffff8162a570 ffff880078e17080 ffff880035893ea8 ffffffff8162d8eb
[  601.276000] Call Trace:
[  601.277007]  [<ffffffff8162a570>] schedule+0x30/0x80
[  601.278497]  [<ffffffff8162d8eb>] rwsem_down_write_failed+0x1fb/0x350
[  601.280240]  [<ffffffff81322f64>] ? call_rwsem_down_read_failed+0x14/0x30
[  601.282058]  [<ffffffff81322f93>] call_rwsem_down_write_failed+0x13/0x20
[  601.283872]  [<ffffffff8162d05f>] ? down_write+0x1f/0x30
[  601.285403]  [<ffffffff81147abe>] vm_munmap+0x2e/0x60
[  601.286924]  [<ffffffff811489fd>] SyS_munmap+0x1d/0x30
[  601.288435]  [<ffffffff8162e9ee>] entry_SYSCALL_64_fastpath+0x12/0x71
[  601.290184] a.out           D ffff8800353b7df0     0  3425   3417 0x00000084
[  601.292108]  ffff8800353b7df0 ffff880078e10c80 ffff8800353b8000 ffff880036032de8
[  601.294165]  ffff880036032e00 ffff8800353b7f58 ffff880078e10c80 ffff8800353b7e08
[  601.296206]  ffffffff8162a570 ffff880078e10c80 ffff8800353b7e68 ffffffff8162d698
[  601.298267] Call Trace:
[  601.299300]  [<ffffffff8162a570>] schedule+0x30/0x80
[  601.300755]  [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150
[  601.302437]  [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30
[  601.304221]  [<ffffffff8162d032>] ? down_read+0x12/0x20
[  601.305764]  [<ffffffff810583f7>] __do_page_fault+0x357/0x400
[  601.307389]  [<ffffffff810584d0>] do_page_fault+0x30/0x80
[  601.308968]  [<ffffffff81630518>] page_fault+0x28/0x30
[  601.310488] a.out           D ffff88007cf87df0     0  3426   3417 0x00000084
[  601.312380]  ffff88007cf87df0 ffff880078e16400 ffff88007cf88000 ffff880036032de8
[  601.314414]  ffff880036032e00 ffff88007cf87f58 ffff880078e16400 ffff88007cf87e08
[  601.316443]  ffffffff8162a570 ffff880078e16400 ffff88007cf87e68 ffffffff8162d698
[  601.318490] Call Trace:
[  601.319536]  [<ffffffff8162a570>] schedule+0x30/0x80
[  601.321036]  [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150
[  601.322763]  [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30
[  601.324504]  [<ffffffff8162d032>] ? down_read+0x12/0x20
[  601.326071]  [<ffffffff810583f7>] __do_page_fault+0x357/0x400
[  601.327715]  [<ffffffff810584d0>] do_page_fault+0x30/0x80
[  601.329287]  [<ffffffff81630518>] page_fault+0x28/0x30
[  601.330761] a.out           D ffff8800792dfdf0     0  3427   3417 0x00000084
[  601.332705]  ffff8800792dfdf0 ffff880078e12580 ffff8800792e0000 ffff880036032de8
[  601.334699]  ffff880036032e00 ffff8800792dff58 ffff880078e12580 ffff8800792dfe08
[  601.336750]  ffffffff8162a570 ffff880078e12580 ffff8800792dfe68 ffffffff8162d698
[  601.338794] Call Trace:
[  601.339781]  [<ffffffff8162a570>] schedule+0x30/0x80
[  601.341280]  [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150
[  601.343009]  [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30
[  601.344813]  [<ffffffff8162d032>] ? down_read+0x12/0x20
[  601.346361]  [<ffffffff810583f7>] __do_page_fault+0x357/0x400
[  601.347990]  [<ffffffff810584d0>] do_page_fault+0x30/0x80
[  601.349521]  [<ffffffff81630518>] page_fault+0x28/0x30
[  601.351044] a.out           D ffff88007743faa8     0  3428   3417 0x00000084
[  601.352942]  ffff88007743faa8 ffff88007bda6400 ffff880077440000 ffff88007743fae0
[  601.354990]  ffff88007fccdfc0 00000001000484e5 0000000000000000 ffff88007743fac0
[  601.357024]  ffffffff8162a570 ffff88007fccdfc0 ffff88007743fb40 ffffffff8162dbed
[  601.359075] Call Trace:
[  601.360096]  [<ffffffff8162a570>] schedule+0x30/0x80
[  601.361540]  [<ffffffff8162dbed>] schedule_timeout+0x11d/0x1c0
[  601.363190]  [<ffffffff810c7e00>] ? cascade+0x90/0x90
[  601.364697]  [<ffffffff8162dce9>] schedule_timeout_uninterruptible+0x19/0x20
[  601.366574]  [<ffffffff8111fc9d>] __alloc_pages_nodemask+0x84d/0xc30
[  601.368332]  [<ffffffff811609a7>] alloc_pages_current+0x87/0x110
[  601.370002]  [<ffffffff811166cf>] __page_cache_alloc+0xaf/0xc0
[  601.371606]  [<ffffffff81119225>] filemap_fault+0x1e5/0x420
[  601.373203]  [<ffffffff81244f39>] xfs_filemap_fault+0x39/0x60
[  601.374798]  [<ffffffff8113d5e7>] __do_fault+0x47/0xd0
[  601.376315]  [<ffffffff81142ec5>] handle_mm_fault+0xe15/0x1370
[  601.377938]  [<ffffffff81322f64>] ? call_rwsem_down_read_failed+0x14/0x30
[  601.379707]  [<ffffffff81058217>] __do_page_fault+0x177/0x400
[  601.381320]  [<ffffffff810584d0>] do_page_fault+0x30/0x80
[  601.382831]  [<ffffffff81630518>] page_fault+0x28/0x30
[  601.384337] a.out           R  running task        0  3419   3417 0x00000080
[  601.386257]  00000000f80745e8 ffff880034ab4400 ffff8800776d3f18 ffff8800776d3f18
[  601.388287]  0000000000000080 0000000000000000 ffff8800776d3ec8 ffffffff81187e72
[  601.390341]  ffff880034ab4400 ffff880034ab4410 0000000000020000 0000000000000000
[  601.392366] Call Trace:
[  601.393388]  [<ffffffff81187e72>] ? __vfs_read+0x32/0xf0
[  601.394952]  [<ffffffff81290aa9>] ? security_file_permission+0xa9/0xc0
[  601.396745]  [<ffffffff8118858d>] ? rw_verify_area+0x4d/0xd0
[  601.398359]  [<ffffffff8118868a>] ? vfs_read+0x7a/0x120
[  601.399897]  [<ffffffff81189560>] ? SyS_pread64+0x90/0xb0
[  601.401429]  [<ffffffff8162e9ee>] ? entry_SYSCALL_64_fastpath+0x12/0x71
----------

I think that I noticed three problems from this reproducer.

(1) While the likeliness of hitting mmap_sem livelock would depend on how
    frequently down_read(&mm->mmap_sem) tasks and down_write(&mm->mmap_sem)
    tasks contend on the OOM victim's mm, we can hit mmap_sem livelock with
    even only one down_read(&mm->mmap_sem) task. On systems where processes
    are monitored using /proc/pid/ interface, we can by chance hit this
    mmap_sem livelock.

(2) The OOM killer tries to kill child process of the memory hog. But the
    child process is not always consuming a lot of memory. The memory
    unzapping kernel thread might not be able to reclaim enough memory
    unless we choose subsequent OOM victims when the first OOM victim task
    got mmap_sem livelock.

(3) I don't know the reason but I can observe that (when there are many
    tasks which got SIGKILL by the OOM killer) many of dying tasks participate
    in a memory allocation competition via page_fault() which cannot make
    forward progress because dying tasks without TIF_MEMDIE are not allowed
    to access the memory reserves.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Silent hang up caused by pages being not scanned?
  2015-10-12  6:43                                   ` Tetsuo Handa
@ 2015-10-12 15:25                                     ` Tetsuo Handa
  2015-10-12 21:23                                       ` Linus Torvalds
  2015-10-13 13:32                                       ` Michal Hocko
  2015-10-26 11:44                                     ` Newbie's question: memory allocation when reclaiming memory Tetsuo Handa
  1 sibling, 2 replies; 109+ messages in thread
From: Tetsuo Handa @ 2015-10-12 15:25 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

Tetsuo Handa wrote:
> Uptime between 101 and 300 is a silent hang up (i.e. no OOM killer messages,
> no SIGKILL pending tasks, no TIF_MEMDIE tasks) which I solved using SysRq-f
> at uptime = 289. I don't know the reason of this silent hang up, but the
> memory unzapping kernel thread will not help because there is no OOM victim.
> 
> ----------
> [  101.438951] MemAlloc-Info: 10 stalling task, 0 dying task, 0 victim task.
> (...snipped...)
> [  111.817922] MemAlloc-Info: 12 stalling task, 0 dying task, 0 victim task.
> (...snipped...)
> [  122.281828] MemAlloc-Info: 13 stalling task, 0 dying task, 0 victim task.
> (...snipped...)
> [  132.793724] MemAlloc-Info: 14 stalling task, 0 dying task, 0 victim task.
> (...snipped...)
> [  143.336154] MemAlloc-Info: 16 stalling task, 0 dying task, 0 victim task.
> (...snipped...)
> [  289.343187] sysrq: SysRq : Manual OOM execution
> (...snipped...)
> [  292.065650] MemAlloc-Info: 16 stalling task, 0 dying task, 0 victim task.
> (...snipped...)
> [  302.590736] kworker/3:2 invoked oom-killer: gfp_mask=0x24000c0, order=-1, oom_score_adj=0
> (...snipped...)
> [  302.690047] MemAlloc-Info: 4 stalling task, 0 dying task, 0 victim task.
> ----------

I examined this hang up using additional debug printk() patch. And it was
observed that when this silent hang up occurs, zone_reclaimable() called from
shrink_zones() called from a __GFP_FS memory allocation request is returning
true forever. Since the __GFP_FS memory allocation request can never call
out_of_memory() due to did_some_progree > 0, the system will silently hang up
with 100% CPU usage.

----------
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0473eec..fda0bb5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2821,6 +2821,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 }
 #endif /* CONFIG_COMPACTION */
 
+pid_t dump_target_pid;
+
 /* Perform direct synchronous page reclaim */
 static int
 __perform_reclaim(gfp_t gfp_mask, unsigned int order,
@@ -2847,6 +2849,9 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
 
 	cond_resched();
 
+	if (dump_target_pid == current->pid)
+		printk(KERN_INFO "__perform_reclaim returned %u at line %u\n",
+		       progress, __LINE__);
 	return progress;
 }
 
@@ -3007,6 +3012,7 @@ static int malloc_watchdog(void *unused)
 	unsigned int memdie_pending;
 	unsigned int stalling_tasks;
 	u8 index;
+	pid_t pid;
 
  not_stalling: /* Healty case. */
 	/*
@@ -3025,12 +3031,16 @@ static int malloc_watchdog(void *unused)
 	 * and stop_memalloc_timer() within timeout duration.
 	 */
 	if (likely(!memalloc_counter[index]))
+	{
+		dump_target_pid = 0;
 		goto not_stalling;
+	}
  maybe_stalling: /* Maybe something is wrong. Let's check. */
 	/* First, report whether there are SIGKILL tasks and/or OOM victims. */
 	sigkill_pending = 0;
 	memdie_pending = 0;
 	stalling_tasks = 0;
+	pid = 0;
 	preempt_disable();
 	rcu_read_lock();
 	for_each_process_thread(g, p) {
@@ -3062,8 +3072,11 @@ static int malloc_watchdog(void *unused)
 			(fatal_signal_pending(p) ? "-dying" : ""),
 			p->comm, p->pid, m->gfp, m->order, spent);
 		show_stack(p, NULL);
+		if (!pid && (m->gfp & __GFP_FS))
+			pid = p->pid;
 	}
 	spin_unlock(&memalloc_list_lock);
+	dump_target_pid = -pid;
 	/* Wait until next timeout duration. */
 	schedule_timeout_interruptible(timeout);
 	if (memalloc_counter[index])
@@ -3155,6 +3168,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto nopage;
 
 retry:
+	if (dump_target_pid == -current->pid)
+		dump_target_pid = -dump_target_pid;
+
 	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
 		wake_all_kswapds(order, ac);
 
@@ -3280,6 +3296,11 @@ retry:
 		goto noretry;
 
 	/* Keep reclaiming pages as long as there is reasonable progress */
+	if (dump_target_pid == current->pid) {
+		printk(KERN_INFO "did_some_progress=%lu at line %u\n",
+		       did_some_progress, __LINE__);
+		dump_target_pid = 0;
+	}
 	pages_reclaimed += did_some_progress;
 	if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) ||
 	    ((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 27d580b..cb0c22e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2527,6 +2527,8 @@ static inline bool compaction_ready(struct zone *zone, int order)
 	return watermark_ok;
 }
 
+extern pid_t dump_target_pid;
+
 /*
  * This is the direct reclaim path, for page-allocating processes.  We only
  * try to reclaim pages from zones which will satisfy the caller's allocation
@@ -2619,16 +2621,41 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 			sc->nr_reclaimed += nr_soft_reclaimed;
 			sc->nr_scanned += nr_soft_scanned;
 			if (nr_soft_reclaimed)
+			{
+				if (dump_target_pid == current->pid)
+					printk(KERN_INFO "nr_soft_reclaimed=%lu at line %u\n",
+					       nr_soft_reclaimed, __LINE__);
 				reclaimable = true;
+			}
 			/* need some check for avoid more shrink_zone() */
 		}
 
 		if (shrink_zone(zone, sc, zone_idx(zone) == classzone_idx))
+		{
+			if (dump_target_pid == current->pid)
+				printk(KERN_INFO "shrink_zone returned 1 at line %u\n",
+				       __LINE__);
 			reclaimable = true;
+		}
 
 		if (global_reclaim(sc) &&
 		    !reclaimable && zone_reclaimable(zone))
+		{
+			if (dump_target_pid == current->pid) {
+				printk(KERN_INFO "zone_reclaimable returned 1 at line %u\n",
+				       __LINE__);
+				printk(KERN_INFO "(ACTIVE_FILE=%lu+INACTIVE_FILE=%lu",
+				       zone_page_state(zone, NR_ACTIVE_FILE),
+				       zone_page_state(zone, NR_INACTIVE_FILE));
+				if (get_nr_swap_pages() > 0)
+					printk(KERN_CONT "+ACTIVE_ANON=%lu+INACTIVE_ANON=%lu",
+					       zone_page_state(zone, NR_ACTIVE_ANON),
+					       zone_page_state(zone, NR_INACTIVE_ANON));
+				printk(KERN_CONT ") * 6 > PAGES_SCANNED=%lu\n",
+				       zone_page_state(zone, NR_PAGES_SCANNED));
+			}
 			reclaimable = true;
+		}
 	}
 
 	/*
@@ -2674,6 +2701,9 @@ retry:
 				sc->priority);
 		sc->nr_scanned = 0;
 		zones_reclaimable = shrink_zones(zonelist, sc);
+		if (dump_target_pid == current->pid)
+			printk(KERN_INFO "shrink_zones returned %u at line %u\n",
+			       zones_reclaimable, __LINE__);
 
 		total_scanned += sc->nr_scanned;
 		if (sc->nr_reclaimed >= sc->nr_to_reclaim)
@@ -2707,11 +2737,21 @@ retry:
 	delayacct_freepages_end();
 
 	if (sc->nr_reclaimed)
+	{
+		if (dump_target_pid == current->pid)
+			printk(KERN_INFO "sc->nr_reclaimed=%lu at line %u\n",
+			       sc->nr_reclaimed, __LINE__);
 		return sc->nr_reclaimed;
+	}
 
 	/* Aborted reclaim to try compaction? don't OOM, then */
 	if (sc->compaction_ready)
+	{
+		if (dump_target_pid == current->pid)
+			printk(KERN_INFO "sc->compaction_ready=%u at line %u\n",
+			       sc->compaction_ready, __LINE__);
 		return 1;
+	}
 
 	/* Untapped cgroup reserves?  Don't OOM, retry. */
 	if (!sc->may_thrash) {
@@ -2720,6 +2760,9 @@ retry:
 		goto retry;
 	}
 
+	if (dump_target_pid == current->pid)
+		printk(KERN_INFO "zones_reclaimable=%u at line %u\n",
+		       zones_reclaimable, __LINE__);
 	/* Any of the zones still reclaimable?  Don't OOM. */
 	if (zones_reclaimable)
 		return 1;
@@ -2875,7 +2918,12 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 	 * point.
 	 */
 	if (throttle_direct_reclaim(gfp_mask, zonelist, nodemask))
+	{
+		if (dump_target_pid == current->pid)
+			printk(KERN_INFO "throttle_direct_reclaim returned 1 at line %u\n",
+			       __LINE__);
 		return 1;
+	}
 
 	trace_mm_vmscan_direct_reclaim_begin(order,
 				sc.may_writepage,
@@ -2885,6 +2933,9 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 
 	trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
 
+	if (dump_target_pid == current->pid)
+		printk(KERN_INFO "do_try_to_free_pages returned %lu at line %u\n",
+		       nr_reclaimed, __LINE__);
 	return nr_reclaimed;
 }
 
----------

What is strange, the values printed by this debug printk() patch did not
change as time went by. Thus, I think that this is not a problem of lack of
CPU time for scanning pages. I suspect that there is a bug that nobody is
scanning pages.

----------
[   66.821450] zone_reclaimable returned 1 at line 2646
[   66.823020] (ACTIVE_FILE=26+INACTIVE_FILE=10) * 6 > PAGES_SCANNED=32
[   66.824935] shrink_zones returned 1 at line 2706
[   66.826392] zones_reclaimable=1 at line 2765
[   66.827865] do_try_to_free_pages returned 1 at line 2938
[   67.102322] __perform_reclaim returned 1 at line 2854
[   67.103968] did_some_progress=1 at line 3301
(...snipped...)
[  281.439977] zone_reclaimable returned 1 at line 2646
[  281.439977] (ACTIVE_FILE=26+INACTIVE_FILE=10) * 6 > PAGES_SCANNED=32
[  281.439978] shrink_zones returned 1 at line 2706
[  281.439978] zones_reclaimable=1 at line 2765
[  281.439979] do_try_to_free_pages returned 1 at line 2938
[  281.439979] __perform_reclaim returned 1 at line 2854
[  281.439980] did_some_progress=1 at line 3301
----------

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20151013.txt.xz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-12 15:25                                     ` Silent hang up caused by pages being not scanned? Tetsuo Handa
@ 2015-10-12 21:23                                       ` Linus Torvalds
  2015-10-13 12:21                                         ` Tetsuo Handa
  2015-10-13 13:32                                       ` Michal Hocko
  1 sibling, 1 reply; 109+ messages in thread
From: Linus Torvalds @ 2015-10-12 21:23 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Michal Hocko, David Rientjes, Oleg Nesterov, Kyle Walker,
	Christoph Lameter, Andrew Morton, Johannes Weiner,
	Vladimir Davydov, linux-mm, Linux Kernel Mailing List,
	Stanislav Kozina

On Mon, Oct 12, 2015 at 8:25 AM, Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
>
> I examined this hang up using additional debug printk() patch. And it was
> observed that when this silent hang up occurs, zone_reclaimable() called from
> shrink_zones() called from a __GFP_FS memory allocation request is returning
> true forever. Since the __GFP_FS memory allocation request can never call
> out_of_memory() due to did_some_progree > 0, the system will silently hang up
> with 100% CPU usage.

I wouldn't blame the zones_reclaimable() logic itself, but yeah, that looks bad.

So the do_try_to_free_pages() logic that does that

        /* Any of the zones still reclaimable?  Don't OOM. */
        if (zones_reclaimable)
                return 1;

is rather dubious. The history of that odd line is pretty dubious too:
it used to be that we would return success if "shrink_zones()"
succeeded or if "nr_reclaimed" was non-zero, but that "shrink_zones()"
logic got rewritten, and I don't think the current situation is all
that sane.

And returning 1 there is actively misleading to callers, since it
makes them think that it made progress.

So I think you should look at what happens if you just remove that
illogical and misleading return value.

HOWEVER.

I think that it's very true that we have then tuned all our *other*
heuristics for taking this thing into account, so I suspect that we'll
find that we'll need to tweak other places. But this crazy "let's say
that we made progress even when we didn't" thing looks just wrong.

In particular, I think that you'll find that you will have to change
the heuristics in __alloc_pages_slowpath() where we currently do

        if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) || ..

when the "did_some_progress" logic changes that radically.

Because while the current return value looks insane, all the other
testing and tweaking has been done with that very odd return value in
place.

                Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-12 21:23                                       ` Linus Torvalds
@ 2015-10-13 12:21                                         ` Tetsuo Handa
  2015-10-13 16:37                                           ` Linus Torvalds
  0 siblings, 1 reply; 109+ messages in thread
From: Tetsuo Handa @ 2015-10-13 12:21 UTC (permalink / raw)
  To: torvalds
  Cc: mhocko, rientjes, oleg, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

Linus Torvalds wrote:
> On Mon, Oct 12, 2015 at 8:25 AM, Tetsuo Handa
> <penguin-kernel@i-love.sakura.ne.jp> wrote:
> >
> > I examined this hang up using additional debug printk() patch. And it was
> > observed that when this silent hang up occurs, zone_reclaimable() called from
> > shrink_zones() called from a __GFP_FS memory allocation request is returning
> > true forever. Since the __GFP_FS memory allocation request can never call
> > out_of_memory() due to did_some_progree > 0, the system will silently hang up
> > with 100% CPU usage.
> 
> I wouldn't blame the zones_reclaimable() logic itself, but yeah, that looks bad.
> 

I compared "hang up after the OOM killer is invoked" and "hang up before
the OOM killer is invoked" by always printing the values.

 			}
 			reclaimable = true;
 		}
+		else if (dump_target_pid == current->pid) {
+			printk(KERN_INFO "(ACTIVE_FILE=%lu+INACTIVE_FILE=%lu",
+			       zone_page_state(zone, NR_ACTIVE_FILE),
+			       zone_page_state(zone, NR_INACTIVE_FILE));
+			if (get_nr_swap_pages() > 0)
+				printk(KERN_CONT "+ACTIVE_ANON=%lu+INACTIVE_ANON=%lu",
+				       zone_page_state(zone, NR_ACTIVE_ANON),
+				       zone_page_state(zone, NR_INACTIVE_ANON));
+			printk(KERN_CONT ") * 6 > PAGES_SCANNED=%lu\n",
+			       zone_page_state(zone, NR_PAGES_SCANNED));
+		}
 	}
 
 	/*

For the former case, most of trials showed that

  (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0

. Sometimes PAGES_SCANNED > 0 (as grep'ed below), but ACTIVE_FILE and
INACTIVE_FILE seems to be always 0.

----------
[  195.905057] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
[  195.927430] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
[  206.317088] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
[  206.338007] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
[  216.723776] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
[  216.744618] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
[  227.129653] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
[  227.151238] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
[  237.650232] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
[  237.671343] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
[  277.980310] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=9
[  278.001481] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=9
[  288.339220] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=9
[  288.361908] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=9
[  298.682988] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=9
[  298.704055] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=9
[  350.368952] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  350.389770] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  360.724821] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  360.746100] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  845.231887] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=27
[  845.233770] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  845.253196] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=27
[  845.254910] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[ 1397.628073] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
[ 1397.649165] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
[ 1408.207041] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
[ 1408.228762] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
----------

For the latter case, most of output showed that
ACTIVE_FILE + INACTIVE_FILE > 0.

----------
[  142.647201] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  142.648883] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  142.842868] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  142.955817] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  143.086363] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  143.231120] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  143.359238] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  143.473342] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  143.618103] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  143.746210] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  143.908162] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  144.035415] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  144.161926] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  144.306435] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  144.434265] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  144.436099] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  144.643374] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  144.773239] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  144.902309] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  145.046154] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  145.185410] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  145.317218] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  145.460304] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  145.654212] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  145.817362] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  145.945136] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  146.086303] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  146.242127] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  153.489868] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  153.491593] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  153.674246] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  153.839478] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  154.003234] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  154.155085] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  154.322187] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  154.447355] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  154.653150] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  154.782216] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  154.939439] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  155.105921] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  155.278386] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  155.440832] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  155.623970] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  155.625766] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  155.831074] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  155.996903] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  156.139137] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  156.318492] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  156.484300] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  156.667411] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  156.817246] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  157.012323] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  157.159483] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  157.323193] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  157.488399] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  157.654198] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  164.339172] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  164.340896] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  164.583026] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  164.797386] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  164.965110] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  165.124935] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  165.431304] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  165.700317] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  165.862071] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  166.029257] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  166.198312] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  166.356224] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  166.559302] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  166.684486] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  166.898551] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  166.900496] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  167.175960] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  167.324390] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  167.526150] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  167.693365] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  167.878407] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  168.061503] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  168.225306] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  168.416398] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  168.617395] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  168.783201] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  168.989053] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  169.196126] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  175.361136] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  175.362865] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  175.626817] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  175.797361] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  176.006389] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  176.211479] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  176.433890] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  176.630951] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  176.855509] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  177.049814] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  177.258218] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  177.455404] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  177.665085] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  177.874173] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  178.057217] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  178.059056] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  178.350935] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  178.559404] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  178.782483] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  178.982803] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  179.203930] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  179.428321] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  179.611349] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  179.851164] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  180.034220] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  180.279197] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  180.455284] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  180.811445] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  186.368405] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  186.370115] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  186.614733] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  186.845695] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  187.024274] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  187.211389] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  187.427147] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  187.552333] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  187.734117] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  187.935811] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  188.138296] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  188.354041] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  188.559245] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  188.641776] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  188.716434] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  188.718199] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  189.015952] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  189.218976] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  189.440131] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  189.659238] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  189.882360] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  190.087342] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  190.314442] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  190.408926] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  190.631240] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  190.850326] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  191.067488] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  191.283243] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
----------

So, something is preventing ACTIVE_FILE and INACTIVE_FILE to become 0 ?

I also tried below change, but the result was same. Therefore, this
problem seems to be independent with "!__GFP_FS allocations do not fail".
(Complete log with below change (uptime > 101) is at
http://I-love.SAKURA.ne.jp/tmp/serial-20151013-2.txt.xz . )

----------
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2736,7 +2736,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 			 * and the OOM killer can't be invoked, but
 			 * keep looping as per tradition.
 			 */
-			*did_some_progress = 1;
 			goto out;
 		}
 		if (pm_suspended_storage())
----------

----------
[  102.719555] (ACTIVE_FILE=3+INACTIVE_FILE=3) * 6 > PAGES_SCANNED=19
[  102.721234] (ACTIVE_FILE=1+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  102.722908] shrink_zones returned 1 at line 2717
----------

> So the do_try_to_free_pages() logic that does that
> 
>         /* Any of the zones still reclaimable?  Don't OOM. */
>         if (zones_reclaimable)
>                 return 1;
> 
> is rather dubious. The history of that odd line is pretty dubious too:
> it used to be that we would return success if "shrink_zones()"
> succeeded or if "nr_reclaimed" was non-zero, but that "shrink_zones()"
> logic got rewritten, and I don't think the current situation is all
> that sane.
> 
> And returning 1 there is actively misleading to callers, since it
> makes them think that it made progress.
> 
> So I think you should look at what happens if you just remove that
> illogical and misleading return value.
> 

If I remove

	/* Any of the zones still reclaimable?  Don't OOM. */
	if (zones_reclaimable)
		return 1;

the OOM killer is invoked even when there are so much memory which can be
reclaimed after written to disk. This is definitely premature invocation of
the OOM killer.

  $ cat < /dev/zero > /tmp/log & sleep 10; ./a.out

---------- When there is a lot of data to write ----------
[  489.952827] Mem-Info:
[  489.953840] active_anon:328227 inactive_anon:3033 isolated_anon:26
[  489.953840]  active_file:2309 inactive_file:80915 isolated_file:0
[  489.953840]  unevictable:0 dirty:53 writeback:80874 unstable:0
[  489.953840]  slab_reclaimable:4975 slab_unreclaimable:4256
[  489.953840]  mapped:2973 shmem:4192 pagetables:1939 bounce:0
[  489.953840]  free:12963 free_pcp:60 free_cma:0
[  489.963395] Node 0 DMA free:7300kB min:400kB low:500kB high:600kB active_anon:5728kB inactive_anon:88kB active_file:140kB inactive_file:1276kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:1300kB mapped:140kB shmem:160kB slab_reclaimable:256kB slab_unreclaimable:180kB kernel_stack:64kB pagetables:180kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9768 all_unreclaimable? yes
[  489.974035] lowmem_reserve[]: 0 1729 1729 1729
[  489.975813] Node 0 DMA32 free:44552kB min:44652kB low:55812kB high:66976kB active_anon:1307180kB inactive_anon:12044kB active_file:9096kB inactive_file:322384kB unevictable:0kB isolated(anon):104kB isolated(file):0kB present:2080640kB managed:1774264kB mlocked:0kB dirty:216kB writeback:322196kB mapped:11752kB shmem:16608kB slab_reclaimable:19644kB slab_unreclaimable:16844kB kernel_stack:3584kB pagetables:7576kB unstable:0kB bounce:0kB free_pcp:240kB local_pcp:120kB free_cma:0kB writeback_tmp:0kB pages_scanned:2419896 all_unreclaimable? yes
[  489.988452] lowmem_reserve[]: 0 0 0 0
[  489.990043] Node 0 DMA: 2*4kB (UE) 1*8kB (M) 4*16kB (UME) 1*32kB (E) 2*64kB (UE) 3*128kB (UME) 2*256kB (UM) 2*512kB (ME) 1*1024kB (E) 2*2048kB (ME) 0*4096kB = 7280kB
[  489.995142] Node 0 DMA32: 578*4kB (UME) 726*8kB (UE) 447*16kB (UE) 253*32kB (UME) 155*64kB (UME) 42*128kB (UME) 3*256kB (UME) 2*512kB (UM) 4*1024kB (U) 0*2048kB 0*4096kB = 44552kB
[  490.000511] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  490.002914] 87434 total pagecache pages
[  490.004612] 0 pages in swap cache
[  490.006138] Swap cache stats: add 0, delete 0, find 0/0
[  490.007976] Free swap  = 0kB
[  490.009329] Total swap = 0kB
[  490.011033] 524157 pages RAM
[  490.012352] 0 pages HighMem/MovableOnly
[  490.013903] 76615 pages reserved
[  490.015260] 0 pages hwpoisoned
---------- When there is a lot of data to write ----------

  $ ./a.out

---------- When there is no data to write ----------
[  792.359024] Mem-Info:
[  792.360001] active_anon:413751 inactive_anon:6226 isolated_anon:0
[  792.360001]  active_file:0 inactive_file:0 isolated_file:0
[  792.360001]  unevictable:0 dirty:0 writeback:0 unstable:0
[  792.360001]  slab_reclaimable:1243 slab_unreclaimable:3638
[  792.360001]  mapped:104 shmem:6236 pagetables:1033 bounce:0
[  792.360001]  free:12965 free_pcp:126 free_cma:0
[  792.368559] Node 0 DMA free:7292kB min:400kB low:500kB high:600kB active_anon:7040kB inactive_anon:160kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:160kB slab_reclaimable:24kB slab_unreclaimable:172kB kernel_stack:64kB pagetables:460kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:8 all_unreclaimable? yes
[  792.378240] lowmem_reserve[]: 0 1729 1729 1729
[  792.379834] Node 0 DMA32 free:44568kB min:44652kB low:55812kB high:66976kB active_anon:1647964kB inactive_anon:24744kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1774264kB mlocked:0kB dirty:0kB writeback:0kB mapped:416kB shmem:24784kB slab_reclaimable:4948kB slab_unreclaimable:14380kB kernel_stack:3104kB pagetables:3672kB unstable:0kB bounce:0kB free_pcp:504kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:8 all_unreclaimable? yes
[  792.390085] lowmem_reserve[]: 0 0 0 0
[  792.391643] Node 0 DMA: 3*4kB (UE) 0*8kB 3*16kB (UE) 24*32kB (ME) 11*64kB (UME) 5*128kB (UM) 2*256kB (ME) 3*512kB (ME) 1*1024kB (E) 1*2048kB (E) 0*4096kB = 7292kB
[  792.396201] Node 0 DMA32: 242*4kB (UME) 386*8kB (UME) 397*16kB (UME) 199*32kB (UE) 105*64kB (UME) 37*128kB (UME) 24*256kB (UME) 20*512kB (UME) 0*1024kB 0*2048kB 0*4096kB = 44616kB
[  792.401136] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  792.403356] 6250 total pagecache pages
[  792.404803] 0 pages in swap cache
[  792.406208] Swap cache stats: add 0, delete 0, find 0/0
[  792.407896] Free swap  = 0kB
[  792.409172] Total swap = 0kB
[  792.410460] 524157 pages RAM
[  792.411752] 0 pages HighMem/MovableOnly
[  792.413106] 76615 pages reserved
[  792.414493] 0 pages hwpoisoned
---------- When there is no data to write ----------

> HOWEVER.
> 
> I think that it's very true that we have then tuned all our *other*
> heuristics for taking this thing into account, so I suspect that we'll
> find that we'll need to tweak other places. But this crazy "let's say
> that we made progress even when we didn't" thing looks just wrong.
> 
> In particular, I think that you'll find that you will have to change
> the heuristics in __alloc_pages_slowpath() where we currently do
> 
>         if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) || ..
> 
> when the "did_some_progress" logic changes that radically.
> 

Yes. But we can't simply do

	if (order <= PAGE_ALLOC_COSTLY_ORDER || ..

because we won't be able to call out_of_memory(), can we?

> Because while the current return value looks insane, all the other
> testing and tweaking has been done with that very odd return value in
> place.
> 
>                 Linus
> 

Well, did I encounter a difficult to fix problem?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-12 15:25                                     ` Silent hang up caused by pages being not scanned? Tetsuo Handa
  2015-10-12 21:23                                       ` Linus Torvalds
@ 2015-10-13 13:32                                       ` Michal Hocko
  2015-10-13 16:19                                         ` Tetsuo Handa
  1 sibling, 1 reply; 109+ messages in thread
From: Michal Hocko @ 2015-10-13 13:32 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

On Tue 13-10-15 00:25:53, Tetsuo Handa wrote:
[...]
> What is strange, the values printed by this debug printk() patch did not
> change as time went by. Thus, I think that this is not a problem of lack of
> CPU time for scanning pages. I suspect that there is a bug that nobody is
> scanning pages.
> 
> ----------
> [   66.821450] zone_reclaimable returned 1 at line 2646
> [   66.823020] (ACTIVE_FILE=26+INACTIVE_FILE=10) * 6 > PAGES_SCANNED=32
> [   66.824935] shrink_zones returned 1 at line 2706
> [   66.826392] zones_reclaimable=1 at line 2765
> [   66.827865] do_try_to_free_pages returned 1 at line 2938
> [   67.102322] __perform_reclaim returned 1 at line 2854
> [   67.103968] did_some_progress=1 at line 3301
> (...snipped...)
> [  281.439977] zone_reclaimable returned 1 at line 2646
> [  281.439977] (ACTIVE_FILE=26+INACTIVE_FILE=10) * 6 > PAGES_SCANNED=32
> [  281.439978] shrink_zones returned 1 at line 2706
> [  281.439978] zones_reclaimable=1 at line 2765
> [  281.439979] do_try_to_free_pages returned 1 at line 2938
> [  281.439979] __perform_reclaim returned 1 at line 2854
> [  281.439980] did_some_progress=1 at line 3301

This is really interesting because even with reclaimable LRUs this low
we should eventually scan them enough times to convince zone_reclaimable
to fail. PAGES_SCANNED in your logs seems to be constant, though, which
suggests somebody manages to free a page every time before we get down
to priority 0 and manage to scan something finally. This is pretty much
pathological behavior and I have hard time to imagine how would that be
possible but it clearly shows that zone_reclaimable heuristic is not
working properly.

I can see two options here. Either we teach zone_reclaimable to be less
fragile or remove zone_reclaimable from shrink_zones altogether. Both of
them are risky because we have a long history of changes in this areas
which made other subtle behavior changes but I guess that the first
option should be less fragile. What about the following patch? I am not
happy about it because the condition is rather rough and a deeper
inspection is really needed to check all the call sites but it should be
good for testing.
--- 

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-13 13:32                                       ` Michal Hocko
@ 2015-10-13 16:19                                         ` Tetsuo Handa
  2015-10-14 13:22                                           ` Michal Hocko
  0 siblings, 1 reply; 109+ messages in thread
From: Tetsuo Handa @ 2015-10-13 16:19 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

Michal Hocko wrote:
> I can see two options here. Either we teach zone_reclaimable to be less
> fragile or remove zone_reclaimable from shrink_zones altogether. Both of
> them are risky because we have a long history of changes in this areas
> which made other subtle behavior changes but I guess that the first
> option should be less fragile. What about the following patch? I am not
> happy about it because the condition is rather rough and a deeper
> inspection is really needed to check all the call sites but it should be
> good for testing.

While zone_reclaimable() for Node 0 DMA32 became false by your patch,
zone_reclaimable() for Node 0 DMA kept returning true, and as a result
overall result (i.e. zones_reclaimable) remained true.

  $ ./a.out

---------- When there is no data to write ----------
[  162.942371] MIN=11163 FREE=11155 (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=16
[  162.944541] MIN=100 FREE=1824 (ACTIVE_FILE=3+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  162.946560] zone_reclaimable returned 1 at line 2665
[  162.948722] shrink_zones returned 1 at line 2716
(...snipped...)
[  164.897587] zones_reclaimable=1 at line 2775
[  164.899172] do_try_to_free_pages returned 1 at line 2948
[  167.087119] __perform_reclaim returned 1 at line 2854
[  167.088868] did_some_progress=1 at line 3301
(...snipped...)
[  261.577944] MIN=11163 FREE=11155 (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  261.580093] MIN=100 FREE=1824 (ACTIVE_FILE=3+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  261.582333] zone_reclaimable returned 1 at line 2665
[  261.583841] shrink_zones returned 1 at line 2716
(...snipped...)
[  264.728434] zones_reclaimable=1 at line 2775
[  264.730002] do_try_to_free_pages returned 1 at line 2948
[  268.191368] __perform_reclaim returned 1 at line 2854
[  268.193113] did_some_progress=1 at line 3301
---------- When there is no data to write ----------

Complete log (with your patch inside) is at
http://I-love.SAKURA.ne.jp/tmp/serial-20151014.txt.xz .

By the way, the OOM killer seems to be invoked prematurely for different load
if your patch is applied.

  $ cat < /dev/zero > /tmp/log & sleep 10; ./a.out

---------- When there is a lot of data to write ----------
[   69.019271] Mem-Info:
[   69.019755] active_anon:335006 inactive_anon:2084 isolated_anon:23
[   69.019755]  active_file:12197 inactive_file:65310 isolated_file:31
[   69.019755]  unevictable:0 dirty:533 writeback:51020 unstable:0
[   69.019755]  slab_reclaimable:4753 slab_unreclaimable:4134
[   69.019755]  mapped:9639 shmem:2144 pagetables:2030 bounce:0
[   69.019755]  free:12972 free_pcp:45 free_cma:0
[   69.026260] Node 0 DMA free:7300kB min:400kB low:500kB high:600kB active_anon:5232kB inactive_anon:96kB active_file:424kB inactive_file:1068kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:164kB writeback:972kB mapped:416kB shmem:104kB slab_reclaimable:304kB slab_unreclaimable:244kB kernel_stack:96kB pagetables:256kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:128 all_unreclaimable? no
[   69.037189] lowmem_reserve[]: 0 1729 1729 1729
[   69.039152] Node 0 DMA32 free:74224kB min:44652kB low:55812kB high:66976kB active_anon:1334792kB inactive_anon:8240kB active_file:48364kB inactive_file:230752kB unevictable:0kB isolated(anon):92kB isolated(file):0kB present:2080640kB managed:1774264kB mlocked:0kB dirty:9328kB writeback:199060kB mapped:38140kB shmem:8472kB slab_reclaimable:17840kB slab_unreclaimable:16292kB kernel_stack:3840kB pagetables:7864kB unstable:0kB bounce:0kB free_pcp:784kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[   69.052017] lowmem_reserve[]: 0 0 0 0
[   69.053818] Node 0 DMA: 17*4kB (UME) 8*8kB (UME) 6*16kB (UME) 2*32kB (UM) 2*64kB (UE) 4*128kB (UME) 1*256kB (U) 2*512kB (UE) 3*1024kB (UME) 1*2048kB (U) 0*4096kB = 7332kB
[   69.059597] Node 0 DMA32: 632*4kB (UME) 454*8kB (UME) 507*16kB (UME) 310*32kB (UME) 177*64kB (UE) 61*128kB (UME) 15*256kB (ME) 19*512kB (M) 10*1024kB (M) 0*2048kB 0*4096kB = 67136kB
[   69.065810] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   69.068305] 72477 total pagecache pages
[   69.069932] 0 pages in swap cache
[   69.071435] Swap cache stats: add 0, delete 0, find 0/0
[   69.073354] Free swap  = 0kB
[   69.074822] Total swap = 0kB
[   69.076660] 524157 pages RAM
[   69.078113] 0 pages HighMem/MovableOnly
[   69.079930] 76615 pages reserved
[   69.081406] 0 pages hwpoisoned
---------- When there is a lot of data to write ----------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-13 12:21                                         ` Tetsuo Handa
@ 2015-10-13 16:37                                           ` Linus Torvalds
  2015-10-14 12:21                                             ` Tetsuo Handa
  2015-10-15 13:14                                             ` Michal Hocko
  0 siblings, 2 replies; 109+ messages in thread
From: Linus Torvalds @ 2015-10-13 16:37 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Michal Hocko, David Rientjes, Oleg Nesterov, Kyle Walker,
	Christoph Lameter, Andrew Morton, Johannes Weiner,
	Vladimir Davydov, linux-mm, Linux Kernel Mailing List,
	Stanislav Kozina

On Tue, Oct 13, 2015 at 5:21 AM, Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
>
> If I remove
>
>         /* Any of the zones still reclaimable?  Don't OOM. */
>         if (zones_reclaimable)
>                 return 1;
>
> the OOM killer is invoked even when there are so much memory which can be
> reclaimed after written to disk. This is definitely premature invocation of
> the OOM killer.

Right. The rest of the code knows that the return value right now
means "there is no memory at all" rather than "I made progress".

> Yes. But we can't simply do
>
>         if (order <= PAGE_ALLOC_COSTLY_ORDER || ..
>
> because we won't be able to call out_of_memory(), can we?

So I think that whole thing is kind of senseless. Not just that
particular conditional, but what it *does* too.

What can easily happen is that we are a blocking allocation, but
because we're __GFP_FS or something, the code doesn't actually start
writing anything out. Nor is anything congested. So the thing just
loops.

And looping is stupid, because we may be not able to actually free
anything exactly because of limitations like __GFP_FS.

So

 (a) the looping condition is senseless

 (b) what we do when looping is senseless

and we actually do try to wake up kswapd in the loop, but we never
*wait* for it, so that's largely pointless too.

So *of*course* the direct reclaim code has to set "I made progress",
because if it doesn't lie and say so, then the code will randomly not
loop, and will oom, and things go to hell.

But I hate the "let's tweak the zone_reclaimable" idea, because it
doesn't actually fix anything. It just perpetuates this "the code
doesn't make sense, so let's add *more* senseless heusristics to this
whole loop".

So instead of that senseless thing, how about trying something
*sensible*. Make the code do something that we can actually explain as
making sense.

I'd suggest something like:

 - add a "retry count"

 - if direct reclaim made no progress, or made less progress than the target:

      if (order > PAGE_ALLOC_COSTLY_ORDER) goto noretry;

 - regardless of whether we made progress or not:

      if (retry count < X) goto retry;

      if (retry count < 2*X) yield/sleep 10ms/wait-for-kswapd and then
goto retry

   where 'X" is something sane that limits our CPU use, but also
guarantees that we don't end up waiting *too* long (if a single
allocation takes more than a big fraction of a second, we should
probably stop trying).

The whole time-based thing might even be explicit. There's nothing
wrong with doing something like

    unsigned long timeout = jiffies + HZ/4;

at the top of the function, and making the whole retry logic actually
say something like

    if (time_after(timeout, jiffies)) goto noretry;

(or make *that* trigger the oom logic, or whatever).

Now, I realize the above suggestions are big changes, and they'll
likely break things and we'll still need to tweak things, but dammit,
wouldn't that be better than just randomly tweaking the insane
zone_reclaimable logic?

                    Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-13 16:37                                           ` Linus Torvalds
@ 2015-10-14 12:21                                             ` Tetsuo Handa
  2015-10-15 13:14                                             ` Michal Hocko
  1 sibling, 0 replies; 109+ messages in thread
From: Tetsuo Handa @ 2015-10-14 12:21 UTC (permalink / raw)
  To: torvalds
  Cc: mhocko, rientjes, oleg, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

Linus Torvalds wrote:
> On Tue, Oct 13, 2015 at 5:21 AM, Tetsuo Handa
> <penguin-kernel@i-love.sakura.ne.jp> wrote:
> >
> > If I remove
> >
> >         /* Any of the zones still reclaimable?  Don't OOM. */
> >         if (zones_reclaimable)
> >                 return 1;
> >
> > the OOM killer is invoked even when there are so much memory which can be
> > reclaimed after written to disk. This is definitely premature invocation of
> > the OOM killer.
> 
> Right. The rest of the code knows that the return value right now
> means "there is no memory at all" rather than "I made progress".
> 
> > Yes. But we can't simply do
> >
> >         if (order <= PAGE_ALLOC_COSTLY_ORDER || ..
> >
> > because we won't be able to call out_of_memory(), can we?
> 
> So I think that whole thing is kind of senseless. Not just that
> particular conditional, but what it *does* too.
> 
> What can easily happen is that we are a blocking allocation, but
> because we're __GFP_FS or something, the code doesn't actually start
> writing anything out. Nor is anything congested. So the thing just
> loops.

congestion_wait() sounds like a source of silent hang up.
http://lkml.kernel.org/r/201406052145.CIB35534.OQLVMSJFOHtFOF@I-love.SAKURA.ne.jp

> 
> And looping is stupid, because we may be not able to actually free
> anything exactly because of limitations like __GFP_FS.
> 
> So
> 
>  (a) the looping condition is senseless
> 
>  (b) what we do when looping is senseless
> 
> and we actually do try to wake up kswapd in the loop, but we never
> *wait* for it, so that's largely pointless too.

Aren't we waiting for kswapd forever?
In other words, we never check whether kswapd can make some progress.
http://lkml.kernel.org/r/20150812091104.GA14940@dhcp22.suse.cz

> 
> So *of*course* the direct reclaim code has to set "I made progress",
> because if it doesn't lie and say so, then the code will randomly not
> loop, and will oom, and things go to hell.
> 
> But I hate the "let's tweak the zone_reclaimable" idea, because it
> doesn't actually fix anything. It just perpetuates this "the code
> doesn't make sense, so let's add *more* senseless heusristics to this
> whole loop".

I also don't think that tweaking current reclaim logic solves bugs
which bothered me via unexplained hangups / reboots.
To me, current memory allocator is too puzzling that it is as if

   if (there_is_much_free_memory() == TRUE)
       goto OK;
   if (do_some_heuristic1() == SUCCESS)
       goto OK;
   if (do_some_heuristic2() == SUCCESS)
       goto OK;
   if (do_some_heuristic3() == SUCCESS)
       goto OK;
   (...snipped...)
   if (do_some_heuristicN() == SUCCESS)
       goto OK;
   while (1);

and we don't know how many heuristics we need to add in order to avoid
reaching the "while (1);". (We are reaching the "while (1);" before

   if (out_of_memory() == SUCCESS)
       goto OK;

is called.)

> 
> So instead of that senseless thing, how about trying something
> *sensible*. Make the code do something that we can actually explain as
> making sense.
> 
> I'd suggest something like:
> 
>  - add a "retry count"
> 
>  - if direct reclaim made no progress, or made less progress than the target:
> 
>       if (order > PAGE_ALLOC_COSTLY_ORDER) goto noretry;

Yes.

> 
>  - regardless of whether we made progress or not:
> 
>       if (retry count < X) goto retry;
> 
>       if (retry count < 2*X) yield/sleep 10ms/wait-for-kswapd and then
> goto retry

I tried sleeping for reducing CPU usage and reporting via SysRq-w.
http://lkml.kernel.org/r/201411231353.BDE90173.FQOMJtHOLVFOFS@I-love.SAKURA.ne.jp

I complained at http://lkml.kernel.org/r/201502162023.GGE26089.tJOOFQMFFHLOVS@I-love.SAKURA.ne.jp

| Oh, why every thread trying to allocate memory has to repeat
| the loop that might defer somebody who can make progress if CPU time was
| given? I wish only somebody like kswapd repeats the loop on behalf of all
| threads waiting at memory allocation slowpath...

Direct reclaim can defer termination upon SIGKILL if blocked at unkillable
lock. If performance were not a problem, is direct reclaim mandatory?

Of course, performance is the problem. Thus we would try direct reclaim
for at least once. But I wish memory allocation logic were as simple as

  (1) If there are enough free memory, allocate it.

  (2) If there are not enough free memory, join on the
      waitqueue list

        wait_event_timeout(waiter, memory_reclaimed, timeout)

      and wait for reclaiming kernel threads (e.g. kswapd) to wake
      the waiters up. If the caller is willing to give up upon SIGKILL
      (e.g. __GFP_KILLABLE) then

        wait_event_killable_timeout(waiter, memory_reclaimed, timeout)

      and return NULL upon SIGKILL.

  (3) Whenever reclaiming kernel threads reclaimed memory and there are
      waiters, wake the waiters up.

  (4) If reclaiming kernel threads cannot reclaim memory,
      the caller will wake up due to timeout, and invoke the OOM
      killer unless the caller does not want (e.g. __GFP_NO_OOMKILL).

> 
>    where 'X" is something sane that limits our CPU use, but also
> guarantees that we don't end up waiting *too* long (if a single
> allocation takes more than a big fraction of a second, we should
> probably stop trying).

Isn't a second too short for waiting for swapping / writeback?

> 
> The whole time-based thing might even be explicit. There's nothing
> wrong with doing something like
> 
>     unsigned long timeout = jiffies + HZ/4;
> 
> at the top of the function, and making the whole retry logic actually
> say something like
> 
>     if (time_after(timeout, jiffies)) goto noretry;
> 
> (or make *that* trigger the oom logic, or whatever).

I prefer time-based thing, for my customer's usage (where watchdog timeout
is configured to 10 seconds) will require kernel messages (maybe OOM killer
messages) printed within a few seconds.

> 
> Now, I realize the above suggestions are big changes, and they'll
> likely break things and we'll still need to tweak things, but dammit,
> wouldn't that be better than just randomly tweaking the insane
> zone_reclaimable logic?
> 
>                     Linus

Yes, this will be big changes. But this change will be better than living
with "no means for understanding what was happening are available" v.s.
"really interesting things are observed if means are available".

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-13 16:19                                         ` Tetsuo Handa
@ 2015-10-14 13:22                                           ` Michal Hocko
  2015-10-14 14:38                                             ` Tetsuo Handa
  0 siblings, 1 reply; 109+ messages in thread
From: Michal Hocko @ 2015-10-14 13:22 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

On Wed 14-10-15 01:19:09, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > I can see two options here. Either we teach zone_reclaimable to be less
> > fragile or remove zone_reclaimable from shrink_zones altogether. Both of
> > them are risky because we have a long history of changes in this areas
> > which made other subtle behavior changes but I guess that the first
> > option should be less fragile. What about the following patch? I am not
> > happy about it because the condition is rather rough and a deeper
> > inspection is really needed to check all the call sites but it should be
> > good for testing.
> 
> While zone_reclaimable() for Node 0 DMA32 became false by your patch,
> zone_reclaimable() for Node 0 DMA kept returning true, and as a result
> overall result (i.e. zones_reclaimable) remained true.

Ahh, right you are. ZONE_DMA might have 0 or close to 0 pages on
LRUs while it is still protected from allocations which are not
targeted for this zone. My patch clearly haven't considered that. The
fix for that would be quite straightforward. We have to consider
lowmem_reserve of the zone wrt. the allocation/reclaim gfp target
zone. But this is getting more and more ugly (see the patch below just
for testing/demonstration purposes).

The OOM report is really interesting:

> [   69.039152] Node 0 DMA32 free:74224kB min:44652kB low:55812kB high:66976kB active_anon:1334792kB inactive_anon:8240kB active_file:48364kB inactive_file:230752kB unevictable:0kB isolated(anon):92kB isolated(file):0kB present:2080640kB managed:1774264kB mlocked:0kB dirty:9328kB writeback:199060kB mapped:38140kB shmem:8472kB slab_reclaimable:17840kB slab_unreclaimable:16292kB kernel_stack:3840kB pagetables:7864kB unstable:0kB bounce:0kB free_pcp:784kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no

so your whole file LRUs are either dirty or under writeback and
reclaimable pages are below min wmark. This alone is quite suspicious.
Why hasn't balance_dirty_pages throttled writers and allowed them to
make the whole LRU dirty? What is your dirty{_background}_{ratio,bytes}
configuration on that system.

Also why throttle_vm_writeout haven't slown the reclaim down?

Anyway this is exactly the case where zone_reclaimable helps us to
prevent OOM because we are looping over the remaining LRU pages without
making progress... This just shows how subtle all this is :/

I have to think about this much more..
---

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-14 13:22                                           ` Michal Hocko
@ 2015-10-14 14:38                                             ` Tetsuo Handa
  2015-10-14 14:59                                               ` Michal Hocko
  0 siblings, 1 reply; 109+ messages in thread
From: Tetsuo Handa @ 2015-10-14 14:38 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

Michal Hocko wrote:
> The OOM report is really interesting:
> 
> > [   69.039152] Node 0 DMA32 free:74224kB min:44652kB low:55812kB high:66976kB active_anon:1334792kB inactive_anon:8240kB active_file:48364kB inactive_file:230752kB unevictable:0kB isolated(anon):92kB isolated(file):0kB present:2080640kB managed:1774264kB mlocked:0kB dirty:9328kB writeback:199060kB mapped:38140kB shmem:8472kB slab_reclaimable:17840kB slab_unreclaimable:16292kB kernel_stack:3840kB pagetables:7864kB unstable:0kB bounce:0kB free_pcp:784kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> 
> so your whole file LRUs are either dirty or under writeback and
> reclaimable pages are below min wmark. This alone is quite suspicious.

I did

  $ cat < /dev/zero > /tmp/log

for 10 seconds before starting

  $ ./a.out

Thus, so much memory was waiting for writeback on XFS filesystem.

> Why hasn't balance_dirty_pages throttled writers and allowed them to
> make the whole LRU dirty? What is your dirty{_background}_{ratio,bytes}
> configuration on that system.

All values are defaults of plain CentOS 7 installation.

# sysctl -a | grep ^vm.
vm.admin_reserve_kbytes = 8192
vm.block_dump = 0
vm.compact_unevictable_allowed = 1
vm.dirty_background_bytes = 0
vm.dirty_background_ratio = 10
vm.dirty_bytes = 0
vm.dirty_expire_centisecs = 3000
vm.dirty_ratio = 30
vm.dirty_writeback_centisecs = 500
vm.dirtytime_expire_seconds = 43200
vm.drop_caches = 0
vm.extfrag_threshold = 500
vm.hugepages_treat_as_movable = 0
vm.hugetlb_shm_group = 0
vm.laptop_mode = 0
vm.legacy_va_layout = 0
vm.lowmem_reserve_ratio = 256   256     32
vm.max_map_count = 65530
vm.memory_failure_early_kill = 0
vm.memory_failure_recovery = 1
vm.min_free_kbytes = 45056
vm.min_slab_ratio = 5
vm.min_unmapped_ratio = 1
vm.mmap_min_addr = 4096
vm.nr_hugepages = 0
vm.nr_hugepages_mempolicy = 0
vm.nr_overcommit_hugepages = 0
vm.nr_pdflush_threads = 0
vm.numa_zonelist_order = default
vm.oom_dump_tasks = 1
vm.oom_kill_allocating_task = 0
vm.overcommit_kbytes = 0
vm.overcommit_memory = 0
vm.overcommit_ratio = 50
vm.page-cluster = 3
vm.panic_on_oom = 0
vm.percpu_pagelist_fraction = 0
vm.stat_interval = 1
vm.swappiness = 30
vm.user_reserve_kbytes = 54808
vm.vfs_cache_pressure = 100
vm.zone_reclaim_mode = 0

> 
> Also why throttle_vm_writeout haven't slown the reclaim down?

Too difficult question for me.

> 
> Anyway this is exactly the case where zone_reclaimable helps us to
> prevent OOM because we are looping over the remaining LRU pages without
> making progress... This just shows how subtle all this is :/
> 
> I have to think about this much more..

I'm suspicious about tweaking current reclaim logic.
Could you please respond to Linus's comments?

There are more moles than kernel developers can find. I think that
what we can do for short term is to prepare for moles that kernel
developers could not find, and for long term is to reform page
allocator for preventing moles from living.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-14 14:38                                             ` Tetsuo Handa
@ 2015-10-14 14:59                                               ` Michal Hocko
  2015-10-14 15:06                                                 ` Tetsuo Handa
  0 siblings, 1 reply; 109+ messages in thread
From: Michal Hocko @ 2015-10-14 14:59 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

On Wed 14-10-15 23:38:00, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > Why hasn't balance_dirty_pages throttled writers and allowed them to
> > make the whole LRU dirty? What is your dirty{_background}_{ratio,bytes}
> > configuration on that system.
> 
> All values are defaults of plain CentOS 7 installation.

So this is 3.10 kernel, right?

> # sysctl -a | grep ^vm.
> vm.dirty_background_ratio = 10
> vm.dirty_bytes = 0
> vm.dirty_expire_centisecs = 3000
> vm.dirty_ratio = 30
[...]

OK, this is nothing unusual. And I _suspect_ that the throttling simply
didn't cope with the writer speed and a large anon memory consumer.
Dirtyable memory was quite high until your anon hammer bumped in
and reduced dirtyable memory down so the file LRU is full of dirty pages
when we get under serious memory pressure. Anonymous pages are not
reclaimable so the whole memory pressure goes to file LRUs and bang.

> > Also why throttle_vm_writeout haven't slown the reclaim down?
> 
> Too difficult question for me.
> 
> > 
> > Anyway this is exactly the case where zone_reclaimable helps us to
> > prevent OOM because we are looping over the remaining LRU pages without
> > making progress... This just shows how subtle all this is :/
> > 
> > I have to think about this much more..
> 
> I'm suspicious about tweaking current reclaim logic.
> Could you please respond to Linus's comments?

Yes I plan to I just didn't get to finish my email yet.
 
> There are more moles than kernel developers can find. I think that
> what we can do for short term is to prepare for moles that kernel
> developers could not find, and for long term is to reform page
> allocator for preventing moles from living.

This is much easier said than done :/ The current code is full of
heuristics grown over time based on very different requirements from
different kernel subsystems. There is no simple solution for this
problem I am afraid.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-14 14:59                                               ` Michal Hocko
@ 2015-10-14 15:06                                                 ` Tetsuo Handa
  0 siblings, 0 replies; 109+ messages in thread
From: Tetsuo Handa @ 2015-10-14 15:06 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

Michal Hocko wrote:
> On Wed 14-10-15 23:38:00, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> [...]
> > > Why hasn't balance_dirty_pages throttled writers and allowed them to
> > > make the whole LRU dirty? What is your dirty{_background}_{ratio,bytes}
> > > configuration on that system.
> > 
> > All values are defaults of plain CentOS 7 installation.
> 
> So this is 3.10 kernel, right?

The userland is CentOS 7 but the kernel is linux-next-20151009.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-13 16:37                                           ` Linus Torvalds
  2015-10-14 12:21                                             ` Tetsuo Handa
@ 2015-10-15 13:14                                             ` Michal Hocko
  2015-10-16 15:57                                               ` Michal Hocko
  1 sibling, 1 reply; 109+ messages in thread
From: Michal Hocko @ 2015-10-15 13:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Kyle Walker,
	Christoph Lameter, Andrew Morton, Johannes Weiner,
	Vladimir Davydov, linux-mm, Linux Kernel Mailing List,
	Stanislav Kozina, Mel Gorman, Rik van Riel

[CC Mel and Rik as well - this has diverged from the original thread
 considerably but the current topic started here:
 http://lkml.kernel.org/r/201510130025.EJF21331.FFOQJtVOMLFHSO%40I-love.SAKURA.ne.jp
]

On Tue 13-10-15 09:37:06, Linus Torvalds wrote:
> So instead of that senseless thing, how about trying something
> *sensible*. Make the code do something that we can actually explain as
> making sense.

I do agree that zone_reclaimable is subtle and hackish way to wait for
the writeback/kswapd to clean up pages which cannot be reclaimed from
the direct reclaim.

> I'd suggest something like:
> 
>  - add a "retry count"
> 
>  - if direct reclaim made no progress, or made less progress than the target:
> 
>       if (order > PAGE_ALLOC_COSTLY_ORDER) goto noretry;
> 
>  - regardless of whether we made progress or not:
> 
>       if (retry count < X) goto retry;
> 
>       if (retry count < 2*X) yield/sleep 10ms/wait-for-kswapd and then
> goto retry

This will certainly cap the reclaim retries but there are risks with
this approach afaics.

First of all other allocators might piggy back on the current reclaimer
and push it to the OOM killer even when we are not really OOM. Maybe
this is possible currently as well but it is less likely because
NR_PAGES_SCANNED is reset on a freed page which allows the reclaimer
another round.

I am also not sure it would help with pathological cases like the
one discussed here. If you have only a small amount of reclaimable
memory on the LRU lists then you scan them quite quickly which will
consume retries. Maybe a sufficient timeout can help but I am afraid we
can still hit the OOM prematurely because a large part of the memory
is still under writeback (which might be a slow device - e.g. an USB
stick).

We used have this kind of problems in memcg reclaim.  We do not
have (resp. didn't have until recently with CONFIG_CGROUP_WRITEBACK)
dirty memory throttling for memory cgroups so the LRU can become full
of dirty data really quickly and that led to memcg OOM killer.
We are not doing zone_reclaimable and other heuristics so we had to
explicitly wait_on_page_writeback in the reclaim to prevent from
premature OOM killer. Ugly hack but the only thing that worked
reliably. Time based solutions were tried and failed with different
workloads and quite randomly depending on the load/storage.

>    where 'X" is something sane that limits our CPU use, but also
> guarantees that we don't end up waiting *too* long (if a single
> allocation takes more than a big fraction of a second, we should
> probably stop trying).
> 
> The whole time-based thing might even be explicit. There's nothing
> wrong with doing something like
> 
>     unsigned long timeout = jiffies + HZ/4;
> 
> at the top of the function, and making the whole retry logic actually
> say something like
> 
>     if (time_after(timeout, jiffies)) goto noretry;
> 
> (or make *that* trigger the oom logic, or whatever).
> 
> Now, I realize the above suggestions are big changes, and they'll
> likely break things and we'll still need to tweak things, but dammit,
> wouldn't that be better than just randomly tweaking the insane
> zone_reclaimable logic?

Yes zone_reclaimable is subtle and imho it is used even at the
wrong level. We should decide whether we are really OOM at
__alloc_pages_slowpath. We definitely need a big picture logic to tell
us when it makes sense to drop the ball and trigger OOM killer or fail
the allocation request.

E.g. free + reclaimable + writeback < min_wmark on all usable zones for
more than X rounds of direct reclaim without any progress is
a sufficient signal to go OOM. Costly/noretry allocations can fail earlier
of course. This is obviously a half baked idea which needs much more
consideration all I am trying to say is that we need a high level metric
to tell OOM condition.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-15 13:14                                             ` Michal Hocko
@ 2015-10-16 15:57                                               ` Michal Hocko
  2015-10-16 18:34                                                 ` Linus Torvalds
  0 siblings, 1 reply; 109+ messages in thread
From: Michal Hocko @ 2015-10-16 15:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Kyle Walker,
	Christoph Lameter, Andrew Morton, Johannes Weiner,
	Vladimir Davydov, linux-mm, Linux Kernel Mailing List,
	Stanislav Kozina, Mel Gorman, Rik van Riel

On Thu 15-10-15 15:14:09, Michal Hocko wrote:
> On Tue 13-10-15 09:37:06, Linus Torvalds wrote:
[...]
> > Now, I realize the above suggestions are big changes, and they'll
> > likely break things and we'll still need to tweak things, but dammit,
> > wouldn't that be better than just randomly tweaking the insane
> > zone_reclaimable logic?
> 
> Yes zone_reclaimable is subtle and imho it is used even at the
> wrong level. We should decide whether we are really OOM at
> __alloc_pages_slowpath. We definitely need a big picture logic to tell
> us when it makes sense to drop the ball and trigger OOM killer or fail
> the allocation request.
> 
> E.g. free + reclaimable + writeback < min_wmark on all usable zones for
> more than X rounds of direct reclaim without any progress is
> a sufficient signal to go OOM. Costly/noretry allocations can fail earlier
> of course. This is obviously a half baked idea which needs much more
> consideration all I am trying to say is that we need a high level metric
> to tell OOM condition.

OK so here is what I am playing with currently. It is not complete
yet. Anyway I have tested it with 2 scenarios on a swapless system with
2G of RAM both do

$ cat writer.sh
#!/bin/sh
size=$((1<<30))
block=$((4<<10))

writer()
{
	(
        while true
        do
                dd if=/dev/zero of=/mnt/data/file.$1 bs=$block count=$(($size/$block))
                rm /mnt/data/file.$1
                sync
        done
	) &
}

writer 1
writer 2

sleep 10s # allow to accumulate enough dirty pages

1) massive OOM
start 100 memeaters each 80M run in parallel (anon private MAP_POPULATE
mapping). This will trigger many OOM killers and the overall count is
what I was interested in. The test is considered finished when we get
a steady state - writers can make progress and there is no more OOM
killing for some time.

$ grep "invoked oom-killer" base-run-oom.log | wc -l
78
$ grep "invoked oom-killer" test-run-oom.log | wc -l
63

So it looks like we have triggered less OOM killing with the patch
applied. I haven't checked those too closely but it seems like at least
two instances might not have triggered with the current implementation
because DMA32 zone is considered reclaimable. But this check is
inherently racy so we cannot be sure.
$ grep "DMA32.*all_unreclaimable? no" test2-run-oom.log | wc -l
2

2) almost OOM situation
invoke 10 memeaters in parallel and try to fill up all the memory
without triggering the OOM killer. This is quite hard and it required a
lot of tunning. I've ended up with:
#!/bin/sh
pkill mem_eater
sync
echo 3 > /proc/sys/vm/drop_caches
sync
size=$(awk '/MemFree/{printf "%dK", ($2/10)-(16*1024)}' /proc/meminfo)
sh writer.sh &
sleep 10s
for i in $(seq 10)
do
        memcg_test/tools/mem_eater $size &
done

wait

and this one doesn't hit the OOM killer with the original implementation
while it hits it with the patch applied:
[   32.727001] DMA32 free:5428kB min:5532kB low:6912kB high:8296kB active_anon:1802520kB inactive_anon:204kB active_file:6692kB inactive_file:137184k
B unevictable:0kB isolated(anon):136kB isolated(file):32kB present:2080640kB managed:1997880kB mlocked:0kB dirty:0kB writeback:137168kB mapped:6408kB
 shmem:204kB slab_reclaimable:20472kB slab_unreclaimable:13276kB kernel_stack:1456kB pagetables:4756kB unstable:0kB bounce:0kB free_pcp:120kB local_p
cp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:948764 all_unreclaimable? yes

There is a lot of memory in the writeback but all_unreclaimable is yes
so who knows maybe it is just a coincidence we haven't triggered OOM in
the original kernel.

Anyway the two implementation will be hard to compare because workloads
are very different but I think something like below should be more
readable and deterministic than what we have right now. It will need
some more tuning for sure and I will be playing with it some more. I
would just like to hear opinions whether this approach makes sense.
If yes I will post it separately in a new thread for a wider discussion.
This email thread seems to be full of detours already.
---

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-16 15:57                                               ` Michal Hocko
@ 2015-10-16 18:34                                                 ` Linus Torvalds
  2015-10-16 18:49                                                   ` Tetsuo Handa
  2015-10-19 12:53                                                   ` Michal Hocko
  0 siblings, 2 replies; 109+ messages in thread
From: Linus Torvalds @ 2015-10-16 18:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Kyle Walker,
	Christoph Lameter, Andrew Morton, Johannes Weiner,
	Vladimir Davydov, linux-mm, Linux Kernel Mailing List,
	Stanislav Kozina, Mel Gorman, Rik van Riel

On Fri, Oct 16, 2015 at 8:57 AM, Michal Hocko <mhocko@kernel.org> wrote:
>
> OK so here is what I am playing with currently. It is not complete
> yet.

So this looks like it's going in a reasonable direction. However:

> +               if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
> +                               ac->high_zoneidx, alloc_flags, target)) {
> +                       /* Wait for some write requests to complete then retry */
> +                       wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
> +                       goto retry;
> +               }

I still think we should at least spend some time re-thinking that
"wait_iff_congested()" thing. We may not actually be congested, but
might be unable to write anything out because of our allocation flags
(ie not allowed to recurse into the filesystems), so we might be in
the situation that we have a lot of dirty pages that we can't directly
do anything about.

Now, we will have woken kswapd, so something *will* hopefully be done
about them eventually, but at no time do we actually really wait for
it. We'll just busy-loop.

So at a minimum, I think we should yield to kswapd. We do do that
"cond_resched()" in wait_iff_congested(), but I'm not entirely
convinced that is at all enough to wait for kswapd to *do* something.

So before we really decide to see if we should oom, I think we should
have at least one  forced io_schedule_timeout(), whether we're
congested or not.

And yes, as Tetsuo Handa said, any kind of short wait might be too
short for IO to really complete, but *something* will have completed.
Unless we're so far up the creek that we really should just oom.

But I suspect we'll have to just try things out and tweak it. This
patch looks like a reasonable starting point to me.

Tetsuo, mind trying it out and maybe tweaking it a bit for the load
you have? Does it seem to improve on your situation?

                   Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-16 18:34                                                 ` Linus Torvalds
@ 2015-10-16 18:49                                                   ` Tetsuo Handa
  2015-10-19 12:57                                                     ` Michal Hocko
  2015-10-19 12:53                                                   ` Michal Hocko
  1 sibling, 1 reply; 109+ messages in thread
From: Tetsuo Handa @ 2015-10-16 18:49 UTC (permalink / raw)
  To: torvalds, mhocko
  Cc: rientjes, oleg, kwalker, cl, akpm, hannes, vdavydov, linux-mm,
	linux-kernel, skozina, mgorman, riel

Linus Torvalds wrote:
> Tetsuo, mind trying it out and maybe tweaking it a bit for the load
> you have? Does it seem to improve on your situation?

Yes, I already tried it and just replied to Michal.

I tested for one hour using various memory stressing programs.
As far as I tested, I did not hit silent hang up (

 MemAlloc-Info: X stalling task, 0 dying task, 0 victim task.

where X > 0).

----------------------------------------
[  134.510993] Mem-Info:
[  134.511940] active_anon:408777 inactive_anon:2088 isolated_anon:24
[  134.511940]  active_file:15 inactive_file:24 isolated_file:0
[  134.511940]  unevictable:0 dirty:4 writeback:1 unstable:0
[  134.511940]  slab_reclaimable:3109 slab_unreclaimable:5594
[  134.511940]  mapped:679 shmem:2156 pagetables:2077 bounce:0
[  134.511940]  free:12911 free_pcp:31 free_cma:0
[  134.521256] Node 0 DMA free:7256kB min:400kB low:500kB high:600kB active_anon:6560kB inactive_anon:180kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:80kB shmem:184kB slab_reclaimable:236kB slab_unreclaimable:296kB kernel_stack:48kB pagetables:556kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[  134.532779] lowmem_reserve[]: 0 1714 1714 1714
[  134.534455] Node 0 DMA32 free:44388kB min:44652kB low:55812kB high:66976kB active_anon:1628548kB inactive_anon:8172kB active_file:60kB inactive_file:96kB unevictable:0kB isolated(anon):96kB isolated(file):0kB present:2080640kB managed:1759252kB mlocked:0kB dirty:16kB writeback:4kB mapped:2636kB shmem:8440kB slab_reclaimable:12200kB slab_unreclaimable:22080kB kernel_stack:3584kB pagetables:7752kB unstable:0kB bounce:0kB free_pcp:240kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:1016 all_unreclaimable? yes
[  134.545830] lowmem_reserve[]: 0 0 0 0
[  134.547404] Node 0 DMA: 16*4kB (UME) 16*8kB (UME) 10*16kB (UME) 6*32kB (UME) 1*64kB (M) 2*128kB (UE) 1*256kB (M) 2*512kB (UE) 3*1024kB (UME) 1*2048kB (U) 0*4096kB = 7264kB
[  134.552766] Node 0 DMA32: 1158*4kB (UME) 638*8kB (UE) 244*16kB (UME) 163*32kB (UE) 73*64kB (UE) 34*128kB (UME) 17*256kB (UME) 10*512kB (UME) 7*1024kB (UM) 0*2048kB 0*4096kB = 44520kB
[  134.558111] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  134.560358] 2195 total pagecache pages
[  134.562043] 0 pages in swap cache
[  134.563604] Swap cache stats: add 0, delete 0, find 0/0
[  134.565441] Free swap  = 0kB
[  134.567015] Total swap = 0kB
[  134.568628] 524157 pages RAM
[  134.570034] 0 pages HighMem/MovableOnly
[  134.571681] 80368 pages reserved
[  134.573467] 0 pages hwpoisoned
----------------------------------------

Only problem I felt is that the ratio of inactive_file/writeback
(shown below) was high (compared to shown above) when I did

  $ cat < /dev/zero > /tmp/file1 & cat < /dev/zero > /tmp/file2 & cat < /dev/zero > /tmp/file3 & sleep 10; ./a.out; killall cat

but I think this patch is better than current code.

----------------------------------------
[ 1135.909600] Mem-Info:
[ 1135.910686] active_anon:321011 inactive_anon:4664 isolated_anon:0
[ 1135.910686]  active_file:3170 inactive_file:78035 isolated_file:512
[ 1135.910686]  unevictable:0 dirty:0 writeback:78618 unstable:0
[ 1135.910686]  slab_reclaimable:5739 slab_unreclaimable:6170
[ 1135.910686]  mapped:4666 shmem:8300 pagetables:1966 bounce:0
[ 1135.910686]  free:12938 free_pcp:0 free_cma:0
[ 1135.925255] Node 0 DMA free:7232kB min:400kB low:500kB high:600kB active_anon:5852kB inactive_anon:196kB active_file:120kB inactive_file:980kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:968kB mapped:248kB shmem:388kB slab_reclaimable:316kB slab_unreclaimable:272kB kernel_stack:64kB pagetables:100kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:7444 all_unreclaimable? yes
[ 1135.936728] lowmem_reserve[]: 0 1714 1714 1714
[ 1135.938486] Node 0 DMA32 free:44520kB min:44652kB low:55812kB high:66976kB active_anon:1278192kB inactive_anon:18460kB active_file:12560kB inactive_file:313176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1759252kB mlocked:0kB dirty:0kB writeback:313504kB mapped:18416kB shmem:32812kB slab_reclaimable:22640kB slab_unreclaimable:24408kB kernel_stack:4240kB pagetables:7764kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2957668 all_unreclaimable? yes
[ 1135.950355] lowmem_reserve[]: 0 0 0 0
[ 1135.952011] Node 0 DMA: 7*4kB (U) 14*8kB (UM) 13*16kB (UM) 6*32kB (UME) 1*64kB (M) 4*128kB (UME) 2*256kB (UM) 3*512kB (UME) 2*1024kB (UE) 1*2048kB (M) 0*4096kB = 7260kB
[ 1135.957169] Node 0 DMA32: 241*4kB (UE) 929*8kB (UE) 496*16kB (UME) 277*32kB (UE) 135*64kB (UME) 17*128kB (UME) 3*256kB (E) 16*512kB (ME) 0*1024kB 0*2048kB 0*4096kB = 44972kB
[ 1135.963047] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 1135.965472] 90009 total pagecache pages
[ 1135.967078] 0 pages in swap cache
[ 1135.968581] Swap cache stats: add 0, delete 0, find 0/0
[ 1135.970424] Free swap  = 0kB
[ 1135.971828] Total swap = 0kB
[ 1135.973248] 524157 pages RAM
[ 1135.974655] 0 pages HighMem/MovableOnly
[ 1135.976230] 80368 pages reserved
[ 1135.977745] 0 pages hwpoisoned
----------------------------------------

I can still hit OOM livelock (

 MemAlloc-Info: X stalling task, Y dying task, Z victim task.

where X > 0 && Y > 0).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-16 18:34                                                 ` Linus Torvalds
  2015-10-16 18:49                                                   ` Tetsuo Handa
@ 2015-10-19 12:53                                                   ` Michal Hocko
  1 sibling, 0 replies; 109+ messages in thread
From: Michal Hocko @ 2015-10-19 12:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Kyle Walker,
	Christoph Lameter, Andrew Morton, Johannes Weiner,
	Vladimir Davydov, linux-mm, Linux Kernel Mailing List,
	Stanislav Kozina, Mel Gorman, Rik van Riel

On Fri 16-10-15 11:34:48, Linus Torvalds wrote:
> On Fri, Oct 16, 2015 at 8:57 AM, Michal Hocko <mhocko@kernel.org> wrote:
> >
> > OK so here is what I am playing with currently. It is not complete
> > yet.
> 
> So this looks like it's going in a reasonable direction. However:
> 
> > +               if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
> > +                               ac->high_zoneidx, alloc_flags, target)) {
> > +                       /* Wait for some write requests to complete then retry */
> > +                       wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
> > +                       goto retry;
> > +               }
> 
> I still think we should at least spend some time re-thinking that
> "wait_iff_congested()" thing.

You are right. I thought we would be congested most of the time because
of the heavy IO but a quick test has shown that the zone is marked
congested but the nr_wb_congested is zero all the time. That is most
probably because the IO is throttled severly by the lack of memory as
well.

> We may not actually be congested, but
> might be unable to write anything out because of our allocation flags
> (ie not allowed to recurse into the filesystems), so we might be in
> the situation that we have a lot of dirty pages that we can't directly
> do anything about.
> 
> Now, we will have woken kswapd, so something *will* hopefully be done
> about them eventually, but at no time do we actually really wait for
> it. We'll just busy-loop.
> 
> So at a minimum, I think we should yield to kswapd. We do do that
> "cond_resched()" in wait_iff_congested(), but I'm not entirely
> convinced that is at all enough to wait for kswapd to *do* something.

I went with congestion_wait which is what we used to do in the past
before wait_iff_congested has been introduced. The primary reason for
the change was that congestion_wait used to cause unhealthy stalls in
the direct reclaim where the bdi wasn't really congested and so we were
sleeping for the full timeout.

Now I think we can do better even with congestion_wait. We do not have
to wait when we did_some_progress so we won't affect a regular direct
reclaim path and we can reduce sleeping to:

dirty+writeback > reclaimable/2

This is a good signal that the reason for no progress is the stale
IO most likely and we need to wait even if the bdi itself is not
congested. We can also increase the timeout to HZ/10 because this is an
extreme slow path - we are not doing any progress and stalling is better
than OOM.

This is a diff on top of the previous patch. I even think that this part
would deserve a separate patch for a better bisect-ability. My testing
shows that close-to-oom behaves better (I can use more memory for
memeaters without hitting OOM)

What do you think?
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e28028681c59..fed1bb7ea43a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3188,8 +3187,21 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		 */
 		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
 				ac->high_zoneidx, alloc_flags, target)) {
-			/* Wait for some write requests to complete then retry */
-			wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
+			unsigned long writeback = zone_page_state(zone, NR_WRITEBACK),
+				      dirty = zone_page_state(zone, NR_FILE_DIRTY);
+			if (did_some_progress)
+				goto retry;
+
+			/*
+			 * If we didn't make any progress and have a lot of
+			 * dirty + writeback pages then we should wait for
+			 * an IO to complete to slow down the reclaim and
+			 * prevent from pre mature OOM
+			 */
+			if (2*(writeback + dirty) > reclaimable)
+				congestion_wait(BLK_RW_ASYNC, HZ/10);
+			else
+				cond_resched();
 			goto retry;
 		}
 	}

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-16 18:49                                                   ` Tetsuo Handa
@ 2015-10-19 12:57                                                     ` Michal Hocko
  0 siblings, 0 replies; 109+ messages in thread
From: Michal Hocko @ 2015-10-19 12:57 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: torvalds, rientjes, oleg, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina, mgorman, riel

On Sat 17-10-15 03:49:39, Tetsuo Handa wrote:
> Linus Torvalds wrote:
> > Tetsuo, mind trying it out and maybe tweaking it a bit for the load
> > you have? Does it seem to improve on your situation?
> 
> Yes, I already tried it and just replied to Michal.
> 
> I tested for one hour using various memory stressing programs.
> As far as I tested, I did not hit silent hang up (

Thank you for your testing!

[...]

> Only problem I felt is that the ratio of inactive_file/writeback
> (shown below) was high (compared to shown above) when I did

Yes this is the lack of congestion on the bdi as Linus expected.
Another patch I've just posted should help in that regards. At least it
seems to help in my testing.

[...]

> I can still hit OOM livelock (
> 
>  MemAlloc-Info: X stalling task, Y dying task, Z victim task.
> 
> where X > 0 && Y > 0).

This seems a separate issue, though.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Newbie's question: memory allocation when reclaiming memory
  2015-10-12  6:43                                   ` Tetsuo Handa
  2015-10-12 15:25                                     ` Silent hang up caused by pages being not scanned? Tetsuo Handa
@ 2015-10-26 11:44                                     ` Tetsuo Handa
  2015-11-05  8:46                                       ` Vlastimil Babka
  1 sibling, 1 reply; 109+ messages in thread
From: Tetsuo Handa @ 2015-10-26 11:44 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

May I ask a newbie question? Say, there is some amount of memory pages
which can be reclaimed if they are flushed to storage. And lower layer
might issue memory allocation request in a way which won't cause reclaim
deadlock (e.g. using GFP_NOFS or GFP_NOIO) when flushing to storage,
isn't it?

What I'm worrying is a dependency that __GFP_FS allocation requests think
that there are reclaimable pages and therefore there is no need to call
out_of_memory(); and GFP_NOFS allocation requests which the __GFP_FS
allocation requests depend on (in order to flush to storage) is waiting
for GFP_NOIO allocation requests; and the GFP_NOIO allocation requests
which the GFP_NOFS allocation requests depend on (in order to flush to
storage) are waiting for memory pages to be reclaimed without calling
out_of_memory(); because gfp_to_alloc_flags() does not favor GFP_NOIO over
GFP_NOFS nor GFP_NOFS over __GFP_FS which will throttle all allocations
at the same watermark level.

How do we guarantee that GFP_NOFS/GFP_NOIO allocations make forward
progress? What mechanism guarantees that memory pages which __GFP_FS
allocation requests are waiting for are reclaimed? I assume that there
is some mechanism; otherwise we can hit silent livelock, can't we?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Newbie's question: memory allocation when reclaiming memory
  2015-10-26 11:44                                     ` Newbie's question: memory allocation when reclaiming memory Tetsuo Handa
@ 2015-11-05  8:46                                       ` Vlastimil Babka
  0 siblings, 0 replies; 109+ messages in thread
From: Vlastimil Babka @ 2015-11-05  8:46 UTC (permalink / raw)
  To: Tetsuo Handa, mhocko
  Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

On 10/26/2015 12:44 PM, Tetsuo Handa wrote:
> May I ask a newbie question? Say, there is some amount of memory pages
> which can be reclaimed if they are flushed to storage. And lower layer
> might issue memory allocation request in a way which won't cause reclaim
> deadlock (e.g. using GFP_NOFS or GFP_NOIO) when flushing to storage,
> isn't it?
> 
> What I'm worrying is a dependency that __GFP_FS allocation requests think
> that there are reclaimable pages and therefore there is no need to call
> out_of_memory(); and GFP_NOFS allocation requests which the __GFP_FS
> allocation requests depend on (in order to flush to storage) is waiting
> for GFP_NOIO allocation requests; and the GFP_NOIO allocation requests
> which the GFP_NOFS allocation requests depend on (in order to flush to
> storage) are waiting for memory pages to be reclaimed without calling
> out_of_memory(); because gfp_to_alloc_flags() does not favor GFP_NOIO over
> GFP_NOFS nor GFP_NOFS over __GFP_FS which will throttle all allocations
> at the same watermark level.
> 
> How do we guarantee that GFP_NOFS/GFP_NOIO allocations make forward
> progress? What mechanism guarantees that memory pages which __GFP_FS
> allocation requests are waiting for are reclaimed? I assume that there
> is some mechanism; otherwise we can hit silent livelock, can't we?

I've never studied the code myself, but IIRC in all the debates LSF/MM I've
heard it said that GFP_NOIO allocations have mempools that guarantee forward
progress, so when they allocate from this mempool, there should be nothing else
to block the request other than waiting for the actual hardware to finish the
I/O request, and then the memory is returned to mempool and another request can
use it. So there shouldn't be waiting for reclaim at that level, breaking the
livelock you described?

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

end of thread, other threads:[~2015-11-05  8:46 UTC | newest]

Thread overview: 109+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-17 17:59 [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks Kyle Walker
2015-09-17 19:22 ` Oleg Nesterov
2015-09-18 15:41   ` Christoph Lameter
2015-09-18 16:24     ` Oleg Nesterov
2015-09-18 16:39       ` Tetsuo Handa
2015-09-18 16:54         ` Oleg Nesterov
2015-09-18 17:00       ` Christoph Lameter
2015-09-18 19:07         ` Oleg Nesterov
2015-09-18 19:19           ` Christoph Lameter
2015-09-18 21:28             ` Kyle Walker
2015-09-18 22:07               ` Christoph Lameter
2015-09-19  8:32         ` Michal Hocko
2015-09-19 14:33           ` Tetsuo Handa
2015-09-19 15:51             ` Michal Hocko
2015-09-21 23:33             ` David Rientjes
2015-09-22  5:33               ` Tetsuo Handa
2015-09-22 23:32                 ` David Rientjes
2015-09-23 12:03                   ` Kyle Walker
2015-09-24 11:50                     ` Tetsuo Handa
2015-09-19 14:44           ` Oleg Nesterov
2015-09-21 23:27         ` David Rientjes
2015-09-19  8:25     ` Michal Hocko
2015-09-19  8:22 ` Michal Hocko
2015-09-21 23:08   ` David Rientjes
2015-09-19 15:03 ` can't oom-kill zap the victim's memory? Oleg Nesterov
2015-09-19 15:10   ` Oleg Nesterov
2015-09-19 15:58   ` Michal Hocko
2015-09-20 13:16     ` Oleg Nesterov
2015-09-19 22:24   ` Linus Torvalds
2015-09-19 22:54     ` Raymond Jennings
2015-09-19 23:00     ` Raymond Jennings
2015-09-19 23:13       ` Linus Torvalds
2015-09-20  9:33     ` Michal Hocko
2015-09-20 13:06       ` Oleg Nesterov
2015-09-20 12:56     ` Oleg Nesterov
2015-09-20 18:05       ` Linus Torvalds
2015-09-20 18:21         ` Raymond Jennings
2015-09-20 18:23         ` Raymond Jennings
2015-09-20 19:07         ` Raymond Jennings
2015-09-21 13:57           ` Oleg Nesterov
2015-09-21 13:44         ` Oleg Nesterov
2015-09-21 14:24           ` Michal Hocko
2015-09-21 15:32             ` Oleg Nesterov
2015-09-21 16:12               ` Michal Hocko
2015-09-22 16:06                 ` Oleg Nesterov
2015-09-22 23:04                   ` David Rientjes
2015-09-23 20:59                   ` Michal Hocko
2015-09-24 21:15                     ` David Rientjes
2015-09-25  9:35                       ` Michal Hocko
2015-09-25 16:14                         ` Tetsuo Handa
2015-09-28 16:18                           ` Tetsuo Handa
2015-09-28 22:28                             ` David Rientjes
2015-10-02 12:36                             ` Michal Hocko
2015-10-02 19:01                               ` Linus Torvalds
2015-10-05 14:44                                 ` Michal Hocko
2015-10-07  5:16                                   ` Vlastimil Babka
2015-10-07 10:43                                     ` Tetsuo Handa
2015-10-08  9:40                                       ` Vlastimil Babka
2015-10-06  7:55                                 ` Eric W. Biederman
2015-10-06  8:49                                   ` Linus Torvalds
2015-10-06  8:55                                     ` Linus Torvalds
2015-10-06 14:52                                       ` Eric W. Biederman
2015-10-03  6:02                               ` Can't we use timeout based OOM warning/killing? Tetsuo Handa
2015-10-06 14:51                                 ` Tetsuo Handa
2015-10-12  6:43                                   ` Tetsuo Handa
2015-10-12 15:25                                     ` Silent hang up caused by pages being not scanned? Tetsuo Handa
2015-10-12 21:23                                       ` Linus Torvalds
2015-10-13 12:21                                         ` Tetsuo Handa
2015-10-13 16:37                                           ` Linus Torvalds
2015-10-14 12:21                                             ` Tetsuo Handa
2015-10-15 13:14                                             ` Michal Hocko
2015-10-16 15:57                                               ` Michal Hocko
2015-10-16 18:34                                                 ` Linus Torvalds
2015-10-16 18:49                                                   ` Tetsuo Handa
2015-10-19 12:57                                                     ` Michal Hocko
2015-10-19 12:53                                                   ` Michal Hocko
2015-10-13 13:32                                       ` Michal Hocko
2015-10-13 16:19                                         ` Tetsuo Handa
2015-10-14 13:22                                           ` Michal Hocko
2015-10-14 14:38                                             ` Tetsuo Handa
2015-10-14 14:59                                               ` Michal Hocko
2015-10-14 15:06                                                 ` Tetsuo Handa
2015-10-26 11:44                                     ` Newbie's question: memory allocation when reclaiming memory Tetsuo Handa
2015-11-05  8:46                                       ` Vlastimil Babka
2015-10-06 15:25                                 ` Can't we use timeout based OOM warning/killing? Linus Torvalds
2015-10-08 15:33                                   ` Tetsuo Handa
2015-10-10 12:50                                 ` Tetsuo Handa
2015-09-28 22:24                         ` can't oom-kill zap the victim's memory? David Rientjes
2015-09-29  7:57                           ` Tetsuo Handa
2015-09-29 22:56                             ` David Rientjes
2015-09-30  4:25                               ` Tetsuo Handa
2015-09-30 10:21                                 ` Tetsuo Handa
2015-09-30 21:11                                 ` David Rientjes
2015-10-01 12:13                                   ` Tetsuo Handa
2015-10-01 14:48                           ` Michal Hocko
2015-10-02 13:06                             ` Tetsuo Handa
2015-10-06 18:45                     ` Oleg Nesterov
2015-10-07 11:03                       ` Tetsuo Handa
2015-10-07 12:00                         ` Oleg Nesterov
2015-10-08 14:04                           ` Michal Hocko
2015-10-08 14:01                       ` Michal Hocko
2015-09-21 16:51               ` Tetsuo Handa
2015-09-22 12:43                 ` Oleg Nesterov
2015-09-22 14:30                   ` Tetsuo Handa
2015-09-22 14:45                     ` Oleg Nesterov
2015-09-21 23:42               ` David Rientjes
2015-09-21 16:55           ` Linus Torvalds
2015-09-20 14:50   ` Tetsuo Handa
2015-09-20 14:55     ` Oleg Nesterov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).