All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
@ 2016-12-06 10:33 Tetsuo Handa
  2016-12-07  8:15 ` Michal Hocko
  0 siblings, 1 reply; 96+ messages in thread
From: Tetsuo Handa @ 2016-12-06 10:33 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, Tetsuo Handa

If the OOM killer is invoked when many threads are looping inside the
page allocator, it is possible that the OOM killer is preempted by other
threads. As a result, the OOM killer is unable to send SIGKILL to OOM
victims and/or wake up the OOM reaper by releasing oom_lock for minutes
because other threads consume a lot of CPU time for pointless direct
reclaim.

----------
[ 2802.635229] Killed process 7267 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[ 2802.644296] oom_reaper: reaped process 7267 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 2802.650237] Out of memory: Kill process 7268 (a.out) score 999 or sacrifice child
[ 2803.653052] Killed process 7268 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[ 2804.426183] oom_reaper: reaped process 7268 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 2804.432524] Out of memory: Kill process 7269 (a.out) score 999 or sacrifice child
[ 2805.349380] a.out: page allocation stalls for 10047ms, order:0, mode:0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO)
[ 2805.349383] CPU: 2 PID: 7243 Comm: a.out Not tainted 4.9.0-rc8 #62
(...snipped...)
[ 3540.977499]           a.out  7269     22716.893359      5272   120
[ 3540.977499]         0.000000      1447.601063         0.000000
[ 3540.977499]  0 0
[ 3540.977500]  /autogroup-155
----------

This patch adds extra sleeps which is effectively equivalent to

  if (mutex_lock_killable(&oom_lock) == 0)
    mutex_unlock(&oom_lock);

before retrying allocation at __alloc_pages_may_oom() so that the
OOM killer is not preempted by other threads waiting for the OOM
killer/reaper to reclaim memory. Since the OOM reaper grabs oom_lock
due to commit e2fe14564d3316d1 ("oom_reaper: close race with exiting
task"), waking up other threads before the OOM reaper is woken up by
directly waiting for oom_lock might not help so much.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/page_alloc.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 51cbe1e..e5c1102 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3060,6 +3060,7 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
 		.order = order,
 	};
 	struct page *page;
+	static bool wait_more;
 
 	*did_some_progress = 0;
 
@@ -3070,6 +3071,9 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
 	if (!mutex_trylock(&oom_lock)) {
 		*did_some_progress = 1;
 		schedule_timeout_uninterruptible(1);
+		while (wait_more)
+			if (schedule_timeout_killable(1) < 0)
+				break;
 		return NULL;
 	}
 
@@ -3109,6 +3113,7 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
 		if (gfp_mask & __GFP_THISNODE)
 			goto out;
 	}
+	wait_more = true;
 	/* Exhausted what can be done so it's blamo time */
 	if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
 		*did_some_progress = 1;
@@ -3125,6 +3130,7 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
 					ALLOC_NO_WATERMARKS, ac);
 		}
 	}
+	wait_more = false;
 out:
 	mutex_unlock(&oom_lock);
 	return page;
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-06 10:33 [PATCH] mm/page_alloc: Wait for oom_lock before retrying Tetsuo Handa
@ 2016-12-07  8:15 ` Michal Hocko
  2016-12-07 15:29   ` Tetsuo Handa
  0 siblings, 1 reply; 96+ messages in thread
From: Michal Hocko @ 2016-12-07  8:15 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm

On Tue 06-12-16 19:33:59, Tetsuo Handa wrote:
> If the OOM killer is invoked when many threads are looping inside the
> page allocator, it is possible that the OOM killer is preempted by other
> threads.

Hmm, the only way I can see this would happen is when the task which
actually manages to take the lock is not invoking the OOM killer for
whatever reason. Is this what happens in your case? Are you able to
trigger this reliably?

> As a result, the OOM killer is unable to send SIGKILL to OOM
> victims and/or wake up the OOM reaper by releasing oom_lock for minutes
> because other threads consume a lot of CPU time for pointless direct
> reclaim.
> 
> ----------
> [ 2802.635229] Killed process 7267 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> [ 2802.644296] oom_reaper: reaped process 7267 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> [ 2802.650237] Out of memory: Kill process 7268 (a.out) score 999 or sacrifice child
> [ 2803.653052] Killed process 7268 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> [ 2804.426183] oom_reaper: reaped process 7268 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> [ 2804.432524] Out of memory: Kill process 7269 (a.out) score 999 or sacrifice child
> [ 2805.349380] a.out: page allocation stalls for 10047ms, order:0, mode:0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO)
> [ 2805.349383] CPU: 2 PID: 7243 Comm: a.out Not tainted 4.9.0-rc8 #62
> (...snipped...)
> [ 3540.977499]           a.out  7269     22716.893359      5272   120
> [ 3540.977499]         0.000000      1447.601063         0.000000
> [ 3540.977499]  0 0
> [ 3540.977500]  /autogroup-155
> ----------
> 
> This patch adds extra sleeps which is effectively equivalent to
> 
>   if (mutex_lock_killable(&oom_lock) == 0)
>     mutex_unlock(&oom_lock);
> 
> before retrying allocation at __alloc_pages_may_oom() so that the
> OOM killer is not preempted by other threads waiting for the OOM
> killer/reaper to reclaim memory. Since the OOM reaper grabs oom_lock
> due to commit e2fe14564d3316d1 ("oom_reaper: close race with exiting
> task"), waking up other threads before the OOM reaper is woken up by
> directly waiting for oom_lock might not help so much.

So, why don't you simply s@mutex_trylock@mutex_lock_killable@ then?
The trylock is simply an optimistic heuristic to retry while the memory
is being freed. Making this part sync might help for the case you are
seeing.

> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  mm/page_alloc.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 51cbe1e..e5c1102 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3060,6 +3060,7 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
>  		.order = order,
>  	};
>  	struct page *page;
> +	static bool wait_more;
>  
>  	*did_some_progress = 0;
>  
> @@ -3070,6 +3071,9 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
>  	if (!mutex_trylock(&oom_lock)) {
>  		*did_some_progress = 1;
>  		schedule_timeout_uninterruptible(1);
> +		while (wait_more)
> +			if (schedule_timeout_killable(1) < 0)
> +				break;
>  		return NULL;
>  	}
>  
> @@ -3109,6 +3113,7 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
>  		if (gfp_mask & __GFP_THISNODE)
>  			goto out;
>  	}
> +	wait_more = true;
>  	/* Exhausted what can be done so it's blamo time */
>  	if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
>  		*did_some_progress = 1;
> @@ -3125,6 +3130,7 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
>  					ALLOC_NO_WATERMARKS, ac);
>  		}
>  	}
> +	wait_more = false;
>  out:
>  	mutex_unlock(&oom_lock);
>  	return page;

This is a joke, isn't it? Seriously, this made my eyes bleed.

violent NAK!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-07  8:15 ` Michal Hocko
@ 2016-12-07 15:29   ` Tetsuo Handa
  2016-12-08  8:20     ` Vlastimil Babka
  2016-12-08 13:27     ` Michal Hocko
  0 siblings, 2 replies; 96+ messages in thread
From: Tetsuo Handa @ 2016-12-07 15:29 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm

Michal Hocko wrote:
> On Tue 06-12-16 19:33:59, Tetsuo Handa wrote:
> > If the OOM killer is invoked when many threads are looping inside the
> > page allocator, it is possible that the OOM killer is preempted by other
> > threads.
> 
> Hmm, the only way I can see this would happen is when the task which
> actually manages to take the lock is not invoking the OOM killer for
> whatever reason. Is this what happens in your case? Are you able to
> trigger this reliably?

Regarding http://I-love.SAKURA.ne.jp/tmp/serial-20161206.txt.xz ,
somebody called oom_kill_process() and reached

  pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",

line but did not reach

  pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",

line within tolerable delay.

It is trivial to make the page allocator being spammed by uncontrolled
warn_alloc() like http://I-love.SAKURA.ne.jp/tmp/serial-20161207-2.txt.xz
and delayed by printk() using a stressor shown below. It seems to me that
most of CPU time is spent for pointless direct reclaim and printk().

----------
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <poll.h>

int main(int argc, char *argv[])
{
        static char buffer[4096] = { };
        char *buf = NULL;
        unsigned long size;
        int i;
        for (i = 0; i < 1024; i++) {
                if (fork() == 0) {
                        int fd = open("/proc/self/oom_score_adj", O_WRONLY);
                        write(fd, "1000", 4);
                        close(fd);
                        sleep(1);
                        snprintf(buffer, sizeof(buffer), "/tmp/file.%u", getpid());
                        //snprintf(buffer, sizeof(buffer), "/tmp/file");
                        fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600);
                        while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer)) {
                                poll(NULL, 0, 10);
                                fsync(fd);
                        }
                        _exit(0);
                }
        }
        for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
                char *cp = realloc(buf, size);
                if (!cp) {
                        size >>= 1;
                        break;
                }
                buf = cp;
        }
        sleep(2);
        /* Will cause OOM due to overcommit */
        for (i = 0; i < size; i += 4096) {
                buf[i] = 0;
                if (i >= 1800 * 1048576) /* This VM has 2048MB RAM */
                        poll(NULL, 0, 10);
        }
        pause();
        return 0;
}
----------

> 
> > As a result, the OOM killer is unable to send SIGKILL to OOM
> > victims and/or wake up the OOM reaper by releasing oom_lock for minutes
> > because other threads consume a lot of CPU time for pointless direct
> > reclaim.
> > 
> > ----------
> > [ 2802.635229] Killed process 7267 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> > [ 2802.644296] oom_reaper: reaped process 7267 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> > [ 2802.650237] Out of memory: Kill process 7268 (a.out) score 999 or sacrifice child
> > [ 2803.653052] Killed process 7268 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> > [ 2804.426183] oom_reaper: reaped process 7268 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> > [ 2804.432524] Out of memory: Kill process 7269 (a.out) score 999 or sacrifice child
> > [ 2805.349380] a.out: page allocation stalls for 10047ms, order:0, mode:0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO)
> > [ 2805.349383] CPU: 2 PID: 7243 Comm: a.out Not tainted 4.9.0-rc8 #62
> > (...snipped...)
> > [ 3540.977499]           a.out  7269     22716.893359      5272   120
> > [ 3540.977499]         0.000000      1447.601063         0.000000
> > [ 3540.977499]  0 0
> > [ 3540.977500]  /autogroup-155
> > ----------
> > 
> > This patch adds extra sleeps which is effectively equivalent to
> > 
> >   if (mutex_lock_killable(&oom_lock) == 0)
> >     mutex_unlock(&oom_lock);
> > 
> > before retrying allocation at __alloc_pages_may_oom() so that the
> > OOM killer is not preempted by other threads waiting for the OOM
> > killer/reaper to reclaim memory. Since the OOM reaper grabs oom_lock
> > due to commit e2fe14564d3316d1 ("oom_reaper: close race with exiting
> > task"), waking up other threads before the OOM reaper is woken up by
> > directly waiting for oom_lock might not help so much.
> 
> So, why don't you simply s@mutex_trylock@mutex_lock_killable@ then?
> The trylock is simply an optimistic heuristic to retry while the memory
> is being freed. Making this part sync might help for the case you are
> seeing.

May I? Something like below? With patch below, the OOM killer can send
SIGKILL smoothly and printk() can report smoothly (the frequency of
"** XXX printk messages dropped **" messages is significantly reduced).

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2c6d5f6..ee0105b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3075,7 +3075,7 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
 	 * Acquire the oom lock.  If that fails, somebody else is
 	 * making progress for us.
 	 */
-	if (!mutex_trylock(&oom_lock)) {
+	if (mutex_lock_killable(&oom_lock)) {
 		*did_some_progress = 1;
 		schedule_timeout_uninterruptible(1);
 		return NULL;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-07 15:29   ` Tetsuo Handa
@ 2016-12-08  8:20     ` Vlastimil Babka
  2016-12-08 11:00       ` Tetsuo Handa
  2016-12-08 13:27     ` Michal Hocko
  1 sibling, 1 reply; 96+ messages in thread
From: Vlastimil Babka @ 2016-12-08  8:20 UTC (permalink / raw)
  To: Tetsuo Handa, mhocko; +Cc: linux-mm

On 12/07/2016 04:29 PM, Tetsuo Handa wrote:
>>> As a result, the OOM killer is unable to send SIGKILL to OOM
>>> victims and/or wake up the OOM reaper by releasing oom_lock for minutes
>>> because other threads consume a lot of CPU time for pointless direct
>>> reclaim.
>>>
>>> ----------
>>> [ 2802.635229] Killed process 7267 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
>>> [ 2802.644296] oom_reaper: reaped process 7267 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
>>> [ 2802.650237] Out of memory: Kill process 7268 (a.out) score 999 or sacrifice child
>>> [ 2803.653052] Killed process 7268 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
>>> [ 2804.426183] oom_reaper: reaped process 7268 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
>>> [ 2804.432524] Out of memory: Kill process 7269 (a.out) score 999 or sacrifice child
>>> [ 2805.349380] a.out: page allocation stalls for 10047ms, order:0, mode:0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO)
>>> [ 2805.349383] CPU: 2 PID: 7243 Comm: a.out Not tainted 4.9.0-rc8 #62
>>> (...snipped...)
>>> [ 3540.977499]           a.out  7269     22716.893359      5272   120
>>> [ 3540.977499]         0.000000      1447.601063         0.000000
>>> [ 3540.977499]  0 0
>>> [ 3540.977500]  /autogroup-155
>>> ----------
>>>
>>> This patch adds extra sleeps which is effectively equivalent to
>>>
>>>   if (mutex_lock_killable(&oom_lock) == 0)
>>>     mutex_unlock(&oom_lock);
>>>
>>> before retrying allocation at __alloc_pages_may_oom() so that the
>>> OOM killer is not preempted by other threads waiting for the OOM
>>> killer/reaper to reclaim memory. Since the OOM reaper grabs oom_lock
>>> due to commit e2fe14564d3316d1 ("oom_reaper: close race with exiting
>>> task"), waking up other threads before the OOM reaper is woken up by
>>> directly waiting for oom_lock might not help so much.
>>
>> So, why don't you simply s@mutex_trylock@mutex_lock_killable@ then?
>> The trylock is simply an optimistic heuristic to retry while the memory
>> is being freed. Making this part sync might help for the case you are
>> seeing.
>
> May I? Something like below? With patch below, the OOM killer can send
> SIGKILL smoothly and printk() can report smoothly (the frequency of
> "** XXX printk messages dropped **" messages is significantly reduced).
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2c6d5f6..ee0105b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3075,7 +3075,7 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
>  	 * Acquire the oom lock.  If that fails, somebody else is
>  	 * making progress for us.
>  	 */

The comment above could use some updating then. Although maybe "somebody 
killed us" is also technically "making progress for us" :)

> -	if (!mutex_trylock(&oom_lock)) {
> +	if (mutex_lock_killable(&oom_lock)) {
>  		*did_some_progress = 1;
>  		schedule_timeout_uninterruptible(1);

I think if we get here, it means somebody killed us, so we should not do 
this uninterruptible sleep anymore? (maybe also the caller could need 
some check to expedite the kill?).

>  		return NULL;
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-08  8:20     ` Vlastimil Babka
@ 2016-12-08 11:00       ` Tetsuo Handa
  2016-12-08 13:32         ` Michal Hocko
  2016-12-08 16:18         ` Sergey Senozhatsky
  0 siblings, 2 replies; 96+ messages in thread
From: Tetsuo Handa @ 2016-12-08 11:00 UTC (permalink / raw)
  To: vbabka, mhocko
  Cc: linux-mm, hannes, rientjes, aarcange, david, sergey.senozhatsky, akpm

Cc'ing people involved in commit dc56401fc9f25e8f ("mm: oom_kill: simplify
OOM killer locking") and Sergey as printk() expert. Topic started from
http://lkml.kernel.org/r/1481020439-5867-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp .

Vlastimil Babka wrote:
> > May I? Something like below? With patch below, the OOM killer can send
> > SIGKILL smoothly and printk() can report smoothly (the frequency of
> > "** XXX printk messages dropped **" messages is significantly reduced).
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 2c6d5f6..ee0105b 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3075,7 +3075,7 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
> >  	 * Acquire the oom lock.  If that fails, somebody else is
> >  	 * making progress for us.
> >  	 */
> 
> The comment above could use some updating then. Although maybe "somebody 
> killed us" is also technically "making progress for us" :)

I think we can update the comment. But since __GFP_KILLABLE does not exist,
SIGKILL is pending does not imply that current thread will make progress by
leaving the retry loop immediately. Therefore,

> 
> > -	if (!mutex_trylock(&oom_lock)) {
> > +	if (mutex_lock_killable(&oom_lock)) {
> >  		*did_some_progress = 1;
> >  		schedule_timeout_uninterruptible(1);
> 
> I think if we get here, it means somebody killed us, so we should not do 
> this uninterruptible sleep anymore? (maybe also the caller could need 
> some check to expedite the kill?).
> 
> >  		return NULL;

I guess we should still sleep.

----------------------------------------
>From f294e5f53524d3b055857d35aa6f3dc16cf20d86 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Thu, 8 Dec 2016 09:27:18 +0900
Subject: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.

If the OOM killer is invoked when many threads are looping inside the
page allocator, it is possible that the OOM killer is blocked for
unbounded period due to preemption and/or printk() with oom_lock held.

----------
[ 2802.635229] Killed process 7267 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[ 2802.644296] oom_reaper: reaped process 7267 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 2802.650237] Out of memory: Kill process 7268 (a.out) score 999 or sacrifice child
[ 2803.653052] Killed process 7268 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[ 2804.426183] oom_reaper: reaped process 7268 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 2804.432524] Out of memory: Kill process 7269 (a.out) score 999 or sacrifice child
[ 2805.349380] a.out: page allocation stalls for 10047ms, order:0, mode:0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO)
[ 2805.349383] CPU: 2 PID: 7243 Comm: a.out Not tainted 4.9.0-rc8 #62
(...snipped...)
[ 3540.977499]           a.out  7269     22716.893359      5272   120
[ 3540.977499]         0.000000      1447.601063         0.000000
[ 3540.977499]  0 0
[ 3540.977500]  /autogroup-155
----------

The problem triggered by preemption existed before commit 63f53dea0c98
("mm: warn about allocations which stall for too long"). But that commit
made the problem also triggerable by printk() because currently printk()
tries to flush printk log buffer ( https://lwn.net/Articles/705938/ ).

  Thread-1

    __alloc_pages_slowpath() {
      __alloc_pages_may_oom() {
        mutex_trylock(&oom_lock); // succeeds
        out_of_memory() {
          oom_kill_process() {
             pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n", ...);
             do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
             pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", ...);
          }
        }
        mutex_unlock(&oom_lock);
      }
      // retry allocation due to did_some_progress == 1.
    }

  Thread-2

    __alloc_pages_slowpath() {
      __alloc_pages_may_oom() {
        mutex_trylock(&oom_lock); // fails and returns
      }
      // retry allocation due to did_some_progress == 1.
      warn_alloc() {
        pr_warn(gfp_mask, "page allocation stalls for %ums, order:%u", ...);
      }
    }

Theread-1 was trying to flush printk log buffer by printk() from
oom_kill_process() with oom_lock held. Thread-2 was appending to printk
log buffer by printk() from warn_alloc() because Thread-2 cannot hold
oom_lock held by Thread-1. As a result, this formed an AB-BA livelock.

Although warn_alloc() calls printk() aggressively enough to livelock is
problematic, at least we can say that it is wasteful to spend CPU time for
pointless "direct reclaim and warn_alloc()" calls when waiting for the OOM
killer to send SIGKILL. Therefore, this patch replaces mutex_trylock()
with mutex_lock_killable().

Replacing mutex_trylock() with mutex_lock_killable() should be safe, for
if somebody by error called __alloc_pages_may_oom() with oom_lock held,
it will livelock because did_some_progress will be set to 1 despite
mutex_trylock() failure.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/page_alloc.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6de9440..6c43d8e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3037,12 +3037,16 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
 	*did_some_progress = 0;
 
 	/*
-	 * Acquire the oom lock.  If that fails, somebody else is
-	 * making progress for us.
+	 * Give the OOM killer enough CPU time for sending SIGKILL.
+	 * Do not return without a short sleep unless TIF_MEMDIE is set, for
+	 * currently tsk_is_oom_victim(current) == true does not make
+	 * gfp_pfmemalloc_allowed() == true via TIF_MEMDIE until
+	 * mark_oom_victim(current) is called.
 	 */
-	if (!mutex_trylock(&oom_lock)) {
+	if (mutex_lock_killable(&oom_lock)) {
 		*did_some_progress = 1;
-		schedule_timeout_uninterruptible(1);
+		if (!test_thread_flag(TIF_MEMDIE))
+			schedule_timeout_uninterruptible(1);
 		return NULL;
 	}
 
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-07 15:29   ` Tetsuo Handa
  2016-12-08  8:20     ` Vlastimil Babka
@ 2016-12-08 13:27     ` Michal Hocko
  2016-12-09 14:23       ` Tetsuo Handa
  1 sibling, 1 reply; 96+ messages in thread
From: Michal Hocko @ 2016-12-08 13:27 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm

On Thu 08-12-16 00:29:26, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Tue 06-12-16 19:33:59, Tetsuo Handa wrote:
> > > If the OOM killer is invoked when many threads are looping inside the
> > > page allocator, it is possible that the OOM killer is preempted by other
> > > threads.
> > 
> > Hmm, the only way I can see this would happen is when the task which
> > actually manages to take the lock is not invoking the OOM killer for
> > whatever reason. Is this what happens in your case? Are you able to
> > trigger this reliably?
> 
> Regarding http://I-love.SAKURA.ne.jp/tmp/serial-20161206.txt.xz ,
> somebody called oom_kill_process() and reached
> 
>   pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
> 
> line but did not reach
> 
>   pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
> 
> line within tolerable delay.

I would be really interested in that. This can happen only if
find_lock_task_mm fails. This would mean that either we are selecting a
child without mm or the selected victim has no mm anymore. Both cases
should be ephemeral because oom_badness will rule those tasks on the
next round. So the primary question here is why no other task has hit
out_of_memory. Have you tried to instrument the kernel and see whether
GFP_NOFS contexts simply preempted any other attempt to get there?
I would find it quite unlikely but not impossible. If that is the case
we should really think how to move forward. One way is to make the oom
path fully synchronous as suggested below. Other is to tweak GFP_NOFS
some more and do not take the lock while we are evaluating that. This
sounds quite messy though.

[...]

> > So, why don't you simply s@mutex_trylock@mutex_lock_killable@ then?
> > The trylock is simply an optimistic heuristic to retry while the memory
> > is being freed. Making this part sync might help for the case you are
> > seeing.
> 
> May I? Something like below? With patch below, the OOM killer can send
> SIGKILL smoothly and printk() can report smoothly (the frequency of
> "** XXX printk messages dropped **" messages is significantly reduced).

Well, this has to be properly evaluated. The fact that
__oom_reap_task_mm requires the oom_lock makes it more complicated. We
definitely do not want to starve it. On the other hand the oom
invocation path shouldn't stall for too long and even when we have
hundreds of tasks blocked on the lock and blocking the oom reaper then
the reaper should run _eventually_. It might take some time but this a
glacial slow path so it should be acceptable.

That being said, this should be OK. But please make sure to mention all
these details in the changelog. Also make sure to document the actual
failure mode as mentioned above.

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2c6d5f6..ee0105b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3075,7 +3075,7 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
>  	 * Acquire the oom lock.  If that fails, somebody else is
>  	 * making progress for us.
>  	 */
> -	if (!mutex_trylock(&oom_lock)) {
> +	if (mutex_lock_killable(&oom_lock)) {
>  		*did_some_progress = 1;
>  		schedule_timeout_uninterruptible(1);
>  		return NULL;

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-08 11:00       ` Tetsuo Handa
@ 2016-12-08 13:32         ` Michal Hocko
  2016-12-08 16:18         ` Sergey Senozhatsky
  1 sibling, 0 replies; 96+ messages in thread
From: Michal Hocko @ 2016-12-08 13:32 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: vbabka, linux-mm, hannes, rientjes, aarcange, david,
	sergey.senozhatsky, akpm

On Thu 08-12-16 20:00:39, Tetsuo Handa wrote:
> Cc'ing people involved in commit dc56401fc9f25e8f ("mm: oom_kill: simplify
> OOM killer locking") and Sergey as printk() expert. Topic started from
> http://lkml.kernel.org/r/1481020439-5867-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp .
> 
> Vlastimil Babka wrote:
> > > May I? Something like below? With patch below, the OOM killer can send
> > > SIGKILL smoothly and printk() can report smoothly (the frequency of
> > > "** XXX printk messages dropped **" messages is significantly reduced).
> > >
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 2c6d5f6..ee0105b 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -3075,7 +3075,7 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
> > >  	 * Acquire the oom lock.  If that fails, somebody else is
> > >  	 * making progress for us.
> > >  	 */
> > 
> > The comment above could use some updating then. Although maybe "somebody 
> > killed us" is also technically "making progress for us" :)
> 
> I think we can update the comment. But since __GFP_KILLABLE does not exist,
> SIGKILL is pending does not imply that current thread will make progress by
> leaving the retry loop immediately. Therefore,

Although this is true I do not think that cluttering the code with this
case is anyhow useful. In the vast majority of cases SIGKILL pending
will be a result of the oom killer.

[...]
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6de9440..6c43d8e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3037,12 +3037,16 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
>  	*did_some_progress = 0;
>  
>  	/*
> -	 * Acquire the oom lock.  If that fails, somebody else is
> -	 * making progress for us.
> +	 * Give the OOM killer enough CPU time for sending SIGKILL.
> +	 * Do not return without a short sleep unless TIF_MEMDIE is set, for
> +	 * currently tsk_is_oom_victim(current) == true does not make
> +	 * gfp_pfmemalloc_allowed() == true via TIF_MEMDIE until
> +	 * mark_oom_victim(current) is called.
>  	 */
> -	if (!mutex_trylock(&oom_lock)) {
> +	if (mutex_lock_killable(&oom_lock)) {
>  		*did_some_progress = 1;
> -		schedule_timeout_uninterruptible(1);
> +		if (!test_thread_flag(TIF_MEMDIE))
> +			schedule_timeout_uninterruptible(1);

I am not really sure this is necessary. Just return outside and for
those unlikely cases where the current task was killed before entering
the page allocator simply do not matter imho. I would rather go with
simplicity here.

>  		return NULL;
>  	}
>  
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-08 11:00       ` Tetsuo Handa
  2016-12-08 13:32         ` Michal Hocko
@ 2016-12-08 16:18         ` Sergey Senozhatsky
  1 sibling, 0 replies; 96+ messages in thread
From: Sergey Senozhatsky @ 2016-12-08 16:18 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: vbabka, mhocko, linux-mm, hannes, rientjes, aarcange, david,
	sergey.senozhatsky, akpm, Sergey Senozhatsky, Petr Mladek

Hello,

On (12/08/16 20:00), Tetsuo Handa wrote:
[..]
> Theread-1 was trying to flush printk log buffer by printk() from
> oom_kill_process() with oom_lock held. Thread-2 was appending to printk
> log buffer by printk() from warn_alloc() because Thread-2 cannot hold
> oom_lock held by Thread-1. As a result, this formed an AB-BA livelock.
> 
> Although warn_alloc() calls printk() aggressively enough to livelock is
> problematic, at least we can say that it is wasteful to spend CPU time for
> pointless "direct reclaim and warn_alloc()" calls when waiting for the OOM
> killer to send SIGKILL. Therefore, this patch replaces mutex_trylock()
> with mutex_lock_killable().
> 
> Replacing mutex_trylock() with mutex_lock_killable() should be safe, for
> if somebody by error called __alloc_pages_may_oom() with oom_lock held,
> it will livelock because did_some_progress will be set to 1 despite
> mutex_trylock() failure.

so I have some patches in my own/personal/secret tree. the patches were
not yet published to the kernel-list (well, not all of them. I'll do it
a bit later). we are moving towards the "printk() just appends the messages
to the logbuf" idea:

https://gitlab.com/senozhatsky/linux-next-ss/commits/printk-safe-deferred


	-ss

> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  mm/page_alloc.c | 12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6de9440..6c43d8e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3037,12 +3037,16 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
>  	*did_some_progress = 0;
>  
>  	/*
> -	 * Acquire the oom lock.  If that fails, somebody else is
> -	 * making progress for us.
> +	 * Give the OOM killer enough CPU time for sending SIGKILL.
> +	 * Do not return without a short sleep unless TIF_MEMDIE is set, for
> +	 * currently tsk_is_oom_victim(current) == true does not make
> +	 * gfp_pfmemalloc_allowed() == true via TIF_MEMDIE until
> +	 * mark_oom_victim(current) is called.
>  	 */
> -	if (!mutex_trylock(&oom_lock)) {
> +	if (mutex_lock_killable(&oom_lock)) {
>  		*did_some_progress = 1;
> -		schedule_timeout_uninterruptible(1);
> +		if (!test_thread_flag(TIF_MEMDIE))
> +			schedule_timeout_uninterruptible(1);
>  		return NULL;
>  	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-08 13:27     ` Michal Hocko
@ 2016-12-09 14:23       ` Tetsuo Handa
  2016-12-09 14:46         ` Michal Hocko
  0 siblings, 1 reply; 96+ messages in thread
From: Tetsuo Handa @ 2016-12-09 14:23 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm

Michal Hocko wrote:
> On Thu 08-12-16 00:29:26, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Tue 06-12-16 19:33:59, Tetsuo Handa wrote:
> > > > If the OOM killer is invoked when many threads are looping inside the
> > > > page allocator, it is possible that the OOM killer is preempted by other
> > > > threads.
> > > 
> > > Hmm, the only way I can see this would happen is when the task which
> > > actually manages to take the lock is not invoking the OOM killer for
> > > whatever reason. Is this what happens in your case? Are you able to
> > > trigger this reliably?
> > 
> > Regarding http://I-love.SAKURA.ne.jp/tmp/serial-20161206.txt.xz ,
> > somebody called oom_kill_process() and reached
> > 
> >   pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
> > 
> > line but did not reach
> > 
> >   pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
> > 
> > line within tolerable delay.
> 
> I would be really interested in that. This can happen only if
> find_lock_task_mm fails. This would mean that either we are selecting a
> child without mm or the selected victim has no mm anymore. Both cases
> should be ephemeral because oom_badness will rule those tasks on the
> next round. So the primary question here is why no other task has hit
> out_of_memory.

This can also happen due to AB-BA livelock (oom_lock v.s. console_sem).

>                Have you tried to instrument the kernel and see whether
> GFP_NOFS contexts simply preempted any other attempt to get there?
> I would find it quite unlikely but not impossible. If that is the case
> we should really think how to move forward. One way is to make the oom
> path fully synchronous as suggested below. Other is to tweak GFP_NOFS
> some more and do not take the lock while we are evaluating that. This
> sounds quite messy though.

Do you mean "tweak GFP_NOFS" as something like below patch?

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3036,6 +3036,17 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
 
 	*did_some_progress = 0;
 
+	if (!(gfp_mask & (__GFP_FS | __GFP_NOFAIL))) {
+		if ((current->flags & PF_DUMPCORE) ||
+		    (order > PAGE_ALLOC_COSTLY_ORDER) ||
+		    (ac->high_zoneidx < ZONE_NORMAL) ||
+		    (pm_suspended_storage()) ||
+		    (gfp_mask & __GFP_THISNODE))
+			return NULL;
+		*did_some_progress = 1;
+		return NULL;
+	}
+
 	/*
 	 * Acquire the oom lock.  If that fails, somebody else is
 	 * making progress for us.

Then, serial-20161209-gfp.txt in http://I-love.SAKURA.ne.jp/tmp/20161209.tar.xz is
console log with above patch applied. Spinning without invoking the OOM killer.
It did not avoid locking up.

[  879.772089] Killed process 14529 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  884.746246] Killed process 14530 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  885.162475] Killed process 14531 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  885.399802] Killed process 14532 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  889.497044] a.out: page allocation stalls for 10001ms, order:0, mode:0x342004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE)
[  889.507193] a.out: page allocation stalls for 10016ms, order:0, mode:0x342004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE)
[  889.560741] systemd-journal: page allocation stalls for 10020ms, order:0, mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[  889.590231] a.out: page allocation stalls for 10079ms, order:0, mode:0x342004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE)
[  889.600207] a.out: page allocation stalls for 10091ms, order:0, mode:0x342004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE)
[  889.607186] a.out: page allocation stalls for 10105ms, order:0, mode:0x342004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE)
[  889.611057] a.out: page allocation stalls for 10001ms, order:0, mode:0x342004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE)
[  889.646180] a.out: page allocation stalls for 10065ms, order:0, mode:0x342004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE)
[  889.655083] tuned: page allocation stalls for 10001ms, order:0, mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
(...snipped...)
[ 1139.516867] a.out: page allocation stalls for 260007ms, order:0, mode:0x342004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE)
[ 1139.530790] a.out: page allocation stalls for 260034ms, order:0, mode:0x342004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE)
[ 1139.555816] a.out: page allocation stalls for 260038ms, order:0, mode:0x342004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE)
[ 1142.097226] NetworkManager: page allocation stalls for 210003ms, order:0, mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[ 1142.747370] systemd-journal: page allocation stalls for 220003ms, order:0, mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[ 1142.747443] page allocation stalls for 220003ms, order:0 [<ffffffff81226c20>] __do_fault+0x80/0x130
[ 1142.750326] irqbalance: page allocation stalls for 220001ms, order:0, mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[ 1142.763366] postgres: page allocation stalls for 220003ms, order:0, mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[ 1143.139489] master: page allocation stalls for 220003ms, order:0, mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[ 1143.292492] mysqld: page allocation stalls for 260001ms, order:0, mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[ 1143.313282] mysqld: page allocation stalls for 260002ms, order:0, mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[ 1143.543551] mysqld: page allocation stalls for 250003ms, order:0, mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[ 1143.726339] postgres: page allocation stalls for 260003ms, order:0, mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[ 1147.408614] smbd: page allocation stalls for 220001ms, order:0, mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)

> 
> [...]
> 
> > > So, why don't you simply s@mutex_trylock@mutex_lock_killable@ then?
> > > The trylock is simply an optimistic heuristic to retry while the memory
> > > is being freed. Making this part sync might help for the case you are
> > > seeing.
> > 
> > May I? Something like below? With patch below, the OOM killer can send
> > SIGKILL smoothly and printk() can report smoothly (the frequency of
> > "** XXX printk messages dropped **" messages is significantly reduced).
> 
> Well, this has to be properly evaluated. The fact that
> __oom_reap_task_mm requires the oom_lock makes it more complicated. We
> definitely do not want to starve it. On the other hand the oom
> invocation path shouldn't stall for too long and even when we have
> hundreds of tasks blocked on the lock and blocking the oom reaper then
> the reaper should run _eventually_. It might take some time but this a
> glacial slow path so it should be acceptable.
> 
> That being said, this should be OK. But please make sure to mention all
> these details in the changelog. Also make sure to document the actual
> failure mode as mentioned above.

stall-20161209-1.png and stall-20161209-2.png in 20161209.tar.xz are
screen shots and serial-20161209-stall.txt is console log without any patch.
We can see that console log is unreadably dropped and all CPUs are spinning
without invoking the OOM killer.

[  130.084200] Killed process 2613 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  130.297981] Killed process 2614 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  130.509444] Killed process 2615 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  130.725497] Killed process 2616 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  140.886508] a.out: page allocation stalls for 10004ms, order:0, mode:0x342004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE)
[  140.888637] a.out: page allocation stalls for 10006ms, order:0, mode:0x342004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE)
[  140.890348] a.out: page allocation stalls for 10008ms, order:0, mode:0x342004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE)
** 49 printk messages dropped ** [  140.892119]  [<ffffffff81293685>] __vfs_write+0xe5/0x140
** 45 printk messages dropped ** [  140.892994]  [<ffffffff81306f10>] ? iomap_write_end+0x80/0x80
** 93 printk messages dropped ** [  140.898500]  [<ffffffff811e802d>] __page_cache_alloc+0x15d/0x1a0
** 45 printk messages dropped ** [  140.900144]  [<ffffffff811f58e9>] warn_alloc+0x149/0x180
** 94 printk messages dropped ** [  140.900785] CPU: 1 PID: 3372 Comm: a.out Not tainted 4.9.0-rc8+ #70
** 89 printk messages dropped ** [  147.049875] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
** 96 printk messages dropped ** [  150.110000] Node 0 DMA32: 9*4kB (H) 4*8kB (UH) 8*16kB (UEH) 187*32kB (UMEH) 75*64kB (UEH) 108*128kB (UME) 49*256kB (UME) 12*512kB (UME) 1*1024kB (U) 0*2048kB 0*4096kB = 44516kB
** 303 printk messages dropped ** [  150.893480] lowmem_reserve[]: 0 0 0 0
** 148 printk messages dropped ** [  153.480652] Node 0 DMA free:6700kB min:440kB low:548kB high:656kB active_anon:9144kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:28kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
** 191 printk messages dropped ** [  160.110155] Node 0 DMA free:6700kB min:440kB low:548kB high:656kB active_anon:9144kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:28kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
** 1551 printk messages dropped ** [  178.654905] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
** 43 printk messages dropped ** [  179.057226]  ffffc90003377a08 ffffffff813c9d4d ffffffff81a29518 0000000000000001
** 95 printk messages dropped ** [  180.109388]  ffffc90002283a08 ffffffff813c9d4d ffffffff81a29518 0000000000000001
** 94 printk messages dropped ** [  180.889628] 0 pages hwpoisoned
[  180.895764] a.out: page allocation stalls for 50013ms, order:0, mode:0x342004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE)
** 240 printk messages dropped ** [  183.318598] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
(...snipped...)
** 188 printk messages dropped ** [  452.747159] 0 pages HighMem/MovableOnly
** 44 printk messages dropped ** [  452.773748] 4366 total pagecache pages
** 48 printk messages dropped ** [  452.803376] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
** 537 printk messages dropped ** [  460.107887] lowmem_reserve[]: 0 0 0 0

> 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 2c6d5f6..ee0105b 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3075,7 +3075,7 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
> >  	 * Acquire the oom lock.  If that fails, somebody else is
> >  	 * making progress for us.
> >  	 */
> > -	if (!mutex_trylock(&oom_lock)) {
> > +	if (mutex_lock_killable(&oom_lock)) {
> >  		*did_some_progress = 1;
> >  		schedule_timeout_uninterruptible(1);
> >  		return NULL;
> 

nostall-20161209-1.png and nostall-20161209-2.png are screen shots and
serial-20161209-nostall.txt is console log with mutex_lock_killable() patch applied.
We can see that console log is less dropped and only 1 CPU is spinning with
invoking the OOM killer.

[  421.630240] Killed process 4568 (a.out) total-vm:4176kB, anon-rss:80kB, file-rss:0kB, shmem-rss:0kB
[  421.643236] Killed process 4569 (a.out) total-vm:4176kB, anon-rss:80kB, file-rss:0kB, shmem-rss:0kB
[  421.842463] Killed process 4570 (a.out) total-vm:4176kB, anon-rss:80kB, file-rss:0kB, shmem-rss:0kB
[  421.899778] postgres: page allocation stalls for 11376ms, order:0, mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[  421.900569] Killed process 4571 (a.out) total-vm:4176kB, anon-rss:80kB, file-rss:0kB, shmem-rss:0kB
[  421.900792] postgres: page allocation stalls for 185751ms, order:0, mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[  421.900920] systemd-logind: page allocation stalls for 162980ms, order:0, mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[  421.901027] master: page allocation stalls for 86144ms, order:0, mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[  421.912876] pickup: page allocation stalls for 18360ms, order:0, mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[  422.007323] Killed process 4572 (a.out) total-vm:4176kB, anon-rss:80kB, file-rss:0kB, shmem-rss:0kB
[  422.011580] Killed process 4573 (a.out) total-vm:4176kB, anon-rss:80kB, file-rss:0kB, shmem-rss:0kB
[  422.017043] Killed process 4574 (a.out) total-vm:4176kB, anon-rss:80kB, file-rss:0kB, shmem-rss:0kB
[  422.027035] Killed process 4575 (a.out) total-vm:4176kB, anon-rss:80kB, file-rss:0kB, shmem-rss:0kB

So, I think serializing with mutex_lock_killable() is preferable for avoiding lockups
even if it might defer !__GFP_FS && !__GFP_NOFAIL allocations or the OOM reaper.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-09 14:23       ` Tetsuo Handa
@ 2016-12-09 14:46         ` Michal Hocko
  2016-12-10 11:24           ` Tetsuo Handa
  0 siblings, 1 reply; 96+ messages in thread
From: Michal Hocko @ 2016-12-09 14:46 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm

On Fri 09-12-16 23:23:10, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 08-12-16 00:29:26, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Tue 06-12-16 19:33:59, Tetsuo Handa wrote:
> > > > > If the OOM killer is invoked when many threads are looping inside the
> > > > > page allocator, it is possible that the OOM killer is preempted by other
> > > > > threads.
> > > > 
> > > > Hmm, the only way I can see this would happen is when the task which
> > > > actually manages to take the lock is not invoking the OOM killer for
> > > > whatever reason. Is this what happens in your case? Are you able to
> > > > trigger this reliably?
> > > 
> > > Regarding http://I-love.SAKURA.ne.jp/tmp/serial-20161206.txt.xz ,
> > > somebody called oom_kill_process() and reached
> > > 
> > >   pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
> > > 
> > > line but did not reach
> > > 
> > >   pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
> > > 
> > > line within tolerable delay.
> > 
> > I would be really interested in that. This can happen only if
> > find_lock_task_mm fails. This would mean that either we are selecting a
> > child without mm or the selected victim has no mm anymore. Both cases
> > should be ephemeral because oom_badness will rule those tasks on the
> > next round. So the primary question here is why no other task has hit
> > out_of_memory.
> 
> This can also happen due to AB-BA livelock (oom_lock v.s. console_sem).

Care to explain how would that livelock look like?

> >                Have you tried to instrument the kernel and see whether
> > GFP_NOFS contexts simply preempted any other attempt to get there?
> > I would find it quite unlikely but not impossible. If that is the case
> > we should really think how to move forward. One way is to make the oom
> > path fully synchronous as suggested below. Other is to tweak GFP_NOFS
> > some more and do not take the lock while we are evaluating that. This
> > sounds quite messy though.
> 
> Do you mean "tweak GFP_NOFS" as something like below patch?
> 
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3036,6 +3036,17 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
>  
>  	*did_some_progress = 0;
>  
> +	if (!(gfp_mask & (__GFP_FS | __GFP_NOFAIL))) {
> +		if ((current->flags & PF_DUMPCORE) ||
> +		    (order > PAGE_ALLOC_COSTLY_ORDER) ||
> +		    (ac->high_zoneidx < ZONE_NORMAL) ||
> +		    (pm_suspended_storage()) ||
> +		    (gfp_mask & __GFP_THISNODE))
> +			return NULL;
> +		*did_some_progress = 1;
> +		return NULL;
> +	}
> +
>  	/*
>  	 * Acquire the oom lock.  If that fails, somebody else is
>  	 * making progress for us.
> 
> Then, serial-20161209-gfp.txt in http://I-love.SAKURA.ne.jp/tmp/20161209.tar.xz is
> console log with above patch applied. Spinning without invoking the OOM killer.
> It did not avoid locking up.

OK, so the reason of the lock up must be something different. If we are
really {dead,live}locking on the printk because of warn_alloc then that
path should be tweaked instead. Something like below should rule this
out:
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ed65d7df72d5..c2ba51cec93d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3024,11 +3024,14 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
 	unsigned int filter = SHOW_MEM_FILTER_NODES;
 	struct va_format vaf;
 	va_list args;
+	static DEFINE_MUTEX(warn_lock);
 
 	if ((gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs) ||
 	    debug_guardpage_minorder() > 0)
 		return;
 
+	mutex_lock(&warn_lock);
+
 	/*
 	 * This documents exceptions given to allocations in certain
 	 * contexts that are allowed to allocate outside current's set
@@ -3054,6 +3057,8 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
 	dump_stack();
 	if (!should_suppress_show_mem())
 		show_mem(filter);
+
+	mutex_unlock(&warn_lock);
 }
 
 static inline struct page *
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-09 14:46         ` Michal Hocko
@ 2016-12-10 11:24           ` Tetsuo Handa
  2016-12-12  9:07             ` Michal Hocko
  0 siblings, 1 reply; 96+ messages in thread
From: Tetsuo Handa @ 2016-12-10 11:24 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm

Michal Hocko wrote:
> On Fri 09-12-16 23:23:10, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Thu 08-12-16 00:29:26, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > > > On Tue 06-12-16 19:33:59, Tetsuo Handa wrote:
> > > > > > If the OOM killer is invoked when many threads are looping inside the
> > > > > > page allocator, it is possible that the OOM killer is preempted by other
> > > > > > threads.
> > > > > 
> > > > > Hmm, the only way I can see this would happen is when the task which
> > > > > actually manages to take the lock is not invoking the OOM killer for
> > > > > whatever reason. Is this what happens in your case? Are you able to
> > > > > trigger this reliably?
> > > > 
> > > > Regarding http://I-love.SAKURA.ne.jp/tmp/serial-20161206.txt.xz ,
> > > > somebody called oom_kill_process() and reached
> > > > 
> > > >   pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
> > > > 
> > > > line but did not reach
> > > > 
> > > >   pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
> > > > 
> > > > line within tolerable delay.
> > > 
> > > I would be really interested in that. This can happen only if
> > > find_lock_task_mm fails. This would mean that either we are selecting a
> > > child without mm or the selected victim has no mm anymore. Both cases
> > > should be ephemeral because oom_badness will rule those tasks on the
> > > next round. So the primary question here is why no other task has hit
> > > out_of_memory.
> > 
> > This can also happen due to AB-BA livelock (oom_lock v.s. console_sem).
> 
> Care to explain how would that livelock look like?

Two types of threads (Thread-1 which is holding oom_lock, Thread-2 which is not
holding oom_lock) are doing memory allocation. Since oom_lock is a mutex, there
can be only 1 instance for Thread-1. But there can be multiple instances for
Thread-2.

(1) Thread-1 enters out_of_memory() because it is holding oom_lock.
(2) Thread-1 enters printk() due to

    pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n", ...);

    in oom_kill_process().

(3) vprintk_func() is mapped to vprintk_default() because Thread-1 is not
    inside NMI handler.

(4) In vprintk_emit(), in_sched == false because loglevel for pr_err()
    is not LOGLEVEL_SCHED.

(5) Thread-1 calls log_store() via log_output() from vprintk_emit().

(6) Thread-1 calls console_trylock() because in_sched == false.

(7) Thread-1 acquires console_sem via down_trylock_console_sem().

(8) In console_trylock(), console_may_schedule is set to true because
    Thread-1 is in sleepable context.

(9) Thread-1 calls console_unlock() because console_trylock() succeeded.

(9) In console_unlock(), pending data stored by log_store() are printed
    to consoles. Since there may be slow consoles, cond_resched() is called
    if possible. And since console_may_schedule == true because Thread-1 is
    in sleepable context, Thread-1 may be scheduled at console_unlock().

(10) Thread-2 tries to acquire oom_lock but it fails because Thread-1 is
     holding oom_lock.

(11) Thread-2 enters warn_alloc() because it is waiting for Thread-1 to
     return from oom_kill_process().

(12) Thread-2 enters printk() due to

     warn_alloc(gfp_mask, "page allocation stalls for %ums, order:%u", ...);

     in __alloc_pages_slowpath().

(13) vprintk_func() is mapped to vprintk_default() because Thread-2 is not
     inside NMI handler.

(14) In vprintk_emit(), in_sched == false because loglevel for pr_err()
     is not LOGLEVEL_SCHED.

(15) Thread-2 calls log_store() via log_output() from vprintk_emit().

(16) Thread-2 calls console_trylock() because in_sched == false.

(17) Thread-2 fails to acquire console_sem via down_trylock_console_sem().

(18) Thread-2 returns from vprintk_emit().

(19) Thread-2 leaves warn_alloc().

(20) When Thread-1 is waken up, it finds new data appended by Thread-2.

(21) Thread-1 remains inside console_unlock() with oom_lock still held
     because there is data which should be printed to consoles.

(22) Thread-2 remains failing to acquire oom_lock, periodically appending
     new data via warn_alloc(), and failing to acquire oom_lock.

(23) The user visible result is that Thread-1 is unable to return from

     pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n", ...);

     in oom_kill_process().

The introduction of uncontrolled

  warn_alloc(gfp_mask, "page allocation stalls for %ums, order:%u", ...);

in __alloc_pages_slowpath() increased the possibility for Thread-1 to remain
inside console_unlock(). Although Sergey is working on this problem by
offloading printing to consoles, we might still see "** XXX printk messages
dropped **" messages if we let Thread-2 call printk() uncontrolledly, for

  /*
   * Give the killed process a good chance to exit before trying
   * to allocate memory again.
   */
  schedule_timeout_killable(1);

which is called after Thread-1 returned from oom_kill_process() allows
Thread-2 and other threads to consume long duration before the OOM reaper
can start reaping by taking oom_lock.

> 
> > >                Have you tried to instrument the kernel and see whether
> > > GFP_NOFS contexts simply preempted any other attempt to get there?
> > > I would find it quite unlikely but not impossible. If that is the case
> > > we should really think how to move forward. One way is to make the oom
> > > path fully synchronous as suggested below. Other is to tweak GFP_NOFS
> > > some more and do not take the lock while we are evaluating that. This
> > > sounds quite messy though.
> > 
> > Do you mean "tweak GFP_NOFS" as something like below patch?
> > 
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3036,6 +3036,17 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
> >  
> >  	*did_some_progress = 0;
> >  
> > +	if (!(gfp_mask & (__GFP_FS | __GFP_NOFAIL))) {
> > +		if ((current->flags & PF_DUMPCORE) ||
> > +		    (order > PAGE_ALLOC_COSTLY_ORDER) ||
> > +		    (ac->high_zoneidx < ZONE_NORMAL) ||
> > +		    (pm_suspended_storage()) ||
> > +		    (gfp_mask & __GFP_THISNODE))
> > +			return NULL;
> > +		*did_some_progress = 1;
> > +		return NULL;
> > +	}
> > +
> >  	/*
> >  	 * Acquire the oom lock.  If that fails, somebody else is
> >  	 * making progress for us.
> > 
> > Then, serial-20161209-gfp.txt in http://I-love.SAKURA.ne.jp/tmp/20161209.tar.xz is
> > console log with above patch applied. Spinning without invoking the OOM killer.
> > It did not avoid locking up.
> 
> OK, so the reason of the lock up must be something different. If we are
> really {dead,live}locking on the printk because of warn_alloc then that
> path should be tweaked instead. Something like below should rule this
> out:

Last year I proposed disabling preemption at
http://lkml.kernel.org/r/201509191605.CAF13520.QVSFHLtFJOMOOF@I-love.SAKURA.ne.jp
but it was not accepted. "while (1);" in userspace corresponds with
pointless "direct reclaim and warn_alloc()" in kernel space. This time,
I'm proposing serialization by oom_lock and replace warn_alloc() with kmallocwd
in order to make printk() not to flood.

> ---
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ed65d7df72d5..c2ba51cec93d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3024,11 +3024,14 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
>  	unsigned int filter = SHOW_MEM_FILTER_NODES;
>  	struct va_format vaf;
>  	va_list args;
> +	static DEFINE_MUTEX(warn_lock);
>  
>  	if ((gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs) ||
>  	    debug_guardpage_minorder() > 0)
>  		return;
>  

if (gfp_mask & __GFP_DIRECT_RECLAIM)

> +	mutex_lock(&warn_lock);
> +
>  	/*
>  	 * This documents exceptions given to allocations in certain
>  	 * contexts that are allowed to allocate outside current's set
> @@ -3054,6 +3057,8 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
>  	dump_stack();
>  	if (!should_suppress_show_mem())
>  		show_mem(filter);
> +

if (gfp_mask & __GFP_DIRECT_RECLAIM)

> +	mutex_unlock(&warn_lock);
>  }
>  
>  static inline struct page *

and I think "s/warn_lock/oom_lock/" because out_of_memory() might
call show_mem() concurrently.

I think this warn_alloc() is too much noise. When something went
wrong, multiple instances of Thread-2 tend to call warn_alloc()
concurrently. We don't need to report similar memory information.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-10 11:24           ` Tetsuo Handa
@ 2016-12-12  9:07             ` Michal Hocko
  2016-12-12 11:49               ` Petr Mladek
  2016-12-12 12:12               ` Tetsuo Handa
  0 siblings, 2 replies; 96+ messages in thread
From: Michal Hocko @ 2016-12-12  9:07 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, Petr Mladek, Sergey Senozhatsky

On Sat 10-12-16 20:24:57, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 09-12-16 23:23:10, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Thu 08-12-16 00:29:26, Tetsuo Handa wrote:
> > > > > Michal Hocko wrote:
> > > > > > On Tue 06-12-16 19:33:59, Tetsuo Handa wrote:
> > > > > > > If the OOM killer is invoked when many threads are looping inside the
> > > > > > > page allocator, it is possible that the OOM killer is preempted by other
> > > > > > > threads.
> > > > > > 
> > > > > > Hmm, the only way I can see this would happen is when the task which
> > > > > > actually manages to take the lock is not invoking the OOM killer for
> > > > > > whatever reason. Is this what happens in your case? Are you able to
> > > > > > trigger this reliably?
> > > > > 
> > > > > Regarding http://I-love.SAKURA.ne.jp/tmp/serial-20161206.txt.xz ,
> > > > > somebody called oom_kill_process() and reached
> > > > > 
> > > > >   pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
> > > > > 
> > > > > line but did not reach
> > > > > 
> > > > >   pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
> > > > > 
> > > > > line within tolerable delay.
> > > > 
> > > > I would be really interested in that. This can happen only if
> > > > find_lock_task_mm fails. This would mean that either we are selecting a
> > > > child without mm or the selected victim has no mm anymore. Both cases
> > > > should be ephemeral because oom_badness will rule those tasks on the
> > > > next round. So the primary question here is why no other task has hit
> > > > out_of_memory.
> > > 
> > > This can also happen due to AB-BA livelock (oom_lock v.s. console_sem).
> > 
> > Care to explain how would that livelock look like?
> 
> Two types of threads (Thread-1 which is holding oom_lock, Thread-2 which is not
> holding oom_lock) are doing memory allocation. Since oom_lock is a mutex, there
> can be only 1 instance for Thread-1. But there can be multiple instances for
> Thread-2.
> 
> (1) Thread-1 enters out_of_memory() because it is holding oom_lock.
> (2) Thread-1 enters printk() due to
> 
>     pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n", ...);
> 
>     in oom_kill_process().
> 
> (3) vprintk_func() is mapped to vprintk_default() because Thread-1 is not
>     inside NMI handler.
> 
> (4) In vprintk_emit(), in_sched == false because loglevel for pr_err()
>     is not LOGLEVEL_SCHED.
> 
> (5) Thread-1 calls log_store() via log_output() from vprintk_emit().
> 
> (6) Thread-1 calls console_trylock() because in_sched == false.
> 
> (7) Thread-1 acquires console_sem via down_trylock_console_sem().
> 
> (8) In console_trylock(), console_may_schedule is set to true because
>     Thread-1 is in sleepable context.
> 
> (9) Thread-1 calls console_unlock() because console_trylock() succeeded.
> 
> (9) In console_unlock(), pending data stored by log_store() are printed
>     to consoles. Since there may be slow consoles, cond_resched() is called
>     if possible. And since console_may_schedule == true because Thread-1 is
>     in sleepable context, Thread-1 may be scheduled at console_unlock().
> 
> (10) Thread-2 tries to acquire oom_lock but it fails because Thread-1 is
>      holding oom_lock.
> 
> (11) Thread-2 enters warn_alloc() because it is waiting for Thread-1 to
>      return from oom_kill_process().
> 
> (12) Thread-2 enters printk() due to
> 
>      warn_alloc(gfp_mask, "page allocation stalls for %ums, order:%u", ...);
> 
>      in __alloc_pages_slowpath().
> 
> (13) vprintk_func() is mapped to vprintk_default() because Thread-2 is not
>      inside NMI handler.
> 
> (14) In vprintk_emit(), in_sched == false because loglevel for pr_err()
>      is not LOGLEVEL_SCHED.
> 
> (15) Thread-2 calls log_store() via log_output() from vprintk_emit().
> 
> (16) Thread-2 calls console_trylock() because in_sched == false.
> 
> (17) Thread-2 fails to acquire console_sem via down_trylock_console_sem().
> 
> (18) Thread-2 returns from vprintk_emit().
> 
> (19) Thread-2 leaves warn_alloc().
> 
> (20) When Thread-1 is waken up, it finds new data appended by Thread-2.
> 
> (21) Thread-1 remains inside console_unlock() with oom_lock still held
>      because there is data which should be printed to consoles.
> 
> (22) Thread-2 remains failing to acquire oom_lock, periodically appending
>      new data via warn_alloc(), and failing to acquire oom_lock.
> 
> (23) The user visible result is that Thread-1 is unable to return from
> 
>      pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n", ...);
> 
>      in oom_kill_process().

OK, I see. This is not a new problem though and people are trying to
solve it in the printk proper. CCed some people, I do not have links
to those threads handy. And if this is really the problem here then we
definitely shouldn't put hacks into the page allocator path to handle
it because there might be other sources of the printk flood might be
arbitrary.

> The introduction of uncontrolled
> 
>   warn_alloc(gfp_mask, "page allocation stalls for %ums, order:%u", ...);
> 
> in __alloc_pages_slowpath() increased the possibility for Thread-1 to remain
> inside console_unlock(). Although Sergey is working on this problem by
> offloading printing to consoles, we might still see "** XXX printk messages
> dropped **" messages if we let Thread-2 call printk() uncontrolledly, for
> 
>   /*
>    * Give the killed process a good chance to exit before trying
>    * to allocate memory again.
>    */
>   schedule_timeout_killable(1);
> 
> which is called after Thread-1 returned from oom_kill_process() allows
> Thread-2 and other threads to consume long duration before the OOM reaper
> can start reaping by taking oom_lock.
> 
[...]
> > OK, so the reason of the lock up must be something different. If we are
> > really {dead,live}locking on the printk because of warn_alloc then that
> > path should be tweaked instead. Something like below should rule this
> > out:
> 
> Last year I proposed disabling preemption at
> http://lkml.kernel.org/r/201509191605.CAF13520.QVSFHLtFJOMOOF@I-love.SAKURA.ne.jp
> but it was not accepted. "while (1);" in userspace corresponds with
> pointless "direct reclaim and warn_alloc()" in kernel space. This time,
> I'm proposing serialization by oom_lock and replace warn_alloc() with kmallocwd
> in order to make printk() not to flood.

The way how you are trying to push your kmallocwd on any occasion is
quite annoying to be honest. If that approach would be so much better
than I am pretty sure you wouldn't have such a problem to have it
merged. warn_alloc is a simple and straightforward approach. If it can
cause floods of messages then we should tune it not replace by a big
hammer.

> > ---
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index ed65d7df72d5..c2ba51cec93d 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3024,11 +3024,14 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
> >  	unsigned int filter = SHOW_MEM_FILTER_NODES;
> >  	struct va_format vaf;
> >  	va_list args;
> > +	static DEFINE_MUTEX(warn_lock);
> >  
> >  	if ((gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs) ||
> >  	    debug_guardpage_minorder() > 0)
> >  		return;
> >  
> 
> if (gfp_mask & __GFP_DIRECT_RECLAIM)

Why?

> > +	mutex_lock(&warn_lock);
> > +
> >  	/*
> >  	 * This documents exceptions given to allocations in certain
> >  	 * contexts that are allowed to allocate outside current's set
> > @@ -3054,6 +3057,8 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
> >  	dump_stack();
> >  	if (!should_suppress_show_mem())
> >  		show_mem(filter);
> > +
> 
> if (gfp_mask & __GFP_DIRECT_RECLAIM)
> 
> > +	mutex_unlock(&warn_lock);
> >  }
> >  
> >  static inline struct page *
> 
> and I think "s/warn_lock/oom_lock/" because out_of_memory() might
> call show_mem() concurrently.

I would rather not mix the two. Even if both use show_mem then there is
no reason to abuse the oom_lock.

Maybe I've missed that but you haven't responded to the question whether
the warn_lock actually resolves the problem you are seeing.

> I think this warn_alloc() is too much noise. When something went
> wrong, multiple instances of Thread-2 tend to call warn_alloc()
> concurrently. We don't need to report similar memory information.

That is why we have ratelimitting. It is needs a better tunning then
just let's do it.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-12  9:07             ` Michal Hocko
@ 2016-12-12 11:49               ` Petr Mladek
  2016-12-12 13:00                 ` Michal Hocko
  2016-12-13  1:06                 ` Sergey Senozhatsky
  2016-12-12 12:12               ` Tetsuo Handa
  1 sibling, 2 replies; 96+ messages in thread
From: Petr Mladek @ 2016-12-12 11:49 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Tetsuo Handa, linux-mm, Sergey Senozhatsky

On Mon 2016-12-12 10:07:03, Michal Hocko wrote:
> On Sat 10-12-16 20:24:57, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Fri 09-12-16 23:23:10, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > > > On Thu 08-12-16 00:29:26, Tetsuo Handa wrote:
> > > > > > Michal Hocko wrote:
> > > > > > > On Tue 06-12-16 19:33:59, Tetsuo Handa wrote:
> > > > > > > > If the OOM killer is invoked when many threads are looping inside the
> > > > > > > > page allocator, it is possible that the OOM killer is preempted by other
> > > > > > > > threads.
> > > > > > > 
> > > > > > > Hmm, the only way I can see this would happen is when the task which
> > > > > > > actually manages to take the lock is not invoking the OOM killer for
> > > > > > > whatever reason. Is this what happens in your case? Are you able to
> > > > > > > trigger this reliably?
> > > > > > 
> > > > > > Regarding http://I-love.SAKURA.ne.jp/tmp/serial-20161206.txt.xz ,
> > > > > > somebody called oom_kill_process() and reached
> > > > > > 
> > > > > >   pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
> > > > > > 
> > > > > > line but did not reach
> > > > > > 
> > > > > >   pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
> > > > > > 
> > > > > > line within tolerable delay.
> > > > > 
> > > > > I would be really interested in that. This can happen only if
> > > > > find_lock_task_mm fails. This would mean that either we are selecting a
> > > > > child without mm or the selected victim has no mm anymore. Both cases
> > > > > should be ephemeral because oom_badness will rule those tasks on the
> > > > > next round. So the primary question here is why no other task has hit
> > > > > out_of_memory.
> > > > 
> > > > This can also happen due to AB-BA livelock (oom_lock v.s. console_sem).
> > > 
> > > Care to explain how would that livelock look like?
> > 
> > Two types of threads (Thread-1 which is holding oom_lock, Thread-2 which is not
> > holding oom_lock) are doing memory allocation. Since oom_lock is a mutex, there
> > can be only 1 instance for Thread-1. But there can be multiple instances for
> > Thread-2.
> > 
> > (1) Thread-1 enters out_of_memory() because it is holding oom_lock.
> > (2) Thread-1 enters printk() due to
> > 
> >     pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n", ...);
> > 
> >     in oom_kill_process().
> > 
> > (3) vprintk_func() is mapped to vprintk_default() because Thread-1 is not
> >     inside NMI handler.
> > 
> > (4) In vprintk_emit(), in_sched == false because loglevel for pr_err()
> >     is not LOGLEVEL_SCHED.
> > 
> > (5) Thread-1 calls log_store() via log_output() from vprintk_emit().
> > 
> > (6) Thread-1 calls console_trylock() because in_sched == false.
> > 
> > (7) Thread-1 acquires console_sem via down_trylock_console_sem().
> > 
> > (8) In console_trylock(), console_may_schedule is set to true because
> >     Thread-1 is in sleepable context.
> > 
> > (9) Thread-1 calls console_unlock() because console_trylock() succeeded.
> > 
> > (9) In console_unlock(), pending data stored by log_store() are printed
> >     to consoles. Since there may be slow consoles, cond_resched() is called
> >     if possible. And since console_may_schedule == true because Thread-1 is
> >     in sleepable context, Thread-1 may be scheduled at console_unlock().
> > 
> > (10) Thread-2 tries to acquire oom_lock but it fails because Thread-1 is
> >      holding oom_lock.
> > 
> > (11) Thread-2 enters warn_alloc() because it is waiting for Thread-1 to
> >      return from oom_kill_process().
> > 
> > (12) Thread-2 enters printk() due to
> > 
> >      warn_alloc(gfp_mask, "page allocation stalls for %ums, order:%u", ...);
> > 
> >      in __alloc_pages_slowpath().
> > 
> > (13) vprintk_func() is mapped to vprintk_default() because Thread-2 is not
> >      inside NMI handler.
> > 
> > (14) In vprintk_emit(), in_sched == false because loglevel for pr_err()
> >      is not LOGLEVEL_SCHED.
> > 
> > (15) Thread-2 calls log_store() via log_output() from vprintk_emit().
> > 
> > (16) Thread-2 calls console_trylock() because in_sched == false.
> > 
> > (17) Thread-2 fails to acquire console_sem via down_trylock_console_sem().
> > 
> > (18) Thread-2 returns from vprintk_emit().
> > 
> > (19) Thread-2 leaves warn_alloc().
> > 
> > (20) When Thread-1 is waken up, it finds new data appended by Thread-2.
> > 
> > (21) Thread-1 remains inside console_unlock() with oom_lock still held
> >      because there is data which should be printed to consoles.
> > 
> > (22) Thread-2 remains failing to acquire oom_lock, periodically appending
> >      new data via warn_alloc(), and failing to acquire oom_lock.
> > 
> > (23) The user visible result is that Thread-1 is unable to return from
> > 
> >      pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n", ...);
> > 
> >      in oom_kill_process().
> 
> OK, I see. This is not a new problem though and people are trying to
> solve it in the printk proper. CCed some people, I do not have links
> to those threads handy. And if this is really the problem here then we
> definitely shouldn't put hacks into the page allocator path to handle
> it because there might be other sources of the printk flood might be
> arbitrary.

Yup, this is exactly the type of the problem that we want to solve
by the async printk.


> > The introduction of uncontrolled
> > 
> >   warn_alloc(gfp_mask, "page allocation stalls for %ums, order:%u", ...);

I am just curious that there would be so many messages.
If I get it correctly, this warning is printed
once every 10 second. Or am I wrong?

Well, you might want to consider using

		stall_timeout *= 2;

instead of adding the constant 10 * HZ.

Of course, a better would be some global throttling of
this message.


Best Regards,
Petr

PS: I am not mm expert and did not read this thread. Just ignore this
if I missed the point. Anyway, it sounds weird to linearize all
allocation request in OOM situation. It is much harder to unblock
a high-order requests than a low-order ones.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-12  9:07             ` Michal Hocko
  2016-12-12 11:49               ` Petr Mladek
@ 2016-12-12 12:12               ` Tetsuo Handa
  2016-12-12 12:55                 ` Michal Hocko
  1 sibling, 1 reply; 96+ messages in thread
From: Tetsuo Handa @ 2016-12-12 12:12 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, pmladek, sergey.senozhatsky

Michal Hocko wrote:
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index ed65d7df72d5..c2ba51cec93d 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -3024,11 +3024,14 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
> > >  	unsigned int filter = SHOW_MEM_FILTER_NODES;
> > >  	struct va_format vaf;
> > >  	va_list args;
> > > +	static DEFINE_MUTEX(warn_lock);
> > >  
> > >  	if ((gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs) ||
> > >  	    debug_guardpage_minorder() > 0)
> > >  		return;
> > >  
> > 
> > if (gfp_mask & __GFP_DIRECT_RECLAIM)
> 
> Why?

Because warn_alloc() is also called by !__GFP_DIRECT_RECLAIM allocation
requests when allocation failed. We are not allowed to sleep in that case.

> 
> > > +	mutex_lock(&warn_lock);
> > > +
> > >  	/*
> > >  	 * This documents exceptions given to allocations in certain
> > >  	 * contexts that are allowed to allocate outside current's set
> > > @@ -3054,6 +3057,8 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
> > >  	dump_stack();
> > >  	if (!should_suppress_show_mem())
> > >  		show_mem(filter);
> > > +
> > 
> > if (gfp_mask & __GFP_DIRECT_RECLAIM)
> > 
> > > +	mutex_unlock(&warn_lock);
> > >  }
> > >  
> > >  static inline struct page *
> > 
> > and I think "s/warn_lock/oom_lock/" because out_of_memory() might
> > call show_mem() concurrently.
> 
> I would rather not mix the two. Even if both use show_mem then there is
> no reason to abuse the oom_lock.
> 
> Maybe I've missed that but you haven't responded to the question whether
> the warn_lock actually resolves the problem you are seeing.

I haven't tried warn_lock, but is warn_lock in warn_alloc() better than
serializing oom_lock in __alloc_pages_may_oom() ? I think we don't need to
waste CPU cycles before the OOM killer sends SIGKILL.

> 
> > I think this warn_alloc() is too much noise. When something went
> > wrong, multiple instances of Thread-2 tend to call warn_alloc()
> > concurrently. We don't need to report similar memory information.
> 
> That is why we have ratelimitting. It is needs a better tunning then
> just let's do it.

I think that calling show_mem() once per a series of warn_alloc() threads is
sufficient. Since the amount of output by dump_stack() and that by show_mem()
are nearly equals, we can save nearly 50% of output if we manage to avoid
the same show_mem() calls.

> > > OK, so the reason of the lock up must be something different. If we are
> > > really {dead,live}locking on the printk because of warn_alloc then that
> > > path should be tweaked instead. Something like below should rule this
> > > out:
> > 
> > Last year I proposed disabling preemption at
> > http://lkml.kernel.org/r/201509191605.CAF13520.QVSFHLtFJOMOOF@I-love.SAKURA.ne.jp
> > but it was not accepted. "while (1);" in userspace corresponds with
> > pointless "direct reclaim and warn_alloc()" in kernel space. This time,
> > I'm proposing serialization by oom_lock and replace warn_alloc() with kmallocwd
> > in order to make printk() not to flood.
> 
> The way how you are trying to push your kmallocwd on any occasion is
> quite annoying to be honest. If that approach would be so much better
> than I am pretty sure you wouldn't have such a problem to have it
> merged. warn_alloc is a simple and straightforward approach. If it can
> cause floods of messages then we should tune it not replace by a big
> hammer.

I wrote kmallocwd ( https://lkml.org/lkml/2016/11/6/7 )
with the following precautions in mind.

 (1) Can trigger even if the allocating tasks got stuck before reaching
     warn_alloc(), as shown by kswapd v.s. shrink_inactive_list() example.
     Will trigger even if new bugs are unexpectedly added in the future.

 (2) Do not printk() too much at once. There are enterprise servers which
     cannot print to serial console faster than 9600bps. By waiting as
     needed, we can reduce the risk of hitting stall warnings and dropping
     messages. Although currently there is no API which waits until
     specified amounts are printed to console, kmallocwd can call such API
     when such API is added.

 (3) Report memory information only once per a series of reports.
     Printing memory information for each thread generates too much
     output.

 (4) Report kswapd threads which might be relevant with memory allocation
     stalls.

 (5) Report workqueues status if debug is enabled, for in many cases
     workqueues being unable to make progress is observed when stalling.

 (6) Allow administrators to capture vmcore (i.e. panic if stall detected)
     without adding sysctl tunables for triggering panic, for administrators
     can install a trigger for calling panic() using SystemTap. One sysctl
     tunable that controls timeout if kmallocwd is enabled is enough.

 (7) Allow technical staff at support centers to analyze vmcore based on
     last minutes memory allocation behavior.

 (8) Allow kernel developers to implement and call functions such as
     /proc/*stat which are currently mostly available for only file
     interface.

Maybe more, but no need to enumerate in this thread.
How many of these precautions can be achieved by tuning warn_alloc() ?
printk() tries to solve unbounded delay problem by using (I guess) a
dedicated kernel thread. I don't think we can achieve these precautions
without a centralized state tracking which can sleep and synchronize as
needed.

Quite few people are responding to discussions regarding almost
OOM situation. I beg for your joining to discussions.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-12 12:12               ` Tetsuo Handa
@ 2016-12-12 12:55                 ` Michal Hocko
  2016-12-12 13:19                   ` Michal Hocko
  2016-12-12 14:59                   ` Tetsuo Handa
  0 siblings, 2 replies; 96+ messages in thread
From: Michal Hocko @ 2016-12-12 12:55 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, pmladek, sergey.senozhatsky

On Mon 12-12-16 21:12:06, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > index ed65d7df72d5..c2ba51cec93d 100644
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -3024,11 +3024,14 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
> > > >  	unsigned int filter = SHOW_MEM_FILTER_NODES;
> > > >  	struct va_format vaf;
> > > >  	va_list args;
> > > > +	static DEFINE_MUTEX(warn_lock);
> > > >  
> > > >  	if ((gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs) ||
> > > >  	    debug_guardpage_minorder() > 0)
> > > >  		return;
> > > >  
> > > 
> > > if (gfp_mask & __GFP_DIRECT_RECLAIM)
> > 
> > Why?
> 
> Because warn_alloc() is also called by !__GFP_DIRECT_RECLAIM allocation
> requests when allocation failed. We are not allowed to sleep in that case.

Dohh, right. I have forgotten that warn_alloc is called when in the
nopage path. Sorry about that! We can make the lock non-sleepable...

> > 
> > > > +	mutex_lock(&warn_lock);
> > > > +
> > > >  	/*
> > > >  	 * This documents exceptions given to allocations in certain
> > > >  	 * contexts that are allowed to allocate outside current's set
> > > > @@ -3054,6 +3057,8 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
> > > >  	dump_stack();
> > > >  	if (!should_suppress_show_mem())
> > > >  		show_mem(filter);
> > > > +
> > > 
> > > if (gfp_mask & __GFP_DIRECT_RECLAIM)
> > > 
> > > > +	mutex_unlock(&warn_lock);
> > > >  }
> > > >  
> > > >  static inline struct page *
> > > 
> > > and I think "s/warn_lock/oom_lock/" because out_of_memory() might
> > > call show_mem() concurrently.
> > 
> > I would rather not mix the two. Even if both use show_mem then there is
> > no reason to abuse the oom_lock.
> > 
> > Maybe I've missed that but you haven't responded to the question whether
> > the warn_lock actually resolves the problem you are seeing.
> 
> I haven't tried warn_lock, but is warn_lock in warn_alloc() better than
> serializing oom_lock in __alloc_pages_may_oom() ? I think we don't need to
> waste CPU cycles before the OOM killer sends SIGKILL.

Yes, I find a separate lock better because there is no real reason to
abuse an unrelated lock.

> > > I think this warn_alloc() is too much noise. When something went
> > > wrong, multiple instances of Thread-2 tend to call warn_alloc()
> > > concurrently. We don't need to report similar memory information.
> > 
> > That is why we have ratelimitting. It is needs a better tunning then
> > just let's do it.
> 
> I think that calling show_mem() once per a series of warn_alloc() threads is
> sufficient. Since the amount of output by dump_stack() and that by show_mem()
> are nearly equals, we can save nearly 50% of output if we manage to avoid
> the same show_mem() calls.

I do not mind such an update. Again, that is what we have the
ratelimitting for. The fact that it doesn't throttle properly means that
we should tune its parameters.

> > > > OK, so the reason of the lock up must be something different. If we are
> > > > really {dead,live}locking on the printk because of warn_alloc then that
> > > > path should be tweaked instead. Something like below should rule this
> > > > out:
> > > 
> > > Last year I proposed disabling preemption at
> > > http://lkml.kernel.org/r/201509191605.CAF13520.QVSFHLtFJOMOOF@I-love.SAKURA.ne.jp
> > > but it was not accepted. "while (1);" in userspace corresponds with
> > > pointless "direct reclaim and warn_alloc()" in kernel space. This time,
> > > I'm proposing serialization by oom_lock and replace warn_alloc() with kmallocwd
> > > in order to make printk() not to flood.
> > 
> > The way how you are trying to push your kmallocwd on any occasion is
> > quite annoying to be honest. If that approach would be so much better
> > than I am pretty sure you wouldn't have such a problem to have it
> > merged. warn_alloc is a simple and straightforward approach. If it can
> > cause floods of messages then we should tune it not replace by a big
> > hammer.
> 
> I wrote kmallocwd ( https://lkml.org/lkml/2016/11/6/7 )
> with the following precautions in mind.
> 

Skipping your points about kmallocwd which is (for the N+1th times) not
related to this thread and which belongs to the changelog of your
paatch.

[...]

> Maybe more, but no need to enumerate in this thread.
> How many of these precautions can be achieved by tuning warn_alloc() ?
> printk() tries to solve unbounded delay problem by using (I guess) a
> dedicated kernel thread. I don't think we can achieve these precautions
> without a centralized state tracking which can sleep and synchronize as
> needed.
> 
> Quite few people are responding to discussions regarding almost
> OOM situation. I beg for your joining to discussions.

I have already stated my position. I do not think that the code this
patch introduces is really justified for the advantages it provides over
a simple warn_alloc approach. Additional debugging information might be
nice but not necessary in 99% cases. If there are definciences in
warn_alloc (which I agree there are if there are thousands of contexts
hitting the path) then let's try to address them.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-12 11:49               ` Petr Mladek
@ 2016-12-12 13:00                 ` Michal Hocko
  2016-12-12 14:05                   ` Tetsuo Handa
  2016-12-13  1:06                 ` Sergey Senozhatsky
  1 sibling, 1 reply; 96+ messages in thread
From: Michal Hocko @ 2016-12-12 13:00 UTC (permalink / raw)
  To: Petr Mladek; +Cc: Tetsuo Handa, linux-mm, Sergey Senozhatsky

On Mon 12-12-16 12:49:03, Petr Mladek wrote:
> On Mon 2016-12-12 10:07:03, Michal Hocko wrote:
> > On Sat 10-12-16 20:24:57, Tetsuo Handa wrote:
[...]
> > > The introduction of uncontrolled
> > > 
> > >   warn_alloc(gfp_mask, "page allocation stalls for %ums, order:%u", ...);
> 
> I am just curious that there would be so many messages.
> If I get it correctly, this warning is printed
> once every 10 second. Or am I wrong?

Yes it is once per 10s per allocation context. Tetsuo's test case is
generating hundreds of such allocation paths which are hitting the
warn_alloc path. So they can meet there and generate a lot of output.
Now we have __ratelimit here which should help but most probably needs
some better tunning.

I am also considering to use a per warn_alloc lock which would also help
to make the output nicer (not interleaving for parallel callers).

> Well, you might want to consider using
> 
> 		stall_timeout *= 2;
> 
> instead of adding the constant 10 * HZ.

This wouldn't help in the above situation.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-12 12:55                 ` Michal Hocko
@ 2016-12-12 13:19                   ` Michal Hocko
  2016-12-13 12:06                     ` Tetsuo Handa
  2016-12-12 14:59                   ` Tetsuo Handa
  1 sibling, 1 reply; 96+ messages in thread
From: Michal Hocko @ 2016-12-12 13:19 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, pmladek, sergey.senozhatsky

On Mon 12-12-16 13:55:35, Michal Hocko wrote:
> On Mon 12-12-16 21:12:06, Tetsuo Handa wrote:
> > Michal Hocko wrote:
[...]
> > > > I think this warn_alloc() is too much noise. When something went
> > > > wrong, multiple instances of Thread-2 tend to call warn_alloc()
> > > > concurrently. We don't need to report similar memory information.
> > > 
> > > That is why we have ratelimitting. It is needs a better tunning then
> > > just let's do it.
> > 
> > I think that calling show_mem() once per a series of warn_alloc() threads is
> > sufficient. Since the amount of output by dump_stack() and that by show_mem()
> > are nearly equals, we can save nearly 50% of output if we manage to avoid
> > the same show_mem() calls.
> 
> I do not mind such an update. Again, that is what we have the
> ratelimitting for. The fact that it doesn't throttle properly means that
> we should tune its parameters.

What about the following? Does this help?
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c52268786027..54348e5a5377 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3003,9 +3003,12 @@ static inline bool should_suppress_show_mem(void)
 	return ret;
 }
 
-static DEFINE_RATELIMIT_STATE(nopage_rs,
-		DEFAULT_RATELIMIT_INTERVAL,
-		DEFAULT_RATELIMIT_BURST);
+/*
+ * Do not swamp logs with allocation failures details
+ * Once per second should be more than enough to get an
+ * overview what is going on
+ */
+static DEFINE_RATELIMIT_STATE(nopage_rs, HZ, 1);
 
 void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
 {
@@ -3013,7 +3016,7 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
 	struct va_format vaf;
 	va_list args;
 
-	if ((gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs) ||
+	if ((gfp_mask & __GFP_NOWARN) ||
 	    debug_guardpage_minorder() > 0)
 		return;
 
@@ -3040,7 +3043,7 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
 	pr_cont(", mode:%#x(%pGg)\n", gfp_mask, &gfp_mask);
 
 	dump_stack();
-	if (!should_suppress_show_mem())
+	if (!should_suppress_show_mem() || __ratelimit(&nopage_rs))
 		show_mem(filter);
 }
 
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-12 13:00                 ` Michal Hocko
@ 2016-12-12 14:05                   ` Tetsuo Handa
  0 siblings, 0 replies; 96+ messages in thread
From: Tetsuo Handa @ 2016-12-12 14:05 UTC (permalink / raw)
  To: mhocko, pmladek; +Cc: linux-mm, sergey.senozhatsky

Michal Hocko wrote:
> On Mon 12-12-16 12:49:03, Petr Mladek wrote:
> > On Mon 2016-12-12 10:07:03, Michal Hocko wrote:
> > > On Sat 10-12-16 20:24:57, Tetsuo Handa wrote:
> [...]
> > > > The introduction of uncontrolled
> > > > 
> > > >   warn_alloc(gfp_mask, "page allocation stalls for %ums, order:%u", ...);
> > 
> > I am just curious that there would be so many messages.
> > If I get it correctly, this warning is printed
> > once every 10 second. Or am I wrong?
> 
> Yes it is once per 10s per allocation context. Tetsuo's test case is
> generating hundreds of such allocation paths which are hitting the
> warn_alloc path. So they can meet there and generate a lot of output.

Excuse me, but most processes in this testcase
( http://lkml.kernel.org/r/201612080029.IBD55588.OSOFOtHVMLQFFJ@I-love.SAKURA.ne.jp )
are blocked on locks. I guess at most few dozens of threads are in allocation paths.

It would be possible to try to keep as many threads as possible inside allocation
paths if I try different test cases. But not so interesting thing for a system with
only 4 CPUs. Maybe interesting if 256 CPUs or more...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-12 12:55                 ` Michal Hocko
  2016-12-12 13:19                   ` Michal Hocko
@ 2016-12-12 14:59                   ` Tetsuo Handa
  2016-12-12 15:55                     ` Michal Hocko
  1 sibling, 1 reply; 96+ messages in thread
From: Tetsuo Handa @ 2016-12-12 14:59 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, pmladek, sergey.senozhatsky

Michal Hocko wrote:
> On Mon 12-12-16 21:12:06, Tetsuo Handa wrote:
> > > I would rather not mix the two. Even if both use show_mem then there is
> > > no reason to abuse the oom_lock.
> > > 
> > > Maybe I've missed that but you haven't responded to the question whether
> > > the warn_lock actually resolves the problem you are seeing.
> > 
> > I haven't tried warn_lock, but is warn_lock in warn_alloc() better than
> > serializing oom_lock in __alloc_pages_may_oom() ? I think we don't need to
> > waste CPU cycles before the OOM killer sends SIGKILL.
> 
> Yes, I find a separate lock better because there is no real reason to
> abuse an unrelated lock.

Using separate lock for warn_alloc() is fine for me. I can still consider
serialization of oom_lock independent with warn_alloc(). But

> > Maybe more, but no need to enumerate in this thread.
> > How many of these precautions can be achieved by tuning warn_alloc() ?
> > printk() tries to solve unbounded delay problem by using (I guess) a
> > dedicated kernel thread. I don't think we can achieve these precautions
> > without a centralized state tracking which can sleep and synchronize as
> > needed.
> > 
> > Quite few people are responding to discussions regarding almost
> > OOM situation. I beg for your joining to discussions.
> 
> I have already stated my position. I do not think that the code this
> patch introduces is really justified for the advantages it provides over
> a simple warn_alloc approach. Additional debugging information might be
> nice but not necessary in 99% cases. If there are definciences in
> warn_alloc (which I agree there are if there are thousands of contexts
> hitting the path) then let's try to address them.

I'm not happy with keeping kmallocwd out-of-tree.

http://I-love.SAKURA.ne.jp/tmp/serial-20161212.txt.xz is a console log
which I've just captured using stock 4.9 kernel (as a preparation step for
trying http://lkml.kernel.org/r/20161212131910.GC3185@dhcp22.suse.cz ) using
http://lkml.kernel.org/r/201612080029.IBD55588.OSOFOtHVMLQFFJ@I-love.SAKURA.ne.jp .
Only warn_alloc() by GFP_NOIO allocation request was reported (uptime > 148).
Guessing from

  INFO: task kswapd0:60 blocked for more than 60 seconds.

message, I hit kswapd v.s. shrink_inactive_list() trap. But there are
no other hints which would have been reported if kmallocwd is available.
This is one of unsolvable definciences in warn_alloc() (or any synchronous
watchdog).

It is administrators who decide whether to utilize debugging capability
with state tracking. Let's give administrators a choice and a chance.

Although you think most users won't need kmallcwd, there is no objection
for asynchronous watchdog, isn't it?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-12 14:59                   ` Tetsuo Handa
@ 2016-12-12 15:55                     ` Michal Hocko
  0 siblings, 0 replies; 96+ messages in thread
From: Michal Hocko @ 2016-12-12 15:55 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, pmladek, sergey.senozhatsky

On Mon 12-12-16 23:59:55, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Mon 12-12-16 21:12:06, Tetsuo Handa wrote:
> > > > I would rather not mix the two. Even if both use show_mem then there is
> > > > no reason to abuse the oom_lock.
> > > > 
> > > > Maybe I've missed that but you haven't responded to the question whether
> > > > the warn_lock actually resolves the problem you are seeing.
> > > 
> > > I haven't tried warn_lock, but is warn_lock in warn_alloc() better than
> > > serializing oom_lock in __alloc_pages_may_oom() ? I think we don't need to
> > > waste CPU cycles before the OOM killer sends SIGKILL.
> > 
> > Yes, I find a separate lock better because there is no real reason to
> > abuse an unrelated lock.
> 
> Using separate lock for warn_alloc() is fine for me. I can still consider
> serialization of oom_lock independent with warn_alloc().

Could you try the ratelimit update as well? Maybe it will be sufficient
on its own.

> But
> 
> > > Maybe more, but no need to enumerate in this thread.
> > > How many of these precautions can be achieved by tuning warn_alloc() ?
> > > printk() tries to solve unbounded delay problem by using (I guess) a
> > > dedicated kernel thread. I don't think we can achieve these precautions
> > > without a centralized state tracking which can sleep and synchronize as
> > > needed.
> > > 
> > > Quite few people are responding to discussions regarding almost
> > > OOM situation. I beg for your joining to discussions.
> > 
> > I have already stated my position. I do not think that the code this
> > patch introduces is really justified for the advantages it provides over
> > a simple warn_alloc approach. Additional debugging information might be
> > nice but not necessary in 99% cases. If there are definciences in
> > warn_alloc (which I agree there are if there are thousands of contexts
> > hitting the path) then let's try to address them.
> 
> I'm not happy with keeping kmallocwd out-of-tree.

I completely fail why you are still bringing this up. I will repeat it
for the last time and won't reply to any further note about kmallocwd
here or anywhere else where it is not directly discussed. If you think
your code is valuable document that in the patch description and post
your patch.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-12 11:49               ` Petr Mladek
  2016-12-12 13:00                 ` Michal Hocko
@ 2016-12-13  1:06                 ` Sergey Senozhatsky
  1 sibling, 0 replies; 96+ messages in thread
From: Sergey Senozhatsky @ 2016-12-13  1:06 UTC (permalink / raw)
  To: Petr Mladek; +Cc: Michal Hocko, Tetsuo Handa, linux-mm, Sergey Senozhatsky

On (12/12/16 12:49), Petr Mladek wrote:
[..]
> > OK, I see. This is not a new problem though and people are trying to
> > solve it in the printk proper. CCed some people, I do not have links
> > to those threads handy. And if this is really the problem here then we
> > definitely shouldn't put hacks into the page allocator path to handle
> > it because there might be other sources of the printk flood might be
> > arbitrary.
> 
> Yup, this is exactly the type of the problem that we want to solve
> by the async printk.

yes, I think async printk will help here.

> > > The introduction of uncontrolled
> > > 
> > >   warn_alloc(gfp_mask, "page allocation stalls for %ums, order:%u", ...);
> 
> I am just curious that there would be so many messages.
> If I get it correctly, this warning is printed
> once every 10 second. Or am I wrong?
> 
> Well, you might want to consider using
> 
> 		stall_timeout *= 2;
> 
> instead of adding the constant 10 * HZ.
> 
> Of course, a better would be some global throttling of
> this message.

yeah. rate limiting is still a good thing to have.

somewhat unrelated, but somehow related. just some thoughts.

with async printk, in some cases, I suspect (and I haven't thought
of it long enought), messages rate limiting can have an even bigger,
to some extent, necessity than with the current printk. the thing
is that in current scheme CPU that does printk-s can *sometimes*
go to console_unlock() and spins there printing the messages that
it appended to the logbuf. which naturally throttles that CPU and
it can't execte more printk-s for awhile. with async printk that
CPU is detached from console_unlock() printing loop, so the CPU is
free to append new messages to the logbuf as fast as it wants to.
it should not cause any lockups or something, but we can lost some
messages.

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-12 13:19                   ` Michal Hocko
@ 2016-12-13 12:06                     ` Tetsuo Handa
  2016-12-13 17:06                       ` Michal Hocko
  2016-12-14  9:37                       ` Petr Mladek
  0 siblings, 2 replies; 96+ messages in thread
From: Tetsuo Handa @ 2016-12-13 12:06 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, pmladek, sergey.senozhatsky

Michal Hocko wrote:
> On Mon 12-12-16 13:55:35, Michal Hocko wrote:
> > On Mon 12-12-16 21:12:06, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> [...]
> > > > > I think this warn_alloc() is too much noise. When something went
> > > > > wrong, multiple instances of Thread-2 tend to call warn_alloc()
> > > > > concurrently. We don't need to report similar memory information.
> > > > 
> > > > That is why we have ratelimitting. It is needs a better tunning then
> > > > just let's do it.
> > > 
> > > I think that calling show_mem() once per a series of warn_alloc() threads is
> > > sufficient. Since the amount of output by dump_stack() and that by show_mem()
> > > are nearly equals, we can save nearly 50% of output if we manage to avoid
> > > the same show_mem() calls.
> > 
> > I do not mind such an update. Again, that is what we have the
> > ratelimitting for. The fact that it doesn't throttle properly means that
> > we should tune its parameters.
> 
> What about the following? Does this help?

I don't think it made much difference.

I noticed that one of triggers which cause a lot of
"** XXX printk messages dropped **" is show_all_locks() added by
commit b2d4c2edb2e4f89a ("locking/hung_task: Show all locks"). When there are
a lot of threads being blocked on fs locks, show_all_locks() on each blocked
thread generates incredible amount of messages periodically. Therefore,
I temporarily set /proc/sys/kernel/hung_task_timeout_secs to 0 to disable
hung task warnings for testing this patch.

http://I-love.SAKURA.ne.jp/tmp/serial-20161213.txt.xz is a console log with
this patch applied. Due to hung task warnings disabled, amount of messages
are significantly reduced.

Uptime > 400 are testcases where the stresser was invoked via "taskset -c 0".
Since there are some "** XXX printk messages dropped **" messages, I can't
tell whether the OOM killer was able to make forward progress. But guessing
 from the result that there is no corresponding "Killed process" line for
"Out of memory: " line at uptime = 450 and the duration of PID 14622 stalled,
I think it is OK to say that the system got stuck because the OOM killer was
not able to make forward progress.

----------
[  450.767693] Out of memory: Kill process 14642 (a.out) score 999 or sacrifice child
[  450.769974] Killed process 14642 (a.out) total-vm:4168kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  450.776538] oom_reaper: reaped process 14642 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  450.781170] Out of memory: Kill process 14643 (a.out) score 999 or sacrifice child
[  450.783469] Killed process 14643 (a.out) total-vm:4168kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  450.787912] oom_reaper: reaped process 14643 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  450.792630] Out of memory: Kill process 14644 (a.out) score 999 or sacrifice child
[  450.964031] a.out: page allocation stalls for 10014ms, order:0, mode:0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO)
[  450.964033] CPU: 0 PID: 14622 Comm: a.out Tainted: G        W       4.9.0+ #99
(...snipped...)
[  740.984902] a.out: page allocation stalls for 300003ms, order:0, mode:0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO)
[  740.984905] CPU: 0 PID: 14622 Comm: a.out Tainted: G        W       4.9.0+ #99
----------

Although it is fine to make warn_alloc() less verbose, this is not
a problem which can be avoided by simply reducing printk(). Unless
we give enough CPU time to the OOM killer and OOM victims, it is
trivial to lockup the system.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-13 12:06                     ` Tetsuo Handa
@ 2016-12-13 17:06                       ` Michal Hocko
  2016-12-14 11:37                         ` Tetsuo Handa
  2016-12-15  1:11                         ` Sergey Senozhatsky
  2016-12-14  9:37                       ` Petr Mladek
  1 sibling, 2 replies; 96+ messages in thread
From: Michal Hocko @ 2016-12-13 17:06 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, pmladek, sergey.senozhatsky

On Tue 13-12-16 21:06:57, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Mon 12-12-16 13:55:35, Michal Hocko wrote:
> > > On Mon 12-12-16 21:12:06, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > [...]
> > > > > > I think this warn_alloc() is too much noise. When something went
> > > > > > wrong, multiple instances of Thread-2 tend to call warn_alloc()
> > > > > > concurrently. We don't need to report similar memory information.
> > > > > 
> > > > > That is why we have ratelimitting. It is needs a better tunning then
> > > > > just let's do it.
> > > > 
> > > > I think that calling show_mem() once per a series of warn_alloc() threads is
> > > > sufficient. Since the amount of output by dump_stack() and that by show_mem()
> > > > are nearly equals, we can save nearly 50% of output if we manage to avoid
> > > > the same show_mem() calls.
> > > 
> > > I do not mind such an update. Again, that is what we have the
> > > ratelimitting for. The fact that it doesn't throttle properly means that
> > > we should tune its parameters.
> > 
> > What about the following? Does this help?
> 
> I don't think it made much difference.

Because I am an idiot. The condition is wrong.
	if (!should_suppress_show_mem() || __ratelimit(&nopage_rs))
		show_mem(filter);
	
it should read
	if (!should_suppress_show_mem() && __ratelimit(&nopage_rs))
		show_mem(filter);

so there was throttling at all. :/ Sorry about that!
 
> I noticed that one of triggers which cause a lot of
> "** XXX printk messages dropped **" is show_all_locks() added by
> commit b2d4c2edb2e4f89a ("locking/hung_task: Show all locks"). When there are
> a lot of threads being blocked on fs locks, show_all_locks() on each blocked
> thread generates incredible amount of messages periodically. Therefore,
> I temporarily set /proc/sys/kernel/hung_task_timeout_secs to 0 to disable
> hung task warnings for testing this patch.
> 
> http://I-love.SAKURA.ne.jp/tmp/serial-20161213.txt.xz is a console log with
> this patch applied. Due to hung task warnings disabled, amount of messages
> are significantly reduced.
> 
> Uptime > 400 are testcases where the stresser was invoked via "taskset -c 0".
> Since there are some "** XXX printk messages dropped **" messages, I can't
> tell whether the OOM killer was able to make forward progress. But guessing
>  from the result that there is no corresponding "Killed process" line for
> "Out of memory: " line at uptime = 450 and the duration of PID 14622 stalled,
> I think it is OK to say that the system got stuck because the OOM killer was
> not able to make forward progress.

The oom situation certainly didn't get resolved. I would be really
curious whether we can rule out the printk out of the picture, though. I
am still not sure we can rule out some obscure OOM killer bug at this
stage.

What if we lower the loglevel as much as possible to only see KERN_ERR
should be sufficient to see few oom killer messages while suppressing
most of the other noise. Unfortunatelly, even messages with level >
loglevel get stored into the ringbuffer (as I've just learned) so
console_unlock() has to crawl through them just to drop them (Meh) but
at least it doesn't have to go to the serial console drivers and spend
even more time there. An alternative would be to tweak printk to not
even store those messaes. Something like the below

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index f7a55e9ff2f7..197f2b9fb703 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1865,6 +1865,15 @@ asmlinkage int vprintk_emit(int facility, int level,
 				lflags |= LOG_CONT;
 			}
 
+			if (suppress_message_printing(kern_level)) {
+				logbuf_cpu = UINT_MAX;
+				raw_spin_unlock(&logbuf_lock);
+				lockdep_on();
+				local_irq_restore(flags);
+				return 0;
+			}
+
+
 			text_len -= 2;
 			text += 2;
 		}

So it would be really great if you could
	1) test with the fixed throttling
	2) loglevel=4 on the kernel command line
	3) try the above with the same loglevel

ideally 1) would be sufficient and that would make the most sense from
the warn_alloc point of view. If this is 2 or 3 then we are hitting a
more generic problem and I would be quite careful to hack it around.
 
> ----------
> [  450.767693] Out of memory: Kill process 14642 (a.out) score 999 or sacrifice child
> [  450.769974] Killed process 14642 (a.out) total-vm:4168kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> [  450.776538] oom_reaper: reaped process 14642 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> [  450.781170] Out of memory: Kill process 14643 (a.out) score 999 or sacrifice child
> [  450.783469] Killed process 14643 (a.out) total-vm:4168kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> [  450.787912] oom_reaper: reaped process 14643 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> [  450.792630] Out of memory: Kill process 14644 (a.out) score 999 or sacrifice child
> [  450.964031] a.out: page allocation stalls for 10014ms, order:0, mode:0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO)
> [  450.964033] CPU: 0 PID: 14622 Comm: a.out Tainted: G        W       4.9.0+ #99
> (...snipped...)
> [  740.984902] a.out: page allocation stalls for 300003ms, order:0, mode:0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO)
> [  740.984905] CPU: 0 PID: 14622 Comm: a.out Tainted: G        W       4.9.0+ #99
> ----------
> 
> Although it is fine to make warn_alloc() less verbose, this is not
> a problem which can be avoided by simply reducing printk(). Unless
> we give enough CPU time to the OOM killer and OOM victims, it is
> trivial to lockup the system.

This is simply hard if there are way too many tasks runnable...

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-13 12:06                     ` Tetsuo Handa
  2016-12-13 17:06                       ` Michal Hocko
@ 2016-12-14  9:37                       ` Petr Mladek
  2016-12-14 10:20                         ` Sergey Senozhatsky
                                           ` (2 more replies)
  1 sibling, 3 replies; 96+ messages in thread
From: Petr Mladek @ 2016-12-14  9:37 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: mhocko, linux-mm, sergey.senozhatsky

On Tue 2016-12-13 21:06:57, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Mon 12-12-16 13:55:35, Michal Hocko wrote:
> > > On Mon 12-12-16 21:12:06, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > [...]
> > > > > > I think this warn_alloc() is too much noise. When something went
> > > > > > wrong, multiple instances of Thread-2 tend to call warn_alloc()
> > > > > > concurrently. We don't need to report similar memory information.
> > > > > 
> > > > > That is why we have ratelimitting. It is needs a better tunning then
> > > > > just let's do it.
> > > > 
> > > > I think that calling show_mem() once per a series of warn_alloc() threads is
> > > > sufficient. Since the amount of output by dump_stack() and that by show_mem()
> > > > are nearly equals, we can save nearly 50% of output if we manage to avoid
> > > > the same show_mem() calls.
> > > 
> > > I do not mind such an update. Again, that is what we have the
> > > ratelimitting for. The fact that it doesn't throttle properly means that
> > > we should tune its parameters.
> > 
> > What about the following? Does this help?
> 
> I don't think it made much difference.
> 
> I noticed that one of triggers which cause a lot of
> "** XXX printk messages dropped **" is show_all_locks() added by
> commit b2d4c2edb2e4f89a ("locking/hung_task: Show all locks"). When there are
> a lot of threads being blocked on fs locks, show_all_locks() on each blocked
> thread generates incredible amount of messages periodically. Therefore,
> I temporarily set /proc/sys/kernel/hung_task_timeout_secs to 0 to disable
> hung task warnings for testing this patch.
> 
> http://I-love.SAKURA.ne.jp/tmp/serial-20161213.txt.xz is a console log with
> this patch applied. Due to hung task warnings disabled, amount of messages
> are significantly reduced.
> 
> Uptime > 400 are testcases where the stresser was invoked via "taskset -c 0".
> Since there are some "** XXX printk messages dropped **" messages, I can't
> tell whether the OOM killer was able to make forward progress. But guessing
>  from the result that there is no corresponding "Killed process" line for
> "Out of memory: " line at uptime = 450 and the duration of PID 14622 stalled,
> I think it is OK to say that the system got stuck because the OOM killer was
> not able to make forward progress.

I am afraid that as long as you see "** XXX printk messages dropped
**" then there is something that is able to keep warn_alloc() busy,
never leave the printk()/console_unlock() and and block OOM killer
progress.

> ----------
> [  450.767693] Out of memory: Kill process 14642 (a.out) score 999 or sacrifice child
> [  450.769974] Killed process 14642 (a.out) total-vm:4168kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> [  450.776538] oom_reaper: reaped process 14642 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> [  450.781170] Out of memory: Kill process 14643 (a.out) score 999 or sacrifice child
> [  450.783469] Killed process 14643 (a.out) total-vm:4168kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> [  450.787912] oom_reaper: reaped process 14643 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> [  450.792630] Out of memory: Kill process 14644 (a.out) score 999 or sacrifice child
> [  450.964031] a.out: page allocation stalls for 10014ms, order:0, mode:0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO)
> [  450.964033] CPU: 0 PID: 14622 Comm: a.out Tainted: G        W       4.9.0+ #99
> (...snipped...)
> [  740.984902] a.out: page allocation stalls for 300003ms, order:0, mode:0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO)
> [  740.984905] CPU: 0 PID: 14622 Comm: a.out Tainted: G        W       4.9.0+ #99
> ----------
> 
> Although it is fine to make warn_alloc() less verbose, this is not
> a problem which can be avoided by simply reducing printk(). Unless
> we give enough CPU time to the OOM killer and OOM victims, it is
> trivial to lockup the system.

You could try to use printk_deferred() in warn_alloc(). It will not
handle console. It will help to be sure that the blocked printk()
is the main problem.

Best Regards,
Petr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-14  9:37                       ` Petr Mladek
@ 2016-12-14 10:20                         ` Sergey Senozhatsky
  2016-12-14 11:01                           ` Petr Mladek
  2016-12-14 10:26                         ` Michal Hocko
  2016-12-14 11:37                         ` Tetsuo Handa
  2 siblings, 1 reply; 96+ messages in thread
From: Sergey Senozhatsky @ 2016-12-14 10:20 UTC (permalink / raw)
  To: Petr Mladek; +Cc: Tetsuo Handa, mhocko, linux-mm, sergey.senozhatsky

On (12/14/16 10:37), Petr Mladek wrote:
[..]
> > ----------
> > [  450.767693] Out of memory: Kill process 14642 (a.out) score 999 or sacrifice child
> > [  450.769974] Killed process 14642 (a.out) total-vm:4168kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> > [  450.776538] oom_reaper: reaped process 14642 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> > [  450.781170] Out of memory: Kill process 14643 (a.out) score 999 or sacrifice child
> > [  450.783469] Killed process 14643 (a.out) total-vm:4168kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> > [  450.787912] oom_reaper: reaped process 14643 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> > [  450.792630] Out of memory: Kill process 14644 (a.out) score 999 or sacrifice child
> > [  450.964031] a.out: page allocation stalls for 10014ms, order:0, mode:0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO)
> > [  450.964033] CPU: 0 PID: 14622 Comm: a.out Tainted: G        W       4.9.0+ #99
> > (...snipped...)
> > [  740.984902] a.out: page allocation stalls for 300003ms, order:0, mode:0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO)
> > [  740.984905] CPU: 0 PID: 14622 Comm: a.out Tainted: G        W       4.9.0+ #99
> > ----------
> > 
> > Although it is fine to make warn_alloc() less verbose, this is not
> > a problem which can be avoided by simply reducing printk(). Unless
> > we give enough CPU time to the OOM killer and OOM victims, it is
> > trivial to lockup the system.
> 
> You could try to use printk_deferred() in warn_alloc(). It will not
> handle console. It will help to be sure that the blocked printk()
> is the main problem.

I thought about deferred printk, but I'm afraid in the given
conditions this has great chances to badly lockup the system.

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-14  9:37                       ` Petr Mladek
  2016-12-14 10:20                         ` Sergey Senozhatsky
@ 2016-12-14 10:26                         ` Michal Hocko
  2016-12-15  7:34                           ` Sergey Senozhatsky
  2016-12-14 11:37                         ` Tetsuo Handa
  2 siblings, 1 reply; 96+ messages in thread
From: Michal Hocko @ 2016-12-14 10:26 UTC (permalink / raw)
  To: Petr Mladek; +Cc: Tetsuo Handa, linux-mm, sergey.senozhatsky

On Wed 14-12-16 10:37:06, Petr Mladek wrote:
> On Tue 2016-12-13 21:06:57, Tetsuo Handa wrote:
[...]
> > Although it is fine to make warn_alloc() less verbose, this is not
> > a problem which can be avoided by simply reducing printk(). Unless
> > we give enough CPU time to the OOM killer and OOM victims, it is
> > trivial to lockup the system.
> 
> You could try to use printk_deferred() in warn_alloc(). It will not
> handle console.

the problem is, however, _any_ printk under the oom_lock. So all of them
would have to be converted AFAIU.

> It will help to be sure that the blocked printk()
> is the main problem.

I think we should rather ratelimit those messages than tweak the way how
the printk is used. The source of the heavy printk might be completely
different so this has to be addressed at the printk level.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-14 10:20                         ` Sergey Senozhatsky
@ 2016-12-14 11:01                           ` Petr Mladek
  2016-12-14 12:23                             ` Sergey Senozhatsky
  0 siblings, 1 reply; 96+ messages in thread
From: Petr Mladek @ 2016-12-14 11:01 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Tetsuo Handa, mhocko, linux-mm, sergey.senozhatsky

On Wed 2016-12-14 19:20:28, Sergey Senozhatsky wrote:
> On (12/14/16 10:37), Petr Mladek wrote:
> [..]
> > > ----------
> > > [  450.767693] Out of memory: Kill process 14642 (a.out) score 999 or sacrifice child
> > > [  450.769974] Killed process 14642 (a.out) total-vm:4168kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> > > [  450.776538] oom_reaper: reaped process 14642 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> > > [  450.781170] Out of memory: Kill process 14643 (a.out) score 999 or sacrifice child
> > > [  450.783469] Killed process 14643 (a.out) total-vm:4168kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> > > [  450.787912] oom_reaper: reaped process 14643 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> > > [  450.792630] Out of memory: Kill process 14644 (a.out) score 999 or sacrifice child
> > > [  450.964031] a.out: page allocation stalls for 10014ms, order:0, mode:0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO)
> > > [  450.964033] CPU: 0 PID: 14622 Comm: a.out Tainted: G        W       4.9.0+ #99
> > > (...snipped...)
> > > [  740.984902] a.out: page allocation stalls for 300003ms, order:0, mode:0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO)
> > > [  740.984905] CPU: 0 PID: 14622 Comm: a.out Tainted: G        W       4.9.0+ #99
> > > ----------
> > > 
> > > Although it is fine to make warn_alloc() less verbose, this is not
> > > a problem which can be avoided by simply reducing printk(). Unless
> > > we give enough CPU time to the OOM killer and OOM victims, it is
> > > trivial to lockup the system.
> > 
> > You could try to use printk_deferred() in warn_alloc(). It will not
> > handle console. It will help to be sure that the blocked printk()
> > is the main problem.
> 
> I thought about deferred printk, but I'm afraid in the given
> conditions this has great chances to badly lockup the system.

I am just curious. Do you have any particular scenario in mind?

AFAIK, the current problem is the classic softlockup in
console_unlock(). Other CPUs are producing a flood of printk
messages and the victim is blocked in console_unlock() "forever".
I do not see any deadlock with logbuf_lock.

This is where async printk should help. And printk_deferred()
is the way to use async printk for a particular printk call.

Did I miss something, please?

Best regards,
Petr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-13 17:06                       ` Michal Hocko
@ 2016-12-14 11:37                         ` Tetsuo Handa
  2016-12-14 12:42                           ` Michal Hocko
  2016-12-15  1:11                         ` Sergey Senozhatsky
  1 sibling, 1 reply; 96+ messages in thread
From: Tetsuo Handa @ 2016-12-14 11:37 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, pmladek, sergey.senozhatsky

Michal Hocko wrote:
> On Tue 13-12-16 21:06:57, Tetsuo Handa wrote:
> > http://I-love.SAKURA.ne.jp/tmp/serial-20161213.txt.xz is a console log with
> > this patch applied. Due to hung task warnings disabled, amount of messages
> > are significantly reduced.
> > 
> > Uptime > 400 are testcases where the stresser was invoked via "taskset -c 0".
> > Since there are some "** XXX printk messages dropped **" messages, I can't
> > tell whether the OOM killer was able to make forward progress. But guessing
> >  from the result that there is no corresponding "Killed process" line for
> > "Out of memory: " line at uptime = 450 and the duration of PID 14622 stalled,
> > I think it is OK to say that the system got stuck because the OOM killer was
> > not able to make forward progress.
> 
> The oom situation certainly didn't get resolved. I would be really
> curious whether we can rule out the printk out of the picture, though. I
> am still not sure we can rule out some obscure OOM killer bug at this
> stage.
> 
> What if we lower the loglevel as much as possible to only see KERN_ERR
> should be sufficient to see few oom killer messages while suppressing
> most of the other noise. Unfortunatelly, even messages with level >
> loglevel get stored into the ringbuffer (as I've just learned) so
> console_unlock() has to crawl through them just to drop them (Meh) but
> at least it doesn't have to go to the serial console drivers and spend
> even more time there. An alternative would be to tweak printk to not
> even store those messaes. Something like the below

Changing loglevel is not a option for me. Under OOM, syslog cannot work.
Only messages sent to serial console / netconsole are available for
understanding something went wrong. And serial consoles may be very slow.
We need to try to avoid uncontrolled printk().

> 
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index f7a55e9ff2f7..197f2b9fb703 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -1865,6 +1865,15 @@ asmlinkage int vprintk_emit(int facility, int level,
>  				lflags |= LOG_CONT;
>  			}
>  
> +			if (suppress_message_printing(kern_level)) {
> +				logbuf_cpu = UINT_MAX;
> +				raw_spin_unlock(&logbuf_lock);
> +				lockdep_on();
> +				local_irq_restore(flags);
> +				return 0;
> +			}
> +
> +
>  			text_len -= 2;
>  			text += 2;
>  		}
> 
> So it would be really great if you could
> 	1) test with the fixed throttling
> 	2) loglevel=4 on the kernel command line
> 	3) try the above with the same loglevel
> 
> ideally 1) would be sufficient and that would make the most sense from
> the warn_alloc point of view. If this is 2 or 3 then we are hitting a
> more generic problem and I would be quite careful to hack it around.

Thus, I don't think I can do these.

>  
> > ----------
> > [  450.767693] Out of memory: Kill process 14642 (a.out) score 999 or sacrifice child
> > [  450.769974] Killed process 14642 (a.out) total-vm:4168kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> > [  450.776538] oom_reaper: reaped process 14642 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> > [  450.781170] Out of memory: Kill process 14643 (a.out) score 999 or sacrifice child
> > [  450.783469] Killed process 14643 (a.out) total-vm:4168kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> > [  450.787912] oom_reaper: reaped process 14643 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> > [  450.792630] Out of memory: Kill process 14644 (a.out) score 999 or sacrifice child
> > [  450.964031] a.out: page allocation stalls for 10014ms, order:0, mode:0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO)
> > [  450.964033] CPU: 0 PID: 14622 Comm: a.out Tainted: G        W       4.9.0+ #99
> > (...snipped...)
> > [  740.984902] a.out: page allocation stalls for 300003ms, order:0, mode:0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO)
> > [  740.984905] CPU: 0 PID: 14622 Comm: a.out Tainted: G        W       4.9.0+ #99
> > ----------
> > 
> > Although it is fine to make warn_alloc() less verbose, this is not
> > a problem which can be avoided by simply reducing printk(). Unless
> > we give enough CPU time to the OOM killer and OOM victims, it is
> > trivial to lockup the system.
> 
> This is simply hard if there are way too many tasks runnable...

Runnable threads which do not involve page allocation do not harm.
Only runnable threads which are almost-busy-looping with direct reclaim
are problematic. mutex_lock_killable(&oom_lock) is the simplest approach
for eliminating such threads and prevent such threads from calling
warn_alloc() uncontrolledly.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-14  9:37                       ` Petr Mladek
  2016-12-14 10:20                         ` Sergey Senozhatsky
  2016-12-14 10:26                         ` Michal Hocko
@ 2016-12-14 11:37                         ` Tetsuo Handa
  2016-12-14 12:36                           ` Petr Mladek
  2 siblings, 1 reply; 96+ messages in thread
From: Tetsuo Handa @ 2016-12-14 11:37 UTC (permalink / raw)
  To: pmladek; +Cc: mhocko, linux-mm, sergey.senozhatsky

Petr Mladek wrote:
> On Tue 2016-12-13 21:06:57, Tetsuo Handa wrote:
> > Uptime > 400 are testcases where the stresser was invoked via "taskset -c 0".
> > Since there are some "** XXX printk messages dropped **" messages, I can't
> > tell whether the OOM killer was able to make forward progress. But guessing
> >  from the result that there is no corresponding "Killed process" line for
> > "Out of memory: " line at uptime = 450 and the duration of PID 14622 stalled,
> > I think it is OK to say that the system got stuck because the OOM killer was
> > not able to make forward progress.
> 
> I am afraid that as long as you see "** XXX printk messages dropped
> **" then there is something that is able to keep warn_alloc() busy,
> never leave the printk()/console_unlock() and and block OOM killer
> progress.

Excuse me, but it is not warn_alloc() but functions that call printk()
which are kept busy with oom_lock held (e.g. oom_kill_process()).

----------
[ 1845.191495] MemAlloc: a.out(15607) flags=0x400040 switches=18863 seq=3 gfp=0x26042c0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK) order=0 delay=604455
[ 1845.191498] a.out           R  running task    11824 15607  14625 0x00000080
[ 1845.191498] Call Trace:
[ 1845.191500]  ? __schedule+0x23f/0xba0
[ 1845.191501]  preempt_schedule_common+0x1f/0x32
[ 1845.191502]  _cond_resched+0x1d/0x30
[ 1845.191503]  console_unlock+0x257/0x620
[ 1845.191504]  vprintk_emit+0x33a/0x520
[ 1845.191505]  vprintk_default+0x1a/0x20
[ 1845.191506]  printk+0x58/0x6f
[ 1845.191507]  show_mem+0xb7/0xf0
[ 1845.191508]  dump_header+0xa0/0x3de
[ 1845.191509]  ? trace_hardirqs_on+0xd/0x10
[ 1845.191510]  oom_kill_process+0x226/0x500
[ 1845.191511]  out_of_memory+0x140/0x5a0
[ 1845.191512]  ? out_of_memory+0x210/0x5a0
[ 1845.191513]  __alloc_pages_nodemask+0x1077/0x10e0
[ 1845.191514]  cache_grow_begin+0xcf/0x630
[ 1845.191515]  ? ____cache_alloc_node+0x1bf/0x240
[ 1845.191515]  fallback_alloc+0x1e5/0x290
[ 1845.191516]  ____cache_alloc_node+0x235/0x240
[ 1845.191534]  ? kmem_zone_alloc+0x91/0x120 [xfs]
[ 1845.191535]  kmem_cache_alloc+0x26c/0x3e0
[ 1845.191551]  kmem_zone_alloc+0x91/0x120 [xfs]
[ 1845.191567]  xfs_trans_alloc+0x68/0x130 [xfs]
[ 1845.191584]  xfs_iomap_write_allocate+0x209/0x390 [xfs]
[ 1845.191596]  ? xfs_bmbt_get_all+0x13/0x20 [xfs]
[ 1845.191611]  ? xfs_map_blocks+0xf6/0x4d0 [xfs]
[ 1845.191612]  ? rcu_read_lock_sched_held+0x91/0xa0
[ 1845.191625]  xfs_map_blocks+0x211/0x4d0 [xfs]
[ 1845.191639]  xfs_do_writepage+0x1e0/0x870 [xfs]
[ 1845.191640]  write_cache_pages+0x24a/0x680
[ 1845.191653]  ? xfs_aops_discard_page+0x140/0x140 [xfs]
[ 1845.191666]  xfs_vm_writepages+0x66/0xa0 [xfs]
[ 1845.191667]  do_writepages+0x1c/0x30
[ 1845.191668]  __filemap_fdatawrite_range+0xc1/0x100
[ 1845.191669]  filemap_write_and_wait_range+0x28/0x60
[ 1845.191692]  xfs_file_fsync+0x86/0x310 [xfs]
[ 1845.191694]  vfs_fsync_range+0x38/0xa0
[ 1845.191696]  ? return_from_SYSCALL_64+0x2d/0x7a
[ 1845.191697]  do_fsync+0x38/0x60
[ 1845.191698]  SyS_fsync+0xb/0x10
[ 1845.191699]  do_syscall_64+0x67/0x1f0
[ 1845.191700]  entry_SYSCALL64_slow_path+0x25/0x25
----------

> 
> > ----------
> > [  450.767693] Out of memory: Kill process 14642 (a.out) score 999 or sacrifice child
> > [  450.769974] Killed process 14642 (a.out) total-vm:4168kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> > [  450.776538] oom_reaper: reaped process 14642 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> > [  450.781170] Out of memory: Kill process 14643 (a.out) score 999 or sacrifice child
> > [  450.783469] Killed process 14643 (a.out) total-vm:4168kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> > [  450.787912] oom_reaper: reaped process 14643 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> > [  450.792630] Out of memory: Kill process 14644 (a.out) score 999 or sacrifice child
> > [  450.964031] a.out: page allocation stalls for 10014ms, order:0, mode:0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO)
> > [  450.964033] CPU: 0 PID: 14622 Comm: a.out Tainted: G        W       4.9.0+ #99
> > (...snipped...)
> > [  740.984902] a.out: page allocation stalls for 300003ms, order:0, mode:0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO)
> > [  740.984905] CPU: 0 PID: 14622 Comm: a.out Tainted: G        W       4.9.0+ #99
> > ----------
> > 
> > Although it is fine to make warn_alloc() less verbose, this is not
> > a problem which can be avoided by simply reducing printk(). Unless
> > we give enough CPU time to the OOM killer and OOM victims, it is
> > trivial to lockup the system.
> 
> You could try to use printk_deferred() in warn_alloc(). It will not
> handle console. It will help to be sure that the blocked printk()
> is the main problem.

If we can map all printk() called inside oom_kill_process() to printk_deferred(),
we can avoid cond_resched() inside console_unlock() with oom_lock held.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-14 11:01                           ` Petr Mladek
@ 2016-12-14 12:23                             ` Sergey Senozhatsky
  2016-12-14 12:47                               ` Petr Mladek
  0 siblings, 1 reply; 96+ messages in thread
From: Sergey Senozhatsky @ 2016-12-14 12:23 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Sergey Senozhatsky, Tetsuo Handa, mhocko, linux-mm, sergey.senozhatsky

On (12/14/16 12:01), Petr Mladek wrote:
> On Wed 2016-12-14 19:20:28, Sergey Senozhatsky wrote:
> > On (12/14/16 10:37), Petr Mladek wrote:
> > [..]
> > > > ----------
> > > > [  450.767693] Out of memory: Kill process 14642 (a.out) score 999 or sacrifice child
> > > > [  450.769974] Killed process 14642 (a.out) total-vm:4168kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> > > > [  450.776538] oom_reaper: reaped process 14642 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> > > > [  450.781170] Out of memory: Kill process 14643 (a.out) score 999 or sacrifice child
> > > > [  450.783469] Killed process 14643 (a.out) total-vm:4168kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> > > > [  450.787912] oom_reaper: reaped process 14643 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> > > > [  450.792630] Out of memory: Kill process 14644 (a.out) score 999 or sacrifice child
> > > > [  450.964031] a.out: page allocation stalls for 10014ms, order:0, mode:0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO)
> > > > [  450.964033] CPU: 0 PID: 14622 Comm: a.out Tainted: G        W       4.9.0+ #99
> > > > (...snipped...)
> > > > [  740.984902] a.out: page allocation stalls for 300003ms, order:0, mode:0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO)
> > > > [  740.984905] CPU: 0 PID: 14622 Comm: a.out Tainted: G        W       4.9.0+ #99
> > > > ----------
> > > > 
> > > > Although it is fine to make warn_alloc() less verbose, this is not
> > > > a problem which can be avoided by simply reducing printk(). Unless
> > > > we give enough CPU time to the OOM killer and OOM victims, it is
> > > > trivial to lockup the system.
> > > 
> > > You could try to use printk_deferred() in warn_alloc(). It will not
> > > handle console. It will help to be sure that the blocked printk()
> > > is the main problem.
> > 
> > I thought about deferred printk, but I'm afraid in the given
> > conditions this has great chances to badly lockup the system.
> 
> I am just curious. Do you have any particular scenario in mind?
> 
> AFAIK, the current problem is the classic softlockup in
> console_unlock(). Other CPUs are producing a flood of printk
> messages and the victim is blocked in console_unlock() "forever".
> I do not see any deadlock with logbuf_lock.

well, printk_deferred moves console_unlock() to IRQ context. so
we still have a classic lockup, other CPUs still can add messages
to logbuf, expect that now we are doing printing from IRQ (assuming
that IRQ work acquired the console sem). lockup in IRQ is worth
than softlockup. (well, just saying)


static void wake_up_klogd_work_func(struct irq_work *irq_work)
{
	if (console_trylock())
		console_unlock();
}

static DEFINE_PER_CPU(struct irq_work, wake_up_klogd_work) = {
	.func = wake_up_klogd_work_func,
	.flags = IRQ_WORK_LAZY,
};



> This is where async printk should help. And printk_deferred()
> is the way to use async printk for a particular printk call.

yes, with a difference that async printk does not work from IRQ.
I'm a bit lost here, sorry. do you mean async-printk/deferred
patch set from my tree or the current printk_deferred()?

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-14 11:37                         ` Tetsuo Handa
@ 2016-12-14 12:36                           ` Petr Mladek
  2016-12-14 12:44                             ` Michal Hocko
  2016-12-14 12:50                             ` Sergey Senozhatsky
  0 siblings, 2 replies; 96+ messages in thread
From: Petr Mladek @ 2016-12-14 12:36 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: mhocko, linux-mm, sergey.senozhatsky

On Wed 2016-12-14 20:37:51, Tetsuo Handa wrote:
> Petr Mladek wrote:
> > On Tue 2016-12-13 21:06:57, Tetsuo Handa wrote:
> > > Uptime > 400 are testcases where the stresser was invoked via "taskset -c 0".
> > > Since there are some "** XXX printk messages dropped **" messages, I can't
> > > tell whether the OOM killer was able to make forward progress. But guessing
> > >  from the result that there is no corresponding "Killed process" line for
> > > "Out of memory: " line at uptime = 450 and the duration of PID 14622 stalled,
> > > I think it is OK to say that the system got stuck because the OOM killer was
> > > not able to make forward progress.
> > 
> > I am afraid that as long as you see "** XXX printk messages dropped
> > **" then there is something that is able to keep warn_alloc() busy,
> > never leave the printk()/console_unlock() and and block OOM killer
> > progress.
> 
> Excuse me, but it is not warn_alloc() but functions that call printk()
> which are kept busy with oom_lock held (e.g. oom_kill_process()).

No, they are keeping busy each other. If I get it properly,
this is a livelock:

First, OOM killer stalls inside console_unlock() because
other processes produce new messages faster than it is able to
push to console.

Second, the other processes stall because they are waiting for
the OOM killer to get some free memory.

Now, the blocked processes try to inform about the situation
and produce that many messages. But there are also other
producers, like the hung task detector that see the problems
from outside and tries to inform about it as well.


There are basically two solution for this situation:

1. Fix printk() so that it does not block forever. This will
   get solved by the async printk patchset[*]. In the meantime,
   a particular sensitive location might be worked around
   by using printk_deferred() instead of printk()[**]

2. Reduce the amount of messages. It is insane to report
   the same problem many times so that the same messages
   fill the entire log buffer. Note that the allocator
   is not the only sinner here.

In fact, both solutions makes sense together.


[*] The async printk patchset is flying around in many
    modifications for years. I am more optimistic after
    the discussions on the last Kernel Summit. Anyway,
    it will not be in mainline before 4.12.

[**] printk_deferred() only puts massages into the log
     buffer. It does not call
     console_trylock()/console_unlock(). Therefore,
     it is always "fast".

Best Regards,
Petr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-14 11:37                         ` Tetsuo Handa
@ 2016-12-14 12:42                           ` Michal Hocko
  2016-12-14 16:36                             ` Tetsuo Handa
  0 siblings, 1 reply; 96+ messages in thread
From: Michal Hocko @ 2016-12-14 12:42 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, pmladek, sergey.senozhatsky

On Wed 14-12-16 20:37:07, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Tue 13-12-16 21:06:57, Tetsuo Handa wrote:
> > > http://I-love.SAKURA.ne.jp/tmp/serial-20161213.txt.xz is a console log with
> > > this patch applied. Due to hung task warnings disabled, amount of messages
> > > are significantly reduced.
> > > 
> > > Uptime > 400 are testcases where the stresser was invoked via "taskset -c 0".
> > > Since there are some "** XXX printk messages dropped **" messages, I can't
> > > tell whether the OOM killer was able to make forward progress. But guessing
> > >  from the result that there is no corresponding "Killed process" line for
> > > "Out of memory: " line at uptime = 450 and the duration of PID 14622 stalled,
> > > I think it is OK to say that the system got stuck because the OOM killer was
> > > not able to make forward progress.
> > 
> > The oom situation certainly didn't get resolved. I would be really
> > curious whether we can rule out the printk out of the picture, though. I
> > am still not sure we can rule out some obscure OOM killer bug at this
> > stage.
> > 
> > What if we lower the loglevel as much as possible to only see KERN_ERR
> > should be sufficient to see few oom killer messages while suppressing
> > most of the other noise. Unfortunatelly, even messages with level >
> > loglevel get stored into the ringbuffer (as I've just learned) so
> > console_unlock() has to crawl through them just to drop them (Meh) but
> > at least it doesn't have to go to the serial console drivers and spend
> > even more time there. An alternative would be to tweak printk to not
> > even store those messaes. Something like the below
> 
> Changing loglevel is not a option for me. Under OOM, syslog cannot work.
> Only messages sent to serial console / netconsole are available for
> understanding something went wrong. And serial consoles may be very slow.
> We need to try to avoid uncontrolled printk().

That is definitely true I just wanted the above for the sake of testing
and rulling out a different problem because currently it is not clear to
me that this is the printk livelock issue. Evidences are quite
convincing but not 100% sure. So...

> > So it would be really great if you could
> > 	1) test with the fixed throttling
> > 	2) loglevel=4 on the kernel command line
> > 	3) try the above with the same loglevel
> > 
> > ideally 1) would be sufficient and that would make the most sense from
> > the warn_alloc point of view. If this is 2 or 3 then we are hitting a
> > more generic problem and I would be quite careful to hack it around.
> 
> Thus, I don't think I can do these.

i think this would be really valuable.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-14 12:36                           ` Petr Mladek
@ 2016-12-14 12:44                             ` Michal Hocko
  2016-12-14 13:36                               ` Tetsuo Handa
  2016-12-14 12:50                             ` Sergey Senozhatsky
  1 sibling, 1 reply; 96+ messages in thread
From: Michal Hocko @ 2016-12-14 12:44 UTC (permalink / raw)
  To: Petr Mladek; +Cc: Tetsuo Handa, linux-mm, sergey.senozhatsky

On Wed 14-12-16 13:36:44, Petr Mladek wrote:
[...]
> There are basically two solution for this situation:
> 
> 1. Fix printk() so that it does not block forever. This will
>    get solved by the async printk patchset[*]. In the meantime,
>    a particular sensitive location might be worked around
>    by using printk_deferred() instead of printk()[**]

Absolutely!

> 2. Reduce the amount of messages. It is insane to report
>    the same problem many times so that the same messages
>    fill the entire log buffer. Note that the allocator
>    is not the only sinner here.

sure and the ratelimit patch should help in that direction.
show_mem for each allocation stall is really way too much.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-14 12:23                             ` Sergey Senozhatsky
@ 2016-12-14 12:47                               ` Petr Mladek
  0 siblings, 0 replies; 96+ messages in thread
From: Petr Mladek @ 2016-12-14 12:47 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Sergey Senozhatsky, Tetsuo Handa, mhocko, linux-mm

On Wed 2016-12-14 21:23:13, Sergey Senozhatsky wrote:
> On (12/14/16 12:01), Petr Mladek wrote:
> > On Wed 2016-12-14 19:20:28, Sergey Senozhatsky wrote:
> > > On (12/14/16 10:37), Petr Mladek wrote:
> > > [..]
> > > > > ----------
> > > > > [  450.767693] Out of memory: Kill process 14642 (a.out) score 999 or sacrifice child
> > > > > [  450.769974] Killed process 14642 (a.out) total-vm:4168kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> > > > > [  450.776538] oom_reaper: reaped process 14642 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> > > > > [  450.781170] Out of memory: Kill process 14643 (a.out) score 999 or sacrifice child
> > > > > [  450.783469] Killed process 14643 (a.out) total-vm:4168kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> > > > > [  450.787912] oom_reaper: reaped process 14643 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> > > > > [  450.792630] Out of memory: Kill process 14644 (a.out) score 999 or sacrifice child
> > > > > [  450.964031] a.out: page allocation stalls for 10014ms, order:0, mode:0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO)
> > > > > [  450.964033] CPU: 0 PID: 14622 Comm: a.out Tainted: G        W       4.9.0+ #99
> > > > > (...snipped...)
> > > > > [  740.984902] a.out: page allocation stalls for 300003ms, order:0, mode:0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO)
> > > > > [  740.984905] CPU: 0 PID: 14622 Comm: a.out Tainted: G        W       4.9.0+ #99
> > > > > ----------
> > > > > 
> > > > > Although it is fine to make warn_alloc() less verbose, this is not
> > > > > a problem which can be avoided by simply reducing printk(). Unless
> > > > > we give enough CPU time to the OOM killer and OOM victims, it is
> > > > > trivial to lockup the system.
> > > > 
> > > > You could try to use printk_deferred() in warn_alloc(). It will not
> > > > handle console. It will help to be sure that the blocked printk()
> > > > is the main problem.
> > > 
> > > I thought about deferred printk, but I'm afraid in the given
> > > conditions this has great chances to badly lockup the system.
> > 
> > I am just curious. Do you have any particular scenario in mind?
> > 
> > AFAIK, the current problem is the classic softlockup in
> > console_unlock(). Other CPUs are producing a flood of printk
> > messages and the victim is blocked in console_unlock() "forever".
> > I do not see any deadlock with logbuf_lock.
> 
> well, printk_deferred moves console_unlock() to IRQ context. so
> we still have a classic lockup, other CPUs still can add messages
> to logbuf, expect that now we are doing printing from IRQ (assuming
> that IRQ work acquired the console sem). lockup in IRQ is worth
> than softlockup. (well, just saying)

You are right. The current printk_deferred() will not solve anything.
It defers the work to IRQ context. If the IRQ interrupts OOM killer,
we are in the same livelock situation.

The only solution is the async printk patchset that allows to deffer
the console flushing to a kthread.

Best Regards,
Petr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-14 12:36                           ` Petr Mladek
  2016-12-14 12:44                             ` Michal Hocko
@ 2016-12-14 12:50                             ` Sergey Senozhatsky
  1 sibling, 0 replies; 96+ messages in thread
From: Sergey Senozhatsky @ 2016-12-14 12:50 UTC (permalink / raw)
  To: Petr Mladek; +Cc: Tetsuo Handa, mhocko, linux-mm, sergey.senozhatsky

On (12/14/16 13:36), Petr Mladek wrote:
[..]
> [*] The async printk patchset is flying around in many
>     modifications for years. I am more optimistic after
>     the discussions on the last Kernel Summit. Anyway,
>     it will not be in mainline before 4.12.
> 
> [**] printk_deferred() only puts massages into the log
>      buffer. It does not call
>      console_trylock()/console_unlock(). Therefore,
>      it is always "fast".

a small addition,

as a side effect, printk_deferred()  guarantees  that we will
attempt to console_unlock() from IRQ. CPU's pending bit stays
set until we run the irq work list on that CPU, per-CPU irq
work stays queued in per-CPU irq work list.

so, yes, printk_deferred() adds messages to logbuf, but in
exchange it says:
    "I promise I will try to do console_unlock() from IRQ".

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-14 12:44                             ` Michal Hocko
@ 2016-12-14 13:36                               ` Tetsuo Handa
  2016-12-14 13:52                                 ` Michal Hocko
  0 siblings, 1 reply; 96+ messages in thread
From: Tetsuo Handa @ 2016-12-14 13:36 UTC (permalink / raw)
  To: mhocko, pmladek; +Cc: linux-mm, sergey.senozhatsky

Michal Hocko wrote:
> On Wed 14-12-16 13:36:44, Petr Mladek wrote:
> [...]
> > There are basically two solution for this situation:
> > 
> > 1. Fix printk() so that it does not block forever. This will
> >    get solved by the async printk patchset[*]. In the meantime,
> >    a particular sensitive location might be worked around
> >    by using printk_deferred() instead of printk()[**]
> 
> Absolutely!
> 
> > 2. Reduce the amount of messages. It is insane to report
> >    the same problem many times so that the same messages
> >    fill the entire log buffer. Note that the allocator
> >    is not the only sinner here.
> 
> sure and the ratelimit patch should help in that direction.
> show_mem for each allocation stall is really way too much.

dump_stack() from warn_alloc() for each allocation stall is also too much.
Regarding synchronous watchdog like warn_alloc(), each thread's backtrace
never change for that allocation request because it is always called from
the same location (i.e. __alloc_pages_slowpath()). Backtrace might be useful
for the first time of each thread's first allocation stall report for
that allocation request, but subsequent ones are noises unless backtrace
of the first time was corrupted/dropped, for they are only saying that
allocation retry loop did not get stuck inside e.g. shrink_inactive_list().
Maybe we don't need to call warn_alloc() for each allocation stall; call
warn_alloc() only once and then use one-liner report.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-14 13:36                               ` Tetsuo Handa
@ 2016-12-14 13:52                                 ` Michal Hocko
  0 siblings, 0 replies; 96+ messages in thread
From: Michal Hocko @ 2016-12-14 13:52 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: pmladek, linux-mm, sergey.senozhatsky

On Wed 14-12-16 22:36:29, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 14-12-16 13:36:44, Petr Mladek wrote:
> > [...]
> > > There are basically two solution for this situation:
> > > 
> > > 1. Fix printk() so that it does not block forever. This will
> > >    get solved by the async printk patchset[*]. In the meantime,
> > >    a particular sensitive location might be worked around
> > >    by using printk_deferred() instead of printk()[**]
> > 
> > Absolutely!
> > 
> > > 2. Reduce the amount of messages. It is insane to report
> > >    the same problem many times so that the same messages
> > >    fill the entire log buffer. Note that the allocator
> > >    is not the only sinner here.
> > 
> > sure and the ratelimit patch should help in that direction.
> > show_mem for each allocation stall is really way too much.
> 
> dump_stack() from warn_alloc() for each allocation stall is also too much.
> Regarding synchronous watchdog like warn_alloc(), each thread's backtrace
> never change for that allocation request because it is always called from
> the same location (i.e. __alloc_pages_slowpath()). Backtrace might be useful
> for the first time of each thread's first allocation stall report for
> that allocation request, but subsequent ones are noises unless backtrace
> of the first time was corrupted/dropped,

Well, the problem is when the ringbuffer overflows and then we lose
older data - and the stack as well. But I agree that dumping it for each
allocation is a lot of noise. We can be more clever than that but this
is more complicated I guess. A global ratelimit will not work and we
most probably do not want to have per task ratelimit I believe because
that sounds too much.

> for they are only saying that
> allocation retry loop did not get stuck inside e.g. shrink_inactive_list().
> Maybe we don't need to call warn_alloc() for each allocation stall; call
> warn_alloc() only once and then use one-liner report.

The thing is that we want occasional show_mem because we want to see how
the situation with the memory counters evolves over time.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-14 12:42                           ` Michal Hocko
@ 2016-12-14 16:36                             ` Tetsuo Handa
  2016-12-14 18:18                               ` Michal Hocko
  0 siblings, 1 reply; 96+ messages in thread
From: Tetsuo Handa @ 2016-12-14 16:36 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, pmladek, sergey.senozhatsky

Michal Hocko wrote:
> On Wed 14-12-16 20:37:07, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Tue 13-12-16 21:06:57, Tetsuo Handa wrote:
> > > > http://I-love.SAKURA.ne.jp/tmp/serial-20161213.txt.xz is a console log with
> > > > this patch applied. Due to hung task warnings disabled, amount of messages
> > > > are significantly reduced.
> > > > 
> > > > Uptime > 400 are testcases where the stresser was invoked via "taskset -c 0".
> > > > Since there are some "** XXX printk messages dropped **" messages, I can't
> > > > tell whether the OOM killer was able to make forward progress. But guessing
> > > >  from the result that there is no corresponding "Killed process" line for
> > > > "Out of memory: " line at uptime = 450 and the duration of PID 14622 stalled,
> > > > I think it is OK to say that the system got stuck because the OOM killer was
> > > > not able to make forward progress.
> > > 
> > > The oom situation certainly didn't get resolved. I would be really
> > > curious whether we can rule out the printk out of the picture, though. I
> > > am still not sure we can rule out some obscure OOM killer bug at this
> > > stage.
> > > 
> > > What if we lower the loglevel as much as possible to only see KERN_ERR
> > > should be sufficient to see few oom killer messages while suppressing
> > > most of the other noise. Unfortunatelly, even messages with level >
> > > loglevel get stored into the ringbuffer (as I've just learned) so
> > > console_unlock() has to crawl through them just to drop them (Meh) but
> > > at least it doesn't have to go to the serial console drivers and spend
> > > even more time there. An alternative would be to tweak printk to not
> > > even store those messaes. Something like the below
> > 
> > Changing loglevel is not a option for me. Under OOM, syslog cannot work.
> > Only messages sent to serial console / netconsole are available for
> > understanding something went wrong. And serial consoles may be very slow.
> > We need to try to avoid uncontrolled printk().
> 
> That is definitely true I just wanted the above for the sake of testing
> and rulling out a different problem because currently it is not clear to
> me that this is the printk livelock issue. Evidences are quite
> convincing but not 100% sure. So...
> 
> > > So it would be really great if you could
> > > 	1) test with the fixed throttling
> > > 	2) loglevel=4 on the kernel command line
> > > 	3) try the above with the same loglevel
> > > 
> > > ideally 1) would be sufficient and that would make the most sense from
> > > the warn_alloc point of view. If this is 2 or 3 then we are hitting a
> > > more generic problem and I would be quite careful to hack it around.
> > 
> > Thus, I don't think I can do these.
> 
> i think this would be really valuable.

OK. I tried 1) and 2). I didn't try 3) because printk() did not work as expected.

Regarding 1), it did not help. I can still see "** XXX printk messages dropped **"
( http://I-love.SAKURA.ne.jp/tmp/serial-20161215-1.txt.xz ).

Regarding 2), I can't tell whether it helped
( http://I-love.SAKURA.ne.jp/tmp/serial-20161215-2.txt.xz ).
I can no longer see "** XXX printk messages dropped **", but sometimes they stalled.
In most cases, "Out of memory: " and "Killed process" lines are printed within 0.1
second. But sometimes it took a few seconds. Less often it took longer than a minute.
There was one big stall which lasted for minutes. I changed loglevel to 7 and checked
memory information. Seems that watermark was low enough to call out_of_memory().

[  371.077952] Out of memory: Kill process 5092 (a.out) score 999 or sacrifice child
[  371.080486] Killed process 5092 (a.out) total-vm:4168kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  371.087651] Out of memory: Kill process 5093 (a.out) score 999 or sacrifice child
[  371.090130] Killed process 5093 (a.out) total-vm:4168kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  371.096977] Out of memory: Kill process 5094 (a.out) score 999 or sacrifice child
[  371.099452] Killed process 5094 (a.out) total-vm:4168kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  609.565043] sysrq: SysRq : Show Memory
[  617.645805] sysrq: SysRq : Changing Loglevel
[  617.647667] sysrq: Loglevel set to 7
[  619.493984] sysrq: SysRq : Show Memory
[  619.495721] Mem-Info:
[  619.497065] active_anon:356034 inactive_anon:2961 isolated_anon:0
[  619.497065]  active_file:57 inactive_file:133 isolated_file:32
[  619.497065]  unevictable:0 dirty:14 writeback:0 unstable:0
[  619.497065]  slab_reclaimable:3654 slab_unreclaimable:29434
[  619.497065]  mapped:718 shmem:4209 pagetables:9032 bounce:0
[  619.497065]  free:12922 free_pcp:89 free_cma:0
[  619.508579] Node 0 active_anon:1424136kB inactive_anon:11844kB active_file:228kB inactive_file:532kB unevictable:0kB isolated(anon):0kB isolated(file):128kB mapped:2872kB dirty:56kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 1161216kB anon_thp: 16836kB writeback_tmp:0kB unstable:0kB pages_scanned:1347 all_unreclaimable? yes
[  619.516992] Node 0 DMA free:7120kB min:412kB low:512kB high:612kB active_anon:8752kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  619.525582] lowmem_reserve[]: 0 1677 1677 1677
[  619.527519] Node 0 DMA32 free:44568kB min:44640kB low:55800kB high:66960kB active_anon:1415384kB inactive_anon:11844kB active_file:228kB inactive_file:532kB unevictable:0kB writepending:56kB present:2080640kB managed:1717740kB mlocked:0kB slab_reclaimable:14616kB slab_unreclaimable:117704kB kernel_stack:18816kB pagetables:36128kB bounce:0kB free_pcp:356kB local_pcp:0kB free_cma:0kB
[  619.536967] lowmem_reserve[]: 0 0 0 0
[  619.538808] Node 0 DMA: 0*4kB 0*8kB 1*16kB (M) 0*32kB 3*64kB (UM) 2*128kB (UM) 2*256kB (UM) 0*512kB 2*1024kB (UM) 0*2048kB 1*4096kB (M) = 7120kB
[  619.542971] Node 0 DMA32: 2*4kB (UH) 248*8kB (MEH) 59*16kB (UMEH) 135*32kB (UMEH) 47*64kB (UMEH) 8*128kB (UEH) 4*256kB (UEH) 31*512kB (M) 16*1024kB (UM) 0*2048kB 0*4096kB = 44568kB
[  619.548827] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  619.551524] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  619.554147] 4441 total pagecache pages
[  619.555838] 0 pages in swap cache
[  619.557421] Swap cache stats: add 0, delete 0, find 0/0
[  619.559359] Free swap  = 0kB
[  619.560827] Total swap = 0kB
[  619.562312] 524157 pages RAM
[  619.563779] 0 pages HighMem/MovableOnly
[  619.565418] 90746 pages reserved
[  619.566897] 0 pages hwpoisoned
[  624.638061] a.out: page allocation stalls for 140001ms, order:0[  624.646725] a.out: 
[  624.646727] page allocation stalls for 140026ms, order:0, mode:0x342004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE)
[  624.646731] CPU: 0 PID: 5167 Comm: a.out Tainted: G        W       4.9.0+ #102
[  624.646732] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  624.646733]  ffff880060dab930
[  624.646734]  ffffffff8134b0af ffffffff8198ce88 0000000000000001 ffff880060dab9b8
[  624.646735]  ffffffff8115489b 0342004a5fa93740 ffffffff8198ce88 ffff880060dab958
[  624.646737]  ffff880000000010 ffff880060dab9c8 ffff880060dab978Call Trace:
[  624.646744]  [<ffffffff8134b0af>] dump_stack+0x67/0x98
[  624.646747]  [<ffffffff8115489b>] warn_alloc+0x12b/0x170
[  624.646748]  [<ffffffff8115526b>] __alloc_pages_nodemask+0x91b/0xf20
[  624.646751]  [<ffffffff811a71e6>] alloc_pages_current+0x96/0x190
[  624.646754]  [<ffffffff811488f2>] __page_cache_alloc+0x142/0x180
[  624.646755]  [<ffffffff81149208>] ? find_get_entry+0x198/0x270
[  624.646756]  [<ffffffff81149070>] ? page_cache_prev_hole+0x50/0x50
[  624.646758]  [<ffffffff8114949b>] pagecache_get_page+0x8b/0x2a0
[  624.646759]  [<ffffffff8114a92e>] grab_cache_page_write_begin+0x1e/0x40
[  624.646761]  [<ffffffff81244adb>] iomap_write_begin+0x4b/0x100
[  624.646762]  [<ffffffff81244d60>] iomap_write_actor+0xb0/0x190
[  624.646764]  [<ffffffff812cb28b>] ? xfs_trans_commit+0xb/0x10
[  624.646765]  [<ffffffff81244cb0>] ? iomap_write_end+0x70/0x70
[  624.646766]  [<ffffffff812453ae>] iomap_apply+0xae/0x130
[  624.646767]  [<ffffffff81245493>] iomap_file_buffered_write+0x63/0xa0
[  624.646768]  [<ffffffff81244cb0>] ? iomap_write_end+0x70/0x70
[  624.646770]  [<ffffffff812b03af>] xfs_file_buffered_aio_write+0xcf/0x1f0
[  624.646772]  [<ffffffff812b0555>] xfs_file_write_iter+0x85/0x120
[  624.646773]  [<ffffffff811dc770>] __vfs_write+0xe0/0x140
[  624.646774]  [<ffffffff811dd440>] vfs_write+0xb0/0x1b0
[  624.646776]  [<ffffffff81002240>] ? syscall_trace_enter+0x1b0/0x240
[  624.646778]  [<ffffffff811de8e3>] SyS_write+0x53/0xc0
[  624.646781]  [<ffffffff81367963>] ? __this_cpu_preempt_check+0x13/0x20
[  624.646781]  [<ffffffff81002511>] do_syscall_64+0x61/0x1d0
[  624.646784]  [<ffffffff816b9d64>] entry_SYSCALL64_slow_path+0x25/0x25
[  624.646786] Mem-Info:
[  624.646788] active_anon:356034 inactive_anon:2961 isolated_anon:0
[  624.646788]  active_file:57 inactive_file:133 isolated_file:32
[  624.646788]  unevictable:0 dirty:14 writeback:0 unstable:0
[  624.646788]  slab_reclaimable:3654 slab_unreclaimable:29434
[  624.646788]  mapped:718 shmem:4209 pagetables:9032 bounce:0
[  624.646788]  free:12922 free_pcp:89 free_cma:0
[  624.646791] Node 0 active_anon:1424136kB inactive_anon:11844kB active_file:228kB inactive_file:532kB unevictable:0kB isolated(anon):0kB isolated(file):128kB mapped:2872kB dirty:56kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 1161216kB anon_thp: 16836kB writeback_tmp:0kB unstable:0kB pages_scanned:1347 all_unreclaimable? yes
[  624.646792] Node 0 
[  624.646794] DMA free:7120kB min:412kB low:512kB high:612kB active_anon:8752kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]:
[  624.646795]  0 1677 1677 1677Node 0 
[  624.646799] DMA32 free:44568kB min:44640kB low:55800kB high:66960kB active_anon:1415384kB inactive_anon:11844kB active_file:228kB inactive_file:532kB unevictable:0kB writepending:56kB present:2080640kB managed:1717740kB mlocked:0kB slab_reclaimable:14616kB slab_unreclaimable:117704kB kernel_stack:18816kB pagetables:36128kB bounce:0kB free_pcp:356kB local_pcp:120kB free_cma:0kB
lowmem_reserve[]:
[  624.646800]  0 0 0 0Node 0 
[  624.646801] DMA: 0*4kB 0*8kB 1*16kB (M) 0*32kB 3*64kB (UM) 2*128kB (UM) 2*256kB (UM) 0*512kB 2*1024kB (UM) 0*2048kB 1*4096kB (M) = 7120kB
Node 0 
[  624.646810] DMA32: 2*4kB (UH) 248*8kB (MEH) 59*16kB (UMEH) 135*32kB (UMEH) 47*64kB (UMEH) 8*128kB (UEH) 4*256kB (UEH) 31*512kB (M) 16*1024kB (UM) 0*2048kB 0*4096kB = 44568kB
[  624.646819] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  624.646820] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  624.646821] 4441 total pagecache pages
[  624.646822] 0 pages in swap cache
[  624.646822] Swap cache stats: add 0, delete 0, find 0/0
[  624.646823] Free swap  = 0kB
[  624.646823] Total swap = 0kB
[  624.646824] 524157 pages RAM
[  624.646825] 0 pages HighMem/MovableOnly
[  624.646825] 90746 pages reserved
[  624.646825] 0 pages hwpoisoned

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-14 16:36                             ` Tetsuo Handa
@ 2016-12-14 18:18                               ` Michal Hocko
  2016-12-15 10:21                                 ` Tetsuo Handa
  0 siblings, 1 reply; 96+ messages in thread
From: Michal Hocko @ 2016-12-14 18:18 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, pmladek, sergey.senozhatsky

On Thu 15-12-16 01:36:07, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 14-12-16 20:37:07, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
[...]
> > > > So it would be really great if you could
> > > > 	1) test with the fixed throttling
> > > > 	2) loglevel=4 on the kernel command line
> > > > 	3) try the above with the same loglevel
> > > > 
> > > > ideally 1) would be sufficient and that would make the most sense from
> > > > the warn_alloc point of view. If this is 2 or 3 then we are hitting a
> > > > more generic problem and I would be quite careful to hack it around.
> > > 
> > > Thus, I don't think I can do these.
> > 
> > i think this would be really valuable.
> 
> OK. I tried 1) and 2). I didn't try 3) because printk() did not work as expected.
> 
> Regarding 1), it did not help. I can still see "** XXX printk messages dropped **"
> ( http://I-love.SAKURA.ne.jp/tmp/serial-20161215-1.txt.xz ).

So we still manage to swamp the logbuffer. The question is whether you
can still see the lockup. This is not obvious from the output to me.

> Regarding 2), I can't tell whether it helped
> ( http://I-love.SAKURA.ne.jp/tmp/serial-20161215-2.txt.xz ).
> I can no longer see "** XXX printk messages dropped **", but sometimes they stalled.
> In most cases, "Out of memory: " and "Killed process" lines are printed within 0.1
> second. But sometimes it took a few seconds. Less often it took longer than a minute.
> There was one big stall which lasted for minutes. I changed loglevel to 7 and checked
> memory information. Seems that watermark was low enough to call out_of_memory().

Isn't that what your test case essentially does though? Keep the system
in OOM continually? Some stalls are to be expected I guess, the main
question is whether there is a point with no progress at all.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-13 17:06                       ` Michal Hocko
  2016-12-14 11:37                         ` Tetsuo Handa
@ 2016-12-15  1:11                         ` Sergey Senozhatsky
  2016-12-15  6:35                           ` Michal Hocko
  1 sibling, 1 reply; 96+ messages in thread
From: Sergey Senozhatsky @ 2016-12-15  1:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, linux-mm, pmladek, sergey.senozhatsky,
	sergey.senozhatsky.work

On (12/13/16 18:06), Michal Hocko wrote:
[..]
> What if we lower the loglevel as much as possible to only see KERN_ERR
> should be sufficient to see few oom killer messages while suppressing
> most of the other noise. Unfortunatelly, even messages with level >
> loglevel get stored into the ringbuffer (as I've just learned) so
> console_unlock() has to crawl through them just to drop them (Meh) but
> at least it doesn't have to go to the serial console drivers and spend
> even more time there. An alternative would be to tweak printk to not
> even store those messaes. Something like the below
> 
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index f7a55e9ff2f7..197f2b9fb703 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -1865,6 +1865,15 @@ asmlinkage int vprintk_emit(int facility, int level,
>  				lflags |= LOG_CONT;
>  			}
>  
> +			if (suppress_message_printing(kern_level)) {

aren't we supposed to check level here:
				suppress_message_printing(level)?

kern_level is '0' away from actual level:

	kern_level = printk_get_level(text)
	switch (kern_level)
	case '0' ... '7':
		level = kern_level - '0';

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-15  1:11                         ` Sergey Senozhatsky
@ 2016-12-15  6:35                           ` Michal Hocko
  2016-12-15 10:16                             ` Petr Mladek
  0 siblings, 1 reply; 96+ messages in thread
From: Michal Hocko @ 2016-12-15  6:35 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Tetsuo Handa, linux-mm, pmladek, sergey.senozhatsky

On Thu 15-12-16 10:11:42, Sergey Senozhatsky wrote:
> On (12/13/16 18:06), Michal Hocko wrote:
> [..]
> > What if we lower the loglevel as much as possible to only see KERN_ERR
> > should be sufficient to see few oom killer messages while suppressing
> > most of the other noise. Unfortunatelly, even messages with level >
> > loglevel get stored into the ringbuffer (as I've just learned) so
> > console_unlock() has to crawl through them just to drop them (Meh) but
> > at least it doesn't have to go to the serial console drivers and spend
> > even more time there. An alternative would be to tweak printk to not
> > even store those messaes. Something like the below
> > 
> > diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> > index f7a55e9ff2f7..197f2b9fb703 100644
> > --- a/kernel/printk/printk.c
> > +++ b/kernel/printk/printk.c
> > @@ -1865,6 +1865,15 @@ asmlinkage int vprintk_emit(int facility, int level,
> >  				lflags |= LOG_CONT;
> >  			}
> >  
> > +			if (suppress_message_printing(kern_level)) {
> 
> aren't we supposed to check level here:
> 				suppress_message_printing(level)?
> 
> kern_level is '0' away from actual level:
> 
> 	kern_level = printk_get_level(text)
> 	switch (kern_level)
> 	case '0' ... '7':
> 		level = kern_level - '0';

Yes you are right. The patch would be broken for KERN_CONT so I think it
doesn't make much sense to pursue it for testing.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-14 10:26                         ` Michal Hocko
@ 2016-12-15  7:34                           ` Sergey Senozhatsky
  0 siblings, 0 replies; 96+ messages in thread
From: Sergey Senozhatsky @ 2016-12-15  7:34 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Petr Mladek, Tetsuo Handa, linux-mm, sergey.senozhatsky

On (12/14/16 11:26), Michal Hocko wrote:
> On Wed 14-12-16 10:37:06, Petr Mladek wrote:
> > On Tue 2016-12-13 21:06:57, Tetsuo Handa wrote:
> [...]
> > > Although it is fine to make warn_alloc() less verbose, this is not
> > > a problem which can be avoided by simply reducing printk(). Unless
> > > we give enough CPU time to the OOM killer and OOM victims, it is
> > > trivial to lockup the system.
> > 
> > You could try to use printk_deferred() in warn_alloc(). It will not
> > handle console.
> 
> the problem is, however, _any_ printk under the oom_lock. So all of them
> would have to be converted AFAIU.
> 
> > It will help to be sure that the blocked printk()
> > is the main problem.
> 
> I think we should rather ratelimit those messages than tweak the way how
> the printk is used. The source of the heavy printk might be completely
> different so this has to be addressed at the printk level.

yes, rate limiting seems to be the only right thing to do. if not for
lockup avoidance (async printk can help here), then for logbuf overflow
and lost messages avoidance (async printk can't prevent this from
happening).

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-15  6:35                           ` Michal Hocko
@ 2016-12-15 10:16                             ` Petr Mladek
  0 siblings, 0 replies; 96+ messages in thread
From: Petr Mladek @ 2016-12-15 10:16 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Sergey Senozhatsky, Tetsuo Handa, linux-mm, sergey.senozhatsky

On Thu 2016-12-15 07:35:52, Michal Hocko wrote:
> On Thu 15-12-16 10:11:42, Sergey Senozhatsky wrote:
> > On (12/13/16 18:06), Michal Hocko wrote:
> > [..]
> > > What if we lower the loglevel as much as possible to only see KERN_ERR
> > > should be sufficient to see few oom killer messages while suppressing
> > > most of the other noise. Unfortunatelly, even messages with level >
> > > loglevel get stored into the ringbuffer (as I've just learned) so
> > > console_unlock() has to crawl through them just to drop them (Meh) but
> > > at least it doesn't have to go to the serial console drivers and spend
> > > even more time there. An alternative would be to tweak printk to not
> > > even store those messaes. Something like the below
> > > 
> > > diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> > > index f7a55e9ff2f7..197f2b9fb703 100644
> > > --- a/kernel/printk/printk.c
> > > +++ b/kernel/printk/printk.c
> > > @@ -1865,6 +1865,15 @@ asmlinkage int vprintk_emit(int facility, int level,
> > >  				lflags |= LOG_CONT;
> > >  			}
> > >  
> > > +			if (suppress_message_printing(kern_level)) {
> > 
> > aren't we supposed to check level here:
> > 				suppress_message_printing(level)?
> > 
> > kern_level is '0' away from actual level:
> > 
> > 	kern_level = printk_get_level(text)
> > 	switch (kern_level)
> > 	case '0' ... '7':
> > 		level = kern_level - '0';
> 
> Yes you are right. The patch would be broken for KERN_CONT so I think it
> doesn't make much sense to pursue it for testing.

It should help to do the check later when "level" variable has
the final value:

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index b3c454b733da..97f2737c3380 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1774,6 +1774,14 @@ asmlinkage int vprintk_emit(int facility, int level,
 	if (level == LOGLEVEL_DEFAULT)
 		level = default_message_loglevel;
 
+	if (suppress_message_printing(level)) {
+		logbuf_cpu = UINT_MAX;
+		raw_spin_unlock(&logbuf_lock);
+		lockdep_on();
+		local_irq_restore(flags);
+		return 0;
+	}
+
 	if (dict)
 		lflags |= LOG_PREFIX|LOG_NEWLINE;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-14 18:18                               ` Michal Hocko
@ 2016-12-15 10:21                                 ` Tetsuo Handa
  2016-12-19 11:25                                   ` Tetsuo Handa
  0 siblings, 1 reply; 96+ messages in thread
From: Tetsuo Handa @ 2016-12-15 10:21 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, pmladek, sergey.senozhatsky

Michal Hocko wrote:
> > Regarding 1), it did not help. I can still see "** XXX printk messages dropped **"
> > ( http://I-love.SAKURA.ne.jp/tmp/serial-20161215-1.txt.xz ).
> 
> So we still manage to swamp the logbuffer. The question is whether you
> can still see the lockup. This is not obvious from the output to me.

I couldn't check whether oom_lock was released (which would have been reported
as kmallocwd's oom_count= field). But I think I can say the system locked up.
The last "Killed process" line is uptime = 118 and the stalls started from around
uptime = 112 lasted for 100 seconds. No OOM killer messages found until I issued
SysRq-b at uptime = 464.

--------------------
[  118.572525] Out of memory: Kill process 9485 (a.out) score 999 or sacrifice child
[  118.574882] Killed process 9485 (a.out) total-vm:4168kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  118.584444] oom_reaper: reaped process 9485 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  118.590450] a.out invoked oom-killer: gfp_mask=0x26042c0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK), nodemask=0, order=0, oom_score_adj=1000
[  118.910441] a.out cpuset=/ mems_allowed=0
(...snipped...)
[  122.418304] a.out: page allocation stalls for 10024ms, order:0, mode:0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO)
(...snipped...)
[  203.482124] nmbd: page allocation stalls for 90001ms, order:0, mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[  212.797150] systemd-journal: page allocation stalls for 100004ms, order:0, mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[  214.456261] kworker/1:2: page allocation stalls for 100003ms, order:0, mode:0x2400000(GFP_NOIO)
[  222.794883] vmtoolsd: page allocation stalls for 110001ms, order:0, mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[  222.795740] systemd-journal: page allocation stalls for 110001ms, order:0, mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[  223.485251] nmbd: page allocation stalls for 110001ms, order:0, mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
** 88 printk messages dropped ** [  302.797171] vmtoolsd: page allocation stalls for 190003ms, order:0, mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[  354.317184] a.out: page allocation stalls for 20116ms, order:0, mode:0x342004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE)
** 72 printk messages dropped ** [  394.275022] a.out: page allocation stalls for 60080ms, order:0, mode:0x342004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE)
** 2081 printk messages dropped ** [  424.298603] a.out: page allocation stalls for 90046ms, order:0, mode:0x342004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE)
(...snipped...)
** 119 printk messages dropped ** [  464.324536]  [<ffffffff8115489b>] warn_alloc+0x12b/0x170
** 56 printk messages dropped ** [  464.330865] CPU: 0 PID: 10356 Comm: a.out Tainted: G        W       4.9.0+ #102
--------------------



I think that the oom_lock stall problem is essentially independent with
printk() from warn_alloc(). I can trigger lockups even if I use one-liner
stall report per each second like below.

--------------------
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6de9440..dc7f6be 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3657,10 +3657,14 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 
 	/* Make sure we know about allocations which stall for too long */
 	if (time_after(jiffies, alloc_start + stall_timeout)) {
-		warn_alloc(gfp_mask,
-			"page allocation stalls for %ums, order:%u",
-			jiffies_to_msecs(jiffies-alloc_start), order);
-		stall_timeout += 10 * HZ;
+		static DEFINE_RATELIMIT_STATE(stall_rs, HZ, 1);
+
+		if (__ratelimit(&stall_rs)) {
+			pr_warn("%s(%u): page allocation stalls for %ums, order:%u mode:%#x(%pGg)\n",
+				current->comm, current->pid, jiffies_to_msecs(jiffies - alloc_start),
+				order, gfp_mask, &gfp_mask);
+			stall_timeout += 10 * HZ;
+		}
 	}
 
 	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
--------------------

Console log from http://I-love.SAKURA.ne.jp/tmp/serial-20161215-3.txt.xz :

--------------------
[  601.337474] Out of memory: Kill process 15498 (a.out) score 716 or sacrifice child
[  601.342349] Killed process 15498 (a.out) total-vm:2166868kB, anon-rss:1159344kB, file-rss:12kB, shmem-rss:0kB
[  601.575590] oom_reaper: reaped process 15498 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  641.271223] oom_kill_process: 132 callbacks suppressed
[  641.280260] a.out invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_score_adj=0
[  641.300796] a.out cpuset=/ mems_allowed=0
[  641.310305] CPU: 0 PID: 16548 Comm: a.out Not tainted 4.9.0+ #78
[  641.320346] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  641.335472]  ffffc900069a7ac8 ffffffff8138730d ffffc900069a7c90 ffff88006fcc0040
[  641.347456]  ffffc900069a7b68 ffffffff8128125b 0000000000000000 0000000000000000
[  641.358075]  0000000000000206 ffffffffffffff10 ffffffff8175974b 0000000000000010
[  641.368537] Call Trace:
[  641.373574]  [<ffffffff8138730d>] dump_stack+0x85/0xc8
[  641.570618]  [<ffffffff8128125b>] dump_header+0x82/0x275
[  641.581733]  [<ffffffff8175974b>] ? _raw_spin_unlock_irqrestore+0x3b/0x60
[  641.594710]  [<ffffffff811e2e29>] oom_kill_process+0x219/0x400
[  641.606663]  [<ffffffff811e334e>] out_of_memory+0x13e/0x580
[  641.616883]  [<ffffffff811e341e>] ? out_of_memory+0x20e/0x580
[  641.626734]  [<ffffffff8128208b>] __alloc_pages_slowpath+0x93f/0x9db
[  641.636861]  [<ffffffff811e9966>] __alloc_pages_nodemask+0x456/0x4e0
[  641.644836]  [<ffffffff81249a0e>] alloc_pages_vma+0xbe/0x2d0
[  641.648426]  [<ffffffff812214fc>] handle_mm_fault+0xdfc/0x1010
[  641.652060]  [<ffffffff8122075b>] ? handle_mm_fault+0x5b/0x1010
[  641.655846]  [<ffffffff810783a5>] ? __do_page_fault+0x175/0x530
[  641.659547]  [<ffffffff8107847a>] __do_page_fault+0x24a/0x530
[  641.663145]  [<ffffffff81078790>] do_page_fault+0x30/0x80
[  641.666539]  [<ffffffff8175b598>] page_fault+0x28/0x30
[  641.669944] Mem-Info:
[  650.149133] active_anon:304795 inactive_anon:13357 isolated_anon:0
[  650.149133]  active_file:422 inactive_file:668 isolated_file:37
[  650.149133]  unevictable:0 dirty:0 writeback:0 unstable:0
[  650.149133]  slab_reclaimable:9296 slab_unreclaimable:32181
[  650.149133]  mapped:2489 shmem:13874 pagetables:9351 bounce:0
[  650.149133]  free:12820 free_pcp:60 free_cma:0
[  651.713701] a.out(16970): page allocation stalls for 10004ms, order:0 mode:0x342004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE)
[  652.714719] __alloc_pages_slowpath: 27287 callbacks suppressed
[  652.714722] a.out(17089): page allocation stalls for 10829ms, order:0 mode:0x26042c0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
[  653.715782] __alloc_pages_slowpath: 59740 callbacks suppressed
[  653.715785] a.out(16619): page allocation stalls for 11930ms, order:0 mode:0x2604240(GFP_NOFS|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
[  654.716880] __alloc_pages_slowpath: 58342 callbacks suppressed
[  654.716883] qmgr(2570): page allocation stalls for 12036ms, order:0 mode:0x2604240(GFP_NOFS|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
[  655.717483] __alloc_pages_slowpath: 57454 callbacks suppressed
[  655.717486] a.out(16860): page allocation stalls for 13965ms, order:0 mode:0x342004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE)
[  656.718605] __alloc_pages_slowpath: 57881 callbacks suppressed
[  656.718608] a.out(16596): page allocation stalls for 14928ms, order:0 mode:0x26042c0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
[  657.719694] __alloc_pages_slowpath: 52753 callbacks suppressed
[  657.719696] a.out(16960): page allocation stalls for 15266ms, order:0 mode:0x2604240(GFP_NOFS|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
[  658.720810] __alloc_pages_slowpath: 57183 callbacks suppressed
[  658.720813] a.out(17435): page allocation stalls for 16999ms, order:0 mode:0x2604240(GFP_NOFS|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
[  659.721904] __alloc_pages_slowpath: 58473 callbacks suppressed
[  659.721907] systemd-journal(375): page allocation stalls for 17036ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[  660.722496] __alloc_pages_slowpath: 57207 callbacks suppressed
[  660.722499] a.out(17232): page allocation stalls for 18907ms, order:0 mode:0x26042c0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
[  661.723612] __alloc_pages_slowpath: 55834 callbacks suppressed
[  661.723615] a.out(16819): page allocation stalls for 19933ms, order:0 mode:0x26042c0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
[  662.750823] __alloc_pages_slowpath: 40948 callbacks suppressed
[  662.750826] kworker/3:3(11291): page allocation stalls for 20085ms, order:0 mode:0x2604240(GFP_NOFS|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
[  663.751839] __alloc_pages_slowpath: 59668 callbacks suppressed
[  663.751842] a.out(17055): page allocation stalls for 22030ms, order:0 mode:0x2604240(GFP_NOFS|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
[  664.753591] __alloc_pages_slowpath: 59260 callbacks suppressed
[  664.753593] master(2528): page allocation stalls for 17258ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[  665.754543] __alloc_pages_slowpath: 59829 callbacks suppressed
[  665.754546] a.out(17113): page allocation stalls for 23924ms, order:0 mode:0x26042c0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
[  666.755642] __alloc_pages_slowpath: 57192 callbacks suppressed
[  666.755645] postgres(2888): page allocation stalls for 11047ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[  667.756734] __alloc_pages_slowpath: 61894 callbacks suppressed
[  667.756737] a.out(16608): page allocation stalls for 25858ms, order:0 mode:0x26042c0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
[  668.757861] __alloc_pages_slowpath: 65951 callbacks suppressed
[  668.757863] a.out(17212): page allocation stalls for 26891ms, order:0 mode:0x26042c0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
[  669.758474] __alloc_pages_slowpath: 66800 callbacks suppressed
[  669.758477] a.out(16920): page allocation stalls for 27908ms, order:0 mode:0x26042c0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
[  670.759554] __alloc_pages_slowpath: 69374 callbacks suppressed
[  670.759557] qmgr(2570): page allocation stalls for 28079ms, order:0 mode:0x2604240(GFP_NOFS|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
[  671.760661] __alloc_pages_slowpath: 64171 callbacks suppressed
[  671.760664] crond(495): page allocation stalls for 29050ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[  672.761787] __alloc_pages_slowpath: 55733 callbacks suppressed
[  672.761790] smbd(3561): page allocation stalls for 15833ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[  673.762858] __alloc_pages_slowpath: 53271 callbacks suppressed
[  673.762861] mysqld(13418): page allocation stalls for 16925ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[  674.763473] __alloc_pages_slowpath: 53489 callbacks suppressed
[  674.763476] systemd(1): page allocation stalls for 32088ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[  675.764578] __alloc_pages_slowpath: 52748 callbacks suppressed
[  675.764580] a.out(16854): page allocation stalls for 34003ms, order:0 mode:0x342004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE)
[  676.765699] __alloc_pages_slowpath: 55054 callbacks suppressed
[  676.765702] a.out(16713): page allocation stalls for 34107ms, order:0 mode:0x2604240(GFP_NOFS|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
[  677.766782] __alloc_pages_slowpath: 59519 callbacks suppressed
[  677.766785] a.out(17096): page allocation stalls for 35108ms, order:0 mode:0x2604240(GFP_NOFS|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
[  678.767901] __alloc_pages_slowpath: 59092 callbacks suppressed
[  678.767904] a.out(17223): page allocation stalls for 36968ms, order:0 mode:0x26042c0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
[  679.768499] __alloc_pages_slowpath: 58356 callbacks suppressed
[  679.768502] a.out(16979): page allocation stalls for 37938ms, order:0 mode:0x26042c0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
[  680.769611] __alloc_pages_slowpath: 59518 callbacks suppressed
[  680.769614] auditd(420): page allocation stalls for 10422ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[  681.770568] __alloc_pages_slowpath: 59785 callbacks suppressed
[  681.770571] a.out(16754): page allocation stalls for 39522ms, order:0 mode:0x26042c0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
[  682.771814] __alloc_pages_slowpath: 56695 callbacks suppressed
[  682.771817] a.out(17162): page allocation stalls for 40981ms, order:0 mode:0x26042c0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
[  683.773081] __alloc_pages_slowpath: 59588 callbacks suppressed
[  683.773084] mysqld(13418): page allocation stalls for 26935ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
** 36364 printk messages dropped ** [  904.221613]  [<ffffffffa0297836>] kmem_alloc+0x96/0x120 [xfs]
[  904.221614]  [<ffffffff81254639>] ? kfree+0x1f9/0x330
** 15 printk messages dropped ** [  904.221733] a.out           R  running task    12312 17361  16548 0x00000080
[  904.221734]  0000000000000000 ffff88003ee3a140 ffff88004d808040 ffff88003eed8040
** 18 printk messages dropped ** [  904.221797]  [<ffffffffa0297b86>] ? kmem_zone_alloc+0x96/0x120 [xfs]
** 15 printk messages dropped ** [  904.221977]  [<ffffffffa025cbcb>] xfs_vm_writepages+0x6b/0xb0 [xfs]
[  904.221978]  [<ffffffff811ef141>] do_writepages+0x21/0x40
** 19 printk messages dropped ** [  904.222046]  [<ffffffff81282020>] __alloc_pages_slowpath+0x8d4/0x9db
[  904.222048]  [<ffffffff811e9966>] __alloc_pages_nodemask+0x456/0x4e0
** 14 printk messages dropped ** [  904.222163]  [<ffffffff810f723f>] ? up_read+0x1f/0x40
[  904.222179]  [<ffffffffa025ee09>] xfs_map_blocks+0x309/0x550 [xfs]
** 16 printk messages dropped ** [  904.222265]  ffff8800753da218 ffffc900121474b0 ffffffff817524d8 ffffffff813a55b6
[  904.222266]  0000000000000000 ffff8800753da218 0000000000000292 ffff88003eeda540
[  904.222266] Call Trace:
** 24 printk messages dropped ** [  904.222419]  [<ffffffffa025f293>] xfs_do_writepage+0x243/0x940 [xfs]
[  904.222421]  [<ffffffff811ecf7b>] write_cache_pages+0x2cb/0x6b0
[  904.222435]  [<ffffffffa025f050>] ? xfs_map_blocks+0x550/0x550 [xfs]
** 25 printk messages dropped ** [  904.222503]  [<ffffffff81252e9a>] new_slab+0x4ca/0x6a0
[  904.222504]  [<ffffffff81255091>] ___slab_alloc+0x3a1/0x620
** 16 printk messages dropped ** [  904.222704]  [<ffffffffa025cba8>] ? xfs_vm_writepages+0x48/0xb0 [xfs]
[  904.222717]  [<ffffffffa025cbcb>] xfs_vm_writepages+0x6b/0xb0 [xfs]
** 15 printk messages dropped ** [  904.222753]  [<ffffffff813a55b6>] ? debug_object_activate+0x166/0x210
** 13 printk messages dropped ** [  904.222812]  [<ffffffffa0297b86>] ? kmem_zone_alloc+0x96/0x120 [xfs]
[  904.222813]  [<ffffffff812555f8>] kmem_cache_alloc+0x2e8/0x370
** 15 printk messages dropped ** [  904.222994]  [<ffffffff811ef141>] do_writepages+0x21/0x40
[  904.222994]  [<ffffffff811deff6>] __filemap_fdatawrite_range+0xc6/0x100
** 16 printk messages dropped ** [  904.223062]  [<ffffffff81127910>] ? lock_timer_base+0xa0/0xa0
[  904.223063]  [<ffffffff8175840a>] schedule_timeout_uninterruptible+0x2a/0x30
** 21 printk messages dropped ** [  904.223239]  [<ffffffffa025cba8>] ? xfs_vm_writepages+0x48/0xb0 [xfs]
[  904.223253]  [<ffffffffa025cbcb>] xfs_vm_writepages+0x6b/0xb0 [xfs]
** 15 printk messages dropped ** [  904.223288]  [<ffffffff813a55b6>] ? debug_object_activate+0x166/0x210
** 10 printk messages dropped ** [  904.223299]  [<ffffffff812f9ce0>] iomap_write_begin+0x50/0xd0
** 7 printk messages dropped ** [  904.223334]  [<ffffffffa0271890>] xfs_file_write_iter+0x90/0x130 [xfs]
[  904.223335]  [<ffffffff812870b5>] __vfs_write+0xe5/0x140
** 16 printk messages dropped ** [  904.223354]  [<ffffffff81282020>] __alloc_pages_slowpath+0x8d4/0x9db
[  904.223355]  [<ffffffff811e9966>] __alloc_pages_nodemask+0x456/0x4e0
** 1 printk messages dropped ** [  904.223357]  [<ffffffff81247507>] alloc_pages_current+0x97/0x1b0
[  904.223377]  [<ffffffffa02c5acb>] xfs_buf_allocate_memory+0x160/0x29b [xfs]
[  904.223393]  [<ffffffffa026895e>] xfs_buf_get_map+0x2be/0x480 [xfs]
[  904.223407]  [<ffffffffa026a1fc>] xfs_buf_read_map+0x2c/0x400 [xfs]
[  904.223426]  [<ffffffffa02b2e41>] xfs_trans_read_buf_map+0x201/0x810 [xfs]
[  904.223440]  [<ffffffffa021b4f8>] xfs_btree_read_buf_block.constprop.34+0x78/0xc0 [xfs]
[  904.223453]  [<ffffffffa021b5c2>] xfs_btree_lookup_get_block+0x82/0xf0 [xfs]
[  904.223467]  [<ffffffffa0221cdb>] xfs_btree_lookup+0xbb/0x700 [xfs]
[  904.223468]  [<ffffffff81255567>] ? kmem_cache_alloc+0x257/0x370
[  904.223479]  [<ffffffffa01f322b>] xfs_alloc_lookup_eq+0x1b/0x20 [xfs]
(...snipped...)
[  904.275264] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=6 idle: 16534 9227 15482 12337 11293
[  904.275266] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 19 15486
[  904.275268] pool 3: cpus=1 node=0 flags=0x0 nice=-20 hung=6s workers=2 manager: 9235
[  904.275270] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 12339 284
[  904.275272] pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 16524 35
[  904.275301] pool 128: cpus=0-63 flags=0x4 nice=0 hung=13s workers=3 idle: 6 57
[  905.018641] __alloc_pages_slowpath: 63439 callbacks suppressed
[  905.018644] a.out(17124): page allocation stalls for 262560ms, order:0 mode:0x26042c0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
[  906.019716] __alloc_pages_slowpath: 88258 callbacks suppressed
[  906.019719] a.out(17062): page allocation stalls for 263854ms, order:0 mode:0x342004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE)
[  907.020846] __alloc_pages_slowpath: 85003 callbacks suppressed
[  907.020848] systemd-logind(493): page allocation stalls for 245959ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[  908.021420] __alloc_pages_slowpath: 87775 callbacks suppressed
[  908.021423] auditd(420): page allocation stalls for 237674ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[  909.022556] __alloc_pages_slowpath: 83301 callbacks suppressed
[  909.022559] dhclient(864): page allocation stalls for 18275ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[  910.023673] __alloc_pages_slowpath: 76616 callbacks suppressed
[  910.023676] a.out(17000): page allocation stalls for 268270ms, order:0 mode:0x342004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE)
[  911.024731] __alloc_pages_slowpath: 82055 callbacks suppressed
[  911.024751] postgres(2883): page allocation stalls for 20263ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[  912.025845] __alloc_pages_slowpath: 88979 callbacks suppressed
[  912.025848] a.out(16615): page allocation stalls for 270240ms, order:0 mode:0x2604240(GFP_NOFS|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
[  913.026457] __alloc_pages_slowpath: 87008 callbacks suppressed
[  913.026460] a.out(17032): page allocation stalls for 271305ms, order:0 mode:0x2604240(GFP_NOFS|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
[  914.027564] __alloc_pages_slowpath: 86030 callbacks suppressed
[  914.027567] tuned(3478): page allocation stalls for 23289ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[  915.028668] __alloc_pages_slowpath: 84050 callbacks suppressed
[  915.028671] crond(495): page allocation stalls for 272318ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[  916.029762] __alloc_pages_slowpath: 81937 callbacks suppressed
[  916.029765] a.out(17363): page allocation stalls for 274239ms, order:0 mode:0x26042c0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
[  917.030882] __alloc_pages_slowpath: 76745 callbacks suppressed
[  917.030885] smbd(3526): page allocation stalls for 240888ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
(...snipped...)
[ 1018.148501] __alloc_pages_slowpath: 76879 callbacks suppressed
[ 1018.148503] postgres(2883): page allocation stalls for 127387ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[ 1019.149587] __alloc_pages_slowpath: 87666 callbacks suppressed
[ 1019.149590] a.out(16891): page allocation stalls for 377428ms, order:0 mode:0x342004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE)
[ 1020.152735] __alloc_pages_slowpath: 86402 callbacks suppressed
[ 1020.152737] kworker/1:7(16532): page allocation stalls for 75028ms, order:0 mode:0x2400000(GFP_NOIO)
[ 1020.872273] sysrq: SysRq : Terminate All Tasks
[ 1022.822657] systemd-journald[375]: /dev/kmsg buffer overrun, some messages lost.
[ 1023.161592] systemd-journald[375]: Received SIGTERM.
[ 1024.438089] audit: type=1305 audit(1481767291.051:348): audit_pid=0 old=420 auid=4294967295 ses=4294967295 res=1
[ 1025.616706] audit: type=2404 audit(1481767292.230:349): pid=1055 uid=0 auid=4294967295 ses=4294967295 msg='op=destroy kind=server fp=19:e2:36:ac:65:24:ca:d6:dd:ff:6a:aa:76:25:73:f3 direction=? s
pid=1055 suid=0  exe="/usr/sbin/sshd" hostname=? addr=? terminal=? res=success'
[ 1025.616841] audit: type=2404 audit(1481767292.230:350): pid=1055 uid=0 auid=4294967295 ses=4294967295 msg='op=destroy kind=server fp=09:0b:6a:93:3e:e3:59:e1:79:8a:6e:2e:a9:05:59:94 direction=? spid=1055 suid=0  exe="/usr/sbin/sshd" hostname=? addr=? terminal=? res=success'
[ 1025.616898] audit: type=2404 audit(1481767292.230:351): pid=1055 uid=0 auid=4294967295 ses=4294967295 msg='op=destroy kind=server fp=95:0b:7d:ce:9e:bd:01:8c:d9:0e:be:7c:f3:b7:96:0d direction=? spid=1055 suid=0  exe="/usr/sbin/sshd" hostname=? addr=? terminal=? res=success'
[ 1025.784671] audit: type=1104 audit(1481767292.398:352): pid=1083 uid=0 auid=1000 ses=1 msg='op=PAM:setcred grantors=pam_securetty,pam_unix acct="kumaneko" exe="/usr/bin/login" hostname=? addr=? terminal=tty1 res=success'
[ 1026.242786] audit: type=1325 audit(1481767292.856:353): table=nat family=2 entries=52
[ 1026.244646] audit: type=1300 audit(1481767292.856:353): arch=c000003e syscall=54 success=yes exit=0 a0=4 a1=0 a2=40 a3=17e0bc0 items=0 ppid=491 pid=17592 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="iptables" exe="/usr/sbin/xtables-multi" key=(null)
[ 1026.244670] audit: type=1327 audit(1481767292.856:353): proctitle=2F7362696E2F69707461626C6573002D7732002D74006E6174002D46
[ 1026.292520] audit: type=1325 audit(1481767292.906:354): table=nat family=2 entries=35
[ 1026.293085] audit: type=1300 audit(1481767292.906:354): arch=c000003e syscall=54 success=yes exit=0 a0=4 a1=0 a2=40 a3=1208a60 items=0 ppid=491 pid=17599 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="iptables" exe="/usr/sbin/xtables-multi" key=(null)
[ 1033.283379] Ebtables v2.0 unregistered
[ 1060.441909] Node 0 active_anon:1044024kB inactive_anon:56280kB active_file:13396kB inactive_file:51668kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:10516kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 544768kB anon_thp: 57748kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
[ 1060.441912] Node 0 DMA free:6700kB min:440kB low:548kB high:656kB active_anon:9168kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 1060.441914] lowmem_reserve[]: 0 1566 1566 1566
[ 1060.441916] Node 0 DMA32 free:274100kB min:44612kB low:55764kB high:66916kB active_anon:1034856kB inactive_anon:56280kB active_file:13396kB inactive_file:51668kB unevictable:0kB writepending:0kB present:2080640kB managed:1604544kB mlocked:0kB slab_reclaimable:37076kB slab_unreclaimable:87724kB kernel_stack:2944kB pagetables:3528kB bounce:0kB free_pcp:2360kB local_pcp:728kB free_cma:0kB
[ 1060.441917] lowmem_reserve[]: 0 0 0 0
[ 1060.441922] Node 0 DMA: 1*4kB (U) 1*8kB (U) 2*16kB (UM) 2*32kB (UM) 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 2*1024kB (UM) 0*2048kB 1*4096kB (M) = 6700kB
[ 1060.441927] Node 0 DMA32: 385*4kB (UME) 486*8kB (UME) 1556*16kB (UMH) 1044*32kB (UMEH) 603*64kB (UME) 324*128kB (UMEH) 59*256kB (UMEH) 11*512kB (UME) 21*1024kB (UME) 31*2048kB (ME) 6*4096kB (M) = 274100kB
[ 1060.441930] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[ 1060.441932] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 1060.441932] 30702 total pagecache pages
[ 1060.441933] 0 pages in swap cache
[ 1060.441934] Swap cache stats: add 0, delete 0, find 0/0
[ 1060.441934] Free swap  = 0kB
[ 1060.441934] Total swap = 0kB
[ 1060.441935] 524157 pages RAM
[ 1060.441935] 0 pages HighMem/MovableOnly
[ 1060.441935] 119045 pages reserved
[ 1060.441936] 0 pages cma reserved
[ 1060.441936] 0 pages hwpoisoned
[ 1060.441937] Out of memory: Kill process 16549 (a.out) score 999 or sacrifice child
[ 1060.609985] audit_printk_skb: 381 callbacks suppressed
--------------------

show_mem() from dump_header() started at uptime = 641.
Something preempted into show_mem() and output suspended at uptime = 650.
One-liner stall report started from uptime = 651.

However, increment of uptime counter is obviously slower than real time clock.
I pressed SysRq-t at uptime = 683 after waiting for a few minutes from uptime = 651.
Then, uptime counter was updated to 904 (probably real time clock) and output from
SysRq-t started (although hopelessly dropped).

I pressed SysRq-e at uptime = 1020. SysRq-e took about a half minute to complete.
Then, uptime counter was updated to 1060 (probably real time clock) and output from
show_mem() resumed. Finally, "Out of memory:" line at uptime = 1060 which was
expected to be printed by uptime = 642 was printed. So, oom_lock was held for
at least 5 minutes.



I don't know why increment of uptime counter became slow.
Since stall_timeout is updated for only once per a (slowed down) second
due to ratelimit, "__alloc_pages_slowpath: XXXXX callbacks suppressed"
represents total number of attempts that reached there for each second.
We can see that XXXXX is between 60000 and 80000. From CONFIG_HZ = 1000
and using 4 CPUs, 15 to 20 attempts reached there for every HZ on each CPU.
(This stressor generates 1024 processes but most of them are simply blocked
on fs locks. Thus, I assume this estimation is not bogus.) Thus, I think
15 to 20 threads running on each CPU are eating that CPU's time (although
there might be some overhead from other than these 15 to 20 threads).

Thus, my guess is that something deferred the OOM killer, and pointless
direct reclaim loop due to "!mutex_trylock(&oom_lock)" (or some overhead
not from the thread doing direct reclaim loop) accelerated deferral of
the OOM killer by consuming almost all CPU time.

This stall lasted with only two kernel messages per a second. I wonder we
have room for tuning warn_alloc() unless the trigger is identified and fixed.
Maybe because I'm using VMware Player. But I don't have a native machine
to test. I appreciate if someone can test using a native machine or KVM.
My environment is "4 CPUs, 2GB RAM, /dev/sda1 for / partition formatted as
XFS, no swap partition or file" on VMware Player on Windows using SATA disk
and the stressor is
http://lkml.kernel.org/r/201612080029.IBD55588.OSOFOtHVMLQFFJ@I-love.SAKURA.ne.jp .



> Isn't that what your test case essentially does though? Keep the system
> in OOM continually? Some stalls are to be expected I guess, the main
> question is whether there is a point with no progress at all.

No. The purpose of running this testcase which keeps the system in almost
OOM situation is to find and report problems which occur when the system is
almost OOM (but that should go to kmallocwd thread). Lockups with oom_lock
held (the subject of this thread) is an obstacle for me when testing almost
OOM situation.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-15 10:21                                 ` Tetsuo Handa
@ 2016-12-19 11:25                                   ` Tetsuo Handa
  2016-12-19 12:27                                     ` Sergey Senozhatsky
  0 siblings, 1 reply; 96+ messages in thread
From: Tetsuo Handa @ 2016-12-19 11:25 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, pmladek, sergey.senozhatsky

Tetsuo Handa wrote:
> I think that the oom_lock stall problem is essentially independent with
> printk() from warn_alloc(). I can trigger lockups even if I use one-liner
> stall report per each second like below.
> 
> --------------------
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6de9440..dc7f6be 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3657,10 +3657,14 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  
>  	/* Make sure we know about allocations which stall for too long */
>  	if (time_after(jiffies, alloc_start + stall_timeout)) {
> -		warn_alloc(gfp_mask,
> -			"page allocation stalls for %ums, order:%u",
> -			jiffies_to_msecs(jiffies-alloc_start), order);
> -		stall_timeout += 10 * HZ;
> +		static DEFINE_RATELIMIT_STATE(stall_rs, HZ, 1);
> +
> +		if (__ratelimit(&stall_rs)) {
> +			pr_warn("%s(%u): page allocation stalls for %ums, order:%u mode:%#x(%pGg)\n",
> +				current->comm, current->pid, jiffies_to_msecs(jiffies - alloc_start),
> +				order, gfp_mask, &gfp_mask);
> +			stall_timeout += 10 * HZ;
> +		}
>  	}
>  
>  	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
> --------------------
>
> This stall lasted with only two kernel messages per a second. I wonder we
> have room for tuning warn_alloc() unless the trigger is identified and fixed.

I retested using netconsole for recording clock time without delays. It seems to me
that once the system reaches some threshold, even ratelimiting to two kernel messages
per a second does not help flushing printk buffer. CPU time for flushing printk buffer
is definitely insufficient because direct reclaimers waiting for oom_lock continued
almost-busy looping.

The first OOM killer was on 16:14:24 and a lot of OOM messages are scheduled for
printk(). As of 16:17:00, the flushing was delaying for 40 seconds (clock time
elapsed is 156 seconds but printk time elapsed is only 116 seconds).

I pressed SysRq-H on 16:17:05 and the message by SysRq-H was printed on 16:20:03.
The delay was getting larger (clock time since first OOM killer is 219 seconds but
printk time elapsed is only 161 seconds).

Then, I waited for a while whether ratelimiting to two kernel messages per a
second helps flushing printk buffer. But it did not help because only two
kernel messages are printed per a second or two seconds.

I pressed SysRq-E on 16:23:15 and flushing became as fast as possible because
some threads which terminated immediately helped solving the OOM situation and
helped direct reclaimers not to consume CPU time by pointless direct reclaim loop.
As of 16:23:18, all printk buffer was flushed and the delay was completely solved
(clock time since first OOM killer is 534 seconds and printk elapsed time is 531
seconds).

Complete log is http://I-love.SAKURA.ne.jp/tmp/netconsole-20161219.txt.xz .
----------------------------------------
2016-12-19 16:14:24 192.168.186.128:6666 [   61.383922] kworker/0:0 invoked oom-killer: gfp_mask=0x26040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=0, order=0, oom_score_adj=0
(...snipped...)
2016-12-19 16:17:00 192.168.186.128:6666 [  177.731850] __alloc_pages_slowpath: 86652 callbacks suppressed
2016-12-19 16:17:08 192.168.186.128:6666 [  177.731852] a.out(4225): page allocation stalls for 47748ms, order:0 mode:0x26042c0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
(...snipped...)
2016-12-19 16:20:03 192.168.186.128:6666 [  222.241385] sysrq: SysRq : HELP : loglevel(0-9) reboot(b) crash(c) show-all-locks(d) terminate-all-tasks(e) memory-full-oom-kill(f) kill-all-tasks(i) thaw-filesystems(j) sak(k) show-backtrace-all-active-cpus(l) show-memory-usage(m) nice-all-RT-tasks(n) poweroff(o) show-registers(p) show-all-timers(q) unraw(r) sync(s) show-task-states(t) unmount(u) show-blocked-tasks(w) dump-ftrace-buffer(z)
(...snipped...)
2016-12-19 16:23:01 192.168.186.128:6666 [  279.848855] __alloc_pages_slowpath: 90002 callbacks suppressed
2016-12-19 16:23:02 192.168.186.128:6666 [  279.848857] a.out(4195): page allocation stalls for 150144ms, order:0 mode:0x26042c0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
2016-12-19 16:23:04 192.168.186.128:6666 [  280.849909] __alloc_pages_slowpath: 90381 callbacks suppressed
2016-12-19 16:23:05 192.168.186.128:6666 [  280.849913] a.out(4242): page allocation stalls for 151122ms, order:0 mode:0x26042c0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
2016-12-19 16:23:05 192.168.186.128:6666 [  281.850976] __alloc_pages_slowpath: 90292 callbacks suppressed
2016-12-19 16:23:08 192.168.186.128:6666 [  281.850979] a.out(4329): page allocation stalls for 151818ms, order:0 mode:0x2604240(GFP_NOFS|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
2016-12-19 16:23:10 192.168.186.128:6666 [  282.852029] __alloc_pages_slowpath: 89988 callbacks suppressed
2016-12-19 16:23:10 192.168.186.128:6666 [  282.852032] a.out(3981): page allocation stalls for 152873ms, order:0 mode:0x26042c0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
2016-12-19 16:23:12 192.168.186.128:6666 [  283.852585] __alloc_pages_slowpath: 90468 callbacks suppressed
2016-12-19 16:23:13 192.168.186.128:6666 [  283.852589] a.out(3854): page allocation stalls for 154167ms, order:0 mode:0x342004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE)
2016-12-19 16:23:15 192.168.186.128:6666 [  284.853651] __alloc_pages_slowpath: 90091 callbacks suppressed
2016-12-19 16:23:15 192.168.186.128:6666 [  284.853654] a.out(4011): page allocation stalls for 154625ms, order:0 mode:0x2604240(GFP_NOFS|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
2016-12-19 16:23:15 192.168.186.128:6666 [  285.854709] __alloc_pages_slowpath: 90481 callbacks suppressed
2016-12-19 16:23:15 192.168.186.128:6666 [  285.854712] mysqld(2467): page allocation stalls for 155635ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
2016-12-19 16:23:15 192.168.186.128:6666 [  286.855765] __alloc_pages_slowpath: 90363 callbacks suppressed
2016-12-19 16:23:15 192.168.186.128:6666 [  286.855768] mysqld(2467): page allocation stalls for 156636ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
2016-12-19 16:23:15 192.168.186.128:6666 [  287.856871] __alloc_pages_slowpath: 90794 callbacks suppressed
(...snipped...)
2016-12-19 16:23:15 192.168.186.128:6666 [  330.899814] __alloc_pages_slowpath: 85298 callbacks suppressed
2016-12-19 16:23:15 192.168.186.128:6666 [  330.899817] a.out(3937): page allocation stalls for 201204ms, order:0 mode:0x342004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE)
2016-12-19 16:23:15 192.168.186.128:6666 [  331.900876] __alloc_pages_slowpath: 88338 callbacks suppressed
2016-12-19 16:23:15 192.168.186.128:6666 [  331.900879] tuned(2414): page allocation stalls for 201869ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
2016-12-19 16:23:15 192.168.186.128:6666 [  332.901929] __alloc_pages_slowpath: 86052 callbacks suppressed
2016-12-19 16:23:15 192.168.186.128:6666 [  332.901932] a.out(4252): page allocation stalls for 203197ms, order:0 mode:0x26042c0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
2016-12-19 16:23:16 192.168.186.128:6666 [  333.902994] __alloc_pages_slowpath: 88699 callbacks suppressed
2016-12-19 16:23:16 192.168.186.128:6666 [  333.902997] mysqld(2464): page allocation stalls for 203684ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
2016-12-19 16:23:16 192.168.186.128:6666 [  334.904057] __alloc_pages_slowpath: 88382 callbacks suppressed
2016-12-19 16:23:16 192.168.186.128:6666 [  334.904059] systemd(1): page allocation stalls for 165774ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
(...snipped...)
2016-12-19 16:23:16 192.168.186.128:6666 [  425.006012] __alloc_pages_slowpath: 88655 callbacks suppressed
2016-12-19 16:23:16 192.168.186.128:6666 [  425.006014] systemd(1): page allocation stalls for 255876ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
2016-12-19 16:23:16 192.168.186.128:6666 [  426.007042] __alloc_pages_slowpath: 86528 callbacks suppressed
2016-12-19 16:23:16 192.168.186.128:6666 [  426.007046] a.out(4062): page allocation stalls for 296298ms, order:0 mode:0x26042c0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
2016-12-19 16:23:16 192.168.186.128:6666 [  427.007621] __alloc_pages_slowpath: 82527 callbacks suppressed
2016-12-19 16:23:17 192.168.186.128:6666 [  427.007624] postgres(2416): page allocation stalls for 296946ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
2016-12-19 16:23:17 192.168.186.128:6666 [  428.008697] __alloc_pages_slowpath: 86985 callbacks suppressed
2016-12-19 16:23:17 192.168.186.128:6666 [  428.008700] a.out(3841): page allocation stalls for 298293ms, order:0 mode:0x26042c0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
(...snipped...)
2016-12-19 16:23:17 192.168.186.128:6666 [  522.135752] __alloc_pages_slowpath: 90737 callbacks suppressed
2016-12-19 16:23:17 192.168.186.128:6666 [  522.135755] a.out(4563): page allocation stalls for 392427ms, order:0 mode:0x26042c0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
2016-12-19 16:23:17 192.168.186.128:6666 [  523.136816] __alloc_pages_slowpath: 90260 callbacks suppressed
2016-12-19 16:23:17 192.168.186.128:6666 [  523.136819] a.out(4443): page allocation stalls for 393420ms, order:0 mode:0x26042c0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
2016-12-19 16:23:17 192.168.186.128:6666 [  524.137890] __alloc_pages_slowpath: 86161 callbacks suppressed
2016-12-19 16:23:18 192.168.186.128:6666 [  524.137892] a.out(4048): page allocation stalls for 394432ms, order:0 mode:0x26042c0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
2016-12-19 16:23:18 192.168.186.128:6666 [  525.138932] __alloc_pages_slowpath: 90421 callbacks suppressed
2016-12-19 16:23:18 192.168.186.128:6666 [  525.138934] a.out(4051): page allocation stalls for 395440ms, order:0 mode:0x342004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE)
(...snipped...)
2016-12-19 16:23:18 192.168.186.128:6666 [  589.260793] __alloc_pages_slowpath: 90260 callbacks suppressed
2016-12-19 16:23:18 192.168.186.128:6666 [  589.260797] a.out(3772): page allocation stalls for 459277ms, order:0 mode:0x26042c0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
2016-12-19 16:23:18 192.168.186.128:6666 [  590.261875] __alloc_pages_slowpath: 89716 callbacks suppressed
2016-12-19 16:23:18 192.168.186.128:6666 [  590.261878] kworker/0:5(4629): page allocation stalls for 459918ms, order:0 mode:0x2604240(GFP_NOFS|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
2016-12-19 16:23:18 192.168.186.128:6666 [  591.262660] __alloc_pages_slowpath: 89555 callbacks suppressed
2016-12-19 16:23:18 192.168.186.128:6666 [  591.262663] postgres(2415): page allocation stalls for 255434ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
2016-12-19 16:23:18 192.168.186.128:6666 [  592.094337] sysrq: SysRq : Terminate All Tasks
2016-12-19 16:23:18 192.168.186.128:6666 [  592.900136] systemd-journald[377]: Received SIGTERM.
----------------------------------------

So, I'd like to check whether async printk() can prevent the system from reaching
the threshold. Though, I guess async printk() won't help for preemption outside
printk() (i.e. CONFIG_PREEMPT=y and/or longer sleep by schedule_timeout_killable(1)
after returning from oom_kill_process()).

Sergey, will you share your async printk() patches?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-19 11:25                                   ` Tetsuo Handa
@ 2016-12-19 12:27                                     ` Sergey Senozhatsky
  2016-12-20 15:39                                       ` Sergey Senozhatsky
  0 siblings, 1 reply; 96+ messages in thread
From: Sergey Senozhatsky @ 2016-12-19 12:27 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: mhocko, linux-mm, pmladek, sergey.senozhatsky

On (12/19/16 20:25), Tetsuo Handa wrote:
[..]
> So, I'd like to check whether async printk() can prevent the system from reaching
> the threshold. Though, I guess async printk() won't help for preemption outside
> printk() (i.e. CONFIG_PREEMPT=y and/or longer sleep by schedule_timeout_killable(1)
> after returning from oom_kill_process()).
> 
> Sergey, will you share your async printk() patches?

Hello,

I don't have (yet) a re-based version, since printk has changed a lot
once again during this merge window.

the work is in progress now.

the latest publicly available version is against the linux-next 20161202

https://gitlab.com/senozhatsky/linux-next-ss/commits/printk-safe-deferred


I'll finish re-basing the patch set tomorrow.

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-19 12:27                                     ` Sergey Senozhatsky
@ 2016-12-20 15:39                                       ` Sergey Senozhatsky
  2016-12-22 10:27                                         ` Tetsuo Handa
  0 siblings, 1 reply; 96+ messages in thread
From: Sergey Senozhatsky @ 2016-12-20 15:39 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Tetsuo Handa, mhocko, linux-mm, pmladek

On (12/19/16 21:27), Sergey Senozhatsky wrote:
[..]
> 
> I'll finish re-basing the patch set tomorrow.
> 

pushed

https://gitlab.com/senozhatsky/linux-next-ss/commits/printk-safe-deferred

not tested. will test and send out the patch set tomorrow.

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-20 15:39                                       ` Sergey Senozhatsky
@ 2016-12-22 10:27                                         ` Tetsuo Handa
  2016-12-22 10:53                                           ` Petr Mladek
                                                             ` (2 more replies)
  0 siblings, 3 replies; 96+ messages in thread
From: Tetsuo Handa @ 2016-12-22 10:27 UTC (permalink / raw)
  To: sergey.senozhatsky; +Cc: mhocko, linux-mm, pmladek

Sergey Senozhatsky wrote:
> On (12/19/16 21:27), Sergey Senozhatsky wrote:
> [..]
> >
> > I'll finish re-basing the patch set tomorrow.
> >
> 
> pushed
> 
> https://gitlab.com/senozhatsky/linux-next-ss/commits/printk-safe-deferred
> 
> not tested. will test and send out the patch set tomorrow.
> 
>      -ss

Thank you. I tried "[PATCHv6 0/7] printk: use printk_safe to handle printk()
recursive calls" at https://lkml.org/lkml/2016/12/21/232 on top of linux.git
as of commit 52bce91165e5f2db "splice: reinstate SIGPIPE/EPIPE handling", but
it turned out that your patch set does not solve this problem.

I was assuming that sending to consoles from printk() is offloaded to a kernel
thread dedicated for that purpose, but your patch set does not do it. As a result,
somebody who called out_of_memory() is still preempted by other threads consuming
CPU time due to cond_resched() from console_unlock() as demonstrated by below patch.

----------------------------------------
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -2099,6 +2099,9 @@ static inline int can_use_console(void)
 	return cpu_online(raw_smp_processor_id()) || have_callable_console();
 }
 
+extern bool oom_lock_resched;
+extern struct mutex oom_lock;
+
 /**
  * console_unlock - unlock the console system
  *
@@ -2211,8 +2214,11 @@ void console_unlock(void)
 		start_critical_timings();
 		printk_safe_exit(flags);
 
-		if (do_cond_resched)
+		if (do_cond_resched) {
+			oom_lock_resched = (__mutex_owner(&oom_lock) == current);
 			cond_resched();
+			oom_lock_resched = false;
+		}
 	}
 	console_locked = 0;
 
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3523,6 +3523,8 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	return false;
 }
 
+bool oom_lock_resched;
+
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 						struct alloc_context *ac)
@@ -3694,10 +3696,14 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 
 	/* Make sure we know about allocations which stall for too long */
 	if (time_after(jiffies, alloc_start + stall_timeout)) {
-		warn_alloc(gfp_mask,
-			"page allocation stalls for %ums, order:%u",
-			jiffies_to_msecs(jiffies-alloc_start), order);
-		stall_timeout += 10 * HZ;
+		static DEFINE_RATELIMIT_STATE(stall_rs, HZ, 1);
+
+		if (__ratelimit(&stall_rs)) {
+			pr_warn("%s(%u): page allocation stalls for %ums, order:%u mode:%#x(%pGg) cond_resched_with_oom_lock=%u\n",
+				current->comm, current->pid, jiffies_to_msecs(jiffies - alloc_start),
+				order, gfp_mask, &gfp_mask, oom_lock_resched);
+			stall_timeout += 10 * HZ;
+		}
 	}
 
 	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
----------------------------------------

----------------------------------------
[  103.425129] mysqld invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=0, order=0, oom_score_adj=0
[  103.508812] mysqld cpuset=/ mems_allowed=0
[  103.514111] CPU: 2 PID: 2300 Comm: mysqld Not tainted 4.9.0+ #100
[  103.517436] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  103.522379] Call Trace:
[  103.527731]  dump_stack+0x85/0xc9
[  103.532552]  dump_header+0x82/0x275
[  103.534901]  ? trace_hardirqs_on_caller+0xf9/0x1c0
[  103.537726]  ? trace_hardirqs_on+0xd/0x10
[  103.540217]  oom_kill_process+0x219/0x400
[  103.542729]  out_of_memory+0x13e/0x580
[  103.545162]  ? out_of_memory+0x20e/0x580
[  103.547603]  __alloc_pages_slowpath+0x7d4/0x8e6
[  103.550178]  ? get_page_from_freelist+0x15a/0xdc0
[  103.552808]  __alloc_pages_nodemask+0x456/0x4e0
[  103.555355]  alloc_pages_current+0x97/0x1b0
[  103.557750]  ? find_get_entry+0x5/0x300
[  103.559996]  __page_cache_alloc+0x15d/0x1a0
[  103.564859]  ? pagecache_get_page+0x2c/0x2b0
[  103.567258]  filemap_fault+0x48e/0x6d0
[  103.569444]  ? filemap_fault+0x339/0x6d0
[  103.571698]  xfs_filemap_fault+0x71/0x1e0 [xfs]
[  103.574125]  __do_fault+0x21/0xa0
[  103.576075]  ? _raw_spin_unlock+0x27/0x40
[  103.578273]  handle_mm_fault+0xee9/0x1180
[  103.580437]  ? handle_mm_fault+0x5e/0x1180
[  103.582634]  __do_page_fault+0x24a/0x530
[  103.584710]  do_page_fault+0x30/0x80
[  103.586682]  page_fault+0x28/0x30
[  103.588497] RIP: 0033:0x7f3b9d66d5d0
[  103.590410] RSP: 002b:00007f3b7ffc9c88 EFLAGS: 00010246
[  103.592857] RAX: 0000000000000001 RBX: 0000560ff6f1afa0 RCX: 00007f3b9d668a82
[  103.596038] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000560ff6f162e0
[  103.599490] RBP: 00007f3b7ffc9d80 R08: 0000560ff6f162e0 R09: 0000000000000001
[  103.602675] R10: 00007f3b7ffc9ce0 R11: 0000000000000000 R12: 0000000000000003
[  103.605821] R13: 00007f3b7ffc9ce0 R14: 000000000000006e R15: 0000560ff6f16200
[  103.609007] Mem-Info:
[  103.621042] __alloc_pages_slowpath: 39544 callbacks suppressed
[  103.621046] irqbalance(483): page allocation stalls for 14949ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  104.621663] __alloc_pages_slowpath: 41790 callbacks suppressed
[  104.621679] systemd(1): page allocation stalls for 15951ms, order:0 mode:0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO) cond_resched_with_oom_lock=1
[  105.622679] __alloc_pages_slowpath: 42194 callbacks suppressed
[  105.622683] postgres(2210): page allocation stalls for 16864ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  106.623754] __alloc_pages_slowpath: 39712 callbacks suppressed
[  106.623758] postgres(1190): page allocation stalls for 13337ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  107.624849] __alloc_pages_slowpath: 35250 callbacks suppressed
[  107.624853] crond(503): page allocation stalls for 18950ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  108.625924] __alloc_pages_slowpath: 29191 callbacks suppressed
[  108.625928] master(2162): page allocation stalls for 16786ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  109.626959] __alloc_pages_slowpath: 46005 callbacks suppressed
[  109.626963] postgres(2212): page allocation stalls for 16623ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  110.628011] __alloc_pages_slowpath: 53990 callbacks suppressed
[  110.628015] ksmtuned(507): page allocation stalls for 21856ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  111.628588] __alloc_pages_slowpath: 49833 callbacks suppressed
[  111.628592] in:imjournal(885): page allocation stalls for 22956ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  112.629667] __alloc_pages_slowpath: 48069 callbacks suppressed
[  112.629671] crond(503): page allocation stalls for 23955ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  113.630720] __alloc_pages_slowpath: 50438 callbacks suppressed
[  113.630724] master(2162): page allocation stalls for 21791ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  114.631785] __alloc_pages_slowpath: 50191 callbacks suppressed
[  114.631789] systemd(1): page allocation stalls for 25961ms, order:0 mode:0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO) cond_resched_with_oom_lock=1
[  115.632884] __alloc_pages_slowpath: 47058 callbacks suppressed
[  115.632888] postgres(2211): page allocation stalls for 26874ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  116.633937] __alloc_pages_slowpath: 43322 callbacks suppressed
[  116.633940] postgres(1190): page allocation stalls for 23347ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  117.634987] __alloc_pages_slowpath: 41755 callbacks suppressed
[  117.634991] irqbalance(483): page allocation stalls for 28963ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  118.634021] active_anon:313193 inactive_anon:3457 isolated_anon:0
[  118.634021]  active_file:256 inactive_file:337 isolated_file:32
[  118.634021]  unevictable:0 dirty:9 writeback:0 unstable:0
[  118.634021]  slab_reclaimable:8896 slab_unreclaimable:31623
[  118.634021]  mapped:1988 shmem:3527 pagetables:8669 bounce:0
[  118.634021]  free:12797 free_pcp:214 free_cma:0
[  118.636046] __alloc_pages_slowpath: 40420 callbacks suppressed
[  118.636050] gmain(615): page allocation stalls for 29979ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=0
[  119.636632] __alloc_pages_slowpath: 59369 callbacks suppressed
[  119.636636] in:imjournal(885): page allocation stalls for 30964ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  120.637687] __alloc_pages_slowpath: 55694 callbacks suppressed
[  120.637690] ksmtuned(507): page allocation stalls for 31866ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  121.638769] __alloc_pages_slowpath: 51519 callbacks suppressed
[  121.638772] postgres(2212): page allocation stalls for 28635ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  122.639817] __alloc_pages_slowpath: 50601 callbacks suppressed
[  122.639820] crond(503): page allocation stalls for 33965ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  123.640913] __alloc_pages_slowpath: 51169 callbacks suppressed
[  123.640916] lpqd(2332): page allocation stalls for 28749ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  124.641949] __alloc_pages_slowpath: 54501 callbacks suppressed
[  124.641953] gmain(615): page allocation stalls for 35985ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  125.643026] __alloc_pages_slowpath: 54196 callbacks suppressed
[  125.643030] auditd(435): page allocation stalls for 26271ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  126.643593] __alloc_pages_slowpath: 54450 callbacks suppressed
[  126.643597] systemd-hostnam(571): page allocation stalls for 19483ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  127.644663] __alloc_pages_slowpath: 53238 callbacks suppressed
[  127.644666] tuned(2193): page allocation stalls for 39011ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  128.645733] __alloc_pages_slowpath: 53215 callbacks suppressed
[  128.645737] systemd-hostnam(571): page allocation stalls for 21485ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  129.646815] __alloc_pages_slowpath: 60025 callbacks suppressed
[  129.646818] in:imjournal(885): page allocation stalls for 40974ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  130.647865] __alloc_pages_slowpath: 58585 callbacks suppressed
[  130.647869] postgres(2212): page allocation stalls for 37644ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  131.648927] __alloc_pages_slowpath: 56303 callbacks suppressed
[  131.648930] systemd-logind(501): page allocation stalls for 11659ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  132.649994] __alloc_pages_slowpath: 52335 callbacks suppressed
[  132.649998] ksmtuned(507): page allocation stalls for 43878ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  133.651052] __alloc_pages_slowpath: 51745 callbacks suppressed
[  133.651056] lpqd(2332): page allocation stalls for 38759ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  134.651623] __alloc_pages_slowpath: 52909 callbacks suppressed
[  134.651626] tuned(2193): page allocation stalls for 46018ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  135.652717] __alloc_pages_slowpath: 52653 callbacks suppressed
[  135.652721] systemd-journal(383): page allocation stalls for 47032ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  136.653780] __alloc_pages_slowpath: 52888 callbacks suppressed
[  136.653784] systemd-journal(383): page allocation stalls for 48033ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  137.654858] __alloc_pages_slowpath: 54692 callbacks suppressed
[  137.654862] lpqd(2332): page allocation stalls for 42763ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  138.655883] __alloc_pages_slowpath: 55939 callbacks suppressed
[  138.655886] postgres(1190): page allocation stalls for 45369ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  139.656972] __alloc_pages_slowpath: 62010 callbacks suppressed
[  139.656975] auditd(435): page allocation stalls for 40285ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  140.658038] __alloc_pages_slowpath: 64428 callbacks suppressed
[  140.658042] tuned(2193): page allocation stalls for 52024ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  141.658599] __alloc_pages_slowpath: 65734 callbacks suppressed
[  141.658603] NetworkManager(537): page allocation stalls for 52973ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  142.659654] __alloc_pages_slowpath: 67122 callbacks suppressed
[  142.659657] systemd-hostnam(571): page allocation stalls for 35499ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  143.660751] __alloc_pages_slowpath: 64515 callbacks suppressed
[  143.660755] pickup(2169): page allocation stalls for 11784ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  144.661796] __alloc_pages_slowpath: 62998 callbacks suppressed
[  144.661799] in:imjournal(885): page allocation stalls for 55989ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  145.662860] __alloc_pages_slowpath: 58703 callbacks suppressed
[  145.662863] NetworkManager(537): page allocation stalls for 56977ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  146.663959] __alloc_pages_slowpath: 60013 callbacks suppressed
[  146.663962] systemd-logind(501): page allocation stalls for 26674ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  147.664996] __alloc_pages_slowpath: 57511 callbacks suppressed
[  147.664999] ksmtuned(507): page allocation stalls for 58893ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  148.666057] __alloc_pages_slowpath: 55971 callbacks suppressed
[  148.666061] master(2162): page allocation stalls for 56826ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  149.666631] __alloc_pages_slowpath: 62040 callbacks suppressed
[  149.666634] postgres(1190): page allocation stalls for 56380ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  150.667694] __alloc_pages_slowpath: 64495 callbacks suppressed
[  150.667698] systemd-journal(383): page allocation stalls for 62047ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  151.668761] __alloc_pages_slowpath: 65855 callbacks suppressed
[  151.668765] tuned(2193): page allocation stalls for 63035ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  152.669854] __alloc_pages_slowpath: 67093 callbacks suppressed
[  152.669857] dbus-daemon(492): page allocation stalls for 63895ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  153.670902] __alloc_pages_slowpath: 66886 callbacks suppressed
[  153.670906] mysqld(2238): page allocation stalls for 64927ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  154.671963] __alloc_pages_slowpath: 67141 callbacks suppressed
[  154.671966] gmain(615): page allocation stalls for 66015ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  155.673055] __alloc_pages_slowpath: 67300 callbacks suppressed
[  155.673058] NetworkManager(537): page allocation stalls for 66987ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  156.673615] __alloc_pages_slowpath: 67163 callbacks suppressed
[  156.673619] postgres(1190): page allocation stalls for 63387ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  157.674661] __alloc_pages_slowpath: 67216 callbacks suppressed
[  157.674665] postgres(1190): page allocation stalls for 64388ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  158.675760] __alloc_pages_slowpath: 64137 callbacks suppressed
[  158.675763] auditd(435): page allocation stalls for 59304ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  159.676802] __alloc_pages_slowpath: 63812 callbacks suppressed
[  159.676805] mysqld(2240): page allocation stalls for 70645ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  160.678024] __alloc_pages_slowpath: 63074 callbacks suppressed
[  160.678028] systemd-logind(501): page allocation stalls for 40688ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  161.678962] __alloc_pages_slowpath: 64160 callbacks suppressed
[  161.678966] mysqld(2238): page allocation stalls for 72935ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  162.680008] __alloc_pages_slowpath: 63952 callbacks suppressed
[  162.680012] gmain(615): page allocation stalls for 74023ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  163.680565] __alloc_pages_slowpath: 65331 callbacks suppressed
[  163.680568] systemd-hostnam(571): page allocation stalls for 56520ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  164.681997] __alloc_pages_slowpath: 67274 callbacks suppressed
[  164.682005] dbus-daemon(492): page allocation stalls for 75907ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  165.682703] __alloc_pages_slowpath: 67081 callbacks suppressed
[  165.682707] irqbalance(483): page allocation stalls for 77011ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  166.683767] __alloc_pages_slowpath: 67006 callbacks suppressed
[  166.683770] postgres(1190): page allocation stalls for 73397ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  167.684838] __alloc_pages_slowpath: 63439 callbacks suppressed
[  167.684842] postgres(2210): page allocation stalls for 78926ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  168.685904] __alloc_pages_slowpath: 63998 callbacks suppressed
[  168.685907] lpqd(2332): page allocation stalls for 73794ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  169.686974] __alloc_pages_slowpath: 64142 callbacks suppressed
[  169.686978] ksmtuned(507): page allocation stalls for 80915ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  170.687783] __alloc_pages_slowpath: 63814 callbacks suppressed
[  170.687787] irqbalance(483): page allocation stalls for 82016ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  171.688605] __alloc_pages_slowpath: 64212 callbacks suppressed
[  171.688608] postgres(2210): page allocation stalls for 82930ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  172.689672] __alloc_pages_slowpath: 61346 callbacks suppressed
[  172.689676] systemd(1): page allocation stalls for 84019ms, order:0 mode:0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO) cond_resched_with_oom_lock=1
[  173.690749] __alloc_pages_slowpath: 65162 callbacks suppressed
[  173.690753] NetworkManager(537): page allocation stalls for 85005ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  174.691815] __alloc_pages_slowpath: 67004 callbacks suppressed
[  174.691818] irqbalance(483): page allocation stalls for 86020ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  175.692873] __alloc_pages_slowpath: 67376 callbacks suppressed
[  175.692877] gmain(615): page allocation stalls for 87036ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  176.693970] __alloc_pages_slowpath: 67345 callbacks suppressed
[  176.693973] mysqld(2238): page allocation stalls for 87950ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  177.695007] __alloc_pages_slowpath: 67219 callbacks suppressed
[  177.695010] systemd-logind(501): page allocation stalls for 57705ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  178.635236] Node 0 active_anon:1252772kB inactive_anon:13828kB active_file:1024kB inactive_file:1348kB unevictable:0kB isolated(anon):0kB isolated(file):128kB mapped:7952kB dirty:36kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 892928kB anon_thp: 14108kB writeback_tmp:0kB unstable:0kB pages_scanned:3921 all_unreclaimable? yes
[  178.695575] __alloc_pages_slowpath: 66566 callbacks suppressed
[  178.695579] ksmtuned(507): page allocation stalls for 89924ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  179.283906] Node 0 DMA free:6700kB min:440kB low:548kB high:656kB active_anon:9144kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:28kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  179.299970] lowmem_reserve[]: 0 1565 1565 1565
[  179.482583] Node 0 DMA32 free:44488kB min:44612kB low:55764kB high:66916kB active_anon:1243628kB inactive_anon:13828kB active_file:1024kB inactive_file:1348kB unevictable:0kB writepending:36kB present:2080640kB managed:1603468kB mlocked:0kB slab_reclaimable:35584kB slab_unreclaimable:126460kB kernel_stack:17936kB pagetables:34648kB bounce:0kB free_pcp:856kB local_pcp:116kB free_cma:0kB
[  179.696680] __alloc_pages_slowpath: 58011 callbacks suppressed
[  179.696688] a.out(2743): page allocation stalls for 91086ms, order:0 mode:0x342004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE) cond_resched_with_oom_lock=1
[  180.548518] lowmem_reserve[]: 0 0 0 0
[  180.697713] __alloc_pages_slowpath: 64542 callbacks suppressed
[  180.697717] gmain(615): page allocation stalls for 92041ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  181.698801] __alloc_pages_slowpath: 66996 callbacks suppressed
[  181.698805] lpqd(2332): page allocation stalls for 86807ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  182.699853] __alloc_pages_slowpath: 67094 callbacks suppressed
[  182.699857] in:imjournal(885): page allocation stalls for 94027ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  183.700923] __alloc_pages_slowpath: 67039 callbacks suppressed
[  183.700926] systemd(1): page allocation stalls for 95030ms, order:0 mode:0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO) cond_resched_with_oom_lock=1
[  184.702000] __alloc_pages_slowpath: 67067 callbacks suppressed
[  184.702004] NetworkManager(537): page allocation stalls for 96016ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  185.703062] __alloc_pages_slowpath: 66792 callbacks suppressed
[  185.703066] systemd-hostnam(571): page allocation stalls for 78542ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  186.689089] Node 0 DMA: 1*4kB (U) 1*8kB (M) 0*16kB 1*32kB (U) 2*64kB (UM) 1*128kB (U) 1*256kB (U) 0*512kB 2*1024kB (UM) 0*2048kB 1*4096kB (M) = 6700kB
[  186.703605] __alloc_pages_slowpath: 67093 callbacks suppressed
[  186.703608] systemd(1): page allocation stalls for 98033ms, order:0 mode:0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO) cond_resched_with_oom_lock=1
[  187.704669] __alloc_pages_slowpath: 66243 callbacks suppressed
[  187.704672] crond(503): page allocation stalls for 99030ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  188.705740] __alloc_pages_slowpath: 67397 callbacks suppressed
[  188.705744] postgres(2211): page allocation stalls for 99947ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  189.706854] __alloc_pages_slowpath: 66839 callbacks suppressed
[  189.706874] a.out(2653): page allocation stalls for 100985ms, order:0 mode:0x2604240(GFP_NOFS|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK) cond_resched_with_oom_lock=1
[  190.707883] __alloc_pages_slowpath: 66947 callbacks suppressed
[  190.707886] pickup(2169): page allocation stalls for 58831ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  191.662483] Node 0 DMA32: 420*4kB (U) 315*8kB (UM) 172*16kB (UE) 335*32kB (UE) 99*64kB (UE) 26*128kB (UME) 9*256kB (UME) 11*512kB (UME) 9*1024kB (ME) 0*2048kB 0*4096kB = 44488kB
[  191.708840] __alloc_pages_slowpath: 66618 callbacks suppressed
[  191.708848] crond(503): page allocation stalls for 103034ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  192.710030] __alloc_pages_slowpath: 66453 callbacks suppressed
[  192.710034] NetworkManager(537): page allocation stalls for 104024ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  193.710566] __alloc_pages_slowpath: 66903 callbacks suppressed
[  193.710570] NetworkManager(537): page allocation stalls for 105025ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  194.712985] __alloc_pages_slowpath: 66683 callbacks suppressed
[  194.712988] tuned(2193): page allocation stalls for 106079ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  195.713716] __alloc_pages_slowpath: 67736 callbacks suppressed
[  195.713719] postgres(2210): page allocation stalls for 106955ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  196.714811] __alloc_pages_slowpath: 67054 callbacks suppressed
[  196.714816] systemd-logind(501): page allocation stalls for 76725ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  197.715840] __alloc_pages_slowpath: 67059 callbacks suppressed
[  197.715843] tuned(2193): page allocation stalls for 109082ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  198.716913] __alloc_pages_slowpath: 66687 callbacks suppressed
[  198.716917] auditd(435): page allocation stalls for 99345ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  199.718012] __alloc_pages_slowpath: 67443 callbacks suppressed
[  199.718016] postgres(2211): page allocation stalls for 110959ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  200.719049] __alloc_pages_slowpath: 67103 callbacks suppressed
[  200.719053] postgres(2210): page allocation stalls for 111960ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  201.719617] __alloc_pages_slowpath: 65347 callbacks suppressed
[  201.719620] crond(503): page allocation stalls for 113045ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  202.720681] __alloc_pages_slowpath: 67782 callbacks suppressed
[  202.720685] systemd-logind(501): page allocation stalls for 82731ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  203.721761] __alloc_pages_slowpath: 67320 callbacks suppressed
[  203.721765] crond(503): page allocation stalls for 115047ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  204.722809] __alloc_pages_slowpath: 67440 callbacks suppressed
[  204.722812] lpqd(2332): page allocation stalls for 109831ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  205.723878] __alloc_pages_slowpath: 67259 callbacks suppressed
[  205.723882] dbus-daemon(492): page allocation stalls for 116949ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  206.724974] __alloc_pages_slowpath: 66817 callbacks suppressed
[  206.724978] postgres(2210): page allocation stalls for 117966ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  207.726015] __alloc_pages_slowpath: 66699 callbacks suppressed
[  207.726019] auditd(435): page allocation stalls for 108354ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  208.726611] __alloc_pages_slowpath: 67461 callbacks suppressed
[  208.726614] systemd-logind(501): page allocation stalls for 88737ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  209.727663] __alloc_pages_slowpath: 67552 callbacks suppressed
[  209.727667] NetworkManager(537): page allocation stalls for 121042ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  210.728724] __alloc_pages_slowpath: 67554 callbacks suppressed
[  210.728727] master(2162): page allocation stalls for 118889ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  211.729782] __alloc_pages_slowpath: 66883 callbacks suppressed
[  211.729786] systemd-hostnam(571): page allocation stalls for 104569ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  212.730856] __alloc_pages_slowpath: 67060 callbacks suppressed
[  212.730859] tuned(2193): page allocation stalls for 124097ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  213.731957] __alloc_pages_slowpath: 67212 callbacks suppressed
[  213.731960] postgres(2211): page allocation stalls for 124973ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  214.732979] __alloc_pages_slowpath: 66906 callbacks suppressed
[  214.732983] NetworkManager(537): page allocation stalls for 126047ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  215.733541] __alloc_pages_slowpath: 67186 callbacks suppressed
[  215.733544] postgres(2210): page allocation stalls for 126975ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  216.734628] __alloc_pages_slowpath: 67116 callbacks suppressed
[  216.734631] systemd-logind(501): page allocation stalls for 96745ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  217.735688] __alloc_pages_slowpath: 67045 callbacks suppressed
[  217.735692] systemd-logind(501): page allocation stalls for 97746ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  218.736777] __alloc_pages_slowpath: 64296 callbacks suppressed
[  218.736781] NetworkManager(537): page allocation stalls for 130051ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=1
[  219.704510] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  219.711197] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  219.716162] 4152 total pagecache pages
[  219.724755] 0 pages in swap cache
[  219.727428] Swap cache stats: add 0, delete 0, find 0/0
[  219.730665] Free swap  = 0kB
[  219.733323] Total swap = 0kB
[  219.739283] 524157 pages RAM
[  219.741645] 0 pages HighMem/MovableOnly
[  219.744292] 119314 pages reserved
[  219.745254] __alloc_pages_slowpath: 64179 callbacks suppressed
[  219.745257] mysqld(2234): page allocation stalls for 130829ms, order:0 mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) cond_resched_with_oom_lock=0
[  219.759890] 0 pages cma reserved
[  219.762306] 0 pages hwpoisoned
[  219.773275] Out of memory: Kill process 2706 (a.out) score 997 or sacrifice child
[  219.777233] Killed process 2706 (a.out) total-vm:4168kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  219.827687] oom_reaper: reaped process 2706 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
----------------------------------------

Pretending out_of_memory() as if under NMI context is ridicurous.
Running out_of_memory() with preemption disabled
( http://lkml.kernel.org/r/201509191605.CAF13520.QVSFHLtFJOMOOF@I-love.SAKURA.ne.jp )
was not accepted.
Adding exceptions like

-	do_cond_resched = console_may_schedule;
+	do_cond_resched = console_may_schedule && __mutex_owner(&oom_lock) != current;

will not be smart.

Now, what options are left other than replacing !mutex_trylock(&oom_lock)
with mutex_lock_killable(&oom_lock) which also stops wasting CPU time?
Are we waiting for offloading sending to consoles?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-22 10:27                                         ` Tetsuo Handa
@ 2016-12-22 10:53                                           ` Petr Mladek
  2016-12-22 13:40                                             ` Sergey Senozhatsky
  2016-12-22 13:33                                           ` Tetsuo Handa
  2016-12-22 13:42                                           ` Sergey Senozhatsky
  2 siblings, 1 reply; 96+ messages in thread
From: Petr Mladek @ 2016-12-22 10:53 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: sergey.senozhatsky, mhocko, linux-mm

On Thu 2016-12-22 19:27:17, Tetsuo Handa wrote:
> Sergey Senozhatsky wrote:
> > On (12/19/16 21:27), Sergey Senozhatsky wrote:
> > [..]
> > >
> > > I'll finish re-basing the patch set tomorrow.
> > >
> > 
> > pushed
> > 
> > https://gitlab.com/senozhatsky/linux-next-ss/commits/printk-safe-deferred
> > 
> > not tested. will test and send out the patch set tomorrow.
> > 
> >      -ss
> 
> Thank you. I tried "[PATCHv6 0/7] printk: use printk_safe to handle printk()
> recursive calls" at https://lkml.org/lkml/2016/12/21/232 on top of linux.git
> as of commit 52bce91165e5f2db "splice: reinstate SIGPIPE/EPIPE handling", but
> it turned out that your patch set does not solve this problem.
>
> I was assuming that sending to consoles from printk() is offloaded to a kernel
> thread dedicated for that purpose, but your patch set does not do it. As a result,
> somebody who called out_of_memory() is still preempted by other threads consuming
> CPU time due to cond_resched() from console_unlock() as demonstrated by below patch.

Ah, it was a misunderstanding. The "printk_safe" patchset allows to
call printk() from inside some areas guarded by logbuf_lock. By other
words, it allows to print errors from inside printk() code. I does
not solve the soft-/live-locks.

We need the async printk patchset here. It will allow to offload the
console handling to the kthread. AFAIK, Sergey wanted to rebase it
on top of the printk_safe patchset. I am not sure when he want or
will have time to do so, though.

Best Regards,
Petr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-22 10:27                                         ` Tetsuo Handa
  2016-12-22 10:53                                           ` Petr Mladek
@ 2016-12-22 13:33                                           ` Tetsuo Handa
  2016-12-22 19:24                                             ` Michal Hocko
  2016-12-22 13:42                                           ` Sergey Senozhatsky
  2 siblings, 1 reply; 96+ messages in thread
From: Tetsuo Handa @ 2016-12-22 13:33 UTC (permalink / raw)
  To: sergey.senozhatsky; +Cc: mhocko, linux-mm, pmladek

Tetsuo Handa wrote:
> Now, what options are left other than replacing !mutex_trylock(&oom_lock)
> with mutex_lock_killable(&oom_lock) which also stops wasting CPU time?
> Are we waiting for offloading sending to consoles?

 From http://lkml.kernel.org/r/20161222115057.GH6048@dhcp22.suse.cz :
> > Although I don't know whether we agree with mutex_lock_killable(&oom_lock)
> > change, I think this patch alone can go as a cleanup.
> 
> No, we don't agree on that part. As this is a printk issue I do not want
> to workaround it in the oom related code. That is just ridiculous. The
> very same issue would be possible due to other continous source of log
> messages.

I don't think so. Lockup caused by printk() is printk's problem. But printk
is not the only source of lockup. If CONFIG_PREEMPT=y, it is possible that
a thread which held oom_lock can sleep for unbounded period depending on
scheduling priority. Then, you call such latency as scheduler's problem?
mutex_lock_killable(&oom_lock) change helps coping with whatever delays
OOM killer/reaper might encounter.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-22 10:53                                           ` Petr Mladek
@ 2016-12-22 13:40                                             ` Sergey Senozhatsky
  0 siblings, 0 replies; 96+ messages in thread
From: Sergey Senozhatsky @ 2016-12-22 13:40 UTC (permalink / raw)
  To: Petr Mladek; +Cc: Tetsuo Handa, sergey.senozhatsky, mhocko, linux-mm

On (12/22/16 11:53), Petr Mladek wrote:
> > Thank you. I tried "[PATCHv6 0/7] printk: use printk_safe to handle printk()
> > recursive calls" at https://lkml.org/lkml/2016/12/21/232 on top of linux.git
> > as of commit 52bce91165e5f2db "splice: reinstate SIGPIPE/EPIPE handling", but
> > it turned out that your patch set does not solve this problem.
> >
> > I was assuming that sending to consoles from printk() is offloaded to a kernel
> > thread dedicated for that purpose, but your patch set does not do it. As a result,
> > somebody who called out_of_memory() is still preempted by other threads consuming
> > CPU time due to cond_resched() from console_unlock() as demonstrated by below patch.
> 
> Ah, it was a misunderstanding. The "printk_safe" patchset allows to
> call printk() from inside some areas guarded by logbuf_lock. By other
> words, it allows to print errors from inside printk() code. I does
> not solve the soft-/live-locks.

ineeed.

> We need the async printk patchset here. It will allow to offload the
> console handling to the kthread. AFAIK, Sergey wanted to rebase it
> on top of the printk_safe patchset. I am not sure when he want or
> will have time to do so, though.

sure. this is still the case. and in fact my tree here
https://gitlab.com/senozhatsky/linux-next-ss/commits/printk-safe-deferred

contains both patch sets: 9 patche in total (rebased agains linux-next
20161221).

first 7 patches are printk-safe, the last 2 -- async printk.

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-22 10:27                                         ` Tetsuo Handa
  2016-12-22 10:53                                           ` Petr Mladek
  2016-12-22 13:33                                           ` Tetsuo Handa
@ 2016-12-22 13:42                                           ` Sergey Senozhatsky
  2016-12-22 14:01                                             ` Tetsuo Handa
  2 siblings, 1 reply; 96+ messages in thread
From: Sergey Senozhatsky @ 2016-12-22 13:42 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: sergey.senozhatsky, mhocko, linux-mm, pmladek

On (12/22/16 19:27), Tetsuo Handa wrote:
> Thank you. I tried "[PATCHv6 0/7] printk: use printk_safe to handle printk()
> recursive calls" at https://lkml.org/lkml/2016/12/21/232 on top of linux.git
> as of commit 52bce91165e5f2db "splice: reinstate SIGPIPE/EPIPE handling", but
> it turned out that your patch set does not solve this problem.
> 
> I was assuming that sending to consoles from printk() is offloaded to a kernel
> thread dedicated for that purpose, but your patch set does not do it.

sorry, seems that I didn't deliver the information properly.

https://gitlab.com/senozhatsky/linux-next-ss/commits/printk-safe-deferred

there are 2 patch sets. the first one is printk-safe. the second one
is async printk.

9 patches in total (as of now).

can you access it?

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-22 13:42                                           ` Sergey Senozhatsky
@ 2016-12-22 14:01                                             ` Tetsuo Handa
  2016-12-22 14:09                                               ` Sergey Senozhatsky
  0 siblings, 1 reply; 96+ messages in thread
From: Tetsuo Handa @ 2016-12-22 14:01 UTC (permalink / raw)
  To: sergey.senozhatsky; +Cc: mhocko, linux-mm, pmladek

Sergey Senozhatsky wrote:
> On (12/22/16 19:27), Tetsuo Handa wrote:
> > Thank you. I tried "[PATCHv6 0/7] printk: use printk_safe to handle printk()
> > recursive calls" at https://lkml.org/lkml/2016/12/21/232 on top of linux.git
> > as of commit 52bce91165e5f2db "splice: reinstate SIGPIPE/EPIPE handling", but
> > it turned out that your patch set does not solve this problem.
> > 
> > I was assuming that sending to consoles from printk() is offloaded to a kernel
> > thread dedicated for that purpose, but your patch set does not do it.
> 
> sorry, seems that I didn't deliver the information properly.
> 
> https://gitlab.com/senozhatsky/linux-next-ss/commits/printk-safe-deferred
> 
> there are 2 patch sets. the first one is printk-safe. the second one
> is async printk.
> 
> 9 patches in total (as of now).
> 
> can you access it?

"404 The page you're looking for could not be found."

Anonymous access not supported?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-22 14:01                                             ` Tetsuo Handa
@ 2016-12-22 14:09                                               ` Sergey Senozhatsky
  2016-12-22 14:30                                                 ` Sergey Senozhatsky
  2016-12-26 10:54                                                 ` Tetsuo Handa
  0 siblings, 2 replies; 96+ messages in thread
From: Sergey Senozhatsky @ 2016-12-22 14:09 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: sergey.senozhatsky, mhocko, linux-mm, pmladek

[-- Attachment #1: Type: text/plain, Size: 1203 bytes --]

On (12/22/16 23:01), Tetsuo Handa wrote:
> > On (12/22/16 19:27), Tetsuo Handa wrote:
> > > Thank you. I tried "[PATCHv6 0/7] printk: use printk_safe to handle printk()
> > > recursive calls" at https://lkml.org/lkml/2016/12/21/232 on top of linux.git
> > > as of commit 52bce91165e5f2db "splice: reinstate SIGPIPE/EPIPE handling", but
> > > it turned out that your patch set does not solve this problem.
> > > 
> > > I was assuming that sending to consoles from printk() is offloaded to a kernel
> > > thread dedicated for that purpose, but your patch set does not do it.
> > 
> > sorry, seems that I didn't deliver the information properly.
> > 
> > https://gitlab.com/senozhatsky/linux-next-ss/commits/printk-safe-deferred
> > 
> > there are 2 patch sets. the first one is printk-safe. the second one
> > is async printk.
> > 
> > 9 patches in total (as of now).
> > 
> > can you access it?
> 
> "404 The page you're looking for could not be found."
> 
> Anonymous access not supported?

oops... hm, dunno, it says

: Visibility Level (?)
:
: Public
: The project can be cloned without any authentication.

I'll switch to github then may be.

attached 9 patches.

NOTE: not the final version.


	-ss

[-- Attachment #2: 0001-printk-use-vprintk_func-in-vprintk.patch --]
[-- Type: text/x-diff, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-22 14:09                                               ` Sergey Senozhatsky
@ 2016-12-22 14:30                                                 ` Sergey Senozhatsky
  2016-12-26 10:54                                                 ` Tetsuo Handa
  1 sibling, 0 replies; 96+ messages in thread
From: Sergey Senozhatsky @ 2016-12-22 14:30 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Tetsuo Handa, mhocko, linux-mm, pmladek

On (12/22/16 23:09), Sergey Senozhatsky wrote:
> > "404 The page you're looking for could not be found."
> > 
> > Anonymous access not supported?

https://github.com/sergey-senozhatsky/linux-next-ss/commits/printk-safe-deferred

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-22 13:33                                           ` Tetsuo Handa
@ 2016-12-22 19:24                                             ` Michal Hocko
  2016-12-24  6:25                                               ` Tetsuo Handa
  0 siblings, 1 reply; 96+ messages in thread
From: Michal Hocko @ 2016-12-22 19:24 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: sergey.senozhatsky, linux-mm, pmladek

On Thu 22-12-16 22:33:40, Tetsuo Handa wrote:
> Tetsuo Handa wrote:
> > Now, what options are left other than replacing !mutex_trylock(&oom_lock)
> > with mutex_lock_killable(&oom_lock) which also stops wasting CPU time?
> > Are we waiting for offloading sending to consoles?
> 
>  From http://lkml.kernel.org/r/20161222115057.GH6048@dhcp22.suse.cz :
> > > Although I don't know whether we agree with mutex_lock_killable(&oom_lock)
> > > change, I think this patch alone can go as a cleanup.
> > 
> > No, we don't agree on that part. As this is a printk issue I do not want
> > to workaround it in the oom related code. That is just ridiculous. The
> > very same issue would be possible due to other continous source of log
> > messages.
> 
> I don't think so. Lockup caused by printk() is printk's problem. But printk
> is not the only source of lockup. If CONFIG_PREEMPT=y, it is possible that
> a thread which held oom_lock can sleep for unbounded period depending on
> scheduling priority.

Unless there is some runaway realtime process then the holder of the oom
lock shouldn't be preempted for the _unbounded_ amount of time. It might
take quite some time, though. But that is not reduced to the OOM killer.
Any important part of the system (IO flushers and what not) would suffer
from the same issue.

> Then, you call such latency as scheduler's problem?
> mutex_lock_killable(&oom_lock) change helps coping with whatever delays
> OOM killer/reaper might encounter.

It helps _your_ particular insane workload. I believe you can construct
many others which which would cause a similar problem and the above
suggestion wouldn't help a bit. Until I can see this is easily
triggerable on a reasonably configured system then I am not convinced
we should add more non trivial changes to the oom killer path.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-22 19:24                                             ` Michal Hocko
@ 2016-12-24  6:25                                               ` Tetsuo Handa
  2016-12-26 11:49                                                 ` Michal Hocko
  0 siblings, 1 reply; 96+ messages in thread
From: Tetsuo Handa @ 2016-12-24  6:25 UTC (permalink / raw)
  To: mhocko, torvalds, akpm; +Cc: sergey.senozhatsky, linux-mm, pmladek

Linus and Andrew, may I have your attitude about Linux kernel's memory management
subsystem? Currently, the kernel can OOM lockup if more stress than Michal Hocko
thinks "sane" is given. Should we just throw our hands up if stress like
sleep-with-oom_lock2.c shown below is given?

---------- sleep-with-oom_lock2.c start ----------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sched.h>
#include <signal.h>
#include <sys/prctl.h>
#include <poll.h>

int main(int argc, char *argv[])
{
	struct sched_param sp = { 0 };
	cpu_set_t cpu = { { 1 } };
	static int pipe_fd[2] = { EOF, EOF };
	char *buf = NULL;
	unsigned long size = 0;
	unsigned int i;
	int fd;
	signal(SIGCLD, SIG_IGN);
	sched_setaffinity(0, sizeof(cpu), &cpu);
	prctl(PR_SET_NAME, (unsigned long) "normal-priority", 0, 0, 0);
	for (size = 512; size <= 1024 * 256; size <<= 1) 
		buf = realloc(buf, size);
	if (!buf)
		exit(1);
	pipe(pipe_fd);
	for (i = 0; i < 1024; i++)
		if (fork() == 0) {
			char c;
			close(pipe_fd[1]);
			read(pipe_fd[0], &c, 1);
			/*
			 * Wait for a bit after idle-priority started
			 * invoking the OOM killer.
			 */
			poll(NULL, 0, 1000);
			/* Try to consume as much CPU time as possible. */
			for (i = 0; i < 1024 * 256; i += 4096)
				buf[i] = 0;
			_exit(0);
		}
	fd = open("/dev/zero", O_RDONLY);
	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size >>= 1;
			break;
		}
		buf = cp;
	}
	sched_setscheduler(0, SCHED_IDLE, &sp);
	prctl(PR_SET_NAME, (unsigned long) "idle-priority", 0, 0, 0);
	close(pipe_fd[1]);
	read(fd, buf, size); /* Will cause OOM due to overcommit */
	kill(-1, SIGKILL);
	return 0; /* Not reached. */
}
---------- sleep-with-oom_lock2.c end ----------



Michal Hocko wrote:
> On Thu 22-12-16 22:33:40, Tetsuo Handa wrote:
> > Tetsuo Handa wrote:
> > > Now, what options are left other than replacing !mutex_trylock(&oom_lock)
> > > with mutex_lock_killable(&oom_lock) which also stops wasting CPU time?
> > > Are we waiting for offloading sending to consoles?
> > 
> >  From http://lkml.kernel.org/r/20161222115057.GH6048@dhcp22.suse.cz :
> > > > Although I don't know whether we agree with mutex_lock_killable(&oom_lock)
> > > > change, I think this patch alone can go as a cleanup.
> > > 
> > > No, we don't agree on that part. As this is a printk issue I do not want
> > > to workaround it in the oom related code. That is just ridiculous. The
> > > very same issue would be possible due to other continous source of log
> > > messages.
> > 
> > I don't think so. Lockup caused by printk() is printk's problem. But printk
> > is not the only source of lockup. If CONFIG_PREEMPT=y, it is possible that
> > a thread which held oom_lock can sleep for unbounded period depending on
> > scheduling priority.
> 
> Unless there is some runaway realtime process then the holder of the oom
> lock shouldn't be preempted for the _unbounded_ amount of time. It might
> take quite some time, though. But that is not reduced to the OOM killer.
> Any important part of the system (IO flushers and what not) would suffer
> from the same issue.

I fail to understand why you assume "realtime process".
This lockup is still triggerable using "normal process" and "idle process".

Below are results where "printk() lockup with oom_lock held" was solved by
applying http://lkml.kernel.org/r/20161222140930.GF413@tigerII.localdomain .



sleep-with-oom_lock1.c shown below is a reproducer which did not recover
within acceptable period.

---------- sleep-with-oom_lock1.c start ----------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sched.h>
#include <signal.h>
#include <sys/prctl.h>

int main(int argc, char *argv[])
{
	struct sched_param sp = { 0 };
	cpu_set_t cpu = { { 1 } };
	static int pipe_fd[2] = { EOF, EOF };
	char *buf = NULL;
	unsigned long size = 0;
	unsigned int i;
	int fd;
	pipe(pipe_fd);
	signal(SIGCLD, SIG_IGN);
	if (fork() == 0) {
		prctl(PR_SET_NAME, (unsigned long) "first-victim", 0, 0, 0);
		while (1)
			pause();
	}
	close(pipe_fd[1]);
	sched_setaffinity(0, sizeof(cpu), &cpu);
	prctl(PR_SET_NAME, (unsigned long) "normal-priority", 0, 0, 0);
	for (i = 0; i < 1024; i++)
		if (fork() == 0) {
			char c;
			/* Wait until the first-victim is OOM-killed. */
			read(pipe_fd[0], &c, 1);
			/* Try to consume as much CPU time as possible. */
			while(1);
			_exit(0);
		}
	close(pipe_fd[0]);
	fd = open("/dev/zero", O_RDONLY);
	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size >>= 1;
			break;
		}
		buf = cp;
	}
	sched_setscheduler(0, SCHED_IDLE, &sp);
	prctl(PR_SET_NAME, (unsigned long) "idle-priority", 0, 0, 0);
	read(fd, buf, size); /* Will cause OOM due to overcommit */
	kill(-1, SIGKILL);
	return 0; /* Not reached. */
}
---------- sleep-with-oom_lock1.c end ----------

Complete log is http://I-love.SAKURA.ne.jp/tmp/serial-20161224-1.txt.xz
----------
[  426.927853] idle-priority invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_score_adj=0
[  426.927855] idle-priority cpuset=/ mems_allowed=0
(...snipped...)
[  426.928017] Out of memory: Kill process 4360 (idle-priority) score 660 or sacrifice child
[  426.929886] Killed process 4362 (normal-priority) total-vm:4164kB, anon-rss:80kB, file-rss:0kB, shmem-rss:0kB
[  436.962756] normal-priority: page allocation stalls for 10015ms, order:0, mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[  436.962762] CPU: 0 PID: 5203 Comm: normal-priority Not tainted 4.9.0-next-20161222+ #480
(...snipped...)
[  447.123293] normal-priority: page allocation stalls for 20134ms, order:0, mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[  447.123299] CPU: 0 PID: 5176 Comm: normal-priority Not tainted 4.9.0-next-20161222+ #480
(...snipped...)
[ 1037.019523] normal-priority: page allocation stalls for 610074ms, order:0, mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[ 1037.019529] CPU: 0 PID: 5203 Comm: normal-priority Not tainted 4.9.0-next-20161222+ #480
(...snipped...)
[ 1050.710795] normal-priority: page allocation stalls for 623723ms, order:0, mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[ 1050.710801] CPU: 0 PID: 5176 Comm: normal-priority Not tainted 4.9.0-next-20161222+ #480
(...snipped...)
[ 1051.133604] systemd-logind: page allocation stalls for 510002ms, order:0, mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[ 1051.133611] CPU: 0 PID: 668 Comm: systemd-logind Not tainted 4.9.0-next-20161222+ #480
[ 1051.133612] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[ 1051.133613] Call Trace:
[ 1051.133618]  dump_stack+0x85/0xc9
[ 1051.133621]  warn_alloc+0xf8/0x190
[ 1051.133624]  __alloc_pages_slowpath+0x4a8/0x8a0
[ 1051.133625]  __alloc_pages_nodemask+0x456/0x4e0
[ 1051.133627]  alloc_pages_current+0x97/0x1b0
[ 1051.133630]  ? find_get_entry+0x5/0x300
[ 1051.133631]  __page_cache_alloc+0x15d/0x1a0
[ 1051.133633]  ? pagecache_get_page+0x2c/0x2b0
[ 1051.133634]  filemap_fault+0x48e/0x6d0
[ 1051.133636]  ? filemap_fault+0x339/0x6d0
----------

The last OOM killer invocation was uptime = 426 and
I gave up waiting and pressed SysRq-b at uptime = 1051.

You might complain that it is not fair to use 'wasting CPU time by "while(1);"
in userspace' as a reason to push this patch. I agree that we can't cope with it
if CPU time is wasted in userspace.

But sleep-with-oom_lock2.c shown above is a similar reproducer which did not
recover within acceptable period. This time, nobody is wasting CPU time in userspace.

Complete log is http://I-love.SAKURA.ne.jp/tmp/serial-20161224-2.txt.xz
----------
[ 1061.428002] idle-priority invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_score_adj=0
[ 1061.428004] idle-priority cpuset=/ mems_allowed=0
(...snipped...)
[ 1061.428147] Out of memory: Kill process 10553 (idle-priority) score 640 or sacrifice child
[ 1061.429857] Killed process 11527 (normal-priority) total-vm:4556kB, anon-rss:304kB, file-rss:4kB, shmem-rss:0kB
[ 1062.300349] warn_alloc: 117 callbacks suppressed
[ 1062.300351] normal-priority: page allocation stalls for 190508ms, order:0, mode:0x24200ca(GFP_HIGHUSER_MOVABLE)
[ 1062.300355] CPU: 0 PID: 10858 Comm: normal-priority Not tainted 4.9.0-next-20161222+ #480
(...snipped...)
[ 1080.345125] normal-priority: page allocation stalls for 20165ms, order:0, mode:0x24200ca(GFP_HIGHUSER_MOVABLE)
[ 1080.345131] CPU: 0 PID: 11564 Comm: normal-priority Not tainted 4.9.0-next-20161222+ #480
(...snipped...)
[ 2202.150829] normal-priority: page allocation stalls for 1330359ms, order:0, mode:0x24200ca(GFP_HIGHUSER_MOVABLE)
[ 2202.150835] CPU: 0 PID: 10858 Comm: normal-priority Not tainted 4.9.0-next-20161222+ #480
(...snipped...)
[ 2300.897797] normal-priority: page allocation stalls for 1240719ms, order:0, mode:0x24200ca(GFP_HIGHUSER_MOVABLE)
[ 2300.897804] CPU: 0 PID: 11564 Comm: normal-priority Not tainted 4.9.0-next-20161222+ #480
[ 2300.897804] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[ 2300.897805] Call Trace:
[ 2300.897811]  dump_stack+0x85/0xc9
[ 2300.897814]  warn_alloc+0xf8/0x190
[ 2300.897817]  __alloc_pages_slowpath+0x4a8/0x8a0
[ 2300.897819]  __alloc_pages_nodemask+0x456/0x4e0
[ 2300.897820]  ? lock_page_memcg+0x5/0xf0
[ 2300.897823]  alloc_pages_vma+0xbe/0x2d0
[ 2300.897826]  ? sched_clock_cpu+0x84/0xb0
[ 2300.897829]  wp_page_copy+0x83/0x6f0
[ 2300.897830]  do_wp_page+0xa0/0x5c0
[ 2300.897831]  handle_mm_fault+0x929/0x1180
[ 2300.897832]  ? handle_mm_fault+0x5e/0x1180
[ 2300.897835]  __do_page_fault+0x24a/0x530
[ 2300.897837]  do_page_fault+0x30/0x80
[ 2300.897840]  page_fault+0x28/0x30
[ 2300.897841] RIP: 0033:0x4009c0
[ 2300.897842] RSP: 002b:00007fff48aade40 EFLAGS: 00010287
[ 2300.897843] RAX: 0000000000001000 RBX: 000000000000000e RCX: 00007fc7e74bcde0
[ 2300.897844] RDX: 00000000000003e8 RSI: 0000000000000000 RDI: 0000000000000000
[ 2300.897844] RBP: 00007fc7e795f010 R08: 00007fc7e79a0740 R09: 0000000000000000
[ 2300.897845] R10: 00007fff48aadbc0 R11: 0000000000000246 R12: 0000000000080000
[ 2300.897846] R13: 00007fff48aadfe0 R14: 0000000000000000 R15: 0000000000000000
----------

The last OOM killer invocation was uptime = 1061 and
I gave up waiting and pressed SysRq-b at uptime = 2300.

See? The runaway is occurring inside kernel space due to almost-busy looping
direct reclaim against a thread with idle priority with oom_lock held.

My assertion is that we need to make sure that the OOM killer/reaper are given
enough CPU time so that they can perform memory reclaim operation and release
oom_lock. We can't solve CPU time consumption by sleep-with-oom_lock1.c case
but we can solve CPU time consumption by sleep-with-oom_lock2.c case.

I think it is waste of CPU time to let all threads try direct reclaim
which also bothers them with consistent __GFP_NOFS/__GFP_NOIO usage which
might involve dependency to other threads. But changing it is not easy.

Thus, I'm proposing to save CPU time if waiting for the OOM killer/reaper
when direct reclaim did not help.

> 
> > Then, you call such latency as scheduler's problem?
> > mutex_lock_killable(&oom_lock) change helps coping with whatever delays
> > OOM killer/reaper might encounter.
> 
> It helps _your_ particular insane workload. I believe you can construct
> many others which which would cause a similar problem and the above
> suggestion wouldn't help a bit. Until I can see this is easily
> triggerable on a reasonably configured system then I am not convinced
> we should add more non trivial changes to the oom killer path.

I'm not using root privileges nor realtime priority nor CONFIG_PREEMPT=y.
Why you don't care about the worst situation / corner cases?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-22 14:09                                               ` Sergey Senozhatsky
  2016-12-22 14:30                                                 ` Sergey Senozhatsky
@ 2016-12-26 10:54                                                 ` Tetsuo Handa
  2016-12-26 11:34                                                     ` Sergey Senozhatsky
  2016-12-26 11:41                                                   ` Sergey Senozhatsky
  1 sibling, 2 replies; 96+ messages in thread
From: Tetsuo Handa @ 2016-12-26 10:54 UTC (permalink / raw)
  To: sergey.senozhatsky; +Cc: mhocko, linux-mm, pmladek

Sergey Senozhatsky wrote:
> On (12/22/16 23:01), Tetsuo Handa wrote:
> > > On (12/22/16 19:27), Tetsuo Handa wrote:
> > > > Thank you. I tried "[PATCHv6 0/7] printk: use printk_safe to handle printk()
> > > > recursive calls" at https://lkml.org/lkml/2016/12/21/232 on top of linux.git
> > > > as of commit 52bce91165e5f2db "splice: reinstate SIGPIPE/EPIPE handling", but
> > > > it turned out that your patch set does not solve this problem.
> > > > 
> > > > I was assuming that sending to consoles from printk() is offloaded to a kernel
> > > > thread dedicated for that purpose, but your patch set does not do it.
> > > 
> > > sorry, seems that I didn't deliver the information properly.
> > > 
> > > https://gitlab.com/senozhatsky/linux-next-ss/commits/printk-safe-deferred
> > > 
> > > there are 2 patch sets. the first one is printk-safe. the second one
> > > is async printk.
> > > 
> > > 9 patches in total (as of now).
> > > 
> > > can you access it?
> > 
> > "404 The page you're looking for could not be found."
> > 
> > Anonymous access not supported?
> 
> oops... hm, dunno, it says
> 
> : Visibility Level (?)
> :
> : Public
> : The project can be cloned without any authentication.
> 
> I'll switch to github then may be.
> 
> attached 9 patches.
> 
> NOTE: not the final version.
> 
> 
> 	-ss

I tried these 9 patches. Generally OK.

Although there is still "schedule_timeout_killable() lockup with oom_lock held"
problem, async-printk patches help avoiding "printk() lockup with oom_lock held"
problem. Thank you.

Three comments from me.

(1) Messages from e.g. SysRq-b is not waited for sent to consoles.
    "SysRq : Resetting" line is needed as a note that I gave up waiting.

(2) Messages from e.g. SysRq-t should be sent to consoles synchronously?
    "echo t > /proc/sysrq-trigger" case can use asynchronous printing.
    But since ALT-SysRq-T sequence from keyboard may be used when scheduler
    is not responding, it might be better to use synchronous printing.
    (Or define a magic key sequence to toggle synchronous/asynchronous?)

(3) I got below warning. (Though not reproducible.)
    If fb_flashcursor() called console_trylock(), console_may_schedule is set to 1?

----------------------------------------
[  OK  [  255.862188] audit: type=1131 audit(1482733112.662:148): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-tmpfiles-setup-dev comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
] Stopped Create Static Device Nodes in /dev.

[  255.871468] BUG: sleeping function called from invalid context at kernel/printk/printk.c:2325
[  255.871469] in_atomic(): 1, irqs_disabled(): 1, pid: 10079, name: plymouthd
[  255.871469] 6 locks held by plymouthd/10079:
[  255.871470]  #0:  (&tty->ldisc_sem){++++.+}, at: [<ffffffff817413e2>] ldsem_down_read+0x32/0x40
[  255.871472]  #1:  (&tty->atomic_write_lock){+.+.+.}, at: [<ffffffff81424309>] tty_write_lock+0x19/0x50
[  255.871474]  #2:  (&tty->termios_rwsem){++++..}, at: [<ffffffff81429d59>] n_tty_write+0x99/0x470
[  255.871475]  #3:  (&ldata->output_lock){+.+...}, at: [<ffffffff81429df0>] n_tty_write+0x130/0x470
[  255.871477]  #4:  (console_lock){+.+.+.}, at: [<ffffffff8110616e>] console_unlock+0x33e/0x6b0
[  255.871479]  #5:  (printing_lock){......}, at: [<ffffffff8143baf5>] vt_console_print+0x75/0x3d0
[  255.871481] irq event stamp: 15244
[  255.871481] hardirqs last  enabled at (15243): [<ffffffff81105011>] __down_trylock_console_sem+0x91/0xa0
[  255.871482] hardirqs last disabled at (15244): [<ffffffff81105ea4>] console_unlock+0x74/0x6b0
[  255.871482] softirqs last  enabled at (14968): [<ffffffff81096394>] __do_softirq+0x344/0x580
[  255.871482] softirqs last disabled at (14963): [<ffffffff810968d3>] irq_exit+0xe3/0x120
[  255.871483] CPU: 0 PID: 10079 Comm: plymouthd Not tainted 4.9.0-next-20161224+ #12
[  255.871483] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  255.871484] Call Trace:
[  255.871484]  dump_stack+0x85/0xc9
[  255.871485]  ___might_sleep+0x14a/0x250
[  255.871485]  console_conditional_schedule+0x22/0x30
[  255.871485]  fbcon_redraw.isra.24+0xa3/0x1d0
[  255.871486]  ? fbcon_cursor+0x151/0x1c0
[  255.871486]  fbcon_scroll+0x11d/0xcb0
[  255.871487]  con_scroll+0x160/0x170
[  255.871487]  lf+0x9c/0xb0
[  255.871487]  vt_console_print+0x2b7/0x3d0
[  255.871488]  console_unlock+0x457/0x6b0
[  255.871488]  do_con_write.part.19+0x737/0x9e0
[  255.871489]  ? mark_held_locks+0x71/0x90
[  255.871489]  con_write+0x57/0x60
[  255.871489]  n_tty_write+0x1bf/0x470
[  255.871490]  ? prepare_to_wait_event+0x110/0x110
[  255.871490]  tty_write+0x157/0x2d0
[  255.871491]  ? n_tty_open+0xd0/0xd0
[  255.871491]  __vfs_write+0x32/0x140
[  255.871491]  ? trace_hardirqs_on+0xd/0x10
[  255.871492]  ? __audit_syscall_entry+0xaa/0xf0
[  255.871492]  vfs_write+0xc2/0x1f0
[  255.871493]  ? syscall_trace_enter+0x1cb/0x3e0
[  255.871493]  SyS_write+0x53/0xc0
[  255.871493]  do_syscall_64+0x67/0x1f0
[  255.871494]  entry_SYSCALL64_slow_path+0x25/0x25
[  255.871494] RIP: 0033:0x7fb74cf8fc60
[  255.871495] RSP: 002b:00007ffcaab3fe88 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  255.871495] RAX: ffffffffffffffda RBX: 000055d3acaf7160 RCX: 00007fb74cf8fc60
[  255.871496] RDX: 000000000000003f RSI: 000055d3acafd090 RDI: 0000000000000009
[  255.871496] RBP: 000055d3acafc440 R08: 0000000000000070 R09: 0000000000000000
[  255.871497] R10: 000000000000003f R11: 0000000000000246 R12: 000055d3acafc330
[  255.871497] R13: 000000000000003f R14: 00007ffcaab3ffb0 R15: 0000000000000000
         Stopping Create Static Device Nodes in /dev...

----------------------------------------

# ./scripts/faddr2line vmlinux console_unlock+0x74/0x6b0
console_unlock+0x74/0x6b0:
console_unlock at kernel/printk/printk.c:2228
# ./scripts/faddr2line vmlinux console_unlock+0x457/0x6b0
console_unlock+0x457/0x6b0:
call_console_drivers at kernel/printk/printk.c:1613
 (inlined by) console_unlock at kernel/printk/printk.c:2277
# ./scripts/faddr2line vmlinux vt_console_print+0x2b7/0x3d0
vt_console_print+0x2b7/0x3d0:
cr at drivers/tty/vt/vt.c:1137
 (inlined by) vt_console_print at drivers/tty/vt/vt.c:2598
# ./scripts/faddr2line vmlinux lf+0x9c/0xb0
lf+0x9c/0xb0:
lf at drivers/tty/vt/vt.c:1112
# ./scripts/faddr2line vmlinux con_scroll+0x160/0x170
con_scroll+0x160/0x170:
con_scroll at drivers/tty/vt/vt.c:327 (discriminator 1)
# ./scripts/faddr2line vmlinux fbcon_scroll+0x11d/0xcb0
fbcon_scroll+0x11d/0xcb0:
fbcon_scroll at drivers/video/console/fbcon.c:1898
# ./scripts/faddr2line vmlinux fbcon_cursor+0x151/0x1c0
fbcon_cursor+0x151/0x1c0:
fbcon_cursor at drivers/video/console/fbcon.c:1331
# ./scripts/faddr2line vmlinux fbcon_redraw.isra.24+0xa3/0x1d0
fbcon_redraw.isra.24+0xa3/0x1d0:
fbcon_redraw at drivers/video/console/fbcon.c:1756
# ./scripts/faddr2line vmlinux console_conditional_schedule+0x22/0x30
console_conditional_schedule+0x22/0x30:
console_conditional_schedule at kernel/printk/printk.c:2325

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-26 10:54                                                 ` Tetsuo Handa
  2016-12-26 11:34                                                     ` Sergey Senozhatsky
@ 2016-12-26 11:34                                                     ` Sergey Senozhatsky
  1 sibling, 0 replies; 96+ messages in thread
From: Sergey Senozhatsky @ 2016-12-26 11:34 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: sergey.senozhatsky, mhocko, linux-mm, pmladek,
	Greg Kroah-Hartman, Jiri Slaby, linux-fbdev, linux-kernel,
	sergey.senozhatsky.work

Cc Greg, Jiri,

On (12/26/16 19:54), Tetsuo Handa wrote:
[..]
> 
> (3) I got below warning. (Though not reproducible.)
>     If fb_flashcursor() called console_trylock(), console_may_schedule is set to 1?

hmmm... it takes an atomic/spin `printing_lock' lock in vt_console_print(),
then call console_conditional_schedule() from lf(), being under spin_lock.
`console_may_schedule' in console_conditional_schedule() still keeps the
value from console_trylock(), which was ok (console_may_schedule permits
rescheduling). but preemption got changed under console_trylock(), by
that spin_lock.

console_trylock() used to always forbid rescheduling; but it got changed
like a yaer ago.

the other thing is... do we really need to console_conditional_schedule()
from fbcon_*()? console_unlock() does cond_resched() after every line it
prints. wouldn't that be enough?

so may be we can drop some of console_conditional_schedule()
call sites in fbcon. or update console_conditional_schedule()
function to always return the current preemption value, not the
one we saw in console_trylock().

(not tested)

---

 kernel/printk/printk.c | 35 ++++++++++++++++++++---------------
 1 file changed, 20 insertions(+), 15 deletions(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 8b2696420abb..ad4a02cf9f15 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -2075,6 +2075,24 @@ static int console_cpu_notify(unsigned int cpu)
 	return 0;
 }
 
+static int get_console_may_schedule(void)
+{
+	/*
+	 * When PREEMPT_COUNT disabled we can't reliably detect if it's
+	 * safe to schedule (e.g. calling printk while holding a spin_lock),
+	 * because preempt_disable()/preempt_enable() are just barriers there
+	 * and preempt_count() is always 0.
+	 *
+	 * RCU read sections have a separate preemption counter when
+	 * PREEMPT_RCU enabled thus we must take extra care and check
+	 * rcu_preempt_depth(), otherwise RCU read sections modify
+	 * preempt_count().
+	 */
+	return !oops_in_progress &&
+		preemptible() &&
+		!rcu_preempt_depth();
+}
+
 /**
  * console_lock - lock the console system for exclusive use.
  *
@@ -2112,20 +2130,7 @@ int console_trylock(void)
 		return 0;
 	}
 	console_locked = 1;
-	/*
-	 * When PREEMPT_COUNT disabled we can't reliably detect if it's
-	 * safe to schedule (e.g. calling printk while holding a spin_lock),
-	 * because preempt_disable()/preempt_enable() are just barriers there
-	 * and preempt_count() is always 0.
-	 *
-	 * RCU read sections have a separate preemption counter when
-	 * PREEMPT_RCU enabled thus we must take extra care and check
-	 * rcu_preempt_depth(), otherwise RCU read sections modify
-	 * preempt_count().
-	 */
-	console_may_schedule = !oops_in_progress &&
-			preemptible() &&
-			!rcu_preempt_depth();
+	console_may_schedule = get_console_may_schedule();
 	return 1;
 }
 EXPORT_SYMBOL(console_trylock);
@@ -2316,7 +2321,7 @@ EXPORT_SYMBOL(console_unlock);
  */
 void __sched console_conditional_schedule(void)
 {
-	if (console_may_schedule)
+	if (get_console_may_schedule())
 		cond_resched();
 }
 EXPORT_SYMBOL(console_conditional_schedule);


---


	-ss

> ----------------------------------------
> [  OK  [  255.862188] audit: type=1131 audit(1482733112.662:148): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-tmpfiles-setup-dev comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
> ] Stopped Create Static Device Nodes in /dev.
> 
> [  255.871468] BUG: sleeping function called from invalid context at kernel/printk/printk.c:2325
> [  255.871469] in_atomic(): 1, irqs_disabled(): 1, pid: 10079, name: plymouthd
> [  255.871469] 6 locks held by plymouthd/10079:
> [  255.871470]  #0:  (&tty->ldisc_sem){++++.+}, at: [<ffffffff817413e2>] ldsem_down_read+0x32/0x40
> [  255.871472]  #1:  (&tty->atomic_write_lock){+.+.+.}, at: [<ffffffff81424309>] tty_write_lock+0x19/0x50
> [  255.871474]  #2:  (&tty->termios_rwsem){++++..}, at: [<ffffffff81429d59>] n_tty_write+0x99/0x470
> [  255.871475]  #3:  (&ldata->output_lock){+.+...}, at: [<ffffffff81429df0>] n_tty_write+0x130/0x470
> [  255.871477]  #4:  (console_lock){+.+.+.}, at: [<ffffffff8110616e>] console_unlock+0x33e/0x6b0
> [  255.871479]  #5:  (printing_lock){......}, at: [<ffffffff8143baf5>] vt_console_print+0x75/0x3d0
> [  255.871481] irq event stamp: 15244
> [  255.871481] hardirqs last  enabled at (15243): [<ffffffff81105011>] __down_trylock_console_sem+0x91/0xa0
> [  255.871482] hardirqs last disabled at (15244): [<ffffffff81105ea4>] console_unlock+0x74/0x6b0
> [  255.871482] softirqs last  enabled at (14968): [<ffffffff81096394>] __do_softirq+0x344/0x580
> [  255.871482] softirqs last disabled at (14963): [<ffffffff810968d3>] irq_exit+0xe3/0x120
> [  255.871483] CPU: 0 PID: 10079 Comm: plymouthd Not tainted 4.9.0-next-20161224+ #12
> [  255.871483] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
> [  255.871484] Call Trace:
> [  255.871484]  dump_stack+0x85/0xc9
> [  255.871485]  ___might_sleep+0x14a/0x250
> [  255.871485]  console_conditional_schedule+0x22/0x30
> [  255.871485]  fbcon_redraw.isra.24+0xa3/0x1d0
> [  255.871486]  ? fbcon_cursor+0x151/0x1c0
> [  255.871486]  fbcon_scroll+0x11d/0xcb0
> [  255.871487]  con_scroll+0x160/0x170
> [  255.871487]  lf+0x9c/0xb0
> [  255.871487]  vt_console_print+0x2b7/0x3d0
> [  255.871488]  console_unlock+0x457/0x6b0
> [  255.871488]  do_con_write.part.19+0x737/0x9e0
> [  255.871489]  ? mark_held_locks+0x71/0x90
> [  255.871489]  con_write+0x57/0x60
> [  255.871489]  n_tty_write+0x1bf/0x470
> [  255.871490]  ? prepare_to_wait_event+0x110/0x110
> [  255.871490]  tty_write+0x157/0x2d0
> [  255.871491]  ? n_tty_open+0xd0/0xd0
> [  255.871491]  __vfs_write+0x32/0x140
> [  255.871491]  ? trace_hardirqs_on+0xd/0x10
> [  255.871492]  ? __audit_syscall_entry+0xaa/0xf0
> [  255.871492]  vfs_write+0xc2/0x1f0
> [  255.871493]  ? syscall_trace_enter+0x1cb/0x3e0
> [  255.871493]  SyS_write+0x53/0xc0
> [  255.871493]  do_syscall_64+0x67/0x1f0
> [  255.871494]  entry_SYSCALL64_slow_path+0x25/0x25
> [  255.871494] RIP: 0033:0x7fb74cf8fc60
> [  255.871495] RSP: 002b:00007ffcaab3fe88 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> [  255.871495] RAX: ffffffffffffffda RBX: 000055d3acaf7160 RCX: 00007fb74cf8fc60
> [  255.871496] RDX: 000000000000003f RSI: 000055d3acafd090 RDI: 0000000000000009
> [  255.871496] RBP: 000055d3acafc440 R08: 0000000000000070 R09: 0000000000000000
> [  255.871497] R10: 000000000000003f R11: 0000000000000246 R12: 000055d3acafc330
> [  255.871497] R13: 000000000000003f R14: 00007ffcaab3ffb0 R15: 0000000000000000
>          Stopping Create Static Device Nodes in /dev...
> 
> ----------------------------------------
> 
> # ./scripts/faddr2line vmlinux console_unlock+0x74/0x6b0
> console_unlock+0x74/0x6b0:
> console_unlock at kernel/printk/printk.c:2228
> # ./scripts/faddr2line vmlinux console_unlock+0x457/0x6b0
> console_unlock+0x457/0x6b0:
> call_console_drivers at kernel/printk/printk.c:1613
>  (inlined by) console_unlock at kernel/printk/printk.c:2277
> # ./scripts/faddr2line vmlinux vt_console_print+0x2b7/0x3d0
> vt_console_print+0x2b7/0x3d0:
> cr at drivers/tty/vt/vt.c:1137
>  (inlined by) vt_console_print at drivers/tty/vt/vt.c:2598
> # ./scripts/faddr2line vmlinux lf+0x9c/0xb0
> lf+0x9c/0xb0:
> lf at drivers/tty/vt/vt.c:1112
> # ./scripts/faddr2line vmlinux con_scroll+0x160/0x170
> con_scroll+0x160/0x170:
> con_scroll at drivers/tty/vt/vt.c:327 (discriminator 1)
> # ./scripts/faddr2line vmlinux fbcon_scroll+0x11d/0xcb0
> fbcon_scroll+0x11d/0xcb0:
> fbcon_scroll at drivers/video/console/fbcon.c:1898
> # ./scripts/faddr2line vmlinux fbcon_cursor+0x151/0x1c0
> fbcon_cursor+0x151/0x1c0:
> fbcon_cursor at drivers/video/console/fbcon.c:1331
> # ./scripts/faddr2line vmlinux fbcon_redraw.isra.24+0xa3/0x1d0
> fbcon_redraw.isra.24+0xa3/0x1d0:
> fbcon_redraw at drivers/video/console/fbcon.c:1756
> # ./scripts/faddr2line vmlinux console_conditional_schedule+0x22/0x30
> console_conditional_schedule+0x22/0x30:
> console_conditional_schedule at kernel/printk/printk.c:2325
> 

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
@ 2016-12-26 11:34                                                     ` Sergey Senozhatsky
  0 siblings, 0 replies; 96+ messages in thread
From: Sergey Senozhatsky @ 2016-12-26 11:34 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: sergey.senozhatsky, mhocko, linux-mm, pmladek,
	Greg Kroah-Hartman, Jiri Slaby, linux-fbdev, linux-kernel,
	sergey.senozhatsky.work

Cc Greg, Jiri,

On (12/26/16 19:54), Tetsuo Handa wrote:
[..]
> 
> (3) I got below warning. (Though not reproducible.)
>     If fb_flashcursor() called console_trylock(), console_may_schedule is set to 1?

hmmm... it takes an atomic/spin `printing_lock' lock in vt_console_print(),
then call console_conditional_schedule() from lf(), being under spin_lock.
`console_may_schedule' in console_conditional_schedule() still keeps the
value from console_trylock(), which was ok (console_may_schedule permits
rescheduling). but preemption got changed under console_trylock(), by
that spin_lock.

console_trylock() used to always forbid rescheduling; but it got changed
like a yaer ago.

the other thing is... do we really need to console_conditional_schedule()
from fbcon_*()? console_unlock() does cond_resched() after every line it
prints. wouldn't that be enough?

so may be we can drop some of console_conditional_schedule()
call sites in fbcon. or update console_conditional_schedule()
function to always return the current preemption value, not the
one we saw in console_trylock().

(not tested)

---

 kernel/printk/printk.c | 35 ++++++++++++++++++++---------------
 1 file changed, 20 insertions(+), 15 deletions(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 8b2696420abb..ad4a02cf9f15 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -2075,6 +2075,24 @@ static int console_cpu_notify(unsigned int cpu)
 	return 0;
 }
 
+static int get_console_may_schedule(void)
+{
+	/*
+	 * When PREEMPT_COUNT disabled we can't reliably detect if it's
+	 * safe to schedule (e.g. calling printk while holding a spin_lock),
+	 * because preempt_disable()/preempt_enable() are just barriers there
+	 * and preempt_count() is always 0.
+	 *
+	 * RCU read sections have a separate preemption counter when
+	 * PREEMPT_RCU enabled thus we must take extra care and check
+	 * rcu_preempt_depth(), otherwise RCU read sections modify
+	 * preempt_count().
+	 */
+	return !oops_in_progress &&
+		preemptible() &&
+		!rcu_preempt_depth();
+}
+
 /**
  * console_lock - lock the console system for exclusive use.
  *
@@ -2112,20 +2130,7 @@ int console_trylock(void)
 		return 0;
 	}
 	console_locked = 1;
-	/*
-	 * When PREEMPT_COUNT disabled we can't reliably detect if it's
-	 * safe to schedule (e.g. calling printk while holding a spin_lock),
-	 * because preempt_disable()/preempt_enable() are just barriers there
-	 * and preempt_count() is always 0.
-	 *
-	 * RCU read sections have a separate preemption counter when
-	 * PREEMPT_RCU enabled thus we must take extra care and check
-	 * rcu_preempt_depth(), otherwise RCU read sections modify
-	 * preempt_count().
-	 */
-	console_may_schedule = !oops_in_progress &&
-			preemptible() &&
-			!rcu_preempt_depth();
+	console_may_schedule = get_console_may_schedule();
 	return 1;
 }
 EXPORT_SYMBOL(console_trylock);
@@ -2316,7 +2321,7 @@ EXPORT_SYMBOL(console_unlock);
  */
 void __sched console_conditional_schedule(void)
 {
-	if (console_may_schedule)
+	if (get_console_may_schedule())
 		cond_resched();
 }
 EXPORT_SYMBOL(console_conditional_schedule);


---


	-ss

> ----------------------------------------
> [  OK  [  255.862188] audit: type\x1131 audit(1482733112.662:148): pid=1 uid=0 auidB94967295 sesB94967295 msg='unit=systemd-tmpfiles-setup-dev comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
> ] Stopped Create Static Device Nodes in /dev.
> 
> [  255.871468] BUG: sleeping function called from invalid context at kernel/printk/printk.c:2325
> [  255.871469] in_atomic(): 1, irqs_disabled(): 1, pid: 10079, name: plymouthd
> [  255.871469] 6 locks held by plymouthd/10079:
> [  255.871470]  #0:  (&tty->ldisc_sem){++++.+}, at: [<ffffffff817413e2>] ldsem_down_read+0x32/0x40
> [  255.871472]  #1:  (&tty->atomic_write_lock){+.+.+.}, at: [<ffffffff81424309>] tty_write_lock+0x19/0x50
> [  255.871474]  #2:  (&tty->termios_rwsem){++++..}, at: [<ffffffff81429d59>] n_tty_write+0x99/0x470
> [  255.871475]  #3:  (&ldata->output_lock){+.+...}, at: [<ffffffff81429df0>] n_tty_write+0x130/0x470
> [  255.871477]  #4:  (console_lock){+.+.+.}, at: [<ffffffff8110616e>] console_unlock+0x33e/0x6b0
> [  255.871479]  #5:  (printing_lock){......}, at: [<ffffffff8143baf5>] vt_console_print+0x75/0x3d0
> [  255.871481] irq event stamp: 15244
> [  255.871481] hardirqs last  enabled at (15243): [<ffffffff81105011>] __down_trylock_console_sem+0x91/0xa0
> [  255.871482] hardirqs last disabled at (15244): [<ffffffff81105ea4>] console_unlock+0x74/0x6b0
> [  255.871482] softirqs last  enabled at (14968): [<ffffffff81096394>] __do_softirq+0x344/0x580
> [  255.871482] softirqs last disabled at (14963): [<ffffffff810968d3>] irq_exit+0xe3/0x120
> [  255.871483] CPU: 0 PID: 10079 Comm: plymouthd Not tainted 4.9.0-next-20161224+ #12
> [  255.871483] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
> [  255.871484] Call Trace:
> [  255.871484]  dump_stack+0x85/0xc9
> [  255.871485]  ___might_sleep+0x14a/0x250
> [  255.871485]  console_conditional_schedule+0x22/0x30
> [  255.871485]  fbcon_redraw.isra.24+0xa3/0x1d0
> [  255.871486]  ? fbcon_cursor+0x151/0x1c0
> [  255.871486]  fbcon_scroll+0x11d/0xcb0
> [  255.871487]  con_scroll+0x160/0x170
> [  255.871487]  lf+0x9c/0xb0
> [  255.871487]  vt_console_print+0x2b7/0x3d0
> [  255.871488]  console_unlock+0x457/0x6b0
> [  255.871488]  do_con_write.part.19+0x737/0x9e0
> [  255.871489]  ? mark_held_locks+0x71/0x90
> [  255.871489]  con_write+0x57/0x60
> [  255.871489]  n_tty_write+0x1bf/0x470
> [  255.871490]  ? prepare_to_wait_event+0x110/0x110
> [  255.871490]  tty_write+0x157/0x2d0
> [  255.871491]  ? n_tty_open+0xd0/0xd0
> [  255.871491]  __vfs_write+0x32/0x140
> [  255.871491]  ? trace_hardirqs_on+0xd/0x10
> [  255.871492]  ? __audit_syscall_entry+0xaa/0xf0
> [  255.871492]  vfs_write+0xc2/0x1f0
> [  255.871493]  ? syscall_trace_enter+0x1cb/0x3e0
> [  255.871493]  SyS_write+0x53/0xc0
> [  255.871493]  do_syscall_64+0x67/0x1f0
> [  255.871494]  entry_SYSCALL64_slow_path+0x25/0x25
> [  255.871494] RIP: 0033:0x7fb74cf8fc60
> [  255.871495] RSP: 002b:00007ffcaab3fe88 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> [  255.871495] RAX: ffffffffffffffda RBX: 000055d3acaf7160 RCX: 00007fb74cf8fc60
> [  255.871496] RDX: 000000000000003f RSI: 000055d3acafd090 RDI: 0000000000000009
> [  255.871496] RBP: 000055d3acafc440 R08: 0000000000000070 R09: 0000000000000000
> [  255.871497] R10: 000000000000003f R11: 0000000000000246 R12: 000055d3acafc330
> [  255.871497] R13: 000000000000003f R14: 00007ffcaab3ffb0 R15: 0000000000000000
>          Stopping Create Static Device Nodes in /dev...
> 
> ----------------------------------------
> 
> # ./scripts/faddr2line vmlinux console_unlock+0x74/0x6b0
> console_unlock+0x74/0x6b0:
> console_unlock at kernel/printk/printk.c:2228
> # ./scripts/faddr2line vmlinux console_unlock+0x457/0x6b0
> console_unlock+0x457/0x6b0:
> call_console_drivers at kernel/printk/printk.c:1613
>  (inlined by) console_unlock at kernel/printk/printk.c:2277
> # ./scripts/faddr2line vmlinux vt_console_print+0x2b7/0x3d0
> vt_console_print+0x2b7/0x3d0:
> cr at drivers/tty/vt/vt.c:1137
>  (inlined by) vt_console_print at drivers/tty/vt/vt.c:2598
> # ./scripts/faddr2line vmlinux lf+0x9c/0xb0
> lf+0x9c/0xb0:
> lf at drivers/tty/vt/vt.c:1112
> # ./scripts/faddr2line vmlinux con_scroll+0x160/0x170
> con_scroll+0x160/0x170:
> con_scroll at drivers/tty/vt/vt.c:327 (discriminator 1)
> # ./scripts/faddr2line vmlinux fbcon_scroll+0x11d/0xcb0
> fbcon_scroll+0x11d/0xcb0:
> fbcon_scroll at drivers/video/console/fbcon.c:1898
> # ./scripts/faddr2line vmlinux fbcon_cursor+0x151/0x1c0
> fbcon_cursor+0x151/0x1c0:
> fbcon_cursor at drivers/video/console/fbcon.c:1331
> # ./scripts/faddr2line vmlinux fbcon_redraw.isra.24+0xa3/0x1d0
> fbcon_redraw.isra.24+0xa3/0x1d0:
> fbcon_redraw at drivers/video/console/fbcon.c:1756
> # ./scripts/faddr2line vmlinux console_conditional_schedule+0x22/0x30
> console_conditional_schedule+0x22/0x30:
> console_conditional_schedule at kernel/printk/printk.c:2325
> 

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
@ 2016-12-26 11:34                                                     ` Sergey Senozhatsky
  0 siblings, 0 replies; 96+ messages in thread
From: Sergey Senozhatsky @ 2016-12-26 11:34 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: sergey.senozhatsky, mhocko, linux-mm, pmladek,
	Greg Kroah-Hartman, Jiri Slaby, linux-fbdev, linux-kernel,
	sergey.senozhatsky.work

Cc Greg, Jiri,

On (12/26/16 19:54), Tetsuo Handa wrote:
[..]
> 
> (3) I got below warning. (Though not reproducible.)
>     If fb_flashcursor() called console_trylock(), console_may_schedule is set to 1?

hmmm... it takes an atomic/spin `printing_lock' lock in vt_console_print(),
then call console_conditional_schedule() from lf(), being under spin_lock.
`console_may_schedule' in console_conditional_schedule() still keeps the
value from console_trylock(), which was ok (console_may_schedule permits
rescheduling). but preemption got changed under console_trylock(), by
that spin_lock.

console_trylock() used to always forbid rescheduling; but it got changed
like a yaer ago.

the other thing is... do we really need to console_conditional_schedule()
from fbcon_*()? console_unlock() does cond_resched() after every line it
prints. wouldn't that be enough?

so may be we can drop some of console_conditional_schedule()
call sites in fbcon. or update console_conditional_schedule()
function to always return the current preemption value, not the
one we saw in console_trylock().

(not tested)

---

 kernel/printk/printk.c | 35 ++++++++++++++++++++---------------
 1 file changed, 20 insertions(+), 15 deletions(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 8b2696420abb..ad4a02cf9f15 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -2075,6 +2075,24 @@ static int console_cpu_notify(unsigned int cpu)
 	return 0;
 }
 
+static int get_console_may_schedule(void)
+{
+	/*
+	 * When PREEMPT_COUNT disabled we can't reliably detect if it's
+	 * safe to schedule (e.g. calling printk while holding a spin_lock),
+	 * because preempt_disable()/preempt_enable() are just barriers there
+	 * and preempt_count() is always 0.
+	 *
+	 * RCU read sections have a separate preemption counter when
+	 * PREEMPT_RCU enabled thus we must take extra care and check
+	 * rcu_preempt_depth(), otherwise RCU read sections modify
+	 * preempt_count().
+	 */
+	return !oops_in_progress &&
+		preemptible() &&
+		!rcu_preempt_depth();
+}
+
 /**
  * console_lock - lock the console system for exclusive use.
  *
@@ -2112,20 +2130,7 @@ int console_trylock(void)
 		return 0;
 	}
 	console_locked = 1;
-	/*
-	 * When PREEMPT_COUNT disabled we can't reliably detect if it's
-	 * safe to schedule (e.g. calling printk while holding a spin_lock),
-	 * because preempt_disable()/preempt_enable() are just barriers there
-	 * and preempt_count() is always 0.
-	 *
-	 * RCU read sections have a separate preemption counter when
-	 * PREEMPT_RCU enabled thus we must take extra care and check
-	 * rcu_preempt_depth(), otherwise RCU read sections modify
-	 * preempt_count().
-	 */
-	console_may_schedule = !oops_in_progress &&
-			preemptible() &&
-			!rcu_preempt_depth();
+	console_may_schedule = get_console_may_schedule();
 	return 1;
 }
 EXPORT_SYMBOL(console_trylock);
@@ -2316,7 +2321,7 @@ EXPORT_SYMBOL(console_unlock);
  */
 void __sched console_conditional_schedule(void)
 {
-	if (console_may_schedule)
+	if (get_console_may_schedule())
 		cond_resched();
 }
 EXPORT_SYMBOL(console_conditional_schedule);


---


	-ss

> ----------------------------------------
> [  OK  [  255.862188] audit: type=1131 audit(1482733112.662:148): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-tmpfiles-setup-dev comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
> ] Stopped Create Static Device Nodes in /dev.
> 
> [  255.871468] BUG: sleeping function called from invalid context at kernel/printk/printk.c:2325
> [  255.871469] in_atomic(): 1, irqs_disabled(): 1, pid: 10079, name: plymouthd
> [  255.871469] 6 locks held by plymouthd/10079:
> [  255.871470]  #0:  (&tty->ldisc_sem){++++.+}, at: [<ffffffff817413e2>] ldsem_down_read+0x32/0x40
> [  255.871472]  #1:  (&tty->atomic_write_lock){+.+.+.}, at: [<ffffffff81424309>] tty_write_lock+0x19/0x50
> [  255.871474]  #2:  (&tty->termios_rwsem){++++..}, at: [<ffffffff81429d59>] n_tty_write+0x99/0x470
> [  255.871475]  #3:  (&ldata->output_lock){+.+...}, at: [<ffffffff81429df0>] n_tty_write+0x130/0x470
> [  255.871477]  #4:  (console_lock){+.+.+.}, at: [<ffffffff8110616e>] console_unlock+0x33e/0x6b0
> [  255.871479]  #5:  (printing_lock){......}, at: [<ffffffff8143baf5>] vt_console_print+0x75/0x3d0
> [  255.871481] irq event stamp: 15244
> [  255.871481] hardirqs last  enabled at (15243): [<ffffffff81105011>] __down_trylock_console_sem+0x91/0xa0
> [  255.871482] hardirqs last disabled at (15244): [<ffffffff81105ea4>] console_unlock+0x74/0x6b0
> [  255.871482] softirqs last  enabled at (14968): [<ffffffff81096394>] __do_softirq+0x344/0x580
> [  255.871482] softirqs last disabled at (14963): [<ffffffff810968d3>] irq_exit+0xe3/0x120
> [  255.871483] CPU: 0 PID: 10079 Comm: plymouthd Not tainted 4.9.0-next-20161224+ #12
> [  255.871483] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
> [  255.871484] Call Trace:
> [  255.871484]  dump_stack+0x85/0xc9
> [  255.871485]  ___might_sleep+0x14a/0x250
> [  255.871485]  console_conditional_schedule+0x22/0x30
> [  255.871485]  fbcon_redraw.isra.24+0xa3/0x1d0
> [  255.871486]  ? fbcon_cursor+0x151/0x1c0
> [  255.871486]  fbcon_scroll+0x11d/0xcb0
> [  255.871487]  con_scroll+0x160/0x170
> [  255.871487]  lf+0x9c/0xb0
> [  255.871487]  vt_console_print+0x2b7/0x3d0
> [  255.871488]  console_unlock+0x457/0x6b0
> [  255.871488]  do_con_write.part.19+0x737/0x9e0
> [  255.871489]  ? mark_held_locks+0x71/0x90
> [  255.871489]  con_write+0x57/0x60
> [  255.871489]  n_tty_write+0x1bf/0x470
> [  255.871490]  ? prepare_to_wait_event+0x110/0x110
> [  255.871490]  tty_write+0x157/0x2d0
> [  255.871491]  ? n_tty_open+0xd0/0xd0
> [  255.871491]  __vfs_write+0x32/0x140
> [  255.871491]  ? trace_hardirqs_on+0xd/0x10
> [  255.871492]  ? __audit_syscall_entry+0xaa/0xf0
> [  255.871492]  vfs_write+0xc2/0x1f0
> [  255.871493]  ? syscall_trace_enter+0x1cb/0x3e0
> [  255.871493]  SyS_write+0x53/0xc0
> [  255.871493]  do_syscall_64+0x67/0x1f0
> [  255.871494]  entry_SYSCALL64_slow_path+0x25/0x25
> [  255.871494] RIP: 0033:0x7fb74cf8fc60
> [  255.871495] RSP: 002b:00007ffcaab3fe88 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> [  255.871495] RAX: ffffffffffffffda RBX: 000055d3acaf7160 RCX: 00007fb74cf8fc60
> [  255.871496] RDX: 000000000000003f RSI: 000055d3acafd090 RDI: 0000000000000009
> [  255.871496] RBP: 000055d3acafc440 R08: 0000000000000070 R09: 0000000000000000
> [  255.871497] R10: 000000000000003f R11: 0000000000000246 R12: 000055d3acafc330
> [  255.871497] R13: 000000000000003f R14: 00007ffcaab3ffb0 R15: 0000000000000000
>          Stopping Create Static Device Nodes in /dev...
> 
> ----------------------------------------
> 
> # ./scripts/faddr2line vmlinux console_unlock+0x74/0x6b0
> console_unlock+0x74/0x6b0:
> console_unlock at kernel/printk/printk.c:2228
> # ./scripts/faddr2line vmlinux console_unlock+0x457/0x6b0
> console_unlock+0x457/0x6b0:
> call_console_drivers at kernel/printk/printk.c:1613
>  (inlined by) console_unlock at kernel/printk/printk.c:2277
> # ./scripts/faddr2line vmlinux vt_console_print+0x2b7/0x3d0
> vt_console_print+0x2b7/0x3d0:
> cr at drivers/tty/vt/vt.c:1137
>  (inlined by) vt_console_print at drivers/tty/vt/vt.c:2598
> # ./scripts/faddr2line vmlinux lf+0x9c/0xb0
> lf+0x9c/0xb0:
> lf at drivers/tty/vt/vt.c:1112
> # ./scripts/faddr2line vmlinux con_scroll+0x160/0x170
> con_scroll+0x160/0x170:
> con_scroll at drivers/tty/vt/vt.c:327 (discriminator 1)
> # ./scripts/faddr2line vmlinux fbcon_scroll+0x11d/0xcb0
> fbcon_scroll+0x11d/0xcb0:
> fbcon_scroll at drivers/video/console/fbcon.c:1898
> # ./scripts/faddr2line vmlinux fbcon_cursor+0x151/0x1c0
> fbcon_cursor+0x151/0x1c0:
> fbcon_cursor at drivers/video/console/fbcon.c:1331
> # ./scripts/faddr2line vmlinux fbcon_redraw.isra.24+0xa3/0x1d0
> fbcon_redraw.isra.24+0xa3/0x1d0:
> fbcon_redraw at drivers/video/console/fbcon.c:1756
> # ./scripts/faddr2line vmlinux console_conditional_schedule+0x22/0x30
> console_conditional_schedule+0x22/0x30:
> console_conditional_schedule at kernel/printk/printk.c:2325
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-26 10:54                                                 ` Tetsuo Handa
  2016-12-26 11:34                                                     ` Sergey Senozhatsky
@ 2016-12-26 11:41                                                   ` Sergey Senozhatsky
  2017-01-13 14:03                                                     ` Petr Mladek
  1 sibling, 1 reply; 96+ messages in thread
From: Sergey Senozhatsky @ 2016-12-26 11:41 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: sergey.senozhatsky, mhocko, linux-mm, pmladek

On (12/26/16 19:54), Tetsuo Handa wrote:
> I tried these 9 patches. Generally OK.
> 
> Although there is still "schedule_timeout_killable() lockup with oom_lock held"
> problem, async-printk patches help avoiding "printk() lockup with oom_lock held"
> problem. Thank you.
> 
> Three comments from me.
> 
> (1) Messages from e.g. SysRq-b is not waited for sent to consoles.
>     "SysRq : Resetting" line is needed as a note that I gave up waiting.
> 
> (2) Messages from e.g. SysRq-t should be sent to consoles synchronously?
>     "echo t > /proc/sysrq-trigger" case can use asynchronous printing.
>     But since ALT-SysRq-T sequence from keyboard may be used when scheduler
>     is not responding, it might be better to use synchronous printing.
>     (Or define a magic key sequence to toggle synchronous/asynchronous?)

it's really hard to tell if the message comes from sysrq or from
somewhere else. the current approach -- switch to *always* sync printk
once we see the first LOGLEVEL_EMERG message. so you can add
printk(LOGLEVEL_EMERG "sysrq-t\n"); for example, and printk will
switch to sync mode. sync mode, is might be a bit dangerous though,
since we printk from IRQ.

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-24  6:25                                               ` Tetsuo Handa
@ 2016-12-26 11:49                                                 ` Michal Hocko
  2016-12-27 10:39                                                   ` Tetsuo Handa
  0 siblings, 1 reply; 96+ messages in thread
From: Michal Hocko @ 2016-12-26 11:49 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: torvalds, akpm, sergey.senozhatsky, linux-mm, pmladek

On Sat 24-12-16 15:25:43, Tetsuo Handa wrote:
[...]
> Michal Hocko wrote:
> > On Thu 22-12-16 22:33:40, Tetsuo Handa wrote:
> > > Tetsuo Handa wrote:
> > > > Now, what options are left other than replacing !mutex_trylock(&oom_lock)
> > > > with mutex_lock_killable(&oom_lock) which also stops wasting CPU time?
> > > > Are we waiting for offloading sending to consoles?
> > > 
> > >  From http://lkml.kernel.org/r/20161222115057.GH6048@dhcp22.suse.cz :
> > > > > Although I don't know whether we agree with mutex_lock_killable(&oom_lock)
> > > > > change, I think this patch alone can go as a cleanup.
> > > > 
> > > > No, we don't agree on that part. As this is a printk issue I do not want
> > > > to workaround it in the oom related code. That is just ridiculous. The
> > > > very same issue would be possible due to other continous source of log
> > > > messages.
> > > 
> > > I don't think so. Lockup caused by printk() is printk's problem. But printk
> > > is not the only source of lockup. If CONFIG_PREEMPT=y, it is possible that
> > > a thread which held oom_lock can sleep for unbounded period depending on
> > > scheduling priority.
> > 
> > Unless there is some runaway realtime process then the holder of the oom
> > lock shouldn't be preempted for the _unbounded_ amount of time. It might
> > take quite some time, though. But that is not reduced to the OOM killer.
> > Any important part of the system (IO flushers and what not) would suffer
> > from the same issue.
> 
> I fail to understand why you assume "realtime process".

Because then a standard process should get its time slice eventually. It
can take some time, especially with many cpu hogs....

> This lockup is still triggerable using "normal process" and "idle process".

if you have too many of them then you are just out of luck and
everything will take ages.

[...]

> See? The runaway is occurring inside kernel space due to almost-busy looping
> direct reclaim against a thread with idle priority with oom_lock held.
> 
> My assertion is that we need to make sure that the OOM killer/reaper are given
> enough CPU time so that they can perform memory reclaim operation and release
> oom_lock. We can't solve CPU time consumption by sleep-with-oom_lock1.c case
> but we can solve CPU time consumption by sleep-with-oom_lock2.c case.

What I am trying to tell you is that it is really hard to do something
about these situations in general. It is not all that hard to construct
workloads which will constantly preempt the sync oom path and we can do
hardly anything about that. OOM handling is quite complex and takes
considerable amount of time as long as we want to have some
deterministic behavior (unless that deterministic thing is to
immediately raboot which is not something everybody would like to see).

> I think it is waste of CPU time to let all threads try direct reclaim
> which also bothers them with consistent __GFP_NOFS/__GFP_NOIO usage which
> might involve dependency to other threads. But changing it is not easy.

Exactly.

> Thus, I'm proposing to save CPU time if waiting for the OOM killer/reaper
> when direct reclaim did not help.

Which will just move problem somewhere else I am afraid. Now you will
have hundreds of tasks bouncing on the global mutex. That never turned
out to be a good thing in the past and I am worried that it will just
bite us from a different side. What is worse it might hit us in cases
which do actually happen in the real life.

I am not saying that the current code works perfectly when we are
hitting the direct reclaim close to the OOM but improving that requires
much more than slapping a global lock there.
 
> > > Then, you call such latency as scheduler's problem?
> > > mutex_lock_killable(&oom_lock) change helps coping with whatever delays
> > > OOM killer/reaper might encounter.
> > 
> > It helps _your_ particular insane workload. I believe you can construct
> > many others which which would cause a similar problem and the above
> > suggestion wouldn't help a bit. Until I can see this is easily
> > triggerable on a reasonably configured system then I am not convinced
> > we should add more non trivial changes to the oom killer path.
> 
> I'm not using root privileges nor realtime priority nor CONFIG_PREEMPT=y.
> Why you don't care about the worst situation / corner cases?

I do care about them! I just do not want to put random hacks which might
seem to work on this _particular_ workload while it brings risks for
others. Look, those corner cases you are simulating are _interesting_ to
see how robust we are but they are no way close to what really happens
in the real life out there - we call those situations DoS from any
practical POV. Admins usually do everything to prevent from them by
configuring their systems and limiting untrusted users as much as
possible.

So please try to step back, try to understand that there is a difference
between interesting and matters_in_the_real_life and do not try to
_design_ the code on _corner cases_ because that might be more harmful
then useful.

Just try to remember how you were pushing really hard for oom timeouts
one year back because the OOM killer was suboptimal and could lockup. It
took some redesign and many changes to fix that. The result is
imho a better, more predictable and robust code which wouldn't be the
case if we just went your way to have a fix quickly...

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-26 11:49                                                 ` Michal Hocko
@ 2016-12-27 10:39                                                   ` Tetsuo Handa
  2016-12-27 10:57                                                     ` Michal Hocko
  0 siblings, 1 reply; 96+ messages in thread
From: Tetsuo Handa @ 2016-12-27 10:39 UTC (permalink / raw)
  To: mhocko; +Cc: torvalds, akpm, sergey.senozhatsky, linux-mm, pmladek

Michal Hocko wrote:
> On Sat 24-12-16 15:25:43, Tetsuo Handa wrote:
> [...]
> > Michal Hocko wrote:
> > > On Thu 22-12-16 22:33:40, Tetsuo Handa wrote:
> > > > Tetsuo Handa wrote:
> > > > > Now, what options are left other than replacing !mutex_trylock(&oom_lock)
> > > > > with mutex_lock_killable(&oom_lock) which also stops wasting CPU time?
> > > > > Are we waiting for offloading sending to consoles?
> > > > 
> > > >  From http://lkml.kernel.org/r/20161222115057.GH6048@dhcp22.suse.cz :
> > > > > > Although I don't know whether we agree with mutex_lock_killable(&oom_lock)
> > > > > > change, I think this patch alone can go as a cleanup.
> > > > > 
> > > > > No, we don't agree on that part. As this is a printk issue I do not want
> > > > > to workaround it in the oom related code. That is just ridiculous. The
> > > > > very same issue would be possible due to other continous source of log
> > > > > messages.
> > > > 
> > > > I don't think so. Lockup caused by printk() is printk's problem. But printk
> > > > is not the only source of lockup. If CONFIG_PREEMPT=y, it is possible that
> > > > a thread which held oom_lock can sleep for unbounded period depending on
> > > > scheduling priority.
> > > 
> > > Unless there is some runaway realtime process then the holder of the oom
> > > lock shouldn't be preempted for the _unbounded_ amount of time. It might
> > > take quite some time, though. But that is not reduced to the OOM killer.
> > > Any important part of the system (IO flushers and what not) would suffer
> > > from the same issue.
> > 
> > I fail to understand why you assume "realtime process".
> 
> Because then a standard process should get its time slice eventually. It
> can take some time, especially with many cpu hogs....

An "idle process" failed to get its time slice for more than 20 minutes when
we wanted it to sleep for only 1 millisecond. Too long to call it "eventually"
in the real life.

> 
> > This lockup is still triggerable using "normal process" and "idle process".
> 
> if you have too many of them then you are just out of luck and
> everything will take ages.

If "normal processes" were waiting for oom_lock (or wait longer incrementally),
"idle process" would have been able to wake up immediately.

> 
> [...]
> 
> > See? The runaway is occurring inside kernel space due to almost-busy looping
> > direct reclaim against a thread with idle priority with oom_lock held.
> > 
> > My assertion is that we need to make sure that the OOM killer/reaper are given
> > enough CPU time so that they can perform memory reclaim operation and release
> > oom_lock. We can't solve CPU time consumption by sleep-with-oom_lock1.c case
> > but we can solve CPU time consumption by sleep-with-oom_lock2.c case.
> 
> What I am trying to tell you is that it is really hard to do something
> about these situations in general. It is not all that hard to construct
> workloads which will constantly preempt the sync oom path and we can do
> hardly anything about that. OOM handling is quite complex and takes
> considerable amount of time as long as we want to have some
> deterministic behavior (unless that deterministic thing is to
> immediately reboot which is not something everybody would like to see).
> 
> > I think it is waste of CPU time to let all threads try direct reclaim
> > which also bothers them with consistent __GFP_NOFS/__GFP_NOIO usage which
> > might involve dependency to other threads. But changing it is not easy.
> 
> Exactly.
> 
> > Thus, I'm proposing to save CPU time if waiting for the OOM killer/reaper
> > when direct reclaim did not help.
> 
> Which will just move problem somewhere else I am afraid. Now you will
> have hundreds of tasks bouncing on the global mutex. That never turned
> out to be a good thing in the past and I am worried that it will just
> bite us from a different side. What is worse it might hit us in cases
> which do actually happen in the real life.
> 
> I am not saying that the current code works perfectly when we are
> hitting the direct reclaim close to the OOM but improving that requires
> much more than slapping a global lock there.

So, we finally agreed that there are problems when we are hitting the direct
reclaim close to the OOM. Good.

>  
> > > > Then, you call such latency as scheduler's problem?
> > > > mutex_lock_killable(&oom_lock) change helps coping with whatever delays
> > > > OOM killer/reaper might encounter.
> > > 
> > > It helps _your_ particular insane workload. I believe you can construct
> > > many others which which would cause a similar problem and the above
> > > suggestion wouldn't help a bit. Until I can see this is easily
> > > triggerable on a reasonably configured system then I am not convinced
> > > we should add more non trivial changes to the oom killer path.
> > 
> > I'm not using root privileges nor realtime priority nor CONFIG_PREEMPT=y.
> > Why you don't care about the worst situation / corner cases?
> 
> I do care about them! I just do not want to put random hacks which might
> seem to work on this _particular_ workload while it brings risks for
> others. Look, those corner cases you are simulating are _interesting_ to
> see how robust we are but they are no way close to what really happens
> in the real life out there - we call those situations DoS from any
> practical POV. Admins usually do everything to prevent from them by
> configuring their systems and limiting untrusted users as much as
> possible.

I wonder why you introduce "untrusted users" concept. From my experience,
there was no "untrusted users". All users who use their systems are trusted
and innocent, but they _by chance_ hit problems when close to (or already)
the OOM.

> 
> So please try to step back, try to understand that there is a difference
> between interesting and matters_in_the_real_life and do not try to
> _design_ the code on _corner cases_ because that might be more harmful
> then useful.
> 

The reason I continue testing corner cases is that you don't accept catch-all
reporting mechanism. Therefore, I test more harder and harder so that we can
live without catch-all reporting mechanism. But you also say making changes
for handling corner case is bad. Then, I get annoying dilemma.

Based on many reproducers I showed you, problems are categorized to
4 patterns shown below.

  (1) You are aware of bugs, you think they are problems, but you don't
      have solutions.

  (2) You are aware of bugs, you know we can hit these bugs, but you don't
      think they are problems.

  (3) You are aware of bugs, but you don't think we can hit these bugs.

  (4) You are not aware of bugs.

And asynchronous watchdog can catch all patterns which will occur in
the real life. Asynchronous watchdog is far safer than putting random
hacks into allocator path.

My suggestion is that let's allow the kernel to report problems honestly
that something might went wrong with "Somebody else will make progress for
me" approach. I tolerate your "Somebody else will make progress for me"
approach as long as we allow the kernel to report problems honestly.

> Just try to remember how you were pushing really hard for oom timeouts
> one year back because the OOM killer was suboptimal and could lockup. It
> took some redesign and many changes to fix that. The result is
> imho a better, more predictable and robust code which wouldn't be the
> case if we just went your way to have a fix quickly...

I agree that the result is good for users who can update kernels. But that
change was too large to backport. Any approach which did not in time for
customers' deadline of deciding their kernels to use for 10 years is
useless for them. Lack of catch-all reporting/triggering mechanism is
unhappy for both customers and troubleshooting staffs at support centers.

Improving the direct reclaim close to the OOM requires a lot of effort.
We might add new bugs during that effort. So, where is valid reason that
we can not have asynchronous watchdog like kmallocwd? Please do explain
at kmallocwd thread. You have never persuaded me about keeping kmallocwd
out of tree.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-27 10:39                                                   ` Tetsuo Handa
@ 2016-12-27 10:57                                                     ` Michal Hocko
  0 siblings, 0 replies; 96+ messages in thread
From: Michal Hocko @ 2016-12-27 10:57 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: torvalds, akpm, sergey.senozhatsky, linux-mm, pmladek

On Tue 27-12-16 19:39:28, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Sat 24-12-16 15:25:43, Tetsuo Handa wrote:
[...]
> > > Thus, I'm proposing to save CPU time if waiting for the OOM killer/reaper
> > > when direct reclaim did not help.
> > 
> > Which will just move problem somewhere else I am afraid. Now you will
> > have hundreds of tasks bouncing on the global mutex. That never turned
> > out to be a good thing in the past and I am worried that it will just
> > bite us from a different side. What is worse it might hit us in cases
> > which do actually happen in the real life.
> > 
> > I am not saying that the current code works perfectly when we are
> > hitting the direct reclaim close to the OOM but improving that requires
> > much more than slapping a global lock there.
> 
> So, we finally agreed that there are problems when we are hitting the direct
> reclaim close to the OOM. Good.

There has never been a disagreement here. The point we seem to be
disagreeing is how much those issues you are seeing matter. I do not
consider them top priority because they are not happening in real life
enough.
 
> > > > > Then, you call such latency as scheduler's problem?
> > > > > mutex_lock_killable(&oom_lock) change helps coping with whatever delays
> > > > > OOM killer/reaper might encounter.
> > > > 
> > > > It helps _your_ particular insane workload. I believe you can construct
> > > > many others which which would cause a similar problem and the above
> > > > suggestion wouldn't help a bit. Until I can see this is easily
> > > > triggerable on a reasonably configured system then I am not convinced
> > > > we should add more non trivial changes to the oom killer path.
> > > 
> > > I'm not using root privileges nor realtime priority nor CONFIG_PREEMPT=y.
> > > Why you don't care about the worst situation / corner cases?
> > 
> > I do care about them! I just do not want to put random hacks which might
> > seem to work on this _particular_ workload while it brings risks for
> > others. Look, those corner cases you are simulating are _interesting_ to
> > see how robust we are but they are no way close to what really happens
> > in the real life out there - we call those situations DoS from any
> > practical POV. Admins usually do everything to prevent from them by
> > configuring their systems and limiting untrusted users as much as
> > possible.
> 
> I wonder why you introduce "untrusted users" concept. From my experience,
> there was no "untrusted users". All users who use their systems are trusted
> and innocent, but they _by chance_ hit problems when close to (or already)
> the OOM.

my experience is that innocent users are no way close to what you are
simulating. And we tend to handle most OOMs just fine in my experience.
 
[...]

> > Just try to remember how you were pushing really hard for oom timeouts
> > one year back because the OOM killer was suboptimal and could lockup. It
> > took some redesign and many changes to fix that. The result is
> > imho a better, more predictable and robust code which wouldn't be the
> > case if we just went your way to have a fix quickly...
> 
> I agree that the result is good for users who can update kernels. But that
> change was too large to backport. Any approach which did not in time for
> customers' deadline of deciding their kernels to use for 10 years is
> useless for them. Lack of catch-all reporting/triggering mechanism is
> unhappy for both customers and troubleshooting staffs at support centers.

Then implement whatever you find appropriate on those old kernels and
deal with the follow up reports. This is the fair deal you have cope
with when using and supporting old kernels.
 
> Improving the direct reclaim close to the OOM requires a lot of effort.
> We might add new bugs during that effort. So, where is valid reason that
> we can not have asynchronous watchdog like kmallocwd? Please do explain
> at kmallocwd thread. You have never persuaded me about keeping kmallocwd
> out of tree.

I am not going to repeat my arguments over again. I haven't nacked that
patch and it seems there is no great interest in it so do not try to
claim that it is me who is blocking this feature. I just do not think it
is worth it.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-26 11:34                                                     ` Sergey Senozhatsky
  (?)
@ 2017-01-12 13:10                                                       ` Petr Mladek
  -1 siblings, 0 replies; 96+ messages in thread
From: Petr Mladek @ 2017-01-12 13:10 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Tetsuo Handa, mhocko, linux-mm, Greg Kroah-Hartman, Jiri Slaby,
	linux-fbdev, linux-kernel, sergey.senozhatsky.work

On Mon 2016-12-26 20:34:07, Sergey Senozhatsky wrote:
> Cc Greg, Jiri,
> 
> On (12/26/16 19:54), Tetsuo Handa wrote:
> [..]
> > 
> > (3) I got below warning. (Though not reproducible.)
> >     If fb_flashcursor() called console_trylock(), console_may_schedule is set to 1?
> 
> hmmm... it takes an atomic/spin `printing_lock' lock in vt_console_print(),
> then call console_conditional_schedule() from lf(), being under spin_lock.
> `console_may_schedule' in console_conditional_schedule() still keeps the
> value from console_trylock(), which was ok (console_may_schedule permits
> rescheduling). but preemption got changed under console_trylock(), by
> that spin_lock.
> 
> console_trylock() used to always forbid rescheduling; but it got changed
> like a yaer ago.
> 
> the other thing is... do we really need to console_conditional_schedule()
> from fbcon_*()? console_unlock() does cond_resched() after every line it
> prints. wouldn't that be enough?
> 
> so may be we can drop some of console_conditional_schedule()
> call sites in fbcon. or update console_conditional_schedule()
> function to always return the current preemption value, not the
> one we saw in console_trylock().
> 
> (not tested)
> 
> ---
> 
>  kernel/printk/printk.c | 35 ++++++++++++++++++++---------------
>  1 file changed, 20 insertions(+), 15 deletions(-)
> 
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 8b2696420abb..ad4a02cf9f15 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -2075,6 +2075,24 @@ static int console_cpu_notify(unsigned int cpu)
>  	return 0;
>  }
>  
> +static int get_console_may_schedule(void)
> +{
> +	/*
> +	 * When PREEMPT_COUNT disabled we can't reliably detect if it's
> +	 * safe to schedule (e.g. calling printk while holding a spin_lock),
> +	 * because preempt_disable()/preempt_enable() are just barriers there
> +	 * and preempt_count() is always 0.
> +	 *
> +	 * RCU read sections have a separate preemption counter when
> +	 * PREEMPT_RCU enabled thus we must take extra care and check
> +	 * rcu_preempt_depth(), otherwise RCU read sections modify
> +	 * preempt_count().
> +	 */
> +	return !oops_in_progress &&
> +		preemptible() &&
> +		!rcu_preempt_depth();
> +}
> +
>  /**
>   * console_lock - lock the console system for exclusive use.
>   *
> @@ -2316,7 +2321,7 @@ EXPORT_SYMBOL(console_unlock);
>   */
>  void __sched console_conditional_schedule(void)
>  {
> -	if (console_may_schedule)
> +	if (get_console_may_schedule())

Note that console_may_schedule should be zero when
the console drivers are called. See the following lines in
console_unlock():

	/*
	 * Console drivers are called under logbuf_lock, so
	 * @console_may_schedule should be cleared before; however, we may
	 * end up dumping a lot of lines, for example, if called from
	 * console registration path, and should invoke cond_resched()
	 * between lines if allowable.  Not doing so can cause a very long
	 * scheduling stall on a slow console leading to RCU stall and
	 * softlockup warnings which exacerbate the issue with more
	 * messages practically incapacitating the system.
	 */
	do_cond_resched = console_may_schedule;
	console_may_schedule = 0;

IMHO, there is the problem described by Tetsuo in the other mail.
We do not call the above lines when the console semaphore is
re-taken and we do the main cycle again:

	/*
	 * Someone could have filled up the buffer again, so re-check if there's
	 * something to flush. In case we cannot trylock the console_sem again,
	 * there's a new owner and the console_unlock() from them will do the
	 * flush, no worries.
	 */
	raw_spin_lock(&logbuf_lock);
	retry = console_seq != log_next_seq;
	raw_spin_unlock_irqrestore(&logbuf_lock, flags);

	if (retry && console_trylock())
		goto again;


Well, simply moving the again: label is not correct as well.
The global variable is explicitly set in some functions:

	console_lock[2094]             console_may_schedule = 1;
	console_unblank[2339]          console_may_schedule = 0;
	console_flush_on_panic[2361]   console_may_schedule = 0;

But console_try_lock() will set it according to the real context
in console_unlock().


Hmm, the enforced values were there for ages (even in the initial
git commit). It was always 0 also console_trylock() until
the commit 6b97a20d3a7909daa06625d ("printk: set may_schedule for some
of console_trylock() callers").

It might make sense to completely remove the global
@console_may_schedule variable and always decide
by the context. It is slightly suboptimal. But it
simplifies the code and should be sane in all situations.

Sergey, if you agree with the above paragraph. Do you want to prepare
the patch or should I do so?

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
@ 2017-01-12 13:10                                                       ` Petr Mladek
  0 siblings, 0 replies; 96+ messages in thread
From: Petr Mladek @ 2017-01-12 13:10 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Tetsuo Handa, mhocko, linux-mm, Greg Kroah-Hartman, Jiri Slaby,
	linux-fbdev, linux-kernel, sergey.senozhatsky.work

On Mon 2016-12-26 20:34:07, Sergey Senozhatsky wrote:
> Cc Greg, Jiri,
> 
> On (12/26/16 19:54), Tetsuo Handa wrote:
> [..]
> > 
> > (3) I got below warning. (Though not reproducible.)
> >     If fb_flashcursor() called console_trylock(), console_may_schedule is set to 1?
> 
> hmmm... it takes an atomic/spin `printing_lock' lock in vt_console_print(),
> then call console_conditional_schedule() from lf(), being under spin_lock.
> `console_may_schedule' in console_conditional_schedule() still keeps the
> value from console_trylock(), which was ok (console_may_schedule permits
> rescheduling). but preemption got changed under console_trylock(), by
> that spin_lock.
> 
> console_trylock() used to always forbid rescheduling; but it got changed
> like a yaer ago.
> 
> the other thing is... do we really need to console_conditional_schedule()
> from fbcon_*()? console_unlock() does cond_resched() after every line it
> prints. wouldn't that be enough?
> 
> so may be we can drop some of console_conditional_schedule()
> call sites in fbcon. or update console_conditional_schedule()
> function to always return the current preemption value, not the
> one we saw in console_trylock().
> 
> (not tested)
> 
> ---
> 
>  kernel/printk/printk.c | 35 ++++++++++++++++++++---------------
>  1 file changed, 20 insertions(+), 15 deletions(-)
> 
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 8b2696420abb..ad4a02cf9f15 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -2075,6 +2075,24 @@ static int console_cpu_notify(unsigned int cpu)
>  	return 0;
>  }
>  
> +static int get_console_may_schedule(void)
> +{
> +	/*
> +	 * When PREEMPT_COUNT disabled we can't reliably detect if it's
> +	 * safe to schedule (e.g. calling printk while holding a spin_lock),
> +	 * because preempt_disable()/preempt_enable() are just barriers there
> +	 * and preempt_count() is always 0.
> +	 *
> +	 * RCU read sections have a separate preemption counter when
> +	 * PREEMPT_RCU enabled thus we must take extra care and check
> +	 * rcu_preempt_depth(), otherwise RCU read sections modify
> +	 * preempt_count().
> +	 */
> +	return !oops_in_progress &&
> +		preemptible() &&
> +		!rcu_preempt_depth();
> +}
> +
>  /**
>   * console_lock - lock the console system for exclusive use.
>   *
> @@ -2316,7 +2321,7 @@ EXPORT_SYMBOL(console_unlock);
>   */
>  void __sched console_conditional_schedule(void)
>  {
> -	if (console_may_schedule)
> +	if (get_console_may_schedule())

Note that console_may_schedule should be zero when
the console drivers are called. See the following lines in
console_unlock():

	/*
	 * Console drivers are called under logbuf_lock, so
	 * @console_may_schedule should be cleared before; however, we may
	 * end up dumping a lot of lines, for example, if called from
	 * console registration path, and should invoke cond_resched()
	 * between lines if allowable.  Not doing so can cause a very long
	 * scheduling stall on a slow console leading to RCU stall and
	 * softlockup warnings which exacerbate the issue with more
	 * messages practically incapacitating the system.
	 */
	do_cond_resched = console_may_schedule;
	console_may_schedule = 0;

IMHO, there is the problem described by Tetsuo in the other mail.
We do not call the above lines when the console semaphore is
re-taken and we do the main cycle again:

	/*
	 * Someone could have filled up the buffer again, so re-check if there's
	 * something to flush. In case we cannot trylock the console_sem again,
	 * there's a new owner and the console_unlock() from them will do the
	 * flush, no worries.
	 */
	raw_spin_lock(&logbuf_lock);
	retry = console_seq != log_next_seq;
	raw_spin_unlock_irqrestore(&logbuf_lock, flags);

	if (retry && console_trylock())
		goto again;


Well, simply moving the again: label is not correct as well.
The global variable is explicitly set in some functions:

	console_lock[2094]             console_may_schedule = 1;
	console_unblank[2339]          console_may_schedule = 0;
	console_flush_on_panic[2361]   console_may_schedule = 0;

But console_try_lock() will set it according to the real context
in console_unlock().


Hmm, the enforced values were there for ages (even in the initial
git commit). It was always 0 also console_trylock() until
the commit 6b97a20d3a7909daa06625d ("printk: set may_schedule for some
of console_trylock() callers").

It might make sense to completely remove the global
@console_may_schedule variable and always decide
by the context. It is slightly suboptimal. But it
simplifies the code and should be sane in all situations.

Sergey, if you agree with the above paragraph. Do you want to prepare
the patch or should I do so?

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
@ 2017-01-12 13:10                                                       ` Petr Mladek
  0 siblings, 0 replies; 96+ messages in thread
From: Petr Mladek @ 2017-01-12 13:10 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Tetsuo Handa, mhocko, linux-mm, Greg Kroah-Hartman, Jiri Slaby,
	linux-fbdev, linux-kernel, sergey.senozhatsky.work

On Mon 2016-12-26 20:34:07, Sergey Senozhatsky wrote:
> Cc Greg, Jiri,
> 
> On (12/26/16 19:54), Tetsuo Handa wrote:
> [..]
> > 
> > (3) I got below warning. (Though not reproducible.)
> >     If fb_flashcursor() called console_trylock(), console_may_schedule is set to 1?
> 
> hmmm... it takes an atomic/spin `printing_lock' lock in vt_console_print(),
> then call console_conditional_schedule() from lf(), being under spin_lock.
> `console_may_schedule' in console_conditional_schedule() still keeps the
> value from console_trylock(), which was ok (console_may_schedule permits
> rescheduling). but preemption got changed under console_trylock(), by
> that spin_lock.
> 
> console_trylock() used to always forbid rescheduling; but it got changed
> like a yaer ago.
> 
> the other thing is... do we really need to console_conditional_schedule()
> from fbcon_*()? console_unlock() does cond_resched() after every line it
> prints. wouldn't that be enough?
> 
> so may be we can drop some of console_conditional_schedule()
> call sites in fbcon. or update console_conditional_schedule()
> function to always return the current preemption value, not the
> one we saw in console_trylock().
> 
> (not tested)
> 
> ---
> 
>  kernel/printk/printk.c | 35 ++++++++++++++++++++---------------
>  1 file changed, 20 insertions(+), 15 deletions(-)
> 
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 8b2696420abb..ad4a02cf9f15 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -2075,6 +2075,24 @@ static int console_cpu_notify(unsigned int cpu)
>  	return 0;
>  }
>  
> +static int get_console_may_schedule(void)
> +{
> +	/*
> +	 * When PREEMPT_COUNT disabled we can't reliably detect if it's
> +	 * safe to schedule (e.g. calling printk while holding a spin_lock),
> +	 * because preempt_disable()/preempt_enable() are just barriers there
> +	 * and preempt_count() is always 0.
> +	 *
> +	 * RCU read sections have a separate preemption counter when
> +	 * PREEMPT_RCU enabled thus we must take extra care and check
> +	 * rcu_preempt_depth(), otherwise RCU read sections modify
> +	 * preempt_count().
> +	 */
> +	return !oops_in_progress &&
> +		preemptible() &&
> +		!rcu_preempt_depth();
> +}
> +
>  /**
>   * console_lock - lock the console system for exclusive use.
>   *
> @@ -2316,7 +2321,7 @@ EXPORT_SYMBOL(console_unlock);
>   */
>  void __sched console_conditional_schedule(void)
>  {
> -	if (console_may_schedule)
> +	if (get_console_may_schedule())

Note that console_may_schedule should be zero when
the console drivers are called. See the following lines in
console_unlock():

	/*
	 * Console drivers are called under logbuf_lock, so
	 * @console_may_schedule should be cleared before; however, we may
	 * end up dumping a lot of lines, for example, if called from
	 * console registration path, and should invoke cond_resched()
	 * between lines if allowable.  Not doing so can cause a very long
	 * scheduling stall on a slow console leading to RCU stall and
	 * softlockup warnings which exacerbate the issue with more
	 * messages practically incapacitating the system.
	 */
	do_cond_resched = console_may_schedule;
	console_may_schedule = 0;

IMHO, there is the problem described by Tetsuo in the other mail.
We do not call the above lines when the console semaphore is
re-taken and we do the main cycle again:

	/*
	 * Someone could have filled up the buffer again, so re-check if there's
	 * something to flush. In case we cannot trylock the console_sem again,
	 * there's a new owner and the console_unlock() from them will do the
	 * flush, no worries.
	 */
	raw_spin_lock(&logbuf_lock);
	retry = console_seq != log_next_seq;
	raw_spin_unlock_irqrestore(&logbuf_lock, flags);

	if (retry && console_trylock())
		goto again;


Well, simply moving the again: label is not correct as well.
The global variable is explicitly set in some functions:

	console_lock[2094]             console_may_schedule = 1;
	console_unblank[2339]          console_may_schedule = 0;
	console_flush_on_panic[2361]   console_may_schedule = 0;

But console_try_lock() will set it according to the real context
in console_unlock().


Hmm, the enforced values were there for ages (even in the initial
git commit). It was always 0 also console_trylock() until
the commit 6b97a20d3a7909daa06625d ("printk: set may_schedule for some
of console_trylock() callers").

It might make sense to completely remove the global
@console_may_schedule variable and always decide
by the context. It is slightly suboptimal. But it
simplifies the code and should be sane in all situations.

Sergey, if you agree with the above paragraph. Do you want to prepare
the patch or should I do so?

Best Regards,
Petr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-26 11:34                                                     ` Sergey Senozhatsky
  (?)
@ 2017-01-12 14:18                                                       ` Petr Mladek
  -1 siblings, 0 replies; 96+ messages in thread
From: Petr Mladek @ 2017-01-12 14:18 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Tetsuo Handa, mhocko, linux-mm, Greg Kroah-Hartman, Jiri Slaby,
	linux-fbdev, linux-kernel, sergey.senozhatsky.work

On Mon 2016-12-26 20:34:07, Sergey Senozhatsky wrote:
> console_trylock() used to always forbid rescheduling; but it got changed
> like a yaer ago.
> 
> the other thing is... do we really need to console_conditional_schedule()
> from fbcon_*()? console_unlock() does cond_resched() after every line it
> prints. wouldn't that be enough?
> 
> so may be we can drop some of console_conditional_schedule()
> call sites in fbcon. or update console_conditional_schedule()
> function to always return the current preemption value, not the
> one we saw in console_trylock().

I was curious if it makes sense to remove
console_conditional_schedule() completely.

In practice, it never allows rescheduling when the console driver
is called via console_unlock(). It is since 2006 and the commit
78944e549d36673eb62 ("vt: printk: Fix framebuffer console
triggering might_sleep assertion"). This commit added
that

	console_may_schedule = 0;

into console_unlock() before the console drivers are called.


On the other hand, it seems that the rescheduling was always
enabled when some console operations were called via
tty_operations. For example:

struct tty_operations con_ops

  con_ops->con_write()
  -> do_con_write()  #calls console_lock()
   -> do_con_trol()
    -> fbcon_scroll()
     -> fbcon_redraw_move()
      -> console_conditional_schedule()

, where console_lock() sets console_may_schedule = 1;


A complete console scroll/redraw might take a while. The rescheduling
would make sense => IMHO, we should keep console_conditional_schedule()
or some alternative in the console drivers as well.

But I am afraid that we could not use the automatic detection.
We are not able to detect preemption when CONFIG_PREEMPT_COUNT
is disabled. But we still would like to enable rescheduling
when called from the tty code (guarded by console_lock()).


As a result. We should keep console_may_schedule as a global
variable. We cannot put the automatic detection into
console_conditional_schedule(). Instead, we need to
fix handling of the global variable in console_unlock().

I am going to prepare a patch for this.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
@ 2017-01-12 14:18                                                       ` Petr Mladek
  0 siblings, 0 replies; 96+ messages in thread
From: Petr Mladek @ 2017-01-12 14:18 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Tetsuo Handa, mhocko, linux-mm, Greg Kroah-Hartman, Jiri Slaby,
	linux-fbdev, linux-kernel, sergey.senozhatsky.work

On Mon 2016-12-26 20:34:07, Sergey Senozhatsky wrote:
> console_trylock() used to always forbid rescheduling; but it got changed
> like a yaer ago.
> 
> the other thing is... do we really need to console_conditional_schedule()
> from fbcon_*()? console_unlock() does cond_resched() after every line it
> prints. wouldn't that be enough?
> 
> so may be we can drop some of console_conditional_schedule()
> call sites in fbcon. or update console_conditional_schedule()
> function to always return the current preemption value, not the
> one we saw in console_trylock().

I was curious if it makes sense to remove
console_conditional_schedule() completely.

In practice, it never allows rescheduling when the console driver
is called via console_unlock(). It is since 2006 and the commit
78944e549d36673eb62 ("vt: printk: Fix framebuffer console
triggering might_sleep assertion"). This commit added
that

	console_may_schedule = 0;

into console_unlock() before the console drivers are called.


On the other hand, it seems that the rescheduling was always
enabled when some console operations were called via
tty_operations. For example:

struct tty_operations con_ops

  con_ops->con_write()
  -> do_con_write()  #calls console_lock()
   -> do_con_trol()
    -> fbcon_scroll()
     -> fbcon_redraw_move()
      -> console_conditional_schedule()

, where console_lock() sets console_may_schedule = 1;


A complete console scroll/redraw might take a while. The rescheduling
would make sense => IMHO, we should keep console_conditional_schedule()
or some alternative in the console drivers as well.

But I am afraid that we could not use the automatic detection.
We are not able to detect preemption when CONFIG_PREEMPT_COUNT
is disabled. But we still would like to enable rescheduling
when called from the tty code (guarded by console_lock()).


As a result. We should keep console_may_schedule as a global
variable. We cannot put the automatic detection into
console_conditional_schedule(). Instead, we need to
fix handling of the global variable in console_unlock().

I am going to prepare a patch for this.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
@ 2017-01-12 14:18                                                       ` Petr Mladek
  0 siblings, 0 replies; 96+ messages in thread
From: Petr Mladek @ 2017-01-12 14:18 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Tetsuo Handa, mhocko, linux-mm, Greg Kroah-Hartman, Jiri Slaby,
	linux-fbdev, linux-kernel, sergey.senozhatsky.work

On Mon 2016-12-26 20:34:07, Sergey Senozhatsky wrote:
> console_trylock() used to always forbid rescheduling; but it got changed
> like a yaer ago.
> 
> the other thing is... do we really need to console_conditional_schedule()
> from fbcon_*()? console_unlock() does cond_resched() after every line it
> prints. wouldn't that be enough?
> 
> so may be we can drop some of console_conditional_schedule()
> call sites in fbcon. or update console_conditional_schedule()
> function to always return the current preemption value, not the
> one we saw in console_trylock().

I was curious if it makes sense to remove
console_conditional_schedule() completely.

In practice, it never allows rescheduling when the console driver
is called via console_unlock(). It is since 2006 and the commit
78944e549d36673eb62 ("vt: printk: Fix framebuffer console
triggering might_sleep assertion"). This commit added
that

	console_may_schedule = 0;

into console_unlock() before the console drivers are called.


On the other hand, it seems that the rescheduling was always
enabled when some console operations were called via
tty_operations. For example:

struct tty_operations con_ops

  con_ops->con_write()
  -> do_con_write()  #calls console_lock()
   -> do_con_trol()
    -> fbcon_scroll()
     -> fbcon_redraw_move()
      -> console_conditional_schedule()

, where console_lock() sets console_may_schedule = 1;


A complete console scroll/redraw might take a while. The rescheduling
would make sense => IMHO, we should keep console_conditional_schedule()
or some alternative in the console drivers as well.

But I am afraid that we could not use the automatic detection.
We are not able to detect preemption when CONFIG_PREEMPT_COUNT
is disabled. But we still would like to enable rescheduling
when called from the tty code (guarded by console_lock()).


As a result. We should keep console_may_schedule as a global
variable. We cannot put the automatic detection into
console_conditional_schedule(). Instead, we need to
fix handling of the global variable in console_unlock().

I am going to prepare a patch for this.

Best Regards,
Petr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2017-01-12 14:18                                                       ` Petr Mladek
  (?)
@ 2017-01-13  2:28                                                         ` Sergey Senozhatsky
  -1 siblings, 0 replies; 96+ messages in thread
From: Sergey Senozhatsky @ 2017-01-13  2:28 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Sergey Senozhatsky, Tetsuo Handa, mhocko, linux-mm,
	Greg Kroah-Hartman, Jiri Slaby, linux-fbdev, linux-kernel,
	sergey.senozhatsky.work

On (01/12/17 15:18), Petr Mladek wrote:
> On Mon 2016-12-26 20:34:07, Sergey Senozhatsky wrote:
> > console_trylock() used to always forbid rescheduling; but it got changed
> > like a yaer ago.
> > 
> > the other thing is... do we really need to console_conditional_schedule()
> > from fbcon_*()? console_unlock() does cond_resched() after every line it
> > prints. wouldn't that be enough?
> > 
> > so may be we can drop some of console_conditional_schedule()
> > call sites in fbcon. or update console_conditional_schedule()
> > function to always return the current preemption value, not the
> > one we saw in console_trylock().
> 
> I was curious if it makes sense to remove
> console_conditional_schedule() completely.

I was looking at this option at some point as well.

> In practice, it never allows rescheduling when the console driver
> is called via console_unlock(). It is since 2006 and the commit
> 78944e549d36673eb62 ("vt: printk: Fix framebuffer console
> triggering might_sleep assertion"). This commit added
> that
> 
> 	console_may_schedule = 0;
>
> into console_unlock() before the console drivers are called.
> 
> 
> On the other hand, it seems that the rescheduling was always
> enabled when some console operations were called via
> tty_operations. For example:
> 
> struct tty_operations con_ops
> 
>   con_ops->con_write()
>   -> do_con_write()  #calls console_lock()
>    -> do_con_trol()
>     -> fbcon_scroll()
>      -> fbcon_redraw_move()
>       -> console_conditional_schedule()
> 
> , where console_lock() sets console_may_schedule = 1;
> 
> 
> A complete console scroll/redraw might take a while. The rescheduling
> would make sense => IMHO, we should keep console_conditional_schedule()
> or some alternative in the console drivers as well.
> 
> But I am afraid that we could not use the automatic detection.
> We are not able to detect preemption when CONFIG_PREEMPT_COUNT

can one actually have a preemptible kernel with !CONFIG_PREEMPT_COUNT?
how? it's not even possible to change CONFIG_PREEMPT_COUNT in menuconfig.
the option is automatically selected by PREEMPT. and if PREEMPT is not
selected then _cond_resched() is just "{ rcu_all_qs(); return 0; }"

...
> We cannot put the automatic detection into console_conditional_schedule().

why can't we?


> I am going to prepare a patch for this.

I'm on it.

	-ss

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
@ 2017-01-13  2:28                                                         ` Sergey Senozhatsky
  0 siblings, 0 replies; 96+ messages in thread
From: Sergey Senozhatsky @ 2017-01-13  2:28 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Sergey Senozhatsky, Tetsuo Handa, mhocko, linux-mm,
	Greg Kroah-Hartman, Jiri Slaby, linux-fbdev, linux-kernel,
	sergey.senozhatsky.work

On (01/12/17 15:18), Petr Mladek wrote:
> On Mon 2016-12-26 20:34:07, Sergey Senozhatsky wrote:
> > console_trylock() used to always forbid rescheduling; but it got changed
> > like a yaer ago.
> > 
> > the other thing is... do we really need to console_conditional_schedule()
> > from fbcon_*()? console_unlock() does cond_resched() after every line it
> > prints. wouldn't that be enough?
> > 
> > so may be we can drop some of console_conditional_schedule()
> > call sites in fbcon. or update console_conditional_schedule()
> > function to always return the current preemption value, not the
> > one we saw in console_trylock().
> 
> I was curious if it makes sense to remove
> console_conditional_schedule() completely.

I was looking at this option at some point as well.

> In practice, it never allows rescheduling when the console driver
> is called via console_unlock(). It is since 2006 and the commit
> 78944e549d36673eb62 ("vt: printk: Fix framebuffer console
> triggering might_sleep assertion"). This commit added
> that
> 
> 	console_may_schedule = 0;
>
> into console_unlock() before the console drivers are called.
> 
> 
> On the other hand, it seems that the rescheduling was always
> enabled when some console operations were called via
> tty_operations. For example:
> 
> struct tty_operations con_ops
> 
>   con_ops->con_write()
>   -> do_con_write()  #calls console_lock()
>    -> do_con_trol()
>     -> fbcon_scroll()
>      -> fbcon_redraw_move()
>       -> console_conditional_schedule()
> 
> , where console_lock() sets console_may_schedule = 1;
> 
> 
> A complete console scroll/redraw might take a while. The rescheduling
> would make sense => IMHO, we should keep console_conditional_schedule()
> or some alternative in the console drivers as well.
> 
> But I am afraid that we could not use the automatic detection.
> We are not able to detect preemption when CONFIG_PREEMPT_COUNT

can one actually have a preemptible kernel with !CONFIG_PREEMPT_COUNT?
how? it's not even possible to change CONFIG_PREEMPT_COUNT in menuconfig.
the option is automatically selected by PREEMPT. and if PREEMPT is not
selected then _cond_resched() is just "{ rcu_all_qs(); return 0; }"

...
> We cannot put the automatic detection into console_conditional_schedule().

why can't we?


> I am going to prepare a patch for this.

I'm on it.

	-ss

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
@ 2017-01-13  2:28                                                         ` Sergey Senozhatsky
  0 siblings, 0 replies; 96+ messages in thread
From: Sergey Senozhatsky @ 2017-01-13  2:28 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Sergey Senozhatsky, Tetsuo Handa, mhocko, linux-mm,
	Greg Kroah-Hartman, Jiri Slaby, linux-fbdev, linux-kernel,
	sergey.senozhatsky.work

On (01/12/17 15:18), Petr Mladek wrote:
> On Mon 2016-12-26 20:34:07, Sergey Senozhatsky wrote:
> > console_trylock() used to always forbid rescheduling; but it got changed
> > like a yaer ago.
> > 
> > the other thing is... do we really need to console_conditional_schedule()
> > from fbcon_*()? console_unlock() does cond_resched() after every line it
> > prints. wouldn't that be enough?
> > 
> > so may be we can drop some of console_conditional_schedule()
> > call sites in fbcon. or update console_conditional_schedule()
> > function to always return the current preemption value, not the
> > one we saw in console_trylock().
> 
> I was curious if it makes sense to remove
> console_conditional_schedule() completely.

I was looking at this option at some point as well.

> In practice, it never allows rescheduling when the console driver
> is called via console_unlock(). It is since 2006 and the commit
> 78944e549d36673eb62 ("vt: printk: Fix framebuffer console
> triggering might_sleep assertion"). This commit added
> that
> 
> 	console_may_schedule = 0;
>
> into console_unlock() before the console drivers are called.
> 
> 
> On the other hand, it seems that the rescheduling was always
> enabled when some console operations were called via
> tty_operations. For example:
> 
> struct tty_operations con_ops
> 
>   con_ops->con_write()
>   -> do_con_write()  #calls console_lock()
>    -> do_con_trol()
>     -> fbcon_scroll()
>      -> fbcon_redraw_move()
>       -> console_conditional_schedule()
> 
> , where console_lock() sets console_may_schedule = 1;
> 
> 
> A complete console scroll/redraw might take a while. The rescheduling
> would make sense => IMHO, we should keep console_conditional_schedule()
> or some alternative in the console drivers as well.
> 
> But I am afraid that we could not use the automatic detection.
> We are not able to detect preemption when CONFIG_PREEMPT_COUNT

can one actually have a preemptible kernel with !CONFIG_PREEMPT_COUNT?
how? it's not even possible to change CONFIG_PREEMPT_COUNT in menuconfig.
the option is automatically selected by PREEMPT. and if PREEMPT is not
selected then _cond_resched() is just "{ rcu_all_qs(); return 0; }"

...
> We cannot put the automatic detection into console_conditional_schedule().

why can't we?


> I am going to prepare a patch for this.

I'm on it.

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2017-01-12 13:10                                                       ` Petr Mladek
  (?)
@ 2017-01-13  2:52                                                         ` Sergey Senozhatsky
  -1 siblings, 0 replies; 96+ messages in thread
From: Sergey Senozhatsky @ 2017-01-13  2:52 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Sergey Senozhatsky, Tetsuo Handa, mhocko, linux-mm,
	Greg Kroah-Hartman, Jiri Slaby, linux-fbdev, linux-kernel,
	sergey.senozhatsky.work

On (01/12/17 14:10), Petr Mladek wrote:
[..]
> >  /**
> >   * console_lock - lock the console system for exclusive use.
> >   *
> > @@ -2316,7 +2321,7 @@ EXPORT_SYMBOL(console_unlock);
> >   */
> >  void __sched console_conditional_schedule(void)
> >  {
> > -	if (console_may_schedule)
> > +	if (get_console_may_schedule())
> 
> Note that console_may_schedule should be zero when
> the console drivers are called. See the following lines in
> console_unlock():
> 
> 	/*
> 	 * Console drivers are called under logbuf_lock, so
> 	 * @console_may_schedule should be cleared before; however, we may
> 	 * end up dumping a lot of lines, for example, if called from
> 	 * console registration path, and should invoke cond_resched()
> 	 * between lines if allowable.  Not doing so can cause a very long
> 	 * scheduling stall on a slow console leading to RCU stall and
> 	 * softlockup warnings which exacerbate the issue with more
> 	 * messages practically incapacitating the system.
> 	 */
> 	do_cond_resched = console_may_schedule;
> 	console_may_schedule = 0;



console drivers are never-ever-ever getting called under logbuf lock.
never. with disabled local IRQs - yes. under logbuf lock - no. that
would soft lockup systems in really bad ways, otherwise.

the reason why we set console_may_schedule to zero in
console_unlock() is.... VT. and lf() function in particular.

commit 78944e549d36673eb6265a2411574e79c28e23dc
Author: Antonino A. Daplas XXXX
Date:   Sat Aug 5 12:14:16 2006 -0700

    [PATCH] vt: printk: Fix framebuffer console triggering might_sleep assertion
    
    Reported by: Dave Jones
    
    Whilst printk'ing to both console and serial console, I got this...
    (2.6.18rc1)
    
    BUG: sleeping function called from invalid context at kernel/sched.c:4438
    in_atomic():0, irqs_disabled():1
    
    Call Trace:
     [<ffffffff80271db8>] show_trace+0xaa/0x23d
     [<ffffffff80271f60>] dump_stack+0x15/0x17
     [<ffffffff8020b9f8>] __might_sleep+0xb2/0xb4
     [<ffffffff8029232e>] __cond_resched+0x15/0x55
     [<ffffffff80267eb8>] cond_resched+0x3b/0x42
     [<ffffffff80268c64>] console_conditional_schedule+0x12/0x14
     [<ffffffff80368159>] fbcon_redraw+0xf6/0x160
     [<ffffffff80369c58>] fbcon_scroll+0x5d9/0xb52
     [<ffffffff803a43c4>] scrup+0x6b/0xd6
     [<ffffffff803a4453>] lf+0x24/0x44
     [<ffffffff803a7ff8>] vt_console_print+0x166/0x23d
     [<ffffffff80295528>] __call_console_drivers+0x65/0x76
     [<ffffffff80295597>] _call_console_drivers+0x5e/0x62
     [<ffffffff80217e3f>] release_console_sem+0x14b/0x232
     [<ffffffff8036acd6>] fb_flashcursor+0x279/0x2a6
     [<ffffffff80251e3f>] run_workqueue+0xa8/0xfb
     [<ffffffff8024e5e0>] worker_thread+0xef/0x122
     [<ffffffff8023660f>] kthread+0x100/0x136
     [<ffffffff8026419e>] child_rip+0x8/0x12


and we really don't want to cond_resched() when we are in panic.
that's why console_flush_on_panic() sets it to zero explicitly.

console_trylock() checks oops_in_progress, so re-taking the semaphore
when we are in

	panic()
	 console_flush_on_panic()
          console_unlock()
           console_trylock()

should be OK. as well as doing get_console_conditional_schedule() somewhere
in console driver code.


I still don't understand why do you guys think we can't simply do
get_console_conditional_schedule() and get the actual value.


[..]

> Sergey, if you agree with the above paragraph. Do you want to prepare
> the patch or should I do so?

I'm on it.

	-ss

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
@ 2017-01-13  2:52                                                         ` Sergey Senozhatsky
  0 siblings, 0 replies; 96+ messages in thread
From: Sergey Senozhatsky @ 2017-01-13  2:52 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Sergey Senozhatsky, Tetsuo Handa, mhocko, linux-mm,
	Greg Kroah-Hartman, Jiri Slaby, linux-fbdev, linux-kernel,
	sergey.senozhatsky.work

On (01/12/17 14:10), Petr Mladek wrote:
[..]
> >  /**
> >   * console_lock - lock the console system for exclusive use.
> >   *
> > @@ -2316,7 +2321,7 @@ EXPORT_SYMBOL(console_unlock);
> >   */
> >  void __sched console_conditional_schedule(void)
> >  {
> > -	if (console_may_schedule)
> > +	if (get_console_may_schedule())
> 
> Note that console_may_schedule should be zero when
> the console drivers are called. See the following lines in
> console_unlock():
> 
> 	/*
> 	 * Console drivers are called under logbuf_lock, so
> 	 * @console_may_schedule should be cleared before; however, we may
> 	 * end up dumping a lot of lines, for example, if called from
> 	 * console registration path, and should invoke cond_resched()
> 	 * between lines if allowable.  Not doing so can cause a very long
> 	 * scheduling stall on a slow console leading to RCU stall and
> 	 * softlockup warnings which exacerbate the issue with more
> 	 * messages practically incapacitating the system.
> 	 */
> 	do_cond_resched = console_may_schedule;
> 	console_may_schedule = 0;



console drivers are never-ever-ever getting called under logbuf lock.
never. with disabled local IRQs - yes. under logbuf lock - no. that
would soft lockup systems in really bad ways, otherwise.

the reason why we set console_may_schedule to zero in
console_unlock() is.... VT. and lf() function in particular.

commit 78944e549d36673eb6265a2411574e79c28e23dc
Author: Antonino A. Daplas XXXX
Date:   Sat Aug 5 12:14:16 2006 -0700

    [PATCH] vt: printk: Fix framebuffer console triggering might_sleep assertion
    
    Reported by: Dave Jones
    
    Whilst printk'ing to both console and serial console, I got this...
    (2.6.18rc1)
    
    BUG: sleeping function called from invalid context at kernel/sched.c:4438
    in_atomic():0, irqs_disabled():1
    
    Call Trace:
     [<ffffffff80271db8>] show_trace+0xaa/0x23d
     [<ffffffff80271f60>] dump_stack+0x15/0x17
     [<ffffffff8020b9f8>] __might_sleep+0xb2/0xb4
     [<ffffffff8029232e>] __cond_resched+0x15/0x55
     [<ffffffff80267eb8>] cond_resched+0x3b/0x42
     [<ffffffff80268c64>] console_conditional_schedule+0x12/0x14
     [<ffffffff80368159>] fbcon_redraw+0xf6/0x160
     [<ffffffff80369c58>] fbcon_scroll+0x5d9/0xb52
     [<ffffffff803a43c4>] scrup+0x6b/0xd6
     [<ffffffff803a4453>] lf+0x24/0x44
     [<ffffffff803a7ff8>] vt_console_print+0x166/0x23d
     [<ffffffff80295528>] __call_console_drivers+0x65/0x76
     [<ffffffff80295597>] _call_console_drivers+0x5e/0x62
     [<ffffffff80217e3f>] release_console_sem+0x14b/0x232
     [<ffffffff8036acd6>] fb_flashcursor+0x279/0x2a6
     [<ffffffff80251e3f>] run_workqueue+0xa8/0xfb
     [<ffffffff8024e5e0>] worker_thread+0xef/0x122
     [<ffffffff8023660f>] kthread+0x100/0x136
     [<ffffffff8026419e>] child_rip+0x8/0x12


and we really don't want to cond_resched() when we are in panic.
that's why console_flush_on_panic() sets it to zero explicitly.

console_trylock() checks oops_in_progress, so re-taking the semaphore
when we are in

	panic()
	 console_flush_on_panic()
          console_unlock()
           console_trylock()

should be OK. as well as doing get_console_conditional_schedule() somewhere
in console driver code.


I still don't understand why do you guys think we can't simply do
get_console_conditional_schedule() and get the actual value.


[..]

> Sergey, if you agree with the above paragraph. Do you want to prepare
> the patch or should I do so?

I'm on it.

	-ss

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
@ 2017-01-13  2:52                                                         ` Sergey Senozhatsky
  0 siblings, 0 replies; 96+ messages in thread
From: Sergey Senozhatsky @ 2017-01-13  2:52 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Sergey Senozhatsky, Tetsuo Handa, mhocko, linux-mm,
	Greg Kroah-Hartman, Jiri Slaby, linux-fbdev, linux-kernel,
	sergey.senozhatsky.work

On (01/12/17 14:10), Petr Mladek wrote:
[..]
> >  /**
> >   * console_lock - lock the console system for exclusive use.
> >   *
> > @@ -2316,7 +2321,7 @@ EXPORT_SYMBOL(console_unlock);
> >   */
> >  void __sched console_conditional_schedule(void)
> >  {
> > -	if (console_may_schedule)
> > +	if (get_console_may_schedule())
> 
> Note that console_may_schedule should be zero when
> the console drivers are called. See the following lines in
> console_unlock():
> 
> 	/*
> 	 * Console drivers are called under logbuf_lock, so
> 	 * @console_may_schedule should be cleared before; however, we may
> 	 * end up dumping a lot of lines, for example, if called from
> 	 * console registration path, and should invoke cond_resched()
> 	 * between lines if allowable.  Not doing so can cause a very long
> 	 * scheduling stall on a slow console leading to RCU stall and
> 	 * softlockup warnings which exacerbate the issue with more
> 	 * messages practically incapacitating the system.
> 	 */
> 	do_cond_resched = console_may_schedule;
> 	console_may_schedule = 0;



console drivers are never-ever-ever getting called under logbuf lock.
never. with disabled local IRQs - yes. under logbuf lock - no. that
would soft lockup systems in really bad ways, otherwise.

the reason why we set console_may_schedule to zero in
console_unlock() is.... VT. and lf() function in particular.

commit 78944e549d36673eb6265a2411574e79c28e23dc
Author: Antonino A. Daplas XXXX
Date:   Sat Aug 5 12:14:16 2006 -0700

    [PATCH] vt: printk: Fix framebuffer console triggering might_sleep assertion
    
    Reported by: Dave Jones
    
    Whilst printk'ing to both console and serial console, I got this...
    (2.6.18rc1)
    
    BUG: sleeping function called from invalid context at kernel/sched.c:4438
    in_atomic():0, irqs_disabled():1
    
    Call Trace:
     [<ffffffff80271db8>] show_trace+0xaa/0x23d
     [<ffffffff80271f60>] dump_stack+0x15/0x17
     [<ffffffff8020b9f8>] __might_sleep+0xb2/0xb4
     [<ffffffff8029232e>] __cond_resched+0x15/0x55
     [<ffffffff80267eb8>] cond_resched+0x3b/0x42
     [<ffffffff80268c64>] console_conditional_schedule+0x12/0x14
     [<ffffffff80368159>] fbcon_redraw+0xf6/0x160
     [<ffffffff80369c58>] fbcon_scroll+0x5d9/0xb52
     [<ffffffff803a43c4>] scrup+0x6b/0xd6
     [<ffffffff803a4453>] lf+0x24/0x44
     [<ffffffff803a7ff8>] vt_console_print+0x166/0x23d
     [<ffffffff80295528>] __call_console_drivers+0x65/0x76
     [<ffffffff80295597>] _call_console_drivers+0x5e/0x62
     [<ffffffff80217e3f>] release_console_sem+0x14b/0x232
     [<ffffffff8036acd6>] fb_flashcursor+0x279/0x2a6
     [<ffffffff80251e3f>] run_workqueue+0xa8/0xfb
     [<ffffffff8024e5e0>] worker_thread+0xef/0x122
     [<ffffffff8023660f>] kthread+0x100/0x136
     [<ffffffff8026419e>] child_rip+0x8/0x12


and we really don't want to cond_resched() when we are in panic.
that's why console_flush_on_panic() sets it to zero explicitly.

console_trylock() checks oops_in_progress, so re-taking the semaphore
when we are in

	panic()
	 console_flush_on_panic()
          console_unlock()
           console_trylock()

should be OK. as well as doing get_console_conditional_schedule() somewhere
in console driver code.


I still don't understand why do you guys think we can't simply do
get_console_conditional_schedule() and get the actual value.


[..]

> Sergey, if you agree with the above paragraph. Do you want to prepare
> the patch or should I do so?

I'm on it.

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2017-01-13  2:52                                                         ` Sergey Senozhatsky
  (?)
@ 2017-01-13  3:53                                                           ` Sergey Senozhatsky
  -1 siblings, 0 replies; 96+ messages in thread
From: Sergey Senozhatsky @ 2017-01-13  3:53 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Petr Mladek, Sergey Senozhatsky, Tetsuo Handa, mhocko, linux-mm,
	Greg Kroah-Hartman, Jiri Slaby, linux-fbdev, linux-kernel

On (01/13/17 11:52), Sergey Senozhatsky wrote:
[..]
> and we really don't want to cond_resched() when we are in panic.
> that's why console_flush_on_panic() sets it to zero explicitly.
> 
> console_trylock() checks oops_in_progress, so re-taking the semaphore
> when we are in
> 
> 	panic()
> 	 console_flush_on_panic()
>           console_unlock()
>            console_trylock()
> 
> should be OK. as well as doing get_console_conditional_schedule() somewhere
> in console driver code.

d'oh... no, this is false. console_flush_on_panic() is called after we
bust_spinlocks(0), BUT with local IRQs disabled. so console_trylock()
would still set console_may_schedule to 0.

	-ss

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
@ 2017-01-13  3:53                                                           ` Sergey Senozhatsky
  0 siblings, 0 replies; 96+ messages in thread
From: Sergey Senozhatsky @ 2017-01-13  3:53 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Petr Mladek, Sergey Senozhatsky, Tetsuo Handa, mhocko, linux-mm,
	Greg Kroah-Hartman, Jiri Slaby, linux-fbdev, linux-kernel

On (01/13/17 11:52), Sergey Senozhatsky wrote:
[..]
> and we really don't want to cond_resched() when we are in panic.
> that's why console_flush_on_panic() sets it to zero explicitly.
> 
> console_trylock() checks oops_in_progress, so re-taking the semaphore
> when we are in
> 
> 	panic()
> 	 console_flush_on_panic()
>           console_unlock()
>            console_trylock()
> 
> should be OK. as well as doing get_console_conditional_schedule() somewhere
> in console driver code.

d'oh... no, this is false. console_flush_on_panic() is called after we
bust_spinlocks(0), BUT with local IRQs disabled. so console_trylock()
would still set console_may_schedule to 0.

	-ss

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
@ 2017-01-13  3:53                                                           ` Sergey Senozhatsky
  0 siblings, 0 replies; 96+ messages in thread
From: Sergey Senozhatsky @ 2017-01-13  3:53 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Petr Mladek, Sergey Senozhatsky, Tetsuo Handa, mhocko, linux-mm,
	Greg Kroah-Hartman, Jiri Slaby, linux-fbdev, linux-kernel

On (01/13/17 11:52), Sergey Senozhatsky wrote:
[..]
> and we really don't want to cond_resched() when we are in panic.
> that's why console_flush_on_panic() sets it to zero explicitly.
> 
> console_trylock() checks oops_in_progress, so re-taking the semaphore
> when we are in
> 
> 	panic()
> 	 console_flush_on_panic()
>           console_unlock()
>            console_trylock()
> 
> should be OK. as well as doing get_console_conditional_schedule() somewhere
> in console driver code.

d'oh... no, this is false. console_flush_on_panic() is called after we
bust_spinlocks(0), BUT with local IRQs disabled. so console_trylock()
would still set console_may_schedule to 0.

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2017-01-13  2:28                                                         ` Sergey Senozhatsky
  (?)
@ 2017-01-13 11:03                                                           ` Petr Mladek
  -1 siblings, 0 replies; 96+ messages in thread
From: Petr Mladek @ 2017-01-13 11:03 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Sergey Senozhatsky, Tetsuo Handa, mhocko, linux-mm,
	Greg Kroah-Hartman, Jiri Slaby, linux-fbdev, linux-kernel

On Fri 2017-01-13 11:28:43, Sergey Senozhatsky wrote:
> On (01/12/17 15:18), Petr Mladek wrote:
> > On Mon 2016-12-26 20:34:07, Sergey Senozhatsky wrote:
> > > console_trylock() used to always forbid rescheduling; but it got changed
> > > like a yaer ago.
> > > 
> > > the other thing is... do we really need to console_conditional_schedule()
> > > from fbcon_*()? console_unlock() does cond_resched() after every line it
> > > prints. wouldn't that be enough?
> > > 
> > > so may be we can drop some of console_conditional_schedule()
> > > call sites in fbcon. or update console_conditional_schedule()
> > > function to always return the current preemption value, not the
> > > one we saw in console_trylock().
> > 
> > I was curious if it makes sense to remove
> > console_conditional_schedule() completely.
> 
> I was looking at this option at some point as well.
> 
> > In practice, it never allows rescheduling when the console driver
> > is called via console_unlock(). It is since 2006 and the commit
> > 78944e549d36673eb62 ("vt: printk: Fix framebuffer console
> > triggering might_sleep assertion"). This commit added
> > that
> > 
> > 	console_may_schedule = 0;
> >
> > into console_unlock() before the console drivers are called.
> > 
> > 
> > On the other hand, it seems that the rescheduling was always
> > enabled when some console operations were called via
> > tty_operations. For example:
> > 
> > struct tty_operations con_ops
> > 
> >   con_ops->con_write()
> >   -> do_con_write()  #calls console_lock()
> >    -> do_con_trol()
> >     -> fbcon_scroll()
> >      -> fbcon_redraw_move()
> >       -> console_conditional_schedule()
> > 
> > , where console_lock() sets console_may_schedule = 1;
> > 
> > 
> > A complete console scroll/redraw might take a while. The rescheduling
> > would make sense => IMHO, we should keep console_conditional_schedule()
> > or some alternative in the console drivers as well.
> > 
> > But I am afraid that we could not use the automatic detection.
> > We are not able to detect preemption when CONFIG_PREEMPT_COUNT
> 
> can one actually have a preemptible kernel with !CONFIG_PREEMPT_COUNT?
> how? it's not even possible to change CONFIG_PREEMPT_COUNT in menuconfig.
> the option is automatically selected by PREEMPT. and if PREEMPT is not
> selected then _cond_resched() is just "{ rcu_all_qs(); return 0; }"

CONFIG_PREEMPT_COUNT is always enabled in preemptive kernel. But
we do not mind about preemtible kernel. It reschedules automatically
anywhere in preemptive context.

The problem is non-preemptive kernel. It is able to reschedule
only when someone explicitely calls cond_resched() or schedule().
In this case, we are able to detect the preemtive context
automatically only with CONFIG_PREEMPT_COUNT enabled.
We must not call cond_resched() if we are not sure.

> ...
> > We cannot put the automatic detection into console_conditional_schedule().
> 
> why can't we?

Because it would newer call cond_resched() in non-preemptive kernel
with CONFIG_PREEMPT_COUNT disabled. IMHO, we want to call it,
for example, when we scroll the entire screen from tty_operations.

Or do I miss anything?


> > I am going to prepare a patch for this.
> 
> I'm on it.

Uff, I already have one and am very close to send it.

Sigh, I do not want to race who will prepare and send the patch.
I just do not feel comfortable in the reviewer-only role.
I feel like just searching for problems in other's patches
and annoying them with my complains. I know that it is important
but I also want to produce something.

Also I feel that I still need to improve my coding skills.
And I need some training.

Finally, I would not start writing my patch if your one needed
only small updates. But my investigation pushed me very
different way from your proposal. It looked ugly to push
all coding to your side.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
@ 2017-01-13 11:03                                                           ` Petr Mladek
  0 siblings, 0 replies; 96+ messages in thread
From: Petr Mladek @ 2017-01-13 11:03 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Sergey Senozhatsky, Tetsuo Handa, mhocko, linux-mm,
	Greg Kroah-Hartman, Jiri Slaby, linux-fbdev, linux-kernel

On Fri 2017-01-13 11:28:43, Sergey Senozhatsky wrote:
> On (01/12/17 15:18), Petr Mladek wrote:
> > On Mon 2016-12-26 20:34:07, Sergey Senozhatsky wrote:
> > > console_trylock() used to always forbid rescheduling; but it got changed
> > > like a yaer ago.
> > > 
> > > the other thing is... do we really need to console_conditional_schedule()
> > > from fbcon_*()? console_unlock() does cond_resched() after every line it
> > > prints. wouldn't that be enough?
> > > 
> > > so may be we can drop some of console_conditional_schedule()
> > > call sites in fbcon. or update console_conditional_schedule()
> > > function to always return the current preemption value, not the
> > > one we saw in console_trylock().
> > 
> > I was curious if it makes sense to remove
> > console_conditional_schedule() completely.
> 
> I was looking at this option at some point as well.
> 
> > In practice, it never allows rescheduling when the console driver
> > is called via console_unlock(). It is since 2006 and the commit
> > 78944e549d36673eb62 ("vt: printk: Fix framebuffer console
> > triggering might_sleep assertion"). This commit added
> > that
> > 
> > 	console_may_schedule = 0;
> >
> > into console_unlock() before the console drivers are called.
> > 
> > 
> > On the other hand, it seems that the rescheduling was always
> > enabled when some console operations were called via
> > tty_operations. For example:
> > 
> > struct tty_operations con_ops
> > 
> >   con_ops->con_write()
> >   -> do_con_write()  #calls console_lock()
> >    -> do_con_trol()
> >     -> fbcon_scroll()
> >      -> fbcon_redraw_move()
> >       -> console_conditional_schedule()
> > 
> > , where console_lock() sets console_may_schedule = 1;
> > 
> > 
> > A complete console scroll/redraw might take a while. The rescheduling
> > would make sense => IMHO, we should keep console_conditional_schedule()
> > or some alternative in the console drivers as well.
> > 
> > But I am afraid that we could not use the automatic detection.
> > We are not able to detect preemption when CONFIG_PREEMPT_COUNT
> 
> can one actually have a preemptible kernel with !CONFIG_PREEMPT_COUNT?
> how? it's not even possible to change CONFIG_PREEMPT_COUNT in menuconfig.
> the option is automatically selected by PREEMPT. and if PREEMPT is not
> selected then _cond_resched() is just "{ rcu_all_qs(); return 0; }"

CONFIG_PREEMPT_COUNT is always enabled in preemptive kernel. But
we do not mind about preemtible kernel. It reschedules automatically
anywhere in preemptive context.

The problem is non-preemptive kernel. It is able to reschedule
only when someone explicitely calls cond_resched() or schedule().
In this case, we are able to detect the preemtive context
automatically only with CONFIG_PREEMPT_COUNT enabled.
We must not call cond_resched() if we are not sure.

> ...
> > We cannot put the automatic detection into console_conditional_schedule().
> 
> why can't we?

Because it would newer call cond_resched() in non-preemptive kernel
with CONFIG_PREEMPT_COUNT disabled. IMHO, we want to call it,
for example, when we scroll the entire screen from tty_operations.

Or do I miss anything?


> > I am going to prepare a patch for this.
> 
> I'm on it.

Uff, I already have one and am very close to send it.

Sigh, I do not want to race who will prepare and send the patch.
I just do not feel comfortable in the reviewer-only role.
I feel like just searching for problems in other's patches
and annoying them with my complains. I know that it is important
but I also want to produce something.

Also I feel that I still need to improve my coding skills.
And I need some training.

Finally, I would not start writing my patch if your one needed
only small updates. But my investigation pushed me very
different way from your proposal. It looked ugly to push
all coding to your side.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
@ 2017-01-13 11:03                                                           ` Petr Mladek
  0 siblings, 0 replies; 96+ messages in thread
From: Petr Mladek @ 2017-01-13 11:03 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Sergey Senozhatsky, Tetsuo Handa, mhocko, linux-mm,
	Greg Kroah-Hartman, Jiri Slaby, linux-fbdev, linux-kernel

On Fri 2017-01-13 11:28:43, Sergey Senozhatsky wrote:
> On (01/12/17 15:18), Petr Mladek wrote:
> > On Mon 2016-12-26 20:34:07, Sergey Senozhatsky wrote:
> > > console_trylock() used to always forbid rescheduling; but it got changed
> > > like a yaer ago.
> > > 
> > > the other thing is... do we really need to console_conditional_schedule()
> > > from fbcon_*()? console_unlock() does cond_resched() after every line it
> > > prints. wouldn't that be enough?
> > > 
> > > so may be we can drop some of console_conditional_schedule()
> > > call sites in fbcon. or update console_conditional_schedule()
> > > function to always return the current preemption value, not the
> > > one we saw in console_trylock().
> > 
> > I was curious if it makes sense to remove
> > console_conditional_schedule() completely.
> 
> I was looking at this option at some point as well.
> 
> > In practice, it never allows rescheduling when the console driver
> > is called via console_unlock(). It is since 2006 and the commit
> > 78944e549d36673eb62 ("vt: printk: Fix framebuffer console
> > triggering might_sleep assertion"). This commit added
> > that
> > 
> > 	console_may_schedule = 0;
> >
> > into console_unlock() before the console drivers are called.
> > 
> > 
> > On the other hand, it seems that the rescheduling was always
> > enabled when some console operations were called via
> > tty_operations. For example:
> > 
> > struct tty_operations con_ops
> > 
> >   con_ops->con_write()
> >   -> do_con_write()  #calls console_lock()
> >    -> do_con_trol()
> >     -> fbcon_scroll()
> >      -> fbcon_redraw_move()
> >       -> console_conditional_schedule()
> > 
> > , where console_lock() sets console_may_schedule = 1;
> > 
> > 
> > A complete console scroll/redraw might take a while. The rescheduling
> > would make sense => IMHO, we should keep console_conditional_schedule()
> > or some alternative in the console drivers as well.
> > 
> > But I am afraid that we could not use the automatic detection.
> > We are not able to detect preemption when CONFIG_PREEMPT_COUNT
> 
> can one actually have a preemptible kernel with !CONFIG_PREEMPT_COUNT?
> how? it's not even possible to change CONFIG_PREEMPT_COUNT in menuconfig.
> the option is automatically selected by PREEMPT. and if PREEMPT is not
> selected then _cond_resched() is just "{ rcu_all_qs(); return 0; }"

CONFIG_PREEMPT_COUNT is always enabled in preemptive kernel. But
we do not mind about preemtible kernel. It reschedules automatically
anywhere in preemptive context.

The problem is non-preemptive kernel. It is able to reschedule
only when someone explicitely calls cond_resched() or schedule().
In this case, we are able to detect the preemtive context
automatically only with CONFIG_PREEMPT_COUNT enabled.
We must not call cond_resched() if we are not sure.

> ...
> > We cannot put the automatic detection into console_conditional_schedule().
> 
> why can't we?

Because it would newer call cond_resched() in non-preemptive kernel
with CONFIG_PREEMPT_COUNT disabled. IMHO, we want to call it,
for example, when we scroll the entire screen from tty_operations.

Or do I miss anything?


> > I am going to prepare a patch for this.
> 
> I'm on it.

Uff, I already have one and am very close to send it.

Sigh, I do not want to race who will prepare and send the patch.
I just do not feel comfortable in the reviewer-only role.
I feel like just searching for problems in other's patches
and annoying them with my complains. I know that it is important
but I also want to produce something.

Also I feel that I still need to improve my coding skills.
And I need some training.

Finally, I would not start writing my patch if your one needed
only small updates. But my investigation pushed me very
different way from your proposal. It looked ugly to push
all coding to your side.

Best Regards,
Petr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2017-01-13  2:52                                                         ` Sergey Senozhatsky
  (?)
@ 2017-01-13 11:14                                                           ` Petr Mladek
  -1 siblings, 0 replies; 96+ messages in thread
From: Petr Mladek @ 2017-01-13 11:14 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Sergey Senozhatsky, Tetsuo Handa, mhocko, linux-mm,
	Greg Kroah-Hartman, Jiri Slaby, linux-fbdev, linux-kernel

On Fri 2017-01-13 11:52:55, Sergey Senozhatsky wrote:
> On (01/12/17 14:10), Petr Mladek wrote:
> [..]
> > >  /**
> > >   * console_lock - lock the console system for exclusive use.
> > >   *
> > > @@ -2316,7 +2321,7 @@ EXPORT_SYMBOL(console_unlock);
> > >   */
> > >  void __sched console_conditional_schedule(void)
> > >  {
> > > -	if (console_may_schedule)
> > > +	if (get_console_may_schedule())
> > 
> > Note that console_may_schedule should be zero when
> > the console drivers are called. See the following lines in
> > console_unlock():
> > 
> > 	/*
> > 	 * Console drivers are called under logbuf_lock, so
> > 	 * @console_may_schedule should be cleared before; however, we may
> > 	 * end up dumping a lot of lines, for example, if called from
> > 	 * console registration path, and should invoke cond_resched()
> > 	 * between lines if allowable.  Not doing so can cause a very long
> > 	 * scheduling stall on a slow console leading to RCU stall and
> > 	 * softlockup warnings which exacerbate the issue with more
> > 	 * messages practically incapacitating the system.
> > 	 */
> > 	do_cond_resched = console_may_schedule;
> > 	console_may_schedule = 0;
> 
> 
> 
> console drivers are never-ever-ever getting called under logbuf lock.
> never. with disabled local IRQs - yes. under logbuf lock - no. that
> would soft lockup systems in really bad ways, otherwise.

Sure. It is just a misleading comment that someone wrote. I have
already fixed this in my patch.


> the reason why we set console_may_schedule to zero in
> console_unlock() is.... VT. and lf() function in particular.
> 
> commit 78944e549d36673eb6265a2411574e79c28e23dc
> Author: Antonino A. Daplas XXXX
> Date:   Sat Aug 5 12:14:16 2006 -0700
> 
>     [PATCH] vt: printk: Fix framebuffer console triggering might_sleep assertion
>     
>     Reported by: Dave Jones
>     
>     Whilst printk'ing to both console and serial console, I got this...
>     (2.6.18rc1)
>     
>     BUG: sleeping function called from invalid context at kernel/sched.c:4438
>     in_atomic():0, irqs_disabled():1

This is basically the same problem that Testuo has. This commit added
the line

	console_may_schedule = 0;

Tetsuo found that we did not clear it when going back
via the "again:" goto target.


> and we really don't want to cond_resched() when we are in panic.
> that's why console_flush_on_panic() sets it to zero explicitly.

This actually works even with the bug. console_flush_on_panic()
is called with interrupts disabled in panic(). Therefore
console_trylock would disable cond_resched.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
@ 2017-01-13 11:14                                                           ` Petr Mladek
  0 siblings, 0 replies; 96+ messages in thread
From: Petr Mladek @ 2017-01-13 11:14 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Sergey Senozhatsky, Tetsuo Handa, mhocko, linux-mm,
	Greg Kroah-Hartman, Jiri Slaby, linux-fbdev, linux-kernel

On Fri 2017-01-13 11:52:55, Sergey Senozhatsky wrote:
> On (01/12/17 14:10), Petr Mladek wrote:
> [..]
> > >  /**
> > >   * console_lock - lock the console system for exclusive use.
> > >   *
> > > @@ -2316,7 +2321,7 @@ EXPORT_SYMBOL(console_unlock);
> > >   */
> > >  void __sched console_conditional_schedule(void)
> > >  {
> > > -	if (console_may_schedule)
> > > +	if (get_console_may_schedule())
> > 
> > Note that console_may_schedule should be zero when
> > the console drivers are called. See the following lines in
> > console_unlock():
> > 
> > 	/*
> > 	 * Console drivers are called under logbuf_lock, so
> > 	 * @console_may_schedule should be cleared before; however, we may
> > 	 * end up dumping a lot of lines, for example, if called from
> > 	 * console registration path, and should invoke cond_resched()
> > 	 * between lines if allowable.  Not doing so can cause a very long
> > 	 * scheduling stall on a slow console leading to RCU stall and
> > 	 * softlockup warnings which exacerbate the issue with more
> > 	 * messages practically incapacitating the system.
> > 	 */
> > 	do_cond_resched = console_may_schedule;
> > 	console_may_schedule = 0;
> 
> 
> 
> console drivers are never-ever-ever getting called under logbuf lock.
> never. with disabled local IRQs - yes. under logbuf lock - no. that
> would soft lockup systems in really bad ways, otherwise.

Sure. It is just a misleading comment that someone wrote. I have
already fixed this in my patch.


> the reason why we set console_may_schedule to zero in
> console_unlock() is.... VT. and lf() function in particular.
> 
> commit 78944e549d36673eb6265a2411574e79c28e23dc
> Author: Antonino A. Daplas XXXX
> Date:   Sat Aug 5 12:14:16 2006 -0700
> 
>     [PATCH] vt: printk: Fix framebuffer console triggering might_sleep assertion
>     
>     Reported by: Dave Jones
>     
>     Whilst printk'ing to both console and serial console, I got this...
>     (2.6.18rc1)
>     
>     BUG: sleeping function called from invalid context at kernel/sched.c:4438
>     in_atomic():0, irqs_disabled():1

This is basically the same problem that Testuo has. This commit added
the line

	console_may_schedule = 0;

Tetsuo found that we did not clear it when going back
via the "again:" goto target.


> and we really don't want to cond_resched() when we are in panic.
> that's why console_flush_on_panic() sets it to zero explicitly.

This actually works even with the bug. console_flush_on_panic()
is called with interrupts disabled in panic(). Therefore
console_trylock would disable cond_resched.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
@ 2017-01-13 11:14                                                           ` Petr Mladek
  0 siblings, 0 replies; 96+ messages in thread
From: Petr Mladek @ 2017-01-13 11:14 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Sergey Senozhatsky, Tetsuo Handa, mhocko, linux-mm,
	Greg Kroah-Hartman, Jiri Slaby, linux-fbdev, linux-kernel

On Fri 2017-01-13 11:52:55, Sergey Senozhatsky wrote:
> On (01/12/17 14:10), Petr Mladek wrote:
> [..]
> > >  /**
> > >   * console_lock - lock the console system for exclusive use.
> > >   *
> > > @@ -2316,7 +2321,7 @@ EXPORT_SYMBOL(console_unlock);
> > >   */
> > >  void __sched console_conditional_schedule(void)
> > >  {
> > > -	if (console_may_schedule)
> > > +	if (get_console_may_schedule())
> > 
> > Note that console_may_schedule should be zero when
> > the console drivers are called. See the following lines in
> > console_unlock():
> > 
> > 	/*
> > 	 * Console drivers are called under logbuf_lock, so
> > 	 * @console_may_schedule should be cleared before; however, we may
> > 	 * end up dumping a lot of lines, for example, if called from
> > 	 * console registration path, and should invoke cond_resched()
> > 	 * between lines if allowable.  Not doing so can cause a very long
> > 	 * scheduling stall on a slow console leading to RCU stall and
> > 	 * softlockup warnings which exacerbate the issue with more
> > 	 * messages practically incapacitating the system.
> > 	 */
> > 	do_cond_resched = console_may_schedule;
> > 	console_may_schedule = 0;
> 
> 
> 
> console drivers are never-ever-ever getting called under logbuf lock.
> never. with disabled local IRQs - yes. under logbuf lock - no. that
> would soft lockup systems in really bad ways, otherwise.

Sure. It is just a misleading comment that someone wrote. I have
already fixed this in my patch.


> the reason why we set console_may_schedule to zero in
> console_unlock() is.... VT. and lf() function in particular.
> 
> commit 78944e549d36673eb6265a2411574e79c28e23dc
> Author: Antonino A. Daplas XXXX
> Date:   Sat Aug 5 12:14:16 2006 -0700
> 
>     [PATCH] vt: printk: Fix framebuffer console triggering might_sleep assertion
>     
>     Reported by: Dave Jones
>     
>     Whilst printk'ing to both console and serial console, I got this...
>     (2.6.18rc1)
>     
>     BUG: sleeping function called from invalid context at kernel/sched.c:4438
>     in_atomic():0, irqs_disabled():1

This is basically the same problem that Testuo has. This commit added
the line

	console_may_schedule = 0;

Tetsuo found that we did not clear it when going back
via the "again:" goto target.


> and we really don't want to cond_resched() when we are in panic.
> that's why console_flush_on_panic() sets it to zero explicitly.

This actually works even with the bug. console_flush_on_panic()
is called with interrupts disabled in panic(). Therefore
console_trylock would disable cond_resched.

Best Regards,
Petr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2017-01-13  3:53                                                           ` Sergey Senozhatsky
  (?)
@ 2017-01-13 11:15                                                             ` Petr Mladek
  -1 siblings, 0 replies; 96+ messages in thread
From: Petr Mladek @ 2017-01-13 11:15 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Sergey Senozhatsky, Tetsuo Handa, mhocko, linux-mm,
	Greg Kroah-Hartman, Jiri Slaby, linux-fbdev, linux-kernel

On Fri 2017-01-13 12:53:07, Sergey Senozhatsky wrote:
> On (01/13/17 11:52), Sergey Senozhatsky wrote:
> [..]
> > and we really don't want to cond_resched() when we are in panic.
> > that's why console_flush_on_panic() sets it to zero explicitly.
> > 
> > console_trylock() checks oops_in_progress, so re-taking the semaphore
> > when we are in
> > 
> > 	panic()
> > 	 console_flush_on_panic()
> >           console_unlock()
> >            console_trylock()
> > 
> > should be OK. as well as doing get_console_conditional_schedule() somewhere
> > in console driver code.
> 
> d'oh... no, this is false. console_flush_on_panic() is called after we
> bust_spinlocks(0), BUT with local IRQs disabled. so console_trylock()
> would still set console_may_schedule to 0.

Ah, you found it yourself.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
@ 2017-01-13 11:15                                                             ` Petr Mladek
  0 siblings, 0 replies; 96+ messages in thread
From: Petr Mladek @ 2017-01-13 11:15 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Sergey Senozhatsky, Tetsuo Handa, mhocko, linux-mm,
	Greg Kroah-Hartman, Jiri Slaby, linux-fbdev, linux-kernel

On Fri 2017-01-13 12:53:07, Sergey Senozhatsky wrote:
> On (01/13/17 11:52), Sergey Senozhatsky wrote:
> [..]
> > and we really don't want to cond_resched() when we are in panic.
> > that's why console_flush_on_panic() sets it to zero explicitly.
> > 
> > console_trylock() checks oops_in_progress, so re-taking the semaphore
> > when we are in
> > 
> > 	panic()
> > 	 console_flush_on_panic()
> >           console_unlock()
> >            console_trylock()
> > 
> > should be OK. as well as doing get_console_conditional_schedule() somewhere
> > in console driver code.
> 
> d'oh... no, this is false. console_flush_on_panic() is called after we
> bust_spinlocks(0), BUT with local IRQs disabled. so console_trylock()
> would still set console_may_schedule to 0.

Ah, you found it yourself.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
@ 2017-01-13 11:15                                                             ` Petr Mladek
  0 siblings, 0 replies; 96+ messages in thread
From: Petr Mladek @ 2017-01-13 11:15 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Sergey Senozhatsky, Tetsuo Handa, mhocko, linux-mm,
	Greg Kroah-Hartman, Jiri Slaby, linux-fbdev, linux-kernel

On Fri 2017-01-13 12:53:07, Sergey Senozhatsky wrote:
> On (01/13/17 11:52), Sergey Senozhatsky wrote:
> [..]
> > and we really don't want to cond_resched() when we are in panic.
> > that's why console_flush_on_panic() sets it to zero explicitly.
> > 
> > console_trylock() checks oops_in_progress, so re-taking the semaphore
> > when we are in
> > 
> > 	panic()
> > 	 console_flush_on_panic()
> >           console_unlock()
> >            console_trylock()
> > 
> > should be OK. as well as doing get_console_conditional_schedule() somewhere
> > in console driver code.
> 
> d'oh... no, this is false. console_flush_on_panic() is called after we
> bust_spinlocks(0), BUT with local IRQs disabled. so console_trylock()
> would still set console_may_schedule to 0.

Ah, you found it yourself.

Best Regards,
Petr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2017-01-13 11:03                                                           ` Petr Mladek
  (?)
@ 2017-01-13 11:50                                                             ` Sergey Senozhatsky
  -1 siblings, 0 replies; 96+ messages in thread
From: Sergey Senozhatsky @ 2017-01-13 11:50 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, Tetsuo Handa, mhocko,
	linux-mm, Greg Kroah-Hartman, Jiri Slaby, linux-fbdev,
	linux-kernel

On (01/13/17 12:03), Petr Mladek wrote:
[..]
> > why can't we?
> 
> Because it would newer call cond_resched() in non-preemptive kernel
> with CONFIG_PREEMPT_COUNT disabled. IMHO, we want to call it,
> for example, when we scroll the entire screen from tty_operations.
> 
> Or do I miss anything?

so... basically. it has never called cond_resched() there. right?
why is this suddenly a problem now?

	-ss

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
@ 2017-01-13 11:50                                                             ` Sergey Senozhatsky
  0 siblings, 0 replies; 96+ messages in thread
From: Sergey Senozhatsky @ 2017-01-13 11:50 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, Tetsuo Handa, mhocko,
	linux-mm, Greg Kroah-Hartman, Jiri Slaby, linux-fbdev,
	linux-kernel

On (01/13/17 12:03), Petr Mladek wrote:
[..]
> > why can't we?
> 
> Because it would newer call cond_resched() in non-preemptive kernel
> with CONFIG_PREEMPT_COUNT disabled. IMHO, we want to call it,
> for example, when we scroll the entire screen from tty_operations.
> 
> Or do I miss anything?

so... basically. it has never called cond_resched() there. right?
why is this suddenly a problem now?

	-ss

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
@ 2017-01-13 11:50                                                             ` Sergey Senozhatsky
  0 siblings, 0 replies; 96+ messages in thread
From: Sergey Senozhatsky @ 2017-01-13 11:50 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, Tetsuo Handa, mhocko,
	linux-mm, Greg Kroah-Hartman, Jiri Slaby, linux-fbdev,
	linux-kernel

On (01/13/17 12:03), Petr Mladek wrote:
[..]
> > why can't we?
> 
> Because it would newer call cond_resched() in non-preemptive kernel
> with CONFIG_PREEMPT_COUNT disabled. IMHO, we want to call it,
> for example, when we scroll the entire screen from tty_operations.
> 
> Or do I miss anything?

so... basically. it has never called cond_resched() there. right?
why is this suddenly a problem now?

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2017-01-13 11:50                                                             ` Sergey Senozhatsky
  (?)
@ 2017-01-13 12:15                                                               ` Petr Mladek
  -1 siblings, 0 replies; 96+ messages in thread
From: Petr Mladek @ 2017-01-13 12:15 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Sergey Senozhatsky, Tetsuo Handa, mhocko, linux-mm,
	Greg Kroah-Hartman, Jiri Slaby, linux-fbdev, linux-kernel

On Fri 2017-01-13 20:50:24, Sergey Senozhatsky wrote:
> On (01/13/17 12:03), Petr Mladek wrote:
> [..]
> > > why can't we?
> > 
> > Because it would newer call cond_resched() in non-preemptive kernel
> > with CONFIG_PREEMPT_COUNT disabled. IMHO, we want to call it,
> > for example, when we scroll the entire screen from tty_operations.
> > 
> > Or do I miss anything?
> 
> so... basically. it has never called cond_resched() there. right?
> why is this suddenly a problem now?

But it called cond_resched() when the very same code was called
from tty operations under console_lock() that forced
console_may_schedule = 1;

It will never call cond_resched() from the tty operations
when CONFIG_PREEMPT_COUNT is disabled and we try to detect
the preemption automatically.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
@ 2017-01-13 12:15                                                               ` Petr Mladek
  0 siblings, 0 replies; 96+ messages in thread
From: Petr Mladek @ 2017-01-13 12:15 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Sergey Senozhatsky, Tetsuo Handa, mhocko, linux-mm,
	Greg Kroah-Hartman, Jiri Slaby, linux-fbdev, linux-kernel

On Fri 2017-01-13 20:50:24, Sergey Senozhatsky wrote:
> On (01/13/17 12:03), Petr Mladek wrote:
> [..]
> > > why can't we?
> > 
> > Because it would newer call cond_resched() in non-preemptive kernel
> > with CONFIG_PREEMPT_COUNT disabled. IMHO, we want to call it,
> > for example, when we scroll the entire screen from tty_operations.
> > 
> > Or do I miss anything?
> 
> so... basically. it has never called cond_resched() there. right?
> why is this suddenly a problem now?

But it called cond_resched() when the very same code was called
from tty operations under console_lock() that forced
console_may_schedule = 1;

It will never call cond_resched() from the tty operations
when CONFIG_PREEMPT_COUNT is disabled and we try to detect
the preemption automatically.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
@ 2017-01-13 12:15                                                               ` Petr Mladek
  0 siblings, 0 replies; 96+ messages in thread
From: Petr Mladek @ 2017-01-13 12:15 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Sergey Senozhatsky, Tetsuo Handa, mhocko, linux-mm,
	Greg Kroah-Hartman, Jiri Slaby, linux-fbdev, linux-kernel

On Fri 2017-01-13 20:50:24, Sergey Senozhatsky wrote:
> On (01/13/17 12:03), Petr Mladek wrote:
> [..]
> > > why can't we?
> > 
> > Because it would newer call cond_resched() in non-preemptive kernel
> > with CONFIG_PREEMPT_COUNT disabled. IMHO, we want to call it,
> > for example, when we scroll the entire screen from tty_operations.
> > 
> > Or do I miss anything?
> 
> so... basically. it has never called cond_resched() there. right?
> why is this suddenly a problem now?

But it called cond_resched() when the very same code was called
from tty operations under console_lock() that forced
console_may_schedule = 1;

It will never call cond_resched() from the tty operations
when CONFIG_PREEMPT_COUNT is disabled and we try to detect
the preemption automatically.

Best Regards,
Petr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
  2016-12-26 11:41                                                   ` Sergey Senozhatsky
@ 2017-01-13 14:03                                                     ` Petr Mladek
  0 siblings, 0 replies; 96+ messages in thread
From: Petr Mladek @ 2017-01-13 14:03 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Tetsuo Handa, mhocko, linux-mm

On Mon 2016-12-26 20:41:06, Sergey Senozhatsky wrote:
> On (12/26/16 19:54), Tetsuo Handa wrote:
> > I tried these 9 patches. Generally OK.
> > 
> > Although there is still "schedule_timeout_killable() lockup with oom_lock held"
> > problem, async-printk patches help avoiding "printk() lockup with oom_lock held"
> > problem. Thank you.
> > 
> > Three comments from me.
> > 
> > (1) Messages from e.g. SysRq-b is not waited for sent to consoles.
> >     "SysRq : Resetting" line is needed as a note that I gave up waiting.
> > 
> > (2) Messages from e.g. SysRq-t should be sent to consoles synchronously?
> >     "echo t > /proc/sysrq-trigger" case can use asynchronous printing.
> >     But since ALT-SysRq-T sequence from keyboard may be used when scheduler
> >     is not responding, it might be better to use synchronous printing.
> >     (Or define a magic key sequence to toggle synchronous/asynchronous?)
> 
> it's really hard to tell if the message comes from sysrq or from
> somewhere else.

Yes, but we have the oposite problem now. We usually do not see any
sysrq message on the console with async printk.

> the current approach -- switch to *always* sync printk
> once we see the first LOGLEVEL_EMERG message. so you can add
> printk(LOGLEVEL_EMERG "sysrq-t\n"); for example, and printk will
> switch to sync mode. sync mode, is might be a bit dangerous though,
> since we printk from IRQ.

Sysrq forces all messages to the console by manipulating the
console_loglevel by purpose, see:

void __handle_sysrq(int key, bool check_mask)
{
	struct sysrq_key_op *op_p;
	int orig_log_level;
	int i;

	rcu_sysrq_start();
	rcu_read_lock();
	/*
	 * Raise the apparent loglevel to maximum so that the sysrq header
	 * is shown to provide the user with positive feedback.  We do not
	 * simply emit this at KERN_EMERG as that would change message
	 * routing in the consumers of /proc/kmsg.
	 */
	orig_log_level = console_loglevel;
	console_loglevel = CONSOLE_LOGLEVEL_DEFAULT;
	pr_info("SysRq : ");

Where the loglevel forcing seems to be already in the initial commit
to git.

The comment explaining why KERN_EMERG is not a good idea was added
by the commit fb144adc517d9ebe8fd ("sysrq: add commentary on why we
use the console loglevel over using KERN_EMERG").

Also it seems that all messages are flushed with disabled interrupts
by purpose. See the commit message for that rcu calls in the commit
722773afd83209d4088d ("sysrq,rcu: suppress RCU stall warnings while
sysrq runs").


Therefore, it would make sense to switch to the synchronous
mode in this section.

The question is if we want to come back to the asynchronous mode
when sysrq is finished. It is not easy to do it race-less. A solution
would be to force synchronous mode via the printk_context per-CPU
variable, similar way like we force printk_safe mode.

Alternatively we could try to flush console before resetting back
the console_loglevel:

	if (console_trylock())
		console_unlock();
	console_loglevel = orig_log_level;


Of course, the best solution would be to store the desired console
level with the message into logbuf. But this is not easy because
we would break ABI for external tools, like crashdump, crash, ...

Best Regards,
Petr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

end of thread, other threads:[~2017-01-13 14:03 UTC | newest]

Thread overview: 96+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-06 10:33 [PATCH] mm/page_alloc: Wait for oom_lock before retrying Tetsuo Handa
2016-12-07  8:15 ` Michal Hocko
2016-12-07 15:29   ` Tetsuo Handa
2016-12-08  8:20     ` Vlastimil Babka
2016-12-08 11:00       ` Tetsuo Handa
2016-12-08 13:32         ` Michal Hocko
2016-12-08 16:18         ` Sergey Senozhatsky
2016-12-08 13:27     ` Michal Hocko
2016-12-09 14:23       ` Tetsuo Handa
2016-12-09 14:46         ` Michal Hocko
2016-12-10 11:24           ` Tetsuo Handa
2016-12-12  9:07             ` Michal Hocko
2016-12-12 11:49               ` Petr Mladek
2016-12-12 13:00                 ` Michal Hocko
2016-12-12 14:05                   ` Tetsuo Handa
2016-12-13  1:06                 ` Sergey Senozhatsky
2016-12-12 12:12               ` Tetsuo Handa
2016-12-12 12:55                 ` Michal Hocko
2016-12-12 13:19                   ` Michal Hocko
2016-12-13 12:06                     ` Tetsuo Handa
2016-12-13 17:06                       ` Michal Hocko
2016-12-14 11:37                         ` Tetsuo Handa
2016-12-14 12:42                           ` Michal Hocko
2016-12-14 16:36                             ` Tetsuo Handa
2016-12-14 18:18                               ` Michal Hocko
2016-12-15 10:21                                 ` Tetsuo Handa
2016-12-19 11:25                                   ` Tetsuo Handa
2016-12-19 12:27                                     ` Sergey Senozhatsky
2016-12-20 15:39                                       ` Sergey Senozhatsky
2016-12-22 10:27                                         ` Tetsuo Handa
2016-12-22 10:53                                           ` Petr Mladek
2016-12-22 13:40                                             ` Sergey Senozhatsky
2016-12-22 13:33                                           ` Tetsuo Handa
2016-12-22 19:24                                             ` Michal Hocko
2016-12-24  6:25                                               ` Tetsuo Handa
2016-12-26 11:49                                                 ` Michal Hocko
2016-12-27 10:39                                                   ` Tetsuo Handa
2016-12-27 10:57                                                     ` Michal Hocko
2016-12-22 13:42                                           ` Sergey Senozhatsky
2016-12-22 14:01                                             ` Tetsuo Handa
2016-12-22 14:09                                               ` Sergey Senozhatsky
2016-12-22 14:30                                                 ` Sergey Senozhatsky
2016-12-26 10:54                                                 ` Tetsuo Handa
2016-12-26 11:34                                                   ` Sergey Senozhatsky
2016-12-26 11:34                                                     ` Sergey Senozhatsky
2016-12-26 11:34                                                     ` Sergey Senozhatsky
2017-01-12 13:10                                                     ` Petr Mladek
2017-01-12 13:10                                                       ` Petr Mladek
2017-01-12 13:10                                                       ` Petr Mladek
2017-01-13  2:52                                                       ` Sergey Senozhatsky
2017-01-13  2:52                                                         ` Sergey Senozhatsky
2017-01-13  2:52                                                         ` Sergey Senozhatsky
2017-01-13  3:53                                                         ` Sergey Senozhatsky
2017-01-13  3:53                                                           ` Sergey Senozhatsky
2017-01-13  3:53                                                           ` Sergey Senozhatsky
2017-01-13 11:15                                                           ` Petr Mladek
2017-01-13 11:15                                                             ` Petr Mladek
2017-01-13 11:15                                                             ` Petr Mladek
2017-01-13 11:14                                                         ` Petr Mladek
2017-01-13 11:14                                                           ` Petr Mladek
2017-01-13 11:14                                                           ` Petr Mladek
2017-01-12 14:18                                                     ` Petr Mladek
2017-01-12 14:18                                                       ` Petr Mladek
2017-01-12 14:18                                                       ` Petr Mladek
2017-01-13  2:28                                                       ` Sergey Senozhatsky
2017-01-13  2:28                                                         ` Sergey Senozhatsky
2017-01-13  2:28                                                         ` Sergey Senozhatsky
2017-01-13 11:03                                                         ` Petr Mladek
2017-01-13 11:03                                                           ` Petr Mladek
2017-01-13 11:03                                                           ` Petr Mladek
2017-01-13 11:50                                                           ` Sergey Senozhatsky
2017-01-13 11:50                                                             ` Sergey Senozhatsky
2017-01-13 11:50                                                             ` Sergey Senozhatsky
2017-01-13 12:15                                                             ` Petr Mladek
2017-01-13 12:15                                                               ` Petr Mladek
2017-01-13 12:15                                                               ` Petr Mladek
2016-12-26 11:41                                                   ` Sergey Senozhatsky
2017-01-13 14:03                                                     ` Petr Mladek
2016-12-15  1:11                         ` Sergey Senozhatsky
2016-12-15  6:35                           ` Michal Hocko
2016-12-15 10:16                             ` Petr Mladek
2016-12-14  9:37                       ` Petr Mladek
2016-12-14 10:20                         ` Sergey Senozhatsky
2016-12-14 11:01                           ` Petr Mladek
2016-12-14 12:23                             ` Sergey Senozhatsky
2016-12-14 12:47                               ` Petr Mladek
2016-12-14 10:26                         ` Michal Hocko
2016-12-15  7:34                           ` Sergey Senozhatsky
2016-12-14 11:37                         ` Tetsuo Handa
2016-12-14 12:36                           ` Petr Mladek
2016-12-14 12:44                             ` Michal Hocko
2016-12-14 13:36                               ` Tetsuo Handa
2016-12-14 13:52                                 ` Michal Hocko
2016-12-14 12:50                             ` Sergey Senozhatsky
2016-12-12 14:59                   ` Tetsuo Handa
2016-12-12 15:55                     ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.