linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
@ 2017-06-01 11:43 Tetsuo Handa
  2017-06-01 11:59 ` Michal Hocko
  0 siblings, 1 reply; 43+ messages in thread
From: Tetsuo Handa @ 2017-06-01 11:43 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: Tetsuo Handa, Cong Wang, Dave Hansen, Johannes Weiner,
	Mel Gorman, Michal Hocko, Vlastimil Babka

Cong Wang has reported a lockup when running LTP memcg_stress test [1].
Although the cause of allocation stall is unknown (no OOM killer messages
found within partially written output), that test put much stress with
uncontrolled concurrent warn_alloc() calls enough to trigger NMI watchdog
on soft lockup at cpu_relax() loop in dump_stack().

(Immaterial lines removed.)
----------
[16213.477525] Call Trace:
[16213.487314]  [<ffffffff86395ab7>] dump_stack+0x4d/0x66
[16213.497723]  [<ffffffff8619a6c6>] warn_alloc+0x116/0x130
[16213.505627] NMI watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [cleanup:7598]
[16213.505713] CPU: 5 PID: 7598 Comm: cleanup Not tainted 4.9.23.el7.twitter.x86_64 #1
[16213.505714] Hardware name: Dell Inc. PowerEdge C6220/03C9JJ, BIOS 2.2.3 11/07/2013
[16213.505722] RIP: 0010:[<ffffffff86395a93>]  [<ffffffff86395a93>] dump_stack+0x29/0x66
[16213.505795] Call Trace:
[16213.505799]  [<ffffffff8619a6c6>] warn_alloc+0x116/0x130
[16213.505804]  [<ffffffff8619b0bf>] __alloc_pages_slowpath+0x96f/0xbd0
[16213.505807]  [<ffffffff8619b4f6>] __alloc_pages_nodemask+0x1d6/0x230
[16213.505810]  [<ffffffff861e61d5>] alloc_pages_current+0x95/0x140
[16213.505814]  [<ffffffff8619114a>] __page_cache_alloc+0xca/0xe0
[16213.505822]  [<ffffffff86194312>] filemap_fault+0x312/0x4d0
[16213.505826]  [<ffffffff862a8196>] ext4_filemap_fault+0x36/0x50
[16213.505828]  [<ffffffff861c2c71>] __do_fault+0x71/0x130
[16213.505830]  [<ffffffff861c6cec>] handle_mm_fault+0xebc/0x13a0
[16213.505832]  [<ffffffff863b434d>] ? list_del+0xd/0x30
[16213.505833]  [<ffffffff86257e58>] ? ep_poll+0x308/0x320
[16213.505835]  [<ffffffff860509e4>] __do_page_fault+0x254/0x4a0
[16213.505837]  [<ffffffff86050c50>] do_page_fault+0x20/0x70
[16213.505839]  [<ffffffff86700aa2>] page_fault+0x22/0x30
[16213.505877] Code: 5d c3 55 83 c9 ff 48 89 e5 41 54 53 9c 5b fa 65 8b 15 4a 47 c7 79 89 c8 f0 0f b1 15 48 a3 92 00 83 f8 ff 74 0a 39 c2 74 0b 53 9d <f3> 90 eb dd 45 31 e4 eb 06 41 bc 01 00 00 00 48 c7 c7 41 1a a2
[16214.250659] NMI watchdog: BUG: soft lockup - CPU#17 stuck for 22s! [scribed:3905]
[16214.250765] CPU: 17 PID: 3905 Comm: scribed Tainted: G             L 4.9.23.el7.twitter.x86_64 #1
[16214.250767] Hardware name: Dell Inc. PowerEdge C6220/03C9JJ, BIOS 2.2.3 11/07/2013
[16214.250776] RIP: 0010:[<ffffffff86395a93>]  [<ffffffff86395a93>] dump_stack+0x29/0x66
[16214.250840] Call Trace:
[16214.250843]  [<ffffffff8619a6c6>] warn_alloc+0x116/0x130
[16214.250846]  [<ffffffff8619b0bf>] __alloc_pages_slowpath+0x96f/0xbd0
[16214.250849]  [<ffffffff8619b4f6>] __alloc_pages_nodemask+0x1d6/0x230
[16214.250853]  [<ffffffff861e61d5>] alloc_pages_current+0x95/0x140
[16214.250855]  [<ffffffff8619114a>] __page_cache_alloc+0xca/0xe0
[16214.250857]  [<ffffffff86194312>] filemap_fault+0x312/0x4d0
[16214.250859]  [<ffffffff862a8196>] ext4_filemap_fault+0x36/0x50
[16214.250860]  [<ffffffff861c2c71>] __do_fault+0x71/0x130
[16214.250863]  [<ffffffff861c6cec>] handle_mm_fault+0xebc/0x13a0
[16214.250865]  [<ffffffff860c25c1>] ? pick_next_task_fair+0x471/0x4a0
[16214.250869]  [<ffffffff860509e4>] __do_page_fault+0x254/0x4a0
[16214.250871]  [<ffffffff86050c50>] do_page_fault+0x20/0x70
[16214.250873]  [<ffffffff86700aa2>] page_fault+0x22/0x30
[16214.250918] Code: 5d c3 55 83 c9 ff 48 89 e5 41 54 53 9c 5b fa 65 8b 15 4a 47 c7 79 89 c8 f0 0f b1 15 48 a3 92 00 83 f8 ff 74 0a 39 c2 74 0b 53 9d <f3> 90 eb dd 45 31 e4 eb 06 41 bc 01 00 00 00 48 c7 c7 41 1a a2
[16215.157526]  [<ffffffff8619b0bf>] __alloc_pages_slowpath+0x96f/0xbd0
[16215.177523]  [<ffffffff8619b4f6>] __alloc_pages_nodemask+0x1d6/0x230
[16215.197540]  [<ffffffff861e61d5>] alloc_pages_current+0x95/0x140
[16215.217331]  [<ffffffff8619114a>] __page_cache_alloc+0xca/0xe0
[16215.237374]  [<ffffffff86194312>] filemap_fault+0x312/0x4d0
[16215.257136]  [<ffffffff862a8196>] ext4_filemap_fault+0x36/0x50
[16215.276950]  [<ffffffff861c2c71>] __do_fault+0x71/0x130
[16215.287555]  [<ffffffff861c6cec>] handle_mm_fault+0xebc/0x13a0
[16215.307538]  [<ffffffff860509e4>] __do_page_fault+0x254/0x4a0
[16215.327165]  [<ffffffff86050c50>] do_page_fault+0x20/0x70
[16215.346964]  [<ffffffff86700aa2>] page_fault+0x22/0x30
----------

Both the printk() flooding problem caused by uncontrolled concurrent
warn_alloc() calls and the OOM killer being unable to send SIGKILL problem
caused by preemption were already pointed out by me [2], but these problems
are left unfixed because Michal does not like serialization from allocation
path because he is worrying about unexpected side effects [3].

But it's time to at least workaround this problem because the former
problem is now reported by other than me. We have a choice (spinlock or
mutex or [4]) here for serializing warn_alloc() messages.

If we can assume that that NMI context never calls __alloc_pages_nodemask()
without __GFP_NOWARN, we could use a spinlock so that all memory allocation
stall/failure messages do not get mixed. But since the lockup above was
triggered by busy waiting at dump_stack(), we might still hit lockups if
we do busy waiting at warn_alloc(). Thus, spinlock is not a safe choice.

If we can give up serializing !__GFP_DIRECT_RECLAIM memory allocation
failure messages, we could use a mutex. Although avoid mixing both memory
allocation stall/failure messages and the OOM killer messages would be
nice, oom_lock mutex should not be used for this purpose, for waiting for
oom_lock mutex at warn_alloc() can prevent anybody from calling
out_of_memory() from __alloc_pages_may_oom() (i.e. after all lockups).

Therefore, this patch uses a mutex dedicated for warn_alloc() like
suggested in [3]. I believe that the safest fix is [4] which offloads
reporting of allocation stalls to the khungtaskd kernel thread so that
we can eliminate a lot of noisy duplicates, avoid ratelimit filtering
which randomly drops important traces needed for debugging, and allow
triggering more useful actions.

[1] http://lkml.kernel.org/r/CAM_iQpWuPVGc2ky8M-9yukECtS+zKjiDasNymX7rMcBjBFyM_A@mail.gmail.com
[2] http://lkml.kernel.org/r/1481020439-5867-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
[3] http://lkml.kernel.org/r/20161212125535.GA3185@dhcp22.suse.cz
[4] http://lkml.kernel.org/r/1495331504-12480-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp

Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Suggested-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Dave Hansen <dave.hansen@intel.com>
---
 mm/page_alloc.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 682ecac..4f98ff6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3185,10 +3185,13 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 	va_list args;
 	static DEFINE_RATELIMIT_STATE(nopage_rs, DEFAULT_RATELIMIT_INTERVAL,
 				      DEFAULT_RATELIMIT_BURST);
+	static DEFINE_MUTEX(warn_alloc_lock);
 
 	if ((gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs))
 		return;
 
+	if (gfp_mask & __GFP_DIRECT_RECLAIM)
+		mutex_lock(&warn_alloc_lock);
 	pr_warn("%s: ", current->comm);
 
 	va_start(args, fmt);
@@ -3207,6 +3210,8 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 
 	dump_stack();
 	warn_alloc_show_mem(gfp_mask, nodemask);
+	if (gfp_mask & __GFP_DIRECT_RECLAIM)
+		mutex_unlock(&warn_alloc_lock);
 }
 
 static inline struct page *
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-06-01 11:43 [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable Tetsuo Handa
@ 2017-06-01 11:59 ` Michal Hocko
  2017-06-01 13:11   ` Tetsuo Handa
  0 siblings, 1 reply; 43+ messages in thread
From: Michal Hocko @ 2017-06-01 11:59 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, linux-mm, Cong Wang, Dave Hansen, Johannes Weiner,
	Mel Gorman, Vlastimil Babka

On Thu 01-06-17 20:43:47, Tetsuo Handa wrote:
> Cong Wang has reported a lockup when running LTP memcg_stress test [1].

This seems to be on an old and not pristine kernel. Does it happen also
on the vanilla up-to-date kernel?

[...]
> Therefore, this patch uses a mutex dedicated for warn_alloc() like
> suggested in [3].

As I've said previously. We have rate limiting and if that doesn't work
out well, let's tune it. The lock should be the last resort to go with.
We already throttle show_mem, maybe we can throttle dump_stack as well,
although it sounds a bit strange that this adds so much to the picture.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-06-01 11:59 ` Michal Hocko
@ 2017-06-01 13:11   ` Tetsuo Handa
  2017-06-01 13:28     ` Michal Hocko
  0 siblings, 1 reply; 43+ messages in thread
From: Tetsuo Handa @ 2017-06-01 13:11 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka

Michal Hocko wrote:
> On Thu 01-06-17 20:43:47, Tetsuo Handa wrote:
> > Cong Wang has reported a lockup when running LTP memcg_stress test [1].
>
> This seems to be on an old and not pristine kernel. Does it happen also
> on the vanilla up-to-date kernel?

4.9 is not an old kernel! It might be close to the kernel version which
enterprise distributions would choose for their next long term supported
version.

And please stop saying "can you reproduce your problem with latest
linux-next (or at least latest linux)?" Not everybody can use the vanilla
up-to-date kernel!

What I'm pushing via kmallocwd patch is to prepare for overlooked problems
so that enterprise distributors can collect information and identify what
changes are needed to be backported.

As long as you ignore problems not happened with latest linux-next (or
at least latest linux), enterprise distribution users can do nothing.

>
> [...]
> > Therefore, this patch uses a mutex dedicated for warn_alloc() like
> > suggested in [3].
>
> As I've said previously. We have rate limiting and if that doesn't work
> out well, let's tune it. The lock should be the last resort to go with.
> We already throttle show_mem, maybe we can throttle dump_stack as well,
> although it sounds a bit strange that this adds so much to the picture.

Ratelimiting never works well. It randomly drops information which is
useful for debugging. Uncontrolled concurrent dump_stack() causes lockups.
And restricting dump_stack() drops more information.

What we should do is to yield CPU time to operations which might do useful
things (let threads not doing memory allocation; e.g. let printk kernel
threads to flush pending buffer, let console drivers write the output to
consoles, let watchdog kernel threads report what is happening).

When memory allocation request is stalling, serialization via waiting
for a lock does help.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-06-01 13:11   ` Tetsuo Handa
@ 2017-06-01 13:28     ` Michal Hocko
  2017-06-01 22:10       ` Andrew Morton
  0 siblings, 1 reply; 43+ messages in thread
From: Michal Hocko @ 2017-06-01 13:28 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka

On Thu 01-06-17 22:11:13, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 01-06-17 20:43:47, Tetsuo Handa wrote:
> > > Cong Wang has reported a lockup when running LTP memcg_stress test [1].
> >
> > This seems to be on an old and not pristine kernel. Does it happen also
> > on the vanilla up-to-date kernel?
> 
> 4.9 is not an old kernel! It might be close to the kernel version which
> enterprise distributions would choose for their next long term supported
> version.
> 
> And please stop saying "can you reproduce your problem with latest
> linux-next (or at least latest linux)?" Not everybody can use the vanilla
> up-to-date kernel!

The changelog mentioned that the source of stalls is not clear so this
might be out-of-tree patches doing something wrong and dump_stack
showing up just because it is called often. This wouldn't be the first
time I have seen something like that. I am not really keen on adding
heavy lifting for something that is not clearly debugged and based on
hand waving and speculations.

> What I'm pushing via kmallocwd patch is to prepare for overlooked problems
> so that enterprise distributors can collect information and identify what
> changes are needed to be backported.
> 
> As long as you ignore problems not happened with latest linux-next (or
> at least latest linux), enterprise distribution users can do nothing.
> 
> >
> > [...]
> > > Therefore, this patch uses a mutex dedicated for warn_alloc() like
> > > suggested in [3].
> >
> > As I've said previously. We have rate limiting and if that doesn't work
> > out well, let's tune it. The lock should be the last resort to go with.
> > We already throttle show_mem, maybe we can throttle dump_stack as well,
> > although it sounds a bit strange that this adds so much to the picture.
> 
> Ratelimiting never works well. It randomly drops information which is
> useful for debugging. Uncontrolled concurrent dump_stack() causes lockups.
> And restricting dump_stack() drops more information.

As long as the dump_stack can be a source of the stalls, which I am not
so sure about, then we should rate limit it.

> What we should do is to yield CPU time to operations which might do useful
> things (let threads not doing memory allocation; e.g. let printk kernel
> threads to flush pending buffer, let console drivers write the output to
> consoles, let watchdog kernel threads report what is happening).

yes we call that preemptive kernel...

> When memory allocation request is stalling, serialization via waiting
> for a lock does help.

Which will mean that those unlucky ones which stall will stall even more
because they will wait on a lock with potentially many others. While
this certainly is a throttling mechanism it is also a big hammer.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-06-01 13:28     ` Michal Hocko
@ 2017-06-01 22:10       ` Andrew Morton
  2017-06-02  7:18         ` Michal Hocko
  0 siblings, 1 reply; 43+ messages in thread
From: Andrew Morton @ 2017-06-01 22:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, linux-mm, xiyou.wangcong, dave.hansen, hannes,
	mgorman, vbabka

On Thu, 1 Jun 2017 15:28:08 +0200 Michal Hocko <mhocko@suse.com> wrote:

> On Thu 01-06-17 22:11:13, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Thu 01-06-17 20:43:47, Tetsuo Handa wrote:
> > > > Cong Wang has reported a lockup when running LTP memcg_stress test [1].
> > >
> > > This seems to be on an old and not pristine kernel. Does it happen also
> > > on the vanilla up-to-date kernel?
> > 
> > 4.9 is not an old kernel! It might be close to the kernel version which
> > enterprise distributions would choose for their next long term supported
> > version.
> > 
> > And please stop saying "can you reproduce your problem with latest
> > linux-next (or at least latest linux)?" Not everybody can use the vanilla
> > up-to-date kernel!
> 
> The changelog mentioned that the source of stalls is not clear so this
> might be out-of-tree patches doing something wrong and dump_stack
> showing up just because it is called often. This wouldn't be the first
> time I have seen something like that. I am not really keen on adding
> heavy lifting for something that is not clearly debugged and based on
> hand waving and speculations.

I'm thinking we should serialize warn_alloc anyway, to prevent the
output from concurrent calls getting all jumbled together?

I'm not sure I buy the "this isn't a mainline kernel" thing. 
warn_alloc() obviously isn't very robust, but we'd prefer that it be
robust to peculiar situations, wild-n-wacky kernel patches, etc.  It's
a low-level thing and it should Just Work.

I do think ratelimiting will be OK - if the kernel is producing such a
vast stream of warn_alloc() output then nobody is going to be reading
it all anyway.  Probably just the first one is enough for operators to
understand what's going wrong.

So...  I think both.  ratelimit *and* serialize.  Perhaps a simple but
suitable way of doing that is simply to disallow concurrent warn_allocs:

	/* comment goes here */
	if (test_and_set_bit(0, &foo))
		return;
	...
	clear_bit(0, &foo);

or whatever?

(And if we do decide to go with "mm,page_alloc: Serialize warn_alloc()
if schedulable", please do add code comments explaining what's going on)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-06-01 22:10       ` Andrew Morton
@ 2017-06-02  7:18         ` Michal Hocko
  2017-06-02 11:13           ` Tetsuo Handa
                             ` (3 more replies)
  0 siblings, 4 replies; 43+ messages in thread
From: Michal Hocko @ 2017-06-02  7:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tetsuo Handa, linux-mm, xiyou.wangcong, dave.hansen, hannes,
	mgorman, vbabka

On Thu 01-06-17 15:10:22, Andrew Morton wrote:
> On Thu, 1 Jun 2017 15:28:08 +0200 Michal Hocko <mhocko@suse.com> wrote:
> 
> > On Thu 01-06-17 22:11:13, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Thu 01-06-17 20:43:47, Tetsuo Handa wrote:
> > > > > Cong Wang has reported a lockup when running LTP memcg_stress test [1].
> > > >
> > > > This seems to be on an old and not pristine kernel. Does it happen also
> > > > on the vanilla up-to-date kernel?
> > > 
> > > 4.9 is not an old kernel! It might be close to the kernel version which
> > > enterprise distributions would choose for their next long term supported
> > > version.
> > > 
> > > And please stop saying "can you reproduce your problem with latest
> > > linux-next (or at least latest linux)?" Not everybody can use the vanilla
> > > up-to-date kernel!
> > 
> > The changelog mentioned that the source of stalls is not clear so this
> > might be out-of-tree patches doing something wrong and dump_stack
> > showing up just because it is called often. This wouldn't be the first
> > time I have seen something like that. I am not really keen on adding
> > heavy lifting for something that is not clearly debugged and based on
> > hand waving and speculations.
> 
> I'm thinking we should serialize warn_alloc anyway, to prevent the
> output from concurrent calls getting all jumbled together?

dump_stack already serializes concurrent calls.

> I'm not sure I buy the "this isn't a mainline kernel" thing. 

The changelog doesn't really explain what is going on and only
speculates that the excessive warn_alloc is the cause. The kernel is 
4.9.23.el7.twitter.x86_64 which I suspect contains a lot of stuff on top
of 4.9. So I would really _like_ to see whether this is reproducible
with the upstream kernel. Especially when this is a LTP test.

> warn_alloc() obviously isn't very robust, but we'd prefer that it be
> robust to peculiar situations, wild-n-wacky kernel patches, etc.  It's
> a low-level thing and it should Just Work.

Yes I would agree and if we have an evidence that warn_alloc is really
the problem then I am all for fixing it. There is no such evidence yet.
Note that dump_stack serialization might be unfair because there is no
queuing. Is it possible that this is the problem? If yes we should
rather fix that because that is arguably even more low-level routine than
warn_alloc.

That being said. I strongly believe that this patch is not properly
justified, issue fully understood and as such a disagree with adding a
new lock on those grounds.

Until the above is resolved
Nacked-by: Michal Hocko <mhocko@suse.com>
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-06-02  7:18         ` Michal Hocko
@ 2017-06-02 11:13           ` Tetsuo Handa
  2017-06-02 12:15             ` Michal Hocko
  2017-06-02 21:57             ` Cong Wang
  2017-06-02 16:59           ` Cong Wang
                             ` (2 subsequent siblings)
  3 siblings, 2 replies; 43+ messages in thread
From: Tetsuo Handa @ 2017-06-02 11:13 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka

Michal Hocko wrote:
> On Thu 01-06-17 22:11:13, Tetsuo Handa wrote:
>> Michal Hocko wrote:
>>> On Thu 01-06-17 20:43:47, Tetsuo Handa wrote:
>>>> Cong Wang has reported a lockup when running LTP memcg_stress test [1].
>>>
>>> This seems to be on an old and not pristine kernel. Does it happen also
>>> on the vanilla up-to-date kernel?
>>
>> 4.9 is not an old kernel! It might be close to the kernel version which
>> enterprise distributions would choose for their next long term supported
>> version.
>>
>> And please stop saying "can you reproduce your problem with latest
>> linux-next (or at least latest linux)?" Not everybody can use the vanilla
>> up-to-date kernel!
>
> The changelog mentioned that the source of stalls is not clear so this
> might be out-of-tree patches doing something wrong and dump_stack
> showing up just because it is called often. This wouldn't be the first
> time I have seen something like that. I am not really keen on adding
> heavy lifting for something that is not clearly debugged and based on
> hand waving and speculations.

You are asking users to prove that the problem is indeed in the MM subsystem,
but you are thinking that kmallocwd which helps users to check whether the
problem is indeed in the MM subsystem is not worth merging into mainline.
As a result, we have to try things based on what you think handwaving and
speculations. This is a catch-22. If you don't want handwaving/speculations,
please please do provide a mechanism for checking (a) and (b) shown later.

>
>> What I'm pushing via kmallocwd patch is to prepare for overlooked problems
>> so that enterprise distributors can collect information and identify what
>> changes are needed to be backported.
>>
>> As long as you ignore problems not happened with latest linux-next (or
>> at least latest linux), enterprise distribution users can do nothing.
>>
>>>
>>> [...]
>>>> Therefore, this patch uses a mutex dedicated for warn_alloc() like
>>>> suggested in [3].
>>>
>>> As I've said previously. We have rate limiting and if that doesn't work
>>> out well, let's tune it. The lock should be the last resort to go with.
>>> We already throttle show_mem, maybe we can throttle dump_stack as well,
>>> although it sounds a bit strange that this adds so much to the picture.
>>
>> Ratelimiting never works well. It randomly drops information which is
>> useful for debugging. Uncontrolled concurrent dump_stack() causes lockups.
>> And restricting dump_stack() drops more information.
>
> As long as the dump_stack can be a source of the stalls, which I am not
> so sure about, then we should rate limit it.

I'm to some degree suspecting (a) shown later, but I can't prove it due to
lack of a mechanism for debugging. Looking at timestamps of output, the delay
between lines is varying; sometimes multiple lines are printed within one
microsecond, sometimes two lines are printed with ten or twenty milliseconds.
This fluctuation state resembles what I can observe when I put the system under
overstressed condition (i.e. out of CPU time for making progress). But again,
I can't prove it because I'm also suspecting (b) shown later due to use of
memcg and multiple nodes.

Since the server used in that report is Dell Inc. PowerEdge C6220/03C9JJ,
I estimate that the total CPUs installed is 12 cores * 2 slots = 24 CPUs.
(I can confirm that at least 21 CPUs are recognized from "CPU: 20" output.)
Since Cong was trying to run memcg stress test with 150 memcg groups, I
estimate that there were 150 threads running. This means that the system
might have been put under memory pressure where total number of threads
looping inside the page allocator dominates total number of available CPUs.
Since Cong assigned 0.5GB memory limit on each memcg group on a server
which has 64GB of memory, I estimate that the system might experience
non-memcg OOM due to 150 * 0.5G > 64G.

Then, this situation resembles my testcase where I put the system under
memory pressure where total number of threads looping inside the page
allocator (about one hundred or so) dominates total number of available
CPUs (i.e. 4) on a system with only 2GB or 4GB RAM (and no swap partition).
What I can observe is that all CPU times are spent for pointless busy loop
inside __alloc_pages_slowpath() and makes the OOM killer unable to send
SIGKILL, and the system stalls to the level where printk() flooding happens
due to uncontrolled concurrent warn_alloc().

And your response is always "Your system is already DOSed. I don't make
changes for such system at all." and the situation remained unchanged
until this moment (i.e. Cong's report). It is possible that

  (a) Cong is unintentionally making the system under DOS like
      my non-memcg testcase while Cong just wanted to test memcg

but it is also possible that

  (b) Cong is reporting an unnoticed bug in the MM subsystem

as well as

  (c) Cong is reporting a bug which does not exist in the latest
      linux-next kernel

and you are suspecting only (c) without providing a mechanism for
checking (a) and (b). kmallocwd helps users to check (a) and (b)
whereas printk() flooding due to uncontrolled concurrent warn_alloc()
prevents users from checking (a) and (b). This is really bad.

>
>> What we should do is to yield CPU time to operations which might do useful
>> things (let threads not doing memory allocation; e.g. let printk kernel
>> threads to flush pending buffer, let console drivers write the output to
>> consoles, let watchdog kernel threads report what is happening).
>
> yes we call that preemptive kernel...
>

And the page allocator is not preemptive. It does not yield enough CPU
time for other threads to do potentially useful things, allowing (a) to
happen.

>> When memory allocation request is stalling, serialization via waiting
>> for a lock does help.
>
> Which will mean that those unlucky ones which stall will stall even more
> because they will wait on a lock with potentially many others. While
> this certainly is a throttling mechanism it is also a big hammer.

According to my testing, the cause of stalls with flooding of printk() from
warn_alloc() is exactly the lack of enough CPU time because the page
allocator continues busy looping when memory allocation is stalling.



Andrew Morton wrote:
> I'm thinking we should serialize warn_alloc anyway, to prevent the
> output from concurrent calls getting all jumbled together?

Yes. According to my testing, serializing warn_alloc() can not yield
enough CPU time because warn_alloc() is called only once per 10 seconds.
Serializing

-	if (!mutex_trylock(&oom_lock)) {
+	if (mutex_lock_killable(&oom_lock)) {

in __alloc_pages_may_oom() can yield enough CPU time to solve the stalls.

>
> I'm not sure I buy the "this isn't a mainline kernel" thing. 
> warn_alloc() obviously isn't very robust, but we'd prefer that it be
> robust to peculiar situations, wild-n-wacky kernel patches, etc.  It's
> a low-level thing and it should Just Work.

Michal is clear that he won't make warn_alloc() robust, please see
http://lkml.kernel.org/r/20161019115525.GH7517@dhcp22.suse.cz .
And since kmallocwd is stalling, I'm trying to make warn_alloc() robust.

And I'm trying to yield CPU time as well as avoid getting all messages
jumbled by serialization using a lock, and Michal does not like locking.
We are deadlocked. :-(

>
> I do think ratelimiting will be OK - if the kernel is producing such a
> vast stream of warn_alloc() output then nobody is going to be reading
> it all anyway.  Probably just the first one is enough for operators to
> understand what's going wrong.

I don't think ratelimiting is OK. Printing only current thread's traces
helps little because the memory allocation stall problem interacts with
other threads. Without reporting all potentially relevant threads, we
won't know more than "something went wrong". That's where kmallocwd
becomes useful.

>
> So...  I think both.  ratelimit *and* serialize.  Perhaps a simple but
> suitable way of doing that is simply to disallow concurrent warn_allocs:
>
> 	/* comment goes here */
> 	if (test_and_set_bit(0, &foo))
> 		return;
> 	...
> 	clear_bit(0, &foo);
>
> or whatever?
>
> (And if we do decide to go with "mm,page_alloc: Serialize warn_alloc()
> if schedulable", please do add code comments explaining what's going on)

I think "no ratelimit" for __GFP_DIRECT_RECLAIM allocation requests
(unless we do decide to go with state tracking like kmallocwd does).

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3187,11 +3187,13 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 				      DEFAULT_RATELIMIT_BURST);
 	static DEFINE_MUTEX(warn_alloc_lock);
 
-	if ((gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs))
+	if (gfp_mask & __GFP_NOWARN)
 		return;
 
 	if (gfp_mask & __GFP_DIRECT_RECLAIM)
 		mutex_lock(&warn_alloc_lock);
+	else if (!__ratelimit(&nopage_rs))
+		return;
 	pr_warn("%s: ", current->comm);
 
 	va_start(args, fmt);



Michal Hocko wrote:
> On Thu 01-06-17 15:10:22, Andrew Morton wrote:
> > I'm thinking we should serialize warn_alloc anyway, to prevent the
> > output from concurrent calls getting all jumbled together?
> 
> dump_stack already serializes concurrent calls.
> 
> > I'm not sure I buy the "this isn't a mainline kernel" thing. 
> 
> The changelog doesn't really explain what is going on and only
> speculates that the excessive warn_alloc is the cause. The kernel is 
> 4.9.23.el7.twitter.x86_64 which I suspect contains a lot of stuff on top
> of 4.9. So I would really _like_ to see whether this is reproducible
> with the upstream kernel. Especially when this is a LTP test.

You are misunderstanding. This patch is intended for help users understand
what was going on, for flooded/ratelimited/truncated/dropped logs prevents
users from knowing what was going on. You are not allowing users to explain
what is going on, by not allowing users to know what is going on.

> 
> > warn_alloc() obviously isn't very robust, but we'd prefer that it be
> > robust to peculiar situations, wild-n-wacky kernel patches, etc.  It's
> > a low-level thing and it should Just Work.
> 
> Yes I would agree and if we have an evidence that warn_alloc is really
> the problem then I am all for fixing it. There is no such evidence yet.

I don't have access to a system with 24 CPUs. Thus, I use a system with
4 CPUs / 2GB RAM with linux-next-20170602 kernel.

Testcase:
----------------------------------------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sched.h>
#include <signal.h>
#include <sys/prctl.h>

int main(int argc, char *argv[])
{
        struct sched_param sp = { 0 };
        cpu_set_t cpu = { { 1 } };
        static int pipe_fd[2] = { EOF, EOF };
        char *buf = NULL;
        unsigned long size = 0;
        unsigned int i;
        int fd;
        pipe(pipe_fd);
        signal(SIGCLD, SIG_IGN);
        if (fork() == 0) {
                prctl(PR_SET_NAME, (unsigned long) "first-victim", 0, 0, 0);
                while (1)
                        pause();
        }
        close(pipe_fd[1]);
        sched_setaffinity(0, sizeof(cpu), &cpu);
        prctl(PR_SET_NAME, (unsigned long) "normal-priority", 0, 0, 0);
        for (i = 0; i < 1024; i++)
                if (fork() == 0) {
                        char c;
                        /* Wait until the first-victim is OOM-killed. */
                        read(pipe_fd[0], &c, 1);
                        /* Try to consume as much CPU time as possible. */
                        while(1);
                        _exit(0);
                }
        close(pipe_fd[0]);
        fd = open("/dev/zero", O_RDONLY);
        for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
                char *cp = realloc(buf, size);
                if (!cp) {
                        size >>= 1;
                        break;
                }
                buf = cp;
        }
        sched_setscheduler(0, SCHED_IDLE, &sp);
        prctl(PR_SET_NAME, (unsigned long) "idle-priority", 0, 0, 0);
        read(fd, buf, size); /* Will cause OOM due to overcommit */
        kill(-1, SIGKILL);
        return 0; /* Not reached. */
}
----------------------------------------

Results from http://I-love.SAKURA.ne.jp/tmp/serial-20170602-1.txt.xz :
----------------------------------------
[   46.896272] idle-priority invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),  order=0, oom_score_adj=0
(...snipped...)
[   47.102375] Out of memory: Kill process 2210 (idle-priority) score 698 or sacrifice child
[   47.109067] Killed process 2211 (first-victim) total-vm:4168kB, anon-rss:80kB, file-rss:0kB, shmem-rss:0kB
[   56.906104] postgres: page allocation stalls for 10003ms, order:0, mode:0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null)
[   56.911263] systemd-journal: page allocation stalls for 10005ms, order:0, mode:0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null)
(...snipped...)
[  282.940538] pickup: page allocation stalls for 150003ms, order:0, mode:0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null)
(...snipped...)
[  283.203997] sysrq: SysRq : Kill All Tasks
----------------------------------------
This result shows that the OOM killer cannot be invoked again because
userspace processes continued wasting CPU time after an OOM victim was
selected. Yes, I know you will shout that "wasting CPU time in userspace
in order to disturb the OOM killer is unfair". OK, here is another result.

Testcase:
----------------------------------------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sched.h>
#include <sys/mman.h>
#include <sys/prctl.h>

static void setproctitle(const char *name)
{
        prctl(PR_SET_NAME, (unsigned long) name, 0, 0, 0);
}

static cpu_set_t set = { { 1 } };
struct sched_param sp = { };
//static char filename[64] = { };
static char buffer[4096] = { };

static int file_io(void *unused)
{
        const int fd = open("/tmp/file", O_WRONLY | O_CREAT | O_APPEND, 0600);
        setproctitle("mmap-file");
        sched_setscheduler(0, SCHED_IDLE, &sp);
        while (write(fd, buffer, 1) > 0)
                sched_yield();
        close(fd);
        return 0;
}

/* down_read(&mm->mmap_sem) requester which is chosen as an OOM victim. */
static int reader(void *pid)
{
        int fd;
        char filename[32] = { };
        snprintf(filename, sizeof(filename), "/proc/%u/stat", *(pid_t *)pid);
        fd = open(filename, O_RDONLY);
        setproctitle("mmap-read");
        sched_setaffinity(0, sizeof(set), &set);
        //sched_setscheduler(0, SCHED_IDLE, &sp);
        while (pread(fd, buffer, sizeof(buffer), 0) > 0);
        close(fd);
        return 0;
}

/* down_write(&mm->mmap_sem) requester which is chosen as an OOM victim. */
static int writer(void *unused)
{
        const int fd = open("/proc/self/exe", O_RDONLY);
        setproctitle("mmap-write");
        sched_setaffinity(0, sizeof(set), &set);
        //sched_setscheduler(0, SCHED_IDLE, &sp);
        while (1) {
                void *ptr = mmap(NULL, 4096, PROT_READ, MAP_PRIVATE, fd, 0);
                //if (ptr && ptr != (void *) -1)
                munmap(ptr, 4096);
        }
        return 0;
}

int main(int argc, char *argv[])
{
#define MAX_PARALLEL 10
#define MAX_CONTENTION 10
        int i;
        int j;
        pid_t pids[MAX_PARALLEL];
        for (i = 0; i < MAX_PARALLEL; i++) {
                pids[i] = fork();
                if (pids[i] == EOF)
                        break;
                if (pids[i] == 0) {
                        sleep(1);
                        //for (i = 0; i < 2; i++)
                        //      clone(file_io, malloc(1024) + 1024, CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, NULL);
                        for (j = 0; j < MAX_CONTENTION; j++) {
                                char *cp = malloc(4096);
                                if (!cp)
                                        break;
                                clone(writer, cp + 4096, CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, NULL);
                        }
                        pause();
                        _exit(0);
                }
        }
        for (i = 0; i < MAX_PARALLEL; i++) {
                if (pids[i] == EOF)
                        break;
                for (j = 0; j < MAX_CONTENTION; j++) {
                        char *cp = malloc(4096);
                        if (!cp)
                                break;
                        clone(reader, cp + 4096, CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, &pids[i]);
                }
        }
        { /* A dummy process for invoking the OOM killer. */
                char *buf = NULL;
                unsigned long i;
                unsigned long size = 0;
                setproctitle("mmap-mem");
                for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
                        char *cp = realloc(buf, size);
                        if (!cp) {
                                size >>= 1;
                                break;
                        }
                        buf = cp;
                }
                sleep(2);
                //sched_setscheduler(0, SCHED_IDLE, &sp);
                fprintf(stderr, "Eating memory\n");
                for (i = 0; i < size; i += 4096)
                        buf[i] = '\0'; /* Will cause OOM due to overcommit */
                fprintf(stderr, "Exiting\n");
                return 0;
        }
}
----------------------------------------

Results from http://I-love.SAKURA.ne.jp/tmp/serial-20170602-2.txt.xz :
----------------------------------------
[  123.771523] mmap-write invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null),  order=0, oom_score_adj=0
[  124.808940] mmap-write cpuset=/ mems_allowed=0
[  124.811595] CPU: 0 PID: 2852 Comm: mmap-write Not tainted 4.12.0-rc3-next-20170602 #99
[  124.815842] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  124.821171] Call Trace:
[  124.823106]  dump_stack+0x86/0xcf
[  124.825336]  dump_header+0x97/0x26d
[  124.827668]  ? trace_hardirqs_on+0xd/0x10
[  124.830222]  oom_kill_process+0x203/0x470
[  124.832778]  out_of_memory+0x138/0x580
[  124.835223]  __alloc_pages_slowpath+0x1100/0x11f0
[  124.838085]  __alloc_pages_nodemask+0x308/0x3c0
[  124.840850]  alloc_pages_current+0x6a/0xe0
[  124.843332]  __page_cache_alloc+0x119/0x150
[  124.845723]  filemap_fault+0x3dc/0x950
[  124.847932]  ? debug_lockdep_rcu_enabled+0x1d/0x20
[  124.850683]  ? xfs_filemap_fault+0x5b/0x180 [xfs]
[  124.853427]  ? down_read_nested+0x73/0xb0
[  124.855792]  xfs_filemap_fault+0x63/0x180 [xfs]
[  124.858327]  __do_fault+0x1e/0x140
[  124.860383]  __handle_mm_fault+0xb2c/0x1090
[  124.862760]  handle_mm_fault+0x190/0x350
[  124.865161]  __do_page_fault+0x266/0x520
[  124.867409]  do_page_fault+0x30/0x80
[  124.869846]  page_fault+0x28/0x30
[  124.871803] RIP: 0033:0x7fb997682dca
[  124.873875] RSP: 002b:0000000000777fe8 EFLAGS: 00010246
[  124.876601] RAX: 00007fb997b6e000 RBX: 0000000000000000 RCX: 00007fb997682dca
[  124.880077] RDX: 0000000000000001 RSI: 0000000000001000 RDI: 0000000000000000
[  124.883551] RBP: 0000000000001000 R08: 0000000000000003 R09: 0000000000000000
[  124.886933] R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000000002
[  124.890336] R13: 0000000000000000 R14: 0000000000000003 R15: 0000000000000001
[  124.893853] Mem-Info:
[  126.408131] mmap-read: page allocation stalls for 10005ms, order:0, mode:0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null)
[  126.408137] mmap-read cpuset=/ mems_allowed=0
(...snipped...)
[  350.182442] mmap-read: page allocation stalls for 230016ms, order:0, mode:0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null)
[  350.182446] mmap-read cpuset=/ mems_allowed=0
(...snipped...)
[  350.836483] sysrq: SysRq : Kill All Tasks
(...snipped...)
[  389.308085] active_anon:1146 inactive_anon:2777 isolated_anon:0
[  389.308085]  active_file:479 inactive_file:508 isolated_file:0
[  389.308085]  unevictable:0 dirty:0 writeback:0 unstable:0
[  389.308085]  slab_reclaimable:9536 slab_unreclaimable:15265
[  389.308085]  mapped:629 shmem:3535 pagetables:34 bounce:0
[  389.308085]  free:356689 free_pcp:596 free_cma:0
[  389.308089] Node 0 active_anon:4584kB inactive_anon:11108kB active_file:1916kB inactive_file:2032kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:2516kB dirty:0kB writeback:0kB shmem:14140kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[  389.308090] Node 0 DMA free:15872kB min:440kB low:548kB high:656kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  389.308093] lowmem_reserve[]: 0 1561 1561 1561
[  389.308097] Node 0 DMA32 free:1410884kB min:44612kB low:55764kB high:66916kB active_anon:4584kB inactive_anon:11108kB active_file:1916kB inactive_file:2032kB unevictable:0kB writepending:0kB present:2080640kB managed:1599404kB mlocked:0kB slab_reclaimable:38144kB slab_unreclaimable:61028kB kernel_stack:3808kB pagetables:136kB bounce:0kB free_pcp:2384kB local_pcp:696kB free_cma:0kB
[  389.308099] lowmem_reserve[]: 0 0 0 0
[  389.308103] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15872kB
[  389.308121] Node 0 DMA32: 975*4kB (UME) 1199*8kB (UME) 1031*16kB (UME) 973*32kB (UME) 564*64kB (UME) 303*128kB (UME) 160*256kB (UME) 70*512kB (UME) 40*1024kB (UME) 21*2048kB (M) 272*4096kB (M) = 1410884kB
[  389.308142] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  389.308143] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  389.308143] 4522 total pagecache pages
[  389.308147] 0 pages in swap cache
[  389.308148] Swap cache stats: add 0, delete 0, find 0/0
[  389.308149] Free swap  = 0kB
[  389.308149] Total swap = 0kB
[  389.308150] 524157 pages RAM
[  389.308151] 0 pages HighMem/MovableOnly
[  389.308152] 120330 pages reserved
[  389.308153] 0 pages cma reserved
[  389.308153] 0 pages hwpoisoned
[  389.308155] Out of memory: Kill process 2649 (mmap-mem) score 783 or sacrifice child
----------------------------------------
This result shows that the OOM killer was not able to send SIGKILL until
I gave up waiting and pressed SysRq-i because __alloc_pages_slowpath() continued
wasting CPU time after the OOM killer tried to start printing memory information.
We can avoid this case if we wait for oom_lock at __alloc_pages_may_oom().

> Note that dump_stack serialization might be unfair because there is no
> queuing. Is it possible that this is the problem? If yes we should
> rather fix that because that is arguably even more low-level routine than
> warn_alloc.

Wasting all CPU times in __alloc_pages_slowpath() enough to disallow
the OOM killer to send SIGKILL is also unfair. And you refuse to queue
allocating threads into oom_lock waiters at __alloc_pages_may_oom().

> 
> That being said. I strongly believe that this patch is not properly
> justified, issue fully understood and as such a disagree with adding a
> new lock on those grounds.
> 
> Until the above is resolved
> Nacked-by: Michal Hocko <mhocko@suse.com>

So, you can't agree with inspecting synchronously so that users can know
what is happening. Then, we have no choice other than use state tracking
and inspect asynchronously like kmallocwd does. Andrew, I'm so happy if we
can inspect asynchronously.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-06-02 11:13           ` Tetsuo Handa
@ 2017-06-02 12:15             ` Michal Hocko
  2017-06-02 17:13               ` Tetsuo Handa
  2017-06-02 21:57             ` Cong Wang
  1 sibling, 1 reply; 43+ messages in thread
From: Michal Hocko @ 2017-06-02 12:15 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka

On Fri 02-06-17 20:13:32, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 01-06-17 22:11:13, Tetsuo Handa wrote:
> >> Michal Hocko wrote:
> >>> On Thu 01-06-17 20:43:47, Tetsuo Handa wrote:
> >>>> Cong Wang has reported a lockup when running LTP memcg_stress test [1].
> >>>
> >>> This seems to be on an old and not pristine kernel. Does it happen also
> >>> on the vanilla up-to-date kernel?
> >>
> >> 4.9 is not an old kernel! It might be close to the kernel version which
> >> enterprise distributions would choose for their next long term supported
> >> version.
> >>
> >> And please stop saying "can you reproduce your problem with latest
> >> linux-next (or at least latest linux)?" Not everybody can use the vanilla
> >> up-to-date kernel!
> >
> > The changelog mentioned that the source of stalls is not clear so this
> > might be out-of-tree patches doing something wrong and dump_stack
> > showing up just because it is called often. This wouldn't be the first
> > time I have seen something like that. I am not really keen on adding
> > heavy lifting for something that is not clearly debugged and based on
> > hand waving and speculations.
> 
> You are asking users to prove that the problem is indeed in the MM subsystem,
> but you are thinking that kmallocwd which helps users to check whether the
> problem is indeed in the MM subsystem is not worth merging into mainline.
> As a result, we have to try things based on what you think handwaving and
> speculations. This is a catch-22. If you don't want handwaving/speculations,
> please please do provide a mechanism for checking (a) and (b) shown later.

configure watchdog to bug on soft lockup, take a crash dump, see what
is going on there and you can draw a better picture of what is going on
here. Seriously I am fed up with all the "let's do the async thing
because it would tell much more" side discussions. You are trying to fix
a soft lockup which alone is not a deadly condition. If the system is
overwhelmed it can happen and if that is the case then we should care
whether it gets resolved or it is a permanent livelock situation. If yes
then we need to isolate which path is not preempting and why and place
the cond_resched there. The page allocator contains preemption points,
if we are lacking some for some pathological paths let's add them. For
some reason you seem to be focused only on the warn_alloc path, though,
while the real issue might be somewhere completely else.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-06-02  7:18         ` Michal Hocko
  2017-06-02 11:13           ` Tetsuo Handa
@ 2017-06-02 16:59           ` Cong Wang
  2017-06-02 19:59           ` Andrew Morton
  2017-07-08  4:59           ` Tetsuo Handa
  3 siblings, 0 replies; 43+ messages in thread
From: Cong Wang @ 2017-06-02 16:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Tetsuo Handa, linux-mm, dave.hansen,
	Johannes Weiner, Mel Gorman, vbabka

On Fri, Jun 2, 2017 at 12:18 AM, Michal Hocko <mhocko@suse.com> wrote:
> The changelog doesn't really explain what is going on and only
> speculates that the excessive warn_alloc is the cause. The kernel is
> 4.9.23.el7.twitter.x86_64 which I suspect contains a lot of stuff on top
> of 4.9. So I would really _like_ to see whether this is reproducible
> with the upstream kernel. Especially when this is a LTP test.

Just FYI: our kernel 4.9.23.el7.twitter.x86_64 is almost same with
the upstream 4.9.23 release, with just _few_ non-mm patches
backported.

We do have test machines to test non-stable kernels but it
is slightly harder to build an upstream kernel on them, I mean
not as convenient as a kernel rpm...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-06-02 12:15             ` Michal Hocko
@ 2017-06-02 17:13               ` Tetsuo Handa
  0 siblings, 0 replies; 43+ messages in thread
From: Tetsuo Handa @ 2017-06-02 17:13 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka

Michal Hocko wrote:
> On Fri 02-06-17 20:13:32, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Thu 01-06-17 22:11:13, Tetsuo Handa wrote:
> > >> Michal Hocko wrote:
> > >>> On Thu 01-06-17 20:43:47, Tetsuo Handa wrote:
> > >>>> Cong Wang has reported a lockup when running LTP memcg_stress test [1].
> > >>>
> > >>> This seems to be on an old and not pristine kernel. Does it happen also
> > >>> on the vanilla up-to-date kernel?
> > >>
> > >> 4.9 is not an old kernel! It might be close to the kernel version which
> > >> enterprise distributions would choose for their next long term supported
> > >> version.
> > >>
> > >> And please stop saying "can you reproduce your problem with latest
> > >> linux-next (or at least latest linux)?" Not everybody can use the vanilla
> > >> up-to-date kernel!
> > >
> > > The changelog mentioned that the source of stalls is not clear so this
> > > might be out-of-tree patches doing something wrong and dump_stack
> > > showing up just because it is called often. This wouldn't be the first
> > > time I have seen something like that. I am not really keen on adding
> > > heavy lifting for something that is not clearly debugged and based on
> > > hand waving and speculations.
> > 
> > You are asking users to prove that the problem is indeed in the MM subsystem,
> > but you are thinking that kmallocwd which helps users to check whether the
> > problem is indeed in the MM subsystem is not worth merging into mainline.
> > As a result, we have to try things based on what you think handwaving and
> > speculations. This is a catch-22. If you don't want handwaving/speculations,
> > please please do provide a mechanism for checking (a) and (b) shown later.
> 
> configure watchdog to bug on soft lockup, take a crash dump, see what
> is going on there and you can draw a better picture of what is going on
> here. Seriously I am fed up with all the "let's do the async thing
> because it would tell much more" side discussions. You are trying to fix
> a soft lockup which alone is not a deadly condition.

Wrong and unacceptable. I'm trying to collect information for not only the
cause of allocation stall reported by Cong Wang but also any allocation stalls
(e.g. infinite too_many_isolated() trap from shrink_inactive_list() and
infinite mempool_alloc() loop case).

I was indeed surprised to know how severe a race condition on a system with
many CPUs at http://lkml.kernel.org/r/201409192053.IHJ35462.JLOMOSOFFVtQFH@I-love.SAKURA.ne.jp .
Therefore, like you suspect, it is possible that dump_stack() serialization
might be unfair on some hardware, especially when a lot of allocating threads
are calling warn_alloc(). But if you read Cong's report, you will find that
allocation stalls began before the first detection of soft lockup due to
dump_stack() serialization. This suggests that allocation stalls already began
before the system enters into a state where a lot of allocating threads started
calling warn_alloc(). Then, I think that it is natural to suspect that the
first culprit is MM subsystem. If you believe that you can debug and explain
why the OOM killer was not invoked in Cong's case without comparing how situation
changes over time (i.e. with a crash dump only), please explain it in that thread.

>                                                      If the system is
> overwhelmed it can happen and if that is the case then we should care
> whether it gets resolved or it is a permanent livelock situation.

What I'm trying to check is exactly whether the stall gets resolved or
the stall is a permanent livelock situation, via allowing users to take
multiple snapshots of memory image and compare how situation changes over
time. Taking a crash dump via kernel panic cannot allow users to take
multiple snapshots of memory image.

>                                                                   If yes
> then we need to isolate which path is not preempting and why and place
> the cond_resched there. The page allocator contains preemption points,
> if we are lacking some for some pathological paths let's add them.

What I'm talking about is not the lack of cond_resched(). As far as we know,
cond_resched() is inserted whereever necessary. What I'm talking about is
that cond_resched() cannot yield enough CPU time.

Looping with calling cond_resched() can yield only e.g. 1% of CPU time
compared to looping without calling cond_resched(). But if e.g. 100 threads
are looping with only cond_resched(), some thread waiting for wake up from
schedule_timeout_killable(1) can delay a lot. If that some thread has idle
priority, the delay can become too long to wait.

>                                                                    For
> some reason you seem to be focused only on the warn_alloc path, though,
> while the real issue might be somewhere completely else.

Not only because uncontrolled concurrent warn_alloc() causes flooding of
printk() and prevents users from knowing what is going on, but also
warn_alloc() cannot be called when we hit e.g. infinite too_many_isolated()
loop case or infinite mempool_alloc() loop case which after all prevents
users from knowing what is going on.

You are not providing users means to skip warn_alloc() when flooding of
printk() is just a junk (i.e. the real issue is somewhere completely else)
or call warn_alloc() when all printk() messages are needed (i.e. the issue
is in the MM subsystem). Current implementation of warn_alloc() can act as
launcher of corrupted spam messages. Without comparing how situation changes
over time, via reliably saving all printk() messages which might potentially
be relevant, we can't judge whether warn_alloc() flooding is just a collateral
phenomenon of a real issue somewhere completely else.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-06-02  7:18         ` Michal Hocko
  2017-06-02 11:13           ` Tetsuo Handa
  2017-06-02 16:59           ` Cong Wang
@ 2017-06-02 19:59           ` Andrew Morton
  2017-06-03  2:57             ` Tetsuo Handa
  2017-06-03  7:32             ` Michal Hocko
  2017-07-08  4:59           ` Tetsuo Handa
  3 siblings, 2 replies; 43+ messages in thread
From: Andrew Morton @ 2017-06-02 19:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, linux-mm, xiyou.wangcong, dave.hansen, hannes,
	mgorman, vbabka

On Fri, 2 Jun 2017 09:18:18 +0200 Michal Hocko <mhocko@suse.com> wrote:

> On Thu 01-06-17 15:10:22, Andrew Morton wrote:
> > On Thu, 1 Jun 2017 15:28:08 +0200 Michal Hocko <mhocko@suse.com> wrote:
> > 
> > > On Thu 01-06-17 22:11:13, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > > > On Thu 01-06-17 20:43:47, Tetsuo Handa wrote:
> > > > > > Cong Wang has reported a lockup when running LTP memcg_stress test [1].
> > > > >
> > > > > This seems to be on an old and not pristine kernel. Does it happen also
> > > > > on the vanilla up-to-date kernel?
> > > > 
> > > > 4.9 is not an old kernel! It might be close to the kernel version which
> > > > enterprise distributions would choose for their next long term supported
> > > > version.
> > > > 
> > > > And please stop saying "can you reproduce your problem with latest
> > > > linux-next (or at least latest linux)?" Not everybody can use the vanilla
> > > > up-to-date kernel!
> > > 
> > > The changelog mentioned that the source of stalls is not clear so this
> > > might be out-of-tree patches doing something wrong and dump_stack
> > > showing up just because it is called often. This wouldn't be the first
> > > time I have seen something like that. I am not really keen on adding
> > > heavy lifting for something that is not clearly debugged and based on
> > > hand waving and speculations.
> > 
> > I'm thinking we should serialize warn_alloc anyway, to prevent the
> > output from concurrent calls getting all jumbled together?
> 
> dump_stack already serializes concurrent calls.

Sure.  But warn_alloc() doesn't.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-06-02 11:13           ` Tetsuo Handa
  2017-06-02 12:15             ` Michal Hocko
@ 2017-06-02 21:57             ` Cong Wang
  2017-06-04  8:58               ` Tetsuo Handa
  1 sibling, 1 reply; 43+ messages in thread
From: Cong Wang @ 2017-06-02 21:57 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Michal Hocko, Andrew Morton, linux-mm, dave.hansen,
	Johannes Weiner, Mel Gorman, vbabka

On Fri, Jun 2, 2017 at 4:13 AM, Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
> Since the server used in that report is Dell Inc. PowerEdge C6220/03C9JJ,
> I estimate that the total CPUs installed is 12 cores * 2 slots = 24 CPUs.
> (I can confirm that at least 21 CPUs are recognized from "CPU: 20" output.)


Here is the lscpu output in case it is useful for you to debug:

$ lscpu -ae
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE
0   0    0      0    0:0:0:0       yes
1   0    0      1    1:1:1:0       yes
2   0    0      2    2:2:2:0       yes
3   0    0      3    3:3:3:0       yes
4   0    0      4    4:4:4:0       yes
5   0    0      5    5:5:5:0       yes
6   1    1      6    6:6:6:1       yes
7   1    1      7    7:7:7:1       yes
8   1    1      8    8:8:8:1       yes
9   1    1      9    9:9:9:1       yes
10  1    1      10   10:10:10:1    yes
11  1    1      11   11:11:11:1    yes
12  0    0      0    0:0:0:0       yes
13  0    0      1    1:1:1:0       yes
14  0    0      2    2:2:2:0       yes
15  0    0      3    3:3:3:0       yes
16  0    0      4    4:4:4:0       yes
17  0    0      5    5:5:5:0       yes
18  1    1      6    6:6:6:1       yes
19  1    1      7    7:7:7:1       yes
20  1    1      8    8:8:8:1       yes
21  1    1      9    9:9:9:1       yes
22  1    1      10   10:10:10:1    yes
23  1    1      11   11:11:11:1    yes


> Since Cong was trying to run memcg stress test with 150 memcg groups, I
> estimate that there were 150 threads running. This means that the system
> might have been put under memory pressure where total number of threads
> looping inside the page allocator dominates total number of available CPUs.
> Since Cong assigned 0.5GB memory limit on each memcg group on a server
> which has 64GB of memory, I estimate that the system might experience
> non-memcg OOM due to 150 * 0.5G > 64G.

Just FYI: it is not us who picks those numbers, they are in the LTP test
memcg_stress_test.sh.


>
> Then, this situation resembles my testcase where I put the system under
> memory pressure where total number of threads looping inside the page
> allocator (about one hundred or so) dominates total number of available
> CPUs (i.e. 4) on a system with only 2GB or 4GB RAM (and no swap partition).
> What I can observe is that all CPU times are spent for pointless busy loop
> inside __alloc_pages_slowpath() and makes the OOM killer unable to send
> SIGKILL, and the system stalls to the level where printk() flooding happens
> due to uncontrolled concurrent warn_alloc().
>
> And your response is always "Your system is already DOSed. I don't make
> changes for such system at all." and the situation remained unchanged
> until this moment (i.e. Cong's report). It is possible that
>
>   (a) Cong is unintentionally making the system under DOS like
>       my non-memcg testcase while Cong just wanted to test memcg
>
> but it is also possible that
>
>   (b) Cong is reporting an unnoticed bug in the MM subsystem
>

I suppose so when I report the warning, unless commit
63f53dea0c9866e93802d50a230c460a024 is a false alarm. ;)

If I understand that commit correctly, it warns that we spend too
much time on retrying and make no progress on the mm allocator
slow path, which clearly indicates some problem.

But I thought it is obvious we should OOM instead of hanging
somewhere in this situation? (My mm knowledge is very limited.)


> as well as
>
>   (c) Cong is reporting a bug which does not exist in the latest
>       linux-next kernel
>

As I already mentioned in my original report, I know there are at least
two similar warnings reported before:

https://lkml.org/lkml/2016/12/13/529
https://bugzilla.kernel.org/show_bug.cgi?id=192981

I don't see any fix, nor I see they are similar to mine.


> and you are suspecting only (c) without providing a mechanism for
> checking (a) and (b). kmallocwd helps users to check (a) and (b)
> whereas printk() flooding due to uncontrolled concurrent warn_alloc()
> prevents users from checking (a) and (b). This is really bad.
>
>>
>>> What we should do is to yield CPU time to operations which might do useful
>>> things (let threads not doing memory allocation; e.g. let printk kernel
>>> threads to flush pending buffer, let console drivers write the output to
>>> consoles, let watchdog kernel threads report what is happening).
>>
>> yes we call that preemptive kernel...
>>
>
> And the page allocator is not preemptive. It does not yield enough CPU
> time for other threads to do potentially useful things, allowing (a) to
> happen.
>
>>> When memory allocation request is stalling, serialization via waiting
>>> for a lock does help.
>>
>> Which will mean that those unlucky ones which stall will stall even more
>> because they will wait on a lock with potentially many others. While
>> this certainly is a throttling mechanism it is also a big hammer.
>
> According to my testing, the cause of stalls with flooding of printk() from
> warn_alloc() is exactly the lack of enough CPU time because the page
> allocator continues busy looping when memory allocation is stalling.
>

In the retry loop, warn_alloc() is only called after stall is detected, not
before, therefore waiting on the mutex does not contribute to at least
the first stall.

>
>
> Andrew Morton wrote:
>> I'm thinking we should serialize warn_alloc anyway, to prevent the
>> output from concurrent calls getting all jumbled together?
>
> Yes. According to my testing, serializing warn_alloc() can not yield
> enough CPU time because warn_alloc() is called only once per 10 seconds.
> Serializing
>
> -       if (!mutex_trylock(&oom_lock)) {
> +       if (mutex_lock_killable(&oom_lock)) {
>
> in __alloc_pages_may_oom() can yield enough CPU time to solve the stalls.
>

For this point, I am with you, it would be helpful to serialize them in
case we mix different warnings in dmesg. But you probably need to adjust
the timestamps in case waiting on the mutex contributes to the stall too?

[...]

> This result shows that the OOM killer was not able to send SIGKILL until
> I gave up waiting and pressed SysRq-i because __alloc_pages_slowpath() continued
> wasting CPU time after the OOM killer tried to start printing memory information.
> We can avoid this case if we wait for oom_lock at __alloc_pages_may_oom().
>

Note, in my case OOM killer was probably not even invoked although
the log I captured is a complete one...


Thanks for looking into it!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-06-02 19:59           ` Andrew Morton
@ 2017-06-03  2:57             ` Tetsuo Handa
  2017-06-03  7:32             ` Michal Hocko
  1 sibling, 0 replies; 43+ messages in thread
From: Tetsuo Handa @ 2017-06-03  2:57 UTC (permalink / raw)
  To: akpm, mhocko
  Cc: linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka,
	sergey.senozhatsky, pmladek, akataria, syeh, thellstrom,
	charmainel, brianp, sbhatewara

(Adding printk() and VMware folks.)

Andrew Morton wrote:
> On Fri, 2 Jun 2017 09:18:18 +0200 Michal Hocko <mhocko@suse.com> wrote:
>
> > On Thu 01-06-17 15:10:22, Andrew Morton wrote:
> > > On Thu, 1 Jun 2017 15:28:08 +0200 Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > > On Thu 01-06-17 22:11:13, Tetsuo Handa wrote:
> > > > > Michal Hocko wrote:
> > > > > > On Thu 01-06-17 20:43:47, Tetsuo Handa wrote:
> > > > > > > Cong Wang has reported a lockup when running LTP memcg_stress test [1].
> > > > > >
> > > > > > This seems to be on an old and not pristine kernel. Does it happen also
> > > > > > on the vanilla up-to-date kernel?
> > > > >
> > > > > 4.9 is not an old kernel! It might be close to the kernel version which
> > > > > enterprise distributions would choose for their next long term supported
> > > > > version.
> > > > >
> > > > > And please stop saying "can you reproduce your problem with latest
> > > > > linux-next (or at least latest linux)?" Not everybody can use the vanilla
> > > > > up-to-date kernel!
> > > >
> > > > The changelog mentioned that the source of stalls is not clear so this
> > > > might be out-of-tree patches doing something wrong and dump_stack
> > > > showing up just because it is called often. This wouldn't be the first
> > > > time I have seen something like that. I am not really keen on adding
> > > > heavy lifting for something that is not clearly debugged and based on
> > > > hand waving and speculations.
> > >
> > > I'm thinking we should serialize warn_alloc anyway, to prevent the
> > > output from concurrent calls getting all jumbled together?
> >
> > dump_stack already serializes concurrent calls.

I don't think offloading serialization to dump_stack() is a polite behavior
when the caller can do serialization. Not only it wastes a lot of CPU time
but also just passes the DOS stress through to lower layers. printk() needs
CPU time to write to console.

If printk() flooding starts, I can observe that

   (1) timestamp in printk() output does not get incremented timely

   (2) printk() output is not written to virtual serial console timely

which makes me give up observing whether situation is changing over time.

Results from http://I-love.SAKURA.ne.jp/tmp/serial-20170602-2.txt.xz :
----------------------------------------
[  123.771523] mmap-write invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null),  order=0, oom_score_adj=0
[  124.808940] mmap-write cpuset=/ mems_allowed=0
[  124.811595] CPU: 0 PID: 2852 Comm: mmap-write Not tainted 4.12.0-rc3-next-20170602 #99
[  124.815842] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  124.821171] Call Trace:
[  124.823106]  dump_stack+0x86/0xcf
[  124.825336]  dump_header+0x97/0x26d
[  124.827668]  ? trace_hardirqs_on+0xd/0x10
[  124.830222]  oom_kill_process+0x203/0x470
[  124.832778]  out_of_memory+0x138/0x580
[  124.835223]  __alloc_pages_slowpath+0x1100/0x11f0
[  124.838085]  __alloc_pages_nodemask+0x308/0x3c0
[  124.840850]  alloc_pages_current+0x6a/0xe0
[  124.843332]  __page_cache_alloc+0x119/0x150
[  124.845723]  filemap_fault+0x3dc/0x950
[  124.847932]  ? debug_lockdep_rcu_enabled+0x1d/0x20
[  124.850683]  ? xfs_filemap_fault+0x5b/0x180 [xfs]
[  124.853427]  ? down_read_nested+0x73/0xb0
[  124.855792]  xfs_filemap_fault+0x63/0x180 [xfs]
[  124.858327]  __do_fault+0x1e/0x140
[  124.860383]  __handle_mm_fault+0xb2c/0x1090
[  124.862760]  handle_mm_fault+0x190/0x350
[  124.865161]  __do_page_fault+0x266/0x520
[  124.867409]  do_page_fault+0x30/0x80
[  124.869846]  page_fault+0x28/0x30
[  124.871803] RIP: 0033:0x7fb997682dca
[  124.873875] RSP: 002b:0000000000777fe8 EFLAGS: 00010246
[  124.876601] RAX: 00007fb997b6e000 RBX: 0000000000000000 RCX: 00007fb997682dca
[  124.880077] RDX: 0000000000000001 RSI: 0000000000001000 RDI: 0000000000000000
[  124.883551] RBP: 0000000000001000 R08: 0000000000000003 R09: 0000000000000000
[  124.886933] R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000000002
[  124.890336] R13: 0000000000000000 R14: 0000000000000003 R15: 0000000000000001
[  124.893853] Mem-Info:
[  126.408131] mmap-read: page allocation stalls for 10005ms, order:0, mode:0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null)
[  126.408137] mmap-read cpuset=/ mems_allowed=0
(...snipped...)
[  350.182442] mmap-read: page allocation stalls for 230016ms, order:0, mode:0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null)
[  350.182446] mmap-read cpuset=/ mems_allowed=0
[  350.182450] CPU: 0 PID: 2749 Comm: mmap-read Not tainted 4.12.0-rc3-next-20170602 #99
[  350.182451] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  350.182451] Call Trace:
[  350.182455]  dump_stack+0x86/0xcf
[  350.182457]  warn_alloc+0x114/0x1c0
[  350.182466]  __alloc_pages_slowpath+0xbb8/0x11f0
[  350.182481]  __alloc_pages_nodemask+0x308/0x3c0
[  350.182491]  alloc_pages_current+0x6a/0xe0
[  350.182495]  __page_cache_alloc+0x119/0x150
[  350.182498]  filemap_fault+0x3dc/0x950
[  350.182501]  ? debug_lockdep_rcu_enabled+0x1d/0x20
[  350.182525]  ? xfs_filemap_fault+0x5b/0x180 [xfs]
[  350.182528]  ? down_read_nested+0x73/0xb0
[  350.182550]  xfs_filemap_fault+0x63/0x180 [xfs]
[  350.182554]  __do_fault+0x1e/0x140
[  350.182557]  __handle_mm_fault+0xb2c/0x1090
[  350.182566]  handle_mm_fault+0x190/0x350
[  350.182569]  __do_page_fault+0x266/0x520
[  350.182575]  do_page_fault+0x30/0x80
[  350.182578]  page_fault+0x28/0x30
[  350.182580] RIP: 0033:0x400bfb
[  350.182581] RSP: 002b:00000000007cf570 EFLAGS: 00010207
[  350.182582] RAX: 000000000000010d RBX: 0000000000000003 RCX: 00007fb997678443
[  350.182583] RDX: 0000000000001000 RSI: 00000000006021a0 RDI: 0000000000000003
[  350.182583] RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000000a
[  350.182584] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffd42a62438
[  350.182585] R13: 00007ffd42a62540 R14: 0000000000000000 R15: 0000000000000000
[  350.836483] sysrq: SysRq : Kill All Tasks
[  350.876112] cleanupd (2180) used greatest stack depth: 10240 bytes left
[  351.219055] audit: type=1131 audit(1496388867.600:97): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-journald comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
[  351.219060] audit: type=1131 audit(1496388867.603:98): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-udevd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
[  351.219064] audit: type=1131 audit(1496388867.607:99): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=auditd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
[  351.229083] audit: type=1131 audit(1496388867.610:100): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=abrtd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
[  351.229092] audit: type=1131 audit(1496388867.613:101): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=dbus comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
[  351.229095] audit: type=1131 audit(1496388867.617:102): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=polkit comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
[  351.229636] audit: type=1131 audit(1496388867.621:103): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=avahi-daemon comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
[  351.233058] audit: type=1131 audit(1496388867.624:104): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=irqbalance comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
[  351.236596] audit: type=1131 audit(1496388867.628:105): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=atd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
[  351.239144] audit: type=1131 audit(1496388867.630:106): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=crond comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
[  389.308085] active_anon:1146 inactive_anon:2777 isolated_anon:0
[  389.308085]  active_file:479 inactive_file:508 isolated_file:0
[  389.308085]  unevictable:0 dirty:0 writeback:0 unstable:0
[  389.308085]  slab_reclaimable:9536 slab_unreclaimable:15265
[  389.308085]  mapped:629 shmem:3535 pagetables:34 bounce:0
[  389.308085]  free:356689 free_pcp:596 free_cma:0
[  389.308089] Node 0 active_anon:4584kB inactive_anon:11108kB active_file:1916kB inactive_file:2032kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:2516kB dirty:0kB writeback:0kB shmem:14140kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[  389.308090] Node 0 DMA free:15872kB min:440kB low:548kB high:656kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  389.308093] lowmem_reserve[]: 0 1561 1561 1561
[  389.308097] Node 0 DMA32 free:1410884kB min:44612kB low:55764kB high:66916kB active_anon:4584kB inactive_anon:11108kB active_file:1916kB inactive_file:2032kB unevictable:0kB writepending:0kB present:2080640kB managed:1599404kB mlocked:0kB slab_reclaimable:38144kB slab_unreclaimable:61028kB kernel_stack:3808kB pagetables:136kB bounce:0kB free_pcp:2384kB local_pcp:696kB free_cma:0kB
[  389.308099] lowmem_reserve[]: 0 0 0 0
[  389.308103] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15872kB
[  389.308121] Node 0 DMA32: 975*4kB (UME) 1199*8kB (UME) 1031*16kB (UME) 973*32kB (UME) 564*64kB (UME) 303*128kB (UME) 160*256kB (UME) 70*512kB (UME) 40*1024kB (UME) 21*2048kB (M) 272*4096kB (M) = 1410884kB
[  389.308142] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  389.308143] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  389.308143] 4522 total pagecache pages
[  389.308147] 0 pages in swap cache
[  389.308148] Swap cache stats: add 0, delete 0, find 0/0
[  389.308149] Free swap  = 0kB
[  389.308149] Total swap = 0kB
[  389.308150] 524157 pages RAM
[  389.308151] 0 pages HighMem/MovableOnly
[  389.308152] 120330 pages reserved
[  389.308153] 0 pages cma reserved
[  389.308153] 0 pages hwpoisoned
[  389.308155] Out of memory: Kill process 2649 (mmap-mem) score 783 or sacrifice child
----------------------------------------
Notice the timestamp jump between [  351.239144] and [  389.308085].
There was no such silence between the two lines. Rather, increment of
timestamp obviously started delaying after printk() flooding started
(e.g. regarding line-A and line-B where timestamp delta is only a few
microseconds, line-B is written to console after magnitudes of many
seconds of silence after line-A was written to console). And the delay
of timestamp in the log (i.e. about 39 seconds in above example) is
recovered (i.e. timestamp reflects actual uptime) only after I pressed
SysRq-i in order to allow the rest of the system to use CPU time.

I don't know why such delaying occurs (problem in printk() or VMware ?) but
what I can say is that threads spinning inside __alloc_pages_slowpath() is
giving too much stress to printk() and the rest of the system.

Now, similar CPU time wasting situation is reported by Cong as soft lockup
on a physical hardware.

You say "Your system is already DOSed. You should configure your system not
to allow giving such stress to MM subsystem." and I say back "printk() is
already DOSed. You should configure MM subsystem not to allow giving such
stress to printk()."

>
> Sure.  But warn_alloc() doesn't.
>

I agree with Andrew. Callers of dump_stack() must do, if the callers are
in schedulable context, a sensible flow-control so that lower layers are
not annoyed .

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-06-02 19:59           ` Andrew Morton
  2017-06-03  2:57             ` Tetsuo Handa
@ 2017-06-03  7:32             ` Michal Hocko
  2017-06-03  8:36               ` Tetsuo Handa
  2017-06-03 13:21               ` Tetsuo Handa
  1 sibling, 2 replies; 43+ messages in thread
From: Michal Hocko @ 2017-06-03  7:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tetsuo Handa, linux-mm, xiyou.wangcong, dave.hansen, hannes,
	mgorman, vbabka

On Fri 02-06-17 12:59:44, Andrew Morton wrote:
> On Fri, 2 Jun 2017 09:18:18 +0200 Michal Hocko <mhocko@suse.com> wrote:
> 
> > On Thu 01-06-17 15:10:22, Andrew Morton wrote:
> > > On Thu, 1 Jun 2017 15:28:08 +0200 Michal Hocko <mhocko@suse.com> wrote:
> > > 
> > > > On Thu 01-06-17 22:11:13, Tetsuo Handa wrote:
> > > > > Michal Hocko wrote:
> > > > > > On Thu 01-06-17 20:43:47, Tetsuo Handa wrote:
> > > > > > > Cong Wang has reported a lockup when running LTP memcg_stress test [1].
> > > > > >
> > > > > > This seems to be on an old and not pristine kernel. Does it happen also
> > > > > > on the vanilla up-to-date kernel?
> > > > > 
> > > > > 4.9 is not an old kernel! It might be close to the kernel version which
> > > > > enterprise distributions would choose for their next long term supported
> > > > > version.
> > > > > 
> > > > > And please stop saying "can you reproduce your problem with latest
> > > > > linux-next (or at least latest linux)?" Not everybody can use the vanilla
> > > > > up-to-date kernel!
> > > > 
> > > > The changelog mentioned that the source of stalls is not clear so this
> > > > might be out-of-tree patches doing something wrong and dump_stack
> > > > showing up just because it is called often. This wouldn't be the first
> > > > time I have seen something like that. I am not really keen on adding
> > > > heavy lifting for something that is not clearly debugged and based on
> > > > hand waving and speculations.
> > > 
> > > I'm thinking we should serialize warn_alloc anyway, to prevent the
> > > output from concurrent calls getting all jumbled together?
> > 
> > dump_stack already serializes concurrent calls.
> 
> Sure.  But warn_alloc() doesn't.

I really do not see why that would be much better, really. warn_alloc is
more or less one line + dump_stack + warn_alloc_show_mem. Single line
shouldn't be a big deal even though this is a continuation line
actually. dump_stack already contains its own synchronization and the
meminfo stuff is ratelimited to one per second. So why do we exactly
wantt to put yet another lock on top? Just to stick them together? Well
is this worth a new lock dependency between memory allocation and the
whole printk stack or dump_stack? Maybe yes but this needs a much deeper
consideration.

Tetsuo is arguing that the locking will throttle warn_alloc callers and
that can help other processes to move on. I would call it papering over
a real issue which might be somewhere else and that is why I push back so
hard. The initial report is far from complete and seeing 30+ seconds
stalls without any indication that this is just a repeating stall after
10s and 20s suggests that we got stuck somewhere in the reclaim path.

Moreover let's assume that the unfair locking in dump_stack has caused
the stall. How would an warn_alloc lock help when there are other
sources of dump_stack all over the kernel?

Seriously, this whole discussion is based on hand waving. Like for
any other patches, the real issue should be debugged, explained and
discussed based on known facts, not speculations. As things stand now,
my NACK still holds. I am not going to waste my time repeating same
points all over again.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-06-03  7:32             ` Michal Hocko
@ 2017-06-03  8:36               ` Tetsuo Handa
  2017-06-05  7:10                 ` Sergey Senozhatsky
  2017-06-03 13:21               ` Tetsuo Handa
  1 sibling, 1 reply; 43+ messages in thread
From: Tetsuo Handa @ 2017-06-03  8:36 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka,
	sergey.senozhatsky, pmladek

Michal Hocko wrote:
> On Fri 02-06-17 12:59:44, Andrew Morton wrote:
> > On Fri, 2 Jun 2017 09:18:18 +0200 Michal Hocko <mhocko@suse.com> wrote:
> >
> > > On Thu 01-06-17 15:10:22, Andrew Morton wrote:
> > > > On Thu, 1 Jun 2017 15:28:08 +0200 Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > > On Thu 01-06-17 22:11:13, Tetsuo Handa wrote:
> > > > > > Michal Hocko wrote:
> > > > > > > On Thu 01-06-17 20:43:47, Tetsuo Handa wrote:
> > > > > > > > Cong Wang has reported a lockup when running LTP memcg_stress test [1].
> > > > > > >
> > > > > > > This seems to be on an old and not pristine kernel. Does it happen also
> > > > > > > on the vanilla up-to-date kernel?
> > > > > >
> > > > > > 4.9 is not an old kernel! It might be close to the kernel version which
> > > > > > enterprise distributions would choose for their next long term supported
> > > > > > version.
> > > > > >
> > > > > > And please stop saying "can you reproduce your problem with latest
> > > > > > linux-next (or at least latest linux)?" Not everybody can use the vanilla
> > > > > > up-to-date kernel!
> > > > >
> > > > > The changelog mentioned that the source of stalls is not clear so this
> > > > > might be out-of-tree patches doing something wrong and dump_stack
> > > > > showing up just because it is called often. This wouldn't be the first
> > > > > time I have seen something like that. I am not really keen on adding
> > > > > heavy lifting for something that is not clearly debugged and based on
> > > > > hand waving and speculations.
> > > >
> > > > I'm thinking we should serialize warn_alloc anyway, to prevent the
> > > > output from concurrent calls getting all jumbled together?
> > >
> > > dump_stack already serializes concurrent calls.
> >
> > Sure.  But warn_alloc() doesn't.
>
> I really do not see why that would be much better, really. warn_alloc is
> more or less one line + dump_stack + warn_alloc_show_mem. Single line
> shouldn't be a big deal even though this is a continuation line
> actually. dump_stack already contains its own synchronization and the
> meminfo stuff is ratelimited to one per second. So why do we exactly
> wantt to put yet another lock on top? Just to stick them together? Well
> is this worth a new lock dependency between memory allocation and the
> whole printk stack or dump_stack? Maybe yes but this needs a much deeper
> consideration.

You are completely ignoring the fact that writing to consoles needs CPU time.
My proposal is intended for not only grouping relevant lines together but also
giving logbuf readers (currently a thread which is inside console_unlock(),
which might be offloaded to a dedicated kernel thread in near future) CPU time
for writing to consoles.

>
> Tetsuo is arguing that the locking will throttle warn_alloc callers and
> that can help other processes to move on. I would call it papering over
> a real issue which might be somewhere else and that is why I push back so
> hard. The initial report is far from complete and seeing 30+ seconds
> stalls without any indication that this is just a repeating stall after
> 10s and 20s suggests that we got stuck somewhere in the reclaim path.

That timestamp jump is caused by the fact that log_buf writers are consuming
more CPU times than log_buf readers can consume. If I leave that situation
more, printk() just starts printing "** %u printk messages dropped ** " line.

There is nothing more to reclaim, allocating threads are looping with
cond_resched() and schedule_timeout_uninterruptible(1) (which effectively becomes
no-op when there are many other threads doing the same thing) only, logbuf
reader cannot use enough CPU time, and the OOM killer remains oom_lock held
(notice that this timestamp jump is between "invoked oom-killer: " line and
"Out of memory: Kill process " line) which prevents reclaiming memory.

>
> Moreover let's assume that the unfair locking in dump_stack has caused
> the stall. How would an warn_alloc lock help when there are other
> sources of dump_stack all over the kernel?

__alloc_pages_slowpath() is insane as a caller of dump_stack().

Basically __alloc_pages_slowpath() allows doing

  while (1) {
    cond_resched();
    dump_stack();
  }

because all stalling treads can call warn_alloc(). Even though we ratelimit
dump_stack() at both time_after() test and __ratelimit() test like

  while (1) {
    cond_resched();
    if (time_after(jiffies, alloc_start + stall_timeout)) {
      if (!(gfp_mask & __GFP_NOWARN) && __ratelimit(&nopage_rs)) {
        dump_stack();
      }
      stall_timeout += 10 * HZ;
    }
  }

ratelimited threads are still doing

  while (1) {
    cond_resched();
  }

which still obviously remains the source of starving CPU time for
writing to consoles.

This problem won't be solved even if logbuf reader is offloaded to
a kernel thread dedicated for printk().

>
> Seriously, this whole discussion is based on hand waving. Like for
> any other patches, the real issue should be debugged, explained and
> discussed based on known facts, not speculations. As things stand now,
> my NACK still holds. I am not going to waste my time repeating same
> points all over again.

It is not a hand waving. Doing unconstrained printk() loops (with
cond_resched() only) inside kernel is seriously broken. We have to be
careful not to allow CPU time consumption by logbuf writers (e.g.
warn_alloc() from __alloc_pages_slowpath()) because logbuf reader needs
CPU time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-06-03  7:32             ` Michal Hocko
  2017-06-03  8:36               ` Tetsuo Handa
@ 2017-06-03 13:21               ` Tetsuo Handa
  1 sibling, 0 replies; 43+ messages in thread
From: Tetsuo Handa @ 2017-06-03 13:21 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka,
	sergey.senozhatsky, pmladek

Michal Hocko wrote:
> I really do not see why that would be much better, really. warn_alloc is
> more or less one line + dump_stack + warn_alloc_show_mem. Single line
> shouldn't be a big deal even though this is a continuation line
> actually. dump_stack already contains its own synchronization and the
> meminfo stuff is ratelimited to one per second. So why do we exactly
> wantt to put yet another lock on top? Just to stick them together? Well
> is this worth a new lock dependency between memory allocation and the
> whole printk stack or dump_stack? Maybe yes but this needs a much deeper
> consideration.

If you don't want to add a new lock dependency between memory allocation
and the whole printk stack or dump_stack, I'm glad to reuse existing lock
dependency between memory allocation and the whole printk stack or dump_stack
named oom_lock mutex.

Although oom_lock is used for serializing invocation of the OOM killer so that
we don't send SIGKILL more than needed, things done by the OOM killer inside
dump_header() with oom_lock held includes what warn_alloc() does (i.e. call
printk(), dump_stack() and show_mem()).

If we reuse oom_lock for serializing warn_alloc() calls, we can also solve a
problem that the OOM killer cannot send SIGKILL due to preempted by other
threads spinning inside __alloc_pages_slowpath() waiting for the OOM killer
to make progress. Although you said we should not abuse an unrelated lock at
http://lkml.kernel.org/r/20161212125535.GA3185@dhcp22.suse.cz , out_of_memory()
and warn_alloc() are closely related because they are called when the system
cannot allocate memory.

We need to allow users to reliably save printk() output, and serialization for
yielding enough CPU time (by reusing oom_lock mutex or adding warn_alloc_lock
mutex) is one of prerequisites for saving printk() output.

"The fact that it doesn't throttle properly means that we should tune its
parameters." does not work here. We need to avoid wasting CPU time via
effectively "while(1) cond_resched();" situation in __alloc_pages_slowpath().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-06-02 21:57             ` Cong Wang
@ 2017-06-04  8:58               ` Tetsuo Handa
  2017-06-04 15:05                 ` Michal Hocko
  2017-06-05 18:25                 ` Cong Wang
  0 siblings, 2 replies; 43+ messages in thread
From: Tetsuo Handa @ 2017-06-04  8:58 UTC (permalink / raw)
  To: xiyou.wangcong
  Cc: mhocko, akpm, linux-mm, dave.hansen, hannes, mgorman, vbabka

Cong Wang wrote:
> Just FYI: it is not us who picks those numbers, they are in the LTP test
> memcg_stress_test.sh.

I see. No problem as long as 0.5GB per a memcg and 150 memcg groups
parts are correct.

> > but it is also possible that
> >
> >   (b) Cong is reporting an unnoticed bug in the MM subsystem
> >
>
> I suppose so when I report the warning, unless commit
> 63f53dea0c9866e93802d50a230c460a024 is a false alarm. ;)

63f53dea0c9866e9 is really a half-baked troublesome alarm. :-(

It has failed to alarm for more than 30+ minutes (won't be fixed),
it is failing to alarm forever (not yet fixed), it has failed to
alarm __GFP_NOWARN allocations (fixed in 4.12-rc1), it is still
failing to print all possibly relevant threads (not yet fixed) etc.

I can't expect that users find more than "something went wrong" from
this alarm. You might have found a memcg related MM problem, but you
are a victim who failed to find more than "the kernel got soft lockup
due to parallel dump_stack() calls by this alarm".

>
> If I understand that commit correctly, it warns that we spend too
> much time on retrying and make no progress on the mm allocator
> slow path, which clearly indicates some problem.

Yes, but it is far from reliable, and it won't become reliable.

>
> But I thought it is obvious we should OOM instead of hanging
> somewhere in this situation? (My mm knowledge is very limited.)

Yes, I think that your system should have invoked the OOM killer,
though my MM knowledge is very limited as well.

>
>
> > as well as
> >
> >   (c) Cong is reporting a bug which does not exist in the latest
> >       linux-next kernel
> >
>
> As I already mentioned in my original report, I know there are at least
> two similar warnings reported before:
>
> https://lkml.org/lkml/2016/12/13/529
> https://bugzilla.kernel.org/show_bug.cgi?id=192981
>
> I don't see any fix, nor I see they are similar to mine.

No means for analyzing, no plan for fixing the problems.

> >>> When memory allocation request is stalling, serialization via waiting
> >>> for a lock does help.
> >>
> >> Which will mean that those unlucky ones which stall will stall even more
> >> because they will wait on a lock with potentially many others. While
> >> this certainly is a throttling mechanism it is also a big hammer.
> >
> > According to my testing, the cause of stalls with flooding of printk() from
> > warn_alloc() is exactly the lack of enough CPU time because the page
> > allocator continues busy looping when memory allocation is stalling.
> >
>
> In the retry loop, warn_alloc() is only called after stall is detected, not
> before, therefore waiting on the mutex does not contribute to at least
> the first stall.

Right, but waiting on the mutex inside warn_alloc() cannot yield enough
CPU time for allowing log_buf readers to write to consoles.

> > Andrew Morton wrote:
> >> I'm thinking we should serialize warn_alloc anyway, to prevent the
> >> output from concurrent calls getting all jumbled together?
> >
> > Yes. According to my testing, serializing warn_alloc() can not yield
> > enough CPU time because warn_alloc() is called only once per 10 seconds.
> > Serializing
> >
> > -       if (!mutex_trylock(&oom_lock)) {
> > +       if (mutex_lock_killable(&oom_lock)) {
> >
> > in __alloc_pages_may_oom() can yield enough CPU time to solve the stalls.
> >
>
> For this point, I am with you, it would be helpful to serialize them in
> case we mix different warnings in dmesg. But you probably need to adjust
> the timestamps in case waiting on the mutex contributes to the stall too?

As long as Michal refuses serialization, we won't get helpful output.
You can retry with my kmallocwd patch shown bottom. An example output is
at http://I-love.SAKURA.ne.jp/tmp/sample-serial.log .

Of course, kmallocwd can gather only basic information. You might need to
gather more information by e.g. enabling tracepoints after analyzing basic
information.

>
> [...]
>
> > This result shows that the OOM killer was not able to send SIGKILL until
> > I gave up waiting and pressed SysRq-i because __alloc_pages_slowpath() continued
> > wasting CPU time after the OOM killer tried to start printing memory information.
> > We can avoid this case if we wait for oom_lock at __alloc_pages_may_oom().
> >
>
> Note, in my case OOM killer was probably not even invoked although
> the log I captured is a complete one...

Since you said

  The log I sent is partial, but that is already all what we captured,
  I can't find more in kern.log due to log rotation.

you meant "the log I captured is an incomplete one", don't you?

Under memory pressure, we can't expect printk() output (as well as tracepoint
output) to be flushed to log files on a storage because writing to a file
involves memory allocation. Therefore, you can try serial console or netconsole
for saving printk() output more reliably than log files.

Some hardware uses horribly slow / unreliable serial device. In that case
netconsole would be better. If you use netconsole, you can use a utility
I wrote for receiving netconsole messages available at
https://osdn.net/projects/akari/scm/svn/tree/head/branches/udplogger/ .
This utility can concatenate pr_cont() output heavily used in MM subsystem
with one timestamp per a line.

Below is kmallocwd patch backpoated for 4.9.30 kernel from
http://lkml.kernel.org/r/1495331504-12480-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp .
Documentation/malloc-watchdog.txt part is stripped in order to reduce lines.



 include/linux/gfp.h   |    8 +
 include/linux/oom.h   |    4
 include/linux/sched.h |   19 ++++
 kernel/fork.c         |    4
 kernel/hung_task.c    |  216 +++++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sysctl.c       |   10 ++
 lib/Kconfig.debug     |   24 +++++
 mm/mempool.c          |    9 +-
 mm/oom_kill.c         |    3
 mm/page_alloc.c       |   82 ++++++++++++++++++
 10 files changed, 377 insertions(+), 2 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index f8041f9de..cd4253f 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -460,6 +460,14 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 	return __alloc_pages_node(nid, gfp_mask, order);
 }
 
+#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
+extern void start_memalloc_timer(const gfp_t gfp_mask, const int order);
+extern void stop_memalloc_timer(const gfp_t gfp_mask);
+#else
+#define start_memalloc_timer(gfp_mask, order) do { } while (0)
+#define stop_memalloc_timer(gfp_mask) do { } while (0)
+#endif
+
 #ifdef CONFIG_NUMA
 extern struct page *alloc_pages_current(gfp_t gfp_mask, unsigned order);
 
diff --git a/include/linux/oom.h b/include/linux/oom.h
index b4e36e9..8487d1b 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -79,8 +79,12 @@ extern unsigned long oom_badness(struct task_struct *p,
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
+extern unsigned int out_of_memory_count;
+extern bool memalloc_maybe_stalling(void);
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
 extern int sysctl_panic_on_oom;
+extern unsigned long sysctl_memalloc_task_warning_secs;
 #endif /* _INCLUDE_LINUX_OOM_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f425eb3..5d48ecb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1472,6 +1472,22 @@ struct tlbflush_unmap_batch {
 	bool writable;
 };
 
+struct memalloc_info {
+	/* Is current thread doing (nested) memory allocation? */
+	u8 in_flight;
+	/* Watchdog kernel thread is about to report this task? */
+	bool report;
+	/* Index used for memalloc_in_flight[] counter. */
+	u8 idx;
+	/* For progress monitoring. */
+	unsigned int sequence;
+	/* Started time in jiffies as of in_flight == 1. */
+	unsigned long start;
+	/* Requested order and gfp flags as of in_flight == 1. */
+	unsigned int order;
+	gfp_t gfp;
+};
+
 struct task_struct {
 #ifdef CONFIG_THREAD_INFO_IN_TASK
 	/*
@@ -1960,6 +1976,9 @@ struct task_struct {
 	/* A live task holds one reference. */
 	atomic_t stack_refcount;
 #endif
+#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
+	struct memalloc_info memalloc;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index 59faac4..8c2aef2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1624,6 +1624,10 @@ static __latent_entropy struct task_struct *copy_process(
 	p->sequential_io_avg	= 0;
 #endif
 
+#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
+	p->memalloc.sequence = 0;
+#endif
+
 	/* Perform scheduler related setup. Assign this task to a CPU. */
 	retval = sched_fork(clone_flags, p);
 	if (retval)
diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index 2b59c82..b6ce9a3 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -16,6 +16,8 @@
 #include <linux/export.h>
 #include <linux/sysctl.h>
 #include <linux/utsname.h>
+#include <linux/oom.h>
+#include <linux/console.h>
 #include <trace/events/sched.h>
 
 /*
@@ -141,6 +143,8 @@ static bool rcu_lock_break(struct task_struct *g, struct task_struct *t)
 	get_task_struct(g);
 	get_task_struct(t);
 	rcu_read_unlock();
+	if (console_trylock())
+		console_unlock();
 	cond_resched();
 	rcu_read_lock();
 	can_cont = pid_alive(g) && pid_alive(t);
@@ -193,6 +197,200 @@ static long hung_timeout_jiffies(unsigned long last_checked,
 		MAX_SCHEDULE_TIMEOUT;
 }
 
+#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
+/*
+ * Zero means infinite timeout - no checking done:
+ */
+unsigned long __read_mostly sysctl_memalloc_task_warning_secs =
+	CONFIG_DEFAULT_MEMALLOC_TASK_TIMEOUT;
+
+/* Filled by is_stalling_task(), used by only khungtaskd kernel thread. */
+static struct memalloc_info memalloc;
+
+/**
+ * is_stalling_task - Check and copy a task's memalloc variable.
+ *
+ * @task:   A task to check.
+ * @expire: Timeout in jiffies.
+ *
+ * Returns true if a task is stalling, false otherwise.
+ */
+static bool is_stalling_task(const struct task_struct *task,
+			     const unsigned long expire)
+{
+	const struct memalloc_info *m = &task->memalloc;
+
+	if (likely(!m->in_flight || !time_after_eq(expire, m->start)))
+		return false;
+	/*
+	 * start_memalloc_timer() guarantees that ->in_flight is updated after
+	 * ->start is stored.
+	 */
+	smp_rmb();
+	memalloc.sequence = m->sequence;
+	memalloc.start = m->start;
+	memalloc.order = m->order;
+	memalloc.gfp = m->gfp;
+	return time_after_eq(expire, memalloc.start);
+}
+
+/*
+ * check_memalloc_stalling_tasks - Check for memory allocation stalls.
+ *
+ * @timeout: Timeout in jiffies.
+ *
+ * Returns number of stalling tasks.
+ *
+ * This function is marked as "noinline" in order to allow inserting dynamic
+ * probes (e.g. printing more information as needed using SystemTap, calling
+ * panic() if this function returned non 0 value).
+ */
+static noinline int check_memalloc_stalling_tasks(unsigned long timeout)
+{
+	enum {
+		MEMALLOC_TYPE_STALLING,       /* Report as stalling task. */
+		MEMALLOC_TYPE_DYING,          /* Report as dying task. */
+		MEMALLOC_TYPE_EXITING,        /* Report as exiting task.*/
+		MEMALLOC_TYPE_OOM_VICTIM,     /* Report as OOM victim. */
+		MEMALLOC_TYPE_UNCONDITIONAL,  /* Report unconditionally. */
+	};
+	char buf[256];
+	struct task_struct *g, *p;
+	unsigned long now;
+	unsigned long expire;
+	unsigned int sigkill_pending = 0;
+	unsigned int exiting_tasks = 0;
+	unsigned int memdie_pending = 0;
+	unsigned int stalling_tasks = 0;
+
+	cond_resched();
+	now = jiffies;
+	/*
+	 * Report tasks that stalled for more than half of timeout duration
+	 * because such tasks might be correlated with tasks that already
+	 * stalled for full timeout duration.
+	 */
+	expire = now - timeout * (HZ / 2);
+	/* Count stalling tasks, dying and victim tasks. */
+	rcu_read_lock();
+	for_each_process_thread(g, p) {
+		bool report = false;
+
+		if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
+			report = true;
+			memdie_pending++;
+		}
+		if (fatal_signal_pending(p)) {
+			report = true;
+			sigkill_pending++;
+		}
+		if ((p->flags & PF_EXITING) && p->state != TASK_DEAD) {
+			report = true;
+			exiting_tasks++;
+		}
+		if (is_stalling_task(p, expire)) {
+			report = true;
+			stalling_tasks++;
+		}
+		if (p->flags & PF_KSWAPD)
+			report = true;
+		p->memalloc.report = report;
+	}
+	rcu_read_unlock();
+	if (!stalling_tasks)
+		return 0;
+	cond_resched();
+	/* Report stalling tasks, dying and victim tasks. */
+	pr_warn("MemAlloc-Info: stalling=%u dying=%u exiting=%u victim=%u oom_count=%u\n",
+		stalling_tasks, sigkill_pending, exiting_tasks, memdie_pending,
+		out_of_memory_count);
+	cond_resched();
+	sigkill_pending = 0;
+	exiting_tasks = 0;
+	memdie_pending = 0;
+	stalling_tasks = 0;
+	rcu_read_lock();
+ restart_report:
+	for_each_process_thread(g, p) {
+		u8 type;
+
+		if (likely(!p->memalloc.report))
+			continue;
+		p->memalloc.report = false;
+		/* Recheck in case state changed meanwhile. */
+		type = 0;
+		if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
+			type |= (1 << MEMALLOC_TYPE_OOM_VICTIM);
+			memdie_pending++;
+		}
+		if (fatal_signal_pending(p)) {
+			type |= (1 << MEMALLOC_TYPE_DYING);
+			sigkill_pending++;
+		}
+		if ((p->flags & PF_EXITING) && p->state != TASK_DEAD) {
+			type |= (1 << MEMALLOC_TYPE_EXITING);
+			exiting_tasks++;
+		}
+		if (is_stalling_task(p, expire)) {
+			type |= (1 << MEMALLOC_TYPE_STALLING);
+			stalling_tasks++;
+			snprintf(buf, sizeof(buf),
+				 " seq=%u gfp=0x%x(%pGg) order=%u delay=%lu",
+				 memalloc.sequence, memalloc.gfp,
+				 &memalloc.gfp,
+				 memalloc.order, now - memalloc.start);
+		} else {
+			buf[0] = '\0';
+		}
+		if (p->flags & PF_KSWAPD)
+			type |= (1 << MEMALLOC_TYPE_UNCONDITIONAL);
+		if (unlikely(!type))
+			continue;
+		/*
+		 * Victim tasks get pending SIGKILL removed before arriving at
+		 * do_exit(). Therefore, print " exiting" instead for " dying".
+		 */
+		pr_warn("MemAlloc: %s(%u) flags=0x%x switches=%lu%s%s%s%s%s\n",
+			p->comm, p->pid, p->flags, p->nvcsw + p->nivcsw, buf,
+			(p->state & TASK_UNINTERRUPTIBLE) ?
+			" uninterruptible" : "",
+			(type & (1 << MEMALLOC_TYPE_EXITING)) ?
+			" exiting" : "",
+			(type & (1 << MEMALLOC_TYPE_DYING)) ? " dying" : "",
+			(type & (1 << MEMALLOC_TYPE_OOM_VICTIM)) ?
+			" victim" : "");
+		sched_show_task(p);
+		/*
+		 * Since there could be thousands of tasks to report, we always
+		 * call cond_resched() after each report, in order to avoid RCU
+		 * stalls.
+		 *
+		 * Since not yet reported tasks are marked as
+		 * p->memalloc.report == T, this loop can restart even if
+		 * "g" or "p" went away.
+		 *
+		 * TODO: Try to wait for a while (e.g. sleep until usage of
+		 * printk() buffer becomes less than 75%) in order to avoid
+		 * dropping messages.
+		 */
+		if (!rcu_lock_break(g, p))
+			goto restart_report;
+	}
+	rcu_read_unlock();
+	cond_resched();
+	/* Show memory information. (SysRq-m) */
+	show_mem(0);
+	/* Show workqueue state. */
+	show_workqueue_state();
+	/* Show lock information. (SysRq-d) */
+	debug_show_all_locks();
+	pr_warn("MemAlloc-Info: stalling=%u dying=%u exiting=%u victim=%u oom_count=%u\n",
+		stalling_tasks, sigkill_pending, exiting_tasks, memdie_pending,
+		out_of_memory_count);
+	return stalling_tasks;
+}
+#endif /* CONFIG_DETECT_MEMALLOC_STALL_TASK */
+
 /*
  * Process updating of timeout sysctl
  */
@@ -227,12 +425,28 @@ void reset_hung_task_detector(void)
 static int watchdog(void *dummy)
 {
 	unsigned long hung_last_checked = jiffies;
+#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
+	unsigned long stall_last_checked = hung_last_checked;
+#endif
 
 	set_user_nice(current, 0);
 
 	for ( ; ; ) {
 		unsigned long timeout = sysctl_hung_task_timeout_secs;
 		long t = hung_timeout_jiffies(hung_last_checked, timeout);
+#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
+		unsigned long timeout2 = sysctl_memalloc_task_warning_secs;
+		long t2 = hung_timeout_jiffies(stall_last_checked, timeout2);
+
+		if (t2 <= 0) {
+			if (memalloc_maybe_stalling())
+				check_memalloc_stalling_tasks(timeout2);
+			stall_last_checked = jiffies;
+			continue;
+		}
+#else
+		long t2 = t;
+#endif
 
 		if (t <= 0) {
 			if (!atomic_xchg(&reset_hung_task, 0))
@@ -240,7 +454,7 @@ static int watchdog(void *dummy)
 			hung_last_checked = jiffies;
 			continue;
 		}
-		schedule_timeout_interruptible(t);
+		schedule_timeout_interruptible(min(t, t2));
 	}
 
 	return 0;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c1095cd..d8ee12a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1083,6 +1083,16 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= &neg_one,
 	},
+#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
+	{
+		.procname	= "memalloc_task_warning_secs",
+		.data		= &sysctl_memalloc_task_warning_secs,
+		.maxlen		= sizeof(unsigned long),
+		.mode		= 0644,
+		.proc_handler	= proc_dohung_task_timeout_secs,
+		.extra2		= &hung_task_timeout_max,
+	},
+#endif
 #endif
 #ifdef CONFIG_RT_MUTEXES
 	{
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index a6c8db1..54e8da0 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -880,6 +880,30 @@ config WQ_WATCHDOG
 	  state.  This can be configured through kernel parameter
 	  "workqueue.watchdog_thresh" and its sysfs counterpart.
 
+config DETECT_MEMALLOC_STALL_TASK
+	bool "Detect tasks stalling inside memory allocator"
+	default n
+	depends on DETECT_HUNG_TASK
+	help
+	  This option emits warning messages and traces when memory
+	  allocation requests are stalling, in order to catch unexplained
+	  hangups/reboots caused by memory allocation stalls.
+
+config DEFAULT_MEMALLOC_TASK_TIMEOUT
+	int "Default timeout for stalling task detection (in seconds)"
+	depends on DETECT_MEMALLOC_STALL_TASK
+	default 60
+	help
+	  This option controls the default timeout (in seconds) used
+	  to determine when a task has become non-responsive and should
+	  be considered stalling inside memory allocator.
+
+	  It can be adjusted at runtime via the kernel.memalloc_task_warning_secs
+	  sysctl or by writing a value to
+	  /proc/sys/kernel/memalloc_task_warning_secs.
+
+	  A timeout of 0 disables the check. The default is 60 seconds.
+
 endmenu # "Debug lockups and hangs"
 
 config PANIC_ON_OOPS
diff --git a/mm/mempool.c b/mm/mempool.c
index 47a659d..8b449af 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -324,11 +324,14 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 
 	gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
 
+	start_memalloc_timer(gfp_temp, -1);
 repeat_alloc:
 
 	element = pool->alloc(gfp_temp, pool->pool_data);
-	if (likely(element != NULL))
+	if (likely(element != NULL)) {
+		stop_memalloc_timer(gfp_temp);
 		return element;
+	}
 
 	spin_lock_irqsave(&pool->lock, flags);
 	if (likely(pool->curr_nr)) {
@@ -341,6 +344,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 		 * for debugging.
 		 */
 		kmemleak_update_trace(element);
+		stop_memalloc_timer(gfp_temp);
 		return element;
 	}
 
@@ -350,13 +354,16 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 	 */
 	if (gfp_temp != gfp_mask) {
 		spin_unlock_irqrestore(&pool->lock, flags);
+		stop_memalloc_timer(gfp_temp);
 		gfp_temp = gfp_mask;
+		start_memalloc_timer(gfp_temp, -1);
 		goto repeat_alloc;
 	}
 
 	/* We must not sleep if !__GFP_DIRECT_RECLAIM */
 	if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) {
 		spin_unlock_irqrestore(&pool->lock, flags);
+		stop_memalloc_timer(gfp_temp);
 		return NULL;
 	}
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ec9f11d..ab46d06 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -44,6 +44,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/oom.h>
 
+unsigned int out_of_memory_count;
+
 int sysctl_panic_on_oom;
 int sysctl_oom_kill_allocating_task;
 int sysctl_oom_dump_tasks = 1;
@@ -986,6 +988,7 @@ bool out_of_memory(struct oom_control *oc)
 	unsigned long freed = 0;
 	enum oom_constraint constraint = CONSTRAINT_NONE;
 
+	out_of_memory_count++;
 	if (oom_killer_disabled)
 		return false;
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5b06fb3..de24961 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3507,8 +3507,10 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	enum compact_result compact_result;
 	int compaction_retries;
 	int no_progress_loops;
+#ifndef CONFIG_DETECT_MEMALLOC_STALL_TASK
 	unsigned long alloc_start = jiffies;
 	unsigned int stall_timeout = 10 * HZ;
+#endif
 	unsigned int cpuset_mems_cookie;
 
 	/*
@@ -3682,6 +3684,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
 		goto nopage;
 
+#ifndef CONFIG_DETECT_MEMALLOC_STALL_TASK
 	/* Make sure we know about allocations which stall for too long */
 	if (time_after(jiffies, alloc_start + stall_timeout)) {
 		warn_alloc(gfp_mask,
@@ -3689,6 +3692,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 			jiffies_to_msecs(jiffies-alloc_start), order);
 		stall_timeout += 10 * HZ;
 	}
+#endif
 
 	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
 				 did_some_progress > 0, &no_progress_loops))
@@ -3741,6 +3745,76 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	return page;
 }
 
+#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
+
+static DEFINE_PER_CPU_ALIGNED(int, memalloc_in_flight[2]);
+static u8 memalloc_active_index; /* Either 0 or 1. */
+
+/* Called periodically with sysctl_memalloc_task_warning_secs interval. */
+bool memalloc_maybe_stalling(void)
+{
+	int cpu;
+	int sum = 0;
+	const u8 idx = memalloc_active_index ^ 1;
+
+	for_each_online_cpu(cpu)
+		sum += per_cpu(memalloc_in_flight[idx], cpu);
+	if (sum)
+		return true;
+	memalloc_active_index ^= 1;
+	return false;
+}
+
+void start_memalloc_timer(const gfp_t gfp_mask, const int order)
+{
+	struct memalloc_info *m = &current->memalloc;
+
+	/* We don't check for stalls for !__GFP_DIRECT_RECLAIM allocations. */
+	if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
+		return;
+	/* Record the beginning of memory allocation request. */
+	if (!m->in_flight) {
+		m->sequence++;
+		m->start = jiffies;
+		m->order = order;
+		m->gfp = gfp_mask;
+		m->idx = memalloc_active_index;
+		/*
+		 * is_stalling_task() depends on ->in_flight being updated
+		 * after ->start is stored.
+		 */
+		smp_wmb();
+		this_cpu_inc(memalloc_in_flight[m->idx]);
+	}
+	m->in_flight++;
+}
+
+void stop_memalloc_timer(const gfp_t gfp_mask)
+{
+	struct memalloc_info *m = &current->memalloc;
+
+	if ((gfp_mask & __GFP_DIRECT_RECLAIM) && !--m->in_flight)
+		this_cpu_dec(memalloc_in_flight[m->idx]);
+}
+
+static void memalloc_counter_fold(int cpu)
+{
+	int counter;
+	u8 idx;
+
+	for (idx = 0; idx < 2; idx++) {
+		counter = per_cpu(memalloc_in_flight[idx], cpu);
+		if (!counter)
+			continue;
+		this_cpu_add(memalloc_in_flight[idx], counter);
+		per_cpu(memalloc_in_flight[idx], cpu) = 0;
+	}
+}
+
+#else
+#define memalloc_counter_fold(cpu) do { } while (0)
+#endif
+
 /*
  * This is the 'heart' of the zoned buddy allocator.
  */
@@ -3825,7 +3899,9 @@ struct page *
 	if (unlikely(ac.nodemask != nodemask))
 		ac.nodemask = nodemask;
 
+	start_memalloc_timer(alloc_mask, order);
 	page = __alloc_pages_slowpath(alloc_mask, order, &ac);
+	stop_memalloc_timer(alloc_mask);
 
 out:
 	if (memcg_kmem_enabled() && (gfp_mask & __GFP_ACCOUNT) && page &&
@@ -6551,6 +6627,12 @@ static int page_alloc_cpu_notify(struct notifier_block *self,
 		 * race with what we are doing.
 		 */
 		cpu_vm_stats_fold(cpu);
+
+		/*
+		 * Zero the in-flight counters of the dead processor so that
+		 * memalloc_maybe_stalling() needs to check only online processors.
+		 */
+		memalloc_counter_fold(cpu);
 	}
 	return NOTIFY_OK;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-06-04  8:58               ` Tetsuo Handa
@ 2017-06-04 15:05                 ` Michal Hocko
  2017-06-04 21:43                   ` Tetsuo Handa
  2017-06-05 18:25                 ` Cong Wang
  1 sibling, 1 reply; 43+ messages in thread
From: Michal Hocko @ 2017-06-04 15:05 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: xiyou.wangcong, akpm, linux-mm, dave.hansen, hannes, mgorman, vbabka

On Sun 04-06-17 17:58:49, Tetsuo Handa wrote:
[...]
> > As I already mentioned in my original report, I know there are at least
> > two similar warnings reported before:
> >
> > https://lkml.org/lkml/2016/12/13/529
> > https://bugzilla.kernel.org/show_bug.cgi?id=192981
> >
> > I don't see any fix, nor I see they are similar to mine.
> 
> No means for analyzing, no plan for fixing the problems.

Stop this bullshit Tetsuo! Seriously, you are getting over the line!
Nobody said we do not care. In order to do something about that we need
to get further and relevant information. The first and the most
important one is whether this is reproducible with the _clean_ vanilla
kernel. If yes then reproduction steps including the system dependent
ones would help us as well. If we know that we can start building
a more comprehensive picture of what is going on. Unlike you I do not
want to jump into "this must be print" conclusion.

But stop this unjustified claims.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-06-04 15:05                 ` Michal Hocko
@ 2017-06-04 21:43                   ` Tetsuo Handa
  2017-06-05  5:37                     ` Michal Hocko
  0 siblings, 1 reply; 43+ messages in thread
From: Tetsuo Handa @ 2017-06-04 21:43 UTC (permalink / raw)
  To: mhocko
  Cc: xiyou.wangcong, akpm, linux-mm, dave.hansen, hannes, mgorman, vbabka

Michal Hocko wrote:
> On Sun 04-06-17 17:58:49, Tetsuo Handa wrote:
> [...]
> > > As I already mentioned in my original report, I know there are at least
> > > two similar warnings reported before:
> > >
> > > https://lkml.org/lkml/2016/12/13/529
> > > https://bugzilla.kernel.org/show_bug.cgi?id=192981
> > >
> > > I don't see any fix, nor I see they are similar to mine.
> > 
> > No means for analyzing, no plan for fixing the problems.
> 
> Stop this bullshit Tetsuo! Seriously, you are getting over the line!
> Nobody said we do not care. In order to do something about that we need
> to get further and relevant information.

What I'm asking for is the method for getting further and relevant
information. And I get no positive feedback nor usable alternatives.

>                                          The first and the most
> important one is whether this is reproducible with the _clean_ vanilla
> kernel.

At this point, distribution kernel users won't get any help from community,
nor distribution kernel users won't be able to help community.

Even more, you are asking that whether this is reproducible with the clean
_latest_ (linux-next.git or at least linux.git) vanilla kernel. Therefore,
only quite few kernel developers can involve this problem, for not everybody
is good at establishing environments / steps for reproducing this problem.
It makes getting feedback even more difficult.

According to your LSFMM session ( https://lwn.net/Articles/718212/ ),
you are worrying about out of reviewers. But it seems to me that your
orientation keeps the gap between developers and users wider; only
experienced developers like you know almost all things, all others will
know almost nothing.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-06-04 21:43                   ` Tetsuo Handa
@ 2017-06-05  5:37                     ` Michal Hocko
  2017-06-05 18:15                       ` Cong Wang
  0 siblings, 1 reply; 43+ messages in thread
From: Michal Hocko @ 2017-06-05  5:37 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: xiyou.wangcong, akpm, linux-mm, dave.hansen, hannes, mgorman, vbabka

On Mon 05-06-17 06:43:04, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> >                                          The first and the most
> > important one is whether this is reproducible with the _clean_ vanilla
> > kernel.
> 
> At this point, distribution kernel users won't get any help from community,
> nor distribution kernel users won't be able to help community.

Running a distribution kernel is at risk that obscure bugs (like this
one) will be asked to be reproduced on the vanilla kernel. I work to
support a distribution kernel as well and I can tell you that I always
do my best reproducing or at least pinpointing the issue before
reporting it upstream. People working on the upstream kernel are quite
busy and _demanding_ a support for something that should come from their
vendor is a bit to much.

> Even more, you are asking that whether this is reproducible with the clean
> _latest_ (linux-next.git or at least linux.git) vanilla kernel. Therefore,
> only quite few kernel developers can involve this problem, for not everybody
> is good at establishing environments / steps for reproducing this problem.
> It makes getting feedback even more difficult.

Come on. This is clearly an artificial test executed pretty much
intentionally with a known setup.
 
> According to your LSFMM session ( https://lwn.net/Articles/718212/ ),
> you are worrying about out of reviewers. But it seems to me that your
> orientation keeps the gap between developers and users wider; only
> experienced developers like you know almost all things, all others will
> know almost nothing.

I am really tired of your constant accusations. I think I have a pretty
good track record in trying to help with as many bugs reported as
possible. You keep hammering one particular part of the kernel and
consider it alpha and omega of everything. I disagree because I would
rather concentrate on something that actually matters in day to day user
workloads. I have already tried to explain that to you several times but
you just seem to ignore that so I will not waste my time more...
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-06-03  8:36               ` Tetsuo Handa
@ 2017-06-05  7:10                 ` Sergey Senozhatsky
  2017-06-05  9:36                   ` Sergey Senozhatsky
  0 siblings, 1 reply; 43+ messages in thread
From: Sergey Senozhatsky @ 2017-06-05  7:10 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, akpm, linux-mm, xiyou.wangcong, dave.hansen, hannes,
	mgorman, vbabka, sergey.senozhatsky, pmladek

Hello,

On (06/03/17 17:36), Tetsuo Handa wrote:
[..]
> > Tetsuo is arguing that the locking will throttle warn_alloc callers and
> > that can help other processes to move on. I would call it papering over
> > a real issue which might be somewhere else and that is why I push back so
> > hard. The initial report is far from complete and seeing 30+ seconds
> > stalls without any indication that this is just a repeating stall after
> > 10s and 20s suggests that we got stuck somewhere in the reclaim path.
> 
> That timestamp jump is caused by the fact that log_buf writers are consuming
> more CPU times than log_buf readers can consume. If I leave that situation
> more, printk() just starts printing "** %u printk messages dropped ** " line.

hhmm... sorry, not sure I see how printk() would affect timer ticks. unless
you do printing from timer IRQs, or always in deferred printk() mode, which
runs from timer IRQ... timestamps are assigned at the moment we add a new
message to the logbuf, not when we print it. so slow serial console really
should not affect it. unless I'm missing something.

	vprintk_emit()
	{
		logbuf_lock_irqsave(flags);

		log_output(facility, level, lflags, dict, dictlen, text, text_len);
		{
			log_store()
			{
				msg->ts_nsec = local_clock();
						^^^^^^^^^^^^^
			}
		}

		logbuf_unlock_irqrestore(flags);

		if (console_trylock())
			console_unlock();
	}


I don't think vprintk_emit() was spinning on logbuf_lock_irqsave(),
you would have seen spinlock lockup reports otherwise. in console_unlock()
logbuf lock is acquired only to pick the first pending messages and,
basically, do memcpy() to a static buffer. we don't call "slow console
drivers" with the logbuf lock taken. so other CPUs are free/welcome to
append new messages to the logbuf in the meantime (and read accurate
local_clock()).

so if you see spikes in messages' timestamps it's most likely because
there was something between printk() calls that kept the CPU busy.

/* or you had a ton of printk calls from other CPUs with noisy loglevels
   that were suppressed later in console_unlock(), see later. */

... well, serial consoles can be slow, sure.


> Notice the timestamp jump between [  351.239144] and [  389.308085].

do you have a restrictive console loglevel and a ton of messages that
were simply filtered out by console loglevel check? we still store those
messages to the logbuf (for dmesg, etc.) and process them in console_unlock(),
but don't print to the serial console. so, in other words,

logbuf:

	timestamp T0	message M0		-- visible loglevel
	timestamp T1	message M1  		-- suppressed loglevel
	....
	timestamp T100	message M101		-- suppressed loglevel
	timestamp T101	message M102		-- visible loglevel

on the serial console you'll see

	T0	M0
	T101	M102

which might look like a spike in timestamps (while there weren't any).
just a thought.

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-06-05  7:10                 ` Sergey Senozhatsky
@ 2017-06-05  9:36                   ` Sergey Senozhatsky
  2017-06-05 15:02                     ` Tetsuo Handa
  0 siblings, 1 reply; 43+ messages in thread
From: Sergey Senozhatsky @ 2017-06-05  9:36 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, akpm, linux-mm, xiyou.wangcong, dave.hansen, hannes,
	mgorman, vbabka, sergey.senozhatsky, pmladek, Sergey Senozhatsky

On (06/05/17 16:10), Sergey Senozhatsky wrote:
[..]
> > Notice the timestamp jump between [  351.239144] and [  389.308085].
> 
> do you have a restrictive console loglevel and a ton of messages that
> were simply filtered out by console loglevel check? we still store those
> messages to the logbuf (for dmesg, etc.) and process them in console_unlock(),
> but don't print to the serial console. so, in other words,
> 
> logbuf:
> 
> 	timestamp T0	message M0		-- visible loglevel
> 	timestamp T1	message M1  		-- suppressed loglevel
> 	....
> 	timestamp T100	message M101		-- suppressed loglevel
> 	timestamp T101	message M102		-- visible loglevel
> 
> on the serial console you'll see
> 
> 	T0	M0
> 	T101	M102
> 
> which might look like a spike in timestamps (while there weren't any).
> just a thought.

does it make any difference if you disable preemption in console_unlock()?
something like below... just curious...

---

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index a1aecf44ab07..25fe408cb994 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -2204,6 +2204,8 @@ void console_unlock(void)
 		return;
 	}
 
+	preempt_disable();
+
 	for (;;) {
 		struct printk_log *msg;
 		size_t ext_len = 0;
@@ -2260,9 +2262,6 @@ void console_unlock(void)
 		call_console_drivers(ext_text, ext_len, text, len);
 		start_critical_timings();
 		printk_safe_exit_irqrestore(flags);
-
-		if (do_cond_resched)
-			cond_resched();
 	}
 	console_locked = 0;
 
@@ -2274,6 +2273,8 @@ void console_unlock(void)
 
 	up_console_sem();
 
+	preempt_enable();
+
 	/*
 	 * Someone could have filled up the buffer again, so re-check if there's
 	 * something to flush. In case we cannot trylock the console_sem again,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-06-05  9:36                   ` Sergey Senozhatsky
@ 2017-06-05 15:02                     ` Tetsuo Handa
  0 siblings, 0 replies; 43+ messages in thread
From: Tetsuo Handa @ 2017-06-05 15:02 UTC (permalink / raw)
  To: sergey.senozhatsky.work
  Cc: mhocko, akpm, linux-mm, xiyou.wangcong, dave.hansen, hannes,
	mgorman, vbabka, sergey.senozhatsky, pmladek

Thank you for explanation, Sergey.

Sergey Senozhatsky wrote:
> Hello,
>
> On (06/03/17 17:36), Tetsuo Handa wrote:
> [..]
> > > Tetsuo is arguing that the locking will throttle warn_alloc callers and
> > > that can help other processes to move on. I would call it papering over
> > > a real issue which might be somewhere else and that is why I push back so
> > > hard. The initial report is far from complete and seeing 30+ seconds
> > > stalls without any indication that this is just a repeating stall after
> > > 10s and 20s suggests that we got stuck somewhere in the reclaim path.
> >
> > That timestamp jump is caused by the fact that log_buf writers are consuming
> > more CPU times than log_buf readers can consume. If I leave that situation
> > more, printk() just starts printing "** %u printk messages dropped ** " line.
>
> hhmm... sorry, not sure I see how printk() would affect timer ticks. unless
> you do printing from timer IRQs, or always in deferred printk() mode, which
> runs from timer IRQ... timestamps are assigned at the moment we add a new
> message to the logbuf, not when we print it. so slow serial console really
> should not affect it. unless I'm missing something.

All printk() are from warn_alloc(). I retested using stop watch, and confirmed
that console is printing pending output at full speed during the timestamp jump.
Thus, it seems that this timestamp jump was caused by simply log_buf reader had
been busy, and the OOM killer processing resumed after all pending output was
consumed by log_buf reader.

> I don't think vprintk_emit() was spinning on logbuf_lock_irqsave(),
> you would have seen spinlock lockup reports otherwise. in console_unlock()
> logbuf lock is acquired only to pick the first pending messages and,
> basically, do memcpy() to a static buffer. we don't call "slow console
> drivers" with the logbuf lock taken. so other CPUs are free/welcome to
> append new messages to the logbuf in the meantime (and read accurate
> local_clock()).

Yes. The local_clock() value seems to be correct.

>
> so if you see spikes in messages' timestamps it's most likely because
> there was something between printk() calls that kept the CPU busy.
>

Flooding of warn_alloc() from __alloc_pages_slowpath() kept log_buf
reader busy enough to block resuming processing of the OOM killer.
If warn_alloc() refrained from flooding, log_buf reader will be able to
consume pending output more quickly, and we won't observe slowdown nor
timestamp jump.

> does it make any difference if you disable preemption in console_unlock()?
> something like below... just curious...

Yes, this change reduces stalls a lot. But I don't think changing printk()
side for this problem is correct.

>
> ---
>
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index a1aecf44ab07..25fe408cb994 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -2204,6 +2204,8 @@ void console_unlock(void)
>          return;
>      }
> 
> +    preempt_disable();
> +
>      for (;;) {
>          struct printk_log *msg;
>          size_t ext_len = 0;
> @@ -2260,9 +2262,6 @@ void console_unlock(void)
>          call_console_drivers(ext_text, ext_len, text, len);
>          start_critical_timings();
>          printk_safe_exit_irqrestore(flags);
> -
> -        if (do_cond_resched)
> -            cond_resched();
>      }
>      console_locked = 0;
> 
> @@ -2274,6 +2273,8 @@ void console_unlock(void)
> 
>      up_console_sem();
> 
> +    preempt_enable();
> +
>      /*
>       * Someone could have filled up the buffer again, so re-check if there's
>       * something to flush. In case we cannot trylock the console_sem again,

This change is a subset of enclosing whole oom_kill_process() steps
with preempt_disable()/preempt_enable(), which was already rejected.

Regarding the OOM killer preempted by console_unlock() from printk()
problem, it will be mitigated by offloading to the printk kernel thread.
But offloading solves only the OOM killer preempted by console_unlock()
case. The OOM killer can still be preempted by schedule_timeout_killable(1).

Also, Cong Wang's case was (presumably) unable to invoke the OOM killer case.
When the OOM killer can make forward progress, accelerating log_buf readers
instead of throttling log_buf writers might make sense. But when the OOM killer
cannot make forward progress, we need to make sure that log_buf writer (i.e.
warn_alloc()) is slower than log_buf reader (i.e. printk kernel thread).

So, coming back to warn_alloc(). I don't think that current ratelimit
choice is enough to give log_buf reader enough CPU time. Users won't be
happy with randomly filtered, otherwise flooded/mixed warn_alloc() output.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-06-05  5:37                     ` Michal Hocko
@ 2017-06-05 18:15                       ` Cong Wang
  2017-06-06  9:17                         ` Michal Hocko
  0 siblings, 1 reply; 43+ messages in thread
From: Cong Wang @ 2017-06-05 18:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, Andrew Morton, linux-mm, dave.hansen,
	Johannes Weiner, Mel Gorman, vbabka

On Sun, Jun 4, 2017 at 10:37 PM, Michal Hocko <mhocko@suse.com> wrote:
> Running a distribution kernel is at risk that obscure bugs (like this
> one) will be asked to be reproduced on the vanilla kernel. I work to
> support a distribution kernel as well and I can tell you that I always
> do my best reproducing or at least pinpointing the issue before
> reporting it upstream. People working on the upstream kernel are quite
> busy and _demanding_ a support for something that should come from their
> vendor is a bit to much.

I understand that. As I already explained, our kernel has _zero_ code that
is not in upstream, it is just 4.9.23 plus some non-mm backports from latest.

So my question is, is there any fix you believe that is relevant in linux-next
but not in 4.9.23? We definitely can try to backport it too. I have checked
the changelog since 4.9 and don't find anything obviously relevant.

Meanwhile, I will try to run this LTP test repeatly to see if there is any luck.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-06-04  8:58               ` Tetsuo Handa
  2017-06-04 15:05                 ` Michal Hocko
@ 2017-06-05 18:25                 ` Cong Wang
  2017-06-22 10:35                   ` Tetsuo Handa
  1 sibling, 1 reply; 43+ messages in thread
From: Cong Wang @ 2017-06-05 18:25 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Michal Hocko, Andrew Morton, linux-mm, dave.hansen,
	Johannes Weiner, Mel Gorman, vbabka

On Sun, Jun 4, 2017 at 1:58 AM, Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
> You can retry with my kmallocwd patch shown bottom. An example output is
> at http://I-love.SAKURA.ne.jp/tmp/sample-serial.log .
>
> Of course, kmallocwd can gather only basic information. You might need to
> gather more information by e.g. enabling tracepoints after analyzing basic
> information.

Sure, since it is a debugging patch we definitely can try it.


> Since you said
>
>   The log I sent is partial, but that is already all what we captured,
>   I can't find more in kern.log due to log rotation.
>
> you meant "the log I captured is an incomplete one", don't you?

Right, sorry for my stupid typo.

>
> Below is kmallocwd patch backpoated for 4.9.30 kernel from
> http://lkml.kernel.org/r/1495331504-12480-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp .
> Documentation/malloc-watchdog.txt part is stripped in order to reduce lines.

Will do. But can't guarantee that we can reproduce it. ;)

Thanks a lot!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-06-05 18:15                       ` Cong Wang
@ 2017-06-06  9:17                         ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2017-06-06  9:17 UTC (permalink / raw)
  To: Cong Wang
  Cc: Tetsuo Handa, Andrew Morton, linux-mm, dave.hansen,
	Johannes Weiner, Mel Gorman, vbabka

On Mon 05-06-17 11:15:19, Cong Wang wrote:
> On Sun, Jun 4, 2017 at 10:37 PM, Michal Hocko <mhocko@suse.com> wrote:
> > Running a distribution kernel is at risk that obscure bugs (like this
> > one) will be asked to be reproduced on the vanilla kernel. I work to
> > support a distribution kernel as well and I can tell you that I always
> > do my best reproducing or at least pinpointing the issue before
> > reporting it upstream. People working on the upstream kernel are quite
> > busy and _demanding_ a support for something that should come from their
> > vendor is a bit to much.
> 
> I understand that. As I already explained, our kernel has _zero_ code that
> is not in upstream, it is just 4.9.23 plus some non-mm backports from latest.

Yes I understand that. And as this is quite an obscure bug I think it
would be safer to either run with a clean 4.9 stable or the current
linus tree. The later would be preferable because there are some changes
in the mm proper which might or might not be related (e.g. aa187507ef8bb
which throttles a rather talkative warn_alloc_show_mem).

> So my question is, is there any fix you believe that is relevant in linux-next
> but not in 4.9.23? We definitely can try to backport it too. I have checked
> the changelog since 4.9 and don't find anything obviously relevant.
>
> Meanwhile, I will try to run this LTP test repeatly to see if there is any luck.

Thanks!

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-06-05 18:25                 ` Cong Wang
@ 2017-06-22 10:35                   ` Tetsuo Handa
  2017-06-22 22:53                     ` Cong Wang
  0 siblings, 1 reply; 43+ messages in thread
From: Tetsuo Handa @ 2017-06-22 10:35 UTC (permalink / raw)
  To: xiyou.wangcong
  Cc: mhocko, akpm, linux-mm, dave.hansen, hannes, mgorman, vbabka

Cong Wang wrote:
> On Sun, Jun 4, 2017 at 1:58 AM, Tetsuo Handa
> <penguin-kernel@i-love.sakura.ne.jp> wrote:
> > You can retry with my kmallocwd patch shown bottom. An example output is
> > at http://I-love.SAKURA.ne.jp/tmp/sample-serial.log .
> >
> > Of course, kmallocwd can gather only basic information. You might need to
> > gather more information by e.g. enabling tracepoints after analyzing basic
> > information.
> 
> Sure, since it is a debugging patch we definitely can try it.
> 
> >
> > Below is kmallocwd patch backpoated for 4.9.30 kernel from
> > http://lkml.kernel.org/r/1495331504-12480-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp .
> > Documentation/malloc-watchdog.txt part is stripped in order to reduce lines.
> 
> Will do. But can't guarantee that we can reproduce it. ;)
> 

Did you get a chance to try reproducing it?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-06-22 10:35                   ` Tetsuo Handa
@ 2017-06-22 22:53                     ` Cong Wang
  0 siblings, 0 replies; 43+ messages in thread
From: Cong Wang @ 2017-06-22 22:53 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Michal Hocko, Andrew Morton, linux-mm, dave.hansen,
	Johannes Weiner, Mel Gorman, vbabka

On Thu, Jun 22, 2017 at 3:35 AM, Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
> Cong Wang wrote:
>> On Sun, Jun 4, 2017 at 1:58 AM, Tetsuo Handa
>> <penguin-kernel@i-love.sakura.ne.jp> wrote:
>> > You can retry with my kmallocwd patch shown bottom. An example output is
>> > at http://I-love.SAKURA.ne.jp/tmp/sample-serial.log .
>> >
>> > Of course, kmallocwd can gather only basic information. You might need to
>> > gather more information by e.g. enabling tracepoints after analyzing basic
>> > information.
>>
>> Sure, since it is a debugging patch we definitely can try it.
>>
>> >
>> > Below is kmallocwd patch backpoated for 4.9.30 kernel from
>> > http://lkml.kernel.org/r/1495331504-12480-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp .
>> > Documentation/malloc-watchdog.txt part is stripped in order to reduce lines.
>>
>> Will do. But can't guarantee that we can reproduce it. ;)
>>
>
> Did you get a chance to try reproducing it?

Not yet, I plan to apply your patch to our next kernel release but it
doesn't happen yet. ;) I will let you know if I have any update.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-06-02  7:18         ` Michal Hocko
                             ` (2 preceding siblings ...)
  2017-06-02 19:59           ` Andrew Morton
@ 2017-07-08  4:59           ` Tetsuo Handa
  2017-07-10 13:21             ` Michal Hocko
  3 siblings, 1 reply; 43+ messages in thread
From: Tetsuo Handa @ 2017-07-08  4:59 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka

Michal Hocko wrote:
> On Thu 01-06-17 15:10:22, Andrew Morton wrote:
> > On Thu, 1 Jun 2017 15:28:08 +0200 Michal Hocko <mhocko@suse.com> wrote:
> > 
> > > On Thu 01-06-17 22:11:13, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > > > On Thu 01-06-17 20:43:47, Tetsuo Handa wrote:
> > > > > > Cong Wang has reported a lockup when running LTP memcg_stress test [1].
> > > > >
> > > > > This seems to be on an old and not pristine kernel. Does it happen also
> > > > > on the vanilla up-to-date kernel?
> > > > 
> > > > 4.9 is not an old kernel! It might be close to the kernel version which
> > > > enterprise distributions would choose for their next long term supported
> > > > version.
> > > > 
> > > > And please stop saying "can you reproduce your problem with latest
> > > > linux-next (or at least latest linux)?" Not everybody can use the vanilla
> > > > up-to-date kernel!
> > > 
> > > The changelog mentioned that the source of stalls is not clear so this
> > > might be out-of-tree patches doing something wrong and dump_stack
> > > showing up just because it is called often. This wouldn't be the first
> > > time I have seen something like that. I am not really keen on adding
> > > heavy lifting for something that is not clearly debugged and based on
> > > hand waving and speculations.
> > 
> > I'm thinking we should serialize warn_alloc anyway, to prevent the
> > output from concurrent calls getting all jumbled together?
> 
> dump_stack already serializes concurrent calls.
> 
> > I'm not sure I buy the "this isn't a mainline kernel" thing. 
> 
> The changelog doesn't really explain what is going on and only
> speculates that the excessive warn_alloc is the cause. The kernel is 
> 4.9.23.el7.twitter.x86_64 which I suspect contains a lot of stuff on top
> of 4.9. So I would really _like_ to see whether this is reproducible
> with the upstream kernel. Especially when this is a LTP test.
> 
> > warn_alloc() obviously isn't very robust, but we'd prefer that it be
> > robust to peculiar situations, wild-n-wacky kernel patches, etc.  It's
> > a low-level thing and it should Just Work.
> 
> Yes I would agree and if we have an evidence that warn_alloc is really
> the problem then I am all for fixing it. There is no such evidence yet.
> Note that dump_stack serialization might be unfair because there is no
> queuing. Is it possible that this is the problem? If yes we should
> rather fix that because that is arguably even more low-level routine than
> warn_alloc.
> 
> That being said. I strongly believe that this patch is not properly
> justified, issue fully understood and as such a disagree with adding a
> new lock on those grounds.
> 
> Until the above is resolved
> Nacked-by: Michal Hocko <mhocko@suse.com>

Apart from what happened to Cong Wang's case, I'm really bothered by jumbled
messages caused by concurrent warn_alloc() calls. My test case is using
linux-next without any out-of-tree patches. Thus, adding a new lock on those
grounds should be acceptable.

Quoting from http://lkml.kernel.org/r/20170705081956.GA14538@dhcp22.suse.cz :
Michal Hocko wrote:
> On Sat 01-07-17 20:43:56, Tetsuo Handa wrote:
> > You are rejecting serialization under OOM without giving a chance to test
> > side effects of serialization under OOM at linux-next.git. I call such attitude
> > "speculation" which you never accept.
> 
> No I am rejecting abusing the lock for purpose it is not aimed for.

Then, why adding a new lock (not oom_lock but warn_alloc_lock) is not acceptable?
Since warn_alloc_lock is aimed for avoiding messages by warn_alloc() getting
jumbled, there should be no reason you reject this lock.

If you don't like locks, can you instead accept below one?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 80e4adb..3ac382c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3900,9 +3900,14 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 
 	/* Make sure we know about allocations which stall for too long */
 	if (time_after(jiffies, alloc_start + stall_timeout)) {
+		static bool wait;
+
+		while (cmpxchg(&wait, false, true))
+			schedule_timeout_uninterruptible(1);
 		warn_alloc(gfp_mask & ~__GFP_NOWARN, ac->nodemask,
 			"page allocation stalls for %ums, order:%u",
 			jiffies_to_msecs(jiffies-alloc_start), order);
+		wait = false;
 		stall_timeout += 10 * HZ;
 	}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-07-08  4:59           ` Tetsuo Handa
@ 2017-07-10 13:21             ` Michal Hocko
  2017-07-10 13:54               ` Tetsuo Handa
  0 siblings, 1 reply; 43+ messages in thread
From: Michal Hocko @ 2017-07-10 13:21 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka

On Sat 08-07-17 13:59:54, Tetsuo Handa wrote:
[...]
> Quoting from http://lkml.kernel.org/r/20170705081956.GA14538@dhcp22.suse.cz :
> Michal Hocko wrote:
> > On Sat 01-07-17 20:43:56, Tetsuo Handa wrote:
> > > You are rejecting serialization under OOM without giving a chance to test
> > > side effects of serialization under OOM at linux-next.git. I call such attitude
> > > "speculation" which you never accept.
> > 
> > No I am rejecting abusing the lock for purpose it is not aimed for.
> 
> Then, why adding a new lock (not oom_lock but warn_alloc_lock) is not acceptable?
> Since warn_alloc_lock is aimed for avoiding messages by warn_alloc() getting
> jumbled, there should be no reason you reject this lock.
> 
> If you don't like locks, can you instead accept below one?

No, seriously! Just think about what you are proposing. You are stalling
and now you will stall _random_ tasks even more. Some of them for
unbound amount of time because of inherent unfairness of cmpxchg.

If there is a _real_ problem it should be debugged and fixed. If this
is a limitation of what printk can handle then we should think how to
throttle it even more (e.g. does it make much sense to dump_stack when
it hasn't changed since the last time?). If this is about dump_stack
taking too long then we should look into it but we definitely should add
a more on top.

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 80e4adb..3ac382c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3900,9 +3900,14 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  
>  	/* Make sure we know about allocations which stall for too long */
>  	if (time_after(jiffies, alloc_start + stall_timeout)) {
> +		static bool wait;
> +
> +		while (cmpxchg(&wait, false, true))
> +			schedule_timeout_uninterruptible(1);
>  		warn_alloc(gfp_mask & ~__GFP_NOWARN, ac->nodemask,
>  			"page allocation stalls for %ums, order:%u",
>  			jiffies_to_msecs(jiffies-alloc_start), order);
> +		wait = false;
>  		stall_timeout += 10 * HZ;
>  	}
>  

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-07-10 13:21             ` Michal Hocko
@ 2017-07-10 13:54               ` Tetsuo Handa
  2017-07-10 14:14                 ` Michal Hocko
  0 siblings, 1 reply; 43+ messages in thread
From: Tetsuo Handa @ 2017-07-10 13:54 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka

Michal Hocko wrote:
> On Sat 08-07-17 13:59:54, Tetsuo Handa wrote:
> [...]
> > Quoting from http://lkml.kernel.org/r/20170705081956.GA14538@dhcp22.suse.cz :
> > Michal Hocko wrote:
> > > On Sat 01-07-17 20:43:56, Tetsuo Handa wrote:
> > > > You are rejecting serialization under OOM without giving a chance to test
> > > > side effects of serialization under OOM at linux-next.git. I call such attitude
> > > > "speculation" which you never accept.
> > > 
> > > No I am rejecting abusing the lock for purpose it is not aimed for.
> > 
> > Then, why adding a new lock (not oom_lock but warn_alloc_lock) is not acceptable?
> > Since warn_alloc_lock is aimed for avoiding messages by warn_alloc() getting
> > jumbled, there should be no reason you reject this lock.
> > 
> > If you don't like locks, can you instead accept below one?
> 
> No, seriously! Just think about what you are proposing. You are stalling
> and now you will stall _random_ tasks even more. Some of them for
> unbound amount of time because of inherent unfairness of cmpxchg.

The cause of stall when oom_lock is already held is that threads which failed to
hold oom_lock continue almost busy looping; schedule_timeout_uninterruptible(1) is
not sufficient when there are multiple threads doing the same thing, for direct
reclaim/compaction consumes a lot of CPU time.

What makes this situation worse is, since warn_alloc() periodically appends to
printk() buffer, the thread inside the OOM killer with oom_lock held can stall
forever due to cond_resched() from console_unlock() from printk().

Below change significantly reduces possibility of falling into printk() v.s. oom_lock
lockup problem, for the thread inside the OOM killer with oom_lock held no longer
blocks inside printk(). Though there still remains possibility of sleeping for
unexpectedly long at schedule_timeout_killable(1) with the oom_lock held.

--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1051,8 +1051,10 @@ bool out_of_memory(struct oom_control *oc)
 		panic("Out of memory and no killable processes...\n");
 	}
 	if (oc->chosen && oc->chosen != (void *)-1UL) {
+		preempt_disable();
 		oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
 				 "Memory cgroup out of memory");
+		preempt_enable_no_resched();
 		/*
 		 * Give the killed process a good chance to exit before trying
 		 * to allocate memory again.

I wish we could agree with applying this patch until printk-kthread can
work reliably...

> 
> If there is a _real_ problem it should be debugged and fixed. If this
> is a limitation of what printk can handle then we should think how to
> throttle it even more (e.g. does it make much sense to dump_stack when
> it hasn't changed since the last time?). If this is about dump_stack
> taking too long then we should look into it but we definitely should add
> a more on top.

The real problem is lack of CPU time for reclaiming memory when allocating
threads failed to hold oom_lock. And you are refusing to allow allocating
threads give CPU time to the thread holding oom_lock.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-07-10 13:54               ` Tetsuo Handa
@ 2017-07-10 14:14                 ` Michal Hocko
  2017-07-11 13:10                   ` Tetsuo Handa
  0 siblings, 1 reply; 43+ messages in thread
From: Michal Hocko @ 2017-07-10 14:14 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka

On Mon 10-07-17 22:54:37, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Sat 08-07-17 13:59:54, Tetsuo Handa wrote:
> > [...]
> > > Quoting from http://lkml.kernel.org/r/20170705081956.GA14538@dhcp22.suse.cz :
> > > Michal Hocko wrote:
> > > > On Sat 01-07-17 20:43:56, Tetsuo Handa wrote:
> > > > > You are rejecting serialization under OOM without giving a chance to test
> > > > > side effects of serialization under OOM at linux-next.git. I call such attitude
> > > > > "speculation" which you never accept.
> > > > 
> > > > No I am rejecting abusing the lock for purpose it is not aimed for.
> > > 
> > > Then, why adding a new lock (not oom_lock but warn_alloc_lock) is not acceptable?
> > > Since warn_alloc_lock is aimed for avoiding messages by warn_alloc() getting
> > > jumbled, there should be no reason you reject this lock.
> > > 
> > > If you don't like locks, can you instead accept below one?
> > 
> > No, seriously! Just think about what you are proposing. You are stalling
> > and now you will stall _random_ tasks even more. Some of them for
> > unbound amount of time because of inherent unfairness of cmpxchg.
> 
> The cause of stall when oom_lock is already held is that threads which failed to
> hold oom_lock continue almost busy looping; schedule_timeout_uninterruptible(1) is
> not sufficient when there are multiple threads doing the same thing, for direct
> reclaim/compaction consumes a lot of CPU time.
> 
> What makes this situation worse is, since warn_alloc() periodically appends to
> printk() buffer, the thread inside the OOM killer with oom_lock held can stall
> forever due to cond_resched() from console_unlock() from printk().

warn_alloc is just yet-another-user of printk. We might have many
others...
 
> Below change significantly reduces possibility of falling into printk() v.s. oom_lock
> lockup problem, for the thread inside the OOM killer with oom_lock held no longer
> blocks inside printk(). Though there still remains possibility of sleeping for
> unexpectedly long at schedule_timeout_killable(1) with the oom_lock held.

This just papers over the real problem.

> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1051,8 +1051,10 @@ bool out_of_memory(struct oom_control *oc)
>  		panic("Out of memory and no killable processes...\n");
>  	}
>  	if (oc->chosen && oc->chosen != (void *)-1UL) {
> +		preempt_disable();
>  		oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
>  				 "Memory cgroup out of memory");
> +		preempt_enable_no_resched();
>  		/*
>  		 * Give the killed process a good chance to exit before trying
>  		 * to allocate memory again.
> 
> I wish we could agree with applying this patch until printk-kthread can
> work reliably...

And now you have introduced soft lockups most probably because
oom_kill_process can take some time... Or maybe even sleeping while
atomic warnings if some code path needs to sleep for whatever reason.
The real fix is make sure that printk doesn't take an arbitrary amount of
time.

You are trying to hammer this particular path but you should realize
that as long as printk can take an unbound amount of time then there are
many other land mines which need fixing. It is simply not feasible to go
after each and ever one of them and try to tweak them around. So please
stop proposing these random hacks and rather try to work with prink guys
to find solution for this long term printk limitation. OOM killer is a
good usecase to give this a priority.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-07-10 14:14                 ` Michal Hocko
@ 2017-07-11 13:10                   ` Tetsuo Handa
  2017-07-11 13:49                     ` Michal Hocko
  0 siblings, 1 reply; 43+ messages in thread
From: Tetsuo Handa @ 2017-07-11 13:10 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman,
	vbabka, sergey.senozhatsky.work, pmladek

Michal Hocko wrote:
> On Mon 10-07-17 22:54:37, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Sat 08-07-17 13:59:54, Tetsuo Handa wrote:
> > > [...]
> > > > Quoting from http://lkml.kernel.org/r/20170705081956.GA14538@dhcp22.suse.cz :
> > > > Michal Hocko wrote:
> > > > > On Sat 01-07-17 20:43:56, Tetsuo Handa wrote:
> > > > > > You are rejecting serialization under OOM without giving a chance to test
> > > > > > side effects of serialization under OOM at linux-next.git. I call such attitude
> > > > > > "speculation" which you never accept.
> > > > > 
> > > > > No I am rejecting abusing the lock for purpose it is not aimed for.
> > > > 
> > > > Then, why adding a new lock (not oom_lock but warn_alloc_lock) is not acceptable?
> > > > Since warn_alloc_lock is aimed for avoiding messages by warn_alloc() getting
> > > > jumbled, there should be no reason you reject this lock.
> > > > 
> > > > If you don't like locks, can you instead accept below one?
> > > 
> > > No, seriously! Just think about what you are proposing. You are stalling
> > > and now you will stall _random_ tasks even more. Some of them for
> > > unbound amount of time because of inherent unfairness of cmpxchg.
> > 
> > The cause of stall when oom_lock is already held is that threads which failed to
> > hold oom_lock continue almost busy looping; schedule_timeout_uninterruptible(1) is
> > not sufficient when there are multiple threads doing the same thing, for direct
> > reclaim/compaction consumes a lot of CPU time.
> > 
> > What makes this situation worse is, since warn_alloc() periodically appends to
> > printk() buffer, the thread inside the OOM killer with oom_lock held can stall
> > forever due to cond_resched() from console_unlock() from printk().
> 
> warn_alloc is just yet-another-user of printk. We might have many
> others...

warn_alloc() is different from other users of printk() that printk() is called
as long as oom_lock is already held by somebody else processing console_unlock().

>  
> > Below change significantly reduces possibility of falling into printk() v.s. oom_lock
> > lockup problem, for the thread inside the OOM killer with oom_lock held no longer
> > blocks inside printk(). Though there still remains possibility of sleeping for
> > unexpectedly long at schedule_timeout_killable(1) with the oom_lock held.
> 
> This just papers over the real problem.
> 
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -1051,8 +1051,10 @@ bool out_of_memory(struct oom_control *oc)
> >  		panic("Out of memory and no killable processes...\n");
> >  	}
> >  	if (oc->chosen && oc->chosen != (void *)-1UL) {
> > +		preempt_disable();
> >  		oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
> >  				 "Memory cgroup out of memory");
> > +		preempt_enable_no_resched();
> >  		/*
> >  		 * Give the killed process a good chance to exit before trying
> >  		 * to allocate memory again.
> > 
> > I wish we could agree with applying this patch until printk-kthread can
> > work reliably...
> 
> And now you have introduced soft lockups most probably because
> oom_kill_process can take some time... Or maybe even sleeping while
> atomic warnings if some code path needs to sleep for whatever reason.
> The real fix is make sure that printk doesn't take an arbitrary amount of
> time.

The OOM killer is not permitted to wait for __GFP_DIRECT_RECLAIM allocations
directly/indirectly (because it will cause recursion deadlock). Thus, even if
some code path needs to sleep for some reason, that code path is not permitted to
wait for __GFP_DIRECT_RECLAIM allocations directly/indirectly. Anyway, I can
propose scattering preempt_disable()/preempt_enable_no_resched() around printk()
rather than whole oom_kill_process(). You will just reject it as you have rejected
in the past.

Using a reproducer and a patch to disable warn_alloc() at
http://lkml.kernel.org/r/201707082230.ECB51545.JtFFFVHOOSMLOQ@I-love.SAKURA.ne.jp ,
you will find that calling cond_resched() (from console_unlock() from printk())
can cause a delay of nearly one minute, and it can cause a delay of nearly 5 minutes
to complete one out_of_memory() call. Notice that the reproducer is using not so
insane number of threads (only 10 children). (It will become a DoS if 100 children
or 1024 children.)

----------------------------------------
[  589.570344] idle-priority invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),  order=0, oom_score_adj=0
[  602.792064] idle-priority cpuset=/ mems_allowed=0
[  602.794185] CPU: 0 PID: 9833 Comm: idle-priority Not tainted 4.12.0-next-20170711+ #628
[  602.796870] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  602.800102] Call Trace:
[  602.801685]  dump_stack+0x67/0x9e
[  602.803444]  dump_header+0x9d/0x3fa
[  602.805227]  ? trace_hardirqs_on+0xd/0x10
[  602.807106]  oom_kill_process+0x226/0x650
[  602.809002]  out_of_memory+0x136/0x560
[  602.810822]  ? out_of_memory+0x206/0x560
[  602.812688]  __alloc_pages_nodemask+0xcd2/0xe50
[  602.814676]  alloc_pages_vma+0x76/0x1a0
[  602.816519]  __handle_mm_fault+0xdff/0x1180
[  602.818416]  ? native_sched_clock+0x36/0xa0
[  602.820322]  handle_mm_fault+0x186/0x360
[  602.822170]  ? handle_mm_fault+0x44/0x360
[  602.824009]  __do_page_fault+0x1da/0x510
[  602.825808]  do_page_fault+0x21/0x70
[  602.827501]  page_fault+0x22/0x30
[  602.829121] RIP: 0033:0x4008b8
[  602.830657] RSP: 002b:00007ffc7f1b2070 EFLAGS: 00010206
[  602.832602] RAX: 00000000d2d29000 RBX: 0000000100000000 RCX: 00007f2a8e4debd0
[  602.834911] RDX: 0000000000000000 RSI: 0000000000400ae0 RDI: 0000000000000004
[  602.837212] RBP: 00007f288e5ea010 R08: 0000000000000000 R09: 0000000000021000
[  602.839481] R10: 00007ffc7f1b1df0 R11: 0000000000000246 R12: 0000000000000006
[  602.841712] R13: 00007f288e5ea010 R14: 0000000000000000 R15: 0000000000000000
[  602.843964] Mem-Info:
[  660.641313] active_anon:871821 inactive_anon:4422 isolated_anon:0
[  660.641313]  active_file:0 inactive_file:1 isolated_file:0
[  660.641313]  unevictable:0 dirty:0 writeback:0 unstable:0
[  660.641313]  slab_reclaimable:0 slab_unreclaimable:0
[  660.641313]  mapped:555 shmem:6257 pagetables:3184 bounce:0
[  660.641313]  free:21377 free_pcp:188 free_cma:0
[  660.735954] Node 0 active_anon:3487284kB inactive_anon:17688kB active_file:0kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:2220kB dirty:0kB writeback:0kB shmem:25028kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 3198976kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
[  681.835092] Node 0 DMA free:14780kB min:288kB low:360kB high:432kB active_anon:1092kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  702.374011] lowmem_reserve[]: 0 2688 3624 3624
[  707.583447] Node 0 DMA32 free:53556kB min:49908kB low:62384kB high:74860kB active_anon:2698264kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129216kB managed:2752884kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  729.520428] lowmem_reserve[]: 0 0 936 936
[  734.700444] Node 0 Normal free:17204kB min:17384kB low:21728kB high:26072kB active_anon:787928kB inactive_anon:17688kB active_file:0kB inactive_file:4kB unevictable:0kB writepending:0kB present:1048576kB managed:958868kB mlocked:0kB kernel_stack:3072kB pagetables:12736kB bounce:0kB free_pcp:752kB local_pcp:120kB free_cma:0kB
[  760.567618] lowmem_reserve[]: 0 0 0 0
[  765.855466] Node 0 DMA: 1*4kB (M) 1*8kB (M) 1*16kB (M) 1*32kB (M) 2*64kB (U) 2*128kB (UM) 2*256kB (UM) 1*512kB (M) 1*1024kB (U) 0*2048kB 3*4096kB (M) = 14780kB
[  780.375573] Node 0 DMA32: 7*4kB (UM) 5*8kB (UM) 7*16kB (UM) 6*32kB (U) 5*64kB (UM) 1*128kB (M) 4*256kB (M) 5*512kB (UM) 2*1024kB (M) 1*2048kB (U) 11*4096kB (UM) = 53556kB
[  795.846001] Node 0 Normal: 103*4kB (UM) 21*8kB (UM) 90*16kB (UME) 101*32kB (ME) 43*64kB (UME) 16*128kB (UE) 6*256kB (E) 1*512kB (U) 5*1024kB (UM) 0*2048kB 0*4096kB = 17220kB
[  812.472579] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  821.375279] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  830.344875] 6258 total pagecache pages
[  836.012842] 0 pages in swap cache
[  841.786993] Swap cache stats: add 0, delete 0, find 0/0
[  848.324422] Free swap  = 0kB
[  853.617092] Total swap = 0kB
[  858.757287] 1048445 pages RAM
[  864.079999] 0 pages HighMem/MovableOnly
[  869.764325] 116531 pages reserved
[  875.162601] 0 pages hwpoisoned
[  880.350179] Out of memory: Kill process 9833 (idle-priority) score 928 or sacrifice child
[  888.751868] Killed process 9835 (normal-priority) total-vm:4360kB, anon-rss:96kB, file-rss:0kB, shmem-rss:0kB
----------------------------------------

> 
> You are trying to hammer this particular path but you should realize
> that as long as printk can take an unbound amount of time then there are
> many other land mines which need fixing. It is simply not feasible to go
> after each and ever one of them and try to tweak them around. So please
> stop proposing these random hacks and rather try to work with prink guys
> to find solution for this long term printk limitation. OOM killer is a
> good usecase to give this a priority.

Whatever approach we use for printk() not to take unbound amount of time
(e.g. just enqueue to log_buf using per a thread flag), we might still take
unbound amount of time if we allow cond_sched() (or whatever sleep some
code path might need to use) with the oom_lock held. After all, the OOM killer
is ignoring scheduling priority problem regardless of printk() lockup problem.

I don't have objection about making sure that printk() doesn't take an arbitrary
amount of time. But the real fix is make sure that out_of_memory() doesn't take
an arbitrary amount of time (i.e. don't allow cond_resched() etc. at all) unless
there is cooperation from other allocating threads which failed to hold oom_lock.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-07-11 13:10                   ` Tetsuo Handa
@ 2017-07-11 13:49                     ` Michal Hocko
  2017-07-11 14:58                       ` Petr Mladek
  2017-07-11 22:06                       ` Tetsuo Handa
  0 siblings, 2 replies; 43+ messages in thread
From: Michal Hocko @ 2017-07-11 13:49 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman,
	vbabka, sergey.senozhatsky.work, pmladek

On Tue 11-07-17 22:10:36, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Mon 10-07-17 22:54:37, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Sat 08-07-17 13:59:54, Tetsuo Handa wrote:
> > > > [...]
> > > > > Quoting from http://lkml.kernel.org/r/20170705081956.GA14538@dhcp22.suse.cz :
> > > > > Michal Hocko wrote:
> > > > > > On Sat 01-07-17 20:43:56, Tetsuo Handa wrote:
> > > > > > > You are rejecting serialization under OOM without giving a chance to test
> > > > > > > side effects of serialization under OOM at linux-next.git. I call such attitude
> > > > > > > "speculation" which you never accept.
> > > > > > 
> > > > > > No I am rejecting abusing the lock for purpose it is not aimed for.
> > > > > 
> > > > > Then, why adding a new lock (not oom_lock but warn_alloc_lock) is not acceptable?
> > > > > Since warn_alloc_lock is aimed for avoiding messages by warn_alloc() getting
> > > > > jumbled, there should be no reason you reject this lock.
> > > > > 
> > > > > If you don't like locks, can you instead accept below one?
> > > > 
> > > > No, seriously! Just think about what you are proposing. You are stalling
> > > > and now you will stall _random_ tasks even more. Some of them for
> > > > unbound amount of time because of inherent unfairness of cmpxchg.
> > > 
> > > The cause of stall when oom_lock is already held is that threads which failed to
> > > hold oom_lock continue almost busy looping; schedule_timeout_uninterruptible(1) is
> > > not sufficient when there are multiple threads doing the same thing, for direct
> > > reclaim/compaction consumes a lot of CPU time.
> > > 
> > > What makes this situation worse is, since warn_alloc() periodically appends to
> > > printk() buffer, the thread inside the OOM killer with oom_lock held can stall
> > > forever due to cond_resched() from console_unlock() from printk().
> > 
> > warn_alloc is just yet-another-user of printk. We might have many
> > others...
> 
> warn_alloc() is different from other users of printk() that printk() is called
> as long as oom_lock is already held by somebody else processing console_unlock().

So what exactly prevents any other caller of printk interfering while
the oom is ongoing?

> 
> >  
> > > Below change significantly reduces possibility of falling into printk() v.s. oom_lock
> > > lockup problem, for the thread inside the OOM killer with oom_lock held no longer
> > > blocks inside printk(). Though there still remains possibility of sleeping for
> > > unexpectedly long at schedule_timeout_killable(1) with the oom_lock held.
> > 
> > This just papers over the real problem.
> > 
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -1051,8 +1051,10 @@ bool out_of_memory(struct oom_control *oc)
> > >  		panic("Out of memory and no killable processes...\n");
> > >  	}
> > >  	if (oc->chosen && oc->chosen != (void *)-1UL) {
> > > +		preempt_disable();
> > >  		oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
> > >  				 "Memory cgroup out of memory");
> > > +		preempt_enable_no_resched();
> > >  		/*
> > >  		 * Give the killed process a good chance to exit before trying
> > >  		 * to allocate memory again.
> > > 
> > > I wish we could agree with applying this patch until printk-kthread can
> > > work reliably...
> > 
> > And now you have introduced soft lockups most probably because
> > oom_kill_process can take some time... Or maybe even sleeping while
> > atomic warnings if some code path needs to sleep for whatever reason.
> > The real fix is make sure that printk doesn't take an arbitrary amount of
> > time.
> 
> The OOM killer is not permitted to wait for __GFP_DIRECT_RECLAIM allocations
> directly/indirectly (because it will cause recursion deadlock). Thus, even if
> some code path needs to sleep for some reason, that code path is not permitted to
> wait for __GFP_DIRECT_RECLAIM allocations directly/indirectly. Anyway, I can
> propose scattering preempt_disable()/preempt_enable_no_resched() around printk()
> rather than whole oom_kill_process(). You will just reject it as you have rejected
> in the past.

because you are trying to address a problem at a wrong layer. If there
is absolutely no way around it and printk is unfixable then we really
need a printk variant which will make sure that no excessive waiting
will be involved. Then we can replace all printk in the oom path with
this special printk.
 
[...]

> > You are trying to hammer this particular path but you should realize
> > that as long as printk can take an unbound amount of time then there are
> > many other land mines which need fixing. It is simply not feasible to go
> > after each and ever one of them and try to tweak them around. So please
> > stop proposing these random hacks and rather try to work with prink guys
> > to find solution for this long term printk limitation. OOM killer is a
> > good usecase to give this a priority.
> 
> Whatever approach we use for printk() not to take unbound amount of time
> (e.g. just enqueue to log_buf using per a thread flag), we might still take
> unbound amount of time if we allow cond_sched() (or whatever sleep some
> code path might need to use) with the oom_lock held. After all, the OOM killer
> is ignoring scheduling priority problem regardless of printk() lockup problem.
> 
> I don't have objection about making sure that printk() doesn't take an arbitrary
> amount of time. But the real fix is make sure that out_of_memory() doesn't take
> an arbitrary amount of time (i.e. don't allow cond_resched() etc. at all) unless
> there is cooperation from other allocating threads which failed to hold oom_lock.

As I've said out_of_memory is an expensive operation and as such it has
to be preemptible. Addressing this would require quite some work.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-07-11 13:49                     ` Michal Hocko
@ 2017-07-11 14:58                       ` Petr Mladek
  2017-07-11 22:06                       ` Tetsuo Handa
  1 sibling, 0 replies; 43+ messages in thread
From: Petr Mladek @ 2017-07-11 14:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, akpm, linux-mm, xiyou.wangcong, dave.hansen,
	hannes, mgorman, vbabka, sergey.senozhatsky.work

On Tue 2017-07-11 15:49:00, Michal Hocko wrote:
> On Tue 11-07-17 22:10:36, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Mon 10-07-17 22:54:37, Tetsuo Handa wrote:
> > > > What makes this situation worse is, since warn_alloc() periodically appends to
> > > > printk() buffer, the thread inside the OOM killer with oom_lock held can stall
> > > > forever due to cond_resched() from console_unlock() from printk().
> > > 
> > > warn_alloc is just yet-another-user of printk. We might have many
> > > others...

> because you are trying to address a problem at a wrong layer. If there
> is absolutely no way around it and printk is unfixable then we really
> need a printk variant which will make sure that no excessive waiting
> will be involved. Then we can replace all printk in the oom path with
> this special printk.

The last theory about printk offloading suggests that printk() should
always try to push some messages to the console when the console lock is
available. Otherwise, the messages might not appear at all because
the offloading is never 100% reliable, especially when the system is
in troubles.

In each case, this live-lock is another reason to risk the printk
offload at some stage.

Of course, we could make the throttling more aggressive. But it
is another complex problem. Only printk() knows how much it is
stressed and how much throttling is needed. On the other hand,
it might be hard to know what information is repeating and
who need to be throttled. It is a question if it should be
solved by providing more printk_throttle() variants or
by some magic inside normal printk().

Best Regards,
Petr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-07-11 13:49                     ` Michal Hocko
  2017-07-11 14:58                       ` Petr Mladek
@ 2017-07-11 22:06                       ` Tetsuo Handa
  2017-07-12  8:54                         ` Michal Hocko
  1 sibling, 1 reply; 43+ messages in thread
From: Tetsuo Handa @ 2017-07-11 22:06 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman,
	vbabka, sergey.senozhatsky.work, pmladek

Michal Hocko wrote:
> On Tue 11-07-17 22:10:36, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Mon 10-07-17 22:54:37, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > > > On Sat 08-07-17 13:59:54, Tetsuo Handa wrote:
> > > > > [...]
> > > > > > Quoting from http://lkml.kernel.org/r/20170705081956.GA14538@dhcp22.suse.cz :
> > > > > > Michal Hocko wrote:
> > > > > > > On Sat 01-07-17 20:43:56, Tetsuo Handa wrote:
> > > > > > > > You are rejecting serialization under OOM without giving a chance to test
> > > > > > > > side effects of serialization under OOM at linux-next.git. I call such attitude
> > > > > > > > "speculation" which you never accept.
> > > > > > > 
> > > > > > > No I am rejecting abusing the lock for purpose it is not aimed for.
> > > > > > 
> > > > > > Then, why adding a new lock (not oom_lock but warn_alloc_lock) is not acceptable?
> > > > > > Since warn_alloc_lock is aimed for avoiding messages by warn_alloc() getting
> > > > > > jumbled, there should be no reason you reject this lock.
> > > > > > 
> > > > > > If you don't like locks, can you instead accept below one?
> > > > > 
> > > > > No, seriously! Just think about what you are proposing. You are stalling
> > > > > and now you will stall _random_ tasks even more. Some of them for
> > > > > unbound amount of time because of inherent unfairness of cmpxchg.
> > > > 
> > > > The cause of stall when oom_lock is already held is that threads which failed to
> > > > hold oom_lock continue almost busy looping; schedule_timeout_uninterruptible(1) is
> > > > not sufficient when there are multiple threads doing the same thing, for direct
> > > > reclaim/compaction consumes a lot of CPU time.
> > > > 
> > > > What makes this situation worse is, since warn_alloc() periodically appends to
> > > > printk() buffer, the thread inside the OOM killer with oom_lock held can stall
> > > > forever due to cond_resched() from console_unlock() from printk().
> > > 
> > > warn_alloc is just yet-another-user of printk. We might have many
> > > others...
> > 
> > warn_alloc() is different from other users of printk() that printk() is called
> > as long as oom_lock is already held by somebody else processing console_unlock().
> 
> So what exactly prevents any other caller of printk interfering while
> the oom is ongoing?

Other callers of printk() are not doing silly things like "while(1) printk();".
They don't call printk() until something completes (e.g. some operation returned
an error code) or they do throttling. Only watchdog calls printk() without waiting
for something to complete (because watchdog is there in order to warn that something
might be wrong). But watchdog is calling printk() carefully not to cause flooding
(e.g. khungtaskd sleeps enough) and not to cause lockups (e.g. khungtaskd calls
rcu_lock_break()). As far as I can observe, only warn_alloc() for watchdog trivially
causes flooding and lockups.

> 
> > 
> > >  
> > > > Below change significantly reduces possibility of falling into printk() v.s. oom_lock
> > > > lockup problem, for the thread inside the OOM killer with oom_lock held no longer
> > > > blocks inside printk(). Though there still remains possibility of sleeping for
> > > > unexpectedly long at schedule_timeout_killable(1) with the oom_lock held.
> > > 
> > > This just papers over the real problem.
> > > 
> > > > --- a/mm/oom_kill.c
> > > > +++ b/mm/oom_kill.c
> > > > @@ -1051,8 +1051,10 @@ bool out_of_memory(struct oom_control *oc)
> > > >  		panic("Out of memory and no killable processes...\n");
> > > >  	}
> > > >  	if (oc->chosen && oc->chosen != (void *)-1UL) {
> > > > +		preempt_disable();
> > > >  		oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
> > > >  				 "Memory cgroup out of memory");
> > > > +		preempt_enable_no_resched();
> > > >  		/*
> > > >  		 * Give the killed process a good chance to exit before trying
> > > >  		 * to allocate memory again.
> > > > 
> > > > I wish we could agree with applying this patch until printk-kthread can
> > > > work reliably...
> > > 
> > > And now you have introduced soft lockups most probably because
> > > oom_kill_process can take some time... Or maybe even sleeping while
> > > atomic warnings if some code path needs to sleep for whatever reason.
> > > The real fix is make sure that printk doesn't take an arbitrary amount of
> > > time.
> > 
> > The OOM killer is not permitted to wait for __GFP_DIRECT_RECLAIM allocations
> > directly/indirectly (because it will cause recursion deadlock). Thus, even if
> > some code path needs to sleep for some reason, that code path is not permitted to
> > wait for __GFP_DIRECT_RECLAIM allocations directly/indirectly. Anyway, I can
> > propose scattering preempt_disable()/preempt_enable_no_resched() around printk()
> > rather than whole oom_kill_process(). You will just reject it as you have rejected
> > in the past.
> 
> because you are trying to address a problem at a wrong layer. If there
> is absolutely no way around it and printk is unfixable then we really
> need a printk variant which will make sure that no excessive waiting
> will be involved. Then we can replace all printk in the oom path with
> this special printk.

Writing data faster than readers can read is wrong, especially when
writers deprive readers of CPU time to read.

>  
> [...]
> 
> > > You are trying to hammer this particular path but you should realize
> > > that as long as printk can take an unbound amount of time then there are
> > > many other land mines which need fixing. It is simply not feasible to go
> > > after each and ever one of them and try to tweak them around. So please
> > > stop proposing these random hacks and rather try to work with prink guys
> > > to find solution for this long term printk limitation. OOM killer is a
> > > good usecase to give this a priority.
> > 
> > Whatever approach we use for printk() not to take unbound amount of time
> > (e.g. just enqueue to log_buf using per a thread flag), we might still take
> > unbound amount of time if we allow cond_sched() (or whatever sleep some
> > code path might need to use) with the oom_lock held. After all, the OOM killer
> > is ignoring scheduling priority problem regardless of printk() lockup problem.
> > 
> > I don't have objection about making sure that printk() doesn't take an arbitrary
> > amount of time. But the real fix is make sure that out_of_memory() doesn't take
> > an arbitrary amount of time (i.e. don't allow cond_resched() etc. at all) unless
> > there is cooperation from other allocating threads which failed to hold oom_lock.
> 
> As I've said out_of_memory is an expensive operation and as such it has
> to be preemptible. Addressing this would require quite some work.

But calling out_of_memory() with SCHED_IDLE priority makes overall allocations
far more expensive. If you want to keep out_of_memory() preemptible, you should
make sure that out_of_memory() is executed with !SCHED_IDLE priority. Offloading to
a dedicated kernel thread like oom_reaper will do it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-07-11 22:06                       ` Tetsuo Handa
@ 2017-07-12  8:54                         ` Michal Hocko
  2017-07-12 12:23                           ` Tetsuo Handa
  0 siblings, 1 reply; 43+ messages in thread
From: Michal Hocko @ 2017-07-12  8:54 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman,
	vbabka, sergey.senozhatsky.work, pmladek

On Wed 12-07-17 07:06:11, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Tue 11-07-17 22:10:36, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
[...]
> > > > warn_alloc is just yet-another-user of printk. We might have many
> > > > others...
> > > 
> > > warn_alloc() is different from other users of printk() that printk() is called
> > > as long as oom_lock is already held by somebody else processing console_unlock().
> > 
> > So what exactly prevents any other caller of printk interfering while
> > the oom is ongoing?
> 
> Other callers of printk() are not doing silly things like "while(1) printk();".

They can still print a lot. There have been reports of one printk source
pushing an unrelated context to print way too much.

> They don't call printk() until something completes (e.g. some operation returned
> an error code) or they do throttling. Only watchdog calls printk() without waiting
> for something to complete (because watchdog is there in order to warn that something
> might be wrong). But watchdog is calling printk() carefully not to cause flooding
> (e.g. khungtaskd sleeps enough) and not to cause lockups (e.g. khungtaskd calls
> rcu_lock_break()).

Look at hard/soft lockup detector and how it can cause flood of printks.

> As far as I can observe, only warn_alloc() for watchdog trivially
> causes flooding and lockups.

warn_alloc prints a single line + dump_stack for each stalling allocation and
show_mem once per second. That doesn't sound overly crazy to me.
Sure we can have many stalling tasks under certain conditions (most of
them quite unrealistic) and then we can print a lot. I do not see an
easy way out of it without losing information about stalls and I guess
we want to know about them otherwise we will have much harder time to
debug stalls.

Sure we can tune this a bit and e.g. do not dump stacks of tasks which
have already printed their backtrace as it couldn't have changed.  But
this doesn't change anything in principle.

[...]
> > > The OOM killer is not permitted to wait for __GFP_DIRECT_RECLAIM allocations
> > > directly/indirectly (because it will cause recursion deadlock). Thus, even if
> > > some code path needs to sleep for some reason, that code path is not permitted to
> > > wait for __GFP_DIRECT_RECLAIM allocations directly/indirectly. Anyway, I can
> > > propose scattering preempt_disable()/preempt_enable_no_resched() around printk()
> > > rather than whole oom_kill_process(). You will just reject it as you have rejected
> > > in the past.
> > 
> > because you are trying to address a problem at a wrong layer. If there
> > is absolutely no way around it and printk is unfixable then we really
> > need a printk variant which will make sure that no excessive waiting
> > will be involved. Then we can replace all printk in the oom path with
> > this special printk.
> 
> Writing data faster than readers can read is wrong, especially when
> writers deprive readers of CPU time to read.

Yes this is not good but only printk knows the congestion.

[...]
> > As I've said out_of_memory is an expensive operation and as such it has
> > to be preemptible. Addressing this would require quite some work.
> 
> But calling out_of_memory() with SCHED_IDLE priority makes overall allocations
> far more expensive. If you want to keep out_of_memory() preemptible, you should
> make sure that out_of_memory() is executed with !SCHED_IDLE priority. Offloading to
> a dedicated kernel thread like oom_reaper will do it.

You do realize that the whole page allocator is not priority aware and
a low priority task can starve a higher priority task already in the
reclaim path. Is this ideal? Absolutely no but let's be realistic, this
has never been a priority and it would require a lot of heavy lifting.
The OOM is the most cold path in the whole allocation stack and focusing
solely on it while claiming something take a minute or two longer is
just not going to attract a lot of attention.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-07-12  8:54                         ` Michal Hocko
@ 2017-07-12 12:23                           ` Tetsuo Handa
  2017-07-12 12:41                             ` Michal Hocko
  0 siblings, 1 reply; 43+ messages in thread
From: Tetsuo Handa @ 2017-07-12 12:23 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman,
	vbabka, sergey.senozhatsky.work, pmladek

Michal Hocko wrote:
> On Wed 12-07-17 07:06:11, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Tue 11-07-17 22:10:36, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> [...]
> > > > > warn_alloc is just yet-another-user of printk. We might have many
> > > > > others...
> > > >
> > > > warn_alloc() is different from other users of printk() that printk() is called
> > > > as long as oom_lock is already held by somebody else processing console_unlock().
> > >
> > > So what exactly prevents any other caller of printk interfering while
> > > the oom is ongoing?
> >
> > Other callers of printk() are not doing silly things like "while(1) printk();".
>
> They can still print a lot. There have been reports of one printk source
> pushing an unrelated context to print way too much.

Which source is that?

Legitimate printk() users might do

  for (i = 0; i < 1000; i++)
    printk();

but they do not do

  while (1)
    for (i = 0; i < 1000; i++)
      printk();

.

>
> > They don't call printk() until something completes (e.g. some operation returned
> > an error code) or they do throttling. Only watchdog calls printk() without waiting
> > for something to complete (because watchdog is there in order to warn that something
> > might be wrong). But watchdog is calling printk() carefully not to cause flooding
> > (e.g. khungtaskd sleeps enough) and not to cause lockups (e.g. khungtaskd calls
> > rcu_lock_break()).
>
> Look at hard/soft lockup detector and how it can cause flood of printks.

Lockup detector is legitimate because it is there to warn that somebody is
continuously consuming CPU time. Lockup detector might do

  for (i = 0; i < 1000; i++)
    printk();

but does not do

  while (1)
    for (i = 0; i < 1000; i++)
      printk();

because lockup detector waits for enough interval.

  while (1) {
    for (i = 0; i < 1000; i++)
      printk();
    schedule_timeout_killable(HZ * 60);
  }

>
> > As far as I can observe, only warn_alloc() for watchdog trivially
> > causes flooding and lockups.
>
> warn_alloc prints a single line + dump_stack for each stalling allocation and
> show_mem once per second. That doesn't sound overly crazy to me.
> Sure we can have many stalling tasks under certain conditions (most of
> them quite unrealistic) and then we can print a lot. I do not see an
> easy way out of it without losing information about stalls and I guess
> we want to know about them otherwise we will have much harder time to
> debug stalls.

Printing just one line per every second can lead to lockup, for
the condition to escape the "for (;;)" loop in console_unlock() is

                if (console_seq == log_next_seq)
                        break;

when cond_resched() in that loop slept for more than one second due to
SCHED_IDLE priority.

Currently preempt_disable()/preempt_enable_no_resched() (or equivalent)
is the only available countermeasure for minimizing interference like

    for (i = 0; i < 1000; i++)
      printk();

. If prink() allows per printk context (shown below) flag which allows printk()
users to force printk() not to try to print immediately (i.e. declare that
use deferred printing (maybe offloaded to the printk-kthread)), lockups by
cond_resched() from console_unlock() from printk() from out_of_memory() will be
avoided.

----------
static unsigned long printk_context(void)
{
    /*
     * Assume that we can use lower 2 bits for flags, as with
     * __mutex_owner() does.
     */
    unsigned long context = (unsigned long) current;

    /* Both bits set means processing NMI context. */
    if (in_nmi())
        context |= 3;
    /* Only next-LSB set means processing hard IRQ context. */
    else if (in_irq())
        context |= 2;
    /* Only LSB set means processing soft IRQ context. */
    else if (in_serving_softirq())
        context |= 1;
    /*
     * Neither bits set means processing task context,
     * though still might be non sleepable context.
     */
    return context;
}
----------

Of course given that such flag is introduced and you accept setting/clearing
such flag inside out_of_memory()...

>
> Sure we can tune this a bit and e.g. do not dump stacks of tasks which
> have already printed their backtrace as it couldn't have changed.  But
> this doesn't change anything in principle.
>
> [...]
> > > > The OOM killer is not permitted to wait for __GFP_DIRECT_RECLAIM allocations
> > > > directly/indirectly (because it will cause recursion deadlock). Thus, even if
> > > > some code path needs to sleep for some reason, that code path is not permitted to
> > > > wait for __GFP_DIRECT_RECLAIM allocations directly/indirectly. Anyway, I can
> > > > propose scattering preempt_disable()/preempt_enable_no_resched() around printk()
> > > > rather than whole oom_kill_process(). You will just reject it as you have rejected
> > > > in the past.
> > >
> > > because you are trying to address a problem at a wrong layer. If there
> > > is absolutely no way around it and printk is unfixable then we really
> > > need a printk variant which will make sure that no excessive waiting
> > > will be involved. Then we can replace all printk in the oom path with
> > > this special printk.
> >
> > Writing data faster than readers can read is wrong, especially when
> > writers deprive readers of CPU time to read.
>
> Yes this is not good but only printk knows the congestion.

warn_alloc() callers can wait for oom_lock, for somebody which is holding
oom_lock should be making progress for warn_alloc() callers.

        /*
         * Acquire the oom lock.  If that fails, somebody else is
         * making progress for us.
         */

What I'm pointing out is that this comment becomes false if there are
multiple threads doing the same thing.

----------
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 80e4adb..5cd845a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3259,11 +3259,21 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
     *did_some_progress = 0;
 
     /*
-     * Acquire the oom lock.  If that fails, somebody else is
-     * making progress for us.
+     * Acquire the oom lock. If that fails, retry after giving the owner of
+     * the oom lock a chance to make progress, by taking a short sleep.
+     * But when retrying, skip direct reclaim/compaction in order to give
+     * the owner of the oom lock as much CPU time as possible, for the
+     * owner of the oom lock might be a SCHED_IDLE priority thread.
+     * When 10+ !SCHED_IDLE priority threads do direct reclaim/compaction
+     * on a CPU which the owner of the oom lock (a SCHED_IDLE priority
+     * thread) can run, it is possible that all CPU time yielded by the
+     * short sleep is wasted for direct reclaim/compaction and the owner of
+     * the oom lock fails to make progress for 60+ seconds due to lack of
+     * CPU time (i.e. falls into an OOM livelock situation where the OOM
+     * killer cannot send SIGKILL).
      */
     if (!mutex_trylock(&oom_lock)) {
-        *did_some_progress = 1;
+        *did_some_progress = -1;
         schedule_timeout_uninterruptible(1);
         return NULL;
     }
@@ -3770,6 +3780,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
     unsigned long alloc_start = jiffies;
     unsigned int stall_timeout = 10 * HZ;
     unsigned int cpuset_mems_cookie;
+    bool skip_direct_reclaim = false;
 
     /*
      * In the slowpath, we sanity check order to avoid ever trying to
@@ -3906,6 +3917,9 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
         stall_timeout += 10 * HZ;
     }
 
+    if (skip_direct_reclaim)
+        goto wait_for_oom_killer;
+
     /* Avoid recursion of direct reclaim */
     if (current->flags & PF_MEMALLOC)
         goto nopage;
@@ -3955,6 +3969,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
         goto retry_cpuset;
 
     /* Reclaim has failed us, start killing things */
+wait_for_oom_killer:
     page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
     if (page)
         goto got_pg;
@@ -3968,6 +3983,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
     /* Retry as long as the OOM killer is making progress */
     if (did_some_progress) {
         no_progress_loops = 0;
+        skip_direct_reclaim = did_some_progress == -1;
         goto retry;
     }
 
----------

>
> [...]
> > > As I've said out_of_memory is an expensive operation and as such it has
> > > to be preemptible. Addressing this would require quite some work.
> >
> > But calling out_of_memory() with SCHED_IDLE priority makes overall allocations
> > far more expensive. If you want to keep out_of_memory() preemptible, you should
> > make sure that out_of_memory() is executed with !SCHED_IDLE priority. Offloading to
> > a dedicated kernel thread like oom_reaper will do it.
>
> You do realize that the whole page allocator is not priority aware and
> a low priority task can starve a higher priority task already in the
> reclaim path. Is this ideal? Absolutely no but let's be realistic, this
> has never been a priority and it would require a lot of heavy lifting.

Then why do you complain

  No, seriously! Just think about what you are proposing. You are stalling
  and now you will stall _random_ tasks even more.

? We cannot make the whole page allocator priority aware within a few years.
Then, avoiding possibility of lockups for now makes sense until you fix it.

> The OOM is the most cold path in the whole allocation stack and focusing
> solely on it while claiming something take a minute or two longer is
> just not going to attract a lot of attention.

Who said that "a minute or two"? We can make the stalls effectively "forever", for
delay caused by

  while (1)
    cond_resched();

interference in __alloc_pages_slowpath() (this is what I demonstrated by
disabling warn_alloc()) gets longer as number of threads doing that busy loop
increases.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-07-12 12:23                           ` Tetsuo Handa
@ 2017-07-12 12:41                             ` Michal Hocko
  2017-07-14 12:30                               ` Tetsuo Handa
  0 siblings, 1 reply; 43+ messages in thread
From: Michal Hocko @ 2017-07-12 12:41 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman,
	vbabka, sergey.senozhatsky.work, pmladek

On Wed 12-07-17 21:23:05, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 12-07-17 07:06:11, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Tue 11-07-17 22:10:36, Tetsuo Handa wrote:
> > > > > Michal Hocko wrote:
> > [...]
> > > > > > warn_alloc is just yet-another-user of printk. We might have many
> > > > > > others...
> > > > >
> > > > > warn_alloc() is different from other users of printk() that printk() is called
> > > > > as long as oom_lock is already held by somebody else processing console_unlock().
> > > >
> > > > So what exactly prevents any other caller of printk interfering while
> > > > the oom is ongoing?
> > >
> > > Other callers of printk() are not doing silly things like "while(1) printk();".
> >
> > They can still print a lot. There have been reports of one printk source
> > pushing an unrelated context to print way too much.
> 
> Which source is that?
> 
> Legitimate printk() users might do
> 
>   for (i = 0; i < 1000; i++)
>     printk();
> 
> but they do not do
> 
>   while (1)
>     for (i = 0; i < 1000; i++)
>       printk();
> 
> .
> 
> >
> > > They don't call printk() until something completes (e.g. some operation returned
> > > an error code) or they do throttling. Only watchdog calls printk() without waiting
> > > for something to complete (because watchdog is there in order to warn that something
> > > might be wrong). But watchdog is calling printk() carefully not to cause flooding
> > > (e.g. khungtaskd sleeps enough) and not to cause lockups (e.g. khungtaskd calls
> > > rcu_lock_break()).
> >
> > Look at hard/soft lockup detector and how it can cause flood of printks.
> 
> Lockup detector is legitimate because it is there to warn that somebody is
> continuously consuming CPU time. Lockup detector might do

Sigh. What I've tried to convey is that the lockup detector can print _a
lot_ (just consider a large machine with hundreds of CPUs and trying to
dump stack trace on each of them....) and that might mimic a herd of
printks from allocation stalls...
[...]
> > warn_alloc prints a single line + dump_stack for each stalling allocation and
> > show_mem once per second. That doesn't sound overly crazy to me.
> > Sure we can have many stalling tasks under certain conditions (most of
> > them quite unrealistic) and then we can print a lot. I do not see an
> > easy way out of it without losing information about stalls and I guess
> > we want to know about them otherwise we will have much harder time to
> > debug stalls.
> 
> Printing just one line per every second can lead to lockup, for
> the condition to escape the "for (;;)" loop in console_unlock() is
> 
>                 if (console_seq == log_next_seq)
>                         break;

Then something is really broken in that condition, don't you think?
Peter has already mentioned that offloading to a different context seems
like the way to go here.

> when cond_resched() in that loop slept for more than one second due to
> SCHED_IDLE priority.
> 
> Currently preempt_disable()/preempt_enable_no_resched() (or equivalent)
> is the only available countermeasure for minimizing interference like
> 
>     for (i = 0; i < 1000; i++)
>       printk();
> 
> . If prink() allows per printk context (shown below) flag which allows printk()
> users to force printk() not to try to print immediately (i.e. declare that
> use deferred printing (maybe offloaded to the printk-kthread)), lockups by
> cond_resched() from console_unlock() from printk() from out_of_memory() will be
> avoided.

As I've said earlier, if there is no other way to make printk work without all
these nasty side effected then I would be OK to add a printk context
specific calls into the oom killer.

Removing the rest because this is again getting largely tangent. The
primary problem you are seeing is that we stumble over printk here.
Unless I can see a sound argument this is not the case it doesn't make
any sense to discuss allocator changes.

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-07-12 12:41                             ` Michal Hocko
@ 2017-07-14 12:30                               ` Tetsuo Handa
  2017-07-14 12:48                                 ` Michal Hocko
  0 siblings, 1 reply; 43+ messages in thread
From: Tetsuo Handa @ 2017-07-14 12:30 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman,
	vbabka, sergey.senozhatsky.work, pmladek

Michal Hocko wrote:
> On Wed 12-07-17 21:23:05, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Wed 12-07-17 07:06:11, Tetsuo Handa wrote:
> > > > They don't call printk() until something completes (e.g. some operation returned
> > > > an error code) or they do throttling. Only watchdog calls printk() without waiting
> > > > for something to complete (because watchdog is there in order to warn that something
> > > > might be wrong). But watchdog is calling printk() carefully not to cause flooding
> > > > (e.g. khungtaskd sleeps enough) and not to cause lockups (e.g. khungtaskd calls
> > > > rcu_lock_break()).
> > >
> > > Look at hard/soft lockup detector and how it can cause flood of printks.
> > 
> > Lockup detector is legitimate because it is there to warn that somebody is
> > continuously consuming CPU time. Lockup detector might do
> 
> Sigh. What I've tried to convey is that the lockup detector can print _a
> lot_ (just consider a large machine with hundreds of CPUs and trying to
> dump stack trace on each of them....) and that might mimic a herd of
> printks from allocation stalls...

So, you are brave enough to offload writing to consoles to a schedulable context
when the system is experiencing hard/soft lockups (which means that there is no
guarantee that printk kthread can start writing to consoles from a schedulable
context). Good luck. Watchdog messages from interrupt context are urgent.

But OOM killer messages and warn_alloc() (for stall reporting) messages are
both from schedulable context and these messages can be offloaded to printk kthread.
I'm not objecting offloading OOM killer messages and warn_alloc() messages.

> [...]
> > > warn_alloc prints a single line + dump_stack for each stalling allocation and
> > > show_mem once per second. That doesn't sound overly crazy to me.
> > > Sure we can have many stalling tasks under certain conditions (most of
> > > them quite unrealistic) and then we can print a lot. I do not see an
> > > easy way out of it without losing information about stalls and I guess
> > > we want to know about them otherwise we will have much harder time to
> > > debug stalls.
> > 
> > Printing just one line per every second can lead to lockup, for
> > the condition to escape the "for (;;)" loop in console_unlock() is
> > 
> >                 if (console_seq == log_next_seq)
> >                         break;
> 
> Then something is really broken in that condition, don't you think?
> Peter has already mentioned that offloading to a different context seems
> like the way to go here.
> 
> > when cond_resched() in that loop slept for more than one second due to
> > SCHED_IDLE priority.
> > 
> > Currently preempt_disable()/preempt_enable_no_resched() (or equivalent)
> > is the only available countermeasure for minimizing interference like
> > 
> >     for (i = 0; i < 1000; i++)
> >       printk();
> > 
> > . If prink() allows per printk context (shown below) flag which allows printk()
> > users to force printk() not to try to print immediately (i.e. declare that
> > use deferred printing (maybe offloaded to the printk-kthread)), lockups by
> > cond_resched() from console_unlock() from printk() from out_of_memory() will be
> > avoided.
> 
> As I've said earlier, if there is no other way to make printk work without all
> these nasty side effected then I would be OK to add a printk context
> specific calls into the oom killer.
> 
> Removing the rest because this is again getting largely tangent. The
> primary problem you are seeing is that we stumble over printk here.
> Unless I can see a sound argument this is not the case it doesn't make
> any sense to discuss allocator changes.

You are still ignoring my point. I agree that we stumble over printk(), but
printk() is nothing but one of locations we stumble.

Look at schedule_timeout_killable(1) in out_of_memory() which is called with
oom_lock still held. I'm reporting that even printk() is offloaded to printk
kernel thread, scheduling priority can make schedule_timeout_killable(1) sleep
for more than 12 minutes (which is intended to sleep for only one millisecond).
(I gave up waiting and pressed SysRq-i. I can't imagine how long it would have
continued sleeping inside schedule_timeout_killable(1) with oom_lock held.)

Without cooperation from other allocating threads which failed to hold oom_lock,
it is dangerous to keep out_of_memory() preemptible/schedulable context.

---------- Reproducer start ----------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sched.h>
#include <signal.h>
#include <sys/prctl.h>

int main(int argc, char *argv[])
{
	struct sched_param sp = { };
	cpu_set_t cpu = { { 1 } };
	static int pipe_fd[2] = { EOF, EOF };
	unsigned long size = 65536;
	char *buf = malloc(size);
	unsigned long i = 0;
	signal(SIGCLD, SIG_IGN);
	sched_setaffinity(0, sizeof(cpu), &cpu);
	prctl(PR_SET_NAME, (unsigned long) "normal-priority", 0, 0, 0);
	pipe(pipe_fd);
	for (i = 0; i < 1024; i++)
		if (fork() == 0) {
			if (i)
				close(pipe_fd[1]);
			/* Wait until first child gets SIGKILL. */
			read(pipe_fd[0], &i, 1);
			/* Join the direct reclaim storm. */
			for (i = 0; i < size; i += 4096)
				buf[i] = 0;
			_exit(0);
		}
	for (size = 512; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size >>= 1;
			break;
		}
		buf = cp;
	}
	sched_setscheduler(0, SCHED_IDLE, &sp);
	prctl(PR_SET_NAME, (unsigned long) "idle-priority", 0, 0, 0);
	close(pipe_fd[1]);
	for (i = 0; i < size; i += 4096)
		buf[i] = 0; /* Will cause OOM due to overcommit */
	kill(-1, SIGKILL);
	return 0; /* Not reached. */
}
---------- Reproducer end ----------

---------- Simple printk() offloading patch start ----------
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8337e2d..529fc36 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -651,6 +651,7 @@ struct task_struct {
 	/* disallow userland-initiated cgroup migration */
 	unsigned			no_cgroup_migration:1;
 #endif
+	unsigned			offload_printk:1;
 
 	unsigned long			atomic_flags; /* Flags requiring atomic access. */
 
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index fc47863..06afcd7 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -48,6 +48,7 @@
 #include <linux/sched/clock.h>
 #include <linux/sched/debug.h>
 #include <linux/sched/task_stack.h>
+#include <linux/kthread.h>
 
 #include <linux/uaccess.h>
 #include <asm/sections.h>
@@ -2120,6 +2121,20 @@ static int have_callable_console(void)
 	return 0;
 }
 
+static atomic_t printk_reaper_pending = ATOMIC_INIT(0);
+static DECLARE_WAIT_QUEUE_HEAD(printk_reaper_wait);
+
+static int printk_reaper(void *unused)
+{
+	while (true) {
+		wait_event(printk_reaper_wait, atomic_read(&printk_reaper_pending));
+		console_lock();
+		console_unlock();
+		atomic_dec(&printk_reaper_pending);
+	}
+	return 0;
+}
+
 /*
  * Can we actually use the console at this time on this cpu?
  *
@@ -2129,6 +2144,11 @@ static int have_callable_console(void)
  */
 static inline int can_use_console(void)
 {
+	if (current->offload_printk && in_task()) {
+		atomic_inc(&printk_reaper_pending);
+		wake_up(&printk_reaper_wait);
+		return 0;
+	}
 	return cpu_online(raw_smp_processor_id()) || have_callable_console();
 }
 
@@ -2678,6 +2698,7 @@ static int __init printk_late_init(void)
 	ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN, "printk:online",
 					console_cpu_notify, NULL);
 	WARN_ON(ret < 0);
+	kthread_run(printk_reaper, NULL, "printk_reaper");
 	return 0;
 }
 late_initcall(printk_late_init);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 80e4adb..66356f2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3754,6 +3754,19 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	return false;
 }
 
+static void memalloc_print(unsigned long unused);
+static atomic_t memalloc_in_flight = ATOMIC_INIT(0);
+static DEFINE_TIMER(memalloc_timer, memalloc_print, 0, 0);
+static void memalloc_print(unsigned long unused)
+{
+	const int in_flight = atomic_read(&memalloc_in_flight);
+
+	if (in_flight < 10)
+		return;
+	pr_warn("MemAlloc: %d in-flight.\n", in_flight);
+	mod_timer(&memalloc_timer, jiffies + HZ);
+}
+
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 						struct alloc_context *ac)
@@ -3770,6 +3783,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	unsigned long alloc_start = jiffies;
 	unsigned int stall_timeout = 10 * HZ;
 	unsigned int cpuset_mems_cookie;
+	const bool offload_printk = current->offload_printk;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3790,6 +3804,9 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 				(__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
 		gfp_mask &= ~__GFP_ATOMIC;
 
+	if (can_direct_reclaim)
+		atomic_inc(&memalloc_in_flight);
+	current->offload_printk = 1;
 retry_cpuset:
 	compaction_retries = 0;
 	no_progress_loops = 0;
@@ -3898,6 +3915,8 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	if (!can_direct_reclaim)
 		goto nopage;
 
+	if (!timer_pending(&memalloc_timer))
+		mod_timer(&memalloc_timer, jiffies + HZ);
 	/* Make sure we know about allocations which stall for too long */
 	if (time_after(jiffies, alloc_start + stall_timeout)) {
 		warn_alloc(gfp_mask & ~__GFP_NOWARN, ac->nodemask,
@@ -4020,6 +4039,9 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	warn_alloc(gfp_mask, ac->nodemask,
 			"page allocation failure: order:%u", order);
 got_pg:
+	current->offload_printk = offload_printk;
+	if (can_direct_reclaim)
+		atomic_dec(&memalloc_in_flight);
 	return page;
 }
 
---------- Simple printk() offloading patch end ----------

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170714.txt.xz .
It is pointless to complain about 927 threads doing allocations. The primary
problem is scheduling priority rather than printk(). I would be able to reproduce
this problem with much fewer threads if I were permitted to mix realtime priority
threads; I put my reproducer under constraint not to use root privileges.
----------
[  856.066292] idle-priority invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),  order=0, oom_score_adj=0
[  856.066298] idle-priority cpuset=/ mems_allowed=0
[  856.066306] CPU: 0 PID: 6183 Comm: idle-priority Not tainted 4.12.0-next-20170714+ #632
[  856.066307] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  856.066317] Call Trace:
[  856.066325]  dump_stack+0x67/0x9e
[  856.066331]  dump_header+0x9d/0x3fa
[  856.066337]  ? trace_hardirqs_on+0xd/0x10
[  856.066343]  oom_kill_process+0x226/0x650
[  856.066350]  out_of_memory+0x136/0x560
[  856.066351]  ? out_of_memory+0x206/0x560
[  856.066357]  __alloc_pages_nodemask+0xec1/0xf50
[  856.066375]  alloc_pages_vma+0x76/0x1a0
[  856.066381]  __handle_mm_fault+0xddd/0x1160
[  856.066384]  ? native_sched_clock+0x36/0xa0
[  856.066395]  handle_mm_fault+0x186/0x360
[  856.066396]  ? handle_mm_fault+0x44/0x360
[  856.066401]  __do_page_fault+0x1da/0x510
[  856.066409]  do_page_fault+0x21/0x70
[  856.066413]  page_fault+0x22/0x30
[  856.066416] RIP: 0033:0x4008c0
[  856.066417] RSP: 002b:00007ffc83235ba0 EFLAGS: 00010206
[  856.066419] RAX: 00000000c7b4e000 RBX: 0000000100000000 RCX: 00007f1b4da57bd0
[  856.066420] RDX: 0000000000000000 RSI: 0000000000400ae0 RDI: 0000000000000004
[  856.066420] RBP: 00007f194db63010 R08: 0000000000000000 R09: 0000000000021000
[  856.066421] R10: 00007ffc83235920 R11: 0000000000000246 R12: 0000000000000006
[  856.066422] R13: 00007f194db63010 R14: 0000000000000000 R15: 0000000000000000
[  856.066433] Mem-Info:
[  856.066437] active_anon:843412 inactive_anon:3676 isolated_anon:0
[  856.066437]  active_file:0 inactive_file:209 isolated_file:0
[  856.066437]  unevictable:0 dirty:0 writeback:0 unstable:0
[  856.066437]  slab_reclaimable:0 slab_unreclaimable:0
[  856.066437]  mapped:1877 shmem:4208 pagetables:10553 bounce:0
[  856.066437]  free:21334 free_pcp:0 free_cma:0
[  856.066439] Node 0 active_anon:3373648kB inactive_anon:14704kB active_file:0kB inactive_file:836kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:7508kB dirty:0kB writeback:0kB shmem:16832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2762752kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[  856.066484] Node 0 DMA free:14776kB min:288kB low:360kB high:432kB active_anon:1092kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  856.066487] lowmem_reserve[]: 0 2688 3624 3624
[  856.066492] Node 0 DMA32 free:53464kB min:49908kB low:62384kB high:74860kB active_anon:2696692kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129216kB managed:2752816kB mlocked:0kB kernel_stack:48kB pagetables:164kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  856.066495] lowmem_reserve[]: 0 0 936 936
[  856.066499] Node 0 Normal free:17096kB min:17384kB low:21728kB high:26072kB active_anon:675816kB inactive_anon:14704kB active_file:0kB inactive_file:944kB unevictable:0kB writepending:0kB present:1048576kB managed:958868kB mlocked:0kB kernel_stack:17632kB pagetables:42044kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  856.066502] lowmem_reserve[]: 0 0 0 0
[  856.066506] Node 0 DMA: 2*4kB (UM) 2*8kB (UM) 2*16kB (UM) 2*32kB (UM) 1*64kB (U) 2*128kB (UM) 2*256kB (UM) 1*512kB (M) 1*1024kB (U) 0*2048kB 3*4096kB (ME) = 14776kB
[  856.066525] Node 0 DMA32: 22*4kB (UM) 70*8kB (UM) 73*16kB (UM) 78*32kB (UM) 62*64kB (UM) 41*128kB (UM) 38*256kB (UM) 15*512kB (U) 20*1024kB (UM) 1*2048kB (U) 0*4096kB = 53464kB
[  856.066553] Node 0 Normal: 89*4kB (UMH) 96*8kB (UH) 84*16kB (UMEH) 99*32kB (UMEH) 35*64kB (UMEH) 34*128kB (UMEH) 21*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 17604kB
[  856.066569] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  856.066570] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  856.066571] 4450 total pagecache pages
[  856.066573] 0 pages in swap cache
[  856.066574] Swap cache stats: add 0, delete 0, find 0/0
[  856.066575] Free swap  = 0kB
[  856.066576] Total swap = 0kB
[  856.066577] 1048445 pages RAM
[  856.066578] 0 pages HighMem/MovableOnly
[  856.066579] 116548 pages reserved
[  856.066580] 0 pages hwpoisoned
[  856.066581] Out of memory: Kill process 6183 (idle-priority) score 879 or sacrifice child
[  856.068245] Killed process 6301 (normal-priority) total-vm:4360kB, anon-rss:92kB, file-rss:0kB, shmem-rss:0kB
[  856.928087] MemAlloc: 916 in-flight.
[  857.952174] MemAlloc: 921 in-flight.
[  858.976265] MemAlloc: 923 in-flight.
[  859.417641] warn_alloc: 364 callbacks suppressed
(...snipped...)
[ 1155.865901] normal-priority: page allocation stalls for 300124ms, order:0, mode:0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null)
[ 1155.865906] normal-priority cpuset=/ mems_allowed=0
[ 1155.865910] CPU: 0 PID: 7193 Comm: normal-priority Not tainted 4.12.0-next-20170714+ #632
[ 1155.865911] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[ 1155.865911] Call Trace:
[ 1155.865915]  dump_stack+0x67/0x9e
[ 1155.865918]  warn_alloc+0x10f/0x1b0
[ 1155.865924]  ? wake_all_kswapds+0x56/0x96
[ 1155.865929]  __alloc_pages_nodemask+0xb2c/0xf50
[ 1155.865945]  alloc_pages_vma+0x76/0x1a0
[ 1155.865950]  __handle_mm_fault+0xddd/0x1160
[ 1155.865952]  ? native_sched_clock+0x36/0xa0
[ 1155.865962]  handle_mm_fault+0x186/0x360
[ 1155.865964]  ? handle_mm_fault+0x44/0x360
[ 1155.865968]  __do_page_fault+0x1da/0x510
[ 1155.865975]  do_page_fault+0x21/0x70
[ 1155.865979]  page_fault+0x22/0x30
(...snipped...)
[ 1175.931414] normal-priority: page allocation stalls for 320133ms, order:0, mode:0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null)
[ 1175.931420] normal-priority cpuset=/ mems_allowed=0
[ 1175.931425] CPU: 0 PID: 6697 Comm: normal-priority Not tainted 4.12.0-next-20170714+ #632
[ 1175.931426] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[ 1175.931427] Call Trace:
[ 1175.931432]  dump_stack+0x67/0x9e
[ 1175.931436]  warn_alloc+0x10f/0x1b0
[ 1175.931442]  ? wake_all_kswapds+0x56/0x96
[ 1175.931447]  __alloc_pages_nodemask+0xb2c/0xf50
[ 1175.931465]  alloc_pages_current+0x65/0xb0
[ 1175.931469]  __page_cache_alloc+0x10b/0x140
[ 1175.931474]  filemap_fault+0x3df/0x6a0
[ 1175.931475]  ? filemap_fault+0x2ab/0x6a0
[ 1175.931482]  xfs_filemap_fault+0x34/0x50
[ 1175.931485]  __do_fault+0x19/0x120
[ 1175.931489]  __handle_mm_fault+0xa5c/0x1160
[ 1175.931492]  ? native_sched_clock+0x36/0xa0
[ 1175.931502]  handle_mm_fault+0x186/0x360
[ 1175.931504]  ? handle_mm_fault+0x44/0x360
[ 1175.931508]  __do_page_fault+0x1da/0x510
[ 1175.931516]  do_page_fault+0x21/0x70
[ 1175.931520]  page_fault+0x22/0x30
(...snipped...)
[ 1494.107705] idle-priority   R  running task    13272  6183   2806 0x00000080
[ 1494.109560] Call Trace:
[ 1494.110510]  __schedule+0x256/0x8e0
[ 1494.111659]  ? _raw_spin_unlock_irqrestore+0x31/0x50
[ 1494.113078]  schedule+0x38/0x90
[ 1494.114204]  schedule_timeout+0x19a/0x330
[ 1494.115424]  ? call_timer_fn+0x120/0x120
[ 1494.116639]  ? oom_kill_process+0x50c/0x650
[ 1494.117909]  schedule_timeout_killable+0x25/0x30
[ 1494.119259]  out_of_memory+0x140/0x560
[ 1494.120485]  ? out_of_memory+0x206/0x560
[ 1494.121693]  __alloc_pages_nodemask+0xec1/0xf50
[ 1494.123058]  alloc_pages_vma+0x76/0x1a0
[ 1494.124306]  __handle_mm_fault+0xddd/0x1160
[ 1494.125572]  ? native_sched_clock+0x36/0xa0
[ 1494.126848]  handle_mm_fault+0x186/0x360
[ 1494.128103]  ? handle_mm_fault+0x44/0x360
[ 1494.129284]  __do_page_fault+0x1da/0x510
[ 1494.130575]  do_page_fault+0x21/0x70
[ 1494.131738]  page_fault+0x22/0x30
(...snipped...)
[ 1571.807967] MemAlloc: 927 in-flight.
[ 1572.832053] MemAlloc: 927 in-flight.
[ 1573.856064] MemAlloc: 927 in-flight.
[ 1574.880238] MemAlloc: 927 in-flight.
[ 1575.904575] MemAlloc: 927 in-flight.
[ 1576.060103] warn_alloc: 462 callbacks suppressed
[ 1576.060139] normal-priority: page allocation stalls for 720318ms, order:0, mode:0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null)
[ 1576.060149] normal-priority cpuset=/ mems_allowed=0
[ 1576.060158] CPU: 0 PID: 7193 Comm: normal-priority Not tainted 4.12.0-next-20170714+ #632
[ 1576.060160] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[ 1576.060162] Call Trace:
[ 1576.060170]  dump_stack+0x67/0x9e
[ 1576.060177]  warn_alloc+0x10f/0x1b0
[ 1576.060184]  ? wake_all_kswapds+0x56/0x96
[ 1576.060189]  __alloc_pages_nodemask+0xb2c/0xf50
[ 1576.060207]  alloc_pages_vma+0x76/0x1a0
[ 1576.060214]  __handle_mm_fault+0xddd/0x1160
[ 1576.060218]  ? native_sched_clock+0x36/0xa0
[ 1576.060229]  handle_mm_fault+0x186/0x360
[ 1576.060231]  ? handle_mm_fault+0x44/0x360
[ 1576.060236]  __do_page_fault+0x1da/0x510
[ 1576.060244]  do_page_fault+0x21/0x70
[ 1576.060249]  page_fault+0x22/0x30
[ 1576.060252] RIP: 0033:0x400923
[ 1576.060254] RSP: 002b:00007ffc83235ba0 EFLAGS: 00010287
[ 1576.060257] RAX: 0000000000003000 RBX: 0000000000000000 RCX: 00007f1b4da57c30
[ 1576.060259] RDX: 0000000000000001 RSI: 00007ffc83235bb0 RDI: 0000000000000003
[ 1576.060260] RBP: 0000000000000000 R08: 00007f1b4df40740 R09: 0000000000000000
[ 1576.060262] R10: 00007ffc83235920 R11: 0000000000000246 R12: 0000000000400954
[ 1576.060263] R13: 000000000212d010 R14: 0000000000000000 R15: 0000000000000000
[ 1576.060275] warn_alloc_show_mem: 9 callbacks suppressed
[ 1576.060277] Mem-Info:
[ 1576.060281] active_anon:843659 inactive_anon:3676 isolated_anon:0
[ 1576.060281]  active_file:0 inactive_file:0 isolated_file:0
[ 1576.060281]  unevictable:0 dirty:0 writeback:0 unstable:0
[ 1576.060281]  slab_reclaimable:0 slab_unreclaimable:0
[ 1576.060281]  mapped:1853 shmem:4208 pagetables:10537 bounce:0
[ 1576.060281]  free:21425 free_pcp:60 free_cma:0
[ 1576.060284] Node 0 active_anon:3374636kB inactive_anon:14704kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:7412kB dirty:0kB writeback:0kB shmem:16832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2762752kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
[ 1576.060286] Node 0 DMA free:14776kB min:288kB low:360kB high:432kB active_anon:1092kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 1576.060290] lowmem_reserve[]: 0 2688 3624 3624
[ 1576.060299] Node 0 DMA32 free:53544kB min:49908kB low:62384kB high:74860kB active_anon:2696692kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129216kB managed:2752816kB mlocked:0kB kernel_stack:32kB pagetables:164kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 1576.060303] lowmem_reserve[]: 0 0 936 936
[ 1576.060311] Node 0 Normal free:17380kB min:17384kB low:21728kB high:26072kB active_anon:676852kB inactive_anon:14704kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:1048576kB managed:958868kB mlocked:0kB kernel_stack:17584kB pagetables:41980kB bounce:0kB free_pcp:240kB local_pcp:0kB free_cma:0kB
[ 1576.060315] lowmem_reserve[]: 0 0 0 0
[ 1576.060323] Node 0 DMA: 2*4kB (UM) 2*8kB (UM) 2*16kB (UM) 2*32kB (UM) 1*64kB (U) 2*128kB (UM) 2*256kB (UM) 1*512kB (M) 1*1024kB (U) 0*2048kB 3*4096kB (ME) = 14776kB
[ 1576.060357] Node 0 DMA32: 22*4kB (UM) 70*8kB (UM) 72*16kB (UM) 77*32kB (UM) 62*64kB (UM) 40*128kB (UM) 39*256kB (UM) 15*512kB (U) 20*1024kB (UM) 1*2048kB (U) 0*4096kB = 53544kB
[ 1576.060391] Node 0 Normal: 85*4kB (UMH) 128*8kB (UMH) 97*16kB (UMEH) 92*32kB (UMEH) 32*64kB (UEH) 32*128kB (UMEH) 21*256kB (UM) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 17380kB
[ 1576.060421] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[ 1576.060423] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 1576.060425] 4210 total pagecache pages
[ 1576.060428] 0 pages in swap cache
[ 1576.060430] Swap cache stats: add 0, delete 0, find 0/0
[ 1576.060432] Free swap  = 0kB
[ 1576.060433] Total swap = 0kB
[ 1576.060435] 1048445 pages RAM
[ 1576.060437] 0 pages HighMem/MovableOnly
[ 1576.060438] 116548 pages reserved
[ 1576.060440] 0 pages hwpoisoned
(...snipped...)
[ 1576.168438] normal-priority: page allocation stalls for 720370ms, order:0, mode:0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null)
[ 1576.168443] normal-priority cpuset=/ mems_allowed=0
[ 1576.168473] CPU: 0 PID: 6697 Comm: normal-priority Not tainted 4.12.0-next-20170714+ #632
[ 1576.168474] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[ 1576.168475] Call Trace:
[ 1576.168479]  dump_stack+0x67/0x9e
[ 1576.168483]  warn_alloc+0x10f/0x1b0
[ 1576.168489]  ? wake_all_kswapds+0x56/0x96
[ 1576.168493]  __alloc_pages_nodemask+0xb2c/0xf50
[ 1576.168511]  alloc_pages_current+0x65/0xb0
[ 1576.168515]  __page_cache_alloc+0x10b/0x140
[ 1576.168519]  filemap_fault+0x3df/0x6a0
[ 1576.168521]  ? filemap_fault+0x2ab/0x6a0
[ 1576.168528]  xfs_filemap_fault+0x34/0x50
[ 1576.168531]  __do_fault+0x19/0x120
[ 1576.168535]  __handle_mm_fault+0xa5c/0x1160
[ 1576.168538]  ? native_sched_clock+0x36/0xa0
[ 1576.168549]  handle_mm_fault+0x186/0x360
[ 1576.168550]  ? handle_mm_fault+0x44/0x360
[ 1576.168554]  __do_page_fault+0x1da/0x510
[ 1576.168562]  do_page_fault+0x21/0x70
[ 1576.168566]  page_fault+0x22/0x30
[ 1576.168568] RIP: 0033:0x7f1b4dd39ced
[ 1576.168569] RSP: 002b:00007ffc832358e0 EFLAGS: 00010202
[ 1576.168571] RAX: 0000000000000204 RBX: 00007f1b4d972d18 RCX: 00000000000003f3
[ 1576.168572] RDX: 00007f1b4d971280 RSI: 0000000000000001 RDI: 00007f1b4df4f658
[ 1576.168573] RBP: 0000000000000001 R08: 000000000000003e R09: 00000000003bbe1a
[ 1576.168574] R10: 00007f1b4df524a8 R11: 00007ffc83235a20 R12: 0000000000000003
[ 1576.168575] R13: 000000000eef86be R14: 00007f1b4df4fb50 R15: 0000000000000000
[ 1576.928411] MemAlloc: 927 in-flight.
[ 1577.952567] MemAlloc: 927 in-flight.
[ 1578.976001] MemAlloc: 927 in-flight.
[ 1580.000154] MemAlloc: 927 in-flight.
[ 1580.003790] sysrq: SysRq : Kill All Tasks
----------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-07-14 12:30                               ` Tetsuo Handa
@ 2017-07-14 12:48                                 ` Michal Hocko
  2017-08-09  6:14                                   ` Tetsuo Handa
  0 siblings, 1 reply; 43+ messages in thread
From: Michal Hocko @ 2017-07-14 12:48 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman,
	vbabka, sergey.senozhatsky.work, pmladek

On Fri 14-07-17 21:30:54, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > As I've said earlier, if there is no other way to make printk work without all
> > these nasty side effected then I would be OK to add a printk context
> > specific calls into the oom killer.
> > 
> > Removing the rest because this is again getting largely tangent. The
> > primary problem you are seeing is that we stumble over printk here.
> > Unless I can see a sound argument this is not the case it doesn't make
> > any sense to discuss allocator changes.
> 
> You are still ignoring my point. I agree that we stumble over printk(), but
> printk() is nothing but one of locations we stumble.

I am not ignoring it. You just mix too many things together to have a
meaningful conversation...
 
> Look at schedule_timeout_killable(1) in out_of_memory() which is called with
> oom_lock still held. I'm reporting that even printk() is offloaded to printk
> kernel thread, scheduling priority can make schedule_timeout_killable(1) sleep
> for more than 12 minutes (which is intended to sleep for only one millisecond).
> (I gave up waiting and pressed SysRq-i. I can't imagine how long it would have
> continued sleeping inside schedule_timeout_killable(1) with oom_lock held.)
> 
> Without cooperation from other allocating threads which failed to hold oom_lock,
> it is dangerous to keep out_of_memory() preemptible/schedulable context.

I have already tried to explain that this is something that the whole
reclaim path suffers from the priority inversions problem because it has
never been designed to handle that. You are just poking to one
particular path of the reclaim stack and missing the whole forest for a
tree. How the hack is this any different from a reclaim path stumbling
over a lock down inside the filesystem and stalling basically everybody
from making a reasonable progress? Is this a problem? Of course it is,
theoretically. In practice not all that much to go and reimplement the
whole stack. At least I haven't seen any real life reports complaining
about this.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-07-14 12:48                                 ` Michal Hocko
@ 2017-08-09  6:14                                   ` Tetsuo Handa
  2017-08-09 13:01                                     ` Tetsuo Handa
  0 siblings, 1 reply; 43+ messages in thread
From: Tetsuo Handa @ 2017-08-09  6:14 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman,
	vbabka, sergey.senozhatsky.work, pmladek

Michal Hocko wrote:
> On Fri 14-07-17 21:30:54, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> [...]
> > > As I've said earlier, if there is no other way to make printk work without all
> > > these nasty side effected then I would be OK to add a printk context
> > > specific calls into the oom killer.
> > > 
> > > Removing the rest because this is again getting largely tangent. The
> > > primary problem you are seeing is that we stumble over printk here.
> > > Unless I can see a sound argument this is not the case it doesn't make
> > > any sense to discuss allocator changes.
> > 
> > You are still ignoring my point. I agree that we stumble over printk(), but
> > printk() is nothing but one of locations we stumble.
> 
> I am not ignoring it. You just mix too many things together to have a
> meaningful conversation...
>  
> > Look at schedule_timeout_killable(1) in out_of_memory() which is called with
> > oom_lock still held. I'm reporting that even printk() is offloaded to printk
> > kernel thread, scheduling priority can make schedule_timeout_killable(1) sleep
> > for more than 12 minutes (which is intended to sleep for only one millisecond).
> > (I gave up waiting and pressed SysRq-i. I can't imagine how long it would have
> > continued sleeping inside schedule_timeout_killable(1) with oom_lock held.)
> > 
> > Without cooperation from other allocating threads which failed to hold oom_lock,
> > it is dangerous to keep out_of_memory() preemptible/schedulable context.
> 
> I have already tried to explain that this is something that the whole
> reclaim path suffers from the priority inversions problem because it has
> never been designed to handle that. You are just poking to one
> particular path of the reclaim stack and missing the whole forest for a
> tree. How the hack is this any different from a reclaim path stumbling
> over a lock down inside the filesystem and stalling basically everybody
> from making a reasonable progress? Is this a problem? Of course it is,
> theoretically. In practice not all that much to go and reimplement the
> whole stack. At least I haven't seen any real life reports complaining
> about this.

I'm failing to test your "mm, oom: fix oom_reaper fallouts" patches using
http://lkml.kernel.org/r/201708072228.FAJ09347.tOOVOFFQJSHMFL@I-love.SAKURA.ne.jp
because it fails to invoke the OOM killer for unknown reason. I analyzed it using
kmallocwd and confirmed that two dozens of concurrent allocating threads is
sufficient for hitting this warn_alloc() v.s. printk() lockup.
Since printk offloading is not yet available, serialization is the the only choice
we can mitigate this problem for now. How long will we have to waste more?

----------
[  645.993827] MemAlloc-Info: stalling=18 dying=0 exiting=0 victim=0 oom_count=29
(...snipped...)
[  645.996694] MemAlloc: vmtoolsd(2221) flags=0x400100 switches=5607 seq=3740 gfp=0x14200ca(GFP_HIGHUSER_MOVABLE) order=0 delay=7541
[  645.996695] vmtoolsd        R  running task    11960  2221      1 0x00000080
[  645.996699] Call Trace:
[  645.996708]  ? console_unlock+0x373/0x4a0
[  645.996709]  ? vprintk_emit+0x211/0x2f0
[  645.996714]  ? vprintk_emit+0x21a/0x2f0
[  645.996720]  ? vprintk_default+0x1a/0x20
[  645.996722]  ? vprintk_func+0x22/0x60
[  645.996724]  ? printk+0x53/0x6a
[  645.996731]  ? dump_stack_print_info+0xab/0xb0
[  645.996736]  ? dump_stack+0x5e/0x9e
[  645.996739]  ? dump_header+0x9d/0x3fa
[  645.996744]  ? trace_hardirqs_on+0xd/0x10
[  645.996751]  ? oom_kill_process+0x226/0x650
[  645.996757]  ? out_of_memory+0x13d/0x570
[  645.996758]  ? out_of_memory+0x20d/0x570
[  645.996763]  ? __alloc_pages_nodemask+0xbc8/0xed0
[  645.996780]  ? alloc_pages_current+0x65/0xb0
[  645.996784]  ? __page_cache_alloc+0x10b/0x140
[  645.996789]  ? filemap_fault+0x3df/0x6a0
[  645.996790]  ? filemap_fault+0x2ab/0x6a0
[  645.996797]  ? xfs_filemap_fault+0x34/0x50
[  645.996799]  ? __do_fault+0x19/0x120
[  645.996803]  ? __handle_mm_fault+0xa99/0x1260
[  645.996814]  ? handle_mm_fault+0x1b2/0x350
[  645.996816]  ? handle_mm_fault+0x46/0x350
[  645.996820]  ? __do_page_fault+0x1da/0x510
[  645.996828]  ? do_page_fault+0x21/0x70
[  645.996832]  ? page_fault+0x22/0x30
(...snipped...)
[  645.998748] MemAlloc-Info: stalling=18 dying=0 exiting=0 victim=0 oom_count=29
(...snipped...)
[ 1472.484590] MemAlloc-Info: stalling=25 dying=0 exiting=0 victim=0 oom_count=29
(...snipped...)
[ 1472.487341] MemAlloc: vmtoolsd(2221) flags=0x400100 switches=5607 seq=3740 gfp=0x14200ca(GFP_HIGHUSER_MOVABLE) order=0 delay=834032
[ 1472.487342] vmtoolsd        R  running task    11960  2221      1 0x00000080
[ 1472.487346] Call Trace:
[ 1472.487353]  ? console_unlock+0x373/0x4a0
[ 1472.487355]  ? vprintk_emit+0x211/0x2f0
[ 1472.487360]  ? vprintk_emit+0x21a/0x2f0
[ 1472.487367]  ? vprintk_default+0x1a/0x20
[ 1472.487369]  ? vprintk_func+0x22/0x60
[ 1472.487370]  ? printk+0x53/0x6a
[ 1472.487377]  ? dump_stack_print_info+0xab/0xb0
[ 1472.487381]  ? dump_stack+0x5e/0x9e
[ 1472.487384]  ? dump_header+0x9d/0x3fa
[ 1472.487389]  ? trace_hardirqs_on+0xd/0x10
[ 1472.487396]  ? oom_kill_process+0x226/0x650
[ 1472.487402]  ? out_of_memory+0x13d/0x570
[ 1472.487403]  ? out_of_memory+0x20d/0x570
[ 1472.487408]  ? __alloc_pages_nodemask+0xbc8/0xed0
[ 1472.487426]  ? alloc_pages_current+0x65/0xb0
[ 1472.487429]  ? __page_cache_alloc+0x10b/0x140
[ 1472.487434]  ? filemap_fault+0x3df/0x6a0
[ 1472.487435]  ? filemap_fault+0x2ab/0x6a0
[ 1472.487441]  ? xfs_filemap_fault+0x34/0x50
[ 1472.487444]  ? __do_fault+0x19/0x120
[ 1472.487447]  ? __handle_mm_fault+0xa99/0x1260
[ 1472.487459]  ? handle_mm_fault+0x1b2/0x350
[ 1472.487460]  ? handle_mm_fault+0x46/0x350
[ 1472.487465]  ? __do_page_fault+0x1da/0x510
[ 1472.487472]  ? do_page_fault+0x21/0x70
[ 1472.487476]  ? page_fault+0x22/0x30
(...snipped...)
[ 1472.489975] MemAlloc-Info: stalling=25 dying=0 exiting=0 victim=0 oom_count=29
----------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable.
  2017-08-09  6:14                                   ` Tetsuo Handa
@ 2017-08-09 13:01                                     ` Tetsuo Handa
  0 siblings, 0 replies; 43+ messages in thread
From: Tetsuo Handa @ 2017-08-09 13:01 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman,
	vbabka, sergey.senozhatsky.work, pmladek, penguin-kernel

Tetsuo Handa wrote:
> I'm failing to test your "mm, oom: fix oom_reaper fallouts" patches using
> http://lkml.kernel.org/r/201708072228.FAJ09347.tOOVOFFQJSHMFL@I-love.SAKURA.ne.jp
> because it fails to invoke the OOM killer for unknown reason. I analyzed it using
> kmallocwd and confirmed that two dozens of concurrent allocating threads is
> sufficient for hitting this warn_alloc() v.s. printk() lockup.
> Since printk offloading is not yet available, serialization is the the only choice
> we can mitigate this problem for now. How long will we have to waste more?

Above explanation is not appropriate.

As a part of testing your "mm, oom: fix oom_reaper fallouts" patches, I'm trying
to use various stress patterns including changing number of threads. For some
unknown reason (though it is not caused by your patches), sometimes it takes
too much long time (order of minutes) to invoke the OOM killer despite warn_alloc()
is periodically printed or not printed at all. In order to find out what stage is
taking so long time, I'm using kmallocwd with 1 second timeout, for SysRq-t etc. are
useless for tracking how long threads are waiting at specific location because they
do not have in-flight allocation information while they take many seconds for showing
too much noise (including idle threads simply sleeping). And due to kmallocwd's 1 second
timeout setting (which I want to use for finding out at what stage allocating threads
are waiting for so long), printk() from out_of_memory() with oom_lock held is trapped
by periodical printk() by kmallocwd, and caused lockup.

So, this is not direct warn_alloc() v.s. printk() lockup. But what the kmallocwd
tries to warn is similar to warn_alloc(). I can't automate testing of "mm, oom:
fix oom_reaper fallouts" patches because I sometimes need to use SysRq-f etc. to
unstuck the test.

> 
> ----------
> [  645.993827] MemAlloc-Info: stalling=18 dying=0 exiting=0 victim=0 oom_count=29
> (...snipped...)
> [  645.996694] MemAlloc: vmtoolsd(2221) flags=0x400100 switches=5607 seq=3740 gfp=0x14200ca(GFP_HIGHUSER_MOVABLE) order=0 delay=7541
> [  645.996695] vmtoolsd        R  running task    11960  2221      1 0x00000080
> [  645.996699] Call Trace:
> [  645.996708]  ? console_unlock+0x373/0x4a0
> [  645.996709]  ? vprintk_emit+0x211/0x2f0
> [  645.996714]  ? vprintk_emit+0x21a/0x2f0
> [  645.996720]  ? vprintk_default+0x1a/0x20
> [  645.996722]  ? vprintk_func+0x22/0x60
> [  645.996724]  ? printk+0x53/0x6a
> [  645.996731]  ? dump_stack_print_info+0xab/0xb0
> [  645.996736]  ? dump_stack+0x5e/0x9e
> [  645.996739]  ? dump_header+0x9d/0x3fa
> [  645.996744]  ? trace_hardirqs_on+0xd/0x10
> [  645.996751]  ? oom_kill_process+0x226/0x650
> [  645.996757]  ? out_of_memory+0x13d/0x570
> [  645.996758]  ? out_of_memory+0x20d/0x570
> [  645.996763]  ? __alloc_pages_nodemask+0xbc8/0xed0
> [  645.996780]  ? alloc_pages_current+0x65/0xb0
> [  645.996784]  ? __page_cache_alloc+0x10b/0x140
> [  645.996789]  ? filemap_fault+0x3df/0x6a0
> [  645.996790]  ? filemap_fault+0x2ab/0x6a0
> [  645.996797]  ? xfs_filemap_fault+0x34/0x50
> [  645.996799]  ? __do_fault+0x19/0x120
> [  645.996803]  ? __handle_mm_fault+0xa99/0x1260
> [  645.996814]  ? handle_mm_fault+0x1b2/0x350
> [  645.996816]  ? handle_mm_fault+0x46/0x350
> [  645.996820]  ? __do_page_fault+0x1da/0x510
> [  645.996828]  ? do_page_fault+0x21/0x70
> [  645.996832]  ? page_fault+0x22/0x30
> (...snipped...)
> [  645.998748] MemAlloc-Info: stalling=18 dying=0 exiting=0 victim=0 oom_count=29
> (...snipped...)
> [ 1472.484590] MemAlloc-Info: stalling=25 dying=0 exiting=0 victim=0 oom_count=29
> (...snipped...)
> [ 1472.487341] MemAlloc: vmtoolsd(2221) flags=0x400100 switches=5607 seq=3740 gfp=0x14200ca(GFP_HIGHUSER_MOVABLE) order=0 delay=834032
> [ 1472.487342] vmtoolsd        R  running task    11960  2221      1 0x00000080
> [ 1472.487346] Call Trace:
> [ 1472.487353]  ? console_unlock+0x373/0x4a0
> [ 1472.487355]  ? vprintk_emit+0x211/0x2f0
> [ 1472.487360]  ? vprintk_emit+0x21a/0x2f0
> [ 1472.487367]  ? vprintk_default+0x1a/0x20
> [ 1472.487369]  ? vprintk_func+0x22/0x60
> [ 1472.487370]  ? printk+0x53/0x6a
> [ 1472.487377]  ? dump_stack_print_info+0xab/0xb0
> [ 1472.487381]  ? dump_stack+0x5e/0x9e
> [ 1472.487384]  ? dump_header+0x9d/0x3fa
> [ 1472.487389]  ? trace_hardirqs_on+0xd/0x10
> [ 1472.487396]  ? oom_kill_process+0x226/0x650
> [ 1472.487402]  ? out_of_memory+0x13d/0x570
> [ 1472.487403]  ? out_of_memory+0x20d/0x570
> [ 1472.487408]  ? __alloc_pages_nodemask+0xbc8/0xed0
> [ 1472.487426]  ? alloc_pages_current+0x65/0xb0
> [ 1472.487429]  ? __page_cache_alloc+0x10b/0x140
> [ 1472.487434]  ? filemap_fault+0x3df/0x6a0
> [ 1472.487435]  ? filemap_fault+0x2ab/0x6a0
> [ 1472.487441]  ? xfs_filemap_fault+0x34/0x50
> [ 1472.487444]  ? __do_fault+0x19/0x120
> [ 1472.487447]  ? __handle_mm_fault+0xa99/0x1260
> [ 1472.487459]  ? handle_mm_fault+0x1b2/0x350
> [ 1472.487460]  ? handle_mm_fault+0x46/0x350
> [ 1472.487465]  ? __do_page_fault+0x1da/0x510
> [ 1472.487472]  ? do_page_fault+0x21/0x70
> [ 1472.487476]  ? page_fault+0x22/0x30
> (...snipped...)
> [ 1472.489975] MemAlloc-Info: stalling=25 dying=0 exiting=0 victim=0 oom_count=29
> ----------
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2017-08-09 13:02 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-01 11:43 [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable Tetsuo Handa
2017-06-01 11:59 ` Michal Hocko
2017-06-01 13:11   ` Tetsuo Handa
2017-06-01 13:28     ` Michal Hocko
2017-06-01 22:10       ` Andrew Morton
2017-06-02  7:18         ` Michal Hocko
2017-06-02 11:13           ` Tetsuo Handa
2017-06-02 12:15             ` Michal Hocko
2017-06-02 17:13               ` Tetsuo Handa
2017-06-02 21:57             ` Cong Wang
2017-06-04  8:58               ` Tetsuo Handa
2017-06-04 15:05                 ` Michal Hocko
2017-06-04 21:43                   ` Tetsuo Handa
2017-06-05  5:37                     ` Michal Hocko
2017-06-05 18:15                       ` Cong Wang
2017-06-06  9:17                         ` Michal Hocko
2017-06-05 18:25                 ` Cong Wang
2017-06-22 10:35                   ` Tetsuo Handa
2017-06-22 22:53                     ` Cong Wang
2017-06-02 16:59           ` Cong Wang
2017-06-02 19:59           ` Andrew Morton
2017-06-03  2:57             ` Tetsuo Handa
2017-06-03  7:32             ` Michal Hocko
2017-06-03  8:36               ` Tetsuo Handa
2017-06-05  7:10                 ` Sergey Senozhatsky
2017-06-05  9:36                   ` Sergey Senozhatsky
2017-06-05 15:02                     ` Tetsuo Handa
2017-06-03 13:21               ` Tetsuo Handa
2017-07-08  4:59           ` Tetsuo Handa
2017-07-10 13:21             ` Michal Hocko
2017-07-10 13:54               ` Tetsuo Handa
2017-07-10 14:14                 ` Michal Hocko
2017-07-11 13:10                   ` Tetsuo Handa
2017-07-11 13:49                     ` Michal Hocko
2017-07-11 14:58                       ` Petr Mladek
2017-07-11 22:06                       ` Tetsuo Handa
2017-07-12  8:54                         ` Michal Hocko
2017-07-12 12:23                           ` Tetsuo Handa
2017-07-12 12:41                             ` Michal Hocko
2017-07-14 12:30                               ` Tetsuo Handa
2017-07-14 12:48                                 ` Michal Hocko
2017-08-09  6:14                                   ` Tetsuo Handa
2017-08-09 13:01                                     ` Tetsuo Handa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).