All of lore.kernel.org
 help / color / mirror / Atom feed
* How to avoid printk() delay caused by cond_resched() ?
@ 2016-03-02 12:01 Tetsuo Handa
  2016-03-02 13:38 ` Petr Mladek
  2016-03-02 14:04 ` Sergey Senozhatsky
  0 siblings, 2 replies; 11+ messages in thread
From: Tetsuo Handa @ 2016-03-02 12:01 UTC (permalink / raw)
  To: sergey.senozhatsky, pmladek
  Cc: jack, tj, kyle, davej, calvinowens, akpm, linux-mm

I have a question about "printk: set may_schedule for some of
console_trylock() callers" in linux-next.git.

I'm trying to dump information of all threads which might be relevant
to stalling inside memory allocator. But it seems to me that since this
patch changed to allow calling cond_resched() from printk() if it is
safe to do so, it is now possible that the thread which invoked the OOM
killer can sleep for minutes with the oom_lock mutex held when my dump is
in progress. I want to release oom_lock mutex as soon as possible so
that other threads can call out_of_memory() to get TIF_MEMDIE and exit
their allocations.

So, how can I prevent printk() triggered by out_of_memory() from sleeping
for minutes with oom_lock mutex held? Guard it with preempt_disable() /
preempt_enable() ? Guard it with rcu_read_lock() / rcu_read_unlock() ? 

----------
[  460.893958] tgid=11161 invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), order=0, oom_score_adj=1000
[  463.892897] tgid=11161 cpuset=/ mems_allowed=0
[  463.894724] CPU: 1 PID: 12346 Comm: tgid=11161 Not tainted 4.5.0-rc6-next-20160302+ #318
[  463.897026] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  463.899841]  0000000000000286 00000000733ef955 ffff8800778a79b0 ffffffff813a2ded
[  463.902164]  0000000000000000 ffff8800778a7be0 ffff8800778a7a50 ffffffff811c24d0
[  463.904474]  0000000000000206 ffffffff81810c30 ffff8800778a79f0 ffffffff810bf839
[  463.906801] Call Trace:
[  463.908101]  [<ffffffff813a2ded>] dump_stack+0x85/0xc8
[  463.909921]  [<ffffffff811c24d0>] dump_header+0x5b/0x3b0
[  463.911759]  [<ffffffff810bf839>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[  463.913833]  [<ffffffff810bf90d>] ? trace_hardirqs_on+0xd/0x10
[  463.915751]  [<ffffffff81146f2d>] oom_kill_process+0x37d/0x570
[  463.918024]  [<ffffffff81147366>] out_of_memory+0x1f6/0x5a0
[  463.919890]  [<ffffffff81147424>] ? out_of_memory+0x2b4/0x5a0
[  463.921784]  [<ffffffff8114d041>] __alloc_pages_nodemask+0xc91/0xeb0
[  463.923788]  [<ffffffff81196726>] alloc_pages_current+0x96/0x1b0
[  463.925729]  [<ffffffff8114188d>] __page_cache_alloc+0x12d/0x160
[  463.927682]  [<ffffffff8114537a>] filemap_fault+0x48a/0x6a0
[  463.929547]  [<ffffffff81145247>] ? filemap_fault+0x357/0x6a0
[  463.931409]  [<ffffffff812b9e09>] xfs_filemap_fault+0x39/0x60
[  463.933255]  [<ffffffff8116f19d>] __do_fault+0x6d/0x150
[  463.934974]  [<ffffffff81175b1d>] handle_mm_fault+0xecd/0x1800
[  463.936791]  [<ffffffff81174ca3>] ? handle_mm_fault+0x53/0x1800
[  463.938617]  [<ffffffff8105a796>] __do_page_fault+0x1e6/0x520
[  463.940380]  [<ffffffff8105ab00>] do_page_fault+0x30/0x80
[  463.942074]  [<ffffffff817113e8>] page_fault+0x28/0x30
[  463.943750] Mem-Info:
(...snipped...)
[  554.754959] MemAlloc: tgid=11161(12346) flags=0x400040 switches=865 seq=169 gfp=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) order=0 delay=81602
[  554.754962] tgid=11161      R  running task        0 12346  11056 0x00000080
[  554.754965]  ffff8800778a7838 ffff880034510000 ffff8800778b0000 ffff8800778a8000
[  554.754966]  0000000000000091 ffffffff82a86fbc 0000000000000004 0000000000000000
[  554.754967]  ffff8800778a7850 ffffffff8170a5dd 0000000000000296 ffff8800778a7860
[  554.754968] Call Trace:
[  554.754974]  [<ffffffff8170a5dd>] preempt_schedule_common+0x1f/0x42
[  554.754975]  [<ffffffff8170a617>] _cond_resched+0x17/0x20
[  554.754978]  [<ffffffff810d17b9>] console_unlock+0x509/0x5c0
[  554.754979]  [<ffffffff810d1b93>] vprintk_emit+0x323/0x540
[  554.754981]  [<ffffffff810d1f0a>] vprintk_default+0x1a/0x20
[  554.754983]  [<ffffffff8114069e>] printk+0x58/0x6f
[  554.754986]  [<ffffffff813aaede>] show_mem+0x1e/0xe0
[  554.754988]  [<ffffffff811c24ec>] dump_header+0x77/0x3b0
[  554.754990]  [<ffffffff810bf839>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[  554.754991]  [<ffffffff810bf90d>] ? trace_hardirqs_on+0xd/0x10
[  554.754993]  [<ffffffff81146f2d>] oom_kill_process+0x37d/0x570
[  554.754995]  [<ffffffff81147366>] out_of_memory+0x1f6/0x5a0
[  554.754996]  [<ffffffff81147424>] ? out_of_memory+0x2b4/0x5a0
[  554.754997]  [<ffffffff8114d041>] __alloc_pages_nodemask+0xc91/0xeb0
[  554.755000]  [<ffffffff81196726>] alloc_pages_current+0x96/0x1b0
[  554.755001]  [<ffffffff8114188d>] __page_cache_alloc+0x12d/0x160
[  554.755003]  [<ffffffff8114537a>] filemap_fault+0x48a/0x6a0
[  554.755004]  [<ffffffff81145247>] ? filemap_fault+0x357/0x6a0
[  554.755006]  [<ffffffff812b9e09>] xfs_filemap_fault+0x39/0x60
[  554.755007]  [<ffffffff8116f19d>] __do_fault+0x6d/0x150
[  554.755009]  [<ffffffff81175b1d>] handle_mm_fault+0xecd/0x1800
[  554.755010]  [<ffffffff81174ca3>] ? handle_mm_fault+0x53/0x1800
[  554.755012]  [<ffffffff8105a796>] __do_page_fault+0x1e6/0x520
[  554.755013]  [<ffffffff8105ab00>] do_page_fault+0x30/0x80
[  554.755015]  [<ffffffff817113e8>] page_fault+0x28/0x30
----------

CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
CONFIG_PREEMPT_COUNT=y

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How to avoid printk() delay caused by cond_resched() ?
  2016-03-02 12:01 How to avoid printk() delay caused by cond_resched() ? Tetsuo Handa
@ 2016-03-02 13:38 ` Petr Mladek
  2016-03-02 14:11   ` Tetsuo Handa
  2016-03-02 14:34   ` Sergey Senozhatsky
  2016-03-02 14:04 ` Sergey Senozhatsky
  1 sibling, 2 replies; 11+ messages in thread
From: Petr Mladek @ 2016-03-02 13:38 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: sergey.senozhatsky, jack, tj, kyle, davej, calvinowens, akpm,
	linux-mm, mhocko

On Wed 2016-03-02 21:01:03, Tetsuo Handa wrote:
> I have a question about "printk: set may_schedule for some of
> console_trylock() callers" in linux-next.git.
> 
> I'm trying to dump information of all threads which might be relevant
> to stalling inside memory allocator. But it seems to me that since this
> patch changed to allow calling cond_resched() from printk() if it is
> safe to do so, it is now possible that the thread which invoked the OOM
> killer can sleep for minutes with the oom_lock mutex held when my dump is
> in progress. I want to release oom_lock mutex as soon as possible so
> that other threads can call out_of_memory() to get TIF_MEMDIE and exit
> their allocations.
> 
> So, how can I prevent printk() triggered by out_of_memory() from sleeping
> for minutes with oom_lock mutex held? Guard it with preempt_disable() /
> preempt_enable() ? Guard it with rcu_read_lock() / rcu_read_unlock() ? 
>
> ----------
> [  460.893958] tgid=11161 invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), order=0, oom_score_adj=1000
> [  463.892897] tgid=11161 cpuset=/ mems_allowed=0
> [  463.894724] CPU: 1 PID: 12346 Comm: tgid=11161 Not tainted 4.5.0-rc6-next-20160302+ #318
> [  463.897026] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
> [  463.899841]  0000000000000286 00000000733ef955 ffff8800778a79b0 ffffffff813a2ded
> [  463.902164]  0000000000000000 ffff8800778a7be0 ffff8800778a7a50 ffffffff811c24d0
> [  463.904474]  0000000000000206 ffffffff81810c30 ffff8800778a79f0 ffffffff810bf839
> [  463.906801] Call Trace:
> [  463.908101]  [<ffffffff813a2ded>] dump_stack+0x85/0xc8
> [  463.909921]  [<ffffffff811c24d0>] dump_header+0x5b/0x3b0
> [  463.911759]  [<ffffffff810bf839>] ? trace_hardirqs_on_caller+0xf9/0x1c0
> [  463.913833]  [<ffffffff810bf90d>] ? trace_hardirqs_on+0xd/0x10
> [  463.915751]  [<ffffffff81146f2d>] oom_kill_process+0x37d/0x570
> [  463.918024]  [<ffffffff81147366>] out_of_memory+0x1f6/0x5a0
> [  463.919890]  [<ffffffff81147424>] ? out_of_memory+0x2b4/0x5a0
> [  463.921784]  [<ffffffff8114d041>] __alloc_pages_nodemask+0xc91/0xeb0
> [  463.923788]  [<ffffffff81196726>] alloc_pages_current+0x96/0x1b0
> [  463.925729]  [<ffffffff8114188d>] __page_cache_alloc+0x12d/0x160
> [  463.927682]  [<ffffffff8114537a>] filemap_fault+0x48a/0x6a0
> [  463.929547]  [<ffffffff81145247>] ? filemap_fault+0x357/0x6a0
> [  463.931409]  [<ffffffff812b9e09>] xfs_filemap_fault+0x39/0x60
> [  463.933255]  [<ffffffff8116f19d>] __do_fault+0x6d/0x150
> [  463.934974]  [<ffffffff81175b1d>] handle_mm_fault+0xecd/0x1800
> [  463.936791]  [<ffffffff81174ca3>] ? handle_mm_fault+0x53/0x1800
> [  463.938617]  [<ffffffff8105a796>] __do_page_fault+0x1e6/0x520
> [  463.940380]  [<ffffffff8105ab00>] do_page_fault+0x30/0x80
> [  463.942074]  [<ffffffff817113e8>] page_fault+0x28/0x30
> [  463.943750] Mem-Info:
> (...snipped...)
> [  554.754959] MemAlloc: tgid=11161(12346) flags=0x400040 switches=865 seq=169 gfp=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) order=0 delay=81602
> [  554.754962] tgid=11161      R  running task        0 12346  11056 0x00000080
> [  554.754965]  ffff8800778a7838 ffff880034510000 ffff8800778b0000 ffff8800778a8000
> [  554.754966]  0000000000000091 ffffffff82a86fbc 0000000000000004 0000000000000000
> [  554.754967]  ffff8800778a7850 ffffffff8170a5dd 0000000000000296 ffff8800778a7860
> [  554.754968] Call Trace:
> [  554.754974]  [<ffffffff8170a5dd>] preempt_schedule_common+0x1f/0x42
> [  554.754975]  [<ffffffff8170a617>] _cond_resched+0x17/0x20
> [  554.754978]  [<ffffffff810d17b9>] console_unlock+0x509/0x5c0
> [  554.754979]  [<ffffffff810d1b93>] vprintk_emit+0x323/0x540
> [  554.754981]  [<ffffffff810d1f0a>] vprintk_default+0x1a/0x20
> [  554.754983]  [<ffffffff8114069e>] printk+0x58/0x6f
> [  554.754986]  [<ffffffff813aaede>] show_mem+0x1e/0xe0
> [  554.754988]  [<ffffffff811c24ec>] dump_header+0x77/0x3b0
> [  554.754990]  [<ffffffff810bf839>] ? trace_hardirqs_on_caller+0xf9/0x1c0
> [  554.754991]  [<ffffffff810bf90d>] ? trace_hardirqs_on+0xd/0x10
> [  554.754993]  [<ffffffff81146f2d>] oom_kill_process+0x37d/0x570
> [  554.754995]  [<ffffffff81147366>] out_of_memory+0x1f6/0x5a0
> [  554.754996]  [<ffffffff81147424>] ? out_of_memory+0x2b4/0x5a0
> [  554.754997]  [<ffffffff8114d041>] __alloc_pages_nodemask+0xc91/0xeb0
> [  554.755000]  [<ffffffff81196726>] alloc_pages_current+0x96/0x1b0
> [  554.755001]  [<ffffffff8114188d>] __page_cache_alloc+0x12d/0x160
> [  554.755003]  [<ffffffff8114537a>] filemap_fault+0x48a/0x6a0
> [  554.755004]  [<ffffffff81145247>] ? filemap_fault+0x357/0x6a0
> [  554.755006]  [<ffffffff812b9e09>] xfs_filemap_fault+0x39/0x60
> [  554.755007]  [<ffffffff8116f19d>] __do_fault+0x6d/0x150
> [  554.755009]  [<ffffffff81175b1d>] handle_mm_fault+0xecd/0x1800
> [  554.755010]  [<ffffffff81174ca3>] ? handle_mm_fault+0x53/0x1800
> [  554.755012]  [<ffffffff8105a796>] __do_page_fault+0x1e6/0x520
> [  554.755013]  [<ffffffff8105ab00>] do_page_fault+0x30/0x80
> [  554.755015]  [<ffffffff817113e8>] page_fault+0x28/0x30
> ----------
> 
> CONFIG_PREEMPT_NONE=y
> # CONFIG_PREEMPT_VOLUNTARY is not set
> # CONFIG_PREEMPT is not set
> CONFIG_PREEMPT_COUNT=y

preempt_disable() / preempt_enable() would do the job.
The question is where to put it. If you are concerned about
the delay, you might want to disable preemption around
the whole locked area, so that it works reasonable also
in the preemptive kernel.

I am looking forward to have the console printing offloaded
into the workqueues. Then printk() will become consistently
"fast" operation and will cause less surprises like this.

Best Regards,
Petr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How to avoid printk() delay caused by cond_resched() ?
  2016-03-02 12:01 How to avoid printk() delay caused by cond_resched() ? Tetsuo Handa
  2016-03-02 13:38 ` Petr Mladek
@ 2016-03-02 14:04 ` Sergey Senozhatsky
  2016-03-02 15:02   ` Tetsuo Handa
  1 sibling, 1 reply; 11+ messages in thread
From: Sergey Senozhatsky @ 2016-03-02 14:04 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: sergey.senozhatsky, pmladek, jack, tj, kyle, davej, calvinowens,
	akpm, linux-mm

Hello Tetsuo,

On (03/02/16 21:01), Tetsuo Handa wrote:
> I'm trying to dump information of all threads which might be relevant
> to stalling inside memory allocator. But it seems to me that since this
> patch changed to allow calling cond_resched() from printk() if it is
> safe to do so, it is now possible that the thread which invoked the OOM
> killer can sleep for minutes with the oom_lock mutex held when my dump is
> in progress. I want to release oom_lock mutex as soon as possible so
> that other threads can call out_of_memory() to get TIF_MEMDIE and exit
> their allocations.

correct, otherwise chances are that you will softlockup or RCU stall
the system -- console_unlock() does not stop until the printk logbuf
has data in it (which can be appended concurrently from other CPUs).
this fact was the motivation for cond_resched() in console_unlock() in
the first place and for the later patch that permit to some of printk
callers to do the same.

> So, how can I prevent printk() triggered by out_of_memory() from sleeping
> for minutes with oom_lock mutex held? Guard it with preempt_disable() /
> preempt_enable() ? Guard it with rcu_read_lock() / rcu_read_unlock() ? 

correct, both preempt_disable() and rcu_read_lock() will certainly forbid
cond_reshed() in printk()->console_unlock(), but it will also increase the
softlockup or/and rcu stall probability, depending on how many threads you
have and how many stack traces you need to print. can we do dirty things to
the frequency of cond_resched()? for example, do cond_resched() not after
every printk(), but every LOG_LINE_MAX chars.


 kernel/printk/printk.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 9917f69..5467b1e 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -2209,6 +2209,7 @@ void console_unlock(void)
 	unsigned long flags;
 	bool wake_klogd = false;
 	bool do_cond_resched, retry;
+	int num_printed;
 
 	if (console_suspended) {
 		up_console_sem();
@@ -2229,6 +2230,7 @@ void console_unlock(void)
 	console_may_schedule = 0;
 
 again:
+	num_printed = 0;
 	/*
 	 * We released the console_sem lock, so we need to recheck if
 	 * cpu is online and (if not) is there at least one CON_ANYTIME
@@ -2310,8 +2312,11 @@ skip:
 		start_critical_timings();
 		local_irq_restore(flags);
 
-		if (do_cond_resched)
+		num_printed += len + ext_len;
+		if (do_cond_resched && num_printed > LOG_LINE_MAX) {
 			cond_resched();
+			num_printed = 0;
+		}
 	}
 	console_locked = 0;
 

---

several questions,
do you use some sort of oom-kill reproducer? ... the problem with printk
is that you never know which once of the CPUs will do the printing job at
the end; any guess there is highly unreliable. but, assuming that you have
executed that oom-kill reproducer many times before the patch in question
do you have any rough numbers to compare how many seconds it used to take
to dump_stack all of the tasks?



hm, a bit puzzled, how much does OOM add to the cond_resched() latency here?
is this because at least some of the tasks we switch_to can do direct reclaims
and thus we have bigger "pauses" between call_console_drivers() iterations?
in that case even after offloading printk job to system_wq we still will
suffer, in some sense. oom mutex is going to be released sooner, that's true,
but console_unlock() will still cond_resched() after every call_console_drivers(),
so the logs potentially can be printed with a huge delay. we probably cond_resched()
in console_unlock() too often.


I haven't noticed anything like that during my tests.

	-ss



> ----------
> [  460.893958] tgid=11161 invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), order=0, oom_score_adj=1000
> [  463.892897] tgid=11161 cpuset=/ mems_allowed=0
> [  463.894724] CPU: 1 PID: 12346 Comm: tgid=11161 Not tainted 4.5.0-rc6-next-20160302+ #318
> [  463.897026] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
> [  463.899841]  0000000000000286 00000000733ef955 ffff8800778a79b0 ffffffff813a2ded
> [  463.902164]  0000000000000000 ffff8800778a7be0 ffff8800778a7a50 ffffffff811c24d0
> [  463.904474]  0000000000000206 ffffffff81810c30 ffff8800778a79f0 ffffffff810bf839
> [  463.906801] Call Trace:
> [  463.908101]  [<ffffffff813a2ded>] dump_stack+0x85/0xc8
> [  463.909921]  [<ffffffff811c24d0>] dump_header+0x5b/0x3b0
> [  463.911759]  [<ffffffff810bf839>] ? trace_hardirqs_on_caller+0xf9/0x1c0
> [  463.913833]  [<ffffffff810bf90d>] ? trace_hardirqs_on+0xd/0x10
> [  463.915751]  [<ffffffff81146f2d>] oom_kill_process+0x37d/0x570
> [  463.918024]  [<ffffffff81147366>] out_of_memory+0x1f6/0x5a0
> [  463.919890]  [<ffffffff81147424>] ? out_of_memory+0x2b4/0x5a0
> [  463.921784]  [<ffffffff8114d041>] __alloc_pages_nodemask+0xc91/0xeb0
> [  463.923788]  [<ffffffff81196726>] alloc_pages_current+0x96/0x1b0
> [  463.925729]  [<ffffffff8114188d>] __page_cache_alloc+0x12d/0x160
> [  463.927682]  [<ffffffff8114537a>] filemap_fault+0x48a/0x6a0
> [  463.929547]  [<ffffffff81145247>] ? filemap_fault+0x357/0x6a0
> [  463.931409]  [<ffffffff812b9e09>] xfs_filemap_fault+0x39/0x60
> [  463.933255]  [<ffffffff8116f19d>] __do_fault+0x6d/0x150
> [  463.934974]  [<ffffffff81175b1d>] handle_mm_fault+0xecd/0x1800
> [  463.936791]  [<ffffffff81174ca3>] ? handle_mm_fault+0x53/0x1800
> [  463.938617]  [<ffffffff8105a796>] __do_page_fault+0x1e6/0x520
> [  463.940380]  [<ffffffff8105ab00>] do_page_fault+0x30/0x80
> [  463.942074]  [<ffffffff817113e8>] page_fault+0x28/0x30
> [  463.943750] Mem-Info:
> (...snipped...)
> [  554.754959] MemAlloc: tgid=11161(12346) flags=0x400040 switches=865 seq=169 gfp=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) order=0 delay=81602
> [  554.754962] tgid=11161      R  running task        0 12346  11056 0x00000080
> [  554.754965]  ffff8800778a7838 ffff880034510000 ffff8800778b0000 ffff8800778a8000
> [  554.754966]  0000000000000091 ffffffff82a86fbc 0000000000000004 0000000000000000
> [  554.754967]  ffff8800778a7850 ffffffff8170a5dd 0000000000000296 ffff8800778a7860
> [  554.754968] Call Trace:
> [  554.754974]  [<ffffffff8170a5dd>] preempt_schedule_common+0x1f/0x42
> [  554.754975]  [<ffffffff8170a617>] _cond_resched+0x17/0x20
> [  554.754978]  [<ffffffff810d17b9>] console_unlock+0x509/0x5c0
> [  554.754979]  [<ffffffff810d1b93>] vprintk_emit+0x323/0x540
> [  554.754981]  [<ffffffff810d1f0a>] vprintk_default+0x1a/0x20
> [  554.754983]  [<ffffffff8114069e>] printk+0x58/0x6f
> [  554.754986]  [<ffffffff813aaede>] show_mem+0x1e/0xe0
> [  554.754988]  [<ffffffff811c24ec>] dump_header+0x77/0x3b0
> [  554.754990]  [<ffffffff810bf839>] ? trace_hardirqs_on_caller+0xf9/0x1c0
> [  554.754991]  [<ffffffff810bf90d>] ? trace_hardirqs_on+0xd/0x10
> [  554.754993]  [<ffffffff81146f2d>] oom_kill_process+0x37d/0x570
> [  554.754995]  [<ffffffff81147366>] out_of_memory+0x1f6/0x5a0
> [  554.754996]  [<ffffffff81147424>] ? out_of_memory+0x2b4/0x5a0
> [  554.754997]  [<ffffffff8114d041>] __alloc_pages_nodemask+0xc91/0xeb0
> [  554.755000]  [<ffffffff81196726>] alloc_pages_current+0x96/0x1b0
> [  554.755001]  [<ffffffff8114188d>] __page_cache_alloc+0x12d/0x160
> [  554.755003]  [<ffffffff8114537a>] filemap_fault+0x48a/0x6a0
> [  554.755004]  [<ffffffff81145247>] ? filemap_fault+0x357/0x6a0
> [  554.755006]  [<ffffffff812b9e09>] xfs_filemap_fault+0x39/0x60
> [  554.755007]  [<ffffffff8116f19d>] __do_fault+0x6d/0x150
> [  554.755009]  [<ffffffff81175b1d>] handle_mm_fault+0xecd/0x1800
> [  554.755010]  [<ffffffff81174ca3>] ? handle_mm_fault+0x53/0x1800
> [  554.755012]  [<ffffffff8105a796>] __do_page_fault+0x1e6/0x520
> [  554.755013]  [<ffffffff8105ab00>] do_page_fault+0x30/0x80
> [  554.755015]  [<ffffffff817113e8>] page_fault+0x28/0x30
> ----------
> 
> CONFIG_PREEMPT_NONE=y
> # CONFIG_PREEMPT_VOLUNTARY is not set
> # CONFIG_PREEMPT is not set
> CONFIG_PREEMPT_COUNT=y
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: How to avoid printk() delay caused by cond_resched() ?
  2016-03-02 13:38 ` Petr Mladek
@ 2016-03-02 14:11   ` Tetsuo Handa
  2016-03-02 15:21     ` Petr Mladek
  2016-03-02 14:34   ` Sergey Senozhatsky
  1 sibling, 1 reply; 11+ messages in thread
From: Tetsuo Handa @ 2016-03-02 14:11 UTC (permalink / raw)
  To: pmladek
  Cc: sergey.senozhatsky, jack, tj, kyle, davej, calvinowens, akpm,
	linux-mm, mhocko

Petr Mladek wrote:
> On Wed 2016-03-02 21:01:03, Tetsuo Handa wrote:
> > I have a question about "printk: set may_schedule for some of
> > console_trylock() callers" in linux-next.git.
> > 
> > I'm trying to dump information of all threads which might be relevant
> > to stalling inside memory allocator. But it seems to me that since this
> > patch changed to allow calling cond_resched() from printk() if it is
> > safe to do so, it is now possible that the thread which invoked the OOM
> > killer can sleep for minutes with the oom_lock mutex held when my dump is
> > in progress. I want to release oom_lock mutex as soon as possible so
> > that other threads can call out_of_memory() to get TIF_MEMDIE and exit
> > their allocations.
> > 
> > So, how can I prevent printk() triggered by out_of_memory() from sleeping
> > for minutes with oom_lock mutex held? Guard it with preempt_disable() /
> > preempt_enable() ? Guard it with rcu_read_lock() / rcu_read_unlock() ? 
> >
> 
> preempt_disable() / preempt_enable() would do the job.

I see. Thank you.

> The question is where to put it. If you are concerned about
> the delay, you might want to disable preemption around
> the whole locked area, so that it works reasonable also
> in the preemptive kernel.
> 

We had a similar problem in the past. I'll again propose
http://lkml.kernel.org/r/201509191605.CAF13520.QVSFHLtFJOMOOF@I-love.SAKURA.ne.jp .

> I am looking forward to have the console printing offloaded
> into the workqueues. Then printk() will become consistently
> "fast" operation and will cause less surprises like this.
> 

That's a good news. I was wishing that there were a dedicated kernel
thread which does printk() operation. While at it, I ask for an API
which waits for printk buffer to be flushed (something like below) so that
a watchdog thread which might dump thousands of threads from sleepable
context (like my dump) can avoid "** XXX printk messages dropped **"
messages.

----------
diff --git a/include/linux/console.h b/include/linux/console.h
index ea731af..11e936c 100644
--- a/include/linux/console.h
+++ b/include/linux/console.h
@@ -147,6 +147,7 @@ extern int unregister_console(struct console *);
 extern struct console *console_drivers;
 extern void console_lock(void);
 extern int console_trylock(void);
+extern void wait_console_flushed(unsigned long timeout);
 extern void console_unlock(void);
 extern void console_conditional_schedule(void);
 extern void console_unblank(void);
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 9917f69..2eb60df 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -121,6 +121,15 @@ static int __down_trylock_console_sem(unsigned long ip)
 	up(&console_sem);\
 } while (0)
 
+static int __down_timeout_console_sem(unsigned long timeout, unsigned long ip)
+{
+	if (down_timeout(&console_sem, timeout))
+		return 1;
+	mutex_acquire(&console_lock_dep_map, 0, 1, ip);
+	return 0;
+}
+#define down_timeout_console_sem(timeout) __down_timeout_console_sem((timeout), _RET_IP_)
+
 /*
  * This is used for debugging the mess that is the VT code by
  * keeping track if we have the console semaphore held. It's
@@ -2125,6 +2134,21 @@ int console_trylock(void)
 }
 EXPORT_SYMBOL(console_trylock);
 
+void wait_console_flushed(unsigned long timeout)
+{
+	might_sleep();
+
+	if (down_timeout_console_sem(timeout))
+		return;
+	if (console_suspended) {
+		up_console_sem();
+		return;
+	}
+	console_locked = 1;
+	console_may_schedule = 1;
+	console_unlock();
+}
+
 int is_console_locked(void)
 {
 	return console_locked;
----------

> Best Regards,
> Petr
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: How to avoid printk() delay caused by cond_resched() ?
  2016-03-02 13:38 ` Petr Mladek
  2016-03-02 14:11   ` Tetsuo Handa
@ 2016-03-02 14:34   ` Sergey Senozhatsky
  2016-03-02 14:56     ` Petr Mladek
  2016-03-02 16:04     ` Jan Kara
  1 sibling, 2 replies; 11+ messages in thread
From: Sergey Senozhatsky @ 2016-03-02 14:34 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Tetsuo Handa, sergey.senozhatsky, jack, tj, kyle, davej,
	calvinowens, akpm, linux-mm, mhocko

On (03/02/16 14:38), Petr Mladek wrote:
[..]
> > 
> > CONFIG_PREEMPT_NONE=y
> > # CONFIG_PREEMPT_VOLUNTARY is not set
> > # CONFIG_PREEMPT is not set
> > CONFIG_PREEMPT_COUNT=y
> 
> preempt_disable() / preempt_enable() would do the job.
> The question is where to put it. If you are concerned about
> the delay, you might want to disable preemption around
> the whole locked area, so that it works reasonable also
> in the preemptive kernel.

another question is why cond_resched() is suddenly so expensive?
my guess is because of OOM, so we switch to tasks that potentially
do direct reclaims, etc. if so, then even offloaded printk will take
a significant amount of time to print the logs to the consoles; just
because it does cond_resched() after every call_console_drivers().


> I am looking forward to have the console printing offloaded
> into the workqueues. Then printk() will become consistently
> "fast" operation and will cause less surprises like this.

I'm all for it. I need this rework badly. If Jan is too busy at
the moment, which I surely can understand, then I'll be happy to
help ("pick up the patches". please, don't get me wrong).

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How to avoid printk() delay caused by cond_resched() ?
  2016-03-02 14:34   ` Sergey Senozhatsky
@ 2016-03-02 14:56     ` Petr Mladek
  2016-03-02 16:04     ` Jan Kara
  1 sibling, 0 replies; 11+ messages in thread
From: Petr Mladek @ 2016-03-02 14:56 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Tetsuo Handa, jack, tj, kyle, davej, calvinowens, akpm, linux-mm, mhocko

On Wed 2016-03-02 23:34:15, Sergey Senozhatsky wrote:
> On (03/02/16 14:38), Petr Mladek wrote:
> [..]
> > > 
> > > CONFIG_PREEMPT_NONE=y
> > > # CONFIG_PREEMPT_VOLUNTARY is not set
> > > # CONFIG_PREEMPT is not set
> > > CONFIG_PREEMPT_COUNT=y
> > 
> > preempt_disable() / preempt_enable() would do the job.
> > The question is where to put it. If you are concerned about
> > the delay, you might want to disable preemption around
> > the whole locked area, so that it works reasonable also
> > in the preemptive kernel.
> 
> another question is why cond_resched() is suddenly so expensive?

My guess is that nobody called cond_resched() in this OOM path on
non-preemptible kernel before.

> my guess is because of OOM, so we switch to tasks that potentially
> do direct reclaims, etc. if so, then even offloaded printk will take
> a significant amount of time to print the logs to the consoles; just
> because it does cond_resched() after every call_console_drivers().

IMHO, calling cond_resched() is just an offer that the process
is ready to get rescheduled at this point. It will reschedule only
if the process is over its dedicated time slot or if a higher
priority task appeared in the run queue. IMHO, it is perfectly fine
to call is often.

Best Regards,
Petr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How to avoid printk() delay caused by cond_resched() ?
  2016-03-02 14:04 ` Sergey Senozhatsky
@ 2016-03-02 15:02   ` Tetsuo Handa
  2016-03-02 15:29     ` Sergey Senozhatsky
  0 siblings, 1 reply; 11+ messages in thread
From: Tetsuo Handa @ 2016-03-02 15:02 UTC (permalink / raw)
  To: sergey.senozhatsky
  Cc: pmladek, jack, tj, kyle, davej, calvinowens, akpm, linux-mm

Sergey Senozhatsky wrote:
> Hello Tetsuo,
> 
> On (03/02/16 21:01), Tetsuo Handa wrote:
> > I'm trying to dump information of all threads which might be relevant
> > to stalling inside memory allocator. But it seems to me that since this
> > patch changed to allow calling cond_resched() from printk() if it is
> > safe to do so, it is now possible that the thread which invoked the OOM
> > killer can sleep for minutes with the oom_lock mutex held when my dump is
> > in progress. I want to release oom_lock mutex as soon as possible so
> > that other threads can call out_of_memory() to get TIF_MEMDIE and exit
> > their allocations.
> 
> correct, otherwise chances are that you will softlockup or RCU stall
> the system -- console_unlock() does not stop until the printk logbuf
> has data in it (which can be appended concurrently from other CPUs).

Did you mean:

  console_unlock() does not stop until the printk logbuf has no more data
  in it (but data can be appended to the logbuf concurrently from other CPUs).

> this fact was the motivation for cond_resched() in console_unlock() in
> the first place and for the later patch that permit to some of printk
> callers to do the same.
> 
> > So, how can I prevent printk() triggered by out_of_memory() from sleeping
> > for minutes with oom_lock mutex held? Guard it with preempt_disable() /
> > preempt_enable() ? Guard it with rcu_read_lock() / rcu_read_unlock() ? 
> 
> correct, both preempt_disable() and rcu_read_lock() will certainly forbid
> cond_reshed() in printk()->console_unlock(), but it will also increase the
> softlockup or/and rcu stall probability, depending on how many threads you
> have and how many stack traces you need to print. can we do dirty things to
> the frequency of cond_resched()? for example, do cond_resched() not after
> every printk(), but every LOG_LINE_MAX chars.
> 

I think the caller of printk() should handle it. If an API for waiting for
the logbuf to be flushed exists, I'm happy to use it.

> several questions,
> do you use some sort of oom-kill reproducer? ... the problem with printk
> is that you never know which once of the CPUs will do the printing job at
> the end; any guess there is highly unreliable. but, assuming that you have
> executed that oom-kill reproducer many times before the patch in question
> do you have any rough numbers to compare how many seconds it used to take
> to dump_stack all of the tasks?

Yes, I'm using a stress tester which creates thousands of threads
(shown below). My dump (shown bottom) is intended for finding bugs
in corner cases and is written to wait for a bit for each thread
so that my dump shall not hold RCU lock for too long even if there
are thousands of threads to dump.

(This is just an example. There are many different versions.)
----------------------------------------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <signal.h>
#include <sched.h>
#include <sys/prctl.h>

static char stop = 0;
static int fd = EOF;

static void sigcld_handler(int unused)
{
	stop = 1;
}

static int memory_eater(void *unused)
{
	char *buf = NULL;
	unsigned long size = 0;
	char c;
	read(fd, &c, 1);
	while (1) {
		char *tmp = realloc(buf, size + 4096);
		if (!tmp)
			break;
		buf = tmp;
		buf[size] = 0;
		size += 4096;
	}
	pause();
	return 0;
}

int main(int argc, char *argv[])
{
	int pipe_fd[2] = { EOF, EOF };
	char *buf = NULL;
	unsigned long size;
	int i;
	if (chdir("/tmp"))
		return 1;
	if (pipe(pipe_fd))
		return 1;
	fd = pipe_fd[0];
	signal(SIGCLD, sigcld_handler);
	for (i = 0; i < 1024; i++) {
		if (fork() == 0) {
			char *stack = malloc(4096);
			char from[128] = { };
			char to[128] = { };
			const pid_t pid = getpid();
			unsigned char prev = 0;
			int fd = open("/proc/self/oom_score_adj", O_WRONLY);
			write(fd, "1000", 4);
			close(fd);
			close(pipe_fd[1]);
			snprintf(from, sizeof(from), "tgid=%u", pid);
			prctl(PR_SET_NAME, (unsigned long) from, 0, 0, 0);
			srand(pid);
			sleep(2);
			snprintf(from, sizeof(from), "file.%u-0", pid);
			fd = open(from, O_WRONLY | O_CREAT, 0600);
			if (fd == EOF)
				_exit(1);
			if (clone(memory_eater, stack + 4096, CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, NULL) == -1)
				_exit(1);
			while (1) {
				const unsigned char next = rand();
				snprintf(from, sizeof(from), "file.%u-%u", pid, prev);
				snprintf(to, sizeof(to), "file.%u-%u", pid, next);
				prev = next;
				rename(from, to);
				write(fd, "", 1);
			}
			_exit(0);
		}
	}
	close(pipe_fd[0]);
	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size >>= 1;
			break;
		}
		buf = cp;
	}
	sleep(4);
	close(pipe_fd[1]);
	/* Will cause OOM due to overcommit */
	for (i = 0; i < size; i += 4096) {
		buf[i] = 0;
		if (stop)
			break;
	}
	pause();
	return 0;
}
----------------------------------------

(This is today's version. I'm trying to somehow avoid
"** XXX printk messages dropped **" messages.)
----------------------------------------
diff --git a/include/linux/console.h b/include/linux/console.h
index ea731af..11e936c 100644
--- a/include/linux/console.h
+++ b/include/linux/console.h
@@ -147,6 +147,7 @@ extern int unregister_console(struct console *);
 extern struct console *console_drivers;
 extern void console_lock(void);
 extern int console_trylock(void);
+extern void wait_console_flushed(unsigned long timeout);
 extern void console_unlock(void);
 extern void console_conditional_schedule(void);
 extern void console_unblank(void);
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 45993b8..5647c5a 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -113,6 +113,16 @@ static inline bool task_will_free_mem(struct task_struct *task)
 		!(task->signal->flags & SIGNAL_GROUP_COREDUMP);
 }
 
+extern unsigned int out_of_memory_count;
+extern unsigned int no_out_of_memory_count;
+
+static inline void set_memalloc_location(const u8 location)
+{
+#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
+	current->memalloc.known_locations = location;
+#endif
+}
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0b44fbc..ae21ab2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1393,6 +1393,31 @@ struct tlbflush_unmap_batch {
 	bool writable;
 };
 
+struct memalloc_info {
+	/* For locking and progress monitoring. */
+	unsigned int sequence;
+	/*
+	 * 0: not doing __GFP_RECLAIM allocation.
+	 * 1: doing non-recursive __GFP_RECLAIM allocation.
+	 * 2: doing recursive __GFP_RECLAIM allocation.
+	 */
+	u8 valid;
+	/*
+	 * bit 0: Will be reported as OOM victim.
+	 * bit 1: Will be reported as dying task.
+	 * bit 2: Will be reported as stalling task.
+	 * bit 3: Will be reported as exiting task.
+	 * bit 7: Will be reported unconditionally.
+	 */
+	u8 type;
+	u8 known_locations; /* to reduce amount of traces */
+	/* Started time in jiffies as of valid == 1. */
+	unsigned long start;
+	/* Requested order and gfp flags as of valid == 1. */
+	unsigned int order;
+	gfp_t gfp;
+};
+
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	void *stack;
@@ -1855,6 +1880,9 @@ struct task_struct {
 #ifdef CONFIG_MMU
 	struct list_head oom_reaper_list;
 #endif
+#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
+	struct memalloc_info memalloc;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 22db1e6..7f2c230 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -9,6 +9,9 @@ extern int sysctl_hung_task_warnings;
 extern int proc_dohung_task_timeout_secs(struct ctl_table *table, int write,
 					 void __user *buffer,
 					 size_t *lenp, loff_t *ppos);
+#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
+extern unsigned long sysctl_memalloc_task_timeout_secs;
+#endif
 #else
 /* Avoid need for ifdefs elsewhere in the code */
 enum { sysctl_hung_task_timeout_secs = 0 };
diff --git a/kernel/fork.c b/kernel/fork.c
index d277e83..e7789ef 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1425,6 +1425,10 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	p->sequential_io_avg	= 0;
 #endif
 
+#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
+	p->memalloc.sequence = 0;
+#endif
+
 	/* Perform scheduler related setup. Assign this task to a CPU. */
 	retval = sched_fork(clone_flags, p);
 	if (retval)
diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index d234022..13ad212 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -16,6 +16,8 @@
 #include <linux/export.h>
 #include <linux/sysctl.h>
 #include <linux/utsname.h>
+#include <linux/oom.h> /* out_of_memory_count */
+#include <linux/console.h> /* wait_console_flushed() */
 #include <trace/events/sched.h>
 
 /*
@@ -72,6 +74,248 @@ static struct notifier_block panic_block = {
 	.notifier_call = hung_task_panic,
 };
 
+#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
+/*
+ * Zero means infinite timeout - no checking done:
+ */
+unsigned long __read_mostly sysctl_memalloc_task_timeout_secs =
+	CONFIG_DEFAULT_MEMALLOC_TASK_TIMEOUT;
+static struct memalloc_info memalloc; /* Filled by is_stalling_task(). */
+
+static long memalloc_timeout_jiffies(unsigned long last_checked, long timeout)
+{
+	struct task_struct *g, *p;
+	long t;
+	unsigned long delta;
+
+	/* timeout of 0 will disable the watchdog */
+	if (!timeout)
+		return MAX_SCHEDULE_TIMEOUT;
+	/* At least wait for timeout duration. */
+	t = last_checked - jiffies + timeout * HZ;
+	if (t > 0)
+		return t;
+	/* Calculate how long to wait more. */
+	t = timeout * HZ;
+	delta = t - jiffies;
+
+	/*
+	 * We might see outdated values in "struct memalloc_info" here.
+	 * We will recheck later using is_stalling_task().
+	 */
+	preempt_disable();
+	rcu_read_lock();
+	for_each_process_thread(g, p) {
+		if (likely(!p->memalloc.valid))
+			continue;
+		t = min_t(long, t, p->memalloc.start + delta);
+		if (unlikely(t <= 0))
+			goto stalling;
+	}
+ stalling:
+	rcu_read_unlock();
+	preempt_enable();
+	return t;
+}
+
+/**
+ * is_stalling_task - Check and copy a task's memalloc variable.
+ *
+ * @task:   A task to check.
+ * @expire: Timeout in jiffies.
+ *
+ * Returns true if a task is stalling, false otherwise.
+ */
+static bool is_stalling_task(const struct task_struct *task,
+			     const unsigned long expire)
+{
+	const struct memalloc_info *m = &task->memalloc;
+
+	/*
+	 * If start_memalloc_timer() is updating "struct memalloc_info" now,
+	 * we can ignore it because timeout jiffies cannot be expired as soon
+	 * as updating it completes.
+	 */
+	if (!m->valid || (m->sequence & 1))
+		return false;
+	smp_rmb(); /* Block start_memalloc_timer(). */
+	memalloc.start = m->start;
+	memalloc.order = m->order;
+	memalloc.gfp = m->gfp;
+	smp_rmb(); /* Unblock start_memalloc_timer(). */
+	memalloc.sequence = m->sequence;
+	/*
+	 * If start_memalloc_timer() started updating it while we read it,
+	 * we can ignore it for the same reason.
+	 */
+	if (!m->valid || (memalloc.sequence & 1))
+		return false;
+	/* This is a valid "struct memalloc_info". Check for timeout. */
+	return time_after_eq(expire, memalloc.start);
+}
+
+/* Check for memory allocation stalls. */
+static void check_memalloc_stalling_tasks(unsigned long timeout)
+{
+	char buf[256];
+	struct task_struct *g, *p;
+	unsigned long now;
+	unsigned long expire;
+	unsigned int sigkill_pending;
+	unsigned int exiting_tasks;
+	unsigned int memdie_pending;
+	unsigned int stalling_tasks;
+
+	cond_resched();
+	now = jiffies;
+	/*
+	 * Report tasks that stalled for more than half of timeout duration
+	 * because such tasks might be correlated with tasks that already
+	 * stalled for full timeout duration.
+	 */
+	expire = now - timeout * (HZ / 2);
+	/* Count stalling tasks, dying and victim tasks. */
+	sigkill_pending = 0;
+	exiting_tasks = 0;
+	memdie_pending = 0;
+	stalling_tasks = 0;
+	preempt_disable();
+	rcu_read_lock();
+	for_each_process_thread(g, p) {
+		u8 type = 0;
+
+		if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
+			type |= 1;
+			memdie_pending++;
+		}
+		if (fatal_signal_pending(p)) {
+			type |= 2;
+			sigkill_pending++;
+		}
+		if ((p->flags & PF_EXITING) && p->state != TASK_DEAD) {
+			type |= 8;
+			exiting_tasks++;
+		}
+		if (is_stalling_task(p, expire)) {
+			type |= 4;
+			stalling_tasks++;
+		}
+		if (p->flags & PF_KSWAPD)
+			type |= 128;
+		p->memalloc.type = type;
+	}
+	rcu_read_unlock();
+	preempt_enable_no_resched();
+	if (!stalling_tasks)
+		return;
+	wait_console_flushed(HZ);
+	cond_resched();
+	/* Report stalling tasks, dying and victim tasks. */
+	pr_warn("MemAlloc-Info: stalling=%u dying=%u exiting=%u victim=%u oom_count=%u/%u\n",
+		stalling_tasks, sigkill_pending, exiting_tasks, memdie_pending, out_of_memory_count,
+		no_out_of_memory_count);
+	cond_resched();
+	sigkill_pending = 0;
+	exiting_tasks = 0;
+	memdie_pending = 0;
+	stalling_tasks = 0;
+	preempt_disable();
+	rcu_read_lock();
+ restart_report:
+	for_each_process_thread(g, p) {
+		bool can_cont;
+		u8 type;
+
+		if (likely(!p->memalloc.type))
+			continue;
+		p->memalloc.type = 0;
+		/* Recheck in case state changed meanwhile. */
+		type = 0;
+		if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
+			type |= 1;
+			memdie_pending++;
+		}
+		if (fatal_signal_pending(p)) {
+			type |= 2;
+			sigkill_pending++;
+		}
+		if ((p->flags & PF_EXITING) && p->state != TASK_DEAD) {
+			type |= 8;
+			exiting_tasks++;
+		}
+		if (is_stalling_task(p, expire)) {
+			type |= 4;
+			stalling_tasks++;
+			snprintf(buf, sizeof(buf),
+				 " seq=%u gfp=0x%x(%pGg) order=%u delay=%lu",
+				 memalloc.sequence >> 1, memalloc.gfp, &memalloc.gfp,
+				 memalloc.order, now - memalloc.start);
+		} else {
+			buf[0] = '\0';
+		}
+		if (p->flags & PF_KSWAPD)
+			type |= 128;
+		if (unlikely(!type))
+			continue;
+		/*
+		 * Victim tasks get pending SIGKILL removed before arriving at
+		 * do_exit(). Therefore, print " exiting" instead for " dying".
+		 */
+		pr_warn("MemAlloc: %s(%u) flags=0x%x switches=%lu%s%s%s%s%s\n", p->comm,
+			p->pid, p->flags, p->nvcsw + p->nivcsw, buf,
+			(p->state & TASK_UNINTERRUPTIBLE) ?
+			" uninterruptible" : "",
+			(type & 8) ? " exiting" : "",
+			(type & 2) ? " dying" : "",
+			(type & 1) ? " victim" : "");
+		switch (p->memalloc.known_locations) {
+		case 1:
+			pr_warn("Call Trace: looping inside shrink_inactive_list()\n");
+			break;
+		case 2:
+			pr_warn("Call Trace: waiting for oom_lock\n");
+			break;
+		default:
+			sched_show_task(p);
+		}
+		/*
+		 * Since there could be thousands of tasks to report, we always
+		 * call cond_resched() after each report, in order to avoid RCU
+		 * stalls.
+		 *
+		 * Since not yet reported tasks have p->memalloc.type > 0, we
+		 * can simply restart this loop in case "g" or "p" went away.
+		 */
+		get_task_struct(g);
+		get_task_struct(p);
+		rcu_read_unlock();
+		preempt_enable_no_resched();
+		wait_console_flushed(HZ / 10);
+		cond_resched();
+		preempt_disable();
+		rcu_read_lock();
+		can_cont = pid_alive(g) && pid_alive(p);
+		put_task_struct(p);
+		put_task_struct(g);
+		if (!can_cont)
+			goto restart_report;
+	}
+	rcu_read_unlock();
+	preempt_enable_no_resched();
+	cond_resched();
+	/* Show memory information. (SysRq-m) */
+	show_mem(0);
+	/* Show workqueue state. */
+	show_workqueue_state();
+	/* Show lock information. (SysRq-d) */
+	debug_show_all_locks();
+	wait_console_flushed(HZ / 10);
+	pr_warn("MemAlloc-Info: stalling=%u dying=%u exiting=%u victim=%u oom_count=%u/%u\n",
+		stalling_tasks, sigkill_pending, exiting_tasks, memdie_pending, out_of_memory_count,
+		no_out_of_memory_count);
+}
+#endif /* CONFIG_DETECT_MEMALLOC_STALL_TASK */
+
 static void check_hung_task(struct task_struct *t, unsigned long timeout)
 {
 	unsigned long switch_count = t->nvcsw + t->nivcsw;
@@ -227,20 +471,35 @@ EXPORT_SYMBOL_GPL(reset_hung_task_detector);
 static int watchdog(void *dummy)
 {
 	unsigned long hung_last_checked = jiffies;
+#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
+	unsigned long stall_last_checked = hung_last_checked;
+#endif
 
 	set_user_nice(current, 0);
 
 	for ( ; ; ) {
 		unsigned long timeout = sysctl_hung_task_timeout_secs;
 		long t = hung_timeout_jiffies(hung_last_checked, timeout);
-
+#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
+		unsigned long timeout2 = sysctl_memalloc_task_timeout_secs;
+		long t2 = memalloc_timeout_jiffies(stall_last_checked,
+						   timeout2);
+
+		if (t2 <= 0) {
+			check_memalloc_stalling_tasks(timeout2);
+			stall_last_checked = jiffies;
+			continue;
+		}
+#else
+		long t2 = t;
+#endif
 		if (t <= 0) {
 			if (!atomic_xchg(&reset_hung_task, 0))
 				check_hung_uninterruptible_tasks(timeout);
 			hung_last_checked = jiffies;
 			continue;
 		}
-		schedule_timeout_interruptible(t);
+		schedule_timeout_interruptible(min(t, t2));
 	}
 
 	return 0;
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 9917f69..2eb60df 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -121,6 +121,15 @@ static int __down_trylock_console_sem(unsigned long ip)
 	up(&console_sem);\
 } while (0)
 
+static int __down_timeout_console_sem(unsigned long timeout, unsigned long ip)
+{
+	if (down_timeout(&console_sem, timeout))
+		return 1;
+	mutex_acquire(&console_lock_dep_map, 0, 1, ip);
+	return 0;
+}
+#define down_timeout_console_sem(timeout) __down_timeout_console_sem((timeout), _RET_IP_)
+
 /*
  * This is used for debugging the mess that is the VT code by
  * keeping track if we have the console semaphore held. It's
@@ -2125,6 +2134,21 @@ int console_trylock(void)
 }
 EXPORT_SYMBOL(console_trylock);
 
+void wait_console_flushed(unsigned long timeout)
+{
+	might_sleep();
+
+	if (down_timeout_console_sem(timeout))
+		return;
+	if (console_suspended) {
+		up_console_sem();
+		return;
+	}
+	console_locked = 1;
+	console_may_schedule = 1;
+	console_unlock();
+}
+
 int is_console_locked(void)
 {
 	return console_locked;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 96ec234..8bc1c5b 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1073,6 +1073,16 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= &neg_one,
 	},
+#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
+	{
+		.procname	= "memalloc_task_timeout_secs",
+		.data		= &sysctl_memalloc_task_timeout_secs,
+		.maxlen		= sizeof(unsigned long),
+		.mode		= 0644,
+		.proc_handler	= proc_dohung_task_timeout_secs,
+		.extra2		= &hung_task_timeout_max,
+	},
+#endif
 #endif
 #ifdef CONFIG_COMPAT
 	{
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 4a8bfe2..8bbc655 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -864,6 +864,30 @@ config WQ_WATCHDOG
 	  state.  This can be configured through kernel parameter
 	  "workqueue.watchdog_thresh" and its sysfs counterpart.
 
+config DETECT_MEMALLOC_STALL_TASK
+	bool "Detect tasks stalling inside memory allocator"
+	default n
+	depends on DETECT_HUNG_TASK
+	help
+	  This option emits warning messages and traces when memory
+	  allocation requests are stalling, in order to catch unexplained
+	  hangups/reboots caused by memory allocation stalls.
+
+config DEFAULT_MEMALLOC_TASK_TIMEOUT
+	int "Default timeout for stalling task detection (in seconds)"
+	depends on DETECT_MEMALLOC_STALL_TASK
+	default 10
+	help
+	  This option controls the default timeout (in seconds) used
+	  to determine when a task has become non-responsive and should
+	  be considered stalling inside memory allocator.
+
+	  It can be adjusted at runtime via the kernel.memalloc_task_timeout_secs
+	  sysctl or by writing a value to
+	  /proc/sys/kernel/memalloc_task_timeout_secs.
+
+	  A timeout of 0 disables the check. The default is 10 seconds.
+
 endmenu # "Debug lockups and hangs"
 
 config PANIC_ON_OOPS
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5d5eca9..bb5ea86 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -44,6 +44,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/oom.h>
 
+unsigned int out_of_memory_count;
+
 int sysctl_panic_on_oom;
 int sysctl_oom_kill_allocating_task;
 int sysctl_oom_dump_tasks = 1;
@@ -855,6 +857,7 @@ bool out_of_memory(struct oom_control *oc)
 	unsigned int uninitialized_var(points);
 	enum oom_constraint constraint = CONSTRAINT_NONE;
 
+	out_of_memory_count++;
 	if (oom_killer_disabled)
 		return false;
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1993894..1529ccf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2806,6 +2806,8 @@ void warn_alloc_failed(gfp_t gfp_mask, unsigned int order, const char *fmt, ...)
 		show_mem(filter);
 }
 
+unsigned int no_out_of_memory_count;
+
 static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	const struct alloc_context *ac, unsigned long *did_some_progress)
@@ -2826,7 +2828,9 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	 */
 	if (!mutex_trylock(&oom_lock)) {
 		*did_some_progress = 1;
+		set_memalloc_location(2);
 		schedule_timeout_uninterruptible(1);
+		set_memalloc_location(0);
 		return NULL;
 	}
 
@@ -2862,6 +2866,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 			 * enter a quiescent state during suspend.
 			 */
 			*did_some_progress = !oom_killer_disabled;
+			no_out_of_memory_count++;
 			goto out;
 		}
 		if (pm_suspended_storage())
@@ -3399,6 +3404,37 @@ got_pg:
 	return page;
 }
 
+#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
+static void start_memalloc_timer(const gfp_t gfp_mask, const int order)
+{
+	struct memalloc_info *m = &current->memalloc;
+
+	/* We don't check for stalls for !__GFP_RECLAIM allocations. */
+	if (!(gfp_mask & __GFP_RECLAIM))
+		return;
+	/* We don't check for stalls for nested __GFP_RECLAIM allocations */
+	if (!m->valid) {
+		m->sequence++;
+		smp_wmb(); /* Block is_stalling_task(). */
+		m->start = jiffies;
+		m->order = order;
+		m->gfp = gfp_mask;
+		smp_wmb(); /* Unblock is_stalling_task(). */
+		m->sequence++;
+	}
+	m->valid++;
+}
+
+static void stop_memalloc_timer(const gfp_t gfp_mask)
+{
+	if (gfp_mask & __GFP_RECLAIM)
+		current->memalloc.valid--;
+}
+#else
+#define start_memalloc_timer(gfp_mask, order) do { } while (0)
+#define stop_memalloc_timer(gfp_mask) do { } while (0)
+#endif
+
 /*
  * This is the 'heart' of the zoned buddy allocator.
  */
@@ -3466,7 +3502,9 @@ retry_cpuset:
 		alloc_mask = memalloc_noio_flags(gfp_mask);
 		ac.spread_dirty_pages = false;
 
+		start_memalloc_timer(alloc_mask, order);
 		page = __alloc_pages_slowpath(alloc_mask, order, &ac);
+		stop_memalloc_timer(alloc_mask);
 	}
 
 	if (kmemcheck_enabled && page)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 39e90e2..1231dcf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1581,14 +1581,31 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	int file = is_file_lru(lru);
 	struct zone *zone = lruvec_zone(lruvec);
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
+	unsigned char counter = 0;
 
+	set_memalloc_location(1);
 	while (unlikely(too_many_isolated(zone, file, sc))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/* We are about to die and free our memory. Return now. */
-		if (fatal_signal_pending(current))
+		if (fatal_signal_pending(current)) {
+			set_memalloc_location(0);
 			return SWAP_CLUSTER_MAX;
+		}
+		if (!++counter) {
+			if (file)
+				printk(KERN_WARNING "zone=%s NR_INACTIVE_FILE=%lu NR_ISOLATED_FILE=%lu\n",
+				       zone->name,
+				       zone_page_state(zone, NR_INACTIVE_FILE),
+				       zone_page_state(zone, NR_ISOLATED_FILE));
+			else
+				printk(KERN_WARNING "zone=%s NR_INACTIVE_ANON=%lu NR_ISOLATED_ANON=%lu\n",
+				       zone->name,
+				       zone_page_state(zone, NR_INACTIVE_ANON),
+				       zone_page_state(zone, NR_ISOLATED_ANON));
+		}
 	}
+	set_memalloc_location(0);
 
 	lru_add_drain();
 
----------------------------------------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: How to avoid printk() delay caused by cond_resched() ?
  2016-03-02 14:11   ` Tetsuo Handa
@ 2016-03-02 15:21     ` Petr Mladek
  0 siblings, 0 replies; 11+ messages in thread
From: Petr Mladek @ 2016-03-02 15:21 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: sergey.senozhatsky, jack, tj, kyle, davej, calvinowens, akpm,
	linux-mm, mhocko

On Wed 2016-03-02 23:11:30, Tetsuo Handa wrote:
> That's a good news. I was wishing that there were a dedicated kernel
> thread which does printk() operation. While at it, I ask for an API
> which waits for printk buffer to be flushed (something like below) so that
> a watchdog thread which might dump thousands of threads from sleepable
> context (like my dump) can avoid "** XXX printk messages dropped **"
> messages.
>
> ----------
> diff --git a/include/linux/console.h b/include/linux/console.h
> index ea731af..11e936c 100644
> --- a/include/linux/console.h
> +++ b/include/linux/console.h
> @@ -147,6 +147,7 @@ extern int unregister_console(struct console *);
>  extern struct console *console_drivers;
>  extern void console_lock(void);
>  extern int console_trylock(void);
> +extern void wait_console_flushed(unsigned long timeout);
>  extern void console_unlock(void);
>  extern void console_conditional_schedule(void);
>  extern void console_unblank(void);
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 9917f69..2eb60df 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -121,6 +121,15 @@ static int __down_trylock_console_sem(unsigned long ip)
>  	up(&console_sem);\
>  } while (0)
>  
> +static int __down_timeout_console_sem(unsigned long timeout, unsigned long ip)
> +{
> +	if (down_timeout(&console_sem, timeout))
> +		return 1;
> +	mutex_acquire(&console_lock_dep_map, 0, 1, ip);
> +	return 0;
> +}
> +#define down_timeout_console_sem(timeout) __down_timeout_console_sem((timeout), _RET_IP_)
> +
>  /*
>   * This is used for debugging the mess that is the VT code by
>   * keeping track if we have the console semaphore held. It's
> @@ -2125,6 +2134,21 @@ int console_trylock(void)
>  }
>  EXPORT_SYMBOL(console_trylock);
>  
> +void wait_console_flushed(unsigned long timeout)
> +{
> +	might_sleep();
> +
> +	if (down_timeout_console_sem(timeout))
> +		return;
> +	if (console_suspended) {
> +		up_console_sem();
> +		return;
> +	}
> +	console_locked = 1;
> +	console_may_schedule = 1;
> +	console_unlock();
> +}

This tries to take over the responsibility for printing to the
console. I would personally solve this by a wait queue.
console_unlock() might wakeup all waiters when empty. This
will work also when the console stuff is offloaded into
the workqueue.

But there still might be dropped messages if there is a flood
of them from another process. Note that even userspace could push
messages into the kernel ring buffer via /dev/kmsg. We need
to be careful against DOS attacks.

Best Regards,
Petr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How to avoid printk() delay caused by cond_resched() ?
  2016-03-02 15:02   ` Tetsuo Handa
@ 2016-03-02 15:29     ` Sergey Senozhatsky
  0 siblings, 0 replies; 11+ messages in thread
From: Sergey Senozhatsky @ 2016-03-02 15:29 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: sergey.senozhatsky, pmladek, jack, tj, kyle, davej, calvinowens,
	akpm, linux-mm

On (03/03/16 00:02), Tetsuo Handa wrote:
> > On (03/02/16 21:01), Tetsuo Handa wrote:
> > > I'm trying to dump information of all threads which might be relevant
> > > to stalling inside memory allocator. But it seems to me that since this
> > > patch changed to allow calling cond_resched() from printk() if it is
> > > safe to do so, it is now possible that the thread which invoked the OOM
> > > killer can sleep for minutes with the oom_lock mutex held when my dump is
> > > in progress. I want to release oom_lock mutex as soon as possible so
> > > that other threads can call out_of_memory() to get TIF_MEMDIE and exit
> > > their allocations.
> > 
> > correct, otherwise chances are that you will softlockup or RCU stall
> > the system -- console_unlock() does not stop until the printk logbuf
> > has data in it (which can be appended concurrently from other CPUs).
> 
> Did you mean:
> 
>   console_unlock() does not stop until the printk logbuf has no more data
>   in it (but data can be appended to the logbuf concurrently from other CPUs).
> 

yes, I did. sorry for the misleading/wrong wording.

schematically:

CPU0                                    CPU1 - CPUX
printk
 preempt_disable
  console_lock                          printk only add data to logbuf
                                        printk
  console_unlock()                      printk
   while (logbuf not empty)              ...
    call_console_drivers()              printk
  up console_sem
preempt_enable                          printk
                                         preempt_disable
                                          console_lock     #on CPU1
                                          console_unlock
                                           while (logbuf not empty)
                                            ...

and so on.

so that's why in some cases (when there is a concurrent task (or tasks)
doing printk() more of less frequently) doing something like:

= printk in softirg/IRQ
= or spin_lock; printk; spin_unlock
= or preempt_disable; printk; preempt_enable
= or rcu_read_lock; printk; rcu_read_unlock
= or local irq save; printk; local irq restore
= etc.

can lead to negative consequences -- console_unlock() only cares about
the data it can print. cond_resched() cannot fix all of the cases above,
Jan Kara's patch set can.

if you say that you can guarantee that printk will not loop for too long
(sorry, I haven't looked yet at the code you have attached) then yes,
preempt_disable or rcu read lock should do the trick for you.

	-ss

[..]
> > several questions,
> > do you use some sort of oom-kill reproducer? ... the problem with printk
> > is that you never know which once of the CPUs will do the printing job at
> > the end; any guess there is highly unreliable. but, assuming that you have
> > executed that oom-kill reproducer many times before the patch in question
> > do you have any rough numbers to compare how many seconds it used to take
> > to dump_stack all of the tasks?
> 
> Yes, I'm using a stress tester which creates thousands of threads
> (shown below). My dump (shown bottom) is intended for finding bugs
> in corner cases and is written to wait for a bit for each thread
> so that my dump shall not hold RCU lock for too long even if there
> are thousands of threads to dump.
> 
> (This is just an example. There are many different versions.)
> ----------------------------------------
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <signal.h>
> #include <sched.h>
> #include <sys/prctl.h>
> 
> static char stop = 0;
> static int fd = EOF;
> 
> static void sigcld_handler(int unused)
> {
> 	stop = 1;
> }
> 
> static int memory_eater(void *unused)
> {
> 	char *buf = NULL;
> 	unsigned long size = 0;
> 	char c;
> 	read(fd, &c, 1);
> 	while (1) {
> 		char *tmp = realloc(buf, size + 4096);
> 		if (!tmp)
> 			break;
> 		buf = tmp;
> 		buf[size] = 0;
> 		size += 4096;
> 	}
> 	pause();
> 	return 0;
> }
> 
> int main(int argc, char *argv[])
> {
> 	int pipe_fd[2] = { EOF, EOF };
> 	char *buf = NULL;
> 	unsigned long size;
> 	int i;
> 	if (chdir("/tmp"))
> 		return 1;
> 	if (pipe(pipe_fd))
> 		return 1;
> 	fd = pipe_fd[0];
> 	signal(SIGCLD, sigcld_handler);
> 	for (i = 0; i < 1024; i++) {
> 		if (fork() == 0) {
> 			char *stack = malloc(4096);
> 			char from[128] = { };
> 			char to[128] = { };
> 			const pid_t pid = getpid();
> 			unsigned char prev = 0;
> 			int fd = open("/proc/self/oom_score_adj", O_WRONLY);
> 			write(fd, "1000", 4);
> 			close(fd);
> 			close(pipe_fd[1]);
> 			snprintf(from, sizeof(from), "tgid=%u", pid);
> 			prctl(PR_SET_NAME, (unsigned long) from, 0, 0, 0);
> 			srand(pid);
> 			sleep(2);
> 			snprintf(from, sizeof(from), "file.%u-0", pid);
> 			fd = open(from, O_WRONLY | O_CREAT, 0600);
> 			if (fd == EOF)
> 				_exit(1);
> 			if (clone(memory_eater, stack + 4096, CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, NULL) == -1)
> 				_exit(1);
> 			while (1) {
> 				const unsigned char next = rand();
> 				snprintf(from, sizeof(from), "file.%u-%u", pid, prev);
> 				snprintf(to, sizeof(to), "file.%u-%u", pid, next);
> 				prev = next;
> 				rename(from, to);
> 				write(fd, "", 1);
> 			}
> 			_exit(0);
> 		}
> 	}
> 	close(pipe_fd[0]);
> 	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
> 		char *cp = realloc(buf, size);
> 		if (!cp) {
> 			size >>= 1;
> 			break;
> 		}
> 		buf = cp;
> 	}
> 	sleep(4);
> 	close(pipe_fd[1]);
> 	/* Will cause OOM due to overcommit */
> 	for (i = 0; i < size; i += 4096) {
> 		buf[i] = 0;
> 		if (stop)
> 			break;
> 	}
> 	pause();
> 	return 0;
> }
> ----------------------------------------
> 
> (This is today's version. I'm trying to somehow avoid
> "** XXX printk messages dropped **" messages.)
> ----------------------------------------
> diff --git a/include/linux/console.h b/include/linux/console.h
> index ea731af..11e936c 100644
> --- a/include/linux/console.h
> +++ b/include/linux/console.h
> @@ -147,6 +147,7 @@ extern int unregister_console(struct console *);
>  extern struct console *console_drivers;
>  extern void console_lock(void);
>  extern int console_trylock(void);
> +extern void wait_console_flushed(unsigned long timeout);
>  extern void console_unlock(void);
>  extern void console_conditional_schedule(void);
>  extern void console_unblank(void);
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 45993b8..5647c5a 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -113,6 +113,16 @@ static inline bool task_will_free_mem(struct task_struct *task)
>  		!(task->signal->flags & SIGNAL_GROUP_COREDUMP);
>  }
>  
> +extern unsigned int out_of_memory_count;
> +extern unsigned int no_out_of_memory_count;
> +
> +static inline void set_memalloc_location(const u8 location)
> +{
> +#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
> +	current->memalloc.known_locations = location;
> +#endif
> +}
> +
>  /* sysctls */
>  extern int sysctl_oom_dump_tasks;
>  extern int sysctl_oom_kill_allocating_task;
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 0b44fbc..ae21ab2 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1393,6 +1393,31 @@ struct tlbflush_unmap_batch {
>  	bool writable;
>  };
>  
> +struct memalloc_info {
> +	/* For locking and progress monitoring. */
> +	unsigned int sequence;
> +	/*
> +	 * 0: not doing __GFP_RECLAIM allocation.
> +	 * 1: doing non-recursive __GFP_RECLAIM allocation.
> +	 * 2: doing recursive __GFP_RECLAIM allocation.
> +	 */
> +	u8 valid;
> +	/*
> +	 * bit 0: Will be reported as OOM victim.
> +	 * bit 1: Will be reported as dying task.
> +	 * bit 2: Will be reported as stalling task.
> +	 * bit 3: Will be reported as exiting task.
> +	 * bit 7: Will be reported unconditionally.
> +	 */
> +	u8 type;
> +	u8 known_locations; /* to reduce amount of traces */
> +	/* Started time in jiffies as of valid == 1. */
> +	unsigned long start;
> +	/* Requested order and gfp flags as of valid == 1. */
> +	unsigned int order;
> +	gfp_t gfp;
> +};
> +
>  struct task_struct {
>  	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
>  	void *stack;
> @@ -1855,6 +1880,9 @@ struct task_struct {
>  #ifdef CONFIG_MMU
>  	struct list_head oom_reaper_list;
>  #endif
> +#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
> +	struct memalloc_info memalloc;
> +#endif
>  /* CPU-specific state of this task */
>  	struct thread_struct thread;
>  /*
> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> index 22db1e6..7f2c230 100644
> --- a/include/linux/sched/sysctl.h
> +++ b/include/linux/sched/sysctl.h
> @@ -9,6 +9,9 @@ extern int sysctl_hung_task_warnings;
>  extern int proc_dohung_task_timeout_secs(struct ctl_table *table, int write,
>  					 void __user *buffer,
>  					 size_t *lenp, loff_t *ppos);
> +#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
> +extern unsigned long sysctl_memalloc_task_timeout_secs;
> +#endif
>  #else
>  /* Avoid need for ifdefs elsewhere in the code */
>  enum { sysctl_hung_task_timeout_secs = 0 };
> diff --git a/kernel/fork.c b/kernel/fork.c
> index d277e83..e7789ef 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1425,6 +1425,10 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>  	p->sequential_io_avg	= 0;
>  #endif
>  
> +#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
> +	p->memalloc.sequence = 0;
> +#endif
> +
>  	/* Perform scheduler related setup. Assign this task to a CPU. */
>  	retval = sched_fork(clone_flags, p);
>  	if (retval)
> diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> index d234022..13ad212 100644
> --- a/kernel/hung_task.c
> +++ b/kernel/hung_task.c
> @@ -16,6 +16,8 @@
>  #include <linux/export.h>
>  #include <linux/sysctl.h>
>  #include <linux/utsname.h>
> +#include <linux/oom.h> /* out_of_memory_count */
> +#include <linux/console.h> /* wait_console_flushed() */
>  #include <trace/events/sched.h>
>  
>  /*
> @@ -72,6 +74,248 @@ static struct notifier_block panic_block = {
>  	.notifier_call = hung_task_panic,
>  };
>  
> +#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
> +/*
> + * Zero means infinite timeout - no checking done:
> + */
> +unsigned long __read_mostly sysctl_memalloc_task_timeout_secs =
> +	CONFIG_DEFAULT_MEMALLOC_TASK_TIMEOUT;
> +static struct memalloc_info memalloc; /* Filled by is_stalling_task(). */
> +
> +static long memalloc_timeout_jiffies(unsigned long last_checked, long timeout)
> +{
> +	struct task_struct *g, *p;
> +	long t;
> +	unsigned long delta;
> +
> +	/* timeout of 0 will disable the watchdog */
> +	if (!timeout)
> +		return MAX_SCHEDULE_TIMEOUT;
> +	/* At least wait for timeout duration. */
> +	t = last_checked - jiffies + timeout * HZ;
> +	if (t > 0)
> +		return t;
> +	/* Calculate how long to wait more. */
> +	t = timeout * HZ;
> +	delta = t - jiffies;
> +
> +	/*
> +	 * We might see outdated values in "struct memalloc_info" here.
> +	 * We will recheck later using is_stalling_task().
> +	 */
> +	preempt_disable();
> +	rcu_read_lock();
> +	for_each_process_thread(g, p) {
> +		if (likely(!p->memalloc.valid))
> +			continue;
> +		t = min_t(long, t, p->memalloc.start + delta);
> +		if (unlikely(t <= 0))
> +			goto stalling;
> +	}
> + stalling:
> +	rcu_read_unlock();
> +	preempt_enable();
> +	return t;
> +}
> +
> +/**
> + * is_stalling_task - Check and copy a task's memalloc variable.
> + *
> + * @task:   A task to check.
> + * @expire: Timeout in jiffies.
> + *
> + * Returns true if a task is stalling, false otherwise.
> + */
> +static bool is_stalling_task(const struct task_struct *task,
> +			     const unsigned long expire)
> +{
> +	const struct memalloc_info *m = &task->memalloc;
> +
> +	/*
> +	 * If start_memalloc_timer() is updating "struct memalloc_info" now,
> +	 * we can ignore it because timeout jiffies cannot be expired as soon
> +	 * as updating it completes.
> +	 */
> +	if (!m->valid || (m->sequence & 1))
> +		return false;
> +	smp_rmb(); /* Block start_memalloc_timer(). */
> +	memalloc.start = m->start;
> +	memalloc.order = m->order;
> +	memalloc.gfp = m->gfp;
> +	smp_rmb(); /* Unblock start_memalloc_timer(). */
> +	memalloc.sequence = m->sequence;
> +	/*
> +	 * If start_memalloc_timer() started updating it while we read it,
> +	 * we can ignore it for the same reason.
> +	 */
> +	if (!m->valid || (memalloc.sequence & 1))
> +		return false;
> +	/* This is a valid "struct memalloc_info". Check for timeout. */
> +	return time_after_eq(expire, memalloc.start);
> +}
> +
> +/* Check for memory allocation stalls. */
> +static void check_memalloc_stalling_tasks(unsigned long timeout)
> +{
> +	char buf[256];
> +	struct task_struct *g, *p;
> +	unsigned long now;
> +	unsigned long expire;
> +	unsigned int sigkill_pending;
> +	unsigned int exiting_tasks;
> +	unsigned int memdie_pending;
> +	unsigned int stalling_tasks;
> +
> +	cond_resched();
> +	now = jiffies;
> +	/*
> +	 * Report tasks that stalled for more than half of timeout duration
> +	 * because such tasks might be correlated with tasks that already
> +	 * stalled for full timeout duration.
> +	 */
> +	expire = now - timeout * (HZ / 2);
> +	/* Count stalling tasks, dying and victim tasks. */
> +	sigkill_pending = 0;
> +	exiting_tasks = 0;
> +	memdie_pending = 0;
> +	stalling_tasks = 0;
> +	preempt_disable();
> +	rcu_read_lock();
> +	for_each_process_thread(g, p) {
> +		u8 type = 0;
> +
> +		if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
> +			type |= 1;
> +			memdie_pending++;
> +		}
> +		if (fatal_signal_pending(p)) {
> +			type |= 2;
> +			sigkill_pending++;
> +		}
> +		if ((p->flags & PF_EXITING) && p->state != TASK_DEAD) {
> +			type |= 8;
> +			exiting_tasks++;
> +		}
> +		if (is_stalling_task(p, expire)) {
> +			type |= 4;
> +			stalling_tasks++;
> +		}
> +		if (p->flags & PF_KSWAPD)
> +			type |= 128;
> +		p->memalloc.type = type;
> +	}
> +	rcu_read_unlock();
> +	preempt_enable_no_resched();
> +	if (!stalling_tasks)
> +		return;
> +	wait_console_flushed(HZ);
> +	cond_resched();
> +	/* Report stalling tasks, dying and victim tasks. */
> +	pr_warn("MemAlloc-Info: stalling=%u dying=%u exiting=%u victim=%u oom_count=%u/%u\n",
> +		stalling_tasks, sigkill_pending, exiting_tasks, memdie_pending, out_of_memory_count,
> +		no_out_of_memory_count);
> +	cond_resched();
> +	sigkill_pending = 0;
> +	exiting_tasks = 0;
> +	memdie_pending = 0;
> +	stalling_tasks = 0;
> +	preempt_disable();
> +	rcu_read_lock();
> + restart_report:
> +	for_each_process_thread(g, p) {
> +		bool can_cont;
> +		u8 type;
> +
> +		if (likely(!p->memalloc.type))
> +			continue;
> +		p->memalloc.type = 0;
> +		/* Recheck in case state changed meanwhile. */
> +		type = 0;
> +		if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
> +			type |= 1;
> +			memdie_pending++;
> +		}
> +		if (fatal_signal_pending(p)) {
> +			type |= 2;
> +			sigkill_pending++;
> +		}
> +		if ((p->flags & PF_EXITING) && p->state != TASK_DEAD) {
> +			type |= 8;
> +			exiting_tasks++;
> +		}
> +		if (is_stalling_task(p, expire)) {
> +			type |= 4;
> +			stalling_tasks++;
> +			snprintf(buf, sizeof(buf),
> +				 " seq=%u gfp=0x%x(%pGg) order=%u delay=%lu",
> +				 memalloc.sequence >> 1, memalloc.gfp, &memalloc.gfp,
> +				 memalloc.order, now - memalloc.start);
> +		} else {
> +			buf[0] = '\0';
> +		}
> +		if (p->flags & PF_KSWAPD)
> +			type |= 128;
> +		if (unlikely(!type))
> +			continue;
> +		/*
> +		 * Victim tasks get pending SIGKILL removed before arriving at
> +		 * do_exit(). Therefore, print " exiting" instead for " dying".
> +		 */
> +		pr_warn("MemAlloc: %s(%u) flags=0x%x switches=%lu%s%s%s%s%s\n", p->comm,
> +			p->pid, p->flags, p->nvcsw + p->nivcsw, buf,
> +			(p->state & TASK_UNINTERRUPTIBLE) ?
> +			" uninterruptible" : "",
> +			(type & 8) ? " exiting" : "",
> +			(type & 2) ? " dying" : "",
> +			(type & 1) ? " victim" : "");
> +		switch (p->memalloc.known_locations) {
> +		case 1:
> +			pr_warn("Call Trace: looping inside shrink_inactive_list()\n");
> +			break;
> +		case 2:
> +			pr_warn("Call Trace: waiting for oom_lock\n");
> +			break;
> +		default:
> +			sched_show_task(p);
> +		}
> +		/*
> +		 * Since there could be thousands of tasks to report, we always
> +		 * call cond_resched() after each report, in order to avoid RCU
> +		 * stalls.
> +		 *
> +		 * Since not yet reported tasks have p->memalloc.type > 0, we
> +		 * can simply restart this loop in case "g" or "p" went away.
> +		 */
> +		get_task_struct(g);
> +		get_task_struct(p);
> +		rcu_read_unlock();
> +		preempt_enable_no_resched();
> +		wait_console_flushed(HZ / 10);
> +		cond_resched();
> +		preempt_disable();
> +		rcu_read_lock();
> +		can_cont = pid_alive(g) && pid_alive(p);
> +		put_task_struct(p);
> +		put_task_struct(g);
> +		if (!can_cont)
> +			goto restart_report;
> +	}
> +	rcu_read_unlock();
> +	preempt_enable_no_resched();
> +	cond_resched();
> +	/* Show memory information. (SysRq-m) */
> +	show_mem(0);
> +	/* Show workqueue state. */
> +	show_workqueue_state();
> +	/* Show lock information. (SysRq-d) */
> +	debug_show_all_locks();
> +	wait_console_flushed(HZ / 10);
> +	pr_warn("MemAlloc-Info: stalling=%u dying=%u exiting=%u victim=%u oom_count=%u/%u\n",
> +		stalling_tasks, sigkill_pending, exiting_tasks, memdie_pending, out_of_memory_count,
> +		no_out_of_memory_count);
> +}
> +#endif /* CONFIG_DETECT_MEMALLOC_STALL_TASK */
> +
>  static void check_hung_task(struct task_struct *t, unsigned long timeout)
>  {
>  	unsigned long switch_count = t->nvcsw + t->nivcsw;
> @@ -227,20 +471,35 @@ EXPORT_SYMBOL_GPL(reset_hung_task_detector);
>  static int watchdog(void *dummy)
>  {
>  	unsigned long hung_last_checked = jiffies;
> +#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
> +	unsigned long stall_last_checked = hung_last_checked;
> +#endif
>  
>  	set_user_nice(current, 0);
>  
>  	for ( ; ; ) {
>  		unsigned long timeout = sysctl_hung_task_timeout_secs;
>  		long t = hung_timeout_jiffies(hung_last_checked, timeout);
> -
> +#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
> +		unsigned long timeout2 = sysctl_memalloc_task_timeout_secs;
> +		long t2 = memalloc_timeout_jiffies(stall_last_checked,
> +						   timeout2);
> +
> +		if (t2 <= 0) {
> +			check_memalloc_stalling_tasks(timeout2);
> +			stall_last_checked = jiffies;
> +			continue;
> +		}
> +#else
> +		long t2 = t;
> +#endif
>  		if (t <= 0) {
>  			if (!atomic_xchg(&reset_hung_task, 0))
>  				check_hung_uninterruptible_tasks(timeout);
>  			hung_last_checked = jiffies;
>  			continue;
>  		}
> -		schedule_timeout_interruptible(t);
> +		schedule_timeout_interruptible(min(t, t2));
>  	}
>  
>  	return 0;
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 9917f69..2eb60df 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -121,6 +121,15 @@ static int __down_trylock_console_sem(unsigned long ip)
>  	up(&console_sem);\
>  } while (0)
>  
> +static int __down_timeout_console_sem(unsigned long timeout, unsigned long ip)
> +{
> +	if (down_timeout(&console_sem, timeout))
> +		return 1;
> +	mutex_acquire(&console_lock_dep_map, 0, 1, ip);
> +	return 0;
> +}
> +#define down_timeout_console_sem(timeout) __down_timeout_console_sem((timeout), _RET_IP_)
> +
>  /*
>   * This is used for debugging the mess that is the VT code by
>   * keeping track if we have the console semaphore held. It's
> @@ -2125,6 +2134,21 @@ int console_trylock(void)
>  }
>  EXPORT_SYMBOL(console_trylock);
>  
> +void wait_console_flushed(unsigned long timeout)
> +{
> +	might_sleep();
> +
> +	if (down_timeout_console_sem(timeout))
> +		return;
> +	if (console_suspended) {
> +		up_console_sem();
> +		return;
> +	}
> +	console_locked = 1;
> +	console_may_schedule = 1;
> +	console_unlock();
> +}
> +
>  int is_console_locked(void)
>  {
>  	return console_locked;
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 96ec234..8bc1c5b 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1073,6 +1073,16 @@ static struct ctl_table kern_table[] = {
>  		.proc_handler	= proc_dointvec_minmax,
>  		.extra1		= &neg_one,
>  	},
> +#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
> +	{
> +		.procname	= "memalloc_task_timeout_secs",
> +		.data		= &sysctl_memalloc_task_timeout_secs,
> +		.maxlen		= sizeof(unsigned long),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dohung_task_timeout_secs,
> +		.extra2		= &hung_task_timeout_max,
> +	},
> +#endif
>  #endif
>  #ifdef CONFIG_COMPAT
>  	{
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 4a8bfe2..8bbc655 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -864,6 +864,30 @@ config WQ_WATCHDOG
>  	  state.  This can be configured through kernel parameter
>  	  "workqueue.watchdog_thresh" and its sysfs counterpart.
>  
> +config DETECT_MEMALLOC_STALL_TASK
> +	bool "Detect tasks stalling inside memory allocator"
> +	default n
> +	depends on DETECT_HUNG_TASK
> +	help
> +	  This option emits warning messages and traces when memory
> +	  allocation requests are stalling, in order to catch unexplained
> +	  hangups/reboots caused by memory allocation stalls.
> +
> +config DEFAULT_MEMALLOC_TASK_TIMEOUT
> +	int "Default timeout for stalling task detection (in seconds)"
> +	depends on DETECT_MEMALLOC_STALL_TASK
> +	default 10
> +	help
> +	  This option controls the default timeout (in seconds) used
> +	  to determine when a task has become non-responsive and should
> +	  be considered stalling inside memory allocator.
> +
> +	  It can be adjusted at runtime via the kernel.memalloc_task_timeout_secs
> +	  sysctl or by writing a value to
> +	  /proc/sys/kernel/memalloc_task_timeout_secs.
> +
> +	  A timeout of 0 disables the check. The default is 10 seconds.
> +
>  endmenu # "Debug lockups and hangs"
>  
>  config PANIC_ON_OOPS
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 5d5eca9..bb5ea86 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -44,6 +44,8 @@
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/oom.h>
>  
> +unsigned int out_of_memory_count;
> +
>  int sysctl_panic_on_oom;
>  int sysctl_oom_kill_allocating_task;
>  int sysctl_oom_dump_tasks = 1;
> @@ -855,6 +857,7 @@ bool out_of_memory(struct oom_control *oc)
>  	unsigned int uninitialized_var(points);
>  	enum oom_constraint constraint = CONSTRAINT_NONE;
>  
> +	out_of_memory_count++;
>  	if (oom_killer_disabled)
>  		return false;
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1993894..1529ccf 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2806,6 +2806,8 @@ void warn_alloc_failed(gfp_t gfp_mask, unsigned int order, const char *fmt, ...)
>  		show_mem(filter);
>  }
>  
> +unsigned int no_out_of_memory_count;
> +
>  static inline struct page *
>  __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  	const struct alloc_context *ac, unsigned long *did_some_progress)
> @@ -2826,7 +2828,9 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  	 */
>  	if (!mutex_trylock(&oom_lock)) {
>  		*did_some_progress = 1;
> +		set_memalloc_location(2);
>  		schedule_timeout_uninterruptible(1);
> +		set_memalloc_location(0);
>  		return NULL;
>  	}
>  
> @@ -2862,6 +2866,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  			 * enter a quiescent state during suspend.
>  			 */
>  			*did_some_progress = !oom_killer_disabled;
> +			no_out_of_memory_count++;
>  			goto out;
>  		}
>  		if (pm_suspended_storage())
> @@ -3399,6 +3404,37 @@ got_pg:
>  	return page;
>  }
>  
> +#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
> +static void start_memalloc_timer(const gfp_t gfp_mask, const int order)
> +{
> +	struct memalloc_info *m = &current->memalloc;
> +
> +	/* We don't check for stalls for !__GFP_RECLAIM allocations. */
> +	if (!(gfp_mask & __GFP_RECLAIM))
> +		return;
> +	/* We don't check for stalls for nested __GFP_RECLAIM allocations */
> +	if (!m->valid) {
> +		m->sequence++;
> +		smp_wmb(); /* Block is_stalling_task(). */
> +		m->start = jiffies;
> +		m->order = order;
> +		m->gfp = gfp_mask;
> +		smp_wmb(); /* Unblock is_stalling_task(). */
> +		m->sequence++;
> +	}
> +	m->valid++;
> +}
> +
> +static void stop_memalloc_timer(const gfp_t gfp_mask)
> +{
> +	if (gfp_mask & __GFP_RECLAIM)
> +		current->memalloc.valid--;
> +}
> +#else
> +#define start_memalloc_timer(gfp_mask, order) do { } while (0)
> +#define stop_memalloc_timer(gfp_mask) do { } while (0)
> +#endif
> +
>  /*
>   * This is the 'heart' of the zoned buddy allocator.
>   */
> @@ -3466,7 +3502,9 @@ retry_cpuset:
>  		alloc_mask = memalloc_noio_flags(gfp_mask);
>  		ac.spread_dirty_pages = false;
>  
> +		start_memalloc_timer(alloc_mask, order);
>  		page = __alloc_pages_slowpath(alloc_mask, order, &ac);
> +		stop_memalloc_timer(alloc_mask);
>  	}
>  
>  	if (kmemcheck_enabled && page)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 39e90e2..1231dcf 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1581,14 +1581,31 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>  	int file = is_file_lru(lru);
>  	struct zone *zone = lruvec_zone(lruvec);
>  	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> +	unsigned char counter = 0;
>  
> +	set_memalloc_location(1);
>  	while (unlikely(too_many_isolated(zone, file, sc))) {
>  		congestion_wait(BLK_RW_ASYNC, HZ/10);
>  
>  		/* We are about to die and free our memory. Return now. */
> -		if (fatal_signal_pending(current))
> +		if (fatal_signal_pending(current)) {
> +			set_memalloc_location(0);
>  			return SWAP_CLUSTER_MAX;
> +		}
> +		if (!++counter) {
> +			if (file)
> +				printk(KERN_WARNING "zone=%s NR_INACTIVE_FILE=%lu NR_ISOLATED_FILE=%lu\n",
> +				       zone->name,
> +				       zone_page_state(zone, NR_INACTIVE_FILE),
> +				       zone_page_state(zone, NR_ISOLATED_FILE));
> +			else
> +				printk(KERN_WARNING "zone=%s NR_INACTIVE_ANON=%lu NR_ISOLATED_ANON=%lu\n",
> +				       zone->name,
> +				       zone_page_state(zone, NR_INACTIVE_ANON),
> +				       zone_page_state(zone, NR_ISOLATED_ANON));
> +		}
>  	}
> +	set_memalloc_location(0);
>  
>  	lru_add_drain();
>  
> ----------------------------------------
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How to avoid printk() delay caused by cond_resched() ?
  2016-03-02 14:34   ` Sergey Senozhatsky
  2016-03-02 14:56     ` Petr Mladek
@ 2016-03-02 16:04     ` Jan Kara
  2016-03-03  5:41       ` Sergey Senozhatsky
  1 sibling, 1 reply; 11+ messages in thread
From: Jan Kara @ 2016-03-02 16:04 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Petr Mladek, Tetsuo Handa, jack, tj, kyle, davej, calvinowens,
	akpm, linux-mm, mhocko

On Wed 02-03-16 23:34:15, Sergey Senozhatsky wrote:
> > I am looking forward to have the console printing offloaded
> > into the workqueues. Then printk() will become consistently
> > "fast" operation and will cause less surprises like this.
> 
> I'm all for it. I need this rework badly. If Jan is too busy at
> the moment, which I surely can understand, then I'll be happy to
> help ("pick up the patches". please, don't get me wrong).

So I'm rather busy with other stuff currently so if you can pick up my
patches and finish them, it would be good. I think I have addressed all the
comments you had to the previous version, except for handling the case
where all the workers are too busy - maybe using a dedicated workqueue with
a rescueue worker instead of system_wq would solve this issue.

I've sent the current version of patches I have to you including the patch
3/3 which I use for debugging and testing whether the async printing really
helps avoiding softlockups.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How to avoid printk() delay caused by cond_resched() ?
  2016-03-02 16:04     ` Jan Kara
@ 2016-03-03  5:41       ` Sergey Senozhatsky
  0 siblings, 0 replies; 11+ messages in thread
From: Sergey Senozhatsky @ 2016-03-03  5:41 UTC (permalink / raw)
  To: Jan Kara
  Cc: Sergey Senozhatsky, Petr Mladek, Tetsuo Handa, jack, tj, kyle,
	davej, calvinowens, akpm, linux-mm, mhocko

On (03/02/16 17:04), Jan Kara wrote:
> On Wed 02-03-16 23:34:15, Sergey Senozhatsky wrote:
> > > I am looking forward to have the console printing offloaded
> > > into the workqueues. Then printk() will become consistently
> > > "fast" operation and will cause less surprises like this.
> > 
> > I'm all for it. I need this rework badly. If Jan is too busy at
> > the moment, which I surely can understand, then I'll be happy to
> > help ("pick up the patches". please, don't get me wrong).
> 
> So I'm rather busy with other stuff currently so if you can pick up my
> patches and finish them, it would be good. I think I have addressed all the
> comments you had to the previous version, except for handling the case
> where all the workers are too busy - maybe using a dedicated workqueue with
> a rescueue worker instead of system_wq would solve this issue.
> 
> I've sent the current version of patches I have to you including the patch
> 3/3 which I use for debugging and testing whether the async printing really
> helps avoiding softlockups.

great, thank you! will take a look.

and yes, I was thinking about using printk's own workqueue with a
rescue thread bit set (Petr Mladek also proposed this).

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-03-03  5:40 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-02 12:01 How to avoid printk() delay caused by cond_resched() ? Tetsuo Handa
2016-03-02 13:38 ` Petr Mladek
2016-03-02 14:11   ` Tetsuo Handa
2016-03-02 15:21     ` Petr Mladek
2016-03-02 14:34   ` Sergey Senozhatsky
2016-03-02 14:56     ` Petr Mladek
2016-03-02 16:04     ` Jan Kara
2016-03-03  5:41       ` Sergey Senozhatsky
2016-03-02 14:04 ` Sergey Senozhatsky
2016-03-02 15:02   ` Tetsuo Handa
2016-03-02 15:29     ` Sergey Senozhatsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.