All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH] mm: Use WQ_HIGHPRI for mm_percpu_wq.
       [not found] <1503921210-4603-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp>
@ 2017-08-28 12:10 ` Michal Hocko
  2017-08-28 17:06   ` Tejun Heo
  0 siblings, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2017-08-28 12:10 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, linux-mm, tj, Mel Gorman, Vlastimil Babka

On Mon 28-08-17 20:53:30, Tetsuo Handa wrote:
> I noticed that drain_local_pages_wq work stuck for minutes despite it is
> on WQ_MEM_RECLAIM mm_percpu_wq workqueue. Tejun Heo pointed out [1]:
> 
>   Rescuer helps if the worker pool that the workqueue is associated with
>   hangs. If we have other work items actively running, e.g., for reclaim
>   on the pool, the pool isn't stalled and rescuers won't be woken up. If
>   the work items need preferential execution, it should use WQ_HIGHPRI.
> 
> Since work items on mm_percpu_wq workqueue are expected to be executed
> as soon as possible, let's use WQ_HIGHPRI. Note that even with WQ_HIGHPRI,
> up to a few seconds of delay seems to be unavoidable.

I am not sure I understand how WQ_HIGHPRI actually helps. The work item
will get served by a thread with higher priority and from a different
pool than regular WQs. But what prevents the same issue as described
above when the highprio pool gets congested? In other words what make
WQ_HIGHPRI less prone to long stalls when we are under low memory
situation and new workers cannot be allocated?

> If we do want to make
> sure that work items on mm_percpu_wq workqueue are executed without delays,
> we need to consider using kthread_workers instead of workqueue. (Or, maybe
> somehow we can share one kthread with constantly manipulating cpumask?)

Hmm, that doesn't sound like a bad idea to me. We already have a rescuer
thread that basically sits idle all the time so having a dedicated
kernel thread will not be more expensive wrt. resources. So I think this
is a more reasonable approach than playing with WQ_HIGHPRI which smells
like a quite obscure workaround than a real fix to me.

> [1] http://lkml.kernel.org/r/201707111951.IHA98084.OHQtVOFJMLOSFF@I-love.SAKURA.ne.jp
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/vmstat.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 4bb13e7..cb7e198 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1923,7 +1923,8 @@ void __init init_mm_internals(void)
>  {
>  	int ret __maybe_unused;
>  
> -	mm_percpu_wq = alloc_workqueue("mm_percpu_wq", WQ_MEM_RECLAIM, 0);
> +	mm_percpu_wq = alloc_workqueue("mm_percpu_wq",
> +				       WQ_MEM_RECLAIM | WQ_HIGHPRI, 0);
>  
>  #ifdef CONFIG_SMP
>  	ret = cpuhp_setup_state_nocalls(CPUHP_MM_VMSTAT_DEAD, "mm/vmstat:dead",
> -- 
> 1.8.3.1
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm: Use WQ_HIGHPRI for mm_percpu_wq.
  2017-08-28 12:10 ` [PATCH] mm: Use WQ_HIGHPRI for mm_percpu_wq Michal Hocko
@ 2017-08-28 17:06   ` Tejun Heo
  2017-08-28 22:15     ` Tetsuo Handa
  2017-08-29 13:33     ` Michal Hocko
  0 siblings, 2 replies; 19+ messages in thread
From: Tejun Heo @ 2017-08-28 17:06 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Tetsuo Handa, akpm, linux-mm, Mel Gorman, Vlastimil Babka

Hello, Michal.

On Mon, Aug 28, 2017 at 02:10:56PM +0200, Michal Hocko wrote:
> I am not sure I understand how WQ_HIGHPRI actually helps. The work item
> will get served by a thread with higher priority and from a different
> pool than regular WQs. But what prevents the same issue as described
> above when the highprio pool gets congested? In other words what make
> WQ_HIGHPRI less prone to long stalls when we are under low memory
> situation and new workers cannot be allocated?

So, the problem wasn't new worker not getting allocated due to memory
pressure.  Rescuer can handle that.  The problem is that the regular
worker pool is occupied with something which is constantly in runnable
state - most likely writeback / reclaim, so the workqueue doesn't
schedule the other work items.

Setting WQ_HIGHPRI works as highpri worker pool isn't likely to be
contended that way but might not be the best solution.  The right
thing to do would be setting WQ_CPU_INTENSIVE on the work items which
can burn a lot of CPU cycles so that it doesn't get in the way of
other work items (workqueue should probably trigger a warning on these
work items too).

Tetuso, can you please try to find which work items are occupying the
worker pool for an extended period time under memory pressure and set
WQ_CPU_INTENSIVE on them?

> > If we do want to make
> > sure that work items on mm_percpu_wq workqueue are executed without delays,
> > we need to consider using kthread_workers instead of workqueue. (Or, maybe
> > somehow we can share one kthread with constantly manipulating cpumask?)
> 
> Hmm, that doesn't sound like a bad idea to me. We already have a rescuer
> thread that basically sits idle all the time so having a dedicated
> kernel thread will not be more expensive wrt. resources. So I think this
> is a more reasonable approach than playing with WQ_HIGHPRI which smells
> like a quite obscure workaround than a real fix to me.

Well, there's one rescuer in the whole system and you'd need
nr_online_cpus kthreads if you wanna avoid constant cacheline
bouncing.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm: Use WQ_HIGHPRI for mm_percpu_wq.
  2017-08-28 17:06   ` Tejun Heo
@ 2017-08-28 22:15     ` Tetsuo Handa
  2017-08-28 23:02       ` Tejun Heo
  2017-08-29 13:33     ` Michal Hocko
  1 sibling, 1 reply; 19+ messages in thread
From: Tetsuo Handa @ 2017-08-28 22:15 UTC (permalink / raw)
  To: tj, mhocko; +Cc: akpm, linux-mm, mgorman, vbabka

Tejun Heo wrote:
> On Mon, Aug 28, 2017 at 02:10:56PM +0200, Michal Hocko wrote:
> > I am not sure I understand how WQ_HIGHPRI actually helps. The work item
> > will get served by a thread with higher priority and from a different
> > pool than regular WQs. But what prevents the same issue as described
> > above when the highprio pool gets congested? In other words what make
> > WQ_HIGHPRI less prone to long stalls when we are under low memory
> > situation and new workers cannot be allocated?
> 
> So, the problem wasn't new worker not getting allocated due to memory
> pressure.  Rescuer can handle that.  The problem is that the regular
> worker pool is occupied with something which is constantly in runnable
> state - most likely writeback / reclaim, so the workqueue doesn't
> schedule the other work items.
> 
> Setting WQ_HIGHPRI works as highpri worker pool isn't likely to be
> contended that way but might not be the best solution.  The right
> thing to do would be setting WQ_CPU_INTENSIVE on the work items which
> can burn a lot of CPU cycles so that it doesn't get in the way of
> other work items (workqueue should probably trigger a warning on these
> work items too).
> 
> Tetuso, can you please try to find which work items are occupying the
> worker pool for an extended period time under memory pressure and set
> WQ_CPU_INTENSIVE on them?
> 

Isn't it any work item which does __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory
allocation, for doing __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory allocation
burns a lot of CPU cycles under memory pressure? In other words, won't we end up
with setting WQ_CPU_INTENSIVE to almost all workqueues?

----------
[  605.720125] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 208s!
[  605.736025] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 209s!
[  605.746669] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 209s!
[  605.755091] BUG: workqueue lockup - pool cpus=3 node=0 flags=0x0 nice=0 stuck for 64s!
[  605.763390] Showing busy workqueues and worker pools:
[  605.769436] workqueue events: flags=0x0
[  605.772204]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=4/256
[  605.775548]     pending: console_callback{197431}, vmw_fb_dirty_flush [vmwgfx]{174896}, sysrq_reinject_alt_sysrq{174440}, push_to_pool{162245}
[  605.780761]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=3/256
[  605.783603]     pending: e1000_watchdog [e1000]{207984}, check_corruption{166511}, rht_deferred_worker{28894}
[  605.787725]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
[  605.790682]     pending: vmpressure_work_fn{209065}, e1000_watchdog [e1000]{207615}
[  605.794271]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  605.797150]     pending: vmstat_shepherd{208067}
[  605.799610] workqueue events_long: flags=0x0
[  605.801951]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  605.805098]     pending: gc_worker [nf_conntrack]{208961}
[  605.807976] workqueue events_freezable: flags=0x4
[  605.810391]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  605.813151]     pending: vmballoon_work [vmw_balloon]{208085}
[  605.815851] workqueue events_power_efficient: flags=0x80
[  605.818382]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[  605.821124]     pending: check_lifetime{64453}
[  605.823337]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  605.826091]     pending: neigh_periodic_work{199329}
[  605.828426]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
[  605.831068]     pending: fb_flashcursor{209042}, do_cache_clean{201882}
[  605.833902]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  605.836545]     pending: neigh_periodic_work{195234}
[  605.838838] workqueue events_freezable_power_: flags=0x84
[  605.841295]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  605.843824]     pending: disk_events_workfn{208625}
[  605.846084] workqueue mm_percpu_wq: flags=0x8
[  605.848145]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  605.850667]     pending: drain_local_pages_wq{209047} BAR(4561){209047}
[  605.853368] workqueue writeback: flags=0x4e
[  605.855382]   pwq 128: cpus=0-63 flags=0x4 nice=0 active=1/256
[  605.857793]     in-flight: 354:wb_workfn{182977}
[  605.860182] workqueue xfs-data/sda1: flags=0xc
[  605.862314]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=23/256 MAYDAY
[  605.865010]     in-flight: 3221:xfs_end_io [xfs]{209041}, 3212:xfs_end_io [xfs]{209158}, 29:xfs_end_io [xfs]{209200}, 3230:xfs_end_io [xfs]{209171}, 3229:xfs_end_io [xfs]{209099}, 50:xfs_end_io [xfs]{209099}, 3223:xfs_end_io [xfs]{209045}, 165:xfs_end_io [xfs]{209052}, 3215:xfs_end_io [xfs]{209046}
[  605.874362]     pending: xfs_end_io [xfs]{209011}, xfs_end_io [xfs]{209007}, xfs_end_io [xfs]{209007}, xfs_end_io [xfs]{208999}, xfs_end_io [xfs]{208977}, xfs_end_io [xfs]{208975}, xfs_end_io [xfs]{208970}, xfs_end_io [xfs]{208963}, xfs_end_io [xfs]{208963}, xfs_end_io [xfs]{208950}, xfs_end_io [xfs]{208948}, xfs_end_io [xfs]{208948}, xfs_end_io [xfs]{208946}, xfs_end_io [xfs]{30655}
[  605.886882]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=28/256 MAYDAY
[  605.889661]     in-flight: 3225:xfs_end_io [xfs]{209249}, 236:xfs_end_io [xfs]{209163}, 23:xfs_end_io [xfs]{209151}, 3228:xfs_end_io [xfs]{209151}, 4380:xfs_end_io [xfs]{209259}, 3214:xfs_end_io [xfs]{209240}, 3220:xfs_end_io [xfs]{209212}, 3227:xfs_end_io [xfs]{209233}
[  605.898706]     pending: xfs_end_io [xfs]{209159}, xfs_end_io [xfs]{209149}, xfs_end_io [xfs]{209128}, xfs_end_io [xfs]{209128}, xfs_end_io [xfs]{209128}, xfs_end_io [xfs]{209128}, xfs_end_io [xfs]{209128}, xfs_end_io [xfs]{209116}, xfs_end_io [xfs]{209110}, xfs_end_io [xfs]{209096}, xfs_end_io [xfs]{209096}, xfs_end_io [xfs]{209092}, xfs_end_io [xfs]{209082}, xfs_end_io [xfs]{209061}, xfs_end_io [xfs]{209058}, xfs_end_io [xfs]{209051}, xfs_end_io [xfs]{209040}, xfs_end_io [xfs]{209021}, xfs_end_io [xfs]{209014}, xfs_end_io [xfs]{30678}
[  605.917299]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=24/256 MAYDAY
[  605.920254]     in-flight: 375(RESCUER):xfs_end_io [xfs]{209194}, 42:xfs_end_io [xfs]{209278}, 3222:xfs_end_io [xfs]{209286}, 17:xfs_end_io [xfs]{209195}, 65:xfs_end_io [xfs]{209241}, 122:xfs_end_io [xfs]{209230}
[  605.927845]     pending: xfs_end_io [xfs]{209187}, xfs_end_io [xfs]{209154}, xfs_end_io [xfs]{209113}, xfs_end_io [xfs]{209088}, xfs_end_io [xfs]{209081}, xfs_end_io [xfs]{209071}, xfs_end_io [xfs]{209070}, xfs_end_io [xfs]{209067}, xfs_end_io [xfs]{209062}, xfs_end_io [xfs]{209053}, xfs_end_io [xfs]{209051}, xfs_end_io [xfs]{209047}, xfs_end_io [xfs]{209032}, xfs_end_io [xfs]{209027}, xfs_end_io [xfs]{209017}, xfs_end_io [xfs]{209016}, xfs_end_io [xfs]{209014}, xfs_end_io [xfs]{209011}
[  605.944773]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=13/256
[  605.947616]     in-flight: 3218:xfs_end_io [xfs]{209268}, 3216:xfs_end_io [xfs]{209265}, 3:xfs_end_io [xfs]{209223}, 33:xfs_end_io [xfs]{209181}, 101:xfs_end_io [xfs]{209159}, 4381:xfs_end_io [xfs]{209294}, 3219:xfs_end_io [xfs]{209181}
[  605.956005]     pending: xfs_end_io [xfs]{209149}, xfs_end_io [xfs]{209141}, xfs_end_io [xfs]{209133}, xfs_end_io [xfs]{209057}, xfs_end_io [xfs]{209026}, xfs_end_io [xfs]{209025}
[  605.963018] workqueue xfs-sync/sda1: flags=0x4
[  605.965455]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  605.968299]     pending: xfs_log_worker [xfs]{202031}
[  605.970872] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=209s workers=8 manager: 3224
[  605.974252] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=209s workers=6 manager: 3213
[  605.977682] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=209s workers=9 manager: 47
[  605.981015] pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=64s workers=10 manager: 3217
[  605.984382] pool 128: cpus=0-63 flags=0x4 nice=0 hung=183s workers=3 idle: 355 356
----------

> > > If we do want to make
> > > sure that work items on mm_percpu_wq workqueue are executed without delays,
> > > we need to consider using kthread_workers instead of workqueue. (Or, maybe
> > > somehow we can share one kthread with constantly manipulating cpumask?)
> > 
> > Hmm, that doesn't sound like a bad idea to me. We already have a rescuer
> > thread that basically sits idle all the time so having a dedicated
> > kernel thread will not be more expensive wrt. resources. So I think this
> > is a more reasonable approach than playing with WQ_HIGHPRI which smells
> > like a quite obscure workaround than a real fix to me.
> 
> Well, there's one rescuer in the whole system and you'd need
> nr_online_cpus kthreads if you wanna avoid constant cacheline
> bouncing.

Excuse me, one rescuer kernel thread per each WQ_MEM_RECLAIM workqueue, doesn't it?

My thought is to stop using WQ_MEM_RECLAIM workqueue for mm_percpu_wq and use a
dedicated kernel thread like oom_reaper. Since the frequency of calling handler
function seems to be once per a second for each online CPU, I thought switching
cpumask for NR_CPUS times per a second is tolerable.

Or, yet another approach would be to use split counters

  Each CPU writes up-to-date values to per-CPU counters.
  The aggregator kernel thread reads up-to-date values from per-CPU counters,
  calculates diff between up-to-date values and previous values, saves up-to-date
  values as previous values, and reflects the diff to global counters.

if cost of reading per-CPU counters of online CPUs is smaller than cost of
switching cpumask for each online CPU.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm: Use WQ_HIGHPRI for mm_percpu_wq.
  2017-08-28 22:15     ` Tetsuo Handa
@ 2017-08-28 23:02       ` Tejun Heo
  2017-08-28 23:09         ` Tejun Heo
  0 siblings, 1 reply; 19+ messages in thread
From: Tejun Heo @ 2017-08-28 23:02 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: mhocko, akpm, linux-mm, mgorman, vbabka

Hello, Tetsuo.

On Tue, Aug 29, 2017 at 07:15:05AM +0900, Tetsuo Handa wrote:
> Isn't it any work item which does __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory
> allocation, for doing __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory allocation
> burns a lot of CPU cycles under memory pressure? In other words, won't we end up
> with setting WQ_CPU_INTENSIVE to almost all workqueues?

Ah, you're right.  It's the workers getting stuck in direct reclaim.

> > Well, there's one rescuer in the whole system and you'd need
> > nr_online_cpus kthreads if you wanna avoid constant cacheline
> > bouncing.
> 
> Excuse me, one rescuer kernel thread per each WQ_MEM_RECLAIM workqueue, doesn't it?

I meant that it isn't per-cpu.  If you use a kthread for it, that
kthread would be constantly bouncing around.

> My thought is to stop using WQ_MEM_RECLAIM workqueue for mm_percpu_wq and use a
> dedicated kernel thread like oom_reaper. Since the frequency of calling handler
> function seems to be once per a second for each online CPU, I thought switching
> cpumask for NR_CPUS times per a second is tolerable.

Hmm... all these is mostly because workqueue lost the "ignore
concurrency management" flag a while back while converting WQ_HIGHPRI
to mean high nice priority instead of the top of the queue w/o
concurrency management.  Resurrecting that shouldn't be too difficult.
I'll get back to you soon.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm: Use WQ_HIGHPRI for mm_percpu_wq.
  2017-08-28 23:02       ` Tejun Heo
@ 2017-08-28 23:09         ` Tejun Heo
  2017-08-29 11:14           ` Tetsuo Handa
  0 siblings, 1 reply; 19+ messages in thread
From: Tejun Heo @ 2017-08-28 23:09 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: mhocko, akpm, linux-mm, mgorman, vbabka

Hey, again.

On Mon, Aug 28, 2017 at 04:02:56PM -0700, Tejun Heo wrote:
> Hmm... all these is mostly because workqueue lost the "ignore
> concurrency management" flag a while back while converting WQ_HIGHPRI
> to mean high nice priority instead of the top of the queue w/o
> concurrency management.  Resurrecting that shouldn't be too difficult.
> I'll get back to you soon.

Can you please try this patch and see how the work item behaves w/
WQ_HIGHPRI set?  It disables concurrency mgmt for highpri work items
which makes sense anyway.

Thanks.

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index ca937b0..14b6bce 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2021,7 +2021,7 @@ __acquires(&pool->lock)
 {
 	struct pool_workqueue *pwq = get_work_pwq(work);
 	struct worker_pool *pool = worker->pool;
-	bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE;
+	bool cpu_intensive = pwq->wq->flags & (WQ_CPU_INTENSIVE | WQ_HIGHPRI);
 	int work_color;
 	struct worker *collision;
 #ifdef CONFIG_LOCKDEP


-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm: Use WQ_HIGHPRI for mm_percpu_wq.
  2017-08-28 23:09         ` Tejun Heo
@ 2017-08-29 11:14           ` Tetsuo Handa
  2017-08-29 14:38             ` Tejun Heo
  0 siblings, 1 reply; 19+ messages in thread
From: Tetsuo Handa @ 2017-08-29 11:14 UTC (permalink / raw)
  To: tj; +Cc: mhocko, akpm, linux-mm, mgorman, vbabka

Tejun Heo wrote:
> Can you please try this patch and see how the work item behaves w/
> WQ_HIGHPRI set?  It disables concurrency mgmt for highpri work items
> which makes sense anyway.

I tried with below diff, but it did not help.

----------
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index db6dc9d..54027fc 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -101,6 +101,7 @@ struct work_struct {
 	atomic_long_t data;
 	struct list_head entry;
 	work_func_t func;
+	unsigned long stamp;
 #ifdef CONFIG_LOCKDEP
 	struct lockdep_map lockdep_map;
 #endif
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 5a2277f..173bd00 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1296,6 +1296,7 @@ static void insert_work(struct pool_workqueue *pwq, struct work_struct *work,
 	struct worker_pool *pool = pwq->pool;
 
 	/* we own @work, set data and link */
+	work->stamp = jiffies;
 	set_work_pwq(work, pwq, extra_flags);
 	list_add_tail(&work->entry, head);
 	get_pwq(pwq);
@@ -2021,7 +2022,7 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
 {
 	struct pool_workqueue *pwq = get_work_pwq(work);
 	struct worker_pool *pool = worker->pool;
-	bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE;
+	bool cpu_intensive = pwq->wq->flags & (WQ_CPU_INTENSIVE | WQ_HIGHPRI);
 	int work_color;
 	struct worker *collision;
 #ifdef CONFIG_LOCKDEP
@@ -4372,10 +4373,10 @@ static void pr_cont_work(bool comma, struct work_struct *work)
 
 		barr = container_of(work, struct wq_barrier, work);
 
-		pr_cont("%s BAR(%d)", comma ? "," : "",
-			task_pid_nr(barr->task));
+		pr_cont("%s BAR(%d){%lu}", comma ? "," : "",
+			task_pid_nr(barr->task), jiffies - work->stamp);
 	} else {
-		pr_cont("%s %pf", comma ? "," : "", work->func);
+		pr_cont("%s %pf{%lu}", comma ? "," : "", work->func, jiffies - work->stamp);
 	}
 }
 
@@ -4407,10 +4408,11 @@ static void show_pwq(struct pool_workqueue *pwq)
 			if (worker->current_pwq != pwq)
 				continue;
 
-			pr_cont("%s %d%s:%pf", comma ? "," : "",
+			pr_cont("%s %d%s:%pf{%lu}", comma ? "," : "",
 				task_pid_nr(worker->task),
 				worker == pwq->wq->rescuer ? "(RESCUER)" : "",
-				worker->current_func);
+				worker->current_func, worker->current_work ?
+				jiffies - worker->current_work->stamp : 0);
 			list_for_each_entry(work, &worker->scheduled, entry)
 				pr_cont_work(false, work);
 			comma = true;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4bb13e7..cb7e198 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1923,7 +1923,8 @@ void __init init_mm_internals(void)
 {
 	int ret __maybe_unused;
 
-	mm_percpu_wq = alloc_workqueue("mm_percpu_wq", WQ_MEM_RECLAIM, 0);
+	mm_percpu_wq = alloc_workqueue("mm_percpu_wq",
+				       WQ_MEM_RECLAIM | WQ_HIGHPRI, 0);
 
 #ifdef CONFIG_SMP
 	ret = cpuhp_setup_state_nocalls(CPUHP_MM_VMSTAT_DEAD, "mm/vmstat:dead",
----------

Unless above diff for printing delay is wrong, work items on
WQ_MEM_RECLAIM | WQ_HIGHPRI workqueues are delayed by other work items.

----------
[  654.670289] Showing busy workqueues and worker pools:
[  654.670320] workqueue events: flags=0x0
[  654.670664]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  654.670776]     pending: vmpressure_work_fn{5}
[  654.670870] workqueue events_power_efficient: flags=0x80
[  654.670992]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  654.671003]     pending: fb_flashcursor{160}
[  654.671032] workqueue events_freezable_power_: flags=0x84
[  654.671152]   pwq 10: cpus=5 node=0 flags=0x0 nice=0 active=1/256
[  654.671162]     in-flight: 2100:disk_events_workfn{64432}
[  654.671259] workqueue writeback: flags=0x4e
[  654.671370]   pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256
[  654.671453]     in-flight: 380:wb_workfn{1}
[  654.671461]     pending: wb_workfn{1}
[  654.672793] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=0s workers=2 manager: 444 idle: 2126
[  654.672815] pool 10: cpus=5 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 41 257
[  654.673048] pool 256: cpus=0-127 flags=0x4 nice=0 hung=0s workers=4 idle: 3305 378 379

[  838.554020] Showing busy workqueues and worker pools:
[  838.554127] workqueue events_power_efficient: flags=0x80
[  838.554282]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=2/256
[  838.554295]     in-flight: 2126:fb_flashcursor{52}
[  838.554304]     pending: fb_flashcursor{52}
[  838.554335] workqueue events_freezable_power_: flags=0x84
[  838.554467]   pwq 10: cpus=5 node=0 flags=0x0 nice=0 active=1/256
[  838.554479]     in-flight: 2100:disk_events_workfn{248316}
[  838.554595] workqueue writeback: flags=0x4e
[  838.554599]   pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256
[  838.554610]     in-flight: 380:wb_workfn{0} wb_workfn{0}
[  838.555989] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=0s workers=2 manager: 444
[  838.556012] pool 10: cpus=5 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 41 257
[  838.556252] pool 256: cpus=0-127 flags=0x4 nice=0 hung=0s workers=4 idle: 3305 378 379

[  897.501844] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 40s!
[  897.501935] Showing busy workqueues and worker pools:
[  897.501961] workqueue events: flags=0x0
[  897.502274]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  897.502367]     pending: vmw_fb_dirty_flush{58910}
[  897.502379]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
[  897.502390]     in-flight: 97:console_callback{58946} console_callback{58946}
[  897.502404]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=3/256
[  897.502414]     pending: vmpressure_work_fn{40434}, e1000_watchdog [e1000]{40192}, vmstat_shepherd{39743}
[  897.502496] workqueue events_long: flags=0x0
[  897.502643]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  897.502654]     pending: gc_worker [nf_conntrack]{40352}
[  897.502712] workqueue events_power_efficient: flags=0x80
[  897.502855]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=2/256
[  897.502864]     in-flight: 2126:fb_flashcursor{59001}
[  897.502872]     pending: fb_flashcursor{59001}
[  897.502882]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[  897.502892]     pending: neigh_periodic_work{30209}, neigh_periodic_work{30209}
[  897.502926] workqueue events_freezable_power_: flags=0x84
[  897.503062]   pwq 10: cpus=5 node=0 flags=0x0 nice=0 active=1/256
[  897.503072]     in-flight: 2100:disk_events_workfn{307265}
[  897.503107] workqueue mm_percpu_wq: flags=0x18
[  897.503291]   pwq 5: cpus=2 node=0 flags=0x0 nice=-20 active=1/256
[  897.503301]     pending: vmstat_update{58752}
[  897.503374] workqueue writeback: flags=0x4e
[  897.503464]   pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256
[  897.503516]     in-flight: 380:wb_workfn{0}
[  897.503524]     pending: wb_workfn{0}
[  897.504901] workqueue xfs-eofblocks/sda1: flags=0xc
[  897.505045]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  897.505057]     in-flight: 57:xfs_eofblocks_worker [xfs]{40451}
[  897.505127] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=40s workers=2 manager: 135
[  897.505160] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 3311 2132
[  897.505179] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=59s workers=2 manager: 444
[  897.505200] pool 10: cpus=5 node=0 flags=0x0 nice=0 hung=43s workers=3 idle: 41 257
[  897.505478] pool 256: cpus=0-127 flags=0x4 nice=0 hung=0s workers=3 idle: 3305 378

[  927.288644] Showing busy workqueues and worker pools:
[  927.288669] workqueue events: flags=0x0
[  927.288814]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  927.288827]     pending: vmw_fb_dirty_flush{88697}
[  927.288836]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
[  927.288847]     in-flight: 97:console_callback{88733} console_callback{88733}
[  927.288861]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=3/256
[  927.288872]     pending: vmpressure_work_fn{70221}, e1000_watchdog [e1000]{69979}, vmstat_shepherd{69530}
[  927.288936] workqueue events_long: flags=0x0
[  927.289082]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  927.289093]     pending: gc_worker [nf_conntrack]{70139}
[  927.289152] workqueue events_power_efficient: flags=0x80
[  927.289284]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=2/256
[  927.289295]     in-flight: 2126:fb_flashcursor{88787}
[  927.289303]     pending: fb_flashcursor{88787}
[  927.289312]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[  927.289323]     pending: neigh_periodic_work{59995}, neigh_periodic_work{59995}
[  927.289356] workqueue events_freezable_power_: flags=0x84
[  927.289483]   pwq 10: cpus=5 node=0 flags=0x0 nice=0 active=1/256
[  927.289493]     in-flight: 2100:disk_events_workfn{337051}
[  927.289529] workqueue mm_percpu_wq: flags=0x18
[  927.289681]   pwq 5: cpus=2 node=0 flags=0x0 nice=-20 active=1/256
[  927.289693]     pending: vmstat_update{88538}
[  927.289766] workqueue writeback: flags=0x4e
[  927.289770]   pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256
[  927.289780]     in-flight: 380:wb_workfn{1}
[  927.289788]     pending: wb_workfn{1}
[  927.291244] workqueue xfs-eofblocks/sda1: flags=0xc
[  927.291383]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  927.291395]     in-flight: 57:xfs_eofblocks_worker [xfs]{70237}
[  927.291449] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=70s workers=2 manager: 135
[  927.291466] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 3311 2132
[  927.291486] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=88s workers=2 manager: 444
[  927.291507] pool 10: cpus=5 node=0 flags=0x0 nice=0 hung=73s workers=3 idle: 41 257
[  927.291776] pool 256: cpus=0-127 flags=0x4 nice=0 hung=0s workers=3 idle: 3305 378

[  957.917850] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 100s!
[  957.917935] Showing busy workqueues and worker pools:
[  957.917959] workqueue events: flags=0x0
[  957.918109]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  957.918121]     pending: vmw_fb_dirty_flush{119326}
[  957.918131]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
[  957.918142]     in-flight: 97:console_callback{119362} console_callback{119362}
[  957.918156]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=3/256
[  957.918167]     pending: vmpressure_work_fn{100850}, e1000_watchdog [e1000]{100608}, vmstat_shepherd{100159}
[  957.918233] workqueue events_long: flags=0x0
[  957.918374]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  957.918384]     pending: gc_worker [nf_conntrack]{100768}
[  957.918442] workqueue events_power_efficient: flags=0x80
[  957.918599]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=2/256
[  957.918609]     in-flight: 2126:fb_flashcursor{119416}
[  957.918618]     pending: fb_flashcursor{119416}
[  957.918627]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[  957.918637]     pending: neigh_periodic_work{90624}, neigh_periodic_work{90624}
[  957.918669] workqueue events_freezable_power_: flags=0x84
[  957.918809]   pwq 10: cpus=5 node=0 flags=0x0 nice=0 active=1/256
[  957.918819]     in-flight: 2100:disk_events_workfn{367681}
[  957.918855] workqueue mm_percpu_wq: flags=0x18
[  957.919003]   pwq 5: cpus=2 node=0 flags=0x0 nice=-20 active=1/256
[  957.919014]     pending: vmstat_update{119168}
[  957.919090] workqueue writeback: flags=0x4e
[  957.919093]   pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256
[  957.919104]     in-flight: 380:wb_workfn{0} wb_workfn{0}
[  957.920562] workqueue xfs-eofblocks/sda1: flags=0xc
[  957.920705]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  957.920716]     in-flight: 57:xfs_eofblocks_worker [xfs]{100866}
[  957.920772] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=100s workers=2 manager: 135
[  957.920789] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 3311 2132
[  957.920807] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=119s workers=2 manager: 444
[  957.920826] pool 10: cpus=5 node=0 flags=0x0 nice=0 hung=103s workers=3 idle: 41 257
[  957.921096] pool 256: cpus=0-127 flags=0x4 nice=0 hung=0s workers=3 idle: 3305 378

[  988.125885] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 131s!
[  988.125973] Showing busy workqueues and worker pools:
[  988.125998] workqueue events: flags=0x0
[  988.126155]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  988.126168]     pending: vmw_fb_dirty_flush{149534}
[  988.126179]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
[  988.126190]     in-flight: 97:console_callback{149570} console_callback{149570}
[  988.126204]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=3/256
[  988.126215]     pending: vmpressure_work_fn{131058}, e1000_watchdog [e1000]{130816}, vmstat_shepherd{130367}
[  988.126283] workqueue events_long: flags=0x0
[  988.126430]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  988.126441]     pending: gc_worker [nf_conntrack]{130976}
[  988.126521] workqueue events_power_efficient: flags=0x80
[  988.126667]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=2/256
[  988.126678]     in-flight: 2126:fb_flashcursor{149625}
[  988.126686]     pending: fb_flashcursor{149625}
[  988.126695]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[  988.126706]     pending: neigh_periodic_work{120833}, neigh_periodic_work{120833}
[  988.126738] workqueue events_freezable_power_: flags=0x84
[  988.126881]   pwq 10: cpus=5 node=0 flags=0x0 nice=0 active=1/256
[  988.126891]     in-flight: 2100:disk_events_workfn{397889}
[  988.126929] workqueue mm_percpu_wq: flags=0x18
[  988.127086]   pwq 5: cpus=2 node=0 flags=0x0 nice=-20 active=1/256
[  988.127096]     pending: vmstat_update{149376}
[  988.127211] workqueue writeback: flags=0x4e
[  988.127215]   pwq 256: cpus=0-127 flags=0x4 nice=0 active=1/256
[  988.127226]     in-flight: 380:wb_workfn{0}
[  988.128675] workqueue xfs-eofblocks/sda1: flags=0xc
[  988.128825]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  988.128838]     in-flight: 57:xfs_eofblocks_worker [xfs]{131075}
[  988.128897] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=131s workers=2 manager: 135
[  988.128915] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 3311 2132
[  988.128935] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=149s workers=2 manager: 444
[  988.128956] pool 10: cpus=5 node=0 flags=0x0 nice=0 hung=134s workers=3 idle: 41 257
[  988.129278] pool 256: cpus=0-127 flags=0x4 nice=0 hung=0s workers=3 idle: 3305 378

[ 1018.333984] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 161s!
[ 1018.339385] Showing busy workqueues and worker pools:
[ 1018.343135] workqueue events: flags=0x0
[ 1018.346272]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=3/256
[ 1018.350632]     pending: vmpressure_work_fn{161283}, e1000_watchdog [e1000]{161041}, vmstat_shepherd{160592}
[ 1018.356151] workqueue events_long: flags=0x0
[ 1018.359581]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 1018.363519]     pending: gc_worker [nf_conntrack]{161213}
[ 1018.367290] workqueue events_power_efficient: flags=0x80
[ 1018.371142]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[ 1018.375057]     pending: neigh_periodic_work{151081}, neigh_periodic_work{151081}
[ 1018.379686] workqueue events_freezable_power_: flags=0x84
[ 1018.383621]   pwq 10: cpus=5 node=0 flags=0x0 nice=0 active=1/256
[ 1018.387684]     in-flight: 2100:disk_events_workfn{428150}
[ 1018.391519] workqueue writeback: flags=0x4e
[ 1018.394708]   pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256
[ 1018.398537]     in-flight: 380:wb_workfn{0}
[ 1018.401495]     pending: wb_workfn{0}
[ 1018.405654] workqueue xfs-eofblocks/sda1: flags=0xc
[ 1018.409202]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 1018.413075]     in-flight: 57:xfs_eofblocks_worker [xfs]{161359}
[ 1018.417061] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=161s workers=2 manager: 135
[ 1018.421598] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=0s workers=2 manager: 444 idle: 2126
[ 1018.426532] pool 10: cpus=5 node=0 flags=0x0 nice=0 hung=164s workers=3 idle: 41 257
[ 1018.431490] pool 256: cpus=0-127 flags=0x4 nice=0 hung=0s workers=3 idle: 3305 378

[ 1042.653684] Showing busy workqueues and worker pools:
[ 1042.653710] workqueue events: flags=0x0
[ 1042.653857]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=4/256
[ 1042.653869]     pending: vmpressure_work_fn{185586}, e1000_watchdog [e1000]{185344}, vmstat_shepherd{184895}, vmw_fb_dirty_flush{24218}
[ 1042.653935] workqueue events_long: flags=0x0
[ 1042.654060]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 1042.654070]     pending: gc_worker [nf_conntrack]{185504}
[ 1042.654126] workqueue events_power_efficient: flags=0x80
[ 1042.654264]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=2/256
[ 1042.654275]     in-flight: 2126:fb_flashcursor{176}
[ 1042.654283]     pending: fb_flashcursor{176}
[ 1042.654292]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[ 1042.654303]     pending: neigh_periodic_work{175360}, neigh_periodic_work{175360}
[ 1042.654334] workqueue events_freezable_power_: flags=0x84
[ 1042.654455]   pwq 10: cpus=5 node=0 flags=0x0 nice=0 active=1/256
[ 1042.654466]     in-flight: 2100:disk_events_workfn{452416}
[ 1042.654563] workqueue writeback: flags=0x4e
[ 1042.654574]   pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256
[ 1042.654585]     in-flight: 380:wb_workfn{0}
[ 1042.654592]     pending: wb_workfn{0}
[ 1042.655922] workqueue xfs-eofblocks/sda1: flags=0xc
[ 1042.656048]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 1042.656058]     in-flight: 57:xfs_eofblocks_worker [xfs]{185602}
[ 1042.656109] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=185s workers=2 manager: 135
[ 1042.656126] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=0s workers=2 manager: 444
[ 1042.656146] pool 10: cpus=5 node=0 flags=0x0 nice=0 hung=188s workers=3 idle: 41 257
[ 1042.656402] pool 256: cpus=0-127 flags=0x4 nice=0 hung=0s workers=3 idle: 3305 378

[ 1108.958848] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 251s!
[ 1108.958946] Showing busy workqueues and worker pools:
[ 1108.958971] workqueue events: flags=0x0
[ 1108.959121]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
[ 1108.959133]     in-flight: 97:console_callback{66303} console_callback{66303}
[ 1108.959149]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=4/256
[ 1108.959159]     pending: vmpressure_work_fn{251891}, e1000_watchdog [e1000]{251649}, vmstat_shepherd{251200}, vmw_fb_dirty_flush{90523}
[ 1108.959227] workqueue events_long: flags=0x0
[ 1108.959372]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 1108.959382]     pending: gc_worker [nf_conntrack]{251809}
[ 1108.959440] workqueue events_power_efficient: flags=0x80
[ 1108.959616]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=2/256
[ 1108.959627]     in-flight: 2126:fb_flashcursor{66482}
[ 1108.959636]     pending: fb_flashcursor{66482}
[ 1108.959645]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[ 1108.959655]     pending: neigh_periodic_work{241666}, neigh_periodic_work{241666}
[ 1108.959687] workqueue events_freezable_power_: flags=0x84
[ 1108.959825]   pwq 10: cpus=5 node=0 flags=0x0 nice=0 active=1/256
[ 1108.959835]     in-flight: 2100:disk_events_workfn{518722}
[ 1108.959871] workqueue mm_percpu_wq: flags=0x18
[ 1108.960020]   pwq 5: cpus=2 node=0 flags=0x0 nice=-20 active=1/256
[ 1108.960030]     pending: vmstat_update{66241}
[ 1108.960104] workqueue writeback: flags=0x4e
[ 1108.960107]   pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256
[ 1108.960117]     in-flight: 380:wb_workfn{0} wb_workfn{0}
[ 1108.961605] workqueue xfs-eofblocks/sda1: flags=0xc
[ 1108.961749]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 1108.961761]     in-flight: 57:xfs_eofblocks_worker [xfs]{251908}
[ 1108.961818] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=251s workers=2 manager: 135
[ 1108.961834] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 3311 2132
[ 1108.961853] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=66s workers=2 manager: 444
[ 1108.961874] pool 10: cpus=5 node=0 flags=0x0 nice=0 hung=254s workers=3 idle: 41 257
[ 1108.962150] pool 256: cpus=0-127 flags=0x4 nice=0 hung=0s workers=3 idle: 3305 378

[ 1230.033873] Showing busy workqueues and worker pools:
[ 1230.033899] workqueue events: flags=0x0
[ 1230.034066]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=4/256
[ 1230.034079]     pending: vmpressure_work_fn{372966}, e1000_watchdog [e1000]{372724}, vmstat_shepherd{372275}, vmw_fb_dirty_flush{211598}
[ 1230.034148] workqueue events_long: flags=0x0
[ 1230.034276]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 1230.034286]     pending: gc_worker [nf_conntrack]{372884}
[ 1230.034342] workqueue events_power_efficient: flags=0x80
[ 1230.034466]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=2/256
[ 1230.034477]     in-flight: 2126:fb_flashcursor{164}
[ 1230.034485]     pending: fb_flashcursor{164}
[ 1230.034494]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[ 1230.034505]     pending: neigh_periodic_work{362740}, neigh_periodic_work{362740}
[ 1230.034536] workqueue events_freezable_power_: flags=0x84
[ 1230.034658]   pwq 10: cpus=5 node=0 flags=0x0 nice=0 active=1/256
[ 1230.034670]     in-flight: 2100:disk_events_workfn{639796}
[ 1230.034723] workqueue mm_percpu_wq: flags=0x18
[ 1230.034857]   pwq 5: cpus=2 node=0 flags=0x0 nice=-20 active=1/256
[ 1230.034867]     pending: vmstat_update{308}
[ 1230.034941] workqueue writeback: flags=0x4e
[ 1230.034945]   pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256
[ 1230.034955]     in-flight: 380:wb_workfn{0}
[ 1230.034963]     pending: wb_workfn{0}
[ 1230.036617] workqueue xfs-eofblocks/sda1: flags=0xc
[ 1230.036765]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 1230.036777]     in-flight: 57:xfs_eofblocks_worker [xfs]{372980}
[ 1230.036832] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=372s workers=2 manager: 135
[ 1230.036850] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=0s workers=2 manager: 444
[ 1230.036870] pool 10: cpus=5 node=0 flags=0x0 nice=0 hung=376s workers=3 idle: 41 257
[ 1230.037119] pool 256: cpus=0-127 flags=0x4 nice=0 hung=0s workers=3 idle: 3305 378

[ 1290.209431] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 433s!
[ 1290.209513] Showing busy workqueues and worker pools:
[ 1290.209536] workqueue events: flags=0x0
[ 1290.209680]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
[ 1290.209692]     in-flight: 97:console_callback{60175} console_callback{60175}
[ 1290.209708]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=4/256
[ 1290.209718]     pending: vmpressure_work_fn{433141}, e1000_watchdog [e1000]{432899}, vmstat_shepherd{432450}, vmw_fb_dirty_flush{271773}
[ 1290.209795] workqueue events_long: flags=0x0
[ 1290.209934]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 1290.209943]     pending: gc_worker [nf_conntrack]{433059}
[ 1290.210001] workqueue events_power_efficient: flags=0x80
[ 1290.210171]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=2/256
[ 1290.210182]     in-flight: 2126:fb_flashcursor{60339}
[ 1290.210191]     pending: fb_flashcursor{60339}
[ 1290.210201]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[ 1290.210212]     pending: neigh_periodic_work{422915}, neigh_periodic_work{422915}
[ 1290.210244] workqueue events_freezable_power_: flags=0x84
[ 1290.210379]   pwq 10: cpus=5 node=0 flags=0x0 nice=0 active=1/256
[ 1290.210390]     in-flight: 2100:disk_events_workfn{699971}
[ 1290.210426] workqueue mm_percpu_wq: flags=0x18
[ 1290.210574]   pwq 5: cpus=2 node=0 flags=0x0 nice=-20 active=1/256
[ 1290.210584]     pending: vmstat_update{60483}
[ 1290.210777] workqueue writeback: flags=0x4e
[ 1290.210781]   pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256
[ 1290.210792]     in-flight: 380:wb_workfn{0}
[ 1290.210800]     pending: wb_workfn{0}
[ 1290.212242] workqueue xfs-eofblocks/sda1: flags=0xc
[ 1290.212388]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 1290.212399]     in-flight: 57:xfs_eofblocks_worker [xfs]{433155}
[ 1290.212458] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=433s workers=2 manager: 135
[ 1290.212476] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 3311 2132
[ 1290.212495] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=60s workers=2 manager: 444
[ 1290.212515] pool 10: cpus=5 node=0 flags=0x0 nice=0 hung=436s workers=3 idle: 41 257
[ 1290.212826] pool 256: cpus=0-127 flags=0x4 nice=0 hung=0s workers=3 idle: 3305 378

[ 1320.414593] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 463s!
[ 1320.418262] Showing busy workqueues and worker pools:
[ 1320.420892] workqueue events: flags=0x0
[ 1320.423288]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=4/256
[ 1320.426296]     pending: vmpressure_work_fn{463359}, e1000_watchdog [e1000]{463117}, vmstat_shepherd{462668}, vmw_fb_dirty_flush{301991}
[ 1320.431527] workqueue events_long: flags=0x0
[ 1320.434086]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 1320.437020]     pending: gc_worker [nf_conntrack]{463287}
[ 1320.439890] workqueue events_power_efficient: flags=0x80
[ 1320.442695]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[ 1320.445606]     pending: neigh_periodic_work{453152}, neigh_periodic_work{453152}
[ 1320.448995] workqueue events_freezable_power_: flags=0x84
[ 1320.451872]   pwq 10: cpus=5 node=0 flags=0x0 nice=0 active=1/256
[ 1320.454848]     in-flight: 2100:disk_events_workfn{730217}
[ 1320.457663] workqueue writeback: flags=0x4e
[ 1320.459955]   pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256
[ 1320.462704]     in-flight: 380:wb_workfn{3}
[ 1320.464878]     pending: wb_workfn{5}
[ 1320.467959] workqueue xfs-eofblocks/sda1: flags=0xc
[ 1320.470801]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 1320.474955]     in-flight: 57:xfs_eofblocks_worker [xfs]{463421}
[ 1320.479104] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=463s workers=2 manager: 135
[ 1320.483981] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=0s workers=2 manager: 444 idle: 2126
[ 1320.489033] pool 10: cpus=5 node=0 flags=0x0 nice=0 hung=466s workers=3 idle: 41 257
[ 1320.494234] pool 256: cpus=0-127 flags=0x4 nice=0 hung=0s workers=3 idle: 3305 378

[ 1411.610809] Showing busy workqueues and worker pools:
[ 1411.610835] workqueue events: flags=0x0
[ 1411.610983]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=4/256
[ 1411.610995]     pending: vmpressure_work_fn{554543}, e1000_watchdog [e1000]{554301}, vmstat_shepherd{553852}, vmw_fb_dirty_flush{393175}
[ 1411.611082] workqueue events_long: flags=0x0
[ 1411.611209]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 1411.611221]     pending: gc_worker [nf_conntrack]{554462}
[ 1411.611283] workqueue events_power_efficient: flags=0x80
[ 1411.611409]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=2/256
[ 1411.611420]     in-flight: 2126:fb_flashcursor{43}
[ 1411.611428]     pending: fb_flashcursor{43}
[ 1411.611437]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[ 1411.611447]     pending: neigh_periodic_work{544318}, neigh_periodic_work{544318}
[ 1411.611479] workqueue events_freezable_power_: flags=0x84
[ 1411.611599]   pwq 10: cpus=5 node=0 flags=0x0 nice=0 active=1/256
[ 1411.611609]     in-flight: 2100:disk_events_workfn{821374}
[ 1411.611727] workqueue writeback: flags=0x4e
[ 1411.611739]   pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256
[ 1411.611749]     in-flight: 380:wb_workfn{0}
[ 1411.611756]     pending: wb_workfn{0}
[ 1411.613564] workqueue xfs-eofblocks/sda1: flags=0xc
[ 1411.613697]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 1411.613708]     in-flight: 57:xfs_eofblocks_worker [xfs]{554560}
[ 1411.613779] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=554s workers=2 manager: 135
[ 1411.613798] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=0s workers=2 manager: 444
[ 1411.613817] pool 10: cpus=5 node=0 flags=0x0 nice=0 hung=557s workers=3 idle: 41 257
[ 1411.614071] pool 256: cpus=0-127 flags=0x4 nice=0 hung=0s workers=3 idle: 3305 378
----------

Memory stressor I used is shown below.

----------
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

int main(int argc, char *argv[])
{
	static char buffer[4096] = { };
	char *buf = NULL;
	unsigned long size;
	unsigned long i;
	for (i = 0; i < 1024; i++) {
		if (fork() == 0) {
			int fd = open("/proc/self/oom_score_adj", O_WRONLY);
			write(fd, "1000", 4);
			close(fd);
			snprintf(buffer, sizeof(buffer), "/tmp/file.%u", getpid());
			fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600);
			sleep(1);
			while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer));
			_exit(0);
		}
	}
	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size >>= 1;
			break;
		}
		buf = cp;
	}
	sleep(2);
	/* Will cause OOM due to overcommit */
	for (i = 0; i < size; i += 4096)
		buf[i] = 0;
	return 0;
}
----------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm: Use WQ_HIGHPRI for mm_percpu_wq.
  2017-08-28 17:06   ` Tejun Heo
  2017-08-28 22:15     ` Tetsuo Handa
@ 2017-08-29 13:33     ` Michal Hocko
  2017-08-29 14:33       ` Tejun Heo
  1 sibling, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2017-08-29 13:33 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Tetsuo Handa, akpm, linux-mm, Mel Gorman, Vlastimil Babka

On Mon 28-08-17 10:06:11, Tejun Heo wrote:
> Hello, Michal.
> 
> On Mon, Aug 28, 2017 at 02:10:56PM +0200, Michal Hocko wrote:
> > I am not sure I understand how WQ_HIGHPRI actually helps. The work item
> > will get served by a thread with higher priority and from a different
> > pool than regular WQs. But what prevents the same issue as described
> > above when the highprio pool gets congested? In other words what make
> > WQ_HIGHPRI less prone to long stalls when we are under low memory
> > situation and new workers cannot be allocated?
> 
> So, the problem wasn't new worker not getting allocated due to memory
> pressure.  Rescuer can handle that.  The problem is that the regular
> worker pool is occupied with something which is constantly in runnable
> state - most likely writeback / reclaim, so the workqueue doesn't
> schedule the other work items.

Hmm, we have this in should_reclaim_retry
			/*
			 * Memory allocation/reclaim might be called from a WQ
			 * context and the current implementation of the WQ
			 * concurrency control doesn't recognize that
			 * a particular WQ is congested if the worker thread is
			 * looping without ever sleeping. Therefore we have to
			 * do a short sleep here rather than calling
			 * cond_resched().
			 */
			if (current->flags & PF_WQ_WORKER)
				schedule_timeout_uninterruptible(1);

And I thought it would be susfficient for kworkers for concurrency WQ
congestion thingy to jump in. Or do we need something more generic. E.g.
make cond_resched special for kworkers?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm: Use WQ_HIGHPRI for mm_percpu_wq.
  2017-08-29 13:33     ` Michal Hocko
@ 2017-08-29 14:33       ` Tejun Heo
  2017-08-29 20:29         ` Tetsuo Handa
  0 siblings, 1 reply; 19+ messages in thread
From: Tejun Heo @ 2017-08-29 14:33 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Tetsuo Handa, akpm, linux-mm, Mel Gorman, Vlastimil Babka

Hello,

On Tue, Aug 29, 2017 at 03:33:25PM +0200, Michal Hocko wrote:
> Hmm, we have this in should_reclaim_retry
> 			/*
> 			 * Memory allocation/reclaim might be called from a WQ
> 			 * context and the current implementation of the WQ
> 			 * concurrency control doesn't recognize that
> 			 * a particular WQ is congested if the worker thread is
> 			 * looping without ever sleeping. Therefore we have to
> 			 * do a short sleep here rather than calling
> 			 * cond_resched().
> 			 */
> 			if (current->flags & PF_WQ_WORKER)
> 				schedule_timeout_uninterruptible(1);
> 
> And I thought it would be susfficient for kworkers for concurrency WQ
> congestion thingy to jump in. Or do we need something more generic. E.g.
> make cond_resched special for kworkers?

I actually think we're hitting a bug somewhere.  Tetsuo's trace with
the patch applies doesn't add up.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm: Use WQ_HIGHPRI for mm_percpu_wq.
  2017-08-29 11:14           ` Tetsuo Handa
@ 2017-08-29 14:38             ` Tejun Heo
  2017-08-29 21:41               ` Tejun Heo
  0 siblings, 1 reply; 19+ messages in thread
From: Tejun Heo @ 2017-08-29 14:38 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: mhocko, akpm, linux-mm, mgorman, vbabka

Hello, Tetsuo.

On Tue, Aug 29, 2017 at 08:14:49PM +0900, Tetsuo Handa wrote:
> [  897.503107] workqueue mm_percpu_wq: flags=0x18
> [  897.503291]   pwq 5: cpus=2 node=0 flags=0x0 nice=-20 active=1/256
> [  897.503301]     pending: vmstat_update{58752}

This is weird.  Assuming 1000HZ, the work item has been pending for
about a minute but there's no active worker 

> [  897.505127] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=40s workers=2 manager: 135
> [  897.505160] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 3311 2132
> [  897.505179] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=59s workers=2 manager: 444
> [  897.505200] pool 10: cpus=5 node=0 flags=0x0 nice=0 hung=43s workers=3 idle: 41 257
> [  897.505478] pool 256: cpus=0-127 flags=0x4 nice=0 hung=0s workers=3 idle: 3305 378

but there's no active worker on the pool and the rescuer hasn't been
kicked off.

> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> 
> int main(int argc, char *argv[])
> {
> 	static char buffer[4096] = { };
> 	char *buf = NULL;
> 	unsigned long size;
> 	unsigned long i;
> 	for (i = 0; i < 1024; i++) {
> 		if (fork() == 0) {
> 			int fd = open("/proc/self/oom_score_adj", O_WRONLY);
> 			write(fd, "1000", 4);
> 			close(fd);
> 			snprintf(buffer, sizeof(buffer), "/tmp/file.%u", getpid());
> 			fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600);
> 			sleep(1);
> 			while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer));
> 			_exit(0);
> 		}
> 	}
> 	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
> 		char *cp = realloc(buf, size);
> 		if (!cp) {
> 			size >>= 1;
> 			break;
> 		}
> 		buf = cp;
> 	}
> 	sleep(2);
> 	/* Will cause OOM due to overcommit */
> 	for (i = 0; i < size; i += 4096)
> 		buf[i] = 0;
> 	return 0;
> }

I'll try to repro and find out what's going on.

Thanks!

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm: Use WQ_HIGHPRI for mm_percpu_wq.
  2017-08-29 14:33       ` Tejun Heo
@ 2017-08-29 20:29         ` Tetsuo Handa
  2017-08-30  6:40           ` Michal Hocko
  0 siblings, 1 reply; 19+ messages in thread
From: Tetsuo Handa @ 2017-08-29 20:29 UTC (permalink / raw)
  To: tj, mhocko; +Cc: akpm, linux-mm, mgorman, vbabka

Tejun Heo wrote:
> Hello,
> 
> On Tue, Aug 29, 2017 at 03:33:25PM +0200, Michal Hocko wrote:
> > Hmm, we have this in should_reclaim_retry
> > 			/*
> > 			 * Memory allocation/reclaim might be called from a WQ
> > 			 * context and the current implementation of the WQ
> > 			 * concurrency control doesn't recognize that
> > 			 * a particular WQ is congested if the worker thread is
> > 			 * looping without ever sleeping. Therefore we have to
> > 			 * do a short sleep here rather than calling
> > 			 * cond_resched().
> > 			 */
> > 			if (current->flags & PF_WQ_WORKER)
> > 				schedule_timeout_uninterruptible(1);
> > 
> > And I thought it would be susfficient for kworkers for concurrency WQ
> > congestion thingy to jump in. Or do we need something more generic. E.g.
> > make cond_resched special for kworkers?
> 
> I actually think we're hitting a bug somewhere.  Tetsuo's trace with
> the patch applies doesn't add up.
> 
> Thanks.

If we are under memory pressure, __zone_watermark_ok() can return false.
If __zone_watermark_ok() == false, when is schedule_timeout_*() called explicitly?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm: Use WQ_HIGHPRI for mm_percpu_wq.
  2017-08-29 14:38             ` Tejun Heo
@ 2017-08-29 21:41               ` Tejun Heo
  2017-08-30 13:51                 ` Tetsuo Handa
  0 siblings, 1 reply; 19+ messages in thread
From: Tejun Heo @ 2017-08-29 21:41 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: mhocko, akpm, linux-mm, mgorman, vbabka

Hello,

I can't repro the problem.  The test program gets cleanly oom killed.
Hmm... the workqueue dumps you posted are really weird because there
are multiple work items stalling for really long times but only one
pool is reporting hang and nobody has rescuers active.  I don't get
how the system can be in such state.

Just in case, you're testing mainline, right?  I've updated your debug
patch slightly so that it doesn't skip seemingly idle pools.  Can you
please repro the problem with the patch applied?  Thanks.

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index db6dc9d..54027fc 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -101,6 +101,7 @@ struct work_struct {
 	atomic_long_t data;
 	struct list_head entry;
 	work_func_t func;
+	unsigned long stamp;
 #ifdef CONFIG_LOCKDEP
 	struct lockdep_map lockdep_map;
 #endif
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index ca937b0..006c19a 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1296,6 +1296,7 @@ static void insert_work(struct pool_workqueue *pwq, struct work_struct *work,
 	struct worker_pool *pool = pwq->pool;
 
 	/* we own @work, set data and link */
+	work->stamp = jiffies;
 	set_work_pwq(work, pwq, extra_flags);
 	list_add_tail(&work->entry, head);
 	get_pwq(pwq);
@@ -2021,7 +2022,7 @@ __acquires(&pool->lock)
 {
 	struct pool_workqueue *pwq = get_work_pwq(work);
 	struct worker_pool *pool = worker->pool;
-	bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE;
+	bool cpu_intensive = pwq->wq->flags & (WQ_CPU_INTENSIVE | WQ_HIGHPRI);
 	int work_color;
 	struct worker *collision;
 #ifdef CONFIG_LOCKDEP
@@ -4338,10 +4339,10 @@ static void pr_cont_work(bool comma, struct work_struct *work)
 
 		barr = container_of(work, struct wq_barrier, work);
 
-		pr_cont("%s BAR(%d)", comma ? "," : "",
-			task_pid_nr(barr->task));
+		pr_cont("%s BAR(%d){%u}", comma ? "," : "",
+			task_pid_nr(barr->task), jiffies_to_msecs(jiffies - work->stamp));
 	} else {
-		pr_cont("%s %pf", comma ? "," : "", work->func);
+		pr_cont("%s %pf{%u}", comma ? "," : "", work->func, jiffies_to_msecs(jiffies - work->stamp));
 	}
 }
 
@@ -4373,10 +4374,11 @@ static void show_pwq(struct pool_workqueue *pwq)
 			if (worker->current_pwq != pwq)
 				continue;
 
-			pr_cont("%s %d%s:%pf", comma ? "," : "",
+			pr_cont("%s %d%s:%pf{%u}", comma ? "," : "",
 				task_pid_nr(worker->task),
 				worker == pwq->wq->rescuer ? "(RESCUER)" : "",
-				worker->current_func);
+				worker->current_func, worker->current_work ?
+				jiffies_to_msecs(jiffies - worker->current_work->stamp) : 0);
 			list_for_each_entry(work, &worker->scheduled, entry)
 				pr_cont_work(false, work);
 			comma = true;
@@ -4461,8 +4463,8 @@ void show_workqueue_state(void)
 		bool first = true;
 
 		spin_lock_irqsave(&pool->lock, flags);
-		if (pool->nr_workers == pool->nr_idle)
-			goto next_pool;
+		/*if (pool->nr_workers == pool->nr_idle)
+		  goto next_pool;*/
 
 		pr_info("pool %d:", pool->id);
 		pr_cont_pool_info(pool);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 9a4441b..c099ebf 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1768,7 +1768,8 @@ void __init init_mm_internals(void)
 {
 	int ret __maybe_unused;
 
-	mm_percpu_wq = alloc_workqueue("mm_percpu_wq", WQ_MEM_RECLAIM, 0);
+	mm_percpu_wq = alloc_workqueue("mm_percpu_wq",
+				       WQ_MEM_RECLAIM | WQ_HIGHPRI, 0);
 
 #ifdef CONFIG_SMP
 	ret = cpuhp_setup_state_nocalls(CPUHP_MM_VMSTAT_DEAD, "mm/vmstat:dead",

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm: Use WQ_HIGHPRI for mm_percpu_wq.
  2017-08-29 20:29         ` Tetsuo Handa
@ 2017-08-30  6:40           ` Michal Hocko
  0 siblings, 0 replies; 19+ messages in thread
From: Michal Hocko @ 2017-08-30  6:40 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: tj, akpm, linux-mm, mgorman, vbabka

On Wed 30-08-17 05:29:26, Tetsuo Handa wrote:
> Tejun Heo wrote:
> > Hello,
> > 
> > On Tue, Aug 29, 2017 at 03:33:25PM +0200, Michal Hocko wrote:
> > > Hmm, we have this in should_reclaim_retry
> > > 			/*
> > > 			 * Memory allocation/reclaim might be called from a WQ
> > > 			 * context and the current implementation of the WQ
> > > 			 * concurrency control doesn't recognize that
> > > 			 * a particular WQ is congested if the worker thread is
> > > 			 * looping without ever sleeping. Therefore we have to
> > > 			 * do a short sleep here rather than calling
> > > 			 * cond_resched().
> > > 			 */
> > > 			if (current->flags & PF_WQ_WORKER)
> > > 				schedule_timeout_uninterruptible(1);
> > > 
> > > And I thought it would be susfficient for kworkers for concurrency WQ
> > > congestion thingy to jump in. Or do we need something more generic. E.g.
> > > make cond_resched special for kworkers?
> > 
> > I actually think we're hitting a bug somewhere.  Tetsuo's trace with
> > the patch applies doesn't add up.
> > 
> > Thanks.
> 
> If we are under memory pressure, __zone_watermark_ok() can return false.
> If __zone_watermark_ok() == false, when is schedule_timeout_*() called explicitly?

If all zones fail with the watermark check then we should hit the oom
path and sleep there. We do not do so for all cases though. Maybe we
should be more consistent there but even if this was a flood of GFP_NOFS
requests from the WQ context then at least some of them should fail on
the oom lock and sleep and help to make at least some progress. Moreover
Tejun suggests that some pools are idle so this might be a completely
different issue. In any case we can make an explicit reschedule point
in should_reclaim_retry. I would be really surprised if it helped but
maybe this is a better code in the end regardless.
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 018468a3b6b1..c93660926f24 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3699,6 +3699,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 {
 	struct zone *zone;
 	struct zoneref *z;
+	int ret = false;
 
 	/*
 	 * Costly allocations might have made a progress but this doesn't mean
@@ -3762,25 +3763,27 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 				}
 			}
 
-			/*
-			 * Memory allocation/reclaim might be called from a WQ
-			 * context and the current implementation of the WQ
-			 * concurrency control doesn't recognize that
-			 * a particular WQ is congested if the worker thread is
-			 * looping without ever sleeping. Therefore we have to
-			 * do a short sleep here rather than calling
-			 * cond_resched().
-			 */
-			if (current->flags & PF_WQ_WORKER)
-				schedule_timeout_uninterruptible(1);
-			else
-				cond_resched();
-
-			return true;
+			ret = true;
+			break;
 		}
 	}
 
-	return false;
+	/*
+	 * Memory allocation/reclaim might be called from a WQ
+	 * context and the current implementation of the WQ
+	 * concurrency control doesn't recognize that
+	 * a particular WQ is congested if the worker thread is
+	 * looping without ever sleeping. Therefore we have to
+	 * do a short sleep here rather than calling
+	 * cond_resched().
+	 */
+	if (current->flags & PF_WQ_WORKER)
+		schedule_timeout_uninterruptible(1);
+	else
+		cond_resched();
+
+
+	return ret;
 }
 
 static inline bool
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm: Use WQ_HIGHPRI for mm_percpu_wq.
  2017-08-29 21:41               ` Tejun Heo
@ 2017-08-30 13:51                 ` Tetsuo Handa
  2017-08-31  1:46                   ` Tejun Heo
  0 siblings, 1 reply; 19+ messages in thread
From: Tetsuo Handa @ 2017-08-30 13:51 UTC (permalink / raw)
  To: tj; +Cc: mhocko, akpm, linux-mm, mgorman, vbabka

Tejun Heo wrote:
> I can't repro the problem.  The test program gets cleanly oom killed.
> Hmm... the workqueue dumps you posted are really weird because there
> are multiple work items stalling for really long times but only one
> pool is reporting hang and nobody has rescuers active.  I don't get
> how the system can be in such state.
> 
> Just in case, you're testing mainline, right?  I've updated your debug
> patch slightly so that it doesn't skip seemingly idle pools.  Can you
> please repro the problem with the patch applied?  Thanks.

Here are logs from the patch applied on top of linux-next-20170828.
Can you find some clue?

http://I-love.SAKURA.ne.jp/tmp/serial-20170830.txt.xz :

[  150.580362] Showing busy workqueues and worker pools:
[  150.580425] workqueue events_power_efficient: flags=0x80
[  150.580452]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  150.580456]     in-flight: 57:fb_flashcursor{53}
[  150.580486] workqueue mm_percpu_wq: flags=0x18
[  150.580513]   pwq 3: cpus=1 node=0 flags=0x0 nice=-20 active=1/256
[  150.580516]     pending: drain_local_pages_wq{14139} BAR(1706){14139}
[  150.580558] workqueue writeback: flags=0x4e
[  150.580559]   pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256
[  150.580562]     in-flight: 400:wb_workfn{0} wb_workfn{0}
[  150.581413] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 178 3
[  150.581417] pool 1: cpus=0 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 4 98
[  150.581420] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=15s workers=4 idle: 81 2104 17 285
[  150.581424] pool 3: cpus=1 node=0 flags=0x0 nice=-20 hung=14s workers=2 idle: 18 92
[  150.581426] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 84 102 23
[  150.581429] pool 5: cpus=2 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 94 24
[  150.581432] pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 163 85 29
[  150.581435] pool 7: cpus=3 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 30 95
[  150.581437] pool 8: cpus=4 node=0 flags=0x0 nice=0 hung=0s workers=4 idle: 132 86 2106 35
[  150.581440] pool 9: cpus=4 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 97 36
[  150.581442] pool 10: cpus=5 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 87 306 41
[  150.581445] pool 11: cpus=5 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 42 93
[  150.581448] pool 12: cpus=6 node=0 flags=0x0 nice=0 hung=0s workers=4 idle: 88 258 506 47
[  150.581451] pool 13: cpus=6 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 48 96
[  150.581453] pool 14: cpus=7 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 89 470 53
[  150.581456] pool 15: cpus=7 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 54 99
[  150.581458] pool 16: cpus=8 node=0 flags=0x4 nice=0 hung=150s workers=0
[  150.581460] pool 17: cpus=8 node=0 flags=0x4 nice=-20 hung=150s workers=0

[  355.958940] Showing busy workqueues and worker pools:
[  355.960900] workqueue events: flags=0x0
[  355.962508]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=2/256
[  355.964646]     in-flight: 163:rht_deferred_worker{17020} rht_deferred_worker{17020}
[  355.967284]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[  355.969522]     in-flight: 57:vmw_fb_dirty_flush{10381} vmw_fb_dirty_flush{10381}
[  355.972442] workqueue events_freezable_power_: flags=0x84
[  355.974636]   pwq 12: cpus=6 node=0 flags=0x0 nice=0 active=1/256
[  355.976956]     in-flight: 258:disk_events_workfn{355}
[  355.979027] workqueue writeback: flags=0x4e
[  355.980775]   pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256
[  355.982892]     in-flight: 400:wb_workfn{0} wb_workfn{0}
[  355.985847] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=4 idle: 178 2114 3
[  355.988698] pool 1: cpus=0 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 98 4
[  355.991253] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=1s workers=4 idle: 81 2104 17 285
[  355.993997] pool 3: cpus=1 node=0 flags=0x0 nice=-20 hung=1s workers=2 idle: 18 92
[  355.996596] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 84 102 23
[  355.999415] pool 5: cpus=2 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 24 94
[  356.002380] pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=0s workers=4 idle: 2117 29 85
[  356.005322] pool 7: cpus=3 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 95 30
[  356.008095] pool 8: cpus=4 node=0 flags=0x0 nice=0 hung=0s workers=4 idle: 132 86 2106 35
[  356.010991] pool 9: cpus=4 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 97 36
[  356.013592] pool 10: cpus=5 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 87 306 41
[  356.016336] pool 11: cpus=5 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 93 42
[  356.019163] pool 12: cpus=6 node=0 flags=0x0 nice=0 hung=0s workers=6 idle: 88 506 47 2116 2115
[  356.022239] pool 13: cpus=6 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 96 48
[  356.025010] pool 14: cpus=7 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 89 470 53
[  356.027690] pool 15: cpus=7 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 99 54
[  356.030419] pool 16: cpus=8 node=0 flags=0x4 nice=0 hung=276s workers=0
[  356.033156] pool 17: cpus=8 node=0 flags=0x4 nice=-20 hung=276s workers=0

[  488.888894] Showing busy workqueues and worker pools:
[  488.888908] workqueue events: flags=0x0
[  488.888934]   pwq 10: cpus=5 node=0 flags=0x0 nice=0 active=1/256
[  488.888937]     in-flight: 2120:rht_deferred_worker{0}
[  488.888991] workqueue events_power_efficient: flags=0x80
[  488.889011]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  488.889013]     in-flight: 178:fb_flashcursor{57}
[  488.889030] workqueue events_freezable_power_: flags=0x84
[  488.889048]   pwq 12: cpus=6 node=0 flags=0x0 nice=0 active=1/256
[  488.889050]     in-flight: 258:disk_events_workfn{1113}
[  488.889106] workqueue writeback: flags=0x4e
[  488.889114]   pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256
[  488.889118]     in-flight: 400:wb_workfn{0}
[  488.889122]     pending: wb_workfn{0}
[  488.889934] workqueue xfs-eofblocks/sda1: flags=0xc
[  488.889953]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
[  488.889956]     in-flight: 132:xfs_eofblocks_worker [xfs]{132266}
[  488.889991] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=4 idle: 57 2114 3
[  488.889997] pool 1: cpus=0 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 98 4
[  488.890001] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=65s workers=5 idle: 17 2104 81 2118 285
[  488.890006] pool 3: cpus=1 node=0 flags=0x0 nice=-20 hung=65s workers=2 idle: 92 18
[  488.890010] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 84 2119 102
[  488.890014] pool 5: cpus=2 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 24 94
[  488.890018] pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=0s workers=4 idle: 163 2117 29 85
[  488.890023] pool 7: cpus=3 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 30 95
[  488.890027] pool 8: cpus=4 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 86 2106
[  488.890030] pool 9: cpus=4 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 97 36
[  488.890034] pool 10: cpus=5 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 306 87
[  488.890037] pool 11: cpus=5 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 93 42
[  488.890041] pool 12: cpus=6 node=0 flags=0x0 nice=0 hung=0s workers=6 idle: 88 506 47 2116 2115
[  488.890046] pool 13: cpus=6 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 48 96
[  488.890050] pool 14: cpus=7 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 89 2121 470
[  488.890054] pool 15: cpus=7 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 99 54
[  488.890057] pool 16: cpus=8 node=0 flags=0x4 nice=0 hung=488s workers=0
[  488.890060] pool 17: cpus=8 node=0 flags=0x4 nice=-20 hung=488s workers=0

[  782.785399] Showing busy workqueues and worker pools:
[  782.787507] workqueue mm_percpu_wq: flags=0x18
[  782.789366]   pwq 3: cpus=1 node=0 flags=0x0 nice=-20 active=2/256
[  782.791577]     pending: vmstat_update{69030}, drain_local_pages_wq{61669} BAR(63){61669}
[  782.794420] workqueue writeback: flags=0x4e
[  782.796202]   pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256
[  782.798524]     in-flight: 400:wb_workfn{4446} wb_workfn{4446}
[  782.801737] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 178 57 2114
[  782.804596] pool 1: cpus=0 node=0 flags=0x0 nice=-20 hung=6s workers=2 idle: 4 98
[  782.807263] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=69s workers=3 idle: 2104 17 81
[  782.810031] pool 3: cpus=1 node=0 flags=0x0 nice=-20 hung=69s workers=2 idle: 18 92
[  782.812672] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=4s workers=5 idle: 2154 2119 84 2158 102
[  782.815738] pool 5: cpus=2 node=0 flags=0x0 nice=-20 hung=4s workers=2 idle: 24 94
[  782.818558] pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=9s workers=5 idle: 2117 2135 163 2143 29
[  782.821472] pool 7: cpus=3 node=0 flags=0x0 nice=-20 hung=7s workers=2 idle: 95 30
[  782.824111] pool 8: cpus=4 node=0 flags=0x0 nice=0 hung=4s workers=6 idle: 86 2122 132 2163 2164 2106
[  782.827268] pool 9: cpus=4 node=0 flags=0x0 nice=-20 hung=4s workers=2 idle: 36 97
[  782.830008] pool 10: cpus=5 node=0 flags=0x0 nice=0 hung=8s workers=9 idle: 2153 2146 306 2120 2144 2145 2140 2155 87
[  782.833435] pool 11: cpus=5 node=0 flags=0x0 nice=-20 hung=4s workers=2 idle: 93 42
[  782.836114] pool 12: cpus=6 node=0 flags=0x0 nice=0 hung=0s workers=5 idle: 258 2123 506 2160 88
[  782.839049] pool 13: cpus=6 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 96 48
[  782.841727] pool 14: cpus=7 node=0 flags=0x0 nice=0 hung=9s workers=6 idle: 470 2161 2121 2137 89 2150
[  782.844850] pool 15: cpus=7 node=0 flags=0x0 nice=-20 hung=4s workers=2 idle: 99 54
[  782.847604] pool 16: cpus=8 node=0 flags=0x4 nice=0 hung=782s workers=0
[  782.850438] pool 17: cpus=8 node=0 flags=0x4 nice=-20 hung=782s workers=0

[ 1007.742112] Showing busy workqueues and worker pools:
[ 1007.744110] workqueue events: flags=0x0
[ 1007.745835]   pwq 14: cpus=7 node=0 flags=0x0 nice=0 active=2/256
[ 1007.748067]     in-flight: 470:rht_deferred_worker{102996} rht_deferred_worker{102996}
[ 1007.750867] workqueue events_freezable_power_: flags=0x84
[ 1007.752933]   pwq 12: cpus=6 node=0 flags=0x0 nice=0 active=1/256
[ 1007.755118]     in-flight: 2123:disk_events_workfn{27947}
[ 1007.757094] workqueue mm_percpu_wq: flags=0x18
[ 1007.758878]   pwq 3: cpus=1 node=0 flags=0x0 nice=-20 active=1/256
[ 1007.761177]     pending: drain_local_pages_wq{115542} BAR(2325){115542}
[ 1007.763543] workqueue writeback: flags=0x4e
[ 1007.765229]   pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256
[ 1007.767355]     in-flight: 400:wb_workfn{23} wb_workfn{23}
[ 1007.770144] workqueue xfs-eofblocks/sda1: flags=0xc
[ 1007.772130]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
[ 1007.774388]     in-flight: 2122:xfs_eofblocks_worker [xfs]{150718}
[ 1007.776700] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=4 idle: 178 57 3323 2114
[ 1007.779630] pool 1: cpus=0 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 4 98
[ 1007.782237] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=115s workers=4 idle: 2104 17 3322 81
[ 1007.785224] pool 3: cpus=1 node=0 flags=0x0 nice=-20 hung=115s workers=2 idle: 92 18
[ 1007.788045] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=0s workers=2 idle: 2154 2119
[ 1007.790817] pool 5: cpus=2 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 24 94
[ 1007.793534] pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=0s workers=2 idle: 2117 2135
[ 1007.796327] pool 7: cpus=3 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 30 95
[ 1007.799053] pool 8: cpus=4 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 86 132
[ 1007.801779] pool 9: cpus=4 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 36 97
[ 1007.804560] pool 10: cpus=5 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 2146 2153 306
[ 1007.807460] pool 11: cpus=5 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 42 93
[ 1007.810222] pool 12: cpus=6 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 258 506
[ 1007.812973] pool 13: cpus=6 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 48 96
[ 1007.815723] pool 14: cpus=7 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 2161 2121
[ 1007.819226] pool 15: cpus=7 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 99 54
[ 1007.822063] pool 16: cpus=8 node=0 flags=0x4 nice=0 hung=1007s workers=0
[ 1007.824581] pool 17: cpus=8 node=0 flags=0x4 nice=-20 hung=1007s workers=0

[ 1106.385165] Showing busy workqueues and worker pools:
[ 1106.385182] workqueue events: flags=0x0
[ 1106.385218]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 1106.385221]     pending: vmw_fb_dirty_flush{63}
[ 1106.385247] workqueue events_long: flags=0x0
[ 1106.385264]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 1106.385266]     pending: gc_worker [nf_conntrack]{27}
[ 1106.385297] workqueue events_power_efficient: flags=0x80
[ 1106.385313]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 1106.385315]     pending: fb_flashcursor{47}
[ 1106.385375] workqueue writeback: flags=0x4e
[ 1106.385376]   pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256
[ 1106.385379]     in-flight: 400:wb_workfn{7} wb_workfn{7}
[ 1106.386250] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=4 idle: 178 57 3323 2114
[ 1106.386256] pool 1: cpus=0 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 98 4
[ 1106.386259] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 2104 17 3322
[ 1106.386262] pool 3: cpus=1 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 18 92
[ 1106.386265] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 3327 2119 2154
[ 1106.386268] pool 5: cpus=2 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 94 24
[ 1106.386270] pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 2117 3324 2135
[ 1106.386273] pool 7: cpus=3 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 30 95
[ 1106.386275] pool 8: cpus=4 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 2122 86 132
[ 1106.386278] pool 9: cpus=4 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 97 36
[ 1106.386280] pool 10: cpus=5 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 2146 3332 2153
[ 1106.386283] pool 11: cpus=5 node=0 flags=0x0 nice=-20 hung=0s workers=3 idle: 42 3333 93
[ 1106.386286] pool 12: cpus=6 node=0 flags=0x0 nice=0 hung=0s workers=8 idle: 506 3330 3329 258 3328 3325 3326 2123
[ 1106.386290] pool 13: cpus=6 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 96 48
[ 1106.386293] pool 14: cpus=7 node=0 flags=0x0 nice=0 hung=0s workers=4 idle: 2161 470 3331 2121
[ 1106.386296] pool 15: cpus=7 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 54 99
[ 1106.386298] pool 16: cpus=8 node=0 flags=0x4 nice=0 hung=1106s workers=0
[ 1106.386300] pool 17: cpus=8 node=0 flags=0x4 nice=-20 hung=1106s workers=0

http://I-love.SAKURA.ne.jp/tmp/serial-20170830-2.txt.xz :

[  277.367469] Showing busy workqueues and worker pools:
[  277.367545] workqueue events_freezable_power_: flags=0x84
[  277.367573]   pwq 10: cpus=5 node=0 flags=0x0 nice=0 active=1/256
[  277.367577]     in-flight: 88:disk_events_workfn{15431}
[  277.367595] workqueue mm_percpu_wq: flags=0x18
[  277.367622]   pwq 3: cpus=1 node=0 flags=0x0 nice=-20 active=1/256
[  277.367625]     pending: drain_local_pages_wq{17528} BAR(2405){17527}
[  277.367668] workqueue writeback: flags=0x4e
[  277.367669]   pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256
[  277.367672]     in-flight: 400:wb_workfn{0}
[  277.367676]     pending: wb_workfn{0}
[  277.368519] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=12 idle: 2103 57 2105 2110 259 2104 3 2099 2106 2108 2107 2109
[  277.368531] pool 1: cpus=0 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 96 4
[  277.368535] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=17s workers=3 idle: 84 102 17
[  277.368539] pool 3: cpus=1 node=0 flags=0x0 nice=-20 hung=17s workers=2 idle: 94 18
[  277.368542] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 444 85 23
[  277.368546] pool 5: cpus=2 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 93 24
[  277.368551] pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=0s workers=4 idle: 86 156 2100 29
[  277.368556] pool 7: cpus=3 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 30 92
[  277.368560] pool 8: cpus=4 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 87 240 35
[  277.368563] pool 9: cpus=4 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 36 95
[  277.368566] pool 10: cpus=5 node=0 flags=0x0 nice=0 hung=0s workers=4 idle: 2111 308 41
[  277.368571] pool 11: cpus=5 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 97 42
[  277.368574] pool 12: cpus=6 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 47 311 89
[  277.368578] pool 13: cpus=6 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 99 48
[  277.368581] pool 14: cpus=7 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 83 146 53
[  277.368585] pool 15: cpus=7 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 54 98
[  277.368589] pool 16: cpus=8 node=0 flags=0x4 nice=0 hung=277s workers=0
[  277.368591] pool 17: cpus=8 node=0 flags=0x4 nice=-20 hung=277s workers=0

[  488.757378] Showing busy workqueues and worker pools:
[  488.759373] workqueue events_freezable_power_: flags=0x84
[  488.761431]   pwq 10: cpus=5 node=0 flags=0x0 nice=0 active=1/256
[  488.763659]     in-flight: 3158:disk_events_workfn{37021}
[  488.765712] workqueue mm_percpu_wq: flags=0x18
[  488.767541]   pwq 5: cpus=2 node=0 flags=0x0 nice=-20 active=1/256
[  488.769772]     pending: vmstat_update{101027}
[  488.771566]   pwq 3: cpus=1 node=0 flags=0x0 nice=-20 active=1/256
[  488.773812]     pending: drain_local_pages_wq{108219} BAR(3074){108190}
[  488.776363] workqueue writeback: flags=0x4e
[  488.778077]   pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256
[  488.780471]     in-flight: 400:wb_workfn{36001} wb_workfn{36001}
[  488.784158] workqueue xfs-eofblocks/sda1: flags=0xc
[  488.786568]   pwq 12: cpus=6 node=0 flags=0x0 nice=0 active=1/256
[  488.788857]     in-flight: 47:xfs_eofblocks_worker [xfs]{156022}
[  488.791150] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 3159 2103 57
[  488.794047] pool 1: cpus=0 node=0 flags=0x0 nice=-20 hung=6s workers=2 idle: 96 4
[  488.796746] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=108s workers=3 idle: 84 102 17
[  488.799533] pool 3: cpus=1 node=0 flags=0x0 nice=-20 hung=108s workers=2 idle: 94 18
[  488.802291] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=101s workers=3 idle: 444 3152 85
[  488.805121] pool 5: cpus=2 node=0 flags=0x0 nice=-20 hung=101s workers=2 idle: 24 93
[  488.807894] pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=0s workers=2 idle: 86 156
[  488.810602] pool 7: cpus=3 node=0 flags=0x0 nice=-20 hung=34s workers=2 idle: 30 92
[  488.813337] pool 8: cpus=4 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 87 3154 240
[  488.816206] pool 9: cpus=4 node=0 flags=0x0 nice=-20 hung=34s workers=2 idle: 36 95
[  488.818942] pool 10: cpus=5 node=0 flags=0x0 nice=0 hung=0s workers=4 idle: 2111 308 88
[  488.821804] pool 11: cpus=5 node=0 flags=0x0 nice=-20 hung=34s workers=2 idle: 97 42
[  488.824652] pool 12: cpus=6 node=0 flags=0x0 nice=0 hung=0s workers=6 idle: 3150 3153 3155 3156 311
[  488.827793] pool 13: cpus=6 node=0 flags=0x0 nice=-20 hung=10s workers=2 idle: 99 48
[  488.830590] pool 14: cpus=7 node=0 flags=0x0 nice=0 hung=0s workers=4 idle: 3151 3157 146 83
[  488.833591] pool 15: cpus=7 node=0 flags=0x0 nice=-20 hung=34s workers=2 idle: 98 54
[  488.836396] pool 16: cpus=8 node=0 flags=0x4 nice=0 hung=488s workers=0
[  488.838876] pool 17: cpus=8 node=0 flags=0x4 nice=-20 hung=488s workers=0

[  542.541098] Showing busy workqueues and worker pools:
[  542.541110] workqueue events: flags=0x0
[  542.541136]   pwq 14: cpus=7 node=0 flags=0x0 nice=0 active=2/256
[  542.541139]     in-flight: 3151:vmw_fb_dirty_flush{16400} vmw_fb_dirty_flush{16400}
[  542.541146]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
[  542.541148]     in-flight: 84:console_callback{647}
[  542.541152]     pending: vmpressure_work_fn{539}
[  542.541265] workqueue writeback: flags=0x4e
[  542.541275]   pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256
[  542.541278]     in-flight: 400:wb_workfn{0}
[  542.541282]     pending: wb_workfn{0}
[  542.542017] workqueue xfs-buf/sda1: flags=0xc
[  542.542036]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
[  542.542039]     pending: xfs_buf_ioend_work [xfs]{300}
[  542.542133] workqueue xfs-eofblocks/sda1: flags=0xc
[  542.542151]   pwq 12: cpus=6 node=0 flags=0x0 nice=0 active=1/256
[  542.542153]     in-flight: 47:xfs_eofblocks_worker [xfs]{209776}
[  542.542184] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 3159 2103 57
[  542.542189] pool 1: cpus=0 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 96 4
[  542.542193] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 102 17
[  542.542196] pool 3: cpus=1 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 18 94
[  542.542200] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 444 3152 85
[  542.542204] pool 5: cpus=2 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 93 24
[  542.542221] pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 3160 156 86
[  542.542225] pool 7: cpus=3 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 30 92
[  542.542229] pool 8: cpus=4 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 87 3154 240
[  542.542233] pool 9: cpus=4 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 95 36
[  542.542236] pool 10: cpus=5 node=0 flags=0x0 nice=0 hung=0s workers=4 idle: 3158 2111 308 88
[  542.542241] pool 11: cpus=5 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 97 42
[  542.542244] pool 12: cpus=6 node=0 flags=0x0 nice=0 hung=0s workers=6 idle: 3150 3155 3153 3156 311
[  542.542249] pool 13: cpus=6 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 99 48
[  542.542253] pool 14: cpus=7 node=0 flags=0x0 nice=0 hung=0s workers=5 idle: 3161 83 3157 146
[  542.542257] pool 15: cpus=7 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 98 54
[  542.542260] pool 16: cpus=8 node=0 flags=0x4 nice=0 hung=542s workers=0
[  542.542262] pool 17: cpus=8 node=0 flags=0x4 nice=-20 hung=542s workers=0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm: Use WQ_HIGHPRI for mm_percpu_wq.
  2017-08-30 13:51                 ` Tetsuo Handa
@ 2017-08-31  1:46                   ` Tejun Heo
  2017-08-31 14:52                     ` Tetsuo Handa
  0 siblings, 1 reply; 19+ messages in thread
From: Tejun Heo @ 2017-08-31  1:46 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: mhocko, akpm, linux-mm, mgorman, vbabka

Hello,

On Wed, Aug 30, 2017 at 10:51:57PM +0900, Tetsuo Handa wrote:
> Here are logs from the patch applied on top of linux-next-20170828.
> Can you find some clue?
> 
> http://I-love.SAKURA.ne.jp/tmp/serial-20170830.txt.xz :
> 
> [  150.580362] Showing busy workqueues and worker pools:
> [  150.580425] workqueue events_power_efficient: flags=0x80
> [  150.580452]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
> [  150.580456]     in-flight: 57:fb_flashcursor{53}
> [  150.580486] workqueue mm_percpu_wq: flags=0x18
> [  150.580513]   pwq 3: cpus=1 node=0 flags=0x0 nice=-20 active=1/256
> [  150.580516]     pending: drain_local_pages_wq{14139} BAR(1706){14139}

So, there clear are work items queued

> [  150.580558] workqueue writeback: flags=0x4e
> [  150.580559]   pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256
> [  150.580562]     in-flight: 400:wb_workfn{0} wb_workfn{0}
> [  150.581413] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 178 3
> [  150.581417] pool 1: cpus=0 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 4 98
> [  150.581420] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=15s workers=4 idle: 81 2104 17 285
> [  150.581424] pool 3: cpus=1 node=0 flags=0x0 nice=-20 hung=14s workers=2 idle: 18 92

But all of the pool's workers are staying idle.  The only two
possibilities I can think of are

1. Concurrency management is completely broken for some reason.  One
   reason this could happen is if a work item changes the affinity of
   a per-cpu worker thread.  I don't think this is too likely here.

2. Somehow high memory pressure is preventing the worker to leave
   idle.  I have no idea how this would happen but it *could* be that
   there is somehow memory allocation dependency in the worker waking
   up path.  Can you strip down your kernel config to bare minimum and
   see whether the problem still persists.  Alternatively, we can dump
   stack traces of the tasks after a stall detection and try to see if
   the kworkers are stuck somewhere.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm: Use WQ_HIGHPRI for mm_percpu_wq.
  2017-08-31  1:46                   ` Tejun Heo
@ 2017-08-31 14:52                     ` Tetsuo Handa
  2017-08-31 15:25                       ` Michal Hocko
  0 siblings, 1 reply; 19+ messages in thread
From: Tetsuo Handa @ 2017-08-31 14:52 UTC (permalink / raw)
  To: tj; +Cc: mhocko, akpm, linux-mm, mgorman, vbabka

Tejun Heo wrote:
> 2. Somehow high memory pressure is preventing the worker to leave
>    idle.  I have no idea how this would happen but it *could* be that
>    there is somehow memory allocation dependency in the worker waking
>    up path.

So far I tested, it seems that I'm hitting this case.

I tried to enforce long schedule_timeout_*() using below patch (and
manually toggling with SysRq-7 and SysRq-9) so that memory pressure is
kept without warning messages.

----------
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3290,6 +3290,9 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 
 	*did_some_progress = 0;
 
+	while (console_loglevel == 9)
+		schedule_timeout_uninterruptible(HZ);
+
 	/*
 	 * Acquire the oom lock.  If that fails, somebody else is
 	 * making progress for us.
@@ -4002,7 +4005,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 		goto nopage;
 
 	/* Make sure we know about allocations which stall for too long */
-	if (time_after(jiffies, alloc_start + stall_timeout)) {
+	if (0 && time_after(jiffies, alloc_start + stall_timeout)) {
 		warn_alloc(gfp_mask & ~__GFP_NOWARN, ac->nodemask,
 			"page allocation stalls for %ums, order:%u",
 			jiffies_to_msecs(jiffies-alloc_start), order);
----------

Even without passing WQ_HIGHPRI, pending state for mm_percpu_wq was eventually
solved if I press SysRq-9 and wait.

----------
[ 1588.487142] Showing busy workqueues and worker pools:
[ 1588.487142] workqueue mm_percpu_wq: flags=0x8
[ 1588.487142]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 1588.487142]     pending: drain_local_pages_wq{27930} BAR(11781){27929}
[ 1588.487142] workqueue xfs-data/sda1: flags=0xc
[ 1588.487142]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256
[ 1588.487142]     in-flight: 10795:xfs_end_io{1}
[ 1588.487142] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=27s workers=2 idle: 10759 10756
[ 1588.487142] pool 1: cpus=0 node=0 flags=0x0 nice=-20 hung=30s workers=2 idle: 4 370
[ 1588.487142] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=0s workers=17 idle: 10795 10797 9576 10755 11174 11175 11176 11177 11178 11179 11180 11181 11182 11185 11184 11183 10798
[ 1588.487142] pool 3: cpus=1 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 18 1966
[ 1588.487142] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=0s workers=4 idle: 10766 10769 202 9545
[ 1588.487142] pool 5: cpus=2 node=0 flags=0x0 nice=-20 hung=0s workers=2 manager: 24 idle: 1910
[ 1588.487142] pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=0s workers=2 manager: 11048 idle: 11047
[ 1588.487142] pool 7: cpus=3 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 30 385

[ 2003.653410] Showing busy workqueues and worker pools:
[ 2003.653410] workqueue events_freezable_power_: flags=0x84
[ 2003.653410]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[ 2003.653410]     in-flight: 202:disk_events_workfn{49675}
[ 2003.653410] workqueue mm_percpu_wq: flags=0x8
[ 2003.653410]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 2003.653410]     pending: drain_local_pages_wq{51051} BAR(12485){51050}
[ 2003.653410] workqueue writeback: flags=0x4e
[ 2003.653410]   pwq 16: cpus=0-7 flags=0x4 nice=0 active=2/256
[ 2003.653410]     in-flight: 11188:wb_workfn{59981} wb_workfn{59981}
[ 2003.653410] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=51s workers=7 idle: 12244 10759 12237 12236 12242 12240 10756
[ 2003.653410] pool 1: cpus=0 node=0 flags=0x0 nice=-20 hung=52s workers=2 idle: 370 4
[ 2003.653410] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=3s workers=2 idle: 10795 9576
[ 2003.653410] pool 3: cpus=1 node=0 flags=0x0 nice=-20 hung=18s workers=2 idle: 12243 1966
[ 2003.653410] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=0s workers=4 idle: 10769 13270 10766
[ 2003.653410] pool 5: cpus=2 node=0 flags=0x0 nice=-20 hung=47s workers=2 idle: 24 12232
[ 2003.653410] pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=0s workers=6 idle: 12238 13271 11047 12241 12235 11048
[ 2003.653410] pool 7: cpus=3 node=0 flags=0x0 nice=-20 hung=54s workers=2 idle: 12239 30

[ 2685.996827] Showing busy workqueues and worker pools:
[ 2685.996827] workqueue mm_percpu_wq: flags=0x8
[ 2685.996827]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 2685.996827]     pending: drain_local_pages_wq{31500} BAR(13181){31500}
[ 2685.996827] workqueue writeback: flags=0x4e
[ 2685.996827]   pwq 16: cpus=0-7 flags=0x4 nice=0 active=2/256
[ 2685.996827]     in-flight: 11188:wb_workfn{415} wb_workfn{415}
[ 2685.996827] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=31s workers=5 idle: 13274 13287 12244 10759 13281
[ 2685.996827] pool 1: cpus=0 node=0 flags=0x0 nice=-20 hung=38s workers=2 idle: 4 370
[ 2685.996827] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=0s workers=5 idle: 10795 13288 13273 13280 13283
[ 2685.996827] pool 3: cpus=1 node=0 flags=0x0 nice=-20 hung=2s workers=3 idle: 12243 13290 1966
[ 2685.996827] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=0s workers=4 idle: 10769 13285 13284 202
[ 2685.996827] pool 5: cpus=2 node=0 flags=0x0 nice=-20 hung=0s workers=3 idle: 24 13289 12232
[ 2685.996827] pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=0s workers=4 idle: 13286 12238 13271 13279
[ 2685.996827] pool 7: cpus=3 node=0 flags=0x0 nice=-20 hung=0s workers=2 idle: 30 12239

[ 2909.490137] Showing busy workqueues and worker pools:
[ 2909.490137] workqueue events_power_efficient: flags=0x80
[ 2909.490137]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[ 2909.490137]     in-flight: 12238:fb_flashcursor{5}
[ 2909.490137] workqueue events_freezable_power_: flags=0x84
[ 2909.490137]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[ 2909.490137]     in-flight: 13284:disk_events_workfn{29237}
[ 2909.490137] workqueue mm_percpu_wq: flags=0x8
[ 2909.490137]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 2909.490137]     pending: drain_local_pages_wq{31651} BAR(13160){31651}
[ 2909.490137] workqueue writeback: flags=0x4e
[ 2909.490137]   pwq 16: cpus=0-7 flags=0x4 nice=0 active=2/256
[ 2909.490137]     in-flight: 11188:wb_workfn{33925} wb_workfn{33925}
[ 2909.490137] workqueue xfs-data/sda1: flags=0xc
[ 2909.490137]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[ 2909.490137]     in-flight: 13286:xfs_end_io{94642}
[ 2909.490137] workqueue xfs-eofblocks/sda1: flags=0xc
[ 2909.490137]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256
[ 2909.490137]     in-flight: 10795:xfs_eofblocks_worker{216309}
[ 2909.490137] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=31s workers=5 idle: 13274 13287 12244 10759 13281
[ 2909.490137] pool 1: cpus=0 node=0 flags=0x0 nice=-20 hung=31s workers=2 idle: 4 370
[ 2909.490137] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=17s workers=5 idle: 13288 13273 13280 13283
[ 2909.490137] pool 3: cpus=1 node=0 flags=0x0 nice=-20 hung=12s workers=3 idle: 12243 13290 1966
[ 2909.490137] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=0s workers=4 idle: 10769 13285 202
[ 2909.490137] pool 5: cpus=2 node=0 flags=0x0 nice=-20 hung=25s workers=3 idle: 13289 24 12232
[ 2909.490137] pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=0s workers=4 idle: 13271 13279
[ 2909.490137] pool 7: cpus=3 node=0 flags=0x0 nice=-20 hung=26s workers=3 idle: 13291 12239 30

[ 2983.903524] Showing busy workqueues and worker pools:
[ 2983.903524] workqueue events_freezable_power_: flags=0x84
[ 2983.903524]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[ 2983.903524]     in-flight: 13284:disk_events_workfn{96988}
[ 2983.903524] workqueue mm_percpu_wq: flags=0x8
[ 2983.903524]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 2983.903524]     pending: drain_local_pages_wq{34208} BAR(38){34207}
[ 2983.903524] workqueue writeback: flags=0x4e
[ 2983.903524]   pwq 16: cpus=0-7 flags=0x4 nice=0 active=2/256
[ 2983.903524]     in-flight: 11188:wb_workfn{101685} wb_workfn{101685}
[ 2983.903524] workqueue xfs-data/sda1: flags=0xc
[ 2983.903524]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[ 2983.903524]     in-flight: 13286:xfs_end_io{162407}
[ 2983.903524] workqueue xfs-eofblocks/sda1: flags=0xc
[ 2983.903524]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256
[ 2983.903524]     in-flight: 10795:xfs_eofblocks_worker{284079}
[ 2983.903524] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=34s workers=4 idle: 13274 13287 12244 10759
[ 2983.903524] pool 1: cpus=0 node=0 flags=0x0 nice=-20 hung=61s workers=2 idle: 4 370
[ 2983.903524] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=17s workers=4 idle: 13288 13273 13280
[ 2983.903524] pool 3: cpus=1 node=0 flags=0x0 nice=-20 hung=50s workers=2 idle: 12243 13290
[ 2983.903524] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 10769 13285
[ 2983.903524] pool 5: cpus=2 node=0 flags=0x0 nice=-20 hung=67s workers=2 idle: 13289 24
[ 2983.903524] pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 12238 13271
[ 2983.903524] pool 7: cpus=3 node=0 flags=0x0 nice=-20 hung=94s workers=3 idle: 13291 12239 30
----------

So, this pending state seems to be caused by many concurrent allocations by !PF_WQ_WORKER
threads consuming too much CPU time (because they only yield CPU time by many cond_resched()
and one schedule_timeout_uninterruptible(1)) enough to keep schedule_timeout_uninterruptible(1)
by PF_WQ_WORKER threads away for order of minutes. A sort of memory allocation dependency
observable in the form of CPU time starvation for the worker to wake up.

If !PF_WQ_WORKER threads were calling schedule_timeout_uninterruptible(1) more frequently,
PF_WQ_WORKER threads would have gotten a chance to call schedule_timeout_uninterruptible(1)
more quickly and flush pending works via rescuer threads.

To solve this problem, we need to make sure that unlimited number of allocating threads
do not run direct reclaim concurrently. But Michal hates serialization. Hmm.....

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm: Use WQ_HIGHPRI for mm_percpu_wq.
  2017-08-31 14:52                     ` Tetsuo Handa
@ 2017-08-31 15:25                       ` Michal Hocko
  2017-08-31 22:07                         ` Tetsuo Handa
  0 siblings, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2017-08-31 15:25 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: tj, akpm, linux-mm, mgorman, vbabka

On Thu 31-08-17 23:52:57, Tetsuo Handa wrote:
[...]
> So, this pending state seems to be caused by many concurrent allocations by !PF_WQ_WORKER
> threads consuming too much CPU time (because they only yield CPU time by many cond_resched()
> and one schedule_timeout_uninterruptible(1)) enough to keep schedule_timeout_uninterruptible(1)
> by PF_WQ_WORKER threads away for order of minutes. A sort of memory allocation dependency
> observable in the form of CPU time starvation for the worker to wake up.

I do not understand this. Why is cond_resched from the user context
insufficient to let runable kworkers to run?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm: Use WQ_HIGHPRI for mm_percpu_wq.
  2017-08-31 15:25                       ` Michal Hocko
@ 2017-08-31 22:07                         ` Tetsuo Handa
  2017-09-01 13:47                           ` Tejun Heo
  0 siblings, 1 reply; 19+ messages in thread
From: Tetsuo Handa @ 2017-08-31 22:07 UTC (permalink / raw)
  To: mhocko; +Cc: tj, akpm, linux-mm, mgorman, vbabka

Michal Hocko wrote:
> On Thu 31-08-17 23:52:57, Tetsuo Handa wrote:
> [...]
> > So, this pending state seems to be caused by many concurrent allocations by !PF_WQ_WORKER
> > threads consuming too much CPU time (because they only yield CPU time by many cond_resched()
> > and one schedule_timeout_uninterruptible(1)) enough to keep schedule_timeout_uninterruptible(1)
> > by PF_WQ_WORKER threads away for order of minutes. A sort of memory allocation dependency
> > observable in the form of CPU time starvation for the worker to wake up.
> 
> I do not understand this. Why is cond_resched from the user context
> insufficient to let runable kworkers to run?

cond_resched() from !PF_WQ_WORKER threads is sufficient for PF_WQ_WORKER threads to run.
But cond_resched() is not sufficient for rescuer threads to start processing a pending work.
An explicit scheduling (e.g. schedule_timeout_*()) by PF_WQ_WORKER threads is needed for
rescuer threads to start processing a pending work.

Since schedule_timeout_*() from PF_WQ_WORKER threads is called from very limited locations
(i.e. from should_reclaim_retry(), __alloc_pages_may_oom() and out_of_memory()), it can
take many seconds for PF_WQ_WORKER threads to reach such locations when many threads (both
PF_WQ_WORKER and !PF_WQ_WORKER) are constantly switching each other using cond_resched()
as a switching point. I think that if cond_resched() inside memory allocation path were
schedule_timeout_*(), PF_WQ_WORKER threads will be able to call schedule_timeout_*() more
quickly and allow rescuer threads to start processing a pending work faster than now.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm: Use WQ_HIGHPRI for mm_percpu_wq.
  2017-08-31 22:07                         ` Tetsuo Handa
@ 2017-09-01 13:47                           ` Tejun Heo
  2017-09-01 14:29                             ` Tetsuo Handa
  0 siblings, 1 reply; 19+ messages in thread
From: Tejun Heo @ 2017-09-01 13:47 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: mhocko, akpm, linux-mm, mgorman, vbabka

Hello,

On Fri, Sep 01, 2017 at 07:07:25AM +0900, Tetsuo Handa wrote:
> cond_resched() from !PF_WQ_WORKER threads is sufficient for PF_WQ_WORKER threads to run.
> But cond_resched() is not sufficient for rescuer threads to start processing a pending work.
> An explicit scheduling (e.g. schedule_timeout_*()) by PF_WQ_WORKER threads is needed for
> rescuer threads to start processing a pending work.

I'm not even sure this is the case.  Unless I'm mistaken, in your
workqueue dumps, the available workers couldn't even leave idle which
means that they likely didn't get scheduled at all.  It looks like
genuine multi minute starvation by competing direct reclaims.  What's
the load number like while these events are in progress?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm: Use WQ_HIGHPRI for mm_percpu_wq.
  2017-09-01 13:47                           ` Tejun Heo
@ 2017-09-01 14:29                             ` Tetsuo Handa
  0 siblings, 0 replies; 19+ messages in thread
From: Tetsuo Handa @ 2017-09-01 14:29 UTC (permalink / raw)
  To: tj; +Cc: mhocko, akpm, linux-mm, mgorman, vbabka

Tejun Heo wrote:
> On Fri, Sep 01, 2017 at 07:07:25AM +0900, Tetsuo Handa wrote:
> > cond_resched() from !PF_WQ_WORKER threads is sufficient for PF_WQ_WORKER threads to run.
> > But cond_resched() is not sufficient for rescuer threads to start processing a pending work.
> > An explicit scheduling (e.g. schedule_timeout_*()) by PF_WQ_WORKER threads is needed for
> > rescuer threads to start processing a pending work.
> 
> I'm not even sure this is the case.  Unless I'm mistaken, in your
> workqueue dumps, the available workers couldn't even leave idle which
> means that they likely didn't get scheduled at all.  It looks like
> genuine multi minute starvation by competing direct reclaims.  What's
> the load number like while these events are in progress?

I don't know the load number because the system is unresponsive due to global
OOM. All information I can collect is via printk() from SysRq. But I guess that
it is genuine multi minute starvation by competing direct reclaims, for
I ran 1024 threads on 4 or 8 CPUs / 4GB RAM / no swap in order to test heavy
memory pressure situation where WQ_MEM_RECLAIM mm_percpu_wq work will stay
pending when I check for SysRq-t.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2017-09-01 14:29 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1503921210-4603-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp>
2017-08-28 12:10 ` [PATCH] mm: Use WQ_HIGHPRI for mm_percpu_wq Michal Hocko
2017-08-28 17:06   ` Tejun Heo
2017-08-28 22:15     ` Tetsuo Handa
2017-08-28 23:02       ` Tejun Heo
2017-08-28 23:09         ` Tejun Heo
2017-08-29 11:14           ` Tetsuo Handa
2017-08-29 14:38             ` Tejun Heo
2017-08-29 21:41               ` Tejun Heo
2017-08-30 13:51                 ` Tetsuo Handa
2017-08-31  1:46                   ` Tejun Heo
2017-08-31 14:52                     ` Tetsuo Handa
2017-08-31 15:25                       ` Michal Hocko
2017-08-31 22:07                         ` Tetsuo Handa
2017-09-01 13:47                           ` Tejun Heo
2017-09-01 14:29                             ` Tetsuo Handa
2017-08-29 13:33     ` Michal Hocko
2017-08-29 14:33       ` Tejun Heo
2017-08-29 20:29         ` Tetsuo Handa
2017-08-30  6:40           ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.