linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH rfc] workqueue: honour cond_resched() more effectively.
@ 2020-11-09  2:54 NeilBrown
  2020-11-09  7:50 ` Michal Hocko
  2020-11-09  8:00 ` Peter Zijlstra
  0 siblings, 2 replies; 18+ messages in thread
From: NeilBrown @ 2020-11-09  2:54 UTC (permalink / raw)
  To: Tejun Heo, Lai Jiangshan, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot
  Cc: Trond Myklebust, Michal Hocko, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 5128 bytes --]


If a worker task for a normal (bound, non-CPU-intensive) calls
cond_resched(), this allows other non-workqueue processes to run, but
does *not* allow other workqueue workers to run.  This is because
workqueue will only attempt to run one task at a time on each CPU,
unless the current task actually sleeps.

This is already a problem for should_reclaim_retry() in mm/page_alloc.c,
which inserts a small sleep just to convince workqueue to allow other
workers to run.

It can also be a problem for NFS when closing a very large file (e.g.
100 million pages in memory).  NFS can call the final iput() from a
workqueue, which can then take long enough to trigger a workqueue-lockup
warning, and long enough for performance problems to be observed.

I added a WARN_ON_ONCE() in cond_resched() to report when it is run from
a workqueue, and got about 20 hits during boot, many of system_wq (aka
"events") which suggests there is a real chance that worker are being
delayed unnecessarily be tasks which are written to politely relinquish
the CPU.

I think that once a worker calls cond_resched(), it should be treated as
though it was run from a WQ_CPU_INTENSIVE queue, because only cpu-intensive
tasks need to call cond_resched().  This would allow other workers to be
scheduled.

The following patch achieves this I believe.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 include/linux/sched.h     |  7 ++++++-
 include/linux/workqueue.h |  2 ++
 kernel/sched/core.c       |  4 ++++
 kernel/workqueue.c        | 16 +++++++++++++++-
 mm/page_alloc.c           | 12 +-----------
 5 files changed, 28 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4418f5cb8324..728870965df1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1784,7 +1784,12 @@ static inline int test_tsk_need_resched(struct task_struct *tsk)
 #ifndef CONFIG_PREEMPTION
 extern int _cond_resched(void);
 #else
-static inline int _cond_resched(void) { return 0; }
+static inline int _cond_resched(void)
+{
+	if (current->flags & PF_WQ_WORKER)
+		workqueue_cond_resched();
+	return 0;
+}
 #endif
 
 #define cond_resched() ({			\
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 8b505d22fc0e..7bcc5717d80f 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -626,6 +626,8 @@ static inline bool schedule_delayed_work(struct delayed_work *dwork,
 	return queue_delayed_work(system_wq, dwork, delay);
 }
 
+void workqueue_cond_resched(void);
+
 #ifndef CONFIG_SMP
 static inline long work_on_cpu(int cpu, long (*fn)(void *), void *arg)
 {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9a2fbf98fd6f..5b2e38567a0c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5620,6 +5620,8 @@ SYSCALL_DEFINE0(sched_yield)
 #ifndef CONFIG_PREEMPTION
 int __sched _cond_resched(void)
 {
+	if (current->flags & PF_WQ_WORKER)
+		workqueue_cond_resched();
 	if (should_resched(0)) {
 		preempt_schedule_common();
 		return 1;
@@ -5643,6 +5645,8 @@ int __cond_resched_lock(spinlock_t *lock)
 	int resched = should_resched(PREEMPT_LOCK_OFFSET);
 	int ret = 0;
 
+	if (current->flags & PF_WQ_WORKER)
+		workqueue_cond_resched();
 	lockdep_assert_held(lock);
 
 	if (spin_needbreak(lock) || resched) {
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 891ccad5f271..fd2e9557b1a6 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2296,7 +2296,7 @@ __acquires(&pool->lock)
 	spin_lock_irq(&pool->lock);
 
 	/* clear cpu intensive status */
-	if (unlikely(cpu_intensive))
+	if (unlikely(worker->flags & WORKER_CPU_INTENSIVE))
 		worker_clr_flags(worker, WORKER_CPU_INTENSIVE);
 
 	/* tag the worker for identification in schedule() */
@@ -5330,6 +5330,20 @@ int workqueue_set_unbound_cpumask(cpumask_var_t cpumask)
 	return ret;
 }
 
+void workqueue_cond_resched(void)
+{
+	struct worker *worker = current_wq_worker();
+
+	if (worker && !(worker->flags & WORKER_CPU_INTENSIVE)) {
+		struct worker_pool *pool = worker->pool;
+
+		worker_set_flags(worker, WORKER_CPU_INTENSIVE);
+		if (need_more_worker(pool))
+			wake_up_worker(pool);
+	}
+}
+EXPORT_SYMBOL(workqueue_cond_resched);
+
 #ifdef CONFIG_SYSFS
 /*
  * Workqueues with WQ_SYSFS flag set is visible to userland via
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 13cc653122b7..0d5720ecbfe7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4420,17 +4420,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 	}
 
 out:
-	/*
-	 * Memory allocation/reclaim might be called from a WQ context and the
-	 * current implementation of the WQ concurrency control doesn't
-	 * recognize that a particular WQ is congested if the worker thread is
-	 * looping without ever sleeping. Therefore we have to do a short sleep
-	 * here rather than calling cond_resched().
-	 */
-	if (current->flags & PF_WQ_WORKER)
-		schedule_timeout_uninterruptible(1);
-	else
-		cond_resched();
+	cond_resched();
 	return ret;
 }
 
-- 
2.29.2


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 853 bytes --]

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH rfc] workqueue: honour cond_resched() more effectively.
  2020-11-09  2:54 [PATCH rfc] workqueue: honour cond_resched() more effectively NeilBrown
@ 2020-11-09  7:50 ` Michal Hocko
  2020-11-09  8:00 ` Peter Zijlstra
  1 sibling, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2020-11-09  7:50 UTC (permalink / raw)
  To: NeilBrown
  Cc: Tejun Heo, Lai Jiangshan, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Trond Myklebust, linux-kernel

On Mon 09-11-20 13:54:59, Neil Brown wrote:
> 
> If a worker task for a normal (bound, non-CPU-intensive) calls
> cond_resched(), this allows other non-workqueue processes to run, but
> does *not* allow other workqueue workers to run.  This is because
> workqueue will only attempt to run one task at a time on each CPU,
> unless the current task actually sleeps.
> 
> This is already a problem for should_reclaim_retry() in mm/page_alloc.c,
> which inserts a small sleep just to convince workqueue to allow other
> workers to run.
> 
> It can also be a problem for NFS when closing a very large file (e.g.
> 100 million pages in memory).  NFS can call the final iput() from a
> workqueue, which can then take long enough to trigger a workqueue-lockup
> warning, and long enough for performance problems to be observed.
> 
> I added a WARN_ON_ONCE() in cond_resched() to report when it is run from
> a workqueue, and got about 20 hits during boot, many of system_wq (aka
> "events") which suggests there is a real chance that worker are being
> delayed unnecessarily be tasks which are written to politely relinquish
> the CPU.
> 
> I think that once a worker calls cond_resched(), it should be treated as
> though it was run from a WQ_CPU_INTENSIVE queue, because only cpu-intensive
> tasks need to call cond_resched().  This would allow other workers to be
> scheduled.
> 
> The following patch achieves this I believe.

I cannot really judge the implementation because my understanding of the
WQ concurrency control is very superficial but I echo that the existing
behavior is really nonintuitive. It certainly burnt me for the oom
situations where the page allocator cannot make much progress to reclaim
memory and it has to retry really hard. Having to handle worker context
explicitly/differently is error prone and as your example of final iput
in NFS shows that the allocator is not the only path affected so having
a general solution is better.

That being said I would really love to see cond_resched to work
transparently.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH rfc] workqueue: honour cond_resched() more effectively.
  2020-11-09  2:54 [PATCH rfc] workqueue: honour cond_resched() more effectively NeilBrown
  2020-11-09  7:50 ` Michal Hocko
@ 2020-11-09  8:00 ` Peter Zijlstra
  2020-11-09 13:50   ` Trond Myklebust
  1 sibling, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2020-11-09  8:00 UTC (permalink / raw)
  To: NeilBrown
  Cc: Tejun Heo, Lai Jiangshan, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Trond Myklebust, Michal Hocko, linux-kernel

On Mon, Nov 09, 2020 at 01:54:59PM +1100, NeilBrown wrote:
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 4418f5cb8324..728870965df1 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1784,7 +1784,12 @@ static inline int test_tsk_need_resched(struct task_struct *tsk)
>  #ifndef CONFIG_PREEMPTION
>  extern int _cond_resched(void);
>  #else
> -static inline int _cond_resched(void) { return 0; }
> +static inline int _cond_resched(void)
> +{
> +	if (current->flags & PF_WQ_WORKER)
> +		workqueue_cond_resched();
> +	return 0;
> +}
>  #endif
>  
>  #define cond_resched() ({			\

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9a2fbf98fd6f..5b2e38567a0c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5620,6 +5620,8 @@ SYSCALL_DEFINE0(sched_yield)
>  #ifndef CONFIG_PREEMPTION
>  int __sched _cond_resched(void)
>  {
> +	if (current->flags & PF_WQ_WORKER)
> +		workqueue_cond_resched();
>  	if (should_resched(0)) {
>  		preempt_schedule_common();
>  		return 1;


Much hate for this.. :/ cond_resched() should be a NOP on !PREEMPT and
you wreck that. Also, you call into that workqueue_cond_resched()
unconditionally, even when it wouldn't have rescheduled, which seems
very wrong too.

On top of all that, you're adding an extra load to the funcion :/

At some poine Paul tried to frob cond_resched() for RCU and ran into all
sorts of performance issues, I'm thinking this will too.


Going by your justification for all this:

> I think that once a worker calls cond_resched(), it should be treated as
> though it was run from a WQ_CPU_INTENSIVE queue, because only cpu-intensive
> tasks need to call cond_resched().  This would allow other workers to be
> scheduled.

I'm thinking the real problem is that you're abusing workqueues. Just
don't stuff so much work into it that this becomes a problem. Or rather,
if you do, don't lie to it about it.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH rfc] workqueue: honour cond_resched() more effectively.
  2020-11-09  8:00 ` Peter Zijlstra
@ 2020-11-09 13:50   ` Trond Myklebust
  2020-11-09 14:01     ` tj
  2020-11-09 14:20     ` Peter Zijlstra
  0 siblings, 2 replies; 18+ messages in thread
From: Trond Myklebust @ 2020-11-09 13:50 UTC (permalink / raw)
  To: peterz, neilb
  Cc: juri.lelli, mingo, jiangshanlai, tj, mhocko, linux-kernel,
	vincent.guittot

On Mon, 2020-11-09 at 09:00 +0100, Peter Zijlstra wrote:
> On Mon, Nov 09, 2020 at 01:54:59PM +1100, NeilBrown wrote:
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 4418f5cb8324..728870965df1 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1784,7 +1784,12 @@ static inline int
> > test_tsk_need_resched(struct task_struct *tsk)
> >  #ifndef CONFIG_PREEMPTION
> >  extern int _cond_resched(void);
> >  #else
> > -static inline int _cond_resched(void) { return 0; }
> > +static inline int _cond_resched(void)
> > +{
> > +       if (current->flags & PF_WQ_WORKER)
> > +               workqueue_cond_resched();
> > +       return 0;
> > +}
> >  #endif
> >  
> >  #define cond_resched() ({                      \
> 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 9a2fbf98fd6f..5b2e38567a0c 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -5620,6 +5620,8 @@ SYSCALL_DEFINE0(sched_yield)
> >  #ifndef CONFIG_PREEMPTION
> >  int __sched _cond_resched(void)
> >  {
> > +       if (current->flags & PF_WQ_WORKER)
> > +               workqueue_cond_resched();
> >         if (should_resched(0)) {
> >                 preempt_schedule_common();
> >                 return 1;
> 
> 
> Much hate for this.. :/ cond_resched() should be a NOP on !PREEMPT
> and
> you wreck that. Also, you call into that workqueue_cond_resched()
> unconditionally, even when it wouldn't have rescheduled, which seems
> very wrong too.
> 
> On top of all that, you're adding an extra load to the funcion :/
> 
> At some poine Paul tried to frob cond_resched() for RCU and ran into
> all
> sorts of performance issues, I'm thinking this will too.
> 
> 
> Going by your justification for all this:
> 
> > I think that once a worker calls cond_resched(), it should be
> > treated as
> > though it was run from a WQ_CPU_INTENSIVE queue, because only cpu-
> > intensive
> > tasks need to call cond_resched().  This would allow other workers
> > to be
> > scheduled.
> 
> I'm thinking the real problem is that you're abusing workqueues. Just
> don't stuff so much work into it that this becomes a problem. Or
> rather,
> if you do, don't lie to it about it.

If we can't use workqueues to call iput_final() on an inode, then what
is the point of having them at all?

Neil's use case is simply a file that has managed to accumulate a
seriously large page cache, and is therefore taking a long time to
complete the call to truncate_inode_pages_final(). Are you saying we
have to allocate a dedicated thread for every case where this happens?

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH rfc] workqueue: honour cond_resched() more effectively.
  2020-11-09 13:50   ` Trond Myklebust
@ 2020-11-09 14:01     ` tj
  2020-11-09 14:11       ` Trond Myklebust
  2020-11-09 14:20     ` Peter Zijlstra
  1 sibling, 1 reply; 18+ messages in thread
From: tj @ 2020-11-09 14:01 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: peterz, neilb, juri.lelli, mingo, jiangshanlai, mhocko,
	linux-kernel, vincent.guittot

Hello,

On Mon, Nov 09, 2020 at 01:50:40PM +0000, Trond Myklebust wrote:
> > I'm thinking the real problem is that you're abusing workqueues. Just
> > don't stuff so much work into it that this becomes a problem. Or
> > rather,
> > if you do, don't lie to it about it.
> 
> If we can't use workqueues to call iput_final() on an inode, then what
> is the point of having them at all?
> 
> Neil's use case is simply a file that has managed to accumulate a
> seriously large page cache, and is therefore taking a long time to
> complete the call to truncate_inode_pages_final(). Are you saying we
> have to allocate a dedicated thread for every case where this happens?

I think the right thing to do here is setting CPU_INTENSIVE or using an
unbound workqueue. Concurrency controlled per-cpu workqueue is unlikely to
be a good fit if the work can run long enough to need cond_resched(). Better
to let the scheduler handle it. Making workqueue warn against long-running
concurrency managed per-cpu work items would be great. I'll put that on my
todo list but if anyone is interested please be my guest.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH rfc] workqueue: honour cond_resched() more effectively.
  2020-11-09 14:01     ` tj
@ 2020-11-09 14:11       ` Trond Myklebust
  2020-11-09 16:10         ` tj
  0 siblings, 1 reply; 18+ messages in thread
From: Trond Myklebust @ 2020-11-09 14:11 UTC (permalink / raw)
  To: tj
  Cc: jiangshanlai, mhocko, peterz, juri.lelli, mingo, vincent.guittot,
	linux-kernel, neilb

On Mon, 2020-11-09 at 09:01 -0500, tj@kernel.org wrote:
> Hello,
> 
> On Mon, Nov 09, 2020 at 01:50:40PM +0000, Trond Myklebust wrote:
> > > I'm thinking the real problem is that you're abusing workqueues.
> > > Just
> > > don't stuff so much work into it that this becomes a problem. Or
> > > rather,
> > > if you do, don't lie to it about it.
> > 
> > If we can't use workqueues to call iput_final() on an inode, then
> > what
> > is the point of having them at all?
> > 
> > Neil's use case is simply a file that has managed to accumulate a
> > seriously large page cache, and is therefore taking a long time to
> > complete the call to truncate_inode_pages_final(). Are you saying
> > we
> > have to allocate a dedicated thread for every case where this
> > happens?
> 
> I think the right thing to do here is setting CPU_INTENSIVE or using
> an
> unbound workqueue. Concurrency controlled per-cpu workqueue is
> unlikely to
> be a good fit if the work can run long enough to need cond_resched().
> Better
> to let the scheduler handle it. Making workqueue warn against long-
> running
> concurrency managed per-cpu work items would be great. I'll put that
> on my
> todo list but if anyone is interested please be my guest.
> 

That means changing all filesystem code to use cpu-intensive queues. As
far as I can tell, they all use workqueues (most of them using the
standard system queue) for fput(), dput() and/or iput() calls.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH rfc] workqueue: honour cond_resched() more effectively.
  2020-11-09 13:50   ` Trond Myklebust
  2020-11-09 14:01     ` tj
@ 2020-11-09 14:20     ` Peter Zijlstra
  2020-11-10  2:26       ` NeilBrown
  1 sibling, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2020-11-09 14:20 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: neilb, juri.lelli, mingo, jiangshanlai, tj, mhocko, linux-kernel,
	vincent.guittot

On Mon, Nov 09, 2020 at 01:50:40PM +0000, Trond Myklebust wrote:
> On Mon, 2020-11-09 at 09:00 +0100, Peter Zijlstra wrote:

> > I'm thinking the real problem is that you're abusing workqueues. Just
> > don't stuff so much work into it that this becomes a problem. Or
> > rather,
> > if you do, don't lie to it about it.
> 
> If we can't use workqueues to call iput_final() on an inode, then what
> is the point of having them at all?

Running short stuff, apparently.

> Neil's use case is simply a file that has managed to accumulate a
> seriously large page cache, and is therefore taking a long time to
> complete the call to truncate_inode_pages_final(). Are you saying we
> have to allocate a dedicated thread for every case where this happens?

I'm not saying anything, but you're trying to wreck the scheduler
because of a workqueue 'feature'. The 'new' workqueues limit concurrency
by design, if you're then relying on concurrency for things, you're
using it wrong.

I really don't know what the right answer is here, but I thoroughly hate
the one proposed.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH rfc] workqueue: honour cond_resched() more effectively.
  2020-11-09 14:11       ` Trond Myklebust
@ 2020-11-09 16:10         ` tj
  2020-11-17 22:16           ` NeilBrown
                             ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: tj @ 2020-11-09 16:10 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: jiangshanlai, mhocko, peterz, juri.lelli, mingo, vincent.guittot,
	linux-kernel, neilb

Hello,

On Mon, Nov 09, 2020 at 02:11:42PM +0000, Trond Myklebust wrote:
> That means changing all filesystem code to use cpu-intensive queues. As
> far as I can tell, they all use workqueues (most of them using the
> standard system queue) for fput(), dput() and/or iput() calls.

I suppose the assumption was that those operations couldn't possiby be
expensive enough to warrant other options, which doesn't seem to be the case
unfortunately. Switching the users to system_unbound_wq, which should be
pretty trivial, seems to be the straight forward solution.

I can definitely see benefits in making workqueue smarter about
concurrency-managed work items taking a long time. Given that nothing on
these types of workqueues can be latency sensitive and the problem being
reported is on the scale of tens of seconds, I think a more palatable
approach could be through watchdog mechanism rather than hooking into
cond_resched(). Something like:

* Run watchdog timer more frequently - e.g. 1/4 of threshold.

* If a work item is occupying the local concurrency for too long, set
  WORKER_CPU_INTENSIVE for the worker and, probably, generate a warning.

I still think this should generate a warning and thus can't replace
switching to unbound wq. The reason is that the concurrency limit isn't the
only problem. A kthread needing to run on one particular CPU for tens of
seconds just isn't great.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH rfc] workqueue: honour cond_resched() more effectively.
  2020-11-09 14:20     ` Peter Zijlstra
@ 2020-11-10  2:26       ` NeilBrown
  0 siblings, 0 replies; 18+ messages in thread
From: NeilBrown @ 2020-11-10  2:26 UTC (permalink / raw)
  To: Peter Zijlstra, Trond Myklebust
  Cc: juri.lelli, mingo, jiangshanlai, tj, mhocko, linux-kernel,
	vincent.guittot

[-- Attachment #1: Type: text/plain, Size: 3088 bytes --]

On Mon, Nov 09 2020, Peter Zijlstra wrote:

> On Mon, Nov 09, 2020 at 01:50:40PM +0000, Trond Myklebust wrote:
>> On Mon, 2020-11-09 at 09:00 +0100, Peter Zijlstra wrote:
>
>> > I'm thinking the real problem is that you're abusing workqueues. Just
>> > don't stuff so much work into it that this becomes a problem. Or
>> > rather,
>> > if you do, don't lie to it about it.
>> 
>> If we can't use workqueues to call iput_final() on an inode, then what
>> is the point of having them at all?
>
> Running short stuff, apparently.

Also running stuff that sleeps.  If only does work in short bursts, and
sleeps between the works, it can run as long as it likes.
It is only sustained bursts that are currently not supported with
explicit code.

>
>> Neil's use case is simply a file that has managed to accumulate a
>> seriously large page cache, and is therefore taking a long time to
>> complete the call to truncate_inode_pages_final(). Are you saying we
>> have to allocate a dedicated thread for every case where this happens?
>
> I'm not saying anything, but you're trying to wreck the scheduler
> because of a workqueue 'feature'. The 'new' workqueues limit concurrency
> by design, if you're then relying on concurrency for things, you're
> using it wrong.
>
> I really don't know what the right answer is here, but I thoroughly hate
> the one proposed.

Oh good - plenty for room for improvement then :-)

I feel strongly that this should work transparently.  Expecting people
too choose the right option to handle cases that don't often some up in
testing is naive.
A warning whenever a bound,non-CPU-intensive worker calls cond_resched()
is trivial to implement and extremely noise.  As mentioned, I get twenty
just to boot.

One amusing example is rhashtable which schedule a worker to rehash a
table.  This is expected to be cpu-intensive because it calls
cond_resched(), but it is run with schedule_work() - clearly not
realizing that will block other scheduled work on that CPU.

An amusing example for the flip-side is crypto/cryptd.c which creates a
WQ_CPU_INTENSIVE workqueue (cryptd) but the cryptd_queue_worker() has
a comment "Only handle one request at a time to avoid hogging crypto
workqueue." !!! The whole point of WQ_CPU_INTENSIVE is that you cannot
hog the workqueue!!

Anyway, I digress....  warning on ever cond_resched() generates lots of
warnings, including some from printk.... so any work item that might
ever print a message needs to be CPU_INTENSIVE???
I don't think that scales.

Is there some way the scheduler can help?  Does the scheduler notice
"time to check on that CPU over there" and then:
 - if it is in user-space- force it to schedule
 - if it is in kernel-space (and preempt is disabled), then leave it
 alone
 ??

If so, could there be a third case - if it is a bound,non-cpu-intensive
worker, switch it to cpu-intensive???

I wonder how long workers typically run - do many run long enough that
the scheduler might want to ask them to take a break?

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 853 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH rfc] workqueue: honour cond_resched() more effectively.
  2020-11-09 16:10         ` tj
@ 2020-11-17 22:16           ` NeilBrown
       [not found]           ` <20201118025820.307-1-hdanton@sina.com>
  2020-11-19 23:23           ` NeilBrown
  2 siblings, 0 replies; 18+ messages in thread
From: NeilBrown @ 2020-11-17 22:16 UTC (permalink / raw)
  To: tj, Trond Myklebust
  Cc: jiangshanlai, mhocko, peterz, juri.lelli, mingo, vincent.guittot,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 18634 bytes --]

On Mon, Nov 09 2020, tj@kernel.org wrote:

> Hello,
>
> On Mon, Nov 09, 2020 at 02:11:42PM +0000, Trond Myklebust wrote:
>> That means changing all filesystem code to use cpu-intensive queues. As
>> far as I can tell, they all use workqueues (most of them using the
>> standard system queue) for fput(), dput() and/or iput() calls.
>
> I suppose the assumption was that those operations couldn't possiby be
> expensive enough to warrant other options, which doesn't seem to be the case
> unfortunately. Switching the users to system_unbound_wq, which should be
> pretty trivial, seems to be the straight forward solution.
>
> I can definitely see benefits in making workqueue smarter about
> concurrency-managed work items taking a long time. Given that nothing on
> these types of workqueues can be latency sensitive and the problem being
> reported is on the scale of tens of seconds, I think a more palatable
> approach could be through watchdog mechanism rather than hooking into
> cond_resched(). Something like:
>
> * Run watchdog timer more frequently - e.g. 1/4 of threshold.
>
> * If a work item is occupying the local concurrency for too long, set
>   WORKER_CPU_INTENSIVE for the worker and, probably, generate a warning.
>
> I still think this should generate a warning and thus can't replace
> switching to unbound wq. The reason is that the concurrency limit isn't the
> only problem. A kthread needing to run on one particular CPU for tens of
> seconds just isn't great.

I don't think that using a timer to trigger a warning is sufficient.
As justification I point to "might_sleep()".  This doesn't wait for
atomic code to *actually* sleep, but produces a warning when atomic code
calls other code that *might* sleep.  This means we catch problems
during development rather that when in production.

For the same reasons, I think we need a warning when a CM-worker calls
code that *might* consume a lot of CPU, only one when it actually *does*
consume a lot of CPU.
So I think that a warning from cond_resched() is the right thing to do.
The patch below does this - it borrows CONFIG_WQ_WATCHDOG to desire whether
to check for the warning.

So I don't think that moving to workitems to system_unbound_wq is always
appropriate.  Sometimes it certain is, as in patch below.  However in
some cases it is not at all clear at the time the workitem is submitted,
how much work will be down.
nfs has a work queue (nfsiod) which handle the reply to an async RPC.
In some cases there is a tiny amount of work to be done.  In rare cases
there is a lot.  Rather than trying to deduce in advance, it is much
easier if the worker can switch mode only when it finds that there is a
lot of work to do.
The patch below supports this with workqueue_prepare_use_cpu() which
switches to CPU_INTENSIVE mode.  It doesn't unbind from the current
CPU.  Maybe it should but I suspect that wouldn't be as easy to code,
and I'm not at all sure of the benefit.

So: I propose the patch below.  Thoughts?

Thanks,
NeilBrown



From: NeilBrown <neilb@suse.de>
Date: Mon, 9 Nov 2020 13:37:17 +1100
Subject: [PATCH] workqueue: warn when cond_resched called from
 concurrency-managed worker

Several workers in concurrency-managed work queues call cond_resched().
This suggest that they might consume a lot of CPU time and are willing
to allow other code to run concurrently.
This *does* allow non-workqueue code to run but does *not* allow other
concurrency-managed work items to run on the same CPU.

In general such workers should not be run from a concurrency-managed
workqueue however:
 1/ there is no mechanism to alert coders to the fact that they are
    using the wrong work queue, and
 2/ sometimes it is not clear, when a work item is created, whether it
    will end up in code that might consume at lot of CPU time.  So
    choosing the best workqueue it not always easy.

This patch addresses both these issues:
 1/ If a concurrency-managed worker calls cond_resched() a warning
    (WARN_ON_ONCE) is generated (if CONFIG_WQ_WATCHDOG is selected).
 2/ A new interface - workqueue_prepare_use_cpu() - is provided which
    may be called by low-level code which might be called in a workqueue
    and might consume CPU.  This switches the worker to CPU_INTENSIVE
    mode so that other work items can run on the same CPU.
    This means there is no need to chose the required behaviour when
    submitting a work item, as the code can switch behaviour when needed.

This patch also changes some code to avoid the warning:
 - in some cases, system_unbound_wq is used instead of system_wq,
   when the work item will normally call cond_resched()
 - in other cases, calls to workqueue_prepare_use_cpu() are inserted.

 - in slab.c, a cond_resched() call is replaced by
     if (tif_need_resched())
            break;
   so that the worker will relinquish CPU and try again later.

There are doubtless more changes needed, but these allow me to boot and
access NFS files without warnings.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/dcache.c                |  2 ++
 fs/nfs/delegation.c        |  5 +++++
 fs/nfs/write.c             |  6 ++++++
 include/linux/rhashtable.h |  4 ++--
 include/linux/sched.h      |  2 ++
 include/linux/workqueue.h  | 30 ++++++++++++++++++++++++++++++
 kernel/rcu/tree.c          |  2 +-
 kernel/workqueue.c         | 33 +++++++++++++++++++++++++++++++--
 lib/rhashtable.c           |  8 ++++----
 mm/page_alloc.c            | 14 ++++----------
 mm/slab.c                  |  3 ++-
 mm/truncate.c              |  3 +++
 net/sunrpc/clnt.c          |  2 +-
 security/keys/gc.c         |  8 ++++----
 security/keys/key.c        |  2 +-
 15 files changed, 98 insertions(+), 26 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index ea0485861d93..46c83f712ad3 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -686,6 +686,8 @@ static struct dentry *dentry_kill(struct dentry *dentry)
 	struct inode *inode = dentry->d_inode;
 	struct dentry *parent = NULL;
 
+	workqueue_prepare_use_cpu();
+
 	if (inode && unlikely(!spin_trylock(&inode->i_lock)))
 		goto slow_positive;
 
diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
index 816e1427f17e..cbf4e586ee2c 100644
--- a/fs/nfs/delegation.c
+++ b/fs/nfs/delegation.c
@@ -573,6 +573,7 @@ static int nfs_server_return_marked_delegations(struct nfs_server *server,
 	struct nfs_delegation *place_holder_deleg = NULL;
 	int err = 0;
 
+	workqueue_prepare_use_cpu();
 restart:
 	/*
 	 * To avoid quadratic looping we hold a reference
@@ -1097,6 +1098,8 @@ static int nfs_server_reap_unclaimed_delegations(struct nfs_server *server,
 {
 	struct nfs_delegation *delegation;
 	struct inode *inode;
+
+	workqueue_prepare_use_cpu();
 restart:
 	rcu_read_lock();
 restart_locked:
@@ -1229,6 +1232,8 @@ static int nfs_server_reap_expired_delegations(struct nfs_server *server,
 	struct inode *inode;
 	const struct cred *cred;
 	nfs4_stateid stateid;
+
+	workqueue_prepare_use_cpu();
 restart:
 	rcu_read_lock();
 restart_locked:
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 639c34fec04a..e84df784acc6 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1034,6 +1034,8 @@ nfs_scan_commit_list(struct list_head *src, struct list_head *dst,
 	struct nfs_page *req, *tmp;
 	int ret = 0;
 
+	workqueue_prepare_use_cpu();
+
 restart:
 	list_for_each_entry_safe(req, tmp, src, wb_list) {
 		kref_get(&req->wb_kref);
@@ -1839,6 +1841,8 @@ static void nfs_commit_release_pages(struct nfs_commit_data *data)
 	struct nfs_commit_info cinfo;
 	struct nfs_server *nfss;
 
+	workqueue_prepare_use_cpu();
+
 	while (!list_empty(&data->pages)) {
 		req = nfs_list_entry(data->pages.next);
 		nfs_list_remove_request(req);
@@ -1924,6 +1928,8 @@ static int __nfs_commit_inode(struct inode *inode, int how,
 	int may_wait = how & FLUSH_SYNC;
 	int ret, nscan;
 
+	workqueue_prepare_use_cpu();
+
 	nfs_init_cinfo_from_inode(&cinfo, inode);
 	nfs_commit_begin(cinfo.mds);
 	for (;;) {
diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 68dab3e08aad..01ab43d4b9bb 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -788,7 +788,7 @@ static inline void *__rhashtable_insert_fast(
 	rht_assign_unlock(tbl, bkt, obj);
 
 	if (rht_grow_above_75(ht, tbl))
-		schedule_work(&ht->run_work);
+		queue_work(system_unbound_wq, &ht->run_work);
 
 	data = NULL;
 out:
@@ -1056,7 +1056,7 @@ static inline int __rhashtable_remove_fast_one(
 		atomic_dec(&ht->nelems);
 		if (unlikely(ht->p.automatic_shrinking &&
 			     rht_shrink_below_30(ht, tbl)))
-			schedule_work(&ht->run_work);
+			queue_work(system_unbound_wq, &ht->run_work);
 		err = 0;
 	}
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 063cd120b459..3a3f1d9c0bb9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1837,6 +1837,7 @@ static inline int _cond_resched(void) { return 0; }
 
 #define cond_resched() ({			\
 	___might_sleep(__FILE__, __LINE__, 0);	\
+	WARN_ON_ONCE(workqueue_mustnt_use_cpu());	\
 	_cond_resched();			\
 })
 
@@ -1844,6 +1845,7 @@ extern int __cond_resched_lock(spinlock_t *lock);
 
 #define cond_resched_lock(lock) ({				\
 	___might_sleep(__FILE__, __LINE__, PREEMPT_LOCK_OFFSET);\
+	WARN_ON_ONCE(workqueue_mustnt_use_cpu());			\
 	__cond_resched_lock(lock);				\
 })
 
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 26de0cae2a0a..6c6473ee02e1 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -626,6 +626,36 @@ static inline bool schedule_delayed_work(struct delayed_work *dwork,
 	return queue_delayed_work(system_wq, dwork, delay);
 }
 
+
+/* Following are #define rather than 'static inline' because
+ * workqueue.h (this file) is included by sched.h, so it cannot
+ * use types and macros defined in that file.
+ *
+ * If workqueue_mustnt_use_cpu() returns false and triggers a warning,
+ * it means that a worker in a concurrency-managed workqueue is calling
+ * cond_resched().  This does NOT allow other workers to run, so it can
+ * block them unduly.  Such workers should either be run on some other
+ * workqueue, such as system_unbound_wq, or must call
+ * workqueue_prepare_use_cpu() so that the worker drops out of
+ * concurrency management.
+ */
+
+#define current_is_wq_worker() (current->flags & PF_WQ_WORKER)
+#ifdef CONFIG_WQ_WATCHDOG
+bool __workqueue_may_use_cpu(void);
+#define workqueue_mustnt_use_cpu()	\
+	(current_is_wq_worker() && !__workqueue_may_use_cpu())
+#else
+#define workqueue_mustnt_use_cpu() false
+#endif
+
+void __workqueue_prepare_use_cpu(void);
+#define workqueue_prepare_use_cpu()			\
+	do {						\
+		if (current_is_wq_worker())		\
+			__workqueue_prepare_use_cpu();	\
+	} while(0)
+
 #ifndef CONFIG_SMP
 static inline long work_on_cpu(int cpu, long (*fn)(void *), void *arg)
 {
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 06895ef85d69..907707481d36 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3302,7 +3302,7 @@ static inline bool queue_kfree_rcu_work(struct kfree_rcu_cpu *krcp)
 			 * channels have been detached following by each
 			 * other.
 			 */
-			queue_rcu_work(system_wq, &krwp->rcu_work);
+			queue_rcu_work(system_unbound_wq, &krwp->rcu_work);
 		}
 
 		// Repeat if any "free" corresponding channel is still busy.
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index ab593b20fb94..3f5ee4468493 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2295,12 +2295,12 @@ __acquires(&pool->lock)
 	 * stop_machine. At the same time, report a quiescent RCU state so
 	 * the same condition doesn't freeze RCU.
 	 */
-	cond_resched();
+	_cond_resched();
 
 	raw_spin_lock_irq(&pool->lock);
 
 	/* clear cpu intensive status */
-	if (unlikely(cpu_intensive))
+	if (unlikely(worker->flags & WORKER_CPU_INTENSIVE))
 		worker_clr_flags(worker, WORKER_CPU_INTENSIVE);
 
 	/* tag the worker for identification in schedule() */
@@ -5345,6 +5345,35 @@ int workqueue_set_unbound_cpumask(cpumask_var_t cpumask)
 	return ret;
 }
 
+#ifdef CONFIG_WQ_WATCHDOG
+bool __workqueue_may_use_cpu(void)
+{
+	struct worker *worker = current_wq_worker();
+	if (!worker)
+		return true;
+	/* If any flag in WORKER_NOT_RUNNING is set, the worker is not
+	 * counted for concurrency control, so it can use as much CPU
+	 * as it likes.
+	 */
+	return worker->flags & WORKER_NOT_RUNNING;
+}
+EXPORT_SYMBOL(__workqueue_may_use_cpu);
+#endif
+
+void __workqueue_prepare_use_cpu(void)
+{
+	struct worker *worker = current_wq_worker();
+
+	if (worker && !(worker->flags & WORKER_CPU_INTENSIVE)) {
+		struct worker_pool *pool = worker->pool;
+
+		worker_set_flags(worker, WORKER_CPU_INTENSIVE);
+		if (need_more_worker(pool))
+			wake_up_worker(pool);
+	}
+}
+EXPORT_SYMBOL(__workqueue_prepare_use_cpu);
+
 #ifdef CONFIG_SYSFS
 /*
  * Workqueues with WQ_SYSFS flag set is visible to userland via
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index c949c1e3b87c..1ef41411cda7 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -433,7 +433,7 @@ static void rht_deferred_worker(struct work_struct *work)
 	mutex_unlock(&ht->mutex);
 
 	if (err)
-		schedule_work(&ht->run_work);
+		queue_work(system_unbound_wq, &ht->run_work);
 }
 
 static int rhashtable_insert_rehash(struct rhashtable *ht,
@@ -468,7 +468,7 @@ static int rhashtable_insert_rehash(struct rhashtable *ht,
 		if (err == -EEXIST)
 			err = 0;
 	} else
-		schedule_work(&ht->run_work);
+		queue_work(system_unbound_wq, &ht->run_work);
 
 	return err;
 
@@ -479,7 +479,7 @@ static int rhashtable_insert_rehash(struct rhashtable *ht,
 
 	/* Schedule async rehash to retry allocation in process context. */
 	if (err == -ENOMEM)
-		schedule_work(&ht->run_work);
+		queue_work(system_unbound_wq, &ht->run_work);
 
 	return err;
 }
@@ -579,7 +579,7 @@ static struct bucket_table *rhashtable_insert_one(
 
 	atomic_inc(&ht->nelems);
 	if (rht_grow_above_75(ht, tbl))
-		schedule_work(&ht->run_work);
+		queue_work(system_unbound_wq, &ht->run_work);
 
 	return NULL;
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 23f5066bd4a5..69770135e8eb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4557,17 +4557,11 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 	}
 
 out:
-	/*
-	 * Memory allocation/reclaim might be called from a WQ context and the
-	 * current implementation of the WQ concurrency control doesn't
-	 * recognize that a particular WQ is congested if the worker thread is
-	 * looping without ever sleeping. Therefore we have to do a short sleep
-	 * here rather than calling cond_resched().
+	/* This might take a while - allow workqueue to schedule other
+	 * work in parallel.
 	 */
-	if (current->flags & PF_WQ_WORKER)
-		schedule_timeout_uninterruptible(1);
-	else
-		cond_resched();
+	workqueue_prepare_use_cpu();
+	cond_resched();
 	return ret;
 }
 
diff --git a/mm/slab.c b/mm/slab.c
index b1113561b98b..2e50603e1b26 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3994,7 +3994,8 @@ static void cache_reap(struct work_struct *w)
 			STATS_ADD_REAPED(searchp, freed);
 		}
 next:
-		cond_resched();
+		if (tif_need_resched())
+			break;
 	}
 	check_irq_on();
 	mutex_unlock(&slab_mutex);
diff --git a/mm/truncate.c b/mm/truncate.c
index 18cec39a9f53..b333130a7629 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -324,6 +324,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	else
 		end = (lend + 1) >> PAGE_SHIFT;
 
+	workqueue_prepare_use_cpu();
 	pagevec_init(&pvec);
 	index = start;
 	while (index < end && pagevec_lookup_entries(&pvec, mapping, index,
@@ -538,6 +539,7 @@ unsigned long __invalidate_mapping_pages(struct address_space *mapping,
 	unsigned long count = 0;
 	int i;
 
+	workqueue_prepare_use_cpu();
 	pagevec_init(&pvec);
 	while (index <= end && pagevec_lookup_entries(&pvec, mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
@@ -717,6 +719,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 	if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
 		goto out;
 
+	workqueue_prepare_use_cpu();
 	pagevec_init(&pvec);
 	index = start;
 	while (index <= end && pagevec_lookup_entries(&pvec, mapping, index,
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 62e0b6c1e8cf..1ce66672481f 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -906,7 +906,7 @@ rpc_free_client(struct rpc_clnt *clnt)
 	put_cred(clnt->cl_cred);
 
 	INIT_WORK(&clnt->cl_work, rpc_free_client_work);
-	schedule_work(&clnt->cl_work);
+	queue_work(system_unbound_wq, &clnt->cl_work);
 	return parent;
 }
 
diff --git a/security/keys/gc.c b/security/keys/gc.c
index 3c90807476eb..a4ad04b78833 100644
--- a/security/keys/gc.c
+++ b/security/keys/gc.c
@@ -57,7 +57,7 @@ void key_schedule_gc(time64_t gc_at)
 
 	if (gc_at <= now || test_bit(KEY_GC_REAP_KEYTYPE, &key_gc_flags)) {
 		kdebug("IMMEDIATE");
-		schedule_work(&key_gc_work);
+		queue_work(system_unbound_wq, &key_gc_work);
 	} else if (gc_at < key_gc_next_run) {
 		kdebug("DEFERRED");
 		key_gc_next_run = gc_at;
@@ -72,7 +72,7 @@ void key_schedule_gc(time64_t gc_at)
 void key_schedule_gc_links(void)
 {
 	set_bit(KEY_GC_KEY_EXPIRED, &key_gc_flags);
-	schedule_work(&key_gc_work);
+	queue_work(system_unbound_wq, &key_gc_work);
 }
 
 /*
@@ -106,7 +106,7 @@ void key_gc_keytype(struct key_type *ktype)
 	set_bit(KEY_GC_REAP_KEYTYPE, &key_gc_flags);
 
 	kdebug("schedule");
-	schedule_work(&key_gc_work);
+	queue_work(system_unbound_wq, &key_gc_work);
 
 	kdebug("sleep");
 	wait_on_bit(&key_gc_flags, KEY_GC_REAPING_KEYTYPE,
@@ -319,7 +319,7 @@ static void key_garbage_collector(struct work_struct *work)
 	}
 
 	if (gc_state & KEY_GC_REAP_AGAIN)
-		schedule_work(&key_gc_work);
+		queue_work(system_unbound_wq, &key_gc_work);
 	kleave(" [end %x]", gc_state);
 	return;
 
diff --git a/security/keys/key.c b/security/keys/key.c
index e282c6179b21..f990e226d74a 100644
--- a/security/keys/key.c
+++ b/security/keys/key.c
@@ -647,7 +647,7 @@ void key_put(struct key *key)
 		key_check(key);
 
 		if (refcount_dec_and_test(&key->usage))
-			schedule_work(&key_gc_work);
+			queue_work(system_unbound_wq, &key_gc_work);
 	}
 }
 EXPORT_SYMBOL(key_put);
-- 
2.29.2


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 853 bytes --]

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH rfc] workqueue: honour cond_resched() more effectively.
       [not found]           ` <20201118025820.307-1-hdanton@sina.com>
@ 2020-11-18  5:11             ` NeilBrown
       [not found]             ` <20201118055108.358-1-hdanton@sina.com>
  1 sibling, 0 replies; 18+ messages in thread
From: NeilBrown @ 2020-11-18  5:11 UTC (permalink / raw)
  To: Hillf Danton; +Cc: tj, Trond Myklebust, peterz, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 19180 bytes --]

On Wed, Nov 18 2020, Hillf Danton wrote:

> On Wed, 18 Nov 2020 09:16:09 +1100
>> From: NeilBrown <neilb@suse.de>
>> Date: Mon, 9 Nov 2020 13:37:17 +1100
>
> What is the brand of your wall clock? The gap between the Date tag
> above and 18 Nov is longer than a week.

I guess 'git commit --amend' doesn't update the Date: stamp.

>
>> Subject: [PATCH] workqueue: warn when cond_resched called from concurrency-managed worker
>> 
>> Several workers in concurrency-managed work queues call cond_resched().
>> This suggest that they might consume a lot of CPU time and are willing
>> to allow other code to run concurrently.
>> This *does* allow non-workqueue code to run but does *not* allow other
>> concurrency-managed work items to run on the same CPU.
>> 
>> In general such workers should not be run from a concurrency-managed
>> workqueue however:
>>  1/ there is no mechanism to alert coders to the fact that they are
>>     using the wrong work queue, and
>>  2/ sometimes it is not clear, when a work item is created, whether it
>>     will end up in code that might consume at lot of CPU time.  So
>>     choosing the best workqueue it not always easy.
>> 
>> This patch addresses both these issues:
>>  1/ If a concurrency-managed worker calls cond_resched() a warning
>>     (WARN_ON_ONCE) is generated (if CONFIG_WQ_WATCHDOG is selected).
>>  2/ A new interface - workqueue_prepare_use_cpu() - is provided which
>>     may be called by low-level code which might be called in a workqueue
>>     and might consume CPU.  This switches the worker to CPU_INTENSIVE
>>     mode so that other work items can run on the same CPU.
>>     This means there is no need to chose the required behaviour when
>>     submitting a work item, as the code can switch behaviour when needed.
>
> Hm...sounds like the cure can be prepared by splicing the new interface
> above to the cond_resched() in worker's context like
>
> static void foo_work_fn(struct work_struct *work)
> {
> +	bool use;
> ...
> +	use = workqueue_prepare_use_cpu();
> 	cond_resched();
> +	workqueue_end_use_cpu(use);
> ...
> }
>
> bool workqueue_prepare_use_cpu(void)
> {
> 	struct worker *worker = current_wq_worker();
>
> 	if (worker && !(worker->flags & WORKER_CPU_INTENSIVE)) {
> 		worker_set_flags(worker, WORKER_CPU_INTENSIVE);
> 		return true;
> 	} else
> 		return false;
> }
>
> void workqueue_end_use_cpu(bool use)
> {
> 	if (use) {
> 		struct worker *worker = current_wq_worker();
>
> 		worker_clear_flags(worker, WORKER_CPU_INTENSIVE);
> 	}
> }
>
> And in a nutshell it looks like the below helper to avoid touching
> anything like cond_resched().
>
> void workqueue_cond_resched(void)
> {
> 	struct worker *worker = current_wq_worker();
>
> 	if (worker && !(worker->flags & WORKER_CPU_INTENSIVE)) {
> 		worker_set_flags(worker, WORKER_CPU_INTENSIVE);
> 		cond_resched();
> 		worker_clear_flags(worker, WORKER_CPU_INTENSIVE);
> 	} else
> 		cond_resched();
> }

I don't think this is a good idea.
cond_resched() is expected to be called often.  Adding all this extra
work every time is excessive and unnecessary.

It might make sense to add a "workqueue_end_use_cpu()" call at the end
of functions that include "workqueue_prepare_use_cpu()" at the start.  I
don't think it is likely to make a useful difference, but I'm open to
the possibility if anyone can make a case for it.

Thanks,
NeilBrown


>
>> 
>> This patch also changes some code to avoid the warning:
>>  - in some cases, system_unbound_wq is used instead of system_wq,
>>    when the work item will normally call cond_resched()
>>  - in other cases, calls to workqueue_prepare_use_cpu() are inserted.
>> 
>>  - in slab.c, a cond_resched() call is replaced by
>>      if (tif_need_resched())
>>             break;
>>    so that the worker will relinquish CPU and try again later.
>> 
>> There are doubtless more changes needed, but these allow me to boot and
>> access NFS files without warnings.
>> 
>> Signed-off-by: NeilBrown <neilb@suse.de>
>> =2D--
>>  fs/dcache.c                |  2 ++
>>  fs/nfs/delegation.c        |  5 +++++
>>  fs/nfs/write.c             |  6 ++++++
>>  include/linux/rhashtable.h |  4 ++--
>>  include/linux/sched.h      |  2 ++
>>  include/linux/workqueue.h  | 30 ++++++++++++++++++++++++++++++
>>  kernel/rcu/tree.c          |  2 +-
>>  kernel/workqueue.c         | 33 +++++++++++++++++++++++++++++++--
>>  lib/rhashtable.c           |  8 ++++----
>>  mm/page_alloc.c            | 14 ++++----------
>>  mm/slab.c                  |  3 ++-
>>  mm/truncate.c              |  3 +++
>>  net/sunrpc/clnt.c          |  2 +-
>>  security/keys/gc.c         |  8 ++++----
>>  security/keys/key.c        |  2 +-
>>  15 files changed, 98 insertions(+), 26 deletions(-)
>> 
>> diff --git a/fs/dcache.c b/fs/dcache.c
>> index ea0485861d93..46c83f712ad3 100644
>> =2D-- a/fs/dcache.c
>> +++ b/fs/dcache.c
>> @@ -686,6 +686,8 @@ static struct dentry *dentry_kill(struct dentry *dentry)
>>  	struct inode *inode =3D dentry->d_inode;
>>  	struct dentry *parent =3D NULL;
>> =20
>> +	workqueue_prepare_use_cpu();
>> +
>>  	if (inode && unlikely(!spin_trylock(&inode->i_lock)))
>>  		goto slow_positive;
>> =20
>> diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
>> index 816e1427f17e..cbf4e586ee2c 100644
>> =2D-- a/fs/nfs/delegation.c
>> +++ b/fs/nfs/delegation.c
>> @@ -573,6 +573,7 @@ static int nfs_server_return_marked_delegations(struct =
>> nfs_server *server,
>>  	struct nfs_delegation *place_holder_deleg =3D NULL;
>>  	int err =3D 0;
>> =20
>> +	workqueue_prepare_use_cpu();
>>  restart:
>>  	/*
>>  	 * To avoid quadratic looping we hold a reference
>> @@ -1097,6 +1098,8 @@ static int nfs_server_reap_unclaimed_delegations(stru=
>> ct nfs_server *server,
>>  {
>>  	struct nfs_delegation *delegation;
>>  	struct inode *inode;
>> +
>> +	workqueue_prepare_use_cpu();
>>  restart:
>>  	rcu_read_lock();
>>  restart_locked:
>> @@ -1229,6 +1232,8 @@ static int nfs_server_reap_expired_delegations(struct=
>>  nfs_server *server,
>>  	struct inode *inode;
>>  	const struct cred *cred;
>>  	nfs4_stateid stateid;
>> +
>> +	workqueue_prepare_use_cpu();
>>  restart:
>>  	rcu_read_lock();
>>  restart_locked:
>> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
>> index 639c34fec04a..e84df784acc6 100644
>> =2D-- a/fs/nfs/write.c
>> +++ b/fs/nfs/write.c
>> @@ -1034,6 +1034,8 @@ nfs_scan_commit_list(struct list_head *src, struct li=
>> st_head *dst,
>>  	struct nfs_page *req, *tmp;
>>  	int ret =3D 0;
>> =20
>> +	workqueue_prepare_use_cpu();
>> +
>>  restart:
>>  	list_for_each_entry_safe(req, tmp, src, wb_list) {
>>  		kref_get(&req->wb_kref);
>> @@ -1839,6 +1841,8 @@ static void nfs_commit_release_pages(struct nfs_commi=
>> t_data *data)
>>  	struct nfs_commit_info cinfo;
>>  	struct nfs_server *nfss;
>> =20
>> +	workqueue_prepare_use_cpu();
>> +
>>  	while (!list_empty(&data->pages)) {
>>  		req =3D nfs_list_entry(data->pages.next);
>>  		nfs_list_remove_request(req);
>> @@ -1924,6 +1928,8 @@ static int __nfs_commit_inode(struct inode *inode, in=
>> t how,
>>  	int may_wait =3D how & FLUSH_SYNC;
>>  	int ret, nscan;
>> =20
>> +	workqueue_prepare_use_cpu();
>> +
>>  	nfs_init_cinfo_from_inode(&cinfo, inode);
>>  	nfs_commit_begin(cinfo.mds);
>>  	for (;;) {
>> diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
>> index 68dab3e08aad..01ab43d4b9bb 100644
>> =2D-- a/include/linux/rhashtable.h
>> +++ b/include/linux/rhashtable.h
>> @@ -788,7 +788,7 @@ static inline void *__rhashtable_insert_fast(
>>  	rht_assign_unlock(tbl, bkt, obj);
>> =20
>>  	if (rht_grow_above_75(ht, tbl))
>> =2D		schedule_work(&ht->run_work);
>> +		queue_work(system_unbound_wq, &ht->run_work);
>> =20
>>  	data =3D NULL;
>>  out:
>> @@ -1056,7 +1056,7 @@ static inline int __rhashtable_remove_fast_one(
>>  		atomic_dec(&ht->nelems);
>>  		if (unlikely(ht->p.automatic_shrinking &&
>>  			     rht_shrink_below_30(ht, tbl)))
>> =2D			schedule_work(&ht->run_work);
>> +			queue_work(system_unbound_wq, &ht->run_work);
>>  		err =3D 0;
>>  	}
>> =20
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 063cd120b459..3a3f1d9c0bb9 100644
>> =2D-- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1837,6 +1837,7 @@ static inline int _cond_resched(void) { return 0; }
>> =20
>>  #define cond_resched() ({			\
>>  	___might_sleep(__FILE__, __LINE__, 0);	\
>> +	WARN_ON_ONCE(workqueue_mustnt_use_cpu());	\
>>  	_cond_resched();			\
>>  })
>> =20
>> @@ -1844,6 +1845,7 @@ extern int __cond_resched_lock(spinlock_t *lock);
>> =20
>>  #define cond_resched_lock(lock) ({				\
>>  	___might_sleep(__FILE__, __LINE__, PREEMPT_LOCK_OFFSET);\
>> +	WARN_ON_ONCE(workqueue_mustnt_use_cpu());			\
>>  	__cond_resched_lock(lock);				\
>>  })
>> =20
>> diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
>> index 26de0cae2a0a..6c6473ee02e1 100644
>> =2D-- a/include/linux/workqueue.h
>> +++ b/include/linux/workqueue.h
>> @@ -626,6 +626,36 @@ static inline bool schedule_delayed_work(struct delaye=
>> d_work *dwork,
>>  	return queue_delayed_work(system_wq, dwork, delay);
>>  }
>> =20
>> +
>> +/* Following are #define rather than 'static inline' because
>> + * workqueue.h (this file) is included by sched.h, so it cannot
>> + * use types and macros defined in that file.
>> + *
>> + * If workqueue_mustnt_use_cpu() returns false and triggers a warning,
>> + * it means that a worker in a concurrency-managed workqueue is calling
>> + * cond_resched().  This does NOT allow other workers to run, so it can
>> + * block them unduly.  Such workers should either be run on some other
>> + * workqueue, such as system_unbound_wq, or must call
>> + * workqueue_prepare_use_cpu() so that the worker drops out of
>> + * concurrency management.
>> + */
>> +
>> +#define current_is_wq_worker() (current->flags & PF_WQ_WORKER)
>> +#ifdef CONFIG_WQ_WATCHDOG
>> +bool __workqueue_may_use_cpu(void);
>> +#define workqueue_mustnt_use_cpu()	\
>> +	(current_is_wq_worker() && !__workqueue_may_use_cpu())
>> +#else
>> +#define workqueue_mustnt_use_cpu() false
>> +#endif
>> +
>> +void __workqueue_prepare_use_cpu(void);
>> +#define workqueue_prepare_use_cpu()			\
>> +	do {						\
>> +		if (current_is_wq_worker())		\
>> +			__workqueue_prepare_use_cpu();	\
>> +	} while(0)
>> +
>>  #ifndef CONFIG_SMP
>>  static inline long work_on_cpu(int cpu, long (*fn)(void *), void *arg)
>>  {
>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>> index 06895ef85d69..907707481d36 100644
>> =2D-- a/kernel/rcu/tree.c
>> +++ b/kernel/rcu/tree.c
>> @@ -3302,7 +3302,7 @@ static inline bool queue_kfree_rcu_work(struct kfree_=
>> rcu_cpu *krcp)
>>  			 * channels have been detached following by each
>>  			 * other.
>>  			 */
>> =2D			queue_rcu_work(system_wq, &krwp->rcu_work);
>> +			queue_rcu_work(system_unbound_wq, &krwp->rcu_work);
>>  		}
>> =20
>>  		// Repeat if any "free" corresponding channel is still busy.
>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>> index ab593b20fb94..3f5ee4468493 100644
>> =2D-- a/kernel/workqueue.c
>> +++ b/kernel/workqueue.c
>> @@ -2295,12 +2295,12 @@ __acquires(&pool->lock)
>>  	 * stop_machine. At the same time, report a quiescent RCU state so
>>  	 * the same condition doesn't freeze RCU.
>>  	 */
>> =2D	cond_resched();
>> +	_cond_resched();
>> =20
>>  	raw_spin_lock_irq(&pool->lock);
>> =20
>>  	/* clear cpu intensive status */
>> =2D	if (unlikely(cpu_intensive))
>> +	if (unlikely(worker->flags & WORKER_CPU_INTENSIVE))
>>  		worker_clr_flags(worker, WORKER_CPU_INTENSIVE);
>> =20
>>  	/* tag the worker for identification in schedule() */
>> @@ -5345,6 +5345,35 @@ int workqueue_set_unbound_cpumask(cpumask_var_t cpum=
>> ask)
>>  	return ret;
>>  }
>> =20
>> +#ifdef CONFIG_WQ_WATCHDOG
>> +bool __workqueue_may_use_cpu(void)
>> +{
>> +	struct worker *worker =3D current_wq_worker();
>> +	if (!worker)
>> +		return true;
>> +	/* If any flag in WORKER_NOT_RUNNING is set, the worker is not
>> +	 * counted for concurrency control, so it can use as much CPU
>> +	 * as it likes.
>> +	 */
>> +	return worker->flags & WORKER_NOT_RUNNING;
>> +}
>> +EXPORT_SYMBOL(__workqueue_may_use_cpu);
>> +#endif
>> +
>> +void __workqueue_prepare_use_cpu(void)
>> +{
>> +	struct worker *worker =3D current_wq_worker();
>> +
>> +	if (worker && !(worker->flags & WORKER_CPU_INTENSIVE)) {
>> +		struct worker_pool *pool =3D worker->pool;
>> +
>> +		worker_set_flags(worker, WORKER_CPU_INTENSIVE);
>> +		if (need_more_worker(pool))
>> +			wake_up_worker(pool);
>> +	}
>> +}
>> +EXPORT_SYMBOL(__workqueue_prepare_use_cpu);
>> +
>>  #ifdef CONFIG_SYSFS
>>  /*
>>   * Workqueues with WQ_SYSFS flag set is visible to userland via
>> diff --git a/lib/rhashtable.c b/lib/rhashtable.c
>> index c949c1e3b87c..1ef41411cda7 100644
>> =2D-- a/lib/rhashtable.c
>> +++ b/lib/rhashtable.c
>> @@ -433,7 +433,7 @@ static void rht_deferred_worker(struct work_struct *wor=
>> k)
>>  	mutex_unlock(&ht->mutex);
>> =20
>>  	if (err)
>> =2D		schedule_work(&ht->run_work);
>> +		queue_work(system_unbound_wq, &ht->run_work);
>>  }
>> =20
>>  static int rhashtable_insert_rehash(struct rhashtable *ht,
>> @@ -468,7 +468,7 @@ static int rhashtable_insert_rehash(struct rhashtable *=
>> ht,
>>  		if (err =3D=3D -EEXIST)
>>  			err =3D 0;
>>  	} else
>> =2D		schedule_work(&ht->run_work);
>> +		queue_work(system_unbound_wq, &ht->run_work);
>> =20
>>  	return err;
>> =20
>> @@ -479,7 +479,7 @@ static int rhashtable_insert_rehash(struct rhashtable *=
>> ht,
>> =20
>>  	/* Schedule async rehash to retry allocation in process context. */
>>  	if (err =3D=3D -ENOMEM)
>> =2D		schedule_work(&ht->run_work);
>> +		queue_work(system_unbound_wq, &ht->run_work);
>> =20
>>  	return err;
>>  }
>> @@ -579,7 +579,7 @@ static struct bucket_table *rhashtable_insert_one(
>> =20
>>  	atomic_inc(&ht->nelems);
>>  	if (rht_grow_above_75(ht, tbl))
>> =2D		schedule_work(&ht->run_work);
>> +		queue_work(system_unbound_wq, &ht->run_work);
>> =20
>>  	return NULL;
>>  }
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 23f5066bd4a5..69770135e8eb 100644
>> =2D-- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4557,17 +4557,11 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>>  	}
>> =20
>>  out:
>> =2D	/*
>> =2D	 * Memory allocation/reclaim might be called from a WQ context and the
>> =2D	 * current implementation of the WQ concurrency control doesn't
>> =2D	 * recognize that a particular WQ is congested if the worker thread is
>> =2D	 * looping without ever sleeping. Therefore we have to do a short sleep
>> =2D	 * here rather than calling cond_resched().
>> +	/* This might take a while - allow workqueue to schedule other
>> +	 * work in parallel.
>>  	 */
>> =2D	if (current->flags & PF_WQ_WORKER)
>> =2D		schedule_timeout_uninterruptible(1);
>> =2D	else
>> =2D		cond_resched();
>> +	workqueue_prepare_use_cpu();
>> +	cond_resched();
>>  	return ret;
>>  }
>> =20
>> diff --git a/mm/slab.c b/mm/slab.c
>> index b1113561b98b..2e50603e1b26 100644
>> =2D-- a/mm/slab.c
>> +++ b/mm/slab.c
>> @@ -3994,7 +3994,8 @@ static void cache_reap(struct work_struct *w)
>>  			STATS_ADD_REAPED(searchp, freed);
>>  		}
>>  next:
>> =2D		cond_resched();
>> +		if (tif_need_resched())
>> +			break;
>>  	}
>>  	check_irq_on();
>>  	mutex_unlock(&slab_mutex);
>> diff --git a/mm/truncate.c b/mm/truncate.c
>> index 18cec39a9f53..b333130a7629 100644
>> =2D-- a/mm/truncate.c
>> +++ b/mm/truncate.c
>> @@ -324,6 +324,7 @@ void truncate_inode_pages_range(struct address_space *m=
>> apping,
>>  	else
>>  		end =3D (lend + 1) >> PAGE_SHIFT;
>> =20
>> +	workqueue_prepare_use_cpu();
>>  	pagevec_init(&pvec);
>>  	index =3D start;
>>  	while (index < end && pagevec_lookup_entries(&pvec, mapping, index,
>> @@ -538,6 +539,7 @@ unsigned long __invalidate_mapping_pages(struct address=
>> _space *mapping,
>>  	unsigned long count =3D 0;
>>  	int i;
>> =20
>> +	workqueue_prepare_use_cpu();
>>  	pagevec_init(&pvec);
>>  	while (index <=3D end && pagevec_lookup_entries(&pvec, mapping, index,
>>  			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
>> @@ -717,6 +719,7 @@ int invalidate_inode_pages2_range(struct address_space =
>> *mapping,
>>  	if (mapping->nrpages =3D=3D 0 && mapping->nrexceptional =3D=3D 0)
>>  		goto out;
>> =20
>> +	workqueue_prepare_use_cpu();
>>  	pagevec_init(&pvec);
>>  	index =3D start;
>>  	while (index <=3D end && pagevec_lookup_entries(&pvec, mapping, index,
>> diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
>> index 62e0b6c1e8cf..1ce66672481f 100644
>> =2D-- a/net/sunrpc/clnt.c
>> +++ b/net/sunrpc/clnt.c
>> @@ -906,7 +906,7 @@ rpc_free_client(struct rpc_clnt *clnt)
>>  	put_cred(clnt->cl_cred);
>> =20
>>  	INIT_WORK(&clnt->cl_work, rpc_free_client_work);
>> =2D	schedule_work(&clnt->cl_work);
>> +	queue_work(system_unbound_wq, &clnt->cl_work);
>>  	return parent;
>>  }
>> =20
>> diff --git a/security/keys/gc.c b/security/keys/gc.c
>> index 3c90807476eb..a4ad04b78833 100644
>> =2D-- a/security/keys/gc.c
>> +++ b/security/keys/gc.c
>> @@ -57,7 +57,7 @@ void key_schedule_gc(time64_t gc_at)
>> =20
>>  	if (gc_at <=3D now || test_bit(KEY_GC_REAP_KEYTYPE, &key_gc_flags)) {
>>  		kdebug("IMMEDIATE");
>> =2D		schedule_work(&key_gc_work);
>> +		queue_work(system_unbound_wq, &key_gc_work);
>>  	} else if (gc_at < key_gc_next_run) {
>>  		kdebug("DEFERRED");
>>  		key_gc_next_run =3D gc_at;
>> @@ -72,7 +72,7 @@ void key_schedule_gc(time64_t gc_at)
>>  void key_schedule_gc_links(void)
>>  {
>>  	set_bit(KEY_GC_KEY_EXPIRED, &key_gc_flags);
>> =2D	schedule_work(&key_gc_work);
>> +	queue_work(system_unbound_wq, &key_gc_work);
>>  }
>> =20
>>  /*
>> @@ -106,7 +106,7 @@ void key_gc_keytype(struct key_type *ktype)
>>  	set_bit(KEY_GC_REAP_KEYTYPE, &key_gc_flags);
>> =20
>>  	kdebug("schedule");
>> =2D	schedule_work(&key_gc_work);
>> +	queue_work(system_unbound_wq, &key_gc_work);
>> =20
>>  	kdebug("sleep");
>>  	wait_on_bit(&key_gc_flags, KEY_GC_REAPING_KEYTYPE,
>> @@ -319,7 +319,7 @@ static void key_garbage_collector(struct work_struct *w=
>> ork)
>>  	}
>> =20
>>  	if (gc_state & KEY_GC_REAP_AGAIN)
>> =2D		schedule_work(&key_gc_work);
>> +		queue_work(system_unbound_wq, &key_gc_work);
>>  	kleave(" [end %x]", gc_state);
>>  	return;
>> =20
>> diff --git a/security/keys/key.c b/security/keys/key.c
>> index e282c6179b21..f990e226d74a 100644
>> =2D-- a/security/keys/key.c
>> +++ b/security/keys/key.c
>> @@ -647,7 +647,7 @@ void key_put(struct key *key)
>>  		key_check(key);
>> =20
>>  		if (refcount_dec_and_test(&key->usage))
>> =2D			schedule_work(&key_gc_work);
>> +			queue_work(system_unbound_wq, &key_gc_work);
>>  	}
>>  }
>>  EXPORT_SYMBOL(key_put);
>> =2D-=20
>> 2.29.2

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 853 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH rfc] workqueue: honour cond_resched() more effectively.
       [not found]             ` <20201118055108.358-1-hdanton@sina.com>
@ 2020-11-19 23:07               ` NeilBrown
  2020-12-02 20:20                 ` tj
       [not found]               ` <20201120025953.607-1-hdanton@sina.com>
  1 sibling, 1 reply; 18+ messages in thread
From: NeilBrown @ 2020-11-19 23:07 UTC (permalink / raw)
  To: Hillf Danton; +Cc: tj, Trond Myklebust, peterz, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1712 bytes --]

On Wed, Nov 18 2020, Hillf Danton wrote:

> On Wed, 18 Nov 2020 16:11:44 +1100 NeilBrown wrote:
>> On Wed, Nov 18 2020, Hillf Danton wrote:
>> ...
>> I don't think this is a good idea.
>
> Let me add a few more words.
>
>> cond_resched() is expected to be called often.  Adding all this extra
>
> They are those only invoked in concurrency-managed worker contexts and
> are thus supposed to be less often than thought; what is more the callers
> know what they are doing if a schedule() follows up, needless to say it
> is an ant-antenna-size add-in to check WORKER_CPU_INTENSIVE given
> 	WARN_ON_ONCE(workqueue_mustnt_use_cpu())
> added in cond_resched().

"supposed to be less often" is the central point here.
Because the facts are that they sometime happen with high frequency
despite what is "supposed" to happen.
Either the assumption that CM-workers don't call cond_resched() is
wrong, or the code that schedules such workers on CM-queues is wrong.

I much prefer the perspective that the assumption is wrong.  If that is
agreed then we need to handle that circumstance without making
cond_resched() more expensive.
Note that adding WARN_ON_ONCE() does not make it more expensive as it is
only enabled with KERNEL_DEBUG (and WQ_WATCHDOG, though the particular
config option could be changed). It isn't needed in production.

If the workqueue maintainers are unmovable in the position that a
CM-workitem must not use excessive CPU ever, and so must not call
cond_resched(), then I can take that back to the NFS maintainers and
negotiate different workqueue settings.  But as I've said, I think this
is requiring the decision to be made in a place that is not well
positioned to make it.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 853 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH rfc] workqueue: honour cond_resched() more effectively.
  2020-11-09 16:10         ` tj
  2020-11-17 22:16           ` NeilBrown
       [not found]           ` <20201118025820.307-1-hdanton@sina.com>
@ 2020-11-19 23:23           ` NeilBrown
  2020-11-25 12:36             ` tj
  2 siblings, 1 reply; 18+ messages in thread
From: NeilBrown @ 2020-11-19 23:23 UTC (permalink / raw)
  To: tj, Trond Myklebust
  Cc: jiangshanlai, mhocko, peterz, juri.lelli, mingo, vincent.guittot,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 779 bytes --]

On Mon, Nov 09 2020, tj@kernel.org wrote:

>                                                    Given that nothing on
> these types of workqueues can be latency sensitive

This caught my eye and it seems worth drilling in to.  There is no
mention of "latency" in workqueue.rst or workqueue.h.  But you seem to
be saying there is an undocumented assumption that latency-sensitive
work items much not be scheduled on CM-workqueues.
Is that correct?

NFS writes are latency sensitive to a degree as increased latency per
request will hurt overall throughput.  Does this mean that handling
write-completion in a CM-wq is a poor choice?
Would it be better to us WQ_HIGHPRI??  Is there any rule-of-thumb that
can be used to determine when WQ_HIGHPRI is appropriate?

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 853 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH rfc] workqueue: honour cond_resched() more effectively.
       [not found]               ` <20201120025953.607-1-hdanton@sina.com>
@ 2020-11-20  4:33                 ` NeilBrown
       [not found]                 ` <20201126100646.1790-1-hdanton@sina.com>
  1 sibling, 0 replies; 18+ messages in thread
From: NeilBrown @ 2020-11-20  4:33 UTC (permalink / raw)
  To: Hillf Danton; +Cc: TJ, Trond Myklebust, PeterZ, LKML

[-- Attachment #1: Type: text/plain, Size: 6234 bytes --]

On Fri, Nov 20 2020, Hillf Danton wrote:

> On Fri, 20 Nov 2020 10:07:56 +1100 NeilBrown wrote:
>>On Wed, Nov 18 2020, Hillf Danton wrote:
>>> On Wed, 18 Nov 2020 16:11:44 +1100 NeilBrown wrote:
>>>> On Wed, Nov 18 2020, Hillf Danton wrote:
>>>> ...
>>>> I don't think this is a good idea.
>>>
>>> Let me add a few more words.
>>>
>>>> cond_resched() is expected to be called often.  Adding all this extra
>>>
>>> They are those only invoked in concurrency-managed worker contexts and
>>> are thus supposed to be less often than thought; what is more the callers
>>> know what they are doing if a schedule() follows up, needless to say it
>>> is an ant-antenna-size add-in to check WORKER_CPU_INTENSIVE given
>>> 	WARN_ON_ONCE(workqueue_mustnt_use_cpu())
>>> added in cond_resched().
>>
>>"supposed to be less often" is the central point here.
>
> No, it is not in any shape, see below.
>
>>Because the facts are that they sometime happen with high frequency
>>despite what is "supposed" to happen.
>
> Feel free to point me to a couple of such workers. I want to see
> how high it is and why.

The patch should suggest some.
Any work item which calls iput() might find it self in iput_final() and
then truncate_inode_pages_range() which will call cond_resched() once
for every 16 or fewer pages.  If there are millions of pages ....

When a reply is received for an async NFS request (e.g. WRITE, but
several others), the processing happens in a workqueue (nfsiod), and this
will often call iput(), but rarely will that lead to iput_final().
Also, lots of non-workqueue code calls iput(), so adding code to an
inner-loop would cost everyone.

Any worker which allocates memory might find itself in
should_reclaim_retry() which calls cond_resched().  I don't know how
frequently this will fire.

The slab memory allocator uses a system_wq worker to reap a cache.  I
don't know exactly what that means but cache_reap() seems to need to
call cond_resched() periodically.  Maybe it should use be a
WQ_CPU_INTENSIVE workqueue, but there isn't a system_cpu_wq....
Using system_unbound_wq() as it is doing per-CPU work.

>
>>Either the assumption that CM-workers don't call cond_resched() is
>>wrong, or the code that schedules such workers on CM-queues is wrong.
>>
>>I much prefer the perspective that the assumption is wrong.  If that is
>>agreed then we need to handle that circumstance without making
>>cond_resched() more expensive.
>
> This is the central point I think; it is a mile in between what
> you are trying to fix and what you are adding in cond_resched().

My latest patch only adds a WARNING to cond_resched(), so that we can
find problem code before it becomes a problem.  I did previously try
adding more to cond_resched(), and PeterZ didn't like that at all.

I agree that fixing the problem cannot be in cond_resched().  I think
that finding the scope of the problem is best done by instrumenting
cond_resched() (when DEBUG_KERNEL is selected).

>
>>Note that adding WARN_ON_ONCE() does not make it more expensive as it is
>>only enabled with KERNEL_DEBUG (and WQ_WATCHDOG, though the particular
>>config option could be changed). It isn't needed in production.
>
> Because cond_resched() is not the right place from the beginning
> for debugging like this, something in workqueue's backyard by
> design.  It's been there for a while, in production or not.

I don't understand your reasoning.  I don't see why one subsystem cannot
provide debugging to help some other subsystem.  Many subsystems add
"might_sleep()", not to detect bugs in themselves but to detect bugs in
their callers.  Adding a WARNING to cond_resched() helps us find bugs in
code that calls cond_resched()...

>>
>>If the workqueue maintainers are unmovable in the position that a
>
> They are open to any good thoughts, yesterday and tomorrow.
>
>>CM-workitem must not use excessive CPU ever, and so must not call
>>cond_resched(), then I can take that back to the NFS maintainers and
>>negotiate different workqueue settings. 
>
> That sounds like an easy road to go without either touching
> cond_resched() or adding a couple of lines in workqueue.  But
> the rising question is why you are branching to a new direction
> overnight if you think your thoughts are fine to fix an issue
> you observed in wq's property.

I'm branching off because I'm getting push-back and so am trying to
explore the problem space.
My first idea was to add WQ_CPU_INTENSIVE to the nfsiod workqueue, but
Trond wondered what was special about NFS.  Many filesystems call iput
from a workqueue, so finding a solution that helps them all is best.
I then suggested getting cond_resched() to do something more useful when
called by a worker.  PeterZ didn't like the overhead.
Also, TJ seemed to be against auto-adjusting for cpu-intensive code,
preferring the right sort of workqueue to be chosen up front.

I'm not really well placed to assess the validity of any of these
objections, so I'm trying to respond to them without completely giving
in to any of them.  Hence the "new direction overnight".

As a "user" of workqueues I would much much rather there was only one,
and that it always did the right thing.  Maybe I would have to put up
with two, but we currently have
 system_wq, system_highpri_wq, system_long_wq,
 system_unbound_wq, system_freezable_wq, system_power_efficient_wq,
 system_freezable_power_efficient_wq
plus the ability to create your own.  It is an embarrassment of riches
and I really wonder how many people know how to choose the right one.

So I'm not very keen on "make sure you choose the right type of wq", but
if that really is best, I certainly want automatic help to know when
I've made a harmful choice.

>
>>But as I've said, I think this
>>is requiring the decision to be made in a place that is not well
>>positioned to make it.
>
> I say no to asking NFS to take a pill because WQ got a cold.

That's good to know!  I think that means that when a work item happens
to consume a lot of CPU, it needs to stop blocking other work items and
instead share the CPU with them.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 853 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH rfc] workqueue: honour cond_resched() more effectively.
  2020-11-19 23:23           ` NeilBrown
@ 2020-11-25 12:36             ` tj
  2020-11-26 23:30               ` NeilBrown
  0 siblings, 1 reply; 18+ messages in thread
From: tj @ 2020-11-25 12:36 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, jiangshanlai, mhocko, peterz, juri.lelli, mingo,
	vincent.guittot, linux-kernel

Hello,

On Fri, Nov 20, 2020 at 10:23:44AM +1100, NeilBrown wrote:
> On Mon, Nov 09 2020, tj@kernel.org wrote:
> 
> >                                                    Given that nothing on
> > these types of workqueues can be latency sensitive
> 
> This caught my eye and it seems worth drilling in to.  There is no
> mention of "latency" in workqueue.rst or workqueue.h.  But you seem to
> be saying there is an undocumented assumption that latency-sensitive
> work items much not be scheduled on CM-workqueues.
> Is that correct?

Yeah, correct. Because they're all sharing execution concurrency, the
latency consistency is likely a lot worse.

> NFS writes are latency sensitive to a degree as increased latency per
> request will hurt overall throughput.  Does this mean that handling
> write-completion in a CM-wq is a poor choice?
> Would it be better to us WQ_HIGHPRI??  Is there any rule-of-thumb that
> can be used to determine when WQ_HIGHPRI is appropriate?

I don't think it'd need HIGHPRI but UNBOUND or CPU_INTENSIVE would make
sense. I think the rule of the thumb is along the line of if you're worried
about cpu consumption or latency, let the scheduler take care of it (ie. use
unbound workqueues).

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH rfc] workqueue: honour cond_resched() more effectively.
  2020-11-25 12:36             ` tj
@ 2020-11-26 23:30               ` NeilBrown
  0 siblings, 0 replies; 18+ messages in thread
From: NeilBrown @ 2020-11-26 23:30 UTC (permalink / raw)
  To: tj
  Cc: Trond Myklebust, jiangshanlai, mhocko, peterz, juri.lelli, mingo,
	vincent.guittot, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2481 bytes --]

On Wed, Nov 25 2020, tj@kernel.org wrote:

> Hello,
>
> On Fri, Nov 20, 2020 at 10:23:44AM +1100, NeilBrown wrote:
>> On Mon, Nov 09 2020, tj@kernel.org wrote:
>> 
>> >                                                    Given that nothing on
>> > these types of workqueues can be latency sensitive
>> 
>> This caught my eye and it seems worth drilling in to.  There is no
>> mention of "latency" in workqueue.rst or workqueue.h.  But you seem to
>> be saying there is an undocumented assumption that latency-sensitive
>> work items much not be scheduled on CM-workqueues.
>> Is that correct?
>
> Yeah, correct. Because they're all sharing execution concurrency, the
> latency consistency is likely a lot worse.
>
>> NFS writes are latency sensitive to a degree as increased latency per
>> request will hurt overall throughput.  Does this mean that handling
>> write-completion in a CM-wq is a poor choice?
>> Would it be better to us WQ_HIGHPRI??  Is there any rule-of-thumb that
>> can be used to determine when WQ_HIGHPRI is appropriate?
>
> I don't think it'd need HIGHPRI but UNBOUND or CPU_INTENSIVE would make
> sense. I think the rule of the thumb is along the line of if you're worried
> about cpu consumption or latency, let the scheduler take care of it (ie. use
> unbound workqueues).

Thanks.
For nfsiod there are two contexts where it is used.

 In one context there is normally a thread waiting for the work item
 to complete.  It doesn't run the work in-line because the thread needs
 to abort if signaled, but the work needs to happen anyway so that the
 client and server remain in-sync.  In this case the fact that a
 application is waiting suggests that latency could be a problem.

 The other context is completing an async READ or WRITE.  I'm not sure
 if latency at this stage of the request will actually affect
 throughput, but we do need a WQ_MEM_RECLAIM wq for the WRITE at least.

 Keep both types of users on the same wq is simplest, so making it
  WQ_UNBOUND | WQ_MEM_RECLAIM
 is probably safest and would ensure that a cpu-intensive iput_final()
 doesn't interfere with other requests unduly.
 Quite a few other filesystems do use WQ_UNBOUND, often with
  WQ_MEM_RECLAIM, but it is not easy to do a like-for-like comparison.

 I might have a go at updating the workqueue documentation to provide
 some guidance on how to choose a workqueue and when certain flags are
 needed.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 853 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH rfc] workqueue: honour cond_resched() more effectively.
       [not found]                 ` <20201126100646.1790-1-hdanton@sina.com>
@ 2020-11-26 23:44                   ` NeilBrown
  0 siblings, 0 replies; 18+ messages in thread
From: NeilBrown @ 2020-11-26 23:44 UTC (permalink / raw)
  To: Hillf Danton; +Cc: TJ, Trond Myklebust, PeterZ, LKML

[-- Attachment #1: Type: text/plain, Size: 2547 bytes --]

On Thu, Nov 26 2020, Hillf Danton wrote:

> On Fri, 20 Nov 2020 15:33:27 +1100 NeilBrown wrote:
>>
>>My first idea was to add WQ_CPU_INTENSIVE to the nfsiod workqueue, but
>>Trond wondered what was special about NFS.  Many filesystems call iput
>>from a workqueue, so finding a solution that helps them all is best.
>
> In terms of iput, I think we can splice WQ_CPU_INTENSIVE to
> WQ_MEM_RECLAIM.

I'm actually starting to think that WQ_CPU_INTENSIVE is a mistake.  If
you really have cpu-intensive work, you should be using WQ_UNBOUND.

It is possible that there might be work that is CPU intensive and which
must be run on a particular CPU - such as clearing out per-cpu lists of
recently freed slab allocations.  But I don't WQ_CPU_INTENSIVE is currently
used that way.

I cannot find *any* users of WQ_CPU_INTENSIVE which call cond_resched()
in the relevant work items.  And if the code doesn't call cond_resched()
(or similar), then it isn't really CPU-intensive.

>
>>I then suggested getting cond_resched() to do something more useful when
>>called by a worker.  PeterZ didn't like the overhead.
>>
>>Also, TJ seemed to be against auto-adjusting for cpu-intensive code,
>>preferring the right sort of workqueue to be chosen up front.
>
> Actually WQ_EVENTS_LONG sounds better than WQ_CPU_INTENSIVE, given that
> we have two events WQs with the same attr.

There is no WQ_EVENTS_LONG

>
> 	system_wq = alloc_workqueue("events", 0, 0);
> 	system_long_wq = alloc_workqueue("events_long", 0, 0);
>
> Then what are the boundaries we can draw in between WQ_MEM_RECLAIM,
> WQ_CPU_INTENSIVE and WQ_EVENTS_LONG?

I think system_long_wq is a design flaw.
Some code (mistakenly) schedules work on system_wq, calls
flush_workqueue(system_wq)) and expects that to complete reasonably quickly.
To ensure this can work, system_long_wq was created and work items that
might take a long time are encouraged to be run there.
Instead, the mistaken code should create its own work queue, schedule
work on that, and flush that queue.

Thanks,
NeilBrown

>
>
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -4261,6 +4261,9 @@ struct workqueue_struct *alloc_workqueue
>  	if ((flags & WQ_POWER_EFFICIENT) && wq_power_efficient)
>  		flags |= WQ_UNBOUND;
>  
> +	if (flags & (WQ_MEM_RECLAIM | WQ_UNBOUND) == WQ_MEM_RECLAIM)
> +		flags |= WQ_CPU_INTENSIVE;
> +
>  	/* allocate wq and format name */
>  	if (flags & WQ_UNBOUND)
>  		tbl_size = nr_node_ids * sizeof(wq->numa_pwq_tbl[0]);

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 853 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH rfc] workqueue: honour cond_resched() more effectively.
  2020-11-19 23:07               ` NeilBrown
@ 2020-12-02 20:20                 ` tj
  0 siblings, 0 replies; 18+ messages in thread
From: tj @ 2020-12-02 20:20 UTC (permalink / raw)
  To: NeilBrown
  Cc: Hillf Danton, Trond Myklebust, peterz, linux-kernel, Lai Jiangshan

Hello, Neil.

(cc'ing Lai)

On Fri, Nov 20, 2020 at 10:07:56AM +1100, NeilBrown wrote:
> If the workqueue maintainers are unmovable in the position that a
> CM-workitem must not use excessive CPU ever, and so must not call
> cond_resched(), then I can take that back to the NFS maintainers and

Oh, that's not my position at all. I'm completely movable and am pretty sure
Lai would be too.

> negotiate different workqueue settings.  But as I've said, I think this
> is requiring the decision to be made in a place that is not well
> positioned to make it.

Very true. A lot of the current workqueue code is a result of gradual
evolution and has a lot of historical cruft. Things like CPU_INTENSIVE or
long_wq were introduced partly to facilitate existing users to CM workqueue
and definitely are silly interfaces as you mentioned in another reply.

I can see two directions:

1. Keeping things mostly as-is - leave the selection of the specific
   workqueue type to users. We can improve things by adding debug / sanity
   checks to surfaces issues more proactively, adding more documentation and
   cleaning up old cruft.

2. Make things properly dynamic. By default, let workqueue core decide what
   type of execution constraints are gonna be applied and dynamically adjust
   according to the behavior of specific workqueues or work items. e.g. it
   can start most work items CM per-cpu by default and then release them as
   unbound when its cpu consumption exceeds certain threshold.

#1 is easier. #2 is definitely more attractive long term but would need a
lot of careful work. I was suggesting towards #1 mostly because it's a safer
/ easier direction but if you wanna explore #2, please be my guest.

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2020-12-02 20:21 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-09  2:54 [PATCH rfc] workqueue: honour cond_resched() more effectively NeilBrown
2020-11-09  7:50 ` Michal Hocko
2020-11-09  8:00 ` Peter Zijlstra
2020-11-09 13:50   ` Trond Myklebust
2020-11-09 14:01     ` tj
2020-11-09 14:11       ` Trond Myklebust
2020-11-09 16:10         ` tj
2020-11-17 22:16           ` NeilBrown
     [not found]           ` <20201118025820.307-1-hdanton@sina.com>
2020-11-18  5:11             ` NeilBrown
     [not found]             ` <20201118055108.358-1-hdanton@sina.com>
2020-11-19 23:07               ` NeilBrown
2020-12-02 20:20                 ` tj
     [not found]               ` <20201120025953.607-1-hdanton@sina.com>
2020-11-20  4:33                 ` NeilBrown
     [not found]                 ` <20201126100646.1790-1-hdanton@sina.com>
2020-11-26 23:44                   ` NeilBrown
2020-11-19 23:23           ` NeilBrown
2020-11-25 12:36             ` tj
2020-11-26 23:30               ` NeilBrown
2020-11-09 14:20     ` Peter Zijlstra
2020-11-10  2:26       ` NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).