All of lore.kernel.org
 help / color / mirror / Atom feed
* [2.6.36-rc3] Workqueues, XFS, dependencies and deadlocks
@ 2010-09-07  7:29 ` Dave Chinner
  0 siblings, 0 replies; 37+ messages in thread
From: Dave Chinner @ 2010-09-07  7:29 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, xfs, linux-fsdevel

Hi Tejun,

I've got a few concerns about workqueue consolidation it has gone
into 2.6.36-rc and the way XFS has been using workqueues for
concurrency and deadlock avoidance in IO completion. To give you an
idea of the complex dependencies of the IO completion workqueues XFS
uses, I'll start by describing the major deadlock iand latency
issues that they were crafted to avoid:

1. XFS used separate processing threads to prevent deadlocks between
data and log IO completion processing. The deadlock is as follows:

	- inode locked in transaction
	- transaction commit triggers writes to log buffer
	- log buffer write blocks because all log buffers are under
	  IO
	- log IO completion queued behind data IO completion
	- data IO completion blocks on inode lock held by
	  transaction blocked waiting for log IO.

This has been avoided by log IO completion processing being placed
in a separate processing workqueue so they do not get blocked behind
data IO completion. XFS has used this separation of IO completion
processing since this deadlock was discovered in the late 90s on
Irix.

2. XFS used separate threads to avoid OOM deadlocks on unwritten
extent conversion. The deadlock is as follows:

	- data IO into unwritten extent completes
	- unwritten extent conversion starts a transaction
	- transaction requires memory allocation
	- data IO to complete cleaning of dirty pages (say issued by
	  kswapd) gets queued up behind unwritten extent conversion
	  processing
	- data IO completion stalls
	- system goes (B)OOM

XFS pushes unwritten extent conversion off into a separate
processing thread so that it doesn't block other data IO completion
needed to clean pages and hence avoids the OOM deadlock in these
cases.

3. Loop devices turn log IO into data IO on backing filesystem. This
leads to deadlocks because:

	- transaction on loop device holds inode locked, commit
	  blocks waiting for log IO. Log-IO-on-loop-device is turned
	  into data-IO-on-backing-device.
	- data-IO-on-loop-device completes, blocks taking inode lock
	  to update file size.
	- data-IO-on-backing-device for the log-IO-on-loop-device
	  gets queued behind blocked data-IO-on-loop-device
	  completion. Deadlocks loop device and IO completion
	  processing thread.

XFS has worked around this deadlock by using try-lock semantics for
the inode lock on data IO completion, and if it fails we backoff by
sleeping for a jiffie and requeuing the work back to the tail of the
work queue. This works perfectly well for a dedicated set of
processing threads as the only impact is on XFS....

4. XFS used separate threads to minimise log IO completion latency

Queuing log IO completion behind thousands of data and metadata IO
completions stalls the entire transaction subsystem until the log IO
completion is done. By having separate processing threads, log IO
completion processing is not delayed by having to first wait for
data/metadata IO completion processing. This delay can be
significant because XFS can have thousands of IOs in flight at a
time and IO completion processing backlog can extend to tens to
hundreds of thousands of objects that have to be processed every
second.

-----

So, with those descriptions out of the way, I've seen the following
problems in the past week or so:

1. I have had xfstests deadlock twice via #3, once on 2.6.36-rc2,
and once on 2.6.36-rc3. This is clearly a regression, but it is not
caused by any XFS changes since 2.6.35.  From what I can tell from
the backtraces I saw was that it appears that the delaying of the
data IO completion processing by requeuing does not allow the
workqueue to move off the kworker thread. As a result, any work that
is still queued on that kworker queue appears to be starved, and
hence we never get the log workqueue processed that would allow data
IO completion processing to make progress.

2. I have circumstantial evidence that #4 is contributing to
several minute long livelocks. This is intertwined with memory
reclaim and lock contention, but fundamentally log IO completion
processing is being blocked for extremely long periods of time
waiting for a kworker thread to start processing them.  In this
case, I'm creating close to 100,000 inodes every second, and they
are getting written to disk. There is a burst of log IO every 3s or
so, so the log Io completion is getting queued behind at least tens
of thousands of inode IO completion work items. These work
completion items are generating lock contention which slows down
processing even further. The transaciton subsystem stalls completely
while it waits for log IO completion to be processed. AFAICT, this
did not happen on 2.6.35.

This also seems to be correlated memory starvation because we can't
free any memory until the log subsystem comes alive again and allows
all the pinned metadata and transaction structures to be freed (can
be tens to hundreds of megabytes of memory).

http://marc.info/?l=linux-kernel&m=128374586809180&w=2
http://marc.info/?l=linux-kernel&m=128380988716141&w=2

----

XFS has used workqueues for these "separate processing threads"
because they were a simple primitve that provided the separation and
isolation guarantees that XFS IO completion processing required.
That is, work deferred from one processing queue to another would
not block the original queue, and queues can be blocked
independently of the processing of other queues.

>From what I can tell of the new kworker thread based implementation,
I cannot see how it provides the same work queue separation,
blocking and isolation guarantees. If we block during work
processing, then anything on the queue for that thread appears to be
blocked from processing until the work is unblocked.

Hence my main concern is that the new work queue implementation does
not provide the same semantics as the old workqueues, and as such
re-introduces a class of problems that will cause random hangs and
other bad behaviours on XFS filesystems under heavy load.

Hence, I'd like to know if my reading of the new workqueue code is
correct and:

	a) if not, understand why the workqueues are deadlocking;
	b) if so, understand what needs to be done to solve the
	deadlocks;
	c) understand how we can prioritise log IO completion
	processing over data, metadata and unwritten extent IO
	completion processing; and
	d) what can be done before 2.6.36 releases.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [2.6.36-rc3] Workqueues, XFS, dependencies and deadlocks
@ 2010-09-07  7:29 ` Dave Chinner
  0 siblings, 0 replies; 37+ messages in thread
From: Dave Chinner @ 2010-09-07  7:29 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-fsdevel, linux-kernel, xfs

Hi Tejun,

I've got a few concerns about workqueue consolidation it has gone
into 2.6.36-rc and the way XFS has been using workqueues for
concurrency and deadlock avoidance in IO completion. To give you an
idea of the complex dependencies of the IO completion workqueues XFS
uses, I'll start by describing the major deadlock iand latency
issues that they were crafted to avoid:

1. XFS used separate processing threads to prevent deadlocks between
data and log IO completion processing. The deadlock is as follows:

	- inode locked in transaction
	- transaction commit triggers writes to log buffer
	- log buffer write blocks because all log buffers are under
	  IO
	- log IO completion queued behind data IO completion
	- data IO completion blocks on inode lock held by
	  transaction blocked waiting for log IO.

This has been avoided by log IO completion processing being placed
in a separate processing workqueue so they do not get blocked behind
data IO completion. XFS has used this separation of IO completion
processing since this deadlock was discovered in the late 90s on
Irix.

2. XFS used separate threads to avoid OOM deadlocks on unwritten
extent conversion. The deadlock is as follows:

	- data IO into unwritten extent completes
	- unwritten extent conversion starts a transaction
	- transaction requires memory allocation
	- data IO to complete cleaning of dirty pages (say issued by
	  kswapd) gets queued up behind unwritten extent conversion
	  processing
	- data IO completion stalls
	- system goes (B)OOM

XFS pushes unwritten extent conversion off into a separate
processing thread so that it doesn't block other data IO completion
needed to clean pages and hence avoids the OOM deadlock in these
cases.

3. Loop devices turn log IO into data IO on backing filesystem. This
leads to deadlocks because:

	- transaction on loop device holds inode locked, commit
	  blocks waiting for log IO. Log-IO-on-loop-device is turned
	  into data-IO-on-backing-device.
	- data-IO-on-loop-device completes, blocks taking inode lock
	  to update file size.
	- data-IO-on-backing-device for the log-IO-on-loop-device
	  gets queued behind blocked data-IO-on-loop-device
	  completion. Deadlocks loop device and IO completion
	  processing thread.

XFS has worked around this deadlock by using try-lock semantics for
the inode lock on data IO completion, and if it fails we backoff by
sleeping for a jiffie and requeuing the work back to the tail of the
work queue. This works perfectly well for a dedicated set of
processing threads as the only impact is on XFS....

4. XFS used separate threads to minimise log IO completion latency

Queuing log IO completion behind thousands of data and metadata IO
completions stalls the entire transaction subsystem until the log IO
completion is done. By having separate processing threads, log IO
completion processing is not delayed by having to first wait for
data/metadata IO completion processing. This delay can be
significant because XFS can have thousands of IOs in flight at a
time and IO completion processing backlog can extend to tens to
hundreds of thousands of objects that have to be processed every
second.

-----

So, with those descriptions out of the way, I've seen the following
problems in the past week or so:

1. I have had xfstests deadlock twice via #3, once on 2.6.36-rc2,
and once on 2.6.36-rc3. This is clearly a regression, but it is not
caused by any XFS changes since 2.6.35.  From what I can tell from
the backtraces I saw was that it appears that the delaying of the
data IO completion processing by requeuing does not allow the
workqueue to move off the kworker thread. As a result, any work that
is still queued on that kworker queue appears to be starved, and
hence we never get the log workqueue processed that would allow data
IO completion processing to make progress.

2. I have circumstantial evidence that #4 is contributing to
several minute long livelocks. This is intertwined with memory
reclaim and lock contention, but fundamentally log IO completion
processing is being blocked for extremely long periods of time
waiting for a kworker thread to start processing them.  In this
case, I'm creating close to 100,000 inodes every second, and they
are getting written to disk. There is a burst of log IO every 3s or
so, so the log Io completion is getting queued behind at least tens
of thousands of inode IO completion work items. These work
completion items are generating lock contention which slows down
processing even further. The transaciton subsystem stalls completely
while it waits for log IO completion to be processed. AFAICT, this
did not happen on 2.6.35.

This also seems to be correlated memory starvation because we can't
free any memory until the log subsystem comes alive again and allows
all the pinned metadata and transaction structures to be freed (can
be tens to hundreds of megabytes of memory).

http://marc.info/?l=linux-kernel&m=128374586809180&w=2
http://marc.info/?l=linux-kernel&m=128380988716141&w=2

----

XFS has used workqueues for these "separate processing threads"
because they were a simple primitve that provided the separation and
isolation guarantees that XFS IO completion processing required.
That is, work deferred from one processing queue to another would
not block the original queue, and queues can be blocked
independently of the processing of other queues.

>From what I can tell of the new kworker thread based implementation,
I cannot see how it provides the same work queue separation,
blocking and isolation guarantees. If we block during work
processing, then anything on the queue for that thread appears to be
blocked from processing until the work is unblocked.

Hence my main concern is that the new work queue implementation does
not provide the same semantics as the old workqueues, and as such
re-introduces a class of problems that will cause random hangs and
other bad behaviours on XFS filesystems under heavy load.

Hence, I'd like to know if my reading of the new workqueue code is
correct and:

	a) if not, understand why the workqueues are deadlocking;
	b) if so, understand what needs to be done to solve the
	deadlocks;
	c) understand how we can prioritise log IO completion
	processing over data, metadata and unwritten extent IO
	completion processing; and
	d) what can be done before 2.6.36 releases.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [2.6.36-rc3] Workqueues, XFS, dependencies and deadlocks
  2010-09-07  7:29 ` Dave Chinner
@ 2010-09-07  9:04   ` Tejun Heo
  -1 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2010-09-07  9:04 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, xfs, linux-fsdevel

Hello,

On 09/07/2010 09:29 AM, Dave Chinner wrote:
> 1. I have had xfstests deadlock twice via #3, once on 2.6.36-rc2,
> and once on 2.6.36-rc3. This is clearly a regression, but it is not
> caused by any XFS changes since 2.6.35.  From what I can tell from
> the backtraces I saw was that it appears that the delaying of the
> data IO completion processing by requeuing does not allow the
> workqueue to move off the kworker thread. As a result, any work that
> is still queued on that kworker queue appears to be starved, and
> hence we never get the log workqueue processed that would allow data
> IO completion processing to make progress.

This is puzzling.  Queueing order shouldn't have changed.  Maybe I
screwed up queueing order handling of delayed works.  Which workqueue
is this?  Or better, can you give me a small test case which
reproduces the problem?

> 2. I have circumstantial evidence that #4 is contributing to
> several minute long livelocks. This is intertwined with memory
> reclaim and lock contention, but fundamentally log IO completion
> processing is being blocked for extremely long periods of time
> waiting for a kworker thread to start processing them.  In this
> case, I'm creating close to 100,000 inodes every second, and they
> are getting written to disk. There is a burst of log IO every 3s or
> so, so the log Io completion is getting queued behind at least tens
> of thousands of inode IO completion work items. These work
> completion items are generating lock contention which slows down
> processing even further. The transaciton subsystem stalls completely
> while it waits for log IO completion to be processed. AFAICT, this
> did not happen on 2.6.35.

Creating the workqueue for log completion w/ WQ_HIGHPRI should solve
this.

> XFS has used workqueues for these "separate processing threads"
> because they were a simple primitve that provided the separation and
> isolation guarantees that XFS IO completion processing required.
> That is, work deferred from one processing queue to another would
> not block the original queue, and queues can be blocked
> independently of the processing of other queues.

Semantically, that property is (or should be) preserved.  The
scheduling properties change tho and if the code has been depending on
more subtile aspects of work scheduling, it will definitely need to be
adjusted.

>>From what I can tell of the new kworker thread based implementation,
> I cannot see how it provides the same work queue separation,
> blocking and isolation guarantees. If we block during work
> processing, then anything on the queue for that thread appears to be
> blocked from processing until the work is unblocked.

I fail to follow here.  Can you elaborate a bit?

> Hence my main concern is that the new work queue implementation does
> not provide the same semantics as the old workqueues, and as such
> re-introduces a class of problems that will cause random hangs and
> other bad behaviours on XFS filesystems under heavy load.

I don't think it has that level of fundamental design flaw.

> Hence, I'd like to know if my reading of the new workqueue code is
> correct and:

Probably not.

> 	a) if not, understand why the workqueues are deadlocking;

Yeah, let's track this one down.

> 	c) understand how we can prioritise log IO completion
> 	processing over data, metadata and unwritten extent IO
> 	completion processing; and

As I wrote above, WQ_HIGHPRI is there for you.

> 	d) what can be done before 2.6.36 releases.

To preserve the original behavior, create_workqueue() and friends
create workqueues with @max_active of 1, which is pretty silly and bad
for latency.  Aside from fixing the above problems, it would be nice
to find out better values for @max_active for xfs workqueues.  For
most users, using the pretty high default value is okay as they
usually have much stricter constraint elsewhere (like limited number
of work_struct), but last time I tried xfs allocated work_structs and
fired them as fast as it could, so it looked like it definitely needed
some kind of resasonable capping value.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [2.6.36-rc3] Workqueues, XFS, dependencies and deadlocks
@ 2010-09-07  9:04   ` Tejun Heo
  0 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2010-09-07  9:04 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel, xfs

Hello,

On 09/07/2010 09:29 AM, Dave Chinner wrote:
> 1. I have had xfstests deadlock twice via #3, once on 2.6.36-rc2,
> and once on 2.6.36-rc3. This is clearly a regression, but it is not
> caused by any XFS changes since 2.6.35.  From what I can tell from
> the backtraces I saw was that it appears that the delaying of the
> data IO completion processing by requeuing does not allow the
> workqueue to move off the kworker thread. As a result, any work that
> is still queued on that kworker queue appears to be starved, and
> hence we never get the log workqueue processed that would allow data
> IO completion processing to make progress.

This is puzzling.  Queueing order shouldn't have changed.  Maybe I
screwed up queueing order handling of delayed works.  Which workqueue
is this?  Or better, can you give me a small test case which
reproduces the problem?

> 2. I have circumstantial evidence that #4 is contributing to
> several minute long livelocks. This is intertwined with memory
> reclaim and lock contention, but fundamentally log IO completion
> processing is being blocked for extremely long periods of time
> waiting for a kworker thread to start processing them.  In this
> case, I'm creating close to 100,000 inodes every second, and they
> are getting written to disk. There is a burst of log IO every 3s or
> so, so the log Io completion is getting queued behind at least tens
> of thousands of inode IO completion work items. These work
> completion items are generating lock contention which slows down
> processing even further. The transaciton subsystem stalls completely
> while it waits for log IO completion to be processed. AFAICT, this
> did not happen on 2.6.35.

Creating the workqueue for log completion w/ WQ_HIGHPRI should solve
this.

> XFS has used workqueues for these "separate processing threads"
> because they were a simple primitve that provided the separation and
> isolation guarantees that XFS IO completion processing required.
> That is, work deferred from one processing queue to another would
> not block the original queue, and queues can be blocked
> independently of the processing of other queues.

Semantically, that property is (or should be) preserved.  The
scheduling properties change tho and if the code has been depending on
more subtile aspects of work scheduling, it will definitely need to be
adjusted.

>>From what I can tell of the new kworker thread based implementation,
> I cannot see how it provides the same work queue separation,
> blocking and isolation guarantees. If we block during work
> processing, then anything on the queue for that thread appears to be
> blocked from processing until the work is unblocked.

I fail to follow here.  Can you elaborate a bit?

> Hence my main concern is that the new work queue implementation does
> not provide the same semantics as the old workqueues, and as such
> re-introduces a class of problems that will cause random hangs and
> other bad behaviours on XFS filesystems under heavy load.

I don't think it has that level of fundamental design flaw.

> Hence, I'd like to know if my reading of the new workqueue code is
> correct and:

Probably not.

> 	a) if not, understand why the workqueues are deadlocking;

Yeah, let's track this one down.

> 	c) understand how we can prioritise log IO completion
> 	processing over data, metadata and unwritten extent IO
> 	completion processing; and

As I wrote above, WQ_HIGHPRI is there for you.

> 	d) what can be done before 2.6.36 releases.

To preserve the original behavior, create_workqueue() and friends
create workqueues with @max_active of 1, which is pretty silly and bad
for latency.  Aside from fixing the above problems, it would be nice
to find out better values for @max_active for xfs workqueues.  For
most users, using the pretty high default value is okay as they
usually have much stricter constraint elsewhere (like limited number
of work_struct), but last time I tried xfs allocated work_structs and
fired them as fast as it could, so it looked like it definitely needed
some kind of resasonable capping value.

Thanks.

-- 
tejun

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [2.6.36-rc3] Workqueues, XFS, dependencies and deadlocks
  2010-09-07  9:04   ` Tejun Heo
@ 2010-09-07 10:01     ` Dave Chinner
  -1 siblings, 0 replies; 37+ messages in thread
From: Dave Chinner @ 2010-09-07 10:01 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, xfs, linux-fsdevel

On Tue, Sep 07, 2010 at 11:04:59AM +0200, Tejun Heo wrote:
> Hello,
> 
> On 09/07/2010 09:29 AM, Dave Chinner wrote:
> > 1. I have had xfstests deadlock twice via #3, once on 2.6.36-rc2,
> > and once on 2.6.36-rc3. This is clearly a regression, but it is not
> > caused by any XFS changes since 2.6.35.  From what I can tell from
> > the backtraces I saw was that it appears that the delaying of the
> > data IO completion processing by requeuing does not allow the
> > workqueue to move off the kworker thread. As a result, any work that
> > is still queued on that kworker queue appears to be starved, and
> > hence we never get the log workqueue processed that would allow data
> > IO completion processing to make progress.
> 
> This is puzzling.  Queueing order shouldn't have changed.  Maybe I
> screwed up queueing order handling of delayed works.  Which workqueue
> is this?

The three workqueues are initialised in
fs/xfs/linux-2.6/xfs_buf.c::xfs_buf_init().

They do not use delayed works, the requeuing of interest here
occurs in .../xfs_aops.c::xfs_end_io via
.../xfs_aops.c:xfs_finish_ioend() onto the xfsdatad_workqueue.

> Or better, can you give me a small test case which
> reproduces the problem?

I've seen it twice in about 100 xfstests runs in the past week.
I can't remember the test that tripped over it - 078 I think did
once, and it was a different test the first time - only some tests
use the loopback device. We've never had a reliable reproducer
because of the complexity of the race condition that leads to
the deadlock....

> > 2. I have circumstantial evidence that #4 is contributing to
> > several minute long livelocks. This is intertwined with memory
> > reclaim and lock contention, but fundamentally log IO completion
> > processing is being blocked for extremely long periods of time
> > waiting for a kworker thread to start processing them.  In this
> > case, I'm creating close to 100,000 inodes every second, and they
> > are getting written to disk. There is a burst of log IO every 3s or
> > so, so the log Io completion is getting queued behind at least tens
> > of thousands of inode IO completion work items. These work
> > completion items are generating lock contention which slows down
> > processing even further. The transaciton subsystem stalls completely
> > while it waits for log IO completion to be processed. AFAICT, this
> > did not happen on 2.6.35.
> 
> Creating the workqueue for log completion w/ WQ_HIGHPRI should solve
> this.

So what you are saying is that we need to change the workqueue
creation interface to use alloc_workqueue() with some special set of
flags to make the workqueue behave as we want, and that each
workqueue will require a different configuration?  Where can I find
the interface documentation that describes how the different flags
affect the workqueue behaviour?

> > XFS has used workqueues for these "separate processing threads"
> > because they were a simple primitve that provided the separation and
> > isolation guarantees that XFS IO completion processing required.
> > That is, work deferred from one processing queue to another would
> > not block the original queue, and queues can be blocked
> > independently of the processing of other queues.
> 
> Semantically, that property is (or should be) preserved.  The
> scheduling properties change tho and if the code has been depending on
> more subtile aspects of work scheduling, it will definitely need to be
> adjusted.

Which means?

> >>From what I can tell of the new kworker thread based implementation,
> > I cannot see how it provides the same work queue separation,
> > blocking and isolation guarantees. If we block during work
> > processing, then anything on the queue for that thread appears to be
> > blocked from processing until the work is unblocked.
> 
> I fail to follow here.  Can you elaborate a bit?

Here's what the work function does:

 -> run @work
	-> trylock returned EAGAIN
	-> queue_work(@work)
	-> delay(1); // to stop workqueue spinning chewing up CPU

So basically I'm seeing a kworker thread blocked in delay(1) - it's
appears to be making progress by processing the same work item over and over
again with delay(1) calls between them. The queued log IO completion
is not being processed, even though it is sitting in a queue
waiting...

> > Hence my main concern is that the new work queue implementation does
> > not provide the same semantics as the old workqueues, and as such
> > re-introduces a class of problems that will cause random hangs and
> > other bad behaviours on XFS filesystems under heavy load.
> 
> I don't think it has that level of fundamental design flaw.
> 
> > Hence, I'd like to know if my reading of the new workqueue code is
> > correct and:
> 
> Probably not.
> 
> > 	a) if not, understand why the workqueues are deadlocking;
> 
> Yeah, let's track this one down.
> 
> > 	c) understand how we can prioritise log IO completion
> > 	processing over data, metadata and unwritten extent IO
> > 	completion processing; and
> 
> As I wrote above, WQ_HIGHPRI is there for you.
> 
> > 	d) what can be done before 2.6.36 releases.
> 
> To preserve the original behavior, create_workqueue() and friends
> create workqueues with @max_active of 1, which is pretty silly and bad
> for latency.  Aside from fixing the above problems, it would be nice
> to find out better values for @max_active for xfs workqueues.  For

Um, call me clueless, but WTF does max_active actually do? It's not
described anywhere, it's clamped to magic numbers ("I really like
512"), etc. AFAICT, it determines whether the work is queued as
delayed work or whether it is put on an active worklist straight
away. However, the lack of documentation describing the behaviour of
the workqueues and why I might want to set a value other than 1 or
the default makes it pretty hard to work out anything for sure...

> most users, using the pretty high default value is okay as they
> usually have much stricter constraint elsewhere (like limited number
> of work_struct), but last time I tried xfs allocated work_structs and
> fired them as fast as it could, so it looked like it definitely needed
> some kind of resasonable capping value.

What part of XFS fired work structures as fast as it could? Queuing
rates are determined completely by the IO completion rates...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [2.6.36-rc3] Workqueues, XFS, dependencies and deadlocks
@ 2010-09-07 10:01     ` Dave Chinner
  0 siblings, 0 replies; 37+ messages in thread
From: Dave Chinner @ 2010-09-07 10:01 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-fsdevel, linux-kernel, xfs

On Tue, Sep 07, 2010 at 11:04:59AM +0200, Tejun Heo wrote:
> Hello,
> 
> On 09/07/2010 09:29 AM, Dave Chinner wrote:
> > 1. I have had xfstests deadlock twice via #3, once on 2.6.36-rc2,
> > and once on 2.6.36-rc3. This is clearly a regression, but it is not
> > caused by any XFS changes since 2.6.35.  From what I can tell from
> > the backtraces I saw was that it appears that the delaying of the
> > data IO completion processing by requeuing does not allow the
> > workqueue to move off the kworker thread. As a result, any work that
> > is still queued on that kworker queue appears to be starved, and
> > hence we never get the log workqueue processed that would allow data
> > IO completion processing to make progress.
> 
> This is puzzling.  Queueing order shouldn't have changed.  Maybe I
> screwed up queueing order handling of delayed works.  Which workqueue
> is this?

The three workqueues are initialised in
fs/xfs/linux-2.6/xfs_buf.c::xfs_buf_init().

They do not use delayed works, the requeuing of interest here
occurs in .../xfs_aops.c::xfs_end_io via
.../xfs_aops.c:xfs_finish_ioend() onto the xfsdatad_workqueue.

> Or better, can you give me a small test case which
> reproduces the problem?

I've seen it twice in about 100 xfstests runs in the past week.
I can't remember the test that tripped over it - 078 I think did
once, and it was a different test the first time - only some tests
use the loopback device. We've never had a reliable reproducer
because of the complexity of the race condition that leads to
the deadlock....

> > 2. I have circumstantial evidence that #4 is contributing to
> > several minute long livelocks. This is intertwined with memory
> > reclaim and lock contention, but fundamentally log IO completion
> > processing is being blocked for extremely long periods of time
> > waiting for a kworker thread to start processing them.  In this
> > case, I'm creating close to 100,000 inodes every second, and they
> > are getting written to disk. There is a burst of log IO every 3s or
> > so, so the log Io completion is getting queued behind at least tens
> > of thousands of inode IO completion work items. These work
> > completion items are generating lock contention which slows down
> > processing even further. The transaciton subsystem stalls completely
> > while it waits for log IO completion to be processed. AFAICT, this
> > did not happen on 2.6.35.
> 
> Creating the workqueue for log completion w/ WQ_HIGHPRI should solve
> this.

So what you are saying is that we need to change the workqueue
creation interface to use alloc_workqueue() with some special set of
flags to make the workqueue behave as we want, and that each
workqueue will require a different configuration?  Where can I find
the interface documentation that describes how the different flags
affect the workqueue behaviour?

> > XFS has used workqueues for these "separate processing threads"
> > because they were a simple primitve that provided the separation and
> > isolation guarantees that XFS IO completion processing required.
> > That is, work deferred from one processing queue to another would
> > not block the original queue, and queues can be blocked
> > independently of the processing of other queues.
> 
> Semantically, that property is (or should be) preserved.  The
> scheduling properties change tho and if the code has been depending on
> more subtile aspects of work scheduling, it will definitely need to be
> adjusted.

Which means?

> >>From what I can tell of the new kworker thread based implementation,
> > I cannot see how it provides the same work queue separation,
> > blocking and isolation guarantees. If we block during work
> > processing, then anything on the queue for that thread appears to be
> > blocked from processing until the work is unblocked.
> 
> I fail to follow here.  Can you elaborate a bit?

Here's what the work function does:

 -> run @work
	-> trylock returned EAGAIN
	-> queue_work(@work)
	-> delay(1); // to stop workqueue spinning chewing up CPU

So basically I'm seeing a kworker thread blocked in delay(1) - it's
appears to be making progress by processing the same work item over and over
again with delay(1) calls between them. The queued log IO completion
is not being processed, even though it is sitting in a queue
waiting...

> > Hence my main concern is that the new work queue implementation does
> > not provide the same semantics as the old workqueues, and as such
> > re-introduces a class of problems that will cause random hangs and
> > other bad behaviours on XFS filesystems under heavy load.
> 
> I don't think it has that level of fundamental design flaw.
> 
> > Hence, I'd like to know if my reading of the new workqueue code is
> > correct and:
> 
> Probably not.
> 
> > 	a) if not, understand why the workqueues are deadlocking;
> 
> Yeah, let's track this one down.
> 
> > 	c) understand how we can prioritise log IO completion
> > 	processing over data, metadata and unwritten extent IO
> > 	completion processing; and
> 
> As I wrote above, WQ_HIGHPRI is there for you.
> 
> > 	d) what can be done before 2.6.36 releases.
> 
> To preserve the original behavior, create_workqueue() and friends
> create workqueues with @max_active of 1, which is pretty silly and bad
> for latency.  Aside from fixing the above problems, it would be nice
> to find out better values for @max_active for xfs workqueues.  For

Um, call me clueless, but WTF does max_active actually do? It's not
described anywhere, it's clamped to magic numbers ("I really like
512"), etc. AFAICT, it determines whether the work is queued as
delayed work or whether it is put on an active worklist straight
away. However, the lack of documentation describing the behaviour of
the workqueues and why I might want to set a value other than 1 or
the default makes it pretty hard to work out anything for sure...

> most users, using the pretty high default value is okay as they
> usually have much stricter constraint elsewhere (like limited number
> of work_struct), but last time I tried xfs allocated work_structs and
> fired them as fast as it could, so it looked like it definitely needed
> some kind of resasonable capping value.

What part of XFS fired work structures as fast as it could? Queuing
rates are determined completely by the IO completion rates...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [2.6.36-rc3] Workqueues, XFS, dependencies and deadlocks
  2010-09-07 10:01     ` Dave Chinner
@ 2010-09-07 10:35       ` Tejun Heo
  -1 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2010-09-07 10:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, xfs, linux-fsdevel

Hello,

On 09/07/2010 12:01 PM, Dave Chinner wrote:
> The three workqueues are initialised in
> fs/xfs/linux-2.6/xfs_buf.c::xfs_buf_init().
>
> They do not use delayed works, the requeuing of interest here
> occurs in .../xfs_aops.c::xfs_end_io via
> .../xfs_aops.c:xfs_finish_ioend() onto the xfsdatad_workqueue

Oh, I was talking about cwq->delayed_works which is a mechanism which
is used to enforce max_active among other things.

>> Or better, can you give me a small test case which
>> reproduces the problem?
> 
> I've seen it twice in about 100 xfstests runs in the past week.
> I can't remember the test that tripped over it - 078 I think did
> once, and it was a different test the first time - only some tests
> use the loopback device. We've never had a reliable reproducer
> because of the complexity of the race condition that leads to
> the deadlock....

I see.

>> Creating the workqueue for log completion w/ WQ_HIGHPRI should solve
>> this.
> 
> So what you are saying is that we need to change the workqueue
> creation interface to use alloc_workqueue() with some special set of
> flags to make the workqueue behave as we want, and that each
> workqueue will require a different configuration?  Where can I find
> the interface documentation that describes how the different flags
> affect the workqueue behaviour?

Heh, sorry about that.  I'm writing it now.  The plan is to audit all
the create_*workqueue() users and replace them with alloc_workqueue()
w/ appropriate parameters.  Most of them would be fine with the
default set of parameters but there are a few which would need some
adjustments.

>> I fail to follow here.  Can you elaborate a bit?
> 
> Here's what the work function does:
> 
>  -> run @work
> 	-> trylock returned EAGAIN
> 	-> queue_work(@work)
> 	-> delay(1); // to stop workqueue spinning chewing up CPU
>
> So basically I'm seeing a kworker thread blocked in delay(1) - it's
> appears to be making progress by processing the same work item over and over
> again with delay(1) calls between them. The queued log IO completion
> is not being processed, even though it is sitting in a queue
> waiting...

Can you please help me a bit more?  Are you saying the following?

Work w0 starts execution on wq0.  w0 tries locking but fails.  Does
delay(1) and requeues itself on wq0 hoping another work w1 would be
queued on wq0 which will release the lock.  The requeueing should make
w0 queued and executed after w1, but instead w1 never gets executed
while w0 hogs the CPU constantly by re-executing itself.  Also, how
does delay(1) help with chewing up CPU?  Are you talking about
avoiding constant lock/unlock ops starving other lockers?  In such
case, wouldn't cpu_relax() make more sense?

>> To preserve the original behavior, create_workqueue() and friends
>> create workqueues with @max_active of 1, which is pretty silly and bad
>> for latency.  Aside from fixing the above problems, it would be nice
>> to find out better values for @max_active for xfs workqueues.  For
> 
> Um, call me clueless, but WTF does max_active actually do?

It regulates the maximum level of per-cpu concurrency.  ie. If a
workqueue has @max_active of 16.  16 works on the workqueue may
execute concurrently per-cpu.

> It's not described anywhere, it's clamped to magic numbers ("I
> really like 512"), etc.

Yeap, that's just a random safety value I chose.  In most cases, the
level of concurrency is limited by the number of work_struct, so the
default limit is there just to survive complete runaway cases.

>> most users, using the pretty high default value is okay as they
>> usually have much stricter constraint elsewhere (like limited number
>> of work_struct), but last time I tried xfs allocated work_structs and
>> fired them as fast as it could, so it looked like it definitely needed
>> some kind of resasonable capping value.
> 
> What part of XFS fired work structures as fast as it could? Queuing
> rates are determined completely by the IO completion rates...

I don't remember but once I increased maximum concurrency for every
workqueue (the limit was 128 or something) and xfs pretty quickly hit
the concurrency limit.  IIRC, there was a function which allocates
work_struct and schedules it.  I'll look through the emails.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [2.6.36-rc3] Workqueues, XFS, dependencies and deadlocks
@ 2010-09-07 10:35       ` Tejun Heo
  0 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2010-09-07 10:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel, xfs

Hello,

On 09/07/2010 12:01 PM, Dave Chinner wrote:
> The three workqueues are initialised in
> fs/xfs/linux-2.6/xfs_buf.c::xfs_buf_init().
>
> They do not use delayed works, the requeuing of interest here
> occurs in .../xfs_aops.c::xfs_end_io via
> .../xfs_aops.c:xfs_finish_ioend() onto the xfsdatad_workqueue

Oh, I was talking about cwq->delayed_works which is a mechanism which
is used to enforce max_active among other things.

>> Or better, can you give me a small test case which
>> reproduces the problem?
> 
> I've seen it twice in about 100 xfstests runs in the past week.
> I can't remember the test that tripped over it - 078 I think did
> once, and it was a different test the first time - only some tests
> use the loopback device. We've never had a reliable reproducer
> because of the complexity of the race condition that leads to
> the deadlock....

I see.

>> Creating the workqueue for log completion w/ WQ_HIGHPRI should solve
>> this.
> 
> So what you are saying is that we need to change the workqueue
> creation interface to use alloc_workqueue() with some special set of
> flags to make the workqueue behave as we want, and that each
> workqueue will require a different configuration?  Where can I find
> the interface documentation that describes how the different flags
> affect the workqueue behaviour?

Heh, sorry about that.  I'm writing it now.  The plan is to audit all
the create_*workqueue() users and replace them with alloc_workqueue()
w/ appropriate parameters.  Most of them would be fine with the
default set of parameters but there are a few which would need some
adjustments.

>> I fail to follow here.  Can you elaborate a bit?
> 
> Here's what the work function does:
> 
>  -> run @work
> 	-> trylock returned EAGAIN
> 	-> queue_work(@work)
> 	-> delay(1); // to stop workqueue spinning chewing up CPU
>
> So basically I'm seeing a kworker thread blocked in delay(1) - it's
> appears to be making progress by processing the same work item over and over
> again with delay(1) calls between them. The queued log IO completion
> is not being processed, even though it is sitting in a queue
> waiting...

Can you please help me a bit more?  Are you saying the following?

Work w0 starts execution on wq0.  w0 tries locking but fails.  Does
delay(1) and requeues itself on wq0 hoping another work w1 would be
queued on wq0 which will release the lock.  The requeueing should make
w0 queued and executed after w1, but instead w1 never gets executed
while w0 hogs the CPU constantly by re-executing itself.  Also, how
does delay(1) help with chewing up CPU?  Are you talking about
avoiding constant lock/unlock ops starving other lockers?  In such
case, wouldn't cpu_relax() make more sense?

>> To preserve the original behavior, create_workqueue() and friends
>> create workqueues with @max_active of 1, which is pretty silly and bad
>> for latency.  Aside from fixing the above problems, it would be nice
>> to find out better values for @max_active for xfs workqueues.  For
> 
> Um, call me clueless, but WTF does max_active actually do?

It regulates the maximum level of per-cpu concurrency.  ie. If a
workqueue has @max_active of 16.  16 works on the workqueue may
execute concurrently per-cpu.

> It's not described anywhere, it's clamped to magic numbers ("I
> really like 512"), etc.

Yeap, that's just a random safety value I chose.  In most cases, the
level of concurrency is limited by the number of work_struct, so the
default limit is there just to survive complete runaway cases.

>> most users, using the pretty high default value is okay as they
>> usually have much stricter constraint elsewhere (like limited number
>> of work_struct), but last time I tried xfs allocated work_structs and
>> fired them as fast as it could, so it looked like it definitely needed
>> some kind of resasonable capping value.
> 
> What part of XFS fired work structures as fast as it could? Queuing
> rates are determined completely by the IO completion rates...

I don't remember but once I increased maximum concurrency for every
workqueue (the limit was 128 or something) and xfs pretty quickly hit
the concurrency limit.  IIRC, there was a function which allocates
work_struct and schedules it.  I'll look through the emails.

Thanks.

-- 
tejun

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [2.6.36-rc3] Workqueues, XFS, dependencies and deadlocks
  2010-09-07 10:35       ` Tejun Heo
@ 2010-09-07 12:26         ` Tejun Heo
  -1 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2010-09-07 12:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, xfs, linux-fsdevel

On 09/07/2010 12:35 PM, Tejun Heo wrote:
> Can you please help me a bit more?  Are you saying the following?
> 
> Work w0 starts execution on wq0.  w0 tries locking but fails.  Does
> delay(1) and requeues itself on wq0 hoping another work w1 would be
> queued on wq0 which will release the lock.  The requeueing should make
> w0 queued and executed after w1, but instead w1 never gets executed
> while w0 hogs the CPU constantly by re-executing itself.  Also, how
> does delay(1) help with chewing up CPU?  Are you talking about
> avoiding constant lock/unlock ops starving other lockers?  In such
> case, wouldn't cpu_relax() make more sense?

Ooh, almost forgot.  There was nr_active underflow bug in workqueue
code which could lead to malfunctioning max_active regulation and
problems during queue freezing, so you could be hitting that too.  I
sent out pull request some time ago but hasn't been pulled into
mainline yet.  Can you please pull from the following branch and add
WQ_HIGHPRI as discussed before and see whether the problem is still
reproducible?  And if the problem is reproducible, can you please
trigger sysrq thread dump and attach it?

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git for-linus

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [2.6.36-rc3] Workqueues, XFS, dependencies and deadlocks
@ 2010-09-07 12:26         ` Tejun Heo
  0 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2010-09-07 12:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel, xfs

On 09/07/2010 12:35 PM, Tejun Heo wrote:
> Can you please help me a bit more?  Are you saying the following?
> 
> Work w0 starts execution on wq0.  w0 tries locking but fails.  Does
> delay(1) and requeues itself on wq0 hoping another work w1 would be
> queued on wq0 which will release the lock.  The requeueing should make
> w0 queued and executed after w1, but instead w1 never gets executed
> while w0 hogs the CPU constantly by re-executing itself.  Also, how
> does delay(1) help with chewing up CPU?  Are you talking about
> avoiding constant lock/unlock ops starving other lockers?  In such
> case, wouldn't cpu_relax() make more sense?

Ooh, almost forgot.  There was nr_active underflow bug in workqueue
code which could lead to malfunctioning max_active regulation and
problems during queue freezing, so you could be hitting that too.  I
sent out pull request some time ago but hasn't been pulled into
mainline yet.  Can you please pull from the following branch and add
WQ_HIGHPRI as discussed before and see whether the problem is still
reproducible?  And if the problem is reproducible, can you please
trigger sysrq thread dump and attach it?

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git for-linus

Thanks.

-- 
tejun

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [2.6.36-rc3] Workqueues, XFS, dependencies and deadlocks
  2010-09-07 10:35       ` Tejun Heo
@ 2010-09-07 12:48         ` Dave Chinner
  -1 siblings, 0 replies; 37+ messages in thread
From: Dave Chinner @ 2010-09-07 12:48 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, xfs, linux-fsdevel

On Tue, Sep 07, 2010 at 12:35:46PM +0200, Tejun Heo wrote:
> On 09/07/2010 12:01 PM, Dave Chinner wrote:
> >> Creating the workqueue for log completion w/ WQ_HIGHPRI should solve
> >> this.
> > 
> > So what you are saying is that we need to change the workqueue
> > creation interface to use alloc_workqueue() with some special set of
> > flags to make the workqueue behave as we want, and that each
> > workqueue will require a different configuration?  Where can I find
> > the interface documentation that describes how the different flags
> > affect the workqueue behaviour?
> 
> Heh, sorry about that.  I'm writing it now.  The plan is to audit all
> the create_*workqueue() users and replace them with alloc_workqueue()
> w/ appropriate parameters.  Most of them would be fine with the
> default set of parameters but there are a few which would need some
> adjustments.

Ok. Do you have an advance draft of the docco I can read?

> >> I fail to follow here.  Can you elaborate a bit?
> > 
> > Here's what the work function does:
> > 
> >  -> run @work
> > 	-> trylock returned EAGAIN
> > 	-> queue_work(@work)
> > 	-> delay(1); // to stop workqueue spinning chewing up CPU
> >
> > So basically I'm seeing a kworker thread blocked in delay(1) - it's
> > appears to be making progress by processing the same work item over and over
> > again with delay(1) calls between them. The queued log IO completion
> > is not being processed, even though it is sitting in a queue
> > waiting...
> 
> Can you please help me a bit more?  Are you saying the following?
> 
> Work w0 starts execution on wq0.  w0 tries locking but fails.  Does
> delay(1) and requeues itself on wq0 hoping another work w1 would be
> queued on wq0 which will release the lock.  The requeueing should make
> w0 queued and executed after w1, but instead w1 never gets executed
> while w0 hogs the CPU constantly by re-executing itself.

Almost. What happens is that there is a queue of data IO
completions on q0, say w1...wN where wX is in the middle of the
queue. wX requires lock A, but lock A is held by a a transaction
commit that is blocked by IO completion t1 on q1. The dependency
chain we then have is:

	wX on q0 -> lock A -> t1 on q1

To prevent wX from blocking q0, when lock A is not gained, we
requeue wX to the tail of q0 such that the queue is not wX+1..wN,wX.
this means that wX will not block completion processing of data IO.
If wX is the only work on q0, then to stop the work queue from
spinning processing wX, queueing wX, processing wX.... there is a
delay(1) call to allow some time for other IOs to complete to occur
before trying again to process wX again.

At some point, q1 is processed and t1 is run and lock A
released. Once this happens, wX will gain lock A and finish the
completion and be freed.

The issue I appear to be seeing is that while q0 is doing:

	wX -> requeue on q0 -> delay(1) -> wX -> requeue q0 -> wX

q1 which contains t1 is never getting processed, and hence the q0/wX
loop is never getting broken.

> Also, how
> does delay(1) help with chewing up CPU?  Are you talking about
> avoiding constant lock/unlock ops starving other lockers?  In such
> case, wouldn't cpu_relax() make more sense?

Basically delay(1) is used in many places in XFS as a "backoff and
retry after a short period of time" mechanism in places where
blocking would lead to deadlock or we need a state change to occur
before retrying the operation that would have deadlocked. If we
don't put a backoff in, then we simply burn CPU until the condition
clears.

In the case of the data Io completion workqueue processing, the CPU
burn occurred when the only item on the workqueue was the inode that
we could not lock.  Hence the backoff. It's not a great solution,
but it's the only one that could be sued without changing everything
to use delayed works and hence suffer the associated structure bloat
for what is a rare corner case....

As I said, this is fine if the only thing that is delayed is data IO
completion processing for XFS. When it is a generic kworker thread,
it has much greater impact, I think....

> >> To preserve the original behavior, create_workqueue() and friends
> >> create workqueues with @max_active of 1, which is pretty silly and bad
> >> for latency.  Aside from fixing the above problems, it would be nice
> >> to find out better values for @max_active for xfs workqueues.  For
> > 
> > Um, call me clueless, but WTF does max_active actually do?
> 
> It regulates the maximum level of per-cpu concurrency.  ie. If a
> workqueue has @max_active of 16.  16 works on the workqueue may
> execute concurrently per-cpu.
> 
> > It's not described anywhere, it's clamped to magic numbers ("I
> > really like 512"), etc.
> 
> Yeap, that's just a random safety value I chose.  In most cases, the
> level of concurrency is limited by the number of work_struct, so the
> default limit is there just to survive complete runaway cases.

Ok, make sense now. I wish that was already in a comment in the code ;)

> >> most users, using the pretty high default value is okay as they
> >> usually have much stricter constraint elsewhere (like limited number
> >> of work_struct), but last time I tried xfs allocated work_structs and
> >> fired them as fast as it could, so it looked like it definitely needed
> >> some kind of resasonable capping value.
> > 
> > What part of XFS fired work structures as fast as it could? Queuing
> > rates are determined completely by the IO completion rates...
> 
> I don't remember but once I increased maximum concurrency for every
> workqueue (the limit was 128 or something) and xfs pretty quickly hit
> the concurrency limit.  IIRC, there was a function which allocates
> work_struct and schedules it.  I'll look through the emails.

How do you get concurrency requirements of 128 when you only have a
small number of CPUs?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [2.6.36-rc3] Workqueues, XFS, dependencies and deadlocks
@ 2010-09-07 12:48         ` Dave Chinner
  0 siblings, 0 replies; 37+ messages in thread
From: Dave Chinner @ 2010-09-07 12:48 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-fsdevel, linux-kernel, xfs

On Tue, Sep 07, 2010 at 12:35:46PM +0200, Tejun Heo wrote:
> On 09/07/2010 12:01 PM, Dave Chinner wrote:
> >> Creating the workqueue for log completion w/ WQ_HIGHPRI should solve
> >> this.
> > 
> > So what you are saying is that we need to change the workqueue
> > creation interface to use alloc_workqueue() with some special set of
> > flags to make the workqueue behave as we want, and that each
> > workqueue will require a different configuration?  Where can I find
> > the interface documentation that describes how the different flags
> > affect the workqueue behaviour?
> 
> Heh, sorry about that.  I'm writing it now.  The plan is to audit all
> the create_*workqueue() users and replace them with alloc_workqueue()
> w/ appropriate parameters.  Most of them would be fine with the
> default set of parameters but there are a few which would need some
> adjustments.

Ok. Do you have an advance draft of the docco I can read?

> >> I fail to follow here.  Can you elaborate a bit?
> > 
> > Here's what the work function does:
> > 
> >  -> run @work
> > 	-> trylock returned EAGAIN
> > 	-> queue_work(@work)
> > 	-> delay(1); // to stop workqueue spinning chewing up CPU
> >
> > So basically I'm seeing a kworker thread blocked in delay(1) - it's
> > appears to be making progress by processing the same work item over and over
> > again with delay(1) calls between them. The queued log IO completion
> > is not being processed, even though it is sitting in a queue
> > waiting...
> 
> Can you please help me a bit more?  Are you saying the following?
> 
> Work w0 starts execution on wq0.  w0 tries locking but fails.  Does
> delay(1) and requeues itself on wq0 hoping another work w1 would be
> queued on wq0 which will release the lock.  The requeueing should make
> w0 queued and executed after w1, but instead w1 never gets executed
> while w0 hogs the CPU constantly by re-executing itself.

Almost. What happens is that there is a queue of data IO
completions on q0, say w1...wN where wX is in the middle of the
queue. wX requires lock A, but lock A is held by a a transaction
commit that is blocked by IO completion t1 on q1. The dependency
chain we then have is:

	wX on q0 -> lock A -> t1 on q1

To prevent wX from blocking q0, when lock A is not gained, we
requeue wX to the tail of q0 such that the queue is not wX+1..wN,wX.
this means that wX will not block completion processing of data IO.
If wX is the only work on q0, then to stop the work queue from
spinning processing wX, queueing wX, processing wX.... there is a
delay(1) call to allow some time for other IOs to complete to occur
before trying again to process wX again.

At some point, q1 is processed and t1 is run and lock A
released. Once this happens, wX will gain lock A and finish the
completion and be freed.

The issue I appear to be seeing is that while q0 is doing:

	wX -> requeue on q0 -> delay(1) -> wX -> requeue q0 -> wX

q1 which contains t1 is never getting processed, and hence the q0/wX
loop is never getting broken.

> Also, how
> does delay(1) help with chewing up CPU?  Are you talking about
> avoiding constant lock/unlock ops starving other lockers?  In such
> case, wouldn't cpu_relax() make more sense?

Basically delay(1) is used in many places in XFS as a "backoff and
retry after a short period of time" mechanism in places where
blocking would lead to deadlock or we need a state change to occur
before retrying the operation that would have deadlocked. If we
don't put a backoff in, then we simply burn CPU until the condition
clears.

In the case of the data Io completion workqueue processing, the CPU
burn occurred when the only item on the workqueue was the inode that
we could not lock.  Hence the backoff. It's not a great solution,
but it's the only one that could be sued without changing everything
to use delayed works and hence suffer the associated structure bloat
for what is a rare corner case....

As I said, this is fine if the only thing that is delayed is data IO
completion processing for XFS. When it is a generic kworker thread,
it has much greater impact, I think....

> >> To preserve the original behavior, create_workqueue() and friends
> >> create workqueues with @max_active of 1, which is pretty silly and bad
> >> for latency.  Aside from fixing the above problems, it would be nice
> >> to find out better values for @max_active for xfs workqueues.  For
> > 
> > Um, call me clueless, but WTF does max_active actually do?
> 
> It regulates the maximum level of per-cpu concurrency.  ie. If a
> workqueue has @max_active of 16.  16 works on the workqueue may
> execute concurrently per-cpu.
> 
> > It's not described anywhere, it's clamped to magic numbers ("I
> > really like 512"), etc.
> 
> Yeap, that's just a random safety value I chose.  In most cases, the
> level of concurrency is limited by the number of work_struct, so the
> default limit is there just to survive complete runaway cases.

Ok, make sense now. I wish that was already in a comment in the code ;)

> >> most users, using the pretty high default value is okay as they
> >> usually have much stricter constraint elsewhere (like limited number
> >> of work_struct), but last time I tried xfs allocated work_structs and
> >> fired them as fast as it could, so it looked like it definitely needed
> >> some kind of resasonable capping value.
> > 
> > What part of XFS fired work structures as fast as it could? Queuing
> > rates are determined completely by the IO completion rates...
> 
> I don't remember but once I increased maximum concurrency for every
> workqueue (the limit was 128 or something) and xfs pretty quickly hit
> the concurrency limit.  IIRC, there was a function which allocates
> work_struct and schedules it.  I'll look through the emails.

How do you get concurrency requirements of 128 when you only have a
small number of CPUs?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [2.6.36-rc3] Workqueues, XFS, dependencies and deadlocks
  2010-09-07 12:26         ` Tejun Heo
@ 2010-09-07 13:02           ` Dave Chinner
  -1 siblings, 0 replies; 37+ messages in thread
From: Dave Chinner @ 2010-09-07 13:02 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, xfs, linux-fsdevel

On Tue, Sep 07, 2010 at 02:26:54PM +0200, Tejun Heo wrote:
> On 09/07/2010 12:35 PM, Tejun Heo wrote:
> > Can you please help me a bit more?  Are you saying the following?
> > 
> > Work w0 starts execution on wq0.  w0 tries locking but fails.  Does
> > delay(1) and requeues itself on wq0 hoping another work w1 would be
> > queued on wq0 which will release the lock.  The requeueing should make
> > w0 queued and executed after w1, but instead w1 never gets executed
> > while w0 hogs the CPU constantly by re-executing itself.  Also, how
> > does delay(1) help with chewing up CPU?  Are you talking about
> > avoiding constant lock/unlock ops starving other lockers?  In such
> > case, wouldn't cpu_relax() make more sense?
> 
> Ooh, almost forgot.  There was nr_active underflow bug in workqueue
> code which could lead to malfunctioning max_active regulation and
> problems during queue freezing, so you could be hitting that too.  I
> sent out pull request some time ago but hasn't been pulled into
> mainline yet.  Can you please pull from the following branch and add
> WQ_HIGHPRI as discussed before and see whether the problem is still
> reproducible?

I'm currently running with the WQ_HIGHPRI flag. I only change one
thing at a time so I can tell what caused the change in behaviour...

> And if the problem is reproducible, can you please
> trigger sysrq thread dump and attach it?

Well, most of the time the system is 100% unresponsive when the
livelock occurs, so I'll be lucky to get anything at all....

>  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git for-linus

I'll try that next if the probelm still persists.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [2.6.36-rc3] Workqueues, XFS, dependencies and deadlocks
@ 2010-09-07 13:02           ` Dave Chinner
  0 siblings, 0 replies; 37+ messages in thread
From: Dave Chinner @ 2010-09-07 13:02 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-fsdevel, linux-kernel, xfs

On Tue, Sep 07, 2010 at 02:26:54PM +0200, Tejun Heo wrote:
> On 09/07/2010 12:35 PM, Tejun Heo wrote:
> > Can you please help me a bit more?  Are you saying the following?
> > 
> > Work w0 starts execution on wq0.  w0 tries locking but fails.  Does
> > delay(1) and requeues itself on wq0 hoping another work w1 would be
> > queued on wq0 which will release the lock.  The requeueing should make
> > w0 queued and executed after w1, but instead w1 never gets executed
> > while w0 hogs the CPU constantly by re-executing itself.  Also, how
> > does delay(1) help with chewing up CPU?  Are you talking about
> > avoiding constant lock/unlock ops starving other lockers?  In such
> > case, wouldn't cpu_relax() make more sense?
> 
> Ooh, almost forgot.  There was nr_active underflow bug in workqueue
> code which could lead to malfunctioning max_active regulation and
> problems during queue freezing, so you could be hitting that too.  I
> sent out pull request some time ago but hasn't been pulled into
> mainline yet.  Can you please pull from the following branch and add
> WQ_HIGHPRI as discussed before and see whether the problem is still
> reproducible?

I'm currently running with the WQ_HIGHPRI flag. I only change one
thing at a time so I can tell what caused the change in behaviour...

> And if the problem is reproducible, can you please
> trigger sysrq thread dump and attach it?

Well, most of the time the system is 100% unresponsive when the
livelock occurs, so I'll be lucky to get anything at all....

>  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git for-linus

I'll try that next if the probelm still persists.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [2.6.36-rc3] Workqueues, XFS, dependencies and deadlocks
  2010-09-07 12:48         ` Dave Chinner
@ 2010-09-07 15:39           ` Tejun Heo
  -1 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2010-09-07 15:39 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, xfs, linux-fsdevel

Hello,

On 09/07/2010 02:48 PM, Dave Chinner wrote:
> On Tue, Sep 07, 2010 at 12:35:46PM +0200, Tejun Heo wrote:
>> Heh, sorry about that.  I'm writing it now.  The plan is to audit all
>> the create_*workqueue() users and replace them with alloc_workqueue()
>> w/ appropriate parameters.  Most of them would be fine with the
>> default set of parameters but there are a few which would need some
>> adjustments.
> 
> Ok. Do you have an advance draft of the docco I can read?

Including half-finished one at the end of this mail.

> Almost. What happens is that there is a queue of data IO
> completions on q0, say w1...wN where wX is in the middle of the
> queue. wX requires lock A, but lock A is held by a a transaction
> commit that is blocked by IO completion t1 on q1. The dependency
> chain we then have is:
> 
> 	wX on q0 -> lock A -> t1 on q1
> 
> To prevent wX from blocking q0, when lock A is not gained, we
> requeue wX to the tail of q0 such that the queue is not wX+1..wN,wX.
> this means that wX will not block completion processing of data IO.
> If wX is the only work on q0, then to stop the work queue from
> spinning processing wX, queueing wX, processing wX.... there is a
> delay(1) call to allow some time for other IOs to complete to occur
> before trying again to process wX again.
> 
> At some point, q1 is processed and t1 is run and lock A
> released. Once this happens, wX will gain lock A and finish the
> completion and be freed.
> 
> The issue I appear to be seeing is that while q0 is doing:
> 
> 	wX -> requeue on q0 -> delay(1) -> wX -> requeue q0 -> wX
> 
> q1 which contains t1 is never getting processed, and hence the q0/wX
> loop is never getting broken.

I see.  The use case itself shouldn't be problematic at all for cmwq
(sans bugs of course).  In the other reply, you said "the system is
100% unresponsive when the livelock occurs", which is kind of
puzzling.  It isn't really a livelock.  There's no busy loop.  sysrq
should work unless something else is going horribly wrong.  Are you
running with hangcheck timer enabled?  Anyways, please try with the
fixes from wq#for-linus branch pulled and gather as much information
as possible once such hang occurs.

>> Also, how does delay(1) help with chewing up CPU?  Are you talking
>> about avoiding constant lock/unlock ops starving other lockers?  In
>> such case, wouldn't cpu_relax() make more sense?
> 
> Basically delay(1) is used in many places in XFS as a "backoff and
> retry after a short period of time" mechanism in places where
> blocking would lead to deadlock or we need a state change to occur
> before retrying the operation that would have deadlocked. If we
> don't put a backoff in, then we simply burn CPU until the condition
> clears.
> 
> In the case of the data Io completion workqueue processing, the CPU
> burn occurred when the only item on the workqueue was the inode that
> we could not lock.  Hence the backoff. It's not a great solution,
> but it's the only one that could be sued without changing everything
> to use delayed works and hence suffer the associated structure bloat
> for what is a rare corner case....

Hmm... The point where I'm confused is that *delay()'s are busy waits.
They burn CPU cycles.  I suppose you're referring to *sleep()'s,
right?

> As I said, this is fine if the only thing that is delayed is data IO
> completion processing for XFS. When it is a generic kworker thread,
> it has much greater impact, I think....

As I wrote above, the new implementation won't have problem with such
usage, well, at least, not by design.  :-)

>> Yeap, that's just a random safety value I chose.  In most cases, the
>> level of concurrency is limited by the number of work_struct, so the
>> default limit is there just to survive complete runaway cases.
> 
> Ok, make sense now. I wish that was already in a comment in the code ;)

Will add a reference to the documentation.

>> I don't remember but once I increased maximum concurrency for every
>> workqueue (the limit was 128 or something) and xfs pretty quickly hit
>> the concurrency limit.  IIRC, there was a function which allocates
>> work_struct and schedules it.  I'll look through the emails.
> 
> How do you get concurrency requirements of 128 when you only have a
> small number of CPUs?

Probably I have overloaded the term 'concurrency' too much.  In this
case, I meant the number of workers assigned to work items of the wq.
If you fire off N work items which sleep at the same time, cmwq will
eventually try to create N workers as each previous worker goes to
sleep so that the CPU doesn't sit idle while there are work items to
process as long as N < @wq->nr->active.

Documentation follows.


Concurrency Managed Workqueue

September, 2010		Tejun Heo <tj@kernel.org>

CONTENTS

1. Overview
2. The Design
3. Workqueue Attributes
4. Examples and Pitfalls

1. Overview

There are many cases where an execution context is needed and there
already are several mechanisms for them.  The most commonly used one
is workqueue (wq).  In the original workqueue implementation, a multi
threaded (MT) wq had one thread per CPU and a single threaded (ST) wq
had one thread system-wide.  The kernel grew quite a number of wq
users and with the number of CPU cores continuously rising, some
systems saturated the default 32k PID space just booting up.

Although MT wq ended up spending a lot of resources, the level of
concurrency provided was unsatisfactory.  The limitation was common to
both ST and MT wq although it was less severe on MT ones.  Worker
pools of wq were separate from each other.  A MT wq provided one
execution context per CPU while a ST wq one for the whole system,
which led to various problems including proneness to deadlocks around
the single execution context and using ST wq when MT wq would fit
better to avoid creating a number of mostly idle threads.

The tension between the provided level of concurrency and resource
usage forced its users to make unnecessary tradeoffs like libata
choosing to use ST wq for polling PIOs and accepting an unnecessary
limitation that no two polling PIOs can progress at the same time.  As
MT wq don't provide much better concurrency, users which require
higher level of concurrency, like async or fscache, ended up having to
implement their own thread pool.

Concurrency Managed Workqueue (cmwq) extends wq with focus on the
following goals.

* Maintain compatibility with the original workqueue API while
  removing the above mentioned limitations and providing flexible
  level of concurrency on demand.

* Provide single unified worker pool per CPU which can be shared by
  all wq users.  The worker pool and level of concurrency should be
  regulated automatically so that the API users don't need to worry
  about such details.

* Use what's necessary and allocate resources lazily on demand while
  guaranteeing forward progress where necessary.


2. The Design

There's a single global cwq (gcwq) per each possible CPU and a pseudo
unbound CPU which actually serves out execution contexts.
cpu_workqueue's (cwq) of each wq are mostly simple frontends to the
associated gcwq.  When a work is queued, it's queued to the common
worklist of the target gcwq.  Each gcwq has its own pool of workers
which is used to process its unified worklist.

For any worker pool, managing the concurrency level (how many
execution contexts are active) is an important issue.  cmwq tries to
keep the concurrency at minimal but sufficient level.

For each gcwq bound to a specific CPU, concurrency management is
implemented by hooking into the scheduler.  The gcwq is notified
whenever a busy worker wakes up or sleeps and keeps track of the level
of concurrency.  Generally, works aren't supposed to be CPU cycle hogs
and maintaining just enough concurrency to prevent work processing
from stalling is optimal.  As long as there is one or more workers
running or ready to run on the CPU, no new worker is scheduled, but,
when the last running worker blocks, the gcwq immediately schedules a
new worker so that the CPU doesn't sit idle while there are pending
works.

This allows using minimal number of workers without losing execution
bandwidth.  Keeping idle workers around doesn't cost other than the
memory space for kthreads, so cmwq holds onto idle ones for a while
before killing them.

For unbound wq, the above concurrency management doesn't apply and the
unbound gcwq tries to increase the number of execution contexts until
all work items are executing.  The responsibility of regulating
concurrency level is on the users.  There is also a flag to mark bound
wq to ignore the default concurrency management.  Please refer to the
alloc_workqueue() section for details.

Forward progress guarantee relies on that workers can be created when
more execution contexts are necessary, which in turn is guaranteed
through the use of emergency workers.  All wq which might be used in
memory allocation path are required to have an emergency worker which
is reserved for execution of each such wq so that memory allocation
for worker creation doesn't deadlock waiting for execution contexts to
free up.


3. Workqueue Attributes

alloc_workqueue() is the function to allocate a wq.  The original
create_*workqueue() functions are deprecated and scheduled for
removal.  alloc_workqueue() takes three arguments - @name, @flags and
@max_active.  @name is the name of the wq and also used as the name of
the rescuer thread if there is one.

A wq no longer manages execution resources but serves as a domain for
forward progress guarantee, flush and work item attributes.  @flags
and @max_active control how work items are assigned execution
resources, scheduled and executed.

@flags:

 WQ_NON_REENTRANT

	By default, a wq guarantees non-reentrance only on the same
	CPU.  A work may not be executed concurrently on the same CPU
	by multiple workers but is allowed to be executed concurrently
	on multiple CPUs.  This flag makes sure non-reentrance is
	enforced across all the CPUs.  Work items queued to a
	non-reentrant wq are guaranteed to be executed by a single
	worker system-wide at any given time.

 WQ_UNBOUND

	Work items queued to an unbound wq are served by a special
	gcwq which hosts workers which are not bound to any specific
	CPU.  This makes the wq behave as a simple execution context
	provider without concurrency management.  Unbound workers will
	be created as long as there are pending work items and
	resources are available.  This sacrifices locality but is
	useful for the following cases.

	* High fluctuation in the concurrency requirement is expected
          and using bound wq may end up creating large number of
          mostly unused workers across different CPUs as the issuer
          hops through different CPUs.

	* Long running CPU intensive workloads which can be better
          managed by the system scheduler.

 WQ_FREEZEABLE

	A freezeable wq participates in the freeze phase during
	suspend operations.  Work items on the wq are drained and no
	new work item starts execution until thawed.

 WQ_RESCUER

	All wq which might be used in the allocation path _MUST_ have
	this flag set.  This reserves one worker exclusively for the
	execution of this wq under memory pressure.

 WQ_HIGHPRI

	Work items of a highpri wq are queued at the head of the
	worklist of the target gcwq and start execution regardless of
	concurrency level.  In other words, highpri work items will
	always start execution as soon as execution resource is
	available.

	Ordering among highpri work items are preserved - a highpri
	work item queued after another highpri work item will start
	execution after the earlier highpri work item starts.

	Although highpri work items are not held back by other
	runnable work items, they still contribute to the concurrency
	level.  Highpri work items in runnable state will prevent
	non-highpri work items from starting execution.

	This flag is meaningless for unbound wq.

 WQ_CPU_INTENSIVE

	Work items of a CPU intensive wq do not contribute to the
	concurrency level.  Runnable CPU intensive work items will not
	prevent other work items from starting execution.  This is
	useful for per-cpu work items which are expected to hog CPU
	cycles so they are solely regulated by the system scheduler.

	Although CPU intensive work items don't contribute to the
	concurrency level, start of their executions is still
	regulated by the concurrency management and runnable
	non-CPU-intensive work items can delay execution of CPU
	intensive work items.

	This flag is meaningless for unbound wq.

 WQ_HIGHPRI | WQ_CPU_INTENSIVE

	This combination makes the wq avoid interaction with
	concurrency management completely and behave as simple per-CPU
	execution context provider.  Work items queued on a highpri
	CPU-intensive wq start execution as soon as resources are
	available and don't affect execution of other work items.

@max_active:

It determines the maximum number of execution contexts per CPU which
can be assigned to the work items of the wq.  For example, with
@max_active of 16, at most 16 work items can execute per CPU at any
given time.

Currently the maximum value for @max_active is 512 and the default
value used when 0 is specified is 256.  Both are chosen sufficiently
high such that the available concurrency level is not the limiting
factor while providing protection in runaway cases.

The maximum number of active work items of a wq is usually regulated
by the users of the wq, more specifically, by how many work items the
users may queue at the same time.  Unless there is a specific need for
throttling the number of active work items, specifying '0' is
recommended.

Some users depend on the strict execution ordering of ST wq.  The
combination of @max_active of 1 and WQ_UNBOUND is used to achieve this
behavior.  Work items on such wq are always queued to the unbound gcwq
and only one work item can be active at any given time thus achieving
the same ordering property as ST wq.


4. Examples and Pitfalls

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [2.6.36-rc3] Workqueues, XFS, dependencies and deadlocks
@ 2010-09-07 15:39           ` Tejun Heo
  0 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2010-09-07 15:39 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel, xfs

Hello,

On 09/07/2010 02:48 PM, Dave Chinner wrote:
> On Tue, Sep 07, 2010 at 12:35:46PM +0200, Tejun Heo wrote:
>> Heh, sorry about that.  I'm writing it now.  The plan is to audit all
>> the create_*workqueue() users and replace them with alloc_workqueue()
>> w/ appropriate parameters.  Most of them would be fine with the
>> default set of parameters but there are a few which would need some
>> adjustments.
> 
> Ok. Do you have an advance draft of the docco I can read?

Including half-finished one at the end of this mail.

> Almost. What happens is that there is a queue of data IO
> completions on q0, say w1...wN where wX is in the middle of the
> queue. wX requires lock A, but lock A is held by a a transaction
> commit that is blocked by IO completion t1 on q1. The dependency
> chain we then have is:
> 
> 	wX on q0 -> lock A -> t1 on q1
> 
> To prevent wX from blocking q0, when lock A is not gained, we
> requeue wX to the tail of q0 such that the queue is not wX+1..wN,wX.
> this means that wX will not block completion processing of data IO.
> If wX is the only work on q0, then to stop the work queue from
> spinning processing wX, queueing wX, processing wX.... there is a
> delay(1) call to allow some time for other IOs to complete to occur
> before trying again to process wX again.
> 
> At some point, q1 is processed and t1 is run and lock A
> released. Once this happens, wX will gain lock A and finish the
> completion and be freed.
> 
> The issue I appear to be seeing is that while q0 is doing:
> 
> 	wX -> requeue on q0 -> delay(1) -> wX -> requeue q0 -> wX
> 
> q1 which contains t1 is never getting processed, and hence the q0/wX
> loop is never getting broken.

I see.  The use case itself shouldn't be problematic at all for cmwq
(sans bugs of course).  In the other reply, you said "the system is
100% unresponsive when the livelock occurs", which is kind of
puzzling.  It isn't really a livelock.  There's no busy loop.  sysrq
should work unless something else is going horribly wrong.  Are you
running with hangcheck timer enabled?  Anyways, please try with the
fixes from wq#for-linus branch pulled and gather as much information
as possible once such hang occurs.

>> Also, how does delay(1) help with chewing up CPU?  Are you talking
>> about avoiding constant lock/unlock ops starving other lockers?  In
>> such case, wouldn't cpu_relax() make more sense?
> 
> Basically delay(1) is used in many places in XFS as a "backoff and
> retry after a short period of time" mechanism in places where
> blocking would lead to deadlock or we need a state change to occur
> before retrying the operation that would have deadlocked. If we
> don't put a backoff in, then we simply burn CPU until the condition
> clears.
> 
> In the case of the data Io completion workqueue processing, the CPU
> burn occurred when the only item on the workqueue was the inode that
> we could not lock.  Hence the backoff. It's not a great solution,
> but it's the only one that could be sued without changing everything
> to use delayed works and hence suffer the associated structure bloat
> for what is a rare corner case....

Hmm... The point where I'm confused is that *delay()'s are busy waits.
They burn CPU cycles.  I suppose you're referring to *sleep()'s,
right?

> As I said, this is fine if the only thing that is delayed is data IO
> completion processing for XFS. When it is a generic kworker thread,
> it has much greater impact, I think....

As I wrote above, the new implementation won't have problem with such
usage, well, at least, not by design.  :-)

>> Yeap, that's just a random safety value I chose.  In most cases, the
>> level of concurrency is limited by the number of work_struct, so the
>> default limit is there just to survive complete runaway cases.
> 
> Ok, make sense now. I wish that was already in a comment in the code ;)

Will add a reference to the documentation.

>> I don't remember but once I increased maximum concurrency for every
>> workqueue (the limit was 128 or something) and xfs pretty quickly hit
>> the concurrency limit.  IIRC, there was a function which allocates
>> work_struct and schedules it.  I'll look through the emails.
> 
> How do you get concurrency requirements of 128 when you only have a
> small number of CPUs?

Probably I have overloaded the term 'concurrency' too much.  In this
case, I meant the number of workers assigned to work items of the wq.
If you fire off N work items which sleep at the same time, cmwq will
eventually try to create N workers as each previous worker goes to
sleep so that the CPU doesn't sit idle while there are work items to
process as long as N < @wq->nr->active.

Documentation follows.


Concurrency Managed Workqueue

September, 2010		Tejun Heo <tj@kernel.org>

CONTENTS

1. Overview
2. The Design
3. Workqueue Attributes
4. Examples and Pitfalls

1. Overview

There are many cases where an execution context is needed and there
already are several mechanisms for them.  The most commonly used one
is workqueue (wq).  In the original workqueue implementation, a multi
threaded (MT) wq had one thread per CPU and a single threaded (ST) wq
had one thread system-wide.  The kernel grew quite a number of wq
users and with the number of CPU cores continuously rising, some
systems saturated the default 32k PID space just booting up.

Although MT wq ended up spending a lot of resources, the level of
concurrency provided was unsatisfactory.  The limitation was common to
both ST and MT wq although it was less severe on MT ones.  Worker
pools of wq were separate from each other.  A MT wq provided one
execution context per CPU while a ST wq one for the whole system,
which led to various problems including proneness to deadlocks around
the single execution context and using ST wq when MT wq would fit
better to avoid creating a number of mostly idle threads.

The tension between the provided level of concurrency and resource
usage forced its users to make unnecessary tradeoffs like libata
choosing to use ST wq for polling PIOs and accepting an unnecessary
limitation that no two polling PIOs can progress at the same time.  As
MT wq don't provide much better concurrency, users which require
higher level of concurrency, like async or fscache, ended up having to
implement their own thread pool.

Concurrency Managed Workqueue (cmwq) extends wq with focus on the
following goals.

* Maintain compatibility with the original workqueue API while
  removing the above mentioned limitations and providing flexible
  level of concurrency on demand.

* Provide single unified worker pool per CPU which can be shared by
  all wq users.  The worker pool and level of concurrency should be
  regulated automatically so that the API users don't need to worry
  about such details.

* Use what's necessary and allocate resources lazily on demand while
  guaranteeing forward progress where necessary.


2. The Design

There's a single global cwq (gcwq) per each possible CPU and a pseudo
unbound CPU which actually serves out execution contexts.
cpu_workqueue's (cwq) of each wq are mostly simple frontends to the
associated gcwq.  When a work is queued, it's queued to the common
worklist of the target gcwq.  Each gcwq has its own pool of workers
which is used to process its unified worklist.

For any worker pool, managing the concurrency level (how many
execution contexts are active) is an important issue.  cmwq tries to
keep the concurrency at minimal but sufficient level.

For each gcwq bound to a specific CPU, concurrency management is
implemented by hooking into the scheduler.  The gcwq is notified
whenever a busy worker wakes up or sleeps and keeps track of the level
of concurrency.  Generally, works aren't supposed to be CPU cycle hogs
and maintaining just enough concurrency to prevent work processing
from stalling is optimal.  As long as there is one or more workers
running or ready to run on the CPU, no new worker is scheduled, but,
when the last running worker blocks, the gcwq immediately schedules a
new worker so that the CPU doesn't sit idle while there are pending
works.

This allows using minimal number of workers without losing execution
bandwidth.  Keeping idle workers around doesn't cost other than the
memory space for kthreads, so cmwq holds onto idle ones for a while
before killing them.

For unbound wq, the above concurrency management doesn't apply and the
unbound gcwq tries to increase the number of execution contexts until
all work items are executing.  The responsibility of regulating
concurrency level is on the users.  There is also a flag to mark bound
wq to ignore the default concurrency management.  Please refer to the
alloc_workqueue() section for details.

Forward progress guarantee relies on that workers can be created when
more execution contexts are necessary, which in turn is guaranteed
through the use of emergency workers.  All wq which might be used in
memory allocation path are required to have an emergency worker which
is reserved for execution of each such wq so that memory allocation
for worker creation doesn't deadlock waiting for execution contexts to
free up.


3. Workqueue Attributes

alloc_workqueue() is the function to allocate a wq.  The original
create_*workqueue() functions are deprecated and scheduled for
removal.  alloc_workqueue() takes three arguments - @name, @flags and
@max_active.  @name is the name of the wq and also used as the name of
the rescuer thread if there is one.

A wq no longer manages execution resources but serves as a domain for
forward progress guarantee, flush and work item attributes.  @flags
and @max_active control how work items are assigned execution
resources, scheduled and executed.

@flags:

 WQ_NON_REENTRANT

	By default, a wq guarantees non-reentrance only on the same
	CPU.  A work may not be executed concurrently on the same CPU
	by multiple workers but is allowed to be executed concurrently
	on multiple CPUs.  This flag makes sure non-reentrance is
	enforced across all the CPUs.  Work items queued to a
	non-reentrant wq are guaranteed to be executed by a single
	worker system-wide at any given time.

 WQ_UNBOUND

	Work items queued to an unbound wq are served by a special
	gcwq which hosts workers which are not bound to any specific
	CPU.  This makes the wq behave as a simple execution context
	provider without concurrency management.  Unbound workers will
	be created as long as there are pending work items and
	resources are available.  This sacrifices locality but is
	useful for the following cases.

	* High fluctuation in the concurrency requirement is expected
          and using bound wq may end up creating large number of
          mostly unused workers across different CPUs as the issuer
          hops through different CPUs.

	* Long running CPU intensive workloads which can be better
          managed by the system scheduler.

 WQ_FREEZEABLE

	A freezeable wq participates in the freeze phase during
	suspend operations.  Work items on the wq are drained and no
	new work item starts execution until thawed.

 WQ_RESCUER

	All wq which might be used in the allocation path _MUST_ have
	this flag set.  This reserves one worker exclusively for the
	execution of this wq under memory pressure.

 WQ_HIGHPRI

	Work items of a highpri wq are queued at the head of the
	worklist of the target gcwq and start execution regardless of
	concurrency level.  In other words, highpri work items will
	always start execution as soon as execution resource is
	available.

	Ordering among highpri work items are preserved - a highpri
	work item queued after another highpri work item will start
	execution after the earlier highpri work item starts.

	Although highpri work items are not held back by other
	runnable work items, they still contribute to the concurrency
	level.  Highpri work items in runnable state will prevent
	non-highpri work items from starting execution.

	This flag is meaningless for unbound wq.

 WQ_CPU_INTENSIVE

	Work items of a CPU intensive wq do not contribute to the
	concurrency level.  Runnable CPU intensive work items will not
	prevent other work items from starting execution.  This is
	useful for per-cpu work items which are expected to hog CPU
	cycles so they are solely regulated by the system scheduler.

	Although CPU intensive work items don't contribute to the
	concurrency level, start of their executions is still
	regulated by the concurrency management and runnable
	non-CPU-intensive work items can delay execution of CPU
	intensive work items.

	This flag is meaningless for unbound wq.

 WQ_HIGHPRI | WQ_CPU_INTENSIVE

	This combination makes the wq avoid interaction with
	concurrency management completely and behave as simple per-CPU
	execution context provider.  Work items queued on a highpri
	CPU-intensive wq start execution as soon as resources are
	available and don't affect execution of other work items.

@max_active:

It determines the maximum number of execution contexts per CPU which
can be assigned to the work items of the wq.  For example, with
@max_active of 16, at most 16 work items can execute per CPU at any
given time.

Currently the maximum value for @max_active is 512 and the default
value used when 0 is specified is 256.  Both are chosen sufficiently
high such that the available concurrency level is not the limiting
factor while providing protection in runaway cases.

The maximum number of active work items of a wq is usually regulated
by the users of the wq, more specifically, by how many work items the
users may queue at the same time.  Unless there is a specific need for
throttling the number of active work items, specifying '0' is
recommended.

Some users depend on the strict execution ordering of ST wq.  The
combination of @max_active of 1 and WQ_UNBOUND is used to achieve this
behavior.  Work items on such wq are always queued to the unbound gcwq
and only one work item can be active at any given time thus achieving
the same ordering property as ST wq.


4. Examples and Pitfalls

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [2.6.36-rc3] Workqueues, XFS, dependencies and deadlocks
  2010-09-07 15:39           ` Tejun Heo
@ 2010-09-08  7:34             ` Dave Chinner
  -1 siblings, 0 replies; 37+ messages in thread
From: Dave Chinner @ 2010-09-08  7:34 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, xfs, linux-fsdevel

On Tue, Sep 07, 2010 at 05:39:48PM +0200, Tejun Heo wrote:
> On 09/07/2010 02:48 PM, Dave Chinner wrote:
> > On Tue, Sep 07, 2010 at 12:35:46PM +0200, Tejun Heo wrote:
> > Almost. What happens is that there is a queue of data IO
> > completions on q0, say w1...wN where wX is in the middle of the
> > queue. wX requires lock A, but lock A is held by a a transaction
> > commit that is blocked by IO completion t1 on q1. The dependency
> > chain we then have is:
> > 
> > 	wX on q0 -> lock A -> t1 on q1
> > 
> > To prevent wX from blocking q0, when lock A is not gained, we
> > requeue wX to the tail of q0 such that the queue is not wX+1..wN,wX.
> > this means that wX will not block completion processing of data IO.
> > If wX is the only work on q0, then to stop the work queue from
> > spinning processing wX, queueing wX, processing wX.... there is a
> > delay(1) call to allow some time for other IOs to complete to occur
> > before trying again to process wX again.
> > 
> > At some point, q1 is processed and t1 is run and lock A
> > released. Once this happens, wX will gain lock A and finish the
> > completion and be freed.
> > 
> > The issue I appear to be seeing is that while q0 is doing:
> > 
> > 	wX -> requeue on q0 -> delay(1) -> wX -> requeue q0 -> wX
> > 
> > q1 which contains t1 is never getting processed, and hence the q0/wX
> > loop is never getting broken.
> 
> I see.  The use case itself shouldn't be problematic at all for cmwq
> (sans bugs of course).  In the other reply, you said "the system is
> 100% unresponsive when the livelock occurs", which is kind of
> puzzling.  It isn't really a livelock.

Actually, it is. You don't need to burn CPU to livelock, you just
need a loop in the state machine that cannot be broken by internal
or external events to be considered livelocked.

However, this is not what I was calling the livelock problem - this
is what I was calling the deadlock problem because to all external
appearences the state machine is deadlocked on the inode lock....

The livelock case I described where the system is completely
unresponsive is the one I'm testing the WQ_HIGHPRI mod against.

FWIW, having considered the above case again, and seeing what the
WQ_HIGHPRI mod does in terms of queuing, I think that it may also
solve this deadlock as the log IO completionwill always be queued
ahead of the data IO completion now.


> >> Also, how does delay(1) help with chewing up CPU?  Are you talking
> >> about avoiding constant lock/unlock ops starving other lockers?  In
> >> such case, wouldn't cpu_relax() make more sense?
> > 
> > Basically delay(1) is used in many places in XFS as a "backoff and
> > retry after a short period of time" mechanism in places where
> > blocking would lead to deadlock or we need a state change to occur
> > before retrying the operation that would have deadlocked. If we
> > don't put a backoff in, then we simply burn CPU until the condition
> > clears.
> > 
> > In the case of the data Io completion workqueue processing, the CPU
> > burn occurred when the only item on the workqueue was the inode that
> > we could not lock.  Hence the backoff. It's not a great solution,
> > but it's the only one that could be sued without changing everything
> > to use delayed works and hence suffer the associated structure bloat
> > for what is a rare corner case....
> 
> Hmm... The point where I'm confused is that *delay()'s are busy waits.
> They burn CPU cycles.  I suppose you're referring to *sleep()'s,
> right?

fs/xfs/linux-2.6/time.h:

static inline void delay(long ticks)
{
        schedule_timeout_uninterruptible(ticks);
}

> >> I don't remember but once I increased maximum concurrency for every
> >> workqueue (the limit was 128 or something) and xfs pretty quickly hit
> >> the concurrency limit.  IIRC, there was a function which allocates
> >> work_struct and schedules it.  I'll look through the emails.
> > 
> > How do you get concurrency requirements of 128 when you only have a
> > small number of CPUs?
> 
> Probably I have overloaded the term 'concurrency' too much.  In this
> case, I meant the number of workers assigned to work items of the wq.
> If you fire off N work items which sleep at the same time, cmwq will
> eventually try to create N workers as each previous worker goes to
> sleep so that the CPU doesn't sit idle while there are work items to
> process as long as N < @wq->nr->active.

Ok, so if I queue N items on a single CPU when max_active == N, they
get spread across N worker threads on different CPUs? 

> Documentation follows.

I'll have read of this tonight.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [2.6.36-rc3] Workqueues, XFS, dependencies and deadlocks
@ 2010-09-08  7:34             ` Dave Chinner
  0 siblings, 0 replies; 37+ messages in thread
From: Dave Chinner @ 2010-09-08  7:34 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-fsdevel, linux-kernel, xfs

On Tue, Sep 07, 2010 at 05:39:48PM +0200, Tejun Heo wrote:
> On 09/07/2010 02:48 PM, Dave Chinner wrote:
> > On Tue, Sep 07, 2010 at 12:35:46PM +0200, Tejun Heo wrote:
> > Almost. What happens is that there is a queue of data IO
> > completions on q0, say w1...wN where wX is in the middle of the
> > queue. wX requires lock A, but lock A is held by a a transaction
> > commit that is blocked by IO completion t1 on q1. The dependency
> > chain we then have is:
> > 
> > 	wX on q0 -> lock A -> t1 on q1
> > 
> > To prevent wX from blocking q0, when lock A is not gained, we
> > requeue wX to the tail of q0 such that the queue is not wX+1..wN,wX.
> > this means that wX will not block completion processing of data IO.
> > If wX is the only work on q0, then to stop the work queue from
> > spinning processing wX, queueing wX, processing wX.... there is a
> > delay(1) call to allow some time for other IOs to complete to occur
> > before trying again to process wX again.
> > 
> > At some point, q1 is processed and t1 is run and lock A
> > released. Once this happens, wX will gain lock A and finish the
> > completion and be freed.
> > 
> > The issue I appear to be seeing is that while q0 is doing:
> > 
> > 	wX -> requeue on q0 -> delay(1) -> wX -> requeue q0 -> wX
> > 
> > q1 which contains t1 is never getting processed, and hence the q0/wX
> > loop is never getting broken.
> 
> I see.  The use case itself shouldn't be problematic at all for cmwq
> (sans bugs of course).  In the other reply, you said "the system is
> 100% unresponsive when the livelock occurs", which is kind of
> puzzling.  It isn't really a livelock.

Actually, it is. You don't need to burn CPU to livelock, you just
need a loop in the state machine that cannot be broken by internal
or external events to be considered livelocked.

However, this is not what I was calling the livelock problem - this
is what I was calling the deadlock problem because to all external
appearences the state machine is deadlocked on the inode lock....

The livelock case I described where the system is completely
unresponsive is the one I'm testing the WQ_HIGHPRI mod against.

FWIW, having considered the above case again, and seeing what the
WQ_HIGHPRI mod does in terms of queuing, I think that it may also
solve this deadlock as the log IO completionwill always be queued
ahead of the data IO completion now.


> >> Also, how does delay(1) help with chewing up CPU?  Are you talking
> >> about avoiding constant lock/unlock ops starving other lockers?  In
> >> such case, wouldn't cpu_relax() make more sense?
> > 
> > Basically delay(1) is used in many places in XFS as a "backoff and
> > retry after a short period of time" mechanism in places where
> > blocking would lead to deadlock or we need a state change to occur
> > before retrying the operation that would have deadlocked. If we
> > don't put a backoff in, then we simply burn CPU until the condition
> > clears.
> > 
> > In the case of the data Io completion workqueue processing, the CPU
> > burn occurred when the only item on the workqueue was the inode that
> > we could not lock.  Hence the backoff. It's not a great solution,
> > but it's the only one that could be sued without changing everything
> > to use delayed works and hence suffer the associated structure bloat
> > for what is a rare corner case....
> 
> Hmm... The point where I'm confused is that *delay()'s are busy waits.
> They burn CPU cycles.  I suppose you're referring to *sleep()'s,
> right?

fs/xfs/linux-2.6/time.h:

static inline void delay(long ticks)
{
        schedule_timeout_uninterruptible(ticks);
}

> >> I don't remember but once I increased maximum concurrency for every
> >> workqueue (the limit was 128 or something) and xfs pretty quickly hit
> >> the concurrency limit.  IIRC, there was a function which allocates
> >> work_struct and schedules it.  I'll look through the emails.
> > 
> > How do you get concurrency requirements of 128 when you only have a
> > small number of CPUs?
> 
> Probably I have overloaded the term 'concurrency' too much.  In this
> case, I meant the number of workers assigned to work items of the wq.
> If you fire off N work items which sleep at the same time, cmwq will
> eventually try to create N workers as each previous worker goes to
> sleep so that the CPU doesn't sit idle while there are work items to
> process as long as N < @wq->nr->active.

Ok, so if I queue N items on a single CPU when max_active == N, they
get spread across N worker threads on different CPUs? 

> Documentation follows.

I'll have read of this tonight.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [2.6.36-rc3] Workqueues, XFS, dependencies and deadlocks
  2010-09-08  7:34             ` Dave Chinner
@ 2010-09-08  8:20               ` Tejun Heo
  -1 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2010-09-08  8:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, xfs, linux-fsdevel

Hello,

On 09/08/2010 09:34 AM, Dave Chinner wrote:
>> I see.  The use case itself shouldn't be problematic at all for cmwq
>> (sans bugs of course).  In the other reply, you said "the system is
>> 100% unresponsive when the livelock occurs", which is kind of
>> puzzling.  It isn't really a livelock.
> 
> Actually, it is. You don't need to burn CPU to livelock, you just
> need a loop in the state machine that cannot be broken by internal
> or external events to be considered livelocked.

Yeah, but for the system to be completely unresponsive even to sysrq,
the system needs to be live/dead locked in a pretty specific way.

> However, this is not what I was calling the livelock problem - this
> is what I was calling the deadlock problem because to all external
> appearences the state machine is deadlocked on the inode lock....
> 
> The livelock case I described where the system is completely
> unresponsive is the one I'm testing the WQ_HIGHPRI mod against.
> 
> FWIW, having considered the above case again, and seeing what the
> WQ_HIGHPRI mod does in terms of queuing, I think that it may also
> solve this deadlock as the log IO completionwill always be queued
> ahead of the data IO completion now.

Cool, but please keep in mind that the nr_active underflow bug may end
up stalling or loosening ordering rules for a workqueue.  Linus has
pulled in the pending fixes today.

>> Hmm... The point where I'm confused is that *delay()'s are busy waits.
>> They burn CPU cycles.  I suppose you're referring to *sleep()'s,
>> right?
> 
> fs/xfs/linux-2.6/time.h:
> 
> static inline void delay(long ticks)
> {
>         schedule_timeout_uninterruptible(ticks);
> }

Heh yeah, there's my confusion.

>> Probably I have overloaded the term 'concurrency' too much.  In this
>> case, I meant the number of workers assigned to work items of the wq.
>> If you fire off N work items which sleep at the same time, cmwq will
>> eventually try to create N workers as each previous worker goes to
>> sleep so that the CPU doesn't sit idle while there are work items to
>> process as long as N < @wq->nr->active.
> 
> Ok, so if I queue N items on a single CPU when max_active == N, they
> get spread across N worker threads on different CPUs? 

They may if necessary to keep the workqueue progressing.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [2.6.36-rc3] Workqueues, XFS, dependencies and deadlocks
@ 2010-09-08  8:20               ` Tejun Heo
  0 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2010-09-08  8:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel, xfs

Hello,

On 09/08/2010 09:34 AM, Dave Chinner wrote:
>> I see.  The use case itself shouldn't be problematic at all for cmwq
>> (sans bugs of course).  In the other reply, you said "the system is
>> 100% unresponsive when the livelock occurs", which is kind of
>> puzzling.  It isn't really a livelock.
> 
> Actually, it is. You don't need to burn CPU to livelock, you just
> need a loop in the state machine that cannot be broken by internal
> or external events to be considered livelocked.

Yeah, but for the system to be completely unresponsive even to sysrq,
the system needs to be live/dead locked in a pretty specific way.

> However, this is not what I was calling the livelock problem - this
> is what I was calling the deadlock problem because to all external
> appearences the state machine is deadlocked on the inode lock....
> 
> The livelock case I described where the system is completely
> unresponsive is the one I'm testing the WQ_HIGHPRI mod against.
> 
> FWIW, having considered the above case again, and seeing what the
> WQ_HIGHPRI mod does in terms of queuing, I think that it may also
> solve this deadlock as the log IO completionwill always be queued
> ahead of the data IO completion now.

Cool, but please keep in mind that the nr_active underflow bug may end
up stalling or loosening ordering rules for a workqueue.  Linus has
pulled in the pending fixes today.

>> Hmm... The point where I'm confused is that *delay()'s are busy waits.
>> They burn CPU cycles.  I suppose you're referring to *sleep()'s,
>> right?
> 
> fs/xfs/linux-2.6/time.h:
> 
> static inline void delay(long ticks)
> {
>         schedule_timeout_uninterruptible(ticks);
> }

Heh yeah, there's my confusion.

>> Probably I have overloaded the term 'concurrency' too much.  In this
>> case, I meant the number of workers assigned to work items of the wq.
>> If you fire off N work items which sleep at the same time, cmwq will
>> eventually try to create N workers as each previous worker goes to
>> sleep so that the CPU doesn't sit idle while there are work items to
>> process as long as N < @wq->nr->active.
> 
> Ok, so if I queue N items on a single CPU when max_active == N, they
> get spread across N worker threads on different CPUs? 

They may if necessary to keep the workqueue progressing.

Thanks.

-- 
tejun

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [2.6.36-rc3] Workqueues, XFS, dependencies and deadlocks
  2010-09-07 12:26         ` Tejun Heo
@ 2010-09-08  8:22           ` Dave Chinner
  -1 siblings, 0 replies; 37+ messages in thread
From: Dave Chinner @ 2010-09-08  8:22 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, xfs, linux-fsdevel

On Tue, Sep 07, 2010 at 02:26:54PM +0200, Tejun Heo wrote:
> On 09/07/2010 12:35 PM, Tejun Heo wrote:
> > Can you please help me a bit more?  Are you saying the following?
> > 
> > Work w0 starts execution on wq0.  w0 tries locking but fails.  Does
> > delay(1) and requeues itself on wq0 hoping another work w1 would be
> > queued on wq0 which will release the lock.  The requeueing should make
> > w0 queued and executed after w1, but instead w1 never gets executed
> > while w0 hogs the CPU constantly by re-executing itself.  Also, how
> > does delay(1) help with chewing up CPU?  Are you talking about
> > avoiding constant lock/unlock ops starving other lockers?  In such
> > case, wouldn't cpu_relax() make more sense?
> 
> Ooh, almost forgot.  There was nr_active underflow bug in workqueue
> code which could lead to malfunctioning max_active regulation and
> problems during queue freezing, so you could be hitting that too.  I
> sent out pull request some time ago but hasn't been pulled into
> mainline yet.  Can you please pull from the following branch and add
> WQ_HIGHPRI as discussed before and see whether the problem is still
> reproducible?

Ok, it looks as if the WQ_HIGHPRI is all that was required to avoid
the log IO completion starvation livelocks. I haven't yet pulled
the tree below, but I've now created about a billion inodes without
seeing any evidence of the livelock occurring.

Hence it looks like I've been seeing two livelocks - one caused by
the VM that Mel's patches fix, and one caused by the workqueue
changeover that is fixed by the WQ_HIGHPRI change.

Thanks for you insights, Tejun - I'll push the workqueue change
through the XFS tree to Linus.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [2.6.36-rc3] Workqueues, XFS, dependencies and deadlocks
@ 2010-09-08  8:22           ` Dave Chinner
  0 siblings, 0 replies; 37+ messages in thread
From: Dave Chinner @ 2010-09-08  8:22 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-fsdevel, linux-kernel, xfs

On Tue, Sep 07, 2010 at 02:26:54PM +0200, Tejun Heo wrote:
> On 09/07/2010 12:35 PM, Tejun Heo wrote:
> > Can you please help me a bit more?  Are you saying the following?
> > 
> > Work w0 starts execution on wq0.  w0 tries locking but fails.  Does
> > delay(1) and requeues itself on wq0 hoping another work w1 would be
> > queued on wq0 which will release the lock.  The requeueing should make
> > w0 queued and executed after w1, but instead w1 never gets executed
> > while w0 hogs the CPU constantly by re-executing itself.  Also, how
> > does delay(1) help with chewing up CPU?  Are you talking about
> > avoiding constant lock/unlock ops starving other lockers?  In such
> > case, wouldn't cpu_relax() make more sense?
> 
> Ooh, almost forgot.  There was nr_active underflow bug in workqueue
> code which could lead to malfunctioning max_active regulation and
> problems during queue freezing, so you could be hitting that too.  I
> sent out pull request some time ago but hasn't been pulled into
> mainline yet.  Can you please pull from the following branch and add
> WQ_HIGHPRI as discussed before and see whether the problem is still
> reproducible?

Ok, it looks as if the WQ_HIGHPRI is all that was required to avoid
the log IO completion starvation livelocks. I haven't yet pulled
the tree below, but I've now created about a billion inodes without
seeing any evidence of the livelock occurring.

Hence it looks like I've been seeing two livelocks - one caused by
the VM that Mel's patches fix, and one caused by the workqueue
changeover that is fixed by the WQ_HIGHPRI change.

Thanks for you insights, Tejun - I'll push the workqueue change
through the XFS tree to Linus.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [2.6.36-rc3] Workqueues, XFS, dependencies and deadlocks
  2010-09-08  8:20               ` Tejun Heo
@ 2010-09-08  8:28                 ` Dave Chinner
  -1 siblings, 0 replies; 37+ messages in thread
From: Dave Chinner @ 2010-09-08  8:28 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, xfs, linux-fsdevel

On Wed, Sep 08, 2010 at 10:20:27AM +0200, Tejun Heo wrote:
> On 09/08/2010 09:34 AM, Dave Chinner wrote:
> >> Probably I have overloaded the term 'concurrency' too much.  In this
> >> case, I meant the number of workers assigned to work items of the wq.
> >> If you fire off N work items which sleep at the same time, cmwq will
> >> eventually try to create N workers as each previous worker goes to
> >> sleep so that the CPU doesn't sit idle while there are work items to
> >> process as long as N < @wq->nr->active.
> > 
> > Ok, so if I queue N items on a single CPU when max_active == N, they
> > get spread across N worker threads on different CPUs? 
> 
> They may if necessary to keep the workqueue progressing.

Ok, so the normal case is that they will all be processed local to the
CPU they were queued on, like the old workqueue code?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [2.6.36-rc3] Workqueues, XFS, dependencies and deadlocks
@ 2010-09-08  8:28                 ` Dave Chinner
  0 siblings, 0 replies; 37+ messages in thread
From: Dave Chinner @ 2010-09-08  8:28 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-fsdevel, linux-kernel, xfs

On Wed, Sep 08, 2010 at 10:20:27AM +0200, Tejun Heo wrote:
> On 09/08/2010 09:34 AM, Dave Chinner wrote:
> >> Probably I have overloaded the term 'concurrency' too much.  In this
> >> case, I meant the number of workers assigned to work items of the wq.
> >> If you fire off N work items which sleep at the same time, cmwq will
> >> eventually try to create N workers as each previous worker goes to
> >> sleep so that the CPU doesn't sit idle while there are work items to
> >> process as long as N < @wq->nr->active.
> > 
> > Ok, so if I queue N items on a single CPU when max_active == N, they
> > get spread across N worker threads on different CPUs? 
> 
> They may if necessary to keep the workqueue progressing.

Ok, so the normal case is that they will all be processed local to the
CPU they were queued on, like the old workqueue code?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [2.6.36-rc3] Workqueues, XFS, dependencies and deadlocks
  2010-09-08  8:28                 ` Dave Chinner
@ 2010-09-08  8:46                   ` Tejun Heo
  -1 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2010-09-08  8:46 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, xfs, linux-fsdevel

Hello,

On 09/08/2010 10:28 AM, Dave Chinner wrote:
>> They may if necessary to keep the workqueue progressing.
> 
> Ok, so the normal case is that they will all be processed local to the
> CPU they were queued on, like the old workqueue code?

Bound workqueues always process works locally.  Please consider the
following scenario.

 w0, w1, w2 are queued to q0 on the same CPU.  w0 burns CPU for 5ms
 then sleeps for 10ms then burns CPU for 5ms again then finishes.  w1
 and w2 sleeps for 10ms.

The following is what happens with the original workqueue (ignoring
all other tasks and processing overhead).

 TIME IN MSECS	EVENT
 0		w0 burns CPU
 5		w0 sleeps
 15		w0 wakes and burns CPU
 20		w0 finishes, w1 starts and sleeps
 30		w1 finishes, w2 starts and sleeps
 40		w2 finishes

With cmwq if @max_active >= 3,

 TIME IN MSECS	EVENT
 0		w0 burns CPU
 5		w0 sleeps, w1 starts and sleeps, w2 starts and sleeps
 15		w0 wakes and burns CPU, w1 finishes, w2 finishes
 20		w0 finishes

IOW, cmwq assigns a new worker when there are more work items to
process but no work item is currently in progress on the CPU.  Please
note that this behavior is across *all* workqueues.  It doesn't matter
which work item belongs to which workqueue.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [2.6.36-rc3] Workqueues, XFS, dependencies and deadlocks
@ 2010-09-08  8:46                   ` Tejun Heo
  0 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2010-09-08  8:46 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel, xfs

Hello,

On 09/08/2010 10:28 AM, Dave Chinner wrote:
>> They may if necessary to keep the workqueue progressing.
> 
> Ok, so the normal case is that they will all be processed local to the
> CPU they were queued on, like the old workqueue code?

Bound workqueues always process works locally.  Please consider the
following scenario.

 w0, w1, w2 are queued to q0 on the same CPU.  w0 burns CPU for 5ms
 then sleeps for 10ms then burns CPU for 5ms again then finishes.  w1
 and w2 sleeps for 10ms.

The following is what happens with the original workqueue (ignoring
all other tasks and processing overhead).

 TIME IN MSECS	EVENT
 0		w0 burns CPU
 5		w0 sleeps
 15		w0 wakes and burns CPU
 20		w0 finishes, w1 starts and sleeps
 30		w1 finishes, w2 starts and sleeps
 40		w2 finishes

With cmwq if @max_active >= 3,

 TIME IN MSECS	EVENT
 0		w0 burns CPU
 5		w0 sleeps, w1 starts and sleeps, w2 starts and sleeps
 15		w0 wakes and burns CPU, w1 finishes, w2 finishes
 20		w0 finishes

IOW, cmwq assigns a new worker when there are more work items to
process but no work item is currently in progress on the CPU.  Please
note that this behavior is across *all* workqueues.  It doesn't matter
which work item belongs to which workqueue.

Thanks.

-- 
tejun

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [2.6.36-rc3] Workqueues, XFS, dependencies and deadlocks
  2010-09-08  8:22           ` Dave Chinner
@ 2010-09-08  8:51             ` Tejun Heo
  -1 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2010-09-08  8:51 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, xfs, linux-fsdevel

Hello,

On 09/08/2010 10:22 AM, Dave Chinner wrote:
> Ok, it looks as if the WQ_HIGHPRI is all that was required to avoid
> the log IO completion starvation livelocks. I haven't yet pulled
> the tree below, but I've now created about a billion inodes without
> seeing any evidence of the livelock occurring.
> 
> Hence it looks like I've been seeing two livelocks - one caused by
> the VM that Mel's patches fix, and one caused by the workqueue
> changeover that is fixed by the WQ_HIGHPRI change.
> 
> Thanks for you insights, Tejun - I'll push the workqueue change
> through the XFS tree to Linus.

Great, BTW, I have several questions regarding wq usage in xfs.

* Do you think @max_active > 1 could be useful for xfs?  If most works
  queued on the wq are gonna contend for the same (blocking) set of
  resources, it would just make more threads sleeping on those
  resources but otherwise it would help reducing execution latency a
  lot.

* xfs_mru_cache is a singlethread workqueue.  Do you specifically need
  singlethreadedness (strict ordering of works) or is it just to avoid
  creating dedicated per-cpu workers?  If the latter, there's no need
  to use singlethread one anymore.

* Are all four workqueues in xfs used during memory allocation?  With
  the new implementation, the reasons to have dedicated wqs are,

  - Forward progress guarantee in the memory allocation path.  Each
    workqueue w/ WQ_RESCUER has _one_ rescuer thread reserved for
    execution of works on the specific wq, which will be used under
    memory pressure to make forward progress.

  - A wq is a flush domain.  You can flush works on it as a group.

  - A wq is also a attribute domain.  If certain work items need to be
    handled differently (highpri, cpu intensive, execution ordering,
    etc...), they can be queued to a wq w/ those attributes specified.

  Maybe some of those workqueues can drop WQ_RESCUER or merged or just
  use the system workqueue?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [2.6.36-rc3] Workqueues, XFS, dependencies and deadlocks
@ 2010-09-08  8:51             ` Tejun Heo
  0 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2010-09-08  8:51 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel, xfs

Hello,

On 09/08/2010 10:22 AM, Dave Chinner wrote:
> Ok, it looks as if the WQ_HIGHPRI is all that was required to avoid
> the log IO completion starvation livelocks. I haven't yet pulled
> the tree below, but I've now created about a billion inodes without
> seeing any evidence of the livelock occurring.
> 
> Hence it looks like I've been seeing two livelocks - one caused by
> the VM that Mel's patches fix, and one caused by the workqueue
> changeover that is fixed by the WQ_HIGHPRI change.
> 
> Thanks for you insights, Tejun - I'll push the workqueue change
> through the XFS tree to Linus.

Great, BTW, I have several questions regarding wq usage in xfs.

* Do you think @max_active > 1 could be useful for xfs?  If most works
  queued on the wq are gonna contend for the same (blocking) set of
  resources, it would just make more threads sleeping on those
  resources but otherwise it would help reducing execution latency a
  lot.

* xfs_mru_cache is a singlethread workqueue.  Do you specifically need
  singlethreadedness (strict ordering of works) or is it just to avoid
  creating dedicated per-cpu workers?  If the latter, there's no need
  to use singlethread one anymore.

* Are all four workqueues in xfs used during memory allocation?  With
  the new implementation, the reasons to have dedicated wqs are,

  - Forward progress guarantee in the memory allocation path.  Each
    workqueue w/ WQ_RESCUER has _one_ rescuer thread reserved for
    execution of works on the specific wq, which will be used under
    memory pressure to make forward progress.

  - A wq is a flush domain.  You can flush works on it as a group.

  - A wq is also a attribute domain.  If certain work items need to be
    handled differently (highpri, cpu intensive, execution ordering,
    etc...), they can be queued to a wq w/ those attributes specified.

  Maybe some of those workqueues can drop WQ_RESCUER or merged or just
  use the system workqueue?

Thanks.

-- 
tejun

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [2.6.36-rc3] Workqueues, XFS, dependencies and deadlocks
  2010-09-08  8:51             ` Tejun Heo
  (?)
@ 2010-09-08 10:05               ` Dave Chinner
  -1 siblings, 0 replies; 37+ messages in thread
From: Dave Chinner @ 2010-09-08 10:05 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, xfs, linux-fsdevel

On Wed, Sep 08, 2010 at 10:51:28AM +0200, Tejun Heo wrote:
> Hello,
> 
> On 09/08/2010 10:22 AM, Dave Chinner wrote:
> > Ok, it looks as if the WQ_HIGHPRI is all that was required to avoid
> > the log IO completion starvation livelocks. I haven't yet pulled
> > the tree below, but I've now created about a billion inodes without
> > seeing any evidence of the livelock occurring.
> > 
> > Hence it looks like I've been seeing two livelocks - one caused by
> > the VM that Mel's patches fix, and one caused by the workqueue
> > changeover that is fixed by the WQ_HIGHPRI change.
> > 
> > Thanks for you insights, Tejun - I'll push the workqueue change
> > through the XFS tree to Linus.
> 
> Great, BTW, I have several questions regarding wq usage in xfs.
> 
> * Do you think @max_active > 1 could be useful for xfs?  If most works
>   queued on the wq are gonna contend for the same (blocking) set of
>   resources, it would just make more threads sleeping on those
>   resources but otherwise it would help reducing execution latency a
>   lot.

It may indeed help, but I can't really say much more than that right
now. I need a deeper understanding of the impact of increasing
max_active (I have a basic understanding now) before I could say for
certain.

> * xfs_mru_cache is a singlethread workqueue.  Do you specifically need
>   singlethreadedness (strict ordering of works) or is it just to avoid
>   creating dedicated per-cpu workers?  If the latter, there's no need
>   to use singlethread one anymore.

Didn't need per-cpu workers, so could probably drop it now.

> * Are all four workqueues in xfs used during memory allocation?  With
>   the new implementation, the reasons to have dedicated wqs are,

The xfsdatad, xfslogd and xfsconvertd are all in the memory reclaim
path. That is, they need to be able to run and make progress when
memory is low because if the IO does not complete, pages under IO
will never complete the transition from dirty to clean. Hence they
are not in the direct memory allocation path, but they are
definitely an important part of the memory reclaim path that
operates in low memory conditions.

>   - Forward progress guarantee in the memory allocation path.  Each
>     workqueue w/ WQ_RESCUER has _one_ rescuer thread reserved for
>     execution of works on the specific wq, which will be used under
>     memory pressure to make forward progress.

That, to me, says they all need a rescuer thread because they all
need to be able to make forward progress in OOM conditions.

>   - A wq is a flush domain.  You can flush works on it as a group.

We do that as well for the above workqueues as well to ensure
correct sync(1), freeze and unmount behaviour (see
xfs_flush_buftarg()).

>   - A wq is also a attribute domain.  If certain work items need to be
>     handled differently (highpri, cpu intensive, execution ordering,
>     etc...), they can be queued to a wq w/ those attributes specified.

And we already know that that xfslogd_workqueue needs the WQ_HIGHPRI
flag....

>   Maybe some of those workqueues can drop WQ_RESCUER or merged or just
>   use the system workqueue?

Maybe the mru wq can use the system wq, but I'm really opposed to
merging XFS wqs with system work queues simply from a debugging POV.
I've lost count of the number of times I've walked the IO completion
queueѕ with a debugger or crash dump analyser to try to work out if
missing IO that wedged the filesystem got stuck on the completion
queue. If I want to be able to say "the IO was lost by a lower
layer", then I have to be able to confirm it is not stuck in a
completion queue. That much harder if I don't know what the work
container objects on the queue are....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [2.6.36-rc3] Workqueues, XFS, dependencies and deadlocks
@ 2010-09-08 10:05               ` Dave Chinner
  0 siblings, 0 replies; 37+ messages in thread
From: Dave Chinner @ 2010-09-08 10:05 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, xfs, linux-fsdevel

On Wed, Sep 08, 2010 at 10:51:28AM +0200, Tejun Heo wrote:
> Hello,
> 
> On 09/08/2010 10:22 AM, Dave Chinner wrote:
> > Ok, it looks as if the WQ_HIGHPRI is all that was required to avoid
> > the log IO completion starvation livelocks. I haven't yet pulled
> > the tree below, but I've now created about a billion inodes without
> > seeing any evidence of the livelock occurring.
> > 
> > Hence it looks like I've been seeing two livelocks - one caused by
> > the VM that Mel's patches fix, and one caused by the workqueue
> > changeover that is fixed by the WQ_HIGHPRI change.
> > 
> > Thanks for you insights, Tejun - I'll push the workqueue change
> > through the XFS tree to Linus.
> 
> Great, BTW, I have several questions regarding wq usage in xfs.
> 
> * Do you think @max_active > 1 could be useful for xfs?  If most works
>   queued on the wq are gonna contend for the same (blocking) set of
>   resources, it would just make more threads sleeping on those
>   resources but otherwise it would help reducing execution latency a
>   lot.

It may indeed help, but I can't really say much more than that right
now. I need a deeper understanding of the impact of increasing
max_active (I have a basic understanding now) before I could say for
certain.

> * xfs_mru_cache is a singlethread workqueue.  Do you specifically need
>   singlethreadedness (strict ordering of works) or is it just to avoid
>   creating dedicated per-cpu workers?  If the latter, there's no need
>   to use singlethread one anymore.

Didn't need per-cpu workers, so could probably drop it now.

> * Are all four workqueues in xfs used during memory allocation?  With
>   the new implementation, the reasons to have dedicated wqs are,

The xfsdatad, xfslogd and xfsconvertd are all in the memory reclaim
path. That is, they need to be able to run and make progress when
memory is low because if the IO does not complete, pages under IO
will never complete the transition from dirty to clean. Hence they
are not in the direct memory allocation path, but they are
definitely an important part of the memory reclaim path that
operates in low memory conditions.

>   - Forward progress guarantee in the memory allocation path.  Each
>     workqueue w/ WQ_RESCUER has _one_ rescuer thread reserved for
>     execution of works on the specific wq, which will be used under
>     memory pressure to make forward progress.

That, to me, says they all need a rescuer thread because they all
need to be able to make forward progress in OOM conditions.

>   - A wq is a flush domain.  You can flush works on it as a group.

We do that as well for the above workqueues as well to ensure
correct sync(1), freeze and unmount behaviour (see
xfs_flush_buftarg()).

>   - A wq is also a attribute domain.  If certain work items need to be
>     handled differently (highpri, cpu intensive, execution ordering,
>     etc...), they can be queued to a wq w/ those attributes specified.

And we already know that that xfslogd_workqueue needs the WQ_HIGHPRI
flag....

>   Maybe some of those workqueues can drop WQ_RESCUER or merged or just
>   use the system workqueue?

Maybe the mru wq can use the system wq, but I'm really opposed to
merging XFS wqs with system work queues simply from a debugging POV.
I've lost count of the number of times I've walked the IO completion
queueѕ with a debugger or crash dump analyser to try to work out if
missing IO that wedged the filesystem got stuck on the completion
queue. If I want to be able to say "the IO was lost by a lower
layer", then I have to be able to confirm it is not stuck in a
completion queue. That much harder if I don't know what the work
container objects on the queue are....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [2.6.36-rc3] Workqueues, XFS, dependencies and deadlocks
@ 2010-09-08 10:05               ` Dave Chinner
  0 siblings, 0 replies; 37+ messages in thread
From: Dave Chinner @ 2010-09-08 10:05 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-fsdevel, linux-kernel, xfs

On Wed, Sep 08, 2010 at 10:51:28AM +0200, Tejun Heo wrote:
> Hello,
> 
> On 09/08/2010 10:22 AM, Dave Chinner wrote:
> > Ok, it looks as if the WQ_HIGHPRI is all that was required to avoid
> > the log IO completion starvation livelocks. I haven't yet pulled
> > the tree below, but I've now created about a billion inodes without
> > seeing any evidence of the livelock occurring.
> > 
> > Hence it looks like I've been seeing two livelocks - one caused by
> > the VM that Mel's patches fix, and one caused by the workqueue
> > changeover that is fixed by the WQ_HIGHPRI change.
> > 
> > Thanks for you insights, Tejun - I'll push the workqueue change
> > through the XFS tree to Linus.
> 
> Great, BTW, I have several questions regarding wq usage in xfs.
> 
> * Do you think @max_active > 1 could be useful for xfs?  If most works
>   queued on the wq are gonna contend for the same (blocking) set of
>   resources, it would just make more threads sleeping on those
>   resources but otherwise it would help reducing execution latency a
>   lot.

It may indeed help, but I can't really say much more than that right
now. I need a deeper understanding of the impact of increasing
max_active (I have a basic understanding now) before I could say for
certain.

> * xfs_mru_cache is a singlethread workqueue.  Do you specifically need
>   singlethreadedness (strict ordering of works) or is it just to avoid
>   creating dedicated per-cpu workers?  If the latter, there's no need
>   to use singlethread one anymore.

Didn't need per-cpu workers, so could probably drop it now.

> * Are all four workqueues in xfs used during memory allocation?  With
>   the new implementation, the reasons to have dedicated wqs are,

The xfsdatad, xfslogd and xfsconvertd are all in the memory reclaim
path. That is, they need to be able to run and make progress when
memory is low because if the IO does not complete, pages under IO
will never complete the transition from dirty to clean. Hence they
are not in the direct memory allocation path, but they are
definitely an important part of the memory reclaim path that
operates in low memory conditions.

>   - Forward progress guarantee in the memory allocation path.  Each
>     workqueue w/ WQ_RESCUER has _one_ rescuer thread reserved for
>     execution of works on the specific wq, which will be used under
>     memory pressure to make forward progress.

That, to me, says they all need a rescuer thread because they all
need to be able to make forward progress in OOM conditions.

>   - A wq is a flush domain.  You can flush works on it as a group.

We do that as well for the above workqueues as well to ensure
correct sync(1), freeze and unmount behaviour (see
xfs_flush_buftarg()).

>   - A wq is also a attribute domain.  If certain work items need to be
>     handled differently (highpri, cpu intensive, execution ordering,
>     etc...), they can be queued to a wq w/ those attributes specified.

And we already know that that xfslogd_workqueue needs the WQ_HIGHPRI
flag....

>   Maybe some of those workqueues can drop WQ_RESCUER or merged or just
>   use the system workqueue?

Maybe the mru wq can use the system wq, but I'm really opposed to
merging XFS wqs with system work queues simply from a debugging POV.
I've lost count of the number of times I've walked the IO completion
queueѕ with a debugger or crash dump analyser to try to work out if
missing IO that wedged the filesystem got stuck on the completion
queue. If I want to be able to say "the IO was lost by a lower
layer", then I have to be able to confirm it is not stuck in a
completion queue. That much harder if I don't know what the work
container objects on the queue are....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [2.6.36-rc3] Workqueues, XFS, dependencies and deadlocks
  2010-09-08  8:46                   ` Tejun Heo
@ 2010-09-08 10:12                     ` Dave Chinner
  -1 siblings, 0 replies; 37+ messages in thread
From: Dave Chinner @ 2010-09-08 10:12 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, xfs, linux-fsdevel

On Wed, Sep 08, 2010 at 10:46:13AM +0200, Tejun Heo wrote:
> On 09/08/2010 10:28 AM, Dave Chinner wrote:
> >> They may if necessary to keep the workqueue progressing.
> > 
> > Ok, so the normal case is that they will all be processed local to the
> > CPU they were queued on, like the old workqueue code?
> 
> Bound workqueues always process works locally.  Please consider the
> following scenario.
> 
>  w0, w1, w2 are queued to q0 on the same CPU.  w0 burns CPU for 5ms
>  then sleeps for 10ms then burns CPU for 5ms again then finishes.  w1
>  and w2 sleeps for 10ms.
> 
> The following is what happens with the original workqueue (ignoring
> all other tasks and processing overhead).
> 
>  TIME IN MSECS	EVENT
>  0		w0 burns CPU
>  5		w0 sleeps
>  15		w0 wakes and burns CPU
>  20		w0 finishes, w1 starts and sleeps
>  30		w1 finishes, w2 starts and sleeps
>  40		w2 finishes
> 
> With cmwq if @max_active >= 3,
> 
>  TIME IN MSECS	EVENT
>  0		w0 burns CPU
>  5		w0 sleeps, w1 starts and sleeps, w2 starts and sleeps
>  15		w0 wakes and burns CPU, w1 finishes, w2 finishes
>  20		w0 finishes
> 
> IOW, cmwq assigns a new worker when there are more work items to
> process but no work item is currently in progress on the CPU.  Please
> note that this behavior is across *all* workqueues.  It doesn't matter
> which work item belongs to which workqueue.

Ok, so in this case if this was on CPU 1, I'd see kworker[1:0],
kworker[1:1] and kworker[1:2] threads all accumulate CPU time?  I'm
just trying to relate your example it to behaviour I've seen to
check if I understand the example correctly.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [2.6.36-rc3] Workqueues, XFS, dependencies and deadlocks
@ 2010-09-08 10:12                     ` Dave Chinner
  0 siblings, 0 replies; 37+ messages in thread
From: Dave Chinner @ 2010-09-08 10:12 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-fsdevel, linux-kernel, xfs

On Wed, Sep 08, 2010 at 10:46:13AM +0200, Tejun Heo wrote:
> On 09/08/2010 10:28 AM, Dave Chinner wrote:
> >> They may if necessary to keep the workqueue progressing.
> > 
> > Ok, so the normal case is that they will all be processed local to the
> > CPU they were queued on, like the old workqueue code?
> 
> Bound workqueues always process works locally.  Please consider the
> following scenario.
> 
>  w0, w1, w2 are queued to q0 on the same CPU.  w0 burns CPU for 5ms
>  then sleeps for 10ms then burns CPU for 5ms again then finishes.  w1
>  and w2 sleeps for 10ms.
> 
> The following is what happens with the original workqueue (ignoring
> all other tasks and processing overhead).
> 
>  TIME IN MSECS	EVENT
>  0		w0 burns CPU
>  5		w0 sleeps
>  15		w0 wakes and burns CPU
>  20		w0 finishes, w1 starts and sleeps
>  30		w1 finishes, w2 starts and sleeps
>  40		w2 finishes
> 
> With cmwq if @max_active >= 3,
> 
>  TIME IN MSECS	EVENT
>  0		w0 burns CPU
>  5		w0 sleeps, w1 starts and sleeps, w2 starts and sleeps
>  15		w0 wakes and burns CPU, w1 finishes, w2 finishes
>  20		w0 finishes
> 
> IOW, cmwq assigns a new worker when there are more work items to
> process but no work item is currently in progress on the CPU.  Please
> note that this behavior is across *all* workqueues.  It doesn't matter
> which work item belongs to which workqueue.

Ok, so in this case if this was on CPU 1, I'd see kworker[1:0],
kworker[1:1] and kworker[1:2] threads all accumulate CPU time?  I'm
just trying to relate your example it to behaviour I've seen to
check if I understand the example correctly.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [2.6.36-rc3] Workqueues, XFS, dependencies and deadlocks
  2010-09-08 10:12                     ` Dave Chinner
@ 2010-09-08 10:28                       ` Tejun Heo
  -1 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2010-09-08 10:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, xfs, linux-fsdevel

Hello,

On 09/08/2010 12:12 PM, Dave Chinner wrote:
> Ok, so in this case if this was on CPU 1, I'd see kworker[1:0],
> kworker[1:1] and kworker[1:2] threads all accumulate CPU time?  I'm
> just trying to relate your example it to behaviour I've seen to
> check if I understand the example correctly.

Yes, you're right.  If all three works just burn CPU cycles for 5ms
then you'll only see one kworker w/ 15ms of accumulated CPU time.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [2.6.36-rc3] Workqueues, XFS, dependencies and deadlocks
@ 2010-09-08 10:28                       ` Tejun Heo
  0 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2010-09-08 10:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel, xfs

Hello,

On 09/08/2010 12:12 PM, Dave Chinner wrote:
> Ok, so in this case if this was on CPU 1, I'd see kworker[1:0],
> kworker[1:1] and kworker[1:2] threads all accumulate CPU time?  I'm
> just trying to relate your example it to behaviour I've seen to
> check if I understand the example correctly.

Yes, you're right.  If all three works just burn CPU cycles for 5ms
then you'll only see one kworker w/ 15ms of accumulated CPU time.

Thanks.

-- 
tejun

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [2.6.36-rc3] Workqueues, XFS, dependencies and deadlocks
  2010-09-08 10:05               ` Dave Chinner
@ 2010-09-08 14:10                 ` Tejun Heo
  -1 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2010-09-08 14:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, xfs, linux-fsdevel

Hello,

On 09/08/2010 12:05 PM, Dave Chinner wrote:
>> * Do you think @max_active > 1 could be useful for xfs?  If most works
>>   queued on the wq are gonna contend for the same (blocking) set of
>>   resources, it would just make more threads sleeping on those
>>   resources but otherwise it would help reducing execution latency a
>>   lot.
> 
> It may indeed help, but I can't really say much more than that right
> now. I need a deeper understanding of the impact of increasing
> max_active (I have a basic understanding now) before I could say for
> certain.

Sure, things should be fine as they currently stand.  No need to hurry
anything.

>> * xfs_mru_cache is a singlethread workqueue.  Do you specifically need
>>   singlethreadedness (strict ordering of works) or is it just to avoid
>>   creating dedicated per-cpu workers?  If the latter, there's no need
>>   to use singlethread one anymore.
> 
> Didn't need per-cpu workers, so could probably drop it now.

I see.  I'll soon send out a patch to convert xfs to use
alloc_workqueue() instead and will drop singlethread restriction
there.

>>   Maybe some of those workqueues can drop WQ_RESCUER or merged or just
>>   use the system workqueue?
> 
> Maybe the mru wq can use the system wq, but I'm really opposed to
> merging XFS wqs with system work queues simply from a debugging POV.
> I've lost count of the number of times I've walked the IO completion
> queueѕ with a debugger or crash dump analyser to try to work out if
> missing IO that wedged the filesystem got stuck on the completion
> queue. If I want to be able to say "the IO was lost by a lower
> layer", then I have to be able to confirm it is not stuck in a
> completion queue. That much harder if I don't know what the work
> container objects on the queue are....

Hmm... that's gonna be a bit more difficult with cmwq as all the works
are now queued on the shared worklist but you should still be able to
tell.  Maybe crash can be taught how to tell the associated workqueue
from a pending work.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [2.6.36-rc3] Workqueues, XFS, dependencies and deadlocks
@ 2010-09-08 14:10                 ` Tejun Heo
  0 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2010-09-08 14:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel, xfs

Hello,

On 09/08/2010 12:05 PM, Dave Chinner wrote:
>> * Do you think @max_active > 1 could be useful for xfs?  If most works
>>   queued on the wq are gonna contend for the same (blocking) set of
>>   resources, it would just make more threads sleeping on those
>>   resources but otherwise it would help reducing execution latency a
>>   lot.
> 
> It may indeed help, but I can't really say much more than that right
> now. I need a deeper understanding of the impact of increasing
> max_active (I have a basic understanding now) before I could say for
> certain.

Sure, things should be fine as they currently stand.  No need to hurry
anything.

>> * xfs_mru_cache is a singlethread workqueue.  Do you specifically need
>>   singlethreadedness (strict ordering of works) or is it just to avoid
>>   creating dedicated per-cpu workers?  If the latter, there's no need
>>   to use singlethread one anymore.
> 
> Didn't need per-cpu workers, so could probably drop it now.

I see.  I'll soon send out a patch to convert xfs to use
alloc_workqueue() instead and will drop singlethread restriction
there.

>>   Maybe some of those workqueues can drop WQ_RESCUER or merged or just
>>   use the system workqueue?
> 
> Maybe the mru wq can use the system wq, but I'm really opposed to
> merging XFS wqs with system work queues simply from a debugging POV.
> I've lost count of the number of times I've walked the IO completion
> queueѕ with a debugger or crash dump analyser to try to work out if
> missing IO that wedged the filesystem got stuck on the completion
> queue. If I want to be able to say "the IO was lost by a lower
> layer", then I have to be able to confirm it is not stuck in a
> completion queue. That much harder if I don't know what the work
> container objects on the queue are....

Hmm... that's gonna be a bit more difficult with cmwq as all the works
are now queued on the shared worklist but you should still be able to
tell.  Maybe crash can be taught how to tell the associated workqueue
from a pending work.

Thanks.

-- 
tejun

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2010-09-08 14:11 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-07  7:29 [2.6.36-rc3] Workqueues, XFS, dependencies and deadlocks Dave Chinner
2010-09-07  7:29 ` Dave Chinner
2010-09-07  9:04 ` Tejun Heo
2010-09-07  9:04   ` Tejun Heo
2010-09-07 10:01   ` Dave Chinner
2010-09-07 10:01     ` Dave Chinner
2010-09-07 10:35     ` Tejun Heo
2010-09-07 10:35       ` Tejun Heo
2010-09-07 12:26       ` Tejun Heo
2010-09-07 12:26         ` Tejun Heo
2010-09-07 13:02         ` Dave Chinner
2010-09-07 13:02           ` Dave Chinner
2010-09-08  8:22         ` Dave Chinner
2010-09-08  8:22           ` Dave Chinner
2010-09-08  8:51           ` Tejun Heo
2010-09-08  8:51             ` Tejun Heo
2010-09-08 10:05             ` Dave Chinner
2010-09-08 10:05               ` Dave Chinner
2010-09-08 10:05               ` Dave Chinner
2010-09-08 14:10               ` Tejun Heo
2010-09-08 14:10                 ` Tejun Heo
2010-09-07 12:48       ` Dave Chinner
2010-09-07 12:48         ` Dave Chinner
2010-09-07 15:39         ` Tejun Heo
2010-09-07 15:39           ` Tejun Heo
2010-09-08  7:34           ` Dave Chinner
2010-09-08  7:34             ` Dave Chinner
2010-09-08  8:20             ` Tejun Heo
2010-09-08  8:20               ` Tejun Heo
2010-09-08  8:28               ` Dave Chinner
2010-09-08  8:28                 ` Dave Chinner
2010-09-08  8:46                 ` Tejun Heo
2010-09-08  8:46                   ` Tejun Heo
2010-09-08 10:12                   ` Dave Chinner
2010-09-08 10:12                     ` Dave Chinner
2010-09-08 10:28                     ` Tejun Heo
2010-09-08 10:28                       ` Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.