[PATCH] xfs: add readahead bufs to lru early to prevent post-unmount panic

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] xfs: add readahead bufs to lru early to prevent post-unmount panic
@ 2016-06-30 12:53 Brian Foster
  2016-06-30 22:44 ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Brian Foster @ 2016-06-30 12:53 UTC (permalink / raw)
  To: xfs

Newly allocated XFS metadata buffers are added to the LRU once the hold
count is released, which typically occurs after I/O completion. There is
no other mechanism at current that tracks the existence or I/O state of
a new buffer. Further, readahead I/O tends to be submitted
asynchronously by nature, which means the I/O can remain in flight and
actually complete long after the calling context is gone. This means
that file descriptors or any other holds on the filesystem can be
released, allowing the filesystem to be unmounted while I/O is still in
flight. When I/O completion occurs, core data structures may have been
freed, causing completion to run into invalid memory accesses and likely
to panic.

This problem is reproduced on XFS via directory readahead. A filesystem
is mounted, a directory is opened/closed and the filesystem immediately
unmounted. The open/close cycle triggers a directory readahead that if
delayed long enough, runs buffer I/O completion after the unmount has
completed.

To work around this problem, add readahead buffers to the LRU earlier
than other buffers (when the buffer is allocated, specifically). The
buffer hold count will ultimately remain until I/O completion, which
means any shrinker activity will skip the buffer until then. This makes
the buffer visible to xfs_wait_buftarg(), however, which ensures that an
unmount or quiesce waits for I/O completion appropriately.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---

This addresses the problem reproduced by the recently posted xfstests
test:

  http://thread.gmane.org/gmane.comp.file-systems.fstests/2740

This could probably be made more involved, i.e., to create another list
of buffers in flight or some such. This seems more simple/sane to me,
however, and survives my testing so far...

Brian

 fs/xfs/xfs_buf.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 4665ff6..3f03df9 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -590,8 +590,20 @@ xfs_buf_get_map(
 		return NULL;
 	}

+	/*
+	 * If the buffer found doesn't match the one allocated above, somebody
+	 * else beat us to insertion and we can toss the new one.
+	 *
+	 * If we did add the buffer and it happens to be readahead, add to the
+	 * LRU now rather than waiting until the hold is released. Otherwise,
+	 * the buffer is not visible to xfs_wait_buftarg() while in flight and
+	 * nothing else prevents an unmount before I/O completion.
+	 */
 	if (bp != new_bp)
 		xfs_buf_free(new_bp);
+	else if (flags & XBF_READ_AHEAD &&
+		 list_lru_add(&bp->b_target->bt_lru, &bp->b_lru))
+		atomic_inc(&bp->b_hold);

 found:
 	if (!bp->b_addr) {
-- 
2.5.5

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] xfs: add readahead bufs to lru early to prevent post-unmount panic
  2016-06-30 12:53 [PATCH] xfs: add readahead bufs to lru early to prevent post-unmount panic Brian Foster
@ 2016-06-30 22:44 ` Dave Chinner
  2016-06-30 23:56   ` Brian Foster
  2016-07-01 22:30   ` Brian Foster
  0 siblings, 2 replies; 18+ messages in thread
From: Dave Chinner @ 2016-06-30 22:44 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Thu, Jun 30, 2016 at 08:53:49AM -0400, Brian Foster wrote:
> Newly allocated XFS metadata buffers are added to the LRU once the hold
> count is released, which typically occurs after I/O completion. There is
> no other mechanism at current that tracks the existence or I/O state of
> a new buffer. Further, readahead I/O tends to be submitted
> asynchronously by nature, which means the I/O can remain in flight and
> actually complete long after the calling context is gone. This means
> that file descriptors or any other holds on the filesystem can be
> released, allowing the filesystem to be unmounted while I/O is still in
> flight. When I/O completion occurs, core data structures may have been
> freed, causing completion to run into invalid memory accesses and likely
> to panic.
> 
> This problem is reproduced on XFS via directory readahead. A filesystem
> is mounted, a directory is opened/closed and the filesystem immediately
> unmounted. The open/close cycle triggers a directory readahead that if
> delayed long enough, runs buffer I/O completion after the unmount has
> completed.
> 
> To work around this problem, add readahead buffers to the LRU earlier
> than other buffers (when the buffer is allocated, specifically). The
> buffer hold count will ultimately remain until I/O completion, which
> means any shrinker activity will skip the buffer until then. This makes
> the buffer visible to xfs_wait_buftarg(), however, which ensures that an
> unmount or quiesce waits for I/O completion appropriately.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
> 
> This addresses the problem reproduced by the recently posted xfstests
> test:
> 
>   http://thread.gmane.org/gmane.comp.file-systems.fstests/2740
> 
> This could probably be made more involved, i.e., to create another list
> of buffers in flight or some such. This seems more simple/sane to me,
> however, and survives my testing so far...
> 
> Brian
> 
>  fs/xfs/xfs_buf.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 4665ff6..3f03df9 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -590,8 +590,20 @@ xfs_buf_get_map(
>  		return NULL;
>  	}
>  
> +	/*
> +	 * If the buffer found doesn't match the one allocated above, somebody
> +	 * else beat us to insertion and we can toss the new one.
> +	 *
> +	 * If we did add the buffer and it happens to be readahead, add to the
> +	 * LRU now rather than waiting until the hold is released. Otherwise,
> +	 * the buffer is not visible to xfs_wait_buftarg() while in flight and
> +	 * nothing else prevents an unmount before I/O completion.
> +	 */
>  	if (bp != new_bp)
>  		xfs_buf_free(new_bp);
> +	else if (flags & XBF_READ_AHEAD &&
> +		 list_lru_add(&bp->b_target->bt_lru, &bp->b_lru))
> +		atomic_inc(&bp->b_hold);

This doesn't sit right with me. The LRU is for "unused" objects, and
readahead objects are not unused until IO completes and nobody is
waiting on them.

As it is, this points out another problem with readahead buffers -
they aren't actually being cached properly because b_lru_ref == 0,
which means they are immediately reclaimed on IO completion rather
than being added to the LRU....

I also think that it's not sufficient to cover the generic case of
async IO that has no waiter. i.e. we could do get_buf, submit async
write, drop submitter reference, and now we have the same problem
but on a write.  i.e. this problem is and async IO issue, not a
readahead issue.

I think that it might be better to fix it by doing this:

	1. ensure async IO submission always has b_lru_ref set, and
	if it isn't, set it to 1. This ensures the buffer will be
	added to the LRU on completion if it isn't already there.

	2. keep a count of async buffer IO in progress. A per-cpu
	counter in the buftarg will be fine for this. Increment in
	xfs_buf_submit(), decrement in the xfs_buf_rele() call from
	xfs_buf_iodone() once we've determined if the buffer needs
	adding to the LRU or not.

	3. make xfs_wait_buftarg() wait until the async IO count
	goes to zero before it gives up trying to release buffers on
	the LRU.

That will ensure readahead buffers are cached, and we capture both
async read and async write buffers in xfs_wait_buftarg().

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] xfs: add readahead bufs to lru early to prevent post-unmount panic
  2016-06-30 22:44 ` Dave Chinner
@ 2016-06-30 23:56   ` Brian Foster
  2016-07-01  4:33     ` Dave Chinner
  2016-07-01 22:30   ` Brian Foster
  1 sibling, 1 reply; 18+ messages in thread
From: Brian Foster @ 2016-06-30 23:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Fri, Jul 01, 2016 at 08:44:57AM +1000, Dave Chinner wrote:
> On Thu, Jun 30, 2016 at 08:53:49AM -0400, Brian Foster wrote:
> > Newly allocated XFS metadata buffers are added to the LRU once the hold
> > count is released, which typically occurs after I/O completion. There is
> > no other mechanism at current that tracks the existence or I/O state of
> > a new buffer. Further, readahead I/O tends to be submitted
> > asynchronously by nature, which means the I/O can remain in flight and
> > actually complete long after the calling context is gone. This means
> > that file descriptors or any other holds on the filesystem can be
> > released, allowing the filesystem to be unmounted while I/O is still in
> > flight. When I/O completion occurs, core data structures may have been
> > freed, causing completion to run into invalid memory accesses and likely
> > to panic.
> > 
> > This problem is reproduced on XFS via directory readahead. A filesystem
> > is mounted, a directory is opened/closed and the filesystem immediately
> > unmounted. The open/close cycle triggers a directory readahead that if
> > delayed long enough, runs buffer I/O completion after the unmount has
> > completed.
> > 
> > To work around this problem, add readahead buffers to the LRU earlier
> > than other buffers (when the buffer is allocated, specifically). The
> > buffer hold count will ultimately remain until I/O completion, which
> > means any shrinker activity will skip the buffer until then. This makes
> > the buffer visible to xfs_wait_buftarg(), however, which ensures that an
> > unmount or quiesce waits for I/O completion appropriately.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> > 
> > This addresses the problem reproduced by the recently posted xfstests
> > test:
> > 
> >   http://thread.gmane.org/gmane.comp.file-systems.fstests/2740
> > 
> > This could probably be made more involved, i.e., to create another list
> > of buffers in flight or some such. This seems more simple/sane to me,
> > however, and survives my testing so far...
> > 
> > Brian
> > 
> >  fs/xfs/xfs_buf.c | 12 ++++++++++++
> >  1 file changed, 12 insertions(+)
> > 
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index 4665ff6..3f03df9 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -590,8 +590,20 @@ xfs_buf_get_map(
> >  		return NULL;
> >  	}
> >  
> > +	/*
> > +	 * If the buffer found doesn't match the one allocated above, somebody
> > +	 * else beat us to insertion and we can toss the new one.
> > +	 *
> > +	 * If we did add the buffer and it happens to be readahead, add to the
> > +	 * LRU now rather than waiting until the hold is released. Otherwise,
> > +	 * the buffer is not visible to xfs_wait_buftarg() while in flight and
> > +	 * nothing else prevents an unmount before I/O completion.
> > +	 */
> >  	if (bp != new_bp)
> >  		xfs_buf_free(new_bp);
> > +	else if (flags & XBF_READ_AHEAD &&
> > +		 list_lru_add(&bp->b_target->bt_lru, &bp->b_lru))
> > +		atomic_inc(&bp->b_hold);
> 
> This doesn't sit right with me. The LRU is for "unused" objects, and
> readahead objects are not unused until IO completes and nobody is
> waiting on them.
> 

Sure, but don't buffers remain on the LRU once they are reused? My
impression is that this patch really doesn't change behavior in that
regard. IOWs, the buffer is going to end up on the LRU either way and
either end up disposed if it is never actually accessed or "used"
otherwise and the hold count is bumped. Further, isn't the hold count
mechanism precisely to handle "used" buffers on the LRU?

> As it is, this points out another problem with readahead buffers -
> they aren't actually being cached properly because b_lru_ref == 0,
> which means they are immediately reclaimed on IO completion rather
> than being added to the LRU....
> 

Not following... _xfs_buf_alloc() sets b_lru_ref to 1 and I don't see it
set/decremented unless we stale or dispose it. What am I missing?

> I also think that it's not sufficient to cover the generic case of
> async IO that has no waiter. i.e. we could do get_buf, submit async
> write, drop submitter reference, and now we have the same problem
> but on a write.  i.e. this problem is and async IO issue, not a
> readahead issue.
> 

Indeed. I thought about making the current patch check for ASYNC, but
opted to try and make it more isolated (I suppose ASYNC|READAHEAD would
have been moreso).

The other thought I had was to change where buffers are added to the LRU
altogether, but didn't want to jump right to that. Is there any issue
with populating the LRU with initially held buffers as such, or any
particular reason LRU addition was deferred to I/O completion in the
first place?

> I think that it might be better to fix it by doing this:
> 
> 	1. ensure async IO submission always has b_lru_ref set, and
> 	if it isn't, set it to 1. This ensures the buffer will be
> 	added to the LRU on completion if it isn't already there.
> 

See above. I'm not following why this is necessary (if it is, it seems
indicative of a bug).

> 	2. keep a count of async buffer IO in progress. A per-cpu
> 	counter in the buftarg will be fine for this. Increment in
> 	xfs_buf_submit(), decrement in the xfs_buf_rele() call from
> 	xfs_buf_iodone() once we've determined if the buffer needs
> 	adding to the LRU or not.
> 
> 	3. make xfs_wait_buftarg() wait until the async IO count
> 	goes to zero before it gives up trying to release buffers on
> 	the LRU.
> 

This is along the lines of what I was thinking wrt to a list (explicit
I/O tracking one way or another). The count is more simple and does
cover the arbitrary read/write case, but is still more code and seems
slightly duplicative to me because in most cases the LRU wait already
handles this. I'm also not totally sure we need to handle the write
case. Are we susceptible to this issue anywhere that isn't already
protected by the sb write protection mechanism, for example?

OTOH, explicit tracking might provide a more generic way to deal with
uncached buffers as opposed to lock cycling.

Brian

> That will ensure readahead buffers are cached, and we capture both
> async read and async write buffers in xfs_wait_buftarg().
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] xfs: add readahead bufs to lru early to prevent post-unmount panic
  2016-06-30 23:56   ` Brian Foster
@ 2016-07-01  4:33     ` Dave Chinner
  2016-07-01 12:53       ` Brian Foster
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2016-07-01  4:33 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Thu, Jun 30, 2016 at 07:56:21PM -0400, Brian Foster wrote:
> On Fri, Jul 01, 2016 at 08:44:57AM +1000, Dave Chinner wrote:
> > On Thu, Jun 30, 2016 at 08:53:49AM -0400, Brian Foster wrote:
> > > Newly allocated XFS metadata buffers are added to the LRU once the hold
> > > count is released, which typically occurs after I/O completion. There is
> > > no other mechanism at current that tracks the existence or I/O state of
> > > a new buffer. Further, readahead I/O tends to be submitted
> > > asynchronously by nature, which means the I/O can remain in flight and
> > > actually complete long after the calling context is gone. This means
> > > that file descriptors or any other holds on the filesystem can be
> > > released, allowing the filesystem to be unmounted while I/O is still in
> > > flight. When I/O completion occurs, core data structures may have been
> > > freed, causing completion to run into invalid memory accesses and likely
> > > to panic.
> > > 
> > > This problem is reproduced on XFS via directory readahead. A filesystem
> > > is mounted, a directory is opened/closed and the filesystem immediately
> > > unmounted. The open/close cycle triggers a directory readahead that if
> > > delayed long enough, runs buffer I/O completion after the unmount has
> > > completed.
> > > 
> > > To work around this problem, add readahead buffers to the LRU earlier
> > > than other buffers (when the buffer is allocated, specifically). The
> > > buffer hold count will ultimately remain until I/O completion, which
> > > means any shrinker activity will skip the buffer until then. This makes
> > > the buffer visible to xfs_wait_buftarg(), however, which ensures that an
> > > unmount or quiesce waits for I/O completion appropriately.
> > > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > > 
> > > This addresses the problem reproduced by the recently posted xfstests
> > > test:
> > > 
> > >   http://thread.gmane.org/gmane.comp.file-systems.fstests/2740
> > > 
> > > This could probably be made more involved, i.e., to create another list
> > > of buffers in flight or some such. This seems more simple/sane to me,
> > > however, and survives my testing so far...
> > > 
> > > Brian
> > > 
> > >  fs/xfs/xfs_buf.c | 12 ++++++++++++
> > >  1 file changed, 12 insertions(+)
> > > 
> > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > index 4665ff6..3f03df9 100644
> > > --- a/fs/xfs/xfs_buf.c
> > > +++ b/fs/xfs/xfs_buf.c
> > > @@ -590,8 +590,20 @@ xfs_buf_get_map(
> > >  		return NULL;
> > >  	}
> > >  
> > > +	/*
> > > +	 * If the buffer found doesn't match the one allocated above, somebody
> > > +	 * else beat us to insertion and we can toss the new one.
> > > +	 *
> > > +	 * If we did add the buffer and it happens to be readahead, add to the
> > > +	 * LRU now rather than waiting until the hold is released. Otherwise,
> > > +	 * the buffer is not visible to xfs_wait_buftarg() while in flight and
> > > +	 * nothing else prevents an unmount before I/O completion.
> > > +	 */
> > >  	if (bp != new_bp)
> > >  		xfs_buf_free(new_bp);
> > > +	else if (flags & XBF_READ_AHEAD &&
> > > +		 list_lru_add(&bp->b_target->bt_lru, &bp->b_lru))
> > > +		atomic_inc(&bp->b_hold);
> > 
> > This doesn't sit right with me. The LRU is for "unused" objects, and
> > readahead objects are not unused until IO completes and nobody is
> > waiting on them.
> > 
> 
> Sure, but don't buffers remain on the LRU once they are reused?

Yes, but that's once the contents have been used.

> My
> impression is that this patch really doesn't change behavior in that
> regard.

Not really - but it makes the readahead buffers completely different
to the rest of the buffers on the LRU in that they don't contain
valid data.

For the same reason we also don't add buffers at creation time
because they don't hold valid data until after the code that has
created/read in the data has released it.....

> IOWs, the buffer is going to end up on the LRU either way and
> either end up disposed if it is never actually accessed or "used"
> otherwise and the hold count is bumped. Further, isn't the hold count
> mechanism precisely to handle "used" buffers on the LRU?

... hence we only add the buffer to the LRU once the initial hold
count reaches zero and the owner has indicated that the buffer
should be moved to the LRU by setting b_lru_ref. Once we've added
the buffer to the LRU, we add a reference count that persists untile
b_lru_ref is decremented to zero. At that point, we remove the
euffer from the LRU and release it. That (b_lru_ref == 0) then
triggers the buffer to be freed....

Note also that buffers that contain stale data (i.e. XBF_STALE) are
not added to the LRU and xfs_buf_stale() removes buffers from the
LRU, so shortcutting the above reclaim lifecycle in the case where a
user invalidates the contents of the buffer. The end result is the
same - there are no buffers with stale contents on the LRU.

readahead is a special case - there is no accessor to say "cache
this buffer for N references", but we have to keep it around for
some time so that readahead is effective. We don't want to add it
before we've done IO on it, and realistically we should only add it
to the LRU if there was no IO error. We've had to work around bugs
introduced by caching failed readahead buffers on the LRU in the
past....

> > As it is, this points out another problem with readahead buffers -
> > they aren't actually being cached properly because b_lru_ref == 0,
> > which means they are immediately reclaimed on IO completion rather
> > than being added to the LRU....
> > 
> 
> Not following... _xfs_buf_alloc() sets b_lru_ref to 1 and I don't see it
> set/decremented unless we stale or dispose it. What am I missing?

Sorry, I misread the cscope output I was looking at to quickly
remind me of how it all worked - xfs_buf_stale() sets it to zero,
not _xfs_buf_alloc().

> > I also think that it's not sufficient to cover the generic case of
> > async IO that has no waiter. i.e. we could do get_buf, submit async
> > write, drop submitter reference, and now we have the same problem
> > but on a write.  i.e. this problem is and async IO issue, not a
> > readahead issue.
> > 
> 
> Indeed. I thought about making the current patch check for ASYNC, but
> opted to try and make it more isolated (I suppose ASYNC|READAHEAD would
> have been moreso).
> 
> The other thought I had was to change where buffers are added to the LRU
> altogether, but didn't want to jump right to that. Is there any issue
> with populating the LRU with initially held buffers as such, or any
> particular reason LRU addition was deferred to I/O completion in the
> first place?

Yes, because then we have to deal with buffers that fail memory
allocation, read IO or are allocated just to invalidate a range of
disk blocks during a transaction (e.g. removing a remote attribute).
There are probably other cases where we don't want to put buffers we
allocate onto the LRU, but I can't think of any more right now.

> > I think that it might be better to fix it by doing this:
> > 
> > 	1. ensure async IO submission always has b_lru_ref set, and
> > 	if it isn't, set it to 1. This ensures the buffer will be
> > 	added to the LRU on completion if it isn't already there.
> > 
> 
> See above. I'm not following why this is necessary (if it is, it seems
> indicative of a bug).
> 
> > 	2. keep a count of async buffer IO in progress. A per-cpu
> > 	counter in the buftarg will be fine for this. Increment in
> > 	xfs_buf_submit(), decrement in the xfs_buf_rele() call from
> > 	xfs_buf_iodone() once we've determined if the buffer needs
> > 	adding to the LRU or not.
> > 
> > 	3. make xfs_wait_buftarg() wait until the async IO count
> > 	goes to zero before it gives up trying to release buffers on
> > 	the LRU.
> > 
> 
> This is along the lines of what I was thinking wrt to a list (explicit
> I/O tracking one way or another). The count is more simple and does
> cover the arbitrary read/write case, but is still more code and seems
> slightly duplicative to me because in most cases the LRU wait already
> handles this.

Sure, but adding a list means you need global locks and when you
have lots of concurrent readahead going on that will be a problem.
readahead is supposed to have minimal overhead, so anything that
adds submission serialisation is not the best idea...  The counter
has almost no overhead for the async IO case, read or write, and
that's all we need to allow the IO to complete safely during
unmount.

> I'm also not totally sure we need to handle the write
> case. Are we susceptible to this issue anywhere that isn't already
> protected by the sb write protection mechanism, for example?

We can do metadata buffer writes outside the superblock write
protection context.  A good example of that is log recovery on
read-only filesystems....

> OTOH, explicit tracking might provide a more generic way to deal with
> uncached buffers as opposed to lock cycling.

Well, only if those uncached buffers are using async IO....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] xfs: add readahead bufs to lru early to prevent post-unmount panic
  2016-07-01  4:33     ` Dave Chinner
@ 2016-07-01 12:53       ` Brian Foster
  2016-07-04  0:08         ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Brian Foster @ 2016-07-01 12:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Fri, Jul 01, 2016 at 02:33:31PM +1000, Dave Chinner wrote:
> On Thu, Jun 30, 2016 at 07:56:21PM -0400, Brian Foster wrote:
> > On Fri, Jul 01, 2016 at 08:44:57AM +1000, Dave Chinner wrote:
> > > On Thu, Jun 30, 2016 at 08:53:49AM -0400, Brian Foster wrote:
> > > > Newly allocated XFS metadata buffers are added to the LRU once the hold
> > > > count is released, which typically occurs after I/O completion. There is
> > > > no other mechanism at current that tracks the existence or I/O state of
> > > > a new buffer. Further, readahead I/O tends to be submitted
> > > > asynchronously by nature, which means the I/O can remain in flight and
> > > > actually complete long after the calling context is gone. This means
> > > > that file descriptors or any other holds on the filesystem can be
> > > > released, allowing the filesystem to be unmounted while I/O is still in
> > > > flight. When I/O completion occurs, core data structures may have been
> > > > freed, causing completion to run into invalid memory accesses and likely
> > > > to panic.
> > > > 
> > > > This problem is reproduced on XFS via directory readahead. A filesystem
> > > > is mounted, a directory is opened/closed and the filesystem immediately
> > > > unmounted. The open/close cycle triggers a directory readahead that if
> > > > delayed long enough, runs buffer I/O completion after the unmount has
> > > > completed.
> > > > 
> > > > To work around this problem, add readahead buffers to the LRU earlier
> > > > than other buffers (when the buffer is allocated, specifically). The
> > > > buffer hold count will ultimately remain until I/O completion, which
> > > > means any shrinker activity will skip the buffer until then. This makes
> > > > the buffer visible to xfs_wait_buftarg(), however, which ensures that an
> > > > unmount or quiesce waits for I/O completion appropriately.
> > > > 
> > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > ---
> > > > 
> > > > This addresses the problem reproduced by the recently posted xfstests
> > > > test:
> > > > 
> > > >   http://thread.gmane.org/gmane.comp.file-systems.fstests/2740
> > > > 
> > > > This could probably be made more involved, i.e., to create another list
> > > > of buffers in flight or some such. This seems more simple/sane to me,
> > > > however, and survives my testing so far...
> > > > 
> > > > Brian
> > > > 
> > > >  fs/xfs/xfs_buf.c | 12 ++++++++++++
> > > >  1 file changed, 12 insertions(+)
> > > > 
> > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > > index 4665ff6..3f03df9 100644
> > > > --- a/fs/xfs/xfs_buf.c
> > > > +++ b/fs/xfs/xfs_buf.c
> > > > @@ -590,8 +590,20 @@ xfs_buf_get_map(
> > > >  		return NULL;
> > > >  	}
> > > >  
> > > > +	/*
> > > > +	 * If the buffer found doesn't match the one allocated above, somebody
> > > > +	 * else beat us to insertion and we can toss the new one.
> > > > +	 *
> > > > +	 * If we did add the buffer and it happens to be readahead, add to the
> > > > +	 * LRU now rather than waiting until the hold is released. Otherwise,
> > > > +	 * the buffer is not visible to xfs_wait_buftarg() while in flight and
> > > > +	 * nothing else prevents an unmount before I/O completion.
> > > > +	 */
> > > >  	if (bp != new_bp)
> > > >  		xfs_buf_free(new_bp);
> > > > +	else if (flags & XBF_READ_AHEAD &&
> > > > +		 list_lru_add(&bp->b_target->bt_lru, &bp->b_lru))
> > > > +		atomic_inc(&bp->b_hold);
> > > 
> > > This doesn't sit right with me. The LRU is for "unused" objects, and
> > > readahead objects are not unused until IO completes and nobody is
> > > waiting on them.
> > > 
> > 
> > Sure, but don't buffers remain on the LRU once they are reused?
> 
> Yes, but that's once the contents have been used.
> 
> > My
> > impression is that this patch really doesn't change behavior in that
> > regard.
> 
> Not really - but it makes the readahead buffers completely different
> to the rest of the buffers on the LRU in that they don't contain
> valid data.
> 

Yes, this does specially handle readahead buffers in terms of
implementation, which is unfortunate...

> For the same reason we also don't add buffers at creation time
> because they don't hold valid data until after the code that has
> created/read in the data has released it.....
> 

... but buffers are added to the LRU at I/O completion regardless of
whether they have valid data or not (i.e., even on I/O error).

> > IOWs, the buffer is going to end up on the LRU either way and
> > either end up disposed if it is never actually accessed or "used"
> > otherwise and the hold count is bumped. Further, isn't the hold count
> > mechanism precisely to handle "used" buffers on the LRU?
> 
> ... hence we only add the buffer to the LRU once the initial hold
> count reaches zero and the owner has indicated that the buffer
> should be moved to the LRU by setting b_lru_ref. Once we've added
> the buffer to the LRU, we add a reference count that persists untile
> b_lru_ref is decremented to zero. At that point, we remove the
> euffer from the LRU and release it. That (b_lru_ref == 0) then
> triggers the buffer to be freed....
> 
> Note also that buffers that contain stale data (i.e. XBF_STALE) are
> not added to the LRU and xfs_buf_stale() removes buffers from the
> LRU, so shortcutting the above reclaim lifecycle in the case where a
> user invalidates the contents of the buffer. The end result is the
> same - there are no buffers with stale contents on the LRU.
> 

Yes, but this seems arbitrary. The buffer may or may not be on the LRU
when it is marked stale.

> readahead is a special case - there is no accessor to say "cache
> this buffer for N references", but we have to keep it around for
> some time so that readahead is effective. We don't want to add it
> before we've done IO on it, and realistically we should only add it
> to the LRU if there was no IO error. We've had to work around bugs
> introduced by caching failed readahead buffers on the LRU in the
> past....
> 

Ok, that makes sense. At least, I get the sense that you have a view of
the LRU that is slightly different from what the code actually does
(e.g., buffers w/ I/O errors on the LRU when they shouldn't be), which
is why I'm trying to reconcile some of the reasoning here.

In other words, if the LRU was an unused only buffer list where used or
stale buffers were pulled off and later reinserted/reprioritized (in
terms of reclaim), then this would all make complete sense to me. In
actuality, we have an LRU list that has buffers with I/O errors, buffers
in use, buffers that might or might not be stale, etc.

> > > As it is, this points out another problem with readahead buffers -
> > > they aren't actually being cached properly because b_lru_ref == 0,
> > > which means they are immediately reclaimed on IO completion rather
> > > than being added to the LRU....
> > > 
> > 
> > Not following... _xfs_buf_alloc() sets b_lru_ref to 1 and I don't see it
> > set/decremented unless we stale or dispose it. What am I missing?
> 
> Sorry, I misread the cscope output I was looking at to quickly
> remind me of how it all worked - xfs_buf_stale() sets it to zero,
> not _xfs_buf_alloc().
> 
> > > I also think that it's not sufficient to cover the generic case of
> > > async IO that has no waiter. i.e. we could do get_buf, submit async
> > > write, drop submitter reference, and now we have the same problem
> > > but on a write.  i.e. this problem is and async IO issue, not a
> > > readahead issue.
> > > 
> > 
> > Indeed. I thought about making the current patch check for ASYNC, but
> > opted to try and make it more isolated (I suppose ASYNC|READAHEAD would
> > have been moreso).
> > 
> > The other thought I had was to change where buffers are added to the LRU
> > altogether, but didn't want to jump right to that. Is there any issue
> > with populating the LRU with initially held buffers as such, or any
> > particular reason LRU addition was deferred to I/O completion in the
> > first place?
> 
> Yes, because then we have to deal with buffers that fail memory
> allocation, read IO or are allocated just to invalidate a range of
> disk blocks during a transaction (e.g. removing a remote attribute).
> There are probably other cases where we don't want to put buffers we
> allocate onto the LRU, but I can't think of any more right now.
> 

I was thinking more along the lines of insertion on I/O submission
rather than allocation. I.e., similar the proposed #1 below, but to
actually insert to the list such that we never lose track of the buffer.
My understanding is that for an async buffer, LRU insertion is imminent
at this point as it is (unless perhaps it is looked up and marked stale
by somebody else before I/O completion, but I doubt that's common).

> > > I think that it might be better to fix it by doing this:
> > > 
> > > 	1. ensure async IO submission always has b_lru_ref set, and
> > > 	if it isn't, set it to 1. This ensures the buffer will be
> > > 	added to the LRU on completion if it isn't already there.
> > > 
> > 
> > See above. I'm not following why this is necessary (if it is, it seems
> > indicative of a bug).
> > 
> > > 	2. keep a count of async buffer IO in progress. A per-cpu
> > > 	counter in the buftarg will be fine for this. Increment in
> > > 	xfs_buf_submit(), decrement in the xfs_buf_rele() call from
> > > 	xfs_buf_iodone() once we've determined if the buffer needs
> > > 	adding to the LRU or not.
> > > 
> > > 	3. make xfs_wait_buftarg() wait until the async IO count
> > > 	goes to zero before it gives up trying to release buffers on
> > > 	the LRU.
> > > 
> > 
> > This is along the lines of what I was thinking wrt to a list (explicit
> > I/O tracking one way or another). The count is more simple and does
> > cover the arbitrary read/write case, but is still more code and seems
> > slightly duplicative to me because in most cases the LRU wait already
> > handles this.
> 
> Sure, but adding a list means you need global locks and when you
> have lots of concurrent readahead going on that will be a problem.
> readahead is supposed to have minimal overhead, so anything that
> adds submission serialisation is not the best idea...  The counter
> has almost no overhead for the async IO case, read or write, and
> that's all we need to allow the IO to complete safely during
> unmount.
> 

Indeed, I'm not advocating a list over counters. Just pointing out that
either seem potentially unnecessary or more complexity than really
necessary.

I guess my initial approach was to make the fix as isolated as possible
because this code is hairy and it seems we have a lot of little hacks
around to deal with other corner cases such as this. Surely this adds
yet another one, but afaics, so does a counter mechanism. At the end of
the day, the only real issue we know it fixes is this readahead corner
case.

I think that if we're going to go as far as adding a list/counter
mechanism, we should try to fix this in a way that is more broadly
useful. By that I mean fix up the existing mechanism or add something
that allows us to start unwinding some of the preexisting hacks such as
cycling buffer locks to ensure I/O completion (perhaps alleviating the
need to hold locks across I/O in the first place), etc.

Perhaps the counter approach opens the door for that, I have to think
about it some more..

> > I'm also not totally sure we need to handle the write
> > case. Are we susceptible to this issue anywhere that isn't already
> > protected by the sb write protection mechanism, for example?
> 
> We can do metadata buffer writes outside the superblock write
> protection context.  A good example of that is log recovery on
> read-only filesystems....
> 

This waits for I/O to complete so is unaffected by this problem. Now
that I take a look at that, I'm not sure we'd be able to replace that
with an I/O counter wait since technically that would wait for all
in-flight I/O to complete. This context (xfs_buf_delwri_submit()) is
only looking to wait for the locally submitted I/O to complete. Ugh.

Brian

> > OTOH, explicit tracking might provide a more generic way to deal with
> > uncached buffers as opposed to lock cycling.
> 
> Well, only if those uncached buffers are using async IO....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] xfs: add readahead bufs to lru early to prevent post-unmount panic
  2016-07-01 12:53       ` Brian Foster
@ 2016-07-04  0:08         ` Dave Chinner
  2016-07-05 13:42           ` Brian Foster
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2016-07-04  0:08 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Fri, Jul 01, 2016 at 08:53:03AM -0400, Brian Foster wrote:
> On Fri, Jul 01, 2016 at 02:33:31PM +1000, Dave Chinner wrote:
> > On Thu, Jun 30, 2016 at 07:56:21PM -0400, Brian Foster wrote:
> > Note also that buffers that contain stale data (i.e. XBF_STALE) are
> > not added to the LRU and xfs_buf_stale() removes buffers from the
> > LRU, so shortcutting the above reclaim lifecycle in the case where a
> > user invalidates the contents of the buffer. The end result is the
> > same - there are no buffers with stale contents on the LRU.
> > 
> 
> Yes, but this seems arbitrary. The buffer may or may not be on the LRU
> when it is marked stale.

Yes, because (as I've already explained) we can mark buffers stale
without needing to fill them with data. But whether or not the
buffer is in the LRU when we mark it stale is irrelevant, because we
only have a requirement that stale buffers are removed from the LRU
so that they are torn down immediately on release.

This is actually very important because this is how we prevent
duplicate buffers from entering the cache when we free metadata.
Once we free a metadata buffer and finish committing the transaction,
the buffer must not be kept in the cache because the extent it
covers could be reallocated at any time, and in a different shape.
It could even be user data. In all cases, the buffer needs to be
removed from the cache immediately, and that's why xfs_buf_stale()
behaves like this.

> > readahead is a special case - there is no accessor to say "cache
> > this buffer for N references", but we have to keep it around for
> > some time so that readahead is effective. We don't want to add it
> > before we've done IO on it, and realistically we should only add it
> > to the LRU if there was no IO error. We've had to work around bugs
> > introduced by caching failed readahead buffers on the LRU in the
> > past....
> > 
> 
> Ok, that makes sense. At least, I get the sense that you have a view of
> the LRU that is slightly different from what the code actually does
> (e.g., buffers w/ I/O errors on the LRU when they shouldn't be), which
> is why I'm trying to reconcile some of the reasoning here.
> 
> In other words, if the LRU was an unused only buffer list where used or
> stale buffers were pulled off and later reinserted/reprioritized (in
> terms of reclaim), then this would all make complete sense to me. In
> actuality, we have an LRU list that has buffers with I/O errors, buffers
> in use, buffers that might or might not be stale, etc.

Nothing is as simple as it seems - clean design never remains that
way once reality intrudes. However, the general rule is that buffers are not added to the LRU until the initial
user has finished and release them. THis is done for many reasons.

Firstly, We do not add them a lookup time because we do not know if
the caller wants them to be added to the LRU.

Secondly, we also don't add at lookup time because buffer lookup is
the single hottest path in the XFS codebase. i.e.  xfs_buf_find()
and the struct xfs_buf are one of the few places where we've
actually optimised the code to minimise cache footprint and CPU
overhead.

Currently, all of th einformation that a buffer lookup
accesses is in the first cacheline of the struct xfs_buf, ensuring
that we only take one cache miss penalty per buffer we walk through
the rb-tree. If we now add a check to see if the buffer is on the
LRU into the lookup path, we are - at minimum - adding an extra
cacheline miss to the hot path. If we have to add the buffer to the
LRU, then we're adding another couple of cacheline misses and
another lock, all of which will have an impact on performance.

THird, we add/remove buffers to/from the LRU lazily in paths that
are not performance critical as a result. This is exactly the same
algorithms the VFS uses for the inode and dentry cache LRUs, and we
also use this for the XFS dquot LRU, too. The b_lru_ref count
replaces the single referenced flag that the inode/dentry caches use
so we can prioritise reclaim to match typical btree and directory
access patterns (i.e. reclaim data before leaf before node before
root) but otherwise the buffer LRU is no different to the VFS cache
algorithms....

Finally, and perhaps the most important, is behaviour under low memory
conditions. If we add the buffer to the LRU when we first look it up
in the cache, the shrinker can now see it and decrement the
b_lru_ref count while we are doing the first operation on that
buffer. Say this transaction generates sufficient memory pressure
that it causes the shrinker to completely decrement b_lru_ref - it
will then remove the buffer from the LRU *before we've even finished
with it*. This will cause the buffer to be freed immediately after
the caller releases it (because b_lru_ref is zero), and so the next
access requires the buffer to be read from disk again.  What we end
up with here is a situation where the buffer cache cannot maintian a
working set of buffers between operations because the memory
pressure within an operation causes the buffers to be removed from
the LRU before the operation completes.

So, yeah, there's lots of "assumed knowledge" in why the buffer
cache LRU is structured the way it is. That, unfortunately, is
pretty much typical of all the linux slab caches that are controlled
by shrinkers....

FWIW, readahead is often not complete before the followup read
comes in, which means there's a caller blocked on the buffer with an
active reference by the time IO completion runs xfs_buf_relse() to
unlock the buffer. In these cases, the readahead buffer does not go
onto the LRU until after the blocked caller has finished with the
buffer and called xfs_buf_rele() itself (i.e. after the transaction
subsystem is done with it).

> > > The other thought I had was to change where buffers are added
> > > to the LRU altogether, but didn't want to jump right to that.
> > > Is there any issue with populating the LRU with initially held
> > > buffers as such, or any particular reason LRU addition was
> > > deferred to I/O completion in the first place?
> > 
> > Yes, because then we have to deal with buffers that fail memory
> > allocation, read IO or are allocated just to invalidate a range
> > of disk blocks during a transaction (e.g. removing a remote
> > attribute).  There are probably other cases where we don't want
> > to put buffers we allocate onto the LRU, but I can't think of
> > any more right now.
> 
> I was thinking more along the lines of insertion on I/O submission
> rather than allocation. I.e., similar the proposed #1 below, but
> to actually insert to the list such that we never lose track of
> the buffer.

Why do we need a list?

It's a similar situation to inode_dio_wait(): we don't track direct
IO because of the per-IO overhead of doing so, especially as we only
need to implement a "drain barrier". i.e. it only needs a counter to
implement the barrier needed for truncate/hole punch operations.
That's exactly the same case here xfs_wait_buftarg() is a drain
barrier; we could simply count every single IO in flight and wait
for that to return to zero before walking the LRU, and it would
solve the problem for everything, regardless of the type of buffer
or whether it is in or going to be put in the LRU....

Also, remember that xfs_wait_buftarg() is run with the assumption
that all high level processing has been stopped, and we are only
waiting for remaining users to drop their locks and go away (e.g.
complete async IO from AIL pushing). It's not a general use case we
are working on here...

> I guess my initial approach was to make the fix as isolated as possible
> because this code is hairy and it seems we have a lot of little hacks
> around to deal with other corner cases such as this. Surely this adds
> yet another one, but afaics, so does a counter mechanism. At the end of
> the day, the only real issue we know it fixes is this readahead corner
> case.
> 
> I think that if we're going to go as far as adding a list/counter
> mechanism, we should try to fix this in a way that is more broadly
> useful. By that I mean fix up the existing mechanism or add something
> that allows us to start unwinding some of the preexisting hacks such as
> cycling buffer locks to ensure I/O completion (perhaps alleviating the
> need to hold locks across I/O in the first place), etc.
> 
> Perhaps the counter approach opens the door for that, I have to think
> about it some more..

I don't think we can get rid of lock cycling - the counter can only
exist within the IO context which is within the lock context. Hence
to guarantee that all users of a buffer have finished, lock cycling
is still necessary. i.e. lock cycling is a life cycle barrier, not
an IO barrier...

> > We can do metadata buffer writes outside the superblock write
> > protection context.  A good example of that is log recovery on
> > read-only filesystems....
> > 
> 
> This waits for I/O to complete so is unaffected by this problem.

Sure, but that's not what you were asking about. My point was that
there is no specific high level superblock protection against
async metadata IO being issued from non-VFS interfacing, internal
filesystem subsystems.

> Now
> that I take a look at that, I'm not sure we'd be able to replace that
> with an I/O counter wait since technically that would wait for all
> in-flight I/O to complete. This context (xfs_buf_delwri_submit()) is
> only looking to wait for the locally submitted I/O to complete. Ugh.

I'm not advocating doing that at all.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] xfs: add readahead bufs to lru early to prevent post-unmount panic
  2016-07-04  0:08         ` Dave Chinner
@ 2016-07-05 13:42           ` Brian Foster
  0 siblings, 0 replies; 18+ messages in thread
From: Brian Foster @ 2016-07-05 13:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Mon, Jul 04, 2016 at 10:08:38AM +1000, Dave Chinner wrote:
> On Fri, Jul 01, 2016 at 08:53:03AM -0400, Brian Foster wrote:
> > On Fri, Jul 01, 2016 at 02:33:31PM +1000, Dave Chinner wrote:
> > > On Thu, Jun 30, 2016 at 07:56:21PM -0400, Brian Foster wrote:
> > > Note also that buffers that contain stale data (i.e. XBF_STALE) are
> > > not added to the LRU and xfs_buf_stale() removes buffers from the
> > > LRU, so shortcutting the above reclaim lifecycle in the case where a
> > > user invalidates the contents of the buffer. The end result is the
> > > same - there are no buffers with stale contents on the LRU.
> > > 
> > 
> > Yes, but this seems arbitrary. The buffer may or may not be on the LRU
> > when it is marked stale.
> 
> Yes, because (as I've already explained) we can mark buffers stale
> without needing to fill them with data. But whether or not the
> buffer is in the LRU when we mark it stale is irrelevant, because we
> only have a requirement that stale buffers are removed from the LRU
> so that they are torn down immediately on release.
> 
> This is actually very important because this is how we prevent
> duplicate buffers from entering the cache when we free metadata.
> Once we free a metadata buffer and finish committing the transaction,
> the buffer must not be kept in the cache because the extent it
> covers could be reallocated at any time, and in a different shape.
> It could even be user data. In all cases, the buffer needs to be
> removed from the cache immediately, and that's why xfs_buf_stale()
> behaves like this.
> 

Yes, I understand. I think we're talking past eachother on this
particular train of thought. Your original point was that the LRU is for
"unused" buffers, and/or buffers aren't on the LRU until "their content
is used." The example given was that stale buffers are immediately
removed from the LRU and thus "buffers with stale data are never on the
LRU."

My response to that is that's an implementation detail of XBF_STALE
buffers and not necessarily true in other cases. We do actually end up
with buffers with "stale" (as opposed to XBF_STALE, perhaps "invalid" is
a better term) data in the event of I/O error. If a callsite allocates a
buffer, never issues I/O on it (i.e., it is !XBF_DONE) and releases it,
it ends up on the LRU. (I don't know that we do that anywhere given this
is a filesystem after all ;), but that's the natural lifecycle in that
particular case afaict).

So nothing about the LRU suggests to me that the contents must have
valid data. AFAICT, it's primarily memory management mechanism to allow
priority-ordered releasing of "unused" buffers, where an unused buffer
is defined as a buffer without any active references, regardless of what
the buffer contains.

I'm sympathetic to the argument that in the ideal case the LRU might not
contain buffers actively in use, but IMO that's beyond the scope of the
original patch. IOW, if we want to change that, we should approach it
separately.

> > > readahead is a special case - there is no accessor to say "cache
> > > this buffer for N references", but we have to keep it around for
> > > some time so that readahead is effective. We don't want to add it
> > > before we've done IO on it, and realistically we should only add it
> > > to the LRU if there was no IO error. We've had to work around bugs
> > > introduced by caching failed readahead buffers on the LRU in the
> > > past....
> > > 
> > 
> > Ok, that makes sense. At least, I get the sense that you have a view of
> > the LRU that is slightly different from what the code actually does
> > (e.g., buffers w/ I/O errors on the LRU when they shouldn't be), which
> > is why I'm trying to reconcile some of the reasoning here.
> > 
> > In other words, if the LRU was an unused only buffer list where used or
> > stale buffers were pulled off and later reinserted/reprioritized (in
> > terms of reclaim), then this would all make complete sense to me. In
> > actuality, we have an LRU list that has buffers with I/O errors, buffers
> > in use, buffers that might or might not be stale, etc.
> 
> Nothing is as simple as it seems - clean design never remains that
> way once reality intrudes. However, the general rule is that buffers are not added to the LRU until the initial
> user has finished and release them. THis is done for many reasons.
> 

Agreed on the first point. :) I think that's the only reason for the
original patch to exist. ;) (As noted previously, I agree that it is a
hack).

> Firstly, We do not add them a lookup time because we do not know if
> the caller wants them to be added to the LRU.
> 

We set b_lru_ref to 1 on buffer allocation. But it's true that we don't
know if it's going to be marked stale before it is released, of course.

> Secondly, we also don't add at lookup time because buffer lookup is
> the single hottest path in the XFS codebase. i.e.  xfs_buf_find()
> and the struct xfs_buf are one of the few places where we've
> actually optimised the code to minimise cache footprint and CPU
> overhead.
> 

Ok...

> Currently, all of th einformation that a buffer lookup
> accesses is in the first cacheline of the struct xfs_buf, ensuring
> that we only take one cache miss penalty per buffer we walk through
> the rb-tree. If we now add a check to see if the buffer is on the
> LRU into the lookup path, we are - at minimum - adding an extra
> cacheline miss to the hot path. If we have to add the buffer to the
> LRU, then we're adding another couple of cacheline misses and
> another lock, all of which will have an impact on performance.
> 

Performance is a fair point against adding all buffers to the LRU
immediately. I take that to mean we'd have to test such a change with
that in mind, but it sounds like this path is already optimized without
any more room in the first cacheline, thus probably not likely to be an
option.

I'm not sure this is as as critical applied solely to read ahead
buffers, however...

> THird, we add/remove buffers to/from the LRU lazily in paths that
> are not performance critical as a result. This is exactly the same
> algorithms the VFS uses for the inode and dentry cache LRUs, and we
> also use this for the XFS dquot LRU, too. The b_lru_ref count
> replaces the single referenced flag that the inode/dentry caches use
> so we can prioritise reclaim to match typical btree and directory
> access patterns (i.e. reclaim data before leaf before node before
> root) but otherwise the buffer LRU is no different to the VFS cache
> algorithms....
> 
> Finally, and perhaps the most important, is behaviour under low memory
> conditions. If we add the buffer to the LRU when we first look it up
> in the cache, the shrinker can now see it and decrement the
> b_lru_ref count while we are doing the first operation on that
> buffer. Say this transaction generates sufficient memory pressure
> that it causes the shrinker to completely decrement b_lru_ref - it
> will then remove the buffer from the LRU *before we've even finished
> with it*. This will cause the buffer to be freed immediately after
> the caller releases it (because b_lru_ref is zero), and so the next
> access requires the buffer to be read from disk again.  What we end
> up with here is a situation where the buffer cache cannot maintian a
> working set of buffers between operations because the memory
> pressure within an operation causes the buffers to be removed from
> the LRU before the operation completes.
> 

This makes sense, we probably don't want to subject active buffers to
deprioritization via the shrinker before their initial use. It sounds
like this could technically be a flaw for buffers once they're actually
reused, as well. I could see an argument for reused buffers remaining on
the LRU such that their b_lru_ref remains consistent (in terms of
priority) with associated buffers (e.g., btree node vs. leaf priority
ordering) in the event of shrinker activity, but it seems like there are
other ways for that to get out of sync just the same. It could also be
the case that such buffers tend to be "held" together.

Again, I'm not sure how important this is to readahead buffers since
they are released on I/O completion. If we're waiting on the readahead
before it completes then the buffer is not fully released, but that's a
matter of timing more than anything. E.g., faster storage could disrupt
that just as much as this particular change. Otherwise, this just means
the readahead buffer could be freed before we try to use it, which is
already the case today.

> So, yeah, there's lots of "assumed knowledge" in why the buffer
> cache LRU is structured the way it is. That, unfortunately, is
> pretty much typical of all the linux slab caches that are controlled
> by shrinkers....
> 
> FWIW, readahead is often not complete before the followup read
> comes in, which means there's a caller blocked on the buffer with an
> active reference by the time IO completion runs xfs_buf_relse() to
> unlock the buffer. In these cases, the readahead buffer does not go
> onto the LRU until after the blocked caller has finished with the
> buffer and called xfs_buf_rele() itself (i.e. after the transaction
> subsystem is done with it).
> 
> > > > The other thought I had was to change where buffers are added
> > > > to the LRU altogether, but didn't want to jump right to that.
> > > > Is there any issue with populating the LRU with initially held
> > > > buffers as such, or any particular reason LRU addition was
> > > > deferred to I/O completion in the first place?
> > > 
> > > Yes, because then we have to deal with buffers that fail memory
> > > allocation, read IO or are allocated just to invalidate a range
> > > of disk blocks during a transaction (e.g. removing a remote
> > > attribute).  There are probably other cases where we don't want
> > > to put buffers we allocate onto the LRU, but I can't think of
> > > any more right now.
> > 
> > I was thinking more along the lines of insertion on I/O submission
> > rather than allocation. I.e., similar the proposed #1 below, but
> > to actually insert to the list such that we never lose track of
> > the buffer.
> 
> Why do we need a list?
> 

I was referring to the LRU.

> It's a similar situation to inode_dio_wait(): we don't track direct
> IO because of the per-IO overhead of doing so, especially as we only
> need to implement a "drain barrier". i.e. it only needs a counter to
> implement the barrier needed for truncate/hole punch operations.
> That's exactly the same case here xfs_wait_buftarg() is a drain
> barrier; we could simply count every single IO in flight and wait
> for that to return to zero before walking the LRU, and it would
> solve the problem for everything, regardless of the type of buffer
> or whether it is in or going to be put in the LRU....
> 
> Also, remember that xfs_wait_buftarg() is run with the assumption
> that all high level processing has been stopped, and we are only
> waiting for remaining users to drop their locks and go away (e.g.
> complete async IO from AIL pushing). It's not a general use case we
> are working on here...
> 
> > I guess my initial approach was to make the fix as isolated as possible
> > because this code is hairy and it seems we have a lot of little hacks
> > around to deal with other corner cases such as this. Surely this adds
> > yet another one, but afaics, so does a counter mechanism. At the end of
> > the day, the only real issue we know it fixes is this readahead corner
> > case.
> > 
> > I think that if we're going to go as far as adding a list/counter
> > mechanism, we should try to fix this in a way that is more broadly
> > useful. By that I mean fix up the existing mechanism or add something
> > that allows us to start unwinding some of the preexisting hacks such as
> > cycling buffer locks to ensure I/O completion (perhaps alleviating the
> > need to hold locks across I/O in the first place), etc.
> > 
> > Perhaps the counter approach opens the door for that, I have to think
> > about it some more..
> 
> I don't think we can get rid of lock cycling - the counter can only
> exist within the IO context which is within the lock context. Hence
> to guarantee that all users of a buffer have finished, lock cycling
> is still necessary. i.e. lock cycling is a life cycle barrier, not
> an IO barrier...
> 

Ok, but we apparently use it as such in places like
xfs_buf_delwri_submit().

> > > We can do metadata buffer writes outside the superblock write
> > > protection context.  A good example of that is log recovery on
> > > read-only filesystems....
> > > 
> > 
> > This waits for I/O to complete so is unaffected by this problem.
> 
> Sure, but that's not what you were asking about. My point was that
> there is no specific high level superblock protection against
> async metadata IO being issued from non-VFS interfacing, internal
> filesystem subsystems.
> 

Ok, I wasn't being clear... the sb protection was just an example. The
fundamental question was whether we have to handle the write case due to
not otherwise having some kind of synchronization or protection against
tearing down infrastructure before I/Os complete. IOW, I don't care as
much about the sb write protection specifically as whether we can
identify any places where this is an actual, real problem that needs to
be solved.

(Again, I'm not against stepping back and taking a larger look at the
current buffer mechanisms to try and resolve such problems in theory,
but I would prefer to do that separately from this fix).

> > Now
> > that I take a look at that, I'm not sure we'd be able to replace that
> > with an I/O counter wait since technically that would wait for all
> > in-flight I/O to complete. This context (xfs_buf_delwri_submit()) is
> > only looking to wait for the locally submitted I/O to complete. Ugh.
> 
> I'm not advocating doing that at all.
> 

I know, I was. I was hoping that if we go down the path of a more
generic mechanism we might come up with something more useful beyond
this particular readahead problem.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] xfs: add readahead bufs to lru early to prevent post-unmount panic
  2016-06-30 22:44 ` Dave Chinner
  2016-06-30 23:56   ` Brian Foster
@ 2016-07-01 22:30   ` Brian Foster
  2016-07-05 16:45     ` Brian Foster
  1 sibling, 1 reply; 18+ messages in thread
From: Brian Foster @ 2016-07-01 22:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Fri, Jul 01, 2016 at 08:44:57AM +1000, Dave Chinner wrote:
> On Thu, Jun 30, 2016 at 08:53:49AM -0400, Brian Foster wrote:
> > Newly allocated XFS metadata buffers are added to the LRU once the hold
> > count is released, which typically occurs after I/O completion. There is
> > no other mechanism at current that tracks the existence or I/O state of
> > a new buffer. Further, readahead I/O tends to be submitted
> > asynchronously by nature, which means the I/O can remain in flight and
> > actually complete long after the calling context is gone. This means
> > that file descriptors or any other holds on the filesystem can be
> > released, allowing the filesystem to be unmounted while I/O is still in
> > flight. When I/O completion occurs, core data structures may have been
> > freed, causing completion to run into invalid memory accesses and likely
> > to panic.
> > 
> > This problem is reproduced on XFS via directory readahead. A filesystem
> > is mounted, a directory is opened/closed and the filesystem immediately
> > unmounted. The open/close cycle triggers a directory readahead that if
> > delayed long enough, runs buffer I/O completion after the unmount has
> > completed.
> > 
> > To work around this problem, add readahead buffers to the LRU earlier
> > than other buffers (when the buffer is allocated, specifically). The
> > buffer hold count will ultimately remain until I/O completion, which
> > means any shrinker activity will skip the buffer until then. This makes
> > the buffer visible to xfs_wait_buftarg(), however, which ensures that an
> > unmount or quiesce waits for I/O completion appropriately.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> > 
> > This addresses the problem reproduced by the recently posted xfstests
> > test:
> > 
> >   http://thread.gmane.org/gmane.comp.file-systems.fstests/2740
> > 
> > This could probably be made more involved, i.e., to create another list
> > of buffers in flight or some such. This seems more simple/sane to me,
> > however, and survives my testing so far...
> > 
> > Brian
> > 
> >  fs/xfs/xfs_buf.c | 12 ++++++++++++
> >  1 file changed, 12 insertions(+)
> > 
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index 4665ff6..3f03df9 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -590,8 +590,20 @@ xfs_buf_get_map(
> >  		return NULL;
> >  	}
> >  
> > +	/*
> > +	 * If the buffer found doesn't match the one allocated above, somebody
> > +	 * else beat us to insertion and we can toss the new one.
> > +	 *
> > +	 * If we did add the buffer and it happens to be readahead, add to the
> > +	 * LRU now rather than waiting until the hold is released. Otherwise,
> > +	 * the buffer is not visible to xfs_wait_buftarg() while in flight and
> > +	 * nothing else prevents an unmount before I/O completion.
> > +	 */
> >  	if (bp != new_bp)
> >  		xfs_buf_free(new_bp);
> > +	else if (flags & XBF_READ_AHEAD &&
> > +		 list_lru_add(&bp->b_target->bt_lru, &bp->b_lru))
> > +		atomic_inc(&bp->b_hold);
> 
> This doesn't sit right with me. The LRU is for "unused" objects, and
> readahead objects are not unused until IO completes and nobody is
> waiting on them.
> 
> As it is, this points out another problem with readahead buffers -
> they aren't actually being cached properly because b_lru_ref == 0,
> which means they are immediately reclaimed on IO completion rather
> than being added to the LRU....
> 
> I also think that it's not sufficient to cover the generic case of
> async IO that has no waiter. i.e. we could do get_buf, submit async
> write, drop submitter reference, and now we have the same problem
> but on a write.  i.e. this problem is and async IO issue, not a
> readahead issue.
> 
> I think that it might be better to fix it by doing this:
> 
> 	1. ensure async IO submission always has b_lru_ref set, and
> 	if it isn't, set it to 1. This ensures the buffer will be
> 	added to the LRU on completion if it isn't already there.
> 
> 	2. keep a count of async buffer IO in progress. A per-cpu
> 	counter in the buftarg will be fine for this. Increment in
> 	xfs_buf_submit(), decrement in the xfs_buf_rele() call from
> 	xfs_buf_iodone() once we've determined if the buffer needs
> 	adding to the LRU or not.
> 
> 	3. make xfs_wait_buftarg() wait until the async IO count
> 	goes to zero before it gives up trying to release buffers on
> 	the LRU.
> 

After playing with this a bit this afternoon, I don't think it is so
straightforward to maintain consistency between xfs_buf_submit() and
xfs_buf_rele(). Some buffers are actually never released (superblock,
log buffers). Other buffers can actually be submitted for I/O multiple
times before they are ultimately released (e.g., log recovery buffer
read -> delwri submission).

I have a semi-functional patch that holds more of a pure I/O count,
which means the count is decremented immediately in xfs_buf_ioend()
rather than deferred to release. One downside is that while this
technically still resolves the original problem, it's racy in that the
count is dropped before the buffer is added to the LRU. This still works
for the original problem because we also drain the ioend workqueue in
xfs_wait_buftarg(), but it's not correct because we allow for
non-deferred completion in the event of I/O errors (i.e.,
xfs_buf_ioend() called directly from xfs_buf_submit()).

Brian

> That will ensure readahead buffers are cached, and we capture both
> async read and async write buffers in xfs_wait_buftarg().
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] xfs: add readahead bufs to lru early to prevent post-unmount panic
  2016-07-01 22:30   ` Brian Foster
@ 2016-07-05 16:45     ` Brian Foster
  2016-07-11  5:20       ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Brian Foster @ 2016-07-05 16:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Fri, Jul 01, 2016 at 06:30:12PM -0400, Brian Foster wrote:
> On Fri, Jul 01, 2016 at 08:44:57AM +1000, Dave Chinner wrote:
> > On Thu, Jun 30, 2016 at 08:53:49AM -0400, Brian Foster wrote:
> > > Newly allocated XFS metadata buffers are added to the LRU once the hold
> > > count is released, which typically occurs after I/O completion. There is
> > > no other mechanism at current that tracks the existence or I/O state of
> > > a new buffer. Further, readahead I/O tends to be submitted
> > > asynchronously by nature, which means the I/O can remain in flight and
> > > actually complete long after the calling context is gone. This means
> > > that file descriptors or any other holds on the filesystem can be
> > > released, allowing the filesystem to be unmounted while I/O is still in
> > > flight. When I/O completion occurs, core data structures may have been
> > > freed, causing completion to run into invalid memory accesses and likely
> > > to panic.
> > > 
> > > This problem is reproduced on XFS via directory readahead. A filesystem
> > > is mounted, a directory is opened/closed and the filesystem immediately
> > > unmounted. The open/close cycle triggers a directory readahead that if
> > > delayed long enough, runs buffer I/O completion after the unmount has
> > > completed.
> > > 
> > > To work around this problem, add readahead buffers to the LRU earlier
> > > than other buffers (when the buffer is allocated, specifically). The
> > > buffer hold count will ultimately remain until I/O completion, which
> > > means any shrinker activity will skip the buffer until then. This makes
> > > the buffer visible to xfs_wait_buftarg(), however, which ensures that an
> > > unmount or quiesce waits for I/O completion appropriately.
> > > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > > 
> > > This addresses the problem reproduced by the recently posted xfstests
> > > test:
> > > 
> > >   http://thread.gmane.org/gmane.comp.file-systems.fstests/2740
> > > 
> > > This could probably be made more involved, i.e., to create another list
> > > of buffers in flight or some such. This seems more simple/sane to me,
> > > however, and survives my testing so far...
> > > 
> > > Brian
> > > 
> > >  fs/xfs/xfs_buf.c | 12 ++++++++++++
> > >  1 file changed, 12 insertions(+)
> > > 
> > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > index 4665ff6..3f03df9 100644
> > > --- a/fs/xfs/xfs_buf.c
> > > +++ b/fs/xfs/xfs_buf.c
> > > @@ -590,8 +590,20 @@ xfs_buf_get_map(
> > >  		return NULL;
> > >  	}
> > >  
> > > +	/*
> > > +	 * If the buffer found doesn't match the one allocated above, somebody
> > > +	 * else beat us to insertion and we can toss the new one.
> > > +	 *
> > > +	 * If we did add the buffer and it happens to be readahead, add to the
> > > +	 * LRU now rather than waiting until the hold is released. Otherwise,
> > > +	 * the buffer is not visible to xfs_wait_buftarg() while in flight and
> > > +	 * nothing else prevents an unmount before I/O completion.
> > > +	 */
> > >  	if (bp != new_bp)
> > >  		xfs_buf_free(new_bp);
> > > +	else if (flags & XBF_READ_AHEAD &&
> > > +		 list_lru_add(&bp->b_target->bt_lru, &bp->b_lru))
> > > +		atomic_inc(&bp->b_hold);
> > 
> > This doesn't sit right with me. The LRU is for "unused" objects, and
> > readahead objects are not unused until IO completes and nobody is
> > waiting on them.
> > 
> > As it is, this points out another problem with readahead buffers -
> > they aren't actually being cached properly because b_lru_ref == 0,
> > which means they are immediately reclaimed on IO completion rather
> > than being added to the LRU....
> > 
> > I also think that it's not sufficient to cover the generic case of
> > async IO that has no waiter. i.e. we could do get_buf, submit async
> > write, drop submitter reference, and now we have the same problem
> > but on a write.  i.e. this problem is and async IO issue, not a
> > readahead issue.
> > 
> > I think that it might be better to fix it by doing this:
> > 
> > 	1. ensure async IO submission always has b_lru_ref set, and
> > 	if it isn't, set it to 1. This ensures the buffer will be
> > 	added to the LRU on completion if it isn't already there.
> > 
> > 	2. keep a count of async buffer IO in progress. A per-cpu
> > 	counter in the buftarg will be fine for this. Increment in
> > 	xfs_buf_submit(), decrement in the xfs_buf_rele() call from
> > 	xfs_buf_iodone() once we've determined if the buffer needs
> > 	adding to the LRU or not.
> > 
> > 	3. make xfs_wait_buftarg() wait until the async IO count
> > 	goes to zero before it gives up trying to release buffers on
> > 	the LRU.
> > 
> 
> After playing with this a bit this afternoon, I don't think it is so
> straightforward to maintain consistency between xfs_buf_submit() and
> xfs_buf_rele(). Some buffers are actually never released (superblock,
> log buffers). Other buffers can actually be submitted for I/O multiple
> times before they are ultimately released (e.g., log recovery buffer
> read -> delwri submission).
> 

I think I can get around these problems by skipping all uncached I/O and
maintaining a per-buffer I/O count that is sunk into the global buftarg
count once the buffer is released. E.g., something like the following
patch. Not fully tested, but works on some quick tests...

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 4665ff6..8a04b66 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -177,6 +177,7 @@ _xfs_buf_alloc(
 	XB_SET_OWNER(bp);
 	bp->b_target = target;
 	bp->b_flags = flags;
+	atomic_set(&bp->b_io_count, 0);
 
 	/*
 	 * Set length and io_length to the same value initially.
@@ -865,6 +866,21 @@ xfs_buf_hold(
 	atomic_inc(&bp->b_hold);
 }
 
+static inline void
+xfs_buf_rele_iocount(
+	struct xfs_buf	*bp)
+{
+	int		val;
+
+	val = atomic_read(&bp->b_io_count);
+	if (!val)
+		return;
+
+	atomic_sub(val, &bp->b_io_count);
+	percpu_counter_add(&bp->b_target->bt_io_count, -val);
+	wake_up(&bp->b_target->bt_io_wait);
+}
+
 /*
  *	Releases a hold on the specified buffer.  If the
  *	the hold count is 1, calls xfs_buf_free.
@@ -880,8 +896,10 @@ xfs_buf_rele(
 	if (!pag) {
 		ASSERT(list_empty(&bp->b_lru));
 		ASSERT(RB_EMPTY_NODE(&bp->b_rbnode));
-		if (atomic_dec_and_test(&bp->b_hold))
+		if (atomic_dec_and_test(&bp->b_hold)) {
+			xfs_buf_rele_iocount(bp);
 			xfs_buf_free(bp);
+		}
 		return;
 	}
 
@@ -890,6 +908,9 @@ xfs_buf_rele(
 	ASSERT(atomic_read(&bp->b_hold) > 0);
 	if (atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock)) {
 		spin_lock(&bp->b_lock);
+
+		xfs_buf_rele_iocount(bp);
+
 		if (!(bp->b_flags & XBF_STALE) && atomic_read(&bp->b_lru_ref)) {
 			/*
 			 * If the buffer is added to the LRU take a new
@@ -1277,6 +1298,18 @@ _xfs_buf_ioapply(
 	rw |= REQ_META;
 
 	/*
+	 * XXX: uncached check indirectly filters out the sb buffer and log
+	 * buffers (possibly others?), that are held and never released to the
+	 * LRU
+	 */
+	if (bp->b_flags & XBF_ASYNC &&
+	    bp->b_bn != XFS_BUF_DADDR_NULL &&	/* uncached */
+	    atomic_read(&bp->b_lru_ref) && list_empty(&bp->b_lru)) {
+		percpu_counter_inc(&bp->b_target->bt_io_count);
+		atomic_inc(&bp->b_io_count);
+	}
+
+	/*
 	 * Walk all the vectors issuing IO on them. Set up the initial offset
 	 * into the buffer and the desired IO size before we start -
 	 * _xfs_buf_ioapply_vec() will modify them appropriately for each
@@ -1533,6 +1566,8 @@ xfs_wait_buftarg(
 	 * ensure here that all reference counts have been dropped before we
 	 * start walking the LRU list.
 	 */
+	wait_event(btp->bt_io_wait,
+		   (percpu_counter_sum(&btp->bt_io_count) == 0));
 	drain_workqueue(btp->bt_mount->m_buf_workqueue);
 
 	/* loop until there is nothing left on the lru list. */
@@ -1629,6 +1664,8 @@ xfs_free_buftarg(
 	struct xfs_buftarg	*btp)
 {
 	unregister_shrinker(&btp->bt_shrinker);
+	ASSERT(percpu_counter_sum(&btp->bt_io_count) == 0);
+	percpu_counter_destroy(&btp->bt_io_count);
 	list_lru_destroy(&btp->bt_lru);
 
 	if (mp->m_flags & XFS_MOUNT_BARRIER)
@@ -1693,6 +1730,10 @@ xfs_alloc_buftarg(
 	if (list_lru_init(&btp->bt_lru))
 		goto error;
 
+	if (percpu_counter_init(&btp->bt_io_count, 0, GFP_KERNEL))
+		goto error;
+	init_waitqueue_head(&btp->bt_io_wait);
+
 	btp->bt_shrinker.count_objects = xfs_buftarg_shrink_count;
 	btp->bt_shrinker.scan_objects = xfs_buftarg_shrink_scan;
 	btp->bt_shrinker.seeks = DEFAULT_SEEKS;
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 8bfb974..2518d09 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -115,6 +115,10 @@ typedef struct xfs_buftarg {
 	/* LRU control structures */
 	struct shrinker		bt_shrinker;
 	struct list_lru		bt_lru;
+
+	/* XXX: would atomic_t suffice? */
+	struct percpu_counter	bt_io_count;
+	wait_queue_head_t	bt_io_wait;
 } xfs_buftarg_t;
 
 struct xfs_buf;
@@ -183,6 +187,7 @@ typedef struct xfs_buf {
 	unsigned int		b_page_count;	/* size of page array */
 	unsigned int		b_offset;	/* page offset in first page */
 	int			b_error;	/* error code on I/O */
+	atomic_t		b_io_count;
 
 	/*
 	 * async write failure retry count. Initialised to zero on the first

> I have a semi-functional patch that holds more of a pure I/O count,
> which means the count is decremented immediately in xfs_buf_ioend()
> rather than deferred to release. One downside is that while this
> technically still resolves the original problem, it's racy in that the
> count is dropped before the buffer is added to the LRU. This still works
> for the original problem because we also drain the ioend workqueue in
> xfs_wait_buftarg(), but it's not correct because we allow for
> non-deferred completion in the event of I/O errors (i.e.,
> xfs_buf_ioend() called directly from xfs_buf_submit()).
> 
> Brian
> 
> > That will ensure readahead buffers are cached, and we capture both
> > async read and async write buffers in xfs_wait_buftarg().
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] xfs: add readahead bufs to lru early to prevent post-unmount panic
  2016-07-05 16:45     ` Brian Foster
@ 2016-07-11  5:20       ` Dave Chinner
  2016-07-11 13:52         ` Brian Foster
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2016-07-11  5:20 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Tue, Jul 05, 2016 at 12:45:52PM -0400, Brian Foster wrote:
> On Fri, Jul 01, 2016 at 06:30:12PM -0400, Brian Foster wrote:
> > On Fri, Jul 01, 2016 at 08:44:57AM +1000, Dave Chinner wrote:
> > > On Thu, Jun 30, 2016 at 08:53:49AM -0400, Brian Foster wrote:
> > > > Newly allocated XFS metadata buffers are added to the LRU once the hold
> > > > count is released, which typically occurs after I/O completion. There is
> > > > no other mechanism at current that tracks the existence or I/O state of
> > > > a new buffer. Further, readahead I/O tends to be submitted
> > > > asynchronously by nature, which means the I/O can remain in flight and
> > > > actually complete long after the calling context is gone. This means
> > > > that file descriptors or any other holds on the filesystem can be
> > > > released, allowing the filesystem to be unmounted while I/O is still in
> > > > flight. When I/O completion occurs, core data structures may have been
> > > > freed, causing completion to run into invalid memory accesses and likely
> > > > to panic.
> > > > 
> > > > This problem is reproduced on XFS via directory readahead. A filesystem
> > > > is mounted, a directory is opened/closed and the filesystem immediately
> > > > unmounted. The open/close cycle triggers a directory readahead that if
> > > > delayed long enough, runs buffer I/O completion after the unmount has
> > > > completed.
> > > > 
> > > > To work around this problem, add readahead buffers to the LRU earlier
> > > > than other buffers (when the buffer is allocated, specifically). The
> > > > buffer hold count will ultimately remain until I/O completion, which
> > > > means any shrinker activity will skip the buffer until then. This makes
> > > > the buffer visible to xfs_wait_buftarg(), however, which ensures that an
> > > > unmount or quiesce waits for I/O completion appropriately.
> > > > 
> > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > ---
> > > > 
> > > > This addresses the problem reproduced by the recently posted xfstests
> > > > test:
> > > > 
> > > >   http://thread.gmane.org/gmane.comp.file-systems.fstests/2740
> > > > 
> > > > This could probably be made more involved, i.e., to create another list
> > > > of buffers in flight or some such. This seems more simple/sane to me,
> > > > however, and survives my testing so far...
> > > > 
> > > > Brian
> > > > 
> > > >  fs/xfs/xfs_buf.c | 12 ++++++++++++
> > > >  1 file changed, 12 insertions(+)
> > > > 
> > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > > index 4665ff6..3f03df9 100644
> > > > --- a/fs/xfs/xfs_buf.c
> > > > +++ b/fs/xfs/xfs_buf.c
> > > > @@ -590,8 +590,20 @@ xfs_buf_get_map(
> > > >  		return NULL;
> > > >  	}
> > > >  
> > > > +	/*
> > > > +	 * If the buffer found doesn't match the one allocated above, somebody
> > > > +	 * else beat us to insertion and we can toss the new one.
> > > > +	 *
> > > > +	 * If we did add the buffer and it happens to be readahead, add to the
> > > > +	 * LRU now rather than waiting until the hold is released. Otherwise,
> > > > +	 * the buffer is not visible to xfs_wait_buftarg() while in flight and
> > > > +	 * nothing else prevents an unmount before I/O completion.
> > > > +	 */
> > > >  	if (bp != new_bp)
> > > >  		xfs_buf_free(new_bp);
> > > > +	else if (flags & XBF_READ_AHEAD &&
> > > > +		 list_lru_add(&bp->b_target->bt_lru, &bp->b_lru))
> > > > +		atomic_inc(&bp->b_hold);
> > > 
> > > This doesn't sit right with me. The LRU is for "unused" objects, and
> > > readahead objects are not unused until IO completes and nobody is
> > > waiting on them.
> > > 
> > > As it is, this points out another problem with readahead buffers -
> > > they aren't actually being cached properly because b_lru_ref == 0,
> > > which means they are immediately reclaimed on IO completion rather
> > > than being added to the LRU....
> > > 
> > > I also think that it's not sufficient to cover the generic case of
> > > async IO that has no waiter. i.e. we could do get_buf, submit async
> > > write, drop submitter reference, and now we have the same problem
> > > but on a write.  i.e. this problem is and async IO issue, not a
> > > readahead issue.
> > > 
> > > I think that it might be better to fix it by doing this:
> > > 
> > > 	1. ensure async IO submission always has b_lru_ref set, and
> > > 	if it isn't, set it to 1. This ensures the buffer will be
> > > 	added to the LRU on completion if it isn't already there.
> > > 
> > > 	2. keep a count of async buffer IO in progress. A per-cpu
> > > 	counter in the buftarg will be fine for this. Increment in
> > > 	xfs_buf_submit(), decrement in the xfs_buf_rele() call from
> > > 	xfs_buf_iodone() once we've determined if the buffer needs
> > > 	adding to the LRU or not.
> > > 
> > > 	3. make xfs_wait_buftarg() wait until the async IO count
> > > 	goes to zero before it gives up trying to release buffers on
> > > 	the LRU.
> > > 
> > 
> > After playing with this a bit this afternoon, I don't think it is so
> > straightforward to maintain consistency between xfs_buf_submit() and
> > xfs_buf_rele(). Some buffers are actually never released (superblock,
> > log buffers). Other buffers can actually be submitted for I/O multiple
> > times before they are ultimately released (e.g., log recovery buffer
> > read -> delwri submission).
> > 
> 
> I think I can get around these problems by skipping all uncached I/O and
> maintaining a per-buffer I/O count that is sunk into the global buftarg
> count once the buffer is released. E.g., something like the following
> patch. Not fully tested, but works on some quick tests...
....
> +static inline void
> +xfs_buf_rele_iocount(
> +	struct xfs_buf	*bp)
> +{
> +	int		val;
> +
> +	val = atomic_read(&bp->b_io_count);
> +	if (!val)
> +		return;
> +
> +	atomic_sub(val, &bp->b_io_count);
> +	percpu_counter_add(&bp->b_target->bt_io_count, -val);
> +	wake_up(&bp->b_target->bt_io_wait);
> +}
> +
>  /*
>   *	Releases a hold on the specified buffer.  If the
>   *	the hold count is 1, calls xfs_buf_free.
> @@ -880,8 +896,10 @@ xfs_buf_rele(
>  	if (!pag) {
>  		ASSERT(list_empty(&bp->b_lru));
>  		ASSERT(RB_EMPTY_NODE(&bp->b_rbnode));
> -		if (atomic_dec_and_test(&bp->b_hold))
> +		if (atomic_dec_and_test(&bp->b_hold)) {
> +			xfs_buf_rele_iocount(bp);
>  			xfs_buf_free(bp);
> +		}
>  		return;
>  	}
>  
> @@ -890,6 +908,9 @@ xfs_buf_rele(
>  	ASSERT(atomic_read(&bp->b_hold) > 0);
>  	if (atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock)) {
>  		spin_lock(&bp->b_lock);
> +
> +		xfs_buf_rele_iocount(bp);
> +
>  		if (!(bp->b_flags & XBF_STALE) && atomic_read(&bp->b_lru_ref)) {
>  			/*
>  			 * If the buffer is added to the LRU take a new
> @@ -1277,6 +1298,18 @@ _xfs_buf_ioapply(
>  	rw |= REQ_META;
>  
>  	/*
> +	 * XXX: uncached check indirectly filters out the sb buffer and log
> +	 * buffers (possibly others?), that are held and never released to the
> +	 * LRU
> +	 */
> +	if (bp->b_flags & XBF_ASYNC &&
> +	    bp->b_bn != XFS_BUF_DADDR_NULL &&	/* uncached */
> +	    atomic_read(&bp->b_lru_ref) && list_empty(&bp->b_lru)) {
> +		percpu_counter_inc(&bp->b_target->bt_io_count);
> +		atomic_inc(&bp->b_io_count);
> +	}

This seems rather specific and potentially fragile. Why not just
count /all/ IO using the per-cpu counter? If you're worried about
overhead, just use a large custom batch value so they are almost
never folded back into the global count and so the vast majority of
counter updates are single unlocked CPU instructions.

Counting everything is far better than needing to run through 4
branch tests on every buffer we submit IO on. i.e. counting
everything with an appropriately configured percpu counter is likely
to have a lower fast path overhead testing every buffer to capture
only the specific case in question...

> @@ -1533,6 +1566,8 @@ xfs_wait_buftarg(
>  	 * ensure here that all reference counts have been dropped before we
>  	 * start walking the LRU list.
>  	 */
> +	wait_event(btp->bt_io_wait,
> +		   (percpu_counter_sum(&btp->bt_io_count) == 0));
>  	drain_workqueue(btp->bt_mount->m_buf_workqueue);

I also don't think that per-buffer IO accounting is justified for
this. This:

	while(percpu_counter_sum(&btp->bt_io_count))
		delay(100);

Will do the same thing without adding an atomic counter inc/dec to
every buffer we want to capture here.

The waiting should probably also happen after we drain all the
workqueues, because the bt_io_count won't drop to zero until we
process all the pending IO completions on the work queues and
release the buffers...

> +	/* XXX: would atomic_t suffice? */
> +	struct percpu_counter	bt_io_count;

Too much cacheline contention in large machines to use an atomic_t
as a global counter.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] xfs: add readahead bufs to lru early to prevent post-unmount panic
  2016-07-11  5:20       ` Dave Chinner
@ 2016-07-11 13:52         ` Brian Foster
  2016-07-11 15:29           ` Brian Foster
  0 siblings, 1 reply; 18+ messages in thread
From: Brian Foster @ 2016-07-11 13:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Mon, Jul 11, 2016 at 03:20:57PM +1000, Dave Chinner wrote:
> On Tue, Jul 05, 2016 at 12:45:52PM -0400, Brian Foster wrote:
> > On Fri, Jul 01, 2016 at 06:30:12PM -0400, Brian Foster wrote:
> > > On Fri, Jul 01, 2016 at 08:44:57AM +1000, Dave Chinner wrote:
> > > > On Thu, Jun 30, 2016 at 08:53:49AM -0400, Brian Foster wrote:
> > > > > Newly allocated XFS metadata buffers are added to the LRU once the hold
> > > > > count is released, which typically occurs after I/O completion. There is
> > > > > no other mechanism at current that tracks the existence or I/O state of
> > > > > a new buffer. Further, readahead I/O tends to be submitted
> > > > > asynchronously by nature, which means the I/O can remain in flight and
> > > > > actually complete long after the calling context is gone. This means
> > > > > that file descriptors or any other holds on the filesystem can be
> > > > > released, allowing the filesystem to be unmounted while I/O is still in
> > > > > flight. When I/O completion occurs, core data structures may have been
> > > > > freed, causing completion to run into invalid memory accesses and likely
> > > > > to panic.
> > > > > 
> > > > > This problem is reproduced on XFS via directory readahead. A filesystem
> > > > > is mounted, a directory is opened/closed and the filesystem immediately
> > > > > unmounted. The open/close cycle triggers a directory readahead that if
> > > > > delayed long enough, runs buffer I/O completion after the unmount has
> > > > > completed.
> > > > > 
> > > > > To work around this problem, add readahead buffers to the LRU earlier
> > > > > than other buffers (when the buffer is allocated, specifically). The
> > > > > buffer hold count will ultimately remain until I/O completion, which
> > > > > means any shrinker activity will skip the buffer until then. This makes
> > > > > the buffer visible to xfs_wait_buftarg(), however, which ensures that an
> > > > > unmount or quiesce waits for I/O completion appropriately.
> > > > > 
> > > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > > ---
> > > > > 
> > > > > This addresses the problem reproduced by the recently posted xfstests
> > > > > test:
> > > > > 
> > > > >   http://thread.gmane.org/gmane.comp.file-systems.fstests/2740
> > > > > 
> > > > > This could probably be made more involved, i.e., to create another list
> > > > > of buffers in flight or some such. This seems more simple/sane to me,
> > > > > however, and survives my testing so far...
> > > > > 
> > > > > Brian
> > > > > 
> > > > >  fs/xfs/xfs_buf.c | 12 ++++++++++++
> > > > >  1 file changed, 12 insertions(+)
> > > > > 
> > > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > > > index 4665ff6..3f03df9 100644
> > > > > --- a/fs/xfs/xfs_buf.c
> > > > > +++ b/fs/xfs/xfs_buf.c
> > > > > @@ -590,8 +590,20 @@ xfs_buf_get_map(
> > > > >  		return NULL;
> > > > >  	}
> > > > >  
> > > > > +	/*
> > > > > +	 * If the buffer found doesn't match the one allocated above, somebody
> > > > > +	 * else beat us to insertion and we can toss the new one.
> > > > > +	 *
> > > > > +	 * If we did add the buffer and it happens to be readahead, add to the
> > > > > +	 * LRU now rather than waiting until the hold is released. Otherwise,
> > > > > +	 * the buffer is not visible to xfs_wait_buftarg() while in flight and
> > > > > +	 * nothing else prevents an unmount before I/O completion.
> > > > > +	 */
> > > > >  	if (bp != new_bp)
> > > > >  		xfs_buf_free(new_bp);
> > > > > +	else if (flags & XBF_READ_AHEAD &&
> > > > > +		 list_lru_add(&bp->b_target->bt_lru, &bp->b_lru))
> > > > > +		atomic_inc(&bp->b_hold);
> > > > 
> > > > This doesn't sit right with me. The LRU is for "unused" objects, and
> > > > readahead objects are not unused until IO completes and nobody is
> > > > waiting on them.
> > > > 
> > > > As it is, this points out another problem with readahead buffers -
> > > > they aren't actually being cached properly because b_lru_ref == 0,
> > > > which means they are immediately reclaimed on IO completion rather
> > > > than being added to the LRU....
> > > > 
> > > > I also think that it's not sufficient to cover the generic case of
> > > > async IO that has no waiter. i.e. we could do get_buf, submit async
> > > > write, drop submitter reference, and now we have the same problem
> > > > but on a write.  i.e. this problem is and async IO issue, not a
> > > > readahead issue.
> > > > 
> > > > I think that it might be better to fix it by doing this:
> > > > 
> > > > 	1. ensure async IO submission always has b_lru_ref set, and
> > > > 	if it isn't, set it to 1. This ensures the buffer will be
> > > > 	added to the LRU on completion if it isn't already there.
> > > > 
> > > > 	2. keep a count of async buffer IO in progress. A per-cpu
> > > > 	counter in the buftarg will be fine for this. Increment in
> > > > 	xfs_buf_submit(), decrement in the xfs_buf_rele() call from
> > > > 	xfs_buf_iodone() once we've determined if the buffer needs
> > > > 	adding to the LRU or not.
> > > > 
> > > > 	3. make xfs_wait_buftarg() wait until the async IO count
> > > > 	goes to zero before it gives up trying to release buffers on
> > > > 	the LRU.
> > > > 
> > > 
> > > After playing with this a bit this afternoon, I don't think it is so
> > > straightforward to maintain consistency between xfs_buf_submit() and
> > > xfs_buf_rele(). Some buffers are actually never released (superblock,
> > > log buffers). Other buffers can actually be submitted for I/O multiple
> > > times before they are ultimately released (e.g., log recovery buffer
> > > read -> delwri submission).
> > > 
> > 
> > I think I can get around these problems by skipping all uncached I/O and
> > maintaining a per-buffer I/O count that is sunk into the global buftarg
> > count once the buffer is released. E.g., something like the following
> > patch. Not fully tested, but works on some quick tests...
> ....
...
> > @@ -1277,6 +1298,18 @@ _xfs_buf_ioapply(
> >  	rw |= REQ_META;
> >  
> >  	/*
> > +	 * XXX: uncached check indirectly filters out the sb buffer and log
> > +	 * buffers (possibly others?), that are held and never released to the
> > +	 * LRU
> > +	 */
> > +	if (bp->b_flags & XBF_ASYNC &&
> > +	    bp->b_bn != XFS_BUF_DADDR_NULL &&	/* uncached */
> > +	    atomic_read(&bp->b_lru_ref) && list_empty(&bp->b_lru)) {
> > +		percpu_counter_inc(&bp->b_target->bt_io_count);
> > +		atomic_inc(&bp->b_io_count);
> > +	}
> 
> This seems rather specific and potentially fragile. Why not just
> count /all/ IO using the per-cpu counter? If you're worried about
> overhead, just use a large custom batch value so they are almost
> never folded back into the global count and so the vast majority of
> counter updates are single unlocked CPU instructions.
> 

I'm not really worried about overhead (yet). I'm just trying to come up
with a relatively straightforward scheme that solves the problem. The
extra checks here are basically the cost of deferring the I/O count
decrement to buffer release.

> Counting everything is far better than needing to run through 4
> branch tests on every buffer we submit IO on. i.e. counting
> everything with an appropriately configured percpu counter is likely
> to have a lower fast path overhead testing every buffer to capture
> only the specific case in question...
> 

See my immediately previous reply to this thread. I hadn't posted code,
but the initial variant of this patch uses more of a pure I/O count. The
primary drawback with that is that it is racy in that the I/O count is
decremented before the buffer is added to the LRU. The workqueue drain
helps prevent the original problem, but at the same time I'm not sure it
makes this approach any more robust than the original one-liner patch.

I also defined a new buffer flag in that version to help simplify
dealing with variable behavior between error cases and iodone handlers
and whatnot. That trickiness might be avoided by pushing the counter
further down to more of a bio count (e.g., a global variant of
bp>b_io_remaining count), but I haven't experimented with that as of
yet.

> > @@ -1533,6 +1566,8 @@ xfs_wait_buftarg(
> >  	 * ensure here that all reference counts have been dropped before we
> >  	 * start walking the LRU list.
> >  	 */
> > +	wait_event(btp->bt_io_wait,
> > +		   (percpu_counter_sum(&btp->bt_io_count) == 0));
> >  	drain_workqueue(btp->bt_mount->m_buf_workqueue);
> 
> I also don't think that per-buffer IO accounting is justified for
> this. This:
> 
> 	while(percpu_counter_sum(&btp->bt_io_count))
> 		delay(100);
> 
> Will do the same thing without adding an atomic counter inc/dec to
> every buffer we want to capture here.
> 

The purpose of the per-buffer I/O accounting is not to drive the
synchronization mechanism. The purpose is to cover the cases where
buffers are submitted for I/O multiple times before they are released
(also mentioned in my immediately previous reply). Otherwise, the
counter gets out of whack and never returns to 0.

This is required for an implementation that defers decrementing the I/O
count to buffer release time (when the buffer is added to the LRU). I
actually think this patch as posted doesn't quite order things correctly
in that regard, but the broader point holds I think. This additional
accounting isn't required for a pure I/O counter (inc on submit, dec on
complete).

> The waiting should probably also happen after we drain all the
> workqueues, because the bt_io_count won't drop to zero until we
> process all the pending IO completions on the work queues and
> release the buffers...
> 

I'm not sure it matters much for this implementation. I figured it more
natural to wait for pending I/O followed by completions. Note that this
order might be more critical with a pure I/O count approach because the
drain_workqueue() is what ends up serializing the addition to the LRU,
and that only happens if we wait for the I/O to complete first.

But I'll look at this more closely once we've established a direction...

> > +	/* XXX: would atomic_t suffice? */
> > +	struct percpu_counter	bt_io_count;
> 
> Too much cacheline contention in large machines to use an atomic_t
> as a global counter.
> 

Ok.

So what is your preference out of the possible approaches here? AFAICS,
we have the following options:

1.) The original "add readahead to LRU early" approach.
	Pros: simple one-liner
	Cons: bit of a hack, only covers readahead scenario
2.) Defer I/O count decrement to buffer release (this patch).
	Pros: should cover all cases (reads/writes)
	Cons: more complex (requires per-buffer accounting, etc.)
3.) Raw (buffer or bio?) I/O count (no defer to buffer release)
	Pros: eliminates some complexity from #2
	Cons: still more complex than #1, racy in that decrement does
	not serialize against LRU addition (requires drain_workqueue(),
	which still doesn't cover error conditions)

As noted above, option #3 also allows for either a buffer based count or
bio based count, the latter of which might simplify things a bit further
(TBD). Thoughts?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] xfs: add readahead bufs to lru early to prevent post-unmount panic
  2016-07-11 13:52         ` Brian Foster
@ 2016-07-11 15:29           ` Brian Foster
  2016-07-11 22:44             ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Brian Foster @ 2016-07-11 15:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Mon, Jul 11, 2016 at 09:52:52AM -0400, Brian Foster wrote:
> On Mon, Jul 11, 2016 at 03:20:57PM +1000, Dave Chinner wrote:
> > On Tue, Jul 05, 2016 at 12:45:52PM -0400, Brian Foster wrote:
> > > On Fri, Jul 01, 2016 at 06:30:12PM -0400, Brian Foster wrote:
> > > > On Fri, Jul 01, 2016 at 08:44:57AM +1000, Dave Chinner wrote:
> > > > > On Thu, Jun 30, 2016 at 08:53:49AM -0400, Brian Foster wrote:
> > > > > > Newly allocated XFS metadata buffers are added to the LRU once the hold
> > > > > > count is released, which typically occurs after I/O completion. There is
> > > > > > no other mechanism at current that tracks the existence or I/O state of
> > > > > > a new buffer. Further, readahead I/O tends to be submitted
> > > > > > asynchronously by nature, which means the I/O can remain in flight and
> > > > > > actually complete long after the calling context is gone. This means
> > > > > > that file descriptors or any other holds on the filesystem can be
> > > > > > released, allowing the filesystem to be unmounted while I/O is still in
> > > > > > flight. When I/O completion occurs, core data structures may have been
> > > > > > freed, causing completion to run into invalid memory accesses and likely
> > > > > > to panic.
> > > > > > 
> > > > > > This problem is reproduced on XFS via directory readahead. A filesystem
> > > > > > is mounted, a directory is opened/closed and the filesystem immediately
> > > > > > unmounted. The open/close cycle triggers a directory readahead that if
> > > > > > delayed long enough, runs buffer I/O completion after the unmount has
> > > > > > completed.
> > > > > > 
> > > > > > To work around this problem, add readahead buffers to the LRU earlier
> > > > > > than other buffers (when the buffer is allocated, specifically). The
> > > > > > buffer hold count will ultimately remain until I/O completion, which
> > > > > > means any shrinker activity will skip the buffer until then. This makes
> > > > > > the buffer visible to xfs_wait_buftarg(), however, which ensures that an
> > > > > > unmount or quiesce waits for I/O completion appropriately.
> > > > > > 
> > > > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > > > ---
> > > > > > 
> > > > > > This addresses the problem reproduced by the recently posted xfstests
> > > > > > test:
> > > > > > 
> > > > > >   http://thread.gmane.org/gmane.comp.file-systems.fstests/2740
> > > > > > 
> > > > > > This could probably be made more involved, i.e., to create another list
> > > > > > of buffers in flight or some such. This seems more simple/sane to me,
> > > > > > however, and survives my testing so far...
> > > > > > 
> > > > > > Brian
> > > > > > 
> > > > > >  fs/xfs/xfs_buf.c | 12 ++++++++++++
> > > > > >  1 file changed, 12 insertions(+)
> > > > > > 
> > > > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > > > > index 4665ff6..3f03df9 100644
> > > > > > --- a/fs/xfs/xfs_buf.c
> > > > > > +++ b/fs/xfs/xfs_buf.c
> > > > > > @@ -590,8 +590,20 @@ xfs_buf_get_map(
> > > > > >  		return NULL;
> > > > > >  	}
> > > > > >  
> > > > > > +	/*
> > > > > > +	 * If the buffer found doesn't match the one allocated above, somebody
> > > > > > +	 * else beat us to insertion and we can toss the new one.
> > > > > > +	 *
> > > > > > +	 * If we did add the buffer and it happens to be readahead, add to the
> > > > > > +	 * LRU now rather than waiting until the hold is released. Otherwise,
> > > > > > +	 * the buffer is not visible to xfs_wait_buftarg() while in flight and
> > > > > > +	 * nothing else prevents an unmount before I/O completion.
> > > > > > +	 */
> > > > > >  	if (bp != new_bp)
> > > > > >  		xfs_buf_free(new_bp);
> > > > > > +	else if (flags & XBF_READ_AHEAD &&
> > > > > > +		 list_lru_add(&bp->b_target->bt_lru, &bp->b_lru))
> > > > > > +		atomic_inc(&bp->b_hold);
> > > > > 
> > > > > This doesn't sit right with me. The LRU is for "unused" objects, and
> > > > > readahead objects are not unused until IO completes and nobody is
> > > > > waiting on them.
> > > > > 
> > > > > As it is, this points out another problem with readahead buffers -
> > > > > they aren't actually being cached properly because b_lru_ref == 0,
> > > > > which means they are immediately reclaimed on IO completion rather
> > > > > than being added to the LRU....
> > > > > 
> > > > > I also think that it's not sufficient to cover the generic case of
> > > > > async IO that has no waiter. i.e. we could do get_buf, submit async
> > > > > write, drop submitter reference, and now we have the same problem
> > > > > but on a write.  i.e. this problem is and async IO issue, not a
> > > > > readahead issue.
> > > > > 
> > > > > I think that it might be better to fix it by doing this:
> > > > > 
> > > > > 	1. ensure async IO submission always has b_lru_ref set, and
> > > > > 	if it isn't, set it to 1. This ensures the buffer will be
> > > > > 	added to the LRU on completion if it isn't already there.
> > > > > 
> > > > > 	2. keep a count of async buffer IO in progress. A per-cpu
> > > > > 	counter in the buftarg will be fine for this. Increment in
> > > > > 	xfs_buf_submit(), decrement in the xfs_buf_rele() call from
> > > > > 	xfs_buf_iodone() once we've determined if the buffer needs
> > > > > 	adding to the LRU or not.
> > > > > 
> > > > > 	3. make xfs_wait_buftarg() wait until the async IO count
> > > > > 	goes to zero before it gives up trying to release buffers on
> > > > > 	the LRU.
> > > > > 
...
> So what is your preference out of the possible approaches here? AFAICS,
> we have the following options:
> 
> 1.) The original "add readahead to LRU early" approach.
> 	Pros: simple one-liner
> 	Cons: bit of a hack, only covers readahead scenario
> 2.) Defer I/O count decrement to buffer release (this patch).
> 	Pros: should cover all cases (reads/writes)
> 	Cons: more complex (requires per-buffer accounting, etc.)
> 3.) Raw (buffer or bio?) I/O count (no defer to buffer release)
> 	Pros: eliminates some complexity from #2
> 	Cons: still more complex than #1, racy in that decrement does
> 	not serialize against LRU addition (requires drain_workqueue(),
> 	which still doesn't cover error conditions)
> 
> As noted above, option #3 also allows for either a buffer based count or
> bio based count, the latter of which might simplify things a bit further
> (TBD). Thoughts?
> 

FWIW, the following is a slightly cleaned up version of my initial
approach (option #3 above). Note that the flag is used to help deal with
varying ioend behavior. E.g., xfs_buf_ioend() is called once for some
buffers, multiple times for others with an iodone callback, that
behavior changes in some cases when an error is set, etc. (I'll add
comments before an official post.)

Brian

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 4665ff6..45d3ddd 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1018,7 +1018,10 @@ xfs_buf_ioend(
 
 	trace_xfs_buf_iodone(bp, _RET_IP_);
 
-	bp->b_flags &= ~(XBF_READ | XBF_WRITE | XBF_READ_AHEAD);
+	if (bp->b_flags & XBF_IN_FLIGHT)
+		percpu_counter_dec(&bp->b_target->bt_io_count);
+
+	bp->b_flags &= ~(XBF_READ | XBF_WRITE | XBF_READ_AHEAD | XBF_IN_FLIGHT);
 
 	/*
 	 * Pull in IO completion errors now. We are guaranteed to be running
@@ -1341,6 +1344,11 @@ xfs_buf_submit(
 	 * xfs_buf_ioend too early.
 	 */
 	atomic_set(&bp->b_io_remaining, 1);
+	if (bp->b_flags & XBF_ASYNC) {
+		percpu_counter_inc(&bp->b_target->bt_io_count);
+		bp->b_flags |= XBF_IN_FLIGHT;
+	}
+
 	_xfs_buf_ioapply(bp);
 
 	/*
@@ -1533,6 +1541,8 @@ xfs_wait_buftarg(
 	 * ensure here that all reference counts have been dropped before we
 	 * start walking the LRU list.
 	 */
+	while (percpu_counter_sum(&btp->bt_io_count))
+		delay(100);
 	drain_workqueue(btp->bt_mount->m_buf_workqueue);
 
 	/* loop until there is nothing left on the lru list. */
@@ -1629,6 +1639,8 @@ xfs_free_buftarg(
 	struct xfs_buftarg	*btp)
 {
 	unregister_shrinker(&btp->bt_shrinker);
+	ASSERT(percpu_counter_sum(&btp->bt_io_count) == 0);
+	percpu_counter_destroy(&btp->bt_io_count);
 	list_lru_destroy(&btp->bt_lru);
 
 	if (mp->m_flags & XFS_MOUNT_BARRIER)
@@ -1693,6 +1705,9 @@ xfs_alloc_buftarg(
 	if (list_lru_init(&btp->bt_lru))
 		goto error;
 
+	if (percpu_counter_init(&btp->bt_io_count, 0, GFP_KERNEL))
+		goto error;
+
 	btp->bt_shrinker.count_objects = xfs_buftarg_shrink_count;
 	btp->bt_shrinker.scan_objects = xfs_buftarg_shrink_scan;
 	btp->bt_shrinker.seeks = DEFAULT_SEEKS;
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 8bfb974..e1f95e0 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -43,6 +43,7 @@ typedef enum {
 #define XBF_READ	 (1 << 0) /* buffer intended for reading from device */
 #define XBF_WRITE	 (1 << 1) /* buffer intended for writing to device */
 #define XBF_READ_AHEAD	 (1 << 2) /* asynchronous read-ahead */
+#define XBF_IN_FLIGHT	 (1 << 3)
 #define XBF_ASYNC	 (1 << 4) /* initiator will not wait for completion */
 #define XBF_DONE	 (1 << 5) /* all pages in the buffer uptodate */
 #define XBF_STALE	 (1 << 6) /* buffer has been staled, do not find it */
@@ -115,6 +116,8 @@ typedef struct xfs_buftarg {
 	/* LRU control structures */
 	struct shrinker		bt_shrinker;
 	struct list_lru		bt_lru;
+
+	struct percpu_counter	bt_io_count;
 } xfs_buftarg_t;
 
 struct xfs_buf;

> Brian
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] xfs: add readahead bufs to lru early to prevent post-unmount panic
  2016-07-11 15:29           ` Brian Foster
@ 2016-07-11 22:44             ` Dave Chinner
  2016-07-12 12:03               ` Brian Foster
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2016-07-11 22:44 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Mon, Jul 11, 2016 at 11:29:22AM -0400, Brian Foster wrote:
> On Mon, Jul 11, 2016 at 09:52:52AM -0400, Brian Foster wrote:
> ...
> > So what is your preference out of the possible approaches here? AFAICS,
> > we have the following options:
> > 
> > 1.) The original "add readahead to LRU early" approach.
> > 	Pros: simple one-liner
> > 	Cons: bit of a hack, only covers readahead scenario
> > 2.) Defer I/O count decrement to buffer release (this patch).
> > 	Pros: should cover all cases (reads/writes)
> > 	Cons: more complex (requires per-buffer accounting, etc.)
> > 3.) Raw (buffer or bio?) I/O count (no defer to buffer release)
> > 	Pros: eliminates some complexity from #2
> > 	Cons: still more complex than #1, racy in that decrement does
> > 	not serialize against LRU addition (requires drain_workqueue(),
> > 	which still doesn't cover error conditions)
> > 
> > As noted above, option #3 also allows for either a buffer based count or
> > bio based count, the latter of which might simplify things a bit further
> > (TBD). Thoughts?

Pretty good summary :P

> FWIW, the following is a slightly cleaned up version of my initial
> approach (option #3 above). Note that the flag is used to help deal with
> varying ioend behavior. E.g., xfs_buf_ioend() is called once for some
> buffers, multiple times for others with an iodone callback, that
> behavior changes in some cases when an error is set, etc. (I'll add
> comments before an official post.)

The approach looks good - I think there's a couple of things we can
do to clean it up and make it robust. Comments inline.

> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 4665ff6..45d3ddd 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1018,7 +1018,10 @@ xfs_buf_ioend(
>  
>  	trace_xfs_buf_iodone(bp, _RET_IP_);
>  
> -	bp->b_flags &= ~(XBF_READ | XBF_WRITE | XBF_READ_AHEAD);
> +	if (bp->b_flags & XBF_IN_FLIGHT)
> +		percpu_counter_dec(&bp->b_target->bt_io_count);
> +
> +	bp->b_flags &= ~(XBF_READ | XBF_WRITE | XBF_READ_AHEAD | XBF_IN_FLIGHT);
>  
>  	/*
>  	 * Pull in IO completion errors now. We are guaranteed to be running

I think the XBF_IN_FLIGHT can be moved to the final xfs_buf_rele()
processing if:

> @@ -1341,6 +1344,11 @@ xfs_buf_submit(
>  	 * xfs_buf_ioend too early.
>  	 */
>  	atomic_set(&bp->b_io_remaining, 1);
> +	if (bp->b_flags & XBF_ASYNC) {
> +		percpu_counter_inc(&bp->b_target->bt_io_count);
> +		bp->b_flags |= XBF_IN_FLIGHT;
> +	}

You change this to:

	if (!(bp->b_flags & XBF_IN_FLIGHT)) {
		percpu_counter_inc(&bp->b_target->bt_io_count);
		bp->b_flags |= XBF_IN_FLIGHT;
	}

We shouldn't have to check for XBF_ASYNC in xfs_buf_submit() - it is
the path taken for async IO submission, so we should probably
ASSERT(bp->b_flags & XBF_ASYNC) in this function to ensure that is
the case.

[Thinking aloud - __test_and_set_bit() might make this code a bit
cleaner]

> diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> index 8bfb974..e1f95e0 100644
> --- a/fs/xfs/xfs_buf.h
> +++ b/fs/xfs/xfs_buf.h
> @@ -43,6 +43,7 @@ typedef enum {
>  #define XBF_READ	 (1 << 0) /* buffer intended for reading from device */
>  #define XBF_WRITE	 (1 << 1) /* buffer intended for writing to device */
>  #define XBF_READ_AHEAD	 (1 << 2) /* asynchronous read-ahead */
> +#define XBF_IN_FLIGHT	 (1 << 3)

Hmmm - it's an internal flag, so probably should be prefixed with an
"_" and moved down to the section with _XBF_KMEM and friends.

Thoughts?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] xfs: add readahead bufs to lru early to prevent post-unmount panic
  2016-07-11 22:44             ` Dave Chinner
@ 2016-07-12 12:03               ` Brian Foster
  2016-07-12 17:22                 ` Brian Foster
  0 siblings, 1 reply; 18+ messages in thread
From: Brian Foster @ 2016-07-12 12:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Jul 12, 2016 at 08:44:51AM +1000, Dave Chinner wrote:
> On Mon, Jul 11, 2016 at 11:29:22AM -0400, Brian Foster wrote:
> > On Mon, Jul 11, 2016 at 09:52:52AM -0400, Brian Foster wrote:
> > ...
> > > So what is your preference out of the possible approaches here? AFAICS,
> > > we have the following options:
> > > 
> > > 1.) The original "add readahead to LRU early" approach.
> > > 	Pros: simple one-liner
> > > 	Cons: bit of a hack, only covers readahead scenario
> > > 2.) Defer I/O count decrement to buffer release (this patch).
> > > 	Pros: should cover all cases (reads/writes)
> > > 	Cons: more complex (requires per-buffer accounting, etc.)
> > > 3.) Raw (buffer or bio?) I/O count (no defer to buffer release)
> > > 	Pros: eliminates some complexity from #2
> > > 	Cons: still more complex than #1, racy in that decrement does
> > > 	not serialize against LRU addition (requires drain_workqueue(),
> > > 	which still doesn't cover error conditions)
> > > 
> > > As noted above, option #3 also allows for either a buffer based count or
> > > bio based count, the latter of which might simplify things a bit further
> > > (TBD). Thoughts?
> 
> Pretty good summary :P
> 
> > FWIW, the following is a slightly cleaned up version of my initial
> > approach (option #3 above). Note that the flag is used to help deal with
> > varying ioend behavior. E.g., xfs_buf_ioend() is called once for some
> > buffers, multiple times for others with an iodone callback, that
> > behavior changes in some cases when an error is set, etc. (I'll add
> > comments before an official post.)
> 
> The approach looks good - I think there's a couple of things we can
> do to clean it up and make it robust. Comments inline.
> 
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index 4665ff6..45d3ddd 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -1018,7 +1018,10 @@ xfs_buf_ioend(
> >  
> >  	trace_xfs_buf_iodone(bp, _RET_IP_);
> >  
> > -	bp->b_flags &= ~(XBF_READ | XBF_WRITE | XBF_READ_AHEAD);
> > +	if (bp->b_flags & XBF_IN_FLIGHT)
> > +		percpu_counter_dec(&bp->b_target->bt_io_count);
> > +
> > +	bp->b_flags &= ~(XBF_READ | XBF_WRITE | XBF_READ_AHEAD | XBF_IN_FLIGHT);
> >  
> >  	/*
> >  	 * Pull in IO completion errors now. We are guaranteed to be running
> 
> I think the XBF_IN_FLIGHT can be moved to the final xfs_buf_rele()
> processing if:
> 
> > @@ -1341,6 +1344,11 @@ xfs_buf_submit(
> >  	 * xfs_buf_ioend too early.
> >  	 */
> >  	atomic_set(&bp->b_io_remaining, 1);
> > +	if (bp->b_flags & XBF_ASYNC) {
> > +		percpu_counter_inc(&bp->b_target->bt_io_count);
> > +		bp->b_flags |= XBF_IN_FLIGHT;
> > +	}
> 
> You change this to:
> 
> 	if (!(bp->b_flags & XBF_IN_FLIGHT)) {
> 		percpu_counter_inc(&bp->b_target->bt_io_count);
> 		bp->b_flags |= XBF_IN_FLIGHT;
> 	}
> 

Ok, so use the flag to cap the I/O count and defer the decrement to
release. I think that should work and addresses the raciness issue. I'll
give it a try.

> We shouldn't have to check for XBF_ASYNC in xfs_buf_submit() - it is
> the path taken for async IO submission, so we should probably
> ASSERT(bp->b_flags & XBF_ASYNC) in this function to ensure that is
> the case.
> 

Yeah, that's unnecessary. There's already such an assert in
xfs_buf_submit(), actually.

> [Thinking aloud - __test_and_set_bit() might make this code a bit
> cleaner]
> 

On a quick try, this complains about b_flags being an unsigned int. I
think I'll leave the set bit as is and use a helper for the release,
which also provides a location to explain how the count works.

> > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> > index 8bfb974..e1f95e0 100644
> > --- a/fs/xfs/xfs_buf.h
> > +++ b/fs/xfs/xfs_buf.h
> > @@ -43,6 +43,7 @@ typedef enum {
> >  #define XBF_READ	 (1 << 0) /* buffer intended for reading from device */
> >  #define XBF_WRITE	 (1 << 1) /* buffer intended for writing to device */
> >  #define XBF_READ_AHEAD	 (1 << 2) /* asynchronous read-ahead */
> > +#define XBF_IN_FLIGHT	 (1 << 3)
> 
> Hmmm - it's an internal flag, so probably should be prefixed with an
> "_" and moved down to the section with _XBF_KMEM and friends.
> 

Indeed, thanks.

Brian

> Thoughts?
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] xfs: add readahead bufs to lru early to prevent post-unmount panic
  2016-07-12 12:03               ` Brian Foster
@ 2016-07-12 17:22                 ` Brian Foster
  2016-07-12 23:57                   ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Brian Foster @ 2016-07-12 17:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Jul 12, 2016 at 08:03:15AM -0400, Brian Foster wrote:
> On Tue, Jul 12, 2016 at 08:44:51AM +1000, Dave Chinner wrote:
> > On Mon, Jul 11, 2016 at 11:29:22AM -0400, Brian Foster wrote:
> > > On Mon, Jul 11, 2016 at 09:52:52AM -0400, Brian Foster wrote:
> > > ...
> > > > So what is your preference out of the possible approaches here? AFAICS,
> > > > we have the following options:
> > > > 
> > > > 1.) The original "add readahead to LRU early" approach.
> > > > 	Pros: simple one-liner
> > > > 	Cons: bit of a hack, only covers readahead scenario
> > > > 2.) Defer I/O count decrement to buffer release (this patch).
> > > > 	Pros: should cover all cases (reads/writes)
> > > > 	Cons: more complex (requires per-buffer accounting, etc.)
> > > > 3.) Raw (buffer or bio?) I/O count (no defer to buffer release)
> > > > 	Pros: eliminates some complexity from #2
> > > > 	Cons: still more complex than #1, racy in that decrement does
> > > > 	not serialize against LRU addition (requires drain_workqueue(),
> > > > 	which still doesn't cover error conditions)
> > > > 
> > > > As noted above, option #3 also allows for either a buffer based count or
> > > > bio based count, the latter of which might simplify things a bit further
> > > > (TBD). Thoughts?
> > 
> > Pretty good summary :P
> > 
> > > FWIW, the following is a slightly cleaned up version of my initial
> > > approach (option #3 above). Note that the flag is used to help deal with
> > > varying ioend behavior. E.g., xfs_buf_ioend() is called once for some
> > > buffers, multiple times for others with an iodone callback, that
> > > behavior changes in some cases when an error is set, etc. (I'll add
> > > comments before an official post.)
> > 
> > The approach looks good - I think there's a couple of things we can
> > do to clean it up and make it robust. Comments inline.
> > 
> > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > index 4665ff6..45d3ddd 100644
> > > --- a/fs/xfs/xfs_buf.c
> > > +++ b/fs/xfs/xfs_buf.c
> > > @@ -1018,7 +1018,10 @@ xfs_buf_ioend(
> > >  
> > >  	trace_xfs_buf_iodone(bp, _RET_IP_);
> > >  
> > > -	bp->b_flags &= ~(XBF_READ | XBF_WRITE | XBF_READ_AHEAD);
> > > +	if (bp->b_flags & XBF_IN_FLIGHT)
> > > +		percpu_counter_dec(&bp->b_target->bt_io_count);
> > > +
> > > +	bp->b_flags &= ~(XBF_READ | XBF_WRITE | XBF_READ_AHEAD | XBF_IN_FLIGHT);
> > >  
> > >  	/*
> > >  	 * Pull in IO completion errors now. We are guaranteed to be running
> > 
> > I think the XBF_IN_FLIGHT can be moved to the final xfs_buf_rele()
> > processing if:
> > 
> > > @@ -1341,6 +1344,11 @@ xfs_buf_submit(
> > >  	 * xfs_buf_ioend too early.
> > >  	 */
> > >  	atomic_set(&bp->b_io_remaining, 1);
> > > +	if (bp->b_flags & XBF_ASYNC) {
> > > +		percpu_counter_inc(&bp->b_target->bt_io_count);
> > > +		bp->b_flags |= XBF_IN_FLIGHT;
> > > +	}
> > 
> > You change this to:
> > 
> > 	if (!(bp->b_flags & XBF_IN_FLIGHT)) {
> > 		percpu_counter_inc(&bp->b_target->bt_io_count);
> > 		bp->b_flags |= XBF_IN_FLIGHT;
> > 	}
> > 
> 
> Ok, so use the flag to cap the I/O count and defer the decrement to
> release. I think that should work and addresses the raciness issue. I'll
> give it a try.
> 

This appears to be doable, but it reintroduces some ugliness from the
previous approach. For example, we have to start filtering out uncached
buffers again (if we defer the decrement to release, we must handle
never-released buffers one way or another). Also, given the feedback on
the previous patch with regard to filtering out non-new buffers from the
I/O count, I've dropped that and replaced it with updates to
xfs_buf_rele() to decrement when the buffer is returned to the LRU (we
either have to filter out buffers already on the LRU at submit time or
make sure that they are decremented when released back to the LRU).

Code follows...

Brian

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 4665ff6..b7afbac 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -80,6 +80,25 @@ xfs_buf_vmap_len(
 }
 
 /*
+ * Clear the in-flight state on a buffer about to be released to the LRU or
+ * freed and unaccount from the buftarg. The buftarg I/O count maintains a count
+ * of held buffers that have undergone at least one I/O in the current hold
+ * cycle (e.g., not a total I/O count). This provides protection against unmount
+ * for buffer I/O completion (see xfs_wait_buftarg()) processing.
+ */
+static inline void
+xfs_buf_rele_in_flight(
+	struct xfs_buf	*bp)
+{
+	if (!(bp->b_flags & _XBF_IN_FLIGHT))
+		return;
+
+	ASSERT(bp->b_flags & XBF_ASYNC);
+	bp->b_flags &= ~_XBF_IN_FLIGHT;
+	percpu_counter_dec(&bp->b_target->bt_io_count);
+}
+
+/*
  * When we mark a buffer stale, we remove the buffer from the LRU and clear the
  * b_lru_ref count so that the buffer is freed immediately when the buffer
  * reference count falls to zero. If the buffer is already on the LRU, we need
@@ -866,30 +885,37 @@ xfs_buf_hold(
 }
 
 /*
- *	Releases a hold on the specified buffer.  If the
- *	the hold count is 1, calls xfs_buf_free.
+ * Release a hold on the specified buffer. If the hold count is 1, the buffer is
+ * placed on LRU or freed (depending on b_lru_ref).
  */
 void
 xfs_buf_rele(
 	xfs_buf_t		*bp)
 {
 	struct xfs_perag	*pag = bp->b_pag;
+	bool			release;
+	bool			freebuf = false;
 
 	trace_xfs_buf_rele(bp, _RET_IP_);
 
 	if (!pag) {
 		ASSERT(list_empty(&bp->b_lru));
 		ASSERT(RB_EMPTY_NODE(&bp->b_rbnode));
-		if (atomic_dec_and_test(&bp->b_hold))
+		if (atomic_dec_and_test(&bp->b_hold)) {
+			xfs_buf_rele_in_flight(bp);
 			xfs_buf_free(bp);
+		}
 		return;
 	}
 
 	ASSERT(!RB_EMPTY_NODE(&bp->b_rbnode));
 
 	ASSERT(atomic_read(&bp->b_hold) > 0);
-	if (atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock)) {
-		spin_lock(&bp->b_lock);
+
+	release = atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock);
+	spin_lock(&bp->b_lock);
+	if (release) {
+		xfs_buf_rele_in_flight(bp);
 		if (!(bp->b_flags & XBF_STALE) && atomic_read(&bp->b_lru_ref)) {
 			/*
 			 * If the buffer is added to the LRU take a new
@@ -900,7 +926,6 @@ xfs_buf_rele(
 				bp->b_state &= ~XFS_BSTATE_DISPOSE;
 				atomic_inc(&bp->b_hold);
 			}
-			spin_unlock(&bp->b_lock);
 			spin_unlock(&pag->pag_buf_lock);
 		} else {
 			/*
@@ -914,15 +939,24 @@ xfs_buf_rele(
 			} else {
 				ASSERT(list_empty(&bp->b_lru));
 			}
-			spin_unlock(&bp->b_lock);
 
 			ASSERT(!(bp->b_flags & _XBF_DELWRI_Q));
 			rb_erase(&bp->b_rbnode, &pag->pag_buf_tree);
 			spin_unlock(&pag->pag_buf_lock);
 			xfs_perag_put(pag);
-			xfs_buf_free(bp);
+			freebuf = true;
 		}
+	} else if ((atomic_read(&bp->b_hold) == 1) && !list_empty(&bp->b_lru)) {
+		/*
+		 * The buffer is already on the LRU and it holds the only
+		 * reference. Drop the in flight state.
+		 */
+		xfs_buf_rele_in_flight(bp);
 	}
+	spin_unlock(&bp->b_lock);
+
+	if (freebuf)
+		xfs_buf_free(bp);
 }
 
 
@@ -1341,6 +1375,18 @@ xfs_buf_submit(
 	 * xfs_buf_ioend too early.
 	 */
 	atomic_set(&bp->b_io_remaining, 1);
+
+	/*
+	 * Bump the I/O in flight count on the buftarg if we haven't yet done
+	 * so for this buffer. Skip uncached buffers because many of those
+	 * (e.g., superblock, log buffers) are never released.
+	 */
+	if ((bp->b_bn != XFS_BUF_DADDR_NULL) &&
+	    !(bp->b_flags & _XBF_IN_FLIGHT)) {
+		bp->b_flags |= _XBF_IN_FLIGHT;
+		percpu_counter_inc(&bp->b_target->bt_io_count);
+	}
+
 	_xfs_buf_ioapply(bp);
 
 	/*
@@ -1526,13 +1572,19 @@ xfs_wait_buftarg(
 	int loop = 0;
 
 	/*
-	 * We need to flush the buffer workqueue to ensure that all IO
-	 * completion processing is 100% done. Just waiting on buffer locks is
-	 * not sufficient for async IO as the reference count held over IO is
-	 * not released until after the buffer lock is dropped. Hence we need to
-	 * ensure here that all reference counts have been dropped before we
-	 * start walking the LRU list.
+	 * First wait on the buftarg I/O count for all in-flight buffers to be
+	 * released. This is critical as new buffers do not make the LRU until
+	 * they are released.
+	 *
+	 * Next, flush the buffer workqueue to ensure all completion processing
+	 * has finished. Just waiting on buffer locks is not sufficient for
+	 * async IO as the reference count held over IO is not released until
+	 * after the buffer lock is dropped. Hence we need to ensure here that
+	 * all reference counts have been dropped before we start walking the
+	 * LRU list.
 	 */
+	while (percpu_counter_sum(&btp->bt_io_count))
+		delay(100);
 	drain_workqueue(btp->bt_mount->m_buf_workqueue);
 
 	/* loop until there is nothing left on the lru list. */
@@ -1629,6 +1681,8 @@ xfs_free_buftarg(
 	struct xfs_buftarg	*btp)
 {
 	unregister_shrinker(&btp->bt_shrinker);
+	ASSERT(percpu_counter_sum(&btp->bt_io_count) == 0);
+	percpu_counter_destroy(&btp->bt_io_count);
 	list_lru_destroy(&btp->bt_lru);
 
 	if (mp->m_flags & XFS_MOUNT_BARRIER)
@@ -1693,6 +1747,9 @@ xfs_alloc_buftarg(
 	if (list_lru_init(&btp->bt_lru))
 		goto error;
 
+	if (percpu_counter_init(&btp->bt_io_count, 0, GFP_KERNEL))
+		goto error;
+
 	btp->bt_shrinker.count_objects = xfs_buftarg_shrink_count;
 	btp->bt_shrinker.scan_objects = xfs_buftarg_shrink_scan;
 	btp->bt_shrinker.seeks = DEFAULT_SEEKS;
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 8bfb974..19f70e2 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -62,6 +62,7 @@ typedef enum {
 #define _XBF_KMEM	 (1 << 21)/* backed by heap memory */
 #define _XBF_DELWRI_Q	 (1 << 22)/* buffer on a delwri queue */
 #define _XBF_COMPOUND	 (1 << 23)/* compound buffer */
+#define _XBF_IN_FLIGHT	 (1 << 25) /* I/O in flight, for accounting purposes */
 
 typedef unsigned int xfs_buf_flags_t;
 
@@ -115,6 +116,8 @@ typedef struct xfs_buftarg {
 	/* LRU control structures */
 	struct shrinker		bt_shrinker;
 	struct list_lru		bt_lru;
+
+	struct percpu_counter	bt_io_count;
 } xfs_buftarg_t;
 
 struct xfs_buf;

> > We shouldn't have to check for XBF_ASYNC in xfs_buf_submit() - it is
> > the path taken for async IO submission, so we should probably
> > ASSERT(bp->b_flags & XBF_ASYNC) in this function to ensure that is
> > the case.
> > 
> 
> Yeah, that's unnecessary. There's already such an assert in
> xfs_buf_submit(), actually.
> 
> > [Thinking aloud - __test_and_set_bit() might make this code a bit
> > cleaner]
> > 
> 
> On a quick try, this complains about b_flags being an unsigned int. I
> think I'll leave the set bit as is and use a helper for the release,
> which also provides a location to explain how the count works.
> 
> > > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> > > index 8bfb974..e1f95e0 100644
> > > --- a/fs/xfs/xfs_buf.h
> > > +++ b/fs/xfs/xfs_buf.h
> > > @@ -43,6 +43,7 @@ typedef enum {
> > >  #define XBF_READ	 (1 << 0) /* buffer intended for reading from device */
> > >  #define XBF_WRITE	 (1 << 1) /* buffer intended for writing to device */
> > >  #define XBF_READ_AHEAD	 (1 << 2) /* asynchronous read-ahead */
> > > +#define XBF_IN_FLIGHT	 (1 << 3)
> > 
> > Hmmm - it's an internal flag, so probably should be prefixed with an
> > "_" and moved down to the section with _XBF_KMEM and friends.
> > 
> 
> Indeed, thanks.
> 
> Brian
> 
> > Thoughts?
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] xfs: add readahead bufs to lru early to prevent post-unmount panic
  2016-07-12 17:22                 ` Brian Foster
@ 2016-07-12 23:57                   ` Dave Chinner
  2016-07-13 11:32                     ` Brian Foster
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2016-07-12 23:57 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Tue, Jul 12, 2016 at 01:22:59PM -0400, Brian Foster wrote:
> On Tue, Jul 12, 2016 at 08:03:15AM -0400, Brian Foster wrote:
> > On Tue, Jul 12, 2016 at 08:44:51AM +1000, Dave Chinner wrote:
> > > On Mon, Jul 11, 2016 at 11:29:22AM -0400, Brian Foster wrote:
> > > > On Mon, Jul 11, 2016 at 09:52:52AM -0400, Brian Foster wrote:
> > > > ...
> > > > > So what is your preference out of the possible approaches here? AFAICS,
> > > > > we have the following options:
> > > > > 
> > > > > 1.) The original "add readahead to LRU early" approach.
> > > > > 	Pros: simple one-liner
> > > > > 	Cons: bit of a hack, only covers readahead scenario
> > > > > 2.) Defer I/O count decrement to buffer release (this patch).
> > > > > 	Pros: should cover all cases (reads/writes)
> > > > > 	Cons: more complex (requires per-buffer accounting, etc.)
> > > > > 3.) Raw (buffer or bio?) I/O count (no defer to buffer release)
> > > > > 	Pros: eliminates some complexity from #2
> > > > > 	Cons: still more complex than #1, racy in that decrement does
> > > > > 	not serialize against LRU addition (requires drain_workqueue(),
> > > > > 	which still doesn't cover error conditions)
> > > > > 
> > > > > As noted above, option #3 also allows for either a buffer based count or
> > > > > bio based count, the latter of which might simplify things a bit further
> > > > > (TBD). Thoughts?
> > > 
> > > Pretty good summary :P
> > > 
> > > > FWIW, the following is a slightly cleaned up version of my initial
> > > > approach (option #3 above). Note that the flag is used to help deal with
> > > > varying ioend behavior. E.g., xfs_buf_ioend() is called once for some
> > > > buffers, multiple times for others with an iodone callback, that
> > > > behavior changes in some cases when an error is set, etc. (I'll add
> > > > comments before an official post.)
> > > 
> > > The approach looks good - I think there's a couple of things we can
> > > do to clean it up and make it robust. Comments inline.
> > > 
> > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > > index 4665ff6..45d3ddd 100644
> > > > --- a/fs/xfs/xfs_buf.c
> > > > +++ b/fs/xfs/xfs_buf.c
> > > > @@ -1018,7 +1018,10 @@ xfs_buf_ioend(
> > > >  
> > > >  	trace_xfs_buf_iodone(bp, _RET_IP_);
> > > >  
> > > > -	bp->b_flags &= ~(XBF_READ | XBF_WRITE | XBF_READ_AHEAD);
> > > > +	if (bp->b_flags & XBF_IN_FLIGHT)
> > > > +		percpu_counter_dec(&bp->b_target->bt_io_count);
> > > > +
> > > > +	bp->b_flags &= ~(XBF_READ | XBF_WRITE | XBF_READ_AHEAD | XBF_IN_FLIGHT);
> > > >  
> > > >  	/*
> > > >  	 * Pull in IO completion errors now. We are guaranteed to be running
> > > 
> > > I think the XBF_IN_FLIGHT can be moved to the final xfs_buf_rele()
> > > processing if:
> > > 
> > > > @@ -1341,6 +1344,11 @@ xfs_buf_submit(
> > > >  	 * xfs_buf_ioend too early.
> > > >  	 */
> > > >  	atomic_set(&bp->b_io_remaining, 1);
> > > > +	if (bp->b_flags & XBF_ASYNC) {
> > > > +		percpu_counter_inc(&bp->b_target->bt_io_count);
> > > > +		bp->b_flags |= XBF_IN_FLIGHT;
> > > > +	}
> > > 
> > > You change this to:
> > > 
> > > 	if (!(bp->b_flags & XBF_IN_FLIGHT)) {
> > > 		percpu_counter_inc(&bp->b_target->bt_io_count);
> > > 		bp->b_flags |= XBF_IN_FLIGHT;
> > > 	}
> > > 
> > 
> > Ok, so use the flag to cap the I/O count and defer the decrement to
> > release. I think that should work and addresses the raciness issue. I'll
> > give it a try.
> > 
> 
> This appears to be doable, but it reintroduces some ugliness from the
> previous approach.

Ah, so it does. Bugger.

> For example, we have to start filtering out uncached
> buffers again (if we defer the decrement to release, we must handle
> never-released buffers one way or another).

So the problem is limited to the superblock buffer and the iclog
buffers, right? How about making that special case explicit via a
flag set on the buffer? e.g. XBF_NO_IOCOUNT. THat way the exceptions
are clearly spelt out, rather than avoiding all uncached buffers?

> Also, given the feedback on
> the previous patch with regard to filtering out non-new buffers from the
> I/O count, I've dropped that and replaced it with updates to
> xfs_buf_rele() to decrement when the buffer is returned to the LRU (we
> either have to filter out buffers already on the LRU at submit time or
> make sure that they are decremented when released back to the LRU).
> 
> Code follows...
> 
> Brian
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 4665ff6..b7afbac 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -80,6 +80,25 @@ xfs_buf_vmap_len(
>  }
>  
>  /*
> + * Clear the in-flight state on a buffer about to be released to the LRU or
> + * freed and unaccount from the buftarg. The buftarg I/O count maintains a count
> + * of held buffers that have undergone at least one I/O in the current hold
> + * cycle (e.g., not a total I/O count). This provides protection against unmount
> + * for buffer I/O completion (see xfs_wait_buftarg()) processing.
> + */
> +static inline void
> +xfs_buf_rele_in_flight(
> +	struct xfs_buf	*bp)

Not sure about the name: xfs_buf_ioacct_dec()?

> +{
> +	if (!(bp->b_flags & _XBF_IN_FLIGHT))
> +		return;
> +
> +	ASSERT(bp->b_flags & XBF_ASYNC);
> +	bp->b_flags &= ~_XBF_IN_FLIGHT;
> +	percpu_counter_dec(&bp->b_target->bt_io_count);
> +}
> +
> +/*
>   * When we mark a buffer stale, we remove the buffer from the LRU and clear the
>   * b_lru_ref count so that the buffer is freed immediately when the buffer
>   * reference count falls to zero. If the buffer is already on the LRU, we need
> @@ -866,30 +885,37 @@ xfs_buf_hold(
>  }
>  
>  /*
> - *	Releases a hold on the specified buffer.  If the
> - *	the hold count is 1, calls xfs_buf_free.
> + * Release a hold on the specified buffer. If the hold count is 1, the buffer is
> + * placed on LRU or freed (depending on b_lru_ref).
>   */
>  void
>  xfs_buf_rele(
>  	xfs_buf_t		*bp)
>  {
>  	struct xfs_perag	*pag = bp->b_pag;
> +	bool			release;
> +	bool			freebuf = false;
>  
>  	trace_xfs_buf_rele(bp, _RET_IP_);
>  
>  	if (!pag) {
>  		ASSERT(list_empty(&bp->b_lru));
>  		ASSERT(RB_EMPTY_NODE(&bp->b_rbnode));
> -		if (atomic_dec_and_test(&bp->b_hold))
> +		if (atomic_dec_and_test(&bp->b_hold)) {
> +			xfs_buf_rele_in_flight(bp);
>  			xfs_buf_free(bp);
> +		}
>  		return;
>  	}
>  
>  	ASSERT(!RB_EMPTY_NODE(&bp->b_rbnode));
>  
>  	ASSERT(atomic_read(&bp->b_hold) > 0);
> -	if (atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock)) {
> -		spin_lock(&bp->b_lock);
> +
> +	release = atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock);
> +	spin_lock(&bp->b_lock);
> +	if (release) {
> +		xfs_buf_rele_in_flight(bp);
>  		if (!(bp->b_flags & XBF_STALE) && atomic_read(&bp->b_lru_ref)) {
>  			/*
>  			 * If the buffer is added to the LRU take a new
> @@ -900,7 +926,6 @@ xfs_buf_rele(
>  				bp->b_state &= ~XFS_BSTATE_DISPOSE;
>  				atomic_inc(&bp->b_hold);
>  			}
> -			spin_unlock(&bp->b_lock);
>  			spin_unlock(&pag->pag_buf_lock);
>  		} else {
>  			/*
> @@ -914,15 +939,24 @@ xfs_buf_rele(
>  			} else {
>  				ASSERT(list_empty(&bp->b_lru));
>  			}
> -			spin_unlock(&bp->b_lock);
>  
>  			ASSERT(!(bp->b_flags & _XBF_DELWRI_Q));
>  			rb_erase(&bp->b_rbnode, &pag->pag_buf_tree);
>  			spin_unlock(&pag->pag_buf_lock);
>  			xfs_perag_put(pag);
> -			xfs_buf_free(bp);
> +			freebuf = true;
>  		}
> +	} else if ((atomic_read(&bp->b_hold) == 1) && !list_empty(&bp->b_lru)) {
> +		/*
> +		 * The buffer is already on the LRU and it holds the only
> +		 * reference. Drop the in flight state.
> +		 */
> +		xfs_buf_rele_in_flight(bp);
>  	}

This b_hold check is racy - bp->b_lock is not enough to stabilise
the b_hold count. Because we don't hold the buffer semaphore any
more, another buffer reference holder can successfully run the above
atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock). New references
can be taken in xfs_buf_find() so the count could go up, but I
think that's fine given the eventual case we care about here is
draining references on unmount.

I think this is still ok for draining references, too, because of
the flag check inside xfs_buf_rele_in_flight(). If we race on a
transition a value of 1, then we end running the branch in each
caller. If we race on transition to zero, then the caller that is
releasing the buffer will execute xfs_buf_rele_in_flight() and all
will be well.

Needs comments, and maybe restructing the code to handle the
xfs_buf_rele_in_flight() call up front so it's clear that io
accounting is clearly a separate case from the rest of release
handling. e.g.

	release = atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock);
	spin_lock(&bp->b_lock);
	if (!release) {
		if (!(atomic_read(&bp->b_hold) == 1) && !list_empty(&bp->b_lru))
			xfs_buf_ioacct_dec(bp);
		goto out_unlock;
	}
	xfs_buf_ioacct_dec(bp);

	/* rest of release code, one level of indentation removed */

out_unlock:
	spin_unlock(&bp->b_lock);

	if (freebuf)
		xfs_buf_free(bp);



> @@ -1341,6 +1375,18 @@ xfs_buf_submit(
>  	 * xfs_buf_ioend too early.
>  	 */
>  	atomic_set(&bp->b_io_remaining, 1);
> +
> +	/*
> +	 * Bump the I/O in flight count on the buftarg if we haven't yet done
> +	 * so for this buffer. Skip uncached buffers because many of those
> +	 * (e.g., superblock, log buffers) are never released.
> +	 */
> +	if ((bp->b_bn != XFS_BUF_DADDR_NULL) &&
> +	    !(bp->b_flags & _XBF_IN_FLIGHT)) {
> +		bp->b_flags |= _XBF_IN_FLIGHT;
> +		percpu_counter_inc(&bp->b_target->bt_io_count);
> +	}

xfs_buf_ioacct_inc()
{
	if (bp->b_flags & (XBF_NO_IOACCT | _XBF_IN_FLIGHT))
		return;
	percpu_counter_inc(&bp->b_target->bt_io_count);
	bp->b_flags |= _XBF_IN_FLIGHT;
}

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] xfs: add readahead bufs to lru early to prevent post-unmount panic
  2016-07-12 23:57                   ` Dave Chinner
@ 2016-07-13 11:32                     ` Brian Foster
  2016-07-13 12:49                       ` Brian Foster
  0 siblings, 1 reply; 18+ messages in thread
From: Brian Foster @ 2016-07-13 11:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, Jul 13, 2016 at 09:57:52AM +1000, Dave Chinner wrote:
> On Tue, Jul 12, 2016 at 01:22:59PM -0400, Brian Foster wrote:
> > On Tue, Jul 12, 2016 at 08:03:15AM -0400, Brian Foster wrote:
> > > On Tue, Jul 12, 2016 at 08:44:51AM +1000, Dave Chinner wrote:
> > > > On Mon, Jul 11, 2016 at 11:29:22AM -0400, Brian Foster wrote:
> > > > > On Mon, Jul 11, 2016 at 09:52:52AM -0400, Brian Foster wrote:
> > > > > ...
> > > > > > So what is your preference out of the possible approaches here? AFAICS,
> > > > > > we have the following options:
> > > > > > 
> > > > > > 1.) The original "add readahead to LRU early" approach.
> > > > > > 	Pros: simple one-liner
> > > > > > 	Cons: bit of a hack, only covers readahead scenario
> > > > > > 2.) Defer I/O count decrement to buffer release (this patch).
> > > > > > 	Pros: should cover all cases (reads/writes)
> > > > > > 	Cons: more complex (requires per-buffer accounting, etc.)
> > > > > > 3.) Raw (buffer or bio?) I/O count (no defer to buffer release)
> > > > > > 	Pros: eliminates some complexity from #2
> > > > > > 	Cons: still more complex than #1, racy in that decrement does
> > > > > > 	not serialize against LRU addition (requires drain_workqueue(),
> > > > > > 	which still doesn't cover error conditions)
> > > > > > 
> > > > > > As noted above, option #3 also allows for either a buffer based count or
> > > > > > bio based count, the latter of which might simplify things a bit further
> > > > > > (TBD). Thoughts?
> > > > 
> > > > Pretty good summary :P
> > > > 
> > > > > FWIW, the following is a slightly cleaned up version of my initial
> > > > > approach (option #3 above). Note that the flag is used to help deal with
> > > > > varying ioend behavior. E.g., xfs_buf_ioend() is called once for some
> > > > > buffers, multiple times for others with an iodone callback, that
> > > > > behavior changes in some cases when an error is set, etc. (I'll add
> > > > > comments before an official post.)
> > > > 
> > > > The approach looks good - I think there's a couple of things we can
> > > > do to clean it up and make it robust. Comments inline.
> > > > 
> > > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > > > index 4665ff6..45d3ddd 100644
> > > > > --- a/fs/xfs/xfs_buf.c
> > > > > +++ b/fs/xfs/xfs_buf.c
> > > > > @@ -1018,7 +1018,10 @@ xfs_buf_ioend(
> > > > >  
> > > > >  	trace_xfs_buf_iodone(bp, _RET_IP_);
> > > > >  
> > > > > -	bp->b_flags &= ~(XBF_READ | XBF_WRITE | XBF_READ_AHEAD);
> > > > > +	if (bp->b_flags & XBF_IN_FLIGHT)
> > > > > +		percpu_counter_dec(&bp->b_target->bt_io_count);
> > > > > +
> > > > > +	bp->b_flags &= ~(XBF_READ | XBF_WRITE | XBF_READ_AHEAD | XBF_IN_FLIGHT);
> > > > >  
> > > > >  	/*
> > > > >  	 * Pull in IO completion errors now. We are guaranteed to be running
> > > > 
> > > > I think the XBF_IN_FLIGHT can be moved to the final xfs_buf_rele()
> > > > processing if:
> > > > 
> > > > > @@ -1341,6 +1344,11 @@ xfs_buf_submit(
> > > > >  	 * xfs_buf_ioend too early.
> > > > >  	 */
> > > > >  	atomic_set(&bp->b_io_remaining, 1);
> > > > > +	if (bp->b_flags & XBF_ASYNC) {
> > > > > +		percpu_counter_inc(&bp->b_target->bt_io_count);
> > > > > +		bp->b_flags |= XBF_IN_FLIGHT;
> > > > > +	}
> > > > 
> > > > You change this to:
> > > > 
> > > > 	if (!(bp->b_flags & XBF_IN_FLIGHT)) {
> > > > 		percpu_counter_inc(&bp->b_target->bt_io_count);
> > > > 		bp->b_flags |= XBF_IN_FLIGHT;
> > > > 	}
> > > > 
> > > 
> > > Ok, so use the flag to cap the I/O count and defer the decrement to
> > > release. I think that should work and addresses the raciness issue. I'll
> > > give it a try.
> > > 
> > 
> > This appears to be doable, but it reintroduces some ugliness from the
> > previous approach.
> 
> Ah, so it does. Bugger.
> 
> > For example, we have to start filtering out uncached
> > buffers again (if we defer the decrement to release, we must handle
> > never-released buffers one way or another).
> 
> So the problem is limited to the superblock buffer and the iclog
> buffers, right? How about making that special case explicit via a
> flag set on the buffer? e.g. XBF_NO_IOCOUNT. THat way the exceptions
> are clearly spelt out, rather than avoiding all uncached buffers?
> 

I think so. I considered a similar approach earlier, but I didn't want
to spend time tracking down the associated users until the broader
approach was nailed down. More specifically, I think we could set
b_lru_ref to 0 on those buffers and use that to bypass the accounting.
That makes it clear that these buffers are not destined for the LRU and
alternative synchronization is required (which already exists in the
form of lock cycles).

The rest of the feedback makes sense, so I'll incorporate that and give
the above a try... thanks.

Brian

> > Also, given the feedback on
> > the previous patch with regard to filtering out non-new buffers from the
> > I/O count, I've dropped that and replaced it with updates to
> > xfs_buf_rele() to decrement when the buffer is returned to the LRU (we
> > either have to filter out buffers already on the LRU at submit time or
> > make sure that they are decremented when released back to the LRU).
> > 
> > Code follows...
> > 
> > Brian
> > 
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index 4665ff6..b7afbac 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -80,6 +80,25 @@ xfs_buf_vmap_len(
> >  }
> >  
> >  /*
> > + * Clear the in-flight state on a buffer about to be released to the LRU or
> > + * freed and unaccount from the buftarg. The buftarg I/O count maintains a count
> > + * of held buffers that have undergone at least one I/O in the current hold
> > + * cycle (e.g., not a total I/O count). This provides protection against unmount
> > + * for buffer I/O completion (see xfs_wait_buftarg()) processing.
> > + */
> > +static inline void
> > +xfs_buf_rele_in_flight(
> > +	struct xfs_buf	*bp)
> 
> Not sure about the name: xfs_buf_ioacct_dec()?
> 
> > +{
> > +	if (!(bp->b_flags & _XBF_IN_FLIGHT))
> > +		return;
> > +
> > +	ASSERT(bp->b_flags & XBF_ASYNC);
> > +	bp->b_flags &= ~_XBF_IN_FLIGHT;
> > +	percpu_counter_dec(&bp->b_target->bt_io_count);
> > +}
> > +
> > +/*
> >   * When we mark a buffer stale, we remove the buffer from the LRU and clear the
> >   * b_lru_ref count so that the buffer is freed immediately when the buffer
> >   * reference count falls to zero. If the buffer is already on the LRU, we need
> > @@ -866,30 +885,37 @@ xfs_buf_hold(
> >  }
> >  
> >  /*
> > - *	Releases a hold on the specified buffer.  If the
> > - *	the hold count is 1, calls xfs_buf_free.
> > + * Release a hold on the specified buffer. If the hold count is 1, the buffer is
> > + * placed on LRU or freed (depending on b_lru_ref).
> >   */
> >  void
> >  xfs_buf_rele(
> >  	xfs_buf_t		*bp)
> >  {
> >  	struct xfs_perag	*pag = bp->b_pag;
> > +	bool			release;
> > +	bool			freebuf = false;
> >  
> >  	trace_xfs_buf_rele(bp, _RET_IP_);
> >  
> >  	if (!pag) {
> >  		ASSERT(list_empty(&bp->b_lru));
> >  		ASSERT(RB_EMPTY_NODE(&bp->b_rbnode));
> > -		if (atomic_dec_and_test(&bp->b_hold))
> > +		if (atomic_dec_and_test(&bp->b_hold)) {
> > +			xfs_buf_rele_in_flight(bp);
> >  			xfs_buf_free(bp);
> > +		}
> >  		return;
> >  	}
> >  
> >  	ASSERT(!RB_EMPTY_NODE(&bp->b_rbnode));
> >  
> >  	ASSERT(atomic_read(&bp->b_hold) > 0);
> > -	if (atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock)) {
> > -		spin_lock(&bp->b_lock);
> > +
> > +	release = atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock);
> > +	spin_lock(&bp->b_lock);
> > +	if (release) {
> > +		xfs_buf_rele_in_flight(bp);
> >  		if (!(bp->b_flags & XBF_STALE) && atomic_read(&bp->b_lru_ref)) {
> >  			/*
> >  			 * If the buffer is added to the LRU take a new
> > @@ -900,7 +926,6 @@ xfs_buf_rele(
> >  				bp->b_state &= ~XFS_BSTATE_DISPOSE;
> >  				atomic_inc(&bp->b_hold);
> >  			}
> > -			spin_unlock(&bp->b_lock);
> >  			spin_unlock(&pag->pag_buf_lock);
> >  		} else {
> >  			/*
> > @@ -914,15 +939,24 @@ xfs_buf_rele(
> >  			} else {
> >  				ASSERT(list_empty(&bp->b_lru));
> >  			}
> > -			spin_unlock(&bp->b_lock);
> >  
> >  			ASSERT(!(bp->b_flags & _XBF_DELWRI_Q));
> >  			rb_erase(&bp->b_rbnode, &pag->pag_buf_tree);
> >  			spin_unlock(&pag->pag_buf_lock);
> >  			xfs_perag_put(pag);
> > -			xfs_buf_free(bp);
> > +			freebuf = true;
> >  		}
> > +	} else if ((atomic_read(&bp->b_hold) == 1) && !list_empty(&bp->b_lru)) {
> > +		/*
> > +		 * The buffer is already on the LRU and it holds the only
> > +		 * reference. Drop the in flight state.
> > +		 */
> > +		xfs_buf_rele_in_flight(bp);
> >  	}
> 
> This b_hold check is racy - bp->b_lock is not enough to stabilise
> the b_hold count. Because we don't hold the buffer semaphore any
> more, another buffer reference holder can successfully run the above
> atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock). New references
> can be taken in xfs_buf_find() so the count could go up, but I
> think that's fine given the eventual case we care about here is
> draining references on unmount.
> 
> I think this is still ok for draining references, too, because of
> the flag check inside xfs_buf_rele_in_flight(). If we race on a
> transition a value of 1, then we end running the branch in each
> caller. If we race on transition to zero, then the caller that is
> releasing the buffer will execute xfs_buf_rele_in_flight() and all
> will be well.
> 
> Needs comments, and maybe restructing the code to handle the
> xfs_buf_rele_in_flight() call up front so it's clear that io
> accounting is clearly a separate case from the rest of release
> handling. e.g.
> 
> 	release = atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock);
> 	spin_lock(&bp->b_lock);
> 	if (!release) {
> 		if (!(atomic_read(&bp->b_hold) == 1) && !list_empty(&bp->b_lru))
> 			xfs_buf_ioacct_dec(bp);
> 		goto out_unlock;
> 	}
> 	xfs_buf_ioacct_dec(bp);
> 
> 	/* rest of release code, one level of indentation removed */
> 
> out_unlock:
> 	spin_unlock(&bp->b_lock);
> 
> 	if (freebuf)
> 		xfs_buf_free(bp);
> 
> 
> 
> > @@ -1341,6 +1375,18 @@ xfs_buf_submit(
> >  	 * xfs_buf_ioend too early.
> >  	 */
> >  	atomic_set(&bp->b_io_remaining, 1);
> > +
> > +	/*
> > +	 * Bump the I/O in flight count on the buftarg if we haven't yet done
> > +	 * so for this buffer. Skip uncached buffers because many of those
> > +	 * (e.g., superblock, log buffers) are never released.
> > +	 */
> > +	if ((bp->b_bn != XFS_BUF_DADDR_NULL) &&
> > +	    !(bp->b_flags & _XBF_IN_FLIGHT)) {
> > +		bp->b_flags |= _XBF_IN_FLIGHT;
> > +		percpu_counter_inc(&bp->b_target->bt_io_count);
> > +	}
> 
> xfs_buf_ioacct_inc()
> {
> 	if (bp->b_flags & (XBF_NO_IOACCT | _XBF_IN_FLIGHT))
> 		return;
> 	percpu_counter_inc(&bp->b_target->bt_io_count);
> 	bp->b_flags |= _XBF_IN_FLIGHT;
> }
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] xfs: add readahead bufs to lru early to prevent post-unmount panic
  2016-07-13 11:32                     ` Brian Foster
@ 2016-07-13 12:49                       ` Brian Foster
  0 siblings, 0 replies; 18+ messages in thread
From: Brian Foster @ 2016-07-13 12:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, Jul 13, 2016 at 07:32:26AM -0400, Brian Foster wrote:
> On Wed, Jul 13, 2016 at 09:57:52AM +1000, Dave Chinner wrote:
> > On Tue, Jul 12, 2016 at 01:22:59PM -0400, Brian Foster wrote:
> > > On Tue, Jul 12, 2016 at 08:03:15AM -0400, Brian Foster wrote:
> > > > On Tue, Jul 12, 2016 at 08:44:51AM +1000, Dave Chinner wrote:
> > > > > On Mon, Jul 11, 2016 at 11:29:22AM -0400, Brian Foster wrote:
> > > > > > On Mon, Jul 11, 2016 at 09:52:52AM -0400, Brian Foster wrote:
> > > > > > ...
> > > > > > > So what is your preference out of the possible approaches here? AFAICS,
> > > > > > > we have the following options:
> > > > > > > 
> > > > > > > 1.) The original "add readahead to LRU early" approach.
> > > > > > > 	Pros: simple one-liner
> > > > > > > 	Cons: bit of a hack, only covers readahead scenario
> > > > > > > 2.) Defer I/O count decrement to buffer release (this patch).
> > > > > > > 	Pros: should cover all cases (reads/writes)
> > > > > > > 	Cons: more complex (requires per-buffer accounting, etc.)
> > > > > > > 3.) Raw (buffer or bio?) I/O count (no defer to buffer release)
> > > > > > > 	Pros: eliminates some complexity from #2
> > > > > > > 	Cons: still more complex than #1, racy in that decrement does
> > > > > > > 	not serialize against LRU addition (requires drain_workqueue(),
> > > > > > > 	which still doesn't cover error conditions)
> > > > > > > 
> > > > > > > As noted above, option #3 also allows for either a buffer based count or
> > > > > > > bio based count, the latter of which might simplify things a bit further
> > > > > > > (TBD). Thoughts?
> > > > > 
> > > > > Pretty good summary :P
> > > > > 
> > > > > > FWIW, the following is a slightly cleaned up version of my initial
> > > > > > approach (option #3 above). Note that the flag is used to help deal with
> > > > > > varying ioend behavior. E.g., xfs_buf_ioend() is called once for some
> > > > > > buffers, multiple times for others with an iodone callback, that
> > > > > > behavior changes in some cases when an error is set, etc. (I'll add
> > > > > > comments before an official post.)
> > > > > 
> > > > > The approach looks good - I think there's a couple of things we can
> > > > > do to clean it up and make it robust. Comments inline.
> > > > > 
> > > > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > > > > index 4665ff6..45d3ddd 100644
> > > > > > --- a/fs/xfs/xfs_buf.c
> > > > > > +++ b/fs/xfs/xfs_buf.c
> > > > > > @@ -1018,7 +1018,10 @@ xfs_buf_ioend(
> > > > > >  
> > > > > >  	trace_xfs_buf_iodone(bp, _RET_IP_);
> > > > > >  
> > > > > > -	bp->b_flags &= ~(XBF_READ | XBF_WRITE | XBF_READ_AHEAD);
> > > > > > +	if (bp->b_flags & XBF_IN_FLIGHT)
> > > > > > +		percpu_counter_dec(&bp->b_target->bt_io_count);
> > > > > > +
> > > > > > +	bp->b_flags &= ~(XBF_READ | XBF_WRITE | XBF_READ_AHEAD | XBF_IN_FLIGHT);
> > > > > >  
> > > > > >  	/*
> > > > > >  	 * Pull in IO completion errors now. We are guaranteed to be running
> > > > > 
> > > > > I think the XBF_IN_FLIGHT can be moved to the final xfs_buf_rele()
> > > > > processing if:
> > > > > 
> > > > > > @@ -1341,6 +1344,11 @@ xfs_buf_submit(
> > > > > >  	 * xfs_buf_ioend too early.
> > > > > >  	 */
> > > > > >  	atomic_set(&bp->b_io_remaining, 1);
> > > > > > +	if (bp->b_flags & XBF_ASYNC) {
> > > > > > +		percpu_counter_inc(&bp->b_target->bt_io_count);
> > > > > > +		bp->b_flags |= XBF_IN_FLIGHT;
> > > > > > +	}
> > > > > 
> > > > > You change this to:
> > > > > 
> > > > > 	if (!(bp->b_flags & XBF_IN_FLIGHT)) {
> > > > > 		percpu_counter_inc(&bp->b_target->bt_io_count);
> > > > > 		bp->b_flags |= XBF_IN_FLIGHT;
> > > > > 	}
> > > > > 
> > > > 
> > > > Ok, so use the flag to cap the I/O count and defer the decrement to
> > > > release. I think that should work and addresses the raciness issue. I'll
> > > > give it a try.
> > > > 
> > > 
> > > This appears to be doable, but it reintroduces some ugliness from the
> > > previous approach.
> > 
> > Ah, so it does. Bugger.
> > 
> > > For example, we have to start filtering out uncached
> > > buffers again (if we defer the decrement to release, we must handle
> > > never-released buffers one way or another).
> > 
> > So the problem is limited to the superblock buffer and the iclog
> > buffers, right? How about making that special case explicit via a
> > flag set on the buffer? e.g. XBF_NO_IOCOUNT. THat way the exceptions
> > are clearly spelt out, rather than avoiding all uncached buffers?
> > 
> 
> I think so. I considered a similar approach earlier, but I didn't want
> to spend time tracking down the associated users until the broader
> approach was nailed down. More specifically, I think we could set
> b_lru_ref to 0 on those buffers and use that to bypass the accounting.
> That makes it clear that these buffers are not destined for the LRU and
> alternative synchronization is required (which already exists in the
> form of lock cycles).
> 

It occurs to me that this probably won't work in all cases because
b_lru_ref is decremented naturally as part of the LRU shrinker
mechanism. Disregard this, I'll look into the flag..

Brian

> The rest of the feedback makes sense, so I'll incorporate that and give
> the above a try... thanks.
> 
> Brian
> 
> > > Also, given the feedback on
> > > the previous patch with regard to filtering out non-new buffers from the
> > > I/O count, I've dropped that and replaced it with updates to
> > > xfs_buf_rele() to decrement when the buffer is returned to the LRU (we
> > > either have to filter out buffers already on the LRU at submit time or
> > > make sure that they are decremented when released back to the LRU).
> > > 
> > > Code follows...
> > > 
> > > Brian
> > > 
> > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > index 4665ff6..b7afbac 100644
> > > --- a/fs/xfs/xfs_buf.c
> > > +++ b/fs/xfs/xfs_buf.c
> > > @@ -80,6 +80,25 @@ xfs_buf_vmap_len(
> > >  }
> > >  
> > >  /*
> > > + * Clear the in-flight state on a buffer about to be released to the LRU or
> > > + * freed and unaccount from the buftarg. The buftarg I/O count maintains a count
> > > + * of held buffers that have undergone at least one I/O in the current hold
> > > + * cycle (e.g., not a total I/O count). This provides protection against unmount
> > > + * for buffer I/O completion (see xfs_wait_buftarg()) processing.
> > > + */
> > > +static inline void
> > > +xfs_buf_rele_in_flight(
> > > +	struct xfs_buf	*bp)
> > 
> > Not sure about the name: xfs_buf_ioacct_dec()?
> > 
> > > +{
> > > +	if (!(bp->b_flags & _XBF_IN_FLIGHT))
> > > +		return;
> > > +
> > > +	ASSERT(bp->b_flags & XBF_ASYNC);
> > > +	bp->b_flags &= ~_XBF_IN_FLIGHT;
> > > +	percpu_counter_dec(&bp->b_target->bt_io_count);
> > > +}
> > > +
> > > +/*
> > >   * When we mark a buffer stale, we remove the buffer from the LRU and clear the
> > >   * b_lru_ref count so that the buffer is freed immediately when the buffer
> > >   * reference count falls to zero. If the buffer is already on the LRU, we need
> > > @@ -866,30 +885,37 @@ xfs_buf_hold(
> > >  }
> > >  
> > >  /*
> > > - *	Releases a hold on the specified buffer.  If the
> > > - *	the hold count is 1, calls xfs_buf_free.
> > > + * Release a hold on the specified buffer. If the hold count is 1, the buffer is
> > > + * placed on LRU or freed (depending on b_lru_ref).
> > >   */
> > >  void
> > >  xfs_buf_rele(
> > >  	xfs_buf_t		*bp)
> > >  {
> > >  	struct xfs_perag	*pag = bp->b_pag;
> > > +	bool			release;
> > > +	bool			freebuf = false;
> > >  
> > >  	trace_xfs_buf_rele(bp, _RET_IP_);
> > >  
> > >  	if (!pag) {
> > >  		ASSERT(list_empty(&bp->b_lru));
> > >  		ASSERT(RB_EMPTY_NODE(&bp->b_rbnode));
> > > -		if (atomic_dec_and_test(&bp->b_hold))
> > > +		if (atomic_dec_and_test(&bp->b_hold)) {
> > > +			xfs_buf_rele_in_flight(bp);
> > >  			xfs_buf_free(bp);
> > > +		}
> > >  		return;
> > >  	}
> > >  
> > >  	ASSERT(!RB_EMPTY_NODE(&bp->b_rbnode));
> > >  
> > >  	ASSERT(atomic_read(&bp->b_hold) > 0);
> > > -	if (atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock)) {
> > > -		spin_lock(&bp->b_lock);
> > > +
> > > +	release = atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock);
> > > +	spin_lock(&bp->b_lock);
> > > +	if (release) {
> > > +		xfs_buf_rele_in_flight(bp);
> > >  		if (!(bp->b_flags & XBF_STALE) && atomic_read(&bp->b_lru_ref)) {
> > >  			/*
> > >  			 * If the buffer is added to the LRU take a new
> > > @@ -900,7 +926,6 @@ xfs_buf_rele(
> > >  				bp->b_state &= ~XFS_BSTATE_DISPOSE;
> > >  				atomic_inc(&bp->b_hold);
> > >  			}
> > > -			spin_unlock(&bp->b_lock);
> > >  			spin_unlock(&pag->pag_buf_lock);
> > >  		} else {
> > >  			/*
> > > @@ -914,15 +939,24 @@ xfs_buf_rele(
> > >  			} else {
> > >  				ASSERT(list_empty(&bp->b_lru));
> > >  			}
> > > -			spin_unlock(&bp->b_lock);
> > >  
> > >  			ASSERT(!(bp->b_flags & _XBF_DELWRI_Q));
> > >  			rb_erase(&bp->b_rbnode, &pag->pag_buf_tree);
> > >  			spin_unlock(&pag->pag_buf_lock);
> > >  			xfs_perag_put(pag);
> > > -			xfs_buf_free(bp);
> > > +			freebuf = true;
> > >  		}
> > > +	} else if ((atomic_read(&bp->b_hold) == 1) && !list_empty(&bp->b_lru)) {
> > > +		/*
> > > +		 * The buffer is already on the LRU and it holds the only
> > > +		 * reference. Drop the in flight state.
> > > +		 */
> > > +		xfs_buf_rele_in_flight(bp);
> > >  	}
> > 
> > This b_hold check is racy - bp->b_lock is not enough to stabilise
> > the b_hold count. Because we don't hold the buffer semaphore any
> > more, another buffer reference holder can successfully run the above
> > atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock). New references
> > can be taken in xfs_buf_find() so the count could go up, but I
> > think that's fine given the eventual case we care about here is
> > draining references on unmount.
> > 
> > I think this is still ok for draining references, too, because of
> > the flag check inside xfs_buf_rele_in_flight(). If we race on a
> > transition a value of 1, then we end running the branch in each
> > caller. If we race on transition to zero, then the caller that is
> > releasing the buffer will execute xfs_buf_rele_in_flight() and all
> > will be well.
> > 
> > Needs comments, and maybe restructing the code to handle the
> > xfs_buf_rele_in_flight() call up front so it's clear that io
> > accounting is clearly a separate case from the rest of release
> > handling. e.g.
> > 
> > 	release = atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock);
> > 	spin_lock(&bp->b_lock);
> > 	if (!release) {
> > 		if (!(atomic_read(&bp->b_hold) == 1) && !list_empty(&bp->b_lru))
> > 			xfs_buf_ioacct_dec(bp);
> > 		goto out_unlock;
> > 	}
> > 	xfs_buf_ioacct_dec(bp);
> > 
> > 	/* rest of release code, one level of indentation removed */
> > 
> > out_unlock:
> > 	spin_unlock(&bp->b_lock);
> > 
> > 	if (freebuf)
> > 		xfs_buf_free(bp);
> > 
> > 
> > 
> > > @@ -1341,6 +1375,18 @@ xfs_buf_submit(
> > >  	 * xfs_buf_ioend too early.
> > >  	 */
> > >  	atomic_set(&bp->b_io_remaining, 1);
> > > +
> > > +	/*
> > > +	 * Bump the I/O in flight count on the buftarg if we haven't yet done
> > > +	 * so for this buffer. Skip uncached buffers because many of those
> > > +	 * (e.g., superblock, log buffers) are never released.
> > > +	 */
> > > +	if ((bp->b_bn != XFS_BUF_DADDR_NULL) &&
> > > +	    !(bp->b_flags & _XBF_IN_FLIGHT)) {
> > > +		bp->b_flags |= _XBF_IN_FLIGHT;
> > > +		percpu_counter_inc(&bp->b_target->bt_io_count);
> > > +	}
> > 
> > xfs_buf_ioacct_inc()
> > {
> > 	if (bp->b_flags & (XBF_NO_IOACCT | _XBF_IN_FLIGHT))
> > 		return;
> > 	percpu_counter_inc(&bp->b_target->bt_io_count);
> > 	bp->b_flags |= _XBF_IN_FLIGHT;
> > }
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2016-07-13 12:49 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-30 12:53 [PATCH] xfs: add readahead bufs to lru early to prevent post-unmount panic Brian Foster
2016-06-30 22:44 ` Dave Chinner
2016-06-30 23:56   ` Brian Foster
2016-07-01  4:33     ` Dave Chinner
2016-07-01 12:53       ` Brian Foster
2016-07-04  0:08         ` Dave Chinner
2016-07-05 13:42           ` Brian Foster
2016-07-01 22:30   ` Brian Foster
2016-07-05 16:45     ` Brian Foster
2016-07-11  5:20       ` Dave Chinner
2016-07-11 13:52         ` Brian Foster
2016-07-11 15:29           ` Brian Foster
2016-07-11 22:44             ` Dave Chinner
2016-07-12 12:03               ` Brian Foster
2016-07-12 17:22                 ` Brian Foster
2016-07-12 23:57                   ` Dave Chinner
2016-07-13 11:32                     ` Brian Foster
2016-07-13 12:49                       ` Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.