All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] [RFC] Release buffer locks in case of IO error
@ 2016-09-29 13:03 Carlos Maiolino
  2016-09-30  0:01 ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Carlos Maiolino @ 2016-09-29 13:03 UTC (permalink / raw)
  To: linux-xfs

I have been working in a bug still regarding xfs fail_at_unmount configuration,
where, even though the configuration is enable, an unmount attempt will still
hang if the AIL buf items are locked as a result of a previous failed attempt to
flush these items.

Currently, if there is a failure while trying to flush inode buffers to disk,
these items are kept in AIL with FLUSHING status and with the locks held, making
them unable to be retried. Either during unmount, where they will be retried and
'failed', or if using a thin provisioned device, the pool is actually extended, to
accomodate these not-yet-flushed items, instead of retrying to flush such items,
xfs is unable to retry them, once they are already locked.

Brian came with a suggestion about how to fix it, releasing the locks in case we
had a failed write attempt.

I know there are some other persons also involved in this problem, so I think
it's a good idea to send Brian's patch here and get some more comments about it.

I'm keeping Brian's signed-off once he wrote the whole patch.


Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
---
 fs/xfs/xfs_inode_item.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 892c2ac..cce0373 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -517,7 +517,26 @@ xfs_inode_item_push(
 	 * the AIL.
 	 */
 	if (!xfs_iflock_nowait(ip)) {
-		rval = XFS_ITEM_FLUSHING;
+		int error;
+		struct xfs_dinode *dip;
+
+		error = xfs_imap_to_bp(ip->i_mount, NULL, &ip->i_imap, &dip,
+				       &bp, XBF_TRYLOCK, 0);
+		if (error) {
+			rval = XFS_ITEM_FLUSHING;
+			goto out_unlock;
+		}
+
+		if (!(bp->b_flags & XBF_WRITE_FAIL)) {
+			rval = XFS_ITEM_FLUSHING;
+			xfs_buf_relse(bp);
+			goto out_unlock;
+		}
+
+		if (!xfs_buf_delwri_queue(bp, buffer_list))
+			rval = XFS_ITEM_FLUSHING;
+
+		xfs_buf_relse(bp);
 		goto out_unlock;
 	}
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] [RFC] Release buffer locks in case of IO error
  2016-09-29 13:03 [PATCH] [RFC] Release buffer locks in case of IO error Carlos Maiolino
@ 2016-09-30  0:01 ` Dave Chinner
  2016-09-30  9:14   ` Carlos Maiolino
                     ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Dave Chinner @ 2016-09-30  0:01 UTC (permalink / raw)
  To: Carlos Maiolino; +Cc: linux-xfs

On Thu, Sep 29, 2016 at 03:03:19PM +0200, Carlos Maiolino wrote:
> I have been working in a bug still regarding xfs fail_at_unmount configuration,
> where, even though the configuration is enable, an unmount attempt will still
> hang if the AIL buf items are locked as a result of a previous failed attempt to
> flush these items.
> 
> Currently, if there is a failure while trying to flush inode buffers to disk,
> these items are kept in AIL with FLUSHING status and with the locks held, making
> them unable to be retried. Either during unmount, where they will be retried and
> 'failed', or if using a thin provisioned device, the pool is actually extended, to
> accomodate these not-yet-flushed items, instead of retrying to flush such items,
> xfs is unable to retry them, once they are already locked.

[....]

> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 892c2ac..cce0373 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -517,7 +517,26 @@ xfs_inode_item_push(
>  	 * the AIL.
>  	 */
>  	if (!xfs_iflock_nowait(ip)) {
> -		rval = XFS_ITEM_FLUSHING;
> +		int error;
> +		struct xfs_dinode *dip;
> +
> +		error = xfs_imap_to_bp(ip->i_mount, NULL, &ip->i_imap, &dip,
> +				       &bp, XBF_TRYLOCK, 0);

So now, when we have tens of thousands of inodes in flushing state,
we'll hammer the buffer cache doing lookups to determine the state
of the buffer. That's a large amount of additional runtime overhead
that is unnecessary - this is only needed at unmount, according to
the problem description.

> +		if (error) {
> +			rval = XFS_ITEM_FLUSHING;
> +			goto out_unlock;
> +		}

If we are stuck in a shutdown situation, then xfs_imap_to_bp() will
detect a shutdown and return -EIO here. So this doesn't for an
unmount with a stuck inode in a shutdown situation.

> +
> +		if (!(bp->b_flags & XBF_WRITE_FAIL)) {
> +			rval = XFS_ITEM_FLUSHING;
> +			xfs_buf_relse(bp);
> +			goto out_unlock;
> +		}

So if the last write of the buffer was OK, do nothing? How does
that get the inode unlocked if we've failed to flush at unmount?

> +
> +		if (!xfs_buf_delwri_queue(bp, buffer_list))
> +			rval = XFS_ITEM_FLUSHING;
> +
> +		xfs_buf_relse(bp);
>  		goto out_unlock;


Ok, I'm pretty sure that this just addresses a symptom of the
underlying problem, not solve the root cause. e.g. dquot flushing
has exactly the same problem.

The underlying problem is that when the buffer was failed, the
callbacks attached to the buffer were not run. Hence the inodes
locked and attached to the buffer were not aborted and unlocked
when the buffer IO was failed. That's the underlying problem that
needs fixing - this cannot be solved sanely by trying to guess why
an inode is flush locked when walking the AIL....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] [RFC] Release buffer locks in case of IO error
  2016-09-30  0:01 ` Dave Chinner
@ 2016-09-30  9:14   ` Carlos Maiolino
  2016-09-30  9:37   ` Carlos Maiolino
  2016-09-30 14:54   ` Brian Foster
  2 siblings, 0 replies; 11+ messages in thread
From: Carlos Maiolino @ 2016-09-30  9:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Sep 30, 2016 at 10:01:47AM +1000, Dave Chinner wrote:
> On Thu, Sep 29, 2016 at 03:03:19PM +0200, Carlos Maiolino wrote:
> > I have been working in a bug still regarding xfs fail_at_unmount configuration,
> > where, even though the configuration is enable, an unmount attempt will still
> > hang if the AIL buf items are locked as a result of a previous failed attempt to
> > flush these items.
> > 
> > Currently, if there is a failure while trying to flush inode buffers to disk,
> > these items are kept in AIL with FLUSHING status and with the locks held, making
> > them unable to be retried. Either during unmount, where they will be retried and
> > 'failed', or if using a thin provisioned device, the pool is actually extended, to
> > accomodate these not-yet-flushed items, instead of retrying to flush such items,
> > xfs is unable to retry them, once they are already locked.
> 
> [....]
> 
> > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> > index 892c2ac..cce0373 100644
> > --- a/fs/xfs/xfs_inode_item.c
> > +++ b/fs/xfs/xfs_inode_item.c
> > @@ -517,7 +517,26 @@ xfs_inode_item_push(
> >  	 * the AIL.
> >  	 */
> >  	if (!xfs_iflock_nowait(ip)) {
> > -		rval = XFS_ITEM_FLUSHING;
> > +		int error;
> > +		struct xfs_dinode *dip;
> > +
> > +		error = xfs_imap_to_bp(ip->i_mount, NULL, &ip->i_imap, &dip,
> > +				       &bp, XBF_TRYLOCK, 0);
> 
> So now, when we have tens of thousands of inodes in flushing state,
> we'll hammer the buffer cache doing lookups to determine the state
> of the buffer. That's a large amount of additional runtime overhead
> that is unnecessary - this is only needed at unmount, according to
> the problem description.
> 
> > +		if (error) {
> > +			rval = XFS_ITEM_FLUSHING;
> > +			goto out_unlock;
> > +		}
> 
> If we are stuck in a shutdown situation, then xfs_imap_to_bp() will
> detect a shutdown and return -EIO here. So this doesn't for an
> unmount with a stuck inode in a shutdown situation.
> 
> > +
> > +		if (!(bp->b_flags & XBF_WRITE_FAIL)) {
> > +			rval = XFS_ITEM_FLUSHING;
> > +			xfs_buf_relse(bp);
> > +			goto out_unlock;
> > +		}
> 
> So if the last write of the buffer was OK, do nothing? How does
> that get the inode unlocked if we've failed to flush at unmount?
> 
> > +
> > +		if (!xfs_buf_delwri_queue(bp, buffer_list))
> > +			rval = XFS_ITEM_FLUSHING;
> > +
> > +		xfs_buf_relse(bp);
> >  		goto out_unlock;
> 
> 
> Ok, I'm pretty sure that this just addresses a symptom of the
> underlying problem, not solve the root cause. e.g. dquot flushing
> has exactly the same problem.
> 
> The underlying problem is that when the buffer was failed, the
> callbacks attached to the buffer were not run. Hence the inodes
> locked and attached to the buffer were not aborted and unlocked
> when the buffer IO was failed. That's the underlying problem that
> needs fixing - this cannot be solved sanely by trying to guess why
> an inode is flush locked when walking the AIL....
> 

Thanks for the comments, I shall work in that direction now

Cheers

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

-- 
Carlos

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] [RFC] Release buffer locks in case of IO error
  2016-09-30  0:01 ` Dave Chinner
  2016-09-30  9:14   ` Carlos Maiolino
@ 2016-09-30  9:37   ` Carlos Maiolino
  2016-09-30 14:54   ` Brian Foster
  2 siblings, 0 replies; 11+ messages in thread
From: Carlos Maiolino @ 2016-09-30  9:37 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

> > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> > index 892c2ac..cce0373 100644
> > --- a/fs/xfs/xfs_inode_item.c
> > +++ b/fs/xfs/xfs_inode_item.c
> > @@ -517,7 +517,26 @@ xfs_inode_item_push(
> >  	 * the AIL.
> >  	 */
> >  	if (!xfs_iflock_nowait(ip)) {
> > -		rval = XFS_ITEM_FLUSHING;
> > +		int error;
> > +		struct xfs_dinode *dip;
> > +
> > +		error = xfs_imap_to_bp(ip->i_mount, NULL, &ip->i_imap, &dip,
> > +				       &bp, XBF_TRYLOCK, 0);
> 
> So now, when we have tens of thousands of inodes in flushing state,
> we'll hammer the buffer cache doing lookups to determine the state
> of the buffer. That's a large amount of additional runtime overhead
> that is unnecessary - this is only needed at unmount, according to
> the problem description.

Ah, btw, this is not only needed at unmount time, if we have such buffer IO
failures, and the sysadmin extends the underlying thin pool, the buffers will
still be hanging around without progress, never being retried.

But anyway, I'll work in the directions you pointed.

cheers

> 
> > +		if (error) {
> > +			rval = XFS_ITEM_FLUSHING;
> > +			goto out_unlock;
> > +		}
> 
> If we are stuck in a shutdown situation, then xfs_imap_to_bp() will
> detect a shutdown and return -EIO here. So this doesn't for an
> unmount with a stuck inode in a shutdown situation.
> 
> > +
> > +		if (!(bp->b_flags & XBF_WRITE_FAIL)) {
> > +			rval = XFS_ITEM_FLUSHING;
> > +			xfs_buf_relse(bp);
> > +			goto out_unlock;
> > +		}
> 
> So if the last write of the buffer was OK, do nothing? How does
> that get the inode unlocked if we've failed to flush at unmount?
> 
> > +
> > +		if (!xfs_buf_delwri_queue(bp, buffer_list))
> > +			rval = XFS_ITEM_FLUSHING;
> > +
> > +		xfs_buf_relse(bp);
> >  		goto out_unlock;
> 
> 
> Ok, I'm pretty sure that this just addresses a symptom of the
> underlying problem, not solve the root cause. e.g. dquot flushing
> has exactly the same problem.
> 
> The underlying problem is that when the buffer was failed, the
> callbacks attached to the buffer were not run. Hence the inodes
> locked and attached to the buffer were not aborted and unlocked
> when the buffer IO was failed. That's the underlying problem that
> needs fixing - this cannot be solved sanely by trying to guess why
> an inode is flush locked when walking the AIL....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Carlos

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] [RFC] Release buffer locks in case of IO error
  2016-09-30  0:01 ` Dave Chinner
  2016-09-30  9:14   ` Carlos Maiolino
  2016-09-30  9:37   ` Carlos Maiolino
@ 2016-09-30 14:54   ` Brian Foster
  2016-10-01  0:21     ` Dave Chinner
  2 siblings, 1 reply; 11+ messages in thread
From: Brian Foster @ 2016-09-30 14:54 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Carlos Maiolino, linux-xfs

On Fri, Sep 30, 2016 at 10:01:47AM +1000, Dave Chinner wrote:
> On Thu, Sep 29, 2016 at 03:03:19PM +0200, Carlos Maiolino wrote:
> > I have been working in a bug still regarding xfs fail_at_unmount configuration,
> > where, even though the configuration is enable, an unmount attempt will still
> > hang if the AIL buf items are locked as a result of a previous failed attempt to
> > flush these items.
> > 
> > Currently, if there is a failure while trying to flush inode buffers to disk,
> > these items are kept in AIL with FLUSHING status and with the locks held, making
> > them unable to be retried. Either during unmount, where they will be retried and
> > 'failed', or if using a thin provisioned device, the pool is actually extended, to
> > accomodate these not-yet-flushed items, instead of retrying to flush such items,
> > xfs is unable to retry them, once they are already locked.
> 
> [....]
> 

So this was originally written simply as a hack/experiment to test a
theory about what could be going wrong here, based on Carlos'
investigation so far into the issue. It wasn't really intended to be
posted as a proposed fix, so I'm going to skip over the details...

...
> 
> Ok, I'm pretty sure that this just addresses a symptom of the
> underlying problem, not solve the root cause. e.g. dquot flushing
> has exactly the same problem.
> 

Agree.

> The underlying problem is that when the buffer was failed, the
> callbacks attached to the buffer were not run. Hence the inodes
> locked and attached to the buffer were not aborted and unlocked
> when the buffer IO was failed. That's the underlying problem that
> needs fixing - this cannot be solved sanely by trying to guess why
> an inode is flush locked when walking the AIL....
> 

Are you making the assumption that the filesystem is already shutdown in
this scenario? I assume so, otherwise I'm not sure simply running the
callbacks (that remove items from the AIL) is really appropriate.

My _unconfirmed suspicion_ is that the core problem is that any log
items that are flush locked upon AIL push remain flush locked in the
event of I/O error (independent of fs shutdown, which is not guaranteed
after a metadata I/O error). E.g., consider the case of a transient
error or error configuration that expects more than one retry cycle
through xfsaild. IOW, the current AIL error retry mechanism doesn't work
for flush locked items.

(FWIW, another experiment I was thinking about was an optional
error-specific log item callback that would be specified and invoked in
the event of any metadata I/O error to release things like flush locks
and prepare for another retry, but I think that is complicated by the
fact that the in-core struct has already been flushed to the buffer.)

But stepping back from all that, this is just a theory based on Carlos'
investigation so far and could easily be wrong. I haven't debugged the
issue and so I'm not totally confident I actually understand the root
cause. As such, I'm not sure it's worth getting further into the weeds
until we have a root cause analysis.

Carlos,

Unless you completely disagree and are confident you have a handle on
the problem (in which case you can just ignore me and send a new patch
:), I'd suggest that what we probably need here is a detailed writeup of
the root cause. E.g., a step by step progression of what happens to the
log item in relation to the I/O errors and shutdown (if one actually
occurs), preferably backed by tracepoint data. Just my .02.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] [RFC] Release buffer locks in case of IO error
  2016-09-30 14:54   ` Brian Foster
@ 2016-10-01  0:21     ` Dave Chinner
  2016-10-01 13:27       ` Brian Foster
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2016-10-01  0:21 UTC (permalink / raw)
  To: Brian Foster; +Cc: Carlos Maiolino, linux-xfs

On Fri, Sep 30, 2016 at 10:54:13AM -0400, Brian Foster wrote:
> On Fri, Sep 30, 2016 at 10:01:47AM +1000, Dave Chinner wrote:
> > On Thu, Sep 29, 2016 at 03:03:19PM +0200, Carlos Maiolino wrote:
> > > I have been working in a bug still regarding xfs fail_at_unmount configuration,
> > > where, even though the configuration is enable, an unmount attempt will still
> > > hang if the AIL buf items are locked as a result of a previous failed attempt to
> > > flush these items.
> > > 
> > > Currently, if there is a failure while trying to flush inode buffers to disk,
> > > these items are kept in AIL with FLUSHING status and with the locks held, making
> > > them unable to be retried. Either during unmount, where they will be retried and
> > > 'failed', or if using a thin provisioned device, the pool is actually extended, to
> > > accomodate these not-yet-flushed items, instead of retrying to flush such items,
> > > xfs is unable to retry them, once they are already locked.
> > 
> > [....]
> > 
> 
> So this was originally written simply as a hack/experiment to test a
> theory about what could be going wrong here, based on Carlos'
> investigation so far into the issue. It wasn't really intended to be
> posted as a proposed fix, so I'm going to skip over the details...
> 
> ...
> > 
> > Ok, I'm pretty sure that this just addresses a symptom of the
> > underlying problem, not solve the root cause. e.g. dquot flushing
> > has exactly the same problem.
> > 
> 
> Agree.
> 
> > The underlying problem is that when the buffer was failed, the
> > callbacks attached to the buffer were not run. Hence the inodes
> > locked and attached to the buffer were not aborted and unlocked
> > when the buffer IO was failed. That's the underlying problem that
> > needs fixing - this cannot be solved sanely by trying to guess why
> > an inode is flush locked when walking the AIL....
> > 
> 
> Are you making the assumption that the filesystem is already shutdown in
> this scenario? I assume so, otherwise I'm not sure simply running the
> callbacks (that remove items from the AIL) is really appropriate.

No, I'm not. If we end up permanently failing /any/ metadata
writeback, we need to trigger a shutdown. That's the whole point of
these failure configurations - to avoid retrying forever in the hope
that the error goes away and we can avoid a shutdown.

IOWs, on permanent inode buffer writeback failure, we should be:

	a) shutting down the filesystem if it isn't already; and
	b) calling xfs_iflush_abort() to unlock the attached inodes
	   and remove them from the AIL.

> My _unconfirmed suspicion_ is that the core problem is that any
> log items that are flush locked upon AIL push remain flush locked
> in the event of I/O error (independent of fs shutdown, which is
> not guaranteed after a metadata I/O error).  E.g., consider the
> case of a transient error or error configuration that expects more
> than one retry cycle through xfsaild. IOW, the current AIL error
> retry mechanism doesn't work for flush locked items.

So, that's different to the context the patch description set, which
was "testing fail_at_unmount configurations". i.e. there were hangs
at unmount because failed IO was not unlocking the inodes that were
attached. This "abort on permanent failure" problem exists and needs
fixing.

What you're now saying is that there is /another/ problem to do with
the retry side of the code, were an item that is flush locked is not
correctly resubmitted for IO, because the item is still flush
locked.

Ok, yes, now that it's clear this isn't actually addressing
fail_at_unmount permanent errors, I can see that there is a retry
problem here. However, I don't think this patch is
the correct way to address that, either, because the buffer lookup
cost is prohibitive.

I think the correct thing to do - again - is to run the iodone
callbacks on a "retriable error". i.e. on an error that needs to be
retried, we:

	- run callbacks on each object attached to buffer
	- each object gets marked as "write fail"
	- each object remains flush locked
	- the position in the AI lof each object is unchanged
	- buffer remains marked with _XBF_WRITE_FAIL

We cannot unlock the flush lock until the state on disk
matches the state we flushed from the in-memory object to the buffer
that is failing, hence we need some other method for indicating a
new submission is required.

If we then find an object flush locked on an AIL push, we:

	- check the object for the "write fail" flag. If not set,
	  there's nothing more to do.
	- look up the underlying buffer, check the _XBF_WRITE_FAIL
	  flag. If that's not set, we've raced with another IO doing
	  a retry or new submission. Nothing more to do.
	- with the buffer now locked, walk all the objects attached
	  and clear the "write fail" flags. This is needed to
	  prevent attempting to resubmit the buffer for every object
	  attached to the buffer as the AIL push walks down the
	  list.
	- add the buffer to the delwri list.

At that point, the buffer will be retried, and if it succeeds the
inodes will be unlocked and on we'll go. If the buffer runs out of
retries, then we need to shut down the filesystem and abort all the
objects attached to it.....

i.e. there are multiple problems here that need fixing, not one...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] [RFC] Release buffer locks in case of IO error
  2016-10-01  0:21     ` Dave Chinner
@ 2016-10-01 13:27       ` Brian Foster
  2016-10-03 14:03         ` Carlos Maiolino
  0 siblings, 1 reply; 11+ messages in thread
From: Brian Foster @ 2016-10-01 13:27 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Carlos Maiolino, linux-xfs

On Sat, Oct 01, 2016 at 10:21:04AM +1000, Dave Chinner wrote:
> On Fri, Sep 30, 2016 at 10:54:13AM -0400, Brian Foster wrote:
> > On Fri, Sep 30, 2016 at 10:01:47AM +1000, Dave Chinner wrote:
> > > On Thu, Sep 29, 2016 at 03:03:19PM +0200, Carlos Maiolino wrote:
> > > > I have been working in a bug still regarding xfs fail_at_unmount configuration,
> > > > where, even though the configuration is enable, an unmount attempt will still
> > > > hang if the AIL buf items are locked as a result of a previous failed attempt to
> > > > flush these items.
> > > > 
> > > > Currently, if there is a failure while trying to flush inode buffers to disk,
> > > > these items are kept in AIL with FLUSHING status and with the locks held, making
> > > > them unable to be retried. Either during unmount, where they will be retried and
> > > > 'failed', or if using a thin provisioned device, the pool is actually extended, to
> > > > accomodate these not-yet-flushed items, instead of retrying to flush such items,
> > > > xfs is unable to retry them, once they are already locked.
> > > 
> > > [....]
> > > 
> > 
> > So this was originally written simply as a hack/experiment to test a
> > theory about what could be going wrong here, based on Carlos'
> > investigation so far into the issue. It wasn't really intended to be
> > posted as a proposed fix, so I'm going to skip over the details...
> > 
> > ...
> > > 
> > > Ok, I'm pretty sure that this just addresses a symptom of the
> > > underlying problem, not solve the root cause. e.g. dquot flushing
> > > has exactly the same problem.
> > > 
> > 
> > Agree.
> > 
> > > The underlying problem is that when the buffer was failed, the
> > > callbacks attached to the buffer were not run. Hence the inodes
> > > locked and attached to the buffer were not aborted and unlocked
> > > when the buffer IO was failed. That's the underlying problem that
> > > needs fixing - this cannot be solved sanely by trying to guess why
> > > an inode is flush locked when walking the AIL....
> > > 
> > 
> > Are you making the assumption that the filesystem is already shutdown in
> > this scenario? I assume so, otherwise I'm not sure simply running the
> > callbacks (that remove items from the AIL) is really appropriate.
> 
> No, I'm not. If we end up permanently failing /any/ metadata
> writeback, we need to trigger a shutdown. That's the whole point of
> these failure configurations - to avoid retrying forever in the hope
> that the error goes away and we can avoid a shutdown.
> 
> IOWs, on permanent inode buffer writeback failure, we should be:
> 
> 	a) shutting down the filesystem if it isn't already; and
> 	b) calling xfs_iflush_abort() to unlock the attached inodes
> 	   and remove them from the AIL.
> 

Ok. All I'm suggesting here is that the filesystem is shutdown before we
run the callbacks in this (error) case, which is what you're saying, so
I think we're saying the same thing. ;)

> > My _unconfirmed suspicion_ is that the core problem is that any
> > log items that are flush locked upon AIL push remain flush locked
> > in the event of I/O error (independent of fs shutdown, which is
> > not guaranteed after a metadata I/O error).  E.g., consider the
> > case of a transient error or error configuration that expects more
> > than one retry cycle through xfsaild. IOW, the current AIL error
> > retry mechanism doesn't work for flush locked items.
> 
> So, that's different to the context the patch description set, which
> was "testing fail_at_unmount configurations". i.e. there were hangs
> at unmount because failed IO was not unlocking the inodes that were
> attached. This "abort on permanent failure" problem exists and needs
> fixing.
> 
> What you're now saying is that there is /another/ problem to do with
> the retry side of the code, were an item that is flush locked is not
> correctly resubmitted for IO, because the item is still flush
> locked.
> 

Well, that's the only issue that I'm actually aware of and that was the
target of the experiment. What I'm saying, specifically, is Carlos' test
reproduces an unmount hang due to the broken AIL retry problem for flush
locked buffers. For example:

- Filesystem is humming along normally until the dm device shuts down,
  as expected, and metadata writeback begins to fail.
- One of such failed items is a flush locked inode. We hit
  xfs_buf_iodone_callback_error() and do not shut down the fs because
  we're using the default error config (retry forever).
- The inode log item ends up back on the AIL in the flush locked state.
- An unmount occurs, some other item is retried, fails and triggers the
  shutdown on umount check in *_iodone_callback_error().
- The unmount remains hung, despite the shutdown, because the AIL is
  spinning on the flush locked inode item. It can't abort the item
  because it cannot be retried, which would detect the shutdown state in
  the iodone handler, issue the callbacks and remove it from the AIL.

This is the state I seemed to reproduce in the handful of times I ran
Carlos' reproducer. E.g., the filesystem does actually shut down
correctly on unmount, but the umount was still hung spinning on a flush
locked item. Hence the experiment to retry the apparently stuck inode
buffer. This tests the broken retry theory and seemingly works around
(not fixes) the problem.

With that (hopefully clarified), you're losing me with regard to the
separate "abort on permanent failure" problem. It's not clear to me
whether we're confusing the root cause of the same fundamental issue or
there is just another problem that I'm just not aware of. Assuming the
latter, could you please elaborate on that issue to clarify it from the
one described above?

> Ok, yes, now that it's clear this isn't actually addressing
> fail_at_unmount permanent errors, I can see that there is a retry
> problem here. However, I don't think this patch is
> the correct way to address that, either, because the buffer lookup
> cost is prohibitive.
> 

As mentioned, this was just a test to see if a resubmission would
unstick the fs and if so, start a discussion on a proper fix.

> I think the correct thing to do - again - is to run the iodone
> callbacks on a "retriable error". i.e. on an error that needs to be
> retried, we:
> 
> 	- run callbacks on each object attached to buffer
> 	- each object gets marked as "write fail"
> 	- each object remains flush locked
> 	- the position in the AI lof each object is unchanged
> 	- buffer remains marked with _XBF_WRITE_FAIL
> 
> We cannot unlock the flush lock until the state on disk
> matches the state we flushed from the in-memory object to the buffer
> that is failing, hence we need some other method for indicating a
> new submission is required.
> 
> If we then find an object flush locked on an AIL push, we:
> 
> 	- check the object for the "write fail" flag. If not set,
> 	  there's nothing more to do.
> 	- look up the underlying buffer, check the _XBF_WRITE_FAIL
> 	  flag. If that's not set, we've raced with another IO doing
> 	  a retry or new submission. Nothing more to do.
> 	- with the buffer now locked, walk all the objects attached
> 	  and clear the "write fail" flags. This is needed to
> 	  prevent attempting to resubmit the buffer for every object
> 	  attached to the buffer as the AIL push walks down the
> 	  list.
> 	- add the buffer to the delwri list.
> 
> At that point, the buffer will be retried, and if it succeeds the
> inodes will be unlocked and on we'll go. If the buffer runs out of
> retries, then we need to shut down the filesystem and abort all the
> objects attached to it.....
> 

This all sounds like a good idea to me, but I'll defer to Carlos and
whether he groks it enough to cook up a prototype. :)

Brian

> i.e. there are multiple problems here that need fixing, not one...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] [RFC] Release buffer locks in case of IO error
  2016-10-01 13:27       ` Brian Foster
@ 2016-10-03 14:03         ` Carlos Maiolino
  2016-10-03 22:08           ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Carlos Maiolino @ 2016-10-03 14:03 UTC (permalink / raw)
  To: Brian Foster; +Cc: Dave Chinner, linux-xfs

Hi folks,

First I apologize if I caused any confusion with the root cause here, I will try
to clear it a bit on the lines below, and I'll try to write some more detailed
e-mail later explaining what I have until now, I just need to compact it and put
my investigation notes in some way everybody else will understand;


> > What you're now saying is that there is /another/ problem to do with
> > the retry side of the code, were an item that is flush locked is not
> > correctly resubmitted for IO, because the item is still flush
> > locked.
> > 
> 
> Well, that's the only issue that I'm actually aware of and that was the
> target of the experiment. What I'm saying, specifically, is Carlos' test
> reproduces an unmount hang due to the broken AIL retry problem for flush
> locked buffers. For example:

This is basically all we are seeing, and is causing the problem, items that are
flush locked are not being resubmitted for IO, and, in case of unmount, xfsaild
keeps spinning on these locked items, once they can't be resubmitted.

Just another example where this bug can be triggered, is still using dm-thin,
but, instead of unmounting the filesystem and expect it to shutdown, we extend
the POOL device, allowing more space, so the items can be properly flushed.

In this case, the following steps can trigger the problem (using fake values
just for make it simpler to understand):

- Create a dm-thin POOL (ex: 100MB)
- Create a dm-thin device overcommiting the POOL (ex: 200MB)
- mkfs.xfs the dm-thin device
- snapshot the dm-thin device
- Mount the snapshot and fill it (using a single dd works here)

- dd command will fail with an EIO, dm-thin will report it back to XFS, which will
trigger metadata IO errors, and keep the AIL items locked and 'theoretically'
trying to resubmit the buffers.

- Extend dm-thin POOL to something big enough (let's say 500MB)



At this point, the metadata IO errors stop to be reported by XFS, but even
though we now have enough space, the AIL flush locked items are never
resubmitted and xfsaild keeps spinning on them and they are never flushed.


Here though, we might have 2 different behaviors

-xfs will lock up spinning in xfsaild if anyone tries to unmount the filesystem,
the same behavior as before (without extending the dm-thin POOL)

- If any IO is issued to the filesystem, after extending this (a simple `touch
  /mnt/foobar` works), everything goes back to normal, items in AIL are
  resubmitted, and the filesystem can be unmounted normally.

I will try to explain it with call paths later, as soon as I reorganize my logs
here.



> 
> - Filesystem is humming along normally until the dm device shuts down,
>   as expected, and metadata writeback begins to fail.
> - One of such failed items is a flush locked inode. We hit
>   xfs_buf_iodone_callback_error() and do not shut down the fs because
>   we're using the default error config (retry forever).
> - The inode log item ends up back on the AIL in the flush locked state.
> - An unmount occurs, some other item is retried, fails and triggers the
>   shutdown on umount check in *_iodone_callback_error().
> - The unmount remains hung, despite the shutdown, because the AIL is
>   spinning on the flush locked inode item. It can't abort the item
>   because it cannot be retried, which would detect the shutdown state in
>   the iodone handler, issue the callbacks and remove it from the AIL.
> 
> This is the state I seemed to reproduce in the handful of times I ran
> Carlos' reproducer. E.g., the filesystem does actually shut down
> correctly on unmount, but the umount was still hung spinning on a flush
> locked item. Hence the experiment to retry the apparently stuck inode
> buffer. This tests the broken retry theory and seemingly works around
> (not fixes) the problem.
> 
> With that (hopefully clarified), you're losing me with regard to the
> separate "abort on permanent failure" problem. It's not clear to me
> whether we're confusing the root cause of the same fundamental issue or
> there is just another problem that I'm just not aware of. Assuming the
> latter, could you please elaborate on that issue to clarify it from the
> one described above?
> 
> > Ok, yes, now that it's clear this isn't actually addressing
> > fail_at_unmount permanent errors, I can see that there is a retry
> > problem here. However, I don't think this patch is
> > the correct way to address that, either, because the buffer lookup
> > cost is prohibitive.
> > 
> 
> As mentioned, this was just a test to see if a resubmission would
> unstick the fs and if so, start a discussion on a proper fix.
> 
> > I think the correct thing to do - again - is to run the iodone
> > callbacks on a "retriable error". i.e. on an error that needs to be
> > retried, we:
> > 
> > 	- run callbacks on each object attached to buffer
> > 	- each object gets marked as "write fail"
> > 	- each object remains flush locked
> > 	- the position in the AI lof each object is unchanged
> > 	- buffer remains marked with _XBF_WRITE_FAIL
> > 
> > We cannot unlock the flush lock until the state on disk
> > matches the state we flushed from the in-memory object to the buffer
> > that is failing, hence we need some other method for indicating a
> > new submission is required.
> > 
> > If we then find an object flush locked on an AIL push, we:
> > 
> > 	- check the object for the "write fail" flag. If not set,
> > 	  there's nothing more to do.
> > 	- look up the underlying buffer, check the _XBF_WRITE_FAIL
> > 	  flag. If that's not set, we've raced with another IO doing
> > 	  a retry or new submission. Nothing more to do.
> > 	- with the buffer now locked, walk all the objects attached
> > 	  and clear the "write fail" flags. This is needed to
> > 	  prevent attempting to resubmit the buffer for every object
> > 	  attached to the buffer as the AIL push walks down the
> > 	  list.
> > 	- add the buffer to the delwri list.
> > 
> > At that point, the buffer will be retried, and if it succeeds the
> > inodes will be unlocked and on we'll go. If the buffer runs out of
> > retries, then we need to shut down the filesystem and abort all the
> > objects attached to it.....
> > 
> 
> This all sounds like a good idea to me, but I'll defer to Carlos and
> whether he groks it enough to cook up a prototype. :)
> 
> Brian
> 
> > i.e. there are multiple problems here that need fixing, not one...
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Carlos

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] [RFC] Release buffer locks in case of IO error
  2016-10-03 14:03         ` Carlos Maiolino
@ 2016-10-03 22:08           ` Dave Chinner
  0 siblings, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2016-10-03 22:08 UTC (permalink / raw)
  To: Carlos Maiolino; +Cc: Brian Foster, linux-xfs

On Mon, Oct 03, 2016 at 04:03:54PM +0200, Carlos Maiolino wrote:
> > > What you're now saying is that there is /another/ problem to do with
> > > the retry side of the code, were an item that is flush locked is not
> > > correctly resubmitted for IO, because the item is still flush
> > > locked.
> > > 
> > 
> > Well, that's the only issue that I'm actually aware of and that was the
> > target of the experiment. What I'm saying, specifically, is Carlos' test
> > reproduces an unmount hang due to the broken AIL retry problem for flush
> > locked buffers. For example:
> 
> This is basically all we are seeing, and is causing the problem, items that are
> flush locked are not being resubmitted for IO, and, in case of unmount, xfsaild
> keeps spinning on these locked items, once they can't be resubmitted.

Hi Carlos,

I understand the issue - I don't need more explanation of the
problem. Unfortunately, having the problem redescribed to me tends
to indicate that my explanation of the underlying cause and solution
was not sufficient for people to understand what I was describing...
:/ That's my fault for not explaining it clearly enough....

The basic problem is that we are not propagating error state from
the buffer to the objects attached to the buffer. In the case of
inodes and dquots, the primary item being controlled by the AIL is
the inode/dquot, not the buffer being submitted for IO. Hence for
the AIL to be able to issue a retry on IO failure on an inode or
dquot, we have to first mark those items as having failed. This is
necessary so the AIL push code can explicitly handle this state, and
not have to guess or infer whether a retry may be required or not.

> Just another example where this bug can be triggered, is still using dm-thin,
> but, instead of unmounting the filesystem and expect it to shutdown, we extend
> the POOL device, allowing more space, so the items can be properly flushed.
> 
> In this case, the following steps can trigger the problem (using fake values
> just for make it simpler to understand):
> 
> - Create a dm-thin POOL (ex: 100MB)
> - Create a dm-thin device overcommiting the POOL (ex: 200MB)
> - mkfs.xfs the dm-thin device
> - snapshot the dm-thin device
> - Mount the snapshot and fill it (using a single dd works here)
> 
> - dd command will fail with an EIO, dm-thin will report it back to XFS, which will
> trigger metadata IO errors, and keep the AIL items locked and 'theoretically'
> trying to resubmit the buffers.

XFS resubmits buffer items that fail from the AIL just fine - we've
got tests that trigger retries and shutdowns on this sort of
behaviour. The point is that these buffers are primary item that the
AIL is aware of, whilst the AIL is not aware of inode/dquot buffers
at all(*).

What we need to do is propagate the buffer error to the primary
items that are attached to the buffer and processed by the
bp->b_iodone callback. When we get an error and we need to retry,
the callback can mark the inodes/dquots with a "write fail" state
without unlocking the inode.  We can then check this state we find
the item flush locked when trying to push it. This takes all the
guess work out of determining if a retry is needed for any given
inode item.  Retry at this point is then simple - similar to what
what in the first patch that you sent.

However, if the buffer has expired all it's retries and we now
permanently fail the buffer, we also have to permanently fail
all the inodes/dquots. This involves triggering a shutdown, then
aborting all the inode/dquots on the buffer. This will remove the
inodes/dquots from the AIL and hence prevent the inodes/dquots from
getting stuck flush locked in the AIL after a shutdown has occurred.

IOWs, to be able to implement a reliable retry behaviour we first
need to have a solid error propagation model. Error propagation
comes first, retry logic comes second.

Cheers,

Dave.

(*) gross simplifiation - it is aware of them in certain situations such
as allocation and freeing, but these aren't relevant to the problem
at hand.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] [RFC] Release buffer locks in case of IO error
  2016-09-29 13:00 Carlos Maiolino
@ 2016-09-29 13:04 ` Carlos Maiolino
  0 siblings, 0 replies; 11+ messages in thread
From: Carlos Maiolino @ 2016-09-29 13:04 UTC (permalink / raw)
  To: linux-xfs

Sorry, git send-email returned me an error while sending this, but looks like it
actually worked, so, Sorry for the several posts of the same patch ='(



-- 
Carlos

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] [RFC] Release buffer locks in case of IO error
@ 2016-09-29 13:00 Carlos Maiolino
  2016-09-29 13:04 ` Carlos Maiolino
  0 siblings, 1 reply; 11+ messages in thread
From: Carlos Maiolino @ 2016-09-29 13:00 UTC (permalink / raw)
  To: linux-xfs

I have been working in a bug still regarding xfs fail_at_unmount configuration,
where, even though the configuration is enable, an unmount attempt will still
hang if the AIL buf items are locked as a result of a previous failed attempt to
flush these items.

Currently, if there is a failure while trying to flush inode buffers to disk,
these items are kept in AIL with FLUSHING status and with the locks held, making
them unable to be retried. Either during unmount, where they will be retried and
'failed', or if using a thin provisioned device, the pool is actually extended, to
accomodate these not-yet-flushed items, instead of retrying to flush such items,
xfs is unable to retry them, once they are already locked.

Brian came with a suggestion about how to fix it, releasing the locks in case we
had a failed write attempt.

I know there are some other persons also involved in this problem, so I think
it's a good idea to send Brian's patch here and get some more comments about it.

I'm keeping Brian's signed-off once he wrote the whole patch.


Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
---
 fs/xfs/xfs_inode_item.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 892c2ac..cce0373 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -517,7 +517,26 @@ xfs_inode_item_push(
 	 * the AIL.
 	 */
 	if (!xfs_iflock_nowait(ip)) {
-		rval = XFS_ITEM_FLUSHING;
+		int error;
+		struct xfs_dinode *dip;
+
+		error = xfs_imap_to_bp(ip->i_mount, NULL, &ip->i_imap, &dip,
+				       &bp, XBF_TRYLOCK, 0);
+		if (error) {
+			rval = XFS_ITEM_FLUSHING;
+			goto out_unlock;
+		}
+
+		if (!(bp->b_flags & XBF_WRITE_FAIL)) {
+			rval = XFS_ITEM_FLUSHING;
+			xfs_buf_relse(bp);
+			goto out_unlock;
+		}
+
+		if (!xfs_buf_delwri_queue(bp, buffer_list))
+			rval = XFS_ITEM_FLUSHING;
+
+		xfs_buf_relse(bp);
 		goto out_unlock;
 	}
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-10-03 22:08 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-29 13:03 [PATCH] [RFC] Release buffer locks in case of IO error Carlos Maiolino
2016-09-30  0:01 ` Dave Chinner
2016-09-30  9:14   ` Carlos Maiolino
2016-09-30  9:37   ` Carlos Maiolino
2016-09-30 14:54   ` Brian Foster
2016-10-01  0:21     ` Dave Chinner
2016-10-01 13:27       ` Brian Foster
2016-10-03 14:03         ` Carlos Maiolino
2016-10-03 22:08           ` Dave Chinner
  -- strict thread matches above, loose matches on Subject: below --
2016-09-29 13:00 Carlos Maiolino
2016-09-29 13:04 ` Carlos Maiolino

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.