All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] Resubmit items failed during writeback
@ 2017-05-11 13:57 Carlos Maiolino
  2017-05-11 13:57 ` [PATCH 1/2] xfs: Add infrastructure needed for error propagation during buffer IO failure Carlos Maiolino
  2017-05-11 13:57 ` [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback Carlos Maiolino
  0 siblings, 2 replies; 44+ messages in thread
From: Carlos Maiolino @ 2017-05-11 13:57 UTC (permalink / raw)
  To: linux-xfs


This patchset aims to fix the AIL items retry problem that is on my plate for a
while, and it's based on a discussion I had with Dave off-list.

This basically consists of:

1- a mechanism to properly report failures during writeback back to the items
   attached to the failed buffer
2- Use this mechanism to detect such failures and properly re-submit the items
   failed previously

It is worth to mention that this same problem is also possible in dquot code,
but the fix is almost identical to the inodes.

I am not submitting a fix for dquot yet to avoid the need to create VX for both
patches, once we agree with the solution, I'll submit a fix to dquot.

Reviews are welcome :-)

-- 
2.9.3


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 1/2] xfs: Add infrastructure needed for error propagation during buffer IO failure
  2017-05-11 13:57 [PATCH 0/2] Resubmit items failed during writeback Carlos Maiolino
@ 2017-05-11 13:57 ` Carlos Maiolino
  2017-05-11 16:51   ` Brian Foster
  2017-05-11 13:57 ` [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback Carlos Maiolino
  1 sibling, 1 reply; 44+ messages in thread
From: Carlos Maiolino @ 2017-05-11 13:57 UTC (permalink / raw)
  To: linux-xfs

To be able to resubmit an log item for IO, we need a way to mark an item
as failed, if, for any reason the buffer which the item belonged to
failed during writeback.

Add a new log item callback to be used after an IO completion failure
and make the needed clean ups.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
---
 fs/xfs/xfs_buf_item.c | 27 ++++++++++++++++++++++++++-
 fs/xfs/xfs_trans.h    |  5 ++++-
 2 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 0306168..026aed4 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -1051,6 +1051,24 @@ xfs_buf_do_callbacks(
 	}
 }
 
+STATIC void
+xfs_buf_do_callbacks_fail(
+	struct xfs_buf		*bp)
+{
+	struct xfs_log_item	*lip, *next;
+	unsigned int		bflags = bp->b_flags;
+
+	lip = bp->b_fspriv;
+	while (lip != NULL) {
+		next = lip->li_bio_list;
+
+		if (lip->li_ops->iop_error)
+			lip->li_ops->iop_error(lip, bflags);
+
+		lip = next;
+	}
+}
+
 static bool
 xfs_buf_iodone_callback_error(
 	struct xfs_buf		*bp)
@@ -1153,8 +1171,15 @@ xfs_buf_iodone_callbacks(
 	 * to run callbacks after failure processing is done so we
 	 * detect that and take appropriate action.
 	 */
-	if (bp->b_error && xfs_buf_iodone_callback_error(bp))
+	if (bp->b_error && xfs_buf_iodone_callback_error(bp)) {
+
+		/*
+		 * We've got an error during buffer writeback, we need to notify
+		 * the items in the buffer
+		 */
+		xfs_buf_do_callbacks_fail(bp);
 		return;
+	}
 
 	/*
 	 * Successful IO or permanent error. Either way, we can clear the
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index a07acbf..c57181a 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -65,10 +65,12 @@ typedef struct xfs_log_item {
 
 #define	XFS_LI_IN_AIL	0x1
 #define XFS_LI_ABORTED	0x2
+#define XFS_LI_FAILED	0x3
 
 #define XFS_LI_FLAGS \
 	{ XFS_LI_IN_AIL,	"IN_AIL" }, \
-	{ XFS_LI_ABORTED,	"ABORTED" }
+	{ XFS_LI_ABORTED,	"ABORTED" }, \
+	{ XFS_LI_FAILED,	"FAILED" }
 
 struct xfs_item_ops {
 	void (*iop_size)(xfs_log_item_t *, int *, int *);
@@ -79,6 +81,7 @@ struct xfs_item_ops {
 	void (*iop_unlock)(xfs_log_item_t *);
 	xfs_lsn_t (*iop_committed)(xfs_log_item_t *, xfs_lsn_t);
 	void (*iop_committing)(xfs_log_item_t *, xfs_lsn_t);
+	void (*iop_error)(xfs_log_item_t *, unsigned int bflags);
 };
 
 void	xfs_log_item_init(struct xfs_mount *mp, struct xfs_log_item *item,
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback
  2017-05-11 13:57 [PATCH 0/2] Resubmit items failed during writeback Carlos Maiolino
  2017-05-11 13:57 ` [PATCH 1/2] xfs: Add infrastructure needed for error propagation during buffer IO failure Carlos Maiolino
@ 2017-05-11 13:57 ` Carlos Maiolino
  2017-05-11 15:32   ` Eric Sandeen
                     ` (2 more replies)
  1 sibling, 3 replies; 44+ messages in thread
From: Carlos Maiolino @ 2017-05-11 13:57 UTC (permalink / raw)
  To: linux-xfs

When a buffer has been failed during writeback, the inode items into it
are kept flush locked, and are never resubmitted due the flush lock, so,
if any buffer fails to be written, the items in AIL are never written to
disk and never unlocked.

This causes a filesystem to be unmountable due these items flush locked
in AIL, but this also causes the items in AIL to never be written back,
even when the IO device comes back to normal.

I've been testing this patch with a DM-thin device, creating a
filesystem larger than the real device.

When writing enough data to fill the DM-thin device, XFS receives ENOSPC
errors from the device, and keep spinning on xfsaild (when 'retry
forever' configuration is set).

At this point, the filesystem is unmountable because of the flush locked
items in AIL, but worse, the items in AIL are never retried at all
(once xfs_inode_item_push() will skip the items that are flush locked),
even if the underlying DM-thin device is expanded to the proper size.

This patch fixes both cases, retrying any item that has been failed
previously, using the infra-structure provided by the previous patch.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
---

This same problem is also possible in dquot code, but the fix is almost
identical.

I am not submitting a fix for dquot yet to avoid the need to create VX for both
patches, once we agree with the solution, I'll submit a fix to dquot.

 fs/xfs/xfs_inode_item.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 53 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 08cb7d1..583fa9e 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -475,6 +475,21 @@ xfs_inode_item_unpin(
 		wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT);
 }
 
+STATIC void
+xfs_inode_item_error(
+	struct xfs_log_item	*lip,
+	unsigned int		bflags)
+{
+
+	/*
+	 * The buffer writeback containing this inode has been failed
+	 * mark it as failed and unlock the flush lock, so it can be retried
+	 * again
+	 */
+	if (bflags & XBF_WRITE_FAIL)
+		lip->li_flags |= XFS_LI_FAILED;
+}
+
 STATIC uint
 xfs_inode_item_push(
 	struct xfs_log_item	*lip,
@@ -517,8 +532,44 @@ xfs_inode_item_push(
 	 * the AIL.
 	 */
 	if (!xfs_iflock_nowait(ip)) {
+		if (lip->li_flags & XFS_LI_FAILED) {
+
+			struct xfs_dinode	*dip;
+			struct xfs_log_item	*next;
+			int			error;
+
+			error = xfs_imap_to_bp(ip->i_mount, NULL, &ip->i_imap,
+					       &dip, &bp, XBF_TRYLOCK, 0);
+
+			if (error) {
+				rval = XFS_ITEM_FLUSHING;
+				goto out_unlock;
+			}
+
+			if (!(bp->b_flags & XBF_WRITE_FAIL)) {
+				rval = XFS_ITEM_FLUSHING;
+				xfs_buf_relse(bp);
+				goto out_unlock;
+			}
+
+			while (lip != NULL) {
+				next = lip->li_bio_list;
+
+				if (lip->li_flags & XFS_LI_FAILED)
+					lip->li_flags &= XFS_LI_FAILED;
+				lip = next;
+			}
+
+			if (!xfs_buf_delwri_queue(bp, buffer_list))
+				rval = XFS_ITEM_FLUSHING;
+
+			xfs_buf_relse(bp);
+			goto out_unlock;
+		}
+
 		rval = XFS_ITEM_FLUSHING;
 		goto out_unlock;
+
 	}
 
 	ASSERT(iip->ili_fields != 0 || XFS_FORCED_SHUTDOWN(ip->i_mount));
@@ -622,7 +673,8 @@ static const struct xfs_item_ops xfs_inode_item_ops = {
 	.iop_unlock	= xfs_inode_item_unlock,
 	.iop_committed	= xfs_inode_item_committed,
 	.iop_push	= xfs_inode_item_push,
-	.iop_committing = xfs_inode_item_committing
+	.iop_committing = xfs_inode_item_committing,
+	.iop_error	= xfs_inode_item_error
 };
 
 
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback
  2017-05-11 13:57 ` [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback Carlos Maiolino
@ 2017-05-11 15:32   ` Eric Sandeen
  2017-05-12  8:19     ` Carlos Maiolino
  2017-05-11 17:08   ` Brian Foster
  2017-05-17  0:57   ` Dave Chinner
  2 siblings, 1 reply; 44+ messages in thread
From: Eric Sandeen @ 2017-05-11 15:32 UTC (permalink / raw)
  To: Carlos Maiolino, linux-xfs

On 5/11/17 8:57 AM, Carlos Maiolino wrote:
> When a buffer has been failed during writeback, the inode items into it
> are kept flush locked, and are never resubmitted due the flush lock, so,
> if any buffer fails to be written, the items in AIL are never written to
> disk and never unlocked.
> 
> This causes a filesystem to be unmountable due these items flush locked

I think you mean "not unmountable?"

> in AIL, but this also causes the items in AIL to never be written back,
> even when the IO device comes back to normal.
> 
> I've been testing this patch with a DM-thin device, creating a
> filesystem larger than the real device.
> 
> When writing enough data to fill the DM-thin device, XFS receives ENOSPC
> errors from the device, and keep spinning on xfsaild (when 'retry
> forever' configuration is set).
> 
> At this point, the filesystem is unmountable because of the flush locked

(or cannot be unmounted ...)

> items in AIL, but worse, the items in AIL are never retried at all
> (once xfs_inode_item_push() will skip the items that are flush locked),
> even if the underlying DM-thin device is expanded to the proper size.

Can you turn that into an xfstest?

> This patch fixes both cases, retrying any item that has been failed
> previously, using the infra-structure provided by the previous patch.
> 
> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
> ---
> 
> This same problem is also possible in dquot code, but the fix is almost
> identical.
> 
> I am not submitting a fix for dquot yet to avoid the need to create VX for both
> patches, once we agree with the solution, I'll submit a fix to dquot.
> 
>  fs/xfs/xfs_inode_item.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 53 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 08cb7d1..583fa9e 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -475,6 +475,21 @@ xfs_inode_item_unpin(
>  		wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT);
>  }
>  
> +STATIC void
> +xfs_inode_item_error(
> +	struct xfs_log_item	*lip,
> +	unsigned int		bflags)
> +{
> +
> +	/*
> +	 * The buffer writeback containing this inode has been failed
> +	 * mark it as failed and unlock the flush lock, so it can be retried
> +	 * again
> +	 */
> +	if (bflags & XBF_WRITE_FAIL)
> +		lip->li_flags |= XFS_LI_FAILED;
> +}
> +
>  STATIC uint
>  xfs_inode_item_push(
>  	struct xfs_log_item	*lip,
> @@ -517,8 +532,44 @@ xfs_inode_item_push(
>  	 * the AIL.
>  	 */
>  	if (!xfs_iflock_nowait(ip)) {

Some comments about what this new block is for would be helpful, I think.

> +		if (lip->li_flags & XFS_LI_FAILED) {
> +
> +			struct xfs_dinode	*dip;
> +			struct xfs_log_item	*next;
> +			int			error;
> +
> +			error = xfs_imap_to_bp(ip->i_mount, NULL, &ip->i_imap,
> +					       &dip, &bp, XBF_TRYLOCK, 0);
> +
> +			if (error) {
> +				rval = XFS_ITEM_FLUSHING;
> +				goto out_unlock;
> +			}
> +
> +			if (!(bp->b_flags & XBF_WRITE_FAIL)) {
> +				rval = XFS_ITEM_FLUSHING;
> +				xfs_buf_relse(bp);
> +				goto out_unlock;
> +			}
> +
> +			while (lip != NULL) {
> +				next = lip->li_bio_list;
> +
> +				if (lip->li_flags & XFS_LI_FAILED)
> +					lip->li_flags &= XFS_LI_FAILED;

This confuses me.  If XFS_LI_FAILED is set, set XFS_LI_FAILED?
I assume you meant to clear it?

> +				lip = next;
> +			}
> +

			/* Add this buffer back to the delayed write list */

> +			if (!xfs_buf_delwri_queue(bp, buffer_list))
> +				rval = XFS_ITEM_FLUSHING;

> +			xfs_buf_relse(bp);

So by here we have an implicit rval = XFS_ITEM_SUCCESS, I guess?

(I wonder about setting FLUSHING at the top, and setting SUCCESS
only if everything in here works out - but maybe that would be
more confusing)

Anyway that's my first drive-by review, I'm not sure I have all the state & 
locking clear in my head for this stuff.

Thanks,
-Eric

> +			goto out_unlock;
> +		}
> +
>  		rval = XFS_ITEM_FLUSHING;
>  		goto out_unlock;
> +
>  	}
>  
>  	ASSERT(iip->ili_fields != 0 || XFS_FORCED_SHUTDOWN(ip->i_mount));
> @@ -622,7 +673,8 @@ static const struct xfs_item_ops xfs_inode_item_ops = {
>  	.iop_unlock	= xfs_inode_item_unlock,
>  	.iop_committed	= xfs_inode_item_committed,
>  	.iop_push	= xfs_inode_item_push,
> -	.iop_committing = xfs_inode_item_committing
> +	.iop_committing = xfs_inode_item_committing,
> +	.iop_error	= xfs_inode_item_error
>  };
>  
>  
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] xfs: Add infrastructure needed for error propagation during buffer IO failure
  2017-05-11 13:57 ` [PATCH 1/2] xfs: Add infrastructure needed for error propagation during buffer IO failure Carlos Maiolino
@ 2017-05-11 16:51   ` Brian Foster
  2017-05-12  8:41     ` Carlos Maiolino
  0 siblings, 1 reply; 44+ messages in thread
From: Brian Foster @ 2017-05-11 16:51 UTC (permalink / raw)
  To: Carlos Maiolino; +Cc: linux-xfs

On Thu, May 11, 2017 at 03:57:32PM +0200, Carlos Maiolino wrote:
> To be able to resubmit an log item for IO, we need a way to mark an item
> as failed, if, for any reason the buffer which the item belonged to
> failed during writeback.
> 
> Add a new log item callback to be used after an IO completion failure
> and make the needed clean ups.
> 

I think the commit log description should call out the problem with
flush locked items (i.e., that we will currently never resubmit their
buffers) as the motiviation for the patch.

> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
> ---
>  fs/xfs/xfs_buf_item.c | 27 ++++++++++++++++++++++++++-
>  fs/xfs/xfs_trans.h    |  5 ++++-
>  2 files changed, 30 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 0306168..026aed4 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -1051,6 +1051,24 @@ xfs_buf_do_callbacks(
>  	}
>  }
>  
> +STATIC void
> +xfs_buf_do_callbacks_fail(
> +	struct xfs_buf		*bp)
> +{
> +	struct xfs_log_item	*lip, *next;
> +	unsigned int		bflags = bp->b_flags;
> +
> +	lip = bp->b_fspriv;
> +	while (lip != NULL) {
> +		next = lip->li_bio_list;
> +
> +		if (lip->li_ops->iop_error)
> +			lip->li_ops->iop_error(lip, bflags);

I still don't see why we need the iop callback here. This type of
callback is typically required when an operation requires some action on
the specific subtype (e.g., _inode_item_error() does one particular
thing to an inode, buf_item_error() might do something different to an
xfs_buf, etc.), but that doesn't appear to be the case here. Indeed, the
next patch shows that the inode item error handler does:

	lip->li_flags |= XFS_LI_FAILED;

... which doesn't even require to dereference the inode_log_item type.

So can we just set the flag directly from xfs_buf_do_callbacks_fail()
and kill of ->iop_error() until/unless we come to a point where it is
actually needed?

> +
> +		lip = next;
> +	}
> +}
> +
>  static bool
>  xfs_buf_iodone_callback_error(
>  	struct xfs_buf		*bp)
> @@ -1153,8 +1171,15 @@ xfs_buf_iodone_callbacks(
>  	 * to run callbacks after failure processing is done so we
>  	 * detect that and take appropriate action.
>  	 */
> -	if (bp->b_error && xfs_buf_iodone_callback_error(bp))
> +	if (bp->b_error && xfs_buf_iodone_callback_error(bp)) {
> +
> +		/*
> +		 * We've got an error during buffer writeback, we need to notify
> +		 * the items in the buffer
> +		 */
> +		xfs_buf_do_callbacks_fail(bp);

xfs_buf_iodone_callback_error() returns true when the I/O has failed. It
also returns true when it has submitted the internal retry[1], however,
so I don't think this is quite correct. We should only mark items as
failed once this internal sequence has completed and the buffer is no
longer under I/O. As it is, this looks like it would mark the items as
failed while they are still under the internal retry I/O (and possibly
leave them marked as such if this retry actually succeeds..?).

Side note: I really dislike the semantics of
xfs_buf_iodone_callback_error() in that I have to read it and the only
call site to re-understand what the return value means every time I look
at it. Could we add a comment above that function that explains the
return value dictates whether to run callbacks while we're working in
this area?

Brian

[1] Recall that every buffer submitted through xfsaild() is quietly
retried one time in the event of I/O error (via XBF_WRITE_FAIL) before
the buffer is unlocked and effectively released back to the AIL. This is
presumably to help deal with transient errors. It is only when this
second I/O fails that the buffer is unlocked and it is up to the AIL to
resubmit the buffer on a subsequent push.

>  		return;
> +	}
>  
>  	/*
>  	 * Successful IO or permanent error. Either way, we can clear the
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index a07acbf..c57181a 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -65,10 +65,12 @@ typedef struct xfs_log_item {
>  
>  #define	XFS_LI_IN_AIL	0x1
>  #define XFS_LI_ABORTED	0x2
> +#define XFS_LI_FAILED	0x3
>  
>  #define XFS_LI_FLAGS \
>  	{ XFS_LI_IN_AIL,	"IN_AIL" }, \
> -	{ XFS_LI_ABORTED,	"ABORTED" }
> +	{ XFS_LI_ABORTED,	"ABORTED" }, \
> +	{ XFS_LI_FAILED,	"FAILED" }
>  
>  struct xfs_item_ops {
>  	void (*iop_size)(xfs_log_item_t *, int *, int *);
> @@ -79,6 +81,7 @@ struct xfs_item_ops {
>  	void (*iop_unlock)(xfs_log_item_t *);
>  	xfs_lsn_t (*iop_committed)(xfs_log_item_t *, xfs_lsn_t);
>  	void (*iop_committing)(xfs_log_item_t *, xfs_lsn_t);
> +	void (*iop_error)(xfs_log_item_t *, unsigned int bflags);
>  };
>  
>  void	xfs_log_item_init(struct xfs_mount *mp, struct xfs_log_item *item,
> -- 
> 2.9.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback
  2017-05-11 13:57 ` [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback Carlos Maiolino
  2017-05-11 15:32   ` Eric Sandeen
@ 2017-05-11 17:08   ` Brian Foster
  2017-05-12  8:21     ` Carlos Maiolino
  2017-05-17  0:57   ` Dave Chinner
  2 siblings, 1 reply; 44+ messages in thread
From: Brian Foster @ 2017-05-11 17:08 UTC (permalink / raw)
  To: Carlos Maiolino; +Cc: linux-xfs

On Thu, May 11, 2017 at 03:57:33PM +0200, Carlos Maiolino wrote:
> When a buffer has been failed during writeback, the inode items into it
> are kept flush locked, and are never resubmitted due the flush lock, so,
> if any buffer fails to be written, the items in AIL are never written to
> disk and never unlocked.
> 
> This causes a filesystem to be unmountable due these items flush locked
> in AIL, but this also causes the items in AIL to never be written back,
> even when the IO device comes back to normal.
> 
> I've been testing this patch with a DM-thin device, creating a
> filesystem larger than the real device.
> 
> When writing enough data to fill the DM-thin device, XFS receives ENOSPC
> errors from the device, and keep spinning on xfsaild (when 'retry
> forever' configuration is set).
> 
> At this point, the filesystem is unmountable because of the flush locked
> items in AIL, but worse, the items in AIL are never retried at all
> (once xfs_inode_item_push() will skip the items that are flush locked),
> even if the underlying DM-thin device is expanded to the proper size.
> 
> This patch fixes both cases, retrying any item that has been failed
> previously, using the infra-structure provided by the previous patch.
> 
> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
> ---
> 
> This same problem is also possible in dquot code, but the fix is almost
> identical.
> 
> I am not submitting a fix for dquot yet to avoid the need to create VX for both
> patches, once we agree with the solution, I'll submit a fix to dquot.
> 
>  fs/xfs/xfs_inode_item.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 53 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 08cb7d1..583fa9e 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -475,6 +475,21 @@ xfs_inode_item_unpin(
>  		wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT);
>  }
>  
> +STATIC void
> +xfs_inode_item_error(
> +	struct xfs_log_item	*lip,
> +	unsigned int		bflags)
> +{
> +
> +	/*
> +	 * The buffer writeback containing this inode has been failed
> +	 * mark it as failed and unlock the flush lock, so it can be retried
> +	 * again
> +	 */
> +	if (bflags & XBF_WRITE_FAIL)
> +		lip->li_flags |= XFS_LI_FAILED;
> +}
> +
>  STATIC uint
>  xfs_inode_item_push(
>  	struct xfs_log_item	*lip,
> @@ -517,8 +532,44 @@ xfs_inode_item_push(
>  	 * the AIL.
>  	 */
>  	if (!xfs_iflock_nowait(ip)) {
> +		if (lip->li_flags & XFS_LI_FAILED) {
> +
> +			struct xfs_dinode	*dip;
> +			struct xfs_log_item	*next;
> +			int			error;
> +
> +			error = xfs_imap_to_bp(ip->i_mount, NULL, &ip->i_imap,
> +					       &dip, &bp, XBF_TRYLOCK, 0);
> +
> +			if (error) {
> +				rval = XFS_ITEM_FLUSHING;
> +				goto out_unlock;
> +			}
> +
> +			if (!(bp->b_flags & XBF_WRITE_FAIL)) {
> +				rval = XFS_ITEM_FLUSHING;
> +				xfs_buf_relse(bp);
> +				goto out_unlock;
> +			}
> +
> +			while (lip != NULL) {
> +				next = lip->li_bio_list;
> +
> +				if (lip->li_flags & XFS_LI_FAILED)
> +					lip->li_flags &= XFS_LI_FAILED;

Eric already pointed out that you probably intend to clear the flag
here..?

> +				lip = next;
> +			}

This whole hunk might be better off in a helper function (with the
comment Eric suggested as well).

Those points and the ->iop_error() thing aside, this otherwise seems Ok
to me.

Brian

> +
> +			if (!xfs_buf_delwri_queue(bp, buffer_list))
> +				rval = XFS_ITEM_FLUSHING;
> +
> +			xfs_buf_relse(bp);
> +			goto out_unlock;
> +		}
> +
>  		rval = XFS_ITEM_FLUSHING;
>  		goto out_unlock;
> +
>  	}
>  
>  	ASSERT(iip->ili_fields != 0 || XFS_FORCED_SHUTDOWN(ip->i_mount));
> @@ -622,7 +673,8 @@ static const struct xfs_item_ops xfs_inode_item_ops = {
>  	.iop_unlock	= xfs_inode_item_unlock,
>  	.iop_committed	= xfs_inode_item_committed,
>  	.iop_push	= xfs_inode_item_push,
> -	.iop_committing = xfs_inode_item_committing
> +	.iop_committing = xfs_inode_item_committing,
> +	.iop_error	= xfs_inode_item_error
>  };
>  
>  
> -- 
> 2.9.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback
  2017-05-11 15:32   ` Eric Sandeen
@ 2017-05-12  8:19     ` Carlos Maiolino
  0 siblings, 0 replies; 44+ messages in thread
From: Carlos Maiolino @ 2017-05-12  8:19 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: linux-xfs

Ahoj!

On Thu, May 11, 2017 at 10:32:16AM -0500, Eric Sandeen wrote:
> On 5/11/17 8:57 AM, Carlos Maiolino wrote:
> > When a buffer has been failed during writeback, the inode items into it
> > are kept flush locked, and are never resubmitted due the flush lock, so,
> > if any buffer fails to be written, the items in AIL are never written to
> > disk and never unlocked.
> > 
> > This causes a filesystem to be unmountable due these items flush locked
> 
> I think you mean "not unmountable?"
> 
Yeah, my bad, fast typing slow thinking :)

> > in AIL, but this also causes the items in AIL to never be written back,
> > even when the IO device comes back to normal.
> > 
> > I've been testing this patch with a DM-thin device, creating a
> > filesystem larger than the real device.
> > 
> > When writing enough data to fill the DM-thin device, XFS receives ENOSPC
> > errors from the device, and keep spinning on xfsaild (when 'retry
> > forever' configuration is set).
> > 
> > At this point, the filesystem is unmountable because of the flush locked
> 
> (or cannot be unmounted ...)
> 

*nod*

> > items in AIL, but worse, the items in AIL are never retried at all
> > (once xfs_inode_item_push() will skip the items that are flush locked),
> > even if the underlying DM-thin device is expanded to the proper size.
> 
> Can you turn that into an xfstest?
>

Yeah, I am planing to do that, this is really not that hard to move into an
xfstests case, although, it will be going into dangerous sub-set, once it will
lockup the filesystem.
 
> > This patch fixes both cases, retrying any item that has been failed
> > previously, using the infra-structure provided by the previous patch.
> > 
> > Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
> > ---
> > 
> > This same problem is also possible in dquot code, but the fix is almost
> > identical.
> > 
> > I am not submitting a fix for dquot yet to avoid the need to create VX for both
> > patches, once we agree with the solution, I'll submit a fix to dquot.
> > 
> >  fs/xfs/xfs_inode_item.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 53 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> > index 08cb7d1..583fa9e 100644
> > --- a/fs/xfs/xfs_inode_item.c
> > +++ b/fs/xfs/xfs_inode_item.c
> > @@ -475,6 +475,21 @@ xfs_inode_item_unpin(
> >  		wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT);
> >  }
> >  
> > +STATIC void
> > +xfs_inode_item_error(
> > +	struct xfs_log_item	*lip,
> > +	unsigned int		bflags)
> > +{
> > +
> > +	/*
> > +	 * The buffer writeback containing this inode has been failed
> > +	 * mark it as failed and unlock the flush lock, so it can be retried
> > +	 * again
> > +	 */
> > +	if (bflags & XBF_WRITE_FAIL)
> > +		lip->li_flags |= XFS_LI_FAILED;
> > +}
> > +
> >  STATIC uint
> >  xfs_inode_item_push(
> >  	struct xfs_log_item	*lip,
> > @@ -517,8 +532,44 @@ xfs_inode_item_push(
> >  	 * the AIL.
> >  	 */
> >  	if (!xfs_iflock_nowait(ip)) {
> 
> Some comments about what this new block is for would be helpful, I think.
> 

/me replies on Brian's comment

> > +		if (lip->li_flags & XFS_LI_FAILED) {
> > +
> > +			struct xfs_dinode	*dip;
> > +			struct xfs_log_item	*next;
> > +			int			error;
> > +
> > +			error = xfs_imap_to_bp(ip->i_mount, NULL, &ip->i_imap,
> > +					       &dip, &bp, XBF_TRYLOCK, 0);
> > +
> > +			if (error) {
> > +				rval = XFS_ITEM_FLUSHING;
> > +				goto out_unlock;
> > +			}
> > +
> > +			if (!(bp->b_flags & XBF_WRITE_FAIL)) {
> > +				rval = XFS_ITEM_FLUSHING;
> > +				xfs_buf_relse(bp);
> > +				goto out_unlock;
> > +			}
> > +
> > +			while (lip != NULL) {
> > +				next = lip->li_bio_list;
> > +
> > +				if (lip->li_flags & XFS_LI_FAILED)
> > +					lip->li_flags &= XFS_LI_FAILED;
> 
> This confuses me.  If XFS_LI_FAILED is set, set XFS_LI_FAILED?
> I assume you meant to clear it?
>

*nod* fix going to V2
 
> > +				lip = next;
> > +			}
> > +
> 
> 			/* Add this buffer back to the delayed write list */
> 
> > +			if (!xfs_buf_delwri_queue(bp, buffer_list))
> > +				rval = XFS_ITEM_FLUSHING;
> 
> > +			xfs_buf_relse(bp);
> 
> So by here we have an implicit rval = XFS_ITEM_SUCCESS, I guess?
> 

AFAIK this is the current behavior of xfs_inode_item_push() without my patch, at
a first glance it looked weird to me too, but then I just decided to leave it
as-is.

> (I wonder about setting FLUSHING at the top, and setting SUCCESS
> only if everything in here works out - but maybe that would be
> more confusing)
> 
> Anyway that's my first drive-by review, I'm not sure I have all the state & 
> locking clear in my head for this stuff.
> 

I really appreciate the review, thanks for your time :)

> Thanks,
> -Eric
> 
> > +			goto out_unlock;
> > +		}
> > +
> >  		rval = XFS_ITEM_FLUSHING;
> >  		goto out_unlock;
> > +
> >  	}
> >  
> >  	ASSERT(iip->ili_fields != 0 || XFS_FORCED_SHUTDOWN(ip->i_mount));
> > @@ -622,7 +673,8 @@ static const struct xfs_item_ops xfs_inode_item_ops = {
> >  	.iop_unlock	= xfs_inode_item_unlock,
> >  	.iop_committed	= xfs_inode_item_committed,
> >  	.iop_push	= xfs_inode_item_push,
> > -	.iop_committing = xfs_inode_item_committing
> > +	.iop_committing = xfs_inode_item_committing,
> > +	.iop_error	= xfs_inode_item_error
> >  };
> >  
> >  
> > 

-- 
Carlos

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback
  2017-05-11 17:08   ` Brian Foster
@ 2017-05-12  8:21     ` Carlos Maiolino
  2017-05-12 11:37       ` Brian Foster
  0 siblings, 1 reply; 44+ messages in thread
From: Carlos Maiolino @ 2017-05-12  8:21 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, May 11, 2017 at 01:08:05PM -0400, Brian Foster wrote:
> On Thu, May 11, 2017 at 03:57:33PM +0200, Carlos Maiolino wrote:
> > When a buffer has been failed during writeback, the inode items into it
> > are kept flush locked, and are never resubmitted due the flush lock, so,
> > if any buffer fails to be written, the items in AIL are never written to
> > disk and never unlocked.
> > 
> > This causes a filesystem to be unmountable due these items flush locked
> > in AIL, but this also causes the items in AIL to never be written back,
> > even when the IO device comes back to normal.
> > 
> > I've been testing this patch with a DM-thin device, creating a
> > filesystem larger than the real device.
> > 
> > When writing enough data to fill the DM-thin device, XFS receives ENOSPC
> > errors from the device, and keep spinning on xfsaild (when 'retry
> > forever' configuration is set).
> > 
> > At this point, the filesystem is unmountable because of the flush locked
> > items in AIL, but worse, the items in AIL are never retried at all
> > (once xfs_inode_item_push() will skip the items that are flush locked),
> > even if the underlying DM-thin device is expanded to the proper size.
> > 
> > This patch fixes both cases, retrying any item that has been failed
> > previously, using the infra-structure provided by the previous patch.
> > 
> > Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
> > ---
> > 
> > This same problem is also possible in dquot code, but the fix is almost
> > identical.
> > 
> > I am not submitting a fix for dquot yet to avoid the need to create VX for both
> > patches, once we agree with the solution, I'll submit a fix to dquot.
> > 
> >  fs/xfs/xfs_inode_item.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 53 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> > index 08cb7d1..583fa9e 100644
> > --- a/fs/xfs/xfs_inode_item.c
> > +++ b/fs/xfs/xfs_inode_item.c
> > @@ -475,6 +475,21 @@ xfs_inode_item_unpin(
> >  		wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT);
> >  }
> >  
> > +STATIC void
> > +xfs_inode_item_error(
> > +	struct xfs_log_item	*lip,
> > +	unsigned int		bflags)
> > +{
> > +
> > +	/*
> > +	 * The buffer writeback containing this inode has been failed
> > +	 * mark it as failed and unlock the flush lock, so it can be retried
> > +	 * again
> > +	 */
> > +	if (bflags & XBF_WRITE_FAIL)
> > +		lip->li_flags |= XFS_LI_FAILED;
> > +}
> > +
> >  STATIC uint
> >  xfs_inode_item_push(
> >  	struct xfs_log_item	*lip,
> > @@ -517,8 +532,44 @@ xfs_inode_item_push(
> >  	 * the AIL.
> >  	 */
> >  	if (!xfs_iflock_nowait(ip)) {
> > +		if (lip->li_flags & XFS_LI_FAILED) {
> > +
> > +			struct xfs_dinode	*dip;
> > +			struct xfs_log_item	*next;
> > +			int			error;
> > +
> > +			error = xfs_imap_to_bp(ip->i_mount, NULL, &ip->i_imap,
> > +					       &dip, &bp, XBF_TRYLOCK, 0);
> > +
> > +			if (error) {
> > +				rval = XFS_ITEM_FLUSHING;
> > +				goto out_unlock;
> > +			}
> > +
> > +			if (!(bp->b_flags & XBF_WRITE_FAIL)) {
> > +				rval = XFS_ITEM_FLUSHING;
> > +				xfs_buf_relse(bp);
> > +				goto out_unlock;
> > +			}
> > +
> > +			while (lip != NULL) {
> > +				next = lip->li_bio_list;
> > +
> > +				if (lip->li_flags & XFS_LI_FAILED)
> > +					lip->li_flags &= XFS_LI_FAILED;
> 
> Eric already pointed out that you probably intend to clear the flag
> here..?
> 

Yup, my bad.

> > +				lip = next;
> > +			}
> 
> This whole hunk might be better off in a helper function (with the
> comment Eric suggested as well).
>

Agreed, a helper function can be used here and in dquot code as well, so I agree
that a helper function can be useful, I'll try to make it a common code for both
dquot and inode items.
 
> Those points and the ->iop_error() thing aside, this otherwise seems Ok
> to me.
>

 
> Brian
> 
> > +
> > +			if (!xfs_buf_delwri_queue(bp, buffer_list))
> > +				rval = XFS_ITEM_FLUSHING;
> > +
> > +			xfs_buf_relse(bp);
> > +			goto out_unlock;
> > +		}
> > +
> >  		rval = XFS_ITEM_FLUSHING;
> >  		goto out_unlock;
> > +
> >  	}
> >  
> >  	ASSERT(iip->ili_fields != 0 || XFS_FORCED_SHUTDOWN(ip->i_mount));
> > @@ -622,7 +673,8 @@ static const struct xfs_item_ops xfs_inode_item_ops = {
> >  	.iop_unlock	= xfs_inode_item_unlock,
> >  	.iop_committed	= xfs_inode_item_committed,
> >  	.iop_push	= xfs_inode_item_push,
> > -	.iop_committing = xfs_inode_item_committing
> > +	.iop_committing = xfs_inode_item_committing,
> > +	.iop_error	= xfs_inode_item_error
> >  };
> >  
> >  
> > -- 
> > 2.9.3
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Carlos

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] xfs: Add infrastructure needed for error propagation during buffer IO failure
  2017-05-11 16:51   ` Brian Foster
@ 2017-05-12  8:41     ` Carlos Maiolino
  2017-05-12 11:37       ` Brian Foster
  0 siblings, 1 reply; 44+ messages in thread
From: Carlos Maiolino @ 2017-05-12  8:41 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

Howdy!

On Thu, May 11, 2017 at 12:51:24PM -0400, Brian Foster wrote:
> On Thu, May 11, 2017 at 03:57:32PM +0200, Carlos Maiolino wrote:
> > To be able to resubmit an log item for IO, we need a way to mark an item
> > as failed, if, for any reason the buffer which the item belonged to
> > failed during writeback.
> > 
> > Add a new log item callback to be used after an IO completion failure
> > and make the needed clean ups.
> > 
> 
> I think the commit log description should call out the problem with
> flush locked items (i.e., that we will currently never resubmit their
> buffers) as the motiviation for the patch.
>

I believe this patch is more a generic way to handle failed writebacks than
related to the never re-submitted buffers, although, I agree that the main
reason for this patch is the failed buffer resubmission, I don't think this
patch deals with it directly, that's why I didn't comment it in this patch,
but in patch two.

I have no objection in mentioning it here though if required.

> > Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
> > ---
> >  fs/xfs/xfs_buf_item.c | 27 ++++++++++++++++++++++++++-
> >  fs/xfs/xfs_trans.h    |  5 ++++-
> >  2 files changed, 30 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> > index 0306168..026aed4 100644
> > --- a/fs/xfs/xfs_buf_item.c
> > +++ b/fs/xfs/xfs_buf_item.c
> > @@ -1051,6 +1051,24 @@ xfs_buf_do_callbacks(
> >  	}
> >  }
> >  
> > +STATIC void
> > +xfs_buf_do_callbacks_fail(
> > +	struct xfs_buf		*bp)
> > +{
> > +	struct xfs_log_item	*lip, *next;
> > +	unsigned int		bflags = bp->b_flags;
> > +
> > +	lip = bp->b_fspriv;
> > +	while (lip != NULL) {
> > +		next = lip->li_bio_list;
> > +
> > +		if (lip->li_ops->iop_error)
> > +			lip->li_ops->iop_error(lip, bflags);
> 
> I still don't see why we need the iop callback here. This type of
> callback is typically required when an operation requires some action on
> the specific subtype (e.g., _inode_item_error() does one particular
> thing to an inode, buf_item_error() might do something different to an
> xfs_buf, etc.), but that doesn't appear to be the case here. Indeed, the
> next patch shows that the inode item error handler does:
> 
> 	lip->li_flags |= XFS_LI_FAILED;
> 
> ... which doesn't even require to dereference the inode_log_item type.
> 
> So can we just set the flag directly from xfs_buf_do_callbacks_fail()
> and kill of ->iop_error() until/unless we come to a point where it is
> actually needed?

Well, I really have no preference here, this could also be handled directly into
xfs_inode_callback_error(), but seems like the idea is to keep those changes
into item-type specific callbacks, but I won't oppose to any approach, at this
point in time, I just want to have his fixed :)

Although, after giving it some thought, I tend to agree that leaving it into
their all callback error handler looks more correct, and more convenient, but,
that's just my opinion.

> 
> > +
> > +		lip = next;
> > +	}
> > +}
> > +
> >  static bool
> >  xfs_buf_iodone_callback_error(
> >  	struct xfs_buf		*bp)
> > @@ -1153,8 +1171,15 @@ xfs_buf_iodone_callbacks(
> >  	 * to run callbacks after failure processing is done so we
> >  	 * detect that and take appropriate action.
> >  	 */
> > -	if (bp->b_error && xfs_buf_iodone_callback_error(bp))
> > +	if (bp->b_error && xfs_buf_iodone_callback_error(bp)) {
> > +
> > +		/*
> > +		 * We've got an error during buffer writeback, we need to notify
> > +		 * the items in the buffer
> > +		 */
> > +		xfs_buf_do_callbacks_fail(bp);
> 
> xfs_buf_iodone_callback_error() returns true when the I/O has failed. It
> also returns true when it has submitted the internal retry[1], however,
> so I don't think this is quite correct. We should only mark items as
> failed once this internal sequence has completed and the buffer is no
> longer under I/O. As it is, this looks like it would mark the items as
> failed while they are still under the internal retry I/O (and possibly
> leave them marked as such if this retry actually succeeds..?).
> 

> Side note: I really dislike the semantics of
> xfs_buf_iodone_callback_error() in that I have to read it and the only
> call site to re-understand what the return value means every time I look
> at it. Could we add a comment above that function that explains the
> return value dictates whether to run callbacks while we're working in
> this area?
> 
> Brian
> 
> [1] Recall that every buffer submitted through xfsaild() is quietly
> retried one time in the event of I/O error (via XBF_WRITE_FAIL) before
> the buffer is unlocked and effectively released back to the AIL. This is
> presumably to help deal with transient errors. It is only when this
> second I/O fails that the buffer is unlocked and it is up to the AIL to
> resubmit the buffer on a subsequent push.
> 
Yeah, I agree here, I wonder if wouldn't be a good approach to change the
xfs_buf_iodone_callback_error to return something other than a bool, so we can
properly check what happened, i.e. internal retry, retry based on configuration,
or whatever. I will write some POC for this, I just wonder though, if, after
having the configuration handlers available, do we really need to do this first
forced retry.


Thanks, for the review, I'll wait Dave's comments before sending a V2 though
once the callbacks idea came from him, in the meantime I'll take a look how can
I improve the callback_error stuff.

Cheers

> >  		return;
> > +	}
> >  
> >  	/*
> >  	 * Successful IO or permanent error. Either way, we can clear the
> > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> > index a07acbf..c57181a 100644
> > --- a/fs/xfs/xfs_trans.h
> > +++ b/fs/xfs/xfs_trans.h
> > @@ -65,10 +65,12 @@ typedef struct xfs_log_item {
> >  
> >  #define	XFS_LI_IN_AIL	0x1
> >  #define XFS_LI_ABORTED	0x2
> > +#define XFS_LI_FAILED	0x3
> >  
> >  #define XFS_LI_FLAGS \
> >  	{ XFS_LI_IN_AIL,	"IN_AIL" }, \
> > -	{ XFS_LI_ABORTED,	"ABORTED" }
> > +	{ XFS_LI_ABORTED,	"ABORTED" }, \
> > +	{ XFS_LI_FAILED,	"FAILED" }
> >  
> >  struct xfs_item_ops {
> >  	void (*iop_size)(xfs_log_item_t *, int *, int *);
> > @@ -79,6 +81,7 @@ struct xfs_item_ops {
> >  	void (*iop_unlock)(xfs_log_item_t *);
> >  	xfs_lsn_t (*iop_committed)(xfs_log_item_t *, xfs_lsn_t);
> >  	void (*iop_committing)(xfs_log_item_t *, xfs_lsn_t);
> > +	void (*iop_error)(xfs_log_item_t *, unsigned int bflags);
> >  };
> >  
> >  void	xfs_log_item_init(struct xfs_mount *mp, struct xfs_log_item *item,
> > -- 
> > 2.9.3
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Carlos

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] xfs: Add infrastructure needed for error propagation during buffer IO failure
  2017-05-12  8:41     ` Carlos Maiolino
@ 2017-05-12 11:37       ` Brian Foster
  0 siblings, 0 replies; 44+ messages in thread
From: Brian Foster @ 2017-05-12 11:37 UTC (permalink / raw)
  To: linux-xfs

On Fri, May 12, 2017 at 10:41:15AM +0200, Carlos Maiolino wrote:
> Howdy!
> 
> On Thu, May 11, 2017 at 12:51:24PM -0400, Brian Foster wrote:
> > On Thu, May 11, 2017 at 03:57:32PM +0200, Carlos Maiolino wrote:
> > > To be able to resubmit an log item for IO, we need a way to mark an item
> > > as failed, if, for any reason the buffer which the item belonged to
> > > failed during writeback.
> > > 
> > > Add a new log item callback to be used after an IO completion failure
> > > and make the needed clean ups.
> > > 
> > 
> > I think the commit log description should call out the problem with
> > flush locked items (i.e., that we will currently never resubmit their
> > buffers) as the motiviation for the patch.
> >
> 
> I believe this patch is more a generic way to handle failed writebacks than
> related to the never re-submitted buffers, although, I agree that the main
> reason for this patch is the failed buffer resubmission, I don't think this
> patch deals with it directly, that's why I didn't comment it in this patch,
> but in patch two.
> 
> I have no objection in mentioning it here though if required.
> 

The commit log currently just says "To be able to resubmit an log item
for IO, ...", which is something the AIL does today. So why do we need
this patch? :)

IOWs, I'm just asking that the commit log include a couple sentences or
so on the purpose of the patch. E.g., why is what the AIL does today
insufficient? (You could point out explicitly that it doesn't handle
certain cases correctly and we need a more generic mechanism in place to
handle those cases, citing the flush lock example).

> > > Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
> > > ---
> > >  fs/xfs/xfs_buf_item.c | 27 ++++++++++++++++++++++++++-
> > >  fs/xfs/xfs_trans.h    |  5 ++++-
> > >  2 files changed, 30 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> > > index 0306168..026aed4 100644
> > > --- a/fs/xfs/xfs_buf_item.c
> > > +++ b/fs/xfs/xfs_buf_item.c
> > > @@ -1051,6 +1051,24 @@ xfs_buf_do_callbacks(
> > >  	}
> > >  }
> > >  
> > > +STATIC void
> > > +xfs_buf_do_callbacks_fail(
> > > +	struct xfs_buf		*bp)
> > > +{
> > > +	struct xfs_log_item	*lip, *next;
> > > +	unsigned int		bflags = bp->b_flags;
> > > +
> > > +	lip = bp->b_fspriv;
> > > +	while (lip != NULL) {
> > > +		next = lip->li_bio_list;
> > > +
> > > +		if (lip->li_ops->iop_error)
> > > +			lip->li_ops->iop_error(lip, bflags);
> > 
> > I still don't see why we need the iop callback here. This type of
> > callback is typically required when an operation requires some action on
> > the specific subtype (e.g., _inode_item_error() does one particular
> > thing to an inode, buf_item_error() might do something different to an
> > xfs_buf, etc.), but that doesn't appear to be the case here. Indeed, the
> > next patch shows that the inode item error handler does:
> > 
> > 	lip->li_flags |= XFS_LI_FAILED;
> > 
> > ... which doesn't even require to dereference the inode_log_item type.
> > 
> > So can we just set the flag directly from xfs_buf_do_callbacks_fail()
> > and kill of ->iop_error() until/unless we come to a point where it is
> > actually needed?
> 
> Well, I really have no preference here, this could also be handled directly into
> xfs_inode_callback_error(), but seems like the idea is to keep those changes
> into item-type specific callbacks, but I won't oppose to any approach, at this
> point in time, I just want to have his fixed :)
> 

Heh, I feel your pain (and sorry :P). I'd care less if it we were
discussing details of a helper function or something like that, but I
really don't think we should be expanding ->li_ops for calls that don't
care about the type-specific log item or otherwise have a specific
purpose.

> Although, after giving it some thought, I tend to agree that leaving it into
> their all callback error handler looks more correct, and more convenient, but,
> that's just my opinion.
> 

AFAICT, the only side effect it has currently is to enable selective
appropriation of the flag (e.g., we only currently set it for inodes,
and soon dquots). That in and of itself seems inappropriate if we're
using a generic xfs_log_item flag. Put another way, if this is truly a
generic mechanism, then the new XFS_LI_FAILED state that we are creating
should apply generically to all xfs_log_item objects rather than
selectively to the ones where we need it to solve a specific deadlock
problem. Otherwise, anybody looking at XFS_LI_FAILED usage years down
the road is probably just going to be confused.

We've basically defined a metadata type-specific callback handler (e.g.,
xfs_inode_log_item) for something that doesn't even need to look at the
specific log item type. It looks to me like unnecessary code, an
unnecessary level of indirection and probably something that somebody
else will just end up removing after the fact.

> > 
> > > +
> > > +		lip = next;
> > > +	}
> > > +}
> > > +
> > >  static bool
> > >  xfs_buf_iodone_callback_error(
> > >  	struct xfs_buf		*bp)
> > > @@ -1153,8 +1171,15 @@ xfs_buf_iodone_callbacks(
> > >  	 * to run callbacks after failure processing is done so we
> > >  	 * detect that and take appropriate action.
> > >  	 */
> > > -	if (bp->b_error && xfs_buf_iodone_callback_error(bp))
> > > +	if (bp->b_error && xfs_buf_iodone_callback_error(bp)) {
> > > +
> > > +		/*
> > > +		 * We've got an error during buffer writeback, we need to notify
> > > +		 * the items in the buffer
> > > +		 */
> > > +		xfs_buf_do_callbacks_fail(bp);
> > 
> > xfs_buf_iodone_callback_error() returns true when the I/O has failed. It
> > also returns true when it has submitted the internal retry[1], however,
> > so I don't think this is quite correct. We should only mark items as
> > failed once this internal sequence has completed and the buffer is no
> > longer under I/O. As it is, this looks like it would mark the items as
> > failed while they are still under the internal retry I/O (and possibly
> > leave them marked as such if this retry actually succeeds..?).
> > 
> 
> > Side note: I really dislike the semantics of
> > xfs_buf_iodone_callback_error() in that I have to read it and the only
> > call site to re-understand what the return value means every time I look
> > at it. Could we add a comment above that function that explains the
> > return value dictates whether to run callbacks while we're working in
> > this area?
> > 
> > Brian
> > 
> > [1] Recall that every buffer submitted through xfsaild() is quietly
> > retried one time in the event of I/O error (via XBF_WRITE_FAIL) before
> > the buffer is unlocked and effectively released back to the AIL. This is
> > presumably to help deal with transient errors. It is only when this
> > second I/O fails that the buffer is unlocked and it is up to the AIL to
> > resubmit the buffer on a subsequent push.
> > 
> Yeah, I agree here, I wonder if wouldn't be a good approach to change the
> xfs_buf_iodone_callback_error to return something other than a bool, so we can
> properly check what happened, i.e. internal retry, retry based on configuration,
> or whatever. I will write some POC for this, I just wonder though, if, after
> having the configuration handlers available, do we really need to do this first
> forced retry.
> 

Perhaps not, but that's something that should probably be evaluated
separately and after this issue is resolved.

WRT xfs_buf_iodone_callback_error(), I personally think returning void
and setting a do_callbacks bool as a parameter would be notably more
clear than the current semantics. It could be expanded in the future to
return more specific state, if necessary.

But xfs_buf_do_callbacks_fail() should probably be called directly from
xfs_buf_iodone_callback_error() before the buffer lock is released and
the buffer implicitly released back to the AIL. Otherwise, an xfsaild
race can lead to observing the log item in a spurious flushing state
after the I/O had failed, or worse, possibly set XFS_LI_FAILED on some
items after the buffer has been resubmitted.

> 
> Thanks, for the review, I'll wait Dave's comments before sending a V2 though
> once the callbacks idea came from him, in the meantime I'll take a look how can
> I improve the callback_error stuff.
> 

Sounds good.

Brian

> Cheers
> 
> > >  		return;
> > > +	}
> > >  
> > >  	/*
> > >  	 * Successful IO or permanent error. Either way, we can clear the
> > > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> > > index a07acbf..c57181a 100644
> > > --- a/fs/xfs/xfs_trans.h
> > > +++ b/fs/xfs/xfs_trans.h
> > > @@ -65,10 +65,12 @@ typedef struct xfs_log_item {
> > >  
> > >  #define	XFS_LI_IN_AIL	0x1
> > >  #define XFS_LI_ABORTED	0x2
> > > +#define XFS_LI_FAILED	0x3
> > >  
> > >  #define XFS_LI_FLAGS \
> > >  	{ XFS_LI_IN_AIL,	"IN_AIL" }, \
> > > -	{ XFS_LI_ABORTED,	"ABORTED" }
> > > +	{ XFS_LI_ABORTED,	"ABORTED" }, \
> > > +	{ XFS_LI_FAILED,	"FAILED" }
> > >  
> > >  struct xfs_item_ops {
> > >  	void (*iop_size)(xfs_log_item_t *, int *, int *);
> > > @@ -79,6 +81,7 @@ struct xfs_item_ops {
> > >  	void (*iop_unlock)(xfs_log_item_t *);
> > >  	xfs_lsn_t (*iop_committed)(xfs_log_item_t *, xfs_lsn_t);
> > >  	void (*iop_committing)(xfs_log_item_t *, xfs_lsn_t);
> > > +	void (*iop_error)(xfs_log_item_t *, unsigned int bflags);
> > >  };
> > >  
> > >  void	xfs_log_item_init(struct xfs_mount *mp, struct xfs_log_item *item,
> > > -- 
> > > 2.9.3
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> -- 
> Carlos
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback
  2017-05-12  8:21     ` Carlos Maiolino
@ 2017-05-12 11:37       ` Brian Foster
  2017-05-17 11:47         ` Carlos Maiolino
  0 siblings, 1 reply; 44+ messages in thread
From: Brian Foster @ 2017-05-12 11:37 UTC (permalink / raw)
  To: linux-xfs

On Fri, May 12, 2017 at 10:21:35AM +0200, Carlos Maiolino wrote:
> On Thu, May 11, 2017 at 01:08:05PM -0400, Brian Foster wrote:
> > On Thu, May 11, 2017 at 03:57:33PM +0200, Carlos Maiolino wrote:
> > > When a buffer has been failed during writeback, the inode items into it
> > > are kept flush locked, and are never resubmitted due the flush lock, so,
> > > if any buffer fails to be written, the items in AIL are never written to
> > > disk and never unlocked.
> > > 
> > > This causes a filesystem to be unmountable due these items flush locked
> > > in AIL, but this also causes the items in AIL to never be written back,
> > > even when the IO device comes back to normal.
> > > 
> > > I've been testing this patch with a DM-thin device, creating a
> > > filesystem larger than the real device.
> > > 
> > > When writing enough data to fill the DM-thin device, XFS receives ENOSPC
> > > errors from the device, and keep spinning on xfsaild (when 'retry
> > > forever' configuration is set).
> > > 
> > > At this point, the filesystem is unmountable because of the flush locked
> > > items in AIL, but worse, the items in AIL are never retried at all
> > > (once xfs_inode_item_push() will skip the items that are flush locked),
> > > even if the underlying DM-thin device is expanded to the proper size.
> > > 
> > > This patch fixes both cases, retrying any item that has been failed
> > > previously, using the infra-structure provided by the previous patch.
> > > 
> > > Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
> > > ---
> > > 
> > > This same problem is also possible in dquot code, but the fix is almost
> > > identical.
> > > 
> > > I am not submitting a fix for dquot yet to avoid the need to create VX for both
> > > patches, once we agree with the solution, I'll submit a fix to dquot.
> > > 
> > >  fs/xfs/xfs_inode_item.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++-
> > >  1 file changed, 53 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> > > index 08cb7d1..583fa9e 100644
> > > --- a/fs/xfs/xfs_inode_item.c
> > > +++ b/fs/xfs/xfs_inode_item.c
> > > @@ -475,6 +475,21 @@ xfs_inode_item_unpin(
> > >  		wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT);
> > >  }
> > >  
> > > +STATIC void
> > > +xfs_inode_item_error(
> > > +	struct xfs_log_item	*lip,
> > > +	unsigned int		bflags)
> > > +{
> > > +
> > > +	/*
> > > +	 * The buffer writeback containing this inode has been failed
> > > +	 * mark it as failed and unlock the flush lock, so it can be retried
> > > +	 * again
> > > +	 */
> > > +	if (bflags & XBF_WRITE_FAIL)
> > > +		lip->li_flags |= XFS_LI_FAILED;
> > > +}
> > > +
> > >  STATIC uint
> > >  xfs_inode_item_push(
> > >  	struct xfs_log_item	*lip,
> > > @@ -517,8 +532,44 @@ xfs_inode_item_push(
> > >  	 * the AIL.
> > >  	 */
> > >  	if (!xfs_iflock_nowait(ip)) {
> > > +		if (lip->li_flags & XFS_LI_FAILED) {
> > > +
> > > +			struct xfs_dinode	*dip;
> > > +			struct xfs_log_item	*next;
> > > +			int			error;
> > > +
> > > +			error = xfs_imap_to_bp(ip->i_mount, NULL, &ip->i_imap,
> > > +					       &dip, &bp, XBF_TRYLOCK, 0);
> > > +
> > > +			if (error) {
> > > +				rval = XFS_ITEM_FLUSHING;
> > > +				goto out_unlock;
> > > +			}
> > > +
> > > +			if (!(bp->b_flags & XBF_WRITE_FAIL)) {
> > > +				rval = XFS_ITEM_FLUSHING;
> > > +				xfs_buf_relse(bp);
> > > +				goto out_unlock;
> > > +			}

I think I glossed over this on my first pass, but I don't think we need
to (or should) check XBF_WRITE_FAIL here or in the error handler. It's a
flag used to control the internal retry and that is kind of irrelevant
to this mechanism. Unless I'm missing something.. I don't think this
state can occur..?

Brian

> > > +
> > > +			while (lip != NULL) {
> > > +				next = lip->li_bio_list;
> > > +
> > > +				if (lip->li_flags & XFS_LI_FAILED)
> > > +					lip->li_flags &= XFS_LI_FAILED;
> > 
> > Eric already pointed out that you probably intend to clear the flag
> > here..?
> > 
> 
> Yup, my bad.
> 
> > > +				lip = next;
> > > +			}
> > 
> > This whole hunk might be better off in a helper function (with the
> > comment Eric suggested as well).
> >
> 
> Agreed, a helper function can be used here and in dquot code as well, so I agree
> that a helper function can be useful, I'll try to make it a common code for both
> dquot and inode items.
>  
> > Those points and the ->iop_error() thing aside, this otherwise seems Ok
> > to me.
> >
> 
>  
> > Brian
> > 
> > > +
> > > +			if (!xfs_buf_delwri_queue(bp, buffer_list))
> > > +				rval = XFS_ITEM_FLUSHING;
> > > +
> > > +			xfs_buf_relse(bp);
> > > +			goto out_unlock;
> > > +		}
> > > +
> > >  		rval = XFS_ITEM_FLUSHING;
> > >  		goto out_unlock;
> > > +
> > >  	}
> > >  
> > >  	ASSERT(iip->ili_fields != 0 || XFS_FORCED_SHUTDOWN(ip->i_mount));
> > > @@ -622,7 +673,8 @@ static const struct xfs_item_ops xfs_inode_item_ops = {
> > >  	.iop_unlock	= xfs_inode_item_unlock,
> > >  	.iop_committed	= xfs_inode_item_committed,
> > >  	.iop_push	= xfs_inode_item_push,
> > > -	.iop_committing = xfs_inode_item_committing
> > > +	.iop_committing = xfs_inode_item_committing,
> > > +	.iop_error	= xfs_inode_item_error
> > >  };
> > >  
> > >  
> > > -- 
> > > 2.9.3
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> -- 
> Carlos
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback
  2017-05-11 13:57 ` [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback Carlos Maiolino
  2017-05-11 15:32   ` Eric Sandeen
  2017-05-11 17:08   ` Brian Foster
@ 2017-05-17  0:57   ` Dave Chinner
  2017-05-17 10:41     ` Carlos Maiolino
  2 siblings, 1 reply; 44+ messages in thread
From: Dave Chinner @ 2017-05-17  0:57 UTC (permalink / raw)
  To: Carlos Maiolino; +Cc: linux-xfs

On Thu, May 11, 2017 at 03:57:33PM +0200, Carlos Maiolino wrote:
> +xfs_inode_item_error(
> +	struct xfs_log_item	*lip,
> +	unsigned int		bflags)
> +{
> +
> +	/*
> +	 * The buffer writeback containing this inode has been failed
> +	 * mark it as failed and unlock the flush lock, so it can be retried
> +	 * again
> +	 */
> +	if (bflags & XBF_WRITE_FAIL)
> +		lip->li_flags |= XFS_LI_FAILED;
> +}

I'm pretty sure we can't modify the log item state like this in
IO context because the parent object is not locked by this context.
We can modify the state during IO submission because we hold the
parent object lock, but we don't hold it here.

i.e. we can race with other log item state changes done during a
concurrent transaction (e.g. marking the log item dirty in
xfs_trans_log_inode()).

> +
>  STATIC uint
>  xfs_inode_item_push(
>  	struct xfs_log_item	*lip,
> @@ -517,8 +532,44 @@ xfs_inode_item_push(
>  	 * the AIL.
>  	 */
>  	if (!xfs_iflock_nowait(ip)) {
> +		if (lip->li_flags & XFS_LI_FAILED) {
> +
> +			struct xfs_dinode	*dip;
> +			struct xfs_log_item	*next;
> +			int			error;
> +
> +			error = xfs_imap_to_bp(ip->i_mount, NULL, &ip->i_imap,
> +					       &dip, &bp, XBF_TRYLOCK, 0);
> +
> +			if (error) {
> +				rval = XFS_ITEM_FLUSHING;
> +				goto out_unlock;
> +			}
> +
> +			if (!(bp->b_flags & XBF_WRITE_FAIL)) {
> +				rval = XFS_ITEM_FLUSHING;
> +				xfs_buf_relse(bp);
> +				goto out_unlock;
> +			}
> +
> +			while (lip != NULL) {
> +				next = lip->li_bio_list;
> +
> +				if (lip->li_flags & XFS_LI_FAILED)
> +					lip->li_flags &= XFS_LI_FAILED;
> +				lip = next;
> +			}

Same race here - we hold the lock for this inode, but not all the
other inodes linked to the buffer.

This implies that lip->li_flags needs to use atomic bit updates so
there are no RMW update races like this.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback
  2017-05-17  0:57   ` Dave Chinner
@ 2017-05-17 10:41     ` Carlos Maiolino
  2017-05-19  0:22       ` Dave Chinner
  0 siblings, 1 reply; 44+ messages in thread
From: Carlos Maiolino @ 2017-05-17 10:41 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Hi Dave,

On Wed, May 17, 2017 at 10:57:49AM +1000, Dave Chinner wrote:
> On Thu, May 11, 2017 at 03:57:33PM +0200, Carlos Maiolino wrote:
> > +xfs_inode_item_error(
> > +	struct xfs_log_item	*lip,
> > +	unsigned int		bflags)
> > +{
> > +
> > +	/*
> > +	 * The buffer writeback containing this inode has been failed
> > +	 * mark it as failed and unlock the flush lock, so it can be retried
> > +	 * again
> > +	 */
> > +	if (bflags & XBF_WRITE_FAIL)
> > +		lip->li_flags |= XFS_LI_FAILED;
> > +}
> 
> I'm pretty sure we can't modify the log item state like this in
> IO context because the parent object is not locked by this context.
> We can modify the state during IO submission because we hold the
> parent object lock, but we don't hold it here.
> 
> i.e. we can race with other log item state changes done during a
> concurrent transaction (e.g. marking the log item dirty in
> xfs_trans_log_inode()).
> 
> > +
> > +				if (lip->li_flags & XFS_LI_FAILED)
> > +					lip->li_flags &= XFS_LI_FAILED;
> > +				lip = next;
> > +			}
> 
> Same race here - we hold the lock for this inode, but not all the
> other inodes linked to the buffer.
> 
> This implies that lip->li_flags needs to use atomic bit updates so
> there are no RMW update races like this.
> 

I believe refers to both cases above, while setting and removing the flag?

I can simply use set_bit() and clear_bit() here, if this is enough, although, to
use them I'll need to change lip->li_flags size to `unsigned long` if people are
ok with that.

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Carlos

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback
  2017-05-12 11:37       ` Brian Foster
@ 2017-05-17 11:47         ` Carlos Maiolino
  0 siblings, 0 replies; 44+ messages in thread
From: Carlos Maiolino @ 2017-05-17 11:47 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

> > > > +			}
> > > > +
> > > > +			if (!(bp->b_flags & XBF_WRITE_FAIL)) {
> > > > +				rval = XFS_ITEM_FLUSHING;
> > > > +				xfs_buf_relse(bp);
> > > > +				goto out_unlock;
> > > > +			}
> 
> I think I glossed over this on my first pass, but I don't think we need
> to (or should) check XBF_WRITE_FAIL here or in the error handler. It's a
> flag used to control the internal retry and that is kind of irrelevant
> to this mechanism. Unless I'm missing something.. I don't think this
> state can occur..?
> 

Yup, right, it is not required, I'll remove it for the next version,

thanks for the review

> Brian
> 
> > > > +
> > > > +			while (lip != NULL) {
> > > > +				next = lip->li_bio_list;
> > > > +
> > > > +				if (lip->li_flags & XFS_LI_FAILED)
> > > > +					lip->li_flags &= XFS_LI_FAILED;
> > > 
> > > Eric already pointed out that you probably intend to clear the flag
> > > here..?
> > > 
> > 
> > Yup, my bad.
> > 
> > > > +				lip = next;
> > > > +			}
> > > 
> > > This whole hunk might be better off in a helper function (with the
> > > comment Eric suggested as well).
> > >
> > 
> > Agreed, a helper function can be used here and in dquot code as well, so I agree
> > that a helper function can be useful, I'll try to make it a common code for both
> > dquot and inode items.
> >  
> > > Those points and the ->iop_error() thing aside, this otherwise seems Ok
> > > to me.
> > >
> > 
> >  
> > > Brian
> > > 
> > > > +
> > > > +			if (!xfs_buf_delwri_queue(bp, buffer_list))
> > > > +				rval = XFS_ITEM_FLUSHING;
> > > > +
> > > > +			xfs_buf_relse(bp);
> > > > +			goto out_unlock;
> > > > +		}
> > > > +
> > > >  		rval = XFS_ITEM_FLUSHING;
> > > >  		goto out_unlock;
> > > > +
> > > >  	}
> > > >  
> > > >  	ASSERT(iip->ili_fields != 0 || XFS_FORCED_SHUTDOWN(ip->i_mount));
> > > > @@ -622,7 +673,8 @@ static const struct xfs_item_ops xfs_inode_item_ops = {
> > > >  	.iop_unlock	= xfs_inode_item_unlock,
> > > >  	.iop_committed	= xfs_inode_item_committed,
> > > >  	.iop_push	= xfs_inode_item_push,
> > > > -	.iop_committing = xfs_inode_item_committing
> > > > +	.iop_committing = xfs_inode_item_committing,
> > > > +	.iop_error	= xfs_inode_item_error
> > > >  };
> > > >  
> > > >  
> > > > -- 
> > > > 2.9.3
> > > > 
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > -- 
> > Carlos
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Carlos

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback
  2017-05-17 10:41     ` Carlos Maiolino
@ 2017-05-19  0:22       ` Dave Chinner
  2017-05-19 11:27         ` Brian Foster
  0 siblings, 1 reply; 44+ messages in thread
From: Dave Chinner @ 2017-05-19  0:22 UTC (permalink / raw)
  To: linux-xfs

On Wed, May 17, 2017 at 12:41:11PM +0200, Carlos Maiolino wrote:
> Hi Dave,
> 
> On Wed, May 17, 2017 at 10:57:49AM +1000, Dave Chinner wrote:
> > On Thu, May 11, 2017 at 03:57:33PM +0200, Carlos Maiolino wrote:
> > > +xfs_inode_item_error(
> > > +	struct xfs_log_item	*lip,
> > > +	unsigned int		bflags)
> > > +{
> > > +
> > > +	/*
> > > +	 * The buffer writeback containing this inode has been failed
> > > +	 * mark it as failed and unlock the flush lock, so it can be retried
> > > +	 * again
> > > +	 */
> > > +	if (bflags & XBF_WRITE_FAIL)
> > > +		lip->li_flags |= XFS_LI_FAILED;
> > > +}
> > 
> > I'm pretty sure we can't modify the log item state like this in
> > IO context because the parent object is not locked by this context.
> > We can modify the state during IO submission because we hold the
> > parent object lock, but we don't hold it here.
> > 
> > i.e. we can race with other log item state changes done during a
> > concurrent transaction (e.g. marking the log item dirty in
> > xfs_trans_log_inode()).
> > 
> > > +
> > > +				if (lip->li_flags & XFS_LI_FAILED)
> > > +					lip->li_flags &= XFS_LI_FAILED;
> > > +				lip = next;
> > > +			}
> > 
> > Same race here - we hold the lock for this inode, but not all the
> > other inodes linked to the buffer.
> > 
> > This implies that lip->li_flags needs to use atomic bit updates so
> > there are no RMW update races like this.
> > 
> 
> I believe refers to both cases above, while setting and removing the flag?
> 
> I can simply use set_bit() and clear_bit() here, if this is enough, although, to
> use them I'll need to change lip->li_flags size to `unsigned long` if people are
> ok with that.

Yup, and you need to change every other flag in li_flags to use
atomic bitops, too. That also means using test_bit() for value
checks, test_and_clear_bit() instead of "if (bit) {clear bit}" and
so on.

i.e. it's no good just making one flag atomic and all the other
updates using RMW accesses to change the state because that doesn't
close the update race conditions.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback
  2017-05-19  0:22       ` Dave Chinner
@ 2017-05-19 11:27         ` Brian Foster
  2017-05-19 23:39           ` Dave Chinner
  0 siblings, 1 reply; 44+ messages in thread
From: Brian Foster @ 2017-05-19 11:27 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 19, 2017 at 10:22:27AM +1000, Dave Chinner wrote:
> On Wed, May 17, 2017 at 12:41:11PM +0200, Carlos Maiolino wrote:
> > Hi Dave,
> > 
> > On Wed, May 17, 2017 at 10:57:49AM +1000, Dave Chinner wrote:
> > > On Thu, May 11, 2017 at 03:57:33PM +0200, Carlos Maiolino wrote:
> > > > +xfs_inode_item_error(
> > > > +	struct xfs_log_item	*lip,
> > > > +	unsigned int		bflags)
> > > > +{
> > > > +
> > > > +	/*
> > > > +	 * The buffer writeback containing this inode has been failed
> > > > +	 * mark it as failed and unlock the flush lock, so it can be retried
> > > > +	 * again
> > > > +	 */
> > > > +	if (bflags & XBF_WRITE_FAIL)
> > > > +		lip->li_flags |= XFS_LI_FAILED;
> > > > +}
> > > 
> > > I'm pretty sure we can't modify the log item state like this in
> > > IO context because the parent object is not locked by this context.
> > > We can modify the state during IO submission because we hold the
> > > parent object lock, but we don't hold it here.
> > > 
> > > i.e. we can race with other log item state changes done during a
> > > concurrent transaction (e.g. marking the log item dirty in
> > > xfs_trans_log_inode()).
> > > 
> > > > +
> > > > +				if (lip->li_flags & XFS_LI_FAILED)
> > > > +					lip->li_flags &= XFS_LI_FAILED;
> > > > +				lip = next;
> > > > +			}
> > > 
> > > Same race here - we hold the lock for this inode, but not all the
> > > other inodes linked to the buffer.
> > > 
> > > This implies that lip->li_flags needs to use atomic bit updates so
> > > there are no RMW update races like this.
> > > 
> > 
> > I believe refers to both cases above, while setting and removing the flag?
> > 
> > I can simply use set_bit() and clear_bit() here, if this is enough, although, to
> > use them I'll need to change lip->li_flags size to `unsigned long` if people are
> > ok with that.
> 
> Yup, and you need to change every other flag in li_flags to use
> atomic bitops, too. That also means using test_bit() for value
> checks, test_and_clear_bit() instead of "if (bit) {clear bit}" and
> so on.
> 
> i.e. it's no good just making one flag atomic and all the other
> updates using RMW accesses to change the state because that doesn't
> close the update race conditions.
> 

Am I following this correctly that the only potential race here looks
like it is with XFS_LI_ABORTED (the dirty transaction cancel case,
specifically)? Note that functions like xfs_trans_log_inode() set dirty
state on the log item descriptor, which is associated with the
transaction, not the log item itself. The only other log item flag is
LI_IN_AIL and that is going to remain set during the lifetime of
LI_FAILED by design. Maybe I'm missing something else going on here..

But otherwise given the above, that this is primarily I/O error path
code, and that we presumably already have serialization mechanisms in
place in the form of parent object locks, why not just grab the inode
lock in the appropriate places?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback
  2017-05-19 11:27         ` Brian Foster
@ 2017-05-19 23:39           ` Dave Chinner
  2017-05-20 11:46             ` Brian Foster
  0 siblings, 1 reply; 44+ messages in thread
From: Dave Chinner @ 2017-05-19 23:39 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Fri, May 19, 2017 at 07:27:01AM -0400, Brian Foster wrote:
> On Fri, May 19, 2017 at 10:22:27AM +1000, Dave Chinner wrote:
> > On Wed, May 17, 2017 at 12:41:11PM +0200, Carlos Maiolino wrote:
> > > Hi Dave,
> > > 
> > > On Wed, May 17, 2017 at 10:57:49AM +1000, Dave Chinner wrote:
> > > > On Thu, May 11, 2017 at 03:57:33PM +0200, Carlos Maiolino wrote:
> > > > > +xfs_inode_item_error(
> > > > > +	struct xfs_log_item	*lip,
> > > > > +	unsigned int		bflags)
> > > > > +{
> > > > > +
> > > > > +	/*
> > > > > +	 * The buffer writeback containing this inode has been failed
> > > > > +	 * mark it as failed and unlock the flush lock, so it can be retried
> > > > > +	 * again
> > > > > +	 */
> > > > > +	if (bflags & XBF_WRITE_FAIL)
> > > > > +		lip->li_flags |= XFS_LI_FAILED;
> > > > > +}
> > > > 
> > > > I'm pretty sure we can't modify the log item state like this in
> > > > IO context because the parent object is not locked by this context.
> > > > We can modify the state during IO submission because we hold the
> > > > parent object lock, but we don't hold it here.
> > > > 
> > > > i.e. we can race with other log item state changes done during a
> > > > concurrent transaction (e.g. marking the log item dirty in
> > > > xfs_trans_log_inode()).
> > > > 
> > > > > +
> > > > > +				if (lip->li_flags & XFS_LI_FAILED)
> > > > > +					lip->li_flags &= XFS_LI_FAILED;
> > > > > +				lip = next;
> > > > > +			}
> > > > 
> > > > Same race here - we hold the lock for this inode, but not all the
> > > > other inodes linked to the buffer.
> > > > 
> > > > This implies that lip->li_flags needs to use atomic bit updates so
> > > > there are no RMW update races like this.
> > > > 
> > > 
> > > I believe refers to both cases above, while setting and removing the flag?
> > > 
> > > I can simply use set_bit() and clear_bit() here, if this is enough, although, to
> > > use them I'll need to change lip->li_flags size to `unsigned long` if people are
> > > ok with that.
> > 
> > Yup, and you need to change every other flag in li_flags to use
> > atomic bitops, too. That also means using test_bit() for value
> > checks, test_and_clear_bit() instead of "if (bit) {clear bit}" and
> > so on.
> > 
> > i.e. it's no good just making one flag atomic and all the other
> > updates using RMW accesses to change the state because that doesn't
> > close the update race conditions.
> > 
> 
> Am I following this correctly that the only potential race here looks
> like it is with XFS_LI_ABORTED (the dirty transaction cancel case,
> specifically)?

XFS_LI_ABORTED is used during a shutdown (dirty trans cancel causes
a shutdown) so races don't really matter here.

However, the XFS_LI_IN_AIL flag is the problem because of it's
serialisation requirements. That is, log IO completion can be
running asynchronously to metadata object IO completion and the
manipulation of the XFS_LI_IN_AIL flag and state is protected by the
ailp->xa_lock.

Adding new flags to the same field that can be asynchronously
updated by RMW operations outside the ailp->xa_lock will cause
problems in future. There *may not* be a problem right now, but it
is poor programming practice to have different coherency processes
for the same variable that is updated via RMW operations. In these
situations, the only safe way to update the variable is to use an
atomic operation.

i.e. I'm not looking for a specific bug in the code - I'm pointing
out a concurrent programming error that is being made.

> But otherwise given the above, that this is primarily I/O error path
> code, and that we presumably already have serialization mechanisms in
> place in the form of parent object locks, why not just grab the inode
> lock in the appropriate places?

Because we can't guarantee that we can lock the parent object in IO
completion. Deadlock is trivial: process A grabs parent object,
blocks on flush lock. IO completion holds flush lock, blocks trying
to get parent object lock.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback
  2017-05-19 23:39           ` Dave Chinner
@ 2017-05-20 11:46             ` Brian Foster
  2017-05-21 23:19               ` Dave Chinner
  0 siblings, 1 reply; 44+ messages in thread
From: Brian Foster @ 2017-05-20 11:46 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Sat, May 20, 2017 at 09:39:00AM +1000, Dave Chinner wrote:
> On Fri, May 19, 2017 at 07:27:01AM -0400, Brian Foster wrote:
> > On Fri, May 19, 2017 at 10:22:27AM +1000, Dave Chinner wrote:
> > > On Wed, May 17, 2017 at 12:41:11PM +0200, Carlos Maiolino wrote:
> > > > Hi Dave,
> > > > 
> > > > On Wed, May 17, 2017 at 10:57:49AM +1000, Dave Chinner wrote:
> > > > > On Thu, May 11, 2017 at 03:57:33PM +0200, Carlos Maiolino wrote:
> > > > > > +xfs_inode_item_error(
> > > > > > +	struct xfs_log_item	*lip,
> > > > > > +	unsigned int		bflags)
> > > > > > +{
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * The buffer writeback containing this inode has been failed
> > > > > > +	 * mark it as failed and unlock the flush lock, so it can be retried
> > > > > > +	 * again
> > > > > > +	 */
> > > > > > +	if (bflags & XBF_WRITE_FAIL)
> > > > > > +		lip->li_flags |= XFS_LI_FAILED;
> > > > > > +}
> > > > > 
> > > > > I'm pretty sure we can't modify the log item state like this in
> > > > > IO context because the parent object is not locked by this context.
> > > > > We can modify the state during IO submission because we hold the
> > > > > parent object lock, but we don't hold it here.
> > > > > 
> > > > > i.e. we can race with other log item state changes done during a
> > > > > concurrent transaction (e.g. marking the log item dirty in
> > > > > xfs_trans_log_inode()).
> > > > > 
> > > > > > +
> > > > > > +				if (lip->li_flags & XFS_LI_FAILED)
> > > > > > +					lip->li_flags &= XFS_LI_FAILED;
> > > > > > +				lip = next;
> > > > > > +			}
> > > > > 
> > > > > Same race here - we hold the lock for this inode, but not all the
> > > > > other inodes linked to the buffer.
> > > > > 
> > > > > This implies that lip->li_flags needs to use atomic bit updates so
> > > > > there are no RMW update races like this.
> > > > > 
> > > > 
> > > > I believe refers to both cases above, while setting and removing the flag?
> > > > 
> > > > I can simply use set_bit() and clear_bit() here, if this is enough, although, to
> > > > use them I'll need to change lip->li_flags size to `unsigned long` if people are
> > > > ok with that.
> > > 
> > > Yup, and you need to change every other flag in li_flags to use
> > > atomic bitops, too. That also means using test_bit() for value
> > > checks, test_and_clear_bit() instead of "if (bit) {clear bit}" and
> > > so on.
> > > 
> > > i.e. it's no good just making one flag atomic and all the other
> > > updates using RMW accesses to change the state because that doesn't
> > > close the update race conditions.
> > > 
> > 
> > Am I following this correctly that the only potential race here looks
> > like it is with XFS_LI_ABORTED (the dirty transaction cancel case,
> > specifically)?
> 
> XFS_LI_ABORTED is used during a shutdown (dirty trans cancel causes
> a shutdown) so races don't really matter here.
> 

Right.. though I don't think we should dismiss this case because in
theory, if a race causes loss of a flag or some such that happens to
affect the ability to tear down in-core structures, we could still end
up with things like memory leaks or unmount hangs on a shutdown fs.

> However, the XFS_LI_IN_AIL flag is the problem because of it's
> serialisation requirements. That is, log IO completion can be
> running asynchronously to metadata object IO completion and the
> manipulation of the XFS_LI_IN_AIL flag and state is protected by the
> ailp->xa_lock.
> 

Indeed..

> Adding new flags to the same field that can be asynchronously
> updated by RMW operations outside the ailp->xa_lock will cause
> problems in future. There *may not* be a problem right now, but it
> is poor programming practice to have different coherency processes
> for the same variable that is updated via RMW operations. In these
> situations, the only safe way to update the variable is to use an
> atomic operation.
> 

So is there a reason why we couldn't acquire ->xa_lock to fail the log
items as we would have done anyways if the metadata writeback had
succeeded and we were removing the log items from the AIL.. (e.g., why
introduce new serialization rules if we don't need to)?

> i.e. I'm not looking for a specific bug in the code - I'm pointing
> out a concurrent programming error that is being made.
> 

Ok. That's not what was described above, but sounds reasonable enough to
me regardless so long as the patch description is clear on what it is
doing. I don't think we're closing any race outside of the aborted thing
above because IN_AIL has to be set for metadata writeback to occur (and
potentially fail).

> > But otherwise given the above, that this is primarily I/O error path
> > code, and that we presumably already have serialization mechanisms in
> > place in the form of parent object locks, why not just grab the inode
> > lock in the appropriate places?
> 
> Because we can't guarantee that we can lock the parent object in IO
> completion. Deadlock is trivial: process A grabs parent object,
> blocks on flush lock. IO completion holds flush lock, blocks trying
> to get parent object lock.
> 

Ah, right. So once we acquire the flush lock and drop the ilock, we
can no longer block on the ilock from a context responsible for
releasing the flush lock.

That makes sense, but raises a larger question... if "process A" can
cause this kind of locking problem in I/O completion context, what
prevents it from interfering with (re)submission context just the same?
For example, the AIL pushes an inode, locks, flushes and submits the
underlying buffer. The buffer I/O fails and is ultimately released back
to the AIL. Process A comes in, locks the inode and blocks on the flush
lock. Subsequent AIL pushes and retry attempts fail to lock the inode,
never even get to the point of checking for LI_FAILED and thus we're
back to the same problem we have today.

IOW, doesn't this mean we need to check and handle LI_FAILED first off
in ->iop_push() and not just in response to flush lock failure?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback
  2017-05-20 11:46             ` Brian Foster
@ 2017-05-21 23:19               ` Dave Chinner
  2017-05-22 12:51                 ` Brian Foster
  0 siblings, 1 reply; 44+ messages in thread
From: Dave Chinner @ 2017-05-21 23:19 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Sat, May 20, 2017 at 07:46:56AM -0400, Brian Foster wrote:
> On Sat, May 20, 2017 at 09:39:00AM +1000, Dave Chinner wrote:
> > Adding new flags to the same field that can be asynchronously
> > updated by RMW operations outside the ailp->xa_lock will cause
> > problems in future. There *may not* be a problem right now, but it
> > is poor programming practice to have different coherency processes
> > for the same variable that is updated via RMW operations. In these
> > situations, the only safe way to update the variable is to use an
> > atomic operation.
> > 
> 
> So is there a reason why we couldn't acquire ->xa_lock to fail the log
> items as we would have done anyways if the metadata writeback had
> succeeded and we were removing the log items from the AIL..

Yes. the alip->xa_lock protects AIL state is a highly contended
lock. It should not be used for things that aren't AIL related
because that will have performance and scalability implications.

> (e.g., why
> introduce new serialization rules if we don't need to)?

Becuase we don't tie independent object operational state to a
single subsystem's internal locking requirements. If the log item
does not have it's own context lock (which it doesn't), then we need
to use atomic operations to serialise flag updates across different
subsystems so they can safely maintain their own independent
internal locking schemes.

An example is the inode flags - they are serialised by the
i_flags_lock spinlock because there are multiple state changes that
need to be done atomically through different subsystems (including
RCU freeing state).  We serialise these updates by an object lock
rather than a subsystem lock because it's the only way we can share
the flag field across different subsystems safely...

> > > But otherwise given the above, that this is primarily I/O error path
> > > code, and that we presumably already have serialization mechanisms in
> > > place in the form of parent object locks, why not just grab the inode
> > > lock in the appropriate places?
> > 
> > Because we can't guarantee that we can lock the parent object in IO
> > completion. Deadlock is trivial: process A grabs parent object,
> > blocks on flush lock. IO completion holds flush lock, blocks trying
> > to get parent object lock.
> > 
> 
> Ah, right. So once we acquire the flush lock and drop the ilock, we
> can no longer block on the ilock from a context responsible for
> releasing the flush lock.
> 
> That makes sense, but raises a larger question... if "process A" can
> cause this kind of locking problem in I/O completion context, what
> prevents it from interfering with (re)submission context just the same?
> For example, the AIL pushes an inode, locks, flushes and submits the
> underlying buffer. The buffer I/O fails and is ultimately released back
> to the AIL. Process A comes in, locks the inode and blocks on the flush
> lock.  Subsequent AIL pushes and retry attempts fail to lock the inode,
> never even get to the point of checking for LI_FAILED and thus we're
> back to the same problem we have today.
> 
> IOW, doesn't this mean we need to check and handle LI_FAILED first off
> in ->iop_push() and not just in response to flush lock failure?

It's entirely possible that we need to do that. This whole "don't
endlessly retry failed buffer writes" thing is a substantial change
in behaviour, so there's bound to be interactions that we don't get
right the first time...

Cheers,

Dave.

> 
> Brian
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> 

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback
  2017-05-21 23:19               ` Dave Chinner
@ 2017-05-22 12:51                 ` Brian Foster
  2017-05-23 11:23                   ` Dave Chinner
  0 siblings, 1 reply; 44+ messages in thread
From: Brian Foster @ 2017-05-22 12:51 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, May 22, 2017 at 09:19:06AM +1000, Dave Chinner wrote:
> On Sat, May 20, 2017 at 07:46:56AM -0400, Brian Foster wrote:
> > On Sat, May 20, 2017 at 09:39:00AM +1000, Dave Chinner wrote:
> > > Adding new flags to the same field that can be asynchronously
> > > updated by RMW operations outside the ailp->xa_lock will cause
> > > problems in future. There *may not* be a problem right now, but it
> > > is poor programming practice to have different coherency processes
> > > for the same variable that is updated via RMW operations. In these
> > > situations, the only safe way to update the variable is to use an
> > > atomic operation.
> > > 
> > 
> > So is there a reason why we couldn't acquire ->xa_lock to fail the log
> > items as we would have done anyways if the metadata writeback had
> > succeeded and we were removing the log items from the AIL..
> 
> Yes. the alip->xa_lock protects AIL state is a highly contended
> lock. It should not be used for things that aren't AIL related
> because that will have performance and scalability implications.
> 

The purpose of this flag is to control AIL retry processing, how is this
not AIL related? If the I/O succeeds, we grab ->xa_lock to remove the
items from the AIL. If the I/O fails, we can't grab ->xa_lock to tag the
items as failed (so the AIL knows if/how to retry the item(s)) because
of performance and scalability concerns..?

In fact, having taken another look at the code, you could make the
argument that failure to take the ->xa_lock is a landmine should the AIL
task ever expect to observe consistent state across the affected set of
items (independent from atomic li_flags updates).

> > (e.g., why
> > introduce new serialization rules if we don't need to)?
> 
> Becuase we don't tie independent object operational state to a
> single subsystem's internal locking requirements. If the log item
> does not have it's own context lock (which it doesn't), then we need
> to use atomic operations to serialise flag updates across different
> subsystems so they can safely maintain their own independent
> internal locking schemes.
> 
> An example is the inode flags - they are serialised by the
> i_flags_lock spinlock because there are multiple state changes that
> need to be done atomically through different subsystems (including
> RCU freeing state).  We serialise these updates by an object lock
> rather than a subsystem lock because it's the only way we can share
> the flag field across different subsystems safely...
> 

This makes sense in principle and from the perspective that it might
matter for some future changes/flags, I just don't think this patch
really departs from our existing serialization model wrt to the AIL. We
aren't adding the serialization requirement to any new context where it
doesn't already exist. Therefore, (IMO) it's mostly an overcomplication
to rework serialization as a dependency for this fix.

All that said, the bitops change is harmless and there are only a few
flags to deal with, so I don't think it really matters much. I just
think it would be nice to avoid an artificial backport dependency. IOW,
I think this patch should use ->xa_lock as is and can be immediately
followed by a patch to convert the li_flags to bit ops and remove the
->xa_lock from contexts where it is no longer necessary (with documented
justification).

> > > > But otherwise given the above, that this is primarily I/O error path
> > > > code, and that we presumably already have serialization mechanisms in
> > > > place in the form of parent object locks, why not just grab the inode
> > > > lock in the appropriate places?
> > > 
> > > Because we can't guarantee that we can lock the parent object in IO
> > > completion. Deadlock is trivial: process A grabs parent object,
> > > blocks on flush lock. IO completion holds flush lock, blocks trying
> > > to get parent object lock.
> > > 
> > 
> > Ah, right. So once we acquire the flush lock and drop the ilock, we
> > can no longer block on the ilock from a context responsible for
> > releasing the flush lock.
> > 
> > That makes sense, but raises a larger question... if "process A" can
> > cause this kind of locking problem in I/O completion context, what
> > prevents it from interfering with (re)submission context just the same?
> > For example, the AIL pushes an inode, locks, flushes and submits the
> > underlying buffer. The buffer I/O fails and is ultimately released back
> > to the AIL. Process A comes in, locks the inode and blocks on the flush
> > lock.  Subsequent AIL pushes and retry attempts fail to lock the inode,
> > never even get to the point of checking for LI_FAILED and thus we're
> > back to the same problem we have today.
> > 
> > IOW, doesn't this mean we need to check and handle LI_FAILED first off
> > in ->iop_push() and not just in response to flush lock failure?
> 
> It's entirely possible that we need to do that. This whole "don't
> endlessly retry failed buffer writes" thing is a substantial change
> in behaviour, so there's bound to be interactions that we don't get
> right the first time...
> 

I'm not sure if you're referring to this patch or the broader error
configuration stuff here... Note that this patch doesn't change the
fundamental behavior that the AIL retries failed buffers (subject to the
error config). I tend to get this mixed up, but IIUC this has been
traditional behavior for things like buffers, for example, for quite
some time. The problem here is that the AIL doesn't correctly issue
retries for certain items (i.e., those with flush locks), so we're just
unraveling a livelock in the log subsystem (as opposed to changing
behavior).

Brian

> Cheers,
> 
> Dave.
> 
> > 
> > Brian
> > 
> > > Cheers,
> > > 
> > > Dave.
> > > -- 
> > > Dave Chinner
> > > david@fromorbit.com
> > 
> 
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback
  2017-05-22 12:51                 ` Brian Foster
@ 2017-05-23 11:23                   ` Dave Chinner
  2017-05-23 16:22                     ` Brian Foster
  0 siblings, 1 reply; 44+ messages in thread
From: Dave Chinner @ 2017-05-23 11:23 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Mon, May 22, 2017 at 08:51:13AM -0400, Brian Foster wrote:
> On Mon, May 22, 2017 at 09:19:06AM +1000, Dave Chinner wrote:
> > On Sat, May 20, 2017 at 07:46:56AM -0400, Brian Foster wrote:
> > > On Sat, May 20, 2017 at 09:39:00AM +1000, Dave Chinner wrote:
> > > > Adding new flags to the same field that can be asynchronously
> > > > updated by RMW operations outside the ailp->xa_lock will cause
> > > > problems in future. There *may not* be a problem right now, but it
> > > > is poor programming practice to have different coherency processes
> > > > for the same variable that is updated via RMW operations. In these
> > > > situations, the only safe way to update the variable is to use an
> > > > atomic operation.
> > > > 
> > > 
> > > So is there a reason why we couldn't acquire ->xa_lock to fail the log
> > > items as we would have done anyways if the metadata writeback had
> > > succeeded and we were removing the log items from the AIL..
> > 
> > Yes. the alip->xa_lock protects AIL state is a highly contended
> > lock. It should not be used for things that aren't AIL related
> > because that will have performance and scalability implications.
> > 
> 
> The purpose of this flag is to control AIL retry processing, how is this
> not AIL related?

It's IO state, not AIL state. IO submission occurs from more places
than and AIL push (e.g. inode reclaim, inode clustering, etc) and
there's no way we should be exposing the internal AIL state lock in
places like that.

> All that said, the bitops change is harmless and there are only a few
> flags to deal with, so I don't think it really matters much. I just
> think it would be nice to avoid an artificial backport dependency. IOW,
> I think this patch should use ->xa_lock as is and can be immediately
> followed by a patch to convert the li_flags to bit ops and remove the
> ->xa_lock from contexts where it is no longer necessary (with documented
> justification).

Then it needs to be done as a single patch set with the fix you want
to backport as the first patch, otherwise the bitop change not get
done until someone does scalability tests and trips over it and then
we've got more shit to backport to fix performance regressions.

> > > IOW, doesn't this mean we need to check and handle LI_FAILED first off
> > > in ->iop_push() and not just in response to flush lock failure?
> > 
> > It's entirely possible that we need to do that. This whole "don't
> > endlessly retry failed buffer writes" thing is a substantial change
> > in behaviour, so there's bound to be interactions that we don't get
> > right the first time...
> > 
> 
> I'm not sure if you're referring to this patch or the broader error
> configuration stuff here... Note that this patch doesn't change the
> fundamental behavior that the AIL retries failed buffers (subject to the
> error config). I tend to get this mixed up, but IIUC this has been
> traditional behavior for things like buffers, for example, for quite
> some time.

Yes, but the change is that now we rely on the AIL push to trigger
cleanup of failed buffers on unmount, whereas previously the unmount
just hung endlessly retrying the failed buffers. i.e. we used to
accept this hang as "expected behaviour" but now it's considered a
bug we have to fix. Hence we now have to handle retries and failures
and untangle the locking issues we've not had to care about for 20
years.

As it is, the "only check failure if flush lock fails" idea was
designed to prevent having to lookup the backing buffer to check for
failure for every inode we wanted to flush as those lookups are too
expensive to do on every inode we need to flush. However, if we are
propagating the failure state to the log item on IO completion,
checking this state is not expensive any more, so there's no need to
hide it until we detect a state that may indicate an IO failure....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback
  2017-05-23 11:23                   ` Dave Chinner
@ 2017-05-23 16:22                     ` Brian Foster
  2017-05-24  1:06                       ` Dave Chinner
  0 siblings, 1 reply; 44+ messages in thread
From: Brian Foster @ 2017-05-23 16:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, May 23, 2017 at 09:23:18PM +1000, Dave Chinner wrote:
> On Mon, May 22, 2017 at 08:51:13AM -0400, Brian Foster wrote:
> > On Mon, May 22, 2017 at 09:19:06AM +1000, Dave Chinner wrote:
> > > On Sat, May 20, 2017 at 07:46:56AM -0400, Brian Foster wrote:
> > > > On Sat, May 20, 2017 at 09:39:00AM +1000, Dave Chinner wrote:
> > > > > Adding new flags to the same field that can be asynchronously
> > > > > updated by RMW operations outside the ailp->xa_lock will cause
> > > > > problems in future. There *may not* be a problem right now, but it
> > > > > is poor programming practice to have different coherency processes
> > > > > for the same variable that is updated via RMW operations. In these
> > > > > situations, the only safe way to update the variable is to use an
> > > > > atomic operation.
> > > > > 
> > > > 
> > > > So is there a reason why we couldn't acquire ->xa_lock to fail the log
> > > > items as we would have done anyways if the metadata writeback had
> > > > succeeded and we were removing the log items from the AIL..
> > > 
> > > Yes. the alip->xa_lock protects AIL state is a highly contended
> > > lock. It should not be used for things that aren't AIL related
> > > because that will have performance and scalability implications.
> > > 
> > 
> > The purpose of this flag is to control AIL retry processing, how is this
> > not AIL related?
> 
> It's IO state, not AIL state. IO submission occurs from more places
> than and AIL push (e.g. inode reclaim, inode clustering, etc) and
> there's no way we should be exposing the internal AIL state lock in
> places like that.
> 

That's reasonable, though it suggests different code from what this
patchset currently does. This would be far more useful if we could focus
on correctness of the code first and foremost. If the code we commit
actually warrants this kind of change, then this becomes a much easier
(i.e., pointless :P) discussion from my end.

As it is, this patch only ever clears the flag on AIL resubmission. ISTM
you are indirectly suggesting that should occur in other contexts as
well (e.g., consider a reclaim and flush of another inode backed by a
buffer that is already failed, and has other inodes sitting in the AIL
flush locked). If that is the case, please elaborate, because otherwise
it is kind of impossible to evaluate and discuss without specifics.

> > All that said, the bitops change is harmless and there are only a few
> > flags to deal with, so I don't think it really matters much. I just
> > think it would be nice to avoid an artificial backport dependency. IOW,
> > I think this patch should use ->xa_lock as is and can be immediately
> > followed by a patch to convert the li_flags to bit ops and remove the
> > ->xa_lock from contexts where it is no longer necessary (with documented
> > justification).
> 
> Then it needs to be done as a single patch set with the fix you want
> to backport as the first patch, otherwise the bitop change not get
> done until someone does scalability tests and trips over it and then
> we've got more shit to backport to fix performance regressions.
> 

The point is that the atomic bitmask change doesn't need to be
backported at all. Notwithstanding potential changes to the patches as
they currently stand (as noted above), it doesn't actually fix anything,
performance or otherwise.

As it is, this patch would only add an acquisition of ->xa_lock in
completion context on I/O failure. Exactly what workload would you
expect to be affected by changes to I/O error handling? How would it be
any different than what we already do on normal I/O completion?

> > > > IOW, doesn't this mean we need to check and handle LI_FAILED first off
> > > > in ->iop_push() and not just in response to flush lock failure?
> > > 
> > > It's entirely possible that we need to do that. This whole "don't
> > > endlessly retry failed buffer writes" thing is a substantial change
> > > in behaviour, so there's bound to be interactions that we don't get
> > > right the first time...
> > > 
> > 
> > I'm not sure if you're referring to this patch or the broader error
> > configuration stuff here... Note that this patch doesn't change the
> > fundamental behavior that the AIL retries failed buffers (subject to the
> > error config). I tend to get this mixed up, but IIUC this has been
> > traditional behavior for things like buffers, for example, for quite
> > some time.
> 
> Yes, but the change is that now we rely on the AIL push to trigger
> cleanup of failed buffers on unmount, whereas previously the unmount
> just hung endlessly retrying the failed buffers. i.e. we used to
> accept this hang as "expected behaviour" but now it's considered a
> bug we have to fix. Hence we now have to handle retries and failures
> and untangle the locking issues we've not had to care about for 20
> years.
> 

This completely mischaracterizes the problem. If this problem occurs,
log deadlock is the most likely outcome. An unmount is simply part of
the most common test method to immediately observe a hang that is
otherwise already imminent based on previous events. This is not a pure
unmount problem and afaict not associated with any recent behavior
changes with regard to buffer I/O error handling.

> As it is, the "only check failure if flush lock fails" idea was
> designed to prevent having to lookup the backing buffer to check for
> failure for every inode we wanted to flush as those lookups are too
> expensive to do on every inode we need to flush. However, if we are
> propagating the failure state to the log item on IO completion,
> checking this state is not expensive any more, so there's no need to
> hide it until we detect a state that may indicate an IO failure....
> 

Indeed.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback
  2017-05-23 16:22                     ` Brian Foster
@ 2017-05-24  1:06                       ` Dave Chinner
  2017-05-24 12:42                         ` Brian Foster
  0 siblings, 1 reply; 44+ messages in thread
From: Dave Chinner @ 2017-05-24  1:06 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Tue, May 23, 2017 at 12:22:04PM -0400, Brian Foster wrote:
> On Tue, May 23, 2017 at 09:23:18PM +1000, Dave Chinner wrote:
> > On Mon, May 22, 2017 at 08:51:13AM -0400, Brian Foster wrote:
> > > On Mon, May 22, 2017 at 09:19:06AM +1000, Dave Chinner wrote:
> > > > On Sat, May 20, 2017 at 07:46:56AM -0400, Brian Foster wrote:
> > > > > On Sat, May 20, 2017 at 09:39:00AM +1000, Dave Chinner wrote:
> > > > > > Adding new flags to the same field that can be asynchronously
> > > > > > updated by RMW operations outside the ailp->xa_lock will cause
> > > > > > problems in future. There *may not* be a problem right now, but it
> > > > > > is poor programming practice to have different coherency processes
> > > > > > for the same variable that is updated via RMW operations. In these
> > > > > > situations, the only safe way to update the variable is to use an
> > > > > > atomic operation.
> > > > > > 
> > > > > 
> > > > > So is there a reason why we couldn't acquire ->xa_lock to fail the log
> > > > > items as we would have done anyways if the metadata writeback had
> > > > > succeeded and we were removing the log items from the AIL..
> > > > 
> > > > Yes. the alip->xa_lock protects AIL state is a highly contended
> > > > lock. It should not be used for things that aren't AIL related
> > > > because that will have performance and scalability implications.
> > > > 
> > > 
> > > The purpose of this flag is to control AIL retry processing, how is this
> > > not AIL related?
> > 
> > It's IO state, not AIL state. IO submission occurs from more places
> > than and AIL push (e.g. inode reclaim, inode clustering, etc) and
> > there's no way we should be exposing the internal AIL state lock in
> > places like that.
> > 
> 
> That's reasonable, though it suggests different code from what this
> patchset currently does. This would be far more useful if we could focus
> on correctness of the code first and foremost. If the code we commit
> actually warrants this kind of change, then this becomes a much easier
> (i.e., pointless :P) discussion from my end.

Hmmm - this is not the first time recently you've said something
similar in a review discussion where you didn't like this issues
have been pointed out. Last time is was "handwavey", now it's
"pointless". We haven't worked out a solution that solves all the
known problems, so these sorts of comments really don't help close
on one....

> As it is, this patch only ever clears the flag on AIL resubmission. ISTM
> you are indirectly suggesting that should occur in other contexts as
> well (e.g., consider a reclaim and flush of another inode backed by a
> buffer that is already failed, and has other inodes sitting in the AIL
> flush locked). If that is the case, please elaborate, because otherwise
> it is kind of impossible to evaluate and discuss without specifics.

All I've done is take the deadlock scenario you pointed
out and applied to it other places we block on the flush lock.

Go look at inode reclaim - it can do blocking reclaim and so will
block on the flush lock. How is this code supposed to handle hitting
a failed, flush locked inode?  If it just blocks, then it's a likely
deadlock vector as you've already pointed out. Hence we'll need the
same failed write checks to trigger buffer resubmission before we
retry the inode reclaim process.

[....]

> As it is, this patch would only add an acquisition of ->xa_lock in
> completion context on I/O failure. Exactly what workload would you
> expect to be affected by changes to I/O error handling? How would it be
> any different than what we already do on normal I/O completion?

If you want to check and clear the failed flag in the log item in a
non-racy way  in IO submission (because we have to do this outside
the flush lock) then we'd have to hold the AIL lock. And that means
the AIL lock is then in the hot path of inode flushing in all
situations, not just in IO completion.

We've known for a long time that the AIL lock is the second most
contended lock in the filesystem (behind the CIL context lock), and
that it is most heavily contended on inode IO completion because
completion runs on the CPU that the interrupt is delivered to and
so the AIL lock gets contended from multiple CPUs at once on
multi-disk filesystems (see xfs_iflush_done() for more info).
As such, we do not want to increase it's use anywhere in the inode
IO submission/completion path at all if we can possibly avoid it.

So with all the problems that leaving failed inodes flush locked
presents, I'm starting to think that the inode failed callback
should remove the inode from the buffer and drop the flush lock
rather than leave it held and attached to the buffer. We can check
the LI_FAILED flag in xfs_iflush_int() and simply skip the inode
flush to the buffer to effectively "resubmit" the inode unchanged.
If we then clear the LI_FAILED flag in xfs_iflush_done() (i.e. only
when a successful write occurs), we don't need to jump through hoops
to clear the flag at IO submission time. We'd probably also need a
failed flag on the buffer (maybe with a hold ref, too) so that it
isn't reclaimed by memory pressure while we have inodes in the
failed state pending writeback.

This leaves the rest of the submission code completely unchanged -
including the write clustering.  It also gets rid of all the flush
lock deadlock scenarios because we always unlock the flush lock on
IO completion, regardless of whether the buffer IO succeeded or not.
ANd we don't have to change any infrastructure, either - it all just
works as it is because the LI_FAILED flag is handled at the lowest
levels possible.

This, to me, seems like a far cleaner solution than anything we've
discussed so far....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback
  2017-05-24  1:06                       ` Dave Chinner
@ 2017-05-24 12:42                         ` Brian Foster
  2017-05-24 13:26                           ` Carlos Maiolino
  0 siblings, 1 reply; 44+ messages in thread
From: Brian Foster @ 2017-05-24 12:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 24, 2017 at 11:06:01AM +1000, Dave Chinner wrote:
> On Tue, May 23, 2017 at 12:22:04PM -0400, Brian Foster wrote:
> > On Tue, May 23, 2017 at 09:23:18PM +1000, Dave Chinner wrote:
> > > On Mon, May 22, 2017 at 08:51:13AM -0400, Brian Foster wrote:
> > > > On Mon, May 22, 2017 at 09:19:06AM +1000, Dave Chinner wrote:
> > > > > On Sat, May 20, 2017 at 07:46:56AM -0400, Brian Foster wrote:
> > > > > > On Sat, May 20, 2017 at 09:39:00AM +1000, Dave Chinner wrote:
> > > > > > > Adding new flags to the same field that can be asynchronously
> > > > > > > updated by RMW operations outside the ailp->xa_lock will cause
> > > > > > > problems in future. There *may not* be a problem right now, but it
> > > > > > > is poor programming practice to have different coherency processes
> > > > > > > for the same variable that is updated via RMW operations. In these
> > > > > > > situations, the only safe way to update the variable is to use an
> > > > > > > atomic operation.
> > > > > > > 
> > > > > > 
> > > > > > So is there a reason why we couldn't acquire ->xa_lock to fail the log
> > > > > > items as we would have done anyways if the metadata writeback had
> > > > > > succeeded and we were removing the log items from the AIL..
> > > > > 
> > > > > Yes. the alip->xa_lock protects AIL state is a highly contended
> > > > > lock. It should not be used for things that aren't AIL related
> > > > > because that will have performance and scalability implications.
> > > > > 
> > > > 
> > > > The purpose of this flag is to control AIL retry processing, how is this
> > > > not AIL related?
> > > 
> > > It's IO state, not AIL state. IO submission occurs from more places
> > > than and AIL push (e.g. inode reclaim, inode clustering, etc) and
> > > there's no way we should be exposing the internal AIL state lock in
> > > places like that.
> > > 
> > 
> > That's reasonable, though it suggests different code from what this
> > patchset currently does. This would be far more useful if we could focus
> > on correctness of the code first and foremost. If the code we commit
> > actually warrants this kind of change, then this becomes a much easier
> > (i.e., pointless :P) discussion from my end.
> 
> Hmmm - this is not the first time recently you've said something
> similar in a review discussion where you didn't like this issues
> have been pointed out. Last time is was "handwavey", now it's
> "pointless". We haven't worked out a solution that solves all the
> known problems, so these sorts of comments really don't help close
> on one....
> 

You misinterpret what I wrote. What I am saying is that _my_ argument
would become pointless if you could provide context around what locking
specifically leads to your performance/scalability concern.

I.e., just above you refer to taking the AIL lock in various places that
are not modified by this patch. Since this patch does not touch those
areas, I can only assume you see something wrong with the patch that
would actually require locking or some additional changes in those
areas. If that is the case, and you could explicitly point out where/why
that extra locking is necessary, then that would clearly undermine my
argument that there is no risk of performance degradation here. Instead,
you are just asserting that there is additional lock contention and I'm
kind of left guessing where these mysterious ->xa_lock acquisitions are,
because I haven't suggested using it anywhere outside of I/O completion.

Indeed, from this most recent mail, ISTM that your concern comes from
the implication that AIL locking would be required in various other I/O
submission contexts to clear the failed flags. More on that below...

(And I don't see what any of this has to do with the quotacheck deadlock
thing.)

> > As it is, this patch only ever clears the flag on AIL resubmission. ISTM
> > you are indirectly suggesting that should occur in other contexts as
> > well (e.g., consider a reclaim and flush of another inode backed by a
> > buffer that is already failed, and has other inodes sitting in the AIL
> > flush locked). If that is the case, please elaborate, because otherwise
> > it is kind of impossible to evaluate and discuss without specifics.
> 
> All I've done is take the deadlock scenario you pointed
> out and applied to it other places we block on the flush lock.
> 

Ok...

> Go look at inode reclaim - it can do blocking reclaim and so will
> block on the flush lock. How is this code supposed to handle hitting
> a failed, flush locked inode?  If it just blocks, then it's a likely
> deadlock vector as you've already pointed out. Hence we'll need the
> same failed write checks to trigger buffer resubmission before we
> retry the inode reclaim process.
> 

I'm aware we take the flush lock there, but I'm not convinced that is a
deadlock vector. AFAICT, nothing in that callpath blocks the subsequent
AIL retries from occurring. If that AIL retry eventually succeeds, this
all unwinds and we carry on as normal.

If _blocking_ as such is a problem for the codepath in question, then
that seems like more of a question of whether a trylock is more
appropriate.

> [....]
> 
> > As it is, this patch would only add an acquisition of ->xa_lock in
> > completion context on I/O failure. Exactly what workload would you
> > expect to be affected by changes to I/O error handling? How would it be
> > any different than what we already do on normal I/O completion?
> 
> If you want to check and clear the failed flag in the log item in a
> non-racy way  in IO submission (because we have to do this outside
> the flush lock) then we'd have to hold the AIL lock. And that means
> the AIL lock is then in the hot path of inode flushing in all
> situations, not just in IO completion.
> 

Personally, I have no specific desire to clear the failed flags in I/O
submission context as a general solution. The AIL retry originally did
that as a sort of an optimization to avoid looking up the buffer for
every flush locked inode (rightly and based on your suggestion).

This appears to have clouded the fact that we still need to clear the
failed flag on successful I/O completion (I thought I called that out on
a previous iteration, but I've lost track so maybe not. I also haven't
got to Carlos' latest version.).

Conditionally relying on clearing the failed flag from submission
context (as this patch does) is not sufficient because it is still
possible to modify and flush an inode to an already failed buffer (and
thus the buffer has pre-existing failed inodes). If an AIL push ended up
resubmitting the buffer due to the newly flushed/notfailed inode, it
never clears the failed flags on the other inodes attached to the
buffer. I haven't combed through all of the codepaths there, but I
wouldn't be surprised to see a data loss bug lurking in there. (Note
again that this is based on the code in this patch.)

This _could_ imply that all submission contexts need to unconditionally
check for and clear failed items on the buffer or that the buffer needs
a flag too, but it would be much more simple and straightforward to
clear the flag after successful I/O completion (when ->xa_lock is
already held). This avoids the need for extra flags or serialization
requirements. To be fair, this _may_ require an additional ->xa_lock
cycle on I/O completion to clear the flag from relogged items so they
are flushed correctly again on a subsequent push (without having left
the AIL), but again, that should only occur after an I/O error has
occurred.

[NOTE: After reading the rest of the email and coming back here, I see
you suggested the same notion of clearing the flag after I/O success in
your flush lock rework idea, so I suspect I'm not totally off the rails
here. ;)]

If that all works, there is no need to acquire the AIL lock in any
context we don't already have it in general. Again, if I'm missing
something there, please point it out.

> We've known for a long time that the AIL lock is the second most
> contended lock in the filesystem (behind the CIL context lock), and
> that it is most heavily contended on inode IO completion because
> completion runs on the CPU that the interrupt is delivered to and
> so the AIL lock gets contended from multiple CPUs at once on
> multi-disk filesystems (see xfs_iflush_done() for more info).
> As such, we do not want to increase it's use anywhere in the inode
> IO submission/completion path at all if we can possibly avoid it.
> 

Understood and agreed. But as mentioned previously, I've not suggested
taking ->xa_lock anywhere else outside of when I/O errors occur and so
performance in general is not affected.

> So with all the problems that leaving failed inodes flush locked
> presents, I'm starting to think that the inode failed callback
> should remove the inode from the buffer and drop the flush lock
> rather than leave it held and attached to the buffer. We can check
> the LI_FAILED flag in xfs_iflush_int() and simply skip the inode
> flush to the buffer to effectively "resubmit" the inode unchanged.
> If we then clear the LI_FAILED flag in xfs_iflush_done() (i.e. only
> when a successful write occurs), we don't need to jump through hoops
> to clear the flag at IO submission time. We'd probably also need a
> failed flag on the buffer (maybe with a hold ref, too) so that it
> isn't reclaimed by memory pressure while we have inodes in the
> failed state pending writeback.
> 

A couple thoughts... first on the idea itself and second on the
trajectory of this particular patch series.

1.) Interesting idea. Some of my initial questions/thoughts on the idea
itself...

It's not clear to me how we'd manage whether a failed (and
flush-unlocked) inode has been subsequently modified and actually needs
to be flushed again. I'm also not following how we'd propagate
successful completion state back to the parent inodes if they had been
previously removed from the buffer due to an I/O failure.

It sounds like this _may_ conflict with providing a generic solution to
this problem (i.e., we have to fix this problem for dquots as well, and
hence this patch is focused on providing a generic solution). Is that
not the case, or would we have to repeat this pattern for every flush
locked item?

Finally, I'm a bit skeptical that it's wise to provide such a
fundamental shift in flush locking behavior based on the rare case of an
I/O error. That means every case that currently acquires a flush lock
has a new state to consider (IOW, we've effectively changed the meaning
of the lock). My experience is that our error handling
testing/regression capabilities are fairly poor as it is (which I don't
think is something unique to us), and thus error path regressions are
generally easier to introduce and tend to go undetected for quite some
time (how long has AIL retry been broken? ;).

Of course, this is just a first impression and could all be alleviated
with further discussion and/or code.

2.) Independent of whether we want to do something like the above, I
disagree with the premise that this series causes any new problems with
regard to flush locked metadata items (as explained above). It is
focused on fixing a particular AIL retry bug in the uncommon I/O error
case within the AIL retry framework that has been in place for quite
some time. Once the kinks are worked out, I think this is now very close
to fixing that bug succinctly, correctly and without any performance
side effects.

As such, I would very much prefer that we actually make progress on
fixing this specific bug before moving on to the new shiny design that
redefines flush locking. If there are still fundamental flush lock
issues beyond what this patch is intended to fix, we can address those
incrementally and rework flush locking along with LI_FAILED handling on
top of a known working base provided by this patch. At the absolute very
least, I'd prefer that we work out the remaining kinks in this patchset
(even if it is not ultimately merged) to provide a common reference for
a correct solution. If there is a hard technical impasse with this fix,
then that would at least help Carlos and I understand where the actual
problems lie.

That's just my .02 and is ultimately contingent on Carlos' preferred
approach, since he's the one doing the heavy lifting here (though I
suspect he would also prefer the incremental approach).

Brian

> This leaves the rest of the submission code completely unchanged -
> including the write clustering.  It also gets rid of all the flush
> lock deadlock scenarios because we always unlock the flush lock on
> IO completion, regardless of whether the buffer IO succeeded or not.
> ANd we don't have to change any infrastructure, either - it all just
> works as it is because the LI_FAILED flag is handled at the lowest
> levels possible.
> 
> This, to me, seems like a far cleaner solution than anything we've
> discussed so far....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback
  2017-05-24 12:42                         ` Brian Foster
@ 2017-05-24 13:26                           ` Carlos Maiolino
  2017-05-24 17:08                             ` Brian Foster
  0 siblings, 1 reply; 44+ messages in thread
From: Carlos Maiolino @ 2017-05-24 13:26 UTC (permalink / raw)
  To: Brian Foster; +Cc: Dave Chinner, linux-xfs

On Wed, May 24, 2017 at 08:42:47AM -0400, Brian Foster wrote:
> On Wed, May 24, 2017 at 11:06:01AM +1000, Dave Chinner wrote:
> > On Tue, May 23, 2017 at 12:22:04PM -0400, Brian Foster wrote:
> > > On Tue, May 23, 2017 at 09:23:18PM +1000, Dave Chinner wrote:
> > > > On Mon, May 22, 2017 at 08:51:13AM -0400, Brian Foster wrote:
> > > > > On Mon, May 22, 2017 at 09:19:06AM +1000, Dave Chinner wrote:
> > > > > > On Sat, May 20, 2017 at 07:46:56AM -0400, Brian Foster wrote:
> > > > > > > On Sat, May 20, 2017 at 09:39:00AM +1000, Dave Chinner wrote:
> > > > > > > > Adding new flags to the same field that can be asynchronously
> > > > > > > > updated by RMW operations outside the ailp->xa_lock will cause
> > > > > > > > problems in future. There *may not* be a problem right now, but it
> > > > > > > > is poor programming practice to have different coherency processes
> > > > > > > > for the same variable that is updated via RMW operations. In these
> > > > > > > > situations, the only safe way to update the variable is to use an
> > > > > > > > atomic operation.
> > > > > > > > 
> > > > > > > 
> > > > > > > So is there a reason why we couldn't acquire ->xa_lock to fail the log
> > > > > > > items as we would have done anyways if the metadata writeback had
> > > > > > > succeeded and we were removing the log items from the AIL..
> > > > > > 
> > > > > > Yes. the alip->xa_lock protects AIL state is a highly contended
> > > > > > lock. It should not be used for things that aren't AIL related
> > > > > > because that will have performance and scalability implications.
> > > > > > 
> > > > > 
> > > > > The purpose of this flag is to control AIL retry processing, how is this
> > > > > not AIL related?
> > > > 
> > > > It's IO state, not AIL state. IO submission occurs from more places
> > > > than and AIL push (e.g. inode reclaim, inode clustering, etc) and
> > > > there's no way we should be exposing the internal AIL state lock in
> > > > places like that.
> > > > 
> > > 
> > > That's reasonable, though it suggests different code from what this
> > > patchset currently does. This would be far more useful if we could focus
> > > on correctness of the code first and foremost. If the code we commit
> > > actually warrants this kind of change, then this becomes a much easier
> > > (i.e., pointless :P) discussion from my end.
> > 
> > Hmmm - this is not the first time recently you've said something
> > similar in a review discussion where you didn't like this issues
> > have been pointed out. Last time is was "handwavey", now it's
> > "pointless". We haven't worked out a solution that solves all the
> > known problems, so these sorts of comments really don't help close
> > on one....
> > 
> 
> You misinterpret what I wrote. What I am saying is that _my_ argument
> would become pointless if you could provide context around what locking
> specifically leads to your performance/scalability concern.
> 
> I.e., just above you refer to taking the AIL lock in various places that
> are not modified by this patch. Since this patch does not touch those
> areas, I can only assume you see something wrong with the patch that
> would actually require locking or some additional changes in those
> areas. If that is the case, and you could explicitly point out where/why
> that extra locking is necessary, then that would clearly undermine my
> argument that there is no risk of performance degradation here. Instead,
> you are just asserting that there is additional lock contention and I'm
> kind of left guessing where these mysterious ->xa_lock acquisitions are,
> because I haven't suggested using it anywhere outside of I/O completion.
> 
> Indeed, from this most recent mail, ISTM that your concern comes from
> the implication that AIL locking would be required in various other I/O
> submission contexts to clear the failed flags. More on that below...
> 
> (And I don't see what any of this has to do with the quotacheck deadlock
> thing.)
> 
> > > As it is, this patch only ever clears the flag on AIL resubmission. ISTM
> > > you are indirectly suggesting that should occur in other contexts as
> > > well (e.g., consider a reclaim and flush of another inode backed by a
> > > buffer that is already failed, and has other inodes sitting in the AIL
> > > flush locked). If that is the case, please elaborate, because otherwise
> > > it is kind of impossible to evaluate and discuss without specifics.
> > 
> > All I've done is take the deadlock scenario you pointed
> > out and applied to it other places we block on the flush lock.
> > 
> 
> Ok...
> 
> > Go look at inode reclaim - it can do blocking reclaim and so will
> > block on the flush lock. How is this code supposed to handle hitting
> > a failed, flush locked inode?  If it just blocks, then it's a likely
> > deadlock vector as you've already pointed out. Hence we'll need the
> > same failed write checks to trigger buffer resubmission before we
> > retry the inode reclaim process.
> > 
> 
> I'm aware we take the flush lock there, but I'm not convinced that is a
> deadlock vector. AFAICT, nothing in that callpath blocks the subsequent
> AIL retries from occurring. If that AIL retry eventually succeeds, this
> all unwinds and we carry on as normal.
> 
> If _blocking_ as such is a problem for the codepath in question, then
> that seems like more of a question of whether a trylock is more
> appropriate.
> 
> > [....]
> > 
> > > As it is, this patch would only add an acquisition of ->xa_lock in
> > > completion context on I/O failure. Exactly what workload would you
> > > expect to be affected by changes to I/O error handling? How would it be
> > > any different than what we already do on normal I/O completion?
> > 
> > If you want to check and clear the failed flag in the log item in a
> > non-racy way  in IO submission (because we have to do this outside
> > the flush lock) then we'd have to hold the AIL lock. And that means
> > the AIL lock is then in the hot path of inode flushing in all
> > situations, not just in IO completion.
> > 
> 
> Personally, I have no specific desire to clear the failed flags in I/O
> submission context as a general solution. The AIL retry originally did
> that as a sort of an optimization to avoid looking up the buffer for
> every flush locked inode (rightly and based on your suggestion).
> 
> This appears to have clouded the fact that we still need to clear the
> failed flag on successful I/O completion (I thought I called that out on
> a previous iteration, but I've lost track so maybe not. I also haven't
> got to Carlos' latest version.).
> 
> Conditionally relying on clearing the failed flag from submission
> context (as this patch does) is not sufficient because it is still
> possible to modify and flush an inode to an already failed buffer (and
> thus the buffer has pre-existing failed inodes). If an AIL push ended up
> resubmitting the buffer due to the newly flushed/notfailed inode, it
> never clears the failed flags on the other inodes attached to the
> buffer. I haven't combed through all of the codepaths there, but I
> wouldn't be surprised to see a data loss bug lurking in there. (Note
> again that this is based on the code in this patch.)
> 
> This _could_ imply that all submission contexts need to unconditionally
> check for and clear failed items on the buffer or that the buffer needs
> a flag too, but it would be much more simple and straightforward to
> clear the flag after successful I/O completion (when ->xa_lock is
> already held). This avoids the need for extra flags or serialization
> requirements. To be fair, this _may_ require an additional ->xa_lock
> cycle on I/O completion to clear the flag from relogged items so they
> are flushed correctly again on a subsequent push (without having left
> the AIL), but again, that should only occur after an I/O error has
> occurred.
> 
> [NOTE: After reading the rest of the email and coming back here, I see
> you suggested the same notion of clearing the flag after I/O success in
> your flush lock rework idea, so I suspect I'm not totally off the rails
> here. ;)]
> 
> If that all works, there is no need to acquire the AIL lock in any
> context we don't already have it in general. Again, if I'm missing
> something there, please point it out.
> 
> > We've known for a long time that the AIL lock is the second most
> > contended lock in the filesystem (behind the CIL context lock), and
> > that it is most heavily contended on inode IO completion because
> > completion runs on the CPU that the interrupt is delivered to and
> > so the AIL lock gets contended from multiple CPUs at once on
> > multi-disk filesystems (see xfs_iflush_done() for more info).
> > As such, we do not want to increase it's use anywhere in the inode
> > IO submission/completion path at all if we can possibly avoid it.
> > 
> 
> Understood and agreed. But as mentioned previously, I've not suggested
> taking ->xa_lock anywhere else outside of when I/O errors occur and so
> performance in general is not affected.
> 
> > So with all the problems that leaving failed inodes flush locked
> > presents, I'm starting to think that the inode failed callback
> > should remove the inode from the buffer and drop the flush lock
> > rather than leave it held and attached to the buffer. We can check
> > the LI_FAILED flag in xfs_iflush_int() and simply skip the inode
> > flush to the buffer to effectively "resubmit" the inode unchanged.
> > If we then clear the LI_FAILED flag in xfs_iflush_done() (i.e. only
> > when a successful write occurs), we don't need to jump through hoops
> > to clear the flag at IO submission time. We'd probably also need a
> > failed flag on the buffer (maybe with a hold ref, too) so that it
> > isn't reclaimed by memory pressure while we have inodes in the
> > failed state pending writeback.
> > 
> 
> A couple thoughts... first on the idea itself and second on the
> trajectory of this particular patch series.
> 
> 1.) Interesting idea. Some of my initial questions/thoughts on the idea
> itself...
> 
> It's not clear to me how we'd manage whether a failed (and
> flush-unlocked) inode has been subsequently modified and actually needs
> to be flushed again. I'm also not following how we'd propagate
> successful completion state back to the parent inodes if they had been
> previously removed from the buffer due to an I/O failure.
> 
> It sounds like this _may_ conflict with providing a generic solution to
> this problem (i.e., we have to fix this problem for dquots as well, and
> hence this patch is focused on providing a generic solution). Is that
> not the case, or would we have to repeat this pattern for every flush
> locked item?
> 
> Finally, I'm a bit skeptical that it's wise to provide such a
> fundamental shift in flush locking behavior based on the rare case of an
> I/O error. That means every case that currently acquires a flush lock
> has a new state to consider (IOW, we've effectively changed the meaning
> of the lock). My experience is that our error handling
> testing/regression capabilities are fairly poor as it is (which I don't
> think is something unique to us), and thus error path regressions are
> generally easier to introduce and tend to go undetected for quite some
> time (how long has AIL retry been broken? ;).
> 
> Of course, this is just a first impression and could all be alleviated
> with further discussion and/or code.
> 
> 2.) Independent of whether we want to do something like the above, I
> disagree with the premise that this series causes any new problems with
> regard to flush locked metadata items (as explained above). It is
> focused on fixing a particular AIL retry bug in the uncommon I/O error
> case within the AIL retry framework that has been in place for quite
> some time. Once the kinks are worked out, I think this is now very close
> to fixing that bug succinctly, correctly and without any performance
> side effects.
> 
> As such, I would very much prefer that we actually make progress on
> fixing this specific bug before moving on to the new shiny design that
> redefines flush locking. If there are still fundamental flush lock
> issues beyond what this patch is intended to fix, we can address those
> incrementally and rework flush locking along with LI_FAILED handling on
> top of a known working base provided by this patch. At the absolute very
> least, I'd prefer that we work out the remaining kinks in this patchset
> (even if it is not ultimately merged) to provide a common reference for
> a correct solution. If there is a hard technical impasse with this fix,
> then that would at least help Carlos and I understand where the actual
> problems lie.
> 
> That's just my .02 and is ultimately contingent on Carlos' preferred
> approach, since he's the one doing the heavy lifting here (though I
> suspect he would also prefer the incremental approach).
> 
> Brian

I need to give some more thinking about Dave's suggestion on this new flush
locking design Dave suggested. This really looks a nice idea that needs some
polish before the implementation, and I can certainly work on that, but, from my
POV, I'll agree with Brian here.

Reworking in the flush lock design will take time, more discussions, more
patches, etc. In the mean time, we have an increasing number in users hitting
this bug, with the increasing usage of DM-Thin, this tends to increase quick,
once, the bug is easily triggered when using dm-thin.

The unmount hang is just one of the symptoms of the problem, but there are
several other things that go wrong and might be unnoticed by the user. The
unmount is the most noticeable symptom an user will see, and I believe we can
just polish what needed to be polished on my patchset, fix the bug, then move on
and I'll work on this design Dave suggested.

My patchset shouldn't introduce any new bug or regression (saving it still needs
that a successful IO completion clears the LI_FAILED flag), and fixes a bug that
is appearing more and more often in a reasonable way.

Then I move on and look over the design Dave mentioned, instead of stick with
this bug in the code for... who knows how longer.

Anyway, I'm not trying to 'sell my fish' here saying this patchset is the best
solution, I'm far away from doing things like that, but this patchset uses the
approach we took weeks to decide, then, why not use this now, fix the bug and
improve it when at least users are not hitting the bug anymore?

Cheers.


> 
> > This leaves the rest of the submission code completely unchanged -
> > including the write clustering.  It also gets rid of all the flush
> > lock deadlock scenarios because we always unlock the flush lock on
> > IO completion, regardless of whether the buffer IO succeeded or not.
> > ANd we don't have to change any infrastructure, either - it all just
> > works as it is because the LI_FAILED flag is handled at the lowest
> > levels possible.
> > 
> > This, to me, seems like a far cleaner solution than anything we've
> > discussed so far....
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Carlos

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback
  2017-05-24 13:26                           ` Carlos Maiolino
@ 2017-05-24 17:08                             ` Brian Foster
  0 siblings, 0 replies; 44+ messages in thread
From: Brian Foster @ 2017-05-24 17:08 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs

On Wed, May 24, 2017 at 03:26:04PM +0200, Carlos Maiolino wrote:
...
> 
> I need to give some more thinking about Dave's suggestion on this new flush
> locking design Dave suggested. This really looks a nice idea that needs some
> polish before the implementation, and I can certainly work on that, but, from my
> POV, I'll agree with Brian here.
> 
> Reworking in the flush lock design will take time, more discussions, more
> patches, etc. In the mean time, we have an increasing number in users hitting
> this bug, with the increasing usage of DM-Thin, this tends to increase quick,
> once, the bug is easily triggered when using dm-thin.
> 
> The unmount hang is just one of the symptoms of the problem, but there are
> several other things that go wrong and might be unnoticed by the user. The
> unmount is the most noticeable symptom an user will see, and I believe we can
> just polish what needed to be polished on my patchset, fix the bug, then move on
> and I'll work on this design Dave suggested.
> 
> My patchset shouldn't introduce any new bug or regression (saving it still needs
> that a successful IO completion clears the LI_FAILED flag), and fixes a bug that
> is appearing more and more often in a reasonable way.
> 
> Then I move on and look over the design Dave mentioned, instead of stick with
> this bug in the code for... who knows how longer.
> 
> Anyway, I'm not trying to 'sell my fish' here saying this patchset is the best
> solution, I'm far away from doing things like that, but this patchset uses the
> approach we took weeks to decide, then, why not use this now, fix the bug and
> improve it when at least users are not hitting the bug anymore?
> 

FWIW, I just posted some review comments to Carlos' v2 that attempt to
explicitly describe how I was thinking we could handle this without the
need for atomic flags and without introducing performance regressions.
Perhaps it would be easier to carry this discussion over there...

If what I was thinking is broken or has deeper problems, then we can
drop it and stick with the atomic flags approach that v2 uses or
otherwise look into something completely different like what Dave
suggests above.

Brian

> Cheers.
> 
> 
> > 
> > > This leaves the rest of the submission code completely unchanged -
> > > including the write clustering.  It also gets rid of all the flush
> > > lock deadlock scenarios because we always unlock the flush lock on
> > > IO completion, regardless of whether the buffer IO succeeded or not.
> > > ANd we don't have to change any infrastructure, either - it all just
> > > works as it is because the LI_FAILED flag is handled at the lowest
> > > levels possible.
> > > 
> > > This, to me, seems like a far cleaner solution than anything we've
> > > discussed so far....
> > > 
> > > Cheers,
> > > 
> > > Dave.
> > > -- 
> > > Dave Chinner
> > > david@fromorbit.com
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> -- 
> Carlos
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback
  2017-06-21 10:10                   ` Brian Foster
@ 2017-06-21 15:25                     ` Luis R. Rodriguez
  0 siblings, 0 replies; 44+ messages in thread
From: Luis R. Rodriguez @ 2017-06-21 15:25 UTC (permalink / raw)
  To: Brian Foster
  Cc: Luis R. Rodriguez, Eric Sandeen, Darrick J. Wong,
	Carlos Maiolino, linux-xfs

On Wed, Jun 21, 2017 at 06:10:52AM -0400, Brian Foster wrote:
> On Tue, Jun 20, 2017 at 08:05:05PM +0200, Luis R. Rodriguez wrote:
> > On Tue, Jun 20, 2017 at 01:20:41PM -0400, Brian Foster wrote:
> > > On Tue, Jun 20, 2017 at 06:52:04PM +0200, Luis R. Rodriguez wrote:
> > > > On Mon, Jun 19, 2017 at 06:59:05AM -0400, Brian Foster wrote:
> > > > > It hasn't seemed necessary to me to take that approach given the lower
> > > > > prevalence of the issue 
> > > > 
> > > > Of this issue? I suppose its why I asked about examples of issues, I seem
> > > > to have found it likely much more possible out in the wild than expected.
> > > > It would seem folks might be working around it somehow.
> > > > 
> > > 
> > > If we're talking about the thin provisioning case, I suspect most people
> > > work around it by properly configuring their storage. ;)
> > 
> > The fact that we *hang* makes it more serious, so even if folks misconfigured
> > storage with less space it should be no reason to consider hangs any less
> > severe. Specially if it seems to be a common issue, and I'm alluding to the
> > fact that this might be more common than the patch describes.
> > 
> 
> My point is simply that a hang was a likely outcome before the patch
> that introduced the regression as well, so the benefit of doing a proper
> revert doesn't clearly outweigh the cost.

Sure agreed.

> Despite what the side effect
> is, the fact that this tends to primarily affect users who have thin
> volumes going inactive also suggests that this can be worked around
> reasonably well enough via storage configuration. This all suggests to
> me that Carlos' current approach is the most reasonable one. 

OK thanks.

> I'm not following what the line of argument is here. Are you suggesting
> a different approach? If so, based on what use case/reasoning?

No, it just seemed to me you were indicating that the hang was not that serious
of an issue given you could work around it with proper storage configuration.
I see now you were using that analogy just to indicate it was also an issue
before so the revert is with merit.

  Luis

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback
  2017-06-20 16:24       ` Luis R. Rodriguez
@ 2017-06-21 11:51         ` Carlos Maiolino
  0 siblings, 0 replies; 44+ messages in thread
From: Carlos Maiolino @ 2017-06-21 11:51 UTC (permalink / raw)
  To: Luis R. Rodriguez; +Cc: linux-xfs, Gionatan Danti

Hi Luis

> 
> Curious, since you can reproduce what happens if you do a hard reset on the
> system when this trigger, once it boots back up? I'd guess it covers but just
> want to be sure.
> 

Just for context, the problem with the stuck unmounts happens because the items
in AIL can't be written back to their specific locations in the disk due lack of
real space. But, instead of shutting down the FS when somebody tries to unmount,
or permanently fail the buffer when trying to write it back (if XFS is
configured to fail at some point), xfsaild keep spinning around on such buffers
because the items are flush locked, and they are not even retried at all.

giving this small resumed context, and now answering your question regarding a
hard reset.

When you hard reset the system in such state, after the system comes back alive,
the filesystem in question will be unmountable, because the journal will be
dirty, and XFS won't be able to replay it during the mount due lack of space in
the physical device:

# mount <volume> /mnt
[   91.843429] XFS (dm-5): Mounting V5 Filesystem
[   91.864321] device-mapper: thin: 253:2: reached low water mark for data
device: sending event.
[   91.889451] device-mapper: thin: 253:2: switching pool to out-of-data-space
(error IO) mode
[   91.890821] XFS (dm-5): xfs_do_force_shutdown(0x1) called from line 1201 of
file fs/xfs/xfs_buf.c.  Return address = 0xffffffff813bb416
[   91.893590] XFS (dm-5): I/O Error Detected. Shutting down filesystem
[   91.894813] XFS (dm-5): Please umount the filesystem and rectify the
problem(s)
[   91.896158] XFS (dm-5): metadata I/O error: block 0x31f80 ("xlog_bwrite")
error 28 numblks 4096
[   91.902234] XFS (dm-5): failed to locate log tail
[   91.902920] XFS (dm-5): log mount/recovery failed: error -28
[   91.903712] XFS (dm-5): log mount failed
mount: mount <volume> on /mnt failed: No space left
on device

Although, simply my expanding the thin pool, everything comes back to normal
again:

#lvextend -L +500M <POOL>

# mount <volume> /mnt
[  248.935258] XFS (dm-5): Mounting V5 Filesystem
[  248.954288] XFS (dm-5): Starting recovery (logdev: internal)
[  248.985238] XFS (dm-5): Ending recovery (logdev: internal)


-- 
Carlos

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback
  2017-06-20 18:05                 ` Luis R. Rodriguez
@ 2017-06-21 10:10                   ` Brian Foster
  2017-06-21 15:25                     ` Luis R. Rodriguez
  0 siblings, 1 reply; 44+ messages in thread
From: Brian Foster @ 2017-06-21 10:10 UTC (permalink / raw)
  To: Luis R. Rodriguez
  Cc: Eric Sandeen, Darrick J. Wong, Carlos Maiolino, linux-xfs

On Tue, Jun 20, 2017 at 08:05:05PM +0200, Luis R. Rodriguez wrote:
> On Tue, Jun 20, 2017 at 01:20:41PM -0400, Brian Foster wrote:
> > On Tue, Jun 20, 2017 at 06:52:04PM +0200, Luis R. Rodriguez wrote:
> > > On Mon, Jun 19, 2017 at 06:59:05AM -0400, Brian Foster wrote:
> > > > It hasn't seemed necessary to me to take that approach given the lower
> > > > prevalence of the issue 
> > > 
> > > Of this issue? I suppose its why I asked about examples of issues, I seem
> > > to have found it likely much more possible out in the wild than expected.
> > > It would seem folks might be working around it somehow.
> > > 
> > 
> > If we're talking about the thin provisioning case, I suspect most people
> > work around it by properly configuring their storage. ;)
> 
> The fact that we *hang* makes it more serious, so even if folks misconfigured
> storage with less space it should be no reason to consider hangs any less
> severe. Specially if it seems to be a common issue, and I'm alluding to the
> fact that this might be more common than the patch describes.
> 

My point is simply that a hang was a likely outcome before the patch
that introduced the regression as well, so the benefit of doing a proper
revert doesn't clearly outweigh the cost. Despite what the side effect
is, the fact that this tends to primarily affect users who have thin
volumes going inactive also suggests that this can be worked around
reasonably well enough via storage configuration. This all suggests to
me that Carlos' current approach is the most reasonable one. 

I'm not following what the line of argument is here. Are you suggesting
a different approach? If so, based on what use case/reasoning?

Brian

>   Luis
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback
  2017-06-20 17:20               ` Brian Foster
  2017-06-20 18:05                 ` Luis R. Rodriguez
@ 2017-06-20 18:38                 ` Luis R. Rodriguez
  1 sibling, 0 replies; 44+ messages in thread
From: Luis R. Rodriguez @ 2017-06-20 18:38 UTC (permalink / raw)
  To: Brian Foster
  Cc: Luis R. Rodriguez, Eric Sandeen, Darrick J. Wong,
	Carlos Maiolino, linux-xfs

On Tue, Jun 20, 2017 at 01:20:41PM -0400, Brian Foster wrote:
> On Tue, Jun 20, 2017 at 06:52:04PM +0200, Luis R. Rodriguez wrote:
> > On Mon, Jun 19, 2017 at 06:59:05AM -0400, Brian Foster wrote:
> > > If we truly needed a stable worthy fix in short order, that would
> > > probably be to revert ac8809f9a ("xfs: abort metadata writeback on
> > > permanent errors"), which caused this regression by making the AIL
> > > responsible for failed retries. 
> > 
> > Should the following tag be added then to this commit proposed by Carlos:
> > 
> > Fixes: ac8809f9a ("xfs: abort metadata writeback on permanent errors")
> > 
> 
> That seems reasonable to me.

Then in this case I'd like the commit log to also explain *why* the described
fix did not work. It actually describes the issue as being considered, "thin
provisioned devices that have run out of backing space", and this recovering.
So did recovery never really work? Does recovery actually work for some cases
but not some? If so why not for some?

  Luis

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback
  2017-06-20 17:20               ` Brian Foster
@ 2017-06-20 18:05                 ` Luis R. Rodriguez
  2017-06-21 10:10                   ` Brian Foster
  2017-06-20 18:38                 ` Luis R. Rodriguez
  1 sibling, 1 reply; 44+ messages in thread
From: Luis R. Rodriguez @ 2017-06-20 18:05 UTC (permalink / raw)
  To: Brian Foster
  Cc: Luis R. Rodriguez, Eric Sandeen, Darrick J. Wong,
	Carlos Maiolino, linux-xfs

On Tue, Jun 20, 2017 at 01:20:41PM -0400, Brian Foster wrote:
> On Tue, Jun 20, 2017 at 06:52:04PM +0200, Luis R. Rodriguez wrote:
> > On Mon, Jun 19, 2017 at 06:59:05AM -0400, Brian Foster wrote:
> > > It hasn't seemed necessary to me to take that approach given the lower
> > > prevalence of the issue 
> > 
> > Of this issue? I suppose its why I asked about examples of issues, I seem
> > to have found it likely much more possible out in the wild than expected.
> > It would seem folks might be working around it somehow.
> > 
> 
> If we're talking about the thin provisioning case, I suspect most people
> work around it by properly configuring their storage. ;)

The fact that we *hang* makes it more serious, so even if folks misconfigured
storage with less space it should be no reason to consider hangs any less
severe. Specially if it seems to be a common issue, and I'm alluding to the
fact that this might be more common than the patch describes.

  Luis

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback
  2017-06-20 16:52             ` Luis R. Rodriguez
@ 2017-06-20 17:20               ` Brian Foster
  2017-06-20 18:05                 ` Luis R. Rodriguez
  2017-06-20 18:38                 ` Luis R. Rodriguez
  0 siblings, 2 replies; 44+ messages in thread
From: Brian Foster @ 2017-06-20 17:20 UTC (permalink / raw)
  To: Luis R. Rodriguez
  Cc: Eric Sandeen, Darrick J. Wong, Carlos Maiolino, linux-xfs

On Tue, Jun 20, 2017 at 06:52:04PM +0200, Luis R. Rodriguez wrote:
> On Mon, Jun 19, 2017 at 06:59:05AM -0400, Brian Foster wrote:
> > On Fri, Jun 16, 2017 at 02:45:13PM -0500, Eric Sandeen wrote:
> > > On 6/16/17 2:37 PM, Luis R. Rodriguez wrote:
> > > > On Fri, Jun 16, 2017 at 12:24:45PM -0700, Darrick J. Wong wrote:
> > > >> On Fri, Jun 16, 2017 at 08:35:10PM +0200, Luis R. Rodriguez wrote:
> > > > It seems like a rather large set of changes, if the issue was sevee I was hoping
> > > > for a stable candidate fix first. If its not fixing a severe issue then sure.
> > > 
> > > Fixes go uptream first, then to stable kernels if appropriate, right?
> > > 
> > > But not every fix is a trivial one-liner.  I don't think there is any simpler
> > > fix to be had, here.
> > > 
> > 
> > Yeah, this is kind of intended to be the simplest fix available as far
> > as I'm aware. TBH, I don't think the fix here is fundamentally that
> > complex (define a state for already flushed/failed log items), but
> > rather the code that needs to be modified to implement it has various
> > states and corner cases to manage that make it tricky to get right.
> 
> OK this helps, thanks.
> 
> > If we truly needed a stable worthy fix in short order, that would
> > probably be to revert ac8809f9a ("xfs: abort metadata writeback on
> > permanent errors"), which caused this regression by making the AIL
> > responsible for failed retries. 
> 
> Should the following tag be added then to this commit proposed by Carlos:
> 
> Fixes: ac8809f9a ("xfs: abort metadata writeback on permanent errors")
> 

That seems reasonable to me.

> FWIW this was merged as of v3.13. Even though that seems to have taken the
> failed buffer and kept it the commit log of that change seems to indicate we
> already ran into similar situations before, after this commit we seem to just
> retry IO once, but keep the buffer around on the AIL, to allow further
> modifications of the buffer.
> 

Before that commit, I believe we would retry the metadata writeback
indefinitely so long as it failed. If the underlying I/O failure ceased
to occur, then this loop ends and the fs proceeds as normal. If not,
then the filesystem is probably going to hang. After that commit, we
retry once from I/O completion handling and otherwise rely on the AIL to
issue the next (indefinite) retry. If the item that failed has a flush
lock, we won't actually ever submit the I/O (which is the bug), however,
and thus you're toast regardless of whether the I/O would ultimately
succeed or not.

> 
> > A couple caveats I can think of with
> > that are that 1.) this would likely short circuit the recently added
> > error configuration api (which is kind of irrelevant if the underlying
> > error management code doesn't work correctly, but suggests that should
> > be reverted as well) and 2.) in doing so, this reverts back to the
> > hardcoded infinite retry behavior in XFS. That means that transient
> > errors may eventually recover, but the thinp enospc use case and whatnot
> > are still going to hang on umount. 
> 
> Right, and also one of the gains of the patch seems to have been to allow
> thinp devices to keep on chugging with modifications on the buffer, so that
> would be lost as well. That seems to be like an optimization though so its
> worth loosing IMHO if would have resolved the hang. Since that's not the case
> though...
> 

I think that's just a secondary effect of unlocking the buffer. Probably
not that important if I/Os are failing.

> > It hasn't seemed necessary to me to take that approach given the lower
> > prevalence of the issue 
> 
> Of this issue? I suppose its why I asked about examples of issues, I seem
> to have found it likely much more possible out in the wild than expected.
> It would seem folks might be working around it somehow.
> 

If we're talking about the thin provisioning case, I suspect most people
work around it by properly configuring their storage. ;)

Brian

> > and the fact that a solution had been worked out
> > for a while now. Though I suppose the longer we go without a fix in
> > place, the stronger the argument for something like that becomes.
> 
>   Luis
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback
  2017-06-19 10:59           ` Brian Foster
@ 2017-06-20 16:52             ` Luis R. Rodriguez
  2017-06-20 17:20               ` Brian Foster
  0 siblings, 1 reply; 44+ messages in thread
From: Luis R. Rodriguez @ 2017-06-20 16:52 UTC (permalink / raw)
  To: Brian Foster
  Cc: Eric Sandeen, Luis R. Rodriguez, Darrick J. Wong,
	Carlos Maiolino, linux-xfs

On Mon, Jun 19, 2017 at 06:59:05AM -0400, Brian Foster wrote:
> On Fri, Jun 16, 2017 at 02:45:13PM -0500, Eric Sandeen wrote:
> > On 6/16/17 2:37 PM, Luis R. Rodriguez wrote:
> > > On Fri, Jun 16, 2017 at 12:24:45PM -0700, Darrick J. Wong wrote:
> > >> On Fri, Jun 16, 2017 at 08:35:10PM +0200, Luis R. Rodriguez wrote:
> > > It seems like a rather large set of changes, if the issue was sevee I was hoping
> > > for a stable candidate fix first. If its not fixing a severe issue then sure.
> > 
> > Fixes go uptream first, then to stable kernels if appropriate, right?
> > 
> > But not every fix is a trivial one-liner.  I don't think there is any simpler
> > fix to be had, here.
> > 
> 
> Yeah, this is kind of intended to be the simplest fix available as far
> as I'm aware. TBH, I don't think the fix here is fundamentally that
> complex (define a state for already flushed/failed log items), but
> rather the code that needs to be modified to implement it has various
> states and corner cases to manage that make it tricky to get right.

OK this helps, thanks.

> If we truly needed a stable worthy fix in short order, that would
> probably be to revert ac8809f9a ("xfs: abort metadata writeback on
> permanent errors"), which caused this regression by making the AIL
> responsible for failed retries. 

Should the following tag be added then to this commit proposed by Carlos:

Fixes: ac8809f9a ("xfs: abort metadata writeback on permanent errors")

FWIW this was merged as of v3.13. Even though that seems to have taken the
failed buffer and kept it the commit log of that change seems to indicate we
already ran into similar situations before, after this commit we seem to just
retry IO once, but keep the buffer around on the AIL, to allow further
modifications of the buffer.


> A couple caveats I can think of with
> that are that 1.) this would likely short circuit the recently added
> error configuration api (which is kind of irrelevant if the underlying
> error management code doesn't work correctly, but suggests that should
> be reverted as well) and 2.) in doing so, this reverts back to the
> hardcoded infinite retry behavior in XFS. That means that transient
> errors may eventually recover, but the thinp enospc use case and whatnot
> are still going to hang on umount. 

Right, and also one of the gains of the patch seems to have been to allow
thinp devices to keep on chugging with modifications on the buffer, so that
would be lost as well. That seems to be like an optimization though so its
worth loosing IMHO if would have resolved the hang. Since that's not the case
though...

> It hasn't seemed necessary to me to take that approach given the lower
> prevalence of the issue 

Of this issue? I suppose its why I asked about examples of issues, I seem
to have found it likely much more possible out in the wild than expected.
It would seem folks might be working around it somehow.

> and the fact that a solution had been worked out
> for a while now. Though I suppose the longer we go without a fix in
> place, the stronger the argument for something like that becomes.

  Luis

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback
  2017-06-20  7:01     ` Carlos Maiolino
@ 2017-06-20 16:24       ` Luis R. Rodriguez
  2017-06-21 11:51         ` Carlos Maiolino
  0 siblings, 1 reply; 44+ messages in thread
From: Luis R. Rodriguez @ 2017-06-20 16:24 UTC (permalink / raw)
  To: Luis R. Rodriguez, linux-xfs; +Cc: Gionatan Danti

On Tue, Jun 20, 2017 at 09:01:03AM +0200, Carlos Maiolino wrote:
> Hello Luis.
> 
> On Fri, Jun 16, 2017 at 08:35:10PM +0200, Luis R. Rodriguez wrote:
> > On Fri, Jun 16, 2017 at 12:54:45PM +0200, Carlos Maiolino wrote:
> > > When a buffer has been failed during writeback, the inode items into it
> > > are kept flush locked, and are never resubmitted due the flush lock, so,
> > > if any buffer fails to be written, the items in AIL are never written to
> > > disk and never unlocked.
> > > 
> > > This causes unmount operation to hang due these items flush locked in AIL,
> > 
> > What type of hang? If it has occurred in production is there a trace somewhere?
> > what does it look like?
> > 
> 
> No, there isn't any specific trace, the hang can be seen in several different
> places, when unmounting the filesystem, it will hang in xfs_ail_push_all_sync(),
> but this will be hit even if no unmount is attempted, with items stuck forever
> in ail.

Curious, since you can reproduce what happens if you do a hard reset on the
system when this trigger, once it boots back up? I'd guess it covers but just
want to be sure.

> I think the easier way to track this is to look at the device stats in sysfs,
> and you will see a forever increase in push_ail statistics even with no work
> going on in the filesystem.

But the above seems to note it can happen for *any* failed buffer on writeback?
Has that been really so odd to happen, or was this also just because of a
requirement of 'retry forever' option? Or did the 'retry forever' help increase
the chances of reproducing?

I suppose I was hoping for a real world case which might land us in the worst
case. For instance the dm-thin case with a pool that will run out seems to be
like a common case for some docker users and from the looks of it folks are as
a side consequence seeing this as docker just hanging after trying to unmount
[0] and reaching xfs_ail_push_all_sync(). Could this be an example of a real
world issue? There was no mention of the requirement of the 'retry forever'
option though in these cases. If this is an example alternative real world
situation triggering then this might be more common than what the commit log
seems to describe.

[0] https://github.com/moby/moby/issues/20707

> > You said you would work on an xfstest for this, how's that going? Otherewise
> > a commit log description of how to reproduce would be useful.
> >
> 
> The xfstests is not done yet, and I'm actually not focusing on it right now, I
> already have a reproducer, pointed on the beginning of the discussion from this
> problem and having this fixed by now is my priority, once the patches are in
> shape and accepted, I'll work on the xfstests.

OK thanks.

> Not to mention that this problem is still possible to occur not only with
> inode items, but also with dquot items, which will also be fixed as soon as we
> reach a consensus of how to best fix this problem by now. Once the dquot items
> fix will use the same infra-structure as the inode items use in this patchset,
> and quite the same code, one of the reasons I segmented the buffer resubmission
> into a different function that can be used for both item types.

I see... thanks!

> > > but this also causes the items in AIL to never be written back, even when
> > > the IO device comes back to normal.
> > > 
> > > I've been testing this patch with a DM-thin device, creating a
> > > filesystem larger than the real device.
> > > 
> > > When writing enough data to fill the DM-thin device, XFS receives ENOSPC
> > > errors from the device, and keep spinning on xfsaild (when 'retry
> > > forever' configuration is set).
> > > 
> > > At this point, the filesystem can not be unmounted because of the flush locked
> > > items in AIL, but worse, the items in AIL are never retried at all
> > > (once xfs_inode_item_push() will skip the items that are flush locked),
> > > even if the underlying DM-thin device is expanded to the proper size.
> > 
> > Jeesh.
> > 
> > If the above issue is a real hang, shoudl we not consider a sensible stable fix
> > to start off with ?
> > 
> 
> Please take a look at the whole history of this issue, this patchset is supposed
> to be the stable fix, that's why one of the reqs was to use xa_lock here, to
> change the log_item flags, instead of using atomic ops, making it easier to
> backport it to stable kernels, without messing around with atomic ops and field
> type changes and yes, this is a real hang problem, which we already received
> several reports on this along the time I'm working on it.

I'm thrilled to hear stable is being considered here. Thanks!

  Luis

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback
  2017-06-16 18:35   ` Luis R. Rodriguez
  2017-06-16 19:24     ` Darrick J. Wong
@ 2017-06-20  7:01     ` Carlos Maiolino
  2017-06-20 16:24       ` Luis R. Rodriguez
  1 sibling, 1 reply; 44+ messages in thread
From: Carlos Maiolino @ 2017-06-20  7:01 UTC (permalink / raw)
  To: Luis R. Rodriguez; +Cc: linux-xfs

Hello Luis.

On Fri, Jun 16, 2017 at 08:35:10PM +0200, Luis R. Rodriguez wrote:
> On Fri, Jun 16, 2017 at 12:54:45PM +0200, Carlos Maiolino wrote:
> > When a buffer has been failed during writeback, the inode items into it
> > are kept flush locked, and are never resubmitted due the flush lock, so,
> > if any buffer fails to be written, the items in AIL are never written to
> > disk and never unlocked.
> > 
> > This causes unmount operation to hang due these items flush locked in AIL,
> 
> What type of hang? If it has occurred in production is there a trace somewhere?
> what does it look like?
> 

No, there isn't any specific trace, the hang can be seen in several different
places, when unmounting the filesystem, it will hang in xfs_ail_push_all_sync(),
but this will be hit even if no unmount is attempted, with items stuck forever
in ail.

I think the easier way to track this is to look at the device stats in sysfs,
and you will see a forever increase in push_ail statistics even with no work
going on in the filesystem.

> You said you would work on an xfstest for this, how's that going? Otherewise
> a commit log description of how to reproduce would be useful.
>

The xfstests is not done yet, and I'm actually not focusing on it right now, I
already have a reproducer, pointed on the beginning of the discussion from this
problem and having this fixed by now is my priority, once the patches are in
shape and accepted, I'll work on the xfstests.

Not to mention that this problem is still possible to occur not only with
inode items, but also with dquot items, which will also be fixed as soon as we
reach a consensus of how to best fix this problem by now. Once the dquot items
fix will use the same infra-structure as the inode items use in this patchset,
and quite the same code, one of the reasons I segmented the buffer resubmission
into a different function that can be used for both item types.


 
> > but this also causes the items in AIL to never be written back, even when
> > the IO device comes back to normal.
> > 
> > I've been testing this patch with a DM-thin device, creating a
> > filesystem larger than the real device.
> > 
> > When writing enough data to fill the DM-thin device, XFS receives ENOSPC
> > errors from the device, and keep spinning on xfsaild (when 'retry
> > forever' configuration is set).
> > 
> > At this point, the filesystem can not be unmounted because of the flush locked
> > items in AIL, but worse, the items in AIL are never retried at all
> > (once xfs_inode_item_push() will skip the items that are flush locked),
> > even if the underlying DM-thin device is expanded to the proper size.
> 
> Jeesh.
> 
> If the above issue is a real hang, shoudl we not consider a sensible stable fix
> to start off with ?
> 

Please take a look at the whole history of this issue, this patchset is supposed
to be the stable fix, that's why one of the reqs was to use xa_lock here, to
change the log_item flags, instead of using atomic ops, making it easier to
backport it to stable kernels, without messing around with atomic ops and field
type changes and yes, this is a real hang problem, which we already received
several reports on this along the time I'm working on it.

Cheers

>   Luis
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Carlos

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback
  2017-06-19 13:49   ` Brian Foster
@ 2017-06-19 15:09     ` Brian Foster
  0 siblings, 0 replies; 44+ messages in thread
From: Brian Foster @ 2017-06-19 15:09 UTC (permalink / raw)
  To: Carlos Maiolino; +Cc: linux-xfs

On Mon, Jun 19, 2017 at 09:49:42AM -0400, Brian Foster wrote:
> On Fri, Jun 16, 2017 at 12:54:45PM +0200, Carlos Maiolino wrote:
> > When a buffer has been failed during writeback, the inode items into it
> > are kept flush locked, and are never resubmitted due the flush lock, so,
> > if any buffer fails to be written, the items in AIL are never written to
> > disk and never unlocked.
> > 
> > This causes unmount operation to hang due these items flush locked in AIL,
> > but this also causes the items in AIL to never be written back, even when
> > the IO device comes back to normal.
> > 
> > I've been testing this patch with a DM-thin device, creating a
> > filesystem larger than the real device.
> > 
> > When writing enough data to fill the DM-thin device, XFS receives ENOSPC
> > errors from the device, and keep spinning on xfsaild (when 'retry
> > forever' configuration is set).
> > 
> > At this point, the filesystem can not be unmounted because of the flush locked
> > items in AIL, but worse, the items in AIL are never retried at all
> > (once xfs_inode_item_push() will skip the items that are flush locked),
> > even if the underlying DM-thin device is expanded to the proper size.
> > 
> > This patch fixes both cases, retrying any item that has been failed
> > previously, using the infra-structure provided by the previous patch.
> > 
> > Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
> > ---
...
> > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> > index 08cb7d1..2719ac6 100644
> > --- a/fs/xfs/xfs_inode_item.c
> > +++ b/fs/xfs/xfs_inode_item.c
...
> > @@ -475,6 +476,18 @@ xfs_inode_item_unpin(
> >  		wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT);
> >  }
> >  
> > +/*
> > + * Callback used to mark a buffer with XFS_LI_FAILED when items in the buffer
> > + * have been failed during writeback
> > + */
> > +STATIC void
> > +xfs_inode_item_error(
> > +	struct xfs_log_item	*lip,
> > +	struct xfs_buf		*bp)
> > +{

Also if we're going to keep the ->iop_error() thing around, could we add
an 'ASSERT(xfs_isiflocked(INODE_ITEM(lip)->ili_inode))' here?

Brian

> > +	xfs_set_li_failed(lip, bp);
> > +}
> > +
> >  STATIC uint
> >  xfs_inode_item_push(
> >  	struct xfs_log_item	*lip,
> > @@ -491,6 +504,17 @@ xfs_inode_item_push(
> >  	if (xfs_ipincount(ip) > 0)
> >  		return XFS_ITEM_PINNED;
> >  
> > +	/*
> > +	 * The buffer containing this item failed to be written back
> > +	 * previously. Resubmit the buffer for IO.
> > +	 */
> > +	if (lip->li_flags & XFS_LI_FAILED) {
> > +		if (!xfs_buf_resubmit_failed_buffers(lip, buffer_list))
> > +			rval = XFS_ITEM_FLUSHING;
> > +
> > +		return rval;
> > +	}
> > +
> >  	if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED))
> >  		return XFS_ITEM_LOCKED;
> >  
> > @@ -622,7 +646,8 @@ static const struct xfs_item_ops xfs_inode_item_ops = {
> >  	.iop_unlock	= xfs_inode_item_unlock,
> >  	.iop_committed	= xfs_inode_item_committed,
> >  	.iop_push	= xfs_inode_item_push,
> > -	.iop_committing = xfs_inode_item_committing
> > +	.iop_committing = xfs_inode_item_committing,
> > +	.iop_error	= xfs_inode_item_error
> >  };
> >  
> >  
> > @@ -710,7 +735,8 @@ xfs_iflush_done(
> >  		 * the AIL lock.
> >  		 */
> >  		iip = INODE_ITEM(blip);
> > -		if (iip->ili_logged && blip->li_lsn == iip->ili_flush_lsn)
> > +		if ((iip->ili_logged && blip->li_lsn == iip->ili_flush_lsn) ||
> > +		    lip->li_flags & XFS_LI_FAILED)
> >  			need_ail++;
> >  
> >  		blip = next;
> > @@ -718,7 +744,8 @@ xfs_iflush_done(
> >  
> >  	/* make sure we capture the state of the initial inode. */
> >  	iip = INODE_ITEM(lip);
> > -	if (iip->ili_logged && lip->li_lsn == iip->ili_flush_lsn)
> > +	if ((iip->ili_logged && lip->li_lsn == iip->ili_flush_lsn) ||
> > +	    lip->li_flags & XFS_LI_FAILED)
> >  		need_ail++;
> >  
> >  	/*
> > @@ -739,6 +766,9 @@ xfs_iflush_done(
> >  			if (INODE_ITEM(blip)->ili_logged &&
> >  			    blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn)
> >  				mlip_changed |= xfs_ail_delete_one(ailp, blip);
> > +			else {
> > +				xfs_clear_li_failed(blip);
> > +			}
> >  		}
> >  
> >  		if (mlip_changed) {
> > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> > index 50df5367..2d7cf91 100644
> > --- a/fs/xfs/xfs_trans.h
> > +++ b/fs/xfs/xfs_trans.h
> > @@ -49,6 +49,7 @@ typedef struct xfs_log_item {
> >  	struct xfs_ail			*li_ailp;	/* ptr to AIL */
> >  	uint				li_type;	/* item type */
> >  	uint				li_flags;	/* misc flags */
> > +	struct xfs_buf			*li_buf;	/* real buffer pointer */
> >  	struct xfs_log_item		*li_bio_list;	/* buffer item list */
> >  	void				(*li_cb)(struct xfs_buf *,
> >  						 struct xfs_log_item *);
> > @@ -72,6 +73,31 @@ typedef struct xfs_log_item {
> >  	{ XFS_LI_ABORTED,	"ABORTED" }, \
> >  	{ XFS_LI_FAILED,	"FAILED" }
> >  
> > +static inline void
> > +xfs_clear_li_failed(
> > +	struct xfs_log_item	*lip)
> > +{
> > +	struct xfs_buf	*bp = lip->li_buf;
> > +
> 
> I think we should assert that ->xa_lock is held in both of these
> helpers. Note that we have to use lockdep_assert_held() for spinlocks.
> 
> It also couldn't hurt to assert that XFS_LI_IN_AIL is set as well (we'd
> just have to reorder the _clear_li_failed() call in xfs_ail_delete_one()
> below).
> 
> > +	if (lip->li_flags & XFS_LI_FAILED) {
> > +		lip->li_flags &= ~XFS_LI_FAILED;
> > +		lip->li_buf = NULL;
> > +		xfs_buf_rele(bp);
> > +	}
> > +}
> > +
> > +static inline void
> > +xfs_set_li_failed(
> > +	struct xfs_log_item	*lip,
> > +	struct xfs_buf		*bp)
> > +{
> > +	if (lip->li_flags & ~XFS_LI_FAILED) {
> 
> I think you want !(lip->li_flags & XFS_LI_FAILED). ;)
> 
> Brian
> 
> > +		xfs_buf_hold(bp);
> > +		lip->li_flags |= XFS_LI_FAILED;
> > +		lip->li_buf = bp;
> > +	}
> > +}
> > +
> >  struct xfs_item_ops {
> >  	void (*iop_size)(xfs_log_item_t *, int *, int *);
> >  	void (*iop_format)(xfs_log_item_t *, struct xfs_log_vec *);
> > diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> > index 9056c0f..c41d640 100644
> > --- a/fs/xfs/xfs_trans_ail.c
> > +++ b/fs/xfs/xfs_trans_ail.c
> > @@ -687,15 +687,15 @@ xfs_trans_ail_update_bulk(
> >  bool
> >  xfs_ail_delete_one(
> >  	struct xfs_ail		*ailp,
> > -	struct xfs_log_item 	*lip)
> > +	struct xfs_log_item	*lip)
> >  {
> >  	struct xfs_log_item	*mlip = xfs_ail_min(ailp);
> >  
> >  	trace_xfs_ail_delete(lip, mlip->li_lsn, lip->li_lsn);
> >  	xfs_ail_delete(ailp, lip);
> >  	lip->li_flags &= ~XFS_LI_IN_AIL;
> > +	xfs_clear_li_failed(lip);
> >  	lip->li_lsn = 0;
> > -
> >  	return mlip == lip;
> >  }
> >  
> > -- 
> > 2.9.4
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback
  2017-06-16 10:54 ` [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback Carlos Maiolino
  2017-06-16 11:06   ` Carlos Maiolino
  2017-06-16 18:35   ` Luis R. Rodriguez
@ 2017-06-19 13:49   ` Brian Foster
  2017-06-19 15:09     ` Brian Foster
  2 siblings, 1 reply; 44+ messages in thread
From: Brian Foster @ 2017-06-19 13:49 UTC (permalink / raw)
  To: Carlos Maiolino; +Cc: linux-xfs

On Fri, Jun 16, 2017 at 12:54:45PM +0200, Carlos Maiolino wrote:
> When a buffer has been failed during writeback, the inode items into it
> are kept flush locked, and are never resubmitted due the flush lock, so,
> if any buffer fails to be written, the items in AIL are never written to
> disk and never unlocked.
> 
> This causes unmount operation to hang due these items flush locked in AIL,
> but this also causes the items in AIL to never be written back, even when
> the IO device comes back to normal.
> 
> I've been testing this patch with a DM-thin device, creating a
> filesystem larger than the real device.
> 
> When writing enough data to fill the DM-thin device, XFS receives ENOSPC
> errors from the device, and keep spinning on xfsaild (when 'retry
> forever' configuration is set).
> 
> At this point, the filesystem can not be unmounted because of the flush locked
> items in AIL, but worse, the items in AIL are never retried at all
> (once xfs_inode_item_push() will skip the items that are flush locked),
> even if the underlying DM-thin device is expanded to the proper size.
> 
> This patch fixes both cases, retrying any item that has been failed
> previously, using the infra-structure provided by the previous patch.
> 
> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
> ---
> V2:
> 	- Fix XFS_LI_FAILED flag removal
> 	- Use atomic operations to set and clear XFS_LI_FAILED flag
> 	- Remove check for XBF_WRITE_FAIL in xfs_inode_item_push
> 	- Add more comments to the code
> 	- Add a helper function to resubmit the failed buffers, so this
> 	  can be also used in dquot system without duplicating code
> 
> V3:
> 	- kill xfs_imap_to_bp call using a pointer in the log item to
> 	  hold the buffer address
> 	- use xa_lock instead of atomic operations to handle log item
> 	  flags
> 	- Add a hold to the buffer for each log item failed
> 	- move buffer resubmission up in xfs_inode_item_push()
> 
> V4:
> 	- Remove bflags argument from iop_error callback
> 	- Remove ip argument from xfs_buf_resubmit_failed_buffers
> 	- Use helpers to set/clear XFS_LI_FAILED flag
> 	- remove ->xa_lock from the iop->error callback and move it up
> 	  on the stack, so all log items are processed into a single
> 	  pair of lock/unlock
> 
>  fs/xfs/xfs_buf_item.c   | 37 +++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_buf_item.h   |  1 +
>  fs/xfs/xfs_inode_item.c | 36 +++++++++++++++++++++++++++++++++---
>  fs/xfs/xfs_trans.h      | 26 ++++++++++++++++++++++++++
>  fs/xfs/xfs_trans_ail.c  |  4 ++--
>  5 files changed, 99 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 4fa68c9..c5d21ea 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -1222,3 +1222,40 @@ xfs_buf_iodone(
>  	xfs_trans_ail_delete(ailp, lip, SHUTDOWN_CORRUPT_INCORE);
>  	xfs_buf_item_free(BUF_ITEM(lip));
>  }
> +
> +/*
> + * Requeue a failed buffer for writeback
> + *
> + * Return true if the buffer has been re-queued properly, false otherwise
> + */
> +bool
> +xfs_buf_resubmit_failed_buffers(
> +	struct xfs_log_item	*lip,
> +	struct list_head	*buffer_list)
> +{
> +	struct xfs_log_item	*next;
> +	struct xfs_buf		*bp;
> +	bool			ret;
> +
> +	bp = lip->li_buf;
> +
> +	/* Clear XFS_LI_FAILED flag from all items before resubmit

Nit: start of the comment (/*) should be on its own line.

> +	 *
> +	 * XFS_LI_FAILED set/clear is protected by xa_lock, caller  this
> +	 * function already have it acquired
> +	 */
> +	for (; lip; lip = next) {
> +		next = lip->li_bio_list;
> +		xfs_clear_li_failed(lip);
> +	}
> +
> +	/* Add this buffer back to the delayed write list */
> +	xfs_buf_lock(bp);

This should be a trylock and we should return XFS_ITEM_LOCKED from the
caller if it fails so xfsaild can make progress rather than block
sitting on a buffer lock (and this is all much cleaner with the
trylock/unlock in the caller).

> +	if (!xfs_buf_delwri_queue(bp, buffer_list))
> +		ret = false;
> +	else
> +		ret = true;
> +
> +	xfs_buf_unlock(bp);
> +	return ret;
> +}
> diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
> index f7eba99..82c3d64 100644
> --- a/fs/xfs/xfs_buf_item.h
> +++ b/fs/xfs/xfs_buf_item.h
> @@ -70,6 +70,7 @@ void	xfs_buf_attach_iodone(struct xfs_buf *,
>  			      xfs_log_item_t *);
>  void	xfs_buf_iodone_callbacks(struct xfs_buf *);
>  void	xfs_buf_iodone(struct xfs_buf *, struct xfs_log_item *);
> +bool	xfs_buf_resubmit_failed_buffers(struct xfs_log_item *, struct list_head *);
>  
>  extern kmem_zone_t	*xfs_buf_item_zone;
>  
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 08cb7d1..2719ac6 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -27,6 +27,7 @@
>  #include "xfs_error.h"
>  #include "xfs_trace.h"
>  #include "xfs_trans_priv.h"
> +#include "xfs_buf_item.h"
>  #include "xfs_log.h"
>  
>  
> @@ -475,6 +476,18 @@ xfs_inode_item_unpin(
>  		wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT);
>  }
>  
> +/*
> + * Callback used to mark a buffer with XFS_LI_FAILED when items in the buffer
> + * have been failed during writeback
> + */
> +STATIC void
> +xfs_inode_item_error(
> +	struct xfs_log_item	*lip,
> +	struct xfs_buf		*bp)
> +{
> +	xfs_set_li_failed(lip, bp);
> +}
> +
>  STATIC uint
>  xfs_inode_item_push(
>  	struct xfs_log_item	*lip,
> @@ -491,6 +504,17 @@ xfs_inode_item_push(
>  	if (xfs_ipincount(ip) > 0)
>  		return XFS_ITEM_PINNED;
>  
> +	/*
> +	 * The buffer containing this item failed to be written back
> +	 * previously. Resubmit the buffer for IO.
> +	 */
> +	if (lip->li_flags & XFS_LI_FAILED) {
> +		if (!xfs_buf_resubmit_failed_buffers(lip, buffer_list))
> +			rval = XFS_ITEM_FLUSHING;
> +
> +		return rval;
> +	}
> +
>  	if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED))
>  		return XFS_ITEM_LOCKED;
>  
> @@ -622,7 +646,8 @@ static const struct xfs_item_ops xfs_inode_item_ops = {
>  	.iop_unlock	= xfs_inode_item_unlock,
>  	.iop_committed	= xfs_inode_item_committed,
>  	.iop_push	= xfs_inode_item_push,
> -	.iop_committing = xfs_inode_item_committing
> +	.iop_committing = xfs_inode_item_committing,
> +	.iop_error	= xfs_inode_item_error
>  };
>  
>  
> @@ -710,7 +735,8 @@ xfs_iflush_done(
>  		 * the AIL lock.
>  		 */
>  		iip = INODE_ITEM(blip);
> -		if (iip->ili_logged && blip->li_lsn == iip->ili_flush_lsn)
> +		if ((iip->ili_logged && blip->li_lsn == iip->ili_flush_lsn) ||
> +		    lip->li_flags & XFS_LI_FAILED)
>  			need_ail++;
>  
>  		blip = next;
> @@ -718,7 +744,8 @@ xfs_iflush_done(
>  
>  	/* make sure we capture the state of the initial inode. */
>  	iip = INODE_ITEM(lip);
> -	if (iip->ili_logged && lip->li_lsn == iip->ili_flush_lsn)
> +	if ((iip->ili_logged && lip->li_lsn == iip->ili_flush_lsn) ||
> +	    lip->li_flags & XFS_LI_FAILED)
>  		need_ail++;
>  
>  	/*
> @@ -739,6 +766,9 @@ xfs_iflush_done(
>  			if (INODE_ITEM(blip)->ili_logged &&
>  			    blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn)
>  				mlip_changed |= xfs_ail_delete_one(ailp, blip);
> +			else {
> +				xfs_clear_li_failed(blip);
> +			}
>  		}
>  
>  		if (mlip_changed) {
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index 50df5367..2d7cf91 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -49,6 +49,7 @@ typedef struct xfs_log_item {
>  	struct xfs_ail			*li_ailp;	/* ptr to AIL */
>  	uint				li_type;	/* item type */
>  	uint				li_flags;	/* misc flags */
> +	struct xfs_buf			*li_buf;	/* real buffer pointer */
>  	struct xfs_log_item		*li_bio_list;	/* buffer item list */
>  	void				(*li_cb)(struct xfs_buf *,
>  						 struct xfs_log_item *);
> @@ -72,6 +73,31 @@ typedef struct xfs_log_item {
>  	{ XFS_LI_ABORTED,	"ABORTED" }, \
>  	{ XFS_LI_FAILED,	"FAILED" }
>  
> +static inline void
> +xfs_clear_li_failed(
> +	struct xfs_log_item	*lip)
> +{
> +	struct xfs_buf	*bp = lip->li_buf;
> +

I think we should assert that ->xa_lock is held in both of these
helpers. Note that we have to use lockdep_assert_held() for spinlocks.

It also couldn't hurt to assert that XFS_LI_IN_AIL is set as well (we'd
just have to reorder the _clear_li_failed() call in xfs_ail_delete_one()
below).

> +	if (lip->li_flags & XFS_LI_FAILED) {
> +		lip->li_flags &= ~XFS_LI_FAILED;
> +		lip->li_buf = NULL;
> +		xfs_buf_rele(bp);
> +	}
> +}
> +
> +static inline void
> +xfs_set_li_failed(
> +	struct xfs_log_item	*lip,
> +	struct xfs_buf		*bp)
> +{
> +	if (lip->li_flags & ~XFS_LI_FAILED) {

I think you want !(lip->li_flags & XFS_LI_FAILED). ;)

Brian

> +		xfs_buf_hold(bp);
> +		lip->li_flags |= XFS_LI_FAILED;
> +		lip->li_buf = bp;
> +	}
> +}
> +
>  struct xfs_item_ops {
>  	void (*iop_size)(xfs_log_item_t *, int *, int *);
>  	void (*iop_format)(xfs_log_item_t *, struct xfs_log_vec *);
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index 9056c0f..c41d640 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -687,15 +687,15 @@ xfs_trans_ail_update_bulk(
>  bool
>  xfs_ail_delete_one(
>  	struct xfs_ail		*ailp,
> -	struct xfs_log_item 	*lip)
> +	struct xfs_log_item	*lip)
>  {
>  	struct xfs_log_item	*mlip = xfs_ail_min(ailp);
>  
>  	trace_xfs_ail_delete(lip, mlip->li_lsn, lip->li_lsn);
>  	xfs_ail_delete(ailp, lip);
>  	lip->li_flags &= ~XFS_LI_IN_AIL;
> +	xfs_clear_li_failed(lip);
>  	lip->li_lsn = 0;
> -
>  	return mlip == lip;
>  }
>  
> -- 
> 2.9.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback
  2017-06-16 19:45         ` Eric Sandeen
@ 2017-06-19 10:59           ` Brian Foster
  2017-06-20 16:52             ` Luis R. Rodriguez
  0 siblings, 1 reply; 44+ messages in thread
From: Brian Foster @ 2017-06-19 10:59 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Luis R. Rodriguez, Darrick J. Wong, Carlos Maiolino, linux-xfs

On Fri, Jun 16, 2017 at 02:45:13PM -0500, Eric Sandeen wrote:
> On 6/16/17 2:37 PM, Luis R. Rodriguez wrote:
> > On Fri, Jun 16, 2017 at 12:24:45PM -0700, Darrick J. Wong wrote:
> >> On Fri, Jun 16, 2017 at 08:35:10PM +0200, Luis R. Rodriguez wrote:
> >>> On Fri, Jun 16, 2017 at 12:54:45PM +0200, Carlos Maiolino wrote:
> >>>> When a buffer has been failed during writeback, the inode items into it
> >>>> are kept flush locked, and are never resubmitted due the flush lock, so,
> >>>> if any buffer fails to be written, the items in AIL are never written to
> >>>> disk and never unlocked.
> >>>>
> >>>> This causes unmount operation to hang due these items flush locked in AIL,
> >>>
> >>> What type of hang? If it has occurred in production is there a trace somewhere?
> >>> what does it look like?
> >>>
> >>> You said you would work on an xfstest for this, how's that going? Otherewise
> >>> a commit log description of how to reproduce would be useful.
> >>
> >> I'm curious for an xfstest too, but I think Carlos /did/ tell us how to
> >> reproduce -- create a thinp device, format XFS, fill it up, and try to
> >> unmount.
> > 
> > Well he did mention to create a Dm-thin device with a fileystem larger than
> > the real device. You seem to have say just filling up a filsystem?
> 
> No - the case is filling up the /thinp device/, not the filesystem.
> 
> > 
> > Do both cases trigger the issue?
> 
> This whole issue has to do with errors from the block device during writeback;
> that's why the thin device is involved.  When the backing store fills, we
> see the IO errors that cause this problem.
> 
> >> I don't think there /is/ much of a trace, just xfsaild doing
> >> nothing and a bunch of ail items that are flush locked and stuck that way.
> > 
> > OK so no hung task seek, no crash, just a system call that never ends?
> > 
> > Is the issue recoverable? So unmount just never completes? Can we CTRL-C
> > (SIGINT) out of it?
> 
> No, you can't ctrl-c a stuck kernel.
> 
> >>>> but this also causes the items in AIL to never be written back, even when
> >>>> the IO device comes back to normal.
> >>>>
> >>>> I've been testing this patch with a DM-thin device, creating a
> >>>> filesystem larger than the real device.
> >>>>
> >>>> When writing enough data to fill the DM-thin device, XFS receives ENOSPC
> >>>> errors from the device, and keep spinning on xfsaild (when 'retry
> >>>> forever' configuration is set).
> >>>>
> >>>> At this point, the filesystem can not be unmounted because of the flush locked
> >>>> items in AIL, but worse, the items in AIL are never retried at all
> >>>> (once xfs_inode_item_push() will skip the items that are flush locked),
> >>>> even if the underlying DM-thin device is expanded to the proper size.
> >>>
> >>> Jeesh.
> >>>
> >>> If the above issue is a real hang, shoudl we not consider a sensible stable fix
> >>> to start off with ?
> >>
> >> Huh?  I thought this series is supposed to be the fix.
> > 
> > It seems like a rather large set of changes, if the issue was sevee I was hoping
> > for a stable candidate fix first. If its not fixing a severe issue then sure.
> 
> Fixes go uptream first, then to stable kernels if appropriate, right?
> 
> But not every fix is a trivial one-liner.  I don't think there is any simpler
> fix to be had, here.
> 

Yeah, this is kind of intended to be the simplest fix available as far
as I'm aware. TBH, I don't think the fix here is fundamentally that
complex (define a state for already flushed/failed log items), but
rather the code that needs to be modified to implement it has various
states and corner cases to manage that make it tricky to get right.

If we truly needed a stable worthy fix in short order, that would
probably be to revert ac8809f9a ("xfs: abort metadata writeback on
permanent errors"), which caused this regression by making the AIL
responsible for failed retries. A couple caveats I can think of with
that are that 1.) this would likely short circuit the recently added
error configuration api (which is kind of irrelevant if the underlying
error management code doesn't work correctly, but suggests that should
be reverted as well) and 2.) in doing so, this reverts back to the
hardcoded infinite retry behavior in XFS. That means that transient
errors may eventually recover, but the thinp enospc use case and whatnot
are still going to hang on umount. 

It hasn't seemed necessary to me to take that approach given the lower
prevalence of the issue and the fact that a solution had been worked out
for a while now. Though I suppose the longer we go without a fix in
place, the stronger the argument for something like that becomes.

Brian

> -Eric
> 
> >   Luis
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback
  2017-06-16 19:37       ` Luis R. Rodriguez
@ 2017-06-16 19:45         ` Eric Sandeen
  2017-06-19 10:59           ` Brian Foster
  0 siblings, 1 reply; 44+ messages in thread
From: Eric Sandeen @ 2017-06-16 19:45 UTC (permalink / raw)
  To: Luis R. Rodriguez, Darrick J. Wong; +Cc: Carlos Maiolino, linux-xfs

On 6/16/17 2:37 PM, Luis R. Rodriguez wrote:
> On Fri, Jun 16, 2017 at 12:24:45PM -0700, Darrick J. Wong wrote:
>> On Fri, Jun 16, 2017 at 08:35:10PM +0200, Luis R. Rodriguez wrote:
>>> On Fri, Jun 16, 2017 at 12:54:45PM +0200, Carlos Maiolino wrote:
>>>> When a buffer has been failed during writeback, the inode items into it
>>>> are kept flush locked, and are never resubmitted due the flush lock, so,
>>>> if any buffer fails to be written, the items in AIL are never written to
>>>> disk and never unlocked.
>>>>
>>>> This causes unmount operation to hang due these items flush locked in AIL,
>>>
>>> What type of hang? If it has occurred in production is there a trace somewhere?
>>> what does it look like?
>>>
>>> You said you would work on an xfstest for this, how's that going? Otherewise
>>> a commit log description of how to reproduce would be useful.
>>
>> I'm curious for an xfstest too, but I think Carlos /did/ tell us how to
>> reproduce -- create a thinp device, format XFS, fill it up, and try to
>> unmount.
> 
> Well he did mention to create a Dm-thin device with a fileystem larger than
> the real device. You seem to have say just filling up a filsystem?

No - the case is filling up the /thinp device/, not the filesystem.

> 
> Do both cases trigger the issue?

This whole issue has to do with errors from the block device during writeback;
that's why the thin device is involved.  When the backing store fills, we
see the IO errors that cause this problem.

>> I don't think there /is/ much of a trace, just xfsaild doing
>> nothing and a bunch of ail items that are flush locked and stuck that way.
> 
> OK so no hung task seek, no crash, just a system call that never ends?
> 
> Is the issue recoverable? So unmount just never completes? Can we CTRL-C
> (SIGINT) out of it?

No, you can't ctrl-c a stuck kernel.

>>>> but this also causes the items in AIL to never be written back, even when
>>>> the IO device comes back to normal.
>>>>
>>>> I've been testing this patch with a DM-thin device, creating a
>>>> filesystem larger than the real device.
>>>>
>>>> When writing enough data to fill the DM-thin device, XFS receives ENOSPC
>>>> errors from the device, and keep spinning on xfsaild (when 'retry
>>>> forever' configuration is set).
>>>>
>>>> At this point, the filesystem can not be unmounted because of the flush locked
>>>> items in AIL, but worse, the items in AIL are never retried at all
>>>> (once xfs_inode_item_push() will skip the items that are flush locked),
>>>> even if the underlying DM-thin device is expanded to the proper size.
>>>
>>> Jeesh.
>>>
>>> If the above issue is a real hang, shoudl we not consider a sensible stable fix
>>> to start off with ?
>>
>> Huh?  I thought this series is supposed to be the fix.
> 
> It seems like a rather large set of changes, if the issue was sevee I was hoping
> for a stable candidate fix first. If its not fixing a severe issue then sure.

Fixes go uptream first, then to stable kernels if appropriate, right?

But not every fix is a trivial one-liner.  I don't think there is any simpler
fix to be had, here.

-Eric

>   Luis
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback
  2017-06-16 19:24     ` Darrick J. Wong
@ 2017-06-16 19:37       ` Luis R. Rodriguez
  2017-06-16 19:45         ` Eric Sandeen
  0 siblings, 1 reply; 44+ messages in thread
From: Luis R. Rodriguez @ 2017-06-16 19:37 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Luis R. Rodriguez, Carlos Maiolino, linux-xfs

On Fri, Jun 16, 2017 at 12:24:45PM -0700, Darrick J. Wong wrote:
> On Fri, Jun 16, 2017 at 08:35:10PM +0200, Luis R. Rodriguez wrote:
> > On Fri, Jun 16, 2017 at 12:54:45PM +0200, Carlos Maiolino wrote:
> > > When a buffer has been failed during writeback, the inode items into it
> > > are kept flush locked, and are never resubmitted due the flush lock, so,
> > > if any buffer fails to be written, the items in AIL are never written to
> > > disk and never unlocked.
> > > 
> > > This causes unmount operation to hang due these items flush locked in AIL,
> > 
> > What type of hang? If it has occurred in production is there a trace somewhere?
> > what does it look like?
> > 
> > You said you would work on an xfstest for this, how's that going? Otherewise
> > a commit log description of how to reproduce would be useful.
> 
> I'm curious for an xfstest too, but I think Carlos /did/ tell us how to
> reproduce -- create a thinp device, format XFS, fill it up, and try to
> unmount.

Well he did mention to create a Dm-thin device with a fileystem larger than
the real device. You seem to have say just filling up a filsystem?

Do both cases trigger the issue?

> I don't think there /is/ much of a trace, just xfsaild doing
> nothing and a bunch of ail items that are flush locked and stuck that way.

OK so no hung task seek, no crash, just a system call that never ends?

Is the issue recoverable? So unmount just never completes? Can we CTRL-C
(SIGINT) out of it?

> > > but this also causes the items in AIL to never be written back, even when
> > > the IO device comes back to normal.
> > > 
> > > I've been testing this patch with a DM-thin device, creating a
> > > filesystem larger than the real device.
> > > 
> > > When writing enough data to fill the DM-thin device, XFS receives ENOSPC
> > > errors from the device, and keep spinning on xfsaild (when 'retry
> > > forever' configuration is set).
> > > 
> > > At this point, the filesystem can not be unmounted because of the flush locked
> > > items in AIL, but worse, the items in AIL are never retried at all
> > > (once xfs_inode_item_push() will skip the items that are flush locked),
> > > even if the underlying DM-thin device is expanded to the proper size.
> > 
> > Jeesh.
> > 
> > If the above issue is a real hang, shoudl we not consider a sensible stable fix
> > to start off with ?
> 
> Huh?  I thought this series is supposed to be the fix.

It seems like a rather large set of changes, if the issue was sevee I was hoping
for a stable candidate fix first. If its not fixing a severe issue then sure.

  Luis

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback
  2017-06-16 18:35   ` Luis R. Rodriguez
@ 2017-06-16 19:24     ` Darrick J. Wong
  2017-06-16 19:37       ` Luis R. Rodriguez
  2017-06-20  7:01     ` Carlos Maiolino
  1 sibling, 1 reply; 44+ messages in thread
From: Darrick J. Wong @ 2017-06-16 19:24 UTC (permalink / raw)
  To: Luis R. Rodriguez; +Cc: Carlos Maiolino, linux-xfs

On Fri, Jun 16, 2017 at 08:35:10PM +0200, Luis R. Rodriguez wrote:
> On Fri, Jun 16, 2017 at 12:54:45PM +0200, Carlos Maiolino wrote:
> > When a buffer has been failed during writeback, the inode items into it
> > are kept flush locked, and are never resubmitted due the flush lock, so,
> > if any buffer fails to be written, the items in AIL are never written to
> > disk and never unlocked.
> > 
> > This causes unmount operation to hang due these items flush locked in AIL,
> 
> What type of hang? If it has occurred in production is there a trace somewhere?
> what does it look like?
> 
> You said you would work on an xfstest for this, how's that going? Otherewise
> a commit log description of how to reproduce would be useful.

I'm curious for an xfstest too, but I think Carlos /did/ tell us how to
reproduce -- create a thinp device, format XFS, fill it up, and try to
unmount.  I don't think there /is/ much of a trace, just xfsaild doing
nothing and a bunch of ail items that are flush locked and stuck that way.

> > but this also causes the items in AIL to never be written back, even when
> > the IO device comes back to normal.
> > 
> > I've been testing this patch with a DM-thin device, creating a
> > filesystem larger than the real device.
> > 
> > When writing enough data to fill the DM-thin device, XFS receives ENOSPC
> > errors from the device, and keep spinning on xfsaild (when 'retry
> > forever' configuration is set).
> > 
> > At this point, the filesystem can not be unmounted because of the flush locked
> > items in AIL, but worse, the items in AIL are never retried at all
> > (once xfs_inode_item_push() will skip the items that are flush locked),
> > even if the underlying DM-thin device is expanded to the proper size.
> 
> Jeesh.
> 
> If the above issue is a real hang, shoudl we not consider a sensible stable fix
> to start off with ?

Huh?  I thought this series is supposed to be the fix.

--D

> 
>   Luis
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback
  2017-06-16 10:54 ` [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback Carlos Maiolino
  2017-06-16 11:06   ` Carlos Maiolino
@ 2017-06-16 18:35   ` Luis R. Rodriguez
  2017-06-16 19:24     ` Darrick J. Wong
  2017-06-20  7:01     ` Carlos Maiolino
  2017-06-19 13:49   ` Brian Foster
  2 siblings, 2 replies; 44+ messages in thread
From: Luis R. Rodriguez @ 2017-06-16 18:35 UTC (permalink / raw)
  To: Carlos Maiolino; +Cc: linux-xfs

On Fri, Jun 16, 2017 at 12:54:45PM +0200, Carlos Maiolino wrote:
> When a buffer has been failed during writeback, the inode items into it
> are kept flush locked, and are never resubmitted due the flush lock, so,
> if any buffer fails to be written, the items in AIL are never written to
> disk and never unlocked.
> 
> This causes unmount operation to hang due these items flush locked in AIL,

What type of hang? If it has occurred in production is there a trace somewhere?
what does it look like?

You said you would work on an xfstest for this, how's that going? Otherewise
a commit log description of how to reproduce would be useful.

> but this also causes the items in AIL to never be written back, even when
> the IO device comes back to normal.
> 
> I've been testing this patch with a DM-thin device, creating a
> filesystem larger than the real device.
> 
> When writing enough data to fill the DM-thin device, XFS receives ENOSPC
> errors from the device, and keep spinning on xfsaild (when 'retry
> forever' configuration is set).
> 
> At this point, the filesystem can not be unmounted because of the flush locked
> items in AIL, but worse, the items in AIL are never retried at all
> (once xfs_inode_item_push() will skip the items that are flush locked),
> even if the underlying DM-thin device is expanded to the proper size.

Jeesh.

If the above issue is a real hang, shoudl we not consider a sensible stable fix
to start off with ?

  Luis

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback
  2017-06-16 10:54 ` [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback Carlos Maiolino
@ 2017-06-16 11:06   ` Carlos Maiolino
  2017-06-16 18:35   ` Luis R. Rodriguez
  2017-06-19 13:49   ` Brian Foster
  2 siblings, 0 replies; 44+ messages in thread
From: Carlos Maiolino @ 2017-06-16 11:06 UTC (permalink / raw)
  To: linux-xfs

I forgot to add "V4" to the subject here, but the patch is still the V4 one

-- 
Carlos

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback
  2017-06-16 10:54 [PATCH 0/2 V4] Resubmit items failed during writeback Carlos Maiolino
@ 2017-06-16 10:54 ` Carlos Maiolino
  2017-06-16 11:06   ` Carlos Maiolino
                     ` (2 more replies)
  0 siblings, 3 replies; 44+ messages in thread
From: Carlos Maiolino @ 2017-06-16 10:54 UTC (permalink / raw)
  To: linux-xfs

When a buffer has been failed during writeback, the inode items into it
are kept flush locked, and are never resubmitted due the flush lock, so,
if any buffer fails to be written, the items in AIL are never written to
disk and never unlocked.

This causes unmount operation to hang due these items flush locked in AIL,
but this also causes the items in AIL to never be written back, even when
the IO device comes back to normal.

I've been testing this patch with a DM-thin device, creating a
filesystem larger than the real device.

When writing enough data to fill the DM-thin device, XFS receives ENOSPC
errors from the device, and keep spinning on xfsaild (when 'retry
forever' configuration is set).

At this point, the filesystem can not be unmounted because of the flush locked
items in AIL, but worse, the items in AIL are never retried at all
(once xfs_inode_item_push() will skip the items that are flush locked),
even if the underlying DM-thin device is expanded to the proper size.

This patch fixes both cases, retrying any item that has been failed
previously, using the infra-structure provided by the previous patch.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
---
V2:
	- Fix XFS_LI_FAILED flag removal
	- Use atomic operations to set and clear XFS_LI_FAILED flag
	- Remove check for XBF_WRITE_FAIL in xfs_inode_item_push
	- Add more comments to the code
	- Add a helper function to resubmit the failed buffers, so this
	  can be also used in dquot system without duplicating code

V3:
	- kill xfs_imap_to_bp call using a pointer in the log item to
	  hold the buffer address
	- use xa_lock instead of atomic operations to handle log item
	  flags
	- Add a hold to the buffer for each log item failed
	- move buffer resubmission up in xfs_inode_item_push()

V4:
	- Remove bflags argument from iop_error callback
	- Remove ip argument from xfs_buf_resubmit_failed_buffers
	- Use helpers to set/clear XFS_LI_FAILED flag
	- remove ->xa_lock from the iop->error callback and move it up
	  on the stack, so all log items are processed into a single
	  pair of lock/unlock

 fs/xfs/xfs_buf_item.c   | 37 +++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_buf_item.h   |  1 +
 fs/xfs/xfs_inode_item.c | 36 +++++++++++++++++++++++++++++++++---
 fs/xfs/xfs_trans.h      | 26 ++++++++++++++++++++++++++
 fs/xfs/xfs_trans_ail.c  |  4 ++--
 5 files changed, 99 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 4fa68c9..c5d21ea 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -1222,3 +1222,40 @@ xfs_buf_iodone(
 	xfs_trans_ail_delete(ailp, lip, SHUTDOWN_CORRUPT_INCORE);
 	xfs_buf_item_free(BUF_ITEM(lip));
 }
+
+/*
+ * Requeue a failed buffer for writeback
+ *
+ * Return true if the buffer has been re-queued properly, false otherwise
+ */
+bool
+xfs_buf_resubmit_failed_buffers(
+	struct xfs_log_item	*lip,
+	struct list_head	*buffer_list)
+{
+	struct xfs_log_item	*next;
+	struct xfs_buf		*bp;
+	bool			ret;
+
+	bp = lip->li_buf;
+
+	/* Clear XFS_LI_FAILED flag from all items before resubmit
+	 *
+	 * XFS_LI_FAILED set/clear is protected by xa_lock, caller  this
+	 * function already have it acquired
+	 */
+	for (; lip; lip = next) {
+		next = lip->li_bio_list;
+		xfs_clear_li_failed(lip);
+	}
+
+	/* Add this buffer back to the delayed write list */
+	xfs_buf_lock(bp);
+	if (!xfs_buf_delwri_queue(bp, buffer_list))
+		ret = false;
+	else
+		ret = true;
+
+	xfs_buf_unlock(bp);
+	return ret;
+}
diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
index f7eba99..82c3d64 100644
--- a/fs/xfs/xfs_buf_item.h
+++ b/fs/xfs/xfs_buf_item.h
@@ -70,6 +70,7 @@ void	xfs_buf_attach_iodone(struct xfs_buf *,
 			      xfs_log_item_t *);
 void	xfs_buf_iodone_callbacks(struct xfs_buf *);
 void	xfs_buf_iodone(struct xfs_buf *, struct xfs_log_item *);
+bool	xfs_buf_resubmit_failed_buffers(struct xfs_log_item *, struct list_head *);
 
 extern kmem_zone_t	*xfs_buf_item_zone;
 
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 08cb7d1..2719ac6 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -27,6 +27,7 @@
 #include "xfs_error.h"
 #include "xfs_trace.h"
 #include "xfs_trans_priv.h"
+#include "xfs_buf_item.h"
 #include "xfs_log.h"
 
 
@@ -475,6 +476,18 @@ xfs_inode_item_unpin(
 		wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT);
 }
 
+/*
+ * Callback used to mark a buffer with XFS_LI_FAILED when items in the buffer
+ * have been failed during writeback
+ */
+STATIC void
+xfs_inode_item_error(
+	struct xfs_log_item	*lip,
+	struct xfs_buf		*bp)
+{
+	xfs_set_li_failed(lip, bp);
+}
+
 STATIC uint
 xfs_inode_item_push(
 	struct xfs_log_item	*lip,
@@ -491,6 +504,17 @@ xfs_inode_item_push(
 	if (xfs_ipincount(ip) > 0)
 		return XFS_ITEM_PINNED;
 
+	/*
+	 * The buffer containing this item failed to be written back
+	 * previously. Resubmit the buffer for IO.
+	 */
+	if (lip->li_flags & XFS_LI_FAILED) {
+		if (!xfs_buf_resubmit_failed_buffers(lip, buffer_list))
+			rval = XFS_ITEM_FLUSHING;
+
+		return rval;
+	}
+
 	if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED))
 		return XFS_ITEM_LOCKED;
 
@@ -622,7 +646,8 @@ static const struct xfs_item_ops xfs_inode_item_ops = {
 	.iop_unlock	= xfs_inode_item_unlock,
 	.iop_committed	= xfs_inode_item_committed,
 	.iop_push	= xfs_inode_item_push,
-	.iop_committing = xfs_inode_item_committing
+	.iop_committing = xfs_inode_item_committing,
+	.iop_error	= xfs_inode_item_error
 };
 
 
@@ -710,7 +735,8 @@ xfs_iflush_done(
 		 * the AIL lock.
 		 */
 		iip = INODE_ITEM(blip);
-		if (iip->ili_logged && blip->li_lsn == iip->ili_flush_lsn)
+		if ((iip->ili_logged && blip->li_lsn == iip->ili_flush_lsn) ||
+		    lip->li_flags & XFS_LI_FAILED)
 			need_ail++;
 
 		blip = next;
@@ -718,7 +744,8 @@ xfs_iflush_done(
 
 	/* make sure we capture the state of the initial inode. */
 	iip = INODE_ITEM(lip);
-	if (iip->ili_logged && lip->li_lsn == iip->ili_flush_lsn)
+	if ((iip->ili_logged && lip->li_lsn == iip->ili_flush_lsn) ||
+	    lip->li_flags & XFS_LI_FAILED)
 		need_ail++;
 
 	/*
@@ -739,6 +766,9 @@ xfs_iflush_done(
 			if (INODE_ITEM(blip)->ili_logged &&
 			    blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn)
 				mlip_changed |= xfs_ail_delete_one(ailp, blip);
+			else {
+				xfs_clear_li_failed(blip);
+			}
 		}
 
 		if (mlip_changed) {
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 50df5367..2d7cf91 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -49,6 +49,7 @@ typedef struct xfs_log_item {
 	struct xfs_ail			*li_ailp;	/* ptr to AIL */
 	uint				li_type;	/* item type */
 	uint				li_flags;	/* misc flags */
+	struct xfs_buf			*li_buf;	/* real buffer pointer */
 	struct xfs_log_item		*li_bio_list;	/* buffer item list */
 	void				(*li_cb)(struct xfs_buf *,
 						 struct xfs_log_item *);
@@ -72,6 +73,31 @@ typedef struct xfs_log_item {
 	{ XFS_LI_ABORTED,	"ABORTED" }, \
 	{ XFS_LI_FAILED,	"FAILED" }
 
+static inline void
+xfs_clear_li_failed(
+	struct xfs_log_item	*lip)
+{
+	struct xfs_buf	*bp = lip->li_buf;
+
+	if (lip->li_flags & XFS_LI_FAILED) {
+		lip->li_flags &= ~XFS_LI_FAILED;
+		lip->li_buf = NULL;
+		xfs_buf_rele(bp);
+	}
+}
+
+static inline void
+xfs_set_li_failed(
+	struct xfs_log_item	*lip,
+	struct xfs_buf		*bp)
+{
+	if (lip->li_flags & ~XFS_LI_FAILED) {
+		xfs_buf_hold(bp);
+		lip->li_flags |= XFS_LI_FAILED;
+		lip->li_buf = bp;
+	}
+}
+
 struct xfs_item_ops {
 	void (*iop_size)(xfs_log_item_t *, int *, int *);
 	void (*iop_format)(xfs_log_item_t *, struct xfs_log_vec *);
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 9056c0f..c41d640 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -687,15 +687,15 @@ xfs_trans_ail_update_bulk(
 bool
 xfs_ail_delete_one(
 	struct xfs_ail		*ailp,
-	struct xfs_log_item 	*lip)
+	struct xfs_log_item	*lip)
 {
 	struct xfs_log_item	*mlip = xfs_ail_min(ailp);
 
 	trace_xfs_ail_delete(lip, mlip->li_lsn, lip->li_lsn);
 	xfs_ail_delete(ailp, lip);
 	lip->li_flags &= ~XFS_LI_IN_AIL;
+	xfs_clear_li_failed(lip);
 	lip->li_lsn = 0;
-
 	return mlip == lip;
 }
 
-- 
2.9.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2017-06-21 16:43 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-11 13:57 [PATCH 0/2] Resubmit items failed during writeback Carlos Maiolino
2017-05-11 13:57 ` [PATCH 1/2] xfs: Add infrastructure needed for error propagation during buffer IO failure Carlos Maiolino
2017-05-11 16:51   ` Brian Foster
2017-05-12  8:41     ` Carlos Maiolino
2017-05-12 11:37       ` Brian Foster
2017-05-11 13:57 ` [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback Carlos Maiolino
2017-05-11 15:32   ` Eric Sandeen
2017-05-12  8:19     ` Carlos Maiolino
2017-05-11 17:08   ` Brian Foster
2017-05-12  8:21     ` Carlos Maiolino
2017-05-12 11:37       ` Brian Foster
2017-05-17 11:47         ` Carlos Maiolino
2017-05-17  0:57   ` Dave Chinner
2017-05-17 10:41     ` Carlos Maiolino
2017-05-19  0:22       ` Dave Chinner
2017-05-19 11:27         ` Brian Foster
2017-05-19 23:39           ` Dave Chinner
2017-05-20 11:46             ` Brian Foster
2017-05-21 23:19               ` Dave Chinner
2017-05-22 12:51                 ` Brian Foster
2017-05-23 11:23                   ` Dave Chinner
2017-05-23 16:22                     ` Brian Foster
2017-05-24  1:06                       ` Dave Chinner
2017-05-24 12:42                         ` Brian Foster
2017-05-24 13:26                           ` Carlos Maiolino
2017-05-24 17:08                             ` Brian Foster
2017-06-16 10:54 [PATCH 0/2 V4] Resubmit items failed during writeback Carlos Maiolino
2017-06-16 10:54 ` [PATCH 2/2] xfs: Properly retry failed inode items in case of error during buffer writeback Carlos Maiolino
2017-06-16 11:06   ` Carlos Maiolino
2017-06-16 18:35   ` Luis R. Rodriguez
2017-06-16 19:24     ` Darrick J. Wong
2017-06-16 19:37       ` Luis R. Rodriguez
2017-06-16 19:45         ` Eric Sandeen
2017-06-19 10:59           ` Brian Foster
2017-06-20 16:52             ` Luis R. Rodriguez
2017-06-20 17:20               ` Brian Foster
2017-06-20 18:05                 ` Luis R. Rodriguez
2017-06-21 10:10                   ` Brian Foster
2017-06-21 15:25                     ` Luis R. Rodriguez
2017-06-20 18:38                 ` Luis R. Rodriguez
2017-06-20  7:01     ` Carlos Maiolino
2017-06-20 16:24       ` Luis R. Rodriguez
2017-06-21 11:51         ` Carlos Maiolino
2017-06-19 13:49   ` Brian Foster
2017-06-19 15:09     ` Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.