All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/13] xfs: flush related error handling cleanups
@ 2020-04-22 17:54 Brian Foster
  2020-04-22 17:54 ` [PATCH v2 01/13] xfs: refactor failed buffer resubmission into xfsaild Brian Foster
                   ` (12 more replies)
  0 siblings, 13 replies; 46+ messages in thread
From: Brian Foster @ 2020-04-22 17:54 UTC (permalink / raw)
  To: linux-xfs

Hi all,

Here's a v2 of the various error handling cleanup patches. I opted to
retain and slightly rework the xfs_qm_dqflush() error handling patch
into something that seems a bit more graceful by simply expanding the
error path to include the buffer association. The dqflush verifier check
has been fixed up to cover the in-core structure instead of being
removed (note that this results in a small tweak to patch 7, but I
retained the R-b tags from v1). Finally, a couple new patches are
inserted to combine the AIL item removal functions and remove unused
shutdown types. Various other small changes are noted in the changelog
below.

Thoughts, reviews, flames appreciated.

Brian

git repo: https://github.com/bsfost/linux-xfs/tree/xfs-flush-error-handling-cleanups-v2

v2:
- Rename some helper functions.
- Fix up dquot flush verifier instead of removing it.
- Drop quotaoff push handler removal patch.
- Reuse existing ratelimit state for buffer error messages.
- Combine AIL removal helpers.
- Refactor iflush error handling rework to update log item.
- Remove unused shutdown types.
v1: https://lore.kernel.org/linux-xfs/20200417150859.14734-1-bfoster@redhat.com/

Brian Foster (13):
  xfs: refactor failed buffer resubmission into xfsaild
  xfs: factor out buffer I/O failure simulation code
  xfs: fallthru to buffer attach on error and simplify error handling
  xfs: remove unnecessary shutdown check from xfs_iflush()
  xfs: ratelimit unmount time per-buffer I/O error message
  xfs: fix duplicate verification from xfs_qm_dqflush()
  xfs: abort consistently on dquot flush failure
  xfs: elide the AIL lock on log item failure tracking
  xfs: clean up AIL log item removal functions
  xfs: combine xfs_trans_ail_[remove|delete]()
  xfs: remove unused iflush stale parameter
  xfs: random buffer write failure errortag
  xfs: remove unused shutdown types

 fs/xfs/libxfs/xfs_errortag.h  |   4 +-
 fs/xfs/libxfs/xfs_inode_buf.c |   7 +-
 fs/xfs/xfs_bmap_item.c        |   2 +-
 fs/xfs/xfs_buf.c              |  42 ++++++++++--
 fs/xfs/xfs_buf.h              |   2 +
 fs/xfs/xfs_buf_item.c         |  96 ++++----------------------
 fs/xfs/xfs_buf_item.h         |   2 -
 fs/xfs/xfs_dquot.c            |  84 +++++++++--------------
 fs/xfs/xfs_dquot_item.c       |  17 +----
 fs/xfs/xfs_error.c            |   3 +
 fs/xfs/xfs_extfree_item.c     |   2 +-
 fs/xfs/xfs_fsops.c            |   5 +-
 fs/xfs/xfs_icache.c           |   2 +-
 fs/xfs/xfs_inode.c            | 124 ++++++++++------------------------
 fs/xfs/xfs_inode_item.c       |  39 ++---------
 fs/xfs/xfs_inode_item.h       |   2 +-
 fs/xfs/xfs_mount.h            |   2 -
 fs/xfs/xfs_refcount_item.c    |   2 +-
 fs/xfs/xfs_rmap_item.c        |   2 +-
 fs/xfs/xfs_trans_ail.c        |  79 ++++++++++++++++------
 fs/xfs/xfs_trans_priv.h       |  23 +------
 21 files changed, 200 insertions(+), 341 deletions(-)

-- 
2.21.1


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v2 01/13] xfs: refactor failed buffer resubmission into xfsaild
  2020-04-22 17:54 [PATCH v2 00/13] xfs: flush related error handling cleanups Brian Foster
@ 2020-04-22 17:54 ` Brian Foster
  2020-04-23  4:09   ` Dave Chinner
  2020-04-25 17:21   ` Christoph Hellwig
  2020-04-22 17:54 ` [PATCH v2 02/13] xfs: factor out buffer I/O failure simulation code Brian Foster
                   ` (11 subsequent siblings)
  12 siblings, 2 replies; 46+ messages in thread
From: Brian Foster @ 2020-04-22 17:54 UTC (permalink / raw)
  To: linux-xfs

Flush locked log items whose underlying buffers fail metadata
writeback are tagged with a special flag to indicate that the flush
lock is already held. This is currently implemented in the type
specific ->iop_push() callback, but the processing required for such
items is not type specific because we're only doing basic state
management on the underlying buffer.

Factor the failed log item handling out of the inode and dquot
->iop_push() callbacks and open code the buffer resubmit helper into
a single helper called from xfsaild_push_item(). This provides a
generic mechanism for handling failed metadata buffer writeback with
a bit less code.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
---
 fs/xfs/xfs_buf_item.c   | 39 ---------------------------------------
 fs/xfs/xfs_buf_item.h   |  2 --
 fs/xfs/xfs_dquot_item.c | 15 ---------------
 fs/xfs/xfs_inode_item.c | 15 ---------------
 fs/xfs/xfs_trans_ail.c  | 41 +++++++++++++++++++++++++++++++++++++++++
 5 files changed, 41 insertions(+), 71 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 1545657c3ca0..8796adde2d12 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -1248,42 +1248,3 @@ xfs_buf_iodone(
 	xfs_trans_ail_delete(ailp, lip, SHUTDOWN_CORRUPT_INCORE);
 	xfs_buf_item_free(BUF_ITEM(lip));
 }
-
-/*
- * Requeue a failed buffer for writeback.
- *
- * We clear the log item failed state here as well, but we have to be careful
- * about reference counts because the only active reference counts on the buffer
- * may be the failed log items. Hence if we clear the log item failed state
- * before queuing the buffer for IO we can release all active references to
- * the buffer and free it, leading to use after free problems in
- * xfs_buf_delwri_queue. It makes no difference to the buffer or log items which
- * order we process them in - the buffer is locked, and we own the buffer list
- * so nothing on them is going to change while we are performing this action.
- *
- * Hence we can safely queue the buffer for IO before we clear the failed log
- * item state, therefore  always having an active reference to the buffer and
- * avoiding the transient zero-reference state that leads to use-after-free.
- *
- * Return true if the buffer was added to the buffer list, false if it was
- * already on the buffer list.
- */
-bool
-xfs_buf_resubmit_failed_buffers(
-	struct xfs_buf		*bp,
-	struct list_head	*buffer_list)
-{
-	struct xfs_log_item	*lip;
-	bool			ret;
-
-	ret = xfs_buf_delwri_queue(bp, buffer_list);
-
-	/*
-	 * XFS_LI_FAILED set/clear is protected by ail_lock, caller of this
-	 * function already have it acquired
-	 */
-	list_for_each_entry(lip, &bp->b_li_list, li_bio_list)
-		xfs_clear_li_failed(lip);
-
-	return ret;
-}
diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
index 30114b510332..c9c57e2da932 100644
--- a/fs/xfs/xfs_buf_item.h
+++ b/fs/xfs/xfs_buf_item.h
@@ -59,8 +59,6 @@ void	xfs_buf_attach_iodone(struct xfs_buf *,
 			      struct xfs_log_item *);
 void	xfs_buf_iodone_callbacks(struct xfs_buf *);
 void	xfs_buf_iodone(struct xfs_buf *, struct xfs_log_item *);
-bool	xfs_buf_resubmit_failed_buffers(struct xfs_buf *,
-					struct list_head *);
 bool	xfs_buf_log_check_iovec(struct xfs_log_iovec *iovec);
 
 extern kmem_zone_t	*xfs_buf_item_zone;
diff --git a/fs/xfs/xfs_dquot_item.c b/fs/xfs/xfs_dquot_item.c
index baad1748d0d1..5a7808299a32 100644
--- a/fs/xfs/xfs_dquot_item.c
+++ b/fs/xfs/xfs_dquot_item.c
@@ -145,21 +145,6 @@ xfs_qm_dquot_logitem_push(
 	if (atomic_read(&dqp->q_pincount) > 0)
 		return XFS_ITEM_PINNED;
 
-	/*
-	 * The buffer containing this item failed to be written back
-	 * previously. Resubmit the buffer for IO
-	 */
-	if (test_bit(XFS_LI_FAILED, &lip->li_flags)) {
-		if (!xfs_buf_trylock(bp))
-			return XFS_ITEM_LOCKED;
-
-		if (!xfs_buf_resubmit_failed_buffers(bp, buffer_list))
-			rval = XFS_ITEM_FLUSHING;
-
-		xfs_buf_unlock(bp);
-		return rval;
-	}
-
 	if (!xfs_dqlock_nowait(dqp))
 		return XFS_ITEM_LOCKED;
 
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index f779cca2346f..1d4d256a2e96 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -497,21 +497,6 @@ xfs_inode_item_push(
 	if (xfs_ipincount(ip) > 0)
 		return XFS_ITEM_PINNED;
 
-	/*
-	 * The buffer containing this item failed to be written back
-	 * previously. Resubmit the buffer for IO.
-	 */
-	if (test_bit(XFS_LI_FAILED, &lip->li_flags)) {
-		if (!xfs_buf_trylock(bp))
-			return XFS_ITEM_LOCKED;
-
-		if (!xfs_buf_resubmit_failed_buffers(bp, buffer_list))
-			rval = XFS_ITEM_FLUSHING;
-
-		xfs_buf_unlock(bp);
-		return rval;
-	}
-
 	if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED))
 		return XFS_ITEM_LOCKED;
 
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 564253550b75..2574d01e4a83 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -345,6 +345,45 @@ xfs_ail_delete(
 	xfs_trans_ail_cursor_clear(ailp, lip);
 }
 
+/*
+ * Requeue a failed buffer for writeback.
+ *
+ * We clear the log item failed state here as well, but we have to be careful
+ * about reference counts because the only active reference counts on the buffer
+ * may be the failed log items. Hence if we clear the log item failed state
+ * before queuing the buffer for IO we can release all active references to
+ * the buffer and free it, leading to use after free problems in
+ * xfs_buf_delwri_queue. It makes no difference to the buffer or log items which
+ * order we process them in - the buffer is locked, and we own the buffer list
+ * so nothing on them is going to change while we are performing this action.
+ *
+ * Hence we can safely queue the buffer for IO before we clear the failed log
+ * item state, therefore  always having an active reference to the buffer and
+ * avoiding the transient zero-reference state that leads to use-after-free.
+ */
+static inline int
+xfsaild_resubmit_item(
+	struct xfs_log_item	*lip,
+	struct list_head	*buffer_list)
+{
+	struct xfs_buf		*bp = lip->li_buf;
+
+	if (!xfs_buf_trylock(bp))
+		return XFS_ITEM_LOCKED;
+
+	if (!xfs_buf_delwri_queue(bp, buffer_list)) {
+		xfs_buf_unlock(bp);
+		return XFS_ITEM_FLUSHING;
+	}
+
+	/* protected by ail_lock */
+	list_for_each_entry(lip, &bp->b_li_list, li_bio_list)
+		xfs_clear_li_failed(lip);
+
+	xfs_buf_unlock(bp);
+	return XFS_ITEM_SUCCESS;
+}
+
 static inline uint
 xfsaild_push_item(
 	struct xfs_ail		*ailp,
@@ -365,6 +404,8 @@ xfsaild_push_item(
 	 */
 	if (!lip->li_ops->iop_push)
 		return XFS_ITEM_PINNED;
+	if (test_bit(XFS_LI_FAILED, &lip->li_flags))
+		return xfsaild_resubmit_item(lip, &ailp->ail_buf_list);
 	return lip->li_ops->iop_push(lip, &ailp->ail_buf_list);
 }
 
-- 
2.21.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 02/13] xfs: factor out buffer I/O failure simulation code
  2020-04-22 17:54 [PATCH v2 00/13] xfs: flush related error handling cleanups Brian Foster
  2020-04-22 17:54 ` [PATCH v2 01/13] xfs: refactor failed buffer resubmission into xfsaild Brian Foster
@ 2020-04-22 17:54 ` Brian Foster
  2020-04-23  4:10   ` Dave Chinner
  2020-04-25 17:23   ` Christoph Hellwig
  2020-04-22 17:54 ` [PATCH v2 03/13] xfs: fallthru to buffer attach on error and simplify error handling Brian Foster
                   ` (10 subsequent siblings)
  12 siblings, 2 replies; 46+ messages in thread
From: Brian Foster @ 2020-04-22 17:54 UTC (permalink / raw)
  To: linux-xfs

We use the same buffer I/O failure simulation code in a few
different places. It's not much code, but it's not necessarily
self-explanatory. Factor it into a helper and document it in one
place.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
---
 fs/xfs/xfs_buf.c      | 23 +++++++++++++++++++----
 fs/xfs/xfs_buf.h      |  1 +
 fs/xfs/xfs_buf_item.c | 22 +++-------------------
 fs/xfs/xfs_inode.c    |  7 +------
 4 files changed, 24 insertions(+), 29 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 9ec3eaf1c618..7a6bc617f0a9 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1248,6 +1248,24 @@ xfs_buf_ioerror_alert(
 			-bp->b_error);
 }
 
+/*
+ * To simulate an I/O failure, the buffer must be locked and held with at least
+ * three references. The LRU reference is dropped by the stale call. The buf
+ * item reference is dropped via ioend processing. The third reference is owned
+ * by the caller and is dropped on I/O completion if the buffer is XBF_ASYNC.
+ */
+void
+xfs_buf_ioend_fail(
+	struct xfs_buf	*bp,
+	int		flags)
+{
+	bp->b_flags |= flags;
+	bp->b_flags &= ~XBF_DONE;
+	xfs_buf_stale(bp);
+	xfs_buf_ioerror(bp, -EIO);
+	xfs_buf_ioend(bp);
+}
+
 int
 xfs_bwrite(
 	struct xfs_buf		*bp)
@@ -1480,10 +1498,7 @@ __xfs_buf_submit(
 
 	/* on shutdown we stale and complete the buffer immediately */
 	if (XFS_FORCED_SHUTDOWN(bp->b_mount)) {
-		xfs_buf_ioerror(bp, -EIO);
-		bp->b_flags &= ~XBF_DONE;
-		xfs_buf_stale(bp);
-		xfs_buf_ioend(bp);
+		xfs_buf_ioend_fail(bp, 0);
 		return -EIO;
 	}
 
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 9a04c53c2488..598b93b17d95 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -263,6 +263,7 @@ extern void __xfs_buf_ioerror(struct xfs_buf *bp, int error,
 		xfs_failaddr_t failaddr);
 #define xfs_buf_ioerror(bp, err) __xfs_buf_ioerror((bp), (err), __this_address)
 extern void xfs_buf_ioerror_alert(struct xfs_buf *bp, xfs_failaddr_t fa);
+void xfs_buf_ioend_fail(struct xfs_buf *, int);
 
 extern int __xfs_buf_submit(struct xfs_buf *bp, bool);
 static inline int xfs_buf_submit(struct xfs_buf *bp)
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 8796adde2d12..e34298227f87 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -471,28 +471,12 @@ xfs_buf_item_unpin(
 		xfs_buf_relse(bp);
 	} else if (freed && remove) {
 		/*
-		 * There are currently two references to the buffer - the active
-		 * LRU reference and the buf log item. What we are about to do
-		 * here - simulate a failed IO completion - requires 3
-		 * references.
-		 *
-		 * The LRU reference is removed by the xfs_buf_stale() call. The
-		 * buf item reference is removed by the xfs_buf_iodone()
-		 * callback that is run by xfs_buf_do_callbacks() during ioend
-		 * processing (via the bp->b_iodone callback), and then finally
-		 * the ioend processing will drop the IO reference if the buffer
-		 * is marked XBF_ASYNC.
-		 *
-		 * Hence we need to take an additional reference here so that IO
-		 * completion processing doesn't free the buffer prematurely.
+		 * The buffer must be locked and held by the caller to simulate
+		 * an async I/O failure.
 		 */
 		xfs_buf_lock(bp);
 		xfs_buf_hold(bp);
-		bp->b_flags |= XBF_ASYNC;
-		xfs_buf_ioerror(bp, -EIO);
-		bp->b_flags &= ~XBF_DONE;
-		xfs_buf_stale(bp);
-		xfs_buf_ioend(bp);
+		xfs_buf_ioend_fail(bp, XBF_ASYNC);
 	}
 }
 
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index d1772786af29..1aea19ca6601 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3629,12 +3629,7 @@ xfs_iflush_cluster(
 	 * xfs_buf_submit().
 	 */
 	ASSERT(bp->b_iodone);
-	bp->b_flags |= XBF_ASYNC;
-	bp->b_flags &= ~XBF_DONE;
-	xfs_buf_stale(bp);
-	xfs_buf_ioerror(bp, -EIO);
-	xfs_buf_ioend(bp);
-
+	xfs_buf_ioend_fail(bp, XBF_ASYNC);
 	xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
 
 	/* abort the corrupt inode, as it was not attached to the buffer */
-- 
2.21.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 03/13] xfs: fallthru to buffer attach on error and simplify error handling
  2020-04-22 17:54 [PATCH v2 00/13] xfs: flush related error handling cleanups Brian Foster
  2020-04-22 17:54 ` [PATCH v2 01/13] xfs: refactor failed buffer resubmission into xfsaild Brian Foster
  2020-04-22 17:54 ` [PATCH v2 02/13] xfs: factor out buffer I/O failure simulation code Brian Foster
@ 2020-04-22 17:54 ` Brian Foster
  2020-04-23  4:18   ` Dave Chinner
  2020-04-25 17:26   ` Christoph Hellwig
  2020-04-22 17:54 ` [PATCH v2 04/13] xfs: remove unnecessary shutdown check from xfs_iflush() Brian Foster
                   ` (9 subsequent siblings)
  12 siblings, 2 replies; 46+ messages in thread
From: Brian Foster @ 2020-04-22 17:54 UTC (permalink / raw)
  To: linux-xfs

The inode flush code has several layers of error handling between
the inode and cluster flushing code. If the inode flush fails before
acquiring the backing buffer, the inode flush is aborted. If the
cluster flush fails, the current inode flush is aborted and the
cluster buffer is failed to handle the initial inode and any others
that might have been attached before the error.

Since xfs_iflush() is the only caller of xfs_iflush_cluster(), the
error handling between the two can be condensed in the top-level
function. If we update xfs_iflush_int() to always fall through to
the log item update and attach the item completion handler to the
buffer, any errors that occur after the first call to
xfs_iflush_int() can be handled with a buffer I/O failure.

Lift the error handling from xfs_iflush_cluster() into xfs_iflush()
and consolidate with the existing error handling. This also replaces
the need to release the buffer because failing the buffer with
XBF_ASYNC drops the current reference.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_inode.c | 108 ++++++++++++++++-----------------------------
 1 file changed, 38 insertions(+), 70 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 1aea19ca6601..6cdb9fbe2d89 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3496,6 +3496,7 @@ xfs_iflush_cluster(
 	struct xfs_inode	**cilist;
 	struct xfs_inode	*cip;
 	struct xfs_ino_geometry	*igeo = M_IGEO(mp);
+	int			error = 0;
 	int			nr_found;
 	int			clcount = 0;
 	int			i;
@@ -3588,11 +3589,10 @@ xfs_iflush_cluster(
 		 * re-check that it's dirty before flushing.
 		 */
 		if (!xfs_inode_clean(cip)) {
-			int	error;
 			error = xfs_iflush_int(cip, bp);
 			if (error) {
 				xfs_iunlock(cip, XFS_ILOCK_SHARED);
-				goto cluster_corrupt_out;
+				goto out_free;
 			}
 			clcount++;
 		} else {
@@ -3611,32 +3611,7 @@ xfs_iflush_cluster(
 	kmem_free(cilist);
 out_put:
 	xfs_perag_put(pag);
-	return 0;
-
-
-cluster_corrupt_out:
-	/*
-	 * Corruption detected in the clustering loop.  Invalidate the
-	 * inode buffer and shut down the filesystem.
-	 */
-	rcu_read_unlock();
-
-	/*
-	 * We'll always have an inode attached to the buffer for completion
-	 * process by the time we are called from xfs_iflush(). Hence we have
-	 * always need to do IO completion processing to abort the inodes
-	 * attached to the buffer.  handle them just like the shutdown case in
-	 * xfs_buf_submit().
-	 */
-	ASSERT(bp->b_iodone);
-	xfs_buf_ioend_fail(bp, XBF_ASYNC);
-	xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
-
-	/* abort the corrupt inode, as it was not attached to the buffer */
-	xfs_iflush_abort(cip, false);
-	kmem_free(cilist);
-	xfs_perag_put(pag);
-	return -EFSCORRUPTED;
+	return error;
 }
 
 /*
@@ -3692,17 +3667,16 @@ xfs_iflush(
 	 */
 	if (XFS_FORCED_SHUTDOWN(mp)) {
 		error = -EIO;
-		goto abort_out;
+		goto abort;
 	}
 
 	/*
 	 * Get the buffer containing the on-disk inode. We are doing a try-lock
-	 * operation here, so we may get  an EAGAIN error. In that case, we
-	 * simply want to return with the inode still dirty.
+	 * operation here, so we may get an EAGAIN error. In that case, return
+	 * leaving the inode dirty.
 	 *
 	 * If we get any other error, we effectively have a corruption situation
-	 * and we cannot flush the inode, so we treat it the same as failing
-	 * xfs_iflush_int().
+	 * and we cannot flush the inode. Abort the flush and shut down.
 	 */
 	error = xfs_imap_to_bp(mp, NULL, &ip->i_imap, &dip, &bp, XBF_TRYLOCK,
 			       0);
@@ -3711,14 +3685,7 @@ xfs_iflush(
 		return error;
 	}
 	if (error)
-		goto corrupt_out;
-
-	/*
-	 * First flush out the inode that xfs_iflush was called with.
-	 */
-	error = xfs_iflush_int(ip, bp);
-	if (error)
-		goto corrupt_out;
+		goto abort;
 
 	/*
 	 * If the buffer is pinned then push on the log now so we won't
@@ -3728,28 +3695,28 @@ xfs_iflush(
 		xfs_log_force(mp, 0);
 
 	/*
-	 * inode clustering: try to gather other inodes into this write
+	 * Flush the provided inode then attempt to gather others from the
+	 * cluster into the write.
 	 *
-	 * Note: Any error during clustering will result in the filesystem
-	 * being shut down and completion callbacks run on the cluster buffer.
-	 * As we have already flushed and attached this inode to the buffer,
-	 * it has already been aborted and released by xfs_iflush_cluster() and
-	 * so we have no further error handling to do here.
+	 * Note: Once we attempt to flush an inode, we must run buffer
+	 * completion callbacks on any failure. If this fails, simulate an I/O
+	 * failure on the buffer and shut down.
 	 */
-	error = xfs_iflush_cluster(ip, bp);
-	if (error)
-		return error;
+	error = xfs_iflush_int(ip, bp);
+	if (!error)
+		error = xfs_iflush_cluster(ip, bp);
+	if (error) {
+		xfs_buf_ioend_fail(bp, XBF_ASYNC);
+		goto shutdown;
+	}
 
 	*bpp = bp;
 	return 0;
 
-corrupt_out:
-	if (bp)
-		xfs_buf_relse(bp);
-	xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
-abort_out:
-	/* abort the corrupt inode, as it was not attached to the buffer */
+abort:
 	xfs_iflush_abort(ip, false);
+shutdown:
+	xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
 	return error;
 }
 
@@ -3791,6 +3758,7 @@ xfs_iflush_int(
 	struct xfs_inode_log_item *iip = ip->i_itemp;
 	struct xfs_dinode	*dip;
 	struct xfs_mount	*mp = ip->i_mount;
+	int			error;
 
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL|XFS_ILOCK_SHARED));
 	ASSERT(xfs_isiflocked(ip));
@@ -3801,12 +3769,13 @@ xfs_iflush_int(
 	/* set *dip = inode's place in the buffer */
 	dip = xfs_buf_offset(bp, ip->i_imap.im_boffset);
 
+	error = -EFSCORRUPTED;
 	if (XFS_TEST_ERROR(dip->di_magic != cpu_to_be16(XFS_DINODE_MAGIC),
 			       mp, XFS_ERRTAG_IFLUSH_1)) {
 		xfs_alert_tag(mp, XFS_PTAG_IFLUSH,
 			"%s: Bad inode %Lu magic number 0x%x, ptr "PTR_FMT,
 			__func__, ip->i_ino, be16_to_cpu(dip->di_magic), dip);
-		goto corrupt_out;
+		goto flush_out;
 	}
 	if (S_ISREG(VFS_I(ip)->i_mode)) {
 		if (XFS_TEST_ERROR(
@@ -3816,7 +3785,7 @@ xfs_iflush_int(
 			xfs_alert_tag(mp, XFS_PTAG_IFLUSH,
 				"%s: Bad regular inode %Lu, ptr "PTR_FMT,
 				__func__, ip->i_ino, ip);
-			goto corrupt_out;
+			goto flush_out;
 		}
 	} else if (S_ISDIR(VFS_I(ip)->i_mode)) {
 		if (XFS_TEST_ERROR(
@@ -3827,7 +3796,7 @@ xfs_iflush_int(
 			xfs_alert_tag(mp, XFS_PTAG_IFLUSH,
 				"%s: Bad directory inode %Lu, ptr "PTR_FMT,
 				__func__, ip->i_ino, ip);
-			goto corrupt_out;
+			goto flush_out;
 		}
 	}
 	if (XFS_TEST_ERROR(ip->i_d.di_nextents + ip->i_d.di_anextents >
@@ -3838,14 +3807,14 @@ xfs_iflush_int(
 			__func__, ip->i_ino,
 			ip->i_d.di_nextents + ip->i_d.di_anextents,
 			ip->i_d.di_nblocks, ip);
-		goto corrupt_out;
+		goto flush_out;
 	}
 	if (XFS_TEST_ERROR(ip->i_d.di_forkoff > mp->m_sb.sb_inodesize,
 				mp, XFS_ERRTAG_IFLUSH_6)) {
 		xfs_alert_tag(mp, XFS_PTAG_IFLUSH,
 			"%s: bad inode %Lu, forkoff 0x%x, ptr "PTR_FMT,
 			__func__, ip->i_ino, ip->i_d.di_forkoff, ip);
-		goto corrupt_out;
+		goto flush_out;
 	}
 
 	/*
@@ -3862,7 +3831,7 @@ xfs_iflush_int(
 
 	/* Check the inline fork data before we write out. */
 	if (!xfs_inode_verify_forks(ip))
-		goto corrupt_out;
+		goto flush_out;
 
 	/*
 	 * Copy the dirty parts of the inode into the on-disk inode.  We always
@@ -3905,6 +3874,8 @@ xfs_iflush_int(
 	 * need the AIL lock, because it is a 64 bit value that cannot be read
 	 * atomically.
 	 */
+	error = 0;
+flush_out:
 	iip->ili_last_fields = iip->ili_fields;
 	iip->ili_fields = 0;
 	iip->ili_fsync_fields = 0;
@@ -3914,10 +3885,10 @@ xfs_iflush_int(
 				&iip->ili_item.li_lsn);
 
 	/*
-	 * Attach the function xfs_iflush_done to the inode's
-	 * buffer.  This will remove the inode from the AIL
-	 * and unlock the inode's flush lock when the inode is
-	 * completely written to disk.
+	 * Attach the inode item callback to the buffer whether the flush
+	 * succeeded or not. If not, the caller will shut down and fail I/O
+	 * completion on the buffer to remove the inode from the AIL and release
+	 * the flush lock.
 	 */
 	xfs_buf_attach_iodone(bp, xfs_iflush_done, &iip->ili_item);
 
@@ -3926,10 +3897,7 @@ xfs_iflush_int(
 
 	ASSERT(!list_empty(&bp->b_li_list));
 	ASSERT(bp->b_iodone != NULL);
-	return 0;
-
-corrupt_out:
-	return -EFSCORRUPTED;
+	return error;
 }
 
 /* Release an inode. */
-- 
2.21.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 04/13] xfs: remove unnecessary shutdown check from xfs_iflush()
  2020-04-22 17:54 [PATCH v2 00/13] xfs: flush related error handling cleanups Brian Foster
                   ` (2 preceding siblings ...)
  2020-04-22 17:54 ` [PATCH v2 03/13] xfs: fallthru to buffer attach on error and simplify error handling Brian Foster
@ 2020-04-22 17:54 ` Brian Foster
  2020-04-25 17:27   ` Christoph Hellwig
  2020-04-22 17:54 ` [PATCH v2 05/13] xfs: ratelimit unmount time per-buffer I/O error message Brian Foster
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2020-04-22 17:54 UTC (permalink / raw)
  To: linux-xfs

The shutdown check in xfs_iflush() duplicates checks down in the
buffer code. If the fs is shut down, xfs_trans_read_buf_map() always
returns an error and falls into the same error path. Remove the
unnecessary check along with the warning in xfs_imap_to_bp()
that generates excessive noise in the log if the fs is shut down.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_inode_buf.c |  7 +------
 fs/xfs/xfs_inode.c            | 13 -------------
 2 files changed, 1 insertion(+), 19 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 39c5a6e24915..b102e611bf54 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -172,12 +172,7 @@ xfs_imap_to_bp(
 				   (int)imap->im_len, buf_flags, &bp,
 				   &xfs_inode_buf_ops);
 	if (error) {
-		if (error == -EAGAIN) {
-			ASSERT(buf_flags & XBF_TRYLOCK);
-			return error;
-		}
-		xfs_warn(mp, "%s: xfs_trans_read_buf() returned error %d.",
-			__func__, error);
+		ASSERT(error != -EAGAIN || (buf_flags & XBF_TRYLOCK));
 		return error;
 	}
 
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 6cdb9fbe2d89..aa490efdcaa8 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3657,19 +3657,6 @@ xfs_iflush(
 		return 0;
 	}
 
-	/*
-	 * This may have been unpinned because the filesystem is shutting
-	 * down forcibly. If that's the case we must not write this inode
-	 * to disk, because the log record didn't make it to disk.
-	 *
-	 * We also have to remove the log item from the AIL in this case,
-	 * as we wait for an empty AIL as part of the unmount process.
-	 */
-	if (XFS_FORCED_SHUTDOWN(mp)) {
-		error = -EIO;
-		goto abort;
-	}
-
 	/*
 	 * Get the buffer containing the on-disk inode. We are doing a try-lock
 	 * operation here, so we may get an EAGAIN error. In that case, return
-- 
2.21.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 05/13] xfs: ratelimit unmount time per-buffer I/O error message
  2020-04-22 17:54 [PATCH v2 00/13] xfs: flush related error handling cleanups Brian Foster
                   ` (3 preceding siblings ...)
  2020-04-22 17:54 ` [PATCH v2 04/13] xfs: remove unnecessary shutdown check from xfs_iflush() Brian Foster
@ 2020-04-22 17:54 ` Brian Foster
  2020-04-23  4:46   ` Dave Chinner
  2020-04-22 17:54 ` [PATCH v2 06/13] xfs: fix duplicate verification from xfs_qm_dqflush() Brian Foster
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2020-04-22 17:54 UTC (permalink / raw)
  To: linux-xfs

At unmount time, XFS emits a warning for every in-core buffer that
might have undergone a write error. In practice this behavior is
probably reasonable given that the filesystem is likely short lived
once I/O errors begin to occur consistently. Under certain test or
otherwise expected error conditions, this can spam the logs and slow
down the unmount.

We already have a ratelimit state defined for buffers failing
writeback. Fold this state into the buftarg and reuse it for the
unmount time errors.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_buf.c      | 13 +++++++++++--
 fs/xfs/xfs_buf.h      |  1 +
 fs/xfs/xfs_buf_item.c | 10 +---------
 3 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 7a6bc617f0a9..c28a93d2fd8c 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1684,10 +1684,12 @@ xfs_wait_buftarg(
 			struct xfs_buf *bp;
 			bp = list_first_entry(&dispose, struct xfs_buf, b_lru);
 			list_del_init(&bp->b_lru);
-			if (bp->b_flags & XBF_WRITE_FAIL) {
+			if (bp->b_flags & XBF_WRITE_FAIL &&
+			    ___ratelimit(&bp->b_target->bt_ioerror_rl,
+					 "XFS: Corruption Alert")) {
 				xfs_alert(btp->bt_mount,
 "Corruption Alert: Buffer at daddr 0x%llx had permanent write failures!",
-					(long long)bp->b_bn);
+					  (long long)bp->b_bn);
 				xfs_alert(btp->bt_mount,
 "Please run xfs_repair to determine the extent of the problem.");
 			}
@@ -1828,6 +1830,13 @@ xfs_alloc_buftarg(
 	btp->bt_bdev = bdev;
 	btp->bt_daxdev = dax_dev;
 
+	/*
+	 * Buffer IO error rate limiting. Limit it to no more than 10 messages
+	 * per 30 seconds so as to not spam logs too much on repeated errors.
+	 */
+	ratelimit_state_init(&btp->bt_ioerror_rl, 30 * HZ,
+			     DEFAULT_RATELIMIT_BURST);
+
 	if (xfs_setsize_buftarg_early(btp, bdev))
 		goto error_free;
 
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 598b93b17d95..10492f68fd4b 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -91,6 +91,7 @@ typedef struct xfs_buftarg {
 	struct list_lru		bt_lru;
 
 	struct percpu_counter	bt_io_count;
+	struct ratelimit_state	bt_ioerror_rl;
 } xfs_buftarg_t;
 
 struct xfs_buf;
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index e34298227f87..23cbfeb82183 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -480,14 +480,6 @@ xfs_buf_item_unpin(
 	}
 }
 
-/*
- * Buffer IO error rate limiting. Limit it to no more than 10 messages per 30
- * seconds so as to not spam logs too much on repeated detection of the same
- * buffer being bad..
- */
-
-static DEFINE_RATELIMIT_STATE(xfs_buf_write_fail_rl_state, 30 * HZ, 10);
-
 STATIC uint
 xfs_buf_item_push(
 	struct xfs_log_item	*lip,
@@ -518,7 +510,7 @@ xfs_buf_item_push(
 
 	/* has a previous flush failed due to IO errors? */
 	if ((bp->b_flags & XBF_WRITE_FAIL) &&
-	    ___ratelimit(&xfs_buf_write_fail_rl_state, "XFS: Failing async write")) {
+	    ___ratelimit(&bp->b_target->bt_ioerror_rl, "XFS: Failing async write")) {
 		xfs_warn(bp->b_mount,
 "Failing async write on buffer block 0x%llx. Retrying async write.",
 			 (long long)bp->b_bn);
-- 
2.21.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 06/13] xfs: fix duplicate verification from xfs_qm_dqflush()
  2020-04-22 17:54 [PATCH v2 00/13] xfs: flush related error handling cleanups Brian Foster
                   ` (4 preceding siblings ...)
  2020-04-22 17:54 ` [PATCH v2 05/13] xfs: ratelimit unmount time per-buffer I/O error message Brian Foster
@ 2020-04-22 17:54 ` Brian Foster
  2020-04-23  4:47   ` Dave Chinner
  2020-04-25 17:28   ` Christoph Hellwig
  2020-04-22 17:54 ` [PATCH v2 07/13] xfs: abort consistently on dquot flush failure Brian Foster
                   ` (6 subsequent siblings)
  12 siblings, 2 replies; 46+ messages in thread
From: Brian Foster @ 2020-04-22 17:54 UTC (permalink / raw)
  To: linux-xfs

The pre-flush dquot verification in xfs_qm_dqflush() duplicates the
read verifier by checking the dquot in the on-disk buffer. Instead,
verify the in-core variant before it is flushed to the buffer.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_dquot.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index af2c8e5ceea0..265feb62290d 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -1116,13 +1116,12 @@ xfs_qm_dqflush(
 	dqb = bp->b_addr + dqp->q_bufoffset;
 	ddqp = &dqb->dd_diskdq;
 
-	/*
-	 * A simple sanity check in case we got a corrupted dquot.
-	 */
-	fa = xfs_dqblk_verify(mp, dqb, be32_to_cpu(ddqp->d_id), 0);
+	/* sanity check the in-core structure before we flush */
+	fa = xfs_dquot_verify(mp, &dqp->q_core, be32_to_cpu(dqp->q_core.d_id),
+			      0);
 	if (fa) {
 		xfs_alert(mp, "corrupt dquot ID 0x%x in memory at %pS",
-				be32_to_cpu(ddqp->d_id), fa);
+				be32_to_cpu(dqp->q_core.d_id), fa);
 		xfs_buf_relse(bp);
 		xfs_dqfunlock(dqp);
 		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
-- 
2.21.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 07/13] xfs: abort consistently on dquot flush failure
  2020-04-22 17:54 [PATCH v2 00/13] xfs: flush related error handling cleanups Brian Foster
                   ` (5 preceding siblings ...)
  2020-04-22 17:54 ` [PATCH v2 06/13] xfs: fix duplicate verification from xfs_qm_dqflush() Brian Foster
@ 2020-04-22 17:54 ` Brian Foster
  2020-04-25 17:30   ` Christoph Hellwig
  2020-04-22 17:54 ` [PATCH v2 08/13] xfs: elide the AIL lock on log item failure tracking Brian Foster
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2020-04-22 17:54 UTC (permalink / raw)
  To: linux-xfs

The dquot flush handler effectively aborts the dquot flush if the
filesystem is already shut down, but doesn't actually shut down if
the flush fails. Update xfs_qm_dqflush() to consistently abort the
dquot flush and shutdown the fs if the flush fails with an
unexpected error.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
---
 fs/xfs/xfs_dquot.c | 32 ++++++++++----------------------
 1 file changed, 10 insertions(+), 22 deletions(-)

diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index 265feb62290d..ffe607733c50 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -1068,6 +1068,7 @@ xfs_qm_dqflush(
 	struct xfs_buf		**bpp)
 {
 	struct xfs_mount	*mp = dqp->q_mount;
+	struct xfs_log_item	*lip = &dqp->q_logitem.qli_item;
 	struct xfs_buf		*bp;
 	struct xfs_dqblk	*dqb;
 	struct xfs_disk_dquot	*ddqp;
@@ -1083,32 +1084,16 @@ xfs_qm_dqflush(
 
 	xfs_qm_dqunpin_wait(dqp);
 
-	/*
-	 * This may have been unpinned because the filesystem is shutting
-	 * down forcibly. If that's the case we must not write this dquot
-	 * to disk, because the log record didn't make it to disk.
-	 *
-	 * We also have to remove the log item from the AIL in this case,
-	 * as we wait for an emptry AIL as part of the unmount process.
-	 */
-	if (XFS_FORCED_SHUTDOWN(mp)) {
-		struct xfs_log_item	*lip = &dqp->q_logitem.qli_item;
-		dqp->dq_flags &= ~XFS_DQ_DIRTY;
-
-		xfs_trans_ail_remove(lip, SHUTDOWN_CORRUPT_INCORE);
-
-		error = -EIO;
-		goto out_unlock;
-	}
-
 	/*
 	 * Get the buffer containing the on-disk dquot
 	 */
 	error = xfs_trans_read_buf(mp, NULL, mp->m_ddev_targp, dqp->q_blkno,
 				   mp->m_quotainfo->qi_dqchunklen, XBF_TRYLOCK,
 				   &bp, &xfs_dquot_buf_ops);
-	if (error)
+	if (error == -EAGAIN)
 		goto out_unlock;
+	if (error)
+		goto out_abort;
 
 	/*
 	 * Calculate the location of the dquot inside the buffer.
@@ -1123,9 +1108,8 @@ xfs_qm_dqflush(
 		xfs_alert(mp, "corrupt dquot ID 0x%x in memory at %pS",
 				be32_to_cpu(dqp->q_core.d_id), fa);
 		xfs_buf_relse(bp);
-		xfs_dqfunlock(dqp);
-		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
-		return -EFSCORRUPTED;
+		error = -EFSCORRUPTED;
+		goto out_abort;
 	}
 
 	/* This is the only portion of data that needs to persist */
@@ -1174,6 +1158,10 @@ xfs_qm_dqflush(
 	*bpp = bp;
 	return 0;
 
+out_abort:
+	dqp->dq_flags &= ~XFS_DQ_DIRTY;
+	xfs_trans_ail_remove(lip, SHUTDOWN_CORRUPT_INCORE);
+	xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
 out_unlock:
 	xfs_dqfunlock(dqp);
 	return error;
-- 
2.21.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 08/13] xfs: elide the AIL lock on log item failure tracking
  2020-04-22 17:54 [PATCH v2 00/13] xfs: flush related error handling cleanups Brian Foster
                   ` (6 preceding siblings ...)
  2020-04-22 17:54 ` [PATCH v2 07/13] xfs: abort consistently on dquot flush failure Brian Foster
@ 2020-04-22 17:54 ` Brian Foster
  2020-04-23  5:59   ` Dave Chinner
  2020-04-22 17:54 ` [PATCH v2 09/13] xfs: clean up AIL log item removal functions Brian Foster
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2020-04-22 17:54 UTC (permalink / raw)
  To: linux-xfs

The log item failed flag is used to indicate a log item was flushed
but the underlying buffer was not successfully written to disk. If
the error configuration allows for writeback retries, xfsaild uses
the flag to resubmit the underlying buffer without acquiring the
flush lock of the item.

The flag is currently set and cleared under the AIL lock and buffer
lock in the ->iop_error() callback, invoked via ->b_iodone() at I/O
completion time. The flag is checked at xfsaild push time under AIL
lock and cleared under buffer lock before resubmission. If I/O
eventually succeeds, the flag is cleared in the _done() handler for
the associated item type, again under AIL lock and buffer lock.

As far as I can tell, the only reason for holding the AIL lock
across sets/clears is to manage consistency between the log item
bitop state and the temporary buffer pointer that is attached to the
log item. The bit itself is used to manage consistency of the
attached buffer, but is not enough to guarantee the buffer is still
attached by the time xfsaild attempts to access it. However since
failure state is always set or cleared under buffer lock (either via
I/O completion or xfsaild), this particular case can be handled at
item push time by verifying failure state once under buffer lock.

Remove the AIL lock protection from the various bits of log item
failure state management and simplify the surrounding code where
applicable. Use the buffer lock in the xfsaild resubmit code to
detect failure state changes and temporarily treat the item as
locked.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_buf_item.c   |  4 ----
 fs/xfs/xfs_dquot.c      | 41 +++++++++++++++--------------------------
 fs/xfs/xfs_inode_item.c | 11 +++--------
 fs/xfs/xfs_trans_ail.c  | 12 ++++++++++--
 fs/xfs/xfs_trans_priv.h |  5 -----
 5 files changed, 28 insertions(+), 45 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 23cbfeb82183..59906a6e49ae 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -1038,7 +1038,6 @@ xfs_buf_do_callbacks_fail(
 	struct xfs_buf		*bp)
 {
 	struct xfs_log_item	*lip;
-	struct xfs_ail		*ailp;
 
 	/*
 	 * Buffer log item errors are handled directly by xfs_buf_item_push()
@@ -1050,13 +1049,10 @@ xfs_buf_do_callbacks_fail(
 
 	lip = list_first_entry(&bp->b_li_list, struct xfs_log_item,
 			li_bio_list);
-	ailp = lip->li_ailp;
-	spin_lock(&ailp->ail_lock);
 	list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
 		if (lip->li_ops->iop_error)
 			lip->li_ops->iop_error(lip, bp);
 	}
-	spin_unlock(&ailp->ail_lock);
 }
 
 static bool
diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index ffe607733c50..baeb111ae9dc 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -1023,34 +1023,23 @@ xfs_qm_dqflush_done(
 	struct xfs_ail		*ailp = lip->li_ailp;
 
 	/*
-	 * We only want to pull the item from the AIL if its
-	 * location in the log has not changed since we started the flush.
-	 * Thus, we only bother if the dquot's lsn has
-	 * not changed. First we check the lsn outside the lock
-	 * since it's cheaper, and then we recheck while
-	 * holding the lock before removing the dquot from the AIL.
+	 * Only pull the item from the AIL if its location in the log has not
+	 * changed since it was flushed. Do a lockless check first to reduce
+	 * lock traffic.
 	 */
-	if (test_bit(XFS_LI_IN_AIL, &lip->li_flags) &&
-	    ((lip->li_lsn == qip->qli_flush_lsn) ||
-	     test_bit(XFS_LI_FAILED, &lip->li_flags))) {
-
-		/* xfs_trans_ail_delete() drops the AIL lock. */
-		spin_lock(&ailp->ail_lock);
-		if (lip->li_lsn == qip->qli_flush_lsn) {
-			xfs_trans_ail_delete(ailp, lip, SHUTDOWN_CORRUPT_INCORE);
-		} else {
-			/*
-			 * Clear the failed state since we are about to drop the
-			 * flush lock
-			 */
-			xfs_clear_li_failed(lip);
-			spin_unlock(&ailp->ail_lock);
-		}
-	}
+	if (!test_bit(XFS_LI_IN_AIL, &lip->li_flags) ||
+	    lip->li_lsn != qip->qli_flush_lsn)
+		goto out;
 
-	/*
-	 * Release the dq's flush lock since we're done with it.
-	 */
+	spin_lock(&ailp->ail_lock);
+	if (lip->li_lsn == qip->qli_flush_lsn)
+		/* xfs_trans_ail_delete() drops the AIL lock */
+		xfs_trans_ail_delete(ailp, lip, SHUTDOWN_CORRUPT_INCORE);
+	else
+		spin_unlock(&ailp->ail_lock);
+
+out:
+	xfs_clear_li_failed(lip);
 	xfs_dqfunlock(dqp);
 }
 
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 1d4d256a2e96..0ae61844b224 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -682,9 +682,7 @@ xfs_iflush_done(
 	 * Scan the buffer IO completions for other inodes being completed and
 	 * attach them to the current inode log item.
 	 */
-
 	list_add_tail(&lip->li_bio_list, &tmp);
-
 	list_for_each_entry_safe(blip, n, &bp->b_li_list, li_bio_list) {
 		if (lip->li_cb != xfs_iflush_done)
 			continue;
@@ -695,15 +693,13 @@ xfs_iflush_done(
 		 * the AIL lock.
 		 */
 		iip = INODE_ITEM(blip);
-		if ((iip->ili_logged && blip->li_lsn == iip->ili_flush_lsn) ||
-		    test_bit(XFS_LI_FAILED, &blip->li_flags))
+		if (iip->ili_logged && blip->li_lsn == iip->ili_flush_lsn)
 			need_ail++;
 	}
 
 	/* make sure we capture the state of the initial inode. */
 	iip = INODE_ITEM(lip);
-	if ((iip->ili_logged && lip->li_lsn == iip->ili_flush_lsn) ||
-	    test_bit(XFS_LI_FAILED, &lip->li_flags))
+	if (iip->ili_logged && lip->li_lsn == iip->ili_flush_lsn)
 		need_ail++;
 
 	/*
@@ -732,8 +728,6 @@ xfs_iflush_done(
 				xfs_lsn_t lsn = xfs_ail_delete_one(ailp, blip);
 				if (!tail_lsn && lsn)
 					tail_lsn = lsn;
-			} else {
-				xfs_clear_li_failed(blip);
 			}
 		}
 		xfs_ail_update_finish(ailp, tail_lsn);
@@ -746,6 +740,7 @@ xfs_iflush_done(
 	 */
 	list_for_each_entry_safe(blip, n, &tmp, li_bio_list) {
 		list_del_init(&blip->li_bio_list);
+		xfs_clear_li_failed(blip);
 		iip = INODE_ITEM(blip);
 		iip->ili_logged = 0;
 		iip->ili_last_fields = 0;
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 2574d01e4a83..e03643efdac1 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -368,15 +368,23 @@ xfsaild_resubmit_item(
 {
 	struct xfs_buf		*bp = lip->li_buf;
 
-	if (!xfs_buf_trylock(bp))
+	/*
+	 * Log item state bits are racy so we cannot assume the temporary buffer
+	 * pointer is set. Treat the item as locked if the pointer is clear or
+	 * the failure state has changed once we've locked out I/O completion.
+	 */
+	if (!bp || !xfs_buf_trylock(bp))
 		return XFS_ITEM_LOCKED;
+	if (!test_bit(XFS_LI_FAILED, &lip->li_flags)) {
+		xfs_buf_unlock(bp);
+		return XFS_ITEM_LOCKED;
+	}
 
 	if (!xfs_buf_delwri_queue(bp, buffer_list)) {
 		xfs_buf_unlock(bp);
 		return XFS_ITEM_FLUSHING;
 	}
 
-	/* protected by ail_lock */
 	list_for_each_entry(lip, &bp->b_li_list, li_bio_list)
 		xfs_clear_li_failed(lip);
 
diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
index 35655eac01a6..9135afdcee9d 100644
--- a/fs/xfs/xfs_trans_priv.h
+++ b/fs/xfs/xfs_trans_priv.h
@@ -158,9 +158,6 @@ xfs_clear_li_failed(
 {
 	struct xfs_buf	*bp = lip->li_buf;
 
-	ASSERT(test_bit(XFS_LI_IN_AIL, &lip->li_flags));
-	lockdep_assert_held(&lip->li_ailp->ail_lock);
-
 	if (test_and_clear_bit(XFS_LI_FAILED, &lip->li_flags)) {
 		lip->li_buf = NULL;
 		xfs_buf_rele(bp);
@@ -172,8 +169,6 @@ xfs_set_li_failed(
 	struct xfs_log_item	*lip,
 	struct xfs_buf		*bp)
 {
-	lockdep_assert_held(&lip->li_ailp->ail_lock);
-
 	if (!test_and_set_bit(XFS_LI_FAILED, &lip->li_flags)) {
 		xfs_buf_hold(bp);
 		lip->li_buf = bp;
-- 
2.21.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 09/13] xfs: clean up AIL log item removal functions
  2020-04-22 17:54 [PATCH v2 00/13] xfs: flush related error handling cleanups Brian Foster
                   ` (7 preceding siblings ...)
  2020-04-22 17:54 ` [PATCH v2 08/13] xfs: elide the AIL lock on log item failure tracking Brian Foster
@ 2020-04-22 17:54 ` Brian Foster
  2020-04-23  4:54   ` Dave Chinner
  2020-04-25 17:37   ` Christoph Hellwig
  2020-04-22 17:54 ` [PATCH v2 10/13] xfs: combine xfs_trans_ail_[remove|delete]() Brian Foster
                   ` (3 subsequent siblings)
  12 siblings, 2 replies; 46+ messages in thread
From: Brian Foster @ 2020-04-22 17:54 UTC (permalink / raw)
  To: linux-xfs

We have two AIL removal functions with slightly different semantics.
xfs_trans_ail_delete() expects the caller to have the AIL lock and
for the associated item to be AIL resident. If not, the filesystem
is shut down. xfs_trans_ail_remove() acquires the AIL lock, checks
that the item is AIL resident and calls the former if so.

These semantics lead to confused usage between the two. For example,
the _remove() variant takes a shutdown parameter to pass to the
_delete() variant, but immediately returns if the AIL bit is not
set. This means that _remove() would never shut down if an item is
not AIL resident, even though it appears that many callers would
expect it to.

Make the following changes to clean up both of these functions:

- Most callers of xfs_trans_ail_delete() acquire the AIL lock just
  before the call. Update _delete() to acquire the lock and open
  code the couple of callers that make additional checks under AIL
  lock.
- Drop the unnecessary ailp parameter from _delete().
- Drop the unused shutdown parameter from _remove() and open code
  the implementation.

In summary, this leaves a _delete() variant that expects an AIL
resident item and a _remove() helper that checks the AIL bit. Audit
the existing callsites for use of the appropriate function and
update as necessary.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_bmap_item.c     |  2 +-
 fs/xfs/xfs_buf_item.c      | 21 ++++++++-------------
 fs/xfs/xfs_dquot.c         | 12 +++++++-----
 fs/xfs/xfs_dquot_item.c    |  2 +-
 fs/xfs/xfs_extfree_item.c  |  2 +-
 fs/xfs/xfs_inode_item.c    |  6 +-----
 fs/xfs/xfs_refcount_item.c |  2 +-
 fs/xfs/xfs_rmap_item.c     |  2 +-
 fs/xfs/xfs_trans_ail.c     |  3 ++-
 fs/xfs/xfs_trans_priv.h    | 17 +++++++++--------
 10 files changed, 32 insertions(+), 37 deletions(-)

diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index ee6f4229cebc..909221a4a8ab 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -51,7 +51,7 @@ xfs_bui_release(
 {
 	ASSERT(atomic_read(&buip->bui_refcount) > 0);
 	if (atomic_dec_and_test(&buip->bui_refcount)) {
-		xfs_trans_ail_remove(&buip->bui_item, SHUTDOWN_LOG_IO_ERROR);
+		xfs_trans_ail_delete(&buip->bui_item, SHUTDOWN_LOG_IO_ERROR);
 		xfs_bui_item_free(buip);
 	}
 }
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 59906a6e49ae..55e215237b2b 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -410,7 +410,6 @@ xfs_buf_item_unpin(
 {
 	struct xfs_buf_log_item	*bip = BUF_ITEM(lip);
 	xfs_buf_t		*bp = bip->bli_buf;
-	struct xfs_ail		*ailp = lip->li_ailp;
 	int			stale = bip->bli_flags & XFS_BLI_STALE;
 	int			freed;
 
@@ -463,8 +462,7 @@ xfs_buf_item_unpin(
 			list_del_init(&bp->b_li_list);
 			bp->b_iodone = NULL;
 		} else {
-			spin_lock(&ailp->ail_lock);
-			xfs_trans_ail_delete(ailp, lip, SHUTDOWN_LOG_IO_ERROR);
+			xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
 			xfs_buf_item_relse(bp);
 			ASSERT(bp->b_log_item == NULL);
 		}
@@ -560,7 +558,7 @@ xfs_buf_item_put(
 	 * state.
 	 */
 	if (aborted)
-		xfs_trans_ail_remove(lip, SHUTDOWN_LOG_IO_ERROR);
+		xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
 	xfs_buf_item_relse(bip->bli_buf);
 	return true;
 }
@@ -1201,22 +1199,19 @@ xfs_buf_iodone(
 	struct xfs_buf		*bp,
 	struct xfs_log_item	*lip)
 {
-	struct xfs_ail		*ailp = lip->li_ailp;
-
 	ASSERT(BUF_ITEM(lip)->bli_buf == bp);
 
 	xfs_buf_rele(bp);
 
 	/*
-	 * If we are forcibly shutting down, this may well be
-	 * off the AIL already. That's because we simulate the
-	 * log-committed callbacks to unpin these buffers. Or we may never
-	 * have put this item on AIL because of the transaction was
-	 * aborted forcibly. xfs_trans_ail_delete() takes care of these.
+	 * If we are forcibly shutting down, this may well be off the AIL
+	 * already. That's because we simulate the log-committed callbacks to
+	 * unpin these buffers. Or we may never have put this item on AIL
+	 * because of the transaction was aborted forcibly.
+	 * xfs_trans_ail_delete() takes care of these.
 	 *
 	 * Either way, AIL is useless if we're forcing a shutdown.
 	 */
-	spin_lock(&ailp->ail_lock);
-	xfs_trans_ail_delete(ailp, lip, SHUTDOWN_CORRUPT_INCORE);
+	xfs_trans_ail_delete(lip, SHUTDOWN_CORRUPT_INCORE);
 	xfs_buf_item_free(BUF_ITEM(lip));
 }
diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index baeb111ae9dc..36ffc09b1a31 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -1021,6 +1021,7 @@ xfs_qm_dqflush_done(
 	struct xfs_dq_logitem	*qip = (struct xfs_dq_logitem *)lip;
 	struct xfs_dquot	*dqp = qip->qli_dquot;
 	struct xfs_ail		*ailp = lip->li_ailp;
+	xfs_lsn_t		tail_lsn;
 
 	/*
 	 * Only pull the item from the AIL if its location in the log has not
@@ -1032,10 +1033,11 @@ xfs_qm_dqflush_done(
 		goto out;
 
 	spin_lock(&ailp->ail_lock);
-	if (lip->li_lsn == qip->qli_flush_lsn)
-		/* xfs_trans_ail_delete() drops the AIL lock */
-		xfs_trans_ail_delete(ailp, lip, SHUTDOWN_CORRUPT_INCORE);
-	else
+	if (lip->li_lsn == qip->qli_flush_lsn) {
+		/* xfs_ail_update_finish() drops the AIL lock */
+		tail_lsn = xfs_ail_delete_one(ailp, lip);
+		xfs_ail_update_finish(ailp, tail_lsn);
+	} else
 		spin_unlock(&ailp->ail_lock);
 
 out:
@@ -1149,7 +1151,7 @@ xfs_qm_dqflush(
 
 out_abort:
 	dqp->dq_flags &= ~XFS_DQ_DIRTY;
-	xfs_trans_ail_remove(lip, SHUTDOWN_CORRUPT_INCORE);
+	xfs_trans_ail_remove(lip);
 	xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
 out_unlock:
 	xfs_dqfunlock(dqp);
diff --git a/fs/xfs/xfs_dquot_item.c b/fs/xfs/xfs_dquot_item.c
index 5a7808299a32..8bd46810d5db 100644
--- a/fs/xfs/xfs_dquot_item.c
+++ b/fs/xfs/xfs_dquot_item.c
@@ -343,7 +343,7 @@ xfs_qm_qoff_logitem_relse(
 	ASSERT(test_bit(XFS_LI_IN_AIL, &lip->li_flags) ||
 	       test_bit(XFS_LI_ABORTED, &lip->li_flags) ||
 	       XFS_FORCED_SHUTDOWN(lip->li_mountp));
-	xfs_trans_ail_remove(lip, SHUTDOWN_LOG_IO_ERROR);
+	xfs_trans_ail_remove(lip);
 	kmem_free(lip->li_lv_shadow);
 	kmem_free(qoff);
 }
diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index 6ea847f6e298..cd98eba24884 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -55,7 +55,7 @@ xfs_efi_release(
 {
 	ASSERT(atomic_read(&efip->efi_refcount) > 0);
 	if (atomic_dec_and_test(&efip->efi_refcount)) {
-		xfs_trans_ail_remove(&efip->efi_item, SHUTDOWN_LOG_IO_ERROR);
+		xfs_trans_ail_delete(&efip->efi_item, SHUTDOWN_LOG_IO_ERROR);
 		xfs_efi_item_free(efip);
 	}
 }
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 0ae61844b224..f8dd9bb8c851 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -763,11 +763,7 @@ xfs_iflush_abort(
 	xfs_inode_log_item_t	*iip = ip->i_itemp;
 
 	if (iip) {
-		if (test_bit(XFS_LI_IN_AIL, &iip->ili_item.li_flags)) {
-			xfs_trans_ail_remove(&iip->ili_item,
-					     stale ? SHUTDOWN_LOG_IO_ERROR :
-						     SHUTDOWN_CORRUPT_INCORE);
-		}
+		xfs_trans_ail_remove(&iip->ili_item);
 		iip->ili_logged = 0;
 		/*
 		 * Clear the ili_last_fields bits now that we know that the
diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
index 8eeed73928cd..712939a015a9 100644
--- a/fs/xfs/xfs_refcount_item.c
+++ b/fs/xfs/xfs_refcount_item.c
@@ -50,7 +50,7 @@ xfs_cui_release(
 {
 	ASSERT(atomic_read(&cuip->cui_refcount) > 0);
 	if (atomic_dec_and_test(&cuip->cui_refcount)) {
-		xfs_trans_ail_remove(&cuip->cui_item, SHUTDOWN_LOG_IO_ERROR);
+		xfs_trans_ail_delete(&cuip->cui_item, SHUTDOWN_LOG_IO_ERROR);
 		xfs_cui_item_free(cuip);
 	}
 }
diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
index 4911b68f95dd..ff949b32c051 100644
--- a/fs/xfs/xfs_rmap_item.c
+++ b/fs/xfs/xfs_rmap_item.c
@@ -50,7 +50,7 @@ xfs_rui_release(
 {
 	ASSERT(atomic_read(&ruip->rui_refcount) > 0);
 	if (atomic_dec_and_test(&ruip->rui_refcount)) {
-		xfs_trans_ail_remove(&ruip->rui_item, SHUTDOWN_LOG_IO_ERROR);
+		xfs_trans_ail_delete(&ruip->rui_item, SHUTDOWN_LOG_IO_ERROR);
 		xfs_rui_item_free(ruip);
 	}
 }
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index e03643efdac1..4be7b40060f9 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -872,13 +872,14 @@ xfs_ail_delete_one(
  */
 void
 xfs_trans_ail_delete(
-	struct xfs_ail		*ailp,
 	struct xfs_log_item	*lip,
 	int			shutdown_type)
 {
+	struct xfs_ail		*ailp = lip->li_ailp;
 	struct xfs_mount	*mp = ailp->ail_mount;
 	xfs_lsn_t		tail_lsn;
 
+	spin_lock(&ailp->ail_lock);
 	if (!test_bit(XFS_LI_IN_AIL, &lip->li_flags)) {
 		spin_unlock(&ailp->ail_lock);
 		if (!XFS_FORCED_SHUTDOWN(mp)) {
diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
index 9135afdcee9d..7563c78e2997 100644
--- a/fs/xfs/xfs_trans_priv.h
+++ b/fs/xfs/xfs_trans_priv.h
@@ -94,22 +94,23 @@ xfs_trans_ail_update(
 xfs_lsn_t xfs_ail_delete_one(struct xfs_ail *ailp, struct xfs_log_item *lip);
 void xfs_ail_update_finish(struct xfs_ail *ailp, xfs_lsn_t old_lsn)
 			__releases(ailp->ail_lock);
-void xfs_trans_ail_delete(struct xfs_ail *ailp, struct xfs_log_item *lip,
-		int shutdown_type);
+void xfs_trans_ail_delete(struct xfs_log_item *lip, int shutdown_type);
 
 static inline void
 xfs_trans_ail_remove(
-	struct xfs_log_item	*lip,
-	int			shutdown_type)
+	struct xfs_log_item	*lip)
 {
 	struct xfs_ail		*ailp = lip->li_ailp;
+	xfs_lsn_t		tail_lsn;
 
 	spin_lock(&ailp->ail_lock);
-	/* xfs_trans_ail_delete() drops the AIL lock */
-	if (test_bit(XFS_LI_IN_AIL, &lip->li_flags))
-		xfs_trans_ail_delete(ailp, lip, shutdown_type);
-	else
+	/* xfs_ail_update_finish() drops the AIL lock */
+	if (test_bit(XFS_LI_IN_AIL, &lip->li_flags)) {
+		tail_lsn = xfs_ail_delete_one(ailp, lip);
+		xfs_ail_update_finish(ailp, tail_lsn);
+	} else {
 		spin_unlock(&ailp->ail_lock);
+	}
 }
 
 void			xfs_ail_push(struct xfs_ail *, xfs_lsn_t);
-- 
2.21.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 10/13] xfs: combine xfs_trans_ail_[remove|delete]()
  2020-04-22 17:54 [PATCH v2 00/13] xfs: flush related error handling cleanups Brian Foster
                   ` (8 preceding siblings ...)
  2020-04-22 17:54 ` [PATCH v2 09/13] xfs: clean up AIL log item removal functions Brian Foster
@ 2020-04-22 17:54 ` Brian Foster
  2020-04-23  4:55   ` Dave Chinner
  2020-04-22 17:54 ` [PATCH v2 11/13] xfs: remove unused iflush stale parameter Brian Foster
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2020-04-22 17:54 UTC (permalink / raw)
  To: linux-xfs

Now that the functions and callers of
xfs_trans_ail_[remove|delete]() have been fixed up appropriately,
the only difference between the two is the shutdown behavior. There
are only a few callers of the _remove() variant, so make the
shutdown conditional on the parameter and combine the two functions.

Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_dquot.c      |  2 +-
 fs/xfs/xfs_dquot_item.c |  2 +-
 fs/xfs/xfs_inode_item.c |  2 +-
 fs/xfs/xfs_trans_ail.c  | 27 ++++++++-------------------
 fs/xfs/xfs_trans_priv.h | 17 -----------------
 5 files changed, 11 insertions(+), 39 deletions(-)

diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index 36ffc09b1a31..d711476724fc 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -1151,7 +1151,7 @@ xfs_qm_dqflush(
 
 out_abort:
 	dqp->dq_flags &= ~XFS_DQ_DIRTY;
-	xfs_trans_ail_remove(lip);
+	xfs_trans_ail_delete(lip, 0);
 	xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
 out_unlock:
 	xfs_dqfunlock(dqp);
diff --git a/fs/xfs/xfs_dquot_item.c b/fs/xfs/xfs_dquot_item.c
index 8bd46810d5db..349c92d26570 100644
--- a/fs/xfs/xfs_dquot_item.c
+++ b/fs/xfs/xfs_dquot_item.c
@@ -343,7 +343,7 @@ xfs_qm_qoff_logitem_relse(
 	ASSERT(test_bit(XFS_LI_IN_AIL, &lip->li_flags) ||
 	       test_bit(XFS_LI_ABORTED, &lip->li_flags) ||
 	       XFS_FORCED_SHUTDOWN(lip->li_mountp));
-	xfs_trans_ail_remove(lip);
+	xfs_trans_ail_delete(lip, 0);
 	kmem_free(lip->li_lv_shadow);
 	kmem_free(qoff);
 }
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index f8dd9bb8c851..f8f2475804bd 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -763,7 +763,7 @@ xfs_iflush_abort(
 	xfs_inode_log_item_t	*iip = ip->i_itemp;
 
 	if (iip) {
-		xfs_trans_ail_remove(&iip->ili_item);
+		xfs_trans_ail_delete(&iip->ili_item, 0);
 		iip->ili_logged = 0;
 		/*
 		 * Clear the ili_last_fields bits now that we know that the
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 4be7b40060f9..3cf3f7c52220 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -850,25 +850,13 @@ xfs_ail_delete_one(
 }
 
 /**
- * Remove a log items from the AIL
+ * Remove a log item from the AIL.
  *
- * @xfs_trans_ail_delete_bulk takes an array of log items that all need to
- * removed from the AIL. The caller is already holding the AIL lock, and done
- * all the checks necessary to ensure the items passed in via @log_items are
- * ready for deletion. This includes checking that the items are in the AIL.
- *
- * For each log item to be removed, unlink it  from the AIL, clear the IN_AIL
- * flag from the item and reset the item's lsn to 0. If we remove the first
- * item in the AIL, update the log tail to match the new minimum LSN in the
- * AIL.
- *
- * This function will not drop the AIL lock until all items are removed from
- * the AIL to minimise the amount of lock traffic on the AIL. This does not
- * greatly increase the AIL hold time, but does significantly reduce the amount
- * of traffic on the lock, especially during IO completion.
- *
- * This function must be called with the AIL lock held.  The lock is dropped
- * before returning.
+ * For each log item to be removed, unlink it from the AIL, clear the IN_AIL
+ * flag from the item and reset the item's lsn to 0. If we remove the first item
+ * in the AIL, update the log tail to match the new minimum LSN in the AIL. If
+ * the item is not in the AIL, shut down if the caller has provided a shutdown
+ * type. Otherwise return quietly as this state is expected.
  */
 void
 xfs_trans_ail_delete(
@@ -882,7 +870,7 @@ xfs_trans_ail_delete(
 	spin_lock(&ailp->ail_lock);
 	if (!test_bit(XFS_LI_IN_AIL, &lip->li_flags)) {
 		spin_unlock(&ailp->ail_lock);
-		if (!XFS_FORCED_SHUTDOWN(mp)) {
+		if (shutdown_type && !XFS_FORCED_SHUTDOWN(mp)) {
 			xfs_alert_tag(mp, XFS_PTAG_AILDELETE,
 	"%s: attempting to delete a log item that is not in the AIL",
 					__func__);
@@ -891,6 +879,7 @@ xfs_trans_ail_delete(
 		return;
 	}
 
+	/* xfs_ail_update_finish() drops the AIL lock */
 	tail_lsn = xfs_ail_delete_one(ailp, lip);
 	xfs_ail_update_finish(ailp, tail_lsn);
 }
diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
index 7563c78e2997..2ef653a05c77 100644
--- a/fs/xfs/xfs_trans_priv.h
+++ b/fs/xfs/xfs_trans_priv.h
@@ -96,23 +96,6 @@ void xfs_ail_update_finish(struct xfs_ail *ailp, xfs_lsn_t old_lsn)
 			__releases(ailp->ail_lock);
 void xfs_trans_ail_delete(struct xfs_log_item *lip, int shutdown_type);
 
-static inline void
-xfs_trans_ail_remove(
-	struct xfs_log_item	*lip)
-{
-	struct xfs_ail		*ailp = lip->li_ailp;
-	xfs_lsn_t		tail_lsn;
-
-	spin_lock(&ailp->ail_lock);
-	/* xfs_ail_update_finish() drops the AIL lock */
-	if (test_bit(XFS_LI_IN_AIL, &lip->li_flags)) {
-		tail_lsn = xfs_ail_delete_one(ailp, lip);
-		xfs_ail_update_finish(ailp, tail_lsn);
-	} else {
-		spin_unlock(&ailp->ail_lock);
-	}
-}
-
 void			xfs_ail_push(struct xfs_ail *, xfs_lsn_t);
 void			xfs_ail_push_all(struct xfs_ail *);
 void			xfs_ail_push_all_sync(struct xfs_ail *);
-- 
2.21.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 11/13] xfs: remove unused iflush stale parameter
  2020-04-22 17:54 [PATCH v2 00/13] xfs: flush related error handling cleanups Brian Foster
                   ` (9 preceding siblings ...)
  2020-04-22 17:54 ` [PATCH v2 10/13] xfs: combine xfs_trans_ail_[remove|delete]() Brian Foster
@ 2020-04-22 17:54 ` Brian Foster
  2020-04-25 17:37   ` Christoph Hellwig
  2020-04-22 17:54 ` [PATCH v2 12/13] xfs: random buffer write failure errortag Brian Foster
  2020-04-22 17:54 ` [PATCH v2 13/13] xfs: remove unused shutdown types Brian Foster
  12 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2020-04-22 17:54 UTC (permalink / raw)
  To: linux-xfs

The stale parameter was used to control the now unused shutdown
parameter of xfs_trans_ail_remove().

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
---
 fs/xfs/xfs_icache.c     | 2 +-
 fs/xfs/xfs_inode.c      | 2 +-
 fs/xfs/xfs_inode_item.c | 7 +++----
 fs/xfs/xfs_inode_item.h | 2 +-
 4 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 8bf1d15be3f6..7032efcb6814 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1128,7 +1128,7 @@ xfs_reclaim_inode(
 	if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
 		xfs_iunpin_wait(ip);
 		/* xfs_iflush_abort() drops the flush lock */
-		xfs_iflush_abort(ip, false);
+		xfs_iflush_abort(ip);
 		goto reclaim;
 	}
 	if (xfs_ipincount(ip)) {
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index aa490efdcaa8..01df6836ae18 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3701,7 +3701,7 @@ xfs_iflush(
 	return 0;
 
 abort:
-	xfs_iflush_abort(ip, false);
+	xfs_iflush_abort(ip);
 shutdown:
 	xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
 	return error;
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index f8f2475804bd..111016f1fd14 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -757,10 +757,9 @@ xfs_iflush_done(
  */
 void
 xfs_iflush_abort(
-	xfs_inode_t		*ip,
-	bool			stale)
+	struct xfs_inode		*ip)
 {
-	xfs_inode_log_item_t	*iip = ip->i_itemp;
+	struct xfs_inode_log_item	*iip = ip->i_itemp;
 
 	if (iip) {
 		xfs_trans_ail_delete(&iip->ili_item, 0);
@@ -788,7 +787,7 @@ xfs_istale_done(
 	struct xfs_buf		*bp,
 	struct xfs_log_item	*lip)
 {
-	xfs_iflush_abort(INODE_ITEM(lip)->ili_inode, true);
+	xfs_iflush_abort(INODE_ITEM(lip)->ili_inode);
 }
 
 /*
diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
index 07a60e74c39c..a68c114b79a0 100644
--- a/fs/xfs/xfs_inode_item.h
+++ b/fs/xfs/xfs_inode_item.h
@@ -34,7 +34,7 @@ extern void xfs_inode_item_init(struct xfs_inode *, struct xfs_mount *);
 extern void xfs_inode_item_destroy(struct xfs_inode *);
 extern void xfs_iflush_done(struct xfs_buf *, struct xfs_log_item *);
 extern void xfs_istale_done(struct xfs_buf *, struct xfs_log_item *);
-extern void xfs_iflush_abort(struct xfs_inode *, bool);
+extern void xfs_iflush_abort(struct xfs_inode *);
 extern int xfs_inode_item_format_convert(xfs_log_iovec_t *,
 					 struct xfs_inode_log_format *);
 
-- 
2.21.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 12/13] xfs: random buffer write failure errortag
  2020-04-22 17:54 [PATCH v2 00/13] xfs: flush related error handling cleanups Brian Foster
                   ` (10 preceding siblings ...)
  2020-04-22 17:54 ` [PATCH v2 11/13] xfs: remove unused iflush stale parameter Brian Foster
@ 2020-04-22 17:54 ` Brian Foster
  2020-04-23  5:11   ` Dave Chinner
  2020-04-25 17:38   ` Christoph Hellwig
  2020-04-22 17:54 ` [PATCH v2 13/13] xfs: remove unused shutdown types Brian Foster
  12 siblings, 2 replies; 46+ messages in thread
From: Brian Foster @ 2020-04-22 17:54 UTC (permalink / raw)
  To: linux-xfs

Introduce an error tag to randomly fail async buffer writes. This is
primarily to facilitate testing of the XFS error configuration
mechanism.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
---
 fs/xfs/libxfs/xfs_errortag.h | 4 +++-
 fs/xfs/xfs_buf.c             | 6 ++++++
 fs/xfs/xfs_error.c           | 3 +++
 3 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_errortag.h b/fs/xfs/libxfs/xfs_errortag.h
index 79e6c4fb1d8a..2486dab19023 100644
--- a/fs/xfs/libxfs/xfs_errortag.h
+++ b/fs/xfs/libxfs/xfs_errortag.h
@@ -55,7 +55,8 @@
 #define XFS_ERRTAG_FORCE_SCRUB_REPAIR			32
 #define XFS_ERRTAG_FORCE_SUMMARY_RECALC			33
 #define XFS_ERRTAG_IUNLINK_FALLBACK			34
-#define XFS_ERRTAG_MAX					35
+#define XFS_ERRTAG_BUF_IOERROR				35
+#define XFS_ERRTAG_MAX					36
 
 /*
  * Random factors for above tags, 1 means always, 2 means 1/2 time, etc.
@@ -95,5 +96,6 @@
 #define XFS_RANDOM_FORCE_SCRUB_REPAIR			1
 #define XFS_RANDOM_FORCE_SUMMARY_RECALC			1
 #define XFS_RANDOM_IUNLINK_FALLBACK			(XFS_RANDOM_DEFAULT/10)
+#define XFS_RANDOM_BUF_IOERROR				XFS_RANDOM_DEFAULT
 
 #endif /* __XFS_ERRORTAG_H_ */
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index c28a93d2fd8c..4d2151deacef 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1289,6 +1289,12 @@ xfs_buf_bio_end_io(
 	struct bio		*bio)
 {
 	struct xfs_buf		*bp = (struct xfs_buf *)bio->bi_private;
+	struct xfs_mount	*mp = bp->b_mount;
+
+	if (!bio->bi_status &&
+	    (bp->b_flags & XBF_WRITE) && (bp->b_flags & XBF_ASYNC) &&
+	    XFS_TEST_ERROR(false, mp, XFS_ERRTAG_BUF_IOERROR))
+		bio->bi_status = errno_to_blk_status(-EIO);
 
 	/*
 	 * don't overwrite existing errors - otherwise we can lose errors on
diff --git a/fs/xfs/xfs_error.c b/fs/xfs/xfs_error.c
index a21e9cc6516a..7f6e20899473 100644
--- a/fs/xfs/xfs_error.c
+++ b/fs/xfs/xfs_error.c
@@ -53,6 +53,7 @@ static unsigned int xfs_errortag_random_default[] = {
 	XFS_RANDOM_FORCE_SCRUB_REPAIR,
 	XFS_RANDOM_FORCE_SUMMARY_RECALC,
 	XFS_RANDOM_IUNLINK_FALLBACK,
+	XFS_RANDOM_BUF_IOERROR,
 };
 
 struct xfs_errortag_attr {
@@ -162,6 +163,7 @@ XFS_ERRORTAG_ATTR_RW(buf_lru_ref,	XFS_ERRTAG_BUF_LRU_REF);
 XFS_ERRORTAG_ATTR_RW(force_repair,	XFS_ERRTAG_FORCE_SCRUB_REPAIR);
 XFS_ERRORTAG_ATTR_RW(bad_summary,	XFS_ERRTAG_FORCE_SUMMARY_RECALC);
 XFS_ERRORTAG_ATTR_RW(iunlink_fallback,	XFS_ERRTAG_IUNLINK_FALLBACK);
+XFS_ERRORTAG_ATTR_RW(buf_ioerror,	XFS_ERRTAG_BUF_IOERROR);
 
 static struct attribute *xfs_errortag_attrs[] = {
 	XFS_ERRORTAG_ATTR_LIST(noerror),
@@ -199,6 +201,7 @@ static struct attribute *xfs_errortag_attrs[] = {
 	XFS_ERRORTAG_ATTR_LIST(force_repair),
 	XFS_ERRORTAG_ATTR_LIST(bad_summary),
 	XFS_ERRORTAG_ATTR_LIST(iunlink_fallback),
+	XFS_ERRORTAG_ATTR_LIST(buf_ioerror),
 	NULL,
 };
 
-- 
2.21.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 13/13] xfs: remove unused shutdown types
  2020-04-22 17:54 [PATCH v2 00/13] xfs: flush related error handling cleanups Brian Foster
                   ` (11 preceding siblings ...)
  2020-04-22 17:54 ` [PATCH v2 12/13] xfs: random buffer write failure errortag Brian Foster
@ 2020-04-22 17:54 ` Brian Foster
  2020-04-23  5:13   ` Dave Chinner
  2020-04-25 17:39   ` Christoph Hellwig
  12 siblings, 2 replies; 46+ messages in thread
From: Brian Foster @ 2020-04-22 17:54 UTC (permalink / raw)
  To: linux-xfs

Both types control shutdown messaging and neither is used in the
current codebase.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_fsops.c | 5 +----
 fs/xfs/xfs_mount.h | 2 --
 2 files changed, 1 insertion(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 3e61d0cc23f8..ef1d5bb88b93 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -504,10 +504,7 @@ xfs_do_force_shutdown(
 	} else if (logerror) {
 		xfs_alert_tag(mp, XFS_PTAG_SHUTDOWN_LOGERROR,
 			"Log I/O Error Detected. Shutting down filesystem");
-	} else if (flags & SHUTDOWN_DEVICE_REQ) {
-		xfs_alert_tag(mp, XFS_PTAG_SHUTDOWN_IOERROR,
-			"All device paths lost. Shutting down filesystem");
-	} else if (!(flags & SHUTDOWN_REMOTE_REQ)) {
+	} else {
 		xfs_alert_tag(mp, XFS_PTAG_SHUTDOWN_IOERROR,
 			"I/O Error Detected. Shutting down filesystem");
 	}
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index b2e4598fdf7d..07b5ba7e5fbd 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -259,8 +259,6 @@ void xfs_do_force_shutdown(struct xfs_mount *mp, int flags, char *fname,
 #define SHUTDOWN_LOG_IO_ERROR	0x0002	/* write attempt to the log failed */
 #define SHUTDOWN_FORCE_UMOUNT	0x0004	/* shutdown from a forced unmount */
 #define SHUTDOWN_CORRUPT_INCORE	0x0008	/* corrupt in-memory data structures */
-#define SHUTDOWN_REMOTE_REQ	0x0010	/* shutdown came from remote cell */
-#define SHUTDOWN_DEVICE_REQ	0x0020	/* failed all paths to the device */
 
 /*
  * Flags for xfs_mountfs
-- 
2.21.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 01/13] xfs: refactor failed buffer resubmission into xfsaild
  2020-04-22 17:54 ` [PATCH v2 01/13] xfs: refactor failed buffer resubmission into xfsaild Brian Foster
@ 2020-04-23  4:09   ` Dave Chinner
  2020-04-25 17:21   ` Christoph Hellwig
  1 sibling, 0 replies; 46+ messages in thread
From: Dave Chinner @ 2020-04-23  4:09 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Apr 22, 2020 at 01:54:17PM -0400, Brian Foster wrote:
> Flush locked log items whose underlying buffers fail metadata
> writeback are tagged with a special flag to indicate that the flush
> lock is already held. This is currently implemented in the type
> specific ->iop_push() callback, but the processing required for such
> items is not type specific because we're only doing basic state
> management on the underlying buffer.
> 
> Factor the failed log item handling out of the inode and dquot
> ->iop_push() callbacks and open code the buffer resubmit helper into
> a single helper called from xfsaild_push_item(). This provides a
> generic mechanism for handling failed metadata buffer writeback with
> a bit less code.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> Reviewed-by: Allison Collins <allison.henderson@oracle.com>

Looks good.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 02/13] xfs: factor out buffer I/O failure simulation code
  2020-04-22 17:54 ` [PATCH v2 02/13] xfs: factor out buffer I/O failure simulation code Brian Foster
@ 2020-04-23  4:10   ` Dave Chinner
  2020-04-25 17:23   ` Christoph Hellwig
  1 sibling, 0 replies; 46+ messages in thread
From: Dave Chinner @ 2020-04-23  4:10 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Apr 22, 2020 at 01:54:18PM -0400, Brian Foster wrote:
> We use the same buffer I/O failure simulation code in a few
> different places. It's not much code, but it's not necessarily
> self-explanatory. Factor it into a helper and document it in one
> place.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> Reviewed-by: Allison Collins <allison.henderson@oracle.com>

Looks good.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 03/13] xfs: fallthru to buffer attach on error and simplify error handling
  2020-04-22 17:54 ` [PATCH v2 03/13] xfs: fallthru to buffer attach on error and simplify error handling Brian Foster
@ 2020-04-23  4:18   ` Dave Chinner
  2020-04-23 14:28     ` Brian Foster
  2020-04-25 17:26   ` Christoph Hellwig
  1 sibling, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2020-04-23  4:18 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Apr 22, 2020 at 01:54:19PM -0400, Brian Foster wrote:
> The inode flush code has several layers of error handling between
> the inode and cluster flushing code. If the inode flush fails before
> acquiring the backing buffer, the inode flush is aborted. If the
> cluster flush fails, the current inode flush is aborted and the
> cluster buffer is failed to handle the initial inode and any others
> that might have been attached before the error.
> 
> Since xfs_iflush() is the only caller of xfs_iflush_cluster(), the
> error handling between the two can be condensed in the top-level
> function. If we update xfs_iflush_int() to always fall through to
> the log item update and attach the item completion handler to the
> buffer, any errors that occur after the first call to
> xfs_iflush_int() can be handled with a buffer I/O failure.
> 
> Lift the error handling from xfs_iflush_cluster() into xfs_iflush()
> and consolidate with the existing error handling. This also replaces
> the need to release the buffer because failing the buffer with
> XBF_ASYNC drops the current reference.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>

Needs a better subject line, because I had no idea what it meant
until I got to the last hunks in the patch.  Perhaps: "Simplify
inode flush error handling" would be a better summary of the
patch....

> @@ -3791,6 +3758,7 @@ xfs_iflush_int(
>  	struct xfs_inode_log_item *iip = ip->i_itemp;
>  	struct xfs_dinode	*dip;
>  	struct xfs_mount	*mp = ip->i_mount;
> +	int			error;

There needs to be a comment added to this function to explain why we
always attached the inode to the buffer and update the flush state,
even on error. This:

> @@ -3914,10 +3885,10 @@ xfs_iflush_int(
>  				&iip->ili_item.li_lsn);
>  
>  	/*
> -	 * Attach the function xfs_iflush_done to the inode's
> -	 * buffer.  This will remove the inode from the AIL
> -	 * and unlock the inode's flush lock when the inode is
> -	 * completely written to disk.
> +	 * Attach the inode item callback to the buffer whether the flush
> +	 * succeeded or not. If not, the caller will shut down and fail I/O
> +	 * completion on the buffer to remove the inode from the AIL and release
> +	 * the flush lock.
>  	 */
>  	xfs_buf_attach_iodone(bp, xfs_iflush_done, &iip->ili_item);

isn't obviously associated with the "flush_out" label, and so the
structure of the function really isn't explained until you get to
the end of the function. And that's still easy to miss...

Other than that, the code looks OK.

CHeers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 05/13] xfs: ratelimit unmount time per-buffer I/O error message
  2020-04-22 17:54 ` [PATCH v2 05/13] xfs: ratelimit unmount time per-buffer I/O error message Brian Foster
@ 2020-04-23  4:46   ` Dave Chinner
  2020-04-23 14:29     ` Brian Foster
  0 siblings, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2020-04-23  4:46 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Apr 22, 2020 at 01:54:21PM -0400, Brian Foster wrote:
> At unmount time, XFS emits a warning for every in-core buffer that
> might have undergone a write error. In practice this behavior is
> probably reasonable given that the filesystem is likely short lived
> once I/O errors begin to occur consistently. Under certain test or
> otherwise expected error conditions, this can spam the logs and slow
> down the unmount.
> 
> We already have a ratelimit state defined for buffers failing
> writeback. Fold this state into the buftarg and reuse it for the
> unmount time errors.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>

Looks fine, but I suspect we both missed something here:
xfs_buf_ioerror_alert() was made a ratelimited printk in the last
cycle:

void
xfs_buf_ioerror_alert(
        struct xfs_buf          *bp,
        xfs_failaddr_t          func)
{
        xfs_alert_ratelimited(bp->b_mount,
"metadata I/O error in \"%pS\" at daddr 0x%llx len %d error %d",
                        func, (uint64_t)XFS_BUF_ADDR(bp), bp->b_length,
                        -bp->b_error);
}

Hence I think all these buffer error alerts can be brought under the
same rate limiting variable. Something like this in xfs_message.c:

void
xfs_buf_alert_ratelimited(
        struct xfs_buf          *bp,
	const char		*rlmsg,
	const char		*fmt,
	...)
{
	struct va_format        vaf;
	va_list                 args;

	if (!___ratelimit(&bp->b_target->bt_ioerror_rl, rlmsg)
		return;

	va_start(args, fmt);
	vaf.fmt = fmt;
	vaf.args = &args;
	__xfs_printk(KERN_ALERT, bp->b_mount, &vaf);
	va_end(args);
}

and:

void
xfs_buf_ioerror_alert(
        struct xfs_buf          *bp,
        xfs_failaddr_t          func)
{
	xfs_buf_alert_ratelimited(bp, "XFS: metadata IO error",
		"metadata I/O error in \"%pS\" at daddr 0x%llx len %d error %d",
		func, (uint64_t)XFS_BUF_ADDR(bp), bp->b_length, -bp->b_error);
}


> ---
>  fs/xfs/xfs_buf.c      | 13 +++++++++++--
>  fs/xfs/xfs_buf.h      |  1 +
>  fs/xfs/xfs_buf_item.c | 10 +---------
>  3 files changed, 13 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 7a6bc617f0a9..c28a93d2fd8c 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1684,10 +1684,12 @@ xfs_wait_buftarg(
>  			struct xfs_buf *bp;
>  			bp = list_first_entry(&dispose, struct xfs_buf, b_lru);
>  			list_del_init(&bp->b_lru);
> -			if (bp->b_flags & XBF_WRITE_FAIL) {
> +			if (bp->b_flags & XBF_WRITE_FAIL &&
> +			    ___ratelimit(&bp->b_target->bt_ioerror_rl,
> +					 "XFS: Corruption Alert")) {
>  				xfs_alert(btp->bt_mount,
>  "Corruption Alert: Buffer at daddr 0x%llx had permanent write failures!",
> -					(long long)bp->b_bn);
> +					  (long long)bp->b_bn);
>  				xfs_alert(btp->bt_mount,
>  "Please run xfs_repair to determine the extent of the problem.");
>  			}

I think if we are tossing away metadata here, we should probably
shut down the filesystem once the loop has completed. That way we
get all the normal warnings about running xfs_repair and don't have
to open code it here...

> -
>  STATIC uint
>  xfs_buf_item_push(
>  	struct xfs_log_item	*lip,
> @@ -518,7 +510,7 @@ xfs_buf_item_push(
>  
>  	/* has a previous flush failed due to IO errors? */
>  	if ((bp->b_flags & XBF_WRITE_FAIL) &&
> -	    ___ratelimit(&xfs_buf_write_fail_rl_state, "XFS: Failing async write")) {
> +	    ___ratelimit(&bp->b_target->bt_ioerror_rl, "XFS: Failing async write")) {
>  		xfs_warn(bp->b_mount,
>  "Failing async write on buffer block 0x%llx. Retrying async write.",
>  			 (long long)bp->b_bn);

This gets simplified to:

	if (bp->b_flags & XBF_WRITE_FAIL) {
		xfs_buf_alert_ratelimited(bp, "XFS: Failing async write",
"Failing async write on buffer block 0x%llx. Retrying async write.",
					(long long)bp->b_bn);
	}

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 06/13] xfs: fix duplicate verification from xfs_qm_dqflush()
  2020-04-22 17:54 ` [PATCH v2 06/13] xfs: fix duplicate verification from xfs_qm_dqflush() Brian Foster
@ 2020-04-23  4:47   ` Dave Chinner
  2020-04-25 17:28   ` Christoph Hellwig
  1 sibling, 0 replies; 46+ messages in thread
From: Dave Chinner @ 2020-04-23  4:47 UTC (permalink / raw)
  To: Brian Foster, g; +Cc: linux-xfs

On Wed, Apr 22, 2020 at 01:54:22PM -0400, Brian Foster wrote:
> The pre-flush dquot verification in xfs_qm_dqflush() duplicates the
> read verifier by checking the dquot in the on-disk buffer. Instead,
> verify the in-core variant before it is flushed to the buffer.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>

Looks good.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 09/13] xfs: clean up AIL log item removal functions
  2020-04-22 17:54 ` [PATCH v2 09/13] xfs: clean up AIL log item removal functions Brian Foster
@ 2020-04-23  4:54   ` Dave Chinner
  2020-04-25 17:37   ` Christoph Hellwig
  1 sibling, 0 replies; 46+ messages in thread
From: Dave Chinner @ 2020-04-23  4:54 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Apr 22, 2020 at 01:54:25PM -0400, Brian Foster wrote:
> We have two AIL removal functions with slightly different semantics.
> xfs_trans_ail_delete() expects the caller to have the AIL lock and
> for the associated item to be AIL resident. If not, the filesystem
> is shut down. xfs_trans_ail_remove() acquires the AIL lock, checks
> that the item is AIL resident and calls the former if so.
> 
> These semantics lead to confused usage between the two. For example,
> the _remove() variant takes a shutdown parameter to pass to the
> _delete() variant, but immediately returns if the AIL bit is not
> set. This means that _remove() would never shut down if an item is
> not AIL resident, even though it appears that many callers would
> expect it to.
> 
> Make the following changes to clean up both of these functions:
> 
> - Most callers of xfs_trans_ail_delete() acquire the AIL lock just
>   before the call. Update _delete() to acquire the lock and open
>   code the couple of callers that make additional checks under AIL
>   lock.
> - Drop the unnecessary ailp parameter from _delete().
> - Drop the unused shutdown parameter from _remove() and open code
>   the implementation.
> 
> In summary, this leaves a _delete() variant that expects an AIL
> resident item and a _remove() helper that checks the AIL bit. Audit
> the existing callsites for use of the appropriate function and
> update as necessary.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>

In conjunction with the followed patch to combine the remove and
delete functions, I'm good with this.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 10/13] xfs: combine xfs_trans_ail_[remove|delete]()
  2020-04-22 17:54 ` [PATCH v2 10/13] xfs: combine xfs_trans_ail_[remove|delete]() Brian Foster
@ 2020-04-23  4:55   ` Dave Chinner
  0 siblings, 0 replies; 46+ messages in thread
From: Dave Chinner @ 2020-04-23  4:55 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Apr 22, 2020 at 01:54:26PM -0400, Brian Foster wrote:
> Now that the functions and callers of
> xfs_trans_ail_[remove|delete]() have been fixed up appropriately,
> the only difference between the two is the shutdown behavior. There
> are only a few callers of the _remove() variant, so make the
> shutdown conditional on the parameter and combine the two functions.
> 
> Suggested-by: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Brian Foster <bfoster@redhat.com>

Reviewed-by: Dave Chinner <dchinner@redhat.com>
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 12/13] xfs: random buffer write failure errortag
  2020-04-22 17:54 ` [PATCH v2 12/13] xfs: random buffer write failure errortag Brian Foster
@ 2020-04-23  5:11   ` Dave Chinner
  2020-04-25 17:38   ` Christoph Hellwig
  1 sibling, 0 replies; 46+ messages in thread
From: Dave Chinner @ 2020-04-23  5:11 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Apr 22, 2020 at 01:54:28PM -0400, Brian Foster wrote:
> Introduce an error tag to randomly fail async buffer writes. This is
> primarily to facilitate testing of the XFS error configuration
> mechanism.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> Reviewed-by: Allison Collins <allison.henderson@oracle.com>

Reviewed-by: Dave Chinner <dchinner@redhat.com>
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 13/13] xfs: remove unused shutdown types
  2020-04-22 17:54 ` [PATCH v2 13/13] xfs: remove unused shutdown types Brian Foster
@ 2020-04-23  5:13   ` Dave Chinner
  2020-04-25 17:39   ` Christoph Hellwig
  1 sibling, 0 replies; 46+ messages in thread
From: Dave Chinner @ 2020-04-23  5:13 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Apr 22, 2020 at 01:54:29PM -0400, Brian Foster wrote:
> Both types control shutdown messaging and neither is used in the
> current codebase.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>

Yup, historical artifacts, stuff never implemented on Linux.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 08/13] xfs: elide the AIL lock on log item failure tracking
  2020-04-22 17:54 ` [PATCH v2 08/13] xfs: elide the AIL lock on log item failure tracking Brian Foster
@ 2020-04-23  5:59   ` Dave Chinner
  2020-04-23 14:36     ` Brian Foster
  0 siblings, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2020-04-23  5:59 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Apr 22, 2020 at 01:54:24PM -0400, Brian Foster wrote:
> The log item failed flag is used to indicate a log item was flushed
> but the underlying buffer was not successfully written to disk. If
> the error configuration allows for writeback retries, xfsaild uses
> the flag to resubmit the underlying buffer without acquiring the
> flush lock of the item.
> 
> The flag is currently set and cleared under the AIL lock and buffer
> lock in the ->iop_error() callback, invoked via ->b_iodone() at I/O
> completion time. The flag is checked at xfsaild push time under AIL
> lock and cleared under buffer lock before resubmission. If I/O
> eventually succeeds, the flag is cleared in the _done() handler for
> the associated item type, again under AIL lock and buffer lock.

Actually, I think you've missed the relevant lock here: the item's
flush lock. The XFS_LI_FAILED flag is item flush state, and flush
state is protected by the flush lock. When the item has been flushed
and attached to the buffer for completion callbacks, the flush lock
context gets handed to the buffer.

i.e. the buffer owns the flush lock and so while that buffer is
locked for IO we know that the item flush state (and hence the
XFS_LI_FAILED flag) is exclusively owned by the holder of the buffer
lock.

(Note: this is how xfs_ifree_cluster() works - it grabs the buffer
lock then walks the items on the buffer and changes the callback
functions because those items are flush locked and hence holding the
buffer lock gives exclusive access to the flush state of those
items....)

> As far as I can tell, the only reason for holding the AIL lock
> across sets/clears is to manage consistency between the log item
> bitop state and the temporary buffer pointer that is attached to the
> log item. The bit itself is used to manage consistency of the
> attached buffer, but is not enough to guarantee the buffer is still
> attached by the time xfsaild attempts to access it.

Correct. The guarantee that the buffer is still attached to the log
item is what the AIL lock provides us with.

> However since
> failure state is always set or cleared under buffer lock (either via
> I/O completion or xfsaild), this particular case can be handled at
> item push time by verifying failure state once under buffer lock.

In theory, yes, but there's a problem before you get that buffer
lock. That is: what serialises access to lip->li_buf?

The xfsaild does not hold a reference to the buffer and, without the
AIL lock to provide synchronisation, the log item reference to the
buffer can be dropped asynchronously by buffer IO completion. Hence
the buffer could be freed between the xfsaild reading lip->li_buf
and calling xfs_buf_trylock(bp). i.e. this introduces a
use-after-free race condition on the initial buffer access.

IOWs, the xfsaild cannot access lip->li_buf safely unless the
set/clear operations are protected by the same lock the xfsaild
holds. The xfsaild holds neither the buffer lock, a buffer reference
or an item flush lock, hence it's the AIL lock or nothing...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 03/13] xfs: fallthru to buffer attach on error and simplify error handling
  2020-04-23  4:18   ` Dave Chinner
@ 2020-04-23 14:28     ` Brian Foster
  0 siblings, 0 replies; 46+ messages in thread
From: Brian Foster @ 2020-04-23 14:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Apr 23, 2020 at 02:18:23PM +1000, Dave Chinner wrote:
> On Wed, Apr 22, 2020 at 01:54:19PM -0400, Brian Foster wrote:
> > The inode flush code has several layers of error handling between
> > the inode and cluster flushing code. If the inode flush fails before
> > acquiring the backing buffer, the inode flush is aborted. If the
> > cluster flush fails, the current inode flush is aborted and the
> > cluster buffer is failed to handle the initial inode and any others
> > that might have been attached before the error.
> > 
> > Since xfs_iflush() is the only caller of xfs_iflush_cluster(), the
> > error handling between the two can be condensed in the top-level
> > function. If we update xfs_iflush_int() to always fall through to
> > the log item update and attach the item completion handler to the
> > buffer, any errors that occur after the first call to
> > xfs_iflush_int() can be handled with a buffer I/O failure.
> > 
> > Lift the error handling from xfs_iflush_cluster() into xfs_iflush()
> > and consolidate with the existing error handling. This also replaces
> > the need to release the buffer because failing the buffer with
> > XBF_ASYNC drops the current reference.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> 
> Needs a better subject line, because I had no idea what it meant
> until I got to the last hunks in the patch.  Perhaps: "Simplify
> inode flush error handling" would be a better summary of the
> patch....
> 

Ok, fixed.

> > @@ -3791,6 +3758,7 @@ xfs_iflush_int(
> >  	struct xfs_inode_log_item *iip = ip->i_itemp;
> >  	struct xfs_dinode	*dip;
> >  	struct xfs_mount	*mp = ip->i_mount;
> > +	int			error;
> 
> There needs to be a comment added to this function to explain why we
> always attached the inode to the buffer and update the flush state,
> even on error. This:
> 

Indeed. Updated as follows with a comment before the first corruption
check:

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 6cdb9fbe2d89..6b8266eeae2d 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3766,9 +3766,14 @@ xfs_iflush_int(
 	       ip->i_d.di_nextents > XFS_IFORK_MAXEXT(ip, XFS_DATA_FORK));
 	ASSERT(iip != NULL && iip->ili_fields != 0);
 
-	/* set *dip = inode's place in the buffer */
 	dip = xfs_buf_offset(bp, ip->i_imap.im_boffset);
 
+	/*
+	 * We don't flush the inode if any of the following checks fail, but we
+	 * do still update the log item and attach to the backing buffer as if
+	 * the flush happened. This is a formality to facilitate predictable
+	 * error handling as the caller will shutdown and fail the buffer.
+	 */
 	error = -EFSCORRUPTED;
 	if (XFS_TEST_ERROR(dip->di_magic != cpu_to_be16(XFS_DINODE_MAGIC),
 			       mp, XFS_ERRTAG_IFLUSH_1)) {

Brian

> > @@ -3914,10 +3885,10 @@ xfs_iflush_int(
> >  				&iip->ili_item.li_lsn);
> >  
> >  	/*
> > -	 * Attach the function xfs_iflush_done to the inode's
> > -	 * buffer.  This will remove the inode from the AIL
> > -	 * and unlock the inode's flush lock when the inode is
> > -	 * completely written to disk.
> > +	 * Attach the inode item callback to the buffer whether the flush
> > +	 * succeeded or not. If not, the caller will shut down and fail I/O
> > +	 * completion on the buffer to remove the inode from the AIL and release
> > +	 * the flush lock.
> >  	 */
> >  	xfs_buf_attach_iodone(bp, xfs_iflush_done, &iip->ili_item);
> 
> isn't obviously associated with the "flush_out" label, and so the
> structure of the function really isn't explained until you get to
> the end of the function. And that's still easy to miss...
> 
> Other than that, the code looks OK.
> 
> CHeers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 05/13] xfs: ratelimit unmount time per-buffer I/O error message
  2020-04-23  4:46   ` Dave Chinner
@ 2020-04-23 14:29     ` Brian Foster
  2020-04-23 21:14       ` Dave Chinner
  0 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2020-04-23 14:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Apr 23, 2020 at 02:46:04PM +1000, Dave Chinner wrote:
> On Wed, Apr 22, 2020 at 01:54:21PM -0400, Brian Foster wrote:
> > At unmount time, XFS emits a warning for every in-core buffer that
> > might have undergone a write error. In practice this behavior is
> > probably reasonable given that the filesystem is likely short lived
> > once I/O errors begin to occur consistently. Under certain test or
> > otherwise expected error conditions, this can spam the logs and slow
> > down the unmount.
> > 
> > We already have a ratelimit state defined for buffers failing
> > writeback. Fold this state into the buftarg and reuse it for the
> > unmount time errors.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> 
> Looks fine, but I suspect we both missed something here:
> xfs_buf_ioerror_alert() was made a ratelimited printk in the last
> cycle:
> 
> void
> xfs_buf_ioerror_alert(
>         struct xfs_buf          *bp,
>         xfs_failaddr_t          func)
> {
>         xfs_alert_ratelimited(bp->b_mount,
> "metadata I/O error in \"%pS\" at daddr 0x%llx len %d error %d",
>                         func, (uint64_t)XFS_BUF_ADDR(bp), bp->b_length,
>                         -bp->b_error);
> }
> 

Yeah, I hadn't noticed that.

> Hence I think all these buffer error alerts can be brought under the
> same rate limiting variable. Something like this in xfs_message.c:
> 

One thing to note is that xfs_alert_ratelimited() ultimately uses
the DEFAULT_RATELIMIT_INTERVAL of 5s. The ratelimit we're generalizing
here uses 30s (both use a burst of 10). That seems reasonable enough to
me for I/O errors so I'm good with the changes below.

FWIW, that also means we could just call xfs_buf_alert_ratelimited()
from xfs_buf_item_push() if we're also Ok with using an "alert" instead
of a "warn." I'm not immediately aware of a reason to use one over the
other (xfs_wait_buftarg() already uses alert) so I'll try that unless I
hear an objection. The xfs_wait_buftarg() ratelimit presumably remains
open coded because it's two separate calls and we probably don't want
them to individually count against the limit.

Brian

> void
> xfs_buf_alert_ratelimited(
>         struct xfs_buf          *bp,
> 	const char		*rlmsg,
> 	const char		*fmt,
> 	...)
> {
> 	struct va_format        vaf;
> 	va_list                 args;
> 
> 	if (!___ratelimit(&bp->b_target->bt_ioerror_rl, rlmsg)
> 		return;
> 
> 	va_start(args, fmt);
> 	vaf.fmt = fmt;
> 	vaf.args = &args;
> 	__xfs_printk(KERN_ALERT, bp->b_mount, &vaf);
> 	va_end(args);
> }
> 
> and:
> 
> void
> xfs_buf_ioerror_alert(
>         struct xfs_buf          *bp,
>         xfs_failaddr_t          func)
> {
> 	xfs_buf_alert_ratelimited(bp, "XFS: metadata IO error",
> 		"metadata I/O error in \"%pS\" at daddr 0x%llx len %d error %d",
> 		func, (uint64_t)XFS_BUF_ADDR(bp), bp->b_length, -bp->b_error);
> }
> 
> 
> > ---
> >  fs/xfs/xfs_buf.c      | 13 +++++++++++--
> >  fs/xfs/xfs_buf.h      |  1 +
> >  fs/xfs/xfs_buf_item.c | 10 +---------
> >  3 files changed, 13 insertions(+), 11 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index 7a6bc617f0a9..c28a93d2fd8c 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -1684,10 +1684,12 @@ xfs_wait_buftarg(
> >  			struct xfs_buf *bp;
> >  			bp = list_first_entry(&dispose, struct xfs_buf, b_lru);
> >  			list_del_init(&bp->b_lru);
> > -			if (bp->b_flags & XBF_WRITE_FAIL) {
> > +			if (bp->b_flags & XBF_WRITE_FAIL &&
> > +			    ___ratelimit(&bp->b_target->bt_ioerror_rl,
> > +					 "XFS: Corruption Alert")) {
> >  				xfs_alert(btp->bt_mount,
> >  "Corruption Alert: Buffer at daddr 0x%llx had permanent write failures!",
> > -					(long long)bp->b_bn);
> > +					  (long long)bp->b_bn);
> >  				xfs_alert(btp->bt_mount,
> >  "Please run xfs_repair to determine the extent of the problem.");
> >  			}
> 
> I think if we are tossing away metadata here, we should probably
> shut down the filesystem once the loop has completed. That way we
> get all the normal warnings about running xfs_repair and don't have
> to open code it here...
> 
> > -
> >  STATIC uint
> >  xfs_buf_item_push(
> >  	struct xfs_log_item	*lip,
> > @@ -518,7 +510,7 @@ xfs_buf_item_push(
> >  
> >  	/* has a previous flush failed due to IO errors? */
> >  	if ((bp->b_flags & XBF_WRITE_FAIL) &&
> > -	    ___ratelimit(&xfs_buf_write_fail_rl_state, "XFS: Failing async write")) {
> > +	    ___ratelimit(&bp->b_target->bt_ioerror_rl, "XFS: Failing async write")) {
> >  		xfs_warn(bp->b_mount,
> >  "Failing async write on buffer block 0x%llx. Retrying async write.",
> >  			 (long long)bp->b_bn);
> 
> This gets simplified to:
> 
> 	if (bp->b_flags & XBF_WRITE_FAIL) {
> 		xfs_buf_alert_ratelimited(bp, "XFS: Failing async write",
> "Failing async write on buffer block 0x%llx. Retrying async write.",
> 					(long long)bp->b_bn);
> 	}
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 08/13] xfs: elide the AIL lock on log item failure tracking
  2020-04-23  5:59   ` Dave Chinner
@ 2020-04-23 14:36     ` Brian Foster
  2020-04-23 21:38       ` Dave Chinner
  0 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2020-04-23 14:36 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Apr 23, 2020 at 03:59:19PM +1000, Dave Chinner wrote:
> On Wed, Apr 22, 2020 at 01:54:24PM -0400, Brian Foster wrote:
> > The log item failed flag is used to indicate a log item was flushed
> > but the underlying buffer was not successfully written to disk. If
> > the error configuration allows for writeback retries, xfsaild uses
> > the flag to resubmit the underlying buffer without acquiring the
> > flush lock of the item.
> > 
> > The flag is currently set and cleared under the AIL lock and buffer
> > lock in the ->iop_error() callback, invoked via ->b_iodone() at I/O
> > completion time. The flag is checked at xfsaild push time under AIL
> > lock and cleared under buffer lock before resubmission. If I/O
> > eventually succeeds, the flag is cleared in the _done() handler for
> > the associated item type, again under AIL lock and buffer lock.
> 
> Actually, I think you've missed the relevant lock here: the item's
> flush lock. The XFS_LI_FAILED flag is item flush state, and flush
> state is protected by the flush lock. When the item has been flushed
> and attached to the buffer for completion callbacks, the flush lock
> context gets handed to the buffer.
> 
> i.e. the buffer owns the flush lock and so while that buffer is
> locked for IO we know that the item flush state (and hence the
> XFS_LI_FAILED flag) is exclusively owned by the holder of the buffer
> lock.
> 

Right..

> (Note: this is how xfs_ifree_cluster() works - it grabs the buffer
> lock then walks the items on the buffer and changes the callback
> functions because those items are flush locked and hence holding the
> buffer lock gives exclusive access to the flush state of those
> items....)
> 
> > As far as I can tell, the only reason for holding the AIL lock
> > across sets/clears is to manage consistency between the log item
> > bitop state and the temporary buffer pointer that is attached to the
> > log item. The bit itself is used to manage consistency of the
> > attached buffer, but is not enough to guarantee the buffer is still
> > attached by the time xfsaild attempts to access it.
> 
> Correct. The guarantee that the buffer is still attached to the log
> item is what the AIL lock provides us with.
> 
> > However since
> > failure state is always set or cleared under buffer lock (either via
> > I/O completion or xfsaild), this particular case can be handled at
> > item push time by verifying failure state once under buffer lock.
> 
> In theory, yes, but there's a problem before you get that buffer
> lock. That is: what serialises access to lip->li_buf?
> 

That's partly what I was referring to above by the push time changes.
This patch was an attempt to replace reliance on ail_lock with a push
time sequence that would serialize access to a failed buffer
(essentially relying on the failed bit). Whether it's correct or not is
another matter... ;)

> The xfsaild does not hold a reference to the buffer and, without the
> AIL lock to provide synchronisation, the log item reference to the
> buffer can be dropped asynchronously by buffer IO completion. Hence
> the buffer could be freed between the xfsaild reading lip->li_buf
> and calling xfs_buf_trylock(bp). i.e. this introduces a
> use-after-free race condition on the initial buffer access.
> 

Hmm.. the log item holds a temporary reference to the buffer when the
failed state is set. On submission, xfsaild queues the failed buffer
(new ref for the delwri queue), clears the failed state and drops the
failure reference of every failed item that is attached. xfsaild is also
the only task that knows how to process LI_FAILED items, so I don't
think we'd ever race with a failed buffer becoming "unfailed" from
xfsaild (which is the scenario where a buffer could disappear from I/O
completion). In some sense, clearing LI_FAILED of an item is essentially
retaking ownership of the item flush lock and hold of the underlying
buffer, but the existing code is certainly not written that way.

The issue I was wrestling with wrt to this patch was primarily
maintaining consistency between the bit and the assignment of the
pointer on a failing I/O. E.g., the buffer pointer is set by the task
that sets the bit, but xfsaild checking the bit doesn't guarantee the
pointer had been set yet. I did add a post-buffer lock LI_FAILED check
as well, but that probably could have been an assert.

> IOWs, the xfsaild cannot access lip->li_buf safely unless the
> set/clear operations are protected by the same lock the xfsaild
> holds. The xfsaild holds neither the buffer lock, a buffer reference
> or an item flush lock, hence it's the AIL lock or nothing...
> 

Yeah, that was my impression from reading the code. I realize from this
discussion that this doesn't actually simplify the logic. That was the
primary goal, so I think I need to revisit the approach here. Even if
this is correct (which I'm still not totally sure of), it's fragile in
the sense that it partly relies on single-threadedness of xfsaild. I
initially thought about adding a log item spinlock for this kind of
thing, but it didn't (and still doesn't) seem appropriate to burden
every log item in the system with a spinlock for the rare case of buffer
I/O errors.

I mentioned in an earlier patch that it would be nice to consider
removing ->li_buf entirely but hadn't thought it through. That might
alleviate the need to serialize the pointer and the state in the first
place as well as remove the subtle ordering requirement from the
resubmit code to manage the buffer reference. Suppose we relied on
LI_FAILED to represent the buffer hold + flush lock, and then relied on
the hold to facilitate an in-core lookup from xfsaild. For example:

xfs_set_li_failed(lip, bp)
{
	/* keep the backing buffer in-core so long as the item is failed */
	if (!test_and_set_bit(XFS_LI_FAILED, &lip->li_flags))
		xfs_buf_hold(bp);
}

xfs_clear_li_failed(lip, bp)
{
	if (test_and_clear_bit(XFS_LI_FAILED, &lip->li_flags))
		xfs_buf_rele(bp);
}

/*
 * helper to get a mapping from a buffer backed log item and use it to
 * find an in-core buf
 */
xfs_item_to_bp(lip)
{
	if (inode item) {
		...
		blk = ip->imap->im_blkno;
		len = ip->imap->im_len;
	} else if (dquot item) {
		...
		blk = dqp->q_blkno;
		len = mp->m_quotainfo->qi_dqchunklen;
	}

	return xfs_buf_incore(blk, len, XBF_TRYLOCK);
}

xfsaild_resubmit_item(lip)
{
	...

	bp = xfs_item_to_bp(lip);
	if (!bp)
		return XFS_ITEM_LOCKED;
	if (!test_bit(XFS_LI_FAILED, &lip->li_flags))
		goto out_unlock;

	/* drop all failure refs */
	list_for_each_entry(lip, &bp->b_li_list, li_bio_list)
		xfs_clear_li_failed(lip, bp);

	xfs_buf_delwri_queue(...);

	xfs_buf_relse(bp);
	return XFS_ITEM_SUCCESS;
}

I think that would push everything under the buffer lock (and remove
->li_buf) with the caveat that we'd have to lock the inode/dquot to get
the buffer. Any thoughts on something like that? Note that at this point
I might drop this patch from the series regardless due to the complexity
mismatch and consider it separately.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 05/13] xfs: ratelimit unmount time per-buffer I/O error message
  2020-04-23 14:29     ` Brian Foster
@ 2020-04-23 21:14       ` Dave Chinner
  2020-04-24 11:12         ` Brian Foster
  0 siblings, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2020-04-23 21:14 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Apr 23, 2020 at 10:29:58AM -0400, Brian Foster wrote:
> On Thu, Apr 23, 2020 at 02:46:04PM +1000, Dave Chinner wrote:
> > On Wed, Apr 22, 2020 at 01:54:21PM -0400, Brian Foster wrote:
> > > At unmount time, XFS emits a warning for every in-core buffer that
> > > might have undergone a write error. In practice this behavior is
> > > probably reasonable given that the filesystem is likely short lived
> > > once I/O errors begin to occur consistently. Under certain test or
> > > otherwise expected error conditions, this can spam the logs and slow
> > > down the unmount.
> > > 
> > > We already have a ratelimit state defined for buffers failing
> > > writeback. Fold this state into the buftarg and reuse it for the
> > > unmount time errors.
> > > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > 
> > Looks fine, but I suspect we both missed something here:
> > xfs_buf_ioerror_alert() was made a ratelimited printk in the last
> > cycle:
> > 
> > void
> > xfs_buf_ioerror_alert(
> >         struct xfs_buf          *bp,
> >         xfs_failaddr_t          func)
> > {
> >         xfs_alert_ratelimited(bp->b_mount,
> > "metadata I/O error in \"%pS\" at daddr 0x%llx len %d error %d",
> >                         func, (uint64_t)XFS_BUF_ADDR(bp), bp->b_length,
> >                         -bp->b_error);
> > }
> > 
> 
> Yeah, I hadn't noticed that.
> 
> > Hence I think all these buffer error alerts can be brought under the
> > same rate limiting variable. Something like this in xfs_message.c:
> > 
> 
> One thing to note is that xfs_alert_ratelimited() ultimately uses
> the DEFAULT_RATELIMIT_INTERVAL of 5s. The ratelimit we're generalizing
> here uses 30s (both use a burst of 10). That seems reasonable enough to
> me for I/O errors so I'm good with the changes below.
> 
> FWIW, that also means we could just call xfs_buf_alert_ratelimited()
> from xfs_buf_item_push() if we're also Ok with using an "alert" instead
> of a "warn." I'm not immediately aware of a reason to use one over the
> other (xfs_wait_buftarg() already uses alert) so I'll try that unless I
> hear an objection.

SOunds fine to me.

> The xfs_wait_buftarg() ratelimit presumably remains
> open coded because it's two separate calls and we probably don't want
> them to individually count against the limit.

That's why I suggested dropping the second "run xfs_repair" message
and triggering a shutdown after the wait loop. That way we don't
issue "run xfs_repair" for every single failed buffer (largely
noise!), and we get a non-rate-limited common "run xfs-repair"
message once we processed all the failed writes.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 08/13] xfs: elide the AIL lock on log item failure tracking
  2020-04-23 14:36     ` Brian Foster
@ 2020-04-23 21:38       ` Dave Chinner
  2020-04-24 11:14         ` Brian Foster
  0 siblings, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2020-04-23 21:38 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Apr 23, 2020 at 10:36:37AM -0400, Brian Foster wrote:
> On Thu, Apr 23, 2020 at 03:59:19PM +1000, Dave Chinner wrote:
> > On Wed, Apr 22, 2020 at 01:54:24PM -0400, Brian Foster wrote:
> > > The log item failed flag is used to indicate a log item was flushed
> > > but the underlying buffer was not successfully written to disk. If
> > > the error configuration allows for writeback retries, xfsaild uses
> > > the flag to resubmit the underlying buffer without acquiring the
> > > flush lock of the item.
> > > 
> > > The flag is currently set and cleared under the AIL lock and buffer
> > > lock in the ->iop_error() callback, invoked via ->b_iodone() at I/O
> > > completion time. The flag is checked at xfsaild push time under AIL
> > > lock and cleared under buffer lock before resubmission. If I/O
> > > eventually succeeds, the flag is cleared in the _done() handler for
> > > the associated item type, again under AIL lock and buffer lock.
> > 
> > Actually, I think you've missed the relevant lock here: the item's
> > flush lock. The XFS_LI_FAILED flag is item flush state, and flush
> > state is protected by the flush lock. When the item has been flushed
> > and attached to the buffer for completion callbacks, the flush lock
> > context gets handed to the buffer.
> > 
> > i.e. the buffer owns the flush lock and so while that buffer is
> > locked for IO we know that the item flush state (and hence the
> > XFS_LI_FAILED flag) is exclusively owned by the holder of the buffer
> > lock.
> > 
> 
> Right..
> 
> > (Note: this is how xfs_ifree_cluster() works - it grabs the buffer
> > lock then walks the items on the buffer and changes the callback
> > functions because those items are flush locked and hence holding the
> > buffer lock gives exclusive access to the flush state of those
> > items....)
> > 
> > > As far as I can tell, the only reason for holding the AIL lock
> > > across sets/clears is to manage consistency between the log item
> > > bitop state and the temporary buffer pointer that is attached to the
> > > log item. The bit itself is used to manage consistency of the
> > > attached buffer, but is not enough to guarantee the buffer is still
> > > attached by the time xfsaild attempts to access it.
> > 
> > Correct. The guarantee that the buffer is still attached to the log
> > item is what the AIL lock provides us with.
> > 
> > > However since
> > > failure state is always set or cleared under buffer lock (either via
> > > I/O completion or xfsaild), this particular case can be handled at
> > > item push time by verifying failure state once under buffer lock.
> > 
> > In theory, yes, but there's a problem before you get that buffer
> > lock. That is: what serialises access to lip->li_buf?
> > 
> 
> That's partly what I was referring to above by the push time changes.
> This patch was an attempt to replace reliance on ail_lock with a push
> time sequence that would serialize access to a failed buffer
> (essentially relying on the failed bit). Whether it's correct or not is
> another matter... ;)
> 
> > The xfsaild does not hold a reference to the buffer and, without the
> > AIL lock to provide synchronisation, the log item reference to the
> > buffer can be dropped asynchronously by buffer IO completion. Hence
> > the buffer could be freed between the xfsaild reading lip->li_buf
> > and calling xfs_buf_trylock(bp). i.e. this introduces a
> > use-after-free race condition on the initial buffer access.
> > 
> 
> Hmm.. the log item holds a temporary reference to the buffer when the
> failed state is set. On submission, xfsaild queues the failed buffer
> (new ref for the delwri queue), clears the failed state and drops the
> failure reference of every failed item that is attached. xfsaild is also
> the only task that knows how to process LI_FAILED items, so I don't
> think we'd ever race with a failed buffer becoming "unfailed" from
> xfsaild (which is the scenario where a buffer could disappear from I/O
> completion).

Sure we can. every inode on the buffer has XFS_LI_FAILED set on it,
so the first inode in the AIL triggers the resubmit and starts the
IO. the AIL runs again, coming across another inode on the buffer
further into the AIL. That inode has been modified in memory while
the flush was in progress, so it no longer gets removed from the AIL
by the IO completion. Instead, xfs_clear_li_failed() is called from
xfs_iflush_done() and the buffer is removed from the logitem and the
reference dropped.

> In some sense, clearing LI_FAILED of an item is essentially
> retaking ownership of the item flush lock and hold of the underlying
> buffer, but the existing code is certainly not written that way.

Only IO completion clears the LI_FAILED state of the item, not the
xfsaild. IO completion already owns the flush lock - the xfsaild
only holds it long enough to flush an inode to the buffer and then
give the flush lock to the buffer.

Also, we clear LI_FAILED when removing the item from the AIL under
the AIL lock, so the only case here that we are concerned about is
the above "inode was relogged while being flushed" situation. THis
situation is rare, failed buffers are rare, and so requiring the AIL
lock to be held is a performance limiting factor here...

> The issue I was wrestling with wrt to this patch was primarily
> maintaining consistency between the bit and the assignment of the
> pointer on a failing I/O. E.g., the buffer pointer is set by the task
> that sets the bit, but xfsaild checking the bit doesn't guarantee the
> pointer had been set yet. I did add a post-buffer lock LI_FAILED check
> as well, but that probably could have been an assert.

SO you've been focussing on races that may occur when the setting of
the li_buf, but I'm looking at races involving the clearing of the
li_buf pointer. :)

> > IOWs, the xfsaild cannot access lip->li_buf safely unless the
> > set/clear operations are protected by the same lock the xfsaild
> > holds. The xfsaild holds neither the buffer lock, a buffer reference
> > or an item flush lock, hence it's the AIL lock or nothing...
> > 
> 
> Yeah, that was my impression from reading the code. I realize from this
> discussion that this doesn't actually simplify the logic. That was the
> primary goal, so I think I need to revisit the approach here. Even if
> this is correct (which I'm still not totally sure of), it's fragile in
> the sense that it partly relies on single-threadedness of xfsaild. I
> initially thought about adding a log item spinlock for this kind of
> thing, but it didn't (and still doesn't) seem appropriate to burden
> every log item in the system with a spinlock for the rare case of buffer
> I/O errors.

I feel this is all largely a moot, because my patches result in this
this whole problem of setting/clearing the li_buf for failed buffers
go away altogether. Hence I would suggest that this patch gets
dropped for now, because getting rid of the AIL lock is much less
troublesome once LI_FAILED is no long dependent on the inode/dquot
log item taking temporary references to the underlying buffer....

> 
> I mentioned in an earlier patch that it would be nice to consider
> removing ->li_buf entirely but hadn't thought it through.

I'm going the opposite way entirely, such that LI_FAILED doesn't
have any special state associated with it at all...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 05/13] xfs: ratelimit unmount time per-buffer I/O error message
  2020-04-23 21:14       ` Dave Chinner
@ 2020-04-24 11:12         ` Brian Foster
  2020-04-24 22:08           ` Dave Chinner
  0 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2020-04-24 11:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Apr 24, 2020 at 07:14:37AM +1000, Dave Chinner wrote:
> On Thu, Apr 23, 2020 at 10:29:58AM -0400, Brian Foster wrote:
> > On Thu, Apr 23, 2020 at 02:46:04PM +1000, Dave Chinner wrote:
> > > On Wed, Apr 22, 2020 at 01:54:21PM -0400, Brian Foster wrote:
> > > > At unmount time, XFS emits a warning for every in-core buffer that
> > > > might have undergone a write error. In practice this behavior is
> > > > probably reasonable given that the filesystem is likely short lived
> > > > once I/O errors begin to occur consistently. Under certain test or
> > > > otherwise expected error conditions, this can spam the logs and slow
> > > > down the unmount.
> > > > 
> > > > We already have a ratelimit state defined for buffers failing
> > > > writeback. Fold this state into the buftarg and reuse it for the
> > > > unmount time errors.
> > > > 
> > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > 
> > > Looks fine, but I suspect we both missed something here:
> > > xfs_buf_ioerror_alert() was made a ratelimited printk in the last
> > > cycle:
> > > 
> > > void
> > > xfs_buf_ioerror_alert(
> > >         struct xfs_buf          *bp,
> > >         xfs_failaddr_t          func)
> > > {
> > >         xfs_alert_ratelimited(bp->b_mount,
> > > "metadata I/O error in \"%pS\" at daddr 0x%llx len %d error %d",
> > >                         func, (uint64_t)XFS_BUF_ADDR(bp), bp->b_length,
> > >                         -bp->b_error);
> > > }
> > > 
> > 
> > Yeah, I hadn't noticed that.
> > 
> > > Hence I think all these buffer error alerts can be brought under the
> > > same rate limiting variable. Something like this in xfs_message.c:
> > > 
> > 
> > One thing to note is that xfs_alert_ratelimited() ultimately uses
> > the DEFAULT_RATELIMIT_INTERVAL of 5s. The ratelimit we're generalizing
> > here uses 30s (both use a burst of 10). That seems reasonable enough to
> > me for I/O errors so I'm good with the changes below.
> > 
> > FWIW, that also means we could just call xfs_buf_alert_ratelimited()
> > from xfs_buf_item_push() if we're also Ok with using an "alert" instead
> > of a "warn." I'm not immediately aware of a reason to use one over the
> > other (xfs_wait_buftarg() already uses alert) so I'll try that unless I
> > hear an objection.
> 
> SOunds fine to me.
> 
> > The xfs_wait_buftarg() ratelimit presumably remains
> > open coded because it's two separate calls and we probably don't want
> > them to individually count against the limit.
> 
> That's why I suggested dropping the second "run xfs_repair" message
> and triggering a shutdown after the wait loop. That way we don't
> issue "run xfs_repair" for every single failed buffer (largely
> noise!), and we get a non-rate-limited common "run xfs-repair"
> message once we processed all the failed writes.
> 

Sorry, must have missed that in the last reply. I don't think we want to
shut down here because XBF_WRITE_FAIL only reflects that the internal
async write retry (e.g. the one historically used to mitigate transient
I/O errors) has occurred on the buffer, not necessarily that the
immediately previous I/O has failed. For that reason I've kind of looked
at this particular instance as more of a warning that I/O errors have
occurred in the past and the user might want to verify it didn't result
in unexpected damage. FWIW, I've observed plenty of these messages long
after I've disabled error injection and allowed I/O to proceed and the
fs to unmount cleanly.

Looking at it again, the wording of the message is much stronger than a
warning (i.e. "Corruption Alert", "permanent write failures"), so
perhaps we should revisit what we're actually trying to accomplish with
this message. Note that we do have the buffer push time alert to
indicate that I/O errors and retries are occurring, as well as error
handling logic to shutdown on I/O failure while the fs is unmounting
(xfs_buf_iodone_callback_error()), so both of those cases are
fundamentally covered already.

ISTM that leaves at least a few simple options for the
xfs_wait_buftarg() message:

1.) Remove it.
2.) Massage it into more of a "Warning: buffer I/O errors have occurred
in the past" type of message. This could perhaps retain the existing
XBF_WRITE_FAIL logic, but use a flag to lift it out of the loop and thus
avoid the need to rate limit it at all.
3.) Fix up the logic to only dump (ratelimited) specifics on buffers
that have actually failed I/O. This means we use the existing message
but perhaps check ->b_error instead of XBF_WRITE_FAIL. We still don't
need to shut down (I'd probably add an assert), but we could also lift
the second xfs_repair line of the message out of the loop to reduce the
noise.

We could also look into clearing XBF_WRITE_RETRY on successful I/O
completion, FWIW. The existing logic kind of bugs me regardless.
Thoughts?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 08/13] xfs: elide the AIL lock on log item failure tracking
  2020-04-23 21:38       ` Dave Chinner
@ 2020-04-24 11:14         ` Brian Foster
  0 siblings, 0 replies; 46+ messages in thread
From: Brian Foster @ 2020-04-24 11:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Apr 24, 2020 at 07:38:39AM +1000, Dave Chinner wrote:
> On Thu, Apr 23, 2020 at 10:36:37AM -0400, Brian Foster wrote:
> > On Thu, Apr 23, 2020 at 03:59:19PM +1000, Dave Chinner wrote:
> > > On Wed, Apr 22, 2020 at 01:54:24PM -0400, Brian Foster wrote:
> > > > The log item failed flag is used to indicate a log item was flushed
> > > > but the underlying buffer was not successfully written to disk. If
> > > > the error configuration allows for writeback retries, xfsaild uses
> > > > the flag to resubmit the underlying buffer without acquiring the
> > > > flush lock of the item.
> > > > 
> > > > The flag is currently set and cleared under the AIL lock and buffer
> > > > lock in the ->iop_error() callback, invoked via ->b_iodone() at I/O
> > > > completion time. The flag is checked at xfsaild push time under AIL
> > > > lock and cleared under buffer lock before resubmission. If I/O
> > > > eventually succeeds, the flag is cleared in the _done() handler for
> > > > the associated item type, again under AIL lock and buffer lock.
> > > 
> > > Actually, I think you've missed the relevant lock here: the item's
> > > flush lock. The XFS_LI_FAILED flag is item flush state, and flush
> > > state is protected by the flush lock. When the item has been flushed
> > > and attached to the buffer for completion callbacks, the flush lock
> > > context gets handed to the buffer.
> > > 
> > > i.e. the buffer owns the flush lock and so while that buffer is
> > > locked for IO we know that the item flush state (and hence the
> > > XFS_LI_FAILED flag) is exclusively owned by the holder of the buffer
> > > lock.
> > > 
> > 
> > Right..
> > 
> > > (Note: this is how xfs_ifree_cluster() works - it grabs the buffer
> > > lock then walks the items on the buffer and changes the callback
> > > functions because those items are flush locked and hence holding the
> > > buffer lock gives exclusive access to the flush state of those
> > > items....)
> > > 
> > > > As far as I can tell, the only reason for holding the AIL lock
> > > > across sets/clears is to manage consistency between the log item
> > > > bitop state and the temporary buffer pointer that is attached to the
> > > > log item. The bit itself is used to manage consistency of the
> > > > attached buffer, but is not enough to guarantee the buffer is still
> > > > attached by the time xfsaild attempts to access it.
> > > 
> > > Correct. The guarantee that the buffer is still attached to the log
> > > item is what the AIL lock provides us with.
> > > 
> > > > However since
> > > > failure state is always set or cleared under buffer lock (either via
> > > > I/O completion or xfsaild), this particular case can be handled at
> > > > item push time by verifying failure state once under buffer lock.
> > > 
> > > In theory, yes, but there's a problem before you get that buffer
> > > lock. That is: what serialises access to lip->li_buf?
> > > 
> > 
> > That's partly what I was referring to above by the push time changes.
> > This patch was an attempt to replace reliance on ail_lock with a push
> > time sequence that would serialize access to a failed buffer
> > (essentially relying on the failed bit). Whether it's correct or not is
> > another matter... ;)
> > 
> > > The xfsaild does not hold a reference to the buffer and, without the
> > > AIL lock to provide synchronisation, the log item reference to the
> > > buffer can be dropped asynchronously by buffer IO completion. Hence
> > > the buffer could be freed between the xfsaild reading lip->li_buf
> > > and calling xfs_buf_trylock(bp). i.e. this introduces a
> > > use-after-free race condition on the initial buffer access.
> > > 
> > 
> > Hmm.. the log item holds a temporary reference to the buffer when the
> > failed state is set. On submission, xfsaild queues the failed buffer
> > (new ref for the delwri queue), clears the failed state and drops the
> > failure reference of every failed item that is attached. xfsaild is also
> > the only task that knows how to process LI_FAILED items, so I don't
> > think we'd ever race with a failed buffer becoming "unfailed" from
> > xfsaild (which is the scenario where a buffer could disappear from I/O
> > completion).
> 
> Sure we can. every inode on the buffer has XFS_LI_FAILED set on it,
> so the first inode in the AIL triggers the resubmit and starts the
> IO. the AIL runs again, coming across another inode on the buffer
> further into the AIL. That inode has been modified in memory while
> the flush was in progress, so it no longer gets removed from the AIL
> by the IO completion. Instead, xfs_clear_li_failed() is called from
> xfs_iflush_done() and the buffer is removed from the logitem and the
> reference dropped.
> 

When xfsaild encounters the first failed item, the buffer resubmit path
(xfsaild_resubmit_item(), as of this series) calls xfs_clear_li_failed()
on every inode attached to the buffer. AFAICT, that means xfsaild will
see any associated inode as FLUSHING (or PINNED, if another in-core
modification has come through) once the buffer is requeued.

> > In some sense, clearing LI_FAILED of an item is essentially
> > retaking ownership of the item flush lock and hold of the underlying
> > buffer, but the existing code is certainly not written that way.
> 
> Only IO completion clears the LI_FAILED state of the item, not the
> xfsaild. IO completion already owns the flush lock - the xfsaild
> only holds it long enough to flush an inode to the buffer and then
> give the flush lock to the buffer.
> 
> Also, we clear LI_FAILED when removing the item from the AIL under
> the AIL lock, so the only case here that we are concerned about is
> the above "inode was relogged while being flushed" situation. THis
> situation is rare, failed buffers are rare, and so requiring the AIL
> lock to be held is a performance limiting factor here...
> 

Yep, though because of the above I think I/O completion clearing the
failed state might require a more subtle dance between xfsaild and
another submission context, such as if reclaim happens to flush an inode
that wasn't already attached to a previously failed buffer (for
example).

Eh, it doesn't matter that much in any event because..

> > The issue I was wrestling with wrt to this patch was primarily
> > maintaining consistency between the bit and the assignment of the
> > pointer on a failing I/O. E.g., the buffer pointer is set by the task
> > that sets the bit, but xfsaild checking the bit doesn't guarantee the
> > pointer had been set yet. I did add a post-buffer lock LI_FAILED check
> > as well, but that probably could have been an assert.
> 
> SO you've been focussing on races that may occur when the setting of
> the li_buf, but I'm looking at races involving the clearing of the
> li_buf pointer. :)
> 
> > > IOWs, the xfsaild cannot access lip->li_buf safely unless the
> > > set/clear operations are protected by the same lock the xfsaild
> > > holds. The xfsaild holds neither the buffer lock, a buffer reference
> > > or an item flush lock, hence it's the AIL lock or nothing...
> > > 
> > 
> > Yeah, that was my impression from reading the code. I realize from this
> > discussion that this doesn't actually simplify the logic. That was the
> > primary goal, so I think I need to revisit the approach here. Even if
> > this is correct (which I'm still not totally sure of), it's fragile in
> > the sense that it partly relies on single-threadedness of xfsaild. I
> > initially thought about adding a log item spinlock for this kind of
> > thing, but it didn't (and still doesn't) seem appropriate to burden
> > every log item in the system with a spinlock for the rare case of buffer
> > I/O errors.
> 
> I feel this is all largely a moot, because my patches result in this
> this whole problem of setting/clearing the li_buf for failed buffers
> go away altogether. Hence I would suggest that this patch gets
> dropped for now, because getting rid of the AIL lock is much less
> troublesome once LI_FAILED is no long dependent on the inode/dquot
> log item taking temporary references to the underlying buffer....
> 

... I'm good with that. ;) This was intended to be a low level cleanup
to eliminate a fairly minor ail_lock abuse. Working around the ->li_buf
thing is the primary challenge, so if you have a broader rework that
addresses that as a side effect then that presumably makes things
easier.  I'll revisit this if necessary once your work is further
along...

Brian

> > 
> > I mentioned in an earlier patch that it would be nice to consider
> > removing ->li_buf entirely but hadn't thought it through.
> 
> I'm going the opposite way entirely, such that LI_FAILED doesn't
> have any special state associated with it at all...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 05/13] xfs: ratelimit unmount time per-buffer I/O error message
  2020-04-24 11:12         ` Brian Foster
@ 2020-04-24 22:08           ` Dave Chinner
  2020-04-27 11:11             ` Brian Foster
  0 siblings, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2020-04-24 22:08 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Fri, Apr 24, 2020 at 07:12:32AM -0400, Brian Foster wrote:
> On Fri, Apr 24, 2020 at 07:14:37AM +1000, Dave Chinner wrote:
> > On Thu, Apr 23, 2020 at 10:29:58AM -0400, Brian Foster wrote:
> > > On Thu, Apr 23, 2020 at 02:46:04PM +1000, Dave Chinner wrote:
> > > > On Wed, Apr 22, 2020 at 01:54:21PM -0400, Brian Foster wrote:
> > > > > At unmount time, XFS emits a warning for every in-core buffer that
> > > > > might have undergone a write error. In practice this behavior is
> > > > > probably reasonable given that the filesystem is likely short lived
> > > > > once I/O errors begin to occur consistently. Under certain test or
> > > > > otherwise expected error conditions, this can spam the logs and slow
> > > > > down the unmount.
> > > > > 
> > > > > We already have a ratelimit state defined for buffers failing
> > > > > writeback. Fold this state into the buftarg and reuse it for the
> > > > > unmount time errors.
> > > > > 
> > > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > 
> > > > Looks fine, but I suspect we both missed something here:
> > > > xfs_buf_ioerror_alert() was made a ratelimited printk in the last
> > > > cycle:
> > > > 
> > > > void
> > > > xfs_buf_ioerror_alert(
> > > >         struct xfs_buf          *bp,
> > > >         xfs_failaddr_t          func)
> > > > {
> > > >         xfs_alert_ratelimited(bp->b_mount,
> > > > "metadata I/O error in \"%pS\" at daddr 0x%llx len %d error %d",
> > > >                         func, (uint64_t)XFS_BUF_ADDR(bp), bp->b_length,
> > > >                         -bp->b_error);
> > > > }
> > > > 
> > > 
> > > Yeah, I hadn't noticed that.
> > > 
> > > > Hence I think all these buffer error alerts can be brought under the
> > > > same rate limiting variable. Something like this in xfs_message.c:
> > > > 
> > > 
> > > One thing to note is that xfs_alert_ratelimited() ultimately uses
> > > the DEFAULT_RATELIMIT_INTERVAL of 5s. The ratelimit we're generalizing
> > > here uses 30s (both use a burst of 10). That seems reasonable enough to
> > > me for I/O errors so I'm good with the changes below.
> > > 
> > > FWIW, that also means we could just call xfs_buf_alert_ratelimited()
> > > from xfs_buf_item_push() if we're also Ok with using an "alert" instead
> > > of a "warn." I'm not immediately aware of a reason to use one over the
> > > other (xfs_wait_buftarg() already uses alert) so I'll try that unless I
> > > hear an objection.
> > 
> > SOunds fine to me.
> > 
> > > The xfs_wait_buftarg() ratelimit presumably remains
> > > open coded because it's two separate calls and we probably don't want
> > > them to individually count against the limit.
> > 
> > That's why I suggested dropping the second "run xfs_repair" message
> > and triggering a shutdown after the wait loop. That way we don't
> > issue "run xfs_repair" for every single failed buffer (largely
> > noise!), and we get a non-rate-limited common "run xfs-repair"
> > message once we processed all the failed writes.
> > 
> 
> Sorry, must have missed that in the last reply. I don't think we want to
> shut down here because XBF_WRITE_FAIL only reflects that the internal
> async write retry (e.g. the one historically used to mitigate transient
> I/O errors) has occurred on the buffer, not necessarily that the
> immediately previous I/O has failed.

I think this is an incorrect reading of how XBF_WRITE_FAIL
functions. XBF_WRITE_FAIL is used to indicate that the previous
write failed, not that a historic write failed. The flag is cleared
at buffer submission time - see xfs_buf_delwri_submit_buffers() and
xfs_bwrite() - and so it is only set on buffers whose previous IO
failed and hence is still dirty and has not been flushed back to
disk.

If we hit this in xfs_buftarg_wait() after we've pushed the AIL in
xfs_log_quiesce() on unmount, then we've got write failures that
could not be resolved by repeated retries, and the filesystem is, at
this instant in time, inconsistent on disk.

That's a shutdown error condition...

> For that reason I've kind of looked
> at this particular instance as more of a warning that I/O errors have
> occurred in the past and the user might want to verify it didn't result
> in unexpected damage. FWIW, I've observed plenty of these messages long
> after I've disabled error injection and allowed I/O to proceed and the
> fs to unmount cleanly.

Right. THat's the whole point of the flag - the buffer has been
dirtied and it hasn't been able to be written back when we are
purging the buffer cache at the end of unmount. i.e. When
xfs_buftarg_wait() is called, all buffers should be clean because we
are about to write an unmount record to mark the log clean once all
the logged metadata is written back.

What we do right now - write an unmount record after failing metadta
writeback - is actually a bug, and that is one of the
reasons why I suggested a shutdown should be done. i.e. we should
not be writing an unmount record to mark the log clean if we failed
to write back metadata. That metadata is still valid in the journal,
and so it should remain valid in the journal to allow it to be
replayed on the next mount. i.e. retry the writes from log recovery
after the hardware failure has been addressed and the IO errors have
gone away.

Tossing the dirty buffers unmount and then marking the journal
clean is actually making the buffer write failure -worse-. Doing this
guarantees the filesystem is inconsistent on disk (by updating the
journal to indicate those writes actually succeeded) and absolutely
requires xfs_repair to fix as a result.

If we shut down on XBF_WRITE_FAIL buffers in xfs_buftarg_wait(), we
will not write an unmount record and so give the filesystem a chance
to recover on next mount (e.g. after a power cycle to clear whatever
raid hardware bug was being hit) and write that dirty metadata back
without failure.  If recovery fails with IO errors, then the user
really does need to run repair.  However, the situation at this
point is still better than if we write a clean unmount record after
write failures...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 01/13] xfs: refactor failed buffer resubmission into xfsaild
  2020-04-22 17:54 ` [PATCH v2 01/13] xfs: refactor failed buffer resubmission into xfsaild Brian Foster
  2020-04-23  4:09   ` Dave Chinner
@ 2020-04-25 17:21   ` Christoph Hellwig
  1 sibling, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2020-04-25 17:21 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 02/13] xfs: factor out buffer I/O failure simulation code
  2020-04-22 17:54 ` [PATCH v2 02/13] xfs: factor out buffer I/O failure simulation code Brian Foster
  2020-04-23  4:10   ` Dave Chinner
@ 2020-04-25 17:23   ` Christoph Hellwig
  2020-04-27 11:11     ` Brian Foster
  1 sibling, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2020-04-25 17:23 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Apr 22, 2020 at 01:54:18PM -0400, Brian Foster wrote:
> We use the same buffer I/O failure simulation code in a few
> different places. It's not much code, but it's not necessarily
> self-explanatory. Factor it into a helper and document it in one
> place.

The code looks good, but the term simularion sounds rather strange in
this context.  We don't really simulate an I/O failure, but we fail the
buffer with -EIO due to a file system shutdown.

I'd also just keep the ORing of XBF_ASYNC into b_flags in the two
callers, as that keeps the function a little more "compact".

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 03/13] xfs: fallthru to buffer attach on error and simplify error handling
  2020-04-22 17:54 ` [PATCH v2 03/13] xfs: fallthru to buffer attach on error and simplify error handling Brian Foster
  2020-04-23  4:18   ` Dave Chinner
@ 2020-04-25 17:26   ` Christoph Hellwig
  1 sibling, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2020-04-25 17:26 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

I agree with Dave that the subject sounds rather odd.  The actual change
looks good to me:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 04/13] xfs: remove unnecessary shutdown check from xfs_iflush()
  2020-04-22 17:54 ` [PATCH v2 04/13] xfs: remove unnecessary shutdown check from xfs_iflush() Brian Foster
@ 2020-04-25 17:27   ` Christoph Hellwig
  0 siblings, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2020-04-25 17:27 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Apr 22, 2020 at 01:54:20PM -0400, Brian Foster wrote:
> The shutdown check in xfs_iflush() duplicates checks down in the
> buffer code. If the fs is shut down, xfs_trans_read_buf_map() always
> returns an error and falls into the same error path. Remove the
> unnecessary check along with the warning in xfs_imap_to_bp()
> that generates excessive noise in the log if the fs is shut down.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> Reviewed-by: Allison Collins <allison.henderson@oracle.com>
> Reviewed-by: Dave Chinner <dchinner@redhat.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 06/13] xfs: fix duplicate verification from xfs_qm_dqflush()
  2020-04-22 17:54 ` [PATCH v2 06/13] xfs: fix duplicate verification from xfs_qm_dqflush() Brian Foster
  2020-04-23  4:47   ` Dave Chinner
@ 2020-04-25 17:28   ` Christoph Hellwig
  1 sibling, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2020-04-25 17:28 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Apr 22, 2020 at 01:54:22PM -0400, Brian Foster wrote:
> The pre-flush dquot verification in xfs_qm_dqflush() duplicates the
> read verifier by checking the dquot in the on-disk buffer. Instead,
> verify the in-core variant before it is flushed to the buffer.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 07/13] xfs: abort consistently on dquot flush failure
  2020-04-22 17:54 ` [PATCH v2 07/13] xfs: abort consistently on dquot flush failure Brian Foster
@ 2020-04-25 17:30   ` Christoph Hellwig
  0 siblings, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2020-04-25 17:30 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Apr 22, 2020 at 01:54:23PM -0400, Brian Foster wrote:
> The dquot flush handler effectively aborts the dquot flush if the
> filesystem is already shut down, but doesn't actually shut down if
> the flush fails. Update xfs_qm_dqflush() to consistently abort the
> dquot flush and shutdown the fs if the flush fails with an
> unexpected error.

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 09/13] xfs: clean up AIL log item removal functions
  2020-04-22 17:54 ` [PATCH v2 09/13] xfs: clean up AIL log item removal functions Brian Foster
  2020-04-23  4:54   ` Dave Chinner
@ 2020-04-25 17:37   ` Christoph Hellwig
  2020-04-27 11:12     ` Brian Foster
  1 sibling, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2020-04-25 17:37 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Apr 22, 2020 at 01:54:25PM -0400, Brian Foster wrote:
> Make the following changes to clean up both of these functions:
> 
> - Most callers of xfs_trans_ail_delete() acquire the AIL lock just
>   before the call. Update _delete() to acquire the lock and open
>   code the couple of callers that make additional checks under AIL
>   lock.
> - Drop the unnecessary ailp parameter from _delete().
> - Drop the unused shutdown parameter from _remove() and open code
>   the implementation.
> 
> In summary, this leaves a _delete() variant that expects an AIL
> resident item and a _remove() helper that checks the AIL bit. Audit
> the existing callsites for use of the appropriate function and
> update as necessary.

Wouldn't it make sense to split this into separate patches for the
items above?  I have a bit of a hard time verifying this patch, even
if the goal sounds just right.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 11/13] xfs: remove unused iflush stale parameter
  2020-04-22 17:54 ` [PATCH v2 11/13] xfs: remove unused iflush stale parameter Brian Foster
@ 2020-04-25 17:37   ` Christoph Hellwig
  0 siblings, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2020-04-25 17:37 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Apr 22, 2020 at 01:54:27PM -0400, Brian Foster wrote:
> The stale parameter was used to control the now unused shutdown
> parameter of xfs_trans_ail_remove().
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> Reviewed-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Allison Collins <allison.henderson@oracle.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 12/13] xfs: random buffer write failure errortag
  2020-04-22 17:54 ` [PATCH v2 12/13] xfs: random buffer write failure errortag Brian Foster
  2020-04-23  5:11   ` Dave Chinner
@ 2020-04-25 17:38   ` Christoph Hellwig
  2020-04-27 11:12     ` Brian Foster
  1 sibling, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2020-04-25 17:38 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Apr 22, 2020 at 01:54:28PM -0400, Brian Foster wrote:
> @@ -1289,6 +1289,12 @@ xfs_buf_bio_end_io(
>  	struct bio		*bio)
>  {
>  	struct xfs_buf		*bp = (struct xfs_buf *)bio->bi_private;
> +	struct xfs_mount	*mp = bp->b_mount;
> +
> +	if (!bio->bi_status &&
> +	    (bp->b_flags & XBF_WRITE) && (bp->b_flags & XBF_ASYNC) &&
> +	    XFS_TEST_ERROR(false, mp, XFS_ERRTAG_BUF_IOERROR))
> +		bio->bi_status = errno_to_blk_status(-EIO);

Please just use BLK_STS_IOERR directly here.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 13/13] xfs: remove unused shutdown types
  2020-04-22 17:54 ` [PATCH v2 13/13] xfs: remove unused shutdown types Brian Foster
  2020-04-23  5:13   ` Dave Chinner
@ 2020-04-25 17:39   ` Christoph Hellwig
  1 sibling, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2020-04-25 17:39 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Apr 22, 2020 at 01:54:29PM -0400, Brian Foster wrote:
> Both types control shutdown messaging and neither is used in the
> current codebase.

Hmm, I'm pretty sure I submitted the same patch a few weeks ago :)

Because of that it obviously looks fine, of course:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 05/13] xfs: ratelimit unmount time per-buffer I/O error message
  2020-04-24 22:08           ` Dave Chinner
@ 2020-04-27 11:11             ` Brian Foster
  0 siblings, 0 replies; 46+ messages in thread
From: Brian Foster @ 2020-04-27 11:11 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Sat, Apr 25, 2020 at 08:08:23AM +1000, Dave Chinner wrote:
> On Fri, Apr 24, 2020 at 07:12:32AM -0400, Brian Foster wrote:
> > On Fri, Apr 24, 2020 at 07:14:37AM +1000, Dave Chinner wrote:
> > > On Thu, Apr 23, 2020 at 10:29:58AM -0400, Brian Foster wrote:
> > > > On Thu, Apr 23, 2020 at 02:46:04PM +1000, Dave Chinner wrote:
> > > > > On Wed, Apr 22, 2020 at 01:54:21PM -0400, Brian Foster wrote:
> > > > > > At unmount time, XFS emits a warning for every in-core buffer that
> > > > > > might have undergone a write error. In practice this behavior is
> > > > > > probably reasonable given that the filesystem is likely short lived
> > > > > > once I/O errors begin to occur consistently. Under certain test or
> > > > > > otherwise expected error conditions, this can spam the logs and slow
> > > > > > down the unmount.
> > > > > > 
> > > > > > We already have a ratelimit state defined for buffers failing
> > > > > > writeback. Fold this state into the buftarg and reuse it for the
> > > > > > unmount time errors.
> > > > > > 
> > > > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > > 
> > > > > Looks fine, but I suspect we both missed something here:
> > > > > xfs_buf_ioerror_alert() was made a ratelimited printk in the last
> > > > > cycle:
> > > > > 
> > > > > void
> > > > > xfs_buf_ioerror_alert(
> > > > >         struct xfs_buf          *bp,
> > > > >         xfs_failaddr_t          func)
> > > > > {
> > > > >         xfs_alert_ratelimited(bp->b_mount,
> > > > > "metadata I/O error in \"%pS\" at daddr 0x%llx len %d error %d",
> > > > >                         func, (uint64_t)XFS_BUF_ADDR(bp), bp->b_length,
> > > > >                         -bp->b_error);
> > > > > }
> > > > > 
> > > > 
> > > > Yeah, I hadn't noticed that.
> > > > 
> > > > > Hence I think all these buffer error alerts can be brought under the
> > > > > same rate limiting variable. Something like this in xfs_message.c:
> > > > > 
> > > > 
> > > > One thing to note is that xfs_alert_ratelimited() ultimately uses
> > > > the DEFAULT_RATELIMIT_INTERVAL of 5s. The ratelimit we're generalizing
> > > > here uses 30s (both use a burst of 10). That seems reasonable enough to
> > > > me for I/O errors so I'm good with the changes below.
> > > > 
> > > > FWIW, that also means we could just call xfs_buf_alert_ratelimited()
> > > > from xfs_buf_item_push() if we're also Ok with using an "alert" instead
> > > > of a "warn." I'm not immediately aware of a reason to use one over the
> > > > other (xfs_wait_buftarg() already uses alert) so I'll try that unless I
> > > > hear an objection.
> > > 
> > > SOunds fine to me.
> > > 
> > > > The xfs_wait_buftarg() ratelimit presumably remains
> > > > open coded because it's two separate calls and we probably don't want
> > > > them to individually count against the limit.
> > > 
> > > That's why I suggested dropping the second "run xfs_repair" message
> > > and triggering a shutdown after the wait loop. That way we don't
> > > issue "run xfs_repair" for every single failed buffer (largely
> > > noise!), and we get a non-rate-limited common "run xfs-repair"
> > > message once we processed all the failed writes.
> > > 
> > 
> > Sorry, must have missed that in the last reply. I don't think we want to
> > shut down here because XBF_WRITE_FAIL only reflects that the internal
> > async write retry (e.g. the one historically used to mitigate transient
> > I/O errors) has occurred on the buffer, not necessarily that the
> > immediately previous I/O has failed.
> 
> I think this is an incorrect reading of how XBF_WRITE_FAIL
> functions. XBF_WRITE_FAIL is used to indicate that the previous
> write failed, not that a historic write failed. The flag is cleared
> at buffer submission time - see xfs_buf_delwri_submit_buffers() and
> xfs_bwrite() - and so it is only set on buffers whose previous IO
> failed and hence is still dirty and has not been flushed back to
> disk.
> 

Right, but the flag is only ever cleared in those particular submission
paths (which doesn't include the resubmit path). The point being
overlooked here is that it is never cleared on successful I/O
completion. That means if an I/O fails, we set the flag and resubmit. If
the resubmitted I/O succeeds, the flag remains set on the buffer until
the next time it is written, but there's no guarantee the buffer is
dirtied again before we unmount. Therefore, it's possible to have
XBF_WRITE_FAIL set on clean/consistent buffers at unmount time without
any underlying inconsistency.

> If we hit this in xfs_buftarg_wait() after we've pushed the AIL in
> xfs_log_quiesce() on unmount, then we've got write failures that
> could not be resolved by repeated retries, and the filesystem is, at
> this instant in time, inconsistent on disk.
> 
> That's a shutdown error condition...
> 

Yes, but XBF_WRITE_FAIL does not reflect that condition with certainty.

> > For that reason I've kind of looked
> > at this particular instance as more of a warning that I/O errors have
> > occurred in the past and the user might want to verify it didn't result
> > in unexpected damage. FWIW, I've observed plenty of these messages long
> > after I've disabled error injection and allowed I/O to proceed and the
> > fs to unmount cleanly.
> 
> Right. THat's the whole point of the flag - the buffer has been
> dirtied and it hasn't been able to be written back when we are
> purging the buffer cache at the end of unmount. i.e. When
> xfs_buftarg_wait() is called, all buffers should be clean because we
> are about to write an unmount record to mark the log clean once all
> the logged metadata is written back.
> 
> What we do right now - write an unmount record after failing metadta
> writeback - is actually a bug, and that is one of the
> reasons why I suggested a shutdown should be done. i.e. we should
> not be writing an unmount record to mark the log clean if we failed
> to write back metadata. That metadata is still valid in the journal,
> and so it should remain valid in the journal to allow it to be
> replayed on the next mount. i.e. retry the writes from log recovery
> after the hardware failure has been addressed and the IO errors have
> gone away.
> 
> Tossing the dirty buffers unmount and then marking the journal
> clean is actually making the buffer write failure -worse-. Doing this
> guarantees the filesystem is inconsistent on disk (by updating the
> journal to indicate those writes actually succeeded) and absolutely
> requires xfs_repair to fix as a result.
> 
> If we shut down on XBF_WRITE_FAIL buffers in xfs_buftarg_wait(), we
> will not write an unmount record and so give the filesystem a chance
> to recover on next mount (e.g. after a power cycle to clear whatever
> raid hardware bug was being hit) and write that dirty metadata back
> without failure.  If recovery fails with IO errors, then the user
> really does need to run repair.  However, the situation at this
> point is still better than if we write a clean unmount record after
> write failures...
> 

I agree with all of this when we are actually tossing dirty buffers. ;)
See above that I'm referring to how this state can persist after error
injection is disabled and I/O errors have ceased. Also note that the
error handling code already shuts down the filesystem in the completion
handling path of any buffer undergoing retries at unmount time (see
XFS_MOUNT_UNMOUNTING). IIRC this was introduced to prevent hanging
unmount on fs' configured with infinite retries, but it ensures the fs
is shut down if dirty buffers have failed to write back before we get to
xfs_wait_buftarg(). So there isn't a functional bug from what I can see
if XBF_WRITE_FAIL does happen to reflect a failed+dirty buffer in
xfs_wait_buftarg().

It's pretty clear that the implementation of XBF_WRITE_FAIL is
unnecessarily fragile and confusing. Aside from not being able to
establish the meaning of the xfs_wait_buftarg() warning message between
the two of us (good luck, users ;P), I'm guessing it's also not obvious
that there's an internal retry associated with each xfsaild submitted
write, even for buffers that persistently fail. For example, xfsaild
submits a buffer, I/O fails, we set XBF_WRITE_FAIL and resubmit, I/O
fails again, XBF_WRITE_FAIL is already set and retries are configured so
we unlock the buffer, xfsaild comes around again and requeues, delwri
submit clears XBF_WRITE_FAIL and we start over from the beginning.

Perhaps a better solution would be to clear XBF_WRITE_FAIL on successful
I/O completion. That way we only issue the internal retry once since the
last time a buffer was successfully written and xfs_wait_buftarg() can
legitimately warn about tossing dirty buffers and shut down. I do think
it makes sense to shut down from there in that scenario as well,
provided the logic is correct, simply because the dependency on error
configuration shutdown behavior is not obvious. Thoughts?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 02/13] xfs: factor out buffer I/O failure simulation code
  2020-04-25 17:23   ` Christoph Hellwig
@ 2020-04-27 11:11     ` Brian Foster
  0 siblings, 0 replies; 46+ messages in thread
From: Brian Foster @ 2020-04-27 11:11 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Sat, Apr 25, 2020 at 10:23:39AM -0700, Christoph Hellwig wrote:
> On Wed, Apr 22, 2020 at 01:54:18PM -0400, Brian Foster wrote:
> > We use the same buffer I/O failure simulation code in a few
> > different places. It's not much code, but it's not necessarily
> > self-explanatory. Factor it into a helper and document it in one
> > place.
> 
> The code looks good, but the term simularion sounds rather strange in
> this context.  We don't really simulate an I/O failure, but we fail the
> buffer with -EIO due to a file system shutdown.
> 

I was just using the terminology in the existing code. I've updated the
commit log to refer to it as "I/O failure code," but note that the
comments in the code still use the original terminology (unless you want
to suggest alternative wording..).

> I'd also just keep the ORing of XBF_ASYNC into b_flags in the two
> callers, as that keeps the function a little more "compact".
> 

Ok, fine by me.

Brian


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 09/13] xfs: clean up AIL log item removal functions
  2020-04-25 17:37   ` Christoph Hellwig
@ 2020-04-27 11:12     ` Brian Foster
  0 siblings, 0 replies; 46+ messages in thread
From: Brian Foster @ 2020-04-27 11:12 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Sat, Apr 25, 2020 at 10:37:17AM -0700, Christoph Hellwig wrote:
> On Wed, Apr 22, 2020 at 01:54:25PM -0400, Brian Foster wrote:
> > Make the following changes to clean up both of these functions:
> > 
> > - Most callers of xfs_trans_ail_delete() acquire the AIL lock just
> >   before the call. Update _delete() to acquire the lock and open
> >   code the couple of callers that make additional checks under AIL
> >   lock.
> > - Drop the unnecessary ailp parameter from _delete().
> > - Drop the unused shutdown parameter from _remove() and open code
> >   the implementation.
> > 
> > In summary, this leaves a _delete() variant that expects an AIL
> > resident item and a _remove() helper that checks the AIL bit. Audit
> > the existing callsites for use of the appropriate function and
> > update as necessary.
> 
> Wouldn't it make sense to split this into separate patches for the
> items above?  I have a bit of a hard time verifying this patch, even
> if the goal sounds just right.
> 

This ended up as one patch pretty much because I couldn't figure out
what the factoring steps needed to be until I came to the end result.
I'll revisit splitting it up..

Brian


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 12/13] xfs: random buffer write failure errortag
  2020-04-25 17:38   ` Christoph Hellwig
@ 2020-04-27 11:12     ` Brian Foster
  0 siblings, 0 replies; 46+ messages in thread
From: Brian Foster @ 2020-04-27 11:12 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Sat, Apr 25, 2020 at 10:38:33AM -0700, Christoph Hellwig wrote:
> On Wed, Apr 22, 2020 at 01:54:28PM -0400, Brian Foster wrote:
> > @@ -1289,6 +1289,12 @@ xfs_buf_bio_end_io(
> >  	struct bio		*bio)
> >  {
> >  	struct xfs_buf		*bp = (struct xfs_buf *)bio->bi_private;
> > +	struct xfs_mount	*mp = bp->b_mount;
> > +
> > +	if (!bio->bi_status &&
> > +	    (bp->b_flags & XBF_WRITE) && (bp->b_flags & XBF_ASYNC) &&
> > +	    XFS_TEST_ERROR(false, mp, XFS_ERRTAG_BUF_IOERROR))
> > +		bio->bi_status = errno_to_blk_status(-EIO);
> 
> Please just use BLK_STS_IOERR directly here.
> 

Ok, fixed.

Brian


^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2020-04-27 11:12 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-22 17:54 [PATCH v2 00/13] xfs: flush related error handling cleanups Brian Foster
2020-04-22 17:54 ` [PATCH v2 01/13] xfs: refactor failed buffer resubmission into xfsaild Brian Foster
2020-04-23  4:09   ` Dave Chinner
2020-04-25 17:21   ` Christoph Hellwig
2020-04-22 17:54 ` [PATCH v2 02/13] xfs: factor out buffer I/O failure simulation code Brian Foster
2020-04-23  4:10   ` Dave Chinner
2020-04-25 17:23   ` Christoph Hellwig
2020-04-27 11:11     ` Brian Foster
2020-04-22 17:54 ` [PATCH v2 03/13] xfs: fallthru to buffer attach on error and simplify error handling Brian Foster
2020-04-23  4:18   ` Dave Chinner
2020-04-23 14:28     ` Brian Foster
2020-04-25 17:26   ` Christoph Hellwig
2020-04-22 17:54 ` [PATCH v2 04/13] xfs: remove unnecessary shutdown check from xfs_iflush() Brian Foster
2020-04-25 17:27   ` Christoph Hellwig
2020-04-22 17:54 ` [PATCH v2 05/13] xfs: ratelimit unmount time per-buffer I/O error message Brian Foster
2020-04-23  4:46   ` Dave Chinner
2020-04-23 14:29     ` Brian Foster
2020-04-23 21:14       ` Dave Chinner
2020-04-24 11:12         ` Brian Foster
2020-04-24 22:08           ` Dave Chinner
2020-04-27 11:11             ` Brian Foster
2020-04-22 17:54 ` [PATCH v2 06/13] xfs: fix duplicate verification from xfs_qm_dqflush() Brian Foster
2020-04-23  4:47   ` Dave Chinner
2020-04-25 17:28   ` Christoph Hellwig
2020-04-22 17:54 ` [PATCH v2 07/13] xfs: abort consistently on dquot flush failure Brian Foster
2020-04-25 17:30   ` Christoph Hellwig
2020-04-22 17:54 ` [PATCH v2 08/13] xfs: elide the AIL lock on log item failure tracking Brian Foster
2020-04-23  5:59   ` Dave Chinner
2020-04-23 14:36     ` Brian Foster
2020-04-23 21:38       ` Dave Chinner
2020-04-24 11:14         ` Brian Foster
2020-04-22 17:54 ` [PATCH v2 09/13] xfs: clean up AIL log item removal functions Brian Foster
2020-04-23  4:54   ` Dave Chinner
2020-04-25 17:37   ` Christoph Hellwig
2020-04-27 11:12     ` Brian Foster
2020-04-22 17:54 ` [PATCH v2 10/13] xfs: combine xfs_trans_ail_[remove|delete]() Brian Foster
2020-04-23  4:55   ` Dave Chinner
2020-04-22 17:54 ` [PATCH v2 11/13] xfs: remove unused iflush stale parameter Brian Foster
2020-04-25 17:37   ` Christoph Hellwig
2020-04-22 17:54 ` [PATCH v2 12/13] xfs: random buffer write failure errortag Brian Foster
2020-04-23  5:11   ` Dave Chinner
2020-04-25 17:38   ` Christoph Hellwig
2020-04-27 11:12     ` Brian Foster
2020-04-22 17:54 ` [PATCH v2 13/13] xfs: remove unused shutdown types Brian Foster
2020-04-23  5:13   ` Dave Chinner
2020-04-25 17:39   ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.