All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] xfs: fix buffer delwri queue state race
@ 2018-06-07 12:41 Brian Foster
  2018-06-07 12:41 ` [PATCH 1/2] xfs: define internal buf flag for delwri queue wait list Brian Foster
  2018-06-07 12:41 ` [PATCH 2/2] xfs: allow delwri requeue of wait listed buffers Brian Foster
  0 siblings, 2 replies; 7+ messages in thread
From: Brian Foster @ 2018-06-07 12:41 UTC (permalink / raw)
  To: linux-xfs

Hi all,

This is an attempt to fix a race when queuing a buffer to a delwri queue
that happens to already sit on a delwri queue wait list. The problem is
reproduced by xfs/305 under particular conditions. The details are
described in the commit log description of patch 2.

This survives xfstests and repeated runs of xfs/305. Note that the test
still reproduces the known log deadlock problem when quotaoff runs
against a heavily loaded fs. I've hacked around that problem locally to
rule it out from repetition testing of xfs/305.

Thoughts, reviews, flames appreciated.

Brian

Brian Foster (2):
  xfs: define internal buf flag for delwri queue wait list
  xfs: allow delwri requeue of wait listed buffers

 fs/xfs/xfs_buf.c | 79 +++++++++++++++++++++++++++++++-----------------
 fs/xfs/xfs_buf.h |  4 ++-
 2 files changed, 54 insertions(+), 29 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2] xfs: define internal buf flag for delwri queue wait list
  2018-06-07 12:41 [PATCH 0/2] xfs: fix buffer delwri queue state race Brian Foster
@ 2018-06-07 12:41 ` Brian Foster
  2018-06-07 12:41 ` [PATCH 2/2] xfs: allow delwri requeue of wait listed buffers Brian Foster
  1 sibling, 0 replies; 7+ messages in thread
From: Brian Foster @ 2018-06-07 12:41 UTC (permalink / raw)
  To: linux-xfs

The delwri queue mechanism does not provide the ability to
distinguish between stale (or lazily removed) buffers in a delwri
queue and buffers that have been submitted from a delwri queue and
sit on a wait list. Buffers in both cases have the _XBF_DELWRI_Q
flag cleared and ->b_list as a member of an associated list. This
means that a queue of a buffer cannot always tell if a buffer is
actually on a delwri queue if ->b_list is non-empty and
_XBF_DELWRI_Q is not already set.

Define a new flag for wait listed buffers to distinguish this case
and prepare to handle it properly. This patch defines, sets and
clears the _XBF_DELWRI_W flag at the appropriate places and
otherwise does not change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_buf.c | 5 +++--
 fs/xfs/xfs_buf.h | 4 +++-
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 980bc48979e9..912cbbf44db7 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -2033,6 +2033,7 @@ xfs_buf_delwri_submit_buffers(
 		bp->b_flags &= ~(_XBF_DELWRI_Q | XBF_WRITE_FAIL);
 		bp->b_flags |= XBF_WRITE | XBF_ASYNC;
 		if (wait_list) {
+			bp->b_flags |= _XBF_DELWRI_W;
 			xfs_buf_hold(bp);
 			list_move_tail(&bp->b_list, wait_list);
 		} else
@@ -2083,10 +2084,10 @@ xfs_buf_delwri_submit(
 	while (!list_empty(&wait_list)) {
 		bp = list_first_entry(&wait_list, struct xfs_buf, b_list);
 
-		list_del_init(&bp->b_list);
-
 		/* locking the buffer will wait for async IO completion. */
 		xfs_buf_lock(bp);
+		bp->b_flags &= ~_XBF_DELWRI_W;
+		list_del_init(&bp->b_list);
 		error2 = bp->b_error;
 		xfs_buf_relse(bp);
 		if (!error)
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index d24dbd4dac39..07409b72dcad 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -50,7 +50,8 @@ typedef enum {
 #define _XBF_PAGES	 (1 << 20)/* backed by refcounted pages */
 #define _XBF_KMEM	 (1 << 21)/* backed by heap memory */
 #define _XBF_DELWRI_Q	 (1 << 22)/* buffer on a delwri queue */
-#define _XBF_COMPOUND	 (1 << 23)/* compound buffer */
+#define _XBF_DELWRI_W	 (1 << 23)/* buffer on a delwri wait list */
+#define _XBF_COMPOUND	 (1 << 24)/* compound buffer */
 
 typedef unsigned int xfs_buf_flags_t;
 
@@ -71,6 +72,7 @@ typedef unsigned int xfs_buf_flags_t;
 	{ _XBF_PAGES,		"PAGES" }, \
 	{ _XBF_KMEM,		"KMEM" }, \
 	{ _XBF_DELWRI_Q,	"DELWRI_Q" }, \
+	{ _XBF_DELWRI_W,	"DELWRI_W" }, \
 	{ _XBF_COMPOUND,	"COMPOUND" }
 
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/2] xfs: allow delwri requeue of wait listed buffers
  2018-06-07 12:41 [PATCH 0/2] xfs: fix buffer delwri queue state race Brian Foster
  2018-06-07 12:41 ` [PATCH 1/2] xfs: define internal buf flag for delwri queue wait list Brian Foster
@ 2018-06-07 12:41 ` Brian Foster
  2018-06-07 23:27   ` Dave Chinner
  1 sibling, 1 reply; 7+ messages in thread
From: Brian Foster @ 2018-06-07 12:41 UTC (permalink / raw)
  To: linux-xfs

If a delwri queue occurs of a buffer that was previously submitted
from a delwri queue but has not yet been removed from a wait list,
the queue sets _XBF_DELWRI_Q without changing the state of ->b_list.
This occurs, for example, if another thread beats the submitter
thread to the buffer lock after I/O completion. Once the submitter
acquires the lock, it removes the buffer from the wait list and
leaves a buffer with _XBF_DELWRI_Q set but not populated on a list.
This results in a lost buffer submission and in turn can result in
assert failures due to _XBF_DELWRI_Q being set on buffer reclaim or
filesystem lockups if the buffer happens to cover an item in the
AIL.

This problem has been reproduced by repeated iterations of xfs/305
on high CPU count (28xcpu) systems with limited memory (~1GB). Dirty
dquot reclaim races with an xfsaild push of a separate dquot backed
by the same buffer such that the buffer sits on the reclaim wait
list at the time xfsaild attempts to queue it. Since the latter
dquot has been flush locked but the underlying buffer not submitted
for I/O, the dquot pins the AIL and causes the filesystem to
livelock.

To address this problem, allow a delwri queue of a wait listed
buffer to steal the buffer from the wait list and add it to the
associated delwri queue. This is fundamentally safe because the
purpose of the wait list is to provide synchronous I/O. The buffer
lock of each wait listed buffer is cycled to ensure that I/O has
completed. If another thread acquires the buffer lock first, then
the I/O has completed and the submitter lock cycle is a formality.

The tradeoff to this approach is that the submitter loses the
ability to check error state of stolen buffers. This is technically
already possible as once the lock is released there is no guarantee
another thread has not interfered with the buffer error state by the
time the submitter reacquires the lock. Further, most critical error
handling occurs in the iodone callbacks in completion context of the
specific buffer since the delwri submitter has no idea which buffer
failed in the first place. Finally, the stolen buffer case should be
relatively rare and limited to races when under the highly parallel
and low memory conditions described above.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_buf.c | 78 +++++++++++++++++++++++++++++++-----------------
 1 file changed, 50 insertions(+), 28 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 912cbbf44db7..ca0520b173c6 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1917,6 +1917,8 @@ xfs_buf_delwri_queue(
 	struct xfs_buf		*bp,
 	struct list_head	*list)
 {
+	bool			wait = (bp->b_flags & _XBF_DELWRI_W);
+
 	ASSERT(xfs_buf_islocked(bp));
 	ASSERT(!(bp->b_flags & XBF_READ));
 
@@ -1926,23 +1928,37 @@ xfs_buf_delwri_queue(
 	 * case.
 	 */
 	if (bp->b_flags & _XBF_DELWRI_Q) {
+		ASSERT(!wait);
 		trace_xfs_buf_delwri_queued(bp, _RET_IP_);
 		return false;
 	}
 
 	trace_xfs_buf_delwri_queue(bp, _RET_IP_);
 
+	/*
+	 * If the buffer is on a wait list, the I/O has completed and the
+	 * current thread beat the waiter to the lock. Drop the buffer from the
+	 * wait list and requeue it. The waiter will either see that DELWRI_W is
+	 * cleared or skip waiting on the buffer. The wait list reference is
+	 * reused for the delwri queue.
+	 */
+	if (wait) {
+		bp->b_flags &= ~_XBF_DELWRI_W;
+		list_del_init(&bp->b_list);
+	}
+
 	/*
 	 * If a buffer gets written out synchronously or marked stale while it
 	 * is on a delwri list we lazily remove it. To do this, the other party
 	 * clears the  _XBF_DELWRI_Q flag but otherwise leaves the buffer alone.
-	 * It remains referenced and on the list.  In a rare corner case it
-	 * might get readded to a delwri list after the synchronous writeout, in
-	 * which case we need just need to re-add the flag here.
+	 * It remains referenced and on the list. In a rare corner case it might
+	 * get readded to a delwri list after the synchronous writeout, in which
+	 * case we need just need to re-add the flag here.
 	 */
 	bp->b_flags |= _XBF_DELWRI_Q;
 	if (list_empty(&bp->b_list)) {
-		atomic_inc(&bp->b_hold);
+		if (!wait)
+			atomic_inc(&bp->b_hold);
 		list_add_tail(&bp->b_list, list);
 	}
 
@@ -2082,14 +2098,27 @@ xfs_buf_delwri_submit(
 
 	/* Wait for IO to complete. */
 	while (!list_empty(&wait_list)) {
+		bool release = false;
+
 		bp = list_first_entry(&wait_list, struct xfs_buf, b_list);
 
-		/* locking the buffer will wait for async IO completion. */
+		/*
+		 * Lock the buffer to wait for async I/O completion. Note that
+		 * somebody else could have requeued the buffer in the meantime.
+		 * That means some buffers might drop from the wait list, but
+		 * that requires the buffer lock and thus means the I/O had
+		 * completed first.
+		 */
 		xfs_buf_lock(bp);
-		bp->b_flags &= ~_XBF_DELWRI_W;
-		list_del_init(&bp->b_list);
+		if (bp->b_flags & _XBF_DELWRI_W) {
+			bp->b_flags &= ~_XBF_DELWRI_W;
+			list_del_init(&bp->b_list);
+			release = true;
+		}
 		error2 = bp->b_error;
-		xfs_buf_relse(bp);
+		xfs_buf_unlock(bp);
+		if (release)
+			xfs_buf_rele(bp);
 		if (!error)
 			error = error2;
 	}
@@ -2120,38 +2149,31 @@ xfs_buf_delwri_pushbuf(
 	LIST_HEAD		(submit_list);
 	int			error;
 
-	ASSERT(bp->b_flags & _XBF_DELWRI_Q);
-
 	trace_xfs_buf_delwri_pushbuf(bp, _RET_IP_);
 
 	/*
-	 * Isolate the buffer to a new local list so we can submit it for I/O
-	 * independently from the rest of the original list.
+	 * Make sure the buffer sits on a queue and isolate it to a new local
+	 * list so we can submit it independently from the original list.
 	 */
 	xfs_buf_lock(bp);
+	if (!(bp->b_flags & _XBF_DELWRI_Q) || bp->b_flags & _XBF_DELWRI_W) {
+		xfs_buf_unlock(bp);
+		return 0;
+	}
 	list_move(&bp->b_list, &submit_list);
 	xfs_buf_unlock(bp);
 
 	/*
-	 * Delwri submission clears the DELWRI_Q buffer flag and returns with
-	 * the buffer on the wait list with an associated reference. Rather than
-	 * bounce the buffer from a local wait list back to the original list
-	 * after I/O completion, reuse the original list as the wait list.
+	 * Delwri submission moves the buffer over to the wait list. Reuse the
+	 * original list so there's no need to declare a local wait list.
+	 *
+	 * The buffer most likely still resides on the wait list by the time the
+	 * lock is reacquired. Requeue it just in case somebody else beat us to
+	 * the lock and changed the state.
 	 */
 	xfs_buf_delwri_submit_buffers(&submit_list, buffer_list);
-
-	/*
-	 * The buffer is now under I/O and wait listed as during typical delwri
-	 * submission. Lock the buffer to wait for I/O completion. Rather than
-	 * remove the buffer from the wait list and release the reference, we
-	 * want to return with the buffer queued to the original list. The
-	 * buffer already sits on the original list with a wait list reference,
-	 * however. If we let the queue inherit that wait list reference, all we
-	 * need to do is reset the DELWRI_Q flag.
-	 */
 	xfs_buf_lock(bp);
-	error = bp->b_error;
-	bp->b_flags |= _XBF_DELWRI_Q;
+	xfs_buf_delwri_queue(bp, buffer_list);
 	xfs_buf_unlock(bp);
 
 	return error;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] xfs: allow delwri requeue of wait listed buffers
  2018-06-07 12:41 ` [PATCH 2/2] xfs: allow delwri requeue of wait listed buffers Brian Foster
@ 2018-06-07 23:27   ` Dave Chinner
  2018-06-08 12:07     ` Brian Foster
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Chinner @ 2018-06-07 23:27 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Jun 07, 2018 at 08:41:25AM -0400, Brian Foster wrote:
> If a delwri queue occurs of a buffer that was previously submitted
> from a delwri queue but has not yet been removed from a wait list,
> the queue sets _XBF_DELWRI_Q without changing the state of ->b_list.
> This occurs, for example, if another thread beats the submitter
> thread to the buffer lock after I/O completion. Once the submitter
> acquires the lock, it removes the buffer from the wait list and
> leaves a buffer with _XBF_DELWRI_Q set but not populated on a list.
> This results in a lost buffer submission and in turn can result in
> assert failures due to _XBF_DELWRI_Q being set on buffer reclaim or
> filesystem lockups if the buffer happens to cover an item in the
> AIL.

I just so happened to have this ASSERT happen over night on
generic/232 testing some code I wrote yesterday. It never ceases to
amaze me how bugs that have been around for ages always seem to be
hit at the same time by different people in completely different
contexts....

> This problem has been reproduced by repeated iterations of xfs/305
> on high CPU count (28xcpu) systems with limited memory (~1GB). Dirty
> dquot reclaim races with an xfsaild push of a separate dquot backed
> by the same buffer such that the buffer sits on the reclaim wait
> list at the time xfsaild attempts to queue it. Since the latter
> dquot has been flush locked but the underlying buffer not submitted
> for I/O, the dquot pins the AIL and causes the filesystem to
> livelock.
> 
> To address this problem, allow a delwri queue of a wait listed
> buffer to steal the buffer from the wait list and add it to the
> associated delwri queue. This is fundamentally safe because the
> purpose of the wait list is to provide synchronous I/O. The buffer
> lock of each wait listed buffer is cycled to ensure that I/O has
> completed. If another thread acquires the buffer lock first, then
> the I/O has completed and the submitter lock cycle is a formality.
> 
> The tradeoff to this approach is that the submitter loses the
> ability to check error state of stolen buffers. This is technically
> already possible as once the lock is released there is no guarantee
> another thread has not interfered with the buffer error state by the
> time the submitter reacquires the lock. Further, most critical error
> handling occurs in the iodone callbacks in completion context of the
> specific buffer since the delwri submitter has no idea which buffer
> failed in the first place. Finally, the stolen buffer case should be
> relatively rare and limited to races when under the highly parallel
> and low memory conditions described above.

This seems all a bit broken.

The fundamental problem is that we are waiting on buffer locks for
completion, assuming that nobody else can get the lock before we do
to tell us that completion has occured. IMO, it's the way we are
doing the bulk buffer IO submission and waiting that is broken, not
the wait the delwri queues are handled.

i.e. we need to take ownership of the buffer lock across
xfs_buf_delwri_submit_buffers() and the wait loop in
xfs_buf_delwri_submit() because we assume that the delwri code is
the only context with access to the buffer while it is under IO. We
already have an IO waiting mechanism that does this - it's used by
xfs_buf_submit_wait().

So what I think we really need to do is split xfs_buf_submit_wait()
into two halves so we can separate the submission and waiting. TO
save trying to explain it in great detail, I just wrote some
(untested!) code below that makes delwri submission hold the lock
itself until the IO has completed.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

xfs: fix races waiting for delwri buffers

From: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Dave Chinner <dchinner@redhat.com>

---
 fs/xfs/xfs_buf.c | 147 +++++++++++++++++++++++++++----------------------------
 1 file changed, 71 insertions(+), 76 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index a9678c2f3810..40f441e96dc5 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1453,27 +1453,20 @@ _xfs_buf_ioapply(
 }
 
 /*
- * Asynchronous IO submission path. This transfers the buffer lock ownership and
- * the current reference to the IO. It is not safe to reference the buffer after
- * a call to this function unless the caller holds an additional reference
- * itself.
+ * Internal I/O submission helpers for split submission and waiting actions.
  */
-void
-xfs_buf_submit(
+static int
+__xfs_buf_submit(
 	struct xfs_buf	*bp)
 {
-	trace_xfs_buf_submit(bp, _RET_IP_);
-
 	ASSERT(!(bp->b_flags & _XBF_DELWRI_Q));
-	ASSERT(bp->b_flags & XBF_ASYNC);
 
 	/* on shutdown we stale and complete the buffer immediately */
 	if (XFS_FORCED_SHUTDOWN(bp->b_target->bt_mount)) {
 		xfs_buf_ioerror(bp, -EIO);
 		bp->b_flags &= ~XBF_DONE;
 		xfs_buf_stale(bp);
-		xfs_buf_ioend(bp);
-		return;
+		return -EIO;
 	}
 
 	if (bp->b_flags & XBF_WRITE)
@@ -1483,12 +1476,8 @@ xfs_buf_submit(
 	bp->b_io_error = 0;
 
 	/*
-	 * The caller's reference is released during I/O completion.
-	 * This occurs some time after the last b_io_remaining reference is
-	 * released, so after we drop our Io reference we have to have some
-	 * other reference to ensure the buffer doesn't go away from underneath
-	 * us. Take a direct reference to ensure we have safe access to the
-	 * buffer until we are finished with it.
+	 * I/O needs a reference, because the caller may go away before we are
+	 * done with the IO. Completion will deal with it.
 	 */
 	xfs_buf_hold(bp);
 
@@ -1498,21 +1487,66 @@ xfs_buf_submit(
 	 * xfs_buf_ioend too early.
 	 */
 	atomic_set(&bp->b_io_remaining, 1);
-	xfs_buf_ioacct_inc(bp);
+	if (bp->b_flags & XBF_ASYNC)
+		xfs_buf_ioacct_inc(bp);
 	_xfs_buf_ioapply(bp);
+	return 0;
+}
+
+static int
+__xfs_buf_iowait(
+	struct xfs_buf	*bp)
+{
+	int		error;
+
+	/*
+	 * make sure we run completion synchronously if it raced with us and is
+	 * already complete.
+	 */
+	if (atomic_dec_and_test(&bp->b_io_remaining) == 1)
+		xfs_buf_ioend(bp);
+
+	/* wait for completion before gathering the error from the buffer */
+	trace_xfs_buf_iowait(bp, _RET_IP_);
+	wait_for_completion(&bp->b_iowait);
+	trace_xfs_buf_iowait_done(bp, _RET_IP_);
+	error = bp->b_error;
+
+	/*
+	 * all done now, we can release the hold that keeps the buffer
+	 * referenced for the entire IO.
+	 */
+	xfs_buf_rele(bp);
+	return error;
+}
+
+/*
+ * Asynchronous IO submission path. This transfers the buffer lock ownership and
+ * the current reference to the IO. It is not safe to reference the buffer after
+ * a call to this function unless the caller holds an additional reference
+ * itself.
+ */
+void
+xfs_buf_submit(
+	struct xfs_buf	*bp)
+{
+	int		error;
+
+	trace_xfs_buf_submit(bp, _RET_IP_);
+
+	error = __xfs_buf_submit(bp);
 
 	/*
 	 * If _xfs_buf_ioapply failed, we can get back here with only the IO
 	 * reference we took above. If we drop it to zero, run completion so
 	 * that we don't return to the caller with completion still pending.
 	 */
-	if (atomic_dec_and_test(&bp->b_io_remaining) == 1) {
+	if (error || atomic_dec_and_test(&bp->b_io_remaining) == 1) {
 		if (bp->b_error)
 			xfs_buf_ioend(bp);
 		else
 			xfs_buf_ioend_async(bp);
 	}
-
 	xfs_buf_rele(bp);
 	/* Note: it is not safe to reference bp now we've dropped our ref */
 }
@@ -1527,57 +1561,13 @@ xfs_buf_submit_wait(
 	int		error;
 
 	trace_xfs_buf_submit_wait(bp, _RET_IP_);
+	ASSERT(!(bp->b_flags & XBF_ASYNC));
 
-	ASSERT(!(bp->b_flags & (_XBF_DELWRI_Q | XBF_ASYNC)));
-
-	if (XFS_FORCED_SHUTDOWN(bp->b_target->bt_mount)) {
-		xfs_buf_ioerror(bp, -EIO);
-		xfs_buf_stale(bp);
-		bp->b_flags &= ~XBF_DONE;
-		return -EIO;
-	}
-
-	if (bp->b_flags & XBF_WRITE)
-		xfs_buf_wait_unpin(bp);
-
-	/* clear the internal error state to avoid spurious errors */
-	bp->b_io_error = 0;
-
-	/*
-	 * For synchronous IO, the IO does not inherit the submitters reference
-	 * count, nor the buffer lock. Hence we cannot release the reference we
-	 * are about to take until we've waited for all IO completion to occur,
-	 * including any xfs_buf_ioend_async() work that may be pending.
-	 */
-	xfs_buf_hold(bp);
-
-	/*
-	 * Set the count to 1 initially, this will stop an I/O completion
-	 * callout which happens before we have started all the I/O from calling
-	 * xfs_buf_ioend too early.
-	 */
-	atomic_set(&bp->b_io_remaining, 1);
-	_xfs_buf_ioapply(bp);
-
-	/*
-	 * make sure we run completion synchronously if it raced with us and is
-	 * already complete.
-	 */
-	if (atomic_dec_and_test(&bp->b_io_remaining) == 1)
-		xfs_buf_ioend(bp);
-
-	/* wait for completion before gathering the error from the buffer */
-	trace_xfs_buf_iowait(bp, _RET_IP_);
-	wait_for_completion(&bp->b_iowait);
-	trace_xfs_buf_iowait_done(bp, _RET_IP_);
-	error = bp->b_error;
+	error =  __xfs_buf_submit(bp);
+	if (error)
+		return error;
 
-	/*
-	 * all done now, we can release the hold that keeps the buffer
-	 * referenced for the entire IO.
-	 */
-	xfs_buf_rele(bp);
-	return error;
+	return  __xfs_buf_iowait(bp);
 }
 
 void *
@@ -2045,14 +2035,21 @@ xfs_buf_delwri_submit_buffers(
 		 * at this point so the caller can still access it.
 		 */
 		bp->b_flags &= ~(_XBF_DELWRI_Q | XBF_WRITE_FAIL);
-		bp->b_flags |= XBF_WRITE | XBF_ASYNC;
+		bp->b_flags |= XBF_WRITE;
 		if (wait_list) {
+			/*
+			 * Split synchronous IO - we wait later, so we need ai
+			 * reference until we run completion processing and drop
+			 * the buffer lock ourselves
+			 */
 			xfs_buf_hold(bp);
 			list_move_tail(&bp->b_list, wait_list);
-		} else
+			__xfs_buf_submit(bp);
+		} else {
 			list_del_init(&bp->b_list);
-
-		xfs_buf_submit(bp);
+			bp->b_flags |= XBF_ASYNC;
+			xfs_buf_submit(bp);
+		}
 	}
 	blk_finish_plug(&plug);
 
@@ -2099,9 +2096,7 @@ xfs_buf_delwri_submit(
 
 		list_del_init(&bp->b_list);
 
-		/* locking the buffer will wait for async IO completion. */
-		xfs_buf_lock(bp);
-		error2 = bp->b_error;
+		error2 = __xfs_buf_iowait(bp);
 		xfs_buf_relse(bp);
 		if (!error)
 			error = error2;

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] xfs: allow delwri requeue of wait listed buffers
  2018-06-07 23:27   ` Dave Chinner
@ 2018-06-08 12:07     ` Brian Foster
  2018-06-08 22:49       ` Dave Chinner
  0 siblings, 1 reply; 7+ messages in thread
From: Brian Foster @ 2018-06-08 12:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Jun 08, 2018 at 09:27:13AM +1000, Dave Chinner wrote:
> On Thu, Jun 07, 2018 at 08:41:25AM -0400, Brian Foster wrote:
> > If a delwri queue occurs of a buffer that was previously submitted
> > from a delwri queue but has not yet been removed from a wait list,
> > the queue sets _XBF_DELWRI_Q without changing the state of ->b_list.
> > This occurs, for example, if another thread beats the submitter
> > thread to the buffer lock after I/O completion. Once the submitter
> > acquires the lock, it removes the buffer from the wait list and
> > leaves a buffer with _XBF_DELWRI_Q set but not populated on a list.
> > This results in a lost buffer submission and in turn can result in
> > assert failures due to _XBF_DELWRI_Q being set on buffer reclaim or
> > filesystem lockups if the buffer happens to cover an item in the
> > AIL.
> 
> I just so happened to have this ASSERT happen over night on
> generic/232 testing some code I wrote yesterday. It never ceases to
> amaze me how bugs that have been around for ages always seem to be
> hit at the same time by different people in completely different
> contexts....
> 

Interesting, out of curiosity was this in a memory limited environment?

> > This problem has been reproduced by repeated iterations of xfs/305
> > on high CPU count (28xcpu) systems with limited memory (~1GB). Dirty
> > dquot reclaim races with an xfsaild push of a separate dquot backed
> > by the same buffer such that the buffer sits on the reclaim wait
> > list at the time xfsaild attempts to queue it. Since the latter
> > dquot has been flush locked but the underlying buffer not submitted
> > for I/O, the dquot pins the AIL and causes the filesystem to
> > livelock.
> > 
> > To address this problem, allow a delwri queue of a wait listed
> > buffer to steal the buffer from the wait list and add it to the
> > associated delwri queue. This is fundamentally safe because the
> > purpose of the wait list is to provide synchronous I/O. The buffer
> > lock of each wait listed buffer is cycled to ensure that I/O has
> > completed. If another thread acquires the buffer lock first, then
> > the I/O has completed and the submitter lock cycle is a formality.
> > 
> > The tradeoff to this approach is that the submitter loses the
> > ability to check error state of stolen buffers. This is technically
> > already possible as once the lock is released there is no guarantee
> > another thread has not interfered with the buffer error state by the
> > time the submitter reacquires the lock. Further, most critical error
> > handling occurs in the iodone callbacks in completion context of the
> > specific buffer since the delwri submitter has no idea which buffer
> > failed in the first place. Finally, the stolen buffer case should be
> > relatively rare and limited to races when under the highly parallel
> > and low memory conditions described above.
> 
> This seems all a bit broken.
> 

Yes, the premise of this was to do something that didn't break it
further. ;) I figured using sync I/O would also address the problem, but
would introduce terrible submission->completion serialization...

> The fundamental problem is that we are waiting on buffer locks for
> completion, assuming that nobody else can get the lock before we do
> to tell us that completion has occured. IMO, it's the way we are
> doing the bulk buffer IO submission and waiting that is broken, not
> the wait the delwri queues are handled.
> 
> i.e. we need to take ownership of the buffer lock across
> xfs_buf_delwri_submit_buffers() and the wait loop in
> xfs_buf_delwri_submit() because we assume that the delwri code is
> the only context with access to the buffer while it is under IO. We
> already have an IO waiting mechanism that does this - it's used by
> xfs_buf_submit_wait().
> 
> So what I think we really need to do is split xfs_buf_submit_wait()
> into two halves so we can separate the submission and waiting. TO
> save trying to explain it in great detail, I just wrote some
> (untested!) code below that makes delwri submission hold the lock
> itself until the IO has completed.
> 

So essentially take apart sync buffer I/O so we can do batched
submission/completion. That sounds like a nice idea to me.

Feedback on the code to follow. That aside, are you planning to turn
this into a real patch submission or would you like me to do it?

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> xfs: fix races waiting for delwri buffers
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> 
> ---
>  fs/xfs/xfs_buf.c | 147 +++++++++++++++++++++++++++----------------------------
>  1 file changed, 71 insertions(+), 76 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index a9678c2f3810..40f441e96dc5 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1453,27 +1453,20 @@ _xfs_buf_ioapply(
>  }
>  
>  /*
> - * Asynchronous IO submission path. This transfers the buffer lock ownership and
> - * the current reference to the IO. It is not safe to reference the buffer after
> - * a call to this function unless the caller holds an additional reference
> - * itself.
> + * Internal I/O submission helpers for split submission and waiting actions.
>   */
> -void
> -xfs_buf_submit(
> +static int
> +__xfs_buf_submit(

It looks like the buffer submission refactoring could be a separate
patch from the delwri queue race fix.

>  	struct xfs_buf	*bp)
>  {
> -	trace_xfs_buf_submit(bp, _RET_IP_);
> -
>  	ASSERT(!(bp->b_flags & _XBF_DELWRI_Q));
> -	ASSERT(bp->b_flags & XBF_ASYNC);
>  
>  	/* on shutdown we stale and complete the buffer immediately */
>  	if (XFS_FORCED_SHUTDOWN(bp->b_target->bt_mount)) {
>  		xfs_buf_ioerror(bp, -EIO);
>  		bp->b_flags &= ~XBF_DONE;
>  		xfs_buf_stale(bp);
> -		xfs_buf_ioend(bp);
> -		return;
> +		return -EIO;
>  	}
>  
>  	if (bp->b_flags & XBF_WRITE)
> @@ -1483,12 +1476,8 @@ xfs_buf_submit(
>  	bp->b_io_error = 0;
>  
>  	/*
> -	 * The caller's reference is released during I/O completion.
> -	 * This occurs some time after the last b_io_remaining reference is
> -	 * released, so after we drop our Io reference we have to have some
> -	 * other reference to ensure the buffer doesn't go away from underneath
> -	 * us. Take a direct reference to ensure we have safe access to the
> -	 * buffer until we are finished with it.
> +	 * I/O needs a reference, because the caller may go away before we are
> +	 * done with the IO. Completion will deal with it.
>  	 */
>  	xfs_buf_hold(bp);

I think this should be lifted in the callers. For one, it's confusing to
follow. Second, it looks like xfs_buf_submit() unconditionally drops a
reference whereas __xfs_buf_submit() may not acquire one (i.e. when we
return an error).

ISTM that the buffer reference calls could be balanced in the top-level
submit functions rather than split between the common submission path
and unique sync completion path.

>  
> @@ -1498,21 +1487,66 @@ xfs_buf_submit(
>  	 * xfs_buf_ioend too early.
>  	 */
>  	atomic_set(&bp->b_io_remaining, 1);
> -	xfs_buf_ioacct_inc(bp);
> +	if (bp->b_flags & XBF_ASYNC)
> +		xfs_buf_ioacct_inc(bp);
>  	_xfs_buf_ioapply(bp);
> +	return 0;
> +}
> +
> +static int
> +__xfs_buf_iowait(
> +	struct xfs_buf	*bp)
> +{
> +	int		error;
> +
> +	/*
> +	 * make sure we run completion synchronously if it raced with us and is
> +	 * already complete.
> +	 */
> +	if (atomic_dec_and_test(&bp->b_io_remaining) == 1)
> +		xfs_buf_ioend(bp);
> +
> +	/* wait for completion before gathering the error from the buffer */
> +	trace_xfs_buf_iowait(bp, _RET_IP_);
> +	wait_for_completion(&bp->b_iowait);
> +	trace_xfs_buf_iowait_done(bp, _RET_IP_);
> +	error = bp->b_error;
> +
> +	/*
> +	 * all done now, we can release the hold that keeps the buffer
> +	 * referenced for the entire IO.
> +	 */
> +	xfs_buf_rele(bp);
> +	return error;
> +}
> +
> +/*
> + * Asynchronous IO submission path. This transfers the buffer lock ownership and
> + * the current reference to the IO. It is not safe to reference the buffer after
> + * a call to this function unless the caller holds an additional reference
> + * itself.
> + */
> +void
> +xfs_buf_submit(
> +	struct xfs_buf	*bp)
> +{
> +	int		error;
> +
> +	trace_xfs_buf_submit(bp, _RET_IP_);
> +
> +	error = __xfs_buf_submit(bp);
>  
>  	/*
>  	 * If _xfs_buf_ioapply failed, we can get back here with only the IO
>  	 * reference we took above. If we drop it to zero, run completion so
>  	 * that we don't return to the caller with completion still pending.
>  	 */
> -	if (atomic_dec_and_test(&bp->b_io_remaining) == 1) {
> +	if (error || atomic_dec_and_test(&bp->b_io_remaining) == 1) {
>  		if (bp->b_error)
>  			xfs_buf_ioend(bp);
>  		else
>  			xfs_buf_ioend_async(bp);
>  	}
> -
>  	xfs_buf_rele(bp);
>  	/* Note: it is not safe to reference bp now we've dropped our ref */
>  }
> @@ -1527,57 +1561,13 @@ xfs_buf_submit_wait(
>  	int		error;
>  
>  	trace_xfs_buf_submit_wait(bp, _RET_IP_);
> +	ASSERT(!(bp->b_flags & XBF_ASYNC));
>  
> -	ASSERT(!(bp->b_flags & (_XBF_DELWRI_Q | XBF_ASYNC)));
> -
> -	if (XFS_FORCED_SHUTDOWN(bp->b_target->bt_mount)) {
> -		xfs_buf_ioerror(bp, -EIO);
> -		xfs_buf_stale(bp);
> -		bp->b_flags &= ~XBF_DONE;
> -		return -EIO;
> -	}
> -
> -	if (bp->b_flags & XBF_WRITE)
> -		xfs_buf_wait_unpin(bp);
> -
> -	/* clear the internal error state to avoid spurious errors */
> -	bp->b_io_error = 0;
> -
> -	/*
> -	 * For synchronous IO, the IO does not inherit the submitters reference
> -	 * count, nor the buffer lock. Hence we cannot release the reference we
> -	 * are about to take until we've waited for all IO completion to occur,
> -	 * including any xfs_buf_ioend_async() work that may be pending.
> -	 */
> -	xfs_buf_hold(bp);
> -
> -	/*
> -	 * Set the count to 1 initially, this will stop an I/O completion
> -	 * callout which happens before we have started all the I/O from calling
> -	 * xfs_buf_ioend too early.
> -	 */
> -	atomic_set(&bp->b_io_remaining, 1);
> -	_xfs_buf_ioapply(bp);
> -
> -	/*
> -	 * make sure we run completion synchronously if it raced with us and is
> -	 * already complete.
> -	 */
> -	if (atomic_dec_and_test(&bp->b_io_remaining) == 1)
> -		xfs_buf_ioend(bp);
> -
> -	/* wait for completion before gathering the error from the buffer */
> -	trace_xfs_buf_iowait(bp, _RET_IP_);
> -	wait_for_completion(&bp->b_iowait);
> -	trace_xfs_buf_iowait_done(bp, _RET_IP_);
> -	error = bp->b_error;
> +	error =  __xfs_buf_submit(bp);
> +	if (error)
> +		return error;
>  
> -	/*
> -	 * all done now, we can release the hold that keeps the buffer
> -	 * referenced for the entire IO.
> -	 */
> -	xfs_buf_rele(bp);
> -	return error;
> +	return  __xfs_buf_iowait(bp);
>  }
>  
>  void *
> @@ -2045,14 +2035,21 @@ xfs_buf_delwri_submit_buffers(
>  		 * at this point so the caller can still access it.
>  		 */
>  		bp->b_flags &= ~(_XBF_DELWRI_Q | XBF_WRITE_FAIL);
> -		bp->b_flags |= XBF_WRITE | XBF_ASYNC;
> +		bp->b_flags |= XBF_WRITE;

We set XBF_ASYNC below in the specific case, but this doesn't tell us
anything about whether it might have already been set on the buffer. Is
it not the responsibility of this function to set/clear XBF_ASYNC
appropriately?

>  		if (wait_list) {
> +			/*
> +			 * Split synchronous IO - we wait later, so we need ai
> +			 * reference until we run completion processing and drop
> +			 * the buffer lock ourselves
> +			 */

Might as well merge this with the comment above, which needs fixing
anyways since we no longer "do all IO submission async."

>  			xfs_buf_hold(bp);

I think the buffer reference counting is now broken here. We currently
transfer the existing hold (when the buffer was queued) to the async
buffer submission. The wait list case acquires the new hold above and
drops it after cycling the buffer lock and dropping it from the wait
list. Async I/O completion will have dropped the queue hold so when the
whole thing returns the buffer is essentially free.

The async/nowait case still looks Ok. The sync I/O case continues to
grab the wait list reference above, but now sends the buffer through the
sync submission path which will not release the original hold acquired
for the queue upon I/O completion. Unless I'm missing something, it
looks to me that we now return with an elevated hold count. Instead, I
think we should "transfer" the pre-existing queue hold to the wait list
(and document this mess in the comment).

>  			list_move_tail(&bp->b_list, wait_list);
> -		} else
> +			__xfs_buf_submit(bp);

I suspect we need to handle submission errors here, otherwise we wait on
a buffer that was never submitted.

One final thought.. ISTM that the nature of batched sync buffer
submission means that once we wait on one or two of those buffers,
there's a good chance many of the remaining buffer physical I/Os will
have completed by the time we get to the associated iowait. That means
that the current behavior of large (sync) delwri buffer completions
running in the async completion workqueue most likely changes to running
from __xfs_buf_iowait()->xfs_buf_ioend() context. I'm not sure that
really matters, but just something to note. It does make me wonder
whether the extra b_io_remaining submission reference could/should be
decremented in the submission path for both I/O types (which also seems
cleaner from a code perspective).

Brian

> +		} else {
>  			list_del_init(&bp->b_list);
> -
> -		xfs_buf_submit(bp);
> +			bp->b_flags |= XBF_ASYNC;
> +			xfs_buf_submit(bp);
> +		}
>  	}
>  	blk_finish_plug(&plug);
>  
> @@ -2099,9 +2096,7 @@ xfs_buf_delwri_submit(
>  
>  		list_del_init(&bp->b_list);
>  
> -		/* locking the buffer will wait for async IO completion. */
> -		xfs_buf_lock(bp);
> -		error2 = bp->b_error;
> +		error2 = __xfs_buf_iowait(bp);
>  		xfs_buf_relse(bp);
>  		if (!error)
>  			error = error2;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] xfs: allow delwri requeue of wait listed buffers
  2018-06-08 12:07     ` Brian Foster
@ 2018-06-08 22:49       ` Dave Chinner
  2018-06-09 11:26         ` Brian Foster
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Chinner @ 2018-06-08 22:49 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Fri, Jun 08, 2018 at 08:07:09AM -0400, Brian Foster wrote:
> On Fri, Jun 08, 2018 at 09:27:13AM +1000, Dave Chinner wrote:
> > On Thu, Jun 07, 2018 at 08:41:25AM -0400, Brian Foster wrote:
> > > If a delwri queue occurs of a buffer that was previously submitted
> > > from a delwri queue but has not yet been removed from a wait list,
> > > the queue sets _XBF_DELWRI_Q without changing the state of ->b_list.
> > > This occurs, for example, if another thread beats the submitter
> > > thread to the buffer lock after I/O completion. Once the submitter
> > > acquires the lock, it removes the buffer from the wait list and
> > > leaves a buffer with _XBF_DELWRI_Q set but not populated on a list.
> > > This results in a lost buffer submission and in turn can result in
> > > assert failures due to _XBF_DELWRI_Q being set on buffer reclaim or
> > > filesystem lockups if the buffer happens to cover an item in the
> > > AIL.
> > 
> > I just so happened to have this ASSERT happen over night on
> > generic/232 testing some code I wrote yesterday. It never ceases to
> > amaze me how bugs that have been around for ages always seem to be
> > hit at the same time by different people in completely different
> > contexts....
> > 
> 
> Interesting, out of curiosity was this in a memory limited environment?

No. had GBs of free RAM, but it's on a fakenuma setup so a node
could have been low on free memory and hence running the quota
shrinker....

....
> > This seems all a bit broken.
> 
> Yes, the premise of this was to do something that didn't break it
> further. ;) I figured using sync I/O would also address the problem, but
> would introduce terrible submission->completion serialization...

*nod*

> So essentially take apart sync buffer I/O so we can do batched
> submission/completion. That sounds like a nice idea to me.

*nod*

> Feedback on the code to follow. That aside, are you planning to turn
> this into a real patch submission or would you like me to do it?

I've got plenty of other things to do - if you want to take it an
run I'm fine with that :P

> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index a9678c2f3810..40f441e96dc5 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -1453,27 +1453,20 @@ _xfs_buf_ioapply(
> >  }
> >  
> >  /*
> > - * Asynchronous IO submission path. This transfers the buffer lock ownership and
> > - * the current reference to the IO. It is not safe to reference the buffer after
> > - * a call to this function unless the caller holds an additional reference
> > - * itself.
> > + * Internal I/O submission helpers for split submission and waiting actions.
> >   */
> > -void
> > -xfs_buf_submit(
> > +static int
> > +__xfs_buf_submit(
> 
> It looks like the buffer submission refactoring could be a separate
> patch from the delwri queue race fix.

Yup, it probably should be.

> > @@ -1483,12 +1476,8 @@ xfs_buf_submit(
> >  	bp->b_io_error = 0;
> >  
> >  	/*
> > -	 * The caller's reference is released during I/O completion.
> > -	 * This occurs some time after the last b_io_remaining reference is
> > -	 * released, so after we drop our Io reference we have to have some
> > -	 * other reference to ensure the buffer doesn't go away from underneath
> > -	 * us. Take a direct reference to ensure we have safe access to the
> > -	 * buffer until we are finished with it.
> > +	 * I/O needs a reference, because the caller may go away before we are
> > +	 * done with the IO. Completion will deal with it.
> >  	 */
> >  	xfs_buf_hold(bp);
> 
> I think this should be lifted in the callers. For one, it's confusing to
> follow.

Both paths take a reference here, both take it because the IO itself
needs a reference. The only difference is when it is dropped.

> Second, it looks like xfs_buf_submit() unconditionally drops a
> reference whereas __xfs_buf_submit() may not acquire one (i.e. when we
> return an error).
>
> ISTM that the buffer reference calls could be balanced in the top-level
> submit functions rather than split between the common submission path
> and unique sync completion path.

Yes, that may end up being cleaner, because the delwri path has to
take a reference across submit/complete anyway. I just chopped
quickl away at the code without thinking about this stuff too
deeply. I did say "untested", right? :P

> >  void *
> > @@ -2045,14 +2035,21 @@ xfs_buf_delwri_submit_buffers(
> >  		 * at this point so the caller can still access it.
> >  		 */
> >  		bp->b_flags &= ~(_XBF_DELWRI_Q | XBF_WRITE_FAIL);
> > -		bp->b_flags |= XBF_WRITE | XBF_ASYNC;
> > +		bp->b_flags |= XBF_WRITE;
> 
> We set XBF_ASYNC below in the specific case, but this doesn't tell us
> anything about whether it might have already been set on the buffer. Is
> it not the responsibility of this function to set/clear XBF_ASYNC
> appropriately?

XBF_ASYNC should be clear (as the buffer is not under IO) buti, yes,
I agree that it would be a good idea to clear it anyway.

> >  		if (wait_list) {
> > +			/*
> > +			 * Split synchronous IO - we wait later, so we need ai
> > +			 * reference until we run completion processing and drop
> > +			 * the buffer lock ourselves
> > +			 */
> 
> Might as well merge this with the comment above, which needs fixing
> anyways since we no longer "do all IO submission async."
> 
> >  			xfs_buf_hold(bp);
> 
> I think the buffer reference counting is now broken here.

More than likely. :P

....
> >  			list_move_tail(&bp->b_list, wait_list);
> > -		} else
> > +			__xfs_buf_submit(bp);
> 
> I suspect we need to handle submission errors here, otherwise we wait on
> a buffer that was never submitted.

Yup.

> One final thought.. ISTM that the nature of batched sync buffer
> submission means that once we wait on one or two of those buffers,
> there's a good chance many of the remaining buffer physical I/Os will
> have completed by the time we get to the associated iowait. That means
> that the current behavior of large (sync) delwri buffer completions
> running in the async completion workqueue most likely changes to running
> from __xfs_buf_iowait()->xfs_buf_ioend() context. I'm not sure that
> really matters, but just something to note.

Yup, that's something that crossed my mind. But we only wait for
buffers to complete in log recovery, quota check and the /quota
shrinker/ (now the generic/232 failure makes sense, doesn't it? :).

Hence I don't think there's a performance issue we need to worry
about - correctness is more important...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] xfs: allow delwri requeue of wait listed buffers
  2018-06-08 22:49       ` Dave Chinner
@ 2018-06-09 11:26         ` Brian Foster
  0 siblings, 0 replies; 7+ messages in thread
From: Brian Foster @ 2018-06-09 11:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Sat, Jun 09, 2018 at 08:49:09AM +1000, Dave Chinner wrote:
> On Fri, Jun 08, 2018 at 08:07:09AM -0400, Brian Foster wrote:
> > On Fri, Jun 08, 2018 at 09:27:13AM +1000, Dave Chinner wrote:
> > > On Thu, Jun 07, 2018 at 08:41:25AM -0400, Brian Foster wrote:
> > > > If a delwri queue occurs of a buffer that was previously submitted
> > > > from a delwri queue but has not yet been removed from a wait list,
> > > > the queue sets _XBF_DELWRI_Q without changing the state of ->b_list.
> > > > This occurs, for example, if another thread beats the submitter
> > > > thread to the buffer lock after I/O completion. Once the submitter
> > > > acquires the lock, it removes the buffer from the wait list and
> > > > leaves a buffer with _XBF_DELWRI_Q set but not populated on a list.
> > > > This results in a lost buffer submission and in turn can result in
> > > > assert failures due to _XBF_DELWRI_Q being set on buffer reclaim or
> > > > filesystem lockups if the buffer happens to cover an item in the
> > > > AIL.
> > > 
> > > I just so happened to have this ASSERT happen over night on
> > > generic/232 testing some code I wrote yesterday. It never ceases to
> > > amaze me how bugs that have been around for ages always seem to be
> > > hit at the same time by different people in completely different
> > > contexts....
> > > 
> > 
> > Interesting, out of curiosity was this in a memory limited environment?
> 
> No. had GBs of free RAM, but it's on a fakenuma setup so a node
> could have been low on free memory and hence running the quota
> shrinker....
> 

Ok.

> ....
> > > This seems all a bit broken.
> > 
> > Yes, the premise of this was to do something that didn't break it
> > further. ;) I figured using sync I/O would also address the problem, but
> > would introduce terrible submission->completion serialization...
> 
> *nod*
> 
> > So essentially take apart sync buffer I/O so we can do batched
> > submission/completion. That sounds like a nice idea to me.
> 
> *nod*
> 
> > Feedback on the code to follow. That aside, are you planning to turn
> > this into a real patch submission or would you like me to do it?
> 
> I've got plenty of other things to do - if you want to take it an
> run I'm fine with that :P
> 

Heh, no problem. I'll work this into something next week and throw some
testing at it. Thanks for getting it started.

Brian

> > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > index a9678c2f3810..40f441e96dc5 100644
> > > --- a/fs/xfs/xfs_buf.c
> > > +++ b/fs/xfs/xfs_buf.c
> > > @@ -1453,27 +1453,20 @@ _xfs_buf_ioapply(
> > >  }
> > >  
> > >  /*
> > > - * Asynchronous IO submission path. This transfers the buffer lock ownership and
> > > - * the current reference to the IO. It is not safe to reference the buffer after
> > > - * a call to this function unless the caller holds an additional reference
> > > - * itself.
> > > + * Internal I/O submission helpers for split submission and waiting actions.
> > >   */
> > > -void
> > > -xfs_buf_submit(
> > > +static int
> > > +__xfs_buf_submit(
> > 
> > It looks like the buffer submission refactoring could be a separate
> > patch from the delwri queue race fix.
> 
> Yup, it probably should be.
> 
> > > @@ -1483,12 +1476,8 @@ xfs_buf_submit(
> > >  	bp->b_io_error = 0;
> > >  
> > >  	/*
> > > -	 * The caller's reference is released during I/O completion.
> > > -	 * This occurs some time after the last b_io_remaining reference is
> > > -	 * released, so after we drop our Io reference we have to have some
> > > -	 * other reference to ensure the buffer doesn't go away from underneath
> > > -	 * us. Take a direct reference to ensure we have safe access to the
> > > -	 * buffer until we are finished with it.
> > > +	 * I/O needs a reference, because the caller may go away before we are
> > > +	 * done with the IO. Completion will deal with it.
> > >  	 */
> > >  	xfs_buf_hold(bp);
> > 
> > I think this should be lifted in the callers. For one, it's confusing to
> > follow.
> 
> Both paths take a reference here, both take it because the IO itself
> needs a reference. The only difference is when it is dropped.
> 
> > Second, it looks like xfs_buf_submit() unconditionally drops a
> > reference whereas __xfs_buf_submit() may not acquire one (i.e. when we
> > return an error).
> >
> > ISTM that the buffer reference calls could be balanced in the top-level
> > submit functions rather than split between the common submission path
> > and unique sync completion path.
> 
> Yes, that may end up being cleaner, because the delwri path has to
> take a reference across submit/complete anyway. I just chopped
> quickl away at the code without thinking about this stuff too
> deeply. I did say "untested", right? :P
> 
> > >  void *
> > > @@ -2045,14 +2035,21 @@ xfs_buf_delwri_submit_buffers(
> > >  		 * at this point so the caller can still access it.
> > >  		 */
> > >  		bp->b_flags &= ~(_XBF_DELWRI_Q | XBF_WRITE_FAIL);
> > > -		bp->b_flags |= XBF_WRITE | XBF_ASYNC;
> > > +		bp->b_flags |= XBF_WRITE;
> > 
> > We set XBF_ASYNC below in the specific case, but this doesn't tell us
> > anything about whether it might have already been set on the buffer. Is
> > it not the responsibility of this function to set/clear XBF_ASYNC
> > appropriately?
> 
> XBF_ASYNC should be clear (as the buffer is not under IO) buti, yes,
> I agree that it would be a good idea to clear it anyway.
> 
> > >  		if (wait_list) {
> > > +			/*
> > > +			 * Split synchronous IO - we wait later, so we need ai
> > > +			 * reference until we run completion processing and drop
> > > +			 * the buffer lock ourselves
> > > +			 */
> > 
> > Might as well merge this with the comment above, which needs fixing
> > anyways since we no longer "do all IO submission async."
> > 
> > >  			xfs_buf_hold(bp);
> > 
> > I think the buffer reference counting is now broken here.
> 
> More than likely. :P
> 
> ....
> > >  			list_move_tail(&bp->b_list, wait_list);
> > > -		} else
> > > +			__xfs_buf_submit(bp);
> > 
> > I suspect we need to handle submission errors here, otherwise we wait on
> > a buffer that was never submitted.
> 
> Yup.
> 
> > One final thought.. ISTM that the nature of batched sync buffer
> > submission means that once we wait on one or two of those buffers,
> > there's a good chance many of the remaining buffer physical I/Os will
> > have completed by the time we get to the associated iowait. That means
> > that the current behavior of large (sync) delwri buffer completions
> > running in the async completion workqueue most likely changes to running
> > from __xfs_buf_iowait()->xfs_buf_ioend() context. I'm not sure that
> > really matters, but just something to note.
> 
> Yup, that's something that crossed my mind. But we only wait for
> buffers to complete in log recovery, quota check and the /quota
> shrinker/ (now the generic/232 failure makes sense, doesn't it? :).
> 
> Hence I don't think there's a performance issue we need to worry
> about - correctness is more important...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-06-09 11:26 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-07 12:41 [PATCH 0/2] xfs: fix buffer delwri queue state race Brian Foster
2018-06-07 12:41 ` [PATCH 1/2] xfs: define internal buf flag for delwri queue wait list Brian Foster
2018-06-07 12:41 ` [PATCH 2/2] xfs: allow delwri requeue of wait listed buffers Brian Foster
2018-06-07 23:27   ` Dave Chinner
2018-06-08 12:07     ` Brian Foster
2018-06-08 22:49       ` Dave Chinner
2018-06-09 11:26         ` Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.