All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC] xfs: hold buffer across unpin and potential shutdown processing
@ 2021-05-03 12:18 Brian Foster
  2021-05-03 23:25 ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Brian Foster @ 2021-05-03 12:18 UTC (permalink / raw)
  To: linux-xfs

The special processing used to simulate a buffer I/O failure on fs
shutdown has a difficult to reproduce race that can result in a use
after free of the associated buffer. Consider a buffer that has been
committed to the on-disk log and thus is AIL resident. The buffer
lands on the writeback delwri queue, but is subsequently locked,
committed and pinned by another transaction before submitted for
I/O. At this point, the buffer is stuck on the delwri queue as it
cannot be submitted for I/O until it is unpinned. A log checkpoint
I/O failure occurs sometime later, which aborts the bli. The unpin
handler is called with the aborted log item, drops the bli reference
count, the pin count, and falls into the I/O failure simulation
path.

The potential problem here is that once the pin count falls to zero
in ->iop_unpin(), xfsaild is free to retry delwri submission of the
buffer at any time, before the unpin handler even completes. If
delwri queue submission wins the race to the buffer lock, it
observes the shutdown state and simulates the I/O failure itself.
This releases both the bli and delwri queue holds and frees the
buffer while xfs_buf_item_unpin() sits on xfs_buf_lock() waiting to
run through the same failure sequence. This problem is rare and
requires many iterations of fstest generic/019 (which simulates disk
I/O failures) to reproduce.

To avoid this problem, hold the buffer across the unpin sequence in
xfs_buf_item_unpin(). This is a bit unfortunate in that the new hold
is unconditional while really only necessary for a rare, fatal error
scenario, but it guarantees the buffer still exists in the off
chance that the handler attempts to access it.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---

This is a patch I've had around for a bit for a very rare corner case I
was able to reproduce in some past testing. I'm sending this as RFC
because I'm curious if folks have any thoughts on the approach. I'd be
Ok with this change as is, but I think there are alternatives available
too. We could do something fairly simple like bury the hold in the
remove (abort) case only, or perhaps consider checking IN_AIL state
before the pin count drops and base on that (though that seems a bit
more fragile to me). Thoughts?

Brian

 fs/xfs/xfs_buf_item.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index fb69879e4b2b..a1ad6901eb15 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -504,6 +504,7 @@ xfs_buf_item_unpin(
 
 	freed = atomic_dec_and_test(&bip->bli_refcount);
 
+	xfs_buf_hold(bp);
 	if (atomic_dec_and_test(&bp->b_pin_count))
 		wake_up_all(&bp->b_waiters);
 
@@ -560,6 +561,7 @@ xfs_buf_item_unpin(
 		bp->b_flags |= XBF_ASYNC;
 		xfs_buf_ioend_fail(bp);
 	}
+	xfs_buf_rele(bp);
 }
 
 STATIC uint
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC] xfs: hold buffer across unpin and potential shutdown processing
  2021-05-03 12:18 [PATCH RFC] xfs: hold buffer across unpin and potential shutdown processing Brian Foster
@ 2021-05-03 23:25 ` Dave Chinner
  2021-05-05 11:50   ` Brian Foster
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2021-05-03 23:25 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Mon, May 03, 2021 at 08:18:16AM -0400, Brian Foster wrote:
> The special processing used to simulate a buffer I/O failure on fs
> shutdown has a difficult to reproduce race that can result in a use
> after free of the associated buffer. Consider a buffer that has been
> committed to the on-disk log and thus is AIL resident. The buffer
> lands on the writeback delwri queue, but is subsequently locked,
> committed and pinned by another transaction before submitted for
> I/O. At this point, the buffer is stuck on the delwri queue as it
> cannot be submitted for I/O until it is unpinned. A log checkpoint
> I/O failure occurs sometime later, which aborts the bli. The unpin
> handler is called with the aborted log item, drops the bli reference
> count, the pin count, and falls into the I/O failure simulation
> path.
> 
> The potential problem here is that once the pin count falls to zero
> in ->iop_unpin(), xfsaild is free to retry delwri submission of the
> buffer at any time, before the unpin handler even completes. If
> delwri queue submission wins the race to the buffer lock, it
> observes the shutdown state and simulates the I/O failure itself.
> This releases both the bli and delwri queue holds and frees the
> buffer while xfs_buf_item_unpin() sits on xfs_buf_lock() waiting to
> run through the same failure sequence. This problem is rare and
> requires many iterations of fstest generic/019 (which simulates disk
> I/O failures) to reproduce.

You've described the race but not the failure that occurs? I'm going
to guess this is a use-after-free situation or something similar,
but I'm not immediately sure...

> To avoid this problem, hold the buffer across the unpin sequence in
> xfs_buf_item_unpin(). This is a bit unfortunate in that the new hold
> is unconditional while really only necessary for a rare, fatal error
> scenario, but it guarantees the buffer still exists in the off
> chance that the handler attempts to access it.

Ok, so essentially the problem here is that in the case of a
non-stale buffer we can enter xfs_buf_item_unpin() with an unlocked,
buffer that only the bli holds a reference to, and that bli
reference to the can be dropped by a racing IO completion that calls
xfs_buf_item_done()?

i.e. the problem here is that we've dropped the bip->bli_refcount
before we've locked the buffer and taken a reference to it for
the fail path?

OK, I see that xfs_buf_item_done() (called from ioend processing)
simply frees the buf log item and doesn't care about the bli
refcount at all. So the first ioend caller will free the buf log
item regardless of whether there are other references to it at all.

IOWs, once we unpin the buffer, the bli attached to the buffer and
being tracked in the AIL has -zero- references to the bli and so it
gets freed unconditionally on IO completion.

That seems to the be the problem here - the bli is not reference
counted while it is the AIL....

> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
> 
> This is a patch I've had around for a bit for a very rare corner case I
> was able to reproduce in some past testing. I'm sending this as RFC
> because I'm curious if folks have any thoughts on the approach. I'd be
> Ok with this change as is, but I think there are alternatives available
> too. We could do something fairly simple like bury the hold in the
> remove (abort) case only, or perhaps consider checking IN_AIL state
> before the pin count drops and base on that (though that seems a bit
> more fragile to me). Thoughts?
> 
> Brian
> 
>  fs/xfs/xfs_buf_item.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index fb69879e4b2b..a1ad6901eb15 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -504,6 +504,7 @@ xfs_buf_item_unpin(
>  
>  	freed = atomic_dec_and_test(&bip->bli_refcount);
>  
> +	xfs_buf_hold(bp);
>  	if (atomic_dec_and_test(&bp->b_pin_count))
>  		wake_up_all(&bp->b_waiters);
>  
> @@ -560,6 +561,7 @@ xfs_buf_item_unpin(
>  		bp->b_flags |= XBF_ASYNC;
>  		xfs_buf_ioend_fail(bp);
>  	}
> +	xfs_buf_rele(bp);
>  }

Ok, so we take an extra reference for the xfs_buf_ioend_fail()
path because we've exposed the code running here to the bli
reference to the buffer being dropped at any point after the wakeup
due to IO completion being run by another party.

Yup, seems like the bli_refcount scope needs to be expanded here.

Seems to me that we need to change how and where we drop the buf log
item reference count so the reference to the buffer it owns isn't
dropped until -after- we process the IO failure case.

Something like the patch below (completely untested!), perhaps?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

xfs: extend bli_refcount to cover the AIL and IO

From: Dave Chinner <dchinner@redhat.com>

Untested.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf_item.c | 207 +++++++++++++++++++++++++++++---------------------
 fs/xfs/xfs_buf_item.h |   1 -
 2 files changed, 120 insertions(+), 88 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index fb69879e4b2b..a0fa0f16d8c7 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -73,6 +73,32 @@ xfs_buf_item_straddle(
 	return false;
 }
 
+static void
+xfs_buf_item_free(
+	struct xfs_buf_log_item	*bip)
+{
+	xfs_buf_item_free_format(bip);
+	kmem_free(bip->bli_item.li_lv_shadow);
+	kmem_cache_free(xfs_buf_item_zone, bip);
+}
+
+/*
+ * xfs_buf_item_relse() is called when the buf log item is no longer needed.
+ */
+static void
+xfs_buf_item_relse(
+	struct xfs_buf	*bp)
+{
+	struct xfs_buf_log_item	*bip = bp->b_log_item;
+
+	trace_xfs_buf_item_relse(bp, _RET_IP_);
+	ASSERT(!test_bit(XFS_LI_IN_AIL, &bip->bli_item.li_flags));
+
+	bp->b_log_item = NULL;
+	xfs_buf_rele(bp);
+	xfs_buf_item_free(bip);
+}
+
 /*
  * This returns the number of log iovecs needed to log the
  * given buf log item.
@@ -502,56 +528,26 @@ xfs_buf_item_unpin(
 
 	trace_xfs_buf_item_unpin(bip);
 
-	freed = atomic_dec_and_test(&bip->bli_refcount);
-
+	/*
+	 * We can wake pin waiters safely now because we still hold the
+	 * bli_refcount that was taken when the pin was gained.
+	 */
 	if (atomic_dec_and_test(&bp->b_pin_count))
 		wake_up_all(&bp->b_waiters);
 
-	if (freed && stale) {
-		ASSERT(bip->bli_flags & XFS_BLI_STALE);
-		ASSERT(xfs_buf_islocked(bp));
-		ASSERT(bp->b_flags & XBF_STALE);
-		ASSERT(bip->__bli_format.blf_flags & XFS_BLF_CANCEL);
-
-		trace_xfs_buf_item_unpin_stale(bip);
-
-		if (remove) {
-			/*
-			 * If we are in a transaction context, we have to
-			 * remove the log item from the transaction as we are
-			 * about to release our reference to the buffer.  If we
-			 * don't, the unlock that occurs later in
-			 * xfs_trans_uncommit() will try to reference the
-			 * buffer which we no longer have a hold on.
-			 */
-			if (!list_empty(&lip->li_trans))
-				xfs_trans_del_item(lip);
-
-			/*
-			 * Since the transaction no longer refers to the buffer,
-			 * the buffer should no longer refer to the transaction.
-			 */
-			bp->b_transp = NULL;
+	if (!stale) {
+		if (!remove) {
+			/* Nothing to do but drop the refcount the pin owned. */
+			atomic_dec(&bip->bli_refcount);
+			return;
 		}
 
 		/*
-		 * If we get called here because of an IO error, we may or may
-		 * not have the item on the AIL. xfs_trans_ail_delete() will
-		 * take care of that situation. xfs_trans_ail_delete() drops
-		 * the AIL lock.
-		 */
-		if (bip->bli_flags & XFS_BLI_STALE_INODE) {
-			xfs_buf_item_done(bp);
-			xfs_buf_inode_iodone(bp);
-			ASSERT(list_empty(&bp->b_li_list));
-		} else {
-			xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
-			xfs_buf_item_relse(bp);
-			ASSERT(bp->b_log_item == NULL);
-		}
-		xfs_buf_relse(bp);
-	} else if (freed && remove) {
-		/*
+		 * Fail the IO before we drop the bli refcount. This guarantees
+		 * that a racing writeback completion also failing the buffer
+		 * and running completion will not remove the last reference to
+		 * the bli and free it from under us.
+		 *
 		 * The buffer must be locked and held by the caller to simulate
 		 * an async I/O failure.
 		 */
@@ -559,7 +555,62 @@ xfs_buf_item_unpin(
 		xfs_buf_hold(bp);
 		bp->b_flags |= XBF_ASYNC;
 		xfs_buf_ioend_fail(bp);
+		xfs_buf_item_relse(bp);
+		return;
 	}
+
+	/*
+	 * Stale buffer - only process it if this is the last reference to the
+	 * BLI. If this is the last BLI reference, then the buffer will be
+	 * locked and have two references - once from the transaction commit and
+	 * one from the BLI - and we do not unlock and release transaction
+	 * reference until we've finished cleaning up the BLI.
+	 */
+	if (!atomic_dec_and_test(&bip->bli_refcount))
+		return;
+
+	ASSERT(bip->bli_flags & XFS_BLI_STALE);
+	ASSERT(xfs_buf_islocked(bp));
+	ASSERT(bp->b_flags & XBF_STALE);
+	ASSERT(bip->__bli_format.blf_flags & XFS_BLF_CANCEL);
+
+	trace_xfs_buf_item_unpin_stale(bip);
+
+	if (remove) {
+		/*
+		 * If we are in a transaction context, we have to
+		 * remove the log item from the transaction as we are
+		 * about to release our reference to the buffer.  If we
+		 * don't, the unlock that occurs later in
+		 * xfs_trans_uncommit() will try to reference the
+		 * buffer which we no longer have a hold on.
+		 */
+		if (!list_empty(&lip->li_trans))
+			xfs_trans_del_item(lip);
+
+		/*
+		 * Since the transaction no longer refers to the buffer,
+		 * the buffer should no longer refer to the transaction.
+		 */
+		bp->b_transp = NULL;
+	}
+
+	/*
+	 * If we get called here because of an IO error, we may or may
+	 * not have the item on the AIL. xfs_trans_ail_delete() will
+	 * take care of that situation. xfs_trans_ail_delete() drops
+	 * the AIL lock.
+	 */
+	if (bip->bli_flags & XFS_BLI_STALE_INODE) {
+		xfs_buf_item_done(bp);
+		xfs_buf_inode_iodone(bp);
+		ASSERT(list_empty(&bp->b_li_list));
+	} else {
+		xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
+		xfs_buf_item_relse(bp);
+		ASSERT(bp->b_log_item == NULL);
+	}
+	xfs_buf_relse(bp);
 }
 
 STATIC uint
@@ -720,22 +771,24 @@ xfs_buf_item_committing(
 }
 
 /*
- * This is called to find out where the oldest active copy of the
- * buf log item in the on disk log resides now that the last log
- * write of it completed at the given lsn.
- * We always re-log all the dirty data in a buffer, so usually the
- * latest copy in the on disk log is the only one that matters.  For
- * those cases we simply return the given lsn.
+ * The item is about to be inserted into the AIL. If it is not already in the
+ * AIL, we need to take a reference to the BLI for the AIL. This "AIL reference"
+ * will be held until the item is removed from the AIL.
+ *
+ * This is called to find out where the oldest active copy of the buf log item
+ * in the on disk log resides now that the last log write of it completed at the
+ * given lsn.  We always re-log all the dirty data in a buffer, so usually the
+ * latest copy in the on disk log is the only one that matters.  For those cases
+ * we simply return the given lsn.
  *
- * The one exception to this is for buffers full of newly allocated
- * inodes.  These buffers are only relogged with the XFS_BLI_INODE_BUF
- * flag set, indicating that only the di_next_unlinked fields from the
- * inodes in the buffers will be replayed during recovery.  If the
- * original newly allocated inode images have not yet been flushed
- * when the buffer is so relogged, then we need to make sure that we
- * keep the old images in the 'active' portion of the log.  We do this
- * by returning the original lsn of that transaction here rather than
- * the current one.
+ * The one exception to this is for buffers full of newly allocated inodes.
+ * These buffers are only relogged with the XFS_BLI_INODE_BUF flag set,
+ * indicating that only the di_next_unlinked fields from the inodes in the
+ * buffers will be replayed during recovery.  If the original newly allocated
+ * inode images have not yet been flushed when the buffer is so relogged, then
+ * we need to make sure that we keep the old images in the 'active' portion of
+ * the log.  We do this by returning the original lsn of that transaction here
+ * rather than the current one.
  */
 STATIC xfs_lsn_t
 xfs_buf_item_committed(
@@ -746,6 +799,9 @@ xfs_buf_item_committed(
 
 	trace_xfs_buf_item_committed(bip);
 
+	if (!test_bit(XFS_LI_IN_AIL, &bip->bli_item.li_flags))
+		atomic_inc(&bli->bli_refcount);
+
 	if ((bip->bli_flags & XFS_BLI_INODE_ALLOC_BUF) && lip->li_lsn != 0)
 		return lip->li_lsn;
 	return lsn;
@@ -1009,36 +1065,12 @@ xfs_buf_item_dirty_format(
 	return false;
 }
 
-STATIC void
-xfs_buf_item_free(
-	struct xfs_buf_log_item	*bip)
-{
-	xfs_buf_item_free_format(bip);
-	kmem_free(bip->bli_item.li_lv_shadow);
-	kmem_cache_free(xfs_buf_item_zone, bip);
-}
-
-/*
- * xfs_buf_item_relse() is called when the buf log item is no longer needed.
- */
-void
-xfs_buf_item_relse(
-	struct xfs_buf	*bp)
-{
-	struct xfs_buf_log_item	*bip = bp->b_log_item;
-
-	trace_xfs_buf_item_relse(bp, _RET_IP_);
-	ASSERT(!test_bit(XFS_LI_IN_AIL, &bip->bli_item.li_flags));
-
-	bp->b_log_item = NULL;
-	xfs_buf_rele(bp);
-	xfs_buf_item_free(bip);
-}
-
 void
 xfs_buf_item_done(
 	struct xfs_buf		*bp)
 {
+	struct xfs_buf_log_item	*bip = bp->b_log_item;
+
 	/*
 	 * If we are forcibly shutting down, this may well be off the AIL
 	 * already. That's because we simulate the log-committed callbacks to
@@ -1051,8 +1083,9 @@ xfs_buf_item_done(
 	 * Note that log recovery writes might have buffer items that are not on
 	 * the AIL even when the file system is not shut down.
 	 */
-	xfs_trans_ail_delete(&bp->b_log_item->bli_item,
+	xfs_trans_ail_delete(&bip->bli_item,
 			     (bp->b_flags & _XBF_LOGRECOVERY) ? 0 :
 			     SHUTDOWN_CORRUPT_INCORE);
-	xfs_buf_item_relse(bp);
+
+	xfs_buf_item_put(bp);
 }
diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
index 50aa0f5ef959..e3ccbf3ca801 100644
--- a/fs/xfs/xfs_buf_item.h
+++ b/fs/xfs/xfs_buf_item.h
@@ -51,7 +51,6 @@ struct xfs_buf_log_item {
 
 int	xfs_buf_item_init(struct xfs_buf *, struct xfs_mount *);
 void	xfs_buf_item_done(struct xfs_buf *bp);
-void	xfs_buf_item_relse(struct xfs_buf *);
 bool	xfs_buf_item_put(struct xfs_buf_log_item *);
 void	xfs_buf_item_log(struct xfs_buf_log_item *, uint, uint);
 bool	xfs_buf_item_dirty_format(struct xfs_buf_log_item *);

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC] xfs: hold buffer across unpin and potential shutdown processing
  2021-05-03 23:25 ` Dave Chinner
@ 2021-05-05 11:50   ` Brian Foster
  2021-05-06  2:56     ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Brian Foster @ 2021-05-05 11:50 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, May 04, 2021 at 09:25:39AM +1000, Dave Chinner wrote:
> On Mon, May 03, 2021 at 08:18:16AM -0400, Brian Foster wrote:
> > The special processing used to simulate a buffer I/O failure on fs
> > shutdown has a difficult to reproduce race that can result in a use
> > after free of the associated buffer. Consider a buffer that has been
> > committed to the on-disk log and thus is AIL resident. The buffer
> > lands on the writeback delwri queue, but is subsequently locked,
> > committed and pinned by another transaction before submitted for
> > I/O. At this point, the buffer is stuck on the delwri queue as it
> > cannot be submitted for I/O until it is unpinned. A log checkpoint
> > I/O failure occurs sometime later, which aborts the bli. The unpin
> > handler is called with the aborted log item, drops the bli reference
> > count, the pin count, and falls into the I/O failure simulation
> > path.
> > 
> > The potential problem here is that once the pin count falls to zero
> > in ->iop_unpin(), xfsaild is free to retry delwri submission of the
> > buffer at any time, before the unpin handler even completes. If
> > delwri queue submission wins the race to the buffer lock, it
> > observes the shutdown state and simulates the I/O failure itself.
> > This releases both the bli and delwri queue holds and frees the
> > buffer while xfs_buf_item_unpin() sits on xfs_buf_lock() waiting to
> > run through the same failure sequence. This problem is rare and
> > requires many iterations of fstest generic/019 (which simulates disk
> > I/O failures) to reproduce.
> 
> You've described the race but not the failure that occurs? I'm going
> to guess this is a use-after-free situation or something similar,
> but I'm not immediately sure...
> 

Yes, it's a use after free. I don't recall the exact error output, but
the above implies it by pointing out the delwri queue "frees the buffer
while xfs_buf_item_unpin() sits on xfs_buf_lock() ..." Also, the first
sentence in the commit log explicitly states it's a use after free of
the buffer. :)

> > To avoid this problem, hold the buffer across the unpin sequence in
> > xfs_buf_item_unpin(). This is a bit unfortunate in that the new hold
> > is unconditional while really only necessary for a rare, fatal error
> > scenario, but it guarantees the buffer still exists in the off
> > chance that the handler attempts to access it.
> 
> Ok, so essentially the problem here is that in the case of a
> non-stale buffer we can enter xfs_buf_item_unpin() with an unlocked,
> buffer that only the bli holds a reference to, and that bli
> reference to the can be dropped by a racing IO completion that calls
> xfs_buf_item_done()?
> 

The delwri queue and bli hold references to the buffer on entry of
xfs_buf_item_unpin(). I believe the only reference to the bli at this
point is that acquired via ->iop_pin(). (Technically the bli reference
acquired by the transaction is released via ->iop_committing() after the
item is pinned, so the ref effectively transfers along...).

Once the pin count drops to zero, the buffer is available for I/O. Since
unpin also drops the last bli reference and has no indication of whether
the item is AIL resident, both unpin and the delwri submit race to
perform the simulated I/O failure on the buffer that is effectively only
held by the delwri queue (because the bli hold drops on I/O completion).
If the latter wins, it can free the buffer before unpin has acquired the
hold used to simulate the I/O.

> i.e. the problem here is that we've dropped the bip->bli_refcount
> before we've locked the buffer and taken a reference to it for
> the fail path?
> 
> OK, I see that xfs_buf_item_done() (called from ioend processing)
> simply frees the buf log item and doesn't care about the bli
> refcount at all. So the first ioend caller will free the buf log
> item regardless of whether there are other references to it at all.
> 
> IOWs, once we unpin the buffer, the bli attached to the buffer and
> being tracked in the AIL has -zero- references to the bli and so it
> gets freed unconditionally on IO completion.
> 
> That seems to the be the problem here - the bli is not reference
> counted while it is the AIL....
> 

I think it depends on how you look at it. As you point out, we've had
this odd bli reference count pattern for as long as I can remember where
the refcount seems to span the transaction and pin lifecycle, but then
goes to zero at final unpin so it can do certain bli processing there
and buffer writeback completion can just explicitly free the bli
(dropping the buffer hold). I don't think that by itself is necessarily
the problem since for one it's the buffer we care about here, but also
the current code is clearly written with this reference counting pattern
in mind (i.e. we continue to access the bli after the bli refcount drops
to zero).

That said, I'm not opposed to fixing this bug via an improvement to the
bli reference count handling if we can do so cleanly enough. I didn't
really consider that approach because of the difference in complexity
and scope between reworking bli refcount handling and just grabbing a
buffer hold in the context it's needed..

> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> > 
> > This is a patch I've had around for a bit for a very rare corner case I
> > was able to reproduce in some past testing. I'm sending this as RFC
> > because I'm curious if folks have any thoughts on the approach. I'd be
> > Ok with this change as is, but I think there are alternatives available
> > too. We could do something fairly simple like bury the hold in the
> > remove (abort) case only, or perhaps consider checking IN_AIL state
> > before the pin count drops and base on that (though that seems a bit
> > more fragile to me). Thoughts?
> > 
> > Brian
> > 
> >  fs/xfs/xfs_buf_item.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> > index fb69879e4b2b..a1ad6901eb15 100644
> > --- a/fs/xfs/xfs_buf_item.c
> > +++ b/fs/xfs/xfs_buf_item.c
> > @@ -504,6 +504,7 @@ xfs_buf_item_unpin(
> >  
> >  	freed = atomic_dec_and_test(&bip->bli_refcount);
> >  
> > +	xfs_buf_hold(bp);
> >  	if (atomic_dec_and_test(&bp->b_pin_count))
> >  		wake_up_all(&bp->b_waiters);
> >  
> > @@ -560,6 +561,7 @@ xfs_buf_item_unpin(
> >  		bp->b_flags |= XBF_ASYNC;
> >  		xfs_buf_ioend_fail(bp);
> >  	}
> > +	xfs_buf_rele(bp);
> >  }
> 
> Ok, so we take an extra reference for the xfs_buf_ioend_fail()
> path because we've exposed the code running here to the bli
> reference to the buffer being dropped at any point after the wakeup
> due to IO completion being run by another party.
> 
> Yup, seems like the bli_refcount scope needs to be expanded here.
> 
> Seems to me that we need to change how and where we drop the buf log
> item reference count so the reference to the buffer it owns isn't
> dropped until -after- we process the IO failure case.
> 
> Something like the patch below (completely untested!), perhaps?
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> xfs: extend bli_refcount to cover the AIL and IO
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> Untested.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_buf_item.c | 207 +++++++++++++++++++++++++++++---------------------
>  fs/xfs/xfs_buf_item.h |   1 -
>  2 files changed, 120 insertions(+), 88 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index fb69879e4b2b..a0fa0f16d8c7 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
...
> @@ -502,56 +528,26 @@ xfs_buf_item_unpin(
>  
>  	trace_xfs_buf_item_unpin(bip);
>  
> -	freed = atomic_dec_and_test(&bip->bli_refcount);
> -
> +	/*
> +	 * We can wake pin waiters safely now because we still hold the
> +	 * bli_refcount that was taken when the pin was gained.
> +	 */
>  	if (atomic_dec_and_test(&bp->b_pin_count))
>  		wake_up_all(&bp->b_waiters);
>  
> -	if (freed && stale) {
> -		ASSERT(bip->bli_flags & XFS_BLI_STALE);
> -		ASSERT(xfs_buf_islocked(bp));
> -		ASSERT(bp->b_flags & XBF_STALE);
> -		ASSERT(bip->__bli_format.blf_flags & XFS_BLF_CANCEL);
> -
> -		trace_xfs_buf_item_unpin_stale(bip);
> -
> -		if (remove) {
> -			/*
> -			 * If we are in a transaction context, we have to
> -			 * remove the log item from the transaction as we are
> -			 * about to release our reference to the buffer.  If we
> -			 * don't, the unlock that occurs later in
> -			 * xfs_trans_uncommit() will try to reference the
> -			 * buffer which we no longer have a hold on.
> -			 */
> -			if (!list_empty(&lip->li_trans))
> -				xfs_trans_del_item(lip);
> -
> -			/*
> -			 * Since the transaction no longer refers to the buffer,
> -			 * the buffer should no longer refer to the transaction.
> -			 */
> -			bp->b_transp = NULL;
> +	if (!stale) {
> +		if (!remove) {
> +			/* Nothing to do but drop the refcount the pin owned. */
> +			atomic_dec(&bip->bli_refcount);
> +			return;
>  		}

Hmm.. this seems a bit wonky to me. This code historically acts on the
drop of the final reference to the bli. This is not critical for the
common (!stale && !remove) case because that's basically a no-op here
outside of dropping the reference, and it looks like the stale buffer
handling code further down continues to follow that model, but in this
branch it seems we're trying to be clever in how the reference is
managed and as a result can act on a bli that might actually have
additional references. If so, I don't think it's appropriate to run
through the error sequence that follows.

>  
>  		/*
> -		 * If we get called here because of an IO error, we may or may
> -		 * not have the item on the AIL. xfs_trans_ail_delete() will
> -		 * take care of that situation. xfs_trans_ail_delete() drops
> -		 * the AIL lock.
> -		 */
> -		if (bip->bli_flags & XFS_BLI_STALE_INODE) {
> -			xfs_buf_item_done(bp);
> -			xfs_buf_inode_iodone(bp);
> -			ASSERT(list_empty(&bp->b_li_list));
> -		} else {
> -			xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
> -			xfs_buf_item_relse(bp);
> -			ASSERT(bp->b_log_item == NULL);
> -		}
> -		xfs_buf_relse(bp);
> -	} else if (freed && remove) {
> -		/*
> +		 * Fail the IO before we drop the bli refcount. This guarantees
> +		 * that a racing writeback completion also failing the buffer
> +		 * and running completion will not remove the last reference to
> +		 * the bli and free it from under us.
> +		 *
>  		 * The buffer must be locked and held by the caller to simulate
>  		 * an async I/O failure.
>  		 */
> @@ -559,7 +555,62 @@ xfs_buf_item_unpin(
>  		xfs_buf_hold(bp);
>  		bp->b_flags |= XBF_ASYNC;
>  		xfs_buf_ioend_fail(bp);
> +		xfs_buf_item_relse(bp);

Did you mean for this to be xfs_buf_item_put() instead of _relse()? The
comment above implies dropping the bli refcount after the I/O failure,
but this just frees it. ISTM we will always have the AIL bli reference
here because it was bumped in xfs_buf_item_committed() unless it was AIL
resident already. OTOH, if the buffer is already AIL resident the
eventual xfs_buf_item_done() (via xfs_buf_ioend()) would remove it
before it drops the bli ref. It's not clear to me what the intent is
here. 

> +		return;
>  	}
> +
> +	/*
> +	 * Stale buffer - only process it if this is the last reference to the
> +	 * BLI. If this is the last BLI reference, then the buffer will be
> +	 * locked and have two references - once from the transaction commit and
> +	 * one from the BLI - and we do not unlock and release transaction
> +	 * reference until we've finished cleaning up the BLI.
> +	 */
> +	if (!atomic_dec_and_test(&bip->bli_refcount))
> +		return;
> +

If the buffer is stale, will this ever be the last reference now that
_item_committed() bumps the refcount? This change also seems to have
ramifications for the code that follows, such as if a staled buffer is
already in the AIL (with a bli ref), would this code ever get to the
point of removing it?

All in all, I'll reiterate that I think it would be nice to fix up the
bli reference count handling in general, but I think the scope and
complexity of that work is significantly beyond what is reasonably
necessary to fix this bug. A rework may not ultimately be a huge code
change, but the level of review and testing required to cover and
consider all the corner cases and whatnot for refcount handling is
definitely nontrivial. That's just my .02 on a first pass.

Brian

> +	ASSERT(bip->bli_flags & XFS_BLI_STALE);
> +	ASSERT(xfs_buf_islocked(bp));
> +	ASSERT(bp->b_flags & XBF_STALE);
> +	ASSERT(bip->__bli_format.blf_flags & XFS_BLF_CANCEL);
> +
> +	trace_xfs_buf_item_unpin_stale(bip);
> +
> +	if (remove) {
> +		/*
> +		 * If we are in a transaction context, we have to
> +		 * remove the log item from the transaction as we are
> +		 * about to release our reference to the buffer.  If we
> +		 * don't, the unlock that occurs later in
> +		 * xfs_trans_uncommit() will try to reference the
> +		 * buffer which we no longer have a hold on.
> +		 */
> +		if (!list_empty(&lip->li_trans))
> +			xfs_trans_del_item(lip);
> +
> +		/*
> +		 * Since the transaction no longer refers to the buffer,
> +		 * the buffer should no longer refer to the transaction.
> +		 */
> +		bp->b_transp = NULL;
> +	}
> +
> +	/*
> +	 * If we get called here because of an IO error, we may or may
> +	 * not have the item on the AIL. xfs_trans_ail_delete() will
> +	 * take care of that situation. xfs_trans_ail_delete() drops
> +	 * the AIL lock.
> +	 */
> +	if (bip->bli_flags & XFS_BLI_STALE_INODE) {
> +		xfs_buf_item_done(bp);
> +		xfs_buf_inode_iodone(bp);
> +		ASSERT(list_empty(&bp->b_li_list));
> +	} else {
> +		xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
> +		xfs_buf_item_relse(bp);
> +		ASSERT(bp->b_log_item == NULL);
> +	}
> +	xfs_buf_relse(bp);
>  }
>  
>  STATIC uint
> @@ -720,22 +771,24 @@ xfs_buf_item_committing(
>  }
>  
>  /*
> - * This is called to find out where the oldest active copy of the
> - * buf log item in the on disk log resides now that the last log
> - * write of it completed at the given lsn.
> - * We always re-log all the dirty data in a buffer, so usually the
> - * latest copy in the on disk log is the only one that matters.  For
> - * those cases we simply return the given lsn.
> + * The item is about to be inserted into the AIL. If it is not already in the
> + * AIL, we need to take a reference to the BLI for the AIL. This "AIL reference"
> + * will be held until the item is removed from the AIL.
> + *
> + * This is called to find out where the oldest active copy of the buf log item
> + * in the on disk log resides now that the last log write of it completed at the
> + * given lsn.  We always re-log all the dirty data in a buffer, so usually the
> + * latest copy in the on disk log is the only one that matters.  For those cases
> + * we simply return the given lsn.
>   *
> - * The one exception to this is for buffers full of newly allocated
> - * inodes.  These buffers are only relogged with the XFS_BLI_INODE_BUF
> - * flag set, indicating that only the di_next_unlinked fields from the
> - * inodes in the buffers will be replayed during recovery.  If the
> - * original newly allocated inode images have not yet been flushed
> - * when the buffer is so relogged, then we need to make sure that we
> - * keep the old images in the 'active' portion of the log.  We do this
> - * by returning the original lsn of that transaction here rather than
> - * the current one.
> + * The one exception to this is for buffers full of newly allocated inodes.
> + * These buffers are only relogged with the XFS_BLI_INODE_BUF flag set,
> + * indicating that only the di_next_unlinked fields from the inodes in the
> + * buffers will be replayed during recovery.  If the original newly allocated
> + * inode images have not yet been flushed when the buffer is so relogged, then
> + * we need to make sure that we keep the old images in the 'active' portion of
> + * the log.  We do this by returning the original lsn of that transaction here
> + * rather than the current one.
>   */
>  STATIC xfs_lsn_t
>  xfs_buf_item_committed(
> @@ -746,6 +799,9 @@ xfs_buf_item_committed(
>  
>  	trace_xfs_buf_item_committed(bip);
>  
> +	if (!test_bit(XFS_LI_IN_AIL, &bip->bli_item.li_flags))
> +		atomic_inc(&bli->bli_refcount);
> +
>  	if ((bip->bli_flags & XFS_BLI_INODE_ALLOC_BUF) && lip->li_lsn != 0)
>  		return lip->li_lsn;
>  	return lsn;
> @@ -1009,36 +1065,12 @@ xfs_buf_item_dirty_format(
>  	return false;
>  }
>  
> -STATIC void
> -xfs_buf_item_free(
> -	struct xfs_buf_log_item	*bip)
> -{
> -	xfs_buf_item_free_format(bip);
> -	kmem_free(bip->bli_item.li_lv_shadow);
> -	kmem_cache_free(xfs_buf_item_zone, bip);
> -}
> -
> -/*
> - * xfs_buf_item_relse() is called when the buf log item is no longer needed.
> - */
> -void
> -xfs_buf_item_relse(
> -	struct xfs_buf	*bp)
> -{
> -	struct xfs_buf_log_item	*bip = bp->b_log_item;
> -
> -	trace_xfs_buf_item_relse(bp, _RET_IP_);
> -	ASSERT(!test_bit(XFS_LI_IN_AIL, &bip->bli_item.li_flags));
> -
> -	bp->b_log_item = NULL;
> -	xfs_buf_rele(bp);
> -	xfs_buf_item_free(bip);
> -}
> -
>  void
>  xfs_buf_item_done(
>  	struct xfs_buf		*bp)
>  {
> +	struct xfs_buf_log_item	*bip = bp->b_log_item;
> +
>  	/*
>  	 * If we are forcibly shutting down, this may well be off the AIL
>  	 * already. That's because we simulate the log-committed callbacks to
> @@ -1051,8 +1083,9 @@ xfs_buf_item_done(
>  	 * Note that log recovery writes might have buffer items that are not on
>  	 * the AIL even when the file system is not shut down.
>  	 */
> -	xfs_trans_ail_delete(&bp->b_log_item->bli_item,
> +	xfs_trans_ail_delete(&bip->bli_item,
>  			     (bp->b_flags & _XBF_LOGRECOVERY) ? 0 :
>  			     SHUTDOWN_CORRUPT_INCORE);
> -	xfs_buf_item_relse(bp);
> +
> +	xfs_buf_item_put(bp);
>  }
> diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
> index 50aa0f5ef959..e3ccbf3ca801 100644
> --- a/fs/xfs/xfs_buf_item.h
> +++ b/fs/xfs/xfs_buf_item.h
> @@ -51,7 +51,6 @@ struct xfs_buf_log_item {
>  
>  int	xfs_buf_item_init(struct xfs_buf *, struct xfs_mount *);
>  void	xfs_buf_item_done(struct xfs_buf *bp);
> -void	xfs_buf_item_relse(struct xfs_buf *);
>  bool	xfs_buf_item_put(struct xfs_buf_log_item *);
>  void	xfs_buf_item_log(struct xfs_buf_log_item *, uint, uint);
>  bool	xfs_buf_item_dirty_format(struct xfs_buf_log_item *);
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC] xfs: hold buffer across unpin and potential shutdown processing
  2021-05-05 11:50   ` Brian Foster
@ 2021-05-06  2:56     ` Dave Chinner
  2021-05-06 19:29       ` Brian Foster
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2021-05-06  2:56 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, May 05, 2021 at 07:50:17AM -0400, Brian Foster wrote:
> On Tue, May 04, 2021 at 09:25:39AM +1000, Dave Chinner wrote:
> > On Mon, May 03, 2021 at 08:18:16AM -0400, Brian Foster wrote:
> > i.e. the problem here is that we've dropped the bip->bli_refcount
> > before we've locked the buffer and taken a reference to it for
> > the fail path?
> > 
> > OK, I see that xfs_buf_item_done() (called from ioend processing)
> > simply frees the buf log item and doesn't care about the bli
> > refcount at all. So the first ioend caller will free the buf log
> > item regardless of whether there are other references to it at all.
> > 
> > IOWs, once we unpin the buffer, the bli attached to the buffer and
> > being tracked in the AIL has -zero- references to the bli and so it
> > gets freed unconditionally on IO completion.
> > 
> > That seems to the be the problem here - the bli is not reference
> > counted while it is the AIL....
> > 
> 
> I think it depends on how you look at it. As you point out, we've had
> this odd bli reference count pattern for as long as I can remember where

Yes, and it's been a constant source of use-after free bugs in
shutdown processing for as long as I can remember. I want to fix it
so we don't have to keep band-aiding this code every time we change
how some part of log item or stale inode/buffer processing works...


> >  
> > -	freed = atomic_dec_and_test(&bip->bli_refcount);
> > -
> > +	/*
> > +	 * We can wake pin waiters safely now because we still hold the
> > +	 * bli_refcount that was taken when the pin was gained.
> > +	 */
> >  	if (atomic_dec_and_test(&bp->b_pin_count))
> >  		wake_up_all(&bp->b_waiters);
> >  
> > -	if (freed && stale) {
> > -		ASSERT(bip->bli_flags & XFS_BLI_STALE);
> > -		ASSERT(xfs_buf_islocked(bp));
> > -		ASSERT(bp->b_flags & XBF_STALE);
> > -		ASSERT(bip->__bli_format.blf_flags & XFS_BLF_CANCEL);
> > -
> > -		trace_xfs_buf_item_unpin_stale(bip);
> > -
> > -		if (remove) {
> > -			/*
> > -			 * If we are in a transaction context, we have to
> > -			 * remove the log item from the transaction as we are
> > -			 * about to release our reference to the buffer.  If we
> > -			 * don't, the unlock that occurs later in
> > -			 * xfs_trans_uncommit() will try to reference the
> > -			 * buffer which we no longer have a hold on.
> > -			 */
> > -			if (!list_empty(&lip->li_trans))
> > -				xfs_trans_del_item(lip);
> > -
> > -			/*
> > -			 * Since the transaction no longer refers to the buffer,
> > -			 * the buffer should no longer refer to the transaction.
> > -			 */
> > -			bp->b_transp = NULL;
> > +	if (!stale) {
> > +		if (!remove) {
> > +			/* Nothing to do but drop the refcount the pin owned. */
> > +			atomic_dec(&bip->bli_refcount);
> > +			return;
> >  		}
> 
> Hmm.. this seems a bit wonky to me. This code historically acts on the
> drop of the final reference to the bli.

Yes, and that's the problem that needs fixing. The AIL needs a
reference so that we aren't racing with writeback from the AIL to
free the object because both sets of code run without actually
holding an active reference to the BLI...

> This is not critical for the
> common (!stale && !remove) case because that's basically a no-op here
> outside of dropping the reference, and it looks like the stale buffer
> handling code further down continues to follow that model, but in this
> branch it seems we're trying to be clever in how the reference is
> managed and as a result can act on a bli that might actually have
> additional references.

Who cares? If something else has active references, then we must not
free the bli or buffer here, anyway. The lack of active references
by active BLI usres is why we keep getting use-after-free bugs in
this code.....

> If so, I don't think it's appropriate to run
> through the error sequence that follows.
> 
> >  
> >  		/*
> > -		 * If we get called here because of an IO error, we may or may
> > -		 * not have the item on the AIL. xfs_trans_ail_delete() will
> > -		 * take care of that situation. xfs_trans_ail_delete() drops
> > -		 * the AIL lock.
> > -		 */
> > -		if (bip->bli_flags & XFS_BLI_STALE_INODE) {
> > -			xfs_buf_item_done(bp);
> > -			xfs_buf_inode_iodone(bp);
> > -			ASSERT(list_empty(&bp->b_li_list));
> > -		} else {
> > -			xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
> > -			xfs_buf_item_relse(bp);
> > -			ASSERT(bp->b_log_item == NULL);
> > -		}
> > -		xfs_buf_relse(bp);
> > -	} else if (freed && remove) {
> > -		/*
> > +		 * Fail the IO before we drop the bli refcount. This guarantees
> > +		 * that a racing writeback completion also failing the buffer
> > +		 * and running completion will not remove the last reference to
> > +		 * the bli and free it from under us.
> > +		 *
> >  		 * The buffer must be locked and held by the caller to simulate
> >  		 * an async I/O failure.
> >  		 */
> > @@ -559,7 +555,62 @@ xfs_buf_item_unpin(
> >  		xfs_buf_hold(bp);
> >  		bp->b_flags |= XBF_ASYNC;
> >  		xfs_buf_ioend_fail(bp);
> > +		xfs_buf_item_relse(bp);
> 
> Did you mean for this to be xfs_buf_item_put() instead of _relse()? The

Yes. I did say "untested" which implies the patch isn't complete or will
work. It's just a demonstration of how this reference counting might
be done, not a complete, working solution. Patches are a much faster
way of explaining the overall solution that plain text...

> >  	}
> > +
> > +	/*
> > +	 * Stale buffer - only process it if this is the last reference to the
> > +	 * BLI. If this is the last BLI reference, then the buffer will be
> > +	 * locked and have two references - once from the transaction commit and
> > +	 * one from the BLI - and we do not unlock and release transaction
> > +	 * reference until we've finished cleaning up the BLI.
> > +	 */
> > +	if (!atomic_dec_and_test(&bip->bli_refcount))
> > +		return;
> > +
> 
> If the buffer is stale, will this ever be the last reference now that
> _item_committed() bumps the refcount?

If the AIL has a reference, then no.

> This change also seems to have
> ramifications for the code that follows, such as if a staled buffer is
> already in the AIL (with a bli ref), would this code ever get to the
> point of removing it?

That's easy enough to handle - if the buffer is stale and we are in
this code, we hold the buffer locked. Hence we can remove from the
AIL if it is in the AIL and drop that reference, too. Indeed, this
code already does the AIL removal, so all this requires is a simple
rearrangement of the logic in this function....

The only difference is that we have to do this before we drop the
current reference we have on the BLI, which is not what we do now
and that's where all the problems lie.

> All in all, I'll reiterate that I think it would be nice to fix up the
> bli reference count handling in general, but I think the scope and
> complexity of that work is significantly beyond what is reasonably
> necessary to fix this bug.

And so leaving the underlying problem in the code for the next set
of changes someone does to trigger the problem in a differen way.
We've indentified what the root cause is, so can we please spend
the time to fix it properly?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC] xfs: hold buffer across unpin and potential shutdown processing
  2021-05-06  2:56     ` Dave Chinner
@ 2021-05-06 19:29       ` Brian Foster
  2021-06-02 13:32         ` Brian Foster
  0 siblings, 1 reply; 10+ messages in thread
From: Brian Foster @ 2021-05-06 19:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, May 06, 2021 at 12:56:11PM +1000, Dave Chinner wrote:
> On Wed, May 05, 2021 at 07:50:17AM -0400, Brian Foster wrote:
> > On Tue, May 04, 2021 at 09:25:39AM +1000, Dave Chinner wrote:
> > > On Mon, May 03, 2021 at 08:18:16AM -0400, Brian Foster wrote:
> > > i.e. the problem here is that we've dropped the bip->bli_refcount
> > > before we've locked the buffer and taken a reference to it for
> > > the fail path?
> > > 
> > > OK, I see that xfs_buf_item_done() (called from ioend processing)
> > > simply frees the buf log item and doesn't care about the bli
> > > refcount at all. So the first ioend caller will free the buf log
> > > item regardless of whether there are other references to it at all.
> > > 
> > > IOWs, once we unpin the buffer, the bli attached to the buffer and
> > > being tracked in the AIL has -zero- references to the bli and so it
> > > gets freed unconditionally on IO completion.
> > > 
> > > That seems to the be the problem here - the bli is not reference
> > > counted while it is the AIL....
> > > 
> > 
> > I think it depends on how you look at it. As you point out, we've had
> > this odd bli reference count pattern for as long as I can remember where
> 
> Yes, and it's been a constant source of use-after free bugs in
> shutdown processing for as long as I can remember. I want to fix it
> so we don't have to keep band-aiding this code every time we change
> how some part of log item or stale inode/buffer processing works...
> 

IME, most of the bugs in this area tend to be shutdown/error related and
generally relate to the complexity of all the various states and
contexts for the different callback contexts. That isn't a direct result
of this bli refcount behavior, as odd as it is, though that certainly
contributes to the overall complexity.

> 
> > >  
> > > -	freed = atomic_dec_and_test(&bip->bli_refcount);
> > > -
> > > +	/*
> > > +	 * We can wake pin waiters safely now because we still hold the
> > > +	 * bli_refcount that was taken when the pin was gained.
> > > +	 */
> > >  	if (atomic_dec_and_test(&bp->b_pin_count))
> > >  		wake_up_all(&bp->b_waiters);
> > >  
> > > -	if (freed && stale) {
> > > -		ASSERT(bip->bli_flags & XFS_BLI_STALE);
> > > -		ASSERT(xfs_buf_islocked(bp));
> > > -		ASSERT(bp->b_flags & XBF_STALE);
> > > -		ASSERT(bip->__bli_format.blf_flags & XFS_BLF_CANCEL);
> > > -
> > > -		trace_xfs_buf_item_unpin_stale(bip);
> > > -
> > > -		if (remove) {
> > > -			/*
> > > -			 * If we are in a transaction context, we have to
> > > -			 * remove the log item from the transaction as we are
> > > -			 * about to release our reference to the buffer.  If we
> > > -			 * don't, the unlock that occurs later in
> > > -			 * xfs_trans_uncommit() will try to reference the
> > > -			 * buffer which we no longer have a hold on.
> > > -			 */
> > > -			if (!list_empty(&lip->li_trans))
> > > -				xfs_trans_del_item(lip);
> > > -
> > > -			/*
> > > -			 * Since the transaction no longer refers to the buffer,
> > > -			 * the buffer should no longer refer to the transaction.
> > > -			 */
> > > -			bp->b_transp = NULL;
> > > +	if (!stale) {
> > > +		if (!remove) {
> > > +			/* Nothing to do but drop the refcount the pin owned. */
> > > +			atomic_dec(&bip->bli_refcount);
> > > +			return;
> > >  		}
> > 
> > Hmm.. this seems a bit wonky to me. This code historically acts on the
> > drop of the final reference to the bli.
> 
> Yes, and that's the problem that needs fixing. The AIL needs a
> reference so that we aren't racing with writeback from the AIL to
> free the object because both sets of code run without actually
> holding an active reference to the BLI...
> 
> > This is not critical for the
> > common (!stale && !remove) case because that's basically a no-op here
> > outside of dropping the reference, and it looks like the stale buffer
> > handling code further down continues to follow that model, but in this
> > branch it seems we're trying to be clever in how the reference is
> > managed and as a result can act on a bli that might actually have
> > additional references.
> 
> Who cares? If something else has active references, then we must not
> free the bli or buffer here, anyway. The lack of active references
> by active BLI usres is why we keep getting use-after-free bugs in
> this code.....
> 

Well, this impacts more than just whether we free the buffer or not in
the abort case. This potentially runs the I/O failure sequence on a
pinned buffer, or blocks log I/O completion on a buffer lock that might
be held by a transaction.

I don't know if these are immediate problems or not and this is all
abort/shutdown related, but re: my point above around shutdown issues,
I'd prefer to try and avoid these kind of oddball quirks if we can vs.
just replace the historical quirks with new ones.

> > If so, I don't think it's appropriate to run
> > through the error sequence that follows.
> > 
> > >  
> > >  		/*
> > > -		 * If we get called here because of an IO error, we may or may
> > > -		 * not have the item on the AIL. xfs_trans_ail_delete() will
> > > -		 * take care of that situation. xfs_trans_ail_delete() drops
> > > -		 * the AIL lock.
> > > -		 */
> > > -		if (bip->bli_flags & XFS_BLI_STALE_INODE) {
> > > -			xfs_buf_item_done(bp);
> > > -			xfs_buf_inode_iodone(bp);
> > > -			ASSERT(list_empty(&bp->b_li_list));
> > > -		} else {
> > > -			xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
> > > -			xfs_buf_item_relse(bp);
> > > -			ASSERT(bp->b_log_item == NULL);
> > > -		}
> > > -		xfs_buf_relse(bp);
> > > -	} else if (freed && remove) {
> > > -		/*
> > > +		 * Fail the IO before we drop the bli refcount. This guarantees
> > > +		 * that a racing writeback completion also failing the buffer
> > > +		 * and running completion will not remove the last reference to
> > > +		 * the bli and free it from under us.
> > > +		 *
> > >  		 * The buffer must be locked and held by the caller to simulate
> > >  		 * an async I/O failure.
> > >  		 */
> > > @@ -559,7 +555,62 @@ xfs_buf_item_unpin(
> > >  		xfs_buf_hold(bp);
> > >  		bp->b_flags |= XBF_ASYNC;
> > >  		xfs_buf_ioend_fail(bp);
> > > +		xfs_buf_item_relse(bp);
> > 
> > Did you mean for this to be xfs_buf_item_put() instead of _relse()? The
> 
> Yes. I did say "untested" which implies the patch isn't complete or will
> work. It's just a demonstration of how this reference counting might
> be done, not a complete, working solution. Patches are a much faster
> way of explaining the overall solution that plain text...
> 

Sure, I'm just trying to clarify intent. It's a refcount patch so
whether we drop a refcount or explicitly free an objects is particularly
relevant. ;)

> > >  	}
> > > +
> > > +	/*
> > > +	 * Stale buffer - only process it if this is the last reference to the
> > > +	 * BLI. If this is the last BLI reference, then the buffer will be
> > > +	 * locked and have two references - once from the transaction commit and
> > > +	 * one from the BLI - and we do not unlock and release transaction
> > > +	 * reference until we've finished cleaning up the BLI.
> > > +	 */
> > > +	if (!atomic_dec_and_test(&bip->bli_refcount))
> > > +		return;
> > > +
> > 
> > If the buffer is stale, will this ever be the last reference now that
> > _item_committed() bumps the refcount?
> 
> If the AIL has a reference, then no.
> 

Ok, but how would we get here without an AIL reference? That seems
impossible to me based on your patch.

> > This change also seems to have
> > ramifications for the code that follows, such as if a staled buffer is
> > already in the AIL (with a bli ref), would this code ever get to the
> > point of removing it?
> 
> That's easy enough to handle - if the buffer is stale and we are in
> this code, we hold the buffer locked. Hence we can remove from the
> AIL if it is in the AIL and drop that reference, too. Indeed, this
> code already does the AIL removal, so all this requires is a simple
> rearrangement of the logic in this function....
> 

Hmm, I don't think that's a correct assertion. If the buffer is stale
and we're in this code and we have the last reference to the bli, then
we know we hold the buffer locked. Otherwise, ISTM we can get here while
the transaction that staled the buffer might still own the lock.

> The only difference is that we have to do this before we drop the
> current reference we have on the BLI, which is not what we do now
> and that's where all the problems lie.
> 
> > All in all, I'll reiterate that I think it would be nice to fix up the
> > bli reference count handling in general, but I think the scope and
> > complexity of that work is significantly beyond what is reasonably
> > necessary to fix this bug.
> 
> And so leaving the underlying problem in the code for the next set
> of changes someone does to trigger the problem in a differen way.
> We've indentified what the root cause is, so can we please spend
> the time to fix it properly?
> 

That is not what I'm saying at all. I'm saying that I'd prefer to fix
the bug first and then step back and evaluate the overall refcount
design independently because the latter is quite complex and there are
all kinds of subtle interactions that the RFC patch just glazes over (by
design). For example, how log recovery processes bli's slightly
differently looks like yet another impedence mismatch from when the fs
is fully active.

So I propose to rework my patch a bit into something that reflects the
intended direction (see below for an untested diff) and proceed from
there...

Brian

--- 8< ---

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index fb69879e4b2b..7ff31788512b 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -475,17 +475,8 @@ xfs_buf_item_pin(
 }
 
 /*
- * This is called to unpin the buffer associated with the buf log
- * item which was previously pinned with a call to xfs_buf_item_pin().
- *
- * Also drop the reference to the buf item for the current transaction.
- * If the XFS_BLI_STALE flag is set and we are the last reference,
- * then free up the buf log item and unlock the buffer.
- *
- * If the remove flag is set we are called from uncommit in the
- * forced-shutdown path.  If that is true and the reference count on
- * the log item is going to drop to zero we need to free the item's
- * descriptor in the transaction.
+ * This is called to unpin the buffer associated with the buf log item which
+ * was previously pinned with a call to xfs_buf_item_pin().
  */
 STATIC void
 xfs_buf_item_unpin(
@@ -502,12 +493,26 @@ xfs_buf_item_unpin(
 
 	trace_xfs_buf_item_unpin(bip);
 
+	/*
+	 * Drop the bli ref associated with the pin and grab the hold required
+	 * for the I/O simulation failure in the abort case. We have to do this
+	 * before the pin count drops because the AIL doesn't acquire a bli
+	 * reference. Therefore if the refcount drops to zero, the bli could
+	 * still be AIL resident and the buffer submitted for I/O (and freed on
+	 * completion) at any point before we return. This can be removed once
+	 * the AIL properly holds a reference on the bli.
+	 */
 	freed = atomic_dec_and_test(&bip->bli_refcount);
-
+	if (freed && !stale && remove)
+		xfs_buf_hold(bp);
 	if (atomic_dec_and_test(&bp->b_pin_count))
 		wake_up_all(&bp->b_waiters);
 
-	if (freed && stale) {
+	 /* nothing to do but drop the pin count if the bli is active */
+	if (!freed)
+		return;
+
+	if (stale) {
 		ASSERT(bip->bli_flags & XFS_BLI_STALE);
 		ASSERT(xfs_buf_islocked(bp));
 		ASSERT(bp->b_flags & XBF_STALE);
@@ -550,13 +555,13 @@ xfs_buf_item_unpin(
 			ASSERT(bp->b_log_item == NULL);
 		}
 		xfs_buf_relse(bp);
-	} else if (freed && remove) {
+	} else if (remove) {
 		/*
 		 * The buffer must be locked and held by the caller to simulate
-		 * an async I/O failure.
+		 * an async I/O failure. We acquired the hold for this case
+		 * before the buffer was unpinned.
 		 */
 		xfs_buf_lock(bp);
-		xfs_buf_hold(bp);
 		bp->b_flags |= XBF_ASYNC;
 		xfs_buf_ioend_fail(bp);
 	}


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC] xfs: hold buffer across unpin and potential shutdown processing
  2021-05-06 19:29       ` Brian Foster
@ 2021-06-02 13:32         ` Brian Foster
  2021-06-02 16:02           ` Darrick J. Wong
  0 siblings, 1 reply; 10+ messages in thread
From: Brian Foster @ 2021-06-02 13:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, May 06, 2021 at 03:29:50PM -0400, Brian Foster wrote:
> On Thu, May 06, 2021 at 12:56:11PM +1000, Dave Chinner wrote:
> > On Wed, May 05, 2021 at 07:50:17AM -0400, Brian Foster wrote:
> > > On Tue, May 04, 2021 at 09:25:39AM +1000, Dave Chinner wrote:
> > > > On Mon, May 03, 2021 at 08:18:16AM -0400, Brian Foster wrote:
> > > > i.e. the problem here is that we've dropped the bip->bli_refcount
> > > > before we've locked the buffer and taken a reference to it for
> > > > the fail path?
> > > > 
> > > > OK, I see that xfs_buf_item_done() (called from ioend processing)
> > > > simply frees the buf log item and doesn't care about the bli
> > > > refcount at all. So the first ioend caller will free the buf log
> > > > item regardless of whether there are other references to it at all.
> > > > 
> > > > IOWs, once we unpin the buffer, the bli attached to the buffer and
> > > > being tracked in the AIL has -zero- references to the bli and so it
> > > > gets freed unconditionally on IO completion.
> > > > 
> > > > That seems to the be the problem here - the bli is not reference
> > > > counted while it is the AIL....
> > > > 
> > > 
> > > I think it depends on how you look at it. As you point out, we've had
> > > this odd bli reference count pattern for as long as I can remember where
> > 
> > Yes, and it's been a constant source of use-after free bugs in
> > shutdown processing for as long as I can remember. I want to fix it
> > so we don't have to keep band-aiding this code every time we change
> > how some part of log item or stale inode/buffer processing works...
> > 
> 
> IME, most of the bugs in this area tend to be shutdown/error related and
> generally relate to the complexity of all the various states and
> contexts for the different callback contexts. That isn't a direct result
> of this bli refcount behavior, as odd as it is, though that certainly
> contributes to the overall complexity.
> 
> > 
> > > >  
> > > > -	freed = atomic_dec_and_test(&bip->bli_refcount);
> > > > -
> > > > +	/*
> > > > +	 * We can wake pin waiters safely now because we still hold the
> > > > +	 * bli_refcount that was taken when the pin was gained.
> > > > +	 */
> > > >  	if (atomic_dec_and_test(&bp->b_pin_count))
> > > >  		wake_up_all(&bp->b_waiters);
> > > >  
> > > > -	if (freed && stale) {
> > > > -		ASSERT(bip->bli_flags & XFS_BLI_STALE);
> > > > -		ASSERT(xfs_buf_islocked(bp));
> > > > -		ASSERT(bp->b_flags & XBF_STALE);
> > > > -		ASSERT(bip->__bli_format.blf_flags & XFS_BLF_CANCEL);
> > > > -
> > > > -		trace_xfs_buf_item_unpin_stale(bip);
> > > > -
> > > > -		if (remove) {
> > > > -			/*
> > > > -			 * If we are in a transaction context, we have to
> > > > -			 * remove the log item from the transaction as we are
> > > > -			 * about to release our reference to the buffer.  If we
> > > > -			 * don't, the unlock that occurs later in
> > > > -			 * xfs_trans_uncommit() will try to reference the
> > > > -			 * buffer which we no longer have a hold on.
> > > > -			 */
> > > > -			if (!list_empty(&lip->li_trans))
> > > > -				xfs_trans_del_item(lip);
> > > > -
> > > > -			/*
> > > > -			 * Since the transaction no longer refers to the buffer,
> > > > -			 * the buffer should no longer refer to the transaction.
> > > > -			 */
> > > > -			bp->b_transp = NULL;
> > > > +	if (!stale) {
> > > > +		if (!remove) {
> > > > +			/* Nothing to do but drop the refcount the pin owned. */
> > > > +			atomic_dec(&bip->bli_refcount);
> > > > +			return;
> > > >  		}
> > > 
> > > Hmm.. this seems a bit wonky to me. This code historically acts on the
> > > drop of the final reference to the bli.
> > 
> > Yes, and that's the problem that needs fixing. The AIL needs a
> > reference so that we aren't racing with writeback from the AIL to
> > free the object because both sets of code run without actually
> > holding an active reference to the BLI...
> > 
> > > This is not critical for the
> > > common (!stale && !remove) case because that's basically a no-op here
> > > outside of dropping the reference, and it looks like the stale buffer
> > > handling code further down continues to follow that model, but in this
> > > branch it seems we're trying to be clever in how the reference is
> > > managed and as a result can act on a bli that might actually have
> > > additional references.
> > 
> > Who cares? If something else has active references, then we must not
> > free the bli or buffer here, anyway. The lack of active references
> > by active BLI usres is why we keep getting use-after-free bugs in
> > this code.....
> > 
> 
> Well, this impacts more than just whether we free the buffer or not in
> the abort case. This potentially runs the I/O failure sequence on a
> pinned buffer, or blocks log I/O completion on a buffer lock that might
> be held by a transaction.
> 
> I don't know if these are immediate problems or not and this is all
> abort/shutdown related, but re: my point above around shutdown issues,
> I'd prefer to try and avoid these kind of oddball quirks if we can vs.
> just replace the historical quirks with new ones.
> 
> > > If so, I don't think it's appropriate to run
> > > through the error sequence that follows.
> > > 
> > > >  
> > > >  		/*
> > > > -		 * If we get called here because of an IO error, we may or may
> > > > -		 * not have the item on the AIL. xfs_trans_ail_delete() will
> > > > -		 * take care of that situation. xfs_trans_ail_delete() drops
> > > > -		 * the AIL lock.
> > > > -		 */
> > > > -		if (bip->bli_flags & XFS_BLI_STALE_INODE) {
> > > > -			xfs_buf_item_done(bp);
> > > > -			xfs_buf_inode_iodone(bp);
> > > > -			ASSERT(list_empty(&bp->b_li_list));
> > > > -		} else {
> > > > -			xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
> > > > -			xfs_buf_item_relse(bp);
> > > > -			ASSERT(bp->b_log_item == NULL);
> > > > -		}
> > > > -		xfs_buf_relse(bp);
> > > > -	} else if (freed && remove) {
> > > > -		/*
> > > > +		 * Fail the IO before we drop the bli refcount. This guarantees
> > > > +		 * that a racing writeback completion also failing the buffer
> > > > +		 * and running completion will not remove the last reference to
> > > > +		 * the bli and free it from under us.
> > > > +		 *
> > > >  		 * The buffer must be locked and held by the caller to simulate
> > > >  		 * an async I/O failure.
> > > >  		 */
> > > > @@ -559,7 +555,62 @@ xfs_buf_item_unpin(
> > > >  		xfs_buf_hold(bp);
> > > >  		bp->b_flags |= XBF_ASYNC;
> > > >  		xfs_buf_ioend_fail(bp);
> > > > +		xfs_buf_item_relse(bp);
> > > 
> > > Did you mean for this to be xfs_buf_item_put() instead of _relse()? The
> > 
> > Yes. I did say "untested" which implies the patch isn't complete or will
> > work. It's just a demonstration of how this reference counting might
> > be done, not a complete, working solution. Patches are a much faster
> > way of explaining the overall solution that plain text...
> > 
> 
> Sure, I'm just trying to clarify intent. It's a refcount patch so
> whether we drop a refcount or explicitly free an objects is particularly
> relevant. ;)
> 
> > > >  	}
> > > > +
> > > > +	/*
> > > > +	 * Stale buffer - only process it if this is the last reference to the
> > > > +	 * BLI. If this is the last BLI reference, then the buffer will be
> > > > +	 * locked and have two references - once from the transaction commit and
> > > > +	 * one from the BLI - and we do not unlock and release transaction
> > > > +	 * reference until we've finished cleaning up the BLI.
> > > > +	 */
> > > > +	if (!atomic_dec_and_test(&bip->bli_refcount))
> > > > +		return;
> > > > +
> > > 
> > > If the buffer is stale, will this ever be the last reference now that
> > > _item_committed() bumps the refcount?
> > 
> > If the AIL has a reference, then no.
> > 
> 
> Ok, but how would we get here without an AIL reference? That seems
> impossible to me based on your patch.
> 
> > > This change also seems to have
> > > ramifications for the code that follows, such as if a staled buffer is
> > > already in the AIL (with a bli ref), would this code ever get to the
> > > point of removing it?
> > 
> > That's easy enough to handle - if the buffer is stale and we are in
> > this code, we hold the buffer locked. Hence we can remove from the
> > AIL if it is in the AIL and drop that reference, too. Indeed, this
> > code already does the AIL removal, so all this requires is a simple
> > rearrangement of the logic in this function....
> > 
> 
> Hmm, I don't think that's a correct assertion. If the buffer is stale
> and we're in this code and we have the last reference to the bli, then
> we know we hold the buffer locked. Otherwise, ISTM we can get here while
> the transaction that staled the buffer might still own the lock.
> 
> > The only difference is that we have to do this before we drop the
> > current reference we have on the BLI, which is not what we do now
> > and that's where all the problems lie.
> > 
> > > All in all, I'll reiterate that I think it would be nice to fix up the
> > > bli reference count handling in general, but I think the scope and
> > > complexity of that work is significantly beyond what is reasonably
> > > necessary to fix this bug.
> > 
> > And so leaving the underlying problem in the code for the next set
> > of changes someone does to trigger the problem in a differen way.
> > We've indentified what the root cause is, so can we please spend
> > the time to fix it properly?
> > 
> 
> That is not what I'm saying at all. I'm saying that I'd prefer to fix
> the bug first and then step back and evaluate the overall refcount
> design independently because the latter is quite complex and there are
> all kinds of subtle interactions that the RFC patch just glazes over (by
> design). For example, how log recovery processes bli's slightly
> differently looks like yet another impedence mismatch from when the fs
> is fully active.
> 

Just an update on this particular bit...

I've probably spent about a week and a half now working through some
attempts to rework the bli refcount handling (going back to the drawing
board at least a couple times) and while I can get to something mostly
functional (surviving an fstests run), the last steps to test/prove a
solid/reliable implementation end up falling short. On extended testing
I end up sorting through the same kind of shutdown/unmount hang, use
after free, extremely subtle interactions that I've spent many hours
trying to stamp out over the past several years.

Because of that, I don't think this is something worth pushing for in
the short term. While the proposed idea sounds nice in theory, my take
away in practice is that the current design is in place for a reason
(i.e. the refcount historically looks like a transaction reference count
used to provide unlocked/isolated access through the log subsystem as
opposed to a traditional memory object lifecycle reference count) and
has quite a lot of incremental stability that has been baked in over
time.

I do still think this is (or should be :P) ultimately fixable, but I'm
wondering if we're probably better served by exploring some of the
historical warts and subtle rules/dependencies/hurdles of the current
model to see if they can be removed or simplified to the point where the
reference counting incrementally and naturally becomes a bit more
straightforward. I probably need to reset my brain and think a little
more about that...

Brian

> So I propose to rework my patch a bit into something that reflects the
> intended direction (see below for an untested diff) and proceed from
> there...
> 
> Brian
> 
> --- 8< ---
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index fb69879e4b2b..7ff31788512b 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -475,17 +475,8 @@ xfs_buf_item_pin(
>  }
>  
>  /*
> - * This is called to unpin the buffer associated with the buf log
> - * item which was previously pinned with a call to xfs_buf_item_pin().
> - *
> - * Also drop the reference to the buf item for the current transaction.
> - * If the XFS_BLI_STALE flag is set and we are the last reference,
> - * then free up the buf log item and unlock the buffer.
> - *
> - * If the remove flag is set we are called from uncommit in the
> - * forced-shutdown path.  If that is true and the reference count on
> - * the log item is going to drop to zero we need to free the item's
> - * descriptor in the transaction.
> + * This is called to unpin the buffer associated with the buf log item which
> + * was previously pinned with a call to xfs_buf_item_pin().
>   */
>  STATIC void
>  xfs_buf_item_unpin(
> @@ -502,12 +493,26 @@ xfs_buf_item_unpin(
>  
>  	trace_xfs_buf_item_unpin(bip);
>  
> +	/*
> +	 * Drop the bli ref associated with the pin and grab the hold required
> +	 * for the I/O simulation failure in the abort case. We have to do this
> +	 * before the pin count drops because the AIL doesn't acquire a bli
> +	 * reference. Therefore if the refcount drops to zero, the bli could
> +	 * still be AIL resident and the buffer submitted for I/O (and freed on
> +	 * completion) at any point before we return. This can be removed once
> +	 * the AIL properly holds a reference on the bli.
> +	 */
>  	freed = atomic_dec_and_test(&bip->bli_refcount);
> -
> +	if (freed && !stale && remove)
> +		xfs_buf_hold(bp);
>  	if (atomic_dec_and_test(&bp->b_pin_count))
>  		wake_up_all(&bp->b_waiters);
>  
> -	if (freed && stale) {
> +	 /* nothing to do but drop the pin count if the bli is active */
> +	if (!freed)
> +		return;
> +
> +	if (stale) {
>  		ASSERT(bip->bli_flags & XFS_BLI_STALE);
>  		ASSERT(xfs_buf_islocked(bp));
>  		ASSERT(bp->b_flags & XBF_STALE);
> @@ -550,13 +555,13 @@ xfs_buf_item_unpin(
>  			ASSERT(bp->b_log_item == NULL);
>  		}
>  		xfs_buf_relse(bp);
> -	} else if (freed && remove) {
> +	} else if (remove) {
>  		/*
>  		 * The buffer must be locked and held by the caller to simulate
> -		 * an async I/O failure.
> +		 * an async I/O failure. We acquired the hold for this case
> +		 * before the buffer was unpinned.
>  		 */
>  		xfs_buf_lock(bp);
> -		xfs_buf_hold(bp);
>  		bp->b_flags |= XBF_ASYNC;
>  		xfs_buf_ioend_fail(bp);
>  	}


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC] xfs: hold buffer across unpin and potential shutdown processing
  2021-06-02 13:32         ` Brian Foster
@ 2021-06-02 16:02           ` Darrick J. Wong
  2021-06-02 16:31             ` Brian Foster
  0 siblings, 1 reply; 10+ messages in thread
From: Darrick J. Wong @ 2021-06-02 16:02 UTC (permalink / raw)
  To: Brian Foster; +Cc: Dave Chinner, linux-xfs

On Wed, Jun 02, 2021 at 09:32:34AM -0400, Brian Foster wrote:
> On Thu, May 06, 2021 at 03:29:50PM -0400, Brian Foster wrote:
> > On Thu, May 06, 2021 at 12:56:11PM +1000, Dave Chinner wrote:
> > > On Wed, May 05, 2021 at 07:50:17AM -0400, Brian Foster wrote:
> > > > On Tue, May 04, 2021 at 09:25:39AM +1000, Dave Chinner wrote:
> > > > > On Mon, May 03, 2021 at 08:18:16AM -0400, Brian Foster wrote:
> > > > > i.e. the problem here is that we've dropped the bip->bli_refcount
> > > > > before we've locked the buffer and taken a reference to it for
> > > > > the fail path?
> > > > > 
> > > > > OK, I see that xfs_buf_item_done() (called from ioend processing)
> > > > > simply frees the buf log item and doesn't care about the bli
> > > > > refcount at all. So the first ioend caller will free the buf log
> > > > > item regardless of whether there are other references to it at all.
> > > > > 
> > > > > IOWs, once we unpin the buffer, the bli attached to the buffer and
> > > > > being tracked in the AIL has -zero- references to the bli and so it
> > > > > gets freed unconditionally on IO completion.
> > > > > 
> > > > > That seems to the be the problem here - the bli is not reference
> > > > > counted while it is the AIL....
> > > > > 
> > > > 
> > > > I think it depends on how you look at it. As you point out, we've had
> > > > this odd bli reference count pattern for as long as I can remember where
> > > 
> > > Yes, and it's been a constant source of use-after free bugs in
> > > shutdown processing for as long as I can remember. I want to fix it
> > > so we don't have to keep band-aiding this code every time we change
> > > how some part of log item or stale inode/buffer processing works...
> > > 
> > 
> > IME, most of the bugs in this area tend to be shutdown/error related and
> > generally relate to the complexity of all the various states and
> > contexts for the different callback contexts. That isn't a direct result
> > of this bli refcount behavior, as odd as it is, though that certainly
> > contributes to the overall complexity.
> > 
> > > 
> > > > >  
> > > > > -	freed = atomic_dec_and_test(&bip->bli_refcount);
> > > > > -
> > > > > +	/*
> > > > > +	 * We can wake pin waiters safely now because we still hold the
> > > > > +	 * bli_refcount that was taken when the pin was gained.
> > > > > +	 */
> > > > >  	if (atomic_dec_and_test(&bp->b_pin_count))
> > > > >  		wake_up_all(&bp->b_waiters);
> > > > >  
> > > > > -	if (freed && stale) {
> > > > > -		ASSERT(bip->bli_flags & XFS_BLI_STALE);
> > > > > -		ASSERT(xfs_buf_islocked(bp));
> > > > > -		ASSERT(bp->b_flags & XBF_STALE);
> > > > > -		ASSERT(bip->__bli_format.blf_flags & XFS_BLF_CANCEL);
> > > > > -
> > > > > -		trace_xfs_buf_item_unpin_stale(bip);
> > > > > -
> > > > > -		if (remove) {
> > > > > -			/*
> > > > > -			 * If we are in a transaction context, we have to
> > > > > -			 * remove the log item from the transaction as we are
> > > > > -			 * about to release our reference to the buffer.  If we
> > > > > -			 * don't, the unlock that occurs later in
> > > > > -			 * xfs_trans_uncommit() will try to reference the
> > > > > -			 * buffer which we no longer have a hold on.
> > > > > -			 */
> > > > > -			if (!list_empty(&lip->li_trans))
> > > > > -				xfs_trans_del_item(lip);
> > > > > -
> > > > > -			/*
> > > > > -			 * Since the transaction no longer refers to the buffer,
> > > > > -			 * the buffer should no longer refer to the transaction.
> > > > > -			 */
> > > > > -			bp->b_transp = NULL;
> > > > > +	if (!stale) {
> > > > > +		if (!remove) {
> > > > > +			/* Nothing to do but drop the refcount the pin owned. */
> > > > > +			atomic_dec(&bip->bli_refcount);
> > > > > +			return;
> > > > >  		}
> > > > 
> > > > Hmm.. this seems a bit wonky to me. This code historically acts on the
> > > > drop of the final reference to the bli.
> > > 
> > > Yes, and that's the problem that needs fixing. The AIL needs a
> > > reference so that we aren't racing with writeback from the AIL to
> > > free the object because both sets of code run without actually
> > > holding an active reference to the BLI...
> > > 
> > > > This is not critical for the
> > > > common (!stale && !remove) case because that's basically a no-op here
> > > > outside of dropping the reference, and it looks like the stale buffer
> > > > handling code further down continues to follow that model, but in this
> > > > branch it seems we're trying to be clever in how the reference is
> > > > managed and as a result can act on a bli that might actually have
> > > > additional references.
> > > 
> > > Who cares? If something else has active references, then we must not
> > > free the bli or buffer here, anyway. The lack of active references
> > > by active BLI usres is why we keep getting use-after-free bugs in
> > > this code.....
> > > 
> > 
> > Well, this impacts more than just whether we free the buffer or not in
> > the abort case. This potentially runs the I/O failure sequence on a
> > pinned buffer, or blocks log I/O completion on a buffer lock that might
> > be held by a transaction.
> > 
> > I don't know if these are immediate problems or not and this is all
> > abort/shutdown related, but re: my point above around shutdown issues,
> > I'd prefer to try and avoid these kind of oddball quirks if we can vs.
> > just replace the historical quirks with new ones.
> > 
> > > > If so, I don't think it's appropriate to run
> > > > through the error sequence that follows.
> > > > 
> > > > >  
> > > > >  		/*
> > > > > -		 * If we get called here because of an IO error, we may or may
> > > > > -		 * not have the item on the AIL. xfs_trans_ail_delete() will
> > > > > -		 * take care of that situation. xfs_trans_ail_delete() drops
> > > > > -		 * the AIL lock.
> > > > > -		 */
> > > > > -		if (bip->bli_flags & XFS_BLI_STALE_INODE) {
> > > > > -			xfs_buf_item_done(bp);
> > > > > -			xfs_buf_inode_iodone(bp);
> > > > > -			ASSERT(list_empty(&bp->b_li_list));
> > > > > -		} else {
> > > > > -			xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
> > > > > -			xfs_buf_item_relse(bp);
> > > > > -			ASSERT(bp->b_log_item == NULL);
> > > > > -		}
> > > > > -		xfs_buf_relse(bp);
> > > > > -	} else if (freed && remove) {
> > > > > -		/*
> > > > > +		 * Fail the IO before we drop the bli refcount. This guarantees
> > > > > +		 * that a racing writeback completion also failing the buffer
> > > > > +		 * and running completion will not remove the last reference to
> > > > > +		 * the bli and free it from under us.
> > > > > +		 *
> > > > >  		 * The buffer must be locked and held by the caller to simulate
> > > > >  		 * an async I/O failure.
> > > > >  		 */
> > > > > @@ -559,7 +555,62 @@ xfs_buf_item_unpin(
> > > > >  		xfs_buf_hold(bp);
> > > > >  		bp->b_flags |= XBF_ASYNC;
> > > > >  		xfs_buf_ioend_fail(bp);
> > > > > +		xfs_buf_item_relse(bp);
> > > > 
> > > > Did you mean for this to be xfs_buf_item_put() instead of _relse()? The
> > > 
> > > Yes. I did say "untested" which implies the patch isn't complete or will
> > > work. It's just a demonstration of how this reference counting might
> > > be done, not a complete, working solution. Patches are a much faster
> > > way of explaining the overall solution that plain text...
> > > 
> > 
> > Sure, I'm just trying to clarify intent. It's a refcount patch so
> > whether we drop a refcount or explicitly free an objects is particularly
> > relevant. ;)
> > 
> > > > >  	}
> > > > > +
> > > > > +	/*
> > > > > +	 * Stale buffer - only process it if this is the last reference to the
> > > > > +	 * BLI. If this is the last BLI reference, then the buffer will be
> > > > > +	 * locked and have two references - once from the transaction commit and
> > > > > +	 * one from the BLI - and we do not unlock and release transaction
> > > > > +	 * reference until we've finished cleaning up the BLI.
> > > > > +	 */
> > > > > +	if (!atomic_dec_and_test(&bip->bli_refcount))
> > > > > +		return;
> > > > > +
> > > > 
> > > > If the buffer is stale, will this ever be the last reference now that
> > > > _item_committed() bumps the refcount?
> > > 
> > > If the AIL has a reference, then no.
> > > 
> > 
> > Ok, but how would we get here without an AIL reference? That seems
> > impossible to me based on your patch.
> > 
> > > > This change also seems to have
> > > > ramifications for the code that follows, such as if a staled buffer is
> > > > already in the AIL (with a bli ref), would this code ever get to the
> > > > point of removing it?
> > > 
> > > That's easy enough to handle - if the buffer is stale and we are in
> > > this code, we hold the buffer locked. Hence we can remove from the
> > > AIL if it is in the AIL and drop that reference, too. Indeed, this
> > > code already does the AIL removal, so all this requires is a simple
> > > rearrangement of the logic in this function....
> > > 
> > 
> > Hmm, I don't think that's a correct assertion. If the buffer is stale
> > and we're in this code and we have the last reference to the bli, then
> > we know we hold the buffer locked. Otherwise, ISTM we can get here while
> > the transaction that staled the buffer might still own the lock.
> > 
> > > The only difference is that we have to do this before we drop the
> > > current reference we have on the BLI, which is not what we do now
> > > and that's where all the problems lie.
> > > 
> > > > All in all, I'll reiterate that I think it would be nice to fix up the
> > > > bli reference count handling in general, but I think the scope and
> > > > complexity of that work is significantly beyond what is reasonably
> > > > necessary to fix this bug.
> > > 
> > > And so leaving the underlying problem in the code for the next set
> > > of changes someone does to trigger the problem in a differen way.
> > > We've indentified what the root cause is, so can we please spend
> > > the time to fix it properly?
> > > 
> > 
> > That is not what I'm saying at all. I'm saying that I'd prefer to fix
> > the bug first and then step back and evaluate the overall refcount
> > design independently because the latter is quite complex and there are
> > all kinds of subtle interactions that the RFC patch just glazes over (by
> > design). For example, how log recovery processes bli's slightly
> > differently looks like yet another impedence mismatch from when the fs
> > is fully active.
> > 
> 
> Just an update on this particular bit...
> 
> I've probably spent about a week and a half now working through some
> attempts to rework the bli refcount handling (going back to the drawing
> board at least a couple times) and while I can get to something mostly
> functional (surviving an fstests run), the last steps to test/prove a
> solid/reliable implementation end up falling short. On extended testing
> I end up sorting through the same kind of shutdown/unmount hang, use
> after free, extremely subtle interactions that I've spent many hours
> trying to stamp out over the past several years.
> 
> Because of that, I don't think this is something worth pushing for in
> the short term. While the proposed idea sounds nice in theory, my take
> away in practice is that the current design is in place for a reason
> (i.e. the refcount historically looks like a transaction reference count
> used to provide unlocked/isolated access through the log subsystem as
> opposed to a traditional memory object lifecycle reference count) and
> has quite a lot of incremental stability that has been baked in over
> time.
> 
> I do still think this is (or should be :P) ultimately fixable, but I'm
> wondering if we're probably better served by exploring some of the
> historical warts and subtle rules/dependencies/hurdles of the current
> model to see if they can be removed or simplified to the point where the
> reference counting incrementally and naturally becomes a bit more
> straightforward. I probably need to reset my brain and think a little
> more about that...

What do you want to do in the meantime?

How about: Elevate this patch from RFC to regular patch status (perhaps
with an XXX comment outlining the gap as a breadcrumb to remind future
us?) and merge that so that we at least fix the immediate UAF problem?

I suspected that figuring out all the subtleties of the bli lifetimes
would be an intense effort.

--D

> 
> Brian
> 
> > So I propose to rework my patch a bit into something that reflects the
> > intended direction (see below for an untested diff) and proceed from
> > there...
> > 
> > Brian
> > 
> > --- 8< ---
> > 
> > diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> > index fb69879e4b2b..7ff31788512b 100644
> > --- a/fs/xfs/xfs_buf_item.c
> > +++ b/fs/xfs/xfs_buf_item.c
> > @@ -475,17 +475,8 @@ xfs_buf_item_pin(
> >  }
> >  
> >  /*
> > - * This is called to unpin the buffer associated with the buf log
> > - * item which was previously pinned with a call to xfs_buf_item_pin().
> > - *
> > - * Also drop the reference to the buf item for the current transaction.
> > - * If the XFS_BLI_STALE flag is set and we are the last reference,
> > - * then free up the buf log item and unlock the buffer.
> > - *
> > - * If the remove flag is set we are called from uncommit in the
> > - * forced-shutdown path.  If that is true and the reference count on
> > - * the log item is going to drop to zero we need to free the item's
> > - * descriptor in the transaction.
> > + * This is called to unpin the buffer associated with the buf log item which
> > + * was previously pinned with a call to xfs_buf_item_pin().
> >   */
> >  STATIC void
> >  xfs_buf_item_unpin(
> > @@ -502,12 +493,26 @@ xfs_buf_item_unpin(
> >  
> >  	trace_xfs_buf_item_unpin(bip);
> >  
> > +	/*
> > +	 * Drop the bli ref associated with the pin and grab the hold required
> > +	 * for the I/O simulation failure in the abort case. We have to do this
> > +	 * before the pin count drops because the AIL doesn't acquire a bli
> > +	 * reference. Therefore if the refcount drops to zero, the bli could
> > +	 * still be AIL resident and the buffer submitted for I/O (and freed on
> > +	 * completion) at any point before we return. This can be removed once
> > +	 * the AIL properly holds a reference on the bli.
> > +	 */
> >  	freed = atomic_dec_and_test(&bip->bli_refcount);
> > -
> > +	if (freed && !stale && remove)
> > +		xfs_buf_hold(bp);
> >  	if (atomic_dec_and_test(&bp->b_pin_count))
> >  		wake_up_all(&bp->b_waiters);
> >  
> > -	if (freed && stale) {
> > +	 /* nothing to do but drop the pin count if the bli is active */
> > +	if (!freed)
> > +		return;
> > +
> > +	if (stale) {
> >  		ASSERT(bip->bli_flags & XFS_BLI_STALE);
> >  		ASSERT(xfs_buf_islocked(bp));
> >  		ASSERT(bp->b_flags & XBF_STALE);
> > @@ -550,13 +555,13 @@ xfs_buf_item_unpin(
> >  			ASSERT(bp->b_log_item == NULL);
> >  		}
> >  		xfs_buf_relse(bp);
> > -	} else if (freed && remove) {
> > +	} else if (remove) {
> >  		/*
> >  		 * The buffer must be locked and held by the caller to simulate
> > -		 * an async I/O failure.
> > +		 * an async I/O failure. We acquired the hold for this case
> > +		 * before the buffer was unpinned.
> >  		 */
> >  		xfs_buf_lock(bp);
> > -		xfs_buf_hold(bp);
> >  		bp->b_flags |= XBF_ASYNC;
> >  		xfs_buf_ioend_fail(bp);
> >  	}
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC] xfs: hold buffer across unpin and potential shutdown processing
  2021-06-02 16:02           ` Darrick J. Wong
@ 2021-06-02 16:31             ` Brian Foster
  2021-06-02 16:35               ` Darrick J. Wong
  0 siblings, 1 reply; 10+ messages in thread
From: Brian Foster @ 2021-06-02 16:31 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs

On Wed, Jun 02, 2021 at 09:02:38AM -0700, Darrick J. Wong wrote:
> On Wed, Jun 02, 2021 at 09:32:34AM -0400, Brian Foster wrote:
> > On Thu, May 06, 2021 at 03:29:50PM -0400, Brian Foster wrote:
> > > On Thu, May 06, 2021 at 12:56:11PM +1000, Dave Chinner wrote:
> > > > On Wed, May 05, 2021 at 07:50:17AM -0400, Brian Foster wrote:
> > > > > On Tue, May 04, 2021 at 09:25:39AM +1000, Dave Chinner wrote:
> > > > > > On Mon, May 03, 2021 at 08:18:16AM -0400, Brian Foster wrote:
> > > > > > i.e. the problem here is that we've dropped the bip->bli_refcount
> > > > > > before we've locked the buffer and taken a reference to it for
> > > > > > the fail path?
> > > > > > 
> > > > > > OK, I see that xfs_buf_item_done() (called from ioend processing)
> > > > > > simply frees the buf log item and doesn't care about the bli
> > > > > > refcount at all. So the first ioend caller will free the buf log
> > > > > > item regardless of whether there are other references to it at all.
> > > > > > 
> > > > > > IOWs, once we unpin the buffer, the bli attached to the buffer and
> > > > > > being tracked in the AIL has -zero- references to the bli and so it
> > > > > > gets freed unconditionally on IO completion.
> > > > > > 
> > > > > > That seems to the be the problem here - the bli is not reference
> > > > > > counted while it is the AIL....
> > > > > > 
> > > > > 
> > > > > I think it depends on how you look at it. As you point out, we've had
> > > > > this odd bli reference count pattern for as long as I can remember where
> > > > 
> > > > Yes, and it's been a constant source of use-after free bugs in
> > > > shutdown processing for as long as I can remember. I want to fix it
> > > > so we don't have to keep band-aiding this code every time we change
> > > > how some part of log item or stale inode/buffer processing works...
> > > > 
> > > 
> > > IME, most of the bugs in this area tend to be shutdown/error related and
> > > generally relate to the complexity of all the various states and
> > > contexts for the different callback contexts. That isn't a direct result
> > > of this bli refcount behavior, as odd as it is, though that certainly
> > > contributes to the overall complexity.
> > > 
> > > > 
> > > > > >  
> > > > > > -	freed = atomic_dec_and_test(&bip->bli_refcount);
> > > > > > -
> > > > > > +	/*
> > > > > > +	 * We can wake pin waiters safely now because we still hold the
> > > > > > +	 * bli_refcount that was taken when the pin was gained.
> > > > > > +	 */
> > > > > >  	if (atomic_dec_and_test(&bp->b_pin_count))
> > > > > >  		wake_up_all(&bp->b_waiters);
> > > > > >  
> > > > > > -	if (freed && stale) {
> > > > > > -		ASSERT(bip->bli_flags & XFS_BLI_STALE);
> > > > > > -		ASSERT(xfs_buf_islocked(bp));
> > > > > > -		ASSERT(bp->b_flags & XBF_STALE);
> > > > > > -		ASSERT(bip->__bli_format.blf_flags & XFS_BLF_CANCEL);
> > > > > > -
> > > > > > -		trace_xfs_buf_item_unpin_stale(bip);
> > > > > > -
> > > > > > -		if (remove) {
> > > > > > -			/*
> > > > > > -			 * If we are in a transaction context, we have to
> > > > > > -			 * remove the log item from the transaction as we are
> > > > > > -			 * about to release our reference to the buffer.  If we
> > > > > > -			 * don't, the unlock that occurs later in
> > > > > > -			 * xfs_trans_uncommit() will try to reference the
> > > > > > -			 * buffer which we no longer have a hold on.
> > > > > > -			 */
> > > > > > -			if (!list_empty(&lip->li_trans))
> > > > > > -				xfs_trans_del_item(lip);
> > > > > > -
> > > > > > -			/*
> > > > > > -			 * Since the transaction no longer refers to the buffer,
> > > > > > -			 * the buffer should no longer refer to the transaction.
> > > > > > -			 */
> > > > > > -			bp->b_transp = NULL;
> > > > > > +	if (!stale) {
> > > > > > +		if (!remove) {
> > > > > > +			/* Nothing to do but drop the refcount the pin owned. */
> > > > > > +			atomic_dec(&bip->bli_refcount);
> > > > > > +			return;
> > > > > >  		}
> > > > > 
> > > > > Hmm.. this seems a bit wonky to me. This code historically acts on the
> > > > > drop of the final reference to the bli.
> > > > 
> > > > Yes, and that's the problem that needs fixing. The AIL needs a
> > > > reference so that we aren't racing with writeback from the AIL to
> > > > free the object because both sets of code run without actually
> > > > holding an active reference to the BLI...
> > > > 
> > > > > This is not critical for the
> > > > > common (!stale && !remove) case because that's basically a no-op here
> > > > > outside of dropping the reference, and it looks like the stale buffer
> > > > > handling code further down continues to follow that model, but in this
> > > > > branch it seems we're trying to be clever in how the reference is
> > > > > managed and as a result can act on a bli that might actually have
> > > > > additional references.
> > > > 
> > > > Who cares? If something else has active references, then we must not
> > > > free the bli or buffer here, anyway. The lack of active references
> > > > by active BLI usres is why we keep getting use-after-free bugs in
> > > > this code.....
> > > > 
> > > 
> > > Well, this impacts more than just whether we free the buffer or not in
> > > the abort case. This potentially runs the I/O failure sequence on a
> > > pinned buffer, or blocks log I/O completion on a buffer lock that might
> > > be held by a transaction.
> > > 
> > > I don't know if these are immediate problems or not and this is all
> > > abort/shutdown related, but re: my point above around shutdown issues,
> > > I'd prefer to try and avoid these kind of oddball quirks if we can vs.
> > > just replace the historical quirks with new ones.
> > > 
> > > > > If so, I don't think it's appropriate to run
> > > > > through the error sequence that follows.
> > > > > 
> > > > > >  
> > > > > >  		/*
> > > > > > -		 * If we get called here because of an IO error, we may or may
> > > > > > -		 * not have the item on the AIL. xfs_trans_ail_delete() will
> > > > > > -		 * take care of that situation. xfs_trans_ail_delete() drops
> > > > > > -		 * the AIL lock.
> > > > > > -		 */
> > > > > > -		if (bip->bli_flags & XFS_BLI_STALE_INODE) {
> > > > > > -			xfs_buf_item_done(bp);
> > > > > > -			xfs_buf_inode_iodone(bp);
> > > > > > -			ASSERT(list_empty(&bp->b_li_list));
> > > > > > -		} else {
> > > > > > -			xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
> > > > > > -			xfs_buf_item_relse(bp);
> > > > > > -			ASSERT(bp->b_log_item == NULL);
> > > > > > -		}
> > > > > > -		xfs_buf_relse(bp);
> > > > > > -	} else if (freed && remove) {
> > > > > > -		/*
> > > > > > +		 * Fail the IO before we drop the bli refcount. This guarantees
> > > > > > +		 * that a racing writeback completion also failing the buffer
> > > > > > +		 * and running completion will not remove the last reference to
> > > > > > +		 * the bli and free it from under us.
> > > > > > +		 *
> > > > > >  		 * The buffer must be locked and held by the caller to simulate
> > > > > >  		 * an async I/O failure.
> > > > > >  		 */
> > > > > > @@ -559,7 +555,62 @@ xfs_buf_item_unpin(
> > > > > >  		xfs_buf_hold(bp);
> > > > > >  		bp->b_flags |= XBF_ASYNC;
> > > > > >  		xfs_buf_ioend_fail(bp);
> > > > > > +		xfs_buf_item_relse(bp);
> > > > > 
> > > > > Did you mean for this to be xfs_buf_item_put() instead of _relse()? The
> > > > 
> > > > Yes. I did say "untested" which implies the patch isn't complete or will
> > > > work. It's just a demonstration of how this reference counting might
> > > > be done, not a complete, working solution. Patches are a much faster
> > > > way of explaining the overall solution that plain text...
> > > > 
> > > 
> > > Sure, I'm just trying to clarify intent. It's a refcount patch so
> > > whether we drop a refcount or explicitly free an objects is particularly
> > > relevant. ;)
> > > 
> > > > > >  	}
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * Stale buffer - only process it if this is the last reference to the
> > > > > > +	 * BLI. If this is the last BLI reference, then the buffer will be
> > > > > > +	 * locked and have two references - once from the transaction commit and
> > > > > > +	 * one from the BLI - and we do not unlock and release transaction
> > > > > > +	 * reference until we've finished cleaning up the BLI.
> > > > > > +	 */
> > > > > > +	if (!atomic_dec_and_test(&bip->bli_refcount))
> > > > > > +		return;
> > > > > > +
> > > > > 
> > > > > If the buffer is stale, will this ever be the last reference now that
> > > > > _item_committed() bumps the refcount?
> > > > 
> > > > If the AIL has a reference, then no.
> > > > 
> > > 
> > > Ok, but how would we get here without an AIL reference? That seems
> > > impossible to me based on your patch.
> > > 
> > > > > This change also seems to have
> > > > > ramifications for the code that follows, such as if a staled buffer is
> > > > > already in the AIL (with a bli ref), would this code ever get to the
> > > > > point of removing it?
> > > > 
> > > > That's easy enough to handle - if the buffer is stale and we are in
> > > > this code, we hold the buffer locked. Hence we can remove from the
> > > > AIL if it is in the AIL and drop that reference, too. Indeed, this
> > > > code already does the AIL removal, so all this requires is a simple
> > > > rearrangement of the logic in this function....
> > > > 
> > > 
> > > Hmm, I don't think that's a correct assertion. If the buffer is stale
> > > and we're in this code and we have the last reference to the bli, then
> > > we know we hold the buffer locked. Otherwise, ISTM we can get here while
> > > the transaction that staled the buffer might still own the lock.
> > > 
> > > > The only difference is that we have to do this before we drop the
> > > > current reference we have on the BLI, which is not what we do now
> > > > and that's where all the problems lie.
> > > > 
> > > > > All in all, I'll reiterate that I think it would be nice to fix up the
> > > > > bli reference count handling in general, but I think the scope and
> > > > > complexity of that work is significantly beyond what is reasonably
> > > > > necessary to fix this bug.
> > > > 
> > > > And so leaving the underlying problem in the code for the next set
> > > > of changes someone does to trigger the problem in a differen way.
> > > > We've indentified what the root cause is, so can we please spend
> > > > the time to fix it properly?
> > > > 
> > > 
> > > That is not what I'm saying at all. I'm saying that I'd prefer to fix
> > > the bug first and then step back and evaluate the overall refcount
> > > design independently because the latter is quite complex and there are
> > > all kinds of subtle interactions that the RFC patch just glazes over (by
> > > design). For example, how log recovery processes bli's slightly
> > > differently looks like yet another impedence mismatch from when the fs
> > > is fully active.
> > > 
> > 
> > Just an update on this particular bit...
> > 
> > I've probably spent about a week and a half now working through some
> > attempts to rework the bli refcount handling (going back to the drawing
> > board at least a couple times) and while I can get to something mostly
> > functional (surviving an fstests run), the last steps to test/prove a
> > solid/reliable implementation end up falling short. On extended testing
> > I end up sorting through the same kind of shutdown/unmount hang, use
> > after free, extremely subtle interactions that I've spent many hours
> > trying to stamp out over the past several years.
> > 
> > Because of that, I don't think this is something worth pushing for in
> > the short term. While the proposed idea sounds nice in theory, my take
> > away in practice is that the current design is in place for a reason
> > (i.e. the refcount historically looks like a transaction reference count
> > used to provide unlocked/isolated access through the log subsystem as
> > opposed to a traditional memory object lifecycle reference count) and
> > has quite a lot of incremental stability that has been baked in over
> > time.
> > 
> > I do still think this is (or should be :P) ultimately fixable, but I'm
> > wondering if we're probably better served by exploring some of the
> > historical warts and subtle rules/dependencies/hurdles of the current
> > model to see if they can be removed or simplified to the point where the
> > reference counting incrementally and naturally becomes a bit more
> > straightforward. I probably need to reset my brain and think a little
> > more about that...
> 
> What do you want to do in the meantime?
> 
> How about: Elevate this patch from RFC to regular patch status (perhaps
> with an XXX comment outlining the gap as a breadcrumb to remind future
> us?) and merge that so that we at least fix the immediate UAF problem?
> 

Well I had already sent a v1 [1] of the original UAF fix independent
from this work because I didn't want to bury the bug fix behind a
broader rework. I just wanted to follow up here because I've been
looking into the latter and this is where most of the discussion around
the design rework took place.

That said, I am currently looking at (yet another) alternative approach
that might be more in the direction of broadly cleaning things up vs.
stamping out an isolated problem. I just don't know how it will turn out
quite yet, so I could go either way between waiting on that and possibly
sending a v2, or going with v1 for now (assuming it's reviewed) and
basing subsequent cleanups on that. My instinct is that to leave a
record of an isolated fix is probably the proper thing to do regardless,
but I don't feel that strongly about it.

Brian

[1] https://lore.kernel.org/linux-xfs/20210511135257.878743-1-bfoster@redhat.com/

> I suspected that figuring out all the subtleties of the bli lifetimes
> would be an intense effort.
> 
> --D
> 
> > 
> > Brian
> > 
> > > So I propose to rework my patch a bit into something that reflects the
> > > intended direction (see below for an untested diff) and proceed from
> > > there...
> > > 
> > > Brian
> > > 
> > > --- 8< ---
> > > 
> > > diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> > > index fb69879e4b2b..7ff31788512b 100644
> > > --- a/fs/xfs/xfs_buf_item.c
> > > +++ b/fs/xfs/xfs_buf_item.c
> > > @@ -475,17 +475,8 @@ xfs_buf_item_pin(
> > >  }
> > >  
> > >  /*
> > > - * This is called to unpin the buffer associated with the buf log
> > > - * item which was previously pinned with a call to xfs_buf_item_pin().
> > > - *
> > > - * Also drop the reference to the buf item for the current transaction.
> > > - * If the XFS_BLI_STALE flag is set and we are the last reference,
> > > - * then free up the buf log item and unlock the buffer.
> > > - *
> > > - * If the remove flag is set we are called from uncommit in the
> > > - * forced-shutdown path.  If that is true and the reference count on
> > > - * the log item is going to drop to zero we need to free the item's
> > > - * descriptor in the transaction.
> > > + * This is called to unpin the buffer associated with the buf log item which
> > > + * was previously pinned with a call to xfs_buf_item_pin().
> > >   */
> > >  STATIC void
> > >  xfs_buf_item_unpin(
> > > @@ -502,12 +493,26 @@ xfs_buf_item_unpin(
> > >  
> > >  	trace_xfs_buf_item_unpin(bip);
> > >  
> > > +	/*
> > > +	 * Drop the bli ref associated with the pin and grab the hold required
> > > +	 * for the I/O simulation failure in the abort case. We have to do this
> > > +	 * before the pin count drops because the AIL doesn't acquire a bli
> > > +	 * reference. Therefore if the refcount drops to zero, the bli could
> > > +	 * still be AIL resident and the buffer submitted for I/O (and freed on
> > > +	 * completion) at any point before we return. This can be removed once
> > > +	 * the AIL properly holds a reference on the bli.
> > > +	 */
> > >  	freed = atomic_dec_and_test(&bip->bli_refcount);
> > > -
> > > +	if (freed && !stale && remove)
> > > +		xfs_buf_hold(bp);
> > >  	if (atomic_dec_and_test(&bp->b_pin_count))
> > >  		wake_up_all(&bp->b_waiters);
> > >  
> > > -	if (freed && stale) {
> > > +	 /* nothing to do but drop the pin count if the bli is active */
> > > +	if (!freed)
> > > +		return;
> > > +
> > > +	if (stale) {
> > >  		ASSERT(bip->bli_flags & XFS_BLI_STALE);
> > >  		ASSERT(xfs_buf_islocked(bp));
> > >  		ASSERT(bp->b_flags & XBF_STALE);
> > > @@ -550,13 +555,13 @@ xfs_buf_item_unpin(
> > >  			ASSERT(bp->b_log_item == NULL);
> > >  		}
> > >  		xfs_buf_relse(bp);
> > > -	} else if (freed && remove) {
> > > +	} else if (remove) {
> > >  		/*
> > >  		 * The buffer must be locked and held by the caller to simulate
> > > -		 * an async I/O failure.
> > > +		 * an async I/O failure. We acquired the hold for this case
> > > +		 * before the buffer was unpinned.
> > >  		 */
> > >  		xfs_buf_lock(bp);
> > > -		xfs_buf_hold(bp);
> > >  		bp->b_flags |= XBF_ASYNC;
> > >  		xfs_buf_ioend_fail(bp);
> > >  	}
> > 
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC] xfs: hold buffer across unpin and potential shutdown processing
  2021-06-02 16:31             ` Brian Foster
@ 2021-06-02 16:35               ` Darrick J. Wong
  2021-06-02 16:40                 ` Brian Foster
  0 siblings, 1 reply; 10+ messages in thread
From: Darrick J. Wong @ 2021-06-02 16:35 UTC (permalink / raw)
  To: Brian Foster; +Cc: Dave Chinner, linux-xfs

On Wed, Jun 02, 2021 at 12:31:54PM -0400, Brian Foster wrote:
> On Wed, Jun 02, 2021 at 09:02:38AM -0700, Darrick J. Wong wrote:
> > On Wed, Jun 02, 2021 at 09:32:34AM -0400, Brian Foster wrote:
> > > On Thu, May 06, 2021 at 03:29:50PM -0400, Brian Foster wrote:
> > > > On Thu, May 06, 2021 at 12:56:11PM +1000, Dave Chinner wrote:
> > > > > On Wed, May 05, 2021 at 07:50:17AM -0400, Brian Foster wrote:
> > > > > > On Tue, May 04, 2021 at 09:25:39AM +1000, Dave Chinner wrote:
> > > > > > > On Mon, May 03, 2021 at 08:18:16AM -0400, Brian Foster wrote:
> > > > > > > i.e. the problem here is that we've dropped the bip->bli_refcount
> > > > > > > before we've locked the buffer and taken a reference to it for
> > > > > > > the fail path?
> > > > > > > 
> > > > > > > OK, I see that xfs_buf_item_done() (called from ioend processing)
> > > > > > > simply frees the buf log item and doesn't care about the bli
> > > > > > > refcount at all. So the first ioend caller will free the buf log
> > > > > > > item regardless of whether there are other references to it at all.
> > > > > > > 
> > > > > > > IOWs, once we unpin the buffer, the bli attached to the buffer and
> > > > > > > being tracked in the AIL has -zero- references to the bli and so it
> > > > > > > gets freed unconditionally on IO completion.
> > > > > > > 
> > > > > > > That seems to the be the problem here - the bli is not reference
> > > > > > > counted while it is the AIL....
> > > > > > > 
> > > > > > 
> > > > > > I think it depends on how you look at it. As you point out, we've had
> > > > > > this odd bli reference count pattern for as long as I can remember where
> > > > > 
> > > > > Yes, and it's been a constant source of use-after free bugs in
> > > > > shutdown processing for as long as I can remember. I want to fix it
> > > > > so we don't have to keep band-aiding this code every time we change
> > > > > how some part of log item or stale inode/buffer processing works...
> > > > > 
> > > > 
> > > > IME, most of the bugs in this area tend to be shutdown/error related and
> > > > generally relate to the complexity of all the various states and
> > > > contexts for the different callback contexts. That isn't a direct result
> > > > of this bli refcount behavior, as odd as it is, though that certainly
> > > > contributes to the overall complexity.
> > > > 
> > > > > 
> > > > > > >  
> > > > > > > -	freed = atomic_dec_and_test(&bip->bli_refcount);
> > > > > > > -
> > > > > > > +	/*
> > > > > > > +	 * We can wake pin waiters safely now because we still hold the
> > > > > > > +	 * bli_refcount that was taken when the pin was gained.
> > > > > > > +	 */
> > > > > > >  	if (atomic_dec_and_test(&bp->b_pin_count))
> > > > > > >  		wake_up_all(&bp->b_waiters);
> > > > > > >  
> > > > > > > -	if (freed && stale) {
> > > > > > > -		ASSERT(bip->bli_flags & XFS_BLI_STALE);
> > > > > > > -		ASSERT(xfs_buf_islocked(bp));
> > > > > > > -		ASSERT(bp->b_flags & XBF_STALE);
> > > > > > > -		ASSERT(bip->__bli_format.blf_flags & XFS_BLF_CANCEL);
> > > > > > > -
> > > > > > > -		trace_xfs_buf_item_unpin_stale(bip);
> > > > > > > -
> > > > > > > -		if (remove) {
> > > > > > > -			/*
> > > > > > > -			 * If we are in a transaction context, we have to
> > > > > > > -			 * remove the log item from the transaction as we are
> > > > > > > -			 * about to release our reference to the buffer.  If we
> > > > > > > -			 * don't, the unlock that occurs later in
> > > > > > > -			 * xfs_trans_uncommit() will try to reference the
> > > > > > > -			 * buffer which we no longer have a hold on.
> > > > > > > -			 */
> > > > > > > -			if (!list_empty(&lip->li_trans))
> > > > > > > -				xfs_trans_del_item(lip);
> > > > > > > -
> > > > > > > -			/*
> > > > > > > -			 * Since the transaction no longer refers to the buffer,
> > > > > > > -			 * the buffer should no longer refer to the transaction.
> > > > > > > -			 */
> > > > > > > -			bp->b_transp = NULL;
> > > > > > > +	if (!stale) {
> > > > > > > +		if (!remove) {
> > > > > > > +			/* Nothing to do but drop the refcount the pin owned. */
> > > > > > > +			atomic_dec(&bip->bli_refcount);
> > > > > > > +			return;
> > > > > > >  		}
> > > > > > 
> > > > > > Hmm.. this seems a bit wonky to me. This code historically acts on the
> > > > > > drop of the final reference to the bli.
> > > > > 
> > > > > Yes, and that's the problem that needs fixing. The AIL needs a
> > > > > reference so that we aren't racing with writeback from the AIL to
> > > > > free the object because both sets of code run without actually
> > > > > holding an active reference to the BLI...
> > > > > 
> > > > > > This is not critical for the
> > > > > > common (!stale && !remove) case because that's basically a no-op here
> > > > > > outside of dropping the reference, and it looks like the stale buffer
> > > > > > handling code further down continues to follow that model, but in this
> > > > > > branch it seems we're trying to be clever in how the reference is
> > > > > > managed and as a result can act on a bli that might actually have
> > > > > > additional references.
> > > > > 
> > > > > Who cares? If something else has active references, then we must not
> > > > > free the bli or buffer here, anyway. The lack of active references
> > > > > by active BLI usres is why we keep getting use-after-free bugs in
> > > > > this code.....
> > > > > 
> > > > 
> > > > Well, this impacts more than just whether we free the buffer or not in
> > > > the abort case. This potentially runs the I/O failure sequence on a
> > > > pinned buffer, or blocks log I/O completion on a buffer lock that might
> > > > be held by a transaction.
> > > > 
> > > > I don't know if these are immediate problems or not and this is all
> > > > abort/shutdown related, but re: my point above around shutdown issues,
> > > > I'd prefer to try and avoid these kind of oddball quirks if we can vs.
> > > > just replace the historical quirks with new ones.
> > > > 
> > > > > > If so, I don't think it's appropriate to run
> > > > > > through the error sequence that follows.
> > > > > > 
> > > > > > >  
> > > > > > >  		/*
> > > > > > > -		 * If we get called here because of an IO error, we may or may
> > > > > > > -		 * not have the item on the AIL. xfs_trans_ail_delete() will
> > > > > > > -		 * take care of that situation. xfs_trans_ail_delete() drops
> > > > > > > -		 * the AIL lock.
> > > > > > > -		 */
> > > > > > > -		if (bip->bli_flags & XFS_BLI_STALE_INODE) {
> > > > > > > -			xfs_buf_item_done(bp);
> > > > > > > -			xfs_buf_inode_iodone(bp);
> > > > > > > -			ASSERT(list_empty(&bp->b_li_list));
> > > > > > > -		} else {
> > > > > > > -			xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
> > > > > > > -			xfs_buf_item_relse(bp);
> > > > > > > -			ASSERT(bp->b_log_item == NULL);
> > > > > > > -		}
> > > > > > > -		xfs_buf_relse(bp);
> > > > > > > -	} else if (freed && remove) {
> > > > > > > -		/*
> > > > > > > +		 * Fail the IO before we drop the bli refcount. This guarantees
> > > > > > > +		 * that a racing writeback completion also failing the buffer
> > > > > > > +		 * and running completion will not remove the last reference to
> > > > > > > +		 * the bli and free it from under us.
> > > > > > > +		 *
> > > > > > >  		 * The buffer must be locked and held by the caller to simulate
> > > > > > >  		 * an async I/O failure.
> > > > > > >  		 */
> > > > > > > @@ -559,7 +555,62 @@ xfs_buf_item_unpin(
> > > > > > >  		xfs_buf_hold(bp);
> > > > > > >  		bp->b_flags |= XBF_ASYNC;
> > > > > > >  		xfs_buf_ioend_fail(bp);
> > > > > > > +		xfs_buf_item_relse(bp);
> > > > > > 
> > > > > > Did you mean for this to be xfs_buf_item_put() instead of _relse()? The
> > > > > 
> > > > > Yes. I did say "untested" which implies the patch isn't complete or will
> > > > > work. It's just a demonstration of how this reference counting might
> > > > > be done, not a complete, working solution. Patches are a much faster
> > > > > way of explaining the overall solution that plain text...
> > > > > 
> > > > 
> > > > Sure, I'm just trying to clarify intent. It's a refcount patch so
> > > > whether we drop a refcount or explicitly free an objects is particularly
> > > > relevant. ;)
> > > > 
> > > > > > >  	}
> > > > > > > +
> > > > > > > +	/*
> > > > > > > +	 * Stale buffer - only process it if this is the last reference to the
> > > > > > > +	 * BLI. If this is the last BLI reference, then the buffer will be
> > > > > > > +	 * locked and have two references - once from the transaction commit and
> > > > > > > +	 * one from the BLI - and we do not unlock and release transaction
> > > > > > > +	 * reference until we've finished cleaning up the BLI.
> > > > > > > +	 */
> > > > > > > +	if (!atomic_dec_and_test(&bip->bli_refcount))
> > > > > > > +		return;
> > > > > > > +
> > > > > > 
> > > > > > If the buffer is stale, will this ever be the last reference now that
> > > > > > _item_committed() bumps the refcount?
> > > > > 
> > > > > If the AIL has a reference, then no.
> > > > > 
> > > > 
> > > > Ok, but how would we get here without an AIL reference? That seems
> > > > impossible to me based on your patch.
> > > > 
> > > > > > This change also seems to have
> > > > > > ramifications for the code that follows, such as if a staled buffer is
> > > > > > already in the AIL (with a bli ref), would this code ever get to the
> > > > > > point of removing it?
> > > > > 
> > > > > That's easy enough to handle - if the buffer is stale and we are in
> > > > > this code, we hold the buffer locked. Hence we can remove from the
> > > > > AIL if it is in the AIL and drop that reference, too. Indeed, this
> > > > > code already does the AIL removal, so all this requires is a simple
> > > > > rearrangement of the logic in this function....
> > > > > 
> > > > 
> > > > Hmm, I don't think that's a correct assertion. If the buffer is stale
> > > > and we're in this code and we have the last reference to the bli, then
> > > > we know we hold the buffer locked. Otherwise, ISTM we can get here while
> > > > the transaction that staled the buffer might still own the lock.
> > > > 
> > > > > The only difference is that we have to do this before we drop the
> > > > > current reference we have on the BLI, which is not what we do now
> > > > > and that's where all the problems lie.
> > > > > 
> > > > > > All in all, I'll reiterate that I think it would be nice to fix up the
> > > > > > bli reference count handling in general, but I think the scope and
> > > > > > complexity of that work is significantly beyond what is reasonably
> > > > > > necessary to fix this bug.
> > > > > 
> > > > > And so leaving the underlying problem in the code for the next set
> > > > > of changes someone does to trigger the problem in a differen way.
> > > > > We've indentified what the root cause is, so can we please spend
> > > > > the time to fix it properly?
> > > > > 
> > > > 
> > > > That is not what I'm saying at all. I'm saying that I'd prefer to fix
> > > > the bug first and then step back and evaluate the overall refcount
> > > > design independently because the latter is quite complex and there are
> > > > all kinds of subtle interactions that the RFC patch just glazes over (by
> > > > design). For example, how log recovery processes bli's slightly
> > > > differently looks like yet another impedence mismatch from when the fs
> > > > is fully active.
> > > > 
> > > 
> > > Just an update on this particular bit...
> > > 
> > > I've probably spent about a week and a half now working through some
> > > attempts to rework the bli refcount handling (going back to the drawing
> > > board at least a couple times) and while I can get to something mostly
> > > functional (surviving an fstests run), the last steps to test/prove a
> > > solid/reliable implementation end up falling short. On extended testing
> > > I end up sorting through the same kind of shutdown/unmount hang, use
> > > after free, extremely subtle interactions that I've spent many hours
> > > trying to stamp out over the past several years.
> > > 
> > > Because of that, I don't think this is something worth pushing for in
> > > the short term. While the proposed idea sounds nice in theory, my take
> > > away in practice is that the current design is in place for a reason
> > > (i.e. the refcount historically looks like a transaction reference count
> > > used to provide unlocked/isolated access through the log subsystem as
> > > opposed to a traditional memory object lifecycle reference count) and
> > > has quite a lot of incremental stability that has been baked in over
> > > time.
> > > 
> > > I do still think this is (or should be :P) ultimately fixable, but I'm
> > > wondering if we're probably better served by exploring some of the
> > > historical warts and subtle rules/dependencies/hurdles of the current
> > > model to see if they can be removed or simplified to the point where the
> > > reference counting incrementally and naturally becomes a bit more
> > > straightforward. I probably need to reset my brain and think a little
> > > more about that...
> > 
> > What do you want to do in the meantime?
> > 
> > How about: Elevate this patch from RFC to regular patch status (perhaps
> > with an XXX comment outlining the gap as a breadcrumb to remind future
> > us?) and merge that so that we at least fix the immediate UAF problem?
> > 
> 
> Well I had already sent a v1 [1] of the original UAF fix independent
> from this work because I didn't want to bury the bug fix behind a
> broader rework. I just wanted to follow up here because I've been
> looking into the latter and this is where most of the discussion around
> the design rework took place.
> 
> That said, I am currently looking at (yet another) alternative approach
> that might be more in the direction of broadly cleaning things up vs.
> stamping out an isolated problem. I just don't know how it will turn out
> quite yet, so I could go either way between waiting on that and possibly
> sending a v2, or going with v1 for now (assuming it's reviewed) and
> basing subsequent cleanups on that. My instinct is that to leave a
> record of an isolated fix is probably the proper thing to do regardless,
> but I don't feel that strongly about it.

Aha, you /did/ send a v1.  I thought I had seen something like that.

Oh, I reviewed it too.  Hm.  Do you have any changes after v1.1, or do
you want me to pull that in as it is now?

--D

> 
> Brian
> 
> [1] https://lore.kernel.org/linux-xfs/20210511135257.878743-1-bfoster@redhat.com/
> 
> > I suspected that figuring out all the subtleties of the bli lifetimes
> > would be an intense effort.
> > 
> > --D
> > 
> > > 
> > > Brian
> > > 
> > > > So I propose to rework my patch a bit into something that reflects the
> > > > intended direction (see below for an untested diff) and proceed from
> > > > there...
> > > > 
> > > > Brian
> > > > 
> > > > --- 8< ---
> > > > 
> > > > diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> > > > index fb69879e4b2b..7ff31788512b 100644
> > > > --- a/fs/xfs/xfs_buf_item.c
> > > > +++ b/fs/xfs/xfs_buf_item.c
> > > > @@ -475,17 +475,8 @@ xfs_buf_item_pin(
> > > >  }
> > > >  
> > > >  /*
> > > > - * This is called to unpin the buffer associated with the buf log
> > > > - * item which was previously pinned with a call to xfs_buf_item_pin().
> > > > - *
> > > > - * Also drop the reference to the buf item for the current transaction.
> > > > - * If the XFS_BLI_STALE flag is set and we are the last reference,
> > > > - * then free up the buf log item and unlock the buffer.
> > > > - *
> > > > - * If the remove flag is set we are called from uncommit in the
> > > > - * forced-shutdown path.  If that is true and the reference count on
> > > > - * the log item is going to drop to zero we need to free the item's
> > > > - * descriptor in the transaction.
> > > > + * This is called to unpin the buffer associated with the buf log item which
> > > > + * was previously pinned with a call to xfs_buf_item_pin().
> > > >   */
> > > >  STATIC void
> > > >  xfs_buf_item_unpin(
> > > > @@ -502,12 +493,26 @@ xfs_buf_item_unpin(
> > > >  
> > > >  	trace_xfs_buf_item_unpin(bip);
> > > >  
> > > > +	/*
> > > > +	 * Drop the bli ref associated with the pin and grab the hold required
> > > > +	 * for the I/O simulation failure in the abort case. We have to do this
> > > > +	 * before the pin count drops because the AIL doesn't acquire a bli
> > > > +	 * reference. Therefore if the refcount drops to zero, the bli could
> > > > +	 * still be AIL resident and the buffer submitted for I/O (and freed on
> > > > +	 * completion) at any point before we return. This can be removed once
> > > > +	 * the AIL properly holds a reference on the bli.
> > > > +	 */
> > > >  	freed = atomic_dec_and_test(&bip->bli_refcount);
> > > > -
> > > > +	if (freed && !stale && remove)
> > > > +		xfs_buf_hold(bp);
> > > >  	if (atomic_dec_and_test(&bp->b_pin_count))
> > > >  		wake_up_all(&bp->b_waiters);
> > > >  
> > > > -	if (freed && stale) {
> > > > +	 /* nothing to do but drop the pin count if the bli is active */
> > > > +	if (!freed)
> > > > +		return;
> > > > +
> > > > +	if (stale) {
> > > >  		ASSERT(bip->bli_flags & XFS_BLI_STALE);
> > > >  		ASSERT(xfs_buf_islocked(bp));
> > > >  		ASSERT(bp->b_flags & XBF_STALE);
> > > > @@ -550,13 +555,13 @@ xfs_buf_item_unpin(
> > > >  			ASSERT(bp->b_log_item == NULL);
> > > >  		}
> > > >  		xfs_buf_relse(bp);
> > > > -	} else if (freed && remove) {
> > > > +	} else if (remove) {
> > > >  		/*
> > > >  		 * The buffer must be locked and held by the caller to simulate
> > > > -		 * an async I/O failure.
> > > > +		 * an async I/O failure. We acquired the hold for this case
> > > > +		 * before the buffer was unpinned.
> > > >  		 */
> > > >  		xfs_buf_lock(bp);
> > > > -		xfs_buf_hold(bp);
> > > >  		bp->b_flags |= XBF_ASYNC;
> > > >  		xfs_buf_ioend_fail(bp);
> > > >  	}
> > > 
> > 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC] xfs: hold buffer across unpin and potential shutdown processing
  2021-06-02 16:35               ` Darrick J. Wong
@ 2021-06-02 16:40                 ` Brian Foster
  0 siblings, 0 replies; 10+ messages in thread
From: Brian Foster @ 2021-06-02 16:40 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs

On Wed, Jun 02, 2021 at 09:35:37AM -0700, Darrick J. Wong wrote:
> On Wed, Jun 02, 2021 at 12:31:54PM -0400, Brian Foster wrote:
> > On Wed, Jun 02, 2021 at 09:02:38AM -0700, Darrick J. Wong wrote:
> > > On Wed, Jun 02, 2021 at 09:32:34AM -0400, Brian Foster wrote:
> > > > On Thu, May 06, 2021 at 03:29:50PM -0400, Brian Foster wrote:
> > > > > On Thu, May 06, 2021 at 12:56:11PM +1000, Dave Chinner wrote:
> > > > > > On Wed, May 05, 2021 at 07:50:17AM -0400, Brian Foster wrote:
> > > > > > > On Tue, May 04, 2021 at 09:25:39AM +1000, Dave Chinner wrote:
> > > > > > > > On Mon, May 03, 2021 at 08:18:16AM -0400, Brian Foster wrote:
> > > > > > > > i.e. the problem here is that we've dropped the bip->bli_refcount
> > > > > > > > before we've locked the buffer and taken a reference to it for
> > > > > > > > the fail path?
> > > > > > > > 
> > > > > > > > OK, I see that xfs_buf_item_done() (called from ioend processing)
> > > > > > > > simply frees the buf log item and doesn't care about the bli
> > > > > > > > refcount at all. So the first ioend caller will free the buf log
> > > > > > > > item regardless of whether there are other references to it at all.
> > > > > > > > 
> > > > > > > > IOWs, once we unpin the buffer, the bli attached to the buffer and
> > > > > > > > being tracked in the AIL has -zero- references to the bli and so it
> > > > > > > > gets freed unconditionally on IO completion.
> > > > > > > > 
> > > > > > > > That seems to the be the problem here - the bli is not reference
> > > > > > > > counted while it is the AIL....
> > > > > > > > 
> > > > > > > 
> > > > > > > I think it depends on how you look at it. As you point out, we've had
> > > > > > > this odd bli reference count pattern for as long as I can remember where
> > > > > > 
> > > > > > Yes, and it's been a constant source of use-after free bugs in
> > > > > > shutdown processing for as long as I can remember. I want to fix it
> > > > > > so we don't have to keep band-aiding this code every time we change
> > > > > > how some part of log item or stale inode/buffer processing works...
> > > > > > 
> > > > > 
> > > > > IME, most of the bugs in this area tend to be shutdown/error related and
> > > > > generally relate to the complexity of all the various states and
> > > > > contexts for the different callback contexts. That isn't a direct result
> > > > > of this bli refcount behavior, as odd as it is, though that certainly
> > > > > contributes to the overall complexity.
> > > > > 
> > > > > > 
> > > > > > > >  
> > > > > > > > -	freed = atomic_dec_and_test(&bip->bli_refcount);
> > > > > > > > -
> > > > > > > > +	/*
> > > > > > > > +	 * We can wake pin waiters safely now because we still hold the
> > > > > > > > +	 * bli_refcount that was taken when the pin was gained.
> > > > > > > > +	 */
> > > > > > > >  	if (atomic_dec_and_test(&bp->b_pin_count))
> > > > > > > >  		wake_up_all(&bp->b_waiters);
> > > > > > > >  
> > > > > > > > -	if (freed && stale) {
> > > > > > > > -		ASSERT(bip->bli_flags & XFS_BLI_STALE);
> > > > > > > > -		ASSERT(xfs_buf_islocked(bp));
> > > > > > > > -		ASSERT(bp->b_flags & XBF_STALE);
> > > > > > > > -		ASSERT(bip->__bli_format.blf_flags & XFS_BLF_CANCEL);
> > > > > > > > -
> > > > > > > > -		trace_xfs_buf_item_unpin_stale(bip);
> > > > > > > > -
> > > > > > > > -		if (remove) {
> > > > > > > > -			/*
> > > > > > > > -			 * If we are in a transaction context, we have to
> > > > > > > > -			 * remove the log item from the transaction as we are
> > > > > > > > -			 * about to release our reference to the buffer.  If we
> > > > > > > > -			 * don't, the unlock that occurs later in
> > > > > > > > -			 * xfs_trans_uncommit() will try to reference the
> > > > > > > > -			 * buffer which we no longer have a hold on.
> > > > > > > > -			 */
> > > > > > > > -			if (!list_empty(&lip->li_trans))
> > > > > > > > -				xfs_trans_del_item(lip);
> > > > > > > > -
> > > > > > > > -			/*
> > > > > > > > -			 * Since the transaction no longer refers to the buffer,
> > > > > > > > -			 * the buffer should no longer refer to the transaction.
> > > > > > > > -			 */
> > > > > > > > -			bp->b_transp = NULL;
> > > > > > > > +	if (!stale) {
> > > > > > > > +		if (!remove) {
> > > > > > > > +			/* Nothing to do but drop the refcount the pin owned. */
> > > > > > > > +			atomic_dec(&bip->bli_refcount);
> > > > > > > > +			return;
> > > > > > > >  		}
> > > > > > > 
> > > > > > > Hmm.. this seems a bit wonky to me. This code historically acts on the
> > > > > > > drop of the final reference to the bli.
> > > > > > 
> > > > > > Yes, and that's the problem that needs fixing. The AIL needs a
> > > > > > reference so that we aren't racing with writeback from the AIL to
> > > > > > free the object because both sets of code run without actually
> > > > > > holding an active reference to the BLI...
> > > > > > 
> > > > > > > This is not critical for the
> > > > > > > common (!stale && !remove) case because that's basically a no-op here
> > > > > > > outside of dropping the reference, and it looks like the stale buffer
> > > > > > > handling code further down continues to follow that model, but in this
> > > > > > > branch it seems we're trying to be clever in how the reference is
> > > > > > > managed and as a result can act on a bli that might actually have
> > > > > > > additional references.
> > > > > > 
> > > > > > Who cares? If something else has active references, then we must not
> > > > > > free the bli or buffer here, anyway. The lack of active references
> > > > > > by active BLI usres is why we keep getting use-after-free bugs in
> > > > > > this code.....
> > > > > > 
> > > > > 
> > > > > Well, this impacts more than just whether we free the buffer or not in
> > > > > the abort case. This potentially runs the I/O failure sequence on a
> > > > > pinned buffer, or blocks log I/O completion on a buffer lock that might
> > > > > be held by a transaction.
> > > > > 
> > > > > I don't know if these are immediate problems or not and this is all
> > > > > abort/shutdown related, but re: my point above around shutdown issues,
> > > > > I'd prefer to try and avoid these kind of oddball quirks if we can vs.
> > > > > just replace the historical quirks with new ones.
> > > > > 
> > > > > > > If so, I don't think it's appropriate to run
> > > > > > > through the error sequence that follows.
> > > > > > > 
> > > > > > > >  
> > > > > > > >  		/*
> > > > > > > > -		 * If we get called here because of an IO error, we may or may
> > > > > > > > -		 * not have the item on the AIL. xfs_trans_ail_delete() will
> > > > > > > > -		 * take care of that situation. xfs_trans_ail_delete() drops
> > > > > > > > -		 * the AIL lock.
> > > > > > > > -		 */
> > > > > > > > -		if (bip->bli_flags & XFS_BLI_STALE_INODE) {
> > > > > > > > -			xfs_buf_item_done(bp);
> > > > > > > > -			xfs_buf_inode_iodone(bp);
> > > > > > > > -			ASSERT(list_empty(&bp->b_li_list));
> > > > > > > > -		} else {
> > > > > > > > -			xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
> > > > > > > > -			xfs_buf_item_relse(bp);
> > > > > > > > -			ASSERT(bp->b_log_item == NULL);
> > > > > > > > -		}
> > > > > > > > -		xfs_buf_relse(bp);
> > > > > > > > -	} else if (freed && remove) {
> > > > > > > > -		/*
> > > > > > > > +		 * Fail the IO before we drop the bli refcount. This guarantees
> > > > > > > > +		 * that a racing writeback completion also failing the buffer
> > > > > > > > +		 * and running completion will not remove the last reference to
> > > > > > > > +		 * the bli and free it from under us.
> > > > > > > > +		 *
> > > > > > > >  		 * The buffer must be locked and held by the caller to simulate
> > > > > > > >  		 * an async I/O failure.
> > > > > > > >  		 */
> > > > > > > > @@ -559,7 +555,62 @@ xfs_buf_item_unpin(
> > > > > > > >  		xfs_buf_hold(bp);
> > > > > > > >  		bp->b_flags |= XBF_ASYNC;
> > > > > > > >  		xfs_buf_ioend_fail(bp);
> > > > > > > > +		xfs_buf_item_relse(bp);
> > > > > > > 
> > > > > > > Did you mean for this to be xfs_buf_item_put() instead of _relse()? The
> > > > > > 
> > > > > > Yes. I did say "untested" which implies the patch isn't complete or will
> > > > > > work. It's just a demonstration of how this reference counting might
> > > > > > be done, not a complete, working solution. Patches are a much faster
> > > > > > way of explaining the overall solution that plain text...
> > > > > > 
> > > > > 
> > > > > Sure, I'm just trying to clarify intent. It's a refcount patch so
> > > > > whether we drop a refcount or explicitly free an objects is particularly
> > > > > relevant. ;)
> > > > > 
> > > > > > > >  	}
> > > > > > > > +
> > > > > > > > +	/*
> > > > > > > > +	 * Stale buffer - only process it if this is the last reference to the
> > > > > > > > +	 * BLI. If this is the last BLI reference, then the buffer will be
> > > > > > > > +	 * locked and have two references - once from the transaction commit and
> > > > > > > > +	 * one from the BLI - and we do not unlock and release transaction
> > > > > > > > +	 * reference until we've finished cleaning up the BLI.
> > > > > > > > +	 */
> > > > > > > > +	if (!atomic_dec_and_test(&bip->bli_refcount))
> > > > > > > > +		return;
> > > > > > > > +
> > > > > > > 
> > > > > > > If the buffer is stale, will this ever be the last reference now that
> > > > > > > _item_committed() bumps the refcount?
> > > > > > 
> > > > > > If the AIL has a reference, then no.
> > > > > > 
> > > > > 
> > > > > Ok, but how would we get here without an AIL reference? That seems
> > > > > impossible to me based on your patch.
> > > > > 
> > > > > > > This change also seems to have
> > > > > > > ramifications for the code that follows, such as if a staled buffer is
> > > > > > > already in the AIL (with a bli ref), would this code ever get to the
> > > > > > > point of removing it?
> > > > > > 
> > > > > > That's easy enough to handle - if the buffer is stale and we are in
> > > > > > this code, we hold the buffer locked. Hence we can remove from the
> > > > > > AIL if it is in the AIL and drop that reference, too. Indeed, this
> > > > > > code already does the AIL removal, so all this requires is a simple
> > > > > > rearrangement of the logic in this function....
> > > > > > 
> > > > > 
> > > > > Hmm, I don't think that's a correct assertion. If the buffer is stale
> > > > > and we're in this code and we have the last reference to the bli, then
> > > > > we know we hold the buffer locked. Otherwise, ISTM we can get here while
> > > > > the transaction that staled the buffer might still own the lock.
> > > > > 
> > > > > > The only difference is that we have to do this before we drop the
> > > > > > current reference we have on the BLI, which is not what we do now
> > > > > > and that's where all the problems lie.
> > > > > > 
> > > > > > > All in all, I'll reiterate that I think it would be nice to fix up the
> > > > > > > bli reference count handling in general, but I think the scope and
> > > > > > > complexity of that work is significantly beyond what is reasonably
> > > > > > > necessary to fix this bug.
> > > > > > 
> > > > > > And so leaving the underlying problem in the code for the next set
> > > > > > of changes someone does to trigger the problem in a differen way.
> > > > > > We've indentified what the root cause is, so can we please spend
> > > > > > the time to fix it properly?
> > > > > > 
> > > > > 
> > > > > That is not what I'm saying at all. I'm saying that I'd prefer to fix
> > > > > the bug first and then step back and evaluate the overall refcount
> > > > > design independently because the latter is quite complex and there are
> > > > > all kinds of subtle interactions that the RFC patch just glazes over (by
> > > > > design). For example, how log recovery processes bli's slightly
> > > > > differently looks like yet another impedence mismatch from when the fs
> > > > > is fully active.
> > > > > 
> > > > 
> > > > Just an update on this particular bit...
> > > > 
> > > > I've probably spent about a week and a half now working through some
> > > > attempts to rework the bli refcount handling (going back to the drawing
> > > > board at least a couple times) and while I can get to something mostly
> > > > functional (surviving an fstests run), the last steps to test/prove a
> > > > solid/reliable implementation end up falling short. On extended testing
> > > > I end up sorting through the same kind of shutdown/unmount hang, use
> > > > after free, extremely subtle interactions that I've spent many hours
> > > > trying to stamp out over the past several years.
> > > > 
> > > > Because of that, I don't think this is something worth pushing for in
> > > > the short term. While the proposed idea sounds nice in theory, my take
> > > > away in practice is that the current design is in place for a reason
> > > > (i.e. the refcount historically looks like a transaction reference count
> > > > used to provide unlocked/isolated access through the log subsystem as
> > > > opposed to a traditional memory object lifecycle reference count) and
> > > > has quite a lot of incremental stability that has been baked in over
> > > > time.
> > > > 
> > > > I do still think this is (or should be :P) ultimately fixable, but I'm
> > > > wondering if we're probably better served by exploring some of the
> > > > historical warts and subtle rules/dependencies/hurdles of the current
> > > > model to see if they can be removed or simplified to the point where the
> > > > reference counting incrementally and naturally becomes a bit more
> > > > straightforward. I probably need to reset my brain and think a little
> > > > more about that...
> > > 
> > > What do you want to do in the meantime?
> > > 
> > > How about: Elevate this patch from RFC to regular patch status (perhaps
> > > with an XXX comment outlining the gap as a breadcrumb to remind future
> > > us?) and merge that so that we at least fix the immediate UAF problem?
> > > 
> > 
> > Well I had already sent a v1 [1] of the original UAF fix independent
> > from this work because I didn't want to bury the bug fix behind a
> > broader rework. I just wanted to follow up here because I've been
> > looking into the latter and this is where most of the discussion around
> > the design rework took place.
> > 
> > That said, I am currently looking at (yet another) alternative approach
> > that might be more in the direction of broadly cleaning things up vs.
> > stamping out an isolated problem. I just don't know how it will turn out
> > quite yet, so I could go either way between waiting on that and possibly
> > sending a v2, or going with v1 for now (assuming it's reviewed) and
> > basing subsequent cleanups on that. My instinct is that to leave a
> > record of an isolated fix is probably the proper thing to do regardless,
> > but I don't feel that strongly about it.
> 
> Aha, you /did/ send a v1.  I thought I had seen something like that.
> 
> Oh, I reviewed it too.  Hm.  Do you have any changes after v1.1, or do
> you want me to pull that in as it is now?
> 

I had no pending changes after the v1.1 tweak of patch 2. Thanks.

Brian

> --D
> 
> > 
> > Brian
> > 
> > [1] https://lore.kernel.org/linux-xfs/20210511135257.878743-1-bfoster@redhat.com/
> > 
> > > I suspected that figuring out all the subtleties of the bli lifetimes
> > > would be an intense effort.
> > > 
> > > --D
> > > 
> > > > 
> > > > Brian
> > > > 
> > > > > So I propose to rework my patch a bit into something that reflects the
> > > > > intended direction (see below for an untested diff) and proceed from
> > > > > there...
> > > > > 
> > > > > Brian
> > > > > 
> > > > > --- 8< ---
> > > > > 
> > > > > diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> > > > > index fb69879e4b2b..7ff31788512b 100644
> > > > > --- a/fs/xfs/xfs_buf_item.c
> > > > > +++ b/fs/xfs/xfs_buf_item.c
> > > > > @@ -475,17 +475,8 @@ xfs_buf_item_pin(
> > > > >  }
> > > > >  
> > > > >  /*
> > > > > - * This is called to unpin the buffer associated with the buf log
> > > > > - * item which was previously pinned with a call to xfs_buf_item_pin().
> > > > > - *
> > > > > - * Also drop the reference to the buf item for the current transaction.
> > > > > - * If the XFS_BLI_STALE flag is set and we are the last reference,
> > > > > - * then free up the buf log item and unlock the buffer.
> > > > > - *
> > > > > - * If the remove flag is set we are called from uncommit in the
> > > > > - * forced-shutdown path.  If that is true and the reference count on
> > > > > - * the log item is going to drop to zero we need to free the item's
> > > > > - * descriptor in the transaction.
> > > > > + * This is called to unpin the buffer associated with the buf log item which
> > > > > + * was previously pinned with a call to xfs_buf_item_pin().
> > > > >   */
> > > > >  STATIC void
> > > > >  xfs_buf_item_unpin(
> > > > > @@ -502,12 +493,26 @@ xfs_buf_item_unpin(
> > > > >  
> > > > >  	trace_xfs_buf_item_unpin(bip);
> > > > >  
> > > > > +	/*
> > > > > +	 * Drop the bli ref associated with the pin and grab the hold required
> > > > > +	 * for the I/O simulation failure in the abort case. We have to do this
> > > > > +	 * before the pin count drops because the AIL doesn't acquire a bli
> > > > > +	 * reference. Therefore if the refcount drops to zero, the bli could
> > > > > +	 * still be AIL resident and the buffer submitted for I/O (and freed on
> > > > > +	 * completion) at any point before we return. This can be removed once
> > > > > +	 * the AIL properly holds a reference on the bli.
> > > > > +	 */
> > > > >  	freed = atomic_dec_and_test(&bip->bli_refcount);
> > > > > -
> > > > > +	if (freed && !stale && remove)
> > > > > +		xfs_buf_hold(bp);
> > > > >  	if (atomic_dec_and_test(&bp->b_pin_count))
> > > > >  		wake_up_all(&bp->b_waiters);
> > > > >  
> > > > > -	if (freed && stale) {
> > > > > +	 /* nothing to do but drop the pin count if the bli is active */
> > > > > +	if (!freed)
> > > > > +		return;
> > > > > +
> > > > > +	if (stale) {
> > > > >  		ASSERT(bip->bli_flags & XFS_BLI_STALE);
> > > > >  		ASSERT(xfs_buf_islocked(bp));
> > > > >  		ASSERT(bp->b_flags & XBF_STALE);
> > > > > @@ -550,13 +555,13 @@ xfs_buf_item_unpin(
> > > > >  			ASSERT(bp->b_log_item == NULL);
> > > > >  		}
> > > > >  		xfs_buf_relse(bp);
> > > > > -	} else if (freed && remove) {
> > > > > +	} else if (remove) {
> > > > >  		/*
> > > > >  		 * The buffer must be locked and held by the caller to simulate
> > > > > -		 * an async I/O failure.
> > > > > +		 * an async I/O failure. We acquired the hold for this case
> > > > > +		 * before the buffer was unpinned.
> > > > >  		 */
> > > > >  		xfs_buf_lock(bp);
> > > > > -		xfs_buf_hold(bp);
> > > > >  		bp->b_flags |= XBF_ASYNC;
> > > > >  		xfs_buf_ioend_fail(bp);
> > > > >  	}
> > > > 
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-06-02 16:40 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-03 12:18 [PATCH RFC] xfs: hold buffer across unpin and potential shutdown processing Brian Foster
2021-05-03 23:25 ` Dave Chinner
2021-05-05 11:50   ` Brian Foster
2021-05-06  2:56     ` Dave Chinner
2021-05-06 19:29       ` Brian Foster
2021-06-02 13:32         ` Brian Foster
2021-06-02 16:02           ` Darrick J. Wong
2021-06-02 16:31             ` Brian Foster
2021-06-02 16:35               ` Darrick J. Wong
2021-06-02 16:40                 ` Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.