Re: [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover

From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Brian Foster <bfoster@redhat.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover
Date: Wed, 6 May 2020 10:01:50 -0700	[thread overview]
Message-ID: <20200506170150.GY5703@magnolia> (raw)
In-Reply-To: <20200506135656.GA5054@bfoster>

On Wed, May 06, 2020 at 09:56:56AM -0400, Brian Foster wrote:
> On Tue, May 05, 2020 at 05:34:23PM -0700, Darrick J. Wong wrote:
> > On Tue, May 05, 2020 at 10:11:38AM -0400, Brian Foster wrote:
> > > On Mon, May 04, 2020 at 06:13:53PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > In xfs_bui_item_recover, there exists a use-after-free bug with regards
> > > > to the inode that is involved in the bmap replay operation.  If the
> > > > mapping operation does not complete, we call xfs_bmap_unmap_extent to
> > > > create a deferred op to finish the unmapping work, and we retain a
> > > > pointer to the incore inode.
> > > > 
> > > > Unfortunately, the very next thing we do is commit the transaction and
> > > > drop the inode.  If reclaim tears down the inode before we try to finish
> > > > the defer ops, we dereference garbage and blow up.  Therefore, create a
> > > > way to join inodes to the defer ops freezer so that we can maintain the
> > > > xfs_inode reference until we're done with the inode.
> > > > 
> > > > Note: This imposes the requirement that there be enough memory to keep
> > > > every incore inode in memory throughout recovery.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > 
> > > Maybe I'm missing something, but I thought the discussion on the
> > > previous version[1] landed on an approach where the intent would hold a
> > > reference to the inode. Wouldn't that break the dependency on the dfops
> > > freeze/thaw mechanism?
> > 
> > We did, but as I started pondering how to make this work in practice,
> > I soon realized that it's not as simple as "don't irele the inode":
> > 
> > (Part 1)
> > 
> > It would be easy enough to add a flag to struct xfs_bmap_intent that
> > means "when you're done, unlock and irele the inode".  You'd have to add
> > this capability to any log item that deals with inodes, but only
> > deferred bmap items need this.
> > 
> > Log recovery looks like this:
> > 
> > ----------------
> > Transaction 0
> > BUI1 unmap file1, offset1, startblock1, length == 200000
> > ----------------
> > <end>
> > 
> > What happens if the first bunmapi call doesn't actually unmap
> > everything?  We'll queue yet another bmap intent item (call it BUI2) to
> > finish the work.  This means that we can't just drop the inode when
> > we're done processing BUI1; we have to propagate the "unlock and irele"
> > flag to BUI2.  Ok, so now there's this chaining behavior that wasn't
> > there before.
> > 
> > Then I thought about it some more, and wondered what would happen if the
> > log contained two unfinished bmap items for the same inode?  If this
> > happened, you'd deadlock because you've not released the ILOCK for the
> > first unfinished bmap item when you want to acquire the ILOCK for the
> > second unfinished bmap item.
> > 
> 
> Hm, Ok.. so IIUC the lock is implicitly held across these map/unmap
> dfops at runtime, but that's not necessarily the case during recovery
> due to the dfops chaining behavior. To me, this raises the question of
> why we need to reorder/split these particular intents during recovery if
> the original operation would have completed under a single inode lock
> cycle.

Log recovery /did/ function that way (no reordering) until
509955823cc9c, when it was discovered that we don't have a good way to
reconstruct the dfops chain from the list of recovered log intent items.

> The comments in xfs_log_recover.c don't really explain it. From going
> back to commit 509955823cc9c ("xfs: log recovery should replay deferred
> ops in order"), it looks like the issue is that ordering gets screwed up
> for operations that commit multiple intents per-transaction and
> completion of those intents creates new child intents. At runtime, we'd
> process the intents in the current transaction first and then get to the
> child intents in the order they were created. If we didn't have the
> recovery time dfops shuffling implemented in this patch, recovery would
> process the first intent in the log, then all of the child intents of
> that one before getting to the next intent in the log that presumably
> would have been second at runtime. Am I following that correctly?

Correct.  If we could figure out how to rebuild the dfops chain (and
thus the ordering requirements), we would be able to eliminate this
freezer stuff entirely.

> I think the rest makes sense from the perspective of not wanting to
> plumb the conditional locking complexity through the dfops/intent
> processing for every intent type that eventually needs it, as opposed to
> containing in the log recovery code (via dfops magic). I need to think
> about all that some more, but I'd like to make sure I understand why we
> have this recovery requirement in the first place.

<nod>  It took me a few hours to figure out why that patch was correct
the first time I saw it.  It took me a few hours again(!) when I was
trying to write this series. :/

--D

> Brian
> 
> > You'd have to drop the ILOCK (but retain the inode reference) between
> > the two bmap items.  I think this is currently impossible because there
> > aren't any transactions that queue multiple bmap intents per
> > transaction, nobody else holds the ilock, and the existing rmap
> > implementation of the swapext ioctl doesn't allow swapping a file with
> > itself.  However...
> > 
> > (Part 2)
> > 
> > The second thing I thought about is, what if there's a lot of work after
> > transaction 0?  Say the recovery stream looks like:
> > 
> > ----------------
> > Transaction 0
> > BUI1 unmap file1, offset1, startblock1, length == 200000
> > ----------------
> > <10000 more intents applied to other things>
> > ----------------
> > <end>
> > 
> > This is a problem, because we must drop the ILOCK after completing the
> > BUI1 work so that we can process the 10000 more intents before we can
> > get to BUI2.  We cannot let the tail of the log get pinned on the locked
> > inode, so we must release the ILOCK after we're done with BUI1 and
> > re-acquire it when we're ready to deal with BUI2.
> > 
> > This means I need /two/ flags in struct xfs_bmap_intent -- one to say
> > that we need to irele the inode at the end of processing, and another to
> > say that the bmap item needs to grab the ILOCK.  If there's going to be
> > a BUI2 as a result of recovering BUI1, the first flag must be propagated
> > to BUI2, but the second flag must /not/ be propagated.
> > 
> > A potential counterargument is that we crammed all 10000 intents into
> > the log without pinning the log, so maybe we can hold the ILOCK.
> > However....
> > 
> > (Part 3)
> > 
> > Next, I considered how the swapext log item in the atomic file update
> > patchset would interact with log recovery.  A couple of notes about the
> > swapext log items:
> > 
> > 1. They are careful not to create the kinds of bunmap operations that
> > would cause the creation of /more/ BUIs.
> > 
> > 2. Every time we log one extent's worth of unmap/remap operations (using
> > deferred bmap log items, naturally) we log an done item for the original
> > swapext intent item and immediately log another swapext intent for the
> > work remaining.
> > 
> > I came up with the following sequence of transactions to recover:
> > 
> > ----------------
> > Transaction 0
> > SXI0 file1, offset1, file2, offset2, length == 10
> > ----------------
> > Transaction 1
> > BUI1 unmap file1, offset1, startblock1, length == 2
> > BUI2 unmap file2, offset2, startblock2, length == 2
> > BUI3 map file1, offset1, startblock2, length == 2
> > BUI4 map file2, offset2, startblock1, length == 2
> > SXD done (SXI0)
> > SXI5 file1, offset1 + 2, file2, offset2 + 2, length == 8
> > ---------------
> > <end of log>
> > 
> > Recovery in this case will end up with BUI[1-4] and SXI5 that still need
> > processing.  Each of the 5 intent items gets its own recovery
> > transaction and dfops freezer chain.  BUI1, BUI3, and SXI5 will each
> > require ilock'd access to file1, which means that each of them will have
> > to have the ability to drop the ILOCK but retain the reference.  The
> > same argument applies to file2 and BUI2, BUI4, and SXI5.  Note also that
> > in our brave new world, file1 and file2 can be the same.
> > 
> > For the swapext log item type this inode management is particularly
> > acute because we're certain to have new SXIs created as a result of
> > recovering an SXI.
> > 
> > (End)
> > 
> > In conclusion, we need a mechanism to drop both the inode lock and the
> > inode reference.  Each log item type that works with inodes will have to
> > set aside 2 flags somewhere in the incore extent structure, and provide
> > more or less the same code to manage that state.  Right now that's just
> > the bmap items, but then the swapext items and the deferred xattr items
> > will have that same requirement when they get added.
> > 
> > If we do that, the inode reference and ilock management diverges even
> > further from regular operations.  Regular callers are expected to iget
> > and ilock the inode and maintain that all the way to the end of
> > xfs_trans_commit.  Log recovery, on the other hand, will add a bunch of
> > state handling to some of the log items so that we can drop the ILOCK
> > after the first item, and then drop the inode reference after finishing
> > the last possible user of that inode in the revived dfops chain.
> > 
> > On the other hand, keeping the inode management in the defer freezer
> > code means that I keep all that complexity and state management out of
> > the ->finish_item code.  When we go to finish the new incore intents
> > that get created during recovery, the inode reference and ILOCK rules
> > are the same as anywhere else -- the caller is required to grab the
> > inode, the transaction, and the ilock (in that order).
> > 
> > To the extent that recovering log intent items is /not/ the same as
> > running a normal transaction, all that weirdness is concentrated in
> > xfs_log_recover.c and a single function in xfs_defer.c.  The desired
> > behavior is the same across all the log intent item types, so I
> > decided in favor of keeping the centralised behavior and (trying to)
> > contain the wacky.  We don't need all that right away, but we will.
> > 
> > Note that I did make it so that we retain the reference always, so
> > there's no longer a need for irele/igrab cycles, which makes capturing
> > the dfops chain less error prone.
> > 
> > --D
> > 
> > > Brian
> > > 
> > > [1] https://lore.kernel.org/linux-xfs/20200429235818.GX6742@magnolia/
> > > 
> > > >  fs/xfs/libxfs/xfs_defer.c |   50 +++++++++++++++++++++++++++++++++++++++++++++
> > > >  fs/xfs/libxfs/xfs_defer.h |   10 +++++++++
> > > >  fs/xfs/xfs_bmap_item.c    |    7 ++++--
> > > >  fs/xfs/xfs_icache.c       |   19 +++++++++++++++++
> > > >  4 files changed, 83 insertions(+), 3 deletions(-)
> > > > 
> > > > 
> > > > diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
> > > > index ea4d28851bbd..72933fdafcb2 100644
> > > > --- a/fs/xfs/libxfs/xfs_defer.c
> > > > +++ b/fs/xfs/libxfs/xfs_defer.c
> > > > @@ -16,6 +16,7 @@
> > > >  #include "xfs_inode.h"
> > > >  #include "xfs_inode_item.h"
> > > >  #include "xfs_trace.h"
> > > > +#include "xfs_icache.h"
> > > >  
> > > >  /*
> > > >   * Deferred Operations in XFS
> > > > @@ -583,8 +584,19 @@ xfs_defer_thaw(
> > > >  	struct xfs_defer_freezer	*dff,
> > > >  	struct xfs_trans		*tp)
> > > >  {
> > > > +	int				i;
> > > > +
> > > >  	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
> > > >  
> > > > +	/* Re-acquire the inode locks. */
> > > > +	for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) {
> > > > +		if (!dff->dff_inodes[i])
> > > > +			break;
> > > > +
> > > > +		dff->dff_ilocks[i] = XFS_ILOCK_EXCL;
> > > > +		xfs_ilock(dff->dff_inodes[i], dff->dff_ilocks[i]);
> > > > +	}
> > > > +
> > > >  	/* Add the dfops items to the transaction. */
> > > >  	list_splice_init(&dff->dff_dfops, &tp->t_dfops);
> > > >  	tp->t_flags |= dff->dff_tpflags;
> > > > @@ -597,5 +609,43 @@ xfs_defer_freeezer_finish(
> > > >  	struct xfs_defer_freezer	*dff)
> > > >  {
> > > >  	xfs_defer_cancel_list(mp, &dff->dff_dfops);
> > > > +	xfs_defer_freezer_irele(dff);
> > > >  	kmem_free(dff);
> > > >  }
> > > > +
> > > > +/*
> > > > + * Attach an inode to this deferred ops freezer.  Callers must hold ILOCK_EXCL,
> > > > + * which will be dropped and reacquired when we're ready to thaw the frozen
> > > > + * deferred ops.
> > > > + */
> > > > +int
> > > > +xfs_defer_freezer_ijoin(
> > > > +	struct xfs_defer_freezer	*dff,
> > > > +	struct xfs_inode		*ip)
> > > > +{
> > > > +	unsigned int			i;
> > > > +
> > > > +	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> > > > +
> > > > +	for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) {
> > > > +		if (dff->dff_inodes[i] == ip)
> > > > +			goto out;
> > > > +		if (dff->dff_inodes[i] == NULL)
> > > > +			break;
> > > > +	}
> > > > +
> > > > +	if (i == XFS_DEFER_FREEZER_INODES) {
> > > > +		ASSERT(0);
> > > > +		return -EFSCORRUPTED;
> > > > +	}
> > > > +
> > > > +	/*
> > > > +	 * Attach this inode to the freezer and drop its ILOCK because we
> > > > +	 * assume the caller will need to allocate a transaction.
> > > > +	 */
> > > > +	dff->dff_inodes[i] = ip;
> > > > +	dff->dff_ilocks[i] = 0;
> > > > +out:
> > > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > +	return 0;
> > > > +}
> > > > diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
> > > > index 7ae05e10d750..0052a0313283 100644
> > > > --- a/fs/xfs/libxfs/xfs_defer.h
> > > > +++ b/fs/xfs/libxfs/xfs_defer.h
> > > > @@ -76,6 +76,11 @@ struct xfs_defer_freezer {
> > > >  	/* Deferred ops state saved from the transaction. */
> > > >  	struct list_head	dff_dfops;
> > > >  	unsigned int		dff_tpflags;
> > > > +
> > > > +	/* Inodes to hold when we want to finish the deferred work items. */
> > > > +#define XFS_DEFER_FREEZER_INODES	2
> > > > +	unsigned int		dff_ilocks[XFS_DEFER_FREEZER_INODES];
> > > > +	struct xfs_inode	*dff_inodes[XFS_DEFER_FREEZER_INODES];
> > > >  };
> > > >  
> > > >  /* Functions to freeze a chain of deferred operations for later. */
> > > > @@ -83,5 +88,10 @@ int xfs_defer_freeze(struct xfs_trans *tp, struct xfs_defer_freezer **dffp);
> > > >  void xfs_defer_thaw(struct xfs_defer_freezer *dff, struct xfs_trans *tp);
> > > >  void xfs_defer_freeezer_finish(struct xfs_mount *mp,
> > > >  		struct xfs_defer_freezer *dff);
> > > > +int xfs_defer_freezer_ijoin(struct xfs_defer_freezer *dff,
> > > > +		struct xfs_inode *ip);
> > > > +
> > > > +/* These functions must be provided by the xfs implementation. */
> > > > +void xfs_defer_freezer_irele(struct xfs_defer_freezer *dff);
> > > >  
> > > >  #endif /* __XFS_DEFER_H__ */
> > > > diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
> > > > index c733bdeeeb9b..bbce191d8fcd 100644
> > > > --- a/fs/xfs/xfs_bmap_item.c
> > > > +++ b/fs/xfs/xfs_bmap_item.c
> > > > @@ -530,12 +530,13 @@ xfs_bui_item_recover(
> > > >  	}
> > > >  
> > > >  	error = xlog_recover_trans_commit(tp, dffp);
> > > > -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > -	xfs_irele(ip);
> > > > -	return error;
> > > > +	if (error)
> > > > +		goto err_rele;
> > > > +	return xfs_defer_freezer_ijoin(*dffp, ip);
> > > >  
> > > >  err_inode:
> > > >  	xfs_trans_cancel(tp);
> > > > +err_rele:
> > > >  	if (ip) {
> > > >  		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > >  		xfs_irele(ip);
> > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > > index 17a0b86fe701..b96ddf5ff334 100644
> > > > --- a/fs/xfs/xfs_icache.c
> > > > +++ b/fs/xfs/xfs_icache.c
> > > > @@ -12,6 +12,7 @@
> > > >  #include "xfs_sb.h"
> > > >  #include "xfs_mount.h"
> > > >  #include "xfs_inode.h"
> > > > +#include "xfs_defer.h"
> > > >  #include "xfs_trans.h"
> > > >  #include "xfs_trans_priv.h"
> > > >  #include "xfs_inode_item.h"
> > > > @@ -1847,3 +1848,21 @@ xfs_start_block_reaping(
> > > >  	xfs_queue_eofblocks(mp);
> > > >  	xfs_queue_cowblocks(mp);
> > > >  }
> > > > +
> > > > +/* Release all the inode resources attached to this freezer. */
> > > > +void
> > > > +xfs_defer_freezer_irele(
> > > > +	struct xfs_defer_freezer	*dff)
> > > > +{
> > > > +	unsigned int			i;
> > > > +
> > > > +	for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) {
> > > > +		if (!dff->dff_inodes[i])
> > > > +			break;
> > > > +
> > > > +		if (dff->dff_ilocks[i])
> > > > +			xfs_iunlock(dff->dff_inodes[i], dff->dff_ilocks[i]);
> > > > +		xfs_irele(dff->dff_inodes[i]);
> > > > +		dff->dff_inodes[i] = NULL;
> > > > +	}
> > > > +}
> > > > 
> > > 
> > 
>