All of lore.kernel.org
 help / color / mirror / Atom feed
From: Alli <allison.henderson@oracle.com>
To: Dave Chinner <david@fromorbit.com>,
	"Darrick J. Wong" <djwong@kernel.org>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH RESEND v2 01/18] xfs: Fix multi-transaction larp replay
Date: Tue, 09 Aug 2022 22:01:49 -0700	[thread overview]
Message-ID: <373809e97f15e14d181fea6e170bfd8e37a9c9e4.camel@oracle.com> (raw)
In-Reply-To: <20220810015809.GK3600936@dread.disaster.area>

On Wed, 2022-08-10 at 11:58 +1000, Dave Chinner wrote:
> On Tue, Aug 09, 2022 at 09:52:55AM -0700, Darrick J. Wong wrote:
> > On Thu, Aug 04, 2022 at 12:39:56PM -0700, Allison Henderson wrote:
> > > Recent parent pointer testing has exposed a bug in the underlying
> > > attr replay.  A multi transaction replay currently performs a
> > > single step of the replay, then deferrs the rest if there is more
> > > to do.
> 
> Yup.
> 
> > > This causes race conditions with other attr replays that
> > > might be recovered before the remaining deferred work has had a
> > > chance to finish.
> 
> What other attr replays are we racing against?  There can only be
> one incomplete attr item intent/done chain per inode present in log
> recovery, right?
No, a rename queues up a set and remove before committing the
transaction.  One for the new parent pointer, and another to remove the
old one.  It cant be an attr replace because technically the names are
different.

So the recovered set grows the leaf, and returns the egain, then rest
gets capture committed.  Next up is the recovered remove which pulls
out the fork, which causes problems when the rest of the set operation
resumes as a deferred operation.  Here is the link to the original
discussion, it was quite a while ago:

https://lore.kernel.org/all/Yrzw9F5aGsaldrmR@magnolia/

I hope that helps?
Allison

> 
> > > This can lead to interleaved set and remove
> > > operations that may clobber the attribute fork.  Fix this by
> > > deferring all work for any attribute operation.
> 
> Which means this should be an impossible situation.
> 
> That is, if we crash before the final attrd DONE intent is written
> to the log, it means that new attr intents for modifications made
> *after* the current attr modification was completed will not be
> present in the log. We have strict ordering of committed operations
> in the journal, hence an operation on an inode has an incomplete
> intent *must* be the last operation and the *only* incomplete intent
> that is found in the journal for that inode.
> 
> Hence from an operational ordering persepective, this explanation
> for issue being seen doesn't make any sense to me.  If there are
> multiple incomplete attri intents then we've either got a runtime
> journalling problem (a white-out issue? failing to relog the inode
> in each new intent?) or a log recovery problem (failing to match
> intent-done pairs correctly?), not a recovery deferral issue.
> 
> Hence I think we're still looking for the root cause of this
> problem...
> 
> > > Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
> > > ---
> > >  fs/xfs/xfs_attr_item.c | 35 ++++++++---------------------------
> > >  1 file changed, 8 insertions(+), 27 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_attr_item.c b/fs/xfs/xfs_attr_item.c
> > > index 5077a7ad5646..c13d724a3e13 100644
> > > --- a/fs/xfs/xfs_attr_item.c
> > > +++ b/fs/xfs/xfs_attr_item.c
> > > @@ -635,52 +635,33 @@ xfs_attri_item_recover(
> > >  		break;
> > >  	case XFS_ATTRI_OP_FLAGS_REMOVE:
> > >  		if (!xfs_inode_hasattr(args->dp))
> > > -			goto out;
> > > +			return 0;
> > >  		attr->xattri_dela_state =
> > > xfs_attr_init_remove_state(args);
> > >  		break;
> > >  	default:
> > >  		ASSERT(0);
> > > -		error = -EFSCORRUPTED;
> > > -		goto out;
> > > +		return -EFSCORRUPTED;
> > >  	}
> > >  
> > >  	xfs_init_attr_trans(args, &tres, &total);
> > >  	error = xfs_trans_alloc(mp, &tres, total, 0, XFS_TRANS_RESERVE,
> > > &tp);
> > >  	if (error)
> > > -		goto out;
> > > +		return error;
> > >  
> > >  	args->trans = tp;
> > >  	done_item = xfs_trans_get_attrd(tp, attrip);
> > > +	args->trans->t_flags |= XFS_TRANS_HAS_INTENT_DONE;
> > > +	set_bit(XFS_LI_DIRTY, &done_item->attrd_item.li_flags);
> > >  
> > >  	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > >  	xfs_trans_ijoin(tp, ip, 0);
> > >  
> > > -	error = xfs_xattri_finish_update(attr, done_item);
> > > -	if (error == -EAGAIN) {
> > > -		/*
> > > -		 * There's more work to do, so add the intent item to
> > > this
> > > -		 * transaction so that we can continue it later.
> > > -		 */
> > > -		xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr-
> > > >xattri_list);
> > > -		error = xfs_defer_ops_capture_and_commit(tp,
> > > capture_list);
> > > -		if (error)
> > > -			goto out_unlock;
> > > -
> > > -		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > -		xfs_irele(ip);
> > > -		return 0;
> > > -	}
> > > -	if (error) {
> > > -		xfs_trans_cancel(tp);
> > > -		goto out_unlock;
> > > -	}
> > > -
> > > +	xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr->xattri_list);
> > 
> > This seems a little convoluted to me.  Maybe?  Maybe not?
> > 
> > 1. Log recovery recreates an incore xfs_attri_log_item from what it
> > finds in the log.
> > 
> > 2. This function then logs an xattrd for the recovered xattri item.
> > 
> > 3. Then it creates a new xfs_attr_intent to complete the operation.
> > 
> > 4. Finally, it calls xfs_defer_ops_capture_and_commit, which logs a
> > new
> > xattri for the intent created in step 3 and also commits the xattrd
> > for
> > the first xattri.
> > 
> > IOWs, the only difference between before and after is that we're
> > not
> > advancing one more step through the state machine as part of log
> > recovery.  From the perspective of the log, the recovery function
> > merely
> > replaces the recovered xattri log item with a new one.
> > 
> > Why can't we just attach the recovered xattri to the
> > xfs_defer_pending
> > that is created to point to the xfs_attr_intent that's created in
> > step
> > 3, and skip the xattrd?
> 
> Remember that attribute intents are different to all other intent
> types that we have. The existing extent based intents define a
> single indepedent operation that needs to be performed, and each
> step of the intent chain is completely independent of the previous
> step in the chain.  e.g. removing the extent from the rmap btree is
> completely independent of removing it from the inode bmap btree -
> all that matters is that the removal from the bmbt happens first.
> The rmapbt removal can happen at any time after than, and is
> completely independent of any other bmbt or rmapbt operation.
> Similarly, the EFI can processed independently of all bmapbt and
> rmapbt modifications, it just has to happen after those
> modifications are done.
> 
> Hence if we crash during recovery, we can just restart from
> where-ever we got to in the middle of the intent chains and not have
> to care at all.  IOWs, eventual consistency works with these chains
> because there is no dependencies between each step of the intent
> chain and each step is completely independent of the other steps.
> 
> Attribute intent chains are completely different. They link steps in
> a state machine together in a non-trivial, highly dependent chain.
> We can't just restart the chain in the middle like we can for the
> BUI->RUI->CUI->EFI chain because the on-disk attribute is in an
> unknown state and recovering that exact state is .... complex.
> 
> Hence the the first step of recovery is to return the attribute we
> are trying to modify back to a known state. That means we have to
> perform a removal of any existing attribute under that name first.
> Hence this first step should be replacing the existing attr intent
> with the intent that defines the recovery operation we are going to
> perform.
> 
> That means we need to translate set to replace so that cleanup is
> run first, replace needs to clean up the attr under that name
> regardless of whether it has the incomplete bit set on it or not.
> Remove is the only operation that runs the same as at runtime, as
> cleanup for remove is just repeating the remove operation from
> scratch.
> 
> > I /think/ the answer to that question is that we might need to move
> > the
> > log tail forward to free enough log space to finish the intent
> > items, so
> > creating the extra xattrd/xattri (a) avoid the complexity of
> > submitting
> > an incore intent item *and* a log intent item to the defer ops
> > machinery; and (b) avoid livelocks in log recovery.  Therefore, we
> > actually need to do it this way.
> 
> We really need the initial operation to rewrite the intent to match
> the recovery operation we are going to perform. Everything else is
> secondary.
> 
> Cheers,
> 
> Dave.


  reply	other threads:[~2022-08-10  5:02 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-04 19:39 [PATCH RESEND v2 00/18] Parent Pointers Allison Henderson
2022-08-04 19:39 ` [PATCH RESEND v2 01/18] xfs: Fix multi-transaction larp replay Allison Henderson
2022-08-09 16:52   ` Darrick J. Wong
2022-08-10  1:58     ` Dave Chinner
2022-08-10  5:01       ` Alli [this message]
2022-08-10  6:12         ` Dave Chinner
2022-08-10 15:52           ` Darrick J. Wong
2022-08-10 19:28             ` Alli
2022-08-12  1:55           ` Alli
2022-08-12  3:05             ` Darrick J. Wong
2022-08-16  0:54             ` Dave Chinner
2022-08-16  5:07               ` Darrick J. Wong
2022-08-16 20:41                 ` Alli
2022-08-19  1:05                   ` Alli
2022-08-23 15:07                     ` Darrick J. Wong
2022-08-24 18:47                       ` Alli
2022-08-10  3:08     ` Alli
2022-08-04 19:39 ` [PATCH RESEND v2 02/18] xfs: Increase XFS_DEFER_OPS_NR_INODES to 5 Allison Henderson
2022-08-09 16:38   ` Darrick J. Wong
2022-08-10  3:07     ` Alli
2022-08-04 19:39 ` [PATCH RESEND v2 03/18] xfs: Hold inode locks in xfs_ialloc Allison Henderson
2022-08-04 19:39 ` [PATCH RESEND v2 04/18] xfs: Hold inode locks in xfs_trans_alloc_dir Allison Henderson
2022-08-04 19:40 ` [PATCH RESEND v2 05/18] xfs: get directory offset when adding directory name Allison Henderson
2022-08-04 19:40 ` [PATCH RESEND v2 06/18] xfs: get directory offset when removing " Allison Henderson
2022-08-04 19:40 ` [PATCH RESEND v2 07/18] xfs: get directory offset when replacing a " Allison Henderson
2022-08-04 19:40 ` [PATCH RESEND v2 08/18] xfs: add parent pointer support to attribute code Allison Henderson
2022-08-09 16:54   ` Darrick J. Wong
2022-08-10  3:08     ` Alli
2022-08-04 19:40 ` [PATCH RESEND v2 09/18] xfs: define parent pointer xattr format Allison Henderson
2022-08-04 19:40 ` [PATCH RESEND v2 10/18] xfs: Add xfs_verify_pptr Allison Henderson
2022-08-09 16:59   ` Darrick J. Wong
2022-08-10  3:08     ` Alli
2022-08-04 19:40 ` [PATCH RESEND v2 11/18] xfs: extend transaction reservations for parent attributes Allison Henderson
2022-08-09 17:48   ` Darrick J. Wong
2022-08-10  3:08     ` Alli
2022-08-04 19:40 ` [PATCH RESEND v2 12/18] xfs: parent pointer attribute creation Allison Henderson
2022-08-09 18:01   ` Darrick J. Wong
2022-08-09 18:13     ` Darrick J. Wong
2022-08-10  3:09       ` Alli
2022-08-10  3:08     ` Alli
2022-08-04 19:40 ` [PATCH RESEND v2 13/18] xfs: add parent attributes to link Allison Henderson
2022-08-09 18:43   ` Darrick J. Wong
2022-08-10  3:09     ` Alli
2022-09-23 20:25       ` Darrick J. Wong
2022-08-04 19:40 ` [PATCH RESEND v2 14/18] xfs: remove parent pointers in unlink Allison Henderson
2022-08-09 18:45   ` Darrick J. Wong
2022-08-10  3:09     ` Alli
2022-08-04 19:40 ` [PATCH RESEND v2 15/18] xfs: Add parent pointers to rename Allison Henderson
2022-08-09 18:49   ` Darrick J. Wong
2022-08-10  3:09     ` Alli
2022-08-04 19:40 ` [PATCH RESEND v2 16/18] xfs: Add the parent pointer support to the superblock version 5 Allison Henderson
2022-08-04 19:40 ` [PATCH RESEND v2 17/18] xfs: Add helper function xfs_attr_list_context_init Allison Henderson
2022-08-04 19:40 ` [PATCH RESEND v2 18/18] xfs: Add parent pointer ioctl Allison Henderson
2022-08-09 19:26   ` Darrick J. Wong
2022-08-10  3:09     ` Alli
2022-09-24  0:01       ` Darrick J. Wong
2022-08-09 22:55 ` [RFC PATCH 19/18] xfs: fix unit conversion error in xfs_log_calc_max_attrsetm_res Darrick J. Wong
2022-08-09 22:56 ` [RFC PATCH 20/18] xfs: drop compatibility minimum log size computations for reflink Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=373809e97f15e14d181fea6e170bfd8e37a9c9e4.camel@oracle.com \
    --to=allison.henderson@oracle.com \
    --cc=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.