All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <djwong@kernel.org>
To: Alli <allison.henderson@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>, linux-xfs@vger.kernel.org
Subject: Re: [PATCH RESEND v2 01/18] xfs: Fix multi-transaction larp replay
Date: Tue, 23 Aug 2022 08:07:30 -0700	[thread overview]
Message-ID: <YwTtMqOCtkxyox3x@magnolia> (raw)
In-Reply-To: <82cc6ff775832d34f32cdbfe9bd125487ec22226.camel@oracle.com>

On Thu, Aug 18, 2022 at 06:05:54PM -0700, Alli wrote:
> On Tue, 2022-08-16 at 13:41 -0700, Alli wrote:
> > On Mon, 2022-08-15 at 22:07 -0700, Darrick J. Wong wrote:
> > > On Tue, Aug 16, 2022 at 10:54:38AM +1000, Dave Chinner wrote:
> > > > On Thu, Aug 11, 2022 at 06:55:16PM -0700, Alli wrote:
> > > > > On Wed, 2022-08-10 at 16:12 +1000, Dave Chinner wrote:
> > > > > > On Tue, Aug 09, 2022 at 10:01:49PM -0700, Alli wrote:
> > > > > > > On Wed, 2022-08-10 at 11:58 +1000, Dave Chinner wrote:
> > > > > > > > On Tue, Aug 09, 2022 at 09:52:55AM -0700, Darrick J. Wong
> > > > > > > > wrote:
> > > > > > > > > On Thu, Aug 04, 2022 at 12:39:56PM -0700, Allison
> > > > > > > > > Henderson
> > > > > > > > > wrote:
> > > > > > > > > > Recent parent pointer testing has exposed a bug in
> > > > > > > > > > the
> > > > > > > > > > underlying
> > > > > > > > > > attr replay.  A multi transaction replay currently
> > > > > > > > > > performs a
> > > > > > > > > > single step of the replay, then deferrs the rest if
> > > > > > > > > > there is
> > > > > > > > > > more
> > > > > > > > > > to do.
> > > > > > > > 
> > > > > > > > Yup.
> > > > > > > > 
> > > > > > > > > > This causes race conditions with other attr replays
> > > > > > > > > > that
> > > > > > > > > > might be recovered before the remaining deferred work
> > > > > > > > > > has had
> > > > > > > > > > a
> > > > > > > > > > chance to finish.
> > > > > > > > 
> > > > > > > > What other attr replays are we racing against?  There can
> > > > > > > > only be
> > > > > > > > one incomplete attr item intent/done chain per inode
> > > > > > > > present in
> > > > > > > > log
> > > > > > > > recovery, right?
> > > > > > > No, a rename queues up a set and remove before committing
> > > > > > > the
> > > > > > > transaction.  One for the new parent pointer, and another
> > > > > > > to
> > > > > > > remove
> > > > > > > the
> > > > > > > old one.
> > > > > > 
> > > > > > Ah. That really needs to be described in the commit message -
> > > > > > changing from "single intent chain per object" to "multiple
> > > > > > concurrent independent and unserialised intent chains per
> > > > > > object" is
> > > > > > a pretty important design rule change...
> > > > > > 
> > > > > > The whole point of intents is to allow complex, multi-stage
> > > > > > operations on a single object to be sequenced in a tightly
> > > > > > controlled manner. They weren't intended to be run as
> > > > > > concurrent
> > > > > > lines of modification on single items; if you need to do two
> > > > > > modifications on an object, the intent chain ties the two
> > > > > > modifications together into a single whole.
> > > > > > 
> > > > > > One of the reasons I rewrote the attr state machine for LARP
> > > > > > was to
> > > > > > enable new multiple attr operation chains to be easily build
> > > > > > from
> > > > > > the entry points the state machien provides. Parent attr
> > > > > > rename
> > > > > > needs a new intent chain to be built, not run multiple
> > > > > > independent
> > > > > > intent chains for each modification.
> > > > > > 
> > > > > > > It cant be an attr replace because technically the names
> > > > > > > are
> > > > > > > different.
> > > > > > 
> > > > > > I disagree - we have all the pieces we need in the state
> > > > > > machine
> > > > > > already, we just need to define separate attr names for the
> > > > > > remove and insert steps in the attr intent.
> > > > > > 
> > > > > > That is, the "replace" operation we execute when an attr set
> > > > > > overwrites the value is "technically" a "replace value"
> > > > > > operation,
> > > > > > but we actually implement it as a "replace entire attribute"
> > > > > > operation.
> > > > > > 
> > > > > > Without LARP, we do that overwrite in independent steps via
> > > > > > an
> > > > > > intermediate INCOMPLETE state to allow two xattrs of the same
> > > > > > name
> > > > > > to exist in the attr tree at the same time. IOWs, the attr
> > > > > > value
> > > > > > overwrite is effectively a "set-swap-remove" operation on two
> > > > > > entirely independent xattrs, ensuring that if we crash we
> > > > > > always
> > > > > > have either the old or new xattr visible.
> > > > > > 
> > > > > > With LARP, we can remove the original attr first, thereby
> > > > > > avoiding
> > > > > > the need for two versions of the xattr to exist in the tree
> > > > > > in
> > > > > > the
> > > > > > first place. However, we have to do these two operations as a
> > > > > > pair
> > > > > > of linked independent operations. The intent chain provides
> > > > > > the
> > > > > > linking, and requires us to log the name and the value of the
> > > > > > attr
> > > > > > that we are overwriting in the intent. Hence we can always
> > > > > > recover
> > > > > > the modification to completion no matter where in the
> > > > > > operation
> > > > > > we
> > > > > > fail.
> > > > > > 
> > > > > > When it comes to a parent attr rename operation, we are
> > > > > > effectively
> > > > > > doing two linked operations - remove the old attr, set the
> > > > > > new
> > > > > > attr
> > > > > > - on different attributes. Implementation wise, it is exactly
> > > > > > the
> > > > > > same sequence as a "replace value" operation, except for the
> > > > > > fact
> > > > > > that the new attr we add has a different name.
> > > > > > 
> > > > > > Hence the only real difference between the existing "attr
> > > > > > replace"
> > > > > > and the intent chain we need for "parent attr rename" is that
> > > > > > we
> > > > > > have to log two attr names instead of one. 
> > > > > 
> > > > > To be clear, this would imply expanding xfs_attri_log_format to
> > > > > have
> > > > > another alfi_new_name_len feild and another iovec for the attr
> > > > > intent
> > > > > right?  Does that cause issues to change the on disk log layout
> > > > > after
> > > > > the original has merged?  Or is that ok for things that are
> > > > > still
> > > > > experimental? Thanks!
> > > > 
> > > > I think we can get away with this quite easily without breaking
> > > > the
> > > > existing experimental code.
> > > > 
> > > > struct xfs_attri_log_format {
> > > >         uint16_t        alfi_type;      /* attri log item type */
> > > >         uint16_t        alfi_size;      /* size of this item */
> > > >         uint32_t        __pad;          /* pad to 64 bit aligned
> > > > */
> > > >         uint64_t        alfi_id;        /* attri identifier */
> > > >         uint64_t        alfi_ino;       /* the inode for this
> > > > attr
> > > > operation */
> > > >         uint32_t        alfi_op_flags;  /* marks the op as a set
> > > > or
> > > > remove */
> > > >         uint32_t        alfi_name_len;  /* attr name length */
> > > >         uint32_t        alfi_value_len; /* attr value length */
> > > >         uint32_t        alfi_attr_filter;/* attr filter flags */
> > > > };
> > > > 
> > > > We have a padding field in there that is currently all zeros.
> > > > Let's
> > > > make that a count of the number of {name, value} tuples that are
> > > > appended to the format. i.e.
> > > > 
> > > > struct xfs_attri_log_name {
> > > >         uint32_t        alfi_op_flags;  /* marks the op as a set
> > > > or
> > > > remove */
> > > >         uint32_t        alfi_name_len;  /* attr name length */
> > > >         uint32_t        alfi_value_len; /* attr value length */
> > > >         uint32_t        alfi_attr_filter;/* attr filter flags */
> > > > };
> > > > 
> > > > struct xfs_attri_log_format {
> > > >         uint16_t        alfi_type;      /* attri log item type */
> > > >         uint16_t        alfi_size;      /* size of this item */
> > > > 	uint8_t		alfi_attr_cnt;	/* count of name/val
> > > > pairs
> > > > */
> > > >         uint8_t		__pad1;          /* pad to 64 bit
> > > > aligned */
> > > >         uint16_t	__pad2;          /* pad to 64 bit aligned */
> > > >         uint64_t        alfi_id;        /* attri identifier */
> > > >         uint64_t        alfi_ino;       /* the inode for this
> > > > attr
> > > > operation */
> > > > 	struct xfs_attri_log_name alfi_attr[]; /* attrs to operate on
> > > > */
> > > > };
> > > > 
> > > > Basically, the size and shape of the structure has not changed,
> > > > and
> > > > if alfi_attr_cnt == 0 we just treat it as if alfi_attr_cnt == 1
> > > > as
> > > > the backwards compat code for the existing code.
> > > > 
> > > > And then we just have as many followup regions for name/val pairs
> > > > as are defined by the alfi_attr_cnt and alfi_attr[] parts of the
> > > > structure. Each attr can have a different operation performed on
> > > > them, and they can have different filters applied so they can
> > > > exist
> > > > in different namespaces, too.
> > > > 
> > > > SO I don't think we need a new on-disk feature bit for this
> > > > enhancement - it definitely comes under the heading of "this
> > > > stuff
> > > > is experimental, this is the sort of early structure revision
> > > > that
> > > > EXPERIMENTAL is supposed to cover....
> > > 
> > > You might even callit "alfi_extra_names" to avoid the "0 means 1"
> > > stuff.
> > > ;)
> > > 
> > > --D
> > 
> > Oh, I just noticed these comments this morning when I sent out the
> > new
> > attri/d patch.  I'll add this changes to v2.  Please let me know if
> > there's anything else you'd like me to change from the v1.  Thx!
> > 
> > Allison
> 
> Ok, so I am part way through coding this up, and I'm getting this
> feeling like this is not going to work out very well due to the size
> checks for the log formats:
> 
> root@garnet:/home/achender/work_area/xfs-linux# git diff
> fs/xfs/libxfs/xfs_log_format.h fs/xfs/xfs_ondisk.h
> diff --git a/fs/xfs/libxfs/xfs_log_format.h
> b/fs/xfs/libxfs/xfs_log_format.h
> index f1ff52ebb982..5a4e700f32fc 100644
> --- a/fs/xfs/libxfs/xfs_log_format.h
> +++ b/fs/xfs/libxfs/xfs_log_format.h
> @@ -922,6 +922,13 @@ struct xfs_icreate_log {
>                                          XFS_ATTR_PARENT | \
>                                          XFS_ATTR_INCOMPLETE)
>  
> +struct xfs_attri_log_name {
> +       uint32_t        alfi_op_flags;  /* marks the op as a set or
> remove */
> +       uint32_t        alfi_name_len;  /* attr name length */
> +       uint32_t        alfi_value_len; /* attr value length */
> +       uint32_t        alfi_attr_filter;/* attr filter flags */
> +};
> +
>  /*
>   * This is the structure used to lay out an attr log item in the
>   * log.
> @@ -929,14 +936,12 @@ struct xfs_icreate_log {
>  struct xfs_attri_log_format {
>         uint16_t        alfi_type;      /* attri log item type */
>         uint16_t        alfi_size;      /* size of this item */
> -       uint32_t        __pad;          /* pad to 64 bit aligned */
> +       uint8_t         alfi_extra_names;/* count of name/val pairs */
> +       uint8_t         __pad1;         /* pad to 64 bit aligned */
> +       uint16_t        __pad2;         /* pad to 64 bit aligned */
>         uint64_t        alfi_id;        /* attri identifier */
>         uint64_t        alfi_ino;       /* the inode for this attr
> operation */
> -       uint32_t        alfi_op_flags;  /* marks the op as a set or
> remove */
> -       uint32_t        alfi_name_len;  /* attr name length */
> -       uint32_t        alfi_value_len; /* attr value length */
> -       uint32_t        alfi_attr_filter;/* attr filter flags */
> +       struct xfs_attri_log_name alfi_attr[]; /* attrs to operate on

What's the length of this VLA?  1 for a normal SET or REPLACE
operation, and 2 for the "rename and replace value" operation?

If so, why do we need two xfs_attri_log_name structures?  The old value
is unimportant, so we only need one alfi_value_len per operation.  Each
xfs_attri_log_format only describes one change, so it only needs one
alfi_op_flags per op.

For now I also don't think attributes should be able to jump namespaces,
so we'd only need one alfi_attr_filter per op as well.

*lightbulb comes on*  Oops, I think I led you astray with my unfortunate
comment. :(

IOWs, the only change to struct xfs_attri_log_format is:

-       uint32_t        __pad;          /* pad to 64 bit aligned */
+       uint32_t        alfi_new_namelen;/* new attr name length */

and the rest of the changes in "[PATCH] xfs: Add new name to attri/d"
are more or less fine as is.

I'll go reply to that before I get back to Dave's log accounting stuff.

--D

> */
>  };
>  
>  struct xfs_attrd_log_format {
> diff --git a/fs/xfs/xfs_ondisk.h b/fs/xfs/xfs_ondisk.h
> index 3e7f7eaa5b96..c040eeb88def 100644
> --- a/fs/xfs/xfs_ondisk.h
> +++ b/fs/xfs/xfs_ondisk.h
> @@ -132,7 +132,7 @@ xfs_check_ondisk_structs(void)
>         XFS_CHECK_STRUCT_SIZE(struct xfs_inode_log_format,      56);
>         XFS_CHECK_STRUCT_SIZE(struct xfs_qoff_logformat,        20);
>         XFS_CHECK_STRUCT_SIZE(struct xfs_trans_header,          16);
> -       XFS_CHECK_STRUCT_SIZE(struct xfs_attri_log_format,      48);
> +       XFS_CHECK_STRUCT_SIZE(struct xfs_attri_log_format,      24);
>         XFS_CHECK_STRUCT_SIZE(struct xfs_attrd_log_format,      16);
>  
>         /* parent pointer ioctls */
> root@garnet:/home/achender/work_area/xfs-linux# 
> 
> 
> 
> If the on disk size check thinks the format is 24 bytes, and then we
> surprise pack an array of structs after it, isnt that going to run over
> the next item?  I think anything dynamic like this has to be an nvec.
>  Maybe we leave the existing alfi_* as they are so the size doesnt
> change, and then if we have a value in alfi_extra_names, then we have
> an extra nvec that has the array in it.  I think that would work.
> 
> FWIW, an alternate solution would be to use the pad for a second name
> length, and then we get a patch that's very similar to the one I sent
> out last Tues, but backward compatible.  Though it does eat the
> remaining pad and wouldn't be as flexible, I cant think of an attr op
> that would need more than two names either?
> 
> Let me know what people think.  Thanks!
> Allison
> 
> 
> > > > Cheers,
> > > > 
> > > > Dave.
> > > > -- 
> > > > Dave Chinner
> > > > david@fromorbit.com
> 

  reply	other threads:[~2022-08-23 17:32 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-04 19:39 [PATCH RESEND v2 00/18] Parent Pointers Allison Henderson
2022-08-04 19:39 ` [PATCH RESEND v2 01/18] xfs: Fix multi-transaction larp replay Allison Henderson
2022-08-09 16:52   ` Darrick J. Wong
2022-08-10  1:58     ` Dave Chinner
2022-08-10  5:01       ` Alli
2022-08-10  6:12         ` Dave Chinner
2022-08-10 15:52           ` Darrick J. Wong
2022-08-10 19:28             ` Alli
2022-08-12  1:55           ` Alli
2022-08-12  3:05             ` Darrick J. Wong
2022-08-16  0:54             ` Dave Chinner
2022-08-16  5:07               ` Darrick J. Wong
2022-08-16 20:41                 ` Alli
2022-08-19  1:05                   ` Alli
2022-08-23 15:07                     ` Darrick J. Wong [this message]
2022-08-24 18:47                       ` Alli
2022-08-10  3:08     ` Alli
2022-08-04 19:39 ` [PATCH RESEND v2 02/18] xfs: Increase XFS_DEFER_OPS_NR_INODES to 5 Allison Henderson
2022-08-09 16:38   ` Darrick J. Wong
2022-08-10  3:07     ` Alli
2022-08-04 19:39 ` [PATCH RESEND v2 03/18] xfs: Hold inode locks in xfs_ialloc Allison Henderson
2022-08-04 19:39 ` [PATCH RESEND v2 04/18] xfs: Hold inode locks in xfs_trans_alloc_dir Allison Henderson
2022-08-04 19:40 ` [PATCH RESEND v2 05/18] xfs: get directory offset when adding directory name Allison Henderson
2022-08-04 19:40 ` [PATCH RESEND v2 06/18] xfs: get directory offset when removing " Allison Henderson
2022-08-04 19:40 ` [PATCH RESEND v2 07/18] xfs: get directory offset when replacing a " Allison Henderson
2022-08-04 19:40 ` [PATCH RESEND v2 08/18] xfs: add parent pointer support to attribute code Allison Henderson
2022-08-09 16:54   ` Darrick J. Wong
2022-08-10  3:08     ` Alli
2022-08-04 19:40 ` [PATCH RESEND v2 09/18] xfs: define parent pointer xattr format Allison Henderson
2022-08-04 19:40 ` [PATCH RESEND v2 10/18] xfs: Add xfs_verify_pptr Allison Henderson
2022-08-09 16:59   ` Darrick J. Wong
2022-08-10  3:08     ` Alli
2022-08-04 19:40 ` [PATCH RESEND v2 11/18] xfs: extend transaction reservations for parent attributes Allison Henderson
2022-08-09 17:48   ` Darrick J. Wong
2022-08-10  3:08     ` Alli
2022-08-04 19:40 ` [PATCH RESEND v2 12/18] xfs: parent pointer attribute creation Allison Henderson
2022-08-09 18:01   ` Darrick J. Wong
2022-08-09 18:13     ` Darrick J. Wong
2022-08-10  3:09       ` Alli
2022-08-10  3:08     ` Alli
2022-08-04 19:40 ` [PATCH RESEND v2 13/18] xfs: add parent attributes to link Allison Henderson
2022-08-09 18:43   ` Darrick J. Wong
2022-08-10  3:09     ` Alli
2022-09-23 20:25       ` Darrick J. Wong
2022-08-04 19:40 ` [PATCH RESEND v2 14/18] xfs: remove parent pointers in unlink Allison Henderson
2022-08-09 18:45   ` Darrick J. Wong
2022-08-10  3:09     ` Alli
2022-08-04 19:40 ` [PATCH RESEND v2 15/18] xfs: Add parent pointers to rename Allison Henderson
2022-08-09 18:49   ` Darrick J. Wong
2022-08-10  3:09     ` Alli
2022-08-04 19:40 ` [PATCH RESEND v2 16/18] xfs: Add the parent pointer support to the superblock version 5 Allison Henderson
2022-08-04 19:40 ` [PATCH RESEND v2 17/18] xfs: Add helper function xfs_attr_list_context_init Allison Henderson
2022-08-04 19:40 ` [PATCH RESEND v2 18/18] xfs: Add parent pointer ioctl Allison Henderson
2022-08-09 19:26   ` Darrick J. Wong
2022-08-10  3:09     ` Alli
2022-09-24  0:01       ` Darrick J. Wong
2022-08-09 22:55 ` [RFC PATCH 19/18] xfs: fix unit conversion error in xfs_log_calc_max_attrsetm_res Darrick J. Wong
2022-08-09 22:56 ` [RFC PATCH 20/18] xfs: drop compatibility minimum log size computations for reflink Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YwTtMqOCtkxyox3x@magnolia \
    --to=djwong@kernel.org \
    --cc=allison.henderson@oracle.com \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.