All of lore.kernel.org
 help / color / mirror / Atom feed
From: Boris Burkov <boris@bur.io>
To: Naohiro Aota <Naohiro.Aota@wdc.com>
Cc: "linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>,
	"kernel-team@fb.com" <kernel-team@fb.com>
Subject: Re: [PATCH v5 4/5] btrfs: fix crash with non-zero pre in btrfs_split_ordered_extent
Date: Thu, 23 Mar 2023 09:22:27 -0700	[thread overview]
Message-ID: <20230323162227.GB8070@zen> (raw)
In-Reply-To: <20230323083608.m2ut2whbk2smjjpu@naota-xeon>

On Thu, Mar 23, 2023 at 08:36:08AM +0000, Naohiro Aota wrote:
> On Wed, Mar 22, 2023 at 12:11:51PM -0700, Boris Burkov wrote:
> > if pre != 0 in btrfs_split_ordered_extent, then we do the following:
> > 1. remove ordered (at file_offset) from the rb tree
> > 2. modify file_offset+=pre
> > 3. re-insert ordered
> > 4. clone an ordered extent at offset 0 length pre from ordered.
> > 5. clone an ordered extent for the post range, if necessary.
> > 
> > step 4 is not correct, as at this point, the start of ordered is already
> > the end of the desired new pre extent. Further this causes a panic when
> > btrfs_alloc_ordered_extent sees that the node (from the modified and
> > re-inserted ordered) is already present at file_offset + 0 = file_offset.
> > 
> > We can fix this by either using a negative offset, or by moving the
> > clone of the pre extent to after we remove the original one, but before
> > we modify and re-insert it. The former feels quite kludgy, as we are
> > "cloning" from outside the range of the ordered extent, so opt for the
> > latter, which does have some locking annoyances.
> > 
> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> >  fs/btrfs/ordered-data.c | 20 +++++++++++++-------
> >  1 file changed, 13 insertions(+), 7 deletions(-)
> > 
> > diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> > index 4bebebb9b434..d14a3fe1a113 100644
> > --- a/fs/btrfs/ordered-data.c
> > +++ b/fs/btrfs/ordered-data.c
> > @@ -1161,6 +1161,17 @@ int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered,
> >  	if (tree->last == node)
> >  		tree->last = NULL;
> >  
> > +	if (pre) {
> > +		spin_unlock_irq(&tree->lock);
> > +		oe = clone_ordered_extent(ordered, 0, pre);
> > +		ret = IS_ERR(oe) ? PTR_ERR(oe) : 0;
> > +		if (!ret && ret_pre)
> > +			*ret_pre = oe;
> > +		if (ret)
> > +			goto out;
> 
> How about just "return ret;"?
> 
> > +		spin_lock_irq(&tree->lock);
> 
> I'm concerned about the locking too. Before this spin_lock_irq() is taken,
> nothing in the ordered extent range in the tree. So, maybe, someone might
> insert or lookup that range in the meantime, and fail? Well, this function
> is called under the IO for this range, so it might be OK, though...
> 
> So, I considered another approach that factoring out some parts of
> btrfs_add_ordered_extent() and use them to rewrite
> btrfs_split_ordered_extent().
> 
> btrfs_add_ordered_extent() is doing three things:
> 
> 1. Space accounting
>    - btrfs_qgroup_free_data() or btrfs_qgroup_release_data()
>    - percpu_counter_add_batch(...)
> 2. Allocating and initializing btrfs_ordered_extent
> 3. Adding the btrfs_ordered_extent entry to trees, incrementing OE counter
>    - tree_insert(&tree->tree, ...)
>    - list_add_tail(&entry->root_extent_list, &root->ordered_extents);
>    - btrfs_mod_outstanding_extents(inode, 1);
> 
> For btrfs_split_ordered_extent(), we don't need to do #1 above as it was
> already done for the original ordered extent. So, if we factor #2 to a
> function e.g, init_ordered_extent(), we can rewrite clone_ordered_extent()
> to return a cloned OE (doing #2), and also rewrite
> btrfs_split_ordered_extent() o do #3 as following:
> 
> /* clone_ordered_extent() now returns new ordered extent. */
> /* It is not inserted into the trees, yet. */
> static struct btrfs_ordered_extent *clone_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pos,
> 				u64 len)
> {
> 	struct inode *inode = ordered->inode;
> 	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
> 	u64 file_offset = ordered->file_offset + pos;
> 	u64 disk_bytenr = ordered->disk_bytenr + pos;
> 	unsigned long flags = ordered->flags & BTRFS_ORDERED_TYPE_FLAGS;
> 
> 	WARN_ON_ONCE(flags & (1 << BTRFS_ORDERED_COMPRESSED));
> 	return init_ordered_extent(BTRFS_I(inode), file_offset, len, len,
> 				   disk_bytenr, len, 0, flags,
> 				   ordered->compress_type);
> }
> 
> int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre,
> 				u64 post)
> {
> ...
> 	/* clone new OEs first */
> 	if (pre)
> 		pre_oe = clone_ordered_extent(ordered, 0, pre);
> 	if (post)
> 		post_oe = clone_ordered_extent(ordered, ordered->disk_num_bytes - post, ost);
> 	/* check pre_oe and post_oe */
> 
> 	spin_lock_irq(&tree->lock);
> 	/* remove original OE from tree */
> 	...
> 
> 	/* modify the original OE params */
> 	ordered->file_offset += pre;
> 	...
> 
> 	/* Re-insert the original OE */
> 	node = tree_insert(&tree->tree, ordered->file_offset, &ordered->rb_node);
> 	...
> 
> 	/* Insert new OEs */
> 	if (pre_oe)
> 		node = tree_insert(...);
> 	...
> 	spin_unlock_irq(&tree->lock);
> 
> 	/* And, do the root->ordered_extents and outstanding_extents works */
> 	...
> }
> 
> With this approach, no one can see the intermediate state that an OE is
> missing for some area in the original OE range.

I like this solution, I think it is nice to split it up so that three
steps are separate. i.e., initialize the two new OEs with the old state,
then modify the middle OE with the new state and re-insert the new OEs
together. And everything after the initialization can be under the lock.

However, based on Christoph's response, I would lean towards getting rid
of the three way split altogether. I would love to hear your thoughts in
that thread as well, before committing to that, though.

If we do keep the three way split, then I will definitely implement your
idea here, I think it's nicer than the weird lock dropping/re-taking
stuff I was doing.

Thanks,
Boris

> 
> > +	}
> > +
> >  	ordered->file_offset += pre;
> >  	ordered->disk_bytenr += pre;
> >  	ordered->num_bytes -= (pre + post);
> > @@ -1176,18 +1187,13 @@ int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered,
> >  
> >  	spin_unlock_irq(&tree->lock);
> >  
> > -	if (pre) {
> > -		oe = clone_ordered_extent(ordered, 0, pre);
> > -		ret = IS_ERR(oe) ? PTR_ERR(oe) : 0;
> > -		if (!ret && ret_pre)
> > -			*ret_pre = oe;
> > -	}
> > -	if (!ret && post) {
> > +	if (post) {
> >  		oe = clone_ordered_extent(ordered, pre + ordered->disk_num_bytes, post);
> >  		ret = IS_ERR(oe) ? PTR_ERR(oe) : 0;
> >  		if (!ret && ret_post)
> >  			*ret_post = oe;
> >  	}
> > +out:
> >  	return ret;
> >  }
> >  
> > -- 
> > 2.38.1
> > 

  reply	other threads:[~2023-03-23 16:22 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-22 19:11 [PATCH v5 0/5] btrfs: fix corruption caused by partial dio writes Boris Burkov
2023-03-22 19:11 ` [PATCH v5 1/5] btrfs: add function to create and return an ordered extent Boris Burkov
2023-03-22 19:11 ` [PATCH v5 2/5] btrfs: stash ordered extent in dio_data during iomap dio Boris Burkov
2023-03-22 19:11 ` [PATCH v5 3/5] btrfs: return ordered_extent splits from bio extraction Boris Burkov
2023-03-23  8:47   ` Christoph Hellwig
2023-03-23 16:15     ` Boris Burkov
2023-03-23 17:00       ` Boris Burkov
2023-03-23 17:45         ` Boris Burkov
2023-03-23 21:29         ` Christoph Hellwig
2023-03-23 22:43           ` Boris Burkov
2023-03-24  0:24             ` Christoph Hellwig
2023-03-22 19:11 ` [PATCH v5 4/5] btrfs: fix crash with non-zero pre in btrfs_split_ordered_extent Boris Burkov
2023-03-23  8:36   ` Naohiro Aota
2023-03-23 16:22     ` Boris Burkov [this message]
2023-03-22 19:11 ` [PATCH v5 5/5] btrfs: split partial dio bios before submit Boris Burkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230323162227.GB8070@zen \
    --to=boris@bur.io \
    --cc=Naohiro.Aota@wdc.com \
    --cc=kernel-team@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.