From: Josef Bacik <josef@toxicpanda.com>
To: fdmanana@kernel.org, linux-btrfs@vger.kernel.org
Subject: Re: [PATCH] Btrfs: avoid unnecessary splits when setting bits on an extent io tree
Date: Fri, 14 Feb 2020 15:45:52 -0500 [thread overview]
Message-ID: <152c9721-8c15-f976-c9ac-df04e01f63ce@toxicpanda.com> (raw)
In-Reply-To: <20200213102002.6176-1-fdmanana@kernel.org>
On 2/13/20 5:20 AM, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> When attempting to set bits on a range of an exent io tree that already
> has those bits set we can end up splitting an extent state record, use
> the preallocated extent state record, insert it into the red black tree,
> do another search on the red black tree, merge the preallocated extent
> state record with the previous extent state record, remove that previous
> record from the red black tree and then free it. This is all unnecessary
> work that consumes time.
>
> This happens specifically at the following case at __set_extent_bit():
>
> $ cat -n fs/btrfs/extent_io.c
> 957 static int __must_check
> 958 __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
> (...)
> 1044 /*
> 1045 * | ---- desired range ---- |
> 1046 * | state |
> 1047 * or
> 1048 * | ------------- state -------------- |
> 1049 *
> (...)
> 1060 if (state->start < start) {
> 1061 if (state->state & exclusive_bits) {
> 1062 *failed_start = start;
> 1063 err = -EEXIST;
> 1064 goto out;
> 1065 }
> 1066
> 1067 prealloc = alloc_extent_state_atomic(prealloc);
> 1068 BUG_ON(!prealloc);
> 1069 err = split_state(tree, state, prealloc, start);
> 1070 if (err)
> 1071 extent_io_tree_panic(tree, err);
> 1072
> 1073 prealloc = NULL;
>
> So if our extent state represents a range from 0 to 1Mb for example, and
> we want to set bits in the range 128Kb to 256Kb for example, and that
> extent state record already has all those bits set, we end up splitting
> that record, so we end up with extent state records in the tree which
> represent the ranges from 0 to 128Kb and from 128Kb to 1Mb. This is
> temporary because a subsequent iteration in that function will end up
> merging the records.
>
> The splitting requires using the preallocated extent state record, so
> a future iteration that needs to do another split will need to allocate
> another extent state record in an atomic context, something not ideal
> that we try to avoid as much as possible. The splitting also requires
> an insertion in the red black tree, and a subsequent merge will require
> a deletion from the red black tree and freeing an extent state record.
>
> This change just skips the splitting of an extent state record when it
> already has all the bits the we need to set.
>
> Setting a bit that is already set for a range is very common in the
> inode's 'file_extent_tree' extent io tree for example, where we keep
> setting the EXTENT_DIRTY bit every time we replace an extent.
>
> This change also fixes a bug that happens after the recent patchset from
> Josef that avoids having implicit holes after a power failure when not
> using the NO_HOLES feature, more specifically the patch with the subject:
>
> "btrfs: introduce the inode->file_extent_tree"
>
> This patch introduced an extent io tree per inode to keep track of
> completed ordered extents and figure out at any time what is the safe
> value for the inode's disk_i_size. This assumes that for contiguous
> ranges in a file we always end up with a single extent state record in
> the io tree, but that is not the case, as there is a short time window
> where we can have two extent state records representing contiguous
> ranges. When this happens we end setting up an incorrect value for the
> inode's disk_i_size, resulting in data loss after a clean unmount
> of the filesystem. The following example explains how this can happen.
>
> Suppose we have an inode with an i_size and a disk_i_size of 1Mb, so in
> the inode's file_extent_tree we have a single extent state record that
> represents the range [0, 1Mb[ with the EXTENT_DIRTY bit set. Then the
> following steps happen:
>
> 1) A buffered write against file range [512Kb, 768Kb[ is made. At this
> point delalloc was not flushed yet;
>
> 2) Deduplication from some other inode into this inode's range
> [128Kb, 256Kb[ is made. This causes btrfs_inode_set_file_extent_range()
> to be called, from btrfs_insert_clone_extent(), to mark the range
> [128Kb, 256Kb[ with EXTENT_DIRTY in the inode's file_extent_tree;
>
> 3) When btrfs_inode_set_file_extent_range() calls set_extent_bits(), we
> end up at __set_extent_bit(). In the first iteration of that function's
> loop we end up in the following branch:
>
> $ cat -n fs/btrfs/extent_io.c
> 957 static int __must_check
> 958 __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
> (...)
> 1044 /*
> 1045 * | ---- desired range ---- |
> 1046 * | state |
> 1047 * or
> 1048 * | ------------- state -------------- |
> 1049 *
> (...)
> 1060 if (state->start < start) {
> 1061 if (state->state & exclusive_bits) {
> 1062 *failed_start = start;
> 1063 err = -EEXIST;
> 1064 goto out;
> 1065 }
> 1066
> 1067 prealloc = alloc_extent_state_atomic(prealloc);
> 1068 BUG_ON(!prealloc);
> 1069 err = split_state(tree, state, prealloc, start);
> 1070 if (err)
> 1071 extent_io_tree_panic(tree, err);
> 1072
> 1073 prealloc = NULL;
> (...)
> 1089 goto search_again;
>
> This splits the state record into two, one for range [0, 128Kb[ and
> another for the range [128Kb, 1Mb[. Both already have the EXTENT_DIRTY
> bit set. Then we jump to the 'search_again' label, where we unlock the
> the spinlock protecting the extent io tree before jumping to the
> 'again' label to perform the next iteration;
>
> 4) In the meanwhile, delalloc is flushed, the ordered extent for the range
> [512Kb, 768Kb[ is created and when it completes, at
> btrfs_finish_ordered_io(), it calls btrfs_inode_safe_disk_i_size_write()
> with a value of 0 for its 'new_size' argument;
>
> 5) Before the deduplication task currently at __set_extent_bit() moves to
> the next iteration, the task finishing the ordered extent calls
> find_first_extent_bit() through btrfs_inode_safe_disk_i_size_write()
> and gets 'start' set to 0 and 'end' set to 128Kb - because at this
> moment the io tree has two extent state records, one representing the
> range [0, 128Kb[ and another representing the range [128Kb, 1Mb[,
> both with EXTENT_DIRTY set. Then we set 'isize' to:
>
> isize = min(isize, end + 1)
> = min(1Mb, 128Kb - 1 + 1)
> = 128Kb
>
> Then we set the inode's disk_i_size to 128Kb (isize).
>
> After a clean unmount of the filesystem and mounting it again, we have
> the file with a size of 128Kb, and effectively lost all the data it
> had before in the range from 128Kb to 1Mb.
>
> This change fixes that issue too, as we never end up splitting extent
> state records when they already have all the bits we want set.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
Sorry, forgot to say
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Thanks,
Josef
next prev parent reply other threads:[~2020-02-14 20:46 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-02-13 10:20 [PATCH] Btrfs: avoid unnecessary splits when setting bits on an extent io tree fdmanana
2020-02-14 20:45 ` Josef Bacik [this message]
2020-02-18 15:49 ` David Sterba
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=152c9721-8c15-f976-c9ac-df04e01f63ce@toxicpanda.com \
--to=josef@toxicpanda.com \
--cc=fdmanana@kernel.org \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).