Re: [PATCH] Btrfs: avoid unnecessary splits when setting bits on an extent io tree

From: Josef Bacik <josef@toxicpanda.com>
To: fdmanana@kernel.org, linux-btrfs@vger.kernel.org
Subject: Re: [PATCH] Btrfs: avoid unnecessary splits when setting bits on an extent io tree
Date: Fri, 14 Feb 2020 15:45:52 -0500	[thread overview]
Message-ID: <152c9721-8c15-f976-c9ac-df04e01f63ce@toxicpanda.com> (raw)
In-Reply-To: <20200213102002.6176-1-fdmanana@kernel.org>

On 2/13/20 5:20 AM, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> When attempting to set bits on a range of an exent io tree that already
> has those bits set we can end up splitting an extent state record, use
> the preallocated extent state record, insert it into the red black tree,
> do another search on the red black tree, merge the preallocated extent
> state record with the previous extent state record, remove that previous
> record from the red black tree and then free it. This is all unnecessary
> work that consumes time.
> 
> This happens specifically at the following case at __set_extent_bit():
> 
>    $ cat -n fs/btrfs/extent_io.c
>     957  static int __must_check
>     958  __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
>    (...)
>    1044          /*
>    1045           *     | ---- desired range ---- |
>    1046           * | state |
>    1047           *   or
>    1048           * | ------------- state -------------- |
>    1049           *
>    (...)
>    1060          if (state->start < start) {
>    1061                  if (state->state & exclusive_bits) {
>    1062                          *failed_start = start;
>    1063                          err = -EEXIST;
>    1064                          goto out;
>    1065                  }
>    1066
>    1067                  prealloc = alloc_extent_state_atomic(prealloc);
>    1068                  BUG_ON(!prealloc);
>    1069                  err = split_state(tree, state, prealloc, start);
>    1070                  if (err)
>    1071                          extent_io_tree_panic(tree, err);
>    1072
>    1073                  prealloc = NULL;
> 
> So if our extent state represents a range from 0 to 1Mb for example, and
> we want to set bits in the range 128Kb to 256Kb for example, and that
> extent state record already has all those bits set, we end up splitting
> that record, so we end up with extent state records in the tree which
> represent the ranges from 0 to 128Kb and from 128Kb to 1Mb. This is
> temporary because a subsequent iteration in that function will end up
> merging the records.
> 
> The splitting requires using the preallocated extent state record, so
> a future iteration that needs to do another split will need to allocate
> another extent state record in an atomic context, something not ideal
> that we try to avoid as much as possible. The splitting also requires
> an insertion in the red black tree, and a subsequent merge will require
> a deletion from the red black tree and freeing an extent state record.
> 
> This change just skips the splitting of an extent state record when it
> already has all the bits the we need to set.
> 
> Setting a bit that is already set for a range is very common in the
> inode's 'file_extent_tree' extent io tree for example, where we keep
> setting the EXTENT_DIRTY bit every time we replace an extent.
> 
> This change also fixes a bug that happens after the recent patchset from
> Josef that avoids having implicit holes after a power failure when not
> using the NO_HOLES feature, more specifically the patch with the subject:
> 
>    "btrfs: introduce the inode->file_extent_tree"
> 
> This patch introduced an extent io tree per inode to keep track of
> completed ordered extents and figure out at any time what is the safe
> value for the inode's disk_i_size. This assumes that for contiguous
> ranges in a file we always end up with a single extent state record in
> the io tree, but that is not the case, as there is a short time window
> where we can have two extent state records representing contiguous
> ranges. When this happens we end setting up an incorrect value for the
> inode's disk_i_size, resulting in data loss after a clean unmount
> of the filesystem. The following example explains how this can happen.
> 
> Suppose we have an inode with an i_size and a disk_i_size of 1Mb, so in
> the inode's file_extent_tree we have a single extent state record that
> represents the range [0, 1Mb[ with the EXTENT_DIRTY bit set. Then the
> following steps happen:
> 
> 1) A buffered write against file range [512Kb, 768Kb[ is made. At this
>     point delalloc was not flushed yet;
> 
> 2) Deduplication from some other inode into this inode's range
>     [128Kb, 256Kb[ is made. This causes btrfs_inode_set_file_extent_range()
>     to be called, from btrfs_insert_clone_extent(), to mark the range
>     [128Kb, 256Kb[ with EXTENT_DIRTY in the inode's file_extent_tree;
> 
> 3) When btrfs_inode_set_file_extent_range() calls set_extent_bits(), we
>     end up at __set_extent_bit(). In the first iteration of that function's
>     loop we end up in the following branch:
> 
>     $ cat -n fs/btrfs/extent_io.c
>      957  static int __must_check
>      958  __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
>     (...)
>     1044          /*
>     1045           *     | ---- desired range ---- |
>     1046           * | state |
>     1047           *   or
>     1048           * | ------------- state -------------- |
>     1049           *
>     (...)
>     1060          if (state->start < start) {
>     1061                  if (state->state & exclusive_bits) {
>     1062                          *failed_start = start;
>     1063                          err = -EEXIST;
>     1064                          goto out;
>     1065                  }
>     1066
>     1067                  prealloc = alloc_extent_state_atomic(prealloc);
>     1068                  BUG_ON(!prealloc);
>     1069                  err = split_state(tree, state, prealloc, start);
>     1070                  if (err)
>     1071                          extent_io_tree_panic(tree, err);
>     1072
>     1073                  prealloc = NULL;
>     (...)
>     1089                  goto search_again;
> 
>     This splits the state record into two, one for range [0, 128Kb[ and
>     another for the range [128Kb, 1Mb[. Both already have the EXTENT_DIRTY
>     bit set. Then we jump to the 'search_again' label, where we unlock the
>     the spinlock protecting the extent io tree before jumping to the
>     'again' label to perform the next iteration;
> 
> 4) In the meanwhile, delalloc is flushed, the ordered extent for the range
>     [512Kb, 768Kb[ is created and when it completes, at
>     btrfs_finish_ordered_io(), it calls btrfs_inode_safe_disk_i_size_write()
>     with a value of 0 for its 'new_size' argument;
> 
> 5) Before the deduplication task currently at __set_extent_bit() moves to
>     the next iteration, the task finishing the ordered extent calls
>     find_first_extent_bit() through btrfs_inode_safe_disk_i_size_write()
>     and gets 'start' set to 0 and 'end' set to 128Kb - because at this
>     moment the io tree has two extent state records, one representing the
>     range [0, 128Kb[ and another representing the range [128Kb, 1Mb[,
>     both with EXTENT_DIRTY set. Then we set 'isize' to:
> 
>     isize = min(isize, end + 1)
>           = min(1Mb, 128Kb - 1 + 1)
>           = 128Kb
> 
>     Then we set the inode's disk_i_size to 128Kb (isize).
> 
>     After a clean unmount of the filesystem and mounting it again, we have
>     the file with a size of 128Kb, and effectively lost all the data it
>     had before in the range from 128Kb to 1Mb.
> 
> This change fixes that issue too, as we never end up splitting extent
> state records when they already have all the bits we want set.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Sorry, forgot to say

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef