All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Sterba <dsterba@suse.cz>
To: fdmanana@kernel.org
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH] btrfs: fix stale data exposure after cloning a hole with NO_HOLES enabled
Date: Wed, 17 Feb 2021 18:24:07 +0100	[thread overview]
Message-ID: <20210217172407.GU1993@twin.jikos.cz> (raw)
In-Reply-To: <07067d184eb90be19874190df45cc83f06186307.1613473473.git.fdmanana@suse.com>

On Tue, Feb 16, 2021 at 11:09:25AM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> When using the NO_HOLES feature, if we clone a file range that spans only
> a hole into a range that is at or beyond the current i_size of the
> destination file, we end up not setting the full sync runtime flag on the
> inode. As a result, if we then fsync the destination file and have a power
> failure, after log replay we can end up exposing stale data instead of
> having a hole for that range.
> 
> The conditions for this to happen are the following:
> 
> 1) We have a file with a size of, for example, 1280K;
> 
> 2) There is a written (non-prealloc) extent for the file range from 1024K
>    to 1280K with a length of 256K;
> 
> 3) This particular file extent layout is durably persisted, so that the
>    existing superblock persisted on disk points to a subvolume root where
>    the file has that exact file extent layout and state;
> 
> 4) The file is truncated to a smaller size, to an offset lower than the
>    start offset of its last extent, for example to 800K. The truncate sets
>    the full sync runtime flag on the inode;
> 
> 6) Fsync the file to log it and clear the full sync runtime flag;
> 
> 7) Clone a region that covers only a hole (implicit hole due to NO_HOLES)
>    into the file with a destination offset that starts at or beyond the
>    256K file extent item we had - for example to offset 1024K;
> 
> 8) Since the clone operation does not find extents in the source range,
>    we end up in the if branch at the bottom of btrfs_clone() where we
>    punch a hole for the file range starting at offset 1024K by calling
>    btrfs_replace_file_extents(). There we end up not setting the full
>    sync flag on the inode, because we don't know we are being called in
>    a clone context (and not fallocate's punch hole operation), and
>    neither do we create an extent map to represent a hole because the
>    requested range is beyond eof;
> 
> 9) A further fsync to the file will be a fast fsync, since the clone
>    operation did not set the full sync flag, and therefore it relies on
>    modified extent maps to correctly log the file layout. But since
>    it does not find any extent map marking the range from 1024K (the
>    previous eof) to the new eof, it does not log a file extent item
>    for that range representing the hole;
> 
> 10) After a power failure no hole for the range starting at 1024K is
>    punched and we end up exposing stale data from the old 256K extent.
> 
> Turning this into exact steps:
> 
>   $ mkfs.btrfs -f -O no-holes /dev/sdi
>   $ mount /dev/sdi /mnt
> 
>   # Create our test file with 3 extents of 256K and a 256K hole at offset
>   # 256K. The file has a size of 1280K.
>   $ xfs_io -f -s \
>               -c "pwrite -S 0xab -b 256K 0 256K" \
>               -c "pwrite -S 0xcd -b 256K 512K 256K" \
>               -c "pwrite -S 0xef -b 256K 768K 256K" \
>               -c "pwrite -S 0x73 -b 256K 1024K 256K" \
>               /mnt/sdi/foobar
> 
>   # Make sure it's durably persisted. We want the last committed super
>   # block to point to this particular file extent layout.
>   sync
> 
>   # Now truncate our file to a smaller size, falling within a position of
>   # the second extent. This sets the full sync runtime flag on the inode.
>   # Then fsync the file to log it and clear the full sync flag from the
>   # inode. The third extent is no longer part of the file and therefore
>   # it is not logged.
>   $ xfs_io -c "truncate 800K" -c "fsync" /mnt/foobar
> 
>   # Now do a clone operation that only clones the hole and sets back the
>   # file size to match the size it had before the truncate operation
>   # (1280K).
>   $ xfs_io \
>         -c "reflink /mnt/foobar 256K 1024K 256K" \
>         -c "fsync" \
>         /mnt/foobar
> 
>   # File data before power failure:
>   $ od -A d -t x1 /mnt/foobar
>   0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
>   *
>   0262144 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>   *
>   0524288 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd
>   *
>   0786432 ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef
>   *
>   0819200 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>   *
>   1310720
> 
>   <power fail>
> 
>   # Mount the fs again to replay the log tree.
>   $ mount /dev/sdi /mnt
> 
>   # File data after power failure:
>   $ od -A d -t x1 /mnt/foobar
>   0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
>   *
>   0262144 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>   *
>   0524288 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd
>   *
>   0786432 ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef
>   *
>   0819200 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>   *
>   1048576 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73
>   *
>   1310720
> 
> The range from 1024K to 1280K should correspond to a hole but instead it
> points to stale data, to the 256K extent that should not exist after the
> truncate operation.
> 
> The issue does not exists when not using NO_HOLES, because for that case
> we use file extent items to represent holes, these are found and copied
> during the loop that iterates over extents at btrfs_clone(), and that
> causes btrfs_replace_file_extents() to be called with a non-NULL
> extent_info argument and therefore set the full sync runtime flag on the
> inode.
> 
> So fix this by making the code that deals with a trailing hole during
> cloning, at btrfs_clone(), to set the full sync flag on the inode, if the
> range starts at or beyond the current i_size.
> 
> A test case for fstests will follow soon.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Added to misc-next, thanks.

      parent reply	other threads:[~2021-02-17 17:27 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-16 11:09 [PATCH] btrfs: fix stale data exposure after cloning a hole with NO_HOLES enabled fdmanana
2021-02-16 15:48 ` Josef Bacik
2021-02-17 17:24 ` David Sterba [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210217172407.GU1993@twin.jikos.cz \
    --to=dsterba@suse.cz \
    --cc=fdmanana@kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.