Linux-BTRFS Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH] Btrfs: fix file corruption after snapshotting
@ 2019-02-04 14:28 fdmanana
  2019-02-18 17:11 ` David Sterba
  2019-02-27 13:42 ` [PATCH v2] Btrfs: fix file corruption after snapshotting due to mix of buffered/DIO writes fdmanana
  0 siblings, 2 replies; 12+ messages in thread
From: fdmanana @ 2019-02-04 14:28 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When we are mixing buffered writes with direct IO writes against the same
file and snapshotting is happening concurrently, we can end up with a
corrupt file content in the snapshot. Example:

1) Inode/file is empty.

2) Snapshotting starts.

2) Buffered write at offset 0 length 256Kb. This updates the i_size of the
   inode to 256Kb, disk_i_size remains zero. This happens after the task
   doing the snapshot flushes all existing delalloc.

3) DIO write at offset 256Kb length 768Kb. Once the ordered extent
   completes it sets the inode's disk_i_size to 1Mb (256Kb + 768Kb) and
   updates the inode item in the fs tree with a size of 1Mb (which is
   the value of disk_i_size).

4) The dealloc for the range [0, 256Kb[ did not start yet.

5) The transaction used in the DIO ordered extent completion, which updated
   the inode item, is committed by the snapshotting task.

6) Snapshot creation completes.

7) Dealloc for the range [0, 256Kb[ is flushed.

After that when reading the file from the snapshot we always get zeroes for
the range [0, 256Kb[, the file has a size of 1Mb and the data written by
the direct IO write is found. From an application's point of view this is
a corruption, since in the source subvolume it could never read a version
of the file that included the data from the direct IO write without the
data from the buffered write included as well. In the snapshot's tree,
file extent items are missing for the range [0, 256Kb[.

The issue, obviously, does not happen when using the -o flushoncommit
mount option.

Fix this by flushing delalloc for all the roots that are about to be
snapshotted when committing a transaction. This guarantees total ordering
when updating the disk_i_size of an inode since the flush for dealloc is
done when a transaction is in the TRANS_STATE_COMMIT_START state and wait
is done once no more external writers exist. This is similar to what we
do when using the flushoncommit mount option, but we do it only if the
transaction has snapshots to create and only for the roots of the
subvolumes to be snapshotted. The bulk of the dealloc is flushed in the
snapshot creation ioctl, so the flush work we do inside the transaction
is minimized.

This issue, involving buffered and direct IO writes with snapshotting, is
often triggered by fstest btrfs/078, and got reported by fsck when not
using the NO_HOLES features, for example:

  $ cat results/btrfs/078.full
  (...)
  _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
  *** fsck.btrfs output ***
  [1/7] checking root items
  [2/7] checking extents
  [3/7] checking free space cache
  [4/7] checking fs roots
  root 258 inode 264 errors 100, file extent discount
  Found file extent holes:
        start: 524288, len: 65536
  ERROR: errors found in fs roots

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/transaction.c | 36 ++++++++++++++++++++++++++++++------
 1 file changed, 30 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 4ec2b660d014..2e8f15eee2e8 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1886,8 +1886,10 @@ static void btrfs_cleanup_pending_block_groups(struct btrfs_trans_handle *trans)
        }
 }
 
-static inline int btrfs_start_delalloc_flush(struct btrfs_fs_info *fs_info)
+static inline int btrfs_start_delalloc_flush(struct btrfs_trans_handle *trans)
 {
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+
 	/*
 	 * We use writeback_inodes_sb here because if we used
 	 * btrfs_start_delalloc_roots we would deadlock with fs freeze.
@@ -1897,15 +1899,37 @@ static inline int btrfs_start_delalloc_flush(struct btrfs_fs_info *fs_info)
 	 * from already being in a transaction and our join_transaction doesn't
 	 * have to re-take the fs freeze lock.
 	 */
-	if (btrfs_test_opt(fs_info, FLUSHONCOMMIT))
+	if (btrfs_test_opt(fs_info, FLUSHONCOMMIT)) {
 		writeback_inodes_sb(fs_info->sb, WB_REASON_SYNC);
+	} else {
+		struct btrfs_pending_snapshot *pending;
+		struct list_head *head = &trans->transaction->pending_snapshots;
+
+		list_for_each_entry(pending, head, list) {
+			int ret;
+
+			ret = btrfs_start_delalloc_snapshot(pending->root);
+			if (ret)
+				return ret;
+		}
+	}
 	return 0;
 }
 
-static inline void btrfs_wait_delalloc_flush(struct btrfs_fs_info *fs_info)
+static inline void btrfs_wait_delalloc_flush(struct btrfs_trans_handle *trans)
 {
-	if (btrfs_test_opt(fs_info, FLUSHONCOMMIT))
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+
+	if (btrfs_test_opt(fs_info, FLUSHONCOMMIT)) {
 		btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1);
+	} else {
+		struct btrfs_pending_snapshot *pending;
+		struct list_head *head = &trans->transaction->pending_snapshots;
+
+		list_for_each_entry(pending, head, list)
+			btrfs_wait_ordered_extents(pending->root,
+						   U64_MAX, 0, U64_MAX);
+	}
 }
 
 int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
@@ -2024,7 +2048,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 
 	extwriter_counter_dec(cur_trans, trans->type);
 
-	ret = btrfs_start_delalloc_flush(fs_info);
+	ret = btrfs_start_delalloc_flush(trans);
 	if (ret)
 		goto cleanup_transaction;
 
@@ -2040,7 +2064,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 	if (ret)
 		goto cleanup_transaction;
 
-	btrfs_wait_delalloc_flush(fs_info);
+	btrfs_wait_delalloc_flush(trans);
 
 	btrfs_scrub_pause(fs_info);
 	/*
-- 
2.11.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Btrfs: fix file corruption after snapshotting
  2019-02-04 14:28 [PATCH] Btrfs: fix file corruption after snapshotting fdmanana
@ 2019-02-18 17:11 ` David Sterba
  2019-02-18 17:27   ` Filipe Manana
  2019-02-27 13:42 ` [PATCH v2] Btrfs: fix file corruption after snapshotting due to mix of buffered/DIO writes fdmanana
  1 sibling, 1 reply; 12+ messages in thread
From: David Sterba @ 2019-02-18 17:11 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Mon, Feb 04, 2019 at 02:28:10PM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> When we are mixing buffered writes with direct IO writes against the same
> file and snapshotting is happening concurrently, we can end up with a
> corrupt file content in the snapshot. Example:

The patch subject sounds like it's a corruption in generic snapshot
behaviour. But it's when mixing buffered and direct io, which is a
potential corruption scenario on itself, so snapshotting does make it
that much worse. I'd like to see that somhow reflected in the subject.

> 1) Inode/file is empty.
> 
> 2) Snapshotting starts.
> 
> 2) Buffered write at offset 0 length 256Kb. This updates the i_size of the
>    inode to 256Kb, disk_i_size remains zero. This happens after the task
>    doing the snapshot flushes all existing delalloc.
> 
> 3) DIO write at offset 256Kb length 768Kb. Once the ordered extent
>    completes it sets the inode's disk_i_size to 1Mb (256Kb + 768Kb) and
>    updates the inode item in the fs tree with a size of 1Mb (which is
>    the value of disk_i_size).
> 
> 4) The dealloc for the range [0, 256Kb[ did not start yet.
> 
> 5) The transaction used in the DIO ordered extent completion, which updated
>    the inode item, is committed by the snapshotting task.
> 
> 6) Snapshot creation completes.
> 
> 7) Dealloc for the range [0, 256Kb[ is flushed.
> 
> After that when reading the file from the snapshot we always get zeroes for
> the range [0, 256Kb[, the file has a size of 1Mb and the data written by
> the direct IO write is found. From an application's point of view this is
> a corruption, since in the source subvolume it could never read a version
> of the file that included the data from the direct IO write without the
> data from the buffered write included as well. In the snapshot's tree,
> file extent items are missing for the range [0, 256Kb[.
> 
> The issue, obviously, does not happen when using the -o flushoncommit
> mount option.
> 
> Fix this by flushing delalloc for all the roots that are about to be
> snapshotted when committing a transaction. This guarantees total ordering
> when updating the disk_i_size of an inode since the flush for dealloc is
> done when a transaction is in the TRANS_STATE_COMMIT_START state and wait
> is done once no more external writers exist. This is similar to what we
> do when using the flushoncommit mount option, but we do it only if the
> transaction has snapshots to create and only for the roots of the
> subvolumes to be snapshotted. The bulk of the dealloc is flushed in the
> snapshot creation ioctl, so the flush work we do inside the transaction
> is minimized.
> 
> This issue, involving buffered and direct IO writes with snapshotting, is
> often triggered by fstest btrfs/078, and got reported by fsck when not
> using the NO_HOLES features, for example:
> 
>   $ cat results/btrfs/078.full
>   (...)
>   _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
>   *** fsck.btrfs output ***
>   [1/7] checking root items
>   [2/7] checking extents
>   [3/7] checking free space cache
>   [4/7] checking fs roots
>   root 258 inode 264 errors 100, file extent discount
>   Found file extent holes:
>         start: 524288, len: 65536
>   ERROR: errors found in fs roots
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/transaction.c | 36 ++++++++++++++++++++++++++++++------
>  1 file changed, 30 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 4ec2b660d014..2e8f15eee2e8 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -1886,8 +1886,10 @@ static void btrfs_cleanup_pending_block_groups(struct btrfs_trans_handle *trans)
>         }
>  }
>  
> -static inline int btrfs_start_delalloc_flush(struct btrfs_fs_info *fs_info)
> +static inline int btrfs_start_delalloc_flush(struct btrfs_trans_handle *trans)
>  {
> +	struct btrfs_fs_info *fs_info = trans->fs_info;
> +
>  	/*
>  	 * We use writeback_inodes_sb here because if we used
>  	 * btrfs_start_delalloc_roots we would deadlock with fs freeze.
> @@ -1897,15 +1899,37 @@ static inline int btrfs_start_delalloc_flush(struct btrfs_fs_info *fs_info)
>  	 * from already being in a transaction and our join_transaction doesn't
>  	 * have to re-take the fs freeze lock.
>  	 */
> -	if (btrfs_test_opt(fs_info, FLUSHONCOMMIT))
> +	if (btrfs_test_opt(fs_info, FLUSHONCOMMIT)) {
>  		writeback_inodes_sb(fs_info->sb, WB_REASON_SYNC);
> +	} else {
> +		struct btrfs_pending_snapshot *pending;
> +		struct list_head *head = &trans->transaction->pending_snapshots;
> +

A comment would be good here as it's not obvious why the sync is done
here (and similarly the waiting part in btrfs_wait_delalloc_flush).

> +		list_for_each_entry(pending, head, list) {
> +			int ret;
> +
> +			ret = btrfs_start_delalloc_snapshot(pending->root);
> +			if (ret)
> +				return ret;

This adds a potential failure to the middle of transaction commit. I've
checked the errors, there's EROFS (after a global fs error state) and
ENOMEM (from start_delalloc_inodes).

The EROFS could be filtered as a non-issue. ENOMEM means that the
flushing was not possible and skipping it would bring back the problem
this patch is fixing.

So, how about calling writeback_inodes_sb in that case as a fallback?
I'd really like to avoid returning failure from
btrfs_start_delalloc_flush so the extra overhead of the writeback (in a
theoretical error case anyway) should be ok.

> +		}
> +	}
>  	return 0;
>  }
>  
> -static inline void btrfs_wait_delalloc_flush(struct btrfs_fs_info *fs_info)
> +static inline void btrfs_wait_delalloc_flush(struct btrfs_trans_handle *trans)
>  {
> -	if (btrfs_test_opt(fs_info, FLUSHONCOMMIT))
> +	struct btrfs_fs_info *fs_info = trans->fs_info;
> +
> +	if (btrfs_test_opt(fs_info, FLUSHONCOMMIT)) {
>  		btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1);
> +	} else {
> +		struct btrfs_pending_snapshot *pending;
> +		struct list_head *head = &trans->transaction->pending_snapshots;
> +
> +		list_for_each_entry(pending, head, list)
> +			btrfs_wait_ordered_extents(pending->root,
> +						   U64_MAX, 0, U64_MAX);
> +	}
>  }
>  
>  int btrfs_commit_transaction(struct btrfs_trans_handle *trans)

The patch has been in for-next for some time, I just did not get to
writing the comments. Though the dio/buffered use is discouraged, the
errors reported by the test should be fixed. The obvious concern was the
perf penalty, but from that point I it's ok as you point out above.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Btrfs: fix file corruption after snapshotting
  2019-02-18 17:11 ` David Sterba
@ 2019-02-18 17:27   ` Filipe Manana
  2019-02-27 12:47     ` David Sterba
  0 siblings, 1 reply; 12+ messages in thread
From: Filipe Manana @ 2019-02-18 17:27 UTC (permalink / raw)
  To: dsterba, Filipe Manana, linux-btrfs

On Mon, Feb 18, 2019 at 5:09 PM David Sterba <dsterba@suse.cz> wrote:
>
> On Mon, Feb 04, 2019 at 02:28:10PM +0000, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > When we are mixing buffered writes with direct IO writes against the same
> > file and snapshotting is happening concurrently, we can end up with a
> > corrupt file content in the snapshot. Example:
>
> The patch subject sounds like it's a corruption in generic snapshot
> behaviour. But it's when mixing buffered and direct io, which is a
> potential corruption scenario on itself, so snapshotting does make it
> that much worse. I'd like to see that somhow reflected in the subject.

It's a kind of corruption created by snapshotting.
I tried to come up with a better subject that wasn't too long to fit
in the 74-75 characters limit, but couldn't come up with any.
So I left the explanation and example in the remainder of the change log.

If you have a better suggestion... I'm open to it.

>
> > 1) Inode/file is empty.
> >
> > 2) Snapshotting starts.
> >
> > 2) Buffered write at offset 0 length 256Kb. This updates the i_size of the
> >    inode to 256Kb, disk_i_size remains zero. This happens after the task
> >    doing the snapshot flushes all existing delalloc.
> >
> > 3) DIO write at offset 256Kb length 768Kb. Once the ordered extent
> >    completes it sets the inode's disk_i_size to 1Mb (256Kb + 768Kb) and
> >    updates the inode item in the fs tree with a size of 1Mb (which is
> >    the value of disk_i_size).
> >
> > 4) The dealloc for the range [0, 256Kb[ did not start yet.
> >
> > 5) The transaction used in the DIO ordered extent completion, which updated
> >    the inode item, is committed by the snapshotting task.
> >
> > 6) Snapshot creation completes.
> >
> > 7) Dealloc for the range [0, 256Kb[ is flushed.
> >
> > After that when reading the file from the snapshot we always get zeroes for
> > the range [0, 256Kb[, the file has a size of 1Mb and the data written by
> > the direct IO write is found. From an application's point of view this is
> > a corruption, since in the source subvolume it could never read a version
> > of the file that included the data from the direct IO write without the
> > data from the buffered write included as well. In the snapshot's tree,
> > file extent items are missing for the range [0, 256Kb[.
> >
> > The issue, obviously, does not happen when using the -o flushoncommit
> > mount option.
> >
> > Fix this by flushing delalloc for all the roots that are about to be
> > snapshotted when committing a transaction. This guarantees total ordering
> > when updating the disk_i_size of an inode since the flush for dealloc is
> > done when a transaction is in the TRANS_STATE_COMMIT_START state and wait
> > is done once no more external writers exist. This is similar to what we
> > do when using the flushoncommit mount option, but we do it only if the
> > transaction has snapshots to create and only for the roots of the
> > subvolumes to be snapshotted. The bulk of the dealloc is flushed in the
> > snapshot creation ioctl, so the flush work we do inside the transaction
> > is minimized.
> >
> > This issue, involving buffered and direct IO writes with snapshotting, is
> > often triggered by fstest btrfs/078, and got reported by fsck when not
> > using the NO_HOLES features, for example:
> >
> >   $ cat results/btrfs/078.full
> >   (...)
> >   _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
> >   *** fsck.btrfs output ***
> >   [1/7] checking root items
> >   [2/7] checking extents
> >   [3/7] checking free space cache
> >   [4/7] checking fs roots
> >   root 258 inode 264 errors 100, file extent discount
> >   Found file extent holes:
> >         start: 524288, len: 65536
> >   ERROR: errors found in fs roots
> >
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >  fs/btrfs/transaction.c | 36 ++++++++++++++++++++++++++++++------
> >  1 file changed, 30 insertions(+), 6 deletions(-)
> >
> > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> > index 4ec2b660d014..2e8f15eee2e8 100644
> > --- a/fs/btrfs/transaction.c
> > +++ b/fs/btrfs/transaction.c
> > @@ -1886,8 +1886,10 @@ static void btrfs_cleanup_pending_block_groups(struct btrfs_trans_handle *trans)
> >         }
> >  }
> >
> > -static inline int btrfs_start_delalloc_flush(struct btrfs_fs_info *fs_info)
> > +static inline int btrfs_start_delalloc_flush(struct btrfs_trans_handle *trans)
> >  {
> > +     struct btrfs_fs_info *fs_info = trans->fs_info;
> > +
> >       /*
> >        * We use writeback_inodes_sb here because if we used
> >        * btrfs_start_delalloc_roots we would deadlock with fs freeze.
> > @@ -1897,15 +1899,37 @@ static inline int btrfs_start_delalloc_flush(struct btrfs_fs_info *fs_info)
> >        * from already being in a transaction and our join_transaction doesn't
> >        * have to re-take the fs freeze lock.
> >        */
> > -     if (btrfs_test_opt(fs_info, FLUSHONCOMMIT))
> > +     if (btrfs_test_opt(fs_info, FLUSHONCOMMIT)) {
> >               writeback_inodes_sb(fs_info->sb, WB_REASON_SYNC);
> > +     } else {
> > +             struct btrfs_pending_snapshot *pending;
> > +             struct list_head *head = &trans->transaction->pending_snapshots;
> > +
>
> A comment would be good here as it's not obvious why the sync is done
> here (and similarly the waiting part in btrfs_wait_delalloc_flush).

Intentionally left out due to the changelog explaining it and avoiding
a too long comment explaining why/how the corruption happens.

>
> > +             list_for_each_entry(pending, head, list) {
> > +                     int ret;
> > +
> > +                     ret = btrfs_start_delalloc_snapshot(pending->root);
> > +                     if (ret)
> > +                             return ret;
>
> This adds a potential failure to the middle of transaction commit. I've
> checked the errors, there's EROFS (after a global fs error state) and
> ENOMEM (from start_delalloc_inodes).
>
> The EROFS could be filtered as a non-issue. ENOMEM means that the
> flushing was not possible and skipping it would bring back the problem
> this patch is fixing.
>
> So, how about calling writeback_inodes_sb in that case as a fallback?

Thought about it, but the reason I didn't do it is that if you
fallback to writeback_unodes_sb, you'll never have the error reported
to the user.
It's very unlikely the user will do an fsync on the snapshot version
of the file, specially if it's a RO snapshot, which would be the only
way to report the error.


> I'd really like to avoid returning failure from
> btrfs_start_delalloc_flush so the extra overhead of the writeback (in a
> theoretical error case anyway) should be ok.

Me too, as long as the error is reported (a message in syslog/dmesg is
very likely to be missed).


>
> > +             }
> > +     }
> >       return 0;
> >  }
> >
> > -static inline void btrfs_wait_delalloc_flush(struct btrfs_fs_info *fs_info)
> > +static inline void btrfs_wait_delalloc_flush(struct btrfs_trans_handle *trans)
> >  {
> > -     if (btrfs_test_opt(fs_info, FLUSHONCOMMIT))
> > +     struct btrfs_fs_info *fs_info = trans->fs_info;
> > +
> > +     if (btrfs_test_opt(fs_info, FLUSHONCOMMIT)) {
> >               btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1);
> > +     } else {
> > +             struct btrfs_pending_snapshot *pending;
> > +             struct list_head *head = &trans->transaction->pending_snapshots;
> > +
> > +             list_for_each_entry(pending, head, list)
> > +                     btrfs_wait_ordered_extents(pending->root,
> > +                                                U64_MAX, 0, U64_MAX);
> > +     }
> >  }
> >
> >  int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>
> The patch has been in for-next for some time, I just did not get to
> writing the comments. Though the dio/buffered use is discouraged, the
> errors reported by the test should be fixed. The obvious concern was the
> perf penalty, but from that point I it's ok as you point out above.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Btrfs: fix file corruption after snapshotting
  2019-02-18 17:27   ` Filipe Manana
@ 2019-02-27 12:47     ` David Sterba
  2019-02-27 13:42       ` Filipe Manana
  0 siblings, 1 reply; 12+ messages in thread
From: David Sterba @ 2019-02-27 12:47 UTC (permalink / raw)
  To: Filipe Manana; +Cc: dsterba, linux-btrfs

On Mon, Feb 18, 2019 at 05:27:54PM +0000, Filipe Manana wrote:
> On Mon, Feb 18, 2019 at 5:09 PM David Sterba <dsterba@suse.cz> wrote:
> >
> > On Mon, Feb 04, 2019 at 02:28:10PM +0000, fdmanana@kernel.org wrote:
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > When we are mixing buffered writes with direct IO writes against the same
> > > file and snapshotting is happening concurrently, we can end up with a
> > > corrupt file content in the snapshot. Example:
> >
> > The patch subject sounds like it's a corruption in generic snapshot
> > behaviour. But it's when mixing buffered and direct io, which is a
> > potential corruption scenario on itself, so snapshotting does make it
> > that much worse. I'd like to see that somhow reflected in the subject.
> 
> It's a kind of corruption created by snapshotting.
> I tried to come up with a better subject that wasn't too long to fit
> in the 74-75 characters limit, but couldn't come up with any.
> So I left the explanation and example in the remainder of the change log.
> 
> If you have a better suggestion... I'm open to it.

The 74 chars applies namely to the changelog text, there are commits
with long subject line (sample from 4.19 with 75 to 103). I don't mind
if it's for better descriptivity.

"btrfs: fix corruption after snapshotting file with mixed buffer/DIO writes"

> > > 1) Inode/file is empty.
> > >
> > > 2) Snapshotting starts.
> > >
> > > 2) Buffered write at offset 0 length 256Kb. This updates the i_size of the
> > >    inode to 256Kb, disk_i_size remains zero. This happens after the task
> > >    doing the snapshot flushes all existing delalloc.
> > >
> > > 3) DIO write at offset 256Kb length 768Kb. Once the ordered extent
> > >    completes it sets the inode's disk_i_size to 1Mb (256Kb + 768Kb) and
> > >    updates the inode item in the fs tree with a size of 1Mb (which is
> > >    the value of disk_i_size).
> > >
> > > 4) The dealloc for the range [0, 256Kb[ did not start yet.
> > >
> > > 5) The transaction used in the DIO ordered extent completion, which updated
> > >    the inode item, is committed by the snapshotting task.
> > >
> > > 6) Snapshot creation completes.
> > >
> > > 7) Dealloc for the range [0, 256Kb[ is flushed.
> > >
> > > After that when reading the file from the snapshot we always get zeroes for
> > > the range [0, 256Kb[, the file has a size of 1Mb and the data written by
> > > the direct IO write is found. From an application's point of view this is
> > > a corruption, since in the source subvolume it could never read a version
> > > of the file that included the data from the direct IO write without the
> > > data from the buffered write included as well. In the snapshot's tree,
> > > file extent items are missing for the range [0, 256Kb[.
> > >
> > > The issue, obviously, does not happen when using the -o flushoncommit
> > > mount option.
> > >
> > > Fix this by flushing delalloc for all the roots that are about to be
> > > snapshotted when committing a transaction. This guarantees total ordering
> > > when updating the disk_i_size of an inode since the flush for dealloc is
> > > done when a transaction is in the TRANS_STATE_COMMIT_START state and wait
> > > is done once no more external writers exist. This is similar to what we
> > > do when using the flushoncommit mount option, but we do it only if the
> > > transaction has snapshots to create and only for the roots of the
> > > subvolumes to be snapshotted. The bulk of the dealloc is flushed in the
> > > snapshot creation ioctl, so the flush work we do inside the transaction
> > > is minimized.
> > >
> > > This issue, involving buffered and direct IO writes with snapshotting, is
> > > often triggered by fstest btrfs/078, and got reported by fsck when not
> > > using the NO_HOLES features, for example:
> > >
> > >   $ cat results/btrfs/078.full
> > >   (...)
> > >   _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
> > >   *** fsck.btrfs output ***
> > >   [1/7] checking root items
> > >   [2/7] checking extents
> > >   [3/7] checking free space cache
> > >   [4/7] checking fs roots
> > >   root 258 inode 264 errors 100, file extent discount
> > >   Found file extent holes:
> > >         start: 524288, len: 65536
> > >   ERROR: errors found in fs roots
> > >
> > > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > > ---
> > >  fs/btrfs/transaction.c | 36 ++++++++++++++++++++++++++++++------
> > >  1 file changed, 30 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> > > index 4ec2b660d014..2e8f15eee2e8 100644
> > > --- a/fs/btrfs/transaction.c
> > > +++ b/fs/btrfs/transaction.c
> > > @@ -1886,8 +1886,10 @@ static void btrfs_cleanup_pending_block_groups(struct btrfs_trans_handle *trans)
> > >         }
> > >  }
> > >
> > > -static inline int btrfs_start_delalloc_flush(struct btrfs_fs_info *fs_info)
> > > +static inline int btrfs_start_delalloc_flush(struct btrfs_trans_handle *trans)
> > >  {
> > > +     struct btrfs_fs_info *fs_info = trans->fs_info;
> > > +
> > >       /*
> > >        * We use writeback_inodes_sb here because if we used
> > >        * btrfs_start_delalloc_roots we would deadlock with fs freeze.
> > > @@ -1897,15 +1899,37 @@ static inline int btrfs_start_delalloc_flush(struct btrfs_fs_info *fs_info)
> > >        * from already being in a transaction and our join_transaction doesn't
> > >        * have to re-take the fs freeze lock.
> > >        */
> > > -     if (btrfs_test_opt(fs_info, FLUSHONCOMMIT))
> > > +     if (btrfs_test_opt(fs_info, FLUSHONCOMMIT)) {
> > >               writeback_inodes_sb(fs_info->sb, WB_REASON_SYNC);
> > > +     } else {
> > > +             struct btrfs_pending_snapshot *pending;
> > > +             struct list_head *head = &trans->transaction->pending_snapshots;
> > > +
> >
> > A comment would be good here as it's not obvious why the sync is done
> > here (and similarly the waiting part in btrfs_wait_delalloc_flush).
> 
> Intentionally left out due to the changelog explaining it and avoiding
> a too long comment explaining why/how the corruption happens.

I see, so what if the comment is only a short version giving pointers,
something like the first paragraph of the changelog and the fix.

/*
 * Flush delalloc roots about to be snapshotted to guarantee total
 * ordering when updating disk_i_size. This could happen for files with
 * mixed buffered and direct IO
 */

> 
> >
> > > +             list_for_each_entry(pending, head, list) {
> > > +                     int ret;
> > > +
> > > +                     ret = btrfs_start_delalloc_snapshot(pending->root);
> > > +                     if (ret)
> > > +                             return ret;
> >
> > This adds a potential failure to the middle of transaction commit. I've
> > checked the errors, there's EROFS (after a global fs error state) and
> > ENOMEM (from start_delalloc_inodes).
> >
> > The EROFS could be filtered as a non-issue. ENOMEM means that the
> > flushing was not possible and skipping it would bring back the problem
> > this patch is fixing.
> >
> > So, how about calling writeback_inodes_sb in that case as a fallback?
> 
> Thought about it, but the reason I didn't do it is that if you
> fallback to writeback_unodes_sb, you'll never have the error reported
> to the user.
> It's very unlikely the user will do an fsync on the snapshot version
> of the file, specially if it's a RO snapshot, which would be the only
> way to report the error.
> 
> 
> > I'd really like to avoid returning failure from
> > btrfs_start_delalloc_flush so the extra overhead of the writeback (in a
> > theoretical error case anyway) should be ok.
> 
> Me too, as long as the error is reported (a message in syslog/dmesg is
> very likely to be missed).

I probably don't understand. EROFS could be ignored and ENOMEM can be
worked around. I don't see what needs to be reported to the user.

The goal is to make btrfs_start_delalloc_flush return void and drop the
'if (ret)' in transaction commit.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2] Btrfs: fix file corruption after snapshotting due to mix of buffered/DIO writes
  2019-02-04 14:28 [PATCH] Btrfs: fix file corruption after snapshotting fdmanana
  2019-02-18 17:11 ` David Sterba
@ 2019-02-27 13:42 ` fdmanana
  2019-02-27 17:26   ` David Sterba
  1 sibling, 1 reply; 12+ messages in thread
From: fdmanana @ 2019-02-27 13:42 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When we are mixing buffered writes with direct IO writes against the same
file and snapshotting is happening concurrently, we can end up with a
corrupt file content in the snapshot. Example:

1) Inode/file is empty.

2) Snapshotting starts.

2) Buffered write at offset 0 length 256Kb. This updates the i_size of the
   inode to 256Kb, disk_i_size remains zero. This happens after the task
   doing the snapshot flushes all existing delalloc.

3) DIO write at offset 256Kb length 768Kb. Once the ordered extent
   completes it sets the inode's disk_i_size to 1Mb (256Kb + 768Kb) and
   updates the inode item in the fs tree with a size of 1Mb (which is
   the value of disk_i_size).

4) The dealloc for the range [0, 256Kb[ did not start yet.

5) The transaction used in the DIO ordered extent completion, which updated
   the inode item, is committed by the snapshotting task.

6) Snapshot creation completes.

7) Dealloc for the range [0, 256Kb[ is flushed.

After that when reading the file from the snapshot we always get zeroes for
the range [0, 256Kb[, the file has a size of 1Mb and the data written by
the direct IO write is found. From an application's point of view this is
a corruption, since in the source subvolume it could never read a version
of the file that included the data from the direct IO write without the
data from the buffered write included as well. In the snapshot's tree,
file extent items are missing for the range [0, 256Kb[.

The issue, obviously, does not happen when using the -o flushoncommit
mount option.

Fix this by flushing delalloc for all the roots that are about to be
snapshotted when committing a transaction. This guarantees total ordering
when updating the disk_i_size of an inode since the flush for dealloc is
done when a transaction is in the TRANS_STATE_COMMIT_START state and wait
is done once no more external writers exist. This is similar to what we
do when using the flushoncommit mount option, but we do it only if the
transaction has snapshots to create and only for the roots of the
subvolumes to be snapshotted. The bulk of the dealloc is flushed in the
snapshot creation ioctl, so the flush work we do inside the transaction
is minimized.

This issue, involving buffered and direct IO writes with snapshotting, is
often triggered by fstest btrfs/078, and got reported by fsck when not
using the NO_HOLES features, for example:

  $ cat results/btrfs/078.full
  (...)
  _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
  *** fsck.btrfs output ***
  [1/7] checking root items
  [2/7] checking extents
  [3/7] checking free space cache
  [4/7] checking fs roots
  root 258 inode 264 errors 100, file extent discount
  Found file extent holes:
        start: 524288, len: 65536
  ERROR: errors found in fs roots

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---

V2: Changed subject (used to be "Btrfs: fix file corruption after snapshotting").
    Added a couple comments.
    Started to ignore errors from delalloc flushing.

 fs/btrfs/transaction.c | 44 ++++++++++++++++++++++++++++++++++++++------
 1 file changed, 38 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 4ec2b660d014..f96867e55350 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1886,8 +1886,10 @@ static void btrfs_cleanup_pending_block_groups(struct btrfs_trans_handle *trans)
        }
 }
 
-static inline int btrfs_start_delalloc_flush(struct btrfs_fs_info *fs_info)
+static inline int btrfs_start_delalloc_flush(struct btrfs_trans_handle *trans)
 {
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+
 	/*
 	 * We use writeback_inodes_sb here because if we used
 	 * btrfs_start_delalloc_roots we would deadlock with fs freeze.
@@ -1897,15 +1899,45 @@ static inline int btrfs_start_delalloc_flush(struct btrfs_fs_info *fs_info)
 	 * from already being in a transaction and our join_transaction doesn't
 	 * have to re-take the fs freeze lock.
 	 */
-	if (btrfs_test_opt(fs_info, FLUSHONCOMMIT))
+	if (btrfs_test_opt(fs_info, FLUSHONCOMMIT)) {
 		writeback_inodes_sb(fs_info->sb, WB_REASON_SYNC);
+	} else {
+		struct btrfs_pending_snapshot *pending;
+		struct list_head *head = &trans->transaction->pending_snapshots;
+
+		/*
+		 * Flush dellaloc for any root that is going to be snapshotted.
+		 * This is done to avoid a corrupted version of files, in the
+		 * snapshots, that had both buffered and direct IO writes (even
+		 * if they were done sequentially) due to an unordered update of
+		 * the inode's size on disk.
+		 */
+		list_for_each_entry(pending, head, list)
+			btrfs_start_delalloc_snapshot(pending->root);
+	}
 	return 0;
 }
 
-static inline void btrfs_wait_delalloc_flush(struct btrfs_fs_info *fs_info)
+static inline void btrfs_wait_delalloc_flush(struct btrfs_trans_handle *trans)
 {
-	if (btrfs_test_opt(fs_info, FLUSHONCOMMIT))
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+
+	if (btrfs_test_opt(fs_info, FLUSHONCOMMIT)) {
 		btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1);
+	} else {
+		struct btrfs_pending_snapshot *pending;
+		struct list_head *head = &trans->transaction->pending_snapshots;
+
+		/*
+		 * Wait for any dellaloc that we started previously for the roots
+		 * that are going to be snapshotted. This is to avoid a corrupted
+		 * version of files in the snapshots that had both buffered and
+		 * direct IO writes (even if they were done sequentially).
+		 */
+		list_for_each_entry(pending, head, list)
+			btrfs_wait_ordered_extents(pending->root,
+						   U64_MAX, 0, U64_MAX);
+	}
 }
 
 int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
@@ -2024,7 +2056,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 
 	extwriter_counter_dec(cur_trans, trans->type);
 
-	ret = btrfs_start_delalloc_flush(fs_info);
+	ret = btrfs_start_delalloc_flush(trans);
 	if (ret)
 		goto cleanup_transaction;
 
@@ -2040,7 +2072,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 	if (ret)
 		goto cleanup_transaction;
 
-	btrfs_wait_delalloc_flush(fs_info);
+	btrfs_wait_delalloc_flush(trans);
 
 	btrfs_scrub_pause(fs_info);
 	/*
-- 
2.11.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Btrfs: fix file corruption after snapshotting
  2019-02-27 12:47     ` David Sterba
@ 2019-02-27 13:42       ` Filipe Manana
  2019-02-27 17:19         ` David Sterba
  0 siblings, 1 reply; 12+ messages in thread
From: Filipe Manana @ 2019-02-27 13:42 UTC (permalink / raw)
  To: dsterba, Filipe Manana, linux-btrfs

On Wed, Feb 27, 2019 at 1:04 PM David Sterba <dsterba@suse.cz> wrote:
>
> On Mon, Feb 18, 2019 at 05:27:54PM +0000, Filipe Manana wrote:
> > On Mon, Feb 18, 2019 at 5:09 PM David Sterba <dsterba@suse.cz> wrote:
> > >
> > > On Mon, Feb 04, 2019 at 02:28:10PM +0000, fdmanana@kernel.org wrote:
> > > > From: Filipe Manana <fdmanana@suse.com>
> > > >
> > > > When we are mixing buffered writes with direct IO writes against the same
> > > > file and snapshotting is happening concurrently, we can end up with a
> > > > corrupt file content in the snapshot. Example:
> > >
> > > The patch subject sounds like it's a corruption in generic snapshot
> > > behaviour. But it's when mixing buffered and direct io, which is a
> > > potential corruption scenario on itself, so snapshotting does make it
> > > that much worse. I'd like to see that somhow reflected in the subject.
> >
> > It's a kind of corruption created by snapshotting.
> > I tried to come up with a better subject that wasn't too long to fit
> > in the 74-75 characters limit, but couldn't come up with any.
> > So I left the explanation and example in the remainder of the change log.
> >
> > If you have a better suggestion... I'm open to it.
>
> The 74 chars applies namely to the changelog text, there are commits
> with long subject line (sample from 4.19 with 75 to 103). I don't mind
> if it's for better descriptivity.
>
> "btrfs: fix corruption after snapshotting file with mixed buffer/DIO writes"
>
> > > > 1) Inode/file is empty.
> > > >
> > > > 2) Snapshotting starts.
> > > >
> > > > 2) Buffered write at offset 0 length 256Kb. This updates the i_size of the
> > > >    inode to 256Kb, disk_i_size remains zero. This happens after the task
> > > >    doing the snapshot flushes all existing delalloc.
> > > >
> > > > 3) DIO write at offset 256Kb length 768Kb. Once the ordered extent
> > > >    completes it sets the inode's disk_i_size to 1Mb (256Kb + 768Kb) and
> > > >    updates the inode item in the fs tree with a size of 1Mb (which is
> > > >    the value of disk_i_size).
> > > >
> > > > 4) The dealloc for the range [0, 256Kb[ did not start yet.
> > > >
> > > > 5) The transaction used in the DIO ordered extent completion, which updated
> > > >    the inode item, is committed by the snapshotting task.
> > > >
> > > > 6) Snapshot creation completes.
> > > >
> > > > 7) Dealloc for the range [0, 256Kb[ is flushed.
> > > >
> > > > After that when reading the file from the snapshot we always get zeroes for
> > > > the range [0, 256Kb[, the file has a size of 1Mb and the data written by
> > > > the direct IO write is found. From an application's point of view this is
> > > > a corruption, since in the source subvolume it could never read a version
> > > > of the file that included the data from the direct IO write without the
> > > > data from the buffered write included as well. In the snapshot's tree,
> > > > file extent items are missing for the range [0, 256Kb[.
> > > >
> > > > The issue, obviously, does not happen when using the -o flushoncommit
> > > > mount option.
> > > >
> > > > Fix this by flushing delalloc for all the roots that are about to be
> > > > snapshotted when committing a transaction. This guarantees total ordering
> > > > when updating the disk_i_size of an inode since the flush for dealloc is
> > > > done when a transaction is in the TRANS_STATE_COMMIT_START state and wait
> > > > is done once no more external writers exist. This is similar to what we
> > > > do when using the flushoncommit mount option, but we do it only if the
> > > > transaction has snapshots to create and only for the roots of the
> > > > subvolumes to be snapshotted. The bulk of the dealloc is flushed in the
> > > > snapshot creation ioctl, so the flush work we do inside the transaction
> > > > is minimized.
> > > >
> > > > This issue, involving buffered and direct IO writes with snapshotting, is
> > > > often triggered by fstest btrfs/078, and got reported by fsck when not
> > > > using the NO_HOLES features, for example:
> > > >
> > > >   $ cat results/btrfs/078.full
> > > >   (...)
> > > >   _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
> > > >   *** fsck.btrfs output ***
> > > >   [1/7] checking root items
> > > >   [2/7] checking extents
> > > >   [3/7] checking free space cache
> > > >   [4/7] checking fs roots
> > > >   root 258 inode 264 errors 100, file extent discount
> > > >   Found file extent holes:
> > > >         start: 524288, len: 65536
> > > >   ERROR: errors found in fs roots
> > > >
> > > > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > > > ---
> > > >  fs/btrfs/transaction.c | 36 ++++++++++++++++++++++++++++++------
> > > >  1 file changed, 30 insertions(+), 6 deletions(-)
> > > >
> > > > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> > > > index 4ec2b660d014..2e8f15eee2e8 100644
> > > > --- a/fs/btrfs/transaction.c
> > > > +++ b/fs/btrfs/transaction.c
> > > > @@ -1886,8 +1886,10 @@ static void btrfs_cleanup_pending_block_groups(struct btrfs_trans_handle *trans)
> > > >         }
> > > >  }
> > > >
> > > > -static inline int btrfs_start_delalloc_flush(struct btrfs_fs_info *fs_info)
> > > > +static inline int btrfs_start_delalloc_flush(struct btrfs_trans_handle *trans)
> > > >  {
> > > > +     struct btrfs_fs_info *fs_info = trans->fs_info;
> > > > +
> > > >       /*
> > > >        * We use writeback_inodes_sb here because if we used
> > > >        * btrfs_start_delalloc_roots we would deadlock with fs freeze.
> > > > @@ -1897,15 +1899,37 @@ static inline int btrfs_start_delalloc_flush(struct btrfs_fs_info *fs_info)
> > > >        * from already being in a transaction and our join_transaction doesn't
> > > >        * have to re-take the fs freeze lock.
> > > >        */
> > > > -     if (btrfs_test_opt(fs_info, FLUSHONCOMMIT))
> > > > +     if (btrfs_test_opt(fs_info, FLUSHONCOMMIT)) {
> > > >               writeback_inodes_sb(fs_info->sb, WB_REASON_SYNC);
> > > > +     } else {
> > > > +             struct btrfs_pending_snapshot *pending;
> > > > +             struct list_head *head = &trans->transaction->pending_snapshots;
> > > > +
> > >
> > > A comment would be good here as it's not obvious why the sync is done
> > > here (and similarly the waiting part in btrfs_wait_delalloc_flush).
> >
> > Intentionally left out due to the changelog explaining it and avoiding
> > a too long comment explaining why/how the corruption happens.
>
> I see, so what if the comment is only a short version giving pointers,
> something like the first paragraph of the changelog and the fix.
>
> /*
>  * Flush delalloc roots about to be snapshotted to guarantee total
>  * ordering when updating disk_i_size. This could happen for files with
>  * mixed buffered and direct IO
>  */
>
> >
> > >
> > > > +             list_for_each_entry(pending, head, list) {
> > > > +                     int ret;
> > > > +
> > > > +                     ret = btrfs_start_delalloc_snapshot(pending->root);
> > > > +                     if (ret)
> > > > +                             return ret;
> > >
> > > This adds a potential failure to the middle of transaction commit. I've
> > > checked the errors, there's EROFS (after a global fs error state) and
> > > ENOMEM (from start_delalloc_inodes).
> > >
> > > The EROFS could be filtered as a non-issue. ENOMEM means that the
> > > flushing was not possible and skipping it would bring back the problem
> > > this patch is fixing.
> > >
> > > So, how about calling writeback_inodes_sb in that case as a fallback?
> >
> > Thought about it, but the reason I didn't do it is that if you
> > fallback to writeback_unodes_sb, you'll never have the error reported
> > to the user.
> > It's very unlikely the user will do an fsync on the snapshot version
> > of the file, specially if it's a RO snapshot, which would be the only
> > way to report the error.
> >
> >
> > > I'd really like to avoid returning failure from
> > > btrfs_start_delalloc_flush so the extra overhead of the writeback (in a
> > > theoretical error case anyway) should be ok.
> >
> > Me too, as long as the error is reported (a message in syslog/dmesg is
> > very likely to be missed).
>
> I probably don't understand. EROFS could be ignored and ENOMEM can be
> worked around. I don't see what needs to be reported to the user.

For the same reason we don't ignore the error from the initial flush
done in the ioctl (at create_snapshot()).
If the flush fails and we ignore the error, we risk getting a snapshot
with a corrupted version of files,
without the user knowing about it.

Yes, I know here, inside the transaction commit it means ending up in
an abort and turning the fs into RO mode,
which is very inconvenient.

It's a choice.

Anyway, an updated v2 that ignores any error was just sent.
This is probably something where different people will always have a
different preference.

thanks

>
> The goal is to make btrfs_start_delalloc_flush return void and drop the
> 'if (ret)' in transaction commit.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Btrfs: fix file corruption after snapshotting
  2019-02-27 13:42       ` Filipe Manana
@ 2019-02-27 17:19         ` David Sterba
  0 siblings, 0 replies; 12+ messages in thread
From: David Sterba @ 2019-02-27 17:19 UTC (permalink / raw)
  To: Filipe Manana; +Cc: dsterba, linux-btrfs

On Wed, Feb 27, 2019 at 01:42:31PM +0000, Filipe Manana wrote:
> > > > So, how about calling writeback_inodes_sb in that case as a fallback?
> > >
> > > Thought about it, but the reason I didn't do it is that if you
> > > fallback to writeback_unodes_sb, you'll never have the error reported
> > > to the user.
> > > It's very unlikely the user will do an fsync on the snapshot version
> > > of the file, specially if it's a RO snapshot, which would be the only
> > > way to report the error.
> > >
> > >
> > > > I'd really like to avoid returning failure from
> > > > btrfs_start_delalloc_flush so the extra overhead of the writeback (in a
> > > > theoretical error case anyway) should be ok.
> > >
> > > Me too, as long as the error is reported (a message in syslog/dmesg is
> > > very likely to be missed).
> >
> > I probably don't understand. EROFS could be ignored and ENOMEM can be
> > worked around. I don't see what needs to be reported to the user.
> 
> For the same reason we don't ignore the error from the initial flush
> done in the ioctl (at create_snapshot()).
> If the flush fails and we ignore the error, we risk getting a snapshot
> with a corrupted version of files,
> without the user knowing about it.
> 
> Yes, I know here, inside the transaction commit it means ending up in
> an abort and turning the fs into RO mode,
> which is very inconvenient.

create_snapshot is quite different, because the error happens outside of
a transaction.  If it fails at btrfs_start_delalloc_snapshot, it goes
directly to a cleanup and back to userspace. Then it can be restarted
if necessary, the filesystem is still operable (unlike the whole system
that could have the memory exhausted if this was the reason of the
failure).

> It's a choice.

So the way I choose is by the overall impact and try to avoid the abort
if possible.

> Anyway, an updated v2 that ignores any error was just sent.
> This is probably something where different people will always have a
> different preference.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] Btrfs: fix file corruption after snapshotting due to mix of buffered/DIO writes
  2019-02-27 13:42 ` [PATCH v2] Btrfs: fix file corruption after snapshotting due to mix of buffered/DIO writes fdmanana
@ 2019-02-27 17:26   ` David Sterba
  2019-02-27 17:32     ` Filipe Manana
  0 siblings, 1 reply; 12+ messages in thread
From: David Sterba @ 2019-02-27 17:26 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Wed, Feb 27, 2019 at 01:42:30PM +0000, fdmanana@kernel.org wrote:
> @@ -1897,15 +1899,45 @@ static inline int btrfs_start_delalloc_flush(struct btrfs_fs_info *fs_info)
>  	 * from already being in a transaction and our join_transaction doesn't
>  	 * have to re-take the fs freeze lock.
>  	 */
> -	if (btrfs_test_opt(fs_info, FLUSHONCOMMIT))
> +	if (btrfs_test_opt(fs_info, FLUSHONCOMMIT)) {
>  		writeback_inodes_sb(fs_info->sb, WB_REASON_SYNC);
> +	} else {
> +		struct btrfs_pending_snapshot *pending;
> +		struct list_head *head = &trans->transaction->pending_snapshots;
> +
> +		/*
> +		 * Flush dellaloc for any root that is going to be snapshotted.
> +		 * This is done to avoid a corrupted version of files, in the
> +		 * snapshots, that had both buffered and direct IO writes (even
> +		 * if they were done sequentially) due to an unordered update of
> +		 * the inode's size on disk.
> +		 */
> +		list_for_each_entry(pending, head, list)
> +			btrfs_start_delalloc_snapshot(pending->root);
> +	}

A diff would be better than words, incremental on top of your patch:

@@ -1912,8 +1912,15 @@ static inline int btrfs_start_delalloc_flush(struct btrfs_trans_handle *trans)
                 * if they were done sequentially) due to an unordered update of
                 * the inode's size on disk.
                 */
-               list_for_each_entry(pending, head, list)
-                       btrfs_start_delalloc_snapshot(pending->root);
+               list_for_each_entry(pending, head, list) {
+                       int ret;
+
+                       ret = btrfs_start_delalloc_snapshot(pending->root);
+                       if (ret < 0) {
+                               writeback_inodes_sb(fs_info->sb, WB_REASON_SYNC);
+                               break;
+                       }
+               }
        }
        return 0;
 }
---

Making a switch by the exact error is probably not necessary and wouldn't be
future proof anyway.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] Btrfs: fix file corruption after snapshotting due to mix of buffered/DIO writes
  2019-02-27 17:26   ` David Sterba
@ 2019-02-27 17:32     ` Filipe Manana
  2019-02-27 18:31       ` David Sterba
  0 siblings, 1 reply; 12+ messages in thread
From: Filipe Manana @ 2019-02-27 17:32 UTC (permalink / raw)
  To: dsterba, Filipe Manana, linux-btrfs

On Wed, Feb 27, 2019 at 5:25 PM David Sterba <dsterba@suse.cz> wrote:
>
> On Wed, Feb 27, 2019 at 01:42:30PM +0000, fdmanana@kernel.org wrote:
> > @@ -1897,15 +1899,45 @@ static inline int btrfs_start_delalloc_flush(struct btrfs_fs_info *fs_info)
> >        * from already being in a transaction and our join_transaction doesn't
> >        * have to re-take the fs freeze lock.
> >        */
> > -     if (btrfs_test_opt(fs_info, FLUSHONCOMMIT))
> > +     if (btrfs_test_opt(fs_info, FLUSHONCOMMIT)) {
> >               writeback_inodes_sb(fs_info->sb, WB_REASON_SYNC);
> > +     } else {
> > +             struct btrfs_pending_snapshot *pending;
> > +             struct list_head *head = &trans->transaction->pending_snapshots;
> > +
> > +             /*
> > +              * Flush dellaloc for any root that is going to be snapshotted.
> > +              * This is done to avoid a corrupted version of files, in the
> > +              * snapshots, that had both buffered and direct IO writes (even
> > +              * if they were done sequentially) due to an unordered update of
> > +              * the inode's size on disk.
> > +              */
> > +             list_for_each_entry(pending, head, list)
> > +                     btrfs_start_delalloc_snapshot(pending->root);
> > +     }
>
> A diff would be better than words, incremental on top of your patch:

You mean, better than a full 2nd version patch I suppose.

>
> @@ -1912,8 +1912,15 @@ static inline int btrfs_start_delalloc_flush(struct btrfs_trans_handle *trans)
>                  * if they were done sequentially) due to an unordered update of
>                  * the inode's size on disk.
>                  */
> -               list_for_each_entry(pending, head, list)
> -                       btrfs_start_delalloc_snapshot(pending->root);
> +               list_for_each_entry(pending, head, list) {
> +                       int ret;
> +
> +                       ret = btrfs_start_delalloc_snapshot(pending->root);
> +                       if (ret < 0) {
> +                               writeback_inodes_sb(fs_info->sb, WB_REASON_SYNC);
> +                               break;
> +                       }

What do you expect by falling back to writeback_inodes_sb()?
It all ends up going through the same btrfs writeback path.
And as I left it, if an error happens for one root, it still tries to
flush writeback for all the remaining roots as well, so I don't get it
why you fallback to writeback_inodes_sb().

> +               }
>         }
>         return 0;
>  }
> ---
>
> Making a switch by the exact error is probably not necessary and wouldn't be
> future proof anyway.

Not sure I understood that sentence.

Anyway, it's not clear to me whether as it is it's fine, or do you
want an incremental patch, or something else.

thanks

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] Btrfs: fix file corruption after snapshotting due to mix of buffered/DIO writes
  2019-02-27 17:32     ` Filipe Manana
@ 2019-02-27 18:31       ` David Sterba
  2019-02-27 18:56         ` Filipe Manana
  0 siblings, 1 reply; 12+ messages in thread
From: David Sterba @ 2019-02-27 18:31 UTC (permalink / raw)
  To: Filipe Manana; +Cc: dsterba, linux-btrfs

On Wed, Feb 27, 2019 at 05:32:55PM +0000, Filipe Manana wrote:
> On Wed, Feb 27, 2019 at 5:25 PM David Sterba <dsterba@suse.cz> wrote:
> >
> > On Wed, Feb 27, 2019 at 01:42:30PM +0000, fdmanana@kernel.org wrote:
> > > @@ -1897,15 +1899,45 @@ static inline int btrfs_start_delalloc_flush(struct btrfs_fs_info *fs_info)
> > >        * from already being in a transaction and our join_transaction doesn't
> > >        * have to re-take the fs freeze lock.
> > >        */
> > > -     if (btrfs_test_opt(fs_info, FLUSHONCOMMIT))
> > > +     if (btrfs_test_opt(fs_info, FLUSHONCOMMIT)) {
> > >               writeback_inodes_sb(fs_info->sb, WB_REASON_SYNC);
> > > +     } else {
> > > +             struct btrfs_pending_snapshot *pending;
> > > +             struct list_head *head = &trans->transaction->pending_snapshots;
> > > +
> > > +             /*
> > > +              * Flush dellaloc for any root that is going to be snapshotted.
> > > +              * This is done to avoid a corrupted version of files, in the
> > > +              * snapshots, that had both buffered and direct IO writes (even
> > > +              * if they were done sequentially) due to an unordered update of
> > > +              * the inode's size on disk.
> > > +              */
> > > +             list_for_each_entry(pending, head, list)
> > > +                     btrfs_start_delalloc_snapshot(pending->root);
> > > +     }
> >
> > A diff would be better than words, incremental on top of your patch:
> 
> You mean, better than a full 2nd version patch I suppose.

I mean better than my attempts to explain by words the error handling
logic that I was proposing.

> >
> > @@ -1912,8 +1912,15 @@ static inline int btrfs_start_delalloc_flush(struct btrfs_trans_handle *trans)
> >                  * if they were done sequentially) due to an unordered update of
> >                  * the inode's size on disk.
> >                  */
> > -               list_for_each_entry(pending, head, list)
> > -                       btrfs_start_delalloc_snapshot(pending->root);
> > +               list_for_each_entry(pending, head, list) {
> > +                       int ret;
> > +
> > +                       ret = btrfs_start_delalloc_snapshot(pending->root);
> > +                       if (ret < 0) {
> > +                               writeback_inodes_sb(fs_info->sb, WB_REASON_SYNC);
> > +                               break;
> > +                       }
> 
> What do you expect by falling back to writeback_inodes_sb()?
> It all ends up going through the same btrfs writeback path.
> And as I left it, if an error happens for one root, it still tries to
> flush writeback for all the remaining roots as well, so I don't get it
> why you fallback to writeback_inodes_sb().

As I read the changelog, you say that the corruption does not happen
with FLUSHONCOMMIT, which does writeback_inodes_sb. Using that would be
too heavy, thus you only start the delalloc on snapshotted roots.

So the idea is to use the same logic of flushoncommit in the unlikely
case when btrfs_start_delalloc_snapshot would fail. Only at some
performance cost, unless I'm missing something.

As for the v2 as you implement it without any error handling, doesn't
this allow the corruption to happen? If start_delalloc_inodes has a lot
of inodes for which it needs to allocate delalloc_work, the failure is
possible. That the list_for_each continues does not affect that
particular root.

> > Making a switch by the exact error is probably not necessary and wouldn't be
> > future proof anyway.
> 
> Not sure I understood that sentence.

Under v1 I was suggesting to filter out all possible errors from
btrfs_start_delalloc_snapshot, EROFS and ENOMEM. So by 'making a switch'
I meant

	if (ret == -EROFS) {
		break;
	} else {
		writeback_inodes_sb();
		break;
	}

> Anyway, it's not clear to me whether as it is it's fine, or do you
> want an incremental patch, or something else.

I want to continue the discussion about the error handling.  The
incremental diff was to show actual code behind my idea.

If this weren't a correctness and commit related code I probably would
not go that far be ok with the errors and abort. So I'm hoping you
could help me find good options with low impact as the change affects
the default commit path.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] Btrfs: fix file corruption after snapshotting due to mix of buffered/DIO writes
  2019-02-27 18:31       ` David Sterba
@ 2019-02-27 18:56         ` Filipe Manana
  2019-03-12 17:13           ` David Sterba
  0 siblings, 1 reply; 12+ messages in thread
From: Filipe Manana @ 2019-02-27 18:56 UTC (permalink / raw)
  To: dsterba, Filipe Manana, linux-btrfs

On Wed, Feb 27, 2019 at 6:30 PM David Sterba <dsterba@suse.cz> wrote:
>
> On Wed, Feb 27, 2019 at 05:32:55PM +0000, Filipe Manana wrote:
> > On Wed, Feb 27, 2019 at 5:25 PM David Sterba <dsterba@suse.cz> wrote:
> > >
> > > On Wed, Feb 27, 2019 at 01:42:30PM +0000, fdmanana@kernel.org wrote:
> > > > @@ -1897,15 +1899,45 @@ static inline int btrfs_start_delalloc_flush(struct btrfs_fs_info *fs_info)
> > > >        * from already being in a transaction and our join_transaction doesn't
> > > >        * have to re-take the fs freeze lock.
> > > >        */
> > > > -     if (btrfs_test_opt(fs_info, FLUSHONCOMMIT))
> > > > +     if (btrfs_test_opt(fs_info, FLUSHONCOMMIT)) {
> > > >               writeback_inodes_sb(fs_info->sb, WB_REASON_SYNC);
> > > > +     } else {
> > > > +             struct btrfs_pending_snapshot *pending;
> > > > +             struct list_head *head = &trans->transaction->pending_snapshots;
> > > > +
> > > > +             /*
> > > > +              * Flush dellaloc for any root that is going to be snapshotted.
> > > > +              * This is done to avoid a corrupted version of files, in the
> > > > +              * snapshots, that had both buffered and direct IO writes (even
> > > > +              * if they were done sequentially) due to an unordered update of
> > > > +              * the inode's size on disk.
> > > > +              */
> > > > +             list_for_each_entry(pending, head, list)
> > > > +                     btrfs_start_delalloc_snapshot(pending->root);
> > > > +     }
> > >
> > > A diff would be better than words, incremental on top of your patch:
> >
> > You mean, better than a full 2nd version patch I suppose.
>
> I mean better than my attempts to explain by words the error handling
> logic that I was proposing.
>
> > >
> > > @@ -1912,8 +1912,15 @@ static inline int btrfs_start_delalloc_flush(struct btrfs_trans_handle *trans)
> > >                  * if they were done sequentially) due to an unordered update of
> > >                  * the inode's size on disk.
> > >                  */
> > > -               list_for_each_entry(pending, head, list)
> > > -                       btrfs_start_delalloc_snapshot(pending->root);
> > > +               list_for_each_entry(pending, head, list) {
> > > +                       int ret;
> > > +
> > > +                       ret = btrfs_start_delalloc_snapshot(pending->root);
> > > +                       if (ret < 0) {
> > > +                               writeback_inodes_sb(fs_info->sb, WB_REASON_SYNC);
> > > +                               break;
> > > +                       }
> >
> > What do you expect by falling back to writeback_inodes_sb()?
> > It all ends up going through the same btrfs writeback path.
> > And as I left it, if an error happens for one root, it still tries to
> > flush writeback for all the remaining roots as well, so I don't get it
> > why you fallback to writeback_inodes_sb().
>
> As I read the changelog, you say that the corruption does not happen
> with FLUSHONCOMMIT, which does writeback_inodes_sb. Using that would be
> too heavy, thus you only start the delalloc on snapshotted roots.

It does not happen with flushoncommit because delalloc is flushed for all roots.
If flushoncommit (writeback_inodes_sb()) would flush only for roots
that are about to be snapshotted, the corruption wouldn't happen as
well.

>
> So the idea is to use the same logic of flushoncommit in the unlikely
> case when btrfs_start_delalloc_snapshot would fail. Only at some
> performance cost, unless I'm missing something.

Ok, so I hope the first paragraph above explains why there's no need
to fallback to the "flush all delalloc for all roots" logic
(writeback_inodes_sb()).

>
> As for the v2 as you implement it without any error handling, doesn't
> this allow the corruption to happen? If start_delalloc_inodes has a lot
> of inodes for which it needs to allocate delalloc_work, the failure is
> possible. That the list_for_each continues does not affect that
> particular root.

Yes, it allows for the corruption to happen, that's why I had the
error returned in v1.

writeback_inodes_sb() isn't special in any way - flushing some
delalloc range can fail the same way, it ends up calling the same
btrfs writepages() callback. Yes, it doesn't return any error, because
it's relying on callers either not caring about it, or if they do,
checking the inode's mapping for an error, which btrfs sets through
its writepages() callback if any an error happens (by calling
mapping_set_error()),
or any other fs specific way of storing/checking for writeback errors.

If we get an error when flushing the dealloc range from the example in
the changelog, then the corruption happens, regardless of writeback
being
triggered by writeback_inodes_sb() or btrfs_start_delalloc_snapshot().

>
> > > Making a switch by the exact error is probably not necessary and wouldn't be
> > > future proof anyway.
> >
> > Not sure I understood that sentence.
>
> Under v1 I was suggesting to filter out all possible errors from
> btrfs_start_delalloc_snapshot, EROFS and ENOMEM. So by 'making a switch'
> I meant
>
>         if (ret == -EROFS) {
>                 break;
>         } else {
>                 writeback_inodes_sb();
>                 break;
>         }
>
> > Anyway, it's not clear to me whether as it is it's fine, or do you
> > want an incremental patch, or something else.
>
> I want to continue the discussion about the error handling.  The
> incremental diff was to show actual code behind my idea.
>
> If this weren't a correctness and commit related code I probably would
> not go that far be ok with the errors and abort. So I'm hoping you
> could help me find good options with low impact as the change affects
> the default commit path.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] Btrfs: fix file corruption after snapshotting due to mix of buffered/DIO writes
  2019-02-27 18:56         ` Filipe Manana
@ 2019-03-12 17:13           ` David Sterba
  0 siblings, 0 replies; 12+ messages in thread
From: David Sterba @ 2019-03-12 17:13 UTC (permalink / raw)
  To: Filipe Manana; +Cc: dsterba, linux-btrfs

On Wed, Feb 27, 2019 at 06:56:10PM +0000, Filipe Manana wrote:
> > > What do you expect by falling back to writeback_inodes_sb()?
> > > It all ends up going through the same btrfs writeback path.
> > > And as I left it, if an error happens for one root, it still tries to
> > > flush writeback for all the remaining roots as well, so I don't get it
> > > why you fallback to writeback_inodes_sb().
> >
> > As I read the changelog, you say that the corruption does not happen
> > with FLUSHONCOMMIT, which does writeback_inodes_sb. Using that would be
> > too heavy, thus you only start the delalloc on snapshotted roots.
> 
> It does not happen with flushoncommit because delalloc is flushed for all roots.
> If flushoncommit (writeback_inodes_sb()) would flush only for roots
> that are about to be snapshotted, the corruption wouldn't happen as
> well.
> 
> >
> > So the idea is to use the same logic of flushoncommit in the unlikely
> > case when btrfs_start_delalloc_snapshot would fail. Only at some
> > performance cost, unless I'm missing something.
> 
> Ok, so I hope the first paragraph above explains why there's no need
> to fallback to the "flush all delalloc for all roots" logic
> (writeback_inodes_sb()).

Yeah, I think I get it now.

> > As for the v2 as you implement it without any error handling, doesn't
> > this allow the corruption to happen? If start_delalloc_inodes has a lot
> > of inodes for which it needs to allocate delalloc_work, the failure is
> > possible. That the list_for_each continues does not affect that
> > particular root.
> 
> Yes, it allows for the corruption to happen, that's why I had the
> error returned in v1.
> 
> writeback_inodes_sb() isn't special in any way - flushing some
> delalloc range can fail the same way, it ends up calling the same
> btrfs writepages() callback. Yes, it doesn't return any error, because
> it's relying on callers either not caring about it, or if they do,
> checking the inode's mapping for an error, which btrfs sets through
> its writepages() callback if any an error happens (by calling
> mapping_set_error()),
> or any other fs specific way of storing/checking for writeback errors.

Ok, so the the error handling needs to stay and there's no simple way
around it.

> If we get an error when flushing the dealloc range from the example in
> the changelog, then the corruption happens, regardless of writeback
> being
> triggered by writeback_inodes_sb() or btrfs_start_delalloc_snapshot().

Thanks for the time discussing that. I'll use code from v1 and the
subject from v2 and add the patch to the queue.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, back to index

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-04 14:28 [PATCH] Btrfs: fix file corruption after snapshotting fdmanana
2019-02-18 17:11 ` David Sterba
2019-02-18 17:27   ` Filipe Manana
2019-02-27 12:47     ` David Sterba
2019-02-27 13:42       ` Filipe Manana
2019-02-27 17:19         ` David Sterba
2019-02-27 13:42 ` [PATCH v2] Btrfs: fix file corruption after snapshotting due to mix of buffered/DIO writes fdmanana
2019-02-27 17:26   ` David Sterba
2019-02-27 17:32     ` Filipe Manana
2019-02-27 18:31       ` David Sterba
2019-02-27 18:56         ` Filipe Manana
2019-03-12 17:13           ` David Sterba

Linux-BTRFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \
		linux-btrfs@vger.kernel.org linux-btrfs@archiver.kernel.org
	public-inbox-index linux-btrfs


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs


AGPL code for this site: git clone https://public-inbox.org/ public-inbox