linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Btrfs: fix data loss after inode eviction, renaming it, and fsync it
@ 2019-06-06 11:07 fdmanana
  2019-06-06 13:19 ` [PATCH v2] " fdmanana
  2019-06-07 10:25 ` [PATCH v3] " fdmanana
  0 siblings, 2 replies; 4+ messages in thread
From: fdmanana @ 2019-06-06 11:07 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When we log an inode, regardless of logging it completely or only that it
exists, we always update it as logged (logged_trans and last_log_commit
fields of the inode are updated). This is generally fine and avoids future
attempts to log it from having to do repeated work that brings no value.

However, if we write data to a file, then evict its inode after all the
dealloc was flushed (and ordered extents completed), rename the file and
fsync it, we end up not logging the new extents, since the rename may
result in logging that the inode exists in case the parent directory was
logged before. The following reproducer shows and explains how this can
happen:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ mkdir /mnt/dir
  $ touch /mnt/dir/foo
  $ touch /mnt/dir/bar

  # Do a direct IO write instead of a buffered write because with a
  # buffered write we would need to make sure dealloc gets flushed and
  # complete before we do the inode eviction later, and we can not do that
  # from user space with call to things such as sync(2) since that results
  # in a transaction commit as well.
  $ xfs_io -d -c "pwrite -S 0xd3 0 4K" /mnt/dir/bar

  # Keep the directory dir in use while we evict inodes. We want our file
  # bar's inode to be evicted but we don't want our directory's inode to
  # be evicted (if it were evicted too, we would not be able to reproduce
  # the issue since the first fsync below, of file foo, would result in a
  # transaction commit.
  $ ( cd /mnt/dir; while true; do :; done ) &
  $ pid=$!

  # Wait a bit to give time for the background process to chdir.
  $ sleep 0.1

  # Evict all inodes, except the inode for the directory dir because it is
  # currently in use by our background process.
  $ echo 2 > /proc/sys/vm/drop_caches

  # fsync file foo, which ends up persisting information about the parent
  # directory because it is a new inode.
  $ xfs_io -c fsync /mnt/dir/foo

  # Rename bar, this results in logging that this inode exists (inode item,
  # names, xattrs) because the parent directory is in the log.
  $ mv /mnt/dir/bar /mnt/dir/baz

  # Now fsync baz, which ends up doing absolutely nothing because of the
  # rename operation which logged that the inode exists only.
  $ xfs_io -c fsync /mnt/dir/baz

  <power failure>

  $ mount /dev/sdb /mnt
  $ od -t x1 -A d /mnt/dir/baz
  0000000

    --> Empty file, data we wrote is missing.

Fix this by not updating the logged_trans and last_sub_trans of an inode
when we are logging only that it exists and the inode was not yet logged
since it was loaded from disk (full_sync bit set).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/tree-log.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 83755d3b96e3..8db230935098 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -5407,10 +5407,21 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
 		}
 	}
 
-	spin_lock(&inode->lock);
-	inode->logged_trans = trans->transid;
-	inode->last_log_commit = inode->last_sub_trans;
-	spin_unlock(&inode->lock);
+	/*
+	 * Don't update logged_trans and last_log_commit if we logged that an
+	 * inode exists after it was loaded from memory (full_sync bit set).
+	 * This is to prevent data loss when we do a write to the inode, then
+	 * the inode gets evicted after all delalloc was flushed, then we log
+	 * it exists (due to a rename for example) and then fsync it. This last
+	 * fsync would do nothing (not logging the extents previously written).
+	 */
+	if (inode_only != LOG_INODE_EXISTS ||
+	    !test_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &inode->runtime_flags)) {
+		spin_lock(&inode->lock);
+		inode->logged_trans = trans->transid;
+		inode->last_log_commit = inode->last_sub_trans;
+		spin_unlock(&inode->lock);
+	}
 out_unlock:
 	mutex_unlock(&inode->log_mutex);
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH v2] Btrfs: fix data loss after inode eviction, renaming it, and fsync it
  2019-06-06 11:07 [PATCH] Btrfs: fix data loss after inode eviction, renaming it, and fsync it fdmanana
@ 2019-06-06 13:19 ` fdmanana
  2019-06-07 10:25 ` [PATCH v3] " fdmanana
  1 sibling, 0 replies; 4+ messages in thread
From: fdmanana @ 2019-06-06 13:19 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When we log an inode, regardless of logging it completely or only that it
exists, we always update it as logged (logged_trans and last_log_commit
fields of the inode are updated). This is generally fine and avoids future
attempts to log it from having to do repeated work that brings no value.

However, if we write data to a file, then evict its inode after all the
dealloc was flushed (and ordered extents completed), rename the file and
fsync it, we end up not logging the new extents, since the rename may
result in logging that the inode exists in case the parent directory was
logged before. The following reproducer shows and explains how this can
happen:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ mkdir /mnt/dir
  $ touch /mnt/dir/foo
  $ touch /mnt/dir/bar

  # Do a direct IO write instead of a buffered write because with a
  # buffered write we would need to make sure dealloc gets flushed and
  # complete before we do the inode eviction later, and we can not do that
  # from user space with call to things such as sync(2) since that results
  # in a transaction commit as well.
  $ xfs_io -d -c "pwrite -S 0xd3 0 4K" /mnt/dir/bar

  # Keep the directory dir in use while we evict inodes. We want our file
  # bar's inode to be evicted but we don't want our directory's inode to
  # be evicted (if it were evicted too, we would not be able to reproduce
  # the issue since the first fsync below, of file foo, would result in a
  # transaction commit.
  $ ( cd /mnt/dir; while true; do :; done ) &
  $ pid=$!

  # Wait a bit to give time for the background process to chdir.
  $ sleep 0.1

  # Evict all inodes, except the inode for the directory dir because it is
  # currently in use by our background process.
  $ echo 2 > /proc/sys/vm/drop_caches

  # fsync file foo, which ends up persisting information about the parent
  # directory because it is a new inode.
  $ xfs_io -c fsync /mnt/dir/foo

  # Rename bar, this results in logging that this inode exists (inode item,
  # names, xattrs) because the parent directory is in the log.
  $ mv /mnt/dir/bar /mnt/dir/baz

  # Now fsync baz, which ends up doing absolutely nothing because of the
  # rename operation which logged that the inode exists only.
  $ xfs_io -c fsync /mnt/dir/baz

  <power failure>

  $ mount /dev/sdb /mnt
  $ od -t x1 -A d /mnt/dir/baz
  0000000

    --> Empty file, data we wrote is missing.

Fix this by not updating the logged_trans and last_sub_trans of an inode
when we are logging only that it exists and the inode was not yet logged
since it was loaded from disk (full_sync bit set).

Fixes: 3a5f1d458ad161 ("Btrfs: Optimize btree walking while logging inodes")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---

V2: Fixed typo in comment "from memory" -> "to memory", and added missing
    "Fixes:" tag.

 fs/btrfs/tree-log.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 83755d3b96e3..3c208c563f5e 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -5407,10 +5407,21 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
 		}
 	}
 
-	spin_lock(&inode->lock);
-	inode->logged_trans = trans->transid;
-	inode->last_log_commit = inode->last_sub_trans;
-	spin_unlock(&inode->lock);
+	/*
+	 * Don't update logged_trans and last_log_commit if we logged that an
+	 * inode exists after it was loaded to memory (full_sync bit set).
+	 * This is to prevent data loss when we do a write to the inode, then
+	 * the inode gets evicted after all delalloc was flushed, then we log
+	 * it exists (due to a rename for example) and then fsync it. This last
+	 * fsync would do nothing (not logging the extents previously written).
+	 */
+	if (inode_only != LOG_INODE_EXISTS ||
+	    !test_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &inode->runtime_flags)) {
+		spin_lock(&inode->lock);
+		inode->logged_trans = trans->transid;
+		inode->last_log_commit = inode->last_sub_trans;
+		spin_unlock(&inode->lock);
+	}
 out_unlock:
 	mutex_unlock(&inode->log_mutex);
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH v3] Btrfs: fix data loss after inode eviction, renaming it, and fsync it
  2019-06-06 11:07 [PATCH] Btrfs: fix data loss after inode eviction, renaming it, and fsync it fdmanana
  2019-06-06 13:19 ` [PATCH v2] " fdmanana
@ 2019-06-07 10:25 ` fdmanana
  2019-06-11 17:02   ` David Sterba
  1 sibling, 1 reply; 4+ messages in thread
From: fdmanana @ 2019-06-07 10:25 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When we log an inode, regardless of logging it completely or only that it
exists, we always update it as logged (logged_trans and last_log_commit
fields of the inode are updated). This is generally fine and avoids future
attempts to log it from having to do repeated work that brings no value.

However, if we write data to a file, then evict its inode after all the
dealloc was flushed (and ordered extents completed), rename the file and
fsync it, we end up not logging the new extents, since the rename may
result in logging that the inode exists in case the parent directory was
logged before. The following reproducer shows and explains how this can
happen:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ mkdir /mnt/dir
  $ touch /mnt/dir/foo
  $ touch /mnt/dir/bar

  # Do a direct IO write instead of a buffered write because with a
  # buffered write we would need to make sure dealloc gets flushed and
  # complete before we do the inode eviction later, and we can not do that
  # from user space with call to things such as sync(2) since that results
  # in a transaction commit as well.
  $ xfs_io -d -c "pwrite -S 0xd3 0 4K" /mnt/dir/bar

  # Keep the directory dir in use while we evict inodes. We want our file
  # bar's inode to be evicted but we don't want our directory's inode to
  # be evicted (if it were evicted too, we would not be able to reproduce
  # the issue since the first fsync below, of file foo, would result in a
  # transaction commit.
  $ ( cd /mnt/dir; while true; do :; done ) &
  $ pid=$!

  # Wait a bit to give time for the background process to chdir.
  $ sleep 0.1

  # Evict all inodes, except the inode for the directory dir because it is
  # currently in use by our background process.
  $ echo 2 > /proc/sys/vm/drop_caches

  # fsync file foo, which ends up persisting information about the parent
  # directory because it is a new inode.
  $ xfs_io -c fsync /mnt/dir/foo

  # Rename bar, this results in logging that this inode exists (inode item,
  # names, xattrs) because the parent directory is in the log.
  $ mv /mnt/dir/bar /mnt/dir/baz

  # Now fsync baz, which ends up doing absolutely nothing because of the
  # rename operation which logged that the inode exists only.
  $ xfs_io -c fsync /mnt/dir/baz

  <power failure>

  $ mount /dev/sdb /mnt
  $ od -t x1 -A d /mnt/dir/baz
  0000000

    --> Empty file, data we wrote is missing.

Fix this by not updating last_sub_trans of an inode when we are logging
only that it exists and the inode was not yet logged since it was loaded
from disk (full_sync bit set), this is enough to make btrfs_inode_in_log()
return false for this scenario and make us log the inode. The logged_trans
of the inode is still always setsince that alone is used to track if names
need to be deleted as part of unlink operations.

Fixes: 257c62e1bce03e ("Btrfs: avoid tree log commit when there are no changes")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---

V2: Fixed typo in comment "from memory" -> "to memory", and added missing
    "Fixes:" tag.
V3: Updated the logic to still set the inode's logged_trans always, and
    updated change log to justify it.

 fs/btrfs/tree-log.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 83755d3b96e3..40c839161596 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -5407,9 +5407,19 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
 		}
 	}
 
+	/*
+	 * Don't update last_log_commit if we logged that an inode exists after
+	 * it was loaded to memory (full_sync bit set).
+	 * This is to prevent data loss when we do a write to the inode, then
+	 * the inode gets evicted after all delalloc was flushed, then we log
+	 * it exists (due to a rename for example) and then fsync it. This last
+	 * fsync would do nothing (not logging the extents previously written).
+	 */
 	spin_lock(&inode->lock);
 	inode->logged_trans = trans->transid;
-	inode->last_log_commit = inode->last_sub_trans;
+	if (inode_only != LOG_INODE_EXISTS ||
+	    !test_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &inode->runtime_flags))
+		inode->last_log_commit = inode->last_sub_trans;
 	spin_unlock(&inode->lock);
 out_unlock:
 	mutex_unlock(&inode->log_mutex);
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v3] Btrfs: fix data loss after inode eviction, renaming it, and fsync it
  2019-06-07 10:25 ` [PATCH v3] " fdmanana
@ 2019-06-11 17:02   ` David Sterba
  0 siblings, 0 replies; 4+ messages in thread
From: David Sterba @ 2019-06-11 17:02 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Fri, Jun 07, 2019 at 11:25:24AM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> When we log an inode, regardless of logging it completely or only that it
> exists, we always update it as logged (logged_trans and last_log_commit
> fields of the inode are updated). This is generally fine and avoids future
> attempts to log it from having to do repeated work that brings no value.
> 
> However, if we write data to a file, then evict its inode after all the
> dealloc was flushed (and ordered extents completed), rename the file and
> fsync it, we end up not logging the new extents, since the rename may
> result in logging that the inode exists in case the parent directory was
> logged before. The following reproducer shows and explains how this can
> happen:
> 
>   $ mkfs.btrfs -f /dev/sdb
>   $ mount /dev/sdb /mnt
> 
>   $ mkdir /mnt/dir
>   $ touch /mnt/dir/foo
>   $ touch /mnt/dir/bar
> 
>   # Do a direct IO write instead of a buffered write because with a
>   # buffered write we would need to make sure dealloc gets flushed and
>   # complete before we do the inode eviction later, and we can not do that
>   # from user space with call to things such as sync(2) since that results
>   # in a transaction commit as well.
>   $ xfs_io -d -c "pwrite -S 0xd3 0 4K" /mnt/dir/bar
> 
>   # Keep the directory dir in use while we evict inodes. We want our file
>   # bar's inode to be evicted but we don't want our directory's inode to
>   # be evicted (if it were evicted too, we would not be able to reproduce
>   # the issue since the first fsync below, of file foo, would result in a
>   # transaction commit.
>   $ ( cd /mnt/dir; while true; do :; done ) &
>   $ pid=$!
> 
>   # Wait a bit to give time for the background process to chdir.
>   $ sleep 0.1
> 
>   # Evict all inodes, except the inode for the directory dir because it is
>   # currently in use by our background process.
>   $ echo 2 > /proc/sys/vm/drop_caches
> 
>   # fsync file foo, which ends up persisting information about the parent
>   # directory because it is a new inode.
>   $ xfs_io -c fsync /mnt/dir/foo
> 
>   # Rename bar, this results in logging that this inode exists (inode item,
>   # names, xattrs) because the parent directory is in the log.
>   $ mv /mnt/dir/bar /mnt/dir/baz
> 
>   # Now fsync baz, which ends up doing absolutely nothing because of the
>   # rename operation which logged that the inode exists only.
>   $ xfs_io -c fsync /mnt/dir/baz
> 
>   <power failure>
> 
>   $ mount /dev/sdb /mnt
>   $ od -t x1 -A d /mnt/dir/baz
>   0000000
> 
>     --> Empty file, data we wrote is missing.
> 
> Fix this by not updating last_sub_trans of an inode when we are logging
> only that it exists and the inode was not yet logged since it was loaded
> from disk (full_sync bit set), this is enough to make btrfs_inode_in_log()
> return false for this scenario and make us log the inode. The logged_trans
> of the inode is still always setsince that alone is used to track if names
> need to be deleted as part of unlink operations.
> 
> Fixes: 257c62e1bce03e ("Btrfs: avoid tree log commit when there are no changes")
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Added to misc-next, thanks.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-06-11 17:01 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-06 11:07 [PATCH] Btrfs: fix data loss after inode eviction, renaming it, and fsync it fdmanana
2019-06-06 13:19 ` [PATCH v2] " fdmanana
2019-06-07 10:25 ` [PATCH v3] " fdmanana
2019-06-11 17:02   ` David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).