linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] btrfs: fix incorrect updating of log root tree
@ 2019-09-30 20:27 Josef Bacik
  2019-10-01 11:46 ` Filipe Manana
  2019-10-01 16:39 ` David Sterba
  0 siblings, 2 replies; 3+ messages in thread
From: Josef Bacik @ 2019-09-30 20:27 UTC (permalink / raw)
  To: linux-btrfs, kernel-team; +Cc: stable, Chris Mason

We've historically had reports of being unable to mount file systems
because the tree log root couldn't be read.  Usually this is the "parent
transid failure", but could be any of the related errors, including
"fsid mismatch" or "bad tree block", depending on which block got
allocated.

The modification of the individual log root items are serialized on the
per-log root root_mutex.  This means that any modification to the
per-subvol log root_item is completely protected.

However we update the root item in the log root tree outside of the log
root tree log_mutex.  We do this in order to allow multiple subvolumes
to be updated in each log transaction.

This is problematic however because when we are writing the log root
tree out we update the super block with the _current_ log root node
information.  Since these two operations happen independently of each
other, you can end up updating the log root tree in between writing out
the dirty blocks and setting the super block to point at the current
root.

This means we'll point at the new root node that hasn't been written
out, instead of the one we should be pointing at.  Thus whatever garbage
or old block we end up pointing at complains when we mount the file
system later and try to replay the log.

Fix this by copying the log's root item into a local root item copy.
Then once we're safely under the log_root_tree->log_mutex we update the
root item in the log_root_tree.  This way we do not modify the
log_root_tree while we're committing it, fixing the problem.

cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Chris Mason <clm@fb.com>
---
 fs/btrfs/tree-log.c | 36 +++++++++++++++++++++++++++---------
 1 file changed, 27 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 7cac09a6f007..1d7f22951ef2 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -2908,7 +2908,8 @@ static int walk_log_tree(struct btrfs_trans_handle *trans,
  * in the tree of log roots
  */
 static int update_log_root(struct btrfs_trans_handle *trans,
-			   struct btrfs_root *log)
+			   struct btrfs_root *log,
+			   struct btrfs_root_item *root_item)
 {
 	struct btrfs_fs_info *fs_info = log->fs_info;
 	int ret;
@@ -2916,10 +2917,10 @@ static int update_log_root(struct btrfs_trans_handle *trans,
 	if (log->log_transid == 1) {
 		/* insert root item on the first sync */
 		ret = btrfs_insert_root(trans, fs_info->log_root_tree,
-				&log->root_key, &log->root_item);
+				&log->root_key, root_item);
 	} else {
 		ret = btrfs_update_root(trans, fs_info->log_root_tree,
-				&log->root_key, &log->root_item);
+				&log->root_key, root_item);
 	}
 	return ret;
 }
@@ -3017,6 +3018,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_root *log = root->log_root;
 	struct btrfs_root *log_root_tree = fs_info->log_root_tree;
+	struct btrfs_root_item new_root_item;
 	int log_transid = 0;
 	struct btrfs_log_ctx root_log_ctx;
 	struct blk_plug plug;
@@ -3080,17 +3082,25 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 		goto out;
 	}
 
+	/*
+	 * We _must_ update under the root->log_mutex in order to make sure we
+	 * have a consistent view of the log root we are trying to commit at
+	 * this moment.
+	 *
+	 * We _must_ copy this into a local copy, because we are not holding the
+	 * log_root_tree->log_mutex yet.  This is important because when we
+	 * commit the log_root_tree we must have a consistent view of the
+	 * log_root_tree when we update the super block to point at the
+	 * log_root_tree bytenr.  If we update the log_root_tree here we'll race
+	 * with the commit and possibly point at the new block which we may not
+	 * have written out.
+	 */
 	btrfs_set_root_node(&log->root_item, log->node);
+	memcpy(&new_root_item, &log->root_item, sizeof(new_root_item));
 
 	root->log_transid++;
 	log->log_transid = root->log_transid;
 	root->log_start_pid = 0;
-	/*
-	 * Update or create log root item under the root's log_mutex to prevent
-	 * races with concurrent log syncs that can lead to failure to update
-	 * log root item because it was not created yet.
-	 */
-	ret = update_log_root(trans, log);
 	/*
 	 * IO has been started, blocks of the log tree have WRITTEN flag set
 	 * in their headers. new modifications of the log will be written to
@@ -3111,6 +3121,14 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	mutex_unlock(&log_root_tree->log_mutex);
 
 	mutex_lock(&log_root_tree->log_mutex);
+
+	/*
+	 * Now we are safe to update the log_root_tree because we're under the
+	 * log_mutex, and we're a current writer so we're holding the commit
+	 * open until we drop the log_mutex.
+	 */
+	ret = update_log_root(trans, log, &new_root_item);
+
 	if (atomic_dec_and_test(&log_root_tree->log_writers)) {
 		/* atomic_dec_and_test implies a barrier */
 		cond_wake_up_nomb(&log_root_tree->log_writer_wait);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] btrfs: fix incorrect updating of log root tree
  2019-09-30 20:27 [PATCH] btrfs: fix incorrect updating of log root tree Josef Bacik
@ 2019-10-01 11:46 ` Filipe Manana
  2019-10-01 16:39 ` David Sterba
  1 sibling, 0 replies; 3+ messages in thread
From: Filipe Manana @ 2019-10-01 11:46 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team, stable, Chris Mason

On Mon, Sep 30, 2019 at 11:25 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> We've historically had reports of being unable to mount file systems
> because the tree log root couldn't be read.  Usually this is the "parent
> transid failure", but could be any of the related errors, including
> "fsid mismatch" or "bad tree block", depending on which block got
> allocated.
>
> The modification of the individual log root items are serialized on the
> per-log root root_mutex.  This means that any modification to the
> per-subvol log root_item is completely protected.
>
> However we update the root item in the log root tree outside of the log
> root tree log_mutex.  We do this in order to allow multiple subvolumes
> to be updated in each log transaction.
>
> This is problematic however because when we are writing the log root
> tree out we update the super block with the _current_ log root node
> information.  Since these two operations happen independently of each
> other, you can end up updating the log root tree in between writing out
> the dirty blocks and setting the super block to point at the current
> root.
>
> This means we'll point at the new root node that hasn't been written
> out, instead of the one we should be pointing at.  Thus whatever garbage
> or old block we end up pointing at complains when we mount the file
> system later and try to replay the log.
>
> Fix this by copying the log's root item into a local root item copy.
> Then once we're safely under the log_root_tree->log_mutex we update the
> root item in the log_root_tree.  This way we do not modify the
> log_root_tree while we're committing it, fixing the problem.
>
> cc: stable@vger.kernel.org
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> Reviewed-by: Chris Mason <clm@fb.com>

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Looks good to me, great catch!
Thanks.

> ---
>  fs/btrfs/tree-log.c | 36 +++++++++++++++++++++++++++---------
>  1 file changed, 27 insertions(+), 9 deletions(-)
>
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index 7cac09a6f007..1d7f22951ef2 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -2908,7 +2908,8 @@ static int walk_log_tree(struct btrfs_trans_handle *trans,
>   * in the tree of log roots
>   */
>  static int update_log_root(struct btrfs_trans_handle *trans,
> -                          struct btrfs_root *log)
> +                          struct btrfs_root *log,
> +                          struct btrfs_root_item *root_item)
>  {
>         struct btrfs_fs_info *fs_info = log->fs_info;
>         int ret;
> @@ -2916,10 +2917,10 @@ static int update_log_root(struct btrfs_trans_handle *trans,
>         if (log->log_transid == 1) {
>                 /* insert root item on the first sync */
>                 ret = btrfs_insert_root(trans, fs_info->log_root_tree,
> -                               &log->root_key, &log->root_item);
> +                               &log->root_key, root_item);
>         } else {
>                 ret = btrfs_update_root(trans, fs_info->log_root_tree,
> -                               &log->root_key, &log->root_item);
> +                               &log->root_key, root_item);
>         }
>         return ret;
>  }
> @@ -3017,6 +3018,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>         struct btrfs_fs_info *fs_info = root->fs_info;
>         struct btrfs_root *log = root->log_root;
>         struct btrfs_root *log_root_tree = fs_info->log_root_tree;
> +       struct btrfs_root_item new_root_item;
>         int log_transid = 0;
>         struct btrfs_log_ctx root_log_ctx;
>         struct blk_plug plug;
> @@ -3080,17 +3082,25 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>                 goto out;
>         }
>
> +       /*
> +        * We _must_ update under the root->log_mutex in order to make sure we
> +        * have a consistent view of the log root we are trying to commit at
> +        * this moment.
> +        *
> +        * We _must_ copy this into a local copy, because we are not holding the
> +        * log_root_tree->log_mutex yet.  This is important because when we
> +        * commit the log_root_tree we must have a consistent view of the
> +        * log_root_tree when we update the super block to point at the
> +        * log_root_tree bytenr.  If we update the log_root_tree here we'll race
> +        * with the commit and possibly point at the new block which we may not
> +        * have written out.
> +        */
>         btrfs_set_root_node(&log->root_item, log->node);
> +       memcpy(&new_root_item, &log->root_item, sizeof(new_root_item));
>
>         root->log_transid++;
>         log->log_transid = root->log_transid;
>         root->log_start_pid = 0;
> -       /*
> -        * Update or create log root item under the root's log_mutex to prevent
> -        * races with concurrent log syncs that can lead to failure to update
> -        * log root item because it was not created yet.
> -        */
> -       ret = update_log_root(trans, log);
>         /*
>          * IO has been started, blocks of the log tree have WRITTEN flag set
>          * in their headers. new modifications of the log will be written to
> @@ -3111,6 +3121,14 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>         mutex_unlock(&log_root_tree->log_mutex);
>
>         mutex_lock(&log_root_tree->log_mutex);
> +
> +       /*
> +        * Now we are safe to update the log_root_tree because we're under the
> +        * log_mutex, and we're a current writer so we're holding the commit
> +        * open until we drop the log_mutex.
> +        */
> +       ret = update_log_root(trans, log, &new_root_item);
> +
>         if (atomic_dec_and_test(&log_root_tree->log_writers)) {
>                 /* atomic_dec_and_test implies a barrier */
>                 cond_wake_up_nomb(&log_root_tree->log_writer_wait);
> --
> 2.21.0
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] btrfs: fix incorrect updating of log root tree
  2019-09-30 20:27 [PATCH] btrfs: fix incorrect updating of log root tree Josef Bacik
  2019-10-01 11:46 ` Filipe Manana
@ 2019-10-01 16:39 ` David Sterba
  1 sibling, 0 replies; 3+ messages in thread
From: David Sterba @ 2019-10-01 16:39 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team, stable, Chris Mason

On Mon, Sep 30, 2019 at 04:27:25PM -0400, Josef Bacik wrote:
> We've historically had reports of being unable to mount file systems
> because the tree log root couldn't be read.  Usually this is the "parent
> transid failure", but could be any of the related errors, including
> "fsid mismatch" or "bad tree block", depending on which block got
> allocated.
> 
> The modification of the individual log root items are serialized on the
> per-log root root_mutex.  This means that any modification to the
> per-subvol log root_item is completely protected.
> 
> However we update the root item in the log root tree outside of the log
> root tree log_mutex.  We do this in order to allow multiple subvolumes
> to be updated in each log transaction.
> 
> This is problematic however because when we are writing the log root
> tree out we update the super block with the _current_ log root node
> information.  Since these two operations happen independently of each
> other, you can end up updating the log root tree in between writing out
> the dirty blocks and setting the super block to point at the current
> root.
> 
> This means we'll point at the new root node that hasn't been written
> out, instead of the one we should be pointing at.  Thus whatever garbage
> or old block we end up pointing at complains when we mount the file
> system later and try to replay the log.
> 
> Fix this by copying the log's root item into a local root item copy.
> Then once we're safely under the log_root_tree->log_mutex we update the
> root item in the log_root_tree.  This way we do not modify the
> log_root_tree while we're committing it, fixing the problem.
> 
> cc: stable@vger.kernel.org
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> Reviewed-by: Chris Mason <clm@fb.com>

Added to 5.4 queue, thanks.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2019-10-01 16:39 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-30 20:27 [PATCH] btrfs: fix incorrect updating of log root tree Josef Bacik
2019-10-01 11:46 ` Filipe Manana
2019-10-01 16:39 ` David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).