Linux-BTRFS Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction
@ 2019-09-10 14:26 fdmanana
  2019-09-11 16:04 ` Josef Bacik
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: fdmanana @ 2019-09-10 14:26 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Sometimes when fsync'ing a file we need to log that other inodes exist and
when we need to do that we acquire a reference on the inodes and then drop
that reference using iput() after logging them.

That generally is not a problem except if we end up doing the final iput()
(dropping the last reference) on the inode and that inode has a link count
of 0, which can happen in a very short time window if the logging path
gets a reference on the inode while it's being unlinked.

In that case we end up getting the eviction callback, btrfs_evict_inode(),
invoked through the iput() call chain which needs to drop all of the
inode's items from its subvolume btree, and in order to do that, it needs
to join a transaction at the helper function evict_refill_and_join().
However because the task previously started a transaction at the fsync
handler, btrfs_sync_file(), it has current->journal_info already pointing
to a transaction handle and therefore evict_refill_and_join() will get
that transaction handle from btrfs_join_transaction(). From this point on,
two different problems can happen:

1) evict_refill_and_join() will often change the transaction handle's
   block reserve (->block_rsv) and set its ->bytes_reserved field to a
   value greater than 0. If evict_refill_and_join() never commits the
   transaction, the eviction handler ends up decreasing the reference
   count (->use_count) of the transaction handle through the call to
   btrfs_end_transaction(), and after that point we have a transaction
   handle with a NULL ->block_rsv (which is the value prior to the
   transaction join from evict_refill_and_join()) and a ->bytes_reserved
   value greater than 0. If after the eviction/iput completes the inode
   logging path hits an error or it decides that it must fallback to a
   transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a
   non-zero value from btrfs_log_dentry_safe(), and because of that
   non-zero value it tries to commit the transaction using a handle with
   a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes
   the transaction commit hit an assertion failure at
   btrfs_trans_release_metadata() because ->bytes_reserved is not zero but
   the ->block_rsv is NULL. The produced stack trace for that is like the
   following:

   [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816
   [192922.917553] ------------[ cut here ]------------
   [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532!
   [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
   [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G        W         5.1.4-btrfs-next-47 #1
   [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
   [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs]
   (...)
   [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286
   [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000
   [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838
   [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000
   [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980
   [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8
   [192922.923200] FS:  00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000
   [192922.923579] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0
   [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
   [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
   [192922.925105] Call Trace:
   [192922.925505]  btrfs_trans_release_metadata+0x10c/0x170 [btrfs]
   [192922.925911]  btrfs_commit_transaction+0x3e/0xaf0 [btrfs]
   [192922.926324]  btrfs_sync_file+0x44c/0x490 [btrfs]
   [192922.926731]  do_fsync+0x38/0x60
   [192922.927138]  __x64_sys_fdatasync+0x13/0x20
   [192922.927543]  do_syscall_64+0x60/0x1c0
   [192922.927939]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
   (...)
   [192922.934077] ---[ end trace f00808b12068168f ]---

2) If evict_refill_and_join() decides to commit the transaction, it will
   be able to do it, since the nested transaction join only increments the
   transaction handle's ->use_count reference counter and it does not
   prevent the transaction from getting committed. This means that after
   eviction completes, the fsync logging path will be using a transaction
   handle that refers to an already committed transaction. What happens
   when using such a stale transaction can be unpredictable, we are at
   least having a use-after-free on the transaction handle itself, since
   the transaction commit will call kmem_cache_free() against the handle
   regardless of its ->use_count value, or we can end up silently losing
   all the updates to the log tree after that iput() in the logging path,
   or using a transaction handle that in the meanwhile was allocated to
   another task for a new transaction, etc, pretty much unpredictable
   what can happen.

In order to fix both of them, instead of using iput() during logging, use
btrfs_add_delayed_iput(), so that the logging path of fsync never drops
the last reference on an inode, that step is offloaded to a safe context
(usually the cleaner kthread).

The assertion failure issue was sporadically triggered by the test case
generic/475 from fstests, which loads the dm error target while fsstress
is running, which lead to fsync failing while logging inodes with -EIO
errors and then trying later to commit the transaction, triggering the
assertion failure.

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/tree-log.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 6c8297bcfeb7..1bfd7e34f31e 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4985,7 +4985,7 @@ static int log_conflicting_inodes(struct btrfs_trans_handle *trans,
 						      BTRFS_I(inode),
 						      LOG_OTHER_INODE_ALL,
 						      0, LLONG_MAX, ctx);
-					iput(inode);
+					btrfs_add_delayed_iput(inode);
 				}
 			}
 			continue;
@@ -5000,7 +5000,7 @@ static int log_conflicting_inodes(struct btrfs_trans_handle *trans,
 		ret = btrfs_log_inode(trans, root, BTRFS_I(inode),
 				      LOG_OTHER_INODE, 0, LLONG_MAX, ctx);
 		if (ret) {
-			iput(inode);
+			btrfs_add_delayed_iput(inode);
 			continue;
 		}
 
@@ -5009,7 +5009,7 @@ static int log_conflicting_inodes(struct btrfs_trans_handle *trans,
 		key.offset = 0;
 		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
 		if (ret < 0) {
-			iput(inode);
+			btrfs_add_delayed_iput(inode);
 			continue;
 		}
 
@@ -5056,7 +5056,7 @@ static int log_conflicting_inodes(struct btrfs_trans_handle *trans,
 			}
 			path->slots[0]++;
 		}
-		iput(inode);
+		btrfs_add_delayed_iput(inode);
 	}
 
 	return ret;
@@ -5689,7 +5689,7 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans,
 			}
 
 			if (btrfs_inode_in_log(BTRFS_I(di_inode), trans->transid)) {
-				iput(di_inode);
+				btrfs_add_delayed_iput(di_inode);
 				break;
 			}
 
@@ -5701,7 +5701,7 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans,
 			if (!ret &&
 			    btrfs_must_commit_transaction(trans, BTRFS_I(di_inode)))
 				ret = 1;
-			iput(di_inode);
+			btrfs_add_delayed_iput(di_inode);
 			if (ret)
 				goto next_dir_inode;
 			if (ctx->log_new_dentries) {
@@ -5848,7 +5848,7 @@ static int btrfs_log_all_parents(struct btrfs_trans_handle *trans,
 			if (!ret && ctx && ctx->log_new_dentries)
 				ret = log_new_dir_dentries(trans, root,
 						   BTRFS_I(dir_inode), ctx);
-			iput(dir_inode);
+			btrfs_add_delayed_iput(dir_inode);
 			if (ret)
 				goto out;
 		}
@@ -5891,7 +5891,7 @@ static int log_new_ancestors(struct btrfs_trans_handle *trans,
 			ret = btrfs_log_inode(trans, root, BTRFS_I(inode),
 					      LOG_INODE_EXISTS,
 					      0, LLONG_MAX, ctx);
-		iput(inode);
+		btrfs_add_delayed_iput(inode);
 		if (ret)
 			return ret;
 
-- 
2.11.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction
  2019-09-10 14:26 [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction fdmanana
@ 2019-09-11 16:04 ` Josef Bacik
  2019-09-12  7:30 ` Sasha Levin
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Josef Bacik @ 2019-09-11 16:04 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Tue, Sep 10, 2019 at 03:26:49PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Sometimes when fsync'ing a file we need to log that other inodes exist and
> when we need to do that we acquire a reference on the inodes and then drop
> that reference using iput() after logging them.
> 
> That generally is not a problem except if we end up doing the final iput()
> (dropping the last reference) on the inode and that inode has a link count
> of 0, which can happen in a very short time window if the logging path
> gets a reference on the inode while it's being unlinked.
> 
> In that case we end up getting the eviction callback, btrfs_evict_inode(),
> invoked through the iput() call chain which needs to drop all of the
> inode's items from its subvolume btree, and in order to do that, it needs
> to join a transaction at the helper function evict_refill_and_join().
> However because the task previously started a transaction at the fsync
> handler, btrfs_sync_file(), it has current->journal_info already pointing
> to a transaction handle and therefore evict_refill_and_join() will get
> that transaction handle from btrfs_join_transaction(). From this point on,
> two different problems can happen:
> 
> 1) evict_refill_and_join() will often change the transaction handle's
>    block reserve (->block_rsv) and set its ->bytes_reserved field to a
>    value greater than 0. If evict_refill_and_join() never commits the
>    transaction, the eviction handler ends up decreasing the reference
>    count (->use_count) of the transaction handle through the call to
>    btrfs_end_transaction(), and after that point we have a transaction
>    handle with a NULL ->block_rsv (which is the value prior to the
>    transaction join from evict_refill_and_join()) and a ->bytes_reserved
>    value greater than 0. If after the eviction/iput completes the inode
>    logging path hits an error or it decides that it must fallback to a
>    transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a
>    non-zero value from btrfs_log_dentry_safe(), and because of that
>    non-zero value it tries to commit the transaction using a handle with
>    a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes
>    the transaction commit hit an assertion failure at
>    btrfs_trans_release_metadata() because ->bytes_reserved is not zero but
>    the ->block_rsv is NULL. The produced stack trace for that is like the
>    following:
> 
>    [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816
>    [192922.917553] ------------[ cut here ]------------
>    [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532!
>    [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
>    [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G        W         5.1.4-btrfs-next-47 #1
>    [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
>    [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs]
>    (...)
>    [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286
>    [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000
>    [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838
>    [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000
>    [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980
>    [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8
>    [192922.923200] FS:  00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000
>    [192922.923579] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>    [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0
>    [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>    [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>    [192922.925105] Call Trace:
>    [192922.925505]  btrfs_trans_release_metadata+0x10c/0x170 [btrfs]
>    [192922.925911]  btrfs_commit_transaction+0x3e/0xaf0 [btrfs]
>    [192922.926324]  btrfs_sync_file+0x44c/0x490 [btrfs]
>    [192922.926731]  do_fsync+0x38/0x60
>    [192922.927138]  __x64_sys_fdatasync+0x13/0x20
>    [192922.927543]  do_syscall_64+0x60/0x1c0
>    [192922.927939]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>    (...)
>    [192922.934077] ---[ end trace f00808b12068168f ]---
> 
> 2) If evict_refill_and_join() decides to commit the transaction, it will
>    be able to do it, since the nested transaction join only increments the
>    transaction handle's ->use_count reference counter and it does not
>    prevent the transaction from getting committed. This means that after
>    eviction completes, the fsync logging path will be using a transaction
>    handle that refers to an already committed transaction. What happens
>    when using such a stale transaction can be unpredictable, we are at
>    least having a use-after-free on the transaction handle itself, since
>    the transaction commit will call kmem_cache_free() against the handle
>    regardless of its ->use_count value, or we can end up silently losing
>    all the updates to the log tree after that iput() in the logging path,
>    or using a transaction handle that in the meanwhile was allocated to
>    another task for a new transaction, etc, pretty much unpredictable
>    what can happen.
> 
> In order to fix both of them, instead of using iput() during logging, use
> btrfs_add_delayed_iput(), so that the logging path of fsync never drops
> the last reference on an inode, that step is offloaded to a safe context
> (usually the cleaner kthread).
> 
> The assertion failure issue was sporadically triggered by the test case
> generic/475 from fstests, which loads the dm error target while fsstress
> is running, which lead to fsync failing while logging inodes with -EIO
> errors and then trying later to commit the transaction, triggering the
> assertion failure.
> 
> CC: stable@vger.kernel.org # 4.4+
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction
  2019-09-10 14:26 [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction fdmanana
  2019-09-11 16:04 ` Josef Bacik
@ 2019-09-12  7:30 ` Sasha Levin
  2019-09-12  8:02   ` Filipe Manana
  2019-09-12 11:10 ` Nikolay Borisov
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Sasha Levin @ 2019-09-12  7:30 UTC (permalink / raw)
  To: Sasha Levin; +Cc: stable

Hi,

[This is an automated email]

This commit has been processed because it contains a -stable tag.
The stable tag indicates that it's relevant for the following trees: 4.4+

The bot has tested the following trees: v5.2.14, v4.19.72, v4.14.143, v4.9.192, v4.4.192.

v5.2.14: Build OK!
v4.19.72: Failed to apply! Possible dependencies:
    6b5fc433a7ad ("Btrfs: fix fsync after succession of renames of different files")
    a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir")
    b8aa330d2acb ("Btrfs: improve performance on fsync of files with multiple hardlinks")

v4.14.143: Failed to apply! Possible dependencies:
    0d836392cadd ("Btrfs: fix mount failure after fsync due to hard link recreation")
    1f250e929a9c ("Btrfs: fix log replay failure after unlink and link combination")
    6b5fc433a7ad ("Btrfs: fix fsync after succession of renames of different files")
    8d9e220ca084 ("btrfs: simplify IS_ERR/PTR_ERR checks")
    a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir")
    b8aa330d2acb ("Btrfs: improve performance on fsync of files with multiple hardlinks")

v4.9.192: Failed to apply! Possible dependencies:
    0b246afa62b0 ("btrfs: root->fs_info cleanup, add fs_info convenience variables")
    0d836392cadd ("Btrfs: fix mount failure after fsync due to hard link recreation")
    1f250e929a9c ("Btrfs: fix log replay failure after unlink and link combination")
    4791c8f19c45 ("btrfs: Make btrfs_check_ref_name_override take btrfs_inode")
    6b5fc433a7ad ("Btrfs: fix fsync after succession of renames of different files")
    a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir")
    cf8cddd38bab ("btrfs: don't abuse REQ_OP_* flags for btrfs_map_block")
    da17066c4047 ("btrfs: pull node/sector/stripe sizes out of root and into fs_info")
    db0a669fb002 ("btrfs: Make btrfs_add_link take btrfs_inode")
    de143792253e ("btrfs: struct btrfsic_state->root should be an fs_info")
    fb456252d3d9 ("btrfs: root->fs_info cleanup, use fs_info->dev_root everywhere")

v4.4.192: Failed to apply! Possible dependencies:
    0132761017e0 ("btrfs: fix string and comment grammatical issues and typos")
    09cbfeaf1a5a ("mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros")
    0b246afa62b0 ("btrfs: root->fs_info cleanup, add fs_info convenience variables")
    0d836392cadd ("Btrfs: fix mount failure after fsync due to hard link recreation")
    0e749e54244e ("dax: increase granularity of dax_clear_blocks() operations")
    1f250e929a9c ("Btrfs: fix log replay failure after unlink and link combination")
    44f714dae50a ("Btrfs: improve performance on fsync against new inode after rename/unlink")
    4791c8f19c45 ("btrfs: Make btrfs_check_ref_name_override take btrfs_inode")
    52db400fcd50 ("pmem, dax: clean up clear_pmem()")
    6b5fc433a7ad ("Btrfs: fix fsync after succession of renames of different files")
    781feef7e6be ("Btrfs: fix lockdep warning about log_mutex")
    a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir")
    b2e0d1625e19 ("dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()")
    bb7ab3b92e46 ("btrfs: Fix misspellings in comments.")
    cf8cddd38bab ("btrfs: don't abuse REQ_OP_* flags for btrfs_map_block")
    d1a5f2b4d8a1 ("block: use DAX for partition table reads")
    db0a669fb002 ("btrfs: Make btrfs_add_link take btrfs_inode")
    de143792253e ("btrfs: struct btrfsic_state->root should be an fs_info")


NOTE: The patch will not be queued to stable trees until it is upstream.

How should we proceed with this patch?

--
Thanks,
Sasha

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction
  2019-09-12  7:30 ` Sasha Levin
@ 2019-09-12  8:02   ` Filipe Manana
  0 siblings, 0 replies; 12+ messages in thread
From: Filipe Manana @ 2019-09-12  8:02 UTC (permalink / raw)
  To: Sasha Levin; +Cc: Filipe Manana, linux-btrfs, stable

On Thu, Sep 12, 2019 at 8:32 AM Sasha Levin <sashal@kernel.org> wrote:
>
> Hi,
>
> [This is an automated email]
>
> This commit has been processed because it contains a -stable tag.
> The stable tag indicates that it's relevant for the following trees: 4.4+
>
> The bot has tested the following trees: v5.2.14, v4.19.72, v4.14.143, v4.9.192, v4.4.192.
>
> v5.2.14: Build OK!
> v4.19.72: Failed to apply! Possible dependencies:
>     6b5fc433a7ad ("Btrfs: fix fsync after succession of renames of different files")
>     a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir")
>     b8aa330d2acb ("Btrfs: improve performance on fsync of files with multiple hardlinks")
>
> v4.14.143: Failed to apply! Possible dependencies:
>     0d836392cadd ("Btrfs: fix mount failure after fsync due to hard link recreation")
>     1f250e929a9c ("Btrfs: fix log replay failure after unlink and link combination")
>     6b5fc433a7ad ("Btrfs: fix fsync after succession of renames of different files")
>     8d9e220ca084 ("btrfs: simplify IS_ERR/PTR_ERR checks")
>     a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir")
>     b8aa330d2acb ("Btrfs: improve performance on fsync of files with multiple hardlinks")
>
> v4.9.192: Failed to apply! Possible dependencies:
>     0b246afa62b0 ("btrfs: root->fs_info cleanup, add fs_info convenience variables")
>     0d836392cadd ("Btrfs: fix mount failure after fsync due to hard link recreation")
>     1f250e929a9c ("Btrfs: fix log replay failure after unlink and link combination")
>     4791c8f19c45 ("btrfs: Make btrfs_check_ref_name_override take btrfs_inode")
>     6b5fc433a7ad ("Btrfs: fix fsync after succession of renames of different files")
>     a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir")
>     cf8cddd38bab ("btrfs: don't abuse REQ_OP_* flags for btrfs_map_block")
>     da17066c4047 ("btrfs: pull node/sector/stripe sizes out of root and into fs_info")
>     db0a669fb002 ("btrfs: Make btrfs_add_link take btrfs_inode")
>     de143792253e ("btrfs: struct btrfsic_state->root should be an fs_info")
>     fb456252d3d9 ("btrfs: root->fs_info cleanup, use fs_info->dev_root everywhere")
>
> v4.4.192: Failed to apply! Possible dependencies:
>     0132761017e0 ("btrfs: fix string and comment grammatical issues and typos")
>     09cbfeaf1a5a ("mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros")
>     0b246afa62b0 ("btrfs: root->fs_info cleanup, add fs_info convenience variables")
>     0d836392cadd ("Btrfs: fix mount failure after fsync due to hard link recreation")
>     0e749e54244e ("dax: increase granularity of dax_clear_blocks() operations")
>     1f250e929a9c ("Btrfs: fix log replay failure after unlink and link combination")
>     44f714dae50a ("Btrfs: improve performance on fsync against new inode after rename/unlink")
>     4791c8f19c45 ("btrfs: Make btrfs_check_ref_name_override take btrfs_inode")
>     52db400fcd50 ("pmem, dax: clean up clear_pmem()")
>     6b5fc433a7ad ("Btrfs: fix fsync after succession of renames of different files")
>     781feef7e6be ("Btrfs: fix lockdep warning about log_mutex")
>     a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir")
>     b2e0d1625e19 ("dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()")
>     bb7ab3b92e46 ("btrfs: Fix misspellings in comments.")
>     cf8cddd38bab ("btrfs: don't abuse REQ_OP_* flags for btrfs_map_block")
>     d1a5f2b4d8a1 ("block: use DAX for partition table reads")
>     db0a669fb002 ("btrfs: Make btrfs_add_link take btrfs_inode")
>     de143792253e ("btrfs: struct btrfsic_state->root should be an fs_info")
>
>
> NOTE: The patch will not be queued to stable trees until it is upstream.
>
> How should we proceed with this patch?

After it lands on Linus' tree, I'll try to send patch versions that
apply to different kernel releases.

Thanks.

>
> --
> Thanks,
> Sasha



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction
  2019-09-10 14:26 [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction fdmanana
  2019-09-11 16:04 ` Josef Bacik
  2019-09-12  7:30 ` Sasha Levin
@ 2019-09-12 11:10 ` Nikolay Borisov
  2019-09-12 11:24   ` Filipe Manana
  2019-09-12 11:58 ` Josef Bacik
  2019-09-12 12:18 ` Josef Bacik
  4 siblings, 1 reply; 12+ messages in thread
From: Nikolay Borisov @ 2019-09-12 11:10 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



On 10.09.19 г. 17:26 ч., fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Sometimes when fsync'ing a file we need to log that other inodes exist and
> when we need to do that we acquire a reference on the inodes and then drop
> that reference using iput() after logging them.
> 
> That generally is not a problem except if we end up doing the final iput()
> (dropping the last reference) on the inode and that inode has a link count
> of 0, which can happen in a very short time window if the logging path
> gets a reference on the inode while it's being unlinked.
> 
> In that case we end up getting the eviction callback, btrfs_evict_inode(),
> invoked through the iput() call chain which needs to drop all of the
> inode's items from its subvolume btree, and in order to do that, it needs
> to join a transaction at the helper function evict_refill_and_join().
> However because the task previously started a transaction at the fsync
> handler, btrfs_sync_file(), it has current->journal_info already pointing
> to a transaction handle and therefore evict_refill_and_join() will get
> that transaction handle from btrfs_join_transaction(). From this point on,
> two different problems can happen:
> 
> 1) evict_refill_and_join() will often change the transaction handle's
>    block reserve (->block_rsv) and set its ->bytes_reserved field to a
>    value greater than 0. If evict_refill_and_join() never commits the
>    transaction, the eviction handler ends up decreasing the reference
>    count (->use_count) of the transaction handle through the call to
>    btrfs_end_transaction(), and after that point we have a transaction
>    handle with a NULL ->block_rsv (which is the value prior to the
>    transaction join from evict_refill_and_join()) and a ->bytes_reserved
>    value greater than 0. If after the eviction/iput completes the inode
>    logging path hits an error or it decides that it must fallback to a
>    transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a
>    non-zero value from btrfs_log_dentry_safe(), and because of that
>    non-zero value it tries to commit the transaction using a handle with
>    a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes
>    the transaction commit hit an assertion failure at
>    btrfs_trans_release_metadata() because ->bytes_reserved is not zero but
>    the ->block_rsv is NULL. The produced stack trace for that is like the
>    following:
> 
>    [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816
>    [192922.917553] ------------[ cut here ]------------
>    [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532!
>    [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
>    [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G        W         5.1.4-btrfs-next-47 #1
>    [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
>    [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs]
>    (...)
>    [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286
>    [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000
>    [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838
>    [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000
>    [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980
>    [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8
>    [192922.923200] FS:  00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000
>    [192922.923579] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>    [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0
>    [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>    [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>    [192922.925105] Call Trace:
>    [192922.925505]  btrfs_trans_release_metadata+0x10c/0x170 [btrfs]
>    [192922.925911]  btrfs_commit_transaction+0x3e/0xaf0 [btrfs]
>    [192922.926324]  btrfs_sync_file+0x44c/0x490 [btrfs]
>    [192922.926731]  do_fsync+0x38/0x60
>    [192922.927138]  __x64_sys_fdatasync+0x13/0x20
>    [192922.927543]  do_syscall_64+0x60/0x1c0
>    [192922.927939]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>    (...)
>    [192922.934077] ---[ end trace f00808b12068168f ]---
> 
> 2) If evict_refill_and_join() decides to commit the transaction, it will
evict_refill_and_join only ever calls btrfs_join_transaction so it
cannot ever commit the transaction. Let's look into its sole caller,
btrfs_evict_inode, I see 3 cases where transactions handle is involved
in that function:

1. btrfs_commit_inode_delayed_inode - iot calls trans_join +
end_transaction => use_count = 2 -> no commit

2. btrfs_truncate_inode_items -> never calls end/commit transaction

3. btrfs_evict_inode itself only ever calls btrfs_end_transaction on a
handle that's obtained via btrfs_join_transaction, meaning that trans
handle's ->use_count will be 2. This in turn will hit:

 if (refcount_read(&trans->use_count) > 1) {
                  refcount_dec(&trans->use_count);

                  trans->block_rsv = trans->orig_rsv;

                  return 0;

}

in __btrfs_end_transaction. That code is in direct contradiction with
what you describe next? Am I missing something here? If my analysis is
correct this implies case 1) above also cannot happen because
trans->block_rsv is set to NULL in __btrfs_end_transaction only if
use_count == 1.

>    be able to do it, since the nested transaction join only increments the
>    transaction handle's ->use_count reference counter and it does not
>    prevent the transaction from getting committed. This means that after
>    eviction completes, the fsync logging path will be using a transaction
>    handle that refers to an already committed transaction. What happens
>    when using such a stale transaction can be unpredictable, we are at
>    least having a use-after-free on the transaction handle itself, since
>    the transaction commit will call kmem_cache_free() against the handle
>    regardless of its ->use_count value, or we can end up silently losing
>    all the updates to the log tree after that iput() in the logging path,
>    or using a transaction handle that in the meanwhile was allocated to
>    another task for a new transaction, etc, pretty much unpredictable
>    what can happen.
> 
> In order to fix both of them, instead of using iput() during logging, use
> btrfs_add_delayed_iput(), so that the logging path of fsync never drops
> the last reference on an inode, that step is offloaded to a safe context
> (usually the cleaner kthread).
> 
> The assertion failure issue was sporadically triggered by the test case
> generic/475 from fstests, which loads the dm error target while fsstress
> is running, which lead to fsync failing while logging inodes with -EIO
> errors and then trying later to commit the transaction, triggering the
> assertion failure.
> 
> CC: stable@vger.kernel.org # 4.4+
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/tree-log.c | 16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index 6c8297bcfeb7..1bfd7e34f31e 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -4985,7 +4985,7 @@ static int log_conflicting_inodes(struct btrfs_trans_handle *trans,
>  						      BTRFS_I(inode),
>  						      LOG_OTHER_INODE_ALL,
>  						      0, LLONG_MAX, ctx);
> -					iput(inode);
> +					btrfs_add_delayed_iput(inode);
>  				}
>  			}
>  			continue;
> @@ -5000,7 +5000,7 @@ static int log_conflicting_inodes(struct btrfs_trans_handle *trans,
>  		ret = btrfs_log_inode(trans, root, BTRFS_I(inode),
>  				      LOG_OTHER_INODE, 0, LLONG_MAX, ctx);
>  		if (ret) {
> -			iput(inode);
> +			btrfs_add_delayed_iput(inode);
>  			continue;
>  		}
>  
> @@ -5009,7 +5009,7 @@ static int log_conflicting_inodes(struct btrfs_trans_handle *trans,
>  		key.offset = 0;
>  		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
>  		if (ret < 0) {
> -			iput(inode);
> +			btrfs_add_delayed_iput(inode);
>  			continue;
>  		}
>  
> @@ -5056,7 +5056,7 @@ static int log_conflicting_inodes(struct btrfs_trans_handle *trans,
>  			}
>  			path->slots[0]++;
>  		}
> -		iput(inode);
> +		btrfs_add_delayed_iput(inode);
>  	}
>  
>  	return ret;
> @@ -5689,7 +5689,7 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans,
>  			}
>  
>  			if (btrfs_inode_in_log(BTRFS_I(di_inode), trans->transid)) {
> -				iput(di_inode);
> +				btrfs_add_delayed_iput(di_inode);
>  				break;
>  			}
>  
> @@ -5701,7 +5701,7 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans,
>  			if (!ret &&
>  			    btrfs_must_commit_transaction(trans, BTRFS_I(di_inode)))
>  				ret = 1;
> -			iput(di_inode);
> +			btrfs_add_delayed_iput(di_inode);
>  			if (ret)
>  				goto next_dir_inode;
>  			if (ctx->log_new_dentries) {
> @@ -5848,7 +5848,7 @@ static int btrfs_log_all_parents(struct btrfs_trans_handle *trans,
>  			if (!ret && ctx && ctx->log_new_dentries)
>  				ret = log_new_dir_dentries(trans, root,
>  						   BTRFS_I(dir_inode), ctx);
> -			iput(dir_inode);
> +			btrfs_add_delayed_iput(dir_inode);
>  			if (ret)
>  				goto out;
>  		}
> @@ -5891,7 +5891,7 @@ static int log_new_ancestors(struct btrfs_trans_handle *trans,
>  			ret = btrfs_log_inode(trans, root, BTRFS_I(inode),
>  					      LOG_INODE_EXISTS,
>  					      0, LLONG_MAX, ctx);
> -		iput(inode);
> +		btrfs_add_delayed_iput(inode);
>  		if (ret)
>  			return ret;
>  
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction
  2019-09-12 11:10 ` Nikolay Borisov
@ 2019-09-12 11:24   ` Filipe Manana
  2019-09-12 11:35     ` Nikolay Borisov
  0 siblings, 1 reply; 12+ messages in thread
From: Filipe Manana @ 2019-09-12 11:24 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs

On Thu, Sep 12, 2019 at 12:10 PM Nikolay Borisov <nborisov@suse.com> wrote:
>
>
>
> On 10.09.19 г. 17:26 ч., fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > Sometimes when fsync'ing a file we need to log that other inodes exist and
> > when we need to do that we acquire a reference on the inodes and then drop
> > that reference using iput() after logging them.
> >
> > That generally is not a problem except if we end up doing the final iput()
> > (dropping the last reference) on the inode and that inode has a link count
> > of 0, which can happen in a very short time window if the logging path
> > gets a reference on the inode while it's being unlinked.
> >
> > In that case we end up getting the eviction callback, btrfs_evict_inode(),
> > invoked through the iput() call chain which needs to drop all of the
> > inode's items from its subvolume btree, and in order to do that, it needs
> > to join a transaction at the helper function evict_refill_and_join().
> > However because the task previously started a transaction at the fsync
> > handler, btrfs_sync_file(), it has current->journal_info already pointing
> > to a transaction handle and therefore evict_refill_and_join() will get
> > that transaction handle from btrfs_join_transaction(). From this point on,
> > two different problems can happen:
> >
> > 1) evict_refill_and_join() will often change the transaction handle's
> >    block reserve (->block_rsv) and set its ->bytes_reserved field to a
> >    value greater than 0. If evict_refill_and_join() never commits the
> >    transaction, the eviction handler ends up decreasing the reference
> >    count (->use_count) of the transaction handle through the call to
> >    btrfs_end_transaction(), and after that point we have a transaction
> >    handle with a NULL ->block_rsv (which is the value prior to the
> >    transaction join from evict_refill_and_join()) and a ->bytes_reserved
> >    value greater than 0. If after the eviction/iput completes the inode
> >    logging path hits an error or it decides that it must fallback to a
> >    transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a
> >    non-zero value from btrfs_log_dentry_safe(), and because of that
> >    non-zero value it tries to commit the transaction using a handle with
> >    a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes
> >    the transaction commit hit an assertion failure at
> >    btrfs_trans_release_metadata() because ->bytes_reserved is not zero but
> >    the ->block_rsv is NULL. The produced stack trace for that is like the
> >    following:
> >
> >    [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816
> >    [192922.917553] ------------[ cut here ]------------
> >    [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532!
> >    [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
> >    [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G        W         5.1.4-btrfs-next-47 #1
> >    [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
> >    [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs]
> >    (...)
> >    [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286
> >    [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000
> >    [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838
> >    [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000
> >    [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980
> >    [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8
> >    [192922.923200] FS:  00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000
> >    [192922.923579] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >    [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0
> >    [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >    [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >    [192922.925105] Call Trace:
> >    [192922.925505]  btrfs_trans_release_metadata+0x10c/0x170 [btrfs]
> >    [192922.925911]  btrfs_commit_transaction+0x3e/0xaf0 [btrfs]
> >    [192922.926324]  btrfs_sync_file+0x44c/0x490 [btrfs]
> >    [192922.926731]  do_fsync+0x38/0x60
> >    [192922.927138]  __x64_sys_fdatasync+0x13/0x20
> >    [192922.927543]  do_syscall_64+0x60/0x1c0
> >    [192922.927939]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >    (...)
> >    [192922.934077] ---[ end trace f00808b12068168f ]---
> >
> > 2) If evict_refill_and_join() decides to commit the transaction, it will
> evict_refill_and_join only ever calls btrfs_join_transaction so it
> cannot ever commit the transaction.

It can:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/btrfs/inode.c?h=v5.3-rc8#n5399

>  Let's look into its sole caller,
> btrfs_evict_inode, I see 3 cases where transactions handle is involved
> in that function:
>
> 1. btrfs_commit_inode_delayed_inode - iot calls trans_join +
> end_transaction => use_count = 2 -> no commit
>
> 2. btrfs_truncate_inode_items -> never calls end/commit transaction
>
> 3. btrfs_evict_inode itself only ever calls btrfs_end_transaction on a
> handle that's obtained via btrfs_join_transaction, meaning that trans
> handle's ->use_count will be 2. This in turn will hit:
>
>  if (refcount_read(&trans->use_count) > 1) {
>                   refcount_dec(&trans->use_count);
>
>                   trans->block_rsv = trans->orig_rsv;
>
>                   return 0;
>
> }
>
> in __btrfs_end_transaction. That code is in direct contradiction with
> what you describe next?

No, it isn't.

I explicitly mention it in point 1):

"... the eviction handler ends up decreasing the reference
count (->use_count) of the transaction handle through the call to
btrfs_end_transaction(), and after that point we have a transaction
handle with a NULL ->block_rsv (which is the value prior to the
transaction join from evict_refill_and_join()) and a ->bytes_reserved
value greater than 0."


>  Am I missing something here? If my analysis is
> correct this implies case 1) above also cannot happen because
> trans->block_rsv is set to NULL in __btrfs_end_transaction only if
> use_count == 1.

Again, go read the change log and the code more carefully.

Thanks.

>
> >    be able to do it, since the nested transaction join only increments the
> >    transaction handle's ->use_count reference counter and it does not
> >    prevent the transaction from getting committed. This means that after
> >    eviction completes, the fsync logging path will be using a transaction
> >    handle that refers to an already committed transaction. What happens
> >    when using such a stale transaction can be unpredictable, we are at
> >    least having a use-after-free on the transaction handle itself, since
> >    the transaction commit will call kmem_cache_free() against the handle
> >    regardless of its ->use_count value, or we can end up silently losing
> >    all the updates to the log tree after that iput() in the logging path,
> >    or using a transaction handle that in the meanwhile was allocated to
> >    another task for a new transaction, etc, pretty much unpredictable
> >    what can happen.
> >
> > In order to fix both of them, instead of using iput() during logging, use
> > btrfs_add_delayed_iput(), so that the logging path of fsync never drops
> > the last reference on an inode, that step is offloaded to a safe context
> > (usually the cleaner kthread).
> >
> > The assertion failure issue was sporadically triggered by the test case
> > generic/475 from fstests, which loads the dm error target while fsstress
> > is running, which lead to fsync failing while logging inodes with -EIO
> > errors and then trying later to commit the transaction, triggering the
> > assertion failure.
> >
> > CC: stable@vger.kernel.org # 4.4+
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >  fs/btrfs/tree-log.c | 16 ++++++++--------
> >  1 file changed, 8 insertions(+), 8 deletions(-)
> >
> > diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> > index 6c8297bcfeb7..1bfd7e34f31e 100644
> > --- a/fs/btrfs/tree-log.c
> > +++ b/fs/btrfs/tree-log.c
> > @@ -4985,7 +4985,7 @@ static int log_conflicting_inodes(struct btrfs_trans_handle *trans,
> >                                                     BTRFS_I(inode),
> >                                                     LOG_OTHER_INODE_ALL,
> >                                                     0, LLONG_MAX, ctx);
> > -                                     iput(inode);
> > +                                     btrfs_add_delayed_iput(inode);
> >                               }
> >                       }
> >                       continue;
> > @@ -5000,7 +5000,7 @@ static int log_conflicting_inodes(struct btrfs_trans_handle *trans,
> >               ret = btrfs_log_inode(trans, root, BTRFS_I(inode),
> >                                     LOG_OTHER_INODE, 0, LLONG_MAX, ctx);
> >               if (ret) {
> > -                     iput(inode);
> > +                     btrfs_add_delayed_iput(inode);
> >                       continue;
> >               }
> >
> > @@ -5009,7 +5009,7 @@ static int log_conflicting_inodes(struct btrfs_trans_handle *trans,
> >               key.offset = 0;
> >               ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
> >               if (ret < 0) {
> > -                     iput(inode);
> > +                     btrfs_add_delayed_iput(inode);
> >                       continue;
> >               }
> >
> > @@ -5056,7 +5056,7 @@ static int log_conflicting_inodes(struct btrfs_trans_handle *trans,
> >                       }
> >                       path->slots[0]++;
> >               }
> > -             iput(inode);
> > +             btrfs_add_delayed_iput(inode);
> >       }
> >
> >       return ret;
> > @@ -5689,7 +5689,7 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans,
> >                       }
> >
> >                       if (btrfs_inode_in_log(BTRFS_I(di_inode), trans->transid)) {
> > -                             iput(di_inode);
> > +                             btrfs_add_delayed_iput(di_inode);
> >                               break;
> >                       }
> >
> > @@ -5701,7 +5701,7 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans,
> >                       if (!ret &&
> >                           btrfs_must_commit_transaction(trans, BTRFS_I(di_inode)))
> >                               ret = 1;
> > -                     iput(di_inode);
> > +                     btrfs_add_delayed_iput(di_inode);
> >                       if (ret)
> >                               goto next_dir_inode;
> >                       if (ctx->log_new_dentries) {
> > @@ -5848,7 +5848,7 @@ static int btrfs_log_all_parents(struct btrfs_trans_handle *trans,
> >                       if (!ret && ctx && ctx->log_new_dentries)
> >                               ret = log_new_dir_dentries(trans, root,
> >                                                  BTRFS_I(dir_inode), ctx);
> > -                     iput(dir_inode);
> > +                     btrfs_add_delayed_iput(dir_inode);
> >                       if (ret)
> >                               goto out;
> >               }
> > @@ -5891,7 +5891,7 @@ static int log_new_ancestors(struct btrfs_trans_handle *trans,
> >                       ret = btrfs_log_inode(trans, root, BTRFS_I(inode),
> >                                             LOG_INODE_EXISTS,
> >                                             0, LLONG_MAX, ctx);
> > -             iput(inode);
> > +             btrfs_add_delayed_iput(inode);
> >               if (ret)
> >                       return ret;
> >
> >

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction
  2019-09-12 11:24   ` Filipe Manana
@ 2019-09-12 11:35     ` Nikolay Borisov
  0 siblings, 0 replies; 12+ messages in thread
From: Nikolay Borisov @ 2019-09-12 11:35 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs



On 12.09.19 г. 14:24 ч., Filipe Manana wrote:
> On Thu, Sep 12, 2019 at 12:10 PM Nikolay Borisov <nborisov@suse.com> wrote:
>>
>>
>>
>> On 10.09.19 г. 17:26 ч., fdmanana@kernel.org wrote:
>>> From: Filipe Manana <fdmanana@suse.com>
>>>
>>> Sometimes when fsync'ing a file we need to log that other inodes exist and
>>> when we need to do that we acquire a reference on the inodes and then drop
>>> that reference using iput() after logging them.
>>>
>>> That generally is not a problem except if we end up doing the final iput()
>>> (dropping the last reference) on the inode and that inode has a link count
>>> of 0, which can happen in a very short time window if the logging path
>>> gets a reference on the inode while it's being unlinked.
>>>
>>> In that case we end up getting the eviction callback, btrfs_evict_inode(),
>>> invoked through the iput() call chain which needs to drop all of the
>>> inode's items from its subvolume btree, and in order to do that, it needs
>>> to join a transaction at the helper function evict_refill_and_join().
>>> However because the task previously started a transaction at the fsync
>>> handler, btrfs_sync_file(), it has current->journal_info already pointing
>>> to a transaction handle and therefore evict_refill_and_join() will get
>>> that transaction handle from btrfs_join_transaction(). From this point on,
>>> two different problems can happen:
>>>
>>> 1) evict_refill_and_join() will often change the transaction handle's
>>>    block reserve (->block_rsv) and set its ->bytes_reserved field to a
>>>    value greater than 0. If evict_refill_and_join() never commits the
>>>    transaction, the eviction handler ends up decreasing the reference
>>>    count (->use_count) of the transaction handle through the call to
>>>    btrfs_end_transaction(), and after that point we have a transaction
>>>    handle with a NULL ->block_rsv (which is the value prior to the
>>>    transaction join from evict_refill_and_join()) and a ->bytes_reserved
>>>    value greater than 0. If after the eviction/iput completes the inode
>>>    logging path hits an error or it decides that it must fallback to a
>>>    transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a
>>>    non-zero value from btrfs_log_dentry_safe(), and because of that
>>>    non-zero value it tries to commit the transaction using a handle with
>>>    a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes
>>>    the transaction commit hit an assertion failure at
>>>    btrfs_trans_release_metadata() because ->bytes_reserved is not zero but
>>>    the ->block_rsv is NULL. The produced stack trace for that is like the
>>>    following:
>>>
>>>    [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816
>>>    [192922.917553] ------------[ cut here ]------------
>>>    [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532!
>>>    [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
>>>    [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G        W         5.1.4-btrfs-next-47 #1
>>>    [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
>>>    [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs]
>>>    (...)
>>>    [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286
>>>    [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000
>>>    [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838
>>>    [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000
>>>    [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980
>>>    [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8
>>>    [192922.923200] FS:  00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000
>>>    [192922.923579] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>    [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0
>>>    [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>    [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>    [192922.925105] Call Trace:
>>>    [192922.925505]  btrfs_trans_release_metadata+0x10c/0x170 [btrfs]
>>>    [192922.925911]  btrfs_commit_transaction+0x3e/0xaf0 [btrfs]
>>>    [192922.926324]  btrfs_sync_file+0x44c/0x490 [btrfs]
>>>    [192922.926731]  do_fsync+0x38/0x60
>>>    [192922.927138]  __x64_sys_fdatasync+0x13/0x20
>>>    [192922.927543]  do_syscall_64+0x60/0x1c0
>>>    [192922.927939]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>    (...)
>>>    [192922.934077] ---[ end trace f00808b12068168f ]---
>>>
>>> 2) If evict_refill_and_join() decides to commit the transaction, it will
>> evict_refill_and_join only ever calls btrfs_join_transaction so it
>> cannot ever commit the transaction.
> 
> It can:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/btrfs/inode.c?h=v5.3-rc8#n5399

Right, in 5.3 branch yes but in misc-next evict_and_join looks
completely different, that's where the confusion stems from. Thanks

<snip>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction
  2019-09-10 14:26 [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction fdmanana
                   ` (2 preceding siblings ...)
  2019-09-12 11:10 ` Nikolay Borisov
@ 2019-09-12 11:58 ` Josef Bacik
  2019-09-12 13:17   ` Filipe Manana
  2019-09-12 12:18 ` Josef Bacik
  4 siblings, 1 reply; 12+ messages in thread
From: Josef Bacik @ 2019-09-12 11:58 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Tue, Sep 10, 2019 at 03:26:49PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Sometimes when fsync'ing a file we need to log that other inodes exist and
> when we need to do that we acquire a reference on the inodes and then drop
> that reference using iput() after logging them.
> 
> That generally is not a problem except if we end up doing the final iput()
> (dropping the last reference) on the inode and that inode has a link count
> of 0, which can happen in a very short time window if the logging path
> gets a reference on the inode while it's being unlinked.
> 
> In that case we end up getting the eviction callback, btrfs_evict_inode(),
> invoked through the iput() call chain which needs to drop all of the
> inode's items from its subvolume btree, and in order to do that, it needs
> to join a transaction at the helper function evict_refill_and_join().
> However because the task previously started a transaction at the fsync
> handler, btrfs_sync_file(), it has current->journal_info already pointing
> to a transaction handle and therefore evict_refill_and_join() will get
> that transaction handle from btrfs_join_transaction(). From this point on,
> two different problems can happen:
> 
> 1) evict_refill_and_join() will often change the transaction handle's
>    block reserve (->block_rsv) and set its ->bytes_reserved field to a
>    value greater than 0. If evict_refill_and_join() never commits the
>    transaction, the eviction handler ends up decreasing the reference
>    count (->use_count) of the transaction handle through the call to
>    btrfs_end_transaction(), and after that point we have a transaction
>    handle with a NULL ->block_rsv (which is the value prior to the
>    transaction join from evict_refill_and_join()) and a ->bytes_reserved
>    value greater than 0. If after the eviction/iput completes the inode
>    logging path hits an error or it decides that it must fallback to a
>    transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a
>    non-zero value from btrfs_log_dentry_safe(), and because of that
>    non-zero value it tries to commit the transaction using a handle with
>    a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes
>    the transaction commit hit an assertion failure at
>    btrfs_trans_release_metadata() because ->bytes_reserved is not zero but
>    the ->block_rsv is NULL. The produced stack trace for that is like the
>    following:
> 
>    [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816
>    [192922.917553] ------------[ cut here ]------------
>    [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532!
>    [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
>    [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G        W         5.1.4-btrfs-next-47 #1
>    [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
>    [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs]
>    (...)
>    [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286
>    [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000
>    [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838
>    [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000
>    [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980
>    [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8
>    [192922.923200] FS:  00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000
>    [192922.923579] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>    [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0
>    [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>    [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>    [192922.925105] Call Trace:
>    [192922.925505]  btrfs_trans_release_metadata+0x10c/0x170 [btrfs]
>    [192922.925911]  btrfs_commit_transaction+0x3e/0xaf0 [btrfs]
>    [192922.926324]  btrfs_sync_file+0x44c/0x490 [btrfs]
>    [192922.926731]  do_fsync+0x38/0x60
>    [192922.927138]  __x64_sys_fdatasync+0x13/0x20
>    [192922.927543]  do_syscall_64+0x60/0x1c0
>    [192922.927939]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>    (...)
>    [192922.934077] ---[ end trace f00808b12068168f ]---
> 
> 2) If evict_refill_and_join() decides to commit the transaction, it will
>    be able to do it, since the nested transaction join only increments the
>    transaction handle's ->use_count reference counter and it does not
>    prevent the transaction from getting committed. This means that after

This brings up a good point, we should probably not allow the commit in this
case, or add an ASSERT(use_count == 1) or something, cause this would be bad.
Thanks,

Josef

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction
  2019-09-10 14:26 [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction fdmanana
                   ` (3 preceding siblings ...)
  2019-09-12 11:58 ` Josef Bacik
@ 2019-09-12 12:18 ` Josef Bacik
  2019-09-12 13:19   ` Filipe Manana
  4 siblings, 1 reply; 12+ messages in thread
From: Josef Bacik @ 2019-09-12 12:18 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Tue, Sep 10, 2019 at 03:26:49PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Sometimes when fsync'ing a file we need to log that other inodes exist and
> when we need to do that we acquire a reference on the inodes and then drop
> that reference using iput() after logging them.
> 
> That generally is not a problem except if we end up doing the final iput()
> (dropping the last reference) on the inode and that inode has a link count
> of 0, which can happen in a very short time window if the logging path
> gets a reference on the inode while it's being unlinked.
> 
> In that case we end up getting the eviction callback, btrfs_evict_inode(),
> invoked through the iput() call chain which needs to drop all of the
> inode's items from its subvolume btree, and in order to do that, it needs
> to join a transaction at the helper function evict_refill_and_join().
> However because the task previously started a transaction at the fsync
> handler, btrfs_sync_file(), it has current->journal_info already pointing
> to a transaction handle and therefore evict_refill_and_join() will get
> that transaction handle from btrfs_join_transaction(). From this point on,
> two different problems can happen:
> 
> 1) evict_refill_and_join() will often change the transaction handle's
>    block reserve (->block_rsv) and set its ->bytes_reserved field to a
>    value greater than 0. If evict_refill_and_join() never commits the
>    transaction, the eviction handler ends up decreasing the reference
>    count (->use_count) of the transaction handle through the call to
>    btrfs_end_transaction(), and after that point we have a transaction
>    handle with a NULL ->block_rsv (which is the value prior to the
>    transaction join from evict_refill_and_join()) and a ->bytes_reserved
>    value greater than 0. If after the eviction/iput completes the inode
>    logging path hits an error or it decides that it must fallback to a
>    transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a
>    non-zero value from btrfs_log_dentry_safe(), and because of that
>    non-zero value it tries to commit the transaction using a handle with
>    a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes
>    the transaction commit hit an assertion failure at
>    btrfs_trans_release_metadata() because ->bytes_reserved is not zero but
>    the ->block_rsv is NULL. The produced stack trace for that is like the
>    following:
> 
>    [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816
>    [192922.917553] ------------[ cut here ]------------
>    [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532!
>    [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
>    [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G        W         5.1.4-btrfs-next-47 #1
>    [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
>    [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs]
>    (...)
>    [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286
>    [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000
>    [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838
>    [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000
>    [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980
>    [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8
>    [192922.923200] FS:  00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000
>    [192922.923579] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>    [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0
>    [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>    [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>    [192922.925105] Call Trace:
>    [192922.925505]  btrfs_trans_release_metadata+0x10c/0x170 [btrfs]
>    [192922.925911]  btrfs_commit_transaction+0x3e/0xaf0 [btrfs]
>    [192922.926324]  btrfs_sync_file+0x44c/0x490 [btrfs]
>    [192922.926731]  do_fsync+0x38/0x60
>    [192922.927138]  __x64_sys_fdatasync+0x13/0x20
>    [192922.927543]  do_syscall_64+0x60/0x1c0
>    [192922.927939]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>    (...)
>    [192922.934077] ---[ end trace f00808b12068168f ]---
> 
> 2) If evict_refill_and_join() decides to commit the transaction, it will
>    be able to do it, since the nested transaction join only increments the
>    transaction handle's ->use_count reference counter and it does not
>    prevent the transaction from getting committed. This means that after
>    eviction completes, the fsync logging path will be using a transaction
>    handle that refers to an already committed transaction. What happens
>    when using such a stale transaction can be unpredictable, we are at
>    least having a use-after-free on the transaction handle itself, since
>    the transaction commit will call kmem_cache_free() against the handle
>    regardless of its ->use_count value, or we can end up silently losing
>    all the updates to the log tree after that iput() in the logging path,
>    or using a transaction handle that in the meanwhile was allocated to
>    another task for a new transaction, etc, pretty much unpredictable
>    what can happen.
> 

And talking it over with Nikolay I realized that since we're doing the commit
through the flushing this doesn't actually happen anymore.
may_commit_transaction() returns -EGAIN if we already have a trans handle open,
so hooray I made it safer by accident!  But we definitely should follow up and
add an assert in btrfs_commit_transaction() to catch this case, cause holy shit
that's bad.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction
  2019-09-12 11:58 ` Josef Bacik
@ 2019-09-12 13:17   ` Filipe Manana
  0 siblings, 0 replies; 12+ messages in thread
From: Filipe Manana @ 2019-09-12 13:17 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On Thu, Sep 12, 2019 at 12:59 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> On Tue, Sep 10, 2019 at 03:26:49PM +0100, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > Sometimes when fsync'ing a file we need to log that other inodes exist and
> > when we need to do that we acquire a reference on the inodes and then drop
> > that reference using iput() after logging them.
> >
> > That generally is not a problem except if we end up doing the final iput()
> > (dropping the last reference) on the inode and that inode has a link count
> > of 0, which can happen in a very short time window if the logging path
> > gets a reference on the inode while it's being unlinked.
> >
> > In that case we end up getting the eviction callback, btrfs_evict_inode(),
> > invoked through the iput() call chain which needs to drop all of the
> > inode's items from its subvolume btree, and in order to do that, it needs
> > to join a transaction at the helper function evict_refill_and_join().
> > However because the task previously started a transaction at the fsync
> > handler, btrfs_sync_file(), it has current->journal_info already pointing
> > to a transaction handle and therefore evict_refill_and_join() will get
> > that transaction handle from btrfs_join_transaction(). From this point on,
> > two different problems can happen:
> >
> > 1) evict_refill_and_join() will often change the transaction handle's
> >    block reserve (->block_rsv) and set its ->bytes_reserved field to a
> >    value greater than 0. If evict_refill_and_join() never commits the
> >    transaction, the eviction handler ends up decreasing the reference
> >    count (->use_count) of the transaction handle through the call to
> >    btrfs_end_transaction(), and after that point we have a transaction
> >    handle with a NULL ->block_rsv (which is the value prior to the
> >    transaction join from evict_refill_and_join()) and a ->bytes_reserved
> >    value greater than 0. If after the eviction/iput completes the inode
> >    logging path hits an error or it decides that it must fallback to a
> >    transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a
> >    non-zero value from btrfs_log_dentry_safe(), and because of that
> >    non-zero value it tries to commit the transaction using a handle with
> >    a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes
> >    the transaction commit hit an assertion failure at
> >    btrfs_trans_release_metadata() because ->bytes_reserved is not zero but
> >    the ->block_rsv is NULL. The produced stack trace for that is like the
> >    following:
> >
> >    [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816
> >    [192922.917553] ------------[ cut here ]------------
> >    [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532!
> >    [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
> >    [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G        W         5.1.4-btrfs-next-47 #1
> >    [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
> >    [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs]
> >    (...)
> >    [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286
> >    [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000
> >    [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838
> >    [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000
> >    [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980
> >    [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8
> >    [192922.923200] FS:  00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000
> >    [192922.923579] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >    [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0
> >    [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >    [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >    [192922.925105] Call Trace:
> >    [192922.925505]  btrfs_trans_release_metadata+0x10c/0x170 [btrfs]
> >    [192922.925911]  btrfs_commit_transaction+0x3e/0xaf0 [btrfs]
> >    [192922.926324]  btrfs_sync_file+0x44c/0x490 [btrfs]
> >    [192922.926731]  do_fsync+0x38/0x60
> >    [192922.927138]  __x64_sys_fdatasync+0x13/0x20
> >    [192922.927543]  do_syscall_64+0x60/0x1c0
> >    [192922.927939]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >    (...)
> >    [192922.934077] ---[ end trace f00808b12068168f ]---
> >
> > 2) If evict_refill_and_join() decides to commit the transaction, it will
> >    be able to do it, since the nested transaction join only increments the
> >    transaction handle's ->use_count reference counter and it does not
> >    prevent the transaction from getting committed. This means that after
>
> This brings up a good point, we should probably not allow the commit in this
> case, or add an ASSERT(use_count == 1) or something, cause this would be bad.
> Thanks,

Yes, that should be separate change however.
I had warn_on(trans->use_count > 1) in the commit locally during
testing (didn't trigger in 4 days of fstests).

Thanks.
>
> Josef

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction
  2019-09-12 12:18 ` Josef Bacik
@ 2019-09-12 13:19   ` Filipe Manana
  2019-09-12 15:14     ` Josef Bacik
  0 siblings, 1 reply; 12+ messages in thread
From: Filipe Manana @ 2019-09-12 13:19 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On Thu, Sep 12, 2019 at 1:18 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> On Tue, Sep 10, 2019 at 03:26:49PM +0100, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > Sometimes when fsync'ing a file we need to log that other inodes exist and
> > when we need to do that we acquire a reference on the inodes and then drop
> > that reference using iput() after logging them.
> >
> > That generally is not a problem except if we end up doing the final iput()
> > (dropping the last reference) on the inode and that inode has a link count
> > of 0, which can happen in a very short time window if the logging path
> > gets a reference on the inode while it's being unlinked.
> >
> > In that case we end up getting the eviction callback, btrfs_evict_inode(),
> > invoked through the iput() call chain which needs to drop all of the
> > inode's items from its subvolume btree, and in order to do that, it needs
> > to join a transaction at the helper function evict_refill_and_join().
> > However because the task previously started a transaction at the fsync
> > handler, btrfs_sync_file(), it has current->journal_info already pointing
> > to a transaction handle and therefore evict_refill_and_join() will get
> > that transaction handle from btrfs_join_transaction(). From this point on,
> > two different problems can happen:
> >
> > 1) evict_refill_and_join() will often change the transaction handle's
> >    block reserve (->block_rsv) and set its ->bytes_reserved field to a
> >    value greater than 0. If evict_refill_and_join() never commits the
> >    transaction, the eviction handler ends up decreasing the reference
> >    count (->use_count) of the transaction handle through the call to
> >    btrfs_end_transaction(), and after that point we have a transaction
> >    handle with a NULL ->block_rsv (which is the value prior to the
> >    transaction join from evict_refill_and_join()) and a ->bytes_reserved
> >    value greater than 0. If after the eviction/iput completes the inode
> >    logging path hits an error or it decides that it must fallback to a
> >    transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a
> >    non-zero value from btrfs_log_dentry_safe(), and because of that
> >    non-zero value it tries to commit the transaction using a handle with
> >    a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes
> >    the transaction commit hit an assertion failure at
> >    btrfs_trans_release_metadata() because ->bytes_reserved is not zero but
> >    the ->block_rsv is NULL. The produced stack trace for that is like the
> >    following:
> >
> >    [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816
> >    [192922.917553] ------------[ cut here ]------------
> >    [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532!
> >    [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
> >    [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G        W         5.1.4-btrfs-next-47 #1
> >    [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
> >    [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs]
> >    (...)
> >    [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286
> >    [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000
> >    [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838
> >    [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000
> >    [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980
> >    [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8
> >    [192922.923200] FS:  00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000
> >    [192922.923579] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >    [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0
> >    [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >    [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >    [192922.925105] Call Trace:
> >    [192922.925505]  btrfs_trans_release_metadata+0x10c/0x170 [btrfs]
> >    [192922.925911]  btrfs_commit_transaction+0x3e/0xaf0 [btrfs]
> >    [192922.926324]  btrfs_sync_file+0x44c/0x490 [btrfs]
> >    [192922.926731]  do_fsync+0x38/0x60
> >    [192922.927138]  __x64_sys_fdatasync+0x13/0x20
> >    [192922.927543]  do_syscall_64+0x60/0x1c0
> >    [192922.927939]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >    (...)
> >    [192922.934077] ---[ end trace f00808b12068168f ]---
> >
> > 2) If evict_refill_and_join() decides to commit the transaction, it will
> >    be able to do it, since the nested transaction join only increments the
> >    transaction handle's ->use_count reference counter and it does not
> >    prevent the transaction from getting committed. This means that after
> >    eviction completes, the fsync logging path will be using a transaction
> >    handle that refers to an already committed transaction. What happens
> >    when using such a stale transaction can be unpredictable, we are at
> >    least having a use-after-free on the transaction handle itself, since
> >    the transaction commit will call kmem_cache_free() against the handle
> >    regardless of its ->use_count value, or we can end up silently losing
> >    all the updates to the log tree after that iput() in the logging path,
> >    or using a transaction handle that in the meanwhile was allocated to
> >    another task for a new transaction, etc, pretty much unpredictable
> >    what can happen.
> >
>
> And talking it over with Nikolay I realized that since we're doing the commit
> through the flushing this doesn't actually happen anymore.
> may_commit_transaction() returns -EGAIN if we already have a trans handle open,
> so hooray I made it safer by accident!  But we definitely should follow up and
> add an assert in btrfs_commit_transaction() to catch this case, cause holy shit
> that's bad.  Thanks,

Yep, but problem 1) (assertion failure) is still valid after your
recent changes in misc-next.
Either way, this fix has to be added to stable releases to fix both
problems there.

Thanks.

>
> Josef

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction
  2019-09-12 13:19   ` Filipe Manana
@ 2019-09-12 15:14     ` Josef Bacik
  0 siblings, 0 replies; 12+ messages in thread
From: Josef Bacik @ 2019-09-12 15:14 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Josef Bacik, linux-btrfs

On Thu, Sep 12, 2019 at 02:19:55PM +0100, Filipe Manana wrote:
> On Thu, Sep 12, 2019 at 1:18 PM Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > On Tue, Sep 10, 2019 at 03:26:49PM +0100, fdmanana@kernel.org wrote:
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > Sometimes when fsync'ing a file we need to log that other inodes exist and
> > > when we need to do that we acquire a reference on the inodes and then drop
> > > that reference using iput() after logging them.
> > >
> > > That generally is not a problem except if we end up doing the final iput()
> > > (dropping the last reference) on the inode and that inode has a link count
> > > of 0, which can happen in a very short time window if the logging path
> > > gets a reference on the inode while it's being unlinked.
> > >
> > > In that case we end up getting the eviction callback, btrfs_evict_inode(),
> > > invoked through the iput() call chain which needs to drop all of the
> > > inode's items from its subvolume btree, and in order to do that, it needs
> > > to join a transaction at the helper function evict_refill_and_join().
> > > However because the task previously started a transaction at the fsync
> > > handler, btrfs_sync_file(), it has current->journal_info already pointing
> > > to a transaction handle and therefore evict_refill_and_join() will get
> > > that transaction handle from btrfs_join_transaction(). From this point on,
> > > two different problems can happen:
> > >
> > > 1) evict_refill_and_join() will often change the transaction handle's
> > >    block reserve (->block_rsv) and set its ->bytes_reserved field to a
> > >    value greater than 0. If evict_refill_and_join() never commits the
> > >    transaction, the eviction handler ends up decreasing the reference
> > >    count (->use_count) of the transaction handle through the call to
> > >    btrfs_end_transaction(), and after that point we have a transaction
> > >    handle with a NULL ->block_rsv (which is the value prior to the
> > >    transaction join from evict_refill_and_join()) and a ->bytes_reserved
> > >    value greater than 0. If after the eviction/iput completes the inode
> > >    logging path hits an error or it decides that it must fallback to a
> > >    transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a
> > >    non-zero value from btrfs_log_dentry_safe(), and because of that
> > >    non-zero value it tries to commit the transaction using a handle with
> > >    a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes
> > >    the transaction commit hit an assertion failure at
> > >    btrfs_trans_release_metadata() because ->bytes_reserved is not zero but
> > >    the ->block_rsv is NULL. The produced stack trace for that is like the
> > >    following:
> > >
> > >    [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816
> > >    [192922.917553] ------------[ cut here ]------------
> > >    [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532!
> > >    [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
> > >    [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G        W         5.1.4-btrfs-next-47 #1
> > >    [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
> > >    [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs]
> > >    (...)
> > >    [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286
> > >    [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000
> > >    [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838
> > >    [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000
> > >    [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980
> > >    [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8
> > >    [192922.923200] FS:  00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000
> > >    [192922.923579] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >    [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0
> > >    [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > >    [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > >    [192922.925105] Call Trace:
> > >    [192922.925505]  btrfs_trans_release_metadata+0x10c/0x170 [btrfs]
> > >    [192922.925911]  btrfs_commit_transaction+0x3e/0xaf0 [btrfs]
> > >    [192922.926324]  btrfs_sync_file+0x44c/0x490 [btrfs]
> > >    [192922.926731]  do_fsync+0x38/0x60
> > >    [192922.927138]  __x64_sys_fdatasync+0x13/0x20
> > >    [192922.927543]  do_syscall_64+0x60/0x1c0
> > >    [192922.927939]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > >    (...)
> > >    [192922.934077] ---[ end trace f00808b12068168f ]---
> > >
> > > 2) If evict_refill_and_join() decides to commit the transaction, it will
> > >    be able to do it, since the nested transaction join only increments the
> > >    transaction handle's ->use_count reference counter and it does not
> > >    prevent the transaction from getting committed. This means that after
> > >    eviction completes, the fsync logging path will be using a transaction
> > >    handle that refers to an already committed transaction. What happens
> > >    when using such a stale transaction can be unpredictable, we are at
> > >    least having a use-after-free on the transaction handle itself, since
> > >    the transaction commit will call kmem_cache_free() against the handle
> > >    regardless of its ->use_count value, or we can end up silently losing
> > >    all the updates to the log tree after that iput() in the logging path,
> > >    or using a transaction handle that in the meanwhile was allocated to
> > >    another task for a new transaction, etc, pretty much unpredictable
> > >    what can happen.
> > >
> >
> > And talking it over with Nikolay I realized that since we're doing the commit
> > through the flushing this doesn't actually happen anymore.
> > may_commit_transaction() returns -EGAIN if we already have a trans handle open,
> > so hooray I made it safer by accident!  But we definitely should follow up and
> > add an assert in btrfs_commit_transaction() to catch this case, cause holy shit
> > that's bad.  Thanks,
> 
> Yep, but problem 1) (assertion failure) is still valid after your
> recent changes in misc-next.
> Either way, this fix has to be added to stable releases to fix both
> problems there.
> 

Agreed, just adding context for the current changes.  This patch is fine as it
is, just some follow up stuff to catch transaction commits under nested trans
handles.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, back to index

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-10 14:26 [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction fdmanana
2019-09-11 16:04 ` Josef Bacik
2019-09-12  7:30 ` Sasha Levin
2019-09-12  8:02   ` Filipe Manana
2019-09-12 11:10 ` Nikolay Borisov
2019-09-12 11:24   ` Filipe Manana
2019-09-12 11:35     ` Nikolay Borisov
2019-09-12 11:58 ` Josef Bacik
2019-09-12 13:17   ` Filipe Manana
2019-09-12 12:18 ` Josef Bacik
2019-09-12 13:19   ` Filipe Manana
2019-09-12 15:14     ` Josef Bacik

Linux-BTRFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \
		linux-btrfs@vger.kernel.org linux-btrfs@archiver.kernel.org
	public-inbox-index linux-btrfs


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs


AGPL code for this site: git clone https://public-inbox.org/ public-inbox