* [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction @ 2019-09-10 14:26 fdmanana 2019-09-11 16:04 ` Josef Bacik ` (4 more replies) 0 siblings, 5 replies; 14+ messages in thread From: fdmanana @ 2019-09-10 14:26 UTC (permalink / raw) To: linux-btrfs From: Filipe Manana <fdmanana@suse.com> Sometimes when fsync'ing a file we need to log that other inodes exist and when we need to do that we acquire a reference on the inodes and then drop that reference using iput() after logging them. That generally is not a problem except if we end up doing the final iput() (dropping the last reference) on the inode and that inode has a link count of 0, which can happen in a very short time window if the logging path gets a reference on the inode while it's being unlinked. In that case we end up getting the eviction callback, btrfs_evict_inode(), invoked through the iput() call chain which needs to drop all of the inode's items from its subvolume btree, and in order to do that, it needs to join a transaction at the helper function evict_refill_and_join(). However because the task previously started a transaction at the fsync handler, btrfs_sync_file(), it has current->journal_info already pointing to a transaction handle and therefore evict_refill_and_join() will get that transaction handle from btrfs_join_transaction(). From this point on, two different problems can happen: 1) evict_refill_and_join() will often change the transaction handle's block reserve (->block_rsv) and set its ->bytes_reserved field to a value greater than 0. If evict_refill_and_join() never commits the transaction, the eviction handler ends up decreasing the reference count (->use_count) of the transaction handle through the call to btrfs_end_transaction(), and after that point we have a transaction handle with a NULL ->block_rsv (which is the value prior to the transaction join from evict_refill_and_join()) and a ->bytes_reserved value greater than 0. If after the eviction/iput completes the inode logging path hits an error or it decides that it must fallback to a transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a non-zero value from btrfs_log_dentry_safe(), and because of that non-zero value it tries to commit the transaction using a handle with a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes the transaction commit hit an assertion failure at btrfs_trans_release_metadata() because ->bytes_reserved is not zero but the ->block_rsv is NULL. The produced stack trace for that is like the following: [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816 [192922.917553] ------------[ cut here ]------------ [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532! [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G W 5.1.4-btrfs-next-47 #1 [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs] (...) [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286 [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000 [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838 [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000 [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980 [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8 [192922.923200] FS: 00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000 [192922.923579] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0 [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [192922.925105] Call Trace: [192922.925505] btrfs_trans_release_metadata+0x10c/0x170 [btrfs] [192922.925911] btrfs_commit_transaction+0x3e/0xaf0 [btrfs] [192922.926324] btrfs_sync_file+0x44c/0x490 [btrfs] [192922.926731] do_fsync+0x38/0x60 [192922.927138] __x64_sys_fdatasync+0x13/0x20 [192922.927543] do_syscall_64+0x60/0x1c0 [192922.927939] entry_SYSCALL_64_after_hwframe+0x49/0xbe (...) [192922.934077] ---[ end trace f00808b12068168f ]--- 2) If evict_refill_and_join() decides to commit the transaction, it will be able to do it, since the nested transaction join only increments the transaction handle's ->use_count reference counter and it does not prevent the transaction from getting committed. This means that after eviction completes, the fsync logging path will be using a transaction handle that refers to an already committed transaction. What happens when using such a stale transaction can be unpredictable, we are at least having a use-after-free on the transaction handle itself, since the transaction commit will call kmem_cache_free() against the handle regardless of its ->use_count value, or we can end up silently losing all the updates to the log tree after that iput() in the logging path, or using a transaction handle that in the meanwhile was allocated to another task for a new transaction, etc, pretty much unpredictable what can happen. In order to fix both of them, instead of using iput() during logging, use btrfs_add_delayed_iput(), so that the logging path of fsync never drops the last reference on an inode, that step is offloaded to a safe context (usually the cleaner kthread). The assertion failure issue was sporadically triggered by the test case generic/475 from fstests, which loads the dm error target while fsstress is running, which lead to fsync failing while logging inodes with -EIO errors and then trying later to commit the transaction, triggering the assertion failure. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> --- fs/btrfs/tree-log.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 6c8297bcfeb7..1bfd7e34f31e 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -4985,7 +4985,7 @@ static int log_conflicting_inodes(struct btrfs_trans_handle *trans, BTRFS_I(inode), LOG_OTHER_INODE_ALL, 0, LLONG_MAX, ctx); - iput(inode); + btrfs_add_delayed_iput(inode); } } continue; @@ -5000,7 +5000,7 @@ static int log_conflicting_inodes(struct btrfs_trans_handle *trans, ret = btrfs_log_inode(trans, root, BTRFS_I(inode), LOG_OTHER_INODE, 0, LLONG_MAX, ctx); if (ret) { - iput(inode); + btrfs_add_delayed_iput(inode); continue; } @@ -5009,7 +5009,7 @@ static int log_conflicting_inodes(struct btrfs_trans_handle *trans, key.offset = 0; ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); if (ret < 0) { - iput(inode); + btrfs_add_delayed_iput(inode); continue; } @@ -5056,7 +5056,7 @@ static int log_conflicting_inodes(struct btrfs_trans_handle *trans, } path->slots[0]++; } - iput(inode); + btrfs_add_delayed_iput(inode); } return ret; @@ -5689,7 +5689,7 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans, } if (btrfs_inode_in_log(BTRFS_I(di_inode), trans->transid)) { - iput(di_inode); + btrfs_add_delayed_iput(di_inode); break; } @@ -5701,7 +5701,7 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans, if (!ret && btrfs_must_commit_transaction(trans, BTRFS_I(di_inode))) ret = 1; - iput(di_inode); + btrfs_add_delayed_iput(di_inode); if (ret) goto next_dir_inode; if (ctx->log_new_dentries) { @@ -5848,7 +5848,7 @@ static int btrfs_log_all_parents(struct btrfs_trans_handle *trans, if (!ret && ctx && ctx->log_new_dentries) ret = log_new_dir_dentries(trans, root, BTRFS_I(dir_inode), ctx); - iput(dir_inode); + btrfs_add_delayed_iput(dir_inode); if (ret) goto out; } @@ -5891,7 +5891,7 @@ static int log_new_ancestors(struct btrfs_trans_handle *trans, ret = btrfs_log_inode(trans, root, BTRFS_I(inode), LOG_INODE_EXISTS, 0, LLONG_MAX, ctx); - iput(inode); + btrfs_add_delayed_iput(inode); if (ret) return ret; -- 2.11.0 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction 2019-09-10 14:26 [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction fdmanana @ 2019-09-11 16:04 ` Josef Bacik 2019-09-12 7:30 ` Sasha Levin ` (3 subsequent siblings) 4 siblings, 0 replies; 14+ messages in thread From: Josef Bacik @ 2019-09-11 16:04 UTC (permalink / raw) To: fdmanana; +Cc: linux-btrfs On Tue, Sep 10, 2019 at 03:26:49PM +0100, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > Sometimes when fsync'ing a file we need to log that other inodes exist and > when we need to do that we acquire a reference on the inodes and then drop > that reference using iput() after logging them. > > That generally is not a problem except if we end up doing the final iput() > (dropping the last reference) on the inode and that inode has a link count > of 0, which can happen in a very short time window if the logging path > gets a reference on the inode while it's being unlinked. > > In that case we end up getting the eviction callback, btrfs_evict_inode(), > invoked through the iput() call chain which needs to drop all of the > inode's items from its subvolume btree, and in order to do that, it needs > to join a transaction at the helper function evict_refill_and_join(). > However because the task previously started a transaction at the fsync > handler, btrfs_sync_file(), it has current->journal_info already pointing > to a transaction handle and therefore evict_refill_and_join() will get > that transaction handle from btrfs_join_transaction(). From this point on, > two different problems can happen: > > 1) evict_refill_and_join() will often change the transaction handle's > block reserve (->block_rsv) and set its ->bytes_reserved field to a > value greater than 0. If evict_refill_and_join() never commits the > transaction, the eviction handler ends up decreasing the reference > count (->use_count) of the transaction handle through the call to > btrfs_end_transaction(), and after that point we have a transaction > handle with a NULL ->block_rsv (which is the value prior to the > transaction join from evict_refill_and_join()) and a ->bytes_reserved > value greater than 0. If after the eviction/iput completes the inode > logging path hits an error or it decides that it must fallback to a > transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a > non-zero value from btrfs_log_dentry_safe(), and because of that > non-zero value it tries to commit the transaction using a handle with > a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes > the transaction commit hit an assertion failure at > btrfs_trans_release_metadata() because ->bytes_reserved is not zero but > the ->block_rsv is NULL. The produced stack trace for that is like the > following: > > [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816 > [192922.917553] ------------[ cut here ]------------ > [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532! > [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI > [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G W 5.1.4-btrfs-next-47 #1 > [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 > [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs] > (...) > [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286 > [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000 > [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838 > [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000 > [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980 > [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8 > [192922.923200] FS: 00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000 > [192922.923579] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0 > [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [192922.925105] Call Trace: > [192922.925505] btrfs_trans_release_metadata+0x10c/0x170 [btrfs] > [192922.925911] btrfs_commit_transaction+0x3e/0xaf0 [btrfs] > [192922.926324] btrfs_sync_file+0x44c/0x490 [btrfs] > [192922.926731] do_fsync+0x38/0x60 > [192922.927138] __x64_sys_fdatasync+0x13/0x20 > [192922.927543] do_syscall_64+0x60/0x1c0 > [192922.927939] entry_SYSCALL_64_after_hwframe+0x49/0xbe > (...) > [192922.934077] ---[ end trace f00808b12068168f ]--- > > 2) If evict_refill_and_join() decides to commit the transaction, it will > be able to do it, since the nested transaction join only increments the > transaction handle's ->use_count reference counter and it does not > prevent the transaction from getting committed. This means that after > eviction completes, the fsync logging path will be using a transaction > handle that refers to an already committed transaction. What happens > when using such a stale transaction can be unpredictable, we are at > least having a use-after-free on the transaction handle itself, since > the transaction commit will call kmem_cache_free() against the handle > regardless of its ->use_count value, or we can end up silently losing > all the updates to the log tree after that iput() in the logging path, > or using a transaction handle that in the meanwhile was allocated to > another task for a new transaction, etc, pretty much unpredictable > what can happen. > > In order to fix both of them, instead of using iput() during logging, use > btrfs_add_delayed_iput(), so that the logging path of fsync never drops > the last reference on an inode, that step is offloaded to a safe context > (usually the cleaner kthread). > > The assertion failure issue was sporadically triggered by the test case > generic/475 from fstests, which loads the dm error target while fsstress > is running, which lead to fsync failing while logging inodes with -EIO > errors and then trying later to commit the transaction, triggering the > assertion failure. > > CC: stable@vger.kernel.org # 4.4+ > Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Thanks, Josef ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction 2019-09-10 14:26 [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction fdmanana 2019-09-11 16:04 ` Josef Bacik @ 2019-09-12 7:30 ` Sasha Levin 2019-09-12 8:02 ` Filipe Manana 2019-09-16 12:59 ` Filipe Manana 2019-09-12 11:10 ` Nikolay Borisov ` (2 subsequent siblings) 4 siblings, 2 replies; 14+ messages in thread From: Sasha Levin @ 2019-09-12 7:30 UTC (permalink / raw) To: Sasha Levin, Filipe Manana, linux-btrfs; +Cc: stable Hi, [This is an automated email] This commit has been processed because it contains a -stable tag. The stable tag indicates that it's relevant for the following trees: 4.4+ The bot has tested the following trees: v5.2.14, v4.19.72, v4.14.143, v4.9.192, v4.4.192. v5.2.14: Build OK! v4.19.72: Failed to apply! Possible dependencies: 6b5fc433a7ad ("Btrfs: fix fsync after succession of renames of different files") a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir") b8aa330d2acb ("Btrfs: improve performance on fsync of files with multiple hardlinks") v4.14.143: Failed to apply! Possible dependencies: 0d836392cadd ("Btrfs: fix mount failure after fsync due to hard link recreation") 1f250e929a9c ("Btrfs: fix log replay failure after unlink and link combination") 6b5fc433a7ad ("Btrfs: fix fsync after succession of renames of different files") 8d9e220ca084 ("btrfs: simplify IS_ERR/PTR_ERR checks") a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir") b8aa330d2acb ("Btrfs: improve performance on fsync of files with multiple hardlinks") v4.9.192: Failed to apply! Possible dependencies: 0b246afa62b0 ("btrfs: root->fs_info cleanup, add fs_info convenience variables") 0d836392cadd ("Btrfs: fix mount failure after fsync due to hard link recreation") 1f250e929a9c ("Btrfs: fix log replay failure after unlink and link combination") 4791c8f19c45 ("btrfs: Make btrfs_check_ref_name_override take btrfs_inode") 6b5fc433a7ad ("Btrfs: fix fsync after succession of renames of different files") a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir") cf8cddd38bab ("btrfs: don't abuse REQ_OP_* flags for btrfs_map_block") da17066c4047 ("btrfs: pull node/sector/stripe sizes out of root and into fs_info") db0a669fb002 ("btrfs: Make btrfs_add_link take btrfs_inode") de143792253e ("btrfs: struct btrfsic_state->root should be an fs_info") fb456252d3d9 ("btrfs: root->fs_info cleanup, use fs_info->dev_root everywhere") v4.4.192: Failed to apply! Possible dependencies: 0132761017e0 ("btrfs: fix string and comment grammatical issues and typos") 09cbfeaf1a5a ("mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros") 0b246afa62b0 ("btrfs: root->fs_info cleanup, add fs_info convenience variables") 0d836392cadd ("Btrfs: fix mount failure after fsync due to hard link recreation") 0e749e54244e ("dax: increase granularity of dax_clear_blocks() operations") 1f250e929a9c ("Btrfs: fix log replay failure after unlink and link combination") 44f714dae50a ("Btrfs: improve performance on fsync against new inode after rename/unlink") 4791c8f19c45 ("btrfs: Make btrfs_check_ref_name_override take btrfs_inode") 52db400fcd50 ("pmem, dax: clean up clear_pmem()") 6b5fc433a7ad ("Btrfs: fix fsync after succession of renames of different files") 781feef7e6be ("Btrfs: fix lockdep warning about log_mutex") a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir") b2e0d1625e19 ("dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()") bb7ab3b92e46 ("btrfs: Fix misspellings in comments.") cf8cddd38bab ("btrfs: don't abuse REQ_OP_* flags for btrfs_map_block") d1a5f2b4d8a1 ("block: use DAX for partition table reads") db0a669fb002 ("btrfs: Make btrfs_add_link take btrfs_inode") de143792253e ("btrfs: struct btrfsic_state->root should be an fs_info") NOTE: The patch will not be queued to stable trees until it is upstream. How should we proceed with this patch? -- Thanks, Sasha ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction 2019-09-12 7:30 ` Sasha Levin @ 2019-09-12 8:02 ` Filipe Manana 2019-09-16 12:59 ` Filipe Manana 1 sibling, 0 replies; 14+ messages in thread From: Filipe Manana @ 2019-09-12 8:02 UTC (permalink / raw) To: Sasha Levin; +Cc: Filipe Manana, linux-btrfs, stable On Thu, Sep 12, 2019 at 8:32 AM Sasha Levin <sashal@kernel.org> wrote: > > Hi, > > [This is an automated email] > > This commit has been processed because it contains a -stable tag. > The stable tag indicates that it's relevant for the following trees: 4.4+ > > The bot has tested the following trees: v5.2.14, v4.19.72, v4.14.143, v4.9.192, v4.4.192. > > v5.2.14: Build OK! > v4.19.72: Failed to apply! Possible dependencies: > 6b5fc433a7ad ("Btrfs: fix fsync after succession of renames of different files") > a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir") > b8aa330d2acb ("Btrfs: improve performance on fsync of files with multiple hardlinks") > > v4.14.143: Failed to apply! Possible dependencies: > 0d836392cadd ("Btrfs: fix mount failure after fsync due to hard link recreation") > 1f250e929a9c ("Btrfs: fix log replay failure after unlink and link combination") > 6b5fc433a7ad ("Btrfs: fix fsync after succession of renames of different files") > 8d9e220ca084 ("btrfs: simplify IS_ERR/PTR_ERR checks") > a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir") > b8aa330d2acb ("Btrfs: improve performance on fsync of files with multiple hardlinks") > > v4.9.192: Failed to apply! Possible dependencies: > 0b246afa62b0 ("btrfs: root->fs_info cleanup, add fs_info convenience variables") > 0d836392cadd ("Btrfs: fix mount failure after fsync due to hard link recreation") > 1f250e929a9c ("Btrfs: fix log replay failure after unlink and link combination") > 4791c8f19c45 ("btrfs: Make btrfs_check_ref_name_override take btrfs_inode") > 6b5fc433a7ad ("Btrfs: fix fsync after succession of renames of different files") > a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir") > cf8cddd38bab ("btrfs: don't abuse REQ_OP_* flags for btrfs_map_block") > da17066c4047 ("btrfs: pull node/sector/stripe sizes out of root and into fs_info") > db0a669fb002 ("btrfs: Make btrfs_add_link take btrfs_inode") > de143792253e ("btrfs: struct btrfsic_state->root should be an fs_info") > fb456252d3d9 ("btrfs: root->fs_info cleanup, use fs_info->dev_root everywhere") > > v4.4.192: Failed to apply! Possible dependencies: > 0132761017e0 ("btrfs: fix string and comment grammatical issues and typos") > 09cbfeaf1a5a ("mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros") > 0b246afa62b0 ("btrfs: root->fs_info cleanup, add fs_info convenience variables") > 0d836392cadd ("Btrfs: fix mount failure after fsync due to hard link recreation") > 0e749e54244e ("dax: increase granularity of dax_clear_blocks() operations") > 1f250e929a9c ("Btrfs: fix log replay failure after unlink and link combination") > 44f714dae50a ("Btrfs: improve performance on fsync against new inode after rename/unlink") > 4791c8f19c45 ("btrfs: Make btrfs_check_ref_name_override take btrfs_inode") > 52db400fcd50 ("pmem, dax: clean up clear_pmem()") > 6b5fc433a7ad ("Btrfs: fix fsync after succession of renames of different files") > 781feef7e6be ("Btrfs: fix lockdep warning about log_mutex") > a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir") > b2e0d1625e19 ("dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()") > bb7ab3b92e46 ("btrfs: Fix misspellings in comments.") > cf8cddd38bab ("btrfs: don't abuse REQ_OP_* flags for btrfs_map_block") > d1a5f2b4d8a1 ("block: use DAX for partition table reads") > db0a669fb002 ("btrfs: Make btrfs_add_link take btrfs_inode") > de143792253e ("btrfs: struct btrfsic_state->root should be an fs_info") > > > NOTE: The patch will not be queued to stable trees until it is upstream. > > How should we proceed with this patch? After it lands on Linus' tree, I'll try to send patch versions that apply to different kernel releases. Thanks. > > -- > Thanks, > Sasha -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.” ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction 2019-09-12 7:30 ` Sasha Levin 2019-09-12 8:02 ` Filipe Manana @ 2019-09-16 12:59 ` Filipe Manana 2019-09-16 15:13 ` Greg KH 1 sibling, 1 reply; 14+ messages in thread From: Filipe Manana @ 2019-09-16 12:59 UTC (permalink / raw) To: Sasha Levin; +Cc: Filipe Manana, linux-btrfs, stable [-- Attachment #1: Type: text/plain, Size: 4148 bytes --] On Thu, Sep 12, 2019 at 8:32 AM Sasha Levin <sashal@kernel.org> wrote: > > Hi, > > [This is an automated email] > > This commit has been processed because it contains a -stable tag. > The stable tag indicates that it's relevant for the following trees: 4.4+ > > The bot has tested the following trees: v5.2.14, v4.19.72, v4.14.143, v4.9.192, v4.4.192. > > v5.2.14: Build OK! > v4.19.72: Failed to apply! Possible dependencies: > 6b5fc433a7ad ("Btrfs: fix fsync after succession of renames of different files") > a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir") > b8aa330d2acb ("Btrfs: improve performance on fsync of files with multiple hardlinks") > > v4.14.143: Failed to apply! Possible dependencies: > 0d836392cadd ("Btrfs: fix mount failure after fsync due to hard link recreation") > 1f250e929a9c ("Btrfs: fix log replay failure after unlink and link combination") > 6b5fc433a7ad ("Btrfs: fix fsync after succession of renames of different files") > 8d9e220ca084 ("btrfs: simplify IS_ERR/PTR_ERR checks") > a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir") > b8aa330d2acb ("Btrfs: improve performance on fsync of files with multiple hardlinks") > > v4.9.192: Failed to apply! Possible dependencies: > 0b246afa62b0 ("btrfs: root->fs_info cleanup, add fs_info convenience variables") > 0d836392cadd ("Btrfs: fix mount failure after fsync due to hard link recreation") > 1f250e929a9c ("Btrfs: fix log replay failure after unlink and link combination") > 4791c8f19c45 ("btrfs: Make btrfs_check_ref_name_override take btrfs_inode") > 6b5fc433a7ad ("Btrfs: fix fsync after succession of renames of different files") > a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir") > cf8cddd38bab ("btrfs: don't abuse REQ_OP_* flags for btrfs_map_block") > da17066c4047 ("btrfs: pull node/sector/stripe sizes out of root and into fs_info") > db0a669fb002 ("btrfs: Make btrfs_add_link take btrfs_inode") > de143792253e ("btrfs: struct btrfsic_state->root should be an fs_info") > fb456252d3d9 ("btrfs: root->fs_info cleanup, use fs_info->dev_root everywhere") > > v4.4.192: Failed to apply! Possible dependencies: > 0132761017e0 ("btrfs: fix string and comment grammatical issues and typos") > 09cbfeaf1a5a ("mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros") > 0b246afa62b0 ("btrfs: root->fs_info cleanup, add fs_info convenience variables") > 0d836392cadd ("Btrfs: fix mount failure after fsync due to hard link recreation") > 0e749e54244e ("dax: increase granularity of dax_clear_blocks() operations") > 1f250e929a9c ("Btrfs: fix log replay failure after unlink and link combination") > 44f714dae50a ("Btrfs: improve performance on fsync against new inode after rename/unlink") > 4791c8f19c45 ("btrfs: Make btrfs_check_ref_name_override take btrfs_inode") > 52db400fcd50 ("pmem, dax: clean up clear_pmem()") > 6b5fc433a7ad ("Btrfs: fix fsync after succession of renames of different files") > 781feef7e6be ("Btrfs: fix lockdep warning about log_mutex") > a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir") > b2e0d1625e19 ("dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()") > bb7ab3b92e46 ("btrfs: Fix misspellings in comments.") > cf8cddd38bab ("btrfs: don't abuse REQ_OP_* flags for btrfs_map_block") > d1a5f2b4d8a1 ("block: use DAX for partition table reads") > db0a669fb002 ("btrfs: Make btrfs_add_link take btrfs_inode") > de143792253e ("btrfs: struct btrfsic_state->root should be an fs_info") > > > NOTE: The patch will not be queued to stable trees until it is upstream. > > How should we proceed with this patch? So here follows, as attachments, patches for the following stable versions: 4.4.193, 4.9.193, 4.14.144 and 4.19.73. Thanks. > > -- > Thanks, > Sasha -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.” [-- Attachment #2: 4.4.193-Btrfs-fix-assertion-failure-during-fsync-and-use-of-.patch --] [-- Type: text/x-patch, Size: 7226 bytes --] From 842266d7437f7714a67ddb34680546f0268f27c9 Mon Sep 17 00:00:00 2001 From: Filipe Manana <fdmanana@suse.com> Date: Tue, 10 Sep 2019 15:26:49 +0100 Subject: [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction Sometimes when fsync'ing a file we need to log that other inodes exist and when we need to do that we acquire a reference on the inodes and then drop that reference using iput() after logging them. That generally is not a problem except if we end up doing the final iput() (dropping the last reference) on the inode and that inode has a link count of 0, which can happen in a very short time window if the logging path gets a reference on the inode while it's being unlinked. In that case we end up getting the eviction callback, btrfs_evict_inode(), invoked through the iput() call chain which needs to drop all of the inode's items from its subvolume btree, and in order to do that, it needs to join a transaction at the helper function evict_refill_and_join(). However because the task previously started a transaction at the fsync handler, btrfs_sync_file(), it has current->journal_info already pointing to a transaction handle and therefore evict_refill_and_join() will get that transaction handle from btrfs_join_transaction(). From this point on, two different problems can happen: 1) evict_refill_and_join() will often change the transaction handle's block reserve (->block_rsv) and set its ->bytes_reserved field to a value greater than 0. If evict_refill_and_join() never commits the transaction, the eviction handler ends up decreasing the reference count (->use_count) of the transaction handle through the call to btrfs_end_transaction(), and after that point we have a transaction handle with a NULL ->block_rsv (which is the value prior to the transaction join from evict_refill_and_join()) and a ->bytes_reserved value greater than 0. If after the eviction/iput completes the inode logging path hits an error or it decides that it must fallback to a transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a non-zero value from btrfs_log_dentry_safe(), and because of that non-zero value it tries to commit the transaction using a handle with a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes the transaction commit hit an assertion failure at btrfs_trans_release_metadata() because ->bytes_reserved is not zero but the ->block_rsv is NULL. The produced stack trace for that is like the following: [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816 [192922.917553] ------------[ cut here ]------------ [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532! [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G W 5.1.4-btrfs-next-47 #1 [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs] (...) [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286 [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000 [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838 [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000 [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980 [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8 [192922.923200] FS: 00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000 [192922.923579] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0 [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [192922.925105] Call Trace: [192922.925505] btrfs_trans_release_metadata+0x10c/0x170 [btrfs] [192922.925911] btrfs_commit_transaction+0x3e/0xaf0 [btrfs] [192922.926324] btrfs_sync_file+0x44c/0x490 [btrfs] [192922.926731] do_fsync+0x38/0x60 [192922.927138] __x64_sys_fdatasync+0x13/0x20 [192922.927543] do_syscall_64+0x60/0x1c0 [192922.927939] entry_SYSCALL_64_after_hwframe+0x49/0xbe (...) [192922.934077] ---[ end trace f00808b12068168f ]--- 2) If evict_refill_and_join() decides to commit the transaction, it will be able to do it, since the nested transaction join only increments the transaction handle's ->use_count reference counter and it does not prevent the transaction from getting committed. This means that after eviction completes, the fsync logging path will be using a transaction handle that refers to an already committed transaction. What happens when using such a stale transaction can be unpredictable, we are at least having a use-after-free on the transaction handle itself, since the transaction commit will call kmem_cache_free() against the handle regardless of its ->use_count value, or we can end up silently losing all the updates to the log tree after that iput() in the logging path, or using a transaction handle that in the meanwhile was allocated to another task for a new transaction, etc, pretty much unpredictable what can happen. In order to fix both of them, instead of using iput() during logging, use btrfs_add_delayed_iput(), so that the logging path of fsync never drops the last reference on an inode, that step is offloaded to a safe context (usually the cleaner kthread). The assertion failure issue was sporadically triggered by the test case generic/475 from fstests, which loads the dm error target while fsstress is running, which lead to fsync failing while logging inodes with -EIO errors and then trying later to commit the transaction, triggering the assertion failure. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> --- fs/btrfs/tree-log.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 57a46093656a..f9c3907bf159 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -5133,7 +5133,7 @@ process_leaf: } if (btrfs_inode_in_log(di_inode, trans->transid)) { - iput(di_inode); + btrfs_add_delayed_iput(di_inode); continue; } @@ -5143,7 +5143,7 @@ process_leaf: btrfs_release_path(path); ret = btrfs_log_inode(trans, root, di_inode, log_mode, 0, LLONG_MAX, ctx); - iput(di_inode); + btrfs_add_delayed_iput(di_inode); if (ret) goto next_dir_inode; if (ctx->log_new_dentries) { @@ -5281,7 +5281,7 @@ static int btrfs_log_all_parents(struct btrfs_trans_handle *trans, ret = btrfs_log_inode(trans, root, dir_inode, LOG_INODE_ALL, 0, LLONG_MAX, ctx); - iput(dir_inode); + btrfs_add_delayed_iput(dir_inode); if (ret) goto out; } -- 2.16.4 [-- Attachment #3: 4.9.193-Btrfs-fix-assertion-failure-during-fsync-and-use-of-.patch --] [-- Type: text/x-patch, Size: 7626 bytes --] From ba94b074ad2b346f59a43fb62e908e4f8bb943e6 Mon Sep 17 00:00:00 2001 From: Filipe Manana <fdmanana@suse.com> Date: Tue, 10 Sep 2019 15:26:49 +0100 Subject: [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction Sometimes when fsync'ing a file we need to log that other inodes exist and when we need to do that we acquire a reference on the inodes and then drop that reference using iput() after logging them. That generally is not a problem except if we end up doing the final iput() (dropping the last reference) on the inode and that inode has a link count of 0, which can happen in a very short time window if the logging path gets a reference on the inode while it's being unlinked. In that case we end up getting the eviction callback, btrfs_evict_inode(), invoked through the iput() call chain which needs to drop all of the inode's items from its subvolume btree, and in order to do that, it needs to join a transaction at the helper function evict_refill_and_join(). However because the task previously started a transaction at the fsync handler, btrfs_sync_file(), it has current->journal_info already pointing to a transaction handle and therefore evict_refill_and_join() will get that transaction handle from btrfs_join_transaction(). From this point on, two different problems can happen: 1) evict_refill_and_join() will often change the transaction handle's block reserve (->block_rsv) and set its ->bytes_reserved field to a value greater than 0. If evict_refill_and_join() never commits the transaction, the eviction handler ends up decreasing the reference count (->use_count) of the transaction handle through the call to btrfs_end_transaction(), and after that point we have a transaction handle with a NULL ->block_rsv (which is the value prior to the transaction join from evict_refill_and_join()) and a ->bytes_reserved value greater than 0. If after the eviction/iput completes the inode logging path hits an error or it decides that it must fallback to a transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a non-zero value from btrfs_log_dentry_safe(), and because of that non-zero value it tries to commit the transaction using a handle with a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes the transaction commit hit an assertion failure at btrfs_trans_release_metadata() because ->bytes_reserved is not zero but the ->block_rsv is NULL. The produced stack trace for that is like the following: [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816 [192922.917553] ------------[ cut here ]------------ [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532! [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G W 5.1.4-btrfs-next-47 #1 [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs] (...) [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286 [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000 [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838 [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000 [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980 [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8 [192922.923200] FS: 00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000 [192922.923579] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0 [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [192922.925105] Call Trace: [192922.925505] btrfs_trans_release_metadata+0x10c/0x170 [btrfs] [192922.925911] btrfs_commit_transaction+0x3e/0xaf0 [btrfs] [192922.926324] btrfs_sync_file+0x44c/0x490 [btrfs] [192922.926731] do_fsync+0x38/0x60 [192922.927138] __x64_sys_fdatasync+0x13/0x20 [192922.927543] do_syscall_64+0x60/0x1c0 [192922.927939] entry_SYSCALL_64_after_hwframe+0x49/0xbe (...) [192922.934077] ---[ end trace f00808b12068168f ]--- 2) If evict_refill_and_join() decides to commit the transaction, it will be able to do it, since the nested transaction join only increments the transaction handle's ->use_count reference counter and it does not prevent the transaction from getting committed. This means that after eviction completes, the fsync logging path will be using a transaction handle that refers to an already committed transaction. What happens when using such a stale transaction can be unpredictable, we are at least having a use-after-free on the transaction handle itself, since the transaction commit will call kmem_cache_free() against the handle regardless of its ->use_count value, or we can end up silently losing all the updates to the log tree after that iput() in the logging path, or using a transaction handle that in the meanwhile was allocated to another task for a new transaction, etc, pretty much unpredictable what can happen. In order to fix both of them, instead of using iput() during logging, use btrfs_add_delayed_iput(), so that the logging path of fsync never drops the last reference on an inode, that step is offloaded to a safe context (usually the cleaner kthread). The assertion failure issue was sporadically triggered by the test case generic/475 from fstests, which loads the dm error target while fsstress is running, which lead to fsync failing while logging inodes with -EIO errors and then trying later to commit the transaction, triggering the assertion failure. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> --- fs/btrfs/tree-log.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 02bb7b52cb36..65e1eaa5df84 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -4846,7 +4846,7 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans, err = btrfs_log_inode(trans, root, other_inode, LOG_OTHER_INODE, 0, LLONG_MAX, ctx); - iput(other_inode); + btrfs_add_delayed_iput(other_inode); if (err) goto out_unlock; else @@ -5264,7 +5264,7 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans, } if (btrfs_inode_in_log(di_inode, trans->transid)) { - iput(di_inode); + btrfs_add_delayed_iput(di_inode); break; } @@ -5276,7 +5276,7 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans, if (!ret && btrfs_must_commit_transaction(trans, di_inode)) ret = 1; - iput(di_inode); + btrfs_add_delayed_iput(di_inode); if (ret) goto next_dir_inode; if (ctx->log_new_dentries) { @@ -5422,7 +5422,7 @@ static int btrfs_log_all_parents(struct btrfs_trans_handle *trans, if (!ret && ctx && ctx->log_new_dentries) ret = log_new_dir_dentries(trans, root, dir_inode, ctx); - iput(dir_inode); + btrfs_add_delayed_iput(dir_inode); if (ret) goto out; } -- 2.16.4 [-- Attachment #4: 4.14.144-Btrfs-fix-assertion-failure-during-fsync-and-use-of-.patch --] [-- Type: text/x-patch, Size: 7616 bytes --] From ca5970ba21c15efd6f1e2a66ca3cd1e7c3332761 Mon Sep 17 00:00:00 2001 From: Filipe Manana <fdmanana@suse.com> Date: Tue, 10 Sep 2019 15:26:49 +0100 Subject: [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction Sometimes when fsync'ing a file we need to log that other inodes exist and when we need to do that we acquire a reference on the inodes and then drop that reference using iput() after logging them. That generally is not a problem except if we end up doing the final iput() (dropping the last reference) on the inode and that inode has a link count of 0, which can happen in a very short time window if the logging path gets a reference on the inode while it's being unlinked. In that case we end up getting the eviction callback, btrfs_evict_inode(), invoked through the iput() call chain which needs to drop all of the inode's items from its subvolume btree, and in order to do that, it needs to join a transaction at the helper function evict_refill_and_join(). However because the task previously started a transaction at the fsync handler, btrfs_sync_file(), it has current->journal_info already pointing to a transaction handle and therefore evict_refill_and_join() will get that transaction handle from btrfs_join_transaction(). From this point on, two different problems can happen: 1) evict_refill_and_join() will often change the transaction handle's block reserve (->block_rsv) and set its ->bytes_reserved field to a value greater than 0. If evict_refill_and_join() never commits the transaction, the eviction handler ends up decreasing the reference count (->use_count) of the transaction handle through the call to btrfs_end_transaction(), and after that point we have a transaction handle with a NULL ->block_rsv (which is the value prior to the transaction join from evict_refill_and_join()) and a ->bytes_reserved value greater than 0. If after the eviction/iput completes the inode logging path hits an error or it decides that it must fallback to a transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a non-zero value from btrfs_log_dentry_safe(), and because of that non-zero value it tries to commit the transaction using a handle with a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes the transaction commit hit an assertion failure at btrfs_trans_release_metadata() because ->bytes_reserved is not zero but the ->block_rsv is NULL. The produced stack trace for that is like the following: [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816 [192922.917553] ------------[ cut here ]------------ [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532! [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G W 5.1.4-btrfs-next-47 #1 [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs] (...) [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286 [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000 [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838 [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000 [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980 [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8 [192922.923200] FS: 00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000 [192922.923579] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0 [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [192922.925105] Call Trace: [192922.925505] btrfs_trans_release_metadata+0x10c/0x170 [btrfs] [192922.925911] btrfs_commit_transaction+0x3e/0xaf0 [btrfs] [192922.926324] btrfs_sync_file+0x44c/0x490 [btrfs] [192922.926731] do_fsync+0x38/0x60 [192922.927138] __x64_sys_fdatasync+0x13/0x20 [192922.927543] do_syscall_64+0x60/0x1c0 [192922.927939] entry_SYSCALL_64_after_hwframe+0x49/0xbe (...) [192922.934077] ---[ end trace f00808b12068168f ]--- 2) If evict_refill_and_join() decides to commit the transaction, it will be able to do it, since the nested transaction join only increments the transaction handle's ->use_count reference counter and it does not prevent the transaction from getting committed. This means that after eviction completes, the fsync logging path will be using a transaction handle that refers to an already committed transaction. What happens when using such a stale transaction can be unpredictable, we are at least having a use-after-free on the transaction handle itself, since the transaction commit will call kmem_cache_free() against the handle regardless of its ->use_count value, or we can end up silently losing all the updates to the log tree after that iput() in the logging path, or using a transaction handle that in the meanwhile was allocated to another task for a new transaction, etc, pretty much unpredictable what can happen. In order to fix both of them, instead of using iput() during logging, use btrfs_add_delayed_iput(), so that the logging path of fsync never drops the last reference on an inode, that step is offloaded to a safe context (usually the cleaner kthread). The assertion failure issue was sporadically triggered by the test case generic/475 from fstests, which loads the dm error target while fsstress is running, which lead to fsync failing while logging inodes with -EIO errors and then trying later to commit the transaction, triggering the assertion failure. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> --- fs/btrfs/tree-log.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index bfd7c89c8d92..f8ef2e3fbf63 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -5122,7 +5122,7 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans, BTRFS_I(other_inode), LOG_OTHER_INODE, 0, LLONG_MAX, ctx); - iput(other_inode); + btrfs_add_delayed_iput(other_inode); if (err) goto out_unlock; else @@ -5539,7 +5539,7 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans, } if (btrfs_inode_in_log(BTRFS_I(di_inode), trans->transid)) { - iput(di_inode); + btrfs_add_delayed_iput(di_inode); break; } @@ -5551,7 +5551,7 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans, if (!ret && btrfs_must_commit_transaction(trans, BTRFS_I(di_inode))) ret = 1; - iput(di_inode); + btrfs_add_delayed_iput(di_inode); if (ret) goto next_dir_inode; if (ctx->log_new_dentries) { @@ -5698,7 +5698,7 @@ static int btrfs_log_all_parents(struct btrfs_trans_handle *trans, if (!ret && ctx && ctx->log_new_dentries) ret = log_new_dir_dentries(trans, root, BTRFS_I(dir_inode), ctx); - iput(dir_inode); + btrfs_add_delayed_iput(dir_inode); if (ret) goto out; } -- 2.16.4 [-- Attachment #5: 4.19.73-Btrfs-fix-assertion-failure-during-fsync-and-use-of-.patch --] [-- Type: text/x-patch, Size: 7616 bytes --] From 958cb34cec8138e2adf330ebcbfbff19639a8981 Mon Sep 17 00:00:00 2001 From: Filipe Manana <fdmanana@suse.com> Date: Tue, 10 Sep 2019 15:26:49 +0100 Subject: [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction Sometimes when fsync'ing a file we need to log that other inodes exist and when we need to do that we acquire a reference on the inodes and then drop that reference using iput() after logging them. That generally is not a problem except if we end up doing the final iput() (dropping the last reference) on the inode and that inode has a link count of 0, which can happen in a very short time window if the logging path gets a reference on the inode while it's being unlinked. In that case we end up getting the eviction callback, btrfs_evict_inode(), invoked through the iput() call chain which needs to drop all of the inode's items from its subvolume btree, and in order to do that, it needs to join a transaction at the helper function evict_refill_and_join(). However because the task previously started a transaction at the fsync handler, btrfs_sync_file(), it has current->journal_info already pointing to a transaction handle and therefore evict_refill_and_join() will get that transaction handle from btrfs_join_transaction(). From this point on, two different problems can happen: 1) evict_refill_and_join() will often change the transaction handle's block reserve (->block_rsv) and set its ->bytes_reserved field to a value greater than 0. If evict_refill_and_join() never commits the transaction, the eviction handler ends up decreasing the reference count (->use_count) of the transaction handle through the call to btrfs_end_transaction(), and after that point we have a transaction handle with a NULL ->block_rsv (which is the value prior to the transaction join from evict_refill_and_join()) and a ->bytes_reserved value greater than 0. If after the eviction/iput completes the inode logging path hits an error or it decides that it must fallback to a transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a non-zero value from btrfs_log_dentry_safe(), and because of that non-zero value it tries to commit the transaction using a handle with a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes the transaction commit hit an assertion failure at btrfs_trans_release_metadata() because ->bytes_reserved is not zero but the ->block_rsv is NULL. The produced stack trace for that is like the following: [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816 [192922.917553] ------------[ cut here ]------------ [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532! [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G W 5.1.4-btrfs-next-47 #1 [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs] (...) [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286 [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000 [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838 [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000 [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980 [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8 [192922.923200] FS: 00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000 [192922.923579] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0 [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [192922.925105] Call Trace: [192922.925505] btrfs_trans_release_metadata+0x10c/0x170 [btrfs] [192922.925911] btrfs_commit_transaction+0x3e/0xaf0 [btrfs] [192922.926324] btrfs_sync_file+0x44c/0x490 [btrfs] [192922.926731] do_fsync+0x38/0x60 [192922.927138] __x64_sys_fdatasync+0x13/0x20 [192922.927543] do_syscall_64+0x60/0x1c0 [192922.927939] entry_SYSCALL_64_after_hwframe+0x49/0xbe (...) [192922.934077] ---[ end trace f00808b12068168f ]--- 2) If evict_refill_and_join() decides to commit the transaction, it will be able to do it, since the nested transaction join only increments the transaction handle's ->use_count reference counter and it does not prevent the transaction from getting committed. This means that after eviction completes, the fsync logging path will be using a transaction handle that refers to an already committed transaction. What happens when using such a stale transaction can be unpredictable, we are at least having a use-after-free on the transaction handle itself, since the transaction commit will call kmem_cache_free() against the handle regardless of its ->use_count value, or we can end up silently losing all the updates to the log tree after that iput() in the logging path, or using a transaction handle that in the meanwhile was allocated to another task for a new transaction, etc, pretty much unpredictable what can happen. In order to fix both of them, instead of using iput() during logging, use btrfs_add_delayed_iput(), so that the logging path of fsync never drops the last reference on an inode, that step is offloaded to a safe context (usually the cleaner kthread). The assertion failure issue was sporadically triggered by the test case generic/475 from fstests, which loads the dm error target while fsstress is running, which lead to fsync failing while logging inodes with -EIO errors and then trying later to commit the transaction, triggering the assertion failure. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> --- fs/btrfs/tree-log.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 08c5afa06aee..e561eb475339 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -5107,7 +5107,7 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans, BTRFS_I(other_inode), LOG_OTHER_INODE, 0, LLONG_MAX, ctx); - iput(other_inode); + btrfs_add_delayed_iput(other_inode); if (err) goto out_unlock; else @@ -5519,7 +5519,7 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans, } if (btrfs_inode_in_log(BTRFS_I(di_inode), trans->transid)) { - iput(di_inode); + btrfs_add_delayed_iput(di_inode); break; } @@ -5531,7 +5531,7 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans, if (!ret && btrfs_must_commit_transaction(trans, BTRFS_I(di_inode))) ret = 1; - iput(di_inode); + btrfs_add_delayed_iput(di_inode); if (ret) goto next_dir_inode; if (ctx->log_new_dentries) { @@ -5678,7 +5678,7 @@ static int btrfs_log_all_parents(struct btrfs_trans_handle *trans, if (!ret && ctx && ctx->log_new_dentries) ret = log_new_dir_dentries(trans, root, BTRFS_I(dir_inode), ctx); - iput(dir_inode); + btrfs_add_delayed_iput(dir_inode); if (ret) goto out; } -- 2.16.4 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction 2019-09-16 12:59 ` Filipe Manana @ 2019-09-16 15:13 ` Greg KH 0 siblings, 0 replies; 14+ messages in thread From: Greg KH @ 2019-09-16 15:13 UTC (permalink / raw) To: Filipe Manana; +Cc: Sasha Levin, Filipe Manana, linux-btrfs, stable On Mon, Sep 16, 2019 at 01:59:47PM +0100, Filipe Manana wrote: > On Thu, Sep 12, 2019 at 8:32 AM Sasha Levin <sashal@kernel.org> wrote: > > > > Hi, > > > > [This is an automated email] > > > > This commit has been processed because it contains a -stable tag. > > The stable tag indicates that it's relevant for the following trees: 4.4+ > > > > The bot has tested the following trees: v5.2.14, v4.19.72, v4.14.143, v4.9.192, v4.4.192. > > > > v5.2.14: Build OK! > > v4.19.72: Failed to apply! Possible dependencies: > > 6b5fc433a7ad ("Btrfs: fix fsync after succession of renames of different files") > > a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir") > > b8aa330d2acb ("Btrfs: improve performance on fsync of files with multiple hardlinks") > > > > v4.14.143: Failed to apply! Possible dependencies: > > 0d836392cadd ("Btrfs: fix mount failure after fsync due to hard link recreation") > > 1f250e929a9c ("Btrfs: fix log replay failure after unlink and link combination") > > 6b5fc433a7ad ("Btrfs: fix fsync after succession of renames of different files") > > 8d9e220ca084 ("btrfs: simplify IS_ERR/PTR_ERR checks") > > a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir") > > b8aa330d2acb ("Btrfs: improve performance on fsync of files with multiple hardlinks") > > > > v4.9.192: Failed to apply! Possible dependencies: > > 0b246afa62b0 ("btrfs: root->fs_info cleanup, add fs_info convenience variables") > > 0d836392cadd ("Btrfs: fix mount failure after fsync due to hard link recreation") > > 1f250e929a9c ("Btrfs: fix log replay failure after unlink and link combination") > > 4791c8f19c45 ("btrfs: Make btrfs_check_ref_name_override take btrfs_inode") > > 6b5fc433a7ad ("Btrfs: fix fsync after succession of renames of different files") > > a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir") > > cf8cddd38bab ("btrfs: don't abuse REQ_OP_* flags for btrfs_map_block") > > da17066c4047 ("btrfs: pull node/sector/stripe sizes out of root and into fs_info") > > db0a669fb002 ("btrfs: Make btrfs_add_link take btrfs_inode") > > de143792253e ("btrfs: struct btrfsic_state->root should be an fs_info") > > fb456252d3d9 ("btrfs: root->fs_info cleanup, use fs_info->dev_root everywhere") > > > > v4.4.192: Failed to apply! Possible dependencies: > > 0132761017e0 ("btrfs: fix string and comment grammatical issues and typos") > > 09cbfeaf1a5a ("mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros") > > 0b246afa62b0 ("btrfs: root->fs_info cleanup, add fs_info convenience variables") > > 0d836392cadd ("Btrfs: fix mount failure after fsync due to hard link recreation") > > 0e749e54244e ("dax: increase granularity of dax_clear_blocks() operations") > > 1f250e929a9c ("Btrfs: fix log replay failure after unlink and link combination") > > 44f714dae50a ("Btrfs: improve performance on fsync against new inode after rename/unlink") > > 4791c8f19c45 ("btrfs: Make btrfs_check_ref_name_override take btrfs_inode") > > 52db400fcd50 ("pmem, dax: clean up clear_pmem()") > > 6b5fc433a7ad ("Btrfs: fix fsync after succession of renames of different files") > > 781feef7e6be ("Btrfs: fix lockdep warning about log_mutex") > > a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir") > > b2e0d1625e19 ("dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()") > > bb7ab3b92e46 ("btrfs: Fix misspellings in comments.") > > cf8cddd38bab ("btrfs: don't abuse REQ_OP_* flags for btrfs_map_block") > > d1a5f2b4d8a1 ("block: use DAX for partition table reads") > > db0a669fb002 ("btrfs: Make btrfs_add_link take btrfs_inode") > > de143792253e ("btrfs: struct btrfsic_state->root should be an fs_info") > > > > > > NOTE: The patch will not be queued to stable trees until it is upstream. > > > > How should we proceed with this patch? > > So here follows, as attachments, patches for the following stable > versions: 4.4.193, 4.9.193, 4.14.144 and 4.19.73. Thanks for these backports, now queued up! greg k-h ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction 2019-09-10 14:26 [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction fdmanana 2019-09-11 16:04 ` Josef Bacik 2019-09-12 7:30 ` Sasha Levin @ 2019-09-12 11:10 ` Nikolay Borisov 2019-09-12 11:24 ` Filipe Manana 2019-09-12 11:58 ` Josef Bacik 2019-09-12 12:18 ` Josef Bacik 4 siblings, 1 reply; 14+ messages in thread From: Nikolay Borisov @ 2019-09-12 11:10 UTC (permalink / raw) To: fdmanana, linux-btrfs On 10.09.19 г. 17:26 ч., fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > Sometimes when fsync'ing a file we need to log that other inodes exist and > when we need to do that we acquire a reference on the inodes and then drop > that reference using iput() after logging them. > > That generally is not a problem except if we end up doing the final iput() > (dropping the last reference) on the inode and that inode has a link count > of 0, which can happen in a very short time window if the logging path > gets a reference on the inode while it's being unlinked. > > In that case we end up getting the eviction callback, btrfs_evict_inode(), > invoked through the iput() call chain which needs to drop all of the > inode's items from its subvolume btree, and in order to do that, it needs > to join a transaction at the helper function evict_refill_and_join(). > However because the task previously started a transaction at the fsync > handler, btrfs_sync_file(), it has current->journal_info already pointing > to a transaction handle and therefore evict_refill_and_join() will get > that transaction handle from btrfs_join_transaction(). From this point on, > two different problems can happen: > > 1) evict_refill_and_join() will often change the transaction handle's > block reserve (->block_rsv) and set its ->bytes_reserved field to a > value greater than 0. If evict_refill_and_join() never commits the > transaction, the eviction handler ends up decreasing the reference > count (->use_count) of the transaction handle through the call to > btrfs_end_transaction(), and after that point we have a transaction > handle with a NULL ->block_rsv (which is the value prior to the > transaction join from evict_refill_and_join()) and a ->bytes_reserved > value greater than 0. If after the eviction/iput completes the inode > logging path hits an error or it decides that it must fallback to a > transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a > non-zero value from btrfs_log_dentry_safe(), and because of that > non-zero value it tries to commit the transaction using a handle with > a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes > the transaction commit hit an assertion failure at > btrfs_trans_release_metadata() because ->bytes_reserved is not zero but > the ->block_rsv is NULL. The produced stack trace for that is like the > following: > > [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816 > [192922.917553] ------------[ cut here ]------------ > [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532! > [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI > [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G W 5.1.4-btrfs-next-47 #1 > [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 > [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs] > (...) > [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286 > [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000 > [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838 > [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000 > [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980 > [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8 > [192922.923200] FS: 00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000 > [192922.923579] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0 > [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [192922.925105] Call Trace: > [192922.925505] btrfs_trans_release_metadata+0x10c/0x170 [btrfs] > [192922.925911] btrfs_commit_transaction+0x3e/0xaf0 [btrfs] > [192922.926324] btrfs_sync_file+0x44c/0x490 [btrfs] > [192922.926731] do_fsync+0x38/0x60 > [192922.927138] __x64_sys_fdatasync+0x13/0x20 > [192922.927543] do_syscall_64+0x60/0x1c0 > [192922.927939] entry_SYSCALL_64_after_hwframe+0x49/0xbe > (...) > [192922.934077] ---[ end trace f00808b12068168f ]--- > > 2) If evict_refill_and_join() decides to commit the transaction, it will evict_refill_and_join only ever calls btrfs_join_transaction so it cannot ever commit the transaction. Let's look into its sole caller, btrfs_evict_inode, I see 3 cases where transactions handle is involved in that function: 1. btrfs_commit_inode_delayed_inode - iot calls trans_join + end_transaction => use_count = 2 -> no commit 2. btrfs_truncate_inode_items -> never calls end/commit transaction 3. btrfs_evict_inode itself only ever calls btrfs_end_transaction on a handle that's obtained via btrfs_join_transaction, meaning that trans handle's ->use_count will be 2. This in turn will hit: if (refcount_read(&trans->use_count) > 1) { refcount_dec(&trans->use_count); trans->block_rsv = trans->orig_rsv; return 0; } in __btrfs_end_transaction. That code is in direct contradiction with what you describe next? Am I missing something here? If my analysis is correct this implies case 1) above also cannot happen because trans->block_rsv is set to NULL in __btrfs_end_transaction only if use_count == 1. > be able to do it, since the nested transaction join only increments the > transaction handle's ->use_count reference counter and it does not > prevent the transaction from getting committed. This means that after > eviction completes, the fsync logging path will be using a transaction > handle that refers to an already committed transaction. What happens > when using such a stale transaction can be unpredictable, we are at > least having a use-after-free on the transaction handle itself, since > the transaction commit will call kmem_cache_free() against the handle > regardless of its ->use_count value, or we can end up silently losing > all the updates to the log tree after that iput() in the logging path, > or using a transaction handle that in the meanwhile was allocated to > another task for a new transaction, etc, pretty much unpredictable > what can happen. > > In order to fix both of them, instead of using iput() during logging, use > btrfs_add_delayed_iput(), so that the logging path of fsync never drops > the last reference on an inode, that step is offloaded to a safe context > (usually the cleaner kthread). > > The assertion failure issue was sporadically triggered by the test case > generic/475 from fstests, which loads the dm error target while fsstress > is running, which lead to fsync failing while logging inodes with -EIO > errors and then trying later to commit the transaction, triggering the > assertion failure. > > CC: stable@vger.kernel.org # 4.4+ > Signed-off-by: Filipe Manana <fdmanana@suse.com> > --- > fs/btrfs/tree-log.c | 16 ++++++++-------- > 1 file changed, 8 insertions(+), 8 deletions(-) > > diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c > index 6c8297bcfeb7..1bfd7e34f31e 100644 > --- a/fs/btrfs/tree-log.c > +++ b/fs/btrfs/tree-log.c > @@ -4985,7 +4985,7 @@ static int log_conflicting_inodes(struct btrfs_trans_handle *trans, > BTRFS_I(inode), > LOG_OTHER_INODE_ALL, > 0, LLONG_MAX, ctx); > - iput(inode); > + btrfs_add_delayed_iput(inode); > } > } > continue; > @@ -5000,7 +5000,7 @@ static int log_conflicting_inodes(struct btrfs_trans_handle *trans, > ret = btrfs_log_inode(trans, root, BTRFS_I(inode), > LOG_OTHER_INODE, 0, LLONG_MAX, ctx); > if (ret) { > - iput(inode); > + btrfs_add_delayed_iput(inode); > continue; > } > > @@ -5009,7 +5009,7 @@ static int log_conflicting_inodes(struct btrfs_trans_handle *trans, > key.offset = 0; > ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); > if (ret < 0) { > - iput(inode); > + btrfs_add_delayed_iput(inode); > continue; > } > > @@ -5056,7 +5056,7 @@ static int log_conflicting_inodes(struct btrfs_trans_handle *trans, > } > path->slots[0]++; > } > - iput(inode); > + btrfs_add_delayed_iput(inode); > } > > return ret; > @@ -5689,7 +5689,7 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans, > } > > if (btrfs_inode_in_log(BTRFS_I(di_inode), trans->transid)) { > - iput(di_inode); > + btrfs_add_delayed_iput(di_inode); > break; > } > > @@ -5701,7 +5701,7 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans, > if (!ret && > btrfs_must_commit_transaction(trans, BTRFS_I(di_inode))) > ret = 1; > - iput(di_inode); > + btrfs_add_delayed_iput(di_inode); > if (ret) > goto next_dir_inode; > if (ctx->log_new_dentries) { > @@ -5848,7 +5848,7 @@ static int btrfs_log_all_parents(struct btrfs_trans_handle *trans, > if (!ret && ctx && ctx->log_new_dentries) > ret = log_new_dir_dentries(trans, root, > BTRFS_I(dir_inode), ctx); > - iput(dir_inode); > + btrfs_add_delayed_iput(dir_inode); > if (ret) > goto out; > } > @@ -5891,7 +5891,7 @@ static int log_new_ancestors(struct btrfs_trans_handle *trans, > ret = btrfs_log_inode(trans, root, BTRFS_I(inode), > LOG_INODE_EXISTS, > 0, LLONG_MAX, ctx); > - iput(inode); > + btrfs_add_delayed_iput(inode); > if (ret) > return ret; > > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction 2019-09-12 11:10 ` Nikolay Borisov @ 2019-09-12 11:24 ` Filipe Manana 2019-09-12 11:35 ` Nikolay Borisov 0 siblings, 1 reply; 14+ messages in thread From: Filipe Manana @ 2019-09-12 11:24 UTC (permalink / raw) To: Nikolay Borisov; +Cc: linux-btrfs On Thu, Sep 12, 2019 at 12:10 PM Nikolay Borisov <nborisov@suse.com> wrote: > > > > On 10.09.19 г. 17:26 ч., fdmanana@kernel.org wrote: > > From: Filipe Manana <fdmanana@suse.com> > > > > Sometimes when fsync'ing a file we need to log that other inodes exist and > > when we need to do that we acquire a reference on the inodes and then drop > > that reference using iput() after logging them. > > > > That generally is not a problem except if we end up doing the final iput() > > (dropping the last reference) on the inode and that inode has a link count > > of 0, which can happen in a very short time window if the logging path > > gets a reference on the inode while it's being unlinked. > > > > In that case we end up getting the eviction callback, btrfs_evict_inode(), > > invoked through the iput() call chain which needs to drop all of the > > inode's items from its subvolume btree, and in order to do that, it needs > > to join a transaction at the helper function evict_refill_and_join(). > > However because the task previously started a transaction at the fsync > > handler, btrfs_sync_file(), it has current->journal_info already pointing > > to a transaction handle and therefore evict_refill_and_join() will get > > that transaction handle from btrfs_join_transaction(). From this point on, > > two different problems can happen: > > > > 1) evict_refill_and_join() will often change the transaction handle's > > block reserve (->block_rsv) and set its ->bytes_reserved field to a > > value greater than 0. If evict_refill_and_join() never commits the > > transaction, the eviction handler ends up decreasing the reference > > count (->use_count) of the transaction handle through the call to > > btrfs_end_transaction(), and after that point we have a transaction > > handle with a NULL ->block_rsv (which is the value prior to the > > transaction join from evict_refill_and_join()) and a ->bytes_reserved > > value greater than 0. If after the eviction/iput completes the inode > > logging path hits an error or it decides that it must fallback to a > > transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a > > non-zero value from btrfs_log_dentry_safe(), and because of that > > non-zero value it tries to commit the transaction using a handle with > > a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes > > the transaction commit hit an assertion failure at > > btrfs_trans_release_metadata() because ->bytes_reserved is not zero but > > the ->block_rsv is NULL. The produced stack trace for that is like the > > following: > > > > [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816 > > [192922.917553] ------------[ cut here ]------------ > > [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532! > > [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI > > [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G W 5.1.4-btrfs-next-47 #1 > > [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 > > [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs] > > (...) > > [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286 > > [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000 > > [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838 > > [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000 > > [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980 > > [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8 > > [192922.923200] FS: 00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000 > > [192922.923579] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0 > > [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > [192922.925105] Call Trace: > > [192922.925505] btrfs_trans_release_metadata+0x10c/0x170 [btrfs] > > [192922.925911] btrfs_commit_transaction+0x3e/0xaf0 [btrfs] > > [192922.926324] btrfs_sync_file+0x44c/0x490 [btrfs] > > [192922.926731] do_fsync+0x38/0x60 > > [192922.927138] __x64_sys_fdatasync+0x13/0x20 > > [192922.927543] do_syscall_64+0x60/0x1c0 > > [192922.927939] entry_SYSCALL_64_after_hwframe+0x49/0xbe > > (...) > > [192922.934077] ---[ end trace f00808b12068168f ]--- > > > > 2) If evict_refill_and_join() decides to commit the transaction, it will > evict_refill_and_join only ever calls btrfs_join_transaction so it > cannot ever commit the transaction. It can: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/btrfs/inode.c?h=v5.3-rc8#n5399 > Let's look into its sole caller, > btrfs_evict_inode, I see 3 cases where transactions handle is involved > in that function: > > 1. btrfs_commit_inode_delayed_inode - iot calls trans_join + > end_transaction => use_count = 2 -> no commit > > 2. btrfs_truncate_inode_items -> never calls end/commit transaction > > 3. btrfs_evict_inode itself only ever calls btrfs_end_transaction on a > handle that's obtained via btrfs_join_transaction, meaning that trans > handle's ->use_count will be 2. This in turn will hit: > > if (refcount_read(&trans->use_count) > 1) { > refcount_dec(&trans->use_count); > > trans->block_rsv = trans->orig_rsv; > > return 0; > > } > > in __btrfs_end_transaction. That code is in direct contradiction with > what you describe next? No, it isn't. I explicitly mention it in point 1): "... the eviction handler ends up decreasing the reference count (->use_count) of the transaction handle through the call to btrfs_end_transaction(), and after that point we have a transaction handle with a NULL ->block_rsv (which is the value prior to the transaction join from evict_refill_and_join()) and a ->bytes_reserved value greater than 0." > Am I missing something here? If my analysis is > correct this implies case 1) above also cannot happen because > trans->block_rsv is set to NULL in __btrfs_end_transaction only if > use_count == 1. Again, go read the change log and the code more carefully. Thanks. > > > be able to do it, since the nested transaction join only increments the > > transaction handle's ->use_count reference counter and it does not > > prevent the transaction from getting committed. This means that after > > eviction completes, the fsync logging path will be using a transaction > > handle that refers to an already committed transaction. What happens > > when using such a stale transaction can be unpredictable, we are at > > least having a use-after-free on the transaction handle itself, since > > the transaction commit will call kmem_cache_free() against the handle > > regardless of its ->use_count value, or we can end up silently losing > > all the updates to the log tree after that iput() in the logging path, > > or using a transaction handle that in the meanwhile was allocated to > > another task for a new transaction, etc, pretty much unpredictable > > what can happen. > > > > In order to fix both of them, instead of using iput() during logging, use > > btrfs_add_delayed_iput(), so that the logging path of fsync never drops > > the last reference on an inode, that step is offloaded to a safe context > > (usually the cleaner kthread). > > > > The assertion failure issue was sporadically triggered by the test case > > generic/475 from fstests, which loads the dm error target while fsstress > > is running, which lead to fsync failing while logging inodes with -EIO > > errors and then trying later to commit the transaction, triggering the > > assertion failure. > > > > CC: stable@vger.kernel.org # 4.4+ > > Signed-off-by: Filipe Manana <fdmanana@suse.com> > > --- > > fs/btrfs/tree-log.c | 16 ++++++++-------- > > 1 file changed, 8 insertions(+), 8 deletions(-) > > > > diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c > > index 6c8297bcfeb7..1bfd7e34f31e 100644 > > --- a/fs/btrfs/tree-log.c > > +++ b/fs/btrfs/tree-log.c > > @@ -4985,7 +4985,7 @@ static int log_conflicting_inodes(struct btrfs_trans_handle *trans, > > BTRFS_I(inode), > > LOG_OTHER_INODE_ALL, > > 0, LLONG_MAX, ctx); > > - iput(inode); > > + btrfs_add_delayed_iput(inode); > > } > > } > > continue; > > @@ -5000,7 +5000,7 @@ static int log_conflicting_inodes(struct btrfs_trans_handle *trans, > > ret = btrfs_log_inode(trans, root, BTRFS_I(inode), > > LOG_OTHER_INODE, 0, LLONG_MAX, ctx); > > if (ret) { > > - iput(inode); > > + btrfs_add_delayed_iput(inode); > > continue; > > } > > > > @@ -5009,7 +5009,7 @@ static int log_conflicting_inodes(struct btrfs_trans_handle *trans, > > key.offset = 0; > > ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); > > if (ret < 0) { > > - iput(inode); > > + btrfs_add_delayed_iput(inode); > > continue; > > } > > > > @@ -5056,7 +5056,7 @@ static int log_conflicting_inodes(struct btrfs_trans_handle *trans, > > } > > path->slots[0]++; > > } > > - iput(inode); > > + btrfs_add_delayed_iput(inode); > > } > > > > return ret; > > @@ -5689,7 +5689,7 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans, > > } > > > > if (btrfs_inode_in_log(BTRFS_I(di_inode), trans->transid)) { > > - iput(di_inode); > > + btrfs_add_delayed_iput(di_inode); > > break; > > } > > > > @@ -5701,7 +5701,7 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans, > > if (!ret && > > btrfs_must_commit_transaction(trans, BTRFS_I(di_inode))) > > ret = 1; > > - iput(di_inode); > > + btrfs_add_delayed_iput(di_inode); > > if (ret) > > goto next_dir_inode; > > if (ctx->log_new_dentries) { > > @@ -5848,7 +5848,7 @@ static int btrfs_log_all_parents(struct btrfs_trans_handle *trans, > > if (!ret && ctx && ctx->log_new_dentries) > > ret = log_new_dir_dentries(trans, root, > > BTRFS_I(dir_inode), ctx); > > - iput(dir_inode); > > + btrfs_add_delayed_iput(dir_inode); > > if (ret) > > goto out; > > } > > @@ -5891,7 +5891,7 @@ static int log_new_ancestors(struct btrfs_trans_handle *trans, > > ret = btrfs_log_inode(trans, root, BTRFS_I(inode), > > LOG_INODE_EXISTS, > > 0, LLONG_MAX, ctx); > > - iput(inode); > > + btrfs_add_delayed_iput(inode); > > if (ret) > > return ret; > > > > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction 2019-09-12 11:24 ` Filipe Manana @ 2019-09-12 11:35 ` Nikolay Borisov 0 siblings, 0 replies; 14+ messages in thread From: Nikolay Borisov @ 2019-09-12 11:35 UTC (permalink / raw) To: Filipe Manana; +Cc: linux-btrfs On 12.09.19 г. 14:24 ч., Filipe Manana wrote: > On Thu, Sep 12, 2019 at 12:10 PM Nikolay Borisov <nborisov@suse.com> wrote: >> >> >> >> On 10.09.19 г. 17:26 ч., fdmanana@kernel.org wrote: >>> From: Filipe Manana <fdmanana@suse.com> >>> >>> Sometimes when fsync'ing a file we need to log that other inodes exist and >>> when we need to do that we acquire a reference on the inodes and then drop >>> that reference using iput() after logging them. >>> >>> That generally is not a problem except if we end up doing the final iput() >>> (dropping the last reference) on the inode and that inode has a link count >>> of 0, which can happen in a very short time window if the logging path >>> gets a reference on the inode while it's being unlinked. >>> >>> In that case we end up getting the eviction callback, btrfs_evict_inode(), >>> invoked through the iput() call chain which needs to drop all of the >>> inode's items from its subvolume btree, and in order to do that, it needs >>> to join a transaction at the helper function evict_refill_and_join(). >>> However because the task previously started a transaction at the fsync >>> handler, btrfs_sync_file(), it has current->journal_info already pointing >>> to a transaction handle and therefore evict_refill_and_join() will get >>> that transaction handle from btrfs_join_transaction(). From this point on, >>> two different problems can happen: >>> >>> 1) evict_refill_and_join() will often change the transaction handle's >>> block reserve (->block_rsv) and set its ->bytes_reserved field to a >>> value greater than 0. If evict_refill_and_join() never commits the >>> transaction, the eviction handler ends up decreasing the reference >>> count (->use_count) of the transaction handle through the call to >>> btrfs_end_transaction(), and after that point we have a transaction >>> handle with a NULL ->block_rsv (which is the value prior to the >>> transaction join from evict_refill_and_join()) and a ->bytes_reserved >>> value greater than 0. If after the eviction/iput completes the inode >>> logging path hits an error or it decides that it must fallback to a >>> transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a >>> non-zero value from btrfs_log_dentry_safe(), and because of that >>> non-zero value it tries to commit the transaction using a handle with >>> a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes >>> the transaction commit hit an assertion failure at >>> btrfs_trans_release_metadata() because ->bytes_reserved is not zero but >>> the ->block_rsv is NULL. The produced stack trace for that is like the >>> following: >>> >>> [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816 >>> [192922.917553] ------------[ cut here ]------------ >>> [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532! >>> [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI >>> [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G W 5.1.4-btrfs-next-47 #1 >>> [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 >>> [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs] >>> (...) >>> [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286 >>> [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000 >>> [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838 >>> [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000 >>> [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980 >>> [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8 >>> [192922.923200] FS: 00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000 >>> [192922.923579] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0 >>> [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >>> [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >>> [192922.925105] Call Trace: >>> [192922.925505] btrfs_trans_release_metadata+0x10c/0x170 [btrfs] >>> [192922.925911] btrfs_commit_transaction+0x3e/0xaf0 [btrfs] >>> [192922.926324] btrfs_sync_file+0x44c/0x490 [btrfs] >>> [192922.926731] do_fsync+0x38/0x60 >>> [192922.927138] __x64_sys_fdatasync+0x13/0x20 >>> [192922.927543] do_syscall_64+0x60/0x1c0 >>> [192922.927939] entry_SYSCALL_64_after_hwframe+0x49/0xbe >>> (...) >>> [192922.934077] ---[ end trace f00808b12068168f ]--- >>> >>> 2) If evict_refill_and_join() decides to commit the transaction, it will >> evict_refill_and_join only ever calls btrfs_join_transaction so it >> cannot ever commit the transaction. > > It can: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/btrfs/inode.c?h=v5.3-rc8#n5399 Right, in 5.3 branch yes but in misc-next evict_and_join looks completely different, that's where the confusion stems from. Thanks <snip> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction 2019-09-10 14:26 [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction fdmanana ` (2 preceding siblings ...) 2019-09-12 11:10 ` Nikolay Borisov @ 2019-09-12 11:58 ` Josef Bacik 2019-09-12 13:17 ` Filipe Manana 2019-09-12 12:18 ` Josef Bacik 4 siblings, 1 reply; 14+ messages in thread From: Josef Bacik @ 2019-09-12 11:58 UTC (permalink / raw) To: fdmanana; +Cc: linux-btrfs On Tue, Sep 10, 2019 at 03:26:49PM +0100, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > Sometimes when fsync'ing a file we need to log that other inodes exist and > when we need to do that we acquire a reference on the inodes and then drop > that reference using iput() after logging them. > > That generally is not a problem except if we end up doing the final iput() > (dropping the last reference) on the inode and that inode has a link count > of 0, which can happen in a very short time window if the logging path > gets a reference on the inode while it's being unlinked. > > In that case we end up getting the eviction callback, btrfs_evict_inode(), > invoked through the iput() call chain which needs to drop all of the > inode's items from its subvolume btree, and in order to do that, it needs > to join a transaction at the helper function evict_refill_and_join(). > However because the task previously started a transaction at the fsync > handler, btrfs_sync_file(), it has current->journal_info already pointing > to a transaction handle and therefore evict_refill_and_join() will get > that transaction handle from btrfs_join_transaction(). From this point on, > two different problems can happen: > > 1) evict_refill_and_join() will often change the transaction handle's > block reserve (->block_rsv) and set its ->bytes_reserved field to a > value greater than 0. If evict_refill_and_join() never commits the > transaction, the eviction handler ends up decreasing the reference > count (->use_count) of the transaction handle through the call to > btrfs_end_transaction(), and after that point we have a transaction > handle with a NULL ->block_rsv (which is the value prior to the > transaction join from evict_refill_and_join()) and a ->bytes_reserved > value greater than 0. If after the eviction/iput completes the inode > logging path hits an error or it decides that it must fallback to a > transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a > non-zero value from btrfs_log_dentry_safe(), and because of that > non-zero value it tries to commit the transaction using a handle with > a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes > the transaction commit hit an assertion failure at > btrfs_trans_release_metadata() because ->bytes_reserved is not zero but > the ->block_rsv is NULL. The produced stack trace for that is like the > following: > > [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816 > [192922.917553] ------------[ cut here ]------------ > [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532! > [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI > [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G W 5.1.4-btrfs-next-47 #1 > [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 > [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs] > (...) > [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286 > [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000 > [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838 > [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000 > [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980 > [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8 > [192922.923200] FS: 00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000 > [192922.923579] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0 > [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [192922.925105] Call Trace: > [192922.925505] btrfs_trans_release_metadata+0x10c/0x170 [btrfs] > [192922.925911] btrfs_commit_transaction+0x3e/0xaf0 [btrfs] > [192922.926324] btrfs_sync_file+0x44c/0x490 [btrfs] > [192922.926731] do_fsync+0x38/0x60 > [192922.927138] __x64_sys_fdatasync+0x13/0x20 > [192922.927543] do_syscall_64+0x60/0x1c0 > [192922.927939] entry_SYSCALL_64_after_hwframe+0x49/0xbe > (...) > [192922.934077] ---[ end trace f00808b12068168f ]--- > > 2) If evict_refill_and_join() decides to commit the transaction, it will > be able to do it, since the nested transaction join only increments the > transaction handle's ->use_count reference counter and it does not > prevent the transaction from getting committed. This means that after This brings up a good point, we should probably not allow the commit in this case, or add an ASSERT(use_count == 1) or something, cause this would be bad. Thanks, Josef ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction 2019-09-12 11:58 ` Josef Bacik @ 2019-09-12 13:17 ` Filipe Manana 0 siblings, 0 replies; 14+ messages in thread From: Filipe Manana @ 2019-09-12 13:17 UTC (permalink / raw) To: Josef Bacik; +Cc: linux-btrfs On Thu, Sep 12, 2019 at 12:59 PM Josef Bacik <josef@toxicpanda.com> wrote: > > On Tue, Sep 10, 2019 at 03:26:49PM +0100, fdmanana@kernel.org wrote: > > From: Filipe Manana <fdmanana@suse.com> > > > > Sometimes when fsync'ing a file we need to log that other inodes exist and > > when we need to do that we acquire a reference on the inodes and then drop > > that reference using iput() after logging them. > > > > That generally is not a problem except if we end up doing the final iput() > > (dropping the last reference) on the inode and that inode has a link count > > of 0, which can happen in a very short time window if the logging path > > gets a reference on the inode while it's being unlinked. > > > > In that case we end up getting the eviction callback, btrfs_evict_inode(), > > invoked through the iput() call chain which needs to drop all of the > > inode's items from its subvolume btree, and in order to do that, it needs > > to join a transaction at the helper function evict_refill_and_join(). > > However because the task previously started a transaction at the fsync > > handler, btrfs_sync_file(), it has current->journal_info already pointing > > to a transaction handle and therefore evict_refill_and_join() will get > > that transaction handle from btrfs_join_transaction(). From this point on, > > two different problems can happen: > > > > 1) evict_refill_and_join() will often change the transaction handle's > > block reserve (->block_rsv) and set its ->bytes_reserved field to a > > value greater than 0. If evict_refill_and_join() never commits the > > transaction, the eviction handler ends up decreasing the reference > > count (->use_count) of the transaction handle through the call to > > btrfs_end_transaction(), and after that point we have a transaction > > handle with a NULL ->block_rsv (which is the value prior to the > > transaction join from evict_refill_and_join()) and a ->bytes_reserved > > value greater than 0. If after the eviction/iput completes the inode > > logging path hits an error or it decides that it must fallback to a > > transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a > > non-zero value from btrfs_log_dentry_safe(), and because of that > > non-zero value it tries to commit the transaction using a handle with > > a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes > > the transaction commit hit an assertion failure at > > btrfs_trans_release_metadata() because ->bytes_reserved is not zero but > > the ->block_rsv is NULL. The produced stack trace for that is like the > > following: > > > > [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816 > > [192922.917553] ------------[ cut here ]------------ > > [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532! > > [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI > > [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G W 5.1.4-btrfs-next-47 #1 > > [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 > > [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs] > > (...) > > [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286 > > [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000 > > [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838 > > [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000 > > [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980 > > [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8 > > [192922.923200] FS: 00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000 > > [192922.923579] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0 > > [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > [192922.925105] Call Trace: > > [192922.925505] btrfs_trans_release_metadata+0x10c/0x170 [btrfs] > > [192922.925911] btrfs_commit_transaction+0x3e/0xaf0 [btrfs] > > [192922.926324] btrfs_sync_file+0x44c/0x490 [btrfs] > > [192922.926731] do_fsync+0x38/0x60 > > [192922.927138] __x64_sys_fdatasync+0x13/0x20 > > [192922.927543] do_syscall_64+0x60/0x1c0 > > [192922.927939] entry_SYSCALL_64_after_hwframe+0x49/0xbe > > (...) > > [192922.934077] ---[ end trace f00808b12068168f ]--- > > > > 2) If evict_refill_and_join() decides to commit the transaction, it will > > be able to do it, since the nested transaction join only increments the > > transaction handle's ->use_count reference counter and it does not > > prevent the transaction from getting committed. This means that after > > This brings up a good point, we should probably not allow the commit in this > case, or add an ASSERT(use_count == 1) or something, cause this would be bad. > Thanks, Yes, that should be separate change however. I had warn_on(trans->use_count > 1) in the commit locally during testing (didn't trigger in 4 days of fstests). Thanks. > > Josef ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction 2019-09-10 14:26 [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction fdmanana ` (3 preceding siblings ...) 2019-09-12 11:58 ` Josef Bacik @ 2019-09-12 12:18 ` Josef Bacik 2019-09-12 13:19 ` Filipe Manana 4 siblings, 1 reply; 14+ messages in thread From: Josef Bacik @ 2019-09-12 12:18 UTC (permalink / raw) To: fdmanana; +Cc: linux-btrfs On Tue, Sep 10, 2019 at 03:26:49PM +0100, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > Sometimes when fsync'ing a file we need to log that other inodes exist and > when we need to do that we acquire a reference on the inodes and then drop > that reference using iput() after logging them. > > That generally is not a problem except if we end up doing the final iput() > (dropping the last reference) on the inode and that inode has a link count > of 0, which can happen in a very short time window if the logging path > gets a reference on the inode while it's being unlinked. > > In that case we end up getting the eviction callback, btrfs_evict_inode(), > invoked through the iput() call chain which needs to drop all of the > inode's items from its subvolume btree, and in order to do that, it needs > to join a transaction at the helper function evict_refill_and_join(). > However because the task previously started a transaction at the fsync > handler, btrfs_sync_file(), it has current->journal_info already pointing > to a transaction handle and therefore evict_refill_and_join() will get > that transaction handle from btrfs_join_transaction(). From this point on, > two different problems can happen: > > 1) evict_refill_and_join() will often change the transaction handle's > block reserve (->block_rsv) and set its ->bytes_reserved field to a > value greater than 0. If evict_refill_and_join() never commits the > transaction, the eviction handler ends up decreasing the reference > count (->use_count) of the transaction handle through the call to > btrfs_end_transaction(), and after that point we have a transaction > handle with a NULL ->block_rsv (which is the value prior to the > transaction join from evict_refill_and_join()) and a ->bytes_reserved > value greater than 0. If after the eviction/iput completes the inode > logging path hits an error or it decides that it must fallback to a > transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a > non-zero value from btrfs_log_dentry_safe(), and because of that > non-zero value it tries to commit the transaction using a handle with > a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes > the transaction commit hit an assertion failure at > btrfs_trans_release_metadata() because ->bytes_reserved is not zero but > the ->block_rsv is NULL. The produced stack trace for that is like the > following: > > [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816 > [192922.917553] ------------[ cut here ]------------ > [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532! > [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI > [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G W 5.1.4-btrfs-next-47 #1 > [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 > [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs] > (...) > [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286 > [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000 > [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838 > [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000 > [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980 > [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8 > [192922.923200] FS: 00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000 > [192922.923579] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0 > [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [192922.925105] Call Trace: > [192922.925505] btrfs_trans_release_metadata+0x10c/0x170 [btrfs] > [192922.925911] btrfs_commit_transaction+0x3e/0xaf0 [btrfs] > [192922.926324] btrfs_sync_file+0x44c/0x490 [btrfs] > [192922.926731] do_fsync+0x38/0x60 > [192922.927138] __x64_sys_fdatasync+0x13/0x20 > [192922.927543] do_syscall_64+0x60/0x1c0 > [192922.927939] entry_SYSCALL_64_after_hwframe+0x49/0xbe > (...) > [192922.934077] ---[ end trace f00808b12068168f ]--- > > 2) If evict_refill_and_join() decides to commit the transaction, it will > be able to do it, since the nested transaction join only increments the > transaction handle's ->use_count reference counter and it does not > prevent the transaction from getting committed. This means that after > eviction completes, the fsync logging path will be using a transaction > handle that refers to an already committed transaction. What happens > when using such a stale transaction can be unpredictable, we are at > least having a use-after-free on the transaction handle itself, since > the transaction commit will call kmem_cache_free() against the handle > regardless of its ->use_count value, or we can end up silently losing > all the updates to the log tree after that iput() in the logging path, > or using a transaction handle that in the meanwhile was allocated to > another task for a new transaction, etc, pretty much unpredictable > what can happen. > And talking it over with Nikolay I realized that since we're doing the commit through the flushing this doesn't actually happen anymore. may_commit_transaction() returns -EGAIN if we already have a trans handle open, so hooray I made it safer by accident! But we definitely should follow up and add an assert in btrfs_commit_transaction() to catch this case, cause holy shit that's bad. Thanks, Josef ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction 2019-09-12 12:18 ` Josef Bacik @ 2019-09-12 13:19 ` Filipe Manana 2019-09-12 15:14 ` Josef Bacik 0 siblings, 1 reply; 14+ messages in thread From: Filipe Manana @ 2019-09-12 13:19 UTC (permalink / raw) To: Josef Bacik; +Cc: linux-btrfs On Thu, Sep 12, 2019 at 1:18 PM Josef Bacik <josef@toxicpanda.com> wrote: > > On Tue, Sep 10, 2019 at 03:26:49PM +0100, fdmanana@kernel.org wrote: > > From: Filipe Manana <fdmanana@suse.com> > > > > Sometimes when fsync'ing a file we need to log that other inodes exist and > > when we need to do that we acquire a reference on the inodes and then drop > > that reference using iput() after logging them. > > > > That generally is not a problem except if we end up doing the final iput() > > (dropping the last reference) on the inode and that inode has a link count > > of 0, which can happen in a very short time window if the logging path > > gets a reference on the inode while it's being unlinked. > > > > In that case we end up getting the eviction callback, btrfs_evict_inode(), > > invoked through the iput() call chain which needs to drop all of the > > inode's items from its subvolume btree, and in order to do that, it needs > > to join a transaction at the helper function evict_refill_and_join(). > > However because the task previously started a transaction at the fsync > > handler, btrfs_sync_file(), it has current->journal_info already pointing > > to a transaction handle and therefore evict_refill_and_join() will get > > that transaction handle from btrfs_join_transaction(). From this point on, > > two different problems can happen: > > > > 1) evict_refill_and_join() will often change the transaction handle's > > block reserve (->block_rsv) and set its ->bytes_reserved field to a > > value greater than 0. If evict_refill_and_join() never commits the > > transaction, the eviction handler ends up decreasing the reference > > count (->use_count) of the transaction handle through the call to > > btrfs_end_transaction(), and after that point we have a transaction > > handle with a NULL ->block_rsv (which is the value prior to the > > transaction join from evict_refill_and_join()) and a ->bytes_reserved > > value greater than 0. If after the eviction/iput completes the inode > > logging path hits an error or it decides that it must fallback to a > > transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a > > non-zero value from btrfs_log_dentry_safe(), and because of that > > non-zero value it tries to commit the transaction using a handle with > > a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes > > the transaction commit hit an assertion failure at > > btrfs_trans_release_metadata() because ->bytes_reserved is not zero but > > the ->block_rsv is NULL. The produced stack trace for that is like the > > following: > > > > [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816 > > [192922.917553] ------------[ cut here ]------------ > > [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532! > > [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI > > [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G W 5.1.4-btrfs-next-47 #1 > > [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 > > [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs] > > (...) > > [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286 > > [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000 > > [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838 > > [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000 > > [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980 > > [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8 > > [192922.923200] FS: 00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000 > > [192922.923579] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0 > > [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > [192922.925105] Call Trace: > > [192922.925505] btrfs_trans_release_metadata+0x10c/0x170 [btrfs] > > [192922.925911] btrfs_commit_transaction+0x3e/0xaf0 [btrfs] > > [192922.926324] btrfs_sync_file+0x44c/0x490 [btrfs] > > [192922.926731] do_fsync+0x38/0x60 > > [192922.927138] __x64_sys_fdatasync+0x13/0x20 > > [192922.927543] do_syscall_64+0x60/0x1c0 > > [192922.927939] entry_SYSCALL_64_after_hwframe+0x49/0xbe > > (...) > > [192922.934077] ---[ end trace f00808b12068168f ]--- > > > > 2) If evict_refill_and_join() decides to commit the transaction, it will > > be able to do it, since the nested transaction join only increments the > > transaction handle's ->use_count reference counter and it does not > > prevent the transaction from getting committed. This means that after > > eviction completes, the fsync logging path will be using a transaction > > handle that refers to an already committed transaction. What happens > > when using such a stale transaction can be unpredictable, we are at > > least having a use-after-free on the transaction handle itself, since > > the transaction commit will call kmem_cache_free() against the handle > > regardless of its ->use_count value, or we can end up silently losing > > all the updates to the log tree after that iput() in the logging path, > > or using a transaction handle that in the meanwhile was allocated to > > another task for a new transaction, etc, pretty much unpredictable > > what can happen. > > > > And talking it over with Nikolay I realized that since we're doing the commit > through the flushing this doesn't actually happen anymore. > may_commit_transaction() returns -EGAIN if we already have a trans handle open, > so hooray I made it safer by accident! But we definitely should follow up and > add an assert in btrfs_commit_transaction() to catch this case, cause holy shit > that's bad. Thanks, Yep, but problem 1) (assertion failure) is still valid after your recent changes in misc-next. Either way, this fix has to be added to stable releases to fix both problems there. Thanks. > > Josef ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction 2019-09-12 13:19 ` Filipe Manana @ 2019-09-12 15:14 ` Josef Bacik 0 siblings, 0 replies; 14+ messages in thread From: Josef Bacik @ 2019-09-12 15:14 UTC (permalink / raw) To: Filipe Manana; +Cc: Josef Bacik, linux-btrfs On Thu, Sep 12, 2019 at 02:19:55PM +0100, Filipe Manana wrote: > On Thu, Sep 12, 2019 at 1:18 PM Josef Bacik <josef@toxicpanda.com> wrote: > > > > On Tue, Sep 10, 2019 at 03:26:49PM +0100, fdmanana@kernel.org wrote: > > > From: Filipe Manana <fdmanana@suse.com> > > > > > > Sometimes when fsync'ing a file we need to log that other inodes exist and > > > when we need to do that we acquire a reference on the inodes and then drop > > > that reference using iput() after logging them. > > > > > > That generally is not a problem except if we end up doing the final iput() > > > (dropping the last reference) on the inode and that inode has a link count > > > of 0, which can happen in a very short time window if the logging path > > > gets a reference on the inode while it's being unlinked. > > > > > > In that case we end up getting the eviction callback, btrfs_evict_inode(), > > > invoked through the iput() call chain which needs to drop all of the > > > inode's items from its subvolume btree, and in order to do that, it needs > > > to join a transaction at the helper function evict_refill_and_join(). > > > However because the task previously started a transaction at the fsync > > > handler, btrfs_sync_file(), it has current->journal_info already pointing > > > to a transaction handle and therefore evict_refill_and_join() will get > > > that transaction handle from btrfs_join_transaction(). From this point on, > > > two different problems can happen: > > > > > > 1) evict_refill_and_join() will often change the transaction handle's > > > block reserve (->block_rsv) and set its ->bytes_reserved field to a > > > value greater than 0. If evict_refill_and_join() never commits the > > > transaction, the eviction handler ends up decreasing the reference > > > count (->use_count) of the transaction handle through the call to > > > btrfs_end_transaction(), and after that point we have a transaction > > > handle with a NULL ->block_rsv (which is the value prior to the > > > transaction join from evict_refill_and_join()) and a ->bytes_reserved > > > value greater than 0. If after the eviction/iput completes the inode > > > logging path hits an error or it decides that it must fallback to a > > > transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a > > > non-zero value from btrfs_log_dentry_safe(), and because of that > > > non-zero value it tries to commit the transaction using a handle with > > > a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes > > > the transaction commit hit an assertion failure at > > > btrfs_trans_release_metadata() because ->bytes_reserved is not zero but > > > the ->block_rsv is NULL. The produced stack trace for that is like the > > > following: > > > > > > [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816 > > > [192922.917553] ------------[ cut here ]------------ > > > [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532! > > > [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI > > > [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G W 5.1.4-btrfs-next-47 #1 > > > [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 > > > [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs] > > > (...) > > > [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286 > > > [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000 > > > [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838 > > > [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000 > > > [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980 > > > [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8 > > > [192922.923200] FS: 00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000 > > > [192922.923579] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0 > > > [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > > [192922.925105] Call Trace: > > > [192922.925505] btrfs_trans_release_metadata+0x10c/0x170 [btrfs] > > > [192922.925911] btrfs_commit_transaction+0x3e/0xaf0 [btrfs] > > > [192922.926324] btrfs_sync_file+0x44c/0x490 [btrfs] > > > [192922.926731] do_fsync+0x38/0x60 > > > [192922.927138] __x64_sys_fdatasync+0x13/0x20 > > > [192922.927543] do_syscall_64+0x60/0x1c0 > > > [192922.927939] entry_SYSCALL_64_after_hwframe+0x49/0xbe > > > (...) > > > [192922.934077] ---[ end trace f00808b12068168f ]--- > > > > > > 2) If evict_refill_and_join() decides to commit the transaction, it will > > > be able to do it, since the nested transaction join only increments the > > > transaction handle's ->use_count reference counter and it does not > > > prevent the transaction from getting committed. This means that after > > > eviction completes, the fsync logging path will be using a transaction > > > handle that refers to an already committed transaction. What happens > > > when using such a stale transaction can be unpredictable, we are at > > > least having a use-after-free on the transaction handle itself, since > > > the transaction commit will call kmem_cache_free() against the handle > > > regardless of its ->use_count value, or we can end up silently losing > > > all the updates to the log tree after that iput() in the logging path, > > > or using a transaction handle that in the meanwhile was allocated to > > > another task for a new transaction, etc, pretty much unpredictable > > > what can happen. > > > > > > > And talking it over with Nikolay I realized that since we're doing the commit > > through the flushing this doesn't actually happen anymore. > > may_commit_transaction() returns -EGAIN if we already have a trans handle open, > > so hooray I made it safer by accident! But we definitely should follow up and > > add an assert in btrfs_commit_transaction() to catch this case, cause holy shit > > that's bad. Thanks, > > Yep, but problem 1) (assertion failure) is still valid after your > recent changes in misc-next. > Either way, this fix has to be added to stable releases to fix both > problems there. > Agreed, just adding context for the current changes. This patch is fine as it is, just some follow up stuff to catch transaction commits under nested trans handles. Thanks, Josef ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2019-09-16 15:13 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-09-10 14:26 [PATCH] Btrfs: fix assertion failure during fsync and use of stale transaction fdmanana 2019-09-11 16:04 ` Josef Bacik 2019-09-12 7:30 ` Sasha Levin 2019-09-12 8:02 ` Filipe Manana 2019-09-16 12:59 ` Filipe Manana 2019-09-16 15:13 ` Greg KH 2019-09-12 11:10 ` Nikolay Borisov 2019-09-12 11:24 ` Filipe Manana 2019-09-12 11:35 ` Nikolay Borisov 2019-09-12 11:58 ` Josef Bacik 2019-09-12 13:17 ` Filipe Manana 2019-09-12 12:18 ` Josef Bacik 2019-09-12 13:19 ` Filipe Manana 2019-09-12 15:14 ` Josef Bacik
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).