On Wed, Apr 17, 2019 at 09:22:44AM +0000, Filipe Manana wrote: > On Wed, Apr 17, 2019 at 1:08 AM Zygo Blaxell > wrote: > > > > On Mon, Apr 15, 2019 at 09:29:00AM +0100, fdmanana@kernel.org wrote: > > > From: Filipe Manana > > > > > > During fiemap, for regular extents (non inline) we need to check if they > > > are shared and if they are, set the shared bit. Checking if an extent is > > > shared requires checking the delayed references of the currently running > > > transaction, since some reference might have not yet hit the extent tree > > > and be only in the in-memory delayed references. > > > > > > However we were using a transaction join for this, which creates a new > > > transaction when there is no transaction currently running. That means > > > that two more potential failures can happen: creating the transaction and > > > committing it. Further, if no write activity is currently happening in the > > > system, and fiemap calls keep being done, we end up creating and > > > committing transactions that do nothing. > > > > Cool! > > > > Any chance we can do this for the LOGICAL_INO ioctl? Last time I checked > > (I admit that was a while ago), LOGICAL_INO cost about the same as > > fsync(), because it creates transactions with btrfs_join_transaction a > > few levels deep in the call stack, and gets blocked waiting for them > > to complete. > > So iterate_extent_inodes() should definitly use the attach API and not join for > the same reasons. And I can fix that separately. Uhhh...should it? I thought there were some callers of iterate_extent_inodes that definitely do need up-to-date data from the current transaction for their iterator function...like maybe record_extent_backrefs? Then again I'm not sure what record_extent_backrefs actually does. > > It looks like btrfs_ioctl_logical_to_ino can set: > > > > path->search_commit_root = 1; > > > > just before calling iterate_inodes_from_logical, but I tried that once, > > and my test filesystem blew up a few days later, so there might be > > something subtle that I missed. Or maybe my test filesystem was going > > to blow up that day anyway--I just assumed that I don't know what I'm > > doing, and didn't repeat the test. > > Blew up? That's a very vague term... What do you mean with it? What > happened exactly? Every few dozen hours of uptime with that change in the test kernel, btrfs was forcing the filesystem to remount RO. After a week of that, there were parent transid verify failures, and my test filesystem stopped being mountable. > Setting search_commit_root should not cause any problems, the only > thing that can happen is giving you non-accurate results > because there might be relevant changes in the current transaction and > therefore they are not visible yet from the commit roots. That's what I thought too, but my one data point said otherwise. I can repeat the test if that will be helpful. > > bees spends about 30% of its time stuck in LOGICAL_INO with a stack > > trace like this: > > > > [] btrfs_async_run_delayed_refs+0x118/0x140 > > [] __btrfs_end_transaction+0x1da/0x2e0 > > [] iterate_extent_inodes+0x159/0x3a0 > > [] iterate_inodes_from_logical+0x96/0xb0 > > [] btrfs_ioctl_logical_to_ino+0xe8/0x190 > > [] btrfs_ioctl+0xc5d/0x26a0 > > [] do_vfs_ioctl+0x92/0x6b0 > > [] SyS_ioctl+0x74/0x80 > > [] do_syscall_64+0x76/0x180 > > [] entry_SYSCALL_64_after_hwframe+0x42/0xb7 > > [] 0xffffffffffffffff > > > > so if it's at all possible to do LOGICAL_INO without joining a > > transaction, the benefits should be significant. > > So that is not the same problem. That means that some other task has > already started a transaction, the transaction > was not created by the logical_ino ioctl. It just happened that > logical_ino ended up flushing delayed refs, created by some other > task(s), > when releasing its transaction handle. There is also a lot of time without the delayed_refs call on 4.14 (the top of the stack is __btrfs_end_transaction). > We don't flush delayed refs when releasing a transaction handle > anymore, as of commit: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=db2462a6ad3dc490ac33250042e728226ef3ba00 On 5.0.6 (which has that commit), btrfs_ioctl_logical_to_ino is still floating between 30% and 40% of the total bees runtime, with a trace like: [<0>] wait_current_trans+0xc3/0x100 [<0>] start_transaction+0x311/0x4d0 [<0>] iterate_extent_inodes+0x96/0x3c0 [<0>] iterate_inodes_from_logical+0xa6/0xd0 [<0>] btrfs_ioctl_logical_to_ino+0xe8/0x190 [<0>] btrfs_ioctl+0xcf7/0x2cb0 [<0>] do_vfs_ioctl+0xa2/0x6f0 [<0>] ksys_ioctl+0x70/0x80 [<0>] __x64_sys_ioctl+0x16/0x20 [<0>] do_syscall_64+0x60/0x190 [<0>] entry_SYSCALL_64_after_hwframe+0x49/0xbe [<0>] 0xffffffffffffffff > > > In some extreme cases this can result in the commit of the transaction > > > created by fiemap to fail with ENOSPC when updating the root item of a > > > subvolume tree because a join does not reserve any space, leading to a > > > trace like the following: > > > > > > heisenberg kernel: ------------[ cut here ]------------ > > > heisenberg kernel: BTRFS: Transaction aborted (error -28) > > > heisenberg kernel: WARNING: CPU: 0 PID: 7137 at fs/btrfs/root-tree.c:136 btrfs_update_root+0x22b/0x320 [btrfs] > > > (...) > > > heisenberg kernel: CPU: 0 PID: 7137 Comm: btrfs-transacti Not tainted 4.19.0-4-amd64 #1 Debian 4.19.28-2 > > > heisenberg kernel: Hardware name: FUJITSU LIFEBOOK U757/FJNB2A5, BIOS Version 1.21 03/19/2018 > > > heisenberg kernel: RIP: 0010:btrfs_update_root+0x22b/0x320 [btrfs] > > > (...) > > > heisenberg kernel: RSP: 0018:ffffb5448828bd40 EFLAGS: 00010286 > > > heisenberg kernel: RAX: 0000000000000000 RBX: ffff8ed56bccef50 RCX: 0000000000000006 > > > heisenberg kernel: RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff8ed6bda166a0 > > > heisenberg kernel: RBP: 00000000ffffffe4 R08: 00000000000003df R09: 0000000000000007 > > > heisenberg kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff8ed63396a078 > > > heisenberg kernel: R13: ffff8ed092d7c800 R14: ffff8ed64f5db028 R15: ffff8ed6bd03d068 > > > heisenberg kernel: FS: 0000000000000000(0000) GS:ffff8ed6bda00000(0000) knlGS:0000000000000000 > > > heisenberg kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > heisenberg kernel: CR2: 00007f46f75f8000 CR3: 0000000310a0a002 CR4: 00000000003606f0 > > > heisenberg kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > heisenberg kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > > heisenberg kernel: Call Trace: > > > heisenberg kernel: commit_fs_roots+0x166/0x1d0 [btrfs] > > > heisenberg kernel: ? _cond_resched+0x15/0x30 > > > heisenberg kernel: ? btrfs_run_delayed_refs+0xac/0x180 [btrfs] > > > heisenberg kernel: btrfs_commit_transaction+0x2bd/0x870 [btrfs] > > > heisenberg kernel: ? start_transaction+0x9d/0x3f0 [btrfs] > > > heisenberg kernel: transaction_kthread+0x147/0x180 [btrfs] > > > heisenberg kernel: ? btrfs_cleanup_transaction+0x530/0x530 [btrfs] > > > heisenberg kernel: kthread+0x112/0x130 > > > heisenberg kernel: ? kthread_bind+0x30/0x30 > > > heisenberg kernel: ret_from_fork+0x35/0x40 > > > heisenberg kernel: ---[ end trace 05de912e30e012d9 ]--- > > > > > > Since fiemap (and btrfs_check_shared()) is a read-only operation, do not do > > > a transaction join to avoid the overhead of creating a new transaction (if > > > there is currently no running transaction) and introducing a potential > > > point of failure when the new transaction gets committed, instead use a > > > transaction attach to grab a handle for the currently running transaction > > > if any. > > > > > > Reported-by: Christoph Anton Mitterer > > > Link: https://lore.kernel.org/linux-btrfs/b2a668d7124f1d3e410367f587926f622b3f03a4.camel@scientia.net/ > > > Fixes: afce772e87c36c ("btrfs: fix check_shared for fiemap ioctl") > > > Signed-off-by: Filipe Manana > > > --- > > > fs/btrfs/backref.c | 11 ++++++++--- > > > 1 file changed, 8 insertions(+), 3 deletions(-) > > > > > > diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c > > > index 11459fe84a29..876e6bb93797 100644 > > > --- a/fs/btrfs/backref.c > > > +++ b/fs/btrfs/backref.c > > > @@ -1460,8 +1460,8 @@ int btrfs_find_all_roots(struct btrfs_trans_handle *trans, > > > * callers (such as fiemap) which want to know whether the extent is > > > * shared but do not need a ref count. > > > * > > > - * This attempts to allocate a transaction in order to account for > > > - * delayed refs, but continues on even when the alloc fails. > > > + * This attempts to attach to the running transaction in order to account for > > > + * delayed refs, but continues on even when no running transaction exists. > > > * > > > * Return: 0 if extent is not shared, 1 if it is shared, < 0 on error. > > > */ > > > @@ -1489,8 +1489,12 @@ int btrfs_check_shared(struct btrfs_root *root, u64 inum, u64 bytenr) > > > return -ENOMEM; > > > } > > > > > > - trans = btrfs_join_transaction(root); > > > + trans = btrfs_attach_transaction(root); > > > if (IS_ERR(trans)) { > > > + if (PTR_ERR(trans) != -ENOENT) { > > > + ret = PTR_ERR(trans); > > > + goto out; > > > + } > > > trans = NULL; > > > down_read(&fs_info->commit_root_sem); > > > } else { > > > @@ -1523,6 +1527,7 @@ int btrfs_check_shared(struct btrfs_root *root, u64 inum, u64 bytenr) > > > } else { > > > up_read(&fs_info->commit_root_sem); > > > } > > > +out: > > > ulist_free(tmp); > > > ulist_free(roots); > > > return ret; > > > -- > > > 2.11.0 > > > >