From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:55879 "EHLO mx0b-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751999AbcDZTsu (ORCPT ); Tue, 26 Apr 2016 15:48:50 -0400 Subject: Re: [PATCH v2 2/2] Btrfs: don't do unnecessary delalloc flushes when relocating To: Filipe Manana References: <1461685179-3957-1-git-send-email-fdmanana@kernel.org> <1461685179-3957-3-git-send-email-fdmanana@kernel.org> <1ccc296c-0db4-a165-4eb4-0f57e2c4b2a5@fb.com> CC: "linux-btrfs@vger.kernel.org" From: Josef Bacik Message-ID: <614f2c09-8ad7-e74b-b104-f13a80bfc9a9@fb.com> Date: Tue, 26 Apr 2016 15:48:41 -0400 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8"; format=flowed Sender: linux-btrfs-owner@vger.kernel.org List-ID: On 04/26/2016 12:09 PM, Filipe Manana wrote: > On Tue, Apr 26, 2016 at 5:02 PM, Josef Bacik wrote: >> On 04/26/2016 11:39 AM, fdmanana@kernel.org wrote: >>> >>> From: Filipe Manana >>> >>> Before we start the actual relocation process of a block group, we do >>> calls to flush delalloc of all inodes and then wait for ordered extents >>> to complete. However we do these flush calls just to make sure we don't >>> race with concurrent tasks that have actually already started to run >>> delalloc and have allocated an extent from the block group we want to >>> relocate, right before we set it to readonly mode, but have not yet >>> created the respective ordered extents. The flush calls make us wait >>> for such concurrent tasks because they end up calling >>> filemap_fdatawrite_range() (through btrfs_start_delalloc_roots() -> >>> __start_delalloc_inodes() -> btrfs_alloc_delalloc_work() -> >>> btrfs_run_delalloc_work()) which ends up serializing us with those tasks >>> due to attempts to lock the same pages (and the delalloc flush procedure >>> calls the allocator and creates the ordered extents before unlocking the >>> pages). >>> >>> These flushing calls not only make us waste time (cpu, IO) but also reduce >>> the chances of writing larger extents (applications might be writing to >>> contiguous ranges and we flush before they finish dirtying the whole >>> ranges). >>> >>> So make sure we don't flush delalloc and just wait for concurrent tasks >>> that have already started flushing delalloc and have allocated an extent >>> from the block group we are about to relocate. >>> >>> This change also ends up fixing a race with direct IO writes that makes >>> relocation not wait for direct IO ordered extents. This race is >>> illustrated by the following diagram: >>> >>> CPU 1 CPU 2 >>> >>> btrfs_relocate_block_group(bg X) >>> >>> starts direct IO write, >>> target inode currently has no >>> ordered extents ongoing nor >>> dirty pages (delalloc regions), >>> therefore the root for our >>> inode >>> is not in the list >>> fs_info->ordered_roots >>> >>> btrfs_direct_IO() >>> __blockdev_direct_IO() >>> btrfs_get_blocks_direct() >>> >>> btrfs_lock_extent_direct() >>> locks range in the io >>> tree >>> btrfs_new_extent_direct() >>> btrfs_reserve_extent() >>> --> extent allocated >>> from bg X >>> >>> btrfs_inc_block_group_ro(bg X) >>> >>> btrfs_start_delalloc_roots() >>> __start_delalloc_inodes() >>> --> does nothing, no dealloc ranges >>> in the inode's io tree so the >>> inode's root is not in the list >>> fs_info->delalloc_roots >>> >>> btrfs_wait_ordered_roots() >>> --> does not find the inode's root in the >>> list fs_info->ordered_roots >>> >>> --> ends up not waiting for the direct IO >>> write started by the task at CPU 2 >>> >>> relocate_block_group(rc->stage == >>> MOVE_DATA_EXTENTS) >>> >>> prepare_to_relocate() >>> btrfs_commit_transaction() >>> >>> iterates the extent tree, using its >>> commit root and moves extents into new >>> locations >>> >>> >>> btrfs_add_ordered_extent_dio() >>> --> now a ordered >>> extent is >>> created and added >>> to the >>> list >>> root->ordered_extents >>> and the root >>> added to the >>> list >>> fs_info->ordered_roots >>> --> this is too late >>> and the >>> task at CPU 1 >>> already >>> started the >>> relocation >>> >>> btrfs_commit_transaction() >>> >>> >>> btrfs_finish_ordered_io() >>> >>> btrfs_alloc_reserved_file_extent() >>> --> adds delayed >>> data reference >>> for the extent >>> allocated >>> from bg X >>> >>> relocate_block_group(rc->stage == >>> UPDATE_DATA_PTRS) >>> >>> prepare_to_relocate() >>> btrfs_commit_transaction() >>> --> delayed refs are run, so an extent >>> item for the allocated extent from >>> bg X is added to extent tree >>> --> commit roots are switched, so the >>> next scan in the extent tree will >>> see the extent item >>> >>> sees the extent in the extent tree >>> >>> When this happens the relocation produces the following warning when it >>> finishes: >>> >>> [ 7260.832836] ------------[ cut here ]------------ >>> [ 7260.834653] WARNING: CPU: 5 PID: 6765 at fs/btrfs/relocation.c:4318 >>> btrfs_relocate_block_group+0x245/0x2a1 [btrfs]() >>> [ 7260.838268] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq >>> psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core >>> pcspkr parport_pc >>> [ 7260.850935] CPU: 5 PID: 6765 Comm: btrfs Not tainted >>> 4.5.0-rc6-btrfs-next-28+ #1 >>> [ 7260.852998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS >>> by qemu-project.org 04/01/2014 >>> [ 7260.852998] 0000000000000000 ffff88020bf57bc0 ffffffff812648b3 >>> 0000000000000000 >>> [ 7260.852998] 0000000000000009 ffff88020bf57bf8 ffffffff81051608 >>> ffffffffa03c1b2d >>> [ 7260.852998] ffff8800b2bbb800 0000000000000000 ffff8800b17bcc58 >>> ffff8800399dd000 >>> [ 7260.852998] Call Trace: >>> [ 7260.852998] [] dump_stack+0x67/0x90 >>> [ 7260.852998] [] warn_slowpath_common+0x99/0xb2 >>> [ 7260.852998] [] ? >>> btrfs_relocate_block_group+0x245/0x2a1 [btrfs] >>> [ 7260.852998] [] warn_slowpath_null+0x1a/0x1c >>> [ 7260.852998] [] >>> btrfs_relocate_block_group+0x245/0x2a1 [btrfs] >>> [ 7260.852998] [] >>> btrfs_relocate_chunk.isra.29+0x66/0xdb [btrfs] >>> [ 7260.852998] [] btrfs_balance+0xde1/0xe4e [btrfs] >>> [ 7260.852998] [] ? debug_smp_processor_id+0x17/0x19 >>> [ 7260.852998] [] btrfs_ioctl_balance+0x255/0x2d3 >>> [btrfs] >>> [ 7260.852998] [] btrfs_ioctl+0x11e0/0x1dff [btrfs] >>> [ 7260.852998] [] ? handle_mm_fault+0x443/0xd63 >>> [ 7260.852998] [] ? _raw_spin_unlock+0x31/0x44 >>> [ 7260.852998] [] ? arch_local_irq_save+0x9/0xc >>> [ 7260.852998] [] vfs_ioctl+0x18/0x34 >>> [ 7260.852998] [] do_vfs_ioctl+0x550/0x5be >>> [ 7260.852998] [] ? __fget_light+0x4d/0x71 >>> [ 7260.852998] [] SyS_ioctl+0x57/0x79 >>> [ 7260.852998] [] entry_SYSCALL_64_fastpath+0x12/0x6b >>> [ 7260.893268] ---[ end trace eb7803b24ebab8ad ]--- >>> >>> This is because at the end of the first stage, in relocate_block_group(), >>> we commit the current transaction, which makes delayed refs run, the >>> commit roots are switched and so the second stage will find the extent >>> item that the ordered extent added to the delayed refs. But this extent >>> was not moved (ordered extent completed after first stage finished), so >>> at the end of the relocation our block group item still has a positive >>> used bytes counter, triggering a warning at the end of >>> btrfs_relocate_block_group(). Later on when trying to read the extent >>> contents from disk we hit a BUG_ON() due to the inability to map a block >>> with a logical address that belongs to the block group we relocated and >>> is no longer valid, resulting in the following trace: >>> >>> [ 7344.885290] BTRFS critical (device sdi): unable to find logical >>> 12845056 len 4096 >>> [ 7344.887518] ------------[ cut here ]------------ >>> [ 7344.888431] kernel BUG at fs/btrfs/inode.c:1833! >>> [ 7344.888431] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC >>> [ 7344.888431] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq >>> psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core >>> pcspkr parport_pc >>> [ 7344.888431] CPU: 0 PID: 6831 Comm: od Tainted: G W >>> 4.5.0-rc6-btrfs-next-28+ #1 >>> [ 7344.888431] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS >>> by qemu-project.org 04/01/2014 >>> [ 7344.888431] task: ffff880215818600 ti: ffff880204684000 task.ti: >>> ffff880204684000 >>> [ 7344.888431] RIP: 0010:[] [] >>> btrfs_merge_bio_hook+0x54/0x6b [btrfs] >>> [ 7344.888431] RSP: 0018:ffff8802046878f0 EFLAGS: 00010282 >>> [ 7344.888431] RAX: 00000000ffffffea RBX: 0000000000001000 RCX: >>> 0000000000000001 >>> [ 7344.888431] RDX: ffff88023ec0f950 RSI: ffffffff8183b638 RDI: >>> 00000000ffffffff >>> [ 7344.888431] RBP: ffff880204687908 R08: 0000000000000001 R09: >>> 0000000000000000 >>> [ 7344.888431] R10: ffff880204687770 R11: ffffffff82f2d52d R12: >>> 0000000000001000 >>> [ 7344.888431] R13: ffff88021afbfee8 R14: 0000000000006208 R15: >>> ffff88006cd199b0 >>> [ 7344.888431] FS: 00007f1f9e1d6700(0000) GS:ffff88023ec00000(0000) >>> knlGS:0000000000000000 >>> [ 7344.888431] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> [ 7344.888431] CR2: 00007f1f9dc8cb60 CR3: 000000023e3b6000 CR4: >>> 00000000000006f0 >>> [ 7344.888431] Stack: >>> [ 7344.888431] 0000000000001000 0000000000001000 ffff880204687b98 >>> ffff880204687950 >>> [ 7344.888431] ffffffffa0395c8f ffffea0004d64d48 0000000000000000 >>> 0000000000001000 >>> [ 7344.888431] ffffea0004d64d48 0000000000001000 0000000000000000 >>> 0000000000000000 >>> [ 7344.888431] Call Trace: >>> [ 7344.888431] [] submit_extent_page+0xf5/0x16f [btrfs] >>> [ 7344.888431] [] __do_readpage+0x4a0/0x4f1 [btrfs] >>> [ 7344.888431] [] ? btrfs_create_repair_bio+0xcb/0xcb >>> [btrfs] >>> [ 7344.888431] [] ? >>> btrfs_writepage_start_hook+0xbc/0xbc [btrfs] >>> [ 7344.888431] [] ? trace_hardirqs_on+0xd/0xf >>> [ 7344.888431] [] >>> __do_contiguous_readpages.constprop.26+0xc2/0xe4 [btrfs] >>> [ 7344.888431] [] ? >>> btrfs_writepage_start_hook+0xbc/0xbc [btrfs] >>> [ 7344.888431] [] >>> __extent_readpages.constprop.25+0xed/0x100 [btrfs] >>> [ 7344.888431] [] ? lru_cache_add+0xe/0x10 >>> [ 7344.888431] [] extent_readpages+0x160/0x1aa [btrfs] >>> [ 7344.888431] [] ? >>> btrfs_writepage_start_hook+0xbc/0xbc [btrfs] >>> [ 7344.888431] [] ? alloc_pages_current+0xa9/0xcd >>> [ 7344.888431] [] btrfs_readpages+0x1f/0x21 [btrfs] >>> [ 7344.888431] [] __do_page_cache_readahead+0x168/0x1fc >>> [ 7344.888431] [] ondemand_readahead+0x1f6/0x207 >>> [ 7344.888431] [] ? ondemand_readahead+0x1f6/0x207 >>> [ 7344.888431] [] ? pagecache_get_page+0x2b/0x154 >>> [ 7344.888431] [] page_cache_sync_readahead+0x3d/0x3f >>> [ 7344.888431] [] generic_file_read_iter+0x197/0x4e1 >>> [ 7344.888431] [] __vfs_read+0x79/0x9d >>> [ 7344.888431] [] vfs_read+0x8f/0xd2 >>> [ 7344.888431] [] SyS_read+0x50/0x7e >>> [ 7344.888431] [] entry_SYSCALL_64_fastpath+0x12/0x6b >>> [ 7344.888431] Code: 8d 4d e8 45 31 c9 45 31 c0 48 8b 00 48 c1 e2 09 48 8b >>> 80 80 fc ff ff 4c 89 65 e8 48 8b b8 f0 01 00 00 e8 1d 42 02 00 85 c0 79 02 >>> <0f> 0b 4c 0 >>> [ 7344.888431] RIP [] btrfs_merge_bio_hook+0x54/0x6b >>> [btrfs] >>> [ 7344.888431] RSP >>> [ 7344.970544] ---[ end trace eb7803b24ebab8ae ]--- >>> >>> Signed-off-by: Filipe Manana >>> --- >>> fs/btrfs/ctree.h | 14 ++++++++++++ >>> fs/btrfs/extent-tree.c | 58 >>> ++++++++++++++++++++++++++++++++++++++++++++++++-- >>> fs/btrfs/inode.c | 8 +++++++ >>> fs/btrfs/relocation.c | 6 +----- >>> 4 files changed, 79 insertions(+), 7 deletions(-) >>> >>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h >>> index 84a6a5b..90e70e2 100644 >>> --- a/fs/btrfs/ctree.h >>> +++ b/fs/btrfs/ctree.h >>> @@ -1408,6 +1408,17 @@ struct btrfs_block_group_cache { >>> >>> struct btrfs_io_ctl io_ctl; >>> >>> + /* >>> + * Incremented when doing extent allocations and holding a read >>> lock >>> + * on the space_info's groups_sem semaphore. >>> + * Decremented when an ordered extent that represents an IO >>> against this >>> + * block group's range is created (after it's added to its inode's >>> + * root's list of ordered extents) or immediately after the >>> allocation >>> + * if it's a metadata extent or fallocate extent (for these cases >>> we >>> + * don't create ordered extents). >>> + */ >>> + atomic_t reservations; >>> + >>> /* Lock for free space tree operations. */ >>> struct mutex free_space_lock; >>> >>> @@ -3499,6 +3510,9 @@ int btrfs_should_throttle_delayed_refs(struct >>> btrfs_trans_handle *trans, >>> struct btrfs_root *root); >>> int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans, >>> struct btrfs_root *root); >>> +void btrfs_dec_block_group_reservations(struct btrfs_fs_info *fs_info, >>> + const u64 start); >>> +void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache >>> *bg); >>> void btrfs_put_block_group(struct btrfs_block_group_cache *cache); >>> int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, >>> struct btrfs_root *root, unsigned long count); >>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >>> index 0544f70..35e3ea9 100644 >>> --- a/fs/btrfs/extent-tree.c >>> +++ b/fs/btrfs/extent-tree.c >>> @@ -6175,6 +6175,57 @@ int btrfs_exclude_logged_extents(struct btrfs_root >>> *log, >>> return 0; >>> } >>> >>> +static void >>> +btrfs_inc_block_group_reservations(struct btrfs_block_group_cache *bg) >>> +{ >>> + atomic_inc(&bg->reservations); >>> +} >>> + >>> +void btrfs_dec_block_group_reservations(struct btrfs_fs_info *fs_info, >>> + const u64 start) >>> +{ >>> + struct btrfs_block_group_cache *bg; >>> + >>> + bg = btrfs_lookup_block_group(fs_info, start); >>> + BUG_ON(!bg); >> >> >> ASSERT(bg) instead please. >> >> >>> + if (atomic_dec_and_test(&bg->reservations)) >>> + wake_up_atomic_t(&bg->reservations); >>> + btrfs_put_block_group(bg); >>> +} >>> + >>> +static int btrfs_wait_bg_reservations_atomic_t(atomic_t *a) >>> +{ >>> + schedule(); >>> + return 0; >>> +} >>> + >>> +void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache >>> *bg) >>> +{ >>> + struct btrfs_space_info *space_info = bg->space_info; >>> + >>> + ASSERT(bg->ro); >>> + >>> + if (!(bg->flags & BTRFS_BLOCK_GROUP_DATA)) >>> + return; >>> + >>> + /* >>> + * Our block group is read only but before we set it to read only, >>> + * some task might have had allocated an extent from it already, >>> but it >>> + * has not yet created a respective ordered extent (and added it >>> to a >>> + * root's list of ordered extents). >>> + * Therefore wait for any task currently allocating extents, since >>> the >>> + * block group's reservations counter is incremented while a read >>> lock >>> + * on the groups' semaphore is held and decremented after >>> releasing >>> + * the read access on that semaphore and creating the ordered >>> extent. >>> + */ >>> + down_write(&space_info->groups_sem); >>> + up_write(&space_info->groups_sem); >>> + >>> + wait_on_atomic_t(&bg->reservations, >>> + btrfs_wait_bg_reservations_atomic_t, >>> + TASK_UNINTERRUPTIBLE); >>> +} >>> + >>> /** >>> * btrfs_update_reserved_bytes - update the block_group and space info >>> counters >>> * @cache: The cache we are manipulating >>> @@ -7434,6 +7485,7 @@ checks: >>> btrfs_add_free_space(block_group, offset, >>> num_bytes); >>> goto loop; >>> } >>> + btrfs_inc_block_group_reservations(block_group); >>> >>> /* we are all good, lets return */ >>> ins->objectid = search_start; >>> @@ -7615,8 +7667,10 @@ again: >>> WARN_ON(num_bytes < root->sectorsize); >>> ret = find_free_extent(root, num_bytes, empty_size, hint_byte, >>> ins, >>> flags, delalloc); >>> - >>> - if (ret == -ENOSPC) { >>> + if (!ret && !is_data) { >>> + btrfs_dec_block_group_reservations(root->fs_info, >>> + ins->objectid); >>> + } else if (ret == -ENOSPC) { >>> if (!final_tried && ins->offset) { >>> num_bytes = min(num_bytes >> 1, ins->offset); >>> num_bytes = round_down(num_bytes, >>> root->sectorsize); >>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >>> index 41a5688..0085899 100644 >>> --- a/fs/btrfs/inode.c >>> +++ b/fs/btrfs/inode.c >>> @@ -824,6 +824,7 @@ retry: >>> async_extent->ram_size - >>> 1, 0); >>> goto out_free_reserve; >>> } >>> + btrfs_dec_block_group_reservations(root->fs_info, >>> ins.objectid); >>> >>> /* >>> * clear dirty, set writeback and unlock the pages. >>> @@ -861,6 +862,7 @@ retry: >>> } >>> return; >>> out_free_reserve: >>> + btrfs_dec_block_group_reservations(root->fs_info, ins.objectid); >>> btrfs_free_reserved_extent(root, ins.objectid, ins.offset, 1); >>> out_free: >>> extent_clear_unlock_delalloc(inode, async_extent->start, >>> @@ -1038,6 +1040,8 @@ static noinline int cow_file_range(struct inode >>> *inode, >>> goto out_drop_extent_cache; >>> } >>> >>> + btrfs_dec_block_group_reservations(root->fs_info, >>> ins.objectid); >>> + >>> if (disk_num_bytes < cur_alloc_size) >>> break; >>> >>> @@ -1066,6 +1070,7 @@ out: >>> out_drop_extent_cache: >>> btrfs_drop_extent_cache(inode, start, start + ram_size - 1, 0); >>> out_reserve: >>> + btrfs_dec_block_group_reservations(root->fs_info, ins.objectid); >>> btrfs_free_reserved_extent(root, ins.objectid, ins.offset, 1); >>> out_unlock: >>> extent_clear_unlock_delalloc(inode, start, end, locked_page, >>> @@ -7162,6 +7167,8 @@ static struct extent_map >>> *btrfs_new_extent_direct(struct inode *inode, >>> return ERR_PTR(ret); >>> } >>> >>> + btrfs_dec_block_group_reservations(root->fs_info, ins.objectid); >>> + >>> em = create_pinned_em(inode, start, ins.offset, start, >>> ins.objectid, >>> ins.offset, ins.offset, ins.offset, 0); >>> if (IS_ERR(em)) { >>> @@ -9942,6 +9949,7 @@ static int __btrfs_prealloc_file_range(struct inode >>> *inode, int mode, >>> btrfs_end_transaction(trans, root); >>> break; >>> } >>> + btrfs_dec_block_group_reservations(root->fs_info, >>> ins.objectid); >>> >>> last_alloc = ins.offset; >>> ret = insert_reserved_file_extent(trans, inode, >> >> >> Instead of doing all this, why not just add the dec to the end of >> __btrfs_add_ordered_extents? It seems like it would be much cleaner. > > Indeed, and it was something I tried first. > The reason I didn't end up doing it like that is that we still need to > do it in error paths (before we call the add_ordered_extent variants), > so I ended up preferring to leave the dec operation always to the > responsibility of any extent allocation caller (so they do it for both > error and success paths). > What do you think? > We talked about this on irc, just updating here so nobody is confused. You are right, this is probably the best thing for now. Thanks, Josef