All of lore.kernel.org
 help / color / mirror / Atom feed
From: Josef Bacik <jbacik@fb.com>
To: <fdmanana@kernel.org>, <linux-btrfs@vger.kernel.org>
Cc: Filipe Manana <fdmanana@suse.com>
Subject: Re: [PATCH v3 2/2] Btrfs: don't do unnecessary delalloc flushes when relocating
Date: Wed, 27 Apr 2016 10:54:55 -0400	[thread overview]
Message-ID: <6c258f28-95bd-8a9c-bb97-729042fa6797@fb.com> (raw)
In-Reply-To: <1461692724-3923-1-git-send-email-fdmanana@kernel.org>

On 04/26/2016 01:45 PM, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> Before we start the actual relocation process of a block group, we do
> calls to flush delalloc of all inodes and then wait for ordered extents
> to complete. However we do these flush calls just to make sure we don't
> race with concurrent tasks that have actually already started to run
> delalloc and have allocated an extent from the block group we want to
> relocate, right before we set it to readonly mode, but have not yet
> created the respective ordered extents. The flush calls make us wait
> for such concurrent tasks because they end up calling
> filemap_fdatawrite_range() (through btrfs_start_delalloc_roots() ->
> __start_delalloc_inodes() -> btrfs_alloc_delalloc_work() ->
> btrfs_run_delalloc_work()) which ends up serializing us with those tasks
> due to attempts to lock the same pages (and the delalloc flush procedure
> calls the allocator and creates the ordered extents before unlocking the
> pages).
>
> These flushing calls not only make us waste time (cpu, IO) but also reduce
> the chances of writing larger extents (applications might be writing to
> contiguous ranges and we flush before they finish dirtying the whole
> ranges).
>
> So make sure we don't flush delalloc and just wait for concurrent tasks
> that have already started flushing delalloc and have allocated an extent
> from the block group we are about to relocate.
>
> This change also ends up fixing a race with direct IO writes that makes
> relocation not wait for direct IO ordered extents. This race is
> illustrated by the following diagram:
>
>         CPU 1                                       CPU 2
>
>  btrfs_relocate_block_group(bg X)
>
>                                            starts direct IO write,
>                                            target inode currently has no
>                                            ordered extents ongoing nor
>                                            dirty pages (delalloc regions),
>                                            therefore the root for our inode
>                                            is not in the list
>                                            fs_info->ordered_roots
>
>                                            btrfs_direct_IO()
>                                              __blockdev_direct_IO()
>                                                btrfs_get_blocks_direct()
>                                                  btrfs_lock_extent_direct()
>                                                    locks range in the io tree
>                                                  btrfs_new_extent_direct()
>                                                    btrfs_reserve_extent()
>                                                      --> extent allocated
>                                                          from bg X
>
>    btrfs_inc_block_group_ro(bg X)
>
>    btrfs_start_delalloc_roots()
>      __start_delalloc_inodes()
>        --> does nothing, no dealloc ranges
>            in the inode's io tree so the
>            inode's root is not in the list
>            fs_info->delalloc_roots
>
>    btrfs_wait_ordered_roots()
>      --> does not find the inode's root in the
>          list fs_info->ordered_roots
>
>      --> ends up not waiting for the direct IO
>          write started by the task at CPU 2
>
>    relocate_block_group(rc->stage ==
>      MOVE_DATA_EXTENTS)
>
>      prepare_to_relocate()
>        btrfs_commit_transaction()
>
>      iterates the extent tree, using its
>      commit root and moves extents into new
>      locations
>
>                                                    btrfs_add_ordered_extent_dio()
>                                                      --> now a ordered extent is
>                                                          created and added to the
>                                                          list root->ordered_extents
>                                                          and the root added to the
>                                                          list fs_info->ordered_roots
>                                                      --> this is too late and the
>                                                          task at CPU 1 already
>                                                          started the relocation
>
>      btrfs_commit_transaction()
>
>                                                    btrfs_finish_ordered_io()
>                                                      btrfs_alloc_reserved_file_extent()
>                                                        --> adds delayed data reference
>                                                            for the extent allocated
>                                                            from bg X
>
>    relocate_block_group(rc->stage ==
>      UPDATE_DATA_PTRS)
>
>      prepare_to_relocate()
>        btrfs_commit_transaction()
>          --> delayed refs are run, so an extent
>              item for the allocated extent from
>              bg X is added to extent tree
>          --> commit roots are switched, so the
>              next scan in the extent tree will
>              see the extent item
>
>      sees the extent in the extent tree
>
> When this happens the relocation produces the following warning when it
> finishes:
>
> [ 7260.832836] ------------[ cut here ]------------
> [ 7260.834653] WARNING: CPU: 5 PID: 6765 at fs/btrfs/relocation.c:4318 btrfs_relocate_block_group+0x245/0x2a1 [btrfs]()
> [ 7260.838268] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc
> [ 7260.850935] CPU: 5 PID: 6765 Comm: btrfs Not tainted 4.5.0-rc6-btrfs-next-28+ #1
> [ 7260.852998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
> [ 7260.852998]  0000000000000000 ffff88020bf57bc0 ffffffff812648b3 0000000000000000
> [ 7260.852998]  0000000000000009 ffff88020bf57bf8 ffffffff81051608 ffffffffa03c1b2d
> [ 7260.852998]  ffff8800b2bbb800 0000000000000000 ffff8800b17bcc58 ffff8800399dd000
> [ 7260.852998] Call Trace:
> [ 7260.852998]  [<ffffffff812648b3>] dump_stack+0x67/0x90
> [ 7260.852998]  [<ffffffff81051608>] warn_slowpath_common+0x99/0xb2
> [ 7260.852998]  [<ffffffffa03c1b2d>] ? btrfs_relocate_block_group+0x245/0x2a1 [btrfs]
> [ 7260.852998]  [<ffffffff810516d4>] warn_slowpath_null+0x1a/0x1c
> [ 7260.852998]  [<ffffffffa03c1b2d>] btrfs_relocate_block_group+0x245/0x2a1 [btrfs]
> [ 7260.852998]  [<ffffffffa039d9de>] btrfs_relocate_chunk.isra.29+0x66/0xdb [btrfs]
> [ 7260.852998]  [<ffffffffa039f314>] btrfs_balance+0xde1/0xe4e [btrfs]
> [ 7260.852998]  [<ffffffff8127d671>] ? debug_smp_processor_id+0x17/0x19
> [ 7260.852998]  [<ffffffffa03a9583>] btrfs_ioctl_balance+0x255/0x2d3 [btrfs]
> [ 7260.852998]  [<ffffffffa03ac96a>] btrfs_ioctl+0x11e0/0x1dff [btrfs]
> [ 7260.852998]  [<ffffffff811451df>] ? handle_mm_fault+0x443/0xd63
> [ 7260.852998]  [<ffffffff81491817>] ? _raw_spin_unlock+0x31/0x44
> [ 7260.852998]  [<ffffffff8108b36a>] ? arch_local_irq_save+0x9/0xc
> [ 7260.852998]  [<ffffffff811876ab>] vfs_ioctl+0x18/0x34
> [ 7260.852998]  [<ffffffff81187cb2>] do_vfs_ioctl+0x550/0x5be
> [ 7260.852998]  [<ffffffff81190c30>] ? __fget_light+0x4d/0x71
> [ 7260.852998]  [<ffffffff81187d77>] SyS_ioctl+0x57/0x79
> [ 7260.852998]  [<ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b
> [ 7260.893268] ---[ end trace eb7803b24ebab8ad ]---
>
> This is because at the end of the first stage, in relocate_block_group(),
> we commit the current transaction, which makes delayed refs run, the
> commit roots are switched and so the second stage will find the extent
> item that the ordered extent added to the delayed refs. But this extent
> was not moved (ordered extent completed after first stage finished), so
> at the end of the relocation our block group item still has a positive
> used bytes counter, triggering a warning at the end of
> btrfs_relocate_block_group(). Later on when trying to read the extent
> contents from disk we hit a BUG_ON() due to the inability to map a block
> with a logical address that belongs to the block group we relocated and
> is no longer valid, resulting in the following trace:
>
> [ 7344.885290] BTRFS critical (device sdi): unable to find logical 12845056 len 4096
> [ 7344.887518] ------------[ cut here ]------------
> [ 7344.888431] kernel BUG at fs/btrfs/inode.c:1833!
> [ 7344.888431] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [ 7344.888431] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc
> [ 7344.888431] CPU: 0 PID: 6831 Comm: od Tainted: G        W       4.5.0-rc6-btrfs-next-28+ #1
> [ 7344.888431] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
> [ 7344.888431] task: ffff880215818600 ti: ffff880204684000 task.ti: ffff880204684000
> [ 7344.888431] RIP: 0010:[<ffffffffa037c88c>]  [<ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs]
> [ 7344.888431] RSP: 0018:ffff8802046878f0  EFLAGS: 00010282
> [ 7344.888431] RAX: 00000000ffffffea RBX: 0000000000001000 RCX: 0000000000000001
> [ 7344.888431] RDX: ffff88023ec0f950 RSI: ffffffff8183b638 RDI: 00000000ffffffff
> [ 7344.888431] RBP: ffff880204687908 R08: 0000000000000001 R09: 0000000000000000
> [ 7344.888431] R10: ffff880204687770 R11: ffffffff82f2d52d R12: 0000000000001000
> [ 7344.888431] R13: ffff88021afbfee8 R14: 0000000000006208 R15: ffff88006cd199b0
> [ 7344.888431] FS:  00007f1f9e1d6700(0000) GS:ffff88023ec00000(0000) knlGS:0000000000000000
> [ 7344.888431] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 7344.888431] CR2: 00007f1f9dc8cb60 CR3: 000000023e3b6000 CR4: 00000000000006f0
> [ 7344.888431] Stack:
> [ 7344.888431]  0000000000001000 0000000000001000 ffff880204687b98 ffff880204687950
> [ 7344.888431]  ffffffffa0395c8f ffffea0004d64d48 0000000000000000 0000000000001000
> [ 7344.888431]  ffffea0004d64d48 0000000000001000 0000000000000000 0000000000000000
> [ 7344.888431] Call Trace:
> [ 7344.888431]  [<ffffffffa0395c8f>] submit_extent_page+0xf5/0x16f [btrfs]
> [ 7344.888431]  [<ffffffffa03970ac>] __do_readpage+0x4a0/0x4f1 [btrfs]
> [ 7344.888431]  [<ffffffffa039680d>] ? btrfs_create_repair_bio+0xcb/0xcb [btrfs]
> [ 7344.888431]  [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
> [ 7344.888431]  [<ffffffff8108df55>] ? trace_hardirqs_on+0xd/0xf
> [ 7344.888431]  [<ffffffffa039728c>] __do_contiguous_readpages.constprop.26+0xc2/0xe4 [btrfs]
> [ 7344.888431]  [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
> [ 7344.888431]  [<ffffffffa039739b>] __extent_readpages.constprop.25+0xed/0x100 [btrfs]
> [ 7344.888431]  [<ffffffff81129d24>] ? lru_cache_add+0xe/0x10
> [ 7344.888431]  [<ffffffffa0397ea8>] extent_readpages+0x160/0x1aa [btrfs]
> [ 7344.888431]  [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
> [ 7344.888431]  [<ffffffff8115daad>] ? alloc_pages_current+0xa9/0xcd
> [ 7344.888431]  [<ffffffffa037cdc9>] btrfs_readpages+0x1f/0x21 [btrfs]
> [ 7344.888431]  [<ffffffff81128316>] __do_page_cache_readahead+0x168/0x1fc
> [ 7344.888431]  [<ffffffff811285a0>] ondemand_readahead+0x1f6/0x207
> [ 7344.888431]  [<ffffffff811285a0>] ? ondemand_readahead+0x1f6/0x207
> [ 7344.888431]  [<ffffffff8111cf34>] ? pagecache_get_page+0x2b/0x154
> [ 7344.888431]  [<ffffffff8112870e>] page_cache_sync_readahead+0x3d/0x3f
> [ 7344.888431]  [<ffffffff8111dbf7>] generic_file_read_iter+0x197/0x4e1
> [ 7344.888431]  [<ffffffff8117773a>] __vfs_read+0x79/0x9d
> [ 7344.888431]  [<ffffffff81178050>] vfs_read+0x8f/0xd2
> [ 7344.888431]  [<ffffffff81178a38>] SyS_read+0x50/0x7e
> [ 7344.888431]  [<ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b
> [ 7344.888431] Code: 8d 4d e8 45 31 c9 45 31 c0 48 8b 00 48 c1 e2 09 48 8b 80 80 fc ff ff 4c 89 65 e8 48 8b b8 f0 01 00 00 e8 1d 42 02 00 85 c0 79 02 <0f> 0b 4c 0
> [ 7344.888431] RIP  [<ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs]
> [ 7344.888431]  RSP <ffff8802046878f0>
> [ 7344.970544] ---[ end trace eb7803b24ebab8ae ]---
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>

Reviewed-by: Josef Bacik <jbacik@fb.com>

Thanks,

Josef


  reply	other threads:[~2016-04-27 14:55 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-04-26 15:39 [PATCH v2 0/2] Fix for race in relocation and avoid start and wait for unrelated IO fdmanana
2016-04-26 15:39 ` [PATCH v2 1/2] Btrfs: don't wait for unrelated IO to finish before relocation fdmanana
2016-04-26 15:55   ` Josef Bacik
2016-05-10 19:23   ` Liu Bo
2016-04-26 15:39 ` [PATCH v2 2/2] Btrfs: don't do unnecessary delalloc flushes when relocating fdmanana
2016-04-26 16:02   ` Josef Bacik
2016-04-26 16:09     ` Filipe Manana
2016-04-26 19:48       ` Josef Bacik
2016-04-26 17:45   ` [PATCH v3 " fdmanana
2016-04-27 14:54     ` Josef Bacik [this message]
2016-05-10 19:31     ` Liu Bo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6c258f28-95bd-8a9c-bb97-729042fa6797@fb.com \
    --to=jbacik@fb.com \
    --cc=fdmanana@kernel.org \
    --cc=fdmanana@suse.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.