From: Josef Bacik <josef@toxicpanda.com>
To: Naohiro Aota <naohiro.aota@wdc.com>
Cc: linux-btrfs@vger.kernel.org, David Sterba <dsterba@suse.com>,
Chris Mason <clm@fb.com>, Nikolay Borisov <nborisov@suse.com>,
Damien Le Moal <damien.lemoal@wdc.com>,
Johannes Thumshirn <jthumshirn@suse.de>,
Hannes Reinecke <hare@suse.com>,
Anand Jain <anand.jain@oracle.com>,
linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH v6 15/28] btrfs: serialize data allocation and submit IOs
Date: Thu, 19 Dec 2019 09:01:35 -0500 [thread overview]
Message-ID: <ce94fc27-0167-087e-28f1-17e885ff5ddb@toxicpanda.com> (raw)
In-Reply-To: <20191219065457.rhd4wcycylii33c3@naota.dhcp.fujisawa.hgst.com>
On 12/19/19 1:54 AM, Naohiro Aota wrote:
> On Tue, Dec 17, 2019 at 02:49:44PM -0500, Josef Bacik wrote:
>> On 12/12/19 11:09 PM, Naohiro Aota wrote:
>>> To preserve sequential write pattern on the drives, we must serialize
>>> allocation and submit_bio. This commit add per-block group mutex
>>> "zone_io_lock" and find_free_extent_zoned() hold the lock. The lock is kept
>>> even after returning from find_free_extent(). It is released when submiting
>>> IOs corresponding to the allocation is completed.
>>>
>>> Implementing such behavior under __extent_writepage_io() is almost
>>> impossible because once pages are unlocked we are not sure when submiting
>>> IOs for an allocated region is finished or not. Instead, this commit add
>>> run_delalloc_hmzoned() to write out non-compressed data IOs at once using
>>> extent_write_locked_rage(). After the write, we can call
>>> btrfs_hmzoned_data_io_unlock() to unlock the block group for new
>>> allocation.
>>>
>>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>>
>> Have you actually tested these patches with lock debugging on? The
>> submit_compressed_extents stuff is async, so the unlocker owner will not be
>> the lock owner, and that'll make all sorts of things blow up. This is just
>> straight up broken.
>
> Yes, I have ran xfstests on this patch series with lockdeps and
> KASAN. There was no problem with that.
>
> For non-compressed writes, both allocation and submit is done in
> run_delalloc_zoned(). Allocation is done in cow_file_range() and
> submit is done in extent_write_locked_range(), so both are in the same
> context, so both locking and unlocking are done by the same execution
> context.
>
> For compressed writes, again, allocation/lock is done under
> cow_file_range() and submit is done in extent_write_locked_range() and
> unlocked all in submit_compressed_extents() (this is called after
> compression), so they are all in the same context and the lock owner
> does the unlock.
>
>> I would really rather see a hmzoned block scheduler that just doesn't submit
>> the bio's until they are aligned with the WP, that way this intellligence
>> doesn't have to be dealt with at the file system layer. I get allocating in
>> line with the WP, but this whole forcing us to allocate and submit the bio in
>> lock step is just nuts, and broken in your subsequent patches. This whole
>> approach needs to be reworked. Thanks,
>>
>> Josef
>
> We tried this approach by modifying mq-deadline to wait if the first
> queued request is not aligned at the write pointer of a zone. However,
> running btrfs without the allocate+submit lock with this modified IO
> scheduler did not work well at all. With write intensive workloads, we
> observed that a very long wait time was very often necessary to get a
> fully sequential stream of requests starting at the write pointer of a
> zone. The wait time we observed was sometimes in larger than 60 seconds,
> at which point we gave up.
This is because we will only write out the pages we've been handed but do
cow_file_range() for a possibly larger delalloc range, so as you say there can
be a large gap in time between writing one part of the range and writing the
next part.
You actually solve this with your patch, by doing the cow_file_range and then
following it up with the extent_write_locked_range() for the range you just cow'ed.
There is no need for the locking in this case, you could simply do that and then
have a modified block scheduler that keeps the bio's in the correct order. I
imagine if you just did this with your original block layer approach it would
work fine. Thanks,
Josef
next prev parent reply other threads:[~2019-12-19 14:01 UTC|newest]
Thread overview: 69+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-12-13 4:08 [PATCH v6 00/28] btrfs: zoned block device support Naohiro Aota
2019-12-13 4:08 ` [PATCH v6 01/28] btrfs: introduce HMZONED feature flag Naohiro Aota
2019-12-13 4:08 ` [PATCH v6 02/28] btrfs: Get zone information of zoned block devices Naohiro Aota
2019-12-13 16:18 ` Josef Bacik
2019-12-18 2:29 ` Naohiro Aota
2019-12-13 4:08 ` [PATCH v6 03/28] btrfs: Check and enable HMZONED mode Naohiro Aota
2019-12-13 16:21 ` Josef Bacik
2019-12-18 4:17 ` Naohiro Aota
2019-12-13 4:08 ` [PATCH v6 04/28] btrfs: disallow RAID5/6 in " Naohiro Aota
2019-12-13 16:21 ` Josef Bacik
2019-12-13 4:08 ` [PATCH v6 05/28] btrfs: disallow space_cache " Naohiro Aota
2019-12-13 16:24 ` Josef Bacik
2019-12-18 4:28 ` Naohiro Aota
2019-12-13 4:08 ` [PATCH v6 06/28] btrfs: disallow NODATACOW " Naohiro Aota
2019-12-13 16:25 ` Josef Bacik
2019-12-13 4:08 ` [PATCH v6 07/28] btrfs: disable fallocate " Naohiro Aota
2019-12-13 16:26 ` Josef Bacik
2019-12-13 4:08 ` [PATCH v6 08/28] btrfs: implement log-structured superblock for " Naohiro Aota
2019-12-13 16:38 ` Josef Bacik
2019-12-13 21:58 ` Damien Le Moal
2019-12-17 19:17 ` Josef Bacik
2019-12-13 4:08 ` [PATCH v6 09/28] btrfs: align device extent allocation to zone boundary Naohiro Aota
2019-12-13 16:52 ` Josef Bacik
2019-12-13 4:08 ` [PATCH v6 10/28] btrfs: do sequential extent allocation in HMZONED mode Naohiro Aota
2019-12-17 19:19 ` Josef Bacik
2019-12-13 4:08 ` [PATCH v6 11/28] btrfs: make unmirroed BGs readonly only if we have at least one writable BG Naohiro Aota
2019-12-17 19:25 ` Josef Bacik
2019-12-18 7:35 ` Naohiro Aota
2019-12-18 14:54 ` Josef Bacik
2019-12-13 4:08 ` [PATCH v6 12/28] btrfs: ensure metadata space available on/after degraded mount in HMZONED Naohiro Aota
2019-12-17 19:32 ` Josef Bacik
2019-12-13 4:09 ` [PATCH v6 13/28] btrfs: reset zones of unused block groups Naohiro Aota
2019-12-17 19:33 ` Josef Bacik
2019-12-13 4:09 ` [PATCH v6 14/28] btrfs: redirty released extent buffers in HMZONED mode Naohiro Aota
2019-12-17 19:41 ` Josef Bacik
2019-12-13 4:09 ` [PATCH v6 15/28] btrfs: serialize data allocation and submit IOs Naohiro Aota
2019-12-17 19:49 ` Josef Bacik
2019-12-19 6:54 ` Naohiro Aota
2019-12-19 14:01 ` Josef Bacik [this message]
2020-01-21 6:54 ` Naohiro Aota
2019-12-13 4:09 ` [PATCH v6 16/28] btrfs: implement atomic compressed IO submission Naohiro Aota
2019-12-13 4:09 ` [PATCH v6 17/28] btrfs: support direct write IO in HMZONED Naohiro Aota
2019-12-13 4:09 ` [PATCH v6 18/28] btrfs: serialize meta IOs on HMZONED mode Naohiro Aota
2019-12-13 4:09 ` [PATCH v6 19/28] btrfs: wait existing extents before truncating Naohiro Aota
2019-12-17 19:53 ` Josef Bacik
2019-12-13 4:09 ` [PATCH v6 20/28] btrfs: avoid async checksum on HMZONED mode Naohiro Aota
2019-12-13 4:09 ` [PATCH v6 21/28] btrfs: disallow mixed-bg in " Naohiro Aota
2019-12-17 19:56 ` Josef Bacik
2019-12-18 8:03 ` Naohiro Aota
2019-12-13 4:09 ` [PATCH v6 22/28] btrfs: disallow inode_cache " Naohiro Aota
2019-12-17 19:56 ` Josef Bacik
2019-12-13 4:09 ` [PATCH v6 23/28] btrfs: support dev-replace " Naohiro Aota
2019-12-17 21:05 ` Josef Bacik
2019-12-18 6:00 ` Naohiro Aota
2019-12-18 14:58 ` Josef Bacik
2019-12-13 4:09 ` [PATCH v6 24/28] btrfs: enable relocation " Naohiro Aota
2019-12-17 21:32 ` Josef Bacik
2019-12-18 10:49 ` Naohiro Aota
2019-12-18 15:01 ` Josef Bacik
2019-12-13 4:09 ` [PATCH v6 25/28] btrfs: relocate block group to repair IO failure in HMZONED Naohiro Aota
2019-12-17 22:04 ` Josef Bacik
2019-12-13 4:09 ` [PATCH v6 26/28] btrfs: split alloc_log_tree() Naohiro Aota
2019-12-13 4:09 ` [PATCH v6 27/28] btrfs: enable tree-log on HMZONED mode Naohiro Aota
2019-12-17 22:08 ` Josef Bacik
2019-12-18 9:35 ` Naohiro Aota
2019-12-13 4:09 ` [PATCH v6 28/28] btrfs: enable to mount HMZONED incompat flag Naohiro Aota
2019-12-17 22:09 ` Josef Bacik
2019-12-13 4:15 ` [PATCH RFC v2] libblkid: implement zone-aware probing for HMZONED btrfs Naohiro Aota
2019-12-19 20:19 ` [PATCH v6 00/28] btrfs: zoned block device support David Sterba
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ce94fc27-0167-087e-28f1-17e885ff5ddb@toxicpanda.com \
--to=josef@toxicpanda.com \
--cc=anand.jain@oracle.com \
--cc=clm@fb.com \
--cc=damien.lemoal@wdc.com \
--cc=dsterba@suse.com \
--cc=hare@suse.com \
--cc=jthumshirn@suse.de \
--cc=linux-btrfs@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=naohiro.aota@wdc.com \
--cc=nborisov@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).