Re: [PATCH v6 15/28] btrfs: serialize data allocation and submit IOs

From: Josef Bacik <josef@toxicpanda.com>
To: Naohiro Aota <naohiro.aota@wdc.com>
Cc: linux-btrfs@vger.kernel.org, David Sterba <dsterba@suse.com>,
	Chris Mason <clm@fb.com>, Nikolay Borisov <nborisov@suse.com>,
	Damien Le Moal <damien.lemoal@wdc.com>,
	Johannes Thumshirn <jthumshirn@suse.de>,
	Hannes Reinecke <hare@suse.com>,
	Anand Jain <anand.jain@oracle.com>,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH v6 15/28] btrfs: serialize data allocation and submit IOs
Date: Thu, 19 Dec 2019 09:01:35 -0500	[thread overview]
Message-ID: <ce94fc27-0167-087e-28f1-17e885ff5ddb@toxicpanda.com> (raw)
In-Reply-To: <20191219065457.rhd4wcycylii33c3@naota.dhcp.fujisawa.hgst.com>

On 12/19/19 1:54 AM, Naohiro Aota wrote:
> On Tue, Dec 17, 2019 at 02:49:44PM -0500, Josef Bacik wrote:
>> On 12/12/19 11:09 PM, Naohiro Aota wrote:
>>> To preserve sequential write pattern on the drives, we must serialize
>>> allocation and submit_bio. This commit add per-block group mutex
>>> "zone_io_lock" and find_free_extent_zoned() hold the lock. The lock is kept
>>> even after returning from find_free_extent(). It is released when submiting
>>> IOs corresponding to the allocation is completed.
>>>
>>> Implementing such behavior under __extent_writepage_io() is almost
>>> impossible because once pages are unlocked we are not sure when submiting
>>> IOs for an allocated region is finished or not. Instead, this commit add
>>> run_delalloc_hmzoned() to write out non-compressed data IOs at once using
>>> extent_write_locked_rage(). After the write, we can call
>>> btrfs_hmzoned_data_io_unlock() to unlock the block group for new
>>> allocation.
>>>
>>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>>
>> Have you actually tested these patches with lock debugging on?  The 
>> submit_compressed_extents stuff is async, so the unlocker owner will not be 
>> the lock owner, and that'll make all sorts of things blow up. This is just 
>> straight up broken.
> 
> Yes, I have ran xfstests on this patch series with lockdeps and
> KASAN. There was no problem with that.
> 
> For non-compressed writes, both allocation and submit is done in
> run_delalloc_zoned(). Allocation is done in cow_file_range() and
> submit is done in extent_write_locked_range(), so both are in the same
> context, so both locking and unlocking are done by the same execution
> context.
> 
> For compressed writes, again, allocation/lock is done under
> cow_file_range() and submit is done in extent_write_locked_range() and
> unlocked all in submit_compressed_extents() (this is called after
> compression), so they are all in the same context and the lock owner
> does the unlock.
> 
>> I would really rather see a hmzoned block scheduler that just doesn't submit 
>> the bio's until they are aligned with the WP, that way this intellligence 
>> doesn't have to be dealt with at the file system layer. I get allocating in 
>> line with the WP, but this whole forcing us to allocate and submit the bio in 
>> lock step is just nuts, and broken in your subsequent patches.  This whole 
>> approach needs to be reworked. Thanks,
>>
>> Josef
> 
> We tried this approach by modifying mq-deadline to wait if the first
> queued request is not aligned at the write pointer of a zone. However,
> running btrfs without the allocate+submit lock with this modified IO
> scheduler did not work well at all. With write intensive workloads, we
> observed that a very long wait time was very often necessary to get a
> fully sequential stream of requests starting at the write pointer of a
> zone. The wait time we observed was sometimes in larger than 60 seconds,
> at which point we gave up.

This is because we will only write out the pages we've been handed but do 
cow_file_range() for a possibly larger delalloc range, so as you say there can 
be a large gap in time between writing one part of the range and writing the 
next part.

You actually solve this with your patch, by doing the cow_file_range and then 
following it up with the extent_write_locked_range() for the range you just cow'ed.

There is no need for the locking in this case, you could simply do that and then 
have a modified block scheduler that keeps the bio's in the correct order.  I 
imagine if you just did this with your original block layer approach it would 
work fine.  Thanks,

Josef