Re: [PATCH v6 15/28] btrfs: serialize data allocation and submit IOs

From: Naohiro Aota <naohiro.aota@wdc.com>
To: Josef Bacik <josef@toxicpanda.com>
Cc: linux-btrfs@vger.kernel.org, David Sterba <dsterba@suse.com>,
	Chris Mason <clm@fb.com>, Nikolay Borisov <nborisov@suse.com>,
	Damien Le Moal <damien.lemoal@wdc.com>,
	Johannes Thumshirn <jthumshirn@suse.de>,
	Hannes Reinecke <hare@suse.com>,
	Anand Jain <anand.jain@oracle.com>,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH v6 15/28] btrfs: serialize data allocation and submit IOs
Date: Thu, 19 Dec 2019 15:54:57 +0900	[thread overview]
Message-ID: <20191219065457.rhd4wcycylii33c3@naota.dhcp.fujisawa.hgst.com> (raw)
In-Reply-To: <b11ca55e-adb6-6aa7-4494-cffafedb487f@toxicpanda.com>

On Tue, Dec 17, 2019 at 02:49:44PM -0500, Josef Bacik wrote:
>On 12/12/19 11:09 PM, Naohiro Aota wrote:
>>To preserve sequential write pattern on the drives, we must serialize
>>allocation and submit_bio. This commit add per-block group mutex
>>"zone_io_lock" and find_free_extent_zoned() hold the lock. The lock is kept
>>even after returning from find_free_extent(). It is released when submiting
>>IOs corresponding to the allocation is completed.
>>
>>Implementing such behavior under __extent_writepage_io() is almost
>>impossible because once pages are unlocked we are not sure when submiting
>>IOs for an allocated region is finished or not. Instead, this commit add
>>run_delalloc_hmzoned() to write out non-compressed data IOs at once using
>>extent_write_locked_rage(). After the write, we can call
>>btrfs_hmzoned_data_io_unlock() to unlock the block group for new
>>allocation.
>>
>>Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>
>Have you actually tested these patches with lock debugging on?  The 
>submit_compressed_extents stuff is async, so the unlocker owner will 
>not be the lock owner, and that'll make all sorts of things blow up.  
>This is just straight up broken.

Yes, I have ran xfstests on this patch series with lockdeps and
KASAN. There was no problem with that.

For non-compressed writes, both allocation and submit is done in
run_delalloc_zoned(). Allocation is done in cow_file_range() and
submit is done in extent_write_locked_range(), so both are in the same
context, so both locking and unlocking are done by the same execution
context.

For compressed writes, again, allocation/lock is done under
cow_file_range() and submit is done in extent_write_locked_range() and
unlocked all in submit_compressed_extents() (this is called after
compression), so they are all in the same context and the lock owner
does the unlock.

>I would really rather see a hmzoned block scheduler that just doesn't 
>submit the bio's until they are aligned with the WP, that way this 
>intellligence doesn't have to be dealt with at the file system layer.  
>I get allocating in line with the WP, but this whole forcing us to 
>allocate and submit the bio in lock step is just nuts, and broken in 
>your subsequent patches.  This whole approach needs to be reworked.  
>Thanks,
>
>Josef

We tried this approach by modifying mq-deadline to wait if the first
queued request is not aligned at the write pointer of a zone. However,
running btrfs without the allocate+submit lock with this modified IO
scheduler did not work well at all. With write intensive workloads, we
observed that a very long wait time was very often necessary to get a
fully sequential stream of requests starting at the write pointer of a
zone. The wait time we observed was sometimes in larger than 60 seconds,
at which point we gave up.

While we did not extensively dig into the fundamental root cause,
these potentially long wait times can come from a large number of
reasons: page cache writeback behavior, kernel process scheduling,
device IO congestion and writeback throttling, sync, transaction
commit of btrfs, and cgroup use could make everything even worse. In
the worst case scenario, a number of out-of-ordered requests could get
stuck in the IO scheduler, preventing forward progress in the case of
a memory reclaim writeback, causing the OOM killer to start happily
killing application processes. Furthermore, IO error handling becomes
a nightmare as the block layer scheduler would need to issue report
zones commands to re-sync the zone wp in case of write error. And that
is also in addition to having to track other zone commands that change
a zone wp such as reset zone and finish zone.

Considering all this, handling the sequential write constraint at the
file system layer by ensuring that write BIOs are issued in the correct
order starting from a zone WP is far simpler and removes dependencies on
other features such as cgroup, congestion control and other throttling
mechanisms. The IO scheduler can always dispatch to the device the
requests it received without any waiting time, ensuring forward progress.

The mq-deadline IO scheduler supports not only regular block devices but
also zoned block devices and it is the default scheduler for them, and
other schedulers that are not zone compliant cannot be selected (one
cannot change to kyber nor bfq). This ensure that the default system
behavior will be correct as long as the user (the FS) respects the
sequential write rule.

The previous approach I proposed using a btrfs request reordering stage
was indeed very invasive, and similarly the block layer scheduler
changes, could cause problems with cgroups etc. The new approach of this
path using locking to have atomic allocate+bio issuing results in
per-zone sequential write patterns, no matter what happens around it. It
is less invasive and rely on the sequential allocation of blocks for the
ordering of write IOs, so there is no explicit reordering, so no
additional overhead. f2fs implementation uses a similar approach since
kernel 4.10 and has proven to be very solid.

In light of these arguments and explanation, do you still think the
allocate zone locking approach is still not acceptable ?