Re: [PATCH v13 22/42] btrfs: split ordered extent when bio is sent

From: Josef Bacik <josef@toxicpanda.com>
To: Naohiro Aota <naohiro.aota@wdc.com>,
	linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
	Jens Axboe <axboe@kernel.dk>,
	Christoph Hellwig <hch@infradead.org>,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	kernel test robot <lkp@intel.com>
Subject: Re: [PATCH v13 22/42] btrfs: split ordered extent when bio is sent
Date: Fri, 22 Jan 2021 10:22:10 -0500	[thread overview]
Message-ID: <e265540c-9613-9473-f7e6-0f55d455b18e@toxicpanda.com> (raw)
In-Reply-To: <25b86d9571b1af386f1711d0d0ae626ae6a86b35.1611295439.git.naohiro.aota@wdc.com>

On 1/22/21 1:21 AM, Naohiro Aota wrote:
> For a zone append write, the device decides the location the data is
> written to. Therefore we cannot ensure that two bios are written
> consecutively on the device. In order to ensure that a ordered extent maps
> to a contiguous region on disk, we need to maintain a "one bio == one
> ordered extent" rule.
> 
> This commit implements the splitting of an ordered extent and extent map
> on bio submission to adhere to the rule.
> 
> [testbot] made extract_ordered_extent static
> Reported-by: kernel test robot <lkp@intel.com>
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>   fs/btrfs/inode.c        | 95 +++++++++++++++++++++++++++++++++++++++++
>   fs/btrfs/ordered-data.c | 85 ++++++++++++++++++++++++++++++++++++
>   fs/btrfs/ordered-data.h |  2 +
>   3 files changed, 182 insertions(+)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 2e1c1f37b3f6..ab97d4349515 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2217,6 +2217,92 @@ static blk_status_t btrfs_submit_bio_start(struct inode *inode, struct bio *bio,
>   	return btrfs_csum_one_bio(BTRFS_I(inode), bio, 0, 0);
>   }
>   
> +static blk_status_t extract_ordered_extent(struct btrfs_inode *inode,
> +					   struct bio *bio, loff_t file_offset)
> +{
> +	struct btrfs_ordered_extent *ordered;
> +	struct extent_map *em = NULL, *em_new = NULL;
> +	struct extent_map_tree *em_tree = &inode->extent_tree;
> +	u64 start = (u64)bio->bi_iter.bi_sector << SECTOR_SHIFT;
> +	u64 len = bio->bi_iter.bi_size;
> +	u64 end = start + len;
> +	u64 ordered_end;
> +	u64 pre, post;
> +	int ret = 0;
> +
> +	ordered = btrfs_lookup_ordered_extent(inode, file_offset);
> +	if (WARN_ON_ONCE(!ordered))
> +		return BLK_STS_IOERR;
> +
> +	/* No need to split */
> +	if (ordered->disk_num_bytes == len)
> +		goto out;
> +
> +	/* We cannot split once end_bio'd ordered extent */
> +	if (WARN_ON_ONCE(ordered->bytes_left != ordered->disk_num_bytes)) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	/* We cannot split a compressed ordered extent */
> +	if (WARN_ON_ONCE(ordered->disk_num_bytes != ordered->num_bytes)) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	/* We cannot split a waited ordered extent */
> +	if (WARN_ON_ONCE(wq_has_sleeper(&ordered->wait))) {
> +		ret = -EINVAL;
> +		goto out;
> +	}

How is this not a problem?  We can have any arbitrary waiter on an ordered 
extent at any given time?  Write to an area with memory pressure and then fsync 
immediately so we have to wait on an ordered extent that may need to be split, 
bam you get this warning and fail to write out.  This seems like a bad side effect.

> +
> +	ordered_end = ordered->disk_bytenr + ordered->disk_num_bytes;
> +	/* bio must be in one ordered extent */
> +	if (WARN_ON_ONCE(start < ordered->disk_bytenr || end > ordered_end)) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	/* Checksum list should be empty */
> +	if (WARN_ON_ONCE(!list_empty(&ordered->list))) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	pre = start - ordered->disk_bytenr;
> +	post = ordered_end - end;
> +
> +	ret = btrfs_split_ordered_extent(ordered, pre, post);
> +	if (ret)
> +		goto out;
> +
> +	read_lock(&em_tree->lock);
> +	em = lookup_extent_mapping(em_tree, ordered->file_offset, len);
> +	if (!em) {
> +		read_unlock(&em_tree->lock);
> +		ret = -EIO;
> +		goto out;
> +	}
> +	read_unlock(&em_tree->lock);
> +
> +	ASSERT(!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags));
> +	em_new = create_io_em(inode, em->start + pre, len,
> +			      em->start + pre, em->block_start + pre, len,
> +			      len, len, BTRFS_COMPRESS_NONE,
> +			      BTRFS_ORDERED_REGULAR);

This bit confuses me, the io_em is just so we have a mapping to an area that's 
being written to, and this is created at ordered extent time.  I get why we need 
to split up the ordered extent, but the existing io_em should be fine, right? 
Thanks,

Josef