All of lore.kernel.org
 help / color / mirror / Atom feed
From: Wang Yugui <wangyugui@e16-tech.com>
To: fdmanana@kernel.org
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH v2 1/7] btrfs: only reserve the needed data space amount during fallocate
Date: Tue, 15 Mar 2022 21:16:27 +0800	[thread overview]
Message-ID: <20220315211627.BD66.409509F4@e16-tech.com> (raw)
In-Reply-To: <dde255ba71e1bfe38daf88c1b7ef51df15d5c150.1647346287.git.fdmanana@suse.com>

Hi,

> TODO: test case when the file with/without snapshot
> 
> From: Filipe Manana <fdmanana@suse.com>
> 
> During a plain fallocate, we always start by reserving an amount of data
> space that matches the length of the range passed to fallocate. When we
> already have extents allocated in that range, we may end up trying to
> reserve a lot more data space then we need, which can result in two
> undesired behaviours:

Need we check whether the allocated extents are exclusive  or shared?

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2022/03/15

> 
> 1) We fail with -ENOSPC. For example the passed range has a length
>    of 1G, but there's only one hole with a size of 1M in that range;
> 
> 2) We temporarily reserve excessive data space that could be used by
>    other operations happening concurrently;
> 
> 3) By reserving much more data space then we need, we can end up
>    doing expensive things like triggering dellaloc for other inodes,
>    waiting for the ordered extents to complete, trigger transaction
>    commits, allocate new block groups, etc.
> 
> Example:
> 
>   $ cat test.sh
>   #!/bin/bash
> 
>   DEV=/dev/sdj
>   MNT=/mnt/sdj
> 
>   mkfs.btrfs -f -b 1g $DEV
>   mount $DEV $MNT
> 
>   # Create a file with a size of 600M and two holes, one at [200M, 201M[
>   # and another at [401M, 402M[
>   xfs_io -f -c "pwrite -S 0xab 0 200M" \
>             -c "pwrite -S 0xcd 201M 200M" \
>             -c "pwrite -S 0xef 402M 198M" \
>             $MNT/foobar
> 
>   # Now call fallocate against the whole file range, see if it fails
>   # with -ENOSPC or not - it shouldn't since we only need to allocate
>   # 2M of data space.
>   xfs_io -c "falloc 0 600M" $MNT/foobar
> 
>   umount $MNT
> 
>   $ ./test.sh
>   (...)
>   wrote 209715200/209715200 bytes at offset 0
>   200 MiB, 51200 ops; 0.8063 sec (248.026 MiB/sec and 63494.5831 ops/sec)
>   wrote 209715200/209715200 bytes at offset 210763776
>   200 MiB, 51200 ops; 0.8053 sec (248.329 MiB/sec and 63572.3172 ops/sec)
>   wrote 207618048/207618048 bytes at offset 421527552
>   198 MiB, 50688 ops; 0.7925 sec (249.830 MiB/sec and 63956.5548 ops/sec)
>   fallocate: No space left on device
>   $
> 
> So fix this by not allocating an amount of data space that matches the
> length of the range passed to fallocate. Instead allocate an amount of
> data space that corresponds to the sum of the sizes of each hole found
> in the range. This reservation now happens after we have locked the file
> range, which is safe since we know at this point there's no delalloc
> in the range because we've taken the inode's VFS lock in exclusive mode,
> we have taken the inode's i_mmap_lock in exclusive mode, we have flushed
> delalloc and waited for all ordered extents in the range to complete.
> 
> This type of failure actually seems to happen in pratice with systemd,
> and we had at least one report about this in a very long thread which
> is referenced by the Link tag below.
> 
> Link: https://lore.kernel.org/linux-btrfs/bdJVxLiFr_PyQSXRUbZJfFW_jAjsGgoMetqPHJMbg-hdy54Xt_ZHhRetmnJ6cJ99eBlcX76wy-AvWwV715c3YndkxneSlod11P1hlaADx0s=@protonmail.com/
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/file.c | 69 ++++++++++++++++++++++++++-----------------------
>  1 file changed, 37 insertions(+), 32 deletions(-)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 380054c94e4b..b7c0db1000cd 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -3417,6 +3417,9 @@ static long btrfs_fallocate(struct file *file, int mode,
>  	u64 alloc_hint = 0;
>  	u64 locked_end;
>  	u64 actual_end = 0;
> +	u64 data_space_needed = 0;
> +	u64 data_space_reserved = 0;
> +	u64 qgroup_reserved = 0;
>  	struct extent_map *em;
>  	int blocksize = btrfs_inode_sectorsize(BTRFS_I(inode));
>  	int ret;
> @@ -3437,18 +3440,6 @@ static long btrfs_fallocate(struct file *file, int mode,
>  	if (mode & FALLOC_FL_PUNCH_HOLE)
>  		return btrfs_punch_hole(file, offset, len);
>  
> -	/*
> -	 * Only trigger disk allocation, don't trigger qgroup reserve
> -	 *
> -	 * For qgroup space, it will be checked later.
> -	 */
> -	if (!(mode & FALLOC_FL_ZERO_RANGE)) {
> -		ret = btrfs_alloc_data_chunk_ondemand(BTRFS_I(inode),
> -						      alloc_end - alloc_start);
> -		if (ret < 0)
> -			return ret;
> -	}
> -
>  	btrfs_inode_lock(inode, BTRFS_ILOCK_MMAP);
>  
>  	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) {
> @@ -3548,48 +3539,66 @@ static long btrfs_fallocate(struct file *file, int mode,
>  		if (em->block_start == EXTENT_MAP_HOLE ||
>  		    (cur_offset >= inode->i_size &&
>  		     !test_bit(EXTENT_FLAG_PREALLOC, &em->flags))) {
> +			const u64 range_len = last_byte - cur_offset;
> +
>  			ret = add_falloc_range(&reserve_list, cur_offset,
> -					       last_byte - cur_offset);
> +					       range_len);
>  			if (ret < 0) {
>  				free_extent_map(em);
>  				break;
>  			}
>  			ret = btrfs_qgroup_reserve_data(BTRFS_I(inode),
>  					&data_reserved, cur_offset,
> -					last_byte - cur_offset);
> +					range_len);
>  			if (ret < 0) {
> -				cur_offset = last_byte;
>  				free_extent_map(em);
>  				break;
>  			}
> -		} else {
> -			/*
> -			 * Do not need to reserve unwritten extent for this
> -			 * range, free reserved data space first, otherwise
> -			 * it'll result in false ENOSPC error.
> -			 */
> -			btrfs_free_reserved_data_space(BTRFS_I(inode),
> -				data_reserved, cur_offset,
> -				last_byte - cur_offset);
> +			qgroup_reserved += range_len;
> +			data_space_needed += range_len;
>  		}
>  		free_extent_map(em);
>  		cur_offset = last_byte;
>  	}
>  
> +	if (!ret && data_space_needed > 0) {
> +		/*
> +		 * We are safe to reserve space here as we can't have delalloc
> +		 * in the range, see above.
> +		 */
> +		ret = btrfs_alloc_data_chunk_ondemand(BTRFS_I(inode),
> +						      data_space_needed);
> +		if (!ret)
> +			data_space_reserved = data_space_needed;
> +	}
> +
>  	/*
>  	 * If ret is still 0, means we're OK to fallocate.
>  	 * Or just cleanup the list and exit.
>  	 */
>  	list_for_each_entry_safe(range, tmp, &reserve_list, list) {
> -		if (!ret)
> +		if (!ret) {
>  			ret = btrfs_prealloc_file_range(inode, mode,
>  					range->start,
>  					range->len, i_blocksize(inode),
>  					offset + len, &alloc_hint);
> -		else
> +			/*
> +			 * btrfs_prealloc_file_range() releases space even
> +			 * if it returns an error.
> +			 */
> +			data_space_reserved -= range->len;
> +			qgroup_reserved -= range->len;
> +		} else if (data_space_reserved > 0) {
>  			btrfs_free_reserved_data_space(BTRFS_I(inode),
> -					data_reserved, range->start,
> -					range->len);
> +					       data_reserved, range->start,
> +					       range->len);
> +			data_space_reserved -= range->len;
> +			qgroup_reserved -= range->len;
> +		} else if (qgroup_reserved > 0) {
> +			btrfs_qgroup_free_data(BTRFS_I(inode), data_reserved,
> +					       range->start, range->len);
> +			qgroup_reserved -= range->len;
> +		}
>  		list_del(&range->list);
>  		kfree(range);
>  	}
> @@ -3606,10 +3615,6 @@ static long btrfs_fallocate(struct file *file, int mode,
>  			     &cached_state);
>  out:
>  	btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP);
> -	/* Let go of our reservation. */
> -	if (ret != 0 && !(mode & FALLOC_FL_ZERO_RANGE))
> -		btrfs_free_reserved_data_space(BTRFS_I(inode), data_reserved,
> -				cur_offset, alloc_end - cur_offset);
>  	extent_changeset_free(data_reserved);
>  	return ret;
>  }
> -- 
> 2.33.0



  reply	other threads:[~2022-03-15 13:16 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-15 10:47 [PATCH 0/6] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks fdmanana
2022-03-15 10:47 ` [PATCH 1/6] btrfs: only reserve the needed data space amount during fallocate fdmanana
2022-03-15 10:47 ` [PATCH 2/6] btrfs: remove useless dio wait call when doing fallocate zero range fdmanana
2022-03-15 10:47 ` [PATCH 3/6] btrfs: remove inode_dio_wait() calls when starting reflink operations fdmanana
2022-03-15 10:47 ` [PATCH 4/6] btrfs: remove ordered extent check and wait during fallocate fdmanana
2022-03-15 10:47 ` [PATCH 5/6] btrfs: remove ordered extent check and wait during hole punching and zero range fdmanana
2022-03-15 10:47 ` [PATCH 6/6] btrfs: add and use helper to assert an inode range is clean fdmanana
2022-03-15 12:18 ` [PATCH v2 0/7] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks fdmanana
2022-03-15 12:18   ` [PATCH v2 1/7] btrfs: only reserve the needed data space amount during fallocate fdmanana
2022-03-15 13:16     ` Wang Yugui [this message]
2022-03-15 13:45       ` Filipe Manana
2022-03-15 12:18   ` [PATCH v2 2/7] btrfs: remove useless dio wait call when doing fallocate zero range fdmanana
2022-03-15 12:18   ` [PATCH v2 3/7] btrfs: remove inode_dio_wait() calls when starting reflink operations fdmanana
2022-03-15 12:18   ` [PATCH v2 4/7] btrfs: remove ordered extent check and wait during fallocate fdmanana
2022-03-15 12:18   ` [PATCH v2 5/7] btrfs: lock the inode first before flushing range when punching hole fdmanana
2022-03-15 12:18   ` [PATCH v2 6/7] btrfs: remove ordered extent check and wait during hole punching and zero range fdmanana
2022-03-15 12:18   ` [PATCH v2 7/7] btrfs: add and use helper to assert an inode range is clean fdmanana
2022-03-15 15:22 ` [PATCH v3 0/7] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks fdmanana
2022-03-15 15:22   ` [PATCH v3 1/7] btrfs: only reserve the needed data space amount during fallocate fdmanana
2022-03-15 15:22   ` [PATCH v3 2/7] btrfs: remove useless dio wait call when doing fallocate zero range fdmanana
2022-03-15 15:22   ` [PATCH v3 3/7] btrfs: remove inode_dio_wait() calls when starting reflink operations fdmanana
2022-03-15 15:22   ` [PATCH v3 4/7] btrfs: remove ordered extent check and wait during fallocate fdmanana
2022-03-15 15:22   ` [PATCH v3 5/7] btrfs: lock the inode first before flushing range when punching hole fdmanana
2022-03-15 15:22   ` [PATCH v3 6/7] btrfs: remove ordered extent check and wait during hole punching and zero range fdmanana
2022-03-15 15:22   ` [PATCH v3 7/7] btrfs: add and use helper to assert an inode range is clean fdmanana
2022-03-16 16:29   ` [PATCH v3 0/7] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks David Sterba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220315211627.BD66.409509F4@e16-tech.com \
    --to=wangyugui@e16-tech.com \
    --cc=fdmanana@kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.