linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Nikolay Borisov <nborisov@suse.com>
To: Josef Bacik <josef@toxicpanda.com>,
	linux-btrfs@vger.kernel.org, kernel-team@fb.com
Cc: Josef Bacik <jbacik@fb.com>
Subject: Re: [PATCH 05/10] btrfs: introduce delayed_refs_rsv
Date: Fri, 7 Dec 2018 16:45:37 +0200	[thread overview]
Message-ID: <b9d58b65-cccd-4e66-ae1c-15847f947a62@suse.com> (raw)
In-Reply-To: <20181203152038.21388-6-josef@toxicpanda.com>



On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
> From: Josef Bacik <jbacik@fb.com>
> 
> Traditionally we've had voodoo in btrfs to account for the space that
> delayed refs may take up by having a global_block_rsv.  This works most
> of the time, except when it doesn't.  We've had issues reported and seen
> in production where sometimes the global reserve is exhausted during
> transaction commit before we can run all of our delayed refs, resulting
> in an aborted transaction.  Because of this voodoo we have equally
> dubious flushing semantics around throttling delayed refs which we often
> get wrong.
> 
> So instead give them their own block_rsv.  This way we can always know
> exactly how much outstanding space we need for delayed refs.  This
> allows us to make sure we are constantly filling that reservation up
> with space, and allows us to put more precise pressure on the enospc
> system.  Instead of doing math to see if its a good time to throttle,
> the normal enospc code will be invoked if we have a lot of delayed refs
> pending, and they will be run via the normal flushing mechanism.
> 
> For now the delayed_refs_rsv will hold the reservations for the delayed
> refs, the block group updates, and deleting csums.  We could have a
> separate rsv for the block group updates, but the csum deletion stuff is
> still handled via the delayed_refs so that will stay there.


I see one difference in the way that the space is managed. Essentially
for delayed refs rsv you'll only ever be increasaing the size and
->reserved only when you have to refill. This is opposite to the way
other metadata space is managed i.e by using use_block_rsv which
subtracts ->reserved everytime a block has to be CoW'ed. Why this
difference?


> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> ---
>  fs/btrfs/ctree.h       |  14 +++-
>  fs/btrfs/delayed-ref.c |  43 ++++++++--
>  fs/btrfs/disk-io.c     |   4 +
>  fs/btrfs/extent-tree.c | 212 +++++++++++++++++++++++++++++++++++++++++++++----
>  fs/btrfs/transaction.c |  37 ++++++++-
>  5 files changed, 284 insertions(+), 26 deletions(-)
> 

<snip>

> +/**
> + * btrfs_migrate_to_delayed_refs_rsv - transfer bytes to our delayed refs rsv.
> + * @fs_info - the fs info for our fs.
> + * @src - the source block rsv to transfer from.
> + * @num_bytes - the number of bytes to transfer.
> + *
> + * This transfers up to the num_bytes amount from the src rsv to the
> + * delayed_refs_rsv.  Any extra bytes are returned to the space info.
> + */
> +void btrfs_migrate_to_delayed_refs_rsv(struct btrfs_fs_info *fs_info,
> +				       struct btrfs_block_rsv *src,
> +				       u64 num_bytes)

This function is currently used only during transaction start, it seems
to be rather specific to the delayed refs so I'd suggest making it
private to transaction.c

> +{
> +	struct btrfs_block_rsv *delayed_refs_rsv = &fs_info->delayed_refs_rsv;
> +	u64 to_free = 0;
> +
> +	spin_lock(&src->lock);
> +	src->reserved -= num_bytes;
> +	src->size -= num_bytes;
> +	spin_unlock(&src->lock);
> +
> +	spin_lock(&delayed_refs_rsv->lock);
> +	if (delayed_refs_rsv->size > delayed_refs_rsv->reserved) {
> +		u64 delta = delayed_refs_rsv->size -
> +			delayed_refs_rsv->reserved;
> +		if (num_bytes > delta) {
> +			to_free = num_bytes - delta;
> +			num_bytes = delta;
> +		}
> +	} else {
> +		to_free = num_bytes;
> +		num_bytes = 0;
> +	}
> +
> +	if (num_bytes)
> +		delayed_refs_rsv->reserved += num_bytes;
> +	if (delayed_refs_rsv->reserved >= delayed_refs_rsv->size)
> +		delayed_refs_rsv->full = 1;
> +	spin_unlock(&delayed_refs_rsv->lock);
> +
> +	if (num_bytes)
> +		trace_btrfs_space_reservation(fs_info, "delayed_refs_rsv",
> +					      0, num_bytes, 1);
> +	if (to_free)
> +		space_info_add_old_bytes(fs_info, delayed_refs_rsv->space_info,
> +					 to_free);
> +}
> +
> +/**
> + * btrfs_delayed_refs_rsv_refill - refill based on our delayed refs usage.
> + * @fs_info - the fs_info for our fs.
> + * @flush - control how we can flush for this reservation.
> + *
> + * This will refill the delayed block_rsv up to 1 items size worth of space and
> + * will return -ENOSPC if we can't make the reservation.
> + */
> +int btrfs_delayed_refs_rsv_refill(struct btrfs_fs_info *fs_info,
> +				  enum btrfs_reserve_flush_enum flush)
> +{
> +	struct btrfs_block_rsv *block_rsv = &fs_info->delayed_refs_rsv;
> +	u64 limit = btrfs_calc_trans_metadata_size(fs_info, 1);
> +	u64 num_bytes = 0;
> +	int ret = -ENOSPC;
> +
> +	spin_lock(&block_rsv->lock);
> +	if (block_rsv->reserved < block_rsv->size) {
> +		num_bytes = block_rsv->size - block_rsv->reserved;
> +		num_bytes = min(num_bytes, limit);
> +	}
> +	spin_unlock(&block_rsv->lock);
> +
> +	if (!num_bytes)
> +		return 0;
> +
> +	ret = reserve_metadata_bytes(fs_info->extent_root, block_rsv,
> +				     num_bytes, flush);
> +	if (ret)
> +		return ret;
> +	block_rsv_add_bytes(block_rsv, num_bytes, 0);
> +	trace_btrfs_space_reservation(fs_info, "delayed_refs_rsv",
> +				      0, num_bytes, 1);
> +	return 0;
> +}

This is also called from a single place in transaction.c so can be made
static.

> +
> +
>  /*
>   * This is for space we already have accounted in space_info->bytes_may_use, so
>   * basically when we're returning space from block_rsv's.
> @@ -5709,6 +5809,31 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode *inode,
>  	return ret;
>  }
>  
> +static u64 __btrfs_block_rsv_release(struct btrfs_fs_info *fs_info,
> +				     struct btrfs_block_rsv *block_rsv,
> +				     u64 num_bytes, u64 *qgroup_to_release)
> +{
> +	struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
> +	struct btrfs_block_rsv *delayed_rsv = &fs_info->delayed_refs_rsv;
> +	struct btrfs_block_rsv *target = delayed_rsv;
> +
> +	if (target->full || target == block_rsv)
> +		target = global_rsv;
> +
> +	if (block_rsv->space_info != target->space_info)
> +		target = NULL;
> +
> +	return block_rsv_release_bytes(fs_info, block_rsv, target, num_bytes,
> +				       qgroup_to_release);
> +}
> +
> +void btrfs_block_rsv_release(struct btrfs_fs_info *fs_info,
> +			     struct btrfs_block_rsv *block_rsv,
> +			     u64 num_bytes)
> +{
> +	__btrfs_block_rsv_release(fs_info, block_rsv, num_bytes, NULL);
> +}

Again, I think the changelog should mention that one of the major
changes to behavior you do is essentially prioritize delayed refs
refilling over global rsv refilling by "hijacking" all block group
releases to go through __btrfs_block_rsv_release

> +
>  /**
>   * btrfs_inode_rsv_release - release any excessive reservation.
>   * @inode - the inode we need to release from.
> @@ -5723,7 +5848,6 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode *inode,
>  static void btrfs_inode_rsv_release(struct btrfs_inode *inode, bool qgroup_free)
>  {
>  	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> -	struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
>  	struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
>  	u64 released = 0;
>  	u64 qgroup_to_release = 0;
> @@ -5733,8 +5857,8 @@ static void btrfs_inode_rsv_release(struct btrfs_inode *inode, bool qgroup_free)
>  	 * are releasing 0 bytes, and then we'll just get the reservation over
>  	 * the size free'd.
>  	 */
> -	released = block_rsv_release_bytes(fs_info, block_rsv, global_rsv, 0,
> -					   &qgroup_to_release);
> +	released = __btrfs_block_rsv_release(fs_info, block_rsv, 0,
> +					     &qgroup_to_release);

This change requires a bit more explanation. Now when releasing the
bytes for an inode we execute the additional logic in
__btrfs_block_rsv_release to decide if we need to refill the globalrsv
or delayed refs rsf based on whether delayed refs rsv is full or not.
Before global rsv will be refilled by whatever is the excess amount and
right now if delayed refs rsv is not full delayed refs rsv will be
refilled.

>  	if (released > 0)
>  		trace_btrfs_space_reservation(fs_info, "delalloc",
>  					      btrfs_ino(inode), released, 0);
> @@ -5745,16 +5869,26 @@ static void btrfs_inode_rsv_release(struct btrfs_inode *inode, bool qgroup_free)
>  						   qgroup_to_release);
>  }
>  
> -void btrfs_block_rsv_release(struct btrfs_fs_info *fs_info,
> -			     struct btrfs_block_rsv *block_rsv,
> -			     u64 num_bytes)
> +/**
> + * btrfs_delayed_refs_rsv_release - release a ref head's reservation.
> + * @fs_info - the fs_info for our fs.
> + * @nr - the number of items to drop.
> + *
> + * This drops the delayed ref head's count from the delayed refs rsv and free's
> + * any excess reservation we had.
> + */
> +void btrfs_delayed_refs_rsv_release(struct btrfs_fs_info *fs_info, int nr)
>  {
> +	struct btrfs_block_rsv *block_rsv = &fs_info->delayed_refs_rsv;
>  	struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
> +	u64 num_bytes = btrfs_calc_trans_metadata_size(fs_info, nr);
> +	u64 released = 0;
>  
> -	if (global_rsv == block_rsv ||
> -	    block_rsv->space_info != global_rsv->space_info)
> -		global_rsv = NULL;
> -	block_rsv_release_bytes(fs_info, block_rsv, global_rsv, num_bytes, NULL);
> +	released = block_rsv_release_bytes(fs_info, block_rsv, global_rsv,
> +					   num_bytes, NULL);
> +	if (released)
> +		trace_btrfs_space_reservation(fs_info, "delayed_refs_rsv",
> +					      0, released, 0);
>  }
>  
>  static void update_global_block_rsv(struct btrfs_fs_info *fs_info)
> @@ -5819,9 +5953,10 @@ static void init_global_block_rsv(struct btrfs_fs_info *fs_info)
>  	fs_info->trans_block_rsv.space_info = space_info;
>  	fs_info->empty_block_rsv.space_info = space_info;
>  	fs_info->delayed_block_rsv.space_info = space_info;
> +	fs_info->delayed_refs_rsv.space_info = space_info;
>  
> -	fs_info->extent_root->block_rsv = &fs_info->global_block_rsv;
> -	fs_info->csum_root->block_rsv = &fs_info->global_block_rsv;
> +	fs_info->extent_root->block_rsv = &fs_info->delayed_refs_rsv;
> +	fs_info->csum_root->block_rsv = &fs_info->delayed_refs_rsv;

nit: Some whitespace damage

>  	fs_info->dev_root->block_rsv = &fs_info->global_block_rsv;
>  	fs_info->tree_root->block_rsv = &fs_info->global_block_rsv;
>  	if (fs_info->quota_root)

<snip>

  reply	other threads:[~2018-12-07 14:45 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-03 15:20 [PATCH 00/10][V2] Delayed refs rsv Josef Bacik
2018-12-03 15:20 ` [PATCH 01/10] btrfs: add btrfs_delete_ref_head helper Josef Bacik
2018-12-06 12:32   ` Nikolay Borisov
2018-12-03 15:20 ` [PATCH 02/10] btrfs: add cleanup_ref_head_accounting helper Josef Bacik
2018-12-06 12:38   ` Nikolay Borisov
2018-12-03 15:20 ` [PATCH 03/10] btrfs: cleanup extent_op handling Josef Bacik
2018-12-03 15:20 ` [PATCH 04/10] btrfs: only track ref_heads in delayed_ref_updates Josef Bacik
2018-12-07 13:01   ` Nikolay Borisov
2018-12-13 16:36     ` David Sterba
2018-12-03 15:20 ` [PATCH 05/10] btrfs: introduce delayed_refs_rsv Josef Bacik
2018-12-07 14:45   ` Nikolay Borisov [this message]
2018-12-13 16:49     ` David Sterba
2018-12-03 15:20 ` [PATCH 06/10] btrfs: update may_commit_transaction to use the delayed refs rsv Josef Bacik
2018-12-06 12:51   ` Nikolay Borisov
2018-12-03 15:20 ` [PATCH 07/10] btrfs: add new flushing states for " Josef Bacik
2018-12-03 15:20 ` [PATCH 08/10] btrfs: rework btrfs_check_space_for_delayed_refs Josef Bacik
2018-12-06 16:52   ` Nikolay Borisov
2018-12-06 17:54     ` David Sterba
2018-12-07  7:09       ` Nikolay Borisov
2018-12-07  9:57         ` Nikolay Borisov
2018-12-13 16:44         ` David Sterba
2019-01-14  6:28   ` Qu Wenruo
2018-12-03 15:20 ` [PATCH 09/10] btrfs: don't run delayed refs in the end transaction logic Josef Bacik
2018-12-06 16:43   ` Nikolay Borisov
2018-12-03 15:20 ` [PATCH 10/10] btrfs: fix truncate throttling Josef Bacik
2018-12-06 15:56 ` [PATCH 00/10][V2] Delayed refs rsv David Sterba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b9d58b65-cccd-4e66-ae1c-15847f947a62@suse.com \
    --to=nborisov@suse.com \
    --cc=jbacik@fb.com \
    --cc=josef@toxicpanda.com \
    --cc=kernel-team@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).