Linux-BTRFS Archive on lore.kernel.org
 help / color / Atom feed
From: Dennis Zhou <dennis@kernel.org>
To: Josef Bacik <josef@toxicpanda.com>
Cc: David Sterba <dsterba@suse.com>, Chris Mason <clm@fb.com>,
	Omar Sandoval <osandov@osandov.com>,
	kernel-team@fb.com, linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 03/19] btrfs: keep track of which extents have been discarded
Date: Fri, 11 Oct 2019 12:15:49 -0400
Message-ID: <20191011161549.GB29672@dennisz-mbp> (raw)
In-Reply-To: <20191010134035.khr6ifhfu37afrav@MacBook-Pro-91.local>

On Thu, Oct 10, 2019 at 09:40:37AM -0400, Josef Bacik wrote:
> On Mon, Oct 07, 2019 at 06:38:10PM -0400, Dennis Zhou wrote:
> > On Mon, Oct 07, 2019 at 04:37:28PM -0400, Josef Bacik wrote:
> > > On Mon, Oct 07, 2019 at 04:17:34PM -0400, Dennis Zhou wrote:
> > > > Async discard will use the free space cache as backing knowledge for
> > > > which extents to discard. This patch plumbs knowledge about which
> > > > extents need to be discarded into the free space cache from
> > > > unpin_extent_range().
> > > > 
> > > > An untrimmed extent can merge with everything as this is a new region.
> > > > Absorbing trimmed extents is a tradeoff to for greater coalescing which
> > > > makes life better for find_free_extent(). Additionally, it seems the
> > > > size of a trim isn't as problematic as the trim io itself.
> > > > 
> > > > When reading in the free space cache from disk, if sync is set, mark all
> > > > extents as trimmed. The current code ensures at transaction commit that
> > > > all free space is trimmed when sync is set, so this reflects that.
> > > > 
> > > > Signed-off-by: Dennis Zhou <dennis@kernel.org>
> > > > ---
> > > >  fs/btrfs/extent-tree.c      | 15 ++++++++++-----
> > > >  fs/btrfs/free-space-cache.c | 38 ++++++++++++++++++++++++++++++-------
> > > >  fs/btrfs/free-space-cache.h | 10 +++++++++-
> > > >  fs/btrfs/inode-map.c        | 13 +++++++------
> > > >  4 files changed, 57 insertions(+), 19 deletions(-)
> > > > 
> > > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > > > index 77a5904756c5..b9e3bedad878 100644
> > > > --- a/fs/btrfs/extent-tree.c
> > > > +++ b/fs/btrfs/extent-tree.c
> > > > @@ -2782,7 +2782,7 @@ fetch_cluster_info(struct btrfs_fs_info *fs_info,
> > > >  }
> > > >  
> > > >  static int unpin_extent_range(struct btrfs_fs_info *fs_info,
> > > > -			      u64 start, u64 end,
> > > > +			      u64 start, u64 end, u32 fsc_flags,
> > > >  			      const bool return_free_space)
> > > >  {
> > > >  	struct btrfs_block_group_cache *cache = NULL;
> > > > @@ -2816,7 +2816,9 @@ static int unpin_extent_range(struct btrfs_fs_info *fs_info,
> > > >  		if (start < cache->last_byte_to_unpin) {
> > > >  			len = min(len, cache->last_byte_to_unpin - start);
> > > >  			if (return_free_space)
> > > > -				btrfs_add_free_space(cache, start, len);
> > > > +				__btrfs_add_free_space(fs_info,
> > > > +						       cache->free_space_ctl,
> > > > +						       start, len, fsc_flags);
> > > >  		}
> > > >  
> > > >  		start += len;
> > > > @@ -2894,6 +2896,7 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
> > > >  
> > > >  	while (!trans->aborted) {
> > > >  		struct extent_state *cached_state = NULL;
> > > > +		u32 fsc_flags = 0;
> > > >  
> > > >  		mutex_lock(&fs_info->unused_bg_unpin_mutex);
> > > >  		ret = find_first_extent_bit(unpin, 0, &start, &end,
> > > > @@ -2903,12 +2906,14 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
> > > >  			break;
> > > >  		}
> > > >  
> > > > -		if (btrfs_test_opt(fs_info, DISCARD_SYNC))
> > > > +		if (btrfs_test_opt(fs_info, DISCARD_SYNC)) {
> > > >  			ret = btrfs_discard_extent(fs_info, start,
> > > >  						   end + 1 - start, NULL);
> > > > +			fsc_flags |= BTRFS_FSC_TRIMMED;
> > > > +		}
> > > >  
> > > >  		clear_extent_dirty(unpin, start, end, &cached_state);
> > > > -		unpin_extent_range(fs_info, start, end, true);
> > > > +		unpin_extent_range(fs_info, start, end, fsc_flags, true);
> > > >  		mutex_unlock(&fs_info->unused_bg_unpin_mutex);
> > > >  		free_extent_state(cached_state);
> > > >  		cond_resched();
> > > > @@ -5512,7 +5517,7 @@ u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo)
> > > >  int btrfs_error_unpin_extent_range(struct btrfs_fs_info *fs_info,
> > > >  				   u64 start, u64 end)
> > > >  {
> > > > -	return unpin_extent_range(fs_info, start, end, false);
> > > > +	return unpin_extent_range(fs_info, start, end, 0, false);
> > > >  }
> > > >  
> > > >  /*
> > > > diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> > > > index d54dcd0ab230..f119895292b8 100644
> > > > --- a/fs/btrfs/free-space-cache.c
> > > > +++ b/fs/btrfs/free-space-cache.c
> > > > @@ -747,6 +747,14 @@ static int __load_free_space_cache(struct btrfs_root *root, struct inode *inode,
> > > >  			goto free_cache;
> > > >  		}
> > > >  
> > > > +		/*
> > > > +		 * Sync discard ensures that the free space cache is always
> > > > +		 * trimmed.  So when reading this in, the state should reflect
> > > > +		 * that.
> > > > +		 */
> > > > +		if (btrfs_test_opt(fs_info, DISCARD_SYNC))
> > > > +			e->flags |= BTRFS_FSC_TRIMMED;
> > > > +
> > > >  		if (!e->bytes) {
> > > >  			kmem_cache_free(btrfs_free_space_cachep, e);
> > > >  			goto free_cache;
> > > > @@ -2165,6 +2173,7 @@ static bool try_merge_free_space(struct btrfs_free_space_ctl *ctl,
> > > >  	bool merged = false;
> > > >  	u64 offset = info->offset;
> > > >  	u64 bytes = info->bytes;
> > > > +	bool is_trimmed = btrfs_free_space_trimmed(info);
> > > >  
> > > >  	/*
> > > >  	 * first we want to see if there is free space adjacent to the range we
> > > > @@ -2178,7 +2187,8 @@ static bool try_merge_free_space(struct btrfs_free_space_ctl *ctl,
> > > >  	else
> > > >  		left_info = tree_search_offset(ctl, offset - 1, 0, 0);
> > > >  
> > > > -	if (right_info && !right_info->bitmap) {
> > > > +	if (right_info && !right_info->bitmap &&
> > > > +	    (!is_trimmed || btrfs_free_space_trimmed(right_info))) {
> > > >  		if (update_stat)
> > > >  			unlink_free_space(ctl, right_info);
> > > >  		else
> > > > @@ -2189,7 +2199,8 @@ static bool try_merge_free_space(struct btrfs_free_space_ctl *ctl,
> > > >  	}
> > > >  
> > > >  	if (left_info && !left_info->bitmap &&
> > > > -	    left_info->offset + left_info->bytes == offset) {
> > > > +	    left_info->offset + left_info->bytes == offset &&
> > > > +	    (!is_trimmed || btrfs_free_space_trimmed(left_info))) {
> > > 
> > > So we allow merging if we haven't trimmed this entry, or if the adjacent entry
> > > is already trimmed?  This means we'll merge if we trimmed the new entry
> > > regardless of the adjacent entries status, or if the new entry is drity.  Why is
> > > that?  Thanks,
> > > 
> > 
> > This is the tradeoff I called out above here:
> > 
> > > > Absorbing trimmed extents is a tradeoff to for greater coalescing which
> > > > makes life better for find_free_extent(). Additionally, it seems the
> > > > size of a trim isn't as problematic as the trim io itself.
> > 
> > A problematic example case:
> > 
> > |----trimmed----|/////X/////|-----trimmed-----|
> > 
> > If region X gets freed and returned to the free space cache, we end up
> > with the following:
> > 
> > |----trimmed----|-untrimmed-|-----trimmed-----|
> > 
> > This isn't great because now we need to teach find_free_extent() to span
> > multiple btrfs_free_space entries, something I didn't want to do. So the
> > other option is to overtrim trading for a simpler find_free_extent().
> > Then the above becomes:
> > 
> > |-------------------trimmed-------------------|
> > 
> > It makes the assumption that if we're inserting, it's generally is free
> > space being returned rather than we needed to slice out from the middle
> > of a block. It does still have degenerative cases, but it's better than
> > the above. The merging also allows for stuff to come out of bitmaps more
> > proactively too.
> > 
> > Also from what it seems, the cost of a discard operation is quite costly
> > relative to the amount your discarding (1 larger discard is better than
> > several smaller discards) as it will clog up the device too.
> 
> 
> OOOOOh I fucking get it now.  That's going to need a comment, because it's not
> obvious at all.
> 
> However I still wonder if this is right.  Your above examples are legitimate,
> but say you have
> 
> | 512mib adding back that isn't trimmed |------- 512mib trimmed ------|
> 
> we'll merge these two, but really we should probably trim that 512mib chunk
> we're adding right?  Thanks,
> 

So that's the crux of the problem. I'm not sure if it's right to make
heuristics around this and have merging thresholds because it makes the
code tricker + not necessarily correct. A contrived case would be
something where we go through a few iterations of merging because we
pulled stuff out of the bitmaps and that then was able to merge more
free space. How do you what the right balance is for merging extents?

I kind of favor the overeager approach for now because it is always
correct to rediscard regions, but forgetting about regions means it may
go undiscarded until for some unbounded time in the future.  This also
makes life the easiest for find_free_extent().

As I said, I'm not sure what the right thing to do is, so I favored
being accurate.  This is something I'm happy to change depending on
discussion and on further data I collect.

I added a comment, I might need to make it more indepth, but it's a
start (I'll revisit before v2).

Thanks,
Dennis

  reply index

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-07 20:17 [RFC PATCH 00/19] btrfs: async discard support Dennis Zhou
2019-10-07 20:17 ` [PATCH 01/19] bitmap: genericize percpu bitmap region iterators Dennis Zhou
2019-10-07 20:26   ` Josef Bacik
2019-10-07 22:24     ` Dennis Zhou
2019-10-15 12:11       ` David Sterba
2019-10-15 18:35         ` Dennis Zhou
2019-10-07 20:17 ` [PATCH 02/19] btrfs: rename DISCARD opt to DISCARD_SYNC Dennis Zhou
2019-10-07 20:27   ` Josef Bacik
2019-10-08 11:12   ` Johannes Thumshirn
2019-10-11  9:19   ` Nikolay Borisov
2019-10-07 20:17 ` [PATCH 03/19] btrfs: keep track of which extents have been discarded Dennis Zhou
2019-10-07 20:37   ` Josef Bacik
2019-10-07 22:38     ` Dennis Zhou
2019-10-10 13:40       ` Josef Bacik
2019-10-11 16:15         ` Dennis Zhou [this message]
2019-10-08 12:46   ` Nikolay Borisov
2019-10-11 16:08     ` Dennis Zhou
2019-10-15 12:17   ` David Sterba
2019-10-15 19:58     ` Dennis Zhou
2019-10-07 20:17 ` [PATCH 04/19] btrfs: keep track of cleanliness of the bitmap Dennis Zhou
2019-10-10 14:16   ` Josef Bacik
2019-10-11 16:17     ` Dennis Zhou
2019-10-15 12:23   ` David Sterba
2019-10-07 20:17 ` [PATCH 05/19] btrfs: add the beginning of async discard, discard workqueue Dennis Zhou
2019-10-10 14:38   ` Josef Bacik
2019-10-15 12:49   ` David Sterba
2019-10-15 19:57     ` Dennis Zhou
2019-10-07 20:17 ` [PATCH 06/19] btrfs: handle empty block_group removal Dennis Zhou
2019-10-10 15:00   ` Josef Bacik
2019-10-11 16:52     ` Dennis Zhou
2019-10-07 20:17 ` [PATCH 07/19] btrfs: discard one region at a time in async discard Dennis Zhou
2019-10-10 15:22   ` Josef Bacik
2019-10-14 19:42     ` Dennis Zhou
2019-10-07 20:17 ` [PATCH 08/19] btrfs: track discardable extents for asnyc discard Dennis Zhou
2019-10-10 15:36   ` Josef Bacik
2019-10-14 19:50     ` Dennis Zhou
2019-10-15 13:12   ` David Sterba
2019-10-15 18:41     ` Dennis Zhou
2019-10-07 20:17 ` [PATCH 09/19] btrfs: keep track of discardable_bytes Dennis Zhou
2019-10-10 15:38   ` Josef Bacik
2019-10-07 20:17 ` [PATCH 10/19] btrfs: calculate discard delay based on number of extents Dennis Zhou
2019-10-10 15:41   ` Josef Bacik
2019-10-11 18:07     ` Dennis Zhou
2019-10-07 20:17 ` [PATCH 11/19] btrfs: add bps discard rate limit Dennis Zhou
2019-10-10 15:47   ` Josef Bacik
2019-10-14 19:56     ` Dennis Zhou
2019-10-07 20:17 ` [PATCH 12/19] btrfs: limit max discard size for async discard Dennis Zhou
2019-10-10 16:16   ` Josef Bacik
2019-10-14 19:57     ` Dennis Zhou
2019-10-07 20:17 ` [PATCH 13/19] btrfs: have multiple discard lists Dennis Zhou
2019-10-10 16:51   ` Josef Bacik
2019-10-14 20:04     ` Dennis Zhou
2019-10-07 20:17 ` [PATCH 14/19] btrfs: only keep track of data extents for async discard Dennis Zhou
2019-10-10 16:53   ` Josef Bacik
2019-10-07 20:17 ` [PATCH 15/19] btrfs: load block_groups into discard_list on mount Dennis Zhou
2019-10-10 17:11   ` Josef Bacik
2019-10-14 20:17     ` Dennis Zhou
2019-10-14 23:38       ` David Sterba
2019-10-15 15:42         ` Dennis Zhou
2019-10-07 20:17 ` [PATCH 16/19] btrfs: keep track of discard reuse stats Dennis Zhou
2019-10-10 17:13   ` Josef Bacik
2019-10-07 20:17 ` [PATCH 17/19] btrfs: add async discard header Dennis Zhou
2019-10-10 17:13   ` Josef Bacik
2019-10-07 20:17 ` [PATCH 18/19] btrfs: increase the metadata allowance for the free_space_cache Dennis Zhou
2019-10-10 17:16   ` Josef Bacik
2019-10-07 20:17 ` [PATCH 19/19] btrfs: make smaller extents more likely to go into bitmaps Dennis Zhou
2019-10-10 17:17   ` Josef Bacik
2019-10-11  7:49 ` [RFC PATCH 00/19] btrfs: async discard support Nikolay Borisov
2019-10-14 21:05   ` Dennis Zhou
2019-10-15 12:08 ` David Sterba
2019-10-15 15:41   ` Dennis Zhou

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191011161549.GB29672@dennisz-mbp \
    --to=dennis@kernel.org \
    --cc=clm@fb.com \
    --cc=dsterba@suse.com \
    --cc=josef@toxicpanda.com \
    --cc=kernel-team@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=osandov@osandov.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-BTRFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \
		linux-btrfs@vger.kernel.org linux-btrfs@archiver.kernel.org
	public-inbox-index linux-btrfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs


AGPL code for this site: git clone https://public-inbox.org/ public-inbox