From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.6 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 541A7ECE58C for ; Fri, 11 Oct 2019 16:52:56 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 1AC0C21835 for ; Fri, 11 Oct 2019 16:52:56 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1570812776; bh=C9cnmb5Gkmhq6feX/Oydu2MuiZ1T8mZlOXxgXfBVq8E=; h=Date:From:To:Cc:Subject:References:In-Reply-To:List-ID:From; b=Cm1LKtvARfdjboJiHQnziER7C9z/oEdQgToVEyQx8B94Aq1rd4ObW97YwB5uqj8hx 8v7AxAvnz4Upb2bVBWlPW7Cskf1q69zIdH28S46v49me0lx4cPzh+3w19/R+De+MoJ 7AB+AHm64xMQ1gwHTvd7LDe8eqUj6CEpsqL24rEg= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728437AbfJKQwz (ORCPT ); Fri, 11 Oct 2019 12:52:55 -0400 Received: from mail-qt1-f196.google.com ([209.85.160.196]:33978 "EHLO mail-qt1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727984AbfJKQwz (ORCPT ); Fri, 11 Oct 2019 12:52:55 -0400 Received: by mail-qt1-f196.google.com with SMTP id 3so14851044qta.1 for ; Fri, 11 Oct 2019 09:52:52 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=1kTrsX4npaKbH5SQ+3k/KzRPK1Mqej8DQr5Gz8ejLyw=; b=I4bQ4cSE2S6u5XXVP/ccDpCyRfjVe809g9GK+cdCnb7+dzNChGijfZkyBrdKwt3+aQ Ak9hy+WDGL609YglQKmHE8+vb3LUYGCmRx641Ix2yR/HqDakQbZTxTVZ9KNtyLpBLisc Xok4piwoRcSuhLbxfHsEVDo63uLsUWi2yJ8whCkH/+tzBVj1OYTkGzn2FOmQnhiWZiAj A6zxbeLDz7kdv58gSYQmMmKe/jqMyBQvMGLWoSfCy6eT17/et3AO9eU7eiox1Kk2y+dR FvkHkvIrRJMtsgEvZIgyhmJ+yQwVAnh0m245ktLiBMZnJ6mYKenV4JjeoWgXGTHkrNzy 2slQ== X-Gm-Message-State: APjAAAXAl3HoslPa68X4ytjvJU4w50OwZq+TmzxEGUJYzRSmzclYvaC2 pnBb0qmc6eTT0ljdIs0cCjc= X-Google-Smtp-Source: APXvYqxtnCWtPvba40b+lqFhhsDCn3NnVqV6pMZjJLiVQhqoKphrmuqUgO3SVG/u0+nRoHFyiiYjuA== X-Received: by 2002:ac8:b42:: with SMTP id m2mr17911432qti.174.1570812772199; Fri, 11 Oct 2019 09:52:52 -0700 (PDT) Received: from dennisz-mbp ([2620:10d:c091:500::2:985b]) by smtp.gmail.com with ESMTPSA id o14sm6435496qtk.52.2019.10.11.09.52.50 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 11 Oct 2019 09:52:51 -0700 (PDT) Date: Fri, 11 Oct 2019 12:52:49 -0400 From: Dennis Zhou To: Josef Bacik Cc: David Sterba , Chris Mason , Omar Sandoval , kernel-team@fb.com, linux-btrfs@vger.kernel.org Subject: Re: [PATCH 06/19] btrfs: handle empty block_group removal Message-ID: <20191011165249.GD29672@dennisz-mbp> References: <20191010150041.ly725ebipb73rc7b@macbook-pro-91.dhcp.thefacebook.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20191010150041.ly725ebipb73rc7b@macbook-pro-91.dhcp.thefacebook.com> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org On Thu, Oct 10, 2019 at 11:00:42AM -0400, Josef Bacik wrote: > On Mon, Oct 07, 2019 at 04:17:37PM -0400, Dennis Zhou wrote: > > block_group removal is a little tricky. It can race with the extent > > allocator, the cleaner thread, and balancing. The current path is for a > > block_group to be added to the unused_bgs list. Then, when the cleaner > > thread comes around, it starts a transaction and then proceeds with > > removing the block_group. Extents that are pinned are subsequently > > removed from the pinned trees and then eventually a discard is issued > > for the entire block_group. > > > > Async discard introduces another player into the game, the discard > > workqueue. While it has none of the racing issues, the new problem is > > ensuring we don't leave free space untrimmed prior to forgetting the > > block_group. This is handled by placing fully free block_groups on a > > separate discard queue. This is necessary to maintain discarding order > > as in the future we will slowly trim even fully free block_groups. The > > ordering helps us make progress on the same block_group rather than say > > the last fully freed block_group or needing to search through the fully > > freed block groups at the beginning of a list and insert after. > > > > The new order of events is a fully freed block group gets placed on the > > discard queue first. Once it's processed, it will be placed on the > > unusued_bgs list and then the original sequence of events will happen, > > just without the final whole block_group discard. > > > > The mount flags can change when processing unused_bgs, so when flipping > > from DISCARD to DISCARD_ASYNC, the unused_bgs must be punted to the > > discard_list to be trimmed. If we flip off DISCARD_ASYNC, we punt > > free block groups on the discard_list to the unused_bg queue which will > > do the final discard for us. > > > > Signed-off-by: Dennis Zhou > > --- > > fs/btrfs/block-group.c | 39 ++++++++++++++++++--- > > fs/btrfs/ctree.h | 2 +- > > fs/btrfs/discard.c | 68 ++++++++++++++++++++++++++++++++++++- > > fs/btrfs/discard.h | 11 +++++- > > fs/btrfs/free-space-cache.c | 33 ++++++++++++++++++ > > fs/btrfs/free-space-cache.h | 1 + > > fs/btrfs/scrub.c | 7 +++- > > 7 files changed, 153 insertions(+), 8 deletions(-) > > > > diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c > > index 8bbbe7488328..73e5a9384491 100644 > > --- a/fs/btrfs/block-group.c > > +++ b/fs/btrfs/block-group.c > > @@ -1251,6 +1251,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info) > > struct btrfs_block_group_cache *block_group; > > struct btrfs_space_info *space_info; > > struct btrfs_trans_handle *trans; > > + bool async_trim_enabled = btrfs_test_opt(fs_info, DISCARD_ASYNC); > > int ret = 0; > > > > if (!test_bit(BTRFS_FS_OPEN, &fs_info->flags)) > > @@ -1260,6 +1261,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info) > > while (!list_empty(&fs_info->unused_bgs)) { > > u64 start, end; > > int trimming; > > + bool async_trimmed; > > > > block_group = list_first_entry(&fs_info->unused_bgs, > > struct btrfs_block_group_cache, > > @@ -1281,10 +1283,20 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info) > > /* Don't want to race with allocators so take the groups_sem */ > > down_write(&space_info->groups_sem); > > spin_lock(&block_group->lock); > > + > > + /* async discard requires block groups to be fully trimmed */ > > + async_trimmed = (!btrfs_test_opt(fs_info, DISCARD_ASYNC) || > > + btrfs_is_free_space_trimmed(block_group)); > > + > > if (block_group->reserved || block_group->pinned || > > btrfs_block_group_used(&block_group->item) || > > block_group->ro || > > - list_is_singular(&block_group->list)) { > > + list_is_singular(&block_group->list) || > > + !async_trimmed) { > > + /* requeue if we failed because of async discard */ > > + if (!async_trimmed) > > + btrfs_discard_queue_work(&fs_info->discard_ctl, > > + block_group); > > /* > > * We want to bail if we made new allocations or have > > * outstanding allocations in this block group. We do > > @@ -1367,6 +1379,10 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info) > > spin_unlock(&block_group->lock); > > spin_unlock(&space_info->lock); > > > > + if (!async_trim_enabled && > > + btrfs_test_opt(fs_info, DISCARD_ASYNC)) > > + goto flip_async; > > + > > This took me a minute to grok, please add a comment indicating that this is > meant to catch the case that we flipped from no async to async and thus need to > kick off the async trim work now before removing the unused bg. > Added a comment explaining what's going on here. > > /* DISCARD can flip during remount */ > > trimming = btrfs_test_opt(fs_info, DISCARD_SYNC); > > > > @@ -1411,6 +1427,13 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info) > > spin_lock(&fs_info->unused_bgs_lock); > > } > > spin_unlock(&fs_info->unused_bgs_lock); > > + return; > > + > > +flip_async: > > + btrfs_end_transaction(trans); > > + mutex_unlock(&fs_info->delete_unused_bgs_mutex); > > + btrfs_put_block_group(block_group); > > + btrfs_discard_punt_unused_bgs_list(fs_info); > > } > > > > void btrfs_mark_bg_unused(struct btrfs_block_group_cache *bg) > > @@ -1618,6 +1641,8 @@ static struct btrfs_block_group_cache *btrfs_create_block_group_cache( > > cache->full_stripe_len = btrfs_full_stripe_len(fs_info, start); > > set_free_space_tree_thresholds(cache); > > > > + cache->discard_index = 1; > > + > > atomic_set(&cache->count, 1); > > spin_lock_init(&cache->lock); > > init_rwsem(&cache->data_rwsem); > > @@ -1829,7 +1854,11 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info) > > inc_block_group_ro(cache, 1); > > } else if (btrfs_block_group_used(&cache->item) == 0) { > > ASSERT(list_empty(&cache->bg_list)); > > - btrfs_mark_bg_unused(cache); > > + if (btrfs_test_opt(info, DISCARD_ASYNC)) > > + btrfs_add_to_discard_free_list( > > + &info->discard_ctl, cache); > > + else > > + btrfs_mark_bg_unused(cache); > > } > > } > > > > @@ -2724,8 +2753,10 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans, > > * dirty list to avoid races between cleaner kthread and space > > * cache writeout. > > */ > > - if (!alloc && old_val == 0) > > - btrfs_mark_bg_unused(cache); > > + if (!alloc && old_val == 0) { > > + if (!btrfs_test_opt(info, DISCARD_ASYNC)) > > + btrfs_mark_bg_unused(cache); > > + } > > > > btrfs_put_block_group(cache); > > total -= num_bytes; > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > > index 419445868909..c328d2e85e4d 100644 > > --- a/fs/btrfs/ctree.h > > +++ b/fs/btrfs/ctree.h > > @@ -439,7 +439,7 @@ struct btrfs_full_stripe_locks_tree { > > }; > > > > /* discard control */ > > -#define BTRFS_NR_DISCARD_LISTS 1 > > +#define BTRFS_NR_DISCARD_LISTS 2 > > > > struct btrfs_discard_ctl { > > struct workqueue_struct *discard_workers; > > diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c > > index 6df124639e55..fb92b888774d 100644 > > --- a/fs/btrfs/discard.c > > +++ b/fs/btrfs/discard.c > > @@ -29,8 +29,11 @@ void btrfs_add_to_discard_list(struct btrfs_discard_ctl *discard_ctl, > > > > spin_lock(&discard_ctl->lock); > > > > - if (list_empty(&cache->discard_list)) > > + if (list_empty(&cache->discard_list) || !cache->discard_index) { > > + if (!cache->discard_index) > > + cache->discard_index = 1; > > Need a #define for this so it's clear what our intention is, I hate magic > numbers. > Sounds good, done. > > cache->discard_delay = now + BTRFS_DISCARD_DELAY; > > + } > > > > list_move_tail(&cache->discard_list, > > btrfs_get_discard_list(discard_ctl, cache)); > > @@ -38,6 +41,23 @@ void btrfs_add_to_discard_list(struct btrfs_discard_ctl *discard_ctl, > > spin_unlock(&discard_ctl->lock); > > } > > > > +void btrfs_add_to_discard_free_list(struct btrfs_discard_ctl *discard_ctl, > > + struct btrfs_block_group_cache *cache) > > +{ > > + u64 now = ktime_get_ns(); > > + > > + spin_lock(&discard_ctl->lock); > > + > > + if (!list_empty(&cache->discard_list)) > > + list_del_init(&cache->discard_list); > > + > > + cache->discard_index = 0; > > + cache->discard_delay = now; > > + list_add_tail(&cache->discard_list, &discard_ctl->discard_list[0]); > > + > > + spin_unlock(&discard_ctl->lock); > > +} > > + > > static bool remove_from_discard_list(struct btrfs_discard_ctl *discard_ctl, > > struct btrfs_block_group_cache *cache) > > { > > @@ -161,10 +181,52 @@ static void btrfs_discard_workfn(struct work_struct *work) > > btrfs_block_group_end(cache), 0); > > > > remove_from_discard_list(discard_ctl, cache); > > + if (btrfs_is_free_space_trimmed(cache)) > > + btrfs_mark_bg_unused(cache); > > + else if (cache->free_space_ctl->free_space == cache->key.offset) > > + btrfs_add_to_discard_free_list(discard_ctl, cache); > > This needs to be > > else if (btrfs_block_group_used(cache) == 0) > > because we just exclude super mirrors from the free space cache, so completely > empty free_space != cache->key.offset. > Ah cool I didn't know that. This is fixed now. > > > > btrfs_discard_schedule_work(discard_ctl, false); > > } > > > > +void btrfs_discard_punt_unused_bgs_list(struct btrfs_fs_info *fs_info) > > +{ > > + struct btrfs_block_group_cache *cache, *next; > > + > > + /* we enabled async discard, so punt all to the queue */ > > + spin_lock(&fs_info->unused_bgs_lock); > > + > > + list_for_each_entry_safe(cache, next, &fs_info->unused_bgs, bg_list) { > > + list_del_init(&cache->bg_list); > > + btrfs_add_to_discard_free_list(&fs_info->discard_ctl, cache); > > + } > > + > > + spin_unlock(&fs_info->unused_bgs_lock); > > +} > > + > > +static void btrfs_discard_purge_list(struct btrfs_discard_ctl *discard_ctl) > > +{ > > + struct btrfs_block_group_cache *cache, *next; > > + int i; > > + > > + spin_lock(&discard_ctl->lock); > > + > > + for (i = 0; i < BTRFS_NR_DISCARD_LISTS; i++) { > > + list_for_each_entry_safe(cache, next, > > + &discard_ctl->discard_list[i], > > + discard_list) { > > + list_del_init(&cache->discard_list); > > + spin_unlock(&discard_ctl->lock); > > + if (cache->free_space_ctl->free_space == > > + cache->key.offset) > > + btrfs_mark_bg_unused(cache); > > + spin_lock(&discard_ctl->lock); > > + } > > + } > > + > > + spin_unlock(&discard_ctl->lock); > > +} > > + > > void btrfs_discard_resume(struct btrfs_fs_info *fs_info) > > { > > if (!btrfs_test_opt(fs_info, DISCARD_ASYNC)) { > > @@ -172,6 +234,8 @@ void btrfs_discard_resume(struct btrfs_fs_info *fs_info) > > return; > > } > > > > + btrfs_discard_punt_unused_bgs_list(fs_info); > > + > > set_bit(BTRFS_FS_DISCARD_RUNNING, &fs_info->flags); > > } > > > > @@ -197,4 +261,6 @@ void btrfs_discard_cleanup(struct btrfs_fs_info *fs_info) > > { > > btrfs_discard_stop(fs_info); > > cancel_delayed_work_sync(&fs_info->discard_ctl.work); > > + > > + btrfs_discard_purge_list(&fs_info->discard_ctl); > > } > > diff --git a/fs/btrfs/discard.h b/fs/btrfs/discard.h > > index 6d7805bb0eb7..55f79b624943 100644 > > --- a/fs/btrfs/discard.h > > +++ b/fs/btrfs/discard.h > > @@ -10,9 +10,14 @@ > > #include > > > > #include "ctree.h" > > +#include "block-group.h" > > +#include "free-space-cache.h" > > > > void btrfs_add_to_discard_list(struct btrfs_discard_ctl *discard_ctl, > > struct btrfs_block_group_cache *cache); > > +void btrfs_add_to_discard_free_list(struct btrfs_discard_ctl *discard_ctl, > > + struct btrfs_block_group_cache *cache); > > +void btrfs_discard_punt_unused_bgs_list(struct btrfs_fs_info *fs_info); > > > > void btrfs_discard_cancel_work(struct btrfs_discard_ctl *discard_ctl, > > struct btrfs_block_group_cache *cache); > > @@ -41,7 +46,11 @@ void btrfs_discard_queue_work(struct btrfs_discard_ctl *discard_ctl, > > if (!cache || !btrfs_test_opt(cache->fs_info, DISCARD_ASYNC)) > > return; > > > > - btrfs_add_to_discard_list(discard_ctl, cache); > > + if (cache->free_space_ctl->free_space == cache->key.offset) > > + btrfs_add_to_discard_free_list(discard_ctl, cache); > > Same here. > > > + else > > + btrfs_add_to_discard_list(discard_ctl, cache); > > + > > if (!delayed_work_pending(&discard_ctl->work)) > > btrfs_discard_schedule_work(discard_ctl, false); > > } > > diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c > > index 54ff1bc97777..ed0e7ee4c78d 100644 > > --- a/fs/btrfs/free-space-cache.c > > +++ b/fs/btrfs/free-space-cache.c > > @@ -2653,6 +2653,31 @@ void btrfs_remove_free_space_cache(struct btrfs_block_group_cache *block_group) > > > > } > > > > +bool btrfs_is_free_space_trimmed(struct btrfs_block_group_cache *cache) > > +{ > > + struct btrfs_free_space_ctl *ctl = cache->free_space_ctl; > > + struct btrfs_free_space *info; > > + struct rb_node *node; > > + bool ret = true; > > + > > + spin_lock(&ctl->tree_lock); > > + node = rb_first(&ctl->free_space_offset); > > + > > + while (node) { > > + info = rb_entry(node, struct btrfs_free_space, offset_index); > > + > > + if (!btrfs_free_space_trimmed(info)) { > > + ret = false; > > + break; > > + } > > + > > + node = rb_next(node); > > + } > > + > > + spin_unlock(&ctl->tree_lock); > > + return ret; > > +} > > + > > u64 btrfs_find_space_for_alloc(struct btrfs_block_group_cache *block_group, > > u64 offset, u64 bytes, u64 empty_size, > > u64 *max_extent_size) > > @@ -2739,6 +2764,9 @@ int btrfs_return_cluster_to_free_space( > > ret = __btrfs_return_cluster_to_free_space(block_group, cluster); > > spin_unlock(&ctl->tree_lock); > > > > + btrfs_discard_queue_work(&block_group->fs_info->discard_ctl, > > + block_group); > > + > > /* finally drop our ref */ > > btrfs_put_block_group(block_group); > > return ret; > > @@ -3097,6 +3125,7 @@ int btrfs_find_space_cluster(struct btrfs_block_group_cache *block_group, > > u64 min_bytes; > > u64 cont1_bytes; > > int ret; > > + bool found_cluster = false; > > > > /* > > * Choose the minimum extent size we'll require for this > > @@ -3149,6 +3178,7 @@ int btrfs_find_space_cluster(struct btrfs_block_group_cache *block_group, > > list_del_init(&entry->list); > > > > if (!ret) { > > + found_cluster = true; > > atomic_inc(&block_group->count); > > list_add_tail(&cluster->block_group_list, > > &block_group->cluster_list); > > @@ -3160,6 +3190,9 @@ int btrfs_find_space_cluster(struct btrfs_block_group_cache *block_group, > > spin_unlock(&cluster->lock); > > spin_unlock(&ctl->tree_lock); > > > > + if (found_cluster) > > + btrfs_discard_cancel_work(&fs_info->discard_ctl, block_group); > > + > > return ret; > > } > > This bit here seems unrelated to the rest of the change. Why do we want to > cancel the discard work if we find a cluster? And what does it have to do with > unused bgs? If it's the allocation part that makes you want to cancel then that > sort of makes sense in the unused bg context, but this is happening no matter > what. It should probably be in its own patch with an explanation, or at least > in a different patch. Thanks, > > Josef I think this got lost in my rebasing over time. I'll move this or axe it, I'm not sure yet. The idea is later on we don't care about metadata block groups and they are the only ones that use clustering. In that case don't worry about fully trimming as the only way it'll start trimming is if it's a fully free metadata block group. Thanks, Dennis