From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5C11FC433E0 for ; Mon, 4 Jan 2021 23:36:36 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 2C3642251E for ; Mon, 4 Jan 2021 23:36:36 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726791AbhADXgU (ORCPT ); Mon, 4 Jan 2021 18:36:20 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59496 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726258AbhADXgT (ORCPT ); Mon, 4 Jan 2021 18:36:19 -0500 Received: from mail-qk1-x733.google.com (mail-qk1-x733.google.com [IPv6:2607:f8b0:4864:20::733]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 67A13C061574 for ; Mon, 4 Jan 2021 15:35:39 -0800 (PST) Received: by mail-qk1-x733.google.com with SMTP id v126so25035585qkd.11 for ; Mon, 04 Jan 2021 15:35:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=toxicpanda-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=9kaO32pPDJ3OEOTjnPmN0nz67kfmlGnTac+B0s30JRE=; b=OaVvia2lpd+WEWgBuAOI5DAB/ZC87OuK/epDCG6U5bbynnY+3RK0Zk/7AO868O55se KavCTLZ2jYkILJfA5Piiq2KZY0wA7HGq4iDASF040QaND9Z2on1JXJmk2/vpnNebcXVa Hg7t5wjUt18UQk6s9IF79uhe1jFGbE86EFdzQQ5IQN2idNcEc/QSB1lNx6TYypNyKVWy yq4wKjRO2wFtojZYtO4z5Vgr/cFFL61pGburTGKXoGmRbjyCcVXJc5psWzFZUZzgVqQE xf4beN967pL9dzOo8Yi83OKWtG6dK1WAXY//U01EEwYcu1Tz6S9QjsrQYc0DvIBEgqx0 fRwQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=9kaO32pPDJ3OEOTjnPmN0nz67kfmlGnTac+B0s30JRE=; b=C1ZJOX9Jocshh5ZeoufiiCxouSNAJOcCkz0ORGpz2ttWtsO9foL9+iClp2ESJAFWJl J7BCO0qZVpdpQWb7BJOUk1iidLwBpfN97n7zn+zXwRCrnQhlH8y/wAPtt7THNl2vQGuQ fShaC1Pdo5Z0GY+PUQ4OUiolYjR8/Bltryse/DO2hr4DBJK20DsqAE6BAxV2nRPkmR6I M1IVCAZdR5PUWQkaJ2G/0ojYbilRObNdnKgDv9DdPHfgpSC8RwCQCO93VtXP/Lpmd8DX /0m0enAf5/pSprAnQUTyUNtKywPlraCQr0rJ4EzdPUsDFvHEbKdgODMDrO4UT3R5nbLr ptSQ== X-Gm-Message-State: AOAM531EMtnW6WC8/IraFJu0xquTPFNQNycw+b5uPs+TWHLqzMFw0+V2 MwbCG7hsn6xvmvp95SRkf14G4wX8Lb+a66dO X-Google-Smtp-Source: ABdhPJxx+oswHXDBXkAbtm2DyIN+wXBomq9xkiTZBu0iQfw9SZObvJhVYqoxDd2odh2OXogJFZax1Q== X-Received: by 2002:ac8:5b94:: with SMTP id a20mr72401649qta.366.1609792791902; Mon, 04 Jan 2021 12:39:51 -0800 (PST) Received: from localhost (cpe-174-109-172-136.nc.res.rr.com. [174.109.172.136]) by smtp.gmail.com with ESMTPSA id f24sm37521886qkk.89.2021.01.04.12.39.51 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 04 Jan 2021 12:39:51 -0800 (PST) From: Josef Bacik To: linux-btrfs@vger.kernel.org, kernel-team@fb.com Cc: stable@vger.kernel.org, =?UTF-8?q?Ren=C3=A9=20Rebe?= Subject: [PATCH v2] btrfs: shrink delalloc pages instead of full inodes Date: Mon, 4 Jan 2021 15:39:50 -0500 Message-Id: <34d0a7e07cf75ca4b44cf54693cd74477649fb88.1609792754.git.josef@toxicpanda.com> X-Mailer: git-send-email 2.26.2 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org Commit 38d715f494f2 ("btrfs: use btrfs_start_delalloc_roots in shrink_delalloc") cleaned up how we do delalloc shrinking by utilizing some infrastructure we have in place to flush inodes that we use for device replace and snapshot. However this introduced a pretty serious performance regression. To reproduce the user untarred the source tarball of Firefox, and would see it take anywhere from 5 to 20 times as long to untar in 5.10 compared to 5.9. The root cause is because before we would generally use the normal writeback path to reclaim delalloc space, and for this we would provide it with the number of pages we wanted to flush. The referenced commit changed this to flush that many inodes, which drastically increased the amount of space we were flushing in certain cases, which severely affected performance. We cannot revert this patch unfortunately, because Filipe has another fix that requires the ability to skip flushing inodes that are being cloned in certain scenarios, which means we need to keep using our flushing infrastructure or risk re-introducing the deadlock. Instead to fix this problem we can go back to providing btrfs_start_delalloc_roots with a number of pages to flush, and then set up a writeback_control and utilize sync_inode() to handle the flushing for us. This gives us the same behavior we had prior to the fix, while still allowing us to avoid the deadlock that was fixed by Filipe. I redid the users original test and got the following results 5.9 0m54.258s 5.10 1m26.212s Patched 0m38.800s We are significantly faster because of the work I did around improving ENOSPC flushing in 5.10 and 5.11, so reverting to the previous write out behavior gave us a pretty big boost. CC: stable@vger.kernel.org # 5.10 Reported-by: René Rebe Fixes: 38d715f494f2 ("btrfs: use btrfs_start_delalloc_roots in shrink_delalloc") Signed-off-by: Josef Bacik --- v1->v2: - Explicitly state what the regression was in the commit message. fs/btrfs/inode.c | 60 +++++++++++++++++++++++++++++++------------ fs/btrfs/space-info.c | 4 ++- 2 files changed, 46 insertions(+), 18 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 070716650df8..a8e0a6b038d3 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -9390,7 +9390,8 @@ static struct btrfs_delalloc_work *btrfs_alloc_delalloc_work(struct inode *inode * some fairly slow code that needs optimization. This walks the list * of all the inodes with pending delalloc and forces them to disk. */ -static int start_delalloc_inodes(struct btrfs_root *root, u64 *nr, bool snapshot, +static int start_delalloc_inodes(struct btrfs_root *root, + struct writeback_control *wbc, bool snapshot, bool in_reclaim_context) { struct btrfs_inode *binode; @@ -9399,6 +9400,7 @@ static int start_delalloc_inodes(struct btrfs_root *root, u64 *nr, bool snapshot struct list_head works; struct list_head splice; int ret = 0; + bool full_flush = wbc->nr_to_write == LONG_MAX; INIT_LIST_HEAD(&works); INIT_LIST_HEAD(&splice); @@ -9427,18 +9429,24 @@ static int start_delalloc_inodes(struct btrfs_root *root, u64 *nr, bool snapshot if (snapshot) set_bit(BTRFS_INODE_SNAPSHOT_FLUSH, &binode->runtime_flags); - work = btrfs_alloc_delalloc_work(inode); - if (!work) { - iput(inode); - ret = -ENOMEM; - goto out; - } - list_add_tail(&work->list, &works); - btrfs_queue_work(root->fs_info->flush_workers, - &work->work); - if (*nr != U64_MAX) { - (*nr)--; - if (*nr == 0) + if (full_flush) { + work = btrfs_alloc_delalloc_work(inode); + if (!work) { + iput(inode); + ret = -ENOMEM; + goto out; + } + list_add_tail(&work->list, &works); + btrfs_queue_work(root->fs_info->flush_workers, + &work->work); + } else { + ret = sync_inode(inode, wbc); + if (!ret && + test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT, + &BTRFS_I(inode)->runtime_flags)) + ret = sync_inode(inode, wbc); + btrfs_add_delayed_iput(inode); + if (ret || wbc->nr_to_write <= 0) goto out; } cond_resched(); @@ -9464,18 +9472,29 @@ static int start_delalloc_inodes(struct btrfs_root *root, u64 *nr, bool snapshot int btrfs_start_delalloc_snapshot(struct btrfs_root *root) { + struct writeback_control wbc = { + .nr_to_write = LONG_MAX, + .sync_mode = WB_SYNC_NONE, + .range_start = 0, + .range_end = LLONG_MAX, + }; struct btrfs_fs_info *fs_info = root->fs_info; - u64 nr = U64_MAX; if (test_bit(BTRFS_FS_STATE_ERROR, &fs_info->fs_state)) return -EROFS; - return start_delalloc_inodes(root, &nr, true, false); + return start_delalloc_inodes(root, &wbc, true, false); } int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, u64 nr, bool in_reclaim_context) { + struct writeback_control wbc = { + .nr_to_write = (nr == U64_MAX) ? LONG_MAX : (unsigned long)nr, + .sync_mode = WB_SYNC_NONE, + .range_start = 0, + .range_end = LLONG_MAX, + }; struct btrfs_root *root; struct list_head splice; int ret; @@ -9489,6 +9508,13 @@ int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, u64 nr, spin_lock(&fs_info->delalloc_root_lock); list_splice_init(&fs_info->delalloc_roots, &splice); while (!list_empty(&splice) && nr) { + /* + * Reset nr_to_write here so we know that we're doing a full + * flush. + */ + if (nr == U64_MAX) + wbc.nr_to_write = LONG_MAX; + root = list_first_entry(&splice, struct btrfs_root, delalloc_root); root = btrfs_grab_root(root); @@ -9497,9 +9523,9 @@ int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, u64 nr, &fs_info->delalloc_roots); spin_unlock(&fs_info->delalloc_root_lock); - ret = start_delalloc_inodes(root, &nr, false, in_reclaim_context); + ret = start_delalloc_inodes(root, &wbc, false, in_reclaim_context); btrfs_put_root(root); - if (ret < 0) + if (ret < 0 || wbc.nr_to_write <= 0) goto out; spin_lock(&fs_info->delalloc_root_lock); } diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c index 67e55c5479b8..e8347461c8dd 100644 --- a/fs/btrfs/space-info.c +++ b/fs/btrfs/space-info.c @@ -532,7 +532,9 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info, loops = 0; while ((delalloc_bytes || dio_bytes) && loops < 3) { - btrfs_start_delalloc_roots(fs_info, items, true); + u64 nr_pages = min(delalloc_bytes, to_reclaim) >> PAGE_SHIFT; + + btrfs_start_delalloc_roots(fs_info, nr_pages, true); loops++; if (wait_ordered && !trans) { -- 2.26.2