stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Josef Bacik <josef@toxicpanda.com>,
	Filipe Manana <fdmanana@suse.com>,
	David Sterba <dsterba@suse.com>, Sasha Levin <sashal@kernel.org>,
	clm@fb.com, linux-btrfs@vger.kernel.org, bpf@vger.kernel.org
Subject: [PATCH AUTOSEL 6.5 13/41] btrfs: do not block starts waiting on previous transaction commit
Date: Sun, 24 Sep 2023 09:15:01 -0400	[thread overview]
Message-ID: <20230924131529.1275335-13-sashal@kernel.org> (raw)
In-Reply-To: <20230924131529.1275335-1-sashal@kernel.org>

From: Josef Bacik <josef@toxicpanda.com>

[ Upstream commit 77d20c685b6baeb942606a93ed861c191381b73e ]

Internally I got a report of very long stalls on normal operations like
creating a new file when auto relocation was running.  The reporter used
the 'bpf offcputime' tracer to show that we would get stuck in
start_transaction for 5 to 30 seconds, and were always being woken up by
the transaction commit.

Using my timing-everything script, which times how long a function takes
and what percentage of that total time is taken up by its children, I
saw several traces like this

1083 took 32812902424 ns
        29929002926 ns 91.2110% wait_for_commit_duration
        25568 ns 7.7920e-05% commit_fs_roots_duration
        1007751 ns 0.00307% commit_cowonly_roots_duration
        446855602 ns 1.36182% btrfs_run_delayed_refs_duration
        271980 ns 0.00082% btrfs_run_delayed_items_duration
        2008 ns 6.1195e-06% btrfs_apply_pending_changes_duration
        9656 ns 2.9427e-05% switch_commit_roots_duration
        1598 ns 4.8700e-06% btrfs_commit_device_sizes_duration
        4314 ns 1.3147e-05% btrfs_free_log_root_tree_duration

Here I was only tracing functions that happen where we are between
START_COMMIT and UNBLOCKED in order to see what would be keeping us
blocked for so long.  The wait_for_commit() we do is where we wait for a
previous transaction that hasn't completed it's commit.  This can
include all of the unpin work and other cleanups, which tends to be the
longest part of our transaction commit.

There is no reason we should be blocking new things from entering the
transaction at this point, it just adds to random latency spikes for no
reason.

Fix this by adding a PREP stage.  This allows us to properly deal with
multiple committers coming in at the same time, we retain the behavior
that the winner waits on the previous transaction and the losers all
wait for this transaction commit to occur.  Nothing else is blocked
during the PREP stage, and then once the wait is complete we switch to
COMMIT_START and all of the same behavior as before is maintained.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/disk-io.c     |  8 ++++----
 fs/btrfs/locking.h     |  2 +-
 fs/btrfs/transaction.c | 39 ++++++++++++++++++++++++---------------
 fs/btrfs/transaction.h |  1 +
 4 files changed, 30 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 4494883a19abc..93554f8aef8cf 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1552,7 +1552,7 @@ static int transaction_kthread(void *arg)
 
 		delta = ktime_get_seconds() - cur->start_time;
 		if (!test_and_clear_bit(BTRFS_FS_COMMIT_TRANS, &fs_info->flags) &&
-		    cur->state < TRANS_STATE_COMMIT_START &&
+		    cur->state < TRANS_STATE_COMMIT_PREP &&
 		    delta < fs_info->commit_interval) {
 			spin_unlock(&fs_info->trans_lock);
 			delay -= msecs_to_jiffies((delta - 1) * 1000);
@@ -2689,8 +2689,8 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
 	btrfs_lockdep_init_map(fs_info, btrfs_trans_num_extwriters);
 	btrfs_lockdep_init_map(fs_info, btrfs_trans_pending_ordered);
 	btrfs_lockdep_init_map(fs_info, btrfs_ordered_extent);
-	btrfs_state_lockdep_init_map(fs_info, btrfs_trans_commit_start,
-				     BTRFS_LOCKDEP_TRANS_COMMIT_START);
+	btrfs_state_lockdep_init_map(fs_info, btrfs_trans_commit_prep,
+				     BTRFS_LOCKDEP_TRANS_COMMIT_PREP);
 	btrfs_state_lockdep_init_map(fs_info, btrfs_trans_unblocked,
 				     BTRFS_LOCKDEP_TRANS_UNBLOCKED);
 	btrfs_state_lockdep_init_map(fs_info, btrfs_trans_super_committed,
@@ -4892,7 +4892,7 @@ static int btrfs_cleanup_transaction(struct btrfs_fs_info *fs_info)
 	while (!list_empty(&fs_info->trans_list)) {
 		t = list_first_entry(&fs_info->trans_list,
 				     struct btrfs_transaction, list);
-		if (t->state >= TRANS_STATE_COMMIT_START) {
+		if (t->state >= TRANS_STATE_COMMIT_PREP) {
 			refcount_inc(&t->use_count);
 			spin_unlock(&fs_info->trans_lock);
 			btrfs_wait_for_commit(fs_info, t->transid);
diff --git a/fs/btrfs/locking.h b/fs/btrfs/locking.h
index edb9b4a0dba15..7d6ee1e609bf2 100644
--- a/fs/btrfs/locking.h
+++ b/fs/btrfs/locking.h
@@ -79,7 +79,7 @@ enum btrfs_lock_nesting {
 };
 
 enum btrfs_lockdep_trans_states {
-	BTRFS_LOCKDEP_TRANS_COMMIT_START,
+	BTRFS_LOCKDEP_TRANS_COMMIT_PREP,
 	BTRFS_LOCKDEP_TRANS_UNBLOCKED,
 	BTRFS_LOCKDEP_TRANS_SUPER_COMMITTED,
 	BTRFS_LOCKDEP_TRANS_COMPLETED,
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 5bbd288b9cb54..942cfa906ed48 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -56,12 +56,17 @@ static struct kmem_cache *btrfs_trans_handle_cachep;
  * |  Call btrfs_commit_transaction() on any trans handle attached to
  * |  transaction N
  * V
- * Transaction N [[TRANS_STATE_COMMIT_START]]
+ * Transaction N [[TRANS_STATE_COMMIT_PREP]]
+ * |
+ * | If there are simultaneous calls to btrfs_commit_transaction() one will win
+ * | the race and the rest will wait for the winner to commit the transaction.
+ * |
+ * | The winner will wait for previous running transaction to completely finish
+ * | if there is one.
  * |
- * | Will wait for previous running transaction to completely finish if there
- * | is one
+ * Transaction N [[TRANS_STATE_COMMIT_START]]
  * |
- * | Then one of the following happes:
+ * | Then one of the following happens:
  * | - Wait for all other trans handle holders to release.
  * |   The btrfs_commit_transaction() caller will do the commit work.
  * | - Wait for current transaction to be committed by others.
@@ -112,6 +117,7 @@ static struct kmem_cache *btrfs_trans_handle_cachep;
  */
 static const unsigned int btrfs_blocked_trans_types[TRANS_STATE_MAX] = {
 	[TRANS_STATE_RUNNING]		= 0U,
+	[TRANS_STATE_COMMIT_PREP]	= 0U,
 	[TRANS_STATE_COMMIT_START]	= (__TRANS_START | __TRANS_ATTACH),
 	[TRANS_STATE_COMMIT_DOING]	= (__TRANS_START |
 					   __TRANS_ATTACH |
@@ -1980,7 +1986,7 @@ void btrfs_commit_transaction_async(struct btrfs_trans_handle *trans)
 	 * Wait for the current transaction commit to start and block
 	 * subsequent transaction joins
 	 */
-	btrfs_might_wait_for_state(fs_info, BTRFS_LOCKDEP_TRANS_COMMIT_START);
+	btrfs_might_wait_for_state(fs_info, BTRFS_LOCKDEP_TRANS_COMMIT_PREP);
 	wait_event(fs_info->transaction_blocked_wait,
 		   cur_trans->state >= TRANS_STATE_COMMIT_START ||
 		   TRANS_ABORTED(cur_trans));
@@ -2127,7 +2133,7 @@ static void add_pending_snapshot(struct btrfs_trans_handle *trans)
 		return;
 
 	lockdep_assert_held(&trans->fs_info->trans_lock);
-	ASSERT(cur_trans->state >= TRANS_STATE_COMMIT_START);
+	ASSERT(cur_trans->state >= TRANS_STATE_COMMIT_PREP);
 
 	list_add(&trans->pending_snapshot->list, &cur_trans->pending_snapshots);
 }
@@ -2151,7 +2157,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 	ktime_t interval;
 
 	ASSERT(refcount_read(&trans->use_count) == 1);
-	btrfs_trans_state_lockdep_acquire(fs_info, BTRFS_LOCKDEP_TRANS_COMMIT_START);
+	btrfs_trans_state_lockdep_acquire(fs_info, BTRFS_LOCKDEP_TRANS_COMMIT_PREP);
 
 	clear_bit(BTRFS_FS_NEED_TRANS_COMMIT, &fs_info->flags);
 
@@ -2211,7 +2217,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 	}
 
 	spin_lock(&fs_info->trans_lock);
-	if (cur_trans->state >= TRANS_STATE_COMMIT_START) {
+	if (cur_trans->state >= TRANS_STATE_COMMIT_PREP) {
 		enum btrfs_trans_state want_state = TRANS_STATE_COMPLETED;
 
 		add_pending_snapshot(trans);
@@ -2223,7 +2229,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 			want_state = TRANS_STATE_SUPER_COMMITTED;
 
 		btrfs_trans_state_lockdep_release(fs_info,
-						  BTRFS_LOCKDEP_TRANS_COMMIT_START);
+						  BTRFS_LOCKDEP_TRANS_COMMIT_PREP);
 		ret = btrfs_end_transaction(trans);
 		wait_for_commit(cur_trans, want_state);
 
@@ -2235,9 +2241,9 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 		return ret;
 	}
 
-	cur_trans->state = TRANS_STATE_COMMIT_START;
+	cur_trans->state = TRANS_STATE_COMMIT_PREP;
 	wake_up(&fs_info->transaction_blocked_wait);
-	btrfs_trans_state_lockdep_release(fs_info, BTRFS_LOCKDEP_TRANS_COMMIT_START);
+	btrfs_trans_state_lockdep_release(fs_info, BTRFS_LOCKDEP_TRANS_COMMIT_PREP);
 
 	if (cur_trans->list.prev != &fs_info->trans_list) {
 		enum btrfs_trans_state want_state = TRANS_STATE_COMPLETED;
@@ -2258,11 +2264,9 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 			btrfs_put_transaction(prev_trans);
 			if (ret)
 				goto lockdep_release;
-		} else {
-			spin_unlock(&fs_info->trans_lock);
+			spin_lock(&fs_info->trans_lock);
 		}
 	} else {
-		spin_unlock(&fs_info->trans_lock);
 		/*
 		 * The previous transaction was aborted and was already removed
 		 * from the list of transactions at fs_info->trans_list. So we
@@ -2270,11 +2274,16 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 		 * corrupt state (pointing to trees with unwritten nodes/leafs).
 		 */
 		if (BTRFS_FS_ERROR(fs_info)) {
+			spin_unlock(&fs_info->trans_lock);
 			ret = -EROFS;
 			goto lockdep_release;
 		}
 	}
 
+	cur_trans->state = TRANS_STATE_COMMIT_START;
+	wake_up(&fs_info->transaction_blocked_wait);
+	spin_unlock(&fs_info->trans_lock);
+
 	/*
 	 * Get the time spent on the work done by the commit thread and not
 	 * the time spent waiting on a previous commit
@@ -2584,7 +2593,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 	goto cleanup_transaction;
 
 lockdep_trans_commit_start_release:
-	btrfs_trans_state_lockdep_release(fs_info, BTRFS_LOCKDEP_TRANS_COMMIT_START);
+	btrfs_trans_state_lockdep_release(fs_info, BTRFS_LOCKDEP_TRANS_COMMIT_PREP);
 	btrfs_end_transaction(trans);
 	return ret;
 }
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 8e9fa23bd7fed..6b309f8a99a86 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -14,6 +14,7 @@
 
 enum btrfs_trans_state {
 	TRANS_STATE_RUNNING,
+	TRANS_STATE_COMMIT_PREP,
 	TRANS_STATE_COMMIT_START,
 	TRANS_STATE_COMMIT_DOING,
 	TRANS_STATE_UNBLOCKED,
-- 
2.40.1


  parent reply	other threads:[~2023-09-24 13:16 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-24 13:14 [PATCH AUTOSEL 6.5 01/41] nvme-fc: Prevent null pointer dereference in nvme_fc_io_getuuid() Sasha Levin
2023-09-24 13:14 ` [PATCH AUTOSEL 6.5 02/41] parisc: sba: Fix compile warning wrt list of SBA devices Sasha Levin
2023-09-24 13:14 ` [PATCH AUTOSEL 6.5 03/41] parisc: sba-iommu: Fix sparse warnigs Sasha Levin
2023-09-24 13:14 ` [PATCH AUTOSEL 6.5 04/41] parisc: ccio-dma: Fix sparse warnings Sasha Levin
2023-09-24 13:14 ` [PATCH AUTOSEL 6.5 05/41] parisc: iosapic.c: " Sasha Levin
2023-09-24 13:14 ` [PATCH AUTOSEL 6.5 06/41] parisc: drivers: Fix sparse warning Sasha Levin
2023-09-24 13:14 ` [PATCH AUTOSEL 6.5 07/41] parisc: irq: Make irq_stack_union static to avoid " Sasha Levin
2023-09-24 13:14 ` [PATCH AUTOSEL 6.5 08/41] scsi: qedf: Add synchronization between I/O completions and abort Sasha Levin
2023-09-24 13:14 ` [PATCH AUTOSEL 6.5 09/41] scsi: ufs: core: Move __ufshcd_send_uic_cmd() outside host_lock Sasha Levin
2023-09-24 13:14 ` [PATCH AUTOSEL 6.5 10/41] scsi: ufs: core: Poll HCS.UCRDY before issuing a UIC command Sasha Levin
2023-09-24 13:14 ` [PATCH AUTOSEL 6.5 11/41] selftests/ftrace: Correctly enable event in instance-event.tc Sasha Levin
2023-09-24 13:15 ` [PATCH AUTOSEL 6.5 12/41] ring-buffer: Avoid softlockup in ring_buffer_resize() Sasha Levin
2023-09-24 13:15 ` Sasha Levin [this message]
2023-09-25 13:01   ` [PATCH AUTOSEL 6.5 13/41] btrfs: do not block starts waiting on previous transaction commit David Sterba
2023-09-25 17:47     ` Sasha Levin
2023-09-24 13:15 ` [PATCH AUTOSEL 6.5 14/41] btrfs: improve error message after failure to add delayed dir index item Sasha Levin
2023-09-24 13:15 ` [PATCH AUTOSEL 6.5 15/41] btrfs: assert delayed node locked when removing delayed item Sasha Levin
2023-09-24 13:15 ` [PATCH AUTOSEL 6.5 16/41] selftests: fix dependency checker script Sasha Levin
2023-09-24 13:15 ` [PATCH AUTOSEL 6.5 17/41] ring-buffer: Do not attempt to read past "commit" Sasha Levin
2023-09-24 13:15 ` [PATCH AUTOSEL 6.5 18/41] net/smc: bugfix for smcr v2 server connect success statistic Sasha Levin
2023-09-24 13:15 ` [PATCH AUTOSEL 6.5 19/41] ata: sata_mv: Fix incorrect string length computation in mv_dump_mem() Sasha Levin
2023-09-24 13:15 ` [PATCH AUTOSEL 6.5 20/41] efi/x86: Ensure that EFI_RUNTIME_MAP is enabled for kexec Sasha Levin
2023-09-24 13:15 ` [PATCH AUTOSEL 6.5 21/41] platform/mellanox: mlxbf-bootctl: add NET dependency into Kconfig Sasha Levin
2023-09-24 13:15 ` [PATCH AUTOSEL 6.5 22/41] platform/x86: asus-wmi: Support 2023 ROG X16 tablet mode Sasha Levin
2023-09-24 13:15 ` [PATCH AUTOSEL 6.5 23/41] thermal/of: add missing of_node_put() Sasha Levin
2023-09-24 13:15 ` [PATCH AUTOSEL 6.5 24/41] drm/amdgpu: Store CU info from all XCCs for GFX v9.4.3 Sasha Levin
2023-09-24 13:15 ` [PATCH AUTOSEL 6.5 25/41] drm/amdkfd: Update cache info reporting " Sasha Levin
2023-09-24 13:15 ` [PATCH AUTOSEL 6.5 26/41] drm/amdkfd: Update CU masking for GFX 9.4.3 Sasha Levin
2023-09-24 13:15 ` [PATCH AUTOSEL 6.5 27/41] drm/amd/display: Add dirty rect support for Replay Sasha Levin
2023-09-24 13:15 ` [PATCH AUTOSEL 6.5 28/41] drm/amd/display: Don't check registers, if using AUX BL control Sasha Levin
2023-09-24 13:15 ` [PATCH AUTOSEL 6.5 29/41] drm/amdgpu/soc21: don't remap HDP registers for SR-IOV Sasha Levin
2023-09-24 13:15 ` [PATCH AUTOSEL 6.5 30/41] drm/amdgpu/nbio4.3: set proper rmmio_remap.reg_offset " Sasha Levin
2023-09-24 13:15 ` [PATCH AUTOSEL 6.5 31/41] drm/amdgpu: fallback to old RAS error message for aqua_vanjaram Sasha Levin
2023-09-24 13:15 ` [PATCH AUTOSEL 6.5 32/41] drm/amdkfd: Checkpoint and restore queues on GFX11 Sasha Levin
2023-09-24 13:15 ` [PATCH AUTOSEL 6.5 33/41] drm/amdgpu: Handle null atom context in VBIOS info ioctl Sasha Levin
2023-09-24 13:15 ` [PATCH AUTOSEL 6.5 34/41] objtool: Fix _THIS_IP_ detection for cold functions Sasha Levin
2023-09-24 13:15 ` [PATCH AUTOSEL 6.5 35/41] nvme-pci: do not set the NUMA node of device if it has none Sasha Levin
2023-09-24 13:15 ` [PATCH AUTOSEL 6.5 36/41] riscv: errata: fix T-Head dcache.cva encoding Sasha Levin
2023-09-24 13:15 ` [PATCH AUTOSEL 6.5 37/41] scsi: pm80xx: Use phy-specific SAS address when sending PHY_START command Sasha Levin
2023-09-24 13:15 ` [PATCH AUTOSEL 6.5 38/41] scsi: pm80xx: Avoid leaking tags when processing OPC_INB_SET_CONTROLLER_CONFIG command Sasha Levin
2023-09-24 13:15 ` [PATCH AUTOSEL 6.5 39/41] smb3: correct places where ENOTSUPP is used instead of preferred EOPNOTSUPP Sasha Levin
2023-09-24 13:15 ` [PATCH AUTOSEL 6.5 40/41] ata: libata-eh: do not clear ATA_PFLAG_EH_PENDING in ata_eh_reset() Sasha Levin
2023-09-24 13:15 ` [PATCH AUTOSEL 6.5 41/41] ata: libata-eh: do not thaw the port twice " Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230924131529.1275335-13-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=clm@fb.com \
    --cc=dsterba@suse.com \
    --cc=fdmanana@suse.com \
    --cc=josef@toxicpanda.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).