linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Eric Whitney <enwlinux@gmail.com>, Theodore Ts'o <tytso@mit.edu>,
	Sasha Levin <sashal@kernel.org>,
	linux-ext4@vger.kernel.org
Subject: [PATCH AUTOSEL 5.10 02/39] ext4: shrink race window in ext4_should_retry_alloc()
Date: Thu, 25 Mar 2021 07:25:21 -0400	[thread overview]
Message-ID: <20210325112558.1927423-2-sashal@kernel.org> (raw)
In-Reply-To: <20210325112558.1927423-1-sashal@kernel.org>

From: Eric Whitney <enwlinux@gmail.com>

[ Upstream commit efc61345274d6c7a46a0570efbc916fcbe3e927b ]

When generic/371 is run on kvm-xfstests using 5.10 and 5.11 kernels, it
fails at significant rates on the two test scenarios that disable
delayed allocation (ext3conv and data_journal) and force actual block
allocation for the fallocate and pwrite functions in the test.  The
failure rate on 5.10 for both ext3conv and data_journal on one test
system typically runs about 85%.  On 5.11, the failure rate on ext3conv
sometimes drops to as low as 1% while the rate on data_journal
increases to nearly 100%.

The observed failures are largely due to ext4_should_retry_alloc()
cutting off block allocation retries when s_mb_free_pending (used to
indicate that a transaction in progress will free blocks) is 0.
However, free space is usually available when this occurs during runs
of generic/371.  It appears that a thread attempting to allocate
blocks is just missing transaction commits in other threads that
increase the free cluster count and reset s_mb_free_pending while
the allocating thread isn't running.  Explicitly testing for free space
availability avoids this race.

The current code uses a post-increment operator in the conditional
expression that determines whether the retry limit has been exceeded.
This means that the conditional expression uses the value of the
retry counter before it's increased, resulting in an extra retry cycle.
The current code actually retries twice before hitting its retry limit
rather than once.

Increasing the retry limit to 3 from the current actual maximum retry
count of 2 in combination with the change described above reduces the
observed failure rate to less that 0.1% on both ext3conv and
data_journal with what should be limited impact on users sensitive to
the overhead caused by retries.

A per filesystem percpu counter exported via sysfs is added to allow
users or developers to track the number of times the retry limit is
exceeded without resorting to debugging methods.  This should provide
some insight into worst case retry behavior.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20210218151132.19678-1-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/ext4/balloc.c | 38 ++++++++++++++++++++++++++------------
 fs/ext4/ext4.h   |  1 +
 fs/ext4/super.c  |  5 +++++
 fs/ext4/sysfs.c  |  7 +++++++
 4 files changed, 39 insertions(+), 12 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 1d640b145637..1afd60fcd772 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -626,27 +626,41 @@ int ext4_claim_free_clusters(struct ext4_sb_info *sbi,
 
 /**
  * ext4_should_retry_alloc() - check if a block allocation should be retried
- * @sb:			super block
- * @retries:		number of attemps has been made
+ * @sb:			superblock
+ * @retries:		number of retry attempts made so far
  *
- * ext4_should_retry_alloc() is called when ENOSPC is returned, and if
- * it is profitable to retry the operation, this function will wait
- * for the current or committing transaction to complete, and then
- * return TRUE.  We will only retry once.
+ * ext4_should_retry_alloc() is called when ENOSPC is returned while
+ * attempting to allocate blocks.  If there's an indication that a pending
+ * journal transaction might free some space and allow another attempt to
+ * succeed, this function will wait for the current or committing transaction
+ * to complete and then return TRUE.
  */
 int ext4_should_retry_alloc(struct super_block *sb, int *retries)
 {
-	if (!ext4_has_free_clusters(EXT4_SB(sb), 1, 0) ||
-	    (*retries)++ > 1 ||
-	    !EXT4_SB(sb)->s_journal)
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+
+	if (!sbi->s_journal)
 		return 0;
 
-	smp_mb();
-	if (EXT4_SB(sb)->s_mb_free_pending == 0)
+	if (++(*retries) > 3) {
+		percpu_counter_inc(&sbi->s_sra_exceeded_retry_limit);
 		return 0;
+	}
 
+	/*
+	 * if there's no indication that blocks are about to be freed it's
+	 * possible we just missed a transaction commit that did so
+	 */
+	smp_mb();
+	if (sbi->s_mb_free_pending == 0)
+		return ext4_has_free_clusters(sbi, 1, 0);
+
+	/*
+	 * it's possible we've just missed a transaction commit here,
+	 * so ignore the returned status
+	 */
 	jbd_debug(1, "%s: retrying operation after ENOSPC\n", sb->s_id);
-	jbd2_journal_force_commit_nested(EXT4_SB(sb)->s_journal);
+	(void) jbd2_journal_force_commit_nested(sbi->s_journal);
 	return 1;
 }
 
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 65ecaf96d0a4..51e665585ecc 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1474,6 +1474,7 @@ struct ext4_sb_info {
 	struct percpu_counter s_freeinodes_counter;
 	struct percpu_counter s_dirs_counter;
 	struct percpu_counter s_dirtyclusters_counter;
+	struct percpu_counter s_sra_exceeded_retry_limit;
 	struct blockgroup_lock *s_blockgroup_lock;
 	struct proc_dir_entry *s_proc;
 	struct kobject s_kobj;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index e30bf8f342c2..594300d315ef 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1226,6 +1226,7 @@ static void ext4_put_super(struct super_block *sb)
 	percpu_counter_destroy(&sbi->s_freeinodes_counter);
 	percpu_counter_destroy(&sbi->s_dirs_counter);
 	percpu_counter_destroy(&sbi->s_dirtyclusters_counter);
+	percpu_counter_destroy(&sbi->s_sra_exceeded_retry_limit);
 	percpu_free_rwsem(&sbi->s_writepages_rwsem);
 #ifdef CONFIG_QUOTA
 	for (i = 0; i < EXT4_MAXQUOTAS; i++)
@@ -5019,6 +5020,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	if (!err)
 		err = percpu_counter_init(&sbi->s_dirtyclusters_counter, 0,
 					  GFP_KERNEL);
+	if (!err)
+		err = percpu_counter_init(&sbi->s_sra_exceeded_retry_limit, 0,
+					  GFP_KERNEL);
 	if (!err)
 		err = percpu_init_rwsem(&sbi->s_writepages_rwsem);
 
@@ -5131,6 +5135,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	percpu_counter_destroy(&sbi->s_freeinodes_counter);
 	percpu_counter_destroy(&sbi->s_dirs_counter);
 	percpu_counter_destroy(&sbi->s_dirtyclusters_counter);
+	percpu_counter_destroy(&sbi->s_sra_exceeded_retry_limit);
 	percpu_free_rwsem(&sbi->s_writepages_rwsem);
 failed_mount5:
 	ext4_ext_release(sb);
diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
index 4e27fe6ed3ae..f24bef3be48a 100644
--- a/fs/ext4/sysfs.c
+++ b/fs/ext4/sysfs.c
@@ -24,6 +24,7 @@ typedef enum {
 	attr_session_write_kbytes,
 	attr_lifetime_write_kbytes,
 	attr_reserved_clusters,
+	attr_sra_exceeded_retry_limit,
 	attr_inode_readahead,
 	attr_trigger_test_error,
 	attr_first_error_time,
@@ -208,6 +209,7 @@ EXT4_ATTR_FUNC(delayed_allocation_blocks, 0444);
 EXT4_ATTR_FUNC(session_write_kbytes, 0444);
 EXT4_ATTR_FUNC(lifetime_write_kbytes, 0444);
 EXT4_ATTR_FUNC(reserved_clusters, 0644);
+EXT4_ATTR_FUNC(sra_exceeded_retry_limit, 0444);
 
 EXT4_ATTR_OFFSET(inode_readahead_blks, 0644, inode_readahead,
 		 ext4_sb_info, s_inode_readahead_blks);
@@ -257,6 +259,7 @@ static struct attribute *ext4_attrs[] = {
 	ATTR_LIST(session_write_kbytes),
 	ATTR_LIST(lifetime_write_kbytes),
 	ATTR_LIST(reserved_clusters),
+	ATTR_LIST(sra_exceeded_retry_limit),
 	ATTR_LIST(inode_readahead_blks),
 	ATTR_LIST(inode_goal),
 	ATTR_LIST(mb_stats),
@@ -380,6 +383,10 @@ static ssize_t ext4_attr_show(struct kobject *kobj,
 		return snprintf(buf, PAGE_SIZE, "%llu\n",
 				(unsigned long long)
 				atomic64_read(&sbi->s_resv_clusters));
+	case attr_sra_exceeded_retry_limit:
+		return snprintf(buf, PAGE_SIZE, "%llu\n",
+				(unsigned long long)
+			percpu_counter_sum(&sbi->s_sra_exceeded_retry_limit));
 	case attr_inode_readahead:
 	case attr_pointer_ui:
 		if (!ptr)
-- 
2.30.1


  reply	other threads:[~2021-03-25 11:29 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-25 11:25 [PATCH AUTOSEL 5.10 01/39] virtiofs: Fail dax mount if device does not support it Sasha Levin
2021-03-25 11:25 ` Sasha Levin [this message]
2021-03-25 11:25 ` [PATCH AUTOSEL 5.10 03/39] ext4: add reclaim checks to xattr code Sasha Levin
2021-03-25 11:25 ` [PATCH AUTOSEL 5.10 04/39] fs/ext4: fix integer overflow in s_log_groups_per_flex Sasha Levin
2021-03-25 11:25 ` [PATCH AUTOSEL 5.10 05/39] ext4: fix bh ref count on error paths Sasha Levin
2021-03-25 11:25 ` [PATCH AUTOSEL 5.10 06/39] fs: nfsd: fix kconfig dependency warning for NFSD_V4 Sasha Levin
2021-03-25 11:25 ` [PATCH AUTOSEL 5.10 07/39] rpc: fix NULL dereference on kmalloc failure Sasha Levin
2021-03-25 11:25 ` [PATCH AUTOSEL 5.10 08/39] iomap: Fix negative assignment to unsigned sis->pages in iomap_swapfile_activate Sasha Levin
2021-03-25 11:25 ` [PATCH AUTOSEL 5.10 09/39] ASoC: rt1015: fix i2c communication error Sasha Levin
2021-03-25 11:25 ` [PATCH AUTOSEL 5.10 10/39] ASoC: rt5640: Fix dac- and adc- vol-tlv values being off by a factor of 10 Sasha Levin
2021-03-25 11:25 ` [PATCH AUTOSEL 5.10 11/39] ASoC: rt5651: " Sasha Levin
2021-03-25 11:25 ` [PATCH AUTOSEL 5.10 12/39] ASoC: sgtl5000: set DAP_AVC_CTRL register to correct default value on probe Sasha Levin
2021-03-25 11:25 ` [PATCH AUTOSEL 5.10 13/39] ASoC: es8316: Simplify adc_pga_gain_tlv table Sasha Levin
2021-03-25 11:25 ` [PATCH AUTOSEL 5.10 14/39] ASoC: soc-core: Prevent warning if no DMI table is present Sasha Levin
2021-03-25 11:25 ` [PATCH AUTOSEL 5.10 15/39] ASoC: cs42l42: Fix Bitclock polarity inversion Sasha Levin
2021-03-25 11:25 ` [PATCH AUTOSEL 5.10 16/39] ASoC: cs42l42: Fix channel width support Sasha Levin
2021-03-25 11:25 ` [PATCH AUTOSEL 5.10 17/39] ASoC: cs42l42: Fix mixer volume control Sasha Levin
2021-03-25 11:25 ` [PATCH AUTOSEL 5.10 18/39] ASoC: cs42l42: Always wait at least 3ms after reset Sasha Levin
2021-03-25 11:25 ` [PATCH AUTOSEL 5.10 19/39] NFSD: fix error handling in NFSv4.0 callbacks Sasha Levin
2021-03-25 11:25 ` [PATCH AUTOSEL 5.10 20/39] kernel: freezer should treat PF_IO_WORKER like PF_KTHREAD for freezing Sasha Levin
2021-03-25 11:25 ` [PATCH AUTOSEL 5.10 21/39] powerpc: Force inlining of cpu_has_feature() to avoid build failure Sasha Levin
2021-03-25 11:25 ` [PATCH AUTOSEL 5.10 22/39] vhost: Fix vhost_vq_reset() Sasha Levin
2021-03-25 11:25 ` [PATCH AUTOSEL 5.10 23/39] io_uring: fix ->flags races by linked timeouts Sasha Levin
2021-03-25 11:25 ` [PATCH AUTOSEL 5.10 24/39] scsi: st: Fix a use after free in st_open() Sasha Levin
2021-03-25 11:25 ` [PATCH AUTOSEL 5.10 25/39] scsi: qla2xxx: Fix broken #endif placement Sasha Levin
2021-03-25 11:25 ` [PATCH AUTOSEL 5.10 26/39] staging: comedi: cb_pcidas: fix request_irq() warn Sasha Levin
2021-03-25 11:25 ` [PATCH AUTOSEL 5.10 27/39] staging: comedi: cb_pcidas64: " Sasha Levin
2021-03-25 11:25 ` [PATCH AUTOSEL 5.10 28/39] ASoC: rt5659: Update MCLK rate in set_sysclk() Sasha Levin
2021-03-25 11:25 ` [PATCH AUTOSEL 5.10 29/39] ASoC: rt711: add snd_soc_component remove callback Sasha Levin
2021-03-25 11:25 ` [PATCH AUTOSEL 5.10 30/39] thermal/core: Add NULL pointer check before using cooling device stats Sasha Levin
2021-03-25 11:25 ` [PATCH AUTOSEL 5.10 31/39] locking/ww_mutex: Simplify use_ww_ctx & ww_ctx handling Sasha Levin
2021-03-25 11:25 ` [PATCH AUTOSEL 5.10 32/39] locking/ww_mutex: Fix acquire/release imbalance in ww_acquire_init()/ww_acquire_fini() Sasha Levin
2021-03-25 11:25 ` [PATCH AUTOSEL 5.10 33/39] nvmet-tcp: fix kmap leak when data digest in use Sasha Levin
2021-03-25 11:25 ` [PATCH AUTOSEL 5.10 34/39] io_uring: imply MSG_NOSIGNAL for send[msg]()/recv[msg]() calls Sasha Levin
2021-03-25 11:25 ` [PATCH AUTOSEL 5.10 35/39] static_call: Align static_call_is_init() patching condition Sasha Levin
2021-03-25 11:25 ` [PATCH AUTOSEL 5.10 36/39] ext4: do not iput inode under running transaction in ext4_rename() Sasha Levin
2021-03-25 11:25 ` [PATCH AUTOSEL 5.10 37/39] signal: don't allow sending any signals to PF_IO_WORKER threads Sasha Levin
2021-03-25 11:25 ` [PATCH AUTOSEL 5.10 38/39] signal: don't allow STOP on " Sasha Levin
2021-03-25 11:25 ` [PATCH AUTOSEL 5.10 39/39] io_uring: call req_set_fail_links() on short send[msg]()/recv[msg]() with MSG_WAITALL Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210325112558.1927423-2-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=enwlinux@gmail.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).