linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Eric Whitney <enwlinux@gmail.com>,
	Theodore Tso <tytso@mit.edu>, Sasha Levin <sashal@kernel.org>
Subject: [PATCH 5.4 06/74] ext4: shrink race window in ext4_should_retry_alloc()
Date: Mon,  5 Apr 2021 10:53:30 +0200	[thread overview]
Message-ID: <20210405085024.918760035@linuxfoundation.org> (raw)
In-Reply-To: <20210405085024.703004126@linuxfoundation.org>

From: Eric Whitney <enwlinux@gmail.com>

[ Upstream commit efc61345274d6c7a46a0570efbc916fcbe3e927b ]

When generic/371 is run on kvm-xfstests using 5.10 and 5.11 kernels, it
fails at significant rates on the two test scenarios that disable
delayed allocation (ext3conv and data_journal) and force actual block
allocation for the fallocate and pwrite functions in the test.  The
failure rate on 5.10 for both ext3conv and data_journal on one test
system typically runs about 85%.  On 5.11, the failure rate on ext3conv
sometimes drops to as low as 1% while the rate on data_journal
increases to nearly 100%.

The observed failures are largely due to ext4_should_retry_alloc()
cutting off block allocation retries when s_mb_free_pending (used to
indicate that a transaction in progress will free blocks) is 0.
However, free space is usually available when this occurs during runs
of generic/371.  It appears that a thread attempting to allocate
blocks is just missing transaction commits in other threads that
increase the free cluster count and reset s_mb_free_pending while
the allocating thread isn't running.  Explicitly testing for free space
availability avoids this race.

The current code uses a post-increment operator in the conditional
expression that determines whether the retry limit has been exceeded.
This means that the conditional expression uses the value of the
retry counter before it's increased, resulting in an extra retry cycle.
The current code actually retries twice before hitting its retry limit
rather than once.

Increasing the retry limit to 3 from the current actual maximum retry
count of 2 in combination with the change described above reduces the
observed failure rate to less that 0.1% on both ext3conv and
data_journal with what should be limited impact on users sensitive to
the overhead caused by retries.

A per filesystem percpu counter exported via sysfs is added to allow
users or developers to track the number of times the retry limit is
exceeded without resorting to debugging methods.  This should provide
some insight into worst case retry behavior.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20210218151132.19678-1-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/ext4/balloc.c | 38 ++++++++++++++++++++++++++------------
 fs/ext4/ext4.h   |  1 +
 fs/ext4/super.c  |  5 +++++
 fs/ext4/sysfs.c  |  7 +++++++
 4 files changed, 39 insertions(+), 12 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 5aba67a504cf..031ff3f19018 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -612,27 +612,41 @@ int ext4_claim_free_clusters(struct ext4_sb_info *sbi,
 
 /**
  * ext4_should_retry_alloc() - check if a block allocation should be retried
- * @sb:			super block
- * @retries:		number of attemps has been made
+ * @sb:			superblock
+ * @retries:		number of retry attempts made so far
  *
- * ext4_should_retry_alloc() is called when ENOSPC is returned, and if
- * it is profitable to retry the operation, this function will wait
- * for the current or committing transaction to complete, and then
- * return TRUE.  We will only retry once.
+ * ext4_should_retry_alloc() is called when ENOSPC is returned while
+ * attempting to allocate blocks.  If there's an indication that a pending
+ * journal transaction might free some space and allow another attempt to
+ * succeed, this function will wait for the current or committing transaction
+ * to complete and then return TRUE.
  */
 int ext4_should_retry_alloc(struct super_block *sb, int *retries)
 {
-	if (!ext4_has_free_clusters(EXT4_SB(sb), 1, 0) ||
-	    (*retries)++ > 1 ||
-	    !EXT4_SB(sb)->s_journal)
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+
+	if (!sbi->s_journal)
 		return 0;
 
-	smp_mb();
-	if (EXT4_SB(sb)->s_mb_free_pending == 0)
+	if (++(*retries) > 3) {
+		percpu_counter_inc(&sbi->s_sra_exceeded_retry_limit);
 		return 0;
+	}
 
+	/*
+	 * if there's no indication that blocks are about to be freed it's
+	 * possible we just missed a transaction commit that did so
+	 */
+	smp_mb();
+	if (sbi->s_mb_free_pending == 0)
+		return ext4_has_free_clusters(sbi, 1, 0);
+
+	/*
+	 * it's possible we've just missed a transaction commit here,
+	 * so ignore the returned status
+	 */
 	jbd_debug(1, "%s: retrying operation after ENOSPC\n", sb->s_id);
-	jbd2_journal_force_commit_nested(EXT4_SB(sb)->s_journal);
+	(void) jbd2_journal_force_commit_nested(sbi->s_journal);
 	return 1;
 }
 
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 1c558b554788..bf3eaa903033 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1420,6 +1420,7 @@ struct ext4_sb_info {
 	struct percpu_counter s_freeinodes_counter;
 	struct percpu_counter s_dirs_counter;
 	struct percpu_counter s_dirtyclusters_counter;
+	struct percpu_counter s_sra_exceeded_retry_limit;
 	struct blockgroup_lock *s_blockgroup_lock;
 	struct proc_dir_entry *s_proc;
 	struct kobject s_kobj;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 06568467b0c2..2ecf4594a20d 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1017,6 +1017,7 @@ static void ext4_put_super(struct super_block *sb)
 	percpu_counter_destroy(&sbi->s_freeinodes_counter);
 	percpu_counter_destroy(&sbi->s_dirs_counter);
 	percpu_counter_destroy(&sbi->s_dirtyclusters_counter);
+	percpu_counter_destroy(&sbi->s_sra_exceeded_retry_limit);
 	percpu_free_rwsem(&sbi->s_writepages_rwsem);
 #ifdef CONFIG_QUOTA
 	for (i = 0; i < EXT4_MAXQUOTAS; i++)
@@ -4597,6 +4598,9 @@ no_journal:
 	if (!err)
 		err = percpu_counter_init(&sbi->s_dirtyclusters_counter, 0,
 					  GFP_KERNEL);
+	if (!err)
+		err = percpu_counter_init(&sbi->s_sra_exceeded_retry_limit, 0,
+					  GFP_KERNEL);
 	if (!err)
 		err = percpu_init_rwsem(&sbi->s_writepages_rwsem);
 
@@ -4699,6 +4703,7 @@ failed_mount6:
 	percpu_counter_destroy(&sbi->s_freeinodes_counter);
 	percpu_counter_destroy(&sbi->s_dirs_counter);
 	percpu_counter_destroy(&sbi->s_dirtyclusters_counter);
+	percpu_counter_destroy(&sbi->s_sra_exceeded_retry_limit);
 	percpu_free_rwsem(&sbi->s_writepages_rwsem);
 failed_mount5:
 	ext4_ext_release(sb);
diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
index eb1efad0e20a..9394360ff137 100644
--- a/fs/ext4/sysfs.c
+++ b/fs/ext4/sysfs.c
@@ -23,6 +23,7 @@ typedef enum {
 	attr_session_write_kbytes,
 	attr_lifetime_write_kbytes,
 	attr_reserved_clusters,
+	attr_sra_exceeded_retry_limit,
 	attr_inode_readahead,
 	attr_trigger_test_error,
 	attr_first_error_time,
@@ -176,6 +177,7 @@ EXT4_ATTR_FUNC(delayed_allocation_blocks, 0444);
 EXT4_ATTR_FUNC(session_write_kbytes, 0444);
 EXT4_ATTR_FUNC(lifetime_write_kbytes, 0444);
 EXT4_ATTR_FUNC(reserved_clusters, 0644);
+EXT4_ATTR_FUNC(sra_exceeded_retry_limit, 0444);
 
 EXT4_ATTR_OFFSET(inode_readahead_blks, 0644, inode_readahead,
 		 ext4_sb_info, s_inode_readahead_blks);
@@ -207,6 +209,7 @@ static struct attribute *ext4_attrs[] = {
 	ATTR_LIST(session_write_kbytes),
 	ATTR_LIST(lifetime_write_kbytes),
 	ATTR_LIST(reserved_clusters),
+	ATTR_LIST(sra_exceeded_retry_limit),
 	ATTR_LIST(inode_readahead_blks),
 	ATTR_LIST(inode_goal),
 	ATTR_LIST(mb_stats),
@@ -308,6 +311,10 @@ static ssize_t ext4_attr_show(struct kobject *kobj,
 		return snprintf(buf, PAGE_SIZE, "%llu\n",
 				(unsigned long long)
 				atomic64_read(&sbi->s_resv_clusters));
+	case attr_sra_exceeded_retry_limit:
+		return snprintf(buf, PAGE_SIZE, "%llu\n",
+				(unsigned long long)
+			percpu_counter_sum(&sbi->s_sra_exceeded_retry_limit));
 	case attr_inode_readahead:
 	case attr_pointer_ui:
 		if (!ptr)
-- 
2.30.1




  parent reply	other threads:[~2021-04-05  9:05 UTC|newest]

Thread overview: 84+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-05  8:53 [PATCH 5.4 00/74] 5.4.110-rc1 review Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 5.4 01/74] selinux: vsock: Set SID for socket returned by accept() Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 5.4 02/74] ipv6: weaken the v4mapped source check Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 5.4 03/74] module: merge repetitive strings in module_sig_check() Greg Kroah-Hartman
2021-04-05 13:35   ` Sergey Shtylyov
2021-04-05 13:40     ` Greg Kroah-Hartman
2021-04-05 14:11       ` Sergey Shtylyov
2021-04-05 16:14         ` Sasha Levin
2021-04-05  8:53 ` [PATCH 5.4 04/74] module: avoid *goto*s " Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 5.4 05/74] module: harden ELF info handling Greg Kroah-Hartman
2021-04-05  8:53 ` Greg Kroah-Hartman [this message]
2021-04-05  8:53 ` [PATCH 5.4 07/74] ext4: fix bh ref count on error paths Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 5.4 08/74] fs: nfsd: fix kconfig dependency warning for NFSD_V4 Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 5.4 09/74] rpc: fix NULL dereference on kmalloc failure Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 5.4 10/74] iomap: Fix negative assignment to unsigned sis->pages in iomap_swapfile_activate Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 5.4 11/74] ASoC: rt5640: Fix dac- and adc- vol-tlv values being off by a factor of 10 Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 5.4 12/74] ASoC: rt5651: " Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 5.4 13/74] ASoC: sgtl5000: set DAP_AVC_CTRL register to correct default value on probe Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 5.4 14/74] ASoC: es8316: Simplify adc_pga_gain_tlv table Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 5.4 15/74] ASoC: cs42l42: Fix Bitclock polarity inversion Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 5.4 16/74] ASoC: cs42l42: Fix channel width support Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 5.4 17/74] ASoC: cs42l42: Fix mixer volume control Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 5.4 18/74] ASoC: cs42l42: Always wait at least 3ms after reset Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 5.4 19/74] NFSD: fix error handling in NFSv4.0 callbacks Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 5.4 20/74] powerpc: Force inlining of cpu_has_feature() to avoid build failure Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 5.4 21/74] vhost: Fix vhost_vq_reset() Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 5.4 22/74] scsi: st: Fix a use after free in st_open() Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 5.4 23/74] scsi: qla2xxx: Fix broken #endif placement Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 5.4 24/74] staging: comedi: cb_pcidas: fix request_irq() warn Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 5.4 25/74] staging: comedi: cb_pcidas64: " Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 5.4 26/74] ASoC: rt5659: Update MCLK rate in set_sysclk() Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 5.4 27/74] thermal/core: Add NULL pointer check before using cooling device stats Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 5.4 28/74] locking/ww_mutex: Simplify use_ww_ctx & ww_ctx handling Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 5.4 29/74] ext4: do not iput inode under running transaction in ext4_rename() Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 5.4 30/74] net: mvpp2: fix interrupt mask/unmask skip condition Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 5.4 31/74] flow_dissector: fix TTL and TOS dissection on IPv4 fragments Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 5.4 32/74] can: dev: move driver related infrastructure into separate subdir Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 5.4 33/74] net: introduce CAN specific pointer in the struct net_device Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 5.4 34/74] can: tcan4x5x: fix max register value Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 5.4 35/74] brcmfmac: clear EAP/association status bits on linkdown events Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 5.4 36/74] ath10k: hold RCU lock when calling ieee80211_find_sta_by_ifaddr() Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 5.4 37/74] net: ethernet: aquantia: Handle error cleanup of start on open Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 5.4 38/74] appletalk: Fix skb allocation size in loopback case Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 5.4 39/74] net: wan/lmc: unregister device when no matching device is found Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 5.4 40/74] bpf: Remove MTU check in __bpf_skb_max_len Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 5.4 41/74] ALSA: usb-audio: Apply sample rate quirk to Logitech Connect Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 5.4 42/74] ALSA: hda: Re-add dropped snd_poewr_change_state() calls Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 5.4 43/74] ALSA: hda: Add missing sanity checks in PM prepare/complete callbacks Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 5.4 44/74] ALSA: hda/realtek: fix a determine_headset_type issue for a Dell AIO Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 5.4 45/74] ALSA: hda/realtek: call alc_update_headset_mode() in hp_automute_hook Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 5.4 46/74] xtensa: move coprocessor_flush to the .text section Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 5.4 47/74] PM: runtime: Fix race getting/putting suppliers at probe Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 5.4 48/74] PM: runtime: Fix ordering in pm_runtime_get_suppliers() Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 5.4 49/74] tracing: Fix stack trace event size Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 5.4 50/74] mm: fix race by making init_zero_pfn() early_initcall Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 5.4 51/74] drm/amdgpu: fix offset calculation in amdgpu_vm_bo_clear_mappings() Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 5.4 52/74] drm/amdgpu: check alignment on CPU page for bo map Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 5.4 53/74] reiserfs: update reiserfs_xattrs_initialized() condition Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 5.4 54/74] drm/tegra: sor: Grab runtime PM reference across reset Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 5.4 55/74] vfio/nvlink: Add missing SPAPR_TCE_IOMMU depends Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 5.4 56/74] pinctrl: rockchip: fix restore error in resume Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 5.4 57/74] extcon: Add stubs for extcon_register_notifier_all() functions Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 5.4 58/74] extcon: Fix error handling in extcon_dev_register Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 5.4 59/74] firewire: nosy: Fix a use-after-free bug in nosy_ioctl() Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 5.4 60/74] usbip: vhci_hcd fix shift out-of-bounds in vhci_hub_control() Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 5.4 61/74] USB: quirks: ignore remote wake-up on Fibocom L850-GL LTE modem Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 5.4 62/74] usb: musb: Fix suspend with devices connected for a64 Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 5.4 63/74] usb: xhci-mtk: fix broken streams issue on 0.96 xHCI Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 5.4 64/74] cdc-acm: fix BREAK rx code path adding necessary calls Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 5.4 65/74] USB: cdc-acm: untangle a circular dependency between callback and softint Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 5.4 66/74] USB: cdc-acm: downgrade message to debug Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 5.4 67/74] USB: cdc-acm: fix double free on probe failure Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 5.4 68/74] USB: cdc-acm: fix use-after-free after " Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 5.4 69/74] usb: gadget: udc: amd5536udc_pci fix null-ptr-dereference Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 5.4 70/74] usb: dwc2: Fix HPRT0.PrtSusp bit setting for HiKey 960 board Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 5.4 71/74] usb: dwc2: Prevent core suspend when port connection flag is 0 Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 5.4 72/74] staging: rtl8192e: Fix incorrect source in memcpy() Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 5.4 73/74] staging: rtl8192e: Change state information from u16 to u8 Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 5.4 74/74] drivers: video: fbcon: fix NULL dereference in fbcon_cursor() Greg Kroah-Hartman
2021-04-05 16:51 ` [PATCH 5.4 00/74] 5.4.110-rc1 review Florian Fainelli
2021-04-05 17:58 ` Guenter Roeck
2021-04-06  0:29 ` Shuah Khan
2021-04-06  3:42 ` Naresh Kamboju
2021-04-06  7:12 ` Samuel Zou

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210405085024.918760035@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=enwlinux@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sashal@kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).