linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: linux-xfs@vger.kernel.org, hch@infradead.org
Subject: [PATCH] xfs: inodegc needs to stop before freeze
Date: Wed, 4 Aug 2021 20:03:28 +1000	[thread overview]
Message-ID: <20210804100328.GK2757197@dread.disaster.area> (raw)
In-Reply-To: <20210804032030.GT3601443@magnolia>

On Tue, Aug 03, 2021 at 08:20:30PM -0700, Darrick J. Wong wrote:
> For everyone else following along at home, I've posted the current draft
> version of this whole thing in:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=deferred-inactivation-5.15
> 
> Here's Dave's patch reworked slightly to fix a deadlock between icreate
> and inactivation; conversion to use m_opstate and related macro stamping
> goodness; and some code reorganization to make it easier to add the
> throttling bits in the back two thirds of the series.
> 
> IOWs, I like this patch.  The runtime for my crazy deltree benchmark
> dropped from ~27 minutes to ~17 when the VM has 560M of RAM, and there's
> no observable drop in performance when the VM has 16G of RAM.  I also
> finally got it to run with 512M of RAM, whereas current TOT OOMs.
> 
> (Note: My crazy deltree benchmark is: I have a mdrestored sparse image
> with 10m files that I use dm-snapshot so that I can repeatedly write to
> it without needing to restore the image.  Then I mount the dm snapshot,
> and try to delete every file in the fs.)
....

Ok, so xfs/517 fails with a freeze assert:

XFS: Assertion failed: mp->m_super->s_writers.frozen < SB_FREEZE_FS, file: fs/xfs/xfs_icache.c, line: 1861

> @@ -718,6 +729,25 @@ xfs_fs_sync_fs(
>  		flush_delayed_work(&mp->m_log->l_work);
>  	}
>  
> +	/*
> +	 * Flush all deferred inode inactivation work so that the free space
> +	 * counters will reflect recent deletions.  Do not force the log again
> +	 * because log recovery can restart the inactivation from the info that
> +	 * we just wrote into the ondisk log.
> +	 *
> +	 * For regular operation this isn't strictly necessary since we aren't
> +	 * required to guarantee that unlinking frees space immediately, but
> +	 * that is how XFS historically behaved.
> +	 *
> +	 * If, however, the filesystem is at FREEZE_PAGEFAULTS, this is our
> +	 * last chance to complete the inactivation work before the filesystem
> +	 * freezes and the log is quiesced.  The background worker will not
> +	 * activate again until the fs is thawed because the VFS won't evict
> +	 * any more inodes until freeze_super drops s_umount and we disable the
> +	 * worker in xfs_fs_freeze.
> +	 */
> +	xfs_inodegc_flush(mp);

How does s_umount prevent __fput() from dropping the last reference
to an unlinked inode and putting it through evict() and hence adding
it to the deferred list that way?

Remember that the flush does not guarantee the per-cpu queues are
empty when it returns, just that whatever is in each percpu queue at
the time the per-cpu work is run has been completed.  We haven't yet
gone to SB_FREEZE_FS, so the transaction subsystem won't be frozen
at this point. Hence I can't see anything that would prevent unlinks
racing with this flush and queueing work after the flush work drains
the queues and starts processing the inodes it drained.

> +
>  	return 0;
>  }
>  
> @@ -832,6 +862,17 @@ xfs_fs_freeze(
>  	 */
>  	flags = memalloc_nofs_save();
>  	xfs_blockgc_stop(mp);
> +
> +	/*
> +	 * Stop the inodegc background worker.  freeze_super already flushed
> +	 * all pending inodegc work when it sync'd the filesystem after setting
> +	 * SB_FREEZE_PAGEFAULTS, and it holds s_umount, so we know that inodes
> +	 * cannot enter xfs_fs_destroy_inode until the freeze is complete.
> +	 * If the filesystem is read-write, inactivated inodes will queue but
> +	 * the worker will not run until the filesystem thaws or unmounts.
> +	 */
> +	xfs_inodegc_stop(mp);

.... and so we end up with this flush blocked on the background work
that assert failed and BUG()d:

[  219.511172] task:xfs_io          state:D stack:14208 pid:10238 ppid:  9089 flags:0x00004004
[  219.513126] Call Trace:
[  219.513768]  __schedule+0x310/0x9f0
[  219.514628]  schedule+0x67/0xe0
[  219.515405]  schedule_timeout+0x114/0x160
[  219.516404]  ? _raw_spin_unlock_irqrestore+0x12/0x40
[  219.517622]  ? do_raw_spin_unlock+0x57/0xb0
[  219.518655]  __wait_for_common+0xc0/0x160
[  219.519638]  ? usleep_range+0xa0/0xa0
[  219.520545]  wait_for_completion+0x24/0x30
[  219.521544]  flush_work+0x58/0x70
[  219.522357]  ? flush_workqueue_prep_pwqs+0x140/0x140
[  219.523553]  xfs_inodegc_flush+0x88/0x100
[  219.524524]  xfs_inodegc_stop+0x28/0xb0
[  219.525514]  xfs_fs_freeze+0x40/0x70
[  219.526401]  freeze_super+0xd8/0x140
[  219.527277]  do_vfs_ioctl+0x784/0x890
[  219.528146]  __x64_sys_ioctl+0x6f/0xc0
[  219.529062]  do_syscall_64+0x35/0x80
[  219.529974]  entry_SYSCALL_64_after_hwframe+0x44/0xae

At this point, we are at SB_FREEZE_FS and the transaction system it
shut down, so this is a hard fail.

ISTR a discussion about this in the past - I think we need to hook
->freeze_super() and run xfs_inodegc_stop() before we run
freeze_super(). That way we end up just queuing pending
inactivations while the freeze runs and completes.

The patch below does this (applied on top of you entire stack) and
it seems to fix the 517 failure (0 failures in 50 runs vs 100% fail
rate without the patch).

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

xfs: inodegc needs to stop before freeze

From: Dave Chinner <dchinner@redhat.com>

There is no way to safely stop background inactivation
during a freeze without racing. We can't rely on flushes during
->sync_fs to empty the queues because the freeze hasn't yet frozen
the transaction subsystem. Hence unlinks can race with freeze
resulting in inactivations being processed after the transaction
system has been frozen. That leads to assert failures like:

XFS: Assertion failed: mp->m_super->s_writers.frozen < SB_FREEZE_FS, file: fs/xfs/xfs_icache.c, line: 1861

So, shutdown inodegc before the freeze machinery starts by hooking
->freeze_super(). This ensures that no inactivation takes place
while we are freezing and hence we avoid the freeze vs inactivation
races altogether.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_super.c | 47 +++++++++++++++++------------------------------
 1 file changed, 17 insertions(+), 30 deletions(-)

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index ec0165513c60..c251679e8514 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -750,26 +750,6 @@ xfs_fs_sync_fs(
 		 */
 		flush_delayed_work(&mp->m_log->l_work);
 	}
-
-	/*
-	 * Flush all deferred inode inactivation work so that the free space
-	 * counters will reflect recent deletions.  Do not force the log again
-	 * because log recovery can restart the inactivation from the info that
-	 * we just wrote into the ondisk log.
-	 *
-	 * For regular operation this isn't strictly necessary since we aren't
-	 * required to guarantee that unlinking frees space immediately, but
-	 * that is how XFS historically behaved.
-	 *
-	 * If, however, the filesystem is at FREEZE_PAGEFAULTS, this is our
-	 * last chance to complete the inactivation work before the filesystem
-	 * freezes and the log is quiesced.  The background worker will not
-	 * activate again until the fs is thawed because the VFS won't evict
-	 * any more inodes until freeze_super drops s_umount and we disable the
-	 * worker in xfs_fs_freeze.
-	 */
-	xfs_inodegc_flush(mp);
-
 	return 0;
 }
 
@@ -888,16 +868,6 @@ xfs_fs_freeze(
 	flags = memalloc_nofs_save();
 	xfs_blockgc_stop(mp);
 
-	/*
-	 * Stop the inodegc background worker.  freeze_super already flushed
-	 * all pending inodegc work when it sync'd the filesystem after setting
-	 * SB_FREEZE_PAGEFAULTS, and it holds s_umount, so we know that inodes
-	 * cannot enter xfs_fs_destroy_inode until the freeze is complete.
-	 * If the filesystem is read-write, inactivated inodes will queue but
-	 * the worker will not run until the filesystem thaws or unmounts.
-	 */
-	xfs_inodegc_stop(mp);
-
 	xfs_save_resvblks(mp);
 	ret = xfs_log_quiesce(mp);
 	memalloc_nofs_restore(flags);
@@ -927,6 +897,22 @@ xfs_fs_unfreeze(
 	return 0;
 }
 
+static int
+xfs_fs_freeze_super(
+	struct super_block	*sb)
+{
+	struct xfs_mount	*mp = XFS_M(sb);
+	int			error;
+
+	xfs_inodegc_stop(mp);
+	error = freeze_super(sb);
+	if (error) {
+		if (!(mp->m_flags & XFS_MOUNT_RDONLY))
+			xfs_inodegc_start(mp);
+	}
+	return error;
+}
+
 /*
  * This function fills in xfs_mount_t fields based on mount args.
  * Note: the superblock _has_ now been read in.
@@ -1130,6 +1116,7 @@ static const struct super_operations xfs_super_operations = {
 	.drop_inode		= xfs_fs_drop_inode,
 	.put_super		= xfs_fs_put_super,
 	.sync_fs		= xfs_fs_sync_fs,
+	.freeze_super		= xfs_fs_freeze_super,
 	.freeze_fs		= xfs_fs_freeze,
 	.unfreeze_fs		= xfs_fs_unfreeze,
 	.statfs			= xfs_fs_statfs,

  reply	other threads:[~2021-08-04 10:03 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-29 18:43 [PATCHSET v8 00/20] xfs: deferred inode inactivation Darrick J. Wong
2021-07-29 18:43 ` [PATCH 01/20] xfs: move xfs_inactive call to xfs_inode_mark_reclaimable Darrick J. Wong
2021-07-29 18:44 ` [PATCH 02/20] xfs: detach dquots from inode if we don't need to inactivate it Darrick J. Wong
2021-07-29 18:44 ` [PATCH 03/20] xfs: defer inode inactivation to a workqueue Darrick J. Wong
2021-07-30  4:24   ` Dave Chinner
2021-07-31  4:21     ` Darrick J. Wong
2021-08-01 21:49       ` Dave Chinner
2021-08-01 23:47         ` Dave Chinner
2021-08-03  8:34   ` [PATCH, alternative] xfs: per-cpu deferred inode inactivation queues Dave Chinner
2021-08-03 20:20     ` Darrick J. Wong
2021-08-04  3:20     ` [PATCH, alternative v2] " Darrick J. Wong
2021-08-04 10:03       ` Dave Chinner [this message]
2021-08-04 12:37         ` [PATCH] xfs: inodegc needs to stop before freeze Dave Chinner
2021-08-04 10:46       ` [PATCH] xfs: don't run inodegc flushes when inodegc is not active Dave Chinner
2021-08-04 16:20         ` Darrick J. Wong
2021-08-04 11:09       ` [PATCH, alternative v2] xfs: per-cpu deferred inode inactivation queues Dave Chinner
2021-08-04 15:59         ` Darrick J. Wong
2021-08-04 21:35           ` Dave Chinner
2021-08-04 11:49       ` [PATCH, pre-03/20 #1] xfs: introduce CPU hotplug infrastructure Dave Chinner
2021-08-04 11:50       ` [PATCH, pre-03/20 #2] xfs: introduce all-mounts list for cpu hotplug notifications Dave Chinner
2021-08-04 16:06         ` Darrick J. Wong
2021-08-04 21:17           ` Dave Chinner
2021-08-04 11:52       ` [PATCH, post-03/20 1/1] xfs: hook up inodegc to CPU dead notification Dave Chinner
2021-08-04 16:19         ` Darrick J. Wong
2021-08-04 21:48           ` Dave Chinner
2021-07-29 18:44 ` [PATCH 04/20] xfs: throttle inode inactivation queuing on memory reclaim Darrick J. Wong
2021-07-29 18:44 ` [PATCH 05/20] xfs: don't throttle memory reclaim trying to queue inactive inodes Darrick J. Wong
2021-07-29 18:44 ` [PATCH 06/20] xfs: throttle inodegc queuing on backlog Darrick J. Wong
2021-08-02  0:45   ` Dave Chinner
2021-08-02  1:30     ` Dave Chinner
2021-07-29 18:44 ` [PATCH 07/20] xfs: queue inodegc worker immediately when memory is tight Darrick J. Wong
2021-07-29 18:44 ` [PATCH 08/20] xfs: expose sysfs knob to control inode inactivation delay Darrick J. Wong
2021-07-29 18:44 ` [PATCH 09/20] xfs: reduce inactivation delay when free space is tight Darrick J. Wong
2021-07-29 18:44 ` [PATCH 10/20] xfs: reduce inactivation delay when quota are tight Darrick J. Wong
2021-07-29 18:44 ` [PATCH 11/20] xfs: reduce inactivation delay when realtime extents " Darrick J. Wong
2021-07-29 18:44 ` [PATCH 12/20] xfs: inactivate inodes any time we try to free speculative preallocations Darrick J. Wong
2021-07-29 18:45 ` [PATCH 13/20] xfs: flush inode inactivation work when compiling usage statistics Darrick J. Wong
2021-07-29 18:45 ` [PATCH 14/20] xfs: parallelize inode inactivation Darrick J. Wong
2021-08-02  0:55   ` Dave Chinner
2021-08-02 21:33     ` Darrick J. Wong
2021-07-29 18:45 ` [PATCH 15/20] xfs: reduce inactivation delay when AG free space are tight Darrick J. Wong
2021-07-29 18:45 ` [PATCH 16/20] xfs: queue inodegc worker immediately on backlog Darrick J. Wong
2021-07-29 18:45 ` [PATCH 17/20] xfs: don't run speculative preallocation gc when fs is frozen Darrick J. Wong
2021-07-29 18:45 ` [PATCH 18/20] xfs: scale speculative preallocation gc delay based on free space Darrick J. Wong
2021-07-29 18:45 ` [PATCH 19/20] xfs: use background worker pool when transactions can't get " Darrick J. Wong
2021-07-29 18:45 ` [PATCH 20/20] xfs: avoid buffer deadlocks when walking fs inodes Darrick J. Wong
2021-08-02 10:35 ` [PATCHSET v8 00/20] xfs: deferred inode inactivation Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210804100328.GK2757197@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=hch@infradead.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).