linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* xfs: log recovery hang fixes
@ 2022-03-07  5:32 Dave Chinner
  2022-03-07  5:32 ` [PATCH 1/3] xfs: log worker needs to start before intent/unlink recovery Dave Chinner
                   ` (4 more replies)
  0 siblings, 5 replies; 9+ messages in thread
From: Dave Chinner @ 2022-03-07  5:32 UTC (permalink / raw)
  To: linux-xfs; +Cc: willy

Hi folks,

Willy reported generic/530 had started hanging on his test machines
and I've tried to reproduce the problem he reported. While I haven't
reproduced the exact hang he's been having, I've found a couple of
others while running g/530 in a tight loop on a couple of test
machines.

The first two patches are a result of a hang documented in patch 1.
The change to run the log worker earlier is defensive, but serves to
break generic log space deadlocks during intent and unlinked inode
recovery as it does at normal runtime. This doesn't fix the problem,
just adds a layer of protection that means stuff that gets stuck on
pinned buffers, push hangs, etc only stays hung up for 30s at most.

The second patch fixes the hang that results from delwri buffer
pushing racing with modifications that pin the buffer (i.e.
transaction commit) and then require access to it again soon after.
The buffer is locked by delwri submission that is waiting for it to
be unpinned, but the processes that might be able to trigger an
unpin are blocked waiting for the buffer lock itself. This happens
during log recovery when processing unlinked inodes that hit the
same inode cluster buffer.

The third patch is for log recovery hangs I've been seeing that
occur after unlinked inode recovery has completed and the AIL is
being pushed out. The trigger may be unique to the highly modified
kernel I was running (and mitigated to a 30s delay to log recovery
completion in g/530 by the first patch in the series), but I have
occasionally seen period hangs in xfs_ail_push_all_sync() in the
past where the AIL has not been fully emptied but it is sleeping
without making progress. Hence I think the problem is a real one,
just I don't have a way of reproducing it reliably an unmodified
kernel.

Willy, can you see if these patches fix the problem you are seeing?
If not, I still think they stand alone as necessary fixes, but I'll
have to keep digging to find out why you are seeing hangs in g/530.

Cheers,

Dave.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/3] xfs: log worker needs to start before intent/unlink recovery
  2022-03-07  5:32 xfs: log recovery hang fixes Dave Chinner
@ 2022-03-07  5:32 ` Dave Chinner
  2022-03-07  5:32 ` [PATCH 2/3] xfs: check buffer pin state after locking in delwri_submit Dave Chinner
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Dave Chinner @ 2022-03-07  5:32 UTC (permalink / raw)
  To: linux-xfs; +Cc: willy

From: Dave Chinner <dchinner@redhat.com>

After 963 iterations of generic/530, it deadlocked during recovery
on a pinned inode cluster buffer like so:

XFS (pmem1): Starting recovery (logdev: internal)
INFO: task kworker/8:0:306037 blocked for more than 122 seconds.
      Not tainted 5.17.0-rc6-dgc+ #975
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/8:0     state:D stack:13024 pid:306037 ppid:     2 flags:0x00004000
Workqueue: xfs-inodegc/pmem1 xfs_inodegc_worker
Call Trace:
 <TASK>
 __schedule+0x30d/0x9e0
 schedule+0x55/0xd0
 schedule_timeout+0x114/0x160
 __down+0x99/0xf0
 down+0x5e/0x70
 xfs_buf_lock+0x36/0xf0
 xfs_buf_find+0x418/0x850
 xfs_buf_get_map+0x47/0x380
 xfs_buf_read_map+0x54/0x240
 xfs_trans_read_buf_map+0x1bd/0x490
 xfs_imap_to_bp+0x4f/0x70
 xfs_iunlink_map_ino+0x66/0xd0
 xfs_iunlink_map_prev.constprop.0+0x148/0x2f0
 xfs_iunlink_remove_inode+0xf2/0x1d0
 xfs_inactive_ifree+0x1a3/0x900
 xfs_inode_unlink+0xcc/0x210
 xfs_inodegc_worker+0x1ac/0x2f0
 process_one_work+0x1ac/0x390
 worker_thread+0x56/0x3c0
 kthread+0xf6/0x120
 ret_from_fork+0x1f/0x30
 </TASK>
task:mount           state:D stack:13248 pid:324509 ppid:324233 flags:0x00004000
Call Trace:
 <TASK>
 __schedule+0x30d/0x9e0
 schedule+0x55/0xd0
 schedule_timeout+0x114/0x160
 __down+0x99/0xf0
 down+0x5e/0x70
 xfs_buf_lock+0x36/0xf0
 xfs_buf_find+0x418/0x850
 xfs_buf_get_map+0x47/0x380
 xfs_buf_read_map+0x54/0x240
 xfs_trans_read_buf_map+0x1bd/0x490
 xfs_imap_to_bp+0x4f/0x70
 xfs_iget+0x300/0xb40
 xlog_recover_process_one_iunlink+0x4c/0x170
 xlog_recover_process_iunlinks.isra.0+0xee/0x130
 xlog_recover_finish+0x57/0x110
 xfs_log_mount_finish+0xfc/0x1e0
 xfs_mountfs+0x540/0x910
 xfs_fs_fill_super+0x495/0x850
 get_tree_bdev+0x171/0x270
 xfs_fs_get_tree+0x15/0x20
 vfs_get_tree+0x24/0xc0
 path_mount+0x304/0xba0
 __x64_sys_mount+0x108/0x140
 do_syscall_64+0x35/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xae
 </TASK>
task:xfsaild/pmem1   state:D stack:14544 pid:324525 ppid:     2 flags:0x00004000
Call Trace:
 <TASK>
 __schedule+0x30d/0x9e0
 schedule+0x55/0xd0
 io_schedule+0x4b/0x80
 xfs_buf_wait_unpin+0x9e/0xf0
 __xfs_buf_submit+0x14a/0x230
 xfs_buf_delwri_submit_buffers+0x107/0x280
 xfs_buf_delwri_submit_nowait+0x10/0x20
 xfsaild+0x27e/0x9d0
 kthread+0xf6/0x120
 ret_from_fork+0x1f/0x30

We have the mount process waiting on an inode cluster buffer read,
inodegc doing unlink waiting on the same inode cluster buffer, and
the AIL push thread blocked in writeback waiting for the inode to
become unpinned.

What has happened here is that the AIL push thread has raced with
the inodegc process modifying, committing and pinning the inode
cluster buffer here in xfs_buf_delwri_submit_buffers() here:

	blk_start_plug(&plug);
	list_for_each_entry_safe(bp, n, buffer_list, b_list) {
		if (!wait_list) {
			if (xfs_buf_ispinned(bp)) {
				pinned++;
				continue;
			}
Here >>>>>>
			if (!xfs_buf_trylock(bp))
				continue;

Basically, the AIL has found the buffer wasn't pinned and got the
lock without blocking, but then the buffer was pinned. This implies
the processing here was pre-empted between the pin check and the
lock, because the pin count can only be increased while holding the
buffer locked. Hence when it has gone to submit the IO, it has
blocked waiting for the buffer to be unpinned.

With all executing threads now waiting on the buffer to be unpinned,
we normally get out of situations like this via the background log
worker issuing a log force which will unpinned stuck buffers like
this. But at this point in recovery, we haven't started the log
worker. In fact, the first thing we do after processing intents and
unlinked inodes is *start the log worker*. IOWs, we start it too
late to have it break deadlocks like this.

Avoid this and any other similar deadlock vectors in intent and
unlinked inode recovery by starting the log worker before we recover
intents and unlinked inodes. This part of recovery runs as though
the filesystem is fully active, so we really should have the same
infrastructure running as we normally do at runtime.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 4196655e2ef4..8585305d1769 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -835,10 +835,9 @@ xfs_log_mount_finish(
 	 * mount failure occurs.
 	 */
 	mp->m_super->s_flags |= SB_ACTIVE;
+	xfs_log_work_queue(mp);
 	if (xlog_recovery_needed(log))
 		error = xlog_recover_finish(log);
-	if (!error)
-		xfs_log_work_queue(mp);
 	mp->m_super->s_flags &= ~SB_ACTIVE;
 	evict_inodes(mp->m_super);
 
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/3] xfs: check buffer pin state after locking in delwri_submit
  2022-03-07  5:32 xfs: log recovery hang fixes Dave Chinner
  2022-03-07  5:32 ` [PATCH 1/3] xfs: log worker needs to start before intent/unlink recovery Dave Chinner
@ 2022-03-07  5:32 ` Dave Chinner
  2022-03-07  5:32 ` [PATCH 3/3] xfs: xfs_ail_push_all_sync() stalls when racing with updates Dave Chinner
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Dave Chinner @ 2022-03-07  5:32 UTC (permalink / raw)
  To: linux-xfs; +Cc: willy

From: Dave Chinner <dchinner@redhat.com>

AIL flushing can get stuck here:

[316649.005769] INFO: task xfsaild/pmem1:324525 blocked for more than 123 seconds.
[316649.007807]       Not tainted 5.17.0-rc6-dgc+ #975
[316649.009186] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[316649.011720] task:xfsaild/pmem1   state:D stack:14544 pid:324525 ppid:     2 flags:0x00004000
[316649.014112] Call Trace:
[316649.014841]  <TASK>
[316649.015492]  __schedule+0x30d/0x9e0
[316649.017745]  schedule+0x55/0xd0
[316649.018681]  io_schedule+0x4b/0x80
[316649.019683]  xfs_buf_wait_unpin+0x9e/0xf0
[316649.021850]  __xfs_buf_submit+0x14a/0x230
[316649.023033]  xfs_buf_delwri_submit_buffers+0x107/0x280
[316649.024511]  xfs_buf_delwri_submit_nowait+0x10/0x20
[316649.025931]  xfsaild+0x27e/0x9d0
[316649.028283]  kthread+0xf6/0x120
[316649.030602]  ret_from_fork+0x1f/0x30

in the situation where flushing gets preempted between the unpin
check and the buffer trylock under nowait conditions:

	blk_start_plug(&plug);
	list_for_each_entry_safe(bp, n, buffer_list, b_list) {
		if (!wait_list) {
			if (xfs_buf_ispinned(bp)) {
				pinned++;
				continue;
			}
Here >>>>>>
			if (!xfs_buf_trylock(bp))
				continue;

This means submission is stuck until something else triggers a log
force to unpin the buffer.

To get onto the delwri list to begin with, the buffer pin state has
already been checked, and hence it's relatively rare we get a race
between flushing and encountering a pinned buffer in delwri
submission to begin with. Further, to increase the pin count the
buffer has to be locked, so the only way we can hit this race
without failing the trylock is to be preempted between the pincount
check seeing zero and the trylock being run.

Hence to avoid this problem, just invert the order of trylock vs
pin check. We shouldn't hit that many pinned buffers here, so
optimising away the trylock for pinned buffers should not matter for
performance at all.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 18a0b0b96071..63f27d20cf45 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -2125,12 +2125,13 @@ xfs_buf_delwri_submit_buffers(
 	blk_start_plug(&plug);
 	list_for_each_entry_safe(bp, n, buffer_list, b_list) {
 		if (!wait_list) {
+			if (!xfs_buf_trylock(bp))
+				continue;
 			if (xfs_buf_ispinned(bp)) {
+				xfs_buf_unlock(bp);
 				pinned++;
 				continue;
 			}
-			if (!xfs_buf_trylock(bp))
-				continue;
 		} else {
 			xfs_buf_lock(bp);
 		}
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/3] xfs: xfs_ail_push_all_sync() stalls when racing with updates
  2022-03-07  5:32 xfs: log recovery hang fixes Dave Chinner
  2022-03-07  5:32 ` [PATCH 1/3] xfs: log worker needs to start before intent/unlink recovery Dave Chinner
  2022-03-07  5:32 ` [PATCH 2/3] xfs: check buffer pin state after locking in delwri_submit Dave Chinner
@ 2022-03-07  5:32 ` Dave Chinner
  2022-03-07 17:43 ` xfs: log recovery hang fixes Matthew Wilcox
  2022-03-07 23:18 ` [PATCH 4/3] xfs: async CIL flushes need pending pushes to be made stable Dave Chinner
  4 siblings, 0 replies; 9+ messages in thread
From: Dave Chinner @ 2022-03-07  5:32 UTC (permalink / raw)
  To: linux-xfs; +Cc: willy

From: Dave Chinner <dchinner@redhat.com>

xfs_ail_push_all_sync() has a loop like this:

while max_ail_lsn {
	prepare_to_wait(ail_empty)
	target = max_ail_lsn
	wake_up(ail_task);
	schedule()
}

Which is designed to sleep until the AIL is emptied. When
xfs_ail_finish_update() moves the tail of the log, it does:

	if (list_empty(&ailp->ail_head))
		wake_up_all(&ailp->ail_empty);

So it will only wake up the sync push waiter when the AIL goes
empty. If, by the time the push waiter has woken, the AIL has more
in it, it will reset the target, wake the push task and go back to
sleep.

The problem here is that if the AIL is having items added to it
when xfs_ail_push_all_sync() is called, then they may get inserted
into the AIL at a LSN higher than the target LSN. At this point,
xfsaild_push() will see that the target is X, the item LSNs are
(X+N) and skip over them, hence never pushing the out.

The result of this the AIL will not get emptied by the AIL push
thread, hence xfs_ail_finish_update() will never see the AIL being
empty even if it moves the tail. Hence xfs_ail_push_all_sync() never
gets woken and hence cannot update the push target to capture the
items beyond the current target on the LSN.

This is a TOCTOU type of issue so the way to avoid it is to not
use the push target at all for sync pushes. We know that a sync push
is being requested by the fact the ail_empty wait queue is active,
hence the xfsaild can just set the target to max_ail_lsn on every
push that we see the wait queue active. Hence we no longer will
leave items on the AIL that are beyond the LSN sampled at the start
of a sync push.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_trans_ail.c | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 2a8c8dc54c95..1b52952097c1 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -448,10 +448,22 @@ xfsaild_push(
 
 	spin_lock(&ailp->ail_lock);
 
-	/* barrier matches the ail_target update in xfs_ail_push() */
-	smp_rmb();
-	target = ailp->ail_target;
-	ailp->ail_target_prev = target;
+	/*
+	 * If we have a sync push waiter, we always have to push till the AIL is
+	 * empty. Update the target to point to the end of the AIL so that
+	 * capture updates that occur after the sync push waiter has gone to
+	 * sleep.
+	 */
+	if (waitqueue_active(&ailp->ail_empty)) {
+		lip = xfs_ail_max(ailp);
+		if (lip)
+			target = lip->li_lsn;
+	} else {
+		/* barrier matches the ail_target update in xfs_ail_push() */
+		smp_rmb();
+		target = ailp->ail_target;
+		ailp->ail_target_prev = target;
+	}
 
 	/* we're done if the AIL is empty or our push has reached the end */
 	lip = xfs_trans_ail_cursor_first(ailp, &cur, ailp->ail_last_pushed_lsn);
@@ -724,7 +736,6 @@ xfs_ail_push_all_sync(
 	spin_lock(&ailp->ail_lock);
 	while ((lip = xfs_ail_max(ailp)) != NULL) {
 		prepare_to_wait(&ailp->ail_empty, &wait, TASK_UNINTERRUPTIBLE);
-		ailp->ail_target = lip->li_lsn;
 		wake_up_process(ailp->ail_task);
 		spin_unlock(&ailp->ail_lock);
 		schedule();
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: xfs: log recovery hang fixes
  2022-03-07  5:32 xfs: log recovery hang fixes Dave Chinner
                   ` (2 preceding siblings ...)
  2022-03-07  5:32 ` [PATCH 3/3] xfs: xfs_ail_push_all_sync() stalls when racing with updates Dave Chinner
@ 2022-03-07 17:43 ` Matthew Wilcox
  2022-03-07 21:18   ` Dave Chinner
  2022-03-07 23:18 ` [PATCH 4/3] xfs: async CIL flushes need pending pushes to be made stable Dave Chinner
  4 siblings, 1 reply; 9+ messages in thread
From: Matthew Wilcox @ 2022-03-07 17:43 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Mar 07, 2022 at 04:32:49PM +1100, Dave Chinner wrote:
> Willy reported generic/530 had started hanging on his test machines
> and I've tried to reproduce the problem he reported. While I haven't
> reproduced the exact hang he's been having, I've found a couple of
> others while running g/530 in a tight loop on a couple of test
> machines.
[...]
> 
> Willy, can you see if these patches fix the problem you are seeing?
> If not, I still think they stand alone as necessary fixes, but I'll
> have to keep digging to find out why you are seeing hangs in g/530.

I no longer see hangs, but I do see an interesting pattern in runtime
of g/530.  I was seeing hangs after only a few minutes of running g/530,
and I was using 15 minutes of success to say "git bisect good".  Now at 45
minutes of runtime with no hangs.  Specifically, I'm testing 0020a190cf3e
("xfs: AIL needs asynchronous CIL forcing"), plus these three patches.
If you're interested, I can see which of these three patches actually
fixes my hang.  I should also test these three patches on top of current
5.17-rc, but I wanted to check they were backportable to current stable
first.

Of the 120 times g/530 has run, I see 30 occurrences of the test taking
32-35 seconds.  I see one occurrence of the test taking 63 seconds.
Usually it takes 2-3s.  This smacks to me of a 30s timeout expiring.
Let me know if you want me to try to track down which one it is.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: xfs: log recovery hang fixes
  2022-03-07 17:43 ` xfs: log recovery hang fixes Matthew Wilcox
@ 2022-03-07 21:18   ` Dave Chinner
  0 siblings, 0 replies; 9+ messages in thread
From: Dave Chinner @ 2022-03-07 21:18 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-xfs

On Mon, Mar 07, 2022 at 05:43:18PM +0000, Matthew Wilcox wrote:
> On Mon, Mar 07, 2022 at 04:32:49PM +1100, Dave Chinner wrote:
> > Willy reported generic/530 had started hanging on his test machines
> > and I've tried to reproduce the problem he reported. While I haven't
> > reproduced the exact hang he's been having, I've found a couple of
> > others while running g/530 in a tight loop on a couple of test
> > machines.
> [...]
> > 
> > Willy, can you see if these patches fix the problem you are seeing?
> > If not, I still think they stand alone as necessary fixes, but I'll
> > have to keep digging to find out why you are seeing hangs in g/530.
> 
> I no longer see hangs, but I do see an interesting pattern in runtime
> of g/530.  I was seeing hangs after only a few minutes of running g/530,
> and I was using 15 minutes of success to say "git bisect good".  Now at 45
> minutes of runtime with no hangs.  Specifically, I'm testing 0020a190cf3e
> ("xfs: AIL needs asynchronous CIL forcing"), plus these three patches.
> If you're interested, I can see which of these three patches actually
> fixes my hang.  I should also test these three patches on top of current
> 5.17-rc, but I wanted to check they were backportable to current stable
> first.
> 
> Of the 120 times g/530 has run, I see 30 occurrences of the test taking
> 32-35 seconds.  I see one occurrence of the test taking 63 seconds.
> Usually it takes 2-3s.  This smacks to me of a 30s timeout expiring.
> Let me know if you want me to try to track down which one it is.

That'll be the log worker triggering a log force after 30s, and that
gets it unstuck. So you're still seeing the problem, only now the
watchdog kicks everything back into life.

Can you run a trace for me that captures one of those 30-60s runs
so I can see what might be happening? Something like:

# trace-cmd record -e xlog\* -e xfs_ail\* -e xfs_log\* -e xfs_inodegc\* -e printk ./check generic/530

I don't need all the XFS tracepoints - I'm mainly interested in log
and AIL interactions and what is stuck on them and when...

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 4/3] xfs: async CIL flushes need pending pushes to be made stable
  2022-03-07  5:32 xfs: log recovery hang fixes Dave Chinner
                   ` (3 preceding siblings ...)
  2022-03-07 17:43 ` xfs: log recovery hang fixes Matthew Wilcox
@ 2022-03-07 23:18 ` Dave Chinner
  2022-03-08  6:12   ` [PATCH 4/3 v2] " Dave Chinner
  4 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2022-03-07 23:18 UTC (permalink / raw)
  To: linux-xfs; +Cc: willy


From: Dave Chinner <dchinner@redhat.com>

When the AIL tries to flush the CIL, it relies on the CIL push
ending up on stable storage without having to wait for and
manipulate iclog state directly. However, if there is already a
pending CIL push when the AIL tries to flush the CIL, it won't set
the cil->xc_push_commit_stable flag and so the CIL push will not
actively flush the commit record iclog.

generic/530 when run on a single CPU test VM can trigger this fairly
reliably. This test exercises unlinked inode recovery, and can
result in inodes being pinned in memory by ongoing modifications to
the inode cluster buffer to record unlinked list modifications. As a
result, the first inode unlinked in a buffer can pin the tail of the
log whilst the inode cluster buffer is pinned by the current
checkpoint that has been pushed but isn't on stable storage because
because the cil->xc_push_commit_stable was not set. This results in
the log/AIL effectively deadlocking until something triggers the
commit record iclog to be pushed to stable storage (i.e. the
periodic log worker calling xfs_log_force()).

The fix is simple - always set the cil->xc_push_commit_stable when
xlog_cil_flush() is called, regardless of whether there is already a
pending push or not.

Reported-by: Matthew Wilcox <willy@infradead.org>
Fixes: 0020a190cf3e ("xfs: AIL needs asynchronous CIL forcing")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 05cee602406c..9785ac0351fe 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -1490,11 +1490,21 @@ xlog_cil_push_now(
 	if (!async)
 		flush_workqueue(cil->xc_push_wq);
 
+	spin_lock(&cil->xc_push_lock);
+
+	/*
+	 * If this is an async flush request, we always need to set the
+	 * xc_push_commit_stable flag even if something else has already queued
+	 * a push. The flush caller is asking for the CIL to be on stable
+	 * storage when the next push completes, so regardless of who has queued
+	 * the push, the flush requires stable semantics from it.
+	 */
+	cil->xc_push_commit_stable = async;
+
 	/*
 	 * If the CIL is empty or we've already pushed the sequence then
-	 * there's no work we need to do.
+	 * there's no more work that we need to do.
 	 */
-	spin_lock(&cil->xc_push_lock);
 	if (test_bit(XLOG_CIL_EMPTY, &cil->xc_flags) ||
 	    push_seq <= cil->xc_push_seq) {
 		spin_unlock(&cil->xc_push_lock);
@@ -1502,7 +1512,6 @@ xlog_cil_push_now(
 	}
 
 	cil->xc_push_seq = push_seq;
-	cil->xc_push_commit_stable = async;
 	queue_work(cil->xc_push_wq, &cil->xc_ctx->push_work);
 	spin_unlock(&cil->xc_push_lock);
 }

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 4/3 v2] xfs: async CIL flushes need pending pushes to be made stable
  2022-03-07 23:18 ` [PATCH 4/3] xfs: async CIL flushes need pending pushes to be made stable Dave Chinner
@ 2022-03-08  6:12   ` Dave Chinner
  2022-03-08 13:52     ` Matthew Wilcox
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2022-03-08  6:12 UTC (permalink / raw)
  To: linux-xfs; +Cc: willy

From: Dave Chinner <dchinner@redhat.com>

When the AIL tries to flush the CIL, it relies on the CIL push
ending up on stable storage without having to wait for and
manipulate iclog state directly. However, if there is already a
pending CIL push when the AIL tries to flush the CIL, it won't set
the cil->xc_push_commit_stable flag and so the CIL push will not
actively flush the commit record iclog.

generic/530 when run on a single CPU test VM can trigger this fairly
reliably. This test exercises unlinked inode recovery, and can
result in inodes being pinned in memory by ongoing modifications to
the inode cluster buffer to record unlinked list modifications. As a
result, the first inode unlinked in a buffer can pin the tail of the
log whilst the inode cluster buffer is pinned by the current
checkpoint that has been pushed but isn't on stable storage because
because the cil->xc_push_commit_stable was not set. This results in
the log/AIL effectively deadlocking until something triggers the
commit record iclog to be pushed to stable storage (i.e. the
periodic log worker calling xfs_log_force()).

The fix is two-fold - first we should always set the
cil->xc_push_commit_stable when xlog_cil_flush() is called,
regardless of whether there is already a pending push or not.

Second, if the CIL is empty, we should trigger an iclog flush to
ensure that all the iclogs of the last checkpoint have actually been
submitted to disk as that checkpoint may not have been run under
stable completion constraints.

Reported-by: Matthew Wilcox <willy@infradead.org>
Fixes: 0020a190cf3e ("xfs: AIL needs asynchronous CIL forcing")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
Version 2:
- if the CIL is empty we need to push on the iclogs to ensure that
  the previous CIL flush did actually make it to stable storage.
  This closes a race condition where the AIL doesn't actually
  trigger the last CIL push and so the CIL is already empty by the
  time it tries to flush it to unpin pinned items.

Note: this is against the XFS CIL scalability patchset. The
XLOG_CIL_EMPTY check on a mainline kernel will be:

	if (list_empty(&log->l_cilp->xc_cil))
		xfs_log_force(log->l_mp, 0);

 fs/xfs/xfs_log_cil.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 05cee602406c..de2108822d5b 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -1490,11 +1490,21 @@ xlog_cil_push_now(
 	if (!async)
 		flush_workqueue(cil->xc_push_wq);
 
+	spin_lock(&cil->xc_push_lock);
+
+	/*
+	 * If this is an async flush request, we always need to set the
+	 * xc_push_commit_stable flag even if something else has already queued
+	 * a push. The flush caller is asking for the CIL to be on stable
+	 * storage when the next push completes, so regardless of who has queued
+	 * the push, the flush requires stable semantics from it.
+	 */
+	cil->xc_push_commit_stable = async;
+
 	/*
 	 * If the CIL is empty or we've already pushed the sequence then
-	 * there's no work we need to do.
+	 * there's no more work that we need to do.
 	 */
-	spin_lock(&cil->xc_push_lock);
 	if (test_bit(XLOG_CIL_EMPTY, &cil->xc_flags) ||
 	    push_seq <= cil->xc_push_seq) {
 		spin_unlock(&cil->xc_push_lock);
@@ -1502,7 +1512,6 @@ xlog_cil_push_now(
 	}
 
 	cil->xc_push_seq = push_seq;
-	cil->xc_push_commit_stable = async;
 	queue_work(cil->xc_push_wq, &cil->xc_ctx->push_work);
 	spin_unlock(&cil->xc_push_lock);
 }
@@ -1600,6 +1609,13 @@ xlog_cil_flush(
 
 	trace_xfs_log_force(log->l_mp, seq, _RET_IP_);
 	xlog_cil_push_now(log, seq, true);
+
+	/*
+	 * If the CIL is empty, make sure that any previous checkpoint that may
+	 * still be in an active iclog is pushed to stable storage.
+	 */
+	if (test_bit(XLOG_CIL_EMPTY, &log->l_cilp->xc_flags))
+		xfs_log_force(log->l_mp, 0);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 4/3 v2] xfs: async CIL flushes need pending pushes to be made stable
  2022-03-08  6:12   ` [PATCH 4/3 v2] " Dave Chinner
@ 2022-03-08 13:52     ` Matthew Wilcox
  0 siblings, 0 replies; 9+ messages in thread
From: Matthew Wilcox @ 2022-03-08 13:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Mar 08, 2022 at 05:12:08PM +1100, Dave Chinner wrote:
> The fix is two-fold - first we should always set the
> cil->xc_push_commit_stable when xlog_cil_flush() is called,
> regardless of whether there is already a pending push or not.
> 
> Second, if the CIL is empty, we should trigger an iclog flush to
> ensure that all the iclogs of the last checkpoint have actually been
> submitted to disk as that checkpoint may not have been run under
> stable completion constraints.
> 
> Reported-by: Matthew Wilcox <willy@infradead.org>
> Fixes: 0020a190cf3e ("xfs: AIL needs asynchronous CIL forcing")
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> Version 2:
> - if the CIL is empty we need to push on the iclogs to ensure that
>   the previous CIL flush did actually make it to stable storage.
>   This closes a race condition where the AIL doesn't actually
>   trigger the last CIL push and so the CIL is already empty by the
>   time it tries to flush it to unpin pinned items.
> 
> Note: this is against the XFS CIL scalability patchset. The
> XLOG_CIL_EMPTY check on a mainline kernel will be:
> 
> 	if (list_empty(&log->l_cilp->xc_cil))
> 		xfs_log_force(log->l_mp, 0);

This is it.  100 runs of g/530 all completed in 2 or 3 seconds.  Thanks!

Tested-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Just to be clear, my testing was done on b08968f196d4 (in Linus' tree),
plus my fs-folio patches, plus patches 1-3 from this series, plus this
(whitespace damaged):

+++ b/fs/xfs/xfs_log_cil.c
@@ -1243,18 +1243,27 @@ xlog_cil_push_now(
        if (!async)
                flush_workqueue(cil->xc_push_wq);

+       spin_lock(&cil->xc_push_lock);
+
+       /*
+        * If this is an async flush request, we always need to set the
+        * xc_push_commit_stable flag even if something else has already queued
+        * a push. The flush caller is asking for the CIL to be on stable
+        * storage when the next push completes, so regardless of who has queued
+        * the push, the flush requires stable semantics from it.
+        */
+       cil->xc_push_commit_stable = async;
+
        /*
         * If the CIL is empty or we've already pushed the sequence then
-        * there's no work we need to do.
+        * there's no more work that we need to do.
         */
-       spin_lock(&cil->xc_push_lock);
        if (list_empty(&cil->xc_cil) || push_seq <= cil->xc_push_seq) {
                spin_unlock(&cil->xc_push_lock);
                return;
        }

        cil->xc_push_seq = push_seq;
-       cil->xc_push_commit_stable = async;
        queue_work(cil->xc_push_wq, &cil->xc_ctx->push_work);
        spin_unlock(&cil->xc_push_lock);
 }
@@ -1352,6 +1361,8 @@ xlog_cil_flush(

        trace_xfs_log_force(log->l_mp, seq, _RET_IP_);
        xlog_cil_push_now(log, seq, true);
+        if (list_empty(&log->l_cilp->xc_cil))
+                xfs_log_force(log->l_mp, 0);
 }

 /*


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-03-08 13:52 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-07  5:32 xfs: log recovery hang fixes Dave Chinner
2022-03-07  5:32 ` [PATCH 1/3] xfs: log worker needs to start before intent/unlink recovery Dave Chinner
2022-03-07  5:32 ` [PATCH 2/3] xfs: check buffer pin state after locking in delwri_submit Dave Chinner
2022-03-07  5:32 ` [PATCH 3/3] xfs: xfs_ail_push_all_sync() stalls when racing with updates Dave Chinner
2022-03-07 17:43 ` xfs: log recovery hang fixes Matthew Wilcox
2022-03-07 21:18   ` Dave Chinner
2022-03-07 23:18 ` [PATCH 4/3] xfs: async CIL flushes need pending pushes to be made stable Dave Chinner
2022-03-08  6:12   ` [PATCH 4/3 v2] " Dave Chinner
2022-03-08 13:52     ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).