All of lore.kernel.org
 help / color / mirror / Atom feed
* [BUG] log I/O completion GPF via xfs/006 and xfs/264 on 5.17.0-rc8
@ 2022-03-18 13:46 Brian Foster
  2022-03-18 16:11 ` Brian Foster
  2022-03-18 21:48 ` Dave Chinner
  0 siblings, 2 replies; 10+ messages in thread
From: Brian Foster @ 2022-03-18 13:46 UTC (permalink / raw)
  To: linux-xfs

Hi,

I'm not sure if this is known and/or fixed already, but it didn't look
familiar so here is a report. I hit a splat when testing Willy's
prospective folio bookmark change and it turns out it replicates on
Linus' current master (551acdc3c3d2). This initially reproduced on
xfs/264 (mkfs defaults) and I saw a soft lockup warning variant via
xfs/006, but when I attempted to reproduce the latter a second time I
hit what looks like the same problem as xfs/264. Both tests seem to
involve some form of error injection, so possibly the same underlying
problem. The GPF splat from xfs/264 is below.

Brian

--- 8< ---

general protection fault, probably for non-canonical address 0x102e31d0105f07d: 0000 [#1] PREEMPT SMP NOPTI
CPU: 24 PID: 1647 Comm: kworker/24:1H Tainted: G S                5.17.0-rc8+ #14
Hardware name: Dell Inc. PowerEdge R750/06V45N, BIOS 1.2.4 05/28/2021
Workqueue: xfs-log/dm-5 xlog_ioend_work [xfs]
RIP: 0010:native_queued_spin_lock_slowpath+0x1a4/0x1e0
Code: f3 90 48 8b 0a 48 85 c9 74 f6 eb c5 c1 e9 12 83 e0 03 83 e9 01 48 c1 e0 05 48 63 c9 48 05 40 0d 03 00 48 03 04 cd e0 ba 60 b9 <48> 89 10 8b 42 08 85 c0 75 09 f3 90 8b 42 08 85 c0 74 f7 48 8b 0a
RSP: 0018:ff407350cf917b48 EFLAGS: 00010206
RAX: 0102e31d0105f07d RBX: ff1a52eeb16dd1c0 RCX: 0000000000002c5a
RDX: ff1a53123fb30d40 RSI: ffffffffb95826f6 RDI: ffffffffb9554147
RBP: ff1a53123fb30d40 R08: ff1a52d3c8684028 R09: 0000000000000121
R10: 00000000000000bf R11: 0000000000000b65 R12: 0000000000640000
R13: 0000000000000008 R14: ff1a52d3d0899000 R15: ff1a52d5bdf07800
FS:  0000000000000000(0000) GS:ff1a53123fb00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000556b2ed15f44 CR3: 000000019a4c0005 CR4: 0000000000771ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
 <TASK>
 _raw_spin_lock+0x2c/0x30
 xfs_trans_ail_delete+0x27/0xd0 [xfs]
 xfs_buf_item_done+0x22/0x30 [xfs]
 xfs_buf_ioend+0x71/0x5e0 [xfs]
 xfs_trans_committed_bulk+0x167/0x2c0 [xfs]
 ? enqueue_entity+0x121/0x4d0
 ? enqueue_task_fair+0x417/0x530
 ? resched_curr+0x23/0xc0
 ? check_preempt_curr+0x3f/0x70
 ? _raw_spin_unlock_irqrestore+0x1f/0x31
 ? __wake_up_common_lock+0x87/0xc0
 xlog_cil_committed+0x29c/0x2d0 [xfs]
 ? _raw_spin_unlock_irqrestore+0x1f/0x31
 ? __wake_up_common_lock+0x87/0xc0
 xlog_cil_process_committed+0x69/0x80 [xfs]
 xlog_state_shutdown_callbacks+0xce/0xf0 [xfs]
 xlog_force_shutdown+0xd0/0x110 [xfs]
 xfs_do_force_shutdown+0x5f/0x150 [xfs]
 xlog_ioend_work+0x71/0x80 [xfs]
 process_one_work+0x1c5/0x390
 ? process_one_work+0x390/0x390
 worker_thread+0x30/0x350
 ? process_one_work+0x390/0x390
 kthread+0xe6/0x110
 ? kthread_complete_and_exit+0x20/0x20
 ret_from_fork+0x1f/0x30
 </TASK>
Modules linked in: rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi intel_rapl_msr scsi_transport_iscsi intel_rapl_common ib_umad i10nm_edac x86_pkg_temp_thermal intel_powerclamp ch
 dm_log dm_mod
---[ end trace 0000000000000000 ]---
RIP: 0010:native_queued_spin_lock_slowpath+0x1a4/0x1e0
Code: f3 90 48 8b 0a 48 85 c9 74 f6 eb c5 c1 e9 12 83 e0 03 83 e9 01 48 c1 e0 05 48 63 c9 48 05 40 0d 03 00 48 03 04 cd e0 ba 60 b9 <48> 89 10 8b 42 08 85 c0 75 09 f3 90 8b 42 08 85 c0 74 f7 48 8b 0a
RSP: 0018:ff407350cf917b48 EFLAGS: 00010206
RAX: 0102e31d0105f07d RBX: ff1a52eeb16dd1c0 RCX: 0000000000002c5a
RDX: ff1a53123fb30d40 RSI: ffffffffb95826f6 RDI: ffffffffb9554147
RBP: ff1a53123fb30d40 R08: ff1a52d3c8684028 R09: 0000000000000121
R10: 00000000000000bf R11: 0000000000000b65 R12: 0000000000640000
R13: 0000000000000008 R14: ff1a52d3d0899000 R15: ff1a52d5bdf07800
FS:  0000000000000000(0000) GS:ff1a53123fb00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000556b2ed15f44 CR3: 000000019a4c0005 CR4: 0000000000771ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Kernel panic - not syncing: Fatal exception
Kernel Offset: 0x37200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: Fatal exception ]---


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] log I/O completion GPF via xfs/006 and xfs/264 on 5.17.0-rc8
  2022-03-18 13:46 [BUG] log I/O completion GPF via xfs/006 and xfs/264 on 5.17.0-rc8 Brian Foster
@ 2022-03-18 16:11 ` Brian Foster
  2022-03-18 21:42   ` Dave Chinner
  2022-03-18 21:48 ` Dave Chinner
  1 sibling, 1 reply; 10+ messages in thread
From: Brian Foster @ 2022-03-18 16:11 UTC (permalink / raw)
  To: linux-xfs

On Fri, Mar 18, 2022 at 09:46:53AM -0400, Brian Foster wrote:
> Hi,
> 
> I'm not sure if this is known and/or fixed already, but it didn't look
> familiar so here is a report. I hit a splat when testing Willy's
> prospective folio bookmark change and it turns out it replicates on
> Linus' current master (551acdc3c3d2). This initially reproduced on
> xfs/264 (mkfs defaults) and I saw a soft lockup warning variant via
> xfs/006, but when I attempted to reproduce the latter a second time I
> hit what looks like the same problem as xfs/264. Both tests seem to
> involve some form of error injection, so possibly the same underlying
> problem. The GPF splat from xfs/264 is below.
> 

Darrick pointed out this [1] series on IRC (particularly the final
patch) so I gave that a try. I _think_ that addresses the GPF issue
given it was nearly 100% reproducible before and I didn't see it in a
few iterations, but once I started a test loop for a longer test I ran
into the aforementioned soft lockup again. A snippet of that one is
below [2]. When this occurs, the task appears to be stuck (i.e. the
warning repeats) indefinitely.

Brian

[1] https://lore.kernel.org/linux-xfs/20220317053907.164160-1-david@fromorbit.com/
[2] Soft lockup warning from xfs/264 with patches from [1] applied:

watchdog: BUG: soft lockup - CPU#52 stuck for 134s! [kworker/52:1H:1881]
Modules linked in: rfkill rpcrdma sunrpc intel_rapl_msr intel_rapl_common rdma_ucm ib_srpt ib_isert iscsi_target_mod i10nm_edac target_core_mod x86_pkg_temp_thermal intel_powerclamp ib_iser coretemp libiscsi scsi_transport_iscsi kvm_intel rdma_cm ib_umad ipmi_ssif ib_ipoib iw_cm ib_cm kvm iTCO_wdt iTCO_vendor_support irqbypass crct10dif_pclmul crc32_pclmul acpi_ipmi mlx5_ib ghash_clmulni_intel bnxt_re ipmi_si rapl intel_cstate ib_uverbs ipmi_devintf mei_me isst_if_mmio isst_if_mbox_pci i2c_i801 nd_pmem ib_core intel_uncore wmi_bmof pcspkr isst_if_common mei i2c_smbus intel_pch_thermal ipmi_msghandler nd_btt dax_pmem acpi_power_meter xfs libcrc32c sd_mod sg mlx5_core lpfc mgag200 i2c_algo_bit drm_shmem_helper nvmet_fc drm_kms_helper nvmet nvme_fc mlxfw nvme_fabrics syscopyarea sysfillrect pci_hyperv_intf sysimgblt fb_sys_fops nvme_core ahci tls t10_pi libahci crc32c_intel psample scsi_transport_fc bnxt_en drm megaraid_sas tg3 libata wmi nfit libnvdimm dm_mirror dm_region_hash
 dm_log dm_mod
CPU: 52 PID: 1881 Comm: kworker/52:1H Tainted: G S           L    5.17.0-rc8+ #17
Hardware name: Dell Inc. PowerEdge R750/06V45N, BIOS 1.2.4 05/28/2021
Workqueue: xfs-log/dm-5 xlog_ioend_work [xfs]
RIP: 0010:native_queued_spin_lock_slowpath+0x1b0/0x1e0
Code: c1 e9 12 83 e0 03 83 e9 01 48 c1 e0 05 48 63 c9 48 05 40 0d 03 00 48 03 04 cd e0 ba 00 8c 48 89 10 8b 42 08 85 c0 75 09 f3 90 <8b> 42 08 85 c0 74 f7 48 8b 0a 48 85 c9 0f 84 6b ff ff ff 0f 0d 09
RSP: 0018:ff4ed0b360e4bb48 EFLAGS: 00000246
RAX: 0000000000000000 RBX: ff3413f05c684540 RCX: 0000000000001719
RDX: ff34142ebfeb0d40 RSI: ffffffff8bf826f6 RDI: ffffffff8bf54147
RBP: ff34142ebfeb0d40 R08: ff34142ebfeb0a68 R09: 00000000000001bc
R10: 00000000000001d1 R11: 0000000000000abd R12: 0000000000d40000
R13: 0000000000000008 R14: ff3413f04cd84000 R15: ff3413f059404400
FS:  0000000000000000(0000) GS:ff34142ebfe80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f9200514f70 CR3: 0000000216c16005 CR4: 0000000000771ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
 <TASK>
 _raw_spin_lock+0x2c/0x30
 xfs_trans_ail_delete+0x2a/0xd0 [xfs]
 xfs_buf_item_done+0x22/0x30 [xfs]
 xfs_buf_ioend+0x71/0x5e0 [xfs]
 xfs_trans_committed_bulk+0x167/0x2c0 [xfs]
 ? enqueue_entity+0x121/0x4d0
 ? enqueue_task_fair+0x417/0x530
 ? resched_curr+0x23/0xc0
 ? check_preempt_curr+0x3f/0x70
 ? _raw_spin_unlock_irqrestore+0x1f/0x31
 ? __wake_up_common_lock+0x87/0xc0
 xlog_cil_committed+0x29c/0x2d0 [xfs]
 ? _raw_spin_unlock_irqrestore+0x1f/0x31
 ? __wake_up_common_lock+0x87/0xc0
 xlog_cil_process_committed+0x69/0x80 [xfs]
 xlog_state_shutdown_callbacks+0xce/0xf0 [xfs]
 xlog_force_shutdown+0xd0/0x110 [xfs]
 xfs_do_force_shutdown+0x5f/0x150 [xfs]
 xlog_ioend_work+0x71/0x80 [xfs]
 process_one_work+0x1c5/0x390
 ? process_one_work+0x390/0x390
 worker_thread+0x30/0x350
 ? process_one_work+0x390/0x390
 kthread+0xe6/0x110
 ? kthread_complete_and_exit+0x20/0x20
 ret_from_fork+0x1f/0x30
 </TASK>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] log I/O completion GPF via xfs/006 and xfs/264 on 5.17.0-rc8
  2022-03-18 16:11 ` Brian Foster
@ 2022-03-18 21:42   ` Dave Chinner
  2022-03-21 18:35     ` Brian Foster
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2022-03-18 21:42 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Fri, Mar 18, 2022 at 12:11:07PM -0400, Brian Foster wrote:
> On Fri, Mar 18, 2022 at 09:46:53AM -0400, Brian Foster wrote:
> > Hi,
> > 
> > I'm not sure if this is known and/or fixed already, but it didn't look
> > familiar so here is a report. I hit a splat when testing Willy's
> > prospective folio bookmark change and it turns out it replicates on
> > Linus' current master (551acdc3c3d2). This initially reproduced on
> > xfs/264 (mkfs defaults) and I saw a soft lockup warning variant via
> > xfs/006, but when I attempted to reproduce the latter a second time I
> > hit what looks like the same problem as xfs/264. Both tests seem to
> > involve some form of error injection, so possibly the same underlying
> > problem. The GPF splat from xfs/264 is below.
> > 
> 
> Darrick pointed out this [1] series on IRC (particularly the final
> patch) so I gave that a try. I _think_ that addresses the GPF issue
> given it was nearly 100% reproducible before and I didn't see it in a
> few iterations, but once I started a test loop for a longer test I ran
> into the aforementioned soft lockup again. A snippet of that one is
> below [2]. When this occurs, the task appears to be stuck (i.e. the
> warning repeats) indefinitely.
> 
> Brian
> 
> [1] https://lore.kernel.org/linux-xfs/20220317053907.164160-1-david@fromorbit.com/
> [2] Soft lockup warning from xfs/264 with patches from [1] applied:
> 
> watchdog: BUG: soft lockup - CPU#52 stuck for 134s! [kworker/52:1H:1881]
> Modules linked in: rfkill rpcrdma sunrpc intel_rapl_msr intel_rapl_common rdma_ucm ib_srpt ib_isert iscsi_target_mod i10nm_edac target_core_mod x86_pkg_temp_thermal intel_powerclamp ib_iser coretemp libiscsi scsi_transport_iscsi kvm_intel rdma_cm ib_umad ipmi_ssif ib_ipoib iw_cm ib_cm kvm iTCO_wdt iTCO_vendor_support irqbypass crct10dif_pclmul crc32_pclmul acpi_ipmi mlx5_ib ghash_clmulni_intel bnxt_re ipmi_si rapl intel_cstate ib_uverbs ipmi_devintf mei_me isst_if_mmio isst_if_mbox_pci i2c_i801 nd_pmem ib_core intel_uncore wmi_bmof pcspkr isst_if_common mei i2c_smbus intel_pch_thermal ipmi_msghandler nd_btt dax_pmem acpi_power_meter xfs libcrc32c sd_mod sg mlx5_core lpfc mgag200 i2c_algo_bit drm_shmem_helper nvmet_fc drm_kms_helper nvmet nvme_fc mlxfw nvme_fabrics syscopyarea sysfillrect pci_hyperv_intf sysimgblt fb_sys_fops nvme_core ahci tls t10_pi libahci crc32c_intel psample scsi_transport_fc bnxt_en drm megaraid_sas tg3 libata wmi nfit libnvdimm dm_mirror dm_region_hash
>  dm_log dm_mod
> CPU: 52 PID: 1881 Comm: kworker/52:1H Tainted: G S           L    5.17.0-rc8+ #17
> Hardware name: Dell Inc. PowerEdge R750/06V45N, BIOS 1.2.4 05/28/2021
> Workqueue: xfs-log/dm-5 xlog_ioend_work [xfs]
> RIP: 0010:native_queued_spin_lock_slowpath+0x1b0/0x1e0
> Code: c1 e9 12 83 e0 03 83 e9 01 48 c1 e0 05 48 63 c9 48 05 40 0d 03 00 48 03 04 cd e0 ba 00 8c 48 89 10 8b 42 08 85 c0 75 09 f3 90 <8b> 42 08 85 c0 74 f7 48 8b 0a 48 85 c9 0f 84 6b ff ff ff 0f 0d 09
> RSP: 0018:ff4ed0b360e4bb48 EFLAGS: 00000246
> RAX: 0000000000000000 RBX: ff3413f05c684540 RCX: 0000000000001719
> RDX: ff34142ebfeb0d40 RSI: ffffffff8bf826f6 RDI: ffffffff8bf54147
> RBP: ff34142ebfeb0d40 R08: ff34142ebfeb0a68 R09: 00000000000001bc
> R10: 00000000000001d1 R11: 0000000000000abd R12: 0000000000d40000
> R13: 0000000000000008 R14: ff3413f04cd84000 R15: ff3413f059404400
> FS:  0000000000000000(0000) GS:ff34142ebfe80000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f9200514f70 CR3: 0000000216c16005 CR4: 0000000000771ee0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> PKRU: 55555554
> Call Trace:
>  <TASK>
>  _raw_spin_lock+0x2c/0x30
>  xfs_trans_ail_delete+0x2a/0xd0 [xfs]

So what is running around in a tight circle holding the AIL lock?

Or what assert failed before this while holding the AIL lock?

>  xfs_buf_item_done+0x22/0x30 [xfs]
>  xfs_buf_ioend+0x71/0x5e0 [xfs]
>  xfs_trans_committed_bulk+0x167/0x2c0 [xfs]
>  ? enqueue_entity+0x121/0x4d0
>  ? enqueue_task_fair+0x417/0x530
>  ? resched_curr+0x23/0xc0
>  ? check_preempt_curr+0x3f/0x70
>  ? _raw_spin_unlock_irqrestore+0x1f/0x31
>  ? __wake_up_common_lock+0x87/0xc0
>  xlog_cil_committed+0x29c/0x2d0 [xfs]
>  ? _raw_spin_unlock_irqrestore+0x1f/0x31
>  ? __wake_up_common_lock+0x87/0xc0
>  xlog_cil_process_committed+0x69/0x80 [xfs]
>  xlog_state_shutdown_callbacks+0xce/0xf0 [xfs]
>  xlog_force_shutdown+0xd0/0x110 [xfs]

The stack trace here looks mangled - it's missing functions between
xfs_trans_committed_bulk() and xfs_buf_ioend()

xfs_trans_committed_bulk(abort = true)
  xfs_trans_committed_bulk
    lip->li_iop->iop_unpin(remove = true)
      xfs_buf_item_unpin()
        xfs_buf_ioend_fail()
	  xfs_buf_ioend()
	    xfs_buf_item_done()

Unless, of course, the xfs_buf_ioend symbol is wrongly detected
because it's the last function call in xfs_buf_item_unpin(). That
would give a stack of

xfs_trans_committed_bulk(abort = true)
  xfs_trans_committed_bulk
    lip->li_iop->iop_unpin(remove = true)
      xfs_buf_item_unpin()
        xfs_buf_item_done()

Which is the stale inode buffer release path. Which has a problem
as I mention here:

https://lore.kernel.org/linux-xfs/20220317053907.164160-8-david@fromorbit.com/

@@ -720,6 +721,17 @@ xfs_iflush_ail_updates(
 		if (INODE_ITEM(lip)->ili_flush_lsn != lip->li_lsn)
 			continue;
 
+		/*
+		 * dgc: Not sure how this happens, but it happens very
+		 * occassionaly via generic/388.  xfs_iflush_abort() also
+		 * silently handles this same "under writeback but not in AIL at
+		 * shutdown" condition via xfs_trans_ail_delete().
+		 */
+		if (!test_bit(XFS_LI_IN_AIL, &lip->li_flags)) {
+			ASSERT(xlog_is_shutdown(lip->li_log));
+			continue;
+		}
+

THe symptoms that this worked around were double AIL unlocks, AIL
list corruptions, and bp->b_li_list corruptions leading to
xfs_buf_inode_iodone() getting stuck in an endless loop whilst
holding the AIL lock, leading to soft lookups exactly like this one.

I now know what is causing this problem - it is xfs_iflush_abort()
being called from xfs_inode_reclaim() that removes the inode from
buffer list without holding the buffer lock.

Hence a traversal in xfs_iflush_cluster or xfs_buf_inode_iodone
can store the next inode in the list in n, n then gets aborted by
reclaim and removed from the lsit, and then the list traversal moves
onto n, and it's now an empty list because it was removed. Hence
the list traversal gets stuck forever on n because n->next = n.....

If this sort of thing happens in xfs_iflush_ail_updates(), we can
either do a double AIL removal which fires an assert with the AIL
lock held, or we get stuck spinning on n with
the AIL lock held. Either way, they both lead to softlockups on the
AIL lock like this one.

'echo l > sysrq-trigger' is your friend in these situations - you'll
see if there's a process spinning with the lock held on some other
CPU...

This situation is a regression introduced in the async inode reclaim
patch series:

https://lore.kernel.org/linux-xfs/20200622081605.1818434-1-david@fromorbit.com/

And is a locking screwup with xfs_iflush_abort() being called
without holding the inode cluster buffer lock. It was a thinko
w.r.t. list removal and traversal using the inode item lock. The bug
has been there since June 2020, and it's only now that we have
peeled back the shutdown onion a couple of layers further that it is
manifesting.

I have a prototype patch (below) to fix this - the locking is not
pretty, but the AIL corruptions and soft lockups have gone away in
my testing only to be replaced with a whole new set of g/388
failures *I have never seen before*.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


xfs: locking on b_io_list is broken for inodes

From: Dave Chinner <dchinner@redhat.com>

Most buffer io list operations are run with the bp->b_lock held, but
xfs_iflush_abort() can be called without the buffer lock being held
resulting in inodes being removed from the buffer list while other
list operations are occurring. This causes problems with corrupted
bp->b_io_list inode lists during filesystem shutdown, leading to
traversals that never end, double removals from the AIL, etc.

Fix this by passing the buffer to xfs_iflush_abort() if we have
it locked. If the inode is attached to the buffer, we're going to
have to remove it from the buffer list and we'd have to get the
buffer off the inode log item to do that anyway.

If we don't have a buffer passed in (e.g. from xfs_reclaim_inode())
then we can determine if the inode has a log item and if it is
attached to a buffer before we do anything else. If it does have an
attached buffer, we can lock it safely (because the inode has a
reference to it) and then perform the inode abort.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c     |   2 +-
 fs/xfs/xfs_inode.c      |   4 +-
 fs/xfs/xfs_inode_item.c | 123 ++++++++++++++++++++++++++++++++----------------
 fs/xfs/xfs_inode_item.h |   2 +-
 4 files changed, 87 insertions(+), 44 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 4148cdf7ce4a..ec907be2d5b1 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -883,7 +883,7 @@ xfs_reclaim_inode(
 	 */
 	if (xlog_is_shutdown(ip->i_mount->m_log)) {
 		xfs_iunpin_wait(ip);
-		xfs_iflush_abort(ip);
+		xfs_iflush_abort(ip, NULL);
 		goto reclaim;
 	}
 	if (xfs_ipincount(ip))
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index aab55a06ece7..de8815211a7a 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3612,7 +3612,7 @@ xfs_iflush_cluster(
 
 	/*
 	 * We must use the safe variant here as on shutdown xfs_iflush_abort()
-	 * can remove itself from the list.
+	 * will remove itself from the list.
 	 */
 	list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) {
 		iip = (struct xfs_inode_log_item *)lip;
@@ -3662,7 +3662,7 @@ xfs_iflush_cluster(
 		 */
 		if (xlog_is_shutdown(mp->m_log)) {
 			xfs_iunpin_wait(ip);
-			xfs_iflush_abort(ip);
+			xfs_iflush_abort(ip, bp);
 			xfs_iunlock(ip, XFS_ILOCK_SHARED);
 			error = -EIO;
 			continue;
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 11158fa81a09..89fa1fd9ed5b 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -721,17 +721,6 @@ xfs_iflush_ail_updates(
 		if (INODE_ITEM(lip)->ili_flush_lsn != lip->li_lsn)
 			continue;
 
-		/*
-		 * dgc: Not sure how this happens, but it happens very
-		 * occassionaly via generic/388.  xfs_iflush_abort() also
-		 * silently handles this same "under writeback but not in AIL at
-		 * shutdown" condition via xfs_trans_ail_delete().
-		 */
-		if (!test_bit(XFS_LI_IN_AIL, &lip->li_flags)) {
-			ASSERT(xlog_is_shutdown(lip->li_log));
-			continue;
-		}
-
 		lsn = xfs_ail_delete_one(ailp, lip);
 		if (!tail_lsn && lsn)
 			tail_lsn = lsn;
@@ -799,7 +788,7 @@ xfs_buf_inode_iodone(
 		struct xfs_inode_log_item *iip = INODE_ITEM(lip);
 
 		if (xfs_iflags_test(iip->ili_inode, XFS_ISTALE)) {
-			xfs_iflush_abort(iip->ili_inode);
+			xfs_iflush_abort(iip->ili_inode, bp);
 			continue;
 		}
 		if (!iip->ili_last_fields)
@@ -834,44 +823,98 @@ xfs_buf_inode_io_fail(
 }
 
 /*
- * This is the inode flushing abort routine.  It is called when
- * the filesystem is shutting down to clean up the inode state.  It is
- * responsible for removing the inode item from the AIL if it has not been
- * re-logged and clearing the inode's flush state.
+ * Abort flushing the inode.
+ *
+ * There are two cases where this is called. The first is when the inode cluster
+ * buffer has been removed and the inodes attached to it have been marked
+ * XFS_ISTALE. Inode cluster buffer IO completion will be called on the buffer to mark the stale
+ * inodes clean and remove them from the AIL without doing IO on them. The inode
+ * should always have a log item attached if it is ISTALE, and we should always
+ * be passed the locked buffer the inodes are attached to.
+ *
+ * The second case is log shutdown. When the log has been shut down, we need
+ * to abort any flush that is in progress, mark the inode clean and remove it
+ * from the AIL. We may get passed clean inodes without log items, as well as
+ * clean inodes with log items that aren't attached to cluster buffers. And
+ * depending on whether we are called from, we may or may not have a locked
+ * buffer passed to us.
+ *
+ * If we don't have a locked buffer, we try to get it from the inode log item.
+ * If there is a buffer attached to the ili, then we have a reference to the
+ * buffer and we can safely lock it, then remove the inode from the buffer.
  */
 void
 xfs_iflush_abort(
-	struct xfs_inode	*ip)
+	struct xfs_inode	*ip,
+	struct xfs_buf		*locked_bp)
 {
 	struct xfs_inode_log_item *iip = ip->i_itemp;
-	struct xfs_buf		*bp = NULL;
+	struct xfs_buf		*ibp;
 
-	if (iip) {
-		/*
-		 * Clear the failed bit before removing the item from the AIL so
-		 * xfs_trans_ail_delete() doesn't try to clear and release the
-		 * buffer attached to the log item before we are done with it.
-		 */
-		clear_bit(XFS_LI_FAILED, &iip->ili_item.li_flags);
-		xfs_trans_ail_delete(&iip->ili_item, 0);
+	if (!iip) {
+		/* clean inode, nothing to do */
+		xfs_iflags_clear(ip, XFS_IFLUSHING);
+		return;
+	}
 
-		/*
-		 * Clear the inode logging fields so no more flushes are
-		 * attempted.
-		 */
-		spin_lock(&iip->ili_lock);
-		iip->ili_last_fields = 0;
-		iip->ili_fields = 0;
-		iip->ili_fsync_fields = 0;
-		iip->ili_flush_lsn = 0;
-		bp = iip->ili_item.li_buf;
-		iip->ili_item.li_buf = NULL;
-		list_del_init(&iip->ili_item.li_bio_list);
+	/*
+	 * Capture the associated buffer and lock it if the caller didn't
+	 * pass us the locked buffer to begin with.
+	 */
+	spin_lock(&iip->ili_lock);
+	ibp = iip->ili_item.li_buf;
+	if (!locked_bp && ibp) {
+		xfs_buf_hold(ibp);
 		spin_unlock(&iip->ili_lock);
+		xfs_buf_lock(ibp);
+		spin_lock(&iip->ili_lock);
+		if (!iip->ili_item.li_buf) {
+			/*
+			 * Raced with another removal, hold the only reference
+			 * to ibp now.
+			 */
+			ASSERT(list_empty(&iip->ili_item.li_bio_list));
+		} else {
+			/*
+			 * Got two references to ibp, drop one now. The other
+			 * ges dropped when we are done.
+			 */
+			ASSERT(iip->ili_item.li_buf == ibp);
+			xfs_buf_rele(ibp);
+		}
+	} else {
+		ASSERT(!ibp || ibp == locked_bp);
 	}
+
+	/*
+	 * Clear the inode logging fields so no more flushes are attempted.
+	 * If we are on a buffer list, it is now safe to remove it because
+	 * the buffer is guaranteed to be locked.
+	 */
+	iip->ili_last_fields = 0;
+	iip->ili_fields = 0;
+	iip->ili_fsync_fields = 0;
+	iip->ili_flush_lsn = 0;
+	iip->ili_item.li_buf = NULL;
+	list_del_init(&iip->ili_item.li_bio_list);
+	spin_unlock(&iip->ili_lock);
+
+	/*
+	 * Clear the failed bit before removing the item from the AIL so
+	 * xfs_trans_ail_delete() doesn't try to clear and release the buffer
+	 * attached to the log item before we are done with it.
+	 */
+	clear_bit(XFS_LI_FAILED, &iip->ili_item.li_flags);
+	xfs_trans_ail_delete(&iip->ili_item, 0);
+
 	xfs_iflags_clear(ip, XFS_IFLUSHING);
-	if (bp)
-		xfs_buf_rele(bp);
+
+	/* we can now release the buffer reference the inode log item held. */
+	if (ibp) {
+		if (!locked_bp)
+			xfs_buf_unlock(ibp);
+		xfs_buf_rele(ibp);
+	}
 }
 
 /*
diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
index 1a302000d604..01e5845c7f3d 100644
--- a/fs/xfs/xfs_inode_item.h
+++ b/fs/xfs/xfs_inode_item.h
@@ -43,7 +43,7 @@ static inline int xfs_inode_clean(struct xfs_inode *ip)
 
 extern void xfs_inode_item_init(struct xfs_inode *, struct xfs_mount *);
 extern void xfs_inode_item_destroy(struct xfs_inode *);
-extern void xfs_iflush_abort(struct xfs_inode *);
+extern void xfs_iflush_abort(struct xfs_inode *, struct xfs_buf *);
 extern int xfs_inode_item_format_convert(xfs_log_iovec_t *,
 					 struct xfs_inode_log_format *);
 

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [BUG] log I/O completion GPF via xfs/006 and xfs/264 on 5.17.0-rc8
  2022-03-18 13:46 [BUG] log I/O completion GPF via xfs/006 and xfs/264 on 5.17.0-rc8 Brian Foster
  2022-03-18 16:11 ` Brian Foster
@ 2022-03-18 21:48 ` Dave Chinner
  2022-03-18 21:51   ` Darrick J. Wong
  1 sibling, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2022-03-18 21:48 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Fri, Mar 18, 2022 at 09:46:53AM -0400, Brian Foster wrote:
> Hi,
> 
> I'm not sure if this is known and/or fixed already, but it didn't look
> familiar so here is a report. I hit a splat when testing Willy's
> prospective folio bookmark change and it turns out it replicates on
> Linus' current master (551acdc3c3d2). This initially reproduced on
> xfs/264 (mkfs defaults) and I saw a soft lockup warning variant via
> xfs/006, but when I attempted to reproduce the latter a second time I
> hit what looks like the same problem as xfs/264. Both tests seem to
> involve some form of error injection, so possibly the same underlying
> problem. The GPF splat from xfs/264 is below.

On a side note, I'm wondering if we should add xfs/006 and xfs/264
to the recoveryloop group - they do a shutdown under load and a
followup mount to ensure the filesystem gets recovered before
the test ends and the fs is checked, so while thy don't explicitly
test recovery, they do exercise it....

Thoughts?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] log I/O completion GPF via xfs/006 and xfs/264 on 5.17.0-rc8
  2022-03-18 21:48 ` Dave Chinner
@ 2022-03-18 21:51   ` Darrick J. Wong
  2022-03-18 22:39     ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Darrick J. Wong @ 2022-03-18 21:51 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, linux-xfs

On Sat, Mar 19, 2022 at 08:48:31AM +1100, Dave Chinner wrote:
> On Fri, Mar 18, 2022 at 09:46:53AM -0400, Brian Foster wrote:
> > Hi,
> > 
> > I'm not sure if this is known and/or fixed already, but it didn't look
> > familiar so here is a report. I hit a splat when testing Willy's
> > prospective folio bookmark change and it turns out it replicates on
> > Linus' current master (551acdc3c3d2). This initially reproduced on
> > xfs/264 (mkfs defaults) and I saw a soft lockup warning variant via
> > xfs/006, but when I attempted to reproduce the latter a second time I
> > hit what looks like the same problem as xfs/264. Both tests seem to
> > involve some form of error injection, so possibly the same underlying
> > problem. The GPF splat from xfs/264 is below.
> 
> On a side note, I'm wondering if we should add xfs/006 and xfs/264
> to the recoveryloop group - they do a shutdown under load and a
> followup mount to ensure the filesystem gets recovered before
> the test ends and the fs is checked, so while thy don't explicitly
> test recovery, they do exercise it....
> 
> Thoughts?

Someone else asked about this the other day, and I proposed a 'recovery'
group for tests that don't run in a loop.

--D

> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] log I/O completion GPF via xfs/006 and xfs/264 on 5.17.0-rc8
  2022-03-18 21:51   ` Darrick J. Wong
@ 2022-03-18 22:39     ` Dave Chinner
  0 siblings, 0 replies; 10+ messages in thread
From: Dave Chinner @ 2022-03-18 22:39 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, linux-xfs

On Fri, Mar 18, 2022 at 02:51:33PM -0700, Darrick J. Wong wrote:
> On Sat, Mar 19, 2022 at 08:48:31AM +1100, Dave Chinner wrote:
> > On Fri, Mar 18, 2022 at 09:46:53AM -0400, Brian Foster wrote:
> > > Hi,
> > > 
> > > I'm not sure if this is known and/or fixed already, but it didn't look
> > > familiar so here is a report. I hit a splat when testing Willy's
> > > prospective folio bookmark change and it turns out it replicates on
> > > Linus' current master (551acdc3c3d2). This initially reproduced on
> > > xfs/264 (mkfs defaults) and I saw a soft lockup warning variant via
> > > xfs/006, but when I attempted to reproduce the latter a second time I
> > > hit what looks like the same problem as xfs/264. Both tests seem to
> > > involve some form of error injection, so possibly the same underlying
> > > problem. The GPF splat from xfs/264 is below.
> > 
> > On a side note, I'm wondering if we should add xfs/006 and xfs/264
> > to the recoveryloop group - they do a shutdown under load and a
> > followup mount to ensure the filesystem gets recovered before
> > the test ends and the fs is checked, so while thy don't explicitly
> > test recovery, they do exercise it....
> > 
> > Thoughts?
> 
> Someone else asked about this the other day, and I proposed a 'recovery'
> group for tests that don't run in a loop.

That distinction is largely meaningless to me.

I tend to think of "recoveryloop" as the recovery tests I want to
run in a long running loop via iteration. e.g. isomething like 
'check -I 250 -g recoveryloop'. I don't really care if the tests
loop internally doing multiple recoveries - I'm wanting to run the
recovery tests that reproduce problems frequeently in a tight loop
repeatedly.

Hence I think we should just lump the shutdown+recovery tests all in
one group so that when we want to exercise shutdown/recovery we just
have one single group to run repeatedly in a loop. Whether that
group is named 'recovery' or 'recoveryloop' is largely irrelevant to
me.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] log I/O completion GPF via xfs/006 and xfs/264 on 5.17.0-rc8
  2022-03-18 21:42   ` Dave Chinner
@ 2022-03-21 18:35     ` Brian Foster
  2022-03-21 22:14       ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Brian Foster @ 2022-03-21 18:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Sat, Mar 19, 2022 at 08:42:53AM +1100, Dave Chinner wrote:
> On Fri, Mar 18, 2022 at 12:11:07PM -0400, Brian Foster wrote:
> > On Fri, Mar 18, 2022 at 09:46:53AM -0400, Brian Foster wrote:
> > > Hi,
> > > 
> > > I'm not sure if this is known and/or fixed already, but it didn't look
> > > familiar so here is a report. I hit a splat when testing Willy's
> > > prospective folio bookmark change and it turns out it replicates on
> > > Linus' current master (551acdc3c3d2). This initially reproduced on
> > > xfs/264 (mkfs defaults) and I saw a soft lockup warning variant via
> > > xfs/006, but when I attempted to reproduce the latter a second time I
> > > hit what looks like the same problem as xfs/264. Both tests seem to
> > > involve some form of error injection, so possibly the same underlying
> > > problem. The GPF splat from xfs/264 is below.
> > > 
> > 
> > Darrick pointed out this [1] series on IRC (particularly the final
> > patch) so I gave that a try. I _think_ that addresses the GPF issue
> > given it was nearly 100% reproducible before and I didn't see it in a
> > few iterations, but once I started a test loop for a longer test I ran
> > into the aforementioned soft lockup again. A snippet of that one is
> > below [2]. When this occurs, the task appears to be stuck (i.e. the
> > warning repeats) indefinitely.
> > 
> > Brian
> > 
> > [1] https://lore.kernel.org/linux-xfs/20220317053907.164160-1-david@fromorbit.com/
> > [2] Soft lockup warning from xfs/264 with patches from [1] applied:
> > 
> > watchdog: BUG: soft lockup - CPU#52 stuck for 134s! [kworker/52:1H:1881]
> > Modules linked in: rfkill rpcrdma sunrpc intel_rapl_msr intel_rapl_common rdma_ucm ib_srpt ib_isert iscsi_target_mod i10nm_edac target_core_mod x86_pkg_temp_thermal intel_powerclamp ib_iser coretemp libiscsi scsi_transport_iscsi kvm_intel rdma_cm ib_umad ipmi_ssif ib_ipoib iw_cm ib_cm kvm iTCO_wdt iTCO_vendor_support irqbypass crct10dif_pclmul crc32_pclmul acpi_ipmi mlx5_ib ghash_clmulni_intel bnxt_re ipmi_si rapl intel_cstate ib_uverbs ipmi_devintf mei_me isst_if_mmio isst_if_mbox_pci i2c_i801 nd_pmem ib_core intel_uncore wmi_bmof pcspkr isst_if_common mei i2c_smbus intel_pch_thermal ipmi_msghandler nd_btt dax_pmem acpi_power_meter xfs libcrc32c sd_mod sg mlx5_core lpfc mgag200 i2c_algo_bit drm_shmem_helper nvmet_fc drm_kms_helper nvmet nvme_fc mlxfw nvme_fabrics syscopyarea sysfillrect pci_hyperv_intf sysimgblt fb_sys_fops nvme_core ahci tls t10_pi libahci crc32c_intel psample scsi_transport_fc bnxt_en drm megaraid_sas tg3 libata wmi nfit libnvdimm dm_mirror dm_region_hash
> >  dm_log dm_mod
> > CPU: 52 PID: 1881 Comm: kworker/52:1H Tainted: G S           L    5.17.0-rc8+ #17
> > Hardware name: Dell Inc. PowerEdge R750/06V45N, BIOS 1.2.4 05/28/2021
> > Workqueue: xfs-log/dm-5 xlog_ioend_work [xfs]
> > RIP: 0010:native_queued_spin_lock_slowpath+0x1b0/0x1e0
> > Code: c1 e9 12 83 e0 03 83 e9 01 48 c1 e0 05 48 63 c9 48 05 40 0d 03 00 48 03 04 cd e0 ba 00 8c 48 89 10 8b 42 08 85 c0 75 09 f3 90 <8b> 42 08 85 c0 74 f7 48 8b 0a 48 85 c9 0f 84 6b ff ff ff 0f 0d 09
> > RSP: 0018:ff4ed0b360e4bb48 EFLAGS: 00000246
> > RAX: 0000000000000000 RBX: ff3413f05c684540 RCX: 0000000000001719
> > RDX: ff34142ebfeb0d40 RSI: ffffffff8bf826f6 RDI: ffffffff8bf54147
> > RBP: ff34142ebfeb0d40 R08: ff34142ebfeb0a68 R09: 00000000000001bc
> > R10: 00000000000001d1 R11: 0000000000000abd R12: 0000000000d40000
> > R13: 0000000000000008 R14: ff3413f04cd84000 R15: ff3413f059404400
> > FS:  0000000000000000(0000) GS:ff34142ebfe80000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00007f9200514f70 CR3: 0000000216c16005 CR4: 0000000000771ee0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > PKRU: 55555554
> > Call Trace:
> >  <TASK>
> >  _raw_spin_lock+0x2c/0x30
> >  xfs_trans_ail_delete+0x2a/0xd0 [xfs]
> 
> So what is running around in a tight circle holding the AIL lock?
> 
> Or what assert failed before this while holding the AIL lock?
> 

I don't have much information beyond the test and resulting bug. There
are no assert failures before the bug occurs. An active CPU task dump
shows the stack from the soft lockup warning, the task running the dump
itself, and all other (94/96) CPUs appear idle. I tried the appended
patch on top of latest for-next (which now includes the other log
shutdown fix) and the problem still occurs.

Brian

> >  xfs_buf_item_done+0x22/0x30 [xfs]
> >  xfs_buf_ioend+0x71/0x5e0 [xfs]
> >  xfs_trans_committed_bulk+0x167/0x2c0 [xfs]
> >  ? enqueue_entity+0x121/0x4d0
> >  ? enqueue_task_fair+0x417/0x530
> >  ? resched_curr+0x23/0xc0
> >  ? check_preempt_curr+0x3f/0x70
> >  ? _raw_spin_unlock_irqrestore+0x1f/0x31
> >  ? __wake_up_common_lock+0x87/0xc0
> >  xlog_cil_committed+0x29c/0x2d0 [xfs]
> >  ? _raw_spin_unlock_irqrestore+0x1f/0x31
> >  ? __wake_up_common_lock+0x87/0xc0
> >  xlog_cil_process_committed+0x69/0x80 [xfs]
> >  xlog_state_shutdown_callbacks+0xce/0xf0 [xfs]
> >  xlog_force_shutdown+0xd0/0x110 [xfs]
> 
> The stack trace here looks mangled - it's missing functions between
> xfs_trans_committed_bulk() and xfs_buf_ioend()
> 
> xfs_trans_committed_bulk(abort = true)
>   xfs_trans_committed_bulk
>     lip->li_iop->iop_unpin(remove = true)
>       xfs_buf_item_unpin()
>         xfs_buf_ioend_fail()
> 	  xfs_buf_ioend()
> 	    xfs_buf_item_done()
> 
> Unless, of course, the xfs_buf_ioend symbol is wrongly detected
> because it's the last function call in xfs_buf_item_unpin(). That
> would give a stack of
> 
> xfs_trans_committed_bulk(abort = true)
>   xfs_trans_committed_bulk
>     lip->li_iop->iop_unpin(remove = true)
>       xfs_buf_item_unpin()
>         xfs_buf_item_done()
> 
> Which is the stale inode buffer release path. Which has a problem
> as I mention here:
> 
> https://lore.kernel.org/linux-xfs/20220317053907.164160-8-david@fromorbit.com/
> 
> @@ -720,6 +721,17 @@ xfs_iflush_ail_updates(
>  		if (INODE_ITEM(lip)->ili_flush_lsn != lip->li_lsn)
>  			continue;
>  
> +		/*
> +		 * dgc: Not sure how this happens, but it happens very
> +		 * occassionaly via generic/388.  xfs_iflush_abort() also
> +		 * silently handles this same "under writeback but not in AIL at
> +		 * shutdown" condition via xfs_trans_ail_delete().
> +		 */
> +		if (!test_bit(XFS_LI_IN_AIL, &lip->li_flags)) {
> +			ASSERT(xlog_is_shutdown(lip->li_log));
> +			continue;
> +		}
> +
> 
> THe symptoms that this worked around were double AIL unlocks, AIL
> list corruptions, and bp->b_li_list corruptions leading to
> xfs_buf_inode_iodone() getting stuck in an endless loop whilst
> holding the AIL lock, leading to soft lookups exactly like this one.
> 
> I now know what is causing this problem - it is xfs_iflush_abort()
> being called from xfs_inode_reclaim() that removes the inode from
> buffer list without holding the buffer lock.
> 
> Hence a traversal in xfs_iflush_cluster or xfs_buf_inode_iodone
> can store the next inode in the list in n, n then gets aborted by
> reclaim and removed from the lsit, and then the list traversal moves
> onto n, and it's now an empty list because it was removed. Hence
> the list traversal gets stuck forever on n because n->next = n.....
> 
> If this sort of thing happens in xfs_iflush_ail_updates(), we can
> either do a double AIL removal which fires an assert with the AIL
> lock held, or we get stuck spinning on n with
> the AIL lock held. Either way, they both lead to softlockups on the
> AIL lock like this one.
> 
> 'echo l > sysrq-trigger' is your friend in these situations - you'll
> see if there's a process spinning with the lock held on some other
> CPU...
> 
> This situation is a regression introduced in the async inode reclaim
> patch series:
> 
> https://lore.kernel.org/linux-xfs/20200622081605.1818434-1-david@fromorbit.com/
> 
> And is a locking screwup with xfs_iflush_abort() being called
> without holding the inode cluster buffer lock. It was a thinko
> w.r.t. list removal and traversal using the inode item lock. The bug
> has been there since June 2020, and it's only now that we have
> peeled back the shutdown onion a couple of layers further that it is
> manifesting.
> 
> I have a prototype patch (below) to fix this - the locking is not
> pretty, but the AIL corruptions and soft lockups have gone away in
> my testing only to be replaced with a whole new set of g/388
> failures *I have never seen before*.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> 
> xfs: locking on b_io_list is broken for inodes
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> Most buffer io list operations are run with the bp->b_lock held, but
> xfs_iflush_abort() can be called without the buffer lock being held
> resulting in inodes being removed from the buffer list while other
> list operations are occurring. This causes problems with corrupted
> bp->b_io_list inode lists during filesystem shutdown, leading to
> traversals that never end, double removals from the AIL, etc.
> 
> Fix this by passing the buffer to xfs_iflush_abort() if we have
> it locked. If the inode is attached to the buffer, we're going to
> have to remove it from the buffer list and we'd have to get the
> buffer off the inode log item to do that anyway.
> 
> If we don't have a buffer passed in (e.g. from xfs_reclaim_inode())
> then we can determine if the inode has a log item and if it is
> attached to a buffer before we do anything else. If it does have an
> attached buffer, we can lock it safely (because the inode has a
> reference to it) and then perform the inode abort.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_icache.c     |   2 +-
>  fs/xfs/xfs_inode.c      |   4 +-
>  fs/xfs/xfs_inode_item.c | 123 ++++++++++++++++++++++++++++++++----------------
>  fs/xfs/xfs_inode_item.h |   2 +-
>  4 files changed, 87 insertions(+), 44 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 4148cdf7ce4a..ec907be2d5b1 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -883,7 +883,7 @@ xfs_reclaim_inode(
>  	 */
>  	if (xlog_is_shutdown(ip->i_mount->m_log)) {
>  		xfs_iunpin_wait(ip);
> -		xfs_iflush_abort(ip);
> +		xfs_iflush_abort(ip, NULL);
>  		goto reclaim;
>  	}
>  	if (xfs_ipincount(ip))
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index aab55a06ece7..de8815211a7a 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -3612,7 +3612,7 @@ xfs_iflush_cluster(
>  
>  	/*
>  	 * We must use the safe variant here as on shutdown xfs_iflush_abort()
> -	 * can remove itself from the list.
> +	 * will remove itself from the list.
>  	 */
>  	list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) {
>  		iip = (struct xfs_inode_log_item *)lip;
> @@ -3662,7 +3662,7 @@ xfs_iflush_cluster(
>  		 */
>  		if (xlog_is_shutdown(mp->m_log)) {
>  			xfs_iunpin_wait(ip);
> -			xfs_iflush_abort(ip);
> +			xfs_iflush_abort(ip, bp);
>  			xfs_iunlock(ip, XFS_ILOCK_SHARED);
>  			error = -EIO;
>  			continue;
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 11158fa81a09..89fa1fd9ed5b 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -721,17 +721,6 @@ xfs_iflush_ail_updates(
>  		if (INODE_ITEM(lip)->ili_flush_lsn != lip->li_lsn)
>  			continue;
>  
> -		/*
> -		 * dgc: Not sure how this happens, but it happens very
> -		 * occassionaly via generic/388.  xfs_iflush_abort() also
> -		 * silently handles this same "under writeback but not in AIL at
> -		 * shutdown" condition via xfs_trans_ail_delete().
> -		 */
> -		if (!test_bit(XFS_LI_IN_AIL, &lip->li_flags)) {
> -			ASSERT(xlog_is_shutdown(lip->li_log));
> -			continue;
> -		}
> -
>  		lsn = xfs_ail_delete_one(ailp, lip);
>  		if (!tail_lsn && lsn)
>  			tail_lsn = lsn;
> @@ -799,7 +788,7 @@ xfs_buf_inode_iodone(
>  		struct xfs_inode_log_item *iip = INODE_ITEM(lip);
>  
>  		if (xfs_iflags_test(iip->ili_inode, XFS_ISTALE)) {
> -			xfs_iflush_abort(iip->ili_inode);
> +			xfs_iflush_abort(iip->ili_inode, bp);
>  			continue;
>  		}
>  		if (!iip->ili_last_fields)
> @@ -834,44 +823,98 @@ xfs_buf_inode_io_fail(
>  }
>  
>  /*
> - * This is the inode flushing abort routine.  It is called when
> - * the filesystem is shutting down to clean up the inode state.  It is
> - * responsible for removing the inode item from the AIL if it has not been
> - * re-logged and clearing the inode's flush state.
> + * Abort flushing the inode.
> + *
> + * There are two cases where this is called. The first is when the inode cluster
> + * buffer has been removed and the inodes attached to it have been marked
> + * XFS_ISTALE. Inode cluster buffer IO completion will be called on the buffer to mark the stale
> + * inodes clean and remove them from the AIL without doing IO on them. The inode
> + * should always have a log item attached if it is ISTALE, and we should always
> + * be passed the locked buffer the inodes are attached to.
> + *
> + * The second case is log shutdown. When the log has been shut down, we need
> + * to abort any flush that is in progress, mark the inode clean and remove it
> + * from the AIL. We may get passed clean inodes without log items, as well as
> + * clean inodes with log items that aren't attached to cluster buffers. And
> + * depending on whether we are called from, we may or may not have a locked
> + * buffer passed to us.
> + *
> + * If we don't have a locked buffer, we try to get it from the inode log item.
> + * If there is a buffer attached to the ili, then we have a reference to the
> + * buffer and we can safely lock it, then remove the inode from the buffer.
>   */
>  void
>  xfs_iflush_abort(
> -	struct xfs_inode	*ip)
> +	struct xfs_inode	*ip,
> +	struct xfs_buf		*locked_bp)
>  {
>  	struct xfs_inode_log_item *iip = ip->i_itemp;
> -	struct xfs_buf		*bp = NULL;
> +	struct xfs_buf		*ibp;
>  
> -	if (iip) {
> -		/*
> -		 * Clear the failed bit before removing the item from the AIL so
> -		 * xfs_trans_ail_delete() doesn't try to clear and release the
> -		 * buffer attached to the log item before we are done with it.
> -		 */
> -		clear_bit(XFS_LI_FAILED, &iip->ili_item.li_flags);
> -		xfs_trans_ail_delete(&iip->ili_item, 0);
> +	if (!iip) {
> +		/* clean inode, nothing to do */
> +		xfs_iflags_clear(ip, XFS_IFLUSHING);
> +		return;
> +	}
>  
> -		/*
> -		 * Clear the inode logging fields so no more flushes are
> -		 * attempted.
> -		 */
> -		spin_lock(&iip->ili_lock);
> -		iip->ili_last_fields = 0;
> -		iip->ili_fields = 0;
> -		iip->ili_fsync_fields = 0;
> -		iip->ili_flush_lsn = 0;
> -		bp = iip->ili_item.li_buf;
> -		iip->ili_item.li_buf = NULL;
> -		list_del_init(&iip->ili_item.li_bio_list);
> +	/*
> +	 * Capture the associated buffer and lock it if the caller didn't
> +	 * pass us the locked buffer to begin with.
> +	 */
> +	spin_lock(&iip->ili_lock);
> +	ibp = iip->ili_item.li_buf;
> +	if (!locked_bp && ibp) {
> +		xfs_buf_hold(ibp);
>  		spin_unlock(&iip->ili_lock);
> +		xfs_buf_lock(ibp);
> +		spin_lock(&iip->ili_lock);
> +		if (!iip->ili_item.li_buf) {
> +			/*
> +			 * Raced with another removal, hold the only reference
> +			 * to ibp now.
> +			 */
> +			ASSERT(list_empty(&iip->ili_item.li_bio_list));
> +		} else {
> +			/*
> +			 * Got two references to ibp, drop one now. The other
> +			 * ges dropped when we are done.
> +			 */
> +			ASSERT(iip->ili_item.li_buf == ibp);
> +			xfs_buf_rele(ibp);
> +		}
> +	} else {
> +		ASSERT(!ibp || ibp == locked_bp);
>  	}
> +
> +	/*
> +	 * Clear the inode logging fields so no more flushes are attempted.
> +	 * If we are on a buffer list, it is now safe to remove it because
> +	 * the buffer is guaranteed to be locked.
> +	 */
> +	iip->ili_last_fields = 0;
> +	iip->ili_fields = 0;
> +	iip->ili_fsync_fields = 0;
> +	iip->ili_flush_lsn = 0;
> +	iip->ili_item.li_buf = NULL;
> +	list_del_init(&iip->ili_item.li_bio_list);
> +	spin_unlock(&iip->ili_lock);
> +
> +	/*
> +	 * Clear the failed bit before removing the item from the AIL so
> +	 * xfs_trans_ail_delete() doesn't try to clear and release the buffer
> +	 * attached to the log item before we are done with it.
> +	 */
> +	clear_bit(XFS_LI_FAILED, &iip->ili_item.li_flags);
> +	xfs_trans_ail_delete(&iip->ili_item, 0);
> +
>  	xfs_iflags_clear(ip, XFS_IFLUSHING);
> -	if (bp)
> -		xfs_buf_rele(bp);
> +
> +	/* we can now release the buffer reference the inode log item held. */
> +	if (ibp) {
> +		if (!locked_bp)
> +			xfs_buf_unlock(ibp);
> +		xfs_buf_rele(ibp);
> +	}
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
> index 1a302000d604..01e5845c7f3d 100644
> --- a/fs/xfs/xfs_inode_item.h
> +++ b/fs/xfs/xfs_inode_item.h
> @@ -43,7 +43,7 @@ static inline int xfs_inode_clean(struct xfs_inode *ip)
>  
>  extern void xfs_inode_item_init(struct xfs_inode *, struct xfs_mount *);
>  extern void xfs_inode_item_destroy(struct xfs_inode *);
> -extern void xfs_iflush_abort(struct xfs_inode *);
> +extern void xfs_iflush_abort(struct xfs_inode *, struct xfs_buf *);
>  extern int xfs_inode_item_format_convert(xfs_log_iovec_t *,
>  					 struct xfs_inode_log_format *);
>  
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] log I/O completion GPF via xfs/006 and xfs/264 on 5.17.0-rc8
  2022-03-21 18:35     ` Brian Foster
@ 2022-03-21 22:14       ` Dave Chinner
  2022-03-22 14:33         ` Brian Foster
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2022-03-21 22:14 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Mon, Mar 21, 2022 at 02:35:21PM -0400, Brian Foster wrote:
> On Sat, Mar 19, 2022 at 08:42:53AM +1100, Dave Chinner wrote:
> > On Fri, Mar 18, 2022 at 12:11:07PM -0400, Brian Foster wrote:
> > > On Fri, Mar 18, 2022 at 09:46:53AM -0400, Brian Foster wrote:
> > > > Hi,
> > > > 
> > > > I'm not sure if this is known and/or fixed already, but it didn't look
> > > > familiar so here is a report. I hit a splat when testing Willy's
> > > > prospective folio bookmark change and it turns out it replicates on
> > > > Linus' current master (551acdc3c3d2). This initially reproduced on
> > > > xfs/264 (mkfs defaults) and I saw a soft lockup warning variant via
> > > > xfs/006, but when I attempted to reproduce the latter a second time I
> > > > hit what looks like the same problem as xfs/264. Both tests seem to
> > > > involve some form of error injection, so possibly the same underlying
> > > > problem. The GPF splat from xfs/264 is below.
> > > > 
> > > 
> > > Darrick pointed out this [1] series on IRC (particularly the final
> > > patch) so I gave that a try. I _think_ that addresses the GPF issue
> > > given it was nearly 100% reproducible before and I didn't see it in a
> > > few iterations, but once I started a test loop for a longer test I ran
> > > into the aforementioned soft lockup again. A snippet of that one is
> > > below [2]. When this occurs, the task appears to be stuck (i.e. the
> > > warning repeats) indefinitely.
> > > 
> > > Brian
> > > 
> > > [1] https://lore.kernel.org/linux-xfs/20220317053907.164160-1-david@fromorbit.com/
> > > [2] Soft lockup warning from xfs/264 with patches from [1] applied:
> > > 
> > > watchdog: BUG: soft lockup - CPU#52 stuck for 134s! [kworker/52:1H:1881]
> > > Modules linked in: rfkill rpcrdma sunrpc intel_rapl_msr intel_rapl_common rdma_ucm ib_srpt ib_isert iscsi_target_mod i10nm_edac target_core_mod x86_pkg_temp_thermal intel_powerclamp ib_iser coretemp libiscsi scsi_transport_iscsi kvm_intel rdma_cm ib_umad ipmi_ssif ib_ipoib iw_cm ib_cm kvm iTCO_wdt iTCO_vendor_support irqbypass crct10dif_pclmul crc32_pclmul acpi_ipmi mlx5_ib ghash_clmulni_intel bnxt_re ipmi_si rapl intel_cstate ib_uverbs ipmi_devintf mei_me isst_if_mmio isst_if_mbox_pci i2c_i801 nd_pmem ib_core intel_uncore wmi_bmof pcspkr isst_if_common mei i2c_smbus intel_pch_thermal ipmi_msghandler nd_btt dax_pmem acpi_power_meter xfs libcrc32c sd_mod sg mlx5_core lpfc mgag200 i2c_algo_bit drm_shmem_helper nvmet_fc drm_kms_helper nvmet nvme_fc mlxfw nvme_fabrics syscopyarea sysfillrect pci_hyperv_intf sysimgblt fb_sys_fops nvme_core ahci tls t10_pi libahci crc32c_intel psample scsi_transport_fc bnxt_en drm megaraid_sas tg3 libata wmi nfit libnvdimm dm_mirror dm_region_hash
> > >  dm_log dm_mod
> > > CPU: 52 PID: 1881 Comm: kworker/52:1H Tainted: G S           L    5.17.0-rc8+ #17
> > > Hardware name: Dell Inc. PowerEdge R750/06V45N, BIOS 1.2.4 05/28/2021
> > > Workqueue: xfs-log/dm-5 xlog_ioend_work [xfs]
> > > RIP: 0010:native_queued_spin_lock_slowpath+0x1b0/0x1e0
> > > Code: c1 e9 12 83 e0 03 83 e9 01 48 c1 e0 05 48 63 c9 48 05 40 0d 03 00 48 03 04 cd e0 ba 00 8c 48 89 10 8b 42 08 85 c0 75 09 f3 90 <8b> 42 08 85 c0 74 f7 48 8b 0a 48 85 c9 0f 84 6b ff ff ff 0f 0d 09
> > > RSP: 0018:ff4ed0b360e4bb48 EFLAGS: 00000246
> > > RAX: 0000000000000000 RBX: ff3413f05c684540 RCX: 0000000000001719
> > > RDX: ff34142ebfeb0d40 RSI: ffffffff8bf826f6 RDI: ffffffff8bf54147
> > > RBP: ff34142ebfeb0d40 R08: ff34142ebfeb0a68 R09: 00000000000001bc
> > > R10: 00000000000001d1 R11: 0000000000000abd R12: 0000000000d40000
> > > R13: 0000000000000008 R14: ff3413f04cd84000 R15: ff3413f059404400
> > > FS:  0000000000000000(0000) GS:ff34142ebfe80000(0000) knlGS:0000000000000000
> > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: 00007f9200514f70 CR3: 0000000216c16005 CR4: 0000000000771ee0
> > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > PKRU: 55555554
> > > Call Trace:
> > >  <TASK>
> > >  _raw_spin_lock+0x2c/0x30
> > >  xfs_trans_ail_delete+0x2a/0xd0 [xfs]
> > 
> > So what is running around in a tight circle holding the AIL lock?
> > 
> > Or what assert failed before this while holding the AIL lock?
> > 
> 
> I don't have much information beyond the test and resulting bug. There
> are no assert failures before the bug occurs. An active CPU task dump
> shows the stack from the soft lockup warning, the task running the dump
> itself, and all other (94/96) CPUs appear idle. I tried the appended
> patch on top of latest for-next (which now includes the other log
> shutdown fix) and the problem still occurs.

Yeah, I got another assert fail in xfs_ail_check() last night from:

  xfs_ail_check+0xa8/0x180
  xfs_ail_delete_one+0x3b/0xf0
  xfs_buf_inode_iodone+0x329/0x3f0
  xfs_buf_ioend+0x1f8/0x530
  xfs_buf_ioend_work+0x15/0x20

Finding an item that didn't have IN_AIL set on it. I think I've
found another mount vs log shutdown case that can result in dirty
aborted inodes that aren't in the AIL being flushed and bad things
happen when we then try to remove them from the AIL and they aren't
there...

Whether that is this problem or not, I don't know, but the assert
failures do end up with other threads spinning on the AIL lock
because of the assert failures under the AIL lock...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] log I/O completion GPF via xfs/006 and xfs/264 on 5.17.0-rc8
  2022-03-21 22:14       ` Dave Chinner
@ 2022-03-22 14:33         ` Brian Foster
  2022-03-22 21:41           ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Brian Foster @ 2022-03-22 14:33 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Mar 22, 2022 at 09:14:33AM +1100, Dave Chinner wrote:
> On Mon, Mar 21, 2022 at 02:35:21PM -0400, Brian Foster wrote:
> > On Sat, Mar 19, 2022 at 08:42:53AM +1100, Dave Chinner wrote:
> > > On Fri, Mar 18, 2022 at 12:11:07PM -0400, Brian Foster wrote:
> > > > On Fri, Mar 18, 2022 at 09:46:53AM -0400, Brian Foster wrote:
> > > > > Hi,
> > > > > 
> > > > > I'm not sure if this is known and/or fixed already, but it didn't look
> > > > > familiar so here is a report. I hit a splat when testing Willy's
> > > > > prospective folio bookmark change and it turns out it replicates on
> > > > > Linus' current master (551acdc3c3d2). This initially reproduced on
> > > > > xfs/264 (mkfs defaults) and I saw a soft lockup warning variant via
> > > > > xfs/006, but when I attempted to reproduce the latter a second time I
> > > > > hit what looks like the same problem as xfs/264. Both tests seem to
> > > > > involve some form of error injection, so possibly the same underlying
> > > > > problem. The GPF splat from xfs/264 is below.
> > > > > 
> > > > 
> > > > Darrick pointed out this [1] series on IRC (particularly the final
> > > > patch) so I gave that a try. I _think_ that addresses the GPF issue
> > > > given it was nearly 100% reproducible before and I didn't see it in a
> > > > few iterations, but once I started a test loop for a longer test I ran
> > > > into the aforementioned soft lockup again. A snippet of that one is
> > > > below [2]. When this occurs, the task appears to be stuck (i.e. the
> > > > warning repeats) indefinitely.
> > > > 
> > > > Brian
> > > > 
> > > > [1] https://lore.kernel.org/linux-xfs/20220317053907.164160-1-david@fromorbit.com/
> > > > [2] Soft lockup warning from xfs/264 with patches from [1] applied:
> > > > 
> > > > watchdog: BUG: soft lockup - CPU#52 stuck for 134s! [kworker/52:1H:1881]
> > > > Modules linked in: rfkill rpcrdma sunrpc intel_rapl_msr intel_rapl_common rdma_ucm ib_srpt ib_isert iscsi_target_mod i10nm_edac target_core_mod x86_pkg_temp_thermal intel_powerclamp ib_iser coretemp libiscsi scsi_transport_iscsi kvm_intel rdma_cm ib_umad ipmi_ssif ib_ipoib iw_cm ib_cm kvm iTCO_wdt iTCO_vendor_support irqbypass crct10dif_pclmul crc32_pclmul acpi_ipmi mlx5_ib ghash_clmulni_intel bnxt_re ipmi_si rapl intel_cstate ib_uverbs ipmi_devintf mei_me isst_if_mmio isst_if_mbox_pci i2c_i801 nd_pmem ib_core intel_uncore wmi_bmof pcspkr isst_if_common mei i2c_smbus intel_pch_thermal ipmi_msghandler nd_btt dax_pmem acpi_power_meter xfs libcrc32c sd_mod sg mlx5_core lpfc mgag200 i2c_algo_bit drm_shmem_helper nvmet_fc drm_kms_helper nvmet nvme_fc mlxfw nvme_fabrics syscopyarea sysfillrect pci_hyperv_intf sysimgblt fb_sys_fops nvme_core ahci tls t10_pi libahci crc32c_intel psample scsi_transport_fc bnxt_en drm megaraid_sas tg3 libata wmi nfit libnvdimm dm_mirror dm_region_hash
> > > >  dm_log dm_mod
> > > > CPU: 52 PID: 1881 Comm: kworker/52:1H Tainted: G S           L    5.17.0-rc8+ #17
> > > > Hardware name: Dell Inc. PowerEdge R750/06V45N, BIOS 1.2.4 05/28/2021
> > > > Workqueue: xfs-log/dm-5 xlog_ioend_work [xfs]
> > > > RIP: 0010:native_queued_spin_lock_slowpath+0x1b0/0x1e0
> > > > Code: c1 e9 12 83 e0 03 83 e9 01 48 c1 e0 05 48 63 c9 48 05 40 0d 03 00 48 03 04 cd e0 ba 00 8c 48 89 10 8b 42 08 85 c0 75 09 f3 90 <8b> 42 08 85 c0 74 f7 48 8b 0a 48 85 c9 0f 84 6b ff ff ff 0f 0d 09
> > > > RSP: 0018:ff4ed0b360e4bb48 EFLAGS: 00000246
> > > > RAX: 0000000000000000 RBX: ff3413f05c684540 RCX: 0000000000001719
> > > > RDX: ff34142ebfeb0d40 RSI: ffffffff8bf826f6 RDI: ffffffff8bf54147
> > > > RBP: ff34142ebfeb0d40 R08: ff34142ebfeb0a68 R09: 00000000000001bc
> > > > R10: 00000000000001d1 R11: 0000000000000abd R12: 0000000000d40000
> > > > R13: 0000000000000008 R14: ff3413f04cd84000 R15: ff3413f059404400
> > > > FS:  0000000000000000(0000) GS:ff34142ebfe80000(0000) knlGS:0000000000000000
> > > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > CR2: 00007f9200514f70 CR3: 0000000216c16005 CR4: 0000000000771ee0
> > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > PKRU: 55555554
> > > > Call Trace:
> > > >  <TASK>
> > > >  _raw_spin_lock+0x2c/0x30
> > > >  xfs_trans_ail_delete+0x2a/0xd0 [xfs]
> > > 
> > > So what is running around in a tight circle holding the AIL lock?
> > > 
> > > Or what assert failed before this while holding the AIL lock?
> > > 
> > 
> > I don't have much information beyond the test and resulting bug. There
> > are no assert failures before the bug occurs. An active CPU task dump
> > shows the stack from the soft lockup warning, the task running the dump
> > itself, and all other (94/96) CPUs appear idle. I tried the appended
> > patch on top of latest for-next (which now includes the other log
> > shutdown fix) and the problem still occurs.
> 
> Yeah, I got another assert fail in xfs_ail_check() last night from:
> 
>   xfs_ail_check+0xa8/0x180
>   xfs_ail_delete_one+0x3b/0xf0
>   xfs_buf_inode_iodone+0x329/0x3f0
>   xfs_buf_ioend+0x1f8/0x530
>   xfs_buf_ioend_work+0x15/0x20
> 
> Finding an item that didn't have IN_AIL set on it. I think I've
> found another mount vs log shutdown case that can result in dirty
> aborted inodes that aren't in the AIL being flushed and bad things
> happen when we then try to remove them from the AIL and they aren't
> there...
> 
> Whether that is this problem or not, I don't know, but the assert
> failures do end up with other threads spinning on the AIL lock
> because of the assert failures under the AIL lock...
> 

Some updates.. I tried to reproduce with lock debugging and whatnot
enabled but the problem was no longer reproducible, probably due to
disruption of timing. When I went back to a reproducing kernel, I ended
up seeing a page fault variant crash via xfs/006 instead of the soft
lockup. This occurred shortly after the unmount attempt started, so I
retried again with KASAN enabled but ran into the same heisenbug
behavior. From there, I replaced the generic debug mechanisms with a
custom mount flag to set in xfs_trans_ail_destroy() immediately after
freeing the xfs_ail object and assert check on entry of
xfs_trans_ail_delete(). The next xfs/006 failure produced the splat
below [2]. I suspect that is the smoking gun [1] and perhaps we've
somehow broken serialization between in-core object teardown and
outstanding log I/O completion after the filesystem happens to shutdown.

Brian

[1] I've not reproduced the soft lockup variant with this hack in place,
but if the problem is a UAF then I'd expect the resulting behavior to be
somewhat erratic/unpredictable.

[2] xfs/006 custom assert and BUG splat:

XFS: Assertion failed: !mp->m_freed_ail, file: fs/xfs/xfs_trans_ail.c, line: 879
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1289 at fs/xfs/xfs_message.c:97 asswarn+0x1a/0x1d [xfs]
Modules linked in: rfkill rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm intel_rapl_msr iTCO_wdt iTCO_vendor_support inth
 dm_log dm_mod
CPU: 2 PID: 1289 Comm: kworker/2:1H Tainted: G S                5.17.0-rc6+ #29
Hardware name: Dell Inc. PowerEdge R750/06V45N, BIOS 1.2.4 05/28/2021
Workqueue: xfs-log/dm-5 xlog_ioend_work [xfs]
RIP: 0010:asswarn+0x1a/0x1d [xfs]
Code: e8 e8 13 8c c4 48 83 c4 60 5b 41 5c 41 5d 5d c3 0f 1f 44 00 00 41 89 c8 48 89 d1 48 89 f2 48 c7 c6 c0 a0 d8 c0 e8 19 fd ff ff <0f> 0b c3 0f 1f 44 00 00 41 89 c8 48 89 d1 48 89 f2 48 c7 c6 c0 a0
RSP: 0018:ff5757be8f37fb70 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ff4847238f4b2940 RCX: 0000000000000000
RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffffc0d7e076
RBP: ff4847341d6a0f80 R08: 0000000000000000 R09: 0000000000000000
R10: 000000000000000a R11: f000000000000000 R12: 0000000200002600
R13: ff484724b1c35000 R14: 0000000000000008 R15: ff4847238f4b2940
FS:  0000000000000000(0000) GS:ff484761ffc40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f55c21f0f70 CR3: 0000000142c9c002 CR4: 0000000000771ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
 <TASK>
 xfs_trans_ail_delete+0x102/0x130 [xfs]
 xfs_buf_item_done+0x22/0x30 [xfs]
 xfs_buf_ioend+0x73/0x4d0 [xfs]
 xfs_trans_committed_bulk+0x17e/0x2f0 [xfs]
 ? enqueue_task_fair+0x91/0x680
 ? remove_entity_load_avg+0x2e/0x70
 ? __wake_up_common_lock+0x87/0xc0
 xlog_cil_committed+0x2a9/0x300 [xfs]
 ? __wake_up_common_lock+0x87/0xc0
 xlog_cil_process_committed+0x69/0x80 [xfs]
 xlog_state_shutdown_callbacks+0xce/0xf0 [xfs]
 xlog_force_shutdown+0xdf/0x150 [xfs]
 xfs_do_force_shutdown+0x5f/0x150 [xfs]
 xlog_ioend_work+0x71/0x80 [xfs]
 process_one_work+0x1c5/0x390
 ? process_one_work+0x390/0x390
 worker_thread+0x30/0x350
 ? process_one_work+0x390/0x390
 kthread+0xd7/0x100
 ? kthread_complete_and_exit+0x20/0x20
 ret_from_fork+0x1f/0x30
 </TASK>
---[ end trace 0000000000000000 ]---
BUG: unable to handle page fault for address: 0000000000030840
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 281cd3067 P4D 0 
Oops: 0002 [#1] PREEMPT SMP NOPTI
CPU: 2 PID: 1289 Comm: kworker/2:1H Tainted: G S      W         5.17.0-rc6+ #29
Hardware name: Dell Inc. PowerEdge R750/06V45N, BIOS 1.2.4 05/28/2021
Workqueue: xfs-log/dm-5 xlog_ioend_work [xfs]
RIP: 0010:native_queued_spin_lock_slowpath+0x173/0x1b0
Code: f3 90 48 8b 32 48 85 f6 74 f6 eb d5 c1 ee 12 83 e0 03 83 ee 01 48 c1 e0 05 48 63 f6 48 05 00 08 03 00 48 03 04 f5 e0 5a ec 85 <48> 89 10 8b 42 08 85 c0 75 09 f3 90 8b 42 08 85 c0 74 f7 48 8b 32
RSP: 0018:ff5757be8f37fb68 EFLAGS: 00010202
RAX: 0000000000030840 RBX: ff4847238f4b2940 RCX: 00000000000c0000
RDX: ff484761ffc70800 RSI: 0000000000000759 RDI: ff4847341d6a0fc0
RBP: ff4847341d6a0f80 R08: 0000000000000000 R09: 0000000000000000
R10: 000000000000000a R11: f000000000000000 R12: ff4847341d6a0fc0
R13: ff484724b1c35000 R14: 0000000000000008 R15: ff4847238f4b2940
FS:  0000000000000000(0000) GS:ff484761ffc40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000030840 CR3: 0000000142c9c002 CR4: 0000000000771ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
 <TASK>
 _raw_spin_lock+0x21/0x30
 xfs_trans_ail_delete+0x38/0x130 [xfs]
 xfs_buf_item_done+0x22/0x30 [xfs]
 xfs_buf_ioend+0x73/0x4d0 [xfs]
 xfs_trans_committed_bulk+0x17e/0x2f0 [xfs]
 ? enqueue_task_fair+0x91/0x680
 ? remove_entity_load_avg+0x2e/0x70
 ? __wake_up_common_lock+0x87/0xc0
 xlog_cil_committed+0x2a9/0x300 [xfs]
 ? __wake_up_common_lock+0x87/0xc0
 xlog_cil_process_committed+0x69/0x80 [xfs]
 xlog_state_shutdown_callbacks+0xce/0xf0 [xfs]
 xlog_force_shutdown+0xdf/0x150 [xfs]
 xfs_do_force_shutdown+0x5f/0x150 [xfs]
 xlog_ioend_work+0x71/0x80 [xfs]
 process_one_work+0x1c5/0x390
 ? process_one_work+0x390/0x390
 worker_thread+0x30/0x350
 ? process_one_work+0x390/0x390
 kthread+0xd7/0x100
 ? kthread_complete_and_exit+0x20/0x20
 ret_from_fork+0x1f/0x30
 </TASK>
Modules linked in: rfkill rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm intel_rapl_msr iTCO_wdt iTCO_vendor_support inth
 dm_log dm_mod
CR2: 0000000000030840
---[ end trace 0000000000000000 ]---
RIP: 0010:native_queued_spin_lock_slowpath+0x173/0x1b0
Code: f3 90 48 8b 32 48 85 f6 74 f6 eb d5 c1 ee 12 83 e0 03 83 ee 01 48 c1 e0 05 48 63 f6 48 05 00 08 03 00 48 03 04 f5 e0 5a ec 85 <48> 89 10 8b 42 08 85 c0 75 09 f3 90 8b 42 08 85 c0 74 f7 48 8b 32
RSP: 0018:ff5757be8f37fb68 EFLAGS: 00010202
RAX: 0000000000030840 RBX: ff4847238f4b2940 RCX: 00000000000c0000
RDX: ff484761ffc70800 RSI: 0000000000000759 RDI: ff4847341d6a0fc0
RBP: ff4847341d6a0f80 R08: 0000000000000000 R09: 0000000000000000
R10: 000000000000000a R11: f000000000000000 R12: ff4847341d6a0fc0
R13: ff484724b1c35000 R14: 0000000000000008 R15: ff4847238f4b2940
FS:  0000000000000000(0000) GS:ff484761ffc40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000030840 CR3: 0000000142c9c002 CR4: 0000000000771ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Kernel panic - not syncing: Fatal exception
Kernel Offset: 0x3c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: Fatal exception ]---


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] log I/O completion GPF via xfs/006 and xfs/264 on 5.17.0-rc8
  2022-03-22 14:33         ` Brian Foster
@ 2022-03-22 21:41           ` Dave Chinner
  0 siblings, 0 replies; 10+ messages in thread
From: Dave Chinner @ 2022-03-22 21:41 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Tue, Mar 22, 2022 at 10:33:00AM -0400, Brian Foster wrote:
> On Tue, Mar 22, 2022 at 09:14:33AM +1100, Dave Chinner wrote:
> > On Mon, Mar 21, 2022 at 02:35:21PM -0400, Brian Foster wrote:
> > > On Sat, Mar 19, 2022 at 08:42:53AM +1100, Dave Chinner wrote:
> > > > On Fri, Mar 18, 2022 at 12:11:07PM -0400, Brian Foster wrote:
> > > > > On Fri, Mar 18, 2022 at 09:46:53AM -0400, Brian Foster wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > I'm not sure if this is known and/or fixed already, but it didn't look
> > > > > > familiar so here is a report. I hit a splat when testing Willy's
> > > > > > prospective folio bookmark change and it turns out it replicates on
> > > > > > Linus' current master (551acdc3c3d2). This initially reproduced on
> > > > > > xfs/264 (mkfs defaults) and I saw a soft lockup warning variant via
> > > > > > xfs/006, but when I attempted to reproduce the latter a second time I
> > > > > > hit what looks like the same problem as xfs/264. Both tests seem to
> > > > > > involve some form of error injection, so possibly the same underlying
> > > > > > problem. The GPF splat from xfs/264 is below.
> > > > > > 
> > > > > 
> > > > > Darrick pointed out this [1] series on IRC (particularly the final
> > > > > patch) so I gave that a try. I _think_ that addresses the GPF issue
> > > > > given it was nearly 100% reproducible before and I didn't see it in a
> > > > > few iterations, but once I started a test loop for a longer test I ran
> > > > > into the aforementioned soft lockup again. A snippet of that one is
> > > > > below [2]. When this occurs, the task appears to be stuck (i.e. the
> > > > > warning repeats) indefinitely.
> > > > > 
> > > > > Brian
> > > > > 
> > > > > [1] https://lore.kernel.org/linux-xfs/20220317053907.164160-1-david@fromorbit.com/
> > > > > [2] Soft lockup warning from xfs/264 with patches from [1] applied:
> > > > > 
> > > > > watchdog: BUG: soft lockup - CPU#52 stuck for 134s! [kworker/52:1H:1881]
> > > > > Modules linked in: rfkill rpcrdma sunrpc intel_rapl_msr intel_rapl_common rdma_ucm ib_srpt ib_isert iscsi_target_mod i10nm_edac target_core_mod x86_pkg_temp_thermal intel_powerclamp ib_iser coretemp libiscsi scsi_transport_iscsi kvm_intel rdma_cm ib_umad ipmi_ssif ib_ipoib iw_cm ib_cm kvm iTCO_wdt iTCO_vendor_support irqbypass crct10dif_pclmul crc32_pclmul acpi_ipmi mlx5_ib ghash_clmulni_intel bnxt_re ipmi_si rapl intel_cstate ib_uverbs ipmi_devintf mei_me isst_if_mmio isst_if_mbox_pci i2c_i801 nd_pmem ib_core intel_uncore wmi_bmof pcspkr isst_if_common mei i2c_smbus intel_pch_thermal ipmi_msghandler nd_btt dax_pmem acpi_power_meter xfs libcrc32c sd_mod sg mlx5_core lpfc mgag200 i2c_algo_bit drm_shmem_helper nvmet_fc drm_kms_helper nvmet nvme_fc mlxfw nvme_fabrics syscopyarea sysfillrect pci_hyperv_intf sysimgblt fb_sys_fops nvme_core ahci tls t10_pi libahci crc32c_intel psample scsi_transport_fc bnxt_en drm megaraid_sas tg3 libata wmi nfit libnvdimm dm_mirror dm_region_hash
> > > > >  dm_log dm_mod
> > > > > CPU: 52 PID: 1881 Comm: kworker/52:1H Tainted: G S           L    5.17.0-rc8+ #17
> > > > > Hardware name: Dell Inc. PowerEdge R750/06V45N, BIOS 1.2.4 05/28/2021
> > > > > Workqueue: xfs-log/dm-5 xlog_ioend_work [xfs]
> > > > > RIP: 0010:native_queued_spin_lock_slowpath+0x1b0/0x1e0
> > > > > Code: c1 e9 12 83 e0 03 83 e9 01 48 c1 e0 05 48 63 c9 48 05 40 0d 03 00 48 03 04 cd e0 ba 00 8c 48 89 10 8b 42 08 85 c0 75 09 f3 90 <8b> 42 08 85 c0 74 f7 48 8b 0a 48 85 c9 0f 84 6b ff ff ff 0f 0d 09
> > > > > RSP: 0018:ff4ed0b360e4bb48 EFLAGS: 00000246
> > > > > RAX: 0000000000000000 RBX: ff3413f05c684540 RCX: 0000000000001719
> > > > > RDX: ff34142ebfeb0d40 RSI: ffffffff8bf826f6 RDI: ffffffff8bf54147
> > > > > RBP: ff34142ebfeb0d40 R08: ff34142ebfeb0a68 R09: 00000000000001bc
> > > > > R10: 00000000000001d1 R11: 0000000000000abd R12: 0000000000d40000
> > > > > R13: 0000000000000008 R14: ff3413f04cd84000 R15: ff3413f059404400
> > > > > FS:  0000000000000000(0000) GS:ff34142ebfe80000(0000) knlGS:0000000000000000
> > > > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > CR2: 00007f9200514f70 CR3: 0000000216c16005 CR4: 0000000000771ee0
> > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > > PKRU: 55555554
> > > > > Call Trace:
> > > > >  <TASK>
> > > > >  _raw_spin_lock+0x2c/0x30
> > > > >  xfs_trans_ail_delete+0x2a/0xd0 [xfs]
> > > > 
> > > > So what is running around in a tight circle holding the AIL lock?
> > > > 
> > > > Or what assert failed before this while holding the AIL lock?
> > > > 
> > > 
> > > I don't have much information beyond the test and resulting bug. There
> > > are no assert failures before the bug occurs. An active CPU task dump
> > > shows the stack from the soft lockup warning, the task running the dump
> > > itself, and all other (94/96) CPUs appear idle. I tried the appended
> > > patch on top of latest for-next (which now includes the other log
> > > shutdown fix) and the problem still occurs.
> > 
> > Yeah, I got another assert fail in xfs_ail_check() last night from:
> > 
> >   xfs_ail_check+0xa8/0x180
> >   xfs_ail_delete_one+0x3b/0xf0
> >   xfs_buf_inode_iodone+0x329/0x3f0
> >   xfs_buf_ioend+0x1f8/0x530
> >   xfs_buf_ioend_work+0x15/0x20
> > 
> > Finding an item that didn't have IN_AIL set on it. I think I've
> > found another mount vs log shutdown case that can result in dirty
> > aborted inodes that aren't in the AIL being flushed and bad things
> > happen when we then try to remove them from the AIL and they aren't
> > there...
> > 
> > Whether that is this problem or not, I don't know, but the assert
> > failures do end up with other threads spinning on the AIL lock
> > because of the assert failures under the AIL lock...
> > 
> 
> Some updates.. I tried to reproduce with lock debugging and whatnot
> enabled but the problem was no longer reproducible, probably due to
> disruption of timing. When I went back to a reproducing kernel, I ended
> up seeing a page fault variant crash via xfs/006 instead of the soft
> lockup. This occurred shortly after the unmount attempt started, so I
> retried again with KASAN enabled but ran into the same heisenbug
> behavior. From there, I replaced the generic debug mechanisms with a
> custom mount flag to set in xfs_trans_ail_destroy() immediately after
> freeing the xfs_ail object and assert check on entry of
> xfs_trans_ail_delete(). The next xfs/006 failure produced the splat
> below [2]. I suspect that is the smoking gun [1] and perhaps we've
> somehow broken serialization between in-core object teardown and
> outstanding log I/O completion after the filesystem happens to shutdown.

OK, so the AIL has been torn down, then we are getting log IO
completing and being failed, trying to remove a buffer item from the
AIL?

xfs_log_unmount() does:

xfs_log_unmount(
        struct xfs_mount        *mp)
{
        xfs_log_clean(mp);

        xfs_buftarg_drain(mp->m_ddev_targp);

        xfs_trans_ail_destroy(mp);

And so the AIL cannot be destroyed if there are any active buffers
still in the system (e.g. being logged) - it will block until
xfs_buftarg_drain() returns with an empty buffer cache.

Which means this buffer is likely the superblock buffer - the only
buffer in the filesystem that isn't cached in the buffer cache but
is logged.

So I suspect we have:

xfs_log_unmount()
  xfs_log_clean()
    xfs_log_quiesce()
      xfs_log_cover()
      	xfs_sync_sb()
	xfs_ail_push_all_sync()
	

failing to push the superblock buffer into the AIL, so
xfs_log_cover() returns while the log IO is still in progress.
We then write an unmount record, which can silently fail to force
the iclog and so we get through xfs_log_clean() while the superblock
buffer is still attached to an iclog...

The buffer cache is empty, and then We then tear down the AIL,
only to have iclog callbacks run, fail the superblock buffer item
which is in the and try to remove the superblock
buffer from the AIL as it's releasing the last reference to the
buf log item...

This is convoluted, but I think it can happen if we get a iclog IO
error on the log force from xfs_sync_sb(). I think the key thing
is how the log is shut down and wakeups processed in
xlog_force_shutdown....

> [1] I've not reproduced the soft lockup variant with this hack in place,
> but if the problem is a UAF then I'd expect the resulting behavior to be
> somewhat erratic/unpredictable.
> 
> [2] xfs/006 custom assert and BUG splat:
> 
> XFS: Assertion failed: !mp->m_freed_ail, file: fs/xfs/xfs_trans_ail.c, line: 879
> ------------[ cut here ]------------
> WARNING: CPU: 2 PID: 1289 at fs/xfs/xfs_message.c:97 asswarn+0x1a/0x1d [xfs]
> Modules linked in: rfkill rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm intel_rapl_msr iTCO_wdt iTCO_vendor_support inth
>  dm_log dm_mod
> CPU: 2 PID: 1289 Comm: kworker/2:1H Tainted: G S                5.17.0-rc6+ #29
> Hardware name: Dell Inc. PowerEdge R750/06V45N, BIOS 1.2.4 05/28/2021
> Workqueue: xfs-log/dm-5 xlog_ioend_work [xfs]
> RIP: 0010:asswarn+0x1a/0x1d [xfs]
> Code: e8 e8 13 8c c4 48 83 c4 60 5b 41 5c 41 5d 5d c3 0f 1f 44 00 00 41 89 c8 48 89 d1 48 89 f2 48 c7 c6 c0 a0 d8 c0 e8 19 fd ff ff <0f> 0b c3 0f 1f 44 00 00 41 89 c8 48 89 d1 48 89 f2 48 c7 c6 c0 a0
> RSP: 0018:ff5757be8f37fb70 EFLAGS: 00010246
> RAX: 0000000000000000 RBX: ff4847238f4b2940 RCX: 0000000000000000
> RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffffc0d7e076
> RBP: ff4847341d6a0f80 R08: 0000000000000000 R09: 0000000000000000
> R10: 000000000000000a R11: f000000000000000 R12: 0000000200002600
> R13: ff484724b1c35000 R14: 0000000000000008 R15: ff4847238f4b2940
> FS:  0000000000000000(0000) GS:ff484761ffc40000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f55c21f0f70 CR3: 0000000142c9c002 CR4: 0000000000771ee0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> PKRU: 55555554
> Call Trace:
>  <TASK>
>  xfs_trans_ail_delete+0x102/0x130 [xfs]
>  xfs_buf_item_done+0x22/0x30 [xfs]
>  xfs_buf_ioend+0x73/0x4d0 [xfs]
>  xfs_trans_committed_bulk+0x17e/0x2f0 [xfs]
>  ? enqueue_task_fair+0x91/0x680
>  ? remove_entity_load_avg+0x2e/0x70
>  ? __wake_up_common_lock+0x87/0xc0
>  xlog_cil_committed+0x2a9/0x300 [xfs]
>  ? __wake_up_common_lock+0x87/0xc0
>  xlog_cil_process_committed+0x69/0x80 [xfs]
>  xlog_state_shutdown_callbacks+0xce/0xf0 [xfs]
>  xlog_force_shutdown+0xdf/0x150 [xfs]
>  xfs_do_force_shutdown+0x5f/0x150 [xfs]
>  xlog_ioend_work+0x71/0x80 [xfs]

OK, so this is processing an EIO error to a log write, and it's
triggering a force shutdown. THis cause the log to be shut down,
and then it is running attached iclog callbacks from the shutdown
context. That means the fs and log has already been marked as
xfs_is_shutdown/xlog_is_shutdown and so high level code will abort
(e.g. xfs_trans_commit(), xfs_log_force(), etc) with an error
because of shutdown. That's the first thing we need for
xfs_sync_sb() to exit without waiting for the superblock buffer to
be comitted to disk above.

The second is xlog_state_shutdown_callbacks() doing this:

{
        struct xlog_in_core     *iclog;
        LIST_HEAD(cb_list);

        spin_lock(&log->l_icloglock);
        iclog = log->l_iclog;
        do {
                if (atomic_read(&iclog->ic_refcnt)) {
                        /* Reference holder will re-run iclog callbacks. */
                        continue;
                }
                list_splice_init(&iclog->ic_callbacks, &cb_list);
>>>>>>           wake_up_all(&iclog->ic_write_wait);
>>>>>>           wake_up_all(&iclog->ic_force_wait);
        } while ((iclog = iclog->ic_next) != log->l_iclog);

        wake_up_all(&log->l_flush_wait);
        spin_unlock(&log->l_icloglock);

>>>>>>  xlog_cil_process_committed(&cb_list);
}

It wakes forces waiters before shutdown processes all the pending
callbacks.  That means the xfs_sync_sb() waiting on a sync
transaction in xfs_log_force() on iclog->ic_force_wait will get
woken before the callbacks attached to that iclog are run. Normally
this is just fine because the force waiter has nothing to do with
AIL operations. But in the case of this unmount path, the log force
waiter goes on to tear down the AIL because the log is now shut down
and nothing ever blocks it again from the wait point in
xfs_log_cover().

Hence it's a race to see who gets to the AIL first - the unmount
code or xlog_cil_process_committed() killing the superblock buffer.

So that's the bug, and it has nothing to do with any of the other
shutdown issues I'm trying to sort out right now. I'll add it to the
list.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-03-22 21:42 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-18 13:46 [BUG] log I/O completion GPF via xfs/006 and xfs/264 on 5.17.0-rc8 Brian Foster
2022-03-18 16:11 ` Brian Foster
2022-03-18 21:42   ` Dave Chinner
2022-03-21 18:35     ` Brian Foster
2022-03-21 22:14       ` Dave Chinner
2022-03-22 14:33         ` Brian Foster
2022-03-22 21:41           ` Dave Chinner
2022-03-18 21:48 ` Dave Chinner
2022-03-18 21:51   ` Darrick J. Wong
2022-03-18 22:39     ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.