All of lore.kernel.org
 help / color / mirror / Atom feed
* generic/068 crash on 5.18-rc2?
@ 2022-04-13  3:34 Darrick J. Wong
  2022-04-13 14:50 ` Matthew Wilcox
  0 siblings, 1 reply; 22+ messages in thread
From: Darrick J. Wong @ 2022-04-13  3:34 UTC (permalink / raw)
  To: xfs, Matthew Wilcox; +Cc: fstests, linux-fsdevel

Hmm.  Two nights in a row I've seen the following crash.  Has anyone
else seen this, or should I keep digging?  This is a fairly boring
x86_64 VM with a XFS v5 filesystem + rmapbt.

mm/filemap.c:1653 is the BUG in:

void folio_end_writeback(struct folio *folio)
{
	/*
	 * folio_test_clear_reclaim() could be used here but it is an
	 * atomic operation and overkill in this particular case.
	 * Failing to shuffle a folio marked for immediate reclaim is
	 * too mild a gain to justify taking an atomic operation penalty
	 * at the end of every folio writeback.
	 */
	if (folio_test_reclaim(folio)) {
		folio_clear_reclaim(folio);
		folio_rotate_reclaimable(folio);
	}

	/*
	 * Writeback does not hold a folio reference of its own, relying
	 * on truncation to wait for the clearing of PG_writeback.
	 * But here we must make sure that the folio is not freed and
	 * reused before the folio_wake().
	 */
	folio_get(folio);
	if (!__folio_end_writeback(folio))
>>>>		BUG();

	smp_mb__after_atomic();
	folio_wake(folio, PG_writeback);
	acct_reclaim_writeback(folio);
	folio_put(folio);
}
EXPORT_SYMBOL(folio_end_writeback);


--D

run fstests generic/068 at 2022-04-12 17:57:11
XFS (sda3): Mounting V5 Filesystem
XFS (sda3): Ending clean mount
XFS (sda4): Mounting V5 Filesystem
XFS (sda4): Ending clean mount
------------[ cut here ]------------
kernel BUG at mm/filemap.c:1653!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 0 PID: 1349866 Comm: 0:116 Tainted: G        W         5.18.0-rc2-djwx #rc2 19cc48221d47ada6c8e5859639b6a0946c9a3777
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20171121_152543-x86-ol7-builder-01.us.oracle.com-4.el7.1 04/01/2014
Workqueue: xfs-conv/sda4 xfs_end_io [xfs]
RIP: 0010:folio_end_writeback+0x79/0x80
Code: d2 75 1d f0 ff 4d 34 74 0e 5d c3 f0 80 67 02 fb e8 ac 29 01 00 eb ad 48 89 ef 5d e9 a1 0f 01 00 48 89 ee e8 b9 e8 01 00 eb d9 <0f> 0b 0f 1f 44 00 00 0f 1f 44 00 00 53 48 89 fb e8 62 f7 ff ff 48
RSP: 0018:ffffc9000286fd50 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffffea0007376840 RCX: 000000000000000c
RDX: ffff88810d2de000 RSI: ffffffff81e55f0b RDI: ffff88810d2de000
RBP: ffffea0007376840 R08: ffffea000b82c308 R09: ffffea000b82c308
R10: 0000000000000001 R11: 000000000000000c R12: 0000000000000000
R13: 000000000000c000 R14: 0000000000000005 R15: 0000000000000001
FS:  0000000000000000(0000) GS:ffff88843fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f5b067d0000 CR3: 000000010d1bb000 CR4: 00000000001506b0
Call Trace:
 <TASK>
 iomap_finish_ioend+0x19e/0x560
 iomap_finish_ioends+0x69/0x100
 xfs_end_ioend+0x5a/0x160 [xfs 513857e2ae300a835ce1fbd8065a84dc5382e649]
 xfs_end_io+0xb1/0xf0 [xfs 513857e2ae300a835ce1fbd8065a84dc5382e649]
 process_one_work+0x1df/0x3c0
 ? rescuer_thread+0x3b0/0x3b0
 worker_thread+0x53/0x3b0
 ? rescuer_thread+0x3b0/0x3b0
 kthread+0xea/0x110
 ? kthread_complete_and_exit+0x20/0x20
 ret_from_fork+0x1f/0x30
 </TASK>
Modules linked in: dm_snapshot dm_bufio dm_zero dm_flakey xfs libcrc32c xt_REDIRECT iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 auth_rpcgss oid_registry xt_tcpudp ip_set_hash_ip ip_set_hash_net xt_set ip_set_hash_mac ip_set nfnetlink ip6table_filter ip6_tables iptable_filter bfq sch_fq_codel ip_tables x_tables overlay nfsv4 af_packet [last unloaded: scsi_debug]
Dumping ftrace buffer:
   (ftrace buffer empty)
---[ end trace 0000000000000000 ]---
RIP: 0010:folio_end_writeback+0x79/0x80
Code: d2 75 1d f0 ff 4d 34 74 0e 5d c3 f0 80 67 02 fb e8 ac 29 01 00 eb ad 48 89 ef 5d e9 a1 0f 01 00 48 89 ee e8 b9 e8 01 00 eb d9 <0f> 0b 0f 1f 44 00 00 0f 1f 44 00 00 53 48 89 fb e8 62 f7 ff ff 48
RSP: 0018:ffffc9000286fd50 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffffea0007376840 RCX: 000000000000000c
RDX: ffff88810d2de000 RSI: ffffffff81e55f0b RDI: ffff88810d2de000
RBP: ffffea0007376840 R08: ffffea000b82c308 R09: ffffea000b82c308
R10: 0000000000000001 R11: 000000000000000c R12: 0000000000000000
R13: 000000000000c000 R14: 0000000000000005 R15: 0000000000000001
FS:  0000000000000000(0000) GS:ffff88843fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f4b94008278 CR3: 0000000101ac9000 CR4: 00000000001506b0


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: generic/068 crash on 5.18-rc2?
  2022-04-13  3:34 generic/068 crash on 5.18-rc2? Darrick J. Wong
@ 2022-04-13 14:50 ` Matthew Wilcox
  2022-04-13 16:23   ` Darrick J. Wong
  2022-04-18 17:47   ` Darrick J. Wong
  0 siblings, 2 replies; 22+ messages in thread
From: Matthew Wilcox @ 2022-04-13 14:50 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: xfs, fstests, linux-fsdevel

On Tue, Apr 12, 2022 at 08:34:25PM -0700, Darrick J. Wong wrote:
> Hmm.  Two nights in a row I've seen the following crash.  Has anyone
> else seen this, or should I keep digging?  This is a fairly boring
> x86_64 VM with a XFS v5 filesystem + rmapbt.

I have not seen this before.  I test with:
MKFS_OPTIONS  -- -f -m reflink=1,rmapbt=1 -i sparse=1 -b size=1024 /dev/sdc

Maybe I should try a 4096 byte block size.

> mm/filemap.c:1653 is the BUG in:
> 
> void folio_end_writeback(struct folio *folio)
> {
> 	/*
> 	 * folio_test_clear_reclaim() could be used here but it is an
> 	 * atomic operation and overkill in this particular case.
> 	 * Failing to shuffle a folio marked for immediate reclaim is
> 	 * too mild a gain to justify taking an atomic operation penalty
> 	 * at the end of every folio writeback.
> 	 */
> 	if (folio_test_reclaim(folio)) {
> 		folio_clear_reclaim(folio);
> 		folio_rotate_reclaimable(folio);
> 	}
> 
> 	/*
> 	 * Writeback does not hold a folio reference of its own, relying
> 	 * on truncation to wait for the clearing of PG_writeback.
> 	 * But here we must make sure that the folio is not freed and
> 	 * reused before the folio_wake().
> 	 */
> 	folio_get(folio);
> 	if (!__folio_end_writeback(folio))
> >>>>		BUG();

Grr, that should have been a VM_BUG_ON_FOLIO(1, folio) so we get useful
information about the folio (like whether it has an iop, or what order
the folio is).  Can you make that change and try to reproduce?

What's going on here is that we've called folio_end_writeback() on a
folio which does not have the writeback flag set.  That _should_ be
impossible, hence the use of BUG().  Either we've called
folio_end_writeback() twice on the same folio, or we neglected to set
the writeback flag on the folio.  I don't immediately see why either
of those two things would happen.

> 
> 
> --D
> 
> run fstests generic/068 at 2022-04-12 17:57:11
> XFS (sda3): Mounting V5 Filesystem
> XFS (sda3): Ending clean mount
> XFS (sda4): Mounting V5 Filesystem
> XFS (sda4): Ending clean mount
> ------------[ cut here ]------------
> kernel BUG at mm/filemap.c:1653!
> invalid opcode: 0000 [#1] PREEMPT SMP
> CPU: 0 PID: 1349866 Comm: 0:116 Tainted: G        W         5.18.0-rc2-djwx #rc2 19cc48221d47ada6c8e5859639b6a0946c9a3777
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20171121_152543-x86-ol7-builder-01.us.oracle.com-4.el7.1 04/01/2014
> Workqueue: xfs-conv/sda4 xfs_end_io [xfs]
> RIP: 0010:folio_end_writeback+0x79/0x80
> Code: d2 75 1d f0 ff 4d 34 74 0e 5d c3 f0 80 67 02 fb e8 ac 29 01 00 eb ad 48 89 ef 5d e9 a1 0f 01 00 48 89 ee e8 b9 e8 01 00 eb d9 <0f> 0b 0f 1f 44 00 00 0f 1f 44 00 00 53 48 89 fb e8 62 f7 ff ff 48
> RSP: 0018:ffffc9000286fd50 EFLAGS: 00010246
> RAX: 0000000000000000 RBX: ffffea0007376840 RCX: 000000000000000c
> RDX: ffff88810d2de000 RSI: ffffffff81e55f0b RDI: ffff88810d2de000
> RBP: ffffea0007376840 R08: ffffea000b82c308 R09: ffffea000b82c308
> R10: 0000000000000001 R11: 000000000000000c R12: 0000000000000000
> R13: 000000000000c000 R14: 0000000000000005 R15: 0000000000000001
> FS:  0000000000000000(0000) GS:ffff88843fc00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f5b067d0000 CR3: 000000010d1bb000 CR4: 00000000001506b0
> Call Trace:
>  <TASK>
>  iomap_finish_ioend+0x19e/0x560
>  iomap_finish_ioends+0x69/0x100
>  xfs_end_ioend+0x5a/0x160 [xfs 513857e2ae300a835ce1fbd8065a84dc5382e649]
>  xfs_end_io+0xb1/0xf0 [xfs 513857e2ae300a835ce1fbd8065a84dc5382e649]
>  process_one_work+0x1df/0x3c0
>  ? rescuer_thread+0x3b0/0x3b0
>  worker_thread+0x53/0x3b0
>  ? rescuer_thread+0x3b0/0x3b0
>  kthread+0xea/0x110
>  ? kthread_complete_and_exit+0x20/0x20
>  ret_from_fork+0x1f/0x30
>  </TASK>
> Modules linked in: dm_snapshot dm_bufio dm_zero dm_flakey xfs libcrc32c xt_REDIRECT iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 auth_rpcgss oid_registry xt_tcpudp ip_set_hash_ip ip_set_hash_net xt_set ip_set_hash_mac ip_set nfnetlink ip6table_filter ip6_tables iptable_filter bfq sch_fq_codel ip_tables x_tables overlay nfsv4 af_packet [last unloaded: scsi_debug]
> Dumping ftrace buffer:
>    (ftrace buffer empty)
> ---[ end trace 0000000000000000 ]---
> RIP: 0010:folio_end_writeback+0x79/0x80
> Code: d2 75 1d f0 ff 4d 34 74 0e 5d c3 f0 80 67 02 fb e8 ac 29 01 00 eb ad 48 89 ef 5d e9 a1 0f 01 00 48 89 ee e8 b9 e8 01 00 eb d9 <0f> 0b 0f 1f 44 00 00 0f 1f 44 00 00 53 48 89 fb e8 62 f7 ff ff 48
> RSP: 0018:ffffc9000286fd50 EFLAGS: 00010246
> RAX: 0000000000000000 RBX: ffffea0007376840 RCX: 000000000000000c
> RDX: ffff88810d2de000 RSI: ffffffff81e55f0b RDI: ffff88810d2de000
> RBP: ffffea0007376840 R08: ffffea000b82c308 R09: ffffea000b82c308
> R10: 0000000000000001 R11: 000000000000000c R12: 0000000000000000
> R13: 000000000000c000 R14: 0000000000000005 R15: 0000000000000001
> FS:  0000000000000000(0000) GS:ffff88843fc00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f4b94008278 CR3: 0000000101ac9000 CR4: 00000000001506b0
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: generic/068 crash on 5.18-rc2?
  2022-04-13 14:50 ` Matthew Wilcox
@ 2022-04-13 16:23   ` Darrick J. Wong
  2022-04-13 16:35     ` Matthew Wilcox
  2022-04-18 17:47   ` Darrick J. Wong
  1 sibling, 1 reply; 22+ messages in thread
From: Darrick J. Wong @ 2022-04-13 16:23 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: xfs, fstests, linux-fsdevel

On Wed, Apr 13, 2022 at 03:50:32PM +0100, Matthew Wilcox wrote:
> On Tue, Apr 12, 2022 at 08:34:25PM -0700, Darrick J. Wong wrote:
> > Hmm.  Two nights in a row I've seen the following crash.  Has anyone
> > else seen this, or should I keep digging?  This is a fairly boring
> > x86_64 VM with a XFS v5 filesystem + rmapbt.
> 
> I have not seen this before.  I test with:
> MKFS_OPTIONS  -- -f -m reflink=1,rmapbt=1 -i sparse=1 -b size=1024 /dev/sdc
> 
> Maybe I should try a 4096 byte block size.
> 
> > mm/filemap.c:1653 is the BUG in:
> > 
> > void folio_end_writeback(struct folio *folio)
> > {
> > 	/*
> > 	 * folio_test_clear_reclaim() could be used here but it is an
> > 	 * atomic operation and overkill in this particular case.
> > 	 * Failing to shuffle a folio marked for immediate reclaim is
> > 	 * too mild a gain to justify taking an atomic operation penalty
> > 	 * at the end of every folio writeback.
> > 	 */
> > 	if (folio_test_reclaim(folio)) {
> > 		folio_clear_reclaim(folio);
> > 		folio_rotate_reclaimable(folio);
> > 	}
> > 
> > 	/*
> > 	 * Writeback does not hold a folio reference of its own, relying
> > 	 * on truncation to wait for the clearing of PG_writeback.
> > 	 * But here we must make sure that the folio is not freed and
> > 	 * reused before the folio_wake().
> > 	 */
> > 	folio_get(folio);
> > 	if (!__folio_end_writeback(folio))
> > >>>>		BUG();
> 
> Grr, that should have been a VM_BUG_ON_FOLIO(1, folio) so we get useful
> information about the folio (like whether it has an iop, or what order
> the folio is).  Can you make that change and try to reproduce?
> 
> What's going on here is that we've called folio_end_writeback() on a
> folio which does not have the writeback flag set.  That _should_ be
> impossible, hence the use of BUG().  Either we've called
> folio_end_writeback() twice on the same folio, or we neglected to set
> the writeback flag on the folio.  I don't immediately see why either
> of those two things would happen.

Ok, will do.

An ARM VM also tripped over this last night (64k pages, 4k fsblocksize)
and it had even more to say:

run fstests generic/068 at 2022-04-12 20:49:17
spectre-v4 mitigation disabled by command-line option
XFS (sda2): Mounting V5 Filesystem
XFS (sda2): Ending clean mount
XFS (sda3): Mounting V5 Filesystem
XFS (sda3): Ending clean mount
XFS (sda3): Quotacheck needed: Please wait.
XFS (sda3): Quotacheck: Done.
------------[ cut here ]------------
WARNING: CPU: 0 PID: 1214302 at fs/iomap/buffered-io.c:1020 iomap_finish_ioend+0x29c/0x37c
Modules linked in: dm_zero ext2 dm_delay xfs dm_snapshot ext4 mbcache jbd2 dm_log_writes dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio dm_flakey libcrc32c xt_REDIRECT iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 rpcsec_gss_krb5 auth_rpcgss xt_tcpudp ip_set_hash_ip ip_set_hash_net xt_set ip_set_hash_mac ip_set nfnetlink crct10dif_ce bfq ip6table_filter ip6_tables iptable_filter sch_fq_codel efivarfs ip_tables x_tables overlay nfsv4 [last unloaded: scsi_debug]
CPU: 0 PID: 1214302 Comm: 0:4 Tainted: G        W         5.18.0-rc2-djwa #rc2 541bf598c49d4450e32c1bfc9b8fb32b7009548e
Hardware name: QEMU KVM Virtual Machine, BIOS 1.5.1 06/16/2021
Workqueue: xfs-conv/sda3 xfs_end_io [xfs]
pstate: 20401005 (nzCv daif +PAN -UAO -TCO -DIT +SSBS BTYPE=--)
pc : iomap_finish_ioend+0x29c/0x37c
lr : iomap_finish_ioend+0x19c/0x37c
sp : fffffe000f08fc20
x29: fffffe000f08fc20 x28: 0000000000040000 x27: 0000000000010000
x26: ffffffff00775040 x25: 0000000000000000 x24: 0000000000000001
x23: fffffc00e1014cc0 x22: fffffc01a44fac40 x21: 0000000000000000
x20: 0000000000000000 x19: 0000000000000001 x18: 0000000000000000
x17: 620000006b290000 x16: 4dae6b4802000000 x15: 0000000000000000
x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
x11: 0000000000000000 x10: 0000000000001ae0 x9 : fffffe00088ebff4
x8 : fffffe01f6840000 x7 : fffffe0008fe92f0 x6 : 0000000000000387
x5 : 00000000f0000000 x4 : 0000000000000000 x3 : 0000000000010000
x2 : fffffc00e2623e80 x1 : 000000000000000d x0 : 0000000000000008
Call trace:
 iomap_finish_ioend+0x29c/0x37c
 iomap_finish_ioends+0x80/0x130
 xfs_end_ioend+0x68/0x164 [xfs ccff30bab1b631f6755d8bbcebc428122f4b51e0]
 xfs_end_io+0xcc/0x12c [xfs ccff30bab1b631f6755d8bbcebc428122f4b51e0]
 process_one_work+0x1e8/0x480
 worker_thread+0x7c/0x430
 kthread+0x108/0x114
 ret_from_fork+0x10/0x20
---[ end trace 0000000000000000 ]---
------------[ cut here ]------------

I tracked that assertion to:

static void iomap_finish_folio_write(struct inode *inode, struct folio *folio,
		size_t len, int error)
{
	struct iomap_page *iop = to_iomap_page(folio);

	if (error) {
		folio_set_error(folio);
		mapping_set_error(inode->i_mapping, error);
	}

>>>>	WARN_ON_ONCE(i_blocks_per_folio(inode, folio) > 1 && !iop);
	WARN_ON_ONCE(iop && atomic_read(&iop->write_bytes_pending) <= 0);

	if (!iop || atomic_sub_and_test(len, &iop->write_bytes_pending))
		folio_end_writeback(folio);
}

...before it also tripped over the same BUG at mm/filemap.c:1653.

--D

> > 
> > 
> > --D
> > 
> > run fstests generic/068 at 2022-04-12 17:57:11
> > XFS (sda3): Mounting V5 Filesystem
> > XFS (sda3): Ending clean mount
> > XFS (sda4): Mounting V5 Filesystem
> > XFS (sda4): Ending clean mount
> > ------------[ cut here ]------------
> > kernel BUG at mm/filemap.c:1653!
> > invalid opcode: 0000 [#1] PREEMPT SMP
> > CPU: 0 PID: 1349866 Comm: 0:116 Tainted: G        W         5.18.0-rc2-djwx #rc2 19cc48221d47ada6c8e5859639b6a0946c9a3777
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20171121_152543-x86-ol7-builder-01.us.oracle.com-4.el7.1 04/01/2014
> > Workqueue: xfs-conv/sda4 xfs_end_io [xfs]
> > RIP: 0010:folio_end_writeback+0x79/0x80
> > Code: d2 75 1d f0 ff 4d 34 74 0e 5d c3 f0 80 67 02 fb e8 ac 29 01 00 eb ad 48 89 ef 5d e9 a1 0f 01 00 48 89 ee e8 b9 e8 01 00 eb d9 <0f> 0b 0f 1f 44 00 00 0f 1f 44 00 00 53 48 89 fb e8 62 f7 ff ff 48
> > RSP: 0018:ffffc9000286fd50 EFLAGS: 00010246
> > RAX: 0000000000000000 RBX: ffffea0007376840 RCX: 000000000000000c
> > RDX: ffff88810d2de000 RSI: ffffffff81e55f0b RDI: ffff88810d2de000
> > RBP: ffffea0007376840 R08: ffffea000b82c308 R09: ffffea000b82c308
> > R10: 0000000000000001 R11: 000000000000000c R12: 0000000000000000
> > R13: 000000000000c000 R14: 0000000000000005 R15: 0000000000000001
> > FS:  0000000000000000(0000) GS:ffff88843fc00000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00007f5b067d0000 CR3: 000000010d1bb000 CR4: 00000000001506b0
> > Call Trace:
> >  <TASK>
> >  iomap_finish_ioend+0x19e/0x560
> >  iomap_finish_ioends+0x69/0x100
> >  xfs_end_ioend+0x5a/0x160 [xfs 513857e2ae300a835ce1fbd8065a84dc5382e649]
> >  xfs_end_io+0xb1/0xf0 [xfs 513857e2ae300a835ce1fbd8065a84dc5382e649]
> >  process_one_work+0x1df/0x3c0
> >  ? rescuer_thread+0x3b0/0x3b0
> >  worker_thread+0x53/0x3b0
> >  ? rescuer_thread+0x3b0/0x3b0
> >  kthread+0xea/0x110
> >  ? kthread_complete_and_exit+0x20/0x20
> >  ret_from_fork+0x1f/0x30
> >  </TASK>
> > Modules linked in: dm_snapshot dm_bufio dm_zero dm_flakey xfs libcrc32c xt_REDIRECT iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 auth_rpcgss oid_registry xt_tcpudp ip_set_hash_ip ip_set_hash_net xt_set ip_set_hash_mac ip_set nfnetlink ip6table_filter ip6_tables iptable_filter bfq sch_fq_codel ip_tables x_tables overlay nfsv4 af_packet [last unloaded: scsi_debug]
> > Dumping ftrace buffer:
> >    (ftrace buffer empty)
> > ---[ end trace 0000000000000000 ]---
> > RIP: 0010:folio_end_writeback+0x79/0x80
> > Code: d2 75 1d f0 ff 4d 34 74 0e 5d c3 f0 80 67 02 fb e8 ac 29 01 00 eb ad 48 89 ef 5d e9 a1 0f 01 00 48 89 ee e8 b9 e8 01 00 eb d9 <0f> 0b 0f 1f 44 00 00 0f 1f 44 00 00 53 48 89 fb e8 62 f7 ff ff 48
> > RSP: 0018:ffffc9000286fd50 EFLAGS: 00010246
> > RAX: 0000000000000000 RBX: ffffea0007376840 RCX: 000000000000000c
> > RDX: ffff88810d2de000 RSI: ffffffff81e55f0b RDI: ffff88810d2de000
> > RBP: ffffea0007376840 R08: ffffea000b82c308 R09: ffffea000b82c308
> > R10: 0000000000000001 R11: 000000000000000c R12: 0000000000000000
> > R13: 000000000000c000 R14: 0000000000000005 R15: 0000000000000001
> > FS:  0000000000000000(0000) GS:ffff88843fc00000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00007f4b94008278 CR3: 0000000101ac9000 CR4: 00000000001506b0
> > 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: generic/068 crash on 5.18-rc2?
  2022-04-13 16:23   ` Darrick J. Wong
@ 2022-04-13 16:35     ` Matthew Wilcox
  2022-04-18 18:44       ` Darrick J. Wong
  0 siblings, 1 reply; 22+ messages in thread
From: Matthew Wilcox @ 2022-04-13 16:35 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: xfs, fstests, linux-fsdevel

On Wed, Apr 13, 2022 at 09:23:51AM -0700, Darrick J. Wong wrote:
> On Wed, Apr 13, 2022 at 03:50:32PM +0100, Matthew Wilcox wrote:
> > On Tue, Apr 12, 2022 at 08:34:25PM -0700, Darrick J. Wong wrote:
> > > Hmm.  Two nights in a row I've seen the following crash.  Has anyone
> > > else seen this, or should I keep digging?  This is a fairly boring
> > > x86_64 VM with a XFS v5 filesystem + rmapbt.
> > 
> > I have not seen this before.  I test with:
> > MKFS_OPTIONS  -- -f -m reflink=1,rmapbt=1 -i sparse=1 -b size=1024 /dev/sdc
> > 
> > Maybe I should try a 4096 byte block size.
> > 
> > > mm/filemap.c:1653 is the BUG in:
> > > 
> > > void folio_end_writeback(struct folio *folio)
> > > {
> > > 	/*
> > > 	 * folio_test_clear_reclaim() could be used here but it is an
> > > 	 * atomic operation and overkill in this particular case.
> > > 	 * Failing to shuffle a folio marked for immediate reclaim is
> > > 	 * too mild a gain to justify taking an atomic operation penalty
> > > 	 * at the end of every folio writeback.
> > > 	 */
> > > 	if (folio_test_reclaim(folio)) {
> > > 		folio_clear_reclaim(folio);
> > > 		folio_rotate_reclaimable(folio);
> > > 	}
> > > 
> > > 	/*
> > > 	 * Writeback does not hold a folio reference of its own, relying
> > > 	 * on truncation to wait for the clearing of PG_writeback.
> > > 	 * But here we must make sure that the folio is not freed and
> > > 	 * reused before the folio_wake().
> > > 	 */
> > > 	folio_get(folio);
> > > 	if (!__folio_end_writeback(folio))
> > > >>>>		BUG();
> > 
> > Grr, that should have been a VM_BUG_ON_FOLIO(1, folio) so we get useful
> > information about the folio (like whether it has an iop, or what order
> > the folio is).  Can you make that change and try to reproduce?
> > 
> > What's going on here is that we've called folio_end_writeback() on a
> > folio which does not have the writeback flag set.  That _should_ be
> > impossible, hence the use of BUG().  Either we've called
> > folio_end_writeback() twice on the same folio, or we neglected to set
> > the writeback flag on the folio.  I don't immediately see why either
> > of those two things would happen.
> 
> Ok, will do.
> 
> An ARM VM also tripped over this last night (64k pages, 4k fsblocksize)
> and it had even more to say:
> 
> run fstests generic/068 at 2022-04-12 20:49:17
> spectre-v4 mitigation disabled by command-line option
> XFS (sda2): Mounting V5 Filesystem
> XFS (sda2): Ending clean mount
> XFS (sda3): Mounting V5 Filesystem
> XFS (sda3): Ending clean mount
> XFS (sda3): Quotacheck needed: Please wait.
> XFS (sda3): Quotacheck: Done.
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 1214302 at fs/iomap/buffered-io.c:1020 iomap_finish_ioend+0x29c/0x37c
> Modules linked in: dm_zero ext2 dm_delay xfs dm_snapshot ext4 mbcache jbd2 dm_log_writes dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio dm_flakey libcrc32c xt_REDIRECT iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 rpcsec_gss_krb5 auth_rpcgss xt_tcpudp ip_set_hash_ip ip_set_hash_net xt_set ip_set_hash_mac ip_set nfnetlink crct10dif_ce bfq ip6table_filter ip6_tables iptable_filter sch_fq_codel efivarfs ip_tables x_tables overlay nfsv4 [last unloaded: scsi_debug]
> CPU: 0 PID: 1214302 Comm: 0:4 Tainted: G        W         5.18.0-rc2-djwa #rc2 541bf598c49d4450e32c1bfc9b8fb32b7009548e
> Hardware name: QEMU KVM Virtual Machine, BIOS 1.5.1 06/16/2021
> Workqueue: xfs-conv/sda3 xfs_end_io [xfs]
> pstate: 20401005 (nzCv daif +PAN -UAO -TCO -DIT +SSBS BTYPE=--)
> pc : iomap_finish_ioend+0x29c/0x37c
> lr : iomap_finish_ioend+0x19c/0x37c
> sp : fffffe000f08fc20
> x29: fffffe000f08fc20 x28: 0000000000040000 x27: 0000000000010000
> x26: ffffffff00775040 x25: 0000000000000000 x24: 0000000000000001
> x23: fffffc00e1014cc0 x22: fffffc01a44fac40 x21: 0000000000000000
> x20: 0000000000000000 x19: 0000000000000001 x18: 0000000000000000
> x17: 620000006b290000 x16: 4dae6b4802000000 x15: 0000000000000000
> x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
> x11: 0000000000000000 x10: 0000000000001ae0 x9 : fffffe00088ebff4
> x8 : fffffe01f6840000 x7 : fffffe0008fe92f0 x6 : 0000000000000387
> x5 : 00000000f0000000 x4 : 0000000000000000 x3 : 0000000000010000
> x2 : fffffc00e2623e80 x1 : 000000000000000d x0 : 0000000000000008
> Call trace:
>  iomap_finish_ioend+0x29c/0x37c
>  iomap_finish_ioends+0x80/0x130
>  xfs_end_ioend+0x68/0x164 [xfs ccff30bab1b631f6755d8bbcebc428122f4b51e0]
>  xfs_end_io+0xcc/0x12c [xfs ccff30bab1b631f6755d8bbcebc428122f4b51e0]
>  process_one_work+0x1e8/0x480
>  worker_thread+0x7c/0x430
>  kthread+0x108/0x114
>  ret_from_fork+0x10/0x20
> ---[ end trace 0000000000000000 ]---
> ------------[ cut here ]------------
> 
> I tracked that assertion to:
> 
> static void iomap_finish_folio_write(struct inode *inode, struct folio *folio,
> 		size_t len, int error)
> {
> 	struct iomap_page *iop = to_iomap_page(folio);
> 
> 	if (error) {
> 		folio_set_error(folio);
> 		mapping_set_error(inode->i_mapping, error);
> 	}
> 
> >>>>	WARN_ON_ONCE(i_blocks_per_folio(inode, folio) > 1 && !iop);

Oho, that's interesting.  I wonder if we have something that's stripping
the iop off the folio while writes are in progress?  Although we should
catch that:

        WARN_ON_ONCE(atomic_read(&iop->read_bytes_pending));
        WARN_ON_ONCE(atomic_read(&iop->write_bytes_pending));
        WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
                        folio_test_uptodate(folio));

> 	WARN_ON_ONCE(iop && atomic_read(&iop->write_bytes_pending) <= 0);
> 
> 	if (!iop || atomic_sub_and_test(len, &iop->write_bytes_pending))
> 		folio_end_writeback(folio);
> }
> 
> ...before it also tripped over the same BUG at mm/filemap.c:1653.
> 
> --D
> 
> > > 
> > > 
> > > --D
> > > 
> > > run fstests generic/068 at 2022-04-12 17:57:11
> > > XFS (sda3): Mounting V5 Filesystem
> > > XFS (sda3): Ending clean mount
> > > XFS (sda4): Mounting V5 Filesystem
> > > XFS (sda4): Ending clean mount
> > > ------------[ cut here ]------------
> > > kernel BUG at mm/filemap.c:1653!
> > > invalid opcode: 0000 [#1] PREEMPT SMP
> > > CPU: 0 PID: 1349866 Comm: 0:116 Tainted: G        W         5.18.0-rc2-djwx #rc2 19cc48221d47ada6c8e5859639b6a0946c9a3777
> > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20171121_152543-x86-ol7-builder-01.us.oracle.com-4.el7.1 04/01/2014
> > > Workqueue: xfs-conv/sda4 xfs_end_io [xfs]
> > > RIP: 0010:folio_end_writeback+0x79/0x80
> > > Code: d2 75 1d f0 ff 4d 34 74 0e 5d c3 f0 80 67 02 fb e8 ac 29 01 00 eb ad 48 89 ef 5d e9 a1 0f 01 00 48 89 ee e8 b9 e8 01 00 eb d9 <0f> 0b 0f 1f 44 00 00 0f 1f 44 00 00 53 48 89 fb e8 62 f7 ff ff 48
> > > RSP: 0018:ffffc9000286fd50 EFLAGS: 00010246
> > > RAX: 0000000000000000 RBX: ffffea0007376840 RCX: 000000000000000c
> > > RDX: ffff88810d2de000 RSI: ffffffff81e55f0b RDI: ffff88810d2de000
> > > RBP: ffffea0007376840 R08: ffffea000b82c308 R09: ffffea000b82c308
> > > R10: 0000000000000001 R11: 000000000000000c R12: 0000000000000000
> > > R13: 000000000000c000 R14: 0000000000000005 R15: 0000000000000001
> > > FS:  0000000000000000(0000) GS:ffff88843fc00000(0000) knlGS:0000000000000000
> > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: 00007f5b067d0000 CR3: 000000010d1bb000 CR4: 00000000001506b0
> > > Call Trace:
> > >  <TASK>
> > >  iomap_finish_ioend+0x19e/0x560
> > >  iomap_finish_ioends+0x69/0x100
> > >  xfs_end_ioend+0x5a/0x160 [xfs 513857e2ae300a835ce1fbd8065a84dc5382e649]
> > >  xfs_end_io+0xb1/0xf0 [xfs 513857e2ae300a835ce1fbd8065a84dc5382e649]
> > >  process_one_work+0x1df/0x3c0
> > >  ? rescuer_thread+0x3b0/0x3b0
> > >  worker_thread+0x53/0x3b0
> > >  ? rescuer_thread+0x3b0/0x3b0
> > >  kthread+0xea/0x110
> > >  ? kthread_complete_and_exit+0x20/0x20
> > >  ret_from_fork+0x1f/0x30
> > >  </TASK>
> > > Modules linked in: dm_snapshot dm_bufio dm_zero dm_flakey xfs libcrc32c xt_REDIRECT iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 auth_rpcgss oid_registry xt_tcpudp ip_set_hash_ip ip_set_hash_net xt_set ip_set_hash_mac ip_set nfnetlink ip6table_filter ip6_tables iptable_filter bfq sch_fq_codel ip_tables x_tables overlay nfsv4 af_packet [last unloaded: scsi_debug]
> > > Dumping ftrace buffer:
> > >    (ftrace buffer empty)
> > > ---[ end trace 0000000000000000 ]---
> > > RIP: 0010:folio_end_writeback+0x79/0x80
> > > Code: d2 75 1d f0 ff 4d 34 74 0e 5d c3 f0 80 67 02 fb e8 ac 29 01 00 eb ad 48 89 ef 5d e9 a1 0f 01 00 48 89 ee e8 b9 e8 01 00 eb d9 <0f> 0b 0f 1f 44 00 00 0f 1f 44 00 00 53 48 89 fb e8 62 f7 ff ff 48
> > > RSP: 0018:ffffc9000286fd50 EFLAGS: 00010246
> > > RAX: 0000000000000000 RBX: ffffea0007376840 RCX: 000000000000000c
> > > RDX: ffff88810d2de000 RSI: ffffffff81e55f0b RDI: ffff88810d2de000
> > > RBP: ffffea0007376840 R08: ffffea000b82c308 R09: ffffea000b82c308
> > > R10: 0000000000000001 R11: 000000000000000c R12: 0000000000000000
> > > R13: 000000000000c000 R14: 0000000000000005 R15: 0000000000000001
> > > FS:  0000000000000000(0000) GS:ffff88843fc00000(0000) knlGS:0000000000000000
> > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: 00007f4b94008278 CR3: 0000000101ac9000 CR4: 00000000001506b0
> > > 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: generic/068 crash on 5.18-rc2?
  2022-04-13 14:50 ` Matthew Wilcox
  2022-04-13 16:23   ` Darrick J. Wong
@ 2022-04-18 17:47   ` Darrick J. Wong
  2022-04-20  0:37     ` Darrick J. Wong
  2022-04-22 21:59     ` Darrick J. Wong
  1 sibling, 2 replies; 22+ messages in thread
From: Darrick J. Wong @ 2022-04-18 17:47 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: xfs, fstests, linux-fsdevel

On Wed, Apr 13, 2022 at 03:50:32PM +0100, Matthew Wilcox wrote:
> On Tue, Apr 12, 2022 at 08:34:25PM -0700, Darrick J. Wong wrote:
> > Hmm.  Two nights in a row I've seen the following crash.  Has anyone
> > else seen this, or should I keep digging?  This is a fairly boring
> > x86_64 VM with a XFS v5 filesystem + rmapbt.
> 
> I have not seen this before.  I test with:
> MKFS_OPTIONS  -- -f -m reflink=1,rmapbt=1 -i sparse=1 -b size=1024 /dev/sdc
> 
> Maybe I should try a 4096 byte block size.
> 
> > mm/filemap.c:1653 is the BUG in:
> > 
> > void folio_end_writeback(struct folio *folio)
> > {
> > 	/*
> > 	 * folio_test_clear_reclaim() could be used here but it is an
> > 	 * atomic operation and overkill in this particular case.
> > 	 * Failing to shuffle a folio marked for immediate reclaim is
> > 	 * too mild a gain to justify taking an atomic operation penalty
> > 	 * at the end of every folio writeback.
> > 	 */
> > 	if (folio_test_reclaim(folio)) {
> > 		folio_clear_reclaim(folio);
> > 		folio_rotate_reclaimable(folio);
> > 	}
> > 
> > 	/*
> > 	 * Writeback does not hold a folio reference of its own, relying
> > 	 * on truncation to wait for the clearing of PG_writeback.
> > 	 * But here we must make sure that the folio is not freed and
> > 	 * reused before the folio_wake().
> > 	 */
> > 	folio_get(folio);
> > 	if (!__folio_end_writeback(folio))
> > >>>>		BUG();
> 
> Grr, that should have been a VM_BUG_ON_FOLIO(1, folio) so we get useful
> information about the folio (like whether it has an iop, or what order
> the folio is).  Can you make that change and try to reproduce?

> What's going on here is that we've called folio_end_writeback() on a
> folio which does not have the writeback flag set.  That _should_ be
> impossible, hence the use of BUG().  Either we've called
> folio_end_writeback() twice on the same folio, or we neglected to set
> the writeback flag on the folio.  I don't immediately see why either
> of those two things would happen.

Well, I made that change and rebased to -rc3 to see if reverting that
ZERO_PAGE thing would produce better results, I think I just got the
same crash.  Curiously, the only VM that died this time was the one
running the realtime configuration, but it's still generic/068:

FSTYP         -- xfs (debug)
PLATFORM      -- Linux/x86_64 oci-mtr28 5.18.0-rc3-djwx #rc3 SMP PREEMPT_DYNAMIC Sun Apr 17 14:42:49 PDT 2022
MKFS_OPTIONS  -- -f -rrtdev=/dev/sdb4 -llogdev=/dev/sdb2 -m reflink=0,rmapbt=0, -d rtinherit=1, /dev/sda4
MOUNT_OPTIONS -- -ortdev=/dev/sdb4 -ologdev=/dev/sdb2 /dev/sda4 /opt

I don't know if it'll help, but here's the sequence of tests that we
were running just prior to crashing:

generic/445      3s
generic/225      76s
xfs/306  22s
xfs/290  3s
generic/155     [not run] Reflink not supported by test filesystem type: xfs
generic/525      6s
generic/269      89s
generic/1206    [not run] xfs_io swapext -v vfs -s 64k -l 64k ioctl support is missing
xfs/504  198s
xfs/192 [not run] Reflink not supported by scratch filesystem type: xfs
xfs/303  1s
generic/346      6s
generic/512      5s
xfs/227  308s
generic/147     [not run] Reflink not supported by test filesystem type: xfs
generic/230     [not run] Quotas not supported on realtime test device
generic/008      4s
generic/108      4s
xfs/264  12s
generic/200     [not run] Reflink not supported by scratch filesystem type: xfs
generic/493     [not run] Dedupe not supported by scratch filesystem type: xfs
xfs/021  5s
generic/672     [not run] Reflink not supported by scratch filesystem type: xfs
xfs/493  5s
xfs/146  13s
xfs/315 [not run] Reflink not supported by scratch filesystem type: xfs
generic/068     

And the dmesg output:

run fstests generic/068 at 2022-04-17 16:57:16
XFS (sda4): Mounting V5 Filesystem
XFS (sda4): Ending clean mount
page:ffffea0004a39c40 refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x128e71
flags: 0x17ff80000000000(node=0|zone=2|lastcpupid=0xfff)
raw: 017ff80000000000 0000000000000000 ffffffff00000203 0000000000000000
raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: VM_BUG_ON_FOLIO(((unsigned int) folio_ref_count(folio) + 127u <= 127u))
------------[ cut here ]------------
kernel BUG at include/linux/mm.h:1164!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 3 PID: 1094085 Comm: 3:0 Tainted: G        W         5.18.0-rc3-djwx #rc3 0a707744ee7c555d54e50726c5b02515710a6aae
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20171121_152543-x86-ol7-builder-01.us.oracle.com-4.el7.1 04/01/2014
Workqueue: xfs-conv/sda4 xfs_end_io [xfs]
RIP: 0010:folio_end_writeback+0xd0/0x110
Code: 80 60 02 fb 48 89 ef e8 5e 6d 01 00 8b 45 34 83 c0 7f 83 f8 7f 0f 87 6a ff ff ff 48 c7 c6 40 c7 e2 81 48 89 ef e8 30 69 04 00 <0f> 0b 48 89 ee e8 b6 51 02 00 eb 9a 48 c7 c6 c0 ad e5 81 48 89 ef
RSP: 0018:ffffc900084f3d48 EFLAGS: 00010246
RAX: 000000000000005c RBX: 0000000000001000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff81e56da3 RDI: 00000000ffffffff
RBP: ffffea0004a39c40 R08: 0000000000000000 R09: ffffffff8205fe40
R10: 0000000000017578 R11: 00000000000175f0 R12: 0000000000004000
R13: ffff88814dc5cd40 R14: 000000000000002e R15: ffffea0004a39c40
FS:  0000000000000000(0000) GS:ffff88843fd80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f2b0ea47010 CR3: 000000043f00c000 CR4: 00000000001506a0
Call Trace:
 <TASK>
 iomap_finish_ioend+0x1ee/0x6a0
 iomap_finish_ioends+0x69/0x100
 xfs_end_ioend+0x5a/0x160 [xfs e8251de1111d7958449fd159d84af12a2afc12f2]
 xfs_end_io+0xb1/0xf0 [xfs e8251de1111d7958449fd159d84af12a2afc12f2]
 process_one_work+0x1df/0x3c0
 ? rescuer_thread+0x3b0/0x3b0
 worker_thread+0x53/0x3b0
 ? rescuer_thread+0x3b0/0x3b0
 kthread+0xea/0x110
 ? kthread_complete_and_exit+0x20/0x20
 ret_from_fork+0x1f/0x30
 </TASK>
Modules linked in: xfs dm_zero btrfs blake2b_generic xor lzo_compress lzo_decompress zlib_deflate raid6_pq zstd_compress dm_delay dm_snapshot dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio dm_flakey libcrc32c xt_REDIRECT iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 auth_rpcgss oid_registry xt_tcpudp ip_set_hash_ip ip_set_hash_net xt_set ip_set_hash_mac ip_set nfnetlink ip6table_filter ip6_tables iptable_filter bfq sch_fq_codel ip_tables x_tables overlay nfsv4 af_packet [last unloaded: scsi_debug]
Dumping ftrace buffer:
   (ftrace buffer empty)
---[ end trace 0000000000000000 ]---
RIP: 0010:folio_end_writeback+0xd0/0x110
Code: 80 60 02 fb 48 89 ef e8 5e 6d 01 00 8b 45 34 83 c0 7f 83 f8 7f 0f 87 6a ff ff ff 48 c7 c6 40 c7 e2 81 48 89 ef e8 30 69 04 00 <0f> 0b 48 89 ee e8 b6 51 02 00 eb 9a 48 c7 c6 c0 ad e5 81 48 89 ef
RSP: 0018:ffffc900084f3d48 EFLAGS: 00010246
RAX: 000000000000005c RBX: 0000000000001000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff81e56da3 RDI: 00000000ffffffff
RBP: ffffea0004a39c40 R08: 0000000000000000 R09: ffffffff8205fe40
R10: 0000000000017578 R11: 00000000000175f0 R12: 0000000000004000
R13: ffff88814dc5cd40 R14: 000000000000002e R15: ffffea0004a39c40
FS:  0000000000000000(0000) GS:ffff88843fd80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f2b0ea47010 CR3: 000000043f00c000 CR4: 00000000001506a0

--D

> 
> > 
> > 
> > --D
> > 
> > run fstests generic/068 at 2022-04-12 17:57:11
> > XFS (sda3): Mounting V5 Filesystem
> > XFS (sda3): Ending clean mount
> > XFS (sda4): Mounting V5 Filesystem
> > XFS (sda4): Ending clean mount
> > ------------[ cut here ]------------
> > kernel BUG at mm/filemap.c:1653!
> > invalid opcode: 0000 [#1] PREEMPT SMP
> > CPU: 0 PID: 1349866 Comm: 0:116 Tainted: G        W         5.18.0-rc2-djwx #rc2 19cc48221d47ada6c8e5859639b6a0946c9a3777
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20171121_152543-x86-ol7-builder-01.us.oracle.com-4.el7.1 04/01/2014
> > Workqueue: xfs-conv/sda4 xfs_end_io [xfs]
> > RIP: 0010:folio_end_writeback+0x79/0x80
> > Code: d2 75 1d f0 ff 4d 34 74 0e 5d c3 f0 80 67 02 fb e8 ac 29 01 00 eb ad 48 89 ef 5d e9 a1 0f 01 00 48 89 ee e8 b9 e8 01 00 eb d9 <0f> 0b 0f 1f 44 00 00 0f 1f 44 00 00 53 48 89 fb e8 62 f7 ff ff 48
> > RSP: 0018:ffffc9000286fd50 EFLAGS: 00010246
> > RAX: 0000000000000000 RBX: ffffea0007376840 RCX: 000000000000000c
> > RDX: ffff88810d2de000 RSI: ffffffff81e55f0b RDI: ffff88810d2de000
> > RBP: ffffea0007376840 R08: ffffea000b82c308 R09: ffffea000b82c308
> > R10: 0000000000000001 R11: 000000000000000c R12: 0000000000000000
> > R13: 000000000000c000 R14: 0000000000000005 R15: 0000000000000001
> > FS:  0000000000000000(0000) GS:ffff88843fc00000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00007f5b067d0000 CR3: 000000010d1bb000 CR4: 00000000001506b0
> > Call Trace:
> >  <TASK>
> >  iomap_finish_ioend+0x19e/0x560
> >  iomap_finish_ioends+0x69/0x100
> >  xfs_end_ioend+0x5a/0x160 [xfs 513857e2ae300a835ce1fbd8065a84dc5382e649]
> >  xfs_end_io+0xb1/0xf0 [xfs 513857e2ae300a835ce1fbd8065a84dc5382e649]
> >  process_one_work+0x1df/0x3c0
> >  ? rescuer_thread+0x3b0/0x3b0
> >  worker_thread+0x53/0x3b0
> >  ? rescuer_thread+0x3b0/0x3b0
> >  kthread+0xea/0x110
> >  ? kthread_complete_and_exit+0x20/0x20
> >  ret_from_fork+0x1f/0x30
> >  </TASK>
> > Modules linked in: dm_snapshot dm_bufio dm_zero dm_flakey xfs libcrc32c xt_REDIRECT iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 auth_rpcgss oid_registry xt_tcpudp ip_set_hash_ip ip_set_hash_net xt_set ip_set_hash_mac ip_set nfnetlink ip6table_filter ip6_tables iptable_filter bfq sch_fq_codel ip_tables x_tables overlay nfsv4 af_packet [last unloaded: scsi_debug]
> > Dumping ftrace buffer:
> >    (ftrace buffer empty)
> > ---[ end trace 0000000000000000 ]---
> > RIP: 0010:folio_end_writeback+0x79/0x80
> > Code: d2 75 1d f0 ff 4d 34 74 0e 5d c3 f0 80 67 02 fb e8 ac 29 01 00 eb ad 48 89 ef 5d e9 a1 0f 01 00 48 89 ee e8 b9 e8 01 00 eb d9 <0f> 0b 0f 1f 44 00 00 0f 1f 44 00 00 53 48 89 fb e8 62 f7 ff ff 48
> > RSP: 0018:ffffc9000286fd50 EFLAGS: 00010246
> > RAX: 0000000000000000 RBX: ffffea0007376840 RCX: 000000000000000c
> > RDX: ffff88810d2de000 RSI: ffffffff81e55f0b RDI: ffff88810d2de000
> > RBP: ffffea0007376840 R08: ffffea000b82c308 R09: ffffea000b82c308
> > R10: 0000000000000001 R11: 000000000000000c R12: 0000000000000000
> > R13: 000000000000c000 R14: 0000000000000005 R15: 0000000000000001
> > FS:  0000000000000000(0000) GS:ffff88843fc00000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00007f4b94008278 CR3: 0000000101ac9000 CR4: 00000000001506b0
> > 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: generic/068 crash on 5.18-rc2?
  2022-04-13 16:35     ` Matthew Wilcox
@ 2022-04-18 18:44       ` Darrick J. Wong
  0 siblings, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2022-04-18 18:44 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: xfs, fstests, linux-fsdevel

On Wed, Apr 13, 2022 at 05:35:31PM +0100, Matthew Wilcox wrote:
> On Wed, Apr 13, 2022 at 09:23:51AM -0700, Darrick J. Wong wrote:
> > On Wed, Apr 13, 2022 at 03:50:32PM +0100, Matthew Wilcox wrote:
> > > On Tue, Apr 12, 2022 at 08:34:25PM -0700, Darrick J. Wong wrote:
> > > > Hmm.  Two nights in a row I've seen the following crash.  Has anyone
> > > > else seen this, or should I keep digging?  This is a fairly boring
> > > > x86_64 VM with a XFS v5 filesystem + rmapbt.
> > > 
> > > I have not seen this before.  I test with:
> > > MKFS_OPTIONS  -- -f -m reflink=1,rmapbt=1 -i sparse=1 -b size=1024 /dev/sdc
> > > 
> > > Maybe I should try a 4096 byte block size.
> > > 
> > > > mm/filemap.c:1653 is the BUG in:
> > > > 
> > > > void folio_end_writeback(struct folio *folio)
> > > > {
> > > > 	/*
> > > > 	 * folio_test_clear_reclaim() could be used here but it is an
> > > > 	 * atomic operation and overkill in this particular case.
> > > > 	 * Failing to shuffle a folio marked for immediate reclaim is
> > > > 	 * too mild a gain to justify taking an atomic operation penalty
> > > > 	 * at the end of every folio writeback.
> > > > 	 */
> > > > 	if (folio_test_reclaim(folio)) {
> > > > 		folio_clear_reclaim(folio);
> > > > 		folio_rotate_reclaimable(folio);
> > > > 	}
> > > > 
> > > > 	/*
> > > > 	 * Writeback does not hold a folio reference of its own, relying
> > > > 	 * on truncation to wait for the clearing of PG_writeback.
> > > > 	 * But here we must make sure that the folio is not freed and
> > > > 	 * reused before the folio_wake().
> > > > 	 */
> > > > 	folio_get(folio);
> > > > 	if (!__folio_end_writeback(folio))
> > > > >>>>		BUG();
> > > 
> > > Grr, that should have been a VM_BUG_ON_FOLIO(1, folio) so we get useful
> > > information about the folio (like whether it has an iop, or what order
> > > the folio is).  Can you make that change and try to reproduce?
> > > 
> > > What's going on here is that we've called folio_end_writeback() on a
> > > folio which does not have the writeback flag set.  That _should_ be
> > > impossible, hence the use of BUG().  Either we've called
> > > folio_end_writeback() twice on the same folio, or we neglected to set
> > > the writeback flag on the folio.  I don't immediately see why either
> > > of those two things would happen.
> > 
> > Ok, will do.
> > 
> > An ARM VM also tripped over this last night (64k pages, 4k fsblocksize)
> > and it had even more to say:
> > 
> > run fstests generic/068 at 2022-04-12 20:49:17
> > spectre-v4 mitigation disabled by command-line option
> > XFS (sda2): Mounting V5 Filesystem
> > XFS (sda2): Ending clean mount
> > XFS (sda3): Mounting V5 Filesystem
> > XFS (sda3): Ending clean mount
> > XFS (sda3): Quotacheck needed: Please wait.
> > XFS (sda3): Quotacheck: Done.
> > ------------[ cut here ]------------
> > WARNING: CPU: 0 PID: 1214302 at fs/iomap/buffered-io.c:1020 iomap_finish_ioend+0x29c/0x37c
> > Modules linked in: dm_zero ext2 dm_delay xfs dm_snapshot ext4 mbcache jbd2 dm_log_writes dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio dm_flakey libcrc32c xt_REDIRECT iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 rpcsec_gss_krb5 auth_rpcgss xt_tcpudp ip_set_hash_ip ip_set_hash_net xt_set ip_set_hash_mac ip_set nfnetlink crct10dif_ce bfq ip6table_filter ip6_tables iptable_filter sch_fq_codel efivarfs ip_tables x_tables overlay nfsv4 [last unloaded: scsi_debug]
> > CPU: 0 PID: 1214302 Comm: 0:4 Tainted: G        W         5.18.0-rc2-djwa #rc2 541bf598c49d4450e32c1bfc9b8fb32b7009548e
> > Hardware name: QEMU KVM Virtual Machine, BIOS 1.5.1 06/16/2021
> > Workqueue: xfs-conv/sda3 xfs_end_io [xfs]
> > pstate: 20401005 (nzCv daif +PAN -UAO -TCO -DIT +SSBS BTYPE=--)
> > pc : iomap_finish_ioend+0x29c/0x37c
> > lr : iomap_finish_ioend+0x19c/0x37c
> > sp : fffffe000f08fc20
> > x29: fffffe000f08fc20 x28: 0000000000040000 x27: 0000000000010000
> > x26: ffffffff00775040 x25: 0000000000000000 x24: 0000000000000001
> > x23: fffffc00e1014cc0 x22: fffffc01a44fac40 x21: 0000000000000000
> > x20: 0000000000000000 x19: 0000000000000001 x18: 0000000000000000
> > x17: 620000006b290000 x16: 4dae6b4802000000 x15: 0000000000000000
> > x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
> > x11: 0000000000000000 x10: 0000000000001ae0 x9 : fffffe00088ebff4
> > x8 : fffffe01f6840000 x7 : fffffe0008fe92f0 x6 : 0000000000000387
> > x5 : 00000000f0000000 x4 : 0000000000000000 x3 : 0000000000010000
> > x2 : fffffc00e2623e80 x1 : 000000000000000d x0 : 0000000000000008
> > Call trace:
> >  iomap_finish_ioend+0x29c/0x37c
> >  iomap_finish_ioends+0x80/0x130
> >  xfs_end_ioend+0x68/0x164 [xfs ccff30bab1b631f6755d8bbcebc428122f4b51e0]
> >  xfs_end_io+0xcc/0x12c [xfs ccff30bab1b631f6755d8bbcebc428122f4b51e0]
> >  process_one_work+0x1e8/0x480
> >  worker_thread+0x7c/0x430
> >  kthread+0x108/0x114
> >  ret_from_fork+0x10/0x20
> > ---[ end trace 0000000000000000 ]---
> > ------------[ cut here ]------------
> > 
> > I tracked that assertion to:
> > 
> > static void iomap_finish_folio_write(struct inode *inode, struct folio *folio,
> > 		size_t len, int error)
> > {
> > 	struct iomap_page *iop = to_iomap_page(folio);
> > 
> > 	if (error) {
> > 		folio_set_error(folio);
> > 		mapping_set_error(inode->i_mapping, error);
> > 	}
> > 
> > >>>>	WARN_ON_ONCE(i_blocks_per_folio(inode, folio) > 1 && !iop);
> 
> Oho, that's interesting.  I wonder if we have something that's stripping
> the iop off the folio while writes are in progress?  Although we should
> catch that:
> 
>         WARN_ON_ONCE(atomic_read(&iop->read_bytes_pending));
>         WARN_ON_ONCE(atomic_read(&iop->write_bytes_pending));
>         WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
>                         folio_test_uptodate(folio));

Hmm.  Today's splat looked like this:

run fstests xfs/313 at 2022-04-17 15:04:37
spectre-v4 mitigation disabled by command-line option
XFS (sda2): Mounting V5 Filesystem
XFS (sda2): Ending clean mount
XFS (sda3): Mounting V5 Filesystem
XFS (sda3): Ending clean mount
XFS (sda3): Quotacheck needed: Please wait.
XFS (sda3): Quotacheck: Done.
XFS (sda3): Unmounting Filesystem
XFS (sda3): Mounting V5 Filesystem
XFS (sda3): Ending clean mount
XFS (sda3): Quotacheck needed: Please wait.
XFS (sda3): Quotacheck: Done.
XFS (sda3): Injecting error (false) at file fs/xfs/libxfs/xfs_refcount.c, line 1162, on filesystem "sda3"
XFS (sda3): Corruption of in-memory data (0x8) detected at xfs_defer_finish_noroll+0x580/0x88c [xfs] (fs/xfs/libxfs/xfs_defer.c:533).  Shutting down filesystem.
XFS (sda3): Please unmount the filesystem and rectify the problem(s)
------------[ cut here ]------------
WARNING: CPU: 0 PID: 1670507 at fs/iomap/buffered-io.c:73 iomap_page_release+0x120/0x14c

I chased that down to:

static void iomap_page_release(struct folio *folio)
{
	struct iomap_page *iop = folio_detach_private(folio);
	struct inode *inode = folio->mapping->host;
	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);

	if (!iop)
		return;
	WARN_ON_ONCE(atomic_read(&iop->read_bytes_pending));
>>>>	WARN_ON_ONCE(atomic_read(&iop->write_bytes_pending));
	WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
			folio_test_uptodate(folio));
	kfree(iop);
}

Modules linked in: dm_flakey xfs libcrc32c xt_REDIRECT iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 rpcsec_gss_krb5 auth_rpcgss xt_tcpudp ip_set_hash_ip ip_set_hash_net xt_set ip_set_hash_mac ip_set nfnetlink ip6table_filter ip6_tables iptable_filter bfq crct10dif_ce sch_fq_codel efivarfs ip_toverlay nfsv4
CPU: 0 PID: 1670507 Comm: xfs_io Not tainted 5.18.0-rc3-djwa #rc3 ec810768ab31b0e167ac04d847f3cdf695827dae
Hardware name: QEMU KVM Virtual Machine, BIOS 1.4.1 12/03/2020
pstate: 80401005 (Nzcv daif +PAN -UAO -TCO -DIT +SSBS BTYPE=--)
pc : iomap_page_release+0x120/0x14c
lr : iomap_page_release+0x24/0x14c
sp : fffffe00213af870
x29: fffffe00213af870 x28: fffffe00213afb78 x27: fffffe00213af9e8
x26: 000000000000000c x25: fffffc0105552ac0 x24: fffffc0105552c00
x23: 000000000000000c x22: 000000000000000c x21: fffffc00e3ffac00
x20: 0000000000000040 x19: ffffffff007cbc00 x18: 0000000000000030
x17: 656c626f72702065 x16: 6874207966697463 x15: 657220646e61206d
x14: 0000000000000000 x13: 0000000000000030 x12: 0101010101010101
x11: fffffe00213af470 x10: fffffe0001658ea0 x9 : fffffe000836b8c0
x8 : fffffc00f0ed8400 x7 : 00000000ffffffc0 x6 : 000000000038fc58
x5 : 00000000fffffffb x4 : fffffc01ff5ee480 x3 : fffffe00015a7d44
x2 : 0000000000000002 x1 : 0000000000010000 x0 : 0000000000020000
Call trace:
 iomap_page_release+0x120/0x14c
 iomap_invalidate_folio+0x138/0x1c4
 xfs_discard_folio+0xcc/0x26c [xfs e4acf345e615d04c4712ff9d0ad149b9f1983c65]
 iomap_do_writepage+0x83c/0x980
 write_cache_pages+0x200/0x4a0
 iomap_writepages+0x38/0x60
 xfs_vm_writepages+0x98/0xe0 [xfs e4acf345e615d04c4712ff9d0ad149b9f1983c65]
 do_writepages+0x94/0x1d0
 filemap_fdatawrite_wbc+0x8c/0xc0
 file_write_and_wait_range+0xc0/0x100
 xfs_file_fsync+0x64/0x260 [xfs e4acf345e615d04c4712ff9d0ad149b9f1983c65]
 vfs_fsync_range+0x44/0x90
 do_fsync+0x44/0x90
 __arm64_sys_fsync+0x24/0x34
 invoke_syscall.constprop.0+0x58/0xf0
 do_el0_svc+0x5c/0x160
 el0_svc+0x3c/0x184
 el0t_64_sync_handler+0x1a8/0x1b0
 el0t_64_sync+0x18c/0x190
---[ end trace 0000000000000000 ]---
------------[ cut here ]------------
WARNING: CPU: 0 PID: 1670507 at fs/iomap/buffered-io.c:1020 iomap_finish_ioend+0x3a4/0x550

And then this one is the same crash as last week:

static void iomap_finish_folio_write(struct inode *inode, struct folio *folio,
		size_t len, int error)
{
	struct iomap_page *iop = to_iomap_page(folio);

	if (error) {
		folio_set_error(folio);
		mapping_set_error(inode->i_mapping, error);
	}

>>>>	WARN_ON_ONCE(i_blocks_per_folio(inode, folio) > 1 && !iop);
	VM_BUG_ON_FOLIO(i_blocks_per_folio(inode, folio) > 1 && !iop, folio);
	WARN_ON_ONCE(iop && atomic_read(&iop->write_bytes_pending) <= 0);

	if (!iop || atomic_sub_and_test(len, &iop->write_bytes_pending))
		folio_end_writeback(folio);
}

(Note that I added a VM_BUG_ON_FOLIO for fun!)

I think iomap_add_to_ioend created a bio for writeback, added a sub-page
block to the bio, and then the fs went down.

Unfortunately, iomap_writepage_map called ->map_blocks to schedule
writeback for the first block on the same page.  That fails because the
fs went down, so we call ->discard_folio with @pos set to the start of
the folio.  xfs_discard_folio tries to discard the entire folio, which
because offset_in_folio(@pos) was zero, means we tear down the iop.

Regrettably, the iop still has write_bytes_pending>0, so we blow up.
I guess this means we started writeback on a block that wasn't the first
one in the folio, and then writeback failed on block zero?

Modules linked in: dm_flakey xfs libcrc32c xt_REDIRECT iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 rpcsec_gss_krb5 auth_rpcgss xt_tcpudp ip_set_hash_ip ip_set_hash_net xt_set ip_set_hash_mac ip_set nfnetlink ip6table_filter ip6_tables iptable_filter bfq crct10dif_ce sch_fq_codel efivarfs ip_toverlay nfsv4
CPU: 0 PID: 1670507 Comm: xfs_io Tainted: G        W         5.18.0-rc3-djwa #rc3 ec810768ab31b0e167ac04d847f3cdf695827dae
Hardware name: QEMU KVM Virtual Machine, BIOS 1.4.1 12/03/2020
pstate: 20401005 (nzCv daif +PAN -UAO -TCO -DIT +SSBS BTYPE=--)
pc : iomap_finish_ioend+0x3a4/0x550
lr : iomap_finish_ioend+0x26c/0x550
sp : fffffe00213afa00
x29: fffffe00213afa00 x28: 0000000000000000 x27: ffffffff007cbc00
x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
x23: 00000000fffffffb x22: fffffc0105552c00 x21: fffffc00e23f3dc0
x20: 0000000000020000 x19: 0000000000020000 x18: 0000000000000030
x17: 656c626f72702065 x16: 6874207966697463 x15: 657220646e61206d
x14: 0000000000000000 x13: 0000000000000030 x12: 0101010101010101
x11: fffffe00213af470 x10: fffffe0001658ea0 x9 : fffffe000836f6f0
x8 : 0000000000000238 x7 : 0000000000011000 x6 : 000000000000000c
x5 : 0000000000000040 x4 : 0000000000000005 x3 : fffffc00f1b11524
x2 : 0000000000000002 x1 : 0000000000010000 x0 : 1ffe000000018116
Call trace:
 iomap_finish_ioend+0x3a4/0x550
 iomap_writepage_end_bio+0x38/0x300
 bio_endio+0x174/0x214
 iomap_submit_ioend+0x94/0xb0
 iomap_writepages+0x4c/0x60
 xfs_vm_writepages+0x98/0xe0 [xfs e4acf345e615d04c4712ff9d0ad149b9f1983c65]
 do_writepages+0x94/0x1d0
 filemap_fdatawrite_wbc+0x8c/0xc0
 file_write_and_wait_range+0xc0/0x100
 xfs_file_fsync+0x64/0x260 [xfs e4acf345e615d04c4712ff9d0ad149b9f1983c65]
 vfs_fsync_range+0x44/0x90
 do_fsync+0x44/0x90
 __arm64_sys_fsync+0x24/0x34
 invoke_syscall.constprop.0+0x58/0xf0
 do_el0_svc+0x5c/0x160
 el0_svc+0x3c/0x184
 el0t_64_sync_handler+0x1a8/0x1b0
 el0t_64_sync+0x18c/0x190
---[ end trace 0000000000000000 ]---
page:ffffffff007cbc00 refcount:4 mapcount:0 mapping:fffffc0105552da8 index:0x8 pfn:0x232f0
head:ffffffff007cbc00 order:2 compound_mapcount:0 compound_pincount:0
memcg:fffffc00f24de000
aops:xfs_address_space_operations [xfs] ino:87 dentry name:"file2"
flags: 0x1ffe000000018116(error|referenced|uptodate|lru|writeback|head|node=0|zone=1|lastcpupid=0x7ff)
raw: 1ffe000000018116 ffffffff004cf408 ffffffff007b4c08 fffffc0105552da8
raw: 0000000000000008 0000000000000000 00000004ffffffff fffffc00f24de000
page dumped because: VM_BUG_ON_FOLIO(i_blocks_per_folio(inode, folio) > 1 && !iop)
------------[ cut here ]------------
kernel BUG at fs/iomap/buffered-io.c:1021!
Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in: dm_flakey xfs libcrc32c xt_REDIRECT iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 rpcsec_gss_krb5 auth_rpcgss xt_tcpudp ip_set_hash_ip ip_set_hash_net xt_set ip_set_hash_mac ip_set nfnetlink ip6table_filter ip6_tables iptable_filter bfq crct10dif_ce sch_fq_codel efivarfs ip_toverlay nfsv4
CPU: 0 PID: 1670507 Comm: xfs_io Tainted: G        W         5.18.0-rc3-djwa #rc3 ec810768ab31b0e167ac04d847f3cdf695827dae
Hardware name: QEMU KVM Virtual Machine, BIOS 1.4.1 12/03/2020
pstate: 60401005 (nZCv daif +PAN -UAO -TCO -DIT +SSBS BTYPE=--)
pc : iomap_finish_ioend+0x3c0/0x550
lr : iomap_finish_ioend+0x3c0/0x550
sp : fffffe00213afa00
x29: fffffe00213afa00 x28: 0000000000000000 x27: ffffffff007cbc00
x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
x23: 00000000fffffffb x22: fffffc0105552c00 x21: fffffc00e23f3dc0
x20: 0000000000020000 x19: 0000000000020000 x18: 0000000000000030
x17: 66202c65646f6e69 x16: 286f696c6f665f72 x15: 65705f736b636f6c
x14: 625f69284f494c4f x13: 29706f6921202626 x12: 2031203e20296f69
x11: 000000010000025e x10: 000000010000025e x9 : fffffe00080ca428
x8 : 00000000000008d0 x7 : c00000010000025e x6 : 0000000000000468
x5 : 0000000000000000 x4 : 0000000000000002 x3 : 0000000000000000
x2 : 0000000000000000 x1 : fffffc00e60d1940 x0 : 0000000000000052
Call trace:
 iomap_finish_ioend+0x3c0/0x550
 iomap_writepage_end_bio+0x38/0x300
 bio_endio+0x174/0x214
 iomap_submit_ioend+0x94/0xb0
 iomap_writepages+0x4c/0x60
 xfs_vm_writepages+0x98/0xe0 [xfs e4acf345e615d04c4712ff9d0ad149b9f1983c65]
 do_writepages+0x94/0x1d0
 filemap_fdatawrite_wbc+0x8c/0xc0
 file_write_and_wait_range+0xc0/0x100
 xfs_file_fsync+0x64/0x260 [xfs e4acf345e615d04c4712ff9d0ad149b9f1983c65]
 vfs_fsync_range+0x44/0x90
 do_fsync+0x44/0x90
 __arm64_sys_fsync+0x24/0x34
 invoke_syscall.constprop.0+0x58/0xf0
 do_el0_svc+0x5c/0x160
 el0_svc+0x3c/0x184
 el0t_64_sync_handler+0x1a8/0x1b0
 el0t_64_sync+0x18c/0x190
Code: aa1b03e0 90003f61 91212021 97fb3939 (d4210000) 
---[ end trace 0000000000000000 ]---
note: xfs_io[1670507] exited with preempt_count 1
------------[ cut here ]------------
WARNING: CPU: 0 PID: 0 at kernel/rcu/tree.c:624 rcu_eqs_enter.constprop.0+0x88/0x90
Modules linked in: dm_flakey xfs libcrc32c xt_REDIRECT iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 rpcsec_gss_krb5 auth_rpcgss xt_tcpudp ip_set_hash_ip ip_set_hash_net xt_set ip_set_hash_mac ip_set nfnetlink ip6table_filter ip6_tables iptable_filter bfq crct10dif_ce sch_fq_codel efivarfs ip_toverlay nfsv4
CPU: 0 PID: 0 Comm: swapper/0 Tainted: G      D W         5.18.0-rc3-djwa #rc3 ec810768ab31b0e167ac04d847f3cdf695827dae
Hardware name: QEMU KVM Virtual Machine, BIOS 1.4.1 12/03/2020
pstate: 204010c5 (nzCv daIF +PAN -UAO -TCO -DIT +SSBS BTYPE=--)
pc : rcu_eqs_enter.constprop.0+0x88/0x90
lr : rcu_eqs_enter.constprop.0+0x14/0x90
sp : fffffe0008fefd30
x29: fffffe0008fefd30 x28: 0000000040c80018 x27: fffffe0008b3d238
x26: fffffe000900e140 x25: 0000000000000000 x24: 0000000000000000
x23: fffffe0008b22688 x22: fffffe0008dc7100 x21: fffffe00090059f0
x20: fffffe0009005a98 x19: fffffc01ff5e9180 x18: 0000000000000000
x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
x14: 0000000000000001 x13: 0000000000000000 x12: 0000000000000003
x11: 0000000000000000 x10: 0000000000001ae0 x9 : fffffe0008903184
x8 : fffffe000900fc80 x7 : 0000000100012faa x6 : 000027a0835a48ce
x5 : 00ffffffffffffff x4 : 0000000000000001 x3 : fffffe0008d40008
x2 : 4000000000000000 x1 : fffffe0009009310 x0 : 4000000000000002
Call trace:
 rcu_eqs_enter.constprop.0+0x88/0x90
 rcu_idle_enter+0x1c/0x30
 default_idle_call+0x34/0x188
 do_idle+0x230/0x2a4
 cpu_startup_entry+0x30/0x3c
 rest_init+0xec/0x100
 arch_call_rest_init+0x1c/0x28
 start_kernel+0x6ec/0x72c
 __primary_switched+0xa4/0xac
---[ end trace 0000000000000000 ]---
hrtimer: interrupt took 16422000 ns

--D


> > 	WARN_ON_ONCE(iop && atomic_read(&iop->write_bytes_pending) <= 0);
> > 
> > 	if (!iop || atomic_sub_and_test(len, &iop->write_bytes_pending))
> > 		folio_end_writeback(folio);
> > }
> > 
> > ...before it also tripped over the same BUG at mm/filemap.c:1653.
> > 
> > --D
> > 
> > > > 
> > > > 
> > > > --D
> > > > 
> > > > run fstests generic/068 at 2022-04-12 17:57:11
> > > > XFS (sda3): Mounting V5 Filesystem
> > > > XFS (sda3): Ending clean mount
> > > > XFS (sda4): Mounting V5 Filesystem
> > > > XFS (sda4): Ending clean mount
> > > > ------------[ cut here ]------------
> > > > kernel BUG at mm/filemap.c:1653!
> > > > invalid opcode: 0000 [#1] PREEMPT SMP
> > > > CPU: 0 PID: 1349866 Comm: 0:116 Tainted: G        W         5.18.0-rc2-djwx #rc2 19cc48221d47ada6c8e5859639b6a0946c9a3777
> > > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20171121_152543-x86-ol7-builder-01.us.oracle.com-4.el7.1 04/01/2014
> > > > Workqueue: xfs-conv/sda4 xfs_end_io [xfs]
> > > > RIP: 0010:folio_end_writeback+0x79/0x80
> > > > Code: d2 75 1d f0 ff 4d 34 74 0e 5d c3 f0 80 67 02 fb e8 ac 29 01 00 eb ad 48 89 ef 5d e9 a1 0f 01 00 48 89 ee e8 b9 e8 01 00 eb d9 <0f> 0b 0f 1f 44 00 00 0f 1f 44 00 00 53 48 89 fb e8 62 f7 ff ff 48
> > > > RSP: 0018:ffffc9000286fd50 EFLAGS: 00010246
> > > > RAX: 0000000000000000 RBX: ffffea0007376840 RCX: 000000000000000c
> > > > RDX: ffff88810d2de000 RSI: ffffffff81e55f0b RDI: ffff88810d2de000
> > > > RBP: ffffea0007376840 R08: ffffea000b82c308 R09: ffffea000b82c308
> > > > R10: 0000000000000001 R11: 000000000000000c R12: 0000000000000000
> > > > R13: 000000000000c000 R14: 0000000000000005 R15: 0000000000000001
> > > > FS:  0000000000000000(0000) GS:ffff88843fc00000(0000) knlGS:0000000000000000
> > > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > CR2: 00007f5b067d0000 CR3: 000000010d1bb000 CR4: 00000000001506b0
> > > > Call Trace:
> > > >  <TASK>
> > > >  iomap_finish_ioend+0x19e/0x560
> > > >  iomap_finish_ioends+0x69/0x100
> > > >  xfs_end_ioend+0x5a/0x160 [xfs 513857e2ae300a835ce1fbd8065a84dc5382e649]
> > > >  xfs_end_io+0xb1/0xf0 [xfs 513857e2ae300a835ce1fbd8065a84dc5382e649]
> > > >  process_one_work+0x1df/0x3c0
> > > >  ? rescuer_thread+0x3b0/0x3b0
> > > >  worker_thread+0x53/0x3b0
> > > >  ? rescuer_thread+0x3b0/0x3b0
> > > >  kthread+0xea/0x110
> > > >  ? kthread_complete_and_exit+0x20/0x20
> > > >  ret_from_fork+0x1f/0x30
> > > >  </TASK>
> > > > Modules linked in: dm_snapshot dm_bufio dm_zero dm_flakey xfs libcrc32c xt_REDIRECT iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 auth_rpcgss oid_registry xt_tcpudp ip_set_hash_ip ip_set_hash_net xt_set ip_set_hash_mac ip_set nfnetlink ip6table_filter ip6_tables iptable_filter bfq sch_fq_codel ip_tables x_tables overlay nfsv4 af_packet [last unloaded: scsi_debug]
> > > > Dumping ftrace buffer:
> > > >    (ftrace buffer empty)
> > > > ---[ end trace 0000000000000000 ]---
> > > > RIP: 0010:folio_end_writeback+0x79/0x80
> > > > Code: d2 75 1d f0 ff 4d 34 74 0e 5d c3 f0 80 67 02 fb e8 ac 29 01 00 eb ad 48 89 ef 5d e9 a1 0f 01 00 48 89 ee e8 b9 e8 01 00 eb d9 <0f> 0b 0f 1f 44 00 00 0f 1f 44 00 00 53 48 89 fb e8 62 f7 ff ff 48
> > > > RSP: 0018:ffffc9000286fd50 EFLAGS: 00010246
> > > > RAX: 0000000000000000 RBX: ffffea0007376840 RCX: 000000000000000c
> > > > RDX: ffff88810d2de000 RSI: ffffffff81e55f0b RDI: ffff88810d2de000
> > > > RBP: ffffea0007376840 R08: ffffea000b82c308 R09: ffffea000b82c308
> > > > R10: 0000000000000001 R11: 000000000000000c R12: 0000000000000000
> > > > R13: 000000000000c000 R14: 0000000000000005 R15: 0000000000000001
> > > > FS:  0000000000000000(0000) GS:ffff88843fc00000(0000) knlGS:0000000000000000
> > > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > CR2: 00007f4b94008278 CR3: 0000000101ac9000 CR4: 00000000001506b0
> > > > 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: generic/068 crash on 5.18-rc2?
  2022-04-18 17:47   ` Darrick J. Wong
@ 2022-04-20  0:37     ` Darrick J. Wong
  2022-04-22 21:59     ` Darrick J. Wong
  1 sibling, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2022-04-20  0:37 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: xfs, fstests, linux-fsdevel

On Mon, Apr 18, 2022 at 10:47:47AM -0700, Darrick J. Wong wrote:
> On Wed, Apr 13, 2022 at 03:50:32PM +0100, Matthew Wilcox wrote:
> > On Tue, Apr 12, 2022 at 08:34:25PM -0700, Darrick J. Wong wrote:
> > > Hmm.  Two nights in a row I've seen the following crash.  Has anyone
> > > else seen this, or should I keep digging?  This is a fairly boring
> > > x86_64 VM with a XFS v5 filesystem + rmapbt.
> > 
> > I have not seen this before.  I test with:
> > MKFS_OPTIONS  -- -f -m reflink=1,rmapbt=1 -i sparse=1 -b size=1024 /dev/sdc
> > 
> > Maybe I should try a 4096 byte block size.
> > 
> > > mm/filemap.c:1653 is the BUG in:
> > > 
> > > void folio_end_writeback(struct folio *folio)
> > > {
> > > 	/*
> > > 	 * folio_test_clear_reclaim() could be used here but it is an
> > > 	 * atomic operation and overkill in this particular case.
> > > 	 * Failing to shuffle a folio marked for immediate reclaim is
> > > 	 * too mild a gain to justify taking an atomic operation penalty
> > > 	 * at the end of every folio writeback.
> > > 	 */
> > > 	if (folio_test_reclaim(folio)) {
> > > 		folio_clear_reclaim(folio);
> > > 		folio_rotate_reclaimable(folio);
> > > 	}
> > > 
> > > 	/*
> > > 	 * Writeback does not hold a folio reference of its own, relying
> > > 	 * on truncation to wait for the clearing of PG_writeback.
> > > 	 * But here we must make sure that the folio is not freed and
> > > 	 * reused before the folio_wake().
> > > 	 */
> > > 	folio_get(folio);
> > > 	if (!__folio_end_writeback(folio))
> > > >>>>		BUG();
> > 
> > Grr, that should have been a VM_BUG_ON_FOLIO(1, folio) so we get useful
> > information about the folio (like whether it has an iop, or what order
> > the folio is).  Can you make that change and try to reproduce?
> 
> > What's going on here is that we've called folio_end_writeback() on a
> > folio which does not have the writeback flag set.  That _should_ be
> > impossible, hence the use of BUG().  Either we've called
> > folio_end_writeback() twice on the same folio, or we neglected to set
> > the writeback flag on the folio.  I don't immediately see why either
> > of those two things would happen.
> 
> Well, I made that change and rebased to -rc3 to see if reverting that
> ZERO_PAGE thing would produce better results, I think I just got the
> same crash.  Curiously, the only VM that died this time was the one
> running the realtime configuration, but it's still generic/068:
> 
> FSTYP         -- xfs (debug)
> PLATFORM      -- Linux/x86_64 oci-mtr28 5.18.0-rc3-djwx #rc3 SMP PREEMPT_DYNAMIC Sun Apr 17 14:42:49 PDT 2022
> MKFS_OPTIONS  -- -f -rrtdev=/dev/sdb4 -llogdev=/dev/sdb2 -m reflink=0,rmapbt=0, -d rtinherit=1, /dev/sda4
> MOUNT_OPTIONS -- -ortdev=/dev/sdb4 -ologdev=/dev/sdb2 /dev/sda4 /opt
> 
> I don't know if it'll help, but here's the sequence of tests that we
> were running just prior to crashing:
> 
> generic/445      3s
> generic/225      76s
> xfs/306  22s
> xfs/290  3s
> generic/155     [not run] Reflink not supported by test filesystem type: xfs
> generic/525      6s
> generic/269      89s
> generic/1206    [not run] xfs_io swapext -v vfs -s 64k -l 64k ioctl support is missing
> xfs/504  198s
> xfs/192 [not run] Reflink not supported by scratch filesystem type: xfs
> xfs/303  1s
> generic/346      6s
> generic/512      5s
> xfs/227  308s
> generic/147     [not run] Reflink not supported by test filesystem type: xfs
> generic/230     [not run] Quotas not supported on realtime test device
> generic/008      4s
> generic/108      4s
> xfs/264  12s
> generic/200     [not run] Reflink not supported by scratch filesystem type: xfs
> generic/493     [not run] Dedupe not supported by scratch filesystem type: xfs
> xfs/021  5s
> generic/672     [not run] Reflink not supported by scratch filesystem type: xfs
> xfs/493  5s
> xfs/146  13s
> xfs/315 [not run] Reflink not supported by scratch filesystem type: xfs
> generic/068     
> 
> And the dmesg output:
> 
> run fstests generic/068 at 2022-04-17 16:57:16
> XFS (sda4): Mounting V5 Filesystem
> XFS (sda4): Ending clean mount
> page:ffffea0004a39c40 refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x128e71
> flags: 0x17ff80000000000(node=0|zone=2|lastcpupid=0xfff)
> raw: 017ff80000000000 0000000000000000 ffffffff00000203 0000000000000000
> raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
> page dumped because: VM_BUG_ON_FOLIO(((unsigned int) folio_ref_count(folio) + 127u <= 127u))
> ------------[ cut here ]------------
> kernel BUG at include/linux/mm.h:1164!
> invalid opcode: 0000 [#1] PREEMPT SMP
> CPU: 3 PID: 1094085 Comm: 3:0 Tainted: G        W         5.18.0-rc3-djwx #rc3 0a707744ee7c555d54e50726c5b02515710a6aae
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20171121_152543-x86-ol7-builder-01.us.oracle.com-4.el7.1 04/01/2014
> Workqueue: xfs-conv/sda4 xfs_end_io [xfs]
> RIP: 0010:folio_end_writeback+0xd0/0x110
> Code: 80 60 02 fb 48 89 ef e8 5e 6d 01 00 8b 45 34 83 c0 7f 83 f8 7f 0f 87 6a ff ff ff 48 c7 c6 40 c7 e2 81 48 89 ef e8 30 69 04 00 <0f> 0b 48 89 ee e8 b6 51 02 00 eb 9a 48 c7 c6 c0 ad e5 81 48 89 ef
> RSP: 0018:ffffc900084f3d48 EFLAGS: 00010246
> RAX: 000000000000005c RBX: 0000000000001000 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: ffffffff81e56da3 RDI: 00000000ffffffff
> RBP: ffffea0004a39c40 R08: 0000000000000000 R09: ffffffff8205fe40
> R10: 0000000000017578 R11: 00000000000175f0 R12: 0000000000004000
> R13: ffff88814dc5cd40 R14: 000000000000002e R15: ffffea0004a39c40
> FS:  0000000000000000(0000) GS:ffff88843fd80000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f2b0ea47010 CR3: 000000043f00c000 CR4: 00000000001506a0
> Call Trace:
>  <TASK>
>  iomap_finish_ioend+0x1ee/0x6a0
>  iomap_finish_ioends+0x69/0x100
>  xfs_end_ioend+0x5a/0x160 [xfs e8251de1111d7958449fd159d84af12a2afc12f2]
>  xfs_end_io+0xb1/0xf0 [xfs e8251de1111d7958449fd159d84af12a2afc12f2]
>  process_one_work+0x1df/0x3c0
>  ? rescuer_thread+0x3b0/0x3b0
>  worker_thread+0x53/0x3b0
>  ? rescuer_thread+0x3b0/0x3b0
>  kthread+0xea/0x110
>  ? kthread_complete_and_exit+0x20/0x20
>  ret_from_fork+0x1f/0x30
>  </TASK>
> Modules linked in: xfs dm_zero btrfs blake2b_generic xor lzo_compress lzo_decompress zlib_deflate raid6_pq zstd_compress dm_delay dm_snapshot dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio dm_flakey libcrc32c xt_REDIRECT iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 auth_rpcgss oid_registry xt_tcpudp ip_set_hash_ip ip_set_hash_net xt_set ip_set_hash_mac ip_set nfnetlink ip6table_filter ip6_tables iptable_filter bfq sch_fq_codel ip_tables x_tables overlay nfsv4 af_packet [last unloaded: scsi_debug]
> Dumping ftrace buffer:
>    (ftrace buffer empty)
> ---[ end trace 0000000000000000 ]---
> RIP: 0010:folio_end_writeback+0xd0/0x110
> Code: 80 60 02 fb 48 89 ef e8 5e 6d 01 00 8b 45 34 83 c0 7f 83 f8 7f 0f 87 6a ff ff ff 48 c7 c6 40 c7 e2 81 48 89 ef e8 30 69 04 00 <0f> 0b 48 89 ee e8 b6 51 02 00 eb 9a 48 c7 c6 c0 ad e5 81 48 89 ef
> RSP: 0018:ffffc900084f3d48 EFLAGS: 00010246
> RAX: 000000000000005c RBX: 0000000000001000 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: ffffffff81e56da3 RDI: 00000000ffffffff
> RBP: ffffea0004a39c40 R08: 0000000000000000 R09: ffffffff8205fe40
> R10: 0000000000017578 R11: 00000000000175f0 R12: 0000000000004000
> R13: ffff88814dc5cd40 R14: 000000000000002e R15: ffffea0004a39c40
> FS:  0000000000000000(0000) GS:ffff88843fd80000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f2b0ea47010 CR3: 000000043f00c000 CR4: 00000000001506a0
> 
> --D


Today I managed to capture stack traces of all the D state processes on
a system with the hang described above:

/proc/41/comm = khugepaged
/proc/41/stack : [<0>] flush_work+0x5c/0x80
[<0>] __lru_add_drain_all+0x134/0x1c0
[<0>] khugepaged+0x6c/0x2460
[<0>] kthread+0xea/0x110
[<0>] ret_from_fork+0x1f/0x30

Not sure if this is involved, but it might be?
These systems /do/ have KSM and HUGEPAGE_ALWAYS turned on.

/proc/411335/comm = u8:1+events_unbound
/proc/411335/stack : [<0>] __synchronize_srcu.part.0+0x86/0xf0
[<0>] fsnotify_mark_destroy_workfn+0x7c/0x110
[<0>] process_one_work+0x1df/0x3c0
[<0>] worker_thread+0x53/0x3b0
[<0>] kthread+0xea/0x110
[<0>] ret_from_fork+0x1f/0x30
/proc/973682/comm = u8:0+flush-8:0
/proc/973682/stack : [<0>] down+0x54/0x70
[<0>] xfs_buf_lock+0x2d/0xe0 [xfs]
[<0>] xfs_buf_find+0x356/0x880 [xfs]
[<0>] xfs_buf_get_map+0x46/0x3b0 [xfs]
[<0>] xfs_buf_read_map+0x52/0x2f0 [xfs]
[<0>] xfs_trans_read_buf_map+0x1bb/0x4a0 [xfs]
[<0>] xfs_btree_read_buf_block.constprop.0+0x96/0xd0 [xfs]
[<0>] xfs_btree_lookup_get_block+0x97/0x170 [xfs]
[<0>] xfs_btree_lookup+0xdd/0x540 [xfs]
[<0>] xfs_rmap_map+0xd0/0x860 [xfs]
[<0>] xfs_rmap_finish_one+0x243/0x300 [xfs]
[<0>] xfs_rmap_update_finish_item+0x37/0x70 [xfs]
[<0>] xfs_defer_finish_noroll+0x20a/0x6f0 [xfs]
[<0>] __xfs_trans_commit+0x153/0x3e0 [xfs]
[<0>] xfs_bmapi_convert_delalloc+0x495/0x5e0 [xfs]
[<0>] xfs_map_blocks+0x1ed/0x540 [xfs]
[<0>] iomap_do_writepage+0x2a3/0xae0
[<0>] write_cache_pages+0x224/0x6f0
[<0>] iomap_writepages+0x1c/0x40
[<0>] xfs_vm_writepages+0x7a/0xb0 [xfs]
[<0>] do_writepages+0xcc/0x1c0
[<0>] __writeback_single_inode+0x41/0x340
[<0>] writeback_sb_inodes+0x207/0x4a0
[<0>] __writeback_inodes_wb+0x4c/0xe0
[<0>] wb_writeback+0x1da/0x2c0
[<0>] wb_workfn+0x28c/0x500
[<0>] process_one_work+0x1df/0x3c0
[<0>] worker_thread+0x53/0x3b0
[<0>] kthread+0xea/0x110
[<0>] ret_from_fork+0x1f/0x30

Writeback is stuck in an rmapbt update...

/proc/1139791/comm = u8:2+events_unbound
/proc/1139791/stack : [<0>] __synchronize_srcu.part.0+0x86/0xf0
[<0>] fsnotify_connector_destroy_workfn+0x3c/0x60
[<0>] process_one_work+0x1df/0x3c0
[<0>] worker_thread+0x53/0x3b0
[<0>] kthread+0xea/0x110
[<0>] ret_from_fork+0x1f/0x30
/proc/1274752/comm = u8:3+xfs-blockgc/sda4
/proc/1274752/stack : [<0>] down+0x54/0x70
[<0>] xfs_buf_lock+0x2d/0xe0 [xfs]
[<0>] xfs_buf_find+0x356/0x880 [xfs]
[<0>] xfs_buf_get_map+0x46/0x3b0 [xfs]
[<0>] xfs_buf_read_map+0x52/0x2f0 [xfs]
[<0>] xfs_trans_read_buf_map+0x1bb/0x4a0 [xfs]
[<0>] xfs_read_agf+0xb2/0x1b0 [xfs]
[<0>] xfs_alloc_read_agf+0x54/0x370 [xfs]
[<0>] xfs_alloc_fix_freelist+0x38e/0x510 [xfs]
[<0>] xfs_free_extent_fix_freelist+0x61/0xa0 [xfs]
[<0>] xfs_rmap_finish_one+0xd9/0x300 [xfs]
[<0>] xfs_rmap_update_finish_item+0x37/0x70 [xfs]
[<0>] xfs_defer_finish_noroll+0x20a/0x6f0 [xfs]
[<0>] xfs_defer_finish+0x11/0xa0 [xfs]
[<0>] xfs_itruncate_extents_flags+0x14b/0x4b0 [xfs]
[<0>] xfs_free_eofblocks+0xe9/0x150 [xfs]
[<0>] xfs_icwalk_ag+0x4a7/0x800 [xfs]
[<0>] xfs_blockgc_worker+0x31/0x110 [xfs]
[<0>] process_one_work+0x1df/0x3c0
[<0>] worker_thread+0x53/0x3b0
[<0>] kthread+0xea/0x110
[<0>] ret_from_fork+0x1f/0x30

Stuck waiting for the AGF to do an rmapbt update...

/proc/1315282/comm = fstest
/proc/1315282/stack : [<0>] folio_wait_bit_common+0x148/0x460
[<0>] folio_wait_writeback+0x22/0x80
[<0>] truncate_inode_pages_range+0x3fe/0x6f0
[<0>] truncate_pagecache+0x44/0x60
[<0>] xfs_setattr_size+0x163/0x4d0 [xfs]
[<0>] xfs_vn_setattr+0x75/0x180 [xfs]
[<0>] notify_change+0x306/0x500
[<0>] do_truncate+0x7d/0xd0
[<0>] path_openat+0xc60/0x1060
[<0>] do_filp_open+0xa9/0x150
[<0>] do_sys_openat2+0x97/0x160
[<0>] __x64_sys_openat+0x54/0x90
[<0>] do_syscall_64+0x35/0x80
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
/proc/1315283/comm = fstest
/proc/1315283/stack : [<0>] folio_wait_bit_common+0x148/0x460
[<0>] folio_wait_writeback+0x22/0x80
[<0>] __filemap_fdatawait_range+0x8c/0x250
[<0>] filemap_write_and_wait_range+0x85/0xf0
[<0>] xfs_setattr_size+0x103/0x4d0 [xfs]
[<0>] xfs_vn_setattr+0x75/0x180 [xfs]
[<0>] notify_change+0x306/0x500
[<0>] do_truncate+0x7d/0xd0
[<0>] path_openat+0xc60/0x1060
[<0>] do_filp_open+0xa9/0x150
[<0>] do_sys_openat2+0x97/0x160
[<0>] __x64_sys_openat+0x54/0x90
[<0>] do_syscall_64+0x35/0x80
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
/proc/1315287/comm = fsstress
/proc/1315287/stack : [<0>] folio_wait_bit_common+0x148/0x460
[<0>] folio_wait_writeback+0x22/0x80
[<0>] __filemap_fdatawait_range+0x8c/0x250
[<0>] filemap_write_and_wait_range+0x85/0xf0
[<0>] xfs_setattr_size+0x2da/0x4d0 [xfs]
[<0>] xfs_vn_setattr+0x75/0x180 [xfs]
[<0>] notify_change+0x306/0x500
[<0>] do_truncate+0x7d/0xd0
[<0>] vfs_truncate+0x108/0x140
[<0>] do_sys_truncate.part.0+0x8a/0xa0
[<0>] do_syscall_64+0x35/0x80
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae

Clearly stuck on writeback, not sure if it's the same file as the one
that tripped the assertion...

/proc/1315478/comm = u8:4+xfs-blockgc/sda4
/proc/1315478/stack : [<0>] down+0x54/0x70
[<0>] xfs_buf_lock+0x2d/0xe0 [xfs]
[<0>] xfs_buf_find+0x356/0x880 [xfs]
[<0>] xfs_buf_get_map+0x46/0x3b0 [xfs]
[<0>] xfs_buf_read_map+0x52/0x2f0 [xfs]
[<0>] xfs_trans_read_buf_map+0x1bb/0x4a0 [xfs]
[<0>] xfs_imap_to_bp+0x4e/0x70 [xfs]
[<0>] xfs_trans_log_inode+0x1da/0x350 [xfs]
[<0>] __xfs_bunmapi+0x7e8/0xd30 [xfs]
[<0>] xfs_itruncate_extents_flags+0x13d/0x4b0 [xfs]
[<0>] xfs_free_eofblocks+0xe9/0x150 [xfs]
[<0>] xfs_icwalk_ag+0x4a7/0x800 [xfs]
[<0>] xfs_blockgc_worker+0x31/0x110 [xfs]
[<0>] process_one_work+0x1df/0x3c0
[<0>] worker_thread+0x53/0x3b0
[<0>] kthread+0xea/0x110
[<0>] ret_from_fork+0x1f/0x30
/proc/1315479/comm = u8:5+xfs-blockgc/sda4
/proc/1315479/stack : [<0>] down+0x54/0x70
[<0>] xfs_buf_lock+0x2d/0xe0 [xfs]
[<0>] xfs_buf_find+0x356/0x880 [xfs]
[<0>] xfs_buf_get_map+0x46/0x3b0 [xfs]
[<0>] xfs_buf_read_map+0x52/0x2f0 [xfs]
[<0>] xfs_trans_read_buf_map+0x1bb/0x4a0 [xfs]
[<0>] xfs_imap_to_bp+0x4e/0x70 [xfs]
[<0>] xfs_trans_log_inode+0x1da/0x350 [xfs]
[<0>] __xfs_bunmapi+0x7e8/0xd30 [xfs]
[<0>] xfs_itruncate_extents_flags+0x13d/0x4b0 [xfs]
[<0>] xfs_free_eofblocks+0xe9/0x150 [xfs]
[<0>] xfs_icwalk_ag+0x4a7/0x800 [xfs]
[<0>] xfs_blockgc_worker+0x31/0x110 [xfs]
[<0>] process_one_work+0x1df/0x3c0
[<0>] worker_thread+0x53/0x3b0
[<0>] kthread+0xea/0x110
[<0>] ret_from_fork+0x1f/0x30

Both of these threads are stuck on the inode cluster buffer.  If I had
to bet, I'd guess it's the one underlying the inode being processed by
pid 973682.

/proc/1315480/comm = u8:6+xfs-blockgc/sda4
/proc/1315480/stack : [<0>] down+0x54/0x70
[<0>] xfs_buf_lock+0x2d/0xe0 [xfs]
[<0>] xfs_buf_find+0x356/0x880 [xfs]
[<0>] xfs_buf_get_map+0x46/0x3b0 [xfs]
[<0>] xfs_buf_read_map+0x52/0x2f0 [xfs]
[<0>] xfs_trans_read_buf_map+0x1bb/0x4a0 [xfs]
[<0>] xfs_read_agf+0xb2/0x1b0 [xfs]
[<0>] xfs_alloc_read_agf+0x54/0x370 [xfs]
[<0>] xfs_alloc_fix_freelist+0x38e/0x510 [xfs]
[<0>] xfs_free_extent_fix_freelist+0x61/0xa0 [xfs]
[<0>] xfs_rmap_finish_one+0xd9/0x300 [xfs]
[<0>] xfs_rmap_update_finish_item+0x37/0x70 [xfs]
[<0>] xfs_defer_finish_noroll+0x20a/0x6f0 [xfs]
[<0>] xfs_defer_finish+0x11/0xa0 [xfs]
[<0>] xfs_itruncate_extents_flags+0x14b/0x4b0 [xfs]
[<0>] xfs_free_eofblocks+0xe9/0x150 [xfs]
[<0>] xfs_icwalk_ag+0x4a7/0x800 [xfs]
[<0>] xfs_blockgc_worker+0x31/0x110 [xfs]
[<0>] process_one_work+0x1df/0x3c0
[<0>] worker_thread+0x53/0x3b0
[<0>] kthread+0xea/0x110
[<0>] ret_from_fork+0x1f/0x30

Speculative preallocation garbage collection also got stuck waiting for
the AGF to do an rmapbt update.

--D

> 
> > 
> > > 
> > > 
> > > --D
> > > 
> > > run fstests generic/068 at 2022-04-12 17:57:11
> > > XFS (sda3): Mounting V5 Filesystem
> > > XFS (sda3): Ending clean mount
> > > XFS (sda4): Mounting V5 Filesystem
> > > XFS (sda4): Ending clean mount
> > > ------------[ cut here ]------------
> > > kernel BUG at mm/filemap.c:1653!
> > > invalid opcode: 0000 [#1] PREEMPT SMP
> > > CPU: 0 PID: 1349866 Comm: 0:116 Tainted: G        W         5.18.0-rc2-djwx #rc2 19cc48221d47ada6c8e5859639b6a0946c9a3777
> > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20171121_152543-x86-ol7-builder-01.us.oracle.com-4.el7.1 04/01/2014
> > > Workqueue: xfs-conv/sda4 xfs_end_io [xfs]
> > > RIP: 0010:folio_end_writeback+0x79/0x80
> > > Code: d2 75 1d f0 ff 4d 34 74 0e 5d c3 f0 80 67 02 fb e8 ac 29 01 00 eb ad 48 89 ef 5d e9 a1 0f 01 00 48 89 ee e8 b9 e8 01 00 eb d9 <0f> 0b 0f 1f 44 00 00 0f 1f 44 00 00 53 48 89 fb e8 62 f7 ff ff 48
> > > RSP: 0018:ffffc9000286fd50 EFLAGS: 00010246
> > > RAX: 0000000000000000 RBX: ffffea0007376840 RCX: 000000000000000c
> > > RDX: ffff88810d2de000 RSI: ffffffff81e55f0b RDI: ffff88810d2de000
> > > RBP: ffffea0007376840 R08: ffffea000b82c308 R09: ffffea000b82c308
> > > R10: 0000000000000001 R11: 000000000000000c R12: 0000000000000000
> > > R13: 000000000000c000 R14: 0000000000000005 R15: 0000000000000001
> > > FS:  0000000000000000(0000) GS:ffff88843fc00000(0000) knlGS:0000000000000000
> > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: 00007f5b067d0000 CR3: 000000010d1bb000 CR4: 00000000001506b0
> > > Call Trace:
> > >  <TASK>
> > >  iomap_finish_ioend+0x19e/0x560
> > >  iomap_finish_ioends+0x69/0x100
> > >  xfs_end_ioend+0x5a/0x160 [xfs 513857e2ae300a835ce1fbd8065a84dc5382e649]
> > >  xfs_end_io+0xb1/0xf0 [xfs 513857e2ae300a835ce1fbd8065a84dc5382e649]
> > >  process_one_work+0x1df/0x3c0
> > >  ? rescuer_thread+0x3b0/0x3b0
> > >  worker_thread+0x53/0x3b0
> > >  ? rescuer_thread+0x3b0/0x3b0
> > >  kthread+0xea/0x110
> > >  ? kthread_complete_and_exit+0x20/0x20
> > >  ret_from_fork+0x1f/0x30
> > >  </TASK>
> > > Modules linked in: dm_snapshot dm_bufio dm_zero dm_flakey xfs libcrc32c xt_REDIRECT iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 auth_rpcgss oid_registry xt_tcpudp ip_set_hash_ip ip_set_hash_net xt_set ip_set_hash_mac ip_set nfnetlink ip6table_filter ip6_tables iptable_filter bfq sch_fq_codel ip_tables x_tables overlay nfsv4 af_packet [last unloaded: scsi_debug]
> > > Dumping ftrace buffer:
> > >    (ftrace buffer empty)
> > > ---[ end trace 0000000000000000 ]---
> > > RIP: 0010:folio_end_writeback+0x79/0x80
> > > Code: d2 75 1d f0 ff 4d 34 74 0e 5d c3 f0 80 67 02 fb e8 ac 29 01 00 eb ad 48 89 ef 5d e9 a1 0f 01 00 48 89 ee e8 b9 e8 01 00 eb d9 <0f> 0b 0f 1f 44 00 00 0f 1f 44 00 00 53 48 89 fb e8 62 f7 ff ff 48
> > > RSP: 0018:ffffc9000286fd50 EFLAGS: 00010246
> > > RAX: 0000000000000000 RBX: ffffea0007376840 RCX: 000000000000000c
> > > RDX: ffff88810d2de000 RSI: ffffffff81e55f0b RDI: ffff88810d2de000
> > > RBP: ffffea0007376840 R08: ffffea000b82c308 R09: ffffea000b82c308
> > > R10: 0000000000000001 R11: 000000000000000c R12: 0000000000000000
> > > R13: 000000000000c000 R14: 0000000000000005 R15: 0000000000000001
> > > FS:  0000000000000000(0000) GS:ffff88843fc00000(0000) knlGS:0000000000000000
> > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: 00007f4b94008278 CR3: 0000000101ac9000 CR4: 00000000001506b0
> > > 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: generic/068 crash on 5.18-rc2?
  2022-04-18 17:47   ` Darrick J. Wong
  2022-04-20  0:37     ` Darrick J. Wong
@ 2022-04-22 21:59     ` Darrick J. Wong
  2022-04-28 15:53       ` Brian Foster
  1 sibling, 1 reply; 22+ messages in thread
From: Darrick J. Wong @ 2022-04-22 21:59 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: xfs, fstests, linux-fsdevel

On Mon, Apr 18, 2022 at 10:47:47AM -0700, Darrick J. Wong wrote:
> On Wed, Apr 13, 2022 at 03:50:32PM +0100, Matthew Wilcox wrote:
> > On Tue, Apr 12, 2022 at 08:34:25PM -0700, Darrick J. Wong wrote:
> > > Hmm.  Two nights in a row I've seen the following crash.  Has anyone
> > > else seen this, or should I keep digging?  This is a fairly boring
> > > x86_64 VM with a XFS v5 filesystem + rmapbt.
> > 
> > I have not seen this before.  I test with:
> > MKFS_OPTIONS  -- -f -m reflink=1,rmapbt=1 -i sparse=1 -b size=1024 /dev/sdc
> > 
> > Maybe I should try a 4096 byte block size.
> > 
> > > mm/filemap.c:1653 is the BUG in:
> > > 
> > > void folio_end_writeback(struct folio *folio)
> > > {
> > > 	/*
> > > 	 * folio_test_clear_reclaim() could be used here but it is an
> > > 	 * atomic operation and overkill in this particular case.
> > > 	 * Failing to shuffle a folio marked for immediate reclaim is
> > > 	 * too mild a gain to justify taking an atomic operation penalty
> > > 	 * at the end of every folio writeback.
> > > 	 */
> > > 	if (folio_test_reclaim(folio)) {
> > > 		folio_clear_reclaim(folio);
> > > 		folio_rotate_reclaimable(folio);
> > > 	}
> > > 
> > > 	/*
> > > 	 * Writeback does not hold a folio reference of its own, relying
> > > 	 * on truncation to wait for the clearing of PG_writeback.
> > > 	 * But here we must make sure that the folio is not freed and
> > > 	 * reused before the folio_wake().
> > > 	 */
> > > 	folio_get(folio);
> > > 	if (!__folio_end_writeback(folio))
> > > >>>>		BUG();
> > 
> > Grr, that should have been a VM_BUG_ON_FOLIO(1, folio) so we get useful
> > information about the folio (like whether it has an iop, or what order
> > the folio is).  Can you make that change and try to reproduce?
> 
> > What's going on here is that we've called folio_end_writeback() on a
> > folio which does not have the writeback flag set.  That _should_ be
> > impossible, hence the use of BUG().  Either we've called
> > folio_end_writeback() twice on the same folio, or we neglected to set
> > the writeback flag on the folio.  I don't immediately see why either
> > of those two things would happen.
> 
> Well, I made that change and rebased to -rc3 to see if reverting that
> ZERO_PAGE thing would produce better results, I think I just got the
> same crash.  Curiously, the only VM that died this time was the one
> running the realtime configuration, but it's still generic/068:
> 
> FSTYP         -- xfs (debug)
> PLATFORM      -- Linux/x86_64 oci-mtr28 5.18.0-rc3-djwx #rc3 SMP PREEMPT_DYNAMIC Sun Apr 17 14:42:49 PDT 2022
> MKFS_OPTIONS  -- -f -rrtdev=/dev/sdb4 -llogdev=/dev/sdb2 -m reflink=0,rmapbt=0, -d rtinherit=1, /dev/sda4
> MOUNT_OPTIONS -- -ortdev=/dev/sdb4 -ologdev=/dev/sdb2 /dev/sda4 /opt
> 
> I don't know if it'll help, but here's the sequence of tests that we
> were running just prior to crashing:
> 
> generic/445      3s
> generic/225      76s
> xfs/306  22s
> xfs/290  3s
> generic/155     [not run] Reflink not supported by test filesystem type: xfs
> generic/525      6s
> generic/269      89s
> generic/1206    [not run] xfs_io swapext -v vfs -s 64k -l 64k ioctl support is missing
> xfs/504  198s
> xfs/192 [not run] Reflink not supported by scratch filesystem type: xfs
> xfs/303  1s
> generic/346      6s
> generic/512      5s
> xfs/227  308s
> generic/147     [not run] Reflink not supported by test filesystem type: xfs
> generic/230     [not run] Quotas not supported on realtime test device
> generic/008      4s
> generic/108      4s
> xfs/264  12s
> generic/200     [not run] Reflink not supported by scratch filesystem type: xfs
> generic/493     [not run] Dedupe not supported by scratch filesystem type: xfs
> xfs/021  5s
> generic/672     [not run] Reflink not supported by scratch filesystem type: xfs
> xfs/493  5s
> xfs/146  13s
> xfs/315 [not run] Reflink not supported by scratch filesystem type: xfs
> generic/068     
> 
> And the dmesg output:
> 
> run fstests generic/068 at 2022-04-17 16:57:16
> XFS (sda4): Mounting V5 Filesystem
> XFS (sda4): Ending clean mount
> page:ffffea0004a39c40 refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x128e71
> flags: 0x17ff80000000000(node=0|zone=2|lastcpupid=0xfff)
> raw: 017ff80000000000 0000000000000000 ffffffff00000203 0000000000000000
> raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
> page dumped because: VM_BUG_ON_FOLIO(((unsigned int) folio_ref_count(folio) + 127u <= 127u))
> ------------[ cut here ]------------
> kernel BUG at include/linux/mm.h:1164!
> invalid opcode: 0000 [#1] PREEMPT SMP
> CPU: 3 PID: 1094085 Comm: 3:0 Tainted: G        W         5.18.0-rc3-djwx #rc3 0a707744ee7c555d54e50726c5b02515710a6aae
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20171121_152543-x86-ol7-builder-01.us.oracle.com-4.el7.1 04/01/2014
> Workqueue: xfs-conv/sda4 xfs_end_io [xfs]
> RIP: 0010:folio_end_writeback+0xd0/0x110
> Code: 80 60 02 fb 48 89 ef e8 5e 6d 01 00 8b 45 34 83 c0 7f 83 f8 7f 0f 87 6a ff ff ff 48 c7 c6 40 c7 e2 81 48 89 ef e8 30 69 04 00 <0f> 0b 48 89 ee e8 b6 51 02 00 eb 9a 48 c7 c6 c0 ad e5 81 48 89 ef
> RSP: 0018:ffffc900084f3d48 EFLAGS: 00010246
> RAX: 000000000000005c RBX: 0000000000001000 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: ffffffff81e56da3 RDI: 00000000ffffffff
> RBP: ffffea0004a39c40 R08: 0000000000000000 R09: ffffffff8205fe40
> R10: 0000000000017578 R11: 00000000000175f0 R12: 0000000000004000
> R13: ffff88814dc5cd40 R14: 000000000000002e R15: ffffea0004a39c40
> FS:  0000000000000000(0000) GS:ffff88843fd80000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f2b0ea47010 CR3: 000000043f00c000 CR4: 00000000001506a0
> Call Trace:
>  <TASK>
>  iomap_finish_ioend+0x1ee/0x6a0
>  iomap_finish_ioends+0x69/0x100
>  xfs_end_ioend+0x5a/0x160 [xfs e8251de1111d7958449fd159d84af12a2afc12f2]
>  xfs_end_io+0xb1/0xf0 [xfs e8251de1111d7958449fd159d84af12a2afc12f2]
>  process_one_work+0x1df/0x3c0
>  ? rescuer_thread+0x3b0/0x3b0
>  worker_thread+0x53/0x3b0
>  ? rescuer_thread+0x3b0/0x3b0
>  kthread+0xea/0x110
>  ? kthread_complete_and_exit+0x20/0x20
>  ret_from_fork+0x1f/0x30
>  </TASK>
> Modules linked in: xfs dm_zero btrfs blake2b_generic xor lzo_compress lzo_decompress zlib_deflate raid6_pq zstd_compress dm_delay dm_snapshot dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio dm_flakey libcrc32c xt_REDIRECT iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 auth_rpcgss oid_registry xt_tcpudp ip_set_hash_ip ip_set_hash_net xt_set ip_set_hash_mac ip_set nfnetlink ip6table_filter ip6_tables iptable_filter bfq sch_fq_codel ip_tables x_tables overlay nfsv4 af_packet [last unloaded: scsi_debug]
> Dumping ftrace buffer:
>    (ftrace buffer empty)
> ---[ end trace 0000000000000000 ]---
> RIP: 0010:folio_end_writeback+0xd0/0x110
> Code: 80 60 02 fb 48 89 ef e8 5e 6d 01 00 8b 45 34 83 c0 7f 83 f8 7f 0f 87 6a ff ff ff 48 c7 c6 40 c7 e2 81 48 89 ef e8 30 69 04 00 <0f> 0b 48 89 ee e8 b6 51 02 00 eb 9a 48 c7 c6 c0 ad e5 81 48 89 ef
> RSP: 0018:ffffc900084f3d48 EFLAGS: 00010246
> RAX: 000000000000005c RBX: 0000000000001000 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: ffffffff81e56da3 RDI: 00000000ffffffff
> RBP: ffffea0004a39c40 R08: 0000000000000000 R09: ffffffff8205fe40
> R10: 0000000000017578 R11: 00000000000175f0 R12: 0000000000004000
> R13: ffff88814dc5cd40 R14: 000000000000002e R15: ffffea0004a39c40
> FS:  0000000000000000(0000) GS:ffff88843fd80000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f2b0ea47010 CR3: 000000043f00c000 CR4: 00000000001506a0

Hmm.  I think you might be on to something about this being some
mis-interaction when multi-page folios get truncated or possibly just
split?  The VM_BUG_ON points to pfn 0x206a79, which I /think/ is the
second page of the four-page folio starting with pfn=0x206a78?

    <...>-5263      3..... 276242022us : page_ref_mod: pfn=0x206a70 flags=referenced|uptodate|lru|active|private|writeback|head count=6 mapcount=0 mapping=ffff888125e5d868 mt=1 val=1
    <...>-5263      3..... 276242023us : page_ref_mod_and_test: pfn=0x206a70 flags=referenced|uptodate|lru|active|private|head count=5 mapcount=0 mapping=ffff888125e5d868 mt=1 val=-1 ret=0
    <...>-5263      3..... 276242023us : page_ref_mod: pfn=0x206a74 flags=referenced|uptodate|lru|active|private|writeback|head count=6 mapcount=0 mapping=ffff888125e5d868 mt=1 val=1
    <...>-5263      3..... 276242024us : page_ref_mod_and_test: pfn=0x206a74 flags=referenced|uptodate|lru|active|private|head count=5 mapcount=0 mapping=ffff888125e5d868 mt=1 val=-1 ret=0
    <...>-5263      3..... 276242025us : page_ref_mod: pfn=0x206a78 flags=referenced|uptodate|lru|active|private|writeback|head count=6 mapcount=0 mapping=ffff888125e5d868 mt=1 val=1
   fstest-12612     2..... 276242082us : page_ref_mod_unless: pfn=0x206a70 flags=referenced|uptodate|lru|active|private|head count=6 mapcount=0 mapping=ffff888125e5d868 mt=1 val=1 ret=1
   fstest-12612     2..... 276242083us : page_ref_mod_unless: pfn=0x206a74 flags=referenced|uptodate|lru|active|private|head count=6 mapcount=0 mapping=ffff888125e5d868 mt=1 val=1 ret=1
   fstest-12612     2..... 276242083us : page_ref_mod_unless: pfn=0x206a78 flags=referenced|uptodate|lru|active|private|head count=7 mapcount=0 mapping=ffff888125e5d868 mt=1 val=1 ret=1
   fstest-12612     2..... 276242084us : page_ref_mod_unless: pfn=0x206a7c flags=referenced|uptodate|lru|active|private|writeback|head count=6 mapcount=0 mapping=ffff888125e5d868 mt=1 val=1 ret=1
   fstest-12612     2..... 276242084us : page_ref_mod_and_test: pfn=0x206a7c flags=referenced|uptodate|lru|active|private|writeback|head count=5 mapcount=0 mapping=ffff888125e5d868 mt=1 val=-1 ret=0
   fstest-12612     2..... 276242094us : page_ref_mod_and_test: pfn=0x206a70 flags=locked|referenced|uptodate|lru|active|head count=5 mapcount=0 mapping=ffff888125e5d868 mt=1 val=-1 ret=0
   fstest-12612     2..... 276242094us : page_ref_mod_and_test: pfn=0x206a74 flags=locked|referenced|uptodate|lru|active|head count=5 mapcount=0 mapping=ffff888125e5d868 mt=1 val=-1 ret=0
   fstest-12612     2..... 276242094us : page_ref_mod_and_test: pfn=0x206a78 flags=locked|referenced|uptodate|lru|active|head count=6 mapcount=0 mapping=ffff888125e5d868 mt=1 val=-1 ret=0
   fstest-12612     2..... 276242102us : page_ref_mod_and_test: pfn=0x206a70 flags=locked|referenced|uptodate|lru|active|head count=1 mapcount=0 mapping=0000000000000000 mt=1 val=-4 ret=0
   fstest-12612     2..... 276242102us : page_ref_mod_and_test: pfn=0x206a74 flags=locked|referenced|uptodate|lru|active|head count=1 mapcount=0 mapping=0000000000000000 mt=1 val=-4 ret=0
   fstest-12612     2..... 276242102us : page_ref_mod_and_test: pfn=0x206a78 flags=locked|referenced|uptodate|lru|active|head count=2 mapcount=0 mapping=0000000000000000 mt=1 val=-4 ret=0
   fstest-12612     2..... 276242110us : page_ref_mod_and_test: pfn=0x206a70 flags=referenced|uptodate|lru|active|head count=0 mapcount=0 mapping=0000000000000000 mt=1 val=-1 ret=1
   fstest-12612     2..... 276242111us : page_ref_mod_and_test: pfn=0x206a74 flags=referenced|uptodate|lru|active|head count=0 mapcount=0 mapping=0000000000000000 mt=1 val=-1 ret=1
   fstest-12612     2..... 276242112us : page_ref_mod_and_test: pfn=0x206a78 flags=referenced|uptodate|lru|active|head count=1 mapcount=0 mapping=0000000000000000 mt=1 val=-1 ret=0
   fstest-12612     2..... 276242112us : page_ref_mod_unless: pfn=0x206a7c flags=referenced|uptodate|lru|active|private|writeback|head count=6 mapcount=0 mapping=ffff888125e5d868 mt=1 val=1 ret=1
   fstest-12612     2..... 276242112us : page_ref_mod_and_test: pfn=0x206a7c flags=referenced|uptodate|lru|active|private|writeback|head count=5 mapcount=0 mapping=ffff888125e5d868 mt=1 val=-1 ret=0
   fstest-12612     2..... 276242119us : page_ref_mod_unless: pfn=0x206a7c flags=referenced|uptodate|lru|active|private|writeback|head count=6 mapcount=0 mapping=ffff888125e5d868 mt=1 val=1 ret=1
    <...>-5263      3..... 276242194us : page_ref_mod_and_test: pfn=0x206a78 flags=referenced|uptodate|lru|active|head count=0 mapcount=0 mapping=0000000000000000 mt=1 val=-1 ret=1
    <...>-5263      3d..1. 276242208us : console: [  309.491317] page:ffffea00081a9e40 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x206a79

In which case, that output reduces to:

    <...>-5263      3..... 276242025us : page_ref_mod: pfn=0x206a78 flags=referenced|uptodate|lru|active|private|writeback|head count=6 mapcount=0 mapping=ffff888125e5d868 mt=1 val=1
   fstest-12612     2..... 276242083us : page_ref_mod_unless: pfn=0x206a78 flags=referenced|uptodate|lru|active|private|head count=7 mapcount=0 mapping=ffff888125e5d868 mt=1 val=1 ret=1
   fstest-12612     2..... 276242094us : page_ref_mod_and_test: pfn=0x206a78 flags=locked|referenced|uptodate|lru|active|head count=6 mapcount=0 mapping=ffff888125e5d868 mt=1 val=-1 ret=0
   fstest-12612     2..... 276242102us : page_ref_mod_and_test: pfn=0x206a78 flags=locked|referenced|uptodate|lru|active|head count=2 mapcount=0 mapping=0000000000000000 mt=1 val=-4 ret=0
   fstest-12612     2..... 276242112us : page_ref_mod_and_test: pfn=0x206a78 flags=referenced|uptodate|lru|active|head count=1 mapcount=0 mapping=0000000000000000 mt=1 val=-1 ret=0
    <...>-5263      3..... 276242194us : page_ref_mod_and_test: pfn=0x206a78 flags=referenced|uptodate|lru|active|head count=0 mapcount=0 mapping=0000000000000000 mt=1 val=-1 ret=1
    <...>-5263      3d..1. 276242208us : console: [  309.491317] page:ffffea00081a9e40 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x206a79

Clearly the mapping gets torn off of this folio before time index
276242102us while it still has refcount==6.  It's curious that lost the
writeback bit at the first line.

Unfortunately, the VM_BUG_ON means that the kernel dumps the ftrace
buffer and crashes, so I can't actually do anything with the incomplete
trace-cmd files that were recorded earlier.

--D

> --D
> 
> > 
> > > 
> > > 
> > > --D
> > > 
> > > run fstests generic/068 at 2022-04-12 17:57:11
> > > XFS (sda3): Mounting V5 Filesystem
> > > XFS (sda3): Ending clean mount
> > > XFS (sda4): Mounting V5 Filesystem
> > > XFS (sda4): Ending clean mount
> > > ------------[ cut here ]------------
> > > kernel BUG at mm/filemap.c:1653!
> > > invalid opcode: 0000 [#1] PREEMPT SMP
> > > CPU: 0 PID: 1349866 Comm: 0:116 Tainted: G        W         5.18.0-rc2-djwx #rc2 19cc48221d47ada6c8e5859639b6a0946c9a3777
> > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20171121_152543-x86-ol7-builder-01.us.oracle.com-4.el7.1 04/01/2014
> > > Workqueue: xfs-conv/sda4 xfs_end_io [xfs]
> > > RIP: 0010:folio_end_writeback+0x79/0x80
> > > Code: d2 75 1d f0 ff 4d 34 74 0e 5d c3 f0 80 67 02 fb e8 ac 29 01 00 eb ad 48 89 ef 5d e9 a1 0f 01 00 48 89 ee e8 b9 e8 01 00 eb d9 <0f> 0b 0f 1f 44 00 00 0f 1f 44 00 00 53 48 89 fb e8 62 f7 ff ff 48
> > > RSP: 0018:ffffc9000286fd50 EFLAGS: 00010246
> > > RAX: 0000000000000000 RBX: ffffea0007376840 RCX: 000000000000000c
> > > RDX: ffff88810d2de000 RSI: ffffffff81e55f0b RDI: ffff88810d2de000
> > > RBP: ffffea0007376840 R08: ffffea000b82c308 R09: ffffea000b82c308
> > > R10: 0000000000000001 R11: 000000000000000c R12: 0000000000000000
> > > R13: 000000000000c000 R14: 0000000000000005 R15: 0000000000000001
> > > FS:  0000000000000000(0000) GS:ffff88843fc00000(0000) knlGS:0000000000000000
> > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: 00007f5b067d0000 CR3: 000000010d1bb000 CR4: 00000000001506b0
> > > Call Trace:
> > >  <TASK>
> > >  iomap_finish_ioend+0x19e/0x560
> > >  iomap_finish_ioends+0x69/0x100
> > >  xfs_end_ioend+0x5a/0x160 [xfs 513857e2ae300a835ce1fbd8065a84dc5382e649]
> > >  xfs_end_io+0xb1/0xf0 [xfs 513857e2ae300a835ce1fbd8065a84dc5382e649]
> > >  process_one_work+0x1df/0x3c0
> > >  ? rescuer_thread+0x3b0/0x3b0
> > >  worker_thread+0x53/0x3b0
> > >  ? rescuer_thread+0x3b0/0x3b0
> > >  kthread+0xea/0x110
> > >  ? kthread_complete_and_exit+0x20/0x20
> > >  ret_from_fork+0x1f/0x30
> > >  </TASK>
> > > Modules linked in: dm_snapshot dm_bufio dm_zero dm_flakey xfs libcrc32c xt_REDIRECT iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 auth_rpcgss oid_registry xt_tcpudp ip_set_hash_ip ip_set_hash_net xt_set ip_set_hash_mac ip_set nfnetlink ip6table_filter ip6_tables iptable_filter bfq sch_fq_codel ip_tables x_tables overlay nfsv4 af_packet [last unloaded: scsi_debug]
> > > Dumping ftrace buffer:
> > >    (ftrace buffer empty)
> > > ---[ end trace 0000000000000000 ]---
> > > RIP: 0010:folio_end_writeback+0x79/0x80
> > > Code: d2 75 1d f0 ff 4d 34 74 0e 5d c3 f0 80 67 02 fb e8 ac 29 01 00 eb ad 48 89 ef 5d e9 a1 0f 01 00 48 89 ee e8 b9 e8 01 00 eb d9 <0f> 0b 0f 1f 44 00 00 0f 1f 44 00 00 53 48 89 fb e8 62 f7 ff ff 48
> > > RSP: 0018:ffffc9000286fd50 EFLAGS: 00010246
> > > RAX: 0000000000000000 RBX: ffffea0007376840 RCX: 000000000000000c
> > > RDX: ffff88810d2de000 RSI: ffffffff81e55f0b RDI: ffff88810d2de000
> > > RBP: ffffea0007376840 R08: ffffea000b82c308 R09: ffffea000b82c308
> > > R10: 0000000000000001 R11: 000000000000000c R12: 0000000000000000
> > > R13: 000000000000c000 R14: 0000000000000005 R15: 0000000000000001
> > > FS:  0000000000000000(0000) GS:ffff88843fc00000(0000) knlGS:0000000000000000
> > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: 00007f4b94008278 CR3: 0000000101ac9000 CR4: 00000000001506b0
> > > 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: generic/068 crash on 5.18-rc2?
  2022-04-22 21:59     ` Darrick J. Wong
@ 2022-04-28 15:53       ` Brian Foster
  2022-04-30  3:10         ` Darrick J. Wong
  2022-04-30  3:44         ` Matthew Wilcox
  0 siblings, 2 replies; 22+ messages in thread
From: Brian Foster @ 2022-04-28 15:53 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Matthew Wilcox, xfs, fstests, linux-fsdevel

On Fri, Apr 22, 2022 at 02:59:43PM -0700, Darrick J. Wong wrote:
> On Mon, Apr 18, 2022 at 10:47:47AM -0700, Darrick J. Wong wrote:
> > On Wed, Apr 13, 2022 at 03:50:32PM +0100, Matthew Wilcox wrote:
> > > On Tue, Apr 12, 2022 at 08:34:25PM -0700, Darrick J. Wong wrote:
> > > > Hmm.  Two nights in a row I've seen the following crash.  Has anyone
> > > > else seen this, or should I keep digging?  This is a fairly boring
> > > > x86_64 VM with a XFS v5 filesystem + rmapbt.
> > > 
> > > I have not seen this before.  I test with:
> > > MKFS_OPTIONS  -- -f -m reflink=1,rmapbt=1 -i sparse=1 -b size=1024 /dev/sdc
> > > 
> > > Maybe I should try a 4096 byte block size.
> > > 
> > > > mm/filemap.c:1653 is the BUG in:
> > > > 
> > > > void folio_end_writeback(struct folio *folio)
> > > > {
> > > > 	/*
> > > > 	 * folio_test_clear_reclaim() could be used here but it is an
> > > > 	 * atomic operation and overkill in this particular case.
> > > > 	 * Failing to shuffle a folio marked for immediate reclaim is
> > > > 	 * too mild a gain to justify taking an atomic operation penalty
> > > > 	 * at the end of every folio writeback.
> > > > 	 */
> > > > 	if (folio_test_reclaim(folio)) {
> > > > 		folio_clear_reclaim(folio);
> > > > 		folio_rotate_reclaimable(folio);
> > > > 	}
> > > > 
> > > > 	/*
> > > > 	 * Writeback does not hold a folio reference of its own, relying
> > > > 	 * on truncation to wait for the clearing of PG_writeback.
> > > > 	 * But here we must make sure that the folio is not freed and
> > > > 	 * reused before the folio_wake().
> > > > 	 */
> > > > 	folio_get(folio);
> > > > 	if (!__folio_end_writeback(folio))
> > > > >>>>		BUG();
> > > 
> > > Grr, that should have been a VM_BUG_ON_FOLIO(1, folio) so we get useful
> > > information about the folio (like whether it has an iop, or what order
> > > the folio is).  Can you make that change and try to reproduce?
> > 
> > > What's going on here is that we've called folio_end_writeback() on a
> > > folio which does not have the writeback flag set.  That _should_ be
> > > impossible, hence the use of BUG().  Either we've called
> > > folio_end_writeback() twice on the same folio, or we neglected to set
> > > the writeback flag on the folio.  I don't immediately see why either
> > > of those two things would happen.
> > 
> > Well, I made that change and rebased to -rc3 to see if reverting that
> > ZERO_PAGE thing would produce better results, I think I just got the
> > same crash.  Curiously, the only VM that died this time was the one
> > running the realtime configuration, but it's still generic/068:
> > 
> > FSTYP         -- xfs (debug)
> > PLATFORM      -- Linux/x86_64 oci-mtr28 5.18.0-rc3-djwx #rc3 SMP PREEMPT_DYNAMIC Sun Apr 17 14:42:49 PDT 2022
> > MKFS_OPTIONS  -- -f -rrtdev=/dev/sdb4 -llogdev=/dev/sdb2 -m reflink=0,rmapbt=0, -d rtinherit=1, /dev/sda4
> > MOUNT_OPTIONS -- -ortdev=/dev/sdb4 -ologdev=/dev/sdb2 /dev/sda4 /opt
> > 
> > I don't know if it'll help, but here's the sequence of tests that we
> > were running just prior to crashing:
> > 
> > generic/445      3s
> > generic/225      76s
> > xfs/306  22s
> > xfs/290  3s
> > generic/155     [not run] Reflink not supported by test filesystem type: xfs
> > generic/525      6s
> > generic/269      89s
> > generic/1206    [not run] xfs_io swapext -v vfs -s 64k -l 64k ioctl support is missing
> > xfs/504  198s
> > xfs/192 [not run] Reflink not supported by scratch filesystem type: xfs
> > xfs/303  1s
> > generic/346      6s
> > generic/512      5s
> > xfs/227  308s
> > generic/147     [not run] Reflink not supported by test filesystem type: xfs
> > generic/230     [not run] Quotas not supported on realtime test device
> > generic/008      4s
> > generic/108      4s
> > xfs/264  12s
> > generic/200     [not run] Reflink not supported by scratch filesystem type: xfs
> > generic/493     [not run] Dedupe not supported by scratch filesystem type: xfs
> > xfs/021  5s
> > generic/672     [not run] Reflink not supported by scratch filesystem type: xfs
> > xfs/493  5s
> > xfs/146  13s
> > xfs/315 [not run] Reflink not supported by scratch filesystem type: xfs
> > generic/068     
> > 
> > And the dmesg output:
> > 
> > run fstests generic/068 at 2022-04-17 16:57:16
> > XFS (sda4): Mounting V5 Filesystem
> > XFS (sda4): Ending clean mount
> > page:ffffea0004a39c40 refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x128e71
> > flags: 0x17ff80000000000(node=0|zone=2|lastcpupid=0xfff)
> > raw: 017ff80000000000 0000000000000000 ffffffff00000203 0000000000000000
> > raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
> > page dumped because: VM_BUG_ON_FOLIO(((unsigned int) folio_ref_count(folio) + 127u <= 127u))
> > ------------[ cut here ]------------
> > kernel BUG at include/linux/mm.h:1164!
> > invalid opcode: 0000 [#1] PREEMPT SMP
> > CPU: 3 PID: 1094085 Comm: 3:0 Tainted: G        W         5.18.0-rc3-djwx #rc3 0a707744ee7c555d54e50726c5b02515710a6aae
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20171121_152543-x86-ol7-builder-01.us.oracle.com-4.el7.1 04/01/2014
> > Workqueue: xfs-conv/sda4 xfs_end_io [xfs]
> > RIP: 0010:folio_end_writeback+0xd0/0x110
> > Code: 80 60 02 fb 48 89 ef e8 5e 6d 01 00 8b 45 34 83 c0 7f 83 f8 7f 0f 87 6a ff ff ff 48 c7 c6 40 c7 e2 81 48 89 ef e8 30 69 04 00 <0f> 0b 48 89 ee e8 b6 51 02 00 eb 9a 48 c7 c6 c0 ad e5 81 48 89 ef
> > RSP: 0018:ffffc900084f3d48 EFLAGS: 00010246
> > RAX: 000000000000005c RBX: 0000000000001000 RCX: 0000000000000000
> > RDX: 0000000000000000 RSI: ffffffff81e56da3 RDI: 00000000ffffffff
> > RBP: ffffea0004a39c40 R08: 0000000000000000 R09: ffffffff8205fe40
> > R10: 0000000000017578 R11: 00000000000175f0 R12: 0000000000004000
> > R13: ffff88814dc5cd40 R14: 000000000000002e R15: ffffea0004a39c40
> > FS:  0000000000000000(0000) GS:ffff88843fd80000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00007f2b0ea47010 CR3: 000000043f00c000 CR4: 00000000001506a0
> > Call Trace:
> >  <TASK>
> >  iomap_finish_ioend+0x1ee/0x6a0
> >  iomap_finish_ioends+0x69/0x100
> >  xfs_end_ioend+0x5a/0x160 [xfs e8251de1111d7958449fd159d84af12a2afc12f2]
> >  xfs_end_io+0xb1/0xf0 [xfs e8251de1111d7958449fd159d84af12a2afc12f2]
> >  process_one_work+0x1df/0x3c0
> >  ? rescuer_thread+0x3b0/0x3b0
> >  worker_thread+0x53/0x3b0
> >  ? rescuer_thread+0x3b0/0x3b0
> >  kthread+0xea/0x110
> >  ? kthread_complete_and_exit+0x20/0x20
> >  ret_from_fork+0x1f/0x30
> >  </TASK>
> > Modules linked in: xfs dm_zero btrfs blake2b_generic xor lzo_compress lzo_decompress zlib_deflate raid6_pq zstd_compress dm_delay dm_snapshot dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio dm_flakey libcrc32c xt_REDIRECT iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 auth_rpcgss oid_registry xt_tcpudp ip_set_hash_ip ip_set_hash_net xt_set ip_set_hash_mac ip_set nfnetlink ip6table_filter ip6_tables iptable_filter bfq sch_fq_codel ip_tables x_tables overlay nfsv4 af_packet [last unloaded: scsi_debug]
> > Dumping ftrace buffer:
> >    (ftrace buffer empty)
> > ---[ end trace 0000000000000000 ]---
> > RIP: 0010:folio_end_writeback+0xd0/0x110
> > Code: 80 60 02 fb 48 89 ef e8 5e 6d 01 00 8b 45 34 83 c0 7f 83 f8 7f 0f 87 6a ff ff ff 48 c7 c6 40 c7 e2 81 48 89 ef e8 30 69 04 00 <0f> 0b 48 89 ee e8 b6 51 02 00 eb 9a 48 c7 c6 c0 ad e5 81 48 89 ef
> > RSP: 0018:ffffc900084f3d48 EFLAGS: 00010246
> > RAX: 000000000000005c RBX: 0000000000001000 RCX: 0000000000000000
> > RDX: 0000000000000000 RSI: ffffffff81e56da3 RDI: 00000000ffffffff
> > RBP: ffffea0004a39c40 R08: 0000000000000000 R09: ffffffff8205fe40
> > R10: 0000000000017578 R11: 00000000000175f0 R12: 0000000000004000
> > R13: ffff88814dc5cd40 R14: 000000000000002e R15: ffffea0004a39c40
> > FS:  0000000000000000(0000) GS:ffff88843fd80000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00007f2b0ea47010 CR3: 000000043f00c000 CR4: 00000000001506a0
> 
> Hmm.  I think you might be on to something about this being some
> mis-interaction when multi-page folios get truncated or possibly just
> split?  The VM_BUG_ON points to pfn 0x206a79, which I /think/ is the
> second page of the four-page folio starting with pfn=0x206a78?
> 

The above is the variant of generic/068 failure I was reproducing and
used to bisect [1]. With some additional tracing added to ioend
completion, what I'm seeing is that the bio_for_each_folio_all() bvec
iteration basically seems to go off the rails. What happens more
specifically is that at some point during the loop, bio_next_folio()
actually lands into the second page of the just processed folio instead
of the actual next folio (i.e. as if it's walking to the next page from
the head page of the folio instead of to the next 16k folio). I suspect
completion is racing with some form of truncation/reclaim/invalidation
here, what exactly I don't know, that perhaps breaks down the folio and
renders the iteration (bio_next_folio() -> folio_next()) unsafe. To test
that theory, I open coded and modified the loop to something like the
following:

                for (bio_first_folio(&fi, bio, 0); fi.folio; ) {
                        f = fi.folio;
                        l = fi.length;
                        bio_next_folio(&fi, bio);
                        iomap_finish_folio_write(inode, f, l, error);
                        folio_count++;
                }

... to avoid accessing folio metadata after writeback is cleared on it
and this seems to make the problem disappear (so far, I'll need to let
this spin for a while longer to be completely confident in that).

Brian

[1] https://lore.kernel.org/linux-xfs/YmlOfJljvI49sZyW@bfoster/

>     <...>-5263      3..... 276242022us : page_ref_mod: pfn=0x206a70 flags=referenced|uptodate|lru|active|private|writeback|head count=6 mapcount=0 mapping=ffff888125e5d868 mt=1 val=1
>     <...>-5263      3..... 276242023us : page_ref_mod_and_test: pfn=0x206a70 flags=referenced|uptodate|lru|active|private|head count=5 mapcount=0 mapping=ffff888125e5d868 mt=1 val=-1 ret=0
>     <...>-5263      3..... 276242023us : page_ref_mod: pfn=0x206a74 flags=referenced|uptodate|lru|active|private|writeback|head count=6 mapcount=0 mapping=ffff888125e5d868 mt=1 val=1
>     <...>-5263      3..... 276242024us : page_ref_mod_and_test: pfn=0x206a74 flags=referenced|uptodate|lru|active|private|head count=5 mapcount=0 mapping=ffff888125e5d868 mt=1 val=-1 ret=0
>     <...>-5263      3..... 276242025us : page_ref_mod: pfn=0x206a78 flags=referenced|uptodate|lru|active|private|writeback|head count=6 mapcount=0 mapping=ffff888125e5d868 mt=1 val=1
>    fstest-12612     2..... 276242082us : page_ref_mod_unless: pfn=0x206a70 flags=referenced|uptodate|lru|active|private|head count=6 mapcount=0 mapping=ffff888125e5d868 mt=1 val=1 ret=1
>    fstest-12612     2..... 276242083us : page_ref_mod_unless: pfn=0x206a74 flags=referenced|uptodate|lru|active|private|head count=6 mapcount=0 mapping=ffff888125e5d868 mt=1 val=1 ret=1
>    fstest-12612     2..... 276242083us : page_ref_mod_unless: pfn=0x206a78 flags=referenced|uptodate|lru|active|private|head count=7 mapcount=0 mapping=ffff888125e5d868 mt=1 val=1 ret=1
>    fstest-12612     2..... 276242084us : page_ref_mod_unless: pfn=0x206a7c flags=referenced|uptodate|lru|active|private|writeback|head count=6 mapcount=0 mapping=ffff888125e5d868 mt=1 val=1 ret=1
>    fstest-12612     2..... 276242084us : page_ref_mod_and_test: pfn=0x206a7c flags=referenced|uptodate|lru|active|private|writeback|head count=5 mapcount=0 mapping=ffff888125e5d868 mt=1 val=-1 ret=0
>    fstest-12612     2..... 276242094us : page_ref_mod_and_test: pfn=0x206a70 flags=locked|referenced|uptodate|lru|active|head count=5 mapcount=0 mapping=ffff888125e5d868 mt=1 val=-1 ret=0
>    fstest-12612     2..... 276242094us : page_ref_mod_and_test: pfn=0x206a74 flags=locked|referenced|uptodate|lru|active|head count=5 mapcount=0 mapping=ffff888125e5d868 mt=1 val=-1 ret=0
>    fstest-12612     2..... 276242094us : page_ref_mod_and_test: pfn=0x206a78 flags=locked|referenced|uptodate|lru|active|head count=6 mapcount=0 mapping=ffff888125e5d868 mt=1 val=-1 ret=0
>    fstest-12612     2..... 276242102us : page_ref_mod_and_test: pfn=0x206a70 flags=locked|referenced|uptodate|lru|active|head count=1 mapcount=0 mapping=0000000000000000 mt=1 val=-4 ret=0
>    fstest-12612     2..... 276242102us : page_ref_mod_and_test: pfn=0x206a74 flags=locked|referenced|uptodate|lru|active|head count=1 mapcount=0 mapping=0000000000000000 mt=1 val=-4 ret=0
>    fstest-12612     2..... 276242102us : page_ref_mod_and_test: pfn=0x206a78 flags=locked|referenced|uptodate|lru|active|head count=2 mapcount=0 mapping=0000000000000000 mt=1 val=-4 ret=0
>    fstest-12612     2..... 276242110us : page_ref_mod_and_test: pfn=0x206a70 flags=referenced|uptodate|lru|active|head count=0 mapcount=0 mapping=0000000000000000 mt=1 val=-1 ret=1
>    fstest-12612     2..... 276242111us : page_ref_mod_and_test: pfn=0x206a74 flags=referenced|uptodate|lru|active|head count=0 mapcount=0 mapping=0000000000000000 mt=1 val=-1 ret=1
>    fstest-12612     2..... 276242112us : page_ref_mod_and_test: pfn=0x206a78 flags=referenced|uptodate|lru|active|head count=1 mapcount=0 mapping=0000000000000000 mt=1 val=-1 ret=0
>    fstest-12612     2..... 276242112us : page_ref_mod_unless: pfn=0x206a7c flags=referenced|uptodate|lru|active|private|writeback|head count=6 mapcount=0 mapping=ffff888125e5d868 mt=1 val=1 ret=1
>    fstest-12612     2..... 276242112us : page_ref_mod_and_test: pfn=0x206a7c flags=referenced|uptodate|lru|active|private|writeback|head count=5 mapcount=0 mapping=ffff888125e5d868 mt=1 val=-1 ret=0
>    fstest-12612     2..... 276242119us : page_ref_mod_unless: pfn=0x206a7c flags=referenced|uptodate|lru|active|private|writeback|head count=6 mapcount=0 mapping=ffff888125e5d868 mt=1 val=1 ret=1
>     <...>-5263      3..... 276242194us : page_ref_mod_and_test: pfn=0x206a78 flags=referenced|uptodate|lru|active|head count=0 mapcount=0 mapping=0000000000000000 mt=1 val=-1 ret=1
>     <...>-5263      3d..1. 276242208us : console: [  309.491317] page:ffffea00081a9e40 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x206a79
> 
> In which case, that output reduces to:
> 
>     <...>-5263      3..... 276242025us : page_ref_mod: pfn=0x206a78 flags=referenced|uptodate|lru|active|private|writeback|head count=6 mapcount=0 mapping=ffff888125e5d868 mt=1 val=1
>    fstest-12612     2..... 276242083us : page_ref_mod_unless: pfn=0x206a78 flags=referenced|uptodate|lru|active|private|head count=7 mapcount=0 mapping=ffff888125e5d868 mt=1 val=1 ret=1
>    fstest-12612     2..... 276242094us : page_ref_mod_and_test: pfn=0x206a78 flags=locked|referenced|uptodate|lru|active|head count=6 mapcount=0 mapping=ffff888125e5d868 mt=1 val=-1 ret=0
>    fstest-12612     2..... 276242102us : page_ref_mod_and_test: pfn=0x206a78 flags=locked|referenced|uptodate|lru|active|head count=2 mapcount=0 mapping=0000000000000000 mt=1 val=-4 ret=0
>    fstest-12612     2..... 276242112us : page_ref_mod_and_test: pfn=0x206a78 flags=referenced|uptodate|lru|active|head count=1 mapcount=0 mapping=0000000000000000 mt=1 val=-1 ret=0
>     <...>-5263      3..... 276242194us : page_ref_mod_and_test: pfn=0x206a78 flags=referenced|uptodate|lru|active|head count=0 mapcount=0 mapping=0000000000000000 mt=1 val=-1 ret=1
>     <...>-5263      3d..1. 276242208us : console: [  309.491317] page:ffffea00081a9e40 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x206a79
> 
> Clearly the mapping gets torn off of this folio before time index
> 276242102us while it still has refcount==6.  It's curious that lost the
> writeback bit at the first line.
> 
> Unfortunately, the VM_BUG_ON means that the kernel dumps the ftrace
> buffer and crashes, so I can't actually do anything with the incomplete
> trace-cmd files that were recorded earlier.
> 
> --D
> 
> > --D
> > 
> > > 
> > > > 
> > > > 
> > > > --D
> > > > 
> > > > run fstests generic/068 at 2022-04-12 17:57:11
> > > > XFS (sda3): Mounting V5 Filesystem
> > > > XFS (sda3): Ending clean mount
> > > > XFS (sda4): Mounting V5 Filesystem
> > > > XFS (sda4): Ending clean mount
> > > > ------------[ cut here ]------------
> > > > kernel BUG at mm/filemap.c:1653!
> > > > invalid opcode: 0000 [#1] PREEMPT SMP
> > > > CPU: 0 PID: 1349866 Comm: 0:116 Tainted: G        W         5.18.0-rc2-djwx #rc2 19cc48221d47ada6c8e5859639b6a0946c9a3777
> > > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20171121_152543-x86-ol7-builder-01.us.oracle.com-4.el7.1 04/01/2014
> > > > Workqueue: xfs-conv/sda4 xfs_end_io [xfs]
> > > > RIP: 0010:folio_end_writeback+0x79/0x80
> > > > Code: d2 75 1d f0 ff 4d 34 74 0e 5d c3 f0 80 67 02 fb e8 ac 29 01 00 eb ad 48 89 ef 5d e9 a1 0f 01 00 48 89 ee e8 b9 e8 01 00 eb d9 <0f> 0b 0f 1f 44 00 00 0f 1f 44 00 00 53 48 89 fb e8 62 f7 ff ff 48
> > > > RSP: 0018:ffffc9000286fd50 EFLAGS: 00010246
> > > > RAX: 0000000000000000 RBX: ffffea0007376840 RCX: 000000000000000c
> > > > RDX: ffff88810d2de000 RSI: ffffffff81e55f0b RDI: ffff88810d2de000
> > > > RBP: ffffea0007376840 R08: ffffea000b82c308 R09: ffffea000b82c308
> > > > R10: 0000000000000001 R11: 000000000000000c R12: 0000000000000000
> > > > R13: 000000000000c000 R14: 0000000000000005 R15: 0000000000000001
> > > > FS:  0000000000000000(0000) GS:ffff88843fc00000(0000) knlGS:0000000000000000
> > > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > CR2: 00007f5b067d0000 CR3: 000000010d1bb000 CR4: 00000000001506b0
> > > > Call Trace:
> > > >  <TASK>
> > > >  iomap_finish_ioend+0x19e/0x560
> > > >  iomap_finish_ioends+0x69/0x100
> > > >  xfs_end_ioend+0x5a/0x160 [xfs 513857e2ae300a835ce1fbd8065a84dc5382e649]
> > > >  xfs_end_io+0xb1/0xf0 [xfs 513857e2ae300a835ce1fbd8065a84dc5382e649]
> > > >  process_one_work+0x1df/0x3c0
> > > >  ? rescuer_thread+0x3b0/0x3b0
> > > >  worker_thread+0x53/0x3b0
> > > >  ? rescuer_thread+0x3b0/0x3b0
> > > >  kthread+0xea/0x110
> > > >  ? kthread_complete_and_exit+0x20/0x20
> > > >  ret_from_fork+0x1f/0x30
> > > >  </TASK>
> > > > Modules linked in: dm_snapshot dm_bufio dm_zero dm_flakey xfs libcrc32c xt_REDIRECT iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 auth_rpcgss oid_registry xt_tcpudp ip_set_hash_ip ip_set_hash_net xt_set ip_set_hash_mac ip_set nfnetlink ip6table_filter ip6_tables iptable_filter bfq sch_fq_codel ip_tables x_tables overlay nfsv4 af_packet [last unloaded: scsi_debug]
> > > > Dumping ftrace buffer:
> > > >    (ftrace buffer empty)
> > > > ---[ end trace 0000000000000000 ]---
> > > > RIP: 0010:folio_end_writeback+0x79/0x80
> > > > Code: d2 75 1d f0 ff 4d 34 74 0e 5d c3 f0 80 67 02 fb e8 ac 29 01 00 eb ad 48 89 ef 5d e9 a1 0f 01 00 48 89 ee e8 b9 e8 01 00 eb d9 <0f> 0b 0f 1f 44 00 00 0f 1f 44 00 00 53 48 89 fb e8 62 f7 ff ff 48
> > > > RSP: 0018:ffffc9000286fd50 EFLAGS: 00010246
> > > > RAX: 0000000000000000 RBX: ffffea0007376840 RCX: 000000000000000c
> > > > RDX: ffff88810d2de000 RSI: ffffffff81e55f0b RDI: ffff88810d2de000
> > > > RBP: ffffea0007376840 R08: ffffea000b82c308 R09: ffffea000b82c308
> > > > R10: 0000000000000001 R11: 000000000000000c R12: 0000000000000000
> > > > R13: 000000000000c000 R14: 0000000000000005 R15: 0000000000000001
> > > > FS:  0000000000000000(0000) GS:ffff88843fc00000(0000) knlGS:0000000000000000
> > > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > CR2: 00007f4b94008278 CR3: 0000000101ac9000 CR4: 00000000001506b0
> > > > 
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: generic/068 crash on 5.18-rc2?
  2022-04-28 15:53       ` Brian Foster
@ 2022-04-30  3:10         ` Darrick J. Wong
  2022-04-30  3:44         ` Matthew Wilcox
  1 sibling, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2022-04-30  3:10 UTC (permalink / raw)
  To: Brian Foster; +Cc: Matthew Wilcox, xfs, fstests, linux-fsdevel

On Thu, Apr 28, 2022 at 11:53:18AM -0400, Brian Foster wrote:
> On Fri, Apr 22, 2022 at 02:59:43PM -0700, Darrick J. Wong wrote:
> > On Mon, Apr 18, 2022 at 10:47:47AM -0700, Darrick J. Wong wrote:
> > > On Wed, Apr 13, 2022 at 03:50:32PM +0100, Matthew Wilcox wrote:
> > > > On Tue, Apr 12, 2022 at 08:34:25PM -0700, Darrick J. Wong wrote:
> > > > > Hmm.  Two nights in a row I've seen the following crash.  Has anyone
> > > > > else seen this, or should I keep digging?  This is a fairly boring
> > > > > x86_64 VM with a XFS v5 filesystem + rmapbt.
> > > > 
> > > > I have not seen this before.  I test with:
> > > > MKFS_OPTIONS  -- -f -m reflink=1,rmapbt=1 -i sparse=1 -b size=1024 /dev/sdc
> > > > 
> > > > Maybe I should try a 4096 byte block size.
> > > > 
> > > > > mm/filemap.c:1653 is the BUG in:
> > > > > 
> > > > > void folio_end_writeback(struct folio *folio)
> > > > > {
> > > > > 	/*
> > > > > 	 * folio_test_clear_reclaim() could be used here but it is an
> > > > > 	 * atomic operation and overkill in this particular case.
> > > > > 	 * Failing to shuffle a folio marked for immediate reclaim is
> > > > > 	 * too mild a gain to justify taking an atomic operation penalty
> > > > > 	 * at the end of every folio writeback.
> > > > > 	 */
> > > > > 	if (folio_test_reclaim(folio)) {
> > > > > 		folio_clear_reclaim(folio);
> > > > > 		folio_rotate_reclaimable(folio);
> > > > > 	}
> > > > > 
> > > > > 	/*
> > > > > 	 * Writeback does not hold a folio reference of its own, relying
> > > > > 	 * on truncation to wait for the clearing of PG_writeback.
> > > > > 	 * But here we must make sure that the folio is not freed and
> > > > > 	 * reused before the folio_wake().
> > > > > 	 */
> > > > > 	folio_get(folio);
> > > > > 	if (!__folio_end_writeback(folio))
> > > > > >>>>		BUG();
> > > > 
> > > > Grr, that should have been a VM_BUG_ON_FOLIO(1, folio) so we get useful
> > > > information about the folio (like whether it has an iop, or what order
> > > > the folio is).  Can you make that change and try to reproduce?
> > > 
> > > > What's going on here is that we've called folio_end_writeback() on a
> > > > folio which does not have the writeback flag set.  That _should_ be
> > > > impossible, hence the use of BUG().  Either we've called
> > > > folio_end_writeback() twice on the same folio, or we neglected to set
> > > > the writeback flag on the folio.  I don't immediately see why either
> > > > of those two things would happen.
> > > 
> > > Well, I made that change and rebased to -rc3 to see if reverting that
> > > ZERO_PAGE thing would produce better results, I think I just got the
> > > same crash.  Curiously, the only VM that died this time was the one
> > > running the realtime configuration, but it's still generic/068:
> > > 
> > > FSTYP         -- xfs (debug)
> > > PLATFORM      -- Linux/x86_64 oci-mtr28 5.18.0-rc3-djwx #rc3 SMP PREEMPT_DYNAMIC Sun Apr 17 14:42:49 PDT 2022
> > > MKFS_OPTIONS  -- -f -rrtdev=/dev/sdb4 -llogdev=/dev/sdb2 -m reflink=0,rmapbt=0, -d rtinherit=1, /dev/sda4
> > > MOUNT_OPTIONS -- -ortdev=/dev/sdb4 -ologdev=/dev/sdb2 /dev/sda4 /opt
> > > 
> > > I don't know if it'll help, but here's the sequence of tests that we
> > > were running just prior to crashing:
> > > 
> > > generic/445      3s
> > > generic/225      76s
> > > xfs/306  22s
> > > xfs/290  3s
> > > generic/155     [not run] Reflink not supported by test filesystem type: xfs
> > > generic/525      6s
> > > generic/269      89s
> > > generic/1206    [not run] xfs_io swapext -v vfs -s 64k -l 64k ioctl support is missing
> > > xfs/504  198s
> > > xfs/192 [not run] Reflink not supported by scratch filesystem type: xfs
> > > xfs/303  1s
> > > generic/346      6s
> > > generic/512      5s
> > > xfs/227  308s
> > > generic/147     [not run] Reflink not supported by test filesystem type: xfs
> > > generic/230     [not run] Quotas not supported on realtime test device
> > > generic/008      4s
> > > generic/108      4s
> > > xfs/264  12s
> > > generic/200     [not run] Reflink not supported by scratch filesystem type: xfs
> > > generic/493     [not run] Dedupe not supported by scratch filesystem type: xfs
> > > xfs/021  5s
> > > generic/672     [not run] Reflink not supported by scratch filesystem type: xfs
> > > xfs/493  5s
> > > xfs/146  13s
> > > xfs/315 [not run] Reflink not supported by scratch filesystem type: xfs
> > > generic/068     
> > > 
> > > And the dmesg output:
> > > 
> > > run fstests generic/068 at 2022-04-17 16:57:16
> > > XFS (sda4): Mounting V5 Filesystem
> > > XFS (sda4): Ending clean mount
> > > page:ffffea0004a39c40 refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x128e71
> > > flags: 0x17ff80000000000(node=0|zone=2|lastcpupid=0xfff)
> > > raw: 017ff80000000000 0000000000000000 ffffffff00000203 0000000000000000
> > > raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
> > > page dumped because: VM_BUG_ON_FOLIO(((unsigned int) folio_ref_count(folio) + 127u <= 127u))
> > > ------------[ cut here ]------------
> > > kernel BUG at include/linux/mm.h:1164!
> > > invalid opcode: 0000 [#1] PREEMPT SMP
> > > CPU: 3 PID: 1094085 Comm: 3:0 Tainted: G        W         5.18.0-rc3-djwx #rc3 0a707744ee7c555d54e50726c5b02515710a6aae
> > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20171121_152543-x86-ol7-builder-01.us.oracle.com-4.el7.1 04/01/2014
> > > Workqueue: xfs-conv/sda4 xfs_end_io [xfs]
> > > RIP: 0010:folio_end_writeback+0xd0/0x110
> > > Code: 80 60 02 fb 48 89 ef e8 5e 6d 01 00 8b 45 34 83 c0 7f 83 f8 7f 0f 87 6a ff ff ff 48 c7 c6 40 c7 e2 81 48 89 ef e8 30 69 04 00 <0f> 0b 48 89 ee e8 b6 51 02 00 eb 9a 48 c7 c6 c0 ad e5 81 48 89 ef
> > > RSP: 0018:ffffc900084f3d48 EFLAGS: 00010246
> > > RAX: 000000000000005c RBX: 0000000000001000 RCX: 0000000000000000
> > > RDX: 0000000000000000 RSI: ffffffff81e56da3 RDI: 00000000ffffffff
> > > RBP: ffffea0004a39c40 R08: 0000000000000000 R09: ffffffff8205fe40
> > > R10: 0000000000017578 R11: 00000000000175f0 R12: 0000000000004000
> > > R13: ffff88814dc5cd40 R14: 000000000000002e R15: ffffea0004a39c40
> > > FS:  0000000000000000(0000) GS:ffff88843fd80000(0000) knlGS:0000000000000000
> > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: 00007f2b0ea47010 CR3: 000000043f00c000 CR4: 00000000001506a0
> > > Call Trace:
> > >  <TASK>
> > >  iomap_finish_ioend+0x1ee/0x6a0
> > >  iomap_finish_ioends+0x69/0x100
> > >  xfs_end_ioend+0x5a/0x160 [xfs e8251de1111d7958449fd159d84af12a2afc12f2]
> > >  xfs_end_io+0xb1/0xf0 [xfs e8251de1111d7958449fd159d84af12a2afc12f2]
> > >  process_one_work+0x1df/0x3c0
> > >  ? rescuer_thread+0x3b0/0x3b0
> > >  worker_thread+0x53/0x3b0
> > >  ? rescuer_thread+0x3b0/0x3b0
> > >  kthread+0xea/0x110
> > >  ? kthread_complete_and_exit+0x20/0x20
> > >  ret_from_fork+0x1f/0x30
> > >  </TASK>
> > > Modules linked in: xfs dm_zero btrfs blake2b_generic xor lzo_compress lzo_decompress zlib_deflate raid6_pq zstd_compress dm_delay dm_snapshot dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio dm_flakey libcrc32c xt_REDIRECT iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 auth_rpcgss oid_registry xt_tcpudp ip_set_hash_ip ip_set_hash_net xt_set ip_set_hash_mac ip_set nfnetlink ip6table_filter ip6_tables iptable_filter bfq sch_fq_codel ip_tables x_tables overlay nfsv4 af_packet [last unloaded: scsi_debug]
> > > Dumping ftrace buffer:
> > >    (ftrace buffer empty)
> > > ---[ end trace 0000000000000000 ]---
> > > RIP: 0010:folio_end_writeback+0xd0/0x110
> > > Code: 80 60 02 fb 48 89 ef e8 5e 6d 01 00 8b 45 34 83 c0 7f 83 f8 7f 0f 87 6a ff ff ff 48 c7 c6 40 c7 e2 81 48 89 ef e8 30 69 04 00 <0f> 0b 48 89 ee e8 b6 51 02 00 eb 9a 48 c7 c6 c0 ad e5 81 48 89 ef
> > > RSP: 0018:ffffc900084f3d48 EFLAGS: 00010246
> > > RAX: 000000000000005c RBX: 0000000000001000 RCX: 0000000000000000
> > > RDX: 0000000000000000 RSI: ffffffff81e56da3 RDI: 00000000ffffffff
> > > RBP: ffffea0004a39c40 R08: 0000000000000000 R09: ffffffff8205fe40
> > > R10: 0000000000017578 R11: 00000000000175f0 R12: 0000000000004000
> > > R13: ffff88814dc5cd40 R14: 000000000000002e R15: ffffea0004a39c40
> > > FS:  0000000000000000(0000) GS:ffff88843fd80000(0000) knlGS:0000000000000000
> > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: 00007f2b0ea47010 CR3: 000000043f00c000 CR4: 00000000001506a0
> > 
> > Hmm.  I think you might be on to something about this being some
> > mis-interaction when multi-page folios get truncated or possibly just
> > split?  The VM_BUG_ON points to pfn 0x206a79, which I /think/ is the
> > second page of the four-page folio starting with pfn=0x206a78?
> > 
> 
> The above is the variant of generic/068 failure I was reproducing and
> used to bisect [1]. With some additional tracing added to ioend
> completion, what I'm seeing is that the bio_for_each_folio_all() bvec
> iteration basically seems to go off the rails. What happens more
> specifically is that at some point during the loop, bio_next_folio()
> actually lands into the second page of the just processed folio instead
> of the actual next folio (i.e. as if it's walking to the next page from
> the head page of the folio instead of to the next 16k folio). I suspect
> completion is racing with some form of truncation/reclaim/invalidation
> here, what exactly I don't know, that perhaps breaks down the folio and
> renders the iteration (bio_next_folio() -> folio_next()) unsafe. To test
> that theory, I open coded and modified the loop to something like the
> following:
> 
>                 for (bio_first_folio(&fi, bio, 0); fi.folio; ) {
>                         f = fi.folio;
>                         l = fi.length;
>                         bio_next_folio(&fi, bio);
>                         iomap_finish_folio_write(inode, f, l, error);
>                         folio_count++;
>                 }
> 
> ... to avoid accessing folio metadata after writeback is cleared on it
> and this seems to make the problem disappear (so far, I'll need to let
> this spin for a while longer to be completely confident in that).

Hmm.  I did the same, and fstests started passing again!  Excellent tip!
:)

Given the ftrace output I captured earlier, I agree that it looks like
it's something related to truncation breaking down multipage folios
while they're still undergoing writeback.

--D

> Brian
> 
> [1] https://lore.kernel.org/linux-xfs/YmlOfJljvI49sZyW@bfoster/
> 
> >     <...>-5263      3..... 276242022us : page_ref_mod: pfn=0x206a70 flags=referenced|uptodate|lru|active|private|writeback|head count=6 mapcount=0 mapping=ffff888125e5d868 mt=1 val=1
> >     <...>-5263      3..... 276242023us : page_ref_mod_and_test: pfn=0x206a70 flags=referenced|uptodate|lru|active|private|head count=5 mapcount=0 mapping=ffff888125e5d868 mt=1 val=-1 ret=0
> >     <...>-5263      3..... 276242023us : page_ref_mod: pfn=0x206a74 flags=referenced|uptodate|lru|active|private|writeback|head count=6 mapcount=0 mapping=ffff888125e5d868 mt=1 val=1
> >     <...>-5263      3..... 276242024us : page_ref_mod_and_test: pfn=0x206a74 flags=referenced|uptodate|lru|active|private|head count=5 mapcount=0 mapping=ffff888125e5d868 mt=1 val=-1 ret=0
> >     <...>-5263      3..... 276242025us : page_ref_mod: pfn=0x206a78 flags=referenced|uptodate|lru|active|private|writeback|head count=6 mapcount=0 mapping=ffff888125e5d868 mt=1 val=1
> >    fstest-12612     2..... 276242082us : page_ref_mod_unless: pfn=0x206a70 flags=referenced|uptodate|lru|active|private|head count=6 mapcount=0 mapping=ffff888125e5d868 mt=1 val=1 ret=1
> >    fstest-12612     2..... 276242083us : page_ref_mod_unless: pfn=0x206a74 flags=referenced|uptodate|lru|active|private|head count=6 mapcount=0 mapping=ffff888125e5d868 mt=1 val=1 ret=1
> >    fstest-12612     2..... 276242083us : page_ref_mod_unless: pfn=0x206a78 flags=referenced|uptodate|lru|active|private|head count=7 mapcount=0 mapping=ffff888125e5d868 mt=1 val=1 ret=1
> >    fstest-12612     2..... 276242084us : page_ref_mod_unless: pfn=0x206a7c flags=referenced|uptodate|lru|active|private|writeback|head count=6 mapcount=0 mapping=ffff888125e5d868 mt=1 val=1 ret=1
> >    fstest-12612     2..... 276242084us : page_ref_mod_and_test: pfn=0x206a7c flags=referenced|uptodate|lru|active|private|writeback|head count=5 mapcount=0 mapping=ffff888125e5d868 mt=1 val=-1 ret=0
> >    fstest-12612     2..... 276242094us : page_ref_mod_and_test: pfn=0x206a70 flags=locked|referenced|uptodate|lru|active|head count=5 mapcount=0 mapping=ffff888125e5d868 mt=1 val=-1 ret=0
> >    fstest-12612     2..... 276242094us : page_ref_mod_and_test: pfn=0x206a74 flags=locked|referenced|uptodate|lru|active|head count=5 mapcount=0 mapping=ffff888125e5d868 mt=1 val=-1 ret=0
> >    fstest-12612     2..... 276242094us : page_ref_mod_and_test: pfn=0x206a78 flags=locked|referenced|uptodate|lru|active|head count=6 mapcount=0 mapping=ffff888125e5d868 mt=1 val=-1 ret=0
> >    fstest-12612     2..... 276242102us : page_ref_mod_and_test: pfn=0x206a70 flags=locked|referenced|uptodate|lru|active|head count=1 mapcount=0 mapping=0000000000000000 mt=1 val=-4 ret=0
> >    fstest-12612     2..... 276242102us : page_ref_mod_and_test: pfn=0x206a74 flags=locked|referenced|uptodate|lru|active|head count=1 mapcount=0 mapping=0000000000000000 mt=1 val=-4 ret=0
> >    fstest-12612     2..... 276242102us : page_ref_mod_and_test: pfn=0x206a78 flags=locked|referenced|uptodate|lru|active|head count=2 mapcount=0 mapping=0000000000000000 mt=1 val=-4 ret=0
> >    fstest-12612     2..... 276242110us : page_ref_mod_and_test: pfn=0x206a70 flags=referenced|uptodate|lru|active|head count=0 mapcount=0 mapping=0000000000000000 mt=1 val=-1 ret=1
> >    fstest-12612     2..... 276242111us : page_ref_mod_and_test: pfn=0x206a74 flags=referenced|uptodate|lru|active|head count=0 mapcount=0 mapping=0000000000000000 mt=1 val=-1 ret=1
> >    fstest-12612     2..... 276242112us : page_ref_mod_and_test: pfn=0x206a78 flags=referenced|uptodate|lru|active|head count=1 mapcount=0 mapping=0000000000000000 mt=1 val=-1 ret=0
> >    fstest-12612     2..... 276242112us : page_ref_mod_unless: pfn=0x206a7c flags=referenced|uptodate|lru|active|private|writeback|head count=6 mapcount=0 mapping=ffff888125e5d868 mt=1 val=1 ret=1
> >    fstest-12612     2..... 276242112us : page_ref_mod_and_test: pfn=0x206a7c flags=referenced|uptodate|lru|active|private|writeback|head count=5 mapcount=0 mapping=ffff888125e5d868 mt=1 val=-1 ret=0
> >    fstest-12612     2..... 276242119us : page_ref_mod_unless: pfn=0x206a7c flags=referenced|uptodate|lru|active|private|writeback|head count=6 mapcount=0 mapping=ffff888125e5d868 mt=1 val=1 ret=1
> >     <...>-5263      3..... 276242194us : page_ref_mod_and_test: pfn=0x206a78 flags=referenced|uptodate|lru|active|head count=0 mapcount=0 mapping=0000000000000000 mt=1 val=-1 ret=1
> >     <...>-5263      3d..1. 276242208us : console: [  309.491317] page:ffffea00081a9e40 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x206a79
> > 
> > In which case, that output reduces to:
> > 
> >     <...>-5263      3..... 276242025us : page_ref_mod: pfn=0x206a78 flags=referenced|uptodate|lru|active|private|writeback|head count=6 mapcount=0 mapping=ffff888125e5d868 mt=1 val=1
> >    fstest-12612     2..... 276242083us : page_ref_mod_unless: pfn=0x206a78 flags=referenced|uptodate|lru|active|private|head count=7 mapcount=0 mapping=ffff888125e5d868 mt=1 val=1 ret=1
> >    fstest-12612     2..... 276242094us : page_ref_mod_and_test: pfn=0x206a78 flags=locked|referenced|uptodate|lru|active|head count=6 mapcount=0 mapping=ffff888125e5d868 mt=1 val=-1 ret=0
> >    fstest-12612     2..... 276242102us : page_ref_mod_and_test: pfn=0x206a78 flags=locked|referenced|uptodate|lru|active|head count=2 mapcount=0 mapping=0000000000000000 mt=1 val=-4 ret=0
> >    fstest-12612     2..... 276242112us : page_ref_mod_and_test: pfn=0x206a78 flags=referenced|uptodate|lru|active|head count=1 mapcount=0 mapping=0000000000000000 mt=1 val=-1 ret=0
> >     <...>-5263      3..... 276242194us : page_ref_mod_and_test: pfn=0x206a78 flags=referenced|uptodate|lru|active|head count=0 mapcount=0 mapping=0000000000000000 mt=1 val=-1 ret=1
> >     <...>-5263      3d..1. 276242208us : console: [  309.491317] page:ffffea00081a9e40 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x206a79
> > 
> > Clearly the mapping gets torn off of this folio before time index
> > 276242102us while it still has refcount==6.  It's curious that lost the
> > writeback bit at the first line.
> > 
> > Unfortunately, the VM_BUG_ON means that the kernel dumps the ftrace
> > buffer and crashes, so I can't actually do anything with the incomplete
> > trace-cmd files that were recorded earlier.
> > 
> > --D
> > 
> > > --D
> > > 
> > > > 
> > > > > 
> > > > > 
> > > > > --D
> > > > > 
> > > > > run fstests generic/068 at 2022-04-12 17:57:11
> > > > > XFS (sda3): Mounting V5 Filesystem
> > > > > XFS (sda3): Ending clean mount
> > > > > XFS (sda4): Mounting V5 Filesystem
> > > > > XFS (sda4): Ending clean mount
> > > > > ------------[ cut here ]------------
> > > > > kernel BUG at mm/filemap.c:1653!
> > > > > invalid opcode: 0000 [#1] PREEMPT SMP
> > > > > CPU: 0 PID: 1349866 Comm: 0:116 Tainted: G        W         5.18.0-rc2-djwx #rc2 19cc48221d47ada6c8e5859639b6a0946c9a3777
> > > > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20171121_152543-x86-ol7-builder-01.us.oracle.com-4.el7.1 04/01/2014
> > > > > Workqueue: xfs-conv/sda4 xfs_end_io [xfs]
> > > > > RIP: 0010:folio_end_writeback+0x79/0x80
> > > > > Code: d2 75 1d f0 ff 4d 34 74 0e 5d c3 f0 80 67 02 fb e8 ac 29 01 00 eb ad 48 89 ef 5d e9 a1 0f 01 00 48 89 ee e8 b9 e8 01 00 eb d9 <0f> 0b 0f 1f 44 00 00 0f 1f 44 00 00 53 48 89 fb e8 62 f7 ff ff 48
> > > > > RSP: 0018:ffffc9000286fd50 EFLAGS: 00010246
> > > > > RAX: 0000000000000000 RBX: ffffea0007376840 RCX: 000000000000000c
> > > > > RDX: ffff88810d2de000 RSI: ffffffff81e55f0b RDI: ffff88810d2de000
> > > > > RBP: ffffea0007376840 R08: ffffea000b82c308 R09: ffffea000b82c308
> > > > > R10: 0000000000000001 R11: 000000000000000c R12: 0000000000000000
> > > > > R13: 000000000000c000 R14: 0000000000000005 R15: 0000000000000001
> > > > > FS:  0000000000000000(0000) GS:ffff88843fc00000(0000) knlGS:0000000000000000
> > > > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > CR2: 00007f5b067d0000 CR3: 000000010d1bb000 CR4: 00000000001506b0
> > > > > Call Trace:
> > > > >  <TASK>
> > > > >  iomap_finish_ioend+0x19e/0x560
> > > > >  iomap_finish_ioends+0x69/0x100
> > > > >  xfs_end_ioend+0x5a/0x160 [xfs 513857e2ae300a835ce1fbd8065a84dc5382e649]
> > > > >  xfs_end_io+0xb1/0xf0 [xfs 513857e2ae300a835ce1fbd8065a84dc5382e649]
> > > > >  process_one_work+0x1df/0x3c0
> > > > >  ? rescuer_thread+0x3b0/0x3b0
> > > > >  worker_thread+0x53/0x3b0
> > > > >  ? rescuer_thread+0x3b0/0x3b0
> > > > >  kthread+0xea/0x110
> > > > >  ? kthread_complete_and_exit+0x20/0x20
> > > > >  ret_from_fork+0x1f/0x30
> > > > >  </TASK>
> > > > > Modules linked in: dm_snapshot dm_bufio dm_zero dm_flakey xfs libcrc32c xt_REDIRECT iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 auth_rpcgss oid_registry xt_tcpudp ip_set_hash_ip ip_set_hash_net xt_set ip_set_hash_mac ip_set nfnetlink ip6table_filter ip6_tables iptable_filter bfq sch_fq_codel ip_tables x_tables overlay nfsv4 af_packet [last unloaded: scsi_debug]
> > > > > Dumping ftrace buffer:
> > > > >    (ftrace buffer empty)
> > > > > ---[ end trace 0000000000000000 ]---
> > > > > RIP: 0010:folio_end_writeback+0x79/0x80
> > > > > Code: d2 75 1d f0 ff 4d 34 74 0e 5d c3 f0 80 67 02 fb e8 ac 29 01 00 eb ad 48 89 ef 5d e9 a1 0f 01 00 48 89 ee e8 b9 e8 01 00 eb d9 <0f> 0b 0f 1f 44 00 00 0f 1f 44 00 00 53 48 89 fb e8 62 f7 ff ff 48
> > > > > RSP: 0018:ffffc9000286fd50 EFLAGS: 00010246
> > > > > RAX: 0000000000000000 RBX: ffffea0007376840 RCX: 000000000000000c
> > > > > RDX: ffff88810d2de000 RSI: ffffffff81e55f0b RDI: ffff88810d2de000
> > > > > RBP: ffffea0007376840 R08: ffffea000b82c308 R09: ffffea000b82c308
> > > > > R10: 0000000000000001 R11: 000000000000000c R12: 0000000000000000
> > > > > R13: 000000000000c000 R14: 0000000000000005 R15: 0000000000000001
> > > > > FS:  0000000000000000(0000) GS:ffff88843fc00000(0000) knlGS:0000000000000000
> > > > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > CR2: 00007f4b94008278 CR3: 0000000101ac9000 CR4: 00000000001506b0
> > > > > 
> > 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: generic/068 crash on 5.18-rc2?
  2022-04-28 15:53       ` Brian Foster
  2022-04-30  3:10         ` Darrick J. Wong
@ 2022-04-30  3:44         ` Matthew Wilcox
  2022-04-30 21:40           ` Matthew Wilcox
  2022-05-02 12:18           ` Brian Foster
  1 sibling, 2 replies; 22+ messages in thread
From: Matthew Wilcox @ 2022-04-30  3:44 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, xfs, fstests, linux-fsdevel

On Thu, Apr 28, 2022 at 11:53:18AM -0400, Brian Foster wrote:
> The above is the variant of generic/068 failure I was reproducing and
> used to bisect [1]. With some additional tracing added to ioend
> completion, what I'm seeing is that the bio_for_each_folio_all() bvec
> iteration basically seems to go off the rails. What happens more
> specifically is that at some point during the loop, bio_next_folio()
> actually lands into the second page of the just processed folio instead
> of the actual next folio (i.e. as if it's walking to the next page from
> the head page of the folio instead of to the next 16k folio). I suspect
> completion is racing with some form of truncation/reclaim/invalidation
> here, what exactly I don't know, that perhaps breaks down the folio and
> renders the iteration (bio_next_folio() -> folio_next()) unsafe. To test
> that theory, I open coded and modified the loop to something like the
> following:
> 
>                 for (bio_first_folio(&fi, bio, 0); fi.folio; ) {
>                         f = fi.folio;
>                         l = fi.length;
>                         bio_next_folio(&fi, bio);
>                         iomap_finish_folio_write(inode, f, l, error);
>                         folio_count++;
>                 }
> 
> ... to avoid accessing folio metadata after writeback is cleared on it
> and this seems to make the problem disappear (so far, I'll need to let
> this spin for a while longer to be completely confident in that).

_Oh_.

It's not even a terribly weird race, then.  It's just this:

CPU 0				CPU 1
				truncate_inode_partial_folio()
				folio_wait_writeback();
bio_next_folio(&fi, bio)
iomap_finish_folio_write(fi.folio)
folio_end_writeback(folio)
				split_huge_page()
bio_next_folio()
... oops, now we only walked forward one page instead of the entire folio.

So ... I think we can fix this with:

+++ b/include/linux/bio.h
@@ -290,7 +290,8 @@ static inline void bio_next_folio(struct folio_iter *fi, struct bio *bio)
 {
        fi->_seg_count -= fi->length;
        if (fi->_seg_count) {
-               fi->folio = folio_next(fi->folio);
+               fi->folio = (struct folio *)folio_page(fi->folio,
+                               (fi->offset + fi->length) / PAGE_SIZE);
                fi->offset = 0;
                fi->length = min(folio_size(fi->folio), fi->_seg_count);
        } else if (fi->_i + 1 < bio->bi_vcnt) {

(I do not love this, have not even compiled it; it's late.  We may be
better off just storing next_folio inside the folio_iter).

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: generic/068 crash on 5.18-rc2?
  2022-04-30  3:44         ` Matthew Wilcox
@ 2022-04-30 21:40           ` Matthew Wilcox
  2022-05-02 12:20             ` Brian Foster
  2022-05-02 12:18           ` Brian Foster
  1 sibling, 1 reply; 22+ messages in thread
From: Matthew Wilcox @ 2022-04-30 21:40 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, xfs, fstests, linux-fsdevel

On Sat, Apr 30, 2022 at 04:44:07AM +0100, Matthew Wilcox wrote:
> (I do not love this, have not even compiled it; it's late.  We may be
> better off just storing next_folio inside the folio_iter).

Does anyone have a preference for fixing this between Option A:

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 49eff01fb829..55e2499beff6 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -269,6 +269,7 @@ struct folio_iter {
        size_t offset;
        size_t length;
        /* private: for use by the iterator */
+       struct folio *_next;
        size_t _seg_count;
        int _i;
 };
@@ -280,19 +281,23 @@ static inline void bio_first_folio(struct folio_iter *fi,
struct bio *bio,

        fi->folio = page_folio(bvec->bv_page);
        fi->offset = bvec->bv_offset +
-                       PAGE_SIZE * (bvec->bv_page - &fi->folio->page);
+                       PAGE_SIZE * folio_page_idx(fi->folio, bvec->bv_page);
        fi->_seg_count = bvec->bv_len;
        fi->length = min(folio_size(fi->folio) - fi->offset, fi->_seg_count);
        fi->_i = i;
+       if (fi->_seg_count > fi->length)
+               fi->_next = folio_next(fi->folio);
 }

 static inline void bio_next_folio(struct folio_iter *fi, struct bio *bio)
 {
        fi->_seg_count -= fi->length;
        if (fi->_seg_count) {
-               fi->folio = folio_next(fi->folio);
+               fi->folio = fi->_next;
                fi->offset = 0;
                fi->length = min(folio_size(fi->folio), fi->_seg_count);
+               if (fi->_seg_count > fi->length)
+                       fi->_next = folio_next(fi->folio);
        } else if (fi->_i + 1 < bio->bi_vcnt) {
                bio_first_folio(fi, bio, fi->_i + 1);
        } else {


and Option B:

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 49eff01fb829..554f5fce060c 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -290,7 +290,8 @@ static inline void bio_next_folio(struct folio_iter *fi, struct bio *bio)
 {
        fi->_seg_count -= fi->length;
        if (fi->_seg_count) {
-               fi->folio = folio_next(fi->folio);
+               fi->folio = __folio_next(fi->folio,
+                               (fi->offset + fi->length) / PAGE_SIZE);
                fi->offset = 0;
                fi->length = min(folio_size(fi->folio), fi->_seg_count);
        } else if (fi->_i + 1 < bio->bi_vcnt) {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index de32c0383387..9c5547af8d0e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1642,6 +1642,12 @@ static inline long folio_nr_pages(struct folio *folio)
        return compound_nr(&folio->page);
 }

+static inline struct folio *__folio_next(struct folio *folio,
+               unsigned long nr_pages)
+{
+       return (struct folio *)folio_page(folio, nr_pages);
+}
+
 /**
  * folio_next - Move to the next physical folio.
  * @folio: The folio we're currently operating on.
@@ -1658,7 +1664,7 @@ static inline long folio_nr_pages(struct folio *folio)
  */
 static inline struct folio *folio_next(struct folio *folio)
 {
-       return (struct folio *)folio_page(folio, folio_nr_pages(folio));
+       return __folio_next(folio, folio_nr_pages(folio));
 }

 /**


Currently running Option A through its paces.

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: generic/068 crash on 5.18-rc2?
  2022-04-30  3:44         ` Matthew Wilcox
  2022-04-30 21:40           ` Matthew Wilcox
@ 2022-05-02 12:18           ` Brian Foster
  2022-05-02 13:00             ` Matthew Wilcox
  1 sibling, 1 reply; 22+ messages in thread
From: Brian Foster @ 2022-05-02 12:18 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Darrick J. Wong, xfs, fstests, linux-fsdevel

On Sat, Apr 30, 2022 at 04:44:07AM +0100, Matthew Wilcox wrote:
> On Thu, Apr 28, 2022 at 11:53:18AM -0400, Brian Foster wrote:
> > The above is the variant of generic/068 failure I was reproducing and
> > used to bisect [1]. With some additional tracing added to ioend
> > completion, what I'm seeing is that the bio_for_each_folio_all() bvec
> > iteration basically seems to go off the rails. What happens more
> > specifically is that at some point during the loop, bio_next_folio()
> > actually lands into the second page of the just processed folio instead
> > of the actual next folio (i.e. as if it's walking to the next page from
> > the head page of the folio instead of to the next 16k folio). I suspect
> > completion is racing with some form of truncation/reclaim/invalidation
> > here, what exactly I don't know, that perhaps breaks down the folio and
> > renders the iteration (bio_next_folio() -> folio_next()) unsafe. To test
> > that theory, I open coded and modified the loop to something like the
> > following:
> > 
> >                 for (bio_first_folio(&fi, bio, 0); fi.folio; ) {
> >                         f = fi.folio;
> >                         l = fi.length;
> >                         bio_next_folio(&fi, bio);
> >                         iomap_finish_folio_write(inode, f, l, error);
> >                         folio_count++;
> >                 }
> > 
> > ... to avoid accessing folio metadata after writeback is cleared on it
> > and this seems to make the problem disappear (so far, I'll need to let
> > this spin for a while longer to be completely confident in that).
> 
> _Oh_.
> 
> It's not even a terribly weird race, then.  It's just this:
> 
> CPU 0				CPU 1
> 				truncate_inode_partial_folio()
> 				folio_wait_writeback();
> bio_next_folio(&fi, bio)
> iomap_finish_folio_write(fi.folio)
> folio_end_writeback(folio)
> 				split_huge_page()
> bio_next_folio()
> ... oops, now we only walked forward one page instead of the entire folio.
> 

Yep, though once I noticed and turned on the mm_page_free tracepoint, it
looked like it was actually the I/O completion path breaking down the
compound folio:

   kworker/10:1-440     [010] .....   355.369899: iomap_finish_ioend: 1090: bio 00000000bc8445c7 index 192 fi (00000000dc8c03bd 0 16384 32768 27)
   ...
    kworker/10:1-440     [010] .....   355.369905: mm_page_free: page=00000000dc8c03bd pfn=0x182190 order=2
    kworker/10:1-440     [010] .....   355.369907: iomap_finish_ioend: 1090: bio 00000000bc8445c7 index 1 fi (00000000f8b5d9b3 0 4096 16384 27)

I take that to mean the truncate path executes while the completion side
holds a reference, folio_end_writeback() ends up dropping the last
reference, falls into the free/split path and the iteration breaks from
there. Same idea either way, I think.

Brian

> So ... I think we can fix this with:
> 
> +++ b/include/linux/bio.h
> @@ -290,7 +290,8 @@ static inline void bio_next_folio(struct folio_iter *fi, struct bio *bio)
>  {
>         fi->_seg_count -= fi->length;
>         if (fi->_seg_count) {
> -               fi->folio = folio_next(fi->folio);
> +               fi->folio = (struct folio *)folio_page(fi->folio,
> +                               (fi->offset + fi->length) / PAGE_SIZE);
>                 fi->offset = 0;
>                 fi->length = min(folio_size(fi->folio), fi->_seg_count);
>         } else if (fi->_i + 1 < bio->bi_vcnt) {
> 
> (I do not love this, have not even compiled it; it's late.  We may be
> better off just storing next_folio inside the folio_iter).
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: generic/068 crash on 5.18-rc2?
  2022-04-30 21:40           ` Matthew Wilcox
@ 2022-05-02 12:20             ` Brian Foster
  2022-05-03  3:25               ` Darrick J. Wong
  0 siblings, 1 reply; 22+ messages in thread
From: Brian Foster @ 2022-05-02 12:20 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Darrick J. Wong, xfs, fstests, linux-fsdevel

On Sat, Apr 30, 2022 at 10:40:31PM +0100, Matthew Wilcox wrote:
> On Sat, Apr 30, 2022 at 04:44:07AM +0100, Matthew Wilcox wrote:
> > (I do not love this, have not even compiled it; it's late.  We may be
> > better off just storing next_folio inside the folio_iter).
> 
> Does anyone have a preference for fixing this between Option A:
> 

After seeing the trace in my previous mail and several thousand
successful iterations of the test hack, I had reworked it into this
(which survived weekend testing until it ran into some other XFS problem
that looks unrelated):

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 278cc81cc1e7..aa820e09978e 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -269,6 +269,7 @@ struct folio_iter {
 	size_t offset;
 	size_t length;
 	/* private: for use by the iterator */
+	struct folio *_next;
 	size_t _seg_count;
 	int _i;
 };
@@ -279,6 +280,7 @@ static inline void bio_first_folio(struct folio_iter *fi, struct bio *bio,
 	struct bio_vec *bvec = bio_first_bvec_all(bio) + i;
 
 	fi->folio = page_folio(bvec->bv_page);
+	fi->_next = folio_next(fi->folio);
 	fi->offset = bvec->bv_offset +
 			PAGE_SIZE * (bvec->bv_page - &fi->folio->page);
 	fi->_seg_count = bvec->bv_len;
@@ -290,13 +292,15 @@ static inline void bio_next_folio(struct folio_iter *fi, struct bio *bio)
 {
 	fi->_seg_count -= fi->length;
 	if (fi->_seg_count) {
-		fi->folio = folio_next(fi->folio);
+		fi->folio = fi->_next;
+		fi->_next = folio_next(fi->folio);
 		fi->offset = 0;
 		fi->length = min(folio_size(fi->folio), fi->_seg_count);
 	} else if (fi->_i + 1 < bio->bi_vcnt) {
 		bio_first_folio(fi, bio, fi->_i + 1);
 	} else {
 		fi->folio = NULL;
+		fi->_next = NULL;
 	}
 }

So FWIW, that is just to say that I find option A to be cleaner and more
readable.

Brian

> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 49eff01fb829..55e2499beff6 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -269,6 +269,7 @@ struct folio_iter {
>         size_t offset;
>         size_t length;
>         /* private: for use by the iterator */
> +       struct folio *_next;
>         size_t _seg_count;
>         int _i;
>  };
> @@ -280,19 +281,23 @@ static inline void bio_first_folio(struct folio_iter *fi,
> struct bio *bio,
> 
>         fi->folio = page_folio(bvec->bv_page);
>         fi->offset = bvec->bv_offset +
> -                       PAGE_SIZE * (bvec->bv_page - &fi->folio->page);
> +                       PAGE_SIZE * folio_page_idx(fi->folio, bvec->bv_page);
>         fi->_seg_count = bvec->bv_len;
>         fi->length = min(folio_size(fi->folio) - fi->offset, fi->_seg_count);
>         fi->_i = i;
> +       if (fi->_seg_count > fi->length)
> +               fi->_next = folio_next(fi->folio);
>  }
> 
>  static inline void bio_next_folio(struct folio_iter *fi, struct bio *bio)
>  {
>         fi->_seg_count -= fi->length;
>         if (fi->_seg_count) {
> -               fi->folio = folio_next(fi->folio);
> +               fi->folio = fi->_next;
>                 fi->offset = 0;
>                 fi->length = min(folio_size(fi->folio), fi->_seg_count);
> +               if (fi->_seg_count > fi->length)
> +                       fi->_next = folio_next(fi->folio);
>         } else if (fi->_i + 1 < bio->bi_vcnt) {
>                 bio_first_folio(fi, bio, fi->_i + 1);
>         } else {
> 
> 
> and Option B:
> 
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 49eff01fb829..554f5fce060c 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -290,7 +290,8 @@ static inline void bio_next_folio(struct folio_iter *fi, struct bio *bio)
>  {
>         fi->_seg_count -= fi->length;
>         if (fi->_seg_count) {
> -               fi->folio = folio_next(fi->folio);
> +               fi->folio = __folio_next(fi->folio,
> +                               (fi->offset + fi->length) / PAGE_SIZE);
>                 fi->offset = 0;
>                 fi->length = min(folio_size(fi->folio), fi->_seg_count);
>         } else if (fi->_i + 1 < bio->bi_vcnt) {
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index de32c0383387..9c5547af8d0e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1642,6 +1642,12 @@ static inline long folio_nr_pages(struct folio *folio)
>         return compound_nr(&folio->page);
>  }
> 
> +static inline struct folio *__folio_next(struct folio *folio,
> +               unsigned long nr_pages)
> +{
> +       return (struct folio *)folio_page(folio, nr_pages);
> +}
> +
>  /**
>   * folio_next - Move to the next physical folio.
>   * @folio: The folio we're currently operating on.
> @@ -1658,7 +1664,7 @@ static inline long folio_nr_pages(struct folio *folio)
>   */
>  static inline struct folio *folio_next(struct folio *folio)
>  {
> -       return (struct folio *)folio_page(folio, folio_nr_pages(folio));
> +       return __folio_next(folio, folio_nr_pages(folio));
>  }
> 
>  /**
> 
> 
> Currently running Option A through its paces.
> 


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: generic/068 crash on 5.18-rc2?
  2022-05-02 12:18           ` Brian Foster
@ 2022-05-02 13:00             ` Matthew Wilcox
  0 siblings, 0 replies; 22+ messages in thread
From: Matthew Wilcox @ 2022-05-02 13:00 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, xfs, fstests, linux-fsdevel

On Mon, May 02, 2022 at 08:18:24AM -0400, Brian Foster wrote:
> On Sat, Apr 30, 2022 at 04:44:07AM +0100, Matthew Wilcox wrote:
> > On Thu, Apr 28, 2022 at 11:53:18AM -0400, Brian Foster wrote:
> > > The above is the variant of generic/068 failure I was reproducing and
> > > used to bisect [1]. With some additional tracing added to ioend
> > > completion, what I'm seeing is that the bio_for_each_folio_all() bvec
> > > iteration basically seems to go off the rails. What happens more
> > > specifically is that at some point during the loop, bio_next_folio()
> > > actually lands into the second page of the just processed folio instead
> > > of the actual next folio (i.e. as if it's walking to the next page from
> > > the head page of the folio instead of to the next 16k folio). I suspect
> > > completion is racing with some form of truncation/reclaim/invalidation
> > > here, what exactly I don't know, that perhaps breaks down the folio and
> > > renders the iteration (bio_next_folio() -> folio_next()) unsafe. To test
> > > that theory, I open coded and modified the loop to something like the
> > > following:
> > > 
> > >                 for (bio_first_folio(&fi, bio, 0); fi.folio; ) {
> > >                         f = fi.folio;
> > >                         l = fi.length;
> > >                         bio_next_folio(&fi, bio);
> > >                         iomap_finish_folio_write(inode, f, l, error);
> > >                         folio_count++;
> > >                 }
> > > 
> > > ... to avoid accessing folio metadata after writeback is cleared on it
> > > and this seems to make the problem disappear (so far, I'll need to let
> > > this spin for a while longer to be completely confident in that).
> > 
> > _Oh_.
> > 
> > It's not even a terribly weird race, then.  It's just this:
> > 
> > CPU 0				CPU 1
> > 				truncate_inode_partial_folio()
> > 				folio_wait_writeback();
> > bio_next_folio(&fi, bio)
> > iomap_finish_folio_write(fi.folio)
> > folio_end_writeback(folio)
> > 				split_huge_page()
> > bio_next_folio()
> > ... oops, now we only walked forward one page instead of the entire folio.
> > 
> 
> Yep, though once I noticed and turned on the mm_page_free tracepoint, it
> looked like it was actually the I/O completion path breaking down the
> compound folio:
> 
>    kworker/10:1-440     [010] .....   355.369899: iomap_finish_ioend: 1090: bio 00000000bc8445c7 index 192 fi (00000000dc8c03bd 0 16384 32768 27)
>    ...
>     kworker/10:1-440     [010] .....   355.369905: mm_page_free: page=00000000dc8c03bd pfn=0x182190 order=2
>     kworker/10:1-440     [010] .....   355.369907: iomap_finish_ioend: 1090: bio 00000000bc8445c7 index 1 fi (00000000f8b5d9b3 0 4096 16384 27)
> 
> I take that to mean the truncate path executes while the completion side
> holds a reference, folio_end_writeback() ends up dropping the last
> reference, falls into the free/split path and the iteration breaks from
> there. Same idea either way, I think.

Absolutely.  That's probably the more common path anyway; we truncate
an entire folio instead of a partial one, so it could be:

truncate_inode_partial_folio():
        folio_wait_writeback(folio);
        if (length == folio_size(folio)) {
                truncate_inode_folio(folio->mapping, folio);

or basically the same code in truncate_inode_pages_range()
or invalidate_inode_pages2_range().


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: generic/068 crash on 5.18-rc2?
  2022-05-02 12:20             ` Brian Foster
@ 2022-05-03  3:25               ` Darrick J. Wong
  2022-05-03  4:31                 ` Matthew Wilcox
  0 siblings, 1 reply; 22+ messages in thread
From: Darrick J. Wong @ 2022-05-03  3:25 UTC (permalink / raw)
  To: Brian Foster; +Cc: Matthew Wilcox, xfs, fstests, linux-fsdevel

On Mon, May 02, 2022 at 08:20:00AM -0400, Brian Foster wrote:
> On Sat, Apr 30, 2022 at 10:40:31PM +0100, Matthew Wilcox wrote:
> > On Sat, Apr 30, 2022 at 04:44:07AM +0100, Matthew Wilcox wrote:
> > > (I do not love this, have not even compiled it; it's late.  We may be
> > > better off just storing next_folio inside the folio_iter).
> > 
> > Does anyone have a preference for fixing this between Option A:
> > 
> 
> After seeing the trace in my previous mail and several thousand
> successful iterations of the test hack, I had reworked it into this
> (which survived weekend testing until it ran into some other XFS problem
> that looks unrelated):
> 
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 278cc81cc1e7..aa820e09978e 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -269,6 +269,7 @@ struct folio_iter {
>  	size_t offset;
>  	size_t length;
>  	/* private: for use by the iterator */
> +	struct folio *_next;
>  	size_t _seg_count;
>  	int _i;
>  };
> @@ -279,6 +280,7 @@ static inline void bio_first_folio(struct folio_iter *fi, struct bio *bio,
>  	struct bio_vec *bvec = bio_first_bvec_all(bio) + i;
>  
>  	fi->folio = page_folio(bvec->bv_page);
> +	fi->_next = folio_next(fi->folio);
>  	fi->offset = bvec->bv_offset +
>  			PAGE_SIZE * (bvec->bv_page - &fi->folio->page);
>  	fi->_seg_count = bvec->bv_len;
> @@ -290,13 +292,15 @@ static inline void bio_next_folio(struct folio_iter *fi, struct bio *bio)
>  {
>  	fi->_seg_count -= fi->length;
>  	if (fi->_seg_count) {
> -		fi->folio = folio_next(fi->folio);
> +		fi->folio = fi->_next;
> +		fi->_next = folio_next(fi->folio);
>  		fi->offset = 0;
>  		fi->length = min(folio_size(fi->folio), fi->_seg_count);
>  	} else if (fi->_i + 1 < bio->bi_vcnt) {
>  		bio_first_folio(fi, bio, fi->_i + 1);
>  	} else {
>  		fi->folio = NULL;
> +		fi->_next = NULL;
>  	}
>  }
> 
> So FWIW, that is just to say that I find option A to be cleaner and more
> readable.

Me too.  I'll queue up the usual nightly tests with that patch added and
we'll see how that does.

--D

> Brian
> 
> > diff --git a/include/linux/bio.h b/include/linux/bio.h
> > index 49eff01fb829..55e2499beff6 100644
> > --- a/include/linux/bio.h
> > +++ b/include/linux/bio.h
> > @@ -269,6 +269,7 @@ struct folio_iter {
> >         size_t offset;
> >         size_t length;
> >         /* private: for use by the iterator */
> > +       struct folio *_next;
> >         size_t _seg_count;
> >         int _i;
> >  };
> > @@ -280,19 +281,23 @@ static inline void bio_first_folio(struct folio_iter *fi,
> > struct bio *bio,
> > 
> >         fi->folio = page_folio(bvec->bv_page);
> >         fi->offset = bvec->bv_offset +
> > -                       PAGE_SIZE * (bvec->bv_page - &fi->folio->page);
> > +                       PAGE_SIZE * folio_page_idx(fi->folio, bvec->bv_page);
> >         fi->_seg_count = bvec->bv_len;
> >         fi->length = min(folio_size(fi->folio) - fi->offset, fi->_seg_count);
> >         fi->_i = i;
> > +       if (fi->_seg_count > fi->length)
> > +               fi->_next = folio_next(fi->folio);
> >  }
> > 
> >  static inline void bio_next_folio(struct folio_iter *fi, struct bio *bio)
> >  {
> >         fi->_seg_count -= fi->length;
> >         if (fi->_seg_count) {
> > -               fi->folio = folio_next(fi->folio);
> > +               fi->folio = fi->_next;
> >                 fi->offset = 0;
> >                 fi->length = min(folio_size(fi->folio), fi->_seg_count);
> > +               if (fi->_seg_count > fi->length)
> > +                       fi->_next = folio_next(fi->folio);
> >         } else if (fi->_i + 1 < bio->bi_vcnt) {
> >                 bio_first_folio(fi, bio, fi->_i + 1);
> >         } else {
> > 
> > 
> > and Option B:
> > 
> > diff --git a/include/linux/bio.h b/include/linux/bio.h
> > index 49eff01fb829..554f5fce060c 100644
> > --- a/include/linux/bio.h
> > +++ b/include/linux/bio.h
> > @@ -290,7 +290,8 @@ static inline void bio_next_folio(struct folio_iter *fi, struct bio *bio)
> >  {
> >         fi->_seg_count -= fi->length;
> >         if (fi->_seg_count) {
> > -               fi->folio = folio_next(fi->folio);
> > +               fi->folio = __folio_next(fi->folio,
> > +                               (fi->offset + fi->length) / PAGE_SIZE);
> >                 fi->offset = 0;
> >                 fi->length = min(folio_size(fi->folio), fi->_seg_count);
> >         } else if (fi->_i + 1 < bio->bi_vcnt) {
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index de32c0383387..9c5547af8d0e 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1642,6 +1642,12 @@ static inline long folio_nr_pages(struct folio *folio)
> >         return compound_nr(&folio->page);
> >  }
> > 
> > +static inline struct folio *__folio_next(struct folio *folio,
> > +               unsigned long nr_pages)
> > +{
> > +       return (struct folio *)folio_page(folio, nr_pages);
> > +}
> > +
> >  /**
> >   * folio_next - Move to the next physical folio.
> >   * @folio: The folio we're currently operating on.
> > @@ -1658,7 +1664,7 @@ static inline long folio_nr_pages(struct folio *folio)
> >   */
> >  static inline struct folio *folio_next(struct folio *folio)
> >  {
> > -       return (struct folio *)folio_page(folio, folio_nr_pages(folio));
> > +       return __folio_next(folio, folio_nr_pages(folio));
> >  }
> > 
> >  /**
> > 
> > 
> > Currently running Option A through its paces.
> > 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: generic/068 crash on 5.18-rc2?
  2022-05-03  3:25               ` Darrick J. Wong
@ 2022-05-03  4:31                 ` Matthew Wilcox
  2022-05-03 17:25                   ` Darrick J. Wong
  0 siblings, 1 reply; 22+ messages in thread
From: Matthew Wilcox @ 2022-05-03  4:31 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, xfs, fstests, linux-fsdevel

On Mon, May 02, 2022 at 08:25:34PM -0700, Darrick J. Wong wrote:
> On Mon, May 02, 2022 at 08:20:00AM -0400, Brian Foster wrote:
> > On Sat, Apr 30, 2022 at 10:40:31PM +0100, Matthew Wilcox wrote:
> > > On Sat, Apr 30, 2022 at 04:44:07AM +0100, Matthew Wilcox wrote:
> > > > (I do not love this, have not even compiled it; it's late.  We may be
> > > > better off just storing next_folio inside the folio_iter).
> > > 
> > > Does anyone have a preference for fixing this between Option A:
> > > 
> > 
> > After seeing the trace in my previous mail and several thousand
> > successful iterations of the test hack, I had reworked it into this
> > (which survived weekend testing until it ran into some other XFS problem
> > that looks unrelated):
> > 
> > diff --git a/include/linux/bio.h b/include/linux/bio.h
> > index 278cc81cc1e7..aa820e09978e 100644
> > --- a/include/linux/bio.h
> > +++ b/include/linux/bio.h
> > @@ -269,6 +269,7 @@ struct folio_iter {
> >  	size_t offset;
> >  	size_t length;
> >  	/* private: for use by the iterator */
> > +	struct folio *_next;
> >  	size_t _seg_count;
> >  	int _i;
> >  };
> > @@ -279,6 +280,7 @@ static inline void bio_first_folio(struct folio_iter *fi, struct bio *bio,
> >  	struct bio_vec *bvec = bio_first_bvec_all(bio) + i;
> >  
> >  	fi->folio = page_folio(bvec->bv_page);
> > +	fi->_next = folio_next(fi->folio);
> >  	fi->offset = bvec->bv_offset +
> >  			PAGE_SIZE * (bvec->bv_page - &fi->folio->page);
> >  	fi->_seg_count = bvec->bv_len;
> > @@ -290,13 +292,15 @@ static inline void bio_next_folio(struct folio_iter *fi, struct bio *bio)
> >  {
> >  	fi->_seg_count -= fi->length;
> >  	if (fi->_seg_count) {
> > -		fi->folio = folio_next(fi->folio);
> > +		fi->folio = fi->_next;
> > +		fi->_next = folio_next(fi->folio);
> >  		fi->offset = 0;
> >  		fi->length = min(folio_size(fi->folio), fi->_seg_count);
> >  	} else if (fi->_i + 1 < bio->bi_vcnt) {
> >  		bio_first_folio(fi, bio, fi->_i + 1);
> >  	} else {
> >  		fi->folio = NULL;
> > +		fi->_next = NULL;
> >  	}
> >  }
> > 
> > So FWIW, that is just to say that I find option A to be cleaner and more
> > readable.
> 
> Me too.  I'll queue up the usual nightly tests with that patch added and
> we'll see how that does.

I've just pushed essentially that patch to my for-next tree in case
anybody does any testing with that.  I'll give it a couple of days
before creating a folio-5.18f tag and asking Linus to pull the first two
commits on

git://git.infradead.org/users/willy/pagecache.git for-next

That is, commits

1a4c97e2dd5b ("block: Do not call folio_next() on an unreferenced folio")
095099da208b ("mm/readahead: Fix readahead with large folios")

(more than happy to update anything about those patches)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: generic/068 crash on 5.18-rc2?
  2022-05-03  4:31                 ` Matthew Wilcox
@ 2022-05-03 17:25                   ` Darrick J. Wong
  2022-05-05  2:40                     ` Darrick J. Wong
  0 siblings, 1 reply; 22+ messages in thread
From: Darrick J. Wong @ 2022-05-03 17:25 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Brian Foster, xfs, fstests, linux-fsdevel

On Tue, May 03, 2022 at 05:31:01AM +0100, Matthew Wilcox wrote:
> On Mon, May 02, 2022 at 08:25:34PM -0700, Darrick J. Wong wrote:
> > On Mon, May 02, 2022 at 08:20:00AM -0400, Brian Foster wrote:
> > > On Sat, Apr 30, 2022 at 10:40:31PM +0100, Matthew Wilcox wrote:
> > > > On Sat, Apr 30, 2022 at 04:44:07AM +0100, Matthew Wilcox wrote:
> > > > > (I do not love this, have not even compiled it; it's late.  We may be
> > > > > better off just storing next_folio inside the folio_iter).
> > > > 
> > > > Does anyone have a preference for fixing this between Option A:
> > > > 
> > > 
> > > After seeing the trace in my previous mail and several thousand
> > > successful iterations of the test hack, I had reworked it into this
> > > (which survived weekend testing until it ran into some other XFS problem
> > > that looks unrelated):
> > > 
> > > diff --git a/include/linux/bio.h b/include/linux/bio.h
> > > index 278cc81cc1e7..aa820e09978e 100644
> > > --- a/include/linux/bio.h
> > > +++ b/include/linux/bio.h
> > > @@ -269,6 +269,7 @@ struct folio_iter {
> > >  	size_t offset;
> > >  	size_t length;
> > >  	/* private: for use by the iterator */
> > > +	struct folio *_next;
> > >  	size_t _seg_count;
> > >  	int _i;
> > >  };
> > > @@ -279,6 +280,7 @@ static inline void bio_first_folio(struct folio_iter *fi, struct bio *bio,
> > >  	struct bio_vec *bvec = bio_first_bvec_all(bio) + i;
> > >  
> > >  	fi->folio = page_folio(bvec->bv_page);
> > > +	fi->_next = folio_next(fi->folio);
> > >  	fi->offset = bvec->bv_offset +
> > >  			PAGE_SIZE * (bvec->bv_page - &fi->folio->page);
> > >  	fi->_seg_count = bvec->bv_len;
> > > @@ -290,13 +292,15 @@ static inline void bio_next_folio(struct folio_iter *fi, struct bio *bio)
> > >  {
> > >  	fi->_seg_count -= fi->length;
> > >  	if (fi->_seg_count) {
> > > -		fi->folio = folio_next(fi->folio);
> > > +		fi->folio = fi->_next;
> > > +		fi->_next = folio_next(fi->folio);
> > >  		fi->offset = 0;
> > >  		fi->length = min(folio_size(fi->folio), fi->_seg_count);
> > >  	} else if (fi->_i + 1 < bio->bi_vcnt) {
> > >  		bio_first_folio(fi, bio, fi->_i + 1);
> > >  	} else {
> > >  		fi->folio = NULL;
> > > +		fi->_next = NULL;
> > >  	}
> > >  }
> > > 
> > > So FWIW, that is just to say that I find option A to be cleaner and more
> > > readable.
> > 
> > Me too.  I'll queue up the usual nightly tests with that patch added and
> > we'll see how that does.
> 
> I've just pushed essentially that patch to my for-next tree in case
> anybody does any testing with that.  I'll give it a couple of days
> before creating a folio-5.18f tag and asking Linus to pull the first two
> commits on
> 
> git://git.infradead.org/users/willy/pagecache.git for-next
> 
> That is, commits
> 
> 1a4c97e2dd5b ("block: Do not call folio_next() on an unreferenced folio")
> 095099da208b ("mm/readahead: Fix readahead with large folios")

Hmm.  Well, I added 1a4c97 to my tree last night, and it seems to have
cleared up all but two of the problems I saw with the for-next branch.

generic/388 still fails (40 minutes in) with:

WARN_ON_ONCE(atomic_read(&iop->write_bytes_pending));
VM_BUG_ON_FOLIO(i_blocks_per_folio(inode, folio) > 1 && !iop, folio);

Which I think is the same problem where the fs goes down, XFS throws an
error back to iomap_do_writepages, and it tries to discard a folio that
already had writeback running on it.

There's also the same problem I reported a few days ago in xfs/501
on a 64k-page ARM64 VM where:

run fstests xfs/501 at 2022-05-02 21:17:31
XFS: Assertion failed: IS_ALIGNED((unsigned long)lv->lv_buf, sizeof(uint64_t)), file: fs/xfs/xfs_log_cil.c, line: 430
XFS: Assertion failed: IS_ALIGNED((unsigned long)buf, sizeof(uint64_t)), file: fs/xfs/xfs_log.c, line: 137
XFS: Assertion failed: IS_ALIGNED((unsigned long)buf, sizeof(uint64_t)), file: fs/xfs/xfs_log.c, line: 137

But I think that's a new bug that came in with all the log buffer
alignment changes in the 5.19 branch.

Oh.  My tree still had the "disable large folios" patch in it.  I guess
the "successful" results are mostly invalid then.

--D

> (more than happy to update anything about those patches)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: generic/068 crash on 5.18-rc2?
  2022-05-03 17:25                   ` Darrick J. Wong
@ 2022-05-05  2:40                     ` Darrick J. Wong
  2022-05-05  4:18                       ` Matthew Wilcox
  0 siblings, 1 reply; 22+ messages in thread
From: Darrick J. Wong @ 2022-05-05  2:40 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Brian Foster, xfs, fstests, linux-fsdevel

On Tue, May 03, 2022 at 10:25:32AM -0700, Darrick J. Wong wrote:
> On Tue, May 03, 2022 at 05:31:01AM +0100, Matthew Wilcox wrote:
> > On Mon, May 02, 2022 at 08:25:34PM -0700, Darrick J. Wong wrote:
> > > On Mon, May 02, 2022 at 08:20:00AM -0400, Brian Foster wrote:
> > > > On Sat, Apr 30, 2022 at 10:40:31PM +0100, Matthew Wilcox wrote:
> > > > > On Sat, Apr 30, 2022 at 04:44:07AM +0100, Matthew Wilcox wrote:
> > > > > > (I do not love this, have not even compiled it; it's late.  We may be
> > > > > > better off just storing next_folio inside the folio_iter).
> > > > > 
> > > > > Does anyone have a preference for fixing this between Option A:
> > > > > 
> > > > 
> > > > After seeing the trace in my previous mail and several thousand
> > > > successful iterations of the test hack, I had reworked it into this
> > > > (which survived weekend testing until it ran into some other XFS problem
> > > > that looks unrelated):
> > > > 
> > > > diff --git a/include/linux/bio.h b/include/linux/bio.h
> > > > index 278cc81cc1e7..aa820e09978e 100644
> > > > --- a/include/linux/bio.h
> > > > +++ b/include/linux/bio.h
> > > > @@ -269,6 +269,7 @@ struct folio_iter {
> > > >  	size_t offset;
> > > >  	size_t length;
> > > >  	/* private: for use by the iterator */
> > > > +	struct folio *_next;
> > > >  	size_t _seg_count;
> > > >  	int _i;
> > > >  };
> > > > @@ -279,6 +280,7 @@ static inline void bio_first_folio(struct folio_iter *fi, struct bio *bio,
> > > >  	struct bio_vec *bvec = bio_first_bvec_all(bio) + i;
> > > >  
> > > >  	fi->folio = page_folio(bvec->bv_page);
> > > > +	fi->_next = folio_next(fi->folio);
> > > >  	fi->offset = bvec->bv_offset +
> > > >  			PAGE_SIZE * (bvec->bv_page - &fi->folio->page);
> > > >  	fi->_seg_count = bvec->bv_len;
> > > > @@ -290,13 +292,15 @@ static inline void bio_next_folio(struct folio_iter *fi, struct bio *bio)
> > > >  {
> > > >  	fi->_seg_count -= fi->length;
> > > >  	if (fi->_seg_count) {
> > > > -		fi->folio = folio_next(fi->folio);
> > > > +		fi->folio = fi->_next;
> > > > +		fi->_next = folio_next(fi->folio);
> > > >  		fi->offset = 0;
> > > >  		fi->length = min(folio_size(fi->folio), fi->_seg_count);
> > > >  	} else if (fi->_i + 1 < bio->bi_vcnt) {
> > > >  		bio_first_folio(fi, bio, fi->_i + 1);
> > > >  	} else {
> > > >  		fi->folio = NULL;
> > > > +		fi->_next = NULL;
> > > >  	}
> > > >  }
> > > > 
> > > > So FWIW, that is just to say that I find option A to be cleaner and more
> > > > readable.
> > > 
> > > Me too.  I'll queue up the usual nightly tests with that patch added and
> > > we'll see how that does.
> > 
> > I've just pushed essentially that patch to my for-next tree in case
> > anybody does any testing with that.  I'll give it a couple of days
> > before creating a folio-5.18f tag and asking Linus to pull the first two
> > commits on
> > 
> > git://git.infradead.org/users/willy/pagecache.git for-next
> > 
> > That is, commits
> > 
> > 1a4c97e2dd5b ("block: Do not call folio_next() on an unreferenced folio")
> > 095099da208b ("mm/readahead: Fix readahead with large folios")
> 
> Hmm.  Well, I added 1a4c97 to my tree last night, and it seems to have
> cleared up all but two of the problems I saw with the for-next branch.
> 
> generic/388 still fails (40 minutes in) with:
> 
> WARN_ON_ONCE(atomic_read(&iop->write_bytes_pending));
> VM_BUG_ON_FOLIO(i_blocks_per_folio(inode, folio) > 1 && !iop, folio);
> 
> Which I think is the same problem where the fs goes down, XFS throws an
> error back to iomap_do_writepages, and it tries to discard a folio that
> already had writeback running on it.
> 
> There's also the same problem I reported a few days ago in xfs/501
> on a 64k-page ARM64 VM where:
> 
> run fstests xfs/501 at 2022-05-02 21:17:31
> XFS: Assertion failed: IS_ALIGNED((unsigned long)lv->lv_buf, sizeof(uint64_t)), file: fs/xfs/xfs_log_cil.c, line: 430
> XFS: Assertion failed: IS_ALIGNED((unsigned long)buf, sizeof(uint64_t)), file: fs/xfs/xfs_log.c, line: 137
> XFS: Assertion failed: IS_ALIGNED((unsigned long)buf, sizeof(uint64_t)), file: fs/xfs/xfs_log.c, line: 137
> 
> But I think that's a new bug that came in with all the log buffer
> alignment changes in the 5.19 branch.
> 
> Oh.  My tree still had the "disable large folios" patch in it.  I guess
> the "successful" results are mostly invalid then.

Well... with large folios turned back on and those two patches added to
the branch, *most* of the problems go away.  The generic/388 problem
persists, and last night's run showed that the weird xfs_dquot leak that
I"ve occasionally seen on 5.18 with xfs/43[46] also exists in 5.17.

--D

> --D
> 
> > (more than happy to update anything about those patches)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: generic/068 crash on 5.18-rc2?
  2022-05-05  2:40                     ` Darrick J. Wong
@ 2022-05-05  4:18                       ` Matthew Wilcox
  2022-05-05  4:24                         ` Darrick J. Wong
  0 siblings, 1 reply; 22+ messages in thread
From: Matthew Wilcox @ 2022-05-05  4:18 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, xfs, fstests, linux-fsdevel

On Wed, May 04, 2022 at 07:40:12PM -0700, Darrick J. Wong wrote:
> On Tue, May 03, 2022 at 10:25:32AM -0700, Darrick J. Wong wrote:
> > On Tue, May 03, 2022 at 05:31:01AM +0100, Matthew Wilcox wrote:
> > > On Mon, May 02, 2022 at 08:25:34PM -0700, Darrick J. Wong wrote:
> > > > On Mon, May 02, 2022 at 08:20:00AM -0400, Brian Foster wrote:
> > > > > On Sat, Apr 30, 2022 at 10:40:31PM +0100, Matthew Wilcox wrote:
> > > > > > On Sat, Apr 30, 2022 at 04:44:07AM +0100, Matthew Wilcox wrote:
> > > > > > > (I do not love this, have not even compiled it; it's late.  We may be
> > > > > > > better off just storing next_folio inside the folio_iter).
> > > > > > 
> > > > > > Does anyone have a preference for fixing this between Option A:
> > > > > > 
> > > > > 
> > > > > After seeing the trace in my previous mail and several thousand
> > > > > successful iterations of the test hack, I had reworked it into this
> > > > > (which survived weekend testing until it ran into some other XFS problem
> > > > > that looks unrelated):
> > > > > 
> > > > > diff --git a/include/linux/bio.h b/include/linux/bio.h
> > > > > index 278cc81cc1e7..aa820e09978e 100644
> > > > > --- a/include/linux/bio.h
> > > > > +++ b/include/linux/bio.h
> > > > > @@ -269,6 +269,7 @@ struct folio_iter {
> > > > >  	size_t offset;
> > > > >  	size_t length;
> > > > >  	/* private: for use by the iterator */
> > > > > +	struct folio *_next;
> > > > >  	size_t _seg_count;
> > > > >  	int _i;
> > > > >  };
> > > > > @@ -279,6 +280,7 @@ static inline void bio_first_folio(struct folio_iter *fi, struct bio *bio,
> > > > >  	struct bio_vec *bvec = bio_first_bvec_all(bio) + i;
> > > > >  
> > > > >  	fi->folio = page_folio(bvec->bv_page);
> > > > > +	fi->_next = folio_next(fi->folio);
> > > > >  	fi->offset = bvec->bv_offset +
> > > > >  			PAGE_SIZE * (bvec->bv_page - &fi->folio->page);
> > > > >  	fi->_seg_count = bvec->bv_len;
> > > > > @@ -290,13 +292,15 @@ static inline void bio_next_folio(struct folio_iter *fi, struct bio *bio)
> > > > >  {
> > > > >  	fi->_seg_count -= fi->length;
> > > > >  	if (fi->_seg_count) {
> > > > > -		fi->folio = folio_next(fi->folio);
> > > > > +		fi->folio = fi->_next;
> > > > > +		fi->_next = folio_next(fi->folio);
> > > > >  		fi->offset = 0;
> > > > >  		fi->length = min(folio_size(fi->folio), fi->_seg_count);
> > > > >  	} else if (fi->_i + 1 < bio->bi_vcnt) {
> > > > >  		bio_first_folio(fi, bio, fi->_i + 1);
> > > > >  	} else {
> > > > >  		fi->folio = NULL;
> > > > > +		fi->_next = NULL;
> > > > >  	}
> > > > >  }
> > > > > 
> > > > > So FWIW, that is just to say that I find option A to be cleaner and more
> > > > > readable.
> > > > 
> > > > Me too.  I'll queue up the usual nightly tests with that patch added and
> > > > we'll see how that does.
> > > 
> > > I've just pushed essentially that patch to my for-next tree in case
> > > anybody does any testing with that.  I'll give it a couple of days
> > > before creating a folio-5.18f tag and asking Linus to pull the first two
> > > commits on
> > > 
> > > git://git.infradead.org/users/willy/pagecache.git for-next
> > > 
> > > That is, commits
> > > 
> > > 1a4c97e2dd5b ("block: Do not call folio_next() on an unreferenced folio")
> > > 095099da208b ("mm/readahead: Fix readahead with large folios")
> > 
> > Hmm.  Well, I added 1a4c97 to my tree last night, and it seems to have
> > cleared up all but two of the problems I saw with the for-next branch.
> > 
> > generic/388 still fails (40 minutes in) with:
> > 
> > WARN_ON_ONCE(atomic_read(&iop->write_bytes_pending));
> > VM_BUG_ON_FOLIO(i_blocks_per_folio(inode, folio) > 1 && !iop, folio);
> > 
> > Which I think is the same problem where the fs goes down, XFS throws an
> > error back to iomap_do_writepages, and it tries to discard a folio that
> > already had writeback running on it.
> > 
> > There's also the same problem I reported a few days ago in xfs/501
> > on a 64k-page ARM64 VM where:
> > 
> > run fstests xfs/501 at 2022-05-02 21:17:31
> > XFS: Assertion failed: IS_ALIGNED((unsigned long)lv->lv_buf, sizeof(uint64_t)), file: fs/xfs/xfs_log_cil.c, line: 430
> > XFS: Assertion failed: IS_ALIGNED((unsigned long)buf, sizeof(uint64_t)), file: fs/xfs/xfs_log.c, line: 137
> > XFS: Assertion failed: IS_ALIGNED((unsigned long)buf, sizeof(uint64_t)), file: fs/xfs/xfs_log.c, line: 137
> > 
> > But I think that's a new bug that came in with all the log buffer
> > alignment changes in the 5.19 branch.
> > 
> > Oh.  My tree still had the "disable large folios" patch in it.  I guess
> > the "successful" results are mostly invalid then.
> 
> Well... with large folios turned back on and those two patches added to
> the branch, *most* of the problems go away.  The generic/388 problem
> persists, and last night's run showed that the weird xfs_dquot leak that
> I"ve occasionally seen on 5.18 with xfs/43[46] also exists in 5.17.

OK, so let me just restate what I understand here ... these two patches
cure some, but not all of the currently observed problems with 5.18.
The problems that remain with 5.18 were also present in either 5.17
or in 5.18 with large folios disabled, so at this point we know of no
functional regressions that large folios can be blamed for?

I'll send these patches to Linus tomorrow then, since they fix problems
that have been observed, and I don't think there's any reason to keep
them from him.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: generic/068 crash on 5.18-rc2?
  2022-05-05  4:18                       ` Matthew Wilcox
@ 2022-05-05  4:24                         ` Darrick J. Wong
  2022-05-06 17:03                           ` Darrick J. Wong
  0 siblings, 1 reply; 22+ messages in thread
From: Darrick J. Wong @ 2022-05-05  4:24 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Brian Foster, xfs, fstests, linux-fsdevel

On Thu, May 05, 2022 at 05:18:27AM +0100, Matthew Wilcox wrote:
> On Wed, May 04, 2022 at 07:40:12PM -0700, Darrick J. Wong wrote:
> > On Tue, May 03, 2022 at 10:25:32AM -0700, Darrick J. Wong wrote:
> > > On Tue, May 03, 2022 at 05:31:01AM +0100, Matthew Wilcox wrote:
> > > > On Mon, May 02, 2022 at 08:25:34PM -0700, Darrick J. Wong wrote:
> > > > > On Mon, May 02, 2022 at 08:20:00AM -0400, Brian Foster wrote:
> > > > > > On Sat, Apr 30, 2022 at 10:40:31PM +0100, Matthew Wilcox wrote:
> > > > > > > On Sat, Apr 30, 2022 at 04:44:07AM +0100, Matthew Wilcox wrote:
> > > > > > > > (I do not love this, have not even compiled it; it's late.  We may be
> > > > > > > > better off just storing next_folio inside the folio_iter).
> > > > > > > 
> > > > > > > Does anyone have a preference for fixing this between Option A:
> > > > > > > 
> > > > > > 
> > > > > > After seeing the trace in my previous mail and several thousand
> > > > > > successful iterations of the test hack, I had reworked it into this
> > > > > > (which survived weekend testing until it ran into some other XFS problem
> > > > > > that looks unrelated):
> > > > > > 
> > > > > > diff --git a/include/linux/bio.h b/include/linux/bio.h
> > > > > > index 278cc81cc1e7..aa820e09978e 100644
> > > > > > --- a/include/linux/bio.h
> > > > > > +++ b/include/linux/bio.h
> > > > > > @@ -269,6 +269,7 @@ struct folio_iter {
> > > > > >  	size_t offset;
> > > > > >  	size_t length;
> > > > > >  	/* private: for use by the iterator */
> > > > > > +	struct folio *_next;
> > > > > >  	size_t _seg_count;
> > > > > >  	int _i;
> > > > > >  };
> > > > > > @@ -279,6 +280,7 @@ static inline void bio_first_folio(struct folio_iter *fi, struct bio *bio,
> > > > > >  	struct bio_vec *bvec = bio_first_bvec_all(bio) + i;
> > > > > >  
> > > > > >  	fi->folio = page_folio(bvec->bv_page);
> > > > > > +	fi->_next = folio_next(fi->folio);
> > > > > >  	fi->offset = bvec->bv_offset +
> > > > > >  			PAGE_SIZE * (bvec->bv_page - &fi->folio->page);
> > > > > >  	fi->_seg_count = bvec->bv_len;
> > > > > > @@ -290,13 +292,15 @@ static inline void bio_next_folio(struct folio_iter *fi, struct bio *bio)
> > > > > >  {
> > > > > >  	fi->_seg_count -= fi->length;
> > > > > >  	if (fi->_seg_count) {
> > > > > > -		fi->folio = folio_next(fi->folio);
> > > > > > +		fi->folio = fi->_next;
> > > > > > +		fi->_next = folio_next(fi->folio);
> > > > > >  		fi->offset = 0;
> > > > > >  		fi->length = min(folio_size(fi->folio), fi->_seg_count);
> > > > > >  	} else if (fi->_i + 1 < bio->bi_vcnt) {
> > > > > >  		bio_first_folio(fi, bio, fi->_i + 1);
> > > > > >  	} else {
> > > > > >  		fi->folio = NULL;
> > > > > > +		fi->_next = NULL;
> > > > > >  	}
> > > > > >  }
> > > > > > 
> > > > > > So FWIW, that is just to say that I find option A to be cleaner and more
> > > > > > readable.
> > > > > 
> > > > > Me too.  I'll queue up the usual nightly tests with that patch added and
> > > > > we'll see how that does.
> > > > 
> > > > I've just pushed essentially that patch to my for-next tree in case
> > > > anybody does any testing with that.  I'll give it a couple of days
> > > > before creating a folio-5.18f tag and asking Linus to pull the first two
> > > > commits on
> > > > 
> > > > git://git.infradead.org/users/willy/pagecache.git for-next
> > > > 
> > > > That is, commits
> > > > 
> > > > 1a4c97e2dd5b ("block: Do not call folio_next() on an unreferenced folio")
> > > > 095099da208b ("mm/readahead: Fix readahead with large folios")
> > > 
> > > Hmm.  Well, I added 1a4c97 to my tree last night, and it seems to have
> > > cleared up all but two of the problems I saw with the for-next branch.
> > > 
> > > generic/388 still fails (40 minutes in) with:
> > > 
> > > WARN_ON_ONCE(atomic_read(&iop->write_bytes_pending));
> > > VM_BUG_ON_FOLIO(i_blocks_per_folio(inode, folio) > 1 && !iop, folio);
> > > 
> > > Which I think is the same problem where the fs goes down, XFS throws an
> > > error back to iomap_do_writepages, and it tries to discard a folio that
> > > already had writeback running on it.
> > > 
> > > There's also the same problem I reported a few days ago in xfs/501
> > > on a 64k-page ARM64 VM where:
> > > 
> > > run fstests xfs/501 at 2022-05-02 21:17:31
> > > XFS: Assertion failed: IS_ALIGNED((unsigned long)lv->lv_buf, sizeof(uint64_t)), file: fs/xfs/xfs_log_cil.c, line: 430
> > > XFS: Assertion failed: IS_ALIGNED((unsigned long)buf, sizeof(uint64_t)), file: fs/xfs/xfs_log.c, line: 137
> > > XFS: Assertion failed: IS_ALIGNED((unsigned long)buf, sizeof(uint64_t)), file: fs/xfs/xfs_log.c, line: 137
> > > 
> > > But I think that's a new bug that came in with all the log buffer
> > > alignment changes in the 5.19 branch.
> > > 
> > > Oh.  My tree still had the "disable large folios" patch in it.  I guess
> > > the "successful" results are mostly invalid then.
> > 
> > Well... with large folios turned back on and those two patches added to
> > the branch, *most* of the problems go away.  The generic/388 problem
> > persists, and last night's run showed that the weird xfs_dquot leak that
> > I"ve occasionally seen on 5.18 with xfs/43[46] also exists in 5.17.
> 
> OK, so let me just restate what I understand here ... these two patches
> cure some, but not all of the currently observed problems with 5.18.
> The problems that remain with 5.18 were also present in either 5.17
> or in 5.18 with large folios disabled, so at this point we know of no
> functional regressions that large folios can be blamed for?

Not quite -- the generic/388 one is definitely new as of 5.18-rc1.
Frustratingly, the problems "go away" if you enable KASAN, so I might
try KFENCE (or whatever the new less molasses memory debugger is named)
next.

I suspect the xfs/43[46] problems are probably more of the ongoing log
abend whackamole, since every time we change something in the logging
code, weird latent bugs from 20 years ago start pouring out like spiders
fleeing the first winter rains.

> I'll send these patches to Linus tomorrow then, since they fix problems
> that have been observed, and I don't think there's any reason to keep
> them from him.

Right, I don't think there's a reason to hang on to those any longer.

--D

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: generic/068 crash on 5.18-rc2?
  2022-05-05  4:24                         ` Darrick J. Wong
@ 2022-05-06 17:03                           ` Darrick J. Wong
  0 siblings, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2022-05-06 17:03 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Brian Foster, xfs, fstests, linux-fsdevel

On Wed, May 04, 2022 at 09:24:05PM -0700, Darrick J. Wong wrote:
> On Thu, May 05, 2022 at 05:18:27AM +0100, Matthew Wilcox wrote:
> > On Wed, May 04, 2022 at 07:40:12PM -0700, Darrick J. Wong wrote:
> > > On Tue, May 03, 2022 at 10:25:32AM -0700, Darrick J. Wong wrote:
> > > > On Tue, May 03, 2022 at 05:31:01AM +0100, Matthew Wilcox wrote:
> > > > > On Mon, May 02, 2022 at 08:25:34PM -0700, Darrick J. Wong wrote:
> > > > > > On Mon, May 02, 2022 at 08:20:00AM -0400, Brian Foster wrote:
> > > > > > > On Sat, Apr 30, 2022 at 10:40:31PM +0100, Matthew Wilcox wrote:
> > > > > > > > On Sat, Apr 30, 2022 at 04:44:07AM +0100, Matthew Wilcox wrote:
> > > > > > > > > (I do not love this, have not even compiled it; it's late.  We may be
> > > > > > > > > better off just storing next_folio inside the folio_iter).
> > > > > > > > 
> > > > > > > > Does anyone have a preference for fixing this between Option A:
> > > > > > > > 
> > > > > > > 
> > > > > > > After seeing the trace in my previous mail and several thousand
> > > > > > > successful iterations of the test hack, I had reworked it into this
> > > > > > > (which survived weekend testing until it ran into some other XFS problem
> > > > > > > that looks unrelated):
> > > > > > > 
> > > > > > > diff --git a/include/linux/bio.h b/include/linux/bio.h
> > > > > > > index 278cc81cc1e7..aa820e09978e 100644
> > > > > > > --- a/include/linux/bio.h
> > > > > > > +++ b/include/linux/bio.h
> > > > > > > @@ -269,6 +269,7 @@ struct folio_iter {
> > > > > > >  	size_t offset;
> > > > > > >  	size_t length;
> > > > > > >  	/* private: for use by the iterator */
> > > > > > > +	struct folio *_next;
> > > > > > >  	size_t _seg_count;
> > > > > > >  	int _i;
> > > > > > >  };
> > > > > > > @@ -279,6 +280,7 @@ static inline void bio_first_folio(struct folio_iter *fi, struct bio *bio,
> > > > > > >  	struct bio_vec *bvec = bio_first_bvec_all(bio) + i;
> > > > > > >  
> > > > > > >  	fi->folio = page_folio(bvec->bv_page);
> > > > > > > +	fi->_next = folio_next(fi->folio);
> > > > > > >  	fi->offset = bvec->bv_offset +
> > > > > > >  			PAGE_SIZE * (bvec->bv_page - &fi->folio->page);
> > > > > > >  	fi->_seg_count = bvec->bv_len;
> > > > > > > @@ -290,13 +292,15 @@ static inline void bio_next_folio(struct folio_iter *fi, struct bio *bio)
> > > > > > >  {
> > > > > > >  	fi->_seg_count -= fi->length;
> > > > > > >  	if (fi->_seg_count) {
> > > > > > > -		fi->folio = folio_next(fi->folio);
> > > > > > > +		fi->folio = fi->_next;
> > > > > > > +		fi->_next = folio_next(fi->folio);
> > > > > > >  		fi->offset = 0;
> > > > > > >  		fi->length = min(folio_size(fi->folio), fi->_seg_count);
> > > > > > >  	} else if (fi->_i + 1 < bio->bi_vcnt) {
> > > > > > >  		bio_first_folio(fi, bio, fi->_i + 1);
> > > > > > >  	} else {
> > > > > > >  		fi->folio = NULL;
> > > > > > > +		fi->_next = NULL;
> > > > > > >  	}
> > > > > > >  }
> > > > > > > 
> > > > > > > So FWIW, that is just to say that I find option A to be cleaner and more
> > > > > > > readable.
> > > > > > 
> > > > > > Me too.  I'll queue up the usual nightly tests with that patch added and
> > > > > > we'll see how that does.
> > > > > 
> > > > > I've just pushed essentially that patch to my for-next tree in case
> > > > > anybody does any testing with that.  I'll give it a couple of days
> > > > > before creating a folio-5.18f tag and asking Linus to pull the first two
> > > > > commits on
> > > > > 
> > > > > git://git.infradead.org/users/willy/pagecache.git for-next
> > > > > 
> > > > > That is, commits
> > > > > 
> > > > > 1a4c97e2dd5b ("block: Do not call folio_next() on an unreferenced folio")
> > > > > 095099da208b ("mm/readahead: Fix readahead with large folios")
> > > > 
> > > > Hmm.  Well, I added 1a4c97 to my tree last night, and it seems to have
> > > > cleared up all but two of the problems I saw with the for-next branch.
> > > > 
> > > > generic/388 still fails (40 minutes in) with:
> > > > 
> > > > WARN_ON_ONCE(atomic_read(&iop->write_bytes_pending));
> > > > VM_BUG_ON_FOLIO(i_blocks_per_folio(inode, folio) > 1 && !iop, folio);
> > > > 
> > > > Which I think is the same problem where the fs goes down, XFS throws an
> > > > error back to iomap_do_writepages, and it tries to discard a folio that
> > > > already had writeback running on it.
> > > > 
> > > > There's also the same problem I reported a few days ago in xfs/501
> > > > on a 64k-page ARM64 VM where:
> > > > 
> > > > run fstests xfs/501 at 2022-05-02 21:17:31
> > > > XFS: Assertion failed: IS_ALIGNED((unsigned long)lv->lv_buf, sizeof(uint64_t)), file: fs/xfs/xfs_log_cil.c, line: 430
> > > > XFS: Assertion failed: IS_ALIGNED((unsigned long)buf, sizeof(uint64_t)), file: fs/xfs/xfs_log.c, line: 137
> > > > XFS: Assertion failed: IS_ALIGNED((unsigned long)buf, sizeof(uint64_t)), file: fs/xfs/xfs_log.c, line: 137
> > > > 
> > > > But I think that's a new bug that came in with all the log buffer
> > > > alignment changes in the 5.19 branch.
> > > > 
> > > > Oh.  My tree still had the "disable large folios" patch in it.  I guess
> > > > the "successful" results are mostly invalid then.
> > > 
> > > Well... with large folios turned back on and those two patches added to
> > > the branch, *most* of the problems go away.  The generic/388 problem
> > > persists, and last night's run showed that the weird xfs_dquot leak that
> > > I"ve occasionally seen on 5.18 with xfs/43[46] also exists in 5.17.
> > 
> > OK, so let me just restate what I understand here ... these two patches
> > cure some, but not all of the currently observed problems with 5.18.
> > The problems that remain with 5.18 were also present in either 5.17
> > or in 5.18 with large folios disabled, so at this point we know of no
> > functional regressions that large folios can be blamed for?
> 
> Not quite -- the generic/388 one is definitely new as of 5.18-rc1.
> Frustratingly, the problems "go away" if you enable KASAN, so I might
> try KFENCE (or whatever the new less molasses memory debugger is named)
> next.

I have some semi-good news for willy -- I've created an iomap-5.19-merge
candidate branch with Andreas' iomap fixes and willy's two folio
patches, and it tests cleanly on x64.

Unfortunately, I still see arm64 vms tripping over:

WARN_ON(i_blocks_per_folio(inode, folio) > 1 && !iop)

with the same weird pattern where we try to discard an entire folio
even though the (n>1)th block in the folio is under writeback.  I added
even more debug info, and captured this:

ino 0x87 ibpf 0x40 mapping 0xfffffc0164c8c0e8 index 0x8
page:ffffffff004a0f00 refcount:4 mapcount:0 mapping:fffffc0164c8c0e8 index:0x8 pfn:0x1683c
head:ffffffff004a0f00 order:2 compound_mapcount:0 compound_pincount:0
memcg:fffffc0106c7c000
aops:xfs_address_space_operations [xfs] ino:87 dentry name:"file2"
flags: 0x1ffe000000018116(error|referenced|uptodate|lru|writeback|head|node=0|zone=1|lastcpupid=0x7ff)
raw: 1ffe000000018116 ffffffff00399808 ffffffff003bc408 fffffc0164c8c0e8
raw: 0000000000000008 0000000000000000 00000004ffffffff fffffc0106c7c000
page dumped because: VM_BUG_ON_FOLIO(i_blocks_per_folio(inode, folio) > 1 && !iop)

Looks like we have a single-page folio experiencing this problem.  I'm
not sure if we've simply been tripping over this all along and I just
never noticed until I turned the WARN_ON into a VM_BUG_ON, which
actually takes down the system... but I suspect this is a problem with
writeback (albeit on a shut down filesystem) that has been around for a
while.

I also have some not so good news for Dave -- I think this implies that
something in the xfs 5.19 merge branch might be behind all of these
weird dquot leaks in xfs/43[46] and the blowups in generic/388.  I'll
try to get to bisecting that next week.

--D

> I suspect the xfs/43[46] problems are probably more of the ongoing log
> abend whackamole, since every time we change something in the logging
> code, weird latent bugs from 20 years ago start pouring out like spiders
> fleeing the first winter rains.
> 
> > I'll send these patches to Linus tomorrow then, since they fix problems
> > that have been observed, and I don't think there's any reason to keep
> > them from him.
> 
> Right, I don't think there's a reason to hang on to those any longer.
> 
> --D

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2022-05-06 17:03 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-13  3:34 generic/068 crash on 5.18-rc2? Darrick J. Wong
2022-04-13 14:50 ` Matthew Wilcox
2022-04-13 16:23   ` Darrick J. Wong
2022-04-13 16:35     ` Matthew Wilcox
2022-04-18 18:44       ` Darrick J. Wong
2022-04-18 17:47   ` Darrick J. Wong
2022-04-20  0:37     ` Darrick J. Wong
2022-04-22 21:59     ` Darrick J. Wong
2022-04-28 15:53       ` Brian Foster
2022-04-30  3:10         ` Darrick J. Wong
2022-04-30  3:44         ` Matthew Wilcox
2022-04-30 21:40           ` Matthew Wilcox
2022-05-02 12:20             ` Brian Foster
2022-05-03  3:25               ` Darrick J. Wong
2022-05-03  4:31                 ` Matthew Wilcox
2022-05-03 17:25                   ` Darrick J. Wong
2022-05-05  2:40                     ` Darrick J. Wong
2022-05-05  4:18                       ` Matthew Wilcox
2022-05-05  4:24                         ` Darrick J. Wong
2022-05-06 17:03                           ` Darrick J. Wong
2022-05-02 12:18           ` Brian Foster
2022-05-02 13:00             ` Matthew Wilcox

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.