All of lore.kernel.org
 help / color / mirror / Atom feed
* moar weird metadata corruptions, this time on arm64
@ 2022-11-22  0:16 Darrick J. Wong
  2022-11-22  1:58 ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Darrick J. Wong @ 2022-11-22  0:16 UTC (permalink / raw)
  To: xfs

Hi all,

I've been running near-continuous integration testing of online fsck,
and I've noticed that once a day, one of the ARM VMs will fail the test
with out of order records in the data fork.

xfs/804 races fsstress with online scrub (aka scan but do not change
anything), so I think this might be a bug in the core xfs code.  This
also only seems to trigger if one runs the test for more than ~6 minutes
via TIME_FACTOR=13 or something.
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/tree/tests/xfs/804?h=djwong-wtf

I added a debugging patch to the kernel to check the data fork extents
after taking the ILOCK, before dropping ILOCK, and before and after each
bmapping operation.  So far I've narrowed it down to the delalloc code
inserting a record in the wrong place in the iext tree:

xfs_bmap_add_extent_hole_delay, near line 2691:

	case 0:
		/*
		 * New allocation is not contiguous with another
		 * delayed allocation.
		 * Insert a new entry.
		 */
		oldlen = newlen = 0;
		xfs_iunlock_check_datafork(ip);		<-- ok here
		xfs_iext_insert(ip, icur, new, state);
		xfs_iunlock_check_datafork(ip);		<-- bad here
		break;
	}

But I haven't dug far enough to figure out if the insertion does
anything fancy like rebalance the iext tree nodes.  Will add that
tonight.  Also, curiously, so far this has /only/ reproduced on arm64
with 64k pages.  Regrettably, I also have not yet stood up any long term
soak VMs for ARM64, so I don't even know if this affects TOT 6.1-rcX, or
only djwong-wtf.

Anyway, persisting this to the mailing list in case this rings a bell
for anyone else.

--D

run fstests xfs/804 at 2022-11-21 00:59:48
spectre-v4 mitigation disabled by command-line option
XFS (sda2): EXPERIMENTAL Large extent counts feature in use. Use at your own risk!
XFS (sda2): Mounting V5 Filesystem a82f60e2-c283-4008-baf7-617a68397795
XFS (sda2): Ending clean mount
XFS (sda2): EXPERIMENTAL online scrub feature in use. Use at your own risk!
XFS (sda3): EXPERIMENTAL Large extent counts feature in use. Use at your own risk!
XFS (sda3): Mounting V5 Filesystem 849fc538-5171-40ae-94bc-542b7236eb9e
XFS (sda3): Ending clean mount
XFS (sda3): Quotacheck needed: Please wait.
XFS (sda3): Quotacheck: Done.
XFS (sda3): EXPERIMENTAL online scrub feature in use. Use at your own risk!
XFS (sda3): ino 0x6095c72 nr 0x4 offset 0x6a nextoff 0x85
XFS: Assertion failed: got.br_startoff >= nextoff, file: fs/xfs/xfs_inode.c, line: 136
------------[ cut here ]------------
WARNING: CPU: 0 PID: 2897659 at fs/xfs/xfs_message.c:104 assfail+0x4c/0x5c [xfs]
Modules linked in: xfs nft_chain_nat xt_REDIRECT nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 rpcsec_gss_krb5 auth_rpcgss xt_tcpudp ip_set_hash_ip ip_set_hash_net xt_set nft_compat ip_set_hash_mac nf_tables libcrc32c bfq crct10dif_ce sch_fq_codel fuse configfs efivarfs ip_tables x_tables overlay nfsv4
CPU: 0 PID: 2897659 Comm: fsstress Not tainted 6.1.0-rc6-xfsa #rc6 3e319380b68cffd23e45920c8e84d5a5bad7f2aa
Hardware name: QEMU KVM Virtual Machine, BIOS 1.5.1 06/16/2021
pstate: 60401005 (nZCv daif +PAN -UAO -TCO -DIT +SSBS BTYPE=--)
pc : assfail+0x4c/0x5c [xfs]
lr : assfail+0x3c/0x5c [xfs]
sp : fffffe000fe4f7d0
x29: fffffe000fe4f7d0 x28: 0000000000000001 x27: fffffe0001650928
x26: fffffe0001650960 x25: 0000000000000a83 x24: fffffe00016367d0
x23: 0000000000000003 x22: 0000000000000004 x21: 0000000000000085
x20: fffffc001dacfa00 x19: fffffc001dacfa40 x18: 0000000000000000
x17: 08000000a57b0800 x16: 8bc8553800000000 x15: 0000000000000000
x14: 0000000000000000 x13: 0000000000010000 x12: 00000000000004dc
x11: fffffe000fe4f700 x10: fffffe00016519d8 x9 : fffffe00816519d7
x8 : 000000000000000a x7 : 00000000ffffffc0 x6 : 0000000000000021
x5 : fffffe00016519d9 x4 : 00000000ffffffca x3 : 0000000000000000
x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000
Call trace:
 assfail+0x4c/0x5c [xfs 906731d4aa511f3820f146284d4b72ed26f09c78]
 __xfs_iunlock_check_datafork+0x150/0x29c [xfs 906731d4aa511f3820f146284d4b72ed26f09c78]
 xfs_bmap_add_extent_hole_delay.constprop.0+0x14c/0x5b4 [xfs 906731d4aa511f3820f146284d4b72ed26f09c78]
 xfs_bmapi_reserve_delalloc+0x1f4/0x390 [xfs 906731d4aa511f3820f146284d4b72ed26f09c78]
 xfs_buffered_write_iomap_begin+0x414/0x97c [xfs 906731d4aa511f3820f146284d4b72ed26f09c78]
 iomap_iter+0x134/0x360
 iomap_file_buffered_write+0x224/0x2d0
 xfs_file_buffered_write+0xc0/0x2f0 [xfs 906731d4aa511f3820f146284d4b72ed26f09c78]
 xfs_file_write_iter+0x124/0x2c0 [xfs 906731d4aa511f3820f146284d4b72ed26f09c78]
 vfs_write+0x270/0x370
 ksys_write+0x70/0x100
 __arm64_sys_write+0x24/0x30
 do_el0_svc+0x88/0x190
 el0_svc+0x40/0x190
 el0t_64_sync_handler+0xbc/0x140
 el0t_64_sync+0x18c/0x190
---[ end trace 0000000000000000 ]---
XFS (sda3): ino 0x6095c72 func xfs_bmap_add_extent_hole_delay line 2691 data fork:
XFS (sda3):    ino 0x6095c72 nr 0x0 nr_real 0x0 offset 0x26 blockcount 0x4 startblock 0xc119c4 state 0
XFS (sda3):    ino 0x6095c72 nr 0x1 nr_real 0x1 offset 0x2a blockcount 0x26 startblock 0xcc457e state 1
XFS (sda3):    ino 0x6095c72 nr 0x2 nr_real 0x2 offset 0x58 blockcount 0x12 startblock 0xcc45ac state 1
XFS (sda3):    ino 0x6095c72 nr 0x3 nr_real 0x3 offset 0x70 blockcount 0x15 startblock 0xffffffffe0007 state 0
XFS (sda3):    ino 0x6095c72 nr 0x4 nr_real 0x3 offset 0x6a blockcount 0x6 startblock 0xcc45be state 0
XFS (sda3):    ino 0x6095c72 nr 0x5 nr_real 0x4 offset 0xa7 blockcount 0x19 startblock 0x17ff88 state 0
XFS (sda3):    ino 0x6095c72 nr 0x6 nr_real 0x5 offset 0xe8 blockcount 0x8 startblock 0x18004e state 0
XFS (sda3):    ino 0x6095c72 nr 0x7 nr_real 0x6 offset 0x195 blockcount 0x2 startblock 0x410f0e state 0
XFS (sda3):    ino 0x6095c72 nr 0x8 nr_real 0x7 offset 0x1ac blockcount 0x2 startblock 0x41e169 state 0


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: moar weird metadata corruptions, this time on arm64
  2022-11-22  0:16 moar weird metadata corruptions, this time on arm64 Darrick J. Wong
@ 2022-11-22  1:58 ` Dave Chinner
  2022-11-23 20:00   ` Darrick J. Wong
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2022-11-22  1:58 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: xfs

On Mon, Nov 21, 2022 at 04:16:02PM -0800, Darrick J. Wong wrote:
> Hi all,
> 
> I've been running near-continuous integration testing of online fsck,
> and I've noticed that once a day, one of the ARM VMs will fail the test
> with out of order records in the data fork.
> 
> xfs/804 races fsstress with online scrub (aka scan but do not change
> anything), so I think this might be a bug in the core xfs code.  This
> also only seems to trigger if one runs the test for more than ~6 minutes
> via TIME_FACTOR=13 or something.
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/tree/tests/xfs/804?h=djwong-wtf
> 
> I added a debugging patch to the kernel to check the data fork extents
> after taking the ILOCK, before dropping ILOCK, and before and after each
> bmapping operation.  So far I've narrowed it down to the delalloc code
> inserting a record in the wrong place in the iext tree:
> 
> xfs_bmap_add_extent_hole_delay, near line 2691:
> 
> 	case 0:
> 		/*
> 		 * New allocation is not contiguous with another
> 		 * delayed allocation.
> 		 * Insert a new entry.
> 		 */
> 		oldlen = newlen = 0;
> 		xfs_iunlock_check_datafork(ip);		<-- ok here
> 		xfs_iext_insert(ip, icur, new, state);
> 		xfs_iunlock_check_datafork(ip);		<-- bad here
> 		break;
.....
> XFS (sda3): ino 0x6095c72 nr 0x4 offset 0x6a nextoff 0x85
> XFS: Assertion failed: got.br_startoff >= nextoff, file: fs/xfs/xfs_inode.c, line: 136
....
> XFS (sda3): ino 0x6095c72 func xfs_bmap_add_extent_hole_delay line 2691 data fork:
> XFS (sda3):    ino 0x6095c72 nr 0x0 nr_real 0x0 offset 0x26 blockcount 0x4 startblock 0xc119c4 state 0
> XFS (sda3):    ino 0x6095c72 nr 0x1 nr_real 0x1 offset 0x2a blockcount 0x26 startblock 0xcc457e state 1
> XFS (sda3):    ino 0x6095c72 nr 0x2 nr_real 0x2 offset 0x58 blockcount 0x12 startblock 0xcc45ac state 1
> XFS (sda3):    ino 0x6095c72 nr 0x3 nr_real 0x3 offset 0x70 blockcount 0x15 startblock 0xffffffffe0007 state 0
> XFS (sda3):    ino 0x6095c72 nr 0x4 nr_real 0x3 offset 0x6a blockcount 0x6 startblock 0xcc45be state 0
> XFS (sda3):    ino 0x6095c72 nr 0x5 nr_real 0x4 offset 0xa7 blockcount 0x19 startblock 0x17ff88 state 0

So icur prior to insertion should point to record 0x5, offset 0xa7
(right).  Prev (left) should point to record 0x4, offset 0x6a.

This makes both left and right valid, and while left is adjacent,
it's a different type so isn't contiguous.

So falling through to "case 0" is correct.

But then it inserts it at index 0x3 before record 0x4, not
at index 0x4 before record 0x5.

From xfs_iext_insert():

        for (i = nr_entries; i > cur->pos; i--)
		cur->leaf->recs[i] = cur->leaf->recs[i - 1];
	xfs_iext_set(cur_rec(cur), irec);

This implies cur->pos is wrong. i.e. it made a hole cur->pos = 0x3
and inserted there, not at cur->pos = 0x4.

Can you add debug to trace the iext cursor as
xfs_buffered_write_iomap_begin() ->
  xfs_bmapi_reserve_delalloc() ->
    xfs_bmap_add_extent_hole_delay()->
      xfs_iext_insert()

runs? The iext cursor could have been wrong for some time before
this insert tripped over it, so this may just be the messenger that
something has just smashed the stack (icur is a stack variable).

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: moar weird metadata corruptions, this time on arm64
  2022-11-22  1:58 ` Dave Chinner
@ 2022-11-23 20:00   ` Darrick J. Wong
  2022-11-24  4:40     ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Darrick J. Wong @ 2022-11-23 20:00 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Nov 22, 2022 at 12:58:06PM +1100, Dave Chinner wrote:
> On Mon, Nov 21, 2022 at 04:16:02PM -0800, Darrick J. Wong wrote:
> > Hi all,
> > 
> > I've been running near-continuous integration testing of online fsck,
> > and I've noticed that once a day, one of the ARM VMs will fail the test
> > with out of order records in the data fork.
> > 
> > xfs/804 races fsstress with online scrub (aka scan but do not change
> > anything), so I think this might be a bug in the core xfs code.  This
> > also only seems to trigger if one runs the test for more than ~6 minutes
> > via TIME_FACTOR=13 or something.
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/tree/tests/xfs/804?h=djwong-wtf
> > 
> > I added a debugging patch to the kernel to check the data fork extents
> > after taking the ILOCK, before dropping ILOCK, and before and after each
> > bmapping operation.  So far I've narrowed it down to the delalloc code
> > inserting a record in the wrong place in the iext tree:
> > 
> > xfs_bmap_add_extent_hole_delay, near line 2691:
> > 
> > 	case 0:
> > 		/*
> > 		 * New allocation is not contiguous with another
> > 		 * delayed allocation.
> > 		 * Insert a new entry.
> > 		 */
> > 		oldlen = newlen = 0;
> > 		xfs_iunlock_check_datafork(ip);		<-- ok here
> > 		xfs_iext_insert(ip, icur, new, state);
> > 		xfs_iunlock_check_datafork(ip);		<-- bad here
> > 		break;
> .....
> > XFS (sda3): ino 0x6095c72 nr 0x4 offset 0x6a nextoff 0x85
> > XFS: Assertion failed: got.br_startoff >= nextoff, file: fs/xfs/xfs_inode.c, line: 136
> ....
> > XFS (sda3): ino 0x6095c72 func xfs_bmap_add_extent_hole_delay line 2691 data fork:
> > XFS (sda3):    ino 0x6095c72 nr 0x0 nr_real 0x0 offset 0x26 blockcount 0x4 startblock 0xc119c4 state 0
> > XFS (sda3):    ino 0x6095c72 nr 0x1 nr_real 0x1 offset 0x2a blockcount 0x26 startblock 0xcc457e state 1
> > XFS (sda3):    ino 0x6095c72 nr 0x2 nr_real 0x2 offset 0x58 blockcount 0x12 startblock 0xcc45ac state 1
> > XFS (sda3):    ino 0x6095c72 nr 0x3 nr_real 0x3 offset 0x70 blockcount 0x15 startblock 0xffffffffe0007 state 0
> > XFS (sda3):    ino 0x6095c72 nr 0x4 nr_real 0x3 offset 0x6a blockcount 0x6 startblock 0xcc45be state 0
> > XFS (sda3):    ino 0x6095c72 nr 0x5 nr_real 0x4 offset 0xa7 blockcount 0x19 startblock 0x17ff88 state 0
> 
> So icur prior to insertion should point to record 0x5, offset 0xa7
> (right).  Prev (left) should point to record 0x4, offset 0x6a.
> 
> This makes both left and right valid, and while left is adjacent,
> it's a different type so isn't contiguous.
> 
> So falling through to "case 0" is correct.
> 
> But then it inserts it at index 0x3 before record 0x4, not
> at index 0x4 before record 0x5.
> 
> From xfs_iext_insert():
> 
>         for (i = nr_entries; i > cur->pos; i--)
> 		cur->leaf->recs[i] = cur->leaf->recs[i - 1];
> 	xfs_iext_set(cur_rec(cur), irec);
> 
> This implies cur->pos is wrong. i.e. it made a hole cur->pos = 0x3
> and inserted there, not at cur->pos = 0x4.
> 
> Can you add debug to trace the iext cursor as
> xfs_buffered_write_iomap_begin() ->
>   xfs_bmapi_reserve_delalloc() ->
>     xfs_bmap_add_extent_hole_delay()->
>       xfs_iext_insert()
> 
> runs? The iext cursor could have been wrong for some time before
> this insert tripped over it, so this may just be the messenger that
> something has just smashed the stack (icur is a stack variable).

In the latest run, I got this trace data:

ino 0x600a754 nr 0x5 offset 0x304 nextoff 0x31e
ino 0x600a754 func xfs_bmapi_reserve_delalloc line 4152 data fork:
    ino 0x600a754 nr 0x0 nr_real 0x0 offset 0x7 blockcount 0xe startblock 0xc12867 state 0
    ino 0x600a754 nr 0x1 nr_real 0x1 offset 0x78 blockcount 0xc8 startblock 0xc70d25 state 1
    ino 0x600a754 nr 0x2 nr_real 0x2 offset 0x150 blockcount 0x16 startblock 0xc70dfd state 1
    ino 0x600a754 nr 0x3 nr_real 0x3 offset 0x2a5 blockcount 0x5f startblock 0xc9f218 state 1
    ino 0x600a754 nr 0x4 nr_real 0x4 offset 0x318 blockcount 0x6 startblock 0xffffffffe0007 state 0
    ino 0x600a754 nr 0x5 nr_real 0x4 offset 0x304 blockcount 0x14 startblock 0xc9f277 state 0
    ino 0x600a754 nr 0x6 nr_real 0x5 offset 0xaf8 blockcount 0x1a startblock 0xd17aa3 state 0
    ino 0x600a754 nr 0x7 nr_real 0x6 offset 0x12f7 blockcount 0x40 startblock 0xcca511 state 1
    ino 0x600a754 nr 0x8 nr_real 0x7 offset 0x307c blockcount 0x3 startblock 0xc70ded state 0
    ino 0x600a754 nr 0x9 nr_real 0x8 offset 0x307f blockcount 0x1 startblock 0xc70df0 state 1

Here we again see that a delalloc extent was inserted into the wrong
position in the iext leaf, same as last time.  The extra trace data I
collected are as follows:

ino 0x600a754 fork 0 oldoff 0x318 oldlen 0x6 oldprealloc 0x0 isize 0x307e14c
    ino 0x600a754 oldgotoff 0xaf8 oldgotstart 0xd17aa3 oldgotcount 0x1a oldgotstate 0
    ino 0x600a754 freshgotoff 0xe0e46156d65cb freshgotstart 0xd178b5 freshgotcount 0x1a freshgotstate 0
    ino 0x600a754 nowgotoff 0x318 nowgotstart 0xffffffffe0007 nowgotcount 0x6 nowgotstate 0
    ino 0x600a754 oldicurpos 4 oldleafnr 9 oldleaf 0xfffffc012d8f4680
    ino 0x600a754 newicurpos 4 newleafnr 10 newleaf 0xfffffc012304d800

The first line shows that xfs_bmapi_reserve_delalloc was called with
whichfork=XFS_DATA_FORK, off=0x318, len=0x6, prealloc=0.

The second line shows the contents of @got at the beginning of the call,
which are the results of the first iext lookup in
xfs_buffered_write_iomap_begin.

"freshgot" (line 3) are the results of a new xfs_iext_get_extent right
before the call to xfs_bmap_add_extent_hole_delay.  Totally garbage.

"nowgot" (line 4) is contents of @got after the
xfs_bmap_add_extent_hole_delay call.

Line 5 is a copy of @icur at the beginning fo the call, and line 6 is
the contents of @icur after the xfs_bmap_add_extent_hole_delay call.
Notice that the cursor positions are the same, but the leaf pointers are
different.  I suspect that leaf ~d8f4680 has been freed, and this is the
reason why freshgot is totally garbage.  I wonder if the leaf pointers
being different is the result of an iext btree splitting into 2 objects
and then being recombined into one?

I augmented the xfs_iext_* functions to check the ILOCK state in all
functions that are passed an xfs_inode.  None of them tripped across the
entire fstests cloud run last night, so there's no obvious problem
there.  The buffered write path takes ILOCK_EXCL and keeps it right up
to where the debug splat happens, so there's no locking problem there.

So I started looking for things that could shift the extent count by one.
Looking for semi-adjacent records, I noticed this:

nr 0x1 nr_real 0x1 offset 0x78 blockcount 0xc8 startblock 0xc70d25 state 1
nr 0x2 nr_real 0x2 offset 0x150 blockcount 0x16 startblock 0xc70dfd state 1

0xc70d25+0xc8 == 0xc70ded, so this could be the result of someone
punching 0x10 blocks.

ino 0x600a754 nr 0x3 nr_real 0x3 offset 0x2a5 blockcount 0x5f startblock 0xc9f218 state 1
ino 0x600a754 nr 0x5 nr_real 0x4 offset 0x304 blockcount 0x14 startblock 0xc9f277 state 0

The incorrect delalloc reservation notwithstanding, these two records
are logically and physically adjacent, with the only difference being
that one is unwritten and the other is not.  Someone could have
converted an unwritten extent to written, possibly as a result of a
post-write conversion?

ino 0x600a754 nr 0x8 nr_real 0x7 offset 0x307c blockcount 0x3 startblock 0xc70ded state 0
ino 0x600a754 nr 0x9 nr_real 0x8 offset 0x307f blockcount 0x1 startblock 0xc70df0 state 1

The 0xc70ded here is familiar -- I wonder if this got mapped here as a
result of an FIEXCHANGE'd with offset 0x140?  Extent 9 is adjacent with
extent 8, except for the state difference.  Hmm.  I guess I better go
check the FIEXCHANGE code...

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: moar weird metadata corruptions, this time on arm64
  2022-11-23 20:00   ` Darrick J. Wong
@ 2022-11-24  4:40     ` Dave Chinner
  2022-11-24  6:46       ` Darrick J. Wong
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2022-11-24  4:40 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: xfs

On Wed, Nov 23, 2022 at 12:00:55PM -0800, Darrick J. Wong wrote:
> On Tue, Nov 22, 2022 at 12:58:06PM +1100, Dave Chinner wrote:
> > On Mon, Nov 21, 2022 at 04:16:02PM -0800, Darrick J. Wong wrote:
> > > Hi all,
> > > 
> > > I've been running near-continuous integration testing of online fsck,
> > > and I've noticed that once a day, one of the ARM VMs will fail the test
> > > with out of order records in the data fork.
.....
> In the latest run, I got this trace data:
> 
> ino 0x600a754 nr 0x5 offset 0x304 nextoff 0x31e
> ino 0x600a754 func xfs_bmapi_reserve_delalloc line 4152 data fork:
>     ino 0x600a754 nr 0x0 nr_real 0x0 offset 0x7 blockcount 0xe startblock 0xc12867 state 0
>     ino 0x600a754 nr 0x1 nr_real 0x1 offset 0x78 blockcount 0xc8 startblock 0xc70d25 state 1
>     ino 0x600a754 nr 0x2 nr_real 0x2 offset 0x150 blockcount 0x16 startblock 0xc70dfd state 1
>     ino 0x600a754 nr 0x3 nr_real 0x3 offset 0x2a5 blockcount 0x5f startblock 0xc9f218 state 1
>     ino 0x600a754 nr 0x4 nr_real 0x4 offset 0x318 blockcount 0x6 startblock 0xffffffffe0007 state 0
>     ino 0x600a754 nr 0x5 nr_real 0x4 offset 0x304 blockcount 0x14 startblock 0xc9f277 state 0
>     ino 0x600a754 nr 0x6 nr_real 0x5 offset 0xaf8 blockcount 0x1a startblock 0xd17aa3 state 0
>     ino 0x600a754 nr 0x7 nr_real 0x6 offset 0x12f7 blockcount 0x40 startblock 0xcca511 state 1
>     ino 0x600a754 nr 0x8 nr_real 0x7 offset 0x307c blockcount 0x3 startblock 0xc70ded state 0
>     ino 0x600a754 nr 0x9 nr_real 0x8 offset 0x307f blockcount 0x1 startblock 0xc70df0 state 1
> 
> Here we again see that a delalloc extent was inserted into the wrong
> position in the iext leaf, same as last time.  The extra trace data I
> collected are as follows:
> 
> ino 0x600a754 fork 0 oldoff 0x318 oldlen 0x6 oldprealloc 0x0 isize 0x307e14c
>     ino 0x600a754 oldgotoff 0xaf8 oldgotstart 0xd17aa3 oldgotcount 0x1a oldgotstate 0
>     ino 0x600a754 freshgotoff 0xe0e46156d65cb freshgotstart 0xd178b5 freshgotcount 0x1a freshgotstate 0
>     ino 0x600a754 nowgotoff 0x318 nowgotstart 0xffffffffe0007 nowgotcount 0x6 nowgotstate 0
>     ino 0x600a754 oldicurpos 4 oldleafnr 9 oldleaf 0xfffffc012d8f4680
>     ino 0x600a754 newicurpos 4 newleafnr 10 newleaf 0xfffffc012304d800

.....

> Line 5 is a copy of @icur at the beginning fo the call, and line 6 is
> the contents of @icur after the xfs_bmap_add_extent_hole_delay call.
> Notice that the cursor positions are the same, but the leaf pointers are
> different.  I suspect that leaf ~d8f4680 has been freed, and this is the
> reason why freshgot is totally garbage. 

That implies that a extent tree modification is being made whilst
the delalloc function is holding the ILOCK_EXCL. Either rwsems on
ARM are broken (entirely possible given the subtle memory ordering
of the slow paths has caused this sort of problem on x64-64 multiple
times in the past), or something else isn't holding the ILOCK_EXCL
while modifying the iext tree.

> I wonder if the leaf pointers
> being different is the result of an iext btree splitting into 2 objects
> and then being recombined into one?

That implies multiple operations occurred - the single leaf won't
split until it is full - that's when we add the 15th record to the
tree. We're nowhere near that. And it won't attempt a rebalance that
may merge the the two leaf nodes until a remove occurs and the
number of entries in that leaf drops to half full (7 entries) and
the combined total is less that a full leaf.

Further, this is the left-most leaf, so any split will allocate a
new right sibling and move the entries right in to the new node. If
either the new right or the left node then rebalance, they will
always merge to the left-most leaf and free the right leaf. IOWs,
a grow-split-shrink-merge on the left most leaf in the tree will not
change the address of that left-most leaf - the right leave will get
allocated then freed...

So I'm not sure that even a grow-split-shrink-merge has occurred
concurrently here.....

Oh.... inserting into the root block of the tree results in
xfs_iext_realloc_root() being called, and krealloc() is called to
grow the root leaf block. that points cur->leaf at the newly
reallocated chunk of memory.

Ok, so the change in cursor is to be expected. There's nothing wrong
with the cursor, or that we have a reallocated the root leaf.

That leaves something not holding the right lock when inserting
a new extent,, or rwsems are broken on ARM.

> I augmented the xfs_iext_* functions to check the ILOCK state in all
> functions that are passed an xfs_inode.  None of them tripped across the
> entire fstests cloud run last night, so there's no obvious problem
> there.  The buffered write path takes ILOCK_EXCL and keeps it right up
> to where the debug splat happens, so there's no locking problem there.
> 
> So I started looking for things that could shift the extent count by one.
> Looking for semi-adjacent records, I noticed this:
> 
> nr 0x1 nr_real 0x1 offset 0x78 blockcount 0xc8 startblock 0xc70d25 state 1
> nr 0x2 nr_real 0x2 offset 0x150 blockcount 0x16 startblock 0xc70dfd state 1
> 
> 0xc70d25+0xc8 == 0xc70ded, so this could be the result of someone
> punching 0x10 blocks.
> 
> ino 0x600a754 nr 0x3 nr_real 0x3 offset 0x2a5 blockcount 0x5f startblock 0xc9f218 state 1
> ino 0x600a754 nr 0x5 nr_real 0x4 offset 0x304 blockcount 0x14 startblock 0xc9f277 state 0
> 
> The incorrect delalloc reservation notwithstanding, these two records
> are logically and physically adjacent, with the only difference being
> that one is unwritten and the other is not.  Someone could have
> converted an unwritten extent to written, possibly as a result of a
> post-write conversion?
> 
> ino 0x600a754 nr 0x8 nr_real 0x7 offset 0x307c blockcount 0x3 startblock 0xc70ded state 0
> ino 0x600a754 nr 0x9 nr_real 0x8 offset 0x307f blockcount 0x1 startblock 0xc70df0 state 1
> 
> The 0xc70ded here is familiar -- I wonder if this got mapped here as a
> result of an FIEXCHANGE'd with offset 0x140?  Extent 9 is adjacent with
> extent 8, except for the state difference.  Hmm.  I guess I better go
> check the FIEXCHANGE code...

And there's no such thing a FIEXCHANGE in the upstream current code
base, so that would be a good place to look for shenanigans.... :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: moar weird metadata corruptions, this time on arm64
  2022-11-24  4:40     ` Dave Chinner
@ 2022-11-24  6:46       ` Darrick J. Wong
  2022-11-24 16:44         ` Darrick J. Wong
  0 siblings, 1 reply; 8+ messages in thread
From: Darrick J. Wong @ 2022-11-24  6:46 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, Nov 24, 2022 at 03:40:23PM +1100, Dave Chinner wrote:
> On Wed, Nov 23, 2022 at 12:00:55PM -0800, Darrick J. Wong wrote:
> > On Tue, Nov 22, 2022 at 12:58:06PM +1100, Dave Chinner wrote:
> > > On Mon, Nov 21, 2022 at 04:16:02PM -0800, Darrick J. Wong wrote:
> > > > Hi all,
> > > > 
> > > > I've been running near-continuous integration testing of online fsck,
> > > > and I've noticed that once a day, one of the ARM VMs will fail the test
> > > > with out of order records in the data fork.
> .....
> > In the latest run, I got this trace data:
> > 
> > ino 0x600a754 nr 0x5 offset 0x304 nextoff 0x31e
> > ino 0x600a754 func xfs_bmapi_reserve_delalloc line 4152 data fork:
> >     ino 0x600a754 nr 0x0 nr_real 0x0 offset 0x7 blockcount 0xe startblock 0xc12867 state 0
> >     ino 0x600a754 nr 0x1 nr_real 0x1 offset 0x78 blockcount 0xc8 startblock 0xc70d25 state 1
> >     ino 0x600a754 nr 0x2 nr_real 0x2 offset 0x150 blockcount 0x16 startblock 0xc70dfd state 1
> >     ino 0x600a754 nr 0x3 nr_real 0x3 offset 0x2a5 blockcount 0x5f startblock 0xc9f218 state 1
> >     ino 0x600a754 nr 0x4 nr_real 0x4 offset 0x318 blockcount 0x6 startblock 0xffffffffe0007 state 0
> >     ino 0x600a754 nr 0x5 nr_real 0x4 offset 0x304 blockcount 0x14 startblock 0xc9f277 state 0
> >     ino 0x600a754 nr 0x6 nr_real 0x5 offset 0xaf8 blockcount 0x1a startblock 0xd17aa3 state 0
> >     ino 0x600a754 nr 0x7 nr_real 0x6 offset 0x12f7 blockcount 0x40 startblock 0xcca511 state 1
> >     ino 0x600a754 nr 0x8 nr_real 0x7 offset 0x307c blockcount 0x3 startblock 0xc70ded state 0
> >     ino 0x600a754 nr 0x9 nr_real 0x8 offset 0x307f blockcount 0x1 startblock 0xc70df0 state 1
> > 
> > Here we again see that a delalloc extent was inserted into the wrong
> > position in the iext leaf, same as last time.  The extra trace data I
> > collected are as follows:
> > 
> > ino 0x600a754 fork 0 oldoff 0x318 oldlen 0x6 oldprealloc 0x0 isize 0x307e14c
> >     ino 0x600a754 oldgotoff 0xaf8 oldgotstart 0xd17aa3 oldgotcount 0x1a oldgotstate 0
> >     ino 0x600a754 freshgotoff 0xe0e46156d65cb freshgotstart 0xd178b5 freshgotcount 0x1a freshgotstate 0
> >     ino 0x600a754 nowgotoff 0x318 nowgotstart 0xffffffffe0007 nowgotcount 0x6 nowgotstate 0
> >     ino 0x600a754 oldicurpos 4 oldleafnr 9 oldleaf 0xfffffc012d8f4680
> >     ino 0x600a754 newicurpos 4 newleafnr 10 newleaf 0xfffffc012304d800
> 
> .....
> 
> > Line 5 is a copy of @icur at the beginning fo the call, and line 6 is
> > the contents of @icur after the xfs_bmap_add_extent_hole_delay call.
> > Notice that the cursor positions are the same, but the leaf pointers are
> > different.  I suspect that leaf ~d8f4680 has been freed, and this is the
> > reason why freshgot is totally garbage. 
> 
> That implies that a extent tree modification is being made whilst
> the delalloc function is holding the ILOCK_EXCL. Either rwsems on
> ARM are broken (entirely possible given the subtle memory ordering
> of the slow paths has caused this sort of problem on x64-64 multiple
> times in the past), or something else isn't holding the ILOCK_EXCL
> while modifying the iext tree.

Hmm, maybe I'll try turning on lockdep for arm64 and see if it has
anythign interesting (ha!) to say.  Well, it /is/ a long holiday
weekend....

> > I wonder if the leaf pointers
> > being different is the result of an iext btree splitting into 2 objects
> > and then being recombined into one?
> 
> That implies multiple operations occurred - the single leaf won't
> split until it is full - that's when we add the 15th record to the
> tree. We're nowhere near that. And it won't attempt a rebalance that
> may merge the the two leaf nodes until a remove occurs and the
> number of entries in that leaf drops to half full (7 entries) and
> the combined total is less that a full leaf.
> 
> Further, this is the left-most leaf, so any split will allocate a
> new right sibling and move the entries right in to the new node. If
> either the new right or the left node then rebalance, they will
> always merge to the left-most leaf and free the right leaf. IOWs,
> a grow-split-shrink-merge on the left most leaf in the tree will not
> change the address of that left-most leaf - the right leave will get
> allocated then freed...
> 
> So I'm not sure that even a grow-split-shrink-merge has occurred
> concurrently here.....
> 
> Oh.... inserting into the root block of the tree results in
> xfs_iext_realloc_root() being called, and krealloc() is called to
> grow the root leaf block. that points cur->leaf at the newly
> reallocated chunk of memory.
> 
> Ok, so the change in cursor is to be expected. There's nothing wrong
> with the cursor, or that we have a reallocated the root leaf.

Ah, ok.

> That leaves something not holding the right lock when inserting
> a new extent,, or rwsems are broken on ARM.

That reminded me that a month or two back I also saw this crash when
alwasycow is turned on and we start runnng generic/522:

stack segment: 0000 [#1] PREEMPT SMP
CPU: 0 PID: 430863 Comm: fsstress Not tainted 6.0.0-xfsx #6.0.0 ae8d0391a7a281b411e9d54eb9b1c5c85ef7dbc0
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20171121_152543-x86-ol7-builder-01.us.oracle.com-4.el7.1 04/01/2014
RIP: 0010:xfs_iext_prev+0x71/0x150 [xfs]
Code: 41 89 5c 24 08 74 33 85 db 41 8b 55 14 78 38 83 fa 01 b8 0f 00 00 00 74 4c 39 c3 7d 1d 48 63 f3 48 83 fe 0f 0f 87 c4 00 00 00 <48> 83 7d 00 00 74 09 5b 5d 41 5c 41 5d 4
d 10 85 db
RSP: 0018:ffffc9000207fab8 EFLAGS: 00010297
RAX: 000000000000000f RBX: 000000000000000e RCX: 000000000000000c
RDX: 0000000000000002 RSI: 000000000000000e RDI: ffff88815598d6c0
RBP: 000229e31f0000e9 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000001000 R12: ffffc9000207fb10
R13: ffff88815598d6c0 R14: 000229e31f000001 R15: 000ffffffffe0000
FS:  00007f46c03c4740(0000) GS:ffff88842d000000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f46c03c3000 CR3: 000000014b16c000 CR4: 00000000003506f0
Call Trace:
 <TASK>
 xfs_iomap_prealloc_size.constprop.0.isra.0+0x1a6/0x410 [xfs 71c030f9aed06cc72c1a12695c507ccebedb1b15]
 ? __filemap_add_folio+0x1d9/0x590
 ? xfs_ilock+0xb5/0x1f0 [xfs 71c030f9aed06cc72c1a12695c507ccebedb1b15]
 xfs_buffered_write_iomap_begin+0xa66/0xbc0 [xfs 71c030f9aed06cc72c1a12695c507ccebedb1b15]
 ? __filemap_get_folio+0x15c/0x330
 ? xfs_buffered_write_iomap_end+0x63/0x190 [xfs 71c030f9aed06cc72c1a12695c507ccebedb1b15]
 iomap_iter+0x122/0x2c0
 iomap_file_buffered_write+0x92/0x360
 xfs_file_buffered_write+0xb1/0x330 [xfs 71c030f9aed06cc72c1a12695c507ccebedb1b15]
 vfs_write+0x2eb/0x410
 ksys_write+0x65/0xe0
 do_syscall_64+0x2b/0x80
 entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7f46c04dba37
Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
RSP: 002b:00007fffd7ea9858 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 000000000001192f RCX: 00007f46c04dba37
RDX: 000000000001192f RSI: 0000558ce0a4f600 RDI: 0000000000000004
RBP: 0000000000000004 R08: 0000000000000000 R09: 0000558ce0a4f600
R10: 00007f46c05e1430 R11: 0000000000000246 R12: 00000000000000e8
R13: 000000000005c0fe R14: 0000558ce0a4f600 R15: 0000000000000000

So I wonder if this is the same kind of thing -- someone kreallocates
the iext leaf block without taking ILOCK, and now our poor thread here
walks off the dead leaf pointer in the cursor.

Or maybe this is a totally separate problem, who knows.  I've only seen
it twice in the last 6 months.

> > I augmented the xfs_iext_* functions to check the ILOCK state in all
> > functions that are passed an xfs_inode.  None of them tripped across the
> > entire fstests cloud run last night, so there's no obvious problem
> > there.  The buffered write path takes ILOCK_EXCL and keeps it right up
> > to where the debug splat happens, so there's no locking problem there.
> > 
> > So I started looking for things that could shift the extent count by one.
> > Looking for semi-adjacent records, I noticed this:
> > 
> > nr 0x1 nr_real 0x1 offset 0x78 blockcount 0xc8 startblock 0xc70d25 state 1
> > nr 0x2 nr_real 0x2 offset 0x150 blockcount 0x16 startblock 0xc70dfd state 1
> > 
> > 0xc70d25+0xc8 == 0xc70ded, so this could be the result of someone
> > punching 0x10 blocks.
> > 
> > ino 0x600a754 nr 0x3 nr_real 0x3 offset 0x2a5 blockcount 0x5f startblock 0xc9f218 state 1
> > ino 0x600a754 nr 0x5 nr_real 0x4 offset 0x304 blockcount 0x14 startblock 0xc9f277 state 0
> > 
> > The incorrect delalloc reservation notwithstanding, these two records
> > are logically and physically adjacent, with the only difference being
> > that one is unwritten and the other is not.  Someone could have
> > converted an unwritten extent to written, possibly as a result of a
> > post-write conversion?
> > 
> > ino 0x600a754 nr 0x8 nr_real 0x7 offset 0x307c blockcount 0x3 startblock 0xc70ded state 0
> > ino 0x600a754 nr 0x9 nr_real 0x8 offset 0x307f blockcount 0x1 startblock 0xc70df0 state 1
> > 
> > The 0xc70ded here is familiar -- I wonder if this got mapped here as a
> > result of an FIEXCHANGE'd with offset 0x140?  Extent 9 is adjacent with
> > extent 8, except for the state difference.  Hmm.  I guess I better go
> > check the FIEXCHANGE code...
> 
> And there's no such thing a FIEXCHANGE in the upstream current code
> base, so that would be a good place to look for shenanigans.... :)

Hm.  I looked over the code for that, and it takes ILOCK_EXCL on both
inodes involved in the file contents exchange, and holds it past the
final transaction commit + defer chain processing.  So I don't think
it's there.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: moar weird metadata corruptions, this time on arm64
  2022-11-24  6:46       ` Darrick J. Wong
@ 2022-11-24 16:44         ` Darrick J. Wong
  2022-11-24 20:44           ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Darrick J. Wong @ 2022-11-24 16:44 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, Nov 23, 2022 at 10:46:13PM -0800, Darrick J. Wong wrote:
> On Thu, Nov 24, 2022 at 03:40:23PM +1100, Dave Chinner wrote:
> > On Wed, Nov 23, 2022 at 12:00:55PM -0800, Darrick J. Wong wrote:
> > > On Tue, Nov 22, 2022 at 12:58:06PM +1100, Dave Chinner wrote:
> > > > On Mon, Nov 21, 2022 at 04:16:02PM -0800, Darrick J. Wong wrote:
> > > > > Hi all,
> > > > > 
> > > > > I've been running near-continuous integration testing of online fsck,
> > > > > and I've noticed that once a day, one of the ARM VMs will fail the test
> > > > > with out of order records in the data fork.
> > .....
> > > In the latest run, I got this trace data:
> > > 
> > > ino 0x600a754 nr 0x5 offset 0x304 nextoff 0x31e
> > > ino 0x600a754 func xfs_bmapi_reserve_delalloc line 4152 data fork:
> > >     ino 0x600a754 nr 0x0 nr_real 0x0 offset 0x7 blockcount 0xe startblock 0xc12867 state 0
> > >     ino 0x600a754 nr 0x1 nr_real 0x1 offset 0x78 blockcount 0xc8 startblock 0xc70d25 state 1
> > >     ino 0x600a754 nr 0x2 nr_real 0x2 offset 0x150 blockcount 0x16 startblock 0xc70dfd state 1
> > >     ino 0x600a754 nr 0x3 nr_real 0x3 offset 0x2a5 blockcount 0x5f startblock 0xc9f218 state 1
> > >     ino 0x600a754 nr 0x4 nr_real 0x4 offset 0x318 blockcount 0x6 startblock 0xffffffffe0007 state 0
> > >     ino 0x600a754 nr 0x5 nr_real 0x4 offset 0x304 blockcount 0x14 startblock 0xc9f277 state 0
> > >     ino 0x600a754 nr 0x6 nr_real 0x5 offset 0xaf8 blockcount 0x1a startblock 0xd17aa3 state 0
> > >     ino 0x600a754 nr 0x7 nr_real 0x6 offset 0x12f7 blockcount 0x40 startblock 0xcca511 state 1
> > >     ino 0x600a754 nr 0x8 nr_real 0x7 offset 0x307c blockcount 0x3 startblock 0xc70ded state 0
> > >     ino 0x600a754 nr 0x9 nr_real 0x8 offset 0x307f blockcount 0x1 startblock 0xc70df0 state 1
> > > 
> > > Here we again see that a delalloc extent was inserted into the wrong
> > > position in the iext leaf, same as last time.  The extra trace data I
> > > collected are as follows:
> > > 
> > > ino 0x600a754 fork 0 oldoff 0x318 oldlen 0x6 oldprealloc 0x0 isize 0x307e14c
> > >     ino 0x600a754 oldgotoff 0xaf8 oldgotstart 0xd17aa3 oldgotcount 0x1a oldgotstate 0
> > >     ino 0x600a754 freshgotoff 0xe0e46156d65cb freshgotstart 0xd178b5 freshgotcount 0x1a freshgotstate 0
> > >     ino 0x600a754 nowgotoff 0x318 nowgotstart 0xffffffffe0007 nowgotcount 0x6 nowgotstate 0
> > >     ino 0x600a754 oldicurpos 4 oldleafnr 9 oldleaf 0xfffffc012d8f4680
> > >     ino 0x600a754 newicurpos 4 newleafnr 10 newleaf 0xfffffc012304d800
> > 
> > .....
> > 
> > > Line 5 is a copy of @icur at the beginning fo the call, and line 6 is
> > > the contents of @icur after the xfs_bmap_add_extent_hole_delay call.
> > > Notice that the cursor positions are the same, but the leaf pointers are
> > > different.  I suspect that leaf ~d8f4680 has been freed, and this is the
> > > reason why freshgot is totally garbage. 
> > 
> > That implies that a extent tree modification is being made whilst
> > the delalloc function is holding the ILOCK_EXCL. Either rwsems on
> > ARM are broken (entirely possible given the subtle memory ordering
> > of the slow paths has caused this sort of problem on x64-64 multiple
> > times in the past), or something else isn't holding the ILOCK_EXCL
> > while modifying the iext tree.
> 
> Hmm, maybe I'll try turning on lockdep for arm64 and see if it has
> anythign interesting (ha!) to say.  Well, it /is/ a long holiday
> weekend....
> 
> > > I wonder if the leaf pointers
> > > being different is the result of an iext btree splitting into 2 objects
> > > and then being recombined into one?
> > 
> > That implies multiple operations occurred - the single leaf won't
> > split until it is full - that's when we add the 15th record to the
> > tree. We're nowhere near that. And it won't attempt a rebalance that
> > may merge the the two leaf nodes until a remove occurs and the
> > number of entries in that leaf drops to half full (7 entries) and
> > the combined total is less that a full leaf.
> > 
> > Further, this is the left-most leaf, so any split will allocate a
> > new right sibling and move the entries right in to the new node. If
> > either the new right or the left node then rebalance, they will
> > always merge to the left-most leaf and free the right leaf. IOWs,
> > a grow-split-shrink-merge on the left most leaf in the tree will not
> > change the address of that left-most leaf - the right leave will get
> > allocated then freed...
> > 
> > So I'm not sure that even a grow-split-shrink-merge has occurred
> > concurrently here.....
> > 
> > Oh.... inserting into the root block of the tree results in
> > xfs_iext_realloc_root() being called, and krealloc() is called to
> > grow the root leaf block. that points cur->leaf at the newly
> > reallocated chunk of memory.
> > 
> > Ok, so the change in cursor is to be expected. There's nothing wrong
> > with the cursor, or that we have a reallocated the root leaf.
> 
> Ah, ok.
> 
> > That leaves something not holding the right lock when inserting
> > a new extent,, or rwsems are broken on ARM.
> 
> That reminded me that a month or two back I also saw this crash when
> alwasycow is turned on and we start runnng generic/522:
> 
> stack segment: 0000 [#1] PREEMPT SMP
> CPU: 0 PID: 430863 Comm: fsstress Not tainted 6.0.0-xfsx #6.0.0 ae8d0391a7a281b411e9d54eb9b1c5c85ef7dbc0
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20171121_152543-x86-ol7-builder-01.us.oracle.com-4.el7.1 04/01/2014
> RIP: 0010:xfs_iext_prev+0x71/0x150 [xfs]
> Code: 41 89 5c 24 08 74 33 85 db 41 8b 55 14 78 38 83 fa 01 b8 0f 00 00 00 74 4c 39 c3 7d 1d 48 63 f3 48 83 fe 0f 0f 87 c4 00 00 00 <48> 83 7d 00 00 74 09 5b 5d 41 5c 41 5d 4
> d 10 85 db
> RSP: 0018:ffffc9000207fab8 EFLAGS: 00010297
> RAX: 000000000000000f RBX: 000000000000000e RCX: 000000000000000c
> RDX: 0000000000000002 RSI: 000000000000000e RDI: ffff88815598d6c0
> RBP: 000229e31f0000e9 R08: 0000000000000001 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000001000 R12: ffffc9000207fb10
> R13: ffff88815598d6c0 R14: 000229e31f000001 R15: 000ffffffffe0000
> FS:  00007f46c03c4740(0000) GS:ffff88842d000000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f46c03c3000 CR3: 000000014b16c000 CR4: 00000000003506f0
> Call Trace:
>  <TASK>
>  xfs_iomap_prealloc_size.constprop.0.isra.0+0x1a6/0x410 [xfs 71c030f9aed06cc72c1a12695c507ccebedb1b15]
>  ? __filemap_add_folio+0x1d9/0x590
>  ? xfs_ilock+0xb5/0x1f0 [xfs 71c030f9aed06cc72c1a12695c507ccebedb1b15]
>  xfs_buffered_write_iomap_begin+0xa66/0xbc0 [xfs 71c030f9aed06cc72c1a12695c507ccebedb1b15]
>  ? __filemap_get_folio+0x15c/0x330
>  ? xfs_buffered_write_iomap_end+0x63/0x190 [xfs 71c030f9aed06cc72c1a12695c507ccebedb1b15]
>  iomap_iter+0x122/0x2c0
>  iomap_file_buffered_write+0x92/0x360
>  xfs_file_buffered_write+0xb1/0x330 [xfs 71c030f9aed06cc72c1a12695c507ccebedb1b15]
>  vfs_write+0x2eb/0x410
>  ksys_write+0x65/0xe0
>  do_syscall_64+0x2b/0x80
>  entry_SYSCALL_64_after_hwframe+0x46/0xb0
> RIP: 0033:0x7f46c04dba37
> Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
> RSP: 002b:00007fffd7ea9858 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> RAX: ffffffffffffffda RBX: 000000000001192f RCX: 00007f46c04dba37
> RDX: 000000000001192f RSI: 0000558ce0a4f600 RDI: 0000000000000004
> RBP: 0000000000000004 R08: 0000000000000000 R09: 0000558ce0a4f600
> R10: 00007f46c05e1430 R11: 0000000000000246 R12: 00000000000000e8
> R13: 000000000005c0fe R14: 0000558ce0a4f600 R15: 0000000000000000
> 
> So I wonder if this is the same kind of thing -- someone kreallocates
> the iext leaf block without taking ILOCK, and now our poor thread here
> walks off the dead leaf pointer in the cursor.
> 
> Or maybe this is a totally separate problem, who knows.  I've only seen
> it twice in the last 6 months.
> 
> > > I augmented the xfs_iext_* functions to check the ILOCK state in all
> > > functions that are passed an xfs_inode.  None of them tripped across the
> > > entire fstests cloud run last night, so there's no obvious problem
> > > there.  The buffered write path takes ILOCK_EXCL and keeps it right up
> > > to where the debug splat happens, so there's no locking problem there.
> > > 
> > > So I started looking for things that could shift the extent count by one.
> > > Looking for semi-adjacent records, I noticed this:
> > > 
> > > nr 0x1 nr_real 0x1 offset 0x78 blockcount 0xc8 startblock 0xc70d25 state 1
> > > nr 0x2 nr_real 0x2 offset 0x150 blockcount 0x16 startblock 0xc70dfd state 1
> > > 
> > > 0xc70d25+0xc8 == 0xc70ded, so this could be the result of someone
> > > punching 0x10 blocks.
> > > 
> > > ino 0x600a754 nr 0x3 nr_real 0x3 offset 0x2a5 blockcount 0x5f startblock 0xc9f218 state 1
> > > ino 0x600a754 nr 0x5 nr_real 0x4 offset 0x304 blockcount 0x14 startblock 0xc9f277 state 0
> > > 
> > > The incorrect delalloc reservation notwithstanding, these two records
> > > are logically and physically adjacent, with the only difference being
> > > that one is unwritten and the other is not.  Someone could have
> > > converted an unwritten extent to written, possibly as a result of a
> > > post-write conversion?
> > > 
> > > ino 0x600a754 nr 0x8 nr_real 0x7 offset 0x307c blockcount 0x3 startblock 0xc70ded state 0
> > > ino 0x600a754 nr 0x9 nr_real 0x8 offset 0x307f blockcount 0x1 startblock 0xc70df0 state 1
> > > 
> > > The 0xc70ded here is familiar -- I wonder if this got mapped here as a
> > > result of an FIEXCHANGE'd with offset 0x140?  Extent 9 is adjacent with
> > > extent 8, except for the state difference.  Hmm.  I guess I better go
> > > check the FIEXCHANGE code...
> > 
> > And there's no such thing a FIEXCHANGE in the upstream current code
> > base, so that would be a good place to look for shenanigans.... :)
> 
> Hm.  I looked over the code for that, and it takes ILOCK_EXCL on both
> inodes involved in the file contents exchange, and holds it past the
> final transaction commit + defer chain processing.  So I don't think
> it's there.

Also, last night's run produced this:

ino 0x140bb3 func xfs_bmapi_reserve_delalloc line 4164 data fork:
    ino 0x140bb3 nr 0x0 nr_real 0x0 offset 0xb9 blockcount 0x1f startblock 0x935de2 state 1
    ino 0x140bb3 nr 0x1 nr_real 0x1 offset 0xe6 blockcount 0xa startblock 0xffffffffe0007 state 0
    ino 0x140bb3 nr 0x2 nr_real 0x1 offset 0xd8 blockcount 0xe startblock 0x935e01 state 0
ino 0x140bb3 fork 0 oldoff 0xe6 oldlen 0x4 oldprealloc 0x6 isize 0xe6000
    ino 0x140bb3 oldgotoff 0xea oldgotstart 0xfffffffffffffffe oldgotcount 0x0 oldgotstate 0
    ino 0x140bb3 crapgotoff 0x0 crapgotstart 0x0 crapgotcount 0x0 crapgotstate 0
    ino 0x140bb3 freshgotoff 0xd8 freshgotstart 0x935e01 freshgotcount 0xe freshgotstate 0
    ino 0x140bb3 nowgotoff 0xe6 nowgotstart 0xffffffffe0007 nowgotcount 0xa nowgotstate 0
    ino 0x140bb3 oldicurpos 1 oldleafnr 2 oldleaf 0xfffffc00f0609a00
    ino 0x140bb3 crapicurpos 2 crapleafnr 2 crapleaf 0xfffffc00f0609a00
    ino 0x140bb3 freshicurpos 1 freshleafnr 2 freshleaf 0xfffffc00f0609a00
    ino 0x140bb3 newicurpos 1 newleafnr 3 newleaf 0xfffffc00f0609a00

The old/fresh/nowgot have the same meaning as yesterday.  "crapgot" is
the results of duplicating the cursor at the start of the body of
xfs_bmapi_reserve_delalloc and performing a fresh lookup at @off.
I think @oldgot is a HOLESTARTBLOCK extent because the first lookup
didn't find anything, so we filled in imap with "fake hole until the
end".  At the time of the first lookup, I suspect that there's only one
32-block unwritten extent in the mapping (hence oldicurpos==1) but by
the time we get to recording crapgot, crapicurpos==2.

IOWS, I think I can safely eliminate FIEXCHANGE shenanigans and
concentrate on finding an unlocked unwritten -> written extent
conversion.  Or possibly a written -> unwritten extent conversion?

Anyway, long holiday weekend, so I won't get back to this until Monday.
Just wanted to persist my notes to the mailing list so I can move on to
testing the write race fixes with djwong-dev.

--D

> --D
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: moar weird metadata corruptions, this time on arm64
  2022-11-24 16:44         ` Darrick J. Wong
@ 2022-11-24 20:44           ` Dave Chinner
  2022-11-27 18:45             ` Darrick J. Wong
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2022-11-24 20:44 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: xfs

On Thu, Nov 24, 2022 at 08:44:39AM -0800, Darrick J. Wong wrote:
> Also, last night's run produced this:
> 
> ino 0x140bb3 func xfs_bmapi_reserve_delalloc line 4164 data fork:
>     ino 0x140bb3 nr 0x0 nr_real 0x0 offset 0xb9 blockcount 0x1f startblock 0x935de2 state 1
>     ino 0x140bb3 nr 0x1 nr_real 0x1 offset 0xe6 blockcount 0xa startblock 0xffffffffe0007 state 0
>     ino 0x140bb3 nr 0x2 nr_real 0x1 offset 0xd8 blockcount 0xe startblock 0x935e01 state 0
> ino 0x140bb3 fork 0 oldoff 0xe6 oldlen 0x4 oldprealloc 0x6 isize 0xe6000
>     ino 0x140bb3 oldgotoff 0xea oldgotstart 0xfffffffffffffffe oldgotcount 0x0 oldgotstate 0
>     ino 0x140bb3 crapgotoff 0x0 crapgotstart 0x0 crapgotcount 0x0 crapgotstate 0
>     ino 0x140bb3 freshgotoff 0xd8 freshgotstart 0x935e01 freshgotcount 0xe freshgotstate 0
>     ino 0x140bb3 nowgotoff 0xe6 nowgotstart 0xffffffffe0007 nowgotcount 0xa nowgotstate 0
>     ino 0x140bb3 oldicurpos 1 oldleafnr 2 oldleaf 0xfffffc00f0609a00
>     ino 0x140bb3 crapicurpos 2 crapleafnr 2 crapleaf 0xfffffc00f0609a00
>     ino 0x140bb3 freshicurpos 1 freshleafnr 2 freshleaf 0xfffffc00f0609a00
>     ino 0x140bb3 newicurpos 1 newleafnr 3 newleaf 0xfffffc00f0609a00
> 
> The old/fresh/nowgot have the same meaning as yesterday.  "crapgot" is
> the results of duplicating the cursor at the start of the body of
> xfs_bmapi_reserve_delalloc and performing a fresh lookup at @off.
> I think @oldgot is a HOLESTARTBLOCK extent because the first lookup
> didn't find anything, so we filled in imap with "fake hole until the
> end".  At the time of the first lookup, I suspect that there's only one
> 32-block unwritten extent in the mapping (hence oldicurpos==1) but by
> the time we get to recording crapgot, crapicurpos==2.

Ok, that's much simpler to reason about, and implies the smoke is
coming from xfs_buffered_write_iomap_begin() or
xfs_bmapi_reserve_delalloc(). I suspect the former - it does a lot
of stuff with the ILOCK_EXCL held.....

.... including calling xfs_qm_dqattach_locked().

xfs_buffered_write_iomap_begin
  ILOCK_EXCL
  look up icur
  xfs_qm_dqattach_locked
    xfs_qm_dqattach_one
      xfs_qm_dqget_inode
        dquot cache miss
        xfs_iunlock(ip, XFS_ILOCK_EXCL);
        error = xfs_qm_dqread(mp, id, type, can_alloc, &dqp);
        xfs_ilock(ip, XFS_ILOCK_EXCL);
  ....
  xfs_bmapi_reserve_delalloc(icur)

Yup, that's what is letting the magic smoke out -
xfs_qm_dqattach_locked() can cycle the ILOCK. If that happens, we
can pass a stale icur to xfs_bmapi_reserve_delalloc() and it all
goes downhill from there.

> IOWS, I think I can safely eliminate FIEXCHANGE shenanigans and
> concentrate on finding an unlocked unwritten -> written extent
> conversion.  Or possibly a written -> unwritten extent conversion?
> 
> Anyway, long holiday weekend, so I won't get back to this until Monday.
> Just wanted to persist my notes to the mailing list so I can move on to
> testing the write race fixes with djwong-dev.

And I'm on PTO for the next couple of working days, too, so I'm not
going to write a patch for it right now, either.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: moar weird metadata corruptions, this time on arm64
  2022-11-24 20:44           ` Dave Chinner
@ 2022-11-27 18:45             ` Darrick J. Wong
  0 siblings, 0 replies; 8+ messages in thread
From: Darrick J. Wong @ 2022-11-27 18:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Fri, Nov 25, 2022 at 07:44:55AM +1100, Dave Chinner wrote:
> On Thu, Nov 24, 2022 at 08:44:39AM -0800, Darrick J. Wong wrote:
> > Also, last night's run produced this:
> > 
> > ino 0x140bb3 func xfs_bmapi_reserve_delalloc line 4164 data fork:
> >     ino 0x140bb3 nr 0x0 nr_real 0x0 offset 0xb9 blockcount 0x1f startblock 0x935de2 state 1
> >     ino 0x140bb3 nr 0x1 nr_real 0x1 offset 0xe6 blockcount 0xa startblock 0xffffffffe0007 state 0
> >     ino 0x140bb3 nr 0x2 nr_real 0x1 offset 0xd8 blockcount 0xe startblock 0x935e01 state 0
> > ino 0x140bb3 fork 0 oldoff 0xe6 oldlen 0x4 oldprealloc 0x6 isize 0xe6000
> >     ino 0x140bb3 oldgotoff 0xea oldgotstart 0xfffffffffffffffe oldgotcount 0x0 oldgotstate 0
> >     ino 0x140bb3 crapgotoff 0x0 crapgotstart 0x0 crapgotcount 0x0 crapgotstate 0
> >     ino 0x140bb3 freshgotoff 0xd8 freshgotstart 0x935e01 freshgotcount 0xe freshgotstate 0
> >     ino 0x140bb3 nowgotoff 0xe6 nowgotstart 0xffffffffe0007 nowgotcount 0xa nowgotstate 0
> >     ino 0x140bb3 oldicurpos 1 oldleafnr 2 oldleaf 0xfffffc00f0609a00
> >     ino 0x140bb3 crapicurpos 2 crapleafnr 2 crapleaf 0xfffffc00f0609a00
> >     ino 0x140bb3 freshicurpos 1 freshleafnr 2 freshleaf 0xfffffc00f0609a00
> >     ino 0x140bb3 newicurpos 1 newleafnr 3 newleaf 0xfffffc00f0609a00
> > 
> > The old/fresh/nowgot have the same meaning as yesterday.  "crapgot" is
> > the results of duplicating the cursor at the start of the body of
> > xfs_bmapi_reserve_delalloc and performing a fresh lookup at @off.
> > I think @oldgot is a HOLESTARTBLOCK extent because the first lookup
> > didn't find anything, so we filled in imap with "fake hole until the
> > end".  At the time of the first lookup, I suspect that there's only one
> > 32-block unwritten extent in the mapping (hence oldicurpos==1) but by
> > the time we get to recording crapgot, crapicurpos==2.
> 
> Ok, that's much simpler to reason about, and implies the smoke is
> coming from xfs_buffered_write_iomap_begin() or
> xfs_bmapi_reserve_delalloc(). I suspect the former - it does a lot
> of stuff with the ILOCK_EXCL held.....
> 
> .... including calling xfs_qm_dqattach_locked().
> 
> xfs_buffered_write_iomap_begin
>   ILOCK_EXCL
>   look up icur
>   xfs_qm_dqattach_locked
>     xfs_qm_dqattach_one
>       xfs_qm_dqget_inode
>         dquot cache miss
>         xfs_iunlock(ip, XFS_ILOCK_EXCL);
>         error = xfs_qm_dqread(mp, id, type, can_alloc, &dqp);
>         xfs_ilock(ip, XFS_ILOCK_EXCL);
>   ....
>   xfs_bmapi_reserve_delalloc(icur)
> 
> Yup, that's what is letting the magic smoke out -
> xfs_qm_dqattach_locked() can cycle the ILOCK. If that happens, we
> can pass a stale icur to xfs_bmapi_reserve_delalloc() and it all
> goes downhill from there.
> 
> > IOWS, I think I can safely eliminate FIEXCHANGE shenanigans and
> > concentrate on finding an unlocked unwritten -> written extent
> > conversion.  Or possibly a written -> unwritten extent conversion?
> > 
> > Anyway, long holiday weekend, so I won't get back to this until Monday.
> > Just wanted to persist my notes to the mailing list so I can move on to
> > testing the write race fixes with djwong-dev.
> 
> And I'm on PTO for the next couple of working days, too, so I'm not
> going to write a patch for it right now, either.

No worries, I made a simple patch and pasted in the email discussion.
After three days of rerunning testing I haven't seen any of the
backtraces that I reported in this thread.  Thanks for your help!

Patch:
https://lore.kernel.org/linux-xfs/Y4OuLTwPVdiHMBGi@magnolia/T/#u

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-11-27 18:45 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-22  0:16 moar weird metadata corruptions, this time on arm64 Darrick J. Wong
2022-11-22  1:58 ` Dave Chinner
2022-11-23 20:00   ` Darrick J. Wong
2022-11-24  4:40     ` Dave Chinner
2022-11-24  6:46       ` Darrick J. Wong
2022-11-24 16:44         ` Darrick J. Wong
2022-11-24 20:44           ` Dave Chinner
2022-11-27 18:45             ` Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.