All of lore.kernel.org
 help / color / mirror / Atom feed
* Regular FS shutdown while rsync is running
@ 2018-11-26 13:29 Lucas Stach
  2018-11-26 15:32 ` Brian Foster
  0 siblings, 1 reply; 22+ messages in thread
From: Lucas Stach @ 2018-11-26 13:29 UTC (permalink / raw)
  To: linux-xfs

Hi all,

we have a XFS filesystem which is used to backup data from others
servers via rsync. While rsync is running the filesystem is regularly
shutting itself down. Sometimes the FS will last for 3 days until the
shutdown happens, sometimes it's after only 3 hours.

The issue was spotted on a 4.17 kernel, but I'm not sure that's the
first version with this issue, as that was the kernel version used when
setting up the backup system. We also checked that the issue is still
present in a 4.19.4 vanilla kernel, backtrace from this kernel version
is provided below.

The block stack below the XFS is a LVM on a software RAID10.

I can provide more information as needed, but I'm not sure at this
point which information would be appropriate to further debug this
issue. Any pointers appreciated.

Please keep me in CC, as I'm not subscribed to the XFS list.

Regards,
Lucas



[50013.593883] XFS (dm-2): Internal error xfs_btree_check_sblock at line 179 of file fs/xfs/libxfs/xfs_btree.c.  Caller xfs_btree_lastrec+0x41/0x90 [xfs]
[50013.594365] CPU: 4 PID: 31839 Comm: rsync Not tainted 4.19.4-holodeck10 #1
[50013.594365] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 2.6.0 10/26/2017
[50013.594366] Call Trace:
[50013.594372]  dump_stack+0x5c/0x7b
[50013.594391]  xfs_btree_check_sblock+0x5a/0xb0 [xfs]
[50013.594409]  xfs_btree_lastrec+0x41/0x90 [xfs]
[50013.594427]  xfs_btree_delrec+0xa48/0xdf0 [xfs]
[50013.594446]  ? xfs_inobt_get_maxrecs+0x20/0x20 [xfs]
[50013.594463]  ? xfs_lookup_get_search_key+0x49/0x60 [xfs]
[50013.594480]  xfs_btree_delete+0x43/0x110 [xfs]
[50013.594500]  xfs_dialloc_ag+0x16a/0x290 [xfs]
[50013.594540]  xfs_dialloc+0x5b/0x270 [xfs]
[50013.594562]  xfs_ialloc+0x6c/0x5b0 [xfs]
[50013.594583]  xfs_dir_ialloc+0x68/0x1d0 [xfs]
[50013.594615]  xfs_create+0x3df/0x5e0 [xfs]
[50013.594633]  xfs_generic_create+0x241/0x2e0 [xfs]
[50013.594636]  vfs_mkdir+0x10c/0x1a0
[50013.594638]  do_mkdirat+0xd3/0x110
[50013.594641]  do_syscall_64+0x55/0xf0
[50013.594644]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[50013.594645] RIP: 0033:0x7fda0e896447
[50013.594647] Code: 00 b8 ff ff ff ff c3 0f 1f 40 00 48 8b 05 49 da 2b 00 64 c7 00 5f 00 00 00 b8 ff ff ff ff c3 0f 1f 40 00 b8 53 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 21 da 2b 00 f7 d8 64 89 01 48
[50013.594647] RSP: 002b:00007ffdedc1a118 EFLAGS: 00000202 ORIG_RAX: 0000000000000053
[50013.594649] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fda0e896447
[50013.594649] RDX: 0000000000000008 RSI: 00000000000001c0 RDI: 00007ffdedc1c470
[50013.594650] RBP: 0000560fd94b5c70 R08: 0000000000000080 R09: 00000000ffffffff
[50013.594661] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002
[50013.594662] R13: 0000000000000000 R14: 00007ffdedc1c470 R15: 00000000ffffffff
[50013.594681] XFS (dm-2): Internal error XFS_WANT_CORRUPTED_GOTO at line 3889 of file fs/xfs/libxfs/xfs_btree.c.  Caller xfs_btree_delete+0x43/0x110 [xfs]
[50013.595138] CPU: 4 PID: 31839 Comm: rsync Not tainted 4.19.4-holodeck10 #1
[50013.595138] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 2.6.0 10/26/2017
[50013.595138] Call Trace:
[50013.595141]  dump_stack+0x5c/0x7b
[50013.595157]  xfs_btree_delrec+0xc1c/0xdf0 [xfs]
[50013.595174]  ? xfs_inobt_get_maxrecs+0x20/0x20 [xfs]
[50013.595188]  ? xfs_lookup_get_search_key+0x49/0x60 [xfs]
[50013.595203]  xfs_btree_delete+0x43/0x110 [xfs]
[50013.595220]  xfs_dialloc_ag+0x16a/0x290 [xfs]
[50013.595236]  xfs_dialloc+0x5b/0x270 [xfs]
[50013.595255]  xfs_ialloc+0x6c/0x5b0 [xfs]
[50013.595274]  xfs_dir_ialloc+0x68/0x1d0 [xfs]
[50013.595291]  xfs_create+0x3df/0x5e0 [xfs]
[50013.595309]  xfs_generic_create+0x241/0x2e0 [xfs]
[50013.595311]  vfs_mkdir+0x10c/0x1a0
[50013.595313]  do_mkdirat+0xd3/0x110
[50013.595314]  do_syscall_64+0x55/0xf0
[50013.595316]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[50013.595317] RIP: 0033:0x7fda0e896447
[50013.595318] Code: 00 b8 ff ff ff ff c3 0f 1f 40 00 48 8b 05 49 da 2b 00 64 c7 00 5f 00 00 00 b8 ff ff ff ff c3 0f 1f 40 00 b8 53 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 21 da 2b 00 f7 d8 64 89 01 48
[50013.595319] RSP: 002b:00007ffdedc1a118 EFLAGS: 00000202 ORIG_RAX: 0000000000000053
[50013.595320] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fda0e896447
[50013.595320] RDX: 0000000000000008 RSI: 00000000000001c0 RDI: 00007ffdedc1c470
[50013.595321] RBP: 0000560fd94b5c70 R08: 0000000000000080 R09: 00000000ffffffff
[50013.595322] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002
[50013.595322] R13: 0000000000000000 R14: 00007ffdedc1c470 R15: 00000000ffffffff
[50013.595365] XFS (dm-2): Internal error xfs_trans_cancel at line 1041 of file fs/xfs/xfs_trans.c.  Caller xfs_create+0x3f1/0x5e0 [xfs]
[50013.595750] CPU: 4 PID: 31839 Comm: rsync Not tainted 4.19.4-holodeck10 #1
[50013.595751] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 2.6.0 10/26/2017
[50013.595751] Call Trace:
[50013.595753]  dump_stack+0x5c/0x7b
[50013.595774]  xfs_trans_cancel+0x133/0x160 [xfs]
[50013.595793]  xfs_create+0x3f1/0x5e0 [xfs]
[50013.595812]  xfs_generic_create+0x241/0x2e0 [xfs]
[50013.595814]  vfs_mkdir+0x10c/0x1a0
[50013.595815]  do_mkdirat+0xd3/0x110
[50013.595817]  do_syscall_64+0x55/0xf0
[50013.595819]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[50013.595820] RIP: 0033:0x7fda0e896447
[50013.595820] Code: 00 b8 ff ff ff ff c3 0f 1f 40 00 48 8b 05 49 da 2b 00 64 c7 00 5f 00 00 00 b8 ff ff ff ff c3 0f 1f 40 00 b8 53 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 21 da 2b 00 f7 d8 64 89 01 48
[50013.595821] RSP: 002b:00007ffdedc1a118 EFLAGS: 00000202 ORIG_RAX: 0000000000000053
[50013.595822] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fda0e896447
[50013.595823] RDX: 0000000000000008 RSI: 00000000000001c0 RDI: 00007ffdedc1c470
[50013.595824] RBP: 0000560fd94b5c70 R08: 0000000000000080 R09: 00000000ffffffff
[50013.595824] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002
[50013.595825] R13: 0000000000000000 R14: 00007ffdedc1c470 R15: 00000000ffffffff
[50013.595827] XFS (dm-2): xfs_do_force_shutdown(0x8) called from line 1042 of file fs/xfs/xfs_trans.c.  Return address = 00000000fd4adc21
[50013.667040] XFS (dm-2): Corruption of in-memory data detected.  Shutting down filesystem
[50013.667303] XFS (dm-2): Please umount the filesystem and rectify the problem(s)
[50013.671838] XFS (dm-2): xfs_imap_to_bp: xfs_trans_read_buf() returned error -5.
[50013.671842] XFS (dm-2): xfs_do_force_shutdown(0x8) called from line 3403 of file fs/xfs/xfs_inode.c.  Return address = 0000000062cd5dba

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Regular FS shutdown while rsync is running
  2018-11-26 13:29 Regular FS shutdown while rsync is running Lucas Stach
@ 2018-11-26 15:32 ` Brian Foster
  2019-01-21 10:41   ` Lucas Stach
  0 siblings, 1 reply; 22+ messages in thread
From: Brian Foster @ 2018-11-26 15:32 UTC (permalink / raw)
  To: Lucas Stach; +Cc: linux-xfs

On Mon, Nov 26, 2018 at 02:29:49PM +0100, Lucas Stach wrote:
> Hi all,
> 
> we have a XFS filesystem which is used to backup data from others
> servers via rsync. While rsync is running the filesystem is regularly
> shutting itself down. Sometimes the FS will last for 3 days until the
> shutdown happens, sometimes it's after only 3 hours.
> 
> The issue was spotted on a 4.17 kernel, but I'm not sure that's the
> first version with this issue, as that was the kernel version used when
> setting up the backup system. We also checked that the issue is still
> present in a 4.19.4 vanilla kernel, backtrace from this kernel version
> is provided below.
> 
> The block stack below the XFS is a LVM on a software RAID10.
> 
> I can provide more information as needed, but I'm not sure at this
> point which information would be appropriate to further debug this
> issue. Any pointers appreciated.
> 
> Please keep me in CC, as I'm not subscribed to the XFS list.
> 
> Regards,
> Lucas
> 
> 
> 
> [50013.593883] XFS (dm-2): Internal error xfs_btree_check_sblock at line 179 of file fs/xfs/libxfs/xfs_btree.c.  Caller xfs_btree_lastrec+0x41/0x90 [xfs]
> [50013.594365] CPU: 4 PID: 31839 Comm: rsync Not tainted 4.19.4-holodeck10 #1
> [50013.594365] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 2.6.0 10/26/2017
> [50013.594366] Call Trace:
> [50013.594372]  dump_stack+0x5c/0x7b
> [50013.594391]  xfs_btree_check_sblock+0x5a/0xb0 [xfs]
> [50013.594409]  xfs_btree_lastrec+0x41/0x90 [xfs]
> [50013.594427]  xfs_btree_delrec+0xa48/0xdf0 [xfs]
> [50013.594446]  ? xfs_inobt_get_maxrecs+0x20/0x20 [xfs]
> [50013.594463]  ? xfs_lookup_get_search_key+0x49/0x60 [xfs]
> [50013.594480]  xfs_btree_delete+0x43/0x110 [xfs]
> [50013.594500]  xfs_dialloc_ag+0x16a/0x290 [xfs]
> [50013.594540]  xfs_dialloc+0x5b/0x270 [xfs]
> [50013.594562]  xfs_ialloc+0x6c/0x5b0 [xfs]
> [50013.594583]  xfs_dir_ialloc+0x68/0x1d0 [xfs]
> [50013.594615]  xfs_create+0x3df/0x5e0 [xfs]
> [50013.594633]  xfs_generic_create+0x241/0x2e0 [xfs]
> [50013.594636]  vfs_mkdir+0x10c/0x1a0
> [50013.594638]  do_mkdirat+0xd3/0x110
> [50013.594641]  do_syscall_64+0x55/0xf0
> [50013.594644]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [50013.594645] RIP: 0033:0x7fda0e896447
> [50013.594647] Code: 00 b8 ff ff ff ff c3 0f 1f 40 00 48 8b 05 49 da 2b 00 64 c7 00 5f 00 00 00 b8 ff ff ff ff c3 0f 1f 40 00 b8 53 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 21 da 2b 00 f7 d8 64 89 01 48
> [50013.594647] RSP: 002b:00007ffdedc1a118 EFLAGS: 00000202 ORIG_RAX: 0000000000000053
> [50013.594649] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fda0e896447
> [50013.594649] RDX: 0000000000000008 RSI: 00000000000001c0 RDI: 00007ffdedc1c470
> [50013.594650] RBP: 0000560fd94b5c70 R08: 0000000000000080 R09: 00000000ffffffff
> [50013.594661] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002
> [50013.594662] R13: 0000000000000000 R14: 00007ffdedc1c470 R15: 00000000ffffffff

This looks like a free inode btree corruption of some sort. The error
report doesn't tell us exactly what's wrong with the btree block, but we
can surmise finobt from the record deletion call via the inode
allocation path.

Can you provide xfs_info for the fs and details of your storage, CPU and
RAM configuration? Also, what typically happens after this crash? Can
the filesystem mount again or does log recovery fail and require a
repair? If the filesystem mounts, have you tried running a
non-destructive 'xfs_repair -n <dev>' after log recovery to check for
any latent problems? Would you be able to provide an xfs_metadump image
of this filesystem for closer inspection?

Can you characterize the rsync workload in any more detail? For example,
are files added/removed across runs (perhaps whatever rsync flags are
used would help)? Are files of consistent or varying size/content? Are
many rsync processes running in parallel? Generally, anything you can
describe that might help recreate this problem with a simulated workload
is potentially useful.

Brian

> [50013.594681] XFS (dm-2): Internal error XFS_WANT_CORRUPTED_GOTO at line 3889 of file fs/xfs/libxfs/xfs_btree.c.  Caller xfs_btree_delete+0x43/0x110 [xfs]
> [50013.595138] CPU: 4 PID: 31839 Comm: rsync Not tainted 4.19.4-holodeck10 #1
> [50013.595138] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 2.6.0 10/26/2017
> [50013.595138] Call Trace:
> [50013.595141]  dump_stack+0x5c/0x7b
> [50013.595157]  xfs_btree_delrec+0xc1c/0xdf0 [xfs]
> [50013.595174]  ? xfs_inobt_get_maxrecs+0x20/0x20 [xfs]
> [50013.595188]  ? xfs_lookup_get_search_key+0x49/0x60 [xfs]
> [50013.595203]  xfs_btree_delete+0x43/0x110 [xfs]
> [50013.595220]  xfs_dialloc_ag+0x16a/0x290 [xfs]
> [50013.595236]  xfs_dialloc+0x5b/0x270 [xfs]
> [50013.595255]  xfs_ialloc+0x6c/0x5b0 [xfs]
> [50013.595274]  xfs_dir_ialloc+0x68/0x1d0 [xfs]
> [50013.595291]  xfs_create+0x3df/0x5e0 [xfs]
> [50013.595309]  xfs_generic_create+0x241/0x2e0 [xfs]
> [50013.595311]  vfs_mkdir+0x10c/0x1a0
> [50013.595313]  do_mkdirat+0xd3/0x110
> [50013.595314]  do_syscall_64+0x55/0xf0
> [50013.595316]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [50013.595317] RIP: 0033:0x7fda0e896447
> [50013.595318] Code: 00 b8 ff ff ff ff c3 0f 1f 40 00 48 8b 05 49 da 2b 00 64 c7 00 5f 00 00 00 b8 ff ff ff ff c3 0f 1f 40 00 b8 53 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 21 da 2b 00 f7 d8 64 89 01 48
> [50013.595319] RSP: 002b:00007ffdedc1a118 EFLAGS: 00000202 ORIG_RAX: 0000000000000053
> [50013.595320] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fda0e896447
> [50013.595320] RDX: 0000000000000008 RSI: 00000000000001c0 RDI: 00007ffdedc1c470
> [50013.595321] RBP: 0000560fd94b5c70 R08: 0000000000000080 R09: 00000000ffffffff
> [50013.595322] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002
> [50013.595322] R13: 0000000000000000 R14: 00007ffdedc1c470 R15: 00000000ffffffff
> [50013.595365] XFS (dm-2): Internal error xfs_trans_cancel at line 1041 of file fs/xfs/xfs_trans.c.  Caller xfs_create+0x3f1/0x5e0 [xfs]
> [50013.595750] CPU: 4 PID: 31839 Comm: rsync Not tainted 4.19.4-holodeck10 #1
> [50013.595751] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 2.6.0 10/26/2017
> [50013.595751] Call Trace:
> [50013.595753]  dump_stack+0x5c/0x7b
> [50013.595774]  xfs_trans_cancel+0x133/0x160 [xfs]
> [50013.595793]  xfs_create+0x3f1/0x5e0 [xfs]
> [50013.595812]  xfs_generic_create+0x241/0x2e0 [xfs]
> [50013.595814]  vfs_mkdir+0x10c/0x1a0
> [50013.595815]  do_mkdirat+0xd3/0x110
> [50013.595817]  do_syscall_64+0x55/0xf0
> [50013.595819]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [50013.595820] RIP: 0033:0x7fda0e896447
> [50013.595820] Code: 00 b8 ff ff ff ff c3 0f 1f 40 00 48 8b 05 49 da 2b 00 64 c7 00 5f 00 00 00 b8 ff ff ff ff c3 0f 1f 40 00 b8 53 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 21 da 2b 00 f7 d8 64 89 01 48
> [50013.595821] RSP: 002b:00007ffdedc1a118 EFLAGS: 00000202 ORIG_RAX: 0000000000000053
> [50013.595822] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fda0e896447
> [50013.595823] RDX: 0000000000000008 RSI: 00000000000001c0 RDI: 00007ffdedc1c470
> [50013.595824] RBP: 0000560fd94b5c70 R08: 0000000000000080 R09: 00000000ffffffff
> [50013.595824] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002
> [50013.595825] R13: 0000000000000000 R14: 00007ffdedc1c470 R15: 00000000ffffffff
> [50013.595827] XFS (dm-2): xfs_do_force_shutdown(0x8) called from line 1042 of file fs/xfs/xfs_trans.c.  Return address = 00000000fd4adc21
> [50013.667040] XFS (dm-2): Corruption of in-memory data detected.  Shutting down filesystem
> [50013.667303] XFS (dm-2): Please umount the filesystem and rectify the problem(s)
> [50013.671838] XFS (dm-2): xfs_imap_to_bp: xfs_trans_read_buf() returned error -5.
> [50013.671842] XFS (dm-2): xfs_do_force_shutdown(0x8) called from line 3403 of file fs/xfs/xfs_inode.c.  Return address = 0000000062cd5dba

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Regular FS shutdown while rsync is running
  2018-11-26 15:32 ` Brian Foster
@ 2019-01-21 10:41   ` Lucas Stach
  2019-01-21 13:01     ` Brian Foster
  2019-01-21 21:18     ` Dave Chinner
  0 siblings, 2 replies; 22+ messages in thread
From: Lucas Stach @ 2019-01-21 10:41 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

Hi Brian,

sorry for taking so long to respond. It's hard to gather some good
debug data here.

Am Montag, den 26.11.2018, 10:32 -0500 schrieb Brian Foster:
> On Mon, Nov 26, 2018 at 02:29:49PM +0100, Lucas Stach wrote:
> > Hi all,
> > 
> > we have a XFS filesystem which is used to backup data from others
> > servers via rsync. While rsync is running the filesystem is regularly
> > shutting itself down. Sometimes the FS will last for 3 days until the
> > shutdown happens, sometimes it's after only 3 hours.
> > 
> > The issue was spotted on a 4.17 kernel, but I'm not sure that's the
> > first version with this issue, as that was the kernel version used when
> > setting up the backup system. We also checked that the issue is still
> > present in a 4.19.4 vanilla kernel, backtrace from this kernel version
> > is provided below.
> > 
> > The block stack below the XFS is a LVM on a software RAID10.
> > 
> > I can provide more information as needed, but I'm not sure at this
> > point which information would be appropriate to further debug this
> > issue. Any pointers appreciated.
> > 
> > Please keep me in CC, as I'm not subscribed to the XFS list.
> > 
> > Regards,
> > Lucas
> > 
> > 
> > 
> > [50013.593883] XFS (dm-2): Internal error xfs_btree_check_sblock at line 179 of file fs/xfs/libxfs/xfs_btree.c.  Caller xfs_btree_lastrec+0x41/0x90 [xfs]
> > [50013.594365] CPU: 4 PID: 31839 Comm: rsync Not tainted 4.19.4-holodeck10 #1
> > [50013.594365] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 2.6.0 10/26/2017
> > [50013.594366] Call Trace:
> > [50013.594372]  dump_stack+0x5c/0x7b
> > [50013.594391]  xfs_btree_check_sblock+0x5a/0xb0 [xfs]
> > [50013.594409]  xfs_btree_lastrec+0x41/0x90 [xfs]
> > [50013.594427]  xfs_btree_delrec+0xa48/0xdf0 [xfs]
> > [50013.594446]  ? xfs_inobt_get_maxrecs+0x20/0x20 [xfs]
> > [50013.594463]  ? xfs_lookup_get_search_key+0x49/0x60 [xfs]
> > [50013.594480]  xfs_btree_delete+0x43/0x110 [xfs]
> > [50013.594500]  xfs_dialloc_ag+0x16a/0x290 [xfs]
> > [50013.594540]  xfs_dialloc+0x5b/0x270 [xfs]
> > [50013.594562]  xfs_ialloc+0x6c/0x5b0 [xfs]
> > [50013.594583]  xfs_dir_ialloc+0x68/0x1d0 [xfs]
> > [50013.594615]  xfs_create+0x3df/0x5e0 [xfs]
> > [50013.594633]  xfs_generic_create+0x241/0x2e0 [xfs]
> > [50013.594636]  vfs_mkdir+0x10c/0x1a0
> > [50013.594638]  do_mkdirat+0xd3/0x110
> > [50013.594641]  do_syscall_64+0x55/0xf0
> > [50013.594644]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [50013.594645] RIP: 0033:0x7fda0e896447
> > [50013.594647] Code: 00 b8 ff ff ff ff c3 0f 1f 40 00 48 8b 05 49 da 2b 00 64 c7 00 5f 00 00 00 b8 ff ff ff ff c3 0f 1f 40 00 b8 53 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 21 da 2b 00 f7 d8 64 89 01 48
> > [50013.594647] RSP: 002b:00007ffdedc1a118 EFLAGS: 00000202 ORIG_RAX: 0000000000000053
> > [50013.594649] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fda0e896447
> > [50013.594649] RDX: 0000000000000008 RSI: 00000000000001c0 RDI: 00007ffdedc1c470
> > [50013.594650] RBP: 0000560fd94b5c70 R08: 0000000000000080 R09: 00000000ffffffff
> > [50013.594661] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002
> > [50013.594662] R13: 0000000000000000 R14: 00007ffdedc1c470 R15: 00000000ffffffff
> 
> This looks like a free inode btree corruption of some sort. The error
> report doesn't tell us exactly what's wrong with the btree block, but we
> can surmise finobt from the record deletion call via the inode
> allocation path.
> 
> Can you provide xfs_info for the fs and details of your storage, CPU and
> RAM configuration?

root@XXX:~# xfs_info /srv/
meta-data=/dev/mapper/XXX-backup isize=512    agcount=33, agsize=183123968 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1 spinodes=0 rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=5860389888, imaxpct=15
         =                       sunit=16     swidth=48 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=521728, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

The storage is LVM on top of md raid10 across 8 disks.

System is a Intel Xeon CPU E5-2630 with 64GB ECC RAM. The system
reports no errors about RAM.

>  Also, what typically happens after this crash? Can
> the filesystem mount again or does log recovery fail and require a
> repair? If the filesystem mounts, have you tried running a
> non-destructive 'xfs_repair -n <dev>' after log recovery to check for
> any latent problems?

We are able to umount and mount the filesystem again. I don't there was
a non-destructive run of xfs_repair yet, but just a regular one. Would
a non-destructive one make more sense?

root@XXX:/mnt/metadump# xfs_repair  /dev/XXX
Phase 1 - find and verify superblock...
        - reporting progress in intervals of 15 minutes
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
bad magic # 0x49414233 in inobt block 5/7831662
        - 23:06:50: scanning filesystem freespace - 33 of 33 allocation groups done
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - 23:06:50: scanning agi unlinked lists - 33 of 33 allocation groups done
        - process known inodes and perform inode discovery...
        - agno = 0 
        - agno = 15
        - agno = 30                                                                              
        - agno = 31                         
        - agno = 32                                                                  
        - agno = 16                    
        - agno = 1                           
        - agno = 17                                                                   
        - agno = 2                                     
        - agno = 18
        - agno = 3 
        - agno = 19
        - agno = 4 
        - agno = 5 
        - agno = 20
        - agno = 21
        - agno = 6 
        - agno = 22
        - agno = 7 
        - agno = 23
        - agno = 8 
        - agno = 24
        - agno = 9
        - agno = 25
        - agno = 26
        - agno = 10
        - agno = 27
        - agno = 28
        - agno = 29
        - agno = 11
        - agno = 12
        - agno = 13
        - agno = 14
        - 05:28:46: process known inodes and inode discovery - 402627456 of 402627456 inodes done
        - process newly discovered inodes...
        - 05:28:46: process newly discovered inodes - 33 of 33 allocation groups done
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - 05:28:48: setting up duplicate extent list - 33 of 33 allocation groups done
        - check for inodes claiming duplicate blocks...
        - agno = 0 
        - agno = 15
        - agno = 30                                                                                
        - agno = 31                      
        - agno = 32                                                               
        - agno = 1           
        - agno = 16                  
        - agno = 17                                               
        - agno = 2                 
        - agno = 18             
        - agno = 3                                    
        - agno = 19                        
        - agno = 4                                                                  
        - agno = 5
        - agno = 20
        - agno = 6  
        - agno = 21
        - agno = 7
        - agno = 22           
        - agno = 8
        - agno = 23
        - agno = 9
        - agno = 24
        - agno = 25
        - agno = 10
        - agno = 26
        - agno = 27
        - agno = 11
        - agno = 28
        - agno = 29
        - agno = 12
        - agno = 13
        - agno = 14
        - 17:57:31: check for inodes claiming duplicate blocks - 402627456 of 402627456 inodes done
Phase 5 - rebuild AG headers and trees...
        - 17:57:47: rebuild AG headers and trees - 33 of 33 allocation groups done
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
        - 22:29:19: verify and correct link counts - 33 of 33 allocation groups done
done

>  Would you be able to provide an xfs_metadump image
> of this filesystem for closer inspection?

This filesystem is really metadata heavy, so an xfs_metadump ended up
being around 400GB of data. Not sure if this is something you would be
willing to look into?

> Can you characterize the rsync workload in any more detail? For example,
> are files added/removed across runs (perhaps whatever rsync flags are
> used would help)? Are files of consistent or varying size/content? Are
> many rsync processes running in parallel? Generally, anything you can
> describe that might help recreate this problem with a simulated workload
> is potentially useful.

There are typically about 2 or 3 rsync processes running in parallel.
The workload syncs a lot of small files and uses hardlinks to avoid
storing files that didn't change.

All the corruption issues we've seen so far have the same call trace
starting with a do_mkdirat syscall. We've never seen it happen in other
calls.

As a wild shot in aiding some debugging we've added the following patch
on top of 4.19.16:

------------------------------------->8--------------------------------
 fs/xfs/libxfs/xfs_btree.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 34c6d7bd4d18..5771df0db414 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -176,7 +176,8 @@ xfs_btree_check_sblock(
 			XFS_ERRTAG_BTREE_CHECK_SBLOCK))) {
 		if (bp)
 			trace_xfs_btree_corrupt(bp, _RET_IP_);
-		XFS_ERROR_REPORT(__func__, XFS_ERRLEVEL_LOW, mp);
+		XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW,
+				     mp, block, sizeof(*block));
 		return -EFSCORRUPTED;
 	}
 	return 0;
------------------------------------->8--------------------------------

With that applied we were able to collect the following dump:

[215922.475666] 00000000d471c70c: 49 41 42 33 00 01 01 0c 00 07 58 ab ff ff ff ff  IAB3......X.....
[215922.475954] 000000001be9cc59: 00 00 00 01 b5 db 40 b8 00 00 00 00 00 00 00 00  ......@.........
[215922.476241] 000000005b12f32e: 02 67 11 cc 25 c7 44 b9 89 aa 0a ac 49 6e df ec  .g..%.D.....In..
[215922.476556] 00000000e6e67a65: 00 00 00 05 2a bb 5c 47 36 85 a2 40 36 8e f9 00  ....*.\G6..@6...
[215922.476840] 00000000e4807aa6: 36 96 3e 80 36 a4 61 80                          6.>.6.a.
[215922.477172] XFS (dm-3): Internal error xfs_btree_check_sblock at line 180 of file fs/xfs/libxfs/xfs_btree.c.  Caller xfs_btree_lastrec+0x41/0x90 [xfs]
[215922.477617] CPU: 8 PID: 23979 Comm: rsync Not tainted 4.19.16-XXX+ #20190118.1
[215922.477619] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 2.6.0 10/26/2017
[215922.477620] Call Trace:
[215922.477628]  dump_stack+0x5c/0x7b
[215922.477685]  xfs_corruption_error+0x83/0x90 [xfs]
[215922.477731]  ? xfs_btree_lastrec+0x41/0x90 [xfs]
[215922.477776]  ? xfs_btree_lastrec+0x41/0x90 [xfs]
[215922.477819]  xfs_btree_check_sblock+0x68/0xd0 [xfs]
[215922.477864]  ? xfs_btree_lastrec+0x41/0x90 [xfs]
[215922.477909]  xfs_btree_lastrec+0x41/0x90 [xfs]
[215922.477954]  xfs_btree_delrec+0xb12/0xfd0 [xfs]
[215922.477998]  ? xfs_lookup_get_search_key+0x49/0x60 [xfs]
[215922.478043]  xfs_btree_delete+0x43/0x110 [xfs]
[215922.478092]  xfs_dialloc_ag+0x19b/0x320 [xfs]
[215922.478147]  xfs_dialloc+0x5b/0x290 [xfs]
[215922.478204]  xfs_ialloc+0x6c/0x650 [xfs]
[215922.478264]  ? xlog_grant_head_check+0x6c/0x170 [xfs]
[215922.478319]  xfs_dir_ialloc+0x87/0x250 [xfs]
[215922.478372]  xfs_create+0x3df/0x610 [xfs]
[215922.478424]  xfs_generic_create+0x241/0x2e0 [xfs]
[215922.478429]  ? lookup_dcache+0x17/0x60
[215922.478433]  vfs_mkdir+0x10c/0x1a0
[215922.478437]  do_mkdirat+0xd3/0x110
[215922.478443]  do_syscall_64+0x55/0xf0
[215922.478450]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[215922.478453] RIP: 0033:0x7f2a4ff4c447
[215922.478456] Code: 00 b8 ff ff ff ff c3 0f 1f 40 00 48 8b 05 49 da 2b 00 64 c7 00 5f 00 00 00 b8 ff ff ff ff c3 0f 1f 40 00 b8 53 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 21 da 2b 00 f7 d8 64 89 01 48
[215922.478458] RSP: 002b:00007fffd0f80048 EFLAGS: 00000202 ORIG_RAX: 0000000000000053
[215922.478461] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f2a4ff4c447
[215922.478463] RDX: 000000000000c818 RSI: 00000000000001c0 RDI: 00007fffd0f823a0
[215922.478465] RBP: 00007f2a509f4e70 R08: 0000000000000006 R09: 00000000ffffffff
[215922.478466] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002
[215922.478468] R13: 0000000000000000 R14: 00007fffd0f823a0 R15: 00000000ffffffff
[215922.478472] XFS (dm-3): Corruption detected. Unmount and run xfs_repair

Does this make any sense to you? Aside from the xfs_metadump, is there
any other debug info we could provide to help you figure this out?

Regards,
Lucas

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: Regular FS shutdown while rsync is running
  2019-01-21 10:41   ` Lucas Stach
@ 2019-01-21 13:01     ` Brian Foster
  2019-01-21 16:23       ` Lucas Stach
  2019-01-21 21:18     ` Dave Chinner
  1 sibling, 1 reply; 22+ messages in thread
From: Brian Foster @ 2019-01-21 13:01 UTC (permalink / raw)
  To: Lucas Stach; +Cc: linux-xfs

On Mon, Jan 21, 2019 at 11:41:57AM +0100, Lucas Stach wrote:
> Hi Brian,
> 
> sorry for taking so long to respond. It's hard to gather some good
> debug data here.
> 
> Am Montag, den 26.11.2018, 10:32 -0500 schrieb Brian Foster:
> > On Mon, Nov 26, 2018 at 02:29:49PM +0100, Lucas Stach wrote:
...
> > 
> > Can you provide xfs_info for the fs and details of your storage, CPU and
> > RAM configuration?
> 
> root@XXX:~# xfs_info /srv/
> meta-data=/dev/mapper/XXX-backup isize=512    agcount=33, agsize=183123968 blks
>          =                       sectsz=4096  attr=2, projid32bit=1
>          =                       crc=1        finobt=1 spinodes=0 rmapbt=0
>          =                       reflink=0
> data     =                       bsize=4096   blocks=5860389888, imaxpct=15
>          =                       sunit=16     swidth=48 blks
> naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> log      =internal               bsize=4096   blocks=521728, version=2
>          =                       sectsz=4096  sunit=1 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> The storage is LVM on top of md raid10 across 8 disks.
> 
> System is a Intel Xeon CPU E5-2630 with 64GB ECC RAM. The system
> reports no errors about RAM.
> 

Thanks, looks fairly standard to me.

> >  Also, what typically happens after this crash? Can
> > the filesystem mount again or does log recovery fail and require a
> > repair? If the filesystem mounts, have you tried running a
> > non-destructive 'xfs_repair -n <dev>' after log recovery to check for
> > any latent problems?
> 
> We are able to umount and mount the filesystem again. I don't there was
> a non-destructive run of xfs_repair yet, but just a regular one. Would
> a non-destructive one make more sense?
> 

Not really, I just suggest that in general because 1. xfs_repair mutates
the fs even when no corruption is found and 2. we don't always know how
xfs_repair is going to react to a corrupted fs. I generally prefer to
run 'xfs_repair -n' to find out what the repair will expect to do when
run without -n, how long it might take, etc.

> root@XXX:/mnt/metadump# xfs_repair  /dev/XXX
> Phase 1 - find and verify superblock...
>         - reporting progress in intervals of 15 minutes
> Phase 2 - using internal log
>         - zero log...
>         - scan filesystem freespace and inode maps...
> bad magic # 0x49414233 in inobt block 5/7831662

Hmm, so this looks like a very isolated corruption. It's complaining
about a magic number (internal filesystem value stamped to metadata
blocks for verification/sanity purposes) being wrong on a finobt block
and nothing else seems to be wrong in the fs. (I guess repair should
probably print out 'finobt' here instead of 'inobt,' but that's a
separate issue..).

The finobt uses a magic value of XFS_FIBT_CRC_MAGIC (0x46494233, 'FIB3')
whereas this block has a magic value of 0x49414233. The latter is
'IAB3' (XFS_IBT_CRC_MAGIC), which is the magic value for regular inode
btree blocks.

>         - 23:06:50: scanning filesystem freespace - 33 of 33 allocation groups done
>         - found root inode chunk
> Phase 3 - for each AG...
...
> Phase 7 - verify and correct link counts...
>         - 22:29:19: verify and correct link counts - 33 of 33 allocation groups done
> done
> 
> >  Would you be able to provide an xfs_metadump image
> > of this filesystem for closer inspection?
> 
> This filesystem is really metadata heavy, so an xfs_metadump ended up
> being around 400GB of data. Not sure if this is something you would be
> willing to look into?
> 

Ok, it might be difficult to get ahold of that. Does the image happen to
compress well?

In the meantime, given that the corruption appears to be so isolated you
might be able to provide enough information from the metadump without
having to transfer it. The first thing is probably to take a look at the
block in question..

First, restore the metadump somewhere:

xfs_mdrestore -g ./md.img <destination>

You'll need somewhere with enough space for that 400G or so. Note that
you can restore to a file and mount/inspect that file as if it were the
original fs. I'd also mount/unmount the restored metadump and run an
'xfs_repair -n' on it just to double check that the corruption was
captured properly and there are no other issues with the metadump. -n is
important here as otherwise repair will fix the metadump and remove the
corruption.

Next, use xfs_db to dump the contents of the suspect block. Run 'xfs_db
<metadump image>' to open the fs and try the following sequence of
commands.

- Convert to a global fsb: 'convert agno 5 agbno 7831662 fsb'
- Jump to the fsb: 'fsb <output of prev cmd>'
- Set the block type: 'type finobt'
- Print the block: 'print'

... and copy/paste the output.

> > Can you characterize the rsync workload in any more detail? For example,
> > are files added/removed across runs (perhaps whatever rsync flags are
> > used would help)? Are files of consistent or varying size/content? Are
> > many rsync processes running in parallel? Generally, anything you can
> > describe that might help recreate this problem with a simulated workload
> > is potentially useful.
> 
> There are typically about 2 or 3 rsync processes running in parallel.
> The workload syncs a lot of small files and uses hardlinks to avoid
> storing files that didn't change.
> 
> All the corruption issues we've seen so far have the same call trace
> starting with a do_mkdirat syscall. We've never seen it happen in other
> calls.
> 
> As a wild shot in aiding some debugging we've added the following patch
> on top of 4.19.16:
> 
...
> 
> With that applied we were able to collect the following dump:
> 
> [215922.475666] 00000000d471c70c: 49 41 42 33 00 01 01 0c 00 07 58 ab ff ff ff ff  IAB3......X.....
> [215922.475954] 000000001be9cc59: 00 00 00 01 b5 db 40 b8 00 00 00 00 00 00 00 00  ......@.........
> [215922.476241] 000000005b12f32e: 02 67 11 cc 25 c7 44 b9 89 aa 0a ac 49 6e df ec  .g..%.D.....In..
> [215922.476556] 00000000e6e67a65: 00 00 00 05 2a bb 5c 47 36 85 a2 40 36 8e f9 00  ....*.\G6..@6...
> [215922.476840] 00000000e4807aa6: 36 96 3e 80 36 a4 61 80                          6.>.6.a.
> [215922.477172] XFS (dm-3): Internal error xfs_btree_check_sblock at line 180 of file fs/xfs/libxfs/xfs_btree.c.  Caller xfs_btree_lastrec+0x41/0x90 [xfs]
...
> 
> Does this make any sense to you? Aside from the xfs_metadump, is there
> any other debug info we could provide to help you figure this out?
> 

This pretty much replicates what repair told us. Perhaps there are more
debug checks we could add to help narrow down the cause of this problem,
but I'd start with getting a look at the block as described above.

Brian

> Regards,
> Lucas
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Regular FS shutdown while rsync is running
  2019-01-21 13:01     ` Brian Foster
@ 2019-01-21 16:23       ` Lucas Stach
  2019-01-21 18:11         ` Brian Foster
  0 siblings, 1 reply; 22+ messages in thread
From: Lucas Stach @ 2019-01-21 16:23 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

Am Montag, den 21.01.2019, 08:01 -0500 schrieb Brian Foster:
[...]
> > root@XXX:/mnt/metadump# xfs_repair  /dev/XXX
> > Phase 1 - find and verify superblock...
> >         - reporting progress in intervals of 15 minutes
> > Phase 2 - using internal log
> >         - zero log...
> >         - scan filesystem freespace and inode maps...
> > bad magic # 0x49414233 in inobt block 5/7831662
> 
> Hmm, so this looks like a very isolated corruption. It's complaining
> about a magic number (internal filesystem value stamped to metadata
> blocks for verification/sanity purposes) being wrong on a finobt block
> and nothing else seems to be wrong in the fs. (I guess repair should
> probably print out 'finobt' here instead of 'inobt,' but that's a
> separate issue..).
> 
> The finobt uses a magic value of XFS_FIBT_CRC_MAGIC (0x46494233, 'FIB3')
> whereas this block has a magic value of 0x49414233. The latter is
> 'IAB3' (XFS_IBT_CRC_MAGIC), which is the magic value for regular inode
> btree blocks.
> 
> >         - 23:06:50: scanning filesystem freespace - 33 of 33 allocation groups done
> >         - found root inode chunk
> > Phase 3 - for each AG...
> 
> ...
> > Phase 7 - verify and correct link counts...
> >         - 22:29:19: verify and correct link counts - 33 of 33 allocation groups done
> > done
> > 
> > >  Would you be able to provide an xfs_metadump image
> > > of this filesystem for closer inspection?
> > 
> > This filesystem is really metadata heavy, so an xfs_metadump ended up
> > being around 400GB of data. Not sure if this is something you would be
> > willing to look into?
> > 
> 
> Ok, it might be difficult to get ahold of that. Does the image happen to
> compress well?

I'll see how well it compresses, but this might take a while...

> In the meantime, given that the corruption appears to be so isolated you
> might be able to provide enough information from the metadump without
> having to transfer it. The first thing is probably to take a look at the
> block in question..
> 
> First, restore the metadump somewhere:
> 
> xfs_mdrestore -g ./md.img <destination>
> 
> You'll need somewhere with enough space for that 400G or so. Note that
> you can restore to a file and mount/inspect that file as if it were the
> original fs. I'd also mount/unmount the restored metadump and run an
> 'xfs_repair -n' on it just to double check that the corruption was
> captured properly and there are no other issues with the metadump. -n is
> important here as otherwise repair will fix the metadump and remove the
> corruption.
> 
> Next, use xfs_db to dump the contents of the suspect block. Run 'xfs_db
> <metadump image>' to open the fs and try the following sequence of
> commands.
> 
> - Convert to a global fsb: 'convert agno 5 agbno 7831662 fsb'
> - Jump to the fsb: 'fsb <output of prev cmd>'
> - Set the block type: 'type finobt'
> - Print the block: 'print'
> 
> ... and copy/paste the output.

So for the moment, here's the output of the above sequence.

xfs_db> convert agno 5 agbno 7831662 fsb
0x5077806e (1350008942)
xfs_db> fsb 0x5077806e
xfs_db> type finobt
xfs_db> print
magic = 0x49414233
level = 1
numrecs = 335
leftsib = 7810856
rightsib = null
bno = 7387612016
lsn = 0x6671003d9700
uuid = 026711cc-25c7-44b9-89aa-0aac496edfec
owner = 5
crc = 0xe12b19b2 (correct)
keys[1-335] = [startino] 
1:[958766208] 
2:[960203904] 
3:[961707648] 
4:[963003264] 
5:[964566528] 
6:[965274432] 
7:[966643456] 
8:[967151808] 
9:[967937920] 
10:[968462976] 
11:[969040768] 
12:[970681856] 
13:[971156992] 
14:[972031232] 
15:[972772608] 
16:[973479296] 
17:[974016512] 
18:[974475776] 
19:[974999552] 
20:[975452160] 
21:[975836160] 
22:[976584576] 
23:[977213824] 
24:[982036096] 
25:[987294592] 
26:[988030720] 
27:[988549888] 
28:[989288000] 
29:[990621440] 
30:[991238016] 
31:[991789824] 
32:[992320128] 
33:[993190848] 
34:[994163712] 
35:[995337856] 
36:[997400448] 
37:[997834496] 
38:[998261504] 
39:[998809472] 
40:[1000180608] 
41:[1002622976] 
42:[1003701376] 
43:[1004380288] 
44:[1006392384] 
45:[1007378048] 
46:[1008899264] 
47:[1010753920] 
48:[1011690944] 
49:[1012841856] 
50:[1014790016] 
51:[1018967104] 
52:[1021546752] 
53:[1022964608] 
54:[1025270016] 
55:[1026458240] 
56:[1027689088] 
57:[1029101632] 
58:[1030116032] 
59:[1030642176] 
60:[1032072704] 
61:[1032929664] 
62:[1033577728] 
63:[1034240256] 
64:[1035221504] 
65:[1036707584] 
66:[1037945472] 
67:[1039025408] 
68:[1039866752] 
69:[1040149888] 
70:[1040465664] 
71:[1041233024] 
72:[1042081024] 
73:[1042587776] 
74:[1044256448] 
75:[1044670336] 
76:[1045996928] 
77:[1047547904] 
78:[1048474880] 
79:[1049188224] 
80:[1049874432] 
81:[1049919104] 
82:[1050624128] 
83:[1051385088] 
84:[1052172288] 
85:[1052987968] 
86:[1054098688] 
87:[1054993728] 
88:[1055481984] 
89:[1055968320] 
90:[1056760704] 
91:[1057265408] 
92:[1057744896] 
93:[1058394752] 
94:[1058656000] 
95:[1059149824] 
96:[1059732928] 
97:[1061290624] 
98:[1062338624] 
99:[1062885248] 
100:[1063539328] 
101:[1063939200] 
102:[1065515904] 
103:[1067503232] 
104:[1068582144] 
105:[1069002880] 
106:[1069577664] 
107:[1069960128] 
108:[1070688000] 
109:[1071114368] 
110:[1072531328] 
111:[1073282944] 
112:[1074234496] 
113:[1075912320] 
114:[1076997888] 
115:[1077965440] 
116:[1080531584] 
117:[1081580672] 
118:[1082154112] 
119:[1085020288] 
120:[1085956480] 
121:[1087185664] 
122:[1088890368] 
123:[1088965760] 
124:[1094695424] 
125:[1098562560] 
126:[1101387776] 
127:[1101951488] 
128:[1102746752] 
129:[1103568128] 
130:[1105154048] 
131:[1105477632] 
132:[1106437248] 
133:[1108366720] 
134:[1109182336] 
135:[1110930880] 
136:[1113504384] 
137:[1115613056] 
138:[1116307456] 
139:[1118841984] 
140:[1121370752] 
141:[1122338816] 
142:[1122727488] 
143:[1123963712] 
144:[1126139520] 
145:[1126974208] 
146:[1129443648] 
147:[1130809408] 
148:[1132503552] 
149:[1135562240] 
150:[1137630720] 
151:[1139159872] 
152:[1139976256] 
153:[1140282880] 
154:[1141660160] 
155:[1144208320] 
156:[1144265344] 
157:[1144312192] 
158:[1145091072] 
159:[1147749760] 
160:[1148044352] 
161:[1148957952] 
162:[1149682112] 
163:[1150844800] 
164:[1152461568] 
165:[1152877824] 
166:[1154310464] 
167:[1162018240] 
168:[1163022208] 
169:[1178935936] 
170:[1190226752] 
171:[1191987584] 
172:[1199930368] 
173:[1201136576] 
174:[1202296320] 
175:[1202753792] 
176:[1202950656] 
177:[1203254016] 
178:[1203438464] 
179:[1203862016] 
180:[1204373888] 
181:[1205211648] 
182:[1205716992] 
183:[1206647424] 
184:[1207043840] 
185:[1207310976] 
186:[1207687680] 
187:[1208031744] 
188:[1208310912] 
189:[1208601280] 
190:[1208916608] 
191:[1209711104] 
192:[1210376128] 
193:[1211742528] 
194:[1212674304] 
195:[1212952896] 
196:[1213629312] 
197:[1214420480] 
198:[1214658560] 
199:[1214882368] 
200:[1215331264] 
201:[1215529856] 
202:[1215810560] 
203:[1216206592] 
204:[1216486912] 
205:[1216681344] 
206:[1216914816] 
207:[1217146880] 
208:[1217465984] 
209:[1217724544] 
210:[1218293440] 
211:[1218903424] 
212:[1227517312] 
213:[1228301056] 
214:[1228486528] 
215:[1249407040] 
216:[1249770752] 
217:[1250043968] 
218:[1250634048] 
219:[1250931392] 
220:[1251296128] 
221:[1251928448] 
222:[1252114816] 
223:[1252309888] 
224:[1253285504] 
225:[1253446080] 
226:[1254226432] 
227:[1254858624] 
228:[1255042112] 
229:[1255288064] 
230:[1255453120] 
231:[1256473984] 
232:[1256852480] 
233:[1257881088] 
234:[1259645184] 
235:[1260438912] 
236:[1261217280] 
237:[1261577536] 
238:[1262093696] 
239:[1262966688] 
240:[1263311744] 
241:[1264222016] 
242:[1265448192] 
243:[1266591360] 
244:[1268133376] 
245:[1268988928] 
246:[1269833024] 
247:[1271247296] 
248:[1272420224] 
249:[1273266752] 
250:[1274076224] 
251:[1275203072] 
252:[1277416320] 
253:[1278511232] 
254:[1281360640] 
255:[1302046720] 
256:[1309009024] 
257:[1325006208] 
258:[1330860928] 
259:[1332440704] 
260:[1332907520] 
261:[1333996288] 
262:[1334853248] 
263:[1335306240] 
264:[1336120640] 
265:[1336960192] 
266:[1338423040] 
267:[1339761216] 
268:[1341126208] 
269:[1342171904] 
270:[1343213568] 
271:[1350140672] 
272:[1350661184] 
273:[1351507264] 
274:[1352978240] 
275:[1381651968] 
276:[1384242688] 
277:[1385959936] 
278:[1386452096] 
279:[1388349312] 
280:[1389233664] 
281:[1393009344] 
282:[1396065408] 
283:[1396865152] 
284:[1397867520] 
285:[1398936704] 
286:[1400073472] 
287:[1402718912] 
288:[1406069440] 
289:[1416367744] 
290:[1422916608] 
291:[1424121600] 
292:[1424982976] 
293:[1426681920] 
294:[1428046592] 
295:[1428927040] 
296:[1430388736] 
297:[1430967456] 
298:[1431724224] 
299:[1434083328] 
300:[1434882048] 
301:[1435227904] 
302:[1435628160] 
303:[1436132224] 
304:[1436509568] 
305:[1436820160] 
306:[1437206080] 
307:[1437950720] 
308:[1438352896] 
309:[1438715008] 
310:[1439823552] 
311:[1440158976] 
312:[1440351616] 
313:[1440654528] 
314:[1440928384] 
315:[1441604736] 
316:[1441856576] 
317:[1442484736] 
318:[1442851584] 
319:[1445870976] 
320:[1447073152] 
321:[1448021696] 
322:[1448938112] 
323:[1449341824] 
324:[1449566464] 
325:[1449835520] 
326:[1450147584] 
327:[1453411072] 
328:[1456621952] 
329:[1457883264] 
330:[1460340352] 
331:[1461539840] 
332:[1462206336] 
333:[1463179136] 
334:[1463538944] 
335:[1463832064]
ptrs[1-335] = 1:7831434 2:7831454 3:7831469 4:7831515 5:7831581 6:7831612 7:7831624 8:7831628 9:7831655 10:7831685 11:7831752 12:7831761 13:7831791 14:7831812 15:7831840 16:7831850 17:7831863 18:7831880 19:7831888 20:7831914 21:7831918 22:7831938 23:7831948 24:7832184 25:7832191 26:7832205 27:7832217 28:7832236 29:7832259 30:7832269 31:7832319 32:7832328 33:7832394 34:7832401 35:7832411 36:7832479 37:7832489 38:7832495 39:7832616 40:7832649 41:7832674 42:7832687 43:7832697 44:7832729 45:7832744 46:7832885 47:7832920 48:7832939 49:7832946 50:7833037 51:7833279 52:7833312 53:7833388 54:7833399 55:7833424 56:7833560 57:7833566 58:7833742 59:7833788 60:7833791 61:7833822 62:7834001 63:7834032 64:7834039 65:7834056 66:7834084 67:7834230 68:7834298 69:7834679 70:7834690 71:7834756 72:7834914 73:7834959 74:7835107 75:7835225 76:7835243 77:7835388 78:7835497 79:7835508 80:7835737 81:7835759 82:7835769 83:7835775 84:7836319 85:7836472 86:7836555 87:7836557 88:7836607 89:7836628 90:7836702 91:7836714 92:7838820 93:7839025 94:7839030 95:7839034 96:7839159 97:7839210 98:7839220 99:7839243 100:7839317 101:7839967 102:7839970 103:7840045 104:7840232 105:7840262 106:7840340 107:7840363 108:7840427 109:7840492 110:7840527 111:7840540 112:7840584 113:7840651 114:7840698 115:7840806 116:7840835 117:7841438 118:7841500 119:7841631 120:7841695 121:7841704 122:7841864 123:7841882 124:7842286 125:7842367 126:7842385 127:7842408 128:7842456 129:7842519 130:7842555 131:7842587 132:7842590 133:7842607 134:7842620 135:7842632 136:7842700 137:7842725 138:7844503 139:7844559 140:7844598 141:7844794 142:7844814 143:7844872 144:7845233 145:7845237 146:7845388 147:7845404 148:7845407 149:7845502 150:7845512 151:7845514 152:7845516 153:7845518 154:7845678 155:7845832 156:7845841 157:7846015 158:7846121 159:7846205 160:7846344 161:7846346 162:7846447 163:7846560 164:7846571 165:7846718 166:7846744 167:7846776 168:7846815 169:7846830 170:7846879 171:7846888 172:7846893 173:7846904 174:7846942 175:7846958 176:7846973 177:7847002 178:7847006 179:7847017 180:7847045 181:7847077 182:7847085 183:7847097 184:7847117 185:7847133 186:7847178 187:7847203 188:7847241 189:7847245 190:7847267 191:7847304 192:7847373 193:7847401 194:7847576 195:7847588 196:7847592 197:7847663 198:7847679 199:7847689 200:7847716 201:7847729 202:7847836 203:7847859 204:7847879 205:7847883 206:7847904 207:7847908 208:7847920 209:7847937 210:7847967 211:7847980 212:7847995 213:7848019 214:7848029 215:7848044 216:7848079 217:7848088 218:7848107 219:7848111 220:7848195 221:7848202 222:7848209 223:7848238 224:7848259 225:7848267 226:7848274 227:7848376 228:7848381 229:7848398 230:7848436 231:7848449 232:7848488 233:7848520 234:7848601 235:7848621 236:7848682 237:7848684 238:7849805 239:7849864 240:7849882 241:7849959 242:7850079 243:7850189 244:7850201 245:7850469 246:7850817 247:7850832 248:7850838 249:7850856 250:7850961 251:7851311 252:7851329 253:7851438 254:7851839 255:7851848 256:7852670 257:7853114 258:7853150 259:7853182 260:7853210 261:7853308 262:7853312 263:7853405 264:7853412 265:7853418 266:7853422 267:7853600 268:7853602 269:7853652 270:7853673 271:7853805 272:7854192 273:7854207 274:7854320 275:7854358 276:7854404 277:7854432 278:7854456 279:7854487 280:7854542 281:7854578 282:7854623 283:7854654 284:7854673 285:7854702 286:7854708 287:7854713 288:7854722 289:7854775 290:7854779 291:7854793 292:7854799 293:7854816 294:7854818 295:7854829 296:7854865 297:7854942 298:7854952 299:7854979 300:7855040 301:7855251 302:7855257 303:7855262 304:7855362 305:7855656 306:7855759 307:7856074 308:7856095 309:7856111 310:7856131 311:7856153 312:7856157 313:7856181 314:7856252 315:7856255 316:7856297 317:7856467 318:7856519 319:7856762 320:7856767 321:7856785 322:7856787 323:7856800 324:7856827 325:7856833 326:7856891 327:7857097 328:7857180 329:7857343 330:7857530 331:7857787 332:7857791 333:7857856 334:7857860 335:7857900

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Regular FS shutdown while rsync is running
  2019-01-21 16:23       ` Lucas Stach
@ 2019-01-21 18:11         ` Brian Foster
  2019-01-21 18:21           ` Lucas Stach
  2019-01-22 10:39           ` Lucas Stach
  0 siblings, 2 replies; 22+ messages in thread
From: Brian Foster @ 2019-01-21 18:11 UTC (permalink / raw)
  To: Lucas Stach; +Cc: linux-xfs

On Mon, Jan 21, 2019 at 05:23:44PM +0100, Lucas Stach wrote:
> Am Montag, den 21.01.2019, 08:01 -0500 schrieb Brian Foster:
> [...]
> > > root@XXX:/mnt/metadump# xfs_repair  /dev/XXX
> > > Phase 1 - find and verify superblock...
> > >         - reporting progress in intervals of 15 minutes
> > > Phase 2 - using internal log
> > >         - zero log...
> > >         - scan filesystem freespace and inode maps...
> > > bad magic # 0x49414233 in inobt block 5/7831662
> > 
> > Hmm, so this looks like a very isolated corruption. It's complaining
> > about a magic number (internal filesystem value stamped to metadata
> > blocks for verification/sanity purposes) being wrong on a finobt block
> > and nothing else seems to be wrong in the fs. (I guess repair should
> > probably print out 'finobt' here instead of 'inobt,' but that's a
> > separate issue..).
> > 
> > The finobt uses a magic value of XFS_FIBT_CRC_MAGIC (0x46494233, 'FIB3')
> > whereas this block has a magic value of 0x49414233. The latter is
> > 'IAB3' (XFS_IBT_CRC_MAGIC), which is the magic value for regular inode
> > btree blocks.
> > 
> > >         - 23:06:50: scanning filesystem freespace - 33 of 33 allocation groups done
> > >         - found root inode chunk
> > > Phase 3 - for each AG...
> > 
> > ...
> > > Phase 7 - verify and correct link counts...
> > >         - 22:29:19: verify and correct link counts - 33 of 33 allocation groups done
> > > done
> > > 
> > > >  Would you be able to provide an xfs_metadump image
> > > > of this filesystem for closer inspection?
> > > 
> > > This filesystem is really metadata heavy, so an xfs_metadump ended up
> > > being around 400GB of data. Not sure if this is something you would be
> > > willing to look into?
> > > 
> > 
> > Ok, it might be difficult to get ahold of that. Does the image happen to
> > compress well?
> 
> I'll see how well it compresses, but this might take a while...
> 
> > In the meantime, given that the corruption appears to be so isolated you
> > might be able to provide enough information from the metadump without
> > having to transfer it. The first thing is probably to take a look at the
> > block in question..
> > 
> > First, restore the metadump somewhere:
> > 
> > xfs_mdrestore -g ./md.img <destination>
> > 
> > You'll need somewhere with enough space for that 400G or so. Note that
> > you can restore to a file and mount/inspect that file as if it were the
> > original fs. I'd also mount/unmount the restored metadump and run an
> > 'xfs_repair -n' on it just to double check that the corruption was
> > captured properly and there are no other issues with the metadump. -n is
> > important here as otherwise repair will fix the metadump and remove the
> > corruption.
> > 
> > Next, use xfs_db to dump the contents of the suspect block. Run 'xfs_db
> > <metadump image>' to open the fs and try the following sequence of
> > commands.
> > 
> > - Convert to a global fsb: 'convert agno 5 agbno 7831662 fsb'
> > - Jump to the fsb: 'fsb <output of prev cmd>'
> > - Set the block type: 'type finobt'
> > - Print the block: 'print'
> > 
> > ... and copy/paste the output.
> 
> So for the moment, here's the output of the above sequence.
> 
> xfs_db> convert agno 5 agbno 7831662 fsb
> 0x5077806e (1350008942)
> xfs_db> fsb 0x5077806e
> xfs_db> type finobt
> xfs_db> print
> magic = 0x49414233
> level = 1
> numrecs = 335
> leftsib = 7810856
> rightsib = null
> bno = 7387612016
> lsn = 0x6671003d9700
> uuid = 026711cc-25c7-44b9-89aa-0aac496edfec
> owner = 5
> crc = 0xe12b19b2 (correct)

As expected, we have the inobt magic. Interesting that this is a fairly
full intermediate (level > 0) node. There is no right sibling, which
means we're at the far right end of the tree. I wouldn't mind poking
around a bit more at the tree, but that might be easier with access to
the metadump. I also think that xfs_repair would have complained were
something more significant wrong with the tree.

Hmm, I wonder if the (lightly tested) diff below would help us catch
anything. It basically just splits up the currently combined inobt and
finobt I/O verifiers to expect the appropriate magic number (rather than
accepting either magic for both trees). Could you give that a try?
Unless we're doing something like using the wrong type of cursor for a
particular tree, I'd think this would catch wherever we happen to put a
bad magic on disk. Note that this assumes the underlying filesystem has
been repaired so as to try and detect the next time an on-disk
corruption is introduced.

You'll also need to turn up the XFS error level to make sure this prints
out a stack trace if/when a verifier failure triggers:

echo 5 > /proc/sys/fs/xfs/error_level

I guess we also shouldn't rule out hardware issues or whatnot. I did
notice you have a strange kernel version: 4.19.4-holodeck10. Is that a
distro kernel? Has it been modified from upstream in any way? If so, I'd
strongly suggest to try and confirm whether this is reproducible with an
upstream kernel.

Brian

--- 8< ---

diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index 9b25e7a0df47..c493a37730cb 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -272,13 +272,11 @@ xfs_inobt_verify(
 	 */
 	switch (block->bb_magic) {
 	case cpu_to_be32(XFS_IBT_CRC_MAGIC):
-	case cpu_to_be32(XFS_FIBT_CRC_MAGIC):
 		fa = xfs_btree_sblock_v5hdr_verify(bp);
 		if (fa)
 			return fa;
 		/* fall through */
 	case cpu_to_be32(XFS_IBT_MAGIC):
-	case cpu_to_be32(XFS_FIBT_MAGIC):
 		break;
 	default:
 		return __this_address;
@@ -333,6 +331,86 @@ const struct xfs_buf_ops xfs_inobt_buf_ops = {
 	.verify_struct = xfs_inobt_verify,
 };
 
+static xfs_failaddr_t
+xfs_finobt_verify(
+	struct xfs_buf		*bp)
+{
+	struct xfs_mount	*mp = bp->b_target->bt_mount;
+	struct xfs_btree_block	*block = XFS_BUF_TO_BLOCK(bp);
+	xfs_failaddr_t		fa;
+	unsigned int		level;
+
+	/*
+	 * During growfs operations, we can't verify the exact owner as the
+	 * perag is not fully initialised and hence not attached to the buffer.
+	 *
+	 * Similarly, during log recovery we will have a perag structure
+	 * attached, but the agi information will not yet have been initialised
+	 * from the on disk AGI. We don't currently use any of this information,
+	 * but beware of the landmine (i.e. need to check pag->pagi_init) if we
+	 * ever do.
+	 */
+	switch (block->bb_magic) {
+	case cpu_to_be32(XFS_FIBT_CRC_MAGIC):
+		fa = xfs_btree_sblock_v5hdr_verify(bp);
+		if (fa)
+			return fa;
+		/* fall through */
+	case cpu_to_be32(XFS_FIBT_MAGIC):
+		break;
+	default:
+		return __this_address;
+	}
+
+	/* level verification */
+	level = be16_to_cpu(block->bb_level);
+	if (level >= mp->m_in_maxlevels)
+		return __this_address;
+
+	return xfs_btree_sblock_verify(bp, mp->m_inobt_mxr[level != 0]);
+}
+
+static void
+xfs_finobt_read_verify(
+	struct xfs_buf	*bp)
+{
+	xfs_failaddr_t	fa;
+
+	if (!xfs_btree_sblock_verify_crc(bp))
+		xfs_verifier_error(bp, -EFSBADCRC, __this_address);
+	else {
+		fa = xfs_finobt_verify(bp);
+		if (fa)
+			xfs_verifier_error(bp, -EFSCORRUPTED, fa);
+	}
+
+	if (bp->b_error)
+		trace_xfs_btree_corrupt(bp, _RET_IP_);
+}
+
+static void
+xfs_finobt_write_verify(
+	struct xfs_buf	*bp)
+{
+	xfs_failaddr_t	fa;
+
+	fa = xfs_finobt_verify(bp);
+	if (fa) {
+		trace_xfs_btree_corrupt(bp, _RET_IP_);
+		xfs_verifier_error(bp, -EFSCORRUPTED, fa);
+		return;
+	}
+	xfs_btree_sblock_calc_crc(bp);
+
+}
+
+const struct xfs_buf_ops xfs_finobt_buf_ops = {
+	.name = "xfs_inobt",
+	.verify_read = xfs_finobt_read_verify,
+	.verify_write = xfs_finobt_write_verify,
+	.verify_struct = xfs_finobt_verify,
+};
+
 STATIC int
 xfs_inobt_keys_inorder(
 	struct xfs_btree_cur	*cur,
@@ -389,7 +467,7 @@ static const struct xfs_btree_ops xfs_finobt_ops = {
 	.init_rec_from_cur	= xfs_inobt_init_rec_from_cur,
 	.init_ptr_from_cur	= xfs_finobt_init_ptr_from_cur,
 	.key_diff		= xfs_inobt_key_diff,
-	.buf_ops		= &xfs_inobt_buf_ops,
+	.buf_ops		= &xfs_finobt_buf_ops,
 	.diff_two_keys		= xfs_inobt_diff_two_keys,
 	.keys_inorder		= xfs_inobt_keys_inorder,
 	.recs_inorder		= xfs_inobt_recs_inorder,

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: Regular FS shutdown while rsync is running
  2019-01-21 18:11         ` Brian Foster
@ 2019-01-21 18:21           ` Lucas Stach
  2019-01-22 10:39           ` Lucas Stach
  1 sibling, 0 replies; 22+ messages in thread
From: Lucas Stach @ 2019-01-21 18:21 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

Am Montag, den 21.01.2019, 13:11 -0500 schrieb Brian Foster:
> On Mon, Jan 21, 2019 at 05:23:44PM +0100, Lucas Stach wrote:
> > Am Montag, den 21.01.2019, 08:01 -0500 schrieb Brian Foster:
> > [...]
> > > > root@XXX:/mnt/metadump# xfs_repair  /dev/XXX
> > > > Phase 1 - find and verify superblock...
> > > >         - reporting progress in intervals of 15 minutes
> > > > Phase 2 - using internal log
> > > >         - zero log...
> > > >         - scan filesystem freespace and inode maps...
> > > > bad magic # 0x49414233 in inobt block 5/7831662
> > > 
> > > Hmm, so this looks like a very isolated corruption. It's complaining
> > > about a magic number (internal filesystem value stamped to metadata
> > > blocks for verification/sanity purposes) being wrong on a finobt block
> > > and nothing else seems to be wrong in the fs. (I guess repair should
> > > probably print out 'finobt' here instead of 'inobt,' but that's a
> > > separate issue..).
> > > 
> > > The finobt uses a magic value of XFS_FIBT_CRC_MAGIC (0x46494233, 'FIB3')
> > > whereas this block has a magic value of 0x49414233. The latter is
> > > 'IAB3' (XFS_IBT_CRC_MAGIC), which is the magic value for regular inode
> > > btree blocks.
> > > 
> > > >         - 23:06:50: scanning filesystem freespace - 33 of 33 allocation groups done
> > > >         - found root inode chunk
> > > > Phase 3 - for each AG...
> > > 
> > > ...
> > > > Phase 7 - verify and correct link counts...
> > > >         - 22:29:19: verify and correct link counts - 33 of 33 allocation groups done
> > > > done
> > > > 
> > > > >  Would you be able to provide an xfs_metadump image
> > > > > of this filesystem for closer inspection?
> > > > 
> > > > This filesystem is really metadata heavy, so an xfs_metadump ended up
> > > > being around 400GB of data. Not sure if this is something you would be
> > > > willing to look into?
> > > > 
> > > 
> > > Ok, it might be difficult to get ahold of that. Does the image happen to
> > > compress well?
> > 
> > I'll see how well it compresses, but this might take a while...
> > 
> > > In the meantime, given that the corruption appears to be so isolated you
> > > might be able to provide enough information from the metadump without
> > > having to transfer it. The first thing is probably to take a look at the
> > > block in question..
> > > 
> > > First, restore the metadump somewhere:
> > > 
> > > xfs_mdrestore -g ./md.img <destination>
> > > 
> > > You'll need somewhere with enough space for that 400G or so. Note that
> > > you can restore to a file and mount/inspect that file as if it were the
> > > original fs. I'd also mount/unmount the restored metadump and run an
> > > 'xfs_repair -n' on it just to double check that the corruption was
> > > captured properly and there are no other issues with the metadump. -n is
> > > important here as otherwise repair will fix the metadump and remove the
> > > corruption.
> > > 
> > > Next, use xfs_db to dump the contents of the suspect block. Run 'xfs_db
> > > <metadump image>' to open the fs and try the following sequence of
> > > commands.
> > > 
> > > - Convert to a global fsb: 'convert agno 5 agbno 7831662 fsb'
> > > - Jump to the fsb: 'fsb <output of prev cmd>'
> > > - Set the block type: 'type finobt'
> > > - Print the block: 'print'
> > > 
> > > ... and copy/paste the output.
> > 
> > So for the moment, here's the output of the above sequence.
> > 
> > xfs_db> convert agno 5 agbno 7831662 fsb
> > 0x5077806e (1350008942)
> > xfs_db> fsb 0x5077806e
> > xfs_db> type finobt
> > xfs_db> print
> > magic = 0x49414233
> > level = 1
> > numrecs = 335
> > leftsib = 7810856
> > rightsib = null
> > bno = 7387612016
> > lsn = 0x6671003d9700
> > uuid = 026711cc-25c7-44b9-89aa-0aac496edfec
> > owner = 5
> > crc = 0xe12b19b2 (correct)
> 
> As expected, we have the inobt magic. Interesting that this is a fairly
> full intermediate (level > 0) node. There is no right sibling, which
> means we're at the far right end of the tree. I wouldn't mind poking
> around a bit more at the tree, but that might be easier with access to
> the metadump. I also think that xfs_repair would have complained were
> something more significant wrong with the tree.
> 
> Hmm, I wonder if the (lightly tested) diff below would help us catch
> anything. It basically just splits up the currently combined inobt and
> finobt I/O verifiers to expect the appropriate magic number (rather than
> accepting either magic for both trees). Could you give that a try?
> Unless we're doing something like using the wrong type of cursor for a
> particular tree, I'd think this would catch wherever we happen to put a
> bad magic on disk. Note that this assumes the underlying filesystem has
> been repaired so as to try and detect the next time an on-disk
> corruption is introduced.
> 
> You'll also need to turn up the XFS error level to make sure this prints
> out a stack trace if/when a verifier failure triggers:
> 
> echo 5 > /proc/sys/fs/xfs/error_level

Thanks, we'll try this and see if it digs up something.

> I guess we also shouldn't rule out hardware issues or whatnot. I did
> notice you have a strange kernel version: 4.19.4-holodeck10. Is that a
> distro kernel? Has it been modified from upstream in any way? If so, I'd
> strongly suggest to try and confirm whether this is reproducible with an
> upstream kernel.

This is a custom built kernel, but really just the vanilla stable
kernel built with the Debian kernel config, specifically to rule out
any distro kernel issues and to get closer to upstream than what the
distro provides.

Regards,
Lucas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Regular FS shutdown while rsync is running
  2019-01-21 10:41   ` Lucas Stach
  2019-01-21 13:01     ` Brian Foster
@ 2019-01-21 21:18     ` Dave Chinner
  2019-01-22  9:15       ` Lucas Stach
  1 sibling, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2019-01-21 21:18 UTC (permalink / raw)
  To: Lucas Stach; +Cc: Brian Foster, linux-xfs

On Mon, Jan 21, 2019 at 11:41:57AM +0100, Lucas Stach wrote:
> > Can you provide xfs_info for the fs and details of your storage, CPU and
> > RAM configuration?
> 
> root@XXX:~# xfs_info /srv/
> meta-data=/dev/mapper/XXX-backup isize=512    agcount=33, agsize=183123968 blks
>          =                       sectsz=4096  attr=2, projid32bit=1
>          =                       crc=1        finobt=1 spinodes=0 rmapbt=0
>          =                       reflink=0
> data     =                       bsize=4096   blocks=5860389888, imaxpct=15
>          =                       sunit=16     swidth=48 blks
> naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> log      =internal               bsize=4096   blocks=521728, version=2
>          =                       sectsz=4096  sunit=1 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0

Just recreated a similar fs here to do some math with xfs_db...

> bad magic # 0x49414233 in inobt block 5/7831662

So repair tripped over this bad block at AG 5, agbno 7831662.
Let's check that against the info in the block that the kernel
reported as corrupt:

> With that applied we were able to collect the following dump:
> 
> [215922.475666] 00000000d471c70c: 49 41 42 33 00 01 01 0c 00 07 58 ab ff ff ff ff  IAB3......X.....
> [215922.475954] 000000001be9cc59: 00 00 00 01 b5 db 40 b8 00 00 00 00 00 00 00 00  ......@.........
                                    ^^^^^^^^^^^^^^^^^^^^^^^
				    daddr of the block that was read.

xfs_db> convert daddr 0x1b5db40b8 agno
0x5 (5)
xfs_db> convert daddr 0x1b5db40b8 agbno
0x282817 (2631703)
xfs_db> 

That doesn't match the agbno that xfs repair reported.

xfs_db> convert agno 5 agbno 7831662 daddr
0x1b8560370 (7387612016)
xfs_db> convert daddr 0x1b8560370 agbno
0x77806e (7831662)
xfs_db> 

The agbno isn't even close to being correct. We may have a
misdirected write here. So, we really need to start to walk the tree
structure to determine if this really is in the correct place.  So
what we really need is to look at is the left sibling block of the
bad block (agbno 0x758ab) and determine what agbno it points to
(i.e. if it points to the agbno that repair told us about or the
agbno the kernel thinks it has read).

i.e. run these commands and paste the output:

xfs_db> convert agno 5 agbno 0x758ab fsb
0x500758ab (1342658731)
xfs_db> fsb 0x500758ab
xfs_db> type data
xfs_db> p
[hexdump output we need]
xfs_db> type finobt
xfs_db> p
[same info but decoded as finobt structure]
xfs_db> type inobt
xfs_db> p
[same info but decoded as inobt structure]

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Regular FS shutdown while rsync is running
  2019-01-21 21:18     ` Dave Chinner
@ 2019-01-22  9:15       ` Lucas Stach
  2019-01-22 21:41         ` Dave Chinner
  0 siblings, 1 reply; 22+ messages in thread
From: Lucas Stach @ 2019-01-22  9:15 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, linux-xfs

Hi Dave,

Am Dienstag, den 22.01.2019, 08:18 +1100 schrieb Dave Chinner:
> On Mon, Jan 21, 2019 at 11:41:57AM +0100, Lucas Stach wrote:
> > > Can you provide xfs_info for the fs and details of your storage, CPU and
> > > RAM configuration?
> > 
> > root@XXX:~# xfs_info /srv/
> > meta-data=/dev/mapper/XXX-backup isize=512    agcount=33, agsize=183123968 blks
> >          =                       sectsz=4096  attr=2, projid32bit=1
> >          =                       crc=1        finobt=1 spinodes=0 rmapbt=0
> >          =                       reflink=0
> > data     =                       bsize=4096   blocks=5860389888, imaxpct=15
> >          =                       sunit=16     swidth=48 blks
> > naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> > log      =internal               bsize=4096   blocks=521728, version=2
> >          =                       sectsz=4096  sunit=1 blks, lazy-count=1
> > realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> Just recreated a similar fs here to do some math with xfs_db...
> 
> > bad magic # 0x49414233 in inobt block 5/7831662
> 
> So repair tripped over this bad block at AG 5, agbno 7831662.
> Let's check that against the info in the block that the kernel
> reported as corrupt:
> 
> > With that applied we were able to collect the following dump:
> > 
> > [215922.475666] 00000000d471c70c: 49 41 42 33 00 01 01 0c 00 07 58 ab ff ff ff ff  IAB3......X.....
> > [215922.475954] 000000001be9cc59: 00 00 00 01 b5 db 40 b8 00 00 00 00 00 00 00 00  ......@.........
> 
>                                     ^^^^^^^^^^^^^^^^^^^^^^^
> > 				    daddr of the block that was read.
> 
> xfs_db> convert daddr 0x1b5db40b8 agno
> 0x5 (5)
> xfs_db> convert daddr 0x1b5db40b8 agbno
> 0x282817 (2631703)
> xfs_db> 
> 
> That doesn't match the agbno that xfs repair reported.
> 
> xfs_db> convert agno 5 agbno 7831662 daddr
> 0x1b8560370 (7387612016)
> xfs_db> convert daddr 0x1b8560370 agbno
> 0x77806e (7831662)
> xfs_db> 
> 
> The agbno isn't even close to being correct. We may have a
> misdirected write here.

Sorry for causing confusion here. The kernel splat is not from the same
crash where we were able to capture the metadump from and which the
xfs_repair output references. It is from a later run, where we didn't
store a complete metadump. All it does is prove that the bug is still
present in a later 4.19 stable release.

I guess we should concentrate on the exact crash where we have the
metadump from to do any analysis on.

Just to rule out the misdirected write theory, I did the following:

xfs_db> convert agno 5 agbno 7831662 fsb
0x5077806e (1350008942)
xfs_db> fsb 0x5077806e
xfs_db> type data
xfs_db> p
000: 49414233 0001014f 00772f28 ffffffff 00000001 b8560370 00006671 003d9700
020: 026711cc 25c744b9 89aa0aac 496edfec 00000005 e12b19b2 39259c80 393b8c80
[... snip ...]
xfs_db> convert	daddr 0x1b8560370 agno
0x5 (5)
xfs_db> convert daddr 0x1b8560370 agbno       
0x77806e (7831662)

So it seems we are looking at a finobt node that is exactly where it is
supposed to be, but carries the wrong magic. We are still waiting to
get back results from a run with Brians verifier changes applied.

> So, we really need to start to walk the tree
> structure to determine if this really is in the correct place.  So
> what we really need is to look at is the left sibling block of the
> bad block (agbno 0x758ab) and determine what agbno it points to
> (i.e. if it points to the agbno that repair told us about or the
> agbno the kernel thinks it has read).
> 
> i.e. run these commands and paste the output:
> 
> xfs_db> convert agno 5 agbno 0x758ab fsb
> 0x500758ab (1342658731)
> xfs_db> fsb 0x500758ab
> xfs_db> type data
> xfs_db> p
> [hexdump output we need]
> xfs_db> type finobt
> xfs_db> p
> [same info but decoded as finobt structure]
> xfs_db> type inobt
> xfs_db> p
> [same info but decoded as inobt structure]

Just for completeness I did a bit of the tree walk to look at the left
sibling:

xfs_db> convert agno 5 agbno 7810856 fsb
0x50772f28 (1349988136)
xfs_db> fsb 0x50772f28
xfs_db> type finobt 
xfs_db> p
magic = 0x46494233
level = 1
numrecs = 252
leftsib = null
rightsib = 7831662
bno = 7387445568
lsn = 0x66bc003f28a8
uuid = 026711cc-25c7-44b9-89aa-0aac496edfec
owner = 5
crc = 0xe5e78504 (correct)
[... snip ...]

It seems interesting that this node doesn't have a left sibling. Does
this mean the finobt is just those 2 nodes at that point?

Regards,
Lucas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Regular FS shutdown while rsync is running
  2019-01-21 18:11         ` Brian Foster
  2019-01-21 18:21           ` Lucas Stach
@ 2019-01-22 10:39           ` Lucas Stach
  2019-01-22 13:02             ` Brian Foster
  1 sibling, 1 reply; 22+ messages in thread
From: Lucas Stach @ 2019-01-22 10:39 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

Hi Brian,

Am Montag, den 21.01.2019, 13:11 -0500 schrieb Brian Foster:
[...]
> > So for the moment, here's the output of the above sequence.
> > 
> > xfs_db> convert agno 5 agbno 7831662 fsb
> > 0x5077806e (1350008942)
> > xfs_db> fsb 0x5077806e
> > xfs_db> type finobt
> > xfs_db> print
> > magic = 0x49414233
> > level = 1
> > numrecs = 335
> > leftsib = 7810856
> > rightsib = null
> > bno = 7387612016
> > lsn = 0x6671003d9700
> > uuid = 026711cc-25c7-44b9-89aa-0aac496edfec
> > owner = 5
> > crc = 0xe12b19b2 (correct)
> 
> As expected, we have the inobt magic. Interesting that this is a fairly
> full intermediate (level > 0) node. There is no right sibling, which
> means we're at the far right end of the tree. I wouldn't mind poking
> around a bit more at the tree, but that might be easier with access to
> the metadump. I also think that xfs_repair would have complained were
> something more significant wrong with the tree.
> 
> Hmm, I wonder if the (lightly tested) diff below would help us catch
> anything. It basically just splits up the currently combined inobt and
> finobt I/O verifiers to expect the appropriate magic number (rather than
> accepting either magic for both trees). Could you give that a try?
> Unless we're doing something like using the wrong type of cursor for a
> particular tree, I'd think this would catch wherever we happen to put a
> bad magic on disk. Note that this assumes the underlying filesystem has
> been repaired so as to try and detect the next time an on-disk
> corruption is introduced.
> 
> You'll also need to turn up the XFS error level to make sure this prints
> out a stack trace if/when a verifier failure triggers:
> 
> echo 5 > /proc/sys/fs/xfs/error_level
> 
> I guess we also shouldn't rule out hardware issues or whatnot. I did
> notice you have a strange kernel version: 4.19.4-holodeck10. Is that a
> distro kernel? Has it been modified from upstream in any way? If so, I'd
> strongly suggest to try and confirm whether this is reproducible with an
> upstream kernel.

With the finobt verifier changes applied we are unable to mount the FS,
even after running xfs_repair.

xfs_repair had found "bad magic # 0x49414233 in inobt block 5/2631703",
which would be daddr 0x1b5db40b8 according to xfs_db. The mount trips
over a buffer at a different daddr though:

[   73.237007] XFS (dm-3): Mounting V5 Filesystem
[   73.456481] XFS (dm-3): Ending clean mount
[   74.132671] XFS (dm-3): Metadata corruption detected at xfs_finobt_verify+0x50/0x90 [xfs], xfs_finobt block 0x1b5df7d50 
[   74.133028] XFS (dm-3): Unmount and run xfs_repair
[   74.133184] XFS (dm-3): First 128 bytes of corrupted metadata buffer:
[   74.133395] 00000000e44dfb87: 49 41 42 33 00 01 01 50 00 07 53 58 ff ff ff ff  IAB3...P..SX....
[   74.133679] 000000009f21b317: 00 00 00 01 b5 df 7d 50 00 00 00 00 00 00 00 00  ......}P........
[   74.133964] 000000003429321b: 02 67 11 cc 25 c7 44 b9 89 aa 0a ac 49 6e df ec  .g..%.D.....In..
[   74.134272] 00000000fe79b835: 00 00 00 05 24 52 54 c6 32 dc 7d 00 32 e9 b9 a0  ....$RT.2.}.2...
[   74.134554] 00000000d1e887dc: 32 f4 97 80 33 01 36 80 33 09 ca 80 33 1e b7 80  2...3.6.3...3...
[   74.134852] 00000000612879d2: 33 2f 50 00 33 33 e8 80 33 40 a9 c0 33 4c 08 80  3/P.33..3@..3L..
[   74.135140] 00000000e63fd33a: 33 64 d7 80 33 79 34 40 33 8f 08 80 33 a7 be c0  3d..3y4@3...3...
[   74.135427] 00000000d1c405d7: 33 b6 10 80 33 bf 1e c0 33 d0 99 00 33 df cd 00  3...3...3...3...
[   74.135871] XFS (dm-3): Metadata corruption detected at xfs_finobt_verify+0x50/0x90 [xfs], xfs_finobt block 0x1b5df7d50 
[   74.136231] XFS (dm-3): Unmount and run xfs_repair
[   74.136390] XFS (dm-3): First 128 bytes of corrupted metadata buffer:
[   74.136604] 00000000e44dfb87: 49 41 42 33 00 01 01 50 00 07 53 58 ff ff ff ff  IAB3...P..SX....
[   74.136887] 000000009f21b317: 00 00 00 01 b5 df 7d 50 00 00 00 00 00 00 00 00  ......}P........
[   74.137174] 000000003429321b: 02 67 11 cc 25 c7 44 b9 89 aa 0a ac 49 6e df ec  .g..%.D.....In..
[   74.137463] 00000000fe79b835: 00 00 00 05 24 52 54 c6 32 dc 7d 00 32 e9 b9 a0  ....$RT.2.}.2...
[   74.137750] 00000000d1e887dc: 32 f4 97 80 33 01 36 80 33 09 ca 80 33 1e b7 80  2...3.6.3...3...
[   74.138035] 00000000612879d2: 33 2f 50 00 33 33 e8 80 33 40 a9 c0 33 4c 08 80  3/P.33..3@..3L..
[   74.138358] 00000000e63fd33a: 33 64 d7 80 33 79 34 40 33 8f 08 80 33 a7 be c0  3d..3y4@3...3...
[   74.138639] 00000000d1c405d7: 33 b6 10 80 33 bf 1e c0 33 d0 99 00 33 df cd 00  3...3...3...3...
[   74.138964] XFS (dm-3): metadata I/O error in "xfs_trans_read_buf_map" at daddr 0x1b5df7d50 len 8 error 117
[   78.489686] XFS (dm-3): Error -117 reserving per-AG metadata reserve pool.
[   78.489691] XFS (dm-3): xfs_do_force_shutdown(0x8) called from line 548 of file fs/xfs/xfs_fsops.c.  Return address = 00000000b2beb4b0
[   78.489697] XFS (dm-3): Corruption of in-memory data detected.  Shutting down filesystem
[   78.489955] XFS (dm-3): Please umount the filesystem and rectify the problem(s)

Is this a real issue, or false positive due to things working
differently during early mount?

Regards,
Lucas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Regular FS shutdown while rsync is running
  2019-01-22 10:39           ` Lucas Stach
@ 2019-01-22 13:02             ` Brian Foster
  2019-01-23 11:14               ` Lucas Stach
  0 siblings, 1 reply; 22+ messages in thread
From: Brian Foster @ 2019-01-22 13:02 UTC (permalink / raw)
  To: Lucas Stach; +Cc: linux-xfs

On Tue, Jan 22, 2019 at 11:39:53AM +0100, Lucas Stach wrote:
> Hi Brian,
> 
> Am Montag, den 21.01.2019, 13:11 -0500 schrieb Brian Foster:
> [...]
> > > So for the moment, here's the output of the above sequence.
> > > 
> > > xfs_db> convert agno 5 agbno 7831662 fsb
> > > 0x5077806e (1350008942)
> > > xfs_db> fsb 0x5077806e
> > > xfs_db> type finobt
> > > xfs_db> print
> > > magic = 0x49414233
> > > level = 1
> > > numrecs = 335
> > > leftsib = 7810856
> > > rightsib = null
> > > bno = 7387612016
> > > lsn = 0x6671003d9700
> > > uuid = 026711cc-25c7-44b9-89aa-0aac496edfec
> > > owner = 5
> > > crc = 0xe12b19b2 (correct)
> > 
> > As expected, we have the inobt magic. Interesting that this is a fairly
> > full intermediate (level > 0) node. There is no right sibling, which
> > means we're at the far right end of the tree. I wouldn't mind poking
> > around a bit more at the tree, but that might be easier with access to
> > the metadump. I also think that xfs_repair would have complained were
> > something more significant wrong with the tree.
> > 
> > Hmm, I wonder if the (lightly tested) diff below would help us catch
> > anything. It basically just splits up the currently combined inobt and
> > finobt I/O verifiers to expect the appropriate magic number (rather than
> > accepting either magic for both trees). Could you give that a try?
> > Unless we're doing something like using the wrong type of cursor for a
> > particular tree, I'd think this would catch wherever we happen to put a
> > bad magic on disk. Note that this assumes the underlying filesystem has
> > been repaired so as to try and detect the next time an on-disk
> > corruption is introduced.
> > 
> > You'll also need to turn up the XFS error level to make sure this prints
> > out a stack trace if/when a verifier failure triggers:
> > 
> > echo 5 > /proc/sys/fs/xfs/error_level
> > 
> > I guess we also shouldn't rule out hardware issues or whatnot. I did
> > notice you have a strange kernel version: 4.19.4-holodeck10. Is that a
> > distro kernel? Has it been modified from upstream in any way? If so, I'd
> > strongly suggest to try and confirm whether this is reproducible with an
> > upstream kernel.
> 
> With the finobt verifier changes applied we are unable to mount the FS,
> even after running xfs_repair.
> 
> xfs_repair had found "bad magic # 0x49414233 in inobt block 5/2631703",
> which would be daddr 0x1b5db40b8 according to xfs_db. The mount trips
> over a buffer at a different daddr though:
> 

So the mount failed, you ran repair and discovered the bad magic..? That
suggests there was still an issue with the fs on-disk. Could you run
'xfs_repair -n' after the actual xfs_repair to confirm the fs is free of
errors before it is mounted? Note that xfs_repair unconditionally
regenerates certain metadata structures (like the finobt) from scratch
so there is always the possibility that xfs_repair itself is introducing
some problem in the fs.

I'm not quite sure what to make of the daddr discrepancy atm. As
described in Dave's mail, it might be interesting to spot check each
supposedly corrupted daddr in xfs_db and see which one is actually
busted. That essentially could mean to convert to fsb, load the fsb,
print the (decoded) block, visit a left or right sibling and check the
magic of the sibling and whether the sibling correctly points back to
the original block. Note that the sibling pointers are AG relative
blocks, so you'd need to find the agno from the daddr and do a "convert
agno <agno> agbno <ptr> fsb" to get the actual fsb to visit.

The intent of this is to try and get an idea of whether one of these
might be a misdirected read of an inobt block when trying to read the
finobt, and thus the magic on disk is actually correct. IOW, if we load
the fsb and see it has left/right siblings which also have inobt magics,
that might suggest we are actually looking at the inobt (I may want to
confirm that slice of the tree is actually active via a lookup, but one
thing at a time...).

> [   73.237007] XFS (dm-3): Mounting V5 Filesystem
> [   73.456481] XFS (dm-3): Ending clean mount
> [   74.132671] XFS (dm-3): Metadata corruption detected at xfs_finobt_verify+0x50/0x90 [xfs], xfs_finobt block 0x1b5df7d50 
> [   74.133028] XFS (dm-3): Unmount and run xfs_repair
> [   74.133184] XFS (dm-3): First 128 bytes of corrupted metadata buffer:
> [   74.133395] 00000000e44dfb87: 49 41 42 33 00 01 01 50 00 07 53 58 ff ff ff ff  IAB3...P..SX....
> [   74.133679] 000000009f21b317: 00 00 00 01 b5 df 7d 50 00 00 00 00 00 00 00 00  ......}P........
> [   74.133964] 000000003429321b: 02 67 11 cc 25 c7 44 b9 89 aa 0a ac 49 6e df ec  .g..%.D.....In..
> [   74.134272] 00000000fe79b835: 00 00 00 05 24 52 54 c6 32 dc 7d 00 32 e9 b9 a0  ....$RT.2.}.2...
> [   74.134554] 00000000d1e887dc: 32 f4 97 80 33 01 36 80 33 09 ca 80 33 1e b7 80  2...3.6.3...3...
> [   74.134852] 00000000612879d2: 33 2f 50 00 33 33 e8 80 33 40 a9 c0 33 4c 08 80  3/P.33..3@..3L..
> [   74.135140] 00000000e63fd33a: 33 64 d7 80 33 79 34 40 33 8f 08 80 33 a7 be c0  3d..3y4@3...3...
> [   74.135427] 00000000d1c405d7: 33 b6 10 80 33 bf 1e c0 33 d0 99 00 33 df cd 00  3...3...3...3...
> [   74.135871] XFS (dm-3): Metadata corruption detected at xfs_finobt_verify+0x50/0x90 [xfs], xfs_finobt block 0x1b5df7d50 
> [   74.136231] XFS (dm-3): Unmount and run xfs_repair
> [   74.136390] XFS (dm-3): First 128 bytes of corrupted metadata buffer:
> [   74.136604] 00000000e44dfb87: 49 41 42 33 00 01 01 50 00 07 53 58 ff ff ff ff  IAB3...P..SX....
> [   74.136887] 000000009f21b317: 00 00 00 01 b5 df 7d 50 00 00 00 00 00 00 00 00  ......}P........
> [   74.137174] 000000003429321b: 02 67 11 cc 25 c7 44 b9 89 aa 0a ac 49 6e df ec  .g..%.D.....In..
> [   74.137463] 00000000fe79b835: 00 00 00 05 24 52 54 c6 32 dc 7d 00 32 e9 b9 a0  ....$RT.2.}.2...
> [   74.137750] 00000000d1e887dc: 32 f4 97 80 33 01 36 80 33 09 ca 80 33 1e b7 80  2...3.6.3...3...
> [   74.138035] 00000000612879d2: 33 2f 50 00 33 33 e8 80 33 40 a9 c0 33 4c 08 80  3/P.33..3@..3L..
> [   74.138358] 00000000e63fd33a: 33 64 d7 80 33 79 34 40 33 8f 08 80 33 a7 be c0  3d..3y4@3...3...
> [   74.138639] 00000000d1c405d7: 33 b6 10 80 33 bf 1e c0 33 d0 99 00 33 df cd 00  3...3...3...3...
> [   74.138964] XFS (dm-3): metadata I/O error in "xfs_trans_read_buf_map" at daddr 0x1b5df7d50 len 8 error 117
> [   78.489686] XFS (dm-3): Error -117 reserving per-AG metadata reserve pool.
> [   78.489691] XFS (dm-3): xfs_do_force_shutdown(0x8) called from line 548 of file fs/xfs/xfs_fsops.c.  Return address = 00000000b2beb4b0
> [   78.489697] XFS (dm-3): Corruption of in-memory data detected.  Shutting down filesystem
> [   78.489955] XFS (dm-3): Please umount the filesystem and rectify the problem(s)
> 
> Is this a real issue, or false positive due to things working
> differently during early mount?
> 

The mount isn't behaving any differently here. The internal block
reservation mechanism runs a scan of the finobt on mount for internal
accounting purposes. This is a read only scan, but now that we're
enforcing that finobt blocks must have finobt magic this appears to act
as a mount time scrub of the issue we're looking for. I think that if
the filesystem is clear of this state at mount time, we'd get through
this phase and the mount sequence without incident.

BTW, did you bump the error_level? Was a stack trace not printed with
this error report?

Brian

> Regards,
> Lucas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Regular FS shutdown while rsync is running
  2019-01-22  9:15       ` Lucas Stach
@ 2019-01-22 21:41         ` Dave Chinner
  0 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2019-01-22 21:41 UTC (permalink / raw)
  To: Lucas Stach; +Cc: Brian Foster, linux-xfs

On Tue, Jan 22, 2019 at 10:15:53AM +0100, Lucas Stach wrote:
> 
> Am Dienstag, den 22.01.2019, 08:18 +1100 schrieb Dave Chinner:
> > That doesn't match the agbno that xfs repair reported.
> > 
> > xfs_db> convert agno 5 agbno 7831662 daddr
> > 0x1b8560370 (7387612016)
> > xfs_db> convert daddr 0x1b8560370 agbno
> > 0x77806e (7831662)
> > xfs_db> 
> > 
> > The agbno isn't even close to being correct. We may have a
> > misdirected write here.
> 
> Sorry for causing confusion here. The kernel splat is not from the same
> crash where we were able to capture the metadump from and which the
> xfs_repair output references. It is from a later run, where we didn't
> store a complete metadump. All it does is prove that the bug is still
> present in a later 4.19 stable release.

Well, all it proves is that there is still a problem on disk :/

> Just to rule out the misdirected write theory, I did the following:
> 
> xfs_db> convert agno 5 agbno 7831662 fsb
> 0x5077806e (1350008942)
> xfs_db> fsb 0x5077806e
> xfs_db> type data
> xfs_db> p
> 000: 49414233 0001014f 00772f28 ffffffff 00000001 b8560370 00006671 003d9700
> 020: 026711cc 25c744b9 89aa0aac 496edfec 00000005 e12b19b2 39259c80 393b8c80
> [... snip ...]
> xfs_db> convert	daddr 0x1b8560370 agno
> 0x5 (5)
> xfs_db> convert daddr 0x1b8560370 agbno       
> 0x77806e (7831662)
> 
> So it seems we are looking at a finobt node that is exactly where it is
> supposed to be, but carries the wrong magic. We are still waiting to
> get back results from a run with Brians verifier changes applied.

Well, we don't know yet if it's inobt block incorrectly linked to
the finobt or whether its a finobt block with the wrong magic....

> > So, we really need to start to walk the tree
> > structure to determine if this really is in the correct place.  So
> > what we really need is to look at is the left sibling block of the
> > bad block (agbno 0x758ab) and determine what agbno it points to
> > (i.e. if it points to the agbno that repair told us about or the
> > agbno the kernel thinks it has read).
> > 
> > i.e. run these commands and paste the output:
> > 
> > xfs_db> convert agno 5 agbno 0x758ab fsb
> > 0x500758ab (1342658731)
> > xfs_db> fsb 0x500758ab
> > xfs_db> type data
> > xfs_db> p
> > [hexdump output we need]
> > xfs_db> type finobt
> > xfs_db> p
> > [same info but decoded as finobt structure]
> > xfs_db> type inobt
> > xfs_db> p
> > [same info but decoded as inobt structure]
> 
> Just for completeness I did a bit of the tree walk to look at the left
> sibling:
> 
> xfs_db> convert agno 5 agbno 7810856 fsb
> 0x50772f28 (1349988136)
> xfs_db> fsb 0x50772f28
> xfs_db> type finobt 
> xfs_db> p
> magic = 0x46494233
> level = 1
> numrecs = 252
> leftsib = null
> rightsib = 7831662
> bno = 7387445568
> lsn = 0x66bc003f28a8
> uuid = 026711cc-25c7-44b9-89aa-0aac496edfec
> owner = 5
> crc = 0xe5e78504 (correct)
> [... snip ...]
> 
> It seems interesting that this node doesn't have a left sibling. Does
> this mean the finobt is just those 2 nodes at that point?

No, it means level 1 of the tree only has two blocks. That means
there should be a single level 2 block (the tree root) and up to
several hundred level zero leaf blocks with free inode records in
them.

Can you dump the AGI and the tree roots, walking down the finobt to
the level 1 blocks dumping the tree blocks as you go? i.e.

xfs_db> agi 5
xfs_db> a root
xfs_db> p
[inobt root block info]
xfs_db> ring
type    bblock  bblen    fsbno     inode
* 2: inobt         24     8        3        -1
  1: agi            2     1        0        -1
....
xfs_db> ring 1
xfs_db> a free_root
xfs_db> p
[finobt root block info]
xfs_db> a ptrs[1]
xfs_db> p
[finobt level one left block]
type    bblock  bblen    fsbno     inode
* 3: finobt         .....
  2: finobt         .....
  1: agi             2     1        0        -1
....
xfs_db> ring 2
xfs_db> a ptrs[2]
xfs_db> p
[finobt level one right block]
xfs_db>

basically, this is a walk of the finobt from the top down to
determine if the finobt points to the level 1 block with the bad
magic number.

It may also be owrth while doing a similar walk on the inobt to see
what block is at the right edge of level 1. Essentially that is
walking the last pointer in each block from the root (the right
edge) until you are at the level 1 block....

Cheers,

Dave.


> 
> Regards,
> Lucas
> 

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Regular FS shutdown while rsync is running
  2019-01-22 13:02             ` Brian Foster
@ 2019-01-23 11:14               ` Lucas Stach
  2019-01-23 12:11                 ` Brian Foster
  0 siblings, 1 reply; 22+ messages in thread
From: Lucas Stach @ 2019-01-23 11:14 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

Am Dienstag, den 22.01.2019, 08:02 -0500 schrieb Brian Foster:
> On Tue, Jan 22, 2019 at 11:39:53AM +0100, Lucas Stach wrote:
> > Hi Brian,
> > 
> > Am Montag, den 21.01.2019, 13:11 -0500 schrieb Brian Foster:
> > [...]
> > > > So for the moment, here's the output of the above sequence.
> > > > 
> > > > xfs_db> convert agno 5 agbno 7831662 fsb
> > > > 0x5077806e (1350008942)
> > > > xfs_db> fsb 0x5077806e
> > > > xfs_db> type finobt
> > > > xfs_db> print
> > > > magic = 0x49414233
> > > > level = 1
> > > > numrecs = 335
> > > > leftsib = 7810856
> > > > rightsib = null
> > > > bno = 7387612016
> > > > lsn = 0x6671003d9700
> > > > uuid = 026711cc-25c7-44b9-89aa-0aac496edfec
> > > > owner = 5
> > > > crc = 0xe12b19b2 (correct)
> > > 
> > > As expected, we have the inobt magic. Interesting that this is a fairly
> > > full intermediate (level > 0) node. There is no right sibling, which
> > > means we're at the far right end of the tree. I wouldn't mind poking
> > > around a bit more at the tree, but that might be easier with access to
> > > the metadump. I also think that xfs_repair would have complained were
> > > something more significant wrong with the tree.
> > > 
> > > Hmm, I wonder if the (lightly tested) diff below would help us catch
> > > anything. It basically just splits up the currently combined inobt and
> > > finobt I/O verifiers to expect the appropriate magic number (rather than
> > > accepting either magic for both trees). Could you give that a try?
> > > Unless we're doing something like using the wrong type of cursor for a
> > > particular tree, I'd think this would catch wherever we happen to put a
> > > bad magic on disk. Note that this assumes the underlying filesystem has
> > > been repaired so as to try and detect the next time an on-disk
> > > corruption is introduced.
> > > 
> > > You'll also need to turn up the XFS error level to make sure this prints
> > > out a stack trace if/when a verifier failure triggers:
> > > 
> > > echo 5 > /proc/sys/fs/xfs/error_level
> > > 
> > > I guess we also shouldn't rule out hardware issues or whatnot. I did
> > > notice you have a strange kernel version: 4.19.4-holodeck10. Is that a
> > > distro kernel? Has it been modified from upstream in any way? If so, I'd
> > > strongly suggest to try and confirm whether this is reproducible with an
> > > upstream kernel.
> > 
> > With the finobt verifier changes applied we are unable to mount the FS,
> > even after running xfs_repair.
> > 
> > xfs_repair had found "bad magic # 0x49414233 in inobt block 5/2631703",
> > which would be daddr 0x1b5db40b8 according to xfs_db. The mount trips
> > over a buffer at a different daddr though:
> > 
> 
> So the mount failed, you ran repair and discovered the bad magic..? That
> suggests there was still an issue with the fs on-disk. Could you run
> 'xfs_repair -n' after the actual xfs_repair to confirm the fs is free of
> errors before it is mounted? Note that xfs_repair unconditionally
> regenerates certain metadata structures (like the finobt) from scratch
> so there is always the possibility that xfs_repair itself is introducing
> some problem in the fs.

The sequence was:
1. Trying to mount with the debug kernel, which didn't work due to the
finobt verifier rejecting a bad btree node.
2. Full run of mutating xfs_repair (which did find a finobt node with
bad magic)
3. Try to mount again, still fails due to finobt verifier rejecting a
bad btree node, but at a different daddr than what was flagged by
xfs_repair.
4. xfs_repair -n finds a bad magic finobt node at the same position as
the mount splat.

This leads me to wonder if xfs_repair was the poison rather than the
cure in this case. After digging into the xfs_repair code I found the
following, which looks suspicious, but I don't know enough about XFS
internals to tell if this can actually happen:

build_ino_tree(), which is shared between XFS_BTNUM_INO and
XFS_BTNUM_FINO btree rebuilds contains the following in one of the
loops:

if (lptr->num_recs_pb > 0)
		prop_ino_cursor(mp, agno, btree_curs,
				ino_rec->ino_startnum, 0);

prop_ino_cursor() calls libxfs_btree_init_block() with the btnum
parameter being a fixed XFS_BTNUM_INO.

So if it's possible to get into this code path while rebuilding the
finobt, xfs_repair will generate a otherwise valid btree node with the
wrong magic, which matches what we see.

Sadly we have no more debug information available about the state of
the FS on the first crash, which lead us to run xfs_repair in the first
place.

Regards,
Lucas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Regular FS shutdown while rsync is running
  2019-01-23 11:14               ` Lucas Stach
@ 2019-01-23 12:11                 ` Brian Foster
  2019-01-23 13:03                   ` Brian Foster
  0 siblings, 1 reply; 22+ messages in thread
From: Brian Foster @ 2019-01-23 12:11 UTC (permalink / raw)
  To: Lucas Stach; +Cc: linux-xfs

On Wed, Jan 23, 2019 at 12:14:17PM +0100, Lucas Stach wrote:
> Am Dienstag, den 22.01.2019, 08:02 -0500 schrieb Brian Foster:
> > On Tue, Jan 22, 2019 at 11:39:53AM +0100, Lucas Stach wrote:
> > > Hi Brian,
> > > 
> > > Am Montag, den 21.01.2019, 13:11 -0500 schrieb Brian Foster:
> > > [...]
> > > > > So for the moment, here's the output of the above sequence.
> > > > > 
> > > > > xfs_db> convert agno 5 agbno 7831662 fsb
> > > > > 0x5077806e (1350008942)
> > > > > xfs_db> fsb 0x5077806e
> > > > > xfs_db> type finobt
> > > > > xfs_db> print
> > > > > magic = 0x49414233
> > > > > level = 1
> > > > > numrecs = 335
> > > > > leftsib = 7810856
> > > > > rightsib = null
> > > > > bno = 7387612016
> > > > > lsn = 0x6671003d9700
> > > > > uuid = 026711cc-25c7-44b9-89aa-0aac496edfec
> > > > > owner = 5
> > > > > crc = 0xe12b19b2 (correct)
> > > > 
> > > > As expected, we have the inobt magic. Interesting that this is a fairly
> > > > full intermediate (level > 0) node. There is no right sibling, which
> > > > means we're at the far right end of the tree. I wouldn't mind poking
> > > > around a bit more at the tree, but that might be easier with access to
> > > > the metadump. I also think that xfs_repair would have complained were
> > > > something more significant wrong with the tree.
> > > > 
> > > > Hmm, I wonder if the (lightly tested) diff below would help us catch
> > > > anything. It basically just splits up the currently combined inobt and
> > > > finobt I/O verifiers to expect the appropriate magic number (rather than
> > > > accepting either magic for both trees). Could you give that a try?
> > > > Unless we're doing something like using the wrong type of cursor for a
> > > > particular tree, I'd think this would catch wherever we happen to put a
> > > > bad magic on disk. Note that this assumes the underlying filesystem has
> > > > been repaired so as to try and detect the next time an on-disk
> > > > corruption is introduced.
> > > > 
> > > > You'll also need to turn up the XFS error level to make sure this prints
> > > > out a stack trace if/when a verifier failure triggers:
> > > > 
> > > > echo 5 > /proc/sys/fs/xfs/error_level
> > > > 
> > > > I guess we also shouldn't rule out hardware issues or whatnot. I did
> > > > notice you have a strange kernel version: 4.19.4-holodeck10. Is that a
> > > > distro kernel? Has it been modified from upstream in any way? If so, I'd
> > > > strongly suggest to try and confirm whether this is reproducible with an
> > > > upstream kernel.
> > > 
> > > With the finobt verifier changes applied we are unable to mount the FS,
> > > even after running xfs_repair.
> > > 
> > > xfs_repair had found "bad magic # 0x49414233 in inobt block 5/2631703",
> > > which would be daddr 0x1b5db40b8 according to xfs_db. The mount trips
> > > over a buffer at a different daddr though:
> > > 
> > 
> > So the mount failed, you ran repair and discovered the bad magic..? That
> > suggests there was still an issue with the fs on-disk. Could you run
> > 'xfs_repair -n' after the actual xfs_repair to confirm the fs is free of
> > errors before it is mounted? Note that xfs_repair unconditionally
> > regenerates certain metadata structures (like the finobt) from scratch
> > so there is always the possibility that xfs_repair itself is introducing
> > some problem in the fs.
> 
> The sequence was:
> 1. Trying to mount with the debug kernel, which didn't work due to the
> finobt verifier rejecting a bad btree node.
> 2. Full run of mutating xfs_repair (which did find a finobt node with
> bad magic)
> 3. Try to mount again, still fails due to finobt verifier rejecting a
> bad btree node, but at a different daddr than what was flagged by
> xfs_repair.
> 4. xfs_repair -n finds a bad magic finobt node at the same position as
> the mount splat.
> 
> This leads me to wonder if xfs_repair was the poison rather than the
> cure in this case. After digging into the xfs_repair code I found the
> following, which looks suspicious, but I don't know enough about XFS
> internals to tell if this can actually happen:
> 

That does imply xfs_repair is the culprit. If you haven't, could you try
the xfs_repair -> xfs_repair -n sequence without a mount attempt in the
picture just to be sure? That may also explain why the block seems to
shift around. It wouldn't surprise me if it's the same relative block in
the tree, but IIRC the tree itself can be shifted around run to run.

> build_ino_tree(), which is shared between XFS_BTNUM_INO and
> XFS_BTNUM_FINO btree rebuilds contains the following in one of the
> loops:
> 
> if (lptr->num_recs_pb > 0)
> 		prop_ino_cursor(mp, agno, btree_curs,
> 				ino_rec->ino_startnum, 0);
> 
> prop_ino_cursor() calls libxfs_btree_init_block() with the btnum
> parameter being a fixed XFS_BTNUM_INO.
> 

It's been a while since I've looked at the guts of this code but that
definitely does not look right. build_ino_tree() is where we abstract
construction of the different tree types. Nice catch!

If this is the problem, I'll have to take a closer look to figure out
why this problem isn't more prevalent..

> So if it's possible to get into this code path while rebuilding the
> finobt, xfs_repair will generate a otherwise valid btree node with the
> wrong magic, which matches what we see.
> 
> Sadly we have no more debug information available about the state of
> the FS on the first crash, which lead us to run xfs_repair in the first
> place.
> 

Naturally, the next question would have been what required repair in the
first place. :P

Anyways, if you wanted to give this a quick try to confirm it looks like
a fairly straightforward change. See the appended xfsprogs patch. Note
again that you can (and should) run this against the restored metadump
that captures the corruption before running against the primary fs.

Brian

--- 8< ---

diff --git a/repair/phase5.c b/repair/phase5.c
index 85d1f4fb..05333f11 100644
--- a/repair/phase5.c
+++ b/repair/phase5.c
@@ -983,7 +983,7 @@ init_ino_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, bt_status_t *btree_curs,
 
 static void
 prop_ino_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, bt_status_t *btree_curs,
-	xfs_agino_t startino, int level)
+	xfs_btnum_t btnum, xfs_agino_t startino, int level)
 {
 	struct xfs_btree_block	*bt_hdr;
 	xfs_inobt_key_t		*bt_key;
@@ -1005,7 +1005,7 @@ prop_ino_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, bt_status_t *btree_curs,
 		 * first path up the left side of the tree
 		 * where the agbno's are already set up
 		 */
-		prop_ino_cursor(mp, agno, btree_curs, startino, level);
+		prop_ino_cursor(mp, agno, btree_curs, btnum, startino, level);
 	}
 
 	if (be16_to_cpu(bt_hdr->bb_numrecs) ==
@@ -1041,7 +1041,7 @@ prop_ino_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, bt_status_t *btree_curs,
 		lptr->buf_p->b_ops = &xfs_inobt_buf_ops;
 		bt_hdr = XFS_BUF_TO_BLOCK(lptr->buf_p);
 		memset(bt_hdr, 0, mp->m_sb.sb_blocksize);
-		libxfs_btree_init_block(mp, lptr->buf_p, XFS_BTNUM_INO,
+		libxfs_btree_init_block(mp, lptr->buf_p, btnum,
 					level, 0, agno, 0);
 
 		bt_hdr->bb_u.s.bb_leftsib = cpu_to_be32(lptr->prev_agbno);
@@ -1049,7 +1049,7 @@ prop_ino_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, bt_status_t *btree_curs,
 		/*
 		 * propagate extent record for first extent in new block up
 		 */
-		prop_ino_cursor(mp, agno, btree_curs, startino, level);
+		prop_ino_cursor(mp, agno, btree_curs, btnum, startino, level);
 	}
 	/*
 	 * add inode info to current block
@@ -1201,7 +1201,7 @@ build_ino_tree(xfs_mount_t *mp, xfs_agnumber_t agno,
 			lptr->modulo--;
 
 		if (lptr->num_recs_pb > 0)
-			prop_ino_cursor(mp, agno, btree_curs,
+			prop_ino_cursor(mp, agno, btree_curs, btnum,
 					ino_rec->ino_startnum, 0);
 
 		bt_rec = (xfs_inobt_rec_t *)

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: Regular FS shutdown while rsync is running
  2019-01-23 12:11                 ` Brian Foster
@ 2019-01-23 13:03                   ` Brian Foster
  2019-01-23 18:58                     ` Brian Foster
  2019-01-23 20:45                     ` Dave Chinner
  0 siblings, 2 replies; 22+ messages in thread
From: Brian Foster @ 2019-01-23 13:03 UTC (permalink / raw)
  To: Lucas Stach; +Cc: linux-xfs, Dave Chinner

On Wed, Jan 23, 2019 at 07:11:53AM -0500, Brian Foster wrote:
> On Wed, Jan 23, 2019 at 12:14:17PM +0100, Lucas Stach wrote:
> > Am Dienstag, den 22.01.2019, 08:02 -0500 schrieb Brian Foster:
> > > On Tue, Jan 22, 2019 at 11:39:53AM +0100, Lucas Stach wrote:
> > > > Hi Brian,
> > > > 
> > > > Am Montag, den 21.01.2019, 13:11 -0500 schrieb Brian Foster:
> > > > [...]
> > > > > > So for the moment, here's the output of the above sequence.
> > > > > > 
> > > > > > xfs_db> convert agno 5 agbno 7831662 fsb
> > > > > > 0x5077806e (1350008942)
> > > > > > xfs_db> fsb 0x5077806e
> > > > > > xfs_db> type finobt
> > > > > > xfs_db> print
> > > > > > magic = 0x49414233
> > > > > > level = 1
> > > > > > numrecs = 335
> > > > > > leftsib = 7810856
> > > > > > rightsib = null
> > > > > > bno = 7387612016
> > > > > > lsn = 0x6671003d9700
> > > > > > uuid = 026711cc-25c7-44b9-89aa-0aac496edfec
> > > > > > owner = 5
> > > > > > crc = 0xe12b19b2 (correct)
> > > > > 
> > > > > As expected, we have the inobt magic. Interesting that this is a fairly
> > > > > full intermediate (level > 0) node. There is no right sibling, which
> > > > > means we're at the far right end of the tree. I wouldn't mind poking
> > > > > around a bit more at the tree, but that might be easier with access to
> > > > > the metadump. I also think that xfs_repair would have complained were
> > > > > something more significant wrong with the tree.
> > > > > 
> > > > > Hmm, I wonder if the (lightly tested) diff below would help us catch
> > > > > anything. It basically just splits up the currently combined inobt and
> > > > > finobt I/O verifiers to expect the appropriate magic number (rather than
> > > > > accepting either magic for both trees). Could you give that a try?
> > > > > Unless we're doing something like using the wrong type of cursor for a
> > > > > particular tree, I'd think this would catch wherever we happen to put a
> > > > > bad magic on disk. Note that this assumes the underlying filesystem has
> > > > > been repaired so as to try and detect the next time an on-disk
> > > > > corruption is introduced.
> > > > > 
> > > > > You'll also need to turn up the XFS error level to make sure this prints
> > > > > out a stack trace if/when a verifier failure triggers:
> > > > > 
> > > > > echo 5 > /proc/sys/fs/xfs/error_level
> > > > > 
> > > > > I guess we also shouldn't rule out hardware issues or whatnot. I did
> > > > > notice you have a strange kernel version: 4.19.4-holodeck10. Is that a
> > > > > distro kernel? Has it been modified from upstream in any way? If so, I'd
> > > > > strongly suggest to try and confirm whether this is reproducible with an
> > > > > upstream kernel.
> > > > 
> > > > With the finobt verifier changes applied we are unable to mount the FS,
> > > > even after running xfs_repair.
> > > > 
> > > > xfs_repair had found "bad magic # 0x49414233 in inobt block 5/2631703",
> > > > which would be daddr 0x1b5db40b8 according to xfs_db. The mount trips
> > > > over a buffer at a different daddr though:
> > > > 
> > > 
> > > So the mount failed, you ran repair and discovered the bad magic..? That
> > > suggests there was still an issue with the fs on-disk. Could you run
> > > 'xfs_repair -n' after the actual xfs_repair to confirm the fs is free of
> > > errors before it is mounted? Note that xfs_repair unconditionally
> > > regenerates certain metadata structures (like the finobt) from scratch
> > > so there is always the possibility that xfs_repair itself is introducing
> > > some problem in the fs.
> > 
> > The sequence was:
> > 1. Trying to mount with the debug kernel, which didn't work due to the
> > finobt verifier rejecting a bad btree node.
> > 2. Full run of mutating xfs_repair (which did find a finobt node with
> > bad magic)
> > 3. Try to mount again, still fails due to finobt verifier rejecting a
> > bad btree node, but at a different daddr than what was flagged by
> > xfs_repair.
> > 4. xfs_repair -n finds a bad magic finobt node at the same position as
> > the mount splat.
> > 
> > This leads me to wonder if xfs_repair was the poison rather than the
> > cure in this case. After digging into the xfs_repair code I found the
> > following, which looks suspicious, but I don't know enough about XFS
> > internals to tell if this can actually happen:
> > 
> 
> That does imply xfs_repair is the culprit. If you haven't, could you try
> the xfs_repair -> xfs_repair -n sequence without a mount attempt in the
> picture just to be sure? That may also explain why the block seems to
> shift around. It wouldn't surprise me if it's the same relative block in
> the tree, but IIRC the tree itself can be shifted around run to run.
> 
> > build_ino_tree(), which is shared between XFS_BTNUM_INO and
> > XFS_BTNUM_FINO btree rebuilds contains the following in one of the
> > loops:
> > 
> > if (lptr->num_recs_pb > 0)
> > 		prop_ino_cursor(mp, agno, btree_curs,
> > 				ino_rec->ino_startnum, 0);
> > 
> > prop_ino_cursor() calls libxfs_btree_init_block() with the btnum
> > parameter being a fixed XFS_BTNUM_INO.
> > 
> 
> It's been a while since I've looked at the guts of this code but that
> definitely does not look right. build_ino_tree() is where we abstract
> construction of the different tree types. Nice catch!
> 
> If this is the problem, I'll have to take a closer look to figure out
> why this problem isn't more prevalent..
> 

Hmm... so it looks like this is the code responsible for _growing_ the
interior nodes of the tree (i.e., level > 0). If I'm following the code
correctly, the high level algorithm is to first set up a block for each
level of the tree. This occurs early in build_ino_tree() and uses the
correct btnum. We then start populating the leaf nodes with the finobt
records that have previously been built up in-core. For each leaf block,
we populate the higher level block with the key based on the first
record in the leaf then fill in the remaining records in the leaf. This
repeats until the block at the next level up is full, at which point we
grab a new level > 0 block and initialize it with the inobt magic.

I haven't confirmed, but I think this basically means that we'd
initialize every intermediate block after the left-most block at each
level in the tree with the inobt magic. I think Dave already recognized
in the other subthread that you have a 2 block level 1 tree. I suspect
the only reasons we haven't run into this until now are that the finobt
is intended to remain fairly small by only tracking free inode records
and thus prioritizing allocation from those records (removing them from
the tree) and this is limited to users who repair a filesystem in such a
state.

I'm guessing you have a large enough set of inodes and a sustained
enough deletion pattern in your workload that you eventually free enough
inodes across a large enough set of records to populate a multi-block
level 1 tree. Then as soon as you try to allocate from that side of the
tree the finer grained checks in __xfs_btree_check_[l|s]block() discover
the wrong magic and generate the error.

The positive here is that this isn't really a coherency problem with the
filesystem. The finobt itself is coherent, consistent with the inobt and
the magic value doesn't really dictate how the trees are used/updated at
runtime. This means that the only real corruption should be the magic
value.

I am wondering whether this warrants a special case in the magic
checking to convert the runtime internal error into more of a warning to
unmount and run xfs_repair (similar to how we detected the agfl size
mismatch problem). Dave, any thoughts on that?

Brian

> > So if it's possible to get into this code path while rebuilding the
> > finobt, xfs_repair will generate a otherwise valid btree node with the
> > wrong magic, which matches what we see.
> > 
> > Sadly we have no more debug information available about the state of
> > the FS on the first crash, which lead us to run xfs_repair in the first
> > place.
> > 
> 
> Naturally, the next question would have been what required repair in the
> first place. :P
> 
> Anyways, if you wanted to give this a quick try to confirm it looks like
> a fairly straightforward change. See the appended xfsprogs patch. Note
> again that you can (and should) run this against the restored metadump
> that captures the corruption before running against the primary fs.
> 
> Brian
> 
> --- 8< ---
> 
> diff --git a/repair/phase5.c b/repair/phase5.c
> index 85d1f4fb..05333f11 100644
> --- a/repair/phase5.c
> +++ b/repair/phase5.c
> @@ -983,7 +983,7 @@ init_ino_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, bt_status_t *btree_curs,
>  
>  static void
>  prop_ino_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, bt_status_t *btree_curs,
> -	xfs_agino_t startino, int level)
> +	xfs_btnum_t btnum, xfs_agino_t startino, int level)
>  {
>  	struct xfs_btree_block	*bt_hdr;
>  	xfs_inobt_key_t		*bt_key;
> @@ -1005,7 +1005,7 @@ prop_ino_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, bt_status_t *btree_curs,
>  		 * first path up the left side of the tree
>  		 * where the agbno's are already set up
>  		 */
> -		prop_ino_cursor(mp, agno, btree_curs, startino, level);
> +		prop_ino_cursor(mp, agno, btree_curs, btnum, startino, level);
>  	}
>  
>  	if (be16_to_cpu(bt_hdr->bb_numrecs) ==
> @@ -1041,7 +1041,7 @@ prop_ino_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, bt_status_t *btree_curs,
>  		lptr->buf_p->b_ops = &xfs_inobt_buf_ops;
>  		bt_hdr = XFS_BUF_TO_BLOCK(lptr->buf_p);
>  		memset(bt_hdr, 0, mp->m_sb.sb_blocksize);
> -		libxfs_btree_init_block(mp, lptr->buf_p, XFS_BTNUM_INO,
> +		libxfs_btree_init_block(mp, lptr->buf_p, btnum,
>  					level, 0, agno, 0);
>  
>  		bt_hdr->bb_u.s.bb_leftsib = cpu_to_be32(lptr->prev_agbno);
> @@ -1049,7 +1049,7 @@ prop_ino_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, bt_status_t *btree_curs,
>  		/*
>  		 * propagate extent record for first extent in new block up
>  		 */
> -		prop_ino_cursor(mp, agno, btree_curs, startino, level);
> +		prop_ino_cursor(mp, agno, btree_curs, btnum, startino, level);
>  	}
>  	/*
>  	 * add inode info to current block
> @@ -1201,7 +1201,7 @@ build_ino_tree(xfs_mount_t *mp, xfs_agnumber_t agno,
>  			lptr->modulo--;
>  
>  		if (lptr->num_recs_pb > 0)
> -			prop_ino_cursor(mp, agno, btree_curs,
> +			prop_ino_cursor(mp, agno, btree_curs, btnum,
>  					ino_rec->ino_startnum, 0);
>  
>  		bt_rec = (xfs_inobt_rec_t *)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Regular FS shutdown while rsync is running
  2019-01-23 13:03                   ` Brian Foster
@ 2019-01-23 18:58                     ` Brian Foster
  2019-01-24  8:53                       ` Lucas Stach
  2019-01-23 20:45                     ` Dave Chinner
  1 sibling, 1 reply; 22+ messages in thread
From: Brian Foster @ 2019-01-23 18:58 UTC (permalink / raw)
  To: Lucas Stach; +Cc: linux-xfs, Dave Chinner

On Wed, Jan 23, 2019 at 08:03:07AM -0500, Brian Foster wrote:
> On Wed, Jan 23, 2019 at 07:11:53AM -0500, Brian Foster wrote:
> > On Wed, Jan 23, 2019 at 12:14:17PM +0100, Lucas Stach wrote:
> > > Am Dienstag, den 22.01.2019, 08:02 -0500 schrieb Brian Foster:
> > > > On Tue, Jan 22, 2019 at 11:39:53AM +0100, Lucas Stach wrote:
> > > > > Hi Brian,
> > > > > 
> > > > > Am Montag, den 21.01.2019, 13:11 -0500 schrieb Brian Foster:
> > > > > [...]
> > > > > > > So for the moment, here's the output of the above sequence.
> > > > > > > 
> > > > > > > xfs_db> convert agno 5 agbno 7831662 fsb
> > > > > > > 0x5077806e (1350008942)
> > > > > > > xfs_db> fsb 0x5077806e
> > > > > > > xfs_db> type finobt
> > > > > > > xfs_db> print
> > > > > > > magic = 0x49414233
> > > > > > > level = 1
> > > > > > > numrecs = 335
> > > > > > > leftsib = 7810856
> > > > > > > rightsib = null
> > > > > > > bno = 7387612016
> > > > > > > lsn = 0x6671003d9700
> > > > > > > uuid = 026711cc-25c7-44b9-89aa-0aac496edfec
> > > > > > > owner = 5
> > > > > > > crc = 0xe12b19b2 (correct)
> > > > > > 
> > > > > > As expected, we have the inobt magic. Interesting that this is a fairly
> > > > > > full intermediate (level > 0) node. There is no right sibling, which
> > > > > > means we're at the far right end of the tree. I wouldn't mind poking
> > > > > > around a bit more at the tree, but that might be easier with access to
> > > > > > the metadump. I also think that xfs_repair would have complained were
> > > > > > something more significant wrong with the tree.
> > > > > > 
> > > > > > Hmm, I wonder if the (lightly tested) diff below would help us catch
> > > > > > anything. It basically just splits up the currently combined inobt and
> > > > > > finobt I/O verifiers to expect the appropriate magic number (rather than
> > > > > > accepting either magic for both trees). Could you give that a try?
> > > > > > Unless we're doing something like using the wrong type of cursor for a
> > > > > > particular tree, I'd think this would catch wherever we happen to put a
> > > > > > bad magic on disk. Note that this assumes the underlying filesystem has
> > > > > > been repaired so as to try and detect the next time an on-disk
> > > > > > corruption is introduced.
> > > > > > 
> > > > > > You'll also need to turn up the XFS error level to make sure this prints
> > > > > > out a stack trace if/when a verifier failure triggers:
> > > > > > 
> > > > > > echo 5 > /proc/sys/fs/xfs/error_level
> > > > > > 
> > > > > > I guess we also shouldn't rule out hardware issues or whatnot. I did
> > > > > > notice you have a strange kernel version: 4.19.4-holodeck10. Is that a
> > > > > > distro kernel? Has it been modified from upstream in any way? If so, I'd
> > > > > > strongly suggest to try and confirm whether this is reproducible with an
> > > > > > upstream kernel.
> > > > > 
> > > > > With the finobt verifier changes applied we are unable to mount the FS,
> > > > > even after running xfs_repair.
> > > > > 
> > > > > xfs_repair had found "bad magic # 0x49414233 in inobt block 5/2631703",
> > > > > which would be daddr 0x1b5db40b8 according to xfs_db. The mount trips
> > > > > over a buffer at a different daddr though:
> > > > > 
> > > > 
> > > > So the mount failed, you ran repair and discovered the bad magic..? That
> > > > suggests there was still an issue with the fs on-disk. Could you run
> > > > 'xfs_repair -n' after the actual xfs_repair to confirm the fs is free of
> > > > errors before it is mounted? Note that xfs_repair unconditionally
> > > > regenerates certain metadata structures (like the finobt) from scratch
> > > > so there is always the possibility that xfs_repair itself is introducing
> > > > some problem in the fs.
> > > 
> > > The sequence was:
> > > 1. Trying to mount with the debug kernel, which didn't work due to the
> > > finobt verifier rejecting a bad btree node.
> > > 2. Full run of mutating xfs_repair (which did find a finobt node with
> > > bad magic)
> > > 3. Try to mount again, still fails due to finobt verifier rejecting a
> > > bad btree node, but at a different daddr than what was flagged by
> > > xfs_repair.
> > > 4. xfs_repair -n finds a bad magic finobt node at the same position as
> > > the mount splat.
> > > 
> > > This leads me to wonder if xfs_repair was the poison rather than the
> > > cure in this case. After digging into the xfs_repair code I found the
> > > following, which looks suspicious, but I don't know enough about XFS
> > > internals to tell if this can actually happen:
> > > 
> > 
> > That does imply xfs_repair is the culprit. If you haven't, could you try
> > the xfs_repair -> xfs_repair -n sequence without a mount attempt in the
> > picture just to be sure? That may also explain why the block seems to
> > shift around. It wouldn't surprise me if it's the same relative block in
> > the tree, but IIRC the tree itself can be shifted around run to run.
> > 
> > > build_ino_tree(), which is shared between XFS_BTNUM_INO and
> > > XFS_BTNUM_FINO btree rebuilds contains the following in one of the
> > > loops:
> > > 
> > > if (lptr->num_recs_pb > 0)
> > > 		prop_ino_cursor(mp, agno, btree_curs,
> > > 				ino_rec->ino_startnum, 0);
> > > 
> > > prop_ino_cursor() calls libxfs_btree_init_block() with the btnum
> > > parameter being a fixed XFS_BTNUM_INO.
> > > 
> > 
> > It's been a while since I've looked at the guts of this code but that
> > definitely does not look right. build_ino_tree() is where we abstract
> > construction of the different tree types. Nice catch!
> > 
> > If this is the problem, I'll have to take a closer look to figure out
> > why this problem isn't more prevalent..
> > 
> 
> Hmm... so it looks like this is the code responsible for _growing_ the
> interior nodes of the tree (i.e., level > 0). If I'm following the code
> correctly, the high level algorithm is to first set up a block for each
> level of the tree. This occurs early in build_ino_tree() and uses the
> correct btnum. We then start populating the leaf nodes with the finobt
> records that have previously been built up in-core. For each leaf block,
> we populate the higher level block with the key based on the first
> record in the leaf then fill in the remaining records in the leaf. This
> repeats until the block at the next level up is full, at which point we
> grab a new level > 0 block and initialize it with the inobt magic.
> 
> I haven't confirmed, but I think this basically means that we'd
> initialize every intermediate block after the left-most block at each
> level in the tree with the inobt magic. I think Dave already recognized
> in the other subthread that you have a 2 block level 1 tree. I suspect
> the only reasons we haven't run into this until now are that the finobt
> is intended to remain fairly small by only tracking free inode records
> and thus prioritizing allocation from those records (removing them from
> the tree) and this is limited to users who repair a filesystem in such a
> state.
> 

I created a large finobt filesystem, confirmed that xfs_repair stamps
every node as noted above with the inobt magic, and verified that the
patch below fixes the problem. I'll post it to the list shortly. Thanks
again.

Brian

> I'm guessing you have a large enough set of inodes and a sustained
> enough deletion pattern in your workload that you eventually free enough
> inodes across a large enough set of records to populate a multi-block
> level 1 tree. Then as soon as you try to allocate from that side of the
> tree the finer grained checks in __xfs_btree_check_[l|s]block() discover
> the wrong magic and generate the error.
> 
> The positive here is that this isn't really a coherency problem with the
> filesystem. The finobt itself is coherent, consistent with the inobt and
> the magic value doesn't really dictate how the trees are used/updated at
> runtime. This means that the only real corruption should be the magic
> value.
> 
> I am wondering whether this warrants a special case in the magic
> checking to convert the runtime internal error into more of a warning to
> unmount and run xfs_repair (similar to how we detected the agfl size
> mismatch problem). Dave, any thoughts on that?
> 
> Brian
> 
> > > So if it's possible to get into this code path while rebuilding the
> > > finobt, xfs_repair will generate a otherwise valid btree node with the
> > > wrong magic, which matches what we see.
> > > 
> > > Sadly we have no more debug information available about the state of
> > > the FS on the first crash, which lead us to run xfs_repair in the first
> > > place.
> > > 
> > 
> > Naturally, the next question would have been what required repair in the
> > first place. :P
> > 
> > Anyways, if you wanted to give this a quick try to confirm it looks like
> > a fairly straightforward change. See the appended xfsprogs patch. Note
> > again that you can (and should) run this against the restored metadump
> > that captures the corruption before running against the primary fs.
> > 
> > Brian
> > 
> > --- 8< ---
> > 
> > diff --git a/repair/phase5.c b/repair/phase5.c
> > index 85d1f4fb..05333f11 100644
> > --- a/repair/phase5.c
> > +++ b/repair/phase5.c
> > @@ -983,7 +983,7 @@ init_ino_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, bt_status_t *btree_curs,
> >  
> >  static void
> >  prop_ino_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, bt_status_t *btree_curs,
> > -	xfs_agino_t startino, int level)
> > +	xfs_btnum_t btnum, xfs_agino_t startino, int level)
> >  {
> >  	struct xfs_btree_block	*bt_hdr;
> >  	xfs_inobt_key_t		*bt_key;
> > @@ -1005,7 +1005,7 @@ prop_ino_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, bt_status_t *btree_curs,
> >  		 * first path up the left side of the tree
> >  		 * where the agbno's are already set up
> >  		 */
> > -		prop_ino_cursor(mp, agno, btree_curs, startino, level);
> > +		prop_ino_cursor(mp, agno, btree_curs, btnum, startino, level);
> >  	}
> >  
> >  	if (be16_to_cpu(bt_hdr->bb_numrecs) ==
> > @@ -1041,7 +1041,7 @@ prop_ino_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, bt_status_t *btree_curs,
> >  		lptr->buf_p->b_ops = &xfs_inobt_buf_ops;
> >  		bt_hdr = XFS_BUF_TO_BLOCK(lptr->buf_p);
> >  		memset(bt_hdr, 0, mp->m_sb.sb_blocksize);
> > -		libxfs_btree_init_block(mp, lptr->buf_p, XFS_BTNUM_INO,
> > +		libxfs_btree_init_block(mp, lptr->buf_p, btnum,
> >  					level, 0, agno, 0);
> >  
> >  		bt_hdr->bb_u.s.bb_leftsib = cpu_to_be32(lptr->prev_agbno);
> > @@ -1049,7 +1049,7 @@ prop_ino_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, bt_status_t *btree_curs,
> >  		/*
> >  		 * propagate extent record for first extent in new block up
> >  		 */
> > -		prop_ino_cursor(mp, agno, btree_curs, startino, level);
> > +		prop_ino_cursor(mp, agno, btree_curs, btnum, startino, level);
> >  	}
> >  	/*
> >  	 * add inode info to current block
> > @@ -1201,7 +1201,7 @@ build_ino_tree(xfs_mount_t *mp, xfs_agnumber_t agno,
> >  			lptr->modulo--;
> >  
> >  		if (lptr->num_recs_pb > 0)
> > -			prop_ino_cursor(mp, agno, btree_curs,
> > +			prop_ino_cursor(mp, agno, btree_curs, btnum,
> >  					ino_rec->ino_startnum, 0);
> >  
> >  		bt_rec = (xfs_inobt_rec_t *)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Regular FS shutdown while rsync is running
  2019-01-23 13:03                   ` Brian Foster
  2019-01-23 18:58                     ` Brian Foster
@ 2019-01-23 20:45                     ` Dave Chinner
  2019-01-24 13:31                       ` Brian Foster
  1 sibling, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2019-01-23 20:45 UTC (permalink / raw)
  To: Brian Foster; +Cc: Lucas Stach, linux-xfs

On Wed, Jan 23, 2019 at 08:03:07AM -0500, Brian Foster wrote:
> On Wed, Jan 23, 2019 at 07:11:53AM -0500, Brian Foster wrote:
> > On Wed, Jan 23, 2019 at 12:14:17PM +0100, Lucas Stach wrote:
> > > build_ino_tree(), which is shared between XFS_BTNUM_INO and
> > > XFS_BTNUM_FINO btree rebuilds contains the following in one of the
> > > loops:
> > > 
> > > if (lptr->num_recs_pb > 0)
> > > 		prop_ino_cursor(mp, agno, btree_curs,
> > > 				ino_rec->ino_startnum, 0);
> > > 
> > > prop_ino_cursor() calls libxfs_btree_init_block() with the btnum
> > > parameter being a fixed XFS_BTNUM_INO.

Nice find, Lucas!

> > It's been a while since I've looked at the guts of this code but that
> > definitely does not look right. build_ino_tree() is where we abstract
> > construction of the different tree types. Nice catch!
> > 
> > If this is the problem, I'll have to take a closer look to figure out
> > why this problem isn't more prevalent..
> > 
> 
> Hmm... so it looks like this is the code responsible for _growing_ the
> interior nodes of the tree (i.e., level > 0). If I'm following the code
> correctly, the high level algorithm is to first set up a block for each
> level of the tree. This occurs early in build_ino_tree() and uses the
> correct btnum. We then start populating the leaf nodes with the finobt
> records that have previously been built up in-core. For each leaf block,
> we populate the higher level block with the key based on the first
> record in the leaf then fill in the remaining records in the leaf. This
> repeats until the block at the next level up is full, at which point we
> grab a new level > 0 block and initialize it with the inobt magic.
> 
> I haven't confirmed, but I think this basically means that we'd
> initialize every intermediate block after the left-most block at each
> level in the tree with the inobt magic. I think Dave already recognized
> in the other subthread that you have a 2 block level 1 tree. I suspect
> the only reasons we haven't run into this until now are that the finobt
> is intended to remain fairly small by only tracking free inode records
> and thus prioritizing allocation from those records (removing them from
> the tree) and this is limited to users who repair a filesystem in such a
> state.
> 
> I'm guessing you have a large enough set of inodes and a sustained
> enough deletion pattern in your workload that you eventually free enough
> inodes across a large enough set of records to populate a multi-block
> level 1 tree. Then as soon as you try to allocate from that side of the
> tree the finer grained checks in __xfs_btree_check_[l|s]block() discover
> the wrong magic and generate the error.

Yeah, this is the "lots of sparse free inodes" sort of filesystem
that finobt is intended to really help with. :/

> The positive here is that this isn't really a coherency problem with the
> filesystem. The finobt itself is coherent, consistent with the inobt and
> the magic value doesn't really dictate how the trees are used/updated at
> runtime. This means that the only real corruption should be the magic
> value.

*nod*

Which means it should be fixable by a bit of xfs_db magic to find
and rewrite the magic numbers without needing a full repair to run.

> I am wondering whether this warrants a special case in the magic
> checking to convert the runtime internal error into more of a warning to
> unmount and run xfs_repair (similar to how we detected the agfl size
> mismatch problem). Dave, any thoughts on that?

I don't think it has the scope and breadth of the AGFL issue - there
isn't an actual fatal corruption on disk that will lead to loss or
corruption of user data.  It's just the wrong magic number being
placed in the btree block.

IMO, after fixing the repair bug, we should:

	a) make the inobt/finobt block verifier more robust to make
	sure we have the correct magic number for the btree type. In
	hindsight, if we did that in the first place then the
	repair bug would have not escaped into the wild.

	b) write a xfs_db finobt traversal command that rewrites the
	magic numbers of all the blocks in the tree. Probably with
	an xfs_admin wrapper to make it nice for users who need to
	run it.

	c) audit all the other verifiers to determine whether the
	combined magic number checks can/should be split so in
	future we detect this class of bug during development and
	testing.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Regular FS shutdown while rsync is running
  2019-01-23 18:58                     ` Brian Foster
@ 2019-01-24  8:53                       ` Lucas Stach
  0 siblings, 0 replies; 22+ messages in thread
From: Lucas Stach @ 2019-01-24  8:53 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, Dave Chinner

Am Mittwoch, den 23.01.2019, 13:58 -0500 schrieb Brian Foster:
> On Wed, Jan 23, 2019 at 08:03:07AM -0500, Brian Foster wrote:
> > On Wed, Jan 23, 2019 at 07:11:53AM -0500, Brian Foster wrote:
> > > On Wed, Jan 23, 2019 at 12:14:17PM +0100, Lucas Stach wrote:
> > > > Am Dienstag, den 22.01.2019, 08:02 -0500 schrieb Brian Foster:
> > > > > On Tue, Jan 22, 2019 at 11:39:53AM +0100, Lucas Stach wrote:
> > > > > > Hi Brian,
> > > > > > 
> > > > > > Am Montag, den 21.01.2019, 13:11 -0500 schrieb Brian Foster:
> > > > > > [...]
> > > > > > > > So for the moment, here's the output of the above sequence.
> > > > > > > > 
> > > > > > > > xfs_db> convert agno 5 agbno 7831662 fsb
> > > > > > > > 0x5077806e (1350008942)
> > > > > > > > xfs_db> fsb 0x5077806e
> > > > > > > > xfs_db> type finobt
> > > > > > > > xfs_db> print
> > > > > > > > magic = 0x49414233
> > > > > > > > level = 1
> > > > > > > > numrecs = 335
> > > > > > > > leftsib = 7810856
> > > > > > > > rightsib = null
> > > > > > > > bno = 7387612016
> > > > > > > > lsn = 0x6671003d9700
> > > > > > > > uuid = 026711cc-25c7-44b9-89aa-0aac496edfec
> > > > > > > > owner = 5
> > > > > > > > crc = 0xe12b19b2 (correct)
> > > > > > > 
> > > > > > > As expected, we have the inobt magic. Interesting that this is a fairly
> > > > > > > full intermediate (level > 0) node. There is no right sibling, which
> > > > > > > means we're at the far right end of the tree. I wouldn't mind poking
> > > > > > > around a bit more at the tree, but that might be easier with access to
> > > > > > > the metadump. I also think that xfs_repair would have complained were
> > > > > > > something more significant wrong with the tree.
> > > > > > > 
> > > > > > > Hmm, I wonder if the (lightly tested) diff below would help us catch
> > > > > > > anything. It basically just splits up the currently combined inobt and
> > > > > > > finobt I/O verifiers to expect the appropriate magic number (rather than
> > > > > > > accepting either magic for both trees). Could you give that a try?
> > > > > > > Unless we're doing something like using the wrong type of cursor for a
> > > > > > > particular tree, I'd think this would catch wherever we happen to put a
> > > > > > > bad magic on disk. Note that this assumes the underlying filesystem has
> > > > > > > been repaired so as to try and detect the next time an on-disk
> > > > > > > corruption is introduced.
> > > > > > > 
> > > > > > > You'll also need to turn up the XFS error level to make sure this prints
> > > > > > > out a stack trace if/when a verifier failure triggers:
> > > > > > > 
> > > > > > > echo 5 > /proc/sys/fs/xfs/error_level
> > > > > > > 
> > > > > > > I guess we also shouldn't rule out hardware issues or whatnot. I did
> > > > > > > notice you have a strange kernel version: 4.19.4-holodeck10. Is that a
> > > > > > > distro kernel? Has it been modified from upstream in any way? If so, I'd
> > > > > > > strongly suggest to try and confirm whether this is reproducible with an
> > > > > > > upstream kernel.
> > > > > > 
> > > > > > With the finobt verifier changes applied we are unable to mount the FS,
> > > > > > even after running xfs_repair.
> > > > > > 
> > > > > > xfs_repair had found "bad magic # 0x49414233 in inobt block 5/2631703",
> > > > > > which would be daddr 0x1b5db40b8 according to xfs_db. The mount trips
> > > > > > over a buffer at a different daddr though:
> > > > > > 
> > > > > 
> > > > > So the mount failed, you ran repair and discovered the bad magic..? That
> > > > > suggests there was still an issue with the fs on-disk. Could you run
> > > > > 'xfs_repair -n' after the actual xfs_repair to confirm the fs is free of
> > > > > errors before it is mounted? Note that xfs_repair unconditionally
> > > > > regenerates certain metadata structures (like the finobt) from scratch
> > > > > so there is always the possibility that xfs_repair itself is introducing
> > > > > some problem in the fs.
> > > > 
> > > > The sequence was:
> > > > 1. Trying to mount with the debug kernel, which didn't work due to the
> > > > finobt verifier rejecting a bad btree node.
> > > > 2. Full run of mutating xfs_repair (which did find a finobt node with
> > > > bad magic)
> > > > 3. Try to mount again, still fails due to finobt verifier rejecting a
> > > > bad btree node, but at a different daddr than what was flagged by
> > > > xfs_repair.
> > > > 4. xfs_repair -n finds a bad magic finobt node at the same position as
> > > > the mount splat.
> > > > 
> > > > This leads me to wonder if xfs_repair was the poison rather than the
> > > > cure in this case. After digging into the xfs_repair code I found the
> > > > following, which looks suspicious, but I don't know enough about XFS
> > > > internals to tell if this can actually happen:
> > > > 
> > > 
> > > That does imply xfs_repair is the culprit. If you haven't, could you try
> > > the xfs_repair -> xfs_repair -n sequence without a mount attempt in the
> > > picture just to be sure? That may also explain why the block seems to
> > > shift around. It wouldn't surprise me if it's the same relative block in
> > > the tree, but IIRC the tree itself can be shifted around run to run.
> > > 
> > > > build_ino_tree(), which is shared between XFS_BTNUM_INO and
> > > > XFS_BTNUM_FINO btree rebuilds contains the following in one of the
> > > > loops:
> > > > 
> > > > if (lptr->num_recs_pb > 0)
> > > > > > > > 		prop_ino_cursor(mp, agno, btree_curs,
> > > > > > > > 				ino_rec->ino_startnum, 0);
> > > > 
> > > > prop_ino_cursor() calls libxfs_btree_init_block() with the btnum
> > > > parameter being a fixed XFS_BTNUM_INO.
> > > > 
> > > 
> > > It's been a while since I've looked at the guts of this code but that
> > > definitely does not look right. build_ino_tree() is where we abstract
> > > construction of the different tree types. Nice catch!
> > > 
> > > If this is the problem, I'll have to take a closer look to figure out
> > > why this problem isn't more prevalent..
> > > 
> > 
> > Hmm... so it looks like this is the code responsible for _growing_ the
> > interior nodes of the tree (i.e., level > 0). If I'm following the code
> > correctly, the high level algorithm is to first set up a block for each
> > level of the tree. This occurs early in build_ino_tree() and uses the
> > correct btnum. We then start populating the leaf nodes with the finobt
> > records that have previously been built up in-core. For each leaf block,
> > we populate the higher level block with the key based on the first
> > record in the leaf then fill in the remaining records in the leaf. This
> > repeats until the block at the next level up is full, at which point we
> > grab a new level > 0 block and initialize it with the inobt magic.
> > 
> > I haven't confirmed, but I think this basically means that we'd
> > initialize every intermediate block after the left-most block at each
> > level in the tree with the inobt magic. I think Dave already recognized
> > in the other subthread that you have a 2 block level 1 tree. I suspect
> > the only reasons we haven't run into this until now are that the finobt
> > is intended to remain fairly small by only tracking free inode records
> > and thus prioritizing allocation from those records (removing them from
> > the tree) and this is limited to users who repair a filesystem in such a
> > state.
> > 
> 
> I created a large finobt filesystem, confirmed that xfs_repair stamps
> every node as noted above with the inobt magic, and verified that the
> patch below fixes the problem. I'll post it to the list shortly. Thanks
> again.

I can confirm that with the patch applied xfs_repair is able to fix the
corruption. A second xfs_repair run didn't find any corrupt btree nodes
anymore.

I can also confirm that manually stamping the right magic in the
corrupt finobt block via xfs_db brought our FS back to life without a
xfs_repair run.

Regards,
Lucas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Regular FS shutdown while rsync is running
  2019-01-23 20:45                     ` Dave Chinner
@ 2019-01-24 13:31                       ` Brian Foster
  2019-01-24 19:11                         ` Brian Foster
  0 siblings, 1 reply; 22+ messages in thread
From: Brian Foster @ 2019-01-24 13:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Lucas Stach, linux-xfs

On Thu, Jan 24, 2019 at 07:45:28AM +1100, Dave Chinner wrote:
> On Wed, Jan 23, 2019 at 08:03:07AM -0500, Brian Foster wrote:
> > On Wed, Jan 23, 2019 at 07:11:53AM -0500, Brian Foster wrote:
> > > On Wed, Jan 23, 2019 at 12:14:17PM +0100, Lucas Stach wrote:
> > > > build_ino_tree(), which is shared between XFS_BTNUM_INO and
> > > > XFS_BTNUM_FINO btree rebuilds contains the following in one of the
> > > > loops:
> > > > 
> > > > if (lptr->num_recs_pb > 0)
> > > > 		prop_ino_cursor(mp, agno, btree_curs,
> > > > 				ino_rec->ino_startnum, 0);
> > > > 
> > > > prop_ino_cursor() calls libxfs_btree_init_block() with the btnum
> > > > parameter being a fixed XFS_BTNUM_INO.
> 
> Nice find, Lucas!
> 
> > > It's been a while since I've looked at the guts of this code but that
> > > definitely does not look right. build_ino_tree() is where we abstract
> > > construction of the different tree types. Nice catch!
> > > 
> > > If this is the problem, I'll have to take a closer look to figure out
> > > why this problem isn't more prevalent..
> > > 
> > 
> > Hmm... so it looks like this is the code responsible for _growing_ the
> > interior nodes of the tree (i.e., level > 0). If I'm following the code
> > correctly, the high level algorithm is to first set up a block for each
> > level of the tree. This occurs early in build_ino_tree() and uses the
> > correct btnum. We then start populating the leaf nodes with the finobt
> > records that have previously been built up in-core. For each leaf block,
> > we populate the higher level block with the key based on the first
> > record in the leaf then fill in the remaining records in the leaf. This
> > repeats until the block at the next level up is full, at which point we
> > grab a new level > 0 block and initialize it with the inobt magic.
> > 
> > I haven't confirmed, but I think this basically means that we'd
> > initialize every intermediate block after the left-most block at each
> > level in the tree with the inobt magic. I think Dave already recognized
> > in the other subthread that you have a 2 block level 1 tree. I suspect
> > the only reasons we haven't run into this until now are that the finobt
> > is intended to remain fairly small by only tracking free inode records
> > and thus prioritizing allocation from those records (removing them from
> > the tree) and this is limited to users who repair a filesystem in such a
> > state.
> > 
> > I'm guessing you have a large enough set of inodes and a sustained
> > enough deletion pattern in your workload that you eventually free enough
> > inodes across a large enough set of records to populate a multi-block
> > level 1 tree. Then as soon as you try to allocate from that side of the
> > tree the finer grained checks in __xfs_btree_check_[l|s]block() discover
> > the wrong magic and generate the error.
> 
> Yeah, this is the "lots of sparse free inodes" sort of filesystem
> that finobt is intended to really help with. :/
> 
> > The positive here is that this isn't really a coherency problem with the
> > filesystem. The finobt itself is coherent, consistent with the inobt and
> > the magic value doesn't really dictate how the trees are used/updated at
> > runtime. This means that the only real corruption should be the magic
> > value.
> 
> *nod*
> 
> Which means it should be fixable by a bit of xfs_db magic to find
> and rewrite the magic numbers without needing a full repair to run.
> 
> > I am wondering whether this warrants a special case in the magic
> > checking to convert the runtime internal error into more of a warning to
> > unmount and run xfs_repair (similar to how we detected the agfl size
> > mismatch problem). Dave, any thoughts on that?
> 
> I don't think it has the scope and breadth of the AGFL issue - there
> isn't an actual fatal corruption on disk that will lead to loss or
> corruption of user data.  It's just the wrong magic number being
> placed in the btree block.
> 

Right, it's not a corruption that's going to have compound effects and
lead to broader inconsistencies and whatnot. We made a decision to nuke
the agfl at runtime to preserve behavior and allow the fs to continue
running (in spite of less severe corruption actually introduced by the
agfl reset).

This problem is still a fatal runtime error in some cases (and probably
all cases once we fix up the verifiers), however, and that's the
behavior I'm wondering whether we should mitigate because otherwise the
fs is fully capable of continuing to run (and the state may ultimately
clear itself).

> IMO, after fixing the repair bug, we should:
> 
> 	a) make the inobt/finobt block verifier more robust to make
> 	sure we have the correct magic number for the btree type. In
> 	hindsight, if we did that in the first place then the
> 	repair bug would have not escaped into the wild.
> 

Agreed. I had already sent Lucas a patch along this lines that pretty
much implicated xfs_repair as the cause by virtue of the mount time
finobt scan. I'll post that to the list separately.

> 	b) write a xfs_db finobt traversal command that rewrites the
> 	magic numbers of all the blocks in the tree. Probably with
> 	an xfs_admin wrapper to make it nice for users who need to
> 	run it.
> 

Hmm, I can't really think of any time I'd suggest a user run something
that swizzles on-disk data like this without running xfs_repair before
and after. I certainly wouldn't ever do otherwise myself. Of course such
a script means those xfs_repair runs could be non-destructive, but I'm
not sure that's much of an advantage.

Hmm, I was thinking a more targeted script to fix a single block based
on daddr might be reasonable, but I'm not sure I'd even trust that.

> 	c) audit all the other verifiers to determine whether the
> 	combined magic number checks can/should be split so in
> 	future we detect this class of bug during development and
> 	testing.
> 

Agree. TBH, I think the failure in this case was not thorough enough
testing of the xfs_repair changes moreso than the kernel checks not
being granular enough. Otherwise an immediate follow up xfs_repair -n
run (which I tend to do) would have caught this much quicker than split
checks in the kernel verifiers. That said, the split checks definitely
facilitate analysis of this kind of error report and are worth
considering for that alone.

Brian

> Cheers,
> 
> Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Regular FS shutdown while rsync is running
  2019-01-24 13:31                       ` Brian Foster
@ 2019-01-24 19:11                         ` Brian Foster
  2019-01-28 22:34                           ` Dave Chinner
  0 siblings, 1 reply; 22+ messages in thread
From: Brian Foster @ 2019-01-24 19:11 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Lucas Stach, linux-xfs

On Thu, Jan 24, 2019 at 08:31:20AM -0500, Brian Foster wrote:
> On Thu, Jan 24, 2019 at 07:45:28AM +1100, Dave Chinner wrote:
> > On Wed, Jan 23, 2019 at 08:03:07AM -0500, Brian Foster wrote:
> > > On Wed, Jan 23, 2019 at 07:11:53AM -0500, Brian Foster wrote:
> > > > On Wed, Jan 23, 2019 at 12:14:17PM +0100, Lucas Stach wrote:
> > > > > build_ino_tree(), which is shared between XFS_BTNUM_INO and
> > > > > XFS_BTNUM_FINO btree rebuilds contains the following in one of the
> > > > > loops:
> > > > > 
> > > > > if (lptr->num_recs_pb > 0)
> > > > > 		prop_ino_cursor(mp, agno, btree_curs,
> > > > > 				ino_rec->ino_startnum, 0);
> > > > > 
> > > > > prop_ino_cursor() calls libxfs_btree_init_block() with the btnum
> > > > > parameter being a fixed XFS_BTNUM_INO.
> > 
> > Nice find, Lucas!
> > 
> > > > It's been a while since I've looked at the guts of this code but that
> > > > definitely does not look right. build_ino_tree() is where we abstract
> > > > construction of the different tree types. Nice catch!
> > > > 
> > > > If this is the problem, I'll have to take a closer look to figure out
> > > > why this problem isn't more prevalent..
> > > > 
> > > 
> > > Hmm... so it looks like this is the code responsible for _growing_ the
> > > interior nodes of the tree (i.e., level > 0). If I'm following the code
> > > correctly, the high level algorithm is to first set up a block for each
> > > level of the tree. This occurs early in build_ino_tree() and uses the
> > > correct btnum. We then start populating the leaf nodes with the finobt
> > > records that have previously been built up in-core. For each leaf block,
> > > we populate the higher level block with the key based on the first
> > > record in the leaf then fill in the remaining records in the leaf. This
> > > repeats until the block at the next level up is full, at which point we
> > > grab a new level > 0 block and initialize it with the inobt magic.
> > > 
> > > I haven't confirmed, but I think this basically means that we'd
> > > initialize every intermediate block after the left-most block at each
> > > level in the tree with the inobt magic. I think Dave already recognized
> > > in the other subthread that you have a 2 block level 1 tree. I suspect
> > > the only reasons we haven't run into this until now are that the finobt
> > > is intended to remain fairly small by only tracking free inode records
> > > and thus prioritizing allocation from those records (removing them from
> > > the tree) and this is limited to users who repair a filesystem in such a
> > > state.
> > > 
> > > I'm guessing you have a large enough set of inodes and a sustained
> > > enough deletion pattern in your workload that you eventually free enough
> > > inodes across a large enough set of records to populate a multi-block
> > > level 1 tree. Then as soon as you try to allocate from that side of the
> > > tree the finer grained checks in __xfs_btree_check_[l|s]block() discover
> > > the wrong magic and generate the error.
> > 
> > Yeah, this is the "lots of sparse free inodes" sort of filesystem
> > that finobt is intended to really help with. :/
> > 
> > > The positive here is that this isn't really a coherency problem with the
> > > filesystem. The finobt itself is coherent, consistent with the inobt and
> > > the magic value doesn't really dictate how the trees are used/updated at
> > > runtime. This means that the only real corruption should be the magic
> > > value.
> > 
> > *nod*
> > 
> > Which means it should be fixable by a bit of xfs_db magic to find
> > and rewrite the magic numbers without needing a full repair to run.
> > 
> > > I am wondering whether this warrants a special case in the magic
> > > checking to convert the runtime internal error into more of a warning to
> > > unmount and run xfs_repair (similar to how we detected the agfl size
> > > mismatch problem). Dave, any thoughts on that?
> > 
> > I don't think it has the scope and breadth of the AGFL issue - there
> > isn't an actual fatal corruption on disk that will lead to loss or
> > corruption of user data.  It's just the wrong magic number being
> > placed in the btree block.
> > 
> 
> Right, it's not a corruption that's going to have compound effects and
> lead to broader inconsistencies and whatnot. We made a decision to nuke
> the agfl at runtime to preserve behavior and allow the fs to continue
> running (in spite of less severe corruption actually introduced by the
> agfl reset).
> 
> This problem is still a fatal runtime error in some cases (and probably
> all cases once we fix up the verifiers), however, and that's the
> behavior I'm wondering whether we should mitigate because otherwise the
> fs is fully capable of continuing to run (and the state may ultimately
> clear itself).
> 

Having looked more into this (see the RFC I sent earlier).. an explicit
verifier magic check plus the existing mount time finobt scan pretty
much means anybody affected by this goes from having a fully working fs
to an unmountable fs due to the verifier change. I don't think that's
appropriate at all. We need to figure out some way to handle this a
little more gracefully IMO.. thoughts?

Brian

> > IMO, after fixing the repair bug, we should:
> > 
> > 	a) make the inobt/finobt block verifier more robust to make
> > 	sure we have the correct magic number for the btree type. In
> > 	hindsight, if we did that in the first place then the
> > 	repair bug would have not escaped into the wild.
> > 
> 
> Agreed. I had already sent Lucas a patch along this lines that pretty
> much implicated xfs_repair as the cause by virtue of the mount time
> finobt scan. I'll post that to the list separately.
> 
> > 	b) write a xfs_db finobt traversal command that rewrites the
> > 	magic numbers of all the blocks in the tree. Probably with
> > 	an xfs_admin wrapper to make it nice for users who need to
> > 	run it.
> > 
> 
> Hmm, I can't really think of any time I'd suggest a user run something
> that swizzles on-disk data like this without running xfs_repair before
> and after. I certainly wouldn't ever do otherwise myself. Of course such
> a script means those xfs_repair runs could be non-destructive, but I'm
> not sure that's much of an advantage.
> 
> Hmm, I was thinking a more targeted script to fix a single block based
> on daddr might be reasonable, but I'm not sure I'd even trust that.
> 
> > 	c) audit all the other verifiers to determine whether the
> > 	combined magic number checks can/should be split so in
> > 	future we detect this class of bug during development and
> > 	testing.
> > 
> 
> Agree. TBH, I think the failure in this case was not thorough enough
> testing of the xfs_repair changes moreso than the kernel checks not
> being granular enough. Otherwise an immediate follow up xfs_repair -n
> run (which I tend to do) would have caught this much quicker than split
> checks in the kernel verifiers. That said, the split checks definitely
> facilitate analysis of this kind of error report and are worth
> considering for that alone.
> 
> Brian
> 
> > Cheers,
> > 
> > Dave.
> > 
> > -- 
> > Dave Chinner
> > david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Regular FS shutdown while rsync is running
  2019-01-24 19:11                         ` Brian Foster
@ 2019-01-28 22:34                           ` Dave Chinner
  2019-01-29 13:46                             ` Brian Foster
  0 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2019-01-28 22:34 UTC (permalink / raw)
  To: Brian Foster; +Cc: Lucas Stach, linux-xfs

On Thu, Jan 24, 2019 at 02:11:17PM -0500, Brian Foster wrote:
> On Thu, Jan 24, 2019 at 08:31:20AM -0500, Brian Foster wrote:
> > On Thu, Jan 24, 2019 at 07:45:28AM +1100, Dave Chinner wrote:
> > > On Wed, Jan 23, 2019 at 08:03:07AM -0500, Brian Foster wrote:
> > > I don't think it has the scope and breadth of the AGFL issue - there
> > > isn't an actual fatal corruption on disk that will lead to loss or
> > > corruption of user data.  It's just the wrong magic number being
> > > placed in the btree block.
> > > 
> > 
> > Right, it's not a corruption that's going to have compound effects and
> > lead to broader inconsistencies and whatnot. We made a decision to nuke
> > the agfl at runtime to preserve behavior and allow the fs to continue
> > running (in spite of less severe corruption actually introduced by the
> > agfl reset).
> > 
> > This problem is still a fatal runtime error in some cases (and probably
> > all cases once we fix up the verifiers), however, and that's the
> > behavior I'm wondering whether we should mitigate because otherwise the
> > fs is fully capable of continuing to run (and the state may ultimately
> > clear itself).
> > 
> 
> Having looked more into this (see the RFC I sent earlier).. an explicit
> verifier magic check plus the existing mount time finobt scan pretty

I still seriously dislike that mount time scan just to count the
finobt blocks in use. That sort of thing should be maintained on the
fly like we do with the agf->agf_btreeblks for counting the number
of in-use freespace btree blocks. That needs to be fixed.

> much means anybody affected by this goes from having a fully working fs
> to an unmountable fs due to the verifier change. I don't think that's
> appropriate at all. We need to figure out some way to handle this a
> little more gracefully IMO.. thoughts?

I think that the number of people with a multi-level finobt who have
run repair can be counted on one hand. Until there's eveidence that
it's a widespread problem, we should just make the code error out.
Get the repair code fixed and released, so the people who first hit
it have a solution. If it's demonstrated as an endemic problem, then
we may need to do something different.

> > > 	b) write a xfs_db finobt traversal command that rewrites the
> > > 	magic numbers of all the blocks in the tree. Probably with
> > > 	an xfs_admin wrapper to make it nice for users who need to
> > > 	run it.
> > > 
> > 
> > Hmm, I can't really think of any time I'd suggest a user run something
> > that swizzles on-disk data like this without running xfs_repair before
> > and after.

I've done exactly this sort of thing to fix corruptions many, many
times in the past. it's one of the functions xfs_db was intended
for....

> > I certainly wouldn't ever do otherwise myself. Of course such
> > a script means those xfs_repair runs could be non-destructive, but I'm
> > not sure that's much of an advantage.

When xfs_repair takes days to run and the xfs_db script takes a few
seconds, users with limited downtime are going to want the
"just rewrite the magic numbers in the finobt tree" xfs_db script.

We know that it's just magic numbers that are wrong. We can verify
the linkage of the tree in xfs_db before rewriting anything (i.e.
walk once to verify that the tree linkages are consistent, then run
a second pass to rewrite the magic numbers) and we will have fixed
the problem without requiring the downtime of running xfs_repair
(twice!).

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Regular FS shutdown while rsync is running
  2019-01-28 22:34                           ` Dave Chinner
@ 2019-01-29 13:46                             ` Brian Foster
  0 siblings, 0 replies; 22+ messages in thread
From: Brian Foster @ 2019-01-29 13:46 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Lucas Stach, linux-xfs

On Tue, Jan 29, 2019 at 09:34:58AM +1100, Dave Chinner wrote:
> On Thu, Jan 24, 2019 at 02:11:17PM -0500, Brian Foster wrote:
> > On Thu, Jan 24, 2019 at 08:31:20AM -0500, Brian Foster wrote:
> > > On Thu, Jan 24, 2019 at 07:45:28AM +1100, Dave Chinner wrote:
> > > > On Wed, Jan 23, 2019 at 08:03:07AM -0500, Brian Foster wrote:
> > > > I don't think it has the scope and breadth of the AGFL issue - there
> > > > isn't an actual fatal corruption on disk that will lead to loss or
> > > > corruption of user data.  It's just the wrong magic number being
> > > > placed in the btree block.
> > > > 
> > > 
> > > Right, it's not a corruption that's going to have compound effects and
> > > lead to broader inconsistencies and whatnot. We made a decision to nuke
> > > the agfl at runtime to preserve behavior and allow the fs to continue
> > > running (in spite of less severe corruption actually introduced by the
> > > agfl reset).
> > > 
> > > This problem is still a fatal runtime error in some cases (and probably
> > > all cases once we fix up the verifiers), however, and that's the
> > > behavior I'm wondering whether we should mitigate because otherwise the
> > > fs is fully capable of continuing to run (and the state may ultimately
> > > clear itself).
> > > 
> > 
> > Having looked more into this (see the RFC I sent earlier).. an explicit
> > verifier magic check plus the existing mount time finobt scan pretty
> 
> I still seriously dislike that mount time scan just to count the
> finobt blocks in use. That sort of thing should be maintained on the
> fly like we do with the agf->agf_btreeblks for counting the number
> of in-use freespace btree blocks. That needs to be fixed.
> 

I've never really liked it either, but that's the current state of
things...

> > much means anybody affected by this goes from having a fully working fs
> > to an unmountable fs due to the verifier change. I don't think that's
> > appropriate at all. We need to figure out some way to handle this a
> > little more gracefully IMO.. thoughts?
> 
> I think that the number of people with a multi-level finobt who have
> run repair can be counted on one hand. Until there's eveidence that
> it's a widespread problem, we should just make the code error out.
> Get the repair code fixed and released, so the people who first hit
> it have a solution. If it's demonstrated as an endemic problem, then
> we may need to do something different.
> 

Fair enough. I included a mitigation patch with the RFC because I at
least wanted to have something on the list for reference.

> > > > 	b) write a xfs_db finobt traversal command that rewrites the
> > > > 	magic numbers of all the blocks in the tree. Probably with
> > > > 	an xfs_admin wrapper to make it nice for users who need to
> > > > 	run it.
> > > > 
> > > 
> > > Hmm, I can't really think of any time I'd suggest a user run something
> > > that swizzles on-disk data like this without running xfs_repair before
> > > and after.
> 
> I've done exactly this sort of thing to fix corruptions many, many
> times in the past. it's one of the functions xfs_db was intended
> for....
> 

You aren't exactly the type of user I'm concerned about. ;)

> > > I certainly wouldn't ever do otherwise myself. Of course such
> > > a script means those xfs_repair runs could be non-destructive, but I'm
> > > not sure that's much of an advantage.
> 
> When xfs_repair takes days to run and the xfs_db script takes a few
> seconds, users with limited downtime are going to want the
> "just rewrite the magic numbers in the finobt tree" xfs_db script.
> 
> We know that it's just magic numbers that are wrong. We can verify
> the linkage of the tree in xfs_db before rewriting anything (i.e.
> walk once to verify that the tree linkages are consistent, then run
> a second pass to rewrite the magic numbers) and we will have fixed
> the problem without requiring the downtime of running xfs_repair
> (twice!).
> 

IMO, you can't confirm you've fixed the problem (without screwing
something up) unless you've run repair at least once after the fact. I
think the same logic that somewhat alleviates my concern for the finobt
mount time scan behavior alleviates the need for such a one off fixup
script... the user had to run xfs_repair to get into this situation in
the first place.

Eh, maybe it's worth taking the same deferred approach here as for the
mitigation. I've no issue with somebody tweaking an fs with xfs_db with
developer supervision if repair was painful enough because developer
supervision likely means we've carefully confirmed that the problem is
the magic value and not some more serious corruption. E.g., incorrectly
link an inobt block into the finobt and all of a sudden this fixup
script becomes a corruption propagation mechanism. I'd also prefer to
avoid having us lazily recommend such a script to the vast majority of
users who can probably deal with an xfs_repair run just because it's
there and probably faster.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2019-01-29 13:46 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-26 13:29 Regular FS shutdown while rsync is running Lucas Stach
2018-11-26 15:32 ` Brian Foster
2019-01-21 10:41   ` Lucas Stach
2019-01-21 13:01     ` Brian Foster
2019-01-21 16:23       ` Lucas Stach
2019-01-21 18:11         ` Brian Foster
2019-01-21 18:21           ` Lucas Stach
2019-01-22 10:39           ` Lucas Stach
2019-01-22 13:02             ` Brian Foster
2019-01-23 11:14               ` Lucas Stach
2019-01-23 12:11                 ` Brian Foster
2019-01-23 13:03                   ` Brian Foster
2019-01-23 18:58                     ` Brian Foster
2019-01-24  8:53                       ` Lucas Stach
2019-01-23 20:45                     ` Dave Chinner
2019-01-24 13:31                       ` Brian Foster
2019-01-24 19:11                         ` Brian Foster
2019-01-28 22:34                           ` Dave Chinner
2019-01-29 13:46                             ` Brian Foster
2019-01-21 21:18     ` Dave Chinner
2019-01-22  9:15       ` Lucas Stach
2019-01-22 21:41         ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.