On Wed, Mar 15, 2023 at 10:20:37PM -0700, Darrick J. Wong wrote: > On Thu, Mar 16, 2023 at 10:16:02AM +0530, Ritesh Harjani wrote: > > "Darrick J. Wong" writes: > > > > Hi Darrick, > > > > Thanks for your analysis and quick help on this. > > > > >> > > >> Hi Darrick, > > >> > > >> Please find the information collected from the system. We added some > > >> debug logs and looks like it is exactly what is happening which you > > >> pointed out. > > >> > > >> We added a debug kernel patch to get more info from the system which > > >> you had requested [1] > > >> > > >> 1. We first breaked into emergency shell where root fs is first getting > > >> mounted on /sysroot as "ro" filesystem. Here are the logs. > > >> > > >> [ OK ] Started File System Check on /dev/mapper/rhel_ltcden3--lp1-root. > > >> Mounting /sysroot... > > >> [ 7.203990] SGI XFS with ACLs, security attributes, quota, no debug enabled > > >> [ 7.205835] XFS (dm-0): Mounting V5 Filesystem 7b801289-75a7-4d39-8cd3-24526e9e9da7 > > >> [ ***] A start job is running for /sysroot (15s / 1min 35s)[ 17.439377] XFS (dm-0): Starting recovery (logdev: internal) > > >> [ *** ] A start job is running for /sysroot (16s / 1min 35s)[ 17.771158] xfs_log_mount_finish: Recovery needed is set > > >> [ 17.771172] xlog_recover_iunlink_ag: ran xlog_recover_iunlink_bucket for agi:0, bucket:0 > > >> [ 17.771179] xlog_recover_iunlink_ag: ran xlog_recover_iunlink_bucket for agi:0, bucket:1 > > >> [ 17.771184] xlog_recover_iunlink_ag: ran xlog_recover_iunlink_bucket for agi:0, bucket:2 > > >> [ 17.771190] xlog_recover_iunlink_ag: ran xlog_recover_iunlink_bucket for agi:0, bucket:3 > > >> [ 17.771196] xlog_recover_iunlink_ag: ran xlog_recover_iunlink_bucket for agi:0, bucket:4 > > >> [ 17.771201] xlog_recover_iunlink_ag: ran xlog_recover_iunlink_bucket for agi:0, bucket:5 > > >> [ 17.801033] xlog_recover_iunlink_ag: ran xlog_recover_iunlink_bucket for agi:0, bucket:6 > > >> [ 17.801041] xlog_recover_iunlink_ag: ran xlog_recover_iunlink_bucket for agi:0, bucket:7 > > >> [ 17.801046] xlog_recover_iunlink_ag: ran xlog_recover_iunlink_bucket for agi:0, bucket:8 > > >> [ 17.801052] xlog_recover_iunlink_ag: ran xlog_recover_iunlink_bucket for agi:0, bucket:9 > > >> [ 17.801057] xlog_recover_iunlink_ag: ran xlog_recover_iunlink_bucket for agi:0, bucket:10 > > >> [ 17.801063] xlog_recover_iunlink_ag: ran xlog_recover_iunlink_bucket for agi:0, bucket:11 > > >> [ 17.801068] xlog_recover_iunlink_ag: ran xlog_recover_iunlink_bucket for agi:0, bucket:12 > > >> [ 17.801272] xlog_recover_iunlink_bucket: bucket: 13, agino: 3064909, ino: 3064909, iget ret: 0, previno:18446744073709551615, prev_agino:4294967295 > > >> > > >> is just <-1 %ull and -1 %u> in above. That's why > > >> the huge value. > > > > > > Ok, so log recovery finds 3064909 and clears it... > > > > > >> [ 17.801281] xlog_recover_iunlink_ag: ran xlog_recover_iunlink_bucket for agi:0, bucket:13 > > >> [ 17.801287] xlog_recover_iunlink_ag: ran xlog_recover_iunlink_bucket for agi:0, bucket:14 > > > > > > > > > > > >> [ 17.844910] xlog_recover_iunlink_ag: ran xlog_recover_iunlink_bucket for agi:3, bucket:62 > > >> [ 17.844916] xlog_recover_iunlink_ag: ran xlog_recover_iunlink_bucket for agi:3, bucket:63 > > >> [ 17.886079] XFS (dm-0): Ending recovery (logdev: internal) > > >> [ OK ] Mounted /sysroot. > > >> [ OK ] Reached target Initrd Root File System. > > >> > > >> > > >> 2. Then these are the logs from xfs_repair -n /dev/dm-0 > > >> Here you will notice the same agi 3064909 in bucket 13 (from phase-2) which got also > > >> printed in above xlog_recover_iunlink_ag() function. > > >> > > >> switch_root:/# xfs_repair -n /dev/dm-0 > > >> Phase 1 - find and verify superblock... > > >> Phase 2 - using internal log > > >> - zero log... > > >> - scan filesystem freespace and inode maps... > > >> agi unlinked bucket 13 is 3064909 in ag 0 (inode=3064909) > > > > > > ...yet here we find that 3064909 is still on the unlinked list? > > > > > > Just to confirm -- you ran xfs_repair -n after the successful recovery > > > above, right? > > > > > Yes, that's right. > > > > >> - found root inode chunk > > >> Phase 3 - for each AG... > > >> - scan (but don't clear) agi unlinked lists... > > >> - process known inodes and perform inode discovery... > > >> - agno = 0 > > >> - agno = 1 > > >> - agno = 2 > > >> - agno = 3 > > >> - process newly discovered inodes... > > >> Phase 4 - check for duplicate blocks... > > >> - setting up duplicate extent list... > > >> - check for inodes claiming duplicate blocks... > > >> - agno = 0 > > >> - agno = 2 > > >> - agno = 1 > > >> - agno = 3 > > >> No modify flag set, skipping phase 5 > > >> Phase 6 - check inode connectivity... > > >> - traversing filesystem ... > > >> - traversal finished ... > > >> - moving disconnected inodes to lost+found ... > > >> Phase 7 - verify link counts... > > >> would have reset inode 3064909 nlinks from 4294967291 to 2 > > > > > > Oh now that's interesting. Inode on unlinked list with grossly nonzero > > > (but probably underflowed) link count. That might explain why iunlink > > > recovery ignores the inode. Is inode 3064909 reachable via the > > > directory tree? > > > > > > Would you mind sending me a metadump to play with? metadump -ago would > > > be best, if filenames/xattrnames aren't sensitive customer data. > > > > Sorry about the delay. > > I am checking for any permissions part internally. > > Meanwhile - I can help out if you would like me to try anything. > > Ok. I'll try creating a filesystem with a weirdly high refcount > unlinked inode and I guess you can try it to see if you get the same > symptoms. I've finished with my parent pointers work for the time > being, so I might have some time tomorrow (after I kick the tires on > SETFSUUID) to simulate this and see if I can adapt the AGI repair code > to deal with this. If you uncompress and mdrestore the attached file to a blockdev, mount it, and run some creat() exerciser, do you get the same symptoms? I've figured out how to make online fsck deal with it. :) A possible solution for runtime would be to make it so that xfs_iunlink_lookup could iget the inode if it's not in cache at all. --D > --D > > > >> No modify flag set, skipping filesystem flush and exiting. > > >> > > >> > > >> 3. Then we exit from the shell for the system to continue booting. > > >> Here it will continue.. Just pasting the logs where the warning gets > > >> generated and some extra logs are getting printed for the same inode > > >> with our patch. > > >> > > >> > > >> it continues > > >> ================ > > >> [ 587.999113] ------------[ cut here ]------------ > > >> [ 587.999121] WARNING: CPU: 48 PID: 2026 at fs/xfs/xfs_inode.c:1839 xfs_iunlink_lookup+0x58/0x80 [xfs] > > >> [ 587.999185] Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink rfkill sunrpc xts pseries_rng vmx_crypto xfs libcrc32c sd_mod sg ibmvscsi ibmveth scsi_transport_srp nvme nvme_core t10_pi crc64_rocksoft crc64 dm_mirror dm_region_hash dm_log dm_mod > > >> [ 587.999215] CPU: 48 PID: 2026 Comm: in:imjournal Not tainted 6.2.0-rc8ssh+ #38 > > >> [ 587.999219] Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1010.22 (NH1010_122) hv:phyp pSeries > > >> [ 587.999222] NIP: c00800000065fa80 LR: c00800000065fa4c CTR: c000000000ea4d40 > > >> [ 587.999226] REGS: c00000001aa83650 TRAP: 0700 Not tainted (6.2.0-rc8ssh+) > > >> [ 587.999228] MSR: 8000000000029033 CR: 24224842 XER: 00000000 > > >> [ 587.999236] CFAR: c00800000065fa54 IRQMASK: 0 > > >> [ 587.999236] GPR00: c00000004570b2c8 c00000001aa838f0 c008000000708300 0000000000000000 > > >> [ 587.999236] GPR04: 00000000002ec44d 0000000000000000 0000000000000000 c00000000413faf0 > > >> [ 587.999236] GPR08: 0000000000000000 c00000000413fba0 0000000000000000 fffffffffffffffd > > >> [ 587.999236] GPR12: 0000000000000040 c000004afeccd880 0000000000000000 0000000004000000 > > >> [ 587.999236] GPR16: c00000001aa83b38 c00000001aa83a38 c00000001aa83a68 c000000035264c00 > > >> [ 587.999236] GPR20: c00000004570b200 0000000000008000 c00000000886c400 00000000002ec44d > > >> [ 587.999236] GPR24: 000000000014040d 000000000000000d c000000051875400 00000000002ec44d > > >> [ 587.999236] GPR28: c000000035262c00 c00000004570b200 c00000008f2d8b90 000000000014040d > > >> [ 587.999272] NIP [c00800000065fa80] xfs_iunlink_lookup+0x58/0x80 [xfs] > > >> [ 587.999327] LR [c00800000065fa4c] xfs_iunlink_lookup+0x24/0x80 [xfs] > > >> [ 587.999379] Call Trace: > > >> [ 587.999381] [c00000001aa838f0] [c00000008f2d8b90] 0xc00000008f2d8b90 (unreliable) > > >> [ 587.999385] [c00000001aa83910] [c008000000660094] xfs_iunlink+0x1bc/0x2c0 [xfs] > > >> [ 587.999438] [c00000001aa839d0] [c008000000664804] xfs_rename+0x69c/0xd10 [xfs] > > >> [ 587.999491] [c00000001aa83b10] [c00800000065e020] xfs_vn_rename+0xf8/0x1f0 [xfs] > > >> [ 587.999544] [c00000001aa83ba0] [c000000000579efc] vfs_rename+0x9bc/0xdf0 > > >> [ 587.999549] [c00000001aa83c90] [c00000000058018c] do_renameat2+0x3dc/0x5c0 > > >> [ 587.999553] [c00000001aa83de0] [c000000000580520] sys_rename+0x60/0x80 > > >> [ 587.999557] [c00000001aa83e10] [c000000000033630] system_call_exception+0x150/0x3b0 > > >> [ 587.999562] [c00000001aa83e50] [c00000000000c554] system_call_common+0xf4/0x258 > > >> [ 587.999567] --- interrupt: c00 at 0x7fff96082e20 > > >> [ 587.999569] NIP: 00007fff96082e20 LR: 00007fff95c45e24 CTR: 0000000000000000 > > >> [ 587.999572] REGS: c00000001aa83e80 TRAP: 0c00 Not tainted (6.2.0-rc8ssh+) > > >> [ 587.999575] MSR: 800000000280f033 CR: 2a082202 XER: 00000000 > > >> [ 587.999584] IRQMASK: 0 > > >> [ 587.999584] GPR00: 0000000000000026 00007fff94f5d220 00007fff96207300 00007fff94f5d288 > > >> [ 587.999584] GPR04: 0000000172b76b70 0000000000000000 0600000000000000 0000000000000002 > > >> [ 587.999584] GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > > >> [ 587.999584] GPR12: 0000000000000000 00007fff94f666e0 0000000172b65930 0000000000000000 > > >> [ 587.999584] GPR16: 0000000132fe7648 00007fff95c47ce0 0000000000000000 000000000000004c > > >> [ 587.999584] GPR20: 0000000000000082 00007fff94f5e3a0 00007fff94f5e398 00007fff880029b0 > > >> [ 587.999584] GPR24: 0000000000000000 00007fff94f5e388 00007fff94f5e378 00007fff94f5e3b0 > > >> [ 587.999584] GPR28: 00007fff95c60000 00007fff95c605e0 00007fff88000c10 00007fff94f5d288 > > >> [ 587.999618] NIP [00007fff96082e20] 0x7fff96082e20 > > >> [ 587.999620] LR [00007fff95c45e24] 0x7fff95c45e24 > > >> [ 587.999622] --- interrupt: c00 > > >> [ 587.999624] Code: 2c230000 4182002c e9230020 2fa90000 419e0020 38210020 e8010010 7c0803a6 4e800020 60000000 60000000 60000000 <0fe00000> 60000000 60000000 60000000 > > >> [ 587.999637] ---[ end trace 0000000000000000 ]--- > > >> [ 587.999640] xfs_iunlink_update_backref: next_agino: 3064909 cannot be found > > >> [ 587.999643] xfs_iunlink_insert_inode: Cannot find backref for agino:1311757, ip->i_ino:1311757, next_agino: 3064909 agno:0 > > >> [ 587.999646] XFS (dm-0): Internal error xfs_trans_cancel at line 1097 of file fs/xfs/xfs_trans.c. Caller xfs_rename+0x9cc/0xd10 [xfs] > > >> > > >> ^^^ There are the extra info printing the next_agino to be the same > > >> agino 3064909 for xfs_iunlink_lookup has failed. > > > > > > Yep, then runtime code encounters agi[0].unlinked[13] == 3064909, but > > > doesn't find an xfs_inode in the cache for it, and shuts down the > > > filesystem. > > > > > > > Sure, thanks for the info. > > > > > --D > > > > > >> > > >> [ 587.999701] CPU: 48 PID: 2026 Comm: in:imjournal Tainted: G W 6.2.0-rc8ssh+ #38 > > >> [ 587.999705] Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1010.22 (NH1010_122) hv:phyp pSeries > > >> [ 587.999708] Call Trace: > > >> [ 587.999709] [c00000001aa838f0] [c000000000e87328] dump_stack_lvl+0x6c/0x9c (unreliable) > > >> [ 587.999715] [c00000001aa83920] [c008000000646a84] xfs_error_report+0x5c/0x80 [xfs] > > >> [ 587.999767] [c00000001aa83980] [c008000000676860] xfs_trans_cancel+0x178/0x1b0 [xfs] > > >> [ 587.999823] [c00000001aa839d0] [c008000000664b34] xfs_rename+0x9cc/0xd10 [xfs] > > >> [ 587.999876] [c00000001aa83b10] [c00800000065e020] xfs_vn_rename+0xf8/0x1f0 [xfs] > > >> [ 587.999929] [c00000001aa83ba0] [c000000000579efc] vfs_rename+0x9bc/0xdf0 > > >> [ 587.999933] [c00000001aa83c90] [c00000000058018c] do_renameat2+0x3dc/0x5c0 > > >> [ 587.999937] [c00000001aa83de0] [c000000000580520] sys_rename+0x60/0x80 > > >> [ 587.999941] [c00000001aa83e10] [c000000000033630] system_call_exception+0x150/0x3b0 > > >> [ 587.999945] [c00000001aa83e50] [c00000000000c554] system_call_common+0xf4/0x258 > > >> [ 587.999950] --- interrupt: c00 at 0x7fff96082e20 > > >> [ 587.999952] NIP: 00007fff96082e20 LR: 00007fff95c45e24 CTR: 0000000000000000 > > >> [ 587.999955] REGS: c00000001aa83e80 TRAP: 0c00 Tainted: G W (6.2.0-rc8ssh+) > > >> [ 587.999958] MSR: 800000000280f033 CR: 2a082202 XER: 00000000 > > >> [ 587.999967] IRQMASK: 0 > > >> [ 587.999967] GPR00: 0000000000000026 00007fff94f5d220 00007fff96207300 00007fff94f5d288 > > >> [ 587.999967] GPR04: 0000000172b76b70 0000000000000000 0600000000000000 0000000000000002 > > >> [ 587.999967] GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > > >> [ 587.999967] GPR12: 0000000000000000 00007fff94f666e0 0000000172b65930 0000000000000000 > > >> [ 587.999967] GPR16: 0000000132fe7648 00007fff95c47ce0 0000000000000000 000000000000004c > > >> [ 587.999967] GPR20: 0000000000000082 00007fff94f5e3a0 00007fff94f5e398 00007fff880029b0 > > >> [ 587.999967] GPR24: 0000000000000000 00007fff94f5e388 00007fff94f5e378 00007fff94f5e3b0 > > >> [ 587.999967] GPR28: 00007fff95c60000 00007fff95c605e0 00007fff88000c10 00007fff94f5d288 > > >> [ 588.000000] NIP [00007fff96082e20] 0x7fff96082e20 > > >> [ 588.000002] LR [00007fff95c45e24] 0x7fff95c45e24 > > >> [ 588.000004] --- interrupt: c00 > > >> [ 588.012398] Core dump to |/usr/lib/systemd/systemd-coredump pipe failed > > >> [ 588.020328] XFS (dm-0): Corruption of in-memory data (0x8) detected at xfs_trans_cancel+0x190/0x1b0 [xfs] (fs/xfs/xfs_trans.c:1098). Shutting down filesystem. > > >> [ 588.020388] XFS (dm-0): Please unmount the filesystem and rectify the problem(s) > > >> > > >> > > >> 4. Here is the patch diff which we used to collect the info. > > >> > > >> > > >> ============= > > >> > > >> root-> git diff > > >> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c > > >> index 5808abab786c..86b8cab7f759 100644 > > >> --- a/fs/xfs/xfs_inode.c > > >> +++ b/fs/xfs/xfs_inode.c > > >> @@ -1859,8 +1859,10 @@ xfs_iunlink_update_backref( > > >> return 0; > > >> > > >> ip = xfs_iunlink_lookup(pag, next_agino); > > >> - if (!ip) > > >> + if (!ip) { > > >> + pr_err("%s: next_agino: %u cannot be found\n", __func__, next_agino); > > >> return -EFSCORRUPTED; > > >> + } > > >> ip->i_prev_unlinked = prev_agino; > > >> return 0; > > >> } > > >> @@ -1935,8 +1937,11 @@ xfs_iunlink_insert_inode( > > >> * inode. > > >> */ > > >> error = xfs_iunlink_update_backref(pag, agino, next_agino); > > >> - if (error) > > >> + if (error) { > > >> + pr_crit("%s: Cannot find backref for agino:%u, ip->i_ino:%llu, next_agino: %u agno:%u\n", > > >> + __func__, agino, ip->i_ino, next_agino, pag->pag_agno); > > >> return error; > > >> + } > > >> > > >> if (next_agino != NULLAGINO) { > > >> /* > > >> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c > > >> index fc61cc024023..035fc1eba871 100644 > > >> --- a/fs/xfs/xfs_log.c > > >> +++ b/fs/xfs/xfs_log.c > > >> @@ -825,6 +825,10 @@ xfs_log_mount_finish( > > >> */ > > >> mp->m_super->s_flags |= SB_ACTIVE; > > >> xfs_log_work_queue(mp); > > >> + if (xlog_recovery_needed(log)) > > >> + pr_crit("%s: Recovery needed is set\n", __func__); > > >> + else > > >> + pr_crit("%s: Recovery needed not set\n", __func__); > > >> if (xlog_recovery_needed(log)) > > >> error = xlog_recover_finish(log); > > >> mp->m_super->s_flags &= ~SB_ACTIVE; > > >> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c > > >> index 322eb2ee6c55..6caa8147b443 100644 > > >> --- a/fs/xfs/xfs_log_recover.c > > >> +++ b/fs/xfs/xfs_log_recover.c > > >> @@ -2696,8 +2696,13 @@ xlog_recover_iunlink_bucket( > > >> ASSERT(VFS_I(ip)->i_nlink == 0); > > >> ASSERT(VFS_I(ip)->i_mode != 0); > > >> xfs_iflags_clear(ip, XFS_IRECOVERY); > > >> - agino = ip->i_next_unlinked; > > >> > > >> + if (bucket == 13) { > > >> + pr_crit("%s: bucket: %d, agino: %u, ino: %llu, iget ret: %d, previno:%llu, prev_agino:%u\n", > > >> + __func__, bucket, agino, ip->i_ino, error, prev_ip ? prev_ip->i_ino : -1, prev_ip ? XFS_INO_TO_AGINO(mp, prev_ip->i_ino) : -1); > > >> + } > > >> + > > >> + agino = ip->i_next_unlinked; > > >> if (prev_ip) { > > >> ip->i_prev_unlinked = prev_agino; > > >> xfs_irele(prev_ip); > > >> @@ -2789,8 +2794,11 @@ xlog_recover_iunlink_ag( > > >> * bucket and remaining inodes on it unreferenced and > > >> * unfreeable. > > >> */ > > >> + pr_crit("%s: Failed in xlog_recover_iunlink_bucket %d\n", __func__, error); > > >> xfs_inodegc_flush(pag->pag_mount); > > >> xlog_recover_clear_agi_bucket(pag, bucket); > > >> + } else { > > >> + pr_crit("%s: ran xlog_recover_iunlink_bucket for agi:%u, bucket:%d\n", __func__, pag->pag_agno, bucket); > > >> } > > >> } > > >> > > >> > > >> > That evidence will guide us towards a kernel patch. > > >> > > >> I can spend some more time to debug and understand on how to fix this. > > >> But thought of sharing this info meanwhile and see if there are any > > >> pointers on how to fix this in kernel. > > >> > > >> Let me know if any other info is needed. We haven't yet run xfs_repair > > >> on the device w/o -n option. > > >> > > >> -ritesh