All of lore.kernel.org
 help / color / mirror / Atom feed
* XFS reports in-memory corruption and unmounts filesystem
@ 2018-04-16 19:13 Dheeraj Sangamkar
  2018-04-16 19:57 ` Eric Sandeen
  0 siblings, 1 reply; 2+ messages in thread
From: Dheeraj Sangamkar @ 2018-04-16 19:13 UTC (permalink / raw)
  To: linux-xfs

Hello,

I have a few linux boxes where I see xfs error messages when the
filesystem becomes full.
I saw quite a few reports of this kind of crash but none that had
exactly the same backtrace as the one I found. So, here it is..

The kernel log:

Jan 9 20:09:33 linux-box kernel: 1,1871,248971320,-;XFS (dm-17):
Internal error xfs_trans_cancel at line 1005 of file
/build/src/linux-4.9.51/fs/xfs/xfs_trans.c. Caller
xfs_create+0x44d/0x6c0 [xfs]
Jan 9 20:09:33 linux-box kernel: 4,1872,248985454,-;CPU: 11 PID: 27044
Comm: xxxxxx Tainted: G O 4.9.0-4-amd64 #1 Debian 4.9.51-1+ntap1
Jan 9 20:09:33 linux-box kernel: 4,1873,248994971,-;Hardware name: ..........
Jan 9 20:09:33 linux-box kernel: 4,1874,249005526,-; 0000000000000000
ffffffff99729974 ffff95c11afaae80 0000000000000001
Jan 9 20:09:33 linux-box kernel: 4,1875,249012916,-; ffffffffc0a041ed
ffff95c15b407800 ffff95c1c0949000 00000000ffffffe4
Jan 9 20:09:33 linux-box kernel: 4,1876,249020305,-; ffffffffc09f70fd
0000000000000001 ffffb23f2279bbf0 0000000000000000
Jan 9 20:09:33 linux-box kernel: 4,1877,249027694,-;Call Trace:
Jan 9 20:09:33 linux-box kernel: 4,1878,249030129,-;
[<ffffffff99729974>] ? dump_stack+0x5c/0x78
Jan 9 20:09:33 linux-box kernel: 4,1879,249035474,-;
[<ffffffffc0a041ed>] ? xfs_trans_cancel+0xad/0xd0 [xfs]
Jan 9 20:09:33 linux-box kernel: 4,1880,249041843,-;
[<ffffffffc09f70fd>] ? xfs_create+0x44d/0x6c0 [xfs]
Jan 9 20:09:33 linux-box kernel: 4,1881,249047823,-;
[<ffffffff99660000>] ? load_elf_binary+0x12c0/0x1640
Jan 9 20:09:33 linux-box kernel: 4,1882,249053930,-;
[<ffffffffc09f41ec>] ? xfs_generic_create+0x23c/0x2e0 [xfs]
Jan 9 20:09:33 linux-box kernel: 4,1883,249060597,-;
[<ffffffff99612888>] ? path_openat+0x1338/0x1440
Jan 9 20:09:33 linux-box kernel: 4,1884,249066314,-;
[<ffffffff994f6264>] ? futex_wake+0x94/0x170
Jan 9 20:09:33 linux-box kernel: 4,1885,249071682,-;
[<ffffffff99613c51>] ? do_filp_open+0x91/0x100
Jan 9 20:09:33 linux-box kernel: 4,1886,249077224,-;
[<ffffffff995fedba>] ? __check_object_size+0xfa/0x1d8
Jan 9 20:09:33 linux-box kernel: 4,1887,249083370,-;
[<ffffffff9960162e>] ? do_sys_open+0x12e/0x210
Jan 9 20:09:33 linux-box kernel: 4,1888,249088914,-;
[<ffffffff99a085bb>] ? system_call_fast_compare_end+0xc/0x9b
Jan 9 20:09:33 linux-box kernel: 5,1889,249095715,-;XFS (dm-17):
xfs_do_force_shutdown(0x8) called from line 1006 of file
/build/src/linux-4.9.51/fs/xfs/xfs_trans.c. Return address =
0xffffffffc0a04206
Jan 9 20:09:33 linux-box kernel: 1,1890,249110179,-;XFS (dm-17):
Corruption of in-memory data detected. Shutting down filesystem
Jan 9 20:09:33 linux-box kernel: 1,1891,249118348,-;XFS (dm-17):
Please umount the filesystem and rectify the problem(s)

Upon running xfs repair, I see the following:

Output of xfs_repair on the rangedb device:
root@another-linux-box:/ # xfs_repair -n /dev/sdk
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - scan filesystem freespace and inode maps...
sb_icount 4710720, counted 4711168
sb_ifree 560, counted 0
sb_fdblocks 95850, counted 8321
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 3
        - agno = 2
        - agno = 1
        - agno = 4
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.
root@another-linux-box:/

Remounting the volume makes the content accessible for a while.
However, eventually, some file lookup fails with ENOENT and the
filesystem is unmounted.

I am not able to create the problem at will.

Is this problem new/fixed?
Was the corruption only in memory or on disk as well?
Why did xfs_repair not detect the corruption?

-Dheeraj

Protection of our environment is our responsibility.

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: XFS reports in-memory corruption and unmounts filesystem
  2018-04-16 19:13 XFS reports in-memory corruption and unmounts filesystem Dheeraj Sangamkar
@ 2018-04-16 19:57 ` Eric Sandeen
  0 siblings, 0 replies; 2+ messages in thread
From: Eric Sandeen @ 2018-04-16 19:57 UTC (permalink / raw)
  To: Dheeraj Sangamkar, linux-xfs



On 4/16/18 2:13 PM, Dheeraj Sangamkar wrote:
> Hello,
> 
> I have a few linux boxes where I see xfs error messages when the
> filesystem becomes full.
> I saw quite a few reports of this kind of crash but none that had
> exactly the same backtrace as the one I found. So, here it is..
> 
> The kernel log:


> Jan 9 20:09:33 linux-box kernel: 1,1871,248971320,-;XFS (dm-17): Internal error xfs_trans_cancel at line 1005 of file /build/src/linux-4.9.51/fs/xfs/xfs_trans.c. Caller xfs_create+0x44d/0x6c0 [xfs]
> Jan 9 20:09:33 linux-box kernel: 4,1872,248985454,-;CPU: 11 PID: 27044 Comm: xxxxxx Tainted: G O 4.9.0-4-amd64 #1 Debian 4.9.51-1+ntap1

Can you reproduce this on an upstream kernel?

(and can you find a way to not wrap your emails so stuff like the below is readable) ;)

> Jan 9 20:09:33 linux-box kernel: 4,1873,248994971,-;Hardware name: ..........
> Jan 9 20:09:33 linux-box kernel: 4,1874,249005526,-; 0000000000000000 ffffffff99729974 ffff95c11afaae80 0000000000000001
> Jan 9 20:09:33 linux-box kernel: 4,1875,249012916,-; ffffffffc0a041ed ffff95c15b407800 ffff95c1c0949000 00000000ffffffe4
> Jan 9 20:09:33 linux-box kernel: 4,1876,249020305,-; ffffffffc09f70fd 0000000000000001 ffffb23f2279bbf0 0000000000000000
> Jan 9 20:09:33 linux-box kernel: 4,1877,249027694,-;Call Trace:
> Jan 9 20:09:33 linux-box kernel: 4,1878,249030129,-; [<ffffffff99729974>] ? dump_stack+0x5c/0x78
> Jan 9 20:09:33 linux-box kernel: 4,1879,249035474,-; [<ffffffffc0a041ed>] ? xfs_trans_cancel+0xad/0xd0 [xfs]
> Jan 9 20:09:33 linux-box kernel: 4,1880,249041843,-; [<ffffffffc09f70fd>] ? xfs_create+0x44d/0x6c0 [xfs]
> Jan 9 20:09:33 linux-box kernel: 4,1881,249047823,-; [<ffffffff99660000>] ? load_elf_binary+0x12c0/0x1640
> Jan 9 20:09:33 linux-box kernel: 4,1882,249053930,-; [<ffffffffc09f41ec>] ? xfs_generic_create+0x23c/0x2e0 [xfs]
> Jan 9 20:09:33 linux-box kernel: 4,1883,249060597,-; [<ffffffff99612888>] ? path_openat+0x1338/0x1440
> Jan 9 20:09:33 linux-box kernel: 4,1884,249066314,-; [<ffffffff994f6264>] ? futex_wake+0x94/0x170
> Jan 9 20:09:33 linux-box kernel: 4,1885,249071682,-; [<ffffffff99613c51>] ? do_filp_open+0x91/0x100
> Jan 9 20:09:33 linux-box kernel: 4,1886,249077224,-; [<ffffffff995fedba>] ? __check_object_size+0xfa/0x1d8
> Jan 9 20:09:33 linux-box kernel: 4,1887,249083370,-; [<ffffffff9960162e>] ? do_sys_open+0x12e/0x210
> Jan 9 20:09:33 linux-box kernel: 4,1888,249088914,-; [<ffffffff99a085bb>] ? system_call_fast_compare_end+0xc/0x9b
> Jan 9 20:09:33 linux-box kernel: 5,1889,249095715,-;XFS (dm-17): xfs_do_force_shutdown(0x8) called from line 1006 of file /build/src/linux-4.9.51/fs/xfs/xfs_trans.c. Return address = 0xffffffffc0a04206
> Jan 9 20:09:33 linux-box kernel: 1,1890,249110179,-;XFS (dm-17): Corruption of in-memory data detected. Shutting down filesystem
> Jan 9 20:09:33 linux-box kernel: 1,1891,249118348,-;XFS (dm-17): Please umount the filesystem and rectify the problem(s)

Ok, this is actually canceling a dirty transaction, which is the root of the problem.

> Upon running xfs repair, I see the following:
> 
> Output of xfs_repair on the rangedb device:
> root@another-linux-box:/ # xfs_repair -n /dev/sdk
> Phase 1 - find and verify superblock...
> Phase 2 - using internal log
>         - scan filesystem freespace and inode maps...
> sb_icount 4710720, counted 4711168
> sb_ifree 560, counted 0
> sb_fdblocks 95850, counted 8321

I'm going to guess that this might have a dirty log, and you should
mount/umount it before running repair, and that if you do so you'll
see no corruption here.

>         - found root inode chunk
> Phase 3 - for each AG...
>         - scan (but don't clear) agi unlinked lists...
>         - process known inodes and perform inode discovery...
>         - agno = 0
>         - agno = 1
>         - agno = 2
>         - agno = 3
>         - agno = 4
>         - process newly discovered inodes...
> Phase 4 - check for duplicate blocks...
>         - setting up duplicate extent list...
>         - check for inodes claiming duplicate blocks...
>         - agno = 0
>         - agno = 3
>         - agno = 2
>         - agno = 1
>         - agno = 4
> No modify flag set, skipping phase 5
> Phase 6 - check inode connectivity...
>         - traversing filesystem ...
>         - traversal finished ...
>         - moving disconnected inodes to lost+found ...
> Phase 7 - verify link counts...
> No modify flag set, skipping filesystem flush and exiting.
> root@another-linux-box:/
> 
> Remounting the volume makes the content accessible for a while.
> However, eventually, some file lookup fails with ENOENT and the
> filesystem is unmounted.
> 
> I am not able to create the problem at will.
> 
> Is this problem new/fixed?

Maybe with 

commit f59cf5c29919d17b61913c3360a7bd29b72975c1
Author: Christoph Hellwig <hch@lst.de>
Date:   Mon Dec 4 17:32:55 2017 -0800

    xfs: remove "no-allocation" reservations for file creations
    
    If we create a new file we will need an inode, and usually some metadata
    in the parent direction.  Aiming for everything to go well despite the
    lack of a reservation leads to dirty transactions cancelled under a heavy
    create/delete load.  This patch removes those nospace transactions, which
    will lead to slightly earlier ENOSPC on some workloads, but instead
    prevent file system shutdowns due to cancelling dirty transactions for
    others.

but honestly debugging 2 year old kernels is more a question for your distro
than for upstream...

> Was the corruption only in memory or on disk as well?

it's not actually corruption, that's a poorly worded error message TBH.

> Why did xfs_repair not detect the corruption?

Because there is no corruption on the disk.

-Eric

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2018-04-16 19:57 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-16 19:13 XFS reports in-memory corruption and unmounts filesystem Dheeraj Sangamkar
2018-04-16 19:57 ` Eric Sandeen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.