xfs_iflush_int: Bad inode, xfs_do_force_shutdown from xfs_inode.c during file copy

* xfs_iflush_int: Bad inode, xfs_do_force_shutdown from xfs_inode.c during file copy
@ 2014-05-03 19:49 Marcel Giannelia
  2014-05-04  0:17 ` Dave Chinner
  0 siblings, 1 reply; 4+ messages in thread
From: Marcel Giannelia @ 2014-05-03 19:49 UTC (permalink / raw)
  To: xfs

Hi,

I just had the following happen on a server that's never had any
previous issues with XFS:

I was copying an 8 GB file onto an XFS filesystem, when the copy
aborted with an I/O error message and I was forced to unmount and
xfs_repair the filesystem before it would mount again. Relevant dmesg
messages below.

Some other information that might be relevant:

- Distribution & kernel version: Debian 7, uname -a returns:

Linux hostname 3.2.0-4-686-pae #1 SMP Debian 3.2.41-2+deb7u2 i686 GNU/Linux

- This filesystem is on a software (mdadm) RAID 1 array, backed by
  2 disks.

- The drives underneath the RAID reported no errors, and there were no
  errors from either the RAID layer or the disk layer in dmesg. One hard
  drive shows no change in SMART status; the other one's reallocated
  sector count has increased by 1 since the last time I checked (1
  month ago), but there are no logged errors.

- The filesystem is 174 GB and contains 2.5 million (i.e. 2484996)
  files, and had 20 GB of free space. A day or two ago it contained
  around 4 million files and was almost completely full (2 GB free
  space), and the most recent activity before this problem was the
  deletion of about 1.5 million files. (Over 24 hours passed between
  deletion of these files and the attempt to copy the 8 GB file.)

- The file I was copying was a virtual disk image from a virtual
  machine that I had shut down about 5 seconds before starting the 
  copy, so it is likely that some of the file's data had not yet
  been committed to disk on the source filesystem, and it is also
  possible that the source file's metadata and contents
  (but not size) changed during the copy. The source filesystem is
  also xfs.

- A few weeks ago, one of the disks in the RAID 1 array failed and I
  replaced it with two disks, bringing the number of disks in the array
  up to 3. This seemed to cause performance issues (writes took too
  long to complete, causing disk timeouts in the virtual machines), so
  the third disk was removed and the array was changed back to 2 disks.

- I have never seen memory issues in this server before (I ran an
  overnight memtest when it was first set up, but that was a year ago),
  and after this error I ran the userspace program "memtester", as I
  cannot take the server offline. This is a fairly limited test (I was
  only able to test 1 GB of free memory), but it found no problems.

- The problem has not recurred -- after doing xfs_repair and remounting
  the filesystem, I was able to copy that same file without issues. I
  since have also copied another virtual machine disk of the same size,
  while the VM was running, without problems.

Output of xfs_info for the filesystem:

meta-data=/dev/md126             isize=256    agcount=4, agsize=11375416 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=45501664, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=22217, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

dmesg entries:

> Immediately after the cp command exited with "i/o error":

XFS (md126): xfs_iflush_int: Bad inode 939480132, ptr 0xd12fa080, magic number 0x494d
XFS (md126): xfs_do_force_shutdown(0x8) called from line 2606 of file /build/buildd-linux_3.2.41-2+deb7u2-i386-G4jjsr/linux-3.2.41/fs/xfs/xfs_inode.c.  Return address = 0xf88319b4
XFS (md126): Corruption of in-memory data detected.  Shutting down filesystem
XFS (md126): Please umount the filesystem and rectify the problem(s)
Buffer I/O error on device md126, logical block 5673596
lost page write due to I/O error on md126

(about 10 more of the "Buffer I/O error / lost page" messages)

> I then unmounted the filesystem, which caused these lines in dmesg:

XFS (md126): xfs_log_force: error 5 returned.
XFS (md126): xfs_log_force: error 5 returned.
XFS (md126): xfs_log_force: error 5 returned.
XFS (md126): xfs_do_force_shutdown(0x1) called from line 1033 of file /build/buildd-linux_3.2.41-2+deb7u2-i386-G4jjsr/linux-3.2.41/fs/xfs/xfs_buf.c.  Return address = 0xf87fd1a7
XFS (md126): xfs_log_force: error 5 returned.
XFS (md126): xfs_log_force: error 5 returned.

> I then tried to mount the filesystem again and got:

XFS (md126): Mounting Filesystem
XFS (md126): Starting recovery (logdev: internal)
XFS (md126): xlog_recover_inode_pass2: Bad inode log record, rec ptr 0xe153d720, ino 939480132
XFS (md126): Internal error xlog_recover_inode_pass2(2) at line 2259 of file /build/buildd-linux_3.2.41-2+deb7u2-i386-G4jjsr/linux-3.2.41/fs/xfs/xfs_log_recover.c.  Caller 0xf8835e66

Pid: 30323, comm: mount Tainted: G           O 3.2.0-4-686-pae #1 Debian 3.2.41-2+deb7u2
Call Trace:
 [<f8834fb9>] ? xlog_recover_inode_pass2+0x206/0x60a [xfs]
 [<f8835e66>] ? xlog_recover_commit_trans+0x5f/0x80 [xfs]
 [<f8835e66>] ? xlog_recover_commit_trans+0x5f/0x80 [xfs]
 [<f8835fb8>] ? xlog_recover_process_data+0x131/0x1bd [xfs]
 [<f8836277>] ? xlog_do_recovery_pass+0x233/0x57b [xfs]
 [<c1038fa5>] ? vprintk+0x3ae/0x3df
 [<f8836636>] ? xlog_do_log_recovery+0x77/0x92 [xfs]
 [<f8836663>] ? xlog_do_recover+0x12/0xab [xfs]
 [<f8836876>] ? xlog_recover+0x6d/0x79 [xfs]
 [<f883cb53>] ? xfs_log_mount+0xab/0x109 [xfs]
 [<f8838c30>] ? xfs_mountfs+0x2b6/0x501 [xfs]
 [<f8806ff8>] ? xfs_mru_cache_create+0xeb/0x112 [xfs]
 [<f8808839>] ? xfs_fs_fill_super+0x145/0x205 [xfs]
 [<c10ce843>] ? mount_bdev+0x11f/0x16d
 [<f88086f4>] ? xfs_parseargs+0x8da/0x8da [xfs]
 [<c10a0036>] ? move_active_pages_to_lru+0xab/0xde
 [<c10c1536>] ? __kmalloc_track_caller+0x9b/0xa7
 [<f88071f3>] ? xfs_fs_mount+0xe/0x11 [xfs]
 [<f88086f4>] ? xfs_parseargs+0x8da/0x8da [xfs]
 [<c10cef2b>] ? mount_fs+0x55/0x122
 [<c10df139>] ? vfs_kern_mount+0x4a/0x77
 [<c10df459>] ? do_kern_mount+0x2f/0xac
 [<c10e0763>] ? do_mount+0x5d0/0x61e
 [<c12c1631>] ? _cond_resched+0x5/0x18
 [<c10a6b88>] ? memdup_user+0x26/0x43
 [<c10e09ea>] ? sys_mount+0x67/0x96
 [<c12c6e1f>] ? sysenter_do_call+0x12/0x28
XFS (md126): log mount/recovery failed: error 117
XFS (md126): log mount failed

After this, I ran xfs_repair with -L. xfs_repair noted the same bad inode
number and deleted the file I had tried to copy, but otherwise made no changes
that I could see. After this, the filesystem mounted normally and there were no
further issues.

~Felix.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 4+ messages in thread