All of lore.kernel.org
 help / color / mirror / Atom feed
* Fw: 2.6.28.9: EXT3/NFS inodes corruption
@ 2009-04-22 21:24 Andrew Morton
  2009-04-22 22:44 ` Theodore Tso
  0 siblings, 1 reply; 8+ messages in thread
From: Andrew Morton @ 2009-04-22 21:24 UTC (permalink / raw)
  To: linux-ext4, linux-nfs; +Cc: Sylvain Rochet

[-- Attachment #1: Type: text/plain, Size: 11971 bytes --]


Is it nfsd, or is it htree?


Begin forwarded message:

Date: Mon, 20 Apr 2009 18:20:18 +0200
From: Sylvain Rochet <gradator@gradator.net>
To: linux-kernel@vger.kernel.org
Subject: 2.6.28.9: EXT3/NFS inodes corruption


Hi,


We(TuxFamily) are having some inodes corruptions on a NFS server.


So, let's start with the facts.


==== NFS Server

Linux bazooka 2.6.28.9 #1 SMP Mon Mar 30 12:58:22 CEST 2009 x86_64 GNU/Linux

root@bazooka:/usr/src# grep EXT3 /boot/config-2.6.28.9 
CONFIG_EXT3_FS=y
CONFIG_EXT3_FS_XATTR=y
CONFIG_EXT3_FS_POSIX_ACL=y
CONFIG_EXT3_FS_SECURITY=y

root@bazooka:/usr/src# grep NFS /boot/config-2.6.28.9 
CONFIG_NFS_FS=y
CONFIG_NFS_V3=y
CONFIG_NFS_V3_ACL=y
# CONFIG_NFS_V4 is not set
CONFIG_NFSD=y
CONFIG_NFSD_V2_ACL=y
CONFIG_NFSD_V3=y
CONFIG_NFSD_V3_ACL=y
# CONFIG_NFSD_V4 is not set
CONFIG_NFS_ACL_SUPPORT=y
CONFIG_NFS_COMMON=y

  ==> We upgraded from 2.6.26.5 to 2.6.28.9, problem's still here


/dev/md10 on /data type ext3 (rw,noatime,nodiratime,grpquota,commit=5,data=ordered)

  ==> We used data=writeback, we fallback to data=ordered,
      problem's still here


# /etc/exports
/data                   *(rw,no_root_squash,async,no_subtree_check)
... and lots of exports of subdirs of /data, exported the same way


Process about NFS, on the NFS server.

root@bazooka:~# ps aux | grep -E '(nfsd]|lockd]|statd|mountd|idmapd|rquotad|portmap)$'
daemon    1226  0.0  0.0   4824   452 ?        Ss   Apr11   0:06 /sbin/portmap
root      1703  0.0  0.0      0     0 ?        S<   01:29   0:09 [lockd]
root      1704  0.3  0.0      0     0 ?        D<   01:29   3:29 [nfsd]
root      1705  0.3  0.0      0     0 ?        S<   01:29   3:34 [nfsd]
root      1706  0.3  0.0      0     0 ?        S<   01:29   3:32 [nfsd]
root      1707  0.3  0.0      0     0 ?        S<   01:29   3:30 [nfsd]
root      1708  0.3  0.0      0     0 ?        D<   01:29   3:43 [nfsd]
root      1709  0.3  0.0      0     0 ?        D<   01:29   3:43 [nfsd]
root      1710  0.3  0.0      0     0 ?        D<   01:29   3:39 [nfsd]
root      1711  0.3  0.0      0     0 ?        D<   01:29   3:42 [nfsd]
root      1715  0.0  0.0   5980   576 ?        Ss   01:29   0:00 /usr/sbin/rpc.mountd
statd     1770  0.0  0.0   8072   648 ?        Ss   Apr11   0:00 /sbin/rpc.statd
root      1776  0.0  0.0  23180   536 ?        Ss   Apr11   0:00 /usr/sbin/rpc.idmapd
root      1785  0.0  0.0   6148   552 ?        Ss   Apr11   0:00 /usr/sbin/rpc.rquotad

  ==> We used to run tenths of nfsd daemons, we fallback to 8,
      the default, problem's still here
  ==> There are some 'D' processes because of a running data-check


Block device health:

Apr  3 00:28:20 bazooka kernel: md: data-check of RAID array md10
Apr  3 05:11:59 bazooka kernel: md: md10: data-check done.

Apr  5 01:06:01 bazooka kernel: md: data-check of RAID array md10
Apr  5 05:49:42 bazooka kernel: md: md10: data-check done.

Apr 20 16:27:33 bazooka kernel: md: data-check of RAID array md10

md10 : active raid6 sda[0] sdl[11] sdk[10] sdj[9] sdi[8] sdh[7] sdg[6] sdf[5] sde[4] sdd[3] sdc[2] sdb[1]
      1433738880 blocks level 6, 64k chunk, algorithm 2 [12/12] [UUUUUUUUUUUU]
      [======>..............]  check = 30.1% (43176832/143373888) finish=208.1min speed=8020K/sec

  ==> Everything seems fine


# df -m
/dev/md10              1378166     87170   1290997   7% /data

# df -i
/dev/md10            179224576 3454822 175769754    2% /data



==== NFS Clients

6x Linux cognac 2.6.28.9-grsec #1 SMP Sun Apr 12 13:06:49 CEST 2009 i686 GNU/Linux
5x Linux martini 2.6.28.9-grsec #1 SMP Tue Apr 14 00:01:30 UTC 2009 i686 GNU/Linux
2x Linux armagnac 2.6.28.9 #1 SMP Tue Apr 14 08:59:12 CEST 2009 i686 GNU/Linux

grad@armagnac:~$ grep NFS /boot/config-2.6.28.9 
CONFIG_NFS_FS=y
CONFIG_NFS_V3=y
CONFIG_NFS_V3_ACL=y
# CONFIG_NFS_V4 is not set
CONFIG_NFSD=y
CONFIG_NFSD_V2_ACL=y
CONFIG_NFSD_V3=y
CONFIG_NFSD_V3_ACL=y
# CONFIG_NFSD_V4 is not set
CONFIG_NFS_ACL_SUPPORT=y
CONFIG_NFS_COMMON=y

  ==> We upgraded from 2.6.23.16 and 2.6.24.2 (yeah, vmsplice upgrade 
;-) to 2.6.28.9, problem's still here


x.x.x.x:/data/... on /data/... type nfs (rw,noexec,nosuid,nodev,async,hard,nfsvers=3,udp,intr,rsize=32768,wsize=32768,timeo=20,addr=x.x.x.x)

  ==> All NFS exports are mounted this way, sometimes with the 'sync' 
      option, like web sessions.
  ==> Those are often mounted from outside of chroots into chroots, 
      useless detail I think


Process about NFS, on the NFS clients.

root@cognac:~# ps aux | grep -E '(nfsd]|lockd]|statd|mountd|idmapd|rquotad|portmap)$'
daemon     349  0.0  0.0   1904   536 ?        Ss   Apr12   0:00 /sbin/portmap
statd      360  0.0  0.1   3452  1152 ?        Ss   Apr12   0:00 /sbin/rpc.statd
root      1190  0.0  0.0      0     0 ?        S<   Apr12   0:00 [lockd]



==== So, now, going into the problem

The kernel log is not really nice with us, here on the NFS Server:

Mar 22 06:47:14 bazooka kernel: EXT3-fs warning (device md10): dx_probe: Unrecognised inode hash code 52
Mar 22 06:47:14 bazooka kernel: EXT3-fs warning (device md10): dx_probe: Corrupt dir inode 40420228, running e2fsck is recommended.
Mar 22 06:47:16 bazooka kernel: EXT3-fs warning (device md10): dx_probe: Unrecognised inode hash code 52
Mar 22 06:47:16 bazooka kernel: EXT3-fs warning (device md10): dx_probe: Corrupt dir inode 40420228, running e2fsck is recommended.
Mar 22 06:47:19 bazooka kernel: EXT3-fs warning (device md10): dx_probe: Unrecognised inode hash code 52
Mar 22 06:47:19 bazooka kernel: EXT3-fs warning (device md10): dx_probe: Corrupt dir inode 40420228, running e2fsck is recommended.
Mar 22 06:47:19 bazooka kernel: EXT3-fs warning (device md10): dx_probe: Unrecognised inode hash code 52
Mar 22 06:47:19 bazooka kernel: EXT3-fs warning (device md10): dx_probe: Corrupt dir inode 40420228, running e2fsck is recommended.
Mar 22 06:47:19 bazooka kernel: EXT3-fs warning (device md10): dx_probe: Unrecognised inode hash code 52
And so on...

And more recently...
Apr  2 22:19:01 bazooka kernel: EXT3-fs warning (device md10): ext3_unlink: Deleting nonexistent file (40780223), 0
Apr  2 22:19:02 bazooka kernel: EXT3-fs warning (device md10): ext3_unlink: Deleting nonexistent file (40491685), 0
Apr 11 07:23:02 bazooka kernel: EXT3-fs warning (device md10): ext3_unlink: Deleting nonexistent file (174301379), 0
Apr 20 08:13:32 bazooka kernel: EXT3-fs warning (device md10): ext3_unlink: Deleting nonexistent file (54942021), 0


Not much stuff in the kernel log of NFS clients, history is quite lost, 
but we got some of them:

....................: NFS: Buggy server - nlink == 0!


== Going deeper into the problem

Something like that is quite common:

root@bazooka:/data/...# ls -la
total xxx
drwxrwx--- 2 xx    xx        4096 2009-04-20 03:48 .
drwxr-xr-x 7 root  root      4096 2007-01-21 13:15 ..
-rw-r--r-- 1 root  root         0 2009-04-20 03:48 access.log
-rw-r--r-- 1 root  root  70784145 2009-04-20 00:11 access.log.0
-rw-r--r-- 1 root  root   6347007 2009-04-10 00:07 access.log.10.gz
-rw-r--r-- 1 root  root   6866097 2009-04-09 00:08 access.log.11.gz
-rw-r--r-- 1 root  root   6410119 2009-04-08 00:07 access.log.12.gz
-rw-r--r-- 1 root  root   6488274 2009-04-07 00:08 access.log.13.gz
?--------- ?    ?     ?         ?                ? access.log.14.gz
?--------- ?    ?     ?         ?                ? access.log.15.gz
?--------- ?    ?     ?         ?                ? access.log.16.gz
?--------- ?    ?     ?         ?                ? access.log.17.gz
-rw-r--r-- 1 root  root   6950626 2009-04-02 00:07 access.log.18.gz
?--------- ?    ?     ?         ?                ? access.log.19.gz
-rw-r--r-- 1 root  root   6635884 2009-04-19 00:11 access.log.1.gz
?--------- ?    ?     ?         ?                ? access.log.20.gz
?--------- ?    ?     ?         ?                ? access.log.21.gz
?--------- ?    ?     ?         ?                ? access.log.22.gz
?--------- ?    ?     ?         ?                ? access.log.23.gz
?--------- ?    ?     ?         ?                ? access.log.24.gz
?--------- ?    ?     ?         ?                ? access.log.25.gz
?--------- ?    ?     ?         ?                ? access.log.26.gz
-rw-r--r-- 1 root  root   6616546 2009-03-24 00:07 access.log.27.gz
?--------- ?    ?     ?         ?                ? access.log.28.gz
?--------- ?    ?     ?         ?                ? access.log.29.gz
-rw-r--r-- 1 root  root   6671875 2009-04-18 00:12 access.log.2.gz
?--------- ?    ?     ?         ?                ? access.log.30.gz
-rw-r--r-- 1 root  root   6347518 2009-04-17 00:10 access.log.3.gz
-rw-r--r-- 1 root  root   6569714 2009-04-16 00:12 access.log.4.gz
-rw-r--r-- 1 root  root   7170750 2009-04-15 00:11 access.log.5.gz
-rw-r--r-- 1 root  root   6676518 2009-04-14 00:12 access.log.6.gz
-rw-r--r-- 1 root  root   6167458 2009-04-13 00:11 access.log.7.gz
-rw-r--r-- 1 root  root   5856576 2009-04-12 00:10 access.log.8.gz
-rw-r--r-- 1 root  root   6644142 2009-04-11 00:07 access.log.9.gz


root@bazooka:/data/...# cat *      # output filtered, only errors
cat: access.log.14.gz: Stale NFS file handle
cat: access.log.15.gz: Stale NFS file handle
cat: access.log.16.gz: Stale NFS file handle
cat: access.log.17.gz: Stale NFS file handle
cat: access.log.19.gz: Stale NFS file handle
cat: access.log.20.gz: Stale NFS file handle
cat: access.log.21.gz: Stale NFS file handle
cat: access.log.22.gz: Stale NFS file handle
cat: access.log.23.gz: Stale NFS file handle
cat: access.log.24.gz: Stale NFS file handle
cat: access.log.25.gz: Stale NFS file handle
cat: access.log.26.gz: Stale NFS file handle
cat: access.log.28.gz: Stale NFS file handle
cat: access.log.29.gz: Stale NFS file handle
cat: access.log.30.gz: Stale NFS file handle


"Stale NFS file handle"... on the NFS Server... hummm...


== Other facts

fsck.ext3 fixed the filesystem but didn't fix the problem.

mkfs.ext3 didn't fix the problem either.

It only concerns files which have been recently modified, logs, awstats 
hashfiles, websites caches, sessions, locks, and such.

It mainly happens to files which are created on the NFS server itself, 
but it's not a hard rule.

Keeping inodes into servers' cache seems to prevent the problem to happen.
( yeah, # while true ; do ionice -c3 find /data -size +0 > /dev/null ; done )


Hummm, it seems to concern files which are quite near to each others, 
let's check that:

Let's build up an inode "database"

# find /data -printf '%i %p\n' > /root/inodesnumbers


Let's check how inodes numbers are distributed:

# cat /root/inodesnumbers | perl -e 'use Data::Dumper; my @pof; while(<>){my ( $inode ) = ( $_ =~ /^(\d+)/ ); my $hop = int($inode/1000000); $pof[$hop]++; }; for (0 .. $#pof) { print $_." = ".($pof[$_]/10000)."%\n" }'
[... lot of quite unused inodes groups]
53 = 3.0371%
54 = 26.679%     <= mailboxes
55 = 2.7026%
[... lot of quite unused inodes groups]
58 = 1.3262%
59 = 27.3211%    <= mailing lists archives
60 = 5.5159%
[... lot of quite unused inodes groups]
171 = 0.0631%
172 = 0.1063%
173 = 27.2895%   <=
174 = 44.0623%   <=
175 = 45.6783%   <= websites files
176 = 45.8247%   <=
177 = 36.9376%   <=
178 = 6.3294%
179 = 0.0442%

Hummm, all the files are using the same inodes "groups".
  (groups of a million of inodes)

We use to fix broken folders by moving them to a quarantine folder and 
by restoring disappeared files from the backup.

So, let's check corrupted inodes number from the quarantine folder:

root@bazooka:/data/path/to/rep/of/quarantine/folders# find . -mindepth 1 -maxdepth 1 -printf '%i\n' | sort -n
174293418
174506030
174506056
174506073
174506081
174506733
174507694
174507708
174507888
174507985
174508077
174508083
176473056
176473062
176473064

Humm... those are quite near to each other 17450... 17647... and are of 
course in the most used inodes "groups"...


Open question: are NFS clients can steal inodes numbers from each others ?


I am not sure whether my bug report is good, feel free to ask questions ;)

Best regards,
Sylvain


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Fw: 2.6.28.9: EXT3/NFS inodes corruption
  2009-04-22 21:24 Fw: 2.6.28.9: EXT3/NFS inodes corruption Andrew Morton
@ 2009-04-22 22:44 ` Theodore Tso
       [not found]   ` <20090422224455.GV15541-3s7WtUTddSA@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Theodore Tso @ 2009-04-22 22:44 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-ext4, linux-nfs, Sylvain Rochet

On Wed, Apr 22, 2009 at 02:24:24PM -0700, Andrew Morton wrote:
> 
> Is it nfsd, or is it htree?

Well, I see evidence in the bug report of corrupted directory data
structures, so I don't think it's an NFS problem.  I would want to
rule out hardware flakiness, though.  This could easily be caused by a
hardware problem.

> The kernel log is not really nice with us, here on the NFS Server:
> 
> Mar 22 06:47:14 bazooka kernel: EXT3-fs warning (device md10): dx_probe: Unrecognised inode hash code 52
> Mar 22 06:47:14 bazooka kernel: EXT3-fs warning (device md10): dx_probe: Corrupt dir inode 40420228, running e2fsck is recommended.

Evidence of a corrupted directory entry.  We would need to look at the
directory to see whether the directory just ad a few bits flipped, or
is pure garbage.  The ext3 htree code should do a better job printing
out diagnostics, and flagging the filesystem as corrupt here.

> Apr  2 22:19:02 bazooka kernel: EXT3-fs warning (device md10): ext3_unlink: Deleting nonexistent file (40491685), 0

More evidence of a corrupted directory.

> == Going deeper into the problem
> 
> Something like that is quite common:
> 
> root@bazooka:/data/...# ls -la
> total xxx
> drwxrwx--- 2 xx    xx        4096 2009-04-20 03:48 .
> drwxr-xr-x 7 root  root      4096 2007-01-21 13:15 ..
> -rw-r--r-- 1 root  root         0 2009-04-20 03:48 access.log
> -rw-r--r-- 1 root  root  70784145 2009-04-20 00:11 access.log.0
> -rw-r--r-- 1 root  root   6347007 2009-04-10 00:07 access.log.10.gz
> -rw-r--r-- 1 root  root   6866097 2009-04-09 00:08 access.log.11.gz
> -rw-r--r-- 1 root  root   6410119 2009-04-08 00:07 access.log.12.gz
> -rw-r--r-- 1 root  root   6488274 2009-04-07 00:08 access.log.13.gz
> ?--------- ?    ?     ?         ?                ? access.log.14.gz
> ?--------- ?    ?     ?         ?                ? access.log.15.gz
> ?--------- ?    ?     ?         ?                ? access.log.16.gz

This is on the client side; what happens when you look at the same
directory from the server side?

> 
> fsck.ext3 fixed the filesystem but didn't fix the problem.
> 

What do you mean by that?  That subsequently, you started seeing
filesystem corruptions again?  Can you send me the output of
fsck.ext3?  The sorts of filesystem corruption problems which are
fixed by e2fsck are important in figuring out what is going on.

What you if you run fsck.ext3 (aka e2fsck) twice.  Once after fixing
fixing all of the problems, and then a second time afterwards.  Do the
problems stay fixed?

Suppose you try mounting the filesystem read-only; are things stable
while it is mounted read-only.

> Let's check how inodes numbers are distributed:
> 
> # cat /root/inodesnumbers | perl -e 'use Data::Dumper; my @pof; while(<>){my ( $inode ) = ( $_ =~ /^(\d+)/ ); my $hop = int($inode/1000000); $pof[$hop]++; }; for (0 .. $#pof) { print $_." = ".($pof[$_]/10000)."%\n" }'
> [... lot of quite unused inodes groups]
> 53 = 3.0371%
> 54 = 26.679%     <= mailboxes
> 55 = 2.7026%
> [... lot of quite unused inodes groups]
> 58 = 1.3262%
> 59 = 27.3211%    <= mailing lists archives
> 60 = 5.5159%
> [... lot of quite unused inodes groups]
> 171 = 0.0631%
> 172 = 0.1063%
> 173 = 27.2895%   <=
> 174 = 44.0623%   <=
> 175 = 45.6783%   <= websites files
> 176 = 45.8247%   <=
> 177 = 36.9376%   <=
> 178 = 6.3294%
> 179 = 0.0442%

Yes, that's normal.  BTW, you can get this sort of information much
more easily simply by using the "dumpe2fs" program.

> We use to fix broken folders by moving them to a quarantine folder and 
> by restoring disappeared files from the backup.
> 
> So, let's check corrupted inodes number from the quarantine folder:
> 
> root@bazooka:/data/path/to/rep/of/quarantine/folders# find . -mindepth 1 -maxdepth 1 -printf '%i\n' | sort -n
> 174293418
> 174506030
> 174506056
> 174506073
> 174506081
> 174506733
> 174507694
> 174507708
> 174507888
> 174507985
> 174508077
> 174508083
> 176473056
> 176473062
> 176473064
> 
> Humm... those are quite near to each other 17450... 17647... and are of 
> course in the most used inodes "groups"...

When you say "corrupted inodes", how are they corrupted?  The errors
you showed on the server side looked like directory corruptions.  Were
these inodes directories or data files?


This really smells like a hardware problem to me; my recommendation
would be to run memory tests and also hard drive tests.  I'm going to
guess it's more likely the problem is with your hard drives as opposed
to memory --- that would be consistent with your observation that
trying to keep the inodes in memory seems to help.

						- Ted

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Fw: 2.6.28.9: EXT3/NFS inodes corruption
  2009-04-22 22:44 ` Theodore Tso
@ 2009-04-22 23:48       ` Sylvain Rochet
  0 siblings, 0 replies; 8+ messages in thread
From: Sylvain Rochet @ 2009-04-22 23:48 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Andrew Morton, linux-ext4-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 6795 bytes --]

Hi,


On Wed, Apr 22, 2009 at 06:44:55PM -0400, Theodore Tso wrote:
> On Wed, Apr 22, 2009 at 02:24:24PM -0700, Andrew Morton wrote:
> > 
> > Is it nfsd, or is it htree?
> 
> Well, I see evidence in the bug report of corrupted directory data
> structures, so I don't think it's an NFS problem.  I would want to
> rule out hardware flakiness, though.  This could easily be caused by a
> hardware problem.
> 
> > The kernel log is not really nice with us, here on the NFS Server:
> > 
> > Mar 22 06:47:14 bazooka kernel: EXT3-fs warning (device md10): dx_probe: Unrecognised inode hash code 52
> > Mar 22 06:47:14 bazooka kernel: EXT3-fs warning (device md10): dx_probe: Corrupt dir inode 40420228, running e2fsck is recommended.
> 
> Evidence of a corrupted directory entry.  We would need to look at the
> directory to see whether the directory just ad a few bits flipped, or
> is pure garbage.  The ext3 htree code should do a better job printing
> out diagnostics, and flagging the filesystem as corrupt here.
> 
> > Apr  2 22:19:02 bazooka kernel: EXT3-fs warning (device md10): ext3_unlink: Deleting nonexistent file (40491685), 0
> 
> More evidence of a corrupted directory.
> 
> > == Going deeper into the problem
> > 
> > Something like that is quite common:
> > 
> > root@bazooka:/data/...# ls -la
> > total xxx
> > drwxrwx--- 2 xx    xx        4096 2009-04-20 03:48 .
> > drwxr-xr-x 7 root  root      4096 2007-01-21 13:15 ..
> > -rw-r--r-- 1 root  root         0 2009-04-20 03:48 access.log
> > -rw-r--r-- 1 root  root  70784145 2009-04-20 00:11 access.log.0
> > -rw-r--r-- 1 root  root   6347007 2009-04-10 00:07 access.log.10.gz
> > -rw-r--r-- 1 root  root   6866097 2009-04-09 00:08 access.log.11.gz
> > -rw-r--r-- 1 root  root   6410119 2009-04-08 00:07 access.log.12.gz
> > -rw-r--r-- 1 root  root   6488274 2009-04-07 00:08 access.log.13.gz
> > ?--------- ?    ?     ?         ?                ? access.log.14.gz
> > ?--------- ?    ?     ?         ?                ? access.log.15.gz
> > ?--------- ?    ?     ?         ?                ? access.log.16.gz
> 
> This is on the client side; what happens when you look at the same
> directory from the server side?

This is on the server side ;)


> > fsck.ext3 fixed the filesystem but didn't fix the problem.
> 
> What do you mean by that?  That subsequently, you started seeing
> filesystem corruptions again?

Yes, a few days later, sorry for being unclear.


> Can you send me the output of fsck.ext3?  The sorts of filesystem 
> corruption problems which are fixed by e2fsck are important in 
> figuring out what is going on.

Unfortunately I can't, we fsck'ed it up quite in a hurry, but 
/data/lost+found/ was filled up well with orphaned blocks which appeared 
to be part of the disappeared files.

We first thought it was a problem caused by a not-so-recent power 
outage, and that a simple fsck would fix that. But a further look up on 
cron job mails told us we were wrong ;)


> What you if you run fsck.ext3 (aka e2fsck) twice.  Once after fixing
> fixing all of the problems, and then a second time afterwards.  Do the
> problems stay fixed?

We ran fsck two times in row, and the second check didn't find any 
mistake. We thought, "so, it's fixed!"... erm. Actually it was one month 
ago, corruption happens from time to time, several days to one week can 
pass without worry.


> Suppose you try mounting the filesystem read-only; are things stable
> while it is mounted read-only.

Humm this is not easy to find out, we should wait at least one week to 
conclude.


> > Let's check how inodes numbers are distributed:
> > 
> > # cat /root/inodesnumbers | perl -e 'use Data::Dumper; my @pof; while(<>){my ( $inode ) = ( $_ =~ /^(\d+)/ ); my $hop = int($inode/1000000); $pof[$hop]++; }; for (0 .. $#pof) { print $_." = ".($pof[$_]/10000)."%\n" }'
> > [... lot of quite unused inodes groups]
> > 53 = 3.0371%
> > 54 = 26.679%     <= mailboxes
> > 55 = 2.7026%
> > [... lot of quite unused inodes groups]
> > 58 = 1.3262%
> > 59 = 27.3211%    <= mailing lists archives
> > 60 = 5.5159%
> > [... lot of quite unused inodes groups]
> > 171 = 0.0631%
> > 172 = 0.1063%
> > 173 = 27.2895%   <=
> > 174 = 44.0623%   <=
> > 175 = 45.6783%   <= websites files
> > 176 = 45.8247%   <=
> > 177 = 36.9376%   <=
> > 178 = 6.3294%
> > 179 = 0.0442%
> 
> Yes, that's normal.  BTW, you can get this sort of information much
> more easily simply by using the "dumpe2fs" program.

Yep, exactly.	


> > We use to fix broken folders by moving them to a quarantine folder and 
> > by restoring disappeared files from the backup.
> > 
> > So, let's check corrupted inodes number from the quarantine folder:
> > 
> > root@bazooka:/data/path/to/rep/of/quarantine/folders# find . -mindepth 1 -maxdepth 1 -printf '%i\n' | sort -n
> > 174293418
> > 174506030
> > 174506056
> > 174506073
> > 174506081
> > 174506733
> > 174507694
> > 174507708
> > 174507888
> > 174507985
> > 174508077
> > 174508083
> > 176473056
> > 176473062
> > 176473064
> > 
> > Humm... those are quite near to each other 17450... 17647... and are of 
> > course in the most used inodes "groups"...
> 
> When you say "corrupted inodes", how are they corrupted?  The errors
> you showed on the server side looked like directory corruptions.  Were
> these inodes directories or data files?

Well, this is the inode numbers of directories with entries pointing on 
inexisting inodes, of course we cannot delete these directories anymore 
through a regular recursive deletion (well, without debugfs ;). 
Considering the amount of inodes, this is quite a very low corruption 
rate.


> This really smells like a hardware problem to me; my recommendation
> would be to run memory tests and also hard drive tests.  I'm going to
> guess it's more likely the problem is with your hard drives as opposed
> to memory --- that would be consistent with your observation that
> trying to keep the inodes in memory seems to help.

Yes, this is what we thought too, especially because we use ext3/nfs for 
a very long time without problem like that. I moved all the data to the 
backup array so we can now do read-write tests on the primary one 
without impacting much the production.


So, let's check the raid6 array, well, this is going to take a few days.

# badblocks -w -s /dev/md10


If everything goes well I will check disk by disk.


By the way, if such corruptions doesn't happen on the backup storage 
array we can conclude to a hardware problem around the primary one, but, 
we are not going to be able to conclude before a few weeks.


Thanks Theodore, your help is appreciated ;)


Sylvain

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Fw: 2.6.28.9: EXT3/NFS inodes corruption
@ 2009-04-22 23:48       ` Sylvain Rochet
  0 siblings, 0 replies; 8+ messages in thread
From: Sylvain Rochet @ 2009-04-22 23:48 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Andrew Morton, linux-ext4, linux-nfs

[-- Attachment #1: Type: text/plain, Size: 6795 bytes --]

Hi,


On Wed, Apr 22, 2009 at 06:44:55PM -0400, Theodore Tso wrote:
> On Wed, Apr 22, 2009 at 02:24:24PM -0700, Andrew Morton wrote:
> > 
> > Is it nfsd, or is it htree?
> 
> Well, I see evidence in the bug report of corrupted directory data
> structures, so I don't think it's an NFS problem.  I would want to
> rule out hardware flakiness, though.  This could easily be caused by a
> hardware problem.
> 
> > The kernel log is not really nice with us, here on the NFS Server:
> > 
> > Mar 22 06:47:14 bazooka kernel: EXT3-fs warning (device md10): dx_probe: Unrecognised inode hash code 52
> > Mar 22 06:47:14 bazooka kernel: EXT3-fs warning (device md10): dx_probe: Corrupt dir inode 40420228, running e2fsck is recommended.
> 
> Evidence of a corrupted directory entry.  We would need to look at the
> directory to see whether the directory just ad a few bits flipped, or
> is pure garbage.  The ext3 htree code should do a better job printing
> out diagnostics, and flagging the filesystem as corrupt here.
> 
> > Apr  2 22:19:02 bazooka kernel: EXT3-fs warning (device md10): ext3_unlink: Deleting nonexistent file (40491685), 0
> 
> More evidence of a corrupted directory.
> 
> > == Going deeper into the problem
> > 
> > Something like that is quite common:
> > 
> > root@bazooka:/data/...# ls -la
> > total xxx
> > drwxrwx--- 2 xx    xx        4096 2009-04-20 03:48 .
> > drwxr-xr-x 7 root  root      4096 2007-01-21 13:15 ..
> > -rw-r--r-- 1 root  root         0 2009-04-20 03:48 access.log
> > -rw-r--r-- 1 root  root  70784145 2009-04-20 00:11 access.log.0
> > -rw-r--r-- 1 root  root   6347007 2009-04-10 00:07 access.log.10.gz
> > -rw-r--r-- 1 root  root   6866097 2009-04-09 00:08 access.log.11.gz
> > -rw-r--r-- 1 root  root   6410119 2009-04-08 00:07 access.log.12.gz
> > -rw-r--r-- 1 root  root   6488274 2009-04-07 00:08 access.log.13.gz
> > ?--------- ?    ?     ?         ?                ? access.log.14.gz
> > ?--------- ?    ?     ?         ?                ? access.log.15.gz
> > ?--------- ?    ?     ?         ?                ? access.log.16.gz
> 
> This is on the client side; what happens when you look at the same
> directory from the server side?

This is on the server side ;)


> > fsck.ext3 fixed the filesystem but didn't fix the problem.
> 
> What do you mean by that?  That subsequently, you started seeing
> filesystem corruptions again?

Yes, a few days later, sorry for being unclear.


> Can you send me the output of fsck.ext3?  The sorts of filesystem 
> corruption problems which are fixed by e2fsck are important in 
> figuring out what is going on.

Unfortunately I can't, we fsck'ed it up quite in a hurry, but 
/data/lost+found/ was filled up well with orphaned blocks which appeared 
to be part of the disappeared files.

We first thought it was a problem caused by a not-so-recent power 
outage, and that a simple fsck would fix that. But a further look up on 
cron job mails told us we were wrong ;)


> What you if you run fsck.ext3 (aka e2fsck) twice.  Once after fixing
> fixing all of the problems, and then a second time afterwards.  Do the
> problems stay fixed?

We ran fsck two times in row, and the second check didn't find any 
mistake. We thought, "so, it's fixed!"... erm. Actually it was one month 
ago, corruption happens from time to time, several days to one week can 
pass without worry.


> Suppose you try mounting the filesystem read-only; are things stable
> while it is mounted read-only.

Humm this is not easy to find out, we should wait at least one week to 
conclude.


> > Let's check how inodes numbers are distributed:
> > 
> > # cat /root/inodesnumbers | perl -e 'use Data::Dumper; my @pof; while(<>){my ( $inode ) = ( $_ =~ /^(\d+)/ ); my $hop = int($inode/1000000); $pof[$hop]++; }; for (0 .. $#pof) { print $_." = ".($pof[$_]/10000)."%\n" }'
> > [... lot of quite unused inodes groups]
> > 53 = 3.0371%
> > 54 = 26.679%     <= mailboxes
> > 55 = 2.7026%
> > [... lot of quite unused inodes groups]
> > 58 = 1.3262%
> > 59 = 27.3211%    <= mailing lists archives
> > 60 = 5.5159%
> > [... lot of quite unused inodes groups]
> > 171 = 0.0631%
> > 172 = 0.1063%
> > 173 = 27.2895%   <=
> > 174 = 44.0623%   <=
> > 175 = 45.6783%   <= websites files
> > 176 = 45.8247%   <=
> > 177 = 36.9376%   <=
> > 178 = 6.3294%
> > 179 = 0.0442%
> 
> Yes, that's normal.  BTW, you can get this sort of information much
> more easily simply by using the "dumpe2fs" program.

Yep, exactly.	


> > We use to fix broken folders by moving them to a quarantine folder and 
> > by restoring disappeared files from the backup.
> > 
> > So, let's check corrupted inodes number from the quarantine folder:
> > 
> > root@bazooka:/data/path/to/rep/of/quarantine/folders# find . -mindepth 1 -maxdepth 1 -printf '%i\n' | sort -n
> > 174293418
> > 174506030
> > 174506056
> > 174506073
> > 174506081
> > 174506733
> > 174507694
> > 174507708
> > 174507888
> > 174507985
> > 174508077
> > 174508083
> > 176473056
> > 176473062
> > 176473064
> > 
> > Humm... those are quite near to each other 17450... 17647... and are of 
> > course in the most used inodes "groups"...
> 
> When you say "corrupted inodes", how are they corrupted?  The errors
> you showed on the server side looked like directory corruptions.  Were
> these inodes directories or data files?

Well, this is the inode numbers of directories with entries pointing on 
inexisting inodes, of course we cannot delete these directories anymore 
through a regular recursive deletion (well, without debugfs ;). 
Considering the amount of inodes, this is quite a very low corruption 
rate.


> This really smells like a hardware problem to me; my recommendation
> would be to run memory tests and also hard drive tests.  I'm going to
> guess it's more likely the problem is with your hard drives as opposed
> to memory --- that would be consistent with your observation that
> trying to keep the inodes in memory seems to help.

Yes, this is what we thought too, especially because we use ext3/nfs for 
a very long time without problem like that. I moved all the data to the 
backup array so we can now do read-write tests on the primary one 
without impacting much the production.


So, let's check the raid6 array, well, this is going to take a few days.

# badblocks -w -s /dev/md10


If everything goes well I will check disk by disk.


By the way, if such corruptions doesn't happen on the backup storage 
array we can conclude to a hardware problem around the primary one, but, 
we are not going to be able to conclude before a few weeks.


Thanks Theodore, your help is appreciated ;)


Sylvain

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Fw: 2.6.28.9: EXT3/NFS inodes corruption
       [not found]       ` <20090422234823.GA24477-XWGZPxRNpGHk1uMJSBkQmQ@public.gmane.org>
@ 2009-04-23  0:11           ` Theodore Tso
  0 siblings, 0 replies; 8+ messages in thread
From: Theodore Tso @ 2009-04-23  0:11 UTC (permalink / raw)
  To: Sylvain Rochet
  Cc: Andrew Morton, linux-ext4-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA

On Thu, Apr 23, 2009 at 01:48:23AM +0200, Sylvain Rochet wrote:
> > 
> > This is on the client side; what happens when you look at the same
> > directory from the server side?
> 
> This is on the server side ;)
> 

On the server side, that means you also an inode table block look
corrupted.  I'm pretty sure that if you used debugfs to examine those
blocks you would have seen that the inodes were completely garbaged.
Depending on the inode size, and assuming a 4k block size, there are
typically 128 or 64 inodes in a 4k block, so if you were to look at
the inodes by inode number, you normally find that adjacent inodes are
corrupted within a 4k block.  Of course, this just tells us what had
gotten damaged; whether it was damanged by a kernel bug, a memory bug,
a hard drive or controller failure (and there are multiple types of
storage stack failures; complete garbage getting written into the
right place, and the right data getting written into the wrong place).

> Well, this is the inode numbers of directories with entries pointing on 
> inexisting inodes, of course we cannot delete these directories anymore 
> through a regular recursive deletion (well, without debugfs ;). 
> Considering the amount of inodes, this is quite a very low corruption 
> rate.

Well, sure, but any amount of corruption is extremely troubling....

> Yes, this is what we thought too, especially because we use ext3/nfs for 
> a very long time without problem like that. I moved all the data to the 
> backup array so we can now do read-write tests on the primary one 
> without impacting much the production.
> 
> So, let's check the raid6 array, well, this is going to take a few days.
> 
> # badblocks -w -s /dev/md10
> 
> If everything goes well I will check disk by disk.
> 
> By the way, if such corruptions doesn't happen on the backup storage 
> array we can conclude to a hardware problem around the primary one, but, 
> we are not going to be able to conclude before a few weeks.

Good luck!!

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Fw: 2.6.28.9: EXT3/NFS inodes corruption
@ 2009-04-23  0:11           ` Theodore Tso
  0 siblings, 0 replies; 8+ messages in thread
From: Theodore Tso @ 2009-04-23  0:11 UTC (permalink / raw)
  To: Sylvain Rochet; +Cc: Andrew Morton, linux-ext4, linux-nfs

On Thu, Apr 23, 2009 at 01:48:23AM +0200, Sylvain Rochet wrote:
> > 
> > This is on the client side; what happens when you look at the same
> > directory from the server side?
> 
> This is on the server side ;)
> 

On the server side, that means you also an inode table block look
corrupted.  I'm pretty sure that if you used debugfs to examine those
blocks you would have seen that the inodes were completely garbaged.
Depending on the inode size, and assuming a 4k block size, there are
typically 128 or 64 inodes in a 4k block, so if you were to look at
the inodes by inode number, you normally find that adjacent inodes are
corrupted within a 4k block.  Of course, this just tells us what had
gotten damaged; whether it was damanged by a kernel bug, a memory bug,
a hard drive or controller failure (and there are multiple types of
storage stack failures; complete garbage getting written into the
right place, and the right data getting written into the wrong place).

> Well, this is the inode numbers of directories with entries pointing on 
> inexisting inodes, of course we cannot delete these directories anymore 
> through a regular recursive deletion (well, without debugfs ;). 
> Considering the amount of inodes, this is quite a very low corruption 
> rate.

Well, sure, but any amount of corruption is extremely troubling....

> Yes, this is what we thought too, especially because we use ext3/nfs for 
> a very long time without problem like that. I moved all the data to the 
> backup array so we can now do read-write tests on the primary one 
> without impacting much the production.
> 
> So, let's check the raid6 array, well, this is going to take a few days.
> 
> # badblocks -w -s /dev/md10
> 
> If everything goes well I will check disk by disk.
> 
> By the way, if such corruptions doesn't happen on the backup storage 
> array we can conclude to a hardware problem around the primary one, but, 
> we are not going to be able to conclude before a few weeks.

Good luck!!

						- Ted

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Fw: 2.6.28.9: EXT3/NFS inodes corruption
  2009-04-23  0:11           ` Theodore Tso
@ 2009-04-23 23:14               ` Sylvain Rochet
  -1 siblings, 0 replies; 8+ messages in thread
From: Sylvain Rochet @ 2009-04-23 23:14 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Andrew Morton, linux-ext4-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 2379 bytes --]

Hi,


On Wed, Apr 22, 2009 at 08:11:39PM -0400, Theodore Tso wrote:
> 
> On the server side, that means you also an inode table block look
> corrupted.  I'm pretty sure that if you used debugfs to examine those
> blocks you would have seen that the inodes were completely garbaged.

Yep, I destroyed all evidences by using badblocks in read-write mode, 
but in case of real need of them we just have to put the production back 
on the primary array and wait a few days.


> Depending on the inode size, and assuming a 4k block size, there are
> typically 128 or 64 inodes in a 4k block,

4k block size
128 bytes/inode

so 32 inodes per 4k block in our case ?

Since the new default is 256 bytes/inode and values of less than 128 are 
not allowed, how is it possible to store 64 or 128 inodes in a 4k block ?
(Maybe I miss something :p)


> so if you were to look at the inodes by inode number, you normally 
> find that adjacent inodes are corrupted within a 4k block.  Of course, 
> this just tells us what had gotten damaged; whether it was damanged by 
> a kernel bug, a memory bug, a hard drive or controller failure (and 
> there are multiple types of storage stack failures; complete garbage 
> getting written into the right place, and the right data getting 
> written into the wrong place).

Yes, this is not going to be easy to find out what is responsible, but 
based on the probability of hardware that use to fail easily, let's 
point out one of the harddrive :-)


> Well, sure, but any amount of corruption is extremely troubling....

Yep ;-)


> > By the way, if such corruptions doesn't happen on the backup storage 
> > array we can conclude to a hardware problem around the primary one, but, 
> > we are not going to be able to conclude before a few weeks.
> 
> Good luck!!

Thanks, actually this isn't so bad, we enjoy having backup hardware
(The things we always consider as useless until we -really- need it -- 
"Who said like backups ? I heard it from the end of the room." ;-)

By the way, the badblocks check is going to take 12 days considering the 
current rate. However I ran some data checks of the raid6 array in the 
past, mainly when the filesystem was corrupted and every check 
succeeded. Maybe the raid6 driver computed another parity strides by 
reading corrupted data.


Sylvain

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Fw: 2.6.28.9: EXT3/NFS inodes corruption
@ 2009-04-23 23:14               ` Sylvain Rochet
  0 siblings, 0 replies; 8+ messages in thread
From: Sylvain Rochet @ 2009-04-23 23:14 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Andrew Morton, linux-ext4, linux-nfs

[-- Attachment #1: Type: text/plain, Size: 2379 bytes --]

Hi,


On Wed, Apr 22, 2009 at 08:11:39PM -0400, Theodore Tso wrote:
> 
> On the server side, that means you also an inode table block look
> corrupted.  I'm pretty sure that if you used debugfs to examine those
> blocks you would have seen that the inodes were completely garbaged.

Yep, I destroyed all evidences by using badblocks in read-write mode, 
but in case of real need of them we just have to put the production back 
on the primary array and wait a few days.


> Depending on the inode size, and assuming a 4k block size, there are
> typically 128 or 64 inodes in a 4k block,

4k block size
128 bytes/inode

so 32 inodes per 4k block in our case ?

Since the new default is 256 bytes/inode and values of less than 128 are 
not allowed, how is it possible to store 64 or 128 inodes in a 4k block ?
(Maybe I miss something :p)


> so if you were to look at the inodes by inode number, you normally 
> find that adjacent inodes are corrupted within a 4k block.  Of course, 
> this just tells us what had gotten damaged; whether it was damanged by 
> a kernel bug, a memory bug, a hard drive or controller failure (and 
> there are multiple types of storage stack failures; complete garbage 
> getting written into the right place, and the right data getting 
> written into the wrong place).

Yes, this is not going to be easy to find out what is responsible, but 
based on the probability of hardware that use to fail easily, let's 
point out one of the harddrive :-)


> Well, sure, but any amount of corruption is extremely troubling....

Yep ;-)


> > By the way, if such corruptions doesn't happen on the backup storage 
> > array we can conclude to a hardware problem around the primary one, but, 
> > we are not going to be able to conclude before a few weeks.
> 
> Good luck!!

Thanks, actually this isn't so bad, we enjoy having backup hardware
(The things we always consider as useless until we -really- need it -- 
"Who said like backups ? I heard it from the end of the room." ;-)

By the way, the badblocks check is going to take 12 days considering the 
current rate. However I ran some data checks of the raid6 array in the 
past, mainly when the filesystem was corrupted and every check 
succeeded. Maybe the raid6 driver computed another parity strides by 
reading corrupted data.


Sylvain

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2009-04-23 23:14 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-04-22 21:24 Fw: 2.6.28.9: EXT3/NFS inodes corruption Andrew Morton
2009-04-22 22:44 ` Theodore Tso
     [not found]   ` <20090422224455.GV15541-3s7WtUTddSA@public.gmane.org>
2009-04-22 23:48     ` Sylvain Rochet
2009-04-22 23:48       ` Sylvain Rochet
     [not found]       ` <20090422234823.GA24477-XWGZPxRNpGHk1uMJSBkQmQ@public.gmane.org>
2009-04-23  0:11         ` Theodore Tso
2009-04-23  0:11           ` Theodore Tso
     [not found]           ` <20090423001139.GX15541-3s7WtUTddSA@public.gmane.org>
2009-04-23 23:14             ` Sylvain Rochet
2009-04-23 23:14               ` Sylvain Rochet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.