* More ext4 acl/xattr corruption - 4th occurence now @ 2009-05-13 6:26 Kevin Shanahan 2009-05-13 23:56 ` Kevin Shanahan 2009-05-14 4:40 ` Theodore Tso 0 siblings, 2 replies; 35+ messages in thread From: Kevin Shanahan @ 2009-05-13 6:26 UTC (permalink / raw) To: linux-ext4 [-- Attachment #1: Type: text/plain, Size: 6385 bytes --] Hi, This fs really doesn't like whatever I am doing with it. Here is another instance of corruption, again with the first symptom showing up as corrupted acls and xattrs as the backup script crawls over the directory structure: getfacl: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROF.IF: Input/output error getfacl: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROF.DF: Input/output error getfacl: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLORIG.SMK: Input/output error getfacl: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLNATU.DF: Input/output error getfacl: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLORIG.IF: Input/output error getfacl: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLNATU.SMK: Input/output error getfacl: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROB.SMK: Input/output error getfacl: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLSOUR.DF: Input/output error getfacl: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLOUTC.IF: Input/output error getfacl: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLOUTC.SMK: Input/output error getfacl: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROB.IF: Input/output error getfacl: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLORIG.DF: Input/output error getfacl: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLOUTC.DF: Input/output error getfacl: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROF.SMK: Input/output error getfacl: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROB.DF: Input/output error getfattr: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROF.IF: Input/output error getfattr: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROF.DF: Input/output error getfattr: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLORIG.SMK: Input/output error getfattr: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLNATU.IF: Input/output error getfattr: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLNATU.DF: Input/output error getfattr: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLORIG.IF: Input/output error getfattr: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLNATU.SMK: Input/output error getfattr: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROB.SMK: Input/output error getfattr: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLSOUR.DF: Input/output error getfattr: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLOUTC.IF: Input/output error getfattr: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLOUTC.SMK: Input/output error getfattr: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROB.IF: Input/output error getfattr: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLORIG.DF: Input/output error getfattr: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLOUTC.DF: Input/output error getfattr: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROF.SMK: Input/output error getfattr: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROB.DF: Input/output error rsync: readlink "/srv/samba/local/apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROF.IF" failed: Input/output error (5) rsync: readlink "/srv/samba/local/apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROF.DF" failed: Input/output error (5) rsync: readlink "/srv/samba/local/apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLORIG.SMK" failed: Input/output error (5) rsync: readlink "/srv/samba/local/apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLNATU.DF" failed: Input/output error (5) rsync: readlink "/srv/samba/local/apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLORIG.IF" failed: Input/output error (5) rsync: readlink "/srv/samba/local/apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLNATU.SMK" failed: Input/output error (5) rsync: readlink "/srv/samba/local/apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROB.SMK" failed: Input/output error (5) rsync: readlink "/srv/samba/local/apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLSOUR.DF" failed: Input/output error (5) rsync: readlink "/srv/samba/local/apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLOUTC.IF" failed: Input/output error (5) rsync: readlink "/srv/samba/local/apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLOUTC.SMK" failed: Input/output error (5) rsync: readlink "/srv/samba/local/apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROB.IF" failed: Input/output error (5) rsync: readlink "/srv/samba/local/apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLORIG.DF" failed: Input/output error (5) rsync: readlink "/srv/samba/local/apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLOUTC.DF" failed: Input/output error (5) rsync: readlink "/srv/samba/local/apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROF.SMK" failed: Input/output error (5) rsync: readlink "/srv/samba/local/apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROB.DF" failed: Input/output error (5) skipping non-regular file "apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLNATU.IF" Again, here is the little breadcrumb left at the end of dmesg: init_special_inode: bogus i_mode (36545) attempt to access beyond end of device dm-0: rw=0, want=256153956122632, limit=2147483648 attempt to access beyond end of device dm-0: rw=0, want=256153956122632, limit=2147483648 And following the same formula as last time(s): hermes:~# debugfs /dev/dm-0 debugfs: stat "local/apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROF.IF" invalid inode->i_extra_isize (8224) Inode: 2542 Type: bad type Mode: 0043 Flags: 0x5849462f Generation: 538970637 Version: 0x66663030:535f4445 User: 538980401 Group: 538993001 Size: 996566576 File ACL: 538976288 Directory ACL: 0 Links: 8812 Blockcount: 35322822674750 Fragment: Address: 538976288 Number: 0 Size: 0 ctime: 0x41462d54:65636166 -- Tue Sep 14 08:59:24 2004 atime: 0x4e4f4620:63206c61 -- Sat Aug 20 14:59:04 2011 mtime: 0x594c494d:6972413d -- Fri Jun 23 08:18:45 2017 crtime: 0x726f6c6f:3138233d -- Sun Nov 3 13:24:39 2030 dtime: 0x7241203a -- Sun Sep 29 09:35:14 2030 Size of extra inode fields: 8224 BLOCKS: debugfs: imap "local/apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROF.IF" Inode 2542 is part of block group 0 located at block 447, offset 0x0d00 hermes:~# dd if=/dev/dm-0 of=block-447.dump bs=4k skip=447 count=1 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.0121164 s, 338 kB/s Now, this is (possibly) interesting - that block contains a bunch of file data. Looks like a html email (I can tell it's email because of the FIXED_ prefix added to the tags by the mail sanitizer). If I can locate the source of that data, perhaps it will point to where the corruption is coming from? Any tips on scanning for the data? I'll start with simple find and grep and see how far I get. Regards, Kevin. [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: block-447.dump --] [-- Type: text/plain; charset=utf-8, Size: 4140 bytes --] ; COLOR: #8100ff; FONT-FAMILY: Arial"><BR>1180. Kamilla Leonczuk, </FIXED_SPAN></FONT></B><FIXED_ST1:PLACE><FIXED_ST1:CITY><B><FONT face=Arial color=#8100ff size=2><FIXED_SPAN style="FONT-WEIGHT: bold; FONT-SIZE:\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0²\x0f\b\b\0\0\0ªì©\x1f\x1d\x02\0³- \0\0\0\0\0\x1e\x1d\x02\0 \x1d\x02\0]\0\0\0\x16\0\0\0º\x0f\x02\0\b\0\0\0\0\x15´\0\x06\x10J\b\0\0\0 10pt; COLOR: #8100ff; FONT-FAMILY: Arial">Morgan Hill</FIXED_SPAN></FONT></B></FIXED_ST1:CITY><B><FONT face=Arial color=#8100ff size=2><FIXED_SPAN style="FONT-WEIGHT: bold; FONT-SIZE: 10pt; COLOR: #8100ff; FONT-FAMILY: Arial"> , </FIXED_SPAN></FONT></B><FIXED_ST1:STATE><B><FONT face=Arial color=#8100ff size=2><FIXED_SPAN style="FONT-WEIGHT: bold; FONT-SIZE: 10pt; COLOR: #8100ff; FONT-FAMILY: Arial">CA</FIXED_SPAN></FONT></B></FIXED_ST1:STATE></FIXED_ST1:PLACE><B><FONT face=Arial color=#8100ff size=2><FIXED_SPAN style="FONT-WEIGHT: bold; FONT-SIZE: 10pt; COLOR: #8100ff; FONT-FAMILY: Arial"><BR>1181 Edmund Perry, </FIXED_SPAN></FONT></B><FIXED_ST1:PLACE><FIXED_ST1:CITY><B><FONT face=Arial color=#8100ff size=2><FIXED_SPAN style="FONT-WEIGHT: bold; FONT-SIZE: 10pt; COLOR: #8100ff; FONT-FAMILY: Arial">MountainView</FIXED_SPAN></FONT></B></FIXED_ST1:CITY><B><FONT face=Arial color=#8100ff size=2><FIXED_SPAN style="FONT-WEIGHT: bold; FONT-SIZE: 10pt; COLOR: #8100ff; FONT-FAMILY: Arial"> , </FIXED_SPAN></FONT></B><FIXED_ST1:STATE><B><FONT face=Arial color=#8100ff size=2><FIXED_SPAN style="FONT-WEIGHT: bold; FONT-SIZE: 10pt; COLOR: #8100ff; FONT-FAMILY: Arial">CA</FIXED_SPAN></FONT></B></FIXED_ST1:STATE></FIXED_ST1:PLACE><B><FONT face=Arial color=#8100ff size=2><FIXED_SPAN style="FONT-WEIGHT: bold; FONT-SIZE: 10pt; COLOR: #8100ff; FONT-FAMILY: Arial"><BR>1182 Sheri Simmen, </FIXED_SPAN></FONT></B><FIXED_ST1:PLACE><FIXED_ST1:CITY><B><FONT face=Arial color=#8100ff size=2><FIXED_SPAN style="FONT-WEIGHT: bold; FONT-SIZE: 10pt; COLOR: #8100ff; FONT-FAMILY: Arial">Morton</FIXED_SPAN></FONT></B></FIXED_ST1:CITY><B><FONT face=Arial color=#8100ff size=2><FIXED_SPAN style="FONT-WEIGHT: bold; FONT-SIZE: 10pt; COLOR: #8100ff; FONT-FAMILY: Arial"> , </FIXED_SPAN></FONT></B><FIXED_ST1:STATE><B><FONT face=Arial color=#8100ff size=2><FIXED_SPAN style="FONT-WEIGHT: bold; FONT-SIZE: 10pt; COLOR: #8100ff; FONT-FAMILY: Arial">IL</FIXED_SPAN></FONT></B></FIXED_ST1:STATE></FIXED_ST1:PLACE><B><FONT face=Arial color=#8100ff size=2><FIXED_SPAN style="FONT-WEIGHT: bold; FONT-SIZE: 10pt; COLOR: #8100ff; FONT-FAMILY: Arial"><BR>1183. Cami Ross - </FIXED_SPAN></FONT></B><FIXED_ST1:PLACE><FIXED_ST1:CITY><B><FONT face=Arial color=#8100ff size=2><FIXED_SPAN style="FONT-WEIGHT: bold; FONT-SIZE: 10pt; COLOR: #8100ff; FONT-FAMILY: Arial">Peoria</FIXED_SPAN></FONT></B></FIXED_ST1:CITY><B><FONT face=Arial color=#8100ff size=2><FIXED_SPAN ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: More ext4 acl/xattr corruption - 4th occurence now 2009-05-13 6:26 More ext4 acl/xattr corruption - 4th occurence now Kevin Shanahan @ 2009-05-13 23:56 ` Kevin Shanahan 2009-05-14 4:40 ` Theodore Tso 1 sibling, 0 replies; 35+ messages in thread From: Kevin Shanahan @ 2009-05-13 23:56 UTC (permalink / raw) To: linux-ext4 On Wed, May 13, 2009 at 03:56:34PM +0930, Kevin Shanahan wrote: > And following the same formula as last time(s): > > hermes:~# debugfs /dev/dm-0 > debugfs: stat "local/apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROF.IF" > invalid inode->i_extra_isize (8224) > Inode: 2542 Type: bad type Mode: 0043 Flags: 0x5849462f > Generation: 538970637 Version: 0x66663030:535f4445 > User: 538980401 Group: 538993001 Size: 996566576 > File ACL: 538976288 Directory ACL: 0 > Links: 8812 Blockcount: 35322822674750 > Fragment: Address: 538976288 Number: 0 Size: 0 > ctime: 0x41462d54:65636166 -- Tue Sep 14 08:59:24 2004 > atime: 0x4e4f4620:63206c61 -- Sat Aug 20 14:59:04 2011 > mtime: 0x594c494d:6972413d -- Fri Jun 23 08:18:45 2017 > crtime: 0x726f6c6f:3138233d -- Sun Nov 3 13:24:39 2030 > dtime: 0x7241203a -- Sun Sep 29 09:35:14 2030 > Size of extra inode fields: 8224 > BLOCKS: > > debugfs: imap "local/apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROF.IF" > Inode 2542 is part of block group 0 > located at block 447, offset 0x0d00 > > hermes:~# dd if=/dev/dm-0 of=block-447.dump bs=4k skip=447 count=1 > 1+0 records in > 1+0 records out > 4096 bytes (4.1 kB) copied, 0.0121164 s, 338 kB/s > > Now, this is (possibly) interesting - that block contains a bunch of > file data. Looks like a html email (I can tell it's email because of > the FIXED_ prefix added to the tags by the mail sanitizer). > > If I can locate the source of that data, perhaps it will point to > where the corruption is coming from? Any tips on scanning for the > data? I'll start with simple find and grep and see how far I get. I didn't find this in any file in the current directory structure. I guess it could be old data that hadn't been zeroed out. However, the only binary data I can see seems to be at offset 0x0155 (341), nowhere near offset 0x0d00 (unless I misunderstand the imap output above. Does that little blob of binary data make any sense as an inode? Cheers, Kevin. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: More ext4 acl/xattr corruption - 4th occurence now 2009-05-13 6:26 More ext4 acl/xattr corruption - 4th occurence now Kevin Shanahan 2009-05-13 23:56 ` Kevin Shanahan @ 2009-05-14 4:40 ` Theodore Tso 2009-05-14 11:07 ` Kevin Shanahan 2009-05-15 4:58 ` Aneesh Kumar K.V 1 sibling, 2 replies; 35+ messages in thread From: Theodore Tso @ 2009-05-14 4:40 UTC (permalink / raw) To: Kevin Shanahan; +Cc: linux-ext4 On Wed, May 13, 2009 at 03:56:34PM +0930, Kevin Shanahan wrote: > > Now, this is (possibly) interesting - that block contains a bunch of > file data. Looks like a html email (I can tell it's email because of > the FIXED_ prefix added to the tags by the mail sanitizer). That is interesting. Knowing that it is file data means we might be able to track things down this way. Are you able to apply the following patch to your kernel? If so, hopefully we'll be able to catch whatever is causing the problem in the act. - Ted commit 8ff799da106e9fc4da9b2a3753b5b86caab27f13 Author: Theodore Ts'o <tytso@mit.edu> Date: Thu May 14 00:39:48 2009 -0400 ext4: Add a block validity check to ext4_get_blocks_wrap() A few users with very large disks have been reporting low block number filesystem corruptions, potentially zapping the block group descriptors or inodes in the first inode table block. It's not clear what is causing this, but most recently, it appears that whatever is trashing the filesystem metadata appears to be file data. So let's try to set a trap for the corruption in ext4_get_blocks_wrap(), which is where logical blocks in an inode are mapped to physical blocks. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 4e7f363..4fad4fa 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -1120,6 +1120,35 @@ static void ext4_da_update_reserve_space(struct inode *inode, int used) ext4_discard_preallocations(inode); } +static int check_block_validity(struct inode *inode, sector_t logical, + sector_t phys, int len) +{ + ext4_fsblk_t valid_block, itable_block; + struct ext4_super_block *es = EXT4_SB(inode->i_sb)->s_es; + struct ext4_group_desc *gdp = ext4_get_group_desc(inode->i_sb, 0, 0); + unsigned int itable_len = EXT4_SB(inode->i_sb)->s_itb_per_group; + + valid_block = le32_to_cpu(es->s_first_data_block) + + EXT4_SB(inode->i_sb)->s_gdb_count; + itable_block = ext4_inode_table(inode->i_sb, gdp); + + if (unlikely((phys <= valid_block) || + ((phys + len - 1) > ext4_blocks_count(es)) || + ((phys >= itable_block) && + (phys <= itable_block + itable_len)) || + ((phys + len - 1 >= itable_block) && + (phys + len - 1 <= itable_block + itable_len)))) { + ext4_error(inode->i_sb, "check_block_validity", + "inode #%lu logical block %llu mapped to %llu " + "(size %d)", inode->i_ino, + (unsigned long long) logical, + (unsigned long long) phys, len); + WARN_ON(1); + return -EIO; + } + return 0; +} + /* * The ext4_get_blocks_wrap() function try to look up the requested blocks, * and returns if the blocks are already mapped. @@ -1165,6 +1194,13 @@ int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block, } up_read((&EXT4_I(inode)->i_data_sem)); + if (retval > 0 && buffer_mapped(bh)) { + int ret = check_block_validity(inode, block, + bh->b_blocknr, retval); + if (ret != 0) + return ret; + } + /* If it is only a block(s) look up */ if (!create) return retval; @@ -1241,6 +1277,12 @@ int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block, } up_write((&EXT4_I(inode)->i_data_sem)); + if (retval > 0 && buffer_mapped(bh)) { + int ret = check_block_validity(inode, block, + bh->b_blocknr, retval); + if (ret != 0) + return ret; + } return retval; } ^ permalink raw reply related [flat|nested] 35+ messages in thread
* Re: More ext4 acl/xattr corruption - 4th occurence now 2009-05-14 4:40 ` Theodore Tso @ 2009-05-14 11:07 ` Kevin Shanahan 2009-05-14 11:17 ` Manish Katiyar 2009-05-14 13:25 ` Kevin Shanahan 2009-05-15 4:58 ` Aneesh Kumar K.V 1 sibling, 2 replies; 35+ messages in thread From: Kevin Shanahan @ 2009-05-14 11:07 UTC (permalink / raw) To: Theodore Tso; +Cc: linux-ext4 On Thu, May 14, 2009 at 12:40:11AM -0400, Theodore Tso wrote: > On Wed, May 13, 2009 at 03:56:34PM +0930, Kevin Shanahan wrote: > > > > Now, this is (possibly) interesting - that block contains a bunch of > > file data. Looks like a html email (I can tell it's email because of > > the FIXED_ prefix added to the tags by the mail sanitizer). > > That is interesting. Knowing that it is file data means we might be > able to track things down this way. > > Are you able to apply the following patch to your kernel? If so, > hopefully we'll be able to catch whatever is causing the problem in > the act. Sure - now running with 2.6.29.3 + your patch. patching file fs/ext4/inode.c Hunk #1 succeeded at 1040 with fuzz 1 (offset -80 lines). Hunk #2 succeeded at 1113 (offset -81 lines). Hunk #3 succeeded at 1184 (offset -93 lines). I'll report any hits for "check_block_validity" in syslog. Thankyou, much appreciated. Cheers, Kevin. > commit 8ff799da106e9fc4da9b2a3753b5b86caab27f13 > Author: Theodore Ts'o <tytso@mit.edu> > Date: Thu May 14 00:39:48 2009 -0400 > > ext4: Add a block validity check to ext4_get_blocks_wrap() > > A few users with very large disks have been reporting low block number > filesystem corruptions, potentially zapping the block group > descriptors or inodes in the first inode table block. It's not clear > what is causing this, but most recently, it appears that whatever is > trashing the filesystem metadata appears to be file data. So let's > try to set a trap for the corruption in ext4_get_blocks_wrap(), which > is where logical blocks in an inode are mapped to physical blocks. > > Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 4e7f363..4fad4fa 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -1120,6 +1120,35 @@ static void ext4_da_update_reserve_space(struct inode *inode, int used) > ext4_discard_preallocations(inode); > } > > +static int check_block_validity(struct inode *inode, sector_t logical, > + sector_t phys, int len) > +{ > + ext4_fsblk_t valid_block, itable_block; > + struct ext4_super_block *es = EXT4_SB(inode->i_sb)->s_es; > + struct ext4_group_desc *gdp = ext4_get_group_desc(inode->i_sb, 0, 0); > + unsigned int itable_len = EXT4_SB(inode->i_sb)->s_itb_per_group; > + > + valid_block = le32_to_cpu(es->s_first_data_block) + > + EXT4_SB(inode->i_sb)->s_gdb_count; > + itable_block = ext4_inode_table(inode->i_sb, gdp); > + > + if (unlikely((phys <= valid_block) || > + ((phys + len - 1) > ext4_blocks_count(es)) || > + ((phys >= itable_block) && > + (phys <= itable_block + itable_len)) || > + ((phys + len - 1 >= itable_block) && > + (phys + len - 1 <= itable_block + itable_len)))) { > + ext4_error(inode->i_sb, "check_block_validity", > + "inode #%lu logical block %llu mapped to %llu " > + "(size %d)", inode->i_ino, > + (unsigned long long) logical, > + (unsigned long long) phys, len); > + WARN_ON(1); > + return -EIO; > + } > + return 0; > +} > + > /* > * The ext4_get_blocks_wrap() function try to look up the requested blocks, > * and returns if the blocks are already mapped. > @@ -1165,6 +1194,13 @@ int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block, > } > up_read((&EXT4_I(inode)->i_data_sem)); > > + if (retval > 0 && buffer_mapped(bh)) { > + int ret = check_block_validity(inode, block, > + bh->b_blocknr, retval); > + if (ret != 0) > + return ret; > + } > + > /* If it is only a block(s) look up */ > if (!create) > return retval; > @@ -1241,6 +1277,12 @@ int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block, > } > > up_write((&EXT4_I(inode)->i_data_sem)); > + if (retval > 0 && buffer_mapped(bh)) { > + int ret = check_block_validity(inode, block, > + bh->b_blocknr, retval); > + if (ret != 0) > + return ret; > + } > return retval; > } > ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: More ext4 acl/xattr corruption - 4th occurence now 2009-05-14 11:07 ` Kevin Shanahan @ 2009-05-14 11:17 ` Manish Katiyar 2009-05-14 12:30 ` Theodore Tso 2009-05-14 13:25 ` Kevin Shanahan 1 sibling, 1 reply; 35+ messages in thread From: Manish Katiyar @ 2009-05-14 11:17 UTC (permalink / raw) To: Kevin Shanahan; +Cc: Theodore Tso, linux-ext4 On Thu, May 14, 2009 at 4:37 PM, Kevin Shanahan <kmshanah@ucwb.org.au> wrote: > On Thu, May 14, 2009 at 12:40:11AM -0400, Theodore Tso wrote: >> On Wed, May 13, 2009 at 03:56:34PM +0930, Kevin Shanahan wrote: >> > >> > Now, this is (possibly) interesting - that block contains a bunch of >> > file data. Looks like a html email (I can tell it's email because of >> > the FIXED_ prefix added to the tags by the mail sanitizer). >> >> That is interesting. Knowing that it is file data means we might be >> able to track things down this way. >> >> Are you able to apply the following patch to your kernel? Hi Ted, If we have the possibility of applying patches why not apply this too and try http://markmail.org/message/e3uwzxrd42isxczg . After all it originated from this bug only. Thanks - Manish >> If so, >> hopefully we'll be able to catch whatever is causing the problem in >> the act. > > Sure - now running with 2.6.29.3 + your patch. > > patching file fs/ext4/inode.c > Hunk #1 succeeded at 1040 with fuzz 1 (offset -80 lines). > Hunk #2 succeeded at 1113 (offset -81 lines). > Hunk #3 succeeded at 1184 (offset -93 lines). > > I'll report any hits for "check_block_validity" in syslog. > > Thankyou, much appreciated. > > Cheers, > Kevin. > >> commit 8ff799da106e9fc4da9b2a3753b5b86caab27f13 >> Author: Theodore Ts'o <tytso@mit.edu> >> Date: Thu May 14 00:39:48 2009 -0400 >> >> ext4: Add a block validity check to ext4_get_blocks_wrap() >> >> A few users with very large disks have been reporting low block number >> filesystem corruptions, potentially zapping the block group >> descriptors or inodes in the first inode table block. It's not clear >> what is causing this, but most recently, it appears that whatever is >> trashing the filesystem metadata appears to be file data. So let's >> try to set a trap for the corruption in ext4_get_blocks_wrap(), which >> is where logical blocks in an inode are mapped to physical blocks. >> >> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> >> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c >> index 4e7f363..4fad4fa 100644 >> --- a/fs/ext4/inode.c >> +++ b/fs/ext4/inode.c >> @@ -1120,6 +1120,35 @@ static void ext4_da_update_reserve_space(struct inode *inode, int used) >> ext4_discard_preallocations(inode); >> } >> >> +static int check_block_validity(struct inode *inode, sector_t logical, >> + sector_t phys, int len) >> +{ >> + ext4_fsblk_t valid_block, itable_block; >> + struct ext4_super_block *es = EXT4_SB(inode->i_sb)->s_es; >> + struct ext4_group_desc *gdp = ext4_get_group_desc(inode->i_sb, 0, 0); >> + unsigned int itable_len = EXT4_SB(inode->i_sb)->s_itb_per_group; >> + >> + valid_block = le32_to_cpu(es->s_first_data_block) + >> + EXT4_SB(inode->i_sb)->s_gdb_count; >> + itable_block = ext4_inode_table(inode->i_sb, gdp); >> + >> + if (unlikely((phys <= valid_block) || >> + ((phys + len - 1) > ext4_blocks_count(es)) || >> + ((phys >= itable_block) && >> + (phys <= itable_block + itable_len)) || >> + ((phys + len - 1 >= itable_block) && >> + (phys + len - 1 <= itable_block + itable_len)))) { >> + ext4_error(inode->i_sb, "check_block_validity", >> + "inode #%lu logical block %llu mapped to %llu " >> + "(size %d)", inode->i_ino, >> + (unsigned long long) logical, >> + (unsigned long long) phys, len); >> + WARN_ON(1); >> + return -EIO; >> + } >> + return 0; >> +} >> + >> /* >> * The ext4_get_blocks_wrap() function try to look up the requested blocks, >> * and returns if the blocks are already mapped. >> @@ -1165,6 +1194,13 @@ int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block, >> } >> up_read((&EXT4_I(inode)->i_data_sem)); >> >> + if (retval > 0 && buffer_mapped(bh)) { >> + int ret = check_block_validity(inode, block, >> + bh->b_blocknr, retval); >> + if (ret != 0) >> + return ret; >> + } >> + >> /* If it is only a block(s) look up */ >> if (!create) >> return retval; >> @@ -1241,6 +1277,12 @@ int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block, >> } >> >> up_write((&EXT4_I(inode)->i_data_sem)); >> + if (retval > 0 && buffer_mapped(bh)) { >> + int ret = check_block_validity(inode, block, >> + bh->b_blocknr, retval); >> + if (ret != 0) >> + return ret; >> + } >> return retval; >> } >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Thanks - Manish -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: More ext4 acl/xattr corruption - 4th occurence now 2009-05-14 11:17 ` Manish Katiyar @ 2009-05-14 12:30 ` Theodore Tso 0 siblings, 0 replies; 35+ messages in thread From: Theodore Tso @ 2009-05-14 12:30 UTC (permalink / raw) To: Manish Katiyar; +Cc: Kevin Shanahan, linux-ext4 On Thu, May 14, 2009 at 04:47:14PM +0530, Manish Katiyar wrote: > > Hi Ted, > > If we have the possibility of applying patches why not apply this too > and try http://markmail.org/message/e3uwzxrd42isxczg . After all it > originated from this bug only. Because I implemented this functionality already in fs/ext4/inode.c:ext4_iget(). Check out commit 563bdd61 in the kernel mainline. (It's not yet in stable release, as it was merged after the 2.6.29 merge window.) - Ted ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: More ext4 acl/xattr corruption - 4th occurence now 2009-05-14 11:07 ` Kevin Shanahan 2009-05-14 11:17 ` Manish Katiyar @ 2009-05-14 13:25 ` Kevin Shanahan 2009-05-14 14:07 ` Theodore Tso 1 sibling, 1 reply; 35+ messages in thread From: Kevin Shanahan @ 2009-05-14 13:25 UTC (permalink / raw) To: Theodore Tso; +Cc: linux-ext4 On Thu, May 14, 2009 at 08:37:00PM +0930, Kevin Shanahan wrote: > Sure - now running with 2.6.29.3 + your patch. > > patching file fs/ext4/inode.c > Hunk #1 succeeded at 1040 with fuzz 1 (offset -80 lines). > Hunk #2 succeeded at 1113 (offset -81 lines). > Hunk #3 succeeded at 1184 (offset -93 lines). > > I'll report any hits for "check_block_validity" in syslog. That didn't take long: May 14 22:49:17 hermes kernel: EXT4-fs error (device dm-0): check_block_validity: inode #759 logical block 1741329 mapped to 529 (size 1) May 14 22:49:17 hermes kernel: Aborting journal on device dm-0:8. May 14 22:49:17 hermes kernel: ext4_da_writepages: jbd2_start: 293 pages, ino 759; err -30 May 14 22:49:17 hermes kernel: Pid: 374, comm: pdflush Not tainted 2.6.29.3 #1 May 14 22:49:17 hermes kernel: Call Trace: May 14 22:49:17 hermes kernel: [<ffffffff802fdce3>] ext4_da_writepages+0x211/0x427 May 14 22:49:17 hermes kernel: [<ffffffff8055b44f>] thread_return+0x3d/0xc3 May 14 22:49:17 hermes kernel: [<ffffffff802fef56>] ext4_da_get_block_write+0x0/0x12d May 14 22:49:17 hermes kernel: [<ffffffff802757df>] do_writepages+0x20/0x2d May 14 22:49:17 hermes kernel: [<ffffffff802aa359>] __writeback_single_inode+0x196/0x3a4 May 14 22:49:17 hermes kernel: [<ffffffff802542aa>] del_timer_sync+0xc/0x16 May 14 22:49:17 hermes kernel: [<ffffffff802aa9b9>] generic_sync_sb_inodes+0x2a7/0x439 May 14 22:49:17 hermes kernel: [<ffffffff802aacfa>] writeback_inodes+0x7d/0xcc May 14 22:49:17 hermes kernel: [<ffffffff8027591e>] wb_kupdate+0x9f/0x116 May 14 22:49:17 hermes kernel: [<ffffffff802764bf>] pdflush+0x125/0x1d2 May 14 22:49:17 hermes kernel: [<ffffffff8027587f>] wb_kupdate+0x0/0x116 May 14 22:49:17 hermes kernel: [<ffffffff8027639a>] pdflush+0x0/0x1d2 May 14 22:49:17 hermes kernel: [<ffffffff8025c8cc>] kthread+0x47/0x73 May 14 22:49:17 hermes kernel: [<ffffffff80225f7a>] child_rip+0xa/0x20 May 14 22:49:17 hermes kernel: [<ffffffff8025c885>] kthread+0x0/0x73 May 14 22:49:17 hermes kernel: [<ffffffff80225f70>] child_rip+0x0/0x20 May 14 22:49:18 hermes kernel: Remounting filesystem read-only May 14 22:49:18 hermes kernel: ------------[ cut here ]------------ May 14 22:49:18 hermes kernel: WARNING: at fs/ext4/inode.c:1066 check_block_validity+0xd8/0xff() May 14 22:49:18 hermes kernel: Hardware name: S3210SH May 14 22:49:18 hermes kernel: Pid: 3351, comm: smbd Not tainted 2.6.29.3 #1 May 14 22:49:18 hermes kernel: Call Trace: May 14 22:49:18 hermes kernel: [<ffffffff8024c0b0>] warn_slowpath+0xd3/0x10d May 14 22:49:18 hermes kernel: [<ffffffff8025cc0e>] wake_bit_function+0x0/0x23 May 14 22:49:18 hermes kernel: [<ffffffff802afbec>] sync_dirty_buffer+0x5f/0x97 May 14 22:49:18 hermes kernel: [<ffffffff8030589e>] ext4_commit_super+0xbc/0x103 May 14 22:49:18 hermes kernel: [<ffffffff80307f92>] ext4_handle_error+0x83/0xa9 May 14 22:49:18 hermes kernel: [<ffffffff803081cb>] ext4_error+0x83/0x90 May 14 22:49:18 hermes kernel: [<ffffffff8051c5ab>] ipt_do_table+0x308/0x32d May 14 22:49:18 hermes kernel: [<ffffffff80245f5c>] dequeue_task_fair+0x45/0x167 May 14 22:49:18 hermes kernel: [<ffffffff802fd222>] check_block_validity+0xd8/0xff May 14 22:49:18 hermes kernel: [<ffffffff802febe1>] ext4_get_blocks_wrap+0xb5/0x24f May 14 22:49:18 hermes kernel: [<ffffffff802fee2d>] ext4_da_get_block_prep+0x58/0x181 May 14 22:49:18 hermes kernel: [<ffffffff802b0187>] __block_prepare_write+0x150/0x2c4 May 14 22:49:18 hermes kernel: [<ffffffff802fedd5>] ext4_da_get_block_prep+0x0/0x181 May 14 22:49:18 hermes kernel: [<ffffffff802701e4>] add_to_page_cache_locked+0x73/0x95 May 14 22:49:18 hermes kernel: [<ffffffff802b0447>] block_write_begin+0x7a/0xc7 May 14 22:49:18 hermes kernel: [<ffffffff802fce53>] ext4_da_write_begin+0x166/0x1ed May 14 22:49:18 hermes kernel: [<ffffffff802fedd5>] ext4_da_get_block_prep+0x0/0x181 May 14 22:49:18 hermes kernel: [<ffffffff80270b94>] generic_file_buffered_write+0x12c/0x2e8 May 14 22:49:18 hermes kernel: [<ffffffff80271145>] __generic_file_aio_write_nolock+0x25f/0x293 May 14 22:49:18 hermes kernel: [<ffffffff80271928>] generic_file_aio_write+0x64/0xc4 May 14 22:49:18 hermes kernel: [<ffffffff802f8d33>] ext4_file_write+0x93/0x118 May 14 22:49:18 hermes kernel: [<ffffffff80292f9c>] do_sync_write+0xce/0x113 May 14 22:49:18 hermes kernel: [<ffffffff802bc5a0>] vfs_test_lock+0x2e/0x32 May 14 22:49:18 hermes kernel: [<ffffffff8025cbe0>] autoremove_wake_function+0x0/0x2e May 14 22:49:18 hermes kernel: [<ffffffff802bd116>] fcntl_setlk+0x24b/0x25c May 14 22:49:18 hermes kernel: [<ffffffff8029373f>] vfs_write+0xa5/0xde May 14 22:49:18 hermes kernel: [<ffffffff802937cf>] sys_pwrite64+0x57/0x77 May 14 22:49:18 hermes kernel: [<ffffffff8029d8c9>] sys_fcntl+0x312/0x31e May 14 22:49:18 hermes kernel: [<ffffffff8022501b>] system_call_fastpath+0x16/0x1b May 14 22:49:18 hermes kernel: ---[ end trace df4d7250236f7465 ]--- May 14 22:49:18 hermes kernel: EXT4-fs error (device dm-0) in ext4_da_write_begin: IO failure Any clues there? I don't think I'll be able to run this during the day if it's going to trigger and remount the fs read-only as easily as this. Cheers, Kevin. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: More ext4 acl/xattr corruption - 4th occurence now 2009-05-14 13:25 ` Kevin Shanahan @ 2009-05-14 14:07 ` Theodore Tso 2009-05-14 14:30 ` Kevin Shanahan 0 siblings, 1 reply; 35+ messages in thread From: Theodore Tso @ 2009-05-14 14:07 UTC (permalink / raw) To: Kevin Shanahan; +Cc: linux-ext4 On Thu, May 14, 2009 at 10:55:06PM +0930, Kevin Shanahan wrote: > On Thu, May 14, 2009 at 08:37:00PM +0930, Kevin Shanahan wrote: > > Sure - now running with 2.6.29.3 + your patch. > > > > patching file fs/ext4/inode.c > > Hunk #1 succeeded at 1040 with fuzz 1 (offset -80 lines). > > Hunk #2 succeeded at 1113 (offset -81 lines). > > Hunk #3 succeeded at 1184 (offset -93 lines). > > > > I'll report any hits for "check_block_validity" in syslog. > > That didn't take long: > > May 14 22:49:17 hermes kernel: EXT4-fs error (device dm-0): check_block_validity: inode #759 logical block 1741329 mapped to 529 (size 1) > May 14 22:49:17 hermes kernel: Aborting journal on device dm-0:8. > May 14 22:49:17 hermes kernel: ext4_da_writepages: jbd2_start: 293 pages, ino 759; err -30 > May 14 22:49:17 hermes kernel: Pid: 374, comm: pdflush Not tainted 2.6.29.3 #1 > May 14 22:49:17 hermes kernel: Call Trace: The reason why I put in the ext4_error() was because I figured it was better to stop the filesystem from stomping on the inode table (since that way would lie data loss). It would also freeze the filesystem so we could see what was happening. Could you run debugfs on the filesystem, and then send me the results of these commands: debugfs: stat <759> debugfs: ncheck 759 Also, could you try run e2fsck on the filesystem and send me the output? The ext4_error() should have stopped the kernel from doing any damage (like stomping on block 529, which was part of the inode table or block group descriptors). If send me the dumpe2fs output about group 0, we'll know for sure (i.e.): Group 0: (Blocks 0-32767) [ITABLE_ZEROED] Checksum 0x3e19, unused inodes 8169 Primary superblock at 0, Group descriptors at 1-5 Reserved GDT blocks at 6-32 Block bitmap at 33 (+33), Inode bitmap at 49 (+49) Inode table at 65-576 (+65) (Your numbers will be different; this was from a 80GB hard drive). > Any clues there? I don't think I'll be able to run this during the day > if it's going to trigger and remount the fs read-only as easily as > this. Yes, this is a *huge*. Finding this may have been just what we needed to determine what has been going on. Thank you!! What I would suggest doing is to mount the filesystem with the mount option -o nodelalloc. This would be useful for two fronts: (1) it would determine whether or not the problem shows up if we suppress delayed allocation, and (2) if it works, hopefully it will prevent you from further filesystem corruption incidents. (At least as a workaround; turning off delayed allocation will result in a performance decrease, but better that then data loss, I always say!) The fact that this triggered so quickly is actually a bit scary, so hopefully we'll be able to track down exactly what happened fairly quickly from here, and then fix the problem for real. Thanks, - Ted ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: More ext4 acl/xattr corruption - 4th occurence now 2009-05-14 14:07 ` Theodore Tso @ 2009-05-14 14:30 ` Kevin Shanahan 2009-05-14 15:44 ` Eric Sandeen 2009-05-14 16:12 ` Theodore Tso 0 siblings, 2 replies; 35+ messages in thread From: Kevin Shanahan @ 2009-05-14 14:30 UTC (permalink / raw) To: Theodore Tso; +Cc: linux-ext4 On Thu, May 14, 2009 at 10:07:32AM -0400, Theodore Tso wrote: > On Thu, May 14, 2009 at 10:55:06PM +0930, Kevin Shanahan wrote: > > On Thu, May 14, 2009 at 08:37:00PM +0930, Kevin Shanahan wrote: > > > Sure - now running with 2.6.29.3 + your patch. > > > > > > patching file fs/ext4/inode.c > > > Hunk #1 succeeded at 1040 with fuzz 1 (offset -80 lines). > > > Hunk #2 succeeded at 1113 (offset -81 lines). > > > Hunk #3 succeeded at 1184 (offset -93 lines). > > > > > > I'll report any hits for "check_block_validity" in syslog. > > > > That didn't take long: > > > > May 14 22:49:17 hermes kernel: EXT4-fs error (device dm-0): check_block_validity: inode #759 logical block 1741329 mapped to 529 (size 1) > > May 14 22:49:17 hermes kernel: Aborting journal on device dm-0:8. > > May 14 22:49:17 hermes kernel: ext4_da_writepages: jbd2_start: 293 pages, ino 759; err -30 > > May 14 22:49:17 hermes kernel: Pid: 374, comm: pdflush Not tainted 2.6.29.3 #1 > > May 14 22:49:17 hermes kernel: Call Trace: > > The reason why I put in the ext4_error() was because I figured it was > better to stop the filesystem from stomping on the inode table (since > that way would lie data loss). It would also freeze the filesystem so > we could see what was happening. Sure, makes sense. I guess I have gotten complacent as I have been getting away with it for several months now. :) > Could you run debugfs on the filesystem, and then send me the results > of these commands: > > debugfs: stat <759> hermes:~# debugfs /dev/dm-0 debugfs 1.41.3 (12-Oct-2008) debugfs: stat <759> Inode: 759 Type: regular Mode: 0660 Flags: 0x80000 Generation: 3979120103 Version: 0x00000000:00000001 User: 0 Group: 10140 Size: 14615630848 File ACL: 0 Directory ACL: 0 Links: 1 Blockcount: 28546168 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x4a0acdb5:2a88cbec -- Wed May 13 23:10:05 2009 atime: 0x4a0ac45b:10899618 -- Wed May 13 22:30:11 2009 mtime: 0x4a0acdb5:2a88cbec -- Wed May 13 23:10:05 2009 crtime: 0x4a0ac45b:10899618 -- Wed May 13 22:30:11 2009 Size of extra inode fields: 28 Extended attributes stored in inode body: = "01 00 00 00 01 00 06 00 02 00 07 00 12 27 00 00 04 00 05 00 08 00 05 00 57 27 00 00 08 00 07 00 98 27 00 00 10 00 06 00 20 00 00 00 " (44) DOSATTRIB = "0x20" (4) BLOCKS: (IND):1081467, (0-24575):999424-1023999, (24576-26623):1028096-1030143, (26624-43007):1032192-1048575, (43008-45055):1087488-1089535, (45056-51199):1550336-1556479, (51200-55295):1560576-1564671, (55296-63487):1540096-1548287, (63488-65535):2985984-2988031, (65536-92159):2996224-3022847, (92160-100351):3024896-3033087, (100352-112639):3035136-3047423, (112640-155647):3049472-3092479, (155648-157695):3100672-3102719, (157696-165887):3092480-3100671, (165888-206847):3104768-3145727, (206848-208895):3186688-3188735, (208896-681983):3196928-3670015, (681984-1042431):3735552-4095999, (1042432-1048575):4155392-4161535, (1048576-1050623):4182016-4184063, (1050624-1052671):4186112-4188159, (1052672-1056767):4177920-4182015, (1056768-1060863):4190208-4194303, (1060864-1124351):4655104-4718591, (112 4352-1126399):5390336-5392383, (1126400-1142783):5423104-5439487, (1142784-1155071):5478400-5490687, (1155072-1159167):5472256-5476351, (1159168-1433599):5492736-5767167, (1433600-1437695):6057984-6062079, (1437696-1634303):6094848-6291455, (1634304-1636351):6432768-6434815, (1636352-1640447):6483968-6488063, (1640448-1646591):6498304-6504447, (1646592-1662975):6510592-6526975, (1662976-1718271):6533120-6588415, (1718272-1943551):6590464-6815743, (1943552-2203647):6914048-7174143, (2203648-2205695):7182336-7184383, (2205696-2207743):7192576-7194623, (2207744-2211839):7188480-7192575, (2211840-2355199):7196672-7340031, (2355200-2357247):7380992-7383039, (2357248-2359295):7387136-7389183, (2359296-2463743):7444480-7548927, (2463744-2531327):7567360-7634943, (2531328-2746367):7649280-786431 9, (2746368-2811903):7897088-7962623, (2811904-2813951):7979008-7981055, (2813952-2869247):7995392-8050687, (2869248-2879487):8052736-8062975, (2879488-2881535):8081408-8083455, (2881536-2897919):8065024-8081407, (2897920-2906111):8085504-8093695, (2906112-2908159):8110080-8112127, (2908160-2910207):8114176-8116223, (2910208-2920447):8120320-8130559, (2920448-3026943):8136704-8243199, (3026944-3028991):8249344-8251391, (3028992-3031039):8253440-8255487, (3031040-3035135):8245248-8249343, (3035136-3037183):8292352-8294399, (3037184-3041279):8298496-8302591, (3041280-3043327):8398848-8400895, (3043328-3045375):8402944-8404991, (3045376-3049471):8407040-8411135, (3049472-3055615):8417280-8423423, (3055616-3094527):8501248-8540159, (3094528-3096575):8710144-8712191, (3096576-3110911):8726528 -8740863, (3110912-3112959):8779776-8781823, (3112960-3115007):8806400-8808447, (3115008-3119103):8790016-8794111, (3119104-3162111):8818688-8861695, (3162112-3164159):8863744-8865791, (3164160-3166207):8923136-8925183, (3166208-3235839):11603968-11673599, (3235840-3276799):12998656-13039615, (3276800-3278847):13070336-13072383, (3278848-3282943):13058048-13062143, (3282944-3287039):13066240-13070335, (3287040-3303423):13041664-13058047, (3303424-3307519):16058368-16062463, (3307520-3311615):16084992-16089087, (3311616-3334143):16119808-16142335, (3334144-3336191):16150528-16152575, (3336192-3340287):16146432-16150527, (3340288-3342335):16154624-16156671, (3342336-3348479):16160768-16166911, (3348480-3350527):16199680-16201727, (3350528-3352575):16203776-16205823, (3352576-3354623):16226 304-16228351, (3354624-3356671):16359424-16361471, (3356672-3360767):16369664-16373759, (3360768-3362815):16377856-16379903, (3362816-3364863):16388096-16 > debugfs: ncheck 759 debugfs: ncheck 759 Inode Pathname 759 /local/dumps/exchange/exchange-2000-UCWB-KVM-18.bkfB-KVM-18.bkf By the way, that filename on disk is just "exchange-2000-UCWB-KVM-18.bkf". Not sure if the way it appears to repeat itself at the end there is indicative of a problem. > Also, could you try run e2fsck on the filesystem and send me the > output? The ext4_error() should have stopped the kernel from doing > any damage (like stomping on block 529, which was part of the inode > table or block group descriptors). If send me the dumpe2fs output > about group 0, we'll know for sure (i.e.): > > Group 0: (Blocks 0-32767) [ITABLE_ZEROED] > Checksum 0x3e19, unused inodes 8169 > Primary superblock at 0, Group descriptors at 1-5 > Reserved GDT blocks at 6-32 > Block bitmap at 33 (+33), Inode bitmap at 49 (+49) > Inode table at 65-576 (+65) > > (Your numbers will be different; this was from a 80GB hard drive). hermes:~# e2fsck -pfv /dev/dm-0 /dev/dm-0: recovering journal 331810 inodes used (0.49%) 16080 non-contiguous inodes (4.8%) # of inodes with ind/dind/tind blocks: 0/0/0 Extent depth histogram: 331411/329 46723278 blocks used (17.41%) 0 bad blocks 3 large files 305051 regular files 26686 directories 0 character device files 0 block device files 0 fifos 0 links 64 symbolic links (60 fast symbolic links) 0 sockets -------- 331801 files hermes:~# dumpe2fs /dev/dm-0 ... Group 0: (Blocks 0-32767) [ITABLE_ZEROED] Checksum 0x0eb9, unused inodes 0 Primary superblock at 0, Group descriptors at 1-64 Reserved GDT blocks at 65-256 Block bitmap at 257 (+257), Inode bitmap at 273 (+273) Inode table at 289-800 (+289) 3443 free blocks, 0 free inodes, 571 directories Free blocks: 9067, 9108-9111, 9116-9119, 9276-9279, 9308-9311, 9687, 9720-9727, 9736-9743, 9848-9879, 9888-9895, 9936-9951, 10032-10047, 10072-10079, 10200-10207, 10352-10367, 10584-10591, 10619-10623, 10661-10671, 10740-10751, 11476-11487, 11512-11519, 11637-11647, 11669-11679, 11711, 11772-11775, 11890-11903, 11963-11967, 12187-12191, 12213-12223, 12249-12255, 12272-12280, 12282-12287, 12322-12351, 12381-12383, 12405-12415, 12437-12447, 12478-12479, 12542-12543, 12664-12671, 12727-12735, 12766-12767, 12851-12863, 12890-12895, 12927, 12970, 12980-12991, 13047-13055, 13246-13247, 13290-13295, 13312-13317, 13406-13407, 13433, 13439, 13456-13463, 13490-13503, 13537-13567, 13658-13695, 13770-13823, 13850-13855, 13886-13887, 13908-13919, 13951, 13971-13983, 14010-14015, 14056-14079, 14204-14 207, 14249-14271, 14316-14350, 14358-14361, 14375, 14487-14495, 14526-14527, 14550-14584, 14605, 14619, 14630-14635, 14647-14655, 14660, 14674-14687, 14705-14719, 14744-14774, 14785-14815, 14844-14847, 14889-14926, 14933-14944, 14948, 14965-14975, 14992-15000, 15006-15007, 15035-15039, 15069-15087, 15097-15102, 15114-15127, 15141-15156, 15158-15164, 15176-15184, 15192-15193, 15213, 15223-15230, 15238, 15241-15256, 15264-15281, 15289-15297, 15305-15325, 15333-15341, 15350-15359, 15427-15455, 15482-15487, 15599-15615, 15705-15711, 15732-15743, 15798-15807, 15860-15871, 15923-15935, 15956-15967, 15989-15999, 16040-16063, 16090-16095, 16123-16127, 16171-16191, 16219-16223, 16246-16255, 16309-16319, 16343-16351, 16379-16383, 16565-16575, 16624-16639, 16688-16703, 16721-16735, 16763-16767, 168 37-16863, 16881-16895, 17036-17055, 17075-17087, 17147-17151, 17220-17247, 17267-17279, 17307-17311, 17339-17343, 17365-17375, 17403-17407, 17453-17471, 17531-17535, 17554-17567, 17599, 17647-17663, 17718-17727, 17756-17759, 17812-17823, 17844-17855, 17904-17919, 18103-18111, 18131-18143, 18162-18175, 18277-18303, 18331-18335, 18363-18367, 18393-18399, 18427-18431, 18598-18623, 18676-18687, 18706-18719, 18738-18751, 18805-18815, 18844-18847, 18876-18879, 18908-18911, 18933-18943, 19135, 19198-19199, 19249-19263, 19286-19295, 19323-19327, 19416-19423, 19455, 19527-19551, 19582-19583, 19704-19711, 19756-19775, 19835-19839, 19887-19903, 19925-19935, 19959-19967, 20148-20159, 20206-20223, 20271-20287, 20336-20351, 20422-20447, 20476-20479, 20671, 20727-20735, 20758-20767, 20795-20799, 20818- 20831, 20849-20863, 20891-20895, 20918-20927, 20955-20959, 20980-20991, 21133-21151, 21172-21183, 21221-21247, 21265-21279, 21307-21311, 21345-21375, 21425-21439, 21476-21503, 21531-21553, 21560-21596, 21604, 21611, 21630-21631, 21662-21666, 21672-21673, 21683, 21695-21725, 21732-21755, 21765-21782, 21795-21842, 21848-21875, 21890-21919, 21947-21951, 22005-22030, 22038-22052, 22062-22076, 22091-22092, 22107-22118, 22131-22140, 22154-22183, 22196-22198, 22569-22591, 22619-22623, 22651-22655, 22760-22783, 22803-22815, 22847, 22888-22911, 23002-23007, 23035-23039, 23153-23167, 23189-23199, 23260-23263, 23284-23295, 23336-23359, 23402-23423, 23462-23487, 23512-23519, 23544-23551, 23572-23583, 23644-23647, 23673-23679, 23768-23804, 23806-23807, 23841-23871, 23930-23935, 24011-24031, 24053-240 76, 24103-24127, 24186-24191, 24287, 24347-24351, 24371-24383, 24434-24447, 24475-24479, 24501-24511, 24566-24575, 26431, 26472-26495, 26562-26591, 26619-26623, 26903-26905, 26909-26911, 26932-26943, 26984-27007, 27134-27135, 27182-27199, 27238-27263, 27291-27295, 27323-27327, 27376-27391, 27507-27519, 27641-27647, 27979-27999, 28024-28031, 28103-28127, 28255, 28323-28351, 28373-28383, 28414-28415, 28572-28575, 28598-28607, 28668-28671 Free inodes: > > Any clues there? I don't think I'll be able to run this during the day > > if it's going to trigger and remount the fs read-only as easily as > > this. > > Yes, this is a *huge*. Finding this may have been just what we needed > to determine what has been going on. Thank you!! > > What I would suggest doing is to mount the filesystem with the mount > option -o nodelalloc. This would be useful for two fronts: (1) it > would determine whether or not the problem shows up if we suppress > delayed allocation, and (2) if it works, hopefully it will prevent you > from further filesystem corruption incidents. (At least as a > workaround; turning off delayed allocation will result in a > performance decrease, but better that then data loss, I always say!) Ok, thanks - I've added the mount option. I'll let you know if the error still triggers. > The fact that this triggered so quickly is actually a bit scary, so > hopefully we'll be able to track down exactly what happened fairly > quickly from here, and then fix the problem for real. Yeah, that would be great. Cheers, Kevin. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: More ext4 acl/xattr corruption - 4th occurence now 2009-05-14 14:30 ` Kevin Shanahan @ 2009-05-14 15:44 ` Eric Sandeen 2009-05-14 21:07 ` Kevin Shanahan 2009-05-14 16:12 ` Theodore Tso 1 sibling, 1 reply; 35+ messages in thread From: Eric Sandeen @ 2009-05-14 15:44 UTC (permalink / raw) To: Kevin Shanahan; +Cc: Theodore Tso, linux-ext4 Kevin Shanahan wrote: > hermes:~# debugfs /dev/dm-0 > debugfs 1.41.3 (12-Oct-2008) > debugfs: stat <759> > > Inode: 759 Type: regular Mode: 0660 Flags: 0x80000 > Generation: 3979120103 Version: 0x00000000:00000001 > User: 0 Group: 10140 Size: 14615630848 > File ACL: 0 Directory ACL: 0 > Links: 1 Blockcount: 28546168 > Fragment: Address: 0 Number: 0 Size: 0 > ctime: 0x4a0acdb5:2a88cbec -- Wed May 13 23:10:05 2009 > atime: 0x4a0ac45b:10899618 -- Wed May 13 22:30:11 2009 > mtime: 0x4a0acdb5:2a88cbec -- Wed May 13 23:10:05 2009 > crtime: 0x4a0ac45b:10899618 -- Wed May 13 22:30:11 2009 > Size of extra inode fields: 28 > Extended attributes stored in inode body: > = "01 00 00 00 01 00 06 00 02 00 07 00 12 27 00 00 04 00 05 00 08 00 05 00 57 27 00 00 08 00 07 00 98 27 00 00 10 00 06 00 20 00 00 00 " (44) > DOSATTRIB = "0x20" (4) Apologies if I've asked before: DOSATTRIB is a samba xattr; how does samba come into play in your setup, again? I'm wondering what that other odd xattr value is too... (the one that didn't even print a name) -Eric ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: More ext4 acl/xattr corruption - 4th occurence now 2009-05-14 15:44 ` Eric Sandeen @ 2009-05-14 21:07 ` Kevin Shanahan 2009-05-14 21:08 ` Eric Sandeen 0 siblings, 1 reply; 35+ messages in thread From: Kevin Shanahan @ 2009-05-14 21:07 UTC (permalink / raw) To: Eric Sandeen; +Cc: Theodore Tso, linux-ext4 On Thu, May 14, 2009 at 10:44:07AM -0500, Eric Sandeen wrote: > Kevin Shanahan wrote: > > > hermes:~# debugfs /dev/dm-0 > > debugfs 1.41.3 (12-Oct-2008) > > debugfs: stat <759> > > > > Inode: 759 Type: regular Mode: 0660 Flags: 0x80000 > > Generation: 3979120103 Version: 0x00000000:00000001 > > User: 0 Group: 10140 Size: 14615630848 > > File ACL: 0 Directory ACL: 0 > > Links: 1 Blockcount: 28546168 > > Fragment: Address: 0 Number: 0 Size: 0 > > ctime: 0x4a0acdb5:2a88cbec -- Wed May 13 23:10:05 2009 > > atime: 0x4a0ac45b:10899618 -- Wed May 13 22:30:11 2009 > > mtime: 0x4a0acdb5:2a88cbec -- Wed May 13 23:10:05 2009 > > crtime: 0x4a0ac45b:10899618 -- Wed May 13 22:30:11 2009 > > Size of extra inode fields: 28 > > Extended attributes stored in inode body: > > = "01 00 00 00 01 00 06 00 02 00 07 00 12 27 00 00 04 00 05 00 08 00 05 00 57 27 00 00 08 00 07 00 98 27 00 00 10 00 06 00 20 00 00 00 " (44) > > DOSATTRIB = "0x20" (4) > > Apologies if I've asked before: DOSATTRIB is a samba xattr; how does > samba come into play in your setup, again? This server is a Samba server. This file in particular is being written to via Samba by a Windows 2000 server over the network. > I'm wondering what that other odd xattr value is too... (the one that > didn't even print a name) I would assume that's the file ACL: hermes:/srv/samba/local/dumps/exchange# getfacl --numeric exchange-2000-UCWB-KVM-18.bkf # file: exchange-2000-UCWB-KVM-18.bkf # owner: 0 # group: 10140 user::rw- user:10002:rwx #effective:rw- group::r-x #effective:r-- group:10071:r-x #effective:r-- group:10136:rwx #effective:rw- mask::rw- other::--- Cheers, Kevin. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: More ext4 acl/xattr corruption - 4th occurence now 2009-05-14 21:07 ` Kevin Shanahan @ 2009-05-14 21:08 ` Eric Sandeen 0 siblings, 0 replies; 35+ messages in thread From: Eric Sandeen @ 2009-05-14 21:08 UTC (permalink / raw) To: Kevin Shanahan; +Cc: Theodore Tso, linux-ext4 Kevin Shanahan wrote: > On Thu, May 14, 2009 at 10:44:07AM -0500, Eric Sandeen wrote: ... >>> Extended attributes stored in inode body: >>> = "01 00 00 00 01 00 06 00 02 00 07 00 12 27 00 00 04 00 05 00 08 00 05 00 57 27 00 00 08 00 07 00 98 27 00 00 10 00 06 00 20 00 00 00 " (44) >>> DOSATTRIB = "0x20" (4) >> Apologies if I've asked before: DOSATTRIB is a samba xattr; how does >> samba come into play in your setup, again? > > This server is a Samba server. This file in particular is being > written to via Samba by a Windows 2000 server over the network. > >> I'm wondering what that other odd xattr value is too... (the one that >> didn't even print a name) > > I would assume that's the file ACL: Ah, sure. Ok thanks. -Eric ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: More ext4 acl/xattr corruption - 4th occurence now 2009-05-14 14:30 ` Kevin Shanahan 2009-05-14 15:44 ` Eric Sandeen @ 2009-05-14 16:12 ` Theodore Tso 2009-05-14 21:02 ` Kevin Shanahan 2009-05-15 3:57 ` Alex Tomas 1 sibling, 2 replies; 35+ messages in thread From: Theodore Tso @ 2009-05-14 16:12 UTC (permalink / raw) To: Kevin Shanahan, Andreas Dilger, Alex Tomas; +Cc: linux-ext4 On Fri, May 15, 2009 at 12:00:15AM +0930, Kevin Shanahan wrote: > > debugfs: stat <759> > > hermes:~# debugfs /dev/dm-0 > debugfs 1.41.3 (12-Oct-2008) > debugfs: stat <759> > > Inode: 759 Type: regular Mode: 0660 Flags: 0x80000 > Generation: 3979120103 Version: 0x00000000:00000001 > User: 0 Group: 10140 Size: 14615630848 > File ACL: 0 Directory ACL: 0 > Links: 1 Blockcount: 28546168 > Fragment: Address: 0 Number: 0 Size: 0 > ctime: 0x4a0acdb5:2a88cbec -- Wed May 13 23:10:05 2009 > atime: 0x4a0ac45b:10899618 -- Wed May 13 22:30:11 2009 > mtime: 0x4a0acdb5:2a88cbec -- Wed May 13 23:10:05 2009 > crtime: 0x4a0ac45b:10899618 -- Wed May 13 22:30:11 2009 > Inode Pathname > 759 /local/dumps/exchange/exchange-2000-UCWB-KVM-18.bkfB-KVM-18.bkf Do you know how the system was likely writing into /local/dumps/exhcnag/eexchange-2000-UCWB-KVM-18.bkf? What this a backup via rsync or tar? Was this some application writing into a pre-existing file via NFS, or via local disk access? Given the ctime/atime fields, I'm inclined to guess the latter, but it would be good to know. The stat dump for the inode 759 does *not* show logical block 1741329 getting mapped to physical block 529. So the question is how did that happen? I've started looking, and one thing popped up at me. I need to check in with the Lustre folks who originally donated the code, but I don't see any spinlock or mutexes protecting the inode's extent cache. So if you are on an SMP machine, this could potentially have caused the problem. How many CPU's or cores do you have? What does /proc/cpuinfo report? Also, would it be correct to assume this file is getting served up via Samba. My theory is that we might be running into problems when two threads are simultaneously trying read and write to a single file at the same time. Hmm, what is accessing your files on this system? Are you just doing backups? Is it just a backup server? Or are you serving up files using Samba and there are clients which are accessing those files? So if this the problem the following experiment should be able to confirm whether it's the problem, by seeing if the problem goes away if we short-circuit the inode's extent cache. In fs/ext4/extents.c, try inserting a "return" statement to in ext4_ext_put_in_cache(): static void ext4_ext_put_in_cache(struct inode *inode, ext4_lblk_t block, __u32 len, ext4_fsblk_t start, int type) { struct ext4_ext_cache *cex; return; <---- insert this line BUG_ON(len == 0); cex = &EXT4_I(inode)->i_cached_extent; cex->ec_type = type; cex->ec_block = block; cex->ec_len = len; cex->ec_start = start; } This should short circuit the i_cached_extent cache, and this may be enough to make your problem go away. (If this theory is correct, using mount -o nodelalloc probably won't make a difference, although it might change the timing enough to make the bug harder to see.) If that solves the problem, the right long-term fix will be to drop bin a spinlock to protect i_cached_extent. - Ted ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: More ext4 acl/xattr corruption - 4th occurence now 2009-05-14 16:12 ` Theodore Tso @ 2009-05-14 21:02 ` Kevin Shanahan 2009-05-14 21:23 ` Theodore Tso 2009-05-15 3:57 ` Alex Tomas 1 sibling, 1 reply; 35+ messages in thread From: Kevin Shanahan @ 2009-05-14 21:02 UTC (permalink / raw) To: Theodore Tso; +Cc: Andreas Dilger, Alex Tomas, linux-ext4 On Thu, May 14, 2009 at 12:12:54PM -0400, Theodore Tso wrote: > On Fri, May 15, 2009 at 12:00:15AM +0930, Kevin Shanahan wrote: > > > debugfs: stat <759> > > > > hermes:~# debugfs /dev/dm-0 > > debugfs 1.41.3 (12-Oct-2008) > > debugfs: stat <759> > > > > Inode: 759 Type: regular Mode: 0660 Flags: 0x80000 > > Generation: 3979120103 Version: 0x00000000:00000001 > > User: 0 Group: 10140 Size: 14615630848 > > File ACL: 0 Directory ACL: 0 > > Links: 1 Blockcount: 28546168 > > Fragment: Address: 0 Number: 0 Size: 0 > > ctime: 0x4a0acdb5:2a88cbec -- Wed May 13 23:10:05 2009 > > atime: 0x4a0ac45b:10899618 -- Wed May 13 22:30:11 2009 > > mtime: 0x4a0acdb5:2a88cbec -- Wed May 13 23:10:05 2009 > > crtime: 0x4a0ac45b:10899618 -- Wed May 13 22:30:11 2009 > > > Inode Pathname > > 759 /local/dumps/exchange/exchange-2000-UCWB-KVM-18.bkfB-KVM-18.bkf > > Do you know how the system was likely writing into > /local/dumps/exhcnag/eexchange-2000-UCWB-KVM-18.bkf? What this a > backup via rsync or tar? Was this some application writing into a > pre-existing file via NFS, or via local disk access? This file was getting written by a windows server via samba. The backup script on the windows side deletes the file before writing a new copy each night. It's a fairly large file, about 14GB. > Given the ctime/atime fields, I'm inclined to guess the latter, but it > would be good to know. > > The stat dump for the inode 759 does *not* show logical block 1741329 > getting mapped to physical block 529. So the question is how did that > happen? > > I've started looking, and one thing popped up at me. I need to check > in with the Lustre folks who originally donated the code, but I don't > see any spinlock or mutexes protecting the inode's extent cache. So > if you are on an SMP machine, this could potentially have caused the > problem. How many CPU's or cores do you have? What does > /proc/cpuinfo report? Also, would it be correct to assume this file > is getting served up via Samba. My theory is that we might be running > into problems when two threads are simultaneously trying read and > write to a single file at the same time. Yes, it's SMP (4 cores): processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Xeon(R) CPU X3350 @ 2.66GHz stepping : 7 cpu MHz : 2666.000 cache size : 6144 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm tpr_shadow vnmi flexpriority bogomips : 5322.33 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: > Hmm, what is accessing your files on this system? Are you just doing > backups? Is it just a backup server? Or are you serving up files > using Samba and there are clients which are accessing those files? This is a Samba server serving up files to clients and also a backup server. That particular file is unlikely to have seen concurrent access from multiple clients - the folder is only accessible to that windows server to dump its exchange store overnight. > So if this the problem the following experiment should be able to > confirm whether it's the problem, by seeing if the problem goes away > if we short-circuit the inode's extent cache. In fs/ext4/extents.c, > try inserting a "return" statement to in ext4_ext_put_in_cache(): > > static void > ext4_ext_put_in_cache(struct inode *inode, ext4_lblk_t block, > __u32 len, ext4_fsblk_t start, int type) > { > struct ext4_ext_cache *cex; > > return; <---- insert this line > BUG_ON(len == 0); > cex = &EXT4_I(inode)->i_cached_extent; > cex->ec_type = type; > cex->ec_block = block; > cex->ec_len = len; > cex->ec_start = start; > } > > This should short circuit the i_cached_extent cache, and this may be > enough to make your problem go away. (If this theory is correct, > using mount -o nodelalloc probably won't make a difference, although > it might change the timing enough to make the bug harder to see.) > > If that solves the problem, the right long-term fix will be to drop bin > a spinlock to protect i_cached_extent. Okay, nodelalloc was enough get through the backup process again at least once (wrote the whole exchange-2000-UCWB-KVM-18.bkf file out in full) without triggering the check_block_validity error. To make sure I had something reproducable on the current kernel (2.6.29.3 + check_block_validity patch), I re-mounted without nodelalloc again and tried running just that exchange backup - the bug did trigger again. May 15 05:30:55 hermes kernel: EXT4-fs error (device dm-0): check_block_validity: inode #759 logical block 2509617 mapped to 0 (size 1) May 15 05:30:55 hermes kernel: Aborting journal on device dm-0:8. May 15 05:30:55 hermes kernel: ext4_da_writepages: jbd2_start: 775 pages, ino 759; err -30 May 15 05:30:55 hermes kernel: Pid: 5026, comm: pdflush Tainted: G W 2.6.29.3 #1 May 15 05:30:55 hermes kernel: Call Trace: May 15 05:30:55 hermes kernel: [<ffffffff802fdce3>] ext4_da_writepages+0x211/0x427 May 15 05:30:55 hermes kernel: [<ffffffff8055b44f>] thread_return+0x3d/0xc3 May 15 05:30:55 hermes kernel: [<ffffffff802fef56>] ext4_da_get_block_write+0x0/0x12d May 15 05:30:55 hermes kernel: [<ffffffff802757df>] do_writepages+0x20/0x2d May 15 05:30:55 hermes kernel: [<ffffffff802aa359>] __writeback_single_inode+0x196/0x3a4 May 15 05:30:55 hermes kernel: [<ffffffff802542aa>] del_timer_sync+0xc/0x16 May 15 05:30:55 hermes kernel: [<ffffffff802aa9b9>] generic_sync_sb_inodes+0x2a7/0x439 May 15 05:30:55 hermes kernel: [<ffffffff802aacfa>] writeback_inodes+0x7d/0xcc May 15 05:30:55 hermes kernel: [<ffffffff8027591e>] wb_kupdate+0x9f/0x116 May 15 05:30:55 hermes kernel: [<ffffffff802764bf>] pdflush+0x125/0x1d2 May 15 05:30:55 hermes kernel: [<ffffffff8027587f>] wb_kupdate+0x0/0x116 May 15 05:30:55 hermes kernel: [<ffffffff8027639a>] pdflush+0x0/0x1d2 May 15 05:30:55 hermes kernel: [<ffffffff8025c8cc>] kthread+0x47/0x73 May 15 05:30:55 hermes kernel: [<ffffffff80225f7a>] child_rip+0xa/0x20 May 15 05:30:55 hermes kernel: [<ffffffff8025c885>] kthread+0x0/0x73 May 15 05:30:55 hermes kernel: [<ffffffff80225f70>] child_rip+0x0/0x20 May 15 05:30:55 hermes kernel: Remounting filesystem read-only May 15 05:30:56 hermes kernel: ------------[ cut here ]------------ May 15 05:30:56 hermes kernel: WARNING: at fs/ext4/inode.c:1066 check_block_validity+0xd8/0xff() May 15 05:30:56 hermes kernel: Hardware name: S3210SH May 15 05:30:56 hermes kernel: Pid: 6196, comm: smbd Tainted: G W 2.6.29.3 #1 May 15 05:30:56 hermes kernel: Call Trace: May 15 05:30:56 hermes kernel: [<ffffffff8024c0b0>] warn_slowpath+0xd3/0x10d May 15 05:30:56 hermes kernel: [<ffffffff8025cc0e>] wake_bit_function+0x0/0x23 May 15 05:30:56 hermes kernel: [<ffffffff802afbec>] sync_dirty_buffer+0x5f/0x97 May 15 05:30:56 hermes kernel: [<ffffffff8030589e>] ext4_commit_super+0xbc/0x103 May 15 05:30:56 hermes kernel: [<ffffffff80307f92>] ext4_handle_error+0x83/0xa9 May 15 05:30:56 hermes kernel: [<ffffffff803081cb>] ext4_error+0x83/0x90 May 15 05:30:56 hermes kernel: [<ffffffff8040c772>] scsi_done+0x0/0xc May 15 05:30:56 hermes kernel: [<ffffffff80245f5c>] dequeue_task_fair+0x45/0x167 May 15 05:30:56 hermes kernel: [<ffffffff802fd222>] check_block_validity+0xd8/0xff May 15 05:30:56 hermes kernel: [<ffffffff802febe1>] ext4_get_blocks_wrap+0xb5/0x24f May 15 05:30:56 hermes kernel: [<ffffffff802fee2d>] ext4_da_get_block_prep+0x58/0x181 May 15 05:30:56 hermes kernel: [<ffffffff802b0187>] __block_prepare_write+0x150/0x2c4 May 15 05:30:56 hermes kernel: [<ffffffff802fedd5>] ext4_da_get_block_prep+0x0/0x181 May 15 05:30:56 hermes kernel: [<ffffffff802701e4>] add_to_page_cache_locked+0x73/0x95 May 15 05:30:56 hermes kernel: [<ffffffff802b0447>] block_write_begin+0x7a/0xc7 May 15 05:30:56 hermes kernel: [<ffffffff802fce53>] ext4_da_write_begin+0x166/0x1ed May 15 05:30:56 hermes kernel: [<ffffffff802fedd5>] ext4_da_get_block_prep+0x0/0x181 May 15 05:30:56 hermes kernel: [<ffffffff80270b94>] generic_file_buffered_write+0x12c/0x2e8 May 15 05:30:56 hermes kernel: [<ffffffff80271145>] __generic_file_aio_write_nolock+0x25f/0x293 May 15 05:30:56 hermes kernel: [<ffffffff80271928>] generic_file_aio_write+0x64/0xc4 May 15 05:30:56 hermes kernel: [<ffffffff802f8d33>] ext4_file_write+0x93/0x118 May 15 05:30:56 hermes kernel: [<ffffffff80292f9c>] do_sync_write+0xce/0x113 May 15 05:30:56 hermes kernel: [<ffffffff802bc5a0>] vfs_test_lock+0x2e/0x32 May 15 05:30:56 hermes kernel: [<ffffffff8025cbe0>] autoremove_wake_function+0x0/0x2e May 15 05:30:56 hermes kernel: [<ffffffff802bd116>] fcntl_setlk+0x24b/0x25c May 15 05:30:56 hermes kernel: [<ffffffff8029373f>] vfs_write+0xa5/0xde May 15 05:30:56 hermes kernel: [<ffffffff802937cf>] sys_pwrite64+0x57/0x77 May 15 05:30:56 hermes kernel: [<ffffffff8029d8c9>] sys_fcntl+0x312/0x31e May 15 05:30:56 hermes kernel: [<ffffffff8022501b>] system_call_fastpath+0x16/0x1b May 15 05:30:56 hermes kernel: ---[ end trace df4d7250236f7466 ]--- May 15 05:30:56 hermes kernel: EXT4-fs error (device dm-0) in ext4_da_write_begin: IO failure May 15 05:30:56 hermes kernel: __journal_remove_journal_head: freeing b_frozen_data debugfs: stat <759> Inode: 759 Type: regular Mode: 0660 Flags: 0x80000 Generation: 2963762165 Version: 0x00000000:00000001 User: 0 Group: 10140 Size: 14814860288 File ACL: 0 Directory ACL: 0 Links: 1 Blockcount: 28935288 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x4a0c363e:77c84a60 -- Fri May 15 00:48:22 2009 atime: 0x4a0c2ce8:55358578 -- Fri May 15 00:08:32 2009 mtime: 0x4a0c363e:77c84a60 -- Fri May 15 00:48:22 2009 crtime: 0x4a0c2ce8:55358578 -- Fri May 15 00:08:32 2009 Size of extra inode fields: 28 Extended attributes stored in inode body: = "01 00 00 00 01 00 06 00 02 00 07 00 12 27 00 00 04 00 05 00 08 00 05 00 57 27 00 00 08 00 07 00 98 27 00 00 10 00 06 00 20 00 00 00 " (44) DOSATTRIB = "0x20" (4) BLOCKS: (IND):6603263, (0):6604364, (1-511):6602752-6603262, (512-1023):6586368-6586879, (1024-2047):6587392-6588415, (2048-14335):6590464-6602751, (14336-188415):6641664-6815743, (188416-417791):6914048-7143423, (417792-419839):7172096-7174143, (419840-423935):7168000-7172095, (423936-448511):7143424-7167999, (448512-450559):7182336-7184383, (450560-452607):7192576-7194623, (452608-456703):7188480-7192575, (456704-600063):7196672-7340031, (600064-602111):7380992-7383039, (602112-604159):7387136-7389183, (604160-696319):7444480-7536639, (696320-763903):7567360-7634943, (763904-978943):7649280-7864319, (978944-1044479):7897088-7962623, (1044480-1046527):7979008-7981055, (1046528-1079295):7995392-8028159, (1079296-1081343):8048640-8050687, (1081344-1085439):8044544-8048639, (1085440-1101823):8028160 -8044543, (1101824-1112063):8052736-8062975, (1112064-1114111):8081408-8083455, (1114112-1118207):8065024-8069119, (1118208-1122303):8077312-8081407, (1122304-1130495):8069120-8077311, (1130496-1138687):8085504-8093695, (1138688-1140735):8110080-8112127, (1140736-1142783):8114176-8116223, (1142784-1148927):8120320-8126463, (1148928-1150975):8136704-8138751, (1150976-1155071):8126464-8130559, (1155072-1241087):8138752-8224767, (1241088-1243135):8241152-8243199, (1243136-1245183):8249344-8251391, (1245184-1247231):8253440-8255487, (1247232-1251327):8245248-8249343, (1251328-1267711):8224768-8241151, (1267712-1269759):8292352-8294399, (1269760-1273855):8298496-8302591, (1273856-1275903):8398848-8400895, (1275904-1277951):8402944-8404991, (1277952-1282047):8407040-8411135, (1282048-1288191): 8417280-8423423, (1288192-1306623):8501248-8519679, (1306624-1310719):8536064-8540159, (1310720-1327103):8519680-8536063, (1327104-1329151):8710144-8712191, (1329152-1331199):8718336-8720383, (1331200-1345535):8726528-8740863, (1345536-1347583):8749056-8751103, (1347584-1349631):8761344-8763391, (1349632-1351679):8779776-8781823, (1351680-1353727):8806400-8808447, (1353728-1357823):8790016-8794111, (1357824-1386495):8818688-8847359, (1386496-1388543):8859648-8861695, (1388544-1390591):8863744-8865791, (1390592-1394687):8855552-8859647, (1394688-1402879):8847360-8855551, (1402880-1404927):8923136-8925183, (1404928-1474559):11603968-11673599, (1474560-1484799):12998656-13008895, (1484800-1486847):13037568-13039615, (1486848-1490943):13033472-13037567, (1490944-1515519):13008896-13033471, ( 1515520-1517567):13070336-13072383, (1517568-1521663):13058048-13062143, (1521664-1525759):13066240-13070335, (1525760-1542143):13041664-13058047, (1542144-1546239):16058368-16062463, (1546240-1550335):16084992-16089087, (1550336-1552383):16119808-16121855, (1552384-1554431):16150528-16152575, (1554432-1558527):16138240-16142335, (1558528-1562623):16146432-16150527, (1562624-1579007):16121856-16138239, (1579008-1581055):16154624-16156671, (1581056-1587199):16160768-16166911, (1587200-1589247):16199680-16201727, (1589248-1591295):16203776-16205823, (1591296-1593343):16226304-16228351, (1593344-1595391):16359424-16361471, (1595392-1599487):16369664-16373759, (1599488-1601535):16377856-16379903, (1601536-1603583):16388096-16390143, (1603584-1605631):16394240-16396287, (1605632-1607679):1640 0384-16402431, (1607680-1611775):16396288-16400383, (1611776-1615871):16455680-16459775, (1615872-1619967):16467968-16472063, (1619968-1624063):16461824-16465919, (1624064-1626111):16574464-16576511, (1626112-1628159):16578560-16580607, (1628160-1630207):16609280-16611327, (1630208-1638399):16601088-16609279, (1638400-1646591):16580608-16588799, (1646592-1650687):16652288-16656383, (1650688-1652735):16662528-16664575, (1652736-1656831):18202624-18206719, (1656832-1658879):18262016-18264063, (1658880-1660927):18276352-18278399, (1660928-1673215):18264064-18276351, (1673216-1677311):18296832-18300927, (1677312-1689599):18305024-18317311, (1689600-1693695):18706432-18710527, (1693696-1695743):18712576-18714623, (1695744-1699839):18728960-18733055, (1699840-1701887):18735104-18737151, (17018 88-1714175):18714624-18726911, (1714176-1716223):18753536-18755583, (1716224-1718271):18757632-18759679, (1718272-1720319):18767872-18769919, (1720320-1728511):18759680-18767871, (1728512-1789951):18780160-18841599, (1789952-1794047):18857984-18862079, (1794048-1810431):18841600-18857983, (1810432-1818623):18866176-18874367, (1818624-1820671):18884608-18886655, (1820672-2035711):18888704-19103743, (2035712-2037759):19132416-19134463, (2037760-2041855):19128320-19132415, (2041856-2066431):19103744-19128319, (2066432-2328575):19136512-19398655, (2328576-2330623):19408896-19410943, (2330624-2332671):19419136-19421183, (2332672-2340863):19410944-19419135, (2340864-2447359):19423232-19529727, (2447360-2478079):19531776-19562495, (2478080-2482175):19576832-19580927, (2482176-2486271):19593216- 19597311, (2486272-2488319):19601408-19603455, (2488320-2783231):19628032-19922943, (2783232-2838527):19933184-19988479, (2838528-2842623):19998720-20002815, (2842624-2846719):20006912-20011007, (2846720-2854911):19988480-19996671, (2854912-3248127):20054016-20447231, (3248128-3252223):20457472-20461567, (3252224-3254271):20471808-20473855, (3254272-3262463):20463616-20471807, (3262464-3264511):20484096-20486143, (3264512-3266559):20502528-20504575, (3266560-3268607):20523008-20525055, (3268608-3270655):20533248-20535295, (3270656-3616909):20578304-20924557 TOTAL: 3616911 debugfs: ncheck 759 Inode Pathname 759 /local/dumps/exchange/exchange-2000-UCWB-KVM-18.bkfB-KVM-18.bkf hermes:~# e2fsck -pfv /dev/dm-0 /dev/dm-0: recovering journal /dev/dm-0: Inode 7 has illegal block(s). /dev/dm-0: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY. (i.e., without -a or -p options) hermes:~# e2fsck -fv /dev/dm-0 e2fsck 1.41.3 (12-Oct-2008) Pass 1: Checking inodes, blocks, and sizes Inode 7 has illegal block(s). Clear<y>? yes Illegal block #181260 (1109068081) in inode 7. CLEARED. Illegal block #181261 (2437571651) in inode 7. CLEARED. Illegal block #181262 (892263544) in inode 7. CLEARED. Illegal block #181263 (1023380360) in inode 7. CLEARED. Illegal block #181264 (1152724232) in inode 7. CLEARED. Illegal block #181265 (2602060959) in inode 7. CLEARED. Illegal block #181266 (754287188) in inode 7. CLEARED. Illegal block #181267 (2833437713) in inode 7. CLEARED. Illegal block #181268 (1495914360) in inode 7. CLEARED. Illegal block #181269 (1583358807) in inode 7. CLEARED. Illegal block #181271 (1514245352) in inode 7. CLEARED. Too many illegal blocks in inode 7. Clear inode<y>? yes Restarting e2fsck from the beginning... Resize inode not valid. Recreate<y>? yes Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Free blocks count wrong for group #0 (4275, counted=4276). Fix<y>? yes Free blocks count wrong for group #142 (30919, counted=199). Fix<y>? yes Free blocks count wrong for group #143 (32768, counted=0). Fix<y>? yes Free blocks count wrong for group #159 (13307, counted=11259). Fix<y>? yes Free blocks count wrong for group #164 (7197, counted=5149). Fix<y>? yes Free blocks count wrong for group #165 (18691, counted=2307). Fix<y>? yes Free blocks count wrong for group #167 (31325, counted=2653). Fix<y>? yes Free blocks count wrong for group #168 (32768, counted=0). Fix<y>? yes Free blocks count wrong for group #169 (32768, counted=0). Fix<y>? yes Free blocks count wrong for group #170 (32768, counted=0). Fix<y>? yes Free blocks count wrong for group #171 (32768, counted=0). Fix<y>? yes Free blocks count wrong for group #172 (32768, counted=0). Fix<y>? yes Free blocks count wrong for group #173 (32768, counted=0). Fix<y>? yes Free blocks count wrong for group #174 (32768, counted=0). Fix<y>? yes Free blocks count wrong for group #175 (32768, counted=0). Fix<y>? yes Free blocks count wrong for group #186 (14647, counted=2359). Fix<y>? yes Free blocks count wrong for group #187 (32768, counted=0). Fix<y>? yes Free blocks count wrong for group #188 (32768, counted=0). Fix<y>? yes Free blocks count wrong for group #189 (32768, counted=0). Fix<y>? yes Free blocks count wrong for group #190 (32768, counted=0). Fix<y>? yes Free blocks count wrong for group #191 (32768, counted=0). Fix<y>? yes Free blocks count wrong for group #196 (3955, counted=1907). Fix<y>? yes Free blocks count wrong for group #197 (4901, counted=805). Fix<y>? yes Free blocks count wrong for group #198 (26087, counted=1511). Fix<y>? yes Free blocks count wrong for group #199 (27516, counted=892). Fix<y>? yes Free blocks count wrong for group #200 (32768, counted=0). Fix<y>? yes Free blocks count wrong for group #201 (18382, counted=6094). Fix<y>? yes Free blocks count wrong for group #202 (32104, counted=1384). Fix<y>? yes Free blocks count wrong for group #203 (32768, counted=0). Fix<y>? yes Free blocks count wrong for group #204 (32768, counted=0). Fix<y>? yes Free blocks count wrong for group #205 (32768, counted=0). Fix<y>? yes Free blocks count wrong for group #206 (32768, counted=0). Fix<y>? yes Free blocks count wrong for group #207 (32768, counted=0). Fix<y>? yes Free blocks count wrong for group #211 (32768, counted=0). Fix<y>? yes Free blocks count wrong for group #212 (32768, counted=0). Fix<y>? yes Free blocks count wrong for group #213 (32768, counted=0). Fix<y>? yes Free blocks count wrong for group #214 (32768, counted=0). Fix<y>? yes Free blocks count wrong for group #215 (32768, counted=0). Fix<y>? yes Free blocks count wrong for group #216 (32768, counted=0). Fix<y>? yes Free blocks count wrong for group #217 (32768, counted=0). Fix<y>? yes Free blocks count wrong for group #218 (31158, counted=438). Fix<y>? yes Free blocks count wrong for group #219 (23522, counted=3042). Fix<y>? yes Free blocks count wrong for group #220 (32768, counted=0). Fix<y>? yes Free blocks count wrong for group #221 (32768, counted=0). Fix<y>? yes Free blocks count wrong for group #222 (32768, counted=0). Fix<y>? yes Free blocks count wrong for group #223 (32768, counted=0). Fix<y>? yes Free blocks count wrong for group #225 (14501, counted=10405). Fix<y>? yes Free blocks count wrong for group #227 (27863, counted=1239). Fix<y>? yes Free blocks count wrong for group #228 (32768, counted=0). Fix<y>? yes Free blocks count wrong for group #229 (32768, counted=0). Fix<y>? yes Free blocks count wrong for group #230 (2780, counted=732). Fix<y>? yes Free blocks count wrong for group #231 (32768, counted=0). Fix<y>? yes Free blocks count wrong for group #232 (32768, counted=2481). Fix<y>? yes Free blocks count wrong (222364420, counted=220943542). Fix<y>? yes /dev/dm-0: ***** FILE SYSTEM WAS MODIFIED ***** 331810 inodes used (0.49%) 16104 non-contiguous inodes (4.9%) # of inodes with ind/dind/tind blocks: 0/0/0 Extent depth histogram: 331389/351 47491914 blocks used (17.69%) 0 bad blocks 2 large files 305051 regular files 26686 directories 0 character device files 0 block device files 0 fifos 0 links 64 symbolic links (60 fast symbolic links) 0 sockets -------- 331801 files Okay, so now I've booted into 2.6.29.3 + check_block_validity patch + short circuit i_cached_extent patch, mounted the fs without nodelalloc. I was able to run the full exchange backup without triggering the check_block_validity error. Regards, Kevin. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: More ext4 acl/xattr corruption - 4th occurence now 2009-05-14 21:02 ` Kevin Shanahan @ 2009-05-14 21:23 ` Theodore Tso 2009-05-14 21:33 ` Kevin Shanahan ` (3 more replies) 0 siblings, 4 replies; 35+ messages in thread From: Theodore Tso @ 2009-05-14 21:23 UTC (permalink / raw) To: Kevin Shanahan; +Cc: Andreas Dilger, Alex Tomas, linux-ext4 On Fri, May 15, 2009 at 06:32:45AM +0930, Kevin Shanahan wrote: > Okay, so now I've booted into 2.6.29.3 + check_block_validity patch + > short circuit i_cached_extent patch, mounted the fs without > nodelalloc. I was able to run the full exchange backup without > triggering the check_block_validity error. Great! So here's the final fix (it replaces the short circuit i_cached_extent patch) which I plan to push to Linus. It should be much less of a performance hit than simply short-circuiting i_cached_extent... Thanks so much for helping to find track this down!!! If ever someone deserved an "Ext4 Baker Street Irregulars" T-shirt, it would be you.... - Ted commit 039ed7a483fdcb2dbbc29f00cd0d74c101ab14c5 Author: Theodore Ts'o <tytso@mit.edu> Date: Thu May 14 17:09:37 2009 -0400 ext4: Fix race in ext4_inode_info.i_cached_extent If one CPU is reading from a file while another CPU is writing to the same file different locations, there is nothing protecting the i_cached_extent structure from being used and updated at the same time. This could potentially cause the wrong location on disk to be read or written to, including potentially causing the corruption of the block group descriptors and/or inode table. Many thanks to Ken Shannah for helping to track down this problem. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 172656c..e3a55eb 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -1841,11 +1841,13 @@ ext4_ext_put_in_cache(struct inode *inode, ext4_lblk_t block, { struct ext4_ext_cache *cex; BUG_ON(len == 0); + spin_lock(&EXT4_I(inode)->i_block_reservation_lock); cex = &EXT4_I(inode)->i_cached_extent; cex->ec_type = type; cex->ec_block = block; cex->ec_len = len; cex->ec_start = start; + spin_unlock(&EXT4_I(inode)->i_block_reservation_lock); } /* @@ -1902,12 +1904,17 @@ ext4_ext_in_cache(struct inode *inode, ext4_lblk_t block, struct ext4_extent *ex) { struct ext4_ext_cache *cex; + int ret = EXT4_EXT_CACHE_NO; + /* + * We borrow i_block_reservation_lock to protect i_cached_extent + */ + spin_lock(&EXT4_I(inode)->i_block_reservation_lock); cex = &EXT4_I(inode)->i_cached_extent; /* has cache valid data? */ if (cex->ec_type == EXT4_EXT_CACHE_NO) - return EXT4_EXT_CACHE_NO; + goto errout; BUG_ON(cex->ec_type != EXT4_EXT_CACHE_GAP && cex->ec_type != EXT4_EXT_CACHE_EXTENT); @@ -1918,11 +1925,11 @@ ext4_ext_in_cache(struct inode *inode, ext4_lblk_t block, ext_debug("%u cached by %u:%u:%llu\n", block, cex->ec_block, cex->ec_len, cex->ec_start); - return cex->ec_type; + ret = cex->ec_type; } ^ permalink raw reply related [flat|nested] 35+ messages in thread
* Re: More ext4 acl/xattr corruption - 4th occurence now 2009-05-14 21:23 ` Theodore Tso @ 2009-05-14 21:33 ` Kevin Shanahan 2009-05-15 23:18 ` Kevin Shanahan 2009-05-15 1:21 ` Eric Sandeen ` (2 subsequent siblings) 3 siblings, 1 reply; 35+ messages in thread From: Kevin Shanahan @ 2009-05-14 21:33 UTC (permalink / raw) To: Theodore Tso; +Cc: Andreas Dilger, Alex Tomas, linux-ext4 On Thu, May 14, 2009 at 05:23:25PM -0400, Theodore Tso wrote: > On Fri, May 15, 2009 at 06:32:45AM +0930, Kevin Shanahan wrote: > > Okay, so now I've booted into 2.6.29.3 + check_block_validity patch + > > short circuit i_cached_extent patch, mounted the fs without > > nodelalloc. I was able to run the full exchange backup without > > triggering the check_block_validity error. > > Great! > > So here's the final fix (it replaces the short circuit i_cached_extent > patch) which I plan to push to Linus. It should be much less of a > performance hit than simply short-circuiting i_cached_extent... > > Thanks so much for helping to find track this down!!! If ever someone > deserved an "Ext4 Baker Street Irregulars" T-shirt, it would be > you.... Hehe, no problem. Will do the final testing shortly (ran out of time this morning, users are back on the system now). Just one little correction to your patch below: > commit 039ed7a483fdcb2dbbc29f00cd0d74c101ab14c5 > Author: Theodore Ts'o <tytso@mit.edu> > Date: Thu May 14 17:09:37 2009 -0400 > > ext4: Fix race in ext4_inode_info.i_cached_extent > > If one CPU is reading from a file while another CPU is writing to the > same file different locations, there is nothing protecting the > i_cached_extent structure from being used and updated at the same > time. This could potentially cause the wrong location on disk to be > read or written to, including potentially causing the corruption of > the block group descriptors and/or inode table. > > Many thanks to Ken Shannah for helping to track down this problem. ^^^^^^^^^^^ Cheers, Kevin Shanahan. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: More ext4 acl/xattr corruption - 4th occurence now 2009-05-14 21:33 ` Kevin Shanahan @ 2009-05-15 23:18 ` Kevin Shanahan 0 siblings, 0 replies; 35+ messages in thread From: Kevin Shanahan @ 2009-05-15 23:18 UTC (permalink / raw) To: Theodore Tso; +Cc: Andreas Dilger, Alex Tomas, linux-ext4 On Fri, May 15, 2009 at 07:03:14AM +0930, Kevin Shanahan wrote: > On Thu, May 14, 2009 at 05:23:25PM -0400, Theodore Tso wrote: > > On Fri, May 15, 2009 at 06:32:45AM +0930, Kevin Shanahan wrote: > > > Okay, so now I've booted into 2.6.29.3 + check_block_validity patch + > > > short circuit i_cached_extent patch, mounted the fs without > > > nodelalloc. I was able to run the full exchange backup without > > > triggering the check_block_validity error. > > > > Great! > > > > So here's the final fix (it replaces the short circuit i_cached_extent > > patch) which I plan to push to Linus. It should be much less of a > > performance hit than simply short-circuiting i_cached_extent... > > > > Thanks so much for helping to find track this down!!! If ever someone > > deserved an "Ext4 Baker Street Irregulars" T-shirt, it would be > > you.... > > Hehe, no problem. Will do the final testing shortly (ran out of time > this morning, users are back on the system now). Just one little > correction to your patch below: > > > commit 039ed7a483fdcb2dbbc29f00cd0d74c101ab14c5 > > Author: Theodore Ts'o <tytso@mit.edu> > > Date: Thu May 14 17:09:37 2009 -0400 > > > > ext4: Fix race in ext4_inode_info.i_cached_extent > > > > If one CPU is reading from a file while another CPU is writing to the > > same file different locations, there is nothing protecting the > > i_cached_extent structure from being used and updated at the same > > time. This could potentially cause the wrong location on disk to be > > read or written to, including potentially causing the corruption of > > the block group descriptors and/or inode table. > > > > Many thanks to Ken Shannah for helping to track down this problem. > ^^^^^^^^^^^ Tested-by: Kevin Shanahan <kmshanah@ucwb.org.au> Yes, this patch seems to have fixed the issue. I ran my "exchange backup to samba share" test on 2.6.29.3 + check_block_validity patch + the fix race patch with no problems. Cheers, Kevin. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: More ext4 acl/xattr corruption - 4th occurence now 2009-05-14 21:23 ` Theodore Tso 2009-05-14 21:33 ` Kevin Shanahan @ 2009-05-15 1:21 ` Eric Sandeen 2009-05-15 12:50 ` Theodore Tso 2009-05-15 4:55 ` Aneesh Kumar K.V 2009-05-19 10:00 ` Thierry Vignaud 3 siblings, 1 reply; 35+ messages in thread From: Eric Sandeen @ 2009-05-15 1:21 UTC (permalink / raw) To: Theodore Tso; +Cc: Kevin Shanahan, Andreas Dilger, Alex Tomas, linux-ext4 Theodore Tso wrote: > On Fri, May 15, 2009 at 06:32:45AM +0930, Kevin Shanahan wrote: >> Okay, so now I've booted into 2.6.29.3 + check_block_validity patch + >> short circuit i_cached_extent patch, mounted the fs without >> nodelalloc. I was able to run the full exchange backup without >> triggering the check_block_validity error. > > Great! > > So here's the final fix (it replaces the short circuit i_cached_extent > patch) which I plan to push to Linus. It should be much less of a > performance hit than simply short-circuiting i_cached_extent... > > Thanks so much for helping to find track this down!!! If ever someone > deserved an "Ext4 Baker Street Irregulars" T-shirt, it would be > you.... > > - Ted So here's a fio job I thought would try to hit this ... [global] ioengine=libaio iodepth=1 filesize=4g bs=1m norandommap direct=1 loops=500 size=16g [setup] filename=file rw=randwrite loops=1 [thread1] stonewall filename=file rw=randwrite [thread2] filename=file rw=randread it should lay out a 4g file in random 1m direct IOs to fragment it and get a lot of extents, then launch 2 threads, one each doing random reads and random writes of that same file. I can't make this trip it, though ... -Eric ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: More ext4 acl/xattr corruption - 4th occurence now 2009-05-15 1:21 ` Eric Sandeen @ 2009-05-15 12:50 ` Theodore Tso 2009-05-15 12:58 ` Eric Sandeen 0 siblings, 1 reply; 35+ messages in thread From: Theodore Tso @ 2009-05-15 12:50 UTC (permalink / raw) To: Eric Sandeen; +Cc: Kevin Shanahan, Andreas Dilger, Alex Tomas, linux-ext4 On Thu, May 14, 2009 at 08:21:05PM -0500, Eric Sandeen wrote: > > it should lay out a 4g file in random 1m direct IOs to fragment it and > get a lot of extents, then launch 2 threads, one each doing random reads > and random writes of that same file. > > I can't make this trip it, though ... If all of the blocks are in the page cache, you won't end up calling ext4_get_blocks(). Try adding a shell script which runs in parallel doing a "while /bin/true ; do sleep 1; echo 3 > /proc/sys/vm/drop_cache; done". - Ted ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: More ext4 acl/xattr corruption - 4th occurence now 2009-05-15 12:50 ` Theodore Tso @ 2009-05-15 12:58 ` Eric Sandeen 2009-05-15 15:24 ` Eric Sandeen 0 siblings, 1 reply; 35+ messages in thread From: Eric Sandeen @ 2009-05-15 12:58 UTC (permalink / raw) To: Theodore Tso; +Cc: Kevin Shanahan, Andreas Dilger, Alex Tomas, linux-ext4 Theodore Tso wrote: > On Thu, May 14, 2009 at 08:21:05PM -0500, Eric Sandeen wrote: >> it should lay out a 4g file in random 1m direct IOs to fragment it and >> get a lot of extents, then launch 2 threads, one each doing random reads >> and random writes of that same file. >> >> I can't make this trip it, though ... > > If all of the blocks are in the page cache, you won't end up calling > ext4_get_blocks(). Try adding a shell script which runs in parallel > doing a "while /bin/true ; do sleep 1; echo 3 > /proc/sys/vm/drop_cache; done". > > - Ted I made sure it was a big enough file, and consumed enough memory on the system before the test, that the entire file couldn't fit in memory. I can try doing the dropping in the bg ... but it should have been going to disk already. -Eric ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: More ext4 acl/xattr corruption - 4th occurence now 2009-05-15 12:58 ` Eric Sandeen @ 2009-05-15 15:24 ` Eric Sandeen 2009-05-15 16:27 ` Eric Sandeen 0 siblings, 1 reply; 35+ messages in thread From: Eric Sandeen @ 2009-05-15 15:24 UTC (permalink / raw) To: Theodore Tso; +Cc: Kevin Shanahan, Andreas Dilger, Alex Tomas, linux-ext4 Eric Sandeen wrote: > Theodore Tso wrote: >> On Thu, May 14, 2009 at 08:21:05PM -0500, Eric Sandeen wrote: >>> it should lay out a 4g file in random 1m direct IOs to fragment it and >>> get a lot of extents, then launch 2 threads, one each doing random reads >>> and random writes of that same file. >>> >>> I can't make this trip it, though ... >> If all of the blocks are in the page cache, you won't end up calling >> ext4_get_blocks(). Try adding a shell script which runs in parallel >> doing a "while /bin/true ; do sleep 1; echo 3 > /proc/sys/vm/drop_cache; done". >> >> - Ted > > I made sure it was a big enough file, and consumed enough memory on the > system before the test, that the entire file couldn't fit in memory. > > I can try doing the dropping in the bg ... but it should have been going > to disk already. > > -Eric in a desperate attempt to show the window, I tried this in ext4_ext_put_in_cache(): cex->ec_block = -1; cex->ec_start = -1; schedule_timeout_uninterruptible(HZ/2); cex->ec_start = start; cex->ec_block = block; and this in ext4_ext_in_cache(): if (cex->ec_block == -1 || cex->ec_start == -1) printk("%s got bad cache\n", __func__); and it's not firing. -Eric ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: More ext4 acl/xattr corruption - 4th occurence now 2009-05-15 15:24 ` Eric Sandeen @ 2009-05-15 16:27 ` Eric Sandeen 0 siblings, 0 replies; 35+ messages in thread From: Eric Sandeen @ 2009-05-15 16:27 UTC (permalink / raw) To: Theodore Tso; +Cc: Kevin Shanahan, Andreas Dilger, Alex Tomas, linux-ext4 Eric Sandeen wrote: > Eric Sandeen wrote: >> Theodore Tso wrote: >>> On Thu, May 14, 2009 at 08:21:05PM -0500, Eric Sandeen wrote: >>>> it should lay out a 4g file in random 1m direct IOs to fragment it and >>>> get a lot of extents, then launch 2 threads, one each doing random reads >>>> and random writes of that same file. >>>> >>>> I can't make this trip it, though ... >>> If all of the blocks are in the page cache, you won't end up calling >>> ext4_get_blocks(). Try adding a shell script which runs in parallel >>> doing a "while /bin/true ; do sleep 1; echo 3 > /proc/sys/vm/drop_cache; done". >>> >>> - Ted >> I made sure it was a big enough file, and consumed enough memory on the >> system before the test, that the entire file couldn't fit in memory. >> >> I can try doing the dropping in the bg ... but it should have been going >> to disk already. >> >> -Eric > > in a desperate attempt to show the window, I tried this in > ext4_ext_put_in_cache(): > > cex->ec_block = -1; > cex->ec_start = -1; > schedule_timeout_uninterruptible(HZ/2); > cex->ec_start = start; > cex->ec_block = block; > > and this in ext4_ext_in_cache(): > > if (cex->ec_block == -1 || cex->ec_start == -1) > printk("%s got bad cache\n", __func__); > > and it's not firing. I take it back, needed a different workload. Sorry for being pedantic, but if this race is so blindingly obvious and we're getting so few reports, I wanted to be sure we could hit it. With my artificially wide window now I think I can see it, but I'm still not winding up with any corruption or EIOs.... -Eric ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: More ext4 acl/xattr corruption - 4th occurence now 2009-05-14 21:23 ` Theodore Tso 2009-05-14 21:33 ` Kevin Shanahan 2009-05-15 1:21 ` Eric Sandeen @ 2009-05-15 4:55 ` Aneesh Kumar K.V 2009-05-15 10:11 ` Theodore Tso 2009-05-15 13:07 ` Theodore Tso 2009-05-19 10:00 ` Thierry Vignaud 3 siblings, 2 replies; 35+ messages in thread From: Aneesh Kumar K.V @ 2009-05-15 4:55 UTC (permalink / raw) To: Theodore Tso; +Cc: Kevin Shanahan, Andreas Dilger, Alex Tomas, linux-ext4 On Thu, May 14, 2009 at 05:23:25PM -0400, Theodore Tso wrote: > On Fri, May 15, 2009 at 06:32:45AM +0930, Kevin Shanahan wrote: > > Okay, so now I've booted into 2.6.29.3 + check_block_validity patch + > > short circuit i_cached_extent patch, mounted the fs without > > nodelalloc. I was able to run the full exchange backup without > > triggering the check_block_validity error. > > Great! > > So here's the final fix (it replaces the short circuit i_cached_extent > patch) which I plan to push to Linus. It should be much less of a > performance hit than simply short-circuiting i_cached_extent... > > Thanks so much for helping to find track this down!!! If ever someone > deserved an "Ext4 Baker Street Irregulars" T-shirt, it would be > you.... > > - Ted > > commit 039ed7a483fdcb2dbbc29f00cd0d74c101ab14c5 > Author: Theodore Ts'o <tytso@mit.edu> > Date: Thu May 14 17:09:37 2009 -0400 > > ext4: Fix race in ext4_inode_info.i_cached_extent > > If one CPU is reading from a file while another CPU is writing to the > same file different locations, there is nothing protecting the > i_cached_extent structure from being used and updated at the same > time. This could potentially cause the wrong location on disk to be > read or written to, including potentially causing the corruption of > the block group descriptors and/or inode table. It should be multiple readers. We don't allow read/write or multiple writers via ext4_ext_get_blocks. &EXT4_I(inode)->i_data_sem is supposed to protect read/write and multiple writers. What it allowed was multiple readers(get_block call with create = 0). And readers did cache the extent information which it read from the disk. So the fix is correct, but we need to update the commit message. Reviewed-by:Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> > > Many thanks to Ken Shannah for helping to track down this problem. > > Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> > > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c > index 172656c..e3a55eb 100644 > --- a/fs/ext4/extents.c > +++ b/fs/ext4/extents.c > @@ -1841,11 +1841,13 @@ ext4_ext_put_in_cache(struct inode *inode, ext4_lblk_t block, > { > struct ext4_ext_cache *cex; > BUG_ON(len == 0); > + spin_lock(&EXT4_I(inode)->i_block_reservation_lock); > cex = &EXT4_I(inode)->i_cached_extent; > cex->ec_type = type; > cex->ec_block = block; > cex->ec_len = len; > cex->ec_start = start; > + spin_unlock(&EXT4_I(inode)->i_block_reservation_lock); > } > > /* > @@ -1902,12 +1904,17 @@ ext4_ext_in_cache(struct inode *inode, ext4_lblk_t block, > struct ext4_extent *ex) > { > struct ext4_ext_cache *cex; > + int ret = EXT4_EXT_CACHE_NO; > > + /* > + * We borrow i_block_reservation_lock to protect i_cached_extent > + */ > + spin_lock(&EXT4_I(inode)->i_block_reservation_lock); > cex = &EXT4_I(inode)->i_cached_extent; > > /* has cache valid data? */ > if (cex->ec_type == EXT4_EXT_CACHE_NO) > - return EXT4_EXT_CACHE_NO; > + goto errout; > > BUG_ON(cex->ec_type != EXT4_EXT_CACHE_GAP && > cex->ec_type != EXT4_EXT_CACHE_EXTENT); > @@ -1918,11 +1925,11 @@ ext4_ext_in_cache(struct inode *inode, ext4_lblk_t block, > ext_debug("%u cached by %u:%u:%llu\n", > block, > cex->ec_block, cex->ec_len, cex->ec_start); > - return cex->ec_type; > + ret = cex->ec_type; > } > - > - /* not in cache */ > - return EXT4_EXT_CACHE_NO; > +errout: > + spin_unlock(&EXT4_I(inode)->i_block_reservation_lock); > + return ret; > } > > /* > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: More ext4 acl/xattr corruption - 4th occurence now 2009-05-15 4:55 ` Aneesh Kumar K.V @ 2009-05-15 10:11 ` Theodore Tso 2009-05-15 13:07 ` Theodore Tso 1 sibling, 0 replies; 35+ messages in thread From: Theodore Tso @ 2009-05-15 10:11 UTC (permalink / raw) To: Aneesh Kumar K.V; +Cc: Kevin Shanahan, Andreas Dilger, Alex Tomas, linux-ext4 On Fri, May 15, 2009 at 10:25:08AM +0530, Aneesh Kumar K.V wrote: > > commit 039ed7a483fdcb2dbbc29f00cd0d74c101ab14c5 > > Author: Theodore Ts'o <tytso@mit.edu> > > Date: Thu May 14 17:09:37 2009 -0400 > > > > ext4: Fix race in ext4_inode_info.i_cached_extent > > > > If one CPU is reading from a file while another CPU is writing to the > > same file different locations, there is nothing protecting the > > i_cached_extent structure from being used and updated at the same > > time. This could potentially cause the wrong location on disk to be > > read or written to, including potentially causing the corruption of > > the block group descriptors and/or inode table. > > > It should be multiple readers. We don't allow read/write or multiple > writers via ext4_ext_get_blocks. &EXT4_I(inode)->i_data_sem is supposed > to protect read/write and multiple writers. What it allowed was > multiple readers(get_block call with create = 0). And readers did cache > the extent information which it read from the disk. So the fix is > correct, but we need to update the commit message. Yes, you're right. The i_data_sem would have protected us from the worst of these problems. The commit message isn't totally wrong, though, since even a writer will call ext4_ext_get_blocks() with create == 0. But we should describe the failure mode much more accurately, and say that it is two callers of ext4_ext_get_blocks() with create=0 racing against one another. It's possible, perhaps even likely, that one of those callers came from a call of ext4_get_blocks() with create=1 --- given that the people who were reporting this were doing so when doing backup jobs. However, one could imagine multiple mysql (or DB2) threads writing into random parts of a database, and if two CPU's try to bmap already allocated (and initialized) blocks out of the database file, so that ext4_ext_get_blocks() is always being called with create=0, you could very easily end up with a situation where blocks would be written to the wrong location on disk. <<shiver>> We should make sure that all of the enterprise distributions that are shipping ext4 as a "technology preview" get a copy of this patch as a fixing a high-priority bug, since a database workload should trip this pretty easily. - Ted ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: More ext4 acl/xattr corruption - 4th occurence now 2009-05-15 4:55 ` Aneesh Kumar K.V 2009-05-15 10:11 ` Theodore Tso @ 2009-05-15 13:07 ` Theodore Tso 1 sibling, 0 replies; 35+ messages in thread From: Theodore Tso @ 2009-05-15 13:07 UTC (permalink / raw) To: Aneesh Kumar K.V; +Cc: Kevin Shanahan, Andreas Dilger, Alex Tomas, linux-ext4 On Fri, May 15, 2009 at 10:25:08AM +0530, Aneesh Kumar K.V wrote: > So the fix is correct, but we need to update the commit message. Here's the updated commit message. - Ted ext4: Fix race in ext4_inode_info.i_cached_extent If two CPU's simultaneously call ext4_ext_get_blocks() at the same time, there is nothing protecting the i_cached_extent structure from being used and updated at the same time. This could potentially cause the wrong location on disk to be read or written to, including potentially causing the corruption of the block group descriptors and/or inode table. This bug has been in the ext4 code since almost the very beginning of ext4's development. Fortunately once the data is stored in the page cache cache, ext4_get_blocks() doesn't need to be called, so trying to replicate this problem to the point where we could identify its root cause was *extremely* difficult. Many thanks to Kevin Shanahan for working over several months to be able to reproduce this easily so we could finally nail down the cause of the corruption. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: More ext4 acl/xattr corruption - 4th occurence now 2009-05-14 21:23 ` Theodore Tso ` (2 preceding siblings ...) 2009-05-15 4:55 ` Aneesh Kumar K.V @ 2009-05-19 10:00 ` Thierry Vignaud 2009-05-19 11:36 ` Theodore Tso 3 siblings, 1 reply; 35+ messages in thread From: Thierry Vignaud @ 2009-05-19 10:00 UTC (permalink / raw) To: Theodore Tso; +Cc: Kevin Shanahan, Andreas Dilger, Alex Tomas, linux-ext4 Theodore Tso <tytso@mit.edu> writes: > So here's the final fix (it replaces the short circuit i_cached_extent > patch) which I plan to push to Linus. It should be much less of a > performance hit than simply short-circuiting i_cached_extent... > > Thanks so much for helping to find track this down!!! If ever someone > deserved an "Ext4 Baker Street Irregulars" T-shirt, it would be > you.... > > - Ted > > commit 039ed7a483fdcb2dbbc29f00cd0d74c101ab14c5 > Author: Theodore Ts'o <tytso@mit.edu> > Date: Thu May 14 17:09:37 2009 -0400 > > ext4: Fix race in ext4_inode_info.i_cached_extent > > If one CPU is reading from a file while another CPU is writing to the > same file different locations, there is nothing protecting the > i_cached_extent structure from being used and updated at the same > time. This could potentially cause the wrong location on disk to be > read or written to, including potentially causing the corruption of > the block group descriptors and/or inode table. > > Many thanks to Ken Shannah for helping to track down this problem. > > Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> I wonder if that would explain the corruption I reported a couple weeks ago. Now I remember I wrongly got 2 parallel cp from the same source directory to the same target directory. Could this be the cause? ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: More ext4 acl/xattr corruption - 4th occurence now 2009-05-19 10:00 ` Thierry Vignaud @ 2009-05-19 11:36 ` Theodore Tso 2009-05-19 12:01 ` Alex Tomas 0 siblings, 1 reply; 35+ messages in thread From: Theodore Tso @ 2009-05-19 11:36 UTC (permalink / raw) To: Thierry Vignaud; +Cc: Kevin Shanahan, Andreas Dilger, Alex Tomas, linux-ext4 On Tue, May 19, 2009 at 12:00:56PM +0200, Thierry Vignaud wrote: > I wonder if that would explain the corruption I reported a couple weeks > ago. > > Now I remember I wrongly got 2 parallel cp from the same source > directory to the same target directory. > > Could this be the cause? Yes, it's possible. There could be a problem either if the two cp's tried targetting the same file at the same time, or if the directory was getting expanded at the same time by the two different processes. Since we don't cache logical->physical mapping for directories (since because of the journalling requirements directories are stored in the buffer cache, not the page cache), rather more likely to run into problems with directories; more so since directoris will tend to be fragmented. On the other hand, most of the time writes into the directory will tend to be into pre-existing free space; but if you had two parallel cp's copying a large number of files into the same directory, that could certainly happen. - Ted ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: More ext4 acl/xattr corruption - 4th occurence now 2009-05-19 11:36 ` Theodore Tso @ 2009-05-19 12:01 ` Alex Tomas 2009-05-19 15:04 ` Theodore Tso 0 siblings, 1 reply; 35+ messages in thread From: Alex Tomas @ 2009-05-19 12:01 UTC (permalink / raw) To: Theodore Tso; +Cc: Thierry Vignaud, Kevin Shanahan, Andreas Dilger, linux-ext4 Theodore Tso wrote: > On the other hand, most of the time writes into the directory will > tend to be into pre-existing free space; but if you had two parallel > cp's copying a large number of files into the same directory, that > could certainly happen. isn't access to a directory protected with i_mutex ? thanks, Alex ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: More ext4 acl/xattr corruption - 4th occurence now 2009-05-19 12:01 ` Alex Tomas @ 2009-05-19 15:04 ` Theodore Tso 2009-05-19 15:16 ` Alex Tomas 2009-05-19 15:18 ` Thierry Vignaud 0 siblings, 2 replies; 35+ messages in thread From: Theodore Tso @ 2009-05-19 15:04 UTC (permalink / raw) To: Alex Tomas; +Cc: Thierry Vignaud, Kevin Shanahan, Andreas Dilger, linux-ext4 On Tue, May 19, 2009 at 04:01:36PM +0400, Alex Tomas wrote: > Theodore Tso wrote: >> On the other hand, most of the time writes into the directory will >> tend to be into pre-existing free space; but if you had two parallel >> cp's copying a large number of files into the same directory, that >> could certainly happen. > > isn't access to a directory protected with i_mutex ? Hmm, good point. Yes, that should prevent problems with directories. So there should only be a problem when two processes are writing to the same file at the same time. - Ted ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: More ext4 acl/xattr corruption - 4th occurence now 2009-05-19 15:04 ` Theodore Tso @ 2009-05-19 15:16 ` Alex Tomas 2009-05-19 15:18 ` Thierry Vignaud 1 sibling, 0 replies; 35+ messages in thread From: Alex Tomas @ 2009-05-19 15:16 UTC (permalink / raw) To: Theodore Tso; +Cc: Thierry Vignaud, Kevin Shanahan, Andreas Dilger, linux-ext4 Theodore Tso wrote: > On Tue, May 19, 2009 at 04:01:36PM +0400, Alex Tomas wrote: >> Theodore Tso wrote: >>> On the other hand, most of the time writes into the directory will >>> tend to be into pre-existing free space; but if you had two parallel >>> cp's copying a large number of files into the same directory, that >>> could certainly happen. >> isn't access to a directory protected with i_mutex ? > > Hmm, good point. Yes, that should prevent problems with directories. > So there should only be a problem when two processes are writing to > the same file at the same time. I guess reading can corrupt it as well ? thanks, Alex ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: More ext4 acl/xattr corruption - 4th occurence now 2009-05-19 15:04 ` Theodore Tso 2009-05-19 15:16 ` Alex Tomas @ 2009-05-19 15:18 ` Thierry Vignaud 1 sibling, 0 replies; 35+ messages in thread From: Thierry Vignaud @ 2009-05-19 15:18 UTC (permalink / raw) To: Theodore Tso; +Cc: Alex Tomas, Kevin Shanahan, Andreas Dilger, linux-ext4 Theodore Tso <tytso@mit.edu> writes: > >> On the other hand, most of the time writes into the directory will > >> tend to be into pre-existing free space; but if you had two parallel > >> cp's copying a large number of files into the same directory, that > >> could certainly happen. > > > > isn't access to a directory protected with i_mutex ? > > Hmm, good point. Yes, that should prevent problems with directories. > So there should only be a problem when two processes are writing to > the same file at the same time. I remember that happened... ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: More ext4 acl/xattr corruption - 4th occurence now 2009-05-14 16:12 ` Theodore Tso 2009-05-14 21:02 ` Kevin Shanahan @ 2009-05-15 3:57 ` Alex Tomas 1 sibling, 0 replies; 35+ messages in thread From: Alex Tomas @ 2009-05-15 3:57 UTC (permalink / raw) To: Theodore Tso; +Cc: Kevin Shanahan, Andreas Dilger, linux-ext4 when cache was introduced single exclusive spinlock protect whole ext3_ext_get_blocks and there was no concurrency at all. so I guess your theory is correct. thanks, Alex Theodore Tso wrote: > On Fri, May 15, 2009 at 12:00:15AM +0930, Kevin Shanahan wrote: >>> debugfs: stat <759> >> hermes:~# debugfs /dev/dm-0 >> debugfs 1.41.3 (12-Oct-2008) >> debugfs: stat <759> >> >> Inode: 759 Type: regular Mode: 0660 Flags: 0x80000 >> Generation: 3979120103 Version: 0x00000000:00000001 >> User: 0 Group: 10140 Size: 14615630848 >> File ACL: 0 Directory ACL: 0 >> Links: 1 Blockcount: 28546168 >> Fragment: Address: 0 Number: 0 Size: 0 >> ctime: 0x4a0acdb5:2a88cbec -- Wed May 13 23:10:05 2009 >> atime: 0x4a0ac45b:10899618 -- Wed May 13 22:30:11 2009 >> mtime: 0x4a0acdb5:2a88cbec -- Wed May 13 23:10:05 2009 >> crtime: 0x4a0ac45b:10899618 -- Wed May 13 22:30:11 2009 > >> Inode Pathname >> 759 /local/dumps/exchange/exchange-2000-UCWB-KVM-18.bkfB-KVM-18.bkf > > Do you know how the system was likely writing into > /local/dumps/exhcnag/eexchange-2000-UCWB-KVM-18.bkf? What this a > backup via rsync or tar? Was this some application writing into a > pre-existing file via NFS, or via local disk access? > > Given the ctime/atime fields, I'm inclined to guess the latter, but it > would be good to know. > > The stat dump for the inode 759 does *not* show logical block 1741329 > getting mapped to physical block 529. So the question is how did that > happen? > > I've started looking, and one thing popped up at me. I need to check > in with the Lustre folks who originally donated the code, but I don't > see any spinlock or mutexes protecting the inode's extent cache. So > if you are on an SMP machine, this could potentially have caused the > problem. How many CPU's or cores do you have? What does > /proc/cpuinfo report? Also, would it be correct to assume this file > is getting served up via Samba. My theory is that we might be running > into problems when two threads are simultaneously trying read and > write to a single file at the same time. > > Hmm, what is accessing your files on this system? Are you just doing > backups? Is it just a backup server? Or are you serving up files > using Samba and there are clients which are accessing those files? > > So if this the problem the following experiment should be able to > confirm whether it's the problem, by seeing if the problem goes away > if we short-circuit the inode's extent cache. In fs/ext4/extents.c, > try inserting a "return" statement to in ext4_ext_put_in_cache(): > > static void > ext4_ext_put_in_cache(struct inode *inode, ext4_lblk_t block, > __u32 len, ext4_fsblk_t start, int type) > { > struct ext4_ext_cache *cex; > > return; <---- insert this line > BUG_ON(len == 0); > cex = &EXT4_I(inode)->i_cached_extent; > cex->ec_type = type; > cex->ec_block = block; > cex->ec_len = len; > cex->ec_start = start; > } > > This should short circuit the i_cached_extent cache, and this may be > enough to make your problem go away. (If this theory is correct, > using mount -o nodelalloc probably won't make a difference, although > it might change the timing enough to make the bug harder to see.) > > If that solves the problem, the right long-term fix will be to drop bin > a spinlock to protect i_cached_extent. > > - Ted ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: More ext4 acl/xattr corruption - 4th occurence now 2009-05-14 4:40 ` Theodore Tso 2009-05-14 11:07 ` Kevin Shanahan @ 2009-05-15 4:58 ` Aneesh Kumar K.V 2009-05-15 10:27 ` Theodore Tso 1 sibling, 1 reply; 35+ messages in thread From: Aneesh Kumar K.V @ 2009-05-15 4:58 UTC (permalink / raw) To: Theodore Tso; +Cc: Kevin Shanahan, linux-ext4 On Thu, May 14, 2009 at 12:40:11AM -0400, Theodore Tso wrote: > On Wed, May 13, 2009 at 03:56:34PM +0930, Kevin Shanahan wrote: > > > > Now, this is (possibly) interesting - that block contains a bunch of > > file data. Looks like a html email (I can tell it's email because of > > the FIXED_ prefix added to the tags by the mail sanitizer). > > That is interesting. Knowing that it is file data means we might be > able to track things down this way. > > Are you able to apply the following patch to your kernel? If so, > hopefully we'll be able to catch whatever is causing the problem in > the act. > > - Ted > > commit 8ff799da106e9fc4da9b2a3753b5b86caab27f13 > Author: Theodore Ts'o <tytso@mit.edu> > Date: Thu May 14 00:39:48 2009 -0400 > > ext4: Add a block validity check to ext4_get_blocks_wrap() > > A few users with very large disks have been reporting low block number > filesystem corruptions, potentially zapping the block group > descriptors or inodes in the first inode table block. It's not clear > what is causing this, but most recently, it appears that whatever is > trashing the filesystem metadata appears to be file data. So let's > try to set a trap for the corruption in ext4_get_blocks_wrap(), which > is where logical blocks in an inode are mapped to physical blocks. We already do block validation in allocator. In ext4_mb_mark_diskspace_used we check whether the block allocator is wrongly getting blocks from system zone. So i guess we don't need this patch. Or may be we need to remove the check in ext4_mb_mark_diskspace_used and add the ckeck in get_block so that it catches the extent cache corruption as found by this bug. -aneesh > > Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 4e7f363..4fad4fa 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -1120,6 +1120,35 @@ static void ext4_da_update_reserve_space(struct inode *inode, int used) > ext4_discard_preallocations(inode); > } > > +static int check_block_validity(struct inode *inode, sector_t logical, > + sector_t phys, int len) > +{ > + ext4_fsblk_t valid_block, itable_block; > + struct ext4_super_block *es = EXT4_SB(inode->i_sb)->s_es; > + struct ext4_group_desc *gdp = ext4_get_group_desc(inode->i_sb, 0, 0); > + unsigned int itable_len = EXT4_SB(inode->i_sb)->s_itb_per_group; > + > + valid_block = le32_to_cpu(es->s_first_data_block) + > + EXT4_SB(inode->i_sb)->s_gdb_count; > + itable_block = ext4_inode_table(inode->i_sb, gdp); > + > + if (unlikely((phys <= valid_block) || > + ((phys + len - 1) > ext4_blocks_count(es)) || > + ((phys >= itable_block) && > + (phys <= itable_block + itable_len)) || > + ((phys + len - 1 >= itable_block) && > + (phys + len - 1 <= itable_block + itable_len)))) { > + ext4_error(inode->i_sb, "check_block_validity", > + "inode #%lu logical block %llu mapped to %llu " > + "(size %d)", inode->i_ino, > + (unsigned long long) logical, > + (unsigned long long) phys, len); > + WARN_ON(1); > + return -EIO; > + } > + return 0; > +} > + > /* > * The ext4_get_blocks_wrap() function try to look up the requested blocks, > * and returns if the blocks are already mapped. > @@ -1165,6 +1194,13 @@ int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block, > } > up_read((&EXT4_I(inode)->i_data_sem)); > > + if (retval > 0 && buffer_mapped(bh)) { > + int ret = check_block_validity(inode, block, > + bh->b_blocknr, retval); > + if (ret != 0) > + return ret; > + } > + > /* If it is only a block(s) look up */ > if (!create) > return retval; > @@ -1241,6 +1277,12 @@ int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block, > } > > up_write((&EXT4_I(inode)->i_data_sem)); > + if (retval > 0 && buffer_mapped(bh)) { > + int ret = check_block_validity(inode, block, > + bh->b_blocknr, retval); > + if (ret != 0) > + return ret; > + } > return retval; > } > > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: More ext4 acl/xattr corruption - 4th occurence now 2009-05-15 4:58 ` Aneesh Kumar K.V @ 2009-05-15 10:27 ` Theodore Tso 2009-05-18 2:14 ` [PATCH] ext4: Add a comprehensive block validity check to ext4_get_blocks() (Was: More ext4 acl/xattr corruption - 4th occurence now) Theodore Tso 0 siblings, 1 reply; 35+ messages in thread From: Theodore Tso @ 2009-05-15 10:27 UTC (permalink / raw) To: Aneesh Kumar K.V; +Cc: Kevin Shanahan, linux-ext4 On Fri, May 15, 2009 at 10:28:27AM +0530, Aneesh Kumar K.V wrote: > > commit 8ff799da106e9fc4da9b2a3753b5b86caab27f13 > > Author: Theodore Ts'o <tytso@mit.edu> > > Date: Thu May 14 00:39:48 2009 -0400 > > > > ext4: Add a block validity check to ext4_get_blocks_wrap() > > > > A few users with very large disks have been reporting low block number > > filesystem corruptions, potentially zapping the block group > > descriptors or inodes in the first inode table block. It's not clear > > what is causing this, but most recently, it appears that whatever is > > trashing the filesystem metadata appears to be file data. So let's > > try to set a trap for the corruption in ext4_get_blocks_wrap(), which > > is where logical blocks in an inode are mapped to physical blocks. > > > We already do block validation in allocator. In > ext4_mb_mark_diskspace_used we check whether the block allocator is > wrongly getting blocks from system zone. So i guess we don't need > this patch. Or may be we need to remove the check in > ext4_mb_mark_diskspace_used and add the ckeck in get_block so that > it catches the extent cache corruption as found by this bug. Yeah, I was debating what to do with this patch long term. It catches problems than the sanity checks in ext4_mb_mark_diskspace_used() in mballoc.c, or ext4_valid_extent() in ext4/extents.c will miss. However, (a) it's expensive to do these sorts of tests, and (b) it only checks for problems in the primary block group descriptors and first block group's inode tables, since this is where we had been reporting problems. The other problem is that ext4_mb_mark_diskspace_used()'s tests assume !flex_bg, and with flex_bg, its checks are largely ineffective. What I think the right approach probably will be is that we need to build an rbtree of the "system zone" blocks at mount time. Especially with flex_bg, in the normal case the block group descriptors and bitmap blocks will form a very nice, contiguous set of extents that should be relatively compactly stored in an rbtree. This will allow us to make the tests done by ext4_valid_extent(), ext4_mb_mark_diskspace_used(), and ext4_get_blocks() to be ***much*** more effective and comprehensive. This still doesn't answer the question of where we should be doing the tests. If we're trying to track down a bug, we probably want to be doing these tests everywhere. For performance reasons, though, my guess is that *most* of the tests should be default disabled except unless a mount option is given. (Why a run-time mount option as opposed a compile-time tests? So that if this bug happens in the field, a Level 3 engineer from RHEL's, SLES's, or IBM's help desk can tell the customer to mount with this option, and track it down without requesting that the user install a custom kernel; having done some field work, I can tell you this will save a huge amount of time/effort.) I would probably keep the tests in ext4_valid_extent() so we can test for on-disk corruption, but we would probably want to turn off the tests at allocation and ext4_get_blocks() for performance reasons --- at least until someone has a chance to do some real benchmarking tests and confirms whether or not the extra CPU utilization is visible on say a TPC-H or TPC-C run. - Ted ^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH] ext4: Add a comprehensive block validity check to ext4_get_blocks() (Was: More ext4 acl/xattr corruption - 4th occurence now) 2009-05-15 10:27 ` Theodore Tso @ 2009-05-18 2:14 ` Theodore Tso 0 siblings, 0 replies; 35+ messages in thread From: Theodore Tso @ 2009-05-18 2:14 UTC (permalink / raw) To: Aneesh Kumar K.V; +Cc: Kevin Shanahan, linux-ext4 On Fri, May 15, 2009 at 06:27:19AM -0400, Theodore Tso wrote: > > The other problem is that ext4_mb_mark_diskspace_used()'s tests assume > !flex_bg, and with flex_bg, its checks are largely ineffective. > > What I think the right approach probably will be is that we need to > build an rbtree of the "system zone" blocks at mount time. Especially > with flex_bg, in the normal case the block group descriptors and > bitmap blocks will form a very nice, contiguous set of extents that > should be relatively compactly stored in an rbtree. This will allow > us to make the tests done by ext4_valid_extent(), > ext4_mb_mark_diskspace_used(), and ext4_get_blocks() to be ***much*** > more effective and comprehensive. Here's the patch which I've written up that allows ext4_valid_extent(), ext4_mb_mark_displace_used(), __ext4_check_blockref(), and ext4_get_blocks() to use the same infrastructure. Normally, they will all just check to make sure that block is between s_first_data_blocks and s_blocks_count. But if the mount option block_validity is enabled, the ext4 filesystem will build an rbtree of all of the filesystem metadata blocks, and ext4_data_block_valid() will do a full check make sure the block is valid. I suspect we could afford to make block_validity the default, but it does burn slightly more memory, and will take a little bit of CPU. It would be nice to benchmark the results and see if it really is noticeable. - Ted commit 93c1e4487e365bc637bf1fd38604817ad6cdf13a Author: Theodore Ts'o <tytso@mit.edu> Date: Sun May 17 15:38:01 2009 -0400 ext4: Add a comprehensive block validity check to ext4_get_blocks() To catch filesystem bugs or corruption which could lead to the filesystem getting severly damaged, this patch adds a facility for tracking all of the filesystem metadata blocks by contiguous regions in a red-black tree. This allows quick searching of the tree to locate extents which might overlap with filesystem metadata blocks. This facility is also used by the multi-block allocator to assure that it is not allocating blocks out of the system zone, as well as by the routines used when reading indirect blocks and extents information from disk to make sure their contents are valid. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> diff --git a/fs/ext4/Makefile b/fs/ext4/Makefile index a8ff003..8a34710 100644 --- a/fs/ext4/Makefile +++ b/fs/ext4/Makefile @@ -5,8 +5,8 @@ obj-$(CONFIG_EXT4_FS) += ext4.o ext4-y := balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o \ - ioctl.o namei.o super.o symlink.o hash.o resize.o extents.o \ - ext4_jbd2.o migrate.o mballoc.o + ioctl.o namei.o super.o symlink.o hash.o resize.o extents.o \ + ext4_jbd2.o migrate.o mballoc.o block_validity.o ext4-$(CONFIG_EXT4_FS_XATTR) += xattr.o xattr_user.o xattr_trusted.o ext4-$(CONFIG_EXT4_FS_POSIX_ACL) += acl.o diff --git a/fs/ext4/block_validity.c b/fs/ext4/block_validity.c new file mode 100644 index 0000000..50784ef --- /dev/null +++ b/fs/ext4/block_validity.c @@ -0,0 +1,244 @@ +/* + * linux/fs/ext4/block_validity.c + * + * Copyright (C) 2009 + * Theodore Ts'o (tytso@mit.edu) + * + * Track which blocks in the filesystem are metadata blocks that + * should never be used as data blocks by files or directories. + */ + +#include <linux/time.h> +#include <linux/fs.h> +#include <linux/namei.h> +#include <linux/quotaops.h> +#include <linux/buffer_head.h> +#include <linux/module.h> +#include <linux/swap.h> +#include <linux/pagemap.h> +#include <linux/version.h> +#include <linux/blkdev.h> +#include <linux/mutex.h> +#include "ext4.h" + +struct ext4_system_zone { + struct rb_node node; + ext4_fsblk_t start_blk; + unsigned int count; +}; + +static struct kmem_cache *ext4_system_zone_cachep; + +int __init init_ext4_system_zone(void) +{ + ext4_system_zone_cachep = KMEM_CACHE(ext4_system_zone, + SLAB_RECLAIM_ACCOUNT); + if (ext4_system_zone_cachep == NULL) + return -ENOMEM; + return 0; +} + +void exit_ext4_system_zone(void) +{ + kmem_cache_destroy(ext4_system_zone_cachep); +} + +static inline int can_merge(struct ext4_system_zone *entry1, + struct ext4_system_zone *entry2) +{ + if ((entry1->start_blk + entry1->count) == entry2->start_blk) + return 1; + return 0; +} + +/* + * Mark a range of blocks as belonging to the "system zone" --- that + * is, filesystem metadata blocks which should never be used by + * inodes. + */ +static int add_system_zone(struct ext4_sb_info *sbi, + ext4_fsblk_t start_blk, + unsigned int count) +{ + struct ext4_system_zone *new_entry = NULL, *entry; + struct rb_node **n = &sbi->system_blks.rb_node, *node; + struct rb_node *parent = NULL, *new_node = NULL; + + while (*n) { + parent = *n; + entry = rb_entry(parent, struct ext4_system_zone, node); + if (start_blk < entry->start_blk) + n = &(*n)->rb_left; + else if (start_blk >= (entry->start_blk + entry->count)) + n = &(*n)->rb_right; + else { + if (start_blk + count > (entry->start_blk + + entry->count)) + entry->count = (start_blk + count - + entry->start_blk); + new_node = *n; + new_entry = rb_entry(new_node, struct ext4_system_zone, + node); + break; + } + } + + if (!new_entry) { + new_entry = kmem_cache_alloc(ext4_system_zone_cachep, + GFP_KERNEL); + if (!new_entry) + return -ENOMEM; + new_entry->start_blk = start_blk; + new_entry->count = count; + new_node = &new_entry->node; + + rb_link_node(new_node, parent, n); + rb_insert_color(new_node, &sbi->system_blks); + } + + /* Can we merge to the left? */ + node = rb_prev(new_node); + if (node) { + entry = rb_entry(node, struct ext4_system_zone, node); + if (can_merge(entry, new_entry)) { + new_entry->start_blk = entry->start_blk; + new_entry->count += entry->count; + rb_erase(node, &sbi->system_blks); + kmem_cache_free(ext4_system_zone_cachep, entry); + } + } + + /* Can we merge to the right? */ + node = rb_next(new_node); + if (node) { + entry = rb_entry(node, struct ext4_system_zone, node); + if (can_merge(new_entry, entry)) { + new_entry->count += entry->count; + rb_erase(node, &sbi->system_blks); + kmem_cache_free(ext4_system_zone_cachep, entry); + } + } + return 0; +} + +static void debug_print_tree(struct ext4_sb_info *sbi) +{ + struct rb_node *node; + struct ext4_system_zone *entry; + int first = 1; + + printk(KERN_INFO "System zones: "); + node = rb_first(&sbi->system_blks); + while (node) { + entry = rb_entry(node, struct ext4_system_zone, node); + printk("%s%llu-%llu", first ? "" : ", ", + entry->start_blk, entry->start_blk + entry->count - 1); + first = 0; + node = rb_next(node); + } + printk("\n"); +} + +int ext4_setup_system_zone(struct super_block *sb) +{ + ext4_group_t ngroups = ext4_get_groups_count(sb); + struct ext4_sb_info *sbi = EXT4_SB(sb); + struct ext4_group_desc *gdp; + ext4_group_t i; + int flex_size = ext4_flex_bg_size(sbi); + int ret; + + if (!test_opt(sb, BLOCK_VALIDITY)) { + if (EXT4_SB(sb)->system_blks.rb_node) + ext4_release_system_zone(sb); + return 0; + } + if (EXT4_SB(sb)->system_blks.rb_node) + return 0; + + for (i=0; i < ngroups; i++) { + if (ext4_bg_has_super(sb, i) && + ((i < 5) || ((i % flex_size) == 0))) + add_system_zone(sbi, ext4_group_first_block_no(sb, i), + sbi->s_gdb_count + 1); + gdp = ext4_get_group_desc(sb, i, NULL); + ret = add_system_zone(sbi, ext4_block_bitmap(sb, gdp), 1); + if (ret) + return ret; + ret = add_system_zone(sbi, ext4_inode_bitmap(sb, gdp), 1); + if (ret) + return ret; + ret = add_system_zone(sbi, ext4_inode_table(sb, gdp), + sbi->s_itb_per_group); + if (ret) + return ret; + } + + if (test_opt(sb, DEBUG)) + debug_print_tree(EXT4_SB(sb)); + return 0; +} + +/* Called when the filesystem is unmounted */ +void ext4_release_system_zone(struct super_block *sb) +{ + struct rb_node *n = EXT4_SB(sb)->system_blks.rb_node; + struct rb_node *parent; + struct ext4_system_zone *entry; + + while (n) { + /* Do the node's children first */ + if (n->rb_left) { + n = n->rb_left; + continue; + } + if (n->rb_right) { + n = n->rb_right; + continue; + } + /* + * The node has no children; free it, and then zero + * out parent's link to it. Finally go to the + * beginning of the loop and try to free the parent + * node. + */ + parent = rb_parent(n); + entry = rb_entry(n, struct ext4_system_zone, node); + kmem_cache_free(ext4_system_zone_cachep, entry); + if (!parent) + EXT4_SB(sb)->system_blks.rb_node = NULL; + else if (parent->rb_left == n) + parent->rb_left = NULL; + else if (parent->rb_right == n) + parent->rb_right = NULL; + n = parent; + } + EXT4_SB(sb)->system_blks.rb_node = NULL; +} + +/* + * Returns 1 if the passed-in block region (start_blk, + * start_blk+count) is valid; 0 if some part of the block region + * overlaps with filesystem metadata blocks. + */ +int ext4_data_block_valid(struct ext4_sb_info *sbi, ext4_fsblk_t start_blk, + unsigned int count) +{ + struct ext4_system_zone *entry; + struct rb_node *n = sbi->system_blks.rb_node; + + if ((start_blk <= le32_to_cpu(sbi->s_es->s_first_data_block)) || + (start_blk + count > ext4_blocks_count(sbi->s_es))) + return 0; + while (n) { + entry = rb_entry(n, struct ext4_system_zone, node); + if (start_blk + count - 1 < entry->start_blk) + n = n->rb_left; + else if (start_blk >= (entry->start_blk + entry->count)) + n = n->rb_right; + else + return 0; + } + return 1; +} + diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index d164f12..4311cc8 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -696,6 +696,7 @@ struct ext4_inode_info { #define EXT4_MOUNT_I_VERSION 0x2000000 /* i_version support */ #define EXT4_MOUNT_DELALLOC 0x8000000 /* Delalloc support */ #define EXT4_MOUNT_DATA_ERR_ABORT 0x10000000 /* Abort on file data write */ +#define EXT4_MOUNT_BLOCK_VALIDITY 0x20000000 /* Block validity checking */ /* Compatibility, for having both ext2_fs.h and ext4_fs.h included at once */ #ifndef _LINUX_EXT2_FS_H @@ -887,6 +888,7 @@ struct ext4_sb_info { int s_jquota_fmt; /* Format of quota to use */ #endif unsigned int s_want_extra_isize; /* New inodes should reserve # bytes */ + struct rb_root system_blks; #ifdef EXTENTS_STATS /* ext4 extents stats */ @@ -1618,6 +1620,15 @@ extern struct dentry *ext4_get_parent(struct dentry *child); extern const struct inode_operations ext4_symlink_inode_operations; extern const struct inode_operations ext4_fast_symlink_inode_operations; +/* block_validity */ +extern void ext4_release_system_zone(struct super_block *sb); +extern int ext4_setup_system_zone(struct super_block *sb); +extern int __init init_ext4_system_zone(void); +extern void exit_ext4_system_zone(void); +extern int ext4_data_block_valid(struct ext4_sb_info *sbi, + ext4_fsblk_t start_blk, + unsigned int count); + /* extents.c */ extern int ext4_ext_tree_init(handle_t *handle, struct inode *); extern int ext4_ext_writepage_trans_blocks(struct inode *, int); diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 27c383c..d04b779 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -326,32 +326,18 @@ ext4_ext_max_entries(struct inode *inode, int depth) static int ext4_valid_extent(struct inode *inode, struct ext4_extent *ext) { - ext4_fsblk_t block = ext_pblock(ext), valid_block; + ext4_fsblk_t block = ext_pblock(ext); int len = ext4_ext_get_actual_len(ext); - struct ext4_super_block *es = EXT4_SB(inode->i_sb)->s_es; - valid_block = le32_to_cpu(es->s_first_data_block) + - EXT4_SB(inode->i_sb)->s_gdb_count; - if (unlikely(block <= valid_block || - ((block + len) > ext4_blocks_count(es)))) - return 0; - else - return 1; + return ext4_data_block_valid(EXT4_SB(inode->i_sb), block, len); } static int ext4_valid_extent_idx(struct inode *inode, struct ext4_extent_idx *ext_idx) { - ext4_fsblk_t block = idx_pblock(ext_idx), valid_block; - struct ext4_super_block *es = EXT4_SB(inode->i_sb)->s_es; + ext4_fsblk_t block = idx_pblock(ext_idx); - valid_block = le32_to_cpu(es->s_first_data_block) + - EXT4_SB(inode->i_sb)->s_gdb_count; - if (unlikely(block <= valid_block || - (block >= ext4_blocks_count(es)))) - return 0; - else - return 1; + return ext4_data_block_valid(EXT4_SB(inode->i_sb), block, 1); } static int ext4_valid_extent_entries(struct inode *inode, diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index d7b7480..dadd3f9 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -372,20 +372,21 @@ static int ext4_block_to_path(struct inode *inode, } static int __ext4_check_blockref(const char *function, struct inode *inode, - __le32 *p, unsigned int max) { - - unsigned int maxblocks = ext4_blocks_count(EXT4_SB(inode->i_sb)->s_es); + __le32 *p, unsigned int max) +{ __le32 *bref = p; + unsigned int blk; + while (bref < p+max) { - if (unlikely(le32_to_cpu(*bref) >= maxblocks)) { + blk = le32_to_cpu(*bref++); + if (blk && + unlikely(!ext4_data_block_valid(EXT4_SB(inode->i_sb), + blk, 1))) { ext4_error(inode->i_sb, function, - "block reference %u >= max (%u) " - "in inode #%lu, offset=%d", - le32_to_cpu(*bref), maxblocks, - inode->i_ino, (int)(bref-p)); + "invalid block reference %u " + "in inode #%lu", blk, inode->i_ino); return -EIO; } - bref++; } return 0; } @@ -1125,6 +1126,21 @@ static void ext4_da_update_reserve_space(struct inode *inode, int used) ext4_discard_preallocations(inode); } +static int check_block_validity(struct inode *inode, sector_t logical, + sector_t phys, int len) +{ + if (!ext4_data_block_valid(EXT4_SB(inode->i_sb), phys, len)) { + ext4_error(inode->i_sb, "check_block_validity", + "inode #%lu logical block %llu mapped to %llu " + "(size %d)", inode->i_ino, + (unsigned long long) logical, + (unsigned long long) phys, len); + WARN_ON(1); + return -EIO; + } + return 0; +} + /* * The ext4_get_blocks() function tries to look up the requested blocks, * and returns if the blocks are already mapped. @@ -1170,6 +1186,13 @@ int ext4_get_blocks(handle_t *handle, struct inode *inode, sector_t block, } up_read((&EXT4_I(inode)->i_data_sem)); + if (retval > 0 && buffer_mapped(bh)) { + int ret = check_block_validity(inode, block, + bh->b_blocknr, retval); + if (ret != 0) + return ret; + } + /* If it is only a block(s) look up */ if ((flags & EXT4_GET_BLOCKS_CREATE) == 0) return retval; @@ -1245,6 +1268,12 @@ int ext4_get_blocks(handle_t *handle, struct inode *inode, sector_t block, ext4_da_update_reserve_space(inode, retval); up_write((&EXT4_I(inode)->i_data_sem)); + if (retval > 0 && buffer_mapped(bh)) { + int ret = check_block_validity(inode, block, + bh->b_blocknr, retval); + if (ret != 0) + return ret; + } return retval; } diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 541bd9a..ed8482e 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -2961,15 +2961,10 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac, + le32_to_cpu(es->s_first_data_block); len = ac->ac_b_ex.fe_len; - if (in_range(ext4_block_bitmap(sb, gdp), block, len) || - in_range(ext4_inode_bitmap(sb, gdp), block, len) || - in_range(block, ext4_inode_table(sb, gdp), - EXT4_SB(sb)->s_itb_per_group) || - in_range(block + len - 1, ext4_inode_table(sb, gdp), - EXT4_SB(sb)->s_itb_per_group)) { + if (!ext4_data_block_valid(sbi, block, len)) { ext4_error(sb, __func__, - "Allocating block %llu in system zone of %d group\n", - block, ac->ac_b_ex.fe_group); + "Allocating blocks %llu-%llu which overlap " + "fs metadata\n", block, block+len); /* File system mounted not to panic on error * Fix the bitmap and repeat the block allocation * We leak some of the blocks here. diff --git a/fs/ext4/super.c b/fs/ext4/super.c index dc34ed3..600b7ad 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -568,6 +568,7 @@ static void ext4_put_super(struct super_block *sb) struct ext4_super_block *es = sbi->s_es; int i, err; + ext4_release_system_zone(sb); ext4_mb_release(sb); ext4_ext_release(sb); ext4_xattr_put_super(sb); @@ -1055,6 +1056,7 @@ enum { Opt_ignore, Opt_barrier, Opt_nobarrier, Opt_err, Opt_resize, Opt_usrquota, Opt_grpquota, Opt_i_version, Opt_stripe, Opt_delalloc, Opt_nodelalloc, + Opt_block_validity, Opt_noblock_validity, Opt_inode_readahead_blks, Opt_journal_ioprio }; @@ -1114,6 +1116,8 @@ static const match_table_t tokens = { {Opt_resize, "resize"}, {Opt_delalloc, "delalloc"}, {Opt_nodelalloc, "nodelalloc"}, + {Opt_block_validity, "block_validity"}, + {Opt_noblock_validity, "noblock_validity"}, {Opt_inode_readahead_blks, "inode_readahead_blks=%u"}, {Opt_journal_ioprio, "journal_ioprio=%u"}, {Opt_auto_da_alloc, "auto_da_alloc=%u"}, @@ -1508,6 +1512,12 @@ set_qf_format: case Opt_delalloc: set_opt(sbi->s_mount_opt, DELALLOC); break; + case Opt_block_validity: + set_opt(sbi->s_mount_opt, BLOCK_VALIDITY); + break; + case Opt_noblock_validity: + clear_opt(sbi->s_mount_opt, BLOCK_VALIDITY); + break; case Opt_inode_readahead_blks: if (match_int(&args[0], &option)) return 0; @@ -2826,6 +2836,13 @@ no_journal: } else if (test_opt(sb, DELALLOC)) printk(KERN_INFO "EXT4-fs: delayed allocation enabled\n"); + err = ext4_setup_system_zone(sb); + if (err) { + printk(KERN_ERR "EXT4-fs: failed to initialize system " + "zone (%d)\n", err); + goto failed_mount4; + } + ext4_ext_init(sb); err = ext4_mb_init(sb, needs_recovery); if (err) { @@ -2875,6 +2892,7 @@ cantfind_ext4: failed_mount4: printk(KERN_ERR "EXT4-fs (device %s): mount failed\n", sb->s_id); + ext4_release_system_zone(sb); if (sbi->s_journal) { jbd2_journal_destroy(sbi->s_journal); sbi->s_journal = NULL; @@ -3515,6 +3533,7 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data) sb->s_flags &= ~MS_RDONLY; } } + ext4_setup_system_zone(sb); if (sbi->s_journal == NULL) ext4_commit_super(sb, 1); @@ -3927,13 +3946,16 @@ static int __init init_ext4_fs(void) { int err; + err = init_ext4_system_zone(); + if (err) + return err; ext4_kset = kset_create_and_add("ext4", NULL, fs_kobj); if (!ext4_kset) - return -ENOMEM; + goto out4; ext4_proc_root = proc_mkdir("fs/ext4", NULL); err = init_ext4_mballoc(); if (err) - return err; + goto out3; err = init_ext4_xattr(); if (err) @@ -3958,6 +3980,11 @@ out1: exit_ext4_xattr(); out2: exit_ext4_mballoc(); +out3: + remove_proc_entry("fs/ext4", NULL); + kset_unregister(ext4_kset); +out4: + exit_ext4_system_zone(); return err; } @@ -3972,6 +3999,7 @@ static void __exit exit_ext4_fs(void) exit_ext4_mballoc(); remove_proc_entry("fs/ext4", NULL); kset_unregister(ext4_kset); + exit_ext4_system_zone(); } MODULE_AUTHOR("Remy Card, Stephen Tweedie, Andrew Morton, Andreas Dilger, Theodore Ts'o and others"); ^ permalink raw reply related [flat|nested] 35+ messages in thread
end of thread, other threads:[~2009-05-19 15:18 UTC | newest] Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-05-13 6:26 More ext4 acl/xattr corruption - 4th occurence now Kevin Shanahan 2009-05-13 23:56 ` Kevin Shanahan 2009-05-14 4:40 ` Theodore Tso 2009-05-14 11:07 ` Kevin Shanahan 2009-05-14 11:17 ` Manish Katiyar 2009-05-14 12:30 ` Theodore Tso 2009-05-14 13:25 ` Kevin Shanahan 2009-05-14 14:07 ` Theodore Tso 2009-05-14 14:30 ` Kevin Shanahan 2009-05-14 15:44 ` Eric Sandeen 2009-05-14 21:07 ` Kevin Shanahan 2009-05-14 21:08 ` Eric Sandeen 2009-05-14 16:12 ` Theodore Tso 2009-05-14 21:02 ` Kevin Shanahan 2009-05-14 21:23 ` Theodore Tso 2009-05-14 21:33 ` Kevin Shanahan 2009-05-15 23:18 ` Kevin Shanahan 2009-05-15 1:21 ` Eric Sandeen 2009-05-15 12:50 ` Theodore Tso 2009-05-15 12:58 ` Eric Sandeen 2009-05-15 15:24 ` Eric Sandeen 2009-05-15 16:27 ` Eric Sandeen 2009-05-15 4:55 ` Aneesh Kumar K.V 2009-05-15 10:11 ` Theodore Tso 2009-05-15 13:07 ` Theodore Tso 2009-05-19 10:00 ` Thierry Vignaud 2009-05-19 11:36 ` Theodore Tso 2009-05-19 12:01 ` Alex Tomas 2009-05-19 15:04 ` Theodore Tso 2009-05-19 15:16 ` Alex Tomas 2009-05-19 15:18 ` Thierry Vignaud 2009-05-15 3:57 ` Alex Tomas 2009-05-15 4:58 ` Aneesh Kumar K.V 2009-05-15 10:27 ` Theodore Tso 2009-05-18 2:14 ` [PATCH] ext4: Add a comprehensive block validity check to ext4_get_blocks() (Was: More ext4 acl/xattr corruption - 4th occurence now) Theodore Tso
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.