* [ea-in-inode 0/5] Further fixes @ 2005-01-20 2:01 Andreas Gruenbacher 2005-01-20 2:01 ` [patch 1/5] No lock needed when freeing inode Andreas Gruenbacher ` (5 more replies) 0 siblings, 6 replies; 30+ messages in thread From: Andreas Gruenbacher @ 2005-01-20 2:01 UTC (permalink / raw) To: Andrew Morton, Linus Torvalds, Theodore Ts'o Cc: Andrew Tridgell, Stephen C. Tweedie, Andreas Dilger, Alex Tomas, linux-kernel Hello, here is a set of fixes for ext3 in-inode attributes: patches/ea-xattr-nolock.diff No lock needed when freeing inode The effect of the additional lock taking is very minor, but it's still unnecessary. patches/ea-xattr-update-sb.diff Set the EXT3_FEATURE_COMPAT_EXT_ATTR for in-inode xattrs The EXT_ATTR filesystem feature was set for xattr blocks, but not for in-inode attributes. patches/ea-xattr-doc.diff Documentation fix patches/ea-xattr-no-extra_isize.diff Fix i_extra_isize check Filesystem corruption fix. patches/ea-xattr-reserved-inodes.diff Disallow in-inode attributes for reserved inodes Filesystem corruption fix. Regards, -- Andreas Gruenbacher <agruen@suse.de> SUSE Labs, SUSE LINUX PRODUCTS GMBH ^ permalink raw reply [flat|nested] 30+ messages in thread
* [patch 1/5] No lock needed when freeing inode 2005-01-20 2:01 [ea-in-inode 0/5] Further fixes Andreas Gruenbacher @ 2005-01-20 2:01 ` Andreas Gruenbacher 2005-01-20 2:01 ` [patch 3/5] Documentation fix Andreas Gruenbacher ` (4 subsequent siblings) 5 siblings, 0 replies; 30+ messages in thread From: Andreas Gruenbacher @ 2005-01-20 2:01 UTC (permalink / raw) To: Andrew Morton, Linus Torvalds, Theodore Ts'o Cc: Andrew Tridgell, Stephen C. Tweedie, Andreas Dilger, Alex Tomas, linux-kernel [-- Attachment #1: ea-xattr-nolock.diff --] [-- Type: text/plain, Size: 867 bytes --] ext3_xattr_delete_inode is called from ext3_free_inode which always has exclusive access to the inode, so there is no need to take the xattr semaphore. Signed-off-by: Andreas Gruenbacher <agruen@suse.de> Index: linux-2.6.11-latest/fs/ext3/xattr.c =================================================================== --- linux-2.6.11-latest.orig/fs/ext3/xattr.c +++ linux-2.6.11-latest/fs/ext3/xattr.c @@ -1066,7 +1066,6 @@ ext3_xattr_delete_inode(handle_t *handle { struct buffer_head *bh = NULL; - down_write(&EXT3_I(inode)->xattr_sem); if (!EXT3_I(inode)->i_file_acl) goto cleanup; bh = sb_bread(inode->i_sb, EXT3_I(inode)->i_file_acl); @@ -1088,7 +1087,6 @@ ext3_xattr_delete_inode(handle_t *handle cleanup: brelse(bh); - up_write(&EXT3_I(inode)->xattr_sem); } /* -- Andreas Gruenbacher <agruen@suse.de> SUSE Labs, SUSE LINUX PRODUCTS GMBH ^ permalink raw reply [flat|nested] 30+ messages in thread
* [patch 3/5] Documentation fix 2005-01-20 2:01 [ea-in-inode 0/5] Further fixes Andreas Gruenbacher 2005-01-20 2:01 ` [patch 1/5] No lock needed when freeing inode Andreas Gruenbacher @ 2005-01-20 2:01 ` Andreas Gruenbacher 2005-01-20 2:01 ` [patch 4/5] Fix i_extra_isize check Andreas Gruenbacher ` (3 subsequent siblings) 5 siblings, 0 replies; 30+ messages in thread From: Andreas Gruenbacher @ 2005-01-20 2:01 UTC (permalink / raw) To: Andrew Morton, Linus Torvalds, Theodore Ts'o Cc: Andrew Tridgell, Stephen C. Tweedie, Andreas Dilger, Alex Tomas, linux-kernel [-- Attachment #1: ea-xattr-doc.diff --] [-- Type: text/plain, Size: 903 bytes --] In-inode xattr entry descriptors are unsorted. Signed-off-by: Andreas Gruenbacher <agruen@suse.de> Index: linux-2.6.11-latest/fs/ext3/xattr.c =================================================================== --- linux-2.6.11-latest.orig/fs/ext3/xattr.c +++ linux-2.6.11-latest/fs/ext3/xattr.c @@ -37,9 +37,9 @@ * | value 2 | | * +------------------+ * - * The header is followed by multiple entry descriptors. Descriptors are - * kept sorted. The attribute values are aligned to the end of the block - * in no specific order. + * The header is followed by multiple entry descriptors. In disk blocks, the + * entry descriptors are kept sorted. In inodes, they are unsorted. The + * attribute values are aligned to the end of the block in no specific order. * * Locking strategy * ---------------- -- Andreas Gruenbacher <agruen@suse.de> SUSE Labs, SUSE LINUX PRODUCTS GMBH ^ permalink raw reply [flat|nested] 30+ messages in thread
* [patch 4/5] Fix i_extra_isize check 2005-01-20 2:01 [ea-in-inode 0/5] Further fixes Andreas Gruenbacher 2005-01-20 2:01 ` [patch 1/5] No lock needed when freeing inode Andreas Gruenbacher 2005-01-20 2:01 ` [patch 3/5] Documentation fix Andreas Gruenbacher @ 2005-01-20 2:01 ` Andreas Gruenbacher 2005-01-20 2:01 ` [patch 2/5] Set the EXT3_FEATURE_COMPAT_EXT_ATTR for in-inode xattrs Andreas Gruenbacher ` (2 subsequent siblings) 5 siblings, 0 replies; 30+ messages in thread From: Andreas Gruenbacher @ 2005-01-20 2:01 UTC (permalink / raw) To: Andrew Morton, Linus Torvalds, Theodore Ts'o Cc: Andrew Tridgell, Stephen C. Tweedie, Andreas Dilger, Alex Tomas, linux-kernel [-- Attachment #1: ea-xattr-no-extra_isize.diff --] [-- Type: text/plain, Size: 2134 bytes --] We are checking for (EXT3_SB(inode->i_sb)->s_inode_size <= EXT3_GOOD_OLD_INODE_SIZE) to find out if we can set in-inode attributes; the test fails for inodes that have been created before the ea-in-inode patch. Those inodes have (i_extra_isize == 0), so we end up with the attributes overlapping the i_extra_isize field. Checking for (i_extra_isize == 0) instead fixes this case. The EXT3_STATE_XATTR flag is only set if (i_extra_isize > 0) and the inodes has in-inode attributes, so that is enough in the first two tests. Signed-off-by: Andreas Gruenbacher <agruen@suse.de> Index: linux-2.6.11-latest/fs/ext3/xattr.c =================================================================== --- linux-2.6.11-latest.orig/fs/ext3/xattr.c +++ linux-2.6.11-latest/fs/ext3/xattr.c @@ -272,8 +272,7 @@ ext3_xattr_ibody_get(struct inode *inode void *end; int error; - if (EXT3_SB(inode->i_sb)->s_inode_size <= EXT3_GOOD_OLD_INODE_SIZE || - !(EXT3_I(inode)->i_state & EXT3_STATE_XATTR)) + if (!(EXT3_I(inode)->i_state & EXT3_STATE_XATTR)) return -ENODATA; error = ext3_get_inode_loc(inode, &iloc); if (error) @@ -399,8 +398,7 @@ ext3_xattr_ibody_list(struct inode *inod void *end; int error; - if (EXT3_SB(inode->i_sb)->s_inode_size <= EXT3_GOOD_OLD_INODE_SIZE || - !(EXT3_I(inode)->i_state & EXT3_STATE_XATTR)) + if (!(EXT3_I(inode)->i_state & EXT3_STATE_XATTR)) return 0; error = ext3_get_inode_loc(inode, &iloc); if (error) @@ -865,7 +863,7 @@ ext3_xattr_ibody_find(struct inode *inod struct ext3_inode *raw_inode; int error; - if (EXT3_SB(inode->i_sb)->s_inode_size <= EXT3_GOOD_OLD_INODE_SIZE) + if (EXT3_I(inode)->i_extra_isize == 0) return 0; raw_inode = ext3_raw_inode(&is->iloc); header = IHDR(inode, raw_inode); @@ -896,7 +894,7 @@ ext3_xattr_ibody_set(handle_t *handle, s struct ext3_xattr_search *s = &is->s; int error; - if (EXT3_SB(inode->i_sb)->s_inode_size <= EXT3_GOOD_OLD_INODE_SIZE) + if (EXT3_I(inode)->i_extra_isize == 0) return -ENOSPC; error = ext3_xattr_set_entry(i, s); if (error) -- Andreas Gruenbacher <agruen@suse.de> SUSE Labs, SUSE LINUX PRODUCTS GMBH ^ permalink raw reply [flat|nested] 30+ messages in thread
* [patch 2/5] Set the EXT3_FEATURE_COMPAT_EXT_ATTR for in-inode xattrs 2005-01-20 2:01 [ea-in-inode 0/5] Further fixes Andreas Gruenbacher ` (2 preceding siblings ...) 2005-01-20 2:01 ` [patch 4/5] Fix i_extra_isize check Andreas Gruenbacher @ 2005-01-20 2:01 ` Andreas Gruenbacher 2005-01-20 2:01 ` [patch 5/5] Disallow in-inode attributes for reserved inodes Andreas Gruenbacher 2005-01-21 22:58 ` [ea-in-inode 0/5] Further fixes Stephen C. Tweedie 5 siblings, 0 replies; 30+ messages in thread From: Andreas Gruenbacher @ 2005-01-20 2:01 UTC (permalink / raw) To: Andrew Morton, Linus Torvalds, Theodore Ts'o Cc: Andrew Tridgell, Stephen C. Tweedie, Andreas Dilger, Alex Tomas, linux-kernel [-- Attachment #1: ea-xattr-update-sb.diff --] [-- Type: text/plain, Size: 861 bytes --] The xattr feature was only set when creating an xattr block. Also set it when creating in-inode xattrs. Signed-off-by: Andreas Gruenbacher <agruen@suse.de> Index: linux-2.6.11-latest/fs/ext3/xattr.c =================================================================== --- linux-2.6.11-latest.orig/fs/ext3/xattr.c +++ linux-2.6.11-latest/fs/ext3/xattr.c @@ -823,7 +823,6 @@ getblk_failed: error = ext3_journal_dirty_metadata(handle, new_bh); if (error) goto cleanup; - ext3_xattr_update_super_block(handle, sb); } } @@ -1001,6 +1000,7 @@ ext3_xattr_set_handle(handle_t *handle, } } if (!error) { + ext3_xattr_update_super_block(handle, inode->i_sb); inode->i_ctime = CURRENT_TIME_SEC; error = ext3_mark_iloc_dirty(handle, inode, &is.iloc); /* -- Andreas Gruenbacher <agruen@suse.de> SUSE Labs, SUSE LINUX PRODUCTS GMBH ^ permalink raw reply [flat|nested] 30+ messages in thread
* [patch 5/5] Disallow in-inode attributes for reserved inodes 2005-01-20 2:01 [ea-in-inode 0/5] Further fixes Andreas Gruenbacher ` (3 preceding siblings ...) 2005-01-20 2:01 ` [patch 2/5] Set the EXT3_FEATURE_COMPAT_EXT_ATTR for in-inode xattrs Andreas Gruenbacher @ 2005-01-20 2:01 ` Andreas Gruenbacher 2005-01-20 12:16 ` Andreas Dilger 2005-01-21 22:58 ` [ea-in-inode 0/5] Further fixes Stephen C. Tweedie 5 siblings, 1 reply; 30+ messages in thread From: Andreas Gruenbacher @ 2005-01-20 2:01 UTC (permalink / raw) To: Andrew Morton, Linus Torvalds, Theodore Ts'o Cc: Andrew Tridgell, Stephen C. Tweedie, Andreas Dilger, Alex Tomas, linux-kernel [-- Attachment #1: ea-xattr-reserved-inodes.diff --] [-- Type: text/plain, Size: 2111 bytes --] When creating a filesystem with inodes bigger than 128 bytes, mke2fs fails to clear out bytes beyond EXT3_GOOD_OLD_INODE_SIZE in all inodes it creates (the journal, the filesystem root, and lost+found). We would require a zeroed-out i_extra_isize field but we don't get it, so disallow in-inode attributes for those inodes. Add an i_extra_isize sanity check. Signed-off-by: Andreas Gruenbacher <agruen@suse.de> Index: linux-2.6.11-latest/fs/ext3/inode.c =================================================================== --- linux-2.6.11-latest.orig/fs/ext3/inode.c +++ linux-2.6.11-latest/fs/ext3/inode.c @@ -2493,15 +2493,30 @@ void ext3_read_inode(struct inode * inod ei->i_data[block] = raw_inode->i_block[block]; INIT_LIST_HEAD(&ei->i_orphan); - ei->i_extra_isize = - (EXT3_INODE_SIZE(inode->i_sb) > EXT3_GOOD_OLD_INODE_SIZE) ? - le16_to_cpu(raw_inode->i_extra_isize) : 0; - if (ei->i_extra_isize) { - __le32 *magic = (void *)raw_inode + EXT3_GOOD_OLD_INODE_SIZE + - ei->i_extra_isize; - if (le32_to_cpu(*magic)) - ei->i_state |= EXT3_STATE_XATTR; - } + if (inode->i_ino >= EXT3_FIRST_INO(inode->i_sb) + 1 && + EXT3_INODE_SIZE(inode->i_sb) > EXT3_GOOD_OLD_INODE_SIZE) { + /* + * When mke2fs creates big inodes it does not zero out + * the unused bytes above EXT3_GOOD_OLD_INODE_SIZE, + * so ignore those first few inodes. + */ + ei->i_extra_isize = le16_to_cpu(raw_inode->i_extra_isize); + if (EXT3_GOOD_OLD_INODE_SIZE + ei->i_extra_isize > + EXT3_INODE_SIZE(inode->i_sb)) + goto bad_inode; + if (ei->i_extra_isize == 0) { + /* The extra space is currently unused. Use it. */ + ei->i_extra_isize = sizeof(struct ext3_inode) - + EXT3_GOOD_OLD_INODE_SIZE; + } else { + __le32 *magic = (void *)raw_inode + + EXT3_GOOD_OLD_INODE_SIZE + + ei->i_extra_isize; + if (*magic == cpu_to_le32(EXT3_XATTR_MAGIC)) + ei->i_state |= EXT3_STATE_XATTR; + } + } else + ei->i_extra_isize = 0; if (S_ISREG(inode->i_mode)) { inode->i_op = &ext3_file_inode_operations; -- Andreas Gruenbacher <agruen@suse.de> SUSE Labs, SUSE LINUX PRODUCTS GMBH ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [patch 5/5] Disallow in-inode attributes for reserved inodes 2005-01-20 2:01 ` [patch 5/5] Disallow in-inode attributes for reserved inodes Andreas Gruenbacher @ 2005-01-20 12:16 ` Andreas Dilger 2005-01-20 13:29 ` Andreas Gruenbacher 0 siblings, 1 reply; 30+ messages in thread From: Andreas Dilger @ 2005-01-20 12:16 UTC (permalink / raw) To: Andreas Gruenbacher Cc: Andrew Morton, Linus Torvalds, Theodore Ts'o, Andrew Tridgell, Stephen C. Tweedie, Alex Tomas, linux-kernel [-- Attachment #1: Type: text/plain, Size: 3922 bytes --] On Jan 20, 2005 03:01 +0100, Andreas Gruenbacher wrote: > When creating a filesystem with inodes bigger than 128 bytes, mke2fs > fails to clear out bytes beyond EXT3_GOOD_OLD_INODE_SIZE in all inodes > it creates (the journal, the filesystem root, and lost+found). We would > require a zeroed-out i_extra_isize field but we don't get it, so > disallow in-inode attributes for those inodes. > > Add an i_extra_isize sanity check. I'm not sure I agree with this patch. It definitely resolves a problem I had with the previous patch (4/5), namely that for inodes that were created in a large-inode filesystem using a kernel w/o the large-inode patch we didn't save EAs into the large-inode space at all. With this patch we will start using that space. It is debatable whether we should mark inodes bad if the i_extra_isize field is bad, or if we should just initialize i_extra_isize in that case. On one hand there is very little else we can use to validate the on-disk inodes, but on the other hand the EA data isn't critical to reading the inode so this would seem to introduce additional failure cases. I'm not sure whether an old e2fsck will also clear out the space in a large inode that it writes or not (I know that the patched one we use validates i_extra_isize). If it doesn't it may be that there would be a largish number of inodes that are marked bad because of this, so having the kernel fix this would be good. If the old e2fsck zeroes the large inodes properly it would be prudent to have an ext3_error() here so that the disk corruption triggers an e2fsck the next time the system starts. For the root and lost+found inodes it looks like we can never store an EA in the extra part of the inode regardless of whether i_extra_isize is good or not. If a bad value is found we could just initialize it and start using that space (though not print an ext3_error() in that case, an ext3_warning() if anything since this is probably the fault of mke2fs). > Index: linux-2.6.11-latest/fs/ext3/inode.c > =================================================================== > --- linux-2.6.11-latest.orig/fs/ext3/inode.c > +++ linux-2.6.11-latest/fs/ext3/inode.c > @@ -2493,15 +2493,30 @@ void ext3_read_inode(struct inode * inod > ei->i_data[block] = raw_inode->i_block[block]; > INIT_LIST_HEAD(&ei->i_orphan); > > - ei->i_extra_isize = > - (EXT3_INODE_SIZE(inode->i_sb) > EXT3_GOOD_OLD_INODE_SIZE) ? > - le16_to_cpu(raw_inode->i_extra_isize) : 0; > - if (ei->i_extra_isize) { > - __le32 *magic = (void *)raw_inode + EXT3_GOOD_OLD_INODE_SIZE + > - ei->i_extra_isize; > - if (le32_to_cpu(*magic)) > - ei->i_state |= EXT3_STATE_XATTR; > - } > + if (inode->i_ino >= EXT3_FIRST_INO(inode->i_sb) + 1 && > + EXT3_INODE_SIZE(inode->i_sb) > EXT3_GOOD_OLD_INODE_SIZE) { > + /* > + * When mke2fs creates big inodes it does not zero out > + * the unused bytes above EXT3_GOOD_OLD_INODE_SIZE, > + * so ignore those first few inodes. > + */ > + ei->i_extra_isize = le16_to_cpu(raw_inode->i_extra_isize); > + if (EXT3_GOOD_OLD_INODE_SIZE + ei->i_extra_isize > > + EXT3_INODE_SIZE(inode->i_sb)) > + goto bad_inode; > + if (ei->i_extra_isize == 0) { > + /* The extra space is currently unused. Use it. */ > + ei->i_extra_isize = sizeof(struct ext3_inode) - > + EXT3_GOOD_OLD_INODE_SIZE; > + } else { > + __le32 *magic = (void *)raw_inode + > + EXT3_GOOD_OLD_INODE_SIZE + > + ei->i_extra_isize; > + if (*magic == cpu_to_le32(EXT3_XATTR_MAGIC)) > + ei->i_state |= EXT3_STATE_XATTR; > + } > + } else > + ei->i_extra_isize = 0; > > if (S_ISREG(inode->i_mode)) { > inode->i_op = &ext3_file_inode_operations; Cheers, Andreas -- Andreas Dilger http://sourceforge.net/projects/ext2resize/ http://members.shaw.ca/adilger/ http://members.shaw.ca/golinux/ [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [patch 5/5] Disallow in-inode attributes for reserved inodes 2005-01-20 12:16 ` Andreas Dilger @ 2005-01-20 13:29 ` Andreas Gruenbacher 2005-01-20 23:05 ` Andreas Dilger 0 siblings, 1 reply; 30+ messages in thread From: Andreas Gruenbacher @ 2005-01-20 13:29 UTC (permalink / raw) To: Andreas Dilger Cc: Andrew Morton, Linus Torvalds, Theodore Ts'o, Andrew Tridgell, Stephen C. Tweedie, Alex Tomas, linux-kernel On Thursday 20 January 2005 13:16, Andreas Dilger wrote: > On Jan 20, 2005 03:01 +0100, Andreas Gruenbacher wrote: > > When creating a filesystem with inodes bigger than 128 bytes, mke2fs > > fails to clear out bytes beyond EXT3_GOOD_OLD_INODE_SIZE in all inodes > > it creates (the journal, the filesystem root, and lost+found). We would > > require a zeroed-out i_extra_isize field but we don't get it, so > > disallow in-inode attributes for those inodes. > > > > Add an i_extra_isize sanity check. > > I'm not sure I agree with this patch. It definitely resolves a problem > I had with the previous patch (4/5), namely that for inodes that were > created in a large-inode filesystem using a kernel w/o the large-inode > patch we didn't save EAs into the large-inode space at all. With this > patch we will start using that space. The ea-in-inode patch totally relies on getting all the available inode space cleared out by the kernel (or mke2fs, or e2fsck). If this is not the case for any inode we find, then i_extra_isize may contain a random number, and we've just lost, period: There is no way of sanitizing a random i_extra_isize; we cannot know what the right number would be. We have two options for going forward from here: (a) depend on the fact that all inodes have been cleared out for us, or (b) don't rely on that, fix e2fsck before going any further, and have the fixed mke2fs and e2fsck set a filesystem feature flag that tells us that all the extra space in all inodes is "sane". For (a) we can make an (ugly) exception because we know which inodes mke2fs created. As far as I can tell we can also fully rely on the kernel having cleared out the extra space for us. e2fsck may currently fail to move the extra space around when moving inodes, but that did not matter without ea-in-inode, and if we fix it now, that's probably good enough. We don't really care which inode's zeroed extra space we look at without ea-in-inode. > It is debatable whether we should mark inodes bad if the i_extra_isize > field is bad, or if we should just initialize i_extra_isize in that case. IMHO it's not debatable. Taking an i_extra_isize that looks odd and simply changing it to something we think is better is a really bad idea. We might as well find an unusual but still valid i_extra_isize and use it, then a future patch allocates a few more fields in the inode, and suddenly we start to see garbage. > On one hand there is very little else we can use to validate the on-disk > inodes, but on the other hand the EA data isn't critical to reading the > inode so this would seem to introduce additional failure cases. You may have an access acl on the inode. Not being able to read an access acl is a clear sign of trouble. The same applies for everything else in the system.* and security.* namespaces, at least. > I'm not sure whether an old e2fsck will also clear out the space in a > large inode that it writes or not (I know that the patched one we use > validates i_extra_isize). If it doesn't it may be that there would be a > largish number of inodes that are marked bad because of this, so having > the kernel fix this would be good. Repairing the filesystem has no place in the kernel code. If what you assume is the case (and I sure would have assumed you to check before proposing the kernel patch for inclusion), we have to go for the filesystem feature flag approach. > If the old e2fsck zeroes the large > inodes properly it would be prudent to have an ext3_error() here so that > the disk corruption triggers an e2fsck the next time the system starts. That's even better, yes. > For the root and lost+found inodes it looks like we can never store an > EA in the extra part of the inode regardless of whether i_extra_isize is > good or not. If a bad value is found we could just initialize it and > start using that space (though not print an ext3_error() in that case, > an ext3_warning() if anything since this is probably the fault of mke2fs). I disagree. We cannot just use the space when we think the inode is corrupted. Cheers, -- Andreas Gruenbacher <agruen@suse.de> SUSE Labs, SUSE LINUX PRODUCTS GMBH ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [patch 5/5] Disallow in-inode attributes for reserved inodes 2005-01-20 13:29 ` Andreas Gruenbacher @ 2005-01-20 23:05 ` Andreas Dilger 2005-01-21 0:36 ` Andreas Gruenbacher 0 siblings, 1 reply; 30+ messages in thread From: Andreas Dilger @ 2005-01-20 23:05 UTC (permalink / raw) To: Andreas Gruenbacher Cc: Andrew Morton, Linus Torvalds, Theodore Ts'o, Andrew Tridgell, Stephen C. Tweedie, Alex Tomas, linux-kernel [-- Attachment #1: Type: text/plain, Size: 4260 bytes --] On Jan 20, 2005 14:29 +0100, Andreas Gruenbacher wrote: > The ea-in-inode patch totally relies on getting all the available inode space > cleared out by the kernel (or mke2fs, or e2fsck). If this is not the case for > any inode we find, then i_extra_isize may contain a random number, and we've > just lost, period: There is no way of sanitizing a random i_extra_isize; we > cannot know what the right number would be. The large-inode support is designed to allow different amounts of "fixed" optional data (i.e. what is stored inside i_extra_isize), so it is valid to set this to 4 (i.e. just enough to hold i_extra_isize itself) and store the EA data after that. Any code which reads "fixed" fields from a large inode (e.g. i_mtime_nsec) needs to validate that i_extra_isize on that inode is large enough for that data to actually be in the fixed area in the large inode. If the kernel is setting i_extra_isize > 4 (i.e. it is storing optional fields there like i_mtime_msb_and_ns) it should/is-able-to also initialize those values since it should know what they are or they shouldn't be in struct ext3_inode. The whole point of i_extra_isize is that it is possible for inodes to have different amounts of the optional fixed fields in each large inode, depending on what the kernel that wrote the inode knew about. So any value for i_extra_isize is valid as long as those fields are initialized. If we arbitrarily set i_extra_isize = 4 instead of leaving the bad value this is no different than waiting for e2fsck to do the same. > > It is debatable whether we should mark inodes bad if the i_extra_isize > > field is bad, or if we should just initialize i_extra_isize in that case. > > IMHO it's not debatable. Taking an i_extra_isize that looks odd and simply > changing it to something we think is better is a really bad idea. > > You may have an access acl on the inode. Not being able to read an access acl > is a clear sign of trouble. The same applies for everything else in the > system.* and security.* namespaces, at least. Well, I said it was debatable and we're having a debate ;-). I don't have a strong opinion either way. If we ext3_error() in this case at least we will check the fs on the next boot (which will just zero i_extra_isize) instead of never doing anything to resolve the situation. > > For the root and lost+found inodes it looks like we can never store an > > EA in the extra part of the inode regardless of whether i_extra_isize is > > good or not. If a bad value is found we could just initialize it and > > start using that space (though not print an ext3_error() in that case, > > an ext3_warning() if anything since this is probably the fault of mke2fs). > > I disagree. We cannot just use the space when we think the inode is corrupted. But as your patch stands it doesn't ever check if i_extra_isize is valid for the root or lost+found inode. It just always sets i_extra_isize = 0 and never uses it. Given that the root inode is fairly high-traffic it makes sense to use the faster EA space if it is available. If these inodes have a BAD i_extra_isize it is OK to skip it, but I'm not so keen to have an ext3_error() there. If the user doesn't have an e2fsck with ea-in-inode support there isn't anything they can do to fix it and they will get a full e2fsck on each boot. Even so, for the effort of setting i_extra_isize = 4 (or larger if we initialize the fixed fields) we can do the equivalent of what e2fsck will do when it finds a bogus value. The good news is that we can still apply your patch as-is and address my concerns later since this is a transient issue. Also, given that there are probably only a handful of filesystems in the world using large inodes (excluding Lustre filesystems which aren't affected by this) I don't think it is a pressing issue yet. I'm going to be away for 2 weeks, so I'll say accept this patch as is and we can look at it again when I get back, and maybe Ted and Stephen will have weighed in on this issue also. Cheers, Andreas -- Andreas Dilger http://sourceforge.net/projects/ext2resize/ http://members.shaw.ca/adilger/ http://members.shaw.ca/golinux/ [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [patch 5/5] Disallow in-inode attributes for reserved inodes 2005-01-20 23:05 ` Andreas Dilger @ 2005-01-21 0:36 ` Andreas Gruenbacher 0 siblings, 0 replies; 30+ messages in thread From: Andreas Gruenbacher @ 2005-01-21 0:36 UTC (permalink / raw) To: Andreas Dilger Cc: Andrew Morton, Linus Torvalds, Theodore Ts'o, Andrew Tridgell, Stephen C. Tweedie, Alex Tomas, linux-kernel On Friday 21 January 2005 00:05, Andreas Dilger wrote: > [...] > But as your patch stands it doesn't ever check if i_extra_isize is valid > for the root or lost+found inode. It just always sets i_extra_isize = 0 (that's the in-memory i_extra_isize) > and never uses it. Given that the root inode is fairly high-traffic it > makes sense to use the faster EA space if it is available. It's only a single block we're talking about, not all the overhead you run into with huge amounts of attributes in many xattr disk blocks. It sure would be much cleaner to use the root inode's in-inode space like with all other inodes, but performance wise I don't think it matters. > If these inodes have a BAD i_extra_isize it is OK to skip it, but I'm > not so keen to have an ext3_error() there. If the user doesn't have an > e2fsck with ea-in-inode support there isn't anything they can do to fix > it and they will get a full e2fsck on each boot. Agreed, that would be really bad. We should get e2fsck fixed ASAP. > Even so, for the effort of setting i_extra_isize = 4 (or larger if we > initialize the fixed fields) we can do the equivalent of what e2fsck will > do when it finds a bogus value. We cannot ask the user, and we don't have the kind of global view that e2fsck has. Something different may be messed up, and may have lead to the corruption. It's unlikely, but not impossible. Cheers, -- Andreas Gruenbacher <agruen@suse.de> SUSE Labs, SUSE LINUX PRODUCTS GMBH ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [ea-in-inode 0/5] Further fixes 2005-01-20 2:01 [ea-in-inode 0/5] Further fixes Andreas Gruenbacher ` (4 preceding siblings ...) 2005-01-20 2:01 ` [patch 5/5] Disallow in-inode attributes for reserved inodes Andreas Gruenbacher @ 2005-01-21 22:58 ` Stephen C. Tweedie 2005-01-21 23:46 ` Andreas Gruenbacher 5 siblings, 1 reply; 30+ messages in thread From: Stephen C. Tweedie @ 2005-01-21 22:58 UTC (permalink / raw) To: Andreas Gruenbacher Cc: Andrew Morton, Linus Torvalds, Theodore Ts'o, Andrew Tridgell, Andreas Dilger, Alex Tomas, linux-kernel, Stephen Tweedie Hi Andreas, On Thu, 2005-01-20 at 02:01, Andreas Gruenbacher wrote: > here is a set of fixes for ext3 in-inode attributes: Obvious first question --- have these diffs survived the same torture-by-tridgell that the previous batch suffered? Cheers, Stephen ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [ea-in-inode 0/5] Further fixes 2005-01-21 22:58 ` [ea-in-inode 0/5] Further fixes Stephen C. Tweedie @ 2005-01-21 23:46 ` Andreas Gruenbacher 2005-01-23 13:22 ` Andrew Tridgell 2005-01-23 22:09 ` Andrew Tridgell 0 siblings, 2 replies; 30+ messages in thread From: Andreas Gruenbacher @ 2005-01-21 23:46 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Andrew Morton, Linus Torvalds, Theodore Ts'o, Andrew Tridgell, Andreas Dilger, Alex Tomas, linux-kernel On Fri, 2005-01-21 at 23:58, Stephen C. Tweedie wrote: > Hi Andreas, > > On Thu, 2005-01-20 at 02:01, Andreas Gruenbacher wrote: > > > here is a set of fixes for ext3 in-inode attributes: > > Obvious first question --- have these diffs survived the same > torture-by-tridgell that the previous batch suffered? No. The fixes are a lot less intrusive than the full xattr rework though. I obviously ran tests; this included dbench. Tridge, can you beat the code some more? Andrew has the five fixes in 2.6.11-rc1-mm2. -- Andreas. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [ea-in-inode 0/5] Further fixes 2005-01-21 23:46 ` Andreas Gruenbacher @ 2005-01-23 13:22 ` Andrew Tridgell 2005-01-23 22:09 ` Andrew Tridgell 1 sibling, 0 replies; 30+ messages in thread From: Andrew Tridgell @ 2005-01-23 13:22 UTC (permalink / raw) To: Andreas Gruenbacher Cc: Stephen C. Tweedie, Andrew Morton, Linus Torvalds, Theodore Ts'o, Andreas Dilger, Alex Tomas, linux-kernel Andreas, > Tridge, can you beat the code some more? > Andrew has the five fixes in 2.6.11-rc1-mm2. sorry for the delay. I've started to test 2.6.11-rc1-mm2 tonight. No problems so far. Cheers, Tridge ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [ea-in-inode 0/5] Further fixes 2005-01-21 23:46 ` Andreas Gruenbacher 2005-01-23 13:22 ` Andrew Tridgell @ 2005-01-23 22:09 ` Andrew Tridgell 2005-01-23 22:58 ` Andreas Gruenbacher 1 sibling, 1 reply; 30+ messages in thread From: Andrew Tridgell @ 2005-01-23 22:09 UTC (permalink / raw) To: Andreas Gruenbacher Cc: Stephen C. Tweedie, Andrew Morton, Linus Torvalds, Theodore Ts'o, Andreas Dilger, Alex Tomas, linux-kernel Andreas, > Tridge, can you beat the code some more? > > Andrew has the five fixes in 2.6.11-rc1-mm2. It seemed to pass dbench runs OK, but then I started simultaneously running dbench and nbench on two different disks (I have a new test machine with more disks available). I am getting failures like this: Jan 23 06:54:38 dev4-003 kernel: journal_bmap: journal block not found at offset 1036 on sdc1 Jan 23 06:54:38 dev4-003 kernel: Aborting journal on device sdc1. Jan 23 13:19:43 dev4-003 kernel: journal_bmap: journal block not found at offset 1036 on sdd1 Jan 23 13:19:43 dev4-003 kernel: Aborting journal on device sdd1. Jan 23 13:19:43 dev4-003 kernel: EXT3-fs error (device sdd1) in start_transaction: Readonly filesystem The first failure was on the disk I am using for nbench (sdc). The second is during a later run on the disk I am using the dbench (sdd). I rebooted between the runs. It's interesting that its failing at exactly the same offset both times. Is there anything magic about offset 1036? The new test machine is a 4 way PIII, with 4G ram, and 4 36G SCSI disks. The test machine I have been using previously was a 2 way (+hyperthreaded) Xeon. I'll see if I can reproduce the problem with just dbench or just nbench. Cheers, Tridge ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [ea-in-inode 0/5] Further fixes 2005-01-23 22:09 ` Andrew Tridgell @ 2005-01-23 22:58 ` Andreas Gruenbacher 2005-01-23 23:32 ` Andreas Gruenbacher 0 siblings, 1 reply; 30+ messages in thread From: Andreas Gruenbacher @ 2005-01-23 22:58 UTC (permalink / raw) To: tridge Cc: Stephen C. Tweedie, Andrew Morton, Linus Torvalds, Theodore Ts'o, Andreas Dilger, Alex Tomas, linux-kernel Hello, On Sunday 23 January 2005 23:09, Andrew Tridgell wrote: > Andreas, > > > Tridge, can you beat the code some more? > > > > Andrew has the five fixes in 2.6.11-rc1-mm2. > > It seemed to pass dbench runs OK, but then I started simultaneously > running dbench and nbench on two different disks (I have a new test > machine with more disks available). I am getting failures like this: > > Jan 23 06:54:38 dev4-003 kernel: journal_bmap: journal block not found at > offset 1036 on sdc1 Jan 23 06:54:38 dev4-003 kernel: Aborting journal on > device sdc1. Are you using data journaling on that filesystem? Does this test pass with the patches backed out? With an external journal? Thanks, -- Andreas Gruenbacher <agruen@suse.de> SUSE Labs, SUSE LINUX PRODUCTS GMBH ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [ea-in-inode 0/5] Further fixes 2005-01-23 22:58 ` Andreas Gruenbacher @ 2005-01-23 23:32 ` Andreas Gruenbacher 2005-01-24 11:24 ` Andrew Tridgell 0 siblings, 1 reply; 30+ messages in thread From: Andreas Gruenbacher @ 2005-01-23 23:32 UTC (permalink / raw) To: tridge Cc: Stephen C. Tweedie, Andrew Morton, Linus Torvalds, Theodore Ts'o, Andreas Dilger, Alex Tomas, linux-kernel On Sunday 23 January 2005 23:58, Andreas Gruenbacher wrote: > > Jan 23 06:54:38 dev4-003 kernel: journal_bmap: journal block not found at > > offset 1036 on sdc1 Jan 23 06:54:38 dev4-003 kernel: Aborting journal on > > device sdc1. > > Are you using data journaling on that filesystem? Does this test pass with > the patches backed out? With an external journal? There are 12 direct and 1024 indirect blocks on a filesystem with 4k blocksize, so block 1036 should be the first double-indirect block. It may be that something is messing up the double-indirect link or one of its fields. Interesting. Could you maybe try this patch as well? Index: linux-2.6.11-rc1-mm2/fs/ext3/inode.c =================================================================== --- linux-2.6.11-rc1-mm2.orig/fs/ext3/inode.c +++ linux-2.6.11-rc1-mm2/fs/ext3/inode.c @@ -2653,7 +2653,7 @@ static int ext3_do_update_inode(handle_t } else for (block = 0; block < EXT3_N_BLOCKS; block++) raw_inode->i_block[block] = ei->i_data[block]; - if (EXT3_INODE_SIZE(inode->i_sb) > EXT3_GOOD_OLD_INODE_SIZE) + if (ei->i_extra_isize) raw_inode->i_extra_isize = cpu_to_le16(ei->i_extra_isize); BUFFER_TRACE(bh, "call ext3_journal_dirty_metadata"); Thanks, -- Andreas Gruenbacher <agruen@suse.de> SUSE Labs, SUSE LINUX PRODUCTS GMBH ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [ea-in-inode 0/5] Further fixes 2005-01-23 23:32 ` Andreas Gruenbacher @ 2005-01-24 11:24 ` Andrew Tridgell 2005-01-24 11:42 ` Christoph Hellwig ` (2 more replies) 0 siblings, 3 replies; 30+ messages in thread From: Andrew Tridgell @ 2005-01-24 11:24 UTC (permalink / raw) To: Andreas Gruenbacher Cc: Stephen C. Tweedie, Andrew Morton, Linus Torvalds, Theodore Ts'o, Andreas Dilger, Alex Tomas, linux-kernel Andreas, I'm starting to think the bug I saw is hardware error. I just got this while trying to reproduce it tonight: Jan 24 02:43:32 dev4-003 kernel: qlogicfc0 : abort failed Jan 24 02:43:32 dev4-003 kernel: qlogicfc0 : firmware status is 4000 4 Jan 24 02:43:32 dev4-003 kernel: scsi: Device offlined - not ready after error recovery: host 3 channel 0 id 1 lun 0 I'll see if I can get this confirmed tomorrow. Cheers, Tridge ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [ea-in-inode 0/5] Further fixes 2005-01-24 11:24 ` Andrew Tridgell @ 2005-01-24 11:42 ` Christoph Hellwig 2005-01-24 14:11 ` Andreas Gruenbacher 2005-01-25 3:19 ` memory leak in 2.6.11-rc2 Andrew Tridgell 2 siblings, 0 replies; 30+ messages in thread From: Christoph Hellwig @ 2005-01-24 11:42 UTC (permalink / raw) To: Andrew Tridgell Cc: Andreas Gruenbacher, Stephen C. Tweedie, Andrew Morton, Linus Torvalds, Theodore Ts'o, Andreas Dilger, Alex Tomas, linux-kernel On Mon, Jan 24, 2005 at 10:24:55PM +1100, Andrew Tridgell wrote: > Andreas, > > I'm starting to think the bug I saw is hardware error. I just got this > while trying to reproduce it tonight: > > Jan 24 02:43:32 dev4-003 kernel: qlogicfc0 : abort failed > Jan 24 02:43:32 dev4-003 kernel: qlogicfc0 : firmware status is 4000 4 > Jan 24 02:43:32 dev4-003 kernel: scsi: Device offlined - not ready after error recovery: host 3 > channel 0 id 1 lun 0 > > I'll see if I can get this confirmed tomorrow. Don't use the qlogicfc driver ever but qla2xxx instead. It's only still in the kernel tree because davem's a lazy bastard :) ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [ea-in-inode 0/5] Further fixes 2005-01-24 11:24 ` Andrew Tridgell 2005-01-24 11:42 ` Christoph Hellwig @ 2005-01-24 14:11 ` Andreas Gruenbacher 2005-01-25 3:19 ` memory leak in 2.6.11-rc2 Andrew Tridgell 2 siblings, 0 replies; 30+ messages in thread From: Andreas Gruenbacher @ 2005-01-24 14:11 UTC (permalink / raw) To: tridge Cc: Stephen C. Tweedie, Andrew Morton, Linus Torvalds, Theodore Ts'o, Andreas Dilger, Alex Tomas, linux-kernel On Mon, 2005-01-24 at 12:24, Andrew Tridgell wrote: > Andreas, > > I'm starting to think the bug I saw is hardware error. [...] The patch in my prevous message (Mon, 24 Jan 2005 00:32:16 +0100) still makes sense, if only for cleanliness: Without it, the i_extra_isize field of reserved inodes is zeroed even though we're not allowing in-inode attributes there. It's probably better to leave the garbage in these inodes untouched; we could in theory disable in-inode attributes for arbitrary inodes with this change. Cheers, -- Andreas Gruenbacher <agruen@suse.de> SUSE Labs, SUSE LINUX GMBH ^ permalink raw reply [flat|nested] 30+ messages in thread
* memory leak in 2.6.11-rc2 2005-01-24 11:24 ` Andrew Tridgell 2005-01-24 11:42 ` Christoph Hellwig 2005-01-24 14:11 ` Andreas Gruenbacher @ 2005-01-25 3:19 ` Andrew Tridgell 2005-01-25 3:20 ` Randy.Dunlap 2005-01-25 3:45 ` Dave Jones 2 siblings, 2 replies; 30+ messages in thread From: Andrew Tridgell @ 2005-01-25 3:19 UTC (permalink / raw) To: linux-kernel, Andreas Gruenbacher, Andrew Morton I've fixed up the problems I had with raid, and am now testing the recent xattr changes with dbench and nbench. The problem I've hit now is a severe memory leak. I have applied the patch from Linus for the leak in free_pipe_info(), and still I'm leaking memory at the rate of about 100Mbyte/minute. I've tested with both 2.6.11-rc2 and with 2.6.11-rc1-mm2, both with the pipe leak fix. The setup is: - 4 way PIII with 4G ram - qla2200 adapter with ibm fastt200 disk array - running dbench -x and nbench on separate disks, in a loop The oom killer kicks in after about 30 minutes. Naturally the oom killer decided to kill my sshd, which was running vmstat :-) Killing the dbench and nbench processes does not recovery the memory. Here is what it looks like after I kill all processes except a sshd after about 15 minutes of running: total used free shared buffers cached Mem: 3734304 2471608 1262696 0 26820 921440 -/+ buffers/cache: 1523348 2210956 Swap: 4194272 0 4194272 so we've lost nearly 1.5G in that time. here is /proc/meminfo: dev4-003:~# cat /proc/meminfo MemTotal: 3734304 kB MemFree: 1266944 kB Buffers: 27000 kB Cached: 921464 kB SwapCached: 0 kB Active: 292524 kB Inactive: 2108928 kB HighTotal: 2850752 kB HighFree: 472768 kB LowTotal: 883552 kB LowFree: 794176 kB SwapTotal: 4194272 kB SwapFree: 4194272 kB Dirty: 4 kB Writeback: 0 kB Mapped: 6872 kB Slab: 54236 kB CommitLimit: 6061424 kB Committed_AS: 41660 kB PageTables: 344 kB VmallocTotal: 114680 kB VmallocUsed: 2204 kB VmallocChunk: 112472 kB and here is /proc/slabinfo slabinfo - version: 2.1 # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <batchcount> <limit> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> ip_fib_alias 9 226 16 226 1 : tunables 120 60 8 : slabdata 1 1 0 ip_fib_hash 9 119 32 119 1 : tunables 120 60 8 : slabdata 1 1 0 fib6_nodes 7 119 32 119 1 : tunables 120 60 8 : slabdata 1 1 0 ip6_dst_cache 7 36 224 18 1 : tunables 120 60 8 : slabdata 2 2 0 ndisc_cache 1 25 160 25 1 : tunables 120 60 8 : slabdata 1 1 0 rawv6_sock 6 6 672 6 1 : tunables 54 27 8 : slabdata 1 1 0 udpv6_sock 0 0 640 6 1 : tunables 54 27 8 : slabdata 0 0 0 tcpv6_sock 4 6 1184 6 2 : tunables 24 12 8 : slabdata 1 1 0 unix_sock 2 9 416 9 1 : tunables 54 27 8 : slabdata 1 1 0 ip_mrt_cache 0 0 96 41 1 : tunables 120 60 8 : slabdata 0 0 0 tcp_tw_bucket 0 0 128 31 1 : tunables 120 60 8 : slabdata 0 0 0 tcp_bind_bucket 2 226 16 226 1 : tunables 120 60 8 : slabdata 1 1 0 tcp_open_request 0 0 96 41 1 : tunables 120 60 8 : slabdata 0 0 0 inet_peer_cache 1 61 64 61 1 : tunables 120 60 8 : slabdata 1 1 0 secpath_cache 0 0 128 31 1 : tunables 120 60 8 : slabdata 0 0 0 xfrm_dst_cache 0 0 256 15 1 : tunables 120 60 8 : slabdata 0 0 0 ip_dst_cache 6 15 256 15 1 : tunables 120 60 8 : slabdata 1 1 0 arp_cache 1 25 160 25 1 : tunables 120 60 8 : slabdata 1 1 0 raw_sock 5 7 512 7 1 : tunables 54 27 8 : slabdata 1 1 0 udp_sock 3 8 512 8 1 : tunables 54 27 8 : slabdata 1 1 0 tcp_sock 1 7 1056 7 2 : tunables 24 12 8 : slabdata 1 1 0 flow_cache 0 0 96 41 1 : tunables 120 60 8 : slabdata 0 0 0 qla2xxx_srbs 256 310 128 31 1 : tunables 120 60 8 : slabdata 10 10 0 scsi_cmd_cache 22 22 352 11 1 : tunables 54 27 8 : slabdata 2 2 0 cfq_ioc_pool 0 0 24 156 1 : tunables 120 60 8 : slabdata 0 0 0 cfq_pool 0 0 104 38 1 : tunables 120 60 8 : slabdata 0 0 0 crq_pool 0 0 56 70 1 : tunables 120 60 8 : slabdata 0 0 0 deadline_drq 0 0 52 75 1 : tunables 120 60 8 : slabdata 0 0 0 as_arq 61 61 64 61 1 : tunables 120 60 8 : slabdata 1 1 0 mqueue_inode_cache 1 7 512 7 1 : tunables 54 27 8 : slabdata 1 1 0 devfsd_event 0 0 20 185 1 : tunables 120 60 8 : slabdata 0 0 0 ext2_inode_cache 0 0 456 8 1 : tunables 54 27 8 : slabdata 0 0 0 ext2_xattr 0 0 48 81 1 : tunables 120 60 8 : slabdata 0 0 0 journal_handle 2 185 20 185 1 : tunables 120 60 8 : slabdata 1 1 0 journal_head 3316 155844 48 81 1 : tunables 120 60 8 : slabdata 1924 1924 0 revoke_table 14 290 12 290 1 : tunables 120 60 8 : slabdata 1 1 0 revoke_record 0 0 16 226 1 : tunables 120 60 8 : slabdata 0 0 0 ext3_inode_cache 3626 3745 520 7 1 : tunables 54 27 8 : slabdata 535 535 0 ext3_xattr 0 0 48 81 1 : tunables 120 60 8 : slabdata 0 0 0 dnotify_cache 0 0 20 185 1 : tunables 120 60 8 : slabdata 0 0 0 eventpoll_pwq 0 0 36 107 1 : tunables 120 60 8 : slabdata 0 0 0 eventpoll_epi 0 0 96 41 1 : tunables 120 60 8 : slabdata 0 0 0 kioctx 0 0 192 20 1 : tunables 120 60 8 : slabdata 0 0 0 kiocb 0 0 128 31 1 : tunables 120 60 8 : slabdata 0 0 0 fasync_cache 0 0 16 226 1 : tunables 120 60 8 : slabdata 0 0 0 shmem_inode_cache 8 18 416 9 1 : tunables 54 27 8 : slabdata 2 2 0 posix_timers_cache 0 0 104 38 1 : tunables 120 60 8 : slabdata 0 0 0 uid_cache 1 61 64 61 1 : tunables 120 60 8 : slabdata 1 1 0 sgpool-128 32 32 2048 2 1 : tunables 24 12 8 : slabdata 16 16 0 sgpool-64 32 32 1024 4 1 : tunables 54 27 8 : slabdata 8 8 0 sgpool-32 32 32 512 8 1 : tunables 54 27 8 : slabdata 4 4 0 sgpool-16 32 45 256 15 1 : tunables 120 60 8 : slabdata 3 3 0 sgpool-8 50 62 128 31 1 : tunables 120 60 8 : slabdata 2 2 0 blkdev_ioc 40 270 28 135 1 : tunables 120 60 8 : slabdata 2 2 0 blkdev_queue 37 50 380 10 1 : tunables 54 27 8 : slabdata 5 5 0 blkdev_requests 70 81 148 27 1 : tunables 120 60 8 : slabdata 3 3 0 biovec-(256) 256 256 3072 2 2 : tunables 24 12 8 : slabdata 128 128 0 biovec-128 256 260 1536 5 2 : tunables 24 12 8 : slabdata 52 52 0 biovec-64 256 260 768 5 1 : tunables 54 27 8 : slabdata 52 52 0 biovec-16 260 260 192 20 1 : tunables 120 60 8 : slabdata 13 13 0 biovec-4 256 305 64 61 1 : tunables 120 60 8 : slabdata 5 5 0 biovec-1 322 452 16 226 1 : tunables 120 60 8 : slabdata 2 2 0 bio 303 328 96 41 1 : tunables 120 60 8 : slabdata 8 8 0 file_lock_cache 0 0 92 43 1 : tunables 120 60 8 : slabdata 0 0 0 sock_inode_cache 26 60 384 10 1 : tunables 54 27 8 : slabdata 6 6 0 skbuff_head_cache 137 475 160 25 1 : tunables 120 60 8 : slabdata 19 19 0 sock 4 11 352 11 1 : tunables 54 27 8 : slabdata 1 1 0 key_jar 90 123 96 41 1 : tunables 120 60 8 : slabdata 3 3 0 proc_inode_cache 265 336 336 12 1 : tunables 54 27 8 : slabdata 28 28 0 sigqueue 1 27 148 27 1 : tunables 120 60 8 : slabdata 1 1 0 radix_tree_node 8235 8750 276 14 1 : tunables 54 27 8 : slabdata 625 625 0 bdev_cache 14 18 448 9 1 : tunables 54 27 8 : slabdata 2 2 0 sysfs_dir_cache 2837 2996 36 107 1 : tunables 120 60 8 : slabdata 28 28 0 mnt_cache 21 41 96 41 1 : tunables 120 60 8 : slabdata 1 1 0 inode_cache 946 948 320 12 1 : tunables 54 27 8 : slabdata 79 79 0 dentry_cache 10903 11032 140 28 1 : tunables 120 60 8 : slabdata 394 394 0 filp 300 750 160 25 1 : tunables 120 60 8 : slabdata 30 30 0 names_cache 3 3 4096 1 1 : tunables 24 12 8 : slabdata 3 3 0 idr_layer_cache 80 87 136 29 1 : tunables 120 60 8 : slabdata 3 3 0 buffer_head 593099 649500 52 75 1 : tunables 120 60 8 : slabdata 8660 8660 0 mm_struct 54 54 608 6 1 : tunables 54 27 8 : slabdata 9 9 0 vm_area_struct 543 1260 88 45 1 : tunables 120 60 8 : slabdata 28 28 0 fs_cache 82 122 64 61 1 : tunables 120 60 8 : slabdata 2 2 0 files_cache 50 63 416 9 1 : tunables 54 27 8 : slabdata 7 7 0 signal_cache 68 150 256 15 1 : tunables 120 60 8 : slabdata 10 10 0 sighand_cache 76 78 1312 3 1 : tunables 24 12 8 : slabdata 26 26 0 task_struct 78 81 1312 3 1 : tunables 24 12 8 : slabdata 27 27 0 anon_vma 273 870 12 290 1 : tunables 120 60 8 : slabdata 3 3 0 pgd 24 24 4096 1 1 : tunables 24 12 8 : slabdata 24 24 0 size-131072(DMA) 0 0 131072 1 32 : tunables 8 4 0 : slabdata 0 0 0 size-131072 0 0 131072 1 32 : tunables 8 4 0 : slabdata 0 0 0 size-65536(DMA) 0 0 65536 1 16 : tunables 8 4 0 : slabdata 0 0 0 size-65536 0 0 65536 1 16 : tunables 8 4 0 : slabdata 0 0 0 size-32768(DMA) 0 0 32768 1 8 : tunables 8 4 0 : slabdata 0 0 0 size-32768 0 0 32768 1 8 : tunables 8 4 0 : slabdata 0 0 0 size-16384(DMA) 0 0 16384 1 4 : tunables 8 4 0 : slabdata 0 0 0 size-16384 5 5 16384 1 4 : tunables 8 4 0 : slabdata 5 5 0 size-8192(DMA) 0 0 8192 1 2 : tunables 8 4 0 : slabdata 0 0 0 size-8192 75 75 8192 1 2 : tunables 8 4 0 : slabdata 75 75 0 size-4096(DMA) 0 0 4096 1 1 : tunables 24 12 8 : slabdata 0 0 0 size-4096 29 29 4096 1 1 : tunables 24 12 8 : slabdata 29 29 0 size-2048(DMA) 0 0 2048 2 1 : tunables 24 12 8 : slabdata 0 0 0 size-2048 248 258 2048 2 1 : tunables 24 12 8 : slabdata 129 129 0 size-1024(DMA) 0 0 1024 4 1 : tunables 54 27 8 : slabdata 0 0 0 size-1024 132 132 1024 4 1 : tunables 54 27 8 : slabdata 33 33 0 size-512(DMA) 0 0 512 8 1 : tunables 54 27 8 : slabdata 0 0 0 size-512 403 432 512 8 1 : tunables 54 27 8 : slabdata 54 54 0 size-256(DMA) 0 0 256 15 1 : tunables 120 60 8 : slabdata 0 0 0 size-256 255 255 256 15 1 : tunables 120 60 8 : slabdata 17 17 0 size-128(DMA) 0 0 128 31 1 : tunables 120 60 8 : slabdata 0 0 0 size-128 2403 2542 128 31 1 : tunables 120 60 8 : slabdata 82 82 0 size-64(DMA) 0 0 64 61 1 : tunables 120 60 8 : slabdata 0 0 0 size-64 670 793 64 61 1 : tunables 120 60 8 : slabdata 13 13 0 size-32(DMA) 0 0 32 119 1 : tunables 120 60 8 : slabdata 0 0 0 size-32 1415 3213 32 119 1 : tunables 120 60 8 : slabdata 27 27 0 kmem_cache 175 175 160 25 1 : tunables 120 60 8 : slabdata 7 7 0 ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: memory leak in 2.6.11-rc2 2005-01-25 3:19 ` memory leak in 2.6.11-rc2 Andrew Tridgell @ 2005-01-25 3:20 ` Randy.Dunlap 2005-01-25 3:31 ` Andrew Tridgell 2005-01-25 3:45 ` Dave Jones 1 sibling, 1 reply; 30+ messages in thread From: Randy.Dunlap @ 2005-01-25 3:20 UTC (permalink / raw) To: tridge; +Cc: linux-kernel, Andreas Gruenbacher, Andrew Morton Andrew Tridgell wrote: > I've fixed up the problems I had with raid, and am now testing the > recent xattr changes with dbench and nbench. > > The problem I've hit now is a severe memory leak. I have applied the > patch from Linus for the leak in free_pipe_info(), and still I'm > leaking memory at the rate of about 100Mbyte/minute. > > I've tested with both 2.6.11-rc2 and with 2.6.11-rc1-mm2, both with > the pipe leak fix. The setup is: > > - 4 way PIII with 4G ram > - qla2200 adapter with ibm fastt200 disk array > - running dbench -x and nbench on separate disks, in a loop > > The oom killer kicks in after about 30 minutes. Naturally the oom > killer decided to kill my sshd, which was running vmstat :-) Do you have today's memleak patch applied? (cut-n-paste below). -- ~Randy ---- --- 1.40/fs/pipe.c 2005-01-15 12:01:16 -08:00 +++ edited/fs/pipe.c 2005-01-24 14:35:09 -08:00 @@ -630,13 +630,13 @@ struct pipe_inode_info *info = inode->i_pipe; inode->i_pipe = NULL; - if (info->tmp_page) - __free_page(info->tmp_page); for (i = 0; i < PIPE_BUFFERS; i++) { struct pipe_buffer *buf = info->bufs + i; if (buf->ops) buf->ops->release(info, buf); } + if (info->tmp_page) + __free_page(info->tmp_page); kfree(info); } - ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: memory leak in 2.6.11-rc2 2005-01-25 3:20 ` Randy.Dunlap @ 2005-01-25 3:31 ` Andrew Tridgell 2005-01-25 4:48 ` Andrew Tridgell 0 siblings, 1 reply; 30+ messages in thread From: Andrew Tridgell @ 2005-01-25 3:31 UTC (permalink / raw) To: Randy.Dunlap; +Cc: linux-kernel, Andreas Gruenbacher, Andrew Morton Randy, > I have applied the patch from Linus for the leak in > free_pipe_info() ... > Do you have today's memleak patch applied? (cut-n-paste below). yes :-) I'm trying the little leak tracking patch from Alexander Nyberg now. Cheers, Tridge ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: memory leak in 2.6.11-rc2 2005-01-25 3:31 ` Andrew Tridgell @ 2005-01-25 4:48 ` Andrew Tridgell 2005-01-25 6:06 ` Andrew Morton 0 siblings, 1 reply; 30+ messages in thread From: Andrew Tridgell @ 2005-01-25 4:48 UTC (permalink / raw) To: Randy.Dunlap, linux-kernel, Andreas Gruenbacher, Andrew Morton [-- Attachment #1: message body text --] [-- Type: text/plain, Size: 2849 bytes --] > I'm trying the little leak tracking patch from Alexander Nyberg now. Here are the results (only backtraces with more than 10k counts included). The leak was at 1G of memory at the time I ran this, so its safe to say 10k page allocations ain't enough to explain it :-) I also attach a hacked version of the pgown sort program that sorts the output by count, and isn't O(n^2). It took 10 minutes to run the old version :-) I'm guessing the leak is in the new xattr code given that is what dbench and nbench were beating on. Andreas, can you look at the following and see if you can spot anything? This was on 2.6.11-rc2 with the pipe leak patch from Linus. The machine had leaked 1G of ram in 10 minutes, and was idle (only thing running was sshd). Cheers, Tridge 175485 times: Page allocated via order 0 [0xc0132258] generic_file_buffered_write+280 [0xc011b6a9] current_fs_time+77 [0xc0132a1e] __generic_file_aio_write_nolock+642 [0xc0132e70] generic_file_aio_write+100 [0xc017e586] ext3_file_write+38 [0xc014b7f5] do_sync_write+169 [0xc015f6de] fcntl_setlk64+286 [0xc01295a8] autoremove_wake_function+0 141512 times: Page allocated via order 0 [0xc0132258] generic_file_buffered_write+280 [0xc0132a1e] __generic_file_aio_write_nolock+642 [0xc0132e70] generic_file_aio_write+100 [0xc017e586] ext3_file_write+38 [0xc014b7f5] do_sync_write+169 [0xc015f6de] fcntl_setlk64+286 [0xc01295a8] autoremove_wake_function+0 [0xc014b8d5] vfs_write+157 67641 times: Page allocated via order 0 [0xc0132258] generic_file_buffered_write+280 [0xc014dc69] __getblk+29 [0xc011b6a9] current_fs_time+77 [0xc0132a1e] __generic_file_aio_write_nolock+642 [0xc018d368] ext3_xattr_user_get+108 [0xc0132e70] generic_file_aio_write+100 [0xc017e586] ext3_file_write+38 [0xc014b7f5] do_sync_write+169 52758 times: Page allocated via order 0 [0xc0132258] generic_file_buffered_write+280 [0xc014dc69] __getblk+29 [0xc0132a1e] __generic_file_aio_write_nolock+642 [0xc018d368] ext3_xattr_user_get+108 [0xc0132e70] generic_file_aio_write+100 [0xc017e586] ext3_file_write+38 [0xc014b7f5] do_sync_write+169 [0xc0120c0b] kill_proc_info+47 19610 times: Page allocated via order 0 [0xc0132258] generic_file_buffered_write+280 [0xc011b6a9] current_fs_time+77 [0xc0132a1e] __generic_file_aio_write_nolock+642 [0xc0132c3a] generic_file_aio_write_nolock+54 [0xc0132def] generic_file_write_nolock+151 [0xc011b7cb] __do_softirq+95 [0xc01295a8] autoremove_wake_function+0 [0xc01295a8] autoremove_wake_function+0 16874 times: Page allocated via order 0 [0xc0132258] generic_file_buffered_write+280 [0xc011b6a9] current_fs_time+77 [0xc0132a1e] __generic_file_aio_write_nolock+642 [0xc0132c3a] generic_file_aio_write_nolock+54 [0xc0132def] generic_file_write_nolock+151 [0xc01295a8] autoremove_wake_function+0 [0xc01295a8] autoremove_wake_function+0 [0xc0152d4a] blkdev_file_write+38 [-- Attachment #2: pgown-fast.c --] [-- Type: application/octet-stream, Size: 2427 bytes --] #include <stdio.h> #include <stdlib.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> #include <string.h> struct block_list { char *txt; int len; int num; }; static struct block_list *list; static int list_size; static int max_size; struct block_list *block_head; int read_block(char *buf, FILE *fin) { int ret = 0; int hit = 0; char *curr = buf; for (;;) { *curr = getc(fin); if (*curr == EOF) return -1; ret++; if (*curr == '\n' && hit == 1) return ret - 1; else if (*curr == '\n') hit = 1; else hit = 0; curr++; } } static int compare_txt(struct block_list *l1, struct block_list *l2) { return strcmp(l1->txt, l2->txt); } static int compare_num(struct block_list *l1, struct block_list *l2) { return l2->num - l1->num; } static void add_list(char *buf, int len) { if (list_size != 0 && len == list[list_size-1].len && memcmp(buf, list[list_size-1].txt, len) == 0) { list[list_size-1].num++; return; } if (list_size == max_size) { printf("max_size too small??\n"); exit(1); } list[list_size].txt = malloc(len+1); list[list_size].len = len; list[list_size].num = 1; memcpy(list[list_size].txt, buf, len); list[list_size].txt[len] = 0; list_size++; if (list_size % 1000 == 0) { printf("loaded %d\r", list_size); fflush(stdout); } } int main(int argc, char **argv) { FILE *fin, *fout; char buf[1024]; int ret, i, count; struct block_list *list2; struct stat st; fin = fopen(argv[1], "r"); fout = fopen(argv[2], "w"); if (!fin || !fout) { printf("Usage: ./program <input> <output>\n"); perror("open: "); exit(2); } fstat(fileno(fin), &st); max_size = st.st_size / 100; /* hack ... */ list = malloc(max_size * sizeof(*list)); for(;;) { ret = read_block(buf, fin); if (ret < 0) break; buf[ret] = '\0'; add_list(buf, ret); } printf("loaded %d\n", list_size); printf("sorting ....\n"); qsort(list, list_size, sizeof(list[0]), compare_txt); list2 = malloc(sizeof(*list) * list_size); printf("culling\n"); for (i=count=0;i<list_size;i++) { if (count == 0 || strcmp(list2[count-1].txt, list[i].txt) != 0) { list2[count++] = list[i]; } else { list2[count-1].num += list[i].num; } } qsort(list2, count, sizeof(list[0]), compare_num); for (i=0;i<count;i++) { fprintf(fout, "%d times:\n%s\n", list2[i].num, list2[i].txt); } return 0; } ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: memory leak in 2.6.11-rc2 2005-01-25 4:48 ` Andrew Tridgell @ 2005-01-25 6:06 ` Andrew Morton 2005-01-25 11:35 ` Andrew Tridgell 0 siblings, 1 reply; 30+ messages in thread From: Andrew Morton @ 2005-01-25 6:06 UTC (permalink / raw) To: tridge; +Cc: rddunlap, linux-kernel, agruen Andrew Tridgell <tridge@osdl.org> wrote: > > > I'm trying the little leak tracking patch from Alexander Nyberg now. > > Here are the results (only backtraces with more than 10k counts > included). The leak was at 1G of memory at the time I ran this, so its > safe to say 10k page allocations ain't enough to explain it :-) > > I also attach a hacked version of the pgown sort program that sorts > the output by count, and isn't O(n^2). It took 10 minutes to run the > old version :-) > > I'm guessing the leak is in the new xattr code given that is what > dbench and nbench were beating on. Andreas, can you look at the > following and see if you can spot anything? > > This was on 2.6.11-rc2 with the pipe leak patch from Linus. The > machine had leaked 1G of ram in 10 minutes, and was idle (only thing > running was sshd). > > 175485 times: > Page allocated via order 0 > [0xc0132258] generic_file_buffered_write+280 > [0xc011b6a9] current_fs_time+77 > [0xc0132a1e] __generic_file_aio_write_nolock+642 > [0xc0132e70] generic_file_aio_write+100 > [0xc017e586] ext3_file_write+38 > [0xc014b7f5] do_sync_write+169 > [0xc015f6de] fcntl_setlk64+286 > [0xc01295a8] autoremove_wake_function+0 It would be pretty strange for plain old pagecache pages to leak in this manner. A few things come to mind. - The above trace is indistinguishable from the normal situation of having a lot of pagecache floating about. IOW: we don't know if the above pages have really leaked or not. - It's sometimes possible for ext3 pages to remain alive (on the page LRU) after a truncate, but with no other references to them. These pages are trivially reclaimable. So even if you've deleted the files which the benchmark created, there _could_ be pages left over. Although it would be unusual. So what you should do before generating the leak tool output is to put heavy memory pressure on the machine to try to get it to free up as much of that pagecache as possible. bzero(malloc(lots)) will do it - create a real swapstorm, then do swapoff to kill remaining swapcache as well. If we still see a lot of pages with the above trace then something really broke. Where does one get the appropriate dbench version? How are you mkfsing the filesystem? Was mke2fs patched? How is the benchmark being invoked? Thanks. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: memory leak in 2.6.11-rc2 2005-01-25 6:06 ` Andrew Morton @ 2005-01-25 11:35 ` Andrew Tridgell 2005-01-25 12:11 ` Nick Piggin 0 siblings, 1 reply; 30+ messages in thread From: Andrew Tridgell @ 2005-01-25 11:35 UTC (permalink / raw) To: Andrew Morton; +Cc: rddunlap, linux-kernel, agruen Andrew, > So what you should do before generating the leak tool output is to put > heavy memory pressure on the machine to try to get it to free up as much of > that pagecache as possible. bzero(malloc(lots)) will do it - create a real > swapstorm, then do swapoff to kill remaining swapcache as well. As you saw when you logged into the machine earlier tonight, when you suspend the dbench processes and run a memory filler the memory is reclaimed. I still think its a bug though, as the oom killer is being triggered when it shouldn't be. I have 4G of ram in this machine, and I'm only running a couple of hundred processes that should be using maybe 500M in total, so for the oom killer to kick in might mean that the memory isn't being reclaimed under normal memory pressure. Certainly a ps shows no process using more than a few MB. The oom killer report is below. This is with 2.6.11-rc2, with the pipe leak fix, and the pgown monitoring patch. It was running one nbench of size 50 and one dbench of size 40 at the time. If this isn't a leak, then it would also be good to fix /usr/bin/free so the -/+ buffers line becomes meaningful again. With the machine completely idle (just a sshd running) I see: Mem: 3701184 3483188 217996 0 130092 1889440 -/+ buffers/cache: 1463656 2237528 which looks like a leak. This persists even after the disks are unmounted. After filling/freeing memory I see: Mem: 3701184 28176 3673008 0 520 3764 -/+ buffers/cache: 23892 3677292 so it can recover it, but under normal usage it doesn't before the oom killer kicks in. Cheers, Tridge Jan 25 00:49:14 dev4-003 kernel: Out of Memory: Killed process 18910 (smbd). Jan 25 00:49:14 dev4-003 kernel: oom-killer: gfp_mask=0xd0 Jan 25 00:49:14 dev4-003 kernel: DMA per-cpu: Jan 25 00:49:14 dev4-003 kernel: cpu 0 hot: low 2, high 6, batch 1 Jan 25 00:49:14 dev4-003 kernel: cpu 0 cold: low 0, high 2, batch 1 Jan 25 00:49:14 dev4-003 kernel: cpu 1 hot: low 2, high 6, batch 1 Jan 25 00:49:14 dev4-003 kernel: cpu 1 cold: low 0, high 2, batch 1 Jan 25 00:49:14 dev4-003 kernel: cpu 2 hot: low 2, high 6, batch 1 Jan 25 00:49:14 dev4-003 kernel: cpu 2 cold: low 0, high 2, batch 1 Jan 25 00:49:14 dev4-003 kernel: cpu 3 hot: low 2, high 6, batch 1 Jan 25 00:49:14 dev4-003 kernel: cpu 3 cold: low 0, high 2, batch 1 Jan 25 00:49:14 dev4-003 kernel: Normal per-cpu: Jan 25 00:49:14 dev4-003 kernel: cpu 0 hot: low 32, high 96, batch 16 Jan 25 00:49:14 dev4-003 kernel: cpu 0 cold: low 0, high 32, batch 16 Jan 25 00:49:14 dev4-003 kernel: cpu 1 hot: low 32, high 96, batch 16 Jan 25 00:49:14 dev4-003 kernel: cpu 1 cold: low 0, high 32, batch 16 Jan 25 00:49:14 dev4-003 kernel: cpu 2 hot: low 32, high 96, batch 16 Jan 25 00:49:14 dev4-003 kernel: cpu 2 cold: low 0, high 32, batch 16 Jan 25 00:49:14 dev4-003 kernel: cpu 3 hot: low 32, high 96, batch 16 Jan 25 00:49:14 dev4-003 kernel: cpu 3 cold: low 0, high 32, batch 16 Jan 25 00:49:14 dev4-003 kernel: HighMem per-cpu: Jan 25 00:49:14 dev4-003 kernel: cpu 0 hot: low 32, high 96, batch 16 Jan 25 00:49:14 dev4-003 kernel: cpu 0 cold: low 0, high 32, batch 16 Jan 25 00:49:14 dev4-003 kernel: cpu 1 hot: low 32, high 96, batch 16 Jan 25 00:49:14 dev4-003 kernel: cpu 1 cold: low 0, high 32, batch 16 Jan 25 00:49:14 dev4-003 kernel: cpu 2 hot: low 32, high 96, batch 16 Jan 25 00:49:14 dev4-003 kernel: cpu 2 cold: low 0, high 32, batch 16 Jan 25 00:49:14 dev4-003 kernel: cpu 3 hot: low 32, high 96, batch 16 Jan 25 00:49:14 dev4-003 kernel: cpu 3 cold: low 0, high 32, batch 16 Jan 25 00:49:14 dev4-003 kernel: Jan 25 00:49:14 dev4-003 kernel: Free pages: 5884kB (640kB HighMem) Jan 25 00:49:14 dev4-003 kernel: Active:84764 inactive:806173 dirty:338229 writeback:36 unstable:0 free:1471 slab:23814 mapped:31610 pagetables:1616 Jan 25 00:49:14 dev4-003 kernel: DMA free:284kB min:68kB low:84kB high:100kB active:440kB inactive:10048kB present:16384kB pages_scanned:140 all_unreclaimable? no Jan 25 00:49:14 dev4-003 kernel: protections[]: 0 0 0 Jan 25 00:49:14 dev4-003 kernel: Normal free:4960kB min:3756kB low:4692kB high:5632kB active:72072kB inactive:632096kB present:901120kB pages_scanned:58999 all_unreclaimable? no Jan 25 00:49:14 dev4-003 kernel: protections[]: 0 0 0 Jan 25 00:49:14 dev4-003 kernel: HighMem free:640kB min:512kB low:640kB high:768kB active:266948kB inactive:2581960kB present:2850752kB pages_scanned:0 all_unreclaimable? no Jan 25 00:49:14 dev4-003 kernel: protections[]: 0 0 0 Jan 25 00:49:14 dev4-003 kernel: DMA: 53*4kB 3*8kB 1*16kB 1*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 284kB Jan 25 00:49:14 dev4-003 kernel: Normal: 400*4kB 16*8kB 0*16kB 1*32kB 0*64kB 1*128kB 0*256kB 2*512kB 0*1024kB 1*2048kB 0*4096kB = 4960kB Jan 25 00:49:14 dev4-003 kernel: HighMem: 6*4kB 11*8kB 1*16kB 0*32kB 0*64kB 2*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 640kB Jan 25 00:49:14 dev4-003 kernel: Swap cache: add 1584899, delete 1585025, find 294/351, race 0+0 Jan 25 00:49:14 dev4-003 kernel: Out of Memory: Killed process 18914 (smbd). ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: memory leak in 2.6.11-rc2 2005-01-25 11:35 ` Andrew Tridgell @ 2005-01-25 12:11 ` Nick Piggin 0 siblings, 0 replies; 30+ messages in thread From: Nick Piggin @ 2005-01-25 12:11 UTC (permalink / raw) To: tridge; +Cc: Andrew Morton, rddunlap, linux-kernel, agruen, Andrea Arcangeli [-- Attachment #1: Type: text/plain, Size: 1624 bytes --] Andrew Tridgell wrote: > Andrew, > > > So what you should do before generating the leak tool output is to put > > heavy memory pressure on the machine to try to get it to free up as much of > > that pagecache as possible. bzero(malloc(lots)) will do it - create a real > > swapstorm, then do swapoff to kill remaining swapcache as well. > > As you saw when you logged into the machine earlier tonight, when you > suspend the dbench processes and run a memory filler the memory is > reclaimed. > > I still think its a bug though, as the oom killer is being triggered > when it shouldn't be. I have 4G of ram in this machine, and I'm only > running a couple of hundred processes that should be using maybe 500M > in total, so for the oom killer to kick in might mean that the memory > isn't being reclaimed under normal memory pressure. Certainly a ps > shows no process using more than a few MB. > > The oom killer report is below. This is with 2.6.11-rc2, with the pipe > leak fix, and the pgown monitoring patch. It was running one nbench of > size 50 and one dbench of size 40 at the time. > There are various OOM killer improvements and fixes that have gone into Andrew's kernel tree which should be included for 2.6.11. I don't think the OOM killer was ever perfect in 2.6, but recent tinkering in mm/ probably aggrivated it. *blush* Here is another small OOM killer improvement. Previously we needed to reclaim SWAP_CLUSTER_MAX pages in a single pass. That should be changed so that we need only reclaim that many pages during the entire try_to_free_pages run, without going OOM. Andrea? Andrew? Look OK? [-- Attachment #2: less-oom.patch --] [-- Type: text/plain, Size: 800 bytes --] --- linux-2.6-npiggin/mm/vmscan.c | 6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff -puN mm/vmscan.c~oom-helper mm/vmscan.c --- linux-2.6/mm/vmscan.c~oom-helper 2005-01-25 23:04:28.000000000 +1100 +++ linux-2.6-npiggin/mm/vmscan.c 2005-01-25 23:05:06.000000000 +1100 @@ -914,12 +914,12 @@ int try_to_free_pages(struct zone **zone sc.nr_reclaimed += reclaim_state->reclaimed_slab; reclaim_state->reclaimed_slab = 0; } - if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX) { + total_scanned += sc.nr_scanned; + total_reclaimed += sc.nr_reclaimed; + if (total_reclaimed >= SWAP_CLUSTER_MAX) { ret = 1; goto out; } - total_scanned += sc.nr_scanned; - total_reclaimed += sc.nr_reclaimed; /* * Try to write back as many pages as we just scanned. This _ ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: memory leak in 2.6.11-rc2 2005-01-25 3:19 ` memory leak in 2.6.11-rc2 Andrew Tridgell 2005-01-25 3:20 ` Randy.Dunlap @ 2005-01-25 3:45 ` Dave Jones 2005-01-25 12:51 ` Andrea Arcangeli 1 sibling, 1 reply; 30+ messages in thread From: Dave Jones @ 2005-01-25 3:45 UTC (permalink / raw) To: Andrew Tridgell; +Cc: linux-kernel, Andreas Gruenbacher, Andrew Morton On Tue, Jan 25, 2005 at 02:19:24PM +1100, Andrew Tridgell wrote: > The problem I've hit now is a severe memory leak. I have applied the > patch from Linus for the leak in free_pipe_info(), and still I'm > leaking memory at the rate of about 100Mbyte/minute. > I've tested with both 2.6.11-rc2 and with 2.6.11-rc1-mm2, both with > the pipe leak fix. The setup is: That's a little more extreme than what I'm seeing, so it may be something else, but my firewall box needs rebooting every few days. It leaks around 50MB a day for some reason. Given it's not got a lot of ram, after 4-5 days or so, it's completely exhausted its swap too. It's currently on a 2.6.10-ac kernel, so it's entirely possible that we're not looking at the same issue, though it could be something thats been there for a while if your workload makes it appear quicker than a firewall/ipsec gateway would. Do you see the same leaks with an earlier kernel ? post OOM (when there was about 2K free after named got oom-killed) this is what slabinfo looked like.. dentry_cache 1502 3775 160 25 1 : tunables 120 60 0 : slabdata 151 151 0 vm_area_struct 1599 2021 84 47 1 : tunables 120 60 0 : slabdata 43 43 0 size-128 3431 6262 128 31 1 : tunables 120 60 0 : slabdata 202 202 0 size-64 4352 4575 64 61 1 : tunables 120 60 0 : slabdata 75 75 0 avtab_node 7073 7140 32 119 1 : tunables 120 60 0 : slabdata 60 60 0 size-32 7256 7616 32 119 1 : tunables 120 60 0 : slabdata 64 64 0 Dave ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: memory leak in 2.6.11-rc2 2005-01-25 3:45 ` Dave Jones @ 2005-01-25 12:51 ` Andrea Arcangeli 2005-01-25 13:31 ` Andreas Gruenbacher 0 siblings, 1 reply; 30+ messages in thread From: Andrea Arcangeli @ 2005-01-25 12:51 UTC (permalink / raw) To: Dave Jones, Andrew Tridgell, linux-kernel, Andreas Gruenbacher, Andrew Morton On Mon, Jan 24, 2005 at 10:45:47PM -0500, Dave Jones wrote: > On Tue, Jan 25, 2005 at 02:19:24PM +1100, Andrew Tridgell wrote: > > The problem I've hit now is a severe memory leak. I have applied the > > patch from Linus for the leak in free_pipe_info(), and still I'm > > leaking memory at the rate of about 100Mbyte/minute. > > I've tested with both 2.6.11-rc2 and with 2.6.11-rc1-mm2, both with > > the pipe leak fix. The setup is: > > That's a little more extreme than what I'm seeing, so it may be > something else, but my firewall box needs rebooting every > few days. It leaks around 50MB a day for some reason. > Given it's not got a lot of ram, after 4-5 days or so, it's > completely exhausted its swap too. > > It's currently on a 2.6.10-ac kernel, so it's entirely possible that > we're not looking at the same issue, though it could be something > thats been there for a while if your workload makes it appear > quicker than a firewall/ipsec gateway would. > Do you see the same leaks with an earlier kernel ? > > post OOM (when there was about 2K free after named got oom-killed) > this is what slabinfo looked like.. > > dentry_cache 1502 3775 160 25 1 : tunables 120 60 0 : slabdata 151 151 0 > vm_area_struct 1599 2021 84 47 1 : tunables 120 60 0 : slabdata 43 43 0 > size-128 3431 6262 128 31 1 : tunables 120 60 0 : slabdata 202 202 0 > size-64 4352 4575 64 61 1 : tunables 120 60 0 : slabdata 75 75 0 > avtab_node 7073 7140 32 119 1 : tunables 120 60 0 : slabdata 60 60 0 > size-32 7256 7616 32 119 1 : tunables 120 60 0 : slabdata 64 64 0 What is avtab_node? there's no such thing in my kernel. But the above can be ok. Can you show meminfo too after oom kill? Just another datapoint my firewall runs a kernel based on 2.6.11-rc1-bk8 with all the needed oom fixes and I've no problems on it yet. I run it oom and this is what I get after the oom: athlon:/home/andrea # free total used free shared buffers cached Mem: 511136 50852 460284 0 572 15764 -/+ buffers/cache: 34516 476620 Swap: 1052248 0 1052248 athlon:/home/andrea # The above is sane, 34M is very reasonable for what's loaded there (there's the X server running, named too, and various other non standard daemons, one even has a virtual size of >100m so it's not a tiny thing), so I'm quite sure I'm not hitting a memleak, at least not on the firewal. No ipsec on it btw, and it's a pure IDE without anything special, just quite a few nics and USB usermode running all the time. athlon:/home/andrea # uptime 1:34pm up 2 days 12:08, 1 user, load average: 0.98, 1.13, 0.54 athlon:/home/andrea # iptables -L -v |grep -A2 FORWARD Chain FORWARD (policy ACCEPT 65 packets, 9264 bytes) pkts bytes target prot opt in out source destination 3690K 2321M block all -- any any anywhere anywhere athlon:/home/andrea # So if there's a memleak in rc1-bk8, it's probably not in the core of the kernel, but in some driver or things like ipsec. Either that or it broke after 2.6.11-rc1-bk8. The kernel I'm running is quite heavily patched too, but I'm not aware of any memleak fix in the additional patches. Anyway I'll try again in a few days to verify it goes back down again to exactly 34M of anonymous/random and 15M of cache. No apparent problem on my desktop system either, it's running the same kernel with different config. If somebody could fix the kernel CVS I could have a look at the interesting changesets between 2.6.11-rc1-bk8 and 2.6.11-rc2. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: memory leak in 2.6.11-rc2 2005-01-25 12:51 ` Andrea Arcangeli @ 2005-01-25 13:31 ` Andreas Gruenbacher 2005-01-25 13:55 ` Andrea Arcangeli 0 siblings, 1 reply; 30+ messages in thread From: Andreas Gruenbacher @ 2005-01-25 13:31 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Dave Jones, Andrew Tridgell, linux-kernel, Andrew Morton On Tue, 2005-01-25 at 13:51, Andrea Arcangeli wrote: > If somebody could fix the kernel CVS I could have a look at the > interesting changesets between 2.6.11-rc1-bk8 and 2.6.11-rc2. What's not okay? -- Andreas Gruenbacher <agruen@suse.de> SUSE Labs, SUSE LINUX GMBH ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: memory leak in 2.6.11-rc2 2005-01-25 13:31 ` Andreas Gruenbacher @ 2005-01-25 13:55 ` Andrea Arcangeli 0 siblings, 0 replies; 30+ messages in thread From: Andrea Arcangeli @ 2005-01-25 13:55 UTC (permalink / raw) To: Andreas Gruenbacher; +Cc: linux-kernel On Tue, Jan 25, 2005 at 02:31:03PM +0100, Andreas Gruenbacher wrote: > On Tue, 2005-01-25 at 13:51, Andrea Arcangeli wrote: > > If somebody could fix the kernel CVS I could have a look at the > > interesting changesets between 2.6.11-rc1-bk8 and 2.6.11-rc2. > > What's not okay? I already prepared a separated deatiled bugreport. I'm reproducing one last time before sending it, after a "su - nobody" to be sure it's not a local problem in my environment that might have changed, but I'm 99% sure it'll reproduce just fine in a fresh account too. Are you using cvsps at all? ^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2005-01-25 13:56 UTC | newest] Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2005-01-20 2:01 [ea-in-inode 0/5] Further fixes Andreas Gruenbacher 2005-01-20 2:01 ` [patch 1/5] No lock needed when freeing inode Andreas Gruenbacher 2005-01-20 2:01 ` [patch 3/5] Documentation fix Andreas Gruenbacher 2005-01-20 2:01 ` [patch 4/5] Fix i_extra_isize check Andreas Gruenbacher 2005-01-20 2:01 ` [patch 2/5] Set the EXT3_FEATURE_COMPAT_EXT_ATTR for in-inode xattrs Andreas Gruenbacher 2005-01-20 2:01 ` [patch 5/5] Disallow in-inode attributes for reserved inodes Andreas Gruenbacher 2005-01-20 12:16 ` Andreas Dilger 2005-01-20 13:29 ` Andreas Gruenbacher 2005-01-20 23:05 ` Andreas Dilger 2005-01-21 0:36 ` Andreas Gruenbacher 2005-01-21 22:58 ` [ea-in-inode 0/5] Further fixes Stephen C. Tweedie 2005-01-21 23:46 ` Andreas Gruenbacher 2005-01-23 13:22 ` Andrew Tridgell 2005-01-23 22:09 ` Andrew Tridgell 2005-01-23 22:58 ` Andreas Gruenbacher 2005-01-23 23:32 ` Andreas Gruenbacher 2005-01-24 11:24 ` Andrew Tridgell 2005-01-24 11:42 ` Christoph Hellwig 2005-01-24 14:11 ` Andreas Gruenbacher 2005-01-25 3:19 ` memory leak in 2.6.11-rc2 Andrew Tridgell 2005-01-25 3:20 ` Randy.Dunlap 2005-01-25 3:31 ` Andrew Tridgell 2005-01-25 4:48 ` Andrew Tridgell 2005-01-25 6:06 ` Andrew Morton 2005-01-25 11:35 ` Andrew Tridgell 2005-01-25 12:11 ` Nick Piggin 2005-01-25 3:45 ` Dave Jones 2005-01-25 12:51 ` Andrea Arcangeli 2005-01-25 13:31 ` Andreas Gruenbacher 2005-01-25 13:55 ` Andrea Arcangeli
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).