* [ea-in-inode 0/5] Further fixes
@ 2005-01-20 2:01 Andreas Gruenbacher
2005-01-20 2:01 ` [patch 1/5] No lock needed when freeing inode Andreas Gruenbacher
` (5 more replies)
0 siblings, 6 replies; 30+ messages in thread
From: Andreas Gruenbacher @ 2005-01-20 2:01 UTC (permalink / raw)
To: Andrew Morton, Linus Torvalds, Theodore Ts'o
Cc: Andrew Tridgell, Stephen C. Tweedie, Andreas Dilger, Alex Tomas,
linux-kernel
Hello,
here is a set of fixes for ext3 in-inode attributes:
patches/ea-xattr-nolock.diff
No lock needed when freeing inode
The effect of the additional lock taking is very minor, but
it's still unnecessary.
patches/ea-xattr-update-sb.diff
Set the EXT3_FEATURE_COMPAT_EXT_ATTR for in-inode xattrs
The EXT_ATTR filesystem feature was set for xattr blocks, but not
for in-inode attributes.
patches/ea-xattr-doc.diff
Documentation fix
patches/ea-xattr-no-extra_isize.diff
Fix i_extra_isize check
Filesystem corruption fix.
patches/ea-xattr-reserved-inodes.diff
Disallow in-inode attributes for reserved inodes
Filesystem corruption fix.
Regards,
--
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX PRODUCTS GMBH
^ permalink raw reply [flat|nested] 30+ messages in thread
* [patch 1/5] No lock needed when freeing inode
2005-01-20 2:01 [ea-in-inode 0/5] Further fixes Andreas Gruenbacher
@ 2005-01-20 2:01 ` Andreas Gruenbacher
2005-01-20 2:01 ` [patch 3/5] Documentation fix Andreas Gruenbacher
` (4 subsequent siblings)
5 siblings, 0 replies; 30+ messages in thread
From: Andreas Gruenbacher @ 2005-01-20 2:01 UTC (permalink / raw)
To: Andrew Morton, Linus Torvalds, Theodore Ts'o
Cc: Andrew Tridgell, Stephen C. Tweedie, Andreas Dilger, Alex Tomas,
linux-kernel
[-- Attachment #1: ea-xattr-nolock.diff --]
[-- Type: text/plain, Size: 867 bytes --]
ext3_xattr_delete_inode is called from ext3_free_inode which always
has exclusive access to the inode, so there is no need to take the
xattr semaphore.
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Index: linux-2.6.11-latest/fs/ext3/xattr.c
===================================================================
--- linux-2.6.11-latest.orig/fs/ext3/xattr.c
+++ linux-2.6.11-latest/fs/ext3/xattr.c
@@ -1066,7 +1066,6 @@ ext3_xattr_delete_inode(handle_t *handle
{
struct buffer_head *bh = NULL;
- down_write(&EXT3_I(inode)->xattr_sem);
if (!EXT3_I(inode)->i_file_acl)
goto cleanup;
bh = sb_bread(inode->i_sb, EXT3_I(inode)->i_file_acl);
@@ -1088,7 +1087,6 @@ ext3_xattr_delete_inode(handle_t *handle
cleanup:
brelse(bh);
- up_write(&EXT3_I(inode)->xattr_sem);
}
/*
--
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX PRODUCTS GMBH
^ permalink raw reply [flat|nested] 30+ messages in thread
* [patch 4/5] Fix i_extra_isize check
2005-01-20 2:01 [ea-in-inode 0/5] Further fixes Andreas Gruenbacher
2005-01-20 2:01 ` [patch 1/5] No lock needed when freeing inode Andreas Gruenbacher
2005-01-20 2:01 ` [patch 3/5] Documentation fix Andreas Gruenbacher
@ 2005-01-20 2:01 ` Andreas Gruenbacher
2005-01-20 2:01 ` [patch 2/5] Set the EXT3_FEATURE_COMPAT_EXT_ATTR for in-inode xattrs Andreas Gruenbacher
` (2 subsequent siblings)
5 siblings, 0 replies; 30+ messages in thread
From: Andreas Gruenbacher @ 2005-01-20 2:01 UTC (permalink / raw)
To: Andrew Morton, Linus Torvalds, Theodore Ts'o
Cc: Andrew Tridgell, Stephen C. Tweedie, Andreas Dilger, Alex Tomas,
linux-kernel
[-- Attachment #1: ea-xattr-no-extra_isize.diff --]
[-- Type: text/plain, Size: 2134 bytes --]
We are checking for (EXT3_SB(inode->i_sb)->s_inode_size <=
EXT3_GOOD_OLD_INODE_SIZE) to find out if we can set in-inode attributes;
the test fails for inodes that have been created before the ea-in-inode
patch. Those inodes have (i_extra_isize == 0), so we end up with the
attributes overlapping the i_extra_isize field. Checking for
(i_extra_isize == 0) instead fixes this case.
The EXT3_STATE_XATTR flag is only set if (i_extra_isize > 0) and the
inodes has in-inode attributes, so that is enough in the first two
tests.
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Index: linux-2.6.11-latest/fs/ext3/xattr.c
===================================================================
--- linux-2.6.11-latest.orig/fs/ext3/xattr.c
+++ linux-2.6.11-latest/fs/ext3/xattr.c
@@ -272,8 +272,7 @@ ext3_xattr_ibody_get(struct inode *inode
void *end;
int error;
- if (EXT3_SB(inode->i_sb)->s_inode_size <= EXT3_GOOD_OLD_INODE_SIZE ||
- !(EXT3_I(inode)->i_state & EXT3_STATE_XATTR))
+ if (!(EXT3_I(inode)->i_state & EXT3_STATE_XATTR))
return -ENODATA;
error = ext3_get_inode_loc(inode, &iloc);
if (error)
@@ -399,8 +398,7 @@ ext3_xattr_ibody_list(struct inode *inod
void *end;
int error;
- if (EXT3_SB(inode->i_sb)->s_inode_size <= EXT3_GOOD_OLD_INODE_SIZE ||
- !(EXT3_I(inode)->i_state & EXT3_STATE_XATTR))
+ if (!(EXT3_I(inode)->i_state & EXT3_STATE_XATTR))
return 0;
error = ext3_get_inode_loc(inode, &iloc);
if (error)
@@ -865,7 +863,7 @@ ext3_xattr_ibody_find(struct inode *inod
struct ext3_inode *raw_inode;
int error;
- if (EXT3_SB(inode->i_sb)->s_inode_size <= EXT3_GOOD_OLD_INODE_SIZE)
+ if (EXT3_I(inode)->i_extra_isize == 0)
return 0;
raw_inode = ext3_raw_inode(&is->iloc);
header = IHDR(inode, raw_inode);
@@ -896,7 +894,7 @@ ext3_xattr_ibody_set(handle_t *handle, s
struct ext3_xattr_search *s = &is->s;
int error;
- if (EXT3_SB(inode->i_sb)->s_inode_size <= EXT3_GOOD_OLD_INODE_SIZE)
+ if (EXT3_I(inode)->i_extra_isize == 0)
return -ENOSPC;
error = ext3_xattr_set_entry(i, s);
if (error)
--
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX PRODUCTS GMBH
^ permalink raw reply [flat|nested] 30+ messages in thread
* [patch 3/5] Documentation fix
2005-01-20 2:01 [ea-in-inode 0/5] Further fixes Andreas Gruenbacher
2005-01-20 2:01 ` [patch 1/5] No lock needed when freeing inode Andreas Gruenbacher
@ 2005-01-20 2:01 ` Andreas Gruenbacher
2005-01-20 2:01 ` [patch 4/5] Fix i_extra_isize check Andreas Gruenbacher
` (3 subsequent siblings)
5 siblings, 0 replies; 30+ messages in thread
From: Andreas Gruenbacher @ 2005-01-20 2:01 UTC (permalink / raw)
To: Andrew Morton, Linus Torvalds, Theodore Ts'o
Cc: Andrew Tridgell, Stephen C. Tweedie, Andreas Dilger, Alex Tomas,
linux-kernel
[-- Attachment #1: ea-xattr-doc.diff --]
[-- Type: text/plain, Size: 903 bytes --]
In-inode xattr entry descriptors are unsorted.
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Index: linux-2.6.11-latest/fs/ext3/xattr.c
===================================================================
--- linux-2.6.11-latest.orig/fs/ext3/xattr.c
+++ linux-2.6.11-latest/fs/ext3/xattr.c
@@ -37,9 +37,9 @@
* | value 2 | |
* +------------------+
*
- * The header is followed by multiple entry descriptors. Descriptors are
- * kept sorted. The attribute values are aligned to the end of the block
- * in no specific order.
+ * The header is followed by multiple entry descriptors. In disk blocks, the
+ * entry descriptors are kept sorted. In inodes, they are unsorted. The
+ * attribute values are aligned to the end of the block in no specific order.
*
* Locking strategy
* ----------------
--
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX PRODUCTS GMBH
^ permalink raw reply [flat|nested] 30+ messages in thread
* [patch 5/5] Disallow in-inode attributes for reserved inodes
2005-01-20 2:01 [ea-in-inode 0/5] Further fixes Andreas Gruenbacher
` (3 preceding siblings ...)
2005-01-20 2:01 ` [patch 2/5] Set the EXT3_FEATURE_COMPAT_EXT_ATTR for in-inode xattrs Andreas Gruenbacher
@ 2005-01-20 2:01 ` Andreas Gruenbacher
2005-01-20 12:16 ` Andreas Dilger
2005-01-21 22:58 ` [ea-in-inode 0/5] Further fixes Stephen C. Tweedie
5 siblings, 1 reply; 30+ messages in thread
From: Andreas Gruenbacher @ 2005-01-20 2:01 UTC (permalink / raw)
To: Andrew Morton, Linus Torvalds, Theodore Ts'o
Cc: Andrew Tridgell, Stephen C. Tweedie, Andreas Dilger, Alex Tomas,
linux-kernel
[-- Attachment #1: ea-xattr-reserved-inodes.diff --]
[-- Type: text/plain, Size: 2111 bytes --]
When creating a filesystem with inodes bigger than 128 bytes, mke2fs
fails to clear out bytes beyond EXT3_GOOD_OLD_INODE_SIZE in all inodes
it creates (the journal, the filesystem root, and lost+found). We would
require a zeroed-out i_extra_isize field but we don't get it, so
disallow in-inode attributes for those inodes.
Add an i_extra_isize sanity check.
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Index: linux-2.6.11-latest/fs/ext3/inode.c
===================================================================
--- linux-2.6.11-latest.orig/fs/ext3/inode.c
+++ linux-2.6.11-latest/fs/ext3/inode.c
@@ -2493,15 +2493,30 @@ void ext3_read_inode(struct inode * inod
ei->i_data[block] = raw_inode->i_block[block];
INIT_LIST_HEAD(&ei->i_orphan);
- ei->i_extra_isize =
- (EXT3_INODE_SIZE(inode->i_sb) > EXT3_GOOD_OLD_INODE_SIZE) ?
- le16_to_cpu(raw_inode->i_extra_isize) : 0;
- if (ei->i_extra_isize) {
- __le32 *magic = (void *)raw_inode + EXT3_GOOD_OLD_INODE_SIZE +
- ei->i_extra_isize;
- if (le32_to_cpu(*magic))
- ei->i_state |= EXT3_STATE_XATTR;
- }
+ if (inode->i_ino >= EXT3_FIRST_INO(inode->i_sb) + 1 &&
+ EXT3_INODE_SIZE(inode->i_sb) > EXT3_GOOD_OLD_INODE_SIZE) {
+ /*
+ * When mke2fs creates big inodes it does not zero out
+ * the unused bytes above EXT3_GOOD_OLD_INODE_SIZE,
+ * so ignore those first few inodes.
+ */
+ ei->i_extra_isize = le16_to_cpu(raw_inode->i_extra_isize);
+ if (EXT3_GOOD_OLD_INODE_SIZE + ei->i_extra_isize >
+ EXT3_INODE_SIZE(inode->i_sb))
+ goto bad_inode;
+ if (ei->i_extra_isize == 0) {
+ /* The extra space is currently unused. Use it. */
+ ei->i_extra_isize = sizeof(struct ext3_inode) -
+ EXT3_GOOD_OLD_INODE_SIZE;
+ } else {
+ __le32 *magic = (void *)raw_inode +
+ EXT3_GOOD_OLD_INODE_SIZE +
+ ei->i_extra_isize;
+ if (*magic == cpu_to_le32(EXT3_XATTR_MAGIC))
+ ei->i_state |= EXT3_STATE_XATTR;
+ }
+ } else
+ ei->i_extra_isize = 0;
if (S_ISREG(inode->i_mode)) {
inode->i_op = &ext3_file_inode_operations;
--
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX PRODUCTS GMBH
^ permalink raw reply [flat|nested] 30+ messages in thread
* [patch 2/5] Set the EXT3_FEATURE_COMPAT_EXT_ATTR for in-inode xattrs
2005-01-20 2:01 [ea-in-inode 0/5] Further fixes Andreas Gruenbacher
` (2 preceding siblings ...)
2005-01-20 2:01 ` [patch 4/5] Fix i_extra_isize check Andreas Gruenbacher
@ 2005-01-20 2:01 ` Andreas Gruenbacher
2005-01-20 2:01 ` [patch 5/5] Disallow in-inode attributes for reserved inodes Andreas Gruenbacher
2005-01-21 22:58 ` [ea-in-inode 0/5] Further fixes Stephen C. Tweedie
5 siblings, 0 replies; 30+ messages in thread
From: Andreas Gruenbacher @ 2005-01-20 2:01 UTC (permalink / raw)
To: Andrew Morton, Linus Torvalds, Theodore Ts'o
Cc: Andrew Tridgell, Stephen C. Tweedie, Andreas Dilger, Alex Tomas,
linux-kernel
[-- Attachment #1: ea-xattr-update-sb.diff --]
[-- Type: text/plain, Size: 861 bytes --]
The xattr feature was only set when creating an xattr block. Also set it
when creating in-inode xattrs.
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Index: linux-2.6.11-latest/fs/ext3/xattr.c
===================================================================
--- linux-2.6.11-latest.orig/fs/ext3/xattr.c
+++ linux-2.6.11-latest/fs/ext3/xattr.c
@@ -823,7 +823,6 @@ getblk_failed:
error = ext3_journal_dirty_metadata(handle, new_bh);
if (error)
goto cleanup;
- ext3_xattr_update_super_block(handle, sb);
}
}
@@ -1001,6 +1000,7 @@ ext3_xattr_set_handle(handle_t *handle,
}
}
if (!error) {
+ ext3_xattr_update_super_block(handle, inode->i_sb);
inode->i_ctime = CURRENT_TIME_SEC;
error = ext3_mark_iloc_dirty(handle, inode, &is.iloc);
/*
--
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX PRODUCTS GMBH
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [patch 5/5] Disallow in-inode attributes for reserved inodes
2005-01-20 2:01 ` [patch 5/5] Disallow in-inode attributes for reserved inodes Andreas Gruenbacher
@ 2005-01-20 12:16 ` Andreas Dilger
2005-01-20 13:29 ` Andreas Gruenbacher
0 siblings, 1 reply; 30+ messages in thread
From: Andreas Dilger @ 2005-01-20 12:16 UTC (permalink / raw)
To: Andreas Gruenbacher
Cc: Andrew Morton, Linus Torvalds, Theodore Ts'o,
Andrew Tridgell, Stephen C. Tweedie, Alex Tomas, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 3922 bytes --]
On Jan 20, 2005 03:01 +0100, Andreas Gruenbacher wrote:
> When creating a filesystem with inodes bigger than 128 bytes, mke2fs
> fails to clear out bytes beyond EXT3_GOOD_OLD_INODE_SIZE in all inodes
> it creates (the journal, the filesystem root, and lost+found). We would
> require a zeroed-out i_extra_isize field but we don't get it, so
> disallow in-inode attributes for those inodes.
>
> Add an i_extra_isize sanity check.
I'm not sure I agree with this patch. It definitely resolves a problem
I had with the previous patch (4/5), namely that for inodes that were
created in a large-inode filesystem using a kernel w/o the large-inode
patch we didn't save EAs into the large-inode space at all. With this
patch we will start using that space.
It is debatable whether we should mark inodes bad if the i_extra_isize
field is bad, or if we should just initialize i_extra_isize in that case.
On one hand there is very little else we can use to validate the on-disk
inodes, but on the other hand the EA data isn't critical to reading the
inode so this would seem to introduce additional failure cases.
I'm not sure whether an old e2fsck will also clear out the space in a
large inode that it writes or not (I know that the patched one we use
validates i_extra_isize). If it doesn't it may be that there would be a
largish number of inodes that are marked bad because of this, so having
the kernel fix this would be good. If the old e2fsck zeroes the large
inodes properly it would be prudent to have an ext3_error() here so that
the disk corruption triggers an e2fsck the next time the system starts.
For the root and lost+found inodes it looks like we can never store an
EA in the extra part of the inode regardless of whether i_extra_isize is
good or not. If a bad value is found we could just initialize it and
start using that space (though not print an ext3_error() in that case,
an ext3_warning() if anything since this is probably the fault of mke2fs).
> Index: linux-2.6.11-latest/fs/ext3/inode.c
> ===================================================================
> --- linux-2.6.11-latest.orig/fs/ext3/inode.c
> +++ linux-2.6.11-latest/fs/ext3/inode.c
> @@ -2493,15 +2493,30 @@ void ext3_read_inode(struct inode * inod
> ei->i_data[block] = raw_inode->i_block[block];
> INIT_LIST_HEAD(&ei->i_orphan);
>
> - ei->i_extra_isize =
> - (EXT3_INODE_SIZE(inode->i_sb) > EXT3_GOOD_OLD_INODE_SIZE) ?
> - le16_to_cpu(raw_inode->i_extra_isize) : 0;
> - if (ei->i_extra_isize) {
> - __le32 *magic = (void *)raw_inode + EXT3_GOOD_OLD_INODE_SIZE +
> - ei->i_extra_isize;
> - if (le32_to_cpu(*magic))
> - ei->i_state |= EXT3_STATE_XATTR;
> - }
> + if (inode->i_ino >= EXT3_FIRST_INO(inode->i_sb) + 1 &&
> + EXT3_INODE_SIZE(inode->i_sb) > EXT3_GOOD_OLD_INODE_SIZE) {
> + /*
> + * When mke2fs creates big inodes it does not zero out
> + * the unused bytes above EXT3_GOOD_OLD_INODE_SIZE,
> + * so ignore those first few inodes.
> + */
> + ei->i_extra_isize = le16_to_cpu(raw_inode->i_extra_isize);
> + if (EXT3_GOOD_OLD_INODE_SIZE + ei->i_extra_isize >
> + EXT3_INODE_SIZE(inode->i_sb))
> + goto bad_inode;
> + if (ei->i_extra_isize == 0) {
> + /* The extra space is currently unused. Use it. */
> + ei->i_extra_isize = sizeof(struct ext3_inode) -
> + EXT3_GOOD_OLD_INODE_SIZE;
> + } else {
> + __le32 *magic = (void *)raw_inode +
> + EXT3_GOOD_OLD_INODE_SIZE +
> + ei->i_extra_isize;
> + if (*magic == cpu_to_le32(EXT3_XATTR_MAGIC))
> + ei->i_state |= EXT3_STATE_XATTR;
> + }
> + } else
> + ei->i_extra_isize = 0;
>
> if (S_ISREG(inode->i_mode)) {
> inode->i_op = &ext3_file_inode_operations;
Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://members.shaw.ca/adilger/ http://members.shaw.ca/golinux/
[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [patch 5/5] Disallow in-inode attributes for reserved inodes
2005-01-20 12:16 ` Andreas Dilger
@ 2005-01-20 13:29 ` Andreas Gruenbacher
2005-01-20 23:05 ` Andreas Dilger
0 siblings, 1 reply; 30+ messages in thread
From: Andreas Gruenbacher @ 2005-01-20 13:29 UTC (permalink / raw)
To: Andreas Dilger
Cc: Andrew Morton, Linus Torvalds, Theodore Ts'o,
Andrew Tridgell, Stephen C. Tweedie, Alex Tomas, linux-kernel
On Thursday 20 January 2005 13:16, Andreas Dilger wrote:
> On Jan 20, 2005 03:01 +0100, Andreas Gruenbacher wrote:
> > When creating a filesystem with inodes bigger than 128 bytes, mke2fs
> > fails to clear out bytes beyond EXT3_GOOD_OLD_INODE_SIZE in all inodes
> > it creates (the journal, the filesystem root, and lost+found). We would
> > require a zeroed-out i_extra_isize field but we don't get it, so
> > disallow in-inode attributes for those inodes.
> >
> > Add an i_extra_isize sanity check.
>
> I'm not sure I agree with this patch. It definitely resolves a problem
> I had with the previous patch (4/5), namely that for inodes that were
> created in a large-inode filesystem using a kernel w/o the large-inode
> patch we didn't save EAs into the large-inode space at all. With this
> patch we will start using that space.
The ea-in-inode patch totally relies on getting all the available inode space
cleared out by the kernel (or mke2fs, or e2fsck). If this is not the case for
any inode we find, then i_extra_isize may contain a random number, and we've
just lost, period: There is no way of sanitizing a random i_extra_isize; we
cannot know what the right number would be.
We have two options for going forward from here: (a) depend on the fact that
all inodes have been cleared out for us, or (b) don't rely on that, fix
e2fsck before going any further, and have the fixed mke2fs and e2fsck set a
filesystem feature flag that tells us that all the extra space in all inodes
is "sane".
For (a) we can make an (ugly) exception because we know which inodes mke2fs
created. As far as I can tell we can also fully rely on the kernel having
cleared out the extra space for us. e2fsck may currently fail to move the
extra space around when moving inodes, but that did not matter without
ea-in-inode, and if we fix it now, that's probably good enough. We don't
really care which inode's zeroed extra space we look at without ea-in-inode.
> It is debatable whether we should mark inodes bad if the i_extra_isize
> field is bad, or if we should just initialize i_extra_isize in that case.
IMHO it's not debatable. Taking an i_extra_isize that looks odd and simply
changing it to something we think is better is a really bad idea. We might as
well find an unusual but still valid i_extra_isize and use it, then a future
patch allocates a few more fields in the inode, and suddenly we start to see
garbage.
> On one hand there is very little else we can use to validate the on-disk
> inodes, but on the other hand the EA data isn't critical to reading the
> inode so this would seem to introduce additional failure cases.
You may have an access acl on the inode. Not being able to read an access acl
is a clear sign of trouble. The same applies for everything else in the
system.* and security.* namespaces, at least.
> I'm not sure whether an old e2fsck will also clear out the space in a
> large inode that it writes or not (I know that the patched one we use
> validates i_extra_isize). If it doesn't it may be that there would be a
> largish number of inodes that are marked bad because of this, so having
> the kernel fix this would be good.
Repairing the filesystem has no place in the kernel code. If what you assume
is the case (and I sure would have assumed you to check before proposing the
kernel patch for inclusion), we have to go for the filesystem feature flag
approach.
> If the old e2fsck zeroes the large
> inodes properly it would be prudent to have an ext3_error() here so that
> the disk corruption triggers an e2fsck the next time the system starts.
That's even better, yes.
> For the root and lost+found inodes it looks like we can never store an
> EA in the extra part of the inode regardless of whether i_extra_isize is
> good or not. If a bad value is found we could just initialize it and
> start using that space (though not print an ext3_error() in that case,
> an ext3_warning() if anything since this is probably the fault of mke2fs).
I disagree. We cannot just use the space when we think the inode is corrupted.
Cheers,
--
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX PRODUCTS GMBH
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [patch 5/5] Disallow in-inode attributes for reserved inodes
2005-01-20 13:29 ` Andreas Gruenbacher
@ 2005-01-20 23:05 ` Andreas Dilger
2005-01-21 0:36 ` Andreas Gruenbacher
0 siblings, 1 reply; 30+ messages in thread
From: Andreas Dilger @ 2005-01-20 23:05 UTC (permalink / raw)
To: Andreas Gruenbacher
Cc: Andrew Morton, Linus Torvalds, Theodore Ts'o,
Andrew Tridgell, Stephen C. Tweedie, Alex Tomas, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 4260 bytes --]
On Jan 20, 2005 14:29 +0100, Andreas Gruenbacher wrote:
> The ea-in-inode patch totally relies on getting all the available inode space
> cleared out by the kernel (or mke2fs, or e2fsck). If this is not the case for
> any inode we find, then i_extra_isize may contain a random number, and we've
> just lost, period: There is no way of sanitizing a random i_extra_isize; we
> cannot know what the right number would be.
The large-inode support is designed to allow different amounts of
"fixed" optional data (i.e. what is stored inside i_extra_isize),
so it is valid to set this to 4 (i.e. just enough to hold i_extra_isize
itself) and store the EA data after that. Any code which reads "fixed"
fields from a large inode (e.g. i_mtime_nsec) needs to validate that
i_extra_isize on that inode is large enough for that data to actually
be in the fixed area in the large inode.
If the kernel is setting i_extra_isize > 4 (i.e. it is storing optional
fields there like i_mtime_msb_and_ns) it should/is-able-to also initialize
those values since it should know what they are or they shouldn't be in
struct ext3_inode.
The whole point of i_extra_isize is that it is possible for inodes
to have different amounts of the optional fixed fields in each large
inode, depending on what the kernel that wrote the inode knew about.
So any value for i_extra_isize is valid as long as those fields
are initialized. If we arbitrarily set i_extra_isize = 4 instead of
leaving the bad value this is no different than waiting for e2fsck to
do the same.
> > It is debatable whether we should mark inodes bad if the i_extra_isize
> > field is bad, or if we should just initialize i_extra_isize in that case.
>
> IMHO it's not debatable. Taking an i_extra_isize that looks odd and simply
> changing it to something we think is better is a really bad idea.
>
> You may have an access acl on the inode. Not being able to read an access acl
> is a clear sign of trouble. The same applies for everything else in the
> system.* and security.* namespaces, at least.
Well, I said it was debatable and we're having a debate ;-). I don't have
a strong opinion either way. If we ext3_error() in this case at least we
will check the fs on the next boot (which will just zero i_extra_isize)
instead of never doing anything to resolve the situation.
> > For the root and lost+found inodes it looks like we can never store an
> > EA in the extra part of the inode regardless of whether i_extra_isize is
> > good or not. If a bad value is found we could just initialize it and
> > start using that space (though not print an ext3_error() in that case,
> > an ext3_warning() if anything since this is probably the fault of mke2fs).
>
> I disagree. We cannot just use the space when we think the inode is corrupted.
But as your patch stands it doesn't ever check if i_extra_isize is valid
for the root or lost+found inode. It just always sets i_extra_isize = 0
and never uses it. Given that the root inode is fairly high-traffic it
makes sense to use the faster EA space if it is available.
If these inodes have a BAD i_extra_isize it is OK to skip it, but I'm
not so keen to have an ext3_error() there. If the user doesn't have an
e2fsck with ea-in-inode support there isn't anything they can do to fix
it and they will get a full e2fsck on each boot.
Even so, for the effort of setting i_extra_isize = 4 (or larger if we
initialize the fixed fields) we can do the equivalent of what e2fsck will
do when it finds a bogus value.
The good news is that we can still apply your patch as-is and address my
concerns later since this is a transient issue. Also, given that there
are probably only a handful of filesystems in the world using large inodes
(excluding Lustre filesystems which aren't affected by this) I don't think
it is a pressing issue yet. I'm going to be away for 2 weeks, so I'll say
accept this patch as is and we can look at it again when I get back, and
maybe Ted and Stephen will have weighed in on this issue also.
Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://members.shaw.ca/adilger/ http://members.shaw.ca/golinux/
[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [patch 5/5] Disallow in-inode attributes for reserved inodes
2005-01-20 23:05 ` Andreas Dilger
@ 2005-01-21 0:36 ` Andreas Gruenbacher
0 siblings, 0 replies; 30+ messages in thread
From: Andreas Gruenbacher @ 2005-01-21 0:36 UTC (permalink / raw)
To: Andreas Dilger
Cc: Andrew Morton, Linus Torvalds, Theodore Ts'o,
Andrew Tridgell, Stephen C. Tweedie, Alex Tomas, linux-kernel
On Friday 21 January 2005 00:05, Andreas Dilger wrote:
> [...]
> But as your patch stands it doesn't ever check if i_extra_isize is valid
> for the root or lost+found inode. It just always sets i_extra_isize = 0
(that's the in-memory i_extra_isize)
> and never uses it. Given that the root inode is fairly high-traffic it
> makes sense to use the faster EA space if it is available.
It's only a single block we're talking about, not all the overhead you run
into with huge amounts of attributes in many xattr disk blocks. It sure would
be much cleaner to use the root inode's in-inode space like with all other
inodes, but performance wise I don't think it matters.
> If these inodes have a BAD i_extra_isize it is OK to skip it, but I'm
> not so keen to have an ext3_error() there. If the user doesn't have an
> e2fsck with ea-in-inode support there isn't anything they can do to fix
> it and they will get a full e2fsck on each boot.
Agreed, that would be really bad. We should get e2fsck fixed ASAP.
> Even so, for the effort of setting i_extra_isize = 4 (or larger if we
> initialize the fixed fields) we can do the equivalent of what e2fsck will
> do when it finds a bogus value.
We cannot ask the user, and we don't have the kind of global view that e2fsck
has. Something different may be messed up, and may have lead to the
corruption. It's unlikely, but not impossible.
Cheers,
--
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX PRODUCTS GMBH
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [ea-in-inode 0/5] Further fixes
2005-01-20 2:01 [ea-in-inode 0/5] Further fixes Andreas Gruenbacher
` (4 preceding siblings ...)
2005-01-20 2:01 ` [patch 5/5] Disallow in-inode attributes for reserved inodes Andreas Gruenbacher
@ 2005-01-21 22:58 ` Stephen C. Tweedie
2005-01-21 23:46 ` Andreas Gruenbacher
5 siblings, 1 reply; 30+ messages in thread
From: Stephen C. Tweedie @ 2005-01-21 22:58 UTC (permalink / raw)
To: Andreas Gruenbacher
Cc: Andrew Morton, Linus Torvalds, Theodore Ts'o,
Andrew Tridgell, Andreas Dilger, Alex Tomas, linux-kernel,
Stephen Tweedie
Hi Andreas,
On Thu, 2005-01-20 at 02:01, Andreas Gruenbacher wrote:
> here is a set of fixes for ext3 in-inode attributes:
Obvious first question --- have these diffs survived the same
torture-by-tridgell that the previous batch suffered?
Cheers,
Stephen
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [ea-in-inode 0/5] Further fixes
2005-01-21 22:58 ` [ea-in-inode 0/5] Further fixes Stephen C. Tweedie
@ 2005-01-21 23:46 ` Andreas Gruenbacher
2005-01-23 13:22 ` Andrew Tridgell
2005-01-23 22:09 ` Andrew Tridgell
0 siblings, 2 replies; 30+ messages in thread
From: Andreas Gruenbacher @ 2005-01-21 23:46 UTC (permalink / raw)
To: Stephen C. Tweedie
Cc: Andrew Morton, Linus Torvalds, Theodore Ts'o,
Andrew Tridgell, Andreas Dilger, Alex Tomas, linux-kernel
On Fri, 2005-01-21 at 23:58, Stephen C. Tweedie wrote:
> Hi Andreas,
>
> On Thu, 2005-01-20 at 02:01, Andreas Gruenbacher wrote:
>
> > here is a set of fixes for ext3 in-inode attributes:
>
> Obvious first question --- have these diffs survived the same
> torture-by-tridgell that the previous batch suffered?
No. The fixes are a lot less intrusive than the full xattr rework
though. I obviously ran tests; this included dbench.
Tridge, can you beat the code some more?
Andrew has the five fixes in 2.6.11-rc1-mm2.
-- Andreas.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [ea-in-inode 0/5] Further fixes
2005-01-21 23:46 ` Andreas Gruenbacher
@ 2005-01-23 13:22 ` Andrew Tridgell
2005-01-23 22:09 ` Andrew Tridgell
1 sibling, 0 replies; 30+ messages in thread
From: Andrew Tridgell @ 2005-01-23 13:22 UTC (permalink / raw)
To: Andreas Gruenbacher
Cc: Stephen C. Tweedie, Andrew Morton, Linus Torvalds,
Theodore Ts'o, Andreas Dilger, Alex Tomas, linux-kernel
Andreas,
> Tridge, can you beat the code some more?
> Andrew has the five fixes in 2.6.11-rc1-mm2.
sorry for the delay. I've started to test 2.6.11-rc1-mm2 tonight. No
problems so far.
Cheers, Tridge
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [ea-in-inode 0/5] Further fixes
2005-01-21 23:46 ` Andreas Gruenbacher
2005-01-23 13:22 ` Andrew Tridgell
@ 2005-01-23 22:09 ` Andrew Tridgell
2005-01-23 22:58 ` Andreas Gruenbacher
1 sibling, 1 reply; 30+ messages in thread
From: Andrew Tridgell @ 2005-01-23 22:09 UTC (permalink / raw)
To: Andreas Gruenbacher
Cc: Stephen C. Tweedie, Andrew Morton, Linus Torvalds,
Theodore Ts'o, Andreas Dilger, Alex Tomas, linux-kernel
Andreas,
> Tridge, can you beat the code some more?
>
> Andrew has the five fixes in 2.6.11-rc1-mm2.
It seemed to pass dbench runs OK, but then I started simultaneously
running dbench and nbench on two different disks (I have a new test
machine with more disks available). I am getting failures like this:
Jan 23 06:54:38 dev4-003 kernel: journal_bmap: journal block not found at offset 1036 on sdc1
Jan 23 06:54:38 dev4-003 kernel: Aborting journal on device sdc1.
Jan 23 13:19:43 dev4-003 kernel: journal_bmap: journal block not found at offset 1036 on sdd1
Jan 23 13:19:43 dev4-003 kernel: Aborting journal on device sdd1.
Jan 23 13:19:43 dev4-003 kernel: EXT3-fs error (device sdd1) in start_transaction: Readonly filesystem
The first failure was on the disk I am using for nbench (sdc). The
second is during a later run on the disk I am using the dbench
(sdd). I rebooted between the runs. It's interesting that its failing
at exactly the same offset both times. Is there anything magic about
offset 1036?
The new test machine is a 4 way PIII, with 4G ram, and 4 36G SCSI
disks. The test machine I have been using previously was a 2 way
(+hyperthreaded) Xeon.
I'll see if I can reproduce the problem with just dbench or just
nbench.
Cheers, Tridge
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [ea-in-inode 0/5] Further fixes
2005-01-23 22:09 ` Andrew Tridgell
@ 2005-01-23 22:58 ` Andreas Gruenbacher
2005-01-23 23:32 ` Andreas Gruenbacher
0 siblings, 1 reply; 30+ messages in thread
From: Andreas Gruenbacher @ 2005-01-23 22:58 UTC (permalink / raw)
To: tridge
Cc: Stephen C. Tweedie, Andrew Morton, Linus Torvalds,
Theodore Ts'o, Andreas Dilger, Alex Tomas, linux-kernel
Hello,
On Sunday 23 January 2005 23:09, Andrew Tridgell wrote:
> Andreas,
>
> > Tridge, can you beat the code some more?
> >
> > Andrew has the five fixes in 2.6.11-rc1-mm2.
>
> It seemed to pass dbench runs OK, but then I started simultaneously
> running dbench and nbench on two different disks (I have a new test
> machine with more disks available). I am getting failures like this:
>
> Jan 23 06:54:38 dev4-003 kernel: journal_bmap: journal block not found at
> offset 1036 on sdc1 Jan 23 06:54:38 dev4-003 kernel: Aborting journal on
> device sdc1.
Are you using data journaling on that filesystem? Does this test pass with the
patches backed out? With an external journal?
Thanks,
--
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX PRODUCTS GMBH
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [ea-in-inode 0/5] Further fixes
2005-01-23 22:58 ` Andreas Gruenbacher
@ 2005-01-23 23:32 ` Andreas Gruenbacher
2005-01-24 11:24 ` Andrew Tridgell
0 siblings, 1 reply; 30+ messages in thread
From: Andreas Gruenbacher @ 2005-01-23 23:32 UTC (permalink / raw)
To: tridge
Cc: Stephen C. Tweedie, Andrew Morton, Linus Torvalds,
Theodore Ts'o, Andreas Dilger, Alex Tomas, linux-kernel
On Sunday 23 January 2005 23:58, Andreas Gruenbacher wrote:
> > Jan 23 06:54:38 dev4-003 kernel: journal_bmap: journal block not found at
> > offset 1036 on sdc1 Jan 23 06:54:38 dev4-003 kernel: Aborting journal on
> > device sdc1.
>
> Are you using data journaling on that filesystem? Does this test pass with
> the patches backed out? With an external journal?
There are 12 direct and 1024 indirect blocks on a filesystem with 4k
blocksize, so block 1036 should be the first double-indirect block. It may be
that something is messing up the double-indirect link or one of its fields.
Interesting. Could you maybe try this patch as well?
Index: linux-2.6.11-rc1-mm2/fs/ext3/inode.c
===================================================================
--- linux-2.6.11-rc1-mm2.orig/fs/ext3/inode.c
+++ linux-2.6.11-rc1-mm2/fs/ext3/inode.c
@@ -2653,7 +2653,7 @@ static int ext3_do_update_inode(handle_t
} else for (block = 0; block < EXT3_N_BLOCKS; block++)
raw_inode->i_block[block] = ei->i_data[block];
- if (EXT3_INODE_SIZE(inode->i_sb) > EXT3_GOOD_OLD_INODE_SIZE)
+ if (ei->i_extra_isize)
raw_inode->i_extra_isize = cpu_to_le16(ei->i_extra_isize);
BUFFER_TRACE(bh, "call ext3_journal_dirty_metadata");
Thanks,
--
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX PRODUCTS GMBH
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [ea-in-inode 0/5] Further fixes
2005-01-23 23:32 ` Andreas Gruenbacher
@ 2005-01-24 11:24 ` Andrew Tridgell
2005-01-24 11:42 ` Christoph Hellwig
` (2 more replies)
0 siblings, 3 replies; 30+ messages in thread
From: Andrew Tridgell @ 2005-01-24 11:24 UTC (permalink / raw)
To: Andreas Gruenbacher
Cc: Stephen C. Tweedie, Andrew Morton, Linus Torvalds,
Theodore Ts'o, Andreas Dilger, Alex Tomas, linux-kernel
Andreas,
I'm starting to think the bug I saw is hardware error. I just got this
while trying to reproduce it tonight:
Jan 24 02:43:32 dev4-003 kernel: qlogicfc0 : abort failed
Jan 24 02:43:32 dev4-003 kernel: qlogicfc0 : firmware status is 4000 4
Jan 24 02:43:32 dev4-003 kernel: scsi: Device offlined - not ready after error recovery: host 3
channel 0 id 1 lun 0
I'll see if I can get this confirmed tomorrow.
Cheers, Tridge
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [ea-in-inode 0/5] Further fixes
2005-01-24 11:24 ` Andrew Tridgell
@ 2005-01-24 11:42 ` Christoph Hellwig
2005-01-24 14:11 ` Andreas Gruenbacher
2005-01-25 3:19 ` memory leak in 2.6.11-rc2 Andrew Tridgell
2 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2005-01-24 11:42 UTC (permalink / raw)
To: Andrew Tridgell
Cc: Andreas Gruenbacher, Stephen C. Tweedie, Andrew Morton,
Linus Torvalds, Theodore Ts'o, Andreas Dilger, Alex Tomas,
linux-kernel
On Mon, Jan 24, 2005 at 10:24:55PM +1100, Andrew Tridgell wrote:
> Andreas,
>
> I'm starting to think the bug I saw is hardware error. I just got this
> while trying to reproduce it tonight:
>
> Jan 24 02:43:32 dev4-003 kernel: qlogicfc0 : abort failed
> Jan 24 02:43:32 dev4-003 kernel: qlogicfc0 : firmware status is 4000 4
> Jan 24 02:43:32 dev4-003 kernel: scsi: Device offlined - not ready after error recovery: host 3
> channel 0 id 1 lun 0
>
> I'll see if I can get this confirmed tomorrow.
Don't use the qlogicfc driver ever but qla2xxx instead. It's only still
in the kernel tree because davem's a lazy bastard :)
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [ea-in-inode 0/5] Further fixes
2005-01-24 11:24 ` Andrew Tridgell
2005-01-24 11:42 ` Christoph Hellwig
@ 2005-01-24 14:11 ` Andreas Gruenbacher
2005-01-25 3:19 ` memory leak in 2.6.11-rc2 Andrew Tridgell
2 siblings, 0 replies; 30+ messages in thread
From: Andreas Gruenbacher @ 2005-01-24 14:11 UTC (permalink / raw)
To: tridge
Cc: Stephen C. Tweedie, Andrew Morton, Linus Torvalds,
Theodore Ts'o, Andreas Dilger, Alex Tomas, linux-kernel
On Mon, 2005-01-24 at 12:24, Andrew Tridgell wrote:
> Andreas,
>
> I'm starting to think the bug I saw is hardware error. [...]
The patch in my prevous message (Mon, 24 Jan 2005 00:32:16 +0100) still
makes sense, if only for cleanliness: Without it, the i_extra_isize
field of reserved inodes is zeroed even though we're not allowing
in-inode attributes there. It's probably better to leave the garbage in
these inodes untouched; we could in theory disable in-inode attributes
for arbitrary inodes with this change.
Cheers,
--
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX GMBH
^ permalink raw reply [flat|nested] 30+ messages in thread
* memory leak in 2.6.11-rc2
2005-01-24 11:24 ` Andrew Tridgell
2005-01-24 11:42 ` Christoph Hellwig
2005-01-24 14:11 ` Andreas Gruenbacher
@ 2005-01-25 3:19 ` Andrew Tridgell
2005-01-25 3:20 ` Randy.Dunlap
2005-01-25 3:45 ` Dave Jones
2 siblings, 2 replies; 30+ messages in thread
From: Andrew Tridgell @ 2005-01-25 3:19 UTC (permalink / raw)
To: linux-kernel, Andreas Gruenbacher, Andrew Morton
I've fixed up the problems I had with raid, and am now testing the
recent xattr changes with dbench and nbench.
The problem I've hit now is a severe memory leak. I have applied the
patch from Linus for the leak in free_pipe_info(), and still I'm
leaking memory at the rate of about 100Mbyte/minute.
I've tested with both 2.6.11-rc2 and with 2.6.11-rc1-mm2, both with
the pipe leak fix. The setup is:
- 4 way PIII with 4G ram
- qla2200 adapter with ibm fastt200 disk array
- running dbench -x and nbench on separate disks, in a loop
The oom killer kicks in after about 30 minutes. Naturally the oom
killer decided to kill my sshd, which was running vmstat :-)
Killing the dbench and nbench processes does not recovery the memory.
Here is what it looks like after I kill all processes except a sshd
after about 15 minutes of running:
total used free shared buffers cached
Mem: 3734304 2471608 1262696 0 26820 921440
-/+ buffers/cache: 1523348 2210956
Swap: 4194272 0 4194272
so we've lost nearly 1.5G in that time.
here is /proc/meminfo:
dev4-003:~# cat /proc/meminfo
MemTotal: 3734304 kB
MemFree: 1266944 kB
Buffers: 27000 kB
Cached: 921464 kB
SwapCached: 0 kB
Active: 292524 kB
Inactive: 2108928 kB
HighTotal: 2850752 kB
HighFree: 472768 kB
LowTotal: 883552 kB
LowFree: 794176 kB
SwapTotal: 4194272 kB
SwapFree: 4194272 kB
Dirty: 4 kB
Writeback: 0 kB
Mapped: 6872 kB
Slab: 54236 kB
CommitLimit: 6061424 kB
Committed_AS: 41660 kB
PageTables: 344 kB
VmallocTotal: 114680 kB
VmallocUsed: 2204 kB
VmallocChunk: 112472 kB
and here is /proc/slabinfo
slabinfo - version: 2.1
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <batchcount> <limit> <sharedfactor> : slabdata
<active_slabs> <num_slabs> <sharedavail>
ip_fib_alias 9 226 16 226 1 : tunables 120 60 8 : slabdata 1 1 0
ip_fib_hash 9 119 32 119 1 : tunables 120 60 8 : slabdata 1 1 0
fib6_nodes 7 119 32 119 1 : tunables 120 60 8 : slabdata 1 1 0
ip6_dst_cache 7 36 224 18 1 : tunables 120 60 8 : slabdata 2 2 0
ndisc_cache 1 25 160 25 1 : tunables 120 60 8 : slabdata 1 1 0
rawv6_sock 6 6 672 6 1 : tunables 54 27 8 : slabdata 1 1 0
udpv6_sock 0 0 640 6 1 : tunables 54 27 8 : slabdata 0 0 0
tcpv6_sock 4 6 1184 6 2 : tunables 24 12 8 : slabdata 1 1 0
unix_sock 2 9 416 9 1 : tunables 54 27 8 : slabdata 1 1 0
ip_mrt_cache 0 0 96 41 1 : tunables 120 60 8 : slabdata 0 0 0
tcp_tw_bucket 0 0 128 31 1 : tunables 120 60 8 : slabdata 0 0 0
tcp_bind_bucket 2 226 16 226 1 : tunables 120 60 8 : slabdata 1 1 0
tcp_open_request 0 0 96 41 1 : tunables 120 60 8 : slabdata 0 0 0
inet_peer_cache 1 61 64 61 1 : tunables 120 60 8 : slabdata 1 1 0
secpath_cache 0 0 128 31 1 : tunables 120 60 8 : slabdata 0 0 0
xfrm_dst_cache 0 0 256 15 1 : tunables 120 60 8 : slabdata 0 0 0
ip_dst_cache 6 15 256 15 1 : tunables 120 60 8 : slabdata 1 1 0
arp_cache 1 25 160 25 1 : tunables 120 60 8 : slabdata 1 1 0
raw_sock 5 7 512 7 1 : tunables 54 27 8 : slabdata 1 1 0
udp_sock 3 8 512 8 1 : tunables 54 27 8 : slabdata 1 1 0
tcp_sock 1 7 1056 7 2 : tunables 24 12 8 : slabdata 1 1 0
flow_cache 0 0 96 41 1 : tunables 120 60 8 : slabdata 0 0 0
qla2xxx_srbs 256 310 128 31 1 : tunables 120 60 8 : slabdata 10 10 0
scsi_cmd_cache 22 22 352 11 1 : tunables 54 27 8 : slabdata 2 2 0
cfq_ioc_pool 0 0 24 156 1 : tunables 120 60 8 : slabdata 0 0 0
cfq_pool 0 0 104 38 1 : tunables 120 60 8 : slabdata 0 0 0
crq_pool 0 0 56 70 1 : tunables 120 60 8 : slabdata 0 0 0
deadline_drq 0 0 52 75 1 : tunables 120 60 8 : slabdata 0 0 0
as_arq 61 61 64 61 1 : tunables 120 60 8 : slabdata 1 1 0
mqueue_inode_cache 1 7 512 7 1 : tunables 54 27 8 : slabdata 1 1 0
devfsd_event 0 0 20 185 1 : tunables 120 60 8 : slabdata 0 0 0
ext2_inode_cache 0 0 456 8 1 : tunables 54 27 8 : slabdata 0 0 0
ext2_xattr 0 0 48 81 1 : tunables 120 60 8 : slabdata 0 0 0
journal_handle 2 185 20 185 1 : tunables 120 60 8 : slabdata 1 1 0
journal_head 3316 155844 48 81 1 : tunables 120 60 8 : slabdata 1924 1924 0
revoke_table 14 290 12 290 1 : tunables 120 60 8 : slabdata 1 1 0
revoke_record 0 0 16 226 1 : tunables 120 60 8 : slabdata 0 0 0
ext3_inode_cache 3626 3745 520 7 1 : tunables 54 27 8 : slabdata 535 535 0
ext3_xattr 0 0 48 81 1 : tunables 120 60 8 : slabdata 0 0 0
dnotify_cache 0 0 20 185 1 : tunables 120 60 8 : slabdata 0 0 0
eventpoll_pwq 0 0 36 107 1 : tunables 120 60 8 : slabdata 0 0 0
eventpoll_epi 0 0 96 41 1 : tunables 120 60 8 : slabdata 0 0 0
kioctx 0 0 192 20 1 : tunables 120 60 8 : slabdata 0 0 0
kiocb 0 0 128 31 1 : tunables 120 60 8 : slabdata 0 0 0
fasync_cache 0 0 16 226 1 : tunables 120 60 8 : slabdata 0 0 0
shmem_inode_cache 8 18 416 9 1 : tunables 54 27 8 : slabdata 2 2 0
posix_timers_cache 0 0 104 38 1 : tunables 120 60 8 : slabdata 0 0 0
uid_cache 1 61 64 61 1 : tunables 120 60 8 : slabdata 1 1 0
sgpool-128 32 32 2048 2 1 : tunables 24 12 8 : slabdata 16 16 0
sgpool-64 32 32 1024 4 1 : tunables 54 27 8 : slabdata 8 8 0
sgpool-32 32 32 512 8 1 : tunables 54 27 8 : slabdata 4 4 0
sgpool-16 32 45 256 15 1 : tunables 120 60 8 : slabdata 3 3 0
sgpool-8 50 62 128 31 1 : tunables 120 60 8 : slabdata 2 2 0
blkdev_ioc 40 270 28 135 1 : tunables 120 60 8 : slabdata 2 2 0
blkdev_queue 37 50 380 10 1 : tunables 54 27 8 : slabdata 5 5 0
blkdev_requests 70 81 148 27 1 : tunables 120 60 8 : slabdata 3 3 0
biovec-(256) 256 256 3072 2 2 : tunables 24 12 8 : slabdata 128 128 0
biovec-128 256 260 1536 5 2 : tunables 24 12 8 : slabdata 52 52 0
biovec-64 256 260 768 5 1 : tunables 54 27 8 : slabdata 52 52 0
biovec-16 260 260 192 20 1 : tunables 120 60 8 : slabdata 13 13 0
biovec-4 256 305 64 61 1 : tunables 120 60 8 : slabdata 5 5 0
biovec-1 322 452 16 226 1 : tunables 120 60 8 : slabdata 2 2 0
bio 303 328 96 41 1 : tunables 120 60 8 : slabdata 8 8 0
file_lock_cache 0 0 92 43 1 : tunables 120 60 8 : slabdata 0 0 0
sock_inode_cache 26 60 384 10 1 : tunables 54 27 8 : slabdata 6 6 0
skbuff_head_cache 137 475 160 25 1 : tunables 120 60 8 : slabdata 19 19 0
sock 4 11 352 11 1 : tunables 54 27 8 : slabdata 1 1 0
key_jar 90 123 96 41 1 : tunables 120 60 8 : slabdata 3 3 0
proc_inode_cache 265 336 336 12 1 : tunables 54 27 8 : slabdata 28 28 0
sigqueue 1 27 148 27 1 : tunables 120 60 8 : slabdata 1 1 0
radix_tree_node 8235 8750 276 14 1 : tunables 54 27 8 : slabdata 625 625 0
bdev_cache 14 18 448 9 1 : tunables 54 27 8 : slabdata 2 2 0
sysfs_dir_cache 2837 2996 36 107 1 : tunables 120 60 8 : slabdata 28 28 0
mnt_cache 21 41 96 41 1 : tunables 120 60 8 : slabdata 1 1 0
inode_cache 946 948 320 12 1 : tunables 54 27 8 : slabdata 79 79 0
dentry_cache 10903 11032 140 28 1 : tunables 120 60 8 : slabdata 394 394 0
filp 300 750 160 25 1 : tunables 120 60 8 : slabdata 30 30 0
names_cache 3 3 4096 1 1 : tunables 24 12 8 : slabdata 3 3 0
idr_layer_cache 80 87 136 29 1 : tunables 120 60 8 : slabdata 3 3 0
buffer_head 593099 649500 52 75 1 : tunables 120 60 8 : slabdata 8660 8660 0
mm_struct 54 54 608 6 1 : tunables 54 27 8 : slabdata 9 9 0
vm_area_struct 543 1260 88 45 1 : tunables 120 60 8 : slabdata 28 28 0
fs_cache 82 122 64 61 1 : tunables 120 60 8 : slabdata 2 2 0
files_cache 50 63 416 9 1 : tunables 54 27 8 : slabdata 7 7 0
signal_cache 68 150 256 15 1 : tunables 120 60 8 : slabdata 10 10 0
sighand_cache 76 78 1312 3 1 : tunables 24 12 8 : slabdata 26 26 0
task_struct 78 81 1312 3 1 : tunables 24 12 8 : slabdata 27 27 0
anon_vma 273 870 12 290 1 : tunables 120 60 8 : slabdata 3 3 0
pgd 24 24 4096 1 1 : tunables 24 12 8 : slabdata 24 24 0
size-131072(DMA) 0 0 131072 1 32 : tunables 8 4 0 : slabdata 0 0 0
size-131072 0 0 131072 1 32 : tunables 8 4 0 : slabdata 0 0 0
size-65536(DMA) 0 0 65536 1 16 : tunables 8 4 0 : slabdata 0 0 0
size-65536 0 0 65536 1 16 : tunables 8 4 0 : slabdata 0 0 0
size-32768(DMA) 0 0 32768 1 8 : tunables 8 4 0 : slabdata 0 0 0
size-32768 0 0 32768 1 8 : tunables 8 4 0 : slabdata 0 0 0
size-16384(DMA) 0 0 16384 1 4 : tunables 8 4 0 : slabdata 0 0 0
size-16384 5 5 16384 1 4 : tunables 8 4 0 : slabdata 5 5 0
size-8192(DMA) 0 0 8192 1 2 : tunables 8 4 0 : slabdata 0 0 0
size-8192 75 75 8192 1 2 : tunables 8 4 0 : slabdata 75 75 0
size-4096(DMA) 0 0 4096 1 1 : tunables 24 12 8 : slabdata 0 0 0
size-4096 29 29 4096 1 1 : tunables 24 12 8 : slabdata 29 29 0
size-2048(DMA) 0 0 2048 2 1 : tunables 24 12 8 : slabdata 0 0 0
size-2048 248 258 2048 2 1 : tunables 24 12 8 : slabdata 129 129 0
size-1024(DMA) 0 0 1024 4 1 : tunables 54 27 8 : slabdata 0 0 0
size-1024 132 132 1024 4 1 : tunables 54 27 8 : slabdata 33 33 0
size-512(DMA) 0 0 512 8 1 : tunables 54 27 8 : slabdata 0 0 0
size-512 403 432 512 8 1 : tunables 54 27 8 : slabdata 54 54 0
size-256(DMA) 0 0 256 15 1 : tunables 120 60 8 : slabdata 0 0 0
size-256 255 255 256 15 1 : tunables 120 60 8 : slabdata 17 17 0
size-128(DMA) 0 0 128 31 1 : tunables 120 60 8 : slabdata 0 0 0
size-128 2403 2542 128 31 1 : tunables 120 60 8 : slabdata 82 82 0
size-64(DMA) 0 0 64 61 1 : tunables 120 60 8 : slabdata 0 0 0
size-64 670 793 64 61 1 : tunables 120 60 8 : slabdata 13 13 0
size-32(DMA) 0 0 32 119 1 : tunables 120 60 8 : slabdata 0 0 0
size-32 1415 3213 32 119 1 : tunables 120 60 8 : slabdata 27 27 0
kmem_cache 175 175 160 25 1 : tunables 120 60 8 : slabdata 7 7 0
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: memory leak in 2.6.11-rc2
2005-01-25 3:19 ` memory leak in 2.6.11-rc2 Andrew Tridgell
@ 2005-01-25 3:20 ` Randy.Dunlap
2005-01-25 3:31 ` Andrew Tridgell
2005-01-25 3:45 ` Dave Jones
1 sibling, 1 reply; 30+ messages in thread
From: Randy.Dunlap @ 2005-01-25 3:20 UTC (permalink / raw)
To: tridge; +Cc: linux-kernel, Andreas Gruenbacher, Andrew Morton
Andrew Tridgell wrote:
> I've fixed up the problems I had with raid, and am now testing the
> recent xattr changes with dbench and nbench.
>
> The problem I've hit now is a severe memory leak. I have applied the
> patch from Linus for the leak in free_pipe_info(), and still I'm
> leaking memory at the rate of about 100Mbyte/minute.
>
> I've tested with both 2.6.11-rc2 and with 2.6.11-rc1-mm2, both with
> the pipe leak fix. The setup is:
>
> - 4 way PIII with 4G ram
> - qla2200 adapter with ibm fastt200 disk array
> - running dbench -x and nbench on separate disks, in a loop
>
> The oom killer kicks in after about 30 minutes. Naturally the oom
> killer decided to kill my sshd, which was running vmstat :-)
Do you have today's memleak patch applied? (cut-n-paste below).
--
~Randy
----
--- 1.40/fs/pipe.c 2005-01-15 12:01:16 -08:00
+++ edited/fs/pipe.c 2005-01-24 14:35:09 -08:00
@@ -630,13 +630,13 @@
struct pipe_inode_info *info = inode->i_pipe;
inode->i_pipe = NULL;
- if (info->tmp_page)
- __free_page(info->tmp_page);
for (i = 0; i < PIPE_BUFFERS; i++) {
struct pipe_buffer *buf = info->bufs + i;
if (buf->ops)
buf->ops->release(info, buf);
}
+ if (info->tmp_page)
+ __free_page(info->tmp_page);
kfree(info);
}
-
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: memory leak in 2.6.11-rc2
2005-01-25 3:20 ` Randy.Dunlap
@ 2005-01-25 3:31 ` Andrew Tridgell
2005-01-25 4:48 ` Andrew Tridgell
0 siblings, 1 reply; 30+ messages in thread
From: Andrew Tridgell @ 2005-01-25 3:31 UTC (permalink / raw)
To: Randy.Dunlap; +Cc: linux-kernel, Andreas Gruenbacher, Andrew Morton
Randy,
> I have applied the patch from Linus for the leak in
> free_pipe_info()
...
> Do you have today's memleak patch applied? (cut-n-paste below).
yes :-)
I'm trying the little leak tracking patch from Alexander Nyberg now.
Cheers, Tridge
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: memory leak in 2.6.11-rc2
2005-01-25 3:19 ` memory leak in 2.6.11-rc2 Andrew Tridgell
2005-01-25 3:20 ` Randy.Dunlap
@ 2005-01-25 3:45 ` Dave Jones
2005-01-25 12:51 ` Andrea Arcangeli
1 sibling, 1 reply; 30+ messages in thread
From: Dave Jones @ 2005-01-25 3:45 UTC (permalink / raw)
To: Andrew Tridgell; +Cc: linux-kernel, Andreas Gruenbacher, Andrew Morton
On Tue, Jan 25, 2005 at 02:19:24PM +1100, Andrew Tridgell wrote:
> The problem I've hit now is a severe memory leak. I have applied the
> patch from Linus for the leak in free_pipe_info(), and still I'm
> leaking memory at the rate of about 100Mbyte/minute.
> I've tested with both 2.6.11-rc2 and with 2.6.11-rc1-mm2, both with
> the pipe leak fix. The setup is:
That's a little more extreme than what I'm seeing, so it may be
something else, but my firewall box needs rebooting every
few days. It leaks around 50MB a day for some reason.
Given it's not got a lot of ram, after 4-5 days or so, it's
completely exhausted its swap too.
It's currently on a 2.6.10-ac kernel, so it's entirely possible that
we're not looking at the same issue, though it could be something
thats been there for a while if your workload makes it appear
quicker than a firewall/ipsec gateway would.
Do you see the same leaks with an earlier kernel ?
post OOM (when there was about 2K free after named got oom-killed)
this is what slabinfo looked like..
dentry_cache 1502 3775 160 25 1 : tunables 120 60 0 : slabdata 151 151 0
vm_area_struct 1599 2021 84 47 1 : tunables 120 60 0 : slabdata 43 43 0
size-128 3431 6262 128 31 1 : tunables 120 60 0 : slabdata 202 202 0
size-64 4352 4575 64 61 1 : tunables 120 60 0 : slabdata 75 75 0
avtab_node 7073 7140 32 119 1 : tunables 120 60 0 : slabdata 60 60 0
size-32 7256 7616 32 119 1 : tunables 120 60 0 : slabdata 64 64 0
Dave
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: memory leak in 2.6.11-rc2
2005-01-25 3:31 ` Andrew Tridgell
@ 2005-01-25 4:48 ` Andrew Tridgell
2005-01-25 6:06 ` Andrew Morton
0 siblings, 1 reply; 30+ messages in thread
From: Andrew Tridgell @ 2005-01-25 4:48 UTC (permalink / raw)
To: Randy.Dunlap, linux-kernel, Andreas Gruenbacher, Andrew Morton
[-- Attachment #1: message body text --]
[-- Type: text/plain, Size: 2849 bytes --]
> I'm trying the little leak tracking patch from Alexander Nyberg now.
Here are the results (only backtraces with more than 10k counts
included). The leak was at 1G of memory at the time I ran this, so its
safe to say 10k page allocations ain't enough to explain it :-)
I also attach a hacked version of the pgown sort program that sorts
the output by count, and isn't O(n^2). It took 10 minutes to run the
old version :-)
I'm guessing the leak is in the new xattr code given that is what
dbench and nbench were beating on. Andreas, can you look at the
following and see if you can spot anything?
This was on 2.6.11-rc2 with the pipe leak patch from Linus. The
machine had leaked 1G of ram in 10 minutes, and was idle (only thing
running was sshd).
Cheers, Tridge
175485 times:
Page allocated via order 0
[0xc0132258] generic_file_buffered_write+280
[0xc011b6a9] current_fs_time+77
[0xc0132a1e] __generic_file_aio_write_nolock+642
[0xc0132e70] generic_file_aio_write+100
[0xc017e586] ext3_file_write+38
[0xc014b7f5] do_sync_write+169
[0xc015f6de] fcntl_setlk64+286
[0xc01295a8] autoremove_wake_function+0
141512 times:
Page allocated via order 0
[0xc0132258] generic_file_buffered_write+280
[0xc0132a1e] __generic_file_aio_write_nolock+642
[0xc0132e70] generic_file_aio_write+100
[0xc017e586] ext3_file_write+38
[0xc014b7f5] do_sync_write+169
[0xc015f6de] fcntl_setlk64+286
[0xc01295a8] autoremove_wake_function+0
[0xc014b8d5] vfs_write+157
67641 times:
Page allocated via order 0
[0xc0132258] generic_file_buffered_write+280
[0xc014dc69] __getblk+29
[0xc011b6a9] current_fs_time+77
[0xc0132a1e] __generic_file_aio_write_nolock+642
[0xc018d368] ext3_xattr_user_get+108
[0xc0132e70] generic_file_aio_write+100
[0xc017e586] ext3_file_write+38
[0xc014b7f5] do_sync_write+169
52758 times:
Page allocated via order 0
[0xc0132258] generic_file_buffered_write+280
[0xc014dc69] __getblk+29
[0xc0132a1e] __generic_file_aio_write_nolock+642
[0xc018d368] ext3_xattr_user_get+108
[0xc0132e70] generic_file_aio_write+100
[0xc017e586] ext3_file_write+38
[0xc014b7f5] do_sync_write+169
[0xc0120c0b] kill_proc_info+47
19610 times:
Page allocated via order 0
[0xc0132258] generic_file_buffered_write+280
[0xc011b6a9] current_fs_time+77
[0xc0132a1e] __generic_file_aio_write_nolock+642
[0xc0132c3a] generic_file_aio_write_nolock+54
[0xc0132def] generic_file_write_nolock+151
[0xc011b7cb] __do_softirq+95
[0xc01295a8] autoremove_wake_function+0
[0xc01295a8] autoremove_wake_function+0
16874 times:
Page allocated via order 0
[0xc0132258] generic_file_buffered_write+280
[0xc011b6a9] current_fs_time+77
[0xc0132a1e] __generic_file_aio_write_nolock+642
[0xc0132c3a] generic_file_aio_write_nolock+54
[0xc0132def] generic_file_write_nolock+151
[0xc01295a8] autoremove_wake_function+0
[0xc01295a8] autoremove_wake_function+0
[0xc0152d4a] blkdev_file_write+38
[-- Attachment #2: pgown-fast.c --]
[-- Type: application/octet-stream, Size: 2427 bytes --]
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>
struct block_list {
char *txt;
int len;
int num;
};
static struct block_list *list;
static int list_size;
static int max_size;
struct block_list *block_head;
int read_block(char *buf, FILE *fin)
{
int ret = 0;
int hit = 0;
char *curr = buf;
for (;;) {
*curr = getc(fin);
if (*curr == EOF) return -1;
ret++;
if (*curr == '\n' && hit == 1)
return ret - 1;
else if (*curr == '\n')
hit = 1;
else
hit = 0;
curr++;
}
}
static int compare_txt(struct block_list *l1, struct block_list *l2)
{
return strcmp(l1->txt, l2->txt);
}
static int compare_num(struct block_list *l1, struct block_list *l2)
{
return l2->num - l1->num;
}
static void add_list(char *buf, int len)
{
if (list_size != 0 &&
len == list[list_size-1].len &&
memcmp(buf, list[list_size-1].txt, len) == 0) {
list[list_size-1].num++;
return;
}
if (list_size == max_size) {
printf("max_size too small??\n");
exit(1);
}
list[list_size].txt = malloc(len+1);
list[list_size].len = len;
list[list_size].num = 1;
memcpy(list[list_size].txt, buf, len);
list[list_size].txt[len] = 0;
list_size++;
if (list_size % 1000 == 0) {
printf("loaded %d\r", list_size);
fflush(stdout);
}
}
int main(int argc, char **argv)
{
FILE *fin, *fout;
char buf[1024];
int ret, i, count;
struct block_list *list2;
struct stat st;
fin = fopen(argv[1], "r");
fout = fopen(argv[2], "w");
if (!fin || !fout) {
printf("Usage: ./program <input> <output>\n");
perror("open: ");
exit(2);
}
fstat(fileno(fin), &st);
max_size = st.st_size / 100; /* hack ... */
list = malloc(max_size * sizeof(*list));
for(;;) {
ret = read_block(buf, fin);
if (ret < 0)
break;
buf[ret] = '\0';
add_list(buf, ret);
}
printf("loaded %d\n", list_size);
printf("sorting ....\n");
qsort(list, list_size, sizeof(list[0]), compare_txt);
list2 = malloc(sizeof(*list) * list_size);
printf("culling\n");
for (i=count=0;i<list_size;i++) {
if (count == 0 ||
strcmp(list2[count-1].txt, list[i].txt) != 0) {
list2[count++] = list[i];
} else {
list2[count-1].num += list[i].num;
}
}
qsort(list2, count, sizeof(list[0]), compare_num);
for (i=0;i<count;i++) {
fprintf(fout, "%d times:\n%s\n", list2[i].num, list2[i].txt);
}
return 0;
}
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: memory leak in 2.6.11-rc2
2005-01-25 4:48 ` Andrew Tridgell
@ 2005-01-25 6:06 ` Andrew Morton
2005-01-25 11:35 ` Andrew Tridgell
0 siblings, 1 reply; 30+ messages in thread
From: Andrew Morton @ 2005-01-25 6:06 UTC (permalink / raw)
To: tridge; +Cc: rddunlap, linux-kernel, agruen
Andrew Tridgell <tridge@osdl.org> wrote:
>
> > I'm trying the little leak tracking patch from Alexander Nyberg now.
>
> Here are the results (only backtraces with more than 10k counts
> included). The leak was at 1G of memory at the time I ran this, so its
> safe to say 10k page allocations ain't enough to explain it :-)
>
> I also attach a hacked version of the pgown sort program that sorts
> the output by count, and isn't O(n^2). It took 10 minutes to run the
> old version :-)
>
> I'm guessing the leak is in the new xattr code given that is what
> dbench and nbench were beating on. Andreas, can you look at the
> following and see if you can spot anything?
>
> This was on 2.6.11-rc2 with the pipe leak patch from Linus. The
> machine had leaked 1G of ram in 10 minutes, and was idle (only thing
> running was sshd).
>
> 175485 times:
> Page allocated via order 0
> [0xc0132258] generic_file_buffered_write+280
> [0xc011b6a9] current_fs_time+77
> [0xc0132a1e] __generic_file_aio_write_nolock+642
> [0xc0132e70] generic_file_aio_write+100
> [0xc017e586] ext3_file_write+38
> [0xc014b7f5] do_sync_write+169
> [0xc015f6de] fcntl_setlk64+286
> [0xc01295a8] autoremove_wake_function+0
It would be pretty strange for plain old pagecache pages to leak in this
manner. A few things come to mind.
- The above trace is indistinguishable from the normal situation of
having a lot of pagecache floating about. IOW: we don't know if the
above pages have really leaked or not.
- It's sometimes possible for ext3 pages to remain alive (on the page
LRU) after a truncate, but with no other references to them. These pages
are trivially reclaimable. So even if you've deleted the files which the
benchmark created, there _could_ be pages left over. Although it would
be unusual.
So what you should do before generating the leak tool output is to put
heavy memory pressure on the machine to try to get it to free up as much of
that pagecache as possible. bzero(malloc(lots)) will do it - create a real
swapstorm, then do swapoff to kill remaining swapcache as well.
If we still see a lot of pages with the above trace then something really
broke. Where does one get the appropriate dbench version? How are you
mkfsing the filesystem? Was mke2fs patched? How is the benchmark being
invoked?
Thanks.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: memory leak in 2.6.11-rc2
2005-01-25 6:06 ` Andrew Morton
@ 2005-01-25 11:35 ` Andrew Tridgell
2005-01-25 12:11 ` Nick Piggin
0 siblings, 1 reply; 30+ messages in thread
From: Andrew Tridgell @ 2005-01-25 11:35 UTC (permalink / raw)
To: Andrew Morton; +Cc: rddunlap, linux-kernel, agruen
Andrew,
> So what you should do before generating the leak tool output is to put
> heavy memory pressure on the machine to try to get it to free up as much of
> that pagecache as possible. bzero(malloc(lots)) will do it - create a real
> swapstorm, then do swapoff to kill remaining swapcache as well.
As you saw when you logged into the machine earlier tonight, when you
suspend the dbench processes and run a memory filler the memory is
reclaimed.
I still think its a bug though, as the oom killer is being triggered
when it shouldn't be. I have 4G of ram in this machine, and I'm only
running a couple of hundred processes that should be using maybe 500M
in total, so for the oom killer to kick in might mean that the memory
isn't being reclaimed under normal memory pressure. Certainly a ps
shows no process using more than a few MB.
The oom killer report is below. This is with 2.6.11-rc2, with the pipe
leak fix, and the pgown monitoring patch. It was running one nbench of
size 50 and one dbench of size 40 at the time.
If this isn't a leak, then it would also be good to fix /usr/bin/free
so the -/+ buffers line becomes meaningful again. With the machine
completely idle (just a sshd running) I see:
Mem: 3701184 3483188 217996 0 130092 1889440
-/+ buffers/cache: 1463656 2237528
which looks like a leak. This persists even after the disks are
unmounted. After filling/freeing memory I see:
Mem: 3701184 28176 3673008 0 520 3764
-/+ buffers/cache: 23892 3677292
so it can recover it, but under normal usage it doesn't before the oom
killer kicks in.
Cheers, Tridge
Jan 25 00:49:14 dev4-003 kernel: Out of Memory: Killed process 18910 (smbd).
Jan 25 00:49:14 dev4-003 kernel: oom-killer: gfp_mask=0xd0
Jan 25 00:49:14 dev4-003 kernel: DMA per-cpu:
Jan 25 00:49:14 dev4-003 kernel: cpu 0 hot: low 2, high 6, batch 1
Jan 25 00:49:14 dev4-003 kernel: cpu 0 cold: low 0, high 2, batch 1
Jan 25 00:49:14 dev4-003 kernel: cpu 1 hot: low 2, high 6, batch 1
Jan 25 00:49:14 dev4-003 kernel: cpu 1 cold: low 0, high 2, batch 1
Jan 25 00:49:14 dev4-003 kernel: cpu 2 hot: low 2, high 6, batch 1
Jan 25 00:49:14 dev4-003 kernel: cpu 2 cold: low 0, high 2, batch 1
Jan 25 00:49:14 dev4-003 kernel: cpu 3 hot: low 2, high 6, batch 1
Jan 25 00:49:14 dev4-003 kernel: cpu 3 cold: low 0, high 2, batch 1
Jan 25 00:49:14 dev4-003 kernel: Normal per-cpu:
Jan 25 00:49:14 dev4-003 kernel: cpu 0 hot: low 32, high 96, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 0 cold: low 0, high 32, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 1 hot: low 32, high 96, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 1 cold: low 0, high 32, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 2 hot: low 32, high 96, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 2 cold: low 0, high 32, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 3 hot: low 32, high 96, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 3 cold: low 0, high 32, batch 16
Jan 25 00:49:14 dev4-003 kernel: HighMem per-cpu:
Jan 25 00:49:14 dev4-003 kernel: cpu 0 hot: low 32, high 96, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 0 cold: low 0, high 32, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 1 hot: low 32, high 96, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 1 cold: low 0, high 32, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 2 hot: low 32, high 96, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 2 cold: low 0, high 32, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 3 hot: low 32, high 96, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 3 cold: low 0, high 32, batch 16
Jan 25 00:49:14 dev4-003 kernel:
Jan 25 00:49:14 dev4-003 kernel: Free pages: 5884kB (640kB HighMem)
Jan 25 00:49:14 dev4-003 kernel: Active:84764 inactive:806173 dirty:338229 writeback:36 unstable:0 free:1471 slab:23814 mapped:31610 pagetables:1616
Jan 25 00:49:14 dev4-003 kernel: DMA free:284kB min:68kB low:84kB high:100kB active:440kB inactive:10048kB present:16384kB pages_scanned:140 all_unreclaimable? no
Jan 25 00:49:14 dev4-003 kernel: protections[]: 0 0 0
Jan 25 00:49:14 dev4-003 kernel: Normal free:4960kB min:3756kB low:4692kB high:5632kB active:72072kB inactive:632096kB present:901120kB pages_scanned:58999 all_unreclaimable? no
Jan 25 00:49:14 dev4-003 kernel: protections[]: 0 0 0
Jan 25 00:49:14 dev4-003 kernel: HighMem free:640kB min:512kB low:640kB high:768kB active:266948kB inactive:2581960kB present:2850752kB pages_scanned:0 all_unreclaimable? no
Jan 25 00:49:14 dev4-003 kernel: protections[]: 0 0 0
Jan 25 00:49:14 dev4-003 kernel: DMA: 53*4kB 3*8kB 1*16kB 1*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 284kB
Jan 25 00:49:14 dev4-003 kernel: Normal: 400*4kB 16*8kB 0*16kB 1*32kB 0*64kB 1*128kB 0*256kB 2*512kB 0*1024kB 1*2048kB 0*4096kB = 4960kB
Jan 25 00:49:14 dev4-003 kernel: HighMem: 6*4kB 11*8kB 1*16kB 0*32kB 0*64kB 2*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 640kB
Jan 25 00:49:14 dev4-003 kernel: Swap cache: add 1584899, delete 1585025, find 294/351, race 0+0
Jan 25 00:49:14 dev4-003 kernel: Out of Memory: Killed process 18914 (smbd).
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: memory leak in 2.6.11-rc2
2005-01-25 11:35 ` Andrew Tridgell
@ 2005-01-25 12:11 ` Nick Piggin
0 siblings, 0 replies; 30+ messages in thread
From: Nick Piggin @ 2005-01-25 12:11 UTC (permalink / raw)
To: tridge; +Cc: Andrew Morton, rddunlap, linux-kernel, agruen, Andrea Arcangeli
[-- Attachment #1: Type: text/plain, Size: 1624 bytes --]
Andrew Tridgell wrote:
> Andrew,
>
> > So what you should do before generating the leak tool output is to put
> > heavy memory pressure on the machine to try to get it to free up as much of
> > that pagecache as possible. bzero(malloc(lots)) will do it - create a real
> > swapstorm, then do swapoff to kill remaining swapcache as well.
>
> As you saw when you logged into the machine earlier tonight, when you
> suspend the dbench processes and run a memory filler the memory is
> reclaimed.
>
> I still think its a bug though, as the oom killer is being triggered
> when it shouldn't be. I have 4G of ram in this machine, and I'm only
> running a couple of hundred processes that should be using maybe 500M
> in total, so for the oom killer to kick in might mean that the memory
> isn't being reclaimed under normal memory pressure. Certainly a ps
> shows no process using more than a few MB.
>
> The oom killer report is below. This is with 2.6.11-rc2, with the pipe
> leak fix, and the pgown monitoring patch. It was running one nbench of
> size 50 and one dbench of size 40 at the time.
>
There are various OOM killer improvements and fixes that have gone
into Andrew's kernel tree which should be included for 2.6.11.
I don't think the OOM killer was ever perfect in 2.6, but recent
tinkering in mm/ probably aggrivated it. *blush*
Here is another small OOM killer improvement. Previously we needed
to reclaim SWAP_CLUSTER_MAX pages in a single pass. That should be
changed so that we need only reclaim that many pages during the
entire try_to_free_pages run, without going OOM.
Andrea? Andrew? Look OK?
[-- Attachment #2: less-oom.patch --]
[-- Type: text/plain, Size: 800 bytes --]
---
linux-2.6-npiggin/mm/vmscan.c | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)
diff -puN mm/vmscan.c~oom-helper mm/vmscan.c
--- linux-2.6/mm/vmscan.c~oom-helper 2005-01-25 23:04:28.000000000 +1100
+++ linux-2.6-npiggin/mm/vmscan.c 2005-01-25 23:05:06.000000000 +1100
@@ -914,12 +914,12 @@ int try_to_free_pages(struct zone **zone
sc.nr_reclaimed += reclaim_state->reclaimed_slab;
reclaim_state->reclaimed_slab = 0;
}
- if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX) {
+ total_scanned += sc.nr_scanned;
+ total_reclaimed += sc.nr_reclaimed;
+ if (total_reclaimed >= SWAP_CLUSTER_MAX) {
ret = 1;
goto out;
}
- total_scanned += sc.nr_scanned;
- total_reclaimed += sc.nr_reclaimed;
/*
* Try to write back as many pages as we just scanned. This
_
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: memory leak in 2.6.11-rc2
2005-01-25 3:45 ` Dave Jones
@ 2005-01-25 12:51 ` Andrea Arcangeli
2005-01-25 13:31 ` Andreas Gruenbacher
0 siblings, 1 reply; 30+ messages in thread
From: Andrea Arcangeli @ 2005-01-25 12:51 UTC (permalink / raw)
To: Dave Jones, Andrew Tridgell, linux-kernel, Andreas Gruenbacher,
Andrew Morton
On Mon, Jan 24, 2005 at 10:45:47PM -0500, Dave Jones wrote:
> On Tue, Jan 25, 2005 at 02:19:24PM +1100, Andrew Tridgell wrote:
> > The problem I've hit now is a severe memory leak. I have applied the
> > patch from Linus for the leak in free_pipe_info(), and still I'm
> > leaking memory at the rate of about 100Mbyte/minute.
> > I've tested with both 2.6.11-rc2 and with 2.6.11-rc1-mm2, both with
> > the pipe leak fix. The setup is:
>
> That's a little more extreme than what I'm seeing, so it may be
> something else, but my firewall box needs rebooting every
> few days. It leaks around 50MB a day for some reason.
> Given it's not got a lot of ram, after 4-5 days or so, it's
> completely exhausted its swap too.
>
> It's currently on a 2.6.10-ac kernel, so it's entirely possible that
> we're not looking at the same issue, though it could be something
> thats been there for a while if your workload makes it appear
> quicker than a firewall/ipsec gateway would.
> Do you see the same leaks with an earlier kernel ?
>
> post OOM (when there was about 2K free after named got oom-killed)
> this is what slabinfo looked like..
>
> dentry_cache 1502 3775 160 25 1 : tunables 120 60 0 : slabdata 151 151 0
> vm_area_struct 1599 2021 84 47 1 : tunables 120 60 0 : slabdata 43 43 0
> size-128 3431 6262 128 31 1 : tunables 120 60 0 : slabdata 202 202 0
> size-64 4352 4575 64 61 1 : tunables 120 60 0 : slabdata 75 75 0
> avtab_node 7073 7140 32 119 1 : tunables 120 60 0 : slabdata 60 60 0
> size-32 7256 7616 32 119 1 : tunables 120 60 0 : slabdata 64 64 0
What is avtab_node? there's no such thing in my kernel. But the above
can be ok. Can you show meminfo too after oom kill?
Just another datapoint my firewall runs a kernel based on 2.6.11-rc1-bk8 with
all the needed oom fixes and I've no problems on it yet. I run it oom
and this is what I get after the oom:
athlon:/home/andrea # free
total used free shared buffers cached
Mem: 511136 50852 460284 0 572 15764
-/+ buffers/cache: 34516 476620
Swap: 1052248 0 1052248
athlon:/home/andrea #
The above is sane, 34M is very reasonable for what's loaded there
(there's the X server running, named too, and various other non standard
daemons, one even has a virtual size of >100m so it's not a tiny thing),
so I'm quite sure I'm not hitting a memleak, at least not on the
firewal. No ipsec on it btw, and it's a pure IDE without anything
special, just quite a few nics and USB usermode running all the time.
athlon:/home/andrea # uptime
1:34pm up 2 days 12:08, 1 user, load average: 0.98, 1.13, 0.54
athlon:/home/andrea # iptables -L -v |grep -A2 FORWARD
Chain FORWARD (policy ACCEPT 65 packets, 9264 bytes)
pkts bytes target prot opt in out source destination
3690K 2321M block all -- any any anywhere anywhere
athlon:/home/andrea #
So if there's a memleak in rc1-bk8, it's probably not in the core of the
kernel, but in some driver or things like ipsec. Either that or it broke
after 2.6.11-rc1-bk8. The kernel I'm running is quite heavily patched
too, but I'm not aware of any memleak fix in the additional patches.
Anyway I'll try again in a few days to verify it goes back down again to
exactly 34M of anonymous/random and 15M of cache.
No apparent problem on my desktop system either, it's running the same
kernel with different config.
If somebody could fix the kernel CVS I could have a look at the
interesting changesets between 2.6.11-rc1-bk8 and 2.6.11-rc2.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: memory leak in 2.6.11-rc2
2005-01-25 12:51 ` Andrea Arcangeli
@ 2005-01-25 13:31 ` Andreas Gruenbacher
2005-01-25 13:55 ` Andrea Arcangeli
0 siblings, 1 reply; 30+ messages in thread
From: Andreas Gruenbacher @ 2005-01-25 13:31 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: Dave Jones, Andrew Tridgell, linux-kernel, Andrew Morton
On Tue, 2005-01-25 at 13:51, Andrea Arcangeli wrote:
> If somebody could fix the kernel CVS I could have a look at the
> interesting changesets between 2.6.11-rc1-bk8 and 2.6.11-rc2.
What's not okay?
--
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX GMBH
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: memory leak in 2.6.11-rc2
2005-01-25 13:31 ` Andreas Gruenbacher
@ 2005-01-25 13:55 ` Andrea Arcangeli
0 siblings, 0 replies; 30+ messages in thread
From: Andrea Arcangeli @ 2005-01-25 13:55 UTC (permalink / raw)
To: Andreas Gruenbacher; +Cc: linux-kernel
On Tue, Jan 25, 2005 at 02:31:03PM +0100, Andreas Gruenbacher wrote:
> On Tue, 2005-01-25 at 13:51, Andrea Arcangeli wrote:
> > If somebody could fix the kernel CVS I could have a look at the
> > interesting changesets between 2.6.11-rc1-bk8 and 2.6.11-rc2.
>
> What's not okay?
I already prepared a separated deatiled bugreport. I'm reproducing one
last time before sending it, after a "su - nobody" to be sure it's not a
local problem in my environment that might have changed, but I'm 99%
sure it'll reproduce just fine in a fresh account too.
Are you using cvsps at all?
^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2005-01-25 13:56 UTC | newest]
Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-01-20 2:01 [ea-in-inode 0/5] Further fixes Andreas Gruenbacher
2005-01-20 2:01 ` [patch 1/5] No lock needed when freeing inode Andreas Gruenbacher
2005-01-20 2:01 ` [patch 3/5] Documentation fix Andreas Gruenbacher
2005-01-20 2:01 ` [patch 4/5] Fix i_extra_isize check Andreas Gruenbacher
2005-01-20 2:01 ` [patch 2/5] Set the EXT3_FEATURE_COMPAT_EXT_ATTR for in-inode xattrs Andreas Gruenbacher
2005-01-20 2:01 ` [patch 5/5] Disallow in-inode attributes for reserved inodes Andreas Gruenbacher
2005-01-20 12:16 ` Andreas Dilger
2005-01-20 13:29 ` Andreas Gruenbacher
2005-01-20 23:05 ` Andreas Dilger
2005-01-21 0:36 ` Andreas Gruenbacher
2005-01-21 22:58 ` [ea-in-inode 0/5] Further fixes Stephen C. Tweedie
2005-01-21 23:46 ` Andreas Gruenbacher
2005-01-23 13:22 ` Andrew Tridgell
2005-01-23 22:09 ` Andrew Tridgell
2005-01-23 22:58 ` Andreas Gruenbacher
2005-01-23 23:32 ` Andreas Gruenbacher
2005-01-24 11:24 ` Andrew Tridgell
2005-01-24 11:42 ` Christoph Hellwig
2005-01-24 14:11 ` Andreas Gruenbacher
2005-01-25 3:19 ` memory leak in 2.6.11-rc2 Andrew Tridgell
2005-01-25 3:20 ` Randy.Dunlap
2005-01-25 3:31 ` Andrew Tridgell
2005-01-25 4:48 ` Andrew Tridgell
2005-01-25 6:06 ` Andrew Morton
2005-01-25 11:35 ` Andrew Tridgell
2005-01-25 12:11 ` Nick Piggin
2005-01-25 3:45 ` Dave Jones
2005-01-25 12:51 ` Andrea Arcangeli
2005-01-25 13:31 ` Andreas Gruenbacher
2005-01-25 13:55 ` Andrea Arcangeli
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).