linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [ea-in-inode 0/5] Further fixes
@ 2005-01-20  2:01 Andreas Gruenbacher
  2005-01-20  2:01 ` [patch 1/5] No lock needed when freeing inode Andreas Gruenbacher
                   ` (5 more replies)
  0 siblings, 6 replies; 30+ messages in thread
From: Andreas Gruenbacher @ 2005-01-20  2:01 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds, Theodore Ts'o
  Cc: Andrew Tridgell, Stephen C. Tweedie, Andreas Dilger, Alex Tomas,
	linux-kernel

Hello,

here is a set of fixes for ext3 in-inode attributes:

patches/ea-xattr-nolock.diff
  No lock needed when freeing inode

  The effect of the additional lock taking is very minor, but
  it's still unnecessary.

patches/ea-xattr-update-sb.diff
  Set the EXT3_FEATURE_COMPAT_EXT_ATTR for in-inode xattrs

  The EXT_ATTR filesystem feature was set for xattr blocks, but not
  for in-inode attributes.

patches/ea-xattr-doc.diff
  Documentation fix

patches/ea-xattr-no-extra_isize.diff
  Fix i_extra_isize check

  Filesystem corruption fix.

patches/ea-xattr-reserved-inodes.diff
  Disallow in-inode attributes for reserved inodes

  Filesystem corruption fix.


Regards,
--
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX PRODUCTS GMBH


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [patch 1/5] No lock needed when freeing inode
  2005-01-20  2:01 [ea-in-inode 0/5] Further fixes Andreas Gruenbacher
@ 2005-01-20  2:01 ` Andreas Gruenbacher
  2005-01-20  2:01 ` [patch 3/5] Documentation fix Andreas Gruenbacher
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 30+ messages in thread
From: Andreas Gruenbacher @ 2005-01-20  2:01 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds, Theodore Ts'o
  Cc: Andrew Tridgell, Stephen C. Tweedie, Andreas Dilger, Alex Tomas,
	linux-kernel

[-- Attachment #1: ea-xattr-nolock.diff --]
[-- Type: text/plain, Size: 867 bytes --]

ext3_xattr_delete_inode is called from ext3_free_inode which always
has exclusive access to the inode, so there is no need to take the
xattr semaphore.

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>

Index: linux-2.6.11-latest/fs/ext3/xattr.c
===================================================================
--- linux-2.6.11-latest.orig/fs/ext3/xattr.c
+++ linux-2.6.11-latest/fs/ext3/xattr.c
@@ -1066,7 +1066,6 @@ ext3_xattr_delete_inode(handle_t *handle
 {
 	struct buffer_head *bh = NULL;
 
-	down_write(&EXT3_I(inode)->xattr_sem);
 	if (!EXT3_I(inode)->i_file_acl)
 		goto cleanup;
 	bh = sb_bread(inode->i_sb, EXT3_I(inode)->i_file_acl);
@@ -1088,7 +1087,6 @@ ext3_xattr_delete_inode(handle_t *handle
 
 cleanup:
 	brelse(bh);
-	up_write(&EXT3_I(inode)->xattr_sem);
 }
 
 /*

--
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX PRODUCTS GMBH


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [patch 4/5] Fix i_extra_isize check
  2005-01-20  2:01 [ea-in-inode 0/5] Further fixes Andreas Gruenbacher
  2005-01-20  2:01 ` [patch 1/5] No lock needed when freeing inode Andreas Gruenbacher
  2005-01-20  2:01 ` [patch 3/5] Documentation fix Andreas Gruenbacher
@ 2005-01-20  2:01 ` Andreas Gruenbacher
  2005-01-20  2:01 ` [patch 2/5] Set the EXT3_FEATURE_COMPAT_EXT_ATTR for in-inode xattrs Andreas Gruenbacher
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 30+ messages in thread
From: Andreas Gruenbacher @ 2005-01-20  2:01 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds, Theodore Ts'o
  Cc: Andrew Tridgell, Stephen C. Tweedie, Andreas Dilger, Alex Tomas,
	linux-kernel

[-- Attachment #1: ea-xattr-no-extra_isize.diff --]
[-- Type: text/plain, Size: 2134 bytes --]

We are checking for (EXT3_SB(inode->i_sb)->s_inode_size <=
EXT3_GOOD_OLD_INODE_SIZE) to find out if we can set in-inode attributes;
the test fails for inodes that have been created before the ea-in-inode
patch. Those inodes have (i_extra_isize == 0), so we end up with the
attributes overlapping the i_extra_isize field. Checking for
(i_extra_isize == 0) instead fixes this case.

The EXT3_STATE_XATTR flag is only set if (i_extra_isize > 0) and the
inodes has in-inode attributes, so that is enough in the first two
tests.

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>

Index: linux-2.6.11-latest/fs/ext3/xattr.c
===================================================================
--- linux-2.6.11-latest.orig/fs/ext3/xattr.c
+++ linux-2.6.11-latest/fs/ext3/xattr.c
@@ -272,8 +272,7 @@ ext3_xattr_ibody_get(struct inode *inode
 	void *end;
 	int error;
 
-	if (EXT3_SB(inode->i_sb)->s_inode_size <= EXT3_GOOD_OLD_INODE_SIZE ||
-	    !(EXT3_I(inode)->i_state & EXT3_STATE_XATTR))
+	if (!(EXT3_I(inode)->i_state & EXT3_STATE_XATTR))
 		return -ENODATA;
 	error = ext3_get_inode_loc(inode, &iloc);
 	if (error)
@@ -399,8 +398,7 @@ ext3_xattr_ibody_list(struct inode *inod
 	void *end;
 	int error;
 
-	if (EXT3_SB(inode->i_sb)->s_inode_size <= EXT3_GOOD_OLD_INODE_SIZE ||
-	    !(EXT3_I(inode)->i_state & EXT3_STATE_XATTR))
+	if (!(EXT3_I(inode)->i_state & EXT3_STATE_XATTR))
 		return 0;
 	error = ext3_get_inode_loc(inode, &iloc);
 	if (error)
@@ -865,7 +863,7 @@ ext3_xattr_ibody_find(struct inode *inod
 	struct ext3_inode *raw_inode;
 	int error;
 
-	if (EXT3_SB(inode->i_sb)->s_inode_size <= EXT3_GOOD_OLD_INODE_SIZE)
+	if (EXT3_I(inode)->i_extra_isize == 0)
 		return 0;
 	raw_inode = ext3_raw_inode(&is->iloc);
 	header = IHDR(inode, raw_inode);
@@ -896,7 +894,7 @@ ext3_xattr_ibody_set(handle_t *handle, s
 	struct ext3_xattr_search *s = &is->s;
 	int error;
 
-	if (EXT3_SB(inode->i_sb)->s_inode_size <= EXT3_GOOD_OLD_INODE_SIZE)
+	if (EXT3_I(inode)->i_extra_isize == 0)
 		return -ENOSPC;
 	error = ext3_xattr_set_entry(i, s);
 	if (error)

--
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX PRODUCTS GMBH


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [patch 3/5] Documentation fix
  2005-01-20  2:01 [ea-in-inode 0/5] Further fixes Andreas Gruenbacher
  2005-01-20  2:01 ` [patch 1/5] No lock needed when freeing inode Andreas Gruenbacher
@ 2005-01-20  2:01 ` Andreas Gruenbacher
  2005-01-20  2:01 ` [patch 4/5] Fix i_extra_isize check Andreas Gruenbacher
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 30+ messages in thread
From: Andreas Gruenbacher @ 2005-01-20  2:01 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds, Theodore Ts'o
  Cc: Andrew Tridgell, Stephen C. Tweedie, Andreas Dilger, Alex Tomas,
	linux-kernel

[-- Attachment #1: ea-xattr-doc.diff --]
[-- Type: text/plain, Size: 903 bytes --]

In-inode xattr entry descriptors are unsorted.

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>

Index: linux-2.6.11-latest/fs/ext3/xattr.c
===================================================================
--- linux-2.6.11-latest.orig/fs/ext3/xattr.c
+++ linux-2.6.11-latest/fs/ext3/xattr.c
@@ -37,9 +37,9 @@
  *   | value 2          | |
  *   +------------------+
  *
- * The header is followed by multiple entry descriptors. Descriptors are
- * kept sorted. The attribute values are aligned to the end of the block
- * in no specific order.
+ * The header is followed by multiple entry descriptors. In disk blocks, the
+ * entry descriptors are kept sorted. In inodes, they are unsorted. The
+ * attribute values are aligned to the end of the block in no specific order.
  *
  * Locking strategy
  * ----------------

--
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX PRODUCTS GMBH


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [patch 5/5] Disallow in-inode attributes for reserved inodes
  2005-01-20  2:01 [ea-in-inode 0/5] Further fixes Andreas Gruenbacher
                   ` (3 preceding siblings ...)
  2005-01-20  2:01 ` [patch 2/5] Set the EXT3_FEATURE_COMPAT_EXT_ATTR for in-inode xattrs Andreas Gruenbacher
@ 2005-01-20  2:01 ` Andreas Gruenbacher
  2005-01-20 12:16   ` Andreas Dilger
  2005-01-21 22:58 ` [ea-in-inode 0/5] Further fixes Stephen C. Tweedie
  5 siblings, 1 reply; 30+ messages in thread
From: Andreas Gruenbacher @ 2005-01-20  2:01 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds, Theodore Ts'o
  Cc: Andrew Tridgell, Stephen C. Tweedie, Andreas Dilger, Alex Tomas,
	linux-kernel

[-- Attachment #1: ea-xattr-reserved-inodes.diff --]
[-- Type: text/plain, Size: 2111 bytes --]

When creating a filesystem with inodes bigger than 128 bytes, mke2fs
fails to clear out bytes beyond EXT3_GOOD_OLD_INODE_SIZE in all inodes
it creates (the journal, the filesystem root, and lost+found). We would
require a zeroed-out i_extra_isize field but we don't get it, so
disallow in-inode attributes for those inodes.

Add an i_extra_isize sanity check.

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>

Index: linux-2.6.11-latest/fs/ext3/inode.c
===================================================================
--- linux-2.6.11-latest.orig/fs/ext3/inode.c
+++ linux-2.6.11-latest/fs/ext3/inode.c
@@ -2493,15 +2493,30 @@ void ext3_read_inode(struct inode * inod
 		ei->i_data[block] = raw_inode->i_block[block];
 	INIT_LIST_HEAD(&ei->i_orphan);
 
-	ei->i_extra_isize =
-		(EXT3_INODE_SIZE(inode->i_sb) > EXT3_GOOD_OLD_INODE_SIZE) ?
-		le16_to_cpu(raw_inode->i_extra_isize) : 0;
-	if (ei->i_extra_isize) {
-		__le32 *magic = (void *)raw_inode + EXT3_GOOD_OLD_INODE_SIZE +
-				ei->i_extra_isize;
-		if (le32_to_cpu(*magic))
-			 ei->i_state |= EXT3_STATE_XATTR;
-	}
+	if (inode->i_ino >= EXT3_FIRST_INO(inode->i_sb) + 1 &&
+	    EXT3_INODE_SIZE(inode->i_sb) > EXT3_GOOD_OLD_INODE_SIZE) {
+		/*
+		 * When mke2fs creates big inodes it does not zero out
+		 * the unused bytes above EXT3_GOOD_OLD_INODE_SIZE,
+		 * so ignore those first few inodes.
+		 */
+		ei->i_extra_isize = le16_to_cpu(raw_inode->i_extra_isize);
+		if (EXT3_GOOD_OLD_INODE_SIZE + ei->i_extra_isize >
+		    EXT3_INODE_SIZE(inode->i_sb))
+			goto bad_inode;
+		if (ei->i_extra_isize == 0) {
+			/* The extra space is currently unused. Use it. */
+			ei->i_extra_isize = sizeof(struct ext3_inode) -
+					    EXT3_GOOD_OLD_INODE_SIZE;
+		} else {
+			__le32 *magic = (void *)raw_inode +
+					EXT3_GOOD_OLD_INODE_SIZE +
+					ei->i_extra_isize;
+			if (*magic == cpu_to_le32(EXT3_XATTR_MAGIC))
+				 ei->i_state |= EXT3_STATE_XATTR;
+		}
+	} else
+		ei->i_extra_isize = 0;
 
 	if (S_ISREG(inode->i_mode)) {
 		inode->i_op = &ext3_file_inode_operations;

--
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX PRODUCTS GMBH


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [patch 2/5] Set the EXT3_FEATURE_COMPAT_EXT_ATTR for in-inode xattrs
  2005-01-20  2:01 [ea-in-inode 0/5] Further fixes Andreas Gruenbacher
                   ` (2 preceding siblings ...)
  2005-01-20  2:01 ` [patch 4/5] Fix i_extra_isize check Andreas Gruenbacher
@ 2005-01-20  2:01 ` Andreas Gruenbacher
  2005-01-20  2:01 ` [patch 5/5] Disallow in-inode attributes for reserved inodes Andreas Gruenbacher
  2005-01-21 22:58 ` [ea-in-inode 0/5] Further fixes Stephen C. Tweedie
  5 siblings, 0 replies; 30+ messages in thread
From: Andreas Gruenbacher @ 2005-01-20  2:01 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds, Theodore Ts'o
  Cc: Andrew Tridgell, Stephen C. Tweedie, Andreas Dilger, Alex Tomas,
	linux-kernel

[-- Attachment #1: ea-xattr-update-sb.diff --]
[-- Type: text/plain, Size: 861 bytes --]

The xattr feature was only set when creating an xattr block. Also set it
when creating in-inode xattrs.

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>

Index: linux-2.6.11-latest/fs/ext3/xattr.c
===================================================================
--- linux-2.6.11-latest.orig/fs/ext3/xattr.c
+++ linux-2.6.11-latest/fs/ext3/xattr.c
@@ -823,7 +823,6 @@ getblk_failed:
 			error = ext3_journal_dirty_metadata(handle, new_bh);
 			if (error)
 				goto cleanup;
-			ext3_xattr_update_super_block(handle, sb);
 		}
 	}
 
@@ -1001,6 +1000,7 @@ ext3_xattr_set_handle(handle_t *handle, 
 		}
 	}
 	if (!error) {
+		ext3_xattr_update_super_block(handle, inode->i_sb);
 		inode->i_ctime = CURRENT_TIME_SEC;
 		error = ext3_mark_iloc_dirty(handle, inode, &is.iloc);
 		/*

--
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX PRODUCTS GMBH


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 5/5] Disallow in-inode attributes for reserved inodes
  2005-01-20  2:01 ` [patch 5/5] Disallow in-inode attributes for reserved inodes Andreas Gruenbacher
@ 2005-01-20 12:16   ` Andreas Dilger
  2005-01-20 13:29     ` Andreas Gruenbacher
  0 siblings, 1 reply; 30+ messages in thread
From: Andreas Dilger @ 2005-01-20 12:16 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Andrew Morton, Linus Torvalds, Theodore Ts'o,
	Andrew Tridgell, Stephen C. Tweedie, Alex Tomas, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3922 bytes --]

On Jan 20, 2005  03:01 +0100, Andreas Gruenbacher wrote:
> When creating a filesystem with inodes bigger than 128 bytes, mke2fs
> fails to clear out bytes beyond EXT3_GOOD_OLD_INODE_SIZE in all inodes
> it creates (the journal, the filesystem root, and lost+found). We would
> require a zeroed-out i_extra_isize field but we don't get it, so
> disallow in-inode attributes for those inodes.
> 
> Add an i_extra_isize sanity check.

I'm not sure I agree with this patch.  It definitely resolves a problem
I had with the previous patch (4/5), namely that for inodes that were
created in a large-inode filesystem using a kernel w/o the large-inode
patch we didn't save EAs into the large-inode space at all.  With this
patch we will start using that space.

It is debatable whether we should mark inodes bad if the i_extra_isize
field is bad, or if we should just initialize i_extra_isize in that case.
On one hand there is very little else we can use to validate the on-disk
inodes, but on the other hand the EA data isn't critical to reading the
inode so this would seem to introduce additional failure cases.

I'm not sure whether an old e2fsck will also clear out the space in a
large inode that it writes or not (I know that the patched one we use
validates i_extra_isize).  If it doesn't it may be that there would be a
largish number of inodes that are marked bad because of this, so having
the kernel fix this would be good.  If the old e2fsck zeroes the large
inodes properly it would be prudent to have an ext3_error() here so that
the disk corruption triggers an e2fsck the next time the system starts.

For the root and lost+found inodes it looks like we can never store an
EA in the extra part of the inode regardless of whether i_extra_isize is
good or not.  If a bad value is found we could just initialize it and
start using that space (though not print an ext3_error() in that case,
an ext3_warning() if anything since this is probably the fault of mke2fs).

> Index: linux-2.6.11-latest/fs/ext3/inode.c
> ===================================================================
> --- linux-2.6.11-latest.orig/fs/ext3/inode.c
> +++ linux-2.6.11-latest/fs/ext3/inode.c
> @@ -2493,15 +2493,30 @@ void ext3_read_inode(struct inode * inod
>  		ei->i_data[block] = raw_inode->i_block[block];
>  	INIT_LIST_HEAD(&ei->i_orphan);
>  
> -	ei->i_extra_isize =
> -		(EXT3_INODE_SIZE(inode->i_sb) > EXT3_GOOD_OLD_INODE_SIZE) ?
> -		le16_to_cpu(raw_inode->i_extra_isize) : 0;
> -	if (ei->i_extra_isize) {
> -		__le32 *magic = (void *)raw_inode + EXT3_GOOD_OLD_INODE_SIZE +
> -				ei->i_extra_isize;
> -		if (le32_to_cpu(*magic))
> -			 ei->i_state |= EXT3_STATE_XATTR;
> -	}
> +	if (inode->i_ino >= EXT3_FIRST_INO(inode->i_sb) + 1 &&
> +	    EXT3_INODE_SIZE(inode->i_sb) > EXT3_GOOD_OLD_INODE_SIZE) {
> +		/*
> +		 * When mke2fs creates big inodes it does not zero out
> +		 * the unused bytes above EXT3_GOOD_OLD_INODE_SIZE,
> +		 * so ignore those first few inodes.
> +		 */
> +		ei->i_extra_isize = le16_to_cpu(raw_inode->i_extra_isize);
> +		if (EXT3_GOOD_OLD_INODE_SIZE + ei->i_extra_isize >
> +		    EXT3_INODE_SIZE(inode->i_sb))
> +			goto bad_inode;
> +		if (ei->i_extra_isize == 0) {
> +			/* The extra space is currently unused. Use it. */
> +			ei->i_extra_isize = sizeof(struct ext3_inode) -
> +					    EXT3_GOOD_OLD_INODE_SIZE;
> +		} else {
> +			__le32 *magic = (void *)raw_inode +
> +					EXT3_GOOD_OLD_INODE_SIZE +
> +					ei->i_extra_isize;
> +			if (*magic == cpu_to_le32(EXT3_XATTR_MAGIC))
> +				 ei->i_state |= EXT3_STATE_XATTR;
> +		}
> +	} else
> +		ei->i_extra_isize = 0;
>  
>  	if (S_ISREG(inode->i_mode)) {
>  		inode->i_op = &ext3_file_inode_operations;

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://members.shaw.ca/adilger/             http://members.shaw.ca/golinux/


[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 5/5] Disallow in-inode attributes for reserved inodes
  2005-01-20 12:16   ` Andreas Dilger
@ 2005-01-20 13:29     ` Andreas Gruenbacher
  2005-01-20 23:05       ` Andreas Dilger
  0 siblings, 1 reply; 30+ messages in thread
From: Andreas Gruenbacher @ 2005-01-20 13:29 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Andrew Morton, Linus Torvalds, Theodore Ts'o,
	Andrew Tridgell, Stephen C. Tweedie, Alex Tomas, linux-kernel

On Thursday 20 January 2005 13:16, Andreas Dilger wrote:
> On Jan 20, 2005  03:01 +0100, Andreas Gruenbacher wrote:
> > When creating a filesystem with inodes bigger than 128 bytes, mke2fs
> > fails to clear out bytes beyond EXT3_GOOD_OLD_INODE_SIZE in all inodes
> > it creates (the journal, the filesystem root, and lost+found). We would
> > require a zeroed-out i_extra_isize field but we don't get it, so
> > disallow in-inode attributes for those inodes.
> >
> > Add an i_extra_isize sanity check.
>
> I'm not sure I agree with this patch.  It definitely resolves a problem
> I had with the previous patch (4/5), namely that for inodes that were
> created in a large-inode filesystem using a kernel w/o the large-inode
> patch we didn't save EAs into the large-inode space at all.  With this
> patch we will start using that space.

The ea-in-inode patch totally relies on getting all the available inode space 
cleared out by the kernel (or mke2fs, or e2fsck). If this is not the case for 
any inode we find, then i_extra_isize may contain a random number, and we've 
just lost, period: There is no way of sanitizing a random i_extra_isize; we 
cannot know what the right number would be.

We have two options for going forward from here: (a) depend on the fact that 
all inodes have been cleared out for us, or (b) don't rely on that, fix 
e2fsck before going any further, and have the fixed mke2fs and e2fsck set a 
filesystem feature flag that tells us that all the extra space in all inodes 
is "sane".

For (a) we can make an (ugly) exception because we know which inodes mke2fs 
created. As far as I can tell we can also fully rely on the kernel having 
cleared out the extra space for us. e2fsck may currently fail to move the 
extra space around when moving inodes, but that did not matter without 
ea-in-inode, and if we fix it now, that's probably good enough. We don't 
really care which inode's zeroed extra space we look at without ea-in-inode.

> It is debatable whether we should mark inodes bad if the i_extra_isize
> field is bad, or if we should just initialize i_extra_isize in that case.

IMHO it's not debatable. Taking an i_extra_isize that looks odd and simply 
changing it to something we think is better is a really bad idea. We might as 
well find an unusual but still valid i_extra_isize and use it, then a future 
patch allocates a few more fields in the inode, and suddenly we start to see 
garbage.

> On one hand there is very little else we can use to validate the on-disk
> inodes, but on the other hand the EA data isn't critical to reading the
> inode so this would seem to introduce additional failure cases.

You may have an access acl on the inode. Not being able to read an access acl 
is a clear sign of trouble. The same applies for everything else in the 
system.* and security.* namespaces, at least.

> I'm not sure whether an old e2fsck will also clear out the space in a
> large inode that it writes or not (I know that the patched one we use
> validates i_extra_isize).  If it doesn't it may be that there would be a
> largish number of inodes that are marked bad because of this, so having
> the kernel fix this would be good.

Repairing the filesystem has no place in the kernel code. If what you assume 
is the case (and I sure would have assumed you to check before proposing the 
kernel patch for inclusion), we have to go for the filesystem feature flag 
approach.

> If the old e2fsck zeroes the large 
> inodes properly it would be prudent to have an ext3_error() here so that
> the disk corruption triggers an e2fsck the next time the system starts.

That's even better, yes.

> For the root and lost+found inodes it looks like we can never store an
> EA in the extra part of the inode regardless of whether i_extra_isize is
> good or not.  If a bad value is found we could just initialize it and
> start using that space (though not print an ext3_error() in that case,
> an ext3_warning() if anything since this is probably the fault of mke2fs).

I disagree. We cannot just use the space when we think the inode is corrupted.

Cheers,
-- 
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX PRODUCTS GMBH

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 5/5] Disallow in-inode attributes for reserved inodes
  2005-01-20 13:29     ` Andreas Gruenbacher
@ 2005-01-20 23:05       ` Andreas Dilger
  2005-01-21  0:36         ` Andreas Gruenbacher
  0 siblings, 1 reply; 30+ messages in thread
From: Andreas Dilger @ 2005-01-20 23:05 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Andrew Morton, Linus Torvalds, Theodore Ts'o,
	Andrew Tridgell, Stephen C. Tweedie, Alex Tomas, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4260 bytes --]

On Jan 20, 2005  14:29 +0100, Andreas Gruenbacher wrote:
> The ea-in-inode patch totally relies on getting all the available inode space 
> cleared out by the kernel (or mke2fs, or e2fsck). If this is not the case for 
> any inode we find, then i_extra_isize may contain a random number, and we've 
> just lost, period: There is no way of sanitizing a random i_extra_isize; we 
> cannot know what the right number would be.

The large-inode support is designed to allow different amounts of
"fixed" optional data (i.e. what is stored inside i_extra_isize),
so it is valid to set this to 4 (i.e. just enough to hold i_extra_isize
itself) and store the EA data after that.  Any code which reads "fixed"
fields from a large inode (e.g. i_mtime_nsec) needs to validate that
i_extra_isize on that inode is large enough for that data to actually
be in the fixed area in the large inode.

If the kernel is setting i_extra_isize > 4 (i.e. it is storing optional
fields there like i_mtime_msb_and_ns) it should/is-able-to also initialize
those values since it should know what they are or they shouldn't be in
struct ext3_inode.

The whole point of i_extra_isize is that it is possible for inodes
to have different amounts of the optional fixed fields in each large
inode, depending on what the kernel that wrote the inode knew about.
So any value for i_extra_isize is valid as long as those fields
are initialized.  If we arbitrarily set i_extra_isize = 4 instead of
leaving the bad value this is no different than waiting for e2fsck to
do the same.

> > It is debatable whether we should mark inodes bad if the i_extra_isize
> > field is bad, or if we should just initialize i_extra_isize in that case.
> 
> IMHO it's not debatable. Taking an i_extra_isize that looks odd and simply 
> changing it to something we think is better is a really bad idea.
> 
> You may have an access acl on the inode. Not being able to read an access acl 
> is a clear sign of trouble. The same applies for everything else in the 
> system.* and security.* namespaces, at least.

Well, I said it was debatable and we're having a debate ;-).  I don't have
a strong opinion either way.  If we ext3_error() in this case at least we
will check the fs on the next boot (which will just zero i_extra_isize)
instead of never doing anything to resolve the situation.

> > For the root and lost+found inodes it looks like we can never store an
> > EA in the extra part of the inode regardless of whether i_extra_isize is
> > good or not.  If a bad value is found we could just initialize it and
> > start using that space (though not print an ext3_error() in that case,
> > an ext3_warning() if anything since this is probably the fault of mke2fs).
> 
> I disagree. We cannot just use the space when we think the inode is corrupted.

But as your patch stands it doesn't ever check if i_extra_isize is valid
for the root or lost+found inode.  It just always sets i_extra_isize = 0
and never uses it.  Given that the root inode is fairly high-traffic it
makes sense to use the faster EA space if it is available.

If these inodes have a BAD i_extra_isize it is OK to skip it, but I'm
not so keen to have an ext3_error() there.  If the user doesn't have an
e2fsck with ea-in-inode support there isn't anything they can do to fix
it and they will get a full e2fsck on each boot.

Even so, for the effort of setting i_extra_isize = 4 (or larger if we
initialize the fixed fields) we can do the equivalent of what e2fsck will
do when it finds a bogus value.


The good news is that we can still apply your patch as-is and address my
concerns later since this is a transient issue.  Also, given that there
are probably only a handful of filesystems in the world using large inodes
(excluding Lustre filesystems which aren't affected by this) I don't think
it is a pressing issue yet.  I'm going to be away for 2 weeks, so I'll say
accept this patch as is and we can look at it again when I get back, and
maybe Ted and Stephen will have weighed in on this issue also.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://members.shaw.ca/adilger/             http://members.shaw.ca/golinux/


[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 5/5] Disallow in-inode attributes for reserved inodes
  2005-01-20 23:05       ` Andreas Dilger
@ 2005-01-21  0:36         ` Andreas Gruenbacher
  0 siblings, 0 replies; 30+ messages in thread
From: Andreas Gruenbacher @ 2005-01-21  0:36 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Andrew Morton, Linus Torvalds, Theodore Ts'o,
	Andrew Tridgell, Stephen C. Tweedie, Alex Tomas, linux-kernel

On Friday 21 January 2005 00:05, Andreas Dilger wrote:
> [...]
> But as your patch stands it doesn't ever check if i_extra_isize is valid
> for the root or lost+found inode.  It just always sets i_extra_isize = 0
	(that's the in-memory i_extra_isize)
> and never uses it.  Given that the root inode is fairly high-traffic it
> makes sense to use the faster EA space if it is available.

It's only a single block we're talking about, not all the overhead you run 
into with huge amounts of attributes in many xattr disk blocks. It sure would 
be much cleaner to use the root inode's in-inode space like with all other 
inodes, but performance wise I don't think it matters.

> If these inodes have a BAD i_extra_isize it is OK to skip it, but I'm
> not so keen to have an ext3_error() there.  If the user doesn't have an
> e2fsck with ea-in-inode support there isn't anything they can do to fix
> it and they will get a full e2fsck on each boot.

Agreed, that would be really bad. We should get e2fsck fixed ASAP.

> Even so, for the effort of setting i_extra_isize = 4 (or larger if we
> initialize the fixed fields) we can do the equivalent of what e2fsck will
> do when it finds a bogus value.

We cannot ask the user, and we don't have the kind of global view that e2fsck 
has. Something different may be messed up, and may have lead to the 
corruption. It's unlikely, but not impossible.

Cheers,
-- 
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX PRODUCTS GMBH

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [ea-in-inode 0/5] Further fixes
  2005-01-20  2:01 [ea-in-inode 0/5] Further fixes Andreas Gruenbacher
                   ` (4 preceding siblings ...)
  2005-01-20  2:01 ` [patch 5/5] Disallow in-inode attributes for reserved inodes Andreas Gruenbacher
@ 2005-01-21 22:58 ` Stephen C. Tweedie
  2005-01-21 23:46   ` Andreas Gruenbacher
  5 siblings, 1 reply; 30+ messages in thread
From: Stephen C. Tweedie @ 2005-01-21 22:58 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Andrew Morton, Linus Torvalds, Theodore Ts'o,
	Andrew Tridgell, Andreas Dilger, Alex Tomas, linux-kernel,
	Stephen Tweedie

Hi Andreas,

On Thu, 2005-01-20 at 02:01, Andreas Gruenbacher wrote:

> here is a set of fixes for ext3 in-inode attributes:

Obvious first question --- have these diffs survived the same
torture-by-tridgell that the previous batch suffered?

Cheers,
 Stephen


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [ea-in-inode 0/5] Further fixes
  2005-01-21 22:58 ` [ea-in-inode 0/5] Further fixes Stephen C. Tweedie
@ 2005-01-21 23:46   ` Andreas Gruenbacher
  2005-01-23 13:22     ` Andrew Tridgell
  2005-01-23 22:09     ` Andrew Tridgell
  0 siblings, 2 replies; 30+ messages in thread
From: Andreas Gruenbacher @ 2005-01-21 23:46 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Andrew Morton, Linus Torvalds, Theodore Ts'o,
	Andrew Tridgell, Andreas Dilger, Alex Tomas, linux-kernel

On Fri, 2005-01-21 at 23:58, Stephen C. Tweedie wrote:
> Hi Andreas,
> 
> On Thu, 2005-01-20 at 02:01, Andreas Gruenbacher wrote:
> 
> > here is a set of fixes for ext3 in-inode attributes:
> 
> Obvious first question --- have these diffs survived the same
> torture-by-tridgell that the previous batch suffered?

No. The fixes are a lot less intrusive than the full xattr rework
though. I obviously ran tests; this included dbench.

Tridge, can you beat the code some more?

Andrew has the five fixes in 2.6.11-rc1-mm2.

-- Andreas.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [ea-in-inode 0/5] Further fixes
  2005-01-21 23:46   ` Andreas Gruenbacher
@ 2005-01-23 13:22     ` Andrew Tridgell
  2005-01-23 22:09     ` Andrew Tridgell
  1 sibling, 0 replies; 30+ messages in thread
From: Andrew Tridgell @ 2005-01-23 13:22 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Stephen C. Tweedie, Andrew Morton, Linus Torvalds,
	Theodore Ts'o, Andreas Dilger, Alex Tomas, linux-kernel

Andreas,

 > Tridge, can you beat the code some more?
 > Andrew has the five fixes in 2.6.11-rc1-mm2.

sorry for the delay. I've started to test 2.6.11-rc1-mm2 tonight. No
problems so far.

Cheers, Tridge

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [ea-in-inode 0/5] Further fixes
  2005-01-21 23:46   ` Andreas Gruenbacher
  2005-01-23 13:22     ` Andrew Tridgell
@ 2005-01-23 22:09     ` Andrew Tridgell
  2005-01-23 22:58       ` Andreas Gruenbacher
  1 sibling, 1 reply; 30+ messages in thread
From: Andrew Tridgell @ 2005-01-23 22:09 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Stephen C. Tweedie, Andrew Morton, Linus Torvalds,
	Theodore Ts'o, Andreas Dilger, Alex Tomas, linux-kernel

Andreas,

 > Tridge, can you beat the code some more?
 > 
 > Andrew has the five fixes in 2.6.11-rc1-mm2.

It seemed to pass dbench runs OK, but then I started simultaneously
running dbench and nbench on two different disks (I have a new test
machine with more disks available). I am getting failures like this:

Jan 23 06:54:38 dev4-003 kernel: journal_bmap: journal block not found at offset 1036 on sdc1
Jan 23 06:54:38 dev4-003 kernel: Aborting journal on device sdc1.


Jan 23 13:19:43 dev4-003 kernel: journal_bmap: journal block not found at offset 1036 on sdd1
Jan 23 13:19:43 dev4-003 kernel: Aborting journal on device sdd1.
Jan 23 13:19:43 dev4-003 kernel: EXT3-fs error (device sdd1) in start_transaction: Readonly filesystem


The first failure was on the disk I am using for nbench (sdc). The
second is during a later run on the disk I am using the dbench
(sdd). I rebooted between the runs. It's interesting that its failing
at exactly the same offset both times. Is there anything magic about
offset 1036?

The new test machine is a 4 way PIII, with 4G ram, and 4 36G SCSI
disks. The test machine I have been using previously was a 2 way
(+hyperthreaded) Xeon.

I'll see if I can reproduce the problem with just dbench or just
nbench.

Cheers, Tridge

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [ea-in-inode 0/5] Further fixes
  2005-01-23 22:09     ` Andrew Tridgell
@ 2005-01-23 22:58       ` Andreas Gruenbacher
  2005-01-23 23:32         ` Andreas Gruenbacher
  0 siblings, 1 reply; 30+ messages in thread
From: Andreas Gruenbacher @ 2005-01-23 22:58 UTC (permalink / raw)
  To: tridge
  Cc: Stephen C. Tweedie, Andrew Morton, Linus Torvalds,
	Theodore Ts'o, Andreas Dilger, Alex Tomas, linux-kernel

Hello,

On Sunday 23 January 2005 23:09, Andrew Tridgell wrote:
> Andreas,
>
>  > Tridge, can you beat the code some more?
>  >
>  > Andrew has the five fixes in 2.6.11-rc1-mm2.
>
> It seemed to pass dbench runs OK, but then I started simultaneously
> running dbench and nbench on two different disks (I have a new test
> machine with more disks available). I am getting failures like this:
>
> Jan 23 06:54:38 dev4-003 kernel: journal_bmap: journal block not found at
> offset 1036 on sdc1 Jan 23 06:54:38 dev4-003 kernel: Aborting journal on
> device sdc1.

Are you using data journaling on that filesystem? Does this test pass with the 
patches backed out? With an external journal?

Thanks,
-- 
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX PRODUCTS GMBH

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [ea-in-inode 0/5] Further fixes
  2005-01-23 22:58       ` Andreas Gruenbacher
@ 2005-01-23 23:32         ` Andreas Gruenbacher
  2005-01-24 11:24           ` Andrew Tridgell
  0 siblings, 1 reply; 30+ messages in thread
From: Andreas Gruenbacher @ 2005-01-23 23:32 UTC (permalink / raw)
  To: tridge
  Cc: Stephen C. Tweedie, Andrew Morton, Linus Torvalds,
	Theodore Ts'o, Andreas Dilger, Alex Tomas, linux-kernel

On Sunday 23 January 2005 23:58, Andreas Gruenbacher wrote:
> > Jan 23 06:54:38 dev4-003 kernel: journal_bmap: journal block not found at
> > offset 1036 on sdc1 Jan 23 06:54:38 dev4-003 kernel: Aborting journal on
> > device sdc1.
>
> Are you using data journaling on that filesystem? Does this test pass with
> the patches backed out? With an external journal?

There are 12 direct and 1024 indirect blocks on a filesystem with 4k 
blocksize, so block 1036 should be the first double-indirect block. It may be 
that something is messing up the double-indirect link or one of its fields. 
Interesting. Could you maybe try this patch as well?

Index: linux-2.6.11-rc1-mm2/fs/ext3/inode.c
===================================================================
--- linux-2.6.11-rc1-mm2.orig/fs/ext3/inode.c
+++ linux-2.6.11-rc1-mm2/fs/ext3/inode.c
@@ -2653,7 +2653,7 @@ static int ext3_do_update_inode(handle_t
 	} else for (block = 0; block < EXT3_N_BLOCKS; block++)
 		raw_inode->i_block[block] = ei->i_data[block];
 
-	if (EXT3_INODE_SIZE(inode->i_sb) > EXT3_GOOD_OLD_INODE_SIZE)
+	if (ei->i_extra_isize)
 		raw_inode->i_extra_isize = cpu_to_le16(ei->i_extra_isize);
 
 	BUFFER_TRACE(bh, "call ext3_journal_dirty_metadata");

Thanks,
-- 
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX PRODUCTS GMBH

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [ea-in-inode 0/5] Further fixes
  2005-01-23 23:32         ` Andreas Gruenbacher
@ 2005-01-24 11:24           ` Andrew Tridgell
  2005-01-24 11:42             ` Christoph Hellwig
                               ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Andrew Tridgell @ 2005-01-24 11:24 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Stephen C. Tweedie, Andrew Morton, Linus Torvalds,
	Theodore Ts'o, Andreas Dilger, Alex Tomas, linux-kernel

Andreas,

I'm starting to think the bug I saw is hardware error. I just got this
while trying to reproduce it tonight:

Jan 24 02:43:32 dev4-003 kernel: qlogicfc0 : abort failed
Jan 24 02:43:32 dev4-003 kernel: qlogicfc0 : firmware status is 4000 4
Jan 24 02:43:32 dev4-003 kernel: scsi: Device offlined - not ready after error recovery: host 3 
channel 0 id 1 lun 0

I'll see if I can get this confirmed tomorrow.

Cheers, Tridge

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [ea-in-inode 0/5] Further fixes
  2005-01-24 11:24           ` Andrew Tridgell
@ 2005-01-24 11:42             ` Christoph Hellwig
  2005-01-24 14:11             ` Andreas Gruenbacher
  2005-01-25  3:19             ` memory leak in 2.6.11-rc2 Andrew Tridgell
  2 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2005-01-24 11:42 UTC (permalink / raw)
  To: Andrew Tridgell
  Cc: Andreas Gruenbacher, Stephen C. Tweedie, Andrew Morton,
	Linus Torvalds, Theodore Ts'o, Andreas Dilger, Alex Tomas,
	linux-kernel

On Mon, Jan 24, 2005 at 10:24:55PM +1100, Andrew Tridgell wrote:
> Andreas,
> 
> I'm starting to think the bug I saw is hardware error. I just got this
> while trying to reproduce it tonight:
> 
> Jan 24 02:43:32 dev4-003 kernel: qlogicfc0 : abort failed
> Jan 24 02:43:32 dev4-003 kernel: qlogicfc0 : firmware status is 4000 4
> Jan 24 02:43:32 dev4-003 kernel: scsi: Device offlined - not ready after error recovery: host 3 
> channel 0 id 1 lun 0
> 
> I'll see if I can get this confirmed tomorrow.

Don't use the qlogicfc driver ever but qla2xxx instead.  It's only still
in the kernel tree because davem's a lazy bastard :)


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [ea-in-inode 0/5] Further fixes
  2005-01-24 11:24           ` Andrew Tridgell
  2005-01-24 11:42             ` Christoph Hellwig
@ 2005-01-24 14:11             ` Andreas Gruenbacher
  2005-01-25  3:19             ` memory leak in 2.6.11-rc2 Andrew Tridgell
  2 siblings, 0 replies; 30+ messages in thread
From: Andreas Gruenbacher @ 2005-01-24 14:11 UTC (permalink / raw)
  To: tridge
  Cc: Stephen C. Tweedie, Andrew Morton, Linus Torvalds,
	Theodore Ts'o, Andreas Dilger, Alex Tomas, linux-kernel

On Mon, 2005-01-24 at 12:24, Andrew Tridgell wrote:
> Andreas,
> 
> I'm starting to think the bug I saw is hardware error. [...]

The patch in my prevous message (Mon, 24 Jan 2005 00:32:16 +0100) still
makes sense, if only for cleanliness: Without it, the i_extra_isize
field of reserved inodes is zeroed even though we're not allowing
in-inode attributes there. It's probably better to leave the garbage in
these inodes untouched; we could in theory disable in-inode attributes
for arbitrary inodes with this change.

Cheers,
-- 
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX GMBH


^ permalink raw reply	[flat|nested] 30+ messages in thread

* memory leak in 2.6.11-rc2
  2005-01-24 11:24           ` Andrew Tridgell
  2005-01-24 11:42             ` Christoph Hellwig
  2005-01-24 14:11             ` Andreas Gruenbacher
@ 2005-01-25  3:19             ` Andrew Tridgell
  2005-01-25  3:20               ` Randy.Dunlap
  2005-01-25  3:45               ` Dave Jones
  2 siblings, 2 replies; 30+ messages in thread
From: Andrew Tridgell @ 2005-01-25  3:19 UTC (permalink / raw)
  To: linux-kernel, Andreas Gruenbacher, Andrew Morton

I've fixed up the problems I had with raid, and am now testing the
recent xattr changes with dbench and nbench.

The problem I've hit now is a severe memory leak. I have applied the
patch from Linus for the leak in free_pipe_info(), and still I'm
leaking memory at the rate of about 100Mbyte/minute.

I've tested with both 2.6.11-rc2 and with 2.6.11-rc1-mm2, both with
the pipe leak fix. The setup is:

 - 4 way PIII with 4G ram
 - qla2200 adapter with ibm fastt200 disk array 
 - running dbench -x and nbench on separate disks, in a loop

The oom killer kicks in after about 30 minutes. Naturally the oom
killer decided to kill my sshd, which was running vmstat :-) 

Killing the dbench and nbench processes does not recovery the memory.
Here is what it looks like after I kill all processes except a sshd
after about 15 minutes of running:

             total       used       free     shared    buffers     cached
Mem:       3734304    2471608    1262696          0      26820     921440
-/+ buffers/cache:    1523348    2210956
Swap:      4194272          0    4194272

so we've lost nearly 1.5G in that time.

here is /proc/meminfo:

dev4-003:~# cat /proc/meminfo 
MemTotal:      3734304 kB
MemFree:       1266944 kB
Buffers:         27000 kB
Cached:         921464 kB
SwapCached:          0 kB
Active:         292524 kB
Inactive:      2108928 kB
HighTotal:     2850752 kB
HighFree:       472768 kB
LowTotal:       883552 kB
LowFree:        794176 kB
SwapTotal:     4194272 kB
SwapFree:      4194272 kB
Dirty:               4 kB
Writeback:           0 kB
Mapped:           6872 kB
Slab:            54236 kB
CommitLimit:   6061424 kB
Committed_AS:    41660 kB
PageTables:        344 kB
VmallocTotal:   114680 kB
VmallocUsed:      2204 kB
VmallocChunk:   112472 kB

and here is /proc/slabinfo 

slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <batchcount> <limit> <sharedfactor> : slabdata
 <active_slabs> <num_slabs> <sharedavail>
ip_fib_alias           9    226     16  226    1 : tunables  120   60    8 : slabdata      1      1      0
ip_fib_hash            9    119     32  119    1 : tunables  120   60    8 : slabdata      1      1      0
fib6_nodes             7    119     32  119    1 : tunables  120   60    8 : slabdata      1      1      0
ip6_dst_cache          7     36    224   18    1 : tunables  120   60    8 : slabdata      2      2      0
ndisc_cache            1     25    160   25    1 : tunables  120   60    8 : slabdata      1      1      0
rawv6_sock             6      6    672    6    1 : tunables   54   27    8 : slabdata      1      1      0
udpv6_sock             0      0    640    6    1 : tunables   54   27    8 : slabdata      0      0      0
tcpv6_sock             4      6   1184    6    2 : tunables   24   12    8 : slabdata      1      1      0
unix_sock              2      9    416    9    1 : tunables   54   27    8 : slabdata      1      1      0
ip_mrt_cache           0      0     96   41    1 : tunables  120   60    8 : slabdata      0      0      0
tcp_tw_bucket          0      0    128   31    1 : tunables  120   60    8 : slabdata      0      0      0
tcp_bind_bucket        2    226     16  226    1 : tunables  120   60    8 : slabdata      1      1      0
tcp_open_request       0      0     96   41    1 : tunables  120   60    8 : slabdata      0      0      0
inet_peer_cache        1     61     64   61    1 : tunables  120   60    8 : slabdata      1      1      0
secpath_cache          0      0    128   31    1 : tunables  120   60    8 : slabdata      0      0      0
xfrm_dst_cache         0      0    256   15    1 : tunables  120   60    8 : slabdata      0      0      0
ip_dst_cache           6     15    256   15    1 : tunables  120   60    8 : slabdata      1      1      0
arp_cache              1     25    160   25    1 : tunables  120   60    8 : slabdata      1      1      0
raw_sock               5      7    512    7    1 : tunables   54   27    8 : slabdata      1      1      0
udp_sock               3      8    512    8    1 : tunables   54   27    8 : slabdata      1      1      0
tcp_sock               1      7   1056    7    2 : tunables   24   12    8 : slabdata      1      1      0
flow_cache             0      0     96   41    1 : tunables  120   60    8 : slabdata      0      0      0
qla2xxx_srbs         256    310    128   31    1 : tunables  120   60    8 : slabdata     10     10      0
scsi_cmd_cache        22     22    352   11    1 : tunables   54   27    8 : slabdata      2      2      0
cfq_ioc_pool           0      0     24  156    1 : tunables  120   60    8 : slabdata      0      0      0
cfq_pool               0      0    104   38    1 : tunables  120   60    8 : slabdata      0      0      0
crq_pool               0      0     56   70    1 : tunables  120   60    8 : slabdata      0      0      0
deadline_drq           0      0     52   75    1 : tunables  120   60    8 : slabdata      0      0      0
as_arq                61     61     64   61    1 : tunables  120   60    8 : slabdata      1      1      0
mqueue_inode_cache      1      7    512    7    1 : tunables   54   27    8 : slabdata      1      1      0
devfsd_event           0      0     20  185    1 : tunables  120   60    8 : slabdata      0      0      0
ext2_inode_cache       0      0    456    8    1 : tunables   54   27    8 : slabdata      0      0      0
ext2_xattr             0      0     48   81    1 : tunables  120   60    8 : slabdata      0      0      0
journal_handle         2    185     20  185    1 : tunables  120   60    8 : slabdata      1      1      0
journal_head        3316 155844     48   81    1 : tunables  120   60    8 : slabdata   1924   1924      0
revoke_table          14    290     12  290    1 : tunables  120   60    8 : slabdata      1      1      0
revoke_record          0      0     16  226    1 : tunables  120   60    8 : slabdata      0      0      0
ext3_inode_cache    3626   3745    520    7    1 : tunables   54   27    8 : slabdata    535    535      0
ext3_xattr             0      0     48   81    1 : tunables  120   60    8 : slabdata      0      0      0
dnotify_cache          0      0     20  185    1 : tunables  120   60    8 : slabdata      0      0      0
eventpoll_pwq          0      0     36  107    1 : tunables  120   60    8 : slabdata      0      0      0
eventpoll_epi          0      0     96   41    1 : tunables  120   60    8 : slabdata      0      0      0
kioctx                 0      0    192   20    1 : tunables  120   60    8 : slabdata      0      0      0
kiocb                  0      0    128   31    1 : tunables  120   60    8 : slabdata      0      0      0
fasync_cache           0      0     16  226    1 : tunables  120   60    8 : slabdata      0      0      0
shmem_inode_cache      8     18    416    9    1 : tunables   54   27    8 : slabdata      2      2      0
posix_timers_cache      0      0    104   38    1 : tunables  120   60    8 : slabdata      0      0      0
uid_cache              1     61     64   61    1 : tunables  120   60    8 : slabdata      1      1      0
sgpool-128            32     32   2048    2    1 : tunables   24   12    8 : slabdata     16     16      0
sgpool-64             32     32   1024    4    1 : tunables   54   27    8 : slabdata      8      8      0
sgpool-32             32     32    512    8    1 : tunables   54   27    8 : slabdata      4      4      0
sgpool-16             32     45    256   15    1 : tunables  120   60    8 : slabdata      3      3      0
sgpool-8              50     62    128   31    1 : tunables  120   60    8 : slabdata      2      2      0
blkdev_ioc            40    270     28  135    1 : tunables  120   60    8 : slabdata      2      2      0
blkdev_queue          37     50    380   10    1 : tunables   54   27    8 : slabdata      5      5      0
blkdev_requests       70     81    148   27    1 : tunables  120   60    8 : slabdata      3      3      0
biovec-(256)         256    256   3072    2    2 : tunables   24   12    8 : slabdata    128    128      0
biovec-128           256    260   1536    5    2 : tunables   24   12    8 : slabdata     52     52      0
biovec-64            256    260    768    5    1 : tunables   54   27    8 : slabdata     52     52      0
biovec-16            260    260    192   20    1 : tunables  120   60    8 : slabdata     13     13      0
biovec-4             256    305     64   61    1 : tunables  120   60    8 : slabdata      5      5      0
biovec-1             322    452     16  226    1 : tunables  120   60    8 : slabdata      2      2      0
bio                  303    328     96   41    1 : tunables  120   60    8 : slabdata      8      8      0
file_lock_cache        0      0     92   43    1 : tunables  120   60    8 : slabdata      0      0      0
sock_inode_cache      26     60    384   10    1 : tunables   54   27    8 : slabdata      6      6      0
skbuff_head_cache    137    475    160   25    1 : tunables  120   60    8 : slabdata     19     19      0
sock                   4     11    352   11    1 : tunables   54   27    8 : slabdata      1      1      0
key_jar               90    123     96   41    1 : tunables  120   60    8 : slabdata      3      3      0
proc_inode_cache     265    336    336   12    1 : tunables   54   27    8 : slabdata     28     28      0
sigqueue               1     27    148   27    1 : tunables  120   60    8 : slabdata      1      1      0
radix_tree_node     8235   8750    276   14    1 : tunables   54   27    8 : slabdata    625    625      0
bdev_cache            14     18    448    9    1 : tunables   54   27    8 : slabdata      2      2      0
sysfs_dir_cache     2837   2996     36  107    1 : tunables  120   60    8 : slabdata     28     28      0
mnt_cache             21     41     96   41    1 : tunables  120   60    8 : slabdata      1      1      0
inode_cache          946    948    320   12    1 : tunables   54   27    8 : slabdata     79     79      0
dentry_cache       10903  11032    140   28    1 : tunables  120   60    8 : slabdata    394    394      0
filp                 300    750    160   25    1 : tunables  120   60    8 : slabdata     30     30      0
names_cache            3      3   4096    1    1 : tunables   24   12    8 : slabdata      3      3      0
idr_layer_cache       80     87    136   29    1 : tunables  120   60    8 : slabdata      3      3      0
buffer_head       593099 649500     52   75    1 : tunables  120   60    8 : slabdata   8660   8660      0
mm_struct             54     54    608    6    1 : tunables   54   27    8 : slabdata      9      9      0
vm_area_struct       543   1260     88   45    1 : tunables  120   60    8 : slabdata     28     28      0
fs_cache              82    122     64   61    1 : tunables  120   60    8 : slabdata      2      2      0
files_cache           50     63    416    9    1 : tunables   54   27    8 : slabdata      7      7      0
signal_cache          68    150    256   15    1 : tunables  120   60    8 : slabdata     10     10      0
sighand_cache         76     78   1312    3    1 : tunables   24   12    8 : slabdata     26     26      0
task_struct           78     81   1312    3    1 : tunables   24   12    8 : slabdata     27     27      0
anon_vma             273    870     12  290    1 : tunables  120   60    8 : slabdata      3      3      0
pgd                   24     24   4096    1    1 : tunables   24   12    8 : slabdata     24     24      0
size-131072(DMA)       0      0 131072    1   32 : tunables    8    4    0 : slabdata      0      0      0
size-131072            0      0 131072    1   32 : tunables    8    4    0 : slabdata      0      0      0
size-65536(DMA)        0      0  65536    1   16 : tunables    8    4    0 : slabdata      0      0      0
size-65536             0      0  65536    1   16 : tunables    8    4    0 : slabdata      0      0      0
size-32768(DMA)        0      0  32768    1    8 : tunables    8    4    0 : slabdata      0      0      0
size-32768             0      0  32768    1    8 : tunables    8    4    0 : slabdata      0      0      0
size-16384(DMA)        0      0  16384    1    4 : tunables    8    4    0 : slabdata      0      0      0
size-16384             5      5  16384    1    4 : tunables    8    4    0 : slabdata      5      5      0
size-8192(DMA)         0      0   8192    1    2 : tunables    8    4    0 : slabdata      0      0      0
size-8192             75     75   8192    1    2 : tunables    8    4    0 : slabdata     75     75      0
size-4096(DMA)         0      0   4096    1    1 : tunables   24   12    8 : slabdata      0      0      0
size-4096             29     29   4096    1    1 : tunables   24   12    8 : slabdata     29     29      0
size-2048(DMA)         0      0   2048    2    1 : tunables   24   12    8 : slabdata      0      0      0
size-2048            248    258   2048    2    1 : tunables   24   12    8 : slabdata    129    129      0
size-1024(DMA)         0      0   1024    4    1 : tunables   54   27    8 : slabdata      0      0      0
size-1024            132    132   1024    4    1 : tunables   54   27    8 : slabdata     33     33      0
size-512(DMA)          0      0    512    8    1 : tunables   54   27    8 : slabdata      0      0      0
size-512             403    432    512    8    1 : tunables   54   27    8 : slabdata     54     54      0
size-256(DMA)          0      0    256   15    1 : tunables  120   60    8 : slabdata      0      0      0
size-256             255    255    256   15    1 : tunables  120   60    8 : slabdata     17     17      0
size-128(DMA)          0      0    128   31    1 : tunables  120   60    8 : slabdata      0      0      0
size-128            2403   2542    128   31    1 : tunables  120   60    8 : slabdata     82     82      0
size-64(DMA)           0      0     64   61    1 : tunables  120   60    8 : slabdata      0      0      0
size-64              670    793     64   61    1 : tunables  120   60    8 : slabdata     13     13      0
size-32(DMA)           0      0     32  119    1 : tunables  120   60    8 : slabdata      0      0      0
size-32             1415   3213     32  119    1 : tunables  120   60    8 : slabdata     27     27      0
kmem_cache           175    175    160   25    1 : tunables  120   60    8 : slabdata      7      7      0

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: memory leak in 2.6.11-rc2
  2005-01-25  3:19             ` memory leak in 2.6.11-rc2 Andrew Tridgell
@ 2005-01-25  3:20               ` Randy.Dunlap
  2005-01-25  3:31                 ` Andrew Tridgell
  2005-01-25  3:45               ` Dave Jones
  1 sibling, 1 reply; 30+ messages in thread
From: Randy.Dunlap @ 2005-01-25  3:20 UTC (permalink / raw)
  To: tridge; +Cc: linux-kernel, Andreas Gruenbacher, Andrew Morton

Andrew Tridgell wrote:
> I've fixed up the problems I had with raid, and am now testing the
> recent xattr changes with dbench and nbench.
> 
> The problem I've hit now is a severe memory leak. I have applied the
> patch from Linus for the leak in free_pipe_info(), and still I'm
> leaking memory at the rate of about 100Mbyte/minute.
> 
> I've tested with both 2.6.11-rc2 and with 2.6.11-rc1-mm2, both with
> the pipe leak fix. The setup is:
> 
>  - 4 way PIII with 4G ram
>  - qla2200 adapter with ibm fastt200 disk array 
>  - running dbench -x and nbench on separate disks, in a loop
> 
> The oom killer kicks in after about 30 minutes. Naturally the oom
> killer decided to kill my sshd, which was running vmstat :-) 

Do you have today's memleak patch applied?  (cut-n-paste below).

-- 
~Randy


----
--- 1.40/fs/pipe.c	2005-01-15 12:01:16 -08:00
+++ edited/fs/pipe.c	2005-01-24 14:35:09 -08:00
@@ -630,13 +630,13 @@
  	struct pipe_inode_info *info = inode->i_pipe;

  	inode->i_pipe = NULL;
-	if (info->tmp_page)
-		__free_page(info->tmp_page);
  	for (i = 0; i < PIPE_BUFFERS; i++) {
  		struct pipe_buffer *buf = info->bufs + i;
  		if (buf->ops)
  			buf->ops->release(info, buf);
  	}
+	if (info->tmp_page)
+		__free_page(info->tmp_page);
  	kfree(info);
  }

-

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: memory leak in 2.6.11-rc2
  2005-01-25  3:20               ` Randy.Dunlap
@ 2005-01-25  3:31                 ` Andrew Tridgell
  2005-01-25  4:48                   ` Andrew Tridgell
  0 siblings, 1 reply; 30+ messages in thread
From: Andrew Tridgell @ 2005-01-25  3:31 UTC (permalink / raw)
  To: Randy.Dunlap; +Cc: linux-kernel, Andreas Gruenbacher, Andrew Morton

Randy,

 > I have applied the patch from Linus for the leak in
 > free_pipe_info()
...
 > Do you have today's memleak patch applied?  (cut-n-paste below).

yes :-)

I'm trying the little leak tracking patch from Alexander Nyberg now.

Cheers, Tridge

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: memory leak in 2.6.11-rc2
  2005-01-25  3:19             ` memory leak in 2.6.11-rc2 Andrew Tridgell
  2005-01-25  3:20               ` Randy.Dunlap
@ 2005-01-25  3:45               ` Dave Jones
  2005-01-25 12:51                 ` Andrea Arcangeli
  1 sibling, 1 reply; 30+ messages in thread
From: Dave Jones @ 2005-01-25  3:45 UTC (permalink / raw)
  To: Andrew Tridgell; +Cc: linux-kernel, Andreas Gruenbacher, Andrew Morton

On Tue, Jan 25, 2005 at 02:19:24PM +1100, Andrew Tridgell wrote:
 > The problem I've hit now is a severe memory leak. I have applied the
 > patch from Linus for the leak in free_pipe_info(), and still I'm
 > leaking memory at the rate of about 100Mbyte/minute.
 > I've tested with both 2.6.11-rc2 and with 2.6.11-rc1-mm2, both with
 > the pipe leak fix. The setup is:

That's a little more extreme than what I'm seeing, so it may be
something else, but my firewall box needs rebooting every
few days. It leaks around 50MB a day for some reason.
Given it's not got a lot of ram, after 4-5 days or so, it's
completely exhausted its swap too.

It's currently on a 2.6.10-ac kernel, so it's entirely possible that
we're not looking at the same issue, though it could be something
thats been there for a while if your workload makes it appear
quicker than a firewall/ipsec gateway would.
Do you see the same leaks with an earlier kernel ?

post OOM (when there was about 2K free after named got oom-killed)
this is what slabinfo looked like..

dentry_cache        1502   3775    160   25    1 : tunables  120   60    0 : slabdata    151    151      0
vm_area_struct      1599   2021     84   47    1 : tunables  120   60    0 : slabdata     43     43      0
size-128            3431   6262    128   31    1 : tunables  120   60    0 : slabdata    202    202      0
size-64             4352   4575     64   61    1 : tunables  120   60    0 : slabdata     75     75      0
avtab_node          7073   7140     32  119    1 : tunables  120   60    0 : slabdata     60     60      0
size-32             7256   7616     32  119    1 : tunables  120   60    0 : slabdata     64     64      0

		Dave


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: memory leak in 2.6.11-rc2
  2005-01-25  3:31                 ` Andrew Tridgell
@ 2005-01-25  4:48                   ` Andrew Tridgell
  2005-01-25  6:06                     ` Andrew Morton
  0 siblings, 1 reply; 30+ messages in thread
From: Andrew Tridgell @ 2005-01-25  4:48 UTC (permalink / raw)
  To: Randy.Dunlap, linux-kernel, Andreas Gruenbacher, Andrew Morton

[-- Attachment #1: message body text --]
[-- Type: text/plain, Size: 2849 bytes --]

 > I'm trying the little leak tracking patch from Alexander Nyberg now.

Here are the results (only backtraces with more than 10k counts
included). The leak was at 1G of memory at the time I ran this, so its
safe to say 10k page allocations ain't enough to explain it :-)

I also attach a hacked version of the pgown sort program that sorts
the output by count, and isn't O(n^2). It took 10 minutes to run the
old version :-)

I'm guessing the leak is in the new xattr code given that is what
dbench and nbench were beating on. Andreas, can you look at the
following and see if you can spot anything?

This was on 2.6.11-rc2 with the pipe leak patch from Linus. The
machine had leaked 1G of ram in 10 minutes, and was idle (only thing
running was sshd).

Cheers, Tridge


175485 times:
Page allocated via order 0
[0xc0132258] generic_file_buffered_write+280
[0xc011b6a9] current_fs_time+77
[0xc0132a1e] __generic_file_aio_write_nolock+642
[0xc0132e70] generic_file_aio_write+100
[0xc017e586] ext3_file_write+38
[0xc014b7f5] do_sync_write+169
[0xc015f6de] fcntl_setlk64+286
[0xc01295a8] autoremove_wake_function+0

141512 times:
Page allocated via order 0
[0xc0132258] generic_file_buffered_write+280
[0xc0132a1e] __generic_file_aio_write_nolock+642
[0xc0132e70] generic_file_aio_write+100
[0xc017e586] ext3_file_write+38
[0xc014b7f5] do_sync_write+169
[0xc015f6de] fcntl_setlk64+286
[0xc01295a8] autoremove_wake_function+0
[0xc014b8d5] vfs_write+157

67641 times:
Page allocated via order 0
[0xc0132258] generic_file_buffered_write+280
[0xc014dc69] __getblk+29
[0xc011b6a9] current_fs_time+77
[0xc0132a1e] __generic_file_aio_write_nolock+642
[0xc018d368] ext3_xattr_user_get+108
[0xc0132e70] generic_file_aio_write+100
[0xc017e586] ext3_file_write+38
[0xc014b7f5] do_sync_write+169

52758 times:
Page allocated via order 0
[0xc0132258] generic_file_buffered_write+280
[0xc014dc69] __getblk+29
[0xc0132a1e] __generic_file_aio_write_nolock+642
[0xc018d368] ext3_xattr_user_get+108
[0xc0132e70] generic_file_aio_write+100
[0xc017e586] ext3_file_write+38
[0xc014b7f5] do_sync_write+169
[0xc0120c0b] kill_proc_info+47

19610 times:
Page allocated via order 0
[0xc0132258] generic_file_buffered_write+280
[0xc011b6a9] current_fs_time+77
[0xc0132a1e] __generic_file_aio_write_nolock+642
[0xc0132c3a] generic_file_aio_write_nolock+54
[0xc0132def] generic_file_write_nolock+151
[0xc011b7cb] __do_softirq+95
[0xc01295a8] autoremove_wake_function+0
[0xc01295a8] autoremove_wake_function+0

16874 times:
Page allocated via order 0
[0xc0132258] generic_file_buffered_write+280
[0xc011b6a9] current_fs_time+77
[0xc0132a1e] __generic_file_aio_write_nolock+642
[0xc0132c3a] generic_file_aio_write_nolock+54
[0xc0132def] generic_file_write_nolock+151
[0xc01295a8] autoremove_wake_function+0
[0xc01295a8] autoremove_wake_function+0
[0xc0152d4a] blkdev_file_write+38


[-- Attachment #2: pgown-fast.c --]
[-- Type: application/octet-stream, Size: 2427 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>

struct block_list {
	char *txt;
	int len;
	int num;
};


static struct block_list *list;
static int list_size;
static int max_size;

struct block_list *block_head;

int read_block(char *buf, FILE *fin)
{
	int ret = 0;
	int hit = 0;
	char *curr = buf;
	
	for (;;) {
		*curr = getc(fin);
		if (*curr == EOF) return -1;
		
		ret++;
		if (*curr == '\n' && hit == 1)
			return ret - 1;
		else if (*curr == '\n')
			hit = 1;
		else
			hit = 0;
		curr++;
	}
}

static int compare_txt(struct block_list *l1, struct block_list *l2)
{
	return strcmp(l1->txt, l2->txt);
}

static int compare_num(struct block_list *l1, struct block_list *l2)
{
	return l2->num - l1->num;
}

static void add_list(char *buf, int len)
{
	if (list_size != 0 &&
	    len == list[list_size-1].len &&
	    memcmp(buf, list[list_size-1].txt, len) == 0) {
		list[list_size-1].num++;
		return;
	}
	if (list_size == max_size) {
		printf("max_size too small??\n");
		exit(1);
	}
	list[list_size].txt = malloc(len+1);
	list[list_size].len = len;
	list[list_size].num = 1;
	memcpy(list[list_size].txt, buf, len);
	list[list_size].txt[len] = 0;
	list_size++;
	if (list_size % 1000 == 0) {
		printf("loaded %d\r", list_size);
		fflush(stdout);
	}
}

int main(int argc, char **argv)
{
	FILE *fin, *fout;
	char buf[1024];
	int ret, i, count;
	struct block_list *list2;
	struct stat st;
	
	fin = fopen(argv[1], "r");
	fout = fopen(argv[2], "w");
	if (!fin || !fout) {
		printf("Usage: ./program <input> <output>\n");
		perror("open: ");
		exit(2);
	}

	fstat(fileno(fin), &st);
	max_size = st.st_size / 100; /* hack ... */

	list = malloc(max_size * sizeof(*list));
	
	for(;;) {
		ret = read_block(buf, fin);
		if (ret < 0)
			break;
		
		buf[ret] = '\0';
		add_list(buf, ret);
	}

	printf("loaded %d\n", list_size);

	printf("sorting ....\n");

	qsort(list, list_size, sizeof(list[0]), compare_txt);

	list2 = malloc(sizeof(*list) * list_size);

	printf("culling\n");

	for (i=count=0;i<list_size;i++) {
		if (count == 0 || 
		    strcmp(list2[count-1].txt, list[i].txt) != 0) {
			list2[count++] = list[i];
		} else {
			list2[count-1].num += list[i].num;
		}
	}
	
	qsort(list2, count, sizeof(list[0]), compare_num);
	
	for (i=0;i<count;i++) {
		fprintf(fout, "%d times:\n%s\n", list2[i].num, list2[i].txt);
	}
	return 0;
}





^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: memory leak in 2.6.11-rc2
  2005-01-25  4:48                   ` Andrew Tridgell
@ 2005-01-25  6:06                     ` Andrew Morton
  2005-01-25 11:35                       ` Andrew Tridgell
  0 siblings, 1 reply; 30+ messages in thread
From: Andrew Morton @ 2005-01-25  6:06 UTC (permalink / raw)
  To: tridge; +Cc: rddunlap, linux-kernel, agruen

Andrew Tridgell <tridge@osdl.org> wrote:
>
>  > I'm trying the little leak tracking patch from Alexander Nyberg now.
> 
>  Here are the results (only backtraces with more than 10k counts
>  included). The leak was at 1G of memory at the time I ran this, so its
>  safe to say 10k page allocations ain't enough to explain it :-)
> 
>  I also attach a hacked version of the pgown sort program that sorts
>  the output by count, and isn't O(n^2). It took 10 minutes to run the
>  old version :-)
> 
>  I'm guessing the leak is in the new xattr code given that is what
>  dbench and nbench were beating on. Andreas, can you look at the
>  following and see if you can spot anything?
> 
>  This was on 2.6.11-rc2 with the pipe leak patch from Linus. The
>  machine had leaked 1G of ram in 10 minutes, and was idle (only thing
>  running was sshd).
> 
>  175485 times:
>  Page allocated via order 0
>  [0xc0132258] generic_file_buffered_write+280
>  [0xc011b6a9] current_fs_time+77
>  [0xc0132a1e] __generic_file_aio_write_nolock+642
>  [0xc0132e70] generic_file_aio_write+100
>  [0xc017e586] ext3_file_write+38
>  [0xc014b7f5] do_sync_write+169
>  [0xc015f6de] fcntl_setlk64+286
>  [0xc01295a8] autoremove_wake_function+0

It would be pretty strange for plain old pagecache pages to leak in this
manner.  A few things come to mind.

- The above trace is indistinguishable from the normal situation of
  having a lot of pagecache floating about.  IOW: we don't know if the
  above pages have really leaked or not.

- It's sometimes possible for ext3 pages to remain alive (on the page
  LRU) after a truncate, but with no other references to them.  These pages
  are trivially reclaimable.  So even if you've deleted the files which the
  benchmark created, there _could_ be pages left over.  Although it would
  be unusual.

So what you should do before generating the leak tool output is to put
heavy memory pressure on the machine to try to get it to free up as much of
that pagecache as possible.  bzero(malloc(lots)) will do it - create a real
swapstorm, then do swapoff to kill remaining swapcache as well.

If we still see a lot of pages with the above trace then something really
broke.  Where does one get the appropriate dbench version?  How are you
mkfsing the filesystem?  Was mke2fs patched?  How is the benchmark being
invoked?

Thanks.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: memory leak in 2.6.11-rc2
  2005-01-25  6:06                     ` Andrew Morton
@ 2005-01-25 11:35                       ` Andrew Tridgell
  2005-01-25 12:11                         ` Nick Piggin
  0 siblings, 1 reply; 30+ messages in thread
From: Andrew Tridgell @ 2005-01-25 11:35 UTC (permalink / raw)
  To: Andrew Morton; +Cc: rddunlap, linux-kernel, agruen

Andrew,

 > So what you should do before generating the leak tool output is to put
 > heavy memory pressure on the machine to try to get it to free up as much of
 > that pagecache as possible.  bzero(malloc(lots)) will do it - create a real
 > swapstorm, then do swapoff to kill remaining swapcache as well.

As you saw when you logged into the machine earlier tonight, when you
suspend the dbench processes and run a memory filler the memory is
reclaimed.

I still think its a bug though, as the oom killer is being triggered
when it shouldn't be. I have 4G of ram in this machine, and I'm only
running a couple of hundred processes that should be using maybe 500M
in total, so for the oom killer to kick in might mean that the memory
isn't being reclaimed under normal memory pressure. Certainly a ps
shows no process using more than a few MB.

The oom killer report is below. This is with 2.6.11-rc2, with the pipe
leak fix, and the pgown monitoring patch. It was running one nbench of
size 50 and one dbench of size 40 at the time.

If this isn't a leak, then it would also be good to fix /usr/bin/free
so the -/+ buffers line becomes meaningful again. With the machine
completely idle (just a sshd running) I see:

Mem:       3701184    3483188     217996          0     130092    1889440
-/+ buffers/cache:    1463656    2237528

which looks like a leak. This persists even after the disks are
unmounted. After filling/freeing memory I see:

Mem:       3701184      28176    3673008          0        520       3764
-/+ buffers/cache:      23892    3677292

so it can recover it, but under normal usage it doesn't before the oom
killer kicks in.

Cheers, Tridge


Jan 25 00:49:14 dev4-003 kernel: Out of Memory: Killed process 18910 (smbd).
Jan 25 00:49:14 dev4-003 kernel: oom-killer: gfp_mask=0xd0
Jan 25 00:49:14 dev4-003 kernel: DMA per-cpu:
Jan 25 00:49:14 dev4-003 kernel: cpu 0 hot: low 2, high 6, batch 1
Jan 25 00:49:14 dev4-003 kernel: cpu 0 cold: low 0, high 2, batch 1
Jan 25 00:49:14 dev4-003 kernel: cpu 1 hot: low 2, high 6, batch 1
Jan 25 00:49:14 dev4-003 kernel: cpu 1 cold: low 0, high 2, batch 1
Jan 25 00:49:14 dev4-003 kernel: cpu 2 hot: low 2, high 6, batch 1
Jan 25 00:49:14 dev4-003 kernel: cpu 2 cold: low 0, high 2, batch 1
Jan 25 00:49:14 dev4-003 kernel: cpu 3 hot: low 2, high 6, batch 1
Jan 25 00:49:14 dev4-003 kernel: cpu 3 cold: low 0, high 2, batch 1
Jan 25 00:49:14 dev4-003 kernel: Normal per-cpu:
Jan 25 00:49:14 dev4-003 kernel: cpu 0 hot: low 32, high 96, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 0 cold: low 0, high 32, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 1 hot: low 32, high 96, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 1 cold: low 0, high 32, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 2 hot: low 32, high 96, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 2 cold: low 0, high 32, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 3 hot: low 32, high 96, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 3 cold: low 0, high 32, batch 16
Jan 25 00:49:14 dev4-003 kernel: HighMem per-cpu:
Jan 25 00:49:14 dev4-003 kernel: cpu 0 hot: low 32, high 96, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 0 cold: low 0, high 32, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 1 hot: low 32, high 96, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 1 cold: low 0, high 32, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 2 hot: low 32, high 96, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 2 cold: low 0, high 32, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 3 hot: low 32, high 96, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 3 cold: low 0, high 32, batch 16
Jan 25 00:49:14 dev4-003 kernel:
Jan 25 00:49:14 dev4-003 kernel: Free pages:        5884kB (640kB HighMem)
Jan 25 00:49:14 dev4-003 kernel: Active:84764 inactive:806173 dirty:338229 writeback:36 unstable:0 free:1471 slab:23814 mapped:31610 pagetables:1616
Jan 25 00:49:14 dev4-003 kernel: DMA free:284kB min:68kB low:84kB high:100kB active:440kB inactive:10048kB present:16384kB pages_scanned:140 all_unreclaimable? no
Jan 25 00:49:14 dev4-003 kernel: protections[]: 0 0 0
Jan 25 00:49:14 dev4-003 kernel: Normal free:4960kB min:3756kB low:4692kB high:5632kB active:72072kB inactive:632096kB present:901120kB pages_scanned:58999 all_unreclaimable? no
Jan 25 00:49:14 dev4-003 kernel: protections[]: 0 0 0
Jan 25 00:49:14 dev4-003 kernel: HighMem free:640kB min:512kB low:640kB high:768kB active:266948kB inactive:2581960kB present:2850752kB pages_scanned:0 all_unreclaimable? no
Jan 25 00:49:14 dev4-003 kernel: protections[]: 0 0 0
Jan 25 00:49:14 dev4-003 kernel: DMA: 53*4kB 3*8kB 1*16kB 1*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 284kB
Jan 25 00:49:14 dev4-003 kernel: Normal: 400*4kB 16*8kB 0*16kB 1*32kB 0*64kB 1*128kB 0*256kB 2*512kB 0*1024kB 1*2048kB 0*4096kB = 4960kB
Jan 25 00:49:14 dev4-003 kernel: HighMem: 6*4kB 11*8kB 1*16kB 0*32kB 0*64kB 2*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 640kB
Jan 25 00:49:14 dev4-003 kernel: Swap cache: add 1584899, delete 1585025, find 294/351, race 0+0
Jan 25 00:49:14 dev4-003 kernel: Out of Memory: Killed process 18914 (smbd).

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: memory leak in 2.6.11-rc2
  2005-01-25 11:35                       ` Andrew Tridgell
@ 2005-01-25 12:11                         ` Nick Piggin
  0 siblings, 0 replies; 30+ messages in thread
From: Nick Piggin @ 2005-01-25 12:11 UTC (permalink / raw)
  To: tridge; +Cc: Andrew Morton, rddunlap, linux-kernel, agruen, Andrea Arcangeli

[-- Attachment #1: Type: text/plain, Size: 1624 bytes --]

Andrew Tridgell wrote:
> Andrew,
> 
>  > So what you should do before generating the leak tool output is to put
>  > heavy memory pressure on the machine to try to get it to free up as much of
>  > that pagecache as possible.  bzero(malloc(lots)) will do it - create a real
>  > swapstorm, then do swapoff to kill remaining swapcache as well.
> 
> As you saw when you logged into the machine earlier tonight, when you
> suspend the dbench processes and run a memory filler the memory is
> reclaimed.
> 
> I still think its a bug though, as the oom killer is being triggered
> when it shouldn't be. I have 4G of ram in this machine, and I'm only
> running a couple of hundred processes that should be using maybe 500M
> in total, so for the oom killer to kick in might mean that the memory
> isn't being reclaimed under normal memory pressure. Certainly a ps
> shows no process using more than a few MB.
> 
> The oom killer report is below. This is with 2.6.11-rc2, with the pipe
> leak fix, and the pgown monitoring patch. It was running one nbench of
> size 50 and one dbench of size 40 at the time.
> 

There are various OOM killer improvements and fixes that have gone
into Andrew's kernel tree which should be included for 2.6.11.

I don't think the OOM killer was ever perfect in 2.6, but recent
tinkering in mm/ probably aggrivated it. *blush*

Here is another small OOM killer improvement. Previously we needed
to reclaim SWAP_CLUSTER_MAX pages in a single pass. That should be
changed so that we need only reclaim that many pages during the
entire try_to_free_pages run, without going OOM.

Andrea? Andrew? Look OK?

[-- Attachment #2: less-oom.patch --]
[-- Type: text/plain, Size: 800 bytes --]




---

 linux-2.6-npiggin/mm/vmscan.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff -puN mm/vmscan.c~oom-helper mm/vmscan.c
--- linux-2.6/mm/vmscan.c~oom-helper	2005-01-25 23:04:28.000000000 +1100
+++ linux-2.6-npiggin/mm/vmscan.c	2005-01-25 23:05:06.000000000 +1100
@@ -914,12 +914,12 @@ int try_to_free_pages(struct zone **zone
 			sc.nr_reclaimed += reclaim_state->reclaimed_slab;
 			reclaim_state->reclaimed_slab = 0;
 		}
-		if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX) {
+		total_scanned += sc.nr_scanned;
+		total_reclaimed += sc.nr_reclaimed;
+		if (total_reclaimed >= SWAP_CLUSTER_MAX) {
 			ret = 1;
 			goto out;
 		}
-		total_scanned += sc.nr_scanned;
-		total_reclaimed += sc.nr_reclaimed;
 
 		/*
 		 * Try to write back as many pages as we just scanned.  This

_

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: memory leak in 2.6.11-rc2
  2005-01-25  3:45               ` Dave Jones
@ 2005-01-25 12:51                 ` Andrea Arcangeli
  2005-01-25 13:31                   ` Andreas Gruenbacher
  0 siblings, 1 reply; 30+ messages in thread
From: Andrea Arcangeli @ 2005-01-25 12:51 UTC (permalink / raw)
  To: Dave Jones, Andrew Tridgell, linux-kernel, Andreas Gruenbacher,
	Andrew Morton

On Mon, Jan 24, 2005 at 10:45:47PM -0500, Dave Jones wrote:
> On Tue, Jan 25, 2005 at 02:19:24PM +1100, Andrew Tridgell wrote:
>  > The problem I've hit now is a severe memory leak. I have applied the
>  > patch from Linus for the leak in free_pipe_info(), and still I'm
>  > leaking memory at the rate of about 100Mbyte/minute.
>  > I've tested with both 2.6.11-rc2 and with 2.6.11-rc1-mm2, both with
>  > the pipe leak fix. The setup is:
> 
> That's a little more extreme than what I'm seeing, so it may be
> something else, but my firewall box needs rebooting every
> few days. It leaks around 50MB a day for some reason.
> Given it's not got a lot of ram, after 4-5 days or so, it's
> completely exhausted its swap too.
> 
> It's currently on a 2.6.10-ac kernel, so it's entirely possible that
> we're not looking at the same issue, though it could be something
> thats been there for a while if your workload makes it appear
> quicker than a firewall/ipsec gateway would.
> Do you see the same leaks with an earlier kernel ?
> 
> post OOM (when there was about 2K free after named got oom-killed)
> this is what slabinfo looked like..
> 
> dentry_cache        1502   3775    160   25    1 : tunables  120   60    0 : slabdata    151    151      0
> vm_area_struct      1599   2021     84   47    1 : tunables  120   60    0 : slabdata     43     43      0
> size-128            3431   6262    128   31    1 : tunables  120   60    0 : slabdata    202    202      0
> size-64             4352   4575     64   61    1 : tunables  120   60    0 : slabdata     75     75      0
> avtab_node          7073   7140     32  119    1 : tunables  120   60    0 : slabdata     60     60      0
> size-32             7256   7616     32  119    1 : tunables  120   60    0 : slabdata     64     64      0

What is avtab_node? there's no such thing in my kernel. But the above
can be ok. Can you show meminfo too after oom kill?

Just another datapoint my firewall runs a kernel based on 2.6.11-rc1-bk8 with
all the needed oom fixes and I've no problems on it yet. I run it oom
and this is what I get after the oom:

athlon:/home/andrea # free
             total       used       free     shared    buffers     cached
Mem:        511136      50852     460284          0        572      15764
-/+ buffers/cache:      34516     476620
Swap:      1052248          0    1052248
athlon:/home/andrea # 

The above is sane, 34M is very reasonable for what's loaded there
(there's the X server running, named too, and various other non standard
daemons, one even has a virtual size of >100m so it's not a tiny thing),
so I'm quite sure I'm not hitting a memleak, at least not on the
firewal. No ipsec on it btw, and it's a pure IDE without anything
special, just quite a few nics and USB usermode running all the time.

athlon:/home/andrea # uptime
  1:34pm  up 2 days 12:08,  1 user,  load average: 0.98, 1.13, 0.54
athlon:/home/andrea # iptables -L -v |grep -A2 FORWARD
Chain FORWARD (policy ACCEPT 65 packets, 9264 bytes)
 pkts bytes target     prot opt in     out     source               destination         
3690K 2321M block      all  --  any    any     anywhere             anywhere            
athlon:/home/andrea # 

So if there's a memleak in rc1-bk8, it's probably not in the core of the
kernel, but in some driver or things like ipsec. Either that or it broke
after 2.6.11-rc1-bk8. The kernel I'm running is quite heavily patched
too, but I'm not aware of any memleak fix in the additional patches.

Anyway I'll try again in a few days to verify it goes back down again to
exactly 34M of anonymous/random and 15M of cache.

No apparent problem on my desktop system either, it's running the same
kernel with different config.

If somebody could fix the kernel CVS I could have a look at the
interesting changesets between 2.6.11-rc1-bk8 and 2.6.11-rc2.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: memory leak in 2.6.11-rc2
  2005-01-25 12:51                 ` Andrea Arcangeli
@ 2005-01-25 13:31                   ` Andreas Gruenbacher
  2005-01-25 13:55                     ` Andrea Arcangeli
  0 siblings, 1 reply; 30+ messages in thread
From: Andreas Gruenbacher @ 2005-01-25 13:31 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Dave Jones, Andrew Tridgell, linux-kernel, Andrew Morton

On Tue, 2005-01-25 at 13:51, Andrea Arcangeli wrote:
> If somebody could fix the kernel CVS I could have a look at the
> interesting changesets between 2.6.11-rc1-bk8 and 2.6.11-rc2.

What's not okay?

-- 
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX GMBH


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: memory leak in 2.6.11-rc2
  2005-01-25 13:31                   ` Andreas Gruenbacher
@ 2005-01-25 13:55                     ` Andrea Arcangeli
  0 siblings, 0 replies; 30+ messages in thread
From: Andrea Arcangeli @ 2005-01-25 13:55 UTC (permalink / raw)
  To: Andreas Gruenbacher; +Cc: linux-kernel

On Tue, Jan 25, 2005 at 02:31:03PM +0100, Andreas Gruenbacher wrote:
> On Tue, 2005-01-25 at 13:51, Andrea Arcangeli wrote:
> > If somebody could fix the kernel CVS I could have a look at the
> > interesting changesets between 2.6.11-rc1-bk8 and 2.6.11-rc2.
> 
> What's not okay?

I already prepared a separated deatiled bugreport. I'm reproducing one
last time before sending it, after a "su - nobody" to be sure it's not a
local problem in my environment that might have changed, but I'm 99%
sure it'll reproduce just fine in a fresh account too.

Are you using cvsps at all?

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2005-01-25 13:56 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-01-20  2:01 [ea-in-inode 0/5] Further fixes Andreas Gruenbacher
2005-01-20  2:01 ` [patch 1/5] No lock needed when freeing inode Andreas Gruenbacher
2005-01-20  2:01 ` [patch 3/5] Documentation fix Andreas Gruenbacher
2005-01-20  2:01 ` [patch 4/5] Fix i_extra_isize check Andreas Gruenbacher
2005-01-20  2:01 ` [patch 2/5] Set the EXT3_FEATURE_COMPAT_EXT_ATTR for in-inode xattrs Andreas Gruenbacher
2005-01-20  2:01 ` [patch 5/5] Disallow in-inode attributes for reserved inodes Andreas Gruenbacher
2005-01-20 12:16   ` Andreas Dilger
2005-01-20 13:29     ` Andreas Gruenbacher
2005-01-20 23:05       ` Andreas Dilger
2005-01-21  0:36         ` Andreas Gruenbacher
2005-01-21 22:58 ` [ea-in-inode 0/5] Further fixes Stephen C. Tweedie
2005-01-21 23:46   ` Andreas Gruenbacher
2005-01-23 13:22     ` Andrew Tridgell
2005-01-23 22:09     ` Andrew Tridgell
2005-01-23 22:58       ` Andreas Gruenbacher
2005-01-23 23:32         ` Andreas Gruenbacher
2005-01-24 11:24           ` Andrew Tridgell
2005-01-24 11:42             ` Christoph Hellwig
2005-01-24 14:11             ` Andreas Gruenbacher
2005-01-25  3:19             ` memory leak in 2.6.11-rc2 Andrew Tridgell
2005-01-25  3:20               ` Randy.Dunlap
2005-01-25  3:31                 ` Andrew Tridgell
2005-01-25  4:48                   ` Andrew Tridgell
2005-01-25  6:06                     ` Andrew Morton
2005-01-25 11:35                       ` Andrew Tridgell
2005-01-25 12:11                         ` Nick Piggin
2005-01-25  3:45               ` Dave Jones
2005-01-25 12:51                 ` Andrea Arcangeli
2005-01-25 13:31                   ` Andreas Gruenbacher
2005-01-25 13:55                     ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).