All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 1/2] SLUB: Fix merged slab cache names
@ 2010-09-14 18:48 Pekka Enberg
  2010-09-14 18:48 ` [PATCH v2 2/2] SLUB: Mark merged slab caches in /proc/slabinfo Pekka Enberg
  2010-09-14 18:59 ` [PATCH v2 1/2] SLUB: Fix merged slab cache names Christoph Lameter
  0 siblings, 2 replies; 18+ messages in thread
From: Pekka Enberg @ 2010-09-14 18:48 UTC (permalink / raw)
  To: torvalds; +Cc: linux-kernel, Pekka Enberg, Christoph Lameter, David Rientjes

As explained by Linus "I'm Proud to be an American" Torvalds:

  Looking at the merging code, I actually think it's totally
  buggy. If you have something like this:

   - load module A: create slab cache A

   - load module B: create slab cache B that can merge with A

   - unload module A

   - "cat /proc/slabinfo": BOOM. Oops.

  exactly because the name is not handled correctly, and you'll have
  module B holding open a slab cache that has a name pointer that points
  to module A that no longer exists.

This patch fixes the problem by using kstrdup() to allocate dynamic memory for
->name of "struct kmem_cache" as suggested by Christoph Lameter.

Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
---
 mm/slub.c |   24 +++++++++++++++++++++++-
 1 files changed, 23 insertions(+), 1 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 13fffe1..a31c033 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -210,6 +210,7 @@ static inline int sysfs_slab_alias(struct kmem_cache *s, const char *p)
 							{ return 0; }
 static inline void sysfs_slab_remove(struct kmem_cache *s)
 {
+	kfree(s->name);
 	kfree(s);
 }
 
@@ -3117,6 +3118,19 @@ void __init kmem_cache_init(void)
 	slab_state = UP;
 
 	/* Provide the correct kmalloc names now that the caches are up */
+	kmalloc_caches[0].name = kstrdup(kmalloc_caches[0].name, GFP_NOWAIT);
+	BUG_ON(!kmalloc_caches[0].name);
+
+	if (KMALLOC_MIN_SIZE <= 32) {
+		kmalloc_caches[1].name = kstrdup(kmalloc_caches[1].name, GFP_NOWAIT);
+		BUG_ON(!kmalloc_caches[1].name);
+	}
+
+	if (KMALLOC_MIN_SIZE <= 64) {
+		kmalloc_caches[2].name = kstrdup(kmalloc_caches[2].name, GFP_NOWAIT);
+		BUG_ON(!kmalloc_caches[2].name);
+	}
+
 	for (i = KMALLOC_SHIFT_LOW; i < SLUB_PAGE_SHIFT; i++) {
 		char *s = kasprintf(GFP_NOWAIT, "kmalloc-%d", 1 << i);
 
@@ -3211,6 +3225,7 @@ struct kmem_cache *kmem_cache_create(const char *name, size_t size,
 		size_t align, unsigned long flags, void (*ctor)(void *))
 {
 	struct kmem_cache *s;
+	char *dup_name;
 
 	if (WARN_ON(!name))
 		return NULL;
@@ -3234,19 +3249,25 @@ struct kmem_cache *kmem_cache_create(const char *name, size_t size,
 		return s;
 	}
 
+	dup_name = kstrdup(name, GFP_KERNEL);
+	if (!dup_name)
+		goto err;
+
 	s = kmalloc(kmem_size, GFP_KERNEL);
 	if (s) {
-		if (kmem_cache_open(s, GFP_KERNEL, name,
+		if (kmem_cache_open(s, GFP_KERNEL, dup_name,
 				size, align, flags, ctor)) {
 			list_add(&s->list, &slab_caches);
 			if (sysfs_slab_add(s)) {
 				list_del(&s->list);
+				kfree(dup_name);
 				kfree(s);
 				goto err;
 			}
 			up_write(&slub_lock);
 			return s;
 		}
+		kfree(dup_name);
 		kfree(s);
 	}
 	up_write(&slub_lock);
@@ -4377,6 +4398,7 @@ static void kmem_cache_release(struct kobject *kobj)
 {
 	struct kmem_cache *s = to_slab(kobj);
 
+	kfree(s->name);
 	kfree(s);
 }
 
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 2/2] SLUB: Mark merged slab caches in /proc/slabinfo
  2010-09-14 18:48 [PATCH v2 1/2] SLUB: Fix merged slab cache names Pekka Enberg
@ 2010-09-14 18:48 ` Pekka Enberg
  2010-09-14 20:00   ` David Rientjes
  2010-09-14 18:59 ` [PATCH v2 1/2] SLUB: Fix merged slab cache names Christoph Lameter
  1 sibling, 1 reply; 18+ messages in thread
From: Pekka Enberg @ 2010-09-14 18:48 UTC (permalink / raw)
  To: torvalds; +Cc: linux-kernel, Pekka Enberg, Christoph Lameter, David Rientjes

SLUB uses the name of the first slab cache for all merged slab caches. To make
the output of /proc/slabinfo more obvious, append the name of each merged slab
cache to s->name.

An example output of /proc/slabinfo with this patch looks like this:

  kmalloc-8192         544    544   8192    4    8 : tunables    0
  kmalloc-4096+names_cache+biovec-256+sgpool-128+ecryptfs_headers
  kmalloc-2048+biovec-128+sgpool-64    400    416   2048   16    8
  kmalloc-1024+biovec-64+sgpool-32    436    496   1024   16    4 :
  kmalloc-512+task_xstate+skbuff_fclone_cache+sgpool-16   1060   10
  kmalloc-256+mnt_cache+skbuff_head_cache+biovec-16+sgpool-8+arp_   [ snip ]
  kmalloc-128+pid+bip-1+eventpoll_epi+request_sock_TCP+ip_mrt_cac
  kmalloc-64+fs_cache+biovec-4+blkdev_ioc+inet_peer_cache+tcp_bin
  kmalloc-32+ip_fib_alias+dnotify_struct+inotify_event_private_da
  kmalloc-16+biovec-1+ecryptfs_file_cache+dm_rq_clone_bio_info+dm
  kmalloc-8           5119   5120      8  512    1 : tunables    0
  kmalloc-192+cred_jar+key_jar+filp+bip-4+bio-0+request_sock_TCPv
  kmalloc-96           924   1008     96   42    1 : tunables    0
  kmem_cache_node      128    128     64   64    1 : tunables    0

Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
---
 mm/slub.c |   37 +++++++++++++++++++++++++++++++++++++
 1 files changed, 37 insertions(+), 0 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index a31c033..77e4438 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3161,6 +3161,40 @@ void __init kmem_cache_init_late(void)
 }
 
 /*
+ * This limit is part of /proc/slabinfo ABI. It has never been enforced by the
+ * kernel but userspace programs such as "slabtop" expect it.
+ */
+#define MAX_SLAB_NAME_LEN	63
+
+static int kmem_merge_names(struct kmem_cache *s, const char *name)
+{
+	size_t new_size;
+	char *new_name;
+
+	if (strlen(s->name) >= MAX_SLAB_NAME_LEN)
+		return 0;
+
+	/* Don't append name to merged name more than once */
+	if (strstr(s->name, name))
+		return 0;
+
+	new_size = strlen(s->name) + strlen(name) + 2;
+	if (new_size > (MAX_SLAB_NAME_LEN + 1))
+		new_size = MAX_SLAB_NAME_LEN + 1;	/* NULL terminate */
+
+	new_name = kmalloc(new_size, GFP_KERNEL);
+	if (!new_name)
+		return -ENOMEM;
+
+	snprintf(new_name, new_size, "%s+%s", s->name, name);
+
+	kfree(s->name);
+	s->name = new_name;
+
+	return 0;
+}
+
+/*
  * Find a mergeable slab cache
  */
 static int slab_unmergeable(struct kmem_cache *s)
@@ -3233,6 +3267,9 @@ struct kmem_cache *kmem_cache_create(const char *name, size_t size,
 	down_write(&slub_lock);
 	s = find_mergeable(size, align, flags, name, ctor);
 	if (s) {
+		if (kmem_merge_names(s, name))
+			goto err;
+
 		s->refcount++;
 		/*
 		 * Adjust the object sizes so that we clear
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/2] SLUB: Fix merged slab cache names
  2010-09-14 18:48 [PATCH v2 1/2] SLUB: Fix merged slab cache names Pekka Enberg
  2010-09-14 18:48 ` [PATCH v2 2/2] SLUB: Mark merged slab caches in /proc/slabinfo Pekka Enberg
@ 2010-09-14 18:59 ` Christoph Lameter
  2010-09-14 19:32   ` Pekka Enberg
  1 sibling, 1 reply; 18+ messages in thread
From: Christoph Lameter @ 2010-09-14 18:59 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: torvalds, linux-kernel, David Rientjes

On Tue, 14 Sep 2010, Pekka Enberg wrote:

> @@ -3211,6 +3225,7 @@ struct kmem_cache *kmem_cache_create(const char *name, size_t size,
>  		size_t align, unsigned long flags, void (*ctor)(void *))
>  {
>  	struct kmem_cache *s;
> +	char *dup_name;

Dont like the name. Its a short lived variables. So "n"?

Otherwise

Acked-by: Christoph Lameter <cl@linux.com>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/2] SLUB: Fix merged slab cache names
  2010-09-14 18:59 ` [PATCH v2 1/2] SLUB: Fix merged slab cache names Christoph Lameter
@ 2010-09-14 19:32   ` Pekka Enberg
  0 siblings, 0 replies; 18+ messages in thread
From: Pekka Enberg @ 2010-09-14 19:32 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: torvalds, linux-kernel, David Rientjes

  On 14.9.2010 21.59, Christoph Lameter wrote:
> On Tue, 14 Sep 2010, Pekka Enberg wrote:
>
>> @@ -3211,6 +3225,7 @@ struct kmem_cache *kmem_cache_create(const char *name, size_t size,
>>   		size_t align, unsigned long flags, void (*ctor)(void *))
>>   {
>>   	struct kmem_cache *s;
>> +	char *dup_name;
> Dont like the name. Its a short lived variables. So "n"?
>
> Otherwise
>
> Acked-by: Christoph Lameter<cl@linux.com>
>
Fixed.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/2] SLUB: Mark merged slab caches in /proc/slabinfo
  2010-09-14 18:48 ` [PATCH v2 2/2] SLUB: Mark merged slab caches in /proc/slabinfo Pekka Enberg
@ 2010-09-14 20:00   ` David Rientjes
  2010-09-14 20:05     ` Linus Torvalds
  0 siblings, 1 reply; 18+ messages in thread
From: David Rientjes @ 2010-09-14 20:00 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: Linus Torvalds, linux-kernel, Christoph Lameter

On Tue, 14 Sep 2010, Pekka Enberg wrote:

> SLUB uses the name of the first slab cache for all merged slab caches. To make
> the output of /proc/slabinfo more obvious, append the name of each merged slab
> cache to s->name.
> 
> An example output of /proc/slabinfo with this patch looks like this:
> 
>   kmalloc-8192         544    544   8192    4    8 : tunables    0
>   kmalloc-4096+names_cache+biovec-256+sgpool-128+ecryptfs_headers
>   kmalloc-2048+biovec-128+sgpool-64    400    416   2048   16    8
>   kmalloc-1024+biovec-64+sgpool-32    436    496   1024   16    4 :
>   kmalloc-512+task_xstate+skbuff_fclone_cache+sgpool-16   1060   10
>   kmalloc-256+mnt_cache+skbuff_head_cache+biovec-16+sgpool-8+arp_   [ snip ]
>   kmalloc-128+pid+bip-1+eventpoll_epi+request_sock_TCP+ip_mrt_cac
>   kmalloc-64+fs_cache+biovec-4+blkdev_ioc+inet_peer_cache+tcp_bin
>   kmalloc-32+ip_fib_alias+dnotify_struct+inotify_event_private_da
>   kmalloc-16+biovec-1+ecryptfs_file_cache+dm_rq_clone_bio_info+dm
>   kmalloc-8           5119   5120      8  512    1 : tunables    0
>   kmalloc-192+cred_jar+key_jar+filp+bip-4+bio-0+request_sock_TCPv
>   kmalloc-96           924   1008     96   42    1 : tunables    0
>   kmem_cache_node      128    128     64   64    1 : tunables    0
> 

I really don't like this.

I can understand how it's confusing that only the first slab cache name is 
being emitted, and I think that can be changed, but this shows way too 
much information that is already available when CONFIG_SLUB_DEBUG is used 
via the sysfs interface.

CONFIG_SLUB_DEBUG is the default configuration for all users and is 
required for CONFIG_SLABINFO when using slub.  It softlinks merged caches 
together so that it's very simple to determine their relationship.

There's also nothing preventing a name from already including a '+' 
itself.

I think it would be better to use the unique id of each slab cache when 
emitting this information, perhaps suffixed with a count of the number of 
caches merged such as ":t-0000008(3)".

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/2] SLUB: Mark merged slab caches in /proc/slabinfo
  2010-09-14 20:00   ` David Rientjes
@ 2010-09-14 20:05     ` Linus Torvalds
  2010-09-14 20:11       ` Pekka Enberg
  2010-09-14 20:56       ` David Rientjes
  0 siblings, 2 replies; 18+ messages in thread
From: Linus Torvalds @ 2010-09-14 20:05 UTC (permalink / raw)
  To: David Rientjes; +Cc: Pekka Enberg, linux-kernel, Christoph Lameter

On Tue, Sep 14, 2010 at 1:00 PM, David Rientjes <rientjes@google.com> wrote:
>
> I can understand how it's confusing that only the first slab cache name is
> being emitted, and I think that can be changed, but this shows way too
> much information that is already available when CONFIG_SLUB_DEBUG is used
> via the sysfs interface.

Umm. Not in any readable form, it isn't.

The cause for this is that I made a bug-report about the wrong slab
info, with me claiming 400+ thousand entries (taking up 10M of memory)
for a slub cache that turned out to be entirely innocent.

That's what /proc/slabinfo said, and quite frankly, /proc/slabinfo was
simply _lying_. It gave very misleading output.

In my not-so-humble opinion, either the merging needs to go away
entirely, or the misleading output needs to be fixed. The whole (and
_only_) reason for /proc/slabinfo is to show where memory is being
used, so if that file is misleading, then it's worse than useless.
Pointing to some other /sys file as having more information doesn't
change that in the least.

                        Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/2] SLUB: Mark merged slab caches in /proc/slabinfo
  2010-09-14 20:05     ` Linus Torvalds
@ 2010-09-14 20:11       ` Pekka Enberg
  2010-09-14 20:56         ` Linus Torvalds
  2010-09-14 20:56       ` David Rientjes
  1 sibling, 1 reply; 18+ messages in thread
From: Pekka Enberg @ 2010-09-14 20:11 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David Rientjes, linux-kernel, Christoph Lameter

  On 14.9.2010 23.05, Linus Torvalds wrote:
> On Tue, Sep 14, 2010 at 1:00 PM, David Rientjes<rientjes@google.com>  wrote:
>> I can understand how it's confusing that only the first slab cache name is
>> being emitted, and I think that can be changed, but this shows way too
>> much information that is already available when CONFIG_SLUB_DEBUG is used
>> via the sysfs interface.
> Umm. Not in any readable form, it isn't.
>
> The cause for this is that I made a bug-report about the wrong slab
> info, with me claiming 400+ thousand entries (taking up 10M of memory)
> for a slub cache that turned out to be entirely innocent.
>
> That's what /proc/slabinfo said, and quite frankly, /proc/slabinfo was
> simply _lying_. It gave very misleading output.
>
> In my not-so-humble opinion, either the merging needs to go away
> entirely, or the misleading output needs to be fixed. The whole (and
> _only_) reason for /proc/slabinfo is to show where memory is being
> used, so if that file is misleading, then it's worse than useless.
> Pointing to some other /sys file as having more information doesn't
> change that in the least.
Are you happy with the patch? The output is indeed tad bit unreadable. 
Maybe we should just limit the number of printed out names to two or three?

             Pekka

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/2] SLUB: Mark merged slab caches in /proc/slabinfo
  2010-09-14 20:05     ` Linus Torvalds
  2010-09-14 20:11       ` Pekka Enberg
@ 2010-09-14 20:56       ` David Rientjes
  2010-09-14 21:00         ` Pekka Enberg
  1 sibling, 1 reply; 18+ messages in thread
From: David Rientjes @ 2010-09-14 20:56 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Pekka Enberg, linux-kernel, Christoph Lameter

On Tue, 14 Sep 2010, Linus Torvalds wrote:

> > I can understand how it's confusing that only the first slab cache name is
> > being emitted, and I think that can be changed, but this shows way too
> > much information that is already available when CONFIG_SLUB_DEBUG is used
> > via the sysfs interface.
> 
> Umm. Not in any readable form, it isn't.
> 

The debug information is available via /proc/slabinfo, the relationships 
amongst merged caches is available via /sys/kernel/slab.  The former 
depends on the latter via Kconfig, so slub requires using both interfaces 
because of its implementation of merged caches to understand each cache's 
slab usage.

I agree the only way to show this information only via /proc/slabinfo so 
that you can ignore the sysfs interface is by doing exactly what is done 
here.  The problem is that it's an incomplete solution since it's 
truncated at 63 chars, it delimits cache names by '+' which is allowed in 
names already, and is obfuscated because it doesn't imply caches are 
merged (in fact, it implies additional rather than subsumption.

> The cause for this is that I made a bug-report about the wrong slab
> info, with me claiming 400+ thousand entries (taking up 10M of memory)
> for a slub cache that turned out to be entirely innocent.
> 
> That's what /proc/slabinfo said, and quite frankly, /proc/slabinfo was
> simply _lying_. It gave very misleading output.
> 

I understand, and that's addressed with my suggestion of using the unique 
cache identifier for each line in /proc/slabinfo which would map 1:1 to 
the sysfs dentries for those merged caches.  /proc/slabinfo would then 
represent the true slab state by listing actual caches and their usage.

Aggregating 63 chars of cache names on each line of /proc/slabinfo 
wouldn't have helped you when it reports 400+ thousand entries because you 
still can't identify the guilty cache.  For that issue, this would simply 
give you a list of suspects, and sysfs is a better inteface for that since 
it's complete and not truncated.

> In my not-so-humble opinion, either the merging needs to go away
> entirely, or the misleading output needs to be fixed.

Cache merging may have been advertised as a bigger performance improvement 
than it actually is, and I don't do it in my own slab allocator for other 
reasons, but it does lead to more effective memory use by reducing slab 
fragmentation.  On one of my benchmarking servers, over 60% of caches are 
merged and /sys/kernel/slab/.../partial reports roughly the same percent 
of fewer total partial slabs over the system in comparison to 
slub_nomerge.

Indeed, to debug egregious slab usage on your system you would opt-out of 
the more efficient memory use by using slub_nomerge and then checking 
/proc/slabinfo anyway.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/2] SLUB: Mark merged slab caches in /proc/slabinfo
  2010-09-14 20:11       ` Pekka Enberg
@ 2010-09-14 20:56         ` Linus Torvalds
  0 siblings, 0 replies; 18+ messages in thread
From: Linus Torvalds @ 2010-09-14 20:56 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: David Rientjes, linux-kernel, Christoph Lameter

On Tue, Sep 14, 2010 at 1:11 PM, Pekka Enberg <penberg@kernel.org> wrote:
>
> Are you happy with the patch? The output is indeed tad bit unreadable. Maybe
> we should just limit the number of printed out names to two or three?

I think the patch looked fine. And limiting the printout to perhaps
just two names (+ "..." if you have more) would work for me. At least
at that point the output of /proc/slabinfo is no longer misleading and
_incorrect_, it's just "incomplete" (and at that point the whole "you
can get the full information from /sys" is a valid argument)

                                   Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/2] SLUB: Mark merged slab caches in /proc/slabinfo
  2010-09-14 20:56       ` David Rientjes
@ 2010-09-14 21:00         ` Pekka Enberg
  2010-09-15  0:02           ` David Rientjes
  0 siblings, 1 reply; 18+ messages in thread
From: Pekka Enberg @ 2010-09-14 21:00 UTC (permalink / raw)
  To: David Rientjes; +Cc: Linus Torvalds, linux-kernel, Christoph Lameter

  On 14.9.2010 23.56, David Rientjes wrote:
>> In my not-so-humble opinion, either the merging needs to go away
>> entirely, or the misleading output needs to be fixed.
> Cache merging may have been advertised as a bigger performance improvement
> than it actually is, and I don't do it in my own slab allocator for other
> reasons, but it does lead to more effective memory use by reducing slab
> fragmentation.  On one of my benchmarking servers, over 60% of caches are
> merged and /sys/kernel/slab/.../partial reports roughly the same percent
> of fewer total partial slabs over the system in comparison to
> slub_nomerge.
Last time I checked (and it's been a while), it did reduce _internal 
fragmentation_ for the naive "memory used after boot" scenario. I don't 
think I ever advertised it as a performance improvement. Dunno if 
somebody else did.

             Pekka

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/2] SLUB: Mark merged slab caches in /proc/slabinfo
  2010-09-14 21:00         ` Pekka Enberg
@ 2010-09-15  0:02           ` David Rientjes
  2010-09-15 11:16             ` Theodore Tso
  0 siblings, 1 reply; 18+ messages in thread
From: David Rientjes @ 2010-09-15  0:02 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: Linus Torvalds, linux-kernel, Christoph Lameter

On Wed, 15 Sep 2010, Pekka Enberg wrote:

> > > In my not-so-humble opinion, either the merging needs to go away
> > > entirely, or the misleading output needs to be fixed.
> > Cache merging may have been advertised as a bigger performance improvement
> > than it actually is, and I don't do it in my own slab allocator for other
> > reasons, but it does lead to more effective memory use by reducing slab
> > fragmentation.  On one of my benchmarking servers, over 60% of caches are
> > merged and /sys/kernel/slab/.../partial reports roughly the same percent
> > of fewer total partial slabs over the system in comparison to
> > slub_nomerge.
> Last time I checked (and it's been a while), it did reduce _internal
> fragmentation_ for the naive "memory used after boot" scenario. I don't think
> I ever advertised it as a performance improvement. Dunno if somebody else did.
> 

I believe that cache merging was advertised as being both a small 
cacheline optimization and a method for reducing slab fragmentation by 
filling up partial slabs.  Regardless, I measured the latter to be much 
more significant and found only a very slight improvement for netperf 
TCP_RR on machines with large cpu counts in comparison to slub_nomerge.

I don't believe that we need to remove cache merging and allocate more 
memory by default to be able to identify a particular cache using an 
egregious amount of slab when troubleshooting a problem.  That amount of 
memory (probably 4-8MB) is significant without kmem_cache_shrink() since, 
using my example where 60% of caches were merged into others, the partial 
list won't shrink below min_partial slabs for those ~120 caches.

I prefer my suggestion of using the unique ids of each cache in 
/proc/slabinfo since it won't unfairly attribute slab to caches that 
didn't allocate it.  Sure, /proc/slabinfo can't tell you which cache is at 
fault without booting with slub_nomerge, enabling tracing, using 
SLAB_TRACE, or kmemleak, but I think that's better than sacrificing the 
slab fragmentation by default.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/2] SLUB: Mark merged slab caches in /proc/slabinfo
  2010-09-15  0:02           ` David Rientjes
@ 2010-09-15 11:16             ` Theodore Tso
  2010-09-15 20:33               ` David Rientjes
  0 siblings, 1 reply; 18+ messages in thread
From: Theodore Tso @ 2010-09-15 11:16 UTC (permalink / raw)
  To: David Rientjes
  Cc: Pekka Enberg, Linus Torvalds, linux-kernel, Christoph Lameter


On Sep 14, 2010, at 8:02 PM, David Rientjes wrote:

> 
> I don't believe that we need to remove cache merging and allocate more 
> memory by default to be able to identify a particular cache using an 
> egregious amount of slab when troubleshooting a problem.  

Why not keep a separate accounting for each cache, even if we use the same set of pages for a set of slab cache?   It is *really* *useful* to be able to get that kind of debugging information without needing to reboot the system into some kind of magic debugging mode.  More than once I've had to debug a customer system where there was some kind of memory allocation problem, where rebooting wouldn't have been an option, or where rebooting would have destroyed the evidence.

It would seem to me that if there was a "super-slab" pointer in the slab cache structure, then if it turns out the slab system wants to merge a new slab cache with an existing one, you could instead allocate a "super slab" structure, copy information from the initial slab cache into the "super slab" and then set a pointer to the "super slab".   The only information that would be kept in the individual slab cache structures that have a non-zero "super slab" pointer would be the number of objects allocator for that particular object time.

This allows for us to track object class usage without needing to waste space caused by partially filled slab pages.   I think this would be a real win....

-- Ted


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/2] SLUB: Mark merged slab caches in /proc/slabinfo
  2010-09-15 11:16             ` Theodore Tso
@ 2010-09-15 20:33               ` David Rientjes
  2010-09-15 22:25                 ` Ted Ts'o
  0 siblings, 1 reply; 18+ messages in thread
From: David Rientjes @ 2010-09-15 20:33 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Pekka Enberg, Linus Torvalds, linux-kernel, Christoph Lameter

On Wed, 15 Sep 2010, Theodore Tso wrote:

> > I don't believe that we need to remove cache merging and allocate more 
> > memory by default to be able to identify a particular cache using an 
> > egregious amount of slab when troubleshooting a problem.  
> 
> Why not keep a separate accounting for each cache, even if we use the 
> same set of pages for a set of slab cache?   It is *really* *useful* to 
> be able to get that kind of debugging information without needing to 
> reboot the system into some kind of magic debugging mode.  More than 
> once I've had to debug a customer system where there was some kind of 
> memory allocation problem, where rebooting wouldn't have been an option, 
> or where rebooting would have destroyed the evidence.
> 
> It would seem to me that if there was a "super-slab" pointer in the 
> slab cache structure, then if it turns out the slab system wants to 
> merge a new slab cache with an existing one, you could instead allocate 
> a "super slab" structure, copy information from the initial slab cache 
> into the "super slab" and then set a pointer to the "super slab".   The 
> only information that would be kept in the individual slab cache 
> structures that have a non-zero "super slab" pointer would be the number 
> of objects allocator for that particular object time.
> 

I'd love to have per-cache statistics that we could export without the 
cost of the extra memory from fragmented partial slabs.  You'd have to do 
this for every cache even if it's a "superslab", though, to avoid a branch 
in the fastpath to find the cpu slab.  I'm not sure if Pekka and Christoph 
will be happy with the allocation of kmem_cache structures for mergable 
caches and the increment of the statistic in the fastpath.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/2] SLUB: Mark merged slab caches in /proc/slabinfo
  2010-09-15 20:33               ` David Rientjes
@ 2010-09-15 22:25                 ` Ted Ts'o
  2010-09-15 22:53                   ` David Rientjes
  0 siblings, 1 reply; 18+ messages in thread
From: Ted Ts'o @ 2010-09-15 22:25 UTC (permalink / raw)
  To: David Rientjes
  Cc: Pekka Enberg, Linus Torvalds, linux-kernel, Christoph Lameter

On Wed, Sep 15, 2010 at 01:33:07PM -0700, David Rientjes wrote:
> I'd love to have per-cache statistics that we could export without the 
> cost of the extra memory from fragmented partial slabs.  You'd have to do 
> this for every cache even if it's a "superslab", though, to avoid a branch 
> in the fastpath to find the cpu slab.  I'm not sure if Pekka and Christoph 
> will be happy with the allocation of kmem_cache structures for mergable 
> caches and the increment of the statistic in the fastpath.

I agree, it would be cleaner if we could separate out the data
structures which are used for accounting for the number of objects
allocated and reclaimed for each object type, and then have a separate
data structure which is used for dealing with the pages used by those
slabs that have been merged together.

All I can say is I hope the merging code is intelligent.  We recently
had a problem where we were wasting huge amounts of memory because we
were allocating large numbers of a the ext4_group_info structure,
which was 132 bytes, and for which kmalloc() used a size-256 slab ---
and the wasted memory was enough to cause OOM's in a critical
(unfortunately statically sized) container when the disks got large
enough and numerous enough.  The fix was to use a separate cache just
for these 132-byte objects, and not to use kmalloc().

I would be really annoyed if we switched to a slab allocator which did
merging, and then found that the said slab allocator helpfully merged
the 132-byte slab cache and the size-256 slab into a single slab
cache, on the grounds that it thought it would save memory...  (I
guess I'm just really really nervous about merging happening behind my
back, and I really like having the per-object type allocation
statistics.)

							- Ted

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/2] SLUB: Mark merged slab caches in /proc/slabinfo
  2010-09-15 22:25                 ` Ted Ts'o
@ 2010-09-15 22:53                   ` David Rientjes
  2010-09-16 17:39                     ` Christoph Lameter
  0 siblings, 1 reply; 18+ messages in thread
From: David Rientjes @ 2010-09-15 22:53 UTC (permalink / raw)
  To: Ted Ts'o, Pekka Enberg, Linus Torvalds, linux-kernel,
	Christoph Lameter

On Wed, 15 Sep 2010, Ted Ts'o wrote:

> All I can say is I hope the merging code is intelligent.  We recently
> had a problem where we were wasting huge amounts of memory because we
> were allocating large numbers of a the ext4_group_info structure,
> which was 132 bytes, and for which kmalloc() used a size-256 slab ---
> and the wasted memory was enough to cause OOM's in a critical
> (unfortunately statically sized) container when the disks got large
> enough and numerous enough.  The fix was to use a separate cache just
> for these 132-byte objects, and not to use kmalloc().
> 

That's not cache merging and it wasn't with slub.  kmalloc() allocates 
from caches that are initialized at boot with the smallest power-of-two 
size that allows the object with alignment to fit (and we have special 
96-byte and 192-byte kmalloc caches because they tend to be popular).  So 
with slub, a kmalloc(132, ...) would allocate from kmalloc-192 instead.

Cache merging merges caches created with kmem_cache_create() with already 
existing caches, perhaps even those kmalloc caches, that have the same 
basic properties.  There's some pretty strict requirements if a cache may 
be merged or not: it's alignment must be compatible, and the size must not 
waste more than 8 bytes on 64-bit.  Debugging flags and things like 
SLAB_DESTORY_BY_RCU won't be merged, either.

> I would be really annoyed if we switched to a slab allocator which did
> merging, and then found that the said slab allocator helpfully merged
> the 132-byte slab cache and the size-256 slab into a single slab
> cache, on the grounds that it thought it would save memory...  (I
> guess I'm just really really nervous about merging happening behind my
> back, and I really like having the per-object type allocation
> statistics.)
> 

Slub would allocate kmalloc(132, ...) from kmalloc-192, and it wouldn't 
merge your new cache created for ext4_group_info with any other cache 
unless it shared the same flags and had a size of 132-140 bytes with a 
compatible alignment.  On my system, it looks likely that such a cache 
would get merged with the numa_policy cache.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/2] SLUB: Mark merged slab caches in /proc/slabinfo
  2010-09-15 22:53                   ` David Rientjes
@ 2010-09-16 17:39                     ` Christoph Lameter
  2010-09-16 17:49                       ` Linus Torvalds
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Lameter @ 2010-09-16 17:39 UTC (permalink / raw)
  To: David Rientjes; +Cc: Ted Ts'o, Pekka Enberg, Linus Torvalds, linux-kernel

The main point of merging slabs is to reduce memory footprint and
avoid the overhead of managing more slab caches. Performance is a side
effect that may come about due to a reduced cache footprint if multiple
caches that are mapped to the same merged cache are used in the same
context.

Merging also makes kmem_cache_create() much cheaper (takes a refcount
instead of allocating new memory and managing it) which means that we
could reduce our memory footprint by replacing kmalloc() in various
places. If we do that then we can also associate names with more kernel
objects that are currently using kmalloc resulting in an enhanced
debuggability since any debug option will separate the caches again so
that a problem can be pin-pointed correctly.

kmalloc mostly rounds up to the power of 2 which causes some amount memory
being wasted. Merging allows finer grained sizes that are used by multiple
kernel components. kfree() also requires a lookup of the cache on free
whereas kmem_cache_free() can avoid such a lookup.

Here is an example of slab cache merging on my system. One basically ends
up with a kind of more refined kmalloc array.

christoph@:~/n/linux-2.6$ slabinfo -a

:at-0000016  <- revoke_table jbd2_revoke_table
:at-0000024  <- journal_handle jbd2_journal_handle
:at-0000032  <- revoke_record jbd2_revoke_record
:at-0000088  <- ext2_xattr ext4_xattr ext3_xattr extent_map
:at-0000112  <- jbd2_journal_head journal_head
:at-0000144  <- ext4_alloc_context extent_buffers btrfs_path_cache
:t-0000016   <- dm_rq_clone_bio_info dm_mpath_io kmalloc-16 biovec-1 ecryptfs_file_cache
:t-0000024   <- xfs_dabuf fstrm_item scsi_data_buffer dm_target_io fsnotify_event_holder numa_policy dm_snap_tracked_chunk fasync_cache xfs_bmap_free_item
:t-0000032   <- ecryptfs_dentry_info_cache ecryptfs_key_sig_cache xfs_mru_cache_elem sd_ext_cdb dm_snap_exception ip_fib_alias dnotify_struct inotify_event_private_data kmalloc-32 Acpi-Namespace nv_pte_t
:t-0000040   <- kvm_rmap_desc dm_io
:t-0000048   <- Acpi-Parse shared_policy_node nsproxy
:t-0000056   <- ksm_mm_slot uhci_urb_priv kvm_pte_chain
:t-0000064   <- xfs_ifork kmalloc-64 ecryptfs_global_auth_tok_cache ntfs_attr_ctx_cache fib6_nodes tcp_bind_bucket biovec-4 inet_peer_cache secpath_cache ksm_rmap_item fs_cache
:t-0000072   <- Acpi-ParseExt nfsd4_files Acpi-Operand ip_fib_hash eventpoll_pwq
:t-0000080   <- sysfs_dir_cache Acpi-State blkdev_ioc
:t-0000096   <- flow_cache xfs_ioend kmalloc-96
:t-0000112   <- inotify_inode_mark_entry dm_snap_pending_exception blkdev_integrity dnotify_mark_entry
:t-0000128   <- ecryptfs_open_req_cache ip_mrt_cache scsi_sense_cache eventpoll_epi ntfs_index_ctx_cache pid bip-1 kmalloc-128 uid_cache nfsd4_stateids pid_2 request_sock_TCP ecryptfs_key_tfm_cache
:t-0000168   <- cfq_queue cfq_io_context
:t-0000176   <- posix_timers_cache vm_area_struct kvm_mmu_page_header
:t-0000192   <- filp request_sock_TCPv6 key_jar bip-4 cred_jar ecryptfs_sb_cache kmalloc-192 xfs_buf_item xfs_ili bio-0
:t-0000256   <- skbuff_head_cache kmalloc-256 sgpool-8 scsi_cmd_cache arp_cache kiocb biovec-16 ndisc_cache mnt_cache
:t-0000320   <- rpc_tasks ip6_dst_cache
:t-0000368   <- xfs_efd_item kcopyd_job
:t-0000384   <- xfs_buf bip-16 kioctx xfrm_dst_cache ip_dst_cache
:t-0000512   <- skbuff_fclone_cache task_xstate ntfs_name_cache kmalloc-512 sgpool-16
:t-0000768   <- RAW UNIX
:t-0000960   <- RAWv6 signal_cache
:t-0001024   <- sgpool-32 biovec-64 kmalloc-1024
:t-0002048   <- sgpool-64 kmalloc-2048 net_namespace biovec-128 rpc_buffers
:t-0004096   <- ecryptfs_headers_1 ecryptfs_headers_2 ecryptfs_xattr_cache sgpool-128 biovec-256 kmalloc-4096 names_cache


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/2] SLUB: Mark merged slab caches in /proc/slabinfo
  2010-09-16 17:39                     ` Christoph Lameter
@ 2010-09-16 17:49                       ` Linus Torvalds
  2010-09-16 22:08                         ` Tony Luck
  0 siblings, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2010-09-16 17:49 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Rientjes, Ted Ts'o, Pekka Enberg, linux-kernel

On Thu, Sep 16, 2010 at 10:39 AM, Christoph Lameter <cl@linux.com> wrote:
>
> The main point of merging slabs is to reduce memory footprint and
> avoid the overhead of managing more slab caches. Performance is a side
> effect that may come about due to a reduced cache footprint if multiple
> caches that are mapped to the same merged cache are used in the same
> context.

The thing is, if we were to be able to separate the _accounting_ from
the allocation, we'd get the best of both worlds.

Maybe slub could have separate 'struct kmem_cache' structures for each
slab, and not merge at that level, but then have a merged _allocator_.
We already have a level of indirection there due to the whole per-cpu
front-end, so maybe it wouldn't even cause any extra pointer chasing.

That would allow slabtop etc to show the number of actual allocations
at the granularity that we want (ie you'd see who is the real user),
but still then use the same slab pages and avoid the slab
fragmentation.

How painful would it be to try to re-organize things like this?
Impossible? Very hard? Or just "need to think about it a bit"?

                                    Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/2] SLUB: Mark merged slab caches in /proc/slabinfo
  2010-09-16 17:49                       ` Linus Torvalds
@ 2010-09-16 22:08                         ` Tony Luck
  0 siblings, 0 replies; 18+ messages in thread
From: Tony Luck @ 2010-09-16 22:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christoph Lameter, David Rientjes, Ted Ts'o, Pekka Enberg,
	linux-kernel

On Thu, Sep 16, 2010 at 10:49 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> That would allow slabtop etc to show the number of actual allocations
> at the granularity that we want (ie you'd see who is the real user),
> but still then use the same slab pages and avoid the slab
> fragmentation.

It might be hard to interpret the data. Suppose "foo" and "bar"
are slab caches that have separate accounting, but a merged
allocator. Now we do 20 allocations from foo, and free 19 of them.
Then we do 19 allocations from bar.  The accounts would show
that foo has 1 object in use and 19 free. While bar would show
19 objects in use.

But the 19 "free" foo objects aren't free at all, they've been
handed out to bar.

-Tony

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2010-09-16 22:08 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-14 18:48 [PATCH v2 1/2] SLUB: Fix merged slab cache names Pekka Enberg
2010-09-14 18:48 ` [PATCH v2 2/2] SLUB: Mark merged slab caches in /proc/slabinfo Pekka Enberg
2010-09-14 20:00   ` David Rientjes
2010-09-14 20:05     ` Linus Torvalds
2010-09-14 20:11       ` Pekka Enberg
2010-09-14 20:56         ` Linus Torvalds
2010-09-14 20:56       ` David Rientjes
2010-09-14 21:00         ` Pekka Enberg
2010-09-15  0:02           ` David Rientjes
2010-09-15 11:16             ` Theodore Tso
2010-09-15 20:33               ` David Rientjes
2010-09-15 22:25                 ` Ted Ts'o
2010-09-15 22:53                   ` David Rientjes
2010-09-16 17:39                     ` Christoph Lameter
2010-09-16 17:49                       ` Linus Torvalds
2010-09-16 22:08                         ` Tony Luck
2010-09-14 18:59 ` [PATCH v2 1/2] SLUB: Fix merged slab cache names Christoph Lameter
2010-09-14 19:32   ` Pekka Enberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.