linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
       [not found]             ` <20080611184613.GM15380@fieldses.org>
@ 2008-06-11 19:52               ` J. Bruce Fields
  2008-06-11 20:09                 ` Jeff Layton
  0 siblings, 1 reply; 13+ messages in thread
From: J. Bruce Fields @ 2008-06-11 19:52 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-nfs, Weathers, Norman R.

I'm probably missing something fundamental--why doesn't
/proc/slab_allocators show any results for size-x where x >= 4096?

Someone's seeing a performance problem with the linux nfs server.  One
of the symptoms is the "size-4096" slab cache seems to be out of
control.  I assumed that meant that memory allocated by kmalloc() might
be leaking, so figured it might be interesting to turn on
CONFIG_DEBUG_SLAB_LEAK.  As far as I can tell what that does is list
kmalloc() callers in /proc/slab_allocators.  But that doesn't seem to be
showing any results for size-4096.  Can anyone provide a clue?
Thanks!

--b.

On Wed, Jun 11, 2008 at 02:46:13PM -0400, bfields wrote:
> On Tue, Jun 10, 2008 at 05:12:31PM -0500, Weathers, Norman R. wrote:
> >  
> > 
> > > -----Original Message-----
> > > From: J. Bruce Fields [mailto:bfields@fieldses.org] 
> > > Sent: Tuesday, June 10, 2008 12:16 PM
> > > To: Weathers, Norman R.
> > > Cc: linux-nfs@vger.kernel.org
> > > Subject: Re: Problems with large number of clients and reads
> > > 
> > > On Tue, Jun 10, 2008 at 09:30:18AM -0500, Weathers, Norman R. wrote:
> > > > Unfortunately, I cannot stop the clients (middle of long running
> > > > jobs).  I might be able to test this soon.  If I have the number of
> > > > threads high, yes I can reduce the number of threads and it 
> > > appears to
> > > > lower some of the memory, but even with as little as three threads,
> > > > the memory usage climbs very high, just not as high as if there are
> > > > say 8 threads.  When the memory usage climbs high, it can cause the
> > > > box to not respond over the network (ssh, rsh), and even be very
> > > > sluggish when I am connected over our serial console to the 
> > > server(s).
> > > > This same scenario has been happening with kernels that I have tried
> > > > from 2.6.22.x on to the 2.6.25 series.  The 2.6.25 series is
> > > > interesting in that I can push the same load from a box with the
> > > > 2.6.25 kernel and not have a load over .3 (with 3 threads), but with
> > > > the 2.6.22.x kernel, I have a load of over 3 when I hit the same
> > > > conditions.
> > > 
> > > OK, I think what we want to do is turn on 
> > > CONFIG_DEBUG_SLAB_LEAK.  I've
> > > never used it before, but it looks like it will report which functions
> > > are allocating from each slab cache, which may be exactly what we need
> > > to know.  So:
> > > 
> > > 	1. Install a kernel with both CONFIG_DEBUG_SLAB ("Debug slab
> > > 	memory allocations") and CONFIG_DEBUG_SLAB_LEAK ("Memory leak
> > > 	debugging") turned on.  They're both under the "kernel hacking"
> > > 	section of the kernel config.  (If you have a file
> > > 	/proc/slab_allocators, then you already have these turned on and
> > > 	you can skip this step.)
> > > 
> > > 	2. Do whatever you need to do to reproduce the problem.
> > > 
> > > 	3. Get a copy of /proc/slabinfo and /proc/slab_allocators.
> > > 
> > > Then we can take a look at that and see if it sheds any light.
> > 
> > 
> > I have taken several snapshots of the /proc/slab_allocators and
> > /proc/slabinfo as requested, but since there is a lot of info in them,
> > and I didn't think anyone wanted to go cross-eyed reading the data in an
> > email, I have them up on a website:
> > 
> > http://shashi-weathers.net/linux/cluster/NFS/
> 
> Excellent.
> 
> > 
> > The order of data collection is:
> > 
> > slab_allocators_bad1.txt and corresponding slabinfo
> > slab_allocators_after_bad1.txt and corresponding slabinfo
> > slab_allocators_16_threads.txt and corresponding slabinfo
> > slab_allocators_16_threads_1.txt and corresponding slabinfo
> > slab_allocators_32_threads.txt and corresponding slabinfo
> > slab_allocators_really_bad.txt and corresponding slabinfo.
> > 
> > 
> > You will have to forgive my ignorance at this point, but I was looking
> > through the slabinfo and slab_allocators, and noticed that size-4096
> > does not show up in slab_allocators... I hope that is by design.  You
> > can see it growing into the gigabytes in the slabinfo files....
> 
> Argh. OK, I don't understand well enough how this works.  Time to ask
> someone, I guess....
> 
> --b.
> 
> > 
> > 
> > 
> > > 
> > > I think that debugging will hurt the server performance, so you won't
> > > want to keep it turned on all the time.
> > > 
> > > > 
> > > > Also, this is all with the SLAB cache option.  SLUB crashes 
> > > everytime
> > > > I use it under heavy load.
> > > 
> > > Have you reported the SLUB bugs to lkml?
> > 
> > No, I haven't yet.  I didn't know for sure if I was doing something
> > wrong, or if SLUB was the problem there.  Since the failures, I had gone
> > back to using SLAB anyway, so ....  I probably should...
> > 
> > > 
> > > --b.
> > > 
> > 
> > 
> > Norman Weathers

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
  2008-06-11 19:52               ` CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger? J. Bruce Fields
@ 2008-06-11 20:09                 ` Jeff Layton
  2008-06-11 20:57                   ` J. Bruce Fields
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff Layton @ 2008-06-11 20:09 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-kernel, linux-nfs, Weathers, Norman R.

On Wed, 11 Jun 2008 15:52:22 -0400
"J. Bruce Fields" <bfields@fieldses.org> wrote:

> I'm probably missing something fundamental--why doesn't
> /proc/slab_allocators show any results for size-x where x >= 4096?
> 
> Someone's seeing a performance problem with the linux nfs server.  One
> of the symptoms is the "size-4096" slab cache seems to be out of
> control.  I assumed that meant that memory allocated by kmalloc() might
> be leaking, so figured it might be interesting to turn on
> CONFIG_DEBUG_SLAB_LEAK.  As far as I can tell what that does is list
> kmalloc() callers in /proc/slab_allocators.  But that doesn't seem to be
> showing any results for size-4096.  Can anyone provide a clue?
> Thanks!
> 
> --b.
> 


Hmm...I've never used this, but in kmem_cache_alloc():

        /*
         * Enable redzoning and last user accounting, except for caches with
         * large objects, if the increased size would increase the object size
         * above the next power of two: caches with object sizes just above a
         * power of two have a significant amount of internal fragmentation.
         */
        if (size < 4096 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
                                                2 * sizeof(unsigned long long)))
                flags |= SLAB_RED_ZONE | SLAB_STORE_USER;


...looks like it specifically excludes some caches.


> On Wed, Jun 11, 2008 at 02:46:13PM -0400, bfields wrote:
> > On Tue, Jun 10, 2008 at 05:12:31PM -0500, Weathers, Norman R. wrote:
> > >  
> > > 
> > > > -----Original Message-----
> > > > From: J. Bruce Fields [mailto:bfields@fieldses.org] 
> > > > Sent: Tuesday, June 10, 2008 12:16 PM
> > > > To: Weathers, Norman R.
> > > > Cc: linux-nfs@vger.kernel.org
> > > > Subject: Re: Problems with large number of clients and reads
> > > > 
> > > > On Tue, Jun 10, 2008 at 09:30:18AM -0500, Weathers, Norman R. wrote:
> > > > > Unfortunately, I cannot stop the clients (middle of long running
> > > > > jobs).  I might be able to test this soon.  If I have the number of
> > > > > threads high, yes I can reduce the number of threads and it 
> > > > appears to
> > > > > lower some of the memory, but even with as little as three threads,
> > > > > the memory usage climbs very high, just not as high as if there are
> > > > > say 8 threads.  When the memory usage climbs high, it can cause the
> > > > > box to not respond over the network (ssh, rsh), and even be very
> > > > > sluggish when I am connected over our serial console to the 
> > > > server(s).
> > > > > This same scenario has been happening with kernels that I have tried
> > > > > from 2.6.22.x on to the 2.6.25 series.  The 2.6.25 series is
> > > > > interesting in that I can push the same load from a box with the
> > > > > 2.6.25 kernel and not have a load over .3 (with 3 threads), but with
> > > > > the 2.6.22.x kernel, I have a load of over 3 when I hit the same
> > > > > conditions.
> > > > 
> > > > OK, I think what we want to do is turn on 
> > > > CONFIG_DEBUG_SLAB_LEAK.  I've
> > > > never used it before, but it looks like it will report which functions
> > > > are allocating from each slab cache, which may be exactly what we need
> > > > to know.  So:
> > > > 
> > > > 	1. Install a kernel with both CONFIG_DEBUG_SLAB ("Debug slab
> > > > 	memory allocations") and CONFIG_DEBUG_SLAB_LEAK ("Memory leak
> > > > 	debugging") turned on.  They're both under the "kernel hacking"
> > > > 	section of the kernel config.  (If you have a file
> > > > 	/proc/slab_allocators, then you already have these turned on and
> > > > 	you can skip this step.)
> > > > 
> > > > 	2. Do whatever you need to do to reproduce the problem.
> > > > 
> > > > 	3. Get a copy of /proc/slabinfo and /proc/slab_allocators.
> > > > 
> > > > Then we can take a look at that and see if it sheds any light.
> > > 
> > > 
> > > I have taken several snapshots of the /proc/slab_allocators and
> > > /proc/slabinfo as requested, but since there is a lot of info in them,
> > > and I didn't think anyone wanted to go cross-eyed reading the data in an
> > > email, I have them up on a website:
> > > 
> > > http://shashi-weathers.net/linux/cluster/NFS/
> > 
> > Excellent.
> > 
> > > 
> > > The order of data collection is:
> > > 
> > > slab_allocators_bad1.txt and corresponding slabinfo
> > > slab_allocators_after_bad1.txt and corresponding slabinfo
> > > slab_allocators_16_threads.txt and corresponding slabinfo
> > > slab_allocators_16_threads_1.txt and corresponding slabinfo
> > > slab_allocators_32_threads.txt and corresponding slabinfo
> > > slab_allocators_really_bad.txt and corresponding slabinfo.
> > > 
> > > 
> > > You will have to forgive my ignorance at this point, but I was looking
> > > through the slabinfo and slab_allocators, and noticed that size-4096
> > > does not show up in slab_allocators... I hope that is by design.  You
> > > can see it growing into the gigabytes in the slabinfo files....
> > 
> > Argh. OK, I don't understand well enough how this works.  Time to ask
> > someone, I guess....
> > 
> > --b.
> > 
> > > 
> > > 
> > > 
> > > > 
> > > > I think that debugging will hurt the server performance, so you won't
> > > > want to keep it turned on all the time.
> > > > 
> > > > > 
> > > > > Also, this is all with the SLAB cache option.  SLUB crashes 
> > > > everytime
> > > > > I use it under heavy load.
> > > > 
> > > > Have you reported the SLUB bugs to lkml?
> > > 
> > > No, I haven't yet.  I didn't know for sure if I was doing something
> > > wrong, or if SLUB was the problem there.  Since the failures, I had gone
> > > back to using SLAB anyway, so ....  I probably should...
> > > 
> > > > 
> > > > --b.
> > > > 
> > > 
> > > 
> > > Norman Weathers
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
Jeff Layton <jlayton@poochiereds.net>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
  2008-06-11 20:09                 ` Jeff Layton
@ 2008-06-11 20:57                   ` J. Bruce Fields
  2008-06-11 22:46                     ` Weathers, Norman R.
  0 siblings, 1 reply; 13+ messages in thread
From: J. Bruce Fields @ 2008-06-11 20:57 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-kernel, linux-nfs, Weathers, Norman R.

On Wed, Jun 11, 2008 at 04:09:47PM -0400, Jeff Layton wrote:
> On Wed, 11 Jun 2008 15:52:22 -0400
> "J. Bruce Fields" <bfields@fieldses.org> wrote:
> 
> > I'm probably missing something fundamental--why doesn't
> > /proc/slab_allocators show any results for size-x where x >= 4096?
> > 
> > Someone's seeing a performance problem with the linux nfs server.  One
> > of the symptoms is the "size-4096" slab cache seems to be out of
> > control.  I assumed that meant that memory allocated by kmalloc() might
> > be leaking, so figured it might be interesting to turn on
> > CONFIG_DEBUG_SLAB_LEAK.  As far as I can tell what that does is list
> > kmalloc() callers in /proc/slab_allocators.  But that doesn't seem to be
> > showing any results for size-4096.  Can anyone provide a clue?
> > Thanks!
> > 
> > --b.
> > 
> 
> 
> Hmm...I've never used this, but in kmem_cache_alloc():
> 
>         /*
>          * Enable redzoning and last user accounting, except for caches with
>          * large objects, if the increased size would increase the object size
>          * above the next power of two: caches with object sizes just above a
>          * power of two have a significant amount of internal fragmentation.
>          */
>         if (size < 4096 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
>                                                 2 * sizeof(unsigned long long)))
>                 flags |= SLAB_RED_ZONE | SLAB_STORE_USER;
> 
> 
> ...looks like it specifically excludes some caches.

Ah, I missed that!  I'm a little confused as to how those flags behavior
affect the collection of the leak debugging data, but I can verify that
the below does result in size-4096 showing up in /proc/slab_allocators;
hopefully there's no more negative result than the performance penalty.

Norman, do you think you could try applying this and then trying again?

--b.


diff --git a/mm/slab.c b/mm/slab.c
index 06236e4..b379e31 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2202,7 +2202,7 @@ kmem_cache_create (const char *name, size_t size, size_t align,
 	 * above the next power of two: caches with object sizes just above a
 	 * power of two have a significant amount of internal fragmentation.
 	 */
-	if (size < 4096 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
+	if (size < 8192 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
 						2 * sizeof(unsigned long long)))
 		flags |= SLAB_RED_ZONE | SLAB_STORE_USER;
 	if (!(flags & SLAB_DESTROY_BY_RCU))

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* RE: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
  2008-06-11 20:57                   ` J. Bruce Fields
@ 2008-06-11 22:46                     ` Weathers, Norman R.
  2008-06-11 22:54                       ` J. Bruce Fields
  0 siblings, 1 reply; 13+ messages in thread
From: Weathers, Norman R. @ 2008-06-11 22:46 UTC (permalink / raw)
  To: J. Bruce Fields, Jeff Layton; +Cc: linux-kernel, linux-nfs

 

> -----Original Message-----
> From: J. Bruce Fields [mailto:bfields@fieldses.org] 
> Sent: Wednesday, June 11, 2008 3:58 PM
> To: Jeff Layton
> Cc: linux-kernel@vger.kernel.org; linux-nfs@vger.kernel.org; 
> Weathers, Norman R.
> Subject: Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
> 
> On Wed, Jun 11, 2008 at 04:09:47PM -0400, Jeff Layton wrote:
> > On Wed, 11 Jun 2008 15:52:22 -0400
> > "J. Bruce Fields" <bfields@fieldses.org> wrote:
> > 
> > > I'm probably missing something fundamental--why doesn't
> > > /proc/slab_allocators show any results for size-x where x >= 4096?
> > > 
> > > Someone's seeing a performance problem with the linux nfs 
> server.  One
> > > of the symptoms is the "size-4096" slab cache seems to be out of
> > > control.  I assumed that meant that memory allocated by 
> kmalloc() might
> > > be leaking, so figured it might be interesting to turn on
> > > CONFIG_DEBUG_SLAB_LEAK.  As far as I can tell what that 
> does is list
> > > kmalloc() callers in /proc/slab_allocators.  But that 
> doesn't seem to be
> > > showing any results for size-4096.  Can anyone provide a clue?
> > > Thanks!
> > > 
> > > --b.
> > > 
> > 
> > 
> > Hmm...I've never used this, but in kmem_cache_alloc():
> > 
> >         /*
> >          * Enable redzoning and last user accounting, 
> except for caches with
> >          * large objects, if the increased size would 
> increase the object size
> >          * above the next power of two: caches with object 
> sizes just above a
> >          * power of two have a significant amount of 
> internal fragmentation.
> >          */
> >         if (size < 4096 || fls(size - 1) == fls(size-1 + 
> REDZONE_ALIGN +
> >                                                 2 * 
> sizeof(unsigned long long)))
> >                 flags |= SLAB_RED_ZONE | SLAB_STORE_USER;
> > 
> > 
> > ...looks like it specifically excludes some caches.
> 
> Ah, I missed that!  I'm a little confused as to how those 
> flags behavior
> affect the collection of the leak debugging data, but I can 
> verify that
> the below does result in size-4096 showing up in 
> /proc/slab_allocators;
> hopefully there's no more negative result than the 
> performance penalty.
> 
> Norman, do you think you could try applying this and then 
> trying again?
> 
> --b.


I will try and get it patched and retested, but it may be a day or two
before I can get back the information due to production jobs now
running.  Once they finish up, I will get back with the info.

Thanks everyone for looking at this, by the way!

> 
> 
> diff --git a/mm/slab.c b/mm/slab.c
> index 06236e4..b379e31 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -2202,7 +2202,7 @@ kmem_cache_create (const char *name, 
> size_t size, size_t align,
>  	 * above the next power of two: caches with object 
> sizes just above a
>  	 * power of two have a significant amount of internal 
> fragmentation.
>  	 */
> -	if (size < 4096 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
> +	if (size < 8192 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
>  						2 * 
> sizeof(unsigned long long)))
>  		flags |= SLAB_RED_ZONE | SLAB_STORE_USER;
>  	if (!(flags & SLAB_DESTROY_BY_RCU))
> 


Norman Weathers

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
  2008-06-11 22:46                     ` Weathers, Norman R.
@ 2008-06-11 22:54                       ` J. Bruce Fields
  2008-06-12 19:54                         ` Weathers, Norman R.
  0 siblings, 1 reply; 13+ messages in thread
From: J. Bruce Fields @ 2008-06-11 22:54 UTC (permalink / raw)
  To: Weathers, Norman R.; +Cc: Jeff Layton, linux-kernel, linux-nfs

On Wed, Jun 11, 2008 at 05:46:13PM -0500, Weathers, Norman R. wrote:
> I will try and get it patched and retested, but it may be a day or two
> before I can get back the information due to production jobs now
> running.  Once they finish up, I will get back with the info.

Understood.

> Thanks everyone for looking at this, by the way!

And thanks for your persistence.

--b.

> 
> > 
> > 
> > diff --git a/mm/slab.c b/mm/slab.c
> > index 06236e4..b379e31 100644
> > --- a/mm/slab.c
> > +++ b/mm/slab.c
> > @@ -2202,7 +2202,7 @@ kmem_cache_create (const char *name, 
> > size_t size, size_t align,
> >  	 * above the next power of two: caches with object 
> > sizes just above a
> >  	 * power of two have a significant amount of internal 
> > fragmentation.
> >  	 */
> > -	if (size < 4096 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
> > +	if (size < 8192 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
> >  						2 * 
> > sizeof(unsigned long long)))
> >  		flags |= SLAB_RED_ZONE | SLAB_STORE_USER;
> >  	if (!(flags & SLAB_DESTROY_BY_RCU))
> > 
> 
> 
> Norman Weathers

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
  2008-06-11 22:54                       ` J. Bruce Fields
@ 2008-06-12 19:54                         ` Weathers, Norman R.
  2008-06-13 20:15                           ` J. Bruce Fields
  0 siblings, 1 reply; 13+ messages in thread
From: Weathers, Norman R. @ 2008-06-12 19:54 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Jeff Layton, linux-kernel, linux-nfs

 

> -----Original Message-----
> From: linux-nfs-owner@vger.kernel.org 
> [mailto:linux-nfs-owner@vger.kernel.org] On Behalf Of J. Bruce Fields
> Sent: Wednesday, June 11, 2008 5:55 PM
> To: Weathers, Norman R.
> Cc: Jeff Layton; linux-kernel@vger.kernel.org; 
> linux-nfs@vger.kernel.org
> Subject: Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
> 
> On Wed, Jun 11, 2008 at 05:46:13PM -0500, Weathers, Norman R. wrote:
> > I will try and get it patched and retested, but it may be a 
> day or two
> > before I can get back the information due to production jobs now
> > running.  Once they finish up, I will get back with the info.
> 
> Understood.
> 


I was able to get my big user to cooperate and let me in to be able to
get the information that you were needing.  The full output from the
/proc/slab_allocator file is at
http://www.shashi-weathers.net/linux/cluster/NFS_DEBUG_2 .  The 16
thread case is very interesting.  Also, there is a small txt file in the
directory that has some rpc errors, but I imagine the way that I am
running the box (oversubscribed threads) has more to do with the rpc
errors than anything else.  For those of you wanting the gist of the
story, the size-4096 slab has the following very large allocation:

size-4096: 2 sys_init_module+0x140b/0x1980
size-4096: 1 __vmalloc_area_node+0x188/0x1b0
size-4096: 1 seq_read+0x1d9/0x2e0
size-4096: 1 slabstats_open+0x2b/0x80
size-4096: 5 vc_allocate+0x167/0x190
size-4096: 3 input_allocate_device+0x12/0x80
size-4096: 1 hid_add_field+0x122/0x290
size-4096: 9 reqsk_queue_alloc+0x5f/0xf0
size-4096: 1846825 __alloc_skb+0x7d/0x170
size-4096: 3 alloc_netdev+0x33/0xa0
size-4096: 10 neigh_sysctl_register+0x52/0x2b0
size-4096: 5 devinet_sysctl_register+0x28/0x110
size-4096: 1 pidmap_init+0x15/0x60
size-4096: 1 netlink_proto_init+0x44/0x190
size-4096: 1 ip_rt_init+0xfd/0x2f0
size-4096: 1 cipso_v4_init+0x13/0x70
size-4096: 3 journal_init_revoke+0xe7/0x270 [jbd]
size-4096: 3 journal_init_revoke+0x18a/0x270 [jbd]
size-4096: 2 journal_init_inode+0x84/0x150 [jbd]
size-4096: 2 bnx2_alloc_mem+0x18/0x1f0 [bnx2]
size-4096: 1 joydev_connect+0x53/0x390 [joydev]
size-4096: 13 kmem_alloc+0xb3/0x100 [xfs]
size-4096: 5 addrconf_sysctl_register+0x31/0x130 [ipv6]
size-4096: 7 rpc_clone_client+0x84/0x140 [sunrpc]
size-4096: 3 rpc_create+0x254/0x4d0 [sunrpc]
size-4096: 16 __svc_create_thread+0x53/0x1f0 [sunrpc]
size-4096: 16 __svc_create_thread+0x72/0x1f0 [sunrpc]
size-4096: 1 nfsd_racache_init+0x2e/0x140 [nfsd]

The big one seems to be the __alloc_skb. (This is with 16 threads, and
it says that we are using up somewhere between 12 and 14 GB of memory,
about 2 to 3 gig of that is disk cache).  If I were to put anymore
threads out there, the server would become almost unresponsive (it was
bad enough as it was).   

At the same time, I also noticed this:

skbuff_fclone_cache: 1842524 __alloc_skb+0x50/0x170

Don't know for sure if that is meaningful or not....



> > Thanks everyone for looking at this, by the way!
> 
> And thanks for your persistence.
> 
> --b.
> 


Anytime.  This is the part of the job that is fun (except for my
users...).  Anyone can watch a system run, it's dealing with the unknown
that makes it interesting.


Norman Weathers


> > 
> > > 
> > > 
> > > diff --git a/mm/slab.c b/mm/slab.c
> > > index 06236e4..b379e31 100644
> > > --- a/mm/slab.c
> > > +++ b/mm/slab.c
> > > @@ -2202,7 +2202,7 @@ kmem_cache_create (const char *name, 
> > > size_t size, size_t align,
> > >  	 * above the next power of two: caches with object 
> > > sizes just above a
> > >  	 * power of two have a significant amount of internal 
> > > fragmentation.
> > >  	 */
> > > -	if (size < 4096 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
> > > +	if (size < 8192 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
> > >  						2 * 
> > > sizeof(unsigned long long)))
> > >  		flags |= SLAB_RED_ZONE | SLAB_STORE_USER;
> > >  	if (!(flags & SLAB_DESTROY_BY_RCU))
> > > 
> > 
> > 
> > Norman Weathers
> --
> To unsubscribe from this list: send the line "unsubscribe 
> linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
  2008-06-12 19:54                         ` Weathers, Norman R.
@ 2008-06-13 20:15                           ` J. Bruce Fields
  2008-06-13 21:53                             ` Weathers, Norman R.
  0 siblings, 1 reply; 13+ messages in thread
From: J. Bruce Fields @ 2008-06-13 20:15 UTC (permalink / raw)
  To: Weathers, Norman R.; +Cc: Jeff Layton, linux-kernel, linux-nfs, Neil Brown

On Thu, Jun 12, 2008 at 02:54:09PM -0500, Weathers, Norman R. wrote:
>  
> 
> > -----Original Message-----
> > From: linux-nfs-owner@vger.kernel.org 
> > [mailto:linux-nfs-owner@vger.kernel.org] On Behalf Of J. Bruce Fields
> > Sent: Wednesday, June 11, 2008 5:55 PM
> > To: Weathers, Norman R.
> > Cc: Jeff Layton; linux-kernel@vger.kernel.org; 
> > linux-nfs@vger.kernel.org
> > Subject: Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
> > 
> > On Wed, Jun 11, 2008 at 05:46:13PM -0500, Weathers, Norman R. wrote:
> > > I will try and get it patched and retested, but it may be a 
> > day or two
> > > before I can get back the information due to production jobs now
> > > running.  Once they finish up, I will get back with the info.
> > 
> > Understood.
> > 
> 
> 
> I was able to get my big user to cooperate and let me in to be able to
> get the information that you were needing.  The full output from the
> /proc/slab_allocator file is at
> http://www.shashi-weathers.net/linux/cluster/NFS_DEBUG_2 .  The 16
> thread case is very interesting.  Also, there is a small txt file in the
> directory that has some rpc errors, but I imagine the way that I am
> running the box (oversubscribed threads) has more to do with the rpc
> errors than anything else.  For those of you wanting the gist of the
> story, the size-4096 slab has the following very large allocation:
> 
> size-4096: 2 sys_init_module+0x140b/0x1980
> size-4096: 1 __vmalloc_area_node+0x188/0x1b0
> size-4096: 1 seq_read+0x1d9/0x2e0
> size-4096: 1 slabstats_open+0x2b/0x80
> size-4096: 5 vc_allocate+0x167/0x190
> size-4096: 3 input_allocate_device+0x12/0x80
> size-4096: 1 hid_add_field+0x122/0x290
> size-4096: 9 reqsk_queue_alloc+0x5f/0xf0
> size-4096: 1846825 __alloc_skb+0x7d/0x170
> size-4096: 3 alloc_netdev+0x33/0xa0
> size-4096: 10 neigh_sysctl_register+0x52/0x2b0
> size-4096: 5 devinet_sysctl_register+0x28/0x110
> size-4096: 1 pidmap_init+0x15/0x60
> size-4096: 1 netlink_proto_init+0x44/0x190
> size-4096: 1 ip_rt_init+0xfd/0x2f0
> size-4096: 1 cipso_v4_init+0x13/0x70
> size-4096: 3 journal_init_revoke+0xe7/0x270 [jbd]
> size-4096: 3 journal_init_revoke+0x18a/0x270 [jbd]
> size-4096: 2 journal_init_inode+0x84/0x150 [jbd]
> size-4096: 2 bnx2_alloc_mem+0x18/0x1f0 [bnx2]
> size-4096: 1 joydev_connect+0x53/0x390 [joydev]
> size-4096: 13 kmem_alloc+0xb3/0x100 [xfs]
> size-4096: 5 addrconf_sysctl_register+0x31/0x130 [ipv6]
> size-4096: 7 rpc_clone_client+0x84/0x140 [sunrpc]
> size-4096: 3 rpc_create+0x254/0x4d0 [sunrpc]
> size-4096: 16 __svc_create_thread+0x53/0x1f0 [sunrpc]
> size-4096: 16 __svc_create_thread+0x72/0x1f0 [sunrpc]
> size-4096: 1 nfsd_racache_init+0x2e/0x140 [nfsd]
> 
> The big one seems to be the __alloc_skb. (This is with 16 threads, and
> it says that we are using up somewhere between 12 and 14 GB of memory,
> about 2 to 3 gig of that is disk cache).  If I were to put anymore
> threads out there, the server would become almost unresponsive (it was
> bad enough as it was).   
> 
> At the same time, I also noticed this:
> 
> skbuff_fclone_cache: 1842524 __alloc_skb+0x50/0x170
> 
> Don't know for sure if that is meaningful or not....

OK, so, starting at net/core/skbuff.c, this means that this memory was
allocated by __alloc_skb() calls with something nonzero in the third
("fclone") argument.  The only such caller is alloc_skb_fclone().
Callers of alloc_skb_fclone() include:

	sk_stream_alloc_skb:
		do_tcp_sendpages
		tcp_sendmsg
		tcp_fragment
		tso_fragment
		tcp_mtu_probe
	tcp_send_fin
	tcp_connect
	buf_acquire:
		lots of callers in tipc code (whatever that is).

So unless you're using tipc, or you have something in userspace going
haywire (perhaps netstat would help rule that out?), then I suppose
there's something wrong with knfsd's tcp code.  Which makes sense, I
guess.

I'd think this sort of allocation would be limited by the number of
sockets times the size of the send and receive buffers.
svc_xprt.c:svc_check_conn_limits() claims to be limiting the number of
sockets to (nrthreads+3)*20.  (You aren't hitting the "too many open
connections" printk there, are you?)  The total buffer size should be
bounded by something like 4 megs.

--b.

> 
> 
> 
> > > Thanks everyone for looking at this, by the way!
> > 
> > And thanks for your persistence.
> > 
> > --b.
> > 
> 
> 
> Anytime.  This is the part of the job that is fun (except for my
> users...).  Anyone can watch a system run, it's dealing with the unknown
> that makes it interesting.

OK!  Because I'm a bit stuck, so this will take some more work....

--b.

> 
> 
> Norman Weathers
> 
> 
> > > 
> > > > 
> > > > 
> > > > diff --git a/mm/slab.c b/mm/slab.c
> > > > index 06236e4..b379e31 100644
> > > > --- a/mm/slab.c
> > > > +++ b/mm/slab.c
> > > > @@ -2202,7 +2202,7 @@ kmem_cache_create (const char *name, 
> > > > size_t size, size_t align,
> > > >  	 * above the next power of two: caches with object 
> > > > sizes just above a
> > > >  	 * power of two have a significant amount of internal 
> > > > fragmentation.
> > > >  	 */
> > > > -	if (size < 4096 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
> > > > +	if (size < 8192 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
> > > >  						2 * 
> > > > sizeof(unsigned long long)))
> > > >  		flags |= SLAB_RED_ZONE | SLAB_STORE_USER;
> > > >  	if (!(flags & SLAB_DESTROY_BY_RCU))
> > > > 
> > > 
> > > 
> > > Norman Weathers
> > --
> > To unsubscribe from this list: send the line "unsubscribe 
> > linux-nfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
  2008-06-13 20:15                           ` J. Bruce Fields
@ 2008-06-13 21:53                             ` Weathers, Norman R.
  2008-06-13 22:04                               ` J. Bruce Fields
  0 siblings, 1 reply; 13+ messages in thread
From: Weathers, Norman R. @ 2008-06-13 21:53 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Jeff Layton, linux-kernel, linux-nfs, Neil Brown

 

> -----Original Message-----
> From: linux-nfs-owner@vger.kernel.org 
> [mailto:linux-nfs-owner@vger.kernel.org] On Behalf Of J. Bruce Fields
> Sent: Friday, June 13, 2008 3:16 PM
> To: Weathers, Norman R.
> Cc: Jeff Layton; linux-kernel@vger.kernel.org; 
> linux-nfs@vger.kernel.org; Neil Brown
> Subject: Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
> 
> On Thu, Jun 12, 2008 at 02:54:09PM -0500, Weathers, Norman R. wrote:
> >  
> > 
> > > -----Original Message-----
> > > From: linux-nfs-owner@vger.kernel.org 
> > > [mailto:linux-nfs-owner@vger.kernel.org] On Behalf Of J. 
> Bruce Fields
> > > Sent: Wednesday, June 11, 2008 5:55 PM
> > > To: Weathers, Norman R.
> > > Cc: Jeff Layton; linux-kernel@vger.kernel.org; 
> > > linux-nfs@vger.kernel.org
> > > Subject: Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
> > > 
> > > On Wed, Jun 11, 2008 at 05:46:13PM -0500, Weathers, 
> Norman R. wrote:
> > > > I will try and get it patched and retested, but it may be a 
> > > day or two
> > > > before I can get back the information due to production jobs now
> > > > running.  Once they finish up, I will get back with the info.
> > > 
> > > Understood.
> > > 
> > 
> > 
> > I was able to get my big user to cooperate and let me in to 
> be able to
> > get the information that you were needing.  The full output from the
> > /proc/slab_allocator file is at
> > http://www.shashi-weathers.net/linux/cluster/NFS_DEBUG_2 .  The 16
> > thread case is very interesting.  Also, there is a small 
> txt file in the
> > directory that has some rpc errors, but I imagine the way that I am
> > running the box (oversubscribed threads) has more to do with the rpc
> > errors than anything else.  For those of you wanting the gist of the
> > story, the size-4096 slab has the following very large allocation:
> > 
> > size-4096: 2 sys_init_module+0x140b/0x1980
> > size-4096: 1 __vmalloc_area_node+0x188/0x1b0
> > size-4096: 1 seq_read+0x1d9/0x2e0
> > size-4096: 1 slabstats_open+0x2b/0x80
> > size-4096: 5 vc_allocate+0x167/0x190
> > size-4096: 3 input_allocate_device+0x12/0x80
> > size-4096: 1 hid_add_field+0x122/0x290
> > size-4096: 9 reqsk_queue_alloc+0x5f/0xf0
> > size-4096: 1846825 __alloc_skb+0x7d/0x170
> > size-4096: 3 alloc_netdev+0x33/0xa0
> > size-4096: 10 neigh_sysctl_register+0x52/0x2b0
> > size-4096: 5 devinet_sysctl_register+0x28/0x110
> > size-4096: 1 pidmap_init+0x15/0x60
> > size-4096: 1 netlink_proto_init+0x44/0x190
> > size-4096: 1 ip_rt_init+0xfd/0x2f0
> > size-4096: 1 cipso_v4_init+0x13/0x70
> > size-4096: 3 journal_init_revoke+0xe7/0x270 [jbd]
> > size-4096: 3 journal_init_revoke+0x18a/0x270 [jbd]
> > size-4096: 2 journal_init_inode+0x84/0x150 [jbd]
> > size-4096: 2 bnx2_alloc_mem+0x18/0x1f0 [bnx2]
> > size-4096: 1 joydev_connect+0x53/0x390 [joydev]
> > size-4096: 13 kmem_alloc+0xb3/0x100 [xfs]
> > size-4096: 5 addrconf_sysctl_register+0x31/0x130 [ipv6]
> > size-4096: 7 rpc_clone_client+0x84/0x140 [sunrpc]
> > size-4096: 3 rpc_create+0x254/0x4d0 [sunrpc]
> > size-4096: 16 __svc_create_thread+0x53/0x1f0 [sunrpc]
> > size-4096: 16 __svc_create_thread+0x72/0x1f0 [sunrpc]
> > size-4096: 1 nfsd_racache_init+0x2e/0x140 [nfsd]
> > 
> > The big one seems to be the __alloc_skb. (This is with 16 
> threads, and
> > it says that we are using up somewhere between 12 and 14 GB 
> of memory,
> > about 2 to 3 gig of that is disk cache).  If I were to put anymore
> > threads out there, the server would become almost 
> unresponsive (it was
> > bad enough as it was).   
> > 
> > At the same time, I also noticed this:
> > 
> > skbuff_fclone_cache: 1842524 __alloc_skb+0x50/0x170
> > 
> > Don't know for sure if that is meaningful or not....
> 
> OK, so, starting at net/core/skbuff.c, this means that this memory was
> allocated by __alloc_skb() calls with something nonzero in the third
> ("fclone") argument.  The only such caller is alloc_skb_fclone().
> Callers of alloc_skb_fclone() include:
> 
> 	sk_stream_alloc_skb:
> 		do_tcp_sendpages
> 		tcp_sendmsg
> 		tcp_fragment
> 		tso_fragment

Interesting you should mention the tso...  We recently went through and
turned on TSO on all of our systems, trying it out to see if it helped
with performance...  This could be something to do with that.  I can try
disabling the tso on all of the servers and see if that helps with the
memory.  Actually, I think I will, and I will monitor the situation.  I
think it might help some, but I still think there may be something else
going on in a deep corner...

> 		tcp_mtu_probe
> 	tcp_send_fin
> 	tcp_connect
> 	buf_acquire:
> 		lots of callers in tipc code (whatever that is).
> 
> So unless you're using tipc, or you have something in userspace going
> haywire (perhaps netstat would help rule that out?), then I suppose
> there's something wrong with knfsd's tcp code.  Which makes sense, I
> guess.
> 

Not for sure what tipc is either....

> I'd think this sort of allocation would be limited by the number of
> sockets times the size of the send and receive buffers.
> svc_xprt.c:svc_check_conn_limits() claims to be limiting the number of
> sockets to (nrthreads+3)*20.  (You aren't hitting the "too many open
> connections" printk there, are you?)  The total buffer size should be
> bounded by something like 4 megs.
> 
> --b.
> 

Yes, we are getting a continuous stream of the too many open connections
scrolling across our logs.  


> > 
> > 
> > 
> > > > Thanks everyone for looking at this, by the way!
> > > 
> > > And thanks for your persistence.
> > > 
> > > --b.
> > > 
> > 
> > 
> > Anytime.  This is the part of the job that is fun (except for my
> > users...).  Anyone can watch a system run, it's dealing 
> with the unknown
> > that makes it interesting.
> 
> OK!  Because I'm a bit stuck, so this will take some more work....
> 
> --b.
> 

No problems.  I feel good if I exercised some deep corner of the code
and found something that needed flushed out, that's what the experience
is all about, isn't it?


> > 
> > 
> > Norman Weathers
> > 
> > 
> > > > 
> > > > > 
> > > > > 
> > > > > diff --git a/mm/slab.c b/mm/slab.c
> > > > > index 06236e4..b379e31 100644
> > > > > --- a/mm/slab.c
> > > > > +++ b/mm/slab.c
> > > > > @@ -2202,7 +2202,7 @@ kmem_cache_create (const char *name, 
> > > > > size_t size, size_t align,
> > > > >  	 * above the next power of two: caches with object 
> > > > > sizes just above a
> > > > >  	 * power of two have a significant amount of internal 
> > > > > fragmentation.
> > > > >  	 */
> > > > > -	if (size < 4096 || fls(size - 1) == fls(size-1 
> + REDZONE_ALIGN +
> > > > > +	if (size < 8192 || fls(size - 1) == fls(size-1 
> + REDZONE_ALIGN +
> > > > >  						2 * 
> > > > > sizeof(unsigned long long)))
> > > > >  		flags |= SLAB_RED_ZONE | SLAB_STORE_USER;
> > > > >  	if (!(flags & SLAB_DESTROY_BY_RCU))
> > > > > 
> > > > 
> > > > 
> > > > Norman Weathers
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe 
> > > linux-nfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> --
> To unsubscribe from this list: send the line "unsubscribe 
> linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
  2008-06-13 21:53                             ` Weathers, Norman R.
@ 2008-06-13 22:04                               ` J. Bruce Fields
  2008-06-13 22:53                                 ` Weathers, Norman R.
  0 siblings, 1 reply; 13+ messages in thread
From: J. Bruce Fields @ 2008-06-13 22:04 UTC (permalink / raw)
  To: Weathers, Norman R.; +Cc: Jeff Layton, linux-kernel, linux-nfs, Neil Brown

On Fri, Jun 13, 2008 at 04:53:31PM -0500, Weathers, Norman R. wrote:
>  
> 
> > > The big one seems to be the __alloc_skb. (This is with 16 
> > threads, and
> > > it says that we are using up somewhere between 12 and 14 GB 
> > of memory,
> > > about 2 to 3 gig of that is disk cache).  If I were to put anymore
> > > threads out there, the server would become almost 
> > unresponsive (it was
> > > bad enough as it was).   
> > > 
> > > At the same time, I also noticed this:
> > > 
> > > skbuff_fclone_cache: 1842524 __alloc_skb+0x50/0x170
> > > 
> > > Don't know for sure if that is meaningful or not....
> > 
> > OK, so, starting at net/core/skbuff.c, this means that this memory was
> > allocated by __alloc_skb() calls with something nonzero in the third
> > ("fclone") argument.  The only such caller is alloc_skb_fclone().
> > Callers of alloc_skb_fclone() include:
> > 
> > 	sk_stream_alloc_skb:
> > 		do_tcp_sendpages
> > 		tcp_sendmsg
> > 		tcp_fragment
> > 		tso_fragment
> 
> Interesting you should mention the tso...  We recently went through and
> turned on TSO on all of our systems, trying it out to see if it helped
> with performance...  This could be something to do with that.  I can try
> disabling the tso on all of the servers and see if that helps with the
> memory.  Actually, I think I will, and I will monitor the situation.  I
> think it might help some, but I still think there may be something else
> going on in a deep corner...

I'll plead total ignorance about TSO, and it sounds like a long
shot--but sure, it'd be worth trying, thanks.

> 
> > 		tcp_mtu_probe
> > 	tcp_send_fin
> > 	tcp_connect
> > 	buf_acquire:
> > 		lots of callers in tipc code (whatever that is).
> > 
> > So unless you're using tipc, or you have something in userspace going
> > haywire (perhaps netstat would help rule that out?), then I suppose
> > there's something wrong with knfsd's tcp code.  Which makes sense, I
> > guess.
> > 
> 
> Not for sure what tipc is either....
> 
> > I'd think this sort of allocation would be limited by the number of
> > sockets times the size of the send and receive buffers.
> > svc_xprt.c:svc_check_conn_limits() claims to be limiting the number of
> > sockets to (nrthreads+3)*20.  (You aren't hitting the "too many open
> > connections" printk there, are you?)  The total buffer size should be
> > bounded by something like 4 megs.
> > 
> > --b.
> > 
> 
> Yes, we are getting a continuous stream of the too many open connections
> scrolling across our logs.  

That's interesting!  So we should probably look more closely at the
svc_check_conn_limits() behavior.  I wonder whether some pathological
behavior is triggered in the case where you're constantly over the limit
it's trying to enforce.

(Remind me how many active clients you have?)

> No problems.  I feel good if I exercised some deep corner of the code
> and found something that needed flushed out, that's what the experience
> is all about, isn't it?

Yep!

--b.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
  2008-06-13 22:04                               ` J. Bruce Fields
@ 2008-06-13 22:53                                 ` Weathers, Norman R.
  2008-06-16 17:43                                   ` J. Bruce Fields
  0 siblings, 1 reply; 13+ messages in thread
From: Weathers, Norman R. @ 2008-06-13 22:53 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Jeff Layton, linux-kernel, linux-nfs, Neil Brown

 

> -----Original Message-----
> From: J. Bruce Fields [mailto:bfields@fieldses.org] 
> Sent: Friday, June 13, 2008 5:04 PM
> To: Weathers, Norman R.
> Cc: Jeff Layton; linux-kernel@vger.kernel.org; 
> linux-nfs@vger.kernel.org; Neil Brown
> Subject: Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
> 
> On Fri, Jun 13, 2008 at 04:53:31PM -0500, Weathers, Norman R. wrote:
> >  
> > 
> > > > The big one seems to be the __alloc_skb. (This is with 16 
> > > threads, and
> > > > it says that we are using up somewhere between 12 and 14 GB 
> > > of memory,
> > > > about 2 to 3 gig of that is disk cache).  If I were to 
> put anymore
> > > > threads out there, the server would become almost 
> > > unresponsive (it was
> > > > bad enough as it was).   
> > > > 
> > > > At the same time, I also noticed this:
> > > > 
> > > > skbuff_fclone_cache: 1842524 __alloc_skb+0x50/0x170
> > > > 
> > > > Don't know for sure if that is meaningful or not....
> > > 
> > > OK, so, starting at net/core/skbuff.c, this means that 
> this memory was
> > > allocated by __alloc_skb() calls with something nonzero 
> in the third
> > > ("fclone") argument.  The only such caller is alloc_skb_fclone().
> > > Callers of alloc_skb_fclone() include:
> > > 
> > > 	sk_stream_alloc_skb:
> > > 		do_tcp_sendpages
> > > 		tcp_sendmsg
> > > 		tcp_fragment
> > > 		tso_fragment
> > 
> > Interesting you should mention the tso...  We recently went 
> through and
> > turned on TSO on all of our systems, trying it out to see 
> if it helped
> > with performance...  This could be something to do with 
> that.  I can try
> > disabling the tso on all of the servers and see if that 
> helps with the
> > memory.  Actually, I think I will, and I will monitor the 
> situation.  I
> > think it might help some, but I still think there may be 
> something else
> > going on in a deep corner...
> 
> I'll plead total ignorance about TSO, and it sounds like a long
> shot--but sure, it'd be worth trying, thanks.
> 

Tried it, not for sure if I like the results yet or not...  Didn't seem
to make a huge difference, but here is something that will really make
you want to drink, the 2.6.25.4 kernel does not go into the size-4096
hell.  The largest users of slab there are the size-1024 and still the
skbuff_fclone_cache.  On a box with 16 threads, it will cache up about 5
GB of disk data, and still use about 6 GB of slab to put the information
out there (without TSO on), but at least it is not causing the disk
cache to be evicted, and it appears to be a little more responsive.  If
I up it to 32 or more threads, however, it gets very sluggish, but then
again, I am hitting it with a lot of nodes.

> > 
> > > 		tcp_mtu_probe
> > > 	tcp_send_fin
> > > 	tcp_connect
> > > 	buf_acquire:
> > > 		lots of callers in tipc code (whatever that is).
> > > 
> > > So unless you're using tipc, or you have something in 
> userspace going
> > > haywire (perhaps netstat would help rule that out?), then 
> I suppose
> > > there's something wrong with knfsd's tcp code.  Which 
> makes sense, I
> > > guess.
> > > 
> > 
> > Not for sure what tipc is either....
> > 
> > > I'd think this sort of allocation would be limited by the 
> number of
> > > sockets times the size of the send and receive buffers.
> > > svc_xprt.c:svc_check_conn_limits() claims to be limiting 
> the number of
> > > sockets to (nrthreads+3)*20.  (You aren't hitting the 
> "too many open
> > > connections" printk there, are you?)  The total buffer 
> size should be
> > > bounded by something like 4 megs.
> > > 
> > > --b.
> > > 
> > 
> > Yes, we are getting a continuous stream of the too many 
> open connections
> > scrolling across our logs.  
> 
> That's interesting!  So we should probably look more closely at the
> svc_check_conn_limits() behavior.  I wonder whether some pathological
> behavior is triggered in the case where you're constantly 
> over the limit
> it's trying to enforce.
> 
> (Remind me how many active clients you have?)
> 


We currently are hitting with somewhere around 600 to 800 nodes, but it
can go up to over 1000 nodes.  We are artificially starving with a
limited number of threads (2 to 3) right now on the older 2.6.22.14
kernel because of that memory issue (which may or may not be tso
related)...

I really want to move forward to the newer kernel, but we had an issue
where clients all of the sudden wouldn't connect, yet other clients
could, to the exact same server NFS export.  I had booted the server
into the 2.6.25.4 kernel at the time, and the other admin set us back to
the 2.6.22.14 to see if that was it.  The clients started working again,
and he left it there (he also took out my options in the exports file,
no_subtree_check and insecure).  I know that we are running over the
number of privelaged ports, and we probably need the insecure, but I am
having a hard time wrapping my self around all of the problems at
once....



> > No problems.  I feel good if I exercised some deep corner 
> of the code
> > and found something that needed flushed out, that's what 
> the experience
> > is all about, isn't it?
> 
> Yep!
> 
> --b.
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
  2008-06-13 22:53                                 ` Weathers, Norman R.
@ 2008-06-16 17:43                                   ` J. Bruce Fields
  2008-06-19 15:53                                     ` Weathers, Norman R.
  0 siblings, 1 reply; 13+ messages in thread
From: J. Bruce Fields @ 2008-06-16 17:43 UTC (permalink / raw)
  To: Weathers, Norman R.; +Cc: Jeff Layton, linux-kernel, linux-nfs, Neil Brown

On Fri, Jun 13, 2008 at 05:53:20PM -0500, Weathers, Norman R. wrote:
>  
> 
> > -----Original Message-----
> > From: J. Bruce Fields [mailto:bfields@fieldses.org] 
> > Sent: Friday, June 13, 2008 5:04 PM
> > To: Weathers, Norman R.
> > Cc: Jeff Layton; linux-kernel@vger.kernel.org; 
> > linux-nfs@vger.kernel.org; Neil Brown
> > Subject: Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
> > 
> > On Fri, Jun 13, 2008 at 04:53:31PM -0500, Weathers, Norman R. wrote:
> > >  
> > > 
> > > > > The big one seems to be the __alloc_skb. (This is with 16 
> > > > threads, and
> > > > > it says that we are using up somewhere between 12 and 14 GB 
> > > > of memory,
> > > > > about 2 to 3 gig of that is disk cache).  If I were to 
> > put anymore
> > > > > threads out there, the server would become almost 
> > > > unresponsive (it was
> > > > > bad enough as it was).   
> > > > > 
> > > > > At the same time, I also noticed this:
> > > > > 
> > > > > skbuff_fclone_cache: 1842524 __alloc_skb+0x50/0x170
> > > > > 
> > > > > Don't know for sure if that is meaningful or not....
> > > > 
> > > > OK, so, starting at net/core/skbuff.c, this means that 
> > this memory was
> > > > allocated by __alloc_skb() calls with something nonzero 
> > in the third
> > > > ("fclone") argument.  The only such caller is alloc_skb_fclone().
> > > > Callers of alloc_skb_fclone() include:
> > > > 
> > > > 	sk_stream_alloc_skb:
> > > > 		do_tcp_sendpages
> > > > 		tcp_sendmsg
> > > > 		tcp_fragment
> > > > 		tso_fragment
> > > 
> > > Interesting you should mention the tso...  We recently went 
> > through and
> > > turned on TSO on all of our systems, trying it out to see 
> > if it helped
> > > with performance...  This could be something to do with 
> > that.  I can try
> > > disabling the tso on all of the servers and see if that 
> > helps with the
> > > memory.  Actually, I think I will, and I will monitor the 
> > situation.  I
> > > think it might help some, but I still think there may be 
> > something else
> > > going on in a deep corner...
> > 
> > I'll plead total ignorance about TSO, and it sounds like a long
> > shot--but sure, it'd be worth trying, thanks.
> > 
> 
> Tried it, not for sure if I like the results yet or not...  Didn't seem
> to make a huge difference, but here is something that will really make
> you want to drink, the 2.6.25.4 kernel does not go into the size-4096
> hell.

Remind me what the most recent *bad* kernel was of those you tested?
(2.6.25?)

Nothing jumped out at me in a quick skim through the commits from 2.6.25
to 2.6.25.4.

> The largest users of slab there are the size-1024 and still the
> skbuff_fclone_cache.  On a box with 16 threads, it will cache up about 5
> GB of disk data, and still use about 6 GB of slab to put the information
> out there (without TSO on), but at least it is not causing the disk
> cache to be evicted, and it appears to be a little more responsive.  If
> I up it to 32 or more threads, however, it gets very sluggish, but then
> again, I am hitting it with a lot of nodes.
> 
> > > 
> > > > 		tcp_mtu_probe
> > > > 	tcp_send_fin
> > > > 	tcp_connect
> > > > 	buf_acquire:
> > > > 		lots of callers in tipc code (whatever that is).
> > > > 
> > > > So unless you're using tipc, or you have something in 
> > userspace going
> > > > haywire (perhaps netstat would help rule that out?), then 
> > I suppose
> > > > there's something wrong with knfsd's tcp code.  Which 
> > makes sense, I
> > > > guess.
> > > > 
> > > 
> > > Not for sure what tipc is either....
> > > 
> > > > I'd think this sort of allocation would be limited by the 
> > number of
> > > > sockets times the size of the send and receive buffers.
> > > > svc_xprt.c:svc_check_conn_limits() claims to be limiting 
> > the number of
> > > > sockets to (nrthreads+3)*20.  (You aren't hitting the 
> > "too many open
> > > > connections" printk there, are you?)  The total buffer 
> > size should be
> > > > bounded by something like 4 megs.
> > > > 
> > > > --b.
> > > > 
> > > 
> > > Yes, we are getting a continuous stream of the too many 
> > open connections
> > > scrolling across our logs.  
> > 
> > That's interesting!  So we should probably look more closely at the
> > svc_check_conn_limits() behavior.  I wonder whether some pathological
> > behavior is triggered in the case where you're constantly 
> > over the limit
> > it's trying to enforce.
> > 
> > (Remind me how many active clients you have?)
> > 
> 
> 
> We currently are hitting with somewhere around 600 to 800 nodes, but it
> can go up to over 1000 nodes.  We are artificially starving with a
> limited number of threads (2 to 3) right now on the older 2.6.22.14
> kernel because of that memory issue (which may or may not be tso
> related)...

So with that many clients all making requests to the server at once,
we'd start hitting that (serv->sv_nrthreads+3)*20 limit when the number
of threads was set to less than 30-50.  That doesn't seem to be the
point where you're seeing a change in behavior, though.

> I really want to move forward to the newer kernel, but we had an issue
> where clients all of the sudden wouldn't connect, yet other clients
> could, to the exact same server NFS export.  I had booted the server
> into the 2.6.25.4 kernel at the time, and the other admin set us back to
> the 2.6.22.14 to see if that was it.  The clients started working again,
> and he left it there (he also took out my options in the exports file,
> no_subtree_check and insecure).  I know that we are running over the
> number of privelaged ports, and we probably need the insecure, but I am
> having a hard time wrapping my self around all of the problems at
> once....

The secure ports limitation should be a problem for a client that does a
lot of nfs mounts, not for a server with a lot of clients.

--b.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
  2008-06-16 17:43                                   ` J. Bruce Fields
@ 2008-06-19 15:53                                     ` Weathers, Norman R.
  2008-06-19 18:46                                       ` J. Bruce Fields
  0 siblings, 1 reply; 13+ messages in thread
From: Weathers, Norman R. @ 2008-06-19 15:53 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Jeff Layton, linux-kernel, linux-nfs, Neil Brown

 

> -----Original Message-----
> From: J. Bruce Fields [mailto:bfields@fieldses.org] 
> Sent: Monday, June 16, 2008 12:44 PM
> To: Weathers, Norman R.
> Cc: Jeff Layton; linux-kernel@vger.kernel.org; 
> linux-nfs@vger.kernel.org; Neil Brown
> Subject: Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
> 
> On Fri, Jun 13, 2008 at 05:53:20PM -0500, Weathers, Norman R. wrote:
> >  
> > 
> > > -----Original Message-----
> > > From: J. Bruce Fields [mailto:bfields@fieldses.org] 
> > > Sent: Friday, June 13, 2008 5:04 PM
> > > To: Weathers, Norman R.
> > > Cc: Jeff Layton; linux-kernel@vger.kernel.org; 
> > > linux-nfs@vger.kernel.org; Neil Brown
> > > Subject: Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
> > > 
> > > On Fri, Jun 13, 2008 at 04:53:31PM -0500, Weathers, 
> Norman R. wrote:
> > > >  
> > > > 
> > > > > > The big one seems to be the __alloc_skb. (This is with 16 
> > > > > threads, and
> > > > > > it says that we are using up somewhere between 12 and 14 GB 
> > > > > of memory,
> > > > > > about 2 to 3 gig of that is disk cache).  If I were to 
> > > put anymore
> > > > > > threads out there, the server would become almost 
> > > > > unresponsive (it was
> > > > > > bad enough as it was).   
> > > > > > 
> > > > > > At the same time, I also noticed this:
> > > > > > 
> > > > > > skbuff_fclone_cache: 1842524 __alloc_skb+0x50/0x170
> > > > > > 
> > > > > > Don't know for sure if that is meaningful or not....
> > > > > 
> > > > > OK, so, starting at net/core/skbuff.c, this means that 
> > > this memory was
> > > > > allocated by __alloc_skb() calls with something nonzero 
> > > in the third
> > > > > ("fclone") argument.  The only such caller is 
> alloc_skb_fclone().
> > > > > Callers of alloc_skb_fclone() include:
> > > > > 
> > > > > 	sk_stream_alloc_skb:
> > > > > 		do_tcp_sendpages
> > > > > 		tcp_sendmsg
> > > > > 		tcp_fragment
> > > > > 		tso_fragment
> > > > 
> > > > Interesting you should mention the tso...  We recently went 
> > > through and
> > > > turned on TSO on all of our systems, trying it out to see 
> > > if it helped
> > > > with performance...  This could be something to do with 
> > > that.  I can try
> > > > disabling the tso on all of the servers and see if that 
> > > helps with the
> > > > memory.  Actually, I think I will, and I will monitor the 
> > > situation.  I
> > > > think it might help some, but I still think there may be 
> > > something else
> > > > going on in a deep corner...
> > > 
> > > I'll plead total ignorance about TSO, and it sounds like a long
> > > shot--but sure, it'd be worth trying, thanks.
> > > 
> > 
> > Tried it, not for sure if I like the results yet or not...  
> Didn't seem
> > to make a huge difference, but here is something that will 
> really make
> > you want to drink, the 2.6.25.4 kernel does not go into the 
> size-4096
> > hell.
> 
> Remind me what the most recent *bad* kernel was of those you tested?
> (2.6.25?)
> 

The kernel that we were really seeing the problem with was 2.6.25.4, but
I think we may have figured out the 4096 problem, and it was probably a
mistake on my part, but it is important for the NFS users to see it so
they don't make the same mistake.  I had found some performance tuning
guides, and in trying some of the suggestions, found that the setting
changes did seem to help on some things, but of course I never got to
run a check under full load (800 + clients).  A suggestion was to change
the tcp_reordering tunable under /proc/sys/net/ipv4 from the default 3
to 127.  We think that this was actually causing the issue.  I was able
to trace back through all of the changes, and I changed this setting
back to the default 3, and it immediately fixed the size-4096 hell.  It
appears that the reordering just eats into the memory, especially in
high demand situations, and I guess that should make perfect sense if we
are actually buffering up packets for reorder, and we are slamming the
box with thousands of requests per minute.

We still have other performance issues now, but it appears to be more of
a bottleneck, the nodes do not appear to be backing off when the servers
are becoming congested.


> Nothing jumped out at me in a quick skim through the commits 
> from 2.6.25
> to 2.6.25.4.
> 
> > The largest users of slab there are the size-1024 and still the
> > skbuff_fclone_cache.  On a box with 16 threads, it will 
> cache up about 5
> > GB of disk data, and still use about 6 GB of slab to put 
> the information
> > out there (without TSO on), but at least it is not causing the disk
> > cache to be evicted, and it appears to be a little more 
> responsive.  If
> > I up it to 32 or more threads, however, it gets very 
> sluggish, but then
> > again, I am hitting it with a lot of nodes.
> > 
> > > > 
> > > > > 		tcp_mtu_probe
> > > > > 	tcp_send_fin
> > > > > 	tcp_connect
> > > > > 	buf_acquire:
> > > > > 		lots of callers in tipc code (whatever that is).
> > > > > 
> > > > > So unless you're using tipc, or you have something in 
> > > userspace going
> > > > > haywire (perhaps netstat would help rule that out?), then 
> > > I suppose
> > > > > there's something wrong with knfsd's tcp code.  Which 
> > > makes sense, I
> > > > > guess.
> > > > > 
> > > > 
> > > > Not for sure what tipc is either....
> > > > 
> > > > > I'd think this sort of allocation would be limited by the 
> > > number of
> > > > > sockets times the size of the send and receive buffers.
> > > > > svc_xprt.c:svc_check_conn_limits() claims to be limiting 
> > > the number of
> > > > > sockets to (nrthreads+3)*20.  (You aren't hitting the 
> > > "too many open
> > > > > connections" printk there, are you?)  The total buffer 
> > > size should be
> > > > > bounded by something like 4 megs.
> > > > > 
> > > > > --b.
> > > > > 
> > > > 
> > > > Yes, we are getting a continuous stream of the too many 
> > > open connections
> > > > scrolling across our logs.  
> > > 
> > > That's interesting!  So we should probably look more 
> closely at the
> > > svc_check_conn_limits() behavior.  I wonder whether some 
> pathological
> > > behavior is triggered in the case where you're constantly 
> > > over the limit
> > > it's trying to enforce.
> > > 
> > > (Remind me how many active clients you have?)
> > > 
> > 
> > 
> > We currently are hitting with somewhere around 600 to 800 
> nodes, but it
> > can go up to over 1000 nodes.  We are artificially starving with a
> > limited number of threads (2 to 3) right now on the older 2.6.22.14
> > kernel because of that memory issue (which may or may not be tso
> > related)...
> 
> So with that many clients all making requests to the server at once,
> we'd start hitting that (serv->sv_nrthreads+3)*20 limit when 
> the number
> of threads was set to less than 30-50.  That doesn't seem to be the
> point where you're seeing a change in behavior, though.
> 

We were estimating between 40 and 50 threads was the cut off for being
able to service all of the (current) requests at once.  I haven't ramped
back up to that level yet.  I wasn't comfortable yet with letting it all
hang back out just in case we get into that hellish mode again, it can
be a pain to try and get into those systems once they are overloaded
(even over serial, sometimes it can just timeout the login).  We had to
actually bring online a second option to help alleviate some of the back
congestion because the servers couldn't handle the workload.  


> > I really want to move forward to the newer kernel, but we 
> had an issue
> > where clients all of the sudden wouldn't connect, yet other clients
> > could, to the exact same server NFS export.  I had booted the server
> > into the 2.6.25.4 kernel at the time, and the other admin 
> set us back to
> > the 2.6.22.14 to see if that was it.  The clients started 
> working again,
> > and he left it there (he also took out my options in the 
> exports file,
> > no_subtree_check and insecure).  I know that we are running over the
> > number of privelaged ports, and we probably need the 
> insecure, but I am
> > having a hard time wrapping my self around all of the problems at
> > once....
> 
> The secure ports limitation should be a problem for a client 
> that does a
> lot of nfs mounts, not for a server with a lot of clients.
> 


Ah, OK.  That makes sense.

> --b.
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
  2008-06-19 15:53                                     ` Weathers, Norman R.
@ 2008-06-19 18:46                                       ` J. Bruce Fields
  0 siblings, 0 replies; 13+ messages in thread
From: J. Bruce Fields @ 2008-06-19 18:46 UTC (permalink / raw)
  To: Weathers, Norman R.; +Cc: Jeff Layton, linux-kernel, linux-nfs, Neil Brown

On Thu, Jun 19, 2008 at 10:53:28AM -0500, Weathers, Norman R. wrote:
> The kernel that we were really seeing the problem with was 2.6.25.4, but
> I think we may have figured out the 4096 problem, and it was probably a
> mistake on my part, but it is important for the NFS users to see it so
> they don't make the same mistake.  I had found some performance tuning
> guides, and in trying some of the suggestions, found that the setting
> changes did seem to help on some things, but of course I never got to
> run a check under full load (800 + clients).  A suggestion was to change
> the tcp_reordering tunable under /proc/sys/net/ipv4 from the default 3
> to 127.  We think that this was actually causing the issue.  I was able
> to trace back through all of the changes, and I changed this setting
> back to the default 3, and it immediately fixed the size-4096 hell.  It
> appears that the reordering just eats into the memory, especially in
> high demand situations, and I guess that should make perfect sense if we
> are actually buffering up packets for reorder, and we are slamming the
> box with thousands of requests per minute.

OK, sounds plausible, though I won't pretend to understand exactly how
that reordering code is using memory.

> We still have other performance issues now, but it appears to be more of
> a bottleneck, the nodes do not appear to be backing off when the servers
> are becoming congested.
...
> > So with that many clients all making requests to the server at once,
> > we'd start hitting that (serv->sv_nrthreads+3)*20 limit when 
> > the number
> > of threads was set to less than 30-50.  That doesn't seem to be the
> > point where you're seeing a change in behavior, though.
> > 
> 
> We were estimating between 40 and 50 threads was the cut off for being
> able to service all of the (current) requests at once.  I haven't ramped
> back up to that level yet.  I wasn't comfortable yet with letting it all
> hang back out just in case we get into that hellish mode again, it can
> be a pain to try and get into those systems once they are overloaded
> (even over serial, sometimes it can just timeout the login).  We had to
> actually bring online a second option to help alleviate some of the back
> congestion because the servers couldn't handle the workload.  

Thanks for the update, and let us know if you figure out anything more.

--b.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2008-06-19 18:46 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1212519001.24900.14.camel@hololw58>
     [not found] ` <20080606160922.GG30863@fieldses.org>
     [not found]   ` <0122F800A3B64C449565A9E8C2977010155587@hoexmb9.conoco.net>
     [not found]     ` <20080609185355.GF28584@fieldses.org>
     [not found]       ` <0122F800A3B64C449565A9E8C297701002D75D9F@hoexmb9.conoco.net>
     [not found]         ` <20080610171602.GG20184@fieldses.org>
     [not found]           ` <0122F800A3B64C449565A9E8C297701002D75DA3@hoexmb9.conoco.net>
     [not found]             ` <20080611184613.GM15380@fieldses.org>
2008-06-11 19:52               ` CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger? J. Bruce Fields
2008-06-11 20:09                 ` Jeff Layton
2008-06-11 20:57                   ` J. Bruce Fields
2008-06-11 22:46                     ` Weathers, Norman R.
2008-06-11 22:54                       ` J. Bruce Fields
2008-06-12 19:54                         ` Weathers, Norman R.
2008-06-13 20:15                           ` J. Bruce Fields
2008-06-13 21:53                             ` Weathers, Norman R.
2008-06-13 22:04                               ` J. Bruce Fields
2008-06-13 22:53                                 ` Weathers, Norman R.
2008-06-16 17:43                                   ` J. Bruce Fields
2008-06-19 15:53                                     ` Weathers, Norman R.
2008-06-19 18:46                                       ` J. Bruce Fields

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).