All of lore.kernel.org
 help / color / mirror / Atom feed
* [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
@ 2011-07-28 22:47 ` Pekka Enberg
  0 siblings, 0 replies; 52+ messages in thread
From: Pekka Enberg @ 2011-07-28 22:47 UTC (permalink / raw)
  To: torvalds; +Cc: cl, akpm, rientjes, hughd, linux-kernel, linux-mm

Hi Linus,

This pull request has patches to make SLUB slowpaths lockless like we 
already did for the fastpaths. They have been sitting in linux-next for a 
while now and should be fine. David Rientjes reports improved performance:

   I ran slub/lockless through some stress testing and it seems to be quite
   stable on my testing cluster.  There is about a 2.3% performance
   improvement with the lockless slowpath on the netperf benchmark with
   various thread counts on my 16-core 64GB Opterons, so I'd recommend it to
   be merged into 3.1.

One possible gotcha, though, is that page struct gets bigger on x86_64. Hugh
Dickins writes:

   By the way, if you're thinking of lining up a pull request to Linus
   for 3.1, please make it very clear in that request that these changes
   enlarge the x86_64 struct page from 56 to 64 bytes, for slub alone.

   I remain very uneasy about that (love the cache alignment but...),
   the commit comment is rather vague about it, and I'm not sure that
   anyone else has noticed yet (akpm?).

   Given that Linus wouldn't let Kosaki add 4 bytes to the 32-bit
   vm_area_struct in 3.0, telling him about this upfront does not
   improve your chances that he will pull ;) but does protect you
   from his wrath when he'd later find it sneaked in.

We haven't come up with a solution to keep struct page size the same but 
I think it's a reasonable trade-off.

                         Pekka

The following changes since commit 95b6886526bb510b8370b625a49bc0ab3b8ff10f:
   Linus Torvalds (1):
         Merge branch 'for-linus' of git://git.kernel.org/.../jmorris/security-testing-2.6

are available in the git repository at:

   ssh://master.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6.git slub/lockless

Christoph Lameter (20):
       slub: Push irq disable into allocate_slab()
       slub: Do not use frozen page flag but a bit in the page counters
       slub: Move page->frozen handling near where the page->freelist handling occurs
       mm: Rearrange struct page
       slub: Add cmpxchg_double_slab()
       slub: explicit list_lock taking
       slub: Pass kmem_cache struct to lock and freeze slab
       slub: Rework allocator fastpaths
       slub: Invert locking and avoid slab lock
       slub: Disable interrupts in free_debug processing
       slub: Avoid disabling interrupts in free slowpath
       slub: Get rid of the another_slab label
       slub: Add statistics for the case that the current slab does not match the node
       slub: fast release on full slab
       slub: Not necessary to check for empty slab on load_freelist
       slub: slabinfo update for cmpxchg handling
       SLUB: Fix build breakage in linux/mm_types.h
       Avoid duplicate _count variables in page_struct
       slub: disable interrupts in cmpxchg_double_slab when falling back to pagelock
       slub: When allocating a new slab also prep the first object

Pekka Enberg (2):
       Merge remote branch 'tip/x86/atomic' into slub/lockless
       Revert "SLUB: Fix build breakage in linux/mm_types.h"

  arch/x86/Kconfig.cpu              |    3 +
  arch/x86/include/asm/cmpxchg_32.h |   48 +++
  arch/x86/include/asm/cmpxchg_64.h |   45 +++
  arch/x86/include/asm/cpufeature.h |    2 +
  include/linux/mm_types.h          |   89 +++--
  include/linux/page-flags.h        |    5 -
  include/linux/slub_def.h          |    3 +
  mm/slub.c                         |  764 +++++++++++++++++++++++++------------
  tools/slub/slabinfo.c             |   59 ++-
  9 files changed, 714 insertions(+), 304 deletions(-)

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
@ 2011-07-28 22:47 ` Pekka Enberg
  0 siblings, 0 replies; 52+ messages in thread
From: Pekka Enberg @ 2011-07-28 22:47 UTC (permalink / raw)
  To: torvalds; +Cc: cl, akpm, rientjes, hughd, linux-kernel, linux-mm

Hi Linus,

This pull request has patches to make SLUB slowpaths lockless like we 
already did for the fastpaths. They have been sitting in linux-next for a 
while now and should be fine. David Rientjes reports improved performance:

   I ran slub/lockless through some stress testing and it seems to be quite
   stable on my testing cluster.  There is about a 2.3% performance
   improvement with the lockless slowpath on the netperf benchmark with
   various thread counts on my 16-core 64GB Opterons, so I'd recommend it to
   be merged into 3.1.

One possible gotcha, though, is that page struct gets bigger on x86_64. Hugh
Dickins writes:

   By the way, if you're thinking of lining up a pull request to Linus
   for 3.1, please make it very clear in that request that these changes
   enlarge the x86_64 struct page from 56 to 64 bytes, for slub alone.

   I remain very uneasy about that (love the cache alignment but...),
   the commit comment is rather vague about it, and I'm not sure that
   anyone else has noticed yet (akpm?).

   Given that Linus wouldn't let Kosaki add 4 bytes to the 32-bit
   vm_area_struct in 3.0, telling him about this upfront does not
   improve your chances that he will pull ;) but does protect you
   from his wrath when he'd later find it sneaked in.

We haven't come up with a solution to keep struct page size the same but 
I think it's a reasonable trade-off.

                         Pekka

The following changes since commit 95b6886526bb510b8370b625a49bc0ab3b8ff10f:
   Linus Torvalds (1):
         Merge branch 'for-linus' of git://git.kernel.org/.../jmorris/security-testing-2.6

are available in the git repository at:

   ssh://master.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6.git slub/lockless

Christoph Lameter (20):
       slub: Push irq disable into allocate_slab()
       slub: Do not use frozen page flag but a bit in the page counters
       slub: Move page->frozen handling near where the page->freelist handling occurs
       mm: Rearrange struct page
       slub: Add cmpxchg_double_slab()
       slub: explicit list_lock taking
       slub: Pass kmem_cache struct to lock and freeze slab
       slub: Rework allocator fastpaths
       slub: Invert locking and avoid slab lock
       slub: Disable interrupts in free_debug processing
       slub: Avoid disabling interrupts in free slowpath
       slub: Get rid of the another_slab label
       slub: Add statistics for the case that the current slab does not match the node
       slub: fast release on full slab
       slub: Not necessary to check for empty slab on load_freelist
       slub: slabinfo update for cmpxchg handling
       SLUB: Fix build breakage in linux/mm_types.h
       Avoid duplicate _count variables in page_struct
       slub: disable interrupts in cmpxchg_double_slab when falling back to pagelock
       slub: When allocating a new slab also prep the first object

Pekka Enberg (2):
       Merge remote branch 'tip/x86/atomic' into slub/lockless
       Revert "SLUB: Fix build breakage in linux/mm_types.h"

  arch/x86/Kconfig.cpu              |    3 +
  arch/x86/include/asm/cmpxchg_32.h |   48 +++
  arch/x86/include/asm/cmpxchg_64.h |   45 +++
  arch/x86/include/asm/cpufeature.h |    2 +
  include/linux/mm_types.h          |   89 +++--
  include/linux/page-flags.h        |    5 -
  include/linux/slub_def.h          |    3 +
  mm/slub.c                         |  764 +++++++++++++++++++++++++------------
  tools/slub/slabinfo.c             |   59 ++-
  9 files changed, 714 insertions(+), 304 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
  2011-07-28 22:47 ` Pekka Enberg
@ 2011-07-29 15:04   ` Christoph Lameter
  -1 siblings, 0 replies; 52+ messages in thread
From: Christoph Lameter @ 2011-07-29 15:04 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: torvalds, akpm, rientjes, hughd, linux-kernel, linux-mm

On Fri, 29 Jul 2011, Pekka Enberg wrote:

> We haven't come up with a solution to keep struct page size the same but I
> think it's a reasonable trade-off.

The change requires the page struct to be aligned to a double word
boundary. There is actually no variable added to the page struct. Its just
the alignment requirement that causes padding to be added after each page
struct.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
@ 2011-07-29 15:04   ` Christoph Lameter
  0 siblings, 0 replies; 52+ messages in thread
From: Christoph Lameter @ 2011-07-29 15:04 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: torvalds, akpm, rientjes, hughd, linux-kernel, linux-mm

On Fri, 29 Jul 2011, Pekka Enberg wrote:

> We haven't come up with a solution to keep struct page size the same but I
> think it's a reasonable trade-off.

The change requires the page struct to be aligned to a double word
boundary. There is actually no variable added to the page struct. Its just
the alignment requirement that causes padding to be added after each page
struct.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
  2011-07-29 15:04   ` Christoph Lameter
@ 2011-07-29 23:18     ` Andi Kleen
  -1 siblings, 0 replies; 52+ messages in thread
From: Andi Kleen @ 2011-07-29 23:18 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, torvalds, akpm, rientjes, hughd, linux-kernel, linux-mm

Christoph Lameter <cl@linux.com> writes:

> On Fri, 29 Jul 2011, Pekka Enberg wrote:
>
>> We haven't come up with a solution to keep struct page size the same but I
>> think it's a reasonable trade-off.
>
> The change requires the page struct to be aligned to a double word
> boundary. 

Why is that?

> There is actually no variable added to the page struct. Its just
> the alignment requirement that causes padding to be added after each page
> struct.

These days with everyone using cgroups (and likely mcgroups too) 
you could probably put the cgroups page pointer back there. It's
currently external.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
@ 2011-07-29 23:18     ` Andi Kleen
  0 siblings, 0 replies; 52+ messages in thread
From: Andi Kleen @ 2011-07-29 23:18 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, torvalds, akpm, rientjes, hughd, linux-kernel, linux-mm

Christoph Lameter <cl@linux.com> writes:

> On Fri, 29 Jul 2011, Pekka Enberg wrote:
>
>> We haven't come up with a solution to keep struct page size the same but I
>> think it's a reasonable trade-off.
>
> The change requires the page struct to be aligned to a double word
> boundary. 

Why is that?

> There is actually no variable added to the page struct. Its just
> the alignment requirement that causes padding to be added after each page
> struct.

These days with everyone using cgroups (and likely mcgroups too) 
you could probably put the cgroups page pointer back there. It's
currently external.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
  2011-07-29 23:18     ` Andi Kleen
@ 2011-07-30  6:33       ` Eric Dumazet
  -1 siblings, 0 replies; 52+ messages in thread
From: Eric Dumazet @ 2011-07-30  6:33 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Christoph Lameter, Pekka Enberg, torvalds, akpm, rientjes, hughd,
	linux-kernel, linux-mm

Le vendredi 29 juillet 2011 à 16:18 -0700, Andi Kleen a écrit :
> Christoph Lameter <cl@linux.com> writes:
> 
> > On Fri, 29 Jul 2011, Pekka Enberg wrote:
> >
> >> We haven't come up with a solution to keep struct page size the same but I
> >> think it's a reasonable trade-off.
> >
> > The change requires the page struct to be aligned to a double word
> > boundary. 
> 
> Why is that?
> 

Because cmpxchg16b is believed to require a 16bytes alignment.

http://siyobik.info/main/reference/instruction/CMPXCHG8B%2FCMPXCHG16B

64-Bit Mode Exceptions
...

#GP(0) 	If the memory address is in a non-canonical form. If memory
operand for CMPXCHG16B is not aligned on a 16-byte boundary. If
CPUID.01H:ECX.CMPXCHG16B[bit 13] = 0.




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
@ 2011-07-30  6:33       ` Eric Dumazet
  0 siblings, 0 replies; 52+ messages in thread
From: Eric Dumazet @ 2011-07-30  6:33 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Christoph Lameter, Pekka Enberg, torvalds, akpm, rientjes, hughd,
	linux-kernel, linux-mm

Le vendredi 29 juillet 2011 A  16:18 -0700, Andi Kleen a A(C)crit :
> Christoph Lameter <cl@linux.com> writes:
> 
> > On Fri, 29 Jul 2011, Pekka Enberg wrote:
> >
> >> We haven't come up with a solution to keep struct page size the same but I
> >> think it's a reasonable trade-off.
> >
> > The change requires the page struct to be aligned to a double word
> > boundary. 
> 
> Why is that?
> 

Because cmpxchg16b is believed to require a 16bytes alignment.

http://siyobik.info/main/reference/instruction/CMPXCHG8B%2FCMPXCHG16B

64-Bit Mode Exceptions
...

#GP(0) 	If the memory address is in a non-canonical form. If memory
operand for CMPXCHG16B is not aligned on a 16-byte boundary. If
CPUID.01H:ECX.CMPXCHG16B[bit 13] = 0.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
  2011-07-28 22:47 ` Pekka Enberg
@ 2011-07-30 18:27   ` Linus Torvalds
  -1 siblings, 0 replies; 52+ messages in thread
From: Linus Torvalds @ 2011-07-30 18:27 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: cl, akpm, rientjes, hughd, linux-kernel, linux-mm

On Thu, Jul 28, 2011 at 12:47 PM, Pekka Enberg <penberg@kernel.org> wrote:
>
> This pull request has patches to make SLUB slowpaths lockless like we
> already did for the fastpaths. They have been sitting in linux-next for a
> while now and should be fine. David Rientjes reports improved performance:

So I'm not excited about the growth of the data structure, but I'll
pull this. The performance numbers seem to be solid, and dang it, it
is wonderful to finally hear about netperf performance *improvements*
due to slab changes, rather than things getting slower.

And 'struct page' is largely random-access, so the fact that the
growth makes it basically one cacheline in size sounds like a good
thing.

Do we allocate the page map array sufficiently aligned that we
actually don't ever have the case of straddling a cacheline? I didn't
check.

                                   Linus

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
@ 2011-07-30 18:27   ` Linus Torvalds
  0 siblings, 0 replies; 52+ messages in thread
From: Linus Torvalds @ 2011-07-30 18:27 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: cl, akpm, rientjes, hughd, linux-kernel, linux-mm

On Thu, Jul 28, 2011 at 12:47 PM, Pekka Enberg <penberg@kernel.org> wrote:
>
> This pull request has patches to make SLUB slowpaths lockless like we
> already did for the fastpaths. They have been sitting in linux-next for a
> while now and should be fine. David Rientjes reports improved performance:

So I'm not excited about the growth of the data structure, but I'll
pull this. The performance numbers seem to be solid, and dang it, it
is wonderful to finally hear about netperf performance *improvements*
due to slab changes, rather than things getting slower.

And 'struct page' is largely random-access, so the fact that the
growth makes it basically one cacheline in size sounds like a good
thing.

Do we allocate the page map array sufficiently aligned that we
actually don't ever have the case of straddling a cacheline? I didn't
check.

                                   Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
  2011-07-30 18:27   ` Linus Torvalds
@ 2011-07-30 18:32     ` Linus Torvalds
  -1 siblings, 0 replies; 52+ messages in thread
From: Linus Torvalds @ 2011-07-30 18:32 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: cl, akpm, rientjes, hughd, linux-kernel, linux-mm

On Sat, Jul 30, 2011 at 8:27 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Do we allocate the page map array sufficiently aligned that we
> actually don't ever have the case of straddling a cacheline? I didn't
> check.

Oh, and another thing worth checking: did somebody actually check the
timings for:

 - *just* the alignment change?

   IOW, maybe some of the netperf improvement isn't from the lockless
path, but exactly from 'struct page' always being in a single
cacheline?

 - check performance with cmpxchg16b *without* the alignment.

   Sometimes especially intel is so good at unaligned accesses that
you wouldn't see an issue. Now, locked ops are usually special (and
crossing cachelines with a locked op is dubious at best), so there may
actually be correctness issues involved too, but it would be
interesting to hear if anybody actually just tried it.

Hmm?

            Linus

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
@ 2011-07-30 18:32     ` Linus Torvalds
  0 siblings, 0 replies; 52+ messages in thread
From: Linus Torvalds @ 2011-07-30 18:32 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: cl, akpm, rientjes, hughd, linux-kernel, linux-mm

On Sat, Jul 30, 2011 at 8:27 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Do we allocate the page map array sufficiently aligned that we
> actually don't ever have the case of straddling a cacheline? I didn't
> check.

Oh, and another thing worth checking: did somebody actually check the
timings for:

 - *just* the alignment change?

   IOW, maybe some of the netperf improvement isn't from the lockless
path, but exactly from 'struct page' always being in a single
cacheline?

 - check performance with cmpxchg16b *without* the alignment.

   Sometimes especially intel is so good at unaligned accesses that
you wouldn't see an issue. Now, locked ops are usually special (and
crossing cachelines with a locked op is dubious at best), so there may
actually be correctness issues involved too, but it would be
interesting to hear if anybody actually just tried it.

Hmm?

            Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
  2011-07-30 18:32     ` Linus Torvalds
@ 2011-07-31 17:39       ` Andi Kleen
  -1 siblings, 0 replies; 52+ messages in thread
From: Andi Kleen @ 2011-07-31 17:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pekka Enberg, cl, akpm, rientjes, hughd, linux-kernel, linux-mm,
	kamezawa.hiroyu, kosaki.motohiro, yinghan

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Sat, Jul 30, 2011 at 8:27 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Do we allocate the page map array sufficiently aligned that we
>> actually don't ever have the case of straddling a cacheline? I didn't
>> check.
>
> Oh, and another thing worth checking: did somebody actually check the
> timings for:

I would like to see a followon patch that moves the mem_cgroup
pointer back into struct page. Copying some mem_cgroup people.

>
>  - *just* the alignment change?
>
>    IOW, maybe some of the netperf improvement isn't from the lockless
> path, but exactly from 'struct page' always being in a single
> cacheline?
>
>  - check performance with cmpxchg16b *without* the alignment.
>
>    Sometimes especially intel is so good at unaligned accesses that
> you wouldn't see an issue. Now, locked ops are usually special (and

As Eric pointed out CMPXCHG16B requires alignment, it #GPs otherwise.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
@ 2011-07-31 17:39       ` Andi Kleen
  0 siblings, 0 replies; 52+ messages in thread
From: Andi Kleen @ 2011-07-31 17:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pekka Enberg, cl, akpm, rientjes, hughd, linux-kernel, linux-mm,
	kamezawa.hiroyu, kosaki.motohiro, yinghan

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Sat, Jul 30, 2011 at 8:27 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Do we allocate the page map array sufficiently aligned that we
>> actually don't ever have the case of straddling a cacheline? I didn't
>> check.
>
> Oh, and another thing worth checking: did somebody actually check the
> timings for:

I would like to see a followon patch that moves the mem_cgroup
pointer back into struct page. Copying some mem_cgroup people.

>
>  - *just* the alignment change?
>
>    IOW, maybe some of the netperf improvement isn't from the lockless
> path, but exactly from 'struct page' always being in a single
> cacheline?
>
>  - check performance with cmpxchg16b *without* the alignment.
>
>    Sometimes especially intel is so good at unaligned accesses that
> you wouldn't see an issue. Now, locked ops are usually special (and

As Eric pointed out CMPXCHG16B requires alignment, it #GPs otherwise.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
  2011-07-30 18:32     ` Linus Torvalds
@ 2011-07-31 18:11       ` David Rientjes
  -1 siblings, 0 replies; 52+ messages in thread
From: David Rientjes @ 2011-07-31 18:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pekka Enberg, Christoph Lameter, Andrew Morton, hughd,
	linux-kernel, linux-mm

On Sat, 30 Jul 2011, Linus Torvalds wrote:

> Oh, and another thing worth checking: did somebody actually check the
> timings for:
> 
>  - *just* the alignment change?
> 
>    IOW, maybe some of the netperf improvement isn't from the lockless
> path, but exactly from 'struct page' always being in a single
> cacheline?
> 

Without the lockless slowpath and only the struct page alignment, the 
performance improved only 0.18% compared to vanilla 3.0-rc5, which 
slub/lockless is based on.  I've only benchmarked this in Pekka's slab 
tree, I haven't looked at your tree since it was merged.

>  - check performance with cmpxchg16b *without* the alignment.
> 
>    Sometimes especially intel is so good at unaligned accesses that
> you wouldn't see an issue. Now, locked ops are usually special (and
> crossing cachelines with a locked op is dubious at best), so there may
> actually be correctness issues involved too, but it would be
> interesting to hear if anybody actually just tried it.
> 

If the alignment is removed and struct page is restructured back to what 
it was plus the additions required for the lockless slowpath with 
cmpxchg16b then it becomes not so happy on my testing cluster:

[    0.000000] general protection fault: 0000 [#1] SMP 
[    0.000000] CPU 0 
[    0.000000] Modules linked in:
[    0.000000] 
[    0.000000] Pid: 0, comm: swapper Not tainted 3.0.0-slub_noalign #1
[    0.000000] RIP: 0010:[<ffffffff81198f84>]  [<ffffffff81198f84>] get_partial_node+0xa4/0x1a0
[    0.000000] RSP: 0000:ffffffff81801d78  EFLAGS: 00010002
[    0.000000] RAX: ffff88047f801040 RBX: 0000000000000000 RCX: 0000000180400040
[    0.000000] RDX: 0000000100400001 RSI: ffff88047f801040 RDI: ffffea000fbe4048
[    0.000000] RBP: ffffffff81801de8 R08: ffff88047fc132c0 R09: ffffffff8119e69c
[    0.000000] R10: 0000000000001800 R11: 0000000000001000 R12: ffffea000fbe4038
[    0.000000] R13: ffff88047f800100 R14: ffff88047f801000 R15: ffff88047f801010
[    0.000000] FS:  0000000000000000(0000) GS:ffff88047fc00000(0000) knlGS:0000000000000000
[    0.000000] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[    0.000000] CR2: 0000000000000000 CR3: 0000000001803000 CR4: 00000000000006b0
[    0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.000000] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[    0.000000] Process swapper (pid: 0, threadinfo ffffffff81800000, task ffffffff8180b020)
[    0.000000] Stack:
[    0.000000]  ffff88107ffd8e00 0000000000000000 ffff88047ffda1d8 0000000180400040
[    0.000000]  ffff00066c0a0100 ffffffff8108a1f9 ffffffff81801dc8 ffffffff815a543c
[    0.000000]  0000000000000000 ffffffff8180b020 0000000000000000 ffff88047f800100
[    0.000000] Call Trace:
[    0.000000]  [<ffffffff8108a1f9>] ? __might_sleep+0x9/0xf0
[    0.000000]  [<ffffffff815a543c>] ? mutex_lock+0x2c/0x60
[    0.000000]  [<ffffffff8119df35>] kmem_cache_alloc_node+0x135/0x560
[    0.000000]  [<ffffffff8119e69c>] ? kmem_cache_open+0x33c/0x580
[    0.000000]  [<ffffffff81199cf6>] ? calculate_sizes+0x16/0x3f0
[    0.000000]  [<ffffffff8119e69c>] kmem_cache_open+0x33c/0x580
[    0.000000]  [<ffffffff81194df6>] ? alloc_pages_current+0x96/0x130
[    0.000000]  [<ffffffff818c6c25>] kmem_cache_init+0xbb/0x462
[    0.000000]  [<ffffffff818a3ce7>] start_kernel+0x1be/0x39e
[    0.000000]  [<ffffffff818a3520>] x86_64_start_kernel+0x203/0x20a
[    0.000000] Code: 10 48 89 55 a8 41 0f b7 44 24 1a 80 4d ab 80 66 25 ff 7f 66 89 45 a8 48 8b 4d a8 9c 58 41 f6 45 0b 40 0f 84 7f 00 00 00 48 89 f0 <f0> 48 0f c7 0f 0f 94 c0 84 c0 0f 84 86 00 00 00 49 8b 54 24 20 
[    0.000000] RIP  [<ffffffff81198f84>] get_partial_node+0xa4/0x1a0
[    0.000000]  RSP <ffffffff81801d78>
[    0.000000] ---[ end trace 4eaa2a86a8e2da22 ]---
[    0.000000] Kernel panic - not syncing: Fatal exception
[    0.000000] Pid: 0, comm: swapper Not tainted 3.0.0-slub_noalign #1
[    0.000000] Call Trace:
[    0.000000]  [<ffffffff815a33fb>] panic+0x91/0x198
[    0.000000]  [<ffffffff81052b87>] die+0x247/0x260
[    0.000000]  [<ffffffff815a7952>] do_general_protection+0x162/0x170
[    0.000000]  [<ffffffff815a716f>] general_protection+0x1f/0x30
[    0.000000]  [<ffffffff8119e69c>] ? kmem_cache_open+0x33c/0x580
[    0.000000]  [<ffffffff81198f84>] ? get_partial_node+0xa4/0x1a0
[    0.000000]  [<ffffffff8108a1f9>] ? __might_sleep+0x9/0xf0
[    0.000000]  [<ffffffff815a543c>] ? mutex_lock+0x2c/0x60
[    0.000000]  [<ffffffff8119df35>] kmem_cache_alloc_node+0x135/0x560
[    0.000000]  [<ffffffff8119e69c>] ? kmem_cache_open+0x33c/0x580
[    0.000000]  [<ffffffff81199cf6>] ? calculate_sizes+0x16/0x3f0
[    0.000000]  [<ffffffff8119e69c>] kmem_cache_open+0x33c/0x580
[    0.000000]  [<ffffffff81194df6>] ? alloc_pages_current+0x96/0x130
[    0.000000]  [<ffffffff818c6c25>] kmem_cache_init+0xbb/0x462
[    0.000000]  [<ffffffff818a3ce7>] start_kernel+0x1be/0x39e
[    0.000000]  [<ffffffff818a3520>] x86_64_start_kernel+0x203/0x20a

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
@ 2011-07-31 18:11       ` David Rientjes
  0 siblings, 0 replies; 52+ messages in thread
From: David Rientjes @ 2011-07-31 18:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pekka Enberg, Christoph Lameter, Andrew Morton, hughd,
	linux-kernel, linux-mm

On Sat, 30 Jul 2011, Linus Torvalds wrote:

> Oh, and another thing worth checking: did somebody actually check the
> timings for:
> 
>  - *just* the alignment change?
> 
>    IOW, maybe some of the netperf improvement isn't from the lockless
> path, but exactly from 'struct page' always being in a single
> cacheline?
> 

Without the lockless slowpath and only the struct page alignment, the 
performance improved only 0.18% compared to vanilla 3.0-rc5, which 
slub/lockless is based on.  I've only benchmarked this in Pekka's slab 
tree, I haven't looked at your tree since it was merged.

>  - check performance with cmpxchg16b *without* the alignment.
> 
>    Sometimes especially intel is so good at unaligned accesses that
> you wouldn't see an issue. Now, locked ops are usually special (and
> crossing cachelines with a locked op is dubious at best), so there may
> actually be correctness issues involved too, but it would be
> interesting to hear if anybody actually just tried it.
> 

If the alignment is removed and struct page is restructured back to what 
it was plus the additions required for the lockless slowpath with 
cmpxchg16b then it becomes not so happy on my testing cluster:

[    0.000000] general protection fault: 0000 [#1] SMP 
[    0.000000] CPU 0 
[    0.000000] Modules linked in:
[    0.000000] 
[    0.000000] Pid: 0, comm: swapper Not tainted 3.0.0-slub_noalign #1
[    0.000000] RIP: 0010:[<ffffffff81198f84>]  [<ffffffff81198f84>] get_partial_node+0xa4/0x1a0
[    0.000000] RSP: 0000:ffffffff81801d78  EFLAGS: 00010002
[    0.000000] RAX: ffff88047f801040 RBX: 0000000000000000 RCX: 0000000180400040
[    0.000000] RDX: 0000000100400001 RSI: ffff88047f801040 RDI: ffffea000fbe4048
[    0.000000] RBP: ffffffff81801de8 R08: ffff88047fc132c0 R09: ffffffff8119e69c
[    0.000000] R10: 0000000000001800 R11: 0000000000001000 R12: ffffea000fbe4038
[    0.000000] R13: ffff88047f800100 R14: ffff88047f801000 R15: ffff88047f801010
[    0.000000] FS:  0000000000000000(0000) GS:ffff88047fc00000(0000) knlGS:0000000000000000
[    0.000000] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[    0.000000] CR2: 0000000000000000 CR3: 0000000001803000 CR4: 00000000000006b0
[    0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.000000] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[    0.000000] Process swapper (pid: 0, threadinfo ffffffff81800000, task ffffffff8180b020)
[    0.000000] Stack:
[    0.000000]  ffff88107ffd8e00 0000000000000000 ffff88047ffda1d8 0000000180400040
[    0.000000]  ffff00066c0a0100 ffffffff8108a1f9 ffffffff81801dc8 ffffffff815a543c
[    0.000000]  0000000000000000 ffffffff8180b020 0000000000000000 ffff88047f800100
[    0.000000] Call Trace:
[    0.000000]  [<ffffffff8108a1f9>] ? __might_sleep+0x9/0xf0
[    0.000000]  [<ffffffff815a543c>] ? mutex_lock+0x2c/0x60
[    0.000000]  [<ffffffff8119df35>] kmem_cache_alloc_node+0x135/0x560
[    0.000000]  [<ffffffff8119e69c>] ? kmem_cache_open+0x33c/0x580
[    0.000000]  [<ffffffff81199cf6>] ? calculate_sizes+0x16/0x3f0
[    0.000000]  [<ffffffff8119e69c>] kmem_cache_open+0x33c/0x580
[    0.000000]  [<ffffffff81194df6>] ? alloc_pages_current+0x96/0x130
[    0.000000]  [<ffffffff818c6c25>] kmem_cache_init+0xbb/0x462
[    0.000000]  [<ffffffff818a3ce7>] start_kernel+0x1be/0x39e
[    0.000000]  [<ffffffff818a3520>] x86_64_start_kernel+0x203/0x20a
[    0.000000] Code: 10 48 89 55 a8 41 0f b7 44 24 1a 80 4d ab 80 66 25 ff 7f 66 89 45 a8 48 8b 4d a8 9c 58 41 f6 45 0b 40 0f 84 7f 00 00 00 48 89 f0 <f0> 48 0f c7 0f 0f 94 c0 84 c0 0f 84 86 00 00 00 49 8b 54 24 20 
[    0.000000] RIP  [<ffffffff81198f84>] get_partial_node+0xa4/0x1a0
[    0.000000]  RSP <ffffffff81801d78>
[    0.000000] ---[ end trace 4eaa2a86a8e2da22 ]---
[    0.000000] Kernel panic - not syncing: Fatal exception
[    0.000000] Pid: 0, comm: swapper Not tainted 3.0.0-slub_noalign #1
[    0.000000] Call Trace:
[    0.000000]  [<ffffffff815a33fb>] panic+0x91/0x198
[    0.000000]  [<ffffffff81052b87>] die+0x247/0x260
[    0.000000]  [<ffffffff815a7952>] do_general_protection+0x162/0x170
[    0.000000]  [<ffffffff815a716f>] general_protection+0x1f/0x30
[    0.000000]  [<ffffffff8119e69c>] ? kmem_cache_open+0x33c/0x580
[    0.000000]  [<ffffffff81198f84>] ? get_partial_node+0xa4/0x1a0
[    0.000000]  [<ffffffff8108a1f9>] ? __might_sleep+0x9/0xf0
[    0.000000]  [<ffffffff815a543c>] ? mutex_lock+0x2c/0x60
[    0.000000]  [<ffffffff8119df35>] kmem_cache_alloc_node+0x135/0x560
[    0.000000]  [<ffffffff8119e69c>] ? kmem_cache_open+0x33c/0x580
[    0.000000]  [<ffffffff81199cf6>] ? calculate_sizes+0x16/0x3f0
[    0.000000]  [<ffffffff8119e69c>] kmem_cache_open+0x33c/0x580
[    0.000000]  [<ffffffff81194df6>] ? alloc_pages_current+0x96/0x130
[    0.000000]  [<ffffffff818c6c25>] kmem_cache_init+0xbb/0x462
[    0.000000]  [<ffffffff818a3ce7>] start_kernel+0x1be/0x39e
[    0.000000]  [<ffffffff818a3520>] x86_64_start_kernel+0x203/0x20a

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
  2011-07-29 15:04   ` Christoph Lameter
@ 2011-07-31 18:50     ` David Rientjes
  -1 siblings, 0 replies; 52+ messages in thread
From: David Rientjes @ 2011-07-31 18:50 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Linus Torvalds, Andrew Morton, hughd, linux-kernel,
	linux-mm

On Fri, 29 Jul 2011, Christoph Lameter wrote:

> > We haven't come up with a solution to keep struct page size the same but I
> > think it's a reasonable trade-off.
> 

We won't be coming up with a solution to that since the alignment is a 
requirement for cmpxchg16b, unfortunately.

> The change requires the page struct to be aligned to a double word
> boundary. There is actually no variable added to the page struct. Its just
> the alignment requirement that causes padding to be added after each page
> struct.
> 

Well, the counters variable is added although it doesn't increase the size 
of the unaligned struct page because of how it is restructured.  The end 
result of the alignment for CONFIG_CMPXCHG_LOCAL is that struct page will 
increase from 56 bytes to 64 bytes on my config.  That's a cost of 128MB 
on each of my client and server 64GB machines for the netperf benchmark 
for the ~2.3% speedup.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
@ 2011-07-31 18:50     ` David Rientjes
  0 siblings, 0 replies; 52+ messages in thread
From: David Rientjes @ 2011-07-31 18:50 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Linus Torvalds, Andrew Morton, hughd, linux-kernel,
	linux-mm

On Fri, 29 Jul 2011, Christoph Lameter wrote:

> > We haven't come up with a solution to keep struct page size the same but I
> > think it's a reasonable trade-off.
> 

We won't be coming up with a solution to that since the alignment is a 
requirement for cmpxchg16b, unfortunately.

> The change requires the page struct to be aligned to a double word
> boundary. There is actually no variable added to the page struct. Its just
> the alignment requirement that causes padding to be added after each page
> struct.
> 

Well, the counters variable is added although it doesn't increase the size 
of the unaligned struct page because of how it is restructured.  The end 
result of the alignment for CONFIG_CMPXCHG_LOCAL is that struct page will 
increase from 56 bytes to 64 bytes on my config.  That's a cost of 128MB 
on each of my client and server 64GB machines for the netperf benchmark 
for the ~2.3% speedup.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
  2011-07-31 18:50     ` David Rientjes
@ 2011-07-31 20:24       ` David Rientjes
  -1 siblings, 0 replies; 52+ messages in thread
From: David Rientjes @ 2011-07-31 20:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Linus Torvalds, Andrew Morton, hughd, linux-kernel,
	linux-mm

On Sun, 31 Jul 2011, David Rientjes wrote:

> Well, the counters variable is added although it doesn't increase the size 
> of the unaligned struct page because of how it is restructured.  The end 
> result of the alignment for CONFIG_CMPXCHG_LOCAL is that struct page will 
> increase from 56 bytes to 64 bytes on my config.  That's a cost of 128MB 
> on each of my client and server 64GB machines for the netperf benchmark 
> for the ~2.3% speedup.
> 

And although slub is definitely heading in the right direction regarding 
the netperf benchmark, it's still a non-starter for anybody using large 
NUMA machines for networking performance.  On my 16-core, 4 node, 64GB 
client/server machines running netperf TCP_RR with various thread counts 
for 60 seconds each on 3.0:

	threads		SLUB		SLAB		diff
	 16		76345		74973		- 1.8%
	 32		116380		116272		- 0.1%
	 48		150509		153703		+ 2.1%
	 64		187984		189750		+ 0.9%
	 80		216853		224471		+ 3.5%
	 96		236640		249184		+ 5.3%
	112		256540		275464		+ 7.4%
	128		273027		296014		+ 8.4%
	144		281441		314791		+11.8%
	160		287225		326941		+13.8%

I'm much more inclined to use slab because it's performance is so much 
better for heavy networking loads and have an extra 128MB on each of these 
machines.

Now, if I think about this from a Google perspective, we have scheduled 
jobs on shared machines with memory containment allocated in 128MB chunks 
for several years.  So if these numbers are representative of the 
networking performance I can get on our production machines, I'm not only 
far better off selecting slab for its performance, but I can also schedule 
one small job on every machine in our fleet!

Ignoring the netperf results, if you take just the alignment change on 
struct page as a result of cmpxchg16b, I've lost 0.2% of memory from every 
machine in our fleet by selecting slub.  So if we're bound by memory, I've 
just effectively removed 0.2% of machines from our fleet.  That happens to 
be a large number and at a substantial cost every year.

So although I recommended the lockless changes at the memory cost of 
struct page alignment to improve performance by ~2.3%, it's done with the 
premise that I'm not actually going to be using it, so it's more of a 
recommendation for desktops and small systems where others have shown slub 
is better on benchmarks like kernbench, sysbench, aim9, and hackbench.  

 [ I'd love if we had sufficient predicates in the x86 kconfigs to 
   determine what the appropriate allocator to choose would be because 
   it's obvious that slab is light years ahead of the default slub for us. ]

And although I've developed a mutable slab allocator, SLAM, that makes all 
of this irrelevant since it's a drop-in replacement for slab and slub, I 
can't legitimately propose it for inclusion because it lacks the debugging 
capabilities that slub excels in and there's an understanding that Linus 
won't merge another stand-alone allocator until one is removed.  

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
@ 2011-07-31 20:24       ` David Rientjes
  0 siblings, 0 replies; 52+ messages in thread
From: David Rientjes @ 2011-07-31 20:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Linus Torvalds, Andrew Morton, hughd, linux-kernel,
	linux-mm

On Sun, 31 Jul 2011, David Rientjes wrote:

> Well, the counters variable is added although it doesn't increase the size 
> of the unaligned struct page because of how it is restructured.  The end 
> result of the alignment for CONFIG_CMPXCHG_LOCAL is that struct page will 
> increase from 56 bytes to 64 bytes on my config.  That's a cost of 128MB 
> on each of my client and server 64GB machines for the netperf benchmark 
> for the ~2.3% speedup.
> 

And although slub is definitely heading in the right direction regarding 
the netperf benchmark, it's still a non-starter for anybody using large 
NUMA machines for networking performance.  On my 16-core, 4 node, 64GB 
client/server machines running netperf TCP_RR with various thread counts 
for 60 seconds each on 3.0:

	threads		SLUB		SLAB		diff
	 16		76345		74973		- 1.8%
	 32		116380		116272		- 0.1%
	 48		150509		153703		+ 2.1%
	 64		187984		189750		+ 0.9%
	 80		216853		224471		+ 3.5%
	 96		236640		249184		+ 5.3%
	112		256540		275464		+ 7.4%
	128		273027		296014		+ 8.4%
	144		281441		314791		+11.8%
	160		287225		326941		+13.8%

I'm much more inclined to use slab because it's performance is so much 
better for heavy networking loads and have an extra 128MB on each of these 
machines.

Now, if I think about this from a Google perspective, we have scheduled 
jobs on shared machines with memory containment allocated in 128MB chunks 
for several years.  So if these numbers are representative of the 
networking performance I can get on our production machines, I'm not only 
far better off selecting slab for its performance, but I can also schedule 
one small job on every machine in our fleet!

Ignoring the netperf results, if you take just the alignment change on 
struct page as a result of cmpxchg16b, I've lost 0.2% of memory from every 
machine in our fleet by selecting slub.  So if we're bound by memory, I've 
just effectively removed 0.2% of machines from our fleet.  That happens to 
be a large number and at a substantial cost every year.

So although I recommended the lockless changes at the memory cost of 
struct page alignment to improve performance by ~2.3%, it's done with the 
premise that I'm not actually going to be using it, so it's more of a 
recommendation for desktops and small systems where others have shown slub 
is better on benchmarks like kernbench, sysbench, aim9, and hackbench.  

 [ I'd love if we had sufficient predicates in the x86 kconfigs to 
   determine what the appropriate allocator to choose would be because 
   it's obvious that slab is light years ahead of the default slub for us. ]

And although I've developed a mutable slab allocator, SLAM, that makes all 
of this irrelevant since it's a drop-in replacement for slab and slub, I 
can't legitimately propose it for inclusion because it lacks the debugging 
capabilities that slub excels in and there's an understanding that Linus 
won't merge another stand-alone allocator until one is removed.  

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
  2011-07-31 20:24       ` David Rientjes
@ 2011-07-31 20:45         ` Pekka Enberg
  -1 siblings, 0 replies; 52+ messages in thread
From: Pekka Enberg @ 2011-07-31 20:45 UTC (permalink / raw)
  To: David Rientjes
  Cc: Christoph Lameter, Linus Torvalds, Andrew Morton, hughd,
	linux-kernel, linux-mm

On Sun, 2011-07-31 at 13:24 -0700, David Rientjes wrote:
> And although slub is definitely heading in the right direction regarding 
> the netperf benchmark, it's still a non-starter for anybody using large 
> NUMA machines for networking performance.  On my 16-core, 4 node, 64GB 
> client/server machines running netperf TCP_RR with various thread counts 
> for 60 seconds each on 3.0:
> 
> 	threads		SLUB		SLAB		diff
> 	 16		76345		74973		- 1.8%
> 	 32		116380		116272		- 0.1%
> 	 48		150509		153703		+ 2.1%
> 	 64		187984		189750		+ 0.9%
> 	 80		216853		224471		+ 3.5%
> 	 96		236640		249184		+ 5.3%
> 	112		256540		275464		+ 7.4%
> 	128		273027		296014		+ 8.4%
> 	144		281441		314791		+11.8%
> 	160		287225		326941		+13.8%

That looks like a pretty nasty scaling issue. David, would it be
possible to see 'perf report' for the 160 case? [ Maybe even 'perf
annotate' for the interesting SLUB functions. ]

On Sun, 2011-07-31 at 13:24 -0700, David Rientjes wrote:
> And although I've developed a mutable slab allocator, SLAM, that makes all 
> of this irrelevant since it's a drop-in replacement for slab and slub, I 
> can't legitimately propose it for inclusion because it lacks the debugging 
> capabilities that slub excels in and there's an understanding that Linus 
> won't merge another stand-alone allocator until one is removed.

Nick tried that with SLQB and it didn't work out. I actually even tried
to maintain it out-of-tree for a while but eventually gave up. So no,
I'm not interested in merging a new allocator either. I would be,
however, interested to see the source code.

			Pekka


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
@ 2011-07-31 20:45         ` Pekka Enberg
  0 siblings, 0 replies; 52+ messages in thread
From: Pekka Enberg @ 2011-07-31 20:45 UTC (permalink / raw)
  To: David Rientjes
  Cc: Christoph Lameter, Linus Torvalds, Andrew Morton, hughd,
	linux-kernel, linux-mm

On Sun, 2011-07-31 at 13:24 -0700, David Rientjes wrote:
> And although slub is definitely heading in the right direction regarding 
> the netperf benchmark, it's still a non-starter for anybody using large 
> NUMA machines for networking performance.  On my 16-core, 4 node, 64GB 
> client/server machines running netperf TCP_RR with various thread counts 
> for 60 seconds each on 3.0:
> 
> 	threads		SLUB		SLAB		diff
> 	 16		76345		74973		- 1.8%
> 	 32		116380		116272		- 0.1%
> 	 48		150509		153703		+ 2.1%
> 	 64		187984		189750		+ 0.9%
> 	 80		216853		224471		+ 3.5%
> 	 96		236640		249184		+ 5.3%
> 	112		256540		275464		+ 7.4%
> 	128		273027		296014		+ 8.4%
> 	144		281441		314791		+11.8%
> 	160		287225		326941		+13.8%

That looks like a pretty nasty scaling issue. David, would it be
possible to see 'perf report' for the 160 case? [ Maybe even 'perf
annotate' for the interesting SLUB functions. ]

On Sun, 2011-07-31 at 13:24 -0700, David Rientjes wrote:
> And although I've developed a mutable slab allocator, SLAM, that makes all 
> of this irrelevant since it's a drop-in replacement for slab and slub, I 
> can't legitimately propose it for inclusion because it lacks the debugging 
> capabilities that slub excels in and there's an understanding that Linus 
> won't merge another stand-alone allocator until one is removed.

Nick tried that with SLQB and it didn't work out. I actually even tried
to maintain it out-of-tree for a while but eventually gave up. So no,
I'm not interested in merging a new allocator either. I would be,
however, interested to see the source code.

			Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
  2011-07-31 20:45         ` Pekka Enberg
@ 2011-07-31 21:55           ` David Rientjes
  -1 siblings, 0 replies; 52+ messages in thread
From: David Rientjes @ 2011-07-31 21:55 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Linus Torvalds, Andrew Morton, hughd,
	linux-kernel, linux-mm

On Sun, 31 Jul 2011, Pekka Enberg wrote:

> > And although slub is definitely heading in the right direction regarding 
> > the netperf benchmark, it's still a non-starter for anybody using large 
> > NUMA machines for networking performance.  On my 16-core, 4 node, 64GB 
> > client/server machines running netperf TCP_RR with various thread counts 
> > for 60 seconds each on 3.0:
> > 
> > 	threads		SLUB		SLAB		diff
> > 	 16		76345		74973		- 1.8%
> > 	 32		116380		116272		- 0.1%
> > 	 48		150509		153703		+ 2.1%
> > 	 64		187984		189750		+ 0.9%
> > 	 80		216853		224471		+ 3.5%
> > 	 96		236640		249184		+ 5.3%
> > 	112		256540		275464		+ 7.4%
> > 	128		273027		296014		+ 8.4%
> > 	144		281441		314791		+11.8%
> > 	160		287225		326941		+13.8%
> 
> That looks like a pretty nasty scaling issue. David, would it be
> possible to see 'perf report' for the 160 case? [ Maybe even 'perf
> annotate' for the interesting SLUB functions. ]
> 

More interesting than the perf report (which just shows kfree, 
kmem_cache_free, kmem_cache_alloc dominating) is the statistics that are 
exported by slub itself, it shows the "slab thrashing" issue that I 
described several times over the past few years.  It's difficult to 
address because it's a result of slub's design.  From the client side of 
160 netperf TCP_RR threads for 60 seconds:

	cache		alloc_fastpath		alloc_slowpath
	kmalloc-256	10937512 (62.8%)	6490753
	kmalloc-1024	17121172 (98.3%)	303547
	kmalloc-4096	5526281			11910454 (68.3%)

	cache		free_fastpath		free_slowpath
	kmalloc-256	15469			17412798 (99.9%)
	kmalloc-1024	11604742 (66.6%)	5819973
	kmalloc-4096	14848			17421902 (99.9%)

With those stats, there's no way that slub will even be able to compete 
with slab because it's not optimized for the slowpath.  There are ways to 
mitigate that, like with my slab thrashing patchset from a couple years 
ago that you tracked for a while that improved performance 3-4% at the 
overhead of an increment in the fastpath, but everything else requires 
more memory.  You could preallocate the slabs on the partial list, 
increase the per-node min_partial, increase the order of the slabs 
themselves so you hit the free fastpath much more often, etc, but they all 
come at a considerable cost in memory.

I'm very confident that slub could beat slab on any system if you throw 
enough memory at it because its fastpaths are extremely efficient, but 
there's no business case for that.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
@ 2011-07-31 21:55           ` David Rientjes
  0 siblings, 0 replies; 52+ messages in thread
From: David Rientjes @ 2011-07-31 21:55 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Linus Torvalds, Andrew Morton, hughd,
	linux-kernel, linux-mm

On Sun, 31 Jul 2011, Pekka Enberg wrote:

> > And although slub is definitely heading in the right direction regarding 
> > the netperf benchmark, it's still a non-starter for anybody using large 
> > NUMA machines for networking performance.  On my 16-core, 4 node, 64GB 
> > client/server machines running netperf TCP_RR with various thread counts 
> > for 60 seconds each on 3.0:
> > 
> > 	threads		SLUB		SLAB		diff
> > 	 16		76345		74973		- 1.8%
> > 	 32		116380		116272		- 0.1%
> > 	 48		150509		153703		+ 2.1%
> > 	 64		187984		189750		+ 0.9%
> > 	 80		216853		224471		+ 3.5%
> > 	 96		236640		249184		+ 5.3%
> > 	112		256540		275464		+ 7.4%
> > 	128		273027		296014		+ 8.4%
> > 	144		281441		314791		+11.8%
> > 	160		287225		326941		+13.8%
> 
> That looks like a pretty nasty scaling issue. David, would it be
> possible to see 'perf report' for the 160 case? [ Maybe even 'perf
> annotate' for the interesting SLUB functions. ]
> 

More interesting than the perf report (which just shows kfree, 
kmem_cache_free, kmem_cache_alloc dominating) is the statistics that are 
exported by slub itself, it shows the "slab thrashing" issue that I 
described several times over the past few years.  It's difficult to 
address because it's a result of slub's design.  From the client side of 
160 netperf TCP_RR threads for 60 seconds:

	cache		alloc_fastpath		alloc_slowpath
	kmalloc-256	10937512 (62.8%)	6490753
	kmalloc-1024	17121172 (98.3%)	303547
	kmalloc-4096	5526281			11910454 (68.3%)

	cache		free_fastpath		free_slowpath
	kmalloc-256	15469			17412798 (99.9%)
	kmalloc-1024	11604742 (66.6%)	5819973
	kmalloc-4096	14848			17421902 (99.9%)

With those stats, there's no way that slub will even be able to compete 
with slab because it's not optimized for the slowpath.  There are ways to 
mitigate that, like with my slab thrashing patchset from a couple years 
ago that you tracked for a while that improved performance 3-4% at the 
overhead of an increment in the fastpath, but everything else requires 
more memory.  You could preallocate the slabs on the partial list, 
increase the per-node min_partial, increase the order of the slabs 
themselves so you hit the free fastpath much more often, etc, but they all 
come at a considerable cost in memory.

I'm very confident that slub could beat slab on any system if you throw 
enough memory at it because its fastpaths are extremely efficient, but 
there's no business case for that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
  2011-07-31 17:39       ` Andi Kleen
@ 2011-08-01  0:22         ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-01  0:22 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Pekka Enberg, cl, akpm, rientjes, hughd,
	linux-kernel, linux-mm, kosaki.motohiro, yinghan

On Sun, 31 Jul 2011 10:39:58 -0700
Andi Kleen <andi@firstfloor.org> wrote:

> Linus Torvalds <torvalds@linux-foundation.org> writes:
> 
> > On Sat, Jul 30, 2011 at 8:27 AM, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> >>
> >> Do we allocate the page map array sufficiently aligned that we
> >> actually don't ever have the case of straddling a cacheline? I didn't
> >> check.
> >
> > Oh, and another thing worth checking: did somebody actually check the
> > timings for:
> 
> I would like to see a followon patch that moves the mem_cgroup
> pointer back into struct page. Copying some mem_cgroup people.
> 

A very big change itself is in a future plan. It will do memory usage of
page_cgroup from 32bytes to 8bytes.

A small change, moving page_cgroup->mem_cgroup to struct page, may make
sense. But...IIUC, there is an another user of a field as blkio cgroup.
(They planned to add page_cgroup->blkio_cgroup)

So, my idea is adding

	page->owner

field and encode it in some way. For example, if we can encode it as

	|owner_flags | blkio_id | | memcg_id|

this will work. (I'm not sure how performance will be..)
And we can reduce size of page_cgroup from 32->24(or 16).

In this usage, page->owner will be just required when CGROUP is used.
So, a small machine will not need to increase size of struct page.

If you increase size of 'struct page', memcg will try to make use of
the field.

But we have now some pending big patches (dirty_ratio etc...), moving
pointer may take longer than expected. 

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
@ 2011-08-01  0:22         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-01  0:22 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Pekka Enberg, cl, akpm, rientjes, hughd,
	linux-kernel, linux-mm, kosaki.motohiro, yinghan

On Sun, 31 Jul 2011 10:39:58 -0700
Andi Kleen <andi@firstfloor.org> wrote:

> Linus Torvalds <torvalds@linux-foundation.org> writes:
> 
> > On Sat, Jul 30, 2011 at 8:27 AM, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> >>
> >> Do we allocate the page map array sufficiently aligned that we
> >> actually don't ever have the case of straddling a cacheline? I didn't
> >> check.
> >
> > Oh, and another thing worth checking: did somebody actually check the
> > timings for:
> 
> I would like to see a followon patch that moves the mem_cgroup
> pointer back into struct page. Copying some mem_cgroup people.
> 

A very big change itself is in a future plan. It will do memory usage of
page_cgroup from 32bytes to 8bytes.

A small change, moving page_cgroup->mem_cgroup to struct page, may make
sense. But...IIUC, there is an another user of a field as blkio cgroup.
(They planned to add page_cgroup->blkio_cgroup)

So, my idea is adding

	page->owner

field and encode it in some way. For example, if we can encode it as

	|owner_flags | blkio_id | | memcg_id|

this will work. (I'm not sure how performance will be..)
And we can reduce size of page_cgroup from 32->24(or 16).

In this usage, page->owner will be just required when CGROUP is used.
So, a small machine will not need to increase size of struct page.

If you increase size of 'struct page', memcg will try to make use of
the field.

But we have now some pending big patches (dirty_ratio etc...), moving
pointer may take longer than expected. 

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
  2011-07-31 21:55           ` David Rientjes
@ 2011-08-01  5:08             ` Pekka Enberg
  -1 siblings, 0 replies; 52+ messages in thread
From: Pekka Enberg @ 2011-08-01  5:08 UTC (permalink / raw)
  To: David Rientjes
  Cc: Christoph Lameter, Linus Torvalds, Andrew Morton, hughd,
	linux-kernel, linux-mm

On Sun, 2011-07-31 at 14:55 -0700, David Rientjes wrote:
> On Sun, 31 Jul 2011, Pekka Enberg wrote:
> 
> > > And although slub is definitely heading in the right direction regarding 
> > > the netperf benchmark, it's still a non-starter for anybody using large 
> > > NUMA machines for networking performance.  On my 16-core, 4 node, 64GB 
> > > client/server machines running netperf TCP_RR with various thread counts 
> > > for 60 seconds each on 3.0:
> > > 
> > > 	threads		SLUB		SLAB		diff
> > > 	 16		76345		74973		- 1.8%
> > > 	 32		116380		116272		- 0.1%
> > > 	 48		150509		153703		+ 2.1%
> > > 	 64		187984		189750		+ 0.9%
> > > 	 80		216853		224471		+ 3.5%
> > > 	 96		236640		249184		+ 5.3%
> > > 	112		256540		275464		+ 7.4%
> > > 	128		273027		296014		+ 8.4%
> > > 	144		281441		314791		+11.8%
> > > 	160		287225		326941		+13.8%
> > 
> > That looks like a pretty nasty scaling issue. David, would it be
> > possible to see 'perf report' for the 160 case? [ Maybe even 'perf
> > annotate' for the interesting SLUB functions. ] 
> 
> More interesting than the perf report (which just shows kfree, 
> kmem_cache_free, kmem_cache_alloc dominating) is the statistics that are 
> exported by slub itself, it shows the "slab thrashing" issue that I 
> described several times over the past few years.  It's difficult to 
> address because it's a result of slub's design.  From the client side of 
> 160 netperf TCP_RR threads for 60 seconds:
> 
> 	cache		alloc_fastpath		alloc_slowpath
> 	kmalloc-256	10937512 (62.8%)	6490753
> 	kmalloc-1024	17121172 (98.3%)	303547
> 	kmalloc-4096	5526281			11910454 (68.3%)
> 
> 	cache		free_fastpath		free_slowpath
> 	kmalloc-256	15469			17412798 (99.9%)
> 	kmalloc-1024	11604742 (66.6%)	5819973
> 	kmalloc-4096	14848			17421902 (99.9%)
> 
> With those stats, there's no way that slub will even be able to compete 
> with slab because it's not optimized for the slowpath.

Is the slowpath being hit more often with 160 vs 16 threads? As I said,
the problem you mentioned looks like a *scaling issue* to me which is
actually somewhat surprising. I knew that the slowpaths were slow but I
haven't seen this sort of data before.

I snipped the 'SLUB can never compete with SLAB' part because I'm
frankly more interested in raw data I can analyse myself. I'm hoping to
the per-CPU partial list patch queued for v3.2 soon and I'd be
interested to know how much I can expect that to help.

			Pekka


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
@ 2011-08-01  5:08             ` Pekka Enberg
  0 siblings, 0 replies; 52+ messages in thread
From: Pekka Enberg @ 2011-08-01  5:08 UTC (permalink / raw)
  To: David Rientjes
  Cc: Christoph Lameter, Linus Torvalds, Andrew Morton, hughd,
	linux-kernel, linux-mm

On Sun, 2011-07-31 at 14:55 -0700, David Rientjes wrote:
> On Sun, 31 Jul 2011, Pekka Enberg wrote:
> 
> > > And although slub is definitely heading in the right direction regarding 
> > > the netperf benchmark, it's still a non-starter for anybody using large 
> > > NUMA machines for networking performance.  On my 16-core, 4 node, 64GB 
> > > client/server machines running netperf TCP_RR with various thread counts 
> > > for 60 seconds each on 3.0:
> > > 
> > > 	threads		SLUB		SLAB		diff
> > > 	 16		76345		74973		- 1.8%
> > > 	 32		116380		116272		- 0.1%
> > > 	 48		150509		153703		+ 2.1%
> > > 	 64		187984		189750		+ 0.9%
> > > 	 80		216853		224471		+ 3.5%
> > > 	 96		236640		249184		+ 5.3%
> > > 	112		256540		275464		+ 7.4%
> > > 	128		273027		296014		+ 8.4%
> > > 	144		281441		314791		+11.8%
> > > 	160		287225		326941		+13.8%
> > 
> > That looks like a pretty nasty scaling issue. David, would it be
> > possible to see 'perf report' for the 160 case? [ Maybe even 'perf
> > annotate' for the interesting SLUB functions. ] 
> 
> More interesting than the perf report (which just shows kfree, 
> kmem_cache_free, kmem_cache_alloc dominating) is the statistics that are 
> exported by slub itself, it shows the "slab thrashing" issue that I 
> described several times over the past few years.  It's difficult to 
> address because it's a result of slub's design.  From the client side of 
> 160 netperf TCP_RR threads for 60 seconds:
> 
> 	cache		alloc_fastpath		alloc_slowpath
> 	kmalloc-256	10937512 (62.8%)	6490753
> 	kmalloc-1024	17121172 (98.3%)	303547
> 	kmalloc-4096	5526281			11910454 (68.3%)
> 
> 	cache		free_fastpath		free_slowpath
> 	kmalloc-256	15469			17412798 (99.9%)
> 	kmalloc-1024	11604742 (66.6%)	5819973
> 	kmalloc-4096	14848			17421902 (99.9%)
> 
> With those stats, there's no way that slub will even be able to compete 
> with slab because it's not optimized for the slowpath.

Is the slowpath being hit more often with 160 vs 16 threads? As I said,
the problem you mentioned looks like a *scaling issue* to me which is
actually somewhat surprising. I knew that the slowpaths were slow but I
haven't seen this sort of data before.

I snipped the 'SLUB can never compete with SLAB' part because I'm
frankly more interested in raw data I can analyse myself. I'm hoping to
the per-CPU partial list patch queued for v3.2 soon and I'd be
interested to know how much I can expect that to help.

			Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
  2011-08-01  5:08             ` Pekka Enberg
@ 2011-08-01 10:02               ` David Rientjes
  -1 siblings, 0 replies; 52+ messages in thread
From: David Rientjes @ 2011-08-01 10:02 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Linus Torvalds, Andrew Morton, hughd,
	linux-kernel, linux-mm

On Mon, 1 Aug 2011, Pekka Enberg wrote:

> > More interesting than the perf report (which just shows kfree, 
> > kmem_cache_free, kmem_cache_alloc dominating) is the statistics that are 
> > exported by slub itself, it shows the "slab thrashing" issue that I 
> > described several times over the past few years.  It's difficult to 
> > address because it's a result of slub's design.  From the client side of 
> > 160 netperf TCP_RR threads for 60 seconds:
> > 
> > 	cache		alloc_fastpath		alloc_slowpath
> > 	kmalloc-256	10937512 (62.8%)	6490753
> > 	kmalloc-1024	17121172 (98.3%)	303547
> > 	kmalloc-4096	5526281			11910454 (68.3%)
> > 
> > 	cache		free_fastpath		free_slowpath
> > 	kmalloc-256	15469			17412798 (99.9%)
> > 	kmalloc-1024	11604742 (66.6%)	5819973
> > 	kmalloc-4096	14848			17421902 (99.9%)
> > 
> > With those stats, there's no way that slub will even be able to compete 
> > with slab because it's not optimized for the slowpath.
> 
> Is the slowpath being hit more often with 160 vs 16 threads?

Here's the same testing environment with CONFIG_SLUB_STATS for 16 threads 
instead of 160:

	cache		alloc_fastpath		alloc_slowpath
	kmalloc-256	4263275 (91.1%)		417445
	kmalloc-1024	4636360	(99.1%)		42091
	kmalloc-4096	2570312	(54.4%)		2155946

	cache		free_fastpath		free_slowpath
	kmalloc-256	210115			4470604 (95.5%)
	kmalloc-1024	3579699	(76.5%)		1098764
	kmalloc-4096	67616			4658678 (98.6%)

Keep in mind that this is a default slub configuration, so kmalloc-256 has 
order-1 slabs and both kmalloc-1k and kmalloc-4k have order-3 slabs.  If 
those were decreased, the free slowpath would become even worse, and if 
those were increased, the alloc slowpath would become even worse.

I could probably get better numbers for 160 threads here if I let the free 
slowpath fall off the charts for kmalloc-256 and kmalloc-4k (which 
wouldn't be that bad, they're used 99.9% of the time) and make the alloc 
slowpath much easier to allocate order-0 slabs.  It depends on how often 
we free to a partial slab, but it's a pointless exercise since users won't 
tune their slab allocator settings for specific caches or each workload.

With regard to kmalloc-256 and kmalloc-4k on the 16 thread experiment, 
the lionshare of the allocations and free fastpath usage comes on the cpu 
taking the networking irq, whereas kmalloc-1k, the lionshare of free 
slowpath usage comes from that cpu.

> As I said,
> the problem you mentioned looks like a *scaling issue* to me which is
> actually somewhat surprising. I knew that the slowpaths were slow but I
> haven't seen this sort of data before.
> 

Well, shoot, I wrote a patchset for it and presented similar data two 
years ago: https://lkml.org/lkml/2009/3/30/14 (back then, kmalloc-2k was 
part of the culprit and now it's kmalloc-4k).  Although I agree that we 
don't want to rely on the heuristics that I created in that patchset for 
things like partial list ordering and it's probably not great to have an 
increment on a kmem_cache_cpu variable in the allocation fastpath, I still 
strongly advocate for some logic that only picks off a partial slab from 
while holding the per-node list_lock when it has a certain threshold of 
free objects, otherwise we keep pulling a partial slab that may have one 
object free and performance suffers.  That logic is part of the patchset 
that I proposed back then and it helped performance, but that still comes 
at the cost of increased memory because we'd be allocating new slabs (and 
potentially order-3 as seen above) instead of utilizing sufficient partial 
slabs when the number of object allocations are low.

I'm thinking this is part of the reason that Nick really advocated for 
optimizing for frees on remote cpus in slqb as a fundamental principle of 
the allocator's design.

> I snipped the 'SLUB can never compete with SLAB' part because I'm
> frankly more interested in raw data I can analyse myself. I'm hoping to
> the per-CPU partial list patch queued for v3.2 soon and I'd be
> interested to know how much I can expect that to help.
> 

See my comment about having no doubt that you can improve performance of 
slub by throwing more memory in its direction, that is part of what the 
per-cpu partial list patchset does.

Christoph posted it as an RFC and listed a few significant disadvantages 
to that approach, but I'm still happy to review it and see what can come 
of it.

>From what I remember, though, each per-cpu partial list had a min_partial 
of half of what it currently is per-node.  On my testing environment that 
I've been using here, they were stated to be two 16-core, 4 node systems 
for netperf client and server.  kmalloc-256 currently has a min_partial of 
8, and both kmalloc-1k and kmalloc-4k have min_partial of 10 for its 
current design of per-node partial lists, so that means we keep at minimum 
(absent kmem_cache_shrink() or reclaim) 8*4 kmalloc-256, 10*4 kmalloc-1k, 
and 10*4 kmalloc-4k empty slabs on the partial lists for later use on each 
of these systems.  With the per-cpu partial lists the way I remember it, 
that would become 4*16 kmalloc-256, 5*16 kmalloc-1k, and 5*16 kmalloc-4k 
empty slabs on the partial lists.  So now we've doubled the amount of 
memory we've reserved for the partial lists, so yeah, I'd expect better 
performance as a result of using (4*16 - 8*4) more order-1 slabs and 2 * 
(5*16 - 10*4) more order-3 slabs, about 700 pages for just those two 
caches systemwide.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
@ 2011-08-01 10:02               ` David Rientjes
  0 siblings, 0 replies; 52+ messages in thread
From: David Rientjes @ 2011-08-01 10:02 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Linus Torvalds, Andrew Morton, hughd,
	linux-kernel, linux-mm

On Mon, 1 Aug 2011, Pekka Enberg wrote:

> > More interesting than the perf report (which just shows kfree, 
> > kmem_cache_free, kmem_cache_alloc dominating) is the statistics that are 
> > exported by slub itself, it shows the "slab thrashing" issue that I 
> > described several times over the past few years.  It's difficult to 
> > address because it's a result of slub's design.  From the client side of 
> > 160 netperf TCP_RR threads for 60 seconds:
> > 
> > 	cache		alloc_fastpath		alloc_slowpath
> > 	kmalloc-256	10937512 (62.8%)	6490753
> > 	kmalloc-1024	17121172 (98.3%)	303547
> > 	kmalloc-4096	5526281			11910454 (68.3%)
> > 
> > 	cache		free_fastpath		free_slowpath
> > 	kmalloc-256	15469			17412798 (99.9%)
> > 	kmalloc-1024	11604742 (66.6%)	5819973
> > 	kmalloc-4096	14848			17421902 (99.9%)
> > 
> > With those stats, there's no way that slub will even be able to compete 
> > with slab because it's not optimized for the slowpath.
> 
> Is the slowpath being hit more often with 160 vs 16 threads?

Here's the same testing environment with CONFIG_SLUB_STATS for 16 threads 
instead of 160:

	cache		alloc_fastpath		alloc_slowpath
	kmalloc-256	4263275 (91.1%)		417445
	kmalloc-1024	4636360	(99.1%)		42091
	kmalloc-4096	2570312	(54.4%)		2155946

	cache		free_fastpath		free_slowpath
	kmalloc-256	210115			4470604 (95.5%)
	kmalloc-1024	3579699	(76.5%)		1098764
	kmalloc-4096	67616			4658678 (98.6%)

Keep in mind that this is a default slub configuration, so kmalloc-256 has 
order-1 slabs and both kmalloc-1k and kmalloc-4k have order-3 slabs.  If 
those were decreased, the free slowpath would become even worse, and if 
those were increased, the alloc slowpath would become even worse.

I could probably get better numbers for 160 threads here if I let the free 
slowpath fall off the charts for kmalloc-256 and kmalloc-4k (which 
wouldn't be that bad, they're used 99.9% of the time) and make the alloc 
slowpath much easier to allocate order-0 slabs.  It depends on how often 
we free to a partial slab, but it's a pointless exercise since users won't 
tune their slab allocator settings for specific caches or each workload.

With regard to kmalloc-256 and kmalloc-4k on the 16 thread experiment, 
the lionshare of the allocations and free fastpath usage comes on the cpu 
taking the networking irq, whereas kmalloc-1k, the lionshare of free 
slowpath usage comes from that cpu.

> As I said,
> the problem you mentioned looks like a *scaling issue* to me which is
> actually somewhat surprising. I knew that the slowpaths were slow but I
> haven't seen this sort of data before.
> 

Well, shoot, I wrote a patchset for it and presented similar data two 
years ago: https://lkml.org/lkml/2009/3/30/14 (back then, kmalloc-2k was 
part of the culprit and now it's kmalloc-4k).  Although I agree that we 
don't want to rely on the heuristics that I created in that patchset for 
things like partial list ordering and it's probably not great to have an 
increment on a kmem_cache_cpu variable in the allocation fastpath, I still 
strongly advocate for some logic that only picks off a partial slab from 
while holding the per-node list_lock when it has a certain threshold of 
free objects, otherwise we keep pulling a partial slab that may have one 
object free and performance suffers.  That logic is part of the patchset 
that I proposed back then and it helped performance, but that still comes 
at the cost of increased memory because we'd be allocating new slabs (and 
potentially order-3 as seen above) instead of utilizing sufficient partial 
slabs when the number of object allocations are low.

I'm thinking this is part of the reason that Nick really advocated for 
optimizing for frees on remote cpus in slqb as a fundamental principle of 
the allocator's design.

> I snipped the 'SLUB can never compete with SLAB' part because I'm
> frankly more interested in raw data I can analyse myself. I'm hoping to
> the per-CPU partial list patch queued for v3.2 soon and I'd be
> interested to know how much I can expect that to help.
> 

See my comment about having no doubt that you can improve performance of 
slub by throwing more memory in its direction, that is part of what the 
per-cpu partial list patchset does.

Christoph posted it as an RFC and listed a few significant disadvantages 
to that approach, but I'm still happy to review it and see what can come 
of it.

>From what I remember, though, each per-cpu partial list had a min_partial 
of half of what it currently is per-node.  On my testing environment that 
I've been using here, they were stated to be two 16-core, 4 node systems 
for netperf client and server.  kmalloc-256 currently has a min_partial of 
8, and both kmalloc-1k and kmalloc-4k have min_partial of 10 for its 
current design of per-node partial lists, so that means we keep at minimum 
(absent kmem_cache_shrink() or reclaim) 8*4 kmalloc-256, 10*4 kmalloc-1k, 
and 10*4 kmalloc-4k empty slabs on the partial lists for later use on each 
of these systems.  With the per-cpu partial lists the way I remember it, 
that would become 4*16 kmalloc-256, 5*16 kmalloc-1k, and 5*16 kmalloc-4k 
empty slabs on the partial lists.  So now we've doubled the amount of 
memory we've reserved for the partial lists, so yeah, I'd expect better 
performance as a result of using (4*16 - 8*4) more order-1 slabs and 2 * 
(5*16 - 10*4) more order-3 slabs, about 700 pages for just those two 
caches systemwide.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
  2011-07-31 21:55           ` David Rientjes
@ 2011-08-01 12:06             ` Pekka Enberg
  -1 siblings, 0 replies; 52+ messages in thread
From: Pekka Enberg @ 2011-08-01 12:06 UTC (permalink / raw)
  To: David Rientjes
  Cc: Christoph Lameter, Linus Torvalds, Andrew Morton, hughd,
	linux-kernel, linux-mm

On Mon, Aug 1, 2011 at 12:55 AM, David Rientjes <rientjes@google.com> wrote:
> I'm very confident that slub could beat slab on any system if you throw
> enough memory at it because its fastpaths are extremely efficient, but
> there's no business case for that.

Btw, I haven't measured this recently but in my testing, SLAB has
pretty much always used more memory than SLUB. So 'throwing more
memory at the problem' is definitely a reasonable approach for SLUB.

                        Pekka

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
@ 2011-08-01 12:06             ` Pekka Enberg
  0 siblings, 0 replies; 52+ messages in thread
From: Pekka Enberg @ 2011-08-01 12:06 UTC (permalink / raw)
  To: David Rientjes
  Cc: Christoph Lameter, Linus Torvalds, Andrew Morton, hughd,
	linux-kernel, linux-mm

On Mon, Aug 1, 2011 at 12:55 AM, David Rientjes <rientjes@google.com> wrote:
> I'm very confident that slub could beat slab on any system if you throw
> enough memory at it because its fastpaths are extremely efficient, but
> there's no business case for that.

Btw, I haven't measured this recently but in my testing, SLAB has
pretty much always used more memory than SLUB. So 'throwing more
memory at the problem' is definitely a reasonable approach for SLUB.

                        Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
  2011-08-01 10:02               ` David Rientjes
@ 2011-08-01 12:45                 ` Pekka Enberg
  -1 siblings, 0 replies; 52+ messages in thread
From: Pekka Enberg @ 2011-08-01 12:45 UTC (permalink / raw)
  To: David Rientjes
  Cc: Christoph Lameter, Linus Torvalds, Andrew Morton, hughd,
	linux-kernel, linux-mm

Hi David,

On Mon, Aug 1, 2011 at 1:02 PM, David Rientjes <rientjes@google.com> wrote:
> Here's the same testing environment with CONFIG_SLUB_STATS for 16 threads
> instead of 160:

[snip]

Looking at the data (in slightly reorganized form):

  alloc
  =====

    16 threads:

      cache           alloc_fastpath          alloc_slowpath
      kmalloc-256     4263275 (91.1%)         417445   (8.9%)
      kmalloc-1024    4636360 (99.1%)         42091    (0.9%)
      kmalloc-4096    2570312 (54.4%)         2155946  (45.6%)

    160 threads:

      cache           alloc_fastpath          alloc_slowpath
      kmalloc-256     10937512 (62.8%)        6490753  (37.2%)
      kmalloc-1024    17121172 (98.3%)        303547   (1.7%)
      kmalloc-4096    5526281  (31.7%)        11910454 (68.3%)

  free
  ====

    16 threads:

      cache           free_fastpath           free_slowpath
      kmalloc-256     210115   (4.5%)         4470604  (95.5%)
      kmalloc-1024    3579699  (76.5%)        1098764  (23.5%)
      kmalloc-4096    67616    (1.4%)         4658678  (98.6%)

    160 threads:
      cache           free_fastpath           free_slowpath
      kmalloc-256     15469    (0.1%)         17412798 (99.9%)
      kmalloc-1024    11604742 (66.6%)        5819973  (33.4%)
      kmalloc-4096    14848    (0.1%)         17421902 (99.9%)

it's pretty sad to see how SLUB alloc fastpath utilization drops so
dramatically. Free fastpath utilization isn't all that great with 160
threads either but it seems to me that most of the performance
regression compared to SLAB still comes from the alloc paths.

I guess the problem here is that __slab_free() happens on a remote CPU
which puts the object to 'struct page' freelist which effectively means
we're unable to recycle free'd objects. As the number of concurrent
threads increase, we simply drain out the fastpath freelists more
quickly. Did I understand the problem correctly?

If that's really happening, I'm still bit puzzled why we're hitting the
slowpath so much. I'd assume that __slab_alloc() would simply reload the
'struct page' freelist once the per-cpu freelist is empty.  Why is that
not happening? I see __slab_alloc() does deactivate_slab() upon
node_match() failure. What kind of ALLOC_NODE_MISMATCH stats are you
seeing?

                        Pekka

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
@ 2011-08-01 12:45                 ` Pekka Enberg
  0 siblings, 0 replies; 52+ messages in thread
From: Pekka Enberg @ 2011-08-01 12:45 UTC (permalink / raw)
  To: David Rientjes
  Cc: Christoph Lameter, Linus Torvalds, Andrew Morton, hughd,
	linux-kernel, linux-mm

Hi David,

On Mon, Aug 1, 2011 at 1:02 PM, David Rientjes <rientjes@google.com> wrote:
> Here's the same testing environment with CONFIG_SLUB_STATS for 16 threads
> instead of 160:

[snip]

Looking at the data (in slightly reorganized form):

  alloc
  =====

    16 threads:

      cache           alloc_fastpath          alloc_slowpath
      kmalloc-256     4263275 (91.1%)         417445   (8.9%)
      kmalloc-1024    4636360 (99.1%)         42091    (0.9%)
      kmalloc-4096    2570312 (54.4%)         2155946  (45.6%)

    160 threads:

      cache           alloc_fastpath          alloc_slowpath
      kmalloc-256     10937512 (62.8%)        6490753  (37.2%)
      kmalloc-1024    17121172 (98.3%)        303547   (1.7%)
      kmalloc-4096    5526281  (31.7%)        11910454 (68.3%)

  free
  ====

    16 threads:

      cache           free_fastpath           free_slowpath
      kmalloc-256     210115   (4.5%)         4470604  (95.5%)
      kmalloc-1024    3579699  (76.5%)        1098764  (23.5%)
      kmalloc-4096    67616    (1.4%)         4658678  (98.6%)

    160 threads:
      cache           free_fastpath           free_slowpath
      kmalloc-256     15469    (0.1%)         17412798 (99.9%)
      kmalloc-1024    11604742 (66.6%)        5819973  (33.4%)
      kmalloc-4096    14848    (0.1%)         17421902 (99.9%)

it's pretty sad to see how SLUB alloc fastpath utilization drops so
dramatically. Free fastpath utilization isn't all that great with 160
threads either but it seems to me that most of the performance
regression compared to SLAB still comes from the alloc paths.

I guess the problem here is that __slab_free() happens on a remote CPU
which puts the object to 'struct page' freelist which effectively means
we're unable to recycle free'd objects. As the number of concurrent
threads increase, we simply drain out the fastpath freelists more
quickly. Did I understand the problem correctly?

If that's really happening, I'm still bit puzzled why we're hitting the
slowpath so much. I'd assume that __slab_alloc() would simply reload the
'struct page' freelist once the per-cpu freelist is empty.  Why is that
not happening? I see __slab_alloc() does deactivate_slab() upon
node_match() failure. What kind of ALLOC_NODE_MISMATCH stats are you
seeing?

                        Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
  2011-08-01 12:06             ` Pekka Enberg
@ 2011-08-01 15:55               ` Christoph Lameter
  -1 siblings, 0 replies; 52+ messages in thread
From: Christoph Lameter @ 2011-08-01 15:55 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: David Rientjes, Linus Torvalds, Andrew Morton, hughd,
	linux-kernel, linux-mm


The future plans that I have for performance improvements are:

1. The percpu partial lists.

The min_partial settings are halved by this approach so that there wont be
any excessive memory usage. Pages on per cpu partial lists are frozen and
this means that the __slab_free path can avoid taking node locks for a
page that is cached by another processor. This causes another significant
performance gain in hackbench of up to 20%. The problem here is to fine
tune the approach and clean up the patchset.

2. per cpu full lists.

These will not be specific to a particular slab cache but shared amoung
all of them. This will reduce the need to keep empty slab pages on the
per node partial lists and therefore also reduce memory consumption.

The per cpu full lists will be essentially a caching layer for the
page allocator and will make slab acquisition and release as fast
as the slub fastpath for alloc and free (it uses the same
this_cpu_cmpxchg_double based approach). I basically gave up on
fixing up the page allocator fastpath after trying various approaches
over the last weeks. Maybe the caching layer can be made available
for other kernel subsystems that need fast page access too.

The scaling issues that are left over are then those caused by

1. The per node lock taken for the partial lists per node.
   This can be controlled by enlarging the per cpu partial lists.

2. The necessity to go to the page allocator.
   This will be tunable by configuring the caching layer.

3. Bouncing cachelines for __remote_free if multiple processors
   enter __slab_free for the same page.





^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
@ 2011-08-01 15:55               ` Christoph Lameter
  0 siblings, 0 replies; 52+ messages in thread
From: Christoph Lameter @ 2011-08-01 15:55 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: David Rientjes, Linus Torvalds, Andrew Morton, hughd,
	linux-kernel, linux-mm


The future plans that I have for performance improvements are:

1. The percpu partial lists.

The min_partial settings are halved by this approach so that there wont be
any excessive memory usage. Pages on per cpu partial lists are frozen and
this means that the __slab_free path can avoid taking node locks for a
page that is cached by another processor. This causes another significant
performance gain in hackbench of up to 20%. The problem here is to fine
tune the approach and clean up the patchset.

2. per cpu full lists.

These will not be specific to a particular slab cache but shared amoung
all of them. This will reduce the need to keep empty slab pages on the
per node partial lists and therefore also reduce memory consumption.

The per cpu full lists will be essentially a caching layer for the
page allocator and will make slab acquisition and release as fast
as the slub fastpath for alloc and free (it uses the same
this_cpu_cmpxchg_double based approach). I basically gave up on
fixing up the page allocator fastpath after trying various approaches
over the last weeks. Maybe the caching layer can be made available
for other kernel subsystems that need fast page access too.

The scaling issues that are left over are then those caused by

1. The per node lock taken for the partial lists per node.
   This can be controlled by enlarging the per cpu partial lists.

2. The necessity to go to the page allocator.
   This will be tunable by configuring the caching layer.

3. Bouncing cachelines for __remote_free if multiple processors
   enter __slab_free for the same page.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
  2011-08-01 12:45                 ` Pekka Enberg
@ 2011-08-02  2:43                   ` David Rientjes
  -1 siblings, 0 replies; 52+ messages in thread
From: David Rientjes @ 2011-08-02  2:43 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Linus Torvalds, Andrew Morton, hughd,
	linux-kernel, linux-mm

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1960 bytes --]

On Mon, 1 Aug 2011, Pekka Enberg wrote:

> Looking at the data (in slightly reorganized form):
> 
>   alloc
>   =====
> 
>     16 threads:
> 
>       cache           alloc_fastpath          alloc_slowpath
>       kmalloc-256     4263275 (91.1%)         417445   (8.9%)
>       kmalloc-1024    4636360 (99.1%)         42091    (0.9%)
>       kmalloc-4096    2570312 (54.4%)         2155946  (45.6%)
> 
>     160 threads:
> 
>       cache           alloc_fastpath          alloc_slowpath
>       kmalloc-256     10937512 (62.8%)        6490753  (37.2%)
>       kmalloc-1024    17121172 (98.3%)        303547   (1.7%)
>       kmalloc-4096    5526281  (31.7%)        11910454 (68.3%)
> 
>   free
>   ====
> 
>     16 threads:
> 
>       cache           free_fastpath           free_slowpath
>       kmalloc-256     210115   (4.5%)         4470604  (95.5%)
>       kmalloc-1024    3579699  (76.5%)        1098764  (23.5%)
>       kmalloc-4096    67616    (1.4%)         4658678  (98.6%)
> 
>     160 threads:
>       cache           free_fastpath           free_slowpath
>       kmalloc-256     15469    (0.1%)         17412798 (99.9%)
>       kmalloc-1024    11604742 (66.6%)        5819973  (33.4%)
>       kmalloc-4096    14848    (0.1%)         17421902 (99.9%)
> 
> it's pretty sad to see how SLUB alloc fastpath utilization drops so
> dramatically. Free fastpath utilization isn't all that great with 160
> threads either but it seems to me that most of the performance
> regression compared to SLAB still comes from the alloc paths.
> 

It's the opposite, the cumulative effects of the free slowpath is more 
costly in terms of latency than the alloc slowpath because it occurs at a 
greater frequency; the pattern that I described as "slab thrashing" before 
causes a single free to a full slab, manipulation to get it back on the 
partial list, then the alloc slowpath grabs it for a single allocation, 
and requires another partial slab on the next alloc.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
@ 2011-08-02  2:43                   ` David Rientjes
  0 siblings, 0 replies; 52+ messages in thread
From: David Rientjes @ 2011-08-02  2:43 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Linus Torvalds, Andrew Morton, hughd,
	linux-kernel, linux-mm

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1960 bytes --]

On Mon, 1 Aug 2011, Pekka Enberg wrote:

> Looking at the data (in slightly reorganized form):
> 
>   alloc
>   =====
> 
>     16 threads:
> 
>       cache           alloc_fastpath          alloc_slowpath
>       kmalloc-256     4263275 (91.1%)         417445   (8.9%)
>       kmalloc-1024    4636360 (99.1%)         42091    (0.9%)
>       kmalloc-4096    2570312 (54.4%)         2155946  (45.6%)
> 
>     160 threads:
> 
>       cache           alloc_fastpath          alloc_slowpath
>       kmalloc-256     10937512 (62.8%)        6490753  (37.2%)
>       kmalloc-1024    17121172 (98.3%)        303547   (1.7%)
>       kmalloc-4096    5526281  (31.7%)        11910454 (68.3%)
> 
>   free
>   ====
> 
>     16 threads:
> 
>       cache           free_fastpath           free_slowpath
>       kmalloc-256     210115   (4.5%)         4470604  (95.5%)
>       kmalloc-1024    3579699  (76.5%)        1098764  (23.5%)
>       kmalloc-4096    67616    (1.4%)         4658678  (98.6%)
> 
>     160 threads:
>       cache           free_fastpath           free_slowpath
>       kmalloc-256     15469    (0.1%)         17412798 (99.9%)
>       kmalloc-1024    11604742 (66.6%)        5819973  (33.4%)
>       kmalloc-4096    14848    (0.1%)         17421902 (99.9%)
> 
> it's pretty sad to see how SLUB alloc fastpath utilization drops so
> dramatically. Free fastpath utilization isn't all that great with 160
> threads either but it seems to me that most of the performance
> regression compared to SLAB still comes from the alloc paths.
> 

It's the opposite, the cumulative effects of the free slowpath is more 
costly in terms of latency than the alloc slowpath because it occurs at a 
greater frequency; the pattern that I described as "slab thrashing" before 
causes a single free to a full slab, manipulation to get it back on the 
partial list, then the alloc slowpath grabs it for a single allocation, 
and requires another partial slab on the next alloc.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
  2011-08-01 12:06             ` Pekka Enberg
@ 2011-08-02  4:05               ` David Rientjes
  -1 siblings, 0 replies; 52+ messages in thread
From: David Rientjes @ 2011-08-02  4:05 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Linus Torvalds, Andrew Morton, hughd,
	linux-kernel, linux-mm

On Mon, 1 Aug 2011, Pekka Enberg wrote:

> Btw, I haven't measured this recently but in my testing, SLAB has
> pretty much always used more memory than SLUB. So 'throwing more
> memory at the problem' is definitely a reasonable approach for SLUB.
> 

Yes, slub _did_ use more memory than slab until the alignment of 
struct page.  That cost an additional 128MB on each of these 64GB 
machines, while the total slab usage on the client machine systemwide is 
~75MB while running netperf TCP_RR with 160 threads.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
@ 2011-08-02  4:05               ` David Rientjes
  0 siblings, 0 replies; 52+ messages in thread
From: David Rientjes @ 2011-08-02  4:05 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Linus Torvalds, Andrew Morton, hughd,
	linux-kernel, linux-mm

On Mon, 1 Aug 2011, Pekka Enberg wrote:

> Btw, I haven't measured this recently but in my testing, SLAB has
> pretty much always used more memory than SLUB. So 'throwing more
> memory at the problem' is definitely a reasonable approach for SLUB.
> 

Yes, slub _did_ use more memory than slab until the alignment of 
struct page.  That cost an additional 128MB on each of these 64GB 
machines, while the total slab usage on the client machine systemwide is 
~75MB while running netperf TCP_RR with 160 threads.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
  2011-08-02  4:05               ` David Rientjes
@ 2011-08-02 14:15                 ` Christoph Lameter
  -1 siblings, 0 replies; 52+ messages in thread
From: Christoph Lameter @ 2011-08-02 14:15 UTC (permalink / raw)
  To: David Rientjes
  Cc: Pekka Enberg, Linus Torvalds, Andrew Morton, hughd, linux-kernel,
	linux-mm

On Mon, 1 Aug 2011, David Rientjes wrote:

> On Mon, 1 Aug 2011, Pekka Enberg wrote:
>
> > Btw, I haven't measured this recently but in my testing, SLAB has
> > pretty much always used more memory than SLUB. So 'throwing more
> > memory at the problem' is definitely a reasonable approach for SLUB.
> >
>
> Yes, slub _did_ use more memory than slab until the alignment of
> struct page.  That cost an additional 128MB on each of these 64GB
> machines, while the total slab usage on the client machine systemwide is
> ~75MB while running netperf TCP_RR with 160 threads.

I guess that calculation did not include metadata structures (alien caches
and the NR_CPU arrays in kmem_cache) etc? These are particularly costly on SLAB.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
@ 2011-08-02 14:15                 ` Christoph Lameter
  0 siblings, 0 replies; 52+ messages in thread
From: Christoph Lameter @ 2011-08-02 14:15 UTC (permalink / raw)
  To: David Rientjes
  Cc: Pekka Enberg, Linus Torvalds, Andrew Morton, hughd, linux-kernel,
	linux-mm

On Mon, 1 Aug 2011, David Rientjes wrote:

> On Mon, 1 Aug 2011, Pekka Enberg wrote:
>
> > Btw, I haven't measured this recently but in my testing, SLAB has
> > pretty much always used more memory than SLUB. So 'throwing more
> > memory at the problem' is definitely a reasonable approach for SLUB.
> >
>
> Yes, slub _did_ use more memory than slab until the alignment of
> struct page.  That cost an additional 128MB on each of these 64GB
> machines, while the total slab usage on the client machine systemwide is
> ~75MB while running netperf TCP_RR with 160 threads.

I guess that calculation did not include metadata structures (alien caches
and the NR_CPU arrays in kmem_cache) etc? These are particularly costly on SLAB.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
  2011-08-02 14:15                 ` Christoph Lameter
@ 2011-08-02 16:24                   ` David Rientjes
  -1 siblings, 0 replies; 52+ messages in thread
From: David Rientjes @ 2011-08-02 16:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Linus Torvalds, Andrew Morton, hughd, linux-kernel,
	linux-mm

On Tue, 2 Aug 2011, Christoph Lameter wrote:

> > Yes, slub _did_ use more memory than slab until the alignment of
> > struct page.  That cost an additional 128MB on each of these 64GB
> > machines, while the total slab usage on the client machine systemwide is
> > ~75MB while running netperf TCP_RR with 160 threads.
> 
> I guess that calculation did not include metadata structures (alien caches
> and the NR_CPU arrays in kmem_cache) etc? These are particularly costly on SLAB.
> 

It certainly is costly on slab, but that 75MB number is from a casual 
observation of grep Slab /proc/meminfo while running the benchmark.  For 
slub, that turns into ~55MB.  The true slub usage, though, includes the 
struct page alignment for cmpxchg16b which added 128MB of padding into its 
memory usage even though it appears to be unattributed to slub.  A casual 
grep MemFree /proc/meminfo reveals the lost 100MB for the slower 
allocator, in this case.  And the per-cpu partial list will add even 
additional slab usage for slub, so this is where my "throwing more memory 
at slub to get better performance" came from.  I understand that this is a 
large NUMA machine, though, and the cost of slub may be substantially 
lower on smaller machines.

If you look through the various arch defconfigs, you'll see that we 
actually do a pretty good job of enabling CONFIG_SLAB for large systems.  
I wish we had a clear dividing line in the x86 kconfig that would at least 
guide users toward one allocator over another though, otherwise they 
receive little help.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
@ 2011-08-02 16:24                   ` David Rientjes
  0 siblings, 0 replies; 52+ messages in thread
From: David Rientjes @ 2011-08-02 16:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Linus Torvalds, Andrew Morton, hughd, linux-kernel,
	linux-mm

On Tue, 2 Aug 2011, Christoph Lameter wrote:

> > Yes, slub _did_ use more memory than slab until the alignment of
> > struct page.  That cost an additional 128MB on each of these 64GB
> > machines, while the total slab usage on the client machine systemwide is
> > ~75MB while running netperf TCP_RR with 160 threads.
> 
> I guess that calculation did not include metadata structures (alien caches
> and the NR_CPU arrays in kmem_cache) etc? These are particularly costly on SLAB.
> 

It certainly is costly on slab, but that 75MB number is from a casual 
observation of grep Slab /proc/meminfo while running the benchmark.  For 
slub, that turns into ~55MB.  The true slub usage, though, includes the 
struct page alignment for cmpxchg16b which added 128MB of padding into its 
memory usage even though it appears to be unattributed to slub.  A casual 
grep MemFree /proc/meminfo reveals the lost 100MB for the slower 
allocator, in this case.  And the per-cpu partial list will add even 
additional slab usage for slub, so this is where my "throwing more memory 
at slub to get better performance" came from.  I understand that this is a 
large NUMA machine, though, and the cost of slub may be substantially 
lower on smaller machines.

If you look through the various arch defconfigs, you'll see that we 
actually do a pretty good job of enabling CONFIG_SLAB for large systems.  
I wish we had a clear dividing line in the x86 kconfig that would at least 
guide users toward one allocator over another though, otherwise they 
receive little help.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
  2011-08-02 16:24                   ` David Rientjes
@ 2011-08-02 16:36                     ` Christoph Lameter
  -1 siblings, 0 replies; 52+ messages in thread
From: Christoph Lameter @ 2011-08-02 16:36 UTC (permalink / raw)
  To: David Rientjes
  Cc: Pekka Enberg, Linus Torvalds, Andrew Morton, hughd, linux-kernel,
	linux-mm

On Tue, 2 Aug 2011, David Rientjes wrote:

> allocator, in this case.  And the per-cpu partial list will add even
> additional slab usage for slub, so this is where my "throwing more memory
> at slub to get better performance" came from.  I understand that this is a
> large NUMA machine, though, and the cost of slub may be substantially
> lower on smaller machines.

The per cpu partial lists only add the need for more memory if other
processors have to allocate new pages because they do not have enough
partial slab pages to satisfy their needs. That can be tuned by a cap on
objects.



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
@ 2011-08-02 16:36                     ` Christoph Lameter
  0 siblings, 0 replies; 52+ messages in thread
From: Christoph Lameter @ 2011-08-02 16:36 UTC (permalink / raw)
  To: David Rientjes
  Cc: Pekka Enberg, Linus Torvalds, Andrew Morton, hughd, linux-kernel,
	linux-mm

On Tue, 2 Aug 2011, David Rientjes wrote:

> allocator, in this case.  And the per-cpu partial list will add even
> additional slab usage for slub, so this is where my "throwing more memory
> at slub to get better performance" came from.  I understand that this is a
> large NUMA machine, though, and the cost of slub may be substantially
> lower on smaller machines.

The per cpu partial lists only add the need for more memory if other
processors have to allocate new pages because they do not have enough
partial slab pages to satisfy their needs. That can be tuned by a cap on
objects.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
  2011-08-02 16:36                     ` Christoph Lameter
@ 2011-08-02 20:02                       ` David Rientjes
  -1 siblings, 0 replies; 52+ messages in thread
From: David Rientjes @ 2011-08-02 20:02 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Linus Torvalds, Andrew Morton, hughd, linux-kernel,
	linux-mm

On Tue, 2 Aug 2011, Christoph Lameter wrote:

> The per cpu partial lists only add the need for more memory if other
> processors have to allocate new pages because they do not have enough
> partial slab pages to satisfy their needs. That can be tuned by a cap on
> objects.
> 

The netperf benchmark isn't representative of a heavy slab consuming 
workload, I routinely run jobs on these machines that use 20 times the 
amount of slab.  From what I saw in the earlier posting of the per-cpu 
partial list patch, the min_partial value is set to half of what it was 
previously as a per-node partial list.  Since these are 16-core, 4 node 
systems, that would mean that after a kmem_cache_shrink() on a cache that 
leaves empty slab on the partial lists that we've doubled the memory for 
slub's partial lists systemwide.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
@ 2011-08-02 20:02                       ` David Rientjes
  0 siblings, 0 replies; 52+ messages in thread
From: David Rientjes @ 2011-08-02 20:02 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Linus Torvalds, Andrew Morton, hughd, linux-kernel,
	linux-mm

On Tue, 2 Aug 2011, Christoph Lameter wrote:

> The per cpu partial lists only add the need for more memory if other
> processors have to allocate new pages because they do not have enough
> partial slab pages to satisfy their needs. That can be tuned by a cap on
> objects.
> 

The netperf benchmark isn't representative of a heavy slab consuming 
workload, I routinely run jobs on these machines that use 20 times the 
amount of slab.  From what I saw in the earlier posting of the per-cpu 
partial list patch, the min_partial value is set to half of what it was 
previously as a per-node partial list.  Since these are 16-core, 4 node 
systems, that would mean that after a kmem_cache_shrink() on a cache that 
leaves empty slab on the partial lists that we've doubled the memory for 
slub's partial lists systemwide.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
  2011-08-02 20:02                       ` David Rientjes
@ 2011-08-03 14:09                         ` Christoph Lameter
  -1 siblings, 0 replies; 52+ messages in thread
From: Christoph Lameter @ 2011-08-03 14:09 UTC (permalink / raw)
  To: David Rientjes
  Cc: Pekka Enberg, Linus Torvalds, Andrew Morton, hughd, linux-kernel,
	linux-mm

On Tue, 2 Aug 2011, David Rientjes wrote:

> On Tue, 2 Aug 2011, Christoph Lameter wrote:
>
> > The per cpu partial lists only add the need for more memory if other
> > processors have to allocate new pages because they do not have enough
> > partial slab pages to satisfy their needs. That can be tuned by a cap on
> > objects.
> >
>
> The netperf benchmark isn't representative of a heavy slab consuming
> workload, I routinely run jobs on these machines that use 20 times the
> amount of slab.  From what I saw in the earlier posting of the per-cpu
> partial list patch, the min_partial value is set to half of what it was
> previously as a per-node partial list.  Since these are 16-core, 4 node
> systems, that would mean that after a kmem_cache_shrink() on a cache that
> leaves empty slab on the partial lists that we've doubled the memory for
> slub's partial lists systemwide.

Cutting down the potential number of empty slabs that we might possible
keep around because we have no partial slabs per node increases memory
usage?


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
@ 2011-08-03 14:09                         ` Christoph Lameter
  0 siblings, 0 replies; 52+ messages in thread
From: Christoph Lameter @ 2011-08-03 14:09 UTC (permalink / raw)
  To: David Rientjes
  Cc: Pekka Enberg, Linus Torvalds, Andrew Morton, hughd, linux-kernel,
	linux-mm

On Tue, 2 Aug 2011, David Rientjes wrote:

> On Tue, 2 Aug 2011, Christoph Lameter wrote:
>
> > The per cpu partial lists only add the need for more memory if other
> > processors have to allocate new pages because they do not have enough
> > partial slab pages to satisfy their needs. That can be tuned by a cap on
> > objects.
> >
>
> The netperf benchmark isn't representative of a heavy slab consuming
> workload, I routinely run jobs on these machines that use 20 times the
> amount of slab.  From what I saw in the earlier posting of the per-cpu
> partial list patch, the min_partial value is set to half of what it was
> previously as a per-node partial list.  Since these are 16-core, 4 node
> systems, that would mean that after a kmem_cache_shrink() on a cache that
> leaves empty slab on the partial lists that we've doubled the memory for
> slub's partial lists systemwide.

Cutting down the potential number of empty slabs that we might possible
keep around because we have no partial slabs per node increases memory
usage?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
  2011-08-03 14:09                         ` Christoph Lameter
@ 2011-08-08 20:04                           ` David Rientjes
  -1 siblings, 0 replies; 52+ messages in thread
From: David Rientjes @ 2011-08-08 20:04 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Linus Torvalds, Andrew Morton, hughd, linux-kernel,
	linux-mm

On Wed, 3 Aug 2011, Christoph Lameter wrote:

> > The netperf benchmark isn't representative of a heavy slab consuming
> > workload, I routinely run jobs on these machines that use 20 times the
> > amount of slab.  From what I saw in the earlier posting of the per-cpu
> > partial list patch, the min_partial value is set to half of what it was
> > previously as a per-node partial list.  Since these are 16-core, 4 node
> > systems, that would mean that after a kmem_cache_shrink() on a cache that
> > leaves empty slab on the partial lists that we've doubled the memory for
> > slub's partial lists systemwide.
> 
> Cutting down the potential number of empty slabs that we might possible
> keep around because we have no partial slabs per node increases memory
> usage?
> 

You halved the number of min_partial, but there are 16 partial lists on 
these machines because they are per-cpu instead of 4 partial lists when 
they were per-node.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1
@ 2011-08-08 20:04                           ` David Rientjes
  0 siblings, 0 replies; 52+ messages in thread
From: David Rientjes @ 2011-08-08 20:04 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Linus Torvalds, Andrew Morton, hughd, linux-kernel,
	linux-mm

On Wed, 3 Aug 2011, Christoph Lameter wrote:

> > The netperf benchmark isn't representative of a heavy slab consuming
> > workload, I routinely run jobs on these machines that use 20 times the
> > amount of slab.  From what I saw in the earlier posting of the per-cpu
> > partial list patch, the min_partial value is set to half of what it was
> > previously as a per-node partial list.  Since these are 16-core, 4 node
> > systems, that would mean that after a kmem_cache_shrink() on a cache that
> > leaves empty slab on the partial lists that we've doubled the memory for
> > slub's partial lists systemwide.
> 
> Cutting down the potential number of empty slabs that we might possible
> keep around because we have no partial slabs per node increases memory
> usage?
> 

You halved the number of min_partial, but there are 16 partial lists on 
these machines because they are per-cpu instead of 4 partial lists when 
they were per-node.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2011-08-08 20:04 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-28 22:47 [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1 Pekka Enberg
2011-07-28 22:47 ` Pekka Enberg
2011-07-29 15:04 ` Christoph Lameter
2011-07-29 15:04   ` Christoph Lameter
2011-07-29 23:18   ` Andi Kleen
2011-07-29 23:18     ` Andi Kleen
2011-07-30  6:33     ` Eric Dumazet
2011-07-30  6:33       ` Eric Dumazet
2011-07-31 18:50   ` David Rientjes
2011-07-31 18:50     ` David Rientjes
2011-07-31 20:24     ` David Rientjes
2011-07-31 20:24       ` David Rientjes
2011-07-31 20:45       ` Pekka Enberg
2011-07-31 20:45         ` Pekka Enberg
2011-07-31 21:55         ` David Rientjes
2011-07-31 21:55           ` David Rientjes
2011-08-01  5:08           ` Pekka Enberg
2011-08-01  5:08             ` Pekka Enberg
2011-08-01 10:02             ` David Rientjes
2011-08-01 10:02               ` David Rientjes
2011-08-01 12:45               ` Pekka Enberg
2011-08-01 12:45                 ` Pekka Enberg
2011-08-02  2:43                 ` David Rientjes
2011-08-02  2:43                   ` David Rientjes
2011-08-01 12:06           ` Pekka Enberg
2011-08-01 12:06             ` Pekka Enberg
2011-08-01 15:55             ` Christoph Lameter
2011-08-01 15:55               ` Christoph Lameter
2011-08-02  4:05             ` David Rientjes
2011-08-02  4:05               ` David Rientjes
2011-08-02 14:15               ` Christoph Lameter
2011-08-02 14:15                 ` Christoph Lameter
2011-08-02 16:24                 ` David Rientjes
2011-08-02 16:24                   ` David Rientjes
2011-08-02 16:36                   ` Christoph Lameter
2011-08-02 16:36                     ` Christoph Lameter
2011-08-02 20:02                     ` David Rientjes
2011-08-02 20:02                       ` David Rientjes
2011-08-03 14:09                       ` Christoph Lameter
2011-08-03 14:09                         ` Christoph Lameter
2011-08-08 20:04                         ` David Rientjes
2011-08-08 20:04                           ` David Rientjes
2011-07-30 18:27 ` Linus Torvalds
2011-07-30 18:27   ` Linus Torvalds
2011-07-30 18:32   ` Linus Torvalds
2011-07-30 18:32     ` Linus Torvalds
2011-07-31 17:39     ` Andi Kleen
2011-07-31 17:39       ` Andi Kleen
2011-08-01  0:22       ` KAMEZAWA Hiroyuki
2011-08-01  0:22         ` KAMEZAWA Hiroyuki
2011-07-31 18:11     ` David Rientjes
2011-07-31 18:11       ` David Rientjes

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.