linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] alloc_percpu() fails to allocate percpu data
@ 2008-02-21 18:00 Eric Dumazet
  2008-02-21 22:26 ` Peter Zijlstra
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Eric Dumazet @ 2008-02-21 18:00 UTC (permalink / raw)
  To: David S. Miller, Andrew Morton
  Cc: linux kernel, netdev, Christoph Lameter, Zhang, Yanmin

[-- Attachment #1: Type: text/plain, Size: 1365 bytes --]

Some oprofile results obtained while using tbench on a 2x2 cpu machine 
were very surprising.

For example, loopback_xmit() function was using high number of cpu 
cycles to perform
the statistic updates, supposed to be real cheap since they use percpu data

        pcpu_lstats = netdev_priv(dev);
        lb_stats = per_cpu_ptr(pcpu_lstats, smp_processor_id());
        lb_stats->packets++;  /* HERE : serious contention */
        lb_stats->bytes += skb->len;


struct pcpu_lstats is a small structure containing two longs. It appears 
that on my 32bits platform,
alloc_percpu(8) allocates a single cache line,  instead of giving to 
each cpu a separate
cache line.

Using the following patch gave me impressive boost in various benchmarks 
( 6 % in tbench)
(all percpu_counters hit this bug too)

Long term fix (ie >= 2.6.26) would be to let each CPU allocate their own 
block of memory, so that we
dont need to roudup sizes to L1_CACHE_BYTES, or merging the SGI stuff of 
course...

Note : SLUB vs SLAB is important here to *show* the improvement, since 
they dont have the same minimum
allocation sizes (8 bytes vs 32 bytes).
This could very well explain regressions some guys reported when they 
switched to SLUB.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

 mm/allocpercpu.c |   15 ++++++++++++++-
 1 files changed, 14 insertions(+), 1 deletion(-)



[-- Attachment #2: percpu_populate.patch --]
[-- Type: text/plain, Size: 1229 bytes --]

diff --git a/mm/allocpercpu.c b/mm/allocpercpu.c
index 7e58322..b0012e2 100644
--- a/mm/allocpercpu.c
+++ b/mm/allocpercpu.c
@@ -6,6 +6,10 @@
 #include <linux/mm.h>
 #include <linux/module.h>
 
+#ifndef cache_line_size
+#define cache_line_size()	L1_CACHE_BYTES
+#endif
+
 /**
  * percpu_depopulate - depopulate per-cpu data for given cpu
  * @__pdata: per-cpu data to depopulate
@@ -52,6 +56,11 @@ void *percpu_populate(void *__pdata, size_t size, gfp_t gfp, int cpu)
 	struct percpu_data *pdata = __percpu_disguise(__pdata);
 	int node = cpu_to_node(cpu);
 
+	/*
+	 * We should make sure each CPU gets private memory.
+	 */
+	size = roundup(size, cache_line_size());
+
 	BUG_ON(pdata->ptrs[cpu]);
 	if (node_online(node))
 		pdata->ptrs[cpu] = kmalloc_node(size, gfp|__GFP_ZERO, node);
@@ -98,7 +107,11 @@ EXPORT_SYMBOL_GPL(__percpu_populate_mask);
  */
 void *__percpu_alloc_mask(size_t size, gfp_t gfp, cpumask_t *mask)
 {
-	void *pdata = kzalloc(nr_cpu_ids * sizeof(void *), gfp);
+	/*
+	 * We allocate whole cache lines to avoid false sharing
+	 */
+	size_t sz = roundup(nr_cpu_ids * sizeof(void *), cache_line_size());
+	void *pdata = kzalloc(sz, gfp);
 	void *__pdata = __percpu_disguise(pdata);
 
 	if (unlikely(!pdata))

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] alloc_percpu() fails to allocate percpu data
  2008-02-21 18:00 [PATCH] alloc_percpu() fails to allocate percpu data Eric Dumazet
@ 2008-02-21 22:26 ` Peter Zijlstra
  2008-02-23  9:23   ` Nick Piggin
  2008-02-23  8:04 ` Andrew Morton
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2008-02-21 22:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Andrew Morton, linux kernel, netdev,
	Christoph Lameter, Zhang, Yanmin


On Thu, 2008-02-21 at 19:00 +0100, Eric Dumazet wrote:
> Some oprofile results obtained while using tbench on a 2x2 cpu machine 
> were very surprising.
> 
> For example, loopback_xmit() function was using high number of cpu 
> cycles to perform the statistic updates, supposed to be real cheap
> since they use percpu data
> 
>         pcpu_lstats = netdev_priv(dev);
>         lb_stats = per_cpu_ptr(pcpu_lstats, smp_processor_id());
>         lb_stats->packets++;  /* HERE : serious contention */
>         lb_stats->bytes += skb->len;
> 
> 
> struct pcpu_lstats is a small structure containing two longs. It
> appears that on my 32bits platform, alloc_percpu(8) allocates a single
> cache line,  instead of giving to each cpu a separate cache line.
> 
> Using the following patch gave me impressive boost in various
> benchmarks ( 6 % in tbench) (all percpu_counters hit this bug too)
> 
> Long term fix (ie >= 2.6.26) would be to let each CPU allocate their
> own block of memory, so that we dont need to roudup sizes to
> L1_CACHE_BYTES, or merging the SGI stuff of course...
> 
> Note : SLUB vs SLAB is important here to *show* the improvement, since
> they dont have the same minimum allocation sizes (8 bytes vs 32
> bytes). This could very well explain regressions some guys reported
> when they switched to SLUB.

I've complained about this false sharing as well, so until we get the
new and improved percpu allocators,

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> 
>  mm/allocpercpu.c |   15 ++++++++++++++-
>  1 files changed, 14 insertions(+), 1 deletion(-)
> 
> 
> plain text document attachment (percpu_populate.patch)
> diff --git a/mm/allocpercpu.c b/mm/allocpercpu.c
> index 7e58322..b0012e2 100644
> --- a/mm/allocpercpu.c
> +++ b/mm/allocpercpu.c
> @@ -6,6 +6,10 @@
>  #include <linux/mm.h>
>  #include <linux/module.h>
>  
> +#ifndef cache_line_size
> +#define cache_line_size()	L1_CACHE_BYTES
> +#endif
> +
>  /**
>   * percpu_depopulate - depopulate per-cpu data for given cpu
>   * @__pdata: per-cpu data to depopulate
> @@ -52,6 +56,11 @@ void *percpu_populate(void *__pdata, size_t size, gfp_t gfp, int cpu)
>  	struct percpu_data *pdata = __percpu_disguise(__pdata);
>  	int node = cpu_to_node(cpu);
>  
> +	/*
> +	 * We should make sure each CPU gets private memory.
> +	 */
> +	size = roundup(size, cache_line_size());
> +
>  	BUG_ON(pdata->ptrs[cpu]);
>  	if (node_online(node))
>  		pdata->ptrs[cpu] = kmalloc_node(size, gfp|__GFP_ZERO, node);
> @@ -98,7 +107,11 @@ EXPORT_SYMBOL_GPL(__percpu_populate_mask);
>   */
>  void *__percpu_alloc_mask(size_t size, gfp_t gfp, cpumask_t *mask)
>  {
> -	void *pdata = kzalloc(nr_cpu_ids * sizeof(void *), gfp);
> +	/*
> +	 * We allocate whole cache lines to avoid false sharing
> +	 */
> +	size_t sz = roundup(nr_cpu_ids * sizeof(void *), cache_line_size());
> +	void *pdata = kzalloc(sz, gfp);
>  	void *__pdata = __percpu_disguise(pdata);
>  
>  	if (unlikely(!pdata))


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] alloc_percpu() fails to allocate percpu data
  2008-02-21 18:00 [PATCH] alloc_percpu() fails to allocate percpu data Eric Dumazet
  2008-02-21 22:26 ` Peter Zijlstra
@ 2008-02-23  8:04 ` Andrew Morton
  2008-02-27 19:59 ` Christoph Lameter
  2008-03-11 18:15 ` Mike Snitzer
  3 siblings, 0 replies; 17+ messages in thread
From: Andrew Morton @ 2008-02-23  8:04 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, linux kernel, netdev, Christoph Lameter, Zhang, Yanmin

On Thu, 21 Feb 2008 19:00:03 +0100 Eric Dumazet <dada1@cosmosbay.com> wrote:

> +#ifndef cache_line_size
> +#define cache_line_size()	L1_CACHE_BYTES
> +#endif

argh, you made me look.

Really cache_line_size() should be implemented in include/linux/cache.h. 
Then we tromp the stupid private implementations in slob.c and slub.c.

Then we wonder why x86 uses a custom cache_line_size(), but still uses
L1_CACHE_BYTES for its L1_CACHE_ALIGN().

Once we've answered that, we look at your

+	/*
+	 * We should make sure each CPU gets private memory.
+	 */
+	size = roundup(size, cache_line_size());

and wonder whether it should have used L1_CACHE_ALIGN().

I think I'd better stop looking.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] alloc_percpu() fails to allocate percpu data
  2008-02-21 22:26 ` Peter Zijlstra
@ 2008-02-23  9:23   ` Nick Piggin
  2008-02-27 19:44     ` Christoph Lameter
  0 siblings, 1 reply; 17+ messages in thread
From: Nick Piggin @ 2008-02-23  9:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Eric Dumazet, David S. Miller, Andrew Morton, linux kernel,
	netdev, Christoph Lameter, Zhang, Yanmin

On Friday 22 February 2008 09:26, Peter Zijlstra wrote:
> On Thu, 2008-02-21 at 19:00 +0100, Eric Dumazet wrote:
> > Some oprofile results obtained while using tbench on a 2x2 cpu machine
> > were very surprising.
> >
> > For example, loopback_xmit() function was using high number of cpu
> > cycles to perform the statistic updates, supposed to be real cheap
> > since they use percpu data
> >
> >         pcpu_lstats = netdev_priv(dev);
> >         lb_stats = per_cpu_ptr(pcpu_lstats, smp_processor_id());
> >         lb_stats->packets++;  /* HERE : serious contention */
> >         lb_stats->bytes += skb->len;
> >
> >
> > struct pcpu_lstats is a small structure containing two longs. It
> > appears that on my 32bits platform, alloc_percpu(8) allocates a single
> > cache line,  instead of giving to each cpu a separate cache line.
> >
> > Using the following patch gave me impressive boost in various
> > benchmarks ( 6 % in tbench) (all percpu_counters hit this bug too)
> >
> > Long term fix (ie >= 2.6.26) would be to let each CPU allocate their
> > own block of memory, so that we dont need to roudup sizes to
> > L1_CACHE_BYTES, or merging the SGI stuff of course...
> >
> > Note : SLUB vs SLAB is important here to *show* the improvement, since
> > they dont have the same minimum allocation sizes (8 bytes vs 32
> > bytes). This could very well explain regressions some guys reported
> > when they switched to SLUB.
>
> I've complained about this false sharing as well, so until we get the
> new and improved percpu allocators,

What I don't understand is why the slab allocators have something like
this in it:

        if ((flags & SLAB_HWCACHE_ALIGN) &&
                        size > cache_line_size() / 2)
                return max_t(unsigned long, align, cache_line_size());

If you ask for HWCACHE_ALIGN, then you should get it. I don't
understand, why do they think they knows better than the caller?
Things like this are just going to lead to very difficult to track
performance problems. Possibly correctness problems in rare cases.

There could be another flag for "maybe align".


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] alloc_percpu() fails to allocate percpu data
  2008-02-23  9:23   ` Nick Piggin
@ 2008-02-27 19:44     ` Christoph Lameter
  2008-03-03  3:14       ` Nick Piggin
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Lameter @ 2008-02-27 19:44 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Eric Dumazet, David S. Miller, Andrew Morton,
	linux kernel, netdev, Zhang, Yanmin

On Sat, 23 Feb 2008, Nick Piggin wrote:

> What I don't understand is why the slab allocators have something like
> this in it:
> 
>         if ((flags & SLAB_HWCACHE_ALIGN) &&
>                         size > cache_line_size() / 2)
>                 return max_t(unsigned long, align, cache_line_size());
> 
> If you ask for HWCACHE_ALIGN, then you should get it. I don't
> understand, why do they think they knows better than the caller?

Tradition.... Its irks me as well.

> Things like this are just going to lead to very difficult to track
> performance problems. Possibly correctness problems in rare cases.
> 
> There could be another flag for "maybe align".

SLAB_HWCACHE_ALIGN *is* effectively a maybe align flag given the above 
code. 

If we all agree then we could change this to have must have semantics? It 
has the potential of enlarging objects for small caches. 

SLAB_HWCACHE_ALIGN has an effect that varies according to the alignment 
requirements of the architecture that the kernel is build on. We may be in 
for some surprises if we change this.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] alloc_percpu() fails to allocate percpu data
  2008-02-21 18:00 [PATCH] alloc_percpu() fails to allocate percpu data Eric Dumazet
  2008-02-21 22:26 ` Peter Zijlstra
  2008-02-23  8:04 ` Andrew Morton
@ 2008-02-27 19:59 ` Christoph Lameter
  2008-02-27 20:24   ` Andrew Morton
  2008-03-11 18:15 ` Mike Snitzer
  3 siblings, 1 reply; 17+ messages in thread
From: Christoph Lameter @ 2008-02-27 19:59 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Andrew Morton, linux kernel, netdev, Zhang,
	Yanmin, travis

Any decision made on what to do about this one? Mike or I can 
repost the per cpu allocator against mm? The fix by Eric could be used 
in the interim for 2.6.24?


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] alloc_percpu() fails to allocate percpu data
  2008-02-27 19:59 ` Christoph Lameter
@ 2008-02-27 20:24   ` Andrew Morton
  2008-02-27 21:56     ` Christoph Lameter
  2008-03-01 13:53     ` Eric Dumazet
  0 siblings, 2 replies; 17+ messages in thread
From: Andrew Morton @ 2008-02-27 20:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: dada1, davem, linux-kernel, netdev, yanmin_zhang, travis

On Wed, 27 Feb 2008 11:59:32 -0800 (PST)
Christoph Lameter <clameter@sgi.com> wrote:

> Any decision made on what to do about this one? Mike or I can 
> repost the per cpu allocator against mm? The fix by Eric could be used 
> in the interim for 2.6.24?
> 

I suppose I'll merge Eric's patch when I've tested it fully (well, as fully
as I test stuff).

It'd be nice to get that cache_line_size()/L1_CACHE_BYTES/L1_CACHE_ALIGN()
mess sorted out.  If it's a mess - I _think_ it is?



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] alloc_percpu() fails to allocate percpu data
  2008-02-27 20:24   ` Andrew Morton
@ 2008-02-27 21:56     ` Christoph Lameter
  2008-03-01 13:53     ` Eric Dumazet
  1 sibling, 0 replies; 17+ messages in thread
From: Christoph Lameter @ 2008-02-27 21:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: dada1, davem, linux-kernel, netdev, yanmin_zhang, travis

On Wed, 27 Feb 2008, Andrew Morton wrote:

> On Wed, 27 Feb 2008 11:59:32 -0800 (PST)
> Christoph Lameter <clameter@sgi.com> wrote:
> 
> > Any decision made on what to do about this one? Mike or I can 
> > repost the per cpu allocator against mm? The fix by Eric could be used 
> > in the interim for 2.6.24?
> > 
> 
> I suppose I'll merge Eric's patch when I've tested it fully (well, as fully
> as I test stuff).

Urgh. You too?
 
> It'd be nice to get that cache_line_size()/L1_CACHE_BYTES/L1_CACHE_ALIGN()
> mess sorted out.  If it's a mess - I _think_ it is?

Well I tried it when slub went first in and it did not go well. The issue 
is that x86 detects the cache line size on bootup. Thus cache_line_size(). 
Most of the other arch have compile time cache line sizes. Thus 
L1_CACHE_BYTES. So L1_CACHE_BYTES is the maximum value that 
cache_line_size() can take.

What I was attempting to do is to make x86 have one compile time cache 
line size L1_CACHE_BYTES. That raised objections because space was wasted.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] alloc_percpu() fails to allocate percpu data
  2008-02-27 20:24   ` Andrew Morton
  2008-02-27 21:56     ` Christoph Lameter
@ 2008-03-01 13:53     ` Eric Dumazet
  1 sibling, 0 replies; 17+ messages in thread
From: Eric Dumazet @ 2008-03-01 13:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, davem, linux-kernel, netdev, yanmin_zhang, travis

Andrew Morton a écrit :
> On Wed, 27 Feb 2008 11:59:32 -0800 (PST)
> Christoph Lameter <clameter@sgi.com> wrote:
> 
>> Any decision made on what to do about this one? Mike or I can 
>> repost the per cpu allocator against mm? The fix by Eric could be used 
>> in the interim for 2.6.24?
>>
> 
> I suppose I'll merge Eric's patch when I've tested it fully (well, as fully
> as I test stuff).
> 
> It'd be nice to get that cache_line_size()/L1_CACHE_BYTES/L1_CACHE_ALIGN()
> mess sorted out.  If it's a mess - I _think_ it is?

Just coming back from hollidays, sorry for the delay.

I can provide a patch so that L1_CACHE_BYTES is not anymore a compile time 
constant if you want, but I am not sure it is worth the trouble ? (and this 
certainly not 2.6.{24|25} stuff :) )

Current situation :

L1_CACHE_BYTES is known at compile time, and can be quite large (128 bytes), 
while cache_line_size() gives the real cache line size selected at boot time 
given the hardware capabilities.

If L1_CACHE_BYTES is not anymore a constant, compiler will also uses plain 
divides to compute L1_CACHE_ALIGN()

Maybe uses of L1_CACHE_ALIGN() in fastpath would 'force' us to not only 
declare a cache_line_size() but also a cache_line_size_{mask|shift}() so that 
x86 could use :

#define L1_CACHE_ALIGN(x) ((((x)+cache_line_mask())) >> cache_line_shift())

#define L1_CACHE_BYTES (cache_line_size())

But I am not sure we want to play these games (we must also make sure nothing 
in the tree wants a constant L1_CACHE_BYTES and replace by SMP_CACHE_BYTES)



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] alloc_percpu() fails to allocate percpu data
  2008-02-27 19:44     ` Christoph Lameter
@ 2008-03-03  3:14       ` Nick Piggin
  2008-03-03  7:48         ` Eric Dumazet
  2008-03-03 19:30         ` Christoph Lameter
  0 siblings, 2 replies; 17+ messages in thread
From: Nick Piggin @ 2008-03-03  3:14 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, Eric Dumazet, David S. Miller, Andrew Morton,
	linux kernel, netdev, Zhang, Yanmin

On Thursday 28 February 2008 06:44, Christoph Lameter wrote:
> On Sat, 23 Feb 2008, Nick Piggin wrote:
> > What I don't understand is why the slab allocators have something like
> > this in it:
> >
> >         if ((flags & SLAB_HWCACHE_ALIGN) &&
> >                         size > cache_line_size() / 2)
> >                 return max_t(unsigned long, align, cache_line_size());
> >
> > If you ask for HWCACHE_ALIGN, then you should get it. I don't
> > understand, why do they think they knows better than the caller?
>
> Tradition.... Its irks me as well.
>
> > Things like this are just going to lead to very difficult to track
> > performance problems. Possibly correctness problems in rare cases.
> >
> > There could be another flag for "maybe align".
>
> SLAB_HWCACHE_ALIGN *is* effectively a maybe align flag given the above
> code.
>
> If we all agree then we could change this to have must have semantics? It
> has the potential of enlarging objects for small caches.
>
> SLAB_HWCACHE_ALIGN has an effect that varies according to the alignment
> requirements of the architecture that the kernel is build on. We may be in
> for some surprises if we change this.

I think so. If we ask for HWCACHE_ALIGN, it must be for a good reason.
If some structures get too bloated for no good reason, then the problem
is not with the slab allocator but with the caller asking for
HWCACHE_ALIGN.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] alloc_percpu() fails to allocate percpu data
  2008-03-03  3:14       ` Nick Piggin
@ 2008-03-03  7:48         ` Eric Dumazet
  2008-03-03  9:41           ` Nick Piggin
  2008-03-03 19:30         ` Christoph Lameter
  1 sibling, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2008-03-03  7:48 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Lameter, Peter Zijlstra, David S. Miller,
	Andrew Morton, linux kernel, netdev, Zhang, Yanmin

Nick Piggin a écrit :
> On Thursday 28 February 2008 06:44, Christoph Lameter wrote:
>> On Sat, 23 Feb 2008, Nick Piggin wrote:
>>> What I don't understand is why the slab allocators have something like
>>> this in it:
>>>
>>>         if ((flags & SLAB_HWCACHE_ALIGN) &&
>>>                         size > cache_line_size() / 2)
>>>                 return max_t(unsigned long, align, cache_line_size());
>>>
>>> If you ask for HWCACHE_ALIGN, then you should get it. I don't
>>> understand, why do they think they knows better than the caller?
>> Tradition.... Its irks me as well.
>>
>>> Things like this are just going to lead to very difficult to track
>>> performance problems. Possibly correctness problems in rare cases.
>>>
>>> There could be another flag for "maybe align".
>> SLAB_HWCACHE_ALIGN *is* effectively a maybe align flag given the above
>> code.
>>
>> If we all agree then we could change this to have must have semantics? It
>> has the potential of enlarging objects for small caches.
>>
>> SLAB_HWCACHE_ALIGN has an effect that varies according to the alignment
>> requirements of the architecture that the kernel is build on. We may be in
>> for some surprises if we change this.
> 
> I think so. If we ask for HWCACHE_ALIGN, it must be for a good reason.
> If some structures get too bloated for no good reason, then the problem
> is not with the slab allocator but with the caller asking for
> HWCACHE_ALIGN.
> 

HWCACHE_ALIGN is commonly used, even for large structures, because the 
processor cache line on x86 is not known at compile time (can go from 32 bytes 
to 128 bytes).

The problem that above code is trying to address is about small objects.

Because at the time code using HWCACHE_ALIGN was written, cache line size was 
32 bytes. Now we have CPU with 128 bytes cache lines, we would waste space if 
SLAB_HWCACHE_ALIGN was honored for small objects.

Some occurences of SLAB_HWCACHE_ALIGN are certainly not usefull, we should zap 
them. Last one I removed was the one for "struct flow_cache_entry"  (commit 
dd5a1843d566911dbb077c4022c4936697495af6 : [IPSEC] flow: reorder "struct 
flow_cache_entry" and remove SLAB_HWCACHE_ALIGN)


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] alloc_percpu() fails to allocate percpu data
  2008-03-03  7:48         ` Eric Dumazet
@ 2008-03-03  9:41           ` Nick Piggin
  0 siblings, 0 replies; 17+ messages in thread
From: Nick Piggin @ 2008-03-03  9:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Lameter, Peter Zijlstra, David S. Miller,
	Andrew Morton, linux kernel, netdev, Zhang, Yanmin

On Monday 03 March 2008 18:48, Eric Dumazet wrote:
> Nick Piggin a écrit :
> > On Thursday 28 February 2008 06:44, Christoph Lameter wrote:
> >> On Sat, 23 Feb 2008, Nick Piggin wrote:
> >>> What I don't understand is why the slab allocators have something like
> >>> this in it:
> >>>
> >>>         if ((flags & SLAB_HWCACHE_ALIGN) &&
> >>>                         size > cache_line_size() / 2)
> >>>                 return max_t(unsigned long, align, cache_line_size());
> >>>
> >>> If you ask for HWCACHE_ALIGN, then you should get it. I don't
> >>> understand, why do they think they knows better than the caller?
> >>
> >> Tradition.... Its irks me as well.
> >>
> >>> Things like this are just going to lead to very difficult to track
> >>> performance problems. Possibly correctness problems in rare cases.
> >>>
> >>> There could be another flag for "maybe align".
> >>
> >> SLAB_HWCACHE_ALIGN *is* effectively a maybe align flag given the above
> >> code.
> >>
> >> If we all agree then we could change this to have must have semantics?
> >> It has the potential of enlarging objects for small caches.
> >>
> >> SLAB_HWCACHE_ALIGN has an effect that varies according to the alignment
> >> requirements of the architecture that the kernel is build on. We may be
> >> in for some surprises if we change this.
> >
> > I think so. If we ask for HWCACHE_ALIGN, it must be for a good reason.
> > If some structures get too bloated for no good reason, then the problem
> > is not with the slab allocator but with the caller asking for
> > HWCACHE_ALIGN.
>
> HWCACHE_ALIGN is commonly used, even for large structures, because the
> processor cache line on x86 is not known at compile time (can go from 32
> bytes to 128 bytes).

Sure.


> The problem that above code is trying to address is about small objects.
>
> Because at the time code using HWCACHE_ALIGN was written, cache line size
> was 32 bytes. Now we have CPU with 128 bytes cache lines, we would waste
> space if SLAB_HWCACHE_ALIGN was honored for small objects.

I understand that, but I don't think it is a good reason.
SLAB_HWCACHE_ALIGN should only be specified if it is really needed.
If it is not really needed, it should not be specified. And if it
is, then the allocator should not disregard it.

But let's see. There is a valid case where we want to align to a
power of 2 >= objsize and <= hw cache size. That is if we carefully
pack objects so that we know where cacheline boundaries are and only
take the minimum number of cache misses to access them, but are not
concerned about false sharing. That appears to be what HWCACHE_ALIGN
is for, but SLUB does not really get that right either, because it
drops that alignment restriction completely if the object is <= the
cache line size. It should use the same calculation that SLAB uses.
I would have preferred it to be called something else...

For the case where we want to avoid false sharing, we need a new
SLAB_SMP_ALIGN, which always pads out to cacheline size, but only
for num_possible_cpus() > 1.

That still leaves the problem of how to align kmalloc(). SLAB gives
it HWCACHE_ALIGN by default. Why not do the same for SLUB (which
could be changed if CONFIG_SMALL is set)? That would give a more
consistent allocation pattern, at least (eg. you wouldn't get your
structures suddenly straddling cachelines if you reduce it from 100
bytes to 96 bytes...).

And for kmalloc that requires SMP_ALIGN, I guess it is impossible.
Maybe the percpu allocator could just have its own kmem_cache of
size cache_line_size() and use that for all allocations <= that size.
Then just let the scalemp guys worry about using that wasted padding
for same-CPU allocations ;)

And I guess if there is some other situation where alignment is
required, it could be specified explicitly.

> Some occurences of SLAB_HWCACHE_ALIGN are certainly not usefull, we should
> zap them. Last one I removed was the one for "struct flow_cache_entry" 
> (commit dd5a1843d566911dbb077c4022c4936697495af6 : [IPSEC] flow: reorder
> "struct flow_cache_entry" and remove SLAB_HWCACHE_ALIGN)

Sure.

But in general it isn't always easy to tell what should be aligned
and what should not. If you have a set of smallish objects where you
are likely to do basically random lookups to them and they are not
likely to be in cache, then SLAB_HWCACHE_ALIGN can be a good idea so
that you hit as few cachelines as possible when doing the lookup.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] alloc_percpu() fails to allocate percpu data
  2008-03-03  3:14       ` Nick Piggin
  2008-03-03  7:48         ` Eric Dumazet
@ 2008-03-03 19:30         ` Christoph Lameter
  1 sibling, 0 replies; 17+ messages in thread
From: Christoph Lameter @ 2008-03-03 19:30 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Eric Dumazet, David S. Miller, Andrew Morton,
	linux kernel, netdev, Zhang, Yanmin

On Mon, 3 Mar 2008, Nick Piggin wrote:

> I think so. If we ask for HWCACHE_ALIGN, it must be for a good reason.
> If some structures get too bloated for no good reason, then the problem
> is not with the slab allocator but with the caller asking for
> HWCACHE_ALIGN.

Right. There should only be a few select users of this strange flag. 
Otherwise if a certain alignment is wanted specify it in the alignment 
parameter passed to kmem_cache_create.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] alloc_percpu() fails to allocate percpu data
  2008-02-21 18:00 [PATCH] alloc_percpu() fails to allocate percpu data Eric Dumazet
                   ` (2 preceding siblings ...)
  2008-02-27 19:59 ` Christoph Lameter
@ 2008-03-11 18:15 ` Mike Snitzer
  2008-03-11 18:41   ` Eric Dumazet
  3 siblings, 1 reply; 17+ messages in thread
From: Mike Snitzer @ 2008-03-11 18:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Andrew Morton, linux kernel, netdev,
	Christoph Lameter, Zhang, Yanmin, Peter Zijlstra, stable

On 2/21/08, Eric Dumazet <dada1@cosmosbay.com> wrote:
> Some oprofile results obtained while using tbench on a 2x2 cpu machine
>  were very surprising.
>
>  For example, loopback_xmit() function was using high number of cpu
>  cycles to perform
>  the statistic updates, supposed to be real cheap since they use percpu data
>
>         pcpu_lstats = netdev_priv(dev);
>         lb_stats = per_cpu_ptr(pcpu_lstats, smp_processor_id());
>         lb_stats->packets++;  /* HERE : serious contention */
>         lb_stats->bytes += skb->len;
>
>
>  struct pcpu_lstats is a small structure containing two longs. It appears
>  that on my 32bits platform,
>  alloc_percpu(8) allocates a single cache line,  instead of giving to
>  each cpu a separate
>  cache line.
>
>  Using the following patch gave me impressive boost in various benchmarks
>  ( 6 % in tbench)
>  (all percpu_counters hit this bug too)
>
>  Long term fix (ie >= 2.6.26) would be to let each CPU allocate their own
>  block of memory, so that we
>  dont need to roudup sizes to L1_CACHE_BYTES, or merging the SGI stuff of
>  course...
>
>  Note : SLUB vs SLAB is important here to *show* the improvement, since
>  they dont have the same minimum
>  allocation sizes (8 bytes vs 32 bytes).
>  This could very well explain regressions some guys reported when they
>  switched to SLUB.


I see that this fix was committed to mainline as commit
be852795e1c8d3829ddf3cb1ce806113611fa555

The commit didn't "Cc: <stable@kernel.org>", and it doesn't appear to
be queued for 2.6.24.x.  Should it be?

If I understand you correctly, SLAB doesn't create this particular
cache thrashing on 32bit systems?  Is SLAB ok on other architectures
too?  Can you (or others) comment on the importance of this fix
relative to x86_64 (64byte cacheline) and SLAB?

I'm particularly interested in this given the use of percpu_counters
with the per bdi write throttling.

Mike

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] alloc_percpu() fails to allocate percpu data
  2008-03-11 18:15 ` Mike Snitzer
@ 2008-03-11 18:41   ` Eric Dumazet
  2008-03-11 19:39     ` Mike Snitzer
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2008-03-11 18:41 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: David S. Miller, Andrew Morton, linux kernel, netdev,
	Christoph Lameter, Zhang, Yanmin, Peter Zijlstra, stable

Mike Snitzer a écrit :
> On 2/21/08, Eric Dumazet <dada1@cosmosbay.com> wrote:
>   
>> Some oprofile results obtained while using tbench on a 2x2 cpu machine
>>  were very surprising.
>>
>>  For example, loopback_xmit() function was using high number of cpu
>>  cycles to perform
>>  the statistic updates, supposed to be real cheap since they use percpu data
>>
>>         pcpu_lstats = netdev_priv(dev);
>>         lb_stats = per_cpu_ptr(pcpu_lstats, smp_processor_id());
>>         lb_stats->packets++;  /* HERE : serious contention */
>>         lb_stats->bytes += skb->len;
>>
>>
>>  struct pcpu_lstats is a small structure containing two longs. It appears
>>  that on my 32bits platform,
>>  alloc_percpu(8) allocates a single cache line,  instead of giving to
>>  each cpu a separate
>>  cache line.
>>
>>  Using the following patch gave me impressive boost in various benchmarks
>>  ( 6 % in tbench)
>>  (all percpu_counters hit this bug too)
>>
>>  Long term fix (ie >= 2.6.26) would be to let each CPU allocate their own
>>  block of memory, so that we
>>  dont need to roudup sizes to L1_CACHE_BYTES, or merging the SGI stuffof
>>  course...
>>
>>  Note : SLUB vs SLAB is important here to *show* the improvement, since
>>  they dont have the same minimum
>>  allocation sizes (8 bytes vs 32 bytes).
>>  This could very well explain regressions some guys reported when they
>>  switched to SLUB.
>>     
>
>
> I see that this fix was committed to mainline as commit
> be852795e1c8d3829ddf3cb1ce806113611fa555
>
> The commit didn't "Cc: <stable@kernel.org>", and it doesn't appear to
> be queued for 2.6.24.x.  Should it be?
>
>   
Yes, it should be queued fo 2.6.24.x

> If I understand you correctly, SLAB doesn't create this particular
> cache thrashing on 32bit systems?  Is SLAB ok on other architectures
> too?  Can you (or others) comment on the importance of this fix
> relative to x86_64 (64byte cacheline) and SLAB?
>
>   

Fix is important both for 32 and 64 bits kernels, SLAB or SLUB.

SLAB does have this problem, but less prevalent than SLUB, because these
allocators dont have the same minimal size allocation (32 vs 8)

So with SLUB, it is possible that 8 CPUS share the same 64 bytes
cacheline to store their percpu counters, while only 2 cpus can share
this same cache line with SLAB allocator.

> I'm particularly interested in this given the use of percpu_counters
> with the per bdi write throttling.
>
> Mike
> --
>   




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] alloc_percpu() fails to allocate percpu data
  2008-03-11 18:41   ` Eric Dumazet
@ 2008-03-11 19:39     ` Mike Snitzer
  2008-03-12  0:18       ` [stable] " Chris Wright
  0 siblings, 1 reply; 17+ messages in thread
From: Mike Snitzer @ 2008-03-11 19:39 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Andrew Morton, linux kernel, netdev,
	Christoph Lameter, Zhang, Yanmin, Peter Zijlstra, stable

On 3/11/08, Eric Dumazet <dada1@cosmosbay.com> wrote:
> Mike Snitzer a écrit :
>
> > On 2/21/08, Eric Dumazet <dada1@cosmosbay.com> wrote:
>  >
>  >> Some oprofile results obtained while using tbench on a 2x2 cpu machine
>  >>  were very surprising.
>  >>
>  >>  For example, loopback_xmit() function was using high number of cpu
>  >>  cycles to perform
>  >>  the statistic updates, supposed to be real cheap since they use percpu data
>  >>
>  >>         pcpu_lstats = netdev_priv(dev);
>  >>         lb_stats = per_cpu_ptr(pcpu_lstats, smp_processor_id());
>  >>         lb_stats->packets++;  /* HERE : serious contention */
>  >>         lb_stats->bytes += skb->len;
>  >>
>  >>
>  >>  struct pcpu_lstats is a small structure containing two longs. It appears
>  >>  that on my 32bits platform,
>  >>  alloc_percpu(8) allocates a single cache line,  instead of giving to
>  >>  each cpu a separate
>  >>  cache line.
>  >>
>  >>  Using the following patch gave me impressive boost in various benchmarks
>  >>  ( 6 % in tbench)
>  >>  (all percpu_counters hit this bug too)
>  >>
>  >>  Long term fix (ie >= 2.6.26) would be to let each CPU allocate their own
>  >>  block of memory, so that we
>  >>  dont need to roudup sizes to L1_CACHE_BYTES, or merging the SGI stuffof
>  >>  course...
>  >>
>  >>  Note : SLUB vs SLAB is important here to *show* the improvement, since
>  >>  they dont have the same minimum
>  >>  allocation sizes (8 bytes vs 32 bytes).
>  >>  This could very well explain regressions some guys reported when they
>  >>  switched to SLUB.
>  >>
>  >
>  >
>  > I see that this fix was committed to mainline as commit
>  > be852795e1c8d3829ddf3cb1ce806113611fa555
>  >
>  > The commit didn't "Cc: <stable@kernel.org>", and it doesn't appear to
>  > be queued for 2.6.24.x.  Should it be?
>  >
>  >
>
> Yes, it should be queued fo 2.6.24.x

That means both of the following commits need to be cherry-picked into 2.6.24.x:
b3242151906372f30f57feaa43b4cac96a23edb1
be852795e1c8d3829ddf3cb1ce806113611fa555

>  > If I understand you correctly, SLAB doesn't create this particular
>  > cache thrashing on 32bit systems?  Is SLAB ok on other architectures
>  > too?  Can you (or others) comment on the importance of this fix
>  > relative to x86_64 (64byte cacheline) and SLAB?
>  >
>  >
>
>
> Fix is important both for 32 and 64 bits kernels, SLAB or SLUB.
>
>  SLAB does have this problem, but less prevalent than SLUB, because these
>  allocators dont have the same minimal size allocation (32 vs 8)
>
>  So with SLUB, it is possible that 8 CPUS share the same 64 bytes
>  cacheline to store their percpu counters, while only 2 cpus can share
>  this same cache line with SLAB allocator.

Thanks for the clarification.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [stable] [PATCH] alloc_percpu() fails to allocate percpu data
  2008-03-11 19:39     ` Mike Snitzer
@ 2008-03-12  0:18       ` Chris Wright
  0 siblings, 0 replies; 17+ messages in thread
From: Chris Wright @ 2008-03-12  0:18 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Eric Dumazet, Zhang, Yanmin, Peter Zijlstra, netdev,
	linux kernel, stable, Andrew Morton, David S. Miller,
	Christoph Lameter

* Mike Snitzer (snitzer@gmail.com) wrote:
> On 3/11/08, Eric Dumazet <dada1@cosmosbay.com> wrote:
> > Mike Snitzer a écrit :
> >  > I see that this fix was committed to mainline as commit
> >  > be852795e1c8d3829ddf3cb1ce806113611fa555
> >  >
> >  > The commit didn't "Cc: <stable@kernel.org>", and it doesn't appear to
> >  > be queued for 2.6.24.x.  Should it be?
> >
> > Yes, it should be queued fo 2.6.24.x
> 
> That means both of the following commits need to be cherry-picked into 2.6.24.x:
> b3242151906372f30f57feaa43b4cac96a23edb1
> be852795e1c8d3829ddf3cb1ce806113611fa555

Just send each with appropriate commit hash id and relevant Cc's to
stable@kernel.org so we can pick them up.

thanks,
-chris

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2008-03-12  0:20 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-02-21 18:00 [PATCH] alloc_percpu() fails to allocate percpu data Eric Dumazet
2008-02-21 22:26 ` Peter Zijlstra
2008-02-23  9:23   ` Nick Piggin
2008-02-27 19:44     ` Christoph Lameter
2008-03-03  3:14       ` Nick Piggin
2008-03-03  7:48         ` Eric Dumazet
2008-03-03  9:41           ` Nick Piggin
2008-03-03 19:30         ` Christoph Lameter
2008-02-23  8:04 ` Andrew Morton
2008-02-27 19:59 ` Christoph Lameter
2008-02-27 20:24   ` Andrew Morton
2008-02-27 21:56     ` Christoph Lameter
2008-03-01 13:53     ` Eric Dumazet
2008-03-11 18:15 ` Mike Snitzer
2008-03-11 18:41   ` Eric Dumazet
2008-03-11 19:39     ` Mike Snitzer
2008-03-12  0:18       ` [stable] " Chris Wright

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).