linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch] tlb flush_data: replace per_cpu with an array
@ 2009-01-12 21:35 Frederik Deweerdt
  2009-01-12 21:46 ` Ingo Molnar
                   ` (3 more replies)
  0 siblings, 4 replies; 31+ messages in thread
From: Frederik Deweerdt @ 2009-01-12 21:35 UTC (permalink / raw)
  To: mingo; +Cc: andi, tglx, hpa, linux-kernel

Hi,

On x86_64 flush tlb data is stored in per_cpu variables. This is
unnecessary because only the first NUM_INVALIDATE_TLB_VECTORS entries
are accessed.
This patch aims at making the code less confusing (there's nothing
really "per_cpu") by using a plain array. It also would save some memory
on most distros out there (Ubuntu x86_64 has NR_CPUS=64 by default).

Regards,
Frederik

Signed-off-by: Frederik Deweerdt <frederik.deweerdt@xprog.eu>

diff --git a/arch/x86/kernel/tlb_64.c b/arch/x86/kernel/tlb_64.c
index f8be6f1..c177a1f 100644
--- a/arch/x86/kernel/tlb_64.c
+++ b/arch/x86/kernel/tlb_64.c
@@ -33,7 +33,7 @@
  *	To avoid global state use 8 different call vectors.
  *	Each CPU uses a specific vector to trigger flushes on other
  *	CPUs. Depending on the received vector the target CPUs look into
- *	the right per cpu variable for the flush data.
+ *	the right array slot for the flush data.
  *
  *	With more than 8 CPUs they are hashed to the 8 available
  *	vectors. The limited global vector space forces us to this right now.
@@ -54,7 +54,7 @@ union smp_flush_state {
 /* State is put into the per CPU data section, but padded
    to a full cache line because other CPUs can access it and we don't
    want false sharing in the per cpu data segment. */
-static DEFINE_PER_CPU(union smp_flush_state, flush_state);
+static union smp_flush_state flush_state[NUM_INVALIDATE_TLB_VECTORS];
 
 /*
  * We cannot call mmdrop() because we are in interrupt context,
@@ -129,7 +129,7 @@ asmlinkage void smp_invalidate_interrupt(struct pt_regs *regs)
 	 * Use that to determine where the sender put the data.
 	 */
 	sender = ~regs->orig_ax - INVALIDATE_TLB_VECTOR_START;
-	f = &per_cpu(flush_state, sender);
+	f = &flush_state[sender];
 
 	if (!cpu_isset(cpu, f->flush_cpumask))
 		goto out;
@@ -169,7 +169,7 @@ void native_flush_tlb_others(const cpumask_t *cpumaskp, struct mm_struct *mm,
 
 	/* Caller has disabled preemption */
 	sender = smp_processor_id() % NUM_INVALIDATE_TLB_VECTORS;
-	f = &per_cpu(flush_state, sender);
+	f = &flush_state[sender];
 
 	/*
 	 * Could avoid this lock when
@@ -205,8 +205,8 @@ static int __cpuinit init_smp_flush(void)
 {
 	int i;
 
-	for_each_possible_cpu(i)
-		spin_lock_init(&per_cpu(flush_state, i).tlbstate_lock);
+	for (i = 0; i < ARRAY_SIZE(flush_state); i++)
+		spin_lock_init(&flush_state[i].tlbstate_lock);
 
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [patch] tlb flush_data: replace per_cpu with an array
  2009-01-12 21:35 [patch] tlb flush_data: replace per_cpu with an array Frederik Deweerdt
@ 2009-01-12 21:46 ` Ingo Molnar
  2009-01-12 21:57 ` Andi Kleen
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 31+ messages in thread
From: Ingo Molnar @ 2009-01-12 21:46 UTC (permalink / raw)
  To: Frederik Deweerdt; +Cc: andi, tglx, hpa, linux-kernel, Jeremy Fitzhardinge


* Frederik Deweerdt <frederik.deweerdt@xprog.eu> wrote:

> Hi,
> 
> On x86_64 flush tlb data is stored in per_cpu variables. This is 
> unnecessary because only the first NUM_INVALIDATE_TLB_VECTORS entries 
> are accessed.
>
> This patch aims at making the code less confusing (there's nothing 
> really "per_cpu") by using a plain array. It also would save some memory 
> on most distros out there (Ubuntu x86_64 has NR_CPUS=64 by default).

indeed, well spotted!

This construct was indeed an abuse of the per_cpu facilities.

Note that beyond the obvious memory savings, your patch should make the 
code a bit faster and a bit smaller as well in the SMP TLB flush codepath: 
the smp_flush_state data structure is 64 (or 128) bytes so static array 
arithmetics will be faster than the per_cpu indirection.

I have applied your patch to tip/x86/mm, thanks Frederik!

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch] tlb flush_data: replace per_cpu with an array
  2009-01-12 21:35 [patch] tlb flush_data: replace per_cpu with an array Frederik Deweerdt
  2009-01-12 21:46 ` Ingo Molnar
@ 2009-01-12 21:57 ` Andi Kleen
  2009-01-12 22:10   ` Frederik Deweerdt
  2009-01-12 22:40   ` Ingo Molnar
  2009-01-12 22:34 ` Ravikiran G Thirumalai
  2009-01-12 22:54 ` Mike Travis
  3 siblings, 2 replies; 31+ messages in thread
From: Andi Kleen @ 2009-01-12 21:57 UTC (permalink / raw)
  To: Frederik Deweerdt; +Cc: mingo, andi, tglx, hpa, linux-kernel

On Mon, Jan 12, 2009 at 10:35:42PM +0100, Frederik Deweerdt wrote:
> Hi,
> 
> On x86_64 flush tlb data is stored in per_cpu variables. This is
> unnecessary because only the first NUM_INVALIDATE_TLB_VECTORS entries
> are accessed.
> This patch aims at making the code less confusing (there's nothing
> really "per_cpu") by using a plain array. It also would save some memory
> on most distros out there (Ubuntu x86_64 has NR_CPUS=64 by default).

Nope it doesn't save memory on most systems because per cpu is only allocated
based on the CPUs that are actually there. And if you have more than 8
cores you can likely afford a few bytes per CPU.

You would need to cache line pad each entry then, otherwise you risk
false sharing. That would make the array 1K on 128 bytes cache line 
system.  This means on small systems this would actually waste
much more memory.

per cpu avoids that problem completely.

-Andi


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch] tlb flush_data: replace per_cpu with an array
  2009-01-12 21:57 ` Andi Kleen
@ 2009-01-12 22:10   ` Frederik Deweerdt
  2009-01-12 22:32     ` Andi Kleen
  2009-01-12 22:40   ` Ingo Molnar
  1 sibling, 1 reply; 31+ messages in thread
From: Frederik Deweerdt @ 2009-01-12 22:10 UTC (permalink / raw)
  To: Andi Kleen; +Cc: mingo, tglx, hpa, linux-kernel

On Mon, Jan 12, 2009 at 10:57:02PM +0100, Andi Kleen wrote:
> On Mon, Jan 12, 2009 at 10:35:42PM +0100, Frederik Deweerdt wrote:
> > Hi,
> > 
> > On x86_64 flush tlb data is stored in per_cpu variables. This is
> > unnecessary because only the first NUM_INVALIDATE_TLB_VECTORS entries
> > are accessed.
> > This patch aims at making the code less confusing (there's nothing
> > really "per_cpu") by using a plain array. It also would save some memory
> > on most distros out there (Ubuntu x86_64 has NR_CPUS=64 by default).
> 
> Nope it doesn't save memory on most systems because per cpu is only allocated
> based on the CPUs that are actually there. And if you have more than 8
> cores you can likely afford a few bytes per CPU.
I did not understand that, thanks for clarifiying
> 
> You would need to cache line pad each entry then, otherwise you risk
> false sharing. That would make the array 1K on 128 bytes cache line 
> system.  This means on small systems this would actually waste
> much more memory.
> 
> per cpu avoids that problem completely.
It is also slower (or so percpu.h says), and confusing I'd say.

Regards,
Frederik

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch] tlb flush_data: replace per_cpu with an array
  2009-01-12 22:10   ` Frederik Deweerdt
@ 2009-01-12 22:32     ` Andi Kleen
  2009-01-12 22:50       ` Ingo Molnar
  0 siblings, 1 reply; 31+ messages in thread
From: Andi Kleen @ 2009-01-12 22:32 UTC (permalink / raw)
  To: Frederik Deweerdt; +Cc: Andi Kleen, mingo, tglx, hpa, linux-kernel

On Mon, Jan 12, 2009 at 11:10:23PM +0100, Frederik Deweerdt wrote:
> On Mon, Jan 12, 2009 at 10:57:02PM +0100, Andi Kleen wrote:
> > On Mon, Jan 12, 2009 at 10:35:42PM +0100, Frederik Deweerdt wrote:
> > > Hi,
> > > 
> > > On x86_64 flush tlb data is stored in per_cpu variables. This is
> > > unnecessary because only the first NUM_INVALIDATE_TLB_VECTORS entries
> > > are accessed.
> > > This patch aims at making the code less confusing (there's nothing
> > > really "per_cpu") by using a plain array. It also would save some memory
> > > on most distros out there (Ubuntu x86_64 has NR_CPUS=64 by default).
> > 
> > Nope it doesn't save memory on most systems because per cpu is only allocated
> > based on the CPUs that are actually there. And if you have more than 8
> > cores you can likely afford a few bytes per CPU.
> I did not understand that, thanks for clarifiying
> > 
> > You would need to cache line pad each entry then, otherwise you risk
> > false sharing. That would make the array 1K on 128 bytes cache line 
> > system.  This means on small systems this would actually waste
> > much more memory.
> > 
> > per cpu avoids that problem completely.
> It is also slower (or so percpu.h says), and confusing I'd say.

Well it's something like 3 instructions versus one. You would
have a hard time benchmarking it unless you run it in a very tight 
loop. It will be lost in the noise compared to all the other costs
of the IPI.

Also why i don't like this patch is that on the typical small single/dual
core system running a 128 byte cache line distro kernel you always pay the 
1K cost now, while with per cpu it only needed one/two entries.

Admittedly it could have been better commented.

Not that it matters now unfortunately it's already applied. Sometimes
wonder why I still bother to do patch review...

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch] tlb flush_data: replace per_cpu with an array
  2009-01-12 21:35 [patch] tlb flush_data: replace per_cpu with an array Frederik Deweerdt
  2009-01-12 21:46 ` Ingo Molnar
  2009-01-12 21:57 ` Andi Kleen
@ 2009-01-12 22:34 ` Ravikiran G Thirumalai
  2009-01-12 23:00   ` Ingo Molnar
  2009-01-12 22:54 ` Mike Travis
  3 siblings, 1 reply; 31+ messages in thread
From: Ravikiran G Thirumalai @ 2009-01-12 22:34 UTC (permalink / raw)
  To: Frederik Deweerdt; +Cc: mingo, andi, tglx, hpa, linux-kernel

Hi Frederik,

On Mon, Jan 12, 2009 at 10:35:42PM +0100, Frederik Deweerdt wrote:
>
>Signed-off-by: Frederik Deweerdt <frederik.deweerdt@xprog.eu>
>
>diff --git a/arch/x86/kernel/tlb_64.c b/arch/x86/kernel/tlb_64.c
>index f8be6f1..c177a1f 100644
>--- a/arch/x86/kernel/tlb_64.c
>+++ b/arch/x86/kernel/tlb_64.c
>@@ -33,7 +33,7 @@
>  *	To avoid global state use 8 different call vectors.
>  *	Each CPU uses a specific vector to trigger flushes on other
>  *	CPUs. Depending on the received vector the target CPUs look into
>- *	the right per cpu variable for the flush data.
>+ *	the right array slot for the flush data.
>  *
>  *	With more than 8 CPUs they are hashed to the 8 available
>  *	vectors. The limited global vector space forces us to this right now.
>@@ -54,7 +54,7 @@ union smp_flush_state {
> /* State is put into the per CPU data section, but padded
>    to a full cache line because other CPUs can access it and we don't
>    want false sharing in the per cpu data segment. */
>-static DEFINE_PER_CPU(union smp_flush_state, flush_state);
>+static union smp_flush_state flush_state[NUM_INVALIDATE_TLB_VECTORS];

Since flush_state has a spinlock, an array of flush_states would end up
having multiples spinlocks on the same L1 cacheline.  However, I see that
the union smp_flush_state is already padded to SMP_CACHE_BYTES.

With per-cpu areas,  locks belonging to different array elements did
not end up on the same internode cacheline due to per-node allocation of
per-cpu areas, but with a simple array, two locks could end up on the same
internode cacheline.

Hence, can you please change the padding in smp_flush_state to
INTERNODE_CACHE_BYTES (you have to derive this off INTERNODE_CACHE_SHIFT)
and align it using ____cacheline_internodealigned_in_smp?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch] tlb flush_data: replace per_cpu with an array
  2009-01-12 21:57 ` Andi Kleen
  2009-01-12 22:10   ` Frederik Deweerdt
@ 2009-01-12 22:40   ` Ingo Molnar
  2009-01-12 22:59     ` H. Peter Anvin
  2009-01-13  2:43     ` Andi Kleen
  1 sibling, 2 replies; 31+ messages in thread
From: Ingo Molnar @ 2009-01-12 22:40 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Frederik Deweerdt, tglx, hpa, linux-kernel


* Andi Kleen <andi@firstfloor.org> wrote:

> On Mon, Jan 12, 2009 at 10:35:42PM +0100, Frederik Deweerdt wrote:
> > Hi,
> > 
> > On x86_64 flush tlb data is stored in per_cpu variables. This is
> > unnecessary because only the first NUM_INVALIDATE_TLB_VECTORS entries
> > are accessed.
> >
> > This patch aims at making the code less confusing (there's nothing 
> > really "per_cpu") by using a plain array. It also would save some 
> > memory on most distros out there (Ubuntu x86_64 has NR_CPUS=64 by 
> > default).
> 
> Nope it doesn't save memory on most systems because per cpu is only 
> allocated based on the CPUs that are actually there. And if you have 
> more than 8 cores you can likely afford a few bytes per CPU.

No distro kernel will build with less than 8 CPUs anyway so this point is 
moot.

> You would need to cache line pad each entry then, otherwise you risk 
> false sharing. [...]

They are already cache line padded.

> [...] That would make the array 1K on 128 bytes cache line system.

512 bytes.

> [...]  This means on small systems this would actually waste much more 
> memory.

Really small systems will be UP and wont do cross-CPU TLB flushes, so if 
they are a worry the flush code can be UP optimized. (Nobody bothered so 
far.)

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch] tlb flush_data: replace per_cpu with an array
  2009-01-12 22:32     ` Andi Kleen
@ 2009-01-12 22:50       ` Ingo Molnar
  0 siblings, 0 replies; 31+ messages in thread
From: Ingo Molnar @ 2009-01-12 22:50 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Frederik Deweerdt, tglx, hpa, linux-kernel


* Andi Kleen <andi@firstfloor.org> wrote:

> > It is also slower (or so percpu.h says), and confusing I'd say.
> 
> Well it's something like 3 instructions versus one. [...]

That's enough - micro-optimizations are done like that, instruction by 
instruction.

> [...] You would have a hard time benchmarking it unless you run it in a 
> very tight loop. It will be lost in the noise compared to all the other 
> costs of the IPI.
> 
> Also why i don't like this patch is that on the typical small 
> single/dual core system running a 128 byte cache line distro kernel you 
> always pay the 1K cost now, while with per cpu it only needed one/two 
> entries.

4 or 8 cores is the norm these days - by the time this change hits real 
Linux computers en masse 8 cores will be quite common.

> Admittedly it could have been better commented.
> 
> Not that it matters now unfortunately it's already applied. Sometimes 
> wonder why I still bother to do patch review...

Whether patches are already applied or not has no relevance - patches can 
still be undone or reverted of course, should your review feedback be 
correct.

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch] tlb flush_data: replace per_cpu with an array
  2009-01-12 21:35 [patch] tlb flush_data: replace per_cpu with an array Frederik Deweerdt
                   ` (2 preceding siblings ...)
  2009-01-12 22:34 ` Ravikiran G Thirumalai
@ 2009-01-12 22:54 ` Mike Travis
  2009-01-12 23:51   ` Frederik Deweerdt
  3 siblings, 1 reply; 31+ messages in thread
From: Mike Travis @ 2009-01-12 22:54 UTC (permalink / raw)
  To: Frederik Deweerdt; +Cc: mingo, andi, tglx, hpa, linux-kernel

Frederik Deweerdt wrote:
> Hi,
> 
> On x86_64 flush tlb data is stored in per_cpu variables. This is
> unnecessary because only the first NUM_INVALIDATE_TLB_VECTORS entries
> are accessed.
> This patch aims at making the code less confusing (there's nothing
> really "per_cpu") by using a plain array. It also would save some memory
> on most distros out there (Ubuntu x86_64 has NR_CPUS=64 by default).
> 
> Regards,
> Frederik
> 
> Signed-off-by: Frederik Deweerdt <frederik.deweerdt@xprog.eu>

Here is the net change in memory usage with this patch on a allyesconfig
with NR_CPUS=4096.

====== Data (-l 500)

    1 - initial
    2 - change-flush-tlb

  .1.    .2.  .delta.
    0   5120   +5120      .  flush_state(.bss)

====== Sections (-l 500)

        .1.    .2.  .delta.
   12685496 12693688       +8192 +0.06%  .bss
    1910176  1909408        -768 -0.04%  .data.percpu

> 
> diff --git a/arch/x86/kernel/tlb_64.c b/arch/x86/kernel/tlb_64.c
> index f8be6f1..c177a1f 100644
> --- a/arch/x86/kernel/tlb_64.c
> +++ b/arch/x86/kernel/tlb_64.c
> @@ -33,7 +33,7 @@
>   *	To avoid global state use 8 different call vectors.
>   *	Each CPU uses a specific vector to trigger flushes on other
>   *	CPUs. Depending on the received vector the target CPUs look into
> - *	the right per cpu variable for the flush data.
> + *	the right array slot for the flush data.
>   *
>   *	With more than 8 CPUs they are hashed to the 8 available
>   *	vectors. The limited global vector space forces us to this right now.
> @@ -54,7 +54,7 @@ union smp_flush_state {
>  /* State is put into the per CPU data section, but padded
>     to a full cache line because other CPUs can access it and we don't
>     want false sharing in the per cpu data segment. */
> -static DEFINE_PER_CPU(union smp_flush_state, flush_state);
> +static union smp_flush_state flush_state[NUM_INVALIDATE_TLB_VECTORS];
>  
>  /*
>   * We cannot call mmdrop() because we are in interrupt context,
> @@ -129,7 +129,7 @@ asmlinkage void smp_invalidate_interrupt(struct pt_regs *regs)
>  	 * Use that to determine where the sender put the data.
>  	 */
>  	sender = ~regs->orig_ax - INVALIDATE_TLB_VECTOR_START;
> -	f = &per_cpu(flush_state, sender);
> +	f = &flush_state[sender];
>  
>  	if (!cpu_isset(cpu, f->flush_cpumask))
>  		goto out;
> @@ -169,7 +169,7 @@ void native_flush_tlb_others(const cpumask_t *cpumaskp, struct mm_struct *mm,
>  
>  	/* Caller has disabled preemption */
>  	sender = smp_processor_id() % NUM_INVALIDATE_TLB_VECTORS;
> -	f = &per_cpu(flush_state, sender);
> +	f = &flush_state[sender];
>  
>  	/*
>  	 * Could avoid this lock when
> @@ -205,8 +205,8 @@ static int __cpuinit init_smp_flush(void)
>  {
>  	int i;
>  
> -	for_each_possible_cpu(i)
> -		spin_lock_init(&per_cpu(flush_state, i).tlbstate_lock);
> +	for (i = 0; i < ARRAY_SIZE(flush_state); i++)
> +		spin_lock_init(&flush_state[i].tlbstate_lock);
>  
>  	return 0;
>  }
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch] tlb flush_data: replace per_cpu with an array
  2009-01-12 22:40   ` Ingo Molnar
@ 2009-01-12 22:59     ` H. Peter Anvin
  2009-01-13  2:43     ` Andi Kleen
  1 sibling, 0 replies; 31+ messages in thread
From: H. Peter Anvin @ 2009-01-12 22:59 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andi Kleen, Frederik Deweerdt, tglx, linux-kernel

Ingo Molnar wrote:

>> Nope it doesn't save memory on most systems because per cpu is only 
>> allocated based on the CPUs that are actually there. And if you have 
>> more than 8 cores you can likely afford a few bytes per CPU.
> 
> No distro kernel will build with less than 8 CPUs anyway so this point is 
> moot.

Except for embedded, which will build for exactly what's in the box.

	-hpa

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch] tlb flush_data: replace per_cpu with an array
  2009-01-12 22:34 ` Ravikiran G Thirumalai
@ 2009-01-12 23:00   ` Ingo Molnar
  2009-01-12 23:09     ` Ingo Molnar
                       ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Ingo Molnar @ 2009-01-12 23:00 UTC (permalink / raw)
  To: Ravikiran G Thirumalai; +Cc: Frederik Deweerdt, andi, tglx, hpa, linux-kernel


* Ravikiran G Thirumalai <kiran@scalex86.org> wrote:

> Hi Frederik,
> 
> On Mon, Jan 12, 2009 at 10:35:42PM +0100, Frederik Deweerdt wrote:
> >
> >Signed-off-by: Frederik Deweerdt <frederik.deweerdt@xprog.eu>
> >
> >diff --git a/arch/x86/kernel/tlb_64.c b/arch/x86/kernel/tlb_64.c
> >index f8be6f1..c177a1f 100644
> >--- a/arch/x86/kernel/tlb_64.c
> >+++ b/arch/x86/kernel/tlb_64.c
> >@@ -33,7 +33,7 @@
> >  *	To avoid global state use 8 different call vectors.
> >  *	Each CPU uses a specific vector to trigger flushes on other
> >  *	CPUs. Depending on the received vector the target CPUs look into
> >- *	the right per cpu variable for the flush data.
> >+ *	the right array slot for the flush data.
> >  *
> >  *	With more than 8 CPUs they are hashed to the 8 available
> >  *	vectors. The limited global vector space forces us to this right now.
> >@@ -54,7 +54,7 @@ union smp_flush_state {
> > /* State is put into the per CPU data section, but padded
> >    to a full cache line because other CPUs can access it and we don't
> >    want false sharing in the per cpu data segment. */
> >-static DEFINE_PER_CPU(union smp_flush_state, flush_state);
> >+static union smp_flush_state flush_state[NUM_INVALIDATE_TLB_VECTORS];
> 
> Since flush_state has a spinlock, an array of flush_states would end up 
> having multiples spinlocks on the same L1 cacheline.  However, I see 
> that the union smp_flush_state is already padded to SMP_CACHE_BYTES.

Correct.

> With per-cpu areas, locks belonging to different array elements did not 
> end up on the same internode cacheline due to per-node allocation of 
> per-cpu areas, but with a simple array, two locks could end up on the 
> same internode cacheline.
> 
> Hence, can you please change the padding in smp_flush_state to 
> INTERNODE_CACHE_BYTES (you have to derive this off 
> INTERNODE_CACHE_SHIFT) and align it using 
> ____cacheline_internodealigned_in_smp?

Yes, that matters to vsmp, which has ... 4K node cache-lines.

Note that there's already CONFIG_X86_INTERNODE_CACHE_BYTES which is set 
properly on vsmp.

So ... something like the commit below?

	Ingo

--------------->
>From 23d9dc8bffc759c131b09a48b5215cc2b37a5ac3 Mon Sep 17 00:00:00 2001
From: Frederik Deweerdt <frederik.deweerdt@xprog.eu>
Date: Mon, 12 Jan 2009 22:35:42 +0100
Subject: [PATCH] x86, tlb flush_data: replace per_cpu with an array

Impact: micro-optimization, memory reduction

On x86_64 flush tlb data is stored in per_cpu variables. This is
unnecessary because only the first NUM_INVALIDATE_TLB_VECTORS entries
are accessed.

This patch aims at making the code less confusing (there's nothing
really "per_cpu") by using a plain array. It also would save some memory
on most distros out there (Ubuntu x86_64 has NR_CPUS=64 by default).

[ Ravikiran G Thirumalai also pointed out that the correct alignment
  is ____cacheline_internodealigned_in_smp, so that there's no
  bouncing on vsmp. ]

Signed-off-by: Frederik Deweerdt <frederik.deweerdt@xprog.eu>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/kernel/tlb_64.c |   16 ++++++++--------
 1 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/tlb_64.c b/arch/x86/kernel/tlb_64.c
index f8be6f1..c5a6c6f 100644
--- a/arch/x86/kernel/tlb_64.c
+++ b/arch/x86/kernel/tlb_64.c
@@ -33,7 +33,7 @@
  *	To avoid global state use 8 different call vectors.
  *	Each CPU uses a specific vector to trigger flushes on other
  *	CPUs. Depending on the received vector the target CPUs look into
- *	the right per cpu variable for the flush data.
+ *	the right array slot for the flush data.
  *
  *	With more than 8 CPUs they are hashed to the 8 available
  *	vectors. The limited global vector space forces us to this right now.
@@ -48,13 +48,13 @@ union smp_flush_state {
 		unsigned long flush_va;
 		spinlock_t tlbstate_lock;
 	};
-	char pad[SMP_CACHE_BYTES];
-} ____cacheline_aligned;
+	char pad[X86_INTERNODE_CACHE_BYTES];
+} ____cacheline_internodealigned_in_smp;
 
 /* State is put into the per CPU data section, but padded
    to a full cache line because other CPUs can access it and we don't
    want false sharing in the per cpu data segment. */
-static DEFINE_PER_CPU(union smp_flush_state, flush_state);
+static union smp_flush_state flush_state[NUM_INVALIDATE_TLB_VECTORS];
 
 /*
  * We cannot call mmdrop() because we are in interrupt context,
@@ -129,7 +129,7 @@ asmlinkage void smp_invalidate_interrupt(struct pt_regs *regs)
 	 * Use that to determine where the sender put the data.
 	 */
 	sender = ~regs->orig_ax - INVALIDATE_TLB_VECTOR_START;
-	f = &per_cpu(flush_state, sender);
+	f = &flush_state[sender];
 
 	if (!cpu_isset(cpu, f->flush_cpumask))
 		goto out;
@@ -169,7 +169,7 @@ void native_flush_tlb_others(const cpumask_t *cpumaskp, struct mm_struct *mm,
 
 	/* Caller has disabled preemption */
 	sender = smp_processor_id() % NUM_INVALIDATE_TLB_VECTORS;
-	f = &per_cpu(flush_state, sender);
+	f = &flush_state[sender];
 
 	/*
 	 * Could avoid this lock when
@@ -205,8 +205,8 @@ static int __cpuinit init_smp_flush(void)
 {
 	int i;
 
-	for_each_possible_cpu(i)
-		spin_lock_init(&per_cpu(flush_state, i).tlbstate_lock);
+	for (i = 0; i < ARRAY_SIZE(flush_state); i++)
+		spin_lock_init(&flush_state[i].tlbstate_lock);
 
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [patch] tlb flush_data: replace per_cpu with an array
  2009-01-12 23:00   ` Ingo Molnar
@ 2009-01-12 23:09     ` Ingo Molnar
  2009-01-13  2:14     ` Ravikiran G Thirumalai
  2009-01-13 12:00     ` Peter Zijlstra
  2 siblings, 0 replies; 31+ messages in thread
From: Ingo Molnar @ 2009-01-12 23:09 UTC (permalink / raw)
  To: Ravikiran G Thirumalai; +Cc: Frederik Deweerdt, andi, tglx, hpa, linux-kernel


* Ingo Molnar <mingo@elte.hu> wrote:

> +	char pad[X86_INTERNODE_CACHE_BYTES];

s/CONFIG_X86_INTERNODE_CACHE_BYTES

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch] tlb flush_data: replace per_cpu with an array
  2009-01-12 22:54 ` Mike Travis
@ 2009-01-12 23:51   ` Frederik Deweerdt
  2009-01-12 23:57     ` Mike Travis
  0 siblings, 1 reply; 31+ messages in thread
From: Frederik Deweerdt @ 2009-01-12 23:51 UTC (permalink / raw)
  To: Mike Travis; +Cc: mingo, andi, tglx, hpa, linux-kernel

On Mon, Jan 12, 2009 at 02:54:25PM -0800, Mike Travis wrote:
> Frederik Deweerdt wrote:
> > Hi,
> > 
> > On x86_64 flush tlb data is stored in per_cpu variables. This is
> > unnecessary because only the first NUM_INVALIDATE_TLB_VECTORS entries
> > are accessed.
> > This patch aims at making the code less confusing (there's nothing
> > really "per_cpu") by using a plain array. It also would save some memory
> > on most distros out there (Ubuntu x86_64 has NR_CPUS=64 by default).
> > 
> > Regards,
> > Frederik
> > 
> > Signed-off-by: Frederik Deweerdt <frederik.deweerdt@xprog.eu>
> 
> Here is the net change in memory usage with this patch on a allyesconfig
> with NR_CPUS=4096.
Yes, this point wrt. memory was based on my flawed understanding of how
per_cpu actually allocates the data. There is however 1) a confusing use
of per_cpu removed, 2) faster access to the flush data.

> 
> ====== Data (-l 500)
> 
>     1 - initial
>     2 - change-flush-tlb
> 
>   .1.    .2.  .delta.
>     0   5120   +5120      .  flush_state(.bss)
> 
> ====== Sections (-l 500)
> 
>         .1.    .2.  .delta.
>    12685496 12693688       +8192 +0.06%  .bss
>     1910176  1909408        -768 -0.04%  .data.percpu
> 
I get :
Initial
 size ./arch/x86/kernel/tlb_64.o
   text	   data	    bss	    dec	    hex	filename
   1667	    136	      8	   1811	    713	./arch/x86/kernel/tlb_64.o

After
 size ./arch/x86/kernel/tlb_64.o 
   text	   data	    bss	    dec	    hex	filename
   1598	      8	   1088	   2694	    a86	./arch/x86/kernel/tlb_64.o

    -69    -128   +1080    +883

But I'm not sure those numbers are that relevant, as the percpu part
will be allocated at runtime.

Regards,
Frederik

> > 
> > diff --git a/arch/x86/kernel/tlb_64.c b/arch/x86/kernel/tlb_64.c
> > index f8be6f1..c177a1f 100644
> > --- a/arch/x86/kernel/tlb_64.c
> > +++ b/arch/x86/kernel/tlb_64.c
> > @@ -33,7 +33,7 @@
> >   *	To avoid global state use 8 different call vectors.
> >   *	Each CPU uses a specific vector to trigger flushes on other
> >   *	CPUs. Depending on the received vector the target CPUs look into
> > - *	the right per cpu variable for the flush data.
> > + *	the right array slot for the flush data.
> >   *
> >   *	With more than 8 CPUs they are hashed to the 8 available
> >   *	vectors. The limited global vector space forces us to this right now.
> > @@ -54,7 +54,7 @@ union smp_flush_state {
> >  /* State is put into the per CPU data section, but padded
> >     to a full cache line because other CPUs can access it and we don't
> >     want false sharing in the per cpu data segment. */
> > -static DEFINE_PER_CPU(union smp_flush_state, flush_state);
> > +static union smp_flush_state flush_state[NUM_INVALIDATE_TLB_VECTORS];
> >  
> >  /*
> >   * We cannot call mmdrop() because we are in interrupt context,
> > @@ -129,7 +129,7 @@ asmlinkage void smp_invalidate_interrupt(struct pt_regs *regs)
> >  	 * Use that to determine where the sender put the data.
> >  	 */
> >  	sender = ~regs->orig_ax - INVALIDATE_TLB_VECTOR_START;
> > -	f = &per_cpu(flush_state, sender);
> > +	f = &flush_state[sender];
> >  
> >  	if (!cpu_isset(cpu, f->flush_cpumask))
> >  		goto out;
> > @@ -169,7 +169,7 @@ void native_flush_tlb_others(const cpumask_t *cpumaskp, struct mm_struct *mm,
> >  
> >  	/* Caller has disabled preemption */
> >  	sender = smp_processor_id() % NUM_INVALIDATE_TLB_VECTORS;
> > -	f = &per_cpu(flush_state, sender);
> > +	f = &flush_state[sender];
> >  
> >  	/*
> >  	 * Could avoid this lock when
> > @@ -205,8 +205,8 @@ static int __cpuinit init_smp_flush(void)
> >  {
> >  	int i;
> >  
> > -	for_each_possible_cpu(i)
> > -		spin_lock_init(&per_cpu(flush_state, i).tlbstate_lock);
> > +	for (i = 0; i < ARRAY_SIZE(flush_state); i++)
> > +		spin_lock_init(&flush_state[i].tlbstate_lock);
> >  
> >  	return 0;
> >  }
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch] tlb flush_data: replace per_cpu with an array
  2009-01-12 23:51   ` Frederik Deweerdt
@ 2009-01-12 23:57     ` Mike Travis
  2009-01-13  0:01       ` Ingo Molnar
  0 siblings, 1 reply; 31+ messages in thread
From: Mike Travis @ 2009-01-12 23:57 UTC (permalink / raw)
  To: Frederik Deweerdt; +Cc: mingo, andi, tglx, hpa, linux-kernel

Frederik Deweerdt wrote:
> On Mon, Jan 12, 2009 at 02:54:25PM -0800, Mike Travis wrote:
>> Frederik Deweerdt wrote:
>>> Hi,
>>>
>>> On x86_64 flush tlb data is stored in per_cpu variables. This is
>>> unnecessary because only the first NUM_INVALIDATE_TLB_VECTORS entries
>>> are accessed.
>>> This patch aims at making the code less confusing (there's nothing
>>> really "per_cpu") by using a plain array. It also would save some memory
>>> on most distros out there (Ubuntu x86_64 has NR_CPUS=64 by default).
>>>
>>> Regards,
>>> Frederik
>>>
>>> Signed-off-by: Frederik Deweerdt <frederik.deweerdt@xprog.eu>
>> Here is the net change in memory usage with this patch on a allyesconfig
>> with NR_CPUS=4096.

> Yes, this point wrt. memory was based on my flawed understanding of how
> per_cpu actually allocates the data. There is however 1) a confusing use
> of per_cpu removed, 2) faster access to the flush data.

Is this true?  On a widely separated NUMA system, requiring all CPU's to
access memory on NODE 0 for every tlb flush would seem expensive.  That's
another benefit of per_cpu data, it's local to the node's cpus.

(And was it determined yet, that a cacheline has to be tossed around as well?)

Thanks,
Mike

> 
>> ====== Data (-l 500)
>>
>>     1 - initial
>>     2 - change-flush-tlb
>>
>>   .1.    .2.  .delta.
>>     0   5120   +5120      .  flush_state(.bss)
>>
>> ====== Sections (-l 500)
>>
>>         .1.    .2.  .delta.
>>    12685496 12693688       +8192 +0.06%  .bss
>>     1910176  1909408        -768 -0.04%  .data.percpu
>>
> I get :
> Initial
>  size ./arch/x86/kernel/tlb_64.o
>    text	   data	    bss	    dec	    hex	filename
>    1667	    136	      8	   1811	    713	./arch/x86/kernel/tlb_64.o
> 
> After
>  size ./arch/x86/kernel/tlb_64.o 
>    text	   data	    bss	    dec	    hex	filename
>    1598	      8	   1088	   2694	    a86	./arch/x86/kernel/tlb_64.o
> 
>     -69    -128   +1080    +883
> 
> But I'm not sure those numbers are that relevant, as the percpu part
> will be allocated at runtime.
> 
> Regards,
> Frederik
> 
>>> diff --git a/arch/x86/kernel/tlb_64.c b/arch/x86/kernel/tlb_64.c
>>> index f8be6f1..c177a1f 100644
>>> --- a/arch/x86/kernel/tlb_64.c
>>> +++ b/arch/x86/kernel/tlb_64.c
>>> @@ -33,7 +33,7 @@
>>>   *	To avoid global state use 8 different call vectors.
>>>   *	Each CPU uses a specific vector to trigger flushes on other
>>>   *	CPUs. Depending on the received vector the target CPUs look into
>>> - *	the right per cpu variable for the flush data.
>>> + *	the right array slot for the flush data.
>>>   *
>>>   *	With more than 8 CPUs they are hashed to the 8 available
>>>   *	vectors. The limited global vector space forces us to this right now.
>>> @@ -54,7 +54,7 @@ union smp_flush_state {
>>>  /* State is put into the per CPU data section, but padded
>>>     to a full cache line because other CPUs can access it and we don't
>>>     want false sharing in the per cpu data segment. */
>>> -static DEFINE_PER_CPU(union smp_flush_state, flush_state);
>>> +static union smp_flush_state flush_state[NUM_INVALIDATE_TLB_VECTORS];
>>>  
>>>  /*
>>>   * We cannot call mmdrop() because we are in interrupt context,
>>> @@ -129,7 +129,7 @@ asmlinkage void smp_invalidate_interrupt(struct pt_regs *regs)
>>>  	 * Use that to determine where the sender put the data.
>>>  	 */
>>>  	sender = ~regs->orig_ax - INVALIDATE_TLB_VECTOR_START;
>>> -	f = &per_cpu(flush_state, sender);
>>> +	f = &flush_state[sender];
>>>  
>>>  	if (!cpu_isset(cpu, f->flush_cpumask))
>>>  		goto out;
>>> @@ -169,7 +169,7 @@ void native_flush_tlb_others(const cpumask_t *cpumaskp, struct mm_struct *mm,
>>>  
>>>  	/* Caller has disabled preemption */
>>>  	sender = smp_processor_id() % NUM_INVALIDATE_TLB_VECTORS;
>>> -	f = &per_cpu(flush_state, sender);
>>> +	f = &flush_state[sender];
>>>  
>>>  	/*
>>>  	 * Could avoid this lock when
>>> @@ -205,8 +205,8 @@ static int __cpuinit init_smp_flush(void)
>>>  {
>>>  	int i;
>>>  
>>> -	for_each_possible_cpu(i)
>>> -		spin_lock_init(&per_cpu(flush_state, i).tlbstate_lock);
>>> +	for (i = 0; i < ARRAY_SIZE(flush_state); i++)
>>> +		spin_lock_init(&flush_state[i].tlbstate_lock);
>>>  
>>>  	return 0;
>>>  }
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at  http://www.tux.org/lkml/
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch] tlb flush_data: replace per_cpu with an array
  2009-01-12 23:57     ` Mike Travis
@ 2009-01-13  0:01       ` Ingo Molnar
  2009-01-13  3:36         ` Andi Kleen
  0 siblings, 1 reply; 31+ messages in thread
From: Ingo Molnar @ 2009-01-13  0:01 UTC (permalink / raw)
  To: Mike Travis; +Cc: Frederik Deweerdt, andi, tglx, hpa, linux-kernel


* Mike Travis <travis@sgi.com> wrote:

> Frederik Deweerdt wrote:
> > On Mon, Jan 12, 2009 at 02:54:25PM -0800, Mike Travis wrote:
> >> Frederik Deweerdt wrote:
> >>> Hi,
> >>>
> >>> On x86_64 flush tlb data is stored in per_cpu variables. This is
> >>> unnecessary because only the first NUM_INVALIDATE_TLB_VECTORS entries
> >>> are accessed.
> >>> This patch aims at making the code less confusing (there's nothing
> >>> really "per_cpu") by using a plain array. It also would save some memory
> >>> on most distros out there (Ubuntu x86_64 has NR_CPUS=64 by default).
> >>>
> >>> Regards,
> >>> Frederik
> >>>
> >>> Signed-off-by: Frederik Deweerdt <frederik.deweerdt@xprog.eu>
> >> Here is the net change in memory usage with this patch on a allyesconfig
> >> with NR_CPUS=4096.
> 
> > Yes, this point wrt. memory was based on my flawed understanding of how
> > per_cpu actually allocates the data. There is however 1) a confusing use
> > of per_cpu removed, 2) faster access to the flush data.
> 
> Is this true?  On a widely separated NUMA system, requiring all CPU's to 
> access memory on NODE 0 for every tlb flush would seem expensive.  
> That's another benefit of per_cpu data, it's local to the node's cpus.

That's irrelevant here: we only have 8 IPI slots for the TLB flush, so 
we'll use the first 8 per_cpu areas (note how all access is cpu-nr modulo 
8 in essence). That is going to be node 0 anyway in most NUMA setups.

> (And was it determined yet, that a cacheline has to be tossed around as 
> well?)

No, that assertion of Andi is wrong.

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch] tlb flush_data: replace per_cpu with an array
  2009-01-12 23:00   ` Ingo Molnar
  2009-01-12 23:09     ` Ingo Molnar
@ 2009-01-13  2:14     ` Ravikiran G Thirumalai
  2009-01-13 12:00     ` Peter Zijlstra
  2 siblings, 0 replies; 31+ messages in thread
From: Ravikiran G Thirumalai @ 2009-01-13  2:14 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Frederik Deweerdt, andi, tglx, hpa, linux-kernel

On Tue, Jan 13, 2009 at 12:00:52AM +0100, Ingo Molnar wrote:
>
>
>> With per-cpu areas, locks belonging to different array elements did not 
>> end up on the same internode cacheline due to per-node allocation of 
>> per-cpu areas, but with a simple array, two locks could end up on the 
>> same internode cacheline.
>> 
>> Hence, can you please change the padding in smp_flush_state to 
>> INTERNODE_CACHE_BYTES (you have to derive this off 
>> INTERNODE_CACHE_SHIFT) and align it using 
>> ____cacheline_internodealigned_in_smp?
>
>Yes, that matters to vsmp, which has ... 4K node cache-lines.
>
>Note that there's already CONFIG_X86_INTERNODE_CACHE_BYTES which is set 
>properly on vsmp.
>
>So ... something like the commit below?
>

Yep.  Looks good!

Acked-by: Ravikiran Thirumalai <kiran@scalex86.org>

Thanks,
Kiran

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch] tlb flush_data: replace per_cpu with an array
  2009-01-12 22:40   ` Ingo Molnar
  2009-01-12 22:59     ` H. Peter Anvin
@ 2009-01-13  2:43     ` Andi Kleen
  2009-01-13 12:28       ` Ingo Molnar
  1 sibling, 1 reply; 31+ messages in thread
From: Andi Kleen @ 2009-01-13  2:43 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andi Kleen, Frederik Deweerdt, tglx, hpa, linux-kernel

> No distro kernel will build with less than 8 CPUs anyway so this point is 
> moot.

It has nothing to do with what the distro kernel builds with. As I stated
clearly in my review the per cpu data is sized based on the possible map, 
which is discovered from the BIOS at runtime.  So if your system has two cores 
only you will only have two copies of per cpu data. 

With this patch on the other hand you will always have 8 copies
of this data; aka 1K no matter how many CPUs you have.

So the description that it saves memory is flat out 
wrong on any system with less than 8 threads (which
is by far the biggest majority of systems currently
and in the forseeable future) 

> > You would need to cache line pad each entry then, otherwise you risk 
> > false sharing. [...]
> 
> They are already cache line padded.

Yes that's the problem here. 

> 
> > [...] That would make the array 1K on 128 bytes cache line system.
> 
> 512 bytes.

8 * 128 = 1024 

Ok the real waste is a little less because you need at least one copy,
but still considerable. (896 bytes on UP, 768 bytes on 2C)


> 
> > [...]  This means on small systems this would actually waste much more 
> > memory.
> 
> Really small systems will be UP and wont do cross-CPU TLB flushes, so if 
> they are a worry the flush code can be UP optimized. (Nobody bothered so 
> far.)

The SMP flush code shouldn't be called at all on UP because the "other cpu"
mask is always empty.

Just talking about the memory. Sure it's only 1K (or 896 bytes), but if you 
add up a lot of little 896byte wastes you eventually get a lot of waste all over.

Anyways if you wanted to do this without using per cpu data you
could use alloc_percpu(), but that would be much more complicated
code.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch] tlb flush_data: replace per_cpu with an array
  2009-01-13  0:01       ` Ingo Molnar
@ 2009-01-13  3:36         ` Andi Kleen
  2009-01-13 12:14           ` Ingo Molnar
  0 siblings, 1 reply; 31+ messages in thread
From: Andi Kleen @ 2009-01-13  3:36 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Mike Travis, Frederik Deweerdt, andi, tglx, hpa, linux-kernel

> > (And was it determined yet, that a cacheline has to be tossed around as 
> > well?)
> 
> No, that assertion of Andi is wrong.

For the IPI you always have to toss one cache line around to pass
the information.

Anyways the padding is just to avoid false sharing.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch] tlb flush_data: replace per_cpu with an array
  2009-01-12 23:00   ` Ingo Molnar
  2009-01-12 23:09     ` Ingo Molnar
  2009-01-13  2:14     ` Ravikiran G Thirumalai
@ 2009-01-13 12:00     ` Peter Zijlstra
  2009-01-13 12:33       ` Ingo Molnar
  2009-01-14  9:08       ` Andi Kleen
  2 siblings, 2 replies; 31+ messages in thread
From: Peter Zijlstra @ 2009-01-13 12:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ravikiran G Thirumalai, Frederik Deweerdt, andi, tglx, hpa, linux-kernel

On Tue, 2009-01-13 at 00:00 +0100, Ingo Molnar wrote:

> From 23d9dc8bffc759c131b09a48b5215cc2b37a5ac3 Mon Sep 17 00:00:00 2001
> From: Frederik Deweerdt <frederik.deweerdt@xprog.eu>
> Date: Mon, 12 Jan 2009 22:35:42 +0100
> Subject: [PATCH] x86, tlb flush_data: replace per_cpu with an array
> 
> Impact: micro-optimization, memory reduction
> 
> On x86_64 flush tlb data is stored in per_cpu variables. This is
> unnecessary because only the first NUM_INVALIDATE_TLB_VECTORS entries
> are accessed.
> 
> This patch aims at making the code less confusing (there's nothing
> really "per_cpu") by using a plain array. It also would save some memory
> on most distros out there (Ubuntu x86_64 has NR_CPUS=64 by default).
> 
> [ Ravikiran G Thirumalai also pointed out that the correct alignment
>   is ____cacheline_internodealigned_in_smp, so that there's no
>   bouncing on vsmp. ]
> 
> Signed-off-by: Frederik Deweerdt <frederik.deweerdt@xprog.eu>
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> ---
>  arch/x86/kernel/tlb_64.c |   16 ++++++++--------
>  1 files changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kernel/tlb_64.c b/arch/x86/kernel/tlb_64.c
> index f8be6f1..c5a6c6f 100644
> --- a/arch/x86/kernel/tlb_64.c
> +++ b/arch/x86/kernel/tlb_64.c
> @@ -33,7 +33,7 @@
>   *	To avoid global state use 8 different call vectors.
>   *	Each CPU uses a specific vector to trigger flushes on other
>   *	CPUs. Depending on the received vector the target CPUs look into
> - *	the right per cpu variable for the flush data.
> + *	the right array slot for the flush data.
>   *
>   *	With more than 8 CPUs they are hashed to the 8 available
>   *	vectors. The limited global vector space forces us to this right now.
> @@ -48,13 +48,13 @@ union smp_flush_state {
>  		unsigned long flush_va;
>  		spinlock_t tlbstate_lock;
>  	};
> -	char pad[SMP_CACHE_BYTES];
> -} ____cacheline_aligned;
> +	char pad[X86_INTERNODE_CACHE_BYTES];
> +} ____cacheline_internodealigned_in_smp;

That will make the below array 8*4096 bytes for VSMP, which pushes the
limit for memory savings up to 256 cpus.

I'm really dubious this patch is really worth it.

>  /* State is put into the per CPU data section, but padded
>     to a full cache line because other CPUs can access it and we don't
>     want false sharing in the per cpu data segment. */
> -static DEFINE_PER_CPU(union smp_flush_state, flush_state);
> +static union smp_flush_state flush_state[NUM_INVALIDATE_TLB_VECTORS];
>  
>  /*
>   * We cannot call mmdrop() because we are in interrupt context,
> @@ -129,7 +129,7 @@ asmlinkage void smp_invalidate_interrupt(struct pt_regs *regs)
>  	 * Use that to determine where the sender put the data.
>  	 */
>  	sender = ~regs->orig_ax - INVALIDATE_TLB_VECTOR_START;
> -	f = &per_cpu(flush_state, sender);
> +	f = &flush_state[sender];
>  
>  	if (!cpu_isset(cpu, f->flush_cpumask))
>  		goto out;
> @@ -169,7 +169,7 @@ void native_flush_tlb_others(const cpumask_t *cpumaskp, struct mm_struct *mm,
>  
>  	/* Caller has disabled preemption */
>  	sender = smp_processor_id() % NUM_INVALIDATE_TLB_VECTORS;
> -	f = &per_cpu(flush_state, sender);
> +	f = &flush_state[sender];
>  
>  	/*
>  	 * Could avoid this lock when
> @@ -205,8 +205,8 @@ static int __cpuinit init_smp_flush(void)
>  {
>  	int i;
>  
> -	for_each_possible_cpu(i)
> -		spin_lock_init(&per_cpu(flush_state, i).tlbstate_lock);
> +	for (i = 0; i < ARRAY_SIZE(flush_state); i++)
> +		spin_lock_init(&flush_state[i].tlbstate_lock);
>  
>  	return 0;
>  }
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch] tlb flush_data: replace per_cpu with an array
  2009-01-13  3:36         ` Andi Kleen
@ 2009-01-13 12:14           ` Ingo Molnar
  0 siblings, 0 replies; 31+ messages in thread
From: Ingo Molnar @ 2009-01-13 12:14 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Mike Travis, Frederik Deweerdt, tglx, hpa, linux-kernel


* Andi Kleen <andi@firstfloor.org> wrote:

> > > (And was it determined yet, that a cacheline has to be tossed around 
> > > as well?)
> > 
> > No, that assertion of Andi is wrong.
> 
> For the IPI you always have to toss one cache line around to pass the 
> information.
> 
> Anyways the padding is just to avoid false sharing.

Which is a red herring - the padding was there before the patch, and is 
there after the patch as well.

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch] tlb flush_data: replace per_cpu with an array
  2009-01-13  2:43     ` Andi Kleen
@ 2009-01-13 12:28       ` Ingo Molnar
  0 siblings, 0 replies; 31+ messages in thread
From: Ingo Molnar @ 2009-01-13 12:28 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Frederik Deweerdt, tglx, hpa, linux-kernel


* Andi Kleen <andi@firstfloor.org> wrote:

> > No distro kernel will build with less than 8 CPUs anyway so this point 
> > is moot.
> 
> It has nothing to do with what the distro kernel builds with. As I 
> stated clearly in my review the per cpu data is sized based on the 
> possible map, which is discovered from the BIOS at runtime.  So if your 
> system has two cores only you will only have two copies of per cpu data.

You ignored my observation that by the time this hits distro kernels the 
usual SMP hw size will be 8 or more cores.

> With this patch on the other hand you will always have 8 copies of this 
> data; aka 1K no matter how many CPUs you have.

Firstly, it's 512 bytes (see below), secondly, with the percpu approach we 
waste far more total memory as time goes on and the average core count 
increases.

> So the description that it saves memory is flat out wrong on any system 
> with less than 8 threads (which is by far the biggest majority of 
> systems currently and in the forseeable future)
> 
> > > You would need to cache line pad each entry then, otherwise you risk 
> > > false sharing. [...]
> > 
> > They are already cache line padded.
> 
> Yes that's the problem here. 

I fail to see a problem. It has to be padded and aligned no matter where 
it is - and it is padded and aligned both before and after the patch. I 
dont know why you keep insisting that there's a problem where there is 
none.

> > > [...] That would make the array 1K on 128 bytes cache line system.
> > 
> > 512 bytes.
> 
> 8 * 128 = 1024

The default cacheline size in the x86 tree (for generic cpus) is 64 bytes, 
not 128 bytes - so the default size of the array is 512 bytes, not 1024 
bytes. (This is a change we made yesterday, so you can not have known 
about it unless you follow the x86 tree closely.)

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch] tlb flush_data: replace per_cpu with an array
  2009-01-13 12:00     ` Peter Zijlstra
@ 2009-01-13 12:33       ` Ingo Molnar
  2009-01-13 18:13         ` Ravikiran G Thirumalai
  2009-01-14  9:08       ` Andi Kleen
  1 sibling, 1 reply; 31+ messages in thread
From: Ingo Molnar @ 2009-01-13 12:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ravikiran G Thirumalai, Frederik Deweerdt, andi, tglx, hpa, linux-kernel


* Peter Zijlstra <peterz@infradead.org> wrote:

> > -	char pad[SMP_CACHE_BYTES];
> > -} ____cacheline_aligned;
> > +	char pad[X86_INTERNODE_CACHE_BYTES];
> > +} ____cacheline_internodealigned_in_smp;
> 
> That will make the below array 8*4096 bytes for VSMP, which pushes the 
> limit for memory savings up to 256 cpus.

VSMP is a clustering solution (default-disabled) that pushes 
L1_CACHE_BYTES to 4096 bytes (4K). That is an extremely large alignment 
that pushes up the BSS size ten-fold (!), so no generic Linux distribution 
enables it. 32K compared to 9MB bloat caused by 4K cachelines is a drop in 
the ocean.

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch] tlb flush_data: replace per_cpu with an array
  2009-01-13 12:33       ` Ingo Molnar
@ 2009-01-13 18:13         ` Ravikiran G Thirumalai
  2009-01-13 18:34           ` Ingo Molnar
  0 siblings, 1 reply; 31+ messages in thread
From: Ravikiran G Thirumalai @ 2009-01-13 18:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Frederik Deweerdt, andi, tglx, hpa, linux-kernel

On Tue, Jan 13, 2009 at 01:33:17PM +0100, Ingo Molnar wrote:
>
>
>VSMP is a clustering solution (default-disabled)

Small correction :)
Well, it is a virtualization solution creating single VM across multiple
systems.  Since it is shared memory, we would rather refer to it as SMP or
Virtual SMP rather than a clustering solution.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch] tlb flush_data: replace per_cpu with an array
  2009-01-13 18:13         ` Ravikiran G Thirumalai
@ 2009-01-13 18:34           ` Ingo Molnar
  2009-01-13 18:42             ` Ravikiran G Thirumalai
  0 siblings, 1 reply; 31+ messages in thread
From: Ingo Molnar @ 2009-01-13 18:34 UTC (permalink / raw)
  To: Ravikiran G Thirumalai
  Cc: Peter Zijlstra, Frederik Deweerdt, andi, tglx, hpa, linux-kernel


* Ravikiran G Thirumalai <kiran@scalex86.org> wrote:

> On Tue, Jan 13, 2009 at 01:33:17PM +0100, Ingo Molnar wrote:
> >
> >
> >VSMP is a clustering solution (default-disabled)
> 
> Small correction :)
> Well, it is a virtualization solution creating single VM across multiple
> systems.  Since it is shared memory, we would rather refer to it as SMP or
> Virtual SMP rather than a clustering solution.

You do have hardware help too to do efficient cross-node memory traffic, 
dont you?

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch] tlb flush_data: replace per_cpu with an array
  2009-01-13 18:34           ` Ingo Molnar
@ 2009-01-13 18:42             ` Ravikiran G Thirumalai
  2009-01-14  7:31               ` Ingo Molnar
  0 siblings, 1 reply; 31+ messages in thread
From: Ravikiran G Thirumalai @ 2009-01-13 18:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Frederik Deweerdt, andi, tglx, hpa, linux-kernel, shai

On Tue, Jan 13, 2009 at 07:34:20PM +0100, Ingo Molnar wrote:
>
>* Ravikiran G Thirumalai <kiran@scalex86.org> wrote:
>
>> On Tue, Jan 13, 2009 at 01:33:17PM +0100, Ingo Molnar wrote:
>> >
>> >
>> >VSMP is a clustering solution (default-disabled)
>> 
>> Small correction :)
>> Well, it is a virtualization solution creating single VM across multiple
>> systems.  Since it is shared memory, we would rather refer to it as SMP or
>> Virtual SMP rather than a clustering solution.
>
>You do have hardware help too to do efficient cross-node memory traffic, 
>dont you?
>

Of course.  A fast backplane interconnect.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch] tlb flush_data: replace per_cpu with an array
  2009-01-13 18:42             ` Ravikiran G Thirumalai
@ 2009-01-14  7:31               ` Ingo Molnar
  2009-01-15  7:15                 ` Ravikiran G Thirumalai
  0 siblings, 1 reply; 31+ messages in thread
From: Ingo Molnar @ 2009-01-14  7:31 UTC (permalink / raw)
  To: Ravikiran G Thirumalai
  Cc: Peter Zijlstra, Frederik Deweerdt, andi, tglx, hpa, linux-kernel, shai


* Ravikiran G Thirumalai <kiran@scalex86.org> wrote:

> On Tue, Jan 13, 2009 at 07:34:20PM +0100, Ingo Molnar wrote:
> >
> >* Ravikiran G Thirumalai <kiran@scalex86.org> wrote:
> >
> >> On Tue, Jan 13, 2009 at 01:33:17PM +0100, Ingo Molnar wrote:
> >> >
> >> >
> >> >VSMP is a clustering solution (default-disabled)
> >> 
> >> Small correction :)
> >> Well, it is a virtualization solution creating single VM across multiple
> >> systems.  Since it is shared memory, we would rather refer to it as SMP or
> >> Virtual SMP rather than a clustering solution.
> >
> > You do have hardware help too to do efficient cross-node memory 
> > traffic, dont you?
> 
> Of course.  A fast backplane interconnect.

... so it's PCs clustered together, and called vSMP, right? ;-)

a neat hack in any case.

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch] tlb flush_data: replace per_cpu with an array
  2009-01-13 12:00     ` Peter Zijlstra
  2009-01-13 12:33       ` Ingo Molnar
@ 2009-01-14  9:08       ` Andi Kleen
  2009-01-14 14:41         ` Frederik Deweerdt
  1 sibling, 1 reply; 31+ messages in thread
From: Andi Kleen @ 2009-01-14  9:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Ravikiran G Thirumalai, Frederik Deweerdt, andi,
	tglx, hpa, linux-kernel


> > It also would save some memory
> > on most distros out there (Ubuntu x86_64 has NR_CPUS=64 by default).

As pointed out several times this claim in the changelog is incorrect.
Please fix it.

A correct description would be:

- This patch wastes memory of one configured cache line for each CPU 
on systems with less than 8 cores (e.g. 896 bytes on a 1 core system
with CONFIG_GENERIC_CPU/ cache line size 128)
It saves some memory on systems with more than 8 cores (one cache
line size for each core above 8)

-Andi


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch] tlb flush_data: replace per_cpu with an array
  2009-01-14  9:08       ` Andi Kleen
@ 2009-01-14 14:41         ` Frederik Deweerdt
  2009-01-14 15:20           ` Andi Kleen
  0 siblings, 1 reply; 31+ messages in thread
From: Frederik Deweerdt @ 2009-01-14 14:41 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, Ingo Molnar, Ravikiran G Thirumalai, tglx, hpa,
	linux-kernel

On Wed, Jan 14, 2009 at 10:08:11AM +0100, Andi Kleen wrote:
> 
> > > It also would save some memory
> > > on most distros out there (Ubuntu x86_64 has NR_CPUS=64 by default).
> 
> As pointed out several times this claim in the changelog is incorrect.
> Please fix it.
> 
> A correct description would be:
> 
> - This patch wastes memory of one configured cache line for each CPU 
> on systems with less than 8 cores (e.g. 896 bytes on a 1 core system
> with CONFIG_GENERIC_CPU/ cache line size 128)
> It saves some memory on systems with more than 8 cores (one cache
> line size for each core above 8)
To be fair, it also depends on whenever HOTPLUG_CPU is on or not (if it
is we allocate NR_CPUS percpu sections). In that case, which is Ubuntu's
and Fedora's, we do save memory.
This would make for a fairly long changelog, how about a link to this
thread, and removing the NR_CPUS bit?

Regards,
Frederik
> 
> -Andi
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch] tlb flush_data: replace per_cpu with an array
  2009-01-14 15:20           ` Andi Kleen
@ 2009-01-14 15:10             ` Frederik Deweerdt
  0 siblings, 0 replies; 31+ messages in thread
From: Frederik Deweerdt @ 2009-01-14 15:10 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, Ingo Molnar, Ravikiran G Thirumalai, tglx, hpa,
	linux-kernel

On Wed, Jan 14, 2009 at 04:20:01PM +0100, Andi Kleen wrote:
> > To be fair, it also depends on whenever HOTPLUG_CPU is on or not (if it
> > is we allocate NR_CPUS percpu sections).
> 
> not true to my knowledge. Even a !HOTPLUG_CPU kernel doesn't
> allocate more than what is in the possible map. That would be dumb 
> if it was true, but it isn't fortunately.
My point is that possible map == NR_CPUS if HOTPLUG_CPU
> 
> I don't think it does.
>  In that case, which is Ubuntu's
> > and Fedora's, 
> 
> You're saying that Ubuntu and Fedora do not support suspend to ram?
> (suspend to ram requires CPU hotplug). At least my Fedora box
> does S3 just fine so that cannot be correct.
> 
> we do save memory.
> > This would make for a fairly long changelog, how about a link to this
> > thread, and removing the NR_CPUS bit?
> 
> My original paragraph is correct and it's not long. At least your
> claim is just wrong.
> 
> -Andi
> 
> -- 
> ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch] tlb flush_data: replace per_cpu with an array
  2009-01-14 14:41         ` Frederik Deweerdt
@ 2009-01-14 15:20           ` Andi Kleen
  2009-01-14 15:10             ` Frederik Deweerdt
  0 siblings, 1 reply; 31+ messages in thread
From: Andi Kleen @ 2009-01-14 15:20 UTC (permalink / raw)
  To: Frederik Deweerdt
  Cc: Andi Kleen, Peter Zijlstra, Ingo Molnar, Ravikiran G Thirumalai,
	tglx, hpa, linux-kernel

> To be fair, it also depends on whenever HOTPLUG_CPU is on or not (if it
> is we allocate NR_CPUS percpu sections).

not true to my knowledge. Even a !HOTPLUG_CPU kernel doesn't
allocate more than what is in the possible map. That would be dumb 
if it was true, but it isn't fortunately.

I don't think it does.
 In that case, which is Ubuntu's
> and Fedora's, 

You're saying that Ubuntu and Fedora do not support suspend to ram?
(suspend to ram requires CPU hotplug). At least my Fedora box
does S3 just fine so that cannot be correct.

we do save memory.
> This would make for a fairly long changelog, how about a link to this
> thread, and removing the NR_CPUS bit?

My original paragraph is correct and it's not long. At least your
claim is just wrong.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch] tlb flush_data: replace per_cpu with an array
  2009-01-14  7:31               ` Ingo Molnar
@ 2009-01-15  7:15                 ` Ravikiran G Thirumalai
  0 siblings, 0 replies; 31+ messages in thread
From: Ravikiran G Thirumalai @ 2009-01-15  7:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Frederik Deweerdt, andi, tglx, hpa, linux-kernel, shai

On Wed, Jan 14, 2009 at 08:31:46AM +0100, Ingo Molnar wrote:
>
>* Ravikiran G Thirumalai <kiran@scalex86.org> wrote:
>
>> > You do have hardware help too to do efficient cross-node memory 
>> > traffic, dont you?
>> 
>> Of course.  A fast backplane interconnect.
>
>... so it's PCs clustered together, and called vSMP, right? ;-)
>

Well, it depends on how you define a cluster.  You can even call a
SMP  mother board a cluster of cpus :).  In that case even Sequent NUMA were
clusters, and so are some of the large core count shared memory systems
made of multiple similar boards/bricks/books (based on vendor terminology)
with a custom interconnect for cache coherency.

A vSMP system is multiple similar cpu mother boards, usually dual socket,
each board having  multiple cpus and a commodity fast interconnect for cache
coherency.  There is a VMM which  among other tasks also takes care of the
coherency, and each 'board' or 'node' can access physical memory from
another board.  There is one instance of OS running on the entire system
unlike the classical cluster where there are multiple instances of OS
and the apps under the OS have to work together as a cluster.

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2009-01-15  7:15 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-01-12 21:35 [patch] tlb flush_data: replace per_cpu with an array Frederik Deweerdt
2009-01-12 21:46 ` Ingo Molnar
2009-01-12 21:57 ` Andi Kleen
2009-01-12 22:10   ` Frederik Deweerdt
2009-01-12 22:32     ` Andi Kleen
2009-01-12 22:50       ` Ingo Molnar
2009-01-12 22:40   ` Ingo Molnar
2009-01-12 22:59     ` H. Peter Anvin
2009-01-13  2:43     ` Andi Kleen
2009-01-13 12:28       ` Ingo Molnar
2009-01-12 22:34 ` Ravikiran G Thirumalai
2009-01-12 23:00   ` Ingo Molnar
2009-01-12 23:09     ` Ingo Molnar
2009-01-13  2:14     ` Ravikiran G Thirumalai
2009-01-13 12:00     ` Peter Zijlstra
2009-01-13 12:33       ` Ingo Molnar
2009-01-13 18:13         ` Ravikiran G Thirumalai
2009-01-13 18:34           ` Ingo Molnar
2009-01-13 18:42             ` Ravikiran G Thirumalai
2009-01-14  7:31               ` Ingo Molnar
2009-01-15  7:15                 ` Ravikiran G Thirumalai
2009-01-14  9:08       ` Andi Kleen
2009-01-14 14:41         ` Frederik Deweerdt
2009-01-14 15:20           ` Andi Kleen
2009-01-14 15:10             ` Frederik Deweerdt
2009-01-12 22:54 ` Mike Travis
2009-01-12 23:51   ` Frederik Deweerdt
2009-01-12 23:57     ` Mike Travis
2009-01-13  0:01       ` Ingo Molnar
2009-01-13  3:36         ` Andi Kleen
2009-01-13 12:14           ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).