All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] x86: Align TLB invalidation info
@ 2018-01-31 20:11 Nadav Amit
  2018-01-31 20:24 ` Andy Lutomirski
  2018-01-31 21:01 ` Dave Hansen
  0 siblings, 2 replies; 13+ messages in thread
From: Nadav Amit @ 2018-01-31 20:11 UTC (permalink / raw)
  To: x86
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, linux-kernel,
	Peter Zijlstra, Nadav Amit, Nadav Amit, Andy Lutomirski,
	Dave Hansen

The TLB invalidation info is allocated on the stack, which might cause
it to be unaligned. Since this information may be transferred to
different cores for TLB shootdown, this might result in an additional
cache-line bouncing between the cores.

GCC provides a way to deal with it by using
__builtin_alloca_with_align(). Use it to avoid the bouncing cache lines.

Signed-off-by: Nadav Amit <namit@vmware.com>

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
---
 arch/x86/mm/tlb.c              | 21 +++++++++++----------
 include/linux/compiler-gcc.h   |  5 +++++
 include/linux/compiler_types.h |  4 ++++
 3 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 5bfe61a5e8e3..bab7bb5d982f 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -574,37 +574,38 @@ static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
 void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned long vmflag)
 {
+	struct flush_tlb_info *info;
 	int cpu;
 
-	struct flush_tlb_info info = {
-		.mm = mm,
-	};
+	info = __alloca_with_align(sizeof(*info),
+				   SMP_CACHE_BYTES * BITS_PER_BYTE);
+	info->mm = mm;
 
 	cpu = get_cpu();
 
 	/* This is also a barrier that synchronizes with switch_mm(). */
-	info.new_tlb_gen = inc_mm_tlb_gen(mm);
+	info->new_tlb_gen = inc_mm_tlb_gen(mm);
 
 	/* Should we flush just the requested range? */
 	if ((end != TLB_FLUSH_ALL) &&
 	    !(vmflag & VM_HUGETLB) &&
 	    ((end - start) >> PAGE_SHIFT) <= tlb_single_page_flush_ceiling) {
-		info.start = start;
-		info.end = end;
+		info->start = start;
+		info->end = end;
 	} else {
-		info.start = 0UL;
-		info.end = TLB_FLUSH_ALL;
+		info->start = 0UL;
+		info->end = TLB_FLUSH_ALL;
 	}
 
 	if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
 		VM_WARN_ON(irqs_disabled());
 		local_irq_disable();
-		flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN);
+		flush_tlb_func_local(info, TLB_LOCAL_MM_SHOOTDOWN);
 		local_irq_enable();
 	}
 
 	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
-		flush_tlb_others(mm_cpumask(mm), &info);
+		flush_tlb_others(mm_cpumask(mm), info);
 
 	put_cpu();
 }
diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h
index 631354acfa72..aea9a2e69417 100644
--- a/include/linux/compiler-gcc.h
+++ b/include/linux/compiler-gcc.h
@@ -314,6 +314,11 @@
 #define __designated_init __attribute__((designated_init))
 #endif
 
+#if GCC_VERSION >= 60100
+#define __alloca_with_align(size, alignment)				\
+	__builtin_alloca_with_align(size, alignment)
+#endif
+
 #endif	/* gcc version >= 40000 specific checks */
 
 #if !defined(__noclone)
diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h
index 6b79a9bba9a7..c71297d95c74 100644
--- a/include/linux/compiler_types.h
+++ b/include/linux/compiler_types.h
@@ -271,4 +271,8 @@ struct ftrace_likely_data {
 # define __native_word(t) (sizeof(t) == sizeof(char) || sizeof(t) == sizeof(short) || sizeof(t) == sizeof(int) || sizeof(t) == sizeof(long))
 #endif
 
+#ifndef __alloca_with_align
+#define __alloca_with_align(size, alignment) __builtin_alloca(size)
+#endif
+
 #endif /* __LINUX_COMPILER_TYPES_H */
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86: Align TLB invalidation info
  2018-01-31 20:11 [PATCH] x86: Align TLB invalidation info Nadav Amit
@ 2018-01-31 20:24 ` Andy Lutomirski
  2018-01-31 20:48   ` Nadav Amit
  2018-01-31 21:01 ` Dave Hansen
  1 sibling, 1 reply; 13+ messages in thread
From: Andy Lutomirski @ 2018-01-31 20:24 UTC (permalink / raw)
  To: Nadav Amit
  Cc: x86, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, linux-kernel,
	Peter Zijlstra, Nadav Amit, Andy Lutomirski, Dave Hansen



> On Jan 31, 2018, at 12:11 PM, Nadav Amit <namit@vmware.com> wrote:
> 
> The TLB invalidation info is allocated on the stack, which might cause
> it to be unaligned. Since this information may be transferred to
> different cores for TLB shootdown, this might result in an additional
> cache-line bouncing between the cores.
> 
> GCC provides a way to deal with it by using
> __builtin_alloca_with_align(). Use it to avoid the bouncing cache lines.
> 

Eww.  How about __aligned?


> Signed-off-by: Nadav Amit <namit@vmware.com>
> 
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> ---
> arch/x86/mm/tlb.c              | 21 +++++++++++----------
> include/linux/compiler-gcc.h   |  5 +++++
> include/linux/compiler_types.h |  4 ++++
> 3 files changed, 20 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 5bfe61a5e8e3..bab7bb5d982f 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -574,37 +574,38 @@ static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
> void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
>                unsigned long end, unsigned long vmflag)
> {
> +    struct flush_tlb_info *info;
>    int cpu;
> 
> -    struct flush_tlb_info info = {
> -        .mm = mm,
> -    };
> +    info = __alloca_with_align(sizeof(*info),
> +                   SMP_CACHE_BYTES * BITS_PER_BYTE);
> +    info->mm = mm;
> 
>    cpu = get_cpu();
> 
>    /* This is also a barrier that synchronizes with switch_mm(). */
> -    info.new_tlb_gen = inc_mm_tlb_gen(mm);
> +    info->new_tlb_gen = inc_mm_tlb_gen(mm);
> 
>    /* Should we flush just the requested range? */
>    if ((end != TLB_FLUSH_ALL) &&
>        !(vmflag & VM_HUGETLB) &&
>        ((end - start) >> PAGE_SHIFT) <= tlb_single_page_flush_ceiling) {
> -        info.start = start;
> -        info.end = end;
> +        info->start = start;
> +        info->end = end;
>    } else {
> -        info.start = 0UL;
> -        info.end = TLB_FLUSH_ALL;
> +        info->start = 0UL;
> +        info->end = TLB_FLUSH_ALL;
>    }
> 
>    if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
>        VM_WARN_ON(irqs_disabled());
>        local_irq_disable();
> -        flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN);
> +        flush_tlb_func_local(info, TLB_LOCAL_MM_SHOOTDOWN);
>        local_irq_enable();
>    }
> 
>    if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
> -        flush_tlb_others(mm_cpumask(mm), &info);
> +        flush_tlb_others(mm_cpumask(mm), info);
> 
>    put_cpu();
> }
> diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h
> index 631354acfa72..aea9a2e69417 100644
> --- a/include/linux/compiler-gcc.h
> +++ b/include/linux/compiler-gcc.h
> @@ -314,6 +314,11 @@
> #define __designated_init __attribute__((designated_init))
> #endif
> 
> +#if GCC_VERSION >= 60100
> +#define __alloca_with_align(size, alignment)                \
> +    __builtin_alloca_with_align(size, alignment)
> +#endif
> +
> #endif    /* gcc version >= 40000 specific checks */
> 
> #if !defined(__noclone)
> diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h
> index 6b79a9bba9a7..c71297d95c74 100644
> --- a/include/linux/compiler_types.h
> +++ b/include/linux/compiler_types.h
> @@ -271,4 +271,8 @@ struct ftrace_likely_data {
> # define __native_word(t) (sizeof(t) == sizeof(char) || sizeof(t) == sizeof(short) || sizeof(t) == sizeof(int) || sizeof(t) == sizeof(long))
> #endif
> 
> +#ifndef __alloca_with_align
> +#define __alloca_with_align(size, alignment) __builtin_alloca(size)
> +#endif
> +
> #endif /* __LINUX_COMPILER_TYPES_H */
> -- 
> 2.14.1
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86: Align TLB invalidation info
  2018-01-31 20:24 ` Andy Lutomirski
@ 2018-01-31 20:48   ` Nadav Amit
  2018-01-31 20:49     ` Dave Hansen
  2018-01-31 21:00     ` Andy Lutomirski
  0 siblings, 2 replies; 13+ messages in thread
From: Nadav Amit @ 2018-01-31 20:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, linux-kernel,
	Peter Zijlstra, Andy Lutomirski, Dave Hansen

Andy Lutomirski <luto@amacapital.net> wrote:

> 
> 
>> On Jan 31, 2018, at 12:11 PM, Nadav Amit <namit@vmware.com> wrote:
>> 
>> The TLB invalidation info is allocated on the stack, which might cause
>> it to be unaligned. Since this information may be transferred to
>> different cores for TLB shootdown, this might result in an additional
>> cache-line bouncing between the cores.
>> 
>> GCC provides a way to deal with it by using
>> __builtin_alloca_with_align(). Use it to avoid the bouncing cache lines.
> 
> Eww.  How about __aligned?

Err.. Stupid me. For some reason I remembered I tried it and it didn’t have
the desired effect, which caused me to assume it does not work for variables
on the stack. Anyhow, it does the work. I’ll submit v2.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86: Align TLB invalidation info
  2018-01-31 20:48   ` Nadav Amit
@ 2018-01-31 20:49     ` Dave Hansen
  2018-01-31 20:52       ` Nadav Amit
  2018-01-31 21:00     ` Andy Lutomirski
  1 sibling, 1 reply; 13+ messages in thread
From: Dave Hansen @ 2018-01-31 20:49 UTC (permalink / raw)
  To: Nadav Amit, Andy Lutomirski
  Cc: x86, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, linux-kernel,
	Peter Zijlstra, Andy Lutomirski

On 01/31/2018 12:48 PM, Nadav Amit wrote:
>>> On Jan 31, 2018, at 12:11 PM, Nadav Amit <namit@vmware.com> wrote:
>>>
>>> The TLB invalidation info is allocated on the stack, which might cause
>>> it to be unaligned. Since this information may be transferred to
>>> different cores for TLB shootdown, this might result in an additional
>>> cache-line bouncing between the cores.
>>>
>>> GCC provides a way to deal with it by using
>>> __builtin_alloca_with_align(). Use it to avoid the bouncing cache lines.
>> Eww.  How about __aligned?
> Err.. Stupid me. For some reason I remembered I tried it and it didn’t have
> the desired effect, which caused me to assume it does not work for variables
> on the stack. Anyhow, it does the work. I’ll submit v2.

Also, just curious, but did you find this situation by inspection or did
it show up in a profile somewhere?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86: Align TLB invalidation info
  2018-01-31 20:49     ` Dave Hansen
@ 2018-01-31 20:52       ` Nadav Amit
  0 siblings, 0 replies; 13+ messages in thread
From: Nadav Amit @ 2018-01-31 20:52 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-kernel, Peter Zijlstra, Andy Lutomirski

Dave Hansen <dave.hansen@linux.intel.com> wrote:

> On 01/31/2018 12:48 PM, Nadav Amit wrote:
>>>> On Jan 31, 2018, at 12:11 PM, Nadav Amit <namit@vmware.com> wrote:
>>>> 
>>>> The TLB invalidation info is allocated on the stack, which might cause
>>>> it to be unaligned. Since this information may be transferred to
>>>> different cores for TLB shootdown, this might result in an additional
>>>> cache-line bouncing between the cores.
>>>> 
>>>> GCC provides a way to deal with it by using
>>>> __builtin_alloca_with_align(). Use it to avoid the bouncing cache lines.
>>> Eww.  How about __aligned?
>> Err.. Stupid me. For some reason I remembered I tried it and it didn’t have
>> the desired effect, which caused me to assume it does not work for variables
>> on the stack. Anyhow, it does the work. I’ll submit v2.
> 
> Also, just curious, but did you find this situation by inspection or did
> it show up in a profile somewhere?

Actually, not. I considered adding data to info for a different reason
(which eventually I deserted); I was assuming you will all kill me for
increasing the size, so this was a preemption step.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86: Align TLB invalidation info
  2018-01-31 20:48   ` Nadav Amit
  2018-01-31 20:49     ` Dave Hansen
@ 2018-01-31 21:00     ` Andy Lutomirski
  1 sibling, 0 replies; 13+ messages in thread
From: Andy Lutomirski @ 2018-01-31 21:00 UTC (permalink / raw)
  To: Nadav Amit
  Cc: X86 ML, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, LKML,
	Peter Zijlstra, Andy Lutomirski, Dave Hansen

On Wed, Jan 31, 2018 at 12:48 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
> Andy Lutomirski <luto@amacapital.net> wrote:
>
>>
>>
>>> On Jan 31, 2018, at 12:11 PM, Nadav Amit <namit@vmware.com> wrote:
>>>
>>> The TLB invalidation info is allocated on the stack, which might cause
>>> it to be unaligned. Since this information may be transferred to
>>> different cores for TLB shootdown, this might result in an additional
>>> cache-line bouncing between the cores.
>>>
>>> GCC provides a way to deal with it by using
>>> __builtin_alloca_with_align(). Use it to avoid the bouncing cache lines.
>>
>> Eww.  How about __aligned?
>
> Err.. Stupid me. For some reason I remembered I tried it and it didn’t have
> the desired effect, which caused me to assume it does not work for variables
> on the stack. Anyhow, it does the work. I’ll submit v2.
>

You're probably remembering that __aligned(16) malfunctions on older
GCC versions.  But __aligned(64), which is what you want, has always
been okay AFAIK.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86: Align TLB invalidation info
  2018-01-31 20:11 [PATCH] x86: Align TLB invalidation info Nadav Amit
  2018-01-31 20:24 ` Andy Lutomirski
@ 2018-01-31 21:01 ` Dave Hansen
  2018-01-31 21:09   ` Nadav Amit
  1 sibling, 1 reply; 13+ messages in thread
From: Dave Hansen @ 2018-01-31 21:01 UTC (permalink / raw)
  To: Nadav Amit, x86
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, linux-kernel,
	Peter Zijlstra, Nadav Amit, Andy Lutomirski

On 01/31/2018 12:11 PM, Nadav Amit wrote:
> The TLB invalidation info is allocated on the stack, which might cause
> it to be unaligned. Since this information may be transferred to
> different cores for TLB shootdown, this might result in an additional
> cache-line bouncing between the cores.
> 
> GCC provides a way to deal with it by using
> __builtin_alloca_with_align(). Use it to avoid the bouncing cache lines.

It doesn't really *bounce*, though, does it?  I don't see any writes on
the remote side.  The remote use seems entirely read-only.

You also don't have to exhaustively test this, but I'd love to see at
least a sanity check with a microbenchmark (or something) that, yes,
this does help *something*.  Maybe it makes the remote
flush_tlb_func_common() run faster because it's pulling in fewer lines,
or maybe you can even detect fewer misses in there.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86: Align TLB invalidation info
  2018-01-31 21:01 ` Dave Hansen
@ 2018-01-31 21:09   ` Nadav Amit
  2018-01-31 21:15     ` Andy Lutomirski
       [not found]     ` <f7270e0d-d1d1-599d-d2c9-ddc2c263e090@linux.intel.com>
  0 siblings, 2 replies; 13+ messages in thread
From: Nadav Amit @ 2018-01-31 21:09 UTC (permalink / raw)
  To: Dave Hansen
  Cc: the arch/x86 maintainers, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-kernel, Peter Zijlstra, Andy Lutomirski

Dave Hansen <dave.hansen@linux.intel.com> wrote:

> On 01/31/2018 12:11 PM, Nadav Amit wrote:
>> The TLB invalidation info is allocated on the stack, which might cause
>> it to be unaligned. Since this information may be transferred to
>> different cores for TLB shootdown, this might result in an additional
>> cache-line bouncing between the cores.
>> 
>> GCC provides a way to deal with it by using
>> __builtin_alloca_with_align(). Use it to avoid the bouncing cache lines.
> 
> It doesn't really *bounce*, though, does it?  I don't see any writes on
> the remote side.  The remote use seems entirely read-only.
> 
> You also don't have to exhaustively test this, but I'd love to see at
> least a sanity check with a microbenchmark (or something) that, yes,
> this does help *something*.  Maybe it makes the remote
> flush_tlb_func_common() run faster because it's pulling in fewer lines,
> or maybe you can even detect fewer misses in there.

I agree that with the whole Meltdown/Spectre entry-cost it might not even be
measurable, at least on small ( < 2 sockets) machines. But I do not think it
worth profiling. Basically, AFAIK, all the data structures that are used for
inter-processor communication by the kernel are aligned, and this is an
exception.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86: Align TLB invalidation info
  2018-01-31 21:09   ` Nadav Amit
@ 2018-01-31 21:15     ` Andy Lutomirski
  2018-01-31 21:17       ` Nadav Amit
       [not found]     ` <f7270e0d-d1d1-599d-d2c9-ddc2c263e090@linux.intel.com>
  1 sibling, 1 reply; 13+ messages in thread
From: Andy Lutomirski @ 2018-01-31 21:15 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Dave Hansen, the arch/x86 maintainers, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, LKML, Peter Zijlstra,
	Andy Lutomirski

On Wed, Jan 31, 2018 at 1:09 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
> Dave Hansen <dave.hansen@linux.intel.com> wrote:
>
>> On 01/31/2018 12:11 PM, Nadav Amit wrote:
>>> The TLB invalidation info is allocated on the stack, which might cause
>>> it to be unaligned. Since this information may be transferred to
>>> different cores for TLB shootdown, this might result in an additional
>>> cache-line bouncing between the cores.
>>>
>>> GCC provides a way to deal with it by using
>>> __builtin_alloca_with_align(). Use it to avoid the bouncing cache lines.
>>
>> It doesn't really *bounce*, though, does it?  I don't see any writes on
>> the remote side.  The remote use seems entirely read-only.
>>
>> You also don't have to exhaustively test this, but I'd love to see at
>> least a sanity check with a microbenchmark (or something) that, yes,
>> this does help *something*.  Maybe it makes the remote
>> flush_tlb_func_common() run faster because it's pulling in fewer lines,
>> or maybe you can even detect fewer misses in there.
>
> I agree that with the whole Meltdown/Spectre entry-cost it might not even be
> measurable, at least on small ( < 2 sockets) machines. But I do not think it
> worth profiling. Basically, AFAIK, all the data structures that are used for
> inter-processor communication by the kernel are aligned, and this is an
> exception.
>

This is only going to be measurable at all on NUMA, I suspect.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86: Align TLB invalidation info
  2018-01-31 21:15     ` Andy Lutomirski
@ 2018-01-31 21:17       ` Nadav Amit
  0 siblings, 0 replies; 13+ messages in thread
From: Nadav Amit @ 2018-01-31 21:17 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, the arch/x86 maintainers, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, LKML, Peter Zijlstra

Andy Lutomirski <luto@kernel.org> wrote:

> On Wed, Jan 31, 2018 at 1:09 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
>> Dave Hansen <dave.hansen@linux.intel.com> wrote:
>> 
>>> On 01/31/2018 12:11 PM, Nadav Amit wrote:
>>>> The TLB invalidation info is allocated on the stack, which might cause
>>>> it to be unaligned. Since this information may be transferred to
>>>> different cores for TLB shootdown, this might result in an additional
>>>> cache-line bouncing between the cores.
>>>> 
>>>> GCC provides a way to deal with it by using
>>>> __builtin_alloca_with_align(). Use it to avoid the bouncing cache lines.
>>> 
>>> It doesn't really *bounce*, though, does it?  I don't see any writes on
>>> the remote side.  The remote use seems entirely read-only.
>>> 
>>> You also don't have to exhaustively test this, but I'd love to see at
>>> least a sanity check with a microbenchmark (or something) that, yes,
>>> this does help *something*.  Maybe it makes the remote
>>> flush_tlb_func_common() run faster because it's pulling in fewer lines,
>>> or maybe you can even detect fewer misses in there.
>> 
>> I agree that with the whole Meltdown/Spectre entry-cost it might not even be
>> measurable, at least on small ( < 2 sockets) machines. But I do not think it
>> worth profiling. Basically, AFAIK, all the data structures that are used for
>> inter-processor communication by the kernel are aligned, and this is an
>> exception.
> 
> This is only going to be measurable at all on NUMA, I suspect.

Yes, I meant <= 2 ... 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86: Align TLB invalidation info
       [not found]     ` <f7270e0d-d1d1-599d-d2c9-ddc2c263e090@linux.intel.com>
@ 2018-02-01  5:38       ` Nadav Amit
  2018-02-01  9:38         ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Nadav Amit @ 2018-02-01  5:38 UTC (permalink / raw)
  To: Dave Hansen
  Cc: the arch/x86 maintainers, Andy Lutomirski, H. Peter Anvin,
	Peter Zijlstra, LKML, Thomas Gleixner, Ingo Molnar

Dave Hansen <dave.hansen@linux.intel.com> wrote:

> On 01/31/2018 01:09 PM, Nadav Amit wrote:
>>> You also don't have to exhaustively test this, but I'd love to see at
>>> least a sanity check with a microbenchmark (or something) that, yes,
>>> this does help *something*.  Maybe it makes the remote
>>> flush_tlb_func_common() run faster because it's pulling in fewer lines,
>>> or maybe you can even detect fewer misses in there.
>> I agree that with the whole Meltdown/Spectre entry-cost it might not even be
>> measurable, at least on small ( < 2 sockets) machines. But I do not think it
>> worth profiling. Basically, AFAIK, all the data structures that are used for
>> inter-processor communication by the kernel are aligned, and this is an
>> exception.
> 
> I'm certainly not nak'ing this.  I think your patch is likely a good
> idea.  But, could you please take ten or twenty minutes to go see if
> practice matches your assumptions?  I'd really appreciate it.  If you
> can't measure it, then no biggie.

[CC’ing the mailing list]

Per your request, I measured it (which perhaps I should have done before). I
caused a misalignment intentionally by adding some padding to flush_tlb_info
and compared it with an aligned version.

I used ftrace to measure the execution time of flush_tlb_func_remote() on a
2-socket Haswell machine, using a microbenchmark I wrote for some research
project.

It turns out that your skepticism may be correct - In both cases the
function execution time is roughly 400ns (2% improvement on the aligned case
which is probably noise).

So it is up to you whether you want to discard the patch.

Regards,
Nadav

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86: Align TLB invalidation info
  2018-02-01  5:38       ` Nadav Amit
@ 2018-02-01  9:38         ` Peter Zijlstra
  2018-02-01 18:45           ` Nadav Amit
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2018-02-01  9:38 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Dave Hansen, the arch/x86 maintainers, Andy Lutomirski,
	H. Peter Anvin, LKML, Thomas Gleixner, Ingo Molnar

On Wed, Jan 31, 2018 at 09:38:46PM -0800, Nadav Amit wrote:

> I used ftrace to measure the execution time of flush_tlb_func_remote() on a
> 2-socket Haswell machine, using a microbenchmark I wrote for some research
> project.

However cool ftrace is, it is _really_ bad for such uses. The cost of
using ftrace is many many time higher than any change you could affect
by this.

A microbench and/or perf is what you should use for this.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86: Align TLB invalidation info
  2018-02-01  9:38         ` Peter Zijlstra
@ 2018-02-01 18:45           ` Nadav Amit
  0 siblings, 0 replies; 13+ messages in thread
From: Nadav Amit @ 2018-02-01 18:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Hansen, the arch/x86 maintainers, Andy Lutomirski,
	H. Peter Anvin, LKML, Thomas Gleixner, Ingo Molnar

[-- Attachment #1: Type: text/plain, Size: 1008 bytes --]

Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Jan 31, 2018 at 09:38:46PM -0800, Nadav Amit wrote:
> 
>> I used ftrace to measure the execution time of flush_tlb_func_remote() on a
>> 2-socket Haswell machine, using a microbenchmark I wrote for some research
>> project.
> 
> However cool ftrace is, it is _really_ bad for such uses. The cost of
> using ftrace is many many time higher than any change you could affect
> by this.
> 
> A microbench and/or perf is what you should use for this.

Don’t expect to see a remote NUMA access impact, whose cost are few 10s of
nanoseconds on microbenchmarks. (And indeed I did not.) Each iteration of
#PF - MADV_DONTNEED takes several microseconds, and the impact is lost in
the noise.

You are right in the fact that ftrace introduces overheads, but the variance
is relatively low. If I stretch the struct to 3 lines of cache, I see a 20ns
overhead. Anyhow, I think this line of code got more than its fair share of
attention.



[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2018-02-01 18:46 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-31 20:11 [PATCH] x86: Align TLB invalidation info Nadav Amit
2018-01-31 20:24 ` Andy Lutomirski
2018-01-31 20:48   ` Nadav Amit
2018-01-31 20:49     ` Dave Hansen
2018-01-31 20:52       ` Nadav Amit
2018-01-31 21:00     ` Andy Lutomirski
2018-01-31 21:01 ` Dave Hansen
2018-01-31 21:09   ` Nadav Amit
2018-01-31 21:15     ` Andy Lutomirski
2018-01-31 21:17       ` Nadav Amit
     [not found]     ` <f7270e0d-d1d1-599d-d2c9-ddc2c263e090@linux.intel.com>
2018-02-01  5:38       ` Nadav Amit
2018-02-01  9:38         ` Peter Zijlstra
2018-02-01 18:45           ` Nadav Amit

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.