All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] x86/asm/entry/64: pack interrupt dispatch table tighter
@ 2015-04-03 11:13 Denys Vlasenko
  2015-04-03 14:03 ` Ingo Molnar
  0 siblings, 1 reply; 10+ messages in thread
From: Denys Vlasenko @ 2015-04-03 11:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Denys Vlasenko, Linus Torvalds, Steven Rostedt, Borislav Petkov,
	H. Peter Anvin, Andy Lutomirski, Oleg Nesterov,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	x86, linux-kernel

Interrupt entry points are handled with the following code:
Each 32-byte code block contains seven entry points

		...
		[push][jump 22] // 4 bytes
		[push][jump 18] // 4 bytes
		[push][jump 14] // 4 bytes
		[push][jump 10] // 4 bytes
		[push][jump  6] // 4 bytes
		[push][jump  2] // 4 bytes
		[push][jump common_interrupt][padding] // 8 bytes

		[push][jump]
		[push][jump]
		[push][jump]
		[push][jump]
		[push][jump]
		[push][jump]
		[push][jump common_interrupt][padding]

		[padding_2]
	common_interrupt:

The first six jumps are short (2-byte insns) jumps to the last
jump in the block. The last jump usually has to be a longer, 5-byte form.

There are about 30 such blocks.

This change uses the fact that last few of these blocks are close to
"common_interrupt" label and the last jump there can also be a short one.

This allows several last 32-byte code blocks to contain eight, not seven entry points,
and to have all these entry points jump directly to common_interrupt,
eliminating one branch. They look like this:

		[push][jump common_interrupt] // 4 bytes
		[push][jump common_interrupt] // 4 bytes
		[push][jump common_interrupt] // 4 bytes
		[push][jump common_interrupt] // 4 bytes
		[push][jump common_interrupt] // 4 bytes
		[push][jump common_interrupt] // 4 bytes
		[push][jump common_interrupt] // 4 bytes
		[push][jump common_interrupt] // 4 bytes

This change replaces ".p2align CONFIG_X86_L1_CACHE_SHIFT" before dispatch table
and before common_interrupt with 32-byte alignment.

The first alignment was completely unnecessary:
the dispatch table will not benefit from any alignment bigger than 32 bytes
(any entry point needs at most only 32 bytes of table to be read by the CPU).

The alignment before common_interrupt label affects the size of padding
(the "[padding_2]" above) and excessive padding reduces the number of blocks
which can be optimized. In the .config I tested with, this alignment was 64 bytes,
this means two fewer blocks could be optimized. I believe 32-byte alignment
should do just fine.

The condition which controls the choice of a direct jump versus short one,

    /* Can we reach common_interrupt with a short jump? */
    .if vector >= (FIRST_SYSTEM_VECTOR - (128-ALIGN_common_interrupt)/4)

was tested to be a passimistic one: the generated code will not erroneously
generate a longer than necessary jump even with maximum possible padding
and with -1 subtracted from the right hand side above.
This was verified by checking disassembly.

However, with current value of FIRST_SYSTEM_VECTOR it so happens
that the padding before "common_interrupt" is zero bytes.

 Disassembly before                           After
 ==========================================   ==================================
...                                           ...
840 6a9b           push $0xffffffffffffff9b   82c 6a9d  push $0xffffffffffffff9d
842 eb16           jmp  85a                   82e eb30  jmp  860
844 6a9a           push $0xffffffffffffff9a   830 6a9c  push $0xffffffffffffff9c
846 eb12           jmp  85a                   832 eb2c  jmp  860
848 6a99           push $0xffffffffffffff99   834 6a9b  push $0xffffffffffffff9b
84a eb0e           jmp  85a                   836 eb28  jmp  860
84c 6a98           push $0xffffffffffffff98   838 6a9a  push $0xffffffffffffff9a
84e eb0a           jmp  85a                   83a eb24  jmp  860
850 6a97           push $0xffffffffffffff97   83c 6a99  push $0xffffffffffffff99
852 eb06           jmp  85a                   83e eb20  jmp  860
854 6a96           push $0xffffffffffffff96   840 6a98  push $0xffffffffffffff98
856 eb02           jmp  85a                   842 eb1c  jmp  860
858 6a95           push $0xffffffffffffff95   844 6a97  push $0xffffffffffffff97
85a eb24           jmp  880                   846 eb18  jmp  860
85c 0f1f4000       nopl 0x0(%rax)             848 6a96  push $0xffffffffffffff96
860 6a94           push $0xffffffffffffff94   84a eb14  jmp  860
862 eb0c           jmp  870                   84c 6a95  push $0xffffffffffffff95
864 6a93           push $0xffffffffffffff93   84e eb10  jmp  860
866 eb08           jmp  870                   850 6a94  push $0xffffffffffffff94
868 6a92           push $0xffffffffffffff92   852 eb0c  jmp  860
86a eb04           jmp  870                   854 6a93  push $0xffffffffffffff93
86c 6a91           push $0xffffffffffffff91   856 eb08  jmp  860
86e eb00           jmp  870                   858 6a92  push $0xffffffffffffff92
870 eb0e           jmp  880                   85a eb04  jmp  860
872 66666666662e0f data32 data32 data32 data3285c 6a91  push $0xffffffffffffff91
879 1f840000000000 nopw %cs:0x0(%rax,%rax,1)  85e eb00  jmp  860
880 <common_interrupt>:                       860 <common_interrupt>:

 Sizes:
   text	   data	    bss	    dec	    hex	filename
  12578	      0	      0	  12578	   3122	entry_64.o.before
  12546	      0	      0	  12546	   3102	entry_64.o

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Ingo Molnar <mingo@kernel.org>
CC: Borislav Petkov <bp@alien8.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Alexei Starovoitov <ast@plumgrid.com>
CC: Will Drewry <wad@chromium.org>
CC: Kees Cook <keescook@chromium.org>
CC: x86@kernel.org
CC: linux-kernel@vger.kernel.org
---
 arch/x86/kernel/entry_64.S | 61 +++++++++++++++++++++++++++++++++++++---------
 1 file changed, 49 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index da137f9..2003417 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -537,33 +537,70 @@ END(ret_from_fork)
  * Build the entry stubs and pointer table with some assembler magic.
  * We pack 7 stubs into a single 32-byte chunk, which will fit in a
  * single cache line on all modern x86 implementations.
+ * The last few cachelines pack 8 stubs each.
  */
+ALIGN_common_interrupt=32
 	.section .init.rodata,"a"
 ENTRY(interrupt)
 	.section .entry.text
-	.p2align 5
-	.p2align CONFIG_X86_L1_CACHE_SHIFT
+	.align 32
 ENTRY(irq_entries_start)
 	INTR_FRAME
 vector=FIRST_EXTERNAL_VECTOR
-.rept (FIRST_SYSTEM_VECTOR-FIRST_EXTERNAL_VECTOR+6)/7
-	.balign 32
+.rept 256 /* this number does not need to be exact, just big enough */
+
+  /*
+   * Block of six "push + short_jump" pairs, 4 bytes each,
+   * and a 2-byte seventh "push", without its jump yet.
+   */
+need_near_jump=0
+push_count=0
   .rept	7
     .if vector < FIRST_SYSTEM_VECTOR
-      .if vector <> FIRST_EXTERNAL_VECTOR
+1:	pushq_cfi $(~vector+0x80)	/* Note: always in signed byte range */
+	.previous
+	.quad 1b
+	.section .entry.text
+vector=vector+1
+push_count=push_count+1
+      .if push_count < 7
+	/* Can we reach common_interrupt with a short jump? */
+	.if vector >= (FIRST_SYSTEM_VECTOR - (128-ALIGN_common_interrupt)/4)
+	  jmp common_interrupt  /* yes */
+	.else
+	  jmp 2f
+need_near_jump=1
+	.endif
 	CFI_ADJUST_CFA_OFFSET -8
       .endif
+    .endif
+  .endr
+
+  /* If we are close to the end, we can pack in yet another pair */
+  .if need_near_jump == 0
+    .if push_count == 7
+	/* The "short jump" for the seventh "push" */
+	jmp common_interrupt
+	CFI_ADJUST_CFA_OFFSET -8
+    .endif
+    .if vector < FIRST_SYSTEM_VECTOR
+	/* "push + short_jump" pair #8 */
 1:	pushq_cfi $(~vector+0x80)	/* Note: always in signed byte range */
-      .if ((vector-FIRST_EXTERNAL_VECTOR)%7) <> 6
-	jmp 2f
-      .endif
-      .previous
+	.previous
 	.quad 1b
-      .section .entry.text
+	.section .entry.text
 vector=vector+1
+push_count=push_count+1
+	jmp common_interrupt
+	CFI_ADJUST_CFA_OFFSET -8
     .endif
-  .endr
+  .else
+	/* Use remaining space for "near jump" for the seventh "push" */
 2:	jmp common_interrupt
+	CFI_ADJUST_CFA_OFFSET -8
+	.align 2
+  .endif
+
 .endr
 	CFI_ENDPROC
 END(irq_entries_start)
@@ -632,7 +669,7 @@ END(interrupt)
 	 * The interrupt stubs push (~vector+0x80) onto the stack and
 	 * then jump to common_interrupt.
 	 */
-	.p2align CONFIG_X86_L1_CACHE_SHIFT
+	.align ALIGN_common_interrupt
 common_interrupt:
 	XCPT_FRAME
 	ASM_CLAC
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86/asm/entry/64: pack interrupt dispatch table tighter
  2015-04-03 11:13 [PATCH] x86/asm/entry/64: pack interrupt dispatch table tighter Denys Vlasenko
@ 2015-04-03 14:03 ` Ingo Molnar
  2015-04-03 16:12   ` Denys Vlasenko
  2015-04-03 16:54   ` Denys Vlasenko
  0 siblings, 2 replies; 10+ messages in thread
From: Ingo Molnar @ 2015-04-03 14:03 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Linus Torvalds, Steven Rostedt, Borislav Petkov, H. Peter Anvin,
	Andy Lutomirski, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook, x86, linux-kernel


* Denys Vlasenko <dvlasenk@redhat.com> wrote:

> Interrupt entry points are handled with the following code:
> Each 32-byte code block contains seven entry points
> 
> 		...
> 		[push][jump 22] // 4 bytes
> 		[push][jump 18] // 4 bytes
> 		[push][jump 14] // 4 bytes
> 		[push][jump 10] // 4 bytes
> 		[push][jump  6] // 4 bytes
> 		[push][jump  2] // 4 bytes
> 		[push][jump common_interrupt][padding] // 8 bytes
> 
> 		[push][jump]
> 		[push][jump]
> 		[push][jump]
> 		[push][jump]
> 		[push][jump]
> 		[push][jump]
> 		[push][jump common_interrupt][padding]
> 
> 		[padding_2]
> 	common_interrupt:
> 
> The first six jumps are short (2-byte insns) jumps to the last
> jump in the block. The last jump usually has to be a longer, 5-byte form.
> 
> There are about 30 such blocks.
> 
> This change uses the fact that last few of these blocks are close to
> "common_interrupt" label and the last jump there can also be a short one.
> 
> This allows several last 32-byte code blocks to contain eight, not seven entry points,
> and to have all these entry points jump directly to common_interrupt,
> eliminating one branch. They look like this:
> 
> 		[push][jump common_interrupt] // 4 bytes
> 		[push][jump common_interrupt] // 4 bytes
> 		[push][jump common_interrupt] // 4 bytes
> 		[push][jump common_interrupt] // 4 bytes
> 		[push][jump common_interrupt] // 4 bytes
> 		[push][jump common_interrupt] // 4 bytes
> 		[push][jump common_interrupt] // 4 bytes
> 		[push][jump common_interrupt] // 4 bytes
> 
> This change replaces ".p2align CONFIG_X86_L1_CACHE_SHIFT" before dispatch table
> and before common_interrupt with 32-byte alignment.
> 
> The first alignment was completely unnecessary:
> the dispatch table will not benefit from any alignment bigger than 32 bytes
> (any entry point needs at most only 32 bytes of table to be read by the CPU).
> 
> The alignment before common_interrupt label affects the size of padding
> (the "[padding_2]" above) and excessive padding reduces the number of blocks
> which can be optimized. In the .config I tested with, this alignment was 64 bytes,
> this means two fewer blocks could be optimized. I believe 32-byte alignment
> should do just fine.
> 
> The condition which controls the choice of a direct jump versus short one,
> 
>     /* Can we reach common_interrupt with a short jump? */
>     .if vector >= (FIRST_SYSTEM_VECTOR - (128-ALIGN_common_interrupt)/4)
> 
> was tested to be a passimistic one: the generated code will not erroneously
> generate a longer than necessary jump even with maximum possible padding
> and with -1 subtracted from the right hand side above.
> This was verified by checking disassembly.
> 
> However, with current value of FIRST_SYSTEM_VECTOR it so happens
> that the padding before "common_interrupt" is zero bytes.
> 
>  Disassembly before                           After
>  ==========================================   ==================================
> ...                                           ...
> 840 6a9b           push $0xffffffffffffff9b   82c 6a9d  push $0xffffffffffffff9d
> 842 eb16           jmp  85a                   82e eb30  jmp  860
> 844 6a9a           push $0xffffffffffffff9a   830 6a9c  push $0xffffffffffffff9c
> 846 eb12           jmp  85a                   832 eb2c  jmp  860
> 848 6a99           push $0xffffffffffffff99   834 6a9b  push $0xffffffffffffff9b
> 84a eb0e           jmp  85a                   836 eb28  jmp  860
> 84c 6a98           push $0xffffffffffffff98   838 6a9a  push $0xffffffffffffff9a
> 84e eb0a           jmp  85a                   83a eb24  jmp  860
> 850 6a97           push $0xffffffffffffff97   83c 6a99  push $0xffffffffffffff99
> 852 eb06           jmp  85a                   83e eb20  jmp  860
> 854 6a96           push $0xffffffffffffff96   840 6a98  push $0xffffffffffffff98
> 856 eb02           jmp  85a                   842 eb1c  jmp  860
> 858 6a95           push $0xffffffffffffff95   844 6a97  push $0xffffffffffffff97
> 85a eb24           jmp  880                   846 eb18  jmp  860
> 85c 0f1f4000       nopl 0x0(%rax)             848 6a96  push $0xffffffffffffff96
> 860 6a94           push $0xffffffffffffff94   84a eb14  jmp  860
> 862 eb0c           jmp  870                   84c 6a95  push $0xffffffffffffff95
> 864 6a93           push $0xffffffffffffff93   84e eb10  jmp  860
> 866 eb08           jmp  870                   850 6a94  push $0xffffffffffffff94
> 868 6a92           push $0xffffffffffffff92   852 eb0c  jmp  860
> 86a eb04           jmp  870                   854 6a93  push $0xffffffffffffff93
> 86c 6a91           push $0xffffffffffffff91   856 eb08  jmp  860
> 86e eb00           jmp  870                   858 6a92  push $0xffffffffffffff92
> 870 eb0e           jmp  880                   85a eb04  jmp  860
> 872 66666666662e0f data32 data32 data32 data3285c 6a91  push $0xffffffffffffff91
> 879 1f840000000000 nopw %cs:0x0(%rax,%rax,1)  85e eb00  jmp  860
> 880 <common_interrupt>:                       860 <common_interrupt>:
> 
>  Sizes:
>    text	   data	    bss	    dec	    hex	filename
>   12578	      0	      0	  12578	   3122	entry_64.o.before
>   12546	      0	      0	  12546	   3102	entry_64.o
> 
> Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
> CC: Linus Torvalds <torvalds@linux-foundation.org>
> CC: Steven Rostedt <rostedt@goodmis.org>
> CC: Ingo Molnar <mingo@kernel.org>
> CC: Borislav Petkov <bp@alien8.de>
> CC: "H. Peter Anvin" <hpa@zytor.com>
> CC: Andy Lutomirski <luto@amacapital.net>
> CC: Oleg Nesterov <oleg@redhat.com>
> CC: Frederic Weisbecker <fweisbec@gmail.com>
> CC: Alexei Starovoitov <ast@plumgrid.com>
> CC: Will Drewry <wad@chromium.org>
> CC: Kees Cook <keescook@chromium.org>
> CC: x86@kernel.org
> CC: linux-kernel@vger.kernel.org
> ---
>  arch/x86/kernel/entry_64.S | 61 +++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 49 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index da137f9..2003417 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -537,33 +537,70 @@ END(ret_from_fork)
>   * Build the entry stubs and pointer table with some assembler magic.
>   * We pack 7 stubs into a single 32-byte chunk, which will fit in a
>   * single cache line on all modern x86 implementations.
> + * The last few cachelines pack 8 stubs each.
>   */
> +ALIGN_common_interrupt=32
>  	.section .init.rodata,"a"
>  ENTRY(interrupt)
>  	.section .entry.text
> -	.p2align 5
> -	.p2align CONFIG_X86_L1_CACHE_SHIFT
> +	.align 32
>  ENTRY(irq_entries_start)
>  	INTR_FRAME
>  vector=FIRST_EXTERNAL_VECTOR
> -.rept (FIRST_SYSTEM_VECTOR-FIRST_EXTERNAL_VECTOR+6)/7
> -	.balign 32
> +.rept 256 /* this number does not need to be exact, just big enough */
> +
> +  /*
> +   * Block of six "push + short_jump" pairs, 4 bytes each,
> +   * and a 2-byte seventh "push", without its jump yet.
> +   */
> +need_near_jump=0
> +push_count=0
>    .rept	7
>      .if vector < FIRST_SYSTEM_VECTOR
> -      .if vector <> FIRST_EXTERNAL_VECTOR
> +1:	pushq_cfi $(~vector+0x80)	/* Note: always in signed byte range */
> +	.previous
> +	.quad 1b
> +	.section .entry.text
> +vector=vector+1
> +push_count=push_count+1
> +      .if push_count < 7
> +	/* Can we reach common_interrupt with a short jump? */
> +	.if vector >= (FIRST_SYSTEM_VECTOR - (128-ALIGN_common_interrupt)/4)
> +	  jmp common_interrupt  /* yes */
> +	.else
> +	  jmp 2f
> +need_near_jump=1
> +	.endif
>  	CFI_ADJUST_CFA_OFFSET -8
>        .endif
> +    .endif
> +  .endr
> +
> +  /* If we are close to the end, we can pack in yet another pair */
> +  .if need_near_jump == 0
> +    .if push_count == 7
> +	/* The "short jump" for the seventh "push" */
> +	jmp common_interrupt
> +	CFI_ADJUST_CFA_OFFSET -8
> +    .endif
> +    .if vector < FIRST_SYSTEM_VECTOR
> +	/* "push + short_jump" pair #8 */
>  1:	pushq_cfi $(~vector+0x80)	/* Note: always in signed byte range */
> -      .if ((vector-FIRST_EXTERNAL_VECTOR)%7) <> 6
> -	jmp 2f
> -      .endif
> -      .previous
> +	.previous
>  	.quad 1b
> -      .section .entry.text
> +	.section .entry.text
>  vector=vector+1
> +push_count=push_count+1
> +	jmp common_interrupt
> +	CFI_ADJUST_CFA_OFFSET -8
>      .endif
> -  .endr
> +  .else
> +	/* Use remaining space for "near jump" for the seventh "push" */
>  2:	jmp common_interrupt
> +	CFI_ADJUST_CFA_OFFSET -8
> +	.align 2
> +  .endif
> +
>  .endr
>  	CFI_ENDPROC
>  END(irq_entries_start)

So I think this might be easier to read if we went low tech and used 
an open-coded 240+ lines assembly file for this, with a few 
obvious-purpose helper macros? It's not like it will change all that 
often so it only has to be generated once.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86/asm/entry/64: pack interrupt dispatch table tighter
  2015-04-03 14:03 ` Ingo Molnar
@ 2015-04-03 16:12   ` Denys Vlasenko
  2015-04-03 16:54   ` Denys Vlasenko
  1 sibling, 0 replies; 10+ messages in thread
From: Denys Vlasenko @ 2015-04-03 16:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Steven Rostedt, Borislav Petkov, H. Peter Anvin,
	Andy Lutomirski, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook, x86, linux-kernel

On 04/03/2015 04:03 PM, Ingo Molnar wrote:
> 
> * Denys Vlasenko <dvlasenk@redhat.com> wrote:
> 
>> Interrupt entry points are handled with the following code:
>> Each 32-byte code block contains seven entry points
>>
>> 		...
>> 		[push][jump 22] // 4 bytes
>> 		[push][jump 18] // 4 bytes
>> 		[push][jump 14] // 4 bytes
>> 		[push][jump 10] // 4 bytes
>> 		[push][jump  6] // 4 bytes
>> 		[push][jump  2] // 4 bytes
>> 		[push][jump common_interrupt][padding] // 8 bytes
>>
>> 		[push][jump]
>> 		[push][jump]
>> 		[push][jump]
>> 		[push][jump]
>> 		[push][jump]
>> 		[push][jump]
>> 		[push][jump common_interrupt][padding]
>>
>> 		[padding_2]
>> 	common_interrupt:
>>
>> The first six jumps are short (2-byte insns) jumps to the last
>> jump in the block. The last jump usually has to be a longer, 5-byte form.
>>
>> There are about 30 such blocks.
>>
>> This change uses the fact that last few of these blocks are close to
>> "common_interrupt" label and the last jump there can also be a short one.
>>
>> This allows several last 32-byte code blocks to contain eight, not seven entry points,
>> and to have all these entry points jump directly to common_interrupt,
>> eliminating one branch. They look like this:
>>
>> 		[push][jump common_interrupt] // 4 bytes
>> 		[push][jump common_interrupt] // 4 bytes
>> 		[push][jump common_interrupt] // 4 bytes
>> 		[push][jump common_interrupt] // 4 bytes
>> 		[push][jump common_interrupt] // 4 bytes
>> 		[push][jump common_interrupt] // 4 bytes
>> 		[push][jump common_interrupt] // 4 bytes
>> 		[push][jump common_interrupt] // 4 bytes
>>
>> This change replaces ".p2align CONFIG_X86_L1_CACHE_SHIFT" before dispatch table
>> and before common_interrupt with 32-byte alignment.
>>
>> The first alignment was completely unnecessary:
>> the dispatch table will not benefit from any alignment bigger than 32 bytes
>> (any entry point needs at most only 32 bytes of table to be read by the CPU).
>>
>> The alignment before common_interrupt label affects the size of padding
>> (the "[padding_2]" above) and excessive padding reduces the number of blocks
>> which can be optimized. In the .config I tested with, this alignment was 64 bytes,
>> this means two fewer blocks could be optimized. I believe 32-byte alignment
>> should do just fine.
>>
>> The condition which controls the choice of a direct jump versus short one,
>>
>>     /* Can we reach common_interrupt with a short jump? */
>>     .if vector >= (FIRST_SYSTEM_VECTOR - (128-ALIGN_common_interrupt)/4)
>>
>> was tested to be a passimistic one: the generated code will not erroneously
>> generate a longer than necessary jump even with maximum possible padding
>> and with -1 subtracted from the right hand side above.
>> This was verified by checking disassembly.
>>
>> However, with current value of FIRST_SYSTEM_VECTOR it so happens
>> that the padding before "common_interrupt" is zero bytes.
>>
>>  Disassembly before                           After
>>  ==========================================   ==================================
>> ...                                           ...
>> 840 6a9b           push $0xffffffffffffff9b   82c 6a9d  push $0xffffffffffffff9d
>> 842 eb16           jmp  85a                   82e eb30  jmp  860
>> 844 6a9a           push $0xffffffffffffff9a   830 6a9c  push $0xffffffffffffff9c
>> 846 eb12           jmp  85a                   832 eb2c  jmp  860
>> 848 6a99           push $0xffffffffffffff99   834 6a9b  push $0xffffffffffffff9b
>> 84a eb0e           jmp  85a                   836 eb28  jmp  860
>> 84c 6a98           push $0xffffffffffffff98   838 6a9a  push $0xffffffffffffff9a
>> 84e eb0a           jmp  85a                   83a eb24  jmp  860
>> 850 6a97           push $0xffffffffffffff97   83c 6a99  push $0xffffffffffffff99
>> 852 eb06           jmp  85a                   83e eb20  jmp  860
>> 854 6a96           push $0xffffffffffffff96   840 6a98  push $0xffffffffffffff98
>> 856 eb02           jmp  85a                   842 eb1c  jmp  860
>> 858 6a95           push $0xffffffffffffff95   844 6a97  push $0xffffffffffffff97
>> 85a eb24           jmp  880                   846 eb18  jmp  860
>> 85c 0f1f4000       nopl 0x0(%rax)             848 6a96  push $0xffffffffffffff96
>> 860 6a94           push $0xffffffffffffff94   84a eb14  jmp  860
>> 862 eb0c           jmp  870                   84c 6a95  push $0xffffffffffffff95
>> 864 6a93           push $0xffffffffffffff93   84e eb10  jmp  860
>> 866 eb08           jmp  870                   850 6a94  push $0xffffffffffffff94
>> 868 6a92           push $0xffffffffffffff92   852 eb0c  jmp  860
>> 86a eb04           jmp  870                   854 6a93  push $0xffffffffffffff93
>> 86c 6a91           push $0xffffffffffffff91   856 eb08  jmp  860
>> 86e eb00           jmp  870                   858 6a92  push $0xffffffffffffff92
>> 870 eb0e           jmp  880                   85a eb04  jmp  860
>> 872 66666666662e0f data32 data32 data32 data3285c 6a91  push $0xffffffffffffff91
>> 879 1f840000000000 nopw %cs:0x0(%rax,%rax,1)  85e eb00  jmp  860
>> 880 <common_interrupt>:                       860 <common_interrupt>:
>>
>>  Sizes:
>>    text	   data	    bss	    dec	    hex	filename
>>   12578	      0	      0	  12578	   3122	entry_64.o.before
>>   12546	      0	      0	  12546	   3102	entry_64.o
>>
>> Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
>> CC: Linus Torvalds <torvalds@linux-foundation.org>
>> CC: Steven Rostedt <rostedt@goodmis.org>
>> CC: Ingo Molnar <mingo@kernel.org>
>> CC: Borislav Petkov <bp@alien8.de>
>> CC: "H. Peter Anvin" <hpa@zytor.com>
>> CC: Andy Lutomirski <luto@amacapital.net>
>> CC: Oleg Nesterov <oleg@redhat.com>
>> CC: Frederic Weisbecker <fweisbec@gmail.com>
>> CC: Alexei Starovoitov <ast@plumgrid.com>
>> CC: Will Drewry <wad@chromium.org>
>> CC: Kees Cook <keescook@chromium.org>
>> CC: x86@kernel.org
>> CC: linux-kernel@vger.kernel.org
>> ---
>>  arch/x86/kernel/entry_64.S | 61 +++++++++++++++++++++++++++++++++++++---------
>>  1 file changed, 49 insertions(+), 12 deletions(-)
>>
>> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
>> index da137f9..2003417 100644
>> --- a/arch/x86/kernel/entry_64.S
>> +++ b/arch/x86/kernel/entry_64.S
>> @@ -537,33 +537,70 @@ END(ret_from_fork)
>>   * Build the entry stubs and pointer table with some assembler magic.
>>   * We pack 7 stubs into a single 32-byte chunk, which will fit in a
>>   * single cache line on all modern x86 implementations.
>> + * The last few cachelines pack 8 stubs each.
>>   */
>> +ALIGN_common_interrupt=32
>>  	.section .init.rodata,"a"
>>  ENTRY(interrupt)
>>  	.section .entry.text
>> -	.p2align 5
>> -	.p2align CONFIG_X86_L1_CACHE_SHIFT
>> +	.align 32
>>  ENTRY(irq_entries_start)
>>  	INTR_FRAME
>>  vector=FIRST_EXTERNAL_VECTOR
>> -.rept (FIRST_SYSTEM_VECTOR-FIRST_EXTERNAL_VECTOR+6)/7
>> -	.balign 32
>> +.rept 256 /* this number does not need to be exact, just big enough */
>> +
>> +  /*
>> +   * Block of six "push + short_jump" pairs, 4 bytes each,
>> +   * and a 2-byte seventh "push", without its jump yet.
>> +   */
>> +need_near_jump=0
>> +push_count=0
>>    .rept	7
>>      .if vector < FIRST_SYSTEM_VECTOR
>> -      .if vector <> FIRST_EXTERNAL_VECTOR
>> +1:	pushq_cfi $(~vector+0x80)	/* Note: always in signed byte range */
>> +	.previous
>> +	.quad 1b
>> +	.section .entry.text
>> +vector=vector+1
>> +push_count=push_count+1
>> +      .if push_count < 7
>> +	/* Can we reach common_interrupt with a short jump? */
>> +	.if vector >= (FIRST_SYSTEM_VECTOR - (128-ALIGN_common_interrupt)/4)
>> +	  jmp common_interrupt  /* yes */
>> +	.else
>> +	  jmp 2f
>> +need_near_jump=1
>> +	.endif
>>  	CFI_ADJUST_CFA_OFFSET -8
>>        .endif
>> +    .endif
>> +  .endr
>> +
>> +  /* If we are close to the end, we can pack in yet another pair */
>> +  .if need_near_jump == 0
>> +    .if push_count == 7
>> +	/* The "short jump" for the seventh "push" */
>> +	jmp common_interrupt
>> +	CFI_ADJUST_CFA_OFFSET -8
>> +    .endif
>> +    .if vector < FIRST_SYSTEM_VECTOR
>> +	/* "push + short_jump" pair #8 */
>>  1:	pushq_cfi $(~vector+0x80)	/* Note: always in signed byte range */
>> -      .if ((vector-FIRST_EXTERNAL_VECTOR)%7) <> 6
>> -	jmp 2f
>> -      .endif
>> -      .previous
>> +	.previous
>>  	.quad 1b
>> -      .section .entry.text
>> +	.section .entry.text
>>  vector=vector+1
>> +push_count=push_count+1
>> +	jmp common_interrupt
>> +	CFI_ADJUST_CFA_OFFSET -8
>>      .endif
>> -  .endr
>> +  .else
>> +	/* Use remaining space for "near jump" for the seventh "push" */
>>  2:	jmp common_interrupt
>> +	CFI_ADJUST_CFA_OFFSET -8
>> +	.align 2
>> +  .endif
>> +
>>  .endr
>>  	CFI_ENDPROC
>>  END(irq_entries_start)
> 
> So I think this might be easier to read if we went low tech and used 
> an open-coded 240+ lines assembly file for this, with a few 
> obvious-purpose helper macros?

I tried a few times but the results were monstrous.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86/asm/entry/64: pack interrupt dispatch table tighter
  2015-04-03 14:03 ` Ingo Molnar
  2015-04-03 16:12   ` Denys Vlasenko
@ 2015-04-03 16:54   ` Denys Vlasenko
  2015-04-03 18:08     ` Linus Torvalds
  1 sibling, 1 reply; 10+ messages in thread
From: Denys Vlasenko @ 2015-04-03 16:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Steven Rostedt, Borislav Petkov, H. Peter Anvin,
	Andy Lutomirski, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook, x86, linux-kernel

On 04/03/2015 04:03 PM, Ingo Molnar wrote:
> 
> * Denys Vlasenko <dvlasenk@redhat.com> wrote:
> 
>> Interrupt entry points are handled with the following code:
>> Each 32-byte code block contains seven entry points
>>
>> 		...
>> 		[push][jump 22] // 4 bytes
>> 		[push][jump 18] // 4 bytes
>> 		[push][jump 14] // 4 bytes
>> 		[push][jump 10] // 4 bytes
>> 		[push][jump  6] // 4 bytes
>> 		[push][jump  2] // 4 bytes
>> 		[push][jump common_interrupt][padding] // 8 bytes
>>
>> 		[push][jump]
>> 		[push][jump]
>> 		[push][jump]
>> 		[push][jump]
>> 		[push][jump]
>> 		[push][jump]
>> 		[push][jump common_interrupt][padding]
>>
>> 		[padding_2]
>> 	common_interrupt:
>>
>> The first six jumps are short (2-byte insns) jumps to the last
>> jump in the block. The last jump usually has to be a longer, 5-byte form.
>>
>> There are about 30 such blocks.
>>
>> This change uses the fact that last few of these blocks are close to
>> "common_interrupt" label and the last jump there can also be a short one.
>>
>> This allows several last 32-byte code blocks to contain eight, not seven entry points,
>> and to have all these entry points jump directly to common_interrupt,
>> eliminating one branch. They look like this:
>>
>> 		[push][jump common_interrupt] // 4 bytes
>> 		[push][jump common_interrupt] // 4 bytes
>> 		[push][jump common_interrupt] // 4 bytes
>> 		[push][jump common_interrupt] // 4 bytes
>> 		[push][jump common_interrupt] // 4 bytes
>> 		[push][jump common_interrupt] // 4 bytes
>> 		[push][jump common_interrupt] // 4 bytes
>> 		[push][jump common_interrupt] // 4 bytes
>>
>> This change replaces ".p2align CONFIG_X86_L1_CACHE_SHIFT" before dispatch table
>> and before common_interrupt with 32-byte alignment.
>>
>> The first alignment was completely unnecessary:
>> the dispatch table will not benefit from any alignment bigger than 32 bytes
>> (any entry point needs at most only 32 bytes of table to be read by the CPU).
>>
>> The alignment before common_interrupt label affects the size of padding
>> (the "[padding_2]" above) and excessive padding reduces the number of blocks
>> which can be optimized. In the .config I tested with, this alignment was 64 bytes,
>> this means two fewer blocks could be optimized. I believe 32-byte alignment
>> should do just fine.
>>
>> The condition which controls the choice of a direct jump versus short one,
>>
>>     /* Can we reach common_interrupt with a short jump? */
>>     .if vector >= (FIRST_SYSTEM_VECTOR - (128-ALIGN_common_interrupt)/4)
>>
>> was tested to be a passimistic one: the generated code will not erroneously
>> generate a longer than necessary jump even with maximum possible padding
>> and with -1 subtracted from the right hand side above.
>> This was verified by checking disassembly.
>>
>> However, with current value of FIRST_SYSTEM_VECTOR it so happens
>> that the padding before "common_interrupt" is zero bytes.
>>
>>  Disassembly before                           After
>>  ==========================================   ==================================
>> ...                                           ...
>> 840 6a9b           push $0xffffffffffffff9b   82c 6a9d  push $0xffffffffffffff9d
>> 842 eb16           jmp  85a                   82e eb30  jmp  860
>> 844 6a9a           push $0xffffffffffffff9a   830 6a9c  push $0xffffffffffffff9c
>> 846 eb12           jmp  85a                   832 eb2c  jmp  860
>> 848 6a99           push $0xffffffffffffff99   834 6a9b  push $0xffffffffffffff9b
>> 84a eb0e           jmp  85a                   836 eb28  jmp  860
>> 84c 6a98           push $0xffffffffffffff98   838 6a9a  push $0xffffffffffffff9a
>> 84e eb0a           jmp  85a                   83a eb24  jmp  860
>> 850 6a97           push $0xffffffffffffff97   83c 6a99  push $0xffffffffffffff99
>> 852 eb06           jmp  85a                   83e eb20  jmp  860
>> 854 6a96           push $0xffffffffffffff96   840 6a98  push $0xffffffffffffff98
>> 856 eb02           jmp  85a                   842 eb1c  jmp  860
>> 858 6a95           push $0xffffffffffffff95   844 6a97  push $0xffffffffffffff97
>> 85a eb24           jmp  880                   846 eb18  jmp  860
>> 85c 0f1f4000       nopl 0x0(%rax)             848 6a96  push $0xffffffffffffff96
>> 860 6a94           push $0xffffffffffffff94   84a eb14  jmp  860
>> 862 eb0c           jmp  870                   84c 6a95  push $0xffffffffffffff95
>> 864 6a93           push $0xffffffffffffff93   84e eb10  jmp  860
>> 866 eb08           jmp  870                   850 6a94  push $0xffffffffffffff94
>> 868 6a92           push $0xffffffffffffff92   852 eb0c  jmp  860
>> 86a eb04           jmp  870                   854 6a93  push $0xffffffffffffff93
>> 86c 6a91           push $0xffffffffffffff91   856 eb08  jmp  860
>> 86e eb00           jmp  870                   858 6a92  push $0xffffffffffffff92
>> 870 eb0e           jmp  880                   85a eb04  jmp  860
>> 872 66666666662e0f data32 data32 data32 data3285c 6a91  push $0xffffffffffffff91
>> 879 1f840000000000 nopw %cs:0x0(%rax,%rax,1)  85e eb00  jmp  860
>> 880 <common_interrupt>:                       860 <common_interrupt>:
>>
>>  Sizes:
>>    text	   data	    bss	    dec	    hex	filename
>>   12578	      0	      0	  12578	   3122	entry_64.o.before
>>   12546	      0	      0	  12546	   3102	entry_64.o
>>
>> Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
>> CC: Linus Torvalds <torvalds@linux-foundation.org>
>> CC: Steven Rostedt <rostedt@goodmis.org>
>> CC: Ingo Molnar <mingo@kernel.org>
>> CC: Borislav Petkov <bp@alien8.de>
>> CC: "H. Peter Anvin" <hpa@zytor.com>
>> CC: Andy Lutomirski <luto@amacapital.net>
>> CC: Oleg Nesterov <oleg@redhat.com>
>> CC: Frederic Weisbecker <fweisbec@gmail.com>
>> CC: Alexei Starovoitov <ast@plumgrid.com>
>> CC: Will Drewry <wad@chromium.org>
>> CC: Kees Cook <keescook@chromium.org>
>> CC: x86@kernel.org
>> CC: linux-kernel@vger.kernel.org
>> ---
>>  arch/x86/kernel/entry_64.S | 61 +++++++++++++++++++++++++++++++++++++---------
>>  1 file changed, 49 insertions(+), 12 deletions(-)
>>
>> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
>> index da137f9..2003417 100644
>> --- a/arch/x86/kernel/entry_64.S
>> +++ b/arch/x86/kernel/entry_64.S
>> @@ -537,33 +537,70 @@ END(ret_from_fork)
>>   * Build the entry stubs and pointer table with some assembler magic.
>>   * We pack 7 stubs into a single 32-byte chunk, which will fit in a
>>   * single cache line on all modern x86 implementations.
>> + * The last few cachelines pack 8 stubs each.
>>   */
>> +ALIGN_common_interrupt=32
>>  	.section .init.rodata,"a"
>>  ENTRY(interrupt)
>>  	.section .entry.text
>> -	.p2align 5
>> -	.p2align CONFIG_X86_L1_CACHE_SHIFT
>> +	.align 32
>>  ENTRY(irq_entries_start)
>>  	INTR_FRAME
>>  vector=FIRST_EXTERNAL_VECTOR
>> -.rept (FIRST_SYSTEM_VECTOR-FIRST_EXTERNAL_VECTOR+6)/7
>> -	.balign 32
>> +.rept 256 /* this number does not need to be exact, just big enough */
>> +
>> +  /*
>> +   * Block of six "push + short_jump" pairs, 4 bytes each,
>> +   * and a 2-byte seventh "push", without its jump yet.
>> +   */
>> +need_near_jump=0
>> +push_count=0
>>    .rept	7
>>      .if vector < FIRST_SYSTEM_VECTOR
>> -      .if vector <> FIRST_EXTERNAL_VECTOR
>> +1:	pushq_cfi $(~vector+0x80)	/* Note: always in signed byte range */
>> +	.previous
>> +	.quad 1b
>> +	.section .entry.text
>> +vector=vector+1
>> +push_count=push_count+1
>> +      .if push_count < 7
>> +	/* Can we reach common_interrupt with a short jump? */
>> +	.if vector >= (FIRST_SYSTEM_VECTOR - (128-ALIGN_common_interrupt)/4)
>> +	  jmp common_interrupt  /* yes */
>> +	.else
>> +	  jmp 2f
>> +need_near_jump=1
>> +	.endif
>>  	CFI_ADJUST_CFA_OFFSET -8
>>        .endif
>> +    .endif
>> +  .endr
>> +
>> +  /* If we are close to the end, we can pack in yet another pair */
>> +  .if need_near_jump == 0
>> +    .if push_count == 7
>> +	/* The "short jump" for the seventh "push" */
>> +	jmp common_interrupt
>> +	CFI_ADJUST_CFA_OFFSET -8
>> +    .endif
>> +    .if vector < FIRST_SYSTEM_VECTOR
>> +	/* "push + short_jump" pair #8 */
>>  1:	pushq_cfi $(~vector+0x80)	/* Note: always in signed byte range */
>> -      .if ((vector-FIRST_EXTERNAL_VECTOR)%7) <> 6
>> -	jmp 2f
>> -      .endif
>> -      .previous
>> +	.previous
>>  	.quad 1b
>> -      .section .entry.text
>> +	.section .entry.text
>>  vector=vector+1
>> +push_count=push_count+1
>> +	jmp common_interrupt
>> +	CFI_ADJUST_CFA_OFFSET -8
>>      .endif
>> -  .endr
>> +  .else
>> +	/* Use remaining space for "near jump" for the seventh "push" */
>>  2:	jmp common_interrupt
>> +	CFI_ADJUST_CFA_OFFSET -8
>> +	.align 2
>> +  .endif
>> +
>>  .endr
>>  	CFI_ENDPROC
>>  END(irq_entries_start)
> 
> So I think this might be easier to read if we went low tech and used 
> an open-coded 240+ lines assembly file for this, with a few 
> obvious-purpose helper macros? It's not like it will change all that 
> often so it only has to be generated once.


How about this version?
It's still isn't a star of readability,
but the structure of the 32-byte code block is more visible now...


/*
 * Build the entry stubs and pointer table with some assembler magic.
 * We pack 7 stubs into a single 32-byte chunk, which will fit in a
 * single cache line on all modern x86 implementations.
 * The last few cachelines pack 8 stubs each.
 */
ALIGN_common_interrupt=32
	.section .init.rodata,"a"
ENTRY(interrupt)
	.section .entry.text
	.align 32
ENTRY(irq_entries_start)
	INTR_FRAME

	.macro push_vector
    .if vector < FIRST_SYSTEM_VECTOR
1:	pushq_cfi $(~vector+0x80)	/* Note: always in signed byte range */
	.previous
	.quad 1b
	.section .entry.text
need_jump=1
vector=vector+1
    .endif
	.endm

	.macro short_jump label
    .if need_jump == 1
	/* Can we reach common_interrupt with a short jump? */
    .if vector >= (FIRST_SYSTEM_VECTOR - (128-ALIGN_common_interrupt)/4)
	jmp	common_interrupt  /* yes */
    .else
	jmp	\label
need_near_jump=1
    .endif
	CFI_ADJUST_CFA_OFFSET -8
need_jump=0
    .endif
	.endm

	.macro near_jump label
	jmp	\label
	CFI_ADJUST_CFA_OFFSET -8
	.endm

vector=FIRST_EXTERNAL_VECTOR
.rept 256 /* this number does not need to be exact, just big enough */
need_near_jump=0
	push_vector; short_jump 2f
	push_vector; short_jump 2f
	push_vector; short_jump 2f
	push_vector; short_jump 2f
	push_vector; short_jump 2f
	push_vector; short_jump 2f
    .if need_near_jump == 1
	push_vector; 2:	near_jump common_interrupt; .align 2
    .else
	push_vector; short_jump common_interrupt
	push_vector; short_jump common_interrupt
    .endif
.endr
        CFI_ENDPROC
END(irq_entries_start)

.previous
END(interrupt)
.previous


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86/asm/entry/64: pack interrupt dispatch table tighter
  2015-04-03 16:54   ` Denys Vlasenko
@ 2015-04-03 18:08     ` Linus Torvalds
  2015-04-03 18:35       ` H. Peter Anvin
  2015-04-03 18:36       ` Denys Vlasenko
  0 siblings, 2 replies; 10+ messages in thread
From: Linus Torvalds @ 2015-04-03 18:08 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Ingo Molnar, Steven Rostedt, Borislav Petkov, H. Peter Anvin,
	Andy Lutomirski, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook,
	the arch/x86 maintainers, Linux Kernel Mailing List

On Fri, Apr 3, 2015 at 9:54 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>
> How about this version?
> It's still isn't a star of readability,
> but the structure of the 32-byte code block is more visible now...

Do we really even want to be this clever in the first place?

The thing is, when we take an interrupt:

 (a) the L1 I$ is always cold

 (b) the instruction decoder has never had time to run ahead

 (c) there are usually not that many different interrupts anyway, even
under load (ie you'd have maybe disk and networking)

 (d) we intentionally spread out the different interrupt vector numbers

 (e) the 32-byte block thing is questionable, most older
micro-architectures fetch in 16-byte blocks iirc.

So what this tells me is that:

 - (a+b) the jump-to-jump is likely fairly expensive, because even
though they are in the same cacheline, the front end hasn't gotten
ahead of anything, so there's no hiding any front end pipeline
hickups.

 - (c+d) there is likely very little advantage to trying to "pack"
things in cachelines

 - (d+e) the 7-instructions-in-one-32-byte-block doesn't really sound
all that big of a win, and it does cause a 16-byte split for some
interrupt.

In other words, I'd suggest that we just use simple unconditional
5-byte branch instead. Add the two-byte "push" instruction, you have 7
bytes per interrupt. Align that 7 bytes up to 8, and none of them ever
cross a 16-byte boundary.

Simple, clean, and slightly bigger in memory footprint, but probably
not noticeably more so in cache footprint, simply because there
usually aren't that many active interrupts anyway.

The people who do millions of networking interrupts per second and
have network cards that steer things to many different interrupts
already try to make sure that the steering goes to different CPU's -
otherwise there wouldn't be any *point* to steering things. So that
particular case of "lots of active interrupts" doesn't have a bigger
cache footprint *either*, since any particular CPU L1 I$ will still
only handle a few interrupts.

So you get "only" 4 interrupt cases per 32 bytes rather than 7. But is
that odd double jump and all this complexity really worth it?

So I really suggest just doing something stupid and straightforward
(and completely untested) like this:

    .macro push_vector
        pushq_cfi $(~vector+0x80)
        jmp common_interrupt
        .align 8
    .endm

    vector=FIRST_EXTERNAL_VECTOR
    .align 64
    ENTRY(irq_entries_start)
    .rept 256 /* this number does not need to be exact, just big enough */
         make_vector
    .endr

and just be done with it.

(Of course, you have to change the code that knows about the "7
entries in 32 bytes" patterns too, but that's just going to be much
simpler now).

                     Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86/asm/entry/64: pack interrupt dispatch table tighter
  2015-04-03 18:08     ` Linus Torvalds
@ 2015-04-03 18:35       ` H. Peter Anvin
  2015-04-03 18:37         ` Linus Torvalds
  2015-04-03 18:36       ` Denys Vlasenko
  1 sibling, 1 reply; 10+ messages in thread
From: H. Peter Anvin @ 2015-04-03 18:35 UTC (permalink / raw)
  To: Linus Torvalds, Denys Vlasenko
  Cc: Ingo Molnar, Steven Rostedt, Borislav Petkov, Andy Lutomirski,
	Oleg Nesterov, Frederic Weisbecker, Alexei Starovoitov,
	Will Drewry, Kees Cook, the arch/x86 maintainers,
	Linux Kernel Mailing List

On 04/03/2015 11:08 AM, Linus Torvalds wrote:
> On Fri, Apr 3, 2015 at 9:54 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>>
>> How about this version?
>> It's still isn't a star of readability,
>> but the structure of the 32-byte code block is more visible now...
> 
> Do we really even want to be this clever in the first place?
> 
> The thing is, when we take an interrupt:
> 
>  (a) the L1 I$ is always cold
> 
>  (b) the instruction decoder has never had time to run ahead
> 
>  (c) there are usually not that many different interrupts anyway, even
> under load (ie you'd have maybe disk and networking)
> 
>  (d) we intentionally spread out the different interrupt vector numbers
> 
>  (e) the 32-byte block thing is questionable, most older
> micro-architectures fetch in 16-byte blocks iirc.
> 

For the record, I actually measured the impact of the jump-to-jump when
I wrote it.  It has a small, *but measurable*, positive impact.

	-hpa


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86/asm/entry/64: pack interrupt dispatch table tighter
  2015-04-03 18:08     ` Linus Torvalds
  2015-04-03 18:35       ` H. Peter Anvin
@ 2015-04-03 18:36       ` Denys Vlasenko
  1 sibling, 0 replies; 10+ messages in thread
From: Denys Vlasenko @ 2015-04-03 18:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Steven Rostedt, Borislav Petkov, H. Peter Anvin,
	Andy Lutomirski, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook,
	the arch/x86 maintainers, Linux Kernel Mailing List

On 04/03/2015 08:08 PM, Linus Torvalds wrote:
> On Fri, Apr 3, 2015 at 9:54 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>>
>> How about this version?
>> It's still isn't a star of readability,
>> but the structure of the 32-byte code block is more visible now...
> 
> Do we really even want to be this clever in the first place?
> 
> The thing is, when we take an interrupt:
> 
>  (a) the L1 I$ is always cold
> 
>  (b) the instruction decoder has never had time to run ahead
> 
>  (c) there are usually not that many different interrupts anyway, even
> under load (ie you'd have maybe disk and networking)
> 
>  (d) we intentionally spread out the different interrupt vector numbers
> 
>  (e) the 32-byte block thing is questionable, most older
> micro-architectures fetch in 16-byte blocks iirc.
> 
> So what this tells me is that:
> 
>  - (a+b) the jump-to-jump is likely fairly expensive, because even
> though they are in the same cacheline, the front end hasn't gotten
> ahead of anything, so there's no hiding any front end pipeline
> hickups.
> 
>  - (c+d) there is likely very little advantage to trying to "pack"
> things in cachelines

Good points.

>  - (d+e) the 7-instructions-in-one-32-byte-block doesn't really sound
> all that big of a win, and it does cause a 16-byte split for some
> interrupt.

No, this doesn't happen. With current code, none of instructions
cross 16-byte split. Even 8-byte boundary is never crossed.

> In other words, I'd suggest that we just use simple unconditional
> 5-byte branch instead. Add the two-byte "push" instruction, you have 7
> bytes per interrupt. Align that 7 bytes up to 8, and none of them ever
> cross a 16-byte boundary.
> 
> Simple, clean, and slightly bigger in memory footprint, but probably
> not noticeably more so in cache footprint, simply because there
> usually aren't that many active interrupts anyway.
> 
> The people who do millions of networking interrupts per second and
> have network cards that steer things to many different interrupts
> already try to make sure that the steering goes to different CPU's -
> otherwise there wouldn't be any *point* to steering things. So that
> particular case of "lots of active interrupts" doesn't have a bigger
> cache footprint *either*, since any particular CPU L1 I$ will still
> only handle a few interrupts.
> 
> So you get "only" 4 interrupt cases per 32 bytes rather than 7. But is
> that odd double jump and all this complexity really worth it?
> 
> So I really suggest just doing something stupid and straightforward
> (and completely untested) like this:
> 
>     .macro push_vector
>         pushq_cfi $(~vector+0x80)
>         jmp common_interrupt
>         .align 8
>     .endm
> 
>     vector=FIRST_EXTERNAL_VECTOR
>     .align 64
>     ENTRY(irq_entries_start)
>     .rept 256 /* this number does not need to be exact, just big enough */
>          make_vector
>     .endr
> 
> and just be done with it.
> 
> (Of course, you have to change the code that knows about the "7
> entries in 32 bytes" patterns too, but that's just going to be much
> simpler now).

I'll send a patch in ~30 minutes.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86/asm/entry/64: pack interrupt dispatch table tighter
  2015-04-03 18:35       ` H. Peter Anvin
@ 2015-04-03 18:37         ` Linus Torvalds
  2015-04-03 19:06           ` H. Peter Anvin
  0 siblings, 1 reply; 10+ messages in thread
From: Linus Torvalds @ 2015-04-03 18:37 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Denys Vlasenko, Ingo Molnar, Steven Rostedt, Borislav Petkov,
	Andy Lutomirski, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook,
	the arch/x86 maintainers, Linux Kernel Mailing List

On Fri, Apr 3, 2015 at 11:35 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>
> For the record, I actually measured the impact of the jump-to-jump when
> I wrote it.  It has a small, *but measurable*, positive impact.

What did you compare against, and how did you measure that? I don't
see how it could *possibly* be faster than just a simple aligned "push
+ jmp".

                      Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86/asm/entry/64: pack interrupt dispatch table tighter
  2015-04-03 18:37         ` Linus Torvalds
@ 2015-04-03 19:06           ` H. Peter Anvin
  2015-04-04  6:42             ` Ingo Molnar
  0 siblings, 1 reply; 10+ messages in thread
From: H. Peter Anvin @ 2015-04-03 19:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Denys Vlasenko, Ingo Molnar, Steven Rostedt, Borislav Petkov,
	Andy Lutomirski, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook,
	the arch/x86 maintainers, Linux Kernel Mailing List

On 04/03/2015 11:37 AM, Linus Torvalds wrote:
> On Fri, Apr 3, 2015 at 11:35 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> For the record, I actually measured the impact of the jump-to-jump when
>> I wrote it.  It has a small, *but measurable*, positive impact.
> 
> What did you compare against, and how did you measure that? I don't
> see how it could *possibly* be faster than just a simple aligned "push
> + jmp".
> 

I wish I remembered the exact details; it took a fair bit of gathering
numbers as the spread was quite a bit wider than the delta, but in the
end there were two distribution peaks clearly offset.

I seem to remember it involving a loop running RDTSC continuously and
another RDTSC in the interrupt path.

	-hpa



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86/asm/entry/64: pack interrupt dispatch table tighter
  2015-04-03 19:06           ` H. Peter Anvin
@ 2015-04-04  6:42             ` Ingo Molnar
  0 siblings, 0 replies; 10+ messages in thread
From: Ingo Molnar @ 2015-04-04  6:42 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Denys Vlasenko, Steven Rostedt, Borislav Petkov,
	Andy Lutomirski, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook,
	the arch/x86 maintainers, Linux Kernel Mailing List


* H. Peter Anvin <hpa@zytor.com> wrote:

> On 04/03/2015 11:37 AM, Linus Torvalds wrote:
> > On Fri, Apr 3, 2015 at 11:35 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> >>
> >> For the record, I actually measured the impact of the jump-to-jump when
> >> I wrote it.  It has a small, *but measurable*, positive impact.
> > 
> > What did you compare against, and how did you measure that? I don't
> > see how it could *possibly* be faster than just a simple aligned "push
> > + jmp".
> > 
> 
> I wish I remembered the exact details; it took a fair bit of gathering
> numbers as the spread was quite a bit wider than the delta, but in the
> end there were two distribution peaks clearly offset.
> 
> I seem to remember it involving a loop running RDTSC continuously and
> another RDTSC in the interrupt path.

So the thing is, while I don't know how you've loaded the machine, if 
user-space is doing nothing but looping in RDTSC, the kernel I$ might 
become cache hot easily not just in L2 but in L1 cache as well.

But 'when the machine is not doing anything' is not what we optimize 
for, we (try to) optimize for the case when there's a lot of work 
going on and the async context (irq, fault, etc.) I$ is likely 
cache-cold.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2015-04-04  6:43 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-03 11:13 [PATCH] x86/asm/entry/64: pack interrupt dispatch table tighter Denys Vlasenko
2015-04-03 14:03 ` Ingo Molnar
2015-04-03 16:12   ` Denys Vlasenko
2015-04-03 16:54   ` Denys Vlasenko
2015-04-03 18:08     ` Linus Torvalds
2015-04-03 18:35       ` H. Peter Anvin
2015-04-03 18:37         ` Linus Torvalds
2015-04-03 19:06           ` H. Peter Anvin
2015-04-04  6:42             ` Ingo Molnar
2015-04-03 18:36       ` Denys Vlasenko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.