All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] x86: Deinline cpuid_eax and friends
@ 2015-05-06 17:07 Denys Vlasenko
  2015-05-06 18:59 ` H. Peter Anvin
  0 siblings, 1 reply; 5+ messages in thread
From: Denys Vlasenko @ 2015-05-06 17:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Denys Vlasenko, Steven Rostedt, Borislav Petkov, H. Peter Anvin,
	Andy Lutomirski, Frederic Weisbecker, Alexei Starovoitov,
	Will Drewry, Kees Cook, x86, linux-kernel

cpuid_e{a,b,c,d}x() functions compile to 44 bytes of machine code each.
On x86 allyesconfig build they have 48 callsites.
Deinlining all four of them shrinks kernel by about 1k:

   text      data      bss       dec     hex filename
82434909 22255384 20627456 125317749 7783275 vmlinux.before
82433898 22255384 20627456 125316738 7782e82 vmlinux

Speed impact: CPUID instruction takes from 50 to 350+ cycles,
call overhead is negligible in comparison.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Ingo Molnar <mingo@kernel.org>
CC: Borislav Petkov <bp@alien8.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Alexei Starovoitov <ast@plumgrid.com>
CC: Will Drewry <wad@chromium.org>
CC: Kees Cook <keescook@chromium.org>
CC: x86@kernel.org
CC: linux-kernel@vger.kernel.org
---
 arch/x86/include/asm/processor.h | 39 ++++--------------------------------
 arch/x86/kernel/cpu/common.c     | 43 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 47 insertions(+), 35 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index ec1c935..67e1974 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -616,41 +616,10 @@ static inline void cpuid_count(unsigned int op, int count,
 /*
  * CPUID functions returning a single datum
  */
-static inline unsigned int cpuid_eax(unsigned int op)
-{
-	unsigned int eax, ebx, ecx, edx;
-
-	cpuid(op, &eax, &ebx, &ecx, &edx);
-
-	return eax;
-}
-
-static inline unsigned int cpuid_ebx(unsigned int op)
-{
-	unsigned int eax, ebx, ecx, edx;
-
-	cpuid(op, &eax, &ebx, &ecx, &edx);
-
-	return ebx;
-}
-
-static inline unsigned int cpuid_ecx(unsigned int op)
-{
-	unsigned int eax, ebx, ecx, edx;
-
-	cpuid(op, &eax, &ebx, &ecx, &edx);
-
-	return ecx;
-}
-
-static inline unsigned int cpuid_edx(unsigned int op)
-{
-	unsigned int eax, ebx, ecx, edx;
-
-	cpuid(op, &eax, &ebx, &ecx, &edx);
-
-	return edx;
-}
+unsigned int cpuid_eax(unsigned int op);
+unsigned int cpuid_ebx(unsigned int op);
+unsigned int cpuid_ecx(unsigned int op);
+unsigned int cpuid_edx(unsigned int op);
 
 /* REP NOP (PAUSE) is a good thing to insert into busy-wait loops. */
 static inline void rep_nop(void)
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 2346c95..1d2e270 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -307,6 +307,49 @@ static __always_inline void setup_smap(struct cpuinfo_x86 *c)
 }
 
 /*
+ * CPUID functions returning a single datum
+ */
+unsigned int cpuid_eax(unsigned int op)
+{
+	unsigned int eax, ebx, ecx, edx;
+
+	cpuid(op, &eax, &ebx, &ecx, &edx);
+
+	return eax;
+}
+EXPORT_SYMBOL(cpuid_eax);
+
+unsigned int cpuid_ebx(unsigned int op)
+{
+	unsigned int eax, ebx, ecx, edx;
+
+	cpuid(op, &eax, &ebx, &ecx, &edx);
+
+	return ebx;
+}
+EXPORT_SYMBOL(cpuid_ebx);
+
+unsigned int cpuid_ecx(unsigned int op)
+{
+	unsigned int eax, ebx, ecx, edx;
+
+	cpuid(op, &eax, &ebx, &ecx, &edx);
+
+	return ecx;
+}
+EXPORT_SYMBOL(cpuid_ecx);
+
+unsigned int cpuid_edx(unsigned int op)
+{
+	unsigned int eax, ebx, ecx, edx;
+
+	cpuid(op, &eax, &ebx, &ecx, &edx);
+
+	return edx;
+}
+EXPORT_SYMBOL(cpuid_edx);
+
+/*
  * Some CPU features depend on higher CPUID levels, which may not always
  * be available due to CPUID level capping or broken virtualization
  * software.  Add those features to this table to auto-disable them.
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] x86: Deinline cpuid_eax and friends
  2015-05-06 17:07 [PATCH] x86: Deinline cpuid_eax and friends Denys Vlasenko
@ 2015-05-06 18:59 ` H. Peter Anvin
  2015-05-06 19:09   ` Denys Vlasenko
  0 siblings, 1 reply; 5+ messages in thread
From: H. Peter Anvin @ 2015-05-06 18:59 UTC (permalink / raw)
  To: Denys Vlasenko, Ingo Molnar
  Cc: Steven Rostedt, Borislav Petkov, Andy Lutomirski,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	x86, linux-kernel

On 05/06/2015 10:07 AM, Denys Vlasenko wrote:
> cpuid_e{a,b,c,d}x() functions compile to 44 bytes of machine code each.
> On x86 allyesconfig build they have 48 callsites.
> Deinlining all four of them shrinks kernel by about 1k:
> 
>    text      data      bss       dec     hex filename
> 82434909 22255384 20627456 125317749 7783275 vmlinux.before
> 82433898 22255384 20627456 125316738 7782e82 vmlinux
> 
> Speed impact: CPUID instruction takes from 50 to 350+ cycles,
> call overhead is negligible in comparison.

How on Earth does it make 44 bytes?  Is this due to paravirt_fail?

	-hpa



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] x86: Deinline cpuid_eax and friends
  2015-05-06 18:59 ` H. Peter Anvin
@ 2015-05-06 19:09   ` Denys Vlasenko
  2015-05-06 20:41     ` H. Peter Anvin
  0 siblings, 1 reply; 5+ messages in thread
From: Denys Vlasenko @ 2015-05-06 19:09 UTC (permalink / raw)
  To: H. Peter Anvin, Ingo Molnar
  Cc: Steven Rostedt, Borislav Petkov, Andy Lutomirski,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	x86, linux-kernel

On 05/06/2015 08:59 PM, H. Peter Anvin wrote:
> On 05/06/2015 10:07 AM, Denys Vlasenko wrote:
>> cpuid_e{a,b,c,d}x() functions compile to 44 bytes of machine code each.
>> On x86 allyesconfig build they have 48 callsites.
>> Deinlining all four of them shrinks kernel by about 1k:
>>
>>    text      data      bss       dec     hex filename
>> 82434909 22255384 20627456 125317749 7783275 vmlinux.before
>> 82433898 22255384 20627456 125316738 7782e82 vmlinux
>>
>> Speed impact: CPUID instruction takes from 50 to 350+ cycles,
>> call overhead is negligible in comparison.
> 
> How on Earth does it make 44 bytes?  Is this due to paravirt_fail?

No, just this construct

        unsigned int eax, ebx, ecx, edx;
        cpuid(op, &eax, &ebx, &ecx, &edx);

is not really that cheap to set up. You need to allocate
variables on stack and take address of each:

ffffffff81063668 <cpuid_eax>:
ffffffff81063668:       55                      push   %rbp
ffffffff81063669:       48 89 e5                mov    %rsp,%rbp
ffffffff8106366c:       48 83 ec 10             sub    $0x10,%rsp
ffffffff81063670:       48 8d 4d fc             lea    -0x4(%rbp),%rcx
ffffffff81063674:       89 7d f0                mov    %edi,-0x10(%rbp)
ffffffff81063677:       48 8d 55 f8             lea    -0x8(%rbp),%rdx
ffffffff8106367b:       48 8d 75 f4             lea    -0xc(%rbp),%rsi
ffffffff8106367f:       48 8d 7d f0             lea    -0x10(%rbp),%rdi
ffffffff81063683:       c7 45 f8 00 00 00 00    movl   $0x0,-0x8(%rbp)
ffffffff8106368a:       e8 3c ff ff ff          callq  ffffffff810635cb <__cpuid>
ffffffff8106368f:       8b 45 f0                mov    -0x10(%rbp),%eax
ffffffff81063692:       c9                      leaveq
ffffffff81063693:       c3                      retq

-- 
vda

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] x86: Deinline cpuid_eax and friends
  2015-05-06 19:09   ` Denys Vlasenko
@ 2015-05-06 20:41     ` H. Peter Anvin
  2015-05-07  8:57       ` Denys Vlasenko
  0 siblings, 1 reply; 5+ messages in thread
From: H. Peter Anvin @ 2015-05-06 20:41 UTC (permalink / raw)
  To: Denys Vlasenko, Ingo Molnar
  Cc: Steven Rostedt, Borislav Petkov, Andy Lutomirski,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	x86, linux-kernel

On 05/06/2015 12:09 PM, Denys Vlasenko wrote:
>>
>> How on Earth does it make 44 bytes?  Is this due to paravirt_fail?
> 
> No, just this construct
> 
>         unsigned int eax, ebx, ecx, edx;
>         cpuid(op, &eax, &ebx, &ecx, &edx);
> 
> is not really that cheap to set up. You need to allocate
> variables on stack and take address of each:
> 
> ffffffff81063668 <cpuid_eax>:
> ffffffff81063668:       55                      push   %rbp
> ffffffff81063669:       48 89 e5                mov    %rsp,%rbp
> ffffffff8106366c:       48 83 ec 10             sub    $0x10,%rsp
> ffffffff81063670:       48 8d 4d fc             lea    -0x4(%rbp),%rcx
> ffffffff81063674:       89 7d f0                mov    %edi,-0x10(%rbp)
> ffffffff81063677:       48 8d 55 f8             lea    -0x8(%rbp),%rdx
> ffffffff8106367b:       48 8d 75 f4             lea    -0xc(%rbp),%rsi
> ffffffff8106367f:       48 8d 7d f0             lea    -0x10(%rbp),%rdi
> ffffffff81063683:       c7 45 f8 00 00 00 00    movl   $0x0,-0x8(%rbp)
> ffffffff8106368a:       e8 3c ff ff ff          callq  ffffffff810635cb <__cpuid>
> ffffffff8106368f:       8b 45 f0                mov    -0x10(%rbp),%eax
> ffffffff81063692:       c9                      leaveq
> ffffffff81063693:       c3                      retq
> 

That almost certainly is due to paravirt_fail, because otherwise cpuid
would be inline, and gcc actually knows how to optimize around the cpuid
instruction to the point of eliminating the temporaries.

That being said, it would have been better to use a structure.

	-hpa


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] x86: Deinline cpuid_eax and friends
  2015-05-06 20:41     ` H. Peter Anvin
@ 2015-05-07  8:57       ` Denys Vlasenko
  0 siblings, 0 replies; 5+ messages in thread
From: Denys Vlasenko @ 2015-05-07  8:57 UTC (permalink / raw)
  To: H. Peter Anvin, Ingo Molnar
  Cc: Steven Rostedt, Borislav Petkov, Andy Lutomirski,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	x86, linux-kernel

On 05/06/2015 10:41 PM, H. Peter Anvin wrote:
> On 05/06/2015 12:09 PM, Denys Vlasenko wrote:
>>>
>>> How on Earth does it make 44 bytes?  Is this due to paravirt_fail?
>>
>> No, just this construct
>>
>>         unsigned int eax, ebx, ecx, edx;
>>         cpuid(op, &eax, &ebx, &ecx, &edx);
>>
>> is not really that cheap to set up. You need to allocate
>> variables on stack and take address of each:
>>
>> ffffffff81063668 <cpuid_eax>:
>> ffffffff81063668:       55                      push   %rbp
>> ffffffff81063669:       48 89 e5                mov    %rsp,%rbp
>> ffffffff8106366c:       48 83 ec 10             sub    $0x10,%rsp
>> ffffffff81063670:       48 8d 4d fc             lea    -0x4(%rbp),%rcx
>> ffffffff81063674:       89 7d f0                mov    %edi,-0x10(%rbp)
>> ffffffff81063677:       48 8d 55 f8             lea    -0x8(%rbp),%rdx
>> ffffffff8106367b:       48 8d 75 f4             lea    -0xc(%rbp),%rsi
>> ffffffff8106367f:       48 8d 7d f0             lea    -0x10(%rbp),%rdi
>> ffffffff81063683:       c7 45 f8 00 00 00 00    movl   $0x0,-0x8(%rbp)
>> ffffffff8106368a:       e8 3c ff ff ff          callq  ffffffff810635cb <__cpuid>
>> ffffffff8106368f:       8b 45 f0                mov    -0x10(%rbp),%eax
>> ffffffff81063692:       c9                      leaveq
>> ffffffff81063693:       c3                      retq
>>
> 
> That almost certainly is due to paravirt_fail, because otherwise cpuid
> would be inline, and gcc actually knows how to optimize around the cpuid
> instruction to the point of eliminating the temporaries.

Yes, with HYPERVISOR_GUEST off cpuid_eax() is smaller:

ffffffff81055a66 <cpuid_eax>:
ffffffff81055a66:       55                      push   %rbp
ffffffff81055a67:       89 f8                   mov    %edi,%eax
ffffffff81055a69:       31 c9                   xor    %ecx,%ecx
ffffffff81055a6b:       48 89 e5                mov    %rsp,%rbp
ffffffff81055a6e:       53                      push   %rbx
ffffffff81055a6f:       0f a2                   cpuid
ffffffff81055a71:       5b                      pop    %rbx
ffffffff81055a72:       5d                      pop    %rbp
ffffffff81055a73:       c3                      retq

However, it is not small enough to make vmlinux grow:

    text     data      bss       dec     hex filename
81746530 13978160 20066304 115790994 6e6d492 vmlinux.before
81746509 13978160 20066304 115790973 6e6d47d vmlinux

To recap: with this patch
Code is smaller with and without HYPERVISOR_GUEST.
Slowdown per cpuid_REG() call is at worst 4%.



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-05-07  8:57 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-06 17:07 [PATCH] x86: Deinline cpuid_eax and friends Denys Vlasenko
2015-05-06 18:59 ` H. Peter Anvin
2015-05-06 19:09   ` Denys Vlasenko
2015-05-06 20:41     ` H. Peter Anvin
2015-05-07  8:57       ` Denys Vlasenko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.