All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] x86: Fix detection of GCC -mpreferred-stack-boundary support
@ 2015-07-05 20:08 Andy Lutomirski
  2015-07-06 13:44 ` Ingo Molnar
  0 siblings, 1 reply; 8+ messages in thread
From: Andy Lutomirski @ 2015-07-05 20:08 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Andy Lutomirski, Linus Torvalds, Jan Kara, Ingo Molnar, Borislav Petkov

As per https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53383, GCC only
allows -mpreferred-stack-boundary=3 on x86_64 if -mno-sse is set.
That means that cc-option will not detect
-mpreferred-stack-boundary=3 support, because we test for it before
setting -mno-sse.

Fix it by reordering the Makefile bits.

Compile-tested only.  This should help avoid code generation issues
such as the one that was worked around in b96fecbfa8c8 ("x86/fpu:
Fix boot crash in the early FPU code").

I'm a bit concerned that we could still have problems on older GCC
versions given that our asm code does not respect GCC's idea of the
ABI-required stack alignment.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/Makefile | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 118e6debc483..344dd2110b2a 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -39,6 +39,12 @@ ifdef CONFIG_X86_NEED_RELOCS
         LDFLAGS_vmlinux := --emit-relocs
 endif
 
+# prevent gcc from generating any FP code by mistake
+# This must be before we try -mpreferred-stack-boundary -- see
+# https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53383
+KBUILD_CFLAGS += -mno-sse -mno-mmx -mno-sse2 -mno-3dnow
+KBUILD_CFLAGS += $(call cc-option,-mno-avx,)
+
 ifeq ($(CONFIG_X86_32),y)
         BITS := 32
         UTS_MACHINE := i386
@@ -167,9 +173,6 @@ KBUILD_CFLAGS += -pipe
 KBUILD_CFLAGS += -Wno-sign-compare
 #
 KBUILD_CFLAGS += -fno-asynchronous-unwind-tables
-# prevent gcc from generating any FP code by mistake
-KBUILD_CFLAGS += -mno-sse -mno-mmx -mno-sse2 -mno-3dnow
-KBUILD_CFLAGS += $(call cc-option,-mno-avx,)
 
 KBUILD_CFLAGS += $(mflags-y)
 KBUILD_AFLAGS += $(mflags-y)
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86: Fix detection of GCC -mpreferred-stack-boundary support
  2015-07-05 20:08 [PATCH] x86: Fix detection of GCC -mpreferred-stack-boundary support Andy Lutomirski
@ 2015-07-06 13:44 ` Ingo Molnar
  2015-07-06 16:59   ` Andy Lutomirski
  2015-07-06 17:10   ` Linus Torvalds
  0 siblings, 2 replies; 8+ messages in thread
From: Ingo Molnar @ 2015-07-06 13:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, x86, Linus Torvalds, Jan Kara, Borislav Petkov,
	Denys Vlasenko


* Andy Lutomirski <luto@kernel.org> wrote:

> As per https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53383, GCC only
> allows -mpreferred-stack-boundary=3 on x86_64 if -mno-sse is set.
> That means that cc-option will not detect
> -mpreferred-stack-boundary=3 support, because we test for it before
> setting -mno-sse.
> 
> Fix it by reordering the Makefile bits.
> 
> Compile-tested only.  This should help avoid code generation issues
> such as the one that was worked around in b96fecbfa8c8 ("x86/fpu:
> Fix boot crash in the early FPU code").
> 
> I'm a bit concerned that we could still have problems on older GCC
> versions given that our asm code does not respect GCC's idea of the
> ABI-required stack alignment.
> 
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Borislav Petkov <bp@alien8.de>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/Makefile | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/Makefile b/arch/x86/Makefile
> index 118e6debc483..344dd2110b2a 100644
> --- a/arch/x86/Makefile
> +++ b/arch/x86/Makefile
> @@ -39,6 +39,12 @@ ifdef CONFIG_X86_NEED_RELOCS
>          LDFLAGS_vmlinux := --emit-relocs
>  endif
>  
> +# prevent gcc from generating any FP code by mistake
> +# This must be before we try -mpreferred-stack-boundary -- see
> +# https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53383
> +KBUILD_CFLAGS += -mno-sse -mno-mmx -mno-sse2 -mno-3dnow
> +KBUILD_CFLAGS += $(call cc-option,-mno-avx,)
> +

So the 'stack boundary' is the RSP that GCC generates before it calls another 
function from within an existing function, right?

So looking at this I question the choice of -mpreferred-stack-boundary=3. Why not 
do -mpreferred-stack-boundary=2?

My reasoning: on modern uarchs there's no penalty for 32-bit misalignment of 
64-bit variables, only if they cross 64-byte cache lines, which should be rare 
with a chance of 1:16. This small penalty (of at most +1 cycle in some 
circumstances IIRC) should be more than counterbalanced by the compression of the 
stack by 5% on average.

... using stack-boundary=1 or stack-boundary=0 would probably be 
counterproductive, as these more exotic misalignments get treated progressively 
worse by x86 CPUs.

... but I have not measured any of this and even the 5% is just a possibly 
overoptimistic guess.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86: Fix detection of GCC -mpreferred-stack-boundary support
  2015-07-06 13:44 ` Ingo Molnar
@ 2015-07-06 16:59   ` Andy Lutomirski
  2015-07-06 17:40     ` Ingo Molnar
  2015-07-06 17:10   ` Linus Torvalds
  1 sibling, 1 reply; 8+ messages in thread
From: Andy Lutomirski @ 2015-07-06 16:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, linux-kernel, X86 ML, Linus Torvalds, Jan Kara,
	Borislav Petkov, Denys Vlasenko

On Mon, Jul 6, 2015 at 6:44 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Andy Lutomirski <luto@kernel.org> wrote:
>
>> As per https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53383, GCC only
>> allows -mpreferred-stack-boundary=3 on x86_64 if -mno-sse is set.
>> That means that cc-option will not detect
>> -mpreferred-stack-boundary=3 support, because we test for it before
>> setting -mno-sse.
>>
>> Fix it by reordering the Makefile bits.

...

>
> So the 'stack boundary' is the RSP that GCC generates before it calls another
> function from within an existing function, right?
>

I think so.  Certainly the "incoming stack boundary" (which is exactly
the same as the preferred stack boundary unless explicitly changed) is
the RSP alignment that GCC expects on entry.

> So looking at this I question the choice of -mpreferred-stack-boundary=3. Why not
> do -mpreferred-stack-boundary=2?
>

Easy answer: we can't:

$ gcc -c -mno-sse -mpreferred-stack-boundary=2 empty.c
empty.c:1:0: error: -mpreferred-stack-boundary=2 is not between 3 and 12

> My reasoning: on modern uarchs there's no penalty for 32-bit misalignment of
> 64-bit variables, only if they cross 64-byte cache lines, which should be rare
> with a chance of 1:16. This small penalty (of at most +1 cycle in some
> circumstances IIRC) should be more than counterbalanced by the compression of the
> stack by 5% on average.
>

I'll counter with: what's the benefit?  There are no operations that
will naturally change RSP by anything that isn't a multiple of 8
(there's no pushl in 64-bit mode, or at least not on AMD chips -- the
Intel manual is a bit vague on this point), so we'll end up with RSP
being a multiple of 8 regardless.  Even if we somehow shaved 4 bytes
off in asm, that still wouldn't buy us anything, as a dangling 4 bytes
at the bottom of the stack isn't useful for anything.

--Andy

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86: Fix detection of GCC -mpreferred-stack-boundary support
  2015-07-06 13:44 ` Ingo Molnar
  2015-07-06 16:59   ` Andy Lutomirski
@ 2015-07-06 17:10   ` Linus Torvalds
  2015-07-06 17:32     ` Ingo Molnar
  1 sibling, 1 reply; 8+ messages in thread
From: Linus Torvalds @ 2015-07-06 17:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, Linux Kernel Mailing List,
	the arch/x86 maintainers, Jan Kara, Borislav Petkov,
	Denys Vlasenko

On Mon, Jul 6, 2015 at 6:44 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> So looking at this I question the choice of -mpreferred-stack-boundary=3. Why not
> do -mpreferred-stack-boundary=2?

It wouldn't make sense anyway - it would only make code worse (if it
worked) and not any better.

The reason the "=3" value is good is because 8-byte alignment is the
"natural" alignment - it's what you get with a normal call sequence,
simply because the return address is 8 bytes in size.

That means that with "=3" you don't get extra code to align the stack
for the simple functions that don't need a frame.

Anything smaller than 3 wouldn't help even if it worked, because none
of the normal stack operations (pushing/popping registers to
save/restore them) would be any smaller anyway.

But bigger values than 3 result in the compiler having to generate
extra stack adjustments just to align the stack after a call that very
naturally mis-aligned it. And it doesn't help anyway, since in the
kernel we don't put stuff on the stack that needs bigger alignment
(of, the fxsave buffer is a counter-example, but it's a very odd one
that we _shouldn't_ have put on the stack).

                Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86: Fix detection of GCC -mpreferred-stack-boundary support
  2015-07-06 17:10   ` Linus Torvalds
@ 2015-07-06 17:32     ` Ingo Molnar
  0 siblings, 0 replies; 8+ messages in thread
From: Ingo Molnar @ 2015-07-06 17:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Linux Kernel Mailing List,
	the arch/x86 maintainers, Jan Kara, Borislav Petkov,
	Denys Vlasenko


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, Jul 6, 2015 at 6:44 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > So looking at this I question the choice of -mpreferred-stack-boundary=3. Why 
> > not do -mpreferred-stack-boundary=2?
> 
> It wouldn't make sense anyway - it would only make code worse (if it worked) and 
> not any better.
> 
> The reason the "=3" value is good is because 8-byte alignment is the "natural" 
> alignment - it's what you get with a normal call sequence, simply because the 
> return address is 8 bytes in size.
> 
> That means that with "=3" you don't get extra code to align the stack for the 
> simple functions that don't need a frame.
> 
> Anything smaller than 3 wouldn't help even if it worked, because none of the 
> normal stack operations (pushing/popping registers to save/restore them) would 
> be any smaller anyway.
> 
> But bigger values than 3 result in the compiler having to generate extra stack 
> adjustments just to align the stack after a call that very naturally mis-aligned 
> it. And it doesn't help anyway, since in the kernel we don't put stuff on the 
> stack that needs bigger alignment (of, the fxsave buffer is a counter-example, 
> but it's a very odd one that we _shouldn't_ have put on the stack).

Ok, so it's all moot, but my (quite possibly flawed) thinking was that for deeper 
call chains, using 4 byte RSP alignment (as opposed to 8 bytes) would allow, in 
about 50% of the cases, the stack frame to be narrower by 4 bytes. (depending on 
whether the 'natural' stack boundary is properly aligned to 8 bytes or not.)

For a 10 deep call chain that's a 20 bytes more compact stack on average 
(10*4*0.5), resulting in a tiny bit denser D$.

My assumptions were:

 - no extra code is generated by GCC. (If it causes any extra code to be generated
   then it's an obvious loss.)

 - mis-aligning an 8 byte variable by 4 bytes is being handled quite well by most
   x86 uarchs, without penalty in most cases.

But ... it's all moot and even in the best case if both my assumptions are fully 
met (which is not a given), the advantages are pretty marginal, so consider the 
idea dead by multiple mortal wounds.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86: Fix detection of GCC -mpreferred-stack-boundary support
  2015-07-06 16:59   ` Andy Lutomirski
@ 2015-07-06 17:40     ` Ingo Molnar
  2015-07-06 17:59       ` Andy Lutomirski
  0 siblings, 1 reply; 8+ messages in thread
From: Ingo Molnar @ 2015-07-06 17:40 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, linux-kernel, X86 ML, Linus Torvalds, Jan Kara,
	Borislav Petkov, Denys Vlasenko


* Andy Lutomirski <luto@amacapital.net> wrote:

> > My reasoning: on modern uarchs there's no penalty for 32-bit misalignment of 
> > 64-bit variables, only if they cross 64-byte cache lines, which should be rare 
> > with a chance of 1:16. This small penalty (of at most +1 cycle in some 
> > circumstances IIRC) should be more than counterbalanced by the compression of 
> > the stack by 5% on average.
> 
> I'll counter with: what's the benefit?  There are no operations that will 
> naturally change RSP by anything that isn't a multiple of 8 (there's no pushl in 
> 64-bit mode, or at least not on AMD chips -- the Intel manual is a bit vague on 
> this point), so we'll end up with RSP being a multiple of 8 regardless.  Even if 
> we somehow shaved 4 bytes off in asm, that still wouldn't buy us anything, as a 
> dangling 4 bytes at the bottom of the stack isn't useful for anything.

Yeah, so it might be utilized in frame-pointer less builds (which we might be able 
to utilize in the future if sane Dwarf code comes around), which does not use 
push/pop to manage the stack but often has patterns like:

ffffffff8102aa90 <SyS_getpriority>:
ffffffff8102aa90:       48 83 ec 18             sub    $0x18,%rsp

and uses MOVs to manage the stack. Those kinds of stack frames could be 4-byte 
granular as well.

But yeah ... it's pretty marginal.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86: Fix detection of GCC -mpreferred-stack-boundary support
  2015-07-06 17:40     ` Ingo Molnar
@ 2015-07-06 17:59       ` Andy Lutomirski
  2015-07-07  4:01         ` Raymond Jennings
  0 siblings, 1 reply; 8+ messages in thread
From: Andy Lutomirski @ 2015-07-06 17:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, linux-kernel, X86 ML, Linus Torvalds, Jan Kara,
	Borislav Petkov, Denys Vlasenko

On Mon, Jul 6, 2015 at 10:40 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Andy Lutomirski <luto@amacapital.net> wrote:
>
>> > My reasoning: on modern uarchs there's no penalty for 32-bit misalignment of
>> > 64-bit variables, only if they cross 64-byte cache lines, which should be rare
>> > with a chance of 1:16. This small penalty (of at most +1 cycle in some
>> > circumstances IIRC) should be more than counterbalanced by the compression of
>> > the stack by 5% on average.
>>
>> I'll counter with: what's the benefit?  There are no operations that will
>> naturally change RSP by anything that isn't a multiple of 8 (there's no pushl in
>> 64-bit mode, or at least not on AMD chips -- the Intel manual is a bit vague on
>> this point), so we'll end up with RSP being a multiple of 8 regardless.  Even if
>> we somehow shaved 4 bytes off in asm, that still wouldn't buy us anything, as a
>> dangling 4 bytes at the bottom of the stack isn't useful for anything.
>
> Yeah, so it might be utilized in frame-pointer less builds (which we might be able
> to utilize in the future if sane Dwarf code comes around), which does not use
> push/pop to manage the stack but often has patterns like:
>
> ffffffff8102aa90 <SyS_getpriority>:
> ffffffff8102aa90:       48 83 ec 18             sub    $0x18,%rsp
>
> and uses MOVs to manage the stack. Those kinds of stack frames could be 4-byte
> granular as well.
>
> But yeah ... it's pretty marginal.

To get even that, we'd need an additional ABI-changing GCC flag to
change GCC's idea of the alignment of long from 8 to 4.  (I just
checked: g++ thinks that alignof(long) == 8.  I was too lazy to look
up how to ask the equivalent question in C.)

--Andy

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86: Fix detection of GCC -mpreferred-stack-boundary support
  2015-07-06 17:59       ` Andy Lutomirski
@ 2015-07-07  4:01         ` Raymond Jennings
  0 siblings, 0 replies; 8+ messages in thread
From: Raymond Jennings @ 2015-07-07  4:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Andy Lutomirski, linux-kernel, X86 ML,
	Linus Torvalds, Jan Kara, Borislav Petkov, Denys Vlasenko

On Mon, 2015-07-06 at 10:59 -0700, Andy Lutomirski wrote:
> On Mon, Jul 6, 2015 at 10:40 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > * Andy Lutomirski <luto@amacapital.net> wrote:
> >
> >> > My reasoning: on modern uarchs there's no penalty for 32-bit misalignment of
> >> > 64-bit variables, only if they cross 64-byte cache lines, which should be rare
> >> > with a chance of 1:16. This small penalty (of at most +1 cycle in some
> >> > circumstances IIRC) should be more than counterbalanced by the compression of
> >> > the stack by 5% on average.
> >>
> >> I'll counter with: what's the benefit?  There are no operations that will
> >> naturally change RSP by anything that isn't a multiple of 8 (there's no pushl in
> >> 64-bit mode, or at least not on AMD chips -- the Intel manual is a bit vague on
> >> this point), so we'll end up with RSP being a multiple of 8 regardless.  Even if
> >> we somehow shaved 4 bytes off in asm, that still wouldn't buy us anything, as a
> >> dangling 4 bytes at the bottom of the stack isn't useful for anything.
> >
> > Yeah, so it might be utilized in frame-pointer less builds (which we might be able
> > to utilize in the future if sane Dwarf code comes around), which does not use
> > push/pop to manage the stack but often has patterns like:
> >
> > ffffffff8102aa90 <SyS_getpriority>:
> > ffffffff8102aa90:       48 83 ec 18             sub    $0x18,%rsp
> >
> > and uses MOVs to manage the stack. Those kinds of stack frames could be 4-byte
> > granular as well.
> >
> > But yeah ... it's pretty marginal.
> 
> To get even that, we'd need an additional ABI-changing GCC flag to
> change GCC's idea of the alignment of long from 8 to 4.  (I just
> checked: g++ thinks that alignof(long) == 8.  I was too lazy to look
> up how to ask the equivalent question in C.)

I just want to point out that long itself is 8 bytes on 64-bit x86, but
only 4 bytes on 32-bit x86.

Perhaps we should keep in mind sizeof(long) and not just alignof(long)?

My opinion btw, is that if long is 8 bytes wide, it should also be 8
bytes aligned.

> --Andy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-07-07  4:02 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-05 20:08 [PATCH] x86: Fix detection of GCC -mpreferred-stack-boundary support Andy Lutomirski
2015-07-06 13:44 ` Ingo Molnar
2015-07-06 16:59   ` Andy Lutomirski
2015-07-06 17:40     ` Ingo Molnar
2015-07-06 17:59       ` Andy Lutomirski
2015-07-07  4:01         ` Raymond Jennings
2015-07-06 17:10   ` Linus Torvalds
2015-07-06 17:32     ` Ingo Molnar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.