All of lore.kernel.org
 help / color / mirror / Atom feed
* For your amusement: slightly faster syscalls
@ 2015-06-13  0:09 Andy Lutomirski
       [not found] ` <CA+55aFzMCeDc5rw8bj3Zyimbi7C1RcU15TeiMA6jOMfnd+3B=Q@mail.gmail.com>
  0 siblings, 1 reply; 7+ messages in thread
From: Andy Lutomirski @ 2015-06-13  0:09 UTC (permalink / raw)
  To: linux-kernel, X86 ML, Linus Torvalds, H. Peter Anvin,
	Denys Vlasenko, Borislav Petkov

The SYSCALL prologue starts with SWAPGS immediately followed by a
gs-prefixed instruction.  I think this causes a pipeline stall.

If we instead do:

mov %rsp, rsp_scratch(%rip)
mov sp0(%rip), %rsp)
swapgs
...
pushq rsp_scratch(%rip)

then we avoid the stall and save about three cycles.

Horrible horrible code to do this lives here:

https://git.kernel.org/cgit/linux/kernel/git/luto/devel.git/log/?h=x86/faster_syscalls

Caveat emptor: it also disables SMP.

For three cycles, I don't think this is worth trying to clean up.

--Andy

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: For your amusement: slightly faster syscalls
       [not found] ` <CA+55aFzMCeDc5rw8bj3Zyimbi7C1RcU15TeiMA6jOMfnd+3B=Q@mail.gmail.com>
@ 2015-06-15 21:42   ` H. Peter Anvin
  2015-06-15 21:51     ` Andy Lutomirski
  0 siblings, 1 reply; 7+ messages in thread
From: H. Peter Anvin @ 2015-06-15 21:42 UTC (permalink / raw)
  To: Linus Torvalds, Andy Lutomirski
  Cc: Denys Vlasenko, Borislav Petkov, X86 ML, linux-kernel

On 06/15/2015 02:30 PM, Linus Torvalds wrote:
> 
> On Jun 12, 2015 2:09 PM, "Andy Lutomirski" <luto@amacapital.net
> <mailto:luto@amacapital.net>> wrote:
>>
>> Caveat emptor: it also disables SMP.
> 
> OK, I don't think it's interesting in that form.
> 
> For small cpu counts, I guess we could have per-cpu syscall entry points
> (unless the syscall entry msr is shared across hyperthreading? Some
> msr's are per thread, others per core, AFAIK), and it could actually
> work that way.
> 
> But I'm not sure the three cycles is worth the worry and the complexity.
> 

We discussed the per-cpu syscall entry point, and the issue at hand is
that it is very hard to do that without with fairly high probability
touch another cache line and quite possibly another page (and hence a
TLB entry.)

	-hpa



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: For your amusement: slightly faster syscalls
  2015-06-15 21:42   ` H. Peter Anvin
@ 2015-06-15 21:51     ` Andy Lutomirski
  2015-06-18  8:01       ` Ingo Molnar
  0 siblings, 1 reply; 7+ messages in thread
From: Andy Lutomirski @ 2015-06-15 21:51 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Denys Vlasenko, Borislav Petkov, X86 ML, linux-kernel

On Mon, Jun 15, 2015 at 2:42 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 06/15/2015 02:30 PM, Linus Torvalds wrote:
>>
>> On Jun 12, 2015 2:09 PM, "Andy Lutomirski" <luto@amacapital.net
>> <mailto:luto@amacapital.net>> wrote:
>>>
>>> Caveat emptor: it also disables SMP.
>>
>> OK, I don't think it's interesting in that form.
>>
>> For small cpu counts, I guess we could have per-cpu syscall entry points
>> (unless the syscall entry msr is shared across hyperthreading? Some
>> msr's are per thread, others per core, AFAIK), and it could actually
>> work that way.
>>
>> But I'm not sure the three cycles is worth the worry and the complexity.
>>
>
> We discussed the per-cpu syscall entry point, and the issue at hand is
> that it is very hard to do that without with fairly high probability
> touch another cache line and quite possibly another page (and hence a
> TLB entry.)

I think this isn't actually true.  If we were going to do a per-cpu
syscall entry point, then we might as well duplicate all of the entry
code per cpu instead of just a short trampoline.  That would avoid
extra TLB misses and (L1) cache misses, I think.

I still think this is far too complicated for three cycles.  I was
hoping for more.

--Andy

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: For your amusement: slightly faster syscalls
  2015-06-15 21:51     ` Andy Lutomirski
@ 2015-06-18  8:01       ` Ingo Molnar
  2015-06-18  8:48         ` Ingo Molnar
  2015-06-18  8:50         ` H. Peter Anvin
  0 siblings, 2 replies; 7+ messages in thread
From: Ingo Molnar @ 2015-06-18  8:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: H. Peter Anvin, Linus Torvalds, Denys Vlasenko, Borislav Petkov,
	X86 ML, linux-kernel


* Andy Lutomirski <luto@amacapital.net> wrote:

> On Mon, Jun 15, 2015 at 2:42 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> > On 06/15/2015 02:30 PM, Linus Torvalds wrote:
> >>
> >> On Jun 12, 2015 2:09 PM, "Andy Lutomirski" <luto@amacapital.net
> >> <mailto:luto@amacapital.net>> wrote:
> >>>
> >>> Caveat emptor: it also disables SMP.
> >>
> >> OK, I don't think it's interesting in that form.
> >>
> >> For small cpu counts, I guess we could have per-cpu syscall entry points 
> >> (unless the syscall entry msr is shared across hyperthreading? Some msr's are 
> >> per thread, others per core, AFAIK), and it could actually work that way.
> >>
> >> But I'm not sure the three cycles is worth the worry and the complexity.
> >
> > We discussed the per-cpu syscall entry point, and the issue at hand is that it 
> > is very hard to do that without with fairly high probability touch another 
> > cache line and quite possibly another page (and hence a TLB entry.)

( So apparently I wasn't Cc:ed, or gmail ate the mail - so I can only guess from 
  the surrounding discussion what this patch does, as my lkml folder is still 
  doing a long refresh ... )

> 
> I think this isn't actually true.  If we were going to do a per-cpu syscall 
> entry point, then we might as well duplicate all of the entry code per cpu 
> instead of just a short trampoline.  That would avoid extra TLB misses and (L1) 
> cache misses, I think.
> 
> I still think this is far too complicated for three cycles.  I was hoping for 
> more.

The other problem with duplicating entry code is that with per CPU entry code we 
split its cache footprint in higher level caches (such as the L2 but also L3 
cache).

The interesting number would be to check cache cold entry performance, not cache 
hot one: the NUMA latency advantage of having per node copies of the entry code 
might be worth it.

... and that's why UP is the least interesting case ;-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: For your amusement: slightly faster syscalls
  2015-06-18  8:01       ` Ingo Molnar
@ 2015-06-18  8:48         ` Ingo Molnar
  2015-06-18  8:50         ` H. Peter Anvin
  1 sibling, 0 replies; 7+ messages in thread
From: Ingo Molnar @ 2015-06-18  8:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: H. Peter Anvin, Linus Torvalds, Denys Vlasenko, Borislav Petkov,
	X86 ML, linux-kernel


* Ingo Molnar <mingo@kernel.org> wrote:

> 
> * Andy Lutomirski <luto@amacapital.net> wrote:
> 
> > On Mon, Jun 15, 2015 at 2:42 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> > > On 06/15/2015 02:30 PM, Linus Torvalds wrote:
> > >>
> > >> On Jun 12, 2015 2:09 PM, "Andy Lutomirski" <luto@amacapital.net
> > >> <mailto:luto@amacapital.net>> wrote:
> > >>>
> > >>> Caveat emptor: it also disables SMP.
> > >>
> > >> OK, I don't think it's interesting in that form.
> > >>
> > >> For small cpu counts, I guess we could have per-cpu syscall entry points 
> > >> (unless the syscall entry msr is shared across hyperthreading? Some msr's are 
> > >> per thread, others per core, AFAIK), and it could actually work that way.
> > >>
> > >> But I'm not sure the three cycles is worth the worry and the complexity.
> > >
> > > We discussed the per-cpu syscall entry point, and the issue at hand is that it 
> > > is very hard to do that without with fairly high probability touch another 
> > > cache line and quite possibly another page (and hence a TLB entry.)
> 
> ( So apparently I wasn't Cc:ed, or gmail ate the mail - so I can only guess from 
>   the surrounding discussion what this patch does, as my lkml folder is still 
>   doing a long refresh ... )

Hm, it's nowhere to be found. Could someone please forward me the original email?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: For your amusement: slightly faster syscalls
  2015-06-18  8:01       ` Ingo Molnar
  2015-06-18  8:48         ` Ingo Molnar
@ 2015-06-18  8:50         ` H. Peter Anvin
  2015-06-18  9:06           ` Ingo Molnar
  1 sibling, 1 reply; 7+ messages in thread
From: H. Peter Anvin @ 2015-06-18  8:50 UTC (permalink / raw)
  To: Ingo Molnar, Andy Lutomirski
  Cc: Linus Torvalds, Denys Vlasenko, Borislav Petkov, X86 ML, linux-kernel

Well... with UP we don't even need GS in the kernel...

On June 18, 2015 1:01:06 AM PDT, Ingo Molnar <mingo@kernel.org> wrote:
>
>* Andy Lutomirski <luto@amacapital.net> wrote:
>
>> On Mon, Jun 15, 2015 at 2:42 PM, H. Peter Anvin <hpa@zytor.com>
>wrote:
>> > On 06/15/2015 02:30 PM, Linus Torvalds wrote:
>> >>
>> >> On Jun 12, 2015 2:09 PM, "Andy Lutomirski" <luto@amacapital.net
>> >> <mailto:luto@amacapital.net>> wrote:
>> >>>
>> >>> Caveat emptor: it also disables SMP.
>> >>
>> >> OK, I don't think it's interesting in that form.
>> >>
>> >> For small cpu counts, I guess we could have per-cpu syscall entry
>points 
>> >> (unless the syscall entry msr is shared across hyperthreading?
>Some msr's are 
>> >> per thread, others per core, AFAIK), and it could actually work
>that way.
>> >>
>> >> But I'm not sure the three cycles is worth the worry and the
>complexity.
>> >
>> > We discussed the per-cpu syscall entry point, and the issue at hand
>is that it 
>> > is very hard to do that without with fairly high probability touch
>another 
>> > cache line and quite possibly another page (and hence a TLB entry.)
>
>( So apparently I wasn't Cc:ed, or gmail ate the mail - so I can only
>guess from 
>the surrounding discussion what this patch does, as my lkml folder is
>still 
>  doing a long refresh ... )
>
>> 
>> I think this isn't actually true.  If we were going to do a per-cpu
>syscall 
>> entry point, then we might as well duplicate all of the entry code
>per cpu 
>> instead of just a short trampoline.  That would avoid extra TLB
>misses and (L1) 
>> cache misses, I think.
>> 
>> I still think this is far too complicated for three cycles.  I was
>hoping for 
>> more.
>
>The other problem with duplicating entry code is that with per CPU
>entry code we 
>split its cache footprint in higher level caches (such as the L2 but
>also L3 
>cache).
>
>The interesting number would be to check cache cold entry performance,
>not cache 
>hot one: the NUMA latency advantage of having per node copies of the
>entry code 
>might be worth it.
>
>... and that's why UP is the least interesting case ;-)
>
>Thanks,
>
>	Ingo

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: For your amusement: slightly faster syscalls
  2015-06-18  8:50         ` H. Peter Anvin
@ 2015-06-18  9:06           ` Ingo Molnar
  0 siblings, 0 replies; 7+ messages in thread
From: Ingo Molnar @ 2015-06-18  9:06 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andy Lutomirski, Linus Torvalds, Denys Vlasenko, Borislav Petkov,
	X86 ML, linux-kernel


* H. Peter Anvin <hpa@zytor.com> wrote:

> Well... with UP we don't even need GS in the kernel...

Yeah, but it was just a simple demo, to see how much of a speedup the GS access 
reordering it is, so in that sense it was good enough to see that the speedup is 3 
cycles.

(Got the original email forwarded meanwhile.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-06-18  9:06 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-13  0:09 For your amusement: slightly faster syscalls Andy Lutomirski
     [not found] ` <CA+55aFzMCeDc5rw8bj3Zyimbi7C1RcU15TeiMA6jOMfnd+3B=Q@mail.gmail.com>
2015-06-15 21:42   ` H. Peter Anvin
2015-06-15 21:51     ` Andy Lutomirski
2015-06-18  8:01       ` Ingo Molnar
2015-06-18  8:48         ` Ingo Molnar
2015-06-18  8:50         ` H. Peter Anvin
2015-06-18  9:06           ` Ingo Molnar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.