* For your amusement: slightly faster syscalls
@ 2015-06-13 0:09 Andy Lutomirski
[not found] ` <CA+55aFzMCeDc5rw8bj3Zyimbi7C1RcU15TeiMA6jOMfnd+3B=Q@mail.gmail.com>
0 siblings, 1 reply; 7+ messages in thread
From: Andy Lutomirski @ 2015-06-13 0:09 UTC (permalink / raw)
To: linux-kernel, X86 ML, Linus Torvalds, H. Peter Anvin,
Denys Vlasenko, Borislav Petkov
The SYSCALL prologue starts with SWAPGS immediately followed by a
gs-prefixed instruction. I think this causes a pipeline stall.
If we instead do:
mov %rsp, rsp_scratch(%rip)
mov sp0(%rip), %rsp)
swapgs
...
pushq rsp_scratch(%rip)
then we avoid the stall and save about three cycles.
Horrible horrible code to do this lives here:
https://git.kernel.org/cgit/linux/kernel/git/luto/devel.git/log/?h=x86/faster_syscalls
Caveat emptor: it also disables SMP.
For three cycles, I don't think this is worth trying to clean up.
--Andy
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: For your amusement: slightly faster syscalls
[not found] ` <CA+55aFzMCeDc5rw8bj3Zyimbi7C1RcU15TeiMA6jOMfnd+3B=Q@mail.gmail.com>
@ 2015-06-15 21:42 ` H. Peter Anvin
2015-06-15 21:51 ` Andy Lutomirski
0 siblings, 1 reply; 7+ messages in thread
From: H. Peter Anvin @ 2015-06-15 21:42 UTC (permalink / raw)
To: Linus Torvalds, Andy Lutomirski
Cc: Denys Vlasenko, Borislav Petkov, X86 ML, linux-kernel
On 06/15/2015 02:30 PM, Linus Torvalds wrote:
>
> On Jun 12, 2015 2:09 PM, "Andy Lutomirski" <luto@amacapital.net
> <mailto:luto@amacapital.net>> wrote:
>>
>> Caveat emptor: it also disables SMP.
>
> OK, I don't think it's interesting in that form.
>
> For small cpu counts, I guess we could have per-cpu syscall entry points
> (unless the syscall entry msr is shared across hyperthreading? Some
> msr's are per thread, others per core, AFAIK), and it could actually
> work that way.
>
> But I'm not sure the three cycles is worth the worry and the complexity.
>
We discussed the per-cpu syscall entry point, and the issue at hand is
that it is very hard to do that without with fairly high probability
touch another cache line and quite possibly another page (and hence a
TLB entry.)
-hpa
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: For your amusement: slightly faster syscalls
2015-06-15 21:42 ` H. Peter Anvin
@ 2015-06-15 21:51 ` Andy Lutomirski
2015-06-18 8:01 ` Ingo Molnar
0 siblings, 1 reply; 7+ messages in thread
From: Andy Lutomirski @ 2015-06-15 21:51 UTC (permalink / raw)
To: H. Peter Anvin
Cc: Linus Torvalds, Denys Vlasenko, Borislav Petkov, X86 ML, linux-kernel
On Mon, Jun 15, 2015 at 2:42 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 06/15/2015 02:30 PM, Linus Torvalds wrote:
>>
>> On Jun 12, 2015 2:09 PM, "Andy Lutomirski" <luto@amacapital.net
>> <mailto:luto@amacapital.net>> wrote:
>>>
>>> Caveat emptor: it also disables SMP.
>>
>> OK, I don't think it's interesting in that form.
>>
>> For small cpu counts, I guess we could have per-cpu syscall entry points
>> (unless the syscall entry msr is shared across hyperthreading? Some
>> msr's are per thread, others per core, AFAIK), and it could actually
>> work that way.
>>
>> But I'm not sure the three cycles is worth the worry and the complexity.
>>
>
> We discussed the per-cpu syscall entry point, and the issue at hand is
> that it is very hard to do that without with fairly high probability
> touch another cache line and quite possibly another page (and hence a
> TLB entry.)
I think this isn't actually true. If we were going to do a per-cpu
syscall entry point, then we might as well duplicate all of the entry
code per cpu instead of just a short trampoline. That would avoid
extra TLB misses and (L1) cache misses, I think.
I still think this is far too complicated for three cycles. I was
hoping for more.
--Andy
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: For your amusement: slightly faster syscalls
2015-06-15 21:51 ` Andy Lutomirski
@ 2015-06-18 8:01 ` Ingo Molnar
2015-06-18 8:48 ` Ingo Molnar
2015-06-18 8:50 ` H. Peter Anvin
0 siblings, 2 replies; 7+ messages in thread
From: Ingo Molnar @ 2015-06-18 8:01 UTC (permalink / raw)
To: Andy Lutomirski
Cc: H. Peter Anvin, Linus Torvalds, Denys Vlasenko, Borislav Petkov,
X86 ML, linux-kernel
* Andy Lutomirski <luto@amacapital.net> wrote:
> On Mon, Jun 15, 2015 at 2:42 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> > On 06/15/2015 02:30 PM, Linus Torvalds wrote:
> >>
> >> On Jun 12, 2015 2:09 PM, "Andy Lutomirski" <luto@amacapital.net
> >> <mailto:luto@amacapital.net>> wrote:
> >>>
> >>> Caveat emptor: it also disables SMP.
> >>
> >> OK, I don't think it's interesting in that form.
> >>
> >> For small cpu counts, I guess we could have per-cpu syscall entry points
> >> (unless the syscall entry msr is shared across hyperthreading? Some msr's are
> >> per thread, others per core, AFAIK), and it could actually work that way.
> >>
> >> But I'm not sure the three cycles is worth the worry and the complexity.
> >
> > We discussed the per-cpu syscall entry point, and the issue at hand is that it
> > is very hard to do that without with fairly high probability touch another
> > cache line and quite possibly another page (and hence a TLB entry.)
( So apparently I wasn't Cc:ed, or gmail ate the mail - so I can only guess from
the surrounding discussion what this patch does, as my lkml folder is still
doing a long refresh ... )
>
> I think this isn't actually true. If we were going to do a per-cpu syscall
> entry point, then we might as well duplicate all of the entry code per cpu
> instead of just a short trampoline. That would avoid extra TLB misses and (L1)
> cache misses, I think.
>
> I still think this is far too complicated for three cycles. I was hoping for
> more.
The other problem with duplicating entry code is that with per CPU entry code we
split its cache footprint in higher level caches (such as the L2 but also L3
cache).
The interesting number would be to check cache cold entry performance, not cache
hot one: the NUMA latency advantage of having per node copies of the entry code
might be worth it.
... and that's why UP is the least interesting case ;-)
Thanks,
Ingo
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: For your amusement: slightly faster syscalls
2015-06-18 8:01 ` Ingo Molnar
@ 2015-06-18 8:48 ` Ingo Molnar
2015-06-18 8:50 ` H. Peter Anvin
1 sibling, 0 replies; 7+ messages in thread
From: Ingo Molnar @ 2015-06-18 8:48 UTC (permalink / raw)
To: Andy Lutomirski
Cc: H. Peter Anvin, Linus Torvalds, Denys Vlasenko, Borislav Petkov,
X86 ML, linux-kernel
* Ingo Molnar <mingo@kernel.org> wrote:
>
> * Andy Lutomirski <luto@amacapital.net> wrote:
>
> > On Mon, Jun 15, 2015 at 2:42 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> > > On 06/15/2015 02:30 PM, Linus Torvalds wrote:
> > >>
> > >> On Jun 12, 2015 2:09 PM, "Andy Lutomirski" <luto@amacapital.net
> > >> <mailto:luto@amacapital.net>> wrote:
> > >>>
> > >>> Caveat emptor: it also disables SMP.
> > >>
> > >> OK, I don't think it's interesting in that form.
> > >>
> > >> For small cpu counts, I guess we could have per-cpu syscall entry points
> > >> (unless the syscall entry msr is shared across hyperthreading? Some msr's are
> > >> per thread, others per core, AFAIK), and it could actually work that way.
> > >>
> > >> But I'm not sure the three cycles is worth the worry and the complexity.
> > >
> > > We discussed the per-cpu syscall entry point, and the issue at hand is that it
> > > is very hard to do that without with fairly high probability touch another
> > > cache line and quite possibly another page (and hence a TLB entry.)
>
> ( So apparently I wasn't Cc:ed, or gmail ate the mail - so I can only guess from
> the surrounding discussion what this patch does, as my lkml folder is still
> doing a long refresh ... )
Hm, it's nowhere to be found. Could someone please forward me the original email?
Thanks,
Ingo
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: For your amusement: slightly faster syscalls
2015-06-18 8:01 ` Ingo Molnar
2015-06-18 8:48 ` Ingo Molnar
@ 2015-06-18 8:50 ` H. Peter Anvin
2015-06-18 9:06 ` Ingo Molnar
1 sibling, 1 reply; 7+ messages in thread
From: H. Peter Anvin @ 2015-06-18 8:50 UTC (permalink / raw)
To: Ingo Molnar, Andy Lutomirski
Cc: Linus Torvalds, Denys Vlasenko, Borislav Petkov, X86 ML, linux-kernel
Well... with UP we don't even need GS in the kernel...
On June 18, 2015 1:01:06 AM PDT, Ingo Molnar <mingo@kernel.org> wrote:
>
>* Andy Lutomirski <luto@amacapital.net> wrote:
>
>> On Mon, Jun 15, 2015 at 2:42 PM, H. Peter Anvin <hpa@zytor.com>
>wrote:
>> > On 06/15/2015 02:30 PM, Linus Torvalds wrote:
>> >>
>> >> On Jun 12, 2015 2:09 PM, "Andy Lutomirski" <luto@amacapital.net
>> >> <mailto:luto@amacapital.net>> wrote:
>> >>>
>> >>> Caveat emptor: it also disables SMP.
>> >>
>> >> OK, I don't think it's interesting in that form.
>> >>
>> >> For small cpu counts, I guess we could have per-cpu syscall entry
>points
>> >> (unless the syscall entry msr is shared across hyperthreading?
>Some msr's are
>> >> per thread, others per core, AFAIK), and it could actually work
>that way.
>> >>
>> >> But I'm not sure the three cycles is worth the worry and the
>complexity.
>> >
>> > We discussed the per-cpu syscall entry point, and the issue at hand
>is that it
>> > is very hard to do that without with fairly high probability touch
>another
>> > cache line and quite possibly another page (and hence a TLB entry.)
>
>( So apparently I wasn't Cc:ed, or gmail ate the mail - so I can only
>guess from
>the surrounding discussion what this patch does, as my lkml folder is
>still
> doing a long refresh ... )
>
>>
>> I think this isn't actually true. If we were going to do a per-cpu
>syscall
>> entry point, then we might as well duplicate all of the entry code
>per cpu
>> instead of just a short trampoline. That would avoid extra TLB
>misses and (L1)
>> cache misses, I think.
>>
>> I still think this is far too complicated for three cycles. I was
>hoping for
>> more.
>
>The other problem with duplicating entry code is that with per CPU
>entry code we
>split its cache footprint in higher level caches (such as the L2 but
>also L3
>cache).
>
>The interesting number would be to check cache cold entry performance,
>not cache
>hot one: the NUMA latency advantage of having per node copies of the
>entry code
>might be worth it.
>
>... and that's why UP is the least interesting case ;-)
>
>Thanks,
>
> Ingo
--
Sent from my mobile phone. Please pardon brevity and lack of formatting.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: For your amusement: slightly faster syscalls
2015-06-18 8:50 ` H. Peter Anvin
@ 2015-06-18 9:06 ` Ingo Molnar
0 siblings, 0 replies; 7+ messages in thread
From: Ingo Molnar @ 2015-06-18 9:06 UTC (permalink / raw)
To: H. Peter Anvin
Cc: Andy Lutomirski, Linus Torvalds, Denys Vlasenko, Borislav Petkov,
X86 ML, linux-kernel
* H. Peter Anvin <hpa@zytor.com> wrote:
> Well... with UP we don't even need GS in the kernel...
Yeah, but it was just a simple demo, to see how much of a speedup the GS access
reordering it is, so in that sense it was good enough to see that the speedup is 3
cycles.
(Got the original email forwarded meanwhile.)
Thanks,
Ingo
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2015-06-18 9:06 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-13 0:09 For your amusement: slightly faster syscalls Andy Lutomirski
[not found] ` <CA+55aFzMCeDc5rw8bj3Zyimbi7C1RcU15TeiMA6jOMfnd+3B=Q@mail.gmail.com>
2015-06-15 21:42 ` H. Peter Anvin
2015-06-15 21:51 ` Andy Lutomirski
2015-06-18 8:01 ` Ingo Molnar
2015-06-18 8:48 ` Ingo Molnar
2015-06-18 8:50 ` H. Peter Anvin
2015-06-18 9:06 ` Ingo Molnar
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.