[RFC] syscall calling convention, stts/clts, and xstate latency

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC] syscall calling convention, stts/clts, and xstate latency
@ 2011-07-24 21:07 Andrew Lutomirski
  2011-07-24 21:15 ` Ingo Molnar
  0 siblings, 1 reply; 15+ messages in thread
From: Andrew Lutomirski @ 2011-07-24 21:07 UTC (permalink / raw)
  To: linux-kernel, x86

I was trying to understand the FPU/xstate saving code, and I ran some
benchmarks with surprising results.  These are all on Sandy Bridge
i7-2600.  Please take all numbers with a grain of salt -- they're in
tight-ish loops and don't really take into account real-world cache
effects.

A clts/stts pair takes about 80 ns.  Accessing extended state from
userspace with TS set takes 239 ns.  A kernel_fpu_begin /
kernel_fpu_end pair with no userspace xstate access takes 80 ns
(presumably 79 of those 80 are the clts/stts).  (Note: The numbers in
this paragraph were measured using a hacked-up kernel and KVM.)

With nonzero ymm state, xsave + clflush (on the first cacheline of
xstate) + xrstor takes 128 ns.  With hot cache, xsave = 24ns, xsaveopt
(with unchanged state) = 16 ns, and xrstor = 40 ns.

With nonzero xmm state but zero ymm state, xsave+xrstor drops to 38 ns
and xsaveopt saves another 5 ns.

Zeroing the state completely with vzeroall adds 2 ns.  Not sure what's going on.

All of this makes me think that, at least on Sandy Bridge, lazy xstate
saving is a bad optimization -- if the cache is being nice,
save/restore is faster than twiddling the TS bit.  And the cost of the
trap when TS is set blows everything else away.

Which brings me to another question: what do you think about declaring
some of the extended state to be clobbered by syscall?  Ideally, we'd
treat syscall like a regular function and clobber everything except
the floating point control word and mxcsr.  More conservatively, we'd
leave xmm and x87 state but clobber ymm.  This would let us keep the
cost of the state save and restore down when kernel_fpu_begin is used
in a syscall path and when a context switch happens as a result of a
syscall.

glibc does *not* mark the xmm registers as clobbered when it issues
syscalls, but I suspect that everything everywhere that issues
syscalls does it from a function, and functions are implicitly assumed
to clobber extended state.  (And if anything out there assumes that
ymm state is preserved, I'd be amazed.)

--Andy

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] syscall calling convention, stts/clts, and xstate latency
  2011-07-24 21:07 [RFC] syscall calling convention, stts/clts, and xstate latency Andrew Lutomirski
@ 2011-07-24 21:15 ` Ingo Molnar
  2011-07-24 22:34   ` Andrew Lutomirski
  2011-07-25  7:42   ` Avi Kivity
  0 siblings, 2 replies; 15+ messages in thread
From: Ingo Molnar @ 2011-07-24 21:15 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: linux-kernel, x86, Linus Torvalds, Arjan van de Ven, Avi Kivity


* Andrew Lutomirski <luto@mit.edu> wrote:

> I was trying to understand the FPU/xstate saving code, and I ran 
> some benchmarks with surprising results.  These are all on Sandy 
> Bridge i7-2600.  Please take all numbers with a grain of salt -- 
> they're in tight-ish loops and don't really take into account 
> real-world cache effects.
> 
> A clts/stts pair takes about 80 ns.  Accessing extended state from 
> userspace with TS set takes 239 ns.  A kernel_fpu_begin / 
> kernel_fpu_end pair with no userspace xstate access takes 80 ns 
> (presumably 79 of those 80 are the clts/stts).  (Note: The numbers 
> in this paragraph were measured using a hacked-up kernel and KVM.)
> 
> With nonzero ymm state, xsave + clflush (on the first cacheline of 
> xstate) + xrstor takes 128 ns.  With hot cache, xsave = 24ns, 
> xsaveopt (with unchanged state) = 16 ns, and xrstor = 40 ns.
> 
> With nonzero xmm state but zero ymm state, xsave+xrstor drops to 38 
> ns and xsaveopt saves another 5 ns.
> 
> Zeroing the state completely with vzeroall adds 2 ns.  Not sure 
> what's going on.
> 
> All of this makes me think that, at least on Sandy Bridge, lazy 
> xstate saving is a bad optimization -- if the cache is being nice, 
> save/restore is faster than twiddling the TS bit.  And the cost of 
> the trap when TS is set blows everything else away.

Interesting. Mind cooking up a delazying patch and measure it on 
native as well? KVM generally makes exceptions more expensive, so the 
effect of lazy exceptions might be less on native.

> 
> Which brings me to another question: what do you think about 
> declaring some of the extended state to be clobbered by syscall?  
> Ideally, we'd treat syscall like a regular function and clobber 
> everything except the floating point control word and mxcsr.  More 
> conservatively, we'd leave xmm and x87 state but clobber ymm.  This 
> would let us keep the cost of the state save and restore down when 
> kernel_fpu_begin is used in a syscall path and when a context 
> switch happens as a result of a syscall.
> 
> glibc does *not* mark the xmm registers as clobbered when it issues 
> syscalls, but I suspect that everything everywhere that issues 
> syscalls does it from a function, and functions are implicitly 
> assumed to clobber extended state.  (And if anything out there 
> assumes that ymm state is preserved, I'd be amazed.)

To build the kernel with sse optimizations? Would certainly be 
interesting to try.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] syscall calling convention, stts/clts, and xstate latency
  2011-07-24 21:15 ` Ingo Molnar
@ 2011-07-24 22:34   ` Andrew Lutomirski
  2011-07-25  3:21     ` Andrew Lutomirski
  2011-07-25  6:38     ` [RFC] syscall calling convention, stts/clts, and xstate latency Ingo Molnar
  2011-07-25  7:42   ` Avi Kivity
  1 sibling, 2 replies; 15+ messages in thread
From: Andrew Lutomirski @ 2011-07-24 22:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, x86, Linus Torvalds, Arjan van de Ven, Avi Kivity

On Sun, Jul 24, 2011 at 5:15 PM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Andrew Lutomirski <luto@mit.edu> wrote:
>
>> I was trying to understand the FPU/xstate saving code, and I ran
>> some benchmarks with surprising results.  These are all on Sandy
>> Bridge i7-2600.  Please take all numbers with a grain of salt --
>> they're in tight-ish loops and don't really take into account
>> real-world cache effects.
>>
>> A clts/stts pair takes about 80 ns.  Accessing extended state from
>> userspace with TS set takes 239 ns.  A kernel_fpu_begin /
>> kernel_fpu_end pair with no userspace xstate access takes 80 ns
>> (presumably 79 of those 80 are the clts/stts).  (Note: The numbers
>> in this paragraph were measured using a hacked-up kernel and KVM.)
>>
>> With nonzero ymm state, xsave + clflush (on the first cacheline of
>> xstate) + xrstor takes 128 ns.  With hot cache, xsave = 24ns,
>> xsaveopt (with unchanged state) = 16 ns, and xrstor = 40 ns.
>>
>> With nonzero xmm state but zero ymm state, xsave+xrstor drops to 38
>> ns and xsaveopt saves another 5 ns.
>>
>> Zeroing the state completely with vzeroall adds 2 ns.  Not sure
>> what's going on.
>>
>> All of this makes me think that, at least on Sandy Bridge, lazy
>> xstate saving is a bad optimization -- if the cache is being nice,
>> save/restore is faster than twiddling the TS bit.  And the cost of
>> the trap when TS is set blows everything else away.
>
> Interesting. Mind cooking up a delazying patch and measure it on
> native as well? KVM generally makes exceptions more expensive, so the
> effect of lazy exceptions might be less on native.

Using the same patch on native, I get:

kernel_fpu_begin/kernel_fpu_end (no userspace xstate): 71.53 ns
stts/clts: 73 ns (clearly there's a bit of error here)
userspace xstate with TS set: 229 ns

So virtualization adds only a little bit of overhead.

This isn't really a delazying patch -- it's two arch_prctls, one of
them is kernel_fpu_begin();kernel_fpu_end().  The other is the same
thing in a loop.

The other numbers were already native since I measured them entirely
in userspace.  They look the same after rebooting.

>
>>
>> Which brings me to another question: what do you think about
>> declaring some of the extended state to be clobbered by syscall?
>> Ideally, we'd treat syscall like a regular function and clobber
>> everything except the floating point control word and mxcsr.  More
>> conservatively, we'd leave xmm and x87 state but clobber ymm.  This
>> would let us keep the cost of the state save and restore down when
>> kernel_fpu_begin is used in a syscall path and when a context
>> switch happens as a result of a syscall.
>>
>> glibc does *not* mark the xmm registers as clobbered when it issues
>> syscalls, but I suspect that everything everywhere that issues
>> syscalls does it from a function, and functions are implicitly
>> assumed to clobber extended state.  (And if anything out there
>> assumes that ymm state is preserved, I'd be amazed.)
>
> To build the kernel with sse optimizations? Would certainly be
> interesting to try.

I had in mind something a little less ambitious: making
kernel_fpu_begin very fast, especially when used more than once.
Currently it's slow enough to have spawned arch/x86/crypto/fpu.c,
which is a hideous piece of infrastructure that exists solely to
reduce the number of kernel_fpu_begin/end pairs when using AES-NI.
Clobbering registers in syscall would reduce the cost even more, but
it might require having a way to detect whether the most recent kernel
entry was via syscall or some other means.

Making the whole kernel safe for xstate use would be technically
possible, but it would add about three cycles to syscalls (for
vzeroall -- non-AVX machines would take a larger hit) and apparently
about 57 ns to non-syscall traps.  That seems worse than the lazier
approach.

--Andy

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] syscall calling convention, stts/clts, and xstate latency
  2011-07-24 22:34   ` Andrew Lutomirski
@ 2011-07-25  3:21     ` Andrew Lutomirski
  2011-07-25  6:42       ` Ingo Molnar
  2011-07-25 10:05       ` [PATCH 3.1?] x86: Remove useless stts/clts pair in __switch_to Andy Lutomirski
  2011-07-25  6:38     ` [RFC] syscall calling convention, stts/clts, and xstate latency Ingo Molnar
  1 sibling, 2 replies; 15+ messages in thread
From: Andrew Lutomirski @ 2011-07-25  3:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, x86, Linus Torvalds, Arjan van de Ven, Avi Kivity

On Sun, Jul 24, 2011 at 6:34 PM, Andrew Lutomirski <luto@mit.edu> wrote:
>
> I had in mind something a little less ambitious: making
> kernel_fpu_begin very fast, especially when used more than once.
> Currently it's slow enough to have spawned arch/x86/crypto/fpu.c,
> which is a hideous piece of infrastructure that exists solely to
> reduce the number of kernel_fpu_begin/end pairs when using AES-NI.
> Clobbering registers in syscall would reduce the cost even more, but
> it might require having a way to detect whether the most recent kernel
> entry was via syscall or some other means.

I think it will be very hard to inadvertently cause a regression,
because the current code looks pretty bad.

1. Once a task uses xstate for five timeslices, the kernel decides
that it will continue using it.  The only thing that clears that
condition is __unlazy_fpu called with TS_USEDFPU set.  The only way I
can see for that to happen is if kernel_fpu_begin is called twice in a
row between context switches, and that has little do with the task's
xstate usage.

2. __switch_to, when switching to a task with fpu_counter > 5, will do
stts(); clts().

The combination means that when switching between two xstate-using
tasks (or even tasks that were once xstate-using), we pay the full
price of a state save/restore *and* stts/clts.

--Andy

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] syscall calling convention, stts/clts, and xstate latency
  2011-07-24 22:34   ` Andrew Lutomirski
  2011-07-25  3:21     ` Andrew Lutomirski
@ 2011-07-25  6:38     ` Ingo Molnar
  2011-07-25  9:44       ` Andrew Lutomirski
  1 sibling, 1 reply; 15+ messages in thread
From: Ingo Molnar @ 2011-07-25  6:38 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: linux-kernel, x86, Linus Torvalds, Arjan van de Ven, Avi Kivity


* Andrew Lutomirski <luto@mit.edu> wrote:

> On Sun, Jul 24, 2011 at 5:15 PM, Ingo Molnar <mingo@elte.hu> wrote:
> >
> > * Andrew Lutomirski <luto@mit.edu> wrote:
> >
> >> I was trying to understand the FPU/xstate saving code, and I ran
> >> some benchmarks with surprising results.  These are all on Sandy
> >> Bridge i7-2600.  Please take all numbers with a grain of salt --
> >> they're in tight-ish loops and don't really take into account
> >> real-world cache effects.
> >>
> >> A clts/stts pair takes about 80 ns.  Accessing extended state from
> >> userspace with TS set takes 239 ns.  A kernel_fpu_begin /
> >> kernel_fpu_end pair with no userspace xstate access takes 80 ns
> >> (presumably 79 of those 80 are the clts/stts).  (Note: The numbers
> >> in this paragraph were measured using a hacked-up kernel and KVM.)
> >>
> >> With nonzero ymm state, xsave + clflush (on the first cacheline of
> >> xstate) + xrstor takes 128 ns.  With hot cache, xsave = 24ns,
> >> xsaveopt (with unchanged state) = 16 ns, and xrstor = 40 ns.
> >>
> >> With nonzero xmm state but zero ymm state, xsave+xrstor drops to 38
> >> ns and xsaveopt saves another 5 ns.
> >>
> >> Zeroing the state completely with vzeroall adds 2 ns.  Not sure
> >> what's going on.
> >>
> >> All of this makes me think that, at least on Sandy Bridge, lazy
> >> xstate saving is a bad optimization -- if the cache is being nice,
> >> save/restore is faster than twiddling the TS bit.  And the cost of
> >> the trap when TS is set blows everything else away.
> >
> > Interesting. Mind cooking up a delazying patch and measure it on
> > native as well? KVM generally makes exceptions more expensive, so the
> > effect of lazy exceptions might be less on native.
> 
> Using the same patch on native, I get:
> 
> kernel_fpu_begin/kernel_fpu_end (no userspace xstate): 71.53 ns 
> stts/clts: 73 ns (clearly there's a bit of error here) userspace 
> xstate with TS set: 229 ns
> 
> So virtualization adds only a little bit of overhead.

KVM rocks.

> This isn't really a delazying patch -- it's two arch_prctls, one of 
> them is kernel_fpu_begin();kernel_fpu_end().  The other is the same 
> thing in a loop.
> 
> The other numbers were already native since I measured them 
> entirely in userspace.  They look the same after rebooting.

I should have mentioned it earlier, but there's a certain amount of 
delazying patches in the tip:x86/xsave branch:

 $ gll linus..x86/xsave
 300c6120b465: x86, xsave: fix non-lazy allocation of the xsave area
 f79018f2daa9: Merge branch 'x86/urgent' into x86/xsave
 66beba27e8b5: x86, xsave: remove lazy allocation of xstate area
 1039b306b1c6: x86, xsave: add kernel support for AMDs Lightweight Profiling (LWP)
 4182a4d68bac: x86, xsave: add support for non-lazy xstates
 324cbb83e215: x86, xsave: more cleanups
 2efd67935eb7: x86, xsave: remove unused code
 0c11e6f1aed1: x86, xsave: cleanup fpu/xsave signal frame setup
 7f4f0a56a7d3: x86, xsave: rework fpu/xsave support
 26bce4e4c56f: x86, xsave: cleanup fpu/xsave support

it's not in tip:master because the LWP bits need (much) more work to 
be palatable - but we could spin them off and complete them as per 
your suggestions if they are an independent speedup on modern CPUs.

> >> Which brings me to another question: what do you think about
> >> declaring some of the extended state to be clobbered by syscall?
> >> Ideally, we'd treat syscall like a regular function and clobber
> >> everything except the floating point control word and mxcsr.  More
> >> conservatively, we'd leave xmm and x87 state but clobber ymm.  This
> >> would let us keep the cost of the state save and restore down when
> >> kernel_fpu_begin is used in a syscall path and when a context
> >> switch happens as a result of a syscall.
> >>
> >> glibc does *not* mark the xmm registers as clobbered when it issues
> >> syscalls, but I suspect that everything everywhere that issues
> >> syscalls does it from a function, and functions are implicitly
> >> assumed to clobber extended state.  (And if anything out there
> >> assumes that ymm state is preserved, I'd be amazed.)
> >
> > To build the kernel with sse optimizations? Would certainly be
> > interesting to try.
> 
> I had in mind something a little less ambitious: making 
> kernel_fpu_begin very fast, especially when used more than once. 
> Currently it's slow enough to have spawned arch/x86/crypto/fpu.c, 
> which is a hideous piece of infrastructure that exists solely to 
> reduce the number of kernel_fpu_begin/end pairs when using AES-NI. 
> Clobbering registers in syscall would reduce the cost even more, 
> but it might require having a way to detect whether the most recent 
> kernel entry was via syscall or some other means.
> 
> Making the whole kernel safe for xstate use would be technically 
> possible, but it would add about three cycles to syscalls (for 
> vzeroall -- non-AVX machines would take a larger hit) and 
> apparently about 57 ns to non-syscall traps.  That seems worse than 
> the lazier approach.

3 cycles per syscall is acceptable, if the average optimization 
savings per syscall are better than 3 cycles - which is not 
impossible at all: using more registers generally moves the pressure 
away from GP registers and allows the compiler to be smarter.

(older CPUs with higher switching costs wouldnt want to run such 
kernels, obviously.)

So it's very much worth trying, if only to get some hard numbers.

That would also turn the somewhat awkward way of how we use vector 
operations in the crypto code into something more natural. In theory 
you could write a crypto algorithm in C and the compiler would use 
vector instructions and get a pretty good end result. (one can always 
hope, right?)

But more importantly, doing that would push vector operations *way* 
beyond the somewhat niche area of crypto/RAID optimizations. 
User-space already saves/restores the vector registers so they have 
already done much of the register switching cost - the kernel just 
has to take advantage of that.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] syscall calling convention, stts/clts, and xstate latency
  2011-07-25  3:21     ` Andrew Lutomirski
@ 2011-07-25  6:42       ` Ingo Molnar
  2011-07-25 10:05       ` [PATCH 3.1?] x86: Remove useless stts/clts pair in __switch_to Andy Lutomirski
  1 sibling, 0 replies; 15+ messages in thread
From: Ingo Molnar @ 2011-07-25  6:42 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: linux-kernel, x86, Linus Torvalds, Arjan van de Ven, Avi Kivity


* Andrew Lutomirski <luto@mit.edu> wrote:

> On Sun, Jul 24, 2011 at 6:34 PM, Andrew Lutomirski <luto@mit.edu> wrote:
> >
> > I had in mind something a little less ambitious: making 
> > kernel_fpu_begin very fast, especially when used more than once. 
> > Currently it's slow enough to have spawned arch/x86/crypto/fpu.c, 
> > which is a hideous piece of infrastructure that exists solely to 
> > reduce the number of kernel_fpu_begin/end pairs when using 
> > AES-NI. Clobbering registers in syscall would reduce the cost 
> > even more, but it might require having a way to detect whether 
> > the most recent kernel entry was via syscall or some other means.
> 
> I think it will be very hard to inadvertently cause a regression, 
> because the current code looks pretty bad.

[ heh, one of the rare cases where bad code works in our favor ;-) ]

> 1. Once a task uses xstate for five timeslices, the kernel decides 
> that it will continue using it.  The only thing that clears that 
> condition is __unlazy_fpu called with TS_USEDFPU set.  The only way 
> I can see for that to happen is if kernel_fpu_begin is called twice 
> in a row between context switches, and that has little do with the 
> task's xstate usage.
> 
> 2. __switch_to, when switching to a task with fpu_counter > 5, will 
> do stts(); clts().
> 
> The combination means that when switching between two xstate-using 
> tasks (or even tasks that were once xstate-using), we pay the full 
> price of a state save/restore *and* stts/clts.

I'm all for simplifying this for modern x86 CPUs.

The lazy FPU switching logic was kind of neat on UP but started 
showing its limitations with SMP already - and that was 10 years ago.

So if the numbers prove you right then go for it. It's an added bonus 
that this could enable the kernel to be built using vector 
instructions - you may or may not want to shoot for the glory of 
achieving that feat first ;-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] syscall calling convention, stts/clts, and xstate latency
  2011-07-24 21:15 ` Ingo Molnar
  2011-07-24 22:34   ` Andrew Lutomirski
@ 2011-07-25  7:42   ` Avi Kivity
  2011-07-25  7:54     ` Ingo Molnar
  1 sibling, 1 reply; 15+ messages in thread
From: Avi Kivity @ 2011-07-25  7:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Lutomirski, linux-kernel, x86, Linus Torvalds, Arjan van de Ven

On 07/25/2011 12:15 AM, Ingo Molnar wrote:
> >  All of this makes me think that, at least on Sandy Bridge, lazy
> >  xstate saving is a bad optimization -- if the cache is being nice,
> >  save/restore is faster than twiddling the TS bit.  And the cost of
> >  the trap when TS is set blows everything else away.
>
> Interesting. Mind cooking up a delazying patch and measure it on
> native as well? KVM generally makes exceptions more expensive, so the
> effect of lazy exceptions might be less on native.

While this is true in general, kvm will trap #NM only after a host 
context switch or an exit to host userspace.  These are supposedly rare 
so you won't see them a lot, especially in a benchmark scenario with 
just one guest.

("host context switch" includes switching to the idle thread when the 
guest executes HLT, something I tried to optimize in the past but it 
proved too difficult for the gain)

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] syscall calling convention, stts/clts, and xstate latency
  2011-07-25  7:42   ` Avi Kivity
@ 2011-07-25  7:54     ` Ingo Molnar
  0 siblings, 0 replies; 15+ messages in thread
From: Ingo Molnar @ 2011-07-25  7:54 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Andrew Lutomirski, linux-kernel, x86, Linus Torvalds, Arjan van de Ven


* Avi Kivity <avi@redhat.com> wrote:

> On 07/25/2011 12:15 AM, Ingo Molnar wrote:
> >>  All of this makes me think that, at least on Sandy Bridge, lazy
> >>  xstate saving is a bad optimization -- if the cache is being nice,
> >>  save/restore is faster than twiddling the TS bit.  And the cost of
> >>  the trap when TS is set blows everything else away.
> >
> > Interesting. Mind cooking up a delazying patch and measure it on 
> > native as well? KVM generally makes exceptions more expensive, so 
> > the effect of lazy exceptions might be less on native.
> 
> While this is true in general, kvm will trap #NM only after a host 
> context switch or an exit to host userspace.  These are supposedly 
> rare so you won't see them a lot, especially in a benchmark 
> scenario with just one guest.
> 
> ("host context switch" includes switching to the idle thread when 
> the guest executes HLT, something I tried to optimize in the past 
> but it proved too difficult for the gain)

Yeah - but this was a fair thing to test before Andy embarks on 
something more ambitious on the native side.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] syscall calling convention, stts/clts, and xstate latency
  2011-07-25  6:38     ` [RFC] syscall calling convention, stts/clts, and xstate latency Ingo Molnar
@ 2011-07-25  9:44       ` Andrew Lutomirski
  2011-07-25  9:51         ` Ingo Molnar
  2011-07-25 11:04         ` Hans Rosenfeld
  0 siblings, 2 replies; 15+ messages in thread
From: Andrew Lutomirski @ 2011-07-25  9:44 UTC (permalink / raw)
  To: Ingo Molnar, Hans Rosenfeld
  Cc: linux-kernel, x86, Linus Torvalds, Arjan van de Ven, Avi Kivity

On Mon, Jul 25, 2011 at 2:38 AM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Andrew Lutomirski <luto@mit.edu> wrote:
>
>> On Sun, Jul 24, 2011 at 5:15 PM, Ingo Molnar <mingo@elte.hu> wrote:
>> >
>> > * Andrew Lutomirski <luto@mit.edu> wrote:
>> >
>> >> I was trying to understand the FPU/xstate saving code, and I ran
>> >> some benchmarks with surprising results.  These are all on Sandy
>> >> Bridge i7-2600.  Please take all numbers with a grain of salt --
>> >> they're in tight-ish loops and don't really take into account
>> >> real-world cache effects.
>> >>
>> >> A clts/stts pair takes about 80 ns.  Accessing extended state from
>> >> userspace with TS set takes 239 ns.  A kernel_fpu_begin /
>> >> kernel_fpu_end pair with no userspace xstate access takes 80 ns
>> >> (presumably 79 of those 80 are the clts/stts).  (Note: The numbers
>> >> in this paragraph were measured using a hacked-up kernel and KVM.)
>> >>
>> >> With nonzero ymm state, xsave + clflush (on the first cacheline of
>> >> xstate) + xrstor takes 128 ns.  With hot cache, xsave = 24ns,
>> >> xsaveopt (with unchanged state) = 16 ns, and xrstor = 40 ns.
>> >>
>> >> With nonzero xmm state but zero ymm state, xsave+xrstor drops to 38
>> >> ns and xsaveopt saves another 5 ns.
>> >>
>> >> Zeroing the state completely with vzeroall adds 2 ns.  Not sure
>> >> what's going on.
>> >>
>> >> All of this makes me think that, at least on Sandy Bridge, lazy
>> >> xstate saving is a bad optimization -- if the cache is being nice,
>> >> save/restore is faster than twiddling the TS bit.  And the cost of
>> >> the trap when TS is set blows everything else away.
>> >
>> > Interesting. Mind cooking up a delazying patch and measure it on
>> > native as well? KVM generally makes exceptions more expensive, so the
>> > effect of lazy exceptions might be less on native.
>>
>> Using the same patch on native, I get:
>>
>> kernel_fpu_begin/kernel_fpu_end (no userspace xstate): 71.53 ns
>> stts/clts: 73 ns (clearly there's a bit of error here) userspace
>> xstate with TS set: 229 ns
>>
>> So virtualization adds only a little bit of overhead.
>
> KVM rocks.
>
>> This isn't really a delazying patch -- it's two arch_prctls, one of
>> them is kernel_fpu_begin();kernel_fpu_end().  The other is the same
>> thing in a loop.
>>
>> The other numbers were already native since I measured them
>> entirely in userspace.  They look the same after rebooting.
>
> I should have mentioned it earlier, but there's a certain amount of
> delazying patches in the tip:x86/xsave branch:
>
>  $ gll linus..x86/xsave
>  300c6120b465: x86, xsave: fix non-lazy allocation of the xsave area
>  f79018f2daa9: Merge branch 'x86/urgent' into x86/xsave
>  66beba27e8b5: x86, xsave: remove lazy allocation of xstate area
>  1039b306b1c6: x86, xsave: add kernel support for AMDs Lightweight Profiling (LWP)
>  4182a4d68bac: x86, xsave: add support for non-lazy xstates
>  324cbb83e215: x86, xsave: more cleanups
>  2efd67935eb7: x86, xsave: remove unused code
>  0c11e6f1aed1: x86, xsave: cleanup fpu/xsave signal frame setup
>  7f4f0a56a7d3: x86, xsave: rework fpu/xsave support
>  26bce4e4c56f: x86, xsave: cleanup fpu/xsave support
>
> it's not in tip:master because the LWP bits need (much) more work to
> be palatable - but we could spin them off and complete them as per
> your suggestions if they are an independent speedup on modern CPUs.

Hans, what's the status of these?  I want to do some other cleanups
(now or in a couple of weeks) that will probably conflict with your
xsave work.

--Andy

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] syscall calling convention, stts/clts, and xstate latency
  2011-07-25  9:44       ` Andrew Lutomirski
@ 2011-07-25  9:51         ` Ingo Molnar
  2011-07-25 11:04         ` Hans Rosenfeld
  1 sibling, 0 replies; 15+ messages in thread
From: Ingo Molnar @ 2011-07-25  9:51 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Hans Rosenfeld, linux-kernel, x86, Linus Torvalds,
	Arjan van de Ven, Avi Kivity


* Andrew Lutomirski <luto@mit.edu> wrote:

> On Mon, Jul 25, 2011 at 2:38 AM, Ingo Molnar <mingo@elte.hu> wrote:
> >
> > * Andrew Lutomirski <luto@mit.edu> wrote:
> >
> >> On Sun, Jul 24, 2011 at 5:15 PM, Ingo Molnar <mingo@elte.hu> wrote:
> >> >
> >> > * Andrew Lutomirski <luto@mit.edu> wrote:
> >> >
> >> >> I was trying to understand the FPU/xstate saving code, and I ran
> >> >> some benchmarks with surprising results.  These are all on Sandy
> >> >> Bridge i7-2600.  Please take all numbers with a grain of salt --
> >> >> they're in tight-ish loops and don't really take into account
> >> >> real-world cache effects.
> >> >>
> >> >> A clts/stts pair takes about 80 ns.  Accessing extended state from
> >> >> userspace with TS set takes 239 ns.  A kernel_fpu_begin /
> >> >> kernel_fpu_end pair with no userspace xstate access takes 80 ns
> >> >> (presumably 79 of those 80 are the clts/stts).  (Note: The numbers
> >> >> in this paragraph were measured using a hacked-up kernel and KVM.)
> >> >>
> >> >> With nonzero ymm state, xsave + clflush (on the first cacheline of
> >> >> xstate) + xrstor takes 128 ns.  With hot cache, xsave = 24ns,
> >> >> xsaveopt (with unchanged state) = 16 ns, and xrstor = 40 ns.
> >> >>
> >> >> With nonzero xmm state but zero ymm state, xsave+xrstor drops to 38
> >> >> ns and xsaveopt saves another 5 ns.
> >> >>
> >> >> Zeroing the state completely with vzeroall adds 2 ns.  Not sure
> >> >> what's going on.
> >> >>
> >> >> All of this makes me think that, at least on Sandy Bridge, lazy
> >> >> xstate saving is a bad optimization -- if the cache is being nice,
> >> >> save/restore is faster than twiddling the TS bit.  And the cost of
> >> >> the trap when TS is set blows everything else away.
> >> >
> >> > Interesting. Mind cooking up a delazying patch and measure it on
> >> > native as well? KVM generally makes exceptions more expensive, so the
> >> > effect of lazy exceptions might be less on native.
> >>
> >> Using the same patch on native, I get:
> >>
> >> kernel_fpu_begin/kernel_fpu_end (no userspace xstate): 71.53 ns
> >> stts/clts: 73 ns (clearly there's a bit of error here) userspace
> >> xstate with TS set: 229 ns
> >>
> >> So virtualization adds only a little bit of overhead.
> >
> > KVM rocks.
> >
> >> This isn't really a delazying patch -- it's two arch_prctls, one of
> >> them is kernel_fpu_begin();kernel_fpu_end().  The other is the same
> >> thing in a loop.
> >>
> >> The other numbers were already native since I measured them
> >> entirely in userspace.  They look the same after rebooting.
> >
> > I should have mentioned it earlier, but there's a certain amount of
> > delazying patches in the tip:x86/xsave branch:
> >
> >  $ gll linus..x86/xsave
> >  300c6120b465: x86, xsave: fix non-lazy allocation of the xsave area
> >  f79018f2daa9: Merge branch 'x86/urgent' into x86/xsave
> >  66beba27e8b5: x86, xsave: remove lazy allocation of xstate area
> >  1039b306b1c6: x86, xsave: add kernel support for AMDs Lightweight Profiling (LWP)
> >  4182a4d68bac: x86, xsave: add support for non-lazy xstates
> >  324cbb83e215: x86, xsave: more cleanups
> >  2efd67935eb7: x86, xsave: remove unused code
> >  0c11e6f1aed1: x86, xsave: cleanup fpu/xsave signal frame setup
> >  7f4f0a56a7d3: x86, xsave: rework fpu/xsave support
> >  26bce4e4c56f: x86, xsave: cleanup fpu/xsave support
> >
> > it's not in tip:master because the LWP bits need (much) more work to
> > be palatable - but we could spin them off and complete them as per
> > your suggestions if they are an independent speedup on modern CPUs.
> 
> Hans, what's the status of these?  I want to do some other cleanups
> (now or in a couple of weeks) that will probably conflict with your
> xsave work.

if you extract this bit:

     1039b306b1c6: x86, xsave: add kernel support for AMDs Lightweight Profiling (LWP)

then we can keep all the other patches.

this could be done by:

  git reset --hard 4182a4d68bac   # careful, this zaps your current dirty state
  git cherry-pick 66beba27e8b5
  git cherry-pick 300c6120b465

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 3.1?] x86: Remove useless stts/clts pair in __switch_to
  2011-07-25  3:21     ` Andrew Lutomirski
  2011-07-25  6:42       ` Ingo Molnar
@ 2011-07-25 10:05       ` Andy Lutomirski
  2011-07-25 11:12         ` Ingo Molnar
  1 sibling, 1 reply; 15+ messages in thread
From: Andy Lutomirski @ 2011-07-25 10:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: x86, linux-kernel, Andy Lutomirski, Linus Torvalds,
	Arjan van de Ven, Avi Kivity

An stts/clts pair takes over 70 ns by itself on Sandy Bridge, and
when other things are going on it's apparently even worse.  This
saves 10% on context switches between threads that both use extended
state.

Signed-off-by: Andy Lutomirski <luto@mit.edu>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Arjan van de Ven <arjan@infradead.org>, 
Cc: Avi Kivity <avi@redhat.com>
---

This is not as well tested as it should be (especially on 32-bit, where
I haven't actually tried compiling it), but I think this might be 3.1
material so I want to get it out for review before it's even more
unjustifiably late :)

Argument for inclusion in 3.1 (after a bit more testing):
 - It's dead simple.
 - It's a 10% speedup on context switching under the right conditions [1]
 - It's unlikely to slow any workload down, since it doesn't add any work
   anywwhere.

Argument against:
 - It's late.

[1] https://gitorious.org/linux-test-utils/linux-clock-tests/blobs/master/context_switch_latency.c

 arch/x86/include/asm/i387.h  |   10 ++++++++++
 arch/x86/kernel/process_32.c |   10 ++++------
 arch/x86/kernel/process_64.c |    7 +++----
 3 files changed, 17 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index c9e09ea..9d2d08b 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -295,6 +295,16 @@ static inline void __unlazy_fpu(struct task_struct *tsk)
 		tsk->fpu_counter = 0;
 }
 
+static inline void __unlazy_fpu_clts(struct task_struct *tsk)
+{
+	if (task_thread_info(tsk)->status & TS_USEDFPU) {
+		__save_init_fpu(tsk);
+	} else {
+		tsk->fpu_counter = 0;
+		clts();
+	}
+}
+
 static inline void __clear_fpu(struct task_struct *tsk)
 {
 	if (task_thread_info(tsk)->status & TS_USEDFPU) {
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index a3d0dc5..c707741 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -304,7 +304,10 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	 */
 	preload_fpu = tsk_used_math(next_p) && next_p->fpu_counter > 5;
 
-	__unlazy_fpu(prev_p);
+	if (preload_fpu)
+		__unlazy_fpu_clts(prev_p);
+	else
+		__unlazy_fpu(prev_p);
 
 	/* we're going to use this soon, after a few expensive things */
 	if (preload_fpu)
@@ -348,11 +351,6 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 		     task_thread_info(next_p)->flags & _TIF_WORK_CTXSW_NEXT))
 		__switch_to_xtra(prev_p, next_p, tss);
 
-	/* If we're going to preload the fpu context, make sure clts
-	   is run while we're batching the cpu state updates. */
-	if (preload_fpu)
-		clts();
-
 	/*
 	 * Leave lazy mode, flushing any hypercalls made here.
 	 * This must be done before restoring TLS segments so
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index b1f3f53..272bddd 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -419,11 +419,10 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	load_TLS(next, cpu);
 
 	/* Must be after DS reload */
-	__unlazy_fpu(prev_p);
-
-	/* Make sure cpu is ready for new context */
 	if (preload_fpu)
-		clts();
+		__unlazy_fpu_clts(prev_p);
+	else
+		__unlazy_fpu(prev_p);
 
 	/*
 	 * Leave lazy mode, flushing any hypercalls made here.
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC] syscall calling convention, stts/clts, and xstate latency
  2011-07-25  9:44       ` Andrew Lutomirski
  2011-07-25  9:51         ` Ingo Molnar
@ 2011-07-25 11:04         ` Hans Rosenfeld
  1 sibling, 0 replies; 15+ messages in thread
From: Hans Rosenfeld @ 2011-07-25 11:04 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Ingo Molnar, linux-kernel, x86, Linus Torvalds, Arjan van de Ven,
	Avi Kivity

[-- Attachment #1: Type: text/plain, Size: 1510 bytes --]

On Mon, Jul 25, 2011 at 05:44:32AM -0400, Andrew Lutomirski wrote:
> On Mon, Jul 25, 2011 at 2:38 AM, Ingo Molnar <mingo@elte.hu> wrote:
> > I should have mentioned it earlier, but there's a certain amount of
> > delazying patches in the tip:x86/xsave branch:
> >
> >  $ gll linus..x86/xsave
> >  300c6120b465: x86, xsave: fix non-lazy allocation of the xsave area
> >  f79018f2daa9: Merge branch 'x86/urgent' into x86/xsave
> >  66beba27e8b5: x86, xsave: remove lazy allocation of xstate area
> >  1039b306b1c6: x86, xsave: add kernel support for AMDs Lightweight Profiling (LWP)
> >  4182a4d68bac: x86, xsave: add support for non-lazy xstates
> >  324cbb83e215: x86, xsave: more cleanups
> >  2efd67935eb7: x86, xsave: remove unused code
> >  0c11e6f1aed1: x86, xsave: cleanup fpu/xsave signal frame setup
> >  7f4f0a56a7d3: x86, xsave: rework fpu/xsave support
> >  26bce4e4c56f: x86, xsave: cleanup fpu/xsave support
> >
> > it's not in tip:master because the LWP bits need (much) more work to
> > be palatable - but we could spin them off and complete them as per
> > your suggestions if they are an independent speedup on modern CPUs.
> 
> Hans, what's the status of these?  I want to do some other cleanups
> (now or in a couple of weeks) that will probably conflict with your
> xsave work.

I know of one bug in there that occasionally causes panics at boot, see
the attached patch for a fix.


Hans


-- 
%SYSTEM-F-ANARCHISM, The operating system has been overthrown

[-- Attachment #2: 0001-x86-xsave-clear-pre-allocated-xsave-area.patch --]
[-- Type: text/plain, Size: 1056 bytes --]

>From 599d3ee9a9e743377739480a8a893582f1409a8d Mon Sep 17 00:00:00 2001
From: Hans Rosenfeld <hans.rosenfeld@amd.com>
Date: Wed, 6 Jul 2011 16:31:19 +0200
Subject: [PATCH 1/1] x86, xsave: clear pre-allocated xsave area

Bogus data in the xsave area can cause xrstor to panic, so make sure
that the pre-allocated xsave area is all nice and clean before being
used.

Signed-off-by: Hans Rosenfeld <hans.rosenfeld@amd.com>
---
 arch/x86/kernel/process.c |   11 +++++++++--
 1 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index c5ae256..03c5ded 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -28,8 +28,15 @@ EXPORT_SYMBOL_GPL(task_xstate_cachep);
 
 int arch_prealloc_fpu(struct task_struct *tsk)
 {
-	if (!fpu_allocated(&tsk->thread.fpu))
-		return fpu_alloc(&tsk->thread.fpu);
+	if (!fpu_allocated(&tsk->thread.fpu)) {
+		int err = fpu_alloc(&tsk->thread.fpu);
+
+		if (err)
+			return err;
+
+		fpu_clear(&tsk->thread.fpu);
+	}
+
 	return 0;
 }
 
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 3.1?] x86: Remove useless stts/clts pair in __switch_to
  2011-07-25 10:05       ` [PATCH 3.1?] x86: Remove useless stts/clts pair in __switch_to Andy Lutomirski
@ 2011-07-25 11:12         ` Ingo Molnar
  2011-07-25 13:04           ` Andrew Lutomirski
  0 siblings, 1 reply; 15+ messages in thread
From: Ingo Molnar @ 2011-07-25 11:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, Linus Torvalds, Arjan van de Ven, Avi Kivity


* Andy Lutomirski <luto@MIT.EDU> wrote:

> An stts/clts pair takes over 70 ns by itself on Sandy Bridge, and
> when other things are going on it's apparently even worse.  This
> saves 10% on context switches between threads that both use extended
> state.
> 
> Signed-off-by: Andy Lutomirski <luto@mit.edu>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Arjan van de Ven <arjan@infradead.org>, 
> Cc: Avi Kivity <avi@redhat.com>
> ---
> 
> This is not as well tested as it should be (especially on 32-bit, where
> I haven't actually tried compiling it), but I think this might be 3.1
> material so I want to get it out for review before it's even more
> unjustifiably late :)
> 
> Argument for inclusion in 3.1 (after a bit more testing):
>  - It's dead simple.
>  - It's a 10% speedup on context switching under the right conditions [1]
>  - It's unlikely to slow any workload down, since it doesn't add any work
>    anywwhere.
> 
> Argument against:
>  - It's late.

I think it's late.

Would be much better to stick it into the x86/xsave tree i pointed to 
and treat and debug it as a coherent unit. FPU bugs need a lot of 
time to surface so we definitely do not want to fast-track it. In 
fact if we want it in v3.2 we should start assembling the tree right 
now.

Also, if you are tempted by the prospect of possibly enabling vector 
instructions for the x86 kernel, we could try that too, and get 
multiple speedups for the price of having to debug the tree only once 
;-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3.1?] x86: Remove useless stts/clts pair in __switch_to
  2011-07-25 11:12         ` Ingo Molnar
@ 2011-07-25 13:04           ` Andrew Lutomirski
  2011-07-25 14:13             ` Ingo Molnar
  0 siblings, 1 reply; 15+ messages in thread
From: Andrew Lutomirski @ 2011-07-25 13:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: x86, linux-kernel, Linus Torvalds, Arjan van de Ven, Avi Kivity

On Mon, Jul 25, 2011 at 7:12 AM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Andy Lutomirski <luto@MIT.EDU> wrote:
>
>> An stts/clts pair takes over 70 ns by itself on Sandy Bridge, and
>> when other things are going on it's apparently even worse.  This
>> saves 10% on context switches between threads that both use extended
>> state.
>>
>> Signed-off-by: Andy Lutomirski <luto@mit.edu>
>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>> Cc: Arjan van de Ven <arjan@infradead.org>,
>> Cc: Avi Kivity <avi@redhat.com>
>> ---
>>
>> This is not as well tested as it should be (especially on 32-bit, where
>> I haven't actually tried compiling it), but I think this might be 3.1
>> material so I want to get it out for review before it's even more
>> unjustifiably late :)
>>
>> Argument for inclusion in 3.1 (after a bit more testing):
>>  - It's dead simple.
>>  - It's a 10% speedup on context switching under the right conditions [1]
>>  - It's unlikely to slow any workload down, since it doesn't add any work
>>    anywwhere.
>>
>> Argument against:
>>  - It's late.
>
> I think it's late.
>
> Would be much better to stick it into the x86/xsave tree i pointed to
> and treat and debug it as a coherent unit. FPU bugs need a lot of
> time to surface so we definitely do not want to fast-track it. In
> fact if we want it in v3.2 we should start assembling the tree right
> now.

Fair enough.  I make no guarantee that I'll have anything ready in
less than a few weeks.  I'm defending my thesis in a week, and kernel
hacking is entirely a distraction. :)  (The only thing my thesis has
to do with operating systems is that I mention recvmmsg.)

>
> Also, if you are tempted by the prospect of possibly enabling vector
> instructions for the x86 kernel, we could try that too, and get
> multiple speedups for the price of having to debug the tree only once
> ;-)

I'll play with it.  I have some other cleanup / speedup ideas, too,
and I'll see where they go.  Given that the kernel doesn't really use
floating-point math, I'm not sure that gcc will do much unless we turn
on -ftree-vectorize, and that's a little scary.

--Andy

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3.1?] x86: Remove useless stts/clts pair in __switch_to
  2011-07-25 13:04           ` Andrew Lutomirski
@ 2011-07-25 14:13             ` Ingo Molnar
  0 siblings, 0 replies; 15+ messages in thread
From: Ingo Molnar @ 2011-07-25 14:13 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: x86, linux-kernel, Linus Torvalds, Arjan van de Ven, Avi Kivity


* Andrew Lutomirski <luto@mit.edu> wrote:

> > Also, if you are tempted by the prospect of possibly enabling 
> > vector instructions for the x86 kernel, we could try that too, 
> > and get multiple speedups for the price of having to debug the 
> > tree only once ;-)
> 
> I'll play with it.  I have some other cleanup / speedup ideas, too, 
> and I'll see where they go.  Given that the kernel doesn't really 
> use floating-point math, I'm not sure that gcc will do much unless 
> we turn on -ftree-vectorize, and that's a little scary.

It's indeed scary - but as long as it boots it would allow some 
baseline figures to be estimated - is there any win, and if yes, how 
much. It might be a complete dud in the end.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2011-07-25 14:14 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-24 21:07 [RFC] syscall calling convention, stts/clts, and xstate latency Andrew Lutomirski
2011-07-24 21:15 ` Ingo Molnar
2011-07-24 22:34   ` Andrew Lutomirski
2011-07-25  3:21     ` Andrew Lutomirski
2011-07-25  6:42       ` Ingo Molnar
2011-07-25 10:05       ` [PATCH 3.1?] x86: Remove useless stts/clts pair in __switch_to Andy Lutomirski
2011-07-25 11:12         ` Ingo Molnar
2011-07-25 13:04           ` Andrew Lutomirski
2011-07-25 14:13             ` Ingo Molnar
2011-07-25  6:38     ` [RFC] syscall calling convention, stts/clts, and xstate latency Ingo Molnar
2011-07-25  9:44       ` Andrew Lutomirski
2011-07-25  9:51         ` Ingo Molnar
2011-07-25 11:04         ` Hans Rosenfeld
2011-07-25  7:42   ` Avi Kivity
2011-07-25  7:54     ` Ingo Molnar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.