* Lazy FPU restoration / moving kernel_fpu_end() to context switch @ 2018-06-15 13:11 Jason A. Donenfeld 2018-06-15 16:25 ` Thomas Gleixner 2018-06-15 18:31 ` Andy Lutomirski 0 siblings, 2 replies; 23+ messages in thread From: Jason A. Donenfeld @ 2018-06-15 13:11 UTC (permalink / raw) To: LKML, X86 ML, Andy Lutomirski Hi Andy & folks, Lots of crypto routines look like this: kernel_fpu_begin(); encrypt(); kernel_fpu_end(); If you call such a routine twice, you get: kernel_fpu_begin(); encrypt(); kernel_fpu_end(); kernel_fpu_begin(); encrypt(); kernel_fpu_end(); In a loop this looks like: for (thing) { kernel_fpu_begin(); encrypt(thing); kernel_fpu_end(); } This is obviously very bad, because begin() and end() are slow, so WireGuard does the obvious: kernel_fpu_begin(); for (thing) encrypt(thing); kernel_fpu_end(); This is fine and well, and the crypto API I'm working on will enable this to be done in a clear way, but I do wonder if maybe this is not something that should be happening at the level of the caller, but rather in the fpu functions themselves. Namely, what are your thoughts on modifying kernel_fpu_end() so that it doesn't actually restore the state, but just reenables preemption and marks that on the next context switch, the state should be restored? Then, essentially, kernel_fpu_begin() and end() become free after the first usage of kernel_fpu_begin(). Is this something feasible? I know that performance-wise, I'm really gaining a lot from hoisting those functions out of the loops, and API wise, it'd be slightly simpler to implement if I didn't have to all for that hoisting. Regards, Jason ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch 2018-06-15 13:11 Lazy FPU restoration / moving kernel_fpu_end() to context switch Jason A. Donenfeld @ 2018-06-15 16:25 ` Thomas Gleixner 2018-06-15 18:33 ` Brian Gerst 2018-06-15 19:34 ` Peter Zijlstra 2018-06-15 18:31 ` Andy Lutomirski 1 sibling, 2 replies; 23+ messages in thread From: Thomas Gleixner @ 2018-06-15 16:25 UTC (permalink / raw) To: Jason A. Donenfeld; +Cc: LKML, X86 ML, Andy Lutomirski, Peter Zijlstra On Fri, 15 Jun 2018, Jason A. Donenfeld wrote: > In a loop this looks like: > > for (thing) { > kernel_fpu_begin(); > encrypt(thing); > kernel_fpu_end(); > } > > This is obviously very bad, because begin() and end() are slow, so > WireGuard does the obvious: > > kernel_fpu_begin(); > for (thing) > encrypt(thing); > kernel_fpu_end(); > > This is fine and well, and the crypto API I'm working on will enable It might be fine crypto performance wise, but it's a total nightmare latency wise because kernel_fpu_begin() disables preemption. We've seen latencies in the larger millisecond range due to processing large data sets with kernel FPU. If you want to go there then we really need a better approach which allows kernel FPU usage in preemptible context and in case of preemption a way to stash the preempted FPU context and restore it when the task gets scheduled in again. Just using the existing FPU stuff and moving the loops inside the begin/end section and keeping preemption disabled for arbitrary time spans is not going to fly. Thanks, tglx ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch 2018-06-15 16:25 ` Thomas Gleixner @ 2018-06-15 18:33 ` Brian Gerst 2018-06-15 19:34 ` Peter Zijlstra 1 sibling, 0 replies; 23+ messages in thread From: Brian Gerst @ 2018-06-15 18:33 UTC (permalink / raw) To: Thomas Gleixner Cc: Jason A. Donenfeld, LKML, X86 ML, Andy Lutomirski, Peter Zijlstra On Fri, Jun 15, 2018 at 12:25 PM, Thomas Gleixner <tglx@linutronix.de> wrote: > On Fri, 15 Jun 2018, Jason A. Donenfeld wrote: >> In a loop this looks like: >> >> for (thing) { >> kernel_fpu_begin(); >> encrypt(thing); >> kernel_fpu_end(); >> } >> >> This is obviously very bad, because begin() and end() are slow, so >> WireGuard does the obvious: >> >> kernel_fpu_begin(); >> for (thing) >> encrypt(thing); >> kernel_fpu_end(); >> >> This is fine and well, and the crypto API I'm working on will enable > > It might be fine crypto performance wise, but it's a total nightmare > latency wise because kernel_fpu_begin() disables preemption. We've seen > latencies in the larger millisecond range due to processing large data sets > with kernel FPU. > > If you want to go there then we really need a better approach which allows > kernel FPU usage in preemptible context and in case of preemption a way to > stash the preempted FPU context and restore it when the task gets scheduled > in again. Just using the existing FPU stuff and moving the loops inside the > begin/end section and keeping preemption disabled for arbitrary time spans > is not going to fly. One optimization that can be done is to delay restoring the user FPU state until we exit to userspace. That way the FPU is saved and restored only once no matter how many times kernel_fpu_begin()/kernel_fpu_end() are called. -- Brian Gerst ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch 2018-06-15 16:25 ` Thomas Gleixner 2018-06-15 18:33 ` Brian Gerst @ 2018-06-15 19:34 ` Peter Zijlstra 2018-06-15 20:30 ` Jason A. Donenfeld 1 sibling, 1 reply; 23+ messages in thread From: Peter Zijlstra @ 2018-06-15 19:34 UTC (permalink / raw) To: Thomas Gleixner; +Cc: Jason A. Donenfeld, LKML, X86 ML, Andy Lutomirski On Fri, Jun 15, 2018 at 06:25:39PM +0200, Thomas Gleixner wrote: > On Fri, 15 Jun 2018, Jason A. Donenfeld wrote: > > In a loop this looks like: > > > > for (thing) { > > kernel_fpu_begin(); > > encrypt(thing); > > kernel_fpu_end(); > > } > > > > This is obviously very bad, because begin() and end() are slow, so > > WireGuard does the obvious: > > > > kernel_fpu_begin(); > > for (thing) > > encrypt(thing); > > kernel_fpu_end(); > > > > This is fine and well, and the crypto API I'm working on will enable > > It might be fine crypto performance wise, but it's a total nightmare > latency wise because kernel_fpu_begin() disables preemption. We've seen > latencies in the larger millisecond range due to processing large data sets > with kernel FPU. > > If you want to go there then we really need a better approach which allows > kernel FPU usage in preemptible context and in case of preemption a way to > stash the preempted FPU context and restore it when the task gets scheduled > in again. Just using the existing FPU stuff and moving the loops inside the > begin/end section and keeping preemption disabled for arbitrary time spans > is not going to fly. Didn't we recently do a bunch of crypto patches to help with this? I think they had the pattern: kernel_fpu_begin(); for (units-of-work) { do_unit_of_work(); if (need_resched()) { kernel_fpu_end(); cond_resched(); kernel_fpu_begin(); } } kernel_fpu_end(); I'd have to go dig out the actual series, but I think they had a bunch of helpers to deal with that. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch 2018-06-15 19:34 ` Peter Zijlstra @ 2018-06-15 20:30 ` Jason A. Donenfeld 2018-06-18 9:44 ` Peter Zijlstra 0 siblings, 1 reply; 23+ messages in thread From: Jason A. Donenfeld @ 2018-06-15 20:30 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Thomas Gleixner, LKML, X86 ML, Andy Lutomirski On Fri, Jun 15, 2018 at 9:34 PM Peter Zijlstra <peterz@infradead.org> wrote: > Didn't we recently do a bunch of crypto patches to help with this? > > I think they had the pattern: > > kernel_fpu_begin(); > for (units-of-work) { > do_unit_of_work(); > if (need_resched()) { > kernel_fpu_end(); > cond_resched(); > kernel_fpu_begin(); > } > } > kernel_fpu_end(); Right, so that's the thing -- this is an optimization easily available to individual crypto primitives. But I'm interested in applying this kind of optimization to an entire queue of, say, tiny packets, where each packet is processed individually. Or, to a cryptographic construction, where several different primitives are used, such that it'd be meaningful not to have to get the performance hit of end()begin() in between each and everyone of them. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch 2018-06-15 20:30 ` Jason A. Donenfeld @ 2018-06-18 9:44 ` Peter Zijlstra 2018-06-18 15:25 ` Jason A. Donenfeld 0 siblings, 1 reply; 23+ messages in thread From: Peter Zijlstra @ 2018-06-18 9:44 UTC (permalink / raw) To: Jason A. Donenfeld; +Cc: Thomas Gleixner, LKML, X86 ML, Andy Lutomirski On Fri, Jun 15, 2018 at 10:30:46PM +0200, Jason A. Donenfeld wrote: > On Fri, Jun 15, 2018 at 9:34 PM Peter Zijlstra <peterz@infradead.org> wrote: > > Didn't we recently do a bunch of crypto patches to help with this? > > > > I think they had the pattern: > > > > kernel_fpu_begin(); > > for (units-of-work) { > > do_unit_of_work(); > > if (need_resched()) { > > kernel_fpu_end(); > > cond_resched(); > > kernel_fpu_begin(); > > } > > } > > kernel_fpu_end(); > > Right, so that's the thing -- this is an optimization easily available > to individual crypto primitives. But I'm interested in applying this > kind of optimization to an entire queue of, say, tiny packets, where > each packet is processed individually. Or, to a cryptographic > construction, where several different primitives are used, such that > it'd be meaningful not to have to get the performance hit of > end()begin() in between each and everyone of them. I'm confused.. how does the above not apply to your situation? ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch 2018-06-18 9:44 ` Peter Zijlstra @ 2018-06-18 15:25 ` Jason A. Donenfeld 0 siblings, 0 replies; 23+ messages in thread From: Jason A. Donenfeld @ 2018-06-18 15:25 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Thomas Gleixner, LKML, X86 ML, Andy Lutomirski On Mon, Jun 18, 2018 at 11:44 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Fri, Jun 15, 2018 at 10:30:46PM +0200, Jason A. Donenfeld wrote: > > On Fri, Jun 15, 2018 at 9:34 PM Peter Zijlstra <peterz@infradead.org> wrote: > > > Didn't we recently do a bunch of crypto patches to help with this? > > > > > > I think they had the pattern: > > > > > > kernel_fpu_begin(); > > > for (units-of-work) { > > > do_unit_of_work(); > > > if (need_resched()) { > > > kernel_fpu_end(); > > > cond_resched(); > > > kernel_fpu_begin(); > > > } > > > } > > > kernel_fpu_end(); > > > > Right, so that's the thing -- this is an optimization easily available > > to individual crypto primitives. But I'm interested in applying this > > kind of optimization to an entire queue of, say, tiny packets, where > > each packet is processed individually. Or, to a cryptographic > > construction, where several different primitives are used, such that > > it'd be meaningful not to have to get the performance hit of > > end()begin() in between each and everyone of them. > > I'm confused.. how does the above not apply to your situation? In the example you've given, the optimization is applied at the level of the, say, encryption function. Suppose you send a scattergather off to an encryption function, which then walks the sglist and encrypts each of the parts using some particular key. For each of the parts, it benefits from the above optimization. But what I'm referring to is encrypting multiple different things, with different keys. In the case I'd like to optimize, I have a worker thread that's processing a large queue of separate sglists and encrypting them separately under different keys. In this case, having kernel_fpu_begin/end inside the encryption function itself is a problem, since that means toggling the FPU in between every queue item. The solution, for now, is to just hoist the kernel_fpu_begin/end out of the encryption function, and put them instead at the beginning and end of my worker thread that handles all the items of the queue. This is fine and dandy, but far from ideal, as putting that kind of logic inside the encryption function itself makes more sense. For example, the encryption function can decide whether or not it even wants to use the FPU before calling kernel_fpu_begin. Ostensibly this logic too could be hoisted outside, but at what point do you draw the line and decide these optimizations are leading the API in the wrong direction? Hence, the idea here in this thread is to make it cost-free to place kernel_fpu_begin/end as close as possible to the actual use of the FPU. The solution, it seems, is to have the actual kernel_fpu_end work occur on context switch. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch 2018-06-15 13:11 Lazy FPU restoration / moving kernel_fpu_end() to context switch Jason A. Donenfeld 2018-06-15 16:25 ` Thomas Gleixner @ 2018-06-15 18:31 ` Andy Lutomirski 2018-06-15 18:40 ` Rik van Riel ` (3 more replies) 1 sibling, 4 replies; 23+ messages in thread From: Andy Lutomirski @ 2018-06-15 18:31 UTC (permalink / raw) To: Jason A. Donenfeld, Rik van Riel, Dave Hansen; +Cc: LKML, X86 ML On Fri, Jun 15, 2018 at 6:11 AM Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > Hi Andy & folks, > > Lots of crypto routines look like this: > > kernel_fpu_begin(); > encrypt(); > kernel_fpu_end(); > > If you call such a routine twice, you get: > > kernel_fpu_begin(); > encrypt(); > kernel_fpu_end(); > kernel_fpu_begin(); > encrypt(); > kernel_fpu_end(); > > In a loop this looks like: > > for (thing) { > kernel_fpu_begin(); > encrypt(thing); > kernel_fpu_end(); > } > > This is obviously very bad, because begin() and end() are slow, so > WireGuard does the obvious: > > kernel_fpu_begin(); > for (thing) > encrypt(thing); > kernel_fpu_end(); > > This is fine and well, and the crypto API I'm working on will enable > this to be done in a clear way, but I do wonder if maybe this is not > something that should be happening at the level of the caller, but > rather in the fpu functions themselves. Namely, what are your thoughts > on modifying kernel_fpu_end() so that it doesn't actually restore the > state, but just reenables preemption and marks that on the next > context switch, the state should be restored? Then, essentially, > kernel_fpu_begin() and end() become free after the first usage of > kernel_fpu_begin(). > > Is this something feasible? I know that performance-wise, I'm really > gaining a lot from hoisting those functions out of the loops, and API > wise, it'd be slightly simpler to implement if I didn't have to all > for that hoisting. Hi Jason- Funny you asked. This has been discussed a few times, although not quite in the form you imagined. The idea that we've tossed around is to restore FPU state on return to user mode. Roughly, we'd introduce a new thread flag TIF_FPU_UNLOADED (name TBD). prepare_exit_to_usermode() would notice this flag, copy the fpstate to fpregs, and clear the flag. (Or maybe exit_to_usermode_loop() -- No one has quite thought it through, but I think it should be outside the loop.) We'd update all the FPU accessors to understand the flag. We'd probably have a helper like: void unload_user_fpstate(void) that would do nothing if TIF_FPU_UNLOADED is set. If TIF_FPU_UNLOADED is clear, it would move the state to memory and set TIF_FPU_UNLOADED. We'd call unload_user_fpstate() during context switch in the prev task context and in kernel_fpu_begin(). The one major complication I know of is that PKRU is different from all other FPU state in that it needs to be loaded when *kernel* code does any user memory access. So we'd need to make sure that context switches load the new task's PKRU eagerly. Using WRPKRU is easy, but, unless we do something very clever, actually finding PKRU in the in-memory fpstate image may be slightly nontrivial. I suppose we could eagerly load the new FPU state on context switches, but I'm not sure I see any point. We'd still have to load it when we switch back to user mode, so it would be slower but not necessarily any simpler. I'd love to review patches to make this change, but I don't have the bandwidth to write them right now. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch 2018-06-15 18:31 ` Andy Lutomirski @ 2018-06-15 18:40 ` Rik van Riel 2018-06-15 18:41 ` Dave Hansen ` (2 subsequent siblings) 3 siblings, 0 replies; 23+ messages in thread From: Rik van Riel @ 2018-06-15 18:40 UTC (permalink / raw) To: Andy Lutomirski, Jason A. Donenfeld, Dave Hansen; +Cc: LKML, X86 ML [-- Attachment #1: Type: text/plain, Size: 3746 bytes --] On Fri, 2018-06-15 at 11:31 -0700, Andy Lutomirski wrote: > On Fri, Jun 15, 2018 at 6:11 AM Jason A. Donenfeld <Jason@zx2c4.com> > wrote: > > > > Hi Andy & folks, > > > > Lots of crypto routines look like this: > > > > kernel_fpu_begin(); > > encrypt(); > > kernel_fpu_end(); > > > > If you call such a routine twice, you get: > > > > kernel_fpu_begin(); > > encrypt(); > > kernel_fpu_end(); > > kernel_fpu_begin(); > > encrypt(); > > kernel_fpu_end(); > > > > In a loop this looks like: > > > > for (thing) { > > kernel_fpu_begin(); > > encrypt(thing); > > kernel_fpu_end(); > > } > > > > This is obviously very bad, because begin() and end() are slow, so > > WireGuard does the obvious: > > > > kernel_fpu_begin(); > > for (thing) > > encrypt(thing); > > kernel_fpu_end(); > > > > This is fine and well, and the crypto API I'm working on will > > enable > > this to be done in a clear way, but I do wonder if maybe this is > > not > > something that should be happening at the level of the caller, but > > rather in the fpu functions themselves. Namely, what are your > > thoughts > > on modifying kernel_fpu_end() so that it doesn't actually restore > > the > > state, but just reenables preemption and marks that on the next > > context switch, the state should be restored? Then, essentially, > > kernel_fpu_begin() and end() become free after the first usage of > > kernel_fpu_begin(). > > > > Is this something feasible? I know that performance-wise, I'm > > really > > gaining a lot from hoisting those functions out of the loops, and > > API > > wise, it'd be slightly simpler to implement if I didn't have to all > > for that hoisting. > > Hi Jason- > > Funny you asked. This has been discussed a few times, although not > quite in the form you imagined. The idea that we've tossed around is > to restore FPU state on return to user mode. Roughly, we'd introduce > a new thread flag TIF_FPU_UNLOADED (name TBD). > prepare_exit_to_usermode() would notice this flag, copy the fpstate > to > fpregs, and clear the flag. (Or maybe exit_to_usermode_loop() -- No > one has quite thought it through, but I think it should be outside > the > loop.) We'd update all the FPU accessors to understand the flag. > We'd probably have a helper like: > > void unload_user_fpstate(void) > > that would do nothing if TIF_FPU_UNLOADED is set. If > TIF_FPU_UNLOADED > is clear, it would move the state to memory and set TIF_FPU_UNLOADED. > We'd call unload_user_fpstate() during context switch in the prev > task > context and in kernel_fpu_begin(). > > The one major complication I know of is that PKRU is different from > all other FPU state in that it needs to be loaded when *kernel* code > does any user memory access. So we'd need to make sure that context > switches load the new task's PKRU eagerly. Using WRPKRU is easy, > but, > unless we do something very clever, actually finding PKRU in the > in-memory fpstate image may be slightly nontrivial. IIRC the in-kernel FPU state always has every FPU field at a constant location. > I suppose we could eagerly load the new FPU state on context > switches, > but I'm not sure I see any point. We'd still have to load it when we > switch back to user mode, so it would be slower but not necessarily > any simpler. > > I'd love to review patches to make this change, but I don't have the > bandwidth to write them right now. I can take a look at this, I am already looking at some context switch code right now, anyway. I also should still have the TIF_FPU_UNLOADED patches (or whatever the flag was called) code around somewhere. -- All Rights Reversed. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch 2018-06-15 18:31 ` Andy Lutomirski 2018-06-15 18:40 ` Rik van Riel @ 2018-06-15 18:41 ` Dave Hansen 2018-06-15 18:49 ` Andy Lutomirski 2018-06-15 18:48 ` Dave Hansen 2018-06-15 20:33 ` Jason A. Donenfeld 3 siblings, 1 reply; 23+ messages in thread From: Dave Hansen @ 2018-06-15 18:41 UTC (permalink / raw) To: Andy Lutomirski, Jason A. Donenfeld, Rik van Riel; +Cc: LKML, X86 ML On 06/15/2018 11:31 AM, Andy Lutomirski wrote: > Using WRPKRU is easy, but, > unless we do something very clever, actually finding PKRU in the > in-memory fpstate image may be slightly nontrivial. Why? It's at a constant offset during any one boot, for sure. XSAVEC/XSAVES can move it around, but only if you change the Requested Feature BitMap (RFBM) that we pass to XSAVES or change the control register that enables XSAVE states (XCR0). We don't change XCR0 after boot, and RFBM is hard-coded to -1 as far as I remember. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch 2018-06-15 18:41 ` Dave Hansen @ 2018-06-15 18:49 ` Andy Lutomirski 0 siblings, 0 replies; 23+ messages in thread From: Andy Lutomirski @ 2018-06-15 18:49 UTC (permalink / raw) To: Dave Hansen Cc: Andrew Lutomirski, Jason A. Donenfeld, Rik van Riel, LKML, X86 ML On Fri, Jun 15, 2018 at 11:43 AM Dave Hansen <dave.hansen@linux.intel.com> wrote: > > On 06/15/2018 11:31 AM, Andy Lutomirski wrote: > > Using WRPKRU is easy, but, > > unless we do something very clever, actually finding PKRU in the > > in-memory fpstate image may be slightly nontrivial. > > Why? > > It's at a constant offset during any one boot, for sure. XSAVEC/XSAVES > can move it around, but only if you change the Requested Feature BitMap > (RFBM) that we pass to XSAVES or change the control register that > enables XSAVE states (XCR0). > > We don't change XCR0 after boot, and RFBM is hard-coded to -1 as far as > I remember. I thought that XSAVES didn't allocate space for parts of the state that are in the init state. I guess I was wrong :) So never mind, context switch should just need to WRPKRU the field at the predetermined offset in fpstate for the new task. And, if some appropriate debug option is set, it should warn if TIF_FPU_UNLOADED is clear for the new task. I suspect that the whole patch should only be a couple hundred lines of code. And I think there are VM workloads where it would be a *huge* win. --Andy ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch 2018-06-15 18:31 ` Andy Lutomirski 2018-06-15 18:40 ` Rik van Riel 2018-06-15 18:41 ` Dave Hansen @ 2018-06-15 18:48 ` Dave Hansen 2018-06-15 18:53 ` Andy Lutomirski 2018-06-15 20:33 ` Jason A. Donenfeld 3 siblings, 1 reply; 23+ messages in thread From: Dave Hansen @ 2018-06-15 18:48 UTC (permalink / raw) To: Andy Lutomirski, Jason A. Donenfeld, Rik van Riel; +Cc: LKML, X86 ML On 06/15/2018 11:31 AM, Andy Lutomirski wrote: > for (thing) { > kernel_fpu_begin(); > encrypt(thing); > kernel_fpu_end(); > } Don't forget that the processor has optimizations for this, too. The "modified optimization" will notice that between: kernel_fpu_end(); -> XRSTOR and kernel_fpu_start(); -> XSAVE(S|OPT) the processor has not modified the states. It'll skip doing any writes of the state. Doing what Andy is describing is still way better than letting the processor do it, but you should just know up front that this may not be as much of a win as you would expect. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch 2018-06-15 18:48 ` Dave Hansen @ 2018-06-15 18:53 ` Andy Lutomirski 2018-06-15 20:27 ` Jason A. Donenfeld 2018-06-19 11:43 ` David Laight 0 siblings, 2 replies; 23+ messages in thread From: Andy Lutomirski @ 2018-06-15 18:53 UTC (permalink / raw) To: Dave Hansen Cc: Andrew Lutomirski, Jason A. Donenfeld, Rik van Riel, LKML, X86 ML On Fri, Jun 15, 2018 at 11:50 AM Dave Hansen <dave.hansen@linux.intel.com> wrote: > > On 06/15/2018 11:31 AM, Andy Lutomirski wrote: > > for (thing) { > > kernel_fpu_begin(); > > encrypt(thing); > > kernel_fpu_end(); > > } > > Don't forget that the processor has optimizations for this, too. The > "modified optimization" will notice that between: > > kernel_fpu_end(); -> XRSTOR > and > kernel_fpu_start(); -> XSAVE(S|OPT) > > the processor has not modified the states. It'll skip doing any writes > of the state. Doing what Andy is describing is still way better than > letting the processor do it, but you should just know up front that this > may not be as much of a win as you would expect. Even with the modified optimization, kernel_fpu_end() still needs to reload the state that was trashed by the kernel FPU use. If the kernel is using something like AVX512 state, then kernel_fpu_end() will transfer an enormous amount of data no matter how clever the CPU is. And I think I once measured XSAVEOPT taking a hundred cycles or so even when RFBM==0, so it's not exactly super fast. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch 2018-06-15 18:53 ` Andy Lutomirski @ 2018-06-15 20:27 ` Jason A. Donenfeld 2018-06-15 20:48 ` Jason A. Donenfeld 2018-06-19 11:43 ` David Laight 1 sibling, 1 reply; 23+ messages in thread From: Jason A. Donenfeld @ 2018-06-15 20:27 UTC (permalink / raw) To: Andrew Lutomirski; +Cc: dave.hansen, riel, LKML, X86 ML On Fri, Jun 15, 2018 at 8:53 PM Andy Lutomirski <luto@kernel.org> wrote: > > On Fri, Jun 15, 2018 at 11:50 AM Dave Hansen > <dave.hansen@linux.intel.com> wrote: > Even with the modified optimization, kernel_fpu_end() still needs to > reload the state that was trashed by the kernel FPU use. If the > kernel is using something like AVX512 state, then kernel_fpu_end() > will transfer an enormous amount of data no matter how clever the CPU > is. And I think I once measured XSAVEOPT taking a hundred cycles or > so even when RFBM==0, so it's not exactly super fast. Indeed the speed up is really significant, especially for the AVX512 case. Here are some numbers from my laptop and a server taken a few seconds ago: AVX2 - Intel(R) Xeon(R) CPU E3-1505M v5 @ 2.80GHz Inside: 684617437 Outside: 547710093 Percent speedup: 24 AVX512 - Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz Inside: 634415672 Outside: 286698960 Percent speedup: 121 This is from this test -- https://xn--4db.cc/F7RF2fhv/c . There are probably various issues with that test case, and it's possible there are other effects going on (the avx512 case looks particularly insane) to make the difference _that_ drastic, but I think there's no doubt that the optimization here is a meaningful one. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch 2018-06-15 20:27 ` Jason A. Donenfeld @ 2018-06-15 20:48 ` Jason A. Donenfeld 0 siblings, 0 replies; 23+ messages in thread From: Jason A. Donenfeld @ 2018-06-15 20:48 UTC (permalink / raw) To: Andrew Lutomirski; +Cc: dave.hansen, riel, LKML, X86 ML On Fri, Jun 15, 2018 at 10:27 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote: > AVX512 - Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz > Inside: 634415672 > Outside: 286698960 > Percent speedup: 121 More realistic results, with turbo turned off: Inside: 616851298 Outside: 339606790 Percent speedup: 81 Still a pretty outstanding speedup. ^ permalink raw reply [flat|nested] 23+ messages in thread
* RE: Lazy FPU restoration / moving kernel_fpu_end() to context switch 2018-06-15 18:53 ` Andy Lutomirski 2018-06-15 20:27 ` Jason A. Donenfeld @ 2018-06-19 11:43 ` David Laight 2018-06-19 13:08 ` Thomas Gleixner 1 sibling, 1 reply; 23+ messages in thread From: David Laight @ 2018-06-19 11:43 UTC (permalink / raw) To: 'Andy Lutomirski', Dave Hansen Cc: Jason A. Donenfeld, Rik van Riel, LKML, X86 ML From: Andy Lutomirski > Sent: 15 June 2018 19:54 > On Fri, Jun 15, 2018 at 11:50 AM Dave Hansen > <dave.hansen@linux.intel.com> wrote: > > > > On 06/15/2018 11:31 AM, Andy Lutomirski wrote: > > > for (thing) { > > > kernel_fpu_begin(); > > > encrypt(thing); > > > kernel_fpu_end(); > > > } > > > > Don't forget that the processor has optimizations for this, too. The > > "modified optimization" will notice that between: > > > > kernel_fpu_end(); -> XRSTOR > > and > > kernel_fpu_start(); -> XSAVE(S|OPT) > > > > the processor has not modified the states. It'll skip doing any writes > > of the state. Doing what Andy is describing is still way better than > > letting the processor do it, but you should just know up front that this > > may not be as much of a win as you would expect. > > Even with the modified optimization, kernel_fpu_end() still needs to > reload the state that was trashed by the kernel FPU use. If the > kernel is using something like AVX512 state, then kernel_fpu_end() > will transfer an enormous amount of data no matter how clever the CPU > is. And I think I once measured XSAVEOPT taking a hundred cycles or > so even when RFBM==0, so it's not exactly super fast. If the kernel was entered by a system call do you need to save the AVX512 state at all? IIRC the registers are all defined as 'called saved' so there is no expectation that they will be saved across the syscall wrapper function call. All you need to do is ensure that 'kernel' values aren't passed back to userspace. There is a single instruction to zero all the AVX512 registers. David ^ permalink raw reply [flat|nested] 23+ messages in thread
* RE: Lazy FPU restoration / moving kernel_fpu_end() to context switch 2018-06-19 11:43 ` David Laight @ 2018-06-19 13:08 ` Thomas Gleixner 0 siblings, 0 replies; 23+ messages in thread From: Thomas Gleixner @ 2018-06-19 13:08 UTC (permalink / raw) To: David Laight Cc: 'Andy Lutomirski', Dave Hansen, Jason A. Donenfeld, Rik van Riel, LKML, X86 ML On Tue, 19 Jun 2018, David Laight wrote: > From: Andy Lutomirski > > Sent: 15 June 2018 19:54 > > On Fri, Jun 15, 2018 at 11:50 AM Dave Hansen > > <dave.hansen@linux.intel.com> wrote: > > > > > > On 06/15/2018 11:31 AM, Andy Lutomirski wrote: > > > > for (thing) { > > > > kernel_fpu_begin(); > > > > encrypt(thing); > > > > kernel_fpu_end(); > > > > } > > > > > > Don't forget that the processor has optimizations for this, too. The > > > "modified optimization" will notice that between: > > > > > > kernel_fpu_end(); -> XRSTOR > > > and > > > kernel_fpu_start(); -> XSAVE(S|OPT) > > > > > > the processor has not modified the states. It'll skip doing any writes > > > of the state. Doing what Andy is describing is still way better than > > > letting the processor do it, but you should just know up front that this > > > may not be as much of a win as you would expect. > > > > Even with the modified optimization, kernel_fpu_end() still needs to > > reload the state that was trashed by the kernel FPU use. If the > > kernel is using something like AVX512 state, then kernel_fpu_end() > > will transfer an enormous amount of data no matter how clever the CPU > > is. And I think I once measured XSAVEOPT taking a hundred cycles or > > so even when RFBM==0, so it's not exactly super fast. > > If the kernel was entered by a system call do you need to save the AVX512 > state at all? > IIRC the registers are all defined as 'called saved' so there is no expectation > that they will be saved across the syscall wrapper function call. > All you need to do is ensure that 'kernel' values aren't passed back to userspace. > There is a single instruction to zero all the AVX512 registers. Then we need different treatment for exception entries and consecutive preemption. Lots of corner cases to cover ... Thanks, tglx ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch 2018-06-15 18:31 ` Andy Lutomirski ` (2 preceding siblings ...) 2018-06-15 18:48 ` Dave Hansen @ 2018-06-15 20:33 ` Jason A. Donenfeld 2018-06-15 20:42 ` Dave Hansen 2018-07-11 16:28 ` Sebastian Andrzej Siewior 3 siblings, 2 replies; 23+ messages in thread From: Jason A. Donenfeld @ 2018-06-15 20:33 UTC (permalink / raw) To: Andrew Lutomirski; +Cc: riel, dave.hansen, LKML, X86 ML On Fri, Jun 15, 2018 at 8:32 PM Andy Lutomirski <luto@kernel.org> wrote: > quite in the form you imagined. The idea that we've tossed around is > to restore FPU state on return to user mode. Roughly, we'd introduce > a new thread flag TIF_FPU_UNLOADED (name TBD). > prepare_exit_to_usermode() would notice this flag, copy the fpstate to > fpregs, and clear the flag. (Or maybe exit_to_usermode_loop() -- No > one has quite thought it through, but I think it should be outside the > loop.) We'd update all the FPU accessors to understand the flag. Yes! This is exactly what I was thinking. Then those calls to begin() and end() could be placed as close to the actual FPU usage as possible. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch 2018-06-15 20:33 ` Jason A. Donenfeld @ 2018-06-15 20:42 ` Dave Hansen 2018-06-15 20:52 ` Andy Lutomirski 2018-06-15 20:56 ` Rik van Riel 2018-07-11 16:28 ` Sebastian Andrzej Siewior 1 sibling, 2 replies; 23+ messages in thread From: Dave Hansen @ 2018-06-15 20:42 UTC (permalink / raw) To: Jason A. Donenfeld, Andrew Lutomirski; +Cc: riel, LKML, X86 ML On 06/15/2018 01:33 PM, Jason A. Donenfeld wrote: > On Fri, Jun 15, 2018 at 8:32 PM Andy Lutomirski <luto@kernel.org> wrote: >> quite in the form you imagined. The idea that we've tossed around is >> to restore FPU state on return to user mode. Roughly, we'd introduce >> a new thread flag TIF_FPU_UNLOADED (name TBD). >> prepare_exit_to_usermode() would notice this flag, copy the fpstate to >> fpregs, and clear the flag. (Or maybe exit_to_usermode_loop() -- No >> one has quite thought it through, but I think it should be outside the >> loop.) We'd update all the FPU accessors to understand the flag. > Yes! This is exactly what I was thinking. Then those calls to begin() > and end() could be placed as close to the actual FPU usage as > possible. Andy, what was the specific concern about PKRU? That we might do: kernel_fpu_begin(); <- Saves the first time something() kernel_fpu_end(); <- Does not XRSTOR copy_from_user(); <- Sees old PKRU, does the wrong thing prepare_exit_to_usermode(); <- Does the XRSTOR // only now does PKRU have the right value SYSRET/IRET ? Does that *matter* unless something() modified PKRU? We could just make the rule that nobody is supposed to mess with it and that it's not covered by kernel_fpu_begin/end() semantics. We could even theoretically enforce that in a debug environment if we watch its value. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch 2018-06-15 20:42 ` Dave Hansen @ 2018-06-15 20:52 ` Andy Lutomirski 2018-06-15 20:56 ` Rik van Riel 1 sibling, 0 replies; 23+ messages in thread From: Andy Lutomirski @ 2018-06-15 20:52 UTC (permalink / raw) To: Dave Hansen; +Cc: Jason A. Donenfeld, Andrew Lutomirski, riel, LKML, X86 ML > On Jun 15, 2018, at 1:42 PM, Dave Hansen <dave.hansen@linux.intel.com> wrote: > >> On 06/15/2018 01:33 PM, Jason A. Donenfeld wrote: >>> On Fri, Jun 15, 2018 at 8:32 PM Andy Lutomirski <luto@kernel.org> wrote: >>> quite in the form you imagined. The idea that we've tossed around is >>> to restore FPU state on return to user mode. Roughly, we'd introduce >>> a new thread flag TIF_FPU_UNLOADED (name TBD). >>> prepare_exit_to_usermode() would notice this flag, copy the fpstate to >>> fpregs, and clear the flag. (Or maybe exit_to_usermode_loop() -- No >>> one has quite thought it through, but I think it should be outside the >>> loop.) We'd update all the FPU accessors to understand the flag. >> Yes! This is exactly what I was thinking. Then those calls to begin() >> and end() could be placed as close to the actual FPU usage as >> possible. > > Andy, what was the specific concern about PKRU? That we might do: > > kernel_fpu_begin(); <- Saves the first time > something() > kernel_fpu_end(); <- Does not XRSTOR > > copy_from_user(); <- Sees old PKRU, does the wrong thing > > prepare_exit_to_usermode(); <- Does the XRSTOR > // only now does PKRU have the right value > SYSRET/IRET > > ? > > Does that *matter* unless something() modified PKRU? We could just make > the rule that nobody is supposed to mess with it and that it's not > covered by kernel_fpu_begin/end() semantics. We could even > theoretically enforce that in a debug environment if we watch its value. Indeed, this case seems highly unlikely. I was imagining we have two tasks. Task A enters the kernel and sleeps. Task B runs and gets its PKRU loaded. Then it enters the kernel and we switch back to A. Now we have A’s FPU state in memory and B’s PKRU in the register. If A does copy_to_user, we lose. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch 2018-06-15 20:42 ` Dave Hansen 2018-06-15 20:52 ` Andy Lutomirski @ 2018-06-15 20:56 ` Rik van Riel 1 sibling, 0 replies; 23+ messages in thread From: Rik van Riel @ 2018-06-15 20:56 UTC (permalink / raw) To: Dave Hansen, Jason A. Donenfeld, Andrew Lutomirski; +Cc: LKML, X86 ML [-- Attachment #1: Type: text/plain, Size: 1879 bytes --] On Fri, 2018-06-15 at 13:42 -0700, Dave Hansen wrote: > On 06/15/2018 01:33 PM, Jason A. Donenfeld wrote: > > On Fri, Jun 15, 2018 at 8:32 PM Andy Lutomirski <luto@kernel.org> > > wrote: > > > quite in the form you imagined. The idea that we've tossed > > > around is > > > to restore FPU state on return to user mode. Roughly, we'd > > > introduce > > > a new thread flag TIF_FPU_UNLOADED (name TBD). > > > prepare_exit_to_usermode() would notice this flag, copy the > > > fpstate to > > > fpregs, and clear the flag. (Or maybe exit_to_usermode_loop() -- > > > No > > > one has quite thought it through, but I think it should be > > > outside the > > > loop.) We'd update all the FPU accessors to understand the flag. > > > > Yes! This is exactly what I was thinking. Then those calls to > > begin() > > and end() could be placed as close to the actual FPU usage as > > possible. > > Andy, what was the specific concern about PKRU? That we might do: > > kernel_fpu_begin(); <- Saves the first time > something() > kernel_fpu_end(); <- Does not XRSTOR > > copy_from_user(); <- Sees old PKRU, does the wrong thing > > prepare_exit_to_usermode(); <- Does the XRSTOR > // only now does PKRU have the right value > SYSRET/IRET > > ? > > Does that *matter* unless something() modified PKRU? We could just > make > the rule that nobody is supposed to mess with it and that it's not > covered by kernel_fpu_begin/end() semantics. We could even > theoretically enforce that in a debug environment if we watch its > value. KVM needs to change out guest and host PKRU values when switching between guest and host mode, but since f775b13eedee ("x86,kvm: move qemu/guest FPU switching out to vcpu_run") that no longer happens under kernel_fpu_begin/end so we don't need to care about that :) -- All Rights Reversed. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch 2018-06-15 20:33 ` Jason A. Donenfeld 2018-06-15 20:42 ` Dave Hansen @ 2018-07-11 16:28 ` Sebastian Andrzej Siewior 2018-07-11 20:10 ` Rik van Riel 1 sibling, 1 reply; 23+ messages in thread From: Sebastian Andrzej Siewior @ 2018-07-11 16:28 UTC (permalink / raw) To: Jason A. Donenfeld; +Cc: Andrew Lutomirski, riel, dave.hansen, LKML, X86 ML On 2018-06-15 22:33:47 [+0200], Jason A. Donenfeld wrote: > On Fri, Jun 15, 2018 at 8:32 PM Andy Lutomirski <luto@kernel.org> wrote: > > quite in the form you imagined. The idea that we've tossed around is > > to restore FPU state on return to user mode. Roughly, we'd introduce > > a new thread flag TIF_FPU_UNLOADED (name TBD). > > prepare_exit_to_usermode() would notice this flag, copy the fpstate to > > fpregs, and clear the flag. (Or maybe exit_to_usermode_loop() -- No > > one has quite thought it through, but I think it should be outside the > > loop.) We'd update all the FPU accessors to understand the flag. > > Yes! This is exactly what I was thinking. Then those calls to begin() > and end() could be placed as close to the actual FPU usage as > possible. I was thinking about this myself. Did anyone try to hack something in the meantime? I might want to look into this, too :) Sebastian ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch 2018-07-11 16:28 ` Sebastian Andrzej Siewior @ 2018-07-11 20:10 ` Rik van Riel 0 siblings, 0 replies; 23+ messages in thread From: Rik van Riel @ 2018-07-11 20:10 UTC (permalink / raw) To: Sebastian Andrzej Siewior, Jason A. Donenfeld Cc: Andrew Lutomirski, dave.hansen, LKML, X86 ML [-- Attachment #1: Type: text/plain, Size: 1216 bytes --] On Wed, 2018-07-11 at 18:28 +0200, Sebastian Andrzej Siewior wrote: > On 2018-06-15 22:33:47 [+0200], Jason A. Donenfeld wrote: > > On Fri, Jun 15, 2018 at 8:32 PM Andy Lutomirski <luto@kernel.org> > > wrote: > > > quite in the form you imagined. The idea that we've tossed > > > around is > > > to restore FPU state on return to user mode. Roughly, we'd > > > introduce > > > a new thread flag TIF_FPU_UNLOADED (name TBD). > > > prepare_exit_to_usermode() would notice this flag, copy the > > > fpstate to > > > fpregs, and clear the flag. (Or maybe exit_to_usermode_loop() -- > > > No > > > one has quite thought it through, but I think it should be > > > outside the > > > loop.) We'd update all the FPU accessors to understand the flag. > > > > Yes! This is exactly what I was thinking. Then those calls to > > begin() > > and end() could be placed as close to the actual FPU usage as > > possible. > > I was thinking about this myself. Did anyone try to hack something in > the meantime? I might want to look into this, too :) I have implemented this before, back before the big rewrite of the syscall entry and exit code. It seemed to work fine. -- All Rights Reversed. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2018-07-11 20:10 UTC | newest] Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-06-15 13:11 Lazy FPU restoration / moving kernel_fpu_end() to context switch Jason A. Donenfeld 2018-06-15 16:25 ` Thomas Gleixner 2018-06-15 18:33 ` Brian Gerst 2018-06-15 19:34 ` Peter Zijlstra 2018-06-15 20:30 ` Jason A. Donenfeld 2018-06-18 9:44 ` Peter Zijlstra 2018-06-18 15:25 ` Jason A. Donenfeld 2018-06-15 18:31 ` Andy Lutomirski 2018-06-15 18:40 ` Rik van Riel 2018-06-15 18:41 ` Dave Hansen 2018-06-15 18:49 ` Andy Lutomirski 2018-06-15 18:48 ` Dave Hansen 2018-06-15 18:53 ` Andy Lutomirski 2018-06-15 20:27 ` Jason A. Donenfeld 2018-06-15 20:48 ` Jason A. Donenfeld 2018-06-19 11:43 ` David Laight 2018-06-19 13:08 ` Thomas Gleixner 2018-06-15 20:33 ` Jason A. Donenfeld 2018-06-15 20:42 ` Dave Hansen 2018-06-15 20:52 ` Andy Lutomirski 2018-06-15 20:56 ` Rik van Riel 2018-07-11 16:28 ` Sebastian Andrzej Siewior 2018-07-11 20:10 ` Rik van Riel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).