linux-sgx.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RFC: userspace exception fixups
@ 2018-11-01 17:53 Andy Lutomirski
  2018-11-01 17:53 ` Andy Lutomirski
                   ` (6 more replies)
  0 siblings, 7 replies; 163+ messages in thread
From: Andy Lutomirski @ 2018-11-01 17:53 UTC (permalink / raw)
  To: Dave Hansen, Christopherson, Sean J, Jethro Beekman,
	Jarkko Sakkinen, Florian Weimer, Linux API, Jann Horn,
	Linus Torvalds, X86 ML, linux-arch, LKML, Peter Zijlstra,
	Rich Felker, nhorman, npmccallum, Ayoun, Serge, shay.katz-zamir,
	linux-sgx, Andy Shevchenko, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov

Hi all-

The people working on SGX enablement are grappling with a somewhat
annoying issue: the x86 EENTER instruction is used from user code and
can, as part of its normal-ish operation, raise an exception.  It is
also highly likely to be used from a library, and signal handling in
libraries is unpleasant at best.

There's been some discussion of adding a vDSO entry point to wrap
EENTER and do something sensible with the exceptions, but I'm
wondering if a more general mechanism would be helpful.

The basic idea would be to allow libc, or maybe even any library, to
register a handler that gets a chance to act on an exception caused by
a user instruction before a signal is delivered.  As a straw-man
example for how this could work, there could be a new syscall:

long register_exception_handler(void (*handler)(int, siginfo_t *, void *));

If a handler is registered, then, if a synchronous exception happens
(page fault, etc), the kernel would set up an exception frame as usual
but, rather than checking for signal handlers, it would just call the
registered handler.  That handler is expected to either handle the
exception entirely on its own or to call one of two new syscalls to
ask for normal signal delivery or to ask to retry the faulting
instruction.

Alternatively, we could do something a lot more like the kernel's
internal fixups where there's a table in user memory that maps
potentially faulting instructions to landing pads that handle
exceptions.

Do you think this would be useful?  Here are some use cases that I
think are valid:

(a) Enter an SGX enclave and handle errors.  There would be two
instructions that would need special handling: EENTER and ERESUME.

(b) Do some math and catch division by zero.  I think it would be a
bad idea to have user code call a function and say that it wants to
handle *any* division by zero, but having certain specified division
instructions have special handling seems entirely reasonable.

(c) Ditto for floating point errors.

(d) Try an instruction and see if it gets #UD.

(e) Run a bunch of code and handle page faults to a given address
range by faulting something in.  This is not like the others, in that
a handler wants to handle a range of target addresses, not
instructions.  And userfaultfd is plausibly a better solution anyway.

(f) Run NaCl-like sandboxed code where the code can cause page faults
to certain mapped-but-intentionally-not-present ranges and those need
to be handled.

On Windows, you can use SEH to do crazy things like running
known-buggy code and eating the page faults.  I don't think we want to
go there.

All of this makes me think that the right solution is to have a way to
register fault handlers for instructions to cover (a) - (d) and to
treat (e) and (f) as something else entirely if there's enough demand.

--Andy

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RFC: userspace exception fixups
  2018-11-01 17:53 RFC: userspace exception fixups Andy Lutomirski
@ 2018-11-01 17:53 ` Andy Lutomirski
  2018-11-01 18:09 ` Florian Weimer
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 163+ messages in thread
From: Andy Lutomirski @ 2018-11-01 17:53 UTC (permalink / raw)
  To: Dave Hansen, Christopherson, Sean J, Jethro Beekman,
	Jarkko Sakkinen, Florian Weimer, Linux API, Jann Horn,
	Linus Torvalds, X86 ML, linux-arch, LKML, Peter Zijlstra,
	Rich Felker, nhorman, npmccallum, Ayoun, Serge, shay.katz-zamir,
	linux-sgx, Andy Shevchenko, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov

Hi all-

The people working on SGX enablement are grappling with a somewhat
annoying issue: the x86 EENTER instruction is used from user code and
can, as part of its normal-ish operation, raise an exception.  It is
also highly likely to be used from a library, and signal handling in
libraries is unpleasant at best.

There's been some discussion of adding a vDSO entry point to wrap
EENTER and do something sensible with the exceptions, but I'm
wondering if a more general mechanism would be helpful.

The basic idea would be to allow libc, or maybe even any library, to
register a handler that gets a chance to act on an exception caused by
a user instruction before a signal is delivered.  As a straw-man
example for how this could work, there could be a new syscall:

long register_exception_handler(void (*handler)(int, siginfo_t *, void *));

If a handler is registered, then, if a synchronous exception happens
(page fault, etc), the kernel would set up an exception frame as usual
but, rather than checking for signal handlers, it would just call the
registered handler.  That handler is expected to either handle the
exception entirely on its own or to call one of two new syscalls to
ask for normal signal delivery or to ask to retry the faulting
instruction.

Alternatively, we could do something a lot more like the kernel's
internal fixups where there's a table in user memory that maps
potentially faulting instructions to landing pads that handle
exceptions.

Do you think this would be useful?  Here are some use cases that I
think are valid:

(a) Enter an SGX enclave and handle errors.  There would be two
instructions that would need special handling: EENTER and ERESUME.

(b) Do some math and catch division by zero.  I think it would be a
bad idea to have user code call a function and say that it wants to
handle *any* division by zero, but having certain specified division
instructions have special handling seems entirely reasonable.

(c) Ditto for floating point errors.

(d) Try an instruction and see if it gets #UD.

(e) Run a bunch of code and handle page faults to a given address
range by faulting something in.  This is not like the others, in that
a handler wants to handle a range of target addresses, not
instructions.  And userfaultfd is plausibly a better solution anyway.

(f) Run NaCl-like sandboxed code where the code can cause page faults
to certain mapped-but-intentionally-not-present ranges and those need
to be handled.

On Windows, you can use SEH to do crazy things like running
known-buggy code and eating the page faults.  I don't think we want to
go there.

All of this makes me think that the right solution is to have a way to
register fault handlers for instructions to cover (a) - (d) and to
treat (e) and (f) as something else entirely if there's enough demand.

--Andy

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-01 17:53 RFC: userspace exception fixups Andy Lutomirski
  2018-11-01 17:53 ` Andy Lutomirski
@ 2018-11-01 18:09 ` Florian Weimer
  2018-11-01 18:09   ` Florian Weimer
                     ` (2 more replies)
  2018-11-01 18:27 ` Rich Felker
                   ` (4 subsequent siblings)
  6 siblings, 3 replies; 163+ messages in thread
From: Florian Weimer @ 2018-11-01 18:09 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Christopherson, Sean J, Jethro Beekman,
	Jarkko Sakkinen, Linux API, Jann Horn, Linus Torvalds, X86 ML,
	linux-arch, LKML, Peter Zijlstra, Rich Felker, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Adhemerval Zanella, carlos

* Andy Lutomirski:

> The basic idea would be to allow libc, or maybe even any library, to
> register a handler that gets a chance to act on an exception caused by
> a user instruction before a signal is delivered.  As a straw-man
> example for how this could work, there could be a new syscall:
>
> long register_exception_handler(void (*handler)(int, siginfo_t *, void *));
>
> If a handler is registered, then, if a synchronous exception happens
> (page fault, etc), the kernel would set up an exception frame as usual
> but, rather than checking for signal handlers, it would just call the
> registered handler.  That handler is expected to either handle the
> exception entirely on its own or to call one of two new syscalls to
> ask for normal signal delivery or to ask to retry the faulting
> instruction.

Would the exception handler be a per-thread resource?

If it is: Would the setup and teardown overhead be prohibitive for many
use cases (at least those do not expect a fault)?

Something peripherally related to this interface: Wrappers for signal
handlers (and not just CPU exceptions).  Ideally, we want to maintain a
flag that indicates whether we are in a signal handler, and save and
restore errno around the installed handler.

> Alternatively, we could do something a lot more like the kernel's
> internal fixups where there's a table in user memory that maps
> potentially faulting instructions to landing pads that handle
> exceptions.

GCC already supports that on most Linux targets.  You can unwind from
synchronously invoked signal handlers if you compile with
-fnon-call-exceptions.

However, it's tough to set up a temporary signal handler to trigger such
unwinding because those aren't per-thread.

> On Windows, you can use SEH to do crazy things like running
> known-buggy code and eating the page faults.  I don't think we want to
> go there.

The original SEH was also a rich target for exploiting vulnerabilities.
That's something we really should avoid as well.

I wonder if it would be possible to tack this function onto rseq.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-01 18:09 ` Florian Weimer
@ 2018-11-01 18:09   ` Florian Weimer
  2018-11-01 18:30   ` Rich Felker
  2018-11-01 19:00   ` Jarkko Sakkinen
  2 siblings, 0 replies; 163+ messages in thread
From: Florian Weimer @ 2018-11-01 18:09 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Christopherson, Sean J, Jethro Beekman,
	Jarkko Sakkinen, Linux API, Jann Horn, Linus Torvalds, X86 ML,
	linux-arch, LKML, Peter Zijlstra, Rich Felker, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Adhemerval Zanella, carlos

* Andy Lutomirski:

> The basic idea would be to allow libc, or maybe even any library, to
> register a handler that gets a chance to act on an exception caused by
> a user instruction before a signal is delivered.  As a straw-man
> example for how this could work, there could be a new syscall:
>
> long register_exception_handler(void (*handler)(int, siginfo_t *, void *));
>
> If a handler is registered, then, if a synchronous exception happens
> (page fault, etc), the kernel would set up an exception frame as usual
> but, rather than checking for signal handlers, it would just call the
> registered handler.  That handler is expected to either handle the
> exception entirely on its own or to call one of two new syscalls to
> ask for normal signal delivery or to ask to retry the faulting
> instruction.

Would the exception handler be a per-thread resource?

If it is: Would the setup and teardown overhead be prohibitive for many
use cases (at least those do not expect a fault)?

Something peripherally related to this interface: Wrappers for signal
handlers (and not just CPU exceptions).  Ideally, we want to maintain a
flag that indicates whether we are in a signal handler, and save and
restore errno around the installed handler.

> Alternatively, we could do something a lot more like the kernel's
> internal fixups where there's a table in user memory that maps
> potentially faulting instructions to landing pads that handle
> exceptions.

GCC already supports that on most Linux targets.  You can unwind from
synchronously invoked signal handlers if you compile with
-fnon-call-exceptions.

However, it's tough to set up a temporary signal handler to trigger such
unwinding because those aren't per-thread.

> On Windows, you can use SEH to do crazy things like running
> known-buggy code and eating the page faults.  I don't think we want to
> go there.

The original SEH was also a rich target for exploiting vulnerabilities.
That's something we really should avoid as well.

I wonder if it would be possible to tack this function onto rseq.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-01 17:53 RFC: userspace exception fixups Andy Lutomirski
  2018-11-01 17:53 ` Andy Lutomirski
  2018-11-01 18:09 ` Florian Weimer
@ 2018-11-01 18:27 ` Rich Felker
  2018-11-01 18:27   ` Rich Felker
  2018-11-01 18:33 ` Jann Horn
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 163+ messages in thread
From: Rich Felker @ 2018-11-01 18:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Christopherson, Sean J, Jethro Beekman,
	Jarkko Sakkinen, Florian Weimer, Linux API, Jann Horn,
	Linus Torvalds, X86 ML, linux-arch, LKML, Peter Zijlstra,
	nhorman, npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov

On Thu, Nov 01, 2018 at 10:53:40AM -0700, Andy Lutomirski wrote:
> Hi all-
> 
> The people working on SGX enablement are grappling with a somewhat
> annoying issue: the x86 EENTER instruction is used from user code and
> can, as part of its normal-ish operation, raise an exception.  It is
> also highly likely to be used from a library, and signal handling in
> libraries is unpleasant at best.
> 
> There's been some discussion of adding a vDSO entry point to wrap
> EENTER and do something sensible with the exceptions, but I'm
> wondering if a more general mechanism would be helpful.
> 
> The basic idea would be to allow libc, or maybe even any library, to
> register a handler that gets a chance to act on an exception caused by
> a user instruction before a signal is delivered.  As a straw-man
> example for how this could work, there could be a new syscall:
> 
> long register_exception_handler(void (*handler)(int, siginfo_t *, void *));
> 
> If a handler is registered, then, if a synchronous exception happens
> (page fault, etc), the kernel would set up an exception frame as usual
> but, rather than checking for signal handlers, it would just call the
> registered handler.  That handler is expected to either handle the
> exception entirely on its own or to call one of two new syscalls to
> ask for normal signal delivery or to ask to retry the faulting
> instruction.
> 
> Alternatively, we could do something a lot more like the kernel's
> internal fixups where there's a table in user memory that maps
> potentially faulting instructions to landing pads that handle
> exceptions.

This strikes me as just an "extra layer of signal handlers"; rather
than replacing global state with something that's library-safe, it's
just making new global state that has to have a singleton owner
managing it. If these handlers were thread-local (vs process-global
signal disposition) then you could just register/unregister them on
entry/exit to the library code that needs them, but that has
nontrivial execution time cost.

Moreover, thread-local signal handlers can already be done really
nicely if you don't care about having a global handler (which doesn't
really make sense for synchronous signals). I have fairly canonical
draft code I wrote to demonstrate this a while back which I can share
if there's interest.

One possible advantage of your approach is that it could distinguish
actual synchronous signals from ones sent by kill/sigqueue/etc. This
matters in contexts where the application wants the signal blocked or
ignored. For example if you temporarily set a handler for SIGILL or
SIGSEGV, then unblock it and try to do something that might generate
the signal, you risk consuming an unrelated pending signal sent by
kill/sigqueue/etc. As far as I know there is no way to do this
"transparently". It came up as an issue for why libc init code cannot
do this kind of probing for instruction availability at startup (or
any time).

> Do you think this would be useful?  Here are some use cases that I
> think are valid:
> 
> (a) Enter an SGX enclave and handle errors.  There would be two
> instructions that would need special handling: EENTER and ERESUME.

I'm not familiar with SGX but the vdso approach sounds like a better
abstraction.

> (b) Do some math and catch division by zero.  I think it would be a
> bad idea to have user code call a function and say that it wants to
> handle *any* division by zero, but having certain specified division
> instructions have special handling seems entirely reasonable.

I don't think this is useful. If you really need a division that needs
to survive invalid operands, a simple check before the div
(100%-predictable branch in non-erroneous usage) is dirt cheap.

> (c) Ditto for floating point errors.

Signaling floating point exceptions (rather than sticky flags) are
something of an oddity that's never enabled by default, not supported
on all platforms, and largely (IMO) useless. Generating code that can
support them can be moderately costly too.

> (d) Try an instruction and see if it gets #UD.

In general this seems fairly useful.

> (e) Run a bunch of code and handle page faults to a given address
> range by faulting something in.  This is not like the others, in that
> a handler wants to handle a range of target addresses, not
> instructions.  And userfaultfd is plausibly a better solution anyway.

Agree re: userfaultfd.

> (f) Run NaCl-like sandboxed code where the code can cause page faults
> to certain mapped-but-intentionally-not-present ranges and those need
> to be handled.
> 
> On Windows, you can use SEH to do crazy things like running
> known-buggy code and eating the page faults.  I don't think we want to
> go there.

Agree, this is a huge rabbit hole of filth. Don't go there.

Rich

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-01 18:27 ` Rich Felker
@ 2018-11-01 18:27   ` Rich Felker
  0 siblings, 0 replies; 163+ messages in thread
From: Rich Felker @ 2018-11-01 18:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Christopherson, Sean J, Jethro Beekman,
	Jarkko Sakkinen, Florian Weimer, Linux API, Jann Horn,
	Linus Torvalds, X86 ML, linux-arch, LKML, Peter Zijlstra,
	nhorman, npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov

On Thu, Nov 01, 2018 at 10:53:40AM -0700, Andy Lutomirski wrote:
> Hi all-
> 
> The people working on SGX enablement are grappling with a somewhat
> annoying issue: the x86 EENTER instruction is used from user code and
> can, as part of its normal-ish operation, raise an exception.  It is
> also highly likely to be used from a library, and signal handling in
> libraries is unpleasant at best.
> 
> There's been some discussion of adding a vDSO entry point to wrap
> EENTER and do something sensible with the exceptions, but I'm
> wondering if a more general mechanism would be helpful.
> 
> The basic idea would be to allow libc, or maybe even any library, to
> register a handler that gets a chance to act on an exception caused by
> a user instruction before a signal is delivered.  As a straw-man
> example for how this could work, there could be a new syscall:
> 
> long register_exception_handler(void (*handler)(int, siginfo_t *, void *));
> 
> If a handler is registered, then, if a synchronous exception happens
> (page fault, etc), the kernel would set up an exception frame as usual
> but, rather than checking for signal handlers, it would just call the
> registered handler.  That handler is expected to either handle the
> exception entirely on its own or to call one of two new syscalls to
> ask for normal signal delivery or to ask to retry the faulting
> instruction.
> 
> Alternatively, we could do something a lot more like the kernel's
> internal fixups where there's a table in user memory that maps
> potentially faulting instructions to landing pads that handle
> exceptions.

This strikes me as just an "extra layer of signal handlers"; rather
than replacing global state with something that's library-safe, it's
just making new global state that has to have a singleton owner
managing it. If these handlers were thread-local (vs process-global
signal disposition) then you could just register/unregister them on
entry/exit to the library code that needs them, but that has
nontrivial execution time cost.

Moreover, thread-local signal handlers can already be done really
nicely if you don't care about having a global handler (which doesn't
really make sense for synchronous signals). I have fairly canonical
draft code I wrote to demonstrate this a while back which I can share
if there's interest.

One possible advantage of your approach is that it could distinguish
actual synchronous signals from ones sent by kill/sigqueue/etc. This
matters in contexts where the application wants the signal blocked or
ignored. For example if you temporarily set a handler for SIGILL or
SIGSEGV, then unblock it and try to do something that might generate
the signal, you risk consuming an unrelated pending signal sent by
kill/sigqueue/etc. As far as I know there is no way to do this
"transparently". It came up as an issue for why libc init code cannot
do this kind of probing for instruction availability at startup (or
any time).

> Do you think this would be useful?  Here are some use cases that I
> think are valid:
> 
> (a) Enter an SGX enclave and handle errors.  There would be two
> instructions that would need special handling: EENTER and ERESUME.

I'm not familiar with SGX but the vdso approach sounds like a better
abstraction.

> (b) Do some math and catch division by zero.  I think it would be a
> bad idea to have user code call a function and say that it wants to
> handle *any* division by zero, but having certain specified division
> instructions have special handling seems entirely reasonable.

I don't think this is useful. If you really need a division that needs
to survive invalid operands, a simple check before the div
(100%-predictable branch in non-erroneous usage) is dirt cheap.

> (c) Ditto for floating point errors.

Signaling floating point exceptions (rather than sticky flags) are
something of an oddity that's never enabled by default, not supported
on all platforms, and largely (IMO) useless. Generating code that can
support them can be moderately costly too.

> (d) Try an instruction and see if it gets #UD.

In general this seems fairly useful.

> (e) Run a bunch of code and handle page faults to a given address
> range by faulting something in.  This is not like the others, in that
> a handler wants to handle a range of target addresses, not
> instructions.  And userfaultfd is plausibly a better solution anyway.

Agree re: userfaultfd.

> (f) Run NaCl-like sandboxed code where the code can cause page faults
> to certain mapped-but-intentionally-not-present ranges and those need
> to be handled.
> 
> On Windows, you can use SEH to do crazy things like running
> known-buggy code and eating the page faults.  I don't think we want to
> go there.

Agree, this is a huge rabbit hole of filth. Don't go there.

Rich

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-01 18:09 ` Florian Weimer
  2018-11-01 18:09   ` Florian Weimer
@ 2018-11-01 18:30   ` Rich Felker
  2018-11-01 18:30     ` Rich Felker
  2018-11-01 19:00   ` Jarkko Sakkinen
  2 siblings, 1 reply; 163+ messages in thread
From: Rich Felker @ 2018-11-01 18:30 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Andy Lutomirski, Dave Hansen, Christopherson, Sean J,
	Jethro Beekman, Jarkko Sakkinen, Linux API, Jann Horn,
	Linus Torvalds, X86 ML, linux-arch, LKML, Peter Zijlstra,
	nhorman, npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Adhemerval Zanella, carlos

On Thu, Nov 01, 2018 at 07:09:17PM +0100, Florian Weimer wrote:
> * Andy Lutomirski:
> 
> > The basic idea would be to allow libc, or maybe even any library, to
> > register a handler that gets a chance to act on an exception caused by
> > a user instruction before a signal is delivered.  As a straw-man
> > example for how this could work, there could be a new syscall:
> >
> > long register_exception_handler(void (*handler)(int, siginfo_t *, void *));
> >
> > If a handler is registered, then, if a synchronous exception happens
> > (page fault, etc), the kernel would set up an exception frame as usual
> > but, rather than checking for signal handlers, it would just call the
> > registered handler.  That handler is expected to either handle the
> > exception entirely on its own or to call one of two new syscalls to
> > ask for normal signal delivery or to ask to retry the faulting
> > instruction.
> 
> Would the exception handler be a per-thread resource?
> 
> If it is: Would the setup and teardown overhead be prohibitive for many
> use cases (at least those do not expect a fault)?
> 
> Something peripherally related to this interface: Wrappers for signal
> handlers (and not just CPU exceptions).  Ideally, we want to maintain a
> flag that indicates whether we are in a signal handler, and save and
> restore errno around the installed handler.

I think the right way to make it per-thread AND low-cost would be to
register not the handler, but the (per-thread) address of a
function-pointer object pointing to the handler. Then switching the
handler just requires a single volatile store to thread-local memory,
no syscall.

Rich

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-01 18:30   ` Rich Felker
@ 2018-11-01 18:30     ` Rich Felker
  0 siblings, 0 replies; 163+ messages in thread
From: Rich Felker @ 2018-11-01 18:30 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Andy Lutomirski, Dave Hansen, Christopherson, Sean J,
	Jethro Beekman, Jarkko Sakkinen, Linux API, Jann Horn,
	Linus Torvalds, X86 ML, linux-arch, LKML, Peter Zijlstra,
	nhorman, npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Adhemerval Zanella, carlos

On Thu, Nov 01, 2018 at 07:09:17PM +0100, Florian Weimer wrote:
> * Andy Lutomirski:
> 
> > The basic idea would be to allow libc, or maybe even any library, to
> > register a handler that gets a chance to act on an exception caused by
> > a user instruction before a signal is delivered.  As a straw-man
> > example for how this could work, there could be a new syscall:
> >
> > long register_exception_handler(void (*handler)(int, siginfo_t *, void *));
> >
> > If a handler is registered, then, if a synchronous exception happens
> > (page fault, etc), the kernel would set up an exception frame as usual
> > but, rather than checking for signal handlers, it would just call the
> > registered handler.  That handler is expected to either handle the
> > exception entirely on its own or to call one of two new syscalls to
> > ask for normal signal delivery or to ask to retry the faulting
> > instruction.
> 
> Would the exception handler be a per-thread resource?
> 
> If it is: Would the setup and teardown overhead be prohibitive for many
> use cases (at least those do not expect a fault)?
> 
> Something peripherally related to this interface: Wrappers for signal
> handlers (and not just CPU exceptions).  Ideally, we want to maintain a
> flag that indicates whether we are in a signal handler, and save and
> restore errno around the installed handler.

I think the right way to make it per-thread AND low-cost would be to
register not the handler, but the (per-thread) address of a
function-pointer object pointing to the handler. Then switching the
handler just requires a single volatile store to thread-local memory,
no syscall.

Rich

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-01 17:53 RFC: userspace exception fixups Andy Lutomirski
                   ` (2 preceding siblings ...)
  2018-11-01 18:27 ` Rich Felker
@ 2018-11-01 18:33 ` Jann Horn
  2018-11-01 18:33   ` Jann Horn
  2018-11-01 18:52   ` Rich Felker
  2018-11-01 19:06 ` Linus Torvalds
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 163+ messages in thread
From: Jann Horn @ 2018-11-01 18:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, sean.j.christopherson, jethro, jarkko.sakkinen,
	Florian Weimer, Linux API, Linus Torvalds,
	the arch/x86 maintainers, linux-arch, kernel list,
	Peter Zijlstra, dalias, nhorman, npmccallum, serge.ayoun,
	shay.katz-zamir, linux-sgx, andriy.shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, carlos, adhemerval.zanella

On Thu, Nov 1, 2018 at 6:53 PM Andy Lutomirski <luto@kernel.org> wrote:
> The people working on SGX enablement are grappling with a somewhat
> annoying issue: the x86 EENTER instruction is used from user code and
> can, as part of its normal-ish operation, raise an exception.  It is
> also highly likely to be used from a library, and signal handling in
> libraries is unpleasant at best.
>
> There's been some discussion of adding a vDSO entry point to wrap
> EENTER and do something sensible with the exceptions,

This sounds reasonable to me.

> but I'm
> wondering if a more general mechanism would be helpful.
>
> The basic idea would be to allow libc, or maybe even any library, to
> register a handler that gets a chance to act on an exception caused by
> a user instruction before a signal is delivered.  As a straw-man
> example for how this could work, there could be a new syscall:
>
> long register_exception_handler(void (*handler)(int, siginfo_t *, void *));
>
> If a handler is registered, then, if a synchronous exception happens
> (page fault, etc), the kernel would set up an exception frame as usual
> but, rather than checking for signal handlers, it would just call the
> registered handler.

> That handler is expected to either handle the
> exception entirely on its own or to call one of two new syscalls to
> ask for normal signal delivery

If you do it this way, these exception handlers would have to chain,
with an API convention that you're obligated to always ask for
resumption of signal delivery if you don't recognize the address,
right? Kind of like a notifier chain. (Except that, unless this is
implemented in the vDSO, each notifier invocation would cross the
kernel-user boundary twice.)

> or to ask to retry the faulting instruction.

Why would that have to be a syscall? For signal handlers registered
with SA_NODEFER, you can basically leave the signal handler with a
longjmp, right?

> Alternatively, we could do something a lot more like the kernel's
> internal fixups where there's a table in user memory that maps
> potentially faulting instructions to landing pads that handle
> exceptions.

I like this direction more, although I'm not sure whether the table
the kernel sees should be at instruction-level granularity. Perhaps
you could associate an exception handler with a VMA? Any instruction
that faults in the VMA triggers the fault handler?

> Do you think this would be useful?  Here are some use cases that I
> think are valid:
>
> (a) Enter an SGX enclave and handle errors.  There would be two
> instructions that would need special handling: EENTER and ERESUME.
>
> (b) Do some math and catch division by zero.  I think it would be a
> bad idea to have user code call a function and say that it wants to
> handle *any* division by zero, but having certain specified division
> instructions have special handling seems entirely reasonable.
>
> (c) Ditto for floating point errors.
>
> (d) Try an instruction and see if it gets #UD.
>
> (e) Run a bunch of code and handle page faults to a given address
> range by faulting something in.  This is not like the others, in that
> a handler wants to handle a range of target addresses, not
> instructions.  And userfaultfd is plausibly a better solution anyway.
>
> (f) Run NaCl-like sandboxed code where the code can cause page faults
> to certain mapped-but-intentionally-not-present ranges and those need
> to be handled.
>
> On Windows, you can use SEH to do crazy things like running
> known-buggy code and eating the page faults.  I don't think we want to
> go there.
>
> All of this makes me think that the right solution is to have a way to
> register fault handlers for instructions to cover (a) - (d) and to
> treat (e) and (f) as something else entirely if there's enough demand.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-01 18:33 ` Jann Horn
@ 2018-11-01 18:33   ` Jann Horn
  2018-11-01 18:52   ` Rich Felker
  1 sibling, 0 replies; 163+ messages in thread
From: Jann Horn @ 2018-11-01 18:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, sean.j.christopherson, jethro, jarkko.sakkinen,
	Florian Weimer, Linux API, Linus Torvalds,
	the arch/x86 maintainers, linux-arch, kernel list,
	Peter Zijlstra, dalias, nhorman, npmccallum, serge.ayoun,
	shay.katz-zamir, linux-sgx, andriy.shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, carlos, adhemerval.zanella

On Thu, Nov 1, 2018 at 6:53 PM Andy Lutomirski <luto@kernel.org> wrote:
> The people working on SGX enablement are grappling with a somewhat
> annoying issue: the x86 EENTER instruction is used from user code and
> can, as part of its normal-ish operation, raise an exception.  It is
> also highly likely to be used from a library, and signal handling in
> libraries is unpleasant at best.
>
> There's been some discussion of adding a vDSO entry point to wrap
> EENTER and do something sensible with the exceptions,

This sounds reasonable to me.

> but I'm
> wondering if a more general mechanism would be helpful.
>
> The basic idea would be to allow libc, or maybe even any library, to
> register a handler that gets a chance to act on an exception caused by
> a user instruction before a signal is delivered.  As a straw-man
> example for how this could work, there could be a new syscall:
>
> long register_exception_handler(void (*handler)(int, siginfo_t *, void *));
>
> If a handler is registered, then, if a synchronous exception happens
> (page fault, etc), the kernel would set up an exception frame as usual
> but, rather than checking for signal handlers, it would just call the
> registered handler.

> That handler is expected to either handle the
> exception entirely on its own or to call one of two new syscalls to
> ask for normal signal delivery

If you do it this way, these exception handlers would have to chain,
with an API convention that you're obligated to always ask for
resumption of signal delivery if you don't recognize the address,
right? Kind of like a notifier chain. (Except that, unless this is
implemented in the vDSO, each notifier invocation would cross the
kernel-user boundary twice.)

> or to ask to retry the faulting instruction.

Why would that have to be a syscall? For signal handlers registered
with SA_NODEFER, you can basically leave the signal handler with a
longjmp, right?

> Alternatively, we could do something a lot more like the kernel's
> internal fixups where there's a table in user memory that maps
> potentially faulting instructions to landing pads that handle
> exceptions.

I like this direction more, although I'm not sure whether the table
the kernel sees should be at instruction-level granularity. Perhaps
you could associate an exception handler with a VMA? Any instruction
that faults in the VMA triggers the fault handler?

> Do you think this would be useful?  Here are some use cases that I
> think are valid:
>
> (a) Enter an SGX enclave and handle errors.  There would be two
> instructions that would need special handling: EENTER and ERESUME.
>
> (b) Do some math and catch division by zero.  I think it would be a
> bad idea to have user code call a function and say that it wants to
> handle *any* division by zero, but having certain specified division
> instructions have special handling seems entirely reasonable.
>
> (c) Ditto for floating point errors.
>
> (d) Try an instruction and see if it gets #UD.
>
> (e) Run a bunch of code and handle page faults to a given address
> range by faulting something in.  This is not like the others, in that
> a handler wants to handle a range of target addresses, not
> instructions.  And userfaultfd is plausibly a better solution anyway.
>
> (f) Run NaCl-like sandboxed code where the code can cause page faults
> to certain mapped-but-intentionally-not-present ranges and those need
> to be handled.
>
> On Windows, you can use SEH to do crazy things like running
> known-buggy code and eating the page faults.  I don't think we want to
> go there.
>
> All of this makes me think that the right solution is to have a way to
> register fault handlers for instructions to cover (a) - (d) and to
> treat (e) and (f) as something else entirely if there's enough demand.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-01 18:33 ` Jann Horn
  2018-11-01 18:33   ` Jann Horn
@ 2018-11-01 18:52   ` Rich Felker
  2018-11-01 18:52     ` Rich Felker
  2018-11-01 19:10     ` Linus Torvalds
  1 sibling, 2 replies; 163+ messages in thread
From: Rich Felker @ 2018-11-01 18:52 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andy Lutomirski, Dave Hansen, sean.j.christopherson, jethro,
	jarkko.sakkinen, Florian Weimer, Linux API, Linus Torvalds,
	the arch/x86 maintainers, linux-arch, kernel list,
	Peter Zijlstra, nhorman, npmccallum, serge.ayoun,
	shay.katz-zamir, linux-sgx, andriy.shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, carlos, adhemerval.zanella

On Thu, Nov 01, 2018 at 07:33:33PM +0100, Jann Horn wrote:
> > but I'm
> > wondering if a more general mechanism would be helpful.
> >
> > The basic idea would be to allow libc, or maybe even any library, to
> > register a handler that gets a chance to act on an exception caused by
> > a user instruction before a signal is delivered.  As a straw-man
> > example for how this could work, there could be a new syscall:
> >
> > long register_exception_handler(void (*handler)(int, siginfo_t *, void *));
> >
> > If a handler is registered, then, if a synchronous exception happens
> > (page fault, etc), the kernel would set up an exception frame as usual
> > but, rather than checking for signal handlers, it would just call the
> > registered handler.
> 
> > That handler is expected to either handle the
> > exception entirely on its own or to call one of two new syscalls to
> > ask for normal signal delivery
> 
> If you do it this way, these exception handlers would have to chain,

There's no need to chain if the handler is specific to the context
where the fault happens. You just replace the handler with the one
relevant to the code you're about to run before you run it.

> > or to ask to retry the faulting instruction.
> 
> Why would that have to be a syscall? For signal handlers registered
> with SA_NODEFER, you can basically leave the signal handler with a
> longjmp, right?

longjmp needs a jmp_buf; it can't return to the faulting instruction.
Normally (though this is not defined) signal handlers return to the
faulting instruction if they return, but if returning from the
exception handler meant passing through to the signal disposition,
a different mechanism would be needed to signal that you want to retry
the faulting instruction.

Rich

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-01 18:52   ` Rich Felker
@ 2018-11-01 18:52     ` Rich Felker
  2018-11-01 19:10     ` Linus Torvalds
  1 sibling, 0 replies; 163+ messages in thread
From: Rich Felker @ 2018-11-01 18:52 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andy Lutomirski, Dave Hansen, sean.j.christopherson, jethro,
	jarkko.sakkinen, Florian Weimer, Linux API, Linus Torvalds,
	the arch/x86 maintainers, linux-arch, kernel list,
	Peter Zijlstra, nhorman, npmccallum, serge.ayoun,
	shay.katz-zamir, linux-sgx, andriy.shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, carlos, adhemerval.zanella

On Thu, Nov 01, 2018 at 07:33:33PM +0100, Jann Horn wrote:
> > but I'm
> > wondering if a more general mechanism would be helpful.
> >
> > The basic idea would be to allow libc, or maybe even any library, to
> > register a handler that gets a chance to act on an exception caused by
> > a user instruction before a signal is delivered.  As a straw-man
> > example for how this could work, there could be a new syscall:
> >
> > long register_exception_handler(void (*handler)(int, siginfo_t *, void *));
> >
> > If a handler is registered, then, if a synchronous exception happens
> > (page fault, etc), the kernel would set up an exception frame as usual
> > but, rather than checking for signal handlers, it would just call the
> > registered handler.
> 
> > That handler is expected to either handle the
> > exception entirely on its own or to call one of two new syscalls to
> > ask for normal signal delivery
> 
> If you do it this way, these exception handlers would have to chain,

There's no need to chain if the handler is specific to the context
where the fault happens. You just replace the handler with the one
relevant to the code you're about to run before you run it.

> > or to ask to retry the faulting instruction.
> 
> Why would that have to be a syscall? For signal handlers registered
> with SA_NODEFER, you can basically leave the signal handler with a
> longjmp, right?

longjmp needs a jmp_buf; it can't return to the faulting instruction.
Normally (though this is not defined) signal handlers return to the
faulting instruction if they return, but if returning from the
exception handler meant passing through to the signal disposition,
a different mechanism would be needed to signal that you want to retry
the faulting instruction.

Rich

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-01 18:09 ` Florian Weimer
  2018-11-01 18:09   ` Florian Weimer
  2018-11-01 18:30   ` Rich Felker
@ 2018-11-01 19:00   ` Jarkko Sakkinen
  2018-11-01 19:00     ` Jarkko Sakkinen
  2 siblings, 1 reply; 163+ messages in thread
From: Jarkko Sakkinen @ 2018-11-01 19:00 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Andy Lutomirski, Dave Hansen, Christopherson, Sean J,
	Jethro Beekman, Jarkko Sakkinen, Linux API, Jann Horn,
	Linus Torvalds, X86 ML, linux-arch, LKML, Peter Zijlstra,
	Rich Felker, nhorman, npmccallum, Ayoun, Serge, shay.katz-zamir,
	linux-sgx, Andy Shevchenko, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Adhemerval Zanella, carlos

On Thu, 1 Nov 2018, Florian Weimer wrote:
> * Andy Lutomirski:
>
>> The basic idea would be to allow libc, or maybe even any library, to
>> register a handler that gets a chance to act on an exception caused by
>> a user instruction before a signal is delivered.  As a straw-man
>> example for how this could work, there could be a new syscall:
>>
>> long register_exception_handler(void (*handler)(int, siginfo_t *, void *));
>>
>> If a handler is registered, then, if a synchronous exception happens
>> (page fault, etc), the kernel would set up an exception frame as usual
>> but, rather than checking for signal handlers, it would just call the
>> registered handler.  That handler is expected to either handle the
>> exception entirely on its own or to call one of two new syscalls to
>> ask for normal signal delivery or to ask to retry the faulting
>> instruction.
>
> Would the exception handler be a per-thread resource?

For SGX purposes it would *need* to be per-thread resource so that the
run-time (not just Intel but any user space support code for SGX) is
able to act on thread that caused this exception inside the enclave.

/Jarkko

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-01 19:00   ` Jarkko Sakkinen
@ 2018-11-01 19:00     ` Jarkko Sakkinen
  0 siblings, 0 replies; 163+ messages in thread
From: Jarkko Sakkinen @ 2018-11-01 19:00 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Andy Lutomirski, Dave Hansen, Christopherson, Sean J,
	Jethro Beekman, Jarkko Sakkinen, Linux API, Jann Horn,
	Linus Torvalds, X86 ML, linux-arch, LKML, Peter Zijlstra,
	Rich Felker, nhorman, npmccallum, Ayoun, Serge, shay.katz-zamir,
	linux-sgx, Andy Shevchenko, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Adhemerval Zanella, carlos

On Thu, 1 Nov 2018, Florian Weimer wrote:
> * Andy Lutomirski:
>
>> The basic idea would be to allow libc, or maybe even any library, to
>> register a handler that gets a chance to act on an exception caused by
>> a user instruction before a signal is delivered.  As a straw-man
>> example for how this could work, there could be a new syscall:
>>
>> long register_exception_handler(void (*handler)(int, siginfo_t *, void *));
>>
>> If a handler is registered, then, if a synchronous exception happens
>> (page fault, etc), the kernel would set up an exception frame as usual
>> but, rather than checking for signal handlers, it would just call the
>> registered handler.  That handler is expected to either handle the
>> exception entirely on its own or to call one of two new syscalls to
>> ask for normal signal delivery or to ask to retry the faulting
>> instruction.
>
> Would the exception handler be a per-thread resource?

For SGX purposes it would *need* to be per-thread resource so that the
run-time (not just Intel but any user space support code for SGX) is
able to act on thread that caused this exception inside the enclave.

/Jarkko

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-01 17:53 RFC: userspace exception fixups Andy Lutomirski
                   ` (3 preceding siblings ...)
  2018-11-01 18:33 ` Jann Horn
@ 2018-11-01 19:06 ` Linus Torvalds
  2018-11-01 19:06   ` Linus Torvalds
  2018-11-02 22:07 ` Jarkko Sakkinen
  2018-11-18  7:15 ` Jarkko Sakkinen
  6 siblings, 1 reply; 163+ messages in thread
From: Linus Torvalds @ 2018-11-01 19:06 UTC (permalink / raw)
  To: luto
  Cc: dave.hansen, sean.j.christopherson, jethro, jarkko.sakkinen,
	fweimer, linux-api, Jann Horn, x86, linux-arch,
	Linux Kernel Mailing List, Peter Zijlstra, dalias, nhorman,
	npmccallum, serge.ayoun, shay.katz-zamir, linux-sgx,
	andriy.shevchenko, tglx, Ingo Molnar, bp

On Thu, Nov 1, 2018 at 10:53 AM Andy Lutomirski <luto@kernel.org> wrote:
>
> There's been some discussion of adding a vDSO entry point to wrap
> EENTER and do something sensible with the exceptions,

I think that's likely the right thing to do, and would be similar to sysenter.

> The basic idea would be to allow libc, or maybe even any library, to
> register a handler that gets a chance to act on an exception caused by
> a user instruction before a signal is delivered.  As a straw-man
> example for how this could work, there could be a new syscall:
>
> long register_exception_handler(void (*handler)(int, siginfo_t *, void *));

I'm not a huge fan of signals, but the above is an abomination.

It has all the problems of signals _and_ then some.

And it in absolutely no way fixes the problem with libraires. In fact,
it arguably makes it much much worse, since now there's only one
single library that can register it.

Yes yes, maybe a library would then expose _another_ interface to
other libraries and act as some kind of dispatch point, but on the
whole the above is just crazy and fundamentally broken.

If you want to register an exception, you need to make it clear

 (a) which _thread_ the exception registration is valid for

 (b) which _range_ the exception registration is valid for

 (c) which _fault_ the exception registration is valid for (page
fault, div-by-zero, whatever)

 (d) which save area (aka stack) and exception handler point.

Note that (b) might be more than just an exception IP range. It might
well be interesting to register the exception by page fault address
(in addition to code range).

If you do something that does all of (a)-(d), and you allow some
limited number of exception registrations, then maybe. Because at that
point, you have something that is actually actively more powerful than
signal handling is.

But your suggested "just register a broken form of signal handling for
a special case" is just wrong. Don't do it.

              Linus

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-01 19:06 ` Linus Torvalds
@ 2018-11-01 19:06   ` Linus Torvalds
  0 siblings, 0 replies; 163+ messages in thread
From: Linus Torvalds @ 2018-11-01 19:06 UTC (permalink / raw)
  To: luto
  Cc: dave.hansen, sean.j.christopherson, jethro, jarkko.sakkinen,
	fweimer, linux-api, Jann Horn, x86, linux-arch,
	Linux Kernel Mailing List, Peter Zijlstra, dalias, nhorman,
	npmccallum, serge.ayoun, shay.katz-zamir, linux-sgx,
	andriy.shevchenko, tglx, Ingo Molnar, bp

On Thu, Nov 1, 2018 at 10:53 AM Andy Lutomirski <luto@kernel.org> wrote:
>
> There's been some discussion of adding a vDSO entry point to wrap
> EENTER and do something sensible with the exceptions,

I think that's likely the right thing to do, and would be similar to sysenter.

> The basic idea would be to allow libc, or maybe even any library, to
> register a handler that gets a chance to act on an exception caused by
> a user instruction before a signal is delivered.  As a straw-man
> example for how this could work, there could be a new syscall:
>
> long register_exception_handler(void (*handler)(int, siginfo_t *, void *));

I'm not a huge fan of signals, but the above is an abomination.

It has all the problems of signals _and_ then some.

And it in absolutely no way fixes the problem with libraires. In fact,
it arguably makes it much much worse, since now there's only one
single library that can register it.

Yes yes, maybe a library would then expose _another_ interface to
other libraries and act as some kind of dispatch point, but on the
whole the above is just crazy and fundamentally broken.

If you want to register an exception, you need to make it clear

 (a) which _thread_ the exception registration is valid for

 (b) which _range_ the exception registration is valid for

 (c) which _fault_ the exception registration is valid for (page
fault, div-by-zero, whatever)

 (d) which save area (aka stack) and exception handler point.

Note that (b) might be more than just an exception IP range. It might
well be interesting to register the exception by page fault address
(in addition to code range).

If you do something that does all of (a)-(d), and you allow some
limited number of exception registrations, then maybe. Because at that
point, you have something that is actually actively more powerful than
signal handling is.

But your suggested "just register a broken form of signal handling for
a special case" is just wrong. Don't do it.

              Linus

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-01 18:52   ` Rich Felker
  2018-11-01 18:52     ` Rich Felker
@ 2018-11-01 19:10     ` Linus Torvalds
  2018-11-01 19:10       ` Linus Torvalds
  2018-11-01 19:31       ` Rich Felker
  1 sibling, 2 replies; 163+ messages in thread
From: Linus Torvalds @ 2018-11-01 19:10 UTC (permalink / raw)
  To: dalias
  Cc: Jann Horn, luto, dave.hansen, sean.j.christopherson, jethro,
	jarkko.sakkinen, fweimer, linux-api, x86, linux-arch,
	Linux Kernel Mailing List, Peter Zijlstra, nhorman, npmccallum,
	serge.ayoun, shay.katz-zamir, linux-sgx, andriy.shevchenko, tglx,
	Ingo Molnar, bp, carlos, adhemerval.zanella

On Thu, Nov 1, 2018 at 11:52 AM Rich Felker <dalias@libc.org> wrote:
>
> There's no need to chain if the handler is specific to the context
> where the fault happens. You just replace the handler with the one
> relevant to the code you're about to run before you run it.

That's much too expensive to do as a system call.

Maybe an rseq-like "register an area where exception information will
be found" and then you can just swap in a pointer there (and nest with
previous pointers).

But even that doesn't work. Maybe some library wants to capture page
faults because they write-protected some area and want to log writes
and then emulate them (or just enable them after logging - statistical
logging is a thing).

And then another library (or just nested code) wants to handle the
eenter fault, so it overwrites the page handler fault. What do you do
if you now get a page fault before you even do the eenter?

The whole "one global error handler" model is broken. It's broken even
if the "global" one is just per-thread. Don't do it.

Even signals didn't make *that* bad a mistake, and signals are horrible.

                        Linus

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-01 19:10     ` Linus Torvalds
@ 2018-11-01 19:10       ` Linus Torvalds
  2018-11-01 19:31       ` Rich Felker
  1 sibling, 0 replies; 163+ messages in thread
From: Linus Torvalds @ 2018-11-01 19:10 UTC (permalink / raw)
  To: dalias
  Cc: Jann Horn, luto, dave.hansen, sean.j.christopherson, jethro,
	jarkko.sakkinen, fweimer, linux-api, x86, linux-arch,
	Linux Kernel Mailing List, Peter Zijlstra, nhorman, npmccallum,
	serge.ayoun, shay.katz-zamir, linux-sgx, andriy.shevchenko, tglx,
	Ingo Molnar, bp, carlos, adhemerval.zanella

On Thu, Nov 1, 2018 at 11:52 AM Rich Felker <dalias@libc.org> wrote:
>
> There's no need to chain if the handler is specific to the context
> where the fault happens. You just replace the handler with the one
> relevant to the code you're about to run before you run it.

That's much too expensive to do as a system call.

Maybe an rseq-like "register an area where exception information will
be found" and then you can just swap in a pointer there (and nest with
previous pointers).

But even that doesn't work. Maybe some library wants to capture page
faults because they write-protected some area and want to log writes
and then emulate them (or just enable them after logging - statistical
logging is a thing).

And then another library (or just nested code) wants to handle the
eenter fault, so it overwrites the page handler fault. What do you do
if you now get a page fault before you even do the eenter?

The whole "one global error handler" model is broken. It's broken even
if the "global" one is just per-thread. Don't do it.

Even signals didn't make *that* bad a mistake, and signals are horrible.

                        Linus

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-01 19:10     ` Linus Torvalds
  2018-11-01 19:10       ` Linus Torvalds
@ 2018-11-01 19:31       ` Rich Felker
  2018-11-01 19:31         ` Rich Felker
  2018-11-01 21:24         ` Linus Torvalds
  1 sibling, 2 replies; 163+ messages in thread
From: Rich Felker @ 2018-11-01 19:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jann Horn, luto, dave.hansen, sean.j.christopherson, jethro,
	jarkko.sakkinen, fweimer, linux-api, x86, linux-arch,
	Linux Kernel Mailing List, Peter Zijlstra, nhorman, npmccallum,
	serge.ayoun, shay.katz-zamir, linux-sgx, andriy.shevchenko, tglx,
	Ingo Molnar, bp, carlos, adhemerval.zanella

On Thu, Nov 01, 2018 at 12:10:35PM -0700, Linus Torvalds wrote:
> On Thu, Nov 1, 2018 at 11:52 AM Rich Felker <dalias@libc.org> wrote:
> >
> > There's no need to chain if the handler is specific to the context
> > where the fault happens. You just replace the handler with the one
> > relevant to the code you're about to run before you run it.
> 
> That's much too expensive to do as a system call.

See my other emails in this thread. You would register the *address*
(in TLS) of a function pointer object pointing to the handler, rather
than the function address of the handler. Then switching handler is
just a single store in userspace, no syscalls involved.

Rich

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-01 19:31       ` Rich Felker
@ 2018-11-01 19:31         ` Rich Felker
  2018-11-01 21:24         ` Linus Torvalds
  1 sibling, 0 replies; 163+ messages in thread
From: Rich Felker @ 2018-11-01 19:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jann Horn, luto, dave.hansen, sean.j.christopherson, jethro,
	jarkko.sakkinen, fweimer, linux-api, x86, linux-arch,
	Linux Kernel Mailing List, Peter Zijlstra, nhorman, npmccallum,
	serge.ayoun, shay.katz-zamir, linux-sgx, andriy.shevchenko, tglx,
	Ingo Molnar, bp, carlos, adhemerval.zanella

On Thu, Nov 01, 2018 at 12:10:35PM -0700, Linus Torvalds wrote:
> On Thu, Nov 1, 2018 at 11:52 AM Rich Felker <dalias@libc.org> wrote:
> >
> > There's no need to chain if the handler is specific to the context
> > where the fault happens. You just replace the handler with the one
> > relevant to the code you're about to run before you run it.
> 
> That's much too expensive to do as a system call.

See my other emails in this thread. You would register the *address*
(in TLS) of a function pointer object pointing to the handler, rather
than the function address of the handler. Then switching handler is
just a single store in userspace, no syscalls involved.

Rich

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-01 19:31       ` Rich Felker
  2018-11-01 19:31         ` Rich Felker
@ 2018-11-01 21:24         ` Linus Torvalds
  2018-11-01 21:24           ` Linus Torvalds
  2018-11-01 23:22           ` Andy Lutomirski
  1 sibling, 2 replies; 163+ messages in thread
From: Linus Torvalds @ 2018-11-01 21:24 UTC (permalink / raw)
  To: dalias
  Cc: Jann Horn, luto, dave.hansen, sean.j.christopherson, jethro,
	jarkko.sakkinen, fweimer, linux-api, x86, linux-arch,
	Linux Kernel Mailing List, Peter Zijlstra, nhorman, npmccallum,
	serge.ayoun, shay.katz-zamir, linux-sgx, andriy.shevchenko, tglx,
	Ingo Molnar, bp, carlos, adhemerval.zanella

On Thu, Nov 1, 2018 at 12:31 PM Rich Felker <dalias@libc.org> wrote:
>
> See my other emails in this thread. You would register the *address*
> (in TLS) of a function pointer object pointing to the handler, rather
> than the function address of the handler. Then switching handler is
> just a single store in userspace, no syscalls involved.

Yes.

And for just EENTER, maybe that's the right model.

If we want to generalize it to other thread-synchronous faults, it
needs way more information and a list of handlers, but if we limit the
thing to _only_ EENTER getting an SGX fault, then a single "this is
the fault handler" address is probably the right thing to do.

                     Linus

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-01 21:24         ` Linus Torvalds
@ 2018-11-01 21:24           ` Linus Torvalds
  2018-11-01 23:22           ` Andy Lutomirski
  1 sibling, 0 replies; 163+ messages in thread
From: Linus Torvalds @ 2018-11-01 21:24 UTC (permalink / raw)
  To: dalias
  Cc: Jann Horn, luto, dave.hansen, sean.j.christopherson, jethro,
	jarkko.sakkinen, fweimer, linux-api, x86, linux-arch,
	Linux Kernel Mailing List, Peter Zijlstra, nhorman, npmccallum,
	serge.ayoun, shay.katz-zamir, linux-sgx, andriy.shevchenko, tglx,
	Ingo Molnar, bp, carlos, adhemerval.zanella

On Thu, Nov 1, 2018 at 12:31 PM Rich Felker <dalias@libc.org> wrote:
>
> See my other emails in this thread. You would register the *address*
> (in TLS) of a function pointer object pointing to the handler, rather
> than the function address of the handler. Then switching handler is
> just a single store in userspace, no syscalls involved.

Yes.

And for just EENTER, maybe that's the right model.

If we want to generalize it to other thread-synchronous faults, it
needs way more information and a list of handlers, but if we limit the
thing to _only_ EENTER getting an SGX fault, then a single "this is
the fault handler" address is probably the right thing to do.

                     Linus

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-01 21:24         ` Linus Torvalds
  2018-11-01 21:24           ` Linus Torvalds
@ 2018-11-01 23:22           ` Andy Lutomirski
  2018-11-01 23:22             ` Andy Lutomirski
                               ` (2 more replies)
  1 sibling, 3 replies; 163+ messages in thread
From: Andy Lutomirski @ 2018-11-01 23:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rich Felker, Jann Horn, Andrew Lutomirski, Dave Hansen,
	Christopherson, Sean J, Jethro Beekman, Jarkko Sakkinen,
	Florian Weimer, Linux API, X86 ML, linux-arch, LKML,
	Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella

On Thu, Nov 1, 2018 at 2:24 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, Nov 1, 2018 at 12:31 PM Rich Felker <dalias@libc.org> wrote:
> >
> > See my other emails in this thread. You would register the *address*
> > (in TLS) of a function pointer object pointing to the handler, rather
> > than the function address of the handler. Then switching handler is
> > just a single store in userspace, no syscalls involved.
>
> Yes.
>
> And for just EENTER, maybe that's the right model.
>
> If we want to generalize it to other thread-synchronous faults, it
> needs way more information and a list of handlers, but if we limit the
> thing to _only_ EENTER getting an SGX fault, then a single "this is
> the fault handler" address is probably the right thing to do.

It sounds like you're saying that the kernel should know, *before*
running any user fixup code, whether the fault in question is one that
wants a fixup.  Sounds reasonable.

I think it would be nice, but not absolutely necessary, if user code
didn't need to poke some value into TLS each time it ran a function
that had a fixup.  With the poke-into-TLS approach, it looks a lot
like rseq, and rseq doesn't nest very nicely.  I think we really want
this mechanism to Just Work.  So we could maybe have a syscall that
associates a list of fixups with a given range of text addresses.  We
might want the kernel to automatically zap the fixups when the text in
question is unmapped.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-01 23:22           ` Andy Lutomirski
@ 2018-11-01 23:22             ` Andy Lutomirski
  2018-11-02 16:30             ` Sean Christopherson
  2018-11-02 22:37             ` Jarkko Sakkinen
  2 siblings, 0 replies; 163+ messages in thread
From: Andy Lutomirski @ 2018-11-01 23:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rich Felker, Jann Horn, Andrew Lutomirski, Dave Hansen,
	Christopherson, Sean J, Jethro Beekman, Jarkko Sakkinen,
	Florian Weimer, Linux API, X86 ML, linux-arch, LKML,
	Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella

On Thu, Nov 1, 2018 at 2:24 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, Nov 1, 2018 at 12:31 PM Rich Felker <dalias@libc.org> wrote:
> >
> > See my other emails in this thread. You would register the *address*
> > (in TLS) of a function pointer object pointing to the handler, rather
> > than the function address of the handler. Then switching handler is
> > just a single store in userspace, no syscalls involved.
>
> Yes.
>
> And for just EENTER, maybe that's the right model.
>
> If we want to generalize it to other thread-synchronous faults, it
> needs way more information and a list of handlers, but if we limit the
> thing to _only_ EENTER getting an SGX fault, then a single "this is
> the fault handler" address is probably the right thing to do.

It sounds like you're saying that the kernel should know, *before*
running any user fixup code, whether the fault in question is one that
wants a fixup.  Sounds reasonable.

I think it would be nice, but not absolutely necessary, if user code
didn't need to poke some value into TLS each time it ran a function
that had a fixup.  With the poke-into-TLS approach, it looks a lot
like rseq, and rseq doesn't nest very nicely.  I think we really want
this mechanism to Just Work.  So we could maybe have a syscall that
associates a list of fixups with a given range of text addresses.  We
might want the kernel to automatically zap the fixups when the text in
question is unmapped.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-01 23:22           ` Andy Lutomirski
  2018-11-01 23:22             ` Andy Lutomirski
@ 2018-11-02 16:30             ` Sean Christopherson
  2018-11-02 16:30               ` Sean Christopherson
                                 ` (2 more replies)
  2018-11-02 22:37             ` Jarkko Sakkinen
  2 siblings, 3 replies; 163+ messages in thread
From: Sean Christopherson @ 2018-11-02 16:30 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Rich Felker, Jann Horn, Dave Hansen,
	Jethro Beekman, Jarkko Sakkinen, Florian Weimer, Linux API,
	X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman, npmccallum,
	Ayoun, Serge, shay.katz-zamir, linux-sgx, Andy Shevchenko,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Thu, Nov 01, 2018 at 04:22:55PM -0700, Andy Lutomirski wrote:
> On Thu, Nov 1, 2018 at 2:24 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Thu, Nov 1, 2018 at 12:31 PM Rich Felker <dalias@libc.org> wrote:
> > >
> > > See my other emails in this thread. You would register the *address*
> > > (in TLS) of a function pointer object pointing to the handler, rather
> > > than the function address of the handler. Then switching handler is
> > > just a single store in userspace, no syscalls involved.
> >
> > Yes.
> >
> > And for just EENTER, maybe that's the right model.
> >
> > If we want to generalize it to other thread-synchronous faults, it
> > needs way more information and a list of handlers, but if we limit the
> > thing to _only_ EENTER getting an SGX fault, then a single "this is
> > the fault handler" address is probably the right thing to do.
> 
> It sounds like you're saying that the kernel should know, *before*
> running any user fixup code, whether the fault in question is one that
> wants a fixup.  Sounds reasonable.
> 
> I think it would be nice, but not absolutely necessary, if user code
> didn't need to poke some value into TLS each time it ran a function
> that had a fixup.  With the poke-into-TLS approach, it looks a lot
> like rseq, and rseq doesn't nest very nicely.  I think we really want
> this mechanism to Just Work.  So we could maybe have a syscall that
> associates a list of fixups with a given range of text addresses.  We
> might want the kernel to automatically zap the fixups when the text in
> question is unmapped.

If this is EENTER specific then nesting isn't an issue.  But I don't
see a simple way to restrict the mechanism to EENTER.

What if rather than having userspace register an address for fixup the
kernel instead unconditionally does fixup on the ENCLU opcode?  For
example, skip the instruction and put fault info into some combination
of RDX/RSI/RDI (they're cleared on asynchronous enclave exits).

The decode logic is straightforward since ENCLU doesn't have operands,
we'd just have to eat any ignored prefixes.  The intended convention
for EENTER is to have an ENCLU at the AEX target (to automatically do
ERESUME after INTR, etc...), so this would work regardless of whether
the fault happened on EENTER or in the enclave.  EENTER/ERESUME are
the only ENCLU functions that are allowed outside of an enclave so
there's no danger of accidentally crushing something else.

This way we wouldn't need a VDSO blob and we'd enforce the kernel's
ABI, e.g. a library that tried to use signal handling would go off the
rails when the kernel mucked with the registers.  We could even have
the SGX EPC fault handler return VM_FAULT_SIGBUS if the faulting
instruction isn't ENCLU, e.g. to further enforce that the AEX target
needs to be ENCLU.


Userspace would look something like this:

    mov tcs, %xbx               /* Thread Control Structure address */
    leaq async_exit(%rip), %rcx /* AEX target for EENTER/RESUME */
    mov $SGX_EENTER, %rax       /* EENTER leaf */

async_exit:
    ENCLU

fault_handler:
    <handle fault>

enclave_exit:                   /* EEXIT target */
    <handle enclave request>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 16:30             ` Sean Christopherson
@ 2018-11-02 16:30               ` Sean Christopherson
  2018-11-02 16:37               ` Jethro Beekman
  2018-11-02 16:56               ` Dave Hansen
  2 siblings, 0 replies; 163+ messages in thread
From: Sean Christopherson @ 2018-11-02 16:30 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Rich Felker, Jann Horn, Dave Hansen,
	Jethro Beekman, Jarkko Sakkinen, Florian Weimer, Linux API,
	X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman, npmccallum,
	Ayoun, Serge, shay.katz-zamir, linux-sgx, Andy Shevchenko,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Thu, Nov 01, 2018 at 04:22:55PM -0700, Andy Lutomirski wrote:
> On Thu, Nov 1, 2018 at 2:24 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Thu, Nov 1, 2018 at 12:31 PM Rich Felker <dalias@libc.org> wrote:
> > >
> > > See my other emails in this thread. You would register the *address*
> > > (in TLS) of a function pointer object pointing to the handler, rather
> > > than the function address of the handler. Then switching handler is
> > > just a single store in userspace, no syscalls involved.
> >
> > Yes.
> >
> > And for just EENTER, maybe that's the right model.
> >
> > If we want to generalize it to other thread-synchronous faults, it
> > needs way more information and a list of handlers, but if we limit the
> > thing to _only_ EENTER getting an SGX fault, then a single "this is
> > the fault handler" address is probably the right thing to do.
> 
> It sounds like you're saying that the kernel should know, *before*
> running any user fixup code, whether the fault in question is one that
> wants a fixup.  Sounds reasonable.
> 
> I think it would be nice, but not absolutely necessary, if user code
> didn't need to poke some value into TLS each time it ran a function
> that had a fixup.  With the poke-into-TLS approach, it looks a lot
> like rseq, and rseq doesn't nest very nicely.  I think we really want
> this mechanism to Just Work.  So we could maybe have a syscall that
> associates a list of fixups with a given range of text addresses.  We
> might want the kernel to automatically zap the fixups when the text in
> question is unmapped.

If this is EENTER specific then nesting isn't an issue.  But I don't
see a simple way to restrict the mechanism to EENTER.

What if rather than having userspace register an address for fixup the
kernel instead unconditionally does fixup on the ENCLU opcode?  For
example, skip the instruction and put fault info into some combination
of RDX/RSI/RDI (they're cleared on asynchronous enclave exits).

The decode logic is straightforward since ENCLU doesn't have operands,
we'd just have to eat any ignored prefixes.  The intended convention
for EENTER is to have an ENCLU at the AEX target (to automatically do
ERESUME after INTR, etc...), so this would work regardless of whether
the fault happened on EENTER or in the enclave.  EENTER/ERESUME are
the only ENCLU functions that are allowed outside of an enclave so
there's no danger of accidentally crushing something else.

This way we wouldn't need a VDSO blob and we'd enforce the kernel's
ABI, e.g. a library that tried to use signal handling would go off the
rails when the kernel mucked with the registers.  We could even have
the SGX EPC fault handler return VM_FAULT_SIGBUS if the faulting
instruction isn't ENCLU, e.g. to further enforce that the AEX target
needs to be ENCLU.


Userspace would look something like this:

    mov tcs, %xbx               /* Thread Control Structure address */
    leaq async_exit(%rip), %rcx /* AEX target for EENTER/RESUME */
    mov $SGX_EENTER, %rax       /* EENTER leaf */

async_exit:
    ENCLU

fault_handler:
    <handle fault>

enclave_exit:                   /* EEXIT target */
    <handle enclave request>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 16:30             ` Sean Christopherson
  2018-11-02 16:30               ` Sean Christopherson
@ 2018-11-02 16:37               ` Jethro Beekman
  2018-11-02 16:37                 ` Jethro Beekman
  2018-11-02 16:52                 ` Sean Christopherson
  2018-11-02 16:56               ` Dave Hansen
  2 siblings, 2 replies; 163+ messages in thread
From: Jethro Beekman @ 2018-11-02 16:37 UTC (permalink / raw)
  To: Sean Christopherson, Andy Lutomirski
  Cc: Linus Torvalds, Rich Felker, Jann Horn, Dave Hansen,
	Jarkko Sakkinen, Florian Weimer, Linux API, X86 ML, linux-arch,
	LKML, Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella

[-- Attachment #1: Type: text/plain, Size: 298 bytes --]

On 2018-11-02 09:30, Sean Christopherson wrote:
> ... The intended convention for EENTER is to have an ENCLU at the AEX target ...
> 
> ... to further enforce that the AEX target needs to be ENCLU.

Some SGX runtimes may want to use a different AEX target.

--
Jethro Beekman | Fortanix


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3990 bytes --]

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 16:37               ` Jethro Beekman
@ 2018-11-02 16:37                 ` Jethro Beekman
  2018-11-02 16:52                 ` Sean Christopherson
  1 sibling, 0 replies; 163+ messages in thread
From: Jethro Beekman @ 2018-11-02 16:37 UTC (permalink / raw)
  To: Sean Christopherson, Andy Lutomirski
  Cc: Linus Torvalds, Rich Felker, Jann Horn, Dave Hansen,
	Jarkko Sakkinen, Florian Weimer, Linux API, X86 ML, linux-arch,
	LKML, Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella

[-- Attachment #1: Type: text/plain, Size: 298 bytes --]

On 2018-11-02 09:30, Sean Christopherson wrote:
> ... The intended convention for EENTER is to have an ENCLU at the AEX target ...
> 
> ... to further enforce that the AEX target needs to be ENCLU.

Some SGX runtimes may want to use a different AEX target.

--
Jethro Beekman | Fortanix


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3990 bytes --]

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 16:37               ` Jethro Beekman
  2018-11-02 16:37                 ` Jethro Beekman
@ 2018-11-02 16:52                 ` Sean Christopherson
  2018-11-02 16:52                   ` Sean Christopherson
                                     ` (2 more replies)
  1 sibling, 3 replies; 163+ messages in thread
From: Sean Christopherson @ 2018-11-02 16:52 UTC (permalink / raw)
  To: Jethro Beekman
  Cc: Andy Lutomirski, Linus Torvalds, Rich Felker, Jann Horn,
	Dave Hansen, Jarkko Sakkinen, Florian Weimer, Linux API, X86 ML,
	linux-arch, LKML, Peter Zijlstra, nhorman, npmccallum, Ayoun,
	Serge, shay.katz-zamir, linux-sgx, Andy Shevchenko,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Fri, Nov 02, 2018 at 04:37:10PM +0000, Jethro Beekman wrote:
> On 2018-11-02 09:30, Sean Christopherson wrote:
> >... The intended convention for EENTER is to have an ENCLU at the AEX target ...
> >
> >... to further enforce that the AEX target needs to be ENCLU.
> 
> Some SGX runtimes may want to use a different AEX target.

To what end?  Userspace gets no indication as to why the AEX occurred.
And if exceptions are getting transfered to userspace the trampoline
would effectively be handling only INTR, NMI, #MC and EPC #PF.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 16:52                 ` Sean Christopherson
@ 2018-11-02 16:52                   ` Sean Christopherson
  2018-11-02 16:56                   ` Jethro Beekman
  2018-11-02 22:42                   ` Jarkko Sakkinen
  2 siblings, 0 replies; 163+ messages in thread
From: Sean Christopherson @ 2018-11-02 16:52 UTC (permalink / raw)
  To: Jethro Beekman
  Cc: Andy Lutomirski, Linus Torvalds, Rich Felker, Jann Horn,
	Dave Hansen, Jarkko Sakkinen, Florian Weimer, Linux API, X86 ML,
	linux-arch, LKML, Peter Zijlstra, nhorman, npmccallum, Ayoun,
	Serge, shay.katz-zamir, linux-sgx, Andy Shevchenko,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Fri, Nov 02, 2018 at 04:37:10PM +0000, Jethro Beekman wrote:
> On 2018-11-02 09:30, Sean Christopherson wrote:
> >... The intended convention for EENTER is to have an ENCLU at the AEX target ...
> >
> >... to further enforce that the AEX target needs to be ENCLU.
> 
> Some SGX runtimes may want to use a different AEX target.

To what end?  Userspace gets no indication as to why the AEX occurred.
And if exceptions are getting transfered to userspace the trampoline
would effectively be handling only INTR, NMI, #MC and EPC #PF.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 16:52                 ` Sean Christopherson
  2018-11-02 16:52                   ` Sean Christopherson
@ 2018-11-02 16:56                   ` Jethro Beekman
  2018-11-02 16:56                     ` Jethro Beekman
                                       ` (2 more replies)
  2018-11-02 22:42                   ` Jarkko Sakkinen
  2 siblings, 3 replies; 163+ messages in thread
From: Jethro Beekman @ 2018-11-02 16:56 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Andy Lutomirski, Linus Torvalds, Rich Felker, Jann Horn,
	Dave Hansen, Jarkko Sakkinen, Florian Weimer, Linux API, X86 ML,
	linux-arch, LKML, Peter Zijlstra, nhorman, npmccallum, Ayoun,
	Serge, shay.katz-zamir, linux-sgx, Andy Shevchenko,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

[-- Attachment #1: Type: text/plain, Size: 916 bytes --]

On 2018-11-02 09:52, Sean Christopherson wrote:
> On Fri, Nov 02, 2018 at 04:37:10PM +0000, Jethro Beekman wrote:
>> On 2018-11-02 09:30, Sean Christopherson wrote:
>>> ... The intended convention for EENTER is to have an ENCLU at the AEX target ...
>>>
>>> ... to further enforce that the AEX target needs to be ENCLU.
>>
>> Some SGX runtimes may want to use a different AEX target.
> 
> To what end?  Userspace gets no indication as to why the AEX occurred.
> And if exceptions are getting transfered to userspace the trampoline
> would effectively be handling only INTR, NMI, #MC and EPC #PF.
> 

Various reasons...

Userspace may have established an exception handling convention with the 
enclave (by setting TCS.NSSA > 1) and may want to call EENTER instead of 
ERESUME.

Userspace may want fine-grained control over enclave scheduling (e.g. 
SGX-Step)

--
Jethro Beekman | Fortanix


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3990 bytes --]

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 16:56                   ` Jethro Beekman
@ 2018-11-02 16:56                     ` Jethro Beekman
  2018-11-02 17:01                     ` Andy Lutomirski
  2018-11-02 17:12                     ` Sean Christopherson
  2 siblings, 0 replies; 163+ messages in thread
From: Jethro Beekman @ 2018-11-02 16:56 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Andy Lutomirski, Linus Torvalds, Rich Felker, Jann Horn,
	Dave Hansen, Jarkko Sakkinen, Florian Weimer, Linux API, X86 ML,
	linux-arch, LKML, Peter Zijlstra, nhorman, npmccallum, Ayoun,
	Serge, shay.katz-zamir, linux-sgx, Andy Shevchenko,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

[-- Attachment #1: Type: text/plain, Size: 916 bytes --]

On 2018-11-02 09:52, Sean Christopherson wrote:
> On Fri, Nov 02, 2018 at 04:37:10PM +0000, Jethro Beekman wrote:
>> On 2018-11-02 09:30, Sean Christopherson wrote:
>>> ... The intended convention for EENTER is to have an ENCLU at the AEX target ...
>>>
>>> ... to further enforce that the AEX target needs to be ENCLU.
>>
>> Some SGX runtimes may want to use a different AEX target.
> 
> To what end?  Userspace gets no indication as to why the AEX occurred.
> And if exceptions are getting transfered to userspace the trampoline
> would effectively be handling only INTR, NMI, #MC and EPC #PF.
> 

Various reasons...

Userspace may have established an exception handling convention with the 
enclave (by setting TCS.NSSA > 1) and may want to call EENTER instead of 
ERESUME.

Userspace may want fine-grained control over enclave scheduling (e.g. 
SGX-Step)

--
Jethro Beekman | Fortanix


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3990 bytes --]

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 16:30             ` Sean Christopherson
  2018-11-02 16:30               ` Sean Christopherson
  2018-11-02 16:37               ` Jethro Beekman
@ 2018-11-02 16:56               ` Dave Hansen
  2018-11-02 16:56                 ` Dave Hansen
  2018-11-02 17:06                 ` Sean Christopherson
  2 siblings, 2 replies; 163+ messages in thread
From: Dave Hansen @ 2018-11-02 16:56 UTC (permalink / raw)
  To: Sean Christopherson, Andy Lutomirski
  Cc: Linus Torvalds, Rich Felker, Jann Horn, Dave Hansen,
	Jethro Beekman, Jarkko Sakkinen, Florian Weimer, Linux API,
	X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman, npmccallum,
	Ayoun, Serge, shay.katz-zamir, linux-sgx, Andy Shevchenko,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On 11/2/18 9:30 AM, Sean Christopherson wrote:
> What if rather than having userspace register an address for fixup, the
> kernel instead unconditionally does fixup on the ENCLU opcode?

The problem is knowing what to do for the fixup.  If we have a simple
action to take that's universal, like backing up %RIP, or setting some
other register state, it's not bad.

Think of our prefetch fixups in the page fault code.  We do some
instruction decoding to look for them, and then largely return from the
fault and let the CPU retry.  We know *exactly* what to do for these.

But, if we need to call arbitrary code, or switch stacks, we need an
explicit ABI around it *anyway*, because the action to take isn't clear.

For an enclave exit that's because of a hardware interrupt or page
fault, life is good.  We really *could* just set %RIP to let ERESUME run
again, kinda like we do for (some) syscall situations.  But the
situations for which we can't just call ERESUME, like the out-calls make
this more challenging.  I think we'd need some explicit new interfaces
for those.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 16:56               ` Dave Hansen
@ 2018-11-02 16:56                 ` Dave Hansen
  2018-11-02 17:06                 ` Sean Christopherson
  1 sibling, 0 replies; 163+ messages in thread
From: Dave Hansen @ 2018-11-02 16:56 UTC (permalink / raw)
  To: Sean Christopherson, Andy Lutomirski
  Cc: Linus Torvalds, Rich Felker, Jann Horn, Dave Hansen,
	Jethro Beekman, Jarkko Sakkinen, Florian Weimer, Linux API,
	X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman, npmccallum,
	Ayoun, Serge, shay.katz-zamir, linux-sgx, Andy Shevchenko,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On 11/2/18 9:30 AM, Sean Christopherson wrote:
> What if rather than having userspace register an address for fixup, the
> kernel instead unconditionally does fixup on the ENCLU opcode?

The problem is knowing what to do for the fixup.  If we have a simple
action to take that's universal, like backing up %RIP, or setting some
other register state, it's not bad.

Think of our prefetch fixups in the page fault code.  We do some
instruction decoding to look for them, and then largely return from the
fault and let the CPU retry.  We know *exactly* what to do for these.

But, if we need to call arbitrary code, or switch stacks, we need an
explicit ABI around it *anyway*, because the action to take isn't clear.

For an enclave exit that's because of a hardware interrupt or page
fault, life is good.  We really *could* just set %RIP to let ERESUME run
again, kinda like we do for (some) syscall situations.  But the
situations for which we can't just call ERESUME, like the out-calls make
this more challenging.  I think we'd need some explicit new interfaces
for those.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 16:56                   ` Jethro Beekman
  2018-11-02 16:56                     ` Jethro Beekman
@ 2018-11-02 17:01                     ` Andy Lutomirski
  2018-11-02 17:01                       ` Andy Lutomirski
  2018-11-02 17:05                       ` Jethro Beekman
  2018-11-02 17:12                     ` Sean Christopherson
  2 siblings, 2 replies; 163+ messages in thread
From: Andy Lutomirski @ 2018-11-02 17:01 UTC (permalink / raw)
  To: Jethro Beekman
  Cc: Christopherson, Sean J, Andrew Lutomirski, Linus Torvalds,
	Rich Felker, Jann Horn, Dave Hansen, Jarkko Sakkinen,
	Florian Weimer, Linux API, X86 ML, linux-arch, LKML,
	Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella

On Fri, Nov 2, 2018 at 9:56 AM Jethro Beekman <jethro@fortanix.com> wrote:
>
> On 2018-11-02 09:52, Sean Christopherson wrote:
> > On Fri, Nov 02, 2018 at 04:37:10PM +0000, Jethro Beekman wrote:
> >> On 2018-11-02 09:30, Sean Christopherson wrote:
> >>> ... The intended convention for EENTER is to have an ENCLU at the AEX target ...
> >>>
> >>> ... to further enforce that the AEX target needs to be ENCLU.
> >>
> >> Some SGX runtimes may want to use a different AEX target.
> >
> > To what end?  Userspace gets no indication as to why the AEX occurred.
> > And if exceptions are getting transfered to userspace the trampoline
> > would effectively be handling only INTR, NMI, #MC and EPC #PF.
> >
>
> Various reasons...
>
> Userspace may have established an exception handling convention with the
> enclave (by setting TCS.NSSA > 1) and may want to call EENTER instead of
> ERESUME.
>

Ugh,

I sincerely hope that a future ISA extension lets the kernel return
directly back to enclave mode so that AEX events become entirely
invisible to user code.  It would be nice if user developers didn't
start depending on the rather odd AEX semantics right now.  But I
don't think the kernel can sanely do much about it.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 17:01                     ` Andy Lutomirski
@ 2018-11-02 17:01                       ` Andy Lutomirski
  2018-11-02 17:05                       ` Jethro Beekman
  1 sibling, 0 replies; 163+ messages in thread
From: Andy Lutomirski @ 2018-11-02 17:01 UTC (permalink / raw)
  To: Jethro Beekman
  Cc: Christopherson, Sean J, Andrew Lutomirski, Linus Torvalds,
	Rich Felker, Jann Horn, Dave Hansen, Jarkko Sakkinen,
	Florian Weimer, Linux API, X86 ML, linux-arch, LKML,
	Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella

On Fri, Nov 2, 2018 at 9:56 AM Jethro Beekman <jethro@fortanix.com> wrote:
>
> On 2018-11-02 09:52, Sean Christopherson wrote:
> > On Fri, Nov 02, 2018 at 04:37:10PM +0000, Jethro Beekman wrote:
> >> On 2018-11-02 09:30, Sean Christopherson wrote:
> >>> ... The intended convention for EENTER is to have an ENCLU at the AEX target ...
> >>>
> >>> ... to further enforce that the AEX target needs to be ENCLU.
> >>
> >> Some SGX runtimes may want to use a different AEX target.
> >
> > To what end?  Userspace gets no indication as to why the AEX occurred.
> > And if exceptions are getting transfered to userspace the trampoline
> > would effectively be handling only INTR, NMI, #MC and EPC #PF.
> >
>
> Various reasons...
>
> Userspace may have established an exception handling convention with the
> enclave (by setting TCS.NSSA > 1) and may want to call EENTER instead of
> ERESUME.
>

Ugh,

I sincerely hope that a future ISA extension lets the kernel return
directly back to enclave mode so that AEX events become entirely
invisible to user code.  It would be nice if user developers didn't
start depending on the rather odd AEX semantics right now.  But I
don't think the kernel can sanely do much about it.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 17:01                     ` Andy Lutomirski
  2018-11-02 17:01                       ` Andy Lutomirski
@ 2018-11-02 17:05                       ` Jethro Beekman
  2018-11-02 17:05                         ` Jethro Beekman
  2018-11-02 17:16                         ` Andy Lutomirski
  1 sibling, 2 replies; 163+ messages in thread
From: Jethro Beekman @ 2018-11-02 17:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Christopherson, Sean J, Linus Torvalds, Rich Felker, Jann Horn,
	Dave Hansen, Jarkko Sakkinen, Florian Weimer, Linux API, X86 ML,
	linux-arch, LKML, Peter Zijlstra, nhorman, npmccallum, Ayoun,
	Serge, shay.katz-zamir, linux-sgx, Andy Shevchenko,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

[-- Attachment #1: Type: text/plain, Size: 1335 bytes --]

On 2018-11-02 10:01, Andy Lutomirski wrote:
> On Fri, Nov 2, 2018 at 9:56 AM Jethro Beekman <jethro@fortanix.com> wrote:
>>
>> On 2018-11-02 09:52, Sean Christopherson wrote:
>>> On Fri, Nov 02, 2018 at 04:37:10PM +0000, Jethro Beekman wrote:
>>>> On 2018-11-02 09:30, Sean Christopherson wrote:
>>>>> ... The intended convention for EENTER is to have an ENCLU at the AEX target ...
>>>>>
>>>>> ... to further enforce that the AEX target needs to be ENCLU.
>>>>
>>>> Some SGX runtimes may want to use a different AEX target.
>>>
>>> To what end?  Userspace gets no indication as to why the AEX occurred.
>>> And if exceptions are getting transfered to userspace the trampoline
>>> would effectively be handling only INTR, NMI, #MC and EPC #PF.
>>>
>>
>> Various reasons...
>>
>> Userspace may have established an exception handling convention with the
>> enclave (by setting TCS.NSSA > 1) and may want to call EENTER instead of
>> ERESUME.
>>
> 
> Ugh,
> 
> I sincerely hope that a future ISA extension lets the kernel return
> directly back to enclave mode so that AEX events become entirely
> invisible to user code.

Can you explain how this would work for things like #BR/#DE/#UD that 
need to be fixed up by code running in the enclave before it can be resumed?

-- 
Jethro Beekman | Fortanix


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3990 bytes --]

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 17:05                       ` Jethro Beekman
@ 2018-11-02 17:05                         ` Jethro Beekman
  2018-11-02 17:16                         ` Andy Lutomirski
  1 sibling, 0 replies; 163+ messages in thread
From: Jethro Beekman @ 2018-11-02 17:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Christopherson, Sean J, Linus Torvalds, Rich Felker, Jann Horn,
	Dave Hansen, Jarkko Sakkinen, Florian Weimer, Linux API, X86 ML,
	linux-arch, LKML, Peter Zijlstra, nhorman, npmccallum, Ayoun,
	Serge, shay.katz-zamir, linux-sgx, Andy Shevchenko,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

[-- Attachment #1: Type: text/plain, Size: 1335 bytes --]

On 2018-11-02 10:01, Andy Lutomirski wrote:
> On Fri, Nov 2, 2018 at 9:56 AM Jethro Beekman <jethro@fortanix.com> wrote:
>>
>> On 2018-11-02 09:52, Sean Christopherson wrote:
>>> On Fri, Nov 02, 2018 at 04:37:10PM +0000, Jethro Beekman wrote:
>>>> On 2018-11-02 09:30, Sean Christopherson wrote:
>>>>> ... The intended convention for EENTER is to have an ENCLU at the AEX target ...
>>>>>
>>>>> ... to further enforce that the AEX target needs to be ENCLU.
>>>>
>>>> Some SGX runtimes may want to use a different AEX target.
>>>
>>> To what end?  Userspace gets no indication as to why the AEX occurred.
>>> And if exceptions are getting transfered to userspace the trampoline
>>> would effectively be handling only INTR, NMI, #MC and EPC #PF.
>>>
>>
>> Various reasons...
>>
>> Userspace may have established an exception handling convention with the
>> enclave (by setting TCS.NSSA > 1) and may want to call EENTER instead of
>> ERESUME.
>>
> 
> Ugh,
> 
> I sincerely hope that a future ISA extension lets the kernel return
> directly back to enclave mode so that AEX events become entirely
> invisible to user code.

Can you explain how this would work for things like #BR/#DE/#UD that 
need to be fixed up by code running in the enclave before it can be resumed?

-- 
Jethro Beekman | Fortanix


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3990 bytes --]

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 16:56               ` Dave Hansen
  2018-11-02 16:56                 ` Dave Hansen
@ 2018-11-02 17:06                 ` Sean Christopherson
  2018-11-02 17:06                   ` Sean Christopherson
  2018-11-02 17:13                   ` Dave Hansen
  1 sibling, 2 replies; 163+ messages in thread
From: Sean Christopherson @ 2018-11-02 17:06 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Linus Torvalds, Rich Felker, Jann Horn,
	Dave Hansen, Jethro Beekman, Jarkko Sakkinen, Florian Weimer,
	Linux API, X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Fri, Nov 02, 2018 at 09:56:44AM -0700, Dave Hansen wrote:
> On 11/2/18 9:30 AM, Sean Christopherson wrote:
> > What if rather than having userspace register an address for fixup, the
> > kernel instead unconditionally does fixup on the ENCLU opcode?
> 
> The problem is knowing what to do for the fixup.  If we have a simple
> action to take that's universal, like backing up %RIP, or setting some
> other register state, it's not bad.

Isn't the EENTER/RESUME behavior universal?  Or am I missing something?
 
> Think of our prefetch fixups in the page fault code.  We do some
> instruction decoding to look for them, and then largely return from the
> fault and let the CPU retry.  We know *exactly* what to do for these.
> 
> But, if we need to call arbitrary code, or switch stacks, we need an
> explicit ABI around it *anyway*, because the action to take isn't clear.
> 
> For an enclave exit that's because of a hardware interrupt or page
> fault, life is good.  We really *could* just set %RIP to let ERESUME run
> again, kinda like we do for (some) syscall situations.  But the
> situations for which we can't just call ERESUME, like the out-calls make
> this more challenging.  I think we'd need some explicit new interfaces
> for those.

I don't see how out-calls are a problem.  Once EEXIT completes we're
no longer in the enclave and EPCM faults are no longer a concern, i.e.
we don't need to do fixup.  Every other enclave exit is either an
exception or an interrupt.  And the only way to get back into the
enclave is via ENCLU (EENTER or ERESUME).

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 17:06                 ` Sean Christopherson
@ 2018-11-02 17:06                   ` Sean Christopherson
  2018-11-02 17:13                   ` Dave Hansen
  1 sibling, 0 replies; 163+ messages in thread
From: Sean Christopherson @ 2018-11-02 17:06 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Linus Torvalds, Rich Felker, Jann Horn,
	Dave Hansen, Jethro Beekman, Jarkko Sakkinen, Florian Weimer,
	Linux API, X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Fri, Nov 02, 2018 at 09:56:44AM -0700, Dave Hansen wrote:
> On 11/2/18 9:30 AM, Sean Christopherson wrote:
> > What if rather than having userspace register an address for fixup, the
> > kernel instead unconditionally does fixup on the ENCLU opcode?
> 
> The problem is knowing what to do for the fixup.  If we have a simple
> action to take that's universal, like backing up %RIP, or setting some
> other register state, it's not bad.

Isn't the EENTER/RESUME behavior universal?  Or am I missing something?
 
> Think of our prefetch fixups in the page fault code.  We do some
> instruction decoding to look for them, and then largely return from the
> fault and let the CPU retry.  We know *exactly* what to do for these.
> 
> But, if we need to call arbitrary code, or switch stacks, we need an
> explicit ABI around it *anyway*, because the action to take isn't clear.
> 
> For an enclave exit that's because of a hardware interrupt or page
> fault, life is good.  We really *could* just set %RIP to let ERESUME run
> again, kinda like we do for (some) syscall situations.  But the
> situations for which we can't just call ERESUME, like the out-calls make
> this more challenging.  I think we'd need some explicit new interfaces
> for those.

I don't see how out-calls are a problem.  Once EEXIT completes we're
no longer in the enclave and EPCM faults are no longer a concern, i.e.
we don't need to do fixup.  Every other enclave exit is either an
exception or an interrupt.  And the only way to get back into the
enclave is via ENCLU (EENTER or ERESUME).

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 16:56                   ` Jethro Beekman
  2018-11-02 16:56                     ` Jethro Beekman
  2018-11-02 17:01                     ` Andy Lutomirski
@ 2018-11-02 17:12                     ` Sean Christopherson
  2018-11-02 17:12                       ` Sean Christopherson
  2 siblings, 1 reply; 163+ messages in thread
From: Sean Christopherson @ 2018-11-02 17:12 UTC (permalink / raw)
  To: Jethro Beekman
  Cc: Andy Lutomirski, Linus Torvalds, Rich Felker, Jann Horn,
	Dave Hansen, Jarkko Sakkinen, Florian Weimer, Linux API, X86 ML,
	linux-arch, LKML, Peter Zijlstra, nhorman, npmccallum, Ayoun,
	Serge, shay.katz-zamir, linux-sgx, Andy Shevchenko,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Fri, Nov 02, 2018 at 04:56:36PM +0000, Jethro Beekman wrote:
> On 2018-11-02 09:52, Sean Christopherson wrote:
> >On Fri, Nov 02, 2018 at 04:37:10PM +0000, Jethro Beekman wrote:
> >>On 2018-11-02 09:30, Sean Christopherson wrote:
> >>>... The intended convention for EENTER is to have an ENCLU at the AEX target ...
> >>>
> >>>... to further enforce that the AEX target needs to be ENCLU.
> >>
> >>Some SGX runtimes may want to use a different AEX target.
> >
> >To what end?  Userspace gets no indication as to why the AEX occurred.
> >And if exceptions are getting transfered to userspace the trampoline
> >would effectively be handling only INTR, NMI, #MC and EPC #PF.
> >
> 
> Various reasons...
> 
> Userspace may have established an exception handling convention with the
> enclave (by setting TCS.NSSA > 1) and may want to call EENTER instead of
> ERESUME.

The ERESUME trampoline would only be invoked for exceptions that aren't
transferred to userspace.  On #BR, #UD, etc..., the kernel would fixup
%RIP to effectively point at @fault_handler.  Userspace can then do
whatever it wants to handle the fault, e.g. do EENTER if the fault needs
to be serviced by the enclave.

> Userspace may want fine-grained control over enclave scheduling (e.g.
> SGX-Step)

Uh, isn't SGX-Step an attack on SGX?  Preventing userspace from playing
games with enclave scheduling seems like a good thing.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 17:12                     ` Sean Christopherson
@ 2018-11-02 17:12                       ` Sean Christopherson
  0 siblings, 0 replies; 163+ messages in thread
From: Sean Christopherson @ 2018-11-02 17:12 UTC (permalink / raw)
  To: Jethro Beekman
  Cc: Andy Lutomirski, Linus Torvalds, Rich Felker, Jann Horn,
	Dave Hansen, Jarkko Sakkinen, Florian Weimer, Linux API, X86 ML,
	linux-arch, LKML, Peter Zijlstra, nhorman, npmccallum, Ayoun,
	Serge, shay.katz-zamir, linux-sgx, Andy Shevchenko,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Fri, Nov 02, 2018 at 04:56:36PM +0000, Jethro Beekman wrote:
> On 2018-11-02 09:52, Sean Christopherson wrote:
> >On Fri, Nov 02, 2018 at 04:37:10PM +0000, Jethro Beekman wrote:
> >>On 2018-11-02 09:30, Sean Christopherson wrote:
> >>>... The intended convention for EENTER is to have an ENCLU at the AEX target ...
> >>>
> >>>... to further enforce that the AEX target needs to be ENCLU.
> >>
> >>Some SGX runtimes may want to use a different AEX target.
> >
> >To what end?  Userspace gets no indication as to why the AEX occurred.
> >And if exceptions are getting transfered to userspace the trampoline
> >would effectively be handling only INTR, NMI, #MC and EPC #PF.
> >
> 
> Various reasons...
> 
> Userspace may have established an exception handling convention with the
> enclave (by setting TCS.NSSA > 1) and may want to call EENTER instead of
> ERESUME.

The ERESUME trampoline would only be invoked for exceptions that aren't
transferred to userspace.  On #BR, #UD, etc..., the kernel would fixup
%RIP to effectively point at @fault_handler.  Userspace can then do
whatever it wants to handle the fault, e.g. do EENTER if the fault needs
to be serviced by the enclave.

> Userspace may want fine-grained control over enclave scheduling (e.g.
> SGX-Step)

Uh, isn't SGX-Step an attack on SGX?  Preventing userspace from playing
games with enclave scheduling seems like a good thing.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 17:06                 ` Sean Christopherson
  2018-11-02 17:06                   ` Sean Christopherson
@ 2018-11-02 17:13                   ` Dave Hansen
  2018-11-02 17:13                     ` Dave Hansen
  2018-11-02 17:33                     ` Sean Christopherson
  1 sibling, 2 replies; 163+ messages in thread
From: Dave Hansen @ 2018-11-02 17:13 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Andy Lutomirski, Linus Torvalds, Rich Felker, Jann Horn,
	Dave Hansen, Jethro Beekman, Jarkko Sakkinen, Florian Weimer,
	Linux API, X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On 11/2/18 10:06 AM, Sean Christopherson wrote:
> On Fri, Nov 02, 2018 at 09:56:44AM -0700, Dave Hansen wrote:
>> On 11/2/18 9:30 AM, Sean Christopherson wrote:
>>> What if rather than having userspace register an address for fixup, the
>>> kernel instead unconditionally does fixup on the ENCLU opcode?
>>
>> The problem is knowing what to do for the fixup.  If we have a simple
>> action to take that's universal, like backing up %RIP, or setting some
>> other register state, it's not bad.
> 
> Isn't the EENTER/RESUME behavior universal?  Or am I missing something?

Could someone write down all the ways we get in and out of the enclave?

I think we always get in from userspace calling EENTER or ERESUME.  We
can't ever enter directly from the kernel, like via an IRET from what I
understand.

We get *out* from exceptions, hardware interrupts, or enclave-explicit
EEXITs.  Did I miss any?  Remind me where the hardware lands the control
flow in each of those exit cases.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 17:13                   ` Dave Hansen
@ 2018-11-02 17:13                     ` Dave Hansen
  2018-11-02 17:33                     ` Sean Christopherson
  1 sibling, 0 replies; 163+ messages in thread
From: Dave Hansen @ 2018-11-02 17:13 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Andy Lutomirski, Linus Torvalds, Rich Felker, Jann Horn,
	Dave Hansen, Jethro Beekman, Jarkko Sakkinen, Florian Weimer,
	Linux API, X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On 11/2/18 10:06 AM, Sean Christopherson wrote:
> On Fri, Nov 02, 2018 at 09:56:44AM -0700, Dave Hansen wrote:
>> On 11/2/18 9:30 AM, Sean Christopherson wrote:
>>> What if rather than having userspace register an address for fixup, the
>>> kernel instead unconditionally does fixup on the ENCLU opcode?
>>
>> The problem is knowing what to do for the fixup.  If we have a simple
>> action to take that's universal, like backing up %RIP, or setting some
>> other register state, it's not bad.
> 
> Isn't the EENTER/RESUME behavior universal?  Or am I missing something?

Could someone write down all the ways we get in and out of the enclave?

I think we always get in from userspace calling EENTER or ERESUME.  We
can't ever enter directly from the kernel, like via an IRET from what I
understand.

We get *out* from exceptions, hardware interrupts, or enclave-explicit
EEXITs.  Did I miss any?  Remind me where the hardware lands the control
flow in each of those exit cases.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 17:05                       ` Jethro Beekman
  2018-11-02 17:05                         ` Jethro Beekman
@ 2018-11-02 17:16                         ` Andy Lutomirski
  2018-11-02 17:16                           ` Andy Lutomirski
  2018-11-02 17:32                           ` Rich Felker
  1 sibling, 2 replies; 163+ messages in thread
From: Andy Lutomirski @ 2018-11-02 17:16 UTC (permalink / raw)
  To: Jethro Beekman
  Cc: Andrew Lutomirski, Christopherson, Sean J, Linus Torvalds,
	Rich Felker, Jann Horn, Dave Hansen, Jarkko Sakkinen,
	Florian Weimer, Linux API, X86 ML, linux-arch, LKML,
	Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella

On Fri, Nov 2, 2018 at 10:05 AM Jethro Beekman <jethro@fortanix.com> wrote:
>
> On 2018-11-02 10:01, Andy Lutomirski wrote:
> > On Fri, Nov 2, 2018 at 9:56 AM Jethro Beekman <jethro@fortanix.com> wrote:
> >>
> >> On 2018-11-02 09:52, Sean Christopherson wrote:
> >>> On Fri, Nov 02, 2018 at 04:37:10PM +0000, Jethro Beekman wrote:
> >>>> On 2018-11-02 09:30, Sean Christopherson wrote:
> >>>>> ... The intended convention for EENTER is to have an ENCLU at the AEX target ...
> >>>>>
> >>>>> ... to further enforce that the AEX target needs to be ENCLU.
> >>>>
> >>>> Some SGX runtimes may want to use a different AEX target.
> >>>
> >>> To what end?  Userspace gets no indication as to why the AEX occurred.
> >>> And if exceptions are getting transfered to userspace the trampoline
> >>> would effectively be handling only INTR, NMI, #MC and EPC #PF.
> >>>
> >>
> >> Various reasons...
> >>
> >> Userspace may have established an exception handling convention with the
> >> enclave (by setting TCS.NSSA > 1) and may want to call EENTER instead of
> >> ERESUME.
> >>
> >
> > Ugh,
> >
> > I sincerely hope that a future ISA extension lets the kernel return
> > directly back to enclave mode so that AEX events become entirely
> > invisible to user code.
>
> Can you explain how this would work for things like #BR/#DE/#UD that
> need to be fixed up by code running in the enclave before it can be resumed?
>

Sure.  A better enclave entry function would complete in one of two ways:

1. The enclave exited normally.  Some register output would indicate this.

2. The enclave existed due to an exception or interrupt.  The kernel
would be entered directly and notified of what happened.  The kernel
would fix it up if needed (#PF), handle an interrupt (for en enclave
exit due to an interrupt) and reenter the enclave.  If, of the error
is not kernel-fixable-up, it would return back to userspace with some
explanation of what happened.  Kind of like normal user code.

Alternatively, the CPU could directly distinguish between exceptions
that need the enclave's attention (#BR) and those that don't.

The fact that user code is involved in resuming an enclave when a
hardware interrupt occurs is silly IMO.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 17:16                         ` Andy Lutomirski
@ 2018-11-02 17:16                           ` Andy Lutomirski
  2018-11-02 17:32                           ` Rich Felker
  1 sibling, 0 replies; 163+ messages in thread
From: Andy Lutomirski @ 2018-11-02 17:16 UTC (permalink / raw)
  To: Jethro Beekman
  Cc: Andrew Lutomirski, Christopherson, Sean J, Linus Torvalds,
	Rich Felker, Jann Horn, Dave Hansen, Jarkko Sakkinen,
	Florian Weimer, Linux API, X86 ML, linux-arch, LKML,
	Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella

On Fri, Nov 2, 2018 at 10:05 AM Jethro Beekman <jethro@fortanix.com> wrote:
>
> On 2018-11-02 10:01, Andy Lutomirski wrote:
> > On Fri, Nov 2, 2018 at 9:56 AM Jethro Beekman <jethro@fortanix.com> wrote:
> >>
> >> On 2018-11-02 09:52, Sean Christopherson wrote:
> >>> On Fri, Nov 02, 2018 at 04:37:10PM +0000, Jethro Beekman wrote:
> >>>> On 2018-11-02 09:30, Sean Christopherson wrote:
> >>>>> ... The intended convention for EENTER is to have an ENCLU at the AEX target ...
> >>>>>
> >>>>> ... to further enforce that the AEX target needs to be ENCLU.
> >>>>
> >>>> Some SGX runtimes may want to use a different AEX target.
> >>>
> >>> To what end?  Userspace gets no indication as to why the AEX occurred.
> >>> And if exceptions are getting transfered to userspace the trampoline
> >>> would effectively be handling only INTR, NMI, #MC and EPC #PF.
> >>>
> >>
> >> Various reasons...
> >>
> >> Userspace may have established an exception handling convention with the
> >> enclave (by setting TCS.NSSA > 1) and may want to call EENTER instead of
> >> ERESUME.
> >>
> >
> > Ugh,
> >
> > I sincerely hope that a future ISA extension lets the kernel return
> > directly back to enclave mode so that AEX events become entirely
> > invisible to user code.
>
> Can you explain how this would work for things like #BR/#DE/#UD that
> need to be fixed up by code running in the enclave before it can be resumed?
>

Sure.  A better enclave entry function would complete in one of two ways:

1. The enclave exited normally.  Some register output would indicate this.

2. The enclave existed due to an exception or interrupt.  The kernel
would be entered directly and notified of what happened.  The kernel
would fix it up if needed (#PF), handle an interrupt (for en enclave
exit due to an interrupt) and reenter the enclave.  If, of the error
is not kernel-fixable-up, it would return back to userspace with some
explanation of what happened.  Kind of like normal user code.

Alternatively, the CPU could directly distinguish between exceptions
that need the enclave's attention (#BR) and those that don't.

The fact that user code is involved in resuming an enclave when a
hardware interrupt occurs is silly IMO.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 17:16                         ` Andy Lutomirski
  2018-11-02 17:16                           ` Andy Lutomirski
@ 2018-11-02 17:32                           ` Rich Felker
  2018-11-02 17:32                             ` Rich Felker
  1 sibling, 1 reply; 163+ messages in thread
From: Rich Felker @ 2018-11-02 17:32 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jethro Beekman, Christopherson, Sean J, Linus Torvalds,
	Jann Horn, Dave Hansen, Jarkko Sakkinen, Florian Weimer,
	Linux API, X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Fri, Nov 02, 2018 at 10:16:02AM -0700, Andy Lutomirski wrote:
> On Fri, Nov 2, 2018 at 10:05 AM Jethro Beekman <jethro@fortanix.com> wrote:
> >
> > On 2018-11-02 10:01, Andy Lutomirski wrote:
> > > On Fri, Nov 2, 2018 at 9:56 AM Jethro Beekman <jethro@fortanix.com> wrote:
> > >>
> > >> On 2018-11-02 09:52, Sean Christopherson wrote:
> > >>> On Fri, Nov 02, 2018 at 04:37:10PM +0000, Jethro Beekman wrote:
> > >>>> On 2018-11-02 09:30, Sean Christopherson wrote:
> > >>>>> ... The intended convention for EENTER is to have an ENCLU at the AEX target ...
> > >>>>>
> > >>>>> ... to further enforce that the AEX target needs to be ENCLU.
> > >>>>
> > >>>> Some SGX runtimes may want to use a different AEX target.
> > >>>
> > >>> To what end?  Userspace gets no indication as to why the AEX occurred.
> > >>> And if exceptions are getting transfered to userspace the trampoline
> > >>> would effectively be handling only INTR, NMI, #MC and EPC #PF.
> > >>>
> > >>
> > >> Various reasons...
> > >>
> > >> Userspace may have established an exception handling convention with the
> > >> enclave (by setting TCS.NSSA > 1) and may want to call EENTER instead of
> > >> ERESUME.
> > >>
> > >
> > > Ugh,
> > >
> > > I sincerely hope that a future ISA extension lets the kernel return
> > > directly back to enclave mode so that AEX events become entirely
> > > invisible to user code.
> >
> > Can you explain how this would work for things like #BR/#DE/#UD that
> > need to be fixed up by code running in the enclave before it can be resumed?
> >
> 
> Sure.  A better enclave entry function would complete in one of two ways:
> 
> 1. The enclave exited normally.  Some register output would indicate this.
> 
> 2. The enclave existed due to an exception or interrupt.  The kernel
> would be entered directly and notified of what happened.  The kernel
> would fix it up if needed (#PF), handle an interrupt (for en enclave
> exit due to an interrupt) and reenter the enclave.  If, of the error
> is not kernel-fixable-up, it would return back to userspace with some
> explanation of what happened.  Kind of like normal user code.
> 
> Alternatively, the CPU could directly distinguish between exceptions
> that need the enclave's attention (#BR) and those that don't.
> 
> The fact that user code is involved in resuming an enclave when a
> hardware interrupt occurs is silly IMO.

Agreed absolutely. If this is necessary, it seems like there should be
an agreed-upon protocol such that the kernel can make it happen via
returning to code in the vdso that performs the actual resume, so that
the application never sees it.

Rich

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 17:32                           ` Rich Felker
@ 2018-11-02 17:32                             ` Rich Felker
  0 siblings, 0 replies; 163+ messages in thread
From: Rich Felker @ 2018-11-02 17:32 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jethro Beekman, Christopherson, Sean J, Linus Torvalds,
	Jann Horn, Dave Hansen, Jarkko Sakkinen, Florian Weimer,
	Linux API, X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Fri, Nov 02, 2018 at 10:16:02AM -0700, Andy Lutomirski wrote:
> On Fri, Nov 2, 2018 at 10:05 AM Jethro Beekman <jethro@fortanix.com> wrote:
> >
> > On 2018-11-02 10:01, Andy Lutomirski wrote:
> > > On Fri, Nov 2, 2018 at 9:56 AM Jethro Beekman <jethro@fortanix.com> wrote:
> > >>
> > >> On 2018-11-02 09:52, Sean Christopherson wrote:
> > >>> On Fri, Nov 02, 2018 at 04:37:10PM +0000, Jethro Beekman wrote:
> > >>>> On 2018-11-02 09:30, Sean Christopherson wrote:
> > >>>>> ... The intended convention for EENTER is to have an ENCLU at the AEX target ...
> > >>>>>
> > >>>>> ... to further enforce that the AEX target needs to be ENCLU.
> > >>>>
> > >>>> Some SGX runtimes may want to use a different AEX target.
> > >>>
> > >>> To what end?  Userspace gets no indication as to why the AEX occurred.
> > >>> And if exceptions are getting transfered to userspace the trampoline
> > >>> would effectively be handling only INTR, NMI, #MC and EPC #PF.
> > >>>
> > >>
> > >> Various reasons...
> > >>
> > >> Userspace may have established an exception handling convention with the
> > >> enclave (by setting TCS.NSSA > 1) and may want to call EENTER instead of
> > >> ERESUME.
> > >>
> > >
> > > Ugh,
> > >
> > > I sincerely hope that a future ISA extension lets the kernel return
> > > directly back to enclave mode so that AEX events become entirely
> > > invisible to user code.
> >
> > Can you explain how this would work for things like #BR/#DE/#UD that
> > need to be fixed up by code running in the enclave before it can be resumed?
> >
> 
> Sure.  A better enclave entry function would complete in one of two ways:
> 
> 1. The enclave exited normally.  Some register output would indicate this.
> 
> 2. The enclave existed due to an exception or interrupt.  The kernel
> would be entered directly and notified of what happened.  The kernel
> would fix it up if needed (#PF), handle an interrupt (for en enclave
> exit due to an interrupt) and reenter the enclave.  If, of the error
> is not kernel-fixable-up, it would return back to userspace with some
> explanation of what happened.  Kind of like normal user code.
> 
> Alternatively, the CPU could directly distinguish between exceptions
> that need the enclave's attention (#BR) and those that don't.
> 
> The fact that user code is involved in resuming an enclave when a
> hardware interrupt occurs is silly IMO.

Agreed absolutely. If this is necessary, it seems like there should be
an agreed-upon protocol such that the kernel can make it happen via
returning to code in the vdso that performs the actual resume, so that
the application never sees it.

Rich

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 17:13                   ` Dave Hansen
  2018-11-02 17:13                     ` Dave Hansen
@ 2018-11-02 17:33                     ` Sean Christopherson
  2018-11-02 17:33                       ` Sean Christopherson
  2018-11-02 17:48                       ` Andy Lutomirski
  1 sibling, 2 replies; 163+ messages in thread
From: Sean Christopherson @ 2018-11-02 17:33 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Linus Torvalds, Rich Felker, Jann Horn,
	Dave Hansen, Jethro Beekman, Jarkko Sakkinen, Florian Weimer,
	Linux API, X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Fri, Nov 02, 2018 at 10:13:23AM -0700, Dave Hansen wrote:
> On 11/2/18 10:06 AM, Sean Christopherson wrote:
> > On Fri, Nov 02, 2018 at 09:56:44AM -0700, Dave Hansen wrote:
> >> On 11/2/18 9:30 AM, Sean Christopherson wrote:
> >>> What if rather than having userspace register an address for fixup, the
> >>> kernel instead unconditionally does fixup on the ENCLU opcode?
> >>
> >> The problem is knowing what to do for the fixup.  If we have a simple
> >> action to take that's universal, like backing up %RIP, or setting some
> >> other register state, it's not bad.
> > 
> > Isn't the EENTER/RESUME behavior universal?  Or am I missing something?
> 
> Could someone write down all the ways we get in and out of the enclave?
> 
> I think we always get in from userspace calling EENTER or ERESUME.  We
> can't ever enter directly from the kernel, like via an IRET from what I
> understand.

Correct, the only way to get into the enclave is EENTER or ERESUME.
My understanding is that even SMIs bounce through the AEX target
before transitioning to SMM.
 
> We get *out* from exceptions, hardware interrupts, or enclave-explicit
> EEXITs.  Did I miss any?  Remind me where the hardware lands the control
> flow in each of those exit cases.

And VMExits.  There are basically two cases: EEXIT and everything else.
EEXIT is a glorified indirect jump, e.g. %RBX holds the target %RIP.
Everything else is an Asynchronous Enclave Exit (AEX).  On an AEX, %RIP
is set to a value specified by EENTER/ERESUME, %RBP and %RSP are
restored to pre-enclave values and all other registers are loaded with
synthetic state.  The actual interrupt/exception/VMExit then triggers,
e.g. the %RIP on the stack for an exception is always the AEX target,
not the %RIP inside the enclave that actually faulted.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 17:33                     ` Sean Christopherson
@ 2018-11-02 17:33                       ` Sean Christopherson
  2018-11-02 17:48                       ` Andy Lutomirski
  1 sibling, 0 replies; 163+ messages in thread
From: Sean Christopherson @ 2018-11-02 17:33 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Linus Torvalds, Rich Felker, Jann Horn,
	Dave Hansen, Jethro Beekman, Jarkko Sakkinen, Florian Weimer,
	Linux API, X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Fri, Nov 02, 2018 at 10:13:23AM -0700, Dave Hansen wrote:
> On 11/2/18 10:06 AM, Sean Christopherson wrote:
> > On Fri, Nov 02, 2018 at 09:56:44AM -0700, Dave Hansen wrote:
> >> On 11/2/18 9:30 AM, Sean Christopherson wrote:
> >>> What if rather than having userspace register an address for fixup, the
> >>> kernel instead unconditionally does fixup on the ENCLU opcode?
> >>
> >> The problem is knowing what to do for the fixup.  If we have a simple
> >> action to take that's universal, like backing up %RIP, or setting some
> >> other register state, it's not bad.
> > 
> > Isn't the EENTER/RESUME behavior universal?  Or am I missing something?
> 
> Could someone write down all the ways we get in and out of the enclave?
> 
> I think we always get in from userspace calling EENTER or ERESUME.  We
> can't ever enter directly from the kernel, like via an IRET from what I
> understand.

Correct, the only way to get into the enclave is EENTER or ERESUME.
My understanding is that even SMIs bounce through the AEX target
before transitioning to SMM.
 
> We get *out* from exceptions, hardware interrupts, or enclave-explicit
> EEXITs.  Did I miss any?  Remind me where the hardware lands the control
> flow in each of those exit cases.

And VMExits.  There are basically two cases: EEXIT and everything else.
EEXIT is a glorified indirect jump, e.g. %RBX holds the target %RIP.
Everything else is an Asynchronous Enclave Exit (AEX).  On an AEX, %RIP
is set to a value specified by EENTER/ERESUME, %RBP and %RSP are
restored to pre-enclave values and all other registers are loaded with
synthetic state.  The actual interrupt/exception/VMExit then triggers,
e.g. the %RIP on the stack for an exception is always the AEX target,
not the %RIP inside the enclave that actually faulted.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 17:33                     ` Sean Christopherson
  2018-11-02 17:33                       ` Sean Christopherson
@ 2018-11-02 17:48                       ` Andy Lutomirski
  2018-11-02 17:48                         ` Andy Lutomirski
  2018-11-02 18:27                         ` Sean Christopherson
  1 sibling, 2 replies; 163+ messages in thread
From: Andy Lutomirski @ 2018-11-02 17:48 UTC (permalink / raw)
  To: Christopherson, Sean J
  Cc: Dave Hansen, Andrew Lutomirski, Linus Torvalds, Rich Felker,
	Jann Horn, Dave Hansen, Jethro Beekman, Jarkko Sakkinen,
	Florian Weimer, Linux API, X86 ML, linux-arch, LKML,
	Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella

On Fri, Nov 2, 2018 at 10:33 AM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Fri, Nov 02, 2018 at 10:13:23AM -0700, Dave Hansen wrote:
> > On 11/2/18 10:06 AM, Sean Christopherson wrote:
> > > On Fri, Nov 02, 2018 at 09:56:44AM -0700, Dave Hansen wrote:
> > >> On 11/2/18 9:30 AM, Sean Christopherson wrote:
> > >>> What if rather than having userspace register an address for fixup, the
> > >>> kernel instead unconditionally does fixup on the ENCLU opcode?
> > >>
> > >> The problem is knowing what to do for the fixup.  If we have a simple
> > >> action to take that's universal, like backing up %RIP, or setting some
> > >> other register state, it's not bad.
> > >
> > > Isn't the EENTER/RESUME behavior universal?  Or am I missing something?
> >
> > Could someone write down all the ways we get in and out of the enclave?
> >
> > I think we always get in from userspace calling EENTER or ERESUME.  We
> > can't ever enter directly from the kernel, like via an IRET from what I
> > understand.
>
> Correct, the only way to get into the enclave is EENTER or ERESUME.
> My understanding is that even SMIs bounce through the AEX target
> before transitioning to SMM.
>
> > We get *out* from exceptions, hardware interrupts, or enclave-explicit
> > EEXITs.  Did I miss any?  Remind me where the hardware lands the control
> > flow in each of those exit cases.
>
> And VMExits.  There are basically two cases: EEXIT and everything else.
> EEXIT is a glorified indirect jump, e.g. %RBX holds the target %RIP.
> Everything else is an Asynchronous Enclave Exit (AEX).  On an AEX, %RIP
> is set to a value specified by EENTER/ERESUME, %RBP and %RSP are
> restored to pre-enclave values and all other registers are loaded with
> synthetic state.  The actual interrupt/exception/VMExit then triggers,
> e.g. the %RIP on the stack for an exception is always the AEX target,
> not the %RIP inside the enclave that actually faulted.

So what exactly happens when an enclave accesses non-enclave memory
and takes a page fault, for example?  The SDM says that the #PF vector
and error code are stored in the SSA frame where the kernel can't see
them.  Is a real #PF then delivered?

I guess that, if the memory in question gets faulted in, then the
kernel resumes exection at the AEP address, which does ERESUME, and
the enclave resumes.  But if the access is bad, then the kernel
delivers a signal (or uses some other new mechanism), and then what
happens?  Is the enclave just considered dead?  Is user code supposed
to EENTER back into the enclave to tell it that it got an error?

This whole mechanism seems very complicated, and it's not clear
exactly what behavior user code wants.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 17:48                       ` Andy Lutomirski
@ 2018-11-02 17:48                         ` Andy Lutomirski
  2018-11-02 18:27                         ` Sean Christopherson
  1 sibling, 0 replies; 163+ messages in thread
From: Andy Lutomirski @ 2018-11-02 17:48 UTC (permalink / raw)
  To: Christopherson, Sean J
  Cc: Dave Hansen, Andrew Lutomirski, Linus Torvalds, Rich Felker,
	Jann Horn, Dave Hansen, Jethro Beekman, Jarkko Sakkinen,
	Florian Weimer, Linux API, X86 ML, linux-arch, LKML,
	Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella

On Fri, Nov 2, 2018 at 10:33 AM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Fri, Nov 02, 2018 at 10:13:23AM -0700, Dave Hansen wrote:
> > On 11/2/18 10:06 AM, Sean Christopherson wrote:
> > > On Fri, Nov 02, 2018 at 09:56:44AM -0700, Dave Hansen wrote:
> > >> On 11/2/18 9:30 AM, Sean Christopherson wrote:
> > >>> What if rather than having userspace register an address for fixup, the
> > >>> kernel instead unconditionally does fixup on the ENCLU opcode?
> > >>
> > >> The problem is knowing what to do for the fixup.  If we have a simple
> > >> action to take that's universal, like backing up %RIP, or setting some
> > >> other register state, it's not bad.
> > >
> > > Isn't the EENTER/RESUME behavior universal?  Or am I missing something?
> >
> > Could someone write down all the ways we get in and out of the enclave?
> >
> > I think we always get in from userspace calling EENTER or ERESUME.  We
> > can't ever enter directly from the kernel, like via an IRET from what I
> > understand.
>
> Correct, the only way to get into the enclave is EENTER or ERESUME.
> My understanding is that even SMIs bounce through the AEX target
> before transitioning to SMM.
>
> > We get *out* from exceptions, hardware interrupts, or enclave-explicit
> > EEXITs.  Did I miss any?  Remind me where the hardware lands the control
> > flow in each of those exit cases.
>
> And VMExits.  There are basically two cases: EEXIT and everything else.
> EEXIT is a glorified indirect jump, e.g. %RBX holds the target %RIP.
> Everything else is an Asynchronous Enclave Exit (AEX).  On an AEX, %RIP
> is set to a value specified by EENTER/ERESUME, %RBP and %RSP are
> restored to pre-enclave values and all other registers are loaded with
> synthetic state.  The actual interrupt/exception/VMExit then triggers,
> e.g. the %RIP on the stack for an exception is always the AEX target,
> not the %RIP inside the enclave that actually faulted.

So what exactly happens when an enclave accesses non-enclave memory
and takes a page fault, for example?  The SDM says that the #PF vector
and error code are stored in the SSA frame where the kernel can't see
them.  Is a real #PF then delivered?

I guess that, if the memory in question gets faulted in, then the
kernel resumes exection at the AEP address, which does ERESUME, and
the enclave resumes.  But if the access is bad, then the kernel
delivers a signal (or uses some other new mechanism), and then what
happens?  Is the enclave just considered dead?  Is user code supposed
to EENTER back into the enclave to tell it that it got an error?

This whole mechanism seems very complicated, and it's not clear
exactly what behavior user code wants.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 17:48                       ` Andy Lutomirski
  2018-11-02 17:48                         ` Andy Lutomirski
@ 2018-11-02 18:27                         ` Sean Christopherson
  2018-11-02 18:27                           ` Sean Christopherson
  2018-11-02 19:02                           ` Jann Horn
  1 sibling, 2 replies; 163+ messages in thread
From: Sean Christopherson @ 2018-11-02 18:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Linus Torvalds, Rich Felker, Jann Horn, Dave Hansen,
	Jethro Beekman, Jarkko Sakkinen, Florian Weimer, Linux API,
	X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman, npmccallum,
	Ayoun, Serge, shay.katz-zamir, linux-sgx, Andy Shevchenko,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Fri, Nov 02, 2018 at 10:48:38AM -0700, Andy Lutomirski wrote:
> On Fri, Nov 2, 2018 at 10:33 AM Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
> >
> > On Fri, Nov 02, 2018 at 10:13:23AM -0700, Dave Hansen wrote:
> > > On 11/2/18 10:06 AM, Sean Christopherson wrote:
> > > > On Fri, Nov 02, 2018 at 09:56:44AM -0700, Dave Hansen wrote:
> > > >> On 11/2/18 9:30 AM, Sean Christopherson wrote:
> > > >>> What if rather than having userspace register an address for fixup, the
> > > >>> kernel instead unconditionally does fixup on the ENCLU opcode?
> > > >>
> > > >> The problem is knowing what to do for the fixup.  If we have a simple
> > > >> action to take that's universal, like backing up %RIP, or setting some
> > > >> other register state, it's not bad.
> > > >
> > > > Isn't the EENTER/RESUME behavior universal?  Or am I missing something?
> > >
> > > Could someone write down all the ways we get in and out of the enclave?
> > >
> > > I think we always get in from userspace calling EENTER or ERESUME.  We
> > > can't ever enter directly from the kernel, like via an IRET from what I
> > > understand.
> >
> > Correct, the only way to get into the enclave is EENTER or ERESUME.
> > My understanding is that even SMIs bounce through the AEX target
> > before transitioning to SMM.
> >
> > > We get *out* from exceptions, hardware interrupts, or enclave-explicit
> > > EEXITs.  Did I miss any?  Remind me where the hardware lands the control
> > > flow in each of those exit cases.
> >
> > And VMExits.  There are basically two cases: EEXIT and everything else.
> > EEXIT is a glorified indirect jump, e.g. %RBX holds the target %RIP.
> > Everything else is an Asynchronous Enclave Exit (AEX).  On an AEX, %RIP
> > is set to a value specified by EENTER/ERESUME, %RBP and %RSP are
> > restored to pre-enclave values and all other registers are loaded with
> > synthetic state.  The actual interrupt/exception/VMExit then triggers,
> > e.g. the %RIP on the stack for an exception is always the AEX target,
> > not the %RIP inside the enclave that actually faulted.
> 
> So what exactly happens when an enclave accesses non-enclave memory
> and takes a page fault, for example?  The SDM says that the #PF vector
> and error code are stored in the SSA frame where the kernel can't see
> them.  Is a real #PF then delivered?

Yes.  From there kernel's perspective a #PF occurred on the %RIP of the
AEX target.  This holds true for all AEX types, e.g. GUEST_RIP on VMExit
also points at the AEX target.  On an AEX, %RAX, %RBX and %RCX are set
to match the ERESUME parameter.  The idea is for userspace to have an
ENCU at the AEX so that it automatically ERESUMEs the enclave after the
kernel handles the fault.  And the trampoline approach means the ucode
flows for exceptions, interrupts, VMExit, VMEnter, IRET, RSM, etc...
generally don't need to be SGX-aware.  The events themselves just need
to be redirected to the AEX target and then redo the event.

> I guess that, if the memory in question gets faulted in, then the
> kernel resumes exection at the AEP address, which does ERESUME, and
> the enclave resumes.  But if the access is bad, then the kernel
> delivers a signal (or uses some other new mechanism), and then what
> happens?  Is the enclave just considered dead?  Is user code supposed
> to EENTER back into the enclave to tell it that it got an error?

Completely depends on the enclave and its runtime.  A simple enclave
mayy never expect to encounter a bad access or #UD and so its runtime
would probably just kill it.  A test/development enclave might have
its runtime call back into the enclave to dump state on a fatal fault.

Complex runtimes, e.g. libraries that wrap unmodified applications,
will call back into the enclave so that libraries in-enclave fault
handler can decode what went wrong and take action accordingly, e.g.
request CPUID information if unmodified code tried to do CPUID.

> This whole mechanism seems very complicated, and it's not clear
> exactly what behavior user code wants.

No argument there.  That's why I like the approach of dumping the
exception to userspace without trying to do anything intelligent in
the kernel.  Userspace can then do whatever it wants AND we don't
have to worry about mucking with stacks.

One of the hiccups with the VDSO approach is that the enclave may
want to use the untrusted stack, i.e. the stack that has the VDSO's
stack frame.  For example, Intel's SDK uses the untrusted stack to
pass parameters for EEXIT, which means an AEX might occur with what
is effectively a bad stack from the VDSO's perspective.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 18:27                         ` Sean Christopherson
@ 2018-11-02 18:27                           ` Sean Christopherson
  2018-11-02 19:02                           ` Jann Horn
  1 sibling, 0 replies; 163+ messages in thread
From: Sean Christopherson @ 2018-11-02 18:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Linus Torvalds, Rich Felker, Jann Horn, Dave Hansen,
	Jethro Beekman, Jarkko Sakkinen, Florian Weimer, Linux API,
	X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman, npmccallum,
	Ayoun, Serge, shay.katz-zamir, linux-sgx, Andy Shevchenko,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Fri, Nov 02, 2018 at 10:48:38AM -0700, Andy Lutomirski wrote:
> On Fri, Nov 2, 2018 at 10:33 AM Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
> >
> > On Fri, Nov 02, 2018 at 10:13:23AM -0700, Dave Hansen wrote:
> > > On 11/2/18 10:06 AM, Sean Christopherson wrote:
> > > > On Fri, Nov 02, 2018 at 09:56:44AM -0700, Dave Hansen wrote:
> > > >> On 11/2/18 9:30 AM, Sean Christopherson wrote:
> > > >>> What if rather than having userspace register an address for fixup, the
> > > >>> kernel instead unconditionally does fixup on the ENCLU opcode?
> > > >>
> > > >> The problem is knowing what to do for the fixup.  If we have a simple
> > > >> action to take that's universal, like backing up %RIP, or setting some
> > > >> other register state, it's not bad.
> > > >
> > > > Isn't the EENTER/RESUME behavior universal?  Or am I missing something?
> > >
> > > Could someone write down all the ways we get in and out of the enclave?
> > >
> > > I think we always get in from userspace calling EENTER or ERESUME.  We
> > > can't ever enter directly from the kernel, like via an IRET from what I
> > > understand.
> >
> > Correct, the only way to get into the enclave is EENTER or ERESUME.
> > My understanding is that even SMIs bounce through the AEX target
> > before transitioning to SMM.
> >
> > > We get *out* from exceptions, hardware interrupts, or enclave-explicit
> > > EEXITs.  Did I miss any?  Remind me where the hardware lands the control
> > > flow in each of those exit cases.
> >
> > And VMExits.  There are basically two cases: EEXIT and everything else.
> > EEXIT is a glorified indirect jump, e.g. %RBX holds the target %RIP.
> > Everything else is an Asynchronous Enclave Exit (AEX).  On an AEX, %RIP
> > is set to a value specified by EENTER/ERESUME, %RBP and %RSP are
> > restored to pre-enclave values and all other registers are loaded with
> > synthetic state.  The actual interrupt/exception/VMExit then triggers,
> > e.g. the %RIP on the stack for an exception is always the AEX target,
> > not the %RIP inside the enclave that actually faulted.
> 
> So what exactly happens when an enclave accesses non-enclave memory
> and takes a page fault, for example?  The SDM says that the #PF vector
> and error code are stored in the SSA frame where the kernel can't see
> them.  Is a real #PF then delivered?

Yes.  From there kernel's perspective a #PF occurred on the %RIP of the
AEX target.  This holds true for all AEX types, e.g. GUEST_RIP on VMExit
also points at the AEX target.  On an AEX, %RAX, %RBX and %RCX are set
to match the ERESUME parameter.  The idea is for userspace to have an
ENCU at the AEX so that it automatically ERESUMEs the enclave after the
kernel handles the fault.  And the trampoline approach means the ucode
flows for exceptions, interrupts, VMExit, VMEnter, IRET, RSM, etc...
generally don't need to be SGX-aware.  The events themselves just need
to be redirected to the AEX target and then redo the event.

> I guess that, if the memory in question gets faulted in, then the
> kernel resumes exection at the AEP address, which does ERESUME, and
> the enclave resumes.  But if the access is bad, then the kernel
> delivers a signal (or uses some other new mechanism), and then what
> happens?  Is the enclave just considered dead?  Is user code supposed
> to EENTER back into the enclave to tell it that it got an error?

Completely depends on the enclave and its runtime.  A simple enclave
mayy never expect to encounter a bad access or #UD and so its runtime
would probably just kill it.  A test/development enclave might have
its runtime call back into the enclave to dump state on a fatal fault.

Complex runtimes, e.g. libraries that wrap unmodified applications,
will call back into the enclave so that libraries in-enclave fault
handler can decode what went wrong and take action accordingly, e.g.
request CPUID information if unmodified code tried to do CPUID.

> This whole mechanism seems very complicated, and it's not clear
> exactly what behavior user code wants.

No argument there.  That's why I like the approach of dumping the
exception to userspace without trying to do anything intelligent in
the kernel.  Userspace can then do whatever it wants AND we don't
have to worry about mucking with stacks.

One of the hiccups with the VDSO approach is that the enclave may
want to use the untrusted stack, i.e. the stack that has the VDSO's
stack frame.  For example, Intel's SDK uses the untrusted stack to
pass parameters for EEXIT, which means an AEX might occur with what
is effectively a bad stack from the VDSO's perspective.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 18:27                         ` Sean Christopherson
  2018-11-02 18:27                           ` Sean Christopherson
@ 2018-11-02 19:02                           ` Jann Horn
  2018-11-02 19:02                             ` Jann Horn
  2018-11-02 22:04                             ` Sean Christopherson
  1 sibling, 2 replies; 163+ messages in thread
From: Jann Horn @ 2018-11-02 19:02 UTC (permalink / raw)
  To: sean.j.christopherson
  Cc: Andy Lutomirski, Dave Hansen, Linus Torvalds, dalias,
	Dave Hansen, jethro, jarkko.sakkinen, Florian Weimer, Linux API,
	the arch/x86 maintainers, linux-arch, kernel list,
	Peter Zijlstra, nhorman, npmccallum, serge.ayoun,
	shay.katz-zamir, linux-sgx, andriy.shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, carlos, adhemerval.zanella

On Fri, Nov 2, 2018 at 7:27 PM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
> On Fri, Nov 02, 2018 at 10:48:38AM -0700, Andy Lutomirski wrote:
> > This whole mechanism seems very complicated, and it's not clear
> > exactly what behavior user code wants.
>
> No argument there.  That's why I like the approach of dumping the
> exception to userspace without trying to do anything intelligent in
> the kernel.  Userspace can then do whatever it wants AND we don't
> have to worry about mucking with stacks.
>
> One of the hiccups with the VDSO approach is that the enclave may
> want to use the untrusted stack, i.e. the stack that has the VDSO's
> stack frame.  For example, Intel's SDK uses the untrusted stack to
> pass parameters for EEXIT, which means an AEX might occur with what
> is effectively a bad stack from the VDSO's perspective.

What exactly does "uses the untrusted stack to pass parameters for
EEXIT" mean? I guess you're saying that the enclave is writing to
RSP+[0...some_positive_offset], and the written data needs to be
visible to the code outside the enclave afterwards?

In other words, the vDSO helper would have to not touch the stack
pointer (only using the 128-byte redzone to store spilled data, at
least across the enclave entry), and return by decrementing the stack
pointer by 8 immediately before returning (storing the return pointer
in the redzone)?

So you'd call the vDSO helper with a normal "call
vdso_helper_address", then the vDSO helper does "add rsp, 8", then the
vDSO helper does its magic, and then it returns with "sub rsp, 8" and
"ret"? That way you don't touch anything on the high-address side of
RSP while still avoiding running into CET problems. (I'm assuming that
you can use CET in a process that is hosting SGX enclaves?)

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 19:02                           ` Jann Horn
@ 2018-11-02 19:02                             ` Jann Horn
  2018-11-02 22:04                             ` Sean Christopherson
  1 sibling, 0 replies; 163+ messages in thread
From: Jann Horn @ 2018-11-02 19:02 UTC (permalink / raw)
  To: sean.j.christopherson
  Cc: Andy Lutomirski, Dave Hansen, Linus Torvalds, dalias,
	Dave Hansen, jethro, jarkko.sakkinen, Florian Weimer, Linux API,
	the arch/x86 maintainers, linux-arch, kernel list,
	Peter Zijlstra, nhorman, npmccallum, serge.ayoun,
	shay.katz-zamir, linux-sgx, andriy.shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, carlos, adhemerval.zanella

On Fri, Nov 2, 2018 at 7:27 PM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
> On Fri, Nov 02, 2018 at 10:48:38AM -0700, Andy Lutomirski wrote:
> > This whole mechanism seems very complicated, and it's not clear
> > exactly what behavior user code wants.
>
> No argument there.  That's why I like the approach of dumping the
> exception to userspace without trying to do anything intelligent in
> the kernel.  Userspace can then do whatever it wants AND we don't
> have to worry about mucking with stacks.
>
> One of the hiccups with the VDSO approach is that the enclave may
> want to use the untrusted stack, i.e. the stack that has the VDSO's
> stack frame.  For example, Intel's SDK uses the untrusted stack to
> pass parameters for EEXIT, which means an AEX might occur with what
> is effectively a bad stack from the VDSO's perspective.

What exactly does "uses the untrusted stack to pass parameters for
EEXIT" mean? I guess you're saying that the enclave is writing to
RSP+[0...some_positive_offset], and the written data needs to be
visible to the code outside the enclave afterwards?

In other words, the vDSO helper would have to not touch the stack
pointer (only using the 128-byte redzone to store spilled data, at
least across the enclave entry), and return by decrementing the stack
pointer by 8 immediately before returning (storing the return pointer
in the redzone)?

So you'd call the vDSO helper with a normal "call
vdso_helper_address", then the vDSO helper does "add rsp, 8", then the
vDSO helper does its magic, and then it returns with "sub rsp, 8" and
"ret"? That way you don't touch anything on the high-address side of
RSP while still avoiding running into CET problems. (I'm assuming that
you can use CET in a process that is hosting SGX enclaves?)

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 19:02                           ` Jann Horn
  2018-11-02 19:02                             ` Jann Horn
@ 2018-11-02 22:04                             ` Sean Christopherson
  2018-11-02 22:04                               ` Sean Christopherson
  2018-11-02 23:27                               ` Jann Horn
  1 sibling, 2 replies; 163+ messages in thread
From: Sean Christopherson @ 2018-11-02 22:04 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andy Lutomirski, Dave Hansen, Linus Torvalds, dalias,
	Dave Hansen, jethro, jarkko.sakkinen, Florian Weimer, Linux API,
	the arch/x86 maintainers, linux-arch, kernel list,
	Peter Zijlstra, nhorman, npmccallum, serge.ayoun,
	shay.katz-zamir, linux-sgx, andriy.shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, carlos, adhemerval.zanella

On Fri, Nov 02, 2018 at 08:02:23PM +0100, Jann Horn wrote:
> On Fri, Nov 2, 2018 at 7:27 PM Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
> > On Fri, Nov 02, 2018 at 10:48:38AM -0700, Andy Lutomirski wrote:
> > > This whole mechanism seems very complicated, and it's not clear
> > > exactly what behavior user code wants.
> >
> > No argument there.  That's why I like the approach of dumping the
> > exception to userspace without trying to do anything intelligent in
> > the kernel.  Userspace can then do whatever it wants AND we don't
> > have to worry about mucking with stacks.
> >
> > One of the hiccups with the VDSO approach is that the enclave may
> > want to use the untrusted stack, i.e. the stack that has the VDSO's
> > stack frame.  For example, Intel's SDK uses the untrusted stack to
> > pass parameters for EEXIT, which means an AEX might occur with what
> > is effectively a bad stack from the VDSO's perspective.
> 
> What exactly does "uses the untrusted stack to pass parameters for
> EEXIT" mean? I guess you're saying that the enclave is writing to
> RSP+[0...some_positive_offset], and the written data needs to be
> visible to the code outside the enclave afterwards?

As is, they actually do it the other way around, i.e. negative offsets
relative to the untrusted %RSP.  Going into the enclave there is no
reserved space on the stack.  The SDK uses EEXIT like a function call,
i.e. pushing parameters on the stack and making an call outside of the
enclave, hence the name out-call.  This allows the SDK to handle any
reasonable out-call without a priori knowledge of the application's
maximum out-call "size".


Rough outline of what happens in a non-faulting case.

1: Userspace executes EENTER
        --------------------
        | userspace stack  | 
        -------------------- <-- %RSP at EENTER


2: Enclave does EEXIT to invoke out-call function

        --------------------
        | userspace stack  | 
        -------------------- <-- %RSP at EENTER
        | out-call func ID |
        | param1           |
        | ...              |
        | paramN           |
        -------------------- <-- %RSP at EEXIT


3: Userspace re-EENTERs enclave after handling EEXIT request

        --------------------
        | userspace stack  | 
        -------------------- <-- %RSP at original EENTER
        | out-call func ID |
        | param1           |
        | ...              |
        | paramN           |
        -------------------- <-- %RSP at post-EEXIT EENTER


4: Enclave cleans up the stack

        --------------------
        | userspace stack  | 
        -------------------- <-- %RSP back at original EENTER



In the faulting case, an AEX can occur while the enclave is pushing
parameters onto the stack for EEXIT.


1: Userspace executes EENTER
        --------------------
        | userspace stack  | 
        -------------------- <-- %RSP at EENTER


2: AEX occurs during enclave prep for EEXIT

        --------------------
        | userspace stack  | 
        -------------------- <-- %RSP at EENTER
        | out-call func ID |
        | param1           |
        | ...              | 
        -------------------- <-- %RSP at AEX


3: Userspace re-EENTERs enclave to invoke enclave fault handler

        --------------------
        | userspace stack  | 
        -------------------- <-- %RSP at original EENTER
        | out-call func ID |
        | param1           |
        | ...              | 
        -------------------- <-- %RSP at AEX
        | userspace stack  |
        -------------------- <-- %RSP at EENTER to fault handler


4: Enclave handles the fault, EEXITs back to userspace

        --------------------
        | userspace stack  | 
        -------------------- <-- %RSP at original EENTER
        | out-call func ID |
        | param1           |
        | ...              | 
        -------------------- <-- %RSP at AEX
        | userspace stack  |
        -------------------- <-- %RSP at EEXIT from fault handler


5: Userspace pops its stack and ERESUMEs back to the enclave
        --------------------
        | userspace stack  | 
        -------------------- <-- %RSP at original EENTER
        | out-call func ID |
        | param1           |
        | ...              | 
        -------------------- <-- %RSP at ERESUME


6: Enclave finishes its EEXIT to invoke out-call function

        --------------------
        | userspace stuff  | 
        -------------------- <-- %RSP at original EENTER
        | out-call func ID |
        | param1           |
        | ...              |
        | paramN           |
        -------------------- <-- %RSP at EEXIT 
 
> In other words, the vDSO helper would have to not touch the stack
> pointer (only using the 128-byte redzone to store spilled data, at
> least across the enclave entry), and return by decrementing the stack
> pointer by 8 immediately before returning (storing the return pointer
> in the redzone)?
> 
> So you'd call the vDSO helper with a normal "call
> vdso_helper_address", then the vDSO helper does "add rsp, 8", then the
> vDSO helper does its magic, and then it returns with "sub rsp, 8" and
> "ret"? That way you don't touch anything on the high-address side of
> RSP while still avoiding running into CET problems. (I'm assuming that
> you can use CET in a process that is hosting SGX enclaves?)

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 22:04                             ` Sean Christopherson
@ 2018-11-02 22:04                               ` Sean Christopherson
  2018-11-02 23:27                               ` Jann Horn
  1 sibling, 0 replies; 163+ messages in thread
From: Sean Christopherson @ 2018-11-02 22:04 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andy Lutomirski, Dave Hansen, Linus Torvalds, dalias,
	Dave Hansen, jethro, jarkko.sakkinen, Florian Weimer, Linux API,
	the arch/x86 maintainers, linux-arch, kernel list,
	Peter Zijlstra, nhorman, npmccallum, serge.ayoun,
	shay.katz-zamir, linux-sgx, andriy.shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, carlos, adhemerval.zanella

On Fri, Nov 02, 2018 at 08:02:23PM +0100, Jann Horn wrote:
> On Fri, Nov 2, 2018 at 7:27 PM Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
> > On Fri, Nov 02, 2018 at 10:48:38AM -0700, Andy Lutomirski wrote:
> > > This whole mechanism seems very complicated, and it's not clear
> > > exactly what behavior user code wants.
> >
> > No argument there.  That's why I like the approach of dumping the
> > exception to userspace without trying to do anything intelligent in
> > the kernel.  Userspace can then do whatever it wants AND we don't
> > have to worry about mucking with stacks.
> >
> > One of the hiccups with the VDSO approach is that the enclave may
> > want to use the untrusted stack, i.e. the stack that has the VDSO's
> > stack frame.  For example, Intel's SDK uses the untrusted stack to
> > pass parameters for EEXIT, which means an AEX might occur with what
> > is effectively a bad stack from the VDSO's perspective.
> 
> What exactly does "uses the untrusted stack to pass parameters for
> EEXIT" mean? I guess you're saying that the enclave is writing to
> RSP+[0...some_positive_offset], and the written data needs to be
> visible to the code outside the enclave afterwards?

As is, they actually do it the other way around, i.e. negative offsets
relative to the untrusted %RSP.  Going into the enclave there is no
reserved space on the stack.  The SDK uses EEXIT like a function call,
i.e. pushing parameters on the stack and making an call outside of the
enclave, hence the name out-call.  This allows the SDK to handle any
reasonable out-call without a priori knowledge of the application's
maximum out-call "size".


Rough outline of what happens in a non-faulting case.

1: Userspace executes EENTER
        --------------------
        | userspace stack  | 
        -------------------- <-- %RSP at EENTER


2: Enclave does EEXIT to invoke out-call function

        --------------------
        | userspace stack  | 
        -------------------- <-- %RSP at EENTER
        | out-call func ID |
        | param1           |
        | ...              |
        | paramN           |
        -------------------- <-- %RSP at EEXIT


3: Userspace re-EENTERs enclave after handling EEXIT request

        --------------------
        | userspace stack  | 
        -------------------- <-- %RSP at original EENTER
        | out-call func ID |
        | param1           |
        | ...              |
        | paramN           |
        -------------------- <-- %RSP at post-EEXIT EENTER


4: Enclave cleans up the stack

        --------------------
        | userspace stack  | 
        -------------------- <-- %RSP back at original EENTER



In the faulting case, an AEX can occur while the enclave is pushing
parameters onto the stack for EEXIT.


1: Userspace executes EENTER
        --------------------
        | userspace stack  | 
        -------------------- <-- %RSP at EENTER


2: AEX occurs during enclave prep for EEXIT

        --------------------
        | userspace stack  | 
        -------------------- <-- %RSP at EENTER
        | out-call func ID |
        | param1           |
        | ...              | 
        -------------------- <-- %RSP at AEX


3: Userspace re-EENTERs enclave to invoke enclave fault handler

        --------------------
        | userspace stack  | 
        -------------------- <-- %RSP at original EENTER
        | out-call func ID |
        | param1           |
        | ...              | 
        -------------------- <-- %RSP at AEX
        | userspace stack  |
        -------------------- <-- %RSP at EENTER to fault handler


4: Enclave handles the fault, EEXITs back to userspace

        --------------------
        | userspace stack  | 
        -------------------- <-- %RSP at original EENTER
        | out-call func ID |
        | param1           |
        | ...              | 
        -------------------- <-- %RSP at AEX
        | userspace stack  |
        -------------------- <-- %RSP at EEXIT from fault handler


5: Userspace pops its stack and ERESUMEs back to the enclave
        --------------------
        | userspace stack  | 
        -------------------- <-- %RSP at original EENTER
        | out-call func ID |
        | param1           |
        | ...              | 
        -------------------- <-- %RSP at ERESUME


6: Enclave finishes its EEXIT to invoke out-call function

        --------------------
        | userspace stuff  | 
        -------------------- <-- %RSP at original EENTER
        | out-call func ID |
        | param1           |
        | ...              |
        | paramN           |
        -------------------- <-- %RSP at EEXIT 
 
> In other words, the vDSO helper would have to not touch the stack
> pointer (only using the 128-byte redzone to store spilled data, at
> least across the enclave entry), and return by decrementing the stack
> pointer by 8 immediately before returning (storing the return pointer
> in the redzone)?
> 
> So you'd call the vDSO helper with a normal "call
> vdso_helper_address", then the vDSO helper does "add rsp, 8", then the
> vDSO helper does its magic, and then it returns with "sub rsp, 8" and
> "ret"? That way you don't touch anything on the high-address side of
> RSP while still avoiding running into CET problems. (I'm assuming that
> you can use CET in a process that is hosting SGX enclaves?)

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-01 17:53 RFC: userspace exception fixups Andy Lutomirski
                   ` (4 preceding siblings ...)
  2018-11-01 19:06 ` Linus Torvalds
@ 2018-11-02 22:07 ` Jarkko Sakkinen
  2018-11-02 22:07   ` Jarkko Sakkinen
  2018-11-18  7:15 ` Jarkko Sakkinen
  6 siblings, 1 reply; 163+ messages in thread
From: Jarkko Sakkinen @ 2018-11-02 22:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Christopherson, Sean J, Jethro Beekman,
	Florian Weimer, Linux API, Jann Horn, Linus Torvalds, X86 ML,
	linux-arch, LKML, Peter Zijlstra, Rich Felker, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov

On Thu, Nov 01, 2018 at 10:53:40AM -0700, Andy Lutomirski wrote:
> If a handler is registered, then, if a synchronous exception happens
> (page fault, etc), the kernel would set up an exception frame as usual
> but, rather than checking for signal handlers, it would just call the
> registered handler.  That handler is expected to either handle the
> exception entirely on its own or to call one of two new syscalls to
> ask for normal signal delivery or to ask to retry the faulting
> instruction.

Why the syscalls are required? Couldn't the handler have just a return
value to indicate the appropriate action?

Another thing that I'm wondering is that what if a signal occurs inside
the exception handler?

/Jarkko

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 22:07 ` Jarkko Sakkinen
@ 2018-11-02 22:07   ` Jarkko Sakkinen
  0 siblings, 0 replies; 163+ messages in thread
From: Jarkko Sakkinen @ 2018-11-02 22:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Christopherson, Sean J, Jethro Beekman,
	Florian Weimer, Linux API, Jann Horn, Linus Torvalds, X86 ML,
	linux-arch, LKML, Peter Zijlstra, Rich Felker, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov

On Thu, Nov 01, 2018 at 10:53:40AM -0700, Andy Lutomirski wrote:
> If a handler is registered, then, if a synchronous exception happens
> (page fault, etc), the kernel would set up an exception frame as usual
> but, rather than checking for signal handlers, it would just call the
> registered handler.  That handler is expected to either handle the
> exception entirely on its own or to call one of two new syscalls to
> ask for normal signal delivery or to ask to retry the faulting
> instruction.

Why the syscalls are required? Couldn't the handler have just a return
value to indicate the appropriate action?

Another thing that I'm wondering is that what if a signal occurs inside
the exception handler?

/Jarkko

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-01 23:22           ` Andy Lutomirski
  2018-11-01 23:22             ` Andy Lutomirski
  2018-11-02 16:30             ` Sean Christopherson
@ 2018-11-02 22:37             ` Jarkko Sakkinen
  2018-11-02 22:37               ` Jarkko Sakkinen
  2 siblings, 1 reply; 163+ messages in thread
From: Jarkko Sakkinen @ 2018-11-02 22:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Rich Felker, Jann Horn, Dave Hansen,
	Christopherson, Sean J, Jethro Beekman, Florian Weimer,
	Linux API, X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Thu, Nov 01, 2018 at 04:22:55PM -0700, Andy Lutomirski wrote:
> On Thu, Nov 1, 2018 at 2:24 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Thu, Nov 1, 2018 at 12:31 PM Rich Felker <dalias@libc.org> wrote:
> > >
> > > See my other emails in this thread. You would register the *address*
> > > (in TLS) of a function pointer object pointing to the handler, rather
> > > than the function address of the handler. Then switching handler is
> > > just a single store in userspace, no syscalls involved.
> >
> > Yes.
> >
> > And for just EENTER, maybe that's the right model.
> >
> > If we want to generalize it to other thread-synchronous faults, it
> > needs way more information and a list of handlers, but if we limit the
> > thing to _only_ EENTER getting an SGX fault, then a single "this is
> > the fault handler" address is probably the right thing to do.
> 
> It sounds like you're saying that the kernel should know, *before*
> running any user fixup code, whether the fault in question is one that
> wants a fixup.  Sounds reasonable.
> 
> I think it would be nice, but not absolutely necessary, if user code
> didn't need to poke some value into TLS each time it ran a function
> that had a fixup.  With the poke-into-TLS approach, it looks a lot
> like rseq, and rseq doesn't nest very nicely.  I think we really want
> this mechanism to Just Work.  So we could maybe have a syscall that
> associates a list of fixups with a given range of text addresses.  We
> might want the kernel to automatically zap the fixups when the text in
> question is unmapped.

If we would have a syscall to specify a list fixups that would do the
job. Now essentially the only reason we require a vDSO is to implement
a single fixup for EENTER.

If this fixup stuff makes sense for other parts of the kernel,
introducing a vDSO for EENTER means essentially adding ABI to the kernel
that might possibly become legacy fast.

/Jarkko

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 22:37             ` Jarkko Sakkinen
@ 2018-11-02 22:37               ` Jarkko Sakkinen
  0 siblings, 0 replies; 163+ messages in thread
From: Jarkko Sakkinen @ 2018-11-02 22:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Rich Felker, Jann Horn, Dave Hansen,
	Christopherson, Sean J, Jethro Beekman, Florian Weimer,
	Linux API, X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Thu, Nov 01, 2018 at 04:22:55PM -0700, Andy Lutomirski wrote:
> On Thu, Nov 1, 2018 at 2:24 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Thu, Nov 1, 2018 at 12:31 PM Rich Felker <dalias@libc.org> wrote:
> > >
> > > See my other emails in this thread. You would register the *address*
> > > (in TLS) of a function pointer object pointing to the handler, rather
> > > than the function address of the handler. Then switching handler is
> > > just a single store in userspace, no syscalls involved.
> >
> > Yes.
> >
> > And for just EENTER, maybe that's the right model.
> >
> > If we want to generalize it to other thread-synchronous faults, it
> > needs way more information and a list of handlers, but if we limit the
> > thing to _only_ EENTER getting an SGX fault, then a single "this is
> > the fault handler" address is probably the right thing to do.
> 
> It sounds like you're saying that the kernel should know, *before*
> running any user fixup code, whether the fault in question is one that
> wants a fixup.  Sounds reasonable.
> 
> I think it would be nice, but not absolutely necessary, if user code
> didn't need to poke some value into TLS each time it ran a function
> that had a fixup.  With the poke-into-TLS approach, it looks a lot
> like rseq, and rseq doesn't nest very nicely.  I think we really want
> this mechanism to Just Work.  So we could maybe have a syscall that
> associates a list of fixups with a given range of text addresses.  We
> might want the kernel to automatically zap the fixups when the text in
> question is unmapped.

If we would have a syscall to specify a list fixups that would do the
job. Now essentially the only reason we require a vDSO is to implement
a single fixup for EENTER.

If this fixup stuff makes sense for other parts of the kernel,
introducing a vDSO for EENTER means essentially adding ABI to the kernel
that might possibly become legacy fast.

/Jarkko

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 16:52                 ` Sean Christopherson
  2018-11-02 16:52                   ` Sean Christopherson
  2018-11-02 16:56                   ` Jethro Beekman
@ 2018-11-02 22:42                   ` Jarkko Sakkinen
  2018-11-02 22:42                     ` Jarkko Sakkinen
  2 siblings, 1 reply; 163+ messages in thread
From: Jarkko Sakkinen @ 2018-11-02 22:42 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Jethro Beekman, Andy Lutomirski, Linus Torvalds, Rich Felker,
	Jann Horn, Dave Hansen, Florian Weimer, Linux API, X86 ML,
	linux-arch, LKML, Peter Zijlstra, nhorman, npmccallum, Ayoun,
	Serge, shay.katz-zamir, linux-sgx, Andy Shevchenko,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Fri, Nov 02, 2018 at 09:52:04AM -0700, Sean Christopherson wrote:
> On Fri, Nov 02, 2018 at 04:37:10PM +0000, Jethro Beekman wrote:
> > On 2018-11-02 09:30, Sean Christopherson wrote:
> > >... The intended convention for EENTER is to have an ENCLU at the AEX target ...
> > >
> > >... to further enforce that the AEX target needs to be ENCLU.
> > 
> > Some SGX runtimes may want to use a different AEX target.
> 
> To what end?  Userspace gets no indication as to why the AEX occurred.
> And if exceptions are getting transfered to userspace the trampoline
> would effectively be handling only INTR, NMI, #MC and EPC #PF.

I've understood that in some cases run-time implementation requires to
run a handler implemented inside the enclave i.e the sequence would be

1. #AEX
2. EENTER(in-enclave handler)
3. EEXIT(%rcx)
4. ERESUME

/Jarkko

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 22:42                   ` Jarkko Sakkinen
@ 2018-11-02 22:42                     ` Jarkko Sakkinen
  0 siblings, 0 replies; 163+ messages in thread
From: Jarkko Sakkinen @ 2018-11-02 22:42 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Jethro Beekman, Andy Lutomirski, Linus Torvalds, Rich Felker,
	Jann Horn, Dave Hansen, Florian Weimer, Linux API, X86 ML,
	linux-arch, LKML, Peter Zijlstra, nhorman, npmccallum, Ayoun,
	Serge, shay.katz-zamir, linux-sgx, Andy Shevchenko,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Fri, Nov 02, 2018 at 09:52:04AM -0700, Sean Christopherson wrote:
> On Fri, Nov 02, 2018 at 04:37:10PM +0000, Jethro Beekman wrote:
> > On 2018-11-02 09:30, Sean Christopherson wrote:
> > >... The intended convention for EENTER is to have an ENCLU at the AEX target ...
> > >
> > >... to further enforce that the AEX target needs to be ENCLU.
> > 
> > Some SGX runtimes may want to use a different AEX target.
> 
> To what end?  Userspace gets no indication as to why the AEX occurred.
> And if exceptions are getting transfered to userspace the trampoline
> would effectively be handling only INTR, NMI, #MC and EPC #PF.

I've understood that in some cases run-time implementation requires to
run a handler implemented inside the enclave i.e the sequence would be

1. #AEX
2. EENTER(in-enclave handler)
3. EEXIT(%rcx)
4. ERESUME

/Jarkko

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 22:04                             ` Sean Christopherson
  2018-11-02 22:04                               ` Sean Christopherson
@ 2018-11-02 23:27                               ` Jann Horn
  2018-11-02 23:27                                 ` Jann Horn
  2018-11-02 23:32                                 ` Andy Lutomirski
  1 sibling, 2 replies; 163+ messages in thread
From: Jann Horn @ 2018-11-02 23:27 UTC (permalink / raw)
  To: sean.j.christopherson
  Cc: Andy Lutomirski, Dave Hansen, Linus Torvalds, dalias,
	Dave Hansen, jethro, jarkko.sakkinen, Florian Weimer, Linux API,
	the arch/x86 maintainers, linux-arch, kernel list,
	Peter Zijlstra, nhorman, npmccallum, serge.ayoun,
	shay.katz-zamir, linux-sgx, andriy.shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, carlos, adhemerval.zanella

On Fri, Nov 2, 2018 at 11:04 PM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
> On Fri, Nov 02, 2018 at 08:02:23PM +0100, Jann Horn wrote:
> > On Fri, Nov 2, 2018 at 7:27 PM Sean Christopherson
> > <sean.j.christopherson@intel.com> wrote:
> > > On Fri, Nov 02, 2018 at 10:48:38AM -0700, Andy Lutomirski wrote:
> > > > This whole mechanism seems very complicated, and it's not clear
> > > > exactly what behavior user code wants.
> > >
> > > No argument there.  That's why I like the approach of dumping the
> > > exception to userspace without trying to do anything intelligent in
> > > the kernel.  Userspace can then do whatever it wants AND we don't
> > > have to worry about mucking with stacks.
> > >
> > > One of the hiccups with the VDSO approach is that the enclave may
> > > want to use the untrusted stack, i.e. the stack that has the VDSO's
> > > stack frame.  For example, Intel's SDK uses the untrusted stack to
> > > pass parameters for EEXIT, which means an AEX might occur with what
> > > is effectively a bad stack from the VDSO's perspective.
> >
> > What exactly does "uses the untrusted stack to pass parameters for
> > EEXIT" mean? I guess you're saying that the enclave is writing to
> > RSP+[0...some_positive_offset], and the written data needs to be
> > visible to the code outside the enclave afterwards?
>
> As is, they actually do it the other way around, i.e. negative offsets
> relative to the untrusted %RSP.  Going into the enclave there is no
> reserved space on the stack.  The SDK uses EEXIT like a function call,
> i.e. pushing parameters on the stack and making an call outside of the
> enclave, hence the name out-call.  This allows the SDK to handle any
> reasonable out-call without a priori knowledge of the application's
> maximum out-call "size".

But presumably this is bounded to be at most 128 bytes (the red zone
size), right? Otherwise this would be incompatible with
non-sigaltstack signal delivery.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 23:27                               ` Jann Horn
@ 2018-11-02 23:27                                 ` Jann Horn
  2018-11-02 23:32                                 ` Andy Lutomirski
  1 sibling, 0 replies; 163+ messages in thread
From: Jann Horn @ 2018-11-02 23:27 UTC (permalink / raw)
  To: sean.j.christopherson
  Cc: Andy Lutomirski, Dave Hansen, Linus Torvalds, dalias,
	Dave Hansen, jethro, jarkko.sakkinen, Florian Weimer, Linux API,
	the arch/x86 maintainers, linux-arch, kernel list,
	Peter Zijlstra, nhorman, npmccallum, serge.ayoun,
	shay.katz-zamir, linux-sgx, andriy.shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, carlos, adhemerval.zanella

On Fri, Nov 2, 2018 at 11:04 PM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
> On Fri, Nov 02, 2018 at 08:02:23PM +0100, Jann Horn wrote:
> > On Fri, Nov 2, 2018 at 7:27 PM Sean Christopherson
> > <sean.j.christopherson@intel.com> wrote:
> > > On Fri, Nov 02, 2018 at 10:48:38AM -0700, Andy Lutomirski wrote:
> > > > This whole mechanism seems very complicated, and it's not clear
> > > > exactly what behavior user code wants.
> > >
> > > No argument there.  That's why I like the approach of dumping the
> > > exception to userspace without trying to do anything intelligent in
> > > the kernel.  Userspace can then do whatever it wants AND we don't
> > > have to worry about mucking with stacks.
> > >
> > > One of the hiccups with the VDSO approach is that the enclave may
> > > want to use the untrusted stack, i.e. the stack that has the VDSO's
> > > stack frame.  For example, Intel's SDK uses the untrusted stack to
> > > pass parameters for EEXIT, which means an AEX might occur with what
> > > is effectively a bad stack from the VDSO's perspective.
> >
> > What exactly does "uses the untrusted stack to pass parameters for
> > EEXIT" mean? I guess you're saying that the enclave is writing to
> > RSP+[0...some_positive_offset], and the written data needs to be
> > visible to the code outside the enclave afterwards?
>
> As is, they actually do it the other way around, i.e. negative offsets
> relative to the untrusted %RSP.  Going into the enclave there is no
> reserved space on the stack.  The SDK uses EEXIT like a function call,
> i.e. pushing parameters on the stack and making an call outside of the
> enclave, hence the name out-call.  This allows the SDK to handle any
> reasonable out-call without a priori knowledge of the application's
> maximum out-call "size".

But presumably this is bounded to be at most 128 bytes (the red zone
size), right? Otherwise this would be incompatible with
non-sigaltstack signal delivery.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 23:27                               ` Jann Horn
  2018-11-02 23:27                                 ` Jann Horn
@ 2018-11-02 23:32                                 ` Andy Lutomirski
  2018-11-02 23:32                                   ` Andy Lutomirski
                                                     ` (2 more replies)
  1 sibling, 3 replies; 163+ messages in thread
From: Andy Lutomirski @ 2018-11-02 23:32 UTC (permalink / raw)
  To: Jann Horn
  Cc: Christopherson, Sean J, Andrew Lutomirski, Dave Hansen,
	Linus Torvalds, Rich Felker, Dave Hansen, Jethro Beekman,
	Jarkko Sakkinen, Florian Weimer, Linux API, X86 ML, linux-arch,
	LKML, Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella

On Fri, Nov 2, 2018 at 4:28 PM Jann Horn <jannh@google.com> wrote:
>
> On Fri, Nov 2, 2018 at 11:04 PM Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
> > On Fri, Nov 02, 2018 at 08:02:23PM +0100, Jann Horn wrote:
> > > On Fri, Nov 2, 2018 at 7:27 PM Sean Christopherson
> > > <sean.j.christopherson@intel.com> wrote:
> > > > On Fri, Nov 02, 2018 at 10:48:38AM -0700, Andy Lutomirski wrote:
> > > > > This whole mechanism seems very complicated, and it's not clear
> > > > > exactly what behavior user code wants.
> > > >
> > > > No argument there.  That's why I like the approach of dumping the
> > > > exception to userspace without trying to do anything intelligent in
> > > > the kernel.  Userspace can then do whatever it wants AND we don't
> > > > have to worry about mucking with stacks.
> > > >
> > > > One of the hiccups with the VDSO approach is that the enclave may
> > > > want to use the untrusted stack, i.e. the stack that has the VDSO's
> > > > stack frame.  For example, Intel's SDK uses the untrusted stack to
> > > > pass parameters for EEXIT, which means an AEX might occur with what
> > > > is effectively a bad stack from the VDSO's perspective.
> > >
> > > What exactly does "uses the untrusted stack to pass parameters for
> > > EEXIT" mean? I guess you're saying that the enclave is writing to
> > > RSP+[0...some_positive_offset], and the written data needs to be
> > > visible to the code outside the enclave afterwards?
> >
> > As is, they actually do it the other way around, i.e. negative offsets
> > relative to the untrusted %RSP.  Going into the enclave there is no
> > reserved space on the stack.  The SDK uses EEXIT like a function call,
> > i.e. pushing parameters on the stack and making an call outside of the
> > enclave, hence the name out-call.  This allows the SDK to handle any
> > reasonable out-call without a priori knowledge of the application's
> > maximum out-call "size".
>
> But presumably this is bounded to be at most 128 bytes (the red zone
> size), right? Otherwise this would be incompatible with
> non-sigaltstack signal delivery.


I think Sean is saying that the enclave also updates RSP.

One might reasonably wonder how the SDX knows the offset from RSP to
the function ID.  Presumably using RBP?

Anyway, it seems like this is a mess and it's going to be quite hard
to do this in the vDSO due to a lack of any sane way to store the
return address.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 23:32                                 ` Andy Lutomirski
@ 2018-11-02 23:32                                   ` Andy Lutomirski
  2018-11-02 23:36                                   ` Jann Horn
  2018-11-06 15:37                                   ` Sean Christopherson
  2 siblings, 0 replies; 163+ messages in thread
From: Andy Lutomirski @ 2018-11-02 23:32 UTC (permalink / raw)
  To: Jann Horn
  Cc: Christopherson, Sean J, Andrew Lutomirski, Dave Hansen,
	Linus Torvalds, Rich Felker, Dave Hansen, Jethro Beekman,
	Jarkko Sakkinen, Florian Weimer, Linux API, X86 ML, linux-arch,
	LKML, Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella

On Fri, Nov 2, 2018 at 4:28 PM Jann Horn <jannh@google.com> wrote:
>
> On Fri, Nov 2, 2018 at 11:04 PM Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
> > On Fri, Nov 02, 2018 at 08:02:23PM +0100, Jann Horn wrote:
> > > On Fri, Nov 2, 2018 at 7:27 PM Sean Christopherson
> > > <sean.j.christopherson@intel.com> wrote:
> > > > On Fri, Nov 02, 2018 at 10:48:38AM -0700, Andy Lutomirski wrote:
> > > > > This whole mechanism seems very complicated, and it's not clear
> > > > > exactly what behavior user code wants.
> > > >
> > > > No argument there.  That's why I like the approach of dumping the
> > > > exception to userspace without trying to do anything intelligent in
> > > > the kernel.  Userspace can then do whatever it wants AND we don't
> > > > have to worry about mucking with stacks.
> > > >
> > > > One of the hiccups with the VDSO approach is that the enclave may
> > > > want to use the untrusted stack, i.e. the stack that has the VDSO's
> > > > stack frame.  For example, Intel's SDK uses the untrusted stack to
> > > > pass parameters for EEXIT, which means an AEX might occur with what
> > > > is effectively a bad stack from the VDSO's perspective.
> > >
> > > What exactly does "uses the untrusted stack to pass parameters for
> > > EEXIT" mean? I guess you're saying that the enclave is writing to
> > > RSP+[0...some_positive_offset], and the written data needs to be
> > > visible to the code outside the enclave afterwards?
> >
> > As is, they actually do it the other way around, i.e. negative offsets
> > relative to the untrusted %RSP.  Going into the enclave there is no
> > reserved space on the stack.  The SDK uses EEXIT like a function call,
> > i.e. pushing parameters on the stack and making an call outside of the
> > enclave, hence the name out-call.  This allows the SDK to handle any
> > reasonable out-call without a priori knowledge of the application's
> > maximum out-call "size".
>
> But presumably this is bounded to be at most 128 bytes (the red zone
> size), right? Otherwise this would be incompatible with
> non-sigaltstack signal delivery.


I think Sean is saying that the enclave also updates RSP.

One might reasonably wonder how the SDX knows the offset from RSP to
the function ID.  Presumably using RBP?

Anyway, it seems like this is a mess and it's going to be quite hard
to do this in the vDSO due to a lack of any sane way to store the
return address.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 23:32                                 ` Andy Lutomirski
  2018-11-02 23:32                                   ` Andy Lutomirski
@ 2018-11-02 23:36                                   ` Jann Horn
  2018-11-02 23:36                                     ` Jann Horn
  2018-11-06 15:37                                   ` Sean Christopherson
  2 siblings, 1 reply; 163+ messages in thread
From: Jann Horn @ 2018-11-02 23:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: sean.j.christopherson, Dave Hansen, Linus Torvalds, dalias,
	Dave Hansen, jethro, jarkko.sakkinen, Florian Weimer, Linux API,
	the arch/x86 maintainers, linux-arch, kernel list,
	Peter Zijlstra, nhorman, npmccallum, serge.ayoun,
	shay.katz-zamir, linux-sgx, andriy.shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, carlos, adhemerval.zanella

On Sat, Nov 3, 2018 at 12:32 AM Andy Lutomirski <luto@kernel.org> wrote:
> On Fri, Nov 2, 2018 at 4:28 PM Jann Horn <jannh@google.com> wrote:
> > On Fri, Nov 2, 2018 at 11:04 PM Sean Christopherson
> > <sean.j.christopherson@intel.com> wrote:
> > > On Fri, Nov 02, 2018 at 08:02:23PM +0100, Jann Horn wrote:
> > > > On Fri, Nov 2, 2018 at 7:27 PM Sean Christopherson
> > > > <sean.j.christopherson@intel.com> wrote:
> > > > > On Fri, Nov 02, 2018 at 10:48:38AM -0700, Andy Lutomirski wrote:
> > > > > > This whole mechanism seems very complicated, and it's not clear
> > > > > > exactly what behavior user code wants.
> > > > >
> > > > > No argument there.  That's why I like the approach of dumping the
> > > > > exception to userspace without trying to do anything intelligent in
> > > > > the kernel.  Userspace can then do whatever it wants AND we don't
> > > > > have to worry about mucking with stacks.
> > > > >
> > > > > One of the hiccups with the VDSO approach is that the enclave may
> > > > > want to use the untrusted stack, i.e. the stack that has the VDSO's
> > > > > stack frame.  For example, Intel's SDK uses the untrusted stack to
> > > > > pass parameters for EEXIT, which means an AEX might occur with what
> > > > > is effectively a bad stack from the VDSO's perspective.
> > > >
> > > > What exactly does "uses the untrusted stack to pass parameters for
> > > > EEXIT" mean? I guess you're saying that the enclave is writing to
> > > > RSP+[0...some_positive_offset], and the written data needs to be
> > > > visible to the code outside the enclave afterwards?
> > >
> > > As is, they actually do it the other way around, i.e. negative offsets
> > > relative to the untrusted %RSP.  Going into the enclave there is no
> > > reserved space on the stack.  The SDK uses EEXIT like a function call,
> > > i.e. pushing parameters on the stack and making an call outside of the
> > > enclave, hence the name out-call.  This allows the SDK to handle any
> > > reasonable out-call without a priori knowledge of the application's
> > > maximum out-call "size".
> >
> > But presumably this is bounded to be at most 128 bytes (the red zone
> > size), right? Otherwise this would be incompatible with
> > non-sigaltstack signal delivery.
>
>
> I think Sean is saying that the enclave also updates RSP.

Ah, bleh, of course.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 23:36                                   ` Jann Horn
@ 2018-11-02 23:36                                     ` Jann Horn
  0 siblings, 0 replies; 163+ messages in thread
From: Jann Horn @ 2018-11-02 23:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: sean.j.christopherson, Dave Hansen, Linus Torvalds, dalias,
	Dave Hansen, jethro, jarkko.sakkinen, Florian Weimer, Linux API,
	the arch/x86 maintainers, linux-arch, kernel list,
	Peter Zijlstra, nhorman, npmccallum, serge.ayoun,
	shay.katz-zamir, linux-sgx, andriy.shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, carlos, adhemerval.zanella

On Sat, Nov 3, 2018 at 12:32 AM Andy Lutomirski <luto@kernel.org> wrote:
> On Fri, Nov 2, 2018 at 4:28 PM Jann Horn <jannh@google.com> wrote:
> > On Fri, Nov 2, 2018 at 11:04 PM Sean Christopherson
> > <sean.j.christopherson@intel.com> wrote:
> > > On Fri, Nov 02, 2018 at 08:02:23PM +0100, Jann Horn wrote:
> > > > On Fri, Nov 2, 2018 at 7:27 PM Sean Christopherson
> > > > <sean.j.christopherson@intel.com> wrote:
> > > > > On Fri, Nov 02, 2018 at 10:48:38AM -0700, Andy Lutomirski wrote:
> > > > > > This whole mechanism seems very complicated, and it's not clear
> > > > > > exactly what behavior user code wants.
> > > > >
> > > > > No argument there.  That's why I like the approach of dumping the
> > > > > exception to userspace without trying to do anything intelligent in
> > > > > the kernel.  Userspace can then do whatever it wants AND we don't
> > > > > have to worry about mucking with stacks.
> > > > >
> > > > > One of the hiccups with the VDSO approach is that the enclave may
> > > > > want to use the untrusted stack, i.e. the stack that has the VDSO's
> > > > > stack frame.  For example, Intel's SDK uses the untrusted stack to
> > > > > pass parameters for EEXIT, which means an AEX might occur with what
> > > > > is effectively a bad stack from the VDSO's perspective.
> > > >
> > > > What exactly does "uses the untrusted stack to pass parameters for
> > > > EEXIT" mean? I guess you're saying that the enclave is writing to
> > > > RSP+[0...some_positive_offset], and the written data needs to be
> > > > visible to the code outside the enclave afterwards?
> > >
> > > As is, they actually do it the other way around, i.e. negative offsets
> > > relative to the untrusted %RSP.  Going into the enclave there is no
> > > reserved space on the stack.  The SDK uses EEXIT like a function call,
> > > i.e. pushing parameters on the stack and making an call outside of the
> > > enclave, hence the name out-call.  This allows the SDK to handle any
> > > reasonable out-call without a priori knowledge of the application's
> > > maximum out-call "size".
> >
> > But presumably this is bounded to be at most 128 bytes (the red zone
> > size), right? Otherwise this would be incompatible with
> > non-sigaltstack signal delivery.
>
>
> I think Sean is saying that the enclave also updates RSP.

Ah, bleh, of course.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-02 23:32                                 ` Andy Lutomirski
  2018-11-02 23:32                                   ` Andy Lutomirski
  2018-11-02 23:36                                   ` Jann Horn
@ 2018-11-06 15:37                                   ` Sean Christopherson
  2018-11-06 15:37                                     ` Sean Christopherson
                                                       ` (2 more replies)
  2 siblings, 3 replies; 163+ messages in thread
From: Sean Christopherson @ 2018-11-06 15:37 UTC (permalink / raw)
  To: Andy Lutomirski, Jann Horn
  Cc: Dave Hansen, Linus Torvalds, Rich Felker, Dave Hansen,
	Jethro Beekman, Jarkko Sakkinen, Florian Weimer, Linux API,
	X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman, npmccallum,
	Ayoun, Serge, shay.katz-zamir, linux-sgx, Andy Shevchenko,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Fri, 2018-11-02 at 16:32 -0700, Andy Lutomirski wrote:
> On Fri, Nov 2, 2018 at 4:28 PM Jann Horn <jannh@google.com> wrote:
> > 
> > 
> > On Fri, Nov 2, 2018 at 11:04 PM Sean Christopherson
> > <sean.j.christopherson@intel.com> wrote:
> > > 
> > > On Fri, Nov 02, 2018 at 08:02:23PM +0100, Jann Horn wrote:
> > > > 
> > > > On Fri, Nov 2, 2018 at 7:27 PM Sean Christopherson
> > > > <sean.j.christopherson@intel.com> wrote:
> > > > > 
> > > > > On Fri, Nov 02, 2018 at 10:48:38AM -0700, Andy Lutomirski wrote:
> > > > > > 
> > > > > > This whole mechanism seems very complicated, and it's not clear
> > > > > > exactly what behavior user code wants.
> > > > > No argument there.  That's why I like the approach of dumping the
> > > > > exception to userspace without trying to do anything intelligent in
> > > > > the kernel.  Userspace can then do whatever it wants AND we don't
> > > > > have to worry about mucking with stacks.
> > > > > 
> > > > > One of the hiccups with the VDSO approach is that the enclave may
> > > > > want to use the untrusted stack, i.e. the stack that has the VDSO's
> > > > > stack frame.  For example, Intel's SDK uses the untrusted stack to
> > > > > pass parameters for EEXIT, which means an AEX might occur with what
> > > > > is effectively a bad stack from the VDSO's perspective.
> > > > What exactly does "uses the untrusted stack to pass parameters for
> > > > EEXIT" mean? I guess you're saying that the enclave is writing to
> > > > RSP+[0...some_positive_offset], and the written data needs to be
> > > > visible to the code outside the enclave afterwards?
> > > As is, they actually do it the other way around, i.e. negative offsets
> > > relative to the untrusted %RSP.  Going into the enclave there is no
> > > reserved space on the stack.  The SDK uses EEXIT like a function call,
> > > i.e. pushing parameters on the stack and making an call outside of the
> > > enclave, hence the name out-call.  This allows the SDK to handle any
> > > reasonable out-call without a priori knowledge of the application's
> > > maximum out-call "size".
> > But presumably this is bounded to be at most 128 bytes (the red zone
> > size), right? Otherwise this would be incompatible with
> > non-sigaltstack signal delivery.
> 
> I think Sean is saying that the enclave also updates RSP.

Yeah, the enclave saves/restores RSP from/to the current save state area.

> One might reasonably wonder how the SDX knows the offset from RSP to
> the function ID.  Presumably using RBP?

Here's pseudocode for how the SDK uses the untrusted stack, minus a
bunch of error checking and gory details.

The function ID and a pointer to a marshalling struct are passed to
the untrusted runtime via normal register params, e.g. RDI and RSI.
The marshalling struct is what's actually allocated on the untrusted
stack, like alloca() but more complex and explicit.  The marshalling
struct size is not artificially restricted by the SDK, e.g. AFAIK it
could span multiple 4k pages.


int sgx_out_call(const unsigned int func_index, void *marshalling_struct)
{
	struct sgx_encl_tls *tls = get_encl_tls();

	%RBP = tls->save_state_area[SSA_RBP];
	%RSP = tls->save_state_area[SSA_RSP];
	%RDI = func_index;
	%RSI = marshalling_struct;

	EEXIT

	/* magic elsewhere to get back here on an EENTER(OUT_CALL_RETURN) */
	return %RAX
}

void *sgx_alloc_untrusted_stack(size_t size)
{
	struct sgx_encl_tls *tls = get_encl_tls();
	struct sgx_out_call_context *context;
	void *tmp;

	/* create a frame on the trusted stack to hold the out-call context */
	tls->trusted_stack -= sizeof(struct sgx_out_call_context);

	/* save the untrusted %RSP into the out-call context */
	context = (struct sgx_out_call_context *)tls->trusted_stack;
	context->untrusted_stack = tls->save_state_area[SSA_RSP];

	/* allocate space on the untrusted stack */
	tmp = (void *)(tls->save_state_area[SSA_RSP] - size);
	tls->save_state_area[SSA_RSP] = tmp;

	return tmp;
}

void sgx_pop_untrusted_stack(void)
{
	struct sgx_encl_tls *tls = get_encl_tls();
	struct sgx_out_call_context *context;

	/* retrieve the current out-call context from the trusted stack */
	context = (struct sgx_out_call_context *)tls->trusted_stack;

	/* restore untrusted %RSP */
	tls->save_state_area[SSA_RSP] = context->untrusted_stack;

	/* pop the out-call context frame */
	tls->trusted_stack += sizeof(struct sgx_out_call_context);
}

int sgx_main(void)
{
	struct my_out_call_struct *params;

	params = sgx_alloc_untrusted_stack(sizeof(*params));

	params->0..N = XYZ;

	ret = sgx_out_call(DO_WORK, params);

	sgx_pop_untrusted_stack();

	return ret;
}

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-06 15:37                                   ` Sean Christopherson
@ 2018-11-06 15:37                                     ` Sean Christopherson
  2018-11-06 16:57                                     ` Andy Lutomirski
  2018-11-06 17:00                                     ` Dave Hansen
  2 siblings, 0 replies; 163+ messages in thread
From: Sean Christopherson @ 2018-11-06 15:37 UTC (permalink / raw)
  To: Andy Lutomirski, Jann Horn
  Cc: Dave Hansen, Linus Torvalds, Rich Felker, Dave Hansen,
	Jethro Beekman, Jarkko Sakkinen, Florian Weimer, Linux API,
	X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman, npmccallum,
	Ayoun, Serge, shay.katz-zamir, linux-sgx, Andy Shevchenko,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Fri, 2018-11-02 at 16:32 -0700, Andy Lutomirski wrote:
> On Fri, Nov 2, 2018 at 4:28 PM Jann Horn <jannh@google.com> wrote:
> > 
> > 
> > On Fri, Nov 2, 2018 at 11:04 PM Sean Christopherson
> > <sean.j.christopherson@intel.com> wrote:
> > > 
> > > On Fri, Nov 02, 2018 at 08:02:23PM +0100, Jann Horn wrote:
> > > > 
> > > > On Fri, Nov 2, 2018 at 7:27 PM Sean Christopherson
> > > > <sean.j.christopherson@intel.com> wrote:
> > > > > 
> > > > > On Fri, Nov 02, 2018 at 10:48:38AM -0700, Andy Lutomirski wrote:
> > > > > > 
> > > > > > This whole mechanism seems very complicated, and it's not clear
> > > > > > exactly what behavior user code wants.
> > > > > No argument there.  That's why I like the approach of dumping the
> > > > > exception to userspace without trying to do anything intelligent in
> > > > > the kernel.  Userspace can then do whatever it wants AND we don't
> > > > > have to worry about mucking with stacks.
> > > > > 
> > > > > One of the hiccups with the VDSO approach is that the enclave may
> > > > > want to use the untrusted stack, i.e. the stack that has the VDSO's
> > > > > stack frame.  For example, Intel's SDK uses the untrusted stack to
> > > > > pass parameters for EEXIT, which means an AEX might occur with what
> > > > > is effectively a bad stack from the VDSO's perspective.
> > > > What exactly does "uses the untrusted stack to pass parameters for
> > > > EEXIT" mean? I guess you're saying that the enclave is writing to
> > > > RSP+[0...some_positive_offset], and the written data needs to be
> > > > visible to the code outside the enclave afterwards?
> > > As is, they actually do it the other way around, i.e. negative offsets
> > > relative to the untrusted %RSP.  Going into the enclave there is no
> > > reserved space on the stack.  The SDK uses EEXIT like a function call,
> > > i.e. pushing parameters on the stack and making an call outside of the
> > > enclave, hence the name out-call.  This allows the SDK to handle any
> > > reasonable out-call without a priori knowledge of the application's
> > > maximum out-call "size".
> > But presumably this is bounded to be at most 128 bytes (the red zone
> > size), right? Otherwise this would be incompatible with
> > non-sigaltstack signal delivery.
> 
> I think Sean is saying that the enclave also updates RSP.

Yeah, the enclave saves/restores RSP from/to the current save state area.

> One might reasonably wonder how the SDX knows the offset from RSP to
> the function ID.  Presumably using RBP?

Here's pseudocode for how the SDK uses the untrusted stack, minus a
bunch of error checking and gory details.

The function ID and a pointer to a marshalling struct are passed to
the untrusted runtime via normal register params, e.g. RDI and RSI.
The marshalling struct is what's actually allocated on the untrusted
stack, like alloca() but more complex and explicit.  The marshalling
struct size is not artificially restricted by the SDK, e.g. AFAIK it
could span multiple 4k pages.


int sgx_out_call(const unsigned int func_index, void *marshalling_struct)
{
	struct sgx_encl_tls *tls = get_encl_tls();

	%RBP = tls->save_state_area[SSA_RBP];
	%RSP = tls->save_state_area[SSA_RSP];
	%RDI = func_index;
	%RSI = marshalling_struct;

	EEXIT

	/* magic elsewhere to get back here on an EENTER(OUT_CALL_RETURN) */
	return %RAX
}

void *sgx_alloc_untrusted_stack(size_t size)
{
	struct sgx_encl_tls *tls = get_encl_tls();
	struct sgx_out_call_context *context;
	void *tmp;

	/* create a frame on the trusted stack to hold the out-call context */
	tls->trusted_stack -= sizeof(struct sgx_out_call_context);

	/* save the untrusted %RSP into the out-call context */
	context = (struct sgx_out_call_context *)tls->trusted_stack;
	context->untrusted_stack = tls->save_state_area[SSA_RSP];

	/* allocate space on the untrusted stack */
	tmp = (void *)(tls->save_state_area[SSA_RSP] - size);
	tls->save_state_area[SSA_RSP] = tmp;

	return tmp;
}

void sgx_pop_untrusted_stack(void)
{
	struct sgx_encl_tls *tls = get_encl_tls();
	struct sgx_out_call_context *context;

	/* retrieve the current out-call context from the trusted stack */
	context = (struct sgx_out_call_context *)tls->trusted_stack;

	/* restore untrusted %RSP */
	tls->save_state_area[SSA_RSP] = context->untrusted_stack;

	/* pop the out-call context frame */
	tls->trusted_stack += sizeof(struct sgx_out_call_context);
}

int sgx_main(void)
{
	struct my_out_call_struct *params;

	params = sgx_alloc_untrusted_stack(sizeof(*params));

	params->0..N = XYZ;

	ret = sgx_out_call(DO_WORK, params);

	sgx_pop_untrusted_stack();

	return ret;
}

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-06 15:37                                   ` Sean Christopherson
  2018-11-06 15:37                                     ` Sean Christopherson
@ 2018-11-06 16:57                                     ` Andy Lutomirski
  2018-11-06 16:57                                       ` Andy Lutomirski
                                                         ` (2 more replies)
  2018-11-06 17:00                                     ` Dave Hansen
  2 siblings, 3 replies; 163+ messages in thread
From: Andy Lutomirski @ 2018-11-06 16:57 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Andy Lutomirski, Jann Horn, Dave Hansen, Linus Torvalds,
	Rich Felker, Dave Hansen, Jethro Beekman, Jarkko Sakkinen,
	Florian Weimer, Linux API, X86 ML, linux-arch, LKML,
	Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella



> On Nov 6, 2018, at 7:37 AM, Sean Christopherson <sean.j.christopherson@in=
tel.com> wrote:
>=20
>> On Fri, 2018-11-02 at 16:32 -0700, Andy Lutomirski wrote:
>>> On Fri, Nov 2, 2018 at 4:28 PM Jann Horn <jannh@google.com> wrote:
>>>=20
>>>=20
>>> On Fri, Nov 2, 2018 at 11:04 PM Sean Christopherson
>>> <sean.j.christopherson@intel.com> wrote:
>>>>=20
>>>>> On Fri, Nov 02, 2018 at 08:02:23PM +0100, Jann Horn wrote:
>>>>>=20
>>>>> On Fri, Nov 2, 2018 at 7:27 PM Sean Christopherson
>>>>> <sean.j.christopherson@intel.com> wrote:
>>>>>>=20
>>>>>>> On Fri, Nov 02, 2018 at 10:48:38AM -0700, Andy Lutomirski wrote:
>>>>>>>=20
>>>>>>> This whole mechanism seems very complicated, and it's not clear
>>>>>>> exactly what behavior user code wants.
>>>>>> No argument there.  That's why I like the approach of dumping the
>>>>>> exception to userspace without trying to do anything intelligent in
>>>>>> the kernel.  Userspace can then do whatever it wants AND we don't
>>>>>> have to worry about mucking with stacks.
>>>>>>=20
>>>>>> One of the hiccups with the VDSO approach is that the enclave may
>>>>>> want to use the untrusted stack, i.e. the stack that has the VDSO's
>>>>>> stack frame.  For example, Intel's SDK uses the untrusted stack to
>>>>>> pass parameters for EEXIT, which means an AEX might occur with what
>>>>>> is effectively a bad stack from the VDSO's perspective.
>>>>> What exactly does "uses the untrusted stack to pass parameters for
>>>>> EEXIT" mean? I guess you're saying that the enclave is writing to
>>>>> RSP+[0...some_positive_offset], and the written data needs to be
>>>>> visible to the code outside the enclave afterwards?
>>>> As is, they actually do it the other way around, i.e. negative offsets
>>>> relative to the untrusted %RSP.  Going into the enclave there is no
>>>> reserved space on the stack.  The SDK uses EEXIT like a function call,
>>>> i.e. pushing parameters on the stack and making an call outside of the
>>>> enclave, hence the name out-call.  This allows the SDK to handle any
>>>> reasonable out-call without a priori knowledge of the application's
>>>> maximum out-call "size".
>>> But presumably this is bounded to be at most 128 bytes (the red zone
>>> size), right? Otherwise this would be incompatible with
>>> non-sigaltstack signal delivery.
>>=20
>> I think Sean is saying that the enclave also updates RSP.
>=20
> Yeah, the enclave saves/restores RSP from/to the current save state area.
>=20
>> One might reasonably wonder how the SDX knows the offset from RSP to
>> the function ID.  Presumably using RBP?
>=20
> Here's pseudocode for how the SDK uses the untrusted stack, minus a
> bunch of error checking and gory details.
>=20
> The function ID and a pointer to a marshalling struct are passed to
> the untrusted runtime via normal register params, e.g. RDI and RSI.
> The marshalling struct is what's actually allocated on the untrusted
> stack, like alloca() but more complex and explicit.  The marshalling
> struct size is not artificially restricted by the SDK, e.g. AFAIK it
> could span multiple 4k pages.
>=20
>=20
> int sgx_out_call(const unsigned int func_index, void *marshalling_struct)
> {
>    struct sgx_encl_tls *tls =3D get_encl_tls();
>=20
>    %RBP =3D tls->save_state_area[SSA_RBP];
>    %RSP =3D tls->save_state_area[SSA_RSP];
>    %RDI =3D func_index;
>    %RSI =3D marshalling_struct;
>=20
>    EEXIT
>=20
>    /* magic elsewhere to get back here on an EENTER(OUT_CALL_RETURN) */
>    return %RAX
> }
>=20
> void *sgx_alloc_untrusted_stack(size_t size)
> {
>    struct sgx_encl_tls *tls =3D get_encl_tls();
>    struct sgx_out_call_context *context;
>    void *tmp;
>=20
>    /* create a frame on the trusted stack to hold the out-call context */
>    tls->trusted_stack -=3D sizeof(struct sgx_out_call_context);
>=20
>    /* save the untrusted %RSP into the out-call context */
>    context =3D (struct sgx_out_call_context *)tls->trusted_stack;
>    context->untrusted_stack =3D tls->save_state_area[SSA_RSP];
>=20
>    /* allocate space on the untrusted stack */
>    tmp =3D (void *)(tls->save_state_area[SSA_RSP] - size);
>    tls->save_state_area[SSA_RSP] =3D tmp;
>=20
>    return tmp;
> }
>=20
> void sgx_pop_untrusted_stack(void)
> {
>    struct sgx_encl_tls *tls =3D get_encl_tls();
>    struct sgx_out_call_context *context;
>=20
>    /* retrieve the current out-call context from the trusted stack */
>    context =3D (struct sgx_out_call_context *)tls->trusted_stack;
>=20
>    /* restore untrusted %RSP */
>    tls->save_state_area[SSA_RSP] =3D context->untrusted_stack;
>=20
>    /* pop the out-call context frame */
>    tls->trusted_stack +=3D sizeof(struct sgx_out_call_context);
> }
>=20
> int sgx_main(void)
> {
>    struct my_out_call_struct *params;
>=20
>    params =3D sgx_alloc_untrusted_stack(sizeof(*params));
>=20
>    params->0..N =3D XYZ;
>=20
>    ret =3D sgx_out_call(DO_WORK, params);
>=20
>    sgx_pop_untrusted_stack();
>=20
>    return ret;
> }

So I guess the non-enclave code basically can=E2=80=99t trust its stack poi=
nter because of these shenanigans. And the AEP code has to live with the fa=
ct that its RSP is basically arbitrary and probably can=E2=80=99t even be u=
nwound by a debugger?  And the EENTER code has to deal with the fact that i=
ts red zone can be blatantly violated by the enclave?

I=E2=80=99m assuming it=E2=80=99s way too late for the SGX SDK to be change=
d to use a normal RPC mechanism? I=E2=80=99m a bit disappointed that enclav=
es can even manipulate outside state like this. I assume Intel had some rea=
son for making it possible, but still.=

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-06 16:57                                     ` Andy Lutomirski
@ 2018-11-06 16:57                                       ` Andy Lutomirski
  2018-11-06 17:03                                       ` Dave Hansen
  2018-11-06 17:19                                       ` Sean Christopherson
  2 siblings, 0 replies; 163+ messages in thread
From: Andy Lutomirski @ 2018-11-06 16:57 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Andy Lutomirski, Jann Horn, Dave Hansen, Linus Torvalds,
	Rich Felker, Dave Hansen, Jethro Beekman, Jarkko Sakkinen,
	Florian Weimer, Linux API, X86 ML, linux-arch, LKML,
	Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella



> On Nov 6, 2018, at 7:37 AM, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
>> On Fri, 2018-11-02 at 16:32 -0700, Andy Lutomirski wrote:
>>> On Fri, Nov 2, 2018 at 4:28 PM Jann Horn <jannh@google.com> wrote:
>>> 
>>> 
>>> On Fri, Nov 2, 2018 at 11:04 PM Sean Christopherson
>>> <sean.j.christopherson@intel.com> wrote:
>>>> 
>>>>> On Fri, Nov 02, 2018 at 08:02:23PM +0100, Jann Horn wrote:
>>>>> 
>>>>> On Fri, Nov 2, 2018 at 7:27 PM Sean Christopherson
>>>>> <sean.j.christopherson@intel.com> wrote:
>>>>>> 
>>>>>>> On Fri, Nov 02, 2018 at 10:48:38AM -0700, Andy Lutomirski wrote:
>>>>>>> 
>>>>>>> This whole mechanism seems very complicated, and it's not clear
>>>>>>> exactly what behavior user code wants.
>>>>>> No argument there.  That's why I like the approach of dumping the
>>>>>> exception to userspace without trying to do anything intelligent in
>>>>>> the kernel.  Userspace can then do whatever it wants AND we don't
>>>>>> have to worry about mucking with stacks.
>>>>>> 
>>>>>> One of the hiccups with the VDSO approach is that the enclave may
>>>>>> want to use the untrusted stack, i.e. the stack that has the VDSO's
>>>>>> stack frame.  For example, Intel's SDK uses the untrusted stack to
>>>>>> pass parameters for EEXIT, which means an AEX might occur with what
>>>>>> is effectively a bad stack from the VDSO's perspective.
>>>>> What exactly does "uses the untrusted stack to pass parameters for
>>>>> EEXIT" mean? I guess you're saying that the enclave is writing to
>>>>> RSP+[0...some_positive_offset], and the written data needs to be
>>>>> visible to the code outside the enclave afterwards?
>>>> As is, they actually do it the other way around, i.e. negative offsets
>>>> relative to the untrusted %RSP.  Going into the enclave there is no
>>>> reserved space on the stack.  The SDK uses EEXIT like a function call,
>>>> i.e. pushing parameters on the stack and making an call outside of the
>>>> enclave, hence the name out-call.  This allows the SDK to handle any
>>>> reasonable out-call without a priori knowledge of the application's
>>>> maximum out-call "size".
>>> But presumably this is bounded to be at most 128 bytes (the red zone
>>> size), right? Otherwise this would be incompatible with
>>> non-sigaltstack signal delivery.
>> 
>> I think Sean is saying that the enclave also updates RSP.
> 
> Yeah, the enclave saves/restores RSP from/to the current save state area.
> 
>> One might reasonably wonder how the SDX knows the offset from RSP to
>> the function ID.  Presumably using RBP?
> 
> Here's pseudocode for how the SDK uses the untrusted stack, minus a
> bunch of error checking and gory details.
> 
> The function ID and a pointer to a marshalling struct are passed to
> the untrusted runtime via normal register params, e.g. RDI and RSI.
> The marshalling struct is what's actually allocated on the untrusted
> stack, like alloca() but more complex and explicit.  The marshalling
> struct size is not artificially restricted by the SDK, e.g. AFAIK it
> could span multiple 4k pages.
> 
> 
> int sgx_out_call(const unsigned int func_index, void *marshalling_struct)
> {
>    struct sgx_encl_tls *tls = get_encl_tls();
> 
>    %RBP = tls->save_state_area[SSA_RBP];
>    %RSP = tls->save_state_area[SSA_RSP];
>    %RDI = func_index;
>    %RSI = marshalling_struct;
> 
>    EEXIT
> 
>    /* magic elsewhere to get back here on an EENTER(OUT_CALL_RETURN) */
>    return %RAX
> }
> 
> void *sgx_alloc_untrusted_stack(size_t size)
> {
>    struct sgx_encl_tls *tls = get_encl_tls();
>    struct sgx_out_call_context *context;
>    void *tmp;
> 
>    /* create a frame on the trusted stack to hold the out-call context */
>    tls->trusted_stack -= sizeof(struct sgx_out_call_context);
> 
>    /* save the untrusted %RSP into the out-call context */
>    context = (struct sgx_out_call_context *)tls->trusted_stack;
>    context->untrusted_stack = tls->save_state_area[SSA_RSP];
> 
>    /* allocate space on the untrusted stack */
>    tmp = (void *)(tls->save_state_area[SSA_RSP] - size);
>    tls->save_state_area[SSA_RSP] = tmp;
> 
>    return tmp;
> }
> 
> void sgx_pop_untrusted_stack(void)
> {
>    struct sgx_encl_tls *tls = get_encl_tls();
>    struct sgx_out_call_context *context;
> 
>    /* retrieve the current out-call context from the trusted stack */
>    context = (struct sgx_out_call_context *)tls->trusted_stack;
> 
>    /* restore untrusted %RSP */
>    tls->save_state_area[SSA_RSP] = context->untrusted_stack;
> 
>    /* pop the out-call context frame */
>    tls->trusted_stack += sizeof(struct sgx_out_call_context);
> }
> 
> int sgx_main(void)
> {
>    struct my_out_call_struct *params;
> 
>    params = sgx_alloc_untrusted_stack(sizeof(*params));
> 
>    params->0..N = XYZ;
> 
>    ret = sgx_out_call(DO_WORK, params);
> 
>    sgx_pop_untrusted_stack();
> 
>    return ret;
> }

So I guess the non-enclave code basically can’t trust its stack pointer because of these shenanigans. And the AEP code has to live with the fact that its RSP is basically arbitrary and probably can’t even be unwound by a debugger?  And the EENTER code has to deal with the fact that its red zone can be blatantly violated by the enclave?

I’m assuming it’s way too late for the SGX SDK to be changed to use a normal RPC mechanism? I’m a bit disappointed that enclaves can even manipulate outside state like this. I assume Intel had some reason for making it possible, but still.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-06 15:37                                   ` Sean Christopherson
  2018-11-06 15:37                                     ` Sean Christopherson
  2018-11-06 16:57                                     ` Andy Lutomirski
@ 2018-11-06 17:00                                     ` Dave Hansen
  2018-11-06 17:00                                       ` Dave Hansen
  2 siblings, 1 reply; 163+ messages in thread
From: Dave Hansen @ 2018-11-06 17:00 UTC (permalink / raw)
  To: Sean Christopherson, Andy Lutomirski, Jann Horn
  Cc: Linus Torvalds, Rich Felker, Dave Hansen, Jethro Beekman,
	Jarkko Sakkinen, Florian Weimer, Linux API, X86 ML, linux-arch,
	LKML, Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella

On 11/6/18 7:37 AM, Sean Christopherson wrote:
> 
> void *sgx_alloc_untrusted_stack(size_t size)
> {
> 	struct sgx_encl_tls *tls = get_encl_tls();
> 	struct sgx_out_call_context *context;
> 	void *tmp;
> 
> 	/* create a frame on the trusted stack to hold the out-call context */
> 	tls->trusted_stack -= sizeof(struct sgx_out_call_context);
> 
> 	/* save the untrusted %RSP into the out-call context */
> 	context = (struct sgx_out_call_context *)tls->trusted_stack;
> 	context->untrusted_stack = tls->save_state_area[SSA_RSP];
> 
> 	/* allocate space on the untrusted stack */
> 	tmp = (void *)(tls->save_state_area[SSA_RSP] - size);
> 	tls->save_state_area[SSA_RSP] = tmp;
> 
> 	return tmp;
> }

Why does it bother to go to all the trouble of mucking with the
untrusted stack?  It could *easily* just leave it alone and do out-calls
if it needs to allocate memory for parameter storage.  Heck, that could
theoretically even be _on_ the stack if the untrusted runtime was being
clever.

The only downside would be that the untrusted runtime would have to keep
track of the space a bit more explicitly so it could be cleaned up if
the enclave didn't do it.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-06 17:00                                     ` Dave Hansen
@ 2018-11-06 17:00                                       ` Dave Hansen
  0 siblings, 0 replies; 163+ messages in thread
From: Dave Hansen @ 2018-11-06 17:00 UTC (permalink / raw)
  To: Sean Christopherson, Andy Lutomirski, Jann Horn
  Cc: Linus Torvalds, Rich Felker, Dave Hansen, Jethro Beekman,
	Jarkko Sakkinen, Florian Weimer, Linux API, X86 ML, linux-arch,
	LKML, Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella

On 11/6/18 7:37 AM, Sean Christopherson wrote:
> 
> void *sgx_alloc_untrusted_stack(size_t size)
> {
> 	struct sgx_encl_tls *tls = get_encl_tls();
> 	struct sgx_out_call_context *context;
> 	void *tmp;
> 
> 	/* create a frame on the trusted stack to hold the out-call context */
> 	tls->trusted_stack -= sizeof(struct sgx_out_call_context);
> 
> 	/* save the untrusted %RSP into the out-call context */
> 	context = (struct sgx_out_call_context *)tls->trusted_stack;
> 	context->untrusted_stack = tls->save_state_area[SSA_RSP];
> 
> 	/* allocate space on the untrusted stack */
> 	tmp = (void *)(tls->save_state_area[SSA_RSP] - size);
> 	tls->save_state_area[SSA_RSP] = tmp;
> 
> 	return tmp;
> }

Why does it bother to go to all the trouble of mucking with the
untrusted stack?  It could *easily* just leave it alone and do out-calls
if it needs to allocate memory for parameter storage.  Heck, that could
theoretically even be _on_ the stack if the untrusted runtime was being
clever.

The only downside would be that the untrusted runtime would have to keep
track of the space a bit more explicitly so it could be cleaned up if
the enclave didn't do it.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-06 16:57                                     ` Andy Lutomirski
  2018-11-06 16:57                                       ` Andy Lutomirski
@ 2018-11-06 17:03                                       ` Dave Hansen
  2018-11-06 17:03                                         ` Dave Hansen
  2018-11-06 17:19                                       ` Sean Christopherson
  2 siblings, 1 reply; 163+ messages in thread
From: Dave Hansen @ 2018-11-06 17:03 UTC (permalink / raw)
  To: Andy Lutomirski, Sean Christopherson
  Cc: Andy Lutomirski, Jann Horn, Linus Torvalds, Rich Felker,
	Dave Hansen, Jethro Beekman, Jarkko Sakkinen, Florian Weimer,
	Linux API, X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On 11/6/18 8:57 AM, Andy Lutomirski wrote:
> I’m assuming it’s way too late for the SGX SDK to be changed to use a
> normal RPC mechanism? I’m a bit disappointed that enclaves can even
> manipulate outside state like this. I assume Intel had some reason
> for making it possible, but still.

Just because it's architecturally possible doesn't mean it has to be a
part of the ABI for running enclaves under Linux.

It's not too late to change the SDK.  Intel does not and can not depend
on any behavior of Linux until code gets merged.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-06 17:03                                       ` Dave Hansen
@ 2018-11-06 17:03                                         ` Dave Hansen
  0 siblings, 0 replies; 163+ messages in thread
From: Dave Hansen @ 2018-11-06 17:03 UTC (permalink / raw)
  To: Andy Lutomirski, Sean Christopherson
  Cc: Andy Lutomirski, Jann Horn, Linus Torvalds, Rich Felker,
	Dave Hansen, Jethro Beekman, Jarkko Sakkinen, Florian Weimer,
	Linux API, X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On 11/6/18 8:57 AM, Andy Lutomirski wrote:
> I’m assuming it’s way too late for the SGX SDK to be changed to use a
> normal RPC mechanism? I’m a bit disappointed that enclaves can even
> manipulate outside state like this. I assume Intel had some reason
> for making it possible, but still.

Just because it's architecturally possible doesn't mean it has to be a
part of the ABI for running enclaves under Linux.

It's not too late to change the SDK.  Intel does not and can not depend
on any behavior of Linux until code gets merged.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-06 16:57                                     ` Andy Lutomirski
  2018-11-06 16:57                                       ` Andy Lutomirski
  2018-11-06 17:03                                       ` Dave Hansen
@ 2018-11-06 17:19                                       ` Sean Christopherson
  2018-11-06 17:19                                         ` Sean Christopherson
  2018-11-06 18:20                                         ` Andy Lutomirski
  2 siblings, 2 replies; 163+ messages in thread
From: Sean Christopherson @ 2018-11-06 17:19 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Jann Horn, Hansen, Dave, Linus Torvalds,
	Rich Felker, Dave Hansen, Jethro Beekman, Jarkko Sakkinen,
	Florian Weimer, Linux API, X86 ML, linux-arch, LKML,
	Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge, Katz-zamir,
	Shay, linux-sgx, Andy Shevchenko, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Carlos O'Donell, adhemerval.zanella

T24gVHVlLCAyMDE4LTExLTA2IGF0IDA4OjU3IC0wODAwLCBBbmR5IEx1dG9taXJza2kgd3JvdGU6
DQo+DQo+IFNvIEkgZ3Vlc3MgdGhlIG5vbi1lbmNsYXZlIGNvZGUgYmFzaWNhbGx5IGNhbuKAmXQg
dHJ1c3QgaXRzIHN0YWNrIHBvaW50ZXINCj4gYmVjYXVzZSBvZiB0aGVzZSBzaGVuYW5pZ2Fucy4g
QW5kIHRoZSBBRVAgY29kZSBoYXMgdG8gbGl2ZSB3aXRoIHRoZSBmYWN0DQo+IHRoYXQgaXRzIFJT
UCBpcyBiYXNpY2FsbHkgYXJiaXRyYXJ5IGFuZCBwcm9iYWJseSBjYW7igJl0IGV2ZW4gYmUgdW53
b3VuZA0KPiBieSBhIGRlYnVnZ2VyPw0KDQpUaGUgU0RLIHByb3ZpZGVzIGEgUHl0aG9uIEdEQiBw
bHVnaW4gdG8gaG9vayBpbnRvIHRoZSBvdXQtY2FsbCBmbG93IGFuZA0KZG8gbW9yZSBzdGFjayBz
aGVuYW5pZ2Fucy7CoMKgRnJvbSB3aGF0IEkgY2FuIHRlbGwgaXQncyBmdWRnaW5nIHRoZSBzdGFj
aw0KdG8gbWFrZSBpdCBsb29rIGxpa2UgYSBub3JtYWwgc3RhY2sgZnJhbWUgc28gdGhlIGRlYnVn
Z2VyIGNhbiBkbyBpdCdzDQp0aGluZy4NCg0KPiBBbmQgdGhlIEVFTlRFUiBjb2RlIGhhcyB0byBk
ZWFsIHdpdGggdGhlIGZhY3QgdGhhdCBpdHMgcmVkIHpvbmUgY2FuIGJlDQo+IGJsYXRhbnRseSB2
aW9sYXRlZCBieSB0aGUgZW5jbGF2ZT8NCg0KVGhhdCdzIG15IHVuZGVyc3RhbmRpbmcgb2YgdGhp
bmdzLsKgwqBTbyB5ZWFoLCBpZiBpdCB3YXNuJ3Qgb2J2aW91cyBiZWZvcmUsDQp0aGUgdHJ1c3Rl
ZCBhbmQgdW50cnVzdGVkIHBhcnRzIG9mIHRoZSBTREsgYXJlIHZlcnkgdGlnaHRseSBjb3VwbGVk
Lg0K

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-06 17:19                                       ` Sean Christopherson
@ 2018-11-06 17:19                                         ` Sean Christopherson
  2018-11-06 18:20                                         ` Andy Lutomirski
  1 sibling, 0 replies; 163+ messages in thread
From: Sean Christopherson @ 2018-11-06 17:19 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Jann Horn, Dave Hansen, Linus Torvalds,
	Rich Felker, Dave Hansen, Jethro Beekman, Jarkko Sakkinen,
	Florian Weimer, Linux API, X86 ML, linux-arch, LKML,
	Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella

On Tue, 2018-11-06 at 08:57 -0800, Andy Lutomirski wrote:
>
> So I guess the non-enclave code basically can’t trust its stack pointer
> because of these shenanigans. And the AEP code has to live with the fact
> that its RSP is basically arbitrary and probably can’t even be unwound
> by a debugger?

The SDK provides a Python GDB plugin to hook into the out-call flow and
do more stack shenanigans.  From what I can tell it's fudging the stack
to make it look like a normal stack frame so the debugger can do it's
thing.

> And the EENTER code has to deal with the fact that its red zone can be
> blatantly violated by the enclave?

That's my understanding of things.  So yeah, if it wasn't obvious before,
the trusted and untrusted parts of the SDK are very tightly coupled.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-06 17:19                                       ` Sean Christopherson
  2018-11-06 17:19                                         ` Sean Christopherson
@ 2018-11-06 18:20                                         ` Andy Lutomirski
  2018-11-06 18:20                                           ` Andy Lutomirski
  2018-11-06 18:41                                           ` Dave Hansen
  1 sibling, 2 replies; 163+ messages in thread
From: Andy Lutomirski @ 2018-11-06 18:20 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Andy Lutomirski, Jann Horn, Dave Hansen, Linus Torvalds,
	Rich Felker, Dave Hansen, Jethro Beekman, Jarkko Sakkinen,
	Florian Weimer, Linux API, X86 ML, linux-arch, LKML,
	Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella




> On Nov 6, 2018, at 9:19 AM, Sean Christopherson <sean.j.christopherson@in=
tel.com> wrote:
>=20
>> On Tue, 2018-11-06 at 08:57 -0800, Andy Lutomirski wrote:
>>=20
>> So I guess the non-enclave code basically can=E2=80=99t trust its stack =
pointer
>> because of these shenanigans. And the AEP code has to live with the fact
>> that its RSP is basically arbitrary and probably can=E2=80=99t even be u=
nwound
>> by a debugger?
>=20
> The SDK provides a Python GDB plugin to hook into the out-call flow and
> do more stack shenanigans.  From what I can tell it's fudging the stack
> to make it look like a normal stack frame so the debugger can do it's
> thing.
>=20
>> And the EENTER code has to deal with the fact that its red zone can be
>> blatantly violated by the enclave?
>=20
> That's my understanding of things.  So yeah, if it wasn't obvious before,
> the trusted and untrusted parts of the SDK are very tightly coupled.

Yuck. Just how far does this right coupling go?  If there are enclaves that=
 play with, say, FSBASE or GSBASE, we=E2=80=99re going to start having prob=
lems. And the SGX handling of PKRU is complicated at best.

I almost feel like the right solution is to call into SGX on its own privat=
e stack or maybe even its own private address space. =

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-06 18:20                                         ` Andy Lutomirski
@ 2018-11-06 18:20                                           ` Andy Lutomirski
  2018-11-06 18:41                                           ` Dave Hansen
  1 sibling, 0 replies; 163+ messages in thread
From: Andy Lutomirski @ 2018-11-06 18:20 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Andy Lutomirski, Jann Horn, Dave Hansen, Linus Torvalds,
	Rich Felker, Dave Hansen, Jethro Beekman, Jarkko Sakkinen,
	Florian Weimer, Linux API, X86 ML, linux-arch, LKML,
	Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella




> On Nov 6, 2018, at 9:19 AM, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
>> On Tue, 2018-11-06 at 08:57 -0800, Andy Lutomirski wrote:
>> 
>> So I guess the non-enclave code basically can’t trust its stack pointer
>> because of these shenanigans. And the AEP code has to live with the fact
>> that its RSP is basically arbitrary and probably can’t even be unwound
>> by a debugger?
> 
> The SDK provides a Python GDB plugin to hook into the out-call flow and
> do more stack shenanigans.  From what I can tell it's fudging the stack
> to make it look like a normal stack frame so the debugger can do it's
> thing.
> 
>> And the EENTER code has to deal with the fact that its red zone can be
>> blatantly violated by the enclave?
> 
> That's my understanding of things.  So yeah, if it wasn't obvious before,
> the trusted and untrusted parts of the SDK are very tightly coupled.

Yuck. Just how far does this right coupling go?  If there are enclaves that play with, say, FSBASE or GSBASE, we’re going to start having problems. And the SGX handling of PKRU is complicated at best.

I almost feel like the right solution is to call into SGX on its own private stack or maybe even its own private address space. 

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-06 18:20                                         ` Andy Lutomirski
  2018-11-06 18:20                                           ` Andy Lutomirski
@ 2018-11-06 18:41                                           ` Dave Hansen
  2018-11-06 18:41                                             ` Dave Hansen
  2018-11-06 19:02                                             ` Andy Lutomirski
  1 sibling, 2 replies; 163+ messages in thread
From: Dave Hansen @ 2018-11-06 18:41 UTC (permalink / raw)
  To: Andy Lutomirski, Sean Christopherson
  Cc: Andy Lutomirski, Jann Horn, Linus Torvalds, Rich Felker,
	Dave Hansen, Jethro Beekman, Jarkko Sakkinen, Florian Weimer,
	Linux API, X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On 11/6/18 10:20 AM, Andy Lutomirski wrote:
> I almost feel like the right solution is to call into SGX on its own
> private stack or maybe even its own private address space.

Yeah, I had the same gut feeling.  Couldn't the debugger even treat the
enclave like its own "thread" with its own stack and its own set of
registers and context?  That seems like a much more workable model than
trying to weave it together with the EENTER context.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-06 18:41                                           ` Dave Hansen
@ 2018-11-06 18:41                                             ` Dave Hansen
  2018-11-06 19:02                                             ` Andy Lutomirski
  1 sibling, 0 replies; 163+ messages in thread
From: Dave Hansen @ 2018-11-06 18:41 UTC (permalink / raw)
  To: Andy Lutomirski, Sean Christopherson
  Cc: Andy Lutomirski, Jann Horn, Linus Torvalds, Rich Felker,
	Dave Hansen, Jethro Beekman, Jarkko Sakkinen, Florian Weimer,
	Linux API, X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On 11/6/18 10:20 AM, Andy Lutomirski wrote:
> I almost feel like the right solution is to call into SGX on its own
> private stack or maybe even its own private address space.

Yeah, I had the same gut feeling.  Couldn't the debugger even treat the
enclave like its own "thread" with its own stack and its own set of
registers and context?  That seems like a much more workable model than
trying to weave it together with the EENTER context.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-06 18:41                                           ` Dave Hansen
  2018-11-06 18:41                                             ` Dave Hansen
@ 2018-11-06 19:02                                             ` Andy Lutomirski
  2018-11-06 19:02                                               ` Andy Lutomirski
                                                                 ` (2 more replies)
  1 sibling, 3 replies; 163+ messages in thread
From: Andy Lutomirski @ 2018-11-06 19:02 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Christopherson, Sean J, Andrew Lutomirski, Jann Horn,
	Linus Torvalds, Rich Felker, Dave Hansen, Jethro Beekman,
	Jarkko Sakkinen, Florian Weimer, Linux API, X86 ML, linux-arch,
	LKML, Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella

On Tue, Nov 6, 2018 at 10:41 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 11/6/18 10:20 AM, Andy Lutomirski wrote:
> > I almost feel like the right solution is to call into SGX on its own
> > private stack or maybe even its own private address space.
>
> Yeah, I had the same gut feeling.  Couldn't the debugger even treat the
> enclave like its own "thread" with its own stack and its own set of
> registers and context?  That seems like a much more workable model than
> trying to weave it together with the EENTER context.

So maybe the API should be, roughly

sgx_exit_reason_t sgx_enter_enclave(pointer_to_enclave, struct
host_state *state);
sgx_exit_reason_t sgx_resume_enclave(same args);

where host_state is something like:

struct host_state {
  unsigned long bp, sp, ax, bx, cx, dx, si, di;
};

and the values in host_state explicitly have nothing to do with the
actual host registers.  So, if you want to use the outcall mechanism,
you'd allocate some memory, point sp to that memory, call
sgx_enter_enclave(), and then read that memory to do the outcall.

Actually implementing this would be distinctly nontrivial, and would
almost certainly need some degree of kernel help to avoid an explosion
when a signal gets delivered while we have host_state.sp loaded into
the actual SP register.  Maybe rseq could help with this?

The ISA here is IMO not well thought through.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-06 19:02                                             ` Andy Lutomirski
@ 2018-11-06 19:02                                               ` Andy Lutomirski
  2018-11-06 19:22                                               ` Dave Hansen
  2018-11-06 23:17                                               ` Rich Felker
  2 siblings, 0 replies; 163+ messages in thread
From: Andy Lutomirski @ 2018-11-06 19:02 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Christopherson, Sean J, Andrew Lutomirski, Jann Horn,
	Linus Torvalds, Rich Felker, Dave Hansen, Jethro Beekman,
	Jarkko Sakkinen, Florian Weimer, Linux API, X86 ML, linux-arch,
	LKML, Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella

On Tue, Nov 6, 2018 at 10:41 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 11/6/18 10:20 AM, Andy Lutomirski wrote:
> > I almost feel like the right solution is to call into SGX on its own
> > private stack or maybe even its own private address space.
>
> Yeah, I had the same gut feeling.  Couldn't the debugger even treat the
> enclave like its own "thread" with its own stack and its own set of
> registers and context?  That seems like a much more workable model than
> trying to weave it together with the EENTER context.

So maybe the API should be, roughly

sgx_exit_reason_t sgx_enter_enclave(pointer_to_enclave, struct
host_state *state);
sgx_exit_reason_t sgx_resume_enclave(same args);

where host_state is something like:

struct host_state {
  unsigned long bp, sp, ax, bx, cx, dx, si, di;
};

and the values in host_state explicitly have nothing to do with the
actual host registers.  So, if you want to use the outcall mechanism,
you'd allocate some memory, point sp to that memory, call
sgx_enter_enclave(), and then read that memory to do the outcall.

Actually implementing this would be distinctly nontrivial, and would
almost certainly need some degree of kernel help to avoid an explosion
when a signal gets delivered while we have host_state.sp loaded into
the actual SP register.  Maybe rseq could help with this?

The ISA here is IMO not well thought through.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-06 19:02                                             ` Andy Lutomirski
  2018-11-06 19:02                                               ` Andy Lutomirski
@ 2018-11-06 19:22                                               ` Dave Hansen
  2018-11-06 19:22                                                 ` Dave Hansen
  2018-11-06 20:12                                                 ` Andy Lutomirski
  2018-11-06 23:17                                               ` Rich Felker
  2 siblings, 2 replies; 163+ messages in thread
From: Dave Hansen @ 2018-11-06 19:22 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Christopherson, Sean J, Jann Horn, Linus Torvalds, Rich Felker,
	Dave Hansen, Jethro Beekman, Jarkko Sakkinen, Florian Weimer,
	Linux API, X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On 11/6/18 11:02 AM, Andy Lutomirski wrote:
> On Tue, Nov 6, 2018 at 10:41 AM Dave Hansen <dave.hansen@intel.com> wrote:
>>
>> On 11/6/18 10:20 AM, Andy Lutomirski wrote:
>>> I almost feel like the right solution is to call into SGX on its own
>>> private stack or maybe even its own private address space.
>>
>> Yeah, I had the same gut feeling.  Couldn't the debugger even treat the
>> enclave like its own "thread" with its own stack and its own set of
>> registers and context?  That seems like a much more workable model than
>> trying to weave it together with the EENTER context.
> 
> So maybe the API should be, roughly
> 
> sgx_exit_reason_t sgx_enter_enclave(pointer_to_enclave, struct
> host_state *state);
> sgx_exit_reason_t sgx_resume_enclave(same args);
> 
> where host_state is something like:
> 
> struct host_state {
>   unsigned long bp, sp, ax, bx, cx, dx, si, di;
> };
> 
> and the values in host_state explicitly have nothing to do with the
> actual host registers.  So, if you want to use the outcall mechanism,
> you'd allocate some memory, point sp to that memory, call
> sgx_enter_enclave(), and then read that memory to do the outcall.

Ah, so instead of the enclave rudely "hijacking" the EENTER context, we
have it nicely return and nicely _hint_ to the calling context what it
would like to do.  Then, the EENTER context can make a controlled
transition over to the requested context.

> Actually implementing this would be distinctly nontrivial, and would
> almost certainly need some degree of kernel help to avoid an explosion
> when a signal gets delivered while we have host_state.sp loaded into
> the actual SP register.  Maybe rseq could help with this?

As long as the memory pointed to by host_state.sp is valid and can hold
the signal frame (grows down without clobbering anything), what goes
boom?  The signal handling would push a signal frame and call the
handler.  It would have a shallow-looking stack, but the handler could
just do its normal business and return from the signal where the frame
would get popped and continue with %rsp=host_state.sp, blissfully
unaware of the signal ever having happened.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-06 19:22                                               ` Dave Hansen
@ 2018-11-06 19:22                                                 ` Dave Hansen
  2018-11-06 20:12                                                 ` Andy Lutomirski
  1 sibling, 0 replies; 163+ messages in thread
From: Dave Hansen @ 2018-11-06 19:22 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Christopherson, Sean J, Jann Horn, Linus Torvalds, Rich Felker,
	Dave Hansen, Jethro Beekman, Jarkko Sakkinen, Florian Weimer,
	Linux API, X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On 11/6/18 11:02 AM, Andy Lutomirski wrote:
> On Tue, Nov 6, 2018 at 10:41 AM Dave Hansen <dave.hansen@intel.com> wrote:
>>
>> On 11/6/18 10:20 AM, Andy Lutomirski wrote:
>>> I almost feel like the right solution is to call into SGX on its own
>>> private stack or maybe even its own private address space.
>>
>> Yeah, I had the same gut feeling.  Couldn't the debugger even treat the
>> enclave like its own "thread" with its own stack and its own set of
>> registers and context?  That seems like a much more workable model than
>> trying to weave it together with the EENTER context.
> 
> So maybe the API should be, roughly
> 
> sgx_exit_reason_t sgx_enter_enclave(pointer_to_enclave, struct
> host_state *state);
> sgx_exit_reason_t sgx_resume_enclave(same args);
> 
> where host_state is something like:
> 
> struct host_state {
>   unsigned long bp, sp, ax, bx, cx, dx, si, di;
> };
> 
> and the values in host_state explicitly have nothing to do with the
> actual host registers.  So, if you want to use the outcall mechanism,
> you'd allocate some memory, point sp to that memory, call
> sgx_enter_enclave(), and then read that memory to do the outcall.

Ah, so instead of the enclave rudely "hijacking" the EENTER context, we
have it nicely return and nicely _hint_ to the calling context what it
would like to do.  Then, the EENTER context can make a controlled
transition over to the requested context.

> Actually implementing this would be distinctly nontrivial, and would
> almost certainly need some degree of kernel help to avoid an explosion
> when a signal gets delivered while we have host_state.sp loaded into
> the actual SP register.  Maybe rseq could help with this?

As long as the memory pointed to by host_state.sp is valid and can hold
the signal frame (grows down without clobbering anything), what goes
boom?  The signal handling would push a signal frame and call the
handler.  It would have a shallow-looking stack, but the handler could
just do its normal business and return from the signal where the frame
would get popped and continue with %rsp=host_state.sp, blissfully
unaware of the signal ever having happened.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-06 19:22                                               ` Dave Hansen
  2018-11-06 19:22                                                 ` Dave Hansen
@ 2018-11-06 20:12                                                 ` Andy Lutomirski
  2018-11-06 20:12                                                   ` Andy Lutomirski
  2018-11-06 21:00                                                   ` Dave Hansen
  1 sibling, 2 replies; 163+ messages in thread
From: Andy Lutomirski @ 2018-11-06 20:12 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Christopherson, Sean J, Jann Horn,
	Linus Torvalds, Rich Felker, Dave Hansen, Jethro Beekman,
	Jarkko Sakkinen, Florian Weimer, Linux API, X86 ML, linux-arch,
	LKML, Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella



> On Nov 6, 2018, at 11:22 AM, Dave Hansen <dave.hansen@intel.com> wrote:
>=20
>> On 11/6/18 11:02 AM, Andy Lutomirski wrote:
>>> On Tue, Nov 6, 2018 at 10:41 AM Dave Hansen <dave.hansen@intel.com> wro=
te:
>>>=20
>>>> On 11/6/18 10:20 AM, Andy Lutomirski wrote:
>>>> I almost feel like the right solution is to call into SGX on its own
>>>> private stack or maybe even its own private address space.
>>>=20
>>> Yeah, I had the same gut feeling.  Couldn't the debugger even treat the
>>> enclave like its own "thread" with its own stack and its own set of
>>> registers and context?  That seems like a much more workable model than
>>> trying to weave it together with the EENTER context.
>>=20
>> So maybe the API should be, roughly
>>=20
>> sgx_exit_reason_t sgx_enter_enclave(pointer_to_enclave, struct
>> host_state *state);
>> sgx_exit_reason_t sgx_resume_enclave(same args);
>>=20
>> where host_state is something like:
>>=20
>> struct host_state {
>>  unsigned long bp, sp, ax, bx, cx, dx, si, di;
>> };
>>=20
>> and the values in host_state explicitly have nothing to do with the
>> actual host registers.  So, if you want to use the outcall mechanism,
>> you'd allocate some memory, point sp to that memory, call
>> sgx_enter_enclave(), and then read that memory to do the outcall.
>=20
> Ah, so instead of the enclave rudely "hijacking" the EENTER context, we
> have it nicely return and nicely _hint_ to the calling context what it
> would like to do.  Then, the EENTER context can make a controlled
> transition over to the requested context.

Exactly. And existing enclaves keep working =E2=80=94 their rudeness is jus=
t magically translated into a hint!

>=20
>> Actually implementing this would be distinctly nontrivial, and would
>> almost certainly need some degree of kernel help to avoid an explosion
>> when a signal gets delivered while we have host_state.sp loaded into
>> the actual SP register.  Maybe rseq could help with this?
>=20
> As long as the memory pointed to by host_state.sp is valid and can hold
> the signal frame (grows down without clobbering anything), what goes
> boom?  The signal handling would push a signal frame and call the
> handler.  It would have a shallow-looking stack, but the handler could
> just do its normal business and return from the signal where the frame
> would get popped and continue with %rsp=3Dhost_state.sp, blissfully
> unaware of the signal ever having happened.

True, but what if we have a nasty enclave that writes to memory just below =
SP *before* decrementing SP?

I suspect that rseq really can be used for this with only minimal-ish modif=
ications.  Or we could stick this in the vDSO with some appropriate fixups =
in the kernel.=

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-06 20:12                                                 ` Andy Lutomirski
@ 2018-11-06 20:12                                                   ` Andy Lutomirski
  2018-11-06 21:00                                                   ` Dave Hansen
  1 sibling, 0 replies; 163+ messages in thread
From: Andy Lutomirski @ 2018-11-06 20:12 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Christopherson, Sean J, Jann Horn,
	Linus Torvalds, Rich Felker, Dave Hansen, Jethro Beekman,
	Jarkko Sakkinen, Florian Weimer, Linux API, X86 ML, linux-arch,
	LKML, Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella



> On Nov 6, 2018, at 11:22 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
>> On 11/6/18 11:02 AM, Andy Lutomirski wrote:
>>> On Tue, Nov 6, 2018 at 10:41 AM Dave Hansen <dave.hansen@intel.com> wrote:
>>> 
>>>> On 11/6/18 10:20 AM, Andy Lutomirski wrote:
>>>> I almost feel like the right solution is to call into SGX on its own
>>>> private stack or maybe even its own private address space.
>>> 
>>> Yeah, I had the same gut feeling.  Couldn't the debugger even treat the
>>> enclave like its own "thread" with its own stack and its own set of
>>> registers and context?  That seems like a much more workable model than
>>> trying to weave it together with the EENTER context.
>> 
>> So maybe the API should be, roughly
>> 
>> sgx_exit_reason_t sgx_enter_enclave(pointer_to_enclave, struct
>> host_state *state);
>> sgx_exit_reason_t sgx_resume_enclave(same args);
>> 
>> where host_state is something like:
>> 
>> struct host_state {
>>  unsigned long bp, sp, ax, bx, cx, dx, si, di;
>> };
>> 
>> and the values in host_state explicitly have nothing to do with the
>> actual host registers.  So, if you want to use the outcall mechanism,
>> you'd allocate some memory, point sp to that memory, call
>> sgx_enter_enclave(), and then read that memory to do the outcall.
> 
> Ah, so instead of the enclave rudely "hijacking" the EENTER context, we
> have it nicely return and nicely _hint_ to the calling context what it
> would like to do.  Then, the EENTER context can make a controlled
> transition over to the requested context.

Exactly. And existing enclaves keep working — their rudeness is just magically translated into a hint!

> 
>> Actually implementing this would be distinctly nontrivial, and would
>> almost certainly need some degree of kernel help to avoid an explosion
>> when a signal gets delivered while we have host_state.sp loaded into
>> the actual SP register.  Maybe rseq could help with this?
> 
> As long as the memory pointed to by host_state.sp is valid and can hold
> the signal frame (grows down without clobbering anything), what goes
> boom?  The signal handling would push a signal frame and call the
> handler.  It would have a shallow-looking stack, but the handler could
> just do its normal business and return from the signal where the frame
> would get popped and continue with %rsp=host_state.sp, blissfully
> unaware of the signal ever having happened.

True, but what if we have a nasty enclave that writes to memory just below SP *before* decrementing SP?

I suspect that rseq really can be used for this with only minimal-ish modifications.  Or we could stick this in the vDSO with some appropriate fixups in the kernel.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-06 20:12                                                 ` Andy Lutomirski
  2018-11-06 20:12                                                   ` Andy Lutomirski
@ 2018-11-06 21:00                                                   ` Dave Hansen
  2018-11-06 21:00                                                     ` Dave Hansen
  2018-11-06 21:07                                                     ` Andy Lutomirski
  1 sibling, 2 replies; 163+ messages in thread
From: Dave Hansen @ 2018-11-06 21:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Christopherson, Sean J, Jann Horn,
	Linus Torvalds, Rich Felker, Dave Hansen, Jethro Beekman,
	Jarkko Sakkinen, Florian Weimer, Linux API, X86 ML, linux-arch,
	LKML, Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella

On 11/6/18 12:12 PM, Andy Lutomirski wrote:
> True, but what if we have a nasty enclave that writes to memory just
> below SP *before* decrementing SP?

Yeah, that would be unfortunate.  If an enclave did this (roughly):

	1. EENTER
	2. Hardware sets eenter_hwframe->sp = %sp
	3. Enclave runs... wants to do out-call
	4. Enclave sets up parameters:
		memcpy(&eenter_hwframe->sp[-offset], arg1, size);
		...
	5. Enclave sets eenter_hwframe->sp -= offset

If we got a signal between 4 and 5, we'd clobber the copy of 'arg1' that
was on the stack.  The enclave could easily fix this by moving ->sp first.

But, this is one of those "fun" parts of the ABI that I think we need to
talk about.  If we do this, we also basically require that the code
which handles asynchronous exits must *not* write to the stack.  That's
not hard because it's typically just a single ERESUME instruction, but
it *is* a requirement.

It means fun stuff like that you absolutely can't just async-exit to C code.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-06 21:00                                                   ` Dave Hansen
@ 2018-11-06 21:00                                                     ` Dave Hansen
  2018-11-06 21:07                                                     ` Andy Lutomirski
  1 sibling, 0 replies; 163+ messages in thread
From: Dave Hansen @ 2018-11-06 21:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Christopherson, Sean J, Jann Horn,
	Linus Torvalds, Rich Felker, Dave Hansen, Jethro Beekman,
	Jarkko Sakkinen, Florian Weimer, Linux API, X86 ML, linux-arch,
	LKML, Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella

On 11/6/18 12:12 PM, Andy Lutomirski wrote:
> True, but what if we have a nasty enclave that writes to memory just
> below SP *before* decrementing SP?

Yeah, that would be unfortunate.  If an enclave did this (roughly):

	1. EENTER
	2. Hardware sets eenter_hwframe->sp = %sp
	3. Enclave runs... wants to do out-call
	4. Enclave sets up parameters:
		memcpy(&eenter_hwframe->sp[-offset], arg1, size);
		...
	5. Enclave sets eenter_hwframe->sp -= offset

If we got a signal between 4 and 5, we'd clobber the copy of 'arg1' that
was on the stack.  The enclave could easily fix this by moving ->sp first.

But, this is one of those "fun" parts of the ABI that I think we need to
talk about.  If we do this, we also basically require that the code
which handles asynchronous exits must *not* write to the stack.  That's
not hard because it's typically just a single ERESUME instruction, but
it *is* a requirement.

It means fun stuff like that you absolutely can't just async-exit to C code.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-06 21:00                                                   ` Dave Hansen
  2018-11-06 21:00                                                     ` Dave Hansen
@ 2018-11-06 21:07                                                     ` Andy Lutomirski
  2018-11-06 21:07                                                       ` Andy Lutomirski
                                                                         ` (2 more replies)
  1 sibling, 3 replies; 163+ messages in thread
From: Andy Lutomirski @ 2018-11-06 21:07 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Christopherson, Sean J, Jann Horn,
	Linus Torvalds, Rich Felker, Dave Hansen, Jethro Beekman,
	Jarkko Sakkinen, Florian Weimer, Linux API, X86 ML, linux-arch,
	LKML, Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella



> On Nov 6, 2018, at 1:00 PM, Dave Hansen <dave.hansen@intel.com> wrote:
>=20
>> On 11/6/18 12:12 PM, Andy Lutomirski wrote:
>> True, but what if we have a nasty enclave that writes to memory just
>> below SP *before* decrementing SP?
>=20
> Yeah, that would be unfortunate.  If an enclave did this (roughly):
>=20
>    1. EENTER
>    2. Hardware sets eenter_hwframe->sp =3D %sp
>    3. Enclave runs... wants to do out-call
>    4. Enclave sets up parameters:
>        memcpy(&eenter_hwframe->sp[-offset], arg1, size);
>        ...
>    5. Enclave sets eenter_hwframe->sp -=3D offset
>=20
> If we got a signal between 4 and 5, we'd clobber the copy of 'arg1' that
> was on the stack.  The enclave could easily fix this by moving ->sp first=
.
>=20
> But, this is one of those "fun" parts of the ABI that I think we need to
> talk about.  If we do this, we also basically require that the code
> which handles asynchronous exits must *not* write to the stack.  That's
> not hard because it's typically just a single ERESUME instruction, but
> it *is* a requirement.
>=20

I was assuming that the async exit stuff was completely hidden by the API. =
The AEP code would decide whether the exit got fixed up by the kernel (whic=
h may or may not be easy to tell =E2=80=94 can the code even tell without k=
ernel help whether it was, say, an IRQ vs #UD?) and then either do ERESUME =
or cause sgx_enter_enclave() to return with an appropriate return value.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-06 21:07                                                     ` Andy Lutomirski
@ 2018-11-06 21:07                                                       ` Andy Lutomirski
  2018-11-06 21:41                                                       ` Andy Lutomirski
  2018-11-08 19:54                                                       ` Sean Christopherson
  2 siblings, 0 replies; 163+ messages in thread
From: Andy Lutomirski @ 2018-11-06 21:07 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Christopherson, Sean J, Jann Horn,
	Linus Torvalds, Rich Felker, Dave Hansen, Jethro Beekman,
	Jarkko Sakkinen, Florian Weimer, Linux API, X86 ML, linux-arch,
	LKML, Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella



> On Nov 6, 2018, at 1:00 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
>> On 11/6/18 12:12 PM, Andy Lutomirski wrote:
>> True, but what if we have a nasty enclave that writes to memory just
>> below SP *before* decrementing SP?
> 
> Yeah, that would be unfortunate.  If an enclave did this (roughly):
> 
>    1. EENTER
>    2. Hardware sets eenter_hwframe->sp = %sp
>    3. Enclave runs... wants to do out-call
>    4. Enclave sets up parameters:
>        memcpy(&eenter_hwframe->sp[-offset], arg1, size);
>        ...
>    5. Enclave sets eenter_hwframe->sp -= offset
> 
> If we got a signal between 4 and 5, we'd clobber the copy of 'arg1' that
> was on the stack.  The enclave could easily fix this by moving ->sp first.
> 
> But, this is one of those "fun" parts of the ABI that I think we need to
> talk about.  If we do this, we also basically require that the code
> which handles asynchronous exits must *not* write to the stack.  That's
> not hard because it's typically just a single ERESUME instruction, but
> it *is* a requirement.
> 

I was assuming that the async exit stuff was completely hidden by the API. The AEP code would decide whether the exit got fixed up by the kernel (which may or may not be easy to tell — can the code even tell without kernel help whether it was, say, an IRQ vs #UD?) and then either do ERESUME or cause sgx_enter_enclave() to return with an appropriate return value.



^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-06 21:07                                                     ` Andy Lutomirski
  2018-11-06 21:07                                                       ` Andy Lutomirski
@ 2018-11-06 21:41                                                       ` Andy Lutomirski
  2018-11-06 21:41                                                         ` Andy Lutomirski
  2018-11-06 21:59                                                         ` Sean Christopherson
  2018-11-08 19:54                                                       ` Sean Christopherson
  2 siblings, 2 replies; 163+ messages in thread
From: Andy Lutomirski @ 2018-11-06 21:41 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Lutomirski, Christopherson, Sean J, Jann Horn,
	Linus Torvalds, Rich Felker, Dave Hansen, Jethro Beekman,
	Jarkko Sakkinen, Florian Weimer, Linux API, X86 ML, linux-arch,
	LKML, Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella

On Tue, Nov 6, 2018 at 1:07 PM Andy Lutomirski <luto@amacapital.net> wrote:
>
>
>
> > On Nov 6, 2018, at 1:00 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> >
> >> On 11/6/18 12:12 PM, Andy Lutomirski wrote:
> >> True, but what if we have a nasty enclave that writes to memory just
> >> below SP *before* decrementing SP?
> >
> > Yeah, that would be unfortunate.  If an enclave did this (roughly):
> >
> >    1. EENTER
> >    2. Hardware sets eenter_hwframe->sp =3D %sp
> >    3. Enclave runs... wants to do out-call
> >    4. Enclave sets up parameters:
> >        memcpy(&eenter_hwframe->sp[-offset], arg1, size);
> >        ...
> >    5. Enclave sets eenter_hwframe->sp -=3D offset
> >
> > If we got a signal between 4 and 5, we'd clobber the copy of 'arg1' tha=
t
> > was on the stack.  The enclave could easily fix this by moving ->sp fir=
st.
> >
> > But, this is one of those "fun" parts of the ABI that I think we need t=
o
> > talk about.  If we do this, we also basically require that the code
> > which handles asynchronous exits must *not* write to the stack.  That's
> > not hard because it's typically just a single ERESUME instruction, but
> > it *is* a requirement.
> >
>
> I was assuming that the async exit stuff was completely hidden by the API=
. The AEP code would decide whether the exit got fixed up by the kernel (wh=
ich may or may not be easy to tell =E2=80=94 can the code even tell without=
 kernel help whether it was, say, an IRQ vs #UD?) and then either do ERESUM=
E or cause sgx_enter_enclave() to return with an appropriate return value.
>
>

Sean, how does the current SDK AEX handler decide whether to do
EENTER, ERESUME, or just bail and consider the enclave dead?  It seems
like the *CPU* could give a big hint, but I don't see where there is
any architectural indication of why the AEX code got called or any
obvious way for the user code to know whether the exit was fixed up by
the kernel?

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-06 21:41                                                       ` Andy Lutomirski
@ 2018-11-06 21:41                                                         ` Andy Lutomirski
  2018-11-06 21:59                                                         ` Sean Christopherson
  1 sibling, 0 replies; 163+ messages in thread
From: Andy Lutomirski @ 2018-11-06 21:41 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Lutomirski, Christopherson, Sean J, Jann Horn,
	Linus Torvalds, Rich Felker, Dave Hansen, Jethro Beekman,
	Jarkko Sakkinen, Florian Weimer, Linux API, X86 ML, linux-arch,
	LKML, Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella

On Tue, Nov 6, 2018 at 1:07 PM Andy Lutomirski <luto@amacapital.net> wrote:
>
>
>
> > On Nov 6, 2018, at 1:00 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> >
> >> On 11/6/18 12:12 PM, Andy Lutomirski wrote:
> >> True, but what if we have a nasty enclave that writes to memory just
> >> below SP *before* decrementing SP?
> >
> > Yeah, that would be unfortunate.  If an enclave did this (roughly):
> >
> >    1. EENTER
> >    2. Hardware sets eenter_hwframe->sp = %sp
> >    3. Enclave runs... wants to do out-call
> >    4. Enclave sets up parameters:
> >        memcpy(&eenter_hwframe->sp[-offset], arg1, size);
> >        ...
> >    5. Enclave sets eenter_hwframe->sp -= offset
> >
> > If we got a signal between 4 and 5, we'd clobber the copy of 'arg1' that
> > was on the stack.  The enclave could easily fix this by moving ->sp first.
> >
> > But, this is one of those "fun" parts of the ABI that I think we need to
> > talk about.  If we do this, we also basically require that the code
> > which handles asynchronous exits must *not* write to the stack.  That's
> > not hard because it's typically just a single ERESUME instruction, but
> > it *is* a requirement.
> >
>
> I was assuming that the async exit stuff was completely hidden by the API. The AEP code would decide whether the exit got fixed up by the kernel (which may or may not be easy to tell — can the code even tell without kernel help whether it was, say, an IRQ vs #UD?) and then either do ERESUME or cause sgx_enter_enclave() to return with an appropriate return value.
>
>

Sean, how does the current SDK AEX handler decide whether to do
EENTER, ERESUME, or just bail and consider the enclave dead?  It seems
like the *CPU* could give a big hint, but I don't see where there is
any architectural indication of why the AEX code got called or any
obvious way for the user code to know whether the exit was fixed up by
the kernel?

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-06 21:41                                                       ` Andy Lutomirski
  2018-11-06 21:41                                                         ` Andy Lutomirski
@ 2018-11-06 21:59                                                         ` Sean Christopherson
  2018-11-06 21:59                                                           ` Sean Christopherson
  2018-11-06 23:00                                                           ` Andy Lutomirski
  1 sibling, 2 replies; 163+ messages in thread
From: Sean Christopherson @ 2018-11-06 21:59 UTC (permalink / raw)
  To: Andy Lutomirski, Dave Hansen
  Cc: Jann Horn, Linus Torvalds, Rich Felker, Dave Hansen,
	Jethro Beekman, Jarkko Sakkinen, Florian Weimer, Linux API,
	X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman, npmccallum,
	Ayoun, Serge, shay.katz-zamir, linux-sgx, Andy Shevchenko,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Tue, 2018-11-06 at 13:41 -0800, Andy Lutomirski wrote:
> On Tue, Nov 6, 2018 at 1:07 PM Andy Lutomirski <luto@amacapital.net> wrote:
> > 
> > > 
> > > On Nov 6, 2018, at 1:00 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> > > 
> > > > 
> > > > On 11/6/18 12:12 PM, Andy Lutomirski wrote:
> > > > True, but what if we have a nasty enclave that writes to memory just
> > > > below SP *before* decrementing SP?
> > > Yeah, that would be unfortunate.  If an enclave did this (roughly):
> > > 
> > >    1. EENTER
> > >    2. Hardware sets eenter_hwframe->sp = %sp
> > >    3. Enclave runs... wants to do out-call
> > >    4. Enclave sets up parameters:
> > >        memcpy(&eenter_hwframe->sp[-offset], arg1, size);
> > >        ...
> > >    5. Enclave sets eenter_hwframe->sp -= offset
> > > 
> > > If we got a signal between 4 and 5, we'd clobber the copy of 'arg1' that
> > > was on the stack.  The enclave could easily fix this by moving ->sp first.
> > > 
> > > But, this is one of those "fun" parts of the ABI that I think we need to
> > > talk about.  If we do this, we also basically require that the code
> > > which handles asynchronous exits must *not* write to the stack.  That's
> > > not hard because it's typically just a single ERESUME instruction, but
> > > it *is* a requirement.
> > > 
> > I was assuming that the async exit stuff was completely hidden by the API. The AEP code would decide whether the exit got fixed up by the kernel (which may or may not be easy to tell — can the
> > code even tell without kernel help whether it was, say, an IRQ vs #UD?) and then either do ERESUME or cause sgx_enter_enclave() to return with an appropriate return value.
> > 
> > 
> Sean, how does the current SDK AEX handler decide whether to do
> EENTER, ERESUME, or just bail and consider the enclave dead?  It seems
> like the *CPU* could give a big hint, but I don't see where there is
> any architectural indication of why the AEX code got called or any
> obvious way for the user code to know whether the exit was fixed up by
> the kernel?

The SDK "unconditionally" does ERESUME at the AEP location, but that's
bit misleading because its signal handler may muck with the context's
RIP, e.g. to abort the enclave on a fatal fault.

On an event/exception from within an enclave, the event is immediately
delivered after loading synthetic state and changing RIP to the AEP.
In other words, jamming CPU state is essentially a bunch of vectoring
ucode preamble, but from software's perspective it's a normal event
that happens to point at the AEP instead of somewhere in the enclave.
And because the signals the SDK cares about are all synchronous, the
SDK can simply hardcode ERESUME at the AEP since all of the fault logic
resides in its signal handler.  IRQs and whatnot simply trampoline back
into the enclave.

Userspace can do something funky instead of ERESUME, but only *after*
IRET/RSM/VMRESUME has returned to the AEP location, and in Linux's
case, after the trap handler has run.

Jumping back a bit, how much do we care about preventing userspace
from doing stupid things?  I did a quick POC on the idea of hardcoding
fixup for the ENCLU opcode, and the basic idea checks out.  The code
is fairly minimal and doesn't impact the core functionality of the SDK.
They'd need to redo their trap handling to move it from the signal
handler to inline, but their stack shenanigans won't be any more broken
than they already are.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-06 21:59                                                         ` Sean Christopherson
@ 2018-11-06 21:59                                                           ` Sean Christopherson
  2018-11-06 23:00                                                           ` Andy Lutomirski
  1 sibling, 0 replies; 163+ messages in thread
From: Sean Christopherson @ 2018-11-06 21:59 UTC (permalink / raw)
  To: Andy Lutomirski, Dave Hansen
  Cc: Jann Horn, Linus Torvalds, Rich Felker, Dave Hansen,
	Jethro Beekman, Jarkko Sakkinen, Florian Weimer, Linux API,
	X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman, npmccallum,
	Ayoun, Serge, shay.katz-zamir, linux-sgx, Andy Shevchenko,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Tue, 2018-11-06 at 13:41 -0800, Andy Lutomirski wrote:
> On Tue, Nov 6, 2018 at 1:07 PM Andy Lutomirski <luto@amacapital.net> wrote:
> > 
> > > 
> > > On Nov 6, 2018, at 1:00 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> > > 
> > > > 
> > > > On 11/6/18 12:12 PM, Andy Lutomirski wrote:
> > > > True, but what if we have a nasty enclave that writes to memory just
> > > > below SP *before* decrementing SP?
> > > Yeah, that would be unfortunate.  If an enclave did this (roughly):
> > > 
> > >    1. EENTER
> > >    2. Hardware sets eenter_hwframe->sp = %sp
> > >    3. Enclave runs... wants to do out-call
> > >    4. Enclave sets up parameters:
> > >        memcpy(&eenter_hwframe->sp[-offset], arg1, size);
> > >        ...
> > >    5. Enclave sets eenter_hwframe->sp -= offset
> > > 
> > > If we got a signal between 4 and 5, we'd clobber the copy of 'arg1' that
> > > was on the stack.  The enclave could easily fix this by moving ->sp first.
> > > 
> > > But, this is one of those "fun" parts of the ABI that I think we need to
> > > talk about.  If we do this, we also basically require that the code
> > > which handles asynchronous exits must *not* write to the stack.  That's
> > > not hard because it's typically just a single ERESUME instruction, but
> > > it *is* a requirement.
> > > 
> > I was assuming that the async exit stuff was completely hidden by the API. The AEP code would decide whether the exit got fixed up by the kernel (which may or may not be easy to tell — can the
> > code even tell without kernel help whether it was, say, an IRQ vs #UD?) and then either do ERESUME or cause sgx_enter_enclave() to return with an appropriate return value.
> > 
> > 
> Sean, how does the current SDK AEX handler decide whether to do
> EENTER, ERESUME, or just bail and consider the enclave dead?  It seems
> like the *CPU* could give a big hint, but I don't see where there is
> any architectural indication of why the AEX code got called or any
> obvious way for the user code to know whether the exit was fixed up by
> the kernel?

The SDK "unconditionally" does ERESUME at the AEP location, but that's
bit misleading because its signal handler may muck with the context's
RIP, e.g. to abort the enclave on a fatal fault.

On an event/exception from within an enclave, the event is immediately
delivered after loading synthetic state and changing RIP to the AEP.
In other words, jamming CPU state is essentially a bunch of vectoring
ucode preamble, but from software's perspective it's a normal event
that happens to point at the AEP instead of somewhere in the enclave.
And because the signals the SDK cares about are all synchronous, the
SDK can simply hardcode ERESUME at the AEP since all of the fault logic
resides in its signal handler.  IRQs and whatnot simply trampoline back
into the enclave.

Userspace can do something funky instead of ERESUME, but only *after*
IRET/RSM/VMRESUME has returned to the AEP location, and in Linux's
case, after the trap handler has run.

Jumping back a bit, how much do we care about preventing userspace
from doing stupid things?  I did a quick POC on the idea of hardcoding
fixup for the ENCLU opcode, and the basic idea checks out.  The code
is fairly minimal and doesn't impact the core functionality of the SDK.
They'd need to redo their trap handling to move it from the signal
handler to inline, but their stack shenanigans won't be any more broken
than they already are.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-06 21:59                                                         ` Sean Christopherson
  2018-11-06 21:59                                                           ` Sean Christopherson
@ 2018-11-06 23:00                                                           ` Andy Lutomirski
  2018-11-06 23:00                                                             ` Andy Lutomirski
  2018-11-06 23:35                                                             ` Sean Christopherson
  1 sibling, 2 replies; 163+ messages in thread
From: Andy Lutomirski @ 2018-11-06 23:00 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Andy Lutomirski, Dave Hansen, Jann Horn, Linus Torvalds,
	Rich Felker, Dave Hansen, Jethro Beekman, Jarkko Sakkinen,
	Florian Weimer, Linux API, X86 ML, linux-arch, LKML,
	Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella



>> On Nov 6, 2018, at 1:59 PM, Sean Christopherson <sean.j.christopherson@i=
ntel.com> wrote:
>>=20
>>> On Tue, 2018-11-06 at 13:41 -0800, Andy Lutomirski wrote:
>>>> On Tue, Nov 6, 2018 at 1:07 PM Andy Lutomirski <luto@amacapital.net> w=
rote:
>>>>=20
>>>>=20
>>>>> On Nov 6, 2018, at 1:00 PM, Dave Hansen <dave.hansen@intel.com> wrote=
:
>>>>>=20
>>>>>=20
>>>>> On 11/6/18 12:12 PM, Andy Lutomirski wrote:
>>>>> True, but what if we have a nasty enclave that writes to memory just
>>>>> below SP *before* decrementing SP?
>>>> Yeah, that would be unfortunate.  If an enclave did this (roughly):
>>>>=20
>>>>    1. EENTER
>>>>    2. Hardware sets eenter_hwframe->sp =3D %sp
>>>>    3. Enclave runs... wants to do out-call
>>>>    4. Enclave sets up parameters:
>>>>        memcpy(&eenter_hwframe->sp[-offset], arg1, size);
>>>>        ...
>>>>    5. Enclave sets eenter_hwframe->sp -=3D offset
>>>>=20
>>>> If we got a signal between 4 and 5, we'd clobber the copy of 'arg1' th=
at
>>>> was on the stack.  The enclave could easily fix this by moving ->sp fi=
rst.
>>>>=20
>>>> But, this is one of those "fun" parts of the ABI that I think we need =
to
>>>> talk about.  If we do this, we also basically require that the code
>>>> which handles asynchronous exits must *not* write to the stack.  That'=
s
>>>> not hard because it's typically just a single ERESUME instruction, but
>>>> it *is* a requirement.
>>> I was assuming that the async exit stuff was completely hidden by the A=
PI. The AEP code would decide whether the exit got fixed up by the kernel (=
which may or may not be easy to tell =E2=80=94 can the
>>> code even tell without kernel help whether it was, say, an IRQ vs #UD?)=
 and then either do ERESUME or cause sgx_enter_enclave() to return with an =
appropriate return value.
>> Sean, how does the current SDK AEX handler decide whether to do
>> EENTER, ERESUME, or just bail and consider the enclave dead?  It seems
>> like the *CPU* could give a big hint, but I don't see where there is
>> any architectural indication of why the AEX code got called or any
>> obvious way for the user code to know whether the exit was fixed up by
>> the kernel?
>=20
> The SDK "unconditionally" does ERESUME at the AEP location, but that's
> bit misleading because its signal handler may muck with the context's
> RIP, e.g. to abort the enclave on a fatal fault.
>=20
> On an event/exception from within an enclave, the event is immediately
> delivered after loading synthetic state and changing RIP to the AEP.
> In other words, jamming CPU state is essentially a bunch of vectoring
> ucode preamble, but from software's perspective it's a normal event
> that happens to point at the AEP instead of somewhere in the enclave.
> And because the signals the SDK cares about are all synchronous, the
> SDK can simply hardcode ERESUME at the AEP since all of the fault logic
> resides in its signal handler.  IRQs and whatnot simply trampoline back
> into the enclave.
>=20
> Userspace can do something funky instead of ERESUME, but only *after*
> IRET/RSM/VMRESUME has returned to the AEP location, and in Linux's
> case, after the trap handler has run.
>=20
> Jumping back a bit, how much do we care about preventing userspace
> from doing stupid things?=20

My general feeling is that userspace should be allowed to do apparently stu=
pid things. For example, as far as the kernel is concerned, Wine and DOSEMU=
 are just user programs that do stupid things. Linux generally tries to pro=
vide a reasonably complete view of architectural behavior. This is in contr=
ast to, say, Windows, where IIUC doing an unapproved WRFSBASE May cause ver=
y odd behavior indeed. So magic fixups that do non-architectural things are=
 not so great.

The flip side, of course, is that the architecture is arguably inherently e=
rratic here, and it=E2=80=99s apparently impossible to have an SGX library =
with sane semantics without some kernel assistance.

So if we can make my straw man API work, perhaps with vDSO or rseq-like hel=
p, then the official SDK can use it, but less well behaved programs can sti=
ll mostly work.  (Modulo Linux=E2=80=99s non-support for EINITTOKEN, of cou=
rse.)

Thinking about it some more, the major sticking point may be finding the RI=
P and stack frame of EENTER in the AEP code or in its fixup. The vDSO can=
=E2=80=99t use TLS without serious hackery.  We could massively abuse WRFSB=
ASE, but that=E2=80=99s really ugly.

(How does the Windows case work?  If there=E2=80=99s an exception after the=
 untrusted stack allocation and before EEXIT and SEH tries to handle it, ho=
w does the unwinder figure out where to start?)

>  I did a quick POC on the idea of hardcoding
> fixup for the ENCLU opcode, and the basic idea checks out.  The code
> is fairly minimal and doesn't impact the core functionality of the SDK.
> They'd need to redo their trap handling to move it from the signal
> handler to inline, but their stack shenanigans won't be any more broken
> than they already are.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-06 23:00                                                           ` Andy Lutomirski
@ 2018-11-06 23:00                                                             ` Andy Lutomirski
  2018-11-06 23:35                                                             ` Sean Christopherson
  1 sibling, 0 replies; 163+ messages in thread
From: Andy Lutomirski @ 2018-11-06 23:00 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Andy Lutomirski, Dave Hansen, Jann Horn, Linus Torvalds,
	Rich Felker, Dave Hansen, Jethro Beekman, Jarkko Sakkinen,
	Florian Weimer, Linux API, X86 ML, linux-arch, LKML,
	Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella



>> On Nov 6, 2018, at 1:59 PM, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
>> 
>>> On Tue, 2018-11-06 at 13:41 -0800, Andy Lutomirski wrote:
>>>> On Tue, Nov 6, 2018 at 1:07 PM Andy Lutomirski <luto@amacapital.net> wrote:
>>>> 
>>>> 
>>>>> On Nov 6, 2018, at 1:00 PM, Dave Hansen <dave.hansen@intel.com> wrote:
>>>>> 
>>>>> 
>>>>> On 11/6/18 12:12 PM, Andy Lutomirski wrote:
>>>>> True, but what if we have a nasty enclave that writes to memory just
>>>>> below SP *before* decrementing SP?
>>>> Yeah, that would be unfortunate.  If an enclave did this (roughly):
>>>> 
>>>>    1. EENTER
>>>>    2. Hardware sets eenter_hwframe->sp = %sp
>>>>    3. Enclave runs... wants to do out-call
>>>>    4. Enclave sets up parameters:
>>>>        memcpy(&eenter_hwframe->sp[-offset], arg1, size);
>>>>        ...
>>>>    5. Enclave sets eenter_hwframe->sp -= offset
>>>> 
>>>> If we got a signal between 4 and 5, we'd clobber the copy of 'arg1' that
>>>> was on the stack.  The enclave could easily fix this by moving ->sp first.
>>>> 
>>>> But, this is one of those "fun" parts of the ABI that I think we need to
>>>> talk about.  If we do this, we also basically require that the code
>>>> which handles asynchronous exits must *not* write to the stack.  That's
>>>> not hard because it's typically just a single ERESUME instruction, but
>>>> it *is* a requirement.
>>> I was assuming that the async exit stuff was completely hidden by the API. The AEP code would decide whether the exit got fixed up by the kernel (which may or may not be easy to tell — can the
>>> code even tell without kernel help whether it was, say, an IRQ vs #UD?) and then either do ERESUME or cause sgx_enter_enclave() to return with an appropriate return value.
>> Sean, how does the current SDK AEX handler decide whether to do
>> EENTER, ERESUME, or just bail and consider the enclave dead?  It seems
>> like the *CPU* could give a big hint, but I don't see where there is
>> any architectural indication of why the AEX code got called or any
>> obvious way for the user code to know whether the exit was fixed up by
>> the kernel?
> 
> The SDK "unconditionally" does ERESUME at the AEP location, but that's
> bit misleading because its signal handler may muck with the context's
> RIP, e.g. to abort the enclave on a fatal fault.
> 
> On an event/exception from within an enclave, the event is immediately
> delivered after loading synthetic state and changing RIP to the AEP.
> In other words, jamming CPU state is essentially a bunch of vectoring
> ucode preamble, but from software's perspective it's a normal event
> that happens to point at the AEP instead of somewhere in the enclave.
> And because the signals the SDK cares about are all synchronous, the
> SDK can simply hardcode ERESUME at the AEP since all of the fault logic
> resides in its signal handler.  IRQs and whatnot simply trampoline back
> into the enclave.
> 
> Userspace can do something funky instead of ERESUME, but only *after*
> IRET/RSM/VMRESUME has returned to the AEP location, and in Linux's
> case, after the trap handler has run.
> 
> Jumping back a bit, how much do we care about preventing userspace
> from doing stupid things? 

My general feeling is that userspace should be allowed to do apparently stupid things. For example, as far as the kernel is concerned, Wine and DOSEMU are just user programs that do stupid things. Linux generally tries to provide a reasonably complete view of architectural behavior. This is in contrast to, say, Windows, where IIUC doing an unapproved WRFSBASE May cause very odd behavior indeed. So magic fixups that do non-architectural things are not so great.

The flip side, of course, is that the architecture is arguably inherently erratic here, and it’s apparently impossible to have an SGX library with sane semantics without some kernel assistance.

So if we can make my straw man API work, perhaps with vDSO or rseq-like help, then the official SDK can use it, but less well behaved programs can still mostly work.  (Modulo Linux’s non-support for EINITTOKEN, of course.)

Thinking about it some more, the major sticking point may be finding the RIP and stack frame of EENTER in the AEP code or in its fixup. The vDSO can’t use TLS without serious hackery.  We could massively abuse WRFSBASE, but that’s really ugly.

(How does the Windows case work?  If there’s an exception after the untrusted stack allocation and before EEXIT and SEH tries to handle it, how does the unwinder figure out where to start?)

>  I did a quick POC on the idea of hardcoding
> fixup for the ENCLU opcode, and the basic idea checks out.  The code
> is fairly minimal and doesn't impact the core functionality of the SDK.
> They'd need to redo their trap handling to move it from the signal
> handler to inline, but their stack shenanigans won't be any more broken
> than they already are.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-06 19:02                                             ` Andy Lutomirski
  2018-11-06 19:02                                               ` Andy Lutomirski
  2018-11-06 19:22                                               ` Dave Hansen
@ 2018-11-06 23:17                                               ` Rich Felker
  2018-11-06 23:17                                                 ` Rich Felker
  2018-11-06 23:26                                                 ` Sean Christopherson
  2 siblings, 2 replies; 163+ messages in thread
From: Rich Felker @ 2018-11-06 23:17 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Christopherson, Sean J, Jann Horn, Linus Torvalds,
	Dave Hansen, Jethro Beekman, Jarkko Sakkinen, Florian Weimer,
	Linux API, X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Tue, Nov 06, 2018 at 11:02:11AM -0800, Andy Lutomirski wrote:
> On Tue, Nov 6, 2018 at 10:41 AM Dave Hansen <dave.hansen@intel.com> wrote:
> >
> > On 11/6/18 10:20 AM, Andy Lutomirski wrote:
> > > I almost feel like the right solution is to call into SGX on its own
> > > private stack or maybe even its own private address space.
> >
> > Yeah, I had the same gut feeling.  Couldn't the debugger even treat the
> > enclave like its own "thread" with its own stack and its own set of
> > registers and context?  That seems like a much more workable model than
> > trying to weave it together with the EENTER context.
> 
> So maybe the API should be, roughly
> 
> sgx_exit_reason_t sgx_enter_enclave(pointer_to_enclave, struct
> host_state *state);
> sgx_exit_reason_t sgx_resume_enclave(same args);
> 
> where host_state is something like:
> 
> struct host_state {
>   unsigned long bp, sp, ax, bx, cx, dx, si, di;
> };
> 
> and the values in host_state explicitly have nothing to do with the
> actual host registers.  So, if you want to use the outcall mechanism,
> you'd allocate some memory, point sp to that memory, call
> sgx_enter_enclave(), and then read that memory to do the outcall.
> 
> Actually implementing this would be distinctly nontrivial, and would
> almost certainly need some degree of kernel help to avoid an explosion
> when a signal gets delivered while we have host_state.sp loaded into
> the actual SP register.  Maybe rseq could help with this?
> 
> The ISA here is IMO not well thought through.

Maybe I'm mistaken about some fundamentals here, but my understanding
of SGX is that the whole point is that the host application and the
code running in the enclave are mutually adversarial towards one
another. Do any or all of the proposed protocols here account for this
and fully protect the host application from malicious code in the
enclave? It seems that having control over the register file on exit
from the enclave is fundamentally problematic but I assume there must
be some way I'm missing that this is fixed up.

Rich

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-06 23:17                                               ` Rich Felker
@ 2018-11-06 23:17                                                 ` Rich Felker
  2018-11-06 23:26                                                 ` Sean Christopherson
  1 sibling, 0 replies; 163+ messages in thread
From: Rich Felker @ 2018-11-06 23:17 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Christopherson, Sean J, Jann Horn, Linus Torvalds,
	Dave Hansen, Jethro Beekman, Jarkko Sakkinen, Florian Weimer,
	Linux API, X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Tue, Nov 06, 2018 at 11:02:11AM -0800, Andy Lutomirski wrote:
> On Tue, Nov 6, 2018 at 10:41 AM Dave Hansen <dave.hansen@intel.com> wrote:
> >
> > On 11/6/18 10:20 AM, Andy Lutomirski wrote:
> > > I almost feel like the right solution is to call into SGX on its own
> > > private stack or maybe even its own private address space.
> >
> > Yeah, I had the same gut feeling.  Couldn't the debugger even treat the
> > enclave like its own "thread" with its own stack and its own set of
> > registers and context?  That seems like a much more workable model than
> > trying to weave it together with the EENTER context.
> 
> So maybe the API should be, roughly
> 
> sgx_exit_reason_t sgx_enter_enclave(pointer_to_enclave, struct
> host_state *state);
> sgx_exit_reason_t sgx_resume_enclave(same args);
> 
> where host_state is something like:
> 
> struct host_state {
>   unsigned long bp, sp, ax, bx, cx, dx, si, di;
> };
> 
> and the values in host_state explicitly have nothing to do with the
> actual host registers.  So, if you want to use the outcall mechanism,
> you'd allocate some memory, point sp to that memory, call
> sgx_enter_enclave(), and then read that memory to do the outcall.
> 
> Actually implementing this would be distinctly nontrivial, and would
> almost certainly need some degree of kernel help to avoid an explosion
> when a signal gets delivered while we have host_state.sp loaded into
> the actual SP register.  Maybe rseq could help with this?
> 
> The ISA here is IMO not well thought through.

Maybe I'm mistaken about some fundamentals here, but my understanding
of SGX is that the whole point is that the host application and the
code running in the enclave are mutually adversarial towards one
another. Do any or all of the proposed protocols here account for this
and fully protect the host application from malicious code in the
enclave? It seems that having control over the register file on exit
from the enclave is fundamentally problematic but I assume there must
be some way I'm missing that this is fixed up.

Rich

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-06 23:17                                               ` Rich Felker
  2018-11-06 23:17                                                 ` Rich Felker
@ 2018-11-06 23:26                                                 ` Sean Christopherson
  2018-11-06 23:26                                                   ` Sean Christopherson
  2018-11-07 21:27                                                   ` Rich Felker
  1 sibling, 2 replies; 163+ messages in thread
From: Sean Christopherson @ 2018-11-06 23:26 UTC (permalink / raw)
  To: Rich Felker
  Cc: Andy Lutomirski, Dave Hansen, Jann Horn, Linus Torvalds,
	Dave Hansen, Jethro Beekman, Jarkko Sakkinen, Florian Weimer,
	Linux API, X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Tue, Nov 06, 2018 at 06:17:30PM -0500, Rich Felker wrote:
> On Tue, Nov 06, 2018 at 11:02:11AM -0800, Andy Lutomirski wrote:
> > On Tue, Nov 6, 2018 at 10:41 AM Dave Hansen <dave.hansen@intel.com> wrote:
> > >
> > > On 11/6/18 10:20 AM, Andy Lutomirski wrote:
> > > > I almost feel like the right solution is to call into SGX on its own
> > > > private stack or maybe even its own private address space.
> > >
> > > Yeah, I had the same gut feeling.  Couldn't the debugger even treat the
> > > enclave like its own "thread" with its own stack and its own set of
> > > registers and context?  That seems like a much more workable model than
> > > trying to weave it together with the EENTER context.
> > 
> > So maybe the API should be, roughly
> > 
> > sgx_exit_reason_t sgx_enter_enclave(pointer_to_enclave, struct
> > host_state *state);
> > sgx_exit_reason_t sgx_resume_enclave(same args);
> > 
> > where host_state is something like:
> > 
> > struct host_state {
> >   unsigned long bp, sp, ax, bx, cx, dx, si, di;
> > };
> > 
> > and the values in host_state explicitly have nothing to do with the
> > actual host registers.  So, if you want to use the outcall mechanism,
> > you'd allocate some memory, point sp to that memory, call
> > sgx_enter_enclave(), and then read that memory to do the outcall.
> > 
> > Actually implementing this would be distinctly nontrivial, and would
> > almost certainly need some degree of kernel help to avoid an explosion
> > when a signal gets delivered while we have host_state.sp loaded into
> > the actual SP register.  Maybe rseq could help with this?
> > 
> > The ISA here is IMO not well thought through.
> 
> Maybe I'm mistaken about some fundamentals here, but my understanding
> of SGX is that the whole point is that the host application and the
> code running in the enclave are mutually adversarial towards one
> another. Do any or all of the proposed protocols here account for this
> and fully protect the host application from malicious code in the
> enclave? It seems that having control over the register file on exit
> from the enclave is fundamentally problematic but I assume there must
> be some way I'm missing that this is fixed up.

SGX provides protections for the enclave but not the other way around.
The kernel has all of its normal non-SGX protections in place, but the
enclave can certainly wreak havoc on its userspace process.  The basic
design idea is that the enclave is a specialized .so that gets extra
security protections but is still effectively part of the overall
application, e.g. it has full access to its host userspace process'
virtual memory.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-06 23:26                                                 ` Sean Christopherson
@ 2018-11-06 23:26                                                   ` Sean Christopherson
  2018-11-07 21:27                                                   ` Rich Felker
  1 sibling, 0 replies; 163+ messages in thread
From: Sean Christopherson @ 2018-11-06 23:26 UTC (permalink / raw)
  To: Rich Felker
  Cc: Andy Lutomirski, Dave Hansen, Jann Horn, Linus Torvalds,
	Dave Hansen, Jethro Beekman, Jarkko Sakkinen, Florian Weimer,
	Linux API, X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Tue, Nov 06, 2018 at 06:17:30PM -0500, Rich Felker wrote:
> On Tue, Nov 06, 2018 at 11:02:11AM -0800, Andy Lutomirski wrote:
> > On Tue, Nov 6, 2018 at 10:41 AM Dave Hansen <dave.hansen@intel.com> wrote:
> > >
> > > On 11/6/18 10:20 AM, Andy Lutomirski wrote:
> > > > I almost feel like the right solution is to call into SGX on its own
> > > > private stack or maybe even its own private address space.
> > >
> > > Yeah, I had the same gut feeling.  Couldn't the debugger even treat the
> > > enclave like its own "thread" with its own stack and its own set of
> > > registers and context?  That seems like a much more workable model than
> > > trying to weave it together with the EENTER context.
> > 
> > So maybe the API should be, roughly
> > 
> > sgx_exit_reason_t sgx_enter_enclave(pointer_to_enclave, struct
> > host_state *state);
> > sgx_exit_reason_t sgx_resume_enclave(same args);
> > 
> > where host_state is something like:
> > 
> > struct host_state {
> >   unsigned long bp, sp, ax, bx, cx, dx, si, di;
> > };
> > 
> > and the values in host_state explicitly have nothing to do with the
> > actual host registers.  So, if you want to use the outcall mechanism,
> > you'd allocate some memory, point sp to that memory, call
> > sgx_enter_enclave(), and then read that memory to do the outcall.
> > 
> > Actually implementing this would be distinctly nontrivial, and would
> > almost certainly need some degree of kernel help to avoid an explosion
> > when a signal gets delivered while we have host_state.sp loaded into
> > the actual SP register.  Maybe rseq could help with this?
> > 
> > The ISA here is IMO not well thought through.
> 
> Maybe I'm mistaken about some fundamentals here, but my understanding
> of SGX is that the whole point is that the host application and the
> code running in the enclave are mutually adversarial towards one
> another. Do any or all of the proposed protocols here account for this
> and fully protect the host application from malicious code in the
> enclave? It seems that having control over the register file on exit
> from the enclave is fundamentally problematic but I assume there must
> be some way I'm missing that this is fixed up.

SGX provides protections for the enclave but not the other way around.
The kernel has all of its normal non-SGX protections in place, but the
enclave can certainly wreak havoc on its userspace process.  The basic
design idea is that the enclave is a specialized .so that gets extra
security protections but is still effectively part of the overall
application, e.g. it has full access to its host userspace process'
virtual memory.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-06 23:00                                                           ` Andy Lutomirski
  2018-11-06 23:00                                                             ` Andy Lutomirski
@ 2018-11-06 23:35                                                             ` Sean Christopherson
  2018-11-06 23:35                                                               ` Sean Christopherson
  2018-11-06 23:39                                                               ` Andy Lutomirski
  1 sibling, 2 replies; 163+ messages in thread
From: Sean Christopherson @ 2018-11-06 23:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Dave Hansen, Jann Horn, Linus Torvalds,
	Rich Felker, Dave Hansen, Jethro Beekman, Jarkko Sakkinen,
	Florian Weimer, Linux API, X86 ML, linux-arch, LKML,
	Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella

On Tue, Nov 06, 2018 at 03:00:56PM -0800, Andy Lutomirski wrote:
> 
> 
> >> On Nov 6, 2018, at 1:59 PM, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> >> 
> >>> On Tue, 2018-11-06 at 13:41 -0800, Andy Lutomirski wrote:
> >> Sean, how does the current SDK AEX handler decide whether to do
> >> EENTER, ERESUME, or just bail and consider the enclave dead?  It seems
> >> like the *CPU* could give a big hint, but I don't see where there is
> >> any architectural indication of why the AEX code got called or any
> >> obvious way for the user code to know whether the exit was fixed up by
> >> the kernel?
> > 
> > The SDK "unconditionally" does ERESUME at the AEP location, but that's
> > bit misleading because its signal handler may muck with the context's
> > RIP, e.g. to abort the enclave on a fatal fault.
> > 
> > On an event/exception from within an enclave, the event is immediately
> > delivered after loading synthetic state and changing RIP to the AEP.
> > In other words, jamming CPU state is essentially a bunch of vectoring
> > ucode preamble, but from software's perspective it's a normal event
> > that happens to point at the AEP instead of somewhere in the enclave.
> > And because the signals the SDK cares about are all synchronous, the
> > SDK can simply hardcode ERESUME at the AEP since all of the fault logic
> > resides in its signal handler.  IRQs and whatnot simply trampoline back
> > into the enclave.
> > 
> > Userspace can do something funky instead of ERESUME, but only *after*
> > IRET/RSM/VMRESUME has returned to the AEP location, and in Linux's
> > case, after the trap handler has run.
> > 
> > Jumping back a bit, how much do we care about preventing userspace
> > from doing stupid things? 
> 
> My general feeling is that userspace should be allowed to do apparently
> stupid things. For example, as far as the kernel is concerned, Wine and
> DOSEMU are just user programs that do stupid things. Linux generally tries
> to provide a reasonably complete view of architectural behavior. This is
> in contrast to, say, Windows, where IIUC doing an unapproved WRFSBASE May
> cause very odd behavior indeed. So magic fixups that do non-architectural
> things are not so great.

Sorry if I'm beating a dead horse, but what if we only did fixup on ENCLU
with a specific (ignored) prefix pattern?  I.e. effectively make the magic
fixup opt-in, falling back to signals.  Jamming RIP to skip ENCLU isn't
that far off the architecture, e.g. EENTER stuffs RCX with the next RIP so
that the enclave can EEXIT to immediately after the EENTER location.

> (How does the Windows case work?  If there’s an exception after the untrusted
> stack allocation and before EEXIT and SEH tries to handle it, how does the
> unwinder figure out where to start?)

No clue, I'll ask and report back.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-06 23:35                                                             ` Sean Christopherson
@ 2018-11-06 23:35                                                               ` Sean Christopherson
  2018-11-06 23:39                                                               ` Andy Lutomirski
  1 sibling, 0 replies; 163+ messages in thread
From: Sean Christopherson @ 2018-11-06 23:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Dave Hansen, Jann Horn, Linus Torvalds,
	Rich Felker, Dave Hansen, Jethro Beekman, Jarkko Sakkinen,
	Florian Weimer, Linux API, X86 ML, linux-arch, LKML,
	Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella

On Tue, Nov 06, 2018 at 03:00:56PM -0800, Andy Lutomirski wrote:
> 
> 
> >> On Nov 6, 2018, at 1:59 PM, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> >> 
> >>> On Tue, 2018-11-06 at 13:41 -0800, Andy Lutomirski wrote:
> >> Sean, how does the current SDK AEX handler decide whether to do
> >> EENTER, ERESUME, or just bail and consider the enclave dead?  It seems
> >> like the *CPU* could give a big hint, but I don't see where there is
> >> any architectural indication of why the AEX code got called or any
> >> obvious way for the user code to know whether the exit was fixed up by
> >> the kernel?
> > 
> > The SDK "unconditionally" does ERESUME at the AEP location, but that's
> > bit misleading because its signal handler may muck with the context's
> > RIP, e.g. to abort the enclave on a fatal fault.
> > 
> > On an event/exception from within an enclave, the event is immediately
> > delivered after loading synthetic state and changing RIP to the AEP.
> > In other words, jamming CPU state is essentially a bunch of vectoring
> > ucode preamble, but from software's perspective it's a normal event
> > that happens to point at the AEP instead of somewhere in the enclave.
> > And because the signals the SDK cares about are all synchronous, the
> > SDK can simply hardcode ERESUME at the AEP since all of the fault logic
> > resides in its signal handler.  IRQs and whatnot simply trampoline back
> > into the enclave.
> > 
> > Userspace can do something funky instead of ERESUME, but only *after*
> > IRET/RSM/VMRESUME has returned to the AEP location, and in Linux's
> > case, after the trap handler has run.
> > 
> > Jumping back a bit, how much do we care about preventing userspace
> > from doing stupid things? 
> 
> My general feeling is that userspace should be allowed to do apparently
> stupid things. For example, as far as the kernel is concerned, Wine and
> DOSEMU are just user programs that do stupid things. Linux generally tries
> to provide a reasonably complete view of architectural behavior. This is
> in contrast to, say, Windows, where IIUC doing an unapproved WRFSBASE May
> cause very odd behavior indeed. So magic fixups that do non-architectural
> things are not so great.

Sorry if I'm beating a dead horse, but what if we only did fixup on ENCLU
with a specific (ignored) prefix pattern?  I.e. effectively make the magic
fixup opt-in, falling back to signals.  Jamming RIP to skip ENCLU isn't
that far off the architecture, e.g. EENTER stuffs RCX with the next RIP so
that the enclave can EEXIT to immediately after the EENTER location.

> (How does the Windows case work?  If there’s an exception after the untrusted
> stack allocation and before EEXIT and SEH tries to handle it, how does the
> unwinder figure out where to start?)

No clue, I'll ask and report back.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-06 23:35                                                             ` Sean Christopherson
  2018-11-06 23:35                                                               ` Sean Christopherson
@ 2018-11-06 23:39                                                               ` Andy Lutomirski
  2018-11-06 23:39                                                                 ` Andy Lutomirski
  2018-11-07  0:02                                                                 ` Sean Christopherson
  1 sibling, 2 replies; 163+ messages in thread
From: Andy Lutomirski @ 2018-11-06 23:39 UTC (permalink / raw)
  To: Christopherson, Sean J
  Cc: Andrew Lutomirski, Dave Hansen, Jann Horn, Linus Torvalds,
	Rich Felker, Dave Hansen, Jethro Beekman, Jarkko Sakkinen,
	Florian Weimer, Linux API, X86 ML, linux-arch, LKML,
	Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella

On Tue, Nov 6, 2018 at 3:35 PM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Tue, Nov 06, 2018 at 03:00:56PM -0800, Andy Lutomirski wrote:
> >
> >
> > >> On Nov 6, 2018, at 1:59 PM, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> > >>
> > >>> On Tue, 2018-11-06 at 13:41 -0800, Andy Lutomirski wrote:
> > >> Sean, how does the current SDK AEX handler decide whether to do
> > >> EENTER, ERESUME, or just bail and consider the enclave dead?  It seems
> > >> like the *CPU* could give a big hint, but I don't see where there is
> > >> any architectural indication of why the AEX code got called or any
> > >> obvious way for the user code to know whether the exit was fixed up by
> > >> the kernel?
> > >
> > > The SDK "unconditionally" does ERESUME at the AEP location, but that's
> > > bit misleading because its signal handler may muck with the context's
> > > RIP, e.g. to abort the enclave on a fatal fault.
> > >
> > > On an event/exception from within an enclave, the event is immediately
> > > delivered after loading synthetic state and changing RIP to the AEP.
> > > In other words, jamming CPU state is essentially a bunch of vectoring
> > > ucode preamble, but from software's perspective it's a normal event
> > > that happens to point at the AEP instead of somewhere in the enclave.
> > > And because the signals the SDK cares about are all synchronous, the
> > > SDK can simply hardcode ERESUME at the AEP since all of the fault logic
> > > resides in its signal handler.  IRQs and whatnot simply trampoline back
> > > into the enclave.
> > >
> > > Userspace can do something funky instead of ERESUME, but only *after*
> > > IRET/RSM/VMRESUME has returned to the AEP location, and in Linux's
> > > case, after the trap handler has run.
> > >
> > > Jumping back a bit, how much do we care about preventing userspace
> > > from doing stupid things?
> >
> > My general feeling is that userspace should be allowed to do apparently
> > stupid things. For example, as far as the kernel is concerned, Wine and
> > DOSEMU are just user programs that do stupid things. Linux generally tries
> > to provide a reasonably complete view of architectural behavior. This is
> > in contrast to, say, Windows, where IIUC doing an unapproved WRFSBASE May
> > cause very odd behavior indeed. So magic fixups that do non-architectural
> > things are not so great.
>
> Sorry if I'm beating a dead horse, but what if we only did fixup on ENCLU
> with a specific (ignored) prefix pattern?  I.e. effectively make the magic
> fixup opt-in, falling back to signals.  Jamming RIP to skip ENCLU isn't
> that far off the architecture, e.g. EENTER stuffs RCX with the next RIP so
> that the enclave can EEXIT to immediately after the EENTER location.
>

How does that even work, though?  On an AEX, RIP points to the ERESUME
instruction, not the EENTER instruction, so if we skip it we just end
up in lala land.

How averse would everyone be to making enclave entry be a syscall?
The user code would do sys_sgx_enter_enclave(), and the kernel would
stash away the register state (vm86()-style), point RIP to the vDSO's
ENCLU instruction, point RCX to another vDSO ENCLU instruction, and
SYSRET.  The trap handlers would understand what's going on and
restore register state accordingly.

On non-Meltdown hardware (hah!) this would even be fairly fast.

--Andy

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-06 23:39                                                               ` Andy Lutomirski
@ 2018-11-06 23:39                                                                 ` Andy Lutomirski
  2018-11-07  0:02                                                                 ` Sean Christopherson
  1 sibling, 0 replies; 163+ messages in thread
From: Andy Lutomirski @ 2018-11-06 23:39 UTC (permalink / raw)
  To: Christopherson, Sean J
  Cc: Andrew Lutomirski, Dave Hansen, Jann Horn, Linus Torvalds,
	Rich Felker, Dave Hansen, Jethro Beekman, Jarkko Sakkinen,
	Florian Weimer, Linux API, X86 ML, linux-arch, LKML,
	Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella

On Tue, Nov 6, 2018 at 3:35 PM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Tue, Nov 06, 2018 at 03:00:56PM -0800, Andy Lutomirski wrote:
> >
> >
> > >> On Nov 6, 2018, at 1:59 PM, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> > >>
> > >>> On Tue, 2018-11-06 at 13:41 -0800, Andy Lutomirski wrote:
> > >> Sean, how does the current SDK AEX handler decide whether to do
> > >> EENTER, ERESUME, or just bail and consider the enclave dead?  It seems
> > >> like the *CPU* could give a big hint, but I don't see where there is
> > >> any architectural indication of why the AEX code got called or any
> > >> obvious way for the user code to know whether the exit was fixed up by
> > >> the kernel?
> > >
> > > The SDK "unconditionally" does ERESUME at the AEP location, but that's
> > > bit misleading because its signal handler may muck with the context's
> > > RIP, e.g. to abort the enclave on a fatal fault.
> > >
> > > On an event/exception from within an enclave, the event is immediately
> > > delivered after loading synthetic state and changing RIP to the AEP.
> > > In other words, jamming CPU state is essentially a bunch of vectoring
> > > ucode preamble, but from software's perspective it's a normal event
> > > that happens to point at the AEP instead of somewhere in the enclave.
> > > And because the signals the SDK cares about are all synchronous, the
> > > SDK can simply hardcode ERESUME at the AEP since all of the fault logic
> > > resides in its signal handler.  IRQs and whatnot simply trampoline back
> > > into the enclave.
> > >
> > > Userspace can do something funky instead of ERESUME, but only *after*
> > > IRET/RSM/VMRESUME has returned to the AEP location, and in Linux's
> > > case, after the trap handler has run.
> > >
> > > Jumping back a bit, how much do we care about preventing userspace
> > > from doing stupid things?
> >
> > My general feeling is that userspace should be allowed to do apparently
> > stupid things. For example, as far as the kernel is concerned, Wine and
> > DOSEMU are just user programs that do stupid things. Linux generally tries
> > to provide a reasonably complete view of architectural behavior. This is
> > in contrast to, say, Windows, where IIUC doing an unapproved WRFSBASE May
> > cause very odd behavior indeed. So magic fixups that do non-architectural
> > things are not so great.
>
> Sorry if I'm beating a dead horse, but what if we only did fixup on ENCLU
> with a specific (ignored) prefix pattern?  I.e. effectively make the magic
> fixup opt-in, falling back to signals.  Jamming RIP to skip ENCLU isn't
> that far off the architecture, e.g. EENTER stuffs RCX with the next RIP so
> that the enclave can EEXIT to immediately after the EENTER location.
>

How does that even work, though?  On an AEX, RIP points to the ERESUME
instruction, not the EENTER instruction, so if we skip it we just end
up in lala land.

How averse would everyone be to making enclave entry be a syscall?
The user code would do sys_sgx_enter_enclave(), and the kernel would
stash away the register state (vm86()-style), point RIP to the vDSO's
ENCLU instruction, point RCX to another vDSO ENCLU instruction, and
SYSRET.  The trap handlers would understand what's going on and
restore register state accordingly.

On non-Meltdown hardware (hah!) this would even be fairly fast.

--Andy

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-06 23:39                                                               ` Andy Lutomirski
  2018-11-06 23:39                                                                 ` Andy Lutomirski
@ 2018-11-07  0:02                                                                 ` Sean Christopherson
  2018-11-07  0:02                                                                   ` Sean Christopherson
  2018-11-07  1:17                                                                   ` Andy Lutomirski
  1 sibling, 2 replies; 163+ messages in thread
From: Sean Christopherson @ 2018-11-07  0:02 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Jann Horn, Linus Torvalds, Rich Felker, Dave Hansen,
	Jethro Beekman, Jarkko Sakkinen, Florian Weimer, Linux API,
	X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman, npmccallum,
	Ayoun, Serge, shay.katz-zamir, linux-sgx, Andy Shevchenko,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Tue, Nov 06, 2018 at 03:39:48PM -0800, Andy Lutomirski wrote:
> On Tue, Nov 6, 2018 at 3:35 PM Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
> >
> > On Tue, Nov 06, 2018 at 03:00:56PM -0800, Andy Lutomirski wrote:
> > >
> > >
> > > >> On Nov 6, 2018, at 1:59 PM, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> > > >>
> > > >>> On Tue, 2018-11-06 at 13:41 -0800, Andy Lutomirski wrote:
> > > >> Sean, how does the current SDK AEX handler decide whether to do
> > > >> EENTER, ERESUME, or just bail and consider the enclave dead?  It seems
> > > >> like the *CPU* could give a big hint, but I don't see where there is
> > > >> any architectural indication of why the AEX code got called or any
> > > >> obvious way for the user code to know whether the exit was fixed up by
> > > >> the kernel?
> > > >
> > > > The SDK "unconditionally" does ERESUME at the AEP location, but that's
> > > > bit misleading because its signal handler may muck with the context's
> > > > RIP, e.g. to abort the enclave on a fatal fault.
> > > >
> > > > On an event/exception from within an enclave, the event is immediately
> > > > delivered after loading synthetic state and changing RIP to the AEP.
> > > > In other words, jamming CPU state is essentially a bunch of vectoring
> > > > ucode preamble, but from software's perspective it's a normal event
> > > > that happens to point at the AEP instead of somewhere in the enclave.
> > > > And because the signals the SDK cares about are all synchronous, the
> > > > SDK can simply hardcode ERESUME at the AEP since all of the fault logic
> > > > resides in its signal handler.  IRQs and whatnot simply trampoline back
> > > > into the enclave.
> > > >
> > > > Userspace can do something funky instead of ERESUME, but only *after*
> > > > IRET/RSM/VMRESUME has returned to the AEP location, and in Linux's
> > > > case, after the trap handler has run.
> > > >
> > > > Jumping back a bit, how much do we care about preventing userspace
> > > > from doing stupid things?
> > >
> > > My general feeling is that userspace should be allowed to do apparently
> > > stupid things. For example, as far as the kernel is concerned, Wine and
> > > DOSEMU are just user programs that do stupid things. Linux generally tries
> > > to provide a reasonably complete view of architectural behavior. This is
> > > in contrast to, say, Windows, where IIUC doing an unapproved WRFSBASE May
> > > cause very odd behavior indeed. So magic fixups that do non-architectural
> > > things are not so great.
> >
> > Sorry if I'm beating a dead horse, but what if we only did fixup on ENCLU
> > with a specific (ignored) prefix pattern?  I.e. effectively make the magic
> > fixup opt-in, falling back to signals.  Jamming RIP to skip ENCLU isn't
> > that far off the architecture, e.g. EENTER stuffs RCX with the next RIP so
> > that the enclave can EEXIT to immediately after the EENTER location.
> >
> 
> How does that even work, though?  On an AEX, RIP points to the ERESUME
> instruction, not the EENTER instruction, so if we skip it we just end
> up in lala land.

Userspace would obviously need to be aware of the fixup behavior, but
it actually works out fairly nicely to have a separate path for ERESUME
fixup since a fault on EENTER is generally fatal, whereas as a fault on
ERESUME might be recoverable.


do_eenter:
    mov     tcs, %rbx
    lea     async_exit, %rcx 
    mov     $EENTER, %rax
    ENCLU

/*
 * EEXIT or EENTER faulted.  In the latter case, %RAX already holds some
 * fault indicator, e.g. -EFAULT.
 */
eexit_or_eenter_fault:
    ret

async_exit:
    ENCLU

fixup_handler:
    <do fault stuff>
 
> How averse would everyone be to making enclave entry be a syscall?
> The user code would do sys_sgx_enter_enclave(), and the kernel would
> stash away the register state (vm86()-style), point RIP to the vDSO's
> ENCLU instruction, point RCX to another vDSO ENCLU instruction, and
> SYSRET.  The trap handlers would understand what's going on and
> restore register state accordingly.

Wouldn't that blast away any stack changes made by the enclave?

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-07  0:02                                                                 ` Sean Christopherson
@ 2018-11-07  0:02                                                                   ` Sean Christopherson
  2018-11-07  1:17                                                                   ` Andy Lutomirski
  1 sibling, 0 replies; 163+ messages in thread
From: Sean Christopherson @ 2018-11-07  0:02 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Jann Horn, Linus Torvalds, Rich Felker, Dave Hansen,
	Jethro Beekman, Jarkko Sakkinen, Florian Weimer, Linux API,
	X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman, npmccallum,
	Ayoun, Serge, shay.katz-zamir, linux-sgx, Andy Shevchenko,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Tue, Nov 06, 2018 at 03:39:48PM -0800, Andy Lutomirski wrote:
> On Tue, Nov 6, 2018 at 3:35 PM Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
> >
> > On Tue, Nov 06, 2018 at 03:00:56PM -0800, Andy Lutomirski wrote:
> > >
> > >
> > > >> On Nov 6, 2018, at 1:59 PM, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> > > >>
> > > >>> On Tue, 2018-11-06 at 13:41 -0800, Andy Lutomirski wrote:
> > > >> Sean, how does the current SDK AEX handler decide whether to do
> > > >> EENTER, ERESUME, or just bail and consider the enclave dead?  It seems
> > > >> like the *CPU* could give a big hint, but I don't see where there is
> > > >> any architectural indication of why the AEX code got called or any
> > > >> obvious way for the user code to know whether the exit was fixed up by
> > > >> the kernel?
> > > >
> > > > The SDK "unconditionally" does ERESUME at the AEP location, but that's
> > > > bit misleading because its signal handler may muck with the context's
> > > > RIP, e.g. to abort the enclave on a fatal fault.
> > > >
> > > > On an event/exception from within an enclave, the event is immediately
> > > > delivered after loading synthetic state and changing RIP to the AEP.
> > > > In other words, jamming CPU state is essentially a bunch of vectoring
> > > > ucode preamble, but from software's perspective it's a normal event
> > > > that happens to point at the AEP instead of somewhere in the enclave.
> > > > And because the signals the SDK cares about are all synchronous, the
> > > > SDK can simply hardcode ERESUME at the AEP since all of the fault logic
> > > > resides in its signal handler.  IRQs and whatnot simply trampoline back
> > > > into the enclave.
> > > >
> > > > Userspace can do something funky instead of ERESUME, but only *after*
> > > > IRET/RSM/VMRESUME has returned to the AEP location, and in Linux's
> > > > case, after the trap handler has run.
> > > >
> > > > Jumping back a bit, how much do we care about preventing userspace
> > > > from doing stupid things?
> > >
> > > My general feeling is that userspace should be allowed to do apparently
> > > stupid things. For example, as far as the kernel is concerned, Wine and
> > > DOSEMU are just user programs that do stupid things. Linux generally tries
> > > to provide a reasonably complete view of architectural behavior. This is
> > > in contrast to, say, Windows, where IIUC doing an unapproved WRFSBASE May
> > > cause very odd behavior indeed. So magic fixups that do non-architectural
> > > things are not so great.
> >
> > Sorry if I'm beating a dead horse, but what if we only did fixup on ENCLU
> > with a specific (ignored) prefix pattern?  I.e. effectively make the magic
> > fixup opt-in, falling back to signals.  Jamming RIP to skip ENCLU isn't
> > that far off the architecture, e.g. EENTER stuffs RCX with the next RIP so
> > that the enclave can EEXIT to immediately after the EENTER location.
> >
> 
> How does that even work, though?  On an AEX, RIP points to the ERESUME
> instruction, not the EENTER instruction, so if we skip it we just end
> up in lala land.

Userspace would obviously need to be aware of the fixup behavior, but
it actually works out fairly nicely to have a separate path for ERESUME
fixup since a fault on EENTER is generally fatal, whereas as a fault on
ERESUME might be recoverable.


do_eenter:
    mov     tcs, %rbx
    lea     async_exit, %rcx 
    mov     $EENTER, %rax
    ENCLU

/*
 * EEXIT or EENTER faulted.  In the latter case, %RAX already holds some
 * fault indicator, e.g. -EFAULT.
 */
eexit_or_eenter_fault:
    ret

async_exit:
    ENCLU

fixup_handler:
    <do fault stuff>
 
> How averse would everyone be to making enclave entry be a syscall?
> The user code would do sys_sgx_enter_enclave(), and the kernel would
> stash away the register state (vm86()-style), point RIP to the vDSO's
> ENCLU instruction, point RCX to another vDSO ENCLU instruction, and
> SYSRET.  The trap handlers would understand what's going on and
> restore register state accordingly.

Wouldn't that blast away any stack changes made by the enclave?

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-07  0:02                                                                 ` Sean Christopherson
  2018-11-07  0:02                                                                   ` Sean Christopherson
@ 2018-11-07  1:17                                                                   ` Andy Lutomirski
  2018-11-07  1:17                                                                     ` Andy Lutomirski
                                                                                       ` (2 more replies)
  1 sibling, 3 replies; 163+ messages in thread
From: Andy Lutomirski @ 2018-11-07  1:17 UTC (permalink / raw)
  To: Christopherson, Sean J
  Cc: Andrew Lutomirski, Dave Hansen, Jann Horn, Linus Torvalds,
	Rich Felker, Dave Hansen, Jethro Beekman, Jarkko Sakkinen,
	Florian Weimer, Linux API, X86 ML, linux-arch, LKML,
	Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella

On Tue, Nov 6, 2018 at 4:02 PM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Tue, Nov 06, 2018 at 03:39:48PM -0800, Andy Lutomirski wrote:
> > On Tue, Nov 6, 2018 at 3:35 PM Sean Christopherson
> > <sean.j.christopherson@intel.com> wrote:
> > >
> > > On Tue, Nov 06, 2018 at 03:00:56PM -0800, Andy Lutomirski wrote:
> > > >
> > > >
> > > > >> On Nov 6, 2018, at 1:59 PM, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> > > > >>
> > > > >>> On Tue, 2018-11-06 at 13:41 -0800, Andy Lutomirski wrote:
> > > > >> Sean, how does the current SDK AEX handler decide whether to do
> > > > >> EENTER, ERESUME, or just bail and consider the enclave dead?  It seems
> > > > >> like the *CPU* could give a big hint, but I don't see where there is
> > > > >> any architectural indication of why the AEX code got called or any
> > > > >> obvious way for the user code to know whether the exit was fixed up by
> > > > >> the kernel?
> > > > >
> > > > > The SDK "unconditionally" does ERESUME at the AEP location, but that's
> > > > > bit misleading because its signal handler may muck with the context's
> > > > > RIP, e.g. to abort the enclave on a fatal fault.
> > > > >
> > > > > On an event/exception from within an enclave, the event is immediately
> > > > > delivered after loading synthetic state and changing RIP to the AEP.
> > > > > In other words, jamming CPU state is essentially a bunch of vectoring
> > > > > ucode preamble, but from software's perspective it's a normal event
> > > > > that happens to point at the AEP instead of somewhere in the enclave.
> > > > > And because the signals the SDK cares about are all synchronous, the
> > > > > SDK can simply hardcode ERESUME at the AEP since all of the fault logic
> > > > > resides in its signal handler.  IRQs and whatnot simply trampoline back
> > > > > into the enclave.
> > > > >
> > > > > Userspace can do something funky instead of ERESUME, but only *after*
> > > > > IRET/RSM/VMRESUME has returned to the AEP location, and in Linux's
> > > > > case, after the trap handler has run.
> > > > >
> > > > > Jumping back a bit, how much do we care about preventing userspace
> > > > > from doing stupid things?
> > > >
> > > > My general feeling is that userspace should be allowed to do apparently
> > > > stupid things. For example, as far as the kernel is concerned, Wine and
> > > > DOSEMU are just user programs that do stupid things. Linux generally tries
> > > > to provide a reasonably complete view of architectural behavior. This is
> > > > in contrast to, say, Windows, where IIUC doing an unapproved WRFSBASE May
> > > > cause very odd behavior indeed. So magic fixups that do non-architectural
> > > > things are not so great.
> > >
> > > Sorry if I'm beating a dead horse, but what if we only did fixup on ENCLU
> > > with a specific (ignored) prefix pattern?  I.e. effectively make the magic
> > > fixup opt-in, falling back to signals.  Jamming RIP to skip ENCLU isn't
> > > that far off the architecture, e.g. EENTER stuffs RCX with the next RIP so
> > > that the enclave can EEXIT to immediately after the EENTER location.
> > >
> >
> > How does that even work, though?  On an AEX, RIP points to the ERESUME
> > instruction, not the EENTER instruction, so if we skip it we just end
> > up in lala land.
>
> Userspace would obviously need to be aware of the fixup behavior, but
> it actually works out fairly nicely to have a separate path for ERESUME
> fixup since a fault on EENTER is generally fatal, whereas as a fault on
> ERESUME might be recoverable.
>

Hmm.

>
> do_eenter:
>     mov     tcs, %rbx
>     lea     async_exit, %rcx
>     mov     $EENTER, %rax
>     ENCLU

Or SOME_SILLY_PREFIX ENCLU?

>
> /*
>  * EEXIT or EENTER faulted.  In the latter case, %RAX already holds some
>  * fault indicator, e.g. -EFAULT.
>  */
> eexit_or_eenter_fault:
>     ret

But userspace wants to know whether it was a fault or not.  So I think
we either need two landing pads or we need to hijack a flag bit (are
there any known-zeroed flag bits after EEXIT?) to say whether it was a
fault.  And, if it was a fault, we should give the vector, the
sanitized error code, and possibly CR2.

>
> async_exit:
>     ENCLU

Same prefix here, right?

>
> fixup_handler:
>     <do fault stuff>

This whole thing is a bit odd, but not necessarily a terrible idea.

>
> > How averse would everyone be to making enclave entry be a syscall?
> > The user code would do sys_sgx_enter_enclave(), and the kernel would
> > stash away the register state (vm86()-style), point RIP to the vDSO's
> > ENCLU instruction, point RCX to another vDSO ENCLU instruction, and
> > SYSRET.  The trap handlers would understand what's going on and
> > restore register state accordingly.
>
> Wouldn't that blast away any stack changes made by the enclave?

Yes, but I was imagining that it would stash the registers into the
struct host_state thing I made up :)

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-07  1:17                                                                   ` Andy Lutomirski
@ 2018-11-07  1:17                                                                     ` Andy Lutomirski
  2018-11-07  6:47                                                                     ` Jethro Beekman
  2018-11-07 15:34                                                                     ` Sean Christopherson
  2 siblings, 0 replies; 163+ messages in thread
From: Andy Lutomirski @ 2018-11-07  1:17 UTC (permalink / raw)
  To: Christopherson, Sean J
  Cc: Andrew Lutomirski, Dave Hansen, Jann Horn, Linus Torvalds,
	Rich Felker, Dave Hansen, Jethro Beekman, Jarkko Sakkinen,
	Florian Weimer, Linux API, X86 ML, linux-arch, LKML,
	Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella

On Tue, Nov 6, 2018 at 4:02 PM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Tue, Nov 06, 2018 at 03:39:48PM -0800, Andy Lutomirski wrote:
> > On Tue, Nov 6, 2018 at 3:35 PM Sean Christopherson
> > <sean.j.christopherson@intel.com> wrote:
> > >
> > > On Tue, Nov 06, 2018 at 03:00:56PM -0800, Andy Lutomirski wrote:
> > > >
> > > >
> > > > >> On Nov 6, 2018, at 1:59 PM, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> > > > >>
> > > > >>> On Tue, 2018-11-06 at 13:41 -0800, Andy Lutomirski wrote:
> > > > >> Sean, how does the current SDK AEX handler decide whether to do
> > > > >> EENTER, ERESUME, or just bail and consider the enclave dead?  It seems
> > > > >> like the *CPU* could give a big hint, but I don't see where there is
> > > > >> any architectural indication of why the AEX code got called or any
> > > > >> obvious way for the user code to know whether the exit was fixed up by
> > > > >> the kernel?
> > > > >
> > > > > The SDK "unconditionally" does ERESUME at the AEP location, but that's
> > > > > bit misleading because its signal handler may muck with the context's
> > > > > RIP, e.g. to abort the enclave on a fatal fault.
> > > > >
> > > > > On an event/exception from within an enclave, the event is immediately
> > > > > delivered after loading synthetic state and changing RIP to the AEP.
> > > > > In other words, jamming CPU state is essentially a bunch of vectoring
> > > > > ucode preamble, but from software's perspective it's a normal event
> > > > > that happens to point at the AEP instead of somewhere in the enclave.
> > > > > And because the signals the SDK cares about are all synchronous, the
> > > > > SDK can simply hardcode ERESUME at the AEP since all of the fault logic
> > > > > resides in its signal handler.  IRQs and whatnot simply trampoline back
> > > > > into the enclave.
> > > > >
> > > > > Userspace can do something funky instead of ERESUME, but only *after*
> > > > > IRET/RSM/VMRESUME has returned to the AEP location, and in Linux's
> > > > > case, after the trap handler has run.
> > > > >
> > > > > Jumping back a bit, how much do we care about preventing userspace
> > > > > from doing stupid things?
> > > >
> > > > My general feeling is that userspace should be allowed to do apparently
> > > > stupid things. For example, as far as the kernel is concerned, Wine and
> > > > DOSEMU are just user programs that do stupid things. Linux generally tries
> > > > to provide a reasonably complete view of architectural behavior. This is
> > > > in contrast to, say, Windows, where IIUC doing an unapproved WRFSBASE May
> > > > cause very odd behavior indeed. So magic fixups that do non-architectural
> > > > things are not so great.
> > >
> > > Sorry if I'm beating a dead horse, but what if we only did fixup on ENCLU
> > > with a specific (ignored) prefix pattern?  I.e. effectively make the magic
> > > fixup opt-in, falling back to signals.  Jamming RIP to skip ENCLU isn't
> > > that far off the architecture, e.g. EENTER stuffs RCX with the next RIP so
> > > that the enclave can EEXIT to immediately after the EENTER location.
> > >
> >
> > How does that even work, though?  On an AEX, RIP points to the ERESUME
> > instruction, not the EENTER instruction, so if we skip it we just end
> > up in lala land.
>
> Userspace would obviously need to be aware of the fixup behavior, but
> it actually works out fairly nicely to have a separate path for ERESUME
> fixup since a fault on EENTER is generally fatal, whereas as a fault on
> ERESUME might be recoverable.
>

Hmm.

>
> do_eenter:
>     mov     tcs, %rbx
>     lea     async_exit, %rcx
>     mov     $EENTER, %rax
>     ENCLU

Or SOME_SILLY_PREFIX ENCLU?

>
> /*
>  * EEXIT or EENTER faulted.  In the latter case, %RAX already holds some
>  * fault indicator, e.g. -EFAULT.
>  */
> eexit_or_eenter_fault:
>     ret

But userspace wants to know whether it was a fault or not.  So I think
we either need two landing pads or we need to hijack a flag bit (are
there any known-zeroed flag bits after EEXIT?) to say whether it was a
fault.  And, if it was a fault, we should give the vector, the
sanitized error code, and possibly CR2.

>
> async_exit:
>     ENCLU

Same prefix here, right?

>
> fixup_handler:
>     <do fault stuff>

This whole thing is a bit odd, but not necessarily a terrible idea.

>
> > How averse would everyone be to making enclave entry be a syscall?
> > The user code would do sys_sgx_enter_enclave(), and the kernel would
> > stash away the register state (vm86()-style), point RIP to the vDSO's
> > ENCLU instruction, point RCX to another vDSO ENCLU instruction, and
> > SYSRET.  The trap handlers would understand what's going on and
> > restore register state accordingly.
>
> Wouldn't that blast away any stack changes made by the enclave?

Yes, but I was imagining that it would stash the registers into the
struct host_state thing I made up :)

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-07  1:17                                                                   ` Andy Lutomirski
  2018-11-07  1:17                                                                     ` Andy Lutomirski
@ 2018-11-07  6:47                                                                     ` Jethro Beekman
  2018-11-07  6:47                                                                       ` Jethro Beekman
  2018-11-07 15:34                                                                     ` Sean Christopherson
  2 siblings, 1 reply; 163+ messages in thread
From: Jethro Beekman @ 2018-11-07  6:47 UTC (permalink / raw)
  To: Andy Lutomirski, Christopherson, Sean J
  Cc: Dave Hansen, Jann Horn, Linus Torvalds, Rich Felker, Dave Hansen,
	Jarkko Sakkinen, Florian Weimer, Linux API, X86 ML, linux-arch,
	LKML, Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella

[-- Attachment #1: Type: text/plain, Size: 797 bytes --]

On 2018-11-07 02:17, Andy Lutomirski wrote:
> On Tue, Nov 6, 2018 at 4:02 PM Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
>>
>> /*
>>   * EEXIT or EENTER faulted.  In the latter case, %RAX already holds some
>>   * fault indicator, e.g. -EFAULT.
>>   */
>> eexit_or_eenter_fault:
>>      ret
> 
> But userspace wants to know whether it was a fault or not.  So I think
> we either need two landing pads or we need to hijack a flag bit (are
> there any known-zeroed flag bits after EEXIT?) to say whether it was a
> fault.  And, if it was a fault, we should give the vector, the
> sanitized error code, and possibly CR2.

On AEX, %rax will contain ENCLU_LEAF_ERESUME (0x3). On EEXIT, %rax will 
contain ENCLU_LEAF_EEXIT (0x4).

--
Jethro Beekman | Fortanix


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3990 bytes --]

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-07  6:47                                                                     ` Jethro Beekman
@ 2018-11-07  6:47                                                                       ` Jethro Beekman
  0 siblings, 0 replies; 163+ messages in thread
From: Jethro Beekman @ 2018-11-07  6:47 UTC (permalink / raw)
  To: Andy Lutomirski, Christopherson, Sean J
  Cc: Dave Hansen, Jann Horn, Linus Torvalds, Rich Felker, Dave Hansen,
	Jarkko Sakkinen, Florian Weimer, Linux API, X86 ML, linux-arch,
	LKML, Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella

[-- Attachment #1: Type: text/plain, Size: 797 bytes --]

On 2018-11-07 02:17, Andy Lutomirski wrote:
> On Tue, Nov 6, 2018 at 4:02 PM Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
>>
>> /*
>>   * EEXIT or EENTER faulted.  In the latter case, %RAX already holds some
>>   * fault indicator, e.g. -EFAULT.
>>   */
>> eexit_or_eenter_fault:
>>      ret
> 
> But userspace wants to know whether it was a fault or not.  So I think
> we either need two landing pads or we need to hijack a flag bit (are
> there any known-zeroed flag bits after EEXIT?) to say whether it was a
> fault.  And, if it was a fault, we should give the vector, the
> sanitized error code, and possibly CR2.

On AEX, %rax will contain ENCLU_LEAF_ERESUME (0x3). On EEXIT, %rax will 
contain ENCLU_LEAF_EEXIT (0x4).

--
Jethro Beekman | Fortanix


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3990 bytes --]

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-07  1:17                                                                   ` Andy Lutomirski
  2018-11-07  1:17                                                                     ` Andy Lutomirski
  2018-11-07  6:47                                                                     ` Jethro Beekman
@ 2018-11-07 15:34                                                                     ` Sean Christopherson
  2018-11-07 15:34                                                                       ` Sean Christopherson
  2018-11-07 19:01                                                                       ` Sean Christopherson
  2 siblings, 2 replies; 163+ messages in thread
From: Sean Christopherson @ 2018-11-07 15:34 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Jann Horn, Linus Torvalds, Rich Felker, Dave Hansen,
	Jethro Beekman, Jarkko Sakkinen, Florian Weimer, Linux API,
	X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman, npmccallum,
	Ayoun, Serge, shay.katz-zamir, linux-sgx, Andy Shevchenko,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Tue, Nov 06, 2018 at 05:17:14PM -0800, Andy Lutomirski wrote:
> On Tue, Nov 6, 2018 at 4:02 PM Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
> >
> > On Tue, Nov 06, 2018 at 03:39:48PM -0800, Andy Lutomirski wrote:
> > > On Tue, Nov 6, 2018 at 3:35 PM Sean Christopherson
> > > <sean.j.christopherson@intel.com> wrote:
> > > >
> > > > On Tue, Nov 06, 2018 at 03:00:56PM -0800, Andy Lutomirski wrote:
> > > > >
> > > > >
> > > > > >> On Nov 6, 2018, at 1:59 PM, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> > > > > >>
> > > > > >>> On Tue, 2018-11-06 at 13:41 -0800, Andy Lutomirski wrote:
> > > > > >> Sean, how does the current SDK AEX handler decide whether to do
> > > > > >> EENTER, ERESUME, or just bail and consider the enclave dead?  It seems
> > > > > >> like the *CPU* could give a big hint, but I don't see where there is
> > > > > >> any architectural indication of why the AEX code got called or any
> > > > > >> obvious way for the user code to know whether the exit was fixed up by
> > > > > >> the kernel?
> > > > > >
> > > > > > The SDK "unconditionally" does ERESUME at the AEP location, but that's
> > > > > > bit misleading because its signal handler may muck with the context's
> > > > > > RIP, e.g. to abort the enclave on a fatal fault.
> > > > > >
> > > > > > On an event/exception from within an enclave, the event is immediately
> > > > > > delivered after loading synthetic state and changing RIP to the AEP.
> > > > > > In other words, jamming CPU state is essentially a bunch of vectoring
> > > > > > ucode preamble, but from software's perspective it's a normal event
> > > > > > that happens to point at the AEP instead of somewhere in the enclave.
> > > > > > And because the signals the SDK cares about are all synchronous, the
> > > > > > SDK can simply hardcode ERESUME at the AEP since all of the fault logic
> > > > > > resides in its signal handler.  IRQs and whatnot simply trampoline back
> > > > > > into the enclave.
> > > > > >
> > > > > > Userspace can do something funky instead of ERESUME, but only *after*
> > > > > > IRET/RSM/VMRESUME has returned to the AEP location, and in Linux's
> > > > > > case, after the trap handler has run.
> > > > > >
> > > > > > Jumping back a bit, how much do we care about preventing userspace
> > > > > > from doing stupid things?
> > > > >
> > > > > My general feeling is that userspace should be allowed to do apparently
> > > > > stupid things. For example, as far as the kernel is concerned, Wine and
> > > > > DOSEMU are just user programs that do stupid things. Linux generally tries
> > > > > to provide a reasonably complete view of architectural behavior. This is
> > > > > in contrast to, say, Windows, where IIUC doing an unapproved WRFSBASE May
> > > > > cause very odd behavior indeed. So magic fixups that do non-architectural
> > > > > things are not so great.
> > > >
> > > > Sorry if I'm beating a dead horse, but what if we only did fixup on ENCLU
> > > > with a specific (ignored) prefix pattern?  I.e. effectively make the magic
> > > > fixup opt-in, falling back to signals.  Jamming RIP to skip ENCLU isn't
> > > > that far off the architecture, e.g. EENTER stuffs RCX with the next RIP so
> > > > that the enclave can EEXIT to immediately after the EENTER location.
> > > >
> > >
> > > How does that even work, though?  On an AEX, RIP points to the ERESUME
> > > instruction, not the EENTER instruction, so if we skip it we just end
> > > up in lala land.
> >
> > Userspace would obviously need to be aware of the fixup behavior, but
> > it actually works out fairly nicely to have a separate path for ERESUME
> > fixup since a fault on EENTER is generally fatal, whereas as a fault on
> > ERESUME might be recoverable.
> >
> 
> Hmm.
> 
> >
> > do_eenter:
> >     mov     tcs, %rbx
> >     lea     async_exit, %rcx
> >     mov     $EENTER, %rax
> >     ENCLU
> 
> Or SOME_SILLY_PREFIX ENCLU?

Yeah, forgot to include that.

> >
> > /*
> >  * EEXIT or EENTER faulted.  In the latter case, %RAX already holds some
> >  * fault indicator, e.g. -EFAULT.
> >  */
> > eexit_or_eenter_fault:
> >     ret
> 
> But userspace wants to know whether it was a fault or not.  So I think
> we either need two landing pads or we need to hijack a flag bit (are
> there any known-zeroed flag bits after EEXIT?) to say whether it was a
> fault.  And, if it was a fault, we should give the vector, the
> sanitized error code, and possibly CR2.

As Jethro mentioned, RAX will always be 4 on a successful EEXIT, so we
can use RAX to indicate a fault.  That's what I was trying to imply with
EFAULT.  Here's the reg stuffing I use for the POC:

	regs->ax = EFAULT;
	regs->di = trapnr;
	regs->si = error_code;
	regs->dx = address;


Well-known RAX values also means the kernel fault handlers only need to
look for SOME_SILLY_PREFIX ENCLU if RAX==2 || RAX==3, i.e. the fault
occurred on EENTER or in an enclave (RAX is set to ERESUME's leaf as
part of the asynchronous enlcave exit flow).

> >
> > async_exit:
> >     ENCLU
> 
> Same prefix here, right?
> 
> >
> > fixup_handler:
> >     <do fault stuff>
> 
> This whole thing is a bit odd, but not necessarily a terrible idea.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-07 15:34                                                                     ` Sean Christopherson
@ 2018-11-07 15:34                                                                       ` Sean Christopherson
  2018-11-07 19:01                                                                       ` Sean Christopherson
  1 sibling, 0 replies; 163+ messages in thread
From: Sean Christopherson @ 2018-11-07 15:34 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Jann Horn, Linus Torvalds, Rich Felker, Dave Hansen,
	Jethro Beekman, Jarkko Sakkinen, Florian Weimer, Linux API,
	X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman, npmccallum,
	Ayoun, Serge, shay.katz-zamir, linux-sgx, Andy Shevchenko,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Tue, Nov 06, 2018 at 05:17:14PM -0800, Andy Lutomirski wrote:
> On Tue, Nov 6, 2018 at 4:02 PM Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
> >
> > On Tue, Nov 06, 2018 at 03:39:48PM -0800, Andy Lutomirski wrote:
> > > On Tue, Nov 6, 2018 at 3:35 PM Sean Christopherson
> > > <sean.j.christopherson@intel.com> wrote:
> > > >
> > > > On Tue, Nov 06, 2018 at 03:00:56PM -0800, Andy Lutomirski wrote:
> > > > >
> > > > >
> > > > > >> On Nov 6, 2018, at 1:59 PM, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> > > > > >>
> > > > > >>> On Tue, 2018-11-06 at 13:41 -0800, Andy Lutomirski wrote:
> > > > > >> Sean, how does the current SDK AEX handler decide whether to do
> > > > > >> EENTER, ERESUME, or just bail and consider the enclave dead?  It seems
> > > > > >> like the *CPU* could give a big hint, but I don't see where there is
> > > > > >> any architectural indication of why the AEX code got called or any
> > > > > >> obvious way for the user code to know whether the exit was fixed up by
> > > > > >> the kernel?
> > > > > >
> > > > > > The SDK "unconditionally" does ERESUME at the AEP location, but that's
> > > > > > bit misleading because its signal handler may muck with the context's
> > > > > > RIP, e.g. to abort the enclave on a fatal fault.
> > > > > >
> > > > > > On an event/exception from within an enclave, the event is immediately
> > > > > > delivered after loading synthetic state and changing RIP to the AEP.
> > > > > > In other words, jamming CPU state is essentially a bunch of vectoring
> > > > > > ucode preamble, but from software's perspective it's a normal event
> > > > > > that happens to point at the AEP instead of somewhere in the enclave.
> > > > > > And because the signals the SDK cares about are all synchronous, the
> > > > > > SDK can simply hardcode ERESUME at the AEP since all of the fault logic
> > > > > > resides in its signal handler.  IRQs and whatnot simply trampoline back
> > > > > > into the enclave.
> > > > > >
> > > > > > Userspace can do something funky instead of ERESUME, but only *after*
> > > > > > IRET/RSM/VMRESUME has returned to the AEP location, and in Linux's
> > > > > > case, after the trap handler has run.
> > > > > >
> > > > > > Jumping back a bit, how much do we care about preventing userspace
> > > > > > from doing stupid things?
> > > > >
> > > > > My general feeling is that userspace should be allowed to do apparently
> > > > > stupid things. For example, as far as the kernel is concerned, Wine and
> > > > > DOSEMU are just user programs that do stupid things. Linux generally tries
> > > > > to provide a reasonably complete view of architectural behavior. This is
> > > > > in contrast to, say, Windows, where IIUC doing an unapproved WRFSBASE May
> > > > > cause very odd behavior indeed. So magic fixups that do non-architectural
> > > > > things are not so great.
> > > >
> > > > Sorry if I'm beating a dead horse, but what if we only did fixup on ENCLU
> > > > with a specific (ignored) prefix pattern?  I.e. effectively make the magic
> > > > fixup opt-in, falling back to signals.  Jamming RIP to skip ENCLU isn't
> > > > that far off the architecture, e.g. EENTER stuffs RCX with the next RIP so
> > > > that the enclave can EEXIT to immediately after the EENTER location.
> > > >
> > >
> > > How does that even work, though?  On an AEX, RIP points to the ERESUME
> > > instruction, not the EENTER instruction, so if we skip it we just end
> > > up in lala land.
> >
> > Userspace would obviously need to be aware of the fixup behavior, but
> > it actually works out fairly nicely to have a separate path for ERESUME
> > fixup since a fault on EENTER is generally fatal, whereas as a fault on
> > ERESUME might be recoverable.
> >
> 
> Hmm.
> 
> >
> > do_eenter:
> >     mov     tcs, %rbx
> >     lea     async_exit, %rcx
> >     mov     $EENTER, %rax
> >     ENCLU
> 
> Or SOME_SILLY_PREFIX ENCLU?

Yeah, forgot to include that.

> >
> > /*
> >  * EEXIT or EENTER faulted.  In the latter case, %RAX already holds some
> >  * fault indicator, e.g. -EFAULT.
> >  */
> > eexit_or_eenter_fault:
> >     ret
> 
> But userspace wants to know whether it was a fault or not.  So I think
> we either need two landing pads or we need to hijack a flag bit (are
> there any known-zeroed flag bits after EEXIT?) to say whether it was a
> fault.  And, if it was a fault, we should give the vector, the
> sanitized error code, and possibly CR2.

As Jethro mentioned, RAX will always be 4 on a successful EEXIT, so we
can use RAX to indicate a fault.  That's what I was trying to imply with
EFAULT.  Here's the reg stuffing I use for the POC:

	regs->ax = EFAULT;
	regs->di = trapnr;
	regs->si = error_code;
	regs->dx = address;


Well-known RAX values also means the kernel fault handlers only need to
look for SOME_SILLY_PREFIX ENCLU if RAX==2 || RAX==3, i.e. the fault
occurred on EENTER or in an enclave (RAX is set to ERESUME's leaf as
part of the asynchronous enlcave exit flow).

> >
> > async_exit:
> >     ENCLU
> 
> Same prefix here, right?
> 
> >
> > fixup_handler:
> >     <do fault stuff>
> 
> This whole thing is a bit odd, but not necessarily a terrible idea.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-07 15:34                                                                     ` Sean Christopherson
  2018-11-07 15:34                                                                       ` Sean Christopherson
@ 2018-11-07 19:01                                                                       ` Sean Christopherson
  2018-11-07 19:01                                                                         ` Sean Christopherson
  2018-11-07 20:56                                                                         ` Dave Hansen
  1 sibling, 2 replies; 163+ messages in thread
From: Sean Christopherson @ 2018-11-07 19:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Jann Horn, Linus Torvalds, Rich Felker, Dave Hansen,
	Jethro Beekman, Jarkko Sakkinen, Florian Weimer, Linux API,
	X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman, npmccallum,
	Ayoun, Serge, shay.katz-zamir, linux-sgx, Andy Shevchenko,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Wed, Nov 07, 2018 at 07:34:52AM -0800, Sean Christopherson wrote:
> On Tue, Nov 06, 2018 at 05:17:14PM -0800, Andy Lutomirski wrote:
> > On Tue, Nov 6, 2018 at 4:02 PM Sean Christopherson
> > <sean.j.christopherson@intel.com> wrote:
> > >
> > > On Tue, Nov 06, 2018 at 03:39:48PM -0800, Andy Lutomirski wrote:
> > > > On Tue, Nov 6, 2018 at 3:35 PM Sean Christopherson
> > > > <sean.j.christopherson@intel.com> wrote:
> > > > >
> > > > > Sorry if I'm beating a dead horse, but what if we only did fixup on ENCLU
> > > > > with a specific (ignored) prefix pattern?  I.e. effectively make the magic
> > > > > fixup opt-in, falling back to signals.  Jamming RIP to skip ENCLU isn't
> > > > > that far off the architecture, e.g. EENTER stuffs RCX with the next RIP so
> > > > > that the enclave can EEXIT to immediately after the EENTER location.
> > > > >
> > > >
> > > > How does that even work, though?  On an AEX, RIP points to the ERESUME
> > > > instruction, not the EENTER instruction, so if we skip it we just end
> > > > up in lala land.
> > >
> > > Userspace would obviously need to be aware of the fixup behavior, but
> > > it actually works out fairly nicely to have a separate path for ERESUME
> > > fixup since a fault on EENTER is generally fatal, whereas as a fault on
> > > ERESUME might be recoverable.
> > >
> > 
> > Hmm.
> > 
> > >
> > > do_eenter:
> > >     mov     tcs, %rbx
> > >     lea     async_exit, %rcx
> > >     mov     $EENTER, %rax
> > >     ENCLU
> > 
> > Or SOME_SILLY_PREFIX ENCLU?
> 
> Yeah, forgot to include that.
> 
> > >
> > > /*
> > >  * EEXIT or EENTER faulted.  In the latter case, %RAX already holds some
> > >  * fault indicator, e.g. -EFAULT.
> > >  */
> > > eexit_or_eenter_fault:
> > >     ret
> > 
> > But userspace wants to know whether it was a fault or not.  So I think
> > we either need two landing pads or we need to hijack a flag bit (are
> > there any known-zeroed flag bits after EEXIT?) to say whether it was a
> > fault.  And, if it was a fault, we should give the vector, the
> > sanitized error code, and possibly CR2.
> 
> As Jethro mentioned, RAX will always be 4 on a successful EEXIT, so we
> can use RAX to indicate a fault.  That's what I was trying to imply with
> EFAULT.  Here's the reg stuffing I use for the POC:
> 
> 	regs->ax = EFAULT;
> 	regs->di = trapnr;
> 	regs->si = error_code;
> 	regs->dx = address;
> 
> 
> Well-known RAX values also means the kernel fault handlers only need to
> look for SOME_SILLY_PREFIX ENCLU if RAX==2 || RAX==3, i.e. the fault
> occurred on EENTER or in an enclave (RAX is set to ERESUME's leaf as
> part of the asynchronous enlcave exit flow).

POC kernel code, 64-bit only.

Limiting this to 64-bit isn't necessary, but it makes the code prettier
and allows using REX as the magic prefix.  I like the idea of using REX
because it seems least likely to be repurposed for yet another new
feature.  I have no idea if 64-bit only will fly with the SDK folks.

Going off comments in similar code related to UMIP, we'd need to figure
out how to handle protection keys.


/* REX with all bits set, ignored by ENCLU. */
#define SGX_DO_ENCLU_FIXUP	0x4F

#define SGX_ENCLU_OPCODE0	0x0F
#define SGX_ENCLU_OPCODE1	0x01
#define SGX_ENCLU_OPCODE2	0xD7

/* ENCLU is a three-byte opcode, plus one byte for the magic prefix. */
#define SGX_ENCLU_FIXUP_INSN_LEN	4

static int sgx_detect_enclu(struct pt_regs *regs)
{
	unsigned char buf[SGX_ENCLU_FIXUP_INSN_LEN];

	/* Look for EENTER or ERESUME in RAX, 64-bit mode only. */
	if (!regs || (regs->ax != 2 && regs->ax != 3) || !user_64bit_mode(regs))
		return 0;

	if (copy_from_user(buf, (void __user *)(regs->ip), sizeof(buf)))
		return 0;

	if (buf[0] == SGX_DO_ENCLU_FIXUP &&
	    buf[1] == SGX_ENCLU_OPCODE0 &&
	    buf[2] == SGX_ENCLU_OPCODE1 &&
	    buf[3] == SGX_ENCLU_OPCODE2)
		return SGX_ENCLU_FIXUP_INSN_LEN;

	return 0;
}

bool sgx_fixup_enclu_fault(struct pt_regs *regs, int trapnr,
			   unsigned long error_code, unsigned long address)
{
	int insn_len;

	insn_len = sgx_detect_enclu(regs);
	if (!insn_len)
		return false;

	regs->ip += insn_len;
	regs->ax = EFAULT;
	regs->di = trapnr;
	regs->si = error_code;
	regs->dx = address;
	return true;
}

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-07 19:01                                                                       ` Sean Christopherson
@ 2018-11-07 19:01                                                                         ` Sean Christopherson
  2018-11-07 20:56                                                                         ` Dave Hansen
  1 sibling, 0 replies; 163+ messages in thread
From: Sean Christopherson @ 2018-11-07 19:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Jann Horn, Linus Torvalds, Rich Felker, Dave Hansen,
	Jethro Beekman, Jarkko Sakkinen, Florian Weimer, Linux API,
	X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman, npmccallum,
	Ayoun, Serge, shay.katz-zamir, linux-sgx, Andy Shevchenko,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Wed, Nov 07, 2018 at 07:34:52AM -0800, Sean Christopherson wrote:
> On Tue, Nov 06, 2018 at 05:17:14PM -0800, Andy Lutomirski wrote:
> > On Tue, Nov 6, 2018 at 4:02 PM Sean Christopherson
> > <sean.j.christopherson@intel.com> wrote:
> > >
> > > On Tue, Nov 06, 2018 at 03:39:48PM -0800, Andy Lutomirski wrote:
> > > > On Tue, Nov 6, 2018 at 3:35 PM Sean Christopherson
> > > > <sean.j.christopherson@intel.com> wrote:
> > > > >
> > > > > Sorry if I'm beating a dead horse, but what if we only did fixup on ENCLU
> > > > > with a specific (ignored) prefix pattern?  I.e. effectively make the magic
> > > > > fixup opt-in, falling back to signals.  Jamming RIP to skip ENCLU isn't
> > > > > that far off the architecture, e.g. EENTER stuffs RCX with the next RIP so
> > > > > that the enclave can EEXIT to immediately after the EENTER location.
> > > > >
> > > >
> > > > How does that even work, though?  On an AEX, RIP points to the ERESUME
> > > > instruction, not the EENTER instruction, so if we skip it we just end
> > > > up in lala land.
> > >
> > > Userspace would obviously need to be aware of the fixup behavior, but
> > > it actually works out fairly nicely to have a separate path for ERESUME
> > > fixup since a fault on EENTER is generally fatal, whereas as a fault on
> > > ERESUME might be recoverable.
> > >
> > 
> > Hmm.
> > 
> > >
> > > do_eenter:
> > >     mov     tcs, %rbx
> > >     lea     async_exit, %rcx
> > >     mov     $EENTER, %rax
> > >     ENCLU
> > 
> > Or SOME_SILLY_PREFIX ENCLU?
> 
> Yeah, forgot to include that.
> 
> > >
> > > /*
> > >  * EEXIT or EENTER faulted.  In the latter case, %RAX already holds some
> > >  * fault indicator, e.g. -EFAULT.
> > >  */
> > > eexit_or_eenter_fault:
> > >     ret
> > 
> > But userspace wants to know whether it was a fault or not.  So I think
> > we either need two landing pads or we need to hijack a flag bit (are
> > there any known-zeroed flag bits after EEXIT?) to say whether it was a
> > fault.  And, if it was a fault, we should give the vector, the
> > sanitized error code, and possibly CR2.
> 
> As Jethro mentioned, RAX will always be 4 on a successful EEXIT, so we
> can use RAX to indicate a fault.  That's what I was trying to imply with
> EFAULT.  Here's the reg stuffing I use for the POC:
> 
> 	regs->ax = EFAULT;
> 	regs->di = trapnr;
> 	regs->si = error_code;
> 	regs->dx = address;
> 
> 
> Well-known RAX values also means the kernel fault handlers only need to
> look for SOME_SILLY_PREFIX ENCLU if RAX==2 || RAX==3, i.e. the fault
> occurred on EENTER or in an enclave (RAX is set to ERESUME's leaf as
> part of the asynchronous enlcave exit flow).

POC kernel code, 64-bit only.

Limiting this to 64-bit isn't necessary, but it makes the code prettier
and allows using REX as the magic prefix.  I like the idea of using REX
because it seems least likely to be repurposed for yet another new
feature.  I have no idea if 64-bit only will fly with the SDK folks.

Going off comments in similar code related to UMIP, we'd need to figure
out how to handle protection keys.


/* REX with all bits set, ignored by ENCLU. */
#define SGX_DO_ENCLU_FIXUP	0x4F

#define SGX_ENCLU_OPCODE0	0x0F
#define SGX_ENCLU_OPCODE1	0x01
#define SGX_ENCLU_OPCODE2	0xD7

/* ENCLU is a three-byte opcode, plus one byte for the magic prefix. */
#define SGX_ENCLU_FIXUP_INSN_LEN	4

static int sgx_detect_enclu(struct pt_regs *regs)
{
	unsigned char buf[SGX_ENCLU_FIXUP_INSN_LEN];

	/* Look for EENTER or ERESUME in RAX, 64-bit mode only. */
	if (!regs || (regs->ax != 2 && regs->ax != 3) || !user_64bit_mode(regs))
		return 0;

	if (copy_from_user(buf, (void __user *)(regs->ip), sizeof(buf)))
		return 0;

	if (buf[0] == SGX_DO_ENCLU_FIXUP &&
	    buf[1] == SGX_ENCLU_OPCODE0 &&
	    buf[2] == SGX_ENCLU_OPCODE1 &&
	    buf[3] == SGX_ENCLU_OPCODE2)
		return SGX_ENCLU_FIXUP_INSN_LEN;

	return 0;
}

bool sgx_fixup_enclu_fault(struct pt_regs *regs, int trapnr,
			   unsigned long error_code, unsigned long address)
{
	int insn_len;

	insn_len = sgx_detect_enclu(regs);
	if (!insn_len)
		return false;

	regs->ip += insn_len;
	regs->ax = EFAULT;
	regs->di = trapnr;
	regs->si = error_code;
	regs->dx = address;
	return true;
}

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-07 19:01                                                                       ` Sean Christopherson
  2018-11-07 19:01                                                                         ` Sean Christopherson
@ 2018-11-07 20:56                                                                         ` Dave Hansen
  2018-11-07 20:56                                                                           ` Dave Hansen
  2018-11-08 15:04                                                                           ` Jarkko Sakkinen
  1 sibling, 2 replies; 163+ messages in thread
From: Dave Hansen @ 2018-11-07 20:56 UTC (permalink / raw)
  To: Sean Christopherson, Andy Lutomirski
  Cc: Jann Horn, Linus Torvalds, Rich Felker, Dave Hansen,
	Jethro Beekman, Jarkko Sakkinen, Florian Weimer, Linux API,
	X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman, npmccallum,
	Ayoun, Serge, shay.katz-zamir, linux-sgx, Andy Shevchenko,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On 11/7/18 11:01 AM, Sean Christopherson wrote:
> Going off comments in similar code related to UMIP, we'd need to figure
> out how to handle protection keys.

There are two options:
1. Don't depend on the userspace mapping.  Do get_user_pages() to find
   the instruction in the kernel direct map, and use that.
2. Do a WRPKRU that allows read access, do the read, then put PKRU back.
   This is a pain because of preemption and all that jazz.

Right now, we just let the prefetch instruction detection fail if you
mark it unreadable with pkeys.  Tough cookies, basically.  But, that's
just the kernel being nice, but you need it for functionality, so it's
tougher.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-07 20:56                                                                         ` Dave Hansen
@ 2018-11-07 20:56                                                                           ` Dave Hansen
  2018-11-08 15:04                                                                           ` Jarkko Sakkinen
  1 sibling, 0 replies; 163+ messages in thread
From: Dave Hansen @ 2018-11-07 20:56 UTC (permalink / raw)
  To: Sean Christopherson, Andy Lutomirski
  Cc: Jann Horn, Linus Torvalds, Rich Felker, Dave Hansen,
	Jethro Beekman, Jarkko Sakkinen, Florian Weimer, Linux API,
	X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman, npmccallum,
	Ayoun, Serge, shay.katz-zamir, linux-sgx, Andy Shevchenko,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On 11/7/18 11:01 AM, Sean Christopherson wrote:
> Going off comments in similar code related to UMIP, we'd need to figure
> out how to handle protection keys.

There are two options:
1. Don't depend on the userspace mapping.  Do get_user_pages() to find
   the instruction in the kernel direct map, and use that.
2. Do a WRPKRU that allows read access, do the read, then put PKRU back.
   This is a pain because of preemption and all that jazz.

Right now, we just let the prefetch instruction detection fail if you
mark it unreadable with pkeys.  Tough cookies, basically.  But, that's
just the kernel being nice, but you need it for functionality, so it's
tougher.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-06 23:26                                                 ` Sean Christopherson
  2018-11-06 23:26                                                   ` Sean Christopherson
@ 2018-11-07 21:27                                                   ` Rich Felker
  2018-11-07 21:27                                                     ` Rich Felker
                                                                       ` (2 more replies)
  1 sibling, 3 replies; 163+ messages in thread
From: Rich Felker @ 2018-11-07 21:27 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Andy Lutomirski, Dave Hansen, Jann Horn, Linus Torvalds,
	Dave Hansen, Jethro Beekman, Jarkko Sakkinen, Florian Weimer,
	Linux API, X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Tue, Nov 06, 2018 at 03:26:16PM -0800, Sean Christopherson wrote:
> On Tue, Nov 06, 2018 at 06:17:30PM -0500, Rich Felker wrote:
> > On Tue, Nov 06, 2018 at 11:02:11AM -0800, Andy Lutomirski wrote:
> > > On Tue, Nov 6, 2018 at 10:41 AM Dave Hansen <dave.hansen@intel.com> wrote:
> > > >
> > > > On 11/6/18 10:20 AM, Andy Lutomirski wrote:
> > > > > I almost feel like the right solution is to call into SGX on its own
> > > > > private stack or maybe even its own private address space.
> > > >
> > > > Yeah, I had the same gut feeling.  Couldn't the debugger even treat the
> > > > enclave like its own "thread" with its own stack and its own set of
> > > > registers and context?  That seems like a much more workable model than
> > > > trying to weave it together with the EENTER context.
> > > 
> > > So maybe the API should be, roughly
> > > 
> > > sgx_exit_reason_t sgx_enter_enclave(pointer_to_enclave, struct
> > > host_state *state);
> > > sgx_exit_reason_t sgx_resume_enclave(same args);
> > > 
> > > where host_state is something like:
> > > 
> > > struct host_state {
> > >   unsigned long bp, sp, ax, bx, cx, dx, si, di;
> > > };
> > > 
> > > and the values in host_state explicitly have nothing to do with the
> > > actual host registers.  So, if you want to use the outcall mechanism,
> > > you'd allocate some memory, point sp to that memory, call
> > > sgx_enter_enclave(), and then read that memory to do the outcall.
> > > 
> > > Actually implementing this would be distinctly nontrivial, and would
> > > almost certainly need some degree of kernel help to avoid an explosion
> > > when a signal gets delivered while we have host_state.sp loaded into
> > > the actual SP register.  Maybe rseq could help with this?
> > > 
> > > The ISA here is IMO not well thought through.
> > 
> > Maybe I'm mistaken about some fundamentals here, but my understanding
> > of SGX is that the whole point is that the host application and the
> > code running in the enclave are mutually adversarial towards one
> > another. Do any or all of the proposed protocols here account for this
> > and fully protect the host application from malicious code in the
> > enclave? It seems that having control over the register file on exit
> > from the enclave is fundamentally problematic but I assume there must
> > be some way I'm missing that this is fixed up.
> 
> SGX provides protections for the enclave but not the other way around.
> The kernel has all of its normal non-SGX protections in place, but the
> enclave can certainly wreak havoc on its userspace process.  The basic
> design idea is that the enclave is a specialized .so that gets extra
> security protections but is still effectively part of the overall
> application, e.g. it has full access to its host userspace process'
> virtual memory.

In that case it seems like the only way to use SGX that's not a gaping
security hole is to run the SGX enclave in its own fully-seccomp (or
equivalent) process, with no host application in the same address
space. Since the host application can't see the contents of the
enclave to make any determination of whether it's safe to run, running
it in the same address space only makes sense if the cpu provides
protection against unwanted accesses to the host's memory from the
enclave -- and according to you, it doesn't.

Rich

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-07 21:27                                                   ` Rich Felker
@ 2018-11-07 21:27                                                     ` Rich Felker
  2018-11-07 21:33                                                     ` Andy Lutomirski
  2018-11-07 21:40                                                     ` Sean Christopherson
  2 siblings, 0 replies; 163+ messages in thread
From: Rich Felker @ 2018-11-07 21:27 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Andy Lutomirski, Dave Hansen, Jann Horn, Linus Torvalds,
	Dave Hansen, Jethro Beekman, Jarkko Sakkinen, Florian Weimer,
	Linux API, X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Tue, Nov 06, 2018 at 03:26:16PM -0800, Sean Christopherson wrote:
> On Tue, Nov 06, 2018 at 06:17:30PM -0500, Rich Felker wrote:
> > On Tue, Nov 06, 2018 at 11:02:11AM -0800, Andy Lutomirski wrote:
> > > On Tue, Nov 6, 2018 at 10:41 AM Dave Hansen <dave.hansen@intel.com> wrote:
> > > >
> > > > On 11/6/18 10:20 AM, Andy Lutomirski wrote:
> > > > > I almost feel like the right solution is to call into SGX on its own
> > > > > private stack or maybe even its own private address space.
> > > >
> > > > Yeah, I had the same gut feeling.  Couldn't the debugger even treat the
> > > > enclave like its own "thread" with its own stack and its own set of
> > > > registers and context?  That seems like a much more workable model than
> > > > trying to weave it together with the EENTER context.
> > > 
> > > So maybe the API should be, roughly
> > > 
> > > sgx_exit_reason_t sgx_enter_enclave(pointer_to_enclave, struct
> > > host_state *state);
> > > sgx_exit_reason_t sgx_resume_enclave(same args);
> > > 
> > > where host_state is something like:
> > > 
> > > struct host_state {
> > >   unsigned long bp, sp, ax, bx, cx, dx, si, di;
> > > };
> > > 
> > > and the values in host_state explicitly have nothing to do with the
> > > actual host registers.  So, if you want to use the outcall mechanism,
> > > you'd allocate some memory, point sp to that memory, call
> > > sgx_enter_enclave(), and then read that memory to do the outcall.
> > > 
> > > Actually implementing this would be distinctly nontrivial, and would
> > > almost certainly need some degree of kernel help to avoid an explosion
> > > when a signal gets delivered while we have host_state.sp loaded into
> > > the actual SP register.  Maybe rseq could help with this?
> > > 
> > > The ISA here is IMO not well thought through.
> > 
> > Maybe I'm mistaken about some fundamentals here, but my understanding
> > of SGX is that the whole point is that the host application and the
> > code running in the enclave are mutually adversarial towards one
> > another. Do any or all of the proposed protocols here account for this
> > and fully protect the host application from malicious code in the
> > enclave? It seems that having control over the register file on exit
> > from the enclave is fundamentally problematic but I assume there must
> > be some way I'm missing that this is fixed up.
> 
> SGX provides protections for the enclave but not the other way around.
> The kernel has all of its normal non-SGX protections in place, but the
> enclave can certainly wreak havoc on its userspace process.  The basic
> design idea is that the enclave is a specialized .so that gets extra
> security protections but is still effectively part of the overall
> application, e.g. it has full access to its host userspace process'
> virtual memory.

In that case it seems like the only way to use SGX that's not a gaping
security hole is to run the SGX enclave in its own fully-seccomp (or
equivalent) process, with no host application in the same address
space. Since the host application can't see the contents of the
enclave to make any determination of whether it's safe to run, running
it in the same address space only makes sense if the cpu provides
protection against unwanted accesses to the host's memory from the
enclave -- and according to you, it doesn't.

Rich

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-07 21:27                                                   ` Rich Felker
  2018-11-07 21:27                                                     ` Rich Felker
@ 2018-11-07 21:33                                                     ` Andy Lutomirski
  2018-11-07 21:33                                                       ` Andy Lutomirski
  2018-11-07 21:40                                                     ` Sean Christopherson
  2 siblings, 1 reply; 163+ messages in thread
From: Andy Lutomirski @ 2018-11-07 21:33 UTC (permalink / raw)
  To: Rich Felker
  Cc: Christopherson, Sean J, Andrew Lutomirski, Dave Hansen,
	Jann Horn, Linus Torvalds, Dave Hansen, Jethro Beekman,
	Jarkko Sakkinen, Florian Weimer, Linux API, X86 ML, linux-arch,
	LKML, Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella

On Wed, Nov 7, 2018 at 1:28 PM Rich Felker <dalias@libc.org> wrote:
>
> On Tue, Nov 06, 2018 at 03:26:16PM -0800, Sean Christopherson wrote:
> > On Tue, Nov 06, 2018 at 06:17:30PM -0500, Rich Felker wrote:
> > > On Tue, Nov 06, 2018 at 11:02:11AM -0800, Andy Lutomirski wrote:
> > > > On Tue, Nov 6, 2018 at 10:41 AM Dave Hansen <dave.hansen@intel.com> wrote:
> > > > >
> > > > > On 11/6/18 10:20 AM, Andy Lutomirski wrote:
> > > > > > I almost feel like the right solution is to call into SGX on its own
> > > > > > private stack or maybe even its own private address space.
> > > > >
> > > > > Yeah, I had the same gut feeling.  Couldn't the debugger even treat the
> > > > > enclave like its own "thread" with its own stack and its own set of
> > > > > registers and context?  That seems like a much more workable model than
> > > > > trying to weave it together with the EENTER context.
> > > >
> > > > So maybe the API should be, roughly
> > > >
> > > > sgx_exit_reason_t sgx_enter_enclave(pointer_to_enclave, struct
> > > > host_state *state);
> > > > sgx_exit_reason_t sgx_resume_enclave(same args);
> > > >
> > > > where host_state is something like:
> > > >
> > > > struct host_state {
> > > >   unsigned long bp, sp, ax, bx, cx, dx, si, di;
> > > > };
> > > >
> > > > and the values in host_state explicitly have nothing to do with the
> > > > actual host registers.  So, if you want to use the outcall mechanism,
> > > > you'd allocate some memory, point sp to that memory, call
> > > > sgx_enter_enclave(), and then read that memory to do the outcall.
> > > >
> > > > Actually implementing this would be distinctly nontrivial, and would
> > > > almost certainly need some degree of kernel help to avoid an explosion
> > > > when a signal gets delivered while we have host_state.sp loaded into
> > > > the actual SP register.  Maybe rseq could help with this?
> > > >
> > > > The ISA here is IMO not well thought through.
> > >
> > > Maybe I'm mistaken about some fundamentals here, but my understanding
> > > of SGX is that the whole point is that the host application and the
> > > code running in the enclave are mutually adversarial towards one
> > > another. Do any or all of the proposed protocols here account for this
> > > and fully protect the host application from malicious code in the
> > > enclave? It seems that having control over the register file on exit
> > > from the enclave is fundamentally problematic but I assume there must
> > > be some way I'm missing that this is fixed up.
> >
> > SGX provides protections for the enclave but not the other way around.
> > The kernel has all of its normal non-SGX protections in place, but the
> > enclave can certainly wreak havoc on its userspace process.  The basic
> > design idea is that the enclave is a specialized .so that gets extra
> > security protections but is still effectively part of the overall
> > application, e.g. it has full access to its host userspace process'
> > virtual memory.
>
> In that case it seems like the only way to use SGX that's not a gaping
> security hole is to run the SGX enclave in its own fully-seccomp (or
> equivalent) process, with no host application in the same address
> space. Since the host application can't see the contents of the
> enclave to make any determination of whether it's safe to run, running
> it in the same address space only makes sense if the cpu provides
> protection against unwanted accesses to the host's memory from the
> enclave -- and according to you, it doesn't.
>

I think the theory is that the enclave is shipped with the host application.

That being said, a way to run the enclave in an address space that has
basically nothing else (except an ENCLU instruction as a trampoline)
would be quite nice.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-07 21:33                                                     ` Andy Lutomirski
@ 2018-11-07 21:33                                                       ` Andy Lutomirski
  0 siblings, 0 replies; 163+ messages in thread
From: Andy Lutomirski @ 2018-11-07 21:33 UTC (permalink / raw)
  To: Rich Felker
  Cc: Christopherson, Sean J, Andrew Lutomirski, Dave Hansen,
	Jann Horn, Linus Torvalds, Dave Hansen, Jethro Beekman,
	Jarkko Sakkinen, Florian Weimer, Linux API, X86 ML, linux-arch,
	LKML, Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella

On Wed, Nov 7, 2018 at 1:28 PM Rich Felker <dalias@libc.org> wrote:
>
> On Tue, Nov 06, 2018 at 03:26:16PM -0800, Sean Christopherson wrote:
> > On Tue, Nov 06, 2018 at 06:17:30PM -0500, Rich Felker wrote:
> > > On Tue, Nov 06, 2018 at 11:02:11AM -0800, Andy Lutomirski wrote:
> > > > On Tue, Nov 6, 2018 at 10:41 AM Dave Hansen <dave.hansen@intel.com> wrote:
> > > > >
> > > > > On 11/6/18 10:20 AM, Andy Lutomirski wrote:
> > > > > > I almost feel like the right solution is to call into SGX on its own
> > > > > > private stack or maybe even its own private address space.
> > > > >
> > > > > Yeah, I had the same gut feeling.  Couldn't the debugger even treat the
> > > > > enclave like its own "thread" with its own stack and its own set of
> > > > > registers and context?  That seems like a much more workable model than
> > > > > trying to weave it together with the EENTER context.
> > > >
> > > > So maybe the API should be, roughly
> > > >
> > > > sgx_exit_reason_t sgx_enter_enclave(pointer_to_enclave, struct
> > > > host_state *state);
> > > > sgx_exit_reason_t sgx_resume_enclave(same args);
> > > >
> > > > where host_state is something like:
> > > >
> > > > struct host_state {
> > > >   unsigned long bp, sp, ax, bx, cx, dx, si, di;
> > > > };
> > > >
> > > > and the values in host_state explicitly have nothing to do with the
> > > > actual host registers.  So, if you want to use the outcall mechanism,
> > > > you'd allocate some memory, point sp to that memory, call
> > > > sgx_enter_enclave(), and then read that memory to do the outcall.
> > > >
> > > > Actually implementing this would be distinctly nontrivial, and would
> > > > almost certainly need some degree of kernel help to avoid an explosion
> > > > when a signal gets delivered while we have host_state.sp loaded into
> > > > the actual SP register.  Maybe rseq could help with this?
> > > >
> > > > The ISA here is IMO not well thought through.
> > >
> > > Maybe I'm mistaken about some fundamentals here, but my understanding
> > > of SGX is that the whole point is that the host application and the
> > > code running in the enclave are mutually adversarial towards one
> > > another. Do any or all of the proposed protocols here account for this
> > > and fully protect the host application from malicious code in the
> > > enclave? It seems that having control over the register file on exit
> > > from the enclave is fundamentally problematic but I assume there must
> > > be some way I'm missing that this is fixed up.
> >
> > SGX provides protections for the enclave but not the other way around.
> > The kernel has all of its normal non-SGX protections in place, but the
> > enclave can certainly wreak havoc on its userspace process.  The basic
> > design idea is that the enclave is a specialized .so that gets extra
> > security protections but is still effectively part of the overall
> > application, e.g. it has full access to its host userspace process'
> > virtual memory.
>
> In that case it seems like the only way to use SGX that's not a gaping
> security hole is to run the SGX enclave in its own fully-seccomp (or
> equivalent) process, with no host application in the same address
> space. Since the host application can't see the contents of the
> enclave to make any determination of whether it's safe to run, running
> it in the same address space only makes sense if the cpu provides
> protection against unwanted accesses to the host's memory from the
> enclave -- and according to you, it doesn't.
>

I think the theory is that the enclave is shipped with the host application.

That being said, a way to run the enclave in an address space that has
basically nothing else (except an ENCLU instruction as a trampoline)
would be quite nice.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-07 21:27                                                   ` Rich Felker
  2018-11-07 21:27                                                     ` Rich Felker
  2018-11-07 21:33                                                     ` Andy Lutomirski
@ 2018-11-07 21:40                                                     ` Sean Christopherson
  2018-11-07 21:40                                                       ` Sean Christopherson
  2018-11-08 15:11                                                       ` Jarkko Sakkinen
  2 siblings, 2 replies; 163+ messages in thread
From: Sean Christopherson @ 2018-11-07 21:40 UTC (permalink / raw)
  To: Rich Felker
  Cc: Andy Lutomirski, Dave Hansen, Jann Horn, Linus Torvalds,
	Dave Hansen, Jethro Beekman, Jarkko Sakkinen, Florian Weimer,
	Linux API, X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Wed, Nov 07, 2018 at 04:27:58PM -0500, Rich Felker wrote:
> On Tue, Nov 06, 2018 at 03:26:16PM -0800, Sean Christopherson wrote:
> > On Tue, Nov 06, 2018 at 06:17:30PM -0500, Rich Felker wrote:
> > > On Tue, Nov 06, 2018 at 11:02:11AM -0800, Andy Lutomirski wrote:
> > > > On Tue, Nov 6, 2018 at 10:41 AM Dave Hansen <dave.hansen@intel.com> wrote:
> > > > >
> > > > > On 11/6/18 10:20 AM, Andy Lutomirski wrote:
> > > > > > I almost feel like the right solution is to call into SGX on its own
> > > > > > private stack or maybe even its own private address space.
> > > > >
> > > > > Yeah, I had the same gut feeling.  Couldn't the debugger even treat the
> > > > > enclave like its own "thread" with its own stack and its own set of
> > > > > registers and context?  That seems like a much more workable model than
> > > > > trying to weave it together with the EENTER context.
> > > > 
> > > > So maybe the API should be, roughly
> > > > 
> > > > sgx_exit_reason_t sgx_enter_enclave(pointer_to_enclave, struct
> > > > host_state *state);
> > > > sgx_exit_reason_t sgx_resume_enclave(same args);
> > > > 
> > > > where host_state is something like:
> > > > 
> > > > struct host_state {
> > > >   unsigned long bp, sp, ax, bx, cx, dx, si, di;
> > > > };
> > > > 
> > > > and the values in host_state explicitly have nothing to do with the
> > > > actual host registers.  So, if you want to use the outcall mechanism,
> > > > you'd allocate some memory, point sp to that memory, call
> > > > sgx_enter_enclave(), and then read that memory to do the outcall.
> > > > 
> > > > Actually implementing this would be distinctly nontrivial, and would
> > > > almost certainly need some degree of kernel help to avoid an explosion
> > > > when a signal gets delivered while we have host_state.sp loaded into
> > > > the actual SP register.  Maybe rseq could help with this?
> > > > 
> > > > The ISA here is IMO not well thought through.
> > > 
> > > Maybe I'm mistaken about some fundamentals here, but my understanding
> > > of SGX is that the whole point is that the host application and the
> > > code running in the enclave are mutually adversarial towards one
> > > another. Do any or all of the proposed protocols here account for this
> > > and fully protect the host application from malicious code in the
> > > enclave? It seems that having control over the register file on exit
> > > from the enclave is fundamentally problematic but I assume there must
> > > be some way I'm missing that this is fixed up.
> > 
> > SGX provides protections for the enclave but not the other way around.
> > The kernel has all of its normal non-SGX protections in place, but the
> > enclave can certainly wreak havoc on its userspace process.  The basic
> > design idea is that the enclave is a specialized .so that gets extra
> > security protections but is still effectively part of the overall
> > application, e.g. it has full access to its host userspace process'
> > virtual memory.
> 
> In that case it seems like the only way to use SGX that's not a gaping
> security hole is to run the SGX enclave in its own fully-seccomp (or
> equivalent) process, with no host application in the same address
> space. Since the host application can't see the contents of the
> enclave to make any determination of whether it's safe to run, running
> it in the same address space only makes sense if the cpu provides
> protection against unwanted accesses to the host's memory from the
> enclave -- and according to you, it doesn't.

The enclave's code (and any initial data) isn't encrypted until the
pages are loaded into the Enclave Page Cache (EPC), which can only
be done by the kernel (via ENCLS[EADD]).  In other words, both the
kernel and userspace can vet the code/data before running an enclave.

Practically speaking, an enclave will be coupled with an untrusted
userspace runtime, i.e. it's loader.  Enclaves are also measured
as part of their build process, and so the enclave loader needs to
know which pages to add to the measurement, and in what order.  I
guess technically speaking an enclave could have zero pages added
to its measurement, but that'd probably be a big red flag that said
enclave is up to something fishy.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-07 21:40                                                     ` Sean Christopherson
@ 2018-11-07 21:40                                                       ` Sean Christopherson
  2018-11-08 15:11                                                       ` Jarkko Sakkinen
  1 sibling, 0 replies; 163+ messages in thread
From: Sean Christopherson @ 2018-11-07 21:40 UTC (permalink / raw)
  To: Rich Felker
  Cc: Andy Lutomirski, Dave Hansen, Jann Horn, Linus Torvalds,
	Dave Hansen, Jethro Beekman, Jarkko Sakkinen, Florian Weimer,
	Linux API, X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Wed, Nov 07, 2018 at 04:27:58PM -0500, Rich Felker wrote:
> On Tue, Nov 06, 2018 at 03:26:16PM -0800, Sean Christopherson wrote:
> > On Tue, Nov 06, 2018 at 06:17:30PM -0500, Rich Felker wrote:
> > > On Tue, Nov 06, 2018 at 11:02:11AM -0800, Andy Lutomirski wrote:
> > > > On Tue, Nov 6, 2018 at 10:41 AM Dave Hansen <dave.hansen@intel.com> wrote:
> > > > >
> > > > > On 11/6/18 10:20 AM, Andy Lutomirski wrote:
> > > > > > I almost feel like the right solution is to call into SGX on its own
> > > > > > private stack or maybe even its own private address space.
> > > > >
> > > > > Yeah, I had the same gut feeling.  Couldn't the debugger even treat the
> > > > > enclave like its own "thread" with its own stack and its own set of
> > > > > registers and context?  That seems like a much more workable model than
> > > > > trying to weave it together with the EENTER context.
> > > > 
> > > > So maybe the API should be, roughly
> > > > 
> > > > sgx_exit_reason_t sgx_enter_enclave(pointer_to_enclave, struct
> > > > host_state *state);
> > > > sgx_exit_reason_t sgx_resume_enclave(same args);
> > > > 
> > > > where host_state is something like:
> > > > 
> > > > struct host_state {
> > > >   unsigned long bp, sp, ax, bx, cx, dx, si, di;
> > > > };
> > > > 
> > > > and the values in host_state explicitly have nothing to do with the
> > > > actual host registers.  So, if you want to use the outcall mechanism,
> > > > you'd allocate some memory, point sp to that memory, call
> > > > sgx_enter_enclave(), and then read that memory to do the outcall.
> > > > 
> > > > Actually implementing this would be distinctly nontrivial, and would
> > > > almost certainly need some degree of kernel help to avoid an explosion
> > > > when a signal gets delivered while we have host_state.sp loaded into
> > > > the actual SP register.  Maybe rseq could help with this?
> > > > 
> > > > The ISA here is IMO not well thought through.
> > > 
> > > Maybe I'm mistaken about some fundamentals here, but my understanding
> > > of SGX is that the whole point is that the host application and the
> > > code running in the enclave are mutually adversarial towards one
> > > another. Do any or all of the proposed protocols here account for this
> > > and fully protect the host application from malicious code in the
> > > enclave? It seems that having control over the register file on exit
> > > from the enclave is fundamentally problematic but I assume there must
> > > be some way I'm missing that this is fixed up.
> > 
> > SGX provides protections for the enclave but not the other way around.
> > The kernel has all of its normal non-SGX protections in place, but the
> > enclave can certainly wreak havoc on its userspace process.  The basic
> > design idea is that the enclave is a specialized .so that gets extra
> > security protections but is still effectively part of the overall
> > application, e.g. it has full access to its host userspace process'
> > virtual memory.
> 
> In that case it seems like the only way to use SGX that's not a gaping
> security hole is to run the SGX enclave in its own fully-seccomp (or
> equivalent) process, with no host application in the same address
> space. Since the host application can't see the contents of the
> enclave to make any determination of whether it's safe to run, running
> it in the same address space only makes sense if the cpu provides
> protection against unwanted accesses to the host's memory from the
> enclave -- and according to you, it doesn't.

The enclave's code (and any initial data) isn't encrypted until the
pages are loaded into the Enclave Page Cache (EPC), which can only
be done by the kernel (via ENCLS[EADD]).  In other words, both the
kernel and userspace can vet the code/data before running an enclave.

Practically speaking, an enclave will be coupled with an untrusted
userspace runtime, i.e. it's loader.  Enclaves are also measured
as part of their build process, and so the enclave loader needs to
know which pages to add to the measurement, and in what order.  I
guess technically speaking an enclave could have zero pages added
to its measurement, but that'd probably be a big red flag that said
enclave is up to something fishy.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-07 20:56                                                                         ` Dave Hansen
  2018-11-07 20:56                                                                           ` Dave Hansen
@ 2018-11-08 15:04                                                                           ` Jarkko Sakkinen
  2018-11-08 15:04                                                                             ` Jarkko Sakkinen
  1 sibling, 1 reply; 163+ messages in thread
From: Jarkko Sakkinen @ 2018-11-08 15:04 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Sean Christopherson, Andy Lutomirski, Jann Horn, Linus Torvalds,
	Rich Felker, Dave Hansen, Jethro Beekman, Florian Weimer,
	Linux API, X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Wed, Nov 07, 2018 at 12:56:58PM -0800, Dave Hansen wrote:
> On 11/7/18 11:01 AM, Sean Christopherson wrote:
> > Going off comments in similar code related to UMIP, we'd need to figure
> > out how to handle protection keys.
> 
> There are two options:
> 1. Don't depend on the userspace mapping.  Do get_user_pages() to find
>    the instruction in the kernel direct map, and use that.
> 2. Do a WRPKRU that allows read access, do the read, then put PKRU back.
>    This is a pain because of preemption and all that jazz.
> 
> Right now, we just let the prefetch instruction detection fail if you
> mark it unreadable with pkeys.  Tough cookies, basically.  But, that's
> just the kernel being nice, but you need it for functionality, so it's
> tougher.

I would go with one because it is the stable way to do it and we are
100% sure to not conflict with pk's.

/Jarkko

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-08 15:04                                                                           ` Jarkko Sakkinen
@ 2018-11-08 15:04                                                                             ` Jarkko Sakkinen
  0 siblings, 0 replies; 163+ messages in thread
From: Jarkko Sakkinen @ 2018-11-08 15:04 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Sean Christopherson, Andy Lutomirski, Jann Horn, Linus Torvalds,
	Rich Felker, Dave Hansen, Jethro Beekman, Florian Weimer,
	Linux API, X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Wed, Nov 07, 2018 at 12:56:58PM -0800, Dave Hansen wrote:
> On 11/7/18 11:01 AM, Sean Christopherson wrote:
> > Going off comments in similar code related to UMIP, we'd need to figure
> > out how to handle protection keys.
> 
> There are two options:
> 1. Don't depend on the userspace mapping.  Do get_user_pages() to find
>    the instruction in the kernel direct map, and use that.
> 2. Do a WRPKRU that allows read access, do the read, then put PKRU back.
>    This is a pain because of preemption and all that jazz.
> 
> Right now, we just let the prefetch instruction detection fail if you
> mark it unreadable with pkeys.  Tough cookies, basically.  But, that's
> just the kernel being nice, but you need it for functionality, so it's
> tougher.

I would go with one because it is the stable way to do it and we are
100% sure to not conflict with pk's.

/Jarkko

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-07 21:40                                                     ` Sean Christopherson
  2018-11-07 21:40                                                       ` Sean Christopherson
@ 2018-11-08 15:11                                                       ` Jarkko Sakkinen
  2018-11-08 15:11                                                         ` Jarkko Sakkinen
  1 sibling, 1 reply; 163+ messages in thread
From: Jarkko Sakkinen @ 2018-11-08 15:11 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Rich Felker, Andy Lutomirski, Dave Hansen, Jann Horn,
	Linus Torvalds, Dave Hansen, Jethro Beekman, Florian Weimer,
	Linux API, X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Wed, Nov 07, 2018 at 01:40:59PM -0800, Sean Christopherson wrote:
> > In that case it seems like the only way to use SGX that's not a gaping
> > security hole is to run the SGX enclave in its own fully-seccomp (or
> > equivalent) process, with no host application in the same address
> > space. Since the host application can't see the contents of the
> > enclave to make any determination of whether it's safe to run, running
> > it in the same address space only makes sense if the cpu provides
> > protection against unwanted accesses to the host's memory from the
> > enclave -- and according to you, it doesn't.
> 
> The enclave's code (and any initial data) isn't encrypted until the
> pages are loaded into the Enclave Page Cache (EPC), which can only
> be done by the kernel (via ENCLS[EADD]).  In other words, both the
> kernel and userspace can vet the code/data before running an enclave.
> 
> Practically speaking, an enclave will be coupled with an untrusted
> userspace runtime, i.e. it's loader.  Enclaves are also measured
> as part of their build process, and so the enclave loader needs to
> know which pages to add to the measurement, and in what order.  I
> guess technically speaking an enclave could have zero pages added
> to its measurement, but that'd probably be a big red flag that said
> enclave is up to something fishy.

IMHO the whole idea adds too much policy into kernel even if it would
be doable. You can easily spawn untrusted run-time and enclave to its
own process.

Seccomp limits the syscall space and enclaves cannot do syscalls in the
first place. It is the URT that will do them behalf of the enclave.

/Jarkko

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-08 15:11                                                       ` Jarkko Sakkinen
@ 2018-11-08 15:11                                                         ` Jarkko Sakkinen
  0 siblings, 0 replies; 163+ messages in thread
From: Jarkko Sakkinen @ 2018-11-08 15:11 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Rich Felker, Andy Lutomirski, Dave Hansen, Jann Horn,
	Linus Torvalds, Dave Hansen, Jethro Beekman, Florian Weimer,
	Linux API, X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Wed, Nov 07, 2018 at 01:40:59PM -0800, Sean Christopherson wrote:
> > In that case it seems like the only way to use SGX that's not a gaping
> > security hole is to run the SGX enclave in its own fully-seccomp (or
> > equivalent) process, with no host application in the same address
> > space. Since the host application can't see the contents of the
> > enclave to make any determination of whether it's safe to run, running
> > it in the same address space only makes sense if the cpu provides
> > protection against unwanted accesses to the host's memory from the
> > enclave -- and according to you, it doesn't.
> 
> The enclave's code (and any initial data) isn't encrypted until the
> pages are loaded into the Enclave Page Cache (EPC), which can only
> be done by the kernel (via ENCLS[EADD]).  In other words, both the
> kernel and userspace can vet the code/data before running an enclave.
> 
> Practically speaking, an enclave will be coupled with an untrusted
> userspace runtime, i.e. it's loader.  Enclaves are also measured
> as part of their build process, and so the enclave loader needs to
> know which pages to add to the measurement, and in what order.  I
> guess technically speaking an enclave could have zero pages added
> to its measurement, but that'd probably be a big red flag that said
> enclave is up to something fishy.

IMHO the whole idea adds too much policy into kernel even if it would
be doable. You can easily spawn untrusted run-time and enclave to its
own process.

Seccomp limits the syscall space and enclaves cannot do syscalls in the
first place. It is the URT that will do them behalf of the enclave.

/Jarkko

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-06 21:07                                                     ` Andy Lutomirski
  2018-11-06 21:07                                                       ` Andy Lutomirski
  2018-11-06 21:41                                                       ` Andy Lutomirski
@ 2018-11-08 19:54                                                       ` Sean Christopherson
  2018-11-08 19:54                                                         ` Sean Christopherson
  2018-11-08 20:05                                                         ` Andy Lutomirski
  2 siblings, 2 replies; 163+ messages in thread
From: Sean Christopherson @ 2018-11-08 19:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Andy Lutomirski, Jann Horn, Linus Torvalds,
	Rich Felker, Dave Hansen, Jethro Beekman, Jarkko Sakkinen,
	Florian Weimer, Linux API, X86 ML, linux-arch, LKML,
	Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella

On Tue, Nov 06, 2018 at 01:07:54PM -0800, Andy Lutomirski wrote:
> 
> 
> > On Nov 6, 2018, at 1:00 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> > 
> >> On 11/6/18 12:12 PM, Andy Lutomirski wrote:
> >> True, but what if we have a nasty enclave that writes to memory just
> >> below SP *before* decrementing SP?
> > 
> > Yeah, that would be unfortunate.  If an enclave did this (roughly):
> > 
> >    1. EENTER
> >    2. Hardware sets eenter_hwframe->sp = %sp
> >    3. Enclave runs... wants to do out-call
> >    4. Enclave sets up parameters:
> >        memcpy(&eenter_hwframe->sp[-offset], arg1, size);
> >        ...
> >    5. Enclave sets eenter_hwframe->sp -= offset
> > 
> > If we got a signal between 4 and 5, we'd clobber the copy of 'arg1' that
> > was on the stack.  The enclave could easily fix this by moving ->sp first.
> > 
> > But, this is one of those "fun" parts of the ABI that I think we need to
> > talk about.  If we do this, we also basically require that the code
> > which handles asynchronous exits must *not* write to the stack.  That's
> > not hard because it's typically just a single ERESUME instruction, but
> > it *is* a requirement.
> > 
> 
> I was assuming that the async exit stuff was completely hidden by the
> API.  The AEP code would decide whether the exit got fixed up by the
> kernel (which may or may not be easy to tell — can the code even tell
> without kernel help whether it was, say, an IRQ vs #UD?) and then either
> do ERESUME or cause sgx_enter_enclave() to return with an appropriate
> return value.

Ok, SDK folks came up with an idea that would allow them to use vDSO,
albeit with a bit of ugliness and potentially a ROP-attack issue.
Definitely some weirdness, but the weirdness is well contained, unlike
the magic prefix approach.

Provide two enter_enclave() vDSO "functions".  The first is a normal
function with a normal C interface.  The second is a blob of code that
is "called" and "returns" via indirect jmp, and can be used by SGX
runtimes that want to use the untrusted stack for out-calls from the
enclave.

For the indirect jmp "function", use %rbp to stash the return address
of the caller (either in %rbp itself or in memory pointed to by %rbp).
It works because hardware also saves/restores %rbp along with %rsp when
doing enclave transitions, and the SDK can live with %rbp being
off-limits.  Fault info is passed via registers.

Basic idea for the "functions" below.  The fixup stuff is obviously not
wired up correctly, just trying to convey the concept.



struct enclu_fault_info {
	unsigned int	leaf;
	unsigned int	trapnr;
	unsigned int	error_code;
	unsigned long	address;
};

int __vdso_enter_enclave(void *tcs, struct enclu_fault_info *fault_info)
{
	unsigned int leaf, trapnr;

	asm volatile (
		"lea	2f(%%rip), %%rcx\n\t"
		"1:	enclu\n\t"
		"jmp	3f\n\t"

		/* ERESUME trampoline */
		"2:	enclu\n\t"
		"ud2\n\t"

		/* out: */
		"3:\n"

		/* EENTER fixup */
		".pushsection .fixup,\"ax\"\n\t"
		"4:\n\t"
		"mov	%%eax, %%edi\n\t"
		"movl	$"__stringify(SGX_EENTER)", %%eax\n\t"
		"jmp	3b\n\t"
		".popsection\n\t"
		_ASM_EXTABLE_FAULT(1b, 4b)

		/* ERESUME FIXUP */
		".pushsection .fixup,\"ax\"\n\t"
		"5:\n\t"
		"mov	%%eax, %%edi\n\t"
		"movl	$"__stringify(SGX_ERESUME)", %%eax\n\t"
		"jmp	3b\n\t"
		".popsection\n\t"
		_ASM_EXTABLE_FAULT(2b, 5b)

		: "=a"(leaf), "=D" (trapnr)
		: "a" (SGX_EENTER), "b" (tcs)
		: "cc", "memory", "rcx", "rdx", "rsi", "r8", "r9", "r10",
		  "r11", "r12", "r13", "r14", "r15"
	);

	if (leaf == SGX_EEXIT)
		return 0;

	if (fault_info) {
		fault_info->leaf = leaf;
		fault_info->trapnr = trapnr;
		fault_info->error_code = 0;
		fault_info->address = 0;
	}

	return -EFAULT;
}


GLOBAL(__vdso_enter_enclave_no_stack)
        endbr64

        /* %rbp = return target, %rbx = tcs */
        leaq    3f(%rip), %rcx
        movl    $2, %eax
1:      enclu

        /* "return" to "caller" */
2:      jmp     *%rbp

        /* ERESUME trampoline */
3:      enclu
        ud2

        /* EENTER fixup handler */
4:      movq    %rax, %rdi
        movl    $2, %eax
        /* %rsi = error code, %rdx = address */
        jmp     2b

        /* ERESUME fixup handler */
5:      movq    %rax, %rdi
        movl    $3, %eax
        /* %rsi = error code, %rdx = address */
        jmp     2b

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-08 19:54                                                       ` Sean Christopherson
@ 2018-11-08 19:54                                                         ` Sean Christopherson
  2018-11-08 20:05                                                         ` Andy Lutomirski
  1 sibling, 0 replies; 163+ messages in thread
From: Sean Christopherson @ 2018-11-08 19:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Andy Lutomirski, Jann Horn, Linus Torvalds,
	Rich Felker, Dave Hansen, Jethro Beekman, Jarkko Sakkinen,
	Florian Weimer, Linux API, X86 ML, linux-arch, LKML,
	Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella

On Tue, Nov 06, 2018 at 01:07:54PM -0800, Andy Lutomirski wrote:
> 
> 
> > On Nov 6, 2018, at 1:00 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> > 
> >> On 11/6/18 12:12 PM, Andy Lutomirski wrote:
> >> True, but what if we have a nasty enclave that writes to memory just
> >> below SP *before* decrementing SP?
> > 
> > Yeah, that would be unfortunate.  If an enclave did this (roughly):
> > 
> >    1. EENTER
> >    2. Hardware sets eenter_hwframe->sp = %sp
> >    3. Enclave runs... wants to do out-call
> >    4. Enclave sets up parameters:
> >        memcpy(&eenter_hwframe->sp[-offset], arg1, size);
> >        ...
> >    5. Enclave sets eenter_hwframe->sp -= offset
> > 
> > If we got a signal between 4 and 5, we'd clobber the copy of 'arg1' that
> > was on the stack.  The enclave could easily fix this by moving ->sp first.
> > 
> > But, this is one of those "fun" parts of the ABI that I think we need to
> > talk about.  If we do this, we also basically require that the code
> > which handles asynchronous exits must *not* write to the stack.  That's
> > not hard because it's typically just a single ERESUME instruction, but
> > it *is* a requirement.
> > 
> 
> I was assuming that the async exit stuff was completely hidden by the
> API.  The AEP code would decide whether the exit got fixed up by the
> kernel (which may or may not be easy to tell — can the code even tell
> without kernel help whether it was, say, an IRQ vs #UD?) and then either
> do ERESUME or cause sgx_enter_enclave() to return with an appropriate
> return value.

Ok, SDK folks came up with an idea that would allow them to use vDSO,
albeit with a bit of ugliness and potentially a ROP-attack issue.
Definitely some weirdness, but the weirdness is well contained, unlike
the magic prefix approach.

Provide two enter_enclave() vDSO "functions".  The first is a normal
function with a normal C interface.  The second is a blob of code that
is "called" and "returns" via indirect jmp, and can be used by SGX
runtimes that want to use the untrusted stack for out-calls from the
enclave.

For the indirect jmp "function", use %rbp to stash the return address
of the caller (either in %rbp itself or in memory pointed to by %rbp).
It works because hardware also saves/restores %rbp along with %rsp when
doing enclave transitions, and the SDK can live with %rbp being
off-limits.  Fault info is passed via registers.

Basic idea for the "functions" below.  The fixup stuff is obviously not
wired up correctly, just trying to convey the concept.



struct enclu_fault_info {
	unsigned int	leaf;
	unsigned int	trapnr;
	unsigned int	error_code;
	unsigned long	address;
};

int __vdso_enter_enclave(void *tcs, struct enclu_fault_info *fault_info)
{
	unsigned int leaf, trapnr;

	asm volatile (
		"lea	2f(%%rip), %%rcx\n\t"
		"1:	enclu\n\t"
		"jmp	3f\n\t"

		/* ERESUME trampoline */
		"2:	enclu\n\t"
		"ud2\n\t"

		/* out: */
		"3:\n"

		/* EENTER fixup */
		".pushsection .fixup,\"ax\"\n\t"
		"4:\n\t"
		"mov	%%eax, %%edi\n\t"
		"movl	$"__stringify(SGX_EENTER)", %%eax\n\t"
		"jmp	3b\n\t"
		".popsection\n\t"
		_ASM_EXTABLE_FAULT(1b, 4b)

		/* ERESUME FIXUP */
		".pushsection .fixup,\"ax\"\n\t"
		"5:\n\t"
		"mov	%%eax, %%edi\n\t"
		"movl	$"__stringify(SGX_ERESUME)", %%eax\n\t"
		"jmp	3b\n\t"
		".popsection\n\t"
		_ASM_EXTABLE_FAULT(2b, 5b)

		: "=a"(leaf), "=D" (trapnr)
		: "a" (SGX_EENTER), "b" (tcs)
		: "cc", "memory", "rcx", "rdx", "rsi", "r8", "r9", "r10",
		  "r11", "r12", "r13", "r14", "r15"
	);

	if (leaf == SGX_EEXIT)
		return 0;

	if (fault_info) {
		fault_info->leaf = leaf;
		fault_info->trapnr = trapnr;
		fault_info->error_code = 0;
		fault_info->address = 0;
	}

	return -EFAULT;
}


GLOBAL(__vdso_enter_enclave_no_stack)
        endbr64

        /* %rbp = return target, %rbx = tcs */
        leaq    3f(%rip), %rcx
        movl    $2, %eax
1:      enclu

        /* "return" to "caller" */
2:      jmp     *%rbp

        /* ERESUME trampoline */
3:      enclu
        ud2

        /* EENTER fixup handler */
4:      movq    %rax, %rdi
        movl    $2, %eax
        /* %rsi = error code, %rdx = address */
        jmp     2b

        /* ERESUME fixup handler */
5:      movq    %rax, %rdi
        movl    $3, %eax
        /* %rsi = error code, %rdx = address */
        jmp     2b



^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-08 19:54                                                       ` Sean Christopherson
  2018-11-08 19:54                                                         ` Sean Christopherson
@ 2018-11-08 20:05                                                         ` Andy Lutomirski
  2018-11-08 20:05                                                           ` Andy Lutomirski
                                                                             ` (2 more replies)
  1 sibling, 3 replies; 163+ messages in thread
From: Andy Lutomirski @ 2018-11-08 20:05 UTC (permalink / raw)
  To: Christopherson, Sean J
  Cc: Dave Hansen, Andrew Lutomirski, Jann Horn, Linus Torvalds,
	Rich Felker, Dave Hansen, Jethro Beekman, Jarkko Sakkinen,
	Florian Weimer, Linux API, X86 ML, linux-arch, LKML,
	Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella

On Thu, Nov 8, 2018 at 11:54 AM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Tue, Nov 06, 2018 at 01:07:54PM -0800, Andy Lutomirski wrote:
> >
> >
> > > On Nov 6, 2018, at 1:00 PM, Dave Hansen <dave.hansen@intel.com> wrote=
:
> > >
> > >> On 11/6/18 12:12 PM, Andy Lutomirski wrote:
> > >> True, but what if we have a nasty enclave that writes to memory just
> > >> below SP *before* decrementing SP?
> > >
> > > Yeah, that would be unfortunate.  If an enclave did this (roughly):
> > >
> > >    1. EENTER
> > >    2. Hardware sets eenter_hwframe->sp =3D %sp
> > >    3. Enclave runs... wants to do out-call
> > >    4. Enclave sets up parameters:
> > >        memcpy(&eenter_hwframe->sp[-offset], arg1, size);
> > >        ...
> > >    5. Enclave sets eenter_hwframe->sp -=3D offset
> > >
> > > If we got a signal between 4 and 5, we'd clobber the copy of 'arg1' t=
hat
> > > was on the stack.  The enclave could easily fix this by moving ->sp f=
irst.
> > >
> > > But, this is one of those "fun" parts of the ABI that I think we need=
 to
> > > talk about.  If we do this, we also basically require that the code
> > > which handles asynchronous exits must *not* write to the stack.  That=
's
> > > not hard because it's typically just a single ERESUME instruction, bu=
t
> > > it *is* a requirement.
> > >
> >
> > I was assuming that the async exit stuff was completely hidden by the
> > API.  The AEP code would decide whether the exit got fixed up by the
> > kernel (which may or may not be easy to tell =E2=80=94 can the code eve=
n tell
> > without kernel help whether it was, say, an IRQ vs #UD?) and then eithe=
r
> > do ERESUME or cause sgx_enter_enclave() to return with an appropriate
> > return value.
>
> Ok, SDK folks came up with an idea that would allow them to use vDSO,
> albeit with a bit of ugliness and potentially a ROP-attack issue.
> Definitely some weirdness, but the weirdness is well contained, unlike
> the magic prefix approach.
>
> Provide two enter_enclave() vDSO "functions".  The first is a normal
> function with a normal C interface.  The second is a blob of code that
> is "called" and "returns" via indirect jmp, and can be used by SGX
> runtimes that want to use the untrusted stack for out-calls from the
> enclave.
>
> For the indirect jmp "function", use %rbp to stash the return address
> of the caller (either in %rbp itself or in memory pointed to by %rbp).
> It works because hardware also saves/restores %rbp along with %rsp when
> doing enclave transitions, and the SDK can live with %rbp being
> off-limits.  Fault info is passed via registers.

Hmm.  The idea being that the SDK preserves RBP but not RSP.  That's
not the most terrible thing in the world.  But could the SDK live with
something more like my suggestion where the vDSO supplies a normal
function that takes a struct containing registers that are visible to
the enclave?  This would make it extremely awkward for the enclave to
use the untrusted stack per se, but it would make it quite easy (I
think) for the untrusted part of the SDK to allocate some extra memory
and just tell the enclave that *that* memory is the stack.

AFAFICS we do have two registers that genuinely are preserved: FSBASE
and GSBASE.  Which is a good thing, because otherwise SGX enablement
would currently be a privilege escalation issue due to making GSBASE
writable when it should not be.

This whole thing is a mess.  I'm starting to think that the cleanest
solution would be to provide a way to just tell the kernel that
certain RIP values have exception fixups.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-08 20:05                                                         ` Andy Lutomirski
@ 2018-11-08 20:05                                                           ` Andy Lutomirski
  2018-11-08 20:10                                                           ` Dave Hansen
  2018-11-09  7:12                                                           ` Christoph Hellwig
  2 siblings, 0 replies; 163+ messages in thread
From: Andy Lutomirski @ 2018-11-08 20:05 UTC (permalink / raw)
  To: Christopherson, Sean J
  Cc: Dave Hansen, Andrew Lutomirski, Jann Horn, Linus Torvalds,
	Rich Felker, Dave Hansen, Jethro Beekman, Jarkko Sakkinen,
	Florian Weimer, Linux API, X86 ML, linux-arch, LKML,
	Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella

On Thu, Nov 8, 2018 at 11:54 AM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Tue, Nov 06, 2018 at 01:07:54PM -0800, Andy Lutomirski wrote:
> >
> >
> > > On Nov 6, 2018, at 1:00 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> > >
> > >> On 11/6/18 12:12 PM, Andy Lutomirski wrote:
> > >> True, but what if we have a nasty enclave that writes to memory just
> > >> below SP *before* decrementing SP?
> > >
> > > Yeah, that would be unfortunate.  If an enclave did this (roughly):
> > >
> > >    1. EENTER
> > >    2. Hardware sets eenter_hwframe->sp = %sp
> > >    3. Enclave runs... wants to do out-call
> > >    4. Enclave sets up parameters:
> > >        memcpy(&eenter_hwframe->sp[-offset], arg1, size);
> > >        ...
> > >    5. Enclave sets eenter_hwframe->sp -= offset
> > >
> > > If we got a signal between 4 and 5, we'd clobber the copy of 'arg1' that
> > > was on the stack.  The enclave could easily fix this by moving ->sp first.
> > >
> > > But, this is one of those "fun" parts of the ABI that I think we need to
> > > talk about.  If we do this, we also basically require that the code
> > > which handles asynchronous exits must *not* write to the stack.  That's
> > > not hard because it's typically just a single ERESUME instruction, but
> > > it *is* a requirement.
> > >
> >
> > I was assuming that the async exit stuff was completely hidden by the
> > API.  The AEP code would decide whether the exit got fixed up by the
> > kernel (which may or may not be easy to tell — can the code even tell
> > without kernel help whether it was, say, an IRQ vs #UD?) and then either
> > do ERESUME or cause sgx_enter_enclave() to return with an appropriate
> > return value.
>
> Ok, SDK folks came up with an idea that would allow them to use vDSO,
> albeit with a bit of ugliness and potentially a ROP-attack issue.
> Definitely some weirdness, but the weirdness is well contained, unlike
> the magic prefix approach.
>
> Provide two enter_enclave() vDSO "functions".  The first is a normal
> function with a normal C interface.  The second is a blob of code that
> is "called" and "returns" via indirect jmp, and can be used by SGX
> runtimes that want to use the untrusted stack for out-calls from the
> enclave.
>
> For the indirect jmp "function", use %rbp to stash the return address
> of the caller (either in %rbp itself or in memory pointed to by %rbp).
> It works because hardware also saves/restores %rbp along with %rsp when
> doing enclave transitions, and the SDK can live with %rbp being
> off-limits.  Fault info is passed via registers.

Hmm.  The idea being that the SDK preserves RBP but not RSP.  That's
not the most terrible thing in the world.  But could the SDK live with
something more like my suggestion where the vDSO supplies a normal
function that takes a struct containing registers that are visible to
the enclave?  This would make it extremely awkward for the enclave to
use the untrusted stack per se, but it would make it quite easy (I
think) for the untrusted part of the SDK to allocate some extra memory
and just tell the enclave that *that* memory is the stack.

AFAFICS we do have two registers that genuinely are preserved: FSBASE
and GSBASE.  Which is a good thing, because otherwise SGX enablement
would currently be a privilege escalation issue due to making GSBASE
writable when it should not be.

This whole thing is a mess.  I'm starting to think that the cleanest
solution would be to provide a way to just tell the kernel that
certain RIP values have exception fixups.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-08 20:05                                                         ` Andy Lutomirski
  2018-11-08 20:05                                                           ` Andy Lutomirski
@ 2018-11-08 20:10                                                           ` Dave Hansen
  2018-11-08 20:10                                                             ` Dave Hansen
  2018-11-08 21:16                                                             ` Sean Christopherson
  2018-11-09  7:12                                                           ` Christoph Hellwig
  2 siblings, 2 replies; 163+ messages in thread
From: Dave Hansen @ 2018-11-08 20:10 UTC (permalink / raw)
  To: Andy Lutomirski, Christopherson, Sean J
  Cc: Andrew Lutomirski, Jann Horn, Linus Torvalds, Rich Felker,
	Dave Hansen, Jethro Beekman, Jarkko Sakkinen, Florian Weimer,
	Linux API, X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On 11/8/18 12:05 PM, Andy Lutomirski wrote:
> Hmm.  The idea being that the SDK preserves RBP but not RSP.  That's
> not the most terrible thing in the world.  But could the SDK live with
> something more like my suggestion where the vDSO supplies a normal
> function that takes a struct containing registers that are visible to
> the enclave?  This would make it extremely awkward for the enclave to
> use the untrusted stack per se, but it would make it quite easy (I
> think) for the untrusted part of the SDK to allocate some extra memory
> and just tell the enclave that *that* memory is the stack.

I really think the enclave should keep its grubby mitts off the
untrusted stack.  There are lots of ways to get memory, even with
stack-like semantics, that don't involve mucking with the stack itself.

I have not heard a good, hard argument for why there is an absolute
*need* to store things on the actual untrusted stack.

We could quite easily have the untrusted code just promise to allocate a
stack-sized virtual area (even derived from the stack rlimit size) and
pass that into the enclave for parameter use.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-08 20:10                                                           ` Dave Hansen
@ 2018-11-08 20:10                                                             ` Dave Hansen
  2018-11-08 21:16                                                             ` Sean Christopherson
  1 sibling, 0 replies; 163+ messages in thread
From: Dave Hansen @ 2018-11-08 20:10 UTC (permalink / raw)
  To: Andy Lutomirski, Christopherson, Sean J
  Cc: Andrew Lutomirski, Jann Horn, Linus Torvalds, Rich Felker,
	Dave Hansen, Jethro Beekman, Jarkko Sakkinen, Florian Weimer,
	Linux API, X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On 11/8/18 12:05 PM, Andy Lutomirski wrote:
> Hmm.  The idea being that the SDK preserves RBP but not RSP.  That's
> not the most terrible thing in the world.  But could the SDK live with
> something more like my suggestion where the vDSO supplies a normal
> function that takes a struct containing registers that are visible to
> the enclave?  This would make it extremely awkward for the enclave to
> use the untrusted stack per se, but it would make it quite easy (I
> think) for the untrusted part of the SDK to allocate some extra memory
> and just tell the enclave that *that* memory is the stack.

I really think the enclave should keep its grubby mitts off the
untrusted stack.  There are lots of ways to get memory, even with
stack-like semantics, that don't involve mucking with the stack itself.

I have not heard a good, hard argument for why there is an absolute
*need* to store things on the actual untrusted stack.

We could quite easily have the untrusted code just promise to allocate a
stack-sized virtual area (even derived from the stack rlimit size) and
pass that into the enclave for parameter use.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-08 20:10                                                           ` Dave Hansen
  2018-11-08 20:10                                                             ` Dave Hansen
@ 2018-11-08 21:16                                                             ` Sean Christopherson
  2018-11-08 21:16                                                               ` Sean Christopherson
  2018-11-08 21:50                                                               ` Dave Hansen
  1 sibling, 2 replies; 163+ messages in thread
From: Sean Christopherson @ 2018-11-08 21:16 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Andrew Lutomirski, Jann Horn, Linus Torvalds,
	Rich Felker, Dave Hansen, Jethro Beekman, Jarkko Sakkinen,
	Florian Weimer, Linux API, X86 ML, linux-arch, LKML,
	Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella

On Thu, Nov 08, 2018 at 12:10:30PM -0800, Dave Hansen wrote:
> On 11/8/18 12:05 PM, Andy Lutomirski wrote:
> > Hmm.  The idea being that the SDK preserves RBP but not RSP.  That's
> > not the most terrible thing in the world.  But could the SDK live with
> > something more like my suggestion where the vDSO supplies a normal
> > function that takes a struct containing registers that are visible to
> > the enclave?  This would make it extremely awkward for the enclave to
> > use the untrusted stack per se, but it would make it quite easy (I
> > think) for the untrusted part of the SDK to allocate some extra memory
> > and just tell the enclave that *that* memory is the stack.
> 
> I really think the enclave should keep its grubby mitts off the
> untrusted stack.  There are lots of ways to get memory, even with
> stack-like semantics, that don't involve mucking with the stack itself.
> 
> I have not heard a good, hard argument for why there is an absolute
> *need* to store things on the actual untrusted stack.

Convenience and performance are the only arguments I've heard, e.g. so
that allocating memory doesn't require an extra EEXIT->EENTER round trip.

> We could quite easily have the untrusted code just promise to allocate a
> stack-sized virtual area (even derived from the stack rlimit size) and
> pass that into the enclave for parameter use.

I agree more and more the further I dig.  AFAIK there is no need to for
the enclave to actually load %rsp.  The initial EENTER can pass in the
base/top of the pseudo-stack and from there the enclave can manage it
purely in software.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-08 21:16                                                             ` Sean Christopherson
@ 2018-11-08 21:16                                                               ` Sean Christopherson
  2018-11-08 21:50                                                               ` Dave Hansen
  1 sibling, 0 replies; 163+ messages in thread
From: Sean Christopherson @ 2018-11-08 21:16 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Andrew Lutomirski, Jann Horn, Linus Torvalds,
	Rich Felker, Dave Hansen, Jethro Beekman, Jarkko Sakkinen,
	Florian Weimer, Linux API, X86 ML, linux-arch, LKML,
	Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella

On Thu, Nov 08, 2018 at 12:10:30PM -0800, Dave Hansen wrote:
> On 11/8/18 12:05 PM, Andy Lutomirski wrote:
> > Hmm.  The idea being that the SDK preserves RBP but not RSP.  That's
> > not the most terrible thing in the world.  But could the SDK live with
> > something more like my suggestion where the vDSO supplies a normal
> > function that takes a struct containing registers that are visible to
> > the enclave?  This would make it extremely awkward for the enclave to
> > use the untrusted stack per se, but it would make it quite easy (I
> > think) for the untrusted part of the SDK to allocate some extra memory
> > and just tell the enclave that *that* memory is the stack.
> 
> I really think the enclave should keep its grubby mitts off the
> untrusted stack.  There are lots of ways to get memory, even with
> stack-like semantics, that don't involve mucking with the stack itself.
> 
> I have not heard a good, hard argument for why there is an absolute
> *need* to store things on the actual untrusted stack.

Convenience and performance are the only arguments I've heard, e.g. so
that allocating memory doesn't require an extra EEXIT->EENTER round trip.

> We could quite easily have the untrusted code just promise to allocate a
> stack-sized virtual area (even derived from the stack rlimit size) and
> pass that into the enclave for parameter use.

I agree more and more the further I dig.  AFAIK there is no need to for
the enclave to actually load %rsp.  The initial EENTER can pass in the
base/top of the pseudo-stack and from there the enclave can manage it
purely in software.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-08 21:16                                                             ` Sean Christopherson
  2018-11-08 21:16                                                               ` Sean Christopherson
@ 2018-11-08 21:50                                                               ` Dave Hansen
  2018-11-08 21:50                                                                 ` Dave Hansen
  2018-11-08 22:04                                                                 ` Sean Christopherson
  1 sibling, 2 replies; 163+ messages in thread
From: Dave Hansen @ 2018-11-08 21:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Andy Lutomirski, Andrew Lutomirski, Jann Horn, Linus Torvalds,
	Rich Felker, Dave Hansen, Jethro Beekman, Jarkko Sakkinen,
	Florian Weimer, Linux API, X86 ML, linux-arch, LKML,
	Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella

On 11/8/18 1:16 PM, Sean Christopherson wrote:
> On Thu, Nov 08, 2018 at 12:10:30PM -0800, Dave Hansen wrote:
>> On 11/8/18 12:05 PM, Andy Lutomirski wrote:
>>> Hmm.  The idea being that the SDK preserves RBP but not RSP.  That's
>>> not the most terrible thing in the world.  But could the SDK live with
>>> something more like my suggestion where the vDSO supplies a normal
>>> function that takes a struct containing registers that are visible to
>>> the enclave?  This would make it extremely awkward for the enclave to
>>> use the untrusted stack per se, but it would make it quite easy (I
>>> think) for the untrusted part of the SDK to allocate some extra memory
>>> and just tell the enclave that *that* memory is the stack.
>>
>> I really think the enclave should keep its grubby mitts off the
>> untrusted stack.  There are lots of ways to get memory, even with
>> stack-like semantics, that don't involve mucking with the stack itself.
>>
>> I have not heard a good, hard argument for why there is an absolute
>> *need* to store things on the actual untrusted stack.
> 
> Convenience and performance are the only arguments I've heard, e.g. so
> that allocating memory doesn't require an extra EEXIT->EENTER round trip.

Well, for the first access, it's going to cost a bunch asynchronous
exits to fault in all the stack pages.  Instead of that, if you had a
single area, or an explicit out-call to allocate and populate the area,
you could do it in a single EEXIT and zero asynchronous exits for demand
page faults.

So, it might be convenient, but I'm rather suspicious of any performance
arguments.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-08 21:50                                                               ` Dave Hansen
@ 2018-11-08 21:50                                                                 ` Dave Hansen
  2018-11-08 22:04                                                                 ` Sean Christopherson
  1 sibling, 0 replies; 163+ messages in thread
From: Dave Hansen @ 2018-11-08 21:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Andy Lutomirski, Andrew Lutomirski, Jann Horn, Linus Torvalds,
	Rich Felker, Dave Hansen, Jethro Beekman, Jarkko Sakkinen,
	Florian Weimer, Linux API, X86 ML, linux-arch, LKML,
	Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella

On 11/8/18 1:16 PM, Sean Christopherson wrote:
> On Thu, Nov 08, 2018 at 12:10:30PM -0800, Dave Hansen wrote:
>> On 11/8/18 12:05 PM, Andy Lutomirski wrote:
>>> Hmm.  The idea being that the SDK preserves RBP but not RSP.  That's
>>> not the most terrible thing in the world.  But could the SDK live with
>>> something more like my suggestion where the vDSO supplies a normal
>>> function that takes a struct containing registers that are visible to
>>> the enclave?  This would make it extremely awkward for the enclave to
>>> use the untrusted stack per se, but it would make it quite easy (I
>>> think) for the untrusted part of the SDK to allocate some extra memory
>>> and just tell the enclave that *that* memory is the stack.
>>
>> I really think the enclave should keep its grubby mitts off the
>> untrusted stack.  There are lots of ways to get memory, even with
>> stack-like semantics, that don't involve mucking with the stack itself.
>>
>> I have not heard a good, hard argument for why there is an absolute
>> *need* to store things on the actual untrusted stack.
> 
> Convenience and performance are the only arguments I've heard, e.g. so
> that allocating memory doesn't require an extra EEXIT->EENTER round trip.

Well, for the first access, it's going to cost a bunch asynchronous
exits to fault in all the stack pages.  Instead of that, if you had a
single area, or an explicit out-call to allocate and populate the area,
you could do it in a single EEXIT and zero asynchronous exits for demand
page faults.

So, it might be convenient, but I'm rather suspicious of any performance
arguments.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-08 21:50                                                               ` Dave Hansen
  2018-11-08 21:50                                                                 ` Dave Hansen
@ 2018-11-08 22:04                                                                 ` Sean Christopherson
  2018-11-08 22:04                                                                   ` Sean Christopherson
  1 sibling, 1 reply; 163+ messages in thread
From: Sean Christopherson @ 2018-11-08 22:04 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Andrew Lutomirski, Jann Horn, Linus Torvalds,
	Rich Felker, Dave Hansen, Jethro Beekman, Jarkko Sakkinen,
	Florian Weimer, Linux API, X86 ML, linux-arch, LKML,
	Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella

On Thu, Nov 08, 2018 at 01:50:31PM -0800, Dave Hansen wrote:
> On 11/8/18 1:16 PM, Sean Christopherson wrote:
> > On Thu, Nov 08, 2018 at 12:10:30PM -0800, Dave Hansen wrote:
> >> On 11/8/18 12:05 PM, Andy Lutomirski wrote:
> >>> Hmm.  The idea being that the SDK preserves RBP but not RSP.  That's
> >>> not the most terrible thing in the world.  But could the SDK live with
> >>> something more like my suggestion where the vDSO supplies a normal
> >>> function that takes a struct containing registers that are visible to
> >>> the enclave?  This would make it extremely awkward for the enclave to
> >>> use the untrusted stack per se, but it would make it quite easy (I
> >>> think) for the untrusted part of the SDK to allocate some extra memory
> >>> and just tell the enclave that *that* memory is the stack.
> >>
> >> I really think the enclave should keep its grubby mitts off the
> >> untrusted stack.  There are lots of ways to get memory, even with
> >> stack-like semantics, that don't involve mucking with the stack itself.
> >>
> >> I have not heard a good, hard argument for why there is an absolute
> >> *need* to store things on the actual untrusted stack.
> > 
> > Convenience and performance are the only arguments I've heard, e.g. so
> > that allocating memory doesn't require an extra EEXIT->EENTER round trip.
> 
> Well, for the first access, it's going to cost a bunch asynchronous
> exits to fault in all the stack pages.  Instead of that, if you had a
> single area, or an explicit out-call to allocate and populate the area,
> you could do it in a single EEXIT and zero asynchronous exits for demand
> page faults.
> 
> So, it might be convenient, but I'm rather suspicious of any performance
> arguments.

Ya, I meant versus doing an EEXIT on every allocation, i.e. a very
naive allocation scheme.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-08 22:04                                                                 ` Sean Christopherson
@ 2018-11-08 22:04                                                                   ` Sean Christopherson
  0 siblings, 0 replies; 163+ messages in thread
From: Sean Christopherson @ 2018-11-08 22:04 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Andrew Lutomirski, Jann Horn, Linus Torvalds,
	Rich Felker, Dave Hansen, Jethro Beekman, Jarkko Sakkinen,
	Florian Weimer, Linux API, X86 ML, linux-arch, LKML,
	Peter Zijlstra, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Carlos O'Donell,
	adhemerval.zanella

On Thu, Nov 08, 2018 at 01:50:31PM -0800, Dave Hansen wrote:
> On 11/8/18 1:16 PM, Sean Christopherson wrote:
> > On Thu, Nov 08, 2018 at 12:10:30PM -0800, Dave Hansen wrote:
> >> On 11/8/18 12:05 PM, Andy Lutomirski wrote:
> >>> Hmm.  The idea being that the SDK preserves RBP but not RSP.  That's
> >>> not the most terrible thing in the world.  But could the SDK live with
> >>> something more like my suggestion where the vDSO supplies a normal
> >>> function that takes a struct containing registers that are visible to
> >>> the enclave?  This would make it extremely awkward for the enclave to
> >>> use the untrusted stack per se, but it would make it quite easy (I
> >>> think) for the untrusted part of the SDK to allocate some extra memory
> >>> and just tell the enclave that *that* memory is the stack.
> >>
> >> I really think the enclave should keep its grubby mitts off the
> >> untrusted stack.  There are lots of ways to get memory, even with
> >> stack-like semantics, that don't involve mucking with the stack itself.
> >>
> >> I have not heard a good, hard argument for why there is an absolute
> >> *need* to store things on the actual untrusted stack.
> > 
> > Convenience and performance are the only arguments I've heard, e.g. so
> > that allocating memory doesn't require an extra EEXIT->EENTER round trip.
> 
> Well, for the first access, it's going to cost a bunch asynchronous
> exits to fault in all the stack pages.  Instead of that, if you had a
> single area, or an explicit out-call to allocate and populate the area,
> you could do it in a single EEXIT and zero asynchronous exits for demand
> page faults.
> 
> So, it might be convenient, but I'm rather suspicious of any performance
> arguments.

Ya, I meant versus doing an EEXIT on every allocation, i.e. a very
naive allocation scheme.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-08 20:05                                                         ` Andy Lutomirski
  2018-11-08 20:05                                                           ` Andy Lutomirski
  2018-11-08 20:10                                                           ` Dave Hansen
@ 2018-11-09  7:12                                                           ` Christoph Hellwig
  2018-11-09  7:12                                                             ` Christoph Hellwig
  2 siblings, 1 reply; 163+ messages in thread
From: Christoph Hellwig @ 2018-11-09  7:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Christopherson, Sean J, Dave Hansen, Andrew Lutomirski,
	Jann Horn, Linus Torvalds, Rich Felker, Dave Hansen,
	Jethro Beekman, Jarkko Sakkinen, Florian Weimer, Linux API,
	X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman, npmccallum,
	Ayoun, Serge, shay.katz-zamir, linux-sgx, Andy Shevchenko,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Thu, Nov 08, 2018 at 12:05:42PM -0800, Andy Lutomirski wrote:
> This whole thing is a mess.  I'm starting to think that the cleanest
> solution would be to provide a way to just tell the kernel that
> certain RIP values have exception fixups.

The bay far cleanest solution would be to say that SGX is sich a mess
that we are not going to support it at all.  It's not like it is a must
have a feature to start with.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-09  7:12                                                           ` Christoph Hellwig
@ 2018-11-09  7:12                                                             ` Christoph Hellwig
  0 siblings, 0 replies; 163+ messages in thread
From: Christoph Hellwig @ 2018-11-09  7:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Christopherson, Sean J, Dave Hansen, Andrew Lutomirski,
	Jann Horn, Linus Torvalds, Rich Felker, Dave Hansen,
	Jethro Beekman, Jarkko Sakkinen, Florian Weimer, Linux API,
	X86 ML, linux-arch, LKML, Peter Zijlstra, nhorman, npmccallum,
	Ayoun, Serge, shay.katz-zamir, linux-sgx, Andy Shevchenko,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Carlos O'Donell, adhemerval.zanella

On Thu, Nov 08, 2018 at 12:05:42PM -0800, Andy Lutomirski wrote:
> This whole thing is a mess.  I'm starting to think that the cleanest
> solution would be to provide a way to just tell the kernel that
> certain RIP values have exception fixups.

The bay far cleanest solution would be to say that SGX is sich a mess
that we are not going to support it at all.  It's not like it is a must
have a feature to start with.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-01 17:53 RFC: userspace exception fixups Andy Lutomirski
                   ` (5 preceding siblings ...)
  2018-11-02 22:07 ` Jarkko Sakkinen
@ 2018-11-18  7:15 ` Jarkko Sakkinen
  2018-11-18  7:18   ` Jarkko Sakkinen
                     ` (2 more replies)
  6 siblings, 3 replies; 163+ messages in thread
From: Jarkko Sakkinen @ 2018-11-18  7:15 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Christopherson, Sean J, Jethro Beekman,
	Florian Weimer, Linux API, Jann Horn, Linus Torvalds, X86 ML,
	linux-arch, LKML, Peter Zijlstra, Rich Felker, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov

On Thu, Nov 01, 2018 at 10:53:40AM -0700, Andy Lutomirski wrote:
> Hi all-
> 
> The people working on SGX enablement are grappling with a somewhat
> annoying issue: the x86 EENTER instruction is used from user code and
> can, as part of its normal-ish operation, raise an exception.  It is
> also highly likely to be used from a library, and signal handling in
> libraries is unpleasant at best.
> 
> There's been some discussion of adding a vDSO entry point to wrap
> EENTER and do something sensible with the exceptions, but I'm
> wondering if a more general mechanism would be helpful.

I haven't really followed all of this discussion because I've been busy
working on the patch set but for me all of these approaches look awfully
complicated.

I'll throw my own suggestion and apologize if this has been already
suggested and discarded: return-to-AEP.

My idea is to do just a small extension to SGX AEX handling. At the
moment hardware will RAX, RBX and RCX with ERESUME parameters. We can
fill extend this by filling other three spare registers with exception
information.

AEP handler can then do whatever it wants to do with this information
or just do ERESUME.

In some ways this dummied version of Sean's suggestion.

I think whatever the solution is it should be lightweight and this is
such solution. Why? Because exception handling could be then used to
implement other stuff than just error hadling like syscall wrapper
for the enclaves in nice and lean way.

/Jarkko

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-18  7:15 ` Jarkko Sakkinen
@ 2018-11-18  7:18   ` Jarkko Sakkinen
  2018-11-18 13:02   ` Jarkko Sakkinen
  2018-11-19 15:29   ` Andy Lutomirski
  2 siblings, 0 replies; 163+ messages in thread
From: Jarkko Sakkinen @ 2018-11-18  7:18 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Christopherson, Sean J, Jethro Beekman,
	Florian Weimer, Linux API, Jann Horn, Linus Torvalds, X86 ML,
	linux-arch, LKML, Peter Zijlstra, Rich Felker, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov

On Sun, Nov 18, 2018 at 09:15:48AM +0200, Jarkko Sakkinen wrote:
> On Thu, Nov 01, 2018 at 10:53:40AM -0700, Andy Lutomirski wrote:
> > Hi all-
> > 
> > The people working on SGX enablement are grappling with a somewhat
> > annoying issue: the x86 EENTER instruction is used from user code and
> > can, as part of its normal-ish operation, raise an exception.  It is
> > also highly likely to be used from a library, and signal handling in
> > libraries is unpleasant at best.
> > 
> > There's been some discussion of adding a vDSO entry point to wrap
> > EENTER and do something sensible with the exceptions, but I'm
> > wondering if a more general mechanism would be helpful.
> 
> I haven't really followed all of this discussion because I've been busy
> working on the patch set but for me all of these approaches look awfully
> complicated.
> 
> I'll throw my own suggestion and apologize if this has been already
> suggested and discarded: return-to-AEP.
> 
> My idea is to do just a small extension to SGX AEX handling. At the
> moment hardware will RAX, RBX and RCX with ERESUME parameters. We can
> fill extend this by filling other three spare registers with exception

s/fill extend/extend/

/Jarkko

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-18  7:15 ` Jarkko Sakkinen
  2018-11-18  7:18   ` Jarkko Sakkinen
@ 2018-11-18 13:02   ` Jarkko Sakkinen
  2018-11-19  5:17     ` Jethro Beekman
  2018-11-19 15:29   ` Andy Lutomirski
  2 siblings, 1 reply; 163+ messages in thread
From: Jarkko Sakkinen @ 2018-11-18 13:02 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Christopherson, Sean J, Jethro Beekman,
	Florian Weimer, Linux API, Jann Horn, Linus Torvalds, X86 ML,
	linux-arch, LKML, Peter Zijlstra, Rich Felker, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov

On Sun, Nov 18, 2018 at 09:15:48AM +0200, Jarkko Sakkinen wrote:
> On Thu, Nov 01, 2018 at 10:53:40AM -0700, Andy Lutomirski wrote:
> > Hi all-
> > 
> > The people working on SGX enablement are grappling with a somewhat
> > annoying issue: the x86 EENTER instruction is used from user code and
> > can, as part of its normal-ish operation, raise an exception.  It is
> > also highly likely to be used from a library, and signal handling in
> > libraries is unpleasant at best.
> > 
> > There's been some discussion of adding a vDSO entry point to wrap
> > EENTER and do something sensible with the exceptions, but I'm
> > wondering if a more general mechanism would be helpful.
> 
> I haven't really followed all of this discussion because I've been busy
> working on the patch set but for me all of these approaches look awfully
> complicated.
> 
> I'll throw my own suggestion and apologize if this has been already
> suggested and discarded: return-to-AEP.
> 
> My idea is to do just a small extension to SGX AEX handling. At the
> moment hardware will RAX, RBX and RCX with ERESUME parameters. We can
> fill extend this by filling other three spare registers with exception
> information.
> 
> AEP handler can then do whatever it wants to do with this information
> or just do ERESUME.

A correction here. In practice this will add a requirement to have a bit
more complicated AEP code (check the regs for exceptions) than before
and not just bytes for ENCLU.

e.g. AEP handler should be along the lines

1. #PF (or #UD or) happens. Kernel fills the registers when it cannot
   handle the exception and returns back to user space i.e. to the
   AEP handler.
2. Check the registers containing exception information. If they have
   been filled, take whatever actions user space wants to take.
3. Otherwise, just ERESUME.

From my point of view this is making the AEP parameter useful. Its
standard use is just weird (always point to a place just containing
ENCLU bytes, why the heck it even exists).

/Jarkko

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-18 13:02   ` Jarkko Sakkinen
@ 2018-11-19  5:17     ` Jethro Beekman
  2018-11-19 14:05       ` Jarkko Sakkinen
  0 siblings, 1 reply; 163+ messages in thread
From: Jethro Beekman @ 2018-11-19  5:17 UTC (permalink / raw)
  To: Jarkko Sakkinen, Andy Lutomirski
  Cc: Dave Hansen, Christopherson, Sean J, Florian Weimer, Linux API,
	Jann Horn, Linus Torvalds, X86 ML, linux-arch, LKML,
	Peter Zijlstra, Rich Felker, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov

[-- Attachment #1: Type: text/plain, Size: 2346 bytes --]

On 2018-11-18 18:32, Jarkko Sakkinen wrote:
> On Sun, Nov 18, 2018 at 09:15:48AM +0200, Jarkko Sakkinen wrote:
>> On Thu, Nov 01, 2018 at 10:53:40AM -0700, Andy Lutomirski wrote:
>>> Hi all-
>>>
>>> The people working on SGX enablement are grappling with a somewhat
>>> annoying issue: the x86 EENTER instruction is used from user code and
>>> can, as part of its normal-ish operation, raise an exception.  It is
>>> also highly likely to be used from a library, and signal handling in
>>> libraries is unpleasant at best.
>>>
>>> There's been some discussion of adding a vDSO entry point to wrap
>>> EENTER and do something sensible with the exceptions, but I'm
>>> wondering if a more general mechanism would be helpful.
>>
>> I haven't really followed all of this discussion because I've been busy
>> working on the patch set but for me all of these approaches look awfully
>> complicated.
>>
>> I'll throw my own suggestion and apologize if this has been already
>> suggested and discarded: return-to-AEP.
>>
>> My idea is to do just a small extension to SGX AEX handling. At the
>> moment hardware will RAX, RBX and RCX with ERESUME parameters. We can
>> fill extend this by filling other three spare registers with exception
>> information.
>>
>> AEP handler can then do whatever it wants to do with this information
>> or just do ERESUME.
> 
> A correction here. In practice this will add a requirement to have a bit
> more complicated AEP code (check the regs for exceptions) than before
> and not just bytes for ENCLU.
> 
> e.g. AEP handler should be along the lines
> 
> 1. #PF (or #UD or) happens. Kernel fills the registers when it cannot
>     handle the exception and returns back to user space i.e. to the
>     AEP handler.
> 2. Check the registers containing exception information. If they have
>     been filled, take whatever actions user space wants to take.
> 3. Otherwise, just ERESUME.
> 
>  From my point of view this is making the AEP parameter useful. Its
> standard use is just weird (always point to a place just containing
> ENCLU bytes, why the heck it even exists).

I like this solution. Keeps things simple. One question: when an 
exception occurs, how does the kernel know whether to set special 
registers or send a signal?

--
Jethro Beekman | Fortanix



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3990 bytes --]

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-19  5:17     ` Jethro Beekman
@ 2018-11-19 14:05       ` Jarkko Sakkinen
  2018-11-19 14:59         ` Jarkko Sakkinen
  0 siblings, 1 reply; 163+ messages in thread
From: Jarkko Sakkinen @ 2018-11-19 14:05 UTC (permalink / raw)
  To: Jethro Beekman
  Cc: Andy Lutomirski, Dave Hansen, Christopherson, Sean J,
	Florian Weimer, Linux API, Jann Horn, Linus Torvalds, X86 ML,
	linux-arch, LKML, Peter Zijlstra, Rich Felker, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov

On Mon, Nov 19, 2018 at 05:17:26AM +0000, Jethro Beekman wrote:
> On 2018-11-18 18:32, Jarkko Sakkinen wrote:
> > On Sun, Nov 18, 2018 at 09:15:48AM +0200, Jarkko Sakkinen wrote:
> > > On Thu, Nov 01, 2018 at 10:53:40AM -0700, Andy Lutomirski wrote:
> > > > Hi all-
> > > > 
> > > > The people working on SGX enablement are grappling with a somewhat
> > > > annoying issue: the x86 EENTER instruction is used from user code and
> > > > can, as part of its normal-ish operation, raise an exception.  It is
> > > > also highly likely to be used from a library, and signal handling in
> > > > libraries is unpleasant at best.
> > > > 
> > > > There's been some discussion of adding a vDSO entry point to wrap
> > > > EENTER and do something sensible with the exceptions, but I'm
> > > > wondering if a more general mechanism would be helpful.
> > > 
> > > I haven't really followed all of this discussion because I've been busy
> > > working on the patch set but for me all of these approaches look awfully
> > > complicated.
> > > 
> > > I'll throw my own suggestion and apologize if this has been already
> > > suggested and discarded: return-to-AEP.
> > > 
> > > My idea is to do just a small extension to SGX AEX handling. At the
> > > moment hardware will RAX, RBX and RCX with ERESUME parameters. We can
> > > fill extend this by filling other three spare registers with exception
> > > information.
> > > 
> > > AEP handler can then do whatever it wants to do with this information
> > > or just do ERESUME.
> > 
> > A correction here. In practice this will add a requirement to have a bit
> > more complicated AEP code (check the regs for exceptions) than before
> > and not just bytes for ENCLU.
> > 
> > e.g. AEP handler should be along the lines
> > 
> > 1. #PF (or #UD or) happens. Kernel fills the registers when it cannot
> >     handle the exception and returns back to user space i.e. to the
> >     AEP handler.
> > 2. Check the registers containing exception information. If they have
> >     been filled, take whatever actions user space wants to take.
> > 3. Otherwise, just ERESUME.
> > 
> >  From my point of view this is making the AEP parameter useful. Its
> > standard use is just weird (always point to a place just containing
> > ENCLU bytes, why the heck it even exists).
> 
> I like this solution. Keeps things simple. One question: when an exception
> occurs, how does the kernel know whether to set special registers or send a
> signal?

Yes, and AFAIK people do in many cases people want to do something else
than just direct ERESUME in AEP handler so would neither be a major
bummer for user space. If I remember correctly you have such?

You can check the cases that we have for SIGSEGV (namely EPCM conflict)
from Sean's patch 08/23.

I'm open for expanding the scope. It is the easy part after there is
consensus for the handling mechanism :-)

/Jarkko

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-19 14:05       ` Jarkko Sakkinen
@ 2018-11-19 14:59         ` Jarkko Sakkinen
  0 siblings, 0 replies; 163+ messages in thread
From: Jarkko Sakkinen @ 2018-11-19 14:59 UTC (permalink / raw)
  To: Jethro Beekman
  Cc: Andy Lutomirski, Dave Hansen, Christopherson, Sean J,
	Florian Weimer, Linux API, Jann Horn, Linus Torvalds, X86 ML,
	linux-arch, LKML, Peter Zijlstra, Rich Felker, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov

On Mon, Nov 19, 2018 at 04:05:43PM +0200, Jarkko Sakkinen wrote:
> On Mon, Nov 19, 2018 at 05:17:26AM +0000, Jethro Beekman wrote:
> > On 2018-11-18 18:32, Jarkko Sakkinen wrote:
> > > On Sun, Nov 18, 2018 at 09:15:48AM +0200, Jarkko Sakkinen wrote:
> > > > On Thu, Nov 01, 2018 at 10:53:40AM -0700, Andy Lutomirski wrote:
> > > > > Hi all-
> > > > > 
> > > > > The people working on SGX enablement are grappling with a somewhat
> > > > > annoying issue: the x86 EENTER instruction is used from user code and
> > > > > can, as part of its normal-ish operation, raise an exception.  It is
> > > > > also highly likely to be used from a library, and signal handling in
> > > > > libraries is unpleasant at best.
> > > > > 
> > > > > There's been some discussion of adding a vDSO entry point to wrap
> > > > > EENTER and do something sensible with the exceptions, but I'm
> > > > > wondering if a more general mechanism would be helpful.
> > > > 
> > > > I haven't really followed all of this discussion because I've been busy
> > > > working on the patch set but for me all of these approaches look awfully
> > > > complicated.
> > > > 
> > > > I'll throw my own suggestion and apologize if this has been already
> > > > suggested and discarded: return-to-AEP.
> > > > 
> > > > My idea is to do just a small extension to SGX AEX handling. At the
> > > > moment hardware will RAX, RBX and RCX with ERESUME parameters. We can
> > > > fill extend this by filling other three spare registers with exception
> > > > information.
> > > > 
> > > > AEP handler can then do whatever it wants to do with this information
> > > > or just do ERESUME.
> > > 
> > > A correction here. In practice this will add a requirement to have a bit
> > > more complicated AEP code (check the regs for exceptions) than before
> > > and not just bytes for ENCLU.
> > > 
> > > e.g. AEP handler should be along the lines
> > > 
> > > 1. #PF (or #UD or) happens. Kernel fills the registers when it cannot
> > >     handle the exception and returns back to user space i.e. to the
> > >     AEP handler.
> > > 2. Check the registers containing exception information. If they have
> > >     been filled, take whatever actions user space wants to take.
> > > 3. Otherwise, just ERESUME.
> > > 
> > >  From my point of view this is making the AEP parameter useful. Its
> > > standard use is just weird (always point to a place just containing
> > > ENCLU bytes, why the heck it even exists).
> > 
> > I like this solution. Keeps things simple. One question: when an exception
> > occurs, how does the kernel know whether to set special registers or send a
> > signal?
> 
> Yes, and AFAIK people do in many cases people want to do something else
> than just direct ERESUME in AEP handler so would neither be a major
> bummer for user space. If I remember correctly you have such?
> 
> You can check the cases that we have for SIGSEGV (namely EPCM conflict)
> from Sean's patch 08/23.
> 
> I'm open for expanding the scope. It is the easy part after there is
> consensus for the handling mechanism :-)

Not sure if it a good idea or not but maybe even have new ioctl in
addition to the enclave construction ioctls that you use to specify per
enclave what you want to get. SIGSEGV could be the fallback behavior if
you do not "register" to any exceptions.

/Jarkko

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-18  7:15 ` Jarkko Sakkinen
  2018-11-18  7:18   ` Jarkko Sakkinen
  2018-11-18 13:02   ` Jarkko Sakkinen
@ 2018-11-19 15:29   ` Andy Lutomirski
  2018-11-19 16:02     ` Jarkko Sakkinen
  2 siblings, 1 reply; 163+ messages in thread
From: Andy Lutomirski @ 2018-11-19 15:29 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: Andrew Lutomirski, Dave Hansen, Christopherson, Sean J,
	Jethro Beekman, Florian Weimer, Linux API, Jann Horn,
	Linus Torvalds, X86 ML, linux-arch, LKML, Peter Zijlstra,
	Rich Felker, nhorman, npmccallum, Ayoun, Serge, shay.katz-zamir,
	linux-sgx, Andy Shevchenko, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov

On Sat, Nov 17, 2018 at 11:16 PM Jarkko Sakkinen
<jarkko.sakkinen@linux.intel.com> wrote:
>
> On Thu, Nov 01, 2018 at 10:53:40AM -0700, Andy Lutomirski wrote:
> > Hi all-
> >
> > The people working on SGX enablement are grappling with a somewhat
> > annoying issue: the x86 EENTER instruction is used from user code and
> > can, as part of its normal-ish operation, raise an exception.  It is
> > also highly likely to be used from a library, and signal handling in
> > libraries is unpleasant at best.
> >
> > There's been some discussion of adding a vDSO entry point to wrap
> > EENTER and do something sensible with the exceptions, but I'm
> > wondering if a more general mechanism would be helpful.
>
> I haven't really followed all of this discussion because I've been busy
> working on the patch set but for me all of these approaches look awfully
> complicated.
>
> I'll throw my own suggestion and apologize if this has been already
> suggested and discarded: return-to-AEP.
>
> My idea is to do just a small extension to SGX AEX handling. At the
> moment hardware will RAX, RBX and RCX with ERESUME parameters. We can
> fill extend this by filling other three spare registers with exception
> information.

I have two issues with this approach:

1. The kernel needs some way to know *when* to apply this fixup.
Decoding the instruction stream and doing it to all exceptions that
hit an ENCLU instruction seems like a poor design.

2. It starts exposing what looks like a more generic exception
handling mechanism to userspace, except that it's nonsensical for
anything other than ENCLU.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-19 15:29   ` Andy Lutomirski
@ 2018-11-19 16:02     ` Jarkko Sakkinen
  2018-11-19 17:00       ` Andy Lutomirski
  0 siblings, 1 reply; 163+ messages in thread
From: Jarkko Sakkinen @ 2018-11-19 16:02 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Christopherson, Sean J, Jethro Beekman,
	Florian Weimer, Linux API, Jann Horn, Linus Torvalds, X86 ML,
	linux-arch, LKML, Peter Zijlstra, Rich Felker, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov

On Mon, Nov 19, 2018 at 07:29:36AM -0800, Andy Lutomirski wrote:
> 1. The kernel needs some way to know *when* to apply this fixup.
> Decoding the instruction stream and doing it to all exceptions that
> hit an ENCLU instruction seems like a poor design.

I'm not sure why you would ever need to do any type of fixup as the idea
is to just return to AEP i.e. from chosen exceptions (EPCM, #UD) the AEP
would work the same way as for exceptions that the kernel can deal with
except filling the exception information to registers.

> 2. It starts exposing what looks like a more generic exception
> handling mechanism to userspace, except that it's nonsensical for
> anything other than ENCLU.

Well, I see the user space and namely the run-time the host for the
enclave i.e. middle-man to provide services for emulating instructions
etc.

/Jarkko

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-19 16:02     ` Jarkko Sakkinen
@ 2018-11-19 17:00       ` Andy Lutomirski
  2018-11-20 10:11         ` Jarkko Sakkinen
  0 siblings, 1 reply; 163+ messages in thread
From: Andy Lutomirski @ 2018-11-19 17:00 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: Andrew Lutomirski, Dave Hansen, Christopherson, Sean J,
	Jethro Beekman, Florian Weimer, Linux API, Jann Horn,
	Linus Torvalds, X86 ML, linux-arch, LKML, Peter Zijlstra,
	Rich Felker, nhorman, npmccallum, Ayoun, Serge, shay.katz-zamir,
	linux-sgx, Andy Shevchenko, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov

On Mon, Nov 19, 2018 at 8:02 AM Jarkko Sakkinen
<jarkko.sakkinen@linux.intel.com> wrote:
>
> On Mon, Nov 19, 2018 at 07:29:36AM -0800, Andy Lutomirski wrote:
> > 1. The kernel needs some way to know *when* to apply this fixup.
> > Decoding the instruction stream and doing it to all exceptions that
> > hit an ENCLU instruction seems like a poor design.
>
> I'm not sure why you would ever need to do any type of fixup as the idea
> is to just return to AEP i.e. from chosen exceptions (EPCM, #UD) the AEP
> would work the same way as for exceptions that the kernel can deal with
> except filling the exception information to registers.

Sure, but how does the kernel know when to do that and when to send a
signal?  I don't really like decoding the instruction stream to figure
it out.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-19 17:00       ` Andy Lutomirski
@ 2018-11-20 10:11         ` Jarkko Sakkinen
  2018-11-20 15:19           ` Andy Lutomirski
                             ` (2 more replies)
  0 siblings, 3 replies; 163+ messages in thread
From: Jarkko Sakkinen @ 2018-11-20 10:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Christopherson, Sean J, Jethro Beekman,
	Florian Weimer, Linux API, Jann Horn, Linus Torvalds, X86 ML,
	linux-arch, LKML, Peter Zijlstra, Rich Felker, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov

On Mon, Nov 19, 2018 at 09:00:08AM -0800, Andy Lutomirski wrote:
> On Mon, Nov 19, 2018 at 8:02 AM Jarkko Sakkinen
> <jarkko.sakkinen@linux.intel.com> wrote:
> >
> > On Mon, Nov 19, 2018 at 07:29:36AM -0800, Andy Lutomirski wrote:
> > > 1. The kernel needs some way to know *when* to apply this fixup.
> > > Decoding the instruction stream and doing it to all exceptions that
> > > hit an ENCLU instruction seems like a poor design.
> >
> > I'm not sure why you would ever need to do any type of fixup as the idea
> > is to just return to AEP i.e. from chosen exceptions (EPCM, #UD) the AEP
> > would work the same way as for exceptions that the kernel can deal with
> > except filling the exception information to registers.
> 
> Sure, but how does the kernel know when to do that and when to send a
> signal?  I don't really like decoding the instruction stream to figure
> it out.

Hmm... why you have to decode instruction stream to find that out? Would
just depend on exception type (#GP with EPCM, #UD). Or are you saying
that kernel should need to SIGSEGV if there is in fact ENCLU so that
there is no infinite trap loop? Sorry, I'm a bit lost here that where
does this decoding requirement comes from in the first place. I
understand how it is used in Sean's proposal...

Anyway, this option can be probably discarded without further
consideration because apparently single stepping can cause #DB SS fault
if AEP handler is anything else than a single instruction.

For me it seems that by ruling out options, vDSO option is what is
left. I don't like it but at least it works...

/Jarkko

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-20 10:11         ` Jarkko Sakkinen
@ 2018-11-20 15:19           ` Andy Lutomirski
  2018-11-20 22:55             ` Jarkko Sakkinen
  2018-11-20 18:09           ` Sean Christopherson
  2018-11-20 22:46           ` Jarkko Sakkinen
  2 siblings, 1 reply; 163+ messages in thread
From: Andy Lutomirski @ 2018-11-20 15:19 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: Andrew Lutomirski, Dave Hansen, Christopherson, Sean J,
	Jethro Beekman, Florian Weimer, Linux API, Jann Horn,
	Linus Torvalds, X86 ML, linux-arch, LKML, Peter Zijlstra,
	Rich Felker, nhorman, npmccallum, Ayoun, Serge, shay.katz-zamir,
	linux-sgx, Andy Shevchenko, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov

On Tue, Nov 20, 2018 at 2:11 AM Jarkko Sakkinen
<jarkko.sakkinen@linux.intel.com> wrote:
>
> On Mon, Nov 19, 2018 at 09:00:08AM -0800, Andy Lutomirski wrote:
> > On Mon, Nov 19, 2018 at 8:02 AM Jarkko Sakkinen
> > <jarkko.sakkinen@linux.intel.com> wrote:
> > >
> > > On Mon, Nov 19, 2018 at 07:29:36AM -0800, Andy Lutomirski wrote:
> > > > 1. The kernel needs some way to know *when* to apply this fixup.
> > > > Decoding the instruction stream and doing it to all exceptions that
> > > > hit an ENCLU instruction seems like a poor design.
> > >
> > > I'm not sure why you would ever need to do any type of fixup as the idea
> > > is to just return to AEP i.e. from chosen exceptions (EPCM, #UD) the AEP
> > > would work the same way as for exceptions that the kernel can deal with
> > > except filling the exception information to registers.
> >
> > Sure, but how does the kernel know when to do that and when to send a
> > signal?  I don't really like decoding the instruction stream to figure
> > it out.
>
> Hmm... why you have to decode instruction stream to find that out? Would
> just depend on exception type (#GP with EPCM, #UD).

What is "#GP with EPCM"?  We certainly don't want to react to #UD in
general by mucking with some regs and retrying -- that will infinite
loop and confuse everyone.  I'm not even 100% convinced that decoding
the insn stream is useful -- AEP can point to something that isn't
ENCLU.

IOW the kernel needs to know *when* to apply this special behavior.
Sadly there is no bit in the exception frame that says "came from
SGX".

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-20 10:11         ` Jarkko Sakkinen
  2018-11-20 15:19           ` Andy Lutomirski
@ 2018-11-20 18:09           ` Sean Christopherson
  2018-11-20 22:46           ` Jarkko Sakkinen
  2 siblings, 0 replies; 163+ messages in thread
From: Sean Christopherson @ 2018-11-20 18:09 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: Andy Lutomirski, Dave Hansen, Jethro Beekman, Florian Weimer,
	Linux API, Jann Horn, Linus Torvalds, X86 ML, linux-arch, LKML,
	Peter Zijlstra, Rich Felker, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov

On Tue, Nov 20, 2018 at 12:11:33PM +0200, Jarkko Sakkinen wrote:
> On Mon, Nov 19, 2018 at 09:00:08AM -0800, Andy Lutomirski wrote:
> > On Mon, Nov 19, 2018 at 8:02 AM Jarkko Sakkinen
> > <jarkko.sakkinen@linux.intel.com> wrote:
> > >
> > > On Mon, Nov 19, 2018 at 07:29:36AM -0800, Andy Lutomirski wrote:
> > > > 1. The kernel needs some way to know *when* to apply this fixup.
> > > > Decoding the instruction stream and doing it to all exceptions that
> > > > hit an ENCLU instruction seems like a poor design.
> > >
> > > I'm not sure why you would ever need to do any type of fixup as the idea
> > > is to just return to AEP i.e. from chosen exceptions (EPCM, #UD) the AEP
> > > would work the same way as for exceptions that the kernel can deal with
> > > except filling the exception information to registers.
> > 
> > Sure, but how does the kernel know when to do that and when to send a
> > signal?  I don't really like decoding the instruction stream to figure
> > it out.
> 
> Hmm... why you have to decode instruction stream to find that out? Would
> just depend on exception type (#GP with EPCM, #UD).

#PF w/ PFEC_SGX is the only exception that indicates a fault is related
to SGX.  Theoretically we could avoid decoding by using a magic value
for the AEP itself and doing even more magic fixup, but that wouldn't
help for faults that occur on EENTER, which can be generic #GPs due to
loss of EPC on SGX1 systems. 

> Or are you saying
> that kernel should need to SIGSEGV if there is in fact ENCLU so that
> there is no infinite trap loop? Sorry, I'm a bit lost here that where
> does this decoding requirement comes from in the first place. I
> understand how it is used in Sean's proposal...
> 
> Anyway, this option can be probably discarded without further
> consideration because apparently single stepping can cause #DB SS fault
> if AEP handler is anything else than a single instruction.

Not that it matters, but we could satisfy the "one instruction"
requirement if the fixup changed RIP to point at an ENCLU for #DBs.

> For me it seems that by ruling out options, vDSO option is what is
> left. I don't like it but at least it works...
> 
> /Jarkko

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-20 10:11         ` Jarkko Sakkinen
  2018-11-20 15:19           ` Andy Lutomirski
  2018-11-20 18:09           ` Sean Christopherson
@ 2018-11-20 22:46           ` Jarkko Sakkinen
  2 siblings, 0 replies; 163+ messages in thread
From: Jarkko Sakkinen @ 2018-11-20 22:46 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Christopherson, Sean J, Jethro Beekman,
	Florian Weimer, Linux API, Jann Horn, Linus Torvalds, X86 ML,
	linux-arch, LKML, Peter Zijlstra, Rich Felker, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov

On Tue, Nov 20, 2018 at 12:11:33PM +0200, Jarkko Sakkinen wrote:
> On Mon, Nov 19, 2018 at 09:00:08AM -0800, Andy Lutomirski wrote:
> > On Mon, Nov 19, 2018 at 8:02 AM Jarkko Sakkinen
> > <jarkko.sakkinen@linux.intel.com> wrote:
> > >
> > > On Mon, Nov 19, 2018 at 07:29:36AM -0800, Andy Lutomirski wrote:
> > > > 1. The kernel needs some way to know *when* to apply this fixup.
> > > > Decoding the instruction stream and doing it to all exceptions that
> > > > hit an ENCLU instruction seems like a poor design.
> > >
> > > I'm not sure why you would ever need to do any type of fixup as the idea
> > > is to just return to AEP i.e. from chosen exceptions (EPCM, #UD) the AEP
> > > would work the same way as for exceptions that the kernel can deal with
> > > except filling the exception information to registers.
> > 
> > Sure, but how does the kernel know when to do that and when to send a
> > signal?  I don't really like decoding the instruction stream to figure
> > it out.
> 
> Hmm... why you have to decode instruction stream to find that out? Would
> just depend on exception type (#GP with EPCM, #UD). Or are you saying
> that kernel should need to SIGSEGV if there is in fact ENCLU so that
> there is no infinite trap loop? Sorry, I'm a bit lost here that where
> does this decoding requirement comes from in the first place. I
> understand how it is used in Sean's proposal...
> 
> Anyway, this option can be probably discarded without further
> consideration because apparently single stepping can cause #DB SS fault
> if AEP handler is anything else than a single instruction.
> 
> For me it seems that by ruling out options, vDSO option is what is
> left. I don't like it but at least it works...

The section relevant in the SDM is 43.2.6 but I started to think that
why in dumbed down return-to-AEP that would even be a problem? If you
are single step debugging isn't that what you want? Continue single
stepping in the AEP handler...

I still don't understand the part where the need for decoding
instruction stream comes in this dumbed down approach. There's
not RIP manipulation or anything involved at all.

With this reconsideration I would keep this as one option at least.

/Jarkko

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-20 15:19           ` Andy Lutomirski
@ 2018-11-20 22:55             ` Jarkko Sakkinen
  2018-11-21  5:17               ` Jethro Beekman
  0 siblings, 1 reply; 163+ messages in thread
From: Jarkko Sakkinen @ 2018-11-20 22:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Christopherson, Sean J, Jethro Beekman,
	Florian Weimer, Linux API, Jann Horn, Linus Torvalds, X86 ML,
	linux-arch, LKML, Peter Zijlstra, Rich Felker, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov

On Tue, Nov 20, 2018 at 07:19:37AM -0800, Andy Lutomirski wrote:
> What is "#GP with EPCM"?  We certainly don't want to react to #UD in

A typo. Meant #PF with PF_SGX set i.e. EPCM conflict.

> general by mucking with some regs and retrying -- that will infinite
> loop and confuse everyone.  I'm not even 100% convinced that decoding
> the insn stream is useful -- AEP can point to something that isn't
> ENCLU.

In my return-to-AEP approach to whole point was not to do any decoding
but instead have something else always in the AEP handler than just
ENCLU.

No instruction decoding. No RIP manipulation.

> IOW the kernel needs to know *when* to apply this special behavior.
> Sadly there is no bit in the exception frame that says "came from
> SGX".

/Jarkko

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-20 22:55             ` Jarkko Sakkinen
@ 2018-11-21  5:17               ` Jethro Beekman
  2018-11-21 15:17                 ` Jarkko Sakkinen
  0 siblings, 1 reply; 163+ messages in thread
From: Jethro Beekman @ 2018-11-21  5:17 UTC (permalink / raw)
  To: Jarkko Sakkinen, Andy Lutomirski
  Cc: Dave Hansen, Christopherson, Sean J, Florian Weimer, Linux API,
	Jann Horn, Linus Torvalds, X86 ML, linux-arch, LKML,
	Peter Zijlstra, Rich Felker, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov

[-- Attachment #1: Type: text/plain, Size: 1146 bytes --]

On 2018-11-21 04:25, Jarkko Sakkinen wrote:
> On Tue, Nov 20, 2018 at 07:19:37AM -0800, Andy Lutomirski wrote:
>> general by mucking with some regs and retrying -- that will infinite
>> loop and confuse everyone.  I'm not even 100% convinced that decoding
>> the insn stream is useful -- AEP can point to something that isn't
>> ENCLU.
> 
> In my return-to-AEP approach to whole point was not to do any decoding
> but instead have something else always in the AEP handler than just
> ENCLU.
> 
> No instruction decoding. No RIP manipulation.
> 
>> IOW the kernel needs to know *when* to apply this special behavior.
>> Sadly there is no bit in the exception frame that says "came from
>> SGX".

Jarkko, can you please explain you solution in detail? The CPU receives 
an exception. This will be handled by the kernel exception handler. What 
information does the kernel exception handler use to determine whether 
to deliver the exception as a regular signal to the process, or whether 
to set the special registers values for userspace and just continue 
executing the process manually?

--
Jethro Beekman | Fortanix


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3990 bytes --]

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-21  5:17               ` Jethro Beekman
@ 2018-11-21 15:17                 ` Jarkko Sakkinen
  2018-11-24 17:07                   ` Jarkko Sakkinen
  2018-11-26 14:35                   ` Sean Christopherson
  0 siblings, 2 replies; 163+ messages in thread
From: Jarkko Sakkinen @ 2018-11-21 15:17 UTC (permalink / raw)
  To: Jethro Beekman
  Cc: Andy Lutomirski, Dave Hansen, Christopherson, Sean J,
	Florian Weimer, Linux API, Jann Horn, Linus Torvalds, X86 ML,
	linux-arch, LKML, Peter Zijlstra, Rich Felker, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov

On Wed, Nov 21, 2018 at 05:17:32AM +0000, Jethro Beekman wrote:
> Jarkko, can you please explain you solution in detail? The CPU receives an
> exception. This will be handled by the kernel exception handler. What
> information does the kernel exception handler use to determine whether to
> deliver the exception as a regular signal to the process, or whether to set
> the special registers values for userspace and just continue executing the
> process manually?

Now we throw SIGSEGV when PF_SGX set, right? In my solution that would
be turned just doing iret to AEP with the extra that three registers get
exception data (type, reason, addr). No decoding or RIP adjusting
involved.

That would mean that you would actually have to implement AEP handler
than just have enclu there.

I've also proposed that perhaps for SGX also #UD should be propagated
this way because for some instructions you need outside help to emulate
"non-enclave" environment.

That is all I have drafted together so far. I'll try to finish v18 this
week with other stuff and refine further next week (unless someone gives
obvious reason why this doesn't work, which might well be because I
haven't went too deep with my analysis yet because of lack of time).

/Jarkko

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-21 15:17                 ` Jarkko Sakkinen
@ 2018-11-24 17:07                   ` Jarkko Sakkinen
  2018-11-26 14:35                   ` Sean Christopherson
  1 sibling, 0 replies; 163+ messages in thread
From: Jarkko Sakkinen @ 2018-11-24 17:07 UTC (permalink / raw)
  To: Jethro Beekman
  Cc: Andy Lutomirski, Dave Hansen, Christopherson, Sean J,
	Florian Weimer, Linux API, Jann Horn, Linus Torvalds, X86 ML,
	linux-arch, LKML, Peter Zijlstra, Rich Felker, nhorman,
	npmccallum, Ayoun, Serge, shay.katz-zamir, linux-sgx,
	Andy Shevchenko, Thomas Gleixner, Ingo Molnar, Borislav Petkov

On Wed, Nov 21, 2018 at 05:17:34PM +0200, Jarkko Sakkinen wrote:
> On Wed, Nov 21, 2018 at 05:17:32AM +0000, Jethro Beekman wrote:
> > Jarkko, can you please explain you solution in detail? The CPU receives an
> > exception. This will be handled by the kernel exception handler. What
> > information does the kernel exception handler use to determine whether to
> > deliver the exception as a regular signal to the process, or whether to set
> > the special registers values for userspace and just continue executing the
> > process manually?
> 
> Now we throw SIGSEGV when PF_SGX set, right? In my solution that would
> be turned just doing iret to AEP with the extra that three registers get
> exception data (type, reason, addr). No decoding or RIP adjusting
> involved.
> 
> That would mean that you would actually have to implement AEP handler
> than just have enclu there.
> 
> I've also proposed that perhaps for SGX also #UD should be propagated
> this way because for some instructions you need outside help to emulate
> "non-enclave" environment.
> 
> That is all I have drafted together so far. I'll try to finish v18 this
> week with other stuff and refine further next week (unless someone gives
> obvious reason why this doesn't work, which might well be because I
> haven't went too deep with my analysis yet because of lack of time).

The obvious con in this approach is that if you single step the code,
the whole AEP handler would single stepped also everytime. Probably big
enough con that it is better to go with the vDSO approach anyhow...

/Jarkko

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-21 15:17                 ` Jarkko Sakkinen
  2018-11-24 17:07                   ` Jarkko Sakkinen
@ 2018-11-26 14:35                   ` Sean Christopherson
  2018-11-26 22:06                     ` Jarkko Sakkinen
  1 sibling, 1 reply; 163+ messages in thread
From: Sean Christopherson @ 2018-11-26 14:35 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: Jethro Beekman, Andy Lutomirski, Dave Hansen, Florian Weimer,
	Linux API, Jann Horn, Linus Torvalds, X86 ML, linux-arch, LKML,
	Peter Zijlstra, Rich Felker, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov

On Wed, Nov 21, 2018 at 05:17:34PM +0200, Jarkko Sakkinen wrote:
> On Wed, Nov 21, 2018 at 05:17:32AM +0000, Jethro Beekman wrote:
> > Jarkko, can you please explain you solution in detail? The CPU receives an
> > exception. This will be handled by the kernel exception handler. What
> > information does the kernel exception handler use to determine whether to
> > deliver the exception as a regular signal to the process, or whether to set
> > the special registers values for userspace and just continue executing the
> > process manually?
> 
> Now we throw SIGSEGV when PF_SGX set, right? In my solution that would
> be turned just doing iret to AEP with the extra that three registers get
> exception data (type, reason, addr). No decoding or RIP adjusting
> involved.
> 
> That would mean that you would actually have to implement AEP handler
> than just have enclu there.
> 
> I've also proposed that perhaps for SGX also #UD should be propagated
> this way because for some instructions you need outside help to emulate
> "non-enclave" environment.

And how would you determine the #UD is related to SGX?  Hardware doesn't
provide any indication that a #UD (or any other fault) is related to SGX
or occurred in an enclave.  The only fault that is special-cased in a
non-virtualized environment is #PF signaled by the EPCM, which gets the
PF_SGX bit set in the error code.

> That is all I have drafted together so far. I'll try to finish v18 this
> week with other stuff and refine further next week (unless someone gives
> obvious reason why this doesn't work, which might well be because I
> haven't went too deep with my analysis yet because of lack of time).
> 
> /Jarkko

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: RFC: userspace exception fixups
  2018-11-26 14:35                   ` Sean Christopherson
@ 2018-11-26 22:06                     ` Jarkko Sakkinen
  0 siblings, 0 replies; 163+ messages in thread
From: Jarkko Sakkinen @ 2018-11-26 22:06 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Jethro Beekman, Andy Lutomirski, Dave Hansen, Florian Weimer,
	Linux API, Jann Horn, Linus Torvalds, X86 ML, linux-arch, LKML,
	Peter Zijlstra, Rich Felker, nhorman, npmccallum, Ayoun, Serge,
	shay.katz-zamir, linux-sgx, Andy Shevchenko, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov

On Mon, Nov 26, 2018 at 06:35:34AM -0800, Sean Christopherson wrote:
> And how would you determine the #UD is related to SGX?  Hardware doesn't
> provide any indication that a #UD (or any other fault) is related to SGX
> or occurred in an enclave.  The only fault that is special-cased in a
> non-virtualized environment is #PF signaled by the EPCM, which gets the
> PF_SGX bit set in the error code.

Could you not detect #UD from address where it happened? Kernel knows
where enclaves are mapped. BTW, how does Intel run-time emulate opcodes
currently?

Anyway, I've fully discarded the whole idea because implementing single
stepping w/o well defined AEP handler is nasty. I think vDSO's are the
only viable path that at least I'm aware off...

/Jarkko

^ permalink raw reply	[flat|nested] 163+ messages in thread

end of thread, other threads:[~2018-11-26 22:06 UTC | newest]

Thread overview: 163+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-01 17:53 RFC: userspace exception fixups Andy Lutomirski
2018-11-01 17:53 ` Andy Lutomirski
2018-11-01 18:09 ` Florian Weimer
2018-11-01 18:09   ` Florian Weimer
2018-11-01 18:30   ` Rich Felker
2018-11-01 18:30     ` Rich Felker
2018-11-01 19:00   ` Jarkko Sakkinen
2018-11-01 19:00     ` Jarkko Sakkinen
2018-11-01 18:27 ` Rich Felker
2018-11-01 18:27   ` Rich Felker
2018-11-01 18:33 ` Jann Horn
2018-11-01 18:33   ` Jann Horn
2018-11-01 18:52   ` Rich Felker
2018-11-01 18:52     ` Rich Felker
2018-11-01 19:10     ` Linus Torvalds
2018-11-01 19:10       ` Linus Torvalds
2018-11-01 19:31       ` Rich Felker
2018-11-01 19:31         ` Rich Felker
2018-11-01 21:24         ` Linus Torvalds
2018-11-01 21:24           ` Linus Torvalds
2018-11-01 23:22           ` Andy Lutomirski
2018-11-01 23:22             ` Andy Lutomirski
2018-11-02 16:30             ` Sean Christopherson
2018-11-02 16:30               ` Sean Christopherson
2018-11-02 16:37               ` Jethro Beekman
2018-11-02 16:37                 ` Jethro Beekman
2018-11-02 16:52                 ` Sean Christopherson
2018-11-02 16:52                   ` Sean Christopherson
2018-11-02 16:56                   ` Jethro Beekman
2018-11-02 16:56                     ` Jethro Beekman
2018-11-02 17:01                     ` Andy Lutomirski
2018-11-02 17:01                       ` Andy Lutomirski
2018-11-02 17:05                       ` Jethro Beekman
2018-11-02 17:05                         ` Jethro Beekman
2018-11-02 17:16                         ` Andy Lutomirski
2018-11-02 17:16                           ` Andy Lutomirski
2018-11-02 17:32                           ` Rich Felker
2018-11-02 17:32                             ` Rich Felker
2018-11-02 17:12                     ` Sean Christopherson
2018-11-02 17:12                       ` Sean Christopherson
2018-11-02 22:42                   ` Jarkko Sakkinen
2018-11-02 22:42                     ` Jarkko Sakkinen
2018-11-02 16:56               ` Dave Hansen
2018-11-02 16:56                 ` Dave Hansen
2018-11-02 17:06                 ` Sean Christopherson
2018-11-02 17:06                   ` Sean Christopherson
2018-11-02 17:13                   ` Dave Hansen
2018-11-02 17:13                     ` Dave Hansen
2018-11-02 17:33                     ` Sean Christopherson
2018-11-02 17:33                       ` Sean Christopherson
2018-11-02 17:48                       ` Andy Lutomirski
2018-11-02 17:48                         ` Andy Lutomirski
2018-11-02 18:27                         ` Sean Christopherson
2018-11-02 18:27                           ` Sean Christopherson
2018-11-02 19:02                           ` Jann Horn
2018-11-02 19:02                             ` Jann Horn
2018-11-02 22:04                             ` Sean Christopherson
2018-11-02 22:04                               ` Sean Christopherson
2018-11-02 23:27                               ` Jann Horn
2018-11-02 23:27                                 ` Jann Horn
2018-11-02 23:32                                 ` Andy Lutomirski
2018-11-02 23:32                                   ` Andy Lutomirski
2018-11-02 23:36                                   ` Jann Horn
2018-11-02 23:36                                     ` Jann Horn
2018-11-06 15:37                                   ` Sean Christopherson
2018-11-06 15:37                                     ` Sean Christopherson
2018-11-06 16:57                                     ` Andy Lutomirski
2018-11-06 16:57                                       ` Andy Lutomirski
2018-11-06 17:03                                       ` Dave Hansen
2018-11-06 17:03                                         ` Dave Hansen
2018-11-06 17:19                                       ` Sean Christopherson
2018-11-06 17:19                                         ` Sean Christopherson
2018-11-06 18:20                                         ` Andy Lutomirski
2018-11-06 18:20                                           ` Andy Lutomirski
2018-11-06 18:41                                           ` Dave Hansen
2018-11-06 18:41                                             ` Dave Hansen
2018-11-06 19:02                                             ` Andy Lutomirski
2018-11-06 19:02                                               ` Andy Lutomirski
2018-11-06 19:22                                               ` Dave Hansen
2018-11-06 19:22                                                 ` Dave Hansen
2018-11-06 20:12                                                 ` Andy Lutomirski
2018-11-06 20:12                                                   ` Andy Lutomirski
2018-11-06 21:00                                                   ` Dave Hansen
2018-11-06 21:00                                                     ` Dave Hansen
2018-11-06 21:07                                                     ` Andy Lutomirski
2018-11-06 21:07                                                       ` Andy Lutomirski
2018-11-06 21:41                                                       ` Andy Lutomirski
2018-11-06 21:41                                                         ` Andy Lutomirski
2018-11-06 21:59                                                         ` Sean Christopherson
2018-11-06 21:59                                                           ` Sean Christopherson
2018-11-06 23:00                                                           ` Andy Lutomirski
2018-11-06 23:00                                                             ` Andy Lutomirski
2018-11-06 23:35                                                             ` Sean Christopherson
2018-11-06 23:35                                                               ` Sean Christopherson
2018-11-06 23:39                                                               ` Andy Lutomirski
2018-11-06 23:39                                                                 ` Andy Lutomirski
2018-11-07  0:02                                                                 ` Sean Christopherson
2018-11-07  0:02                                                                   ` Sean Christopherson
2018-11-07  1:17                                                                   ` Andy Lutomirski
2018-11-07  1:17                                                                     ` Andy Lutomirski
2018-11-07  6:47                                                                     ` Jethro Beekman
2018-11-07  6:47                                                                       ` Jethro Beekman
2018-11-07 15:34                                                                     ` Sean Christopherson
2018-11-07 15:34                                                                       ` Sean Christopherson
2018-11-07 19:01                                                                       ` Sean Christopherson
2018-11-07 19:01                                                                         ` Sean Christopherson
2018-11-07 20:56                                                                         ` Dave Hansen
2018-11-07 20:56                                                                           ` Dave Hansen
2018-11-08 15:04                                                                           ` Jarkko Sakkinen
2018-11-08 15:04                                                                             ` Jarkko Sakkinen
2018-11-08 19:54                                                       ` Sean Christopherson
2018-11-08 19:54                                                         ` Sean Christopherson
2018-11-08 20:05                                                         ` Andy Lutomirski
2018-11-08 20:05                                                           ` Andy Lutomirski
2018-11-08 20:10                                                           ` Dave Hansen
2018-11-08 20:10                                                             ` Dave Hansen
2018-11-08 21:16                                                             ` Sean Christopherson
2018-11-08 21:16                                                               ` Sean Christopherson
2018-11-08 21:50                                                               ` Dave Hansen
2018-11-08 21:50                                                                 ` Dave Hansen
2018-11-08 22:04                                                                 ` Sean Christopherson
2018-11-08 22:04                                                                   ` Sean Christopherson
2018-11-09  7:12                                                           ` Christoph Hellwig
2018-11-09  7:12                                                             ` Christoph Hellwig
2018-11-06 23:17                                               ` Rich Felker
2018-11-06 23:17                                                 ` Rich Felker
2018-11-06 23:26                                                 ` Sean Christopherson
2018-11-06 23:26                                                   ` Sean Christopherson
2018-11-07 21:27                                                   ` Rich Felker
2018-11-07 21:27                                                     ` Rich Felker
2018-11-07 21:33                                                     ` Andy Lutomirski
2018-11-07 21:33                                                       ` Andy Lutomirski
2018-11-07 21:40                                                     ` Sean Christopherson
2018-11-07 21:40                                                       ` Sean Christopherson
2018-11-08 15:11                                                       ` Jarkko Sakkinen
2018-11-08 15:11                                                         ` Jarkko Sakkinen
2018-11-06 17:00                                     ` Dave Hansen
2018-11-06 17:00                                       ` Dave Hansen
2018-11-02 22:37             ` Jarkko Sakkinen
2018-11-02 22:37               ` Jarkko Sakkinen
2018-11-01 19:06 ` Linus Torvalds
2018-11-01 19:06   ` Linus Torvalds
2018-11-02 22:07 ` Jarkko Sakkinen
2018-11-02 22:07   ` Jarkko Sakkinen
2018-11-18  7:15 ` Jarkko Sakkinen
2018-11-18  7:18   ` Jarkko Sakkinen
2018-11-18 13:02   ` Jarkko Sakkinen
2018-11-19  5:17     ` Jethro Beekman
2018-11-19 14:05       ` Jarkko Sakkinen
2018-11-19 14:59         ` Jarkko Sakkinen
2018-11-19 15:29   ` Andy Lutomirski
2018-11-19 16:02     ` Jarkko Sakkinen
2018-11-19 17:00       ` Andy Lutomirski
2018-11-20 10:11         ` Jarkko Sakkinen
2018-11-20 15:19           ` Andy Lutomirski
2018-11-20 22:55             ` Jarkko Sakkinen
2018-11-21  5:17               ` Jethro Beekman
2018-11-21 15:17                 ` Jarkko Sakkinen
2018-11-24 17:07                   ` Jarkko Sakkinen
2018-11-26 14:35                   ` Sean Christopherson
2018-11-26 22:06                     ` Jarkko Sakkinen
2018-11-20 18:09           ` Sean Christopherson
2018-11-20 22:46           ` Jarkko Sakkinen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).