All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] x86/pv: Multiple fixes to SYSCALL/SYSENTER handling (XSA-339 followup)
@ 2020-09-23 10:18 Andrew Cooper
  2020-09-23 10:18 ` [PATCH 1/3] x86/pv: Don't deliver #GP for a SYSENTER with NT set Andrew Cooper
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Andrew Cooper @ 2020-09-23 10:18 UTC (permalink / raw)
  To: Xen-devel
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné,
	Wei Liu, Andy Lutomirski, Manuel Bouyer

Patches 1 and 2 are a consequence of trying to get the Linux x86 selftests to
pass even when running under Xen.

Patches 3 and XSA-339 were further fallout from trying to put in place testing
to cover all aspects of the PV fast system call entrypoints.

Patch 3 was almost an XSA itself, but was ultimately argued as not affecting
any known PV guest.  It turns out that this is only true because of c/s
dba899de14 in 2018, which did fix a real userspace => VM DoS on NetBSD.

All fixes need backporting.

Andrew Cooper (3):
  x86/pv: Don't deliver #GP for a SYSENTER with NT set
  x86/pv: Don't clobber NT on return-to-guest
  x86/pv: Inject #UD for missing SYSCALL callbacks

 xen/arch/x86/x86_64/compat/entry.S |  2 +-
 xen/arch/x86/x86_64/entry.S        | 31 +++++++++++++++++++++----------
 2 files changed, 22 insertions(+), 11 deletions(-)

-- 
2.11.0



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/3] x86/pv: Don't deliver #GP for a SYSENTER with NT set
  2020-09-23 10:18 [PATCH 0/3] x86/pv: Multiple fixes to SYSCALL/SYSENTER handling (XSA-339 followup) Andrew Cooper
@ 2020-09-23 10:18 ` Andrew Cooper
  2020-09-24 13:55   ` Jan Beulich
  2020-09-23 10:18 ` [PATCH 2/3] x86/pv: Don't clobber NT on return-to-guest Andrew Cooper
  2020-09-23 10:18 ` [PATCH 3/3] x86/pv: Inject #UD for missing SYSCALL callbacks Andrew Cooper
  2 siblings, 1 reply; 18+ messages in thread
From: Andrew Cooper @ 2020-09-23 10:18 UTC (permalink / raw)
  To: Xen-devel
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné,
	Wei Liu, Andy Lutomirski

It is a matter of guest kernel policy what to do with offending userspace, and
terminating said userspace may not be the action chosen.

Linux explicitly tolerates this case.

Reported-by: Andy Lutomirski <luto@kernel.org>
Fixes: fdac951560 ("x86: clear EFLAGS.NT in SYSENTER entry path")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
CC: Andy Lutomirski <luto@kernel.org>
---
 xen/arch/x86/x86_64/entry.S | 1 -
 1 file changed, 1 deletion(-)

diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 71a00e846b..44a110b9c8 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -313,7 +313,6 @@ UNLIKELY_START(nz, sysenter_nt_set)
         pushfq
         andl  $~X86_EFLAGS_NT,(%rsp)
         popfq
-        xorl  %eax,%eax
 UNLIKELY_END(sysenter_nt_set)
         testq %rax,%rax
         leal  (,%rcx,TBF_INTERRUPT),%ecx
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/3] x86/pv: Don't clobber NT on return-to-guest
  2020-09-23 10:18 [PATCH 0/3] x86/pv: Multiple fixes to SYSCALL/SYSENTER handling (XSA-339 followup) Andrew Cooper
  2020-09-23 10:18 ` [PATCH 1/3] x86/pv: Don't deliver #GP for a SYSENTER with NT set Andrew Cooper
@ 2020-09-23 10:18 ` Andrew Cooper
  2020-09-24 13:57   ` Jan Beulich
  2020-09-23 10:18 ` [PATCH 3/3] x86/pv: Inject #UD for missing SYSCALL callbacks Andrew Cooper
  2 siblings, 1 reply; 18+ messages in thread
From: Andrew Cooper @ 2020-09-23 10:18 UTC (permalink / raw)
  To: Xen-devel
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné,
	Wei Liu, Andy Lutomirski

A 64bit IRET can restore NT - the faulting case is when NT is set in the live
flags.  This change had an unintended consequence of causing the NT flag to
spontaneously disappear from guest context whenever a interrupt/exception
occurred.

In combination with a SYSENTER which sets both TF and NT, Xen's handling of
the #DB exceptions clears NT before it is even recorded suitably in the guest
kernel's view of what userspace was doing.

Reported-by: Andy Lutomirski <luto@kernel.org>
Fixes: 0e47f92b0 ("x86: force EFLAGS.IF on when exiting to PV guests")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
CC: Andy Lutomirski <luto@kernel.org>
---
 xen/arch/x86/x86_64/compat/entry.S | 2 +-
 xen/arch/x86/x86_64/entry.S        | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/x86_64/compat/entry.S b/xen/arch/x86/x86_64/compat/entry.S
index 73619f57ca..3b2136b272 100644
--- a/xen/arch/x86/x86_64/compat/entry.S
+++ b/xen/arch/x86/x86_64/compat/entry.S
@@ -119,7 +119,7 @@ compat_process_trap:
 /* %rbx: struct vcpu, interrupts disabled */
 ENTRY(compat_restore_all_guest)
         ASSERT_INTERRUPTS_DISABLED
-        mov   $~(X86_EFLAGS_IOPL|X86_EFLAGS_NT|X86_EFLAGS_VM),%r11d
+        mov   $~(X86_EFLAGS_IOPL | X86_EFLAGS_VM), %r11d
         and   UREGS_eflags(%rsp),%r11d
 
 .macro alt_cr4_pv32
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 44a110b9c8..000eb9722b 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -182,7 +182,7 @@ restore_all_guest:
         jz    iret_exit_to_guest
 
         movq  24(%rsp),%r11           # RFLAGS
-        andq  $~(X86_EFLAGS_IOPL|X86_EFLAGS_NT|X86_EFLAGS_VM),%r11
+        andq  $~(X86_EFLAGS_IOPL | X86_EFLAGS_VM), %r11
         orq   $X86_EFLAGS_IF,%r11
 
         /* Don't use SYSRET path if the return address is not canonical. */
@@ -213,7 +213,7 @@ restore_all_guest:
         movq  8(%rsp), %rcx           # RIP
 /* No special register assumptions. */
 iret_exit_to_guest:
-        andl  $~(X86_EFLAGS_IOPL|X86_EFLAGS_NT|X86_EFLAGS_VM),24(%rsp)
+        andl  $~(X86_EFLAGS_IOPL | X86_EFLAGS_VM), 24(%rsp)
         orl   $X86_EFLAGS_IF,24(%rsp)
         addq  $8,%rsp
 .Lft0:  iretq
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/3] x86/pv: Inject #UD for missing SYSCALL callbacks
  2020-09-23 10:18 [PATCH 0/3] x86/pv: Multiple fixes to SYSCALL/SYSENTER handling (XSA-339 followup) Andrew Cooper
  2020-09-23 10:18 ` [PATCH 1/3] x86/pv: Don't deliver #GP for a SYSENTER with NT set Andrew Cooper
  2020-09-23 10:18 ` [PATCH 2/3] x86/pv: Don't clobber NT on return-to-guest Andrew Cooper
@ 2020-09-23 10:18 ` Andrew Cooper
  2020-09-24 14:56   ` Jan Beulich
  2020-10-09 11:53   ` [PATCH v2] " Andrew Cooper
  2 siblings, 2 replies; 18+ messages in thread
From: Andrew Cooper @ 2020-09-23 10:18 UTC (permalink / raw)
  To: Xen-devel
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné,
	Wei Liu, Andy Lutomirski, Manuel Bouyer

Despite appearing to be a deliberate design choice of early PV64, the
resulting behaviour for unregistered SYSCALL callbacks creates an untenable
testability problem for Xen.  Furthermore, the behaviour is undocumented,
bizarre, and inconsistent with related behaviour in Xen, and very liable
introduce a security vulnerability into a PV guest if the author hasn't
studied Xen's assembly code in detail.

There are two different bugs here.

1) The current logic confuses the registered entrypoints, and may deliver a
   SYSCALL from 32bit userspace to the 64bit entry, when only a 64bit
   entrypoint is registered.

   This has been the case ever since 2007 (c/s cd75d47348b) but up until
   2018 (c/s dba899de14) the wrong selectors would be handed to the guest for
   a 32bit SYSCALL entry, making it appear as if it a 64bit entry all along.

   Xen would malfunction under these circumstances, if it were a PV guest.
   Linux would as well, but PVOps has always registered both entrypoints and
   discarded the Xen-provided selectors.  NetBSD really does malfunction as a
   consequence (benignly now, but a VM DoS before the 2018 Xen selector fix).

2) In the case that neither SYSCALL callbacks are registered, the guest will
   be crashed when userspace executes a SYSCALL instruction, which is a
   userspace => kernel DoS.

   This has been the case ever since the introduction of 64bit PV support, but
   behaves unlike all other SYSCALL/SYSENTER callbacks in Xen, which yield
   #GP/#UD in userspace before the callback is registered, and are therefore
   safe by default.

This change does constitute a change in the PV ABI, for corner cases of a PV
guest kernel registering neither callback, or not registering the 32bit
callback when running on AMD/Hygon hardware.

It brings the behaviour in line with PV32 SYSCALL/SYSENTER, and PV64
SYSENTER (safe by default, until explicitly enabled), as well as native
hardware (always delivered to the single applicable callback).

Most importantly however, and the primary reason for the change, is that it
lets us actually test the PV entrypoints to prove correct behaviour.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
CC: Andy Lutomirski <luto@kernel.org>
CC: Manuel Bouyer <bouyer@antioche.eu.org>

Manuel: This will result in a corner case change for NetBSD.

At the moment on native, 32bit userspace on 64bit NetBSD will get #UD (Intel,
etc), or an explicit -ENOSYS (AMD, etc) when trying to execute a 32bit SYSCALL
instruction.

After this change, a 64bit PV VM will consistently see #UD (like on Intel, etc
hardware) even when running on AMD/Hygon hardware (as Xsyscall32 isn't
registered with Xen), rather than following Xsyscall into the proper system
call path.
---
 xen/arch/x86/x86_64/entry.S | 26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 000eb9722b..baab6d8755 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -26,18 +26,30 @@
 /* %rbx: struct vcpu */
 ENTRY(switch_to_kernel)
         leaq  VCPU_trap_bounce(%rbx),%rdx
-        /* TB_eip = (32-bit syscall && syscall32_addr) ?
-         *          syscall32_addr : syscall_addr */
-        xor   %eax,%eax
+
+        /* TB_eip = 32-bit syscall ? syscall32_addr : syscall_addr */
+        mov   VCPU_syscall32_addr(%rbx), %ecx
+        mov   VCPU_syscall_addr(%rbx), %rax
         cmpw  $FLAT_USER_CS32,UREGS_cs(%rsp)
-        cmoveq VCPU_syscall32_addr(%rbx),%rax
-        testq %rax,%rax
-        cmovzq VCPU_syscall_addr(%rbx),%rax
-        movq  %rax,TRAPBOUNCE_eip(%rdx)
+        cmove %rcx, %rax
+
         /* TB_flags = VGCF_syscall_disables_events ? TBF_INTERRUPT : 0 */
         btl   $_VGCF_syscall_disables_events,VCPU_guest_context_flags(%rbx)
         setc  %cl
         leal  (,%rcx,TBF_INTERRUPT),%ecx
+
+        test  %rax, %rax
+UNLIKELY_START(z, syscall_no_callback) /* TB_eip == 0 => #UD */
+        movq  VCPU_trap_ctxt(%rbx), %rdi
+        movl  $X86_EXC_UD, UREGS_entry_vector(%rsp)
+        subl  $2, UREGS_rip(%rsp)
+        movl  X86_EXC_UD * TRAPINFO_sizeof + TRAPINFO_eip(%rdi), %eax
+        testb $4, X86_EXC_UD * TRAPINFO_sizeof + TRAPINFO_flags(%rdi)
+        setnz %cl
+        leal  TBF_EXCEPTION(, %rcx, TBF_INTERRUPT), %ecx
+UNLIKELY_END(syscall_no_callback)
+
+        movq  %rax,TRAPBOUNCE_eip(%rdx)
         movb  %cl,TRAPBOUNCE_flags(%rdx)
         call  create_bounce_frame
         andl  $~X86_EFLAGS_DF,UREGS_eflags(%rsp)
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] x86/pv: Don't deliver #GP for a SYSENTER with NT set
  2020-09-23 10:18 ` [PATCH 1/3] x86/pv: Don't deliver #GP for a SYSENTER with NT set Andrew Cooper
@ 2020-09-24 13:55   ` Jan Beulich
  0 siblings, 0 replies; 18+ messages in thread
From: Jan Beulich @ 2020-09-24 13:55 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel, Roger Pau Monné, Wei Liu, Andy Lutomirski

On 23.09.2020 12:18, Andrew Cooper wrote:
> It is a matter of guest kernel policy what to do with offending userspace, and
> terminating said userspace may not be the action chosen.
> 
> Linux explicitly tolerates this case.
> 
> Reported-by: Andy Lutomirski <luto@kernel.org>
> Fixes: fdac951560 ("x86: clear EFLAGS.NT in SYSENTER entry path")
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/3] x86/pv: Don't clobber NT on return-to-guest
  2020-09-23 10:18 ` [PATCH 2/3] x86/pv: Don't clobber NT on return-to-guest Andrew Cooper
@ 2020-09-24 13:57   ` Jan Beulich
  0 siblings, 0 replies; 18+ messages in thread
From: Jan Beulich @ 2020-09-24 13:57 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel, Roger Pau Monné, Wei Liu, Andy Lutomirski

On 23.09.2020 12:18, Andrew Cooper wrote:
> A 64bit IRET can restore NT - the faulting case is when NT is set in the live
> flags.  This change had an unintended consequence of causing the NT flag to
> spontaneously disappear from guest context whenever a interrupt/exception
> occurred.
> 
> In combination with a SYSENTER which sets both TF and NT, Xen's handling of
> the #DB exceptions clears NT before it is even recorded suitably in the guest
> kernel's view of what userspace was doing.
> 
> Reported-by: Andy Lutomirski <luto@kernel.org>
> Fixes: 0e47f92b0 ("x86: force EFLAGS.IF on when exiting to PV guests")
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] x86/pv: Inject #UD for missing SYSCALL callbacks
  2020-09-23 10:18 ` [PATCH 3/3] x86/pv: Inject #UD for missing SYSCALL callbacks Andrew Cooper
@ 2020-09-24 14:56   ` Jan Beulich
  2020-09-28 13:05     ` Andrew Cooper
  2020-10-09 11:53   ` [PATCH v2] " Andrew Cooper
  1 sibling, 1 reply; 18+ messages in thread
From: Jan Beulich @ 2020-09-24 14:56 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Xen-devel, Roger Pau Monné, Wei Liu, Andy Lutomirski, Manuel Bouyer

On 23.09.2020 12:18, Andrew Cooper wrote:
> Despite appearing to be a deliberate design choice of early PV64, the
> resulting behaviour for unregistered SYSCALL callbacks creates an untenable
> testability problem for Xen.  Furthermore, the behaviour is undocumented,
> bizarre, and inconsistent with related behaviour in Xen, and very liable
> introduce a security vulnerability into a PV guest if the author hasn't
> studied Xen's assembly code in detail.
> 
> There are two different bugs here.
> 
> 1) The current logic confuses the registered entrypoints, and may deliver a
>    SYSCALL from 32bit userspace to the 64bit entry, when only a 64bit
>    entrypoint is registered.
> 
>    This has been the case ever since 2007 (c/s cd75d47348b) but up until
>    2018 (c/s dba899de14) the wrong selectors would be handed to the guest for
>    a 32bit SYSCALL entry, making it appear as if it a 64bit entry all along.

I'm not sure what you derive the last half sentence from. To a 32-bit
PV guest, nothing can make things look like being 64-bit. And as you
did say in your 2018 change, FLAT_KERNEL_SS == FLAT_USER_SS32.

As to the "confusion" of entry points - before the compat mode entry
path was introduced, a 64-bit guest could only register a single
entry point. Hence guests at the time had to multiplex 32- and 64-bit
user mode entry from this one code path. In order to avoid regressing
any such guest, the falling back to using the 64-bit entry point was
chosen. Effectively what you propose is to regress such guests now,
rather than back then.

>    Xen would malfunction under these circumstances, if it were a PV guest.
>    Linux would as well, but PVOps has always registered both entrypoints and
>    discarded the Xen-provided selectors.  NetBSD really does malfunction as a
>    consequence (benignly now, but a VM DoS before the 2018 Xen selector fix).
> 
> 2) In the case that neither SYSCALL callbacks are registered, the guest will
>    be crashed when userspace executes a SYSCALL instruction, which is a
>    userspace => kernel DoS.
> 
>    This has been the case ever since the introduction of 64bit PV support, but
>    behaves unlike all other SYSCALL/SYSENTER callbacks in Xen, which yield
>    #GP/#UD in userspace before the callback is registered, and are therefore
>    safe by default.

I agree this part is an improvement.

> This change does constitute a change in the PV ABI, for corner cases of a PV
> guest kernel registering neither callback, or not registering the 32bit
> callback when running on AMD/Hygon hardware.
> 
> It brings the behaviour in line with PV32 SYSCALL/SYSENTER, and PV64
> SYSENTER (safe by default, until explicitly enabled), as well as native
> hardware (always delivered to the single applicable callback).

Albeit an OS running natively and setting EFER.SCE is obliged to set both
entry points; they can't have one without the other (and not be vulnerable).
Since it's unclear what the PV equivalent of EFER.SCE is, I don't think
comparing this particular aspect of the behavior makes a lot of sense.

> Most importantly however, and the primary reason for the change, is that it
> lets us actually test the PV entrypoints to prove correct behaviour.

You mean "test the absence of PV entry points" here?

> --- a/xen/arch/x86/x86_64/entry.S
> +++ b/xen/arch/x86/x86_64/entry.S
> @@ -26,18 +26,30 @@
>  /* %rbx: struct vcpu */
>  ENTRY(switch_to_kernel)
>          leaq  VCPU_trap_bounce(%rbx),%rdx
> -        /* TB_eip = (32-bit syscall && syscall32_addr) ?
> -         *          syscall32_addr : syscall_addr */
> -        xor   %eax,%eax
> +
> +        /* TB_eip = 32-bit syscall ? syscall32_addr : syscall_addr */
> +        mov   VCPU_syscall32_addr(%rbx), %ecx
> +        mov   VCPU_syscall_addr(%rbx), %rax
>          cmpw  $FLAT_USER_CS32,UREGS_cs(%rsp)
> -        cmoveq VCPU_syscall32_addr(%rbx),%rax
> -        testq %rax,%rax
> -        cmovzq VCPU_syscall_addr(%rbx),%rax
> -        movq  %rax,TRAPBOUNCE_eip(%rdx)
> +        cmove %rcx, %rax
> +
>          /* TB_flags = VGCF_syscall_disables_events ? TBF_INTERRUPT : 0 */
>          btl   $_VGCF_syscall_disables_events,VCPU_guest_context_flags(%rbx)
>          setc  %cl
>          leal  (,%rcx,TBF_INTERRUPT),%ecx
> +
> +        test  %rax, %rax
> +UNLIKELY_START(z, syscall_no_callback) /* TB_eip == 0 => #UD */
> +        movq  VCPU_trap_ctxt(%rbx), %rdi
> +        movl  $X86_EXC_UD, UREGS_entry_vector(%rsp)
> +        subl  $2, UREGS_rip(%rsp)
> +        movl  X86_EXC_UD * TRAPINFO_sizeof + TRAPINFO_eip(%rdi), %eax

I guess you mean "movq ..., %rax"? Iirc 32-bit guests don't even get through
here.

Jan


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] x86/pv: Inject #UD for missing SYSCALL callbacks
  2020-09-24 14:56   ` Jan Beulich
@ 2020-09-28 13:05     ` Andrew Cooper
  2020-09-28 15:35       ` Jan Beulich
  0 siblings, 1 reply; 18+ messages in thread
From: Andrew Cooper @ 2020-09-28 13:05 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Xen-devel, Roger Pau Monné, Wei Liu, Andy Lutomirski, Manuel Bouyer

On 24/09/2020 15:56, Jan Beulich wrote:
> On 23.09.2020 12:18, Andrew Cooper wrote:
>> Despite appearing to be a deliberate design choice of early PV64, the
>> resulting behaviour for unregistered SYSCALL callbacks creates an untenable
>> testability problem for Xen.  Furthermore, the behaviour is undocumented,
>> bizarre, and inconsistent with related behaviour in Xen, and very liable
>> introduce a security vulnerability into a PV guest if the author hasn't
>> studied Xen's assembly code in detail.
>>
>> There are two different bugs here.
>>
>> 1) The current logic confuses the registered entrypoints, and may deliver a
>>    SYSCALL from 32bit userspace to the 64bit entry, when only a 64bit
>>    entrypoint is registered.
>>
>>    This has been the case ever since 2007 (c/s cd75d47348b) but up until
>>    2018 (c/s dba899de14) the wrong selectors would be handed to the guest for
>>    a 32bit SYSCALL entry, making it appear as if it a 64bit entry all along.
> I'm not sure what you derive the last half sentence from. To a 32-bit
> PV guest, nothing can make things look like being 64-bit.

Right, but what part of this discussion is relevant to 32bit PV guests,
when we're discussing junk data being passed to the 64bit SYSCALL entry?

> And as you
> did say in your 2018 change, FLAT_KERNEL_SS == FLAT_USER_SS32.

And? Mode is determined by CS, not SS.  A kernel suffering this failure
will find a CS claiming to be FLAT_RING1_DS/RPL3, and not
FLAT_COMPAT_USER_CS.

Even if we presume for a moment that multiplexing was a sensible plan,
there were 13 years where you couldn't rationally distinguish the two
conditions.

Considering the very obvious chaos which occurs when you try to
HYPERCALL_iret with the bogus frame, either noone ever encountered it,
or everyone used the Linux way which was to blindly overwrite Xen's
selectors with the knowledge (and by this, I mean expectation) that the
two entrypoints distinguished the originating mode.

Linux doesn't go wrong because it registers both entrypoints, but
anything else using similar logic (and only one registered entrypoint)
would end up returning to 32bit userspace in 64bit mode.

> As to the "confusion" of entry points - before the compat mode entry
> path was introduced, a 64-bit guest could only register a single
> entry point.

The fact that MSR_LSTAR and MSR_CSTAR are separate in the AMD64 spec is
a very good hint that that is how software should/would expect things to
behave.

The timing and content of c/s 02410e06fea7, which introduced the first
use of SYSCALL, looks suspiciously like it was designed to the Intel
manual, seeing as it failed to configure MSR_CSTAR entirely.

The CSTAR "fix" came later in c/s 6c94cfd1491 "Various bug fixes", which
introduced the confusion of the two entrypoints, and still hadn't been
tested on AMD as it would return to 32bit userspace in 64bit mode.

c/s 091e799a840c was the commit which introduced the syscall entrypoint.

> Hence guests at the time had to multiplex 32- and 64-bit
> user mode entry from this one code path. In order to avoid regressing
> any such guest, the falling back to using the 64-bit entry point was
> chosen. Effectively what you propose is to regress such guests now,
> rather than back then.

I completely believe that you deliberately avoided changing the existing
behaviour at the time.

I just don't find it credible that the multiplexing was a deliberate and
informed design choice originally, when it looks very much like an
accident, and was so broken for more than a decade following.

I'm not trying to ascribe blame.  I can see exactly how this happened,
especially given how broken 32bit SYSCALL was AMD (how many OSes
realised they needed to have #DB in a task gate to be safe, considering
that basically the same bug took everyone by surprise a couple of years
ago).  32bit code never used SYSCALL, so multiplexing never got used in
practice, which is why the bugs remained hidden for 13 years, and which
is why changing the behaviour now doesn't break anything.

>>    Xen would malfunction under these circumstances, if it were a PV guest.
>>    Linux would as well, but PVOps has always registered both entrypoints and
>>    discarded the Xen-provided selectors.  NetBSD really does malfunction as a
>>    consequence (benignly now, but a VM DoS before the 2018 Xen selector fix).
>>
>> 2) In the case that neither SYSCALL callbacks are registered, the guest will
>>    be crashed when userspace executes a SYSCALL instruction, which is a
>>    userspace => kernel DoS.
>>
>>    This has been the case ever since the introduction of 64bit PV support, but
>>    behaves unlike all other SYSCALL/SYSENTER callbacks in Xen, which yield
>>    #GP/#UD in userspace before the callback is registered, and are therefore
>>    safe by default.
> I agree this part is an improvement.
>
>> This change does constitute a change in the PV ABI, for corner cases of a PV
>> guest kernel registering neither callback, or not registering the 32bit
>> callback when running on AMD/Hygon hardware.
>>
>> It brings the behaviour in line with PV32 SYSCALL/SYSENTER, and PV64
>> SYSENTER (safe by default, until explicitly enabled), as well as native
>> hardware (always delivered to the single applicable callback).
> Albeit an OS running natively and setting EFER.SCE is obliged to set both
> entry points; they can't have one without the other (and not be vulnerable).
> Since it's unclear what the PV equivalent of EFER.SCE is, I don't think
> comparing this particular aspect of the behavior makes a lot of sense.

A 32bit PV guest doesn't have EFER.SCE, but the act of registering the
SYSCALL32 entry point "enables" a 32bit SYSCALL to work (i.e. not #GP).

Neither type of PV guest has MSR_SYSENTER_CS, but the act of registering
the SYSENTER entry point "enables" SYSENTER to work (i.e. not #GP, if
you can call the fairly bogus resulting state "working".)

The asymmetry here is that a 64bit PV guest will be DoS'd before it
registers at least one of the two SYSCALL entry points.

>> Most importantly however, and the primary reason for the change, is that it
>> lets us actually test the PV entrypoints to prove correct behaviour.
> You mean "test the absence of PV entry points" here?

No, but that sentence clearly didn't come out right.

What I meant to say was that it lets us sensibly test all of Xen's fast
system call entrypoints, under all of the states which a PV guest can
configure.

>
>> --- a/xen/arch/x86/x86_64/entry.S
>> +++ b/xen/arch/x86/x86_64/entry.S
>> @@ -26,18 +26,30 @@
>>  /* %rbx: struct vcpu */
>>  ENTRY(switch_to_kernel)
>>          leaq  VCPU_trap_bounce(%rbx),%rdx
>> -        /* TB_eip = (32-bit syscall && syscall32_addr) ?
>> -         *          syscall32_addr : syscall_addr */
>> -        xor   %eax,%eax
>> +
>> +        /* TB_eip = 32-bit syscall ? syscall32_addr : syscall_addr */
>> +        mov   VCPU_syscall32_addr(%rbx), %ecx
>> +        mov   VCPU_syscall_addr(%rbx), %rax
>>          cmpw  $FLAT_USER_CS32,UREGS_cs(%rsp)
>> -        cmoveq VCPU_syscall32_addr(%rbx),%rax
>> -        testq %rax,%rax
>> -        cmovzq VCPU_syscall_addr(%rbx),%rax
>> -        movq  %rax,TRAPBOUNCE_eip(%rdx)
>> +        cmove %rcx, %rax
>> +
>>          /* TB_flags = VGCF_syscall_disables_events ? TBF_INTERRUPT : 0 */
>>          btl   $_VGCF_syscall_disables_events,VCPU_guest_context_flags(%rbx)
>>          setc  %cl
>>          leal  (,%rcx,TBF_INTERRUPT),%ecx
>> +
>> +        test  %rax, %rax
>> +UNLIKELY_START(z, syscall_no_callback) /* TB_eip == 0 => #UD */
>> +        movq  VCPU_trap_ctxt(%rbx), %rdi
>> +        movl  $X86_EXC_UD, UREGS_entry_vector(%rsp)
>> +        subl  $2, UREGS_rip(%rsp)
>> +        movl  X86_EXC_UD * TRAPINFO_sizeof + TRAPINFO_eip(%rdi), %eax
> I guess you mean "movq ..., %rax"? Iirc 32-bit guests don't even get through
> here.

Oops - you're quite right.  This only passes testing because XTF is
linked at 1M.

~Andrew


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] x86/pv: Inject #UD for missing SYSCALL callbacks
  2020-09-28 13:05     ` Andrew Cooper
@ 2020-09-28 15:35       ` Jan Beulich
  0 siblings, 0 replies; 18+ messages in thread
From: Jan Beulich @ 2020-09-28 15:35 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Xen-devel, Roger Pau Monné, Wei Liu, Andy Lutomirski, Manuel Bouyer

On 28.09.2020 15:05, Andrew Cooper wrote:
> On 24/09/2020 15:56, Jan Beulich wrote:
>> On 23.09.2020 12:18, Andrew Cooper wrote:
>>> Despite appearing to be a deliberate design choice of early PV64, the
>>> resulting behaviour for unregistered SYSCALL callbacks creates an untenable
>>> testability problem for Xen.  Furthermore, the behaviour is undocumented,
>>> bizarre, and inconsistent with related behaviour in Xen, and very liable
>>> introduce a security vulnerability into a PV guest if the author hasn't
>>> studied Xen's assembly code in detail.
>>>
>>> There are two different bugs here.
>>>
>>> 1) The current logic confuses the registered entrypoints, and may deliver a
>>>    SYSCALL from 32bit userspace to the 64bit entry, when only a 64bit
>>>    entrypoint is registered.
>>>
>>>    This has been the case ever since 2007 (c/s cd75d47348b) but up until
>>>    2018 (c/s dba899de14) the wrong selectors would be handed to the guest for
>>>    a 32bit SYSCALL entry, making it appear as if it a 64bit entry all along.
>> I'm not sure what you derive the last half sentence from. To a 32-bit
>> PV guest, nothing can make things look like being 64-bit.
> 
> Right, but what part of this discussion is relevant to 32bit PV guests,
> when we're discussing junk data being passed to the 64bit SYSCALL entry?

To me your text doesn't make it clear that you only talk about 64-bit
guests there. Talking about 32-bit user space doesn't imply a 64-bit
kernel to me.

>> And as you
>> did say in your 2018 change, FLAT_KERNEL_SS == FLAT_USER_SS32.
> 
> And? Mode is determined by CS, not SS.  A kernel suffering this failure
> will find a CS claiming to be FLAT_RING1_DS/RPL3, and not
> FLAT_COMPAT_USER_CS.

Yet how does this make anyone looking think of this being a 64-bit
entry then?

As to CS vs SS - I think the canonical CPL is SS.DPL, and since
for SS DPL == RPL, also SS.RPL. At least that's what we use in a
number of places.

> Even if we presume for a moment that multiplexing was a sensible plan,
> there were 13 years where you couldn't rationally distinguish the two
> conditions.
> 
> Considering the very obvious chaos which occurs when you try to
> HYPERCALL_iret with the bogus frame, either noone ever encountered it,
> or everyone used the Linux way which was to blindly overwrite Xen's
> selectors with the knowledge (and by this, I mean expectation) that the
> two entrypoints distinguished the originating mode.
> 
> Linux doesn't go wrong because it registers both entrypoints, but
> anything else using similar logic (and only one registered entrypoint)
> would end up returning to 32bit userspace in 64bit mode.
> 
>> As to the "confusion" of entry points - before the compat mode entry
>> path was introduced, a 64-bit guest could only register a single
>> entry point.
> 
> The fact that MSR_LSTAR and MSR_CSTAR are separate in the AMD64 spec is
> a very good hint that that is how software should/would expect things to
> behave.
> 
> The timing and content of c/s 02410e06fea7, which introduced the first
> use of SYSCALL, looks suspiciously like it was designed to the Intel
> manual, seeing as it failed to configure MSR_CSTAR entirely.
> 
> The CSTAR "fix" came later in c/s 6c94cfd1491 "Various bug fixes", which
> introduced the confusion of the two entrypoints, and still hadn't been
> tested on AMD as it would return to 32bit userspace in 64bit mode.
> 
> c/s 091e799a840c was the commit which introduced the syscall entrypoint.
> 
>> Hence guests at the time had to multiplex 32- and 64-bit
>> user mode entry from this one code path. In order to avoid regressing
>> any such guest, the falling back to using the 64-bit entry point was
>> chosen. Effectively what you propose is to regress such guests now,
>> rather than back then.
> 
> I completely believe that you deliberately avoided changing the existing
> behaviour at the time.
> 
> I just don't find it credible that the multiplexing was a deliberate and
> informed design choice originally, when it looks very much like an
> accident, and was so broken for more than a decade following.

The problem here is once again the lack of documentation of the ABI.
As such, the behavior of the implementation, of however good or bad
intent, has been the reference. And hence I don't see us changing the
behavior as a viable thing.

If you can get e.g. Roger to support your position and provide an
ack to this change, I guess I'm willing to accept the change going in
as it is. But I'm afraid I can't give it my R-b.

>>> This change does constitute a change in the PV ABI, for corner cases of a PV
>>> guest kernel registering neither callback, or not registering the 32bit
>>> callback when running on AMD/Hygon hardware.
>>>
>>> It brings the behaviour in line with PV32 SYSCALL/SYSENTER, and PV64
>>> SYSENTER (safe by default, until explicitly enabled), as well as native
>>> hardware (always delivered to the single applicable callback).
>> Albeit an OS running natively and setting EFER.SCE is obliged to set both
>> entry points; they can't have one without the other (and not be vulnerable).
>> Since it's unclear what the PV equivalent of EFER.SCE is, I don't think
>> comparing this particular aspect of the behavior makes a lot of sense.
> 
> A 32bit PV guest doesn't have EFER.SCE, but the act of registering the
> SYSCALL32 entry point "enables" a 32bit SYSCALL to work (i.e. not #GP).
> 
> Neither type of PV guest has MSR_SYSENTER_CS, but the act of registering
> the SYSENTER entry point "enables" SYSENTER to work (i.e. not #GP, if
> you can call the fairly bogus resulting state "working".)
> 
> The asymmetry here is that a 64bit PV guest will be DoS'd before it
> registers at least one of the two SYSCALL entry points.

Well, that's the part of the change that we look to agree to make. Our
disagreement is merely about what is to happen when only a 64-bit
entry point gets registered. I was wondering whether you wouldn't be
willing to split the change, so that the uncontroversial part can go
in. But I do realize that's extra work on your part, which
- considering the position you take - you may view as pointless effort.

Jan


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2] x86/pv: Inject #UD for missing SYSCALL callbacks
  2020-09-23 10:18 ` [PATCH 3/3] x86/pv: Inject #UD for missing SYSCALL callbacks Andrew Cooper
  2020-09-24 14:56   ` Jan Beulich
@ 2020-10-09 11:53   ` Andrew Cooper
  2020-10-09 12:40     ` Manuel Bouyer
                       ` (2 more replies)
  1 sibling, 3 replies; 18+ messages in thread
From: Andrew Cooper @ 2020-10-09 11:53 UTC (permalink / raw)
  To: Xen-devel
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné,
	Wei Liu, Andy Lutomirski, Manuel Bouyer

Despite appearing to be a deliberate design choice of early PV64, the
resulting behaviour for unregistered SYSCALL callbacks creates an untenable
testability problem for Xen.  Furthermore, the behaviour is undocumented,
bizarre, and inconsistent with related behaviour in Xen, and very liable
introduce a security vulnerability into a PV guest if the author hasn't
studied Xen's assembly code in detail.

There are two different bugs here.

1) The current logic confuses the registered entrypoints, and may deliver a
   SYSCALL from 32bit userspace to the 64bit entry, when only a 64bit
   entrypoint is registered.

   This has been the case ever since 2007 (c/s cd75d47348b) but up until
   2018 (c/s dba899de14) the wrong selectors would be handed to the guest for
   a 32bit SYSCALL entry, making it appear as if it a 64bit entry all along.

   Xen would malfunction under these circumstances, if it were a PV guest.
   Linux would as well, but PVOps has always registered both entrypoints and
   discarded the Xen-provided selectors.  NetBSD really does malfunction as a
   consequence (benignly now, but a VM DoS before the 2018 Xen selector fix).

2) In the case that neither SYSCALL callbacks are registered, the guest will
   be crashed when userspace executes a SYSCALL instruction, which is a
   userspace => kernel DoS.

   This has been the case ever since the introduction of 64bit PV support, but
   behaves unlike all other SYSCALL/SYSENTER callbacks in Xen, which yield
   #GP/#UD in userspace before the callback is registered, and are therefore
   safe by default.

This change does constitute a change in the PV ABI, for corner cases of a PV
guest kernel registering neither callback, or not registering the 32bit
callback when running on AMD/Hygon hardware.

It brings the behaviour in line with PV32 SYSCALL/SYSENTER, and PV64
SYSENTER (safe by default, until explicitly enabled), as well as native
hardware (always delivered to the single applicable callback).

Most importantly however, and the primary reason for the change, is that it
lets us sensibly test the fast system call entrypoints under all states a PV
guest can construct, to prove correct behaviour.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
CC: Andy Lutomirski <luto@kernel.org>
CC: Manuel Bouyer <bouyer@antioche.eu.org>

v2:
 * Drop unnecessary instruction suffixes
 * Don't truncate #UD entrypoint to 32 bits

Manuel: This will result in a corner case change for NetBSD.

At the moment on native, 32bit userspace on 64bit NetBSD will get #UD (Intel,
etc), or an explicit -ENOSYS (AMD, etc) when trying to execute a 32bit SYSCALL
instruction.

After this change, a 64bit PV VM will consistently see #UD (like on Intel, etc
hardware) even when running on AMD/Hygon hardware (as Xsyscall32 isn't
registered with Xen), rather than following Xsyscall into the proper system
call path.
---
 xen/arch/x86/x86_64/entry.S | 26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 000eb9722b..aaf8402f93 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -26,18 +26,30 @@
 /* %rbx: struct vcpu */
 ENTRY(switch_to_kernel)
         leaq  VCPU_trap_bounce(%rbx),%rdx
-        /* TB_eip = (32-bit syscall && syscall32_addr) ?
-         *          syscall32_addr : syscall_addr */
-        xor   %eax,%eax
+
+        /* TB_eip = 32-bit syscall ? syscall32_addr : syscall_addr */
+        mov   VCPU_syscall32_addr(%rbx), %ecx
+        mov   VCPU_syscall_addr(%rbx), %rax
         cmpw  $FLAT_USER_CS32,UREGS_cs(%rsp)
-        cmoveq VCPU_syscall32_addr(%rbx),%rax
-        testq %rax,%rax
-        cmovzq VCPU_syscall_addr(%rbx),%rax
-        movq  %rax,TRAPBOUNCE_eip(%rdx)
+        cmove %rcx, %rax
+
         /* TB_flags = VGCF_syscall_disables_events ? TBF_INTERRUPT : 0 */
         btl   $_VGCF_syscall_disables_events,VCPU_guest_context_flags(%rbx)
         setc  %cl
         leal  (,%rcx,TBF_INTERRUPT),%ecx
+
+        test  %rax, %rax
+UNLIKELY_START(z, syscall_no_callback) /* TB_eip == 0 => #UD */
+        mov   VCPU_trap_ctxt(%rbx), %rdi
+        movl  $X86_EXC_UD, UREGS_entry_vector(%rsp)
+        subl  $2, UREGS_rip(%rsp)
+        mov   X86_EXC_UD * TRAPINFO_sizeof + TRAPINFO_eip(%rdi), %rax
+        testb $4, X86_EXC_UD * TRAPINFO_sizeof + TRAPINFO_flags(%rdi)
+        setnz %cl
+        lea   TBF_EXCEPTION(, %rcx, TBF_INTERRUPT), %ecx
+UNLIKELY_END(syscall_no_callback)
+
+        movq  %rax,TRAPBOUNCE_eip(%rdx)
         movb  %cl,TRAPBOUNCE_flags(%rdx)
         call  create_bounce_frame
         andl  $~X86_EFLAGS_DF,UREGS_eflags(%rsp)
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] x86/pv: Inject #UD for missing SYSCALL callbacks
  2020-10-09 11:53   ` [PATCH v2] " Andrew Cooper
@ 2020-10-09 12:40     ` Manuel Bouyer
  2020-10-09 12:50       ` Andrew Cooper
  2020-10-14 14:16     ` Roger Pau Monné
  2020-10-14 16:28     ` Roger Pau Monné
  2 siblings, 1 reply; 18+ messages in thread
From: Manuel Bouyer @ 2020-10-09 12:40 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Xen-devel, Jan Beulich, Roger Pau Monné, Wei Liu, Andy Lutomirski

On Fri, Oct 09, 2020 at 12:53:01PM +0100, Andrew Cooper wrote:
> Despite appearing to be a deliberate design choice of early PV64, the
> resulting behaviour for unregistered SYSCALL callbacks creates an untenable
> testability problem for Xen.  Furthermore, the behaviour is undocumented,
> bizarre, and inconsistent with related behaviour in Xen, and very liable
> introduce a security vulnerability into a PV guest if the author hasn't
> studied Xen's assembly code in detail.
> 
> There are two different bugs here.
> 
> 1) The current logic confuses the registered entrypoints, and may deliver a
>    SYSCALL from 32bit userspace to the 64bit entry, when only a 64bit
>    entrypoint is registered.
> 
>    This has been the case ever since 2007 (c/s cd75d47348b) but up until
>    2018 (c/s dba899de14) the wrong selectors would be handed to the guest for
>    a 32bit SYSCALL entry, making it appear as if it a 64bit entry all along.
> 
>    Xen would malfunction under these circumstances, if it were a PV guest.
>    Linux would as well, but PVOps has always registered both entrypoints and
>    discarded the Xen-provided selectors.  NetBSD really does malfunction as a
>    consequence (benignly now, but a VM DoS before the 2018 Xen selector fix).

What do you mean with «malfunction» ? A 64bits guest can run 32bit code
just fine, this is part of our daily regression tests.

-- 
Manuel Bouyer <bouyer@antioche.eu.org>
     NetBSD: 26 ans d'experience feront toujours la difference
--


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] x86/pv: Inject #UD for missing SYSCALL callbacks
  2020-10-09 12:40     ` Manuel Bouyer
@ 2020-10-09 12:50       ` Andrew Cooper
  0 siblings, 0 replies; 18+ messages in thread
From: Andrew Cooper @ 2020-10-09 12:50 UTC (permalink / raw)
  To: Manuel Bouyer
  Cc: Xen-devel, Jan Beulich, Roger Pau Monné, Wei Liu, Andy Lutomirski

On 09/10/2020 13:40, Manuel Bouyer wrote:
> On Fri, Oct 09, 2020 at 12:53:01PM +0100, Andrew Cooper wrote:
>> Despite appearing to be a deliberate design choice of early PV64, the
>> resulting behaviour for unregistered SYSCALL callbacks creates an untenable
>> testability problem for Xen.  Furthermore, the behaviour is undocumented,
>> bizarre, and inconsistent with related behaviour in Xen, and very liable
>> introduce a security vulnerability into a PV guest if the author hasn't
>> studied Xen's assembly code in detail.
>>
>> There are two different bugs here.
>>
>> 1) The current logic confuses the registered entrypoints, and may deliver a
>>    SYSCALL from 32bit userspace to the 64bit entry, when only a 64bit
>>    entrypoint is registered.
>>
>>    This has been the case ever since 2007 (c/s cd75d47348b) but up until
>>    2018 (c/s dba899de14) the wrong selectors would be handed to the guest for
>>    a 32bit SYSCALL entry, making it appear as if it a 64bit entry all along.
>>
>>    Xen would malfunction under these circumstances, if it were a PV guest.
>>    Linux would as well, but PVOps has always registered both entrypoints and
>>    discarded the Xen-provided selectors.  NetBSD really does malfunction as a
>>    consequence (benignly now, but a VM DoS before the 2018 Xen selector fix).
> What do you mean with «malfunction» ? A 64bits guest can run 32bit code
> just fine, this is part of our daily regression tests.

Right, but your 32bit code never executes the SYSCALL instruction,
because it is hardwired as -ENOSYS on native, and doesn't work on Intel
hardware at all.

Under Xen, this enters the regular syscall path (buggy but benign), and
before the selector fix two years ago, would (AFAICT) eventually try to
HYPERCALL_iret with the bogus selectors, and hit the failsafe callback,
which is a straight panic().

~Andrew


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] x86/pv: Inject #UD for missing SYSCALL callbacks
  2020-10-09 11:53   ` [PATCH v2] " Andrew Cooper
  2020-10-09 12:40     ` Manuel Bouyer
@ 2020-10-14 14:16     ` Roger Pau Monné
  2020-10-14 14:20       ` Manuel Bouyer
  2020-10-14 15:17       ` Andrew Cooper
  2020-10-14 16:28     ` Roger Pau Monné
  2 siblings, 2 replies; 18+ messages in thread
From: Roger Pau Monné @ 2020-10-14 14:16 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Xen-devel, Jan Beulich, Wei Liu, Andy Lutomirski, Manuel Bouyer

On Fri, Oct 09, 2020 at 12:53:01PM +0100, Andrew Cooper wrote:
> Despite appearing to be a deliberate design choice of early PV64, the
> resulting behaviour for unregistered SYSCALL callbacks creates an untenable
> testability problem for Xen.  Furthermore, the behaviour is undocumented,
> bizarre, and inconsistent with related behaviour in Xen, and very liable
> introduce a security vulnerability into a PV guest if the author hasn't
> studied Xen's assembly code in detail.
> 
> There are two different bugs here.
> 
> 1) The current logic confuses the registered entrypoints, and may deliver a
>    SYSCALL from 32bit userspace to the 64bit entry, when only a 64bit
>    entrypoint is registered.
> 
>    This has been the case ever since 2007 (c/s cd75d47348b) but up until
>    2018 (c/s dba899de14) the wrong selectors would be handed to the guest for
>    a 32bit SYSCALL entry, making it appear as if it a 64bit entry all along.
> 
>    Xen would malfunction under these circumstances, if it were a PV guest.
>    Linux would as well, but PVOps has always registered both entrypoints and
>    discarded the Xen-provided selectors.  NetBSD really does malfunction as a
>    consequence (benignly now, but a VM DoS before the 2018 Xen selector fix).
> 
> 2) In the case that neither SYSCALL callbacks are registered, the guest will
>    be crashed when userspace executes a SYSCALL instruction, which is a
>    userspace => kernel DoS.
> 
>    This has been the case ever since the introduction of 64bit PV support, but
>    behaves unlike all other SYSCALL/SYSENTER callbacks in Xen, which yield
>    #GP/#UD in userspace before the callback is registered, and are therefore
>    safe by default.

This seems fairly reasonable, as it turns a guest crash into an #UD
AFAICT.

> This change does constitute a change in the PV ABI, for corner cases of a PV
> guest kernel registering neither callback, or not registering the 32bit
> callback when running on AMD/Hygon hardware.

Is there any place suitable to document this behavior?

> It brings the behaviour in line with PV32 SYSCALL/SYSENTER, and PV64
> SYSENTER (safe by default, until explicitly enabled), as well as native
> hardware (always delivered to the single applicable callback).
> 
> Most importantly however, and the primary reason for the change, is that it
> lets us sensibly test the fast system call entrypoints under all states a PV
> guest can construct, to prove correct behaviour.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Roger Pau Monné <roger.pau@citrix.com>
> CC: Wei Liu <wl@xen.org>
> CC: Andy Lutomirski <luto@kernel.org>
> CC: Manuel Bouyer <bouyer@antioche.eu.org>
> 
> v2:
>  * Drop unnecessary instruction suffixes
>  * Don't truncate #UD entrypoint to 32 bits
> 
> Manuel: This will result in a corner case change for NetBSD.
> 
> At the moment on native, 32bit userspace on 64bit NetBSD will get #UD (Intel,
> etc), or an explicit -ENOSYS (AMD, etc) when trying to execute a 32bit SYSCALL
> instruction.
> 
> After this change, a 64bit PV VM will consistently see #UD (like on Intel, etc
> hardware) even when running on AMD/Hygon hardware (as Xsyscall32 isn't
> registered with Xen), rather than following Xsyscall into the proper system
> call path.

Would this result in a regression for NetBSD then? Is it fine to see
#UD regardless of the platform? It's not clear to me from the text
above whether this change will cause issues with NetBSD.

Roger.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] x86/pv: Inject #UD for missing SYSCALL callbacks
  2020-10-14 14:16     ` Roger Pau Monné
@ 2020-10-14 14:20       ` Manuel Bouyer
  2020-10-14 14:26         ` Andrew Cooper
  2020-10-14 15:17       ` Andrew Cooper
  1 sibling, 1 reply; 18+ messages in thread
From: Manuel Bouyer @ 2020-10-14 14:20 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Andrew Cooper, Xen-devel, Jan Beulich, Wei Liu, Andy Lutomirski

On Wed, Oct 14, 2020 at 04:16:20PM +0200, Roger Pau Monné wrote:
> [...]
> Would this result in a regression for NetBSD then? Is it fine to see
> #UD regardless of the platform? It's not clear to me from the text
> above whether this change will cause issues with NetBSD.

AFAIK this should not cause any issue. If I understand it properly,
SYSCALL in a 32bit context would not work in any case on Intel CPUs.
The patch just makes if fail on AMD cpus the same way it fails on Intel.

-- 
Manuel Bouyer <bouyer@antioche.eu.org>
     NetBSD: 26 ans d'experience feront toujours la difference
--


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] x86/pv: Inject #UD for missing SYSCALL callbacks
  2020-10-14 14:20       ` Manuel Bouyer
@ 2020-10-14 14:26         ` Andrew Cooper
  0 siblings, 0 replies; 18+ messages in thread
From: Andrew Cooper @ 2020-10-14 14:26 UTC (permalink / raw)
  To: Manuel Bouyer, Roger Pau Monné
  Cc: Xen-devel, Jan Beulich, Wei Liu, Andy Lutomirski

On 14/10/2020 15:20, Manuel Bouyer wrote:
> On Wed, Oct 14, 2020 at 04:16:20PM +0200, Roger Pau Monné wrote:
>> [...]
>> Would this result in a regression for NetBSD then? Is it fine to see
>> #UD regardless of the platform? It's not clear to me from the text
>> above whether this change will cause issues with NetBSD.
> AFAIK this should not cause any issue. If I understand it properly,
> SYSCALL in a 32bit context would not work in any case on Intel CPUs.
> The patch just makes if fail on AMD cpus the same way it fails on Intel.

Correct.

~Andrew


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] x86/pv: Inject #UD for missing SYSCALL callbacks
  2020-10-14 14:16     ` Roger Pau Monné
  2020-10-14 14:20       ` Manuel Bouyer
@ 2020-10-14 15:17       ` Andrew Cooper
  1 sibling, 0 replies; 18+ messages in thread
From: Andrew Cooper @ 2020-10-14 15:17 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Xen-devel, Jan Beulich, Wei Liu, Andy Lutomirski, Manuel Bouyer

On 14/10/2020 15:16, Roger Pau Monné wrote:
>> This change does constitute a change in the PV ABI, for corner cases of a PV
>> guest kernel registering neither callback, or not registering the 32bit
>> callback when running on AMD/Hygon hardware.
> Is there any place suitable to document this behavior?

In the short term, my XTF test which will eventually get into CI.

Longer term, my theoretical future where I've described some of this
stuff in docs/guest-guide/

~Andrew


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] x86/pv: Inject #UD for missing SYSCALL callbacks
  2020-10-09 11:53   ` [PATCH v2] " Andrew Cooper
  2020-10-09 12:40     ` Manuel Bouyer
  2020-10-14 14:16     ` Roger Pau Monné
@ 2020-10-14 16:28     ` Roger Pau Monné
  2020-10-14 17:41       ` Andrew Cooper
  2 siblings, 1 reply; 18+ messages in thread
From: Roger Pau Monné @ 2020-10-14 16:28 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Xen-devel, Jan Beulich, Wei Liu, Andy Lutomirski, Manuel Bouyer

On Fri, Oct 09, 2020 at 12:53:01PM +0100, Andrew Cooper wrote:
> Despite appearing to be a deliberate design choice of early PV64, the
> resulting behaviour for unregistered SYSCALL callbacks creates an untenable
> testability problem for Xen.  Furthermore, the behaviour is undocumented,
> bizarre, and inconsistent with related behaviour in Xen, and very liable
> introduce a security vulnerability into a PV guest if the author hasn't
> studied Xen's assembly code in detail.
> 
> There are two different bugs here.
> 
> 1) The current logic confuses the registered entrypoints, and may deliver a
>    SYSCALL from 32bit userspace to the 64bit entry, when only a 64bit
>    entrypoint is registered.
> 
>    This has been the case ever since 2007 (c/s cd75d47348b) but up until
>    2018 (c/s dba899de14) the wrong selectors would be handed to the guest for
>    a 32bit SYSCALL entry, making it appear as if it a 64bit entry all along.
> 
>    Xen would malfunction under these circumstances, if it were a PV guest.
>    Linux would as well, but PVOps has always registered both entrypoints and
>    discarded the Xen-provided selectors.  NetBSD really does malfunction as a
>    consequence (benignly now, but a VM DoS before the 2018 Xen selector fix).
> 
> 2) In the case that neither SYSCALL callbacks are registered, the guest will
>    be crashed when userspace executes a SYSCALL instruction, which is a
>    userspace => kernel DoS.
> 
>    This has been the case ever since the introduction of 64bit PV support, but
>    behaves unlike all other SYSCALL/SYSENTER callbacks in Xen, which yield
>    #GP/#UD in userspace before the callback is registered, and are therefore
>    safe by default.
> 
> This change does constitute a change in the PV ABI, for corner cases of a PV
> guest kernel registering neither callback, or not registering the 32bit
> callback when running on AMD/Hygon hardware.
> 
> It brings the behaviour in line with PV32 SYSCALL/SYSENTER, and PV64
> SYSENTER (safe by default, until explicitly enabled), as well as native
> hardware (always delivered to the single applicable callback).
> 
> Most importantly however, and the primary reason for the change, is that it
> lets us sensibly test the fast system call entrypoints under all states a PV
> guest can construct, to prove correct behaviour.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Acked-by: Roger Pau Monné <roger.pau@citrix.com>

> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Roger Pau Monné <roger.pau@citrix.com>
> CC: Wei Liu <wl@xen.org>
> CC: Andy Lutomirski <luto@kernel.org>
> CC: Manuel Bouyer <bouyer@antioche.eu.org>
> 
> v2:
>  * Drop unnecessary instruction suffixes
>  * Don't truncate #UD entrypoint to 32 bits
> 
> Manuel: This will result in a corner case change for NetBSD.
> 
> At the moment on native, 32bit userspace on 64bit NetBSD will get #UD (Intel,
> etc), or an explicit -ENOSYS (AMD, etc) when trying to execute a 32bit SYSCALL
> instruction.
> 
> After this change, a 64bit PV VM will consistently see #UD (like on Intel, etc
> hardware) even when running on AMD/Hygon hardware (as Xsyscall32 isn't
> registered with Xen), rather than following Xsyscall into the proper system
> call path.
> ---
>  xen/arch/x86/x86_64/entry.S | 26 +++++++++++++++++++-------
>  1 file changed, 19 insertions(+), 7 deletions(-)
> 
> diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
> index 000eb9722b..aaf8402f93 100644
> --- a/xen/arch/x86/x86_64/entry.S
> +++ b/xen/arch/x86/x86_64/entry.S
> @@ -26,18 +26,30 @@
>  /* %rbx: struct vcpu */
>  ENTRY(switch_to_kernel)
>          leaq  VCPU_trap_bounce(%rbx),%rdx
> -        /* TB_eip = (32-bit syscall && syscall32_addr) ?
> -         *          syscall32_addr : syscall_addr */
> -        xor   %eax,%eax
> +
> +        /* TB_eip = 32-bit syscall ? syscall32_addr : syscall_addr */
> +        mov   VCPU_syscall32_addr(%rbx), %ecx

This being an unsigned long field, shouldn't you use %rcx here?

Roger.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] x86/pv: Inject #UD for missing SYSCALL callbacks
  2020-10-14 16:28     ` Roger Pau Monné
@ 2020-10-14 17:41       ` Andrew Cooper
  0 siblings, 0 replies; 18+ messages in thread
From: Andrew Cooper @ 2020-10-14 17:41 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Xen-devel, Jan Beulich, Wei Liu, Andy Lutomirski, Manuel Bouyer

On 14/10/2020 17:28, Roger Pau Monné wrote:
> On Fri, Oct 09, 2020 at 12:53:01PM +0100, Andrew Cooper wrote:
>> Despite appearing to be a deliberate design choice of early PV64, the
>> resulting behaviour for unregistered SYSCALL callbacks creates an untenable
>> testability problem for Xen.  Furthermore, the behaviour is undocumented,
>> bizarre, and inconsistent with related behaviour in Xen, and very liable
>> introduce a security vulnerability into a PV guest if the author hasn't
>> studied Xen's assembly code in detail.
>>
>> There are two different bugs here.
>>
>> 1) The current logic confuses the registered entrypoints, and may deliver a
>>    SYSCALL from 32bit userspace to the 64bit entry, when only a 64bit
>>    entrypoint is registered.
>>
>>    This has been the case ever since 2007 (c/s cd75d47348b) but up until
>>    2018 (c/s dba899de14) the wrong selectors would be handed to the guest for
>>    a 32bit SYSCALL entry, making it appear as if it a 64bit entry all along.
>>
>>    Xen would malfunction under these circumstances, if it were a PV guest.
>>    Linux would as well, but PVOps has always registered both entrypoints and
>>    discarded the Xen-provided selectors.  NetBSD really does malfunction as a
>>    consequence (benignly now, but a VM DoS before the 2018 Xen selector fix).
>>
>> 2) In the case that neither SYSCALL callbacks are registered, the guest will
>>    be crashed when userspace executes a SYSCALL instruction, which is a
>>    userspace => kernel DoS.
>>
>>    This has been the case ever since the introduction of 64bit PV support, but
>>    behaves unlike all other SYSCALL/SYSENTER callbacks in Xen, which yield
>>    #GP/#UD in userspace before the callback is registered, and are therefore
>>    safe by default.
>>
>> This change does constitute a change in the PV ABI, for corner cases of a PV
>> guest kernel registering neither callback, or not registering the 32bit
>> callback when running on AMD/Hygon hardware.
>>
>> It brings the behaviour in line with PV32 SYSCALL/SYSENTER, and PV64
>> SYSENTER (safe by default, until explicitly enabled), as well as native
>> hardware (always delivered to the single applicable callback).
>>
>> Most importantly however, and the primary reason for the change, is that it
>> lets us sensibly test the fast system call entrypoints under all states a PV
>> guest can construct, to prove correct behaviour.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
>
>> ---
>> CC: Jan Beulich <JBeulich@suse.com>
>> CC: Roger Pau Monné <roger.pau@citrix.com>
>> CC: Wei Liu <wl@xen.org>
>> CC: Andy Lutomirski <luto@kernel.org>
>> CC: Manuel Bouyer <bouyer@antioche.eu.org>
>>
>> v2:
>>  * Drop unnecessary instruction suffixes
>>  * Don't truncate #UD entrypoint to 32 bits
>>
>> Manuel: This will result in a corner case change for NetBSD.
>>
>> At the moment on native, 32bit userspace on 64bit NetBSD will get #UD (Intel,
>> etc), or an explicit -ENOSYS (AMD, etc) when trying to execute a 32bit SYSCALL
>> instruction.
>>
>> After this change, a 64bit PV VM will consistently see #UD (like on Intel, etc
>> hardware) even when running on AMD/Hygon hardware (as Xsyscall32 isn't
>> registered with Xen), rather than following Xsyscall into the proper system
>> call path.
>> ---
>>  xen/arch/x86/x86_64/entry.S | 26 +++++++++++++++++++-------
>>  1 file changed, 19 insertions(+), 7 deletions(-)
>>
>> diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
>> index 000eb9722b..aaf8402f93 100644
>> --- a/xen/arch/x86/x86_64/entry.S
>> +++ b/xen/arch/x86/x86_64/entry.S
>> @@ -26,18 +26,30 @@
>>  /* %rbx: struct vcpu */
>>  ENTRY(switch_to_kernel)
>>          leaq  VCPU_trap_bounce(%rbx),%rdx
>> -        /* TB_eip = (32-bit syscall && syscall32_addr) ?
>> -         *          syscall32_addr : syscall_addr */
>> -        xor   %eax,%eax
>> +
>> +        /* TB_eip = 32-bit syscall ? syscall32_addr : syscall_addr */
>> +        mov   VCPU_syscall32_addr(%rbx), %ecx
> This being an unsigned long field, shouldn't you use %rcx here?

Yes I should.  Sorry - thought I'd fixed all of these.  I'll ad higher
half handlers to the XTF test.

~Andrew


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2020-10-14 17:42 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-23 10:18 [PATCH 0/3] x86/pv: Multiple fixes to SYSCALL/SYSENTER handling (XSA-339 followup) Andrew Cooper
2020-09-23 10:18 ` [PATCH 1/3] x86/pv: Don't deliver #GP for a SYSENTER with NT set Andrew Cooper
2020-09-24 13:55   ` Jan Beulich
2020-09-23 10:18 ` [PATCH 2/3] x86/pv: Don't clobber NT on return-to-guest Andrew Cooper
2020-09-24 13:57   ` Jan Beulich
2020-09-23 10:18 ` [PATCH 3/3] x86/pv: Inject #UD for missing SYSCALL callbacks Andrew Cooper
2020-09-24 14:56   ` Jan Beulich
2020-09-28 13:05     ` Andrew Cooper
2020-09-28 15:35       ` Jan Beulich
2020-10-09 11:53   ` [PATCH v2] " Andrew Cooper
2020-10-09 12:40     ` Manuel Bouyer
2020-10-09 12:50       ` Andrew Cooper
2020-10-14 14:16     ` Roger Pau Monné
2020-10-14 14:20       ` Manuel Bouyer
2020-10-14 14:26         ` Andrew Cooper
2020-10-14 15:17       ` Andrew Cooper
2020-10-14 16:28     ` Roger Pau Monné
2020-10-14 17:41       ` Andrew Cooper

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.