All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] KVM: SVM: do not drop VMCB CPL to 0 if SS is not present
@ 2017-05-19 16:14 Roman Penyaev
  2017-05-21  3:31 ` Andy Lutomirski
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Roman Penyaev @ 2017-05-19 16:14 UTC (permalink / raw)
  To: Roman Pen, Mikhail Sennikovskii, Paolo Bonzini, Gleb Natapov,
	kvm, linux-kernel

Hi folks,

After experiencing guest double faults (sometimes triple faults) on
3.16 guest kernels with the following common pattern:

[459395.776124] PANIC: double fault, error_code: 0x0
[459395.776606] CPU: 0 PID: 36565 Comm: fio Not tainted 3.16.39kmemleak #4
[459395.776610] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.7.5-20140531_083029-gandalf 04/01/2014
[459395.776614] task: ffff880009ca06b0 ti: ffff88003cbc2000 task.ti:
ffff88003cbc2000
[459395.776617] RIP: 0010:[<ffffffff81048ecf>]  [<ffffffff81048ecf>]
__do_page_fault+0x1f/0x540
[459395.776628] RSP: 002b:00007ffe0bc9bfa8  EFLAGS: 00010012
[459395.776631] RAX: 0000000081539927 RBX: 0000000000000000 RCX:
ffffffff81539927
[459395.776634] RDX: 0000000000000028 RSI: 0000000000000000 RDI:
00007ffe0bc9c0a8
[459395.776637] RBP: 00007ffe0bc9c0a8 R08: 0001a1d1002e9400 R09:
0000000000063f1b
[459395.776640] R10: 0000000033f8356c R11: 000029c8250c3103 R12:
0000000000000028
[459395.776642] R13: 00007ff8c83e0000 R14: 0000000000000000 R15:
00007ffe0bc9c7c0
[459395.776649] FS:  00007ff8d2aaa7c0(0000) GS:ffff88003f400000(0000)
knlGS:0000000000000000
[459395.776651] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[459395.776656] CR2: 00007ffe0bc9bf98 CR3: 000000003ca46000 CR4:
00000000000407f0
[459395.776658] Stack:
[459395.776661]  0000000000000000 0000000000000000 0000000000000000
0000000000000000
[459395.776666]  0000000000000000 0000000000000000 0000000000000000
0000000000000000
[459395.776670]  0000000000000000 0000000000000000 0000000000000000
[459395.776674] Call Trace:
[459395.776676]  <UNK>
[459395.776678] Code:
[459395.776680] ad 8c 4e 00 be 04 00 03 00 eb a8 90 66 66 66 66 90 41
57 41 56 41 55 41 54 49 89 d4 55 53 48 89 fd 48 89 f3 48 81 ec c8 00
00 00 <65> 48 8b 04 25 28 00 00 00 48 89 84 24 c0 00 00 00 31 c0 65 48
[459395.776716] Kernel panic - not syncing: Machine halted.
[459395.777172] CPU: 0 PID: 36565 Comm: fio Not tainted 3.16.39kmemleak #4
[459395.777673] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.7.5-20140531_083029-gandalf 04/01/2014
[459395.778373]  0000000000000086 00000000d85f6336 ffffffff81532ec9
ffffffff8170203e
[459395.779865]  ffff88003f402f18 ffffffff815318a1 00007ff800000008
ffff88003f402f28
[459395.780061]  ffff88003f402ec0 00000000d85f6336 00007fff00000008
0000000000000046
[459395.780061] Call Trace:
[459395.780061]  <#DF>  [<ffffffff81532ec9>] ? dump_stack+0x47/0x5a
[459395.780061]  [<ffffffff815318a1>] ? panic+0xcf/0x206
[459395.780061]  [<ffffffff8104234d>] ? df_debug+0x2d/0x30
[459395.780061]  [<ffffffff81004f78>] ? do_double_fault+0x78/0xf0
[459395.780061]  [<ffffffff8153a4f2>] ? double_fault+0x22/0x30
[459395.780061]  [<ffffffff81539927>] ? native_iret+0x7/0x7
[459395.780061]  [<ffffffff81048ecf>] ? __do_page_fault+0x1f/0x540

we found out that all kernel backtraces have userspace RSP, where
userspace memory has normal timer, page fault or virtio interrupts
trail:

(the following RSP pointer does not belong to this particular crash
 above, but it does not matter, symptoms are always the same)

crash> rd -s 00007f6cb9556768 100
    7f6cb9556768:  00007f6cfaa21270 00007f6cfaa21270
    7f6cb9556778:  0000000000000000 00007f6cf9b8c6a0
    7f6cb9556788:  00007f6cf983399a 0000000000000000
    7f6cb9556798:  0000000000000000 00007f6cf98a1f2d
    7f6cb95567a8:  00007f6cfaa21270 0000000000000000
    7f6cb95567b8:  00007f6ca4031880
                                    ffffffffffffff7e  IRQ,
~0xffffffffffffff7e = 0x81
    7f6cb95567c8:  00007f6cfa817ae1 0000000000000033  RIP; CS
    7f6cb95567d8:  0000000000000202 00007f6cb95567f0  EFLAGS; RSP
    7f6cb95567e8:  000000000000002b                   SS
                                    00007f6cb00318e0
    7f6cb95567f8:  00007f6cfa817af5 00007f6cac0318e0
    7f6cb9556808:  00007f6cfa817af5 00007f6cb4031880
    7f6cb9556818:  00007f6cfa817af5 00007f6cfbe82340
    7f6cb9556828:  00007f6cfa817af5 00007f6c940318c0
    7f6cb9556838:  00007f6cfa817af5 00007f6ca00318c0
    7f6cb9556848:  00007f6cfa817af5 00007f6c9c031920
    7f6cb9556858:  00007f6cfa817af5 00007f6ca8031920
    7f6cb9556868:  00007f6cfa817af5 00007f6ca4039df0
    7f6cb9556878:  00007f6cfa817af5 00007f6cb0039e50
    7f6cb9556888:  00007f6cfa817af5 00007f6cac039e50

It turned out to be that CPU does not change SS:RSP and took interrupt
on userspace stack (BTW init_tss and gdb_page are not corrupted).
That is completely weird.

Next step was to trace VMCB before and after VMRUN to understand the
exact state seen by real CPU.  VMCB was traced when wrong CPU state is
observed: RIP is to kernel and RSP is from userspace.  The following
is a diff of VMCB, where
--- is the state before VMRUN and
+++ is the state after VMRUN:

 -      event_inj = 0x80000081,
 +      event_inj = 0x0,

        ...

        cs = {
 -        selector = 0x33,
 -        attrib = 0xafb,
 +        selector = 0x10,
 +        attrib = 0x29b,
          limit = 0xffffffff,
          base = 0x0
        },
        ss = {
          selector = 0x2b,
          attrib = 0x0,
          limit = 0xffffffff,
          base = 0x0
        },

        ...

        cpl = 0x0,              <<<<<< WTF?

        ...

 -      rip = 0x7f6cfa817ae1,
 +      rip = 0xffffffff8152b690,

        ...

 -      rsp = 0x7f6cb95567f0,
 +      rsp = 0x7f6cb9556768,


The execution scenario is the following:

  1. Userspace code was suspended.
  2. Virtio 0x80000081 interrupt was injected and VMRUN called again.
  3. HW CPU took virtual interrupt, but did not switch the stack because
     CPL *was* already set to 0.
  4. KABOOOOOM.

So the stack switch does not happen because "the processor performs an
automatic stack switch when a control transfer causes a change in
privilege levels to occur" (AMD manual), and in this case privilege
level was not changed.

The question remains who and why changed CPL to 0?

Further tracking of VMCB states before and after VMRUN showed, that
CPL becomes 0 when VMEXIT happens with the following SS segment:

          ss = {
            selector = 0x2b,
            attrib = 0x400,
            limit = 0xffffffff,
            base = 0x0
          },

          cpl = 0x3

Then on next VMRUN VMCB looks as the following:

          ss = {
            selector = 0x2b,
            attrib = 0x0,            <<< dropped to 0
            limit = 0xffffffff,
            base = 0x0
          },

          cpl = 0x0,                 <<< dropped to 0

Obviously it was changed between VMRUN calls.  The following backtrace
shows that VMCB.SAVE.CPL was set to 0 by QEMU itself:

  CPU: 55 PID: 59531 Comm: kvm
  [<ffffffffa00a3a20>] kvm_arch_vcpu_ioctl_set_sregs+0x2e0/0x480 [kvm]
  [<ffffffffa008ddf0>] kvm_write_guest_cached+0x540/0xc00 [kvm]
  [<ffffffff8107d695>] ? finish_task_switch+0x185/0x240
  [<ffffffff8180097c>] ? __schedule+0x28c/0xa10
  [<ffffffff811a9aad>] do_vfs_ioctl+0x2cd/0x4a0

SS segment which came from QEMU had the following struct members:

       SS->base      = 0
       SS->limit     = ffffffff
       SS->selector  = 2b
       SS->type      = 0
       SS->present   = 0
       SS->dpl       = 0
       SS->db        = 0
       SS->s         = 0
       SS->l         = 0
       SS->g         = 0
       SS->avl       = 0
       SS->unusable  = 1

Indeed, on last VMEXIT SS segment does not have (P) present bit set in
segment attributes:

   (gdb) p 0x400 & (1 << SVM_SELECTOR_P_SHIFT)
   $1 = 0

So when on VMEXIT we have such SS state (P bit is not set) and QEMU
just decides to synchronize registers the following happens on QEMU
side:

   kvm_cpu_synchronize_state():
       kvm_arch_get_registers():
           ...
           get_seg():
              if (rhs->unusable) {
                  lhs->flags = 0;    <<< SS is unusable [(P) is not set)
                                     <<< all attributes are dropped to 0.
               }
       cpu->kvm_vcpu_dirty = true;   <<< Mark VCPU state as dirty


On next VCPU enter registers will be sent from QEMU to KVM back, in order
to sync the state.  In its turn svm_set_segment() on KVM side has the
following code:

    if (var->unusable)    <<< Yes, SS is unusable, attrib is dropped to 0
        s->attrib = 0;
    ...
    if (seg == VCPU_SREG_SS)
        svm->vmcb->save.cpl = (s->attrib >> SVM_SELECTOR_DPL_SHIFT) & 3;
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                              Now CPL is also dropped to 0


We see two possible fixes:

1. Simple one, KVM SVM side, which makes sure that CPL is not updated
   if segment is unusable:

   --- a/arch/x86/kvm/svm.c
   +++ b/arch/x86/kvm/svm.c
   @@ -1549,7 +1549,7 @@ static void svm_set_segment(struct kvm_vcpu *vcpu,
            * forces SS.DPL to 3 on sysret, so we ignore that case; fixing it
            * would entail passing the CPL to userspace and back.
            */
   -       if (seg == VCPU_SREG_SS)
   +       if (seg == VCPU_SREG_SS && !var->unusable)
                   svm->vmcb->save.cpl = (s->attrib >>
SVM_SELECTOR_DPL_SHIFT) & 3;

2. A bit complicated, which makes sure the CPL field is preserved across
   KVM_GET/SET_SREGS calls and makes svm_set_segment() and svm_get_segment()
   functionality symmethric:

   KVM SVM side:
   -------------

   --- a/arch/x86/kvm/svm.c
   +++ b/arch/x86/kvm/svm.c
   @@ -1999,7 +1999,7 @@ static void svm_set_segment(struct kvm_vcpu *vcpu,
            * would entail passing the CPL to userspace and back.
            */
           if (seg == VCPU_SREG_SS)
   -               svm->vmcb->save.cpl = (s->attrib >>
SVM_SELECTOR_DPL_SHIFT) & 3;
   +               svm->vmcb->save.cpl = (var->dpl & 3);

           mark_dirty(svm->vmcb, VMCB_SEG);
   }

   QEMU side:
   ----------

   --- a/target/i386/kvm.c
   +++ b/target/i386/kvm.c
   @@ -1979,6 +1979,8 @@ static int kvm_get_sregs(X86CPU *cpu)
        get_seg(&env->segs[R_FS], &sregs.fs);
        get_seg(&env->segs[R_GS], &sregs.gs);
        get_seg(&env->segs[R_SS], &sregs.ss);
   +    if (sregs.ss.unusable)
   +        env->segs[R_SS].flags |= sregs.ss.dpl << DESC_DPL_SHIFT;

        get_seg(&env->tr, &sregs.tr);
        get_seg(&env->ldt, &sregs.ldt);


Current email is an RFC since for us is not fully clear is it really
needed to preserve DPL across KVM_SET/GET_SREGS calls when segment
is unusable.  E.g. there was a commit:

4cae9c97967a ("target-i386: kvm: clear unusable segments' flags in migration")

which in purpose drops all segment flags to zero on QEMU side in order
to fix guests migration.

--
Roman

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Mikhail Sennikovskii <mikhail.sennikovskii@profitbricks.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] KVM: SVM: do not drop VMCB CPL to 0 if SS is not present
  2017-05-19 16:14 [RFC] KVM: SVM: do not drop VMCB CPL to 0 if SS is not present Roman Penyaev
@ 2017-05-21  3:31 ` Andy Lutomirski
  2017-05-21  7:53   ` Roman Penyaev
  2017-05-30 14:47 ` Paolo Bonzini
  2017-05-30 15:13 ` Paolo Bonzini
  2 siblings, 1 reply; 17+ messages in thread
From: Andy Lutomirski @ 2017-05-21  3:31 UTC (permalink / raw)
  To: Roman Penyaev, Mikhail Sennikovskii, Paolo Bonzini, Gleb Natapov,
	kvm, linux-kernel, Borislav Petkov

On 05/19/2017 09:14 AM, Roman Penyaev wrote:
> Hi folks,
> 
> After experiencing guest double faults (sometimes triple faults) on
> 3.16 guest kernels with the following common pattern:
> 
> [459395.776124] PANIC: double fault, error_code: 0x0
> [459395.776606] CPU: 0 PID: 36565 Comm: fio Not tainted 3.16.39kmemleak #4
> [459395.776610] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.7.5-20140531_083029-gandalf 04/01/2014
> [459395.776614] task: ffff880009ca06b0 ti: ffff88003cbc2000 task.ti:
> ffff88003cbc2000
> [459395.776617] RIP: 0010:[<ffffffff81048ecf>]  [<ffffffff81048ecf>]
> __do_page_fault+0x1f/0x540
> [459395.776628] RSP: 002b:00007ffe0bc9bfa8  EFLAGS: 00010012
> [459395.776631] RAX: 0000000081539927 RBX: 0000000000000000 RCX:
> ffffffff81539927
> [459395.776634] RDX: 0000000000000028 RSI: 0000000000000000 RDI:
> 00007ffe0bc9c0a8
> [459395.776637] RBP: 00007ffe0bc9c0a8 R08: 0001a1d1002e9400 R09:
> 0000000000063f1b
> [459395.776640] R10: 0000000033f8356c R11: 000029c8250c3103 R12:
> 0000000000000028
> [459395.776642] R13: 00007ff8c83e0000 R14: 0000000000000000 R15:
> 00007ffe0bc9c7c0
> [459395.776649] FS:  00007ff8d2aaa7c0(0000) GS:ffff88003f400000(0000)
> knlGS:0000000000000000
> [459395.776651] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [459395.776656] CR2: 00007ffe0bc9bf98 CR3: 000000003ca46000 CR4:
> 00000000000407f0
> [459395.776658] Stack:
> [459395.776661]  0000000000000000 0000000000000000 0000000000000000
> 0000000000000000
> [459395.776666]  0000000000000000 0000000000000000 0000000000000000
> 0000000000000000
> [459395.776670]  0000000000000000 0000000000000000 0000000000000000
> [459395.776674] Call Trace:
> [459395.776676]  <UNK>
> [459395.776678] Code:
> [459395.776680] ad 8c 4e 00 be 04 00 03 00 eb a8 90 66 66 66 66 90 41
> 57 41 56 41 55 41 54 49 89 d4 55 53 48 89 fd 48 89 f3 48 81 ec c8 00
> 00 00 <65> 48 8b 04 25 28 00 00 00 48 89 84 24 c0 00 00 00 31 c0 65 48
> [459395.776716] Kernel panic - not syncing: Machine halted.
> [459395.777172] CPU: 0 PID: 36565 Comm: fio Not tainted 3.16.39kmemleak #4
> [459395.777673] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.7.5-20140531_083029-gandalf 04/01/2014
> [459395.778373]  0000000000000086 00000000d85f6336 ffffffff81532ec9
> ffffffff8170203e
> [459395.779865]  ffff88003f402f18 ffffffff815318a1 00007ff800000008
> ffff88003f402f28
> [459395.780061]  ffff88003f402ec0 00000000d85f6336 00007fff00000008
> 0000000000000046
> [459395.780061] Call Trace:
> [459395.780061]  <#DF>  [<ffffffff81532ec9>] ? dump_stack+0x47/0x5a
> [459395.780061]  [<ffffffff815318a1>] ? panic+0xcf/0x206
> [459395.780061]  [<ffffffff8104234d>] ? df_debug+0x2d/0x30
> [459395.780061]  [<ffffffff81004f78>] ? do_double_fault+0x78/0xf0
> [459395.780061]  [<ffffffff8153a4f2>] ? double_fault+0x22/0x30
> [459395.780061]  [<ffffffff81539927>] ? native_iret+0x7/0x7
> [459395.780061]  [<ffffffff81048ecf>] ? __do_page_fault+0x1f/0x540
> 
> we found out that all kernel backtraces have userspace RSP, where
> userspace memory has normal timer, page fault or virtio interrupts
> trail:
> 
> (the following RSP pointer does not belong to this particular crash
>   above, but it does not matter, symptoms are always the same)
> 
> crash> rd -s 00007f6cb9556768 100
>      7f6cb9556768:  00007f6cfaa21270 00007f6cfaa21270
>      7f6cb9556778:  0000000000000000 00007f6cf9b8c6a0
>      7f6cb9556788:  00007f6cf983399a 0000000000000000
>      7f6cb9556798:  0000000000000000 00007f6cf98a1f2d
>      7f6cb95567a8:  00007f6cfaa21270 0000000000000000
>      7f6cb95567b8:  00007f6ca4031880
>                                      ffffffffffffff7e  IRQ,
> ~0xffffffffffffff7e = 0x81
>      7f6cb95567c8:  00007f6cfa817ae1 0000000000000033  RIP; CS
>      7f6cb95567d8:  0000000000000202 00007f6cb95567f0  EFLAGS; RSP
>      7f6cb95567e8:  000000000000002b                   SS
>                                      00007f6cb00318e0
>      7f6cb95567f8:  00007f6cfa817af5 00007f6cac0318e0
>      7f6cb9556808:  00007f6cfa817af5 00007f6cb4031880
>      7f6cb9556818:  00007f6cfa817af5 00007f6cfbe82340
>      7f6cb9556828:  00007f6cfa817af5 00007f6c940318c0
>      7f6cb9556838:  00007f6cfa817af5 00007f6ca00318c0
>      7f6cb9556848:  00007f6cfa817af5 00007f6c9c031920
>      7f6cb9556858:  00007f6cfa817af5 00007f6ca8031920
>      7f6cb9556868:  00007f6cfa817af5 00007f6ca4039df0
>      7f6cb9556878:  00007f6cfa817af5 00007f6cb0039e50
>      7f6cb9556888:  00007f6cfa817af5 00007f6cac039e50
> 
> It turned out to be that CPU does not change SS:RSP and took interrupt
> on userspace stack (BTW init_tss and gdb_page are not corrupted).
> That is completely weird.
> 
> Next step was to trace VMCB before and after VMRUN to understand the
> exact state seen by real CPU.  VMCB was traced when wrong CPU state is
> observed: RIP is to kernel and RSP is from userspace.  The following
> is a diff of VMCB, where
> --- is the state before VMRUN and
> +++ is the state after VMRUN:
> 
>   -      event_inj = 0x80000081,
>   +      event_inj = 0x0,
> 
>          ...
> 
>          cs = {
>   -        selector = 0x33,
>   -        attrib = 0xafb,
>   +        selector = 0x10,
>   +        attrib = 0x29b,
>            limit = 0xffffffff,
>            base = 0x0
>          },
>          ss = {
>            selector = 0x2b,
>            attrib = 0x0,
>            limit = 0xffffffff,
>            base = 0x0
>          },
> 
>          ...
> 
>          cpl = 0x0,              <<<<<< WTF?
> 
>          ...
> 
>   -      rip = 0x7f6cfa817ae1,
>   +      rip = 0xffffffff8152b690,
> 
>          ...
> 
>   -      rsp = 0x7f6cb95567f0,
>   +      rsp = 0x7f6cb9556768,
> 
> 
> The execution scenario is the following:
> 
>    1. Userspace code was suspended.
>    2. Virtio 0x80000081 interrupt was injected and VMRUN called again.
>    3. HW CPU took virtual interrupt, but did not switch the stack because
>       CPL *was* already set to 0.
>    4. KABOOOOOM.
> 
> So the stack switch does not happen because "the processor performs an
> automatic stack switch when a control transfer causes a change in
> privilege levels to occur" (AMD manual), and in this case privilege
> level was not changed.
> 
> The question remains who and why changed CPL to 0?
> 
> Further tracking of VMCB states before and after VMRUN showed, that
> CPL becomes 0 when VMEXIT happens with the following SS segment:
> 
>            ss = {
>              selector = 0x2b,
>              attrib = 0x400,
>              limit = 0xffffffff,
>              base = 0x0
>            },
> 
>            cpl = 0x3
> 
> Then on next VMRUN VMCB looks as the following:
> 
>            ss = {
>              selector = 0x2b,
>              attrib = 0x0,            <<< dropped to 0
>              limit = 0xffffffff,
>              base = 0x0
>            },
> 
>            cpl = 0x0,                 <<< dropped to 0
> 
> Obviously it was changed between VMRUN calls.  The following backtrace
> shows that VMCB.SAVE.CPL was set to 0 by QEMU itself:
> 
>    CPU: 55 PID: 59531 Comm: kvm
>    [<ffffffffa00a3a20>] kvm_arch_vcpu_ioctl_set_sregs+0x2e0/0x480 [kvm]
>    [<ffffffffa008ddf0>] kvm_write_guest_cached+0x540/0xc00 [kvm]
>    [<ffffffff8107d695>] ? finish_task_switch+0x185/0x240
>    [<ffffffff8180097c>] ? __schedule+0x28c/0xa10
>    [<ffffffff811a9aad>] do_vfs_ioctl+0x2cd/0x4a0
> 
> SS segment which came from QEMU had the following struct members:
> 
>         SS->base      = 0
>         SS->limit     = ffffffff
>         SS->selector  = 2b
>         SS->type      = 0
>         SS->present   = 0
>         SS->dpl       = 0
>         SS->db        = 0
>         SS->s         = 0
>         SS->l         = 0
>         SS->g         = 0
>         SS->avl       = 0
>         SS->unusable  = 1
> 
> Indeed, on last VMEXIT SS segment does not have (P) present bit set in
> segment attributes:
> 
>     (gdb) p 0x400 & (1 << SVM_SELECTOR_P_SHIFT)
>     $1 = 0

Huh?  How is that even possible?  It should not be possible to actually 
run the vCPU with a non-NULL SS that isn't present.  How would you cause 
it to happen?

Unless... is this the sysret_ss_attrs issue?  This is a 3.16.something 
guest, and maybe the sysret_ss_attrs code never got backported.  I bet 
that the user code running when the virtio interrupt hits is in the vDSO.

> 
> So when on VMEXIT we have such SS state (P bit is not set) and QEMU
> just decides to synchronize registers the following happens on QEMU
> side:
> 
>     kvm_cpu_synchronize_state():
>         kvm_arch_get_registers():
>             ...
>             get_seg():
>                if (rhs->unusable) {
>                    lhs->flags = 0;    <<< SS is unusable [(P) is not set)
>                                       <<< all attributes are dropped to 0.
>                 }
>         cpu->kvm_vcpu_dirty = true;   <<< Mark VCPU state as dirty
> 

Looks like the bug is in QEMU, then, right?  Couldn't you just fix this 
code in QEMU by, say, deleting it?  If it's actually needed for some 
reason, then at least sanitize the flags correctly rather than 
corrupting the DPL.  (I'm guessing the real issue here is that migration 
from AMD to Intel fails without this, but migration from AMD to Intel is 
highly dubious regardless.)

--Andy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] KVM: SVM: do not drop VMCB CPL to 0 if SS is not present
  2017-05-21  3:31 ` Andy Lutomirski
@ 2017-05-21  7:53   ` Roman Penyaev
  2017-05-21 20:19     ` Andy Lutomirski
  0 siblings, 1 reply; 17+ messages in thread
From: Roman Penyaev @ 2017-05-21  7:53 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mikhail Sennikovskii, Paolo Bonzini, Gleb Natapov, kvm,
	linux-kernel, Borislav Petkov

On Sun, May 21, 2017 at 5:31 AM, Andy Lutomirski <luto@kernel.org> wrote:
> On 05/19/2017 09:14 AM, Roman Penyaev wrote:
>>
>> Hi folks,
>>
>> After experiencing guest double faults (sometimes triple faults) on
>> 3.16 guest kernels with the following common pattern:
>>

[cut]

>>
>> Further tracking of VMCB states before and after VMRUN showed, that
>> CPL becomes 0 when VMEXIT happens with the following SS segment:
>>
>>            ss = {
>>              selector = 0x2b,
>>              attrib = 0x400,
>>              limit = 0xffffffff,
>>              base = 0x0
>>            },
>>
>>            cpl = 0x3
>>
>> Then on next VMRUN VMCB looks as the following:
>>
>>            ss = {
>>              selector = 0x2b,
>>              attrib = 0x0,            <<< dropped to 0
>>              limit = 0xffffffff,
>>              base = 0x0
>>            },
>>
>>            cpl = 0x0,                 <<< dropped to 0
>>
>> Obviously it was changed between VMRUN calls.  The following backtrace
>> shows that VMCB.SAVE.CPL was set to 0 by QEMU itself:
>>
>>    CPU: 55 PID: 59531 Comm: kvm
>>    [<ffffffffa00a3a20>] kvm_arch_vcpu_ioctl_set_sregs+0x2e0/0x480 [kvm]
>>    [<ffffffffa008ddf0>] kvm_write_guest_cached+0x540/0xc00 [kvm]
>>    [<ffffffff8107d695>] ? finish_task_switch+0x185/0x240
>>    [<ffffffff8180097c>] ? __schedule+0x28c/0xa10
>>    [<ffffffff811a9aad>] do_vfs_ioctl+0x2cd/0x4a0
>>
>> SS segment which came from QEMU had the following struct members:
>>
>>         SS->base      = 0
>>         SS->limit     = ffffffff
>>         SS->selector  = 2b
>>         SS->type      = 0
>>         SS->present   = 0
>>         SS->dpl       = 0
>>         SS->db        = 0
>>         SS->s         = 0
>>         SS->l         = 0
>>         SS->g         = 0
>>         SS->avl       = 0
>>         SS->unusable  = 1
>>
>> Indeed, on last VMEXIT SS segment does not have (P) present bit set in
>> segment attributes:
>>
>>     (gdb) p 0x400 & (1 << SVM_SELECTOR_P_SHIFT)
>>     $1 = 0
>
>
> Huh?  How is that even possible?  It should not be possible to actually run
> the vCPU with a non-NULL SS that isn't present.

That is utterly good question :)  I do not know.  According to my shallow
understanding (P) bit is only a hint for CPU that corresponding segment was
read from gdt and now is cached in private CPU registers (attributes).
Am I right?

At least what I see that it is quite often the case when we exit from VMRUN
with segment not present then VMRUN is resumed and on next vmexit segment has
correct attributes.

> How would you cause it to happen?

We run fio and iperf tests in guests for a couple of days.  Nothing more,
nothing special.  Guests are bare debians with 3.16 kernels.

>
> Unless... is this the sysret_ss_attrs issue?

What is the issue?  This one

https://lkml.org/lkml/2015/4/24/770

??

We tried to disassemble interrupted userspace code to find any visible
pattern - no syscalls around.

> This is a 3.16.something
> guest, and maybe the sysret_ss_attrs code never got backported.
> I bet that the user code running when the virtio interrupt hits is in
> the vDSO.

No, user code is interrupted in random places, e.g. the following is the
print from a driver which intercepts kvm_vcpu_run() and catches wrong
CPL when user code was interrupted:

    @@@@@ [63208] WTF? RIP=83f2a0, RSP=7fffc630e0c0, CPL=0, fix CPL=3
    @@@@@ [63205] WTF? RIP=9ae258, RSP=7ffe49f6a260, CPL=0, fix CPL=3
    @@@@@ [63215] WTF? RIP=dadee1, RSP=7ffc9ee1b508, CPL=0, fix CPL=3
    @@@@@ [63217] WTF? RIP=83f89c, RSP=7ffd6a1287e0, CPL=0, fix CPL=3
    @@@@@ [21521] WTF? RIP=7fa85b68e1e7, RSP=7ffd6d408ab0, CPL=0, fix CPL=3

As you can see sometimes RIP belongs to lower addresses, but vDSO mappings
are higher and start from 7ff.


>>
>> So when on VMEXIT we have such SS state (P bit is not set) and QEMU
>> just decides to synchronize registers the following happens on QEMU
>> side:
>>
>>     kvm_cpu_synchronize_state():
>>         kvm_arch_get_registers():
>>             ...
>>             get_seg():
>>                if (rhs->unusable) {
>>                    lhs->flags = 0;    <<< SS is unusable [(P) is not set)
>>                                       <<< all attributes are dropped to 0.
>>                 }
>>         cpu->kvm_vcpu_dirty = true;   <<< Mark VCPU state as dirty
>>
>
> Looks like the bug is in QEMU, then, right?

KVM SVM restores CPL from unusable selector, obviously this is not nice.

arch/x86/kvm/svm.c:svm_set_segment():

   if (var->unusable)
      s->attrib = 0;
   ...
   if (seg == VCPU_SREG_SS)
       svm->vmcb->save.cpl = (s->attrib >> SVM_SELECTOR_DPL_SHIFT) & 3;


Meanwhile QEMU resets attributes, despite the fact that DPL (which is passed
from KVM) is correct.

So it is not clear what is the proper way to fix that.   What is clear is
that CPL is set to 0 because of this game with registers on both sides.
Now the question is what side to fix or probably both.


> Couldn't you just fix this code
> in QEMU by, say, deleting it?

Certainly, but would be nice to listen to KVM maintainers.  At least the issue
is clear and what is left is a proper one-line fix :)

--
Roman


> If it's actually needed for some reason, then
> at least sanitize the flags correctly rather than corrupting the DPL.  (I'm
> guessing the real issue here is that migration from AMD to Intel fails
> without this, but migration from AMD to Intel is highly dubious regardless.)
>
> --Andy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] KVM: SVM: do not drop VMCB CPL to 0 if SS is not present
  2017-05-21  7:53   ` Roman Penyaev
@ 2017-05-21 20:19     ` Andy Lutomirski
  2017-05-24 19:19       ` Roman Penyaev
  0 siblings, 1 reply; 17+ messages in thread
From: Andy Lutomirski @ 2017-05-21 20:19 UTC (permalink / raw)
  To: Roman Penyaev, Andy Lutomirski
  Cc: Mikhail Sennikovskii, Paolo Bonzini, Gleb Natapov, kvm,
	linux-kernel, Borislav Petkov, Paolo Bonzini

On 05/21/2017 12:53 AM, Roman Penyaev wrote:
> On Sun, May 21, 2017 at 5:31 AM, Andy Lutomirski <luto@kernel.org> wrote:
>> On 05/19/2017 09:14 AM, Roman Penyaev wrote:
>>>
>>> Hi folks,
>>>
>>> After experiencing guest double faults (sometimes triple faults) on
>>> 3.16 guest kernels with the following common pattern:
>>>
> 
> [cut]
> 
>>>
>>> Further tracking of VMCB states before and after VMRUN showed, that
>>> CPL becomes 0 when VMEXIT happens with the following SS segment:
>>>
>>>             ss = {
>>>               selector = 0x2b,
>>>               attrib = 0x400,
>>>               limit = 0xffffffff,
>>>               base = 0x0
>>>             },
>>>
>>>             cpl = 0x3
>>>
>>> Then on next VMRUN VMCB looks as the following:
>>>
>>>             ss = {
>>>               selector = 0x2b,
>>>               attrib = 0x0,            <<< dropped to 0
>>>               limit = 0xffffffff,
>>>               base = 0x0
>>>             },
>>>
>>>             cpl = 0x0,                 <<< dropped to 0
>>>
>>> Obviously it was changed between VMRUN calls.  The following backtrace
>>> shows that VMCB.SAVE.CPL was set to 0 by QEMU itself:
>>>
>>>     CPU: 55 PID: 59531 Comm: kvm
>>>     [<ffffffffa00a3a20>] kvm_arch_vcpu_ioctl_set_sregs+0x2e0/0x480 [kvm]
>>>     [<ffffffffa008ddf0>] kvm_write_guest_cached+0x540/0xc00 [kvm]
>>>     [<ffffffff8107d695>] ? finish_task_switch+0x185/0x240
>>>     [<ffffffff8180097c>] ? __schedule+0x28c/0xa10
>>>     [<ffffffff811a9aad>] do_vfs_ioctl+0x2cd/0x4a0
>>>
>>> SS segment which came from QEMU had the following struct members:
>>>
>>>          SS->base      = 0
>>>          SS->limit     = ffffffff
>>>          SS->selector  = 2b
>>>          SS->type      = 0
>>>          SS->present   = 0
>>>          SS->dpl       = 0
>>>          SS->db        = 0
>>>          SS->s         = 0
>>>          SS->l         = 0
>>>          SS->g         = 0
>>>          SS->avl       = 0
>>>          SS->unusable  = 1
>>>
>>> Indeed, on last VMEXIT SS segment does not have (P) present bit set in
>>> segment attributes:
>>>
>>>      (gdb) p 0x400 & (1 << SVM_SELECTOR_P_SHIFT)
>>>      $1 = 0
>>
>>
>> Huh?  How is that even possible?  It should not be possible to actually run
>> the vCPU with a non-NULL SS that isn't present.
> 
> That is utterly good question :)  I do not know.  According to my shallow
> understanding (P) bit is only a hint for CPU that corresponding segment was
> read from gdt and now is cached in private CPU registers (attributes).
> Am I right?
> 
> At least what I see that it is quite often the case when we exit from VMRUN
> with segment not present then VMRUN is resumed and on next vmexit segment has
> correct attributes.
> 
>> How would you cause it to happen?
> 
> We run fio and iperf tests in guests for a couple of days.  Nothing more,
> nothing special.  Guests are bare debians with 3.16 kernels.
> 
>>
>> Unless... is this the sysret_ss_attrs issue?
> 
> What is the issue?  This one
> 
> https://lkml.org/lkml/2015/4/24/770

Yes.

But I was thinking about it wrong, since this is probably 64-bit 
userspace, not 32-bit userspace.  Here's my theory:

1. User task A does a syscall.  It's not in kernel mode with SS != 0.

2. The scheduler runs and switches to task B.  SS != 0.

2. Kernel enters user mode for task B.

3. User task B gets interrupted.  Kernel ends up running with SS = 0.

4. Kernel switches back to task A.  SS == 0.

5. Kernel does SYSRET.  SS == __USER_DS, but SS's attributes are messed up.

6. QEMU does whatever it does that inspires it to zap SS's attributes.

7. Boom.

If task B were 32-bit, then the vDSO would fix up SS, so there would 
only be a 1-instruction window for problems.

To check this theory, you could try backporting this to the guest and 
seeing if the problem goes away:

commit 61f01dd941ba9e06d2bf05994450ecc3d61b6b8b
Author: Andy Lutomirski <luto@kernel.org>
Date:   Sun Apr 26 16:47:59 2015 -0700

     x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue


>>>
>>> So when on VMEXIT we have such SS state (P bit is not set) and QEMU
>>> just decides to synchronize registers the following happens on QEMU
>>> side:
>>>
>>>      kvm_cpu_synchronize_state():
>>>          kvm_arch_get_registers():
>>>              ...
>>>              get_seg():
>>>                 if (rhs->unusable) {
>>>                     lhs->flags = 0;    <<< SS is unusable [(P) is not set)
>>>                                        <<< all attributes are dropped to 0.
>>>                  }
>>>          cpu->kvm_vcpu_dirty = true;   <<< Mark VCPU state as dirty
>>>
>>
>> Looks like the bug is in QEMU, then, right?
> 
> KVM SVM restores CPL from unusable selector, obviously this is not nice.

I would imagine that QEMU shouldn't be feeding KVM such a selector. 
Also, there's an invariant that SS.DPL == CPL, at least most of the 
time, although this SYSRET issue may be the exception.

Paolo, what's the intended behavior here?  Is the bug in KVM or in QEMU?

> 
> arch/x86/kvm/svm.c:svm_set_segment():
> 
>     if (var->unusable)
>        s->attrib = 0;
>     ...
>     if (seg == VCPU_SREG_SS)
>         svm->vmcb->save.cpl = (s->attrib >> SVM_SELECTOR_DPL_SHIFT) & 3;
> 
> 
> Meanwhile QEMU resets attributes, despite the fact that DPL (which is passed
> from KVM) is correct.
> 
> So it is not clear what is the proper way to fix that.   What is clear is
> that CPL is set to 0 because of this game with registers on both sides.
> Now the question is what side to fix or probably both.
> 
> 
>> Couldn't you just fix this code
>> in QEMU by, say, deleting it?
> 
> Certainly, but would be nice to listen to KVM maintainers.  At least the issue
> is clear and what is left is a proper one-line fix :)

--Andy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] KVM: SVM: do not drop VMCB CPL to 0 if SS is not present
  2017-05-21 20:19     ` Andy Lutomirski
@ 2017-05-24 19:19       ` Roman Penyaev
  0 siblings, 0 replies; 17+ messages in thread
From: Roman Penyaev @ 2017-05-24 19:19 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mikhail Sennikovskii, Paolo Bonzini, Gleb Natapov, kvm,
	linux-kernel, Borislav Petkov, Radim Krčmář

On Sun, May 21, 2017 at 10:19 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>
>>>
>>> Unless... is this the sysret_ss_attrs issue?
>>
>>
>> What is the issue?  This one
>>
>> https://lkml.org/lkml/2015/4/24/770
>
>
> Yes.
>
> But I was thinking about it wrong, since this is probably 64-bit userspace,

sorry, I forgot to mention that userspace is indeed 64-bit.

> not 32-bit userspace.  Here's my theory:
>
> 1. User task A does a syscall.  It's not in kernel mode with SS != 0.
>
> 2. The scheduler runs and switches to task B.  SS != 0.
>
> 2. Kernel enters user mode for task B.
>
> 3. User task B gets interrupted.  Kernel ends up running with SS = 0.
>
> 4. Kernel switches back to task A.  SS == 0.
>
> 5. Kernel does SYSRET.  SS == __USER_DS, but SS's attributes are messed up.
>
> 6. QEMU does whatever it does that inspires it to zap SS's attributes.
>
> 7. Boom.
>
> If task B were 32-bit, then the vDSO would fix up SS, so there would only be
> a 1-instruction window for problems.
>
> To check this theory, you could try backporting this to the guest and seeing
> if the problem goes away:
>
> commit 61f01dd941ba9e06d2bf05994450ecc3d61b6b8b
> Author: Andy Lutomirski <luto@kernel.org>
> Date:   Sun Apr 26 16:47:59 2015 -0700
>
>     x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue


Yes, that is exactly what is happening.  I 've backported your patch on 3.16.
That explains everything.  Why bug is not reproduced on >= 4.1 guest kernels
and why we fall out from VMRUN with SS.attributes == 0x400, i.e. P bit is
not set (because of "AMD CPUs have a misfeature").


>>> Looks like the bug is in QEMU, then, right?
>>
>>
>> KVM SVM restores CPL from unusable selector, obviously this is not nice.
>
>
> I would imagine that QEMU shouldn't be feeding KVM such a selector. Also,
> there's an invariant that SS.DPL == CPL, at least most of the time, although
> this SYSRET issue may be the exception.
>
> Paolo, what's the intended behavior here?  Is the bug in KVM or in QEMU?

So, along with Andrew's workaround for the kernel, it seems that virtualization
side should be fixed accordingly to workaround AMD behaviour.

Guys, any ping?

--
Roman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] KVM: SVM: do not drop VMCB CPL to 0 if SS is not present
  2017-05-19 16:14 [RFC] KVM: SVM: do not drop VMCB CPL to 0 if SS is not present Roman Penyaev
  2017-05-21  3:31 ` Andy Lutomirski
@ 2017-05-30 14:47 ` Paolo Bonzini
  2017-05-30 17:35   ` Roman Penyaev
  2017-05-30 15:13 ` Paolo Bonzini
  2 siblings, 1 reply; 17+ messages in thread
From: Paolo Bonzini @ 2017-05-30 14:47 UTC (permalink / raw)
  To: Roman Penyaev, Mikhail Sennikovskii, Gleb Natapov, kvm, linux-kernel



On 19/05/2017 18:14, Roman Penyaev wrote:
> 2. A bit complicated, which makes sure the CPL field is preserved across
>    KVM_GET/SET_SREGS calls and makes svm_set_segment() and svm_get_segment()
>    functionality symmethric:

I think I prefer this solution.

>    KVM SVM side:
>    -------------
> 
>    --- a/arch/x86/kvm/svm.c
>    +++ b/arch/x86/kvm/svm.c
>    @@ -1999,7 +1999,7 @@ static void svm_set_segment(struct kvm_vcpu *vcpu,
>             * would entail passing the CPL to userspace and back.
>             */
>            if (seg == VCPU_SREG_SS)
>    -               svm->vmcb->save.cpl = (s->attrib >>
> SVM_SELECTOR_DPL_SHIFT) & 3;
>    +               svm->vmcb->save.cpl = (var->dpl & 3);
> 
>            mark_dirty(svm->vmcb, VMCB_SEG);
>    }

I wonder why svm_set_segment is setting s->attrib = 0 at all.  The 
manual only mentions checking P=0.  What about something like:

	s->base = var->base;
	s->limit = var->limit;
	s->selector = var->selector;
	s->attrib = (var->type & SVM_SELECTOR_TYPE_MASK);
	s->attrib |= (var->s & 1) << SVM_SELECTOR_S_SHIFT;
	s->attrib |= (var->dpl & 3) << SVM_SELECTOR_DPL_SHIFT;
	s->attrib |= (var->present && !var->unusable) << SVM_SELECTOR_P_SHIFT;
	s->attrib |= (var->avl & 1) << SVM_SELECTOR_AVL_SHIFT;
	s->attrib |= (var->l & 1) << SVM_SELECTOR_L_SHIFT;
	s->attrib |= (var->db & 1) << SVM_SELECTOR_DB_SHIFT;
	s->attrib |= (var->g & 1) << SVM_SELECTOR_G_SHIFT;

>    QEMU side:
>    ----------
> 
>    --- a/target/i386/kvm.c
>    +++ b/target/i386/kvm.c
>    @@ -1979,6 +1979,8 @@ static int kvm_get_sregs(X86CPU *cpu)
>         get_seg(&env->segs[R_FS], &sregs.fs);
>         get_seg(&env->segs[R_GS], &sregs.gs);
>         get_seg(&env->segs[R_SS], &sregs.ss);
>    +    if (sregs.ss.unusable)
>    +        env->segs[R_SS].flags |= sregs.ss.dpl << DESC_DPL_SHIFT;
> 
>         get_seg(&env->tr, &sregs.tr);
>         get_seg(&env->ldt, &sregs.ldt);

I think what QEMU should do is, in get_seg

	if (rhs->unusable) {
	    lhs->flags &= ~DESC_P_MASK;
	} else {
	    ...
	}

This would preserve the SS.DPL field.  This should still work fine with
QEMU commit 4cae9c9 (the loading side would set lhs->unusable).

Thanks,

Paolo

> 
> Current email is an RFC since for us is not fully clear is it really
> needed to preserve DPL across KVM_SET/GET_SREGS calls when segment
> is unusable.  E.g. there was a commit:
> 
> 4cae9c97967a ("target-i386: kvm: clear unusable segments' flags in migration")
> 
> which in purpose drops all segment flags to zero on QEMU side in order
> to fix guests migration.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] KVM: SVM: do not drop VMCB CPL to 0 if SS is not present
  2017-05-19 16:14 [RFC] KVM: SVM: do not drop VMCB CPL to 0 if SS is not present Roman Penyaev
  2017-05-21  3:31 ` Andy Lutomirski
  2017-05-30 14:47 ` Paolo Bonzini
@ 2017-05-30 15:13 ` Paolo Bonzini
  2017-05-30 15:58   ` Roman Penyaev
  2 siblings, 1 reply; 17+ messages in thread
From: Paolo Bonzini @ 2017-05-30 15:13 UTC (permalink / raw)
  To: Roman Penyaev, Mikhail Sennikovskii, Gleb Natapov, kvm,
	linux-kernel, Andy Lutomirski



On 19/05/2017 18:14, Roman Penyaev wrote:
> 
> 1. Simple one, KVM SVM side, which makes sure that CPL is not updated
>    if segment is unusable:
> 
>    --- a/arch/x86/kvm/svm.c
>    +++ b/arch/x86/kvm/svm.c
>    @@ -1549,7 +1549,7 @@ static void svm_set_segment(struct kvm_vcpu *vcpu,
>             * forces SS.DPL to 3 on sysret, so we ignore that case; fixing it
>             * would entail passing the CPL to userspace and back.
>             */
>    -       if (seg == VCPU_SREG_SS)
>    +       if (seg == VCPU_SREG_SS && !var->unusable)
>                    svm->vmcb->save.cpl = (s->attrib >>
> SVM_SELECTOR_DPL_SHIFT) & 3;

Based on the discussion between you and Andy, my understanding is that
it would not be enough to ensure that the attributes are preserved
across a roundtrip through KVM_GET_SEGMENT and KVM_SET_SEGMENT.  We need
a workaround in the hypervisor if we don't want to pass the CPL to
userspace and back.

Maybe if 1) in 64-bit mode 2) SS.P=0 3) SS selector != 0, then the CPL
can be taken from SS.RPL?

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] KVM: SVM: do not drop VMCB CPL to 0 if SS is not present
  2017-05-30 15:13 ` Paolo Bonzini
@ 2017-05-30 15:58   ` Roman Penyaev
  2017-05-30 16:05     ` Paolo Bonzini
  0 siblings, 1 reply; 17+ messages in thread
From: Roman Penyaev @ 2017-05-30 15:58 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Mikhail Sennikovskii, Gleb Natapov, kvm, linux-kernel, Andy Lutomirski

On Tue, May 30, 2017 at 5:13 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 19/05/2017 18:14, Roman Penyaev wrote:
>>
>> 1. Simple one, KVM SVM side, which makes sure that CPL is not updated
>>    if segment is unusable:
>>
>>    --- a/arch/x86/kvm/svm.c
>>    +++ b/arch/x86/kvm/svm.c
>>    @@ -1549,7 +1549,7 @@ static void svm_set_segment(struct kvm_vcpu *vcpu,
>>             * forces SS.DPL to 3 on sysret, so we ignore that case; fixing it
>>             * would entail passing the CPL to userspace and back.
>>             */
>>    -       if (seg == VCPU_SREG_SS)
>>    +       if (seg == VCPU_SREG_SS && !var->unusable)
>>                    svm->vmcb->save.cpl = (s->attrib >>
>> SVM_SELECTOR_DPL_SHIFT) & 3;
>
> Based on the discussion between you and Andy, my understanding is that
> it would not be enough to ensure that the attributes are preserved
> across a roundtrip through KVM_GET_SEGMENT and KVM_SET_SEGMENT.  We need
> a workaround in the hypervisor if we don't want to pass the CPL to
> userspace and back.

We just need to decide where to store CPL on the way to userspace
and back and unconditionally follow that convention, regardless what
we have in unusable or present flags.

> Maybe if 1) in 64-bit mode 2) SS.P=0 3) SS selector != 0, then the CPL
> can be taken from SS.RPL?

Huh, I just want to show the history of changes of CPL value:

original, CPL is taken from CS.DPL:
-----------------------------------
commit 6aa8b732ca01c3d7a54e93f4d701b8aabbe60fb7
Author: Avi Kivity <avi@qumranet.com>
Date:   Sun Dec 10 02:21:36 2006 -0800

+       if (seg == VCPU_SREG_CS)
+               vcpu->svm->vmcb->save.cpl
+                       = (vcpu->svm->vmcb->save.cs.attrib
+                          >> SVM_SELECTOR_DPL_SHIFT) & 3;


then use RPL rather than DPL (not so much in description):
-----------------------------------
commit ea5e97e8bf1d56a4d9461c39e082b9c31a7be4ff
Author: Kevin Wolf <kwolf@redhat.com>
Date:   Wed Feb 8 14:34:40 2012 +0100


+   svm->vmcb->save.cpl = svm->vmcb->save.cs.selector & 0x3;


then get CPL from SS.DPL:
-----------------------------------
commit ae9fedc793c4d98aa9bb298585b2b9246096ce65
Author: Paolo Bonzini <pbonzini@redhat.com>
Date:   Wed May 14 09:39:49 2014 +0200

+   svm->vmcb->save.cpl = (s->attrib >> SVM_SELECTOR_DPL_SHIFT) & 3;


Indeed, what is left is eventually take it from SS.RPL. J.
But jokes aside,  with your last patch you seems fixed a race problem
when "CS.RPL is not equal to the CPL in the few instructions between
setting CR0.PE and reloading CS".  You also touched svm_get_segment()
which does the following:

case VCPU_SREG_SS:
     ...
     var->dpl = to_svm(vcpu)->vmcb->save.cpl;

So even CPU returned the following state:

     ss = {
         selector = 0x2b,
         attrib = 0x400,
         limit = 0xffffffff,
         base = 0x0
     },

     cpl = 0x3

We will have CPL in var->dpl, and it seems ok.  All we need is not
to lose it on the way kernel->userspace->kernel.

--
Roman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] KVM: SVM: do not drop VMCB CPL to 0 if SS is not present
  2017-05-30 15:58   ` Roman Penyaev
@ 2017-05-30 16:05     ` Paolo Bonzini
  2017-05-30 16:31       ` Gi-Oh Kim
  2017-06-15 21:44       ` Andy Lutomirski
  0 siblings, 2 replies; 17+ messages in thread
From: Paolo Bonzini @ 2017-05-30 16:05 UTC (permalink / raw)
  To: Roman Penyaev
  Cc: Mikhail Sennikovskii, Gleb Natapov, kvm, linux-kernel, Andy Lutomirski



On 30/05/2017 17:58, Roman Penyaev wrote:
> Indeed, what is left is eventually take it from SS.RPL. J.

Ahah! :)  But I only suggested that in specific cases.

> But jokes aside,  with your last patch you seems fixed a race problem
> when "CS.RPL is not equal to the CPL in the few instructions between
> setting CR0.PE and reloading CS".

Yes, exactly.  The symptom was a crash (triple fault) when you kept
interrupting with "info cpus" a guest that repeatedly went to protected
mode and back to real mode.

> We will have CPL in var->dpl, and it seems ok.  All we need is not
> to lose it on the way kernel->userspace->kernel.

You're right.  So what do you think of the other suggestion (svm.c
doesn't clear attributes for unusable registers, QEMU only clears P for
unusable registers)?

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] KVM: SVM: do not drop VMCB CPL to 0 if SS is not present
  2017-05-30 16:05     ` Paolo Bonzini
@ 2017-05-30 16:31       ` Gi-Oh Kim
  2017-06-15 21:44       ` Andy Lutomirski
  1 sibling, 0 replies; 17+ messages in thread
From: Gi-Oh Kim @ 2017-05-30 16:31 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Roman Penyaev, Mikhail Sennikovskii, Gleb Natapov, kvm,
	linux-kernel, Andy Lutomirski

Hi,

I found a code to set unusable flag of segment register incorrectly.
https://lkml.org/lkml/2017/5/30/459
I guess above patch and current discussion could be related.

I guess following sequence could happen.
1. svm_get_segment() sets var->unusable of Stack Segment incorrectly
2. svm_set_segment() clears both of s->attrib and svm->vmcb->save.cpl.

Is it possible scenario?


On Tue, May 30, 2017 at 6:05 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 30/05/2017 17:58, Roman Penyaev wrote:
>> Indeed, what is left is eventually take it from SS.RPL. J.
>
> Ahah! :)  But I only suggested that in specific cases.
>
>> But jokes aside,  with your last patch you seems fixed a race problem
>> when "CS.RPL is not equal to the CPL in the few instructions between
>> setting CR0.PE and reloading CS".
>
> Yes, exactly.  The symptom was a crash (triple fault) when you kept
> interrupting with "info cpus" a guest that repeatedly went to protected
> mode and back to real mode.
>
>> We will have CPL in var->dpl, and it seems ok.  All we need is not
>> to lose it on the way kernel->userspace->kernel.
>
> You're right.  So what do you think of the other suggestion (svm.c
> doesn't clear attributes for unusable registers, QEMU only clears P for
> unusable registers)?
>
> Thanks,
>
> Paolo



-- 
Best regards,
Gi-Oh Kim
TEL: 0176 2697 8962

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] KVM: SVM: do not drop VMCB CPL to 0 if SS is not present
  2017-05-30 14:47 ` Paolo Bonzini
@ 2017-05-30 17:35   ` Roman Penyaev
  2017-05-30 21:09     ` Paolo Bonzini
  0 siblings, 1 reply; 17+ messages in thread
From: Roman Penyaev @ 2017-05-30 17:35 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Mikhail Sennikovskii, Gleb Natapov, kvm, linux-kernel

On Tue, May 30, 2017 at 4:47 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 19/05/2017 18:14, Roman Penyaev wrote:
>> 2. A bit complicated, which makes sure the CPL field is preserved across
>>    KVM_GET/SET_SREGS calls and makes svm_set_segment() and svm_get_segment()
>>    functionality symmethric:
>
> I think I prefer this solution.
>
>>    KVM SVM side:
>>    -------------
>>
>>    --- a/arch/x86/kvm/svm.c
>>    +++ b/arch/x86/kvm/svm.c
>>    @@ -1999,7 +1999,7 @@ static void svm_set_segment(struct kvm_vcpu *vcpu,
>>             * would entail passing the CPL to userspace and back.
>>             */
>>            if (seg == VCPU_SREG_SS)
>>    -               svm->vmcb->save.cpl = (s->attrib >>
>> SVM_SELECTOR_DPL_SHIFT) & 3;
>>    +               svm->vmcb->save.cpl = (var->dpl & 3);
>>
>>            mark_dirty(svm->vmcb, VMCB_SEG);
>>    }
>
> I wonder why svm_set_segment is setting s->attrib = 0 at all.  The
> manual only mentions checking P=0.  What about something like:
>
>         s->base = var->base;
>         s->limit = var->limit;
>         s->selector = var->selector;
>         s->attrib = (var->type & SVM_SELECTOR_TYPE_MASK);
>         s->attrib |= (var->s & 1) << SVM_SELECTOR_S_SHIFT;
>         s->attrib |= (var->dpl & 3) << SVM_SELECTOR_DPL_SHIFT;
>         s->attrib |= (var->present && !var->unusable) << SVM_SELECTOR_P_SHIFT;
>         s->attrib |= (var->avl & 1) << SVM_SELECTOR_AVL_SHIFT;
>         s->attrib |= (var->l & 1) << SVM_SELECTOR_L_SHIFT;
>         s->attrib |= (var->db & 1) << SVM_SELECTOR_DB_SHIFT;
>         s->attrib |= (var->g & 1) << SVM_SELECTOR_G_SHIFT;

Do we care about compatibility issues?  I mean can any old qemu send
us "garbage" in other members of 'var' structure if 'var->unused' == 1 ?

Oh, it seems we require one more field in 'struct kvm_segment' for CPL.

>>    QEMU side:
>>    ----------
>>
>>    --- a/target/i386/kvm.c
>>    +++ b/target/i386/kvm.c
>>    @@ -1979,6 +1979,8 @@ static int kvm_get_sregs(X86CPU *cpu)
>>         get_seg(&env->segs[R_FS], &sregs.fs);
>>         get_seg(&env->segs[R_GS], &sregs.gs);
>>         get_seg(&env->segs[R_SS], &sregs.ss);
>>    +    if (sregs.ss.unusable)
>>    +        env->segs[R_SS].flags |= sregs.ss.dpl << DESC_DPL_SHIFT;
>>
>>         get_seg(&env->tr, &sregs.tr);
>>         get_seg(&env->ldt, &sregs.ldt);
>
> I think what QEMU should do is, in get_seg
>
>         if (rhs->unusable) {
>             lhs->flags &= ~DESC_P_MASK;
>
> This would preserve the SS.DPL field.  This should still work fine with
> QEMU commit 4cae9c9 (the loading side would set lhs->unusable).

Indeed, it will preserve the *old* SS.DPL field, but will not take the *new*
one from kvm side.  And what if extend get_seg() with additional 'segtype'
argument:

static void get_seg(SegmentCache *lhs, const struct kvm_segment *rhs,
int segtype)
{
...
         if (rhs->unusable) {
             /* Clear P and DPL bits */
             lhs->flags &= ~(DESC_P_MASK | (3 << DESC_DPL_SHIFT));
             if (segtype == R_SS)
                /* Set DPL */
                lhs->flags |= rhs->dpl << DESC_DPL_SHIFT;
         }

Then we always keep convention and keep dpl alive along the way U->K->U to
restore it as cpl.

--
Roman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] KVM: SVM: do not drop VMCB CPL to 0 if SS is not present
  2017-05-30 17:35   ` Roman Penyaev
@ 2017-05-30 21:09     ` Paolo Bonzini
  2017-05-31 10:17       ` Roman Penyaev
  0 siblings, 1 reply; 17+ messages in thread
From: Paolo Bonzini @ 2017-05-30 21:09 UTC (permalink / raw)
  To: Roman Penyaev; +Cc: Mikhail Sennikovskii, Gleb Natapov, kvm, linux-kernel



On 30/05/2017 19:35, Roman Penyaev wrote:
> On Tue, May 30, 2017 at 4:47 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>
>>
>> On 19/05/2017 18:14, Roman Penyaev wrote:
>>> 2. A bit complicated, which makes sure the CPL field is preserved across
>>>    KVM_GET/SET_SREGS calls and makes svm_set_segment() and svm_get_segment()
>>>    functionality symmethric:
>>
>> I think I prefer this solution.
>>
>>>    KVM SVM side:
>>>    -------------
>>>
>>>    --- a/arch/x86/kvm/svm.c
>>>    +++ b/arch/x86/kvm/svm.c
>>>    @@ -1999,7 +1999,7 @@ static void svm_set_segment(struct kvm_vcpu *vcpu,
>>>             * would entail passing the CPL to userspace and back.
>>>             */
>>>            if (seg == VCPU_SREG_SS)
>>>    -               svm->vmcb->save.cpl = (s->attrib >>
>>> SVM_SELECTOR_DPL_SHIFT) & 3;
>>>    +               svm->vmcb->save.cpl = (var->dpl & 3);
>>>
>>>            mark_dirty(svm->vmcb, VMCB_SEG);
>>>    }
>>
>> I wonder why svm_set_segment is setting s->attrib = 0 at all.  The
>> manual only mentions checking P=0.  What about something like:
>>
>>         s->base = var->base;
>>         s->limit = var->limit;
>>         s->selector = var->selector;
>>         s->attrib = (var->type & SVM_SELECTOR_TYPE_MASK);
>>         s->attrib |= (var->s & 1) << SVM_SELECTOR_S_SHIFT;
>>         s->attrib |= (var->dpl & 3) << SVM_SELECTOR_DPL_SHIFT;
>>         s->attrib |= (var->present && !var->unusable) << SVM_SELECTOR_P_SHIFT;
>>         s->attrib |= (var->avl & 1) << SVM_SELECTOR_AVL_SHIFT;
>>         s->attrib |= (var->l & 1) << SVM_SELECTOR_L_SHIFT;
>>         s->attrib |= (var->db & 1) << SVM_SELECTOR_DB_SHIFT;
>>         s->attrib |= (var->g & 1) << SVM_SELECTOR_G_SHIFT;
> 
> Do we care about compatibility issues?  I mean can any old qemu send
> us "garbage" in other members of 'var' structure if 'var->unused' == 1 ?

That shouldn't matter, the processor shouldn't use them if P=0.

> Oh, it seems we require one more field in 'struct kvm_segment' for CPL.

Why?  The point is exactly to use SS's var->dpl.

>>>    QEMU side:
>>>    ----------
>>>
>>>    --- a/target/i386/kvm.c
>>>    +++ b/target/i386/kvm.c
>>>    @@ -1979,6 +1979,8 @@ static int kvm_get_sregs(X86CPU *cpu)
>>>         get_seg(&env->segs[R_FS], &sregs.fs);
>>>         get_seg(&env->segs[R_GS], &sregs.gs);
>>>         get_seg(&env->segs[R_SS], &sregs.ss);
>>>    +    if (sregs.ss.unusable)
>>>    +        env->segs[R_SS].flags |= sregs.ss.dpl << DESC_DPL_SHIFT;
>>>
>>>         get_seg(&env->tr, &sregs.tr);
>>>         get_seg(&env->ldt, &sregs.ldt);
>>
>> I think what QEMU should do is, in get_seg
>>
>>         if (rhs->unusable) {
>>             lhs->flags &= ~DESC_P_MASK;
>>
>> This would preserve the SS.DPL field.  This should still work fine with
>> QEMU commit 4cae9c9 (the loading side would set lhs->unusable).
> 
> Indeed, it will preserve the *old* SS.DPL field, but will not take the *new*
> one from kvm side.  And what if extend get_seg() with additional 'segtype'
> argument:

Or:

    lhs->flags = (rhs->type << DESC_TYPE_SHIFT) |
                 (rhs->present * DESC_P_MASK) |
                 (rhs->dpl << DESC_DPL_SHIFT) |
                 (rhs->db << DESC_B_SHIFT) |
                 (rhs->s * DESC_S_MASK) |
                 (rhs->l << DESC_L_SHIFT) |
                 (rhs->g * DESC_G_MASK) |
                 (rhs->avl * DESC_AVL_MASK);
    if (rhs->unusable) {
        lhs->flags = 0;
    }

which could also be simply

   lhs->flags = ... |
                ((rhs->present && !rhs->unusable) * DESC_P_MASK) | ...;

as in the KVM code.
`
> Then we always keep convention and keep dpl alive along the way U->K->U to
> restore it as cpl.

Yes, exactly.

Paolo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] KVM: SVM: do not drop VMCB CPL to 0 if SS is not present
  2017-05-30 21:09     ` Paolo Bonzini
@ 2017-05-31 10:17       ` Roman Penyaev
  2017-05-31 10:50         ` Paolo Bonzini
  0 siblings, 1 reply; 17+ messages in thread
From: Roman Penyaev @ 2017-05-31 10:17 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Mikhail Sennikovskii, Gleb Natapov, kvm, linux-kernel

On Tue, May 30, 2017 at 11:09 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 30/05/2017 19:35, Roman Penyaev wrote:
>> On Tue, May 30, 2017 at 4:47 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>
>>>
>>> On 19/05/2017 18:14, Roman Penyaev wrote:
>>>> 2. A bit complicated, which makes sure the CPL field is preserved across
>>>>    KVM_GET/SET_SREGS calls and makes svm_set_segment() and svm_get_segment()
>>>>    functionality symmethric:
>>>
>>> I think I prefer this solution.
>>>
>>>>    KVM SVM side:
>>>>    -------------
>>>>
>>>>    --- a/arch/x86/kvm/svm.c
>>>>    +++ b/arch/x86/kvm/svm.c
>>>>    @@ -1999,7 +1999,7 @@ static void svm_set_segment(struct kvm_vcpu *vcpu,
>>>>             * would entail passing the CPL to userspace and back.
>>>>             */
>>>>            if (seg == VCPU_SREG_SS)
>>>>    -               svm->vmcb->save.cpl = (s->attrib >>
>>>> SVM_SELECTOR_DPL_SHIFT) & 3;
>>>>    +               svm->vmcb->save.cpl = (var->dpl & 3);
>>>>
>>>>            mark_dirty(svm->vmcb, VMCB_SEG);
>>>>    }
>>>
>>> I wonder why svm_set_segment is setting s->attrib = 0 at all.  The
>>> manual only mentions checking P=0.  What about something like:
>>>
>>>         s->base = var->base;
>>>         s->limit = var->limit;
>>>         s->selector = var->selector;
>>>         s->attrib = (var->type & SVM_SELECTOR_TYPE_MASK);
>>>         s->attrib |= (var->s & 1) << SVM_SELECTOR_S_SHIFT;
>>>         s->attrib |= (var->dpl & 3) << SVM_SELECTOR_DPL_SHIFT;
>>>         s->attrib |= (var->present && !var->unusable) << SVM_SELECTOR_P_SHIFT;
>>>         s->attrib |= (var->avl & 1) << SVM_SELECTOR_AVL_SHIFT;
>>>         s->attrib |= (var->l & 1) << SVM_SELECTOR_L_SHIFT;
>>>         s->attrib |= (var->db & 1) << SVM_SELECTOR_DB_SHIFT;
>>>         s->attrib |= (var->g & 1) << SVM_SELECTOR_G_SHIFT;
>>
>> Do we care about compatibility issues?  I mean can any old qemu send
>> us "garbage" in other members of 'var' structure if 'var->unused' == 1 ?
>
> That shouldn't matter, the processor shouldn't use them if P=0.

Could you please point me where did you find that?  E.g. what I see in
AMD manual 24593—Rev. 3.28—March 2017, section "Segment State in the VMCB",
top of the page 453:

  NOTE: For the Stack Segment attributes, P is observed in legacy and
        compatibility mode. In 64-bit mode, P is ignored because all
        stack segments are treated as present.

So I am confused.

>> Oh, it seems we require one more field in 'struct kvm_segment' for CPL.
>
> Why?  The point is exactly to use SS's var->dpl.

Yes, yes, let's use dpl as it is used now on svm_get_segment().

>
>>>>    QEMU side:
>>>>    ----------
>>>>
>>>>    --- a/target/i386/kvm.c
>>>>    +++ b/target/i386/kvm.c
>>>>    @@ -1979,6 +1979,8 @@ static int kvm_get_sregs(X86CPU *cpu)
>>>>         get_seg(&env->segs[R_FS], &sregs.fs);
>>>>         get_seg(&env->segs[R_GS], &sregs.gs);
>>>>         get_seg(&env->segs[R_SS], &sregs.ss);
>>>>    +    if (sregs.ss.unusable)
>>>>    +        env->segs[R_SS].flags |= sregs.ss.dpl << DESC_DPL_SHIFT;
>>>>
>>>>         get_seg(&env->tr, &sregs.tr);
>>>>         get_seg(&env->ldt, &sregs.ldt);
>>>
>>> I think what QEMU should do is, in get_seg
>>>
>>>         if (rhs->unusable) {
>>>             lhs->flags &= ~DESC_P_MASK;
>>>
>>> This would preserve the SS.DPL field.  This should still work fine with
>>> QEMU commit 4cae9c9 (the loading side would set lhs->unusable).
>>
>> Indeed, it will preserve the *old* SS.DPL field, but will not take the *new*
>> one from kvm side.  And what if extend get_seg() with additional 'segtype'
>> argument:
>
> Or:
>
>     lhs->flags = (rhs->type << DESC_TYPE_SHIFT) |
>                  (rhs->present * DESC_P_MASK) |
>                  (rhs->dpl << DESC_DPL_SHIFT) |
>                  (rhs->db << DESC_B_SHIFT) |
>                  (rhs->s * DESC_S_MASK) |
>                  (rhs->l << DESC_L_SHIFT) |
>                  (rhs->g * DESC_G_MASK) |
>                  (rhs->avl * DESC_AVL_MASK);
>     if (rhs->unusable) {
>         lhs->flags = 0;
>     }
>
> which could also be simply
>
>    lhs->flags = ... |
>                 ((rhs->present && !rhs->unusable) * DESC_P_MASK) | ...;
>
> as in the KVM code.

True.  Fully symmetric.  So something like that:

Kernel:
-------
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index d09bc3e7882c..ecb76d9bf0cb 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1466,6 +1466,7 @@ static void svm_get_segment(struct kvm_vcpu *vcpu,
                 */
                if (var->unusable)
                        var->db = 0;
+               /* This is symmetric with svm_set_segment() */
                var->dpl = to_svm(vcpu)->vmcb->save.cpl;
                break;
        }
@@ -1610,18 +1611,14 @@ static void svm_set_segment(struct kvm_vcpu *vcpu,
        s->base = var->base;
        s->limit = var->limit;
        s->selector = var->selector;
-       if (var->unusable)
-               s->attrib = 0;
-       else {
-               s->attrib = (var->type & SVM_SELECTOR_TYPE_MASK);
-               s->attrib |= (var->s & 1) << SVM_SELECTOR_S_SHIFT;
-               s->attrib |= (var->dpl & 3) << SVM_SELECTOR_DPL_SHIFT;
-               s->attrib |= (var->present & 1) << SVM_SELECTOR_P_SHIFT;
-               s->attrib |= (var->avl & 1) << SVM_SELECTOR_AVL_SHIFT;
-               s->attrib |= (var->l & 1) << SVM_SELECTOR_L_SHIFT;
-               s->attrib |= (var->db & 1) << SVM_SELECTOR_DB_SHIFT;
-               s->attrib |= (var->g & 1) << SVM_SELECTOR_G_SHIFT;
-       }
+       s->attrib = (var->type & SVM_SELECTOR_TYPE_MASK);
+       s->attrib |= (var->s & 1) << SVM_SELECTOR_S_SHIFT;
+       s->attrib |= (var->dpl & 3) << SVM_SELECTOR_DPL_SHIFT;
+       s->attrib |= ((var->present & 1) && !var->unusable) <<
SVM_SELECTOR_P_SHIFT;
+       s->attrib |= (var->avl & 1) << SVM_SELECTOR_AVL_SHIFT;
+       s->attrib |= (var->l & 1) << SVM_SELECTOR_L_SHIFT;
+       s->attrib |= (var->db & 1) << SVM_SELECTOR_DB_SHIFT;
+       s->attrib |= (var->g & 1) << SVM_SELECTOR_G_SHIFT;

        /*
         * This is always accurate, except if SYSRET returned to a segment
@@ -1630,7 +1627,8 @@ static void svm_set_segment(struct kvm_vcpu *vcpu,
         * would entail passing the CPL to userspace and back.
         */
        if (seg == VCPU_SREG_SS)
-               svm->vmcb->save.cpl = (s->attrib >> SVM_SELECTOR_DPL_SHIFT) & 3;
+               /* This is symmetric with svm_get_segment() */
+               svm->vmcb->save.cpl = (var->dpl & 3);

        mark_dirty(svm->vmcb, VMCB_SEG);
 }


QEMU:
-----
diff --git a/target/i386/kvm.c b/target/i386/kvm.c
index 011d4a55b136..faee904d9d59 100644
--- a/target/i386/kvm.c
+++ b/target/i386/kvm.c
@@ -1300,18 +1300,14 @@ static void get_seg(SegmentCache *lhs, const
struct kvm_segment *rhs)
     lhs->selector = rhs->selector;
     lhs->base = rhs->base;
     lhs->limit = rhs->limit;
-    if (rhs->unusable) {
-        lhs->flags = 0;
-    } else {
-        lhs->flags = (rhs->type << DESC_TYPE_SHIFT) |
-                     (rhs->present * DESC_P_MASK) |
-                     (rhs->dpl << DESC_DPL_SHIFT) |
-                     (rhs->db << DESC_B_SHIFT) |
-                     (rhs->s * DESC_S_MASK) |
-                     (rhs->l << DESC_L_SHIFT) |
-                     (rhs->g * DESC_G_MASK) |
-                     (rhs->avl * DESC_AVL_MASK);
-    }
+    lhs->flags = (rhs->type << DESC_TYPE_SHIFT) |
+                 ((rhs->present && !rhs->unusable) * DESC_P_MASK) |
+                 (rhs->dpl << DESC_DPL_SHIFT) |
+                 (rhs->db << DESC_B_SHIFT) |
+                 (rhs->s * DESC_S_MASK) |
+                 (rhs->l << DESC_L_SHIFT) |
+                 (rhs->g * DESC_G_MASK) |
+                 (rhs->avl * DESC_AVL_MASK);
 }

--
Roman

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC] KVM: SVM: do not drop VMCB CPL to 0 if SS is not present
  2017-05-31 10:17       ` Roman Penyaev
@ 2017-05-31 10:50         ` Paolo Bonzini
  0 siblings, 0 replies; 17+ messages in thread
From: Paolo Bonzini @ 2017-05-31 10:50 UTC (permalink / raw)
  To: Roman Penyaev; +Cc: Mikhail Sennikovskii, Gleb Natapov, kvm, linux-kernel



----- Original Message -----
> From: "Roman Penyaev" <roman.penyaev@profitbricks.com>
> To: "Paolo Bonzini" <pbonzini@redhat.com>
> Cc: "Mikhail Sennikovskii" <mikhail.sennikovskii@profitbricks.com>, "Gleb Natapov" <gleb@kernel.org>,
> kvm@vger.kernel.org, linux-kernel@vger.kernel.org
> Sent: Wednesday, May 31, 2017 12:17:01 PM
> Subject: Re: [RFC] KVM: SVM: do not drop VMCB CPL to 0 if SS is not present
> 
> On Tue, May 30, 2017 at 11:09 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >
> >
> > On 30/05/2017 19:35, Roman Penyaev wrote:
> >> On Tue, May 30, 2017 at 4:47 PM, Paolo Bonzini <pbonzini@redhat.com>
> >> wrote:
> >>>
> >>>
> >>> On 19/05/2017 18:14, Roman Penyaev wrote:
> >>>> 2. A bit complicated, which makes sure the CPL field is preserved across
> >>>>    KVM_GET/SET_SREGS calls and makes svm_set_segment() and
> >>>>    svm_get_segment()
> >>>>    functionality symmethric:
> >>>
> >>> I think I prefer this solution.
> >>>
> >>>>    KVM SVM side:
> >>>>    -------------
> >>>>
> >>>>    --- a/arch/x86/kvm/svm.c
> >>>>    +++ b/arch/x86/kvm/svm.c
> >>>>    @@ -1999,7 +1999,7 @@ static void svm_set_segment(struct kvm_vcpu
> >>>>    *vcpu,
> >>>>             * would entail passing the CPL to userspace and back.
> >>>>             */
> >>>>            if (seg == VCPU_SREG_SS)
> >>>>    -               svm->vmcb->save.cpl = (s->attrib >>
> >>>> SVM_SELECTOR_DPL_SHIFT) & 3;
> >>>>    +               svm->vmcb->save.cpl = (var->dpl & 3);
> >>>>
> >>>>            mark_dirty(svm->vmcb, VMCB_SEG);
> >>>>    }
> >>>
> >>> I wonder why svm_set_segment is setting s->attrib = 0 at all.  The
> >>> manual only mentions checking P=0.  What about something like:
> >>>
> >>>         s->base = var->base;
> >>>         s->limit = var->limit;
> >>>         s->selector = var->selector;
> >>>         s->attrib = (var->type & SVM_SELECTOR_TYPE_MASK);
> >>>         s->attrib |= (var->s & 1) << SVM_SELECTOR_S_SHIFT;
> >>>         s->attrib |= (var->dpl & 3) << SVM_SELECTOR_DPL_SHIFT;
> >>>         s->attrib |= (var->present && !var->unusable) <<
> >>>         SVM_SELECTOR_P_SHIFT;
> >>>         s->attrib |= (var->avl & 1) << SVM_SELECTOR_AVL_SHIFT;
> >>>         s->attrib |= (var->l & 1) << SVM_SELECTOR_L_SHIFT;
> >>>         s->attrib |= (var->db & 1) << SVM_SELECTOR_DB_SHIFT;
> >>>         s->attrib |= (var->g & 1) << SVM_SELECTOR_G_SHIFT;
> >>
> >> Do we care about compatibility issues?  I mean can any old qemu send
> >> us "garbage" in other members of 'var' structure if 'var->unused' == 1 ?
> >
> > That shouldn't matter, the processor shouldn't use them if P=0.
> 
> Could you please point me where did you find that?  E.g. what I see in
> AMD manual 24593—Rev. 3.28—March 2017, section "Segment State in the VMCB",
> top of the page 453:
> 
>   NOTE: For the Stack Segment attributes, P is observed in legacy and
>         compatibility mode. In 64-bit mode, P is ignored because all
>         stack segments are treated as present.

You're right and in fact the same applies to unusable=1 on Intel.  But
on the other hand, if the garbage got there somehow (e.g. via SMM) it's
the right thing to use it.

> True.  Fully symmetric.  So something like that:
> 
> Kernel:
> -------
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index d09bc3e7882c..ecb76d9bf0cb 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -1466,6 +1466,7 @@ static void svm_get_segment(struct kvm_vcpu *vcpu,
>                  */
>                 if (var->unusable)
>                         var->db = 0;
> +               /* This is symmetric with svm_set_segment() */
>                 var->dpl = to_svm(vcpu)->vmcb->save.cpl;
>                 break;
>         }
> @@ -1610,18 +1611,14 @@ static void svm_set_segment(struct kvm_vcpu *vcpu,
>         s->base = var->base;
>         s->limit = var->limit;
>         s->selector = var->selector;
> -       if (var->unusable)
> -               s->attrib = 0;
> -       else {
> -               s->attrib = (var->type & SVM_SELECTOR_TYPE_MASK);
> -               s->attrib |= (var->s & 1) << SVM_SELECTOR_S_SHIFT;
> -               s->attrib |= (var->dpl & 3) << SVM_SELECTOR_DPL_SHIFT;
> -               s->attrib |= (var->present & 1) << SVM_SELECTOR_P_SHIFT;
> -               s->attrib |= (var->avl & 1) << SVM_SELECTOR_AVL_SHIFT;
> -               s->attrib |= (var->l & 1) << SVM_SELECTOR_L_SHIFT;
> -               s->attrib |= (var->db & 1) << SVM_SELECTOR_DB_SHIFT;
> -               s->attrib |= (var->g & 1) << SVM_SELECTOR_G_SHIFT;
> -       }
> +       s->attrib = (var->type & SVM_SELECTOR_TYPE_MASK);
> +       s->attrib |= (var->s & 1) << SVM_SELECTOR_S_SHIFT;
> +       s->attrib |= (var->dpl & 3) << SVM_SELECTOR_DPL_SHIFT;
> +       s->attrib |= ((var->present & 1) && !var->unusable) <<
> SVM_SELECTOR_P_SHIFT;
> +       s->attrib |= (var->avl & 1) << SVM_SELECTOR_AVL_SHIFT;
> +       s->attrib |= (var->l & 1) << SVM_SELECTOR_L_SHIFT;
> +       s->attrib |= (var->db & 1) << SVM_SELECTOR_DB_SHIFT;
> +       s->attrib |= (var->g & 1) << SVM_SELECTOR_G_SHIFT;
> 
>         /*
>          * This is always accurate, except if SYSRET returned to a segment
> @@ -1630,7 +1627,8 @@ static void svm_set_segment(struct kvm_vcpu *vcpu,
>          * would entail passing the CPL to userspace and back.
>          */
>         if (seg == VCPU_SREG_SS)
> -               svm->vmcb->save.cpl = (s->attrib >> SVM_SELECTOR_DPL_SHIFT) &
> 3;
> +               /* This is symmetric with svm_get_segment() */
> +               svm->vmcb->save.cpl = (var->dpl & 3);
> 
>         mark_dirty(svm->vmcb, VMCB_SEG);
>  }
> 
> 
> QEMU:
> -----
> diff --git a/target/i386/kvm.c b/target/i386/kvm.c
> index 011d4a55b136..faee904d9d59 100644
> --- a/target/i386/kvm.c
> +++ b/target/i386/kvm.c
> @@ -1300,18 +1300,14 @@ static void get_seg(SegmentCache *lhs, const
> struct kvm_segment *rhs)
>      lhs->selector = rhs->selector;
>      lhs->base = rhs->base;
>      lhs->limit = rhs->limit;
> -    if (rhs->unusable) {
> -        lhs->flags = 0;
> -    } else {
> -        lhs->flags = (rhs->type << DESC_TYPE_SHIFT) |
> -                     (rhs->present * DESC_P_MASK) |
> -                     (rhs->dpl << DESC_DPL_SHIFT) |
> -                     (rhs->db << DESC_B_SHIFT) |
> -                     (rhs->s * DESC_S_MASK) |
> -                     (rhs->l << DESC_L_SHIFT) |
> -                     (rhs->g * DESC_G_MASK) |
> -                     (rhs->avl * DESC_AVL_MASK);
> -    }
> +    lhs->flags = (rhs->type << DESC_TYPE_SHIFT) |
> +                 ((rhs->present && !rhs->unusable) * DESC_P_MASK) |
> +                 (rhs->dpl << DESC_DPL_SHIFT) |
> +                 (rhs->db << DESC_B_SHIFT) |
> +                 (rhs->s * DESC_S_MASK) |
> +                 (rhs->l << DESC_L_SHIFT) |
> +                 (rhs->g * DESC_G_MASK) |
> +                 (rhs->avl * DESC_AVL_MASK);
>  }


Yes, I think both are the right thing to do.

Paolo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] KVM: SVM: do not drop VMCB CPL to 0 if SS is not present
  2017-05-30 16:05     ` Paolo Bonzini
  2017-05-30 16:31       ` Gi-Oh Kim
@ 2017-06-15 21:44       ` Andy Lutomirski
  2017-06-16  8:44         ` Roman Penyaev
  2017-06-16 16:40         ` Paolo Bonzini
  1 sibling, 2 replies; 17+ messages in thread
From: Andy Lutomirski @ 2017-06-15 21:44 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Roman Penyaev, Mikhail Sennikovskii, Gleb Natapov, kvm list,
	linux-kernel

On Tue, May 30, 2017 at 9:05 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 30/05/2017 17:58, Roman Penyaev wrote:
>> Indeed, what is left is eventually take it from SS.RPL. J.
>
> Ahah! :)  But I only suggested that in specific cases.
>
>> But jokes aside,  with your last patch you seems fixed a race problem
>> when "CS.RPL is not equal to the CPL in the few instructions between
>> setting CR0.PE and reloading CS".
>
> Yes, exactly.  The symptom was a crash (triple fault) when you kept
> interrupting with "info cpus" a guest that repeatedly went to protected
> mode and back to real mode.
>
>> We will have CPL in var->dpl, and it seems ok.  All we need is not
>> to lose it on the way kernel->userspace->kernel.
>
> You're right.  So what do you think of the other suggestion (svm.c
> doesn't clear attributes for unusable registers, QEMU only clears P for
> unusable registers)?

AMD CPUs really allow setting RPL in MSR_*STAR to something other than
3 and then blindly copy the result to SS.DPL when SYSRET happens?
Ugh!

I wonder if we can sweep that particular problem under the rug by
saying that, as a KVM guest, you can't program STAR.RPL != 3?  Or
would that require us to set an intercept that we don't want to set?

Alternatively, is there ever a case where CPL == 3, SS.DPL != 3 and
non-root code can observe the fact that SS.DPL != 3?  If not, maybe
KVM could just change SS.DPL to 3 whenever it reads out SS if CPL ==
3.  Then CPL really could live in the SS state even on SVM.  In other
words, if a weird guest forces SS.RPL ! = 3 by programming garbage
into *STAR and doing SYSRET, could that guest tell the difference if
we non-deterministically changed SS.DPL back to 3 out from under it?
Or is there some nasty case in which SS.DPL == 0, CPL == 3, SS is
valid and you're in compat mode, and you expect stack access to fail
because SS.DPL < CPL?

--Andy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] KVM: SVM: do not drop VMCB CPL to 0 if SS is not present
  2017-06-15 21:44       ` Andy Lutomirski
@ 2017-06-16  8:44         ` Roman Penyaev
  2017-06-16 16:40         ` Paolo Bonzini
  1 sibling, 0 replies; 17+ messages in thread
From: Roman Penyaev @ 2017-06-16  8:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Paolo Bonzini, Mikhail Sennikovskii, Gleb Natapov, kvm list,
	linux-kernel

On Thu, Jun 15, 2017 at 11:44 PM, Andy Lutomirski <luto@kernel.org> wrote:
> On Tue, May 30, 2017 at 9:05 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>
>>
>> On 30/05/2017 17:58, Roman Penyaev wrote:
>>> Indeed, what is left is eventually take it from SS.RPL. J.
>>
>> Ahah! :)  But I only suggested that in specific cases.
>>
>>> But jokes aside,  with your last patch you seems fixed a race problem
>>> when "CS.RPL is not equal to the CPL in the few instructions between
>>> setting CR0.PE and reloading CS".
>>
>> Yes, exactly.  The symptom was a crash (triple fault) when you kept
>> interrupting with "info cpus" a guest that repeatedly went to protected
>> mode and back to real mode.
>>
>>> We will have CPL in var->dpl, and it seems ok.  All we need is not
>>> to lose it on the way kernel->userspace->kernel.
>>
>> You're right.  So what do you think of the other suggestion (svm.c
>> doesn't clear attributes for unusable registers, QEMU only clears P for
>> unusable registers)?
>
> AMD CPUs really allow setting RPL in MSR_*STAR to something other than
> 3 and then blindly copy the result to SS.DPL when SYSRET happens?
> Ugh!

Hm, MSR_*START ? I don't know.  The original problem was that AMD CPU
allows you to set any CPL in VMCB.CPL, which can be totally different
from CS.RPL.  And VMCB.CPL was changed from QEMU userspace side.

> I wonder if we can sweep that particular problem under the rug by
> saying that, as a KVM guest, you can't program STAR.RPL != 3?  Or
> would that require us to set an intercept that we don't want to set?

Couple of weeks ago I sent modified patches to kvm/svm and to QEMU
as well.  Sorry, I forgot you to add to CC, so here are the links:

https://patchwork.kernel.org/patch/9758889/
https://www.mail-archive.com/qemu-devel@nongnu.org/msg454368.html

(who knows why it is so difficult to find cached patch in those
 mail archives? always different resources)

> Alternatively, is there ever a case where CPL == 3, SS.DPL != 3 and
> non-root code can observe the fact that SS.DPL != 3?  If not, maybe
> KVM could just change SS.DPL to 3 whenever it reads out SS if CPL ==
> 3.  Then CPL really could live in the SS state even on SVM.  In other
> words, if a weird guest forces SS.RPL ! = 3 by programming garbage
> into *STAR and doing SYSRET, could that guest tell the difference if
> we non-deterministically changed SS.DPL back to 3 out from under it?
> Or is there some nasty case in which SS.DPL == 0, CPL == 3, SS is
> valid and you're in compat mode, and you expect stack access to fail
> because SS.DPL < CPL?

The idea of those patches above is simple: CPL is fetched from VMCB.CPL
and then the value is stored in SS.DPL field, then is sent to userspace
side.  Userspace side (QEMU) does not touch SS.DPL even if segment is
not present.  So in few words: do not touch, corrupt, spoil CPL on the
way kernel->userspace->kernel.  That is symmetric and guarantees us
that VMCB.CPL will be correctly restored from SS.DPL.

--
Roman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] KVM: SVM: do not drop VMCB CPL to 0 if SS is not present
  2017-06-15 21:44       ` Andy Lutomirski
  2017-06-16  8:44         ` Roman Penyaev
@ 2017-06-16 16:40         ` Paolo Bonzini
  1 sibling, 0 replies; 17+ messages in thread
From: Paolo Bonzini @ 2017-06-16 16:40 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Roman Penyaev, Mikhail Sennikovskii, Gleb Natapov, kvm list,
	linux-kernel



On 15/06/2017 23:44, Andy Lutomirski wrote:
> On Tue, May 30, 2017 at 9:05 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> On 30/05/2017 17:58, Roman Penyaev wrote:
>>> We will have CPL in var->dpl, and it seems ok.  All we need is not
>>> to lose it on the way kernel->userspace->kernel.
>>
>> You're right.  So what do you think of the other suggestion (svm.c
>> doesn't clear attributes for unusable registers, QEMU only clears P for
>> unusable registers)?
> 
> AMD CPUs really allow setting RPL in MSR_*STAR to something other than
> 3 and then blindly copy the result to SS.DPL when SYSRET happens?
> Ugh!

For AMD, "a data-segment-descriptor DPL field is ignored in 64-bit mode"
(4.8.2).  This is unlike Intel, where SS.DPL is the CPL.

After SYSRET, CPL is always 3, even if CS.RPL != 3.

> Alternatively, is there ever a case where CPL == 3, SS.DPL != 3 and
> non-root code can observe the fact that SS.DPL != 3?  If not, maybe
> KVM could just change SS.DPL to 3 whenever it reads out SS if CPL ==
> 3.  Then CPL really could live in the SS state even on SVM.

Currently that's almost what happens, except the "migration" of the CPL
field into SS.DPL only happens when going through QEMU.

> In other
> words, if a weird guest forces SS.RPL ! = 3 by programming garbage
> into *STAR and doing SYSRET, could that guest tell the difference if
> we non-deterministically changed SS.DPL back to 3 out from under it?
> Or is there some nasty case in which SS.DPL == 0, CPL == 3, SS is
> valid and you're in compat mode, and you expect stack access to fail
> because SS.DPL < CPL?

No, any case where STAR is programmed with RPL != 3 is garbage.

Paolo

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2017-06-16 16:40 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-19 16:14 [RFC] KVM: SVM: do not drop VMCB CPL to 0 if SS is not present Roman Penyaev
2017-05-21  3:31 ` Andy Lutomirski
2017-05-21  7:53   ` Roman Penyaev
2017-05-21 20:19     ` Andy Lutomirski
2017-05-24 19:19       ` Roman Penyaev
2017-05-30 14:47 ` Paolo Bonzini
2017-05-30 17:35   ` Roman Penyaev
2017-05-30 21:09     ` Paolo Bonzini
2017-05-31 10:17       ` Roman Penyaev
2017-05-31 10:50         ` Paolo Bonzini
2017-05-30 15:13 ` Paolo Bonzini
2017-05-30 15:58   ` Roman Penyaev
2017-05-30 16:05     ` Paolo Bonzini
2017-05-30 16:31       ` Gi-Oh Kim
2017-06-15 21:44       ` Andy Lutomirski
2017-06-16  8:44         ` Roman Penyaev
2017-06-16 16:40         ` Paolo Bonzini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.