[RFC] KVM: SVM: do not drop VMCB CPL to 0 if SS is not present

* [RFC] KVM: SVM: do not drop VMCB CPL to 0 if SS is not present
@ 2017-05-19 16:14 Roman Penyaev
  2017-05-21  3:31 ` Andy Lutomirski
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Roman Penyaev @ 2017-05-19 16:14 UTC (permalink / raw)
  To: Roman Pen, Mikhail Sennikovskii, Paolo Bonzini, Gleb Natapov,
	kvm, linux-kernel

Hi folks,

After experiencing guest double faults (sometimes triple faults) on
3.16 guest kernels with the following common pattern:

[459395.776124] PANIC: double fault, error_code: 0x0
[459395.776606] CPU: 0 PID: 36565 Comm: fio Not tainted 3.16.39kmemleak #4
[459395.776610] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.7.5-20140531_083029-gandalf 04/01/2014
[459395.776614] task: ffff880009ca06b0 ti: ffff88003cbc2000 task.ti:
ffff88003cbc2000
[459395.776617] RIP: 0010:[<ffffffff81048ecf>]  [<ffffffff81048ecf>]
__do_page_fault+0x1f/0x540
[459395.776628] RSP: 002b:00007ffe0bc9bfa8  EFLAGS: 00010012
[459395.776631] RAX: 0000000081539927 RBX: 0000000000000000 RCX:
ffffffff81539927
[459395.776634] RDX: 0000000000000028 RSI: 0000000000000000 RDI:
00007ffe0bc9c0a8
[459395.776637] RBP: 00007ffe0bc9c0a8 R08: 0001a1d1002e9400 R09:
0000000000063f1b
[459395.776640] R10: 0000000033f8356c R11: 000029c8250c3103 R12:
0000000000000028
[459395.776642] R13: 00007ff8c83e0000 R14: 0000000000000000 R15:
00007ffe0bc9c7c0
[459395.776649] FS:  00007ff8d2aaa7c0(0000) GS:ffff88003f400000(0000)
knlGS:0000000000000000
[459395.776651] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[459395.776656] CR2: 00007ffe0bc9bf98 CR3: 000000003ca46000 CR4:
00000000000407f0
[459395.776658] Stack:
[459395.776661]  0000000000000000 0000000000000000 0000000000000000
0000000000000000
[459395.776666]  0000000000000000 0000000000000000 0000000000000000
0000000000000000
[459395.776670]  0000000000000000 0000000000000000 0000000000000000
[459395.776674] Call Trace:
[459395.776676]  <UNK>
[459395.776678] Code:
[459395.776680] ad 8c 4e 00 be 04 00 03 00 eb a8 90 66 66 66 66 90 41
57 41 56 41 55 41 54 49 89 d4 55 53 48 89 fd 48 89 f3 48 81 ec c8 00
00 00 <65> 48 8b 04 25 28 00 00 00 48 89 84 24 c0 00 00 00 31 c0 65 48
[459395.776716] Kernel panic - not syncing: Machine halted.
[459395.777172] CPU: 0 PID: 36565 Comm: fio Not tainted 3.16.39kmemleak #4
[459395.777673] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.7.5-20140531_083029-gandalf 04/01/2014
[459395.778373]  0000000000000086 00000000d85f6336 ffffffff81532ec9
ffffffff8170203e
[459395.779865]  ffff88003f402f18 ffffffff815318a1 00007ff800000008
ffff88003f402f28
[459395.780061]  ffff88003f402ec0 00000000d85f6336 00007fff00000008
0000000000000046
[459395.780061] Call Trace:
[459395.780061]  <#DF>  [<ffffffff81532ec9>] ? dump_stack+0x47/0x5a
[459395.780061]  [<ffffffff815318a1>] ? panic+0xcf/0x206
[459395.780061]  [<ffffffff8104234d>] ? df_debug+0x2d/0x30
[459395.780061]  [<ffffffff81004f78>] ? do_double_fault+0x78/0xf0
[459395.780061]  [<ffffffff8153a4f2>] ? double_fault+0x22/0x30
[459395.780061]  [<ffffffff81539927>] ? native_iret+0x7/0x7
[459395.780061]  [<ffffffff81048ecf>] ? __do_page_fault+0x1f/0x540

we found out that all kernel backtraces have userspace RSP, where
userspace memory has normal timer, page fault or virtio interrupts
trail:

(the following RSP pointer does not belong to this particular crash
 above, but it does not matter, symptoms are always the same)

crash> rd -s 00007f6cb9556768 100
    7f6cb9556768:  00007f6cfaa21270 00007f6cfaa21270
    7f6cb9556778:  0000000000000000 00007f6cf9b8c6a0
    7f6cb9556788:  00007f6cf983399a 0000000000000000
    7f6cb9556798:  0000000000000000 00007f6cf98a1f2d
    7f6cb95567a8:  00007f6cfaa21270 0000000000000000
    7f6cb95567b8:  00007f6ca4031880
                                    ffffffffffffff7e  IRQ,
~0xffffffffffffff7e = 0x81
    7f6cb95567c8:  00007f6cfa817ae1 0000000000000033  RIP; CS
    7f6cb95567d8:  0000000000000202 00007f6cb95567f0  EFLAGS; RSP
    7f6cb95567e8:  000000000000002b                   SS
                                    00007f6cb00318e0
    7f6cb95567f8:  00007f6cfa817af5 00007f6cac0318e0
    7f6cb9556808:  00007f6cfa817af5 00007f6cb4031880
    7f6cb9556818:  00007f6cfa817af5 00007f6cfbe82340
    7f6cb9556828:  00007f6cfa817af5 00007f6c940318c0
    7f6cb9556838:  00007f6cfa817af5 00007f6ca00318c0
    7f6cb9556848:  00007f6cfa817af5 00007f6c9c031920
    7f6cb9556858:  00007f6cfa817af5 00007f6ca8031920
    7f6cb9556868:  00007f6cfa817af5 00007f6ca4039df0
    7f6cb9556878:  00007f6cfa817af5 00007f6cb0039e50
    7f6cb9556888:  00007f6cfa817af5 00007f6cac039e50

It turned out to be that CPU does not change SS:RSP and took interrupt
on userspace stack (BTW init_tss and gdb_page are not corrupted).
That is completely weird.

Next step was to trace VMCB before and after VMRUN to understand the
exact state seen by real CPU.  VMCB was traced when wrong CPU state is
observed: RIP is to kernel and RSP is from userspace.  The following
is a diff of VMCB, where
--- is the state before VMRUN and
+++ is the state after VMRUN:

 -      event_inj = 0x80000081,
 +      event_inj = 0x0,

        ...

        cs = {
 -        selector = 0x33,
 -        attrib = 0xafb,
 +        selector = 0x10,
 +        attrib = 0x29b,
          limit = 0xffffffff,
          base = 0x0
        },
        ss = {
          selector = 0x2b,
          attrib = 0x0,
          limit = 0xffffffff,
          base = 0x0
        },

        ...

        cpl = 0x0,              <<<<<< WTF?

        ...

 -      rip = 0x7f6cfa817ae1,
 +      rip = 0xffffffff8152b690,

        ...

 -      rsp = 0x7f6cb95567f0,
 +      rsp = 0x7f6cb9556768,

The execution scenario is the following:

  1. Userspace code was suspended.
  2. Virtio 0x80000081 interrupt was injected and VMRUN called again.
  3. HW CPU took virtual interrupt, but did not switch the stack because
     CPL *was* already set to 0.
  4. KABOOOOOM.

So the stack switch does not happen because "the processor performs an
automatic stack switch when a control transfer causes a change in
privilege levels to occur" (AMD manual), and in this case privilege
level was not changed.

The question remains who and why changed CPL to 0?

Further tracking of VMCB states before and after VMRUN showed, that
CPL becomes 0 when VMEXIT happens with the following SS segment:

          ss = {
            selector = 0x2b,
            attrib = 0x400,
            limit = 0xffffffff,
            base = 0x0
          },

          cpl = 0x3

Then on next VMRUN VMCB looks as the following:

          ss = {
            selector = 0x2b,
            attrib = 0x0,            <<< dropped to 0
            limit = 0xffffffff,
            base = 0x0
          },

          cpl = 0x0,                 <<< dropped to 0

Obviously it was changed between VMRUN calls.  The following backtrace
shows that VMCB.SAVE.CPL was set to 0 by QEMU itself:

  CPU: 55 PID: 59531 Comm: kvm
  [<ffffffffa00a3a20>] kvm_arch_vcpu_ioctl_set_sregs+0x2e0/0x480 [kvm]
  [<ffffffffa008ddf0>] kvm_write_guest_cached+0x540/0xc00 [kvm]
  [<ffffffff8107d695>] ? finish_task_switch+0x185/0x240
  [<ffffffff8180097c>] ? __schedule+0x28c/0xa10
  [<ffffffff811a9aad>] do_vfs_ioctl+0x2cd/0x4a0

SS segment which came from QEMU had the following struct members:

       SS->base      = 0
       SS->limit     = ffffffff
       SS->selector  = 2b
       SS->type      = 0
       SS->present   = 0
       SS->dpl       = 0
       SS->db        = 0
       SS->s         = 0
       SS->l         = 0
       SS->g         = 0
       SS->avl       = 0
       SS->unusable  = 1

Indeed, on last VMEXIT SS segment does not have (P) present bit set in
segment attributes:

   (gdb) p 0x400 & (1 << SVM_SELECTOR_P_SHIFT)
   $1 = 0

So when on VMEXIT we have such SS state (P bit is not set) and QEMU
just decides to synchronize registers the following happens on QEMU
side:

   kvm_cpu_synchronize_state():
       kvm_arch_get_registers():
           ...
           get_seg():
              if (rhs->unusable) {
                  lhs->flags = 0;    <<< SS is unusable [(P) is not set)
                                     <<< all attributes are dropped to 0.
               }
       cpu->kvm_vcpu_dirty = true;   <<< Mark VCPU state as dirty

On next VCPU enter registers will be sent from QEMU to KVM back, in order
to sync the state.  In its turn svm_set_segment() on KVM side has the
following code:

    if (var->unusable)    <<< Yes, SS is unusable, attrib is dropped to 0
        s->attrib = 0;
    ...
    if (seg == VCPU_SREG_SS)
        svm->vmcb->save.cpl = (s->attrib >> SVM_SELECTOR_DPL_SHIFT) & 3;
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                              Now CPL is also dropped to 0

We see two possible fixes:

1. Simple one, KVM SVM side, which makes sure that CPL is not updated
   if segment is unusable:

   --- a/arch/x86/kvm/svm.c
   +++ b/arch/x86/kvm/svm.c
   @@ -1549,7 +1549,7 @@ static void svm_set_segment(struct kvm_vcpu *vcpu,
            * forces SS.DPL to 3 on sysret, so we ignore that case; fixing it
            * would entail passing the CPL to userspace and back.
            */
   -       if (seg == VCPU_SREG_SS)
   +       if (seg == VCPU_SREG_SS && !var->unusable)
                   svm->vmcb->save.cpl = (s->attrib >>
SVM_SELECTOR_DPL_SHIFT) & 3;

2. A bit complicated, which makes sure the CPL field is preserved across
   KVM_GET/SET_SREGS calls and makes svm_set_segment() and svm_get_segment()
   functionality symmethric:

   KVM SVM side:
   -------------

   --- a/arch/x86/kvm/svm.c
   +++ b/arch/x86/kvm/svm.c
   @@ -1999,7 +1999,7 @@ static void svm_set_segment(struct kvm_vcpu *vcpu,
            * would entail passing the CPL to userspace and back.
            */
           if (seg == VCPU_SREG_SS)
   -               svm->vmcb->save.cpl = (s->attrib >>
SVM_SELECTOR_DPL_SHIFT) & 3;
   +               svm->vmcb->save.cpl = (var->dpl & 3);

           mark_dirty(svm->vmcb, VMCB_SEG);
   }

   QEMU side:
   ----------

   --- a/target/i386/kvm.c
   +++ b/target/i386/kvm.c
   @@ -1979,6 +1979,8 @@ static int kvm_get_sregs(X86CPU *cpu)
        get_seg(&env->segs[R_FS], &sregs.fs);
        get_seg(&env->segs[R_GS], &sregs.gs);
        get_seg(&env->segs[R_SS], &sregs.ss);
   +    if (sregs.ss.unusable)
   +        env->segs[R_SS].flags |= sregs.ss.dpl << DESC_DPL_SHIFT;

        get_seg(&env->tr, &sregs.tr);
        get_seg(&env->ldt, &sregs.ldt);

Current email is an RFC since for us is not fully clear is it really
needed to preserve DPL across KVM_SET/GET_SREGS calls when segment
is unusable.  E.g. there was a commit:

4cae9c97967a ("target-i386: kvm: clear unusable segments' flags in migration")

which in purpose drops all segment flags to zero on QEMU side in order
to fix guests migration.

--
Roman

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Mikhail Sennikovskii <mikhail.sennikovskii@profitbricks.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread