[nVMX] With 3.20.0-0.rc0.git5.1 on L0, booting L2 guest results in L1 *rebooting*

From: Kashyap Chamarthy <kchamart@redhat.com>
To: kvm@vger.kernel.org, jan.kiszka@siemens.com
Cc: dgilbert@redhat.com
Subject: [nVMX] With 3.20.0-0.rc0.git5.1 on L0, booting L2 guest results in L1 *rebooting*
Date: Mon, 16 Feb 2015 21:40:13 +0100	[thread overview]
Message-ID: <20150216204013.GI21838@tesla.redhat.com> (raw)

I can observe this only one of the Intel Xeon machines (which has 48
CPUs and 1TB memory), but very reliably reproducible.

Reproducer:

  - Just ensure physical host (L0) and guest hypervisor (L1) are running
    3.20.0-0.rc0.git5.1 Kernel (I used from Fedora's Rawhide).
    Preferably on an Intel Xeon machine - as that's where I could
    reproduce this issue, not on a Haswell machine
  - Boot an L2 guest: Run `qemu-sanity-check --accel=kvm` in L1 (or
    your own preferred method to boot an L2 KVM guest).
  - On a different terminal, which has serial console for L1: observe L1
    reboot

The only thing I notice in `demsg` (on L0) is this trace. _However_ this
trace does not occur when an L1 reboot is triggered while you watch
`dmesg -w` (to wait for new messages) as I boot an L2 guest -- which
means, the below trace is not the root cause of L1 being rebooted.  When
the L2 gets rebooted, what you observe is just one of these messages
"vcpu0 unhandled rdmsr: 0x1a6" below

. . .
[Feb16 13:44] ------------[ cut here ]------------
[  +0.004632] WARNING: CPU: 4 PID: 1837 at arch/x86/kvm/vmx.c:9190 nested_vmx_vmexit+0x96e/0xb00 [kvm_intel]()
[  +0.009835] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack tun bridge stp llc ip6table_filter ip6_tables cfg80211 rfkill iTCO_wdt iTCO_vendor_support ipmi_devintf gpio_ich dcdbas coretemp kvm_intel kvm crc32c_intel ipmi_ssif serio_raw acpi_power_meter ipmi_si tpm_tis ipmi_msghandler tpm lpc_ich i7core_edac mfd_core edac_core acpi_cpufreq shpchp wmi mgag200 i2c_algo_bit drm_kms_helper ttm ata_generic drm pata_acpi megaraid_sas bnx2
[  +0.050289] CPU: 4 PID: 1837 Comm: qemu-system-x86 Not tainted 3.20.0-0.rc0.git5.1.fc23.x86_64 #1
[  +0.008902] Hardware name: Dell Inc. PowerEdge R910/0P658H, BIOS 2.8.2 10/25/2012
[  +0.007469]  0000000000000000 00000000ee6c0c54 ffff88bf60bf7c18 ffffffff818760f7
[  +0.007542]  0000000000000000 0000000000000000 ffff88bf60bf7c58 ffffffff810ab80a
[  +0.007519]  ffff88ff625b8000 ffff883f55f9b000 0000000000000000 0000000000000014
[  +0.007489] Call Trace:
[  +0.002471]  [<ffffffff818760f7>] dump_stack+0x4c/0x65
[  +0.005152]  [<ffffffff810ab80a>] warn_slowpath_common+0x8a/0xc0
[  +0.006020]  [<ffffffff810ab93a>] warn_slowpath_null+0x1a/0x20
[  +0.005851]  [<ffffffffa130957e>] nested_vmx_vmexit+0x96e/0xb00 [kvm_intel]
[  +0.006974]  [<ffffffffa130c5f7>] ? vmx_handle_exit+0x1e7/0xcb2 [kvm_intel]
[  +0.006999]  [<ffffffffa02ca972>] ? kvm_arch_vcpu_ioctl_run+0x6d2/0x1b50 [kvm]
[  +0.007239]  [<ffffffffa130992a>] vmx_queue_exception+0x10a/0x150 [kvm_intel]
[  +0.007136]  [<ffffffffa02cb30b>] kvm_arch_vcpu_ioctl_run+0x106b/0x1b50 [kvm]
[  +0.007162]  [<ffffffffa02ca972>] ? kvm_arch_vcpu_ioctl_run+0x6d2/0x1b50 [kvm]
[  +0.007241]  [<ffffffff8110760d>] ? trace_hardirqs_on+0xd/0x10
[  +0.005864]  [<ffffffffa02b2df6>] ? vcpu_load+0x26/0x70 [kvm]
[  +0.005761]  [<ffffffff81103c0f>] ? lock_release_holdtime.part.29+0xf/0x200
[  +0.006979]  [<ffffffffa02c5f88>] ? kvm_arch_vcpu_load+0x58/0x210 [kvm]
[  +0.006634]  [<ffffffffa02b3203>] kvm_vcpu_ioctl+0x383/0x7e0 [kvm]
[  +0.006197]  [<ffffffff81027b9d>] ? native_sched_clock+0x2d/0xa0
[  +0.006026]  [<ffffffff810d5fc6>] ? creds_are_invalid.part.1+0x16/0x50
[  +0.006537]  [<ffffffff810d6021>] ? creds_are_invalid+0x21/0x30
[  +0.005930]  [<ffffffff813a61da>] ? inode_has_perm.isra.48+0x2a/0xa0
[  +0.006365]  [<ffffffff8128c7b8>] do_vfs_ioctl+0x2e8/0x530
[  +0.005496]  [<ffffffff8128ca81>] SyS_ioctl+0x81/0xa0
[  +0.005065]  [<ffffffff8187f8e9>] system_call_fastpath+0x12/0x17
[  +0.006014] ---[ end trace 2f24e0820b44f686 ]---
[  +5.870886] kvm [1783]: vcpu0 unhandled rdmsr: 0x1c9
[  +0.004991] kvm [1783]: vcpu0 unhandled rdmsr: 0x1a6
[  +0.005020] kvm [1783]: vcpu0 unhandled rdmsr: 0x3f6
[Feb16 14:18] kvm [1783]: vcpu0 unhandled rdmsr: 0x1c9
[  +0.005020] kvm [1783]: vcpu0 unhandled rdmsr: 0x1a6
[  +0.004998] kvm [1783]: vcpu0 unhandled rdmsr: 0x3f6
. . .

Version
-------

Exact below versions were used on L0 and L1:

  $ uname -r; rpm -q qemu-system-x86
  3.20.0-0.rc0.git5.1.fc23.x86_64
  qemu-system-x86-2.2.0-5.fc22.x86_64

Other info
----------

- Unpacking the kernel-3.20.0-0.rc0.git5.1.fc23.src.rpm and looking at
  this file, arch/x86/kvm/vmx.c, line 9190 is below, with contextual
  code:

   [. . .]
   9178  * Emulate an exit from nested guest (L2) to L1, i.e., prepare to run L1
   9179  * and modify vmcs12 to make it see what it would expect to see there if
   9180  * L2 was its real guest. Must only be called when in L2 (is_guest_mode())
   9181  */
   9182 static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason,
   9183                               u32 exit_intr_info,
   9184                               unsigned long exit_qualification)
   9185 {
   9186         struct vcpu_vmx *vmx = to_vmx(vcpu);
   9187         struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
   9188 
   9189         /* trying to cancel vmlaunch/vmresume is a bug */
   9190         WARN_ON_ONCE(vmx->nested.nested_run_pending);
   9191 
   9192         leave_guest_mode(vcpu);
   9193         prepare_vmcs12(vcpu, vmcs12, exit_reason, exit_intr_info,
   9194                        exit_qualification);
   9195 
   9196         vmx_load_vmcs01(vcpu);
   9197 
   9198         if ((exit_reason == EXIT_REASON_EXTERNAL_INTERRUPT)
   9199             && nested_exit_intr_ack_set(vcpu)) {
   9200                 int irq = kvm_cpu_get_interrupt(vcpu);
   9201                 WARN_ON(irq < 0);
   9202                 vmcs12->vm_exit_intr_info = irq |
   9203                         INTR_INFO_VALID_MASK | INTR_TYPE_EXT_INTR;
   9204         }

- The above line 9190 was introduced in this commt:

  $ git log -S'WARN_ON_ONCE(vmx->nested.nested_run_pending)' \
      -- ./arch/x86/kvm/vmx.c
  commit 5f3d5799974b89100268ba813cec8db7bd0693fb
  Author: Jan Kiszka <jan.kiszka@siemens.com>
  Date:   Sun Apr 14 12:12:46 2013 +0200

      KVM: nVMX: Rework event injection and recovery

      The basic idea is to always transfer the pending event injection on
      vmexit into the architectural state of the VCPU and then drop it from
      there if it turns out that we left L2 to enter L1, i.e. if we enter
      prepare_vmcs12.

      vmcs12_save_pending_events takes care to transfer pending L0 events into
      the queue of L1. That is mandatory as L1 may decide to switch the guest
      state completely, invalidating or preserving the pending events for
      later injection (including on a different node, once we support
      migration).

      This concept is based on the rule that a pending vmlaunch/vmresume is
      not canceled. Otherwise, we would risk to lose injected events or leak
      them into the wrong queues. Encode this rule via a WARN_ON_ONCE at the
      entry of nested_vmx_vmexit.

      Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: Gleb Natapov <gleb@redhat.com>

- `dmesg`, `dmidecode`, `x86info -a` details of L0 and L1 here

    https://kashyapc.fedorapeople.org/virt/Info-L0-Intel-Xeon-and-L1-nVMX-test/

-- 
/kashyap