From mboxrd@z Thu Jan 1 00:00:00 1970 From: Kashyap Chamarthy Subject: [nVMX] With 3.20.0-0.rc0.git5.1 on L0, booting L2 guest results in L1 *rebooting* Date: Mon, 16 Feb 2015 21:40:13 +0100 Message-ID: <20150216204013.GI21838@tesla.redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: dgilbert@redhat.com To: kvm@vger.kernel.org, jan.kiszka@siemens.com Return-path: Received: from mx1.redhat.com ([209.132.183.28]:46445 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750942AbbBPUkX (ORCPT ); Mon, 16 Feb 2015 15:40:23 -0500 Content-Disposition: inline Sender: kvm-owner@vger.kernel.org List-ID: I can observe this only one of the Intel Xeon machines (which has 48 CPUs and 1TB memory), but very reliably reproducible. Reproducer: - Just ensure physical host (L0) and guest hypervisor (L1) are running 3.20.0-0.rc0.git5.1 Kernel (I used from Fedora's Rawhide). Preferably on an Intel Xeon machine - as that's where I could reproduce this issue, not on a Haswell machine - Boot an L2 guest: Run `qemu-sanity-check --accel=kvm` in L1 (or your own preferred method to boot an L2 KVM guest). - On a different terminal, which has serial console for L1: observe L1 reboot The only thing I notice in `demsg` (on L0) is this trace. _However_ this trace does not occur when an L1 reboot is triggered while you watch `dmesg -w` (to wait for new messages) as I boot an L2 guest -- which means, the below trace is not the root cause of L1 being rebooted. When the L2 gets rebooted, what you observe is just one of these messages "vcpu0 unhandled rdmsr: 0x1a6" below . . . [Feb16 13:44] ------------[ cut here ]------------ [ +0.004632] WARNING: CPU: 4 PID: 1837 at arch/x86/kvm/vmx.c:9190 nested_vmx_vmexit+0x96e/0xb00 [kvm_intel]() [ +0.009835] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack tun bridge stp llc ip6table_filter ip6_tables cfg80211 rfkill iTCO_wdt iTCO_vendor_support ipmi_devintf gpio_ich dcdbas coretemp kvm_intel kvm crc32c_intel ipmi_ssif serio_raw acpi_power_meter ipmi_si tpm_tis ipmi_msghandler tpm lpc_ich i7core_edac mfd_core edac_core acpi_cpufreq shpchp wmi mgag200 i2c_algo_bit drm_kms_helper ttm ata_generic drm pata_acpi megaraid_sas bnx2 [ +0.050289] CPU: 4 PID: 1837 Comm: qemu-system-x86 Not tainted 3.20.0-0.rc0.git5.1.fc23.x86_64 #1 [ +0.008902] Hardware name: Dell Inc. PowerEdge R910/0P658H, BIOS 2.8.2 10/25/2012 [ +0.007469] 0000000000000000 00000000ee6c0c54 ffff88bf60bf7c18 ffffffff818760f7 [ +0.007542] 0000000000000000 0000000000000000 ffff88bf60bf7c58 ffffffff810ab80a [ +0.007519] ffff88ff625b8000 ffff883f55f9b000 0000000000000000 0000000000000014 [ +0.007489] Call Trace: [ +0.002471] [] dump_stack+0x4c/0x65 [ +0.005152] [] warn_slowpath_common+0x8a/0xc0 [ +0.006020] [] warn_slowpath_null+0x1a/0x20 [ +0.005851] [] nested_vmx_vmexit+0x96e/0xb00 [kvm_intel] [ +0.006974] [] ? vmx_handle_exit+0x1e7/0xcb2 [kvm_intel] [ +0.006999] [] ? kvm_arch_vcpu_ioctl_run+0x6d2/0x1b50 [kvm] [ +0.007239] [] vmx_queue_exception+0x10a/0x150 [kvm_intel] [ +0.007136] [] kvm_arch_vcpu_ioctl_run+0x106b/0x1b50 [kvm] [ +0.007162] [] ? kvm_arch_vcpu_ioctl_run+0x6d2/0x1b50 [kvm] [ +0.007241] [] ? trace_hardirqs_on+0xd/0x10 [ +0.005864] [] ? vcpu_load+0x26/0x70 [kvm] [ +0.005761] [] ? lock_release_holdtime.part.29+0xf/0x200 [ +0.006979] [] ? kvm_arch_vcpu_load+0x58/0x210 [kvm] [ +0.006634] [] kvm_vcpu_ioctl+0x383/0x7e0 [kvm] [ +0.006197] [] ? native_sched_clock+0x2d/0xa0 [ +0.006026] [] ? creds_are_invalid.part.1+0x16/0x50 [ +0.006537] [] ? creds_are_invalid+0x21/0x30 [ +0.005930] [] ? inode_has_perm.isra.48+0x2a/0xa0 [ +0.006365] [] do_vfs_ioctl+0x2e8/0x530 [ +0.005496] [] SyS_ioctl+0x81/0xa0 [ +0.005065] [] system_call_fastpath+0x12/0x17 [ +0.006014] ---[ end trace 2f24e0820b44f686 ]--- [ +5.870886] kvm [1783]: vcpu0 unhandled rdmsr: 0x1c9 [ +0.004991] kvm [1783]: vcpu0 unhandled rdmsr: 0x1a6 [ +0.005020] kvm [1783]: vcpu0 unhandled rdmsr: 0x3f6 [Feb16 14:18] kvm [1783]: vcpu0 unhandled rdmsr: 0x1c9 [ +0.005020] kvm [1783]: vcpu0 unhandled rdmsr: 0x1a6 [ +0.004998] kvm [1783]: vcpu0 unhandled rdmsr: 0x3f6 . . . Version ------- Exact below versions were used on L0 and L1: $ uname -r; rpm -q qemu-system-x86 3.20.0-0.rc0.git5.1.fc23.x86_64 qemu-system-x86-2.2.0-5.fc22.x86_64 Other info ---------- - Unpacking the kernel-3.20.0-0.rc0.git5.1.fc23.src.rpm and looking at this file, arch/x86/kvm/vmx.c, line 9190 is below, with contextual code: [. . .] 9178 * Emulate an exit from nested guest (L2) to L1, i.e., prepare to run L1 9179 * and modify vmcs12 to make it see what it would expect to see there if 9180 * L2 was its real guest. Must only be called when in L2 (is_guest_mode()) 9181 */ 9182 static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason, 9183 u32 exit_intr_info, 9184 unsigned long exit_qualification) 9185 { 9186 struct vcpu_vmx *vmx = to_vmx(vcpu); 9187 struct vmcs12 *vmcs12 = get_vmcs12(vcpu); 9188 9189 /* trying to cancel vmlaunch/vmresume is a bug */ 9190 WARN_ON_ONCE(vmx->nested.nested_run_pending); 9191 9192 leave_guest_mode(vcpu); 9193 prepare_vmcs12(vcpu, vmcs12, exit_reason, exit_intr_info, 9194 exit_qualification); 9195 9196 vmx_load_vmcs01(vcpu); 9197 9198 if ((exit_reason == EXIT_REASON_EXTERNAL_INTERRUPT) 9199 && nested_exit_intr_ack_set(vcpu)) { 9200 int irq = kvm_cpu_get_interrupt(vcpu); 9201 WARN_ON(irq < 0); 9202 vmcs12->vm_exit_intr_info = irq | 9203 INTR_INFO_VALID_MASK | INTR_TYPE_EXT_INTR; 9204 } - The above line 9190 was introduced in this commt: $ git log -S'WARN_ON_ONCE(vmx->nested.nested_run_pending)' \ -- ./arch/x86/kvm/vmx.c commit 5f3d5799974b89100268ba813cec8db7bd0693fb Author: Jan Kiszka Date: Sun Apr 14 12:12:46 2013 +0200 KVM: nVMX: Rework event injection and recovery The basic idea is to always transfer the pending event injection on vmexit into the architectural state of the VCPU and then drop it from there if it turns out that we left L2 to enter L1, i.e. if we enter prepare_vmcs12. vmcs12_save_pending_events takes care to transfer pending L0 events into the queue of L1. That is mandatory as L1 may decide to switch the guest state completely, invalidating or preserving the pending events for later injection (including on a different node, once we support migration). This concept is based on the rule that a pending vmlaunch/vmresume is not canceled. Otherwise, we would risk to lose injected events or leak them into the wrong queues. Encode this rule via a WARN_ON_ONCE at the entry of nested_vmx_vmexit. Signed-off-by: Jan Kiszka Signed-off-by: Gleb Natapov - `dmesg`, `dmidecode`, `x86info -a` details of L0 and L1 here https://kashyapc.fedorapeople.org/virt/Info-L0-Intel-Xeon-and-L1-nVMX-test/ -- /kashyap