From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kiszka Subject: Re: [nVMX] With 3.20.0-0.rc0.git5.1 on L0, booting L2 guest results in L1 *rebooting* Date: Tue, 17 Feb 2015 07:02:14 +0100 Message-ID: <54E2D966.9070706@siemens.com> References: <20150216204013.GI21838@tesla.redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Cc: dgilbert@redhat.com To: Kashyap Chamarthy , kvm@vger.kernel.org Return-path: Received: from goliath.siemens.de ([192.35.17.28]:55661 "EHLO goliath.siemens.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751707AbbBQGCX (ORCPT ); Tue, 17 Feb 2015 01:02:23 -0500 In-Reply-To: <20150216204013.GI21838@tesla.redhat.com> Sender: kvm-owner@vger.kernel.org List-ID: On 2015-02-16 21:40, Kashyap Chamarthy wrote: > I can observe this only one of the Intel Xeon machines (which has 48 > CPUs and 1TB memory), but very reliably reproducible. > > > Reproducer: > > - Just ensure physical host (L0) and guest hypervisor (L1) are running > 3.20.0-0.rc0.git5.1 Kernel (I used from Fedora's Rawhide). > Preferably on an Intel Xeon machine - as that's where I could > reproduce this issue, not on a Haswell machine > - Boot an L2 guest: Run `qemu-sanity-check --accel=kvm` in L1 (or > your own preferred method to boot an L2 KVM guest). > - On a different terminal, which has serial console for L1: observe L1 > reboot > > > The only thing I notice in `demsg` (on L0) is this trace. _However_ this > trace does not occur when an L1 reboot is triggered while you watch > `dmesg -w` (to wait for new messages) as I boot an L2 guest -- which > means, the below trace is not the root cause of L1 being rebooted. When > the L2 gets rebooted, what you observe is just one of these messages > "vcpu0 unhandled rdmsr: 0x1a6" below > > . . . > [Feb16 13:44] ------------[ cut here ]------------ > [ +0.004632] WARNING: CPU: 4 PID: 1837 at arch/x86/kvm/vmx.c:9190 nested_vmx_vmexit+0x96e/0xb00 [kvm_intel]() > [ +0.009835] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack tun bridge stp llc ip6table_filter ip6_tables cfg80211 rfkill iTCO_wdt iTCO_vendor_support ipmi_devintf gpio_ich dcdbas coretemp kvm_intel kvm crc32c_intel ipmi_ssif serio_raw acpi_power_meter ipmi_si tpm_tis ipmi_msghandler tpm lpc_ich i7core_edac mfd_core edac_core acpi_cpufreq shpchp wmi mgag200 i2c_algo_bit drm_kms_helper ttm ata_generic drm pata_acpi megaraid_sas bnx2 > [ +0.050289] CPU: 4 PID: 1837 Comm: qemu-system-x86 Not tainted 3.20.0-0.rc0.git5.1.fc23.x86_64 #1 > [ +0.008902] Hardware name: Dell Inc. PowerEdge R910/0P658H, BIOS 2.8.2 10/25/2012 > [ +0.007469] 0000000000000000 00000000ee6c0c54 ffff88bf60bf7c18 ffffffff818760f7 > [ +0.007542] 0000000000000000 0000000000000000 ffff88bf60bf7c58 ffffffff810ab80a > [ +0.007519] ffff88ff625b8000 ffff883f55f9b000 0000000000000000 0000000000000014 > [ +0.007489] Call Trace: > [ +0.002471] [] dump_stack+0x4c/0x65 > [ +0.005152] [] warn_slowpath_common+0x8a/0xc0 > [ +0.006020] [] warn_slowpath_null+0x1a/0x20 > [ +0.005851] [] nested_vmx_vmexit+0x96e/0xb00 [kvm_intel] > [ +0.006974] [] ? vmx_handle_exit+0x1e7/0xcb2 [kvm_intel] > [ +0.006999] [] ? kvm_arch_vcpu_ioctl_run+0x6d2/0x1b50 [kvm] > [ +0.007239] [] vmx_queue_exception+0x10a/0x150 [kvm_intel] > [ +0.007136] [] kvm_arch_vcpu_ioctl_run+0x106b/0x1b50 [kvm] > [ +0.007162] [] ? kvm_arch_vcpu_ioctl_run+0x6d2/0x1b50 [kvm] > [ +0.007241] [] ? trace_hardirqs_on+0xd/0x10 > [ +0.005864] [] ? vcpu_load+0x26/0x70 [kvm] > [ +0.005761] [] ? lock_release_holdtime.part.29+0xf/0x200 > [ +0.006979] [] ? kvm_arch_vcpu_load+0x58/0x210 [kvm] > [ +0.006634] [] kvm_vcpu_ioctl+0x383/0x7e0 [kvm] > [ +0.006197] [] ? native_sched_clock+0x2d/0xa0 > [ +0.006026] [] ? creds_are_invalid.part.1+0x16/0x50 > [ +0.006537] [] ? creds_are_invalid+0x21/0x30 > [ +0.005930] [] ? inode_has_perm.isra.48+0x2a/0xa0 > [ +0.006365] [] do_vfs_ioctl+0x2e8/0x530 > [ +0.005496] [] SyS_ioctl+0x81/0xa0 > [ +0.005065] [] system_call_fastpath+0x12/0x17 > [ +0.006014] ---[ end trace 2f24e0820b44f686 ]--- > [ +5.870886] kvm [1783]: vcpu0 unhandled rdmsr: 0x1c9 > [ +0.004991] kvm [1783]: vcpu0 unhandled rdmsr: 0x1a6 > [ +0.005020] kvm [1783]: vcpu0 unhandled rdmsr: 0x3f6 > [Feb16 14:18] kvm [1783]: vcpu0 unhandled rdmsr: 0x1c9 > [ +0.005020] kvm [1783]: vcpu0 unhandled rdmsr: 0x1a6 > [ +0.004998] kvm [1783]: vcpu0 unhandled rdmsr: 0x3f6 > . . . > > > Version > ------- > > Exact below versions were used on L0 and L1: > > $ uname -r; rpm -q qemu-system-x86 > 3.20.0-0.rc0.git5.1.fc23.x86_64 > qemu-system-x86-2.2.0-5.fc22.x86_64 > > > > Other info > ---------- > > - Unpacking the kernel-3.20.0-0.rc0.git5.1.fc23.src.rpm and looking at > this file, arch/x86/kvm/vmx.c, line 9190 is below, with contextual > code: > > [. . .] > 9178 * Emulate an exit from nested guest (L2) to L1, i.e., prepare to run L1 > 9179 * and modify vmcs12 to make it see what it would expect to see there if > 9180 * L2 was its real guest. Must only be called when in L2 (is_guest_mode()) > 9181 */ > 9182 static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason, > 9183 u32 exit_intr_info, > 9184 unsigned long exit_qualification) > 9185 { > 9186 struct vcpu_vmx *vmx = to_vmx(vcpu); > 9187 struct vmcs12 *vmcs12 = get_vmcs12(vcpu); > 9188 > 9189 /* trying to cancel vmlaunch/vmresume is a bug */ > 9190 WARN_ON_ONCE(vmx->nested.nested_run_pending); > 9191 > 9192 leave_guest_mode(vcpu); > 9193 prepare_vmcs12(vcpu, vmcs12, exit_reason, exit_intr_info, > 9194 exit_qualification); > 9195 > 9196 vmx_load_vmcs01(vcpu); > 9197 > 9198 if ((exit_reason == EXIT_REASON_EXTERNAL_INTERRUPT) > 9199 && nested_exit_intr_ack_set(vcpu)) { > 9200 int irq = kvm_cpu_get_interrupt(vcpu); > 9201 WARN_ON(irq < 0); > 9202 vmcs12->vm_exit_intr_info = irq | > 9203 INTR_INFO_VALID_MASK | INTR_TYPE_EXT_INTR; > 9204 } > > > - The above line 9190 was introduced in this commt: > > $ git log -S'WARN_ON_ONCE(vmx->nested.nested_run_pending)' \ > -- ./arch/x86/kvm/vmx.c > commit 5f3d5799974b89100268ba813cec8db7bd0693fb > Author: Jan Kiszka > Date: Sun Apr 14 12:12:46 2013 +0200 > > KVM: nVMX: Rework event injection and recovery > > The basic idea is to always transfer the pending event injection on > vmexit into the architectural state of the VCPU and then drop it from > there if it turns out that we left L2 to enter L1, i.e. if we enter > prepare_vmcs12. > > vmcs12_save_pending_events takes care to transfer pending L0 events into > the queue of L1. That is mandatory as L1 may decide to switch the guest > state completely, invalidating or preserving the pending events for > later injection (including on a different node, once we support > migration). > > This concept is based on the rule that a pending vmlaunch/vmresume is > not canceled. Otherwise, we would risk to lose injected events or leak > them into the wrong queues. Encode this rule via a WARN_ON_ONCE at the > entry of nested_vmx_vmexit. > > Signed-off-by: Jan Kiszka > Signed-off-by: Gleb Natapov > > > - `dmesg`, `dmidecode`, `x86info -a` details of L0 and L1 here > > https://kashyapc.fedorapeople.org/virt/Info-L0-Intel-Xeon-and-L1-nVMX-test/ > Does enable_apicv make a difference? Is this a regression caused by the commit, or do you only see it with very recent kvm.git? Jan -- Siemens AG, Corporate Technology, CT RTC ITP SES-DE Corporate Competence Center Embedded Linux