From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:43007) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZzGFH-0008Qe-J6 for qemu-devel@nongnu.org; Wed, 18 Nov 2015 22:50:53 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ZzGFD-0004h5-FV for qemu-devel@nongnu.org; Wed, 18 Nov 2015 22:50:51 -0500 Received: from indium.canonical.com ([91.189.90.7]:49466) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZzGFD-0004gZ-78 for qemu-devel@nongnu.org; Wed, 18 Nov 2015 22:50:47 -0500 Received: from loganberry.canonical.com ([91.189.90.37]) by indium.canonical.com with esmtp (Exim 4.76 #1 (Debian)) id 1ZzGFC-00069e-CQ for ; Thu, 19 Nov 2015 03:50:46 +0000 Received: from loganberry.canonical.com (localhost [127.0.0.1]) by loganberry.canonical.com (Postfix) with ESMTP id 549E02E80CA for ; Thu, 19 Nov 2015 03:50:46 +0000 (UTC) MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Date: Thu, 19 Nov 2015 03:41:03 -0000 From: lickdragon <1494350@bugs.launchpad.net> Sender: bounces@canonical.com References: <20150910143810.8783.51069.malonedeb@wampee.canonical.com> Message-Id: <20151119034103.20434.1913.malone@soybean.canonical.com> Errors-To: bounces@canonical.com Subject: [Qemu-devel] [Bug 1494350] Re: QEMU: causes vCPU steal time overflow on live migration Reply-To: Bug 1494350 <1494350@bugs.launchpad.net> List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: qemu-devel@nongnu.org Hello, I read in the thread " Applied to kvm/queue. Thanks Marcelo, and thanks David for the review. Paolo " But I cannot find where the patch enters the qemu git repo: http://git.qemu.org/?p=3Dqemu.git&a=3Dsearch&h=3DHEAD&st=3Dauthor&s=3DTosat= ti Is it not there yet? Thanks! C. -- = You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/1494350 Title: QEMU: causes vCPU steal time overflow on live migration Status in QEMU: New Bug description: I'm pasting in text from Debian Bug 785557 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=3D785557 b/c I couldn't find this issue reported. It is present in QEMU 2.3, but I haven't tested later versions. Perhaps someone else will find this bug and confirm for later versions. (Or I will when I have time!) -------------------------------------------------------------------------= ------------------- Hi, I'm trying to debug an issue we're having with some debian.org machines = running in QEMU 2.1.2 instances (see [1] for more background). In short, = after a live migration guests running Debian Jessie (linux 3.16) stop = accounting CPU time properly. /proc/stat in the guest shows no increase = in user and system time anymore (regardless of workload) and what stands = out are extremely large values for steal time: % cat /proc/stat cpu 2400 0 1842 650879168 2579640 0 25 136562317270 0 0 cpu0 1366 0 1028 161392988 1238598 0 11 383803090749 0 0 cpu1 294 0 240 162582008 639105 0 8 39686436048 0 0 cpu2 406 0 338 163331066 383867 0 4 333994238765 0 0 cpu3 332 0 235 163573105 318069 0 1 1223752959076 0 0 intr 355773871 33 10 0 0 0 0 3 0 1 0 0 36 144 0 0 1638612 0 0 0 0 0 0 0 = 0 0 0 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 5001741 41 0 8516993 0 3669582 0 0 0 0 0 0 = 0 0 0 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0= 0 0 0 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0= 0 0 0 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0= 0 0 0 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0= 0 0 0 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0= 0 0 0 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0= 0 0 0 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0= 0 0 0 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0= 0 0 0 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0= 0 0 0 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0= 0 0 0 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0= 0 0 0 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0= 0 0 0 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0= 0 0 0 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0= 0 0 0 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0= 0 0 0 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0= 0 0 0 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ctxt 837862829 btime 1431642967 processes 8529939 procs_running 1 procs_blocked 0 softirq 225193331 2 77532878 172 7250024 819289 0 54 33739135 176552 105= 675225 = Reading the memory pointed to by the steal time MSRs pre- and = post-migration, I can see that post-migration the high bytes are set to = 0xff: (qemu) xp /8b 0x1fc0cfc0 000000001fc0cfc0: 0x94 0x57 0x77 0xf5 0xff 0xff 0xff 0xff The "jump" in steal time happens when the guest is resumed on the = receiving side. I've also been able to consistently reproduce this on a Ganeti cluster = at work, using QEMU 2.1.3 and kernels 3.16 and 4.0 in the guests. The = issue goes away if I disable the steal time MSR using `-cpu = qemu64,-kvm_steal_time`. So, it looks to me as if the steal time MSR is not set/copied properly = during live migration, although AFAICT this should be the case after = 917367aa968fd4fef29d340e0c7ec8c608dffaab. After investigating a bit more, it looks like the issue comes from an ove= rflow in the kernel's accumulate_steal_time() (arch/x86/kvm/x86.c:2023): static void accumulate_steal_time(struct kvm_vcpu *vcpu) { u64 delta; = if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED)) return; = delta =3D current->sched_info.run_delay - vcpu->arch.st.last_st= eal; = Using systemtap with the attached script to trace KVM execution on the = receiving host kernel, we can see that shortly before marking the vCPUs = as runnable on a migrated KVM instance with 2 vCPUs, the following = happens (** marks lines of interest): ** 0 qemu-system-x86(18446): kvm_arch_vcpu_load: run_delay=3D7856949 ns= steal=3D7856949 ns 0 qemu-system-x86(18446): -> kvm_arch_vcpu_load 0 vhost-18446(18447): -> kvm_arch_vcpu_should_kick 5 vhost-18446(18447): <- kvm_arch_vcpu_should_kick 23 qemu-system-x86(18446): <- kvm_arch_vcpu_load 0 qemu-system-x86(18446): -> kvm_arch_vcpu_ioctl 2 qemu-system-x86(18446): <- kvm_arch_vcpu_ioctl 0 qemu-system-x86(18446): -> kvm_arch_vcpu_put 2 qemu-system-x86(18446): -> kvm_put_guest_fpu 3 qemu-system-x86(18446): <- kvm_put_guest_fpu 4 qemu-system-x86(18446): <- kvm_arch_vcpu_put ** 0 qemu-system-x86(18446): kvm_arch_vcpu_load: run_delay=3D7856949 ns= steal=3D7856949 ns 0 qemu-system-x86(18446): -> kvm_arch_vcpu_load 1 qemu-system-x86(18446): <- kvm_arch_vcpu_load 0 qemu-system-x86(18446): -> kvm_arch_vcpu_ioctl 1 qemu-system-x86(18446): <- kvm_arch_vcpu_ioctl 0 qemu-system-x86(18446): -> kvm_arch_vcpu_put 1 qemu-system-x86(18446): -> kvm_put_guest_fpu 2 qemu-system-x86(18446): <- kvm_put_guest_fpu 3 qemu-system-x86(18446): <- kvm_arch_vcpu_put ** 0 qemu-system-x86(18449): kvm_arch_vcpu_load: run_delay=3D40304 ns s= teal=3D7856949 ns 0 qemu-system-x86(18449): -> kvm_arch_vcpu_load ** 7 qemu-system-x86(18449): delta: 18446744073701734971 ns, steal=3D78= 56949 ns, run_delay=3D40304 ns 10 qemu-system-x86(18449): <- kvm_arch_vcpu_load ** 0 qemu-system-x86(18449): -> kvm_arch_vcpu_ioctl_run 4 qemu-system-x86(18449): -> kvm_arch_vcpu_runnable 6 qemu-system-x86(18449): <- kvm_arch_vcpu_runnable ... 0 qemu-system-x86(18448): kvm_arch_vcpu_load: run_delay=3D0 ns steal= =3D7856949 ns 0 qemu-system-x86(18448): -> kvm_arch_vcpu_load ** 34 qemu-system-x86(18448): delta: 18446744073701694667 ns, steal=3D78= 56949 ns, run_delay=3D0 ns 40 qemu-system-x86(18448): <- kvm_arch_vcpu_load ** 0 qemu-system-x86(18448): -> kvm_arch_vcpu_ioctl_run 5 qemu-system-x86(18448): -> kvm_arch_vcpu_runnable Now, what's really interesting is that current->sched_info.run_delay = gets reset because the tasks (threads) using the vCPUs change, and thus = have a different current->sched_info: it looks like task 18446 created = the two vCPUs, and then they were handed over to 18448 and 18449 = respectively. This is also verified by the fact that during the = overflow, both vCPUs have the old steal time of the last vcpu_load of = task 18446. However, according to Documentation/virtual/kvm/api.txt: - vcpu ioctls: These query and set attributes that control the operation of a single virtual cpu. Only run vcpu ioctls from the same thread that was used to create the vcpu. = = So it seems qemu is doing something that it shouldn't: calling vCPU = ioctls from a thread that didn't create the vCPU. Note that this = probably happens on every QEMU startup, but is not visible because the = guest kernel zeroes out the steal time on boot. There are at least two ways to mitigate the issue without a kernel recompilation: - The first one is to disable the steal time propagation from host to = guest by invoking qemu with `-cpu qemu64,-kvm_steal_time`. This will = short-circuit accumulate_steal_time() due to (vcpu->arch.st.msr_val & = KVM_MSR_ENABLED) and will completely disable steal time reporting in = the guest, which may not be desired if people rely on it to detect = CPU congestion. - The other one is using the following systemtap script to prevent the = steal time counter from overflowing by dropping the problematic = samples (WARNING: systemtap guru mode required, use at your own = risk): probe module("kvm").statement("*@arch/x86/kvm/x86.c:2024") { if (@defined($delta) && $delta < 0) { printk(4, "kvm: steal time delta < 0, dropping") $delta =3D 0 } } Note that not all *guests* handle this condition in the same way: 3.2 = guests still get the overflow in /proc/stat, but their scheduler = continues to work as expected. 3.16 guests OTOH go nuts once steal time = overflows and stop accumulating system & user time, while entering an = erratic state where steal time in /proc/stat is *decreasing* on every = clock tick. -------------------------------------------- Revised statement: > Now, what's really interesting is that current->sched_info.run_delay = > gets reset because the tasks (threads) using the vCPUs change, and = > thus have a different current->sched_info: it looks like task 18446 = > created the two vCPUs, and then they were handed over to 18448 and = > 18449 respectively. This is also verified by the fact that during the = > overflow, both vCPUs have the old steal time of the last vcpu_load of = > task 18446. However, according to Documentation/virtual/kvm/api.txt: The above is not entirely accurate: the vCPUs were created by the = threads that are used to run them (18448 and 18449 respectively), it's = just that the main thread is issuing ioctls during initialization, as = illustrated by the strace output on a different process: [ vCPU #0 thread creating vCPU #0 (fd 20) ] [pid 1861] ioctl(14, KVM_CREATE_VCPU, 0) =3D 20 [pid 1861] ioctl(20, KVM_X86_SETUP_MCE, 0x7fbd3ca40cd8) =3D 0 [pid 1861] ioctl(20, KVM_SET_CPUID2, 0x7fbd3ca40ce0) =3D 0 [pid 1861] ioctl(20, KVM_SET_SIGNAL_MASK, 0x7fbd380008f0) =3D 0 = [ vCPU #1 thread creating vCPU #1 (fd 21) ] [pid 1862] ioctl(14, KVM_CREATE_VCPU, 0x1) =3D 21 [pid 1862] ioctl(21, KVM_X86_SETUP_MCE, 0x7fbd37ffdcd8) =3D 0 [pid 1862] ioctl(21, KVM_SET_CPUID2, 0x7fbd37ffdce0) =3D 0 [pid 1862] ioctl(21, KVM_SET_SIGNAL_MASK, 0x7fbd300008f0) =3D 0 = [ Main thread calling kvm_arch_put_registers() on vCPU #0 ] [pid 1859] ioctl(20, KVM_SET_REGS, 0x7ffc98aac230) =3D 0 [pid 1859] ioctl(20, KVM_SET_XSAVE or KVM_SIGNAL_MSI, 0x7fbd38001000) = =3D 0 [pid 1859] ioctl(20, KVM_PPC_ALLOCATE_HTAB or KVM_SET_XCRS, 0x7ffc98aac= 010) =3D 0 [pid 1859] ioctl(20, KVM_SET_SREGS, 0x7ffc98aac050) =3D 0 [pid 1859] ioctl(20, KVM_SET_MSRS, 0x7ffc98aab820) =3D 87 [pid 1859] ioctl(20, KVM_SET_MP_STATE, 0x7ffc98aac230) =3D 0 [pid 1859] ioctl(20, KVM_SET_LAPIC, 0x7ffc98aabd80) =3D 0 [pid 1859] ioctl(20, KVM_SET_MSRS, 0x7ffc98aac1b0) =3D 1 [pid 1859] ioctl(20, KVM_SET_PIT2 or KVM_SET_VCPU_EVENTS, 0x7ffc98aac1b= 0) =3D 0 [pid 1859] ioctl(20, KVM_SET_DEBUGREGS or KVM_SET_TSC_KHZ, 0x7ffc98aac1= b0) =3D 0 = [ Main thread calling kvm_arch_put_registers() on vCPU #1 ] [pid 1859] ioctl(21, KVM_SET_REGS, 0x7ffc98aac230) =3D 0 [pid 1859] ioctl(21, KVM_SET_XSAVE or KVM_SIGNAL_MSI, 0x7fbd30001000) = =3D 0 [pid 1859] ioctl(21, KVM_PPC_ALLOCATE_HTAB or KVM_SET_XCRS, 0x7ffc98aac= 010) =3D 0 [pid 1859] ioctl(21, KVM_SET_SREGS, 0x7ffc98aac050) =3D 0 [pid 1859] ioctl(21, KVM_SET_MSRS, 0x7ffc98aab820) =3D 87 [pid 1859] ioctl(21, KVM_SET_MP_STATE, 0x7ffc98aac230) =3D 0 [pid 1859] ioctl(21, KVM_SET_LAPIC, 0x7ffc98aabd80) =3D 0 [pid 1859] ioctl(21, KVM_SET_MSRS, 0x7ffc98aac1b0) =3D 1 [pid 1859] ioctl(21, KVM_SET_PIT2 or KVM_SET_VCPU_EVENTS, 0x7ffc98aac1b= 0) =3D 0 [pid 1859] ioctl(21, KVM_SET_DEBUGREGS or KVM_SET_TSC_KHZ, 0x7ffc98aac1= b0) =3D 0 = Using systemtap again, I noticed that the main thread's run_delay is copi= ed to last_steal only after a KVM_SET_MSRS ioctl which enables the steal time = MSR is issued by the main thread (see linux = 3.16.7-ckt11-1/arch/x86/kvm/x86.c:2162). Taking an educated guess, I = reverted the following qemu commits: commit 0e5035776df31380a44a1a851850d110b551ecb6 Author: Marcelo Tosatti Date: Tue Sep 3 18:55:16 2013 -0300 = fix steal time MSR vmsd callback to proper opaque type = Convert steal time MSR vmsd callback pointer to proper X86CPU type. = Signed-off-by: Marcelo Tosatti Signed-off-by: Paolo Bonzini = commit 917367aa968fd4fef29d340e0c7ec8c608dffaab Author: Marcelo Tosatti Date: Tue Feb 19 23:27:20 2013 -0300 = target-i386: kvm: save/restore steal time MSR = Read and write steal time MSR, so that reporting is functional across migration. = Signed-off-by: Marcelo Tosatti Signed-off-by: Gleb Natapov = and the steal time jump on migration went away. However, steal time was = not reported at all after migration, which is expected after reverting = 917367aa. So it seems that after 917367aa, the steal time MSR is correctly saved = and copied to the receiving side, but then it is restored by the main = thread (probably during cpu_synchronize_all_post_init()), causing the = overflow when the vCPU threads are unpaused. To manage notifications about this bug go to: https://bugs.launchpad.net/qemu/+bug/1494350/+subscriptions