All of lore.kernel.org
 help / color / mirror / Atom feed
* XP machine freeze
@ 2015-03-16 15:10 Saso Slavicic
  2015-03-19  0:51 ` Marcelo Tosatti
  2015-03-22 15:31 ` Brad Campbell
  0 siblings, 2 replies; 25+ messages in thread
From: Saso Slavicic @ 2015-03-16 15:10 UTC (permalink / raw)
  To: kvm

Hi,

I'm fairly experienced with KVM (Centos 5/6), running about a dozen servers
with 20-30 different (Linux & MS platform) systems.
I have one Windows XP machine that acts very strangely - it freezes. I get
ping timeout for the VM from my monitoring and the machine spins 2 or 3
cores using all the cpu. Now the interesting thing that happens is that once
you open the console, it suddenly starts working again. You can see the
clock catching up as it was frozen in time and everything works normally
once the timer catches up. It usually happens probably about once a month,
although it happened yesterday and today again.

This machine is on Centos 6, qemu-kvm-0.12.1.2-2.448.el6_6, kernel
2.6.32-504.3.3.el6.x86_64.
I was able to do some debugging when the machine was frozen, so I got some
things to work with:

# virsh qemu-monitor-command --hmp DBserver 'info cpus'
* CPU #0: pc=0x0000000080501fdd thread_id=32595
  CPU #1: pc=0x00000000806e7a9b thread_id=32596
  CPU #2: pc=0x00000000ba2da162 (halted) thread_id=32597
  CPU #3: pc=0x00000000ba2da162 (halted) thread_id=32598

Now, in both yesterday's and today's event the CPU0 was stopped at
0x0000000080501fdd. I've disassembled the function and got this:

 0x0000000080501fb5:  int3
 0x0000000080501fb6:  mov    %edi,%edi
 0x0000000080501fb8:  push   %ebp
 0x0000000080501fb9:  mov    %esp,%ebp
 0x0000000080501fbb:  push   %esi
 0x0000000080501fbc:  mov    %fs:0x20,%eax
 0x0000000080501fc2:  mov    0x8(%ebp),%ecx
 0x0000000080501fc5:  lea    -0x1(%ecx),%esi
 0x0000000080501fc8:  test   %esi,%ecx
 0x0000000080501fca:  lea    0x7ec(%eax),%edx
 0x0000000080501fd0:  pop    %esi
 0x0000000080501fd1:  je     0x80501fdd
 0x0000000080501fd3:  lea    0x7a0(%eax),%edx
 0x0000000080501fd9:  jmp    0x80501fdd
 *0x0000000080501fdb:  pause
 0x0000000080501fdd:  cmpl   $0x0,(%edx)
 0x0000000080501fe0:  jne    0x80501fdb
 0x0000000080501fe2:  pop    %ebp
 0x0000000080501fe3:  ret    $0x4
 0x0000000080501fe6:  int3

Mov %edi,%edi is clearly the start of some function. From what I've been
able to understand, the code fetches _KPRCB structure (%fs:0x20) and then
does a spinlock between fdb and fe0 checking for PacketBarrier (?) in EDX
(0xffdff8c0). Now, $pc always shows fdd address, shouldn't it jump between
fdb and fe0, it seems as if it was stuck at fdd?

# virsh qemu-monitor-command --hmp DBserver 'info registers'
 EAX=ffdff120 EBX=c06ddf58 ECX=0000000e EDX=ffdff8c0
 ESI=be6e3921 EDI=c06ddf60 EBP=ba4ff708 ESP=ba4ff708
 EIP=80501fdd EFL=00000202 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
 ES =0023 00000000 ffffffff 00c0f300 DPL=3 DS   [-WA]
 CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA]
 SS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
 DS =0023 00000000 ffffffff 00c0f300 DPL=3 DS   [-WA]
 FS =0030 ffdff000 00001fff 00c09300 DPL=0 DS   [-WA]
 GS =0000 00000000 000fffff 00000000
 LDT=0000 00000000 000fffff 00000000
 TR =0028 80042000 000020ab 00008b00 DPL=0 TSS32-busy
 GDT=     8003f000 000003ff
 IDT=     8003f400 000007ff
 CR0=8001003b CR2=dbbec000 CR3=0b3c0020 CR4=000006f8
 DR0=00000000 DR1=00000000 DR2=00000000 DR3=00000000
 DR6=ffff0ff0 DR7=00000400
 FCW=027f FSW=0020 [ST=0] FTW=00 MXCSR=00001fa0
 FPR0=8053632b003c1658 c048 FPR1=e1e0c048bf80f6ab 76f8
 FPR2=e1e0000000000000 0023 FPR3=0b017c30003c1658 0000
 FPR4=0000003bba1a7604 1e64 FPR5=0007268c00000000 003b
 FPR6=000002020000001b 2684 FPR7=e3e0a9b4e1b50de4 ca0b
 XMM00=0000000000a1fc95000000000020027f
XMM01=0000ffff00001fa000001c4c00000001
 XMM02=000000000000c0488053632b003c1658
XMM03=00000000000076f8e1e0c048bf80f6ab
 XMM04=0000000000000023e1e0000000000000
XMM05=00000000000000000b017c30003c1658
 XMM06=0000000000001e640000003bba1a7604
XMM07=000000000000003b0007268c00000000

Clearly, the address in EDX is not 0:

[root@linux ~]# virsh qemu-monitor-command --hmp DBserver 'x/1xb 0xFFDFF8C0'
00000000ffdff8c0: 0x0e

[root@linux ~]# virt-manager

[root@linux ~]# virsh qemu-monitor-command --hmp DBserver 'x/1xb 0xFFDFF8C0'
00000000ffdff8c0: 0x00

However as soon as the VM console is opened and machine starts, the address
in EDX is set to 0 and the loop is broken.
Does anybody recognize what function that is? What could possibly happen
that opening the console and moving the mouse a little, unfreezes the
machine?
VM has .81 virtio drivers from Fedora repo at the moment.

The configuration of the machine is pretty standard:

<!--
WARNING: THIS IS AN AUTO-GENERATED FILE. CHANGES TO IT ARE LIKELY TO BE
OVERWRITTEN AND LOST. Changes to this xml configuration should be made
using:
  virsh edit DBserver
or other application using the libvirt API.
-->

 <domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
  <name>DBserver</name>
  <uuid>e42b4cf2-7264-515f-4d24-6267eaa24be8</uuid>
  <memory unit='KiB'>3145728</memory>
  <currentMemory unit='KiB'>3145728</currentMemory>
  <vcpu placement='static'>4</vcpu>
  <os>
    <type arch='x86_64' machine='rhel6.6.0'>hvm</type>
    <boot dev='hd'/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <pae/>
  </features>
  <cpu>
    <topology sockets='1' cores='4' threads='4'/>
  </cpu>
  <clock offset='localtime'>
    <timer name='rtc' tickpolicy='catchup'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/libexec/qemu-kvm</emulator>
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none' io='native'/>
      <source dev='/dev/drbd1'/>
      <target dev='vda' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03'
function='0x0'/>
    </disk>
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none' io='native'/>
      <source
dev='/dev/disk/by-id/usb-WD_Ext_HDD_1021_574D415A4138353838383731-0:0'/>
      <target dev='vdb' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04'
function='0x0'/>
    </disk>
    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <target dev='hdc' bus='ide'/>
      <readonly/>
      <address type='drive' controller='0' bus='1' target='0' unit='0'/>
    </disk>
    <controller type='usb' index='0' model='ich9-ehci1'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05'
function='0x7'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci1'>
      <master startport='0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05'
function='0x0' multifunction='on'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci2'>
      <master startport='2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05'
function='0x1'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci3'>
      <master startport='4'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05'
function='0x2'/>
    </controller>
    <controller type='ide' index='0'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01'
function='0x1'/>
    </controller>
    <interface type='bridge'>
      <mac address='52:54:00:a6:92:ca'/>
      <source bridge='br0'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06'
function='0x0'/>
    </interface>
    <serial type='pty'>
      <target port='0'/>
    </serial>
    <console type='pty'>
      <target type='serial' port='0'/>
    </console>
    <input type='mouse' bus='ps2'/>
    <graphics type='vnc' port='-1' autoport='yes'/>
    <video>
      <model type='vga' vram='9216' heads='1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02'
function='0x0'/>
    </video>
    <memballoon model='virtio'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07'
function='0x0'/>
    </memballoon>
  </devices>
  <qemu:commandline>
    <qemu:arg value='-set'/>
    <qemu:arg value='device.virtio-disk0.x-data-plane=on'/>
  </qemu:commandline>
 </domain>

The above config is already changed as I've first experimented with removing
usb tablet (and installing vmware mouse drivers), turning 'x-data-plane on'
and so on, hoping to solve the problem...Is there anything else I can check
the next time the machine freezes?

Regards,
Saso Slavicic



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: XP machine freeze
  2015-03-16 15:10 XP machine freeze Saso Slavicic
@ 2015-03-19  0:51 ` Marcelo Tosatti
  2015-03-30 16:19   ` Saso Slavicic
  2015-03-22 15:31 ` Brad Campbell
  1 sibling, 1 reply; 25+ messages in thread
From: Marcelo Tosatti @ 2015-03-19  0:51 UTC (permalink / raw)
  To: Saso Slavicic; +Cc: kvm

On Mon, Mar 16, 2015 at 04:10:40PM +0100, Saso Slavicic wrote:
> Hi,
> 
> I'm fairly experienced with KVM (Centos 5/6), running about a dozen servers
> with 20-30 different (Linux & MS platform) systems.
> I have one Windows XP machine that acts very strangely - it freezes. I get
> ping timeout for the VM from my monitoring and the machine spins 2 or 3
> cores using all the cpu. Now the interesting thing that happens is that once
> you open the console, it suddenly starts working again. You can see the
> clock catching up as it was frozen in time and everything works normally
> once the timer catches up. It usually happens probably about once a month,
> although it happened yesterday and today again.
> 
> This machine is on Centos 6, qemu-kvm-0.12.1.2-2.448.el6_6, kernel
> 2.6.32-504.3.3.el6.x86_64.
> I was able to do some debugging when the machine was frozen, so I got some
> things to work with:
> 
> # virsh qemu-monitor-command --hmp DBserver 'info cpus'
> * CPU #0: pc=0x0000000080501fdd thread_id=32595
>   CPU #1: pc=0x00000000806e7a9b thread_id=32596
>   CPU #2: pc=0x00000000ba2da162 (halted) thread_id=32597
>   CPU #3: pc=0x00000000ba2da162 (halted) thread_id=32598
> 
> Now, in both yesterday's and today's event the CPU0 was stopped at
> 0x0000000080501fdd. I've disassembled the function and got this:
> 
>  0x0000000080501fb5:  int3
>  0x0000000080501fb6:  mov    %edi,%edi
>  0x0000000080501fb8:  push   %ebp
>  0x0000000080501fb9:  mov    %esp,%ebp
>  0x0000000080501fbb:  push   %esi
>  0x0000000080501fbc:  mov    %fs:0x20,%eax
>  0x0000000080501fc2:  mov    0x8(%ebp),%ecx
>  0x0000000080501fc5:  lea    -0x1(%ecx),%esi
>  0x0000000080501fc8:  test   %esi,%ecx
>  0x0000000080501fca:  lea    0x7ec(%eax),%edx
>  0x0000000080501fd0:  pop    %esi
>  0x0000000080501fd1:  je     0x80501fdd
>  0x0000000080501fd3:  lea    0x7a0(%eax),%edx
>  0x0000000080501fd9:  jmp    0x80501fdd
>  *0x0000000080501fdb:  pause
>  0x0000000080501fdd:  cmpl   $0x0,(%edx)
>  0x0000000080501fe0:  jne    0x80501fdb
>  0x0000000080501fe2:  pop    %ebp
>  0x0000000080501fe3:  ret    $0x4
>  0x0000000080501fe6:  int3
> 
> Mov %edi,%edi is clearly the start of some function. From what I've been
> able to understand, the code fetches _KPRCB structure (%fs:0x20) and then
> does a spinlock between fdb and fe0 checking for PacketBarrier (?) in EDX
> (0xffdff8c0). Now, $pc always shows fdd address, shouldn't it jump between
> fdb and fe0, it seems as if it was stuck at fdd?
> 
> # virsh qemu-monitor-command --hmp DBserver 'info registers'
>  EAX=ffdff120 EBX=c06ddf58 ECX=0000000e EDX=ffdff8c0
>  ESI=be6e3921 EDI=c06ddf60 EBP=ba4ff708 ESP=ba4ff708
>  EIP=80501fdd EFL=00000202 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
>  ES =0023 00000000 ffffffff 00c0f300 DPL=3 DS   [-WA]
>  CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA]
>  SS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>  DS =0023 00000000 ffffffff 00c0f300 DPL=3 DS   [-WA]
>  FS =0030 ffdff000 00001fff 00c09300 DPL=0 DS   [-WA]
>  GS =0000 00000000 000fffff 00000000
>  LDT=0000 00000000 000fffff 00000000
>  TR =0028 80042000 000020ab 00008b00 DPL=0 TSS32-busy
>  GDT=     8003f000 000003ff
>  IDT=     8003f400 000007ff
>  CR0=8001003b CR2=dbbec000 CR3=0b3c0020 CR4=000006f8
>  DR0=00000000 DR1=00000000 DR2=00000000 DR3=00000000
>  DR6=ffff0ff0 DR7=00000400
>  FCW=027f FSW=0020 [ST=0] FTW=00 MXCSR=00001fa0
>  FPR0=8053632b003c1658 c048 FPR1=e1e0c048bf80f6ab 76f8
>  FPR2=e1e0000000000000 0023 FPR3=0b017c30003c1658 0000
>  FPR4=0000003bba1a7604 1e64 FPR5=0007268c00000000 003b
>  FPR6=000002020000001b 2684 FPR7=e3e0a9b4e1b50de4 ca0b
>  XMM00=0000000000a1fc95000000000020027f
> XMM01=0000ffff00001fa000001c4c00000001
>  XMM02=000000000000c0488053632b003c1658
> XMM03=00000000000076f8e1e0c048bf80f6ab
>  XMM04=0000000000000023e1e0000000000000
> XMM05=00000000000000000b017c30003c1658
>  XMM06=0000000000001e640000003bba1a7604
> XMM07=000000000000003b0007268c00000000
> 
> Clearly, the address in EDX is not 0:
> 
> [root@linux ~]# virsh qemu-monitor-command --hmp DBserver 'x/1xb 0xFFDFF8C0'
> 00000000ffdff8c0: 0x0e
> 
> [root@linux ~]# virt-manager
> 
> [root@linux ~]# virsh qemu-monitor-command --hmp DBserver 'x/1xb 0xFFDFF8C0'
> 00000000ffdff8c0: 0x00
> 
> However as soon as the VM console is opened and machine starts, the address
> in EDX is set to 0 and the loop is broken.
> Does anybody recognize what function that is? What could possibly happen
> that opening the console and moving the mouse a little, unfreezes the
> machine?
> VM has .81 virtio drivers from Fedora repo at the moment.

Generate a Windows dump? 

https://support.microsoft.com/en-us/kb/254649

https://support.microsoft.com/en-us/kb/972110
Step 7: Generate a complete crash dump file or a kernel crash dump file
by using an NMI on a Windows-based system

(you can inject NMIs via QEMU monitor).

> 
> The configuration of the machine is pretty standard:
> 
> <!--
> WARNING: THIS IS AN AUTO-GENERATED FILE. CHANGES TO IT ARE LIKELY TO BE
> OVERWRITTEN AND LOST. Changes to this xml configuration should be made
> using:
>   virsh edit DBserver
> or other application using the libvirt API.
> -->
> 
>  <domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
>   <name>DBserver</name>
>   <uuid>e42b4cf2-7264-515f-4d24-6267eaa24be8</uuid>
>   <memory unit='KiB'>3145728</memory>
>   <currentMemory unit='KiB'>3145728</currentMemory>
>   <vcpu placement='static'>4</vcpu>
>   <os>
>     <type arch='x86_64' machine='rhel6.6.0'>hvm</type>
>     <boot dev='hd'/>
>   </os>
>   <features>
>     <acpi/>
>     <apic/>
>     <pae/>
>   </features>
>   <cpu>
>     <topology sockets='1' cores='4' threads='4'/>
>   </cpu>
>   <clock offset='localtime'>
>     <timer name='rtc' tickpolicy='catchup'/>
>   </clock>
>   <on_poweroff>destroy</on_poweroff>
>   <on_reboot>restart</on_reboot>
>   <on_crash>restart</on_crash>
>   <devices>
>     <emulator>/usr/libexec/qemu-kvm</emulator>
>     <disk type='block' device='disk'>
>       <driver name='qemu' type='raw' cache='none' io='native'/>
>       <source dev='/dev/drbd1'/>
>       <target dev='vda' bus='virtio'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x03'
> function='0x0'/>
>     </disk>
>     <disk type='block' device='disk'>
>       <driver name='qemu' type='raw' cache='none' io='native'/>
>       <source
> dev='/dev/disk/by-id/usb-WD_Ext_HDD_1021_574D415A4138353838383731-0:0'/>
>       <target dev='vdb' bus='virtio'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x04'
> function='0x0'/>
>     </disk>
>     <disk type='file' device='cdrom'>
>       <driver name='qemu' type='raw'/>
>       <target dev='hdc' bus='ide'/>
>       <readonly/>
>       <address type='drive' controller='0' bus='1' target='0' unit='0'/>
>     </disk>
>     <controller type='usb' index='0' model='ich9-ehci1'>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x05'
> function='0x7'/>
>     </controller>
>     <controller type='usb' index='0' model='ich9-uhci1'>
>       <master startport='0'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x05'
> function='0x0' multifunction='on'/>
>     </controller>
>     <controller type='usb' index='0' model='ich9-uhci2'>
>       <master startport='2'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x05'
> function='0x1'/>
>     </controller>
>     <controller type='usb' index='0' model='ich9-uhci3'>
>       <master startport='4'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x05'
> function='0x2'/>
>     </controller>
>     <controller type='ide' index='0'>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x01'
> function='0x1'/>
>     </controller>
>     <interface type='bridge'>
>       <mac address='52:54:00:a6:92:ca'/>
>       <source bridge='br0'/>
>       <model type='virtio'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x06'
> function='0x0'/>
>     </interface>
>     <serial type='pty'>
>       <target port='0'/>
>     </serial>
>     <console type='pty'>
>       <target type='serial' port='0'/>
>     </console>
>     <input type='mouse' bus='ps2'/>
>     <graphics type='vnc' port='-1' autoport='yes'/>
>     <video>
>       <model type='vga' vram='9216' heads='1'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x02'
> function='0x0'/>
>     </video>
>     <memballoon model='virtio'>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x07'
> function='0x0'/>
>     </memballoon>
>   </devices>
>   <qemu:commandline>
>     <qemu:arg value='-set'/>
>     <qemu:arg value='device.virtio-disk0.x-data-plane=on'/>
>   </qemu:commandline>
>  </domain>
> 
> The above config is already changed as I've first experimented with removing
> usb tablet (and installing vmware mouse drivers), turning 'x-data-plane on'
> and so on, hoping to solve the problem...Is there anything else I can check
> the next time the machine freezes?
> 
> Regards,
> Saso Slavicic
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: XP machine freeze
  2015-03-16 15:10 XP machine freeze Saso Slavicic
  2015-03-19  0:51 ` Marcelo Tosatti
@ 2015-03-22 15:31 ` Brad Campbell
  2015-03-30 21:11   ` Paolo Bonzini
  1 sibling, 1 reply; 25+ messages in thread
From: Brad Campbell @ 2015-03-22 15:31 UTC (permalink / raw)
  To: Saso Slavicic, kvm

On 16/03/15 23:10, Saso Slavicic wrote:
> Hi,
>
> I'm fairly experienced with KVM (Centos 5/6), running about a dozen servers
> with 20-30 different (Linux & MS platform) systems.
> I have one Windows XP machine that acts very strangely - it freezes. I get
> ping timeout for the VM from my monitoring and the machine spins 2 or 3
> cores using all the cpu. Now the interesting thing that happens is that once
> you open the console, it suddenly starts working again. You can see the
> clock catching up as it was frozen in time and everything works normally
> once the timer catches up. It usually happens probably about once a month,
> although it happened yesterday and today again.
>


Just a me too. I reported this on the 6th of Feb "Windows XP guest latch 
up on KVM with recent kernels" and I've been bisecting ever since. I'm 
on my third round of bisections. First one was completely inconclusive 
leading me to believe I'd made a wrong turn. I aborted the second one 
believing I'd made a bad call and not let it run long enough. This time 
around I'm leaving it more than 50 hours before declaring good. First 
bisect log mirrors this one so far though.

Here's the completed first bisect log :
brad@srv:/raid10/src/linux$ cat ../bisect.log
git bisect start
# good: [19583ca584d6f574384e17fe7613dfaeadcdc4a6] Linux 3.16
git bisect good 19583ca584d6f574384e17fe7613dfaeadcdc4a6
# bad: [bfe01a5ba2490f299e1d2d5508cbbbadd897bbe9] Linux 3.17
git bisect bad bfe01a5ba2490f299e1d2d5508cbbbadd897bbe9
# good: [f2d7e4d4398092d14fb039cb4d38e502d3f019ee] checkpatch: add 
fix_insert_line and fix_delete_line helpers
git bisect good f2d7e4d4398092d14fb039cb4d38e502d3f019ee
# good: [c309bfa9b481e7dbd3e1ab819271bf3009f44859] Merge tag 
'for-linus-20140808' of git://git.infradead.org/linux-mtd
git bisect good c309bfa9b481e7dbd3e1ab819271bf3009f44859
# bad: [433ab34d26e29d0f036c3f514a09ae96f973d8c5] Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
git bisect bad 433ab34d26e29d0f036c3f514a09ae96f973d8c5
# good: [d27c0d90184a13e9e9f28c38e84f889a259f6b5f] Merge branch 
'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good d27c0d90184a13e9e9f28c38e84f889a259f6b5f
# good: [0680eb1f485ba5aac2ee02c9f0622239c9a4b16c] timekeeping: Another 
fix to the VSYSCALL_OLD update_vsyscall
git bisect good 0680eb1f485ba5aac2ee02c9f0622239c9a4b16c
# good: [d8c66f62992dac3a92cbc5f16791557100c7a068] asus-wmi: Disable 
acpi-video backlight on desktop machines
git bisect good d8c66f62992dac3a92cbc5f16791557100c7a068
# bad: [92075f9f640dc3fde91b833c08fbc921b1649088] Merge branch 
'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
git bisect bad 92075f9f640dc3fde91b833c08fbc921b1649088
# bad: [605f884d05cc0de8c3bde36281d58216011f51a5] Merge branch 
'for_linus' of git://cavan.codon.org.uk/platform-drivers-x86
git bisect bad 605f884d05cc0de8c3bde36281d58216011f51a5
# bad: [49899007b9401486421c99bb269db89b88136e47] Merge branch 'release' 
of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux
git bisect bad 49899007b9401486421c99bb269db89b88136e47
# bad: [53b95d6341c142a02538e41bdf1405ef8888bf8b] Merge tag 
'locks-v3.17-2' of git://git.samba.org/jlayton/linux
git bisect bad 53b95d6341c142a02538e41bdf1405ef8888bf8b
# good: [00fefb9cf2b5493a86912de55ba912bdfae4a207] aio: use iovec array 
rather than the single one
git bisect good 00fefb9cf2b5493a86912de55ba912bdfae4a207
# bad: [ed9814d85810c27670987b40c77e8a07105838fe] locks: defer freeing 
locks in locks_delete_lock until after i_lock has been dropped
git bisect bad ed9814d85810c27670987b40c77e8a07105838fe
# bad: [566709bd627caf933ab8edffaf598203a0c5c8b2] locks: don't call 
locks_release_private from locks_copy_lock
git bisect bad 566709bd627caf933ab8edffaf598203a0c5c8b2


Here's my current bisect log thus far.

brad@srv:/raid10/src/linux$ git bisect log
git bisect start
# good: [19583ca584d6f574384e17fe7613dfaeadcdc4a6] Linux 3.16
git bisect good 19583ca584d6f574384e17fe7613dfaeadcdc4a6
# bad: [bfe01a5ba2490f299e1d2d5508cbbbadd897bbe9] Linux 3.17
git bisect bad bfe01a5ba2490f299e1d2d5508cbbbadd897bbe9
# good: [f2d7e4d4398092d14fb039cb4d38e502d3f019ee] checkpatch: add 
fix_insert_line and fix_delete_line helpers
git bisect good f2d7e4d4398092d14fb039cb4d38e502d3f019ee
# good: [c309bfa9b481e7dbd3e1ab819271bf3009f44859] Merge tag 
'for-linus-20140808' of git://git.infradead.org/linux-mtd
git bisect good c309bfa9b481e7dbd3e1ab819271bf3009f44859
# bad: [433ab34d26e29d0f036c3f514a09ae96f973d8c5] Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
git bisect bad 433ab34d26e29d0f036c3f514a09ae96f973d8c5
# bad: [d27c0d90184a13e9e9f28c38e84f889a259f6b5f] Merge branch 
'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect bad d27c0d90184a13e9e9f28c38e84f889a259f6b5f

No help I'm afraid, but at least I can conclusively say that 3.16 is 
good, and 3.17 is bad.

Lockups generally only take a couple of days on this machine, so it's 
not that slow to reproduce.

Brad
-- 
Dolphins are so intelligent that within a few weeks they can
train Americans to stand at the edge of the pool and throw them
fish.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: XP machine freeze
  2015-03-19  0:51 ` Marcelo Tosatti
@ 2015-03-30 16:19   ` Saso Slavicic
  0 siblings, 0 replies; 25+ messages in thread
From: Saso Slavicic @ 2015-03-30 16:19 UTC (permalink / raw)
  To: kvm

> From: Marcelo Tosatti [mailto:mtosatti@redhat.com] 
> Sent: Thursday, March 19, 2015 1:52 AM
> 
> Generate a Windows dump? 
> 
> https://support.microsoft.com/en-us/kb/254649
> 
> https://support.microsoft.com/en-us/kb/972110
> Step 7: Generate a complete crash dump file or a kernel crash dump file by
using an NMI on a Windows-based system
> 
> (you can inject NMIs via QEMU monitor).

Hi, thanks for the hint. Somehow I felt I needed to BSOD it, but I didn't
know where to look.

It happened again today (so about 14 days later after I enabled NMI crash
dump). I got kernel memory dump.

Stack trace shows (I've enabled noisy sym):

STACK_TEXT:
 8054e610 806eaea3 00000080 004f4454 00000000 nt!KeBugCheckEx+0x1b
 8054e65c 805426c4 00000000 ba33c0c8 805517da hal!HalHandleNMI+0x195
 8054e65c 80501fdd 00000000 ba33c0c8 805517da nt!KiTrap02+0xf8
 b36dcaf0 804fb39d 0000000e 00000001 00000004
nt!KiIpiStallOnPacketTargets+0x27
 b36dcb08 8054a690 00000001 00000001 8a852d20 nt!KeFlushEntireTb+0x79
 b36dcb2c 80509247 0007e901 00000000 00000000 nt!MiReserveSystemPtes+0x70
 b36dcb74 b753d9f8 8a852d20 00000000 00000001
nt!MmMapLockedPagesSpecifyCache+0x101
 b36dcb94 b75337f8 8a852d20 00000010 8a2cb2a4
tcpip!TcpipBufferVirtualAddress+0x24
 b36dcbb4 b75589f5 00022885 8a2cb230 0000403f tcpip!XsumSendChain+0x44
 b36dcc00 b7534d35 86e36a9a 00004000 8a80abd4 tcpip!TCPSend+0x4f1
 b36dcc28 b75344a5 00000001 00000000 00004000 tcpip!TdiSend+0x1c7
 b36dcc5c b74bc14a 8a80a990 8ad2ce4c 8a80a990 tcpip!TCPSendData+0x83
 b36dcc88 aff10eb4 8a8a5528 8a80a990 8a804000 netbt!NTSend+0x1e1
 b36dccb0 aff1c8b2 8a80ac98 aff185ef 00000000 srv!SrvStartSend2+0x16e
 b36dccdc aff55dcd aff211f4 8a80a020 8ab4c718
srv!SrvFsdRestartLargeReadAndX+0x2ae
 b36dcd7c aff10836 8a80a028 8ab4c6e0 aff250d8 srv!SrvSmbReadAndX+0x3fe
 b36dcd88 aff250d8 00000000 8ab41528 00000000 srv!SrvProcessSmb+0xb7
 b36dcdac 805cffee 8a80a020 00000000 00000000 srv!WorkerThread+0x11e
 b36dcddc 8054620e aff25024 8ab4c6e0 00000000 nt!PspSystemThreadStartup+0x34
 00000000 00000000 00000000 00000000 00000000 nt!KiThreadStartup+0x16

So the function in question is KiIpiStallOnPacketTargets before it went into
NMI crash handler (see the return address on KiTrap02), that's where the
machine was frozen again today:

# virsh qemu-monitor-command --hmp DBserver 'info cpus'
* CPU #0: pc=0x0000000080501fdd thread_id=6675
  CPU #1: pc=0x000000008051afc1 thread_id=6676
  CPU #2: pc=0x0000000080545e6e thread_id=6677
  CPU #3: pc=0x0000000080545e6e thread_id=6678

I have to mention that this particular machine is a P2V converted XP. The
original physical machine had a Core2Quad already so there wasn't any need
to change HAL or kernel image. I believe it uses ntkrpamp kernel image
(that's what was listed in the debugger). I didn't do anything special
during the conversion, switched to IDE and then installed virtio drivers.
There doesn't seem to be any leftover driver from the phys machine in the
stack trace...
Is there anything else I could check that could help debug this issue?

Best Regards,
Saso Slavicic


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: XP machine freeze
  2015-03-22 15:31 ` Brad Campbell
@ 2015-03-30 21:11   ` Paolo Bonzini
  2015-03-31  0:27     ` Brad Campbell
  2015-04-13  4:07     ` Brad Campbell
  0 siblings, 2 replies; 25+ messages in thread
From: Paolo Bonzini @ 2015-03-30 21:11 UTC (permalink / raw)
  To: Brad Campbell, Saso Slavicic, kvm



On 22/03/2015 16:31, Brad Campbell wrote:
> 
> 
> No help I'm afraid, but at least I can conclusively say that 3.16 is
> good, and 3.17 is bad.

Can you try more specifically around the first KVM pull request?  That
would be between c9b88e958182 (presumed good) and 8533ce727188 (presumed
bad)?

Thanks,

Paolo

> Lockups generally only take a couple of days on this machine, so it's
> not that slow to reproduce.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: XP machine freeze
  2015-03-30 21:11   ` Paolo Bonzini
@ 2015-03-31  0:27     ` Brad Campbell
  2015-03-31  6:29       ` Saso Slavicic
  2015-04-13  4:07     ` Brad Campbell
  1 sibling, 1 reply; 25+ messages in thread
From: Brad Campbell @ 2015-03-31  0:27 UTC (permalink / raw)
  To: Paolo Bonzini, Saso Slavicic, kvm

On 31/03/15 05:11, Paolo Bonzini wrote:
>
>
> On 22/03/2015 16:31, Brad Campbell wrote:
>>
>>
>> No help I'm afraid, but at least I can conclusively say that 3.16 is
>> good, and 3.17 is bad.
>
> Can you try more specifically around the first KVM pull request?  That
> would be between c9b88e958182 (presumed good) and 8533ce727188 (presumed
> bad)?
>


G'day Paolo,

I can and will. Right now I'm toward the end of the bisect run I 
detailed in my previous e-mail, however disturbingly git bisect 
visualize shows no kvm commits at all. I'm beginning to think that 
something non-deterministic is just making this easier or harder to hit. 
Bad kernels usually go bad in a day at most, good kernels I've been 
leaving for up to 5 days.

Here's where the log is at right now. visualize shows 68 commits and I 
can't see any of them being even remotely related to kvm.

brad@srv:/raid10/src/linux$ git bisect log
git bisect start
# good: [19583ca584d6f574384e17fe7613dfaeadcdc4a6] Linux 3.16
git bisect good 19583ca584d6f574384e17fe7613dfaeadcdc4a6
# bad: [bfe01a5ba2490f299e1d2d5508cbbbadd897bbe9] Linux 3.17
git bisect bad bfe01a5ba2490f299e1d2d5508cbbbadd897bbe9
# good: [f2d7e4d4398092d14fb039cb4d38e502d3f019ee] checkpatch: add 
fix_insert_line and fix_delete_line helpers
git bisect good f2d7e4d4398092d14fb039cb4d38e502d3f019ee
# good: [c309bfa9b481e7dbd3e1ab819271bf3009f44859] Merge tag 
'for-linus-20140808' of git://git.infradead.org/linux-mtd
git bisect good c309bfa9b481e7dbd3e1ab819271bf3009f44859
# bad: [433ab34d26e29d0f036c3f514a09ae96f973d8c5] Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
git bisect bad 433ab34d26e29d0f036c3f514a09ae96f973d8c5
# bad: [d27c0d90184a13e9e9f28c38e84f889a259f6b5f] Merge branch 
'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect bad d27c0d90184a13e9e9f28c38e84f889a259f6b5f
# bad: [913847586290d5de22659e2a6195d91ff24d5aa6] Merge branch 
'linux-3.17' of git://anongit.freedesktop.org/git/nouveau/linux-2.6
git bisect bad 913847586290d5de22659e2a6195d91ff24d5aa6
# good: [d1e458fe671baf1e60afafc88bda090202a412f1] svcrdma: remove 
rdma_create_qp() failure recovery logic
git bisect good d1e458fe671baf1e60afafc88bda090202a412f1
# bad: [023f78b02c729070116fa3a7ebd4107a032d3f5c] Merge branch 
'for-next' of git://git.samba.org/sfrench/cifs-2.6
git bisect bad 023f78b02c729070116fa3a7ebd4107a032d3f5c
# bad: [ad1f5caf34390bb20fdbb4eaf71b0494e89936f0] Merge branch 'fixes' 
of git://ftp.arm.linux.org.uk/~rmk/linux-arm
git bisect bad ad1f5caf34390bb20fdbb4eaf71b0494e89936f0

If someone could give me some hard tests to do along the lines of what 
Saso is up to I could probably get that done faster. With the right bad 
kernel I can reproduce this lockup in a matter of hours.

Regards,
Brad


^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: XP machine freeze
  2015-03-31  0:27     ` Brad Campbell
@ 2015-03-31  6:29       ` Saso Slavicic
  2015-03-31  7:18         ` Brad Campbell
  0 siblings, 1 reply; 25+ messages in thread
From: Saso Slavicic @ 2015-03-31  6:29 UTC (permalink / raw)
  To: 'Brad Campbell', 'Paolo Bonzini', kvm

> From: Brad Campbell
> Sent: Tuesday, March 31, 2015 2:28 AM
>
>
> If someone could give me some hard tests to do along the lines of what
> Saso is up to I could probably get that done faster. With the right
> bad kernel I can reproduce this lockup in a matter of hours.

Hi,

My machine usually (but not always) locks during backup. At around 3AM, a
samba machine (a kvm machine on the same server actually) cifs mounts C$ and
starts copying files off of it. The last stacktrace also shows network code.
Is your machine actively working over network (sharing files)?

Regards,
Saso Slavicic


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: XP machine freeze
  2015-03-31  6:29       ` Saso Slavicic
@ 2015-03-31  7:18         ` Brad Campbell
  2015-03-31  8:56           ` Paolo Bonzini
  0 siblings, 1 reply; 25+ messages in thread
From: Brad Campbell @ 2015-03-31  7:18 UTC (permalink / raw)
  To: Saso Slavicic, 'Paolo Bonzini', kvm


On 31/03/15 14:29, Saso Slavicic wrote:
>> From: Brad Campbell
>> Sent: Tuesday, March 31, 2015 2:28 AM
>>
>>
>> If someone could give me some hard tests to do along the lines of what
>> Saso is up to I could probably get that done faster. With the right
>> bad kernel I can reproduce this lockup in a matter of hours.
> Hi,
>
> My machine usually (but not always) locks during backup. At around 3AM, a
> samba machine (a kvm machine on the same server actually) cifs mounts C$ and
> starts copying files off of it. The last stacktrace also shows network code.
> Is your machine actively working over network (sharing files)?
>
Better than that, it's recording h264 rtsp streams from 3 CCTV cameras, 
so there is a constant network load of about 1.5-2MB/s (bytes not bits).
Come to think of it, out of the 3 XP VM's I have that are an identical 
config and actually come from the same qcow2 base image this is the one 
that hits the network hard. The other 2 hardly touch the network.

virtio network interface. I can get it to lock up in hours with the 
right kernel, and repeat lockups after unlocking it with virt-viewer are 
usually less than an hour at most.

My issue is my first bisect proved to be inconclusive, and the second 
one is about 3 steps from done, but there are no kvm commits in the 
current set under investigation.

I *know* that 3.15.6 was good as I ran that kernel for months, it all 
started when I upgraded to a 3.18 and I think I've narrowed it down, but 
like I said the bisects are just not falling out as plausible, and at 5 
days for a good and up to 24 hours for a bad it's slow going.

I'll finish this bisect and then have a crack at the good/bad range 
suggested by Paolo. The issue is being a production box I have to 
schedule the re-boots.

I'm just not sure bisection is the right answer to tracking this down. I 
just don't have the background to know what to poke to try and debug 
this any other way.

Regards,
Brad

-- 
Dolphins are so intelligent that within a few weeks they can
train Americans to stand at the edge of the pool and throw them
fish.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: XP machine freeze
  2015-03-31  7:18         ` Brad Campbell
@ 2015-03-31  8:56           ` Paolo Bonzini
  2015-03-31 11:16             ` Brad Campbell
  0 siblings, 1 reply; 25+ messages in thread
From: Paolo Bonzini @ 2015-03-31  8:56 UTC (permalink / raw)
  To: Brad Campbell, Saso Slavicic, kvm



On 31/03/2015 09:18, Brad Campbell wrote:
> Better than that, it's recording h264 rtsp streams from 3 CCTV cameras,
> so there is a constant network load of about 1.5-2MB/s (bytes not bits).
> Come to think of it, out of the 3 XP VM's I have that are an identical
> config and actually come from the same qcow2 base image this is the one
> that hits the network hard. The other 2 hardly touch the network.
> 
> virtio network interface. I can get it to lock up in hours with the
> right kernel, and repeat lockups after unlocking it with virt-viewer are
> usually less than an hour at most.

Then it's not so weird that you have no KVM left in your bisect log.

We can look at smaller suspicious parts of the repository, until we
find the good one.  For example, after testing before/after the KVM
merge, you could test before/after the net-next merge (that would be
commit f4f142ed4ef835709c7e6d12eaca10d190bcebed presumed good, and
commit ae045e2455429c418a418a3376301a9e5753a0a8 presumed bad).

Paolo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: XP machine freeze
  2015-03-31  8:56           ` Paolo Bonzini
@ 2015-03-31 11:16             ` Brad Campbell
  2015-03-31 11:23               ` Paolo Bonzini
  0 siblings, 1 reply; 25+ messages in thread
From: Brad Campbell @ 2015-03-31 11:16 UTC (permalink / raw)
  To: Paolo Bonzini, Saso Slavicic, kvm


On 31/03/15 16:56, Paolo Bonzini wrote:
>
> On 31/03/2015 09:18, Brad Campbell wrote:
>> Better than that, it's recording h264 rtsp streams from 3 CCTV cameras,
>> so there is a constant network load of about 1.5-2MB/s (bytes not bits).
>> Come to think of it, out of the 3 XP VM's I have that are an identical
>> config and actually come from the same qcow2 base image this is the one
>> that hits the network hard. The other 2 hardly touch the network.
>>
>> virtio network interface. I can get it to lock up in hours with the
>> right kernel, and repeat lockups after unlocking it with virt-viewer are
>> usually less than an hour at most.
> Then it's not so weird that you have no KVM left in your bisect log.
>
> We can look at smaller suspicious parts of the repository, until we
> find the good one.  For example, after testing before/after the KVM
> merge, you could test before/after the net-next merge (that would be
> commit f4f142ed4ef835709c7e6d12eaca10d190bcebed presumed good, and
> commit ae045e2455429c418a418a3376301a9e5753a0a8 presumed bad).
>
> Paolo
>
Right, so now rather than being a pain on my production machine I know 
what to concentrate on with my staging machine to see if I can produce a 
pathological test case. Maybe an XP VM running iPerf. Easter is coming 
up, so I'll have some time to dedicate to the task.

If you look at the bisect point I'm currently at it's a mix of i2c and 
arm. The only vaguely relevant (as far as I can see) commit is the 
addition of the getrandom() syscall, so my bisect is looking dodgy at 
best. If I can come up with a better test case on a non-critical box 
then I'll be in a better position to try and help get to the bottom of 
the issue.

I can at least replicate the actual VM and conditions on similar 
hardware to try and reproduce it.

Regards,
Brad

-- 

Dolphins are so intelligent that within a few weeks they can
train Americans to stand at the edge of the pool and throw them
fish.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: XP machine freeze
  2015-03-31 11:16             ` Brad Campbell
@ 2015-03-31 11:23               ` Paolo Bonzini
  2015-04-04 10:55                 ` Brad Campbell
  0 siblings, 1 reply; 25+ messages in thread
From: Paolo Bonzini @ 2015-03-31 11:23 UTC (permalink / raw)
  To: Brad Campbell, Saso Slavicic, kvm



On 31/03/2015 13:16, Brad Campbell wrote:
> 
> If you look at the bisect point I'm currently at it's a mix of i2c and
> arm. The only vaguely relevant (as far as I can see) commit is the
> addition of the getrandom() syscall, so my bisect is looking dodgy at
> best. If I can come up with a better test case on a non-critical box
> then I'll be in a better position to try and help get to the bottom of
> the issue.

Yes, the bisect went wrong somewhere.  Still, the 'bad' commit leave
both a net-next and a KVM merge, so it could be worse. :)

Paolo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: XP machine freeze
  2015-03-31 11:23               ` Paolo Bonzini
@ 2015-04-04 10:55                 ` Brad Campbell
  0 siblings, 0 replies; 25+ messages in thread
From: Brad Campbell @ 2015-04-04 10:55 UTC (permalink / raw)
  To: Paolo Bonzini, Saso Slavicic, kvm


On 31/03/15 19:23, Paolo Bonzini wrote:
>
> On 31/03/2015 13:16, Brad Campbell wrote:
>> If you look at the bisect point I'm currently at it's a mix of i2c and
>> arm. The only vaguely relevant (as far as I can see) commit is the
>> addition of the getrandom() syscall, so my bisect is looking dodgy at
>> best. If I can come up with a better test case on a non-critical box
>> then I'll be in a better position to try and help get to the bottom of
>> the issue.
> Yes, the bisect went wrong somewhere.  Still, the 'bad' commit leave
> both a net-next and a KVM merge, so it could be worse. :)
>
So the bisect went horribly wrong _again_ and I've been completely 
unable to reproduce this problem on a test or staging machine (tried 
both), so I'm down to trying the 4 commits you suggested (pre/post KVM & 
pre/post net) to see if I can find bookends for another targeted bisect.

I'm 23 hours into the pre-kvm commit, so I probably need another week or 
two to at least identify some new bisect points as I really to want to 
leave it run for 4 or 5 days for a good kernel.

I tried running iperf in various automated incarnations to speed up the 
determination of a bad kernel, but it made absolutely no difference at 
all to the fault time. The other thing that occurred to me is of course 
I'm sucking in about 1.5Mb/s through the network and immediately 
streaming it out to disk, so it's entirely possible it may be disk 
related too.

I'll keep plugging away. In the mean time if anyone has any ideas I'm 
all ears.

Regards,
Brad
-- 
Dolphins are so intelligent that within a few weeks they can train 
Americans to stand at the edge of the pool and throw them fish.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: XP machine freeze
  2015-03-30 21:11   ` Paolo Bonzini
  2015-03-31  0:27     ` Brad Campbell
@ 2015-04-13  4:07     ` Brad Campbell
  2015-04-13 12:38       ` Paolo Bonzini
  1 sibling, 1 reply; 25+ messages in thread
From: Brad Campbell @ 2015-04-13  4:07 UTC (permalink / raw)
  To: Paolo Bonzini, Saso Slavicic, kvm


On 31/03/15 05:11, Paolo Bonzini wrote:
>
> On 22/03/2015 16:31, Brad Campbell wrote:
>>
>> No help I'm afraid, but at least I can conclusively say that 3.16 is
>> good, and 3.17 is bad.
> Can you try more specifically around the first KVM pull request?  That
> would be between c9b88e958182 (presumed good) and 8533ce727188 (presumed
> bad)?
>
>


G'day Paolo.

I can confirm that the fault appears to lie between good and bad as 
specified above.
Bad failed before 48 hours, good ran for 143 hours. I'm bisecting now.

Regards,
Brad

-- 
Dolphins are so intelligent that within a few weeks they can
train Americans to stand at the edge of the pool and throw them
fish.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: XP machine freeze
  2015-04-13  4:07     ` Brad Campbell
@ 2015-04-13 12:38       ` Paolo Bonzini
  2015-04-13 12:45         ` Brad Campbell
                           ` (3 more replies)
  0 siblings, 4 replies; 25+ messages in thread
From: Paolo Bonzini @ 2015-04-13 12:38 UTC (permalink / raw)
  To: Brad Campbell, Saso Slavicic, kvm, Radim Krčmář



On 13/04/2015 06:07, Brad Campbell wrote:
> 
> On 31/03/15 05:11, Paolo Bonzini wrote:
>>
>> On 22/03/2015 16:31, Brad Campbell wrote:
>>>
>>> No help I'm afraid, but at least I can conclusively say that 3.16 is
>>> good, and 3.17 is bad.
>> Can you try more specifically around the first KVM pull request?  That
>> would be between c9b88e958182 (presumed good) and 8533ce727188 (presumed
>> bad)?
>>
>>
> 
> 
> G'day Paolo.
> 
> I can confirm that the fault appears to lie between good and bad as
> specified above.
> Bad failed before 48 hours, good ran for 143 hours. I'm bisecting now.

Thanks!  Remember to bisect only with arch/x86/kvm.

Also:

1) Brad, I see you are on AMD.  Have you ever reproduced it on Intel?
Saso, are you on AMD as well?

If so, the most likely culprit is this:

commit 6addfc42992be4b073c39137ecfdf4b2aa2d487f
Author: Paolo Bonzini <pbonzini@redhat.com>
Date:   Thu Mar 27 11:29:28 2014 +0100

    KVM: x86: avoid useless set of KVM_REQ_EVENT after emulation
    
    Despite the provisions to emulate up to 130 consecutive instructions, in
    practice KVM will emulate just one before exiting handle_invalid_guest_state,
    because x86_emulate_instruction always sets KVM_REQ_EVENT.
    
    However, we only need to do this if an interrupt could be injected,
    which happens a) if an interrupt shadow bit (STI or MOV SS) has gone
    away; b) if the interrupt flag has just been set (other instructions
    than STI can set it without enabling an interrupt shadow).
    
    This cuts another 700-900 cycles from the cost of emulating an
    instruction (measured on a Sandy Bridge Xeon: 1650-2600 cycles
    before the patch on kvm-unit-tests, 925-1700 afterwards).
    
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

I would first try this one, and see if it is bad.

Radim, do you think this could cause a missed interrupt injection
after Windows does a TPR write?

2) For bisection feel free to "git bisect skip" the following:

03916db9348c079d8d214f971cc114bb51c6b869 Replace NR_VMX_MSR with its definition
9a2a05b9ed618b1bb6d4cbec0c2e1f80d6636609 KVM: nVMX: clean up nested_release_vmcs12 and code around it
4fa7734c62cdd8c07edd54fa5a5e91482273071a KVM: nVMX: fix lifetime issues for vmcs02
c9cdd085bb75226879fd468b88e2e7eb467325b7 KVM: x86: Defining missing x86 vectors
0123be429fef40f067e5b1811576c3994229f59e KVM: x86: Assertions to check no overrun in MSR lists
296f047502f1b3ddfd63adbc192624ce80740081 KVM: vmx: remove duplicate vmx_mpx_supported() prototype
963fee1656603ce2e91ebb988cd5a92f2af41369 KVM: nVMX: Fix virtual interrupt delivery injection
6cbc5f5a80a9ae5a80bc81efc574b5a85bfd4a84 KVM: nSVM: Set correct port for IOIO interception evaluation
6493f1574e898b46370e2b2315836d76a1980f2c KVM: nSVM: Fix IOIO size reported on emulation
9bf418335e24da995ea682a028926d7e1036be6f KVM: nSVM: Fix IOIO bitmap evaluation
62baf44cad3bc6b37115cc21e4228fe53d4f3474 KVM: nSVM: Do not report CLTS via SVM_EXIT_WRITE_CR0 to L1
5381417f6a51293e7b8af1eb18aefa5d47976a71 KVM: nVMX: Fix returned value of MSR_IA32_VMX_VMCS_ENUM
2996fca0690f03a5220203588f4a0d8c5acba2b0 KVM: nVMX: Allow to disable VM_{ENTRY_LOAD,EXIT_SAVE}_DEBUG_CONTROLS
560b7ee12ca5e1ebc1675d7eb4008bb22708277a KVM: nVMX: Fix returned value of MSR_IA32_VMX_PROCBASED_CTLS
3dcdf3ec6e48d918741ea11349d4436d0c5aac93 KVM: nVMX: Allow to disable CR3 access interception
3dbcd8da7b564194f93271b003a1c46ef404cbdb KVM: nVMX: Advertise support for MSR_IA32_VMX_TRUE_*_CTLS
bc39c4db7110f88f338cbbabe53d3e43c7400a59 arch/x86/kvm/vmx.c: use PAGE_ALIGNED instead of IS_ALIGNED(PAGE_SIZE
e4aa5288ff07766d101751de9a8420d666c61735 KVM: x86: Fix constant value of VM_{EXIT_SAVE,ENTRY_LOAD}_DEBUG_CONTROLS
42cbc04fd3b5e3f9b011bf9fa3ce0b3d1e10b58b x86/kvm: Resolve shadow warnings in macro expansion
b55a8144d1807f9e74c51cb584f0dd198483d86c x86/kvm: Resolve shadow warning from min macro
98eff52ab5c0ff5cb96940a93e99a1aeb2f11c89 KVM: x86: Fix lapic.c debug prints
9f6226a762c7ae02f6a23a3d4fc552dafa57ea23 arch: x86: kvm: x86.c: Cleaning up variable is set more than once
80112c89ed872c725e7dc39ccf6c37d1a585e161 KVM: Synthesize G bit for all segments.
27e6fb5dae2819d17f38dc9224692b771e989981 KVM: vmx: vmx instructions handling does not consider cs.l
bdc907222c5e4edd848da0c031deb55b59f1cf9a KVM: emulate: fix harmless typo in MMX decoding
10e38fc7cabc668738e6a7b7b57cbcddb2234440 KVM: x86: Emulator flag for instruction that only support 16-bit addresses in real mode
68efa764f3429f2bd71f431e91c04b0bcb7d34f1 KVM: x86: Emulator support for #UD on CPL>0

The following can be skipped assuming you are on 32-bit XP:

1e32c07955b43e7f827174bf320ed35971117275 KVM: vmx: handle_cr ignores 32/64-bit mode
a449c7aa51e10c9bde0ea9bee4e682d6d067ebab KVM: x86: Hypercall handling does not considers opsize correctly
5777392e83c96e3a0799dd2985598e0fc76cf4aa KVM: x86: check DR6/7 high-bits are clear only on long-mode
a825f5cc4a8455663562809748240169cb9bc2c0 KVM: x86: NOP emulation clears (incorrectly) the high 32-bits of RAX
140bad89fd25db1aab60f80ed7874e9a9bdbae3b KVM: x86: emulation of dword cmov on long-mode should clear [63:32]
7dec5603b6b8dc4c3e1c65d318bd2a5a8c62a424 KVM: x86: bit-ops emulation ignores offset on 64-bit
2eedcac8a97cef43c9c5236398fc8c9d0fd9cc0c KVM: x86: Loading segments on 64-bit mode may be wrong
e37a75a13cdae5deaa2ea2cbf8d55b5dd08638b6 KVM: x86: Emulator ignores LDTR/TR extended base on LLDT/LTR

And I think the following can be skipped safely too:

9e8919ae793f4edfaa29694a70f71a515ae9942a KVM: x86: Inter-privilege level ret emulation is not implemeneted
3b32004a66e96e17d2a031c08d3304245c506dfc KVM: x86: movnti minimum op size of 32-bit is not kept
606b1c3e87597c2d6c9f3eb833a7251262390295 KVM: x86: sgdt and sidt are not privilaged
7fe864dc942c041cc4f56e287c4025d54a8e6c1e KVM: x86: Mark VEX-prefix instructions emulation as unimplemented
22d48b2d2aa0b078816eaa1e15e485811a2d03fa KVM: svm: writes to MSR_K7_HWCR generates GPE in guest

and if on AMD:

98eb2f8b145cee711984d42eff5d6f19b6b1df69 KVM: vmx: speed up emulation of invalid guest state



This is the remaining set of commits.  Unfortunately I couldn't get it
down to 32 or less, but at least it cleans up the picture a bit.  And
I do not see anything except the commit I mentioned above:

d6e8c8545651b05a86c5b9d29d2fe11ad4cbb9aa KVM: x86: set rflags.rf during fault injection
b9a1ecb909e8f772934cc4bf1f164124c9fbb0d0 KVM: x86: Setting rflags.rf during rep-string emulation
6f43ed01e87c8a8dbd8c826eaf0f714c1342c039 KVM: x86: DR6/7.RTM cannot be written
4161a569065b17954848069d5209182083ce876b KVM: x86: emulator injects #DB when RFLAGS.RF is set
6c6cb69b8e974049cca2cc4480052fb9e7df767b KVM: x86: Cleanup of rflags.rf cleaning
4467c3f1ad16e3640e2b61e1a5e0bd55281a925d KVM: x86: Clear rflags.rf on emulated instructions
163b135e7b09e9158f7eb0aa74e716865e3005d2 KVM: x86: popf emulation should not change RF
bb663c7ada380f3c89c2f83fdbe2b3626621385d KVM: x86: Clearing rflags.rf upon skipped emulated instruction
44583cba9188b29b20ceeefe8ae23ad19e26d9a4 KVM: x86: use kvm_read_guest_page for emulator accesses
719d5a9b2487e0562f178f61e323c3dc18a8b200 KVM: x86: ensure emulator fetches do not span multiple pages
17052f16a51af6d8f4b7eee0631af675ac204f65 KVM: emulate: put pointers in the fetch_cache
9506d57de3bc8277a4e306e0d439976862f68c6d KVM: emulate: avoid per-byte copying in instruction fetches
5cfc7e0f5e5e1adf998df94f8e36edaf5d30d38e KVM: emulate: avoid repeated calls to do_insn_fetch_bytes
285ca9e948fa047e51fe47082528034de5369e8d KVM: emulate: speed up do_insn_fetch
41061cdb98a0bec464278b4db8e894a3121671f5 KVM: emulate: do not initialize memopp
573e80fe04db1aa44e8303037f65716ba5c3a343 KVM: emulate: rework seg_override
c44b4c6ab80eef3a9c52c7b3f0c632942e6489aa KVM: emulate: clean up initializations in init_decode_cache
02357bdc8c30a60cd33dd438f851c1306c34f435 KVM: emulate: cleanup decode_modrm
685bbf4ac406364a84a1d4237b4970dc570fd4cb KVM: emulate: Remove ctxt->intercept and ctxt->check_perm checks
1498507a47867596de158d4db8728e92385a4919 KVM: emulate: move init_decode_cache to emulate.c
f5f87dfbc777f89148c3c66438741139845d3ac6 KVM: emulate: simplify writeback
54cfdb3e95d4f70409a7d3432a42cffc9a232be7 KVM: emulate: speed up emulated moves
d40a6898e50c2589ca3d345ef5ca6671e2b35b1a KVM: emulate: protect checks on ctxt->d by a common "if (unlikely())"
e24186e097b80c5995ff75e1bbcd541d09c9e42b KVM: emulate: move around some checks
6addfc42992be4b073c39137ecfdf4b2aa2d487f KVM: x86: avoid useless set of KVM_REQ_EVENT after emulation
37ccdcbe0757196ec98c0dcf9754bec8423807a5 KVM: x86: return all bits from get_interrupt_shadow
5f7552d4a56c21a882c9854ac63c6eb73ca7d7c8 KVM: x86: Pending interrupt may be delivered after INIT
0d3da0d26e3c3515997c99451ce3b0ad1a69a36c KVM: x86: fix TSC matching
ee212297cd425620867d4398d55d068c4203768c KVM: x86: Wrong emulation on 'xadd X, X'
968889771749d8e730d794deed2bd2e363a98a54 KVM: emulate: simplify BitOp handling
a5457e7bcf9a76ec5c2de5d311d9b0d3b724edc6 KVM: emulate: POP SS triggers a MOV SS shadow too
32e94d0696c26c6ba4f3ff53e70f6e0e825979bc KVM: x86: smsw emulation is incorrect in 64-bit mode
aaa05f2437b9450f30b301db962ec4d45ec90fbb KVM: x86: Return error on cmpxchg16b emulation
67f4d4288c353734d29c45f6725971c71af96791 KVM: x86: rdpmc emulation checks the counter incorrectly
37c564f2854bf75969d0ac26e03f5cf2bb7d639f KVM: x86: cmpxchg emulation should compare in reverse order

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: XP machine freeze
  2015-04-13 12:38       ` Paolo Bonzini
@ 2015-04-13 12:45         ` Brad Campbell
  2015-04-13 14:02           ` Paolo Bonzini
  2015-04-13 12:47         ` Saso Slavicic
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 25+ messages in thread
From: Brad Campbell @ 2015-04-13 12:45 UTC (permalink / raw)
  To: Paolo Bonzini, Saso Slavicic, kvm, Radim Krčmář


On 13/04/15 20:38, Paolo Bonzini wrote:
>
> On 13/04/2015 06:07, Brad Campbell wrote:
>> On 31/03/15 05:11, Paolo Bonzini wrote:
>>> On 22/03/2015 16:31, Brad Campbell wrote:
>>>> No help I'm afraid, but at least I can conclusively say that 3.16 is
>>>> good, and 3.17 is bad.
>>> Can you try more specifically around the first KVM pull request?  That
>>> would be between c9b88e958182 (presumed good) and 8533ce727188 (presumed
>>> bad)?
>>>
>>>
>>
>> G'day Paolo.
>>
>> I can confirm that the fault appears to lie between good and bad as
>> specified above.
>> Bad failed before 48 hours, good ran for 143 hours. I'm bisecting now.
> Thanks!  Remember to bisect only with arch/x86/kvm.
>
> Also:
>
> 1) Brad, I see you are on AMD.  Have you ever reproduced it on Intel?
> Saso, are you on AMD as well?
>
> If so, the most likely culprit is this:
>
> commit 6addfc42992be4b073c39137ecfdf4b2aa2d487f
> Author: Paolo Bonzini <pbonzini@redhat.com>
> Date:   Thu Mar 27 11:29:28 2014 +0100

G'day Paolo,

Yes, on AMD and I've tried hard to reproduce it on Intel and been unable 
to thus far.

Now you mention it may be AMD specific, I have a spare motherboard and 
processor sitting in a drawer. I'll bolt it together tomorrow and see if 
I can reproduce it on another AMD machine. Two machines should let me 
test it twice as fast.

I got a fail this afternoon, so I'm due to reboot tonight. I'll just 
revert that one suspect commit from a known bad kernel and see if that 
cleans it up. If not then I'll work through the remainder of the 
information in your mail. I really appreciate the attention you've paid 
to this, it has been a frustrating bug for me because I'm in a position 
of not knowing what I don't know, and obviously doing something wrong in 
very long bisection processes.

Regards,
Brad

-- 
Dolphins are so intelligent that within a few weeks they can
train Americans to stand at the edge of the pool and throw them
fish.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: XP machine freeze
  2015-04-13 12:38       ` Paolo Bonzini
  2015-04-13 12:45         ` Brad Campbell
@ 2015-04-13 12:47         ` Saso Slavicic
  2015-04-13 13:33         ` Radim Krčmář
  2015-04-13 13:34         ` Nadav Amit
  3 siblings, 0 replies; 25+ messages in thread
From: Saso Slavicic @ 2015-04-13 12:47 UTC (permalink / raw)
  To: 'Paolo Bonzini', 'Brad Campbell',
	kvm, 'Radim Krčmář'

> From: Paolo Bonzini
> Sent: Monday, April 13, 2015 2:39 PM
> Subject: Re: XP machine freeze
>
> ...
> Also:
> 
> 1) Brad, I see you are on AMD.  Have you ever reproduced it on Intel?
> Saso, are you on AMD as well?

No, this is an Intel machine:

Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz

Regards,
Saso Slavicic


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: XP machine freeze
  2015-04-13 12:38       ` Paolo Bonzini
  2015-04-13 12:45         ` Brad Campbell
  2015-04-13 12:47         ` Saso Slavicic
@ 2015-04-13 13:33         ` Radim Krčmář
  2015-04-13 13:34         ` Nadav Amit
  3 siblings, 0 replies; 25+ messages in thread
From: Radim Krčmář @ 2015-04-13 13:33 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Brad Campbell, Saso Slavicic, kvm

2015-04-13 14:38+0200, Paolo Bonzini:
> If so, the most likely culprit is this:
> 
> commit 6addfc42992be4b073c39137ecfdf4b2aa2d487f
> Author: Paolo Bonzini <pbonzini@redhat.com>
> Date:   Thu Mar 27 11:29:28 2014 +0100
> 
>     KVM: x86: avoid useless set of KVM_REQ_EVENT after emulation
| [...]
> 
> I would first try this one, and see if it is bad.
> 
> Radim, do you think this could cause a missed interrupt injection
> after Windows does a TPR write?

I don't think it could, all changes to TPR/ISR should call
apic_update_ppr, which sets KVM_REQ_EVENT when needed ...

I'll take a look what could have gone wrong.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: XP machine freeze
  2015-04-13 12:38       ` Paolo Bonzini
                           ` (2 preceding siblings ...)
  2015-04-13 13:33         ` Radim Krčmář
@ 2015-04-13 13:34         ` Nadav Amit
  2015-04-13 14:01           ` Paolo Bonzini
  3 siblings, 1 reply; 25+ messages in thread
From: Nadav Amit @ 2015-04-13 13:34 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Brad Campbell, Saso Slavicic, kvm, Radim Krčmář

Paolo,

I hope I am not misleading or interrupting, and I am obviously very biased —
but couldn’t it be related to the issue that patch f210f7572bed ("KVM: x86:
Fix lost interrupt on irr_pending race”) deals with?

I got this issue first when I upgraded to 3.17 in my testing environment,
since apparently a race got worse due to patch 56cc2406d68c. Did anyone try
3.19 that has this fix?

Regards,
Nadav

Paolo Bonzini <pbonzini@redhat.com> wrote:

> 
> 
> On 13/04/2015 06:07, Brad Campbell wrote:
>> On 31/03/15 05:11, Paolo Bonzini wrote:
>>> On 22/03/2015 16:31, Brad Campbell wrote:
>>>> No help I'm afraid, but at least I can conclusively say that 3.16 is
>>>> good, and 3.17 is bad.
>>> Can you try more specifically around the first KVM pull request?  That
>>> would be between c9b88e958182 (presumed good) and 8533ce727188 (presumed
>>> bad)?
>> 
>> 
>> G'day Paolo.
>> 
>> I can confirm that the fault appears to lie between good and bad as
>> specified above.
>> Bad failed before 48 hours, good ran for 143 hours. I'm bisecting now.
> 
> Thanks!  Remember to bisect only with arch/x86/kvm.
> 
> Also:
> 
> 1) Brad, I see you are on AMD.  Have you ever reproduced it on Intel?
> Saso, are you on AMD as well?
> 
> If so, the most likely culprit is this:
> 
> commit 6addfc42992be4b073c39137ecfdf4b2aa2d487f
> Author: Paolo Bonzini <pbonzini@redhat.com>
> Date:   Thu Mar 27 11:29:28 2014 +0100
> 
>    KVM: x86: avoid useless set of KVM_REQ_EVENT after emulation
> 
>    Despite the provisions to emulate up to 130 consecutive instructions, in
>    practice KVM will emulate just one before exiting handle_invalid_guest_state,
>    because x86_emulate_instruction always sets KVM_REQ_EVENT.
> 
>    However, we only need to do this if an interrupt could be injected,
>    which happens a) if an interrupt shadow bit (STI or MOV SS) has gone
>    away; b) if the interrupt flag has just been set (other instructions
>    than STI can set it without enabling an interrupt shadow).
> 
>    This cuts another 700-900 cycles from the cost of emulating an
>    instruction (measured on a Sandy Bridge Xeon: 1650-2600 cycles
>    before the patch on kvm-unit-tests, 925-1700 afterwards).
> 
>    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> 
> I would first try this one, and see if it is bad.
> 
> Radim, do you think this could cause a missed interrupt injection
> after Windows does a TPR write?
> 
> 2) For bisection feel free to "git bisect skip" the following:
> 
> 03916db9348c079d8d214f971cc114bb51c6b869 Replace NR_VMX_MSR with its definition
> 9a2a05b9ed618b1bb6d4cbec0c2e1f80d6636609 KVM: nVMX: clean up nested_release_vmcs12 and code around it
> 4fa7734c62cdd8c07edd54fa5a5e91482273071a KVM: nVMX: fix lifetime issues for vmcs02
> c9cdd085bb75226879fd468b88e2e7eb467325b7 KVM: x86: Defining missing x86 vectors
> 0123be429fef40f067e5b1811576c3994229f59e KVM: x86: Assertions to check no overrun in MSR lists
> 296f047502f1b3ddfd63adbc192624ce80740081 KVM: vmx: remove duplicate vmx_mpx_supported() prototype
> 963fee1656603ce2e91ebb988cd5a92f2af41369 KVM: nVMX: Fix virtual interrupt delivery injection
> 6cbc5f5a80a9ae5a80bc81efc574b5a85bfd4a84 KVM: nSVM: Set correct port for IOIO interception evaluation
> 6493f1574e898b46370e2b2315836d76a1980f2c KVM: nSVM: Fix IOIO size reported on emulation
> 9bf418335e24da995ea682a028926d7e1036be6f KVM: nSVM: Fix IOIO bitmap evaluation
> 62baf44cad3bc6b37115cc21e4228fe53d4f3474 KVM: nSVM: Do not report CLTS via SVM_EXIT_WRITE_CR0 to L1
> 5381417f6a51293e7b8af1eb18aefa5d47976a71 KVM: nVMX: Fix returned value of MSR_IA32_VMX_VMCS_ENUM
> 2996fca0690f03a5220203588f4a0d8c5acba2b0 KVM: nVMX: Allow to disable VM_{ENTRY_LOAD,EXIT_SAVE}_DEBUG_CONTROLS
> 560b7ee12ca5e1ebc1675d7eb4008bb22708277a KVM: nVMX: Fix returned value of MSR_IA32_VMX_PROCBASED_CTLS
> 3dcdf3ec6e48d918741ea11349d4436d0c5aac93 KVM: nVMX: Allow to disable CR3 access interception
> 3dbcd8da7b564194f93271b003a1c46ef404cbdb KVM: nVMX: Advertise support for MSR_IA32_VMX_TRUE_*_CTLS
> bc39c4db7110f88f338cbbabe53d3e43c7400a59 arch/x86/kvm/vmx.c: use PAGE_ALIGNED instead of IS_ALIGNED(PAGE_SIZE
> e4aa5288ff07766d101751de9a8420d666c61735 KVM: x86: Fix constant value of VM_{EXIT_SAVE,ENTRY_LOAD}_DEBUG_CONTROLS
> 42cbc04fd3b5e3f9b011bf9fa3ce0b3d1e10b58b x86/kvm: Resolve shadow warnings in macro expansion
> b55a8144d1807f9e74c51cb584f0dd198483d86c x86/kvm: Resolve shadow warning from min macro
> 98eff52ab5c0ff5cb96940a93e99a1aeb2f11c89 KVM: x86: Fix lapic.c debug prints
> 9f6226a762c7ae02f6a23a3d4fc552dafa57ea23 arch: x86: kvm: x86.c: Cleaning up variable is set more than once
> 80112c89ed872c725e7dc39ccf6c37d1a585e161 KVM: Synthesize G bit for all segments.
> 27e6fb5dae2819d17f38dc9224692b771e989981 KVM: vmx: vmx instructions handling does not consider cs.l
> bdc907222c5e4edd848da0c031deb55b59f1cf9a KVM: emulate: fix harmless typo in MMX decoding
> 10e38fc7cabc668738e6a7b7b57cbcddb2234440 KVM: x86: Emulator flag for instruction that only support 16-bit addresses in real mode
> 68efa764f3429f2bd71f431e91c04b0bcb7d34f1 KVM: x86: Emulator support for #UD on CPL>0
> 
> The following can be skipped assuming you are on 32-bit XP:
> 
> 1e32c07955b43e7f827174bf320ed35971117275 KVM: vmx: handle_cr ignores 32/64-bit mode
> a449c7aa51e10c9bde0ea9bee4e682d6d067ebab KVM: x86: Hypercall handling does not considers opsize correctly
> 5777392e83c96e3a0799dd2985598e0fc76cf4aa KVM: x86: check DR6/7 high-bits are clear only on long-mode
> a825f5cc4a8455663562809748240169cb9bc2c0 KVM: x86: NOP emulation clears (incorrectly) the high 32-bits of RAX
> 140bad89fd25db1aab60f80ed7874e9a9bdbae3b KVM: x86: emulation of dword cmov on long-mode should clear [63:32]
> 7dec5603b6b8dc4c3e1c65d318bd2a5a8c62a424 KVM: x86: bit-ops emulation ignores offset on 64-bit
> 2eedcac8a97cef43c9c5236398fc8c9d0fd9cc0c KVM: x86: Loading segments on 64-bit mode may be wrong
> e37a75a13cdae5deaa2ea2cbf8d55b5dd08638b6 KVM: x86: Emulator ignores LDTR/TR extended base on LLDT/LTR
> 
> And I think the following can be skipped safely too:
> 
> 9e8919ae793f4edfaa29694a70f71a515ae9942a KVM: x86: Inter-privilege level ret emulation is not implemeneted
> 3b32004a66e96e17d2a031c08d3304245c506dfc KVM: x86: movnti minimum op size of 32-bit is not kept
> 606b1c3e87597c2d6c9f3eb833a7251262390295 KVM: x86: sgdt and sidt are not privilaged
> 7fe864dc942c041cc4f56e287c4025d54a8e6c1e KVM: x86: Mark VEX-prefix instructions emulation as unimplemented
> 22d48b2d2aa0b078816eaa1e15e485811a2d03fa KVM: svm: writes to MSR_K7_HWCR generates GPE in guest
> 
> and if on AMD:
> 
> 98eb2f8b145cee711984d42eff5d6f19b6b1df69 KVM: vmx: speed up emulation of invalid guest state
> 
> 
> 
> This is the remaining set of commits.  Unfortunately I couldn't get it
> down to 32 or less, but at least it cleans up the picture a bit.  And
> I do not see anything except the commit I mentioned above:
> 
> d6e8c8545651b05a86c5b9d29d2fe11ad4cbb9aa KVM: x86: set rflags.rf during fault injection
> b9a1ecb909e8f772934cc4bf1f164124c9fbb0d0 KVM: x86: Setting rflags.rf during rep-string emulation
> 6f43ed01e87c8a8dbd8c826eaf0f714c1342c039 KVM: x86: DR6/7.RTM cannot be written
> 4161a569065b17954848069d5209182083ce876b KVM: x86: emulator injects #DB when RFLAGS.RF is set
> 6c6cb69b8e974049cca2cc4480052fb9e7df767b KVM: x86: Cleanup of rflags.rf cleaning
> 4467c3f1ad16e3640e2b61e1a5e0bd55281a925d KVM: x86: Clear rflags.rf on emulated instructions
> 163b135e7b09e9158f7eb0aa74e716865e3005d2 KVM: x86: popf emulation should not change RF
> bb663c7ada380f3c89c2f83fdbe2b3626621385d KVM: x86: Clearing rflags.rf upon skipped emulated instruction
> 44583cba9188b29b20ceeefe8ae23ad19e26d9a4 KVM: x86: use kvm_read_guest_page for emulator accesses
> 719d5a9b2487e0562f178f61e323c3dc18a8b200 KVM: x86: ensure emulator fetches do not span multiple pages
> 17052f16a51af6d8f4b7eee0631af675ac204f65 KVM: emulate: put pointers in the fetch_cache
> 9506d57de3bc8277a4e306e0d439976862f68c6d KVM: emulate: avoid per-byte copying in instruction fetches
> 5cfc7e0f5e5e1adf998df94f8e36edaf5d30d38e KVM: emulate: avoid repeated calls to do_insn_fetch_bytes
> 285ca9e948fa047e51fe47082528034de5369e8d KVM: emulate: speed up do_insn_fetch
> 41061cdb98a0bec464278b4db8e894a3121671f5 KVM: emulate: do not initialize memopp
> 573e80fe04db1aa44e8303037f65716ba5c3a343 KVM: emulate: rework seg_override
> c44b4c6ab80eef3a9c52c7b3f0c632942e6489aa KVM: emulate: clean up initializations in init_decode_cache
> 02357bdc8c30a60cd33dd438f851c1306c34f435 KVM: emulate: cleanup decode_modrm
> 685bbf4ac406364a84a1d4237b4970dc570fd4cb KVM: emulate: Remove ctxt->intercept and ctxt->check_perm checks
> 1498507a47867596de158d4db8728e92385a4919 KVM: emulate: move init_decode_cache to emulate.c
> f5f87dfbc777f89148c3c66438741139845d3ac6 KVM: emulate: simplify writeback
> 54cfdb3e95d4f70409a7d3432a42cffc9a232be7 KVM: emulate: speed up emulated moves
> d40a6898e50c2589ca3d345ef5ca6671e2b35b1a KVM: emulate: protect checks on ctxt->d by a common "if (unlikely())"
> e24186e097b80c5995ff75e1bbcd541d09c9e42b KVM: emulate: move around some checks
> 6addfc42992be4b073c39137ecfdf4b2aa2d487f KVM: x86: avoid useless set of KVM_REQ_EVENT after emulation
> 37ccdcbe0757196ec98c0dcf9754bec8423807a5 KVM: x86: return all bits from get_interrupt_shadow
> 5f7552d4a56c21a882c9854ac63c6eb73ca7d7c8 KVM: x86: Pending interrupt may be delivered after INIT
> 0d3da0d26e3c3515997c99451ce3b0ad1a69a36c KVM: x86: fix TSC matching
> ee212297cd425620867d4398d55d068c4203768c KVM: x86: Wrong emulation on 'xadd X, X'
> 968889771749d8e730d794deed2bd2e363a98a54 KVM: emulate: simplify BitOp handling
> a5457e7bcf9a76ec5c2de5d311d9b0d3b724edc6 KVM: emulate: POP SS triggers a MOV SS shadow too
> 32e94d0696c26c6ba4f3ff53e70f6e0e825979bc KVM: x86: smsw emulation is incorrect in 64-bit mode
> aaa05f2437b9450f30b301db962ec4d45ec90fbb KVM: x86: Return error on cmpxchg16b emulation
> 67f4d4288c353734d29c45f6725971c71af96791 KVM: x86: rdpmc emulation checks the counter incorrectly
> 37c564f2854bf75969d0ac26e03f5cf2bb7d639f KVM: x86: cmpxchg emulation should compare in reverse order
> 
> Thanks,
> 
> Paolo
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: XP machine freeze
  2015-04-13 13:34         ` Nadav Amit
@ 2015-04-13 14:01           ` Paolo Bonzini
  0 siblings, 0 replies; 25+ messages in thread
From: Paolo Bonzini @ 2015-04-13 14:01 UTC (permalink / raw)
  To: Nadav Amit; +Cc: Brad Campbell, Saso Slavicic, kvm, Radim Krčmář



On 13/04/2015 15:34, Nadav Amit wrote:
> Paolo,
> 
> I hope I am not misleading or interrupting, and I am obviously very biased —
> but couldn’t it be related to the issue that patch f210f7572bed ("KVM: x86:
> Fix lost interrupt on irr_pending race”) deals with?
> 
> I got this issue first when I upgraded to 3.17 in my testing environment,
> since apparently a race got worse due to patch 56cc2406d68c. Did anyone try
> 3.19 that has this fix?

That's a much better guess than mine.  Especially because it would also
explain how Saso is reproducing it on CentOS 6 (but still less easily
than Brad who has 56cc2406d68c0f09505c389e276f27a99f495cbd.

Paolo

> Regards,
> Nadav
> 
> Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
>>
>>
>> On 13/04/2015 06:07, Brad Campbell wrote:
>>> On 31/03/15 05:11, Paolo Bonzini wrote:
>>>> On 22/03/2015 16:31, Brad Campbell wrote:
>>>>> No help I'm afraid, but at least I can conclusively say that 3.16 is
>>>>> good, and 3.17 is bad.
>>>> Can you try more specifically around the first KVM pull request?  That
>>>> would be between c9b88e958182 (presumed good) and 8533ce727188 (presumed
>>>> bad)?
>>>
>>>
>>> G'day Paolo.
>>>
>>> I can confirm that the fault appears to lie between good and bad as
>>> specified above.
>>> Bad failed before 48 hours, good ran for 143 hours. I'm bisecting now.
>>
>> Thanks!  Remember to bisect only with arch/x86/kvm.
>>
>> Also:
>>
>> 1) Brad, I see you are on AMD.  Have you ever reproduced it on Intel?
>> Saso, are you on AMD as well?
>>
>> If so, the most likely culprit is this:
>>
>> commit 6addfc42992be4b073c39137ecfdf4b2aa2d487f
>> Author: Paolo Bonzini <pbonzini@redhat.com>
>> Date:   Thu Mar 27 11:29:28 2014 +0100
>>
>>    KVM: x86: avoid useless set of KVM_REQ_EVENT after emulation
>>
>>    Despite the provisions to emulate up to 130 consecutive instructions, in
>>    practice KVM will emulate just one before exiting handle_invalid_guest_state,
>>    because x86_emulate_instruction always sets KVM_REQ_EVENT.
>>
>>    However, we only need to do this if an interrupt could be injected,
>>    which happens a) if an interrupt shadow bit (STI or MOV SS) has gone
>>    away; b) if the interrupt flag has just been set (other instructions
>>    than STI can set it without enabling an interrupt shadow).
>>
>>    This cuts another 700-900 cycles from the cost of emulating an
>>    instruction (measured on a Sandy Bridge Xeon: 1650-2600 cycles
>>    before the patch on kvm-unit-tests, 925-1700 afterwards).
>>
>>    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>>
>> I would first try this one, and see if it is bad.
>>
>> Radim, do you think this could cause a missed interrupt injection
>> after Windows does a TPR write?
>>
>> 2) For bisection feel free to "git bisect skip" the following:
>>
>> 03916db9348c079d8d214f971cc114bb51c6b869 Replace NR_VMX_MSR with its definition
>> 9a2a05b9ed618b1bb6d4cbec0c2e1f80d6636609 KVM: nVMX: clean up nested_release_vmcs12 and code around it
>> 4fa7734c62cdd8c07edd54fa5a5e91482273071a KVM: nVMX: fix lifetime issues for vmcs02
>> c9cdd085bb75226879fd468b88e2e7eb467325b7 KVM: x86: Defining missing x86 vectors
>> 0123be429fef40f067e5b1811576c3994229f59e KVM: x86: Assertions to check no overrun in MSR lists
>> 296f047502f1b3ddfd63adbc192624ce80740081 KVM: vmx: remove duplicate vmx_mpx_supported() prototype
>> 963fee1656603ce2e91ebb988cd5a92f2af41369 KVM: nVMX: Fix virtual interrupt delivery injection
>> 6cbc5f5a80a9ae5a80bc81efc574b5a85bfd4a84 KVM: nSVM: Set correct port for IOIO interception evaluation
>> 6493f1574e898b46370e2b2315836d76a1980f2c KVM: nSVM: Fix IOIO size reported on emulation
>> 9bf418335e24da995ea682a028926d7e1036be6f KVM: nSVM: Fix IOIO bitmap evaluation
>> 62baf44cad3bc6b37115cc21e4228fe53d4f3474 KVM: nSVM: Do not report CLTS via SVM_EXIT_WRITE_CR0 to L1
>> 5381417f6a51293e7b8af1eb18aefa5d47976a71 KVM: nVMX: Fix returned value of MSR_IA32_VMX_VMCS_ENUM
>> 2996fca0690f03a5220203588f4a0d8c5acba2b0 KVM: nVMX: Allow to disable VM_{ENTRY_LOAD,EXIT_SAVE}_DEBUG_CONTROLS
>> 560b7ee12ca5e1ebc1675d7eb4008bb22708277a KVM: nVMX: Fix returned value of MSR_IA32_VMX_PROCBASED_CTLS
>> 3dcdf3ec6e48d918741ea11349d4436d0c5aac93 KVM: nVMX: Allow to disable CR3 access interception
>> 3dbcd8da7b564194f93271b003a1c46ef404cbdb KVM: nVMX: Advertise support for MSR_IA32_VMX_TRUE_*_CTLS
>> bc39c4db7110f88f338cbbabe53d3e43c7400a59 arch/x86/kvm/vmx.c: use PAGE_ALIGNED instead of IS_ALIGNED(PAGE_SIZE
>> e4aa5288ff07766d101751de9a8420d666c61735 KVM: x86: Fix constant value of VM_{EXIT_SAVE,ENTRY_LOAD}_DEBUG_CONTROLS
>> 42cbc04fd3b5e3f9b011bf9fa3ce0b3d1e10b58b x86/kvm: Resolve shadow warnings in macro expansion
>> b55a8144d1807f9e74c51cb584f0dd198483d86c x86/kvm: Resolve shadow warning from min macro
>> 98eff52ab5c0ff5cb96940a93e99a1aeb2f11c89 KVM: x86: Fix lapic.c debug prints
>> 9f6226a762c7ae02f6a23a3d4fc552dafa57ea23 arch: x86: kvm: x86.c: Cleaning up variable is set more than once
>> 80112c89ed872c725e7dc39ccf6c37d1a585e161 KVM: Synthesize G bit for all segments.
>> 27e6fb5dae2819d17f38dc9224692b771e989981 KVM: vmx: vmx instructions handling does not consider cs.l
>> bdc907222c5e4edd848da0c031deb55b59f1cf9a KVM: emulate: fix harmless typo in MMX decoding
>> 10e38fc7cabc668738e6a7b7b57cbcddb2234440 KVM: x86: Emulator flag for instruction that only support 16-bit addresses in real mode
>> 68efa764f3429f2bd71f431e91c04b0bcb7d34f1 KVM: x86: Emulator support for #UD on CPL>0
>>
>> The following can be skipped assuming you are on 32-bit XP:
>>
>> 1e32c07955b43e7f827174bf320ed35971117275 KVM: vmx: handle_cr ignores 32/64-bit mode
>> a449c7aa51e10c9bde0ea9bee4e682d6d067ebab KVM: x86: Hypercall handling does not considers opsize correctly
>> 5777392e83c96e3a0799dd2985598e0fc76cf4aa KVM: x86: check DR6/7 high-bits are clear only on long-mode
>> a825f5cc4a8455663562809748240169cb9bc2c0 KVM: x86: NOP emulation clears (incorrectly) the high 32-bits of RAX
>> 140bad89fd25db1aab60f80ed7874e9a9bdbae3b KVM: x86: emulation of dword cmov on long-mode should clear [63:32]
>> 7dec5603b6b8dc4c3e1c65d318bd2a5a8c62a424 KVM: x86: bit-ops emulation ignores offset on 64-bit
>> 2eedcac8a97cef43c9c5236398fc8c9d0fd9cc0c KVM: x86: Loading segments on 64-bit mode may be wrong
>> e37a75a13cdae5deaa2ea2cbf8d55b5dd08638b6 KVM: x86: Emulator ignores LDTR/TR extended base on LLDT/LTR
>>
>> And I think the following can be skipped safely too:
>>
>> 9e8919ae793f4edfaa29694a70f71a515ae9942a KVM: x86: Inter-privilege level ret emulation is not implemeneted
>> 3b32004a66e96e17d2a031c08d3304245c506dfc KVM: x86: movnti minimum op size of 32-bit is not kept
>> 606b1c3e87597c2d6c9f3eb833a7251262390295 KVM: x86: sgdt and sidt are not privilaged
>> 7fe864dc942c041cc4f56e287c4025d54a8e6c1e KVM: x86: Mark VEX-prefix instructions emulation as unimplemented
>> 22d48b2d2aa0b078816eaa1e15e485811a2d03fa KVM: svm: writes to MSR_K7_HWCR generates GPE in guest
>>
>> and if on AMD:
>>
>> 98eb2f8b145cee711984d42eff5d6f19b6b1df69 KVM: vmx: speed up emulation of invalid guest state
>>
>>
>>
>> This is the remaining set of commits.  Unfortunately I couldn't get it
>> down to 32 or less, but at least it cleans up the picture a bit.  And
>> I do not see anything except the commit I mentioned above:
>>
>> d6e8c8545651b05a86c5b9d29d2fe11ad4cbb9aa KVM: x86: set rflags.rf during fault injection
>> b9a1ecb909e8f772934cc4bf1f164124c9fbb0d0 KVM: x86: Setting rflags.rf during rep-string emulation
>> 6f43ed01e87c8a8dbd8c826eaf0f714c1342c039 KVM: x86: DR6/7.RTM cannot be written
>> 4161a569065b17954848069d5209182083ce876b KVM: x86: emulator injects #DB when RFLAGS.RF is set
>> 6c6cb69b8e974049cca2cc4480052fb9e7df767b KVM: x86: Cleanup of rflags.rf cleaning
>> 4467c3f1ad16e3640e2b61e1a5e0bd55281a925d KVM: x86: Clear rflags.rf on emulated instructions
>> 163b135e7b09e9158f7eb0aa74e716865e3005d2 KVM: x86: popf emulation should not change RF
>> bb663c7ada380f3c89c2f83fdbe2b3626621385d KVM: x86: Clearing rflags.rf upon skipped emulated instruction
>> 44583cba9188b29b20ceeefe8ae23ad19e26d9a4 KVM: x86: use kvm_read_guest_page for emulator accesses
>> 719d5a9b2487e0562f178f61e323c3dc18a8b200 KVM: x86: ensure emulator fetches do not span multiple pages
>> 17052f16a51af6d8f4b7eee0631af675ac204f65 KVM: emulate: put pointers in the fetch_cache
>> 9506d57de3bc8277a4e306e0d439976862f68c6d KVM: emulate: avoid per-byte copying in instruction fetches
>> 5cfc7e0f5e5e1adf998df94f8e36edaf5d30d38e KVM: emulate: avoid repeated calls to do_insn_fetch_bytes
>> 285ca9e948fa047e51fe47082528034de5369e8d KVM: emulate: speed up do_insn_fetch
>> 41061cdb98a0bec464278b4db8e894a3121671f5 KVM: emulate: do not initialize memopp
>> 573e80fe04db1aa44e8303037f65716ba5c3a343 KVM: emulate: rework seg_override
>> c44b4c6ab80eef3a9c52c7b3f0c632942e6489aa KVM: emulate: clean up initializations in init_decode_cache
>> 02357bdc8c30a60cd33dd438f851c1306c34f435 KVM: emulate: cleanup decode_modrm
>> 685bbf4ac406364a84a1d4237b4970dc570fd4cb KVM: emulate: Remove ctxt->intercept and ctxt->check_perm checks
>> 1498507a47867596de158d4db8728e92385a4919 KVM: emulate: move init_decode_cache to emulate.c
>> f5f87dfbc777f89148c3c66438741139845d3ac6 KVM: emulate: simplify writeback
>> 54cfdb3e95d4f70409a7d3432a42cffc9a232be7 KVM: emulate: speed up emulated moves
>> d40a6898e50c2589ca3d345ef5ca6671e2b35b1a KVM: emulate: protect checks on ctxt->d by a common "if (unlikely())"
>> e24186e097b80c5995ff75e1bbcd541d09c9e42b KVM: emulate: move around some checks
>> 6addfc42992be4b073c39137ecfdf4b2aa2d487f KVM: x86: avoid useless set of KVM_REQ_EVENT after emulation
>> 37ccdcbe0757196ec98c0dcf9754bec8423807a5 KVM: x86: return all bits from get_interrupt_shadow
>> 5f7552d4a56c21a882c9854ac63c6eb73ca7d7c8 KVM: x86: Pending interrupt may be delivered after INIT
>> 0d3da0d26e3c3515997c99451ce3b0ad1a69a36c KVM: x86: fix TSC matching
>> ee212297cd425620867d4398d55d068c4203768c KVM: x86: Wrong emulation on 'xadd X, X'
>> 968889771749d8e730d794deed2bd2e363a98a54 KVM: emulate: simplify BitOp handling
>> a5457e7bcf9a76ec5c2de5d311d9b0d3b724edc6 KVM: emulate: POP SS triggers a MOV SS shadow too
>> 32e94d0696c26c6ba4f3ff53e70f6e0e825979bc KVM: x86: smsw emulation is incorrect in 64-bit mode
>> aaa05f2437b9450f30b301db962ec4d45ec90fbb KVM: x86: Return error on cmpxchg16b emulation
>> 67f4d4288c353734d29c45f6725971c71af96791 KVM: x86: rdpmc emulation checks the counter incorrectly
>> 37c564f2854bf75969d0ac26e03f5cf2bb7d639f KVM: x86: cmpxchg emulation should compare in reverse order
>>
>> Thanks,
>>
>> Paolo
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: XP machine freeze
  2015-04-13 12:45         ` Brad Campbell
@ 2015-04-13 14:02           ` Paolo Bonzini
  2015-04-13 14:25             ` Brad Campbell
  2015-04-19 15:27             ` Brad Campbell
  0 siblings, 2 replies; 25+ messages in thread
From: Paolo Bonzini @ 2015-04-13 14:02 UTC (permalink / raw)
  To: Brad Campbell, Saso Slavicic, kvm, Radim Krčmář



On 13/04/2015 14:45, Brad Campbell wrote:
> G'day Paolo,
> 
> Yes, on AMD and I've tried hard to reproduce it on Intel and been unable
> to thus far.
> 
> Now you mention it may be AMD specific, I have a spare motherboard and
> processor sitting in a drawer. I'll bolt it together tomorrow and see if
> I can reproduce it on another AMD machine. Two machines should let me
> test it twice as fast.
> 
> I got a fail this afternoon, so I'm due to reboot tonight. I'll just
> revert that one suspect commit from a known bad kernel and see if that
> cleans it up. If not then I'll work through the remainder of the
> information in your mail. I really appreciate the attention you've paid
> to this, it has been a frustrating bug for me because I'm in a position
> of not knowing what I don't know, and obviously doing something wrong in
> very long bisection processes.

Actually, if you have time to change your course of action, please
revert the one that Nadav pointed out (f210f7572bed, KVM: x86:
Fix lost interrupt on irr_pending race) or cherry-pick it on top of 3.17.

Paolo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: XP machine freeze
  2015-04-13 14:02           ` Paolo Bonzini
@ 2015-04-13 14:25             ` Brad Campbell
  2015-04-19 15:27             ` Brad Campbell
  1 sibling, 0 replies; 25+ messages in thread
From: Brad Campbell @ 2015-04-13 14:25 UTC (permalink / raw)
  To: Paolo Bonzini, Saso Slavicic, kvm, Radim Krčmář


On 13/04/15 22:02, Paolo Bonzini wrote:
>
> Actually, if you have time to change your course of action, please
> revert the one that Nadav pointed out (f210f7572bed, KVM: x86:
> Fix lost interrupt on irr_pending race) or cherry-pick it on top of 3.17.
>
>

Ok, I've done just that. Started on a 3.17 vanilla base and just applied 
that commit. Let's see what happens over the next week.

Regards,
Brad

-- 
Dolphins are so intelligent that within a few weeks they can
train Americans to stand at the edge of the pool and throw them
fish.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: XP machine freeze
  2015-04-13 14:02           ` Paolo Bonzini
  2015-04-13 14:25             ` Brad Campbell
@ 2015-04-19 15:27             ` Brad Campbell
  2015-04-19 15:48               ` Nadav Amit
  1 sibling, 1 reply; 25+ messages in thread
From: Brad Campbell @ 2015-04-19 15:27 UTC (permalink / raw)
  To: Paolo Bonzini, Saso Slavicic, kvm, Radim Krčmář


On 13/04/15 22:02, Paolo Bonzini wrote:
>
> On 13/04/2015 14:45, Brad Campbell wrote:
>> G'day Paolo,
>>
>> Yes, on AMD and I've tried hard to reproduce it on Intel and been unable
>> to thus far.
>>
>> Now you mention it may be AMD specific, I have a spare motherboard and
>> processor sitting in a drawer. I'll bolt it together tomorrow and see if
>> I can reproduce it on another AMD machine. Two machines should let me
>> test it twice as fast.
>>
>> I got a fail this afternoon, so I'm due to reboot tonight. I'll just
>> revert that one suspect commit from a known bad kernel and see if that
>> cleans it up. If not then I'll work through the remainder of the
>> information in your mail. I really appreciate the attention you've paid
>> to this, it has been a frustrating bug for me because I'm in a position
>> of not knowing what I don't know, and obviously doing something wrong in
>> very long bisection processes.
> Actually, if you have time to change your course of action, please
> revert the one that Nadav pointed out (f210f7572bed, KVM: x86:
> Fix lost interrupt on irr_pending race) or cherry-pick it on top of 3.17.
>
> Paolo
>
Ok, I think we have a winner. Patch manually plopped on top of vanilla 
3.17. It has never gone for anywhere near this long on a bad kernel.

brad@srv:~$ uptime
  23:24:48 up 6 days,  1:01,  3 users,  load average: 1.48, 1.95, 2.48

So this patch went into the kernel during the 3.19 release cycle? 
Affected kernels 3.16-3.18?

Regards,
Brad

-- 
Dolphins are so intelligent that within a few weeks they can
train Americans to stand at the edge of the pool and throw them
fish.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: XP machine freeze
  2015-04-19 15:27             ` Brad Campbell
@ 2015-04-19 15:48               ` Nadav Amit
  2015-04-19 16:50                 ` Brad Campbell
  0 siblings, 1 reply; 25+ messages in thread
From: Nadav Amit @ 2015-04-19 15:48 UTC (permalink / raw)
  To: Brad Campbell, Paolo Bonzini
  Cc: Saso Slavicic, kvm list, Radim Krčmář

Brad Campbell <lists2009@fnarfbargle.com> wrote:

> 
> On 13/04/15 22:02, Paolo Bonzini wrote:
>> On 13/04/2015 14:45, Brad Campbell wrote:
>>> G'day Paolo,
>>> 
>>> Yes, on AMD and I've tried hard to reproduce it on Intel and been unable
>>> to thus far.
>>> 
>>> Now you mention it may be AMD specific, I have a spare motherboard and
>>> processor sitting in a drawer. I'll bolt it together tomorrow and see if
>>> I can reproduce it on another AMD machine. Two machines should let me
>>> test it twice as fast.
>>> 
>>> I got a fail this afternoon, so I'm due to reboot tonight. I'll just
>>> revert that one suspect commit from a known bad kernel and see if that
>>> cleans it up. If not then I'll work through the remainder of the
>>> information in your mail. I really appreciate the attention you've paid
>>> to this, it has been a frustrating bug for me because I'm in a position
>>> of not knowing what I don't know, and obviously doing something wrong in
>>> very long bisection processes.
>> Actually, if you have time to change your course of action, please
>> revert the one that Nadav pointed out (f210f7572bed, KVM: x86:
>> Fix lost interrupt on irr_pending race) or cherry-pick it on top of 3.17.
>> 
>> Paolo
> Ok, I think we have a winner. Patch manually plopped on top of vanilla 3.17. It has never gone for anywhere near this long on a bad kernel.
> 
> brad@srv:~$ uptime
> 23:24:48 up 6 days,  1:01,  3 users,  load average: 1.48, 1.95, 2.48
> 
> So this patch went into the kernel during the 3.19 release cycle? Affected kernels 3.16-3.18?

Actually, the original bug seemed to be introduced by commit
33e4c68656a2e461b296ce714ec322978de85412 "KVM: Optimize searching for
highest IRR”. So the bug goes all the way back to 2.6.32. The race that this
patch fixes just became more apparent (i.e., likely to happen) on 3.16. It
is fixed in 3.19.

I guess Paolo would push it to stable now. Right?

Regards,
Nadav


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: XP machine freeze
  2015-04-19 15:48               ` Nadav Amit
@ 2015-04-19 16:50                 ` Brad Campbell
  2015-04-19 17:16                   ` Paolo Bonzini
  0 siblings, 1 reply; 25+ messages in thread
From: Brad Campbell @ 2015-04-19 16:50 UTC (permalink / raw)
  To: Nadav Amit, Paolo Bonzini
  Cc: Saso Slavicic, kvm list, Radim Krčmář


On 19/04/15 23:48, Nadav Amit wrote:
> Brad Campbell <lists2009@fnarfbargle.com> wrote:
>
>> On 13/04/15 22:02, Paolo Bonzini wrote:
>>> On 13/04/2015 14:45, Brad Campbell wrote:
>>>> G'day Paolo,
>>>>
>>>> Yes, on AMD and I've tried hard to reproduce it on Intel and been unable
>>>> to thus far.
>>>>
>>>> Now you mention it may be AMD specific, I have a spare motherboard and
>>>> processor sitting in a drawer. I'll bolt it together tomorrow and see if
>>>> I can reproduce it on another AMD machine. Two machines should let me
>>>> test it twice as fast.
>>>>
>>>> I got a fail this afternoon, so I'm due to reboot tonight. I'll just
>>>> revert that one suspect commit from a known bad kernel and see if that
>>>> cleans it up. If not then I'll work through the remainder of the
>>>> information in your mail. I really appreciate the attention you've paid
>>>> to this, it has been a frustrating bug for me because I'm in a position
>>>> of not knowing what I don't know, and obviously doing something wrong in
>>>> very long bisection processes.
>>> Actually, if you have time to change your course of action, please
>>> revert the one that Nadav pointed out (f210f7572bed, KVM: x86:
>>> Fix lost interrupt on irr_pending race) or cherry-pick it on top of 3.17.
>>>
>>> Paolo
>> Ok, I think we have a winner. Patch manually plopped on top of vanilla 3.17. It has never gone for anywhere near this long on a bad kernel.
>>
>> brad@srv:~$ uptime
>> 23:24:48 up 6 days,  1:01,  3 users,  load average: 1.48, 1.95, 2.48
>>
>> So this patch went into the kernel during the 3.19 release cycle? Affected kernels 3.16-3.18?
> Actually, the original bug seemed to be introduced by commit
> 33e4c68656a2e461b296ce714ec322978de85412 "KVM: Optimize searching for
> highest IRR”. So the bug goes all the way back to 2.6.32. The race that this
> patch fixes just became more apparent (i.e., likely to happen) on 3.16. It
> is fixed in 3.19.

And I can confidently state that over the years I've seen this happen a 
number of times, but in each case I was using qemu with an SDL console 
as a user-interactive VM, and a moving the mouse would restore network 
connectivity. It was obviously seriously exacerbated by something that 
went into 3.16.

I really appreciate the assistance in pinning this down. At the next 
excuse for a reboot I'll upgrade the server to a 3.19.x kernel and call 
it done.

Regards,
Brad

-- 
Dolphins are so intelligent that within a few weeks they can
train Americans to stand at the edge of the pool and throw them
fish.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: XP machine freeze
  2015-04-19 16:50                 ` Brad Campbell
@ 2015-04-19 17:16                   ` Paolo Bonzini
  0 siblings, 0 replies; 25+ messages in thread
From: Paolo Bonzini @ 2015-04-19 17:16 UTC (permalink / raw)
  To: Brad Campbell, Nadav Amit
  Cc: Saso Slavicic, kvm list, Radim Krčmář



On 19/04/2015 18:50, Brad Campbell wrote:
> And I can confidently state that over the years I've seen this happen a
> number of times, but in each case I was using qemu with an SDL console
> as a user-interactive VM, and a moving the mouse would restore network
> connectivity. It was obviously seriously exacerbated by something that
> went into 3.16.

Yes, it was---it's straight in the commit message for commit f210f7572bed:

> commit 56cc2406d68c ("KVM: nVMX: fix "acknowledge interrupt on exit"
> when APICv is in use") changed the behavior of apic_clear_irr [...]
> Nonetheless, it appears the race might even occur prior to this commit:

Note that there is another regression between 3.17 and 3.19 (visible as
migration failures with XFS in the guest), so I'd keep 3.17.x for a
while in production.

The hard disk holding my test system's root filesystem died on Friday so
it may take a few days before I actually prepare the fix for stable@ and
also apply the other pending patches.  Nevertheless, no data was lost
(/home is safe) so it's just a matter of finding some time to reinstall
the OSes.

Paolo

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2015-04-19 17:16 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-16 15:10 XP machine freeze Saso Slavicic
2015-03-19  0:51 ` Marcelo Tosatti
2015-03-30 16:19   ` Saso Slavicic
2015-03-22 15:31 ` Brad Campbell
2015-03-30 21:11   ` Paolo Bonzini
2015-03-31  0:27     ` Brad Campbell
2015-03-31  6:29       ` Saso Slavicic
2015-03-31  7:18         ` Brad Campbell
2015-03-31  8:56           ` Paolo Bonzini
2015-03-31 11:16             ` Brad Campbell
2015-03-31 11:23               ` Paolo Bonzini
2015-04-04 10:55                 ` Brad Campbell
2015-04-13  4:07     ` Brad Campbell
2015-04-13 12:38       ` Paolo Bonzini
2015-04-13 12:45         ` Brad Campbell
2015-04-13 14:02           ` Paolo Bonzini
2015-04-13 14:25             ` Brad Campbell
2015-04-19 15:27             ` Brad Campbell
2015-04-19 15:48               ` Nadav Amit
2015-04-19 16:50                 ` Brad Campbell
2015-04-19 17:16                   ` Paolo Bonzini
2015-04-13 12:47         ` Saso Slavicic
2015-04-13 13:33         ` Radim Krčmář
2015-04-13 13:34         ` Nadav Amit
2015-04-13 14:01           ` Paolo Bonzini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.