All of lore.kernel.org
 help / color / mirror / Atom feed
* VMs freezing when host is running 4.14
@ 2017-11-21 16:18 Marc Haber
  2017-11-22  9:39 ` Marc Haber
  0 siblings, 1 reply; 18+ messages in thread
From: Marc Haber @ 2017-11-21 16:18 UTC (permalink / raw)
  To: linux-kernel

Hi,

I am running Debian stable with home-built kernels on a number of KVM
hosts and a bigger number of KVM VMs. With 4.14, I have an interesting
phenomenon on _one_ of my hosts, while all other hosts run fine. All
systems are reasonably similar to each other.

On the affected host, VMs freeze at a rate about two or three per day.
They just stop dead in their tracks, console and serial console become
unresponsive, ping stops, they don't react to virsh shutdown, only to
virsh destroy. They do, however, take a noticeable part of CPU resources
when they're in this state, up to a full CPU core.  What's left in
syslog of a VM is unsuspicious, the host logs don't have anything
uncommon. When I start a VM that allocates a lot of memory (like the 8
GB Windows VM that I use for bookkeeping and taxes), it happens that two
to five of the Linux VMs freeze in the same second.

The affected host is a Thinkpad T520 with 16 Gig of RAM and an Intel(R)
Core(TM) i5-2520M that is neither under noticeable load nor under memory
pressure (it can afford 8 Gig of disk cache).

When I boot the host back to 4.13.11, things are just fine and the
machine is chugging away painlessly for days. When I boot 4.14, the VM
freeze phenomenon usually appears in the first 24 hours.

The other VM hosts (ranging from a small PC Engines APU to a bigger 48
Gig Server in Housing) run just fine with 4.14 as well.

Any ideas?

Greetings
Marc

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany    |  lose things."    Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VMs freezing when host is running 4.14
  2017-11-21 16:18 VMs freezing when host is running 4.14 Marc Haber
@ 2017-11-22  9:39 ` Marc Haber
  2017-11-22 15:04   ` 王金浦
  0 siblings, 1 reply; 18+ messages in thread
From: Marc Haber @ 2017-11-22  9:39 UTC (permalink / raw)
  To: linux-kernel

On Tue, Nov 21, 2017 at 05:18:21PM +0100, Marc Haber wrote:
> On the affected host, VMs freeze at a rate about two or three per day.
> They just stop dead in their tracks, console and serial console become
> unresponsive, ping stops, they don't react to virsh shutdown, only to
> virsh destroy.

I was able to obtain a log of a VM before it became unresponsive. here
we go:

Nov 22 08:19:01 weave kernel: double fault: 0000 [#1] PREEMPT SMP
Nov 22 08:19:01 weave kernel: Modules linked in: crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc sg aesni_intel aes_x86_64 crypto_simd glue_helper cryptd input_leds virtio_balloon virtio_console led_class qemu_fw_cfg ip_tables x_tables autofs4 ext4 mbcache jbd2 fscrypto usbhid sr_mod cdrom virtio_blk virtio_net ata_generic crc32c_intel ehci_pci ehci_hcd usbcore usb_common floppy i2c_piix4 virtio_pci virtio_ring virtio ata_piix i2c_core libata
Nov 22 08:19:01 weave kernel: CPU: 1 PID: 8795 Comm: debsecan Not tainted 4.14.1-zgsrv20080 #3
Nov 22 08:19:01 weave kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
Nov 22 08:19:01 weave kernel: task: ffff88001ef0adc0 task.stack: ffffc900001fc000
Nov 22 08:19:01 weave kernel: RIP: 0010:kvm_async_pf_task_wait+0x167/0x200
Nov 22 08:19:01 weave kernel: RSP: 0000:ffffc900001ffa10 EFLAGS: 00000202
Nov 22 08:19:01 weave kernel: RAX: ffff88001fd11cc0 RBX: ffffc900001ffa30 RCX: 0000000000000002
Nov 22 08:19:01 weave kernel: RDX: 0140000000000000 RSI: ffffffff8173514b RDI: ffffffff819bdd80
Nov 22 08:19:01 weave kernel: RBP: ffffc900001ffaa0 R08: 0000000000193fc0 R09: ffff880000000000
Nov 22 08:19:01 weave kernel: R10: ffffc900001ffac0 R11: 0000000000000000 R12: ffffc900001ffa40
Nov 22 08:19:01 weave kernel: R13: 0000000000000be8 R14: ffffffff819bdd80 R15: ffffea0000193f80
Nov 22 08:19:01 weave kernel: FS:  00007f97e25dd700(0000) GS:ffff88001fd00000(0000) knlGS:0000000000000000
Nov 22 08:19:01 weave kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 22 08:19:01 weave kernel: CR2: 0000000000483001 CR3: 0000000015df7000 CR4: 00000000000406e0
Nov 22 08:19:01 weave kernel: Call Trace:
Nov 22 08:19:01 weave kernel: do_async_page_fault+0x6b/0x70
Nov 22 08:19:01 weave kernel: ? do_async_page_fault+0x6b/0x70
Nov 22 08:19:01 weave kernel: async_page_fault+0x22/0x30
Nov 22 08:19:01 weave kernel: RIP: 0010:clear_page_rep+0x7/0x10
Nov 22 08:19:01 weave kernel: RSP: 0000:ffffc900001ffb88 EFLAGS: 00010246
Nov 22 08:19:01 weave kernel: RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000200
Nov 22 08:19:01 weave kernel: RDX: ffff88001ef0adc0 RSI: 0000000000193f80 RDI: ffff8800064fe000
Nov 22 08:19:01 weave kernel: RBP: ffffc900001ffc50 R08: 0000000000193fc0 R09: ffff880000000000
Nov 22 08:19:01 weave kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000020
Nov 22 08:19:01 weave kernel: R13: ffff88001ffd5500 R14: ffffc900001ffce8 R15: ffffea0000193f80
Nov 22 08:19:01 weave kernel: ? get_page_from_freelist+0x8c3/0xaf0
Nov 22 08:19:01 weave kernel: ? __mem_cgroup_threshold+0x8a/0x130
Nov 22 08:19:01 weave kernel: ? free_pcppages_bulk+0x3f6/0x410
Nov 22 08:19:01 weave kernel: __alloc_pages_nodemask+0xe4/0xe20
Nov 22 08:19:01 weave kernel: ? free_hot_cold_page_list+0x2b/0x50
Nov 22 08:19:01 weave kernel: ? release_pages+0x2b7/0x360
Nov 22 08:19:01 weave kernel: ? mem_cgroup_commit_charge+0x7a/0x520
Nov 22 08:19:01 weave kernel: ? account_entity_enqueue+0x95/0xc0
Nov 22 08:19:01 weave kernel: alloc_pages_vma+0x7f/0x1e0
Nov 22 08:19:01 weave kernel: __handle_mm_fault+0x9cb/0xf20
Nov 22 08:19:01 weave kernel: handle_mm_fault+0xb2/0x1f0
Nov 22 08:19:01 weave kernel: __do_page_fault+0x1f2/0x440
Nov 22 08:19:01 weave kernel: do_page_fault+0x22/0x30
Nov 22 08:19:01 weave kernel: do_async_page_fault+0x4c/0x70
Nov 22 08:19:01 weave kernel: async_page_fault+0x22/0x30
Nov 22 08:19:01 weave kernel: RIP: 0033:0x56434ef679d8
Nov 22 08:19:01 weave kernel: RSP: 002b:00007ffd6b48ad80 EFLAGS: 00010206
Nov 22 08:19:01 weave kernel: RAX: 00000000000000eb RBX: 000000000000001d RCX: aaaaaaaaaaaaaaab
Nov 22 08:19:01 weave kernel: RDX: 000056434f5eb300 RSI: 000000000000000f RDI: 000056434f3ca6c0
Nov 22 08:19:01 weave kernel: RBP: 00000000000000ec R08: 00007f97e2453000 R09: 000056434f5eb3ea
Nov 22 08:19:01 weave kernel: R10: 000056434f5eb3eb R11: 000056434f4510a0 R12: 000000000000003a
Nov 22 08:19:01 weave kernel: R13: 000056434f3ca500 R14: 000056434f451240 R15: 00007f97e1024750
Nov 22 08:19:01 weave kernel: Code: f7 49 89 9d a0 d1 9b 81 48 89 55 98 4c 8d 63 10 e8 4f 02 53 00 eb 20 48 83 7d 98 00 74 3a e8 21 6e 06 00 80 7d c0 00 74 3f fb f4 <fa> 66 66 90 66 66 90 e8 7d 6f 06 00 80 7d c0 00 75 da 48 8d b5
Nov 22 08:19:01 weave kernel: RIP: kvm_async_pf_task_wait+0x167/0x200 RSP: ffffc900001ffa10
Nov 22 08:19:01 weave kernel: ---[ end trace 4701012ee256be25 ]---

Does that help?

Greetings
Marc

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany    |  lose things."    Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VMs freezing when host is running 4.14
  2017-11-22  9:39 ` Marc Haber
@ 2017-11-22 15:04   ` 王金浦
  2017-11-22 15:52     ` Marc Haber
  2017-12-01 14:43     ` Marc Haber
  0 siblings, 2 replies; 18+ messages in thread
From: 王金浦 @ 2017-11-22 15:04 UTC (permalink / raw)
  To: Marc Haber; +Cc: LKML, KVM-ML (kvm@vger.kernel.org)

+cc kvm

2017-11-22 10:39 GMT+01:00 Marc Haber <mh+linux-kernel@zugschlus.de>:
> On Tue, Nov 21, 2017 at 05:18:21PM +0100, Marc Haber wrote:
>> On the affected host, VMs freeze at a rate about two or three per day.
>> They just stop dead in their tracks, console and serial console become
>> unresponsive, ping stops, they don't react to virsh shutdown, only to
>> virsh destroy.
>
> I was able to obtain a log of a VM before it became unresponsive. here
> we go:
>
> Nov 22 08:19:01 weave kernel: double fault: 0000 [#1] PREEMPT SMP
> Nov 22 08:19:01 weave kernel: Modules linked in: crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc sg aesni_intel aes_x86_64 crypto_simd glue_helper cryptd input_leds virtio_balloon virtio_console led_class qemu_fw_cfg ip_tables x_tables autofs4 ext4 mbcache jbd2 fscrypto usbhid sr_mod cdrom virtio_blk virtio_net ata_generic crc32c_intel ehci_pci ehci_hcd usbcore usb_common floppy i2c_piix4 virtio_pci virtio_ring virtio ata_piix i2c_core libata
> Nov 22 08:19:01 weave kernel: CPU: 1 PID: 8795 Comm: debsecan Not tainted 4.14.1-zgsrv20080 #3
> Nov 22 08:19:01 weave kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> Nov 22 08:19:01 weave kernel: task: ffff88001ef0adc0 task.stack: ffffc900001fc000
> Nov 22 08:19:01 weave kernel: RIP: 0010:kvm_async_pf_task_wait+0x167/0x200
> Nov 22 08:19:01 weave kernel: RSP: 0000:ffffc900001ffa10 EFLAGS: 00000202
> Nov 22 08:19:01 weave kernel: RAX: ffff88001fd11cc0 RBX: ffffc900001ffa30 RCX: 0000000000000002
> Nov 22 08:19:01 weave kernel: RDX: 0140000000000000 RSI: ffffffff8173514b RDI: ffffffff819bdd80
> Nov 22 08:19:01 weave kernel: RBP: ffffc900001ffaa0 R08: 0000000000193fc0 R09: ffff880000000000
> Nov 22 08:19:01 weave kernel: R10: ffffc900001ffac0 R11: 0000000000000000 R12: ffffc900001ffa40
> Nov 22 08:19:01 weave kernel: R13: 0000000000000be8 R14: ffffffff819bdd80 R15: ffffea0000193f80
> Nov 22 08:19:01 weave kernel: FS:  00007f97e25dd700(0000) GS:ffff88001fd00000(0000) knlGS:0000000000000000
> Nov 22 08:19:01 weave kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Nov 22 08:19:01 weave kernel: CR2: 0000000000483001 CR3: 0000000015df7000 CR4: 00000000000406e0
> Nov 22 08:19:01 weave kernel: Call Trace:
> Nov 22 08:19:01 weave kernel: do_async_page_fault+0x6b/0x70
> Nov 22 08:19:01 weave kernel: ? do_async_page_fault+0x6b/0x70
> Nov 22 08:19:01 weave kernel: async_page_fault+0x22/0x30
> Nov 22 08:19:01 weave kernel: RIP: 0010:clear_page_rep+0x7/0x10
> Nov 22 08:19:01 weave kernel: RSP: 0000:ffffc900001ffb88 EFLAGS: 00010246
> Nov 22 08:19:01 weave kernel: RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000200
> Nov 22 08:19:01 weave kernel: RDX: ffff88001ef0adc0 RSI: 0000000000193f80 RDI: ffff8800064fe000
> Nov 22 08:19:01 weave kernel: RBP: ffffc900001ffc50 R08: 0000000000193fc0 R09: ffff880000000000
> Nov 22 08:19:01 weave kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000020
> Nov 22 08:19:01 weave kernel: R13: ffff88001ffd5500 R14: ffffc900001ffce8 R15: ffffea0000193f80
> Nov 22 08:19:01 weave kernel: ? get_page_from_freelist+0x8c3/0xaf0
> Nov 22 08:19:01 weave kernel: ? __mem_cgroup_threshold+0x8a/0x130
> Nov 22 08:19:01 weave kernel: ? free_pcppages_bulk+0x3f6/0x410
> Nov 22 08:19:01 weave kernel: __alloc_pages_nodemask+0xe4/0xe20
> Nov 22 08:19:01 weave kernel: ? free_hot_cold_page_list+0x2b/0x50
> Nov 22 08:19:01 weave kernel: ? release_pages+0x2b7/0x360
> Nov 22 08:19:01 weave kernel: ? mem_cgroup_commit_charge+0x7a/0x520
> Nov 22 08:19:01 weave kernel: ? account_entity_enqueue+0x95/0xc0
> Nov 22 08:19:01 weave kernel: alloc_pages_vma+0x7f/0x1e0
> Nov 22 08:19:01 weave kernel: __handle_mm_fault+0x9cb/0xf20
> Nov 22 08:19:01 weave kernel: handle_mm_fault+0xb2/0x1f0
> Nov 22 08:19:01 weave kernel: __do_page_fault+0x1f2/0x440
> Nov 22 08:19:01 weave kernel: do_page_fault+0x22/0x30
> Nov 22 08:19:01 weave kernel: do_async_page_fault+0x4c/0x70
> Nov 22 08:19:01 weave kernel: async_page_fault+0x22/0x30
> Nov 22 08:19:01 weave kernel: RIP: 0033:0x56434ef679d8
> Nov 22 08:19:01 weave kernel: RSP: 002b:00007ffd6b48ad80 EFLAGS: 00010206
> Nov 22 08:19:01 weave kernel: RAX: 00000000000000eb RBX: 000000000000001d RCX: aaaaaaaaaaaaaaab
> Nov 22 08:19:01 weave kernel: RDX: 000056434f5eb300 RSI: 000000000000000f RDI: 000056434f3ca6c0
> Nov 22 08:19:01 weave kernel: RBP: 00000000000000ec R08: 00007f97e2453000 R09: 000056434f5eb3ea
> Nov 22 08:19:01 weave kernel: R10: 000056434f5eb3eb R11: 000056434f4510a0 R12: 000000000000003a
> Nov 22 08:19:01 weave kernel: R13: 000056434f3ca500 R14: 000056434f451240 R15: 00007f97e1024750
> Nov 22 08:19:01 weave kernel: Code: f7 49 89 9d a0 d1 9b 81 48 89 55 98 4c 8d 63 10 e8 4f 02 53 00 eb 20 48 83 7d 98 00 74 3a e8 21 6e 06 00 80 7d c0 00 74 3f fb f4 <fa> 66 66 90 66 66 90 e8 7d 6f 06 00 80 7d c0 00 75 da 48 8d b5
> Nov 22 08:19:01 weave kernel: RIP: kvm_async_pf_task_wait+0x167/0x200 RSP: ffffc900001ffa10
> Nov 22 08:19:01 weave kernel: ---[ end trace 4701012ee256be25 ]---
>
> Does that help?
>
So all guest kernels are 4.14, or also other older kernel?

> Greetings
> Marc

Regards,
Jack

>
> --
> -----------------------------------------------------------------------------
> Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
> Leimen, Germany    |  lose things."    Winona Ryder | Fon: *49 6224 1600402
> Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VMs freezing when host is running 4.14
  2017-11-22 15:04   ` 王金浦
@ 2017-11-22 15:52     ` Marc Haber
  2017-11-22 16:43       ` Radim Krčmář
  2017-12-01 14:43     ` Marc Haber
  1 sibling, 1 reply; 18+ messages in thread
From: Marc Haber @ 2017-11-22 15:52 UTC (permalink / raw)
  To: Marc Haber, LKML, KVM-ML (kvm@vger.kernel.org)

On Wed, Nov 22, 2017 at 04:04:42PM +0100, 王金浦 wrote:
> So all guest kernels are 4.14, or also other older kernel?

Guest kernels are also 4.14, but the issue disappears when the host is
downgraded to an older kernel. I therefore reckoned that the guest
kernel doesn't matter, but that was before I saw the trace in the log.

Greetings
Marc

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany    |  lose things."    Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VMs freezing when host is running 4.14
  2017-11-22 15:52     ` Marc Haber
@ 2017-11-22 16:43       ` Radim Krčmář
  2017-11-23 15:20         ` Marc Haber
  0 siblings, 1 reply; 18+ messages in thread
From: Radim Krčmář @ 2017-11-22 16:43 UTC (permalink / raw)
  To: Marc Haber; +Cc: LKML, KVM-ML (kvm@vger.kernel.org), Wanpeng Li

2017-11-22 16:52+0100, Marc Haber:
> On Wed, Nov 22, 2017 at 04:04:42PM +0100, 王金浦 wrote:
> > So all guest kernels are 4.14, or also other older kernel?
> 
> Guest kernels are also 4.14, but the issue disappears when the host is
> downgraded to an older kernel. I therefore reckoned that the guest
> kernel doesn't matter, but that was before I saw the trace in the log.

The two most suspicious patches since 4.13 (which I assume works) are

  664f8e26b00c ("KVM: X86: Fix loss of exception which has not yet been
  injected")

and

  9a6e7c39810e ("KVM: async_pf: Fix #DF due to inject "Page not Present"
  and "Page Ready" exceptions simultaneously")

please try reverting them to see if it helps,

thanks.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VMs freezing when host is running 4.14
  2017-11-22 16:43       ` Radim Krčmář
@ 2017-11-23 15:20         ` Marc Haber
  2017-11-23 15:59           ` Radim Krčmář
  0 siblings, 1 reply; 18+ messages in thread
From: Marc Haber @ 2017-11-23 15:20 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: LKML, KVM-ML (kvm@vger.kernel.org), Wanpeng Li

On Wed, Nov 22, 2017 at 05:43:13PM +0100, Radim Krčmář wrote:
> 2017-11-22 16:52+0100, Marc Haber:
> > On Wed, Nov 22, 2017 at 04:04:42PM +0100, 王金浦 wrote:
> > > So all guest kernels are 4.14, or also other older kernel?
> > 
> > Guest kernels are also 4.14, but the issue disappears when the host is
> > downgraded to an older kernel. I therefore reckoned that the guest
> > kernel doesn't matter, but that was before I saw the trace in the log.
> 
> The two most suspicious patches since 4.13 (which I assume works) are
> 
>   664f8e26b00c ("KVM: X86: Fix loss of exception which has not yet been
>   injected")

That one does not revert cleanly, the line in questions seems to have
been removed a bit later.

Reject is:
141 [24/5001]mh@fan:~/linux/git/linux ((v4.14.1) %) $ cat arch/x86/kvm/vmx.c.rej--- arch/x86/kvm/vmx.c
+++ arch/x86/kvm/vmx.c
@@ -2516,7 +2516,7 @@ static void vmx_queue_exception(struct kvm_vcpu *vcpu)
        struct vcpu_vmx *vmx = to_vmx(vcpu);
        unsigned nr = vcpu->arch.exception.nr;
        bool has_error_code = vcpu->arch.exception.has_error_code;
-       bool reinject = vcpu->arch.exception.injected;
+       bool reinject = vcpu->arch.exception.reinject;
        u32 error_code = vcpu->arch.exception.error_code;
        u32 intr_info = nr | INTR_INFO_VALID_MASK;

> 
> and
> 
>   9a6e7c39810e ("KVM: async_pf: Fix #DF due to inject "Page not Present"
>   and "Page Ready" exceptions simultaneously")
> 
> please try reverting them to see if it helps,

That one reverted cleanly. I am now running the new kernel on the
affected machine, and I think that a second machine has joined the
market of being affected.

Would this matter on the host only or on the guests as well?

Greetings
Marc

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany    |  lose things."    Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VMs freezing when host is running 4.14
  2017-11-23 15:20         ` Marc Haber
@ 2017-11-23 15:59           ` Radim Krčmář
  2017-11-23 16:18             ` Liran Alon
  0 siblings, 1 reply; 18+ messages in thread
From: Radim Krčmář @ 2017-11-23 15:59 UTC (permalink / raw)
  To: Marc Haber; +Cc: LKML, KVM-ML (kvm@vger.kernel.org), Wanpeng Li

2017-11-23 16:20+0100, Marc Haber:
> On Wed, Nov 22, 2017 at 05:43:13PM +0100, Radim Krčmář wrote:
> > 2017-11-22 16:52+0100, Marc Haber:
> > > On Wed, Nov 22, 2017 at 04:04:42PM +0100, 王金浦 wrote:
> > > > So all guest kernels are 4.14, or also other older kernel?
> > > 
> > > Guest kernels are also 4.14, but the issue disappears when the host is
> > > downgraded to an older kernel. I therefore reckoned that the guest
> > > kernel doesn't matter, but that was before I saw the trace in the log.
> > 
> > The two most suspicious patches since 4.13 (which I assume works) are
> > 
> >   664f8e26b00c ("KVM: X86: Fix loss of exception which has not yet been
> >   injected")
> 
> That one does not revert cleanly, the line in questions seems to have
> been removed a bit later.
> 
> Reject is:
> 141 [24/5001]mh@fan:~/linux/git/linux ((v4.14.1) %) $ cat arch/x86/kvm/vmx.c.rej--- arch/x86/kvm/vmx.c
> +++ arch/x86/kvm/vmx.c
> @@ -2516,7 +2516,7 @@ static void vmx_queue_exception(struct kvm_vcpu *vcpu)
>         struct vcpu_vmx *vmx = to_vmx(vcpu);
>         unsigned nr = vcpu->arch.exception.nr;
>         bool has_error_code = vcpu->arch.exception.has_error_code;
> -       bool reinject = vcpu->arch.exception.injected;
> +       bool reinject = vcpu->arch.exception.reinject;
>         u32 error_code = vcpu->arch.exception.error_code;
>         u32 intr_info = nr | INTR_INFO_VALID_MASK;

This line one can be deleted as reinject isn't used in the function.

Btw. there have been already many fixes from Liran Alon for that patch
and your case could be the one adressed in
https://www.spinics.net/lists/kvm/msg159158.html

The patch is incorrect, but you might be able to see only its benefits.

> > and
> > 
> >   9a6e7c39810e ("KVM: async_pf: Fix #DF due to inject "Page not Present"
> >   and "Page Ready" exceptions simultaneously")
> > 
> > please try reverting them to see if it helps,
> 
> That one reverted cleanly. I am now running the new kernel on the
> affected machine, and I think that a second machine has joined the
> market of being affected.

That one had much lower chances of being the culprit.

> Would this matter on the host only or on the guests as well?

Only on the host.

Thanks.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VMs freezing when host is running 4.14
  2017-11-23 15:59           ` Radim Krčmář
@ 2017-11-23 16:18             ` Liran Alon
  2017-11-23 16:24               ` Radim Krčmář
  0 siblings, 1 reply; 18+ messages in thread
From: Liran Alon @ 2017-11-23 16:18 UTC (permalink / raw)
  To: Radim Krčmář, Marc Haber
  Cc: LKML, KVM-ML (kvm@vger.kernel.org), Wanpeng Li



On 23/11/17 17:59, Radim Krčmář wrote:
> 2017-11-23 16:20+0100, Marc Haber:
>> On Wed, Nov 22, 2017 at 05:43:13PM +0100, Radim Krčmář wrote:
>>> 2017-11-22 16:52+0100, Marc Haber:
>>>> On Wed, Nov 22, 2017 at 04:04:42PM +0100, 王金浦 wrote:
>>>>> So all guest kernels are 4.14, or also other older kernel?
>>>>
>>>> Guest kernels are also 4.14, but the issue disappears when the host is
>>>> downgraded to an older kernel. I therefore reckoned that the guest
>>>> kernel doesn't matter, but that was before I saw the trace in the log.
>>>
>>> The two most suspicious patches since 4.13 (which I assume works) are
>>>
>>>    664f8e26b00c ("KVM: X86: Fix loss of exception which has not yet been
>>>    injected")
>>
>> That one does not revert cleanly, the line in questions seems to have
>> been removed a bit later.
>>
>> Reject is:
>> 141 [24/5001]mh@fan:~/linux/git/linux ((v4.14.1) %) $ cat arch/x86/kvm/vmx.c.rej--- arch/x86/kvm/vmx.c
>> +++ arch/x86/kvm/vmx.c
>> @@ -2516,7 +2516,7 @@ static void vmx_queue_exception(struct kvm_vcpu *vcpu)
>>          struct vcpu_vmx *vmx = to_vmx(vcpu);
>>          unsigned nr = vcpu->arch.exception.nr;
>>          bool has_error_code = vcpu->arch.exception.has_error_code;
>> -       bool reinject = vcpu->arch.exception.injected;
>> +       bool reinject = vcpu->arch.exception.reinject;
>>          u32 error_code = vcpu->arch.exception.error_code;
>>          u32 intr_info = nr | INTR_INFO_VALID_MASK;
>
> This line one can be deleted as reinject isn't used in the function.
>
> Btw. there have been already many fixes from Liran Alon for that patch
> and your case could be the one adressed in
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_kvm_msg159158.html&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=206jU1rQdk3xs1DYWbQPz1gR7Iim02XOjwn458rwgIo&s=fz1JeZiSQBwqYpkmeX8OJukyC4M8BeXSuIOKwuVaeHg&e=
>
> The patch is incorrect, but you might be able to see only its benefits.

Actually I would first attempt to check this patch of mine:
https://www.spinics.net/lists/kvm/msg159062.html
It fixes a bug of a L2 exception accidentally being delivered into L1.

Regards,
-Liran
>
>>> and
>>>
>>>    9a6e7c39810e ("KVM: async_pf: Fix #DF due to inject "Page not Present"
>>>    and "Page Ready" exceptions simultaneously")
>>>
>>> please try reverting them to see if it helps,
>>
>> That one reverted cleanly. I am now running the new kernel on the
>> affected machine, and I think that a second machine has joined the
>> market of being affected.
>
> That one had much lower chances of being the culprit.
>
>> Would this matter on the host only or on the guests as well?
>
> Only on the host.
>
> Thanks.
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VMs freezing when host is running 4.14
  2017-11-23 16:18             ` Liran Alon
@ 2017-11-23 16:24               ` Radim Krčmář
  2017-11-23 16:26                 ` Liran Alon
  0 siblings, 1 reply; 18+ messages in thread
From: Radim Krčmář @ 2017-11-23 16:24 UTC (permalink / raw)
  To: Liran Alon; +Cc: Marc Haber, LKML, KVM-ML (kvm@vger.kernel.org), Wanpeng Li

2017-11-23 18:18+0200, Liran Alon:
> On 23/11/17 17:59, Radim Krčmář wrote:
> > Btw. there have been already many fixes from Liran Alon for that patch
> > and your case could be the one adressed in
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_kvm_msg159158.html&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=206jU1rQdk3xs1DYWbQPz1gR7Iim02XOjwn458rwgIo&s=fz1JeZiSQBwqYpkmeX8OJukyC4M8BeXSuIOKwuVaeHg&e=
> > 
> > The patch is incorrect, but you might be able to see only its benefits.
> 
> Actually I would first attempt to check this patch of mine:
> https://www.spinics.net/lists/kvm/msg159062.html
> It fixes a bug of a L2 exception accidentally being delivered into L1.

Marc's guest didn't have kvm/vbox/... module loaded, so I assumed that
we're not running L2.  Do you suspect it also fixes something else?

Thanks.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VMs freezing when host is running 4.14
  2017-11-23 16:24               ` Radim Krčmář
@ 2017-11-23 16:26                 ` Liran Alon
  2017-11-24  8:08                   ` Marc Haber
  0 siblings, 1 reply; 18+ messages in thread
From: Liran Alon @ 2017-11-23 16:26 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: Marc Haber, LKML, KVM-ML (kvm@vger.kernel.org), Wanpeng Li



On 23/11/17 18:24, Radim Krčmář wrote:
> 2017-11-23 18:18+0200, Liran Alon:
>> On 23/11/17 17:59, Radim Krčmář wrote:
>>> Btw. there have been already many fixes from Liran Alon for that patch
>>> and your case could be the one adressed in
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_kvm_msg159158.html&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=206jU1rQdk3xs1DYWbQPz1gR7Iim02XOjwn458rwgIo&s=fz1JeZiSQBwqYpkmeX8OJukyC4M8BeXSuIOKwuVaeHg&e=
>>>
>>> The patch is incorrect, but you might be able to see only its benefits.
>>
>> Actually I would first attempt to check this patch of mine:
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_kvm_msg159062.html&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=rrMFkfAUbtL2483mBNnqlNlY4qjSYBGNcZrGkRksaVQ&s=TyykS9RM7nBGxt3GgdD7Oo96RWf8ntIDptU9LJ9fQ8g&e=
>> It fixes a bug of a L2 exception accidentally being delivered into L1.
>
> Marc's guest didn't have kvm/vbox/... module loaded, so I assumed that
> we're not running L2.  Do you suspect it also fixes something else?
>
> Thanks.
>
Oh I see.
If there is no nested guest so no. My fix here probably won't help.

-Liran

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VMs freezing when host is running 4.14
  2017-11-23 16:26                 ` Liran Alon
@ 2017-11-24  8:08                   ` Marc Haber
  0 siblings, 0 replies; 18+ messages in thread
From: Marc Haber @ 2017-11-24  8:08 UTC (permalink / raw)
  To: Radim Krčmář, LKML, KVM-ML (kvm@vger.kernel.org),
	Wanpeng Li

On Thu, Nov 23, 2017 at 06:26:36PM +0200, Liran Alon wrote:
> If there is no nested guest so no. My fix here probably won't help.

I can confirm that I am not running nested virt, the host is running
directly on the APU. I also have three other machines that are running
flawlessly with 4.14, and another virtualization host, a "real" server
with a somewhat dated AMD Opteron 1389 that has the same issue. The
machine that first showed the issue is Intel, so we are not having a
vendor issue.

Greetings
Marc

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany    |  lose things."    Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VMs freezing when host is running 4.14
  2017-11-22 15:04   ` 王金浦
  2017-11-22 15:52     ` Marc Haber
@ 2017-12-01 14:43     ` Marc Haber
  2018-01-08  9:10       ` Marc Haber
  1 sibling, 1 reply; 18+ messages in thread
From: Marc Haber @ 2017-12-01 14:43 UTC (permalink / raw)
  To: 王金浦; +Cc: LKML, KVM-ML (kvm@vger.kernel.org)

4.14.3 is still affected.

I am still bisecting between 4.13 and 4.14, 5 steps to go. Defining a
kernel as "good" if it survived 24 hours on the hosts.

Greetings
Marc


On Wed, Nov 22, 2017 at 04:04:42PM +0100, 王金浦 wrote:
> From: 王金浦 <jinpuwang@gmail.com>
> Subject: Re: VMs freezing when host is running 4.14
> To: Marc Haber <mh+linux-kernel@zugschlus.de>
> Cc: LKML <linux-kernel@vger.kernel.org>, "KVM-ML (kvm@vger.kernel.org)"
>  <kvm@vger.kernel.org>
> Date:   Wed, 22 Nov 2017 16:04:42 +0100
> List-ID: <linux-kernel.vger.kernel.org>
> X-Spam-Score: (-) -1.9
> X-Spam-Report: torres.zugschlus.de  Content analysis details:   (-1.9
>  points, 5.0 required)   pts  rule name              description  ----
>  ---------------------- ------------------------------------------- -0.0
>  RCVD_IN_DNSWL_NONE     RBL: Sender listed at http://www.dnswl.org/, no
>                              trust
>                              [209.85.215.48 listed in list.dnswl.org]  0.0
>  HEADER_FROM_DIFFERENT_DOMAINS From and EnvelopeFrom 2nd level mail
>                              domains are different  0.0 FREEMAIL_FROM
>           Sender email is commonly abused enduser mail provider
>                              (jinpuwang[at]gmail.com) -0.0
>  RCVD_IN_MSPIKE_H3      RBL: Good reputation (+3)
>                              [209.85.215.48 listed in wl.mailspike.net]
>  -1.9 BAYES_00               BODY: Bayes spam probability is 0 to 1%
>                              [score: 0.0000] -0.1 DKIM_VALID
>              Message has at least one valid DKIM or DK signature -0.1
>  DKIM_VALID_AU          Message has a valid DKIM or DK signature from
>  author's                             domain  0.1 DKIM_SIGNED
>             Message has a DKIM or DK signature, not necessarily valid  0.1
>  FREEMAIL_FORGED_FROMDOMAIN 2nd level domains in From and EnvelopeFrom
>                               freemail headers are different -0.0
>  RCVD_IN_MSPIKE_WL      Mailspike good senders
> 
> +cc kvm
> 
> 2017-11-22 10:39 GMT+01:00 Marc Haber <mh+linux-kernel@zugschlus.de>:
> > On Tue, Nov 21, 2017 at 05:18:21PM +0100, Marc Haber wrote:
> >> On the affected host, VMs freeze at a rate about two or three per day.
> >> They just stop dead in their tracks, console and serial console become
> >> unresponsive, ping stops, they don't react to virsh shutdown, only to
> >> virsh destroy.
> >
> > I was able to obtain a log of a VM before it became unresponsive. here
> > we go:
> >
> > Nov 22 08:19:01 weave kernel: double fault: 0000 [#1] PREEMPT SMP
> > Nov 22 08:19:01 weave kernel: Modules linked in: crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc sg aesni_intel aes_x86_64 crypto_simd glue_helper cryptd input_leds virtio_balloon virtio_console led_class qemu_fw_cfg ip_tables x_tables autofs4 ext4 mbcache jbd2 fscrypto usbhid sr_mod cdrom virtio_blk virtio_net ata_generic crc32c_intel ehci_pci ehci_hcd usbcore usb_common floppy i2c_piix4 virtio_pci virtio_ring virtio ata_piix i2c_core libata
> > Nov 22 08:19:01 weave kernel: CPU: 1 PID: 8795 Comm: debsecan Not tainted 4.14.1-zgsrv20080 #3
> > Nov 22 08:19:01 weave kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> > Nov 22 08:19:01 weave kernel: task: ffff88001ef0adc0 task.stack: ffffc900001fc000
> > Nov 22 08:19:01 weave kernel: RIP: 0010:kvm_async_pf_task_wait+0x167/0x200
> > Nov 22 08:19:01 weave kernel: RSP: 0000:ffffc900001ffa10 EFLAGS: 00000202
> > Nov 22 08:19:01 weave kernel: RAX: ffff88001fd11cc0 RBX: ffffc900001ffa30 RCX: 0000000000000002
> > Nov 22 08:19:01 weave kernel: RDX: 0140000000000000 RSI: ffffffff8173514b RDI: ffffffff819bdd80
> > Nov 22 08:19:01 weave kernel: RBP: ffffc900001ffaa0 R08: 0000000000193fc0 R09: ffff880000000000
> > Nov 22 08:19:01 weave kernel: R10: ffffc900001ffac0 R11: 0000000000000000 R12: ffffc900001ffa40
> > Nov 22 08:19:01 weave kernel: R13: 0000000000000be8 R14: ffffffff819bdd80 R15: ffffea0000193f80
> > Nov 22 08:19:01 weave kernel: FS:  00007f97e25dd700(0000) GS:ffff88001fd00000(0000) knlGS:0000000000000000
> > Nov 22 08:19:01 weave kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > Nov 22 08:19:01 weave kernel: CR2: 0000000000483001 CR3: 0000000015df7000 CR4: 00000000000406e0
> > Nov 22 08:19:01 weave kernel: Call Trace:
> > Nov 22 08:19:01 weave kernel: do_async_page_fault+0x6b/0x70
> > Nov 22 08:19:01 weave kernel: ? do_async_page_fault+0x6b/0x70
> > Nov 22 08:19:01 weave kernel: async_page_fault+0x22/0x30
> > Nov 22 08:19:01 weave kernel: RIP: 0010:clear_page_rep+0x7/0x10
> > Nov 22 08:19:01 weave kernel: RSP: 0000:ffffc900001ffb88 EFLAGS: 00010246
> > Nov 22 08:19:01 weave kernel: RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000200
> > Nov 22 08:19:01 weave kernel: RDX: ffff88001ef0adc0 RSI: 0000000000193f80 RDI: ffff8800064fe000
> > Nov 22 08:19:01 weave kernel: RBP: ffffc900001ffc50 R08: 0000000000193fc0 R09: ffff880000000000
> > Nov 22 08:19:01 weave kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000020
> > Nov 22 08:19:01 weave kernel: R13: ffff88001ffd5500 R14: ffffc900001ffce8 R15: ffffea0000193f80
> > Nov 22 08:19:01 weave kernel: ? get_page_from_freelist+0x8c3/0xaf0
> > Nov 22 08:19:01 weave kernel: ? __mem_cgroup_threshold+0x8a/0x130
> > Nov 22 08:19:01 weave kernel: ? free_pcppages_bulk+0x3f6/0x410
> > Nov 22 08:19:01 weave kernel: __alloc_pages_nodemask+0xe4/0xe20
> > Nov 22 08:19:01 weave kernel: ? free_hot_cold_page_list+0x2b/0x50
> > Nov 22 08:19:01 weave kernel: ? release_pages+0x2b7/0x360
> > Nov 22 08:19:01 weave kernel: ? mem_cgroup_commit_charge+0x7a/0x520
> > Nov 22 08:19:01 weave kernel: ? account_entity_enqueue+0x95/0xc0
> > Nov 22 08:19:01 weave kernel: alloc_pages_vma+0x7f/0x1e0
> > Nov 22 08:19:01 weave kernel: __handle_mm_fault+0x9cb/0xf20
> > Nov 22 08:19:01 weave kernel: handle_mm_fault+0xb2/0x1f0
> > Nov 22 08:19:01 weave kernel: __do_page_fault+0x1f2/0x440
> > Nov 22 08:19:01 weave kernel: do_page_fault+0x22/0x30
> > Nov 22 08:19:01 weave kernel: do_async_page_fault+0x4c/0x70
> > Nov 22 08:19:01 weave kernel: async_page_fault+0x22/0x30
> > Nov 22 08:19:01 weave kernel: RIP: 0033:0x56434ef679d8
> > Nov 22 08:19:01 weave kernel: RSP: 002b:00007ffd6b48ad80 EFLAGS: 00010206
> > Nov 22 08:19:01 weave kernel: RAX: 00000000000000eb RBX: 000000000000001d RCX: aaaaaaaaaaaaaaab
> > Nov 22 08:19:01 weave kernel: RDX: 000056434f5eb300 RSI: 000000000000000f RDI: 000056434f3ca6c0
> > Nov 22 08:19:01 weave kernel: RBP: 00000000000000ec R08: 00007f97e2453000 R09: 000056434f5eb3ea
> > Nov 22 08:19:01 weave kernel: R10: 000056434f5eb3eb R11: 000056434f4510a0 R12: 000000000000003a
> > Nov 22 08:19:01 weave kernel: R13: 000056434f3ca500 R14: 000056434f451240 R15: 00007f97e1024750
> > Nov 22 08:19:01 weave kernel: Code: f7 49 89 9d a0 d1 9b 81 48 89 55 98 4c 8d 63 10 e8 4f 02 53 00 eb 20 48 83 7d 98 00 74 3a e8 21 6e 06 00 80 7d c0 00 74 3f fb f4 <fa> 66 66 90 66 66 90 e8 7d 6f 06 00 80 7d c0 00 75 da 48 8d b5
> > Nov 22 08:19:01 weave kernel: RIP: kvm_async_pf_task_wait+0x167/0x200 RSP: ffffc900001ffa10
> > Nov 22 08:19:01 weave kernel: ---[ end trace 4701012ee256be25 ]---
> >
> > Does that help?
> >
> So all guest kernels are 4.14, or also other older kernel?
> 
> > Greetings
> > Marc
> 
> Regards,
> Jack
> 
> >
> > --
> > -----------------------------------------------------------------------------
> > Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
> > Leimen, Germany    |  lose things."    Winona Ryder | Fon: *49 6224 1600402
> > Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany    |  lose things."    Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VMs freezing when host is running 4.14
  2017-12-01 14:43     ` Marc Haber
@ 2018-01-08  9:10       ` Marc Haber
  2018-02-11 13:39         ` Marc Haber
  0 siblings, 1 reply; 18+ messages in thread
From: Marc Haber @ 2018-01-08  9:10 UTC (permalink / raw)
  To: 王金浦, LKML, KVM-ML (kvm@vger.kernel.org)

Hi,

it's been five weeks since I gave you the last information about this
issue. Alas, I don't have a solution yet, only reports:

- The bisect between 4.13 and 4.14 ended up on a one-character fix in a
  comment, so that was a total waste.
- The issue is present in all recent kernels up to 4.15-rc5, I didn't
  try any newer 4.15 version yet.
- 4.13-rc4 seems good
- 4.13-rc5 is the earliest kernel that shows the issue. I am at a loss
  to understand why a bug introduced during the 4.13 RC phase could
  _not_ be present in the 4.13 release but reappear in 4.14. I didn't
  try any 4.14 rc versions but suspect that those are all bad as well.

I will now start bisecting between 4.13-rc4 and 4.13-rc5, which is
"roughly 7 steps"; a kernel is "good" if it survived at least 72 hours
(as I found out that 24 hours might not be long enough).

I am still open to any suggestions that might help in identifying this
issue which now affects five of my six systems that to KVM
virtualization one way or the other. I have in the mean time experienced
file system corruption and data loss (and do have backups).

Greetings
Marc

On Fri, Dec 01, 2017 at 03:43:58PM +0100, Marc Haber wrote:
> On Wed, Nov 22, 2017 at 04:04:42PM +0100, 王金浦 wrote:
> > +cc kvm
> > 
> > 2017-11-22 10:39 GMT+01:00 Marc Haber <mh+linux-kernel@zugschlus.de>:
> > > On Tue, Nov 21, 2017 at 05:18:21PM +0100, Marc Haber wrote:
> > >> On the affected host, VMs freeze at a rate about two or three per day.
> > >> They just stop dead in their tracks, console and serial console become
> > >> unresponsive, ping stops, they don't react to virsh shutdown, only to
> > >> virsh destroy.
> > >
> > > I was able to obtain a log of a VM before it became unresponsive. here
> > > we go:
> > >
> > > Nov 22 08:19:01 weave kernel: double fault: 0000 [#1] PREEMPT SMP
> > > Nov 22 08:19:01 weave kernel: Modules linked in: crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc sg aesni_intel aes_x86_64 crypto_simd glue_helper cryptd input_leds virtio_balloon virtio_console led_class qemu_fw_cfg ip_tables x_tables autofs4 ext4 mbcache jbd2 fscrypto usbhid sr_mod cdrom virtio_blk virtio_net ata_generic crc32c_intel ehci_pci ehci_hcd usbcore usb_common floppy i2c_piix4 virtio_pci virtio_ring virtio ata_piix i2c_core libata
> > > Nov 22 08:19:01 weave kernel: CPU: 1 PID: 8795 Comm: debsecan Not tainted 4.14.1-zgsrv20080 #3
> > > Nov 22 08:19:01 weave kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> > > Nov 22 08:19:01 weave kernel: task: ffff88001ef0adc0 task.stack: ffffc900001fc000
> > > Nov 22 08:19:01 weave kernel: RIP: 0010:kvm_async_pf_task_wait+0x167/0x200
> > > Nov 22 08:19:01 weave kernel: RSP: 0000:ffffc900001ffa10 EFLAGS: 00000202
> > > Nov 22 08:19:01 weave kernel: RAX: ffff88001fd11cc0 RBX: ffffc900001ffa30 RCX: 0000000000000002
> > > Nov 22 08:19:01 weave kernel: RDX: 0140000000000000 RSI: ffffffff8173514b RDI: ffffffff819bdd80
> > > Nov 22 08:19:01 weave kernel: RBP: ffffc900001ffaa0 R08: 0000000000193fc0 R09: ffff880000000000
> > > Nov 22 08:19:01 weave kernel: R10: ffffc900001ffac0 R11: 0000000000000000 R12: ffffc900001ffa40
> > > Nov 22 08:19:01 weave kernel: R13: 0000000000000be8 R14: ffffffff819bdd80 R15: ffffea0000193f80
> > > Nov 22 08:19:01 weave kernel: FS:  00007f97e25dd700(0000) GS:ffff88001fd00000(0000) knlGS:0000000000000000
> > > Nov 22 08:19:01 weave kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > Nov 22 08:19:01 weave kernel: CR2: 0000000000483001 CR3: 0000000015df7000 CR4: 00000000000406e0
> > > Nov 22 08:19:01 weave kernel: Call Trace:
> > > Nov 22 08:19:01 weave kernel: do_async_page_fault+0x6b/0x70
> > > Nov 22 08:19:01 weave kernel: ? do_async_page_fault+0x6b/0x70
> > > Nov 22 08:19:01 weave kernel: async_page_fault+0x22/0x30
> > > Nov 22 08:19:01 weave kernel: RIP: 0010:clear_page_rep+0x7/0x10
> > > Nov 22 08:19:01 weave kernel: RSP: 0000:ffffc900001ffb88 EFLAGS: 00010246
> > > Nov 22 08:19:01 weave kernel: RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000200
> > > Nov 22 08:19:01 weave kernel: RDX: ffff88001ef0adc0 RSI: 0000000000193f80 RDI: ffff8800064fe000
> > > Nov 22 08:19:01 weave kernel: RBP: ffffc900001ffc50 R08: 0000000000193fc0 R09: ffff880000000000
> > > Nov 22 08:19:01 weave kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000020
> > > Nov 22 08:19:01 weave kernel: R13: ffff88001ffd5500 R14: ffffc900001ffce8 R15: ffffea0000193f80
> > > Nov 22 08:19:01 weave kernel: ? get_page_from_freelist+0x8c3/0xaf0
> > > Nov 22 08:19:01 weave kernel: ? __mem_cgroup_threshold+0x8a/0x130
> > > Nov 22 08:19:01 weave kernel: ? free_pcppages_bulk+0x3f6/0x410
> > > Nov 22 08:19:01 weave kernel: __alloc_pages_nodemask+0xe4/0xe20
> > > Nov 22 08:19:01 weave kernel: ? free_hot_cold_page_list+0x2b/0x50
> > > Nov 22 08:19:01 weave kernel: ? release_pages+0x2b7/0x360
> > > Nov 22 08:19:01 weave kernel: ? mem_cgroup_commit_charge+0x7a/0x520
> > > Nov 22 08:19:01 weave kernel: ? account_entity_enqueue+0x95/0xc0
> > > Nov 22 08:19:01 weave kernel: alloc_pages_vma+0x7f/0x1e0
> > > Nov 22 08:19:01 weave kernel: __handle_mm_fault+0x9cb/0xf20
> > > Nov 22 08:19:01 weave kernel: handle_mm_fault+0xb2/0x1f0
> > > Nov 22 08:19:01 weave kernel: __do_page_fault+0x1f2/0x440
> > > Nov 22 08:19:01 weave kernel: do_page_fault+0x22/0x30
> > > Nov 22 08:19:01 weave kernel: do_async_page_fault+0x4c/0x70
> > > Nov 22 08:19:01 weave kernel: async_page_fault+0x22/0x30
> > > Nov 22 08:19:01 weave kernel: RIP: 0033:0x56434ef679d8
> > > Nov 22 08:19:01 weave kernel: RSP: 002b:00007ffd6b48ad80 EFLAGS: 00010206
> > > Nov 22 08:19:01 weave kernel: RAX: 00000000000000eb RBX: 000000000000001d RCX: aaaaaaaaaaaaaaab
> > > Nov 22 08:19:01 weave kernel: RDX: 000056434f5eb300 RSI: 000000000000000f RDI: 000056434f3ca6c0
> > > Nov 22 08:19:01 weave kernel: RBP: 00000000000000ec R08: 00007f97e2453000 R09: 000056434f5eb3ea
> > > Nov 22 08:19:01 weave kernel: R10: 000056434f5eb3eb R11: 000056434f4510a0 R12: 000000000000003a
> > > Nov 22 08:19:01 weave kernel: R13: 000056434f3ca500 R14: 000056434f451240 R15: 00007f97e1024750
> > > Nov 22 08:19:01 weave kernel: Code: f7 49 89 9d a0 d1 9b 81 48 89 55 98 4c 8d 63 10 e8 4f 02 53 00 eb 20 48 83 7d 98 00 74 3a e8 21 6e 06 00 80 7d c0 00 74 3f fb f4 <fa> 66 66 90 66 66 90 e8 7d 6f 06 00 80 7d c0 00 75 da 48 8d b5
> > > Nov 22 08:19:01 weave kernel: RIP: kvm_async_pf_task_wait+0x167/0x200 RSP: ffffc900001ffa10
> > > Nov 22 08:19:01 weave kernel: ---[ end trace 4701012ee256be25 ]---
> > >
> > > Does that help?
> > >
> > So all guest kernels are 4.14, or also other older kernel?
> > 
> > > Greetings
> > > Marc
> > 
> > Regards,
> > Jack
> > 
> > >
> > > --
> > > -----------------------------------------------------------------------------
> > > Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
> > > Leimen, Germany    |  lose things."    Winona Ryder | Fon: *49 6224 1600402
> > > Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421
> 
> -- 
> -----------------------------------------------------------------------------
> Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
> Leimen, Germany    |  lose things."    Winona Ryder | Fon: *49 6224 1600402
> Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany    |  lose things."    Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VMs freezing when host is running 4.14
  2018-01-08  9:10       ` Marc Haber
@ 2018-02-11 13:39         ` Marc Haber
  2018-02-14  2:04           ` Josh Poimboeuf
  0 siblings, 1 reply; 18+ messages in thread
From: Marc Haber @ 2018-02-11 13:39 UTC (permalink / raw)
  To: 王金浦, LKML, KVM-ML (kvm@vger.kernel.org)

Hi,

after in total nine weeks of bisecting, broken filesystems, service
outages (thankfully on unportant systems), 4.15 seems to have fixed the
issue. After going to 4.15, the crashes never happened again.

They have, however, happened with each and every 4.14 release I tried,
which I stopped doing with 4.14.15 on Jan 28.

This means, for me, that the issue is fixed and that I have just wasted
nine weeks of time.

For you, this means that you have a crippling, data-eating issue in the
current long-term releae kernel. I do sincerely hope that I never have
to lay my eye on any 4.14 kernel and hope that no major distribution
will release with this version.

Greetings
Marc


On Mon, Jan 08, 2018 at 10:10:25AM +0100, Marc Haber wrote:
> it's been five weeks since I gave you the last information about this
> issue. Alas, I don't have a solution yet, only reports:
> 
> - The bisect between 4.13 and 4.14 ended up on a one-character fix in a
>   comment, so that was a total waste.
> - The issue is present in all recent kernels up to 4.15-rc5, I didn't
>   try any newer 4.15 version yet.
> - 4.13-rc4 seems good
> - 4.13-rc5 is the earliest kernel that shows the issue. I am at a loss
>   to understand why a bug introduced during the 4.13 RC phase could
>   _not_ be present in the 4.13 release but reappear in 4.14. I didn't
>   try any 4.14 rc versions but suspect that those are all bad as well.
> 
> I will now start bisecting between 4.13-rc4 and 4.13-rc5, which is
> "roughly 7 steps"; a kernel is "good" if it survived at least 72 hours
> (as I found out that 24 hours might not be long enough).
> 
> I am still open to any suggestions that might help in identifying this
> issue which now affects five of my six systems that to KVM
> virtualization one way or the other. I have in the mean time experienced
> file system corruption and data loss (and do have backups).
> 
> Greetings
> Marc
> 
> On Fri, Dec 01, 2017 at 03:43:58PM +0100, Marc Haber wrote:
> > On Wed, Nov 22, 2017 at 04:04:42PM +0100, 王金浦 wrote:
> > > +cc kvm
> > > 
> > > 2017-11-22 10:39 GMT+01:00 Marc Haber <mh+linux-kernel@zugschlus.de>:
> > > > On Tue, Nov 21, 2017 at 05:18:21PM +0100, Marc Haber wrote:
> > > >> On the affected host, VMs freeze at a rate about two or three per day.
> > > >> They just stop dead in their tracks, console and serial console become
> > > >> unresponsive, ping stops, they don't react to virsh shutdown, only to
> > > >> virsh destroy.
> > > >
> > > > I was able to obtain a log of a VM before it became unresponsive. here
> > > > we go:
> > > >
> > > > Nov 22 08:19:01 weave kernel: double fault: 0000 [#1] PREEMPT SMP
> > > > Nov 22 08:19:01 weave kernel: Modules linked in: crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc sg aesni_intel aes_x86_64 crypto_simd glue_helper cryptd input_leds virtio_balloon virtio_console led_class qemu_fw_cfg ip_tables x_tables autofs4 ext4 mbcache jbd2 fscrypto usbhid sr_mod cdrom virtio_blk virtio_net ata_generic crc32c_intel ehci_pci ehci_hcd usbcore usb_common floppy i2c_piix4 virtio_pci virtio_ring virtio ata_piix i2c_core libata
> > > > Nov 22 08:19:01 weave kernel: CPU: 1 PID: 8795 Comm: debsecan Not tainted 4.14.1-zgsrv20080 #3
> > > > Nov 22 08:19:01 weave kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> > > > Nov 22 08:19:01 weave kernel: task: ffff88001ef0adc0 task.stack: ffffc900001fc000
> > > > Nov 22 08:19:01 weave kernel: RIP: 0010:kvm_async_pf_task_wait+0x167/0x200
> > > > Nov 22 08:19:01 weave kernel: RSP: 0000:ffffc900001ffa10 EFLAGS: 00000202
> > > > Nov 22 08:19:01 weave kernel: RAX: ffff88001fd11cc0 RBX: ffffc900001ffa30 RCX: 0000000000000002
> > > > Nov 22 08:19:01 weave kernel: RDX: 0140000000000000 RSI: ffffffff8173514b RDI: ffffffff819bdd80
> > > > Nov 22 08:19:01 weave kernel: RBP: ffffc900001ffaa0 R08: 0000000000193fc0 R09: ffff880000000000
> > > > Nov 22 08:19:01 weave kernel: R10: ffffc900001ffac0 R11: 0000000000000000 R12: ffffc900001ffa40
> > > > Nov 22 08:19:01 weave kernel: R13: 0000000000000be8 R14: ffffffff819bdd80 R15: ffffea0000193f80
> > > > Nov 22 08:19:01 weave kernel: FS:  00007f97e25dd700(0000) GS:ffff88001fd00000(0000) knlGS:0000000000000000
> > > > Nov 22 08:19:01 weave kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > Nov 22 08:19:01 weave kernel: CR2: 0000000000483001 CR3: 0000000015df7000 CR4: 00000000000406e0
> > > > Nov 22 08:19:01 weave kernel: Call Trace:
> > > > Nov 22 08:19:01 weave kernel: do_async_page_fault+0x6b/0x70
> > > > Nov 22 08:19:01 weave kernel: ? do_async_page_fault+0x6b/0x70
> > > > Nov 22 08:19:01 weave kernel: async_page_fault+0x22/0x30
> > > > Nov 22 08:19:01 weave kernel: RIP: 0010:clear_page_rep+0x7/0x10
> > > > Nov 22 08:19:01 weave kernel: RSP: 0000:ffffc900001ffb88 EFLAGS: 00010246
> > > > Nov 22 08:19:01 weave kernel: RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000200
> > > > Nov 22 08:19:01 weave kernel: RDX: ffff88001ef0adc0 RSI: 0000000000193f80 RDI: ffff8800064fe000
> > > > Nov 22 08:19:01 weave kernel: RBP: ffffc900001ffc50 R08: 0000000000193fc0 R09: ffff880000000000
> > > > Nov 22 08:19:01 weave kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000020
> > > > Nov 22 08:19:01 weave kernel: R13: ffff88001ffd5500 R14: ffffc900001ffce8 R15: ffffea0000193f80
> > > > Nov 22 08:19:01 weave kernel: ? get_page_from_freelist+0x8c3/0xaf0
> > > > Nov 22 08:19:01 weave kernel: ? __mem_cgroup_threshold+0x8a/0x130
> > > > Nov 22 08:19:01 weave kernel: ? free_pcppages_bulk+0x3f6/0x410
> > > > Nov 22 08:19:01 weave kernel: __alloc_pages_nodemask+0xe4/0xe20
> > > > Nov 22 08:19:01 weave kernel: ? free_hot_cold_page_list+0x2b/0x50
> > > > Nov 22 08:19:01 weave kernel: ? release_pages+0x2b7/0x360
> > > > Nov 22 08:19:01 weave kernel: ? mem_cgroup_commit_charge+0x7a/0x520
> > > > Nov 22 08:19:01 weave kernel: ? account_entity_enqueue+0x95/0xc0
> > > > Nov 22 08:19:01 weave kernel: alloc_pages_vma+0x7f/0x1e0
> > > > Nov 22 08:19:01 weave kernel: __handle_mm_fault+0x9cb/0xf20
> > > > Nov 22 08:19:01 weave kernel: handle_mm_fault+0xb2/0x1f0
> > > > Nov 22 08:19:01 weave kernel: __do_page_fault+0x1f2/0x440
> > > > Nov 22 08:19:01 weave kernel: do_page_fault+0x22/0x30
> > > > Nov 22 08:19:01 weave kernel: do_async_page_fault+0x4c/0x70
> > > > Nov 22 08:19:01 weave kernel: async_page_fault+0x22/0x30
> > > > Nov 22 08:19:01 weave kernel: RIP: 0033:0x56434ef679d8
> > > > Nov 22 08:19:01 weave kernel: RSP: 002b:00007ffd6b48ad80 EFLAGS: 00010206
> > > > Nov 22 08:19:01 weave kernel: RAX: 00000000000000eb RBX: 000000000000001d RCX: aaaaaaaaaaaaaaab
> > > > Nov 22 08:19:01 weave kernel: RDX: 000056434f5eb300 RSI: 000000000000000f RDI: 000056434f3ca6c0
> > > > Nov 22 08:19:01 weave kernel: RBP: 00000000000000ec R08: 00007f97e2453000 R09: 000056434f5eb3ea
> > > > Nov 22 08:19:01 weave kernel: R10: 000056434f5eb3eb R11: 000056434f4510a0 R12: 000000000000003a
> > > > Nov 22 08:19:01 weave kernel: R13: 000056434f3ca500 R14: 000056434f451240 R15: 00007f97e1024750
> > > > Nov 22 08:19:01 weave kernel: Code: f7 49 89 9d a0 d1 9b 81 48 89 55 98 4c 8d 63 10 e8 4f 02 53 00 eb 20 48 83 7d 98 00 74 3a e8 21 6e 06 00 80 7d c0 00 74 3f fb f4 <fa> 66 66 90 66 66 90 e8 7d 6f 06 00 80 7d c0 00 75 da 48 8d b5
> > > > Nov 22 08:19:01 weave kernel: RIP: kvm_async_pf_task_wait+0x167/0x200 RSP: ffffc900001ffa10
> > > > Nov 22 08:19:01 weave kernel: ---[ end trace 4701012ee256be25 ]---
> > > >
> > > > Does that help?
> > > >
> > > So all guest kernels are 4.14, or also other older kernel?
> > > 
> > > > Greetings
> > > > Marc
> > > 
> > > Regards,
> > > Jack
> > > 
> > > >
> > > > --
> > > > -----------------------------------------------------------------------------
> > > > Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
> > > > Leimen, Germany    |  lose things."    Winona Ryder | Fon: *49 6224 1600402
> > > > Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421
> > 
> > -- 
> > -----------------------------------------------------------------------------
> > Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
> > Leimen, Germany    |  lose things."    Winona Ryder | Fon: *49 6224 1600402
> > Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421
> 
> -- 
> -----------------------------------------------------------------------------
> Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
> Leimen, Germany    |  lose things."    Winona Ryder | Fon: *49 6224 1600402
> Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany    |  lose things."    Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VMs freezing when host is running 4.14
  2018-02-11 13:39         ` Marc Haber
@ 2018-02-14  2:04           ` Josh Poimboeuf
  2018-02-14  2:56             ` Haozhong Zhang
  2018-02-14 10:23             ` Paolo Bonzini
  0 siblings, 2 replies; 18+ messages in thread
From: Josh Poimboeuf @ 2018-02-14  2:04 UTC (permalink / raw)
  To: Marc Haber
  Cc: 王金浦, LKML, KVM-ML (kvm@vger.kernel.org),
	x86, Paolo Bonzini

On Sun, Feb 11, 2018 at 02:39:41PM +0100, Marc Haber wrote:
> Hi,
> 
> after in total nine weeks of bisecting, broken filesystems, service
> outages (thankfully on unportant systems), 4.15 seems to have fixed the
> issue. After going to 4.15, the crashes never happened again.
> 
> They have, however, happened with each and every 4.14 release I tried,
> which I stopped doing with 4.14.15 on Jan 28.
> 
> This means, for me, that the issue is fixed and that I have just wasted
> nine weeks of time.
> 
> For you, this means that you have a crippling, data-eating issue in the
> current long-term releae kernel. I do sincerely hope that I never have
> to lay my eye on any 4.14 kernel and hope that no major distribution
> will release with this version.

I saw something similar today, also in kvm_async_pf_task_wait().  I had
-tip in the guest (based on 4.16.0-rc1) and Fedora
4.14.16-300.fc27.x86_64 on the host.

In my case the double fault seemed to be caused by a corrupt CR3.  The
#DF hit when trying to call schedule() with the bad CR3, immediately
after enabling interrupts.  So there could be something in the interrupt
path which is corrupting CR3, but I audited that path in the guest
kernel (which had PTI enabled) and it looks straightforward.

The reason I think CR3 was corrupt is because some earlier stack traces
(which I forced as part of my unwinder testing) showed a kernel CR3
value of 0x130ada005, but the #DF output showed it as 0x2212006.  And
that also explains why a call instruction would result in a #DF.  But I
have no clue *how* it got corrupted.

I don't know if the bug is in the host (4.14) or the guest (4.16).

[ 4031.436692] PANIC: double fault, error_code: 0x0
[ 4031.439937] CPU: 1 PID: 1227 Comm: kworker/1:1 Not tainted 4.16.0-rc1+ #12
[ 4031.440632] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
[ 4031.441475] Workqueue: events netstamp_clear
[ 4031.441897] RIP: 0010:kvm_async_pf_task_wait+0x19d/0x250
[ 4031.442411] RSP: 0000:ffffc90000f5fbc0 EFLAGS: 00000202
[ 4031.442916] RAX: ffff880136048000 RBX: ffffc90000f5fbe0 RCX: 0000000000000006
[ 4031.443601] RDX: 0000000000000006 RSI: ffff880136048a40 RDI: ffff880136048000
[ 4031.444285] RBP: ffffc90000f5fc90 R08: 000005212156f5cf R09: 0000000000000000
[ 4031.444966] R10: 0000000000000000 R11: 0000000000000000 R12: ffffc90000f5fbf0
[ 4031.445650] R13: 0000000000002e88 R14: ffffffff82ad6360 R15: 0000000000000000
[ 4031.446335] FS:  0000000000000000(0000) GS:ffff88013a800000(0000) knlGS:0000000000000000
[ 4031.447104] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4031.447659] CR2: 0000000000006001 CR3: 0000000002212006 CR4: 00000000001606e0
[ 4031.448354] Call Trace:
[ 4031.448602]  ? kvm_clock_read+0x1f/0x30
[ 4031.448985]  ? prepare_to_swait+0x1d/0x70
[ 4031.449384]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 4031.449845]  ? do_async_page_fault+0x67/0x90
[ 4031.450283]  do_async_page_fault+0x67/0x90
[ 4031.450684]  async_page_fault+0x25/0x50
[ 4031.451067] RIP: 0010:text_poke+0x60/0x250
[ 4031.451528] RSP: 0000:ffffc90000f5fd78 EFLAGS: 00010286
[ 4031.452082] RAX: ffffea0000000000 RBX: ffffea000005f200 RCX: ffffffff817c88e9
[ 4031.452843] RDX: 0000000000000001 RSI: ffffc90000f5fdbf RDI: ffffffff817c88e4
[ 4031.453568] RBP: ffffffff817c88e4 R08: 0000000000000000 R09: 0000000000000001
[ 4031.454283] R10: ffffc90000f5fdf0 R11: 104eab8665f42bc7 R12: 0000000000000001
[ 4031.455010] R13: ffffc90000f5fdbf R14: ffffffff817c98e4 R15: 0000000000000000
[ 4031.455719]  ? dev_gro_receive+0x3f4/0x6f0
[ 4031.456123]  ? netif_receive_skb_internal+0x24/0x380
[ 4031.456641]  ? netif_receive_skb_internal+0x29/0x380
[ 4031.457202]  ? netif_receive_skb_internal+0x24/0x380
[ 4031.457743]  ? text_poke+0x28/0x250
[ 4031.458084]  ? netif_receive_skb_internal+0x24/0x380
[ 4031.458567]  ? netif_receive_skb_internal+0x25/0x380
[ 4031.459046]  text_poke_bp+0x55/0xe0
[ 4031.459393]  arch_jump_label_transform+0x90/0xf0
[ 4031.459842]  __jump_label_update+0x63/0x70
[ 4031.460243]  static_key_enable_cpuslocked+0x54/0x80
[ 4031.460713]  static_key_enable+0x16/0x20
[ 4031.461096]  process_one_work+0x266/0x6d0
[ 4031.461506]  worker_thread+0x3a/0x390
[ 4031.462328]  ? process_one_work+0x6d0/0x6d0
[ 4031.463299]  kthread+0x121/0x140
[ 4031.464122]  ? kthread_create_worker_on_cpu+0x70/0x70
[ 4031.465070]  ret_from_fork+0x3a/0x50
[ 4031.465859] Code: 89 58 08 4c 89 f7 49 89 9d 20 35 ad 82 48 89 95 58 ff ff ff 4c 8d 63 10 e8 61 e4 8d 00 eb 22 e8 7a d7 0b 00 fb 66 0f 1f 44 00 00 <e8> be 64 8d 00 fa 66 0f 1f 44 00 00 e8 12 a0 0b 00 e8 cd 7a 0e 
[ 4031.468817] Kernel panic - not syncing: Machine halted.

-- 
Josh

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VMs freezing when host is running 4.14
  2018-02-14  2:04           ` Josh Poimboeuf
@ 2018-02-14  2:56             ` Haozhong Zhang
  2018-02-14  3:17               ` Josh Poimboeuf
  2018-02-14 10:23             ` Paolo Bonzini
  1 sibling, 1 reply; 18+ messages in thread
From: Haozhong Zhang @ 2018-02-14  2:56 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Marc Haber, 王金浦,
	LKML, KVM-ML (kvm@vger.kernel.org),
	x86, Paolo Bonzini

On 02/13/18 20:04 -0600, Josh Poimboeuf wrote:
> On Sun, Feb 11, 2018 at 02:39:41PM +0100, Marc Haber wrote:
> > Hi,
> > 
> > after in total nine weeks of bisecting, broken filesystems, service
> > outages (thankfully on unportant systems), 4.15 seems to have fixed the
> > issue. After going to 4.15, the crashes never happened again.
> > 
> > They have, however, happened with each and every 4.14 release I tried,
> > which I stopped doing with 4.14.15 on Jan 28.
> > 
> > This means, for me, that the issue is fixed and that I have just wasted
> > nine weeks of time.
> > 
> > For you, this means that you have a crippling, data-eating issue in the
> > current long-term releae kernel. I do sincerely hope that I never have
> > to lay my eye on any 4.14 kernel and hope that no major distribution
> > will release with this version.
> 
> I saw something similar today, also in kvm_async_pf_task_wait().  I had
> -tip in the guest (based on 4.16.0-rc1) and Fedora
> 4.14.16-300.fc27.x86_64 on the host.

Could you check whether this patch [1] on the host kernel fixes your
issue?  Paolo had sent it to the stable tree and it may take a while
to be merged in the next 4.14 stable release (perhaps in this week).

[1] https://patchwork.kernel.org/patch/10155177/

Haozhong

> 
> In my case the double fault seemed to be caused by a corrupt CR3.  The
> #DF hit when trying to call schedule() with the bad CR3, immediately
> after enabling interrupts.  So there could be something in the interrupt
> path which is corrupting CR3, but I audited that path in the guest
> kernel (which had PTI enabled) and it looks straightforward.
> 
> The reason I think CR3 was corrupt is because some earlier stack traces
> (which I forced as part of my unwinder testing) showed a kernel CR3
> value of 0x130ada005, but the #DF output showed it as 0x2212006.  And
> that also explains why a call instruction would result in a #DF.  But I
> have no clue *how* it got corrupted.
> 
> I don't know if the bug is in the host (4.14) or the guest (4.16).
> 
> [ 4031.436692] PANIC: double fault, error_code: 0x0
> [ 4031.439937] CPU: 1 PID: 1227 Comm: kworker/1:1 Not tainted 4.16.0-rc1+ #12
> [ 4031.440632] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
> [ 4031.441475] Workqueue: events netstamp_clear
> [ 4031.441897] RIP: 0010:kvm_async_pf_task_wait+0x19d/0x250
> [ 4031.442411] RSP: 0000:ffffc90000f5fbc0 EFLAGS: 00000202
> [ 4031.442916] RAX: ffff880136048000 RBX: ffffc90000f5fbe0 RCX: 0000000000000006
> [ 4031.443601] RDX: 0000000000000006 RSI: ffff880136048a40 RDI: ffff880136048000
> [ 4031.444285] RBP: ffffc90000f5fc90 R08: 000005212156f5cf R09: 0000000000000000
> [ 4031.444966] R10: 0000000000000000 R11: 0000000000000000 R12: ffffc90000f5fbf0
> [ 4031.445650] R13: 0000000000002e88 R14: ffffffff82ad6360 R15: 0000000000000000
> [ 4031.446335] FS:  0000000000000000(0000) GS:ffff88013a800000(0000) knlGS:0000000000000000
> [ 4031.447104] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 4031.447659] CR2: 0000000000006001 CR3: 0000000002212006 CR4: 00000000001606e0
> [ 4031.448354] Call Trace:
> [ 4031.448602]  ? kvm_clock_read+0x1f/0x30
> [ 4031.448985]  ? prepare_to_swait+0x1d/0x70
> [ 4031.449384]  ? trace_hardirqs_off_thunk+0x1a/0x1c
> [ 4031.449845]  ? do_async_page_fault+0x67/0x90
> [ 4031.450283]  do_async_page_fault+0x67/0x90
> [ 4031.450684]  async_page_fault+0x25/0x50
> [ 4031.451067] RIP: 0010:text_poke+0x60/0x250
> [ 4031.451528] RSP: 0000:ffffc90000f5fd78 EFLAGS: 00010286
> [ 4031.452082] RAX: ffffea0000000000 RBX: ffffea000005f200 RCX: ffffffff817c88e9
> [ 4031.452843] RDX: 0000000000000001 RSI: ffffc90000f5fdbf RDI: ffffffff817c88e4
> [ 4031.453568] RBP: ffffffff817c88e4 R08: 0000000000000000 R09: 0000000000000001
> [ 4031.454283] R10: ffffc90000f5fdf0 R11: 104eab8665f42bc7 R12: 0000000000000001
> [ 4031.455010] R13: ffffc90000f5fdbf R14: ffffffff817c98e4 R15: 0000000000000000
> [ 4031.455719]  ? dev_gro_receive+0x3f4/0x6f0
> [ 4031.456123]  ? netif_receive_skb_internal+0x24/0x380
> [ 4031.456641]  ? netif_receive_skb_internal+0x29/0x380
> [ 4031.457202]  ? netif_receive_skb_internal+0x24/0x380
> [ 4031.457743]  ? text_poke+0x28/0x250
> [ 4031.458084]  ? netif_receive_skb_internal+0x24/0x380
> [ 4031.458567]  ? netif_receive_skb_internal+0x25/0x380
> [ 4031.459046]  text_poke_bp+0x55/0xe0
> [ 4031.459393]  arch_jump_label_transform+0x90/0xf0
> [ 4031.459842]  __jump_label_update+0x63/0x70
> [ 4031.460243]  static_key_enable_cpuslocked+0x54/0x80
> [ 4031.460713]  static_key_enable+0x16/0x20
> [ 4031.461096]  process_one_work+0x266/0x6d0
> [ 4031.461506]  worker_thread+0x3a/0x390
> [ 4031.462328]  ? process_one_work+0x6d0/0x6d0
> [ 4031.463299]  kthread+0x121/0x140
> [ 4031.464122]  ? kthread_create_worker_on_cpu+0x70/0x70
> [ 4031.465070]  ret_from_fork+0x3a/0x50
> [ 4031.465859] Code: 89 58 08 4c 89 f7 49 89 9d 20 35 ad 82 48 89 95 58 ff ff ff 4c 8d 63 10 e8 61 e4 8d 00 eb 22 e8 7a d7 0b 00 fb 66 0f 1f 44 00 00 <e8> be 64 8d 00 fa 66 0f 1f 44 00 00 e8 12 a0 0b 00 e8 cd 7a 0e 
> [ 4031.468817] Kernel panic - not syncing: Machine halted.
> 
> -- 
> Josh

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VMs freezing when host is running 4.14
  2018-02-14  2:56             ` Haozhong Zhang
@ 2018-02-14  3:17               ` Josh Poimboeuf
  0 siblings, 0 replies; 18+ messages in thread
From: Josh Poimboeuf @ 2018-02-14  3:17 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Marc Haber, 王金浦,
	LKML, KVM-ML (kvm@vger.kernel.org),
	x86, Paolo Bonzini

On Wed, Feb 14, 2018 at 10:56:35AM +0800, Haozhong Zhang wrote:
> On 02/13/18 20:04 -0600, Josh Poimboeuf wrote:
> > On Sun, Feb 11, 2018 at 02:39:41PM +0100, Marc Haber wrote:
> > > Hi,
> > > 
> > > after in total nine weeks of bisecting, broken filesystems, service
> > > outages (thankfully on unportant systems), 4.15 seems to have fixed the
> > > issue. After going to 4.15, the crashes never happened again.
> > > 
> > > They have, however, happened with each and every 4.14 release I tried,
> > > which I stopped doing with 4.14.15 on Jan 28.
> > > 
> > > This means, for me, that the issue is fixed and that I have just wasted
> > > nine weeks of time.
> > > 
> > > For you, this means that you have a crippling, data-eating issue in the
> > > current long-term releae kernel. I do sincerely hope that I never have
> > > to lay my eye on any 4.14 kernel and hope that no major distribution
> > > will release with this version.
> > 
> > I saw something similar today, also in kvm_async_pf_task_wait().  I had
> > -tip in the guest (based on 4.16.0-rc1) and Fedora
> > 4.14.16-300.fc27.x86_64 on the host.
> 
> Could you check whether this patch [1] on the host kernel fixes your
> issue?  Paolo had sent it to the stable tree and it may take a while
> to be merged in the next 4.14 stable release (perhaps in this week).
> 
> [1] https://patchwork.kernel.org/patch/10155177/

Unfortunately I don't know how to recreate it, the VM had been sitting
in idle for over an hour when it happened.  I'll leave it running
overnight without the patch and see if it happens again.

-- 
Josh

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VMs freezing when host is running 4.14
  2018-02-14  2:04           ` Josh Poimboeuf
  2018-02-14  2:56             ` Haozhong Zhang
@ 2018-02-14 10:23             ` Paolo Bonzini
  1 sibling, 0 replies; 18+ messages in thread
From: Paolo Bonzini @ 2018-02-14 10:23 UTC (permalink / raw)
  To: Josh Poimboeuf, Marc Haber
  Cc: 王金浦, LKML, KVM-ML (kvm@vger.kernel.org), x86

On 14/02/2018 03:04, Josh Poimboeuf wrote:
> On Sun, Feb 11, 2018 at 02:39:41PM +0100, Marc Haber wrote:
>> Hi,
>>
>> after in total nine weeks of bisecting, broken filesystems, service
>> outages (thankfully on unportant systems), 4.15 seems to have fixed the
>> issue. After going to 4.15, the crashes never happened again.
>>
>> They have, however, happened with each and every 4.14 release I tried,
>> which I stopped doing with 4.14.15 on Jan 28.
>>
>> This means, for me, that the issue is fixed and that I have just wasted
>> nine weeks of time.
>>
>> For you, this means that you have a crippling, data-eating issue in the
>> current long-term releae kernel. I do sincerely hope that I never have
>> to lay my eye on any 4.14 kernel and hope that no major distribution
>> will release with this version.
> 
> I saw something similar today, also in kvm_async_pf_task_wait().  I had
> -tip in the guest (based on 4.16.0-rc1) and Fedora
> 4.14.16-300.fc27.x86_64 on the host.

Hi Josh/Marc,

this is fixed by

    commit 2a266f23550be997d783f27e704b9b40c4010292
    Author: Haozhong Zhang <haozhong.zhang@intel.com>
    Date:   Wed Jan 10 21:44:42 2018 +0800

    KVM MMU: check pending exception before injecting APF
    
    For example, when two APF's for page ready happen after one exit and
    the first one becomes pending, the second one will result in #DF.
    Instead, just handle the second page fault synchronously.
    
    Reported-by: Ross Zwisler <zwisler@gmail.com>
    Message-ID: <CAOxpaSUBf8QoOZQ1p4KfUp0jq76OKfGY4Uxs-Gg8ngReD99xww@mail.gmail.com>
    Reported-by: Alec Blayne <ab@tevsa.net>
    Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

and it will be in 4.14.20.  Unfortunately I only heard about this issue
last week.

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2018-02-14 10:23 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-21 16:18 VMs freezing when host is running 4.14 Marc Haber
2017-11-22  9:39 ` Marc Haber
2017-11-22 15:04   ` 王金浦
2017-11-22 15:52     ` Marc Haber
2017-11-22 16:43       ` Radim Krčmář
2017-11-23 15:20         ` Marc Haber
2017-11-23 15:59           ` Radim Krčmář
2017-11-23 16:18             ` Liran Alon
2017-11-23 16:24               ` Radim Krčmář
2017-11-23 16:26                 ` Liran Alon
2017-11-24  8:08                   ` Marc Haber
2017-12-01 14:43     ` Marc Haber
2018-01-08  9:10       ` Marc Haber
2018-02-11 13:39         ` Marc Haber
2018-02-14  2:04           ` Josh Poimboeuf
2018-02-14  2:56             ` Haozhong Zhang
2018-02-14  3:17               ` Josh Poimboeuf
2018-02-14 10:23             ` Paolo Bonzini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.