All of lore.kernel.org
 help / color / mirror / Atom feed
* Hit a hard lockup BUG while started VM with passthrough NIC
@ 2017-09-21  7:35 Gonglei (Arei)
  2017-09-21 15:00 ` Alex Williamson
  0 siblings, 1 reply; 4+ messages in thread
From: Gonglei (Arei) @ 2017-09-21  7:35 UTC (permalink / raw)
  To: Paolo Bonzini, alex.williamson; +Cc: kvm, Zhanghailiang, Huangweidong (C)


Hi Paolo, Alex,

We hit a hard lockup bug in our tests while started VM with passthrough NIC.
Unfortunately, we didn't get the available vmcore to dig into. And it is quite difficult
To reproduce, actually, we only hit this problem once.
Does anyone hit such problem ? Or any idea ?
(For the complete log, please see the attachment).

[93137.701139] ------------[ cut here ]------------
[93137.701146] WARNING: at kernel/workqueue.c:2131 process_one_work+0x2e8/0x470()
[93137.701147] Modules linked in: ext4 jbd2 ixgbe(O) dev_connlimit(O) igb_uio(OE) uio bum(O) ip_set nfnetlink prio(O) nat(O) vport_vxlan(O) openvswitch(O) nf_defrag_ipv6 gre signo_catch(O) hotpatch(OE) kboxdriver(O) ipmi_devintf ipmi_si ipmi_msghandler kbox(O) mlx5_ib i40e(OE) ib_core ib_addr vxlan coretemp crc32_pclmul ip6_udp_tunnel crc32c_intel udp_tunnel ptp kvm_intel(O) ghash_clmulni_intel aesni_intel dca kvm(O) pps_core lrw gf128mul glue_helper sg ablk_helper cryptd pcspkr mlx5_core acpi_cpufreq shpchp acpi_power_meter acpi_pad remote_trigger(O) nf_conntrack_ipv4 nf_defrag_ipv4 vhost_net(O) tun(O) vhost(O) macvtap macvlan vfio_pci irqbypass vfio_iommu_type1 vfio xt_sctp nf_conntrack_proto_sctp nf_nat_proto_sctp nf_nat nf_conntrack sctp libcrc32c ip_tables ext3 mbcache jbd sd_mod crc_
 t10dif crct10dif_generic
[93137.701171]  crct10dif_pclmul crct10dif_common mpt3sas raid_class ahci scsi_transport_sas libahci libata dm_mod [last unloaded: ixgbe]
[93137.701178] CPU: 1 PID: 119983 Comm: kworker/0:7 Tainted: G        W  OE  ---- -------   3.10.0-327.55.58.94_29.x86_64 #1
[93137.701179] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 0.19 06/22/2017
[93137.701183]  0000000000000000 000000005abbce18 ffff880225cb7dd0 ffffffff8164791f
[93137.701185]  ffff880225cb7e08 ffffffff8107b220 ffff88282066c4a8 ffff880868ed3c90
[93137.701186]  ffff881ffec16080 ffff881ffec1c000 0000000000000000 ffff880225cb7e18
[93137.701188]  ffffffff8107b36a ffff880225cb7e60 ffffffff8109de58 ffff881ffec16098
[93137.701189]  ffff880868ed3c00 ffff881ffec16098 ffff880868ed3cc0 ffff880649b09980
[93137.701190]  ffff880868ed3c90 ffff881ffec16080 ffff880225cb7ec0 ffffffff8109eabb
[93137.701191]  ffff880225cb7fd8 0000000000016840 ffff880649b09980 ffff880649b09980
[93137.701193]  ffff880649b09980 ffff8809065bfd38 ffff880868ed3c90 ffffffff8109e9a0
[93137.701194]  0000000000000000 0000000000000000 ffff880225cb7f48 ffffffff810a61ff
[93137.701195]  0000000000000000 ffff880225cb7f68 ffff880868ed3c90 ffff881f00000000
[93137.701196]  ffff882900000000 ffff880225cb7ef8 ffff880225cb7ef8 ffff881f00000000
[93137.701198]  ffff881f00000000 ffff880225cb7f18 ffff880225cb7f18 000000005abbce18
[93137.701199]  ffffffff810a6130 0000000000000000 0000000000000000 ffff8809065bfd38
[93137.701200]  ffffffff81657b98 0000000000000000 0000000000000000 0000000000000000
[93137.701202]  0000000000000000 ffff8809065bfd38 ffffffff810a6130 0000000000000000
[93137.701203]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
[93137.701204]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
[93137.701205]  ffffffffffffffff 0000000000000000 0000000000000010 0000000000000202
[93137.701207]  ffff880225cb7f58 0000000000000018
[93137.701208] Call Trace:
[93137.701212]  [<ffffffff8164791f>] dump_stack+0x19/0x1b
[93137.701215]  [<ffffffff8107b220>] warn_slowpath_common+0x70/0xb0
[93137.701216]  [<ffffffff8107b36a>] warn_slowpath_null+0x1a/0x20
[93137.701218]  [<ffffffff8109de58>] process_one_work+0x2e8/0x470
[93137.701219]  [<ffffffff8109eabb>] worker_thread+0x11b/0x400
[93137.701221]  [<ffffffff8109e9a0>] ? rescuer_thread+0x400/0x400
[93137.701223]  [<ffffffff810a61ff>] kthread+0xcf/0xe0
[93137.701224]  [<ffffffff810a6130>] ? kthread_create_on_node+0x140/0x140
[93137.701227]  [<ffffffff81657b98>] ret_from_fork+0x58/0x90
[93137.701228]  [<ffffffff810a6130>] ? kthread_create_on_node+0x140/0x140
[93137.701229] ---[ end trace 22f458263368a87f ]---
[93139.432279] ixgbe 0000:5e:00.1 eth7: VF Reset msg received from vf 9
[93139.442787] ixgbe 0000:5e:00.1 eth7: VF 9 requested invalid api version 3
[93139.444842] vfio-pci 0000:5e:12.3: irq 235 for MSI/MSI-X
[93141.699745] vfio-pci 0000:5e:11.5: irq 237 for MSI/MSI-X
[93141.699998] vfio-pci 0000:5e:11.5: irq 238 for MSI/MSI-X
[93141.701442] vfio-pci 0000:5e:11.5: irq 237 for MSI/MSI-X
[93141.701584] vfio-pci 0000:5e:11.5: irq 238 for MSI/MSI-X
[93141.701720] vfio-pci 0000:5e:11.5: irq 242 for MSI/MSI-X
[93142.279728] kvm [186763]: vcpu0 ignored wrmsr: 0x1c9 data 3
[93142.279736] kvm [186763]: vcpu0 ignored wrmsr: 0x1a6 data 11
[93142.279738] kvm [186763]: vcpu0 ignored wrmsr: 0x1a7 data 11
[93142.279740] kvm [186763]: vcpu0 ignored wrmsr: 0x3f6 data 11
[93142.279743] kvm [186763]: vcpu0 ignored wrmsr: 0x3f7 data 11
[93145.421969] ixgbe 0000:5e:00.1 eth7: VF Reset msg received from vf 20
[93145.432473] ixgbe 0000:5e:00.1 eth7: VF 20 requested invalid api version 3
[93145.434743] vfio-pci 0000:5e:15.1: irq 243 for MSI/MSI-X
[93156.975002] vfio-pci 0000:5e:11.7: irq 233 for MSI/MSI-X
[93156.975280] vfio-pci 0000:5e:11.7: irq 245 for MSI/MSI-X
[93156.977146] vfio-pci 0000:5e:11.7: irq 233 for MSI/MSI-X
[93156.977386] vfio-pci 0000:5e:11.7: irq 245 for MSI/MSI-X
[93156.977654] vfio-pci 0000:5e:11.7: irq 256 for MSI/MSI-X
[93160.455435] vfio-pci 0000:5e:12.1: irq 252 for MSI/MSI-X
[93160.455710] vfio-pci 0000:5e:12.1: irq 262 for MSI/MSI-X
[93160.457794] vfio-pci 0000:5e:12.1: irq 252 for MSI/MSI-X
[93160.458043] vfio-pci 0000:5e:12.1: irq 262 for MSI/MSI-X
[93160.458177] vfio-pci 0000:5e:12.1: irq 268 for MSI/MSI-X
[93171.874151] vfio-pci 0000:5e:11.1: irq 230 for MSI/MSI-X
[93171.874258] vfio-pci 0000:5e:11.1: irq 271 for MSI/MSI-X
[93171.875976] vfio-pci 0000:5e:11.1: irq 230 for MSI/MSI-X
[93171.876078] vfio-pci 0000:5e:11.1: irq 271 for MSI/MSI-X
[93171.876154] vfio-pci 0000:5e:11.1: irq 272 for MSI/MSI-X
[93172.575386] vfio-pci 0000:5e:12.3: irq 235 for MSI/MSI-X
[93172.575518] vfio-pci 0000:5e:12.3: irq 280 for MSI/MSI-X
[93172.577258] vfio-pci 0000:5e:12.3: irq 235 for MSI/MSI-X
[93172.577379] vfio-pci 0000:5e:12.3: irq 280 for MSI/MSI-X
[93172.577450] vfio-pci 0000:5e:12.3: irq 281 for MSI/MSI-X
[93185.691658] vfio-pci 0000:5e:15.1: irq 243 for MSI/MSI-X
[93185.691783] vfio-pci 0000:5e:15.1: irq 282 for MSI/MSI-X
[93185.693707] vfio-pci 0000:5e:15.1: irq 243 for MSI/MSI-X
[93185.693805] vfio-pci 0000:5e:15.1: irq 282 for MSI/MSI-X
[93185.693875] vfio-pci 0000:5e:15.1: irq 283 for MSI/MSI-X
[93599.715862] [sched_delayed] sched: RT throttling activated
[93692.404740] !!kbox:catch rlock on cpu 16!starting process!!
[93692.405008] [kbox] num_online_cpus: 63, cur_cpu: 16
[93692.405009] cpu 16 will send nmi, the cpumask: 0xfffffffffffefffe
[93692.417396] smp_nmi_call_function is finished. wait: 1, time: -1, timeout: 996500, cpumask: 0x0
[93692.419042] kbox rlock cpu 16 :sending stop and flush finished
[93698.011186] CPU: 16 PID: 154635 Comm: CPU 3/KVM Tainted: G        W  OE  ---- -------   3.10.0-327.55.58.94_29.x86_64 #1
[93698.011188] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 0.19 06/22/2017
[93698.011190] task: ffff88280451cc80 ti: ffff88292b160000 task.ti: ffff88292b160000
[93698.011216] RIP: 0010:[<ffffffffa0589ea6>]  [<ffffffffa0589ea6>] vmx_vcpu_run+0x666/0x750 [kvm_intel]
[93698.011217] RSP: 0018:ffff88292b163cf0  EFLAGS: 00000082
[93698.011218] RAX: 0000000080000202 RBX: ffff8825bffdb400 RCX: ffff8825bffdb400
[93698.011218] RDX: 0000000000000200 RSI: ffff8825bffdb400 RDI: ffff8825bffdb400
[93698.011218] RBP: ffff88292b163d50 R08: ffff7fffffffffff R09: 0000000000000000
[93698.011219] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
[93698.011220] R13: 00007f7202374808 R14: 00007f71c2375000 R15: 000000000000007f
[93698.011220] FS:  00007ff6f48d7700(0000) GS:ffff883ffca00000(0000) knlGS:ffff88013fcc0000
[93698.011221] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[93698.011221] CR2: 00007f3903526000 CR3: 0000002808a78000 CR4: 00000000003427e0
[93698.011222] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[93698.011222] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[93698.011222] Stack:
[93698.011225]  0000000000000000 ffff8825bffdb400 0000000000000000 ffffffff816585c0
[93698.011226]  00029516f44e1ac2 ffff8825bffdb400 00000000dde5f3e3 ffff8825bffdb400
[93698.011226]  0000000000000000 0000000000000003 ffffffffa05855f0 0000000000000010
[93698.011227]  ffff88292b163dd0 ffffffffa03ea569 000000ef00000000 ffff8825bfa62000
[93698.011227]  ffff8825c2b2f2d8 ffff88292b163fd8 ffff88280451cc80 ffff88292cae0048
[93698.011228]  0000000000000001 00000000dde5f3e3 ffffffffa04103b5 ffff8825bffdb400
[93698.011229]  ffff88292b163fd8 ffff88280451cc80 ffff88292cae0048 0000000000000001
[93698.011229]  ffff88292b163e18 ffffffffa03f2285 ffffffee7ffbfaff 00000000dde5f3e3
[93698.011230]  ffff8825bffdb400 ffff8828115bef40 0000000000000000 ffff88279d5be680
[93698.011231]  ffff88280451cc80 ffff88292b163eb0 ffffffffa03d9b51 0000000000000000
[93698.011231]  0000000100000e40 0000000000000001 ffff882800f19b90 ffff88280451cc80
[93698.011232]  ffff882800f19b98 ffff882800f19ba0 0000000000000000 0000000000000000
[93698.011233]  0000000000000001 ffff8827ff05bc10 00000000dde5f3e3 ffff88279d5be680
[93698.011233]  ffff883fcceff590 0000000000000000 0000000000000000 0000000000000001
[93698.011234]  ffff88292b163f28 ffffffff811fdf65 00000000dde5f3e3 ffff8827ff05bc00
[93698.011234]  0000000000000008 0000000000000002 ffff88292b163f10 ffffffffa03e7dc5
[93698.011235]  0000000000000000 ffff88292b163f58 00000000dde5f3e3 ffff88279d5be680
[93698.011236]  000000000000001e 000000000000ae80 0000000000000000 ffff88292b163f78
[93698.011236]  ffffffff811fe1e1 0000000000000000 0000000102d4d010 00000000dde5f3e3
[93698.011237]  0000000000000000 0000000002d4d010 000000000000ae80 000000000082b4e0
[93698.011238]  0000000000000000 00007ff6f48d69f0 ffffffff81657c49 0000000000000246
[93698.011238]  00007ff7012c7000 00000000000000ff 00000000008456f0 0000000000000010
[93698.011239]  ffffffffffffffff 0000000000000000 000000000000ae80 000000000000001e
[93698.011239]  0000000000000010 00007ff6fbdcb7b7 0000000000000033 0000000000000246
[93698.011240] Call Trace:
[93698.011246]  [<ffffffff816585c0>] ? irq_entries_start+0x300/0x400
[93698.011249]  [<ffffffffa05855f0>] ? vmx_inject_irq+0xf0/0xf0 [kvm_intel]
[93698.011289]  [<ffffffffa03ea569>] vcpu_enter_guest+0x639/0x1160 [kvm]
[93698.011306]  [<ffffffffa04103b5>] ? kvm_apic_local_deliver+0x65/0x70 [kvm]
[93698.011314]  [<ffffffffa03f2285>] kvm_arch_vcpu_ioctl_run+0xd5/0x440 [kvm]
[93698.011319]  [<ffffffffa03d9b51>] kvm_vcpu_ioctl+0x2b1/0x640 [kvm]
[93698.011322]  [<ffffffff811fdf65>] do_vfs_ioctl+0x2e5/0x4c0
[93698.011328]  [<ffffffffa03e7dc5>] ? kvm_on_user_return+0x75/0xb0 [kvm]
[93698.011329]  [<ffffffff811fe1e1>] SyS_ioctl+0xa1/0xc0
[93698.011331]  [<ffffffff81657c49>] system_call_fastpath+0x16/0x1b
[93698.011338] Code: 8b 80 3c 02 00 00 a8 10 0f 84 2f fa ff ff e9 12 ff ff ff 66 90 85 c0 0f 89 c1 fc ff ff 48 8b 5d a8 48 89 df e8 1c 2c e5 ff cd 02 <48> 89 df e8 32 2c e5 ff e9 a6 fc ff ff 49 89 f8 49 c1 e8 0d 41 
[93698.011804] kbox:catch rlock!end process!
[93698.014242] collected_len = 893933, LOG_BUF_LEN_LOCAL = 1048576
[93700.363926] !!kbox:catch rlock on cpu 25!starting process!!
[93700.363926] rlock:another rlock on another cpu 25
[93700.363928] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 25
[93700.363933] CPU: 25 PID: 155622 Comm: CPU 3/KVM Tainted: G        W  OE  ---- -------   3.10.0-327.55.58.94_29.x86_64 #1
[93700.363934] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 0.19 06/22/2017
[93700.363936]  ffffffff8187c808 000000001c178ae5 ffff883ffcc45ae8 ffffffff8164791f
[93700.363937]  ffff883ffcc45b68 ffffffff8164105f 0000000000000010 ffff883ffcc45b78
[93700.363938]  ffff883ffcc45b18 000000001c178ae5 0000000000000000 0000000000000019
[93700.363938]  0000000000000003 0000000000000000 ffffffff81cc8bfc ffffffff81cff00d
[93700.363939]  0000000000000019 0000000000000000 ffff883ffcc45c40 0000000000000000
[93700.363940]  ffff883ffcc45b80 ffffffff8111e871 ffff883fd07944a8 ffff883ffcc45bf8
[93700.363940]  ffffffff81162091 000000001c178ae5 ffff883fd07944a8 fffffff9f2462404
[93700.363941]  00000000f2462404 0000000000000001 ffff883ffcc4bbe0 ffff883ffcc45bf0
[93700.363941]  000000001c178ae5 0000000000000001 ffff883ffcc4b9e0 ffff883fd07944a8
[93700.363942]  0000000000000021 ffff883ffcc4bbe0 ffff883ffcc45c08 ffffffff81162b64
[93700.363943]  ffff883ffcc45e40 ffffffff810327f8 ffff883ffcc4c330 00000064fcc45c36
[93700.363943]  ffff883ffcc45ef8 00000000fcc45c68 0000000200000000 ffffc90000000000
[93700.363944]  ffffc90029494800 ffff883ffcc45d00 ffffffffa08689e9 0000000000000000
[93700.363944]  ffff883ffcc45c88 ffffffff8130ab22 ffffc900294932f8 000000060dba31b0
[93700.363945]  0000000005080021 ffffffff8130bb13 0000000000000000 0000000000000000
[93700.363946]  0000003f00000000 3235ffff8130dbf4 000000000001ffff 0000000000000000
[93700.363946]  0000000000000000 ffffc900294932d8 0000000000001528 ffff883fd44d9f80
[93700.363947]  00000000000557ec ffff883ffcc45d90 ffffffff813029d1 ffffc900181f0fff
[93700.363947]  ffffc900181f1000 ffffc900181f0fff ffffc900181f1000 ffffffff81962c98
[93700.363948]  ffffc900181f0fff ffff881ffe8d8008 00003ffffffff000 ffffc900181f1000
[93700.363949]  ffffc900181f0fff ffffc900181f1000 0000000000000000 00000000557ec02c
[93700.363949]  ffff883fcac74534 0000000004000000 ffffc900181f0000 ffff883ffcc45d90
[93700.363950]  ffffffff811a97c1 ffff883ffcc45de8 ffffffff813a4954 0000000000000014
[93700.363950]  0000000000000014 0000000000000000 00000001813a27ff ffff883fca557758
[93700.363951] Call Trace:
[93700.363960]  <NMI>  [<ffffffff8164791f>] dump_stack+0x19/0x1b
[93700.363964]  [<ffffffff8164105f>] panic+0xd8/0x214
[93700.363968]  [<ffffffff8111e871>] watchdog_overflow_callback+0xd1/0xe0
[93700.363972]  [<ffffffff81162091>] __perf_event_overflow+0xa1/0x250
[93700.363974]  [<ffffffff81162b64>] perf_event_overflow+0x14/0x20
[93700.363977]  [<ffffffff810327f8>] intel_pmu_handle_irq+0x1e8/0x470
[93700.363992]  [<ffffffff8130ab22>] ? put_dec+0x72/0x90
[93700.363993]  [<ffffffff8130bb13>] ? number.isra.2+0x323/0x360
[93700.363995]  [<ffffffff813029d1>] ? ioremap_page_range+0x241/0x320
[93700.363999]  [<ffffffff811a97c1>] ? unmap_kernel_range_noflush+0x11/0x20
[93700.364003]  [<ffffffff813a4954>] ? ghes_copy_tofrom_phys+0x124/0x210
[93700.364005]  [<ffffffff813a4ae0>] ? ghes_read_estatus+0xa0/0x190
[93700.364009]  [<ffffffff81650f9b>] perf_event_nmi_handler+0x2b/0x50
[93700.364010]  [<ffffffff81650619>] nmi_handle.isra.0+0x69/0xb0
[93700.364012]  [<ffffffff81650831>] do_nmi+0x1d1/0x410
[93700.364013]  [<ffffffff8164fa53>] end_repeat_nmi+0x1e/0x2e
[93700.364027]  [<ffffffffa0589ea6>] ? vmx_vcpu_run+0x666/0x750 [kvm_intel]
[93700.364029]  [<ffffffffa0589ea6>] ? vmx_vcpu_run+0x666/0x750 [kvm_intel]
[93700.364031]  [<ffffffffa0589ea6>] ? vmx_vcpu_run+0x666/0x750 [kvm_intel]
[93700.364033]  <<EOE>>  [<ffffffff81658830>] ? uv_bau_message_intr1+0x80/0x80
[93700.364035]  [<ffffffffa05855f0>] ? vmx_inject_irq+0xf0/0xf0 [kvm_intel]
[93700.364074]  [<ffffffffa03ea569>] vcpu_enter_guest+0x639/0x1160 [kvm]
[93700.364087]  [<ffffffffa04103b5>] ? kvm_apic_local_deliver+0x65/0x70 [kvm]
[93700.364095]  [<ffffffffa03f2285>] kvm_arch_vcpu_ioctl_run+0xd5/0x440 [kvm]
[93700.364100]  [<ffffffffa03d9b51>] kvm_vcpu_ioctl+0x2b1/0x640 [kvm]
[93700.364103]  [<ffffffff810e7b72>] ? do_futex+0x122/0x5b0
[93700.364105]  [<ffffffff811fdf65>] do_vfs_ioctl+0x2e5/0x4c0
[93700.364107]  [<ffffffff81235833>] ? anon_inode_getfile+0xd3/0x170
[93700.364113]  [<ffffffffa03e7dc5>] ? kvm_on_user_return+0x75/0xb0 [kvm]
[93700.364113]  [<ffffffff811fe1e1>] SyS_ioctl+0xa1/0xc0
[93700.364115]  [<ffffffff81657c49>] system_call_fastpath+0x16/0x1b
[93701.422730] Shutting down cpus with NMI
[93701.608323] rlock even has been record!
[93701.608324] kbox: notify die begin
[93701.608324] kbox: no notify die func register. no need to notify
[93720.962506] kbox rlock callback end!


Regards,
-Gonglei

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Hit a hard lockup BUG while started VM with passthrough NIC
  2017-09-21  7:35 Hit a hard lockup BUG while started VM with passthrough NIC Gonglei (Arei)
@ 2017-09-21 15:00 ` Alex Williamson
  2017-09-21 15:10   ` Paolo Bonzini
  0 siblings, 1 reply; 4+ messages in thread
From: Alex Williamson @ 2017-09-21 15:00 UTC (permalink / raw)
  To: Gonglei (Arei); +Cc: Paolo Bonzini, kvm, Zhanghailiang, Huangweidong (C)

On Thu, 21 Sep 2017 07:35:20 +0000
"Gonglei (Arei)" <arei.gonglei@huawei.com> wrote:

> Hi Paolo, Alex,
> 
> We hit a hard lockup bug in our tests while started VM with passthrough NIC.
> Unfortunately, we didn't get the available vmcore to dig into. And it is quite difficult
> To reproduce, actually, we only hit this problem once.
> Does anyone hit such problem ? Or any idea ?
> (For the complete log, please see the attachment).

This is not an upstream or a RHEL kernel, so I don't have the sources
to do much analysis.  Based on the kernel version number, I'm guessing
this is some derivative of a RHEL-7.2 kernel.  Can it be reproduced on
upstream (this is an upstream list)?  The ghes functions in the dump
might indicate might indicate a hardware error was triggered and the
firmware logs might provide more indication of the problem.  Thanks,

Alex

> [93137.701139] ------------[ cut here ]------------
> [93137.701146] WARNING: at kernel/workqueue.c:2131 process_one_work+0x2e8/0x470()
> [93137.701147] Modules linked in: ext4 jbd2 ixgbe(O) dev_connlimit(O) igb_uio(OE) uio bum(O) ip_set nfnetlink prio(O) nat(O) vport_vxlan(O) openvswitch(O) nf_defrag_ipv6 gre signo_catch(O) hotpatch(OE) kboxdriver(O) ipmi_devintf ipmi_si ipmi_msghandler kbox(O) mlx5_ib i40e(OE) ib_core ib_addr vxlan coretemp crc32_pclmul ip6_udp_tunnel crc32c_intel udp_tunnel ptp kvm_intel(O) ghash_clmulni_intel aesni_intel dca kvm(O) pps_core lrw gf128mul glue_helper sg ablk_helper cryptd pcspkr mlx5_core acpi_cpufreq shpchp acpi_power_meter acpi_pad remote_trigger(O) nf_conntrack_ipv4 nf_defrag_ipv4 vhost_net(O) tun(O) vhost(O) macvtap macvlan vfio_pci irqbypass vfio_iommu_type1 vfio xt_sctp nf_conntrack_proto_sctp nf_nat_proto_sctp nf_nat nf_conntrack sctp libcrc32c ip_tables ext3 mbcache jbd sd_mod cr
 c_t10dif crct10dif_generic
> [93137.701171]  crct10dif_pclmul crct10dif_common mpt3sas raid_class ahci scsi_transport_sas libahci libata dm_mod [last unloaded: ixgbe]
> [93137.701178] CPU: 1 PID: 119983 Comm: kworker/0:7 Tainted: G        W  OE  ---- -------   3.10.0-327.55.58.94_29.x86_64 #1
> [93137.701179] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 0.19 06/22/2017
> [93137.701183]  0000000000000000 000000005abbce18 ffff880225cb7dd0 ffffffff8164791f
> [93137.701185]  ffff880225cb7e08 ffffffff8107b220 ffff88282066c4a8 ffff880868ed3c90
> [93137.701186]  ffff881ffec16080 ffff881ffec1c000 0000000000000000 ffff880225cb7e18
> [93137.701188]  ffffffff8107b36a ffff880225cb7e60 ffffffff8109de58 ffff881ffec16098
> [93137.701189]  ffff880868ed3c00 ffff881ffec16098 ffff880868ed3cc0 ffff880649b09980
> [93137.701190]  ffff880868ed3c90 ffff881ffec16080 ffff880225cb7ec0 ffffffff8109eabb
> [93137.701191]  ffff880225cb7fd8 0000000000016840 ffff880649b09980 ffff880649b09980
> [93137.701193]  ffff880649b09980 ffff8809065bfd38 ffff880868ed3c90 ffffffff8109e9a0
> [93137.701194]  0000000000000000 0000000000000000 ffff880225cb7f48 ffffffff810a61ff
> [93137.701195]  0000000000000000 ffff880225cb7f68 ffff880868ed3c90 ffff881f00000000
> [93137.701196]  ffff882900000000 ffff880225cb7ef8 ffff880225cb7ef8 ffff881f00000000
> [93137.701198]  ffff881f00000000 ffff880225cb7f18 ffff880225cb7f18 000000005abbce18
> [93137.701199]  ffffffff810a6130 0000000000000000 0000000000000000 ffff8809065bfd38
> [93137.701200]  ffffffff81657b98 0000000000000000 0000000000000000 0000000000000000
> [93137.701202]  0000000000000000 ffff8809065bfd38 ffffffff810a6130 0000000000000000
> [93137.701203]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [93137.701204]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [93137.701205]  ffffffffffffffff 0000000000000000 0000000000000010 0000000000000202
> [93137.701207]  ffff880225cb7f58 0000000000000018
> [93137.701208] Call Trace:
> [93137.701212]  [<ffffffff8164791f>] dump_stack+0x19/0x1b
> [93137.701215]  [<ffffffff8107b220>] warn_slowpath_common+0x70/0xb0
> [93137.701216]  [<ffffffff8107b36a>] warn_slowpath_null+0x1a/0x20
> [93137.701218]  [<ffffffff8109de58>] process_one_work+0x2e8/0x470
> [93137.701219]  [<ffffffff8109eabb>] worker_thread+0x11b/0x400
> [93137.701221]  [<ffffffff8109e9a0>] ? rescuer_thread+0x400/0x400
> [93137.701223]  [<ffffffff810a61ff>] kthread+0xcf/0xe0
> [93137.701224]  [<ffffffff810a6130>] ? kthread_create_on_node+0x140/0x140
> [93137.701227]  [<ffffffff81657b98>] ret_from_fork+0x58/0x90
> [93137.701228]  [<ffffffff810a6130>] ? kthread_create_on_node+0x140/0x140
> [93137.701229] ---[ end trace 22f458263368a87f ]---
> [93139.432279] ixgbe 0000:5e:00.1 eth7: VF Reset msg received from vf 9
> [93139.442787] ixgbe 0000:5e:00.1 eth7: VF 9 requested invalid api version 3
> [93139.444842] vfio-pci 0000:5e:12.3: irq 235 for MSI/MSI-X
> [93141.699745] vfio-pci 0000:5e:11.5: irq 237 for MSI/MSI-X
> [93141.699998] vfio-pci 0000:5e:11.5: irq 238 for MSI/MSI-X
> [93141.701442] vfio-pci 0000:5e:11.5: irq 237 for MSI/MSI-X
> [93141.701584] vfio-pci 0000:5e:11.5: irq 238 for MSI/MSI-X
> [93141.701720] vfio-pci 0000:5e:11.5: irq 242 for MSI/MSI-X
> [93142.279728] kvm [186763]: vcpu0 ignored wrmsr: 0x1c9 data 3
> [93142.279736] kvm [186763]: vcpu0 ignored wrmsr: 0x1a6 data 11
> [93142.279738] kvm [186763]: vcpu0 ignored wrmsr: 0x1a7 data 11
> [93142.279740] kvm [186763]: vcpu0 ignored wrmsr: 0x3f6 data 11
> [93142.279743] kvm [186763]: vcpu0 ignored wrmsr: 0x3f7 data 11
> [93145.421969] ixgbe 0000:5e:00.1 eth7: VF Reset msg received from vf 20
> [93145.432473] ixgbe 0000:5e:00.1 eth7: VF 20 requested invalid api version 3
> [93145.434743] vfio-pci 0000:5e:15.1: irq 243 for MSI/MSI-X
> [93156.975002] vfio-pci 0000:5e:11.7: irq 233 for MSI/MSI-X
> [93156.975280] vfio-pci 0000:5e:11.7: irq 245 for MSI/MSI-X
> [93156.977146] vfio-pci 0000:5e:11.7: irq 233 for MSI/MSI-X
> [93156.977386] vfio-pci 0000:5e:11.7: irq 245 for MSI/MSI-X
> [93156.977654] vfio-pci 0000:5e:11.7: irq 256 for MSI/MSI-X
> [93160.455435] vfio-pci 0000:5e:12.1: irq 252 for MSI/MSI-X
> [93160.455710] vfio-pci 0000:5e:12.1: irq 262 for MSI/MSI-X
> [93160.457794] vfio-pci 0000:5e:12.1: irq 252 for MSI/MSI-X
> [93160.458043] vfio-pci 0000:5e:12.1: irq 262 for MSI/MSI-X
> [93160.458177] vfio-pci 0000:5e:12.1: irq 268 for MSI/MSI-X
> [93171.874151] vfio-pci 0000:5e:11.1: irq 230 for MSI/MSI-X
> [93171.874258] vfio-pci 0000:5e:11.1: irq 271 for MSI/MSI-X
> [93171.875976] vfio-pci 0000:5e:11.1: irq 230 for MSI/MSI-X
> [93171.876078] vfio-pci 0000:5e:11.1: irq 271 for MSI/MSI-X
> [93171.876154] vfio-pci 0000:5e:11.1: irq 272 for MSI/MSI-X
> [93172.575386] vfio-pci 0000:5e:12.3: irq 235 for MSI/MSI-X
> [93172.575518] vfio-pci 0000:5e:12.3: irq 280 for MSI/MSI-X
> [93172.577258] vfio-pci 0000:5e:12.3: irq 235 for MSI/MSI-X
> [93172.577379] vfio-pci 0000:5e:12.3: irq 280 for MSI/MSI-X
> [93172.577450] vfio-pci 0000:5e:12.3: irq 281 for MSI/MSI-X
> [93185.691658] vfio-pci 0000:5e:15.1: irq 243 for MSI/MSI-X
> [93185.691783] vfio-pci 0000:5e:15.1: irq 282 for MSI/MSI-X
> [93185.693707] vfio-pci 0000:5e:15.1: irq 243 for MSI/MSI-X
> [93185.693805] vfio-pci 0000:5e:15.1: irq 282 for MSI/MSI-X
> [93185.693875] vfio-pci 0000:5e:15.1: irq 283 for MSI/MSI-X
> [93599.715862] [sched_delayed] sched: RT throttling activated
> [93692.404740] !!kbox:catch rlock on cpu 16!starting process!!
> [93692.405008] [kbox] num_online_cpus: 63, cur_cpu: 16
> [93692.405009] cpu 16 will send nmi, the cpumask: 0xfffffffffffefffe
> [93692.417396] smp_nmi_call_function is finished. wait: 1, time: -1, timeout: 996500, cpumask: 0x0
> [93692.419042] kbox rlock cpu 16 :sending stop and flush finished
> [93698.011186] CPU: 16 PID: 154635 Comm: CPU 3/KVM Tainted: G        W  OE  ---- -------   3.10.0-327.55.58.94_29.x86_64 #1
> [93698.011188] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 0.19 06/22/2017
> [93698.011190] task: ffff88280451cc80 ti: ffff88292b160000 task.ti: ffff88292b160000
> [93698.011216] RIP: 0010:[<ffffffffa0589ea6>]  [<ffffffffa0589ea6>] vmx_vcpu_run+0x666/0x750 [kvm_intel]
> [93698.011217] RSP: 0018:ffff88292b163cf0  EFLAGS: 00000082
> [93698.011218] RAX: 0000000080000202 RBX: ffff8825bffdb400 RCX: ffff8825bffdb400
> [93698.011218] RDX: 0000000000000200 RSI: ffff8825bffdb400 RDI: ffff8825bffdb400
> [93698.011218] RBP: ffff88292b163d50 R08: ffff7fffffffffff R09: 0000000000000000
> [93698.011219] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
> [93698.011220] R13: 00007f7202374808 R14: 00007f71c2375000 R15: 000000000000007f
> [93698.011220] FS:  00007ff6f48d7700(0000) GS:ffff883ffca00000(0000) knlGS:ffff88013fcc0000
> [93698.011221] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [93698.011221] CR2: 00007f3903526000 CR3: 0000002808a78000 CR4: 00000000003427e0
> [93698.011222] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [93698.011222] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [93698.011222] Stack:
> [93698.011225]  0000000000000000 ffff8825bffdb400 0000000000000000 ffffffff816585c0
> [93698.011226]  00029516f44e1ac2 ffff8825bffdb400 00000000dde5f3e3 ffff8825bffdb400
> [93698.011226]  0000000000000000 0000000000000003 ffffffffa05855f0 0000000000000010
> [93698.011227]  ffff88292b163dd0 ffffffffa03ea569 000000ef00000000 ffff8825bfa62000
> [93698.011227]  ffff8825c2b2f2d8 ffff88292b163fd8 ffff88280451cc80 ffff88292cae0048
> [93698.011228]  0000000000000001 00000000dde5f3e3 ffffffffa04103b5 ffff8825bffdb400
> [93698.011229]  ffff88292b163fd8 ffff88280451cc80 ffff88292cae0048 0000000000000001
> [93698.011229]  ffff88292b163e18 ffffffffa03f2285 ffffffee7ffbfaff 00000000dde5f3e3
> [93698.011230]  ffff8825bffdb400 ffff8828115bef40 0000000000000000 ffff88279d5be680
> [93698.011231]  ffff88280451cc80 ffff88292b163eb0 ffffffffa03d9b51 0000000000000000
> [93698.011231]  0000000100000e40 0000000000000001 ffff882800f19b90 ffff88280451cc80
> [93698.011232]  ffff882800f19b98 ffff882800f19ba0 0000000000000000 0000000000000000
> [93698.011233]  0000000000000001 ffff8827ff05bc10 00000000dde5f3e3 ffff88279d5be680
> [93698.011233]  ffff883fcceff590 0000000000000000 0000000000000000 0000000000000001
> [93698.011234]  ffff88292b163f28 ffffffff811fdf65 00000000dde5f3e3 ffff8827ff05bc00
> [93698.011234]  0000000000000008 0000000000000002 ffff88292b163f10 ffffffffa03e7dc5
> [93698.011235]  0000000000000000 ffff88292b163f58 00000000dde5f3e3 ffff88279d5be680
> [93698.011236]  000000000000001e 000000000000ae80 0000000000000000 ffff88292b163f78
> [93698.011236]  ffffffff811fe1e1 0000000000000000 0000000102d4d010 00000000dde5f3e3
> [93698.011237]  0000000000000000 0000000002d4d010 000000000000ae80 000000000082b4e0
> [93698.011238]  0000000000000000 00007ff6f48d69f0 ffffffff81657c49 0000000000000246
> [93698.011238]  00007ff7012c7000 00000000000000ff 00000000008456f0 0000000000000010
> [93698.011239]  ffffffffffffffff 0000000000000000 000000000000ae80 000000000000001e
> [93698.011239]  0000000000000010 00007ff6fbdcb7b7 0000000000000033 0000000000000246
> [93698.011240] Call Trace:
> [93698.011246]  [<ffffffff816585c0>] ? irq_entries_start+0x300/0x400
> [93698.011249]  [<ffffffffa05855f0>] ? vmx_inject_irq+0xf0/0xf0 [kvm_intel]
> [93698.011289]  [<ffffffffa03ea569>] vcpu_enter_guest+0x639/0x1160 [kvm]
> [93698.011306]  [<ffffffffa04103b5>] ? kvm_apic_local_deliver+0x65/0x70 [kvm]
> [93698.011314]  [<ffffffffa03f2285>] kvm_arch_vcpu_ioctl_run+0xd5/0x440 [kvm]
> [93698.011319]  [<ffffffffa03d9b51>] kvm_vcpu_ioctl+0x2b1/0x640 [kvm]
> [93698.011322]  [<ffffffff811fdf65>] do_vfs_ioctl+0x2e5/0x4c0
> [93698.011328]  [<ffffffffa03e7dc5>] ? kvm_on_user_return+0x75/0xb0 [kvm]
> [93698.011329]  [<ffffffff811fe1e1>] SyS_ioctl+0xa1/0xc0
> [93698.011331]  [<ffffffff81657c49>] system_call_fastpath+0x16/0x1b
> [93698.011338] Code: 8b 80 3c 02 00 00 a8 10 0f 84 2f fa ff ff e9 12 ff ff ff 66 90 85 c0 0f 89 c1 fc ff ff 48 8b 5d a8 48 89 df e8 1c 2c e5 ff cd 02 <48> 89 df e8 32 2c e5 ff e9 a6 fc ff ff 49 89 f8 49 c1 e8 0d 41 
> [93698.011804] kbox:catch rlock!end process!
> [93698.014242] collected_len = 893933, LOG_BUF_LEN_LOCAL = 1048576
> [93700.363926] !!kbox:catch rlock on cpu 25!starting process!!
> [93700.363926] rlock:another rlock on another cpu 25
> [93700.363928] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 25
> [93700.363933] CPU: 25 PID: 155622 Comm: CPU 3/KVM Tainted: G        W  OE  ---- -------   3.10.0-327.55.58.94_29.x86_64 #1
> [93700.363934] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 0.19 06/22/2017
> [93700.363936]  ffffffff8187c808 000000001c178ae5 ffff883ffcc45ae8 ffffffff8164791f
> [93700.363937]  ffff883ffcc45b68 ffffffff8164105f 0000000000000010 ffff883ffcc45b78
> [93700.363938]  ffff883ffcc45b18 000000001c178ae5 0000000000000000 0000000000000019
> [93700.363938]  0000000000000003 0000000000000000 ffffffff81cc8bfc ffffffff81cff00d
> [93700.363939]  0000000000000019 0000000000000000 ffff883ffcc45c40 0000000000000000
> [93700.363940]  ffff883ffcc45b80 ffffffff8111e871 ffff883fd07944a8 ffff883ffcc45bf8
> [93700.363940]  ffffffff81162091 000000001c178ae5 ffff883fd07944a8 fffffff9f2462404
> [93700.363941]  00000000f2462404 0000000000000001 ffff883ffcc4bbe0 ffff883ffcc45bf0
> [93700.363941]  000000001c178ae5 0000000000000001 ffff883ffcc4b9e0 ffff883fd07944a8
> [93700.363942]  0000000000000021 ffff883ffcc4bbe0 ffff883ffcc45c08 ffffffff81162b64
> [93700.363943]  ffff883ffcc45e40 ffffffff810327f8 ffff883ffcc4c330 00000064fcc45c36
> [93700.363943]  ffff883ffcc45ef8 00000000fcc45c68 0000000200000000 ffffc90000000000
> [93700.363944]  ffffc90029494800 ffff883ffcc45d00 ffffffffa08689e9 0000000000000000
> [93700.363944]  ffff883ffcc45c88 ffffffff8130ab22 ffffc900294932f8 000000060dba31b0
> [93700.363945]  0000000005080021 ffffffff8130bb13 0000000000000000 0000000000000000
> [93700.363946]  0000003f00000000 3235ffff8130dbf4 000000000001ffff 0000000000000000
> [93700.363946]  0000000000000000 ffffc900294932d8 0000000000001528 ffff883fd44d9f80
> [93700.363947]  00000000000557ec ffff883ffcc45d90 ffffffff813029d1 ffffc900181f0fff
> [93700.363947]  ffffc900181f1000 ffffc900181f0fff ffffc900181f1000 ffffffff81962c98
> [93700.363948]  ffffc900181f0fff ffff881ffe8d8008 00003ffffffff000 ffffc900181f1000
> [93700.363949]  ffffc900181f0fff ffffc900181f1000 0000000000000000 00000000557ec02c
> [93700.363949]  ffff883fcac74534 0000000004000000 ffffc900181f0000 ffff883ffcc45d90
> [93700.363950]  ffffffff811a97c1 ffff883ffcc45de8 ffffffff813a4954 0000000000000014
> [93700.363950]  0000000000000014 0000000000000000 00000001813a27ff ffff883fca557758
> [93700.363951] Call Trace:
> [93700.363960]  <NMI>  [<ffffffff8164791f>] dump_stack+0x19/0x1b
> [93700.363964]  [<ffffffff8164105f>] panic+0xd8/0x214
> [93700.363968]  [<ffffffff8111e871>] watchdog_overflow_callback+0xd1/0xe0
> [93700.363972]  [<ffffffff81162091>] __perf_event_overflow+0xa1/0x250
> [93700.363974]  [<ffffffff81162b64>] perf_event_overflow+0x14/0x20
> [93700.363977]  [<ffffffff810327f8>] intel_pmu_handle_irq+0x1e8/0x470
> [93700.363992]  [<ffffffff8130ab22>] ? put_dec+0x72/0x90
> [93700.363993]  [<ffffffff8130bb13>] ? number.isra.2+0x323/0x360
> [93700.363995]  [<ffffffff813029d1>] ? ioremap_page_range+0x241/0x320
> [93700.363999]  [<ffffffff811a97c1>] ? unmap_kernel_range_noflush+0x11/0x20
> [93700.364003]  [<ffffffff813a4954>] ? ghes_copy_tofrom_phys+0x124/0x210
> [93700.364005]  [<ffffffff813a4ae0>] ? ghes_read_estatus+0xa0/0x190
> [93700.364009]  [<ffffffff81650f9b>] perf_event_nmi_handler+0x2b/0x50
> [93700.364010]  [<ffffffff81650619>] nmi_handle.isra.0+0x69/0xb0
> [93700.364012]  [<ffffffff81650831>] do_nmi+0x1d1/0x410
> [93700.364013]  [<ffffffff8164fa53>] end_repeat_nmi+0x1e/0x2e
> [93700.364027]  [<ffffffffa0589ea6>] ? vmx_vcpu_run+0x666/0x750 [kvm_intel]
> [93700.364029]  [<ffffffffa0589ea6>] ? vmx_vcpu_run+0x666/0x750 [kvm_intel]
> [93700.364031]  [<ffffffffa0589ea6>] ? vmx_vcpu_run+0x666/0x750 [kvm_intel]
> [93700.364033]  <<EOE>>  [<ffffffff81658830>] ? uv_bau_message_intr1+0x80/0x80
> [93700.364035]  [<ffffffffa05855f0>] ? vmx_inject_irq+0xf0/0xf0 [kvm_intel]
> [93700.364074]  [<ffffffffa03ea569>] vcpu_enter_guest+0x639/0x1160 [kvm]
> [93700.364087]  [<ffffffffa04103b5>] ? kvm_apic_local_deliver+0x65/0x70 [kvm]
> [93700.364095]  [<ffffffffa03f2285>] kvm_arch_vcpu_ioctl_run+0xd5/0x440 [kvm]
> [93700.364100]  [<ffffffffa03d9b51>] kvm_vcpu_ioctl+0x2b1/0x640 [kvm]
> [93700.364103]  [<ffffffff810e7b72>] ? do_futex+0x122/0x5b0
> [93700.364105]  [<ffffffff811fdf65>] do_vfs_ioctl+0x2e5/0x4c0
> [93700.364107]  [<ffffffff81235833>] ? anon_inode_getfile+0xd3/0x170
> [93700.364113]  [<ffffffffa03e7dc5>] ? kvm_on_user_return+0x75/0xb0 [kvm]
> [93700.364113]  [<ffffffff811fe1e1>] SyS_ioctl+0xa1/0xc0
> [93700.364115]  [<ffffffff81657c49>] system_call_fastpath+0x16/0x1b
> [93701.422730] Shutting down cpus with NMI
> [93701.608323] rlock even has been record!
> [93701.608324] kbox: notify die begin
> [93701.608324] kbox: no notify die func register. no need to notify
> [93720.962506] kbox rlock callback end!
> 
> 
> Regards,
> -Gonglei
> 
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Hit a hard lockup BUG while started VM with passthrough NIC
  2017-09-21 15:00 ` Alex Williamson
@ 2017-09-21 15:10   ` Paolo Bonzini
  2017-09-23  2:46     ` Gonglei (Arei)
  0 siblings, 1 reply; 4+ messages in thread
From: Paolo Bonzini @ 2017-09-21 15:10 UTC (permalink / raw)
  To: Alex Williamson, Gonglei (Arei); +Cc: kvm, Zhanghailiang, Huangweidong (C)

On 21/09/2017 17:00, Alex Williamson wrote:
>>
>> We hit a hard lockup bug in our tests while started VM with passthrough NIC.
>> Unfortunately, we didn't get the available vmcore to dig into. And it is quite difficult
>> To reproduce, actually, we only hit this problem once.
>> Does anyone hit such problem ? Or any idea ?
>> (For the complete log, please see the attachment).
> This is not an upstream or a RHEL kernel, so I don't have the sources
> to do much analysis.  Based on the kernel version number, I'm guessing
> this is some derivative of a RHEL-7.2 kernel.  Can it be reproduced on
> upstream (this is an upstream list)?  The ghes functions in the dump
> might indicate might indicate a hardware error was triggered and the
> firmware logs might provide more indication of the problem.  Thanks,

I don't recall anything with passthrough, but it looks like this was
caused by perf running at the same time as KVM (perf_event_overflow in
the second backtrace, "cd 02" aka NMI in the first backtrace's code
hexdump).

There have been some perf bugs related to nested in RHEL7.2 and RHEL7.3
(start L1, do "perf top" in another terminal, boom when you start L2)
and they seem to be fixed in RHEL7.4.  However, we never tried bisecting
for the fix.

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: Hit a hard lockup BUG while started VM with passthrough NIC
  2017-09-21 15:10   ` Paolo Bonzini
@ 2017-09-23  2:46     ` Gonglei (Arei)
  0 siblings, 0 replies; 4+ messages in thread
From: Gonglei (Arei) @ 2017-09-23  2:46 UTC (permalink / raw)
  To: Paolo Bonzini, Alex Williamson
  Cc: kvm, Zhanghailiang, Huangweidong (C), longpeng


> -----Original Message-----
> From: Paolo Bonzini [mailto:pbonzini@redhat.com]
> Sent: Thursday, September 21, 2017 11:11 PM
> To: Alex Williamson; Gonglei (Arei)
> Cc: kvm@vger.kernel.org; Zhanghailiang; Huangweidong (C)
> Subject: Re: Hit a hard lockup BUG while started VM with passthrough NIC
> 
> On 21/09/2017 17:00, Alex Williamson wrote:
> >>
> >> We hit a hard lockup bug in our tests while started VM with passthrough
> NIC.
> >> Unfortunately, we didn't get the available vmcore to dig into. And it is quite
> difficult
> >> To reproduce, actually, we only hit this problem once.
> >> Does anyone hit such problem ? Or any idea ?
> >> (For the complete log, please see the attachment).
> > This is not an upstream or a RHEL kernel, so I don't have the sources
> > to do much analysis.  Based on the kernel version number, I'm guessing
> > this is some derivative of a RHEL-7.2 kernel.  Can it be reproduced on
> > upstream (this is an upstream list)?  The ghes functions in the dump
> > might indicate might indicate a hardware error was triggered and the
> > firmware logs might provide more indication of the problem.  Thanks,
> 
> I don't recall anything with passthrough, but it looks like this was
> caused by perf running at the same time as KVM (perf_event_overflow in
> the second backtrace, "cd 02" aka NMI in the first backtrace's code
> hexdump).
> 
> There have been some perf bugs related to nested in RHEL7.2 and RHEL7.3
> (start L1, do "perf top" in another terminal, boom when you start L2)
> and they seem to be fixed in RHEL7.4.  However, we never tried bisecting
> for the fix.
> 
Thanks for your feedback. 

1) From the backtrace, we can see the current vcpu is of non-root mode and then vm exit
because an NMI. 

  <NMI>  [<ffffffff8164791f>] dump_stack+0x19/0x1b
  [<ffffffff8164105f>] panic+0xd8/0x214
  [<ffffffff8111e871>] watchdog_overflow_callback+0xd1/0xe0
  [<ffffffff81162091>] __perf_event_overflow+0xa1/0x250
  [<ffffffff81162b64>] perf_event_overflow+0x14/0x20
  [<ffffffff810327f8>] intel_pmu_handle_irq+0x1e8/0x470
  [<ffffffff8130ab22>] ? put_dec+0x72/0x90
  [<ffffffff8130bb13>] ? number.isra.2+0x323/0x360
  [<ffffffff813029d1>] ? ioremap_page_range+0x241/0x320
  [<ffffffff811a97c1>] ? unmap_kernel_range_noflush+0x11/0x20
  [<ffffffff813a4954>] ? ghes_copy_tofrom_phys+0x124/0x210
  [<ffffffff813a4ae0>] ? ghes_read_estatus+0xa0/0x190
  [<ffffffff81650f9b>] perf_event_nmi_handler+0x2b/0x50
  [<ffffffff81650619>] nmi_handle.isra.0+0x69/0xb0
  [<ffffffff81650831>] do_nmi+0x1d1/0x410
  [<ffffffff8164fa53>] end_repeat_nmi+0x1e/0x2e
  [<ffffffffa0589ea6>] ? vmx_vcpu_run+0x666/0x750 [kvm_intel]
  [<ffffffffa0589ea6>] ? vmx_vcpu_run+0x666/0x750 [kvm_intel]
  [<ffffffffa0589ea6>] ? vmx_vcpu_run+0x666/0x750 [kvm_intel]
  <<EOE>>  [<ffffffff81658830>] ? uv_bau_message_intr1+0x80/0x80
  [<ffffffffa05855f0>] ? vmx_inject_irq+0xf0/0xf0 [kvm_intel]
  [<ffffffffa03ea569>] vcpu_enter_guest+0x639/0x1160 [kvm]
  [<ffffffffa04103b5>] ? kvm_apic_local_deliver+0x65/0x70 [kvm]
  [<ffffffffa03f2285>] kvm_arch_vcpu_ioctl_run+0xd5/0x440 [kvm]
  [<ffffffffa03d9b51>] kvm_vcpu_ioctl+0x2b1/0x640 [kvm]
  [<ffffffff810e7b72>] ? do_futex+0x122/0x5b0
  [<ffffffff811fdf65>] do_vfs_ioctl+0x2e5/0x4c0
  [<ffffffff81235833>] ? anon_inode_getfile+0xd3/0x170
  [<ffffffffa03e7dc5>] ? kvm_on_user_return+0x75/0xb0 [kvm]
  [<ffffffff811fe1e1>] SyS_ioctl+0xa1/0xc0
  [<ffffffff81657c49>] system_call_fastpath+0x16/0x1b
Shutting down cpus with NMI

2) As we all know, the kmod is of irq_disabled status in non-root mode (vcpu_enter_guest),
 But if any interrupts occur or pending, the guest will vmexit to handle the interrupt. 

So, we can infer that there isn't interrupt pending during 20s (hard lockup period) according
above two reasons.

But, what about timer interrupt? Our host OS is CentOS 7.2, whose HZ is 1000. That
means timer interrupt will be occurred per second...

So we guess that host's timer maybe the root cause. And we found some clues:  

###num:1, record len:203

console region: kernel reboot time:20170911201235-43cfdkernel reboot time:20170911201235-43cfd
[15141.708423] collected_len = 354176, LOG_BUF_LEN_LOCAL = 1048576
kernel reboot time:20170911201235-43cfd   ---------- 2017 


After kernel panic:

###num:2, record len:114

console region: rlock time:20571104194044-c5c73
rlock time:20571104194044-c5c73        ------------ 2057
rlock time:20571104194044-c5c73

Why the system timer is changed to 40 years later after panic ? Is integer overflow of hrtimer? 
And then cause the hrtimer expired? 

Thanks,
-Gonglei

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-09-23  2:46 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-21  7:35 Hit a hard lockup BUG while started VM with passthrough NIC Gonglei (Arei)
2017-09-21 15:00 ` Alex Williamson
2017-09-21 15:10   ` Paolo Bonzini
2017-09-23  2:46     ` Gonglei (Arei)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.