All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] host stalls when qemu-system-aarch64 with kvm and pflash
@ 2017-03-28 19:58 ` Radha Mohan
  0 siblings, 0 replies; 28+ messages in thread
From: Radha Mohan @ 2017-03-28 19:58 UTC (permalink / raw)
  To: qemu-devel, kvmarm

Hi,
I am seeing an issue with qemu-system-aarch64 when using pflash
(booting kernel via UEFI bios).

Host kernel: 4.11.0-rc3-next-20170323
Qemu version: v2.9.0-rc1

Command used:
./aarch64-softmmu/qemu-system-aarch64 -cpu host -enable-kvm -M
virt,gic_version=3 -nographic -smp 1 -m 2048 -drive
if=none,id=hd0,file=/root/zesty-server-cloudimg-arm64.img,id=0 -device
virtio-blk-device,drive=hd0 -pflash /root/flash0.img -pflash
/root/flash1.img


As soon as the guest kernel boots the host starts to stall and prints
the below messages. And the system never recovers. I can neither
poweroff the guest nor the host. So I have resort to external power
reset of the host.

==================
[  116.199077] NMI watchdog: BUG: soft lockup - CPU#25 stuck for 23s!
[kworker/25:1:454]
[  116.206901] Modules linked in: binfmt_misc nls_iso8859_1 aes_ce_blk
shpchp crypto_simd gpio_keys cryptd aes_ce_cipher ghash_ce sha2_ce
sha1_ce uio_pdrv_genirq uio autofs4 btrfs raid10 rai
d456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
raid6_pq libcrc32c raid1 raid0 multipath linear ast i2c_algo_bit ttm
drm_kms_helper syscopyarea sysfillrect sysimgblt fb_s
ys_fops drm nicvf ahci nicpf libahci thunder_bgx thunder_xcv
mdio_thunder mdio_cavium

[  116.206995] CPU: 25 PID: 454 Comm: kworker/25:1 Not tainted
4.11.0-rc3-next-20170323 #1
[  116.206997] Hardware name: www.cavium.com crb-1s/crb-1s, BIOS 0.3 Feb 23 2017
[  116.207010] Workqueue: events netstamp_clear
[  116.207015] task: ffff801f906b5400 task.stack: ffff801f901a4000
[  116.207020] PC is at smp_call_function_many+0x284/0x2e8
[  116.207023] LR is at smp_call_function_many+0x244/0x2e8
[  116.207026] pc : [<ffff000008156ecc>] lr : [<ffff000008156e8c>]
pstate: 80000145
[  116.207028] sp : ffff801f901a7be0
[  116.207030] x29: ffff801f901a7be0 x28: ffff000009139000
[  116.207036] x27: ffff000009139434 x26: 0000000000000080
[  116.207041] x25: 0000000000000000 x24: ffff0000081565d0
[  116.207047] x23: 0000000000000001 x22: ffff000008e11e00
[  116.207052] x21: ffff801f6d5cff00 x20: ffff801f6d5cff08
[  116.207057] x19: ffff000009138e38 x18: 0000000000000a03
[  116.207063] x17: 0000ffffb77c9028 x16: ffff0000082e81d8
[  116.207068] x15: 00003d0d6dd44d08 x14: 0036312196549b4a
[  116.207073] x13: 0000000058dabe4c x12: 0000000000000018
[  116.207079] x11: 00000000366e2f04 x10: 00000000000009f0
[  116.207084] x9 : ffff801f901a7d30 x8 : 0000000000000002
[  116.207089] x7 : 0000000000000000 x6 : 0000000000000000
[  116.207095] x5 : ffffffff00000000 x4 : 0000000000000020
[  116.207100] x3 : 0000000000000020 x2 : 0000000000000000
[  116.207105] x1 : ffff801f6d682578 x0 : 0000000000000003

[  150.443116] INFO: rcu_sched self-detected stall on CPU
[  150.448261]  25-...: (14997 ticks this GP)
idle=47a/140000000000001/0 softirq=349/349 fqs=7495
[  150.451115] INFO: rcu_sched detected stalls on CPUs/tasks:
[  150.451123]  25-...: (14997 ticks this GP)
idle=47a/140000000000001/0 softirq=349/349 fqs=7495
[  150.451124]  (detected by 13, t=15002 jiffies, g=805, c=804, q=8384)
[  150.451136] Task dump for CPU 25:
[  150.451138] kworker/25:1    R  running task        0   454      2 0x00000002
[  150.451155] Workqueue: events netstamp_clear
[  150.451158] Call trace:
[  150.451164] [<ffff000008086188>] __switch_to+0x90/0xa8
[  150.451172] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
[  150.451175] [<ffff0000081f6284>] static_key_enable+0x34/0x60
[  150.451178] [<ffff000008843268>] netstamp_clear+0x68/0x80
[  150.451181] [<ffff0000080e49a0>] process_one_work+0x158/0x478
[  150.451183] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
[  150.451187] [<ffff0000080ebd78>] kthread+0x108/0x138
[  150.451190] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
[  150.477451]   (t=15008 jiffies g=805 c=804 q=8384)
[  150.482242] Task dump for CPU 25:
[  150.482245] kworker/25:1    R  running task        0   454      2 0x00000002
[  150.482259] Workqueue: events netstamp_clear
[  150.482264] Call trace:
[  150.482271] [<ffff00000808a530>] dump_backtrace+0x0/0x2b0
[  150.482277] [<ffff00000808a804>] show_stack+0x24/0x30
[  150.482281] [<ffff0000080fb750>] sched_show_task+0x128/0x178
[  150.482285] [<ffff0000080fd298>] dump_cpu_task+0x48/0x58
[  150.482288] [<ffff0000081f81e4>] rcu_dump_cpu_stacks+0xa0/0xe8
[  150.482297] [<ffff00000813983c>] rcu_check_callbacks+0x774/0x938
[  150.482305] [<ffff00000813fcb4>] update_process_times+0x34/0x60
[  150.482314] [<ffff000008151b80>] tick_sched_handle.isra.7+0x38/0x70
[  150.482319] [<ffff000008151c04>] tick_sched_timer+0x4c/0x98
[  150.482324] [<ffff000008140510>] __hrtimer_run_queues+0xd8/0x2b8
[  150.482328] [<ffff000008141180>] hrtimer_interrupt+0xa8/0x228
[  150.482334] [<ffff0000087f2a2c>] arch_timer_handler_phys+0x3c/0x50
[  150.482341] [<ffff00000812c194>] handle_percpu_devid_irq+0x8c/0x230
[  150.482344] [<ffff000008126174>] generic_handle_irq+0x34/0x50
[  150.482347] [<ffff000008126898>] __handle_domain_irq+0x68/0xc0
[  150.482351] [<ffff0000080817e4>] gic_handle_irq+0xc4/0x170
[  150.482356] Exception stack(0xffff801f901a7ab0 to 0xffff801f901a7be0)
[  150.482360] 7aa0:
0000000000000003 ffff801f6d682578
[  150.482364] 7ac0: 0000000000000000 0000000000000020
0000000000000020 ffffffff00000000
[  150.482367] 7ae0: 0000000000000000 0000000000000000
0000000000000002 ffff801f901a7d30
[  150.482371] 7b00: 00000000000009f0 00000000366e2f04
0000000000000018 0000000058dabe4c
[  150.482375] 7b20: 0036312196549b4a 00003d0d6dd44d08
ffff0000082e81d8 0000ffffb77c9028
[  150.482378] 7b40: 0000000000000a03 ffff000009138e38
ffff801f6d5cff08 ffff801f6d5cff00
[  150.482382] 7b60: ffff000008e11e00 0000000000000001
ffff0000081565d0 0000000000000000
[  150.482386] 7b80: 0000000000000080 ffff000009139434
ffff000009139000 ffff801f901a7be0
[  150.482390] 7ba0: ffff000008156e8c ffff801f901a7be0
ffff000008156ecc 0000000080000145
[  150.482394] 7bc0: ffff801f901a7be0 ffff000008156e68
ffffffffffffffff ffff000008156e8c
[  150.482397] [<ffff000008082ff4>] el1_irq+0xb4/0x140
[  150.482401] [<ffff000008156ecc>] smp_call_function_many+0x284/0x2e8
[  150.482405] [<ffff000008157020>] kick_all_cpus_sync+0x30/0x38
[  150.482409] [<ffff00000897c6cc>] aarch64_insn_patch_text+0xec/0xf8
[  150.482415] [<ffff000008095978>] arch_jump_label_transform+0x60/0x98
[  150.482420] [<ffff0000081f593c>] __jump_label_update+0x8c/0xa8
[  150.482423] [<ffff0000081f6088>] jump_label_update+0x58/0xe8
[  150.482429] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
[  150.482434] [<ffff0000081f6284>] static_key_enable+0x34/0x60
[  150.482438] [<ffff000008843268>] netstamp_clear+0x68/0x80
[  150.482441] [<ffff0000080e49a0>] process_one_work+0x158/0x478
[  150.482444] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
[  150.482448] [<ffff0000080ebd78>] kthread+0x108/0x138
[  150.482451] [<ffff0000080836c0>] ret_from_fork+0x10/0x50

====================================

I am observing that this usually happens when the guest tries to
bringup or use the default virtio-net interface.
And I am unable to reproduce this when directly booting the guest
kernel without UEFI BIOS.
So anyone observed similar issue ?

regards,
Radha Mohan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* host stalls when qemu-system-aarch64 with kvm and pflash
@ 2017-03-28 19:58 ` Radha Mohan
  0 siblings, 0 replies; 28+ messages in thread
From: Radha Mohan @ 2017-03-28 19:58 UTC (permalink / raw)
  To: qemu-devel, kvmarm

Hi,
I am seeing an issue with qemu-system-aarch64 when using pflash
(booting kernel via UEFI bios).

Host kernel: 4.11.0-rc3-next-20170323
Qemu version: v2.9.0-rc1

Command used:
./aarch64-softmmu/qemu-system-aarch64 -cpu host -enable-kvm -M
virt,gic_version=3 -nographic -smp 1 -m 2048 -drive
if=none,id=hd0,file=/root/zesty-server-cloudimg-arm64.img,id=0 -device
virtio-blk-device,drive=hd0 -pflash /root/flash0.img -pflash
/root/flash1.img


As soon as the guest kernel boots the host starts to stall and prints
the below messages. And the system never recovers. I can neither
poweroff the guest nor the host. So I have resort to external power
reset of the host.

==================
[  116.199077] NMI watchdog: BUG: soft lockup - CPU#25 stuck for 23s!
[kworker/25:1:454]
[  116.206901] Modules linked in: binfmt_misc nls_iso8859_1 aes_ce_blk
shpchp crypto_simd gpio_keys cryptd aes_ce_cipher ghash_ce sha2_ce
sha1_ce uio_pdrv_genirq uio autofs4 btrfs raid10 rai
d456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
raid6_pq libcrc32c raid1 raid0 multipath linear ast i2c_algo_bit ttm
drm_kms_helper syscopyarea sysfillrect sysimgblt fb_s
ys_fops drm nicvf ahci nicpf libahci thunder_bgx thunder_xcv
mdio_thunder mdio_cavium

[  116.206995] CPU: 25 PID: 454 Comm: kworker/25:1 Not tainted
4.11.0-rc3-next-20170323 #1
[  116.206997] Hardware name: www.cavium.com crb-1s/crb-1s, BIOS 0.3 Feb 23 2017
[  116.207010] Workqueue: events netstamp_clear
[  116.207015] task: ffff801f906b5400 task.stack: ffff801f901a4000
[  116.207020] PC is at smp_call_function_many+0x284/0x2e8
[  116.207023] LR is at smp_call_function_many+0x244/0x2e8
[  116.207026] pc : [<ffff000008156ecc>] lr : [<ffff000008156e8c>]
pstate: 80000145
[  116.207028] sp : ffff801f901a7be0
[  116.207030] x29: ffff801f901a7be0 x28: ffff000009139000
[  116.207036] x27: ffff000009139434 x26: 0000000000000080
[  116.207041] x25: 0000000000000000 x24: ffff0000081565d0
[  116.207047] x23: 0000000000000001 x22: ffff000008e11e00
[  116.207052] x21: ffff801f6d5cff00 x20: ffff801f6d5cff08
[  116.207057] x19: ffff000009138e38 x18: 0000000000000a03
[  116.207063] x17: 0000ffffb77c9028 x16: ffff0000082e81d8
[  116.207068] x15: 00003d0d6dd44d08 x14: 0036312196549b4a
[  116.207073] x13: 0000000058dabe4c x12: 0000000000000018
[  116.207079] x11: 00000000366e2f04 x10: 00000000000009f0
[  116.207084] x9 : ffff801f901a7d30 x8 : 0000000000000002
[  116.207089] x7 : 0000000000000000 x6 : 0000000000000000
[  116.207095] x5 : ffffffff00000000 x4 : 0000000000000020
[  116.207100] x3 : 0000000000000020 x2 : 0000000000000000
[  116.207105] x1 : ffff801f6d682578 x0 : 0000000000000003

[  150.443116] INFO: rcu_sched self-detected stall on CPU
[  150.448261]  25-...: (14997 ticks this GP)
idle=47a/140000000000001/0 softirq=349/349 fqs=7495
[  150.451115] INFO: rcu_sched detected stalls on CPUs/tasks:
[  150.451123]  25-...: (14997 ticks this GP)
idle=47a/140000000000001/0 softirq=349/349 fqs=7495
[  150.451124]  (detected by 13, t=15002 jiffies, g=805, c=804, q=8384)
[  150.451136] Task dump for CPU 25:
[  150.451138] kworker/25:1    R  running task        0   454      2 0x00000002
[  150.451155] Workqueue: events netstamp_clear
[  150.451158] Call trace:
[  150.451164] [<ffff000008086188>] __switch_to+0x90/0xa8
[  150.451172] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
[  150.451175] [<ffff0000081f6284>] static_key_enable+0x34/0x60
[  150.451178] [<ffff000008843268>] netstamp_clear+0x68/0x80
[  150.451181] [<ffff0000080e49a0>] process_one_work+0x158/0x478
[  150.451183] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
[  150.451187] [<ffff0000080ebd78>] kthread+0x108/0x138
[  150.451190] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
[  150.477451]   (t=15008 jiffies g=805 c=804 q=8384)
[  150.482242] Task dump for CPU 25:
[  150.482245] kworker/25:1    R  running task        0   454      2 0x00000002
[  150.482259] Workqueue: events netstamp_clear
[  150.482264] Call trace:
[  150.482271] [<ffff00000808a530>] dump_backtrace+0x0/0x2b0
[  150.482277] [<ffff00000808a804>] show_stack+0x24/0x30
[  150.482281] [<ffff0000080fb750>] sched_show_task+0x128/0x178
[  150.482285] [<ffff0000080fd298>] dump_cpu_task+0x48/0x58
[  150.482288] [<ffff0000081f81e4>] rcu_dump_cpu_stacks+0xa0/0xe8
[  150.482297] [<ffff00000813983c>] rcu_check_callbacks+0x774/0x938
[  150.482305] [<ffff00000813fcb4>] update_process_times+0x34/0x60
[  150.482314] [<ffff000008151b80>] tick_sched_handle.isra.7+0x38/0x70
[  150.482319] [<ffff000008151c04>] tick_sched_timer+0x4c/0x98
[  150.482324] [<ffff000008140510>] __hrtimer_run_queues+0xd8/0x2b8
[  150.482328] [<ffff000008141180>] hrtimer_interrupt+0xa8/0x228
[  150.482334] [<ffff0000087f2a2c>] arch_timer_handler_phys+0x3c/0x50
[  150.482341] [<ffff00000812c194>] handle_percpu_devid_irq+0x8c/0x230
[  150.482344] [<ffff000008126174>] generic_handle_irq+0x34/0x50
[  150.482347] [<ffff000008126898>] __handle_domain_irq+0x68/0xc0
[  150.482351] [<ffff0000080817e4>] gic_handle_irq+0xc4/0x170
[  150.482356] Exception stack(0xffff801f901a7ab0 to 0xffff801f901a7be0)
[  150.482360] 7aa0:
0000000000000003 ffff801f6d682578
[  150.482364] 7ac0: 0000000000000000 0000000000000020
0000000000000020 ffffffff00000000
[  150.482367] 7ae0: 0000000000000000 0000000000000000
0000000000000002 ffff801f901a7d30
[  150.482371] 7b00: 00000000000009f0 00000000366e2f04
0000000000000018 0000000058dabe4c
[  150.482375] 7b20: 0036312196549b4a 00003d0d6dd44d08
ffff0000082e81d8 0000ffffb77c9028
[  150.482378] 7b40: 0000000000000a03 ffff000009138e38
ffff801f6d5cff08 ffff801f6d5cff00
[  150.482382] 7b60: ffff000008e11e00 0000000000000001
ffff0000081565d0 0000000000000000
[  150.482386] 7b80: 0000000000000080 ffff000009139434
ffff000009139000 ffff801f901a7be0
[  150.482390] 7ba0: ffff000008156e8c ffff801f901a7be0
ffff000008156ecc 0000000080000145
[  150.482394] 7bc0: ffff801f901a7be0 ffff000008156e68
ffffffffffffffff ffff000008156e8c
[  150.482397] [<ffff000008082ff4>] el1_irq+0xb4/0x140
[  150.482401] [<ffff000008156ecc>] smp_call_function_many+0x284/0x2e8
[  150.482405] [<ffff000008157020>] kick_all_cpus_sync+0x30/0x38
[  150.482409] [<ffff00000897c6cc>] aarch64_insn_patch_text+0xec/0xf8
[  150.482415] [<ffff000008095978>] arch_jump_label_transform+0x60/0x98
[  150.482420] [<ffff0000081f593c>] __jump_label_update+0x8c/0xa8
[  150.482423] [<ffff0000081f6088>] jump_label_update+0x58/0xe8
[  150.482429] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
[  150.482434] [<ffff0000081f6284>] static_key_enable+0x34/0x60
[  150.482438] [<ffff000008843268>] netstamp_clear+0x68/0x80
[  150.482441] [<ffff0000080e49a0>] process_one_work+0x158/0x478
[  150.482444] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
[  150.482448] [<ffff0000080ebd78>] kthread+0x108/0x138
[  150.482451] [<ffff0000080836c0>] ret_from_fork+0x10/0x50

====================================

I am observing that this usually happens when the guest tries to
bringup or use the default virtio-net interface.
And I am unable to reproduce this when directly booting the guest
kernel without UEFI BIOS.
So anyone observed similar issue ?

regards,
Radha Mohan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] host stalls when qemu-system-aarch64 with kvm and pflash
  2017-03-28 19:58 ` Radha Mohan
@ 2017-03-28 20:16   ` Christoffer Dall
  -1 siblings, 0 replies; 28+ messages in thread
From: Christoffer Dall @ 2017-03-28 20:16 UTC (permalink / raw)
  To: Radha Mohan; +Cc: qemu-devel, kvmarm

Hi Radha,

On Tue, Mar 28, 2017 at 12:58:24PM -0700, Radha Mohan wrote:
> Hi,
> I am seeing an issue with qemu-system-aarch64 when using pflash
> (booting kernel via UEFI bios).
> 
> Host kernel: 4.11.0-rc3-next-20170323
> Qemu version: v2.9.0-rc1
> 
> Command used:
> ./aarch64-softmmu/qemu-system-aarch64 -cpu host -enable-kvm -M
> virt,gic_version=3 -nographic -smp 1 -m 2048 -drive
> if=none,id=hd0,file=/root/zesty-server-cloudimg-arm64.img,id=0 -device
> virtio-blk-device,drive=hd0 -pflash /root/flash0.img -pflash
> /root/flash1.img
> 
> 
> As soon as the guest kernel boots the host starts to stall and prints
> the below messages. And the system never recovers. I can neither
> poweroff the guest nor the host. So I have resort to external power
> reset of the host.
> 
> ==================
> [  116.199077] NMI watchdog: BUG: soft lockup - CPU#25 stuck for 23s!
> [kworker/25:1:454]
> [  116.206901] Modules linked in: binfmt_misc nls_iso8859_1 aes_ce_blk
> shpchp crypto_simd gpio_keys cryptd aes_ce_cipher ghash_ce sha2_ce
> sha1_ce uio_pdrv_genirq uio autofs4 btrfs raid10 rai
> d456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
> raid6_pq libcrc32c raid1 raid0 multipath linear ast i2c_algo_bit ttm
> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_s
> ys_fops drm nicvf ahci nicpf libahci thunder_bgx thunder_xcv
> mdio_thunder mdio_cavium
> 
> [  116.206995] CPU: 25 PID: 454 Comm: kworker/25:1 Not tainted
> 4.11.0-rc3-next-20170323 #1
> [  116.206997] Hardware name: www.cavium.com crb-1s/crb-1s, BIOS 0.3 Feb 23 2017
> [  116.207010] Workqueue: events netstamp_clear
> [  116.207015] task: ffff801f906b5400 task.stack: ffff801f901a4000
> [  116.207020] PC is at smp_call_function_many+0x284/0x2e8
> [  116.207023] LR is at smp_call_function_many+0x244/0x2e8
> [  116.207026] pc : [<ffff000008156ecc>] lr : [<ffff000008156e8c>]
> pstate: 80000145
> [  116.207028] sp : ffff801f901a7be0
> [  116.207030] x29: ffff801f901a7be0 x28: ffff000009139000
> [  116.207036] x27: ffff000009139434 x26: 0000000000000080
> [  116.207041] x25: 0000000000000000 x24: ffff0000081565d0
> [  116.207047] x23: 0000000000000001 x22: ffff000008e11e00
> [  116.207052] x21: ffff801f6d5cff00 x20: ffff801f6d5cff08
> [  116.207057] x19: ffff000009138e38 x18: 0000000000000a03
> [  116.207063] x17: 0000ffffb77c9028 x16: ffff0000082e81d8
> [  116.207068] x15: 00003d0d6dd44d08 x14: 0036312196549b4a
> [  116.207073] x13: 0000000058dabe4c x12: 0000000000000018
> [  116.207079] x11: 00000000366e2f04 x10: 00000000000009f0
> [  116.207084] x9 : ffff801f901a7d30 x8 : 0000000000000002
> [  116.207089] x7 : 0000000000000000 x6 : 0000000000000000
> [  116.207095] x5 : ffffffff00000000 x4 : 0000000000000020
> [  116.207100] x3 : 0000000000000020 x2 : 0000000000000000
> [  116.207105] x1 : ffff801f6d682578 x0 : 0000000000000003
> 
> [  150.443116] INFO: rcu_sched self-detected stall on CPU
> [  150.448261]  25-...: (14997 ticks this GP)
> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
> [  150.451115] INFO: rcu_sched detected stalls on CPUs/tasks:
> [  150.451123]  25-...: (14997 ticks this GP)
> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
> [  150.451124]  (detected by 13, t=15002 jiffies, g=805, c=804, q=8384)
> [  150.451136] Task dump for CPU 25:
> [  150.451138] kworker/25:1    R  running task        0   454      2 0x00000002
> [  150.451155] Workqueue: events netstamp_clear
> [  150.451158] Call trace:
> [  150.451164] [<ffff000008086188>] __switch_to+0x90/0xa8
> [  150.451172] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
> [  150.451175] [<ffff0000081f6284>] static_key_enable+0x34/0x60
> [  150.451178] [<ffff000008843268>] netstamp_clear+0x68/0x80
> [  150.451181] [<ffff0000080e49a0>] process_one_work+0x158/0x478
> [  150.451183] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
> [  150.451187] [<ffff0000080ebd78>] kthread+0x108/0x138
> [  150.451190] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
> [  150.477451]   (t=15008 jiffies g=805 c=804 q=8384)
> [  150.482242] Task dump for CPU 25:
> [  150.482245] kworker/25:1    R  running task        0   454      2 0x00000002
> [  150.482259] Workqueue: events netstamp_clear
> [  150.482264] Call trace:
> [  150.482271] [<ffff00000808a530>] dump_backtrace+0x0/0x2b0
> [  150.482277] [<ffff00000808a804>] show_stack+0x24/0x30
> [  150.482281] [<ffff0000080fb750>] sched_show_task+0x128/0x178
> [  150.482285] [<ffff0000080fd298>] dump_cpu_task+0x48/0x58
> [  150.482288] [<ffff0000081f81e4>] rcu_dump_cpu_stacks+0xa0/0xe8
> [  150.482297] [<ffff00000813983c>] rcu_check_callbacks+0x774/0x938
> [  150.482305] [<ffff00000813fcb4>] update_process_times+0x34/0x60
> [  150.482314] [<ffff000008151b80>] tick_sched_handle.isra.7+0x38/0x70
> [  150.482319] [<ffff000008151c04>] tick_sched_timer+0x4c/0x98
> [  150.482324] [<ffff000008140510>] __hrtimer_run_queues+0xd8/0x2b8
> [  150.482328] [<ffff000008141180>] hrtimer_interrupt+0xa8/0x228
> [  150.482334] [<ffff0000087f2a2c>] arch_timer_handler_phys+0x3c/0x50
> [  150.482341] [<ffff00000812c194>] handle_percpu_devid_irq+0x8c/0x230
> [  150.482344] [<ffff000008126174>] generic_handle_irq+0x34/0x50
> [  150.482347] [<ffff000008126898>] __handle_domain_irq+0x68/0xc0
> [  150.482351] [<ffff0000080817e4>] gic_handle_irq+0xc4/0x170
> [  150.482356] Exception stack(0xffff801f901a7ab0 to 0xffff801f901a7be0)
> [  150.482360] 7aa0:
> 0000000000000003 ffff801f6d682578
> [  150.482364] 7ac0: 0000000000000000 0000000000000020
> 0000000000000020 ffffffff00000000
> [  150.482367] 7ae0: 0000000000000000 0000000000000000
> 0000000000000002 ffff801f901a7d30
> [  150.482371] 7b00: 00000000000009f0 00000000366e2f04
> 0000000000000018 0000000058dabe4c
> [  150.482375] 7b20: 0036312196549b4a 00003d0d6dd44d08
> ffff0000082e81d8 0000ffffb77c9028
> [  150.482378] 7b40: 0000000000000a03 ffff000009138e38
> ffff801f6d5cff08 ffff801f6d5cff00
> [  150.482382] 7b60: ffff000008e11e00 0000000000000001
> ffff0000081565d0 0000000000000000
> [  150.482386] 7b80: 0000000000000080 ffff000009139434
> ffff000009139000 ffff801f901a7be0
> [  150.482390] 7ba0: ffff000008156e8c ffff801f901a7be0
> ffff000008156ecc 0000000080000145
> [  150.482394] 7bc0: ffff801f901a7be0 ffff000008156e68
> ffffffffffffffff ffff000008156e8c
> [  150.482397] [<ffff000008082ff4>] el1_irq+0xb4/0x140
> [  150.482401] [<ffff000008156ecc>] smp_call_function_many+0x284/0x2e8
> [  150.482405] [<ffff000008157020>] kick_all_cpus_sync+0x30/0x38
> [  150.482409] [<ffff00000897c6cc>] aarch64_insn_patch_text+0xec/0xf8
> [  150.482415] [<ffff000008095978>] arch_jump_label_transform+0x60/0x98
> [  150.482420] [<ffff0000081f593c>] __jump_label_update+0x8c/0xa8
> [  150.482423] [<ffff0000081f6088>] jump_label_update+0x58/0xe8
> [  150.482429] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
> [  150.482434] [<ffff0000081f6284>] static_key_enable+0x34/0x60
> [  150.482438] [<ffff000008843268>] netstamp_clear+0x68/0x80
> [  150.482441] [<ffff0000080e49a0>] process_one_work+0x158/0x478
> [  150.482444] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
> [  150.482448] [<ffff0000080ebd78>] kthread+0x108/0x138
> [  150.482451] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
> 
> ====================================
> 
> I am observing that this usually happens when the guest tries to
> bringup or use the default virtio-net interface.
> And I am unable to reproduce this when directly booting the guest
> kernel without UEFI BIOS.
> So anyone observed similar issue ?
> 

I haven't seen this myself.

Have you tried with other versions of QEMU and the host kernel by any
chance using the same setup?

Thanks for reporting this,
-Christoffer

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: host stalls when qemu-system-aarch64 with kvm and pflash
@ 2017-03-28 20:16   ` Christoffer Dall
  0 siblings, 0 replies; 28+ messages in thread
From: Christoffer Dall @ 2017-03-28 20:16 UTC (permalink / raw)
  To: Radha Mohan; +Cc: qemu-devel, kvmarm

Hi Radha,

On Tue, Mar 28, 2017 at 12:58:24PM -0700, Radha Mohan wrote:
> Hi,
> I am seeing an issue with qemu-system-aarch64 when using pflash
> (booting kernel via UEFI bios).
> 
> Host kernel: 4.11.0-rc3-next-20170323
> Qemu version: v2.9.0-rc1
> 
> Command used:
> ./aarch64-softmmu/qemu-system-aarch64 -cpu host -enable-kvm -M
> virt,gic_version=3 -nographic -smp 1 -m 2048 -drive
> if=none,id=hd0,file=/root/zesty-server-cloudimg-arm64.img,id=0 -device
> virtio-blk-device,drive=hd0 -pflash /root/flash0.img -pflash
> /root/flash1.img
> 
> 
> As soon as the guest kernel boots the host starts to stall and prints
> the below messages. And the system never recovers. I can neither
> poweroff the guest nor the host. So I have resort to external power
> reset of the host.
> 
> ==================
> [  116.199077] NMI watchdog: BUG: soft lockup - CPU#25 stuck for 23s!
> [kworker/25:1:454]
> [  116.206901] Modules linked in: binfmt_misc nls_iso8859_1 aes_ce_blk
> shpchp crypto_simd gpio_keys cryptd aes_ce_cipher ghash_ce sha2_ce
> sha1_ce uio_pdrv_genirq uio autofs4 btrfs raid10 rai
> d456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
> raid6_pq libcrc32c raid1 raid0 multipath linear ast i2c_algo_bit ttm
> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_s
> ys_fops drm nicvf ahci nicpf libahci thunder_bgx thunder_xcv
> mdio_thunder mdio_cavium
> 
> [  116.206995] CPU: 25 PID: 454 Comm: kworker/25:1 Not tainted
> 4.11.0-rc3-next-20170323 #1
> [  116.206997] Hardware name: www.cavium.com crb-1s/crb-1s, BIOS 0.3 Feb 23 2017
> [  116.207010] Workqueue: events netstamp_clear
> [  116.207015] task: ffff801f906b5400 task.stack: ffff801f901a4000
> [  116.207020] PC is at smp_call_function_many+0x284/0x2e8
> [  116.207023] LR is at smp_call_function_many+0x244/0x2e8
> [  116.207026] pc : [<ffff000008156ecc>] lr : [<ffff000008156e8c>]
> pstate: 80000145
> [  116.207028] sp : ffff801f901a7be0
> [  116.207030] x29: ffff801f901a7be0 x28: ffff000009139000
> [  116.207036] x27: ffff000009139434 x26: 0000000000000080
> [  116.207041] x25: 0000000000000000 x24: ffff0000081565d0
> [  116.207047] x23: 0000000000000001 x22: ffff000008e11e00
> [  116.207052] x21: ffff801f6d5cff00 x20: ffff801f6d5cff08
> [  116.207057] x19: ffff000009138e38 x18: 0000000000000a03
> [  116.207063] x17: 0000ffffb77c9028 x16: ffff0000082e81d8
> [  116.207068] x15: 00003d0d6dd44d08 x14: 0036312196549b4a
> [  116.207073] x13: 0000000058dabe4c x12: 0000000000000018
> [  116.207079] x11: 00000000366e2f04 x10: 00000000000009f0
> [  116.207084] x9 : ffff801f901a7d30 x8 : 0000000000000002
> [  116.207089] x7 : 0000000000000000 x6 : 0000000000000000
> [  116.207095] x5 : ffffffff00000000 x4 : 0000000000000020
> [  116.207100] x3 : 0000000000000020 x2 : 0000000000000000
> [  116.207105] x1 : ffff801f6d682578 x0 : 0000000000000003
> 
> [  150.443116] INFO: rcu_sched self-detected stall on CPU
> [  150.448261]  25-...: (14997 ticks this GP)
> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
> [  150.451115] INFO: rcu_sched detected stalls on CPUs/tasks:
> [  150.451123]  25-...: (14997 ticks this GP)
> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
> [  150.451124]  (detected by 13, t=15002 jiffies, g=805, c=804, q=8384)
> [  150.451136] Task dump for CPU 25:
> [  150.451138] kworker/25:1    R  running task        0   454      2 0x00000002
> [  150.451155] Workqueue: events netstamp_clear
> [  150.451158] Call trace:
> [  150.451164] [<ffff000008086188>] __switch_to+0x90/0xa8
> [  150.451172] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
> [  150.451175] [<ffff0000081f6284>] static_key_enable+0x34/0x60
> [  150.451178] [<ffff000008843268>] netstamp_clear+0x68/0x80
> [  150.451181] [<ffff0000080e49a0>] process_one_work+0x158/0x478
> [  150.451183] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
> [  150.451187] [<ffff0000080ebd78>] kthread+0x108/0x138
> [  150.451190] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
> [  150.477451]   (t=15008 jiffies g=805 c=804 q=8384)
> [  150.482242] Task dump for CPU 25:
> [  150.482245] kworker/25:1    R  running task        0   454      2 0x00000002
> [  150.482259] Workqueue: events netstamp_clear
> [  150.482264] Call trace:
> [  150.482271] [<ffff00000808a530>] dump_backtrace+0x0/0x2b0
> [  150.482277] [<ffff00000808a804>] show_stack+0x24/0x30
> [  150.482281] [<ffff0000080fb750>] sched_show_task+0x128/0x178
> [  150.482285] [<ffff0000080fd298>] dump_cpu_task+0x48/0x58
> [  150.482288] [<ffff0000081f81e4>] rcu_dump_cpu_stacks+0xa0/0xe8
> [  150.482297] [<ffff00000813983c>] rcu_check_callbacks+0x774/0x938
> [  150.482305] [<ffff00000813fcb4>] update_process_times+0x34/0x60
> [  150.482314] [<ffff000008151b80>] tick_sched_handle.isra.7+0x38/0x70
> [  150.482319] [<ffff000008151c04>] tick_sched_timer+0x4c/0x98
> [  150.482324] [<ffff000008140510>] __hrtimer_run_queues+0xd8/0x2b8
> [  150.482328] [<ffff000008141180>] hrtimer_interrupt+0xa8/0x228
> [  150.482334] [<ffff0000087f2a2c>] arch_timer_handler_phys+0x3c/0x50
> [  150.482341] [<ffff00000812c194>] handle_percpu_devid_irq+0x8c/0x230
> [  150.482344] [<ffff000008126174>] generic_handle_irq+0x34/0x50
> [  150.482347] [<ffff000008126898>] __handle_domain_irq+0x68/0xc0
> [  150.482351] [<ffff0000080817e4>] gic_handle_irq+0xc4/0x170
> [  150.482356] Exception stack(0xffff801f901a7ab0 to 0xffff801f901a7be0)
> [  150.482360] 7aa0:
> 0000000000000003 ffff801f6d682578
> [  150.482364] 7ac0: 0000000000000000 0000000000000020
> 0000000000000020 ffffffff00000000
> [  150.482367] 7ae0: 0000000000000000 0000000000000000
> 0000000000000002 ffff801f901a7d30
> [  150.482371] 7b00: 00000000000009f0 00000000366e2f04
> 0000000000000018 0000000058dabe4c
> [  150.482375] 7b20: 0036312196549b4a 00003d0d6dd44d08
> ffff0000082e81d8 0000ffffb77c9028
> [  150.482378] 7b40: 0000000000000a03 ffff000009138e38
> ffff801f6d5cff08 ffff801f6d5cff00
> [  150.482382] 7b60: ffff000008e11e00 0000000000000001
> ffff0000081565d0 0000000000000000
> [  150.482386] 7b80: 0000000000000080 ffff000009139434
> ffff000009139000 ffff801f901a7be0
> [  150.482390] 7ba0: ffff000008156e8c ffff801f901a7be0
> ffff000008156ecc 0000000080000145
> [  150.482394] 7bc0: ffff801f901a7be0 ffff000008156e68
> ffffffffffffffff ffff000008156e8c
> [  150.482397] [<ffff000008082ff4>] el1_irq+0xb4/0x140
> [  150.482401] [<ffff000008156ecc>] smp_call_function_many+0x284/0x2e8
> [  150.482405] [<ffff000008157020>] kick_all_cpus_sync+0x30/0x38
> [  150.482409] [<ffff00000897c6cc>] aarch64_insn_patch_text+0xec/0xf8
> [  150.482415] [<ffff000008095978>] arch_jump_label_transform+0x60/0x98
> [  150.482420] [<ffff0000081f593c>] __jump_label_update+0x8c/0xa8
> [  150.482423] [<ffff0000081f6088>] jump_label_update+0x58/0xe8
> [  150.482429] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
> [  150.482434] [<ffff0000081f6284>] static_key_enable+0x34/0x60
> [  150.482438] [<ffff000008843268>] netstamp_clear+0x68/0x80
> [  150.482441] [<ffff0000080e49a0>] process_one_work+0x158/0x478
> [  150.482444] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
> [  150.482448] [<ffff0000080ebd78>] kthread+0x108/0x138
> [  150.482451] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
> 
> ====================================
> 
> I am observing that this usually happens when the guest tries to
> bringup or use the default virtio-net interface.
> And I am unable to reproduce this when directly booting the guest
> kernel without UEFI BIOS.
> So anyone observed similar issue ?
> 

I haven't seen this myself.

Have you tried with other versions of QEMU and the host kernel by any
chance using the same setup?

Thanks for reporting this,
-Christoffer

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] host stalls when qemu-system-aarch64 with kvm and pflash
  2017-03-28 20:16   ` Christoffer Dall
@ 2017-03-28 20:24     ` Radha Mohan
  -1 siblings, 0 replies; 28+ messages in thread
From: Radha Mohan @ 2017-03-28 20:24 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: qemu-devel, kvmarm

On Tue, Mar 28, 2017 at 1:16 PM, Christoffer Dall <cdall@linaro.org> wrote:
> Hi Radha,
>
> On Tue, Mar 28, 2017 at 12:58:24PM -0700, Radha Mohan wrote:
>> Hi,
>> I am seeing an issue with qemu-system-aarch64 when using pflash
>> (booting kernel via UEFI bios).
>>
>> Host kernel: 4.11.0-rc3-next-20170323
>> Qemu version: v2.9.0-rc1
>>
>> Command used:
>> ./aarch64-softmmu/qemu-system-aarch64 -cpu host -enable-kvm -M
>> virt,gic_version=3 -nographic -smp 1 -m 2048 -drive
>> if=none,id=hd0,file=/root/zesty-server-cloudimg-arm64.img,id=0 -device
>> virtio-blk-device,drive=hd0 -pflash /root/flash0.img -pflash
>> /root/flash1.img
>>
>>
>> As soon as the guest kernel boots the host starts to stall and prints
>> the below messages. And the system never recovers. I can neither
>> poweroff the guest nor the host. So I have resort to external power
>> reset of the host.
>>
>> ==================
>> [  116.199077] NMI watchdog: BUG: soft lockup - CPU#25 stuck for 23s!
>> [kworker/25:1:454]
>> [  116.206901] Modules linked in: binfmt_misc nls_iso8859_1 aes_ce_blk
>> shpchp crypto_simd gpio_keys cryptd aes_ce_cipher ghash_ce sha2_ce
>> sha1_ce uio_pdrv_genirq uio autofs4 btrfs raid10 rai
>> d456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
>> raid6_pq libcrc32c raid1 raid0 multipath linear ast i2c_algo_bit ttm
>> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_s
>> ys_fops drm nicvf ahci nicpf libahci thunder_bgx thunder_xcv
>> mdio_thunder mdio_cavium
>>
>> [  116.206995] CPU: 25 PID: 454 Comm: kworker/25:1 Not tainted
>> 4.11.0-rc3-next-20170323 #1
>> [  116.206997] Hardware name: www.cavium.com crb-1s/crb-1s, BIOS 0.3 Feb 23 2017
>> [  116.207010] Workqueue: events netstamp_clear
>> [  116.207015] task: ffff801f906b5400 task.stack: ffff801f901a4000
>> [  116.207020] PC is at smp_call_function_many+0x284/0x2e8
>> [  116.207023] LR is at smp_call_function_many+0x244/0x2e8
>> [  116.207026] pc : [<ffff000008156ecc>] lr : [<ffff000008156e8c>]
>> pstate: 80000145
>> [  116.207028] sp : ffff801f901a7be0
>> [  116.207030] x29: ffff801f901a7be0 x28: ffff000009139000
>> [  116.207036] x27: ffff000009139434 x26: 0000000000000080
>> [  116.207041] x25: 0000000000000000 x24: ffff0000081565d0
>> [  116.207047] x23: 0000000000000001 x22: ffff000008e11e00
>> [  116.207052] x21: ffff801f6d5cff00 x20: ffff801f6d5cff08
>> [  116.207057] x19: ffff000009138e38 x18: 0000000000000a03
>> [  116.207063] x17: 0000ffffb77c9028 x16: ffff0000082e81d8
>> [  116.207068] x15: 00003d0d6dd44d08 x14: 0036312196549b4a
>> [  116.207073] x13: 0000000058dabe4c x12: 0000000000000018
>> [  116.207079] x11: 00000000366e2f04 x10: 00000000000009f0
>> [  116.207084] x9 : ffff801f901a7d30 x8 : 0000000000000002
>> [  116.207089] x7 : 0000000000000000 x6 : 0000000000000000
>> [  116.207095] x5 : ffffffff00000000 x4 : 0000000000000020
>> [  116.207100] x3 : 0000000000000020 x2 : 0000000000000000
>> [  116.207105] x1 : ffff801f6d682578 x0 : 0000000000000003
>>
>> [  150.443116] INFO: rcu_sched self-detected stall on CPU
>> [  150.448261]  25-...: (14997 ticks this GP)
>> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
>> [  150.451115] INFO: rcu_sched detected stalls on CPUs/tasks:
>> [  150.451123]  25-...: (14997 ticks this GP)
>> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
>> [  150.451124]  (detected by 13, t=15002 jiffies, g=805, c=804, q=8384)
>> [  150.451136] Task dump for CPU 25:
>> [  150.451138] kworker/25:1    R  running task        0   454      2 0x00000002
>> [  150.451155] Workqueue: events netstamp_clear
>> [  150.451158] Call trace:
>> [  150.451164] [<ffff000008086188>] __switch_to+0x90/0xa8
>> [  150.451172] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
>> [  150.451175] [<ffff0000081f6284>] static_key_enable+0x34/0x60
>> [  150.451178] [<ffff000008843268>] netstamp_clear+0x68/0x80
>> [  150.451181] [<ffff0000080e49a0>] process_one_work+0x158/0x478
>> [  150.451183] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
>> [  150.451187] [<ffff0000080ebd78>] kthread+0x108/0x138
>> [  150.451190] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
>> [  150.477451]   (t=15008 jiffies g=805 c=804 q=8384)
>> [  150.482242] Task dump for CPU 25:
>> [  150.482245] kworker/25:1    R  running task        0   454      2 0x00000002
>> [  150.482259] Workqueue: events netstamp_clear
>> [  150.482264] Call trace:
>> [  150.482271] [<ffff00000808a530>] dump_backtrace+0x0/0x2b0
>> [  150.482277] [<ffff00000808a804>] show_stack+0x24/0x30
>> [  150.482281] [<ffff0000080fb750>] sched_show_task+0x128/0x178
>> [  150.482285] [<ffff0000080fd298>] dump_cpu_task+0x48/0x58
>> [  150.482288] [<ffff0000081f81e4>] rcu_dump_cpu_stacks+0xa0/0xe8
>> [  150.482297] [<ffff00000813983c>] rcu_check_callbacks+0x774/0x938
>> [  150.482305] [<ffff00000813fcb4>] update_process_times+0x34/0x60
>> [  150.482314] [<ffff000008151b80>] tick_sched_handle.isra.7+0x38/0x70
>> [  150.482319] [<ffff000008151c04>] tick_sched_timer+0x4c/0x98
>> [  150.482324] [<ffff000008140510>] __hrtimer_run_queues+0xd8/0x2b8
>> [  150.482328] [<ffff000008141180>] hrtimer_interrupt+0xa8/0x228
>> [  150.482334] [<ffff0000087f2a2c>] arch_timer_handler_phys+0x3c/0x50
>> [  150.482341] [<ffff00000812c194>] handle_percpu_devid_irq+0x8c/0x230
>> [  150.482344] [<ffff000008126174>] generic_handle_irq+0x34/0x50
>> [  150.482347] [<ffff000008126898>] __handle_domain_irq+0x68/0xc0
>> [  150.482351] [<ffff0000080817e4>] gic_handle_irq+0xc4/0x170
>> [  150.482356] Exception stack(0xffff801f901a7ab0 to 0xffff801f901a7be0)
>> [  150.482360] 7aa0:
>> 0000000000000003 ffff801f6d682578
>> [  150.482364] 7ac0: 0000000000000000 0000000000000020
>> 0000000000000020 ffffffff00000000
>> [  150.482367] 7ae0: 0000000000000000 0000000000000000
>> 0000000000000002 ffff801f901a7d30
>> [  150.482371] 7b00: 00000000000009f0 00000000366e2f04
>> 0000000000000018 0000000058dabe4c
>> [  150.482375] 7b20: 0036312196549b4a 00003d0d6dd44d08
>> ffff0000082e81d8 0000ffffb77c9028
>> [  150.482378] 7b40: 0000000000000a03 ffff000009138e38
>> ffff801f6d5cff08 ffff801f6d5cff00
>> [  150.482382] 7b60: ffff000008e11e00 0000000000000001
>> ffff0000081565d0 0000000000000000
>> [  150.482386] 7b80: 0000000000000080 ffff000009139434
>> ffff000009139000 ffff801f901a7be0
>> [  150.482390] 7ba0: ffff000008156e8c ffff801f901a7be0
>> ffff000008156ecc 0000000080000145
>> [  150.482394] 7bc0: ffff801f901a7be0 ffff000008156e68
>> ffffffffffffffff ffff000008156e8c
>> [  150.482397] [<ffff000008082ff4>] el1_irq+0xb4/0x140
>> [  150.482401] [<ffff000008156ecc>] smp_call_function_many+0x284/0x2e8
>> [  150.482405] [<ffff000008157020>] kick_all_cpus_sync+0x30/0x38
>> [  150.482409] [<ffff00000897c6cc>] aarch64_insn_patch_text+0xec/0xf8
>> [  150.482415] [<ffff000008095978>] arch_jump_label_transform+0x60/0x98
>> [  150.482420] [<ffff0000081f593c>] __jump_label_update+0x8c/0xa8
>> [  150.482423] [<ffff0000081f6088>] jump_label_update+0x58/0xe8
>> [  150.482429] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
>> [  150.482434] [<ffff0000081f6284>] static_key_enable+0x34/0x60
>> [  150.482438] [<ffff000008843268>] netstamp_clear+0x68/0x80
>> [  150.482441] [<ffff0000080e49a0>] process_one_work+0x158/0x478
>> [  150.482444] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
>> [  150.482448] [<ffff0000080ebd78>] kthread+0x108/0x138
>> [  150.482451] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
>>
>> ====================================
>>
>> I am observing that this usually happens when the guest tries to
>> bringup or use the default virtio-net interface.
>> And I am unable to reproduce this when directly booting the guest
>> kernel without UEFI BIOS.
>> So anyone observed similar issue ?
>>
>
> I haven't seen this myself.
>
> Have you tried with other versions of QEMU and the host kernel by any
> chance using the same setup?

The earlier qemu version (atleast 2.5 which is default on ubuntu
16.04) have different issue. The same thing happens when I use
vhost=on but without vhost and with virtio-net didn't see the issue.
The Armband folks have also reported the vhost issue here
https://bugs.launchpad.net/ubuntu/+source/edk2/+bug/1673564

So I am beginning to think the issue could be related to the flash
emulation but wanted to check on ml if anyone else is seeing it. But
if it is flash emulation how can it hose the host like this ?

I will also try with qemu 2.8 and see.

>
> Thanks for reporting this,
> -Christoffer

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: host stalls when qemu-system-aarch64 with kvm and pflash
@ 2017-03-28 20:24     ` Radha Mohan
  0 siblings, 0 replies; 28+ messages in thread
From: Radha Mohan @ 2017-03-28 20:24 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: qemu-devel, kvmarm

On Tue, Mar 28, 2017 at 1:16 PM, Christoffer Dall <cdall@linaro.org> wrote:
> Hi Radha,
>
> On Tue, Mar 28, 2017 at 12:58:24PM -0700, Radha Mohan wrote:
>> Hi,
>> I am seeing an issue with qemu-system-aarch64 when using pflash
>> (booting kernel via UEFI bios).
>>
>> Host kernel: 4.11.0-rc3-next-20170323
>> Qemu version: v2.9.0-rc1
>>
>> Command used:
>> ./aarch64-softmmu/qemu-system-aarch64 -cpu host -enable-kvm -M
>> virt,gic_version=3 -nographic -smp 1 -m 2048 -drive
>> if=none,id=hd0,file=/root/zesty-server-cloudimg-arm64.img,id=0 -device
>> virtio-blk-device,drive=hd0 -pflash /root/flash0.img -pflash
>> /root/flash1.img
>>
>>
>> As soon as the guest kernel boots the host starts to stall and prints
>> the below messages. And the system never recovers. I can neither
>> poweroff the guest nor the host. So I have resort to external power
>> reset of the host.
>>
>> ==================
>> [  116.199077] NMI watchdog: BUG: soft lockup - CPU#25 stuck for 23s!
>> [kworker/25:1:454]
>> [  116.206901] Modules linked in: binfmt_misc nls_iso8859_1 aes_ce_blk
>> shpchp crypto_simd gpio_keys cryptd aes_ce_cipher ghash_ce sha2_ce
>> sha1_ce uio_pdrv_genirq uio autofs4 btrfs raid10 rai
>> d456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
>> raid6_pq libcrc32c raid1 raid0 multipath linear ast i2c_algo_bit ttm
>> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_s
>> ys_fops drm nicvf ahci nicpf libahci thunder_bgx thunder_xcv
>> mdio_thunder mdio_cavium
>>
>> [  116.206995] CPU: 25 PID: 454 Comm: kworker/25:1 Not tainted
>> 4.11.0-rc3-next-20170323 #1
>> [  116.206997] Hardware name: www.cavium.com crb-1s/crb-1s, BIOS 0.3 Feb 23 2017
>> [  116.207010] Workqueue: events netstamp_clear
>> [  116.207015] task: ffff801f906b5400 task.stack: ffff801f901a4000
>> [  116.207020] PC is at smp_call_function_many+0x284/0x2e8
>> [  116.207023] LR is at smp_call_function_many+0x244/0x2e8
>> [  116.207026] pc : [<ffff000008156ecc>] lr : [<ffff000008156e8c>]
>> pstate: 80000145
>> [  116.207028] sp : ffff801f901a7be0
>> [  116.207030] x29: ffff801f901a7be0 x28: ffff000009139000
>> [  116.207036] x27: ffff000009139434 x26: 0000000000000080
>> [  116.207041] x25: 0000000000000000 x24: ffff0000081565d0
>> [  116.207047] x23: 0000000000000001 x22: ffff000008e11e00
>> [  116.207052] x21: ffff801f6d5cff00 x20: ffff801f6d5cff08
>> [  116.207057] x19: ffff000009138e38 x18: 0000000000000a03
>> [  116.207063] x17: 0000ffffb77c9028 x16: ffff0000082e81d8
>> [  116.207068] x15: 00003d0d6dd44d08 x14: 0036312196549b4a
>> [  116.207073] x13: 0000000058dabe4c x12: 0000000000000018
>> [  116.207079] x11: 00000000366e2f04 x10: 00000000000009f0
>> [  116.207084] x9 : ffff801f901a7d30 x8 : 0000000000000002
>> [  116.207089] x7 : 0000000000000000 x6 : 0000000000000000
>> [  116.207095] x5 : ffffffff00000000 x4 : 0000000000000020
>> [  116.207100] x3 : 0000000000000020 x2 : 0000000000000000
>> [  116.207105] x1 : ffff801f6d682578 x0 : 0000000000000003
>>
>> [  150.443116] INFO: rcu_sched self-detected stall on CPU
>> [  150.448261]  25-...: (14997 ticks this GP)
>> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
>> [  150.451115] INFO: rcu_sched detected stalls on CPUs/tasks:
>> [  150.451123]  25-...: (14997 ticks this GP)
>> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
>> [  150.451124]  (detected by 13, t=15002 jiffies, g=805, c=804, q=8384)
>> [  150.451136] Task dump for CPU 25:
>> [  150.451138] kworker/25:1    R  running task        0   454      2 0x00000002
>> [  150.451155] Workqueue: events netstamp_clear
>> [  150.451158] Call trace:
>> [  150.451164] [<ffff000008086188>] __switch_to+0x90/0xa8
>> [  150.451172] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
>> [  150.451175] [<ffff0000081f6284>] static_key_enable+0x34/0x60
>> [  150.451178] [<ffff000008843268>] netstamp_clear+0x68/0x80
>> [  150.451181] [<ffff0000080e49a0>] process_one_work+0x158/0x478
>> [  150.451183] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
>> [  150.451187] [<ffff0000080ebd78>] kthread+0x108/0x138
>> [  150.451190] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
>> [  150.477451]   (t=15008 jiffies g=805 c=804 q=8384)
>> [  150.482242] Task dump for CPU 25:
>> [  150.482245] kworker/25:1    R  running task        0   454      2 0x00000002
>> [  150.482259] Workqueue: events netstamp_clear
>> [  150.482264] Call trace:
>> [  150.482271] [<ffff00000808a530>] dump_backtrace+0x0/0x2b0
>> [  150.482277] [<ffff00000808a804>] show_stack+0x24/0x30
>> [  150.482281] [<ffff0000080fb750>] sched_show_task+0x128/0x178
>> [  150.482285] [<ffff0000080fd298>] dump_cpu_task+0x48/0x58
>> [  150.482288] [<ffff0000081f81e4>] rcu_dump_cpu_stacks+0xa0/0xe8
>> [  150.482297] [<ffff00000813983c>] rcu_check_callbacks+0x774/0x938
>> [  150.482305] [<ffff00000813fcb4>] update_process_times+0x34/0x60
>> [  150.482314] [<ffff000008151b80>] tick_sched_handle.isra.7+0x38/0x70
>> [  150.482319] [<ffff000008151c04>] tick_sched_timer+0x4c/0x98
>> [  150.482324] [<ffff000008140510>] __hrtimer_run_queues+0xd8/0x2b8
>> [  150.482328] [<ffff000008141180>] hrtimer_interrupt+0xa8/0x228
>> [  150.482334] [<ffff0000087f2a2c>] arch_timer_handler_phys+0x3c/0x50
>> [  150.482341] [<ffff00000812c194>] handle_percpu_devid_irq+0x8c/0x230
>> [  150.482344] [<ffff000008126174>] generic_handle_irq+0x34/0x50
>> [  150.482347] [<ffff000008126898>] __handle_domain_irq+0x68/0xc0
>> [  150.482351] [<ffff0000080817e4>] gic_handle_irq+0xc4/0x170
>> [  150.482356] Exception stack(0xffff801f901a7ab0 to 0xffff801f901a7be0)
>> [  150.482360] 7aa0:
>> 0000000000000003 ffff801f6d682578
>> [  150.482364] 7ac0: 0000000000000000 0000000000000020
>> 0000000000000020 ffffffff00000000
>> [  150.482367] 7ae0: 0000000000000000 0000000000000000
>> 0000000000000002 ffff801f901a7d30
>> [  150.482371] 7b00: 00000000000009f0 00000000366e2f04
>> 0000000000000018 0000000058dabe4c
>> [  150.482375] 7b20: 0036312196549b4a 00003d0d6dd44d08
>> ffff0000082e81d8 0000ffffb77c9028
>> [  150.482378] 7b40: 0000000000000a03 ffff000009138e38
>> ffff801f6d5cff08 ffff801f6d5cff00
>> [  150.482382] 7b60: ffff000008e11e00 0000000000000001
>> ffff0000081565d0 0000000000000000
>> [  150.482386] 7b80: 0000000000000080 ffff000009139434
>> ffff000009139000 ffff801f901a7be0
>> [  150.482390] 7ba0: ffff000008156e8c ffff801f901a7be0
>> ffff000008156ecc 0000000080000145
>> [  150.482394] 7bc0: ffff801f901a7be0 ffff000008156e68
>> ffffffffffffffff ffff000008156e8c
>> [  150.482397] [<ffff000008082ff4>] el1_irq+0xb4/0x140
>> [  150.482401] [<ffff000008156ecc>] smp_call_function_many+0x284/0x2e8
>> [  150.482405] [<ffff000008157020>] kick_all_cpus_sync+0x30/0x38
>> [  150.482409] [<ffff00000897c6cc>] aarch64_insn_patch_text+0xec/0xf8
>> [  150.482415] [<ffff000008095978>] arch_jump_label_transform+0x60/0x98
>> [  150.482420] [<ffff0000081f593c>] __jump_label_update+0x8c/0xa8
>> [  150.482423] [<ffff0000081f6088>] jump_label_update+0x58/0xe8
>> [  150.482429] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
>> [  150.482434] [<ffff0000081f6284>] static_key_enable+0x34/0x60
>> [  150.482438] [<ffff000008843268>] netstamp_clear+0x68/0x80
>> [  150.482441] [<ffff0000080e49a0>] process_one_work+0x158/0x478
>> [  150.482444] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
>> [  150.482448] [<ffff0000080ebd78>] kthread+0x108/0x138
>> [  150.482451] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
>>
>> ====================================
>>
>> I am observing that this usually happens when the guest tries to
>> bringup or use the default virtio-net interface.
>> And I am unable to reproduce this when directly booting the guest
>> kernel without UEFI BIOS.
>> So anyone observed similar issue ?
>>
>
> I haven't seen this myself.
>
> Have you tried with other versions of QEMU and the host kernel by any
> chance using the same setup?

The earlier qemu version (atleast 2.5 which is default on ubuntu
16.04) have different issue. The same thing happens when I use
vhost=on but without vhost and with virtio-net didn't see the issue.
The Armband folks have also reported the vhost issue here
https://bugs.launchpad.net/ubuntu/+source/edk2/+bug/1673564

So I am beginning to think the issue could be related to the flash
emulation but wanted to check on ml if anyone else is seeing it. But
if it is flash emulation how can it hose the host like this ?

I will also try with qemu 2.8 and see.

>
> Thanks for reporting this,
> -Christoffer

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] host stalls when qemu-system-aarch64 with kvm and pflash
  2017-03-28 20:24     ` Radha Mohan
@ 2017-03-29 18:17       ` Radha Mohan
  -1 siblings, 0 replies; 28+ messages in thread
From: Radha Mohan @ 2017-03-29 18:17 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: qemu-devel, kvmarm

On Tue, Mar 28, 2017 at 1:24 PM, Radha Mohan <mohun106@gmail.com> wrote:
> On Tue, Mar 28, 2017 at 1:16 PM, Christoffer Dall <cdall@linaro.org> wrote:
>> Hi Radha,
>>
>> On Tue, Mar 28, 2017 at 12:58:24PM -0700, Radha Mohan wrote:
>>> Hi,
>>> I am seeing an issue with qemu-system-aarch64 when using pflash
>>> (booting kernel via UEFI bios).
>>>
>>> Host kernel: 4.11.0-rc3-next-20170323
>>> Qemu version: v2.9.0-rc1
>>>
>>> Command used:
>>> ./aarch64-softmmu/qemu-system-aarch64 -cpu host -enable-kvm -M
>>> virt,gic_version=3 -nographic -smp 1 -m 2048 -drive
>>> if=none,id=hd0,file=/root/zesty-server-cloudimg-arm64.img,id=0 -device
>>> virtio-blk-device,drive=hd0 -pflash /root/flash0.img -pflash
>>> /root/flash1.img
>>>
>>>
>>> As soon as the guest kernel boots the host starts to stall and prints
>>> the below messages. And the system never recovers. I can neither
>>> poweroff the guest nor the host. So I have resort to external power
>>> reset of the host.
>>>
>>> ==================
>>> [  116.199077] NMI watchdog: BUG: soft lockup - CPU#25 stuck for 23s!
>>> [kworker/25:1:454]
>>> [  116.206901] Modules linked in: binfmt_misc nls_iso8859_1 aes_ce_blk
>>> shpchp crypto_simd gpio_keys cryptd aes_ce_cipher ghash_ce sha2_ce
>>> sha1_ce uio_pdrv_genirq uio autofs4 btrfs raid10 rai
>>> d456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
>>> raid6_pq libcrc32c raid1 raid0 multipath linear ast i2c_algo_bit ttm
>>> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_s
>>> ys_fops drm nicvf ahci nicpf libahci thunder_bgx thunder_xcv
>>> mdio_thunder mdio_cavium
>>>
>>> [  116.206995] CPU: 25 PID: 454 Comm: kworker/25:1 Not tainted
>>> 4.11.0-rc3-next-20170323 #1
>>> [  116.206997] Hardware name: www.cavium.com crb-1s/crb-1s, BIOS 0.3 Feb 23 2017
>>> [  116.207010] Workqueue: events netstamp_clear
>>> [  116.207015] task: ffff801f906b5400 task.stack: ffff801f901a4000
>>> [  116.207020] PC is at smp_call_function_many+0x284/0x2e8
>>> [  116.207023] LR is at smp_call_function_many+0x244/0x2e8
>>> [  116.207026] pc : [<ffff000008156ecc>] lr : [<ffff000008156e8c>]
>>> pstate: 80000145
>>> [  116.207028] sp : ffff801f901a7be0
>>> [  116.207030] x29: ffff801f901a7be0 x28: ffff000009139000
>>> [  116.207036] x27: ffff000009139434 x26: 0000000000000080
>>> [  116.207041] x25: 0000000000000000 x24: ffff0000081565d0
>>> [  116.207047] x23: 0000000000000001 x22: ffff000008e11e00
>>> [  116.207052] x21: ffff801f6d5cff00 x20: ffff801f6d5cff08
>>> [  116.207057] x19: ffff000009138e38 x18: 0000000000000a03
>>> [  116.207063] x17: 0000ffffb77c9028 x16: ffff0000082e81d8
>>> [  116.207068] x15: 00003d0d6dd44d08 x14: 0036312196549b4a
>>> [  116.207073] x13: 0000000058dabe4c x12: 0000000000000018
>>> [  116.207079] x11: 00000000366e2f04 x10: 00000000000009f0
>>> [  116.207084] x9 : ffff801f901a7d30 x8 : 0000000000000002
>>> [  116.207089] x7 : 0000000000000000 x6 : 0000000000000000
>>> [  116.207095] x5 : ffffffff00000000 x4 : 0000000000000020
>>> [  116.207100] x3 : 0000000000000020 x2 : 0000000000000000
>>> [  116.207105] x1 : ffff801f6d682578 x0 : 0000000000000003
>>>
>>> [  150.443116] INFO: rcu_sched self-detected stall on CPU
>>> [  150.448261]  25-...: (14997 ticks this GP)
>>> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
>>> [  150.451115] INFO: rcu_sched detected stalls on CPUs/tasks:
>>> [  150.451123]  25-...: (14997 ticks this GP)
>>> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
>>> [  150.451124]  (detected by 13, t=15002 jiffies, g=805, c=804, q=8384)
>>> [  150.451136] Task dump for CPU 25:
>>> [  150.451138] kworker/25:1    R  running task        0   454      2 0x00000002
>>> [  150.451155] Workqueue: events netstamp_clear
>>> [  150.451158] Call trace:
>>> [  150.451164] [<ffff000008086188>] __switch_to+0x90/0xa8
>>> [  150.451172] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
>>> [  150.451175] [<ffff0000081f6284>] static_key_enable+0x34/0x60
>>> [  150.451178] [<ffff000008843268>] netstamp_clear+0x68/0x80
>>> [  150.451181] [<ffff0000080e49a0>] process_one_work+0x158/0x478
>>> [  150.451183] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
>>> [  150.451187] [<ffff0000080ebd78>] kthread+0x108/0x138
>>> [  150.451190] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
>>> [  150.477451]   (t=15008 jiffies g=805 c=804 q=8384)
>>> [  150.482242] Task dump for CPU 25:
>>> [  150.482245] kworker/25:1    R  running task        0   454      2 0x00000002
>>> [  150.482259] Workqueue: events netstamp_clear
>>> [  150.482264] Call trace:
>>> [  150.482271] [<ffff00000808a530>] dump_backtrace+0x0/0x2b0
>>> [  150.482277] [<ffff00000808a804>] show_stack+0x24/0x30
>>> [  150.482281] [<ffff0000080fb750>] sched_show_task+0x128/0x178
>>> [  150.482285] [<ffff0000080fd298>] dump_cpu_task+0x48/0x58
>>> [  150.482288] [<ffff0000081f81e4>] rcu_dump_cpu_stacks+0xa0/0xe8
>>> [  150.482297] [<ffff00000813983c>] rcu_check_callbacks+0x774/0x938
>>> [  150.482305] [<ffff00000813fcb4>] update_process_times+0x34/0x60
>>> [  150.482314] [<ffff000008151b80>] tick_sched_handle.isra.7+0x38/0x70
>>> [  150.482319] [<ffff000008151c04>] tick_sched_timer+0x4c/0x98
>>> [  150.482324] [<ffff000008140510>] __hrtimer_run_queues+0xd8/0x2b8
>>> [  150.482328] [<ffff000008141180>] hrtimer_interrupt+0xa8/0x228
>>> [  150.482334] [<ffff0000087f2a2c>] arch_timer_handler_phys+0x3c/0x50
>>> [  150.482341] [<ffff00000812c194>] handle_percpu_devid_irq+0x8c/0x230
>>> [  150.482344] [<ffff000008126174>] generic_handle_irq+0x34/0x50
>>> [  150.482347] [<ffff000008126898>] __handle_domain_irq+0x68/0xc0
>>> [  150.482351] [<ffff0000080817e4>] gic_handle_irq+0xc4/0x170
>>> [  150.482356] Exception stack(0xffff801f901a7ab0 to 0xffff801f901a7be0)
>>> [  150.482360] 7aa0:
>>> 0000000000000003 ffff801f6d682578
>>> [  150.482364] 7ac0: 0000000000000000 0000000000000020
>>> 0000000000000020 ffffffff00000000
>>> [  150.482367] 7ae0: 0000000000000000 0000000000000000
>>> 0000000000000002 ffff801f901a7d30
>>> [  150.482371] 7b00: 00000000000009f0 00000000366e2f04
>>> 0000000000000018 0000000058dabe4c
>>> [  150.482375] 7b20: 0036312196549b4a 00003d0d6dd44d08
>>> ffff0000082e81d8 0000ffffb77c9028
>>> [  150.482378] 7b40: 0000000000000a03 ffff000009138e38
>>> ffff801f6d5cff08 ffff801f6d5cff00
>>> [  150.482382] 7b60: ffff000008e11e00 0000000000000001
>>> ffff0000081565d0 0000000000000000
>>> [  150.482386] 7b80: 0000000000000080 ffff000009139434
>>> ffff000009139000 ffff801f901a7be0
>>> [  150.482390] 7ba0: ffff000008156e8c ffff801f901a7be0
>>> ffff000008156ecc 0000000080000145
>>> [  150.482394] 7bc0: ffff801f901a7be0 ffff000008156e68
>>> ffffffffffffffff ffff000008156e8c
>>> [  150.482397] [<ffff000008082ff4>] el1_irq+0xb4/0x140
>>> [  150.482401] [<ffff000008156ecc>] smp_call_function_many+0x284/0x2e8
>>> [  150.482405] [<ffff000008157020>] kick_all_cpus_sync+0x30/0x38
>>> [  150.482409] [<ffff00000897c6cc>] aarch64_insn_patch_text+0xec/0xf8
>>> [  150.482415] [<ffff000008095978>] arch_jump_label_transform+0x60/0x98
>>> [  150.482420] [<ffff0000081f593c>] __jump_label_update+0x8c/0xa8
>>> [  150.482423] [<ffff0000081f6088>] jump_label_update+0x58/0xe8
>>> [  150.482429] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
>>> [  150.482434] [<ffff0000081f6284>] static_key_enable+0x34/0x60
>>> [  150.482438] [<ffff000008843268>] netstamp_clear+0x68/0x80
>>> [  150.482441] [<ffff0000080e49a0>] process_one_work+0x158/0x478
>>> [  150.482444] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
>>> [  150.482448] [<ffff0000080ebd78>] kthread+0x108/0x138
>>> [  150.482451] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
>>>
>>> ====================================
>>>
>>> I am observing that this usually happens when the guest tries to
>>> bringup or use the default virtio-net interface.
>>> And I am unable to reproduce this when directly booting the guest
>>> kernel without UEFI BIOS.
>>> So anyone observed similar issue ?
>>>
>>
>> I haven't seen this myself.
>>
>> Have you tried with other versions of QEMU and the host kernel by any
>> chance using the same setup?
>
> The earlier qemu version (atleast 2.5 which is default on ubuntu
> 16.04) have different issue. The same thing happens when I use
> vhost=on but without vhost and with virtio-net didn't see the issue.
> The Armband folks have also reported the vhost issue here
> https://bugs.launchpad.net/ubuntu/+source/edk2/+bug/1673564
>
> So I am beginning to think the issue could be related to the flash
> emulation but wanted to check on ml if anyone else is seeing it. But
> if it is flash emulation how can it hose the host like this ?
>
> I will also try with qemu 2.8 and see.

Same issue with 2.8.0 and 2.7.1 qemu version.

>
>>
>> Thanks for reporting this,
>> -Christoffer

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: host stalls when qemu-system-aarch64 with kvm and pflash
@ 2017-03-29 18:17       ` Radha Mohan
  0 siblings, 0 replies; 28+ messages in thread
From: Radha Mohan @ 2017-03-29 18:17 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: qemu-devel, kvmarm

On Tue, Mar 28, 2017 at 1:24 PM, Radha Mohan <mohun106@gmail.com> wrote:
> On Tue, Mar 28, 2017 at 1:16 PM, Christoffer Dall <cdall@linaro.org> wrote:
>> Hi Radha,
>>
>> On Tue, Mar 28, 2017 at 12:58:24PM -0700, Radha Mohan wrote:
>>> Hi,
>>> I am seeing an issue with qemu-system-aarch64 when using pflash
>>> (booting kernel via UEFI bios).
>>>
>>> Host kernel: 4.11.0-rc3-next-20170323
>>> Qemu version: v2.9.0-rc1
>>>
>>> Command used:
>>> ./aarch64-softmmu/qemu-system-aarch64 -cpu host -enable-kvm -M
>>> virt,gic_version=3 -nographic -smp 1 -m 2048 -drive
>>> if=none,id=hd0,file=/root/zesty-server-cloudimg-arm64.img,id=0 -device
>>> virtio-blk-device,drive=hd0 -pflash /root/flash0.img -pflash
>>> /root/flash1.img
>>>
>>>
>>> As soon as the guest kernel boots the host starts to stall and prints
>>> the below messages. And the system never recovers. I can neither
>>> poweroff the guest nor the host. So I have resort to external power
>>> reset of the host.
>>>
>>> ==================
>>> [  116.199077] NMI watchdog: BUG: soft lockup - CPU#25 stuck for 23s!
>>> [kworker/25:1:454]
>>> [  116.206901] Modules linked in: binfmt_misc nls_iso8859_1 aes_ce_blk
>>> shpchp crypto_simd gpio_keys cryptd aes_ce_cipher ghash_ce sha2_ce
>>> sha1_ce uio_pdrv_genirq uio autofs4 btrfs raid10 rai
>>> d456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
>>> raid6_pq libcrc32c raid1 raid0 multipath linear ast i2c_algo_bit ttm
>>> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_s
>>> ys_fops drm nicvf ahci nicpf libahci thunder_bgx thunder_xcv
>>> mdio_thunder mdio_cavium
>>>
>>> [  116.206995] CPU: 25 PID: 454 Comm: kworker/25:1 Not tainted
>>> 4.11.0-rc3-next-20170323 #1
>>> [  116.206997] Hardware name: www.cavium.com crb-1s/crb-1s, BIOS 0.3 Feb 23 2017
>>> [  116.207010] Workqueue: events netstamp_clear
>>> [  116.207015] task: ffff801f906b5400 task.stack: ffff801f901a4000
>>> [  116.207020] PC is at smp_call_function_many+0x284/0x2e8
>>> [  116.207023] LR is at smp_call_function_many+0x244/0x2e8
>>> [  116.207026] pc : [<ffff000008156ecc>] lr : [<ffff000008156e8c>]
>>> pstate: 80000145
>>> [  116.207028] sp : ffff801f901a7be0
>>> [  116.207030] x29: ffff801f901a7be0 x28: ffff000009139000
>>> [  116.207036] x27: ffff000009139434 x26: 0000000000000080
>>> [  116.207041] x25: 0000000000000000 x24: ffff0000081565d0
>>> [  116.207047] x23: 0000000000000001 x22: ffff000008e11e00
>>> [  116.207052] x21: ffff801f6d5cff00 x20: ffff801f6d5cff08
>>> [  116.207057] x19: ffff000009138e38 x18: 0000000000000a03
>>> [  116.207063] x17: 0000ffffb77c9028 x16: ffff0000082e81d8
>>> [  116.207068] x15: 00003d0d6dd44d08 x14: 0036312196549b4a
>>> [  116.207073] x13: 0000000058dabe4c x12: 0000000000000018
>>> [  116.207079] x11: 00000000366e2f04 x10: 00000000000009f0
>>> [  116.207084] x9 : ffff801f901a7d30 x8 : 0000000000000002
>>> [  116.207089] x7 : 0000000000000000 x6 : 0000000000000000
>>> [  116.207095] x5 : ffffffff00000000 x4 : 0000000000000020
>>> [  116.207100] x3 : 0000000000000020 x2 : 0000000000000000
>>> [  116.207105] x1 : ffff801f6d682578 x0 : 0000000000000003
>>>
>>> [  150.443116] INFO: rcu_sched self-detected stall on CPU
>>> [  150.448261]  25-...: (14997 ticks this GP)
>>> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
>>> [  150.451115] INFO: rcu_sched detected stalls on CPUs/tasks:
>>> [  150.451123]  25-...: (14997 ticks this GP)
>>> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
>>> [  150.451124]  (detected by 13, t=15002 jiffies, g=805, c=804, q=8384)
>>> [  150.451136] Task dump for CPU 25:
>>> [  150.451138] kworker/25:1    R  running task        0   454      2 0x00000002
>>> [  150.451155] Workqueue: events netstamp_clear
>>> [  150.451158] Call trace:
>>> [  150.451164] [<ffff000008086188>] __switch_to+0x90/0xa8
>>> [  150.451172] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
>>> [  150.451175] [<ffff0000081f6284>] static_key_enable+0x34/0x60
>>> [  150.451178] [<ffff000008843268>] netstamp_clear+0x68/0x80
>>> [  150.451181] [<ffff0000080e49a0>] process_one_work+0x158/0x478
>>> [  150.451183] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
>>> [  150.451187] [<ffff0000080ebd78>] kthread+0x108/0x138
>>> [  150.451190] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
>>> [  150.477451]   (t=15008 jiffies g=805 c=804 q=8384)
>>> [  150.482242] Task dump for CPU 25:
>>> [  150.482245] kworker/25:1    R  running task        0   454      2 0x00000002
>>> [  150.482259] Workqueue: events netstamp_clear
>>> [  150.482264] Call trace:
>>> [  150.482271] [<ffff00000808a530>] dump_backtrace+0x0/0x2b0
>>> [  150.482277] [<ffff00000808a804>] show_stack+0x24/0x30
>>> [  150.482281] [<ffff0000080fb750>] sched_show_task+0x128/0x178
>>> [  150.482285] [<ffff0000080fd298>] dump_cpu_task+0x48/0x58
>>> [  150.482288] [<ffff0000081f81e4>] rcu_dump_cpu_stacks+0xa0/0xe8
>>> [  150.482297] [<ffff00000813983c>] rcu_check_callbacks+0x774/0x938
>>> [  150.482305] [<ffff00000813fcb4>] update_process_times+0x34/0x60
>>> [  150.482314] [<ffff000008151b80>] tick_sched_handle.isra.7+0x38/0x70
>>> [  150.482319] [<ffff000008151c04>] tick_sched_timer+0x4c/0x98
>>> [  150.482324] [<ffff000008140510>] __hrtimer_run_queues+0xd8/0x2b8
>>> [  150.482328] [<ffff000008141180>] hrtimer_interrupt+0xa8/0x228
>>> [  150.482334] [<ffff0000087f2a2c>] arch_timer_handler_phys+0x3c/0x50
>>> [  150.482341] [<ffff00000812c194>] handle_percpu_devid_irq+0x8c/0x230
>>> [  150.482344] [<ffff000008126174>] generic_handle_irq+0x34/0x50
>>> [  150.482347] [<ffff000008126898>] __handle_domain_irq+0x68/0xc0
>>> [  150.482351] [<ffff0000080817e4>] gic_handle_irq+0xc4/0x170
>>> [  150.482356] Exception stack(0xffff801f901a7ab0 to 0xffff801f901a7be0)
>>> [  150.482360] 7aa0:
>>> 0000000000000003 ffff801f6d682578
>>> [  150.482364] 7ac0: 0000000000000000 0000000000000020
>>> 0000000000000020 ffffffff00000000
>>> [  150.482367] 7ae0: 0000000000000000 0000000000000000
>>> 0000000000000002 ffff801f901a7d30
>>> [  150.482371] 7b00: 00000000000009f0 00000000366e2f04
>>> 0000000000000018 0000000058dabe4c
>>> [  150.482375] 7b20: 0036312196549b4a 00003d0d6dd44d08
>>> ffff0000082e81d8 0000ffffb77c9028
>>> [  150.482378] 7b40: 0000000000000a03 ffff000009138e38
>>> ffff801f6d5cff08 ffff801f6d5cff00
>>> [  150.482382] 7b60: ffff000008e11e00 0000000000000001
>>> ffff0000081565d0 0000000000000000
>>> [  150.482386] 7b80: 0000000000000080 ffff000009139434
>>> ffff000009139000 ffff801f901a7be0
>>> [  150.482390] 7ba0: ffff000008156e8c ffff801f901a7be0
>>> ffff000008156ecc 0000000080000145
>>> [  150.482394] 7bc0: ffff801f901a7be0 ffff000008156e68
>>> ffffffffffffffff ffff000008156e8c
>>> [  150.482397] [<ffff000008082ff4>] el1_irq+0xb4/0x140
>>> [  150.482401] [<ffff000008156ecc>] smp_call_function_many+0x284/0x2e8
>>> [  150.482405] [<ffff000008157020>] kick_all_cpus_sync+0x30/0x38
>>> [  150.482409] [<ffff00000897c6cc>] aarch64_insn_patch_text+0xec/0xf8
>>> [  150.482415] [<ffff000008095978>] arch_jump_label_transform+0x60/0x98
>>> [  150.482420] [<ffff0000081f593c>] __jump_label_update+0x8c/0xa8
>>> [  150.482423] [<ffff0000081f6088>] jump_label_update+0x58/0xe8
>>> [  150.482429] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
>>> [  150.482434] [<ffff0000081f6284>] static_key_enable+0x34/0x60
>>> [  150.482438] [<ffff000008843268>] netstamp_clear+0x68/0x80
>>> [  150.482441] [<ffff0000080e49a0>] process_one_work+0x158/0x478
>>> [  150.482444] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
>>> [  150.482448] [<ffff0000080ebd78>] kthread+0x108/0x138
>>> [  150.482451] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
>>>
>>> ====================================
>>>
>>> I am observing that this usually happens when the guest tries to
>>> bringup or use the default virtio-net interface.
>>> And I am unable to reproduce this when directly booting the guest
>>> kernel without UEFI BIOS.
>>> So anyone observed similar issue ?
>>>
>>
>> I haven't seen this myself.
>>
>> Have you tried with other versions of QEMU and the host kernel by any
>> chance using the same setup?
>
> The earlier qemu version (atleast 2.5 which is default on ubuntu
> 16.04) have different issue. The same thing happens when I use
> vhost=on but without vhost and with virtio-net didn't see the issue.
> The Armband folks have also reported the vhost issue here
> https://bugs.launchpad.net/ubuntu/+source/edk2/+bug/1673564
>
> So I am beginning to think the issue could be related to the flash
> emulation but wanted to check on ml if anyone else is seeing it. But
> if it is flash emulation how can it hose the host like this ?
>
> I will also try with qemu 2.8 and see.

Same issue with 2.8.0 and 2.7.1 qemu version.

>
>>
>> Thanks for reporting this,
>> -Christoffer

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] host stalls when qemu-system-aarch64 with kvm and pflash
  2017-03-29 18:17       ` Radha Mohan
@ 2017-03-29 18:34         ` Peter Maydell
  -1 siblings, 0 replies; 28+ messages in thread
From: Peter Maydell @ 2017-03-29 18:34 UTC (permalink / raw)
  To: Radha Mohan; +Cc: Christoffer Dall, QEMU Developers, kvmarm

On 29 March 2017 at 19:17, Radha Mohan <mohun106@gmail.com> wrote:
>> I will also try with qemu 2.8 and see.
>
> Same issue with 2.8.0 and 2.7.1 qemu version.

Thanks for checking -- good to know it's not a regression
on the QEMU side. (If you can soft-lockup the host it's
clearly a host kernel bug, but it might have been that
we changed something recently to provoke it.)

thanks
-- PMM

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: host stalls when qemu-system-aarch64 with kvm and pflash
@ 2017-03-29 18:34         ` Peter Maydell
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Maydell @ 2017-03-29 18:34 UTC (permalink / raw)
  To: Radha Mohan; +Cc: Christoffer Dall, QEMU Developers, kvmarm

On 29 March 2017 at 19:17, Radha Mohan <mohun106@gmail.com> wrote:
>> I will also try with qemu 2.8 and see.
>
> Same issue with 2.8.0 and 2.7.1 qemu version.

Thanks for checking -- good to know it's not a regression
on the QEMU side. (If you can soft-lockup the host it's
clearly a host kernel bug, but it might have been that
we changed something recently to provoke it.)

thanks
-- PMM

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] host stalls when qemu-system-aarch64 with kvm and pflash
  2017-03-28 20:24     ` Radha Mohan
@ 2017-03-29 18:56       ` Christoffer Dall
  -1 siblings, 0 replies; 28+ messages in thread
From: Christoffer Dall @ 2017-03-29 18:56 UTC (permalink / raw)
  To: Radha Mohan; +Cc: qemu-devel, kvmarm

On Tue, Mar 28, 2017 at 01:24:15PM -0700, Radha Mohan wrote:
> On Tue, Mar 28, 2017 at 1:16 PM, Christoffer Dall <cdall@linaro.org> wrote:
> > Hi Radha,
> >
> > On Tue, Mar 28, 2017 at 12:58:24PM -0700, Radha Mohan wrote:
> >> Hi,
> >> I am seeing an issue with qemu-system-aarch64 when using pflash
> >> (booting kernel via UEFI bios).
> >>
> >> Host kernel: 4.11.0-rc3-next-20170323
> >> Qemu version: v2.9.0-rc1
> >>
> >> Command used:
> >> ./aarch64-softmmu/qemu-system-aarch64 -cpu host -enable-kvm -M
> >> virt,gic_version=3 -nographic -smp 1 -m 2048 -drive
> >> if=none,id=hd0,file=/root/zesty-server-cloudimg-arm64.img,id=0 -device
> >> virtio-blk-device,drive=hd0 -pflash /root/flash0.img -pflash
> >> /root/flash1.img
> >>
> >>
> >> As soon as the guest kernel boots the host starts to stall and prints
> >> the below messages. And the system never recovers. I can neither
> >> poweroff the guest nor the host. So I have resort to external power
> >> reset of the host.
> >>
> >> ==================
> >> [  116.199077] NMI watchdog: BUG: soft lockup - CPU#25 stuck for 23s!
> >> [kworker/25:1:454]
> >> [  116.206901] Modules linked in: binfmt_misc nls_iso8859_1 aes_ce_blk
> >> shpchp crypto_simd gpio_keys cryptd aes_ce_cipher ghash_ce sha2_ce
> >> sha1_ce uio_pdrv_genirq uio autofs4 btrfs raid10 rai
> >> d456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
> >> raid6_pq libcrc32c raid1 raid0 multipath linear ast i2c_algo_bit ttm
> >> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_s
> >> ys_fops drm nicvf ahci nicpf libahci thunder_bgx thunder_xcv
> >> mdio_thunder mdio_cavium
> >>
> >> [  116.206995] CPU: 25 PID: 454 Comm: kworker/25:1 Not tainted
> >> 4.11.0-rc3-next-20170323 #1
> >> [  116.206997] Hardware name: www.cavium.com crb-1s/crb-1s, BIOS 0.3 Feb 23 2017
> >> [  116.207010] Workqueue: events netstamp_clear
> >> [  116.207015] task: ffff801f906b5400 task.stack: ffff801f901a4000
> >> [  116.207020] PC is at smp_call_function_many+0x284/0x2e8
> >> [  116.207023] LR is at smp_call_function_many+0x244/0x2e8
> >> [  116.207026] pc : [<ffff000008156ecc>] lr : [<ffff000008156e8c>]
> >> pstate: 80000145
> >> [  116.207028] sp : ffff801f901a7be0
> >> [  116.207030] x29: ffff801f901a7be0 x28: ffff000009139000
> >> [  116.207036] x27: ffff000009139434 x26: 0000000000000080
> >> [  116.207041] x25: 0000000000000000 x24: ffff0000081565d0
> >> [  116.207047] x23: 0000000000000001 x22: ffff000008e11e00
> >> [  116.207052] x21: ffff801f6d5cff00 x20: ffff801f6d5cff08
> >> [  116.207057] x19: ffff000009138e38 x18: 0000000000000a03
> >> [  116.207063] x17: 0000ffffb77c9028 x16: ffff0000082e81d8
> >> [  116.207068] x15: 00003d0d6dd44d08 x14: 0036312196549b4a
> >> [  116.207073] x13: 0000000058dabe4c x12: 0000000000000018
> >> [  116.207079] x11: 00000000366e2f04 x10: 00000000000009f0
> >> [  116.207084] x9 : ffff801f901a7d30 x8 : 0000000000000002
> >> [  116.207089] x7 : 0000000000000000 x6 : 0000000000000000
> >> [  116.207095] x5 : ffffffff00000000 x4 : 0000000000000020
> >> [  116.207100] x3 : 0000000000000020 x2 : 0000000000000000
> >> [  116.207105] x1 : ffff801f6d682578 x0 : 0000000000000003
> >>
> >> [  150.443116] INFO: rcu_sched self-detected stall on CPU
> >> [  150.448261]  25-...: (14997 ticks this GP)
> >> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
> >> [  150.451115] INFO: rcu_sched detected stalls on CPUs/tasks:
> >> [  150.451123]  25-...: (14997 ticks this GP)
> >> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
> >> [  150.451124]  (detected by 13, t=15002 jiffies, g=805, c=804, q=8384)
> >> [  150.451136] Task dump for CPU 25:
> >> [  150.451138] kworker/25:1    R  running task        0   454      2 0x00000002
> >> [  150.451155] Workqueue: events netstamp_clear
> >> [  150.451158] Call trace:
> >> [  150.451164] [<ffff000008086188>] __switch_to+0x90/0xa8
> >> [  150.451172] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
> >> [  150.451175] [<ffff0000081f6284>] static_key_enable+0x34/0x60
> >> [  150.451178] [<ffff000008843268>] netstamp_clear+0x68/0x80
> >> [  150.451181] [<ffff0000080e49a0>] process_one_work+0x158/0x478
> >> [  150.451183] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
> >> [  150.451187] [<ffff0000080ebd78>] kthread+0x108/0x138
> >> [  150.451190] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
> >> [  150.477451]   (t=15008 jiffies g=805 c=804 q=8384)
> >> [  150.482242] Task dump for CPU 25:
> >> [  150.482245] kworker/25:1    R  running task        0   454      2 0x00000002
> >> [  150.482259] Workqueue: events netstamp_clear
> >> [  150.482264] Call trace:
> >> [  150.482271] [<ffff00000808a530>] dump_backtrace+0x0/0x2b0
> >> [  150.482277] [<ffff00000808a804>] show_stack+0x24/0x30
> >> [  150.482281] [<ffff0000080fb750>] sched_show_task+0x128/0x178
> >> [  150.482285] [<ffff0000080fd298>] dump_cpu_task+0x48/0x58
> >> [  150.482288] [<ffff0000081f81e4>] rcu_dump_cpu_stacks+0xa0/0xe8
> >> [  150.482297] [<ffff00000813983c>] rcu_check_callbacks+0x774/0x938
> >> [  150.482305] [<ffff00000813fcb4>] update_process_times+0x34/0x60
> >> [  150.482314] [<ffff000008151b80>] tick_sched_handle.isra.7+0x38/0x70
> >> [  150.482319] [<ffff000008151c04>] tick_sched_timer+0x4c/0x98
> >> [  150.482324] [<ffff000008140510>] __hrtimer_run_queues+0xd8/0x2b8
> >> [  150.482328] [<ffff000008141180>] hrtimer_interrupt+0xa8/0x228
> >> [  150.482334] [<ffff0000087f2a2c>] arch_timer_handler_phys+0x3c/0x50
> >> [  150.482341] [<ffff00000812c194>] handle_percpu_devid_irq+0x8c/0x230
> >> [  150.482344] [<ffff000008126174>] generic_handle_irq+0x34/0x50
> >> [  150.482347] [<ffff000008126898>] __handle_domain_irq+0x68/0xc0
> >> [  150.482351] [<ffff0000080817e4>] gic_handle_irq+0xc4/0x170
> >> [  150.482356] Exception stack(0xffff801f901a7ab0 to 0xffff801f901a7be0)
> >> [  150.482360] 7aa0:
> >> 0000000000000003 ffff801f6d682578
> >> [  150.482364] 7ac0: 0000000000000000 0000000000000020
> >> 0000000000000020 ffffffff00000000
> >> [  150.482367] 7ae0: 0000000000000000 0000000000000000
> >> 0000000000000002 ffff801f901a7d30
> >> [  150.482371] 7b00: 00000000000009f0 00000000366e2f04
> >> 0000000000000018 0000000058dabe4c
> >> [  150.482375] 7b20: 0036312196549b4a 00003d0d6dd44d08
> >> ffff0000082e81d8 0000ffffb77c9028
> >> [  150.482378] 7b40: 0000000000000a03 ffff000009138e38
> >> ffff801f6d5cff08 ffff801f6d5cff00
> >> [  150.482382] 7b60: ffff000008e11e00 0000000000000001
> >> ffff0000081565d0 0000000000000000
> >> [  150.482386] 7b80: 0000000000000080 ffff000009139434
> >> ffff000009139000 ffff801f901a7be0
> >> [  150.482390] 7ba0: ffff000008156e8c ffff801f901a7be0
> >> ffff000008156ecc 0000000080000145
> >> [  150.482394] 7bc0: ffff801f901a7be0 ffff000008156e68
> >> ffffffffffffffff ffff000008156e8c
> >> [  150.482397] [<ffff000008082ff4>] el1_irq+0xb4/0x140
> >> [  150.482401] [<ffff000008156ecc>] smp_call_function_many+0x284/0x2e8
> >> [  150.482405] [<ffff000008157020>] kick_all_cpus_sync+0x30/0x38
> >> [  150.482409] [<ffff00000897c6cc>] aarch64_insn_patch_text+0xec/0xf8
> >> [  150.482415] [<ffff000008095978>] arch_jump_label_transform+0x60/0x98
> >> [  150.482420] [<ffff0000081f593c>] __jump_label_update+0x8c/0xa8
> >> [  150.482423] [<ffff0000081f6088>] jump_label_update+0x58/0xe8
> >> [  150.482429] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
> >> [  150.482434] [<ffff0000081f6284>] static_key_enable+0x34/0x60
> >> [  150.482438] [<ffff000008843268>] netstamp_clear+0x68/0x80
> >> [  150.482441] [<ffff0000080e49a0>] process_one_work+0x158/0x478
> >> [  150.482444] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
> >> [  150.482448] [<ffff0000080ebd78>] kthread+0x108/0x138
> >> [  150.482451] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
> >>
> >> ====================================
> >>
> >> I am observing that this usually happens when the guest tries to
> >> bringup or use the default virtio-net interface.
> >> And I am unable to reproduce this when directly booting the guest
> >> kernel without UEFI BIOS.
> >> So anyone observed similar issue ?
> >>
> >
> > I haven't seen this myself.
> >
> > Have you tried with other versions of QEMU and the host kernel by any
> > chance using the same setup?
> 
> The earlier qemu version (atleast 2.5 which is default on ubuntu
> 16.04) have different issue. The same thing happens when I use
> vhost=on but without vhost and with virtio-net didn't see the issue.
> The Armband folks have also reported the vhost issue here
> https://bugs.launchpad.net/ubuntu/+source/edk2/+bug/1673564

So I just ran your setup using the flash emulation on the ThunderX in
packet.net and my host is happily continuing to run hackbench.

> 
> So I am beginning to think the issue could be related to the flash
> emulation but wanted to check on ml if anyone else is seeing it. But
> if it is flash emulation how can it hose the host like this ?

Why do you think it's related to the flash emulation?

>From what you're telling me it sounds much more like a vhost regression.

Can you try with v4.10 of the kernel with vhost=on ?

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: host stalls when qemu-system-aarch64 with kvm and pflash
@ 2017-03-29 18:56       ` Christoffer Dall
  0 siblings, 0 replies; 28+ messages in thread
From: Christoffer Dall @ 2017-03-29 18:56 UTC (permalink / raw)
  To: Radha Mohan; +Cc: qemu-devel, kvmarm

On Tue, Mar 28, 2017 at 01:24:15PM -0700, Radha Mohan wrote:
> On Tue, Mar 28, 2017 at 1:16 PM, Christoffer Dall <cdall@linaro.org> wrote:
> > Hi Radha,
> >
> > On Tue, Mar 28, 2017 at 12:58:24PM -0700, Radha Mohan wrote:
> >> Hi,
> >> I am seeing an issue with qemu-system-aarch64 when using pflash
> >> (booting kernel via UEFI bios).
> >>
> >> Host kernel: 4.11.0-rc3-next-20170323
> >> Qemu version: v2.9.0-rc1
> >>
> >> Command used:
> >> ./aarch64-softmmu/qemu-system-aarch64 -cpu host -enable-kvm -M
> >> virt,gic_version=3 -nographic -smp 1 -m 2048 -drive
> >> if=none,id=hd0,file=/root/zesty-server-cloudimg-arm64.img,id=0 -device
> >> virtio-blk-device,drive=hd0 -pflash /root/flash0.img -pflash
> >> /root/flash1.img
> >>
> >>
> >> As soon as the guest kernel boots the host starts to stall and prints
> >> the below messages. And the system never recovers. I can neither
> >> poweroff the guest nor the host. So I have resort to external power
> >> reset of the host.
> >>
> >> ==================
> >> [  116.199077] NMI watchdog: BUG: soft lockup - CPU#25 stuck for 23s!
> >> [kworker/25:1:454]
> >> [  116.206901] Modules linked in: binfmt_misc nls_iso8859_1 aes_ce_blk
> >> shpchp crypto_simd gpio_keys cryptd aes_ce_cipher ghash_ce sha2_ce
> >> sha1_ce uio_pdrv_genirq uio autofs4 btrfs raid10 rai
> >> d456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
> >> raid6_pq libcrc32c raid1 raid0 multipath linear ast i2c_algo_bit ttm
> >> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_s
> >> ys_fops drm nicvf ahci nicpf libahci thunder_bgx thunder_xcv
> >> mdio_thunder mdio_cavium
> >>
> >> [  116.206995] CPU: 25 PID: 454 Comm: kworker/25:1 Not tainted
> >> 4.11.0-rc3-next-20170323 #1
> >> [  116.206997] Hardware name: www.cavium.com crb-1s/crb-1s, BIOS 0.3 Feb 23 2017
> >> [  116.207010] Workqueue: events netstamp_clear
> >> [  116.207015] task: ffff801f906b5400 task.stack: ffff801f901a4000
> >> [  116.207020] PC is at smp_call_function_many+0x284/0x2e8
> >> [  116.207023] LR is at smp_call_function_many+0x244/0x2e8
> >> [  116.207026] pc : [<ffff000008156ecc>] lr : [<ffff000008156e8c>]
> >> pstate: 80000145
> >> [  116.207028] sp : ffff801f901a7be0
> >> [  116.207030] x29: ffff801f901a7be0 x28: ffff000009139000
> >> [  116.207036] x27: ffff000009139434 x26: 0000000000000080
> >> [  116.207041] x25: 0000000000000000 x24: ffff0000081565d0
> >> [  116.207047] x23: 0000000000000001 x22: ffff000008e11e00
> >> [  116.207052] x21: ffff801f6d5cff00 x20: ffff801f6d5cff08
> >> [  116.207057] x19: ffff000009138e38 x18: 0000000000000a03
> >> [  116.207063] x17: 0000ffffb77c9028 x16: ffff0000082e81d8
> >> [  116.207068] x15: 00003d0d6dd44d08 x14: 0036312196549b4a
> >> [  116.207073] x13: 0000000058dabe4c x12: 0000000000000018
> >> [  116.207079] x11: 00000000366e2f04 x10: 00000000000009f0
> >> [  116.207084] x9 : ffff801f901a7d30 x8 : 0000000000000002
> >> [  116.207089] x7 : 0000000000000000 x6 : 0000000000000000
> >> [  116.207095] x5 : ffffffff00000000 x4 : 0000000000000020
> >> [  116.207100] x3 : 0000000000000020 x2 : 0000000000000000
> >> [  116.207105] x1 : ffff801f6d682578 x0 : 0000000000000003
> >>
> >> [  150.443116] INFO: rcu_sched self-detected stall on CPU
> >> [  150.448261]  25-...: (14997 ticks this GP)
> >> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
> >> [  150.451115] INFO: rcu_sched detected stalls on CPUs/tasks:
> >> [  150.451123]  25-...: (14997 ticks this GP)
> >> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
> >> [  150.451124]  (detected by 13, t=15002 jiffies, g=805, c=804, q=8384)
> >> [  150.451136] Task dump for CPU 25:
> >> [  150.451138] kworker/25:1    R  running task        0   454      2 0x00000002
> >> [  150.451155] Workqueue: events netstamp_clear
> >> [  150.451158] Call trace:
> >> [  150.451164] [<ffff000008086188>] __switch_to+0x90/0xa8
> >> [  150.451172] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
> >> [  150.451175] [<ffff0000081f6284>] static_key_enable+0x34/0x60
> >> [  150.451178] [<ffff000008843268>] netstamp_clear+0x68/0x80
> >> [  150.451181] [<ffff0000080e49a0>] process_one_work+0x158/0x478
> >> [  150.451183] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
> >> [  150.451187] [<ffff0000080ebd78>] kthread+0x108/0x138
> >> [  150.451190] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
> >> [  150.477451]   (t=15008 jiffies g=805 c=804 q=8384)
> >> [  150.482242] Task dump for CPU 25:
> >> [  150.482245] kworker/25:1    R  running task        0   454      2 0x00000002
> >> [  150.482259] Workqueue: events netstamp_clear
> >> [  150.482264] Call trace:
> >> [  150.482271] [<ffff00000808a530>] dump_backtrace+0x0/0x2b0
> >> [  150.482277] [<ffff00000808a804>] show_stack+0x24/0x30
> >> [  150.482281] [<ffff0000080fb750>] sched_show_task+0x128/0x178
> >> [  150.482285] [<ffff0000080fd298>] dump_cpu_task+0x48/0x58
> >> [  150.482288] [<ffff0000081f81e4>] rcu_dump_cpu_stacks+0xa0/0xe8
> >> [  150.482297] [<ffff00000813983c>] rcu_check_callbacks+0x774/0x938
> >> [  150.482305] [<ffff00000813fcb4>] update_process_times+0x34/0x60
> >> [  150.482314] [<ffff000008151b80>] tick_sched_handle.isra.7+0x38/0x70
> >> [  150.482319] [<ffff000008151c04>] tick_sched_timer+0x4c/0x98
> >> [  150.482324] [<ffff000008140510>] __hrtimer_run_queues+0xd8/0x2b8
> >> [  150.482328] [<ffff000008141180>] hrtimer_interrupt+0xa8/0x228
> >> [  150.482334] [<ffff0000087f2a2c>] arch_timer_handler_phys+0x3c/0x50
> >> [  150.482341] [<ffff00000812c194>] handle_percpu_devid_irq+0x8c/0x230
> >> [  150.482344] [<ffff000008126174>] generic_handle_irq+0x34/0x50
> >> [  150.482347] [<ffff000008126898>] __handle_domain_irq+0x68/0xc0
> >> [  150.482351] [<ffff0000080817e4>] gic_handle_irq+0xc4/0x170
> >> [  150.482356] Exception stack(0xffff801f901a7ab0 to 0xffff801f901a7be0)
> >> [  150.482360] 7aa0:
> >> 0000000000000003 ffff801f6d682578
> >> [  150.482364] 7ac0: 0000000000000000 0000000000000020
> >> 0000000000000020 ffffffff00000000
> >> [  150.482367] 7ae0: 0000000000000000 0000000000000000
> >> 0000000000000002 ffff801f901a7d30
> >> [  150.482371] 7b00: 00000000000009f0 00000000366e2f04
> >> 0000000000000018 0000000058dabe4c
> >> [  150.482375] 7b20: 0036312196549b4a 00003d0d6dd44d08
> >> ffff0000082e81d8 0000ffffb77c9028
> >> [  150.482378] 7b40: 0000000000000a03 ffff000009138e38
> >> ffff801f6d5cff08 ffff801f6d5cff00
> >> [  150.482382] 7b60: ffff000008e11e00 0000000000000001
> >> ffff0000081565d0 0000000000000000
> >> [  150.482386] 7b80: 0000000000000080 ffff000009139434
> >> ffff000009139000 ffff801f901a7be0
> >> [  150.482390] 7ba0: ffff000008156e8c ffff801f901a7be0
> >> ffff000008156ecc 0000000080000145
> >> [  150.482394] 7bc0: ffff801f901a7be0 ffff000008156e68
> >> ffffffffffffffff ffff000008156e8c
> >> [  150.482397] [<ffff000008082ff4>] el1_irq+0xb4/0x140
> >> [  150.482401] [<ffff000008156ecc>] smp_call_function_many+0x284/0x2e8
> >> [  150.482405] [<ffff000008157020>] kick_all_cpus_sync+0x30/0x38
> >> [  150.482409] [<ffff00000897c6cc>] aarch64_insn_patch_text+0xec/0xf8
> >> [  150.482415] [<ffff000008095978>] arch_jump_label_transform+0x60/0x98
> >> [  150.482420] [<ffff0000081f593c>] __jump_label_update+0x8c/0xa8
> >> [  150.482423] [<ffff0000081f6088>] jump_label_update+0x58/0xe8
> >> [  150.482429] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
> >> [  150.482434] [<ffff0000081f6284>] static_key_enable+0x34/0x60
> >> [  150.482438] [<ffff000008843268>] netstamp_clear+0x68/0x80
> >> [  150.482441] [<ffff0000080e49a0>] process_one_work+0x158/0x478
> >> [  150.482444] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
> >> [  150.482448] [<ffff0000080ebd78>] kthread+0x108/0x138
> >> [  150.482451] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
> >>
> >> ====================================
> >>
> >> I am observing that this usually happens when the guest tries to
> >> bringup or use the default virtio-net interface.
> >> And I am unable to reproduce this when directly booting the guest
> >> kernel without UEFI BIOS.
> >> So anyone observed similar issue ?
> >>
> >
> > I haven't seen this myself.
> >
> > Have you tried with other versions of QEMU and the host kernel by any
> > chance using the same setup?
> 
> The earlier qemu version (atleast 2.5 which is default on ubuntu
> 16.04) have different issue. The same thing happens when I use
> vhost=on but without vhost and with virtio-net didn't see the issue.
> The Armband folks have also reported the vhost issue here
> https://bugs.launchpad.net/ubuntu/+source/edk2/+bug/1673564

So I just ran your setup using the flash emulation on the ThunderX in
packet.net and my host is happily continuing to run hackbench.

> 
> So I am beginning to think the issue could be related to the flash
> emulation but wanted to check on ml if anyone else is seeing it. But
> if it is flash emulation how can it hose the host like this ?

Why do you think it's related to the flash emulation?

>From what you're telling me it sounds much more like a vhost regression.

Can you try with v4.10 of the kernel with vhost=on ?

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] host stalls when qemu-system-aarch64 with kvm and pflash
  2017-03-29 18:56       ` Christoffer Dall
@ 2017-03-29 20:51         ` Radha Mohan
  -1 siblings, 0 replies; 28+ messages in thread
From: Radha Mohan @ 2017-03-29 20:51 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: qemu-devel, kvmarm

On Wed, Mar 29, 2017 at 11:56 AM, Christoffer Dall <cdall@linaro.org> wrote:
> On Tue, Mar 28, 2017 at 01:24:15PM -0700, Radha Mohan wrote:
>> On Tue, Mar 28, 2017 at 1:16 PM, Christoffer Dall <cdall@linaro.org> wrote:
>> > Hi Radha,
>> >
>> > On Tue, Mar 28, 2017 at 12:58:24PM -0700, Radha Mohan wrote:
>> >> Hi,
>> >> I am seeing an issue with qemu-system-aarch64 when using pflash
>> >> (booting kernel via UEFI bios).
>> >>
>> >> Host kernel: 4.11.0-rc3-next-20170323
>> >> Qemu version: v2.9.0-rc1
>> >>
>> >> Command used:
>> >> ./aarch64-softmmu/qemu-system-aarch64 -cpu host -enable-kvm -M
>> >> virt,gic_version=3 -nographic -smp 1 -m 2048 -drive
>> >> if=none,id=hd0,file=/root/zesty-server-cloudimg-arm64.img,id=0 -device
>> >> virtio-blk-device,drive=hd0 -pflash /root/flash0.img -pflash
>> >> /root/flash1.img
>> >>
>> >>
>> >> As soon as the guest kernel boots the host starts to stall and prints
>> >> the below messages. And the system never recovers. I can neither
>> >> poweroff the guest nor the host. So I have resort to external power
>> >> reset of the host.
>> >>
>> >> ==================
>> >> [  116.199077] NMI watchdog: BUG: soft lockup - CPU#25 stuck for 23s!
>> >> [kworker/25:1:454]
>> >> [  116.206901] Modules linked in: binfmt_misc nls_iso8859_1 aes_ce_blk
>> >> shpchp crypto_simd gpio_keys cryptd aes_ce_cipher ghash_ce sha2_ce
>> >> sha1_ce uio_pdrv_genirq uio autofs4 btrfs raid10 rai
>> >> d456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
>> >> raid6_pq libcrc32c raid1 raid0 multipath linear ast i2c_algo_bit ttm
>> >> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_s
>> >> ys_fops drm nicvf ahci nicpf libahci thunder_bgx thunder_xcv
>> >> mdio_thunder mdio_cavium
>> >>
>> >> [  116.206995] CPU: 25 PID: 454 Comm: kworker/25:1 Not tainted
>> >> 4.11.0-rc3-next-20170323 #1
>> >> [  116.206997] Hardware name: www.cavium.com crb-1s/crb-1s, BIOS 0.3 Feb 23 2017
>> >> [  116.207010] Workqueue: events netstamp_clear
>> >> [  116.207015] task: ffff801f906b5400 task.stack: ffff801f901a4000
>> >> [  116.207020] PC is at smp_call_function_many+0x284/0x2e8
>> >> [  116.207023] LR is at smp_call_function_many+0x244/0x2e8
>> >> [  116.207026] pc : [<ffff000008156ecc>] lr : [<ffff000008156e8c>]
>> >> pstate: 80000145
>> >> [  116.207028] sp : ffff801f901a7be0
>> >> [  116.207030] x29: ffff801f901a7be0 x28: ffff000009139000
>> >> [  116.207036] x27: ffff000009139434 x26: 0000000000000080
>> >> [  116.207041] x25: 0000000000000000 x24: ffff0000081565d0
>> >> [  116.207047] x23: 0000000000000001 x22: ffff000008e11e00
>> >> [  116.207052] x21: ffff801f6d5cff00 x20: ffff801f6d5cff08
>> >> [  116.207057] x19: ffff000009138e38 x18: 0000000000000a03
>> >> [  116.207063] x17: 0000ffffb77c9028 x16: ffff0000082e81d8
>> >> [  116.207068] x15: 00003d0d6dd44d08 x14: 0036312196549b4a
>> >> [  116.207073] x13: 0000000058dabe4c x12: 0000000000000018
>> >> [  116.207079] x11: 00000000366e2f04 x10: 00000000000009f0
>> >> [  116.207084] x9 : ffff801f901a7d30 x8 : 0000000000000002
>> >> [  116.207089] x7 : 0000000000000000 x6 : 0000000000000000
>> >> [  116.207095] x5 : ffffffff00000000 x4 : 0000000000000020
>> >> [  116.207100] x3 : 0000000000000020 x2 : 0000000000000000
>> >> [  116.207105] x1 : ffff801f6d682578 x0 : 0000000000000003
>> >>
>> >> [  150.443116] INFO: rcu_sched self-detected stall on CPU
>> >> [  150.448261]  25-...: (14997 ticks this GP)
>> >> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
>> >> [  150.451115] INFO: rcu_sched detected stalls on CPUs/tasks:
>> >> [  150.451123]  25-...: (14997 ticks this GP)
>> >> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
>> >> [  150.451124]  (detected by 13, t=15002 jiffies, g=805, c=804, q=8384)
>> >> [  150.451136] Task dump for CPU 25:
>> >> [  150.451138] kworker/25:1    R  running task        0   454      2 0x00000002
>> >> [  150.451155] Workqueue: events netstamp_clear
>> >> [  150.451158] Call trace:
>> >> [  150.451164] [<ffff000008086188>] __switch_to+0x90/0xa8
>> >> [  150.451172] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
>> >> [  150.451175] [<ffff0000081f6284>] static_key_enable+0x34/0x60
>> >> [  150.451178] [<ffff000008843268>] netstamp_clear+0x68/0x80
>> >> [  150.451181] [<ffff0000080e49a0>] process_one_work+0x158/0x478
>> >> [  150.451183] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
>> >> [  150.451187] [<ffff0000080ebd78>] kthread+0x108/0x138
>> >> [  150.451190] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
>> >> [  150.477451]   (t=15008 jiffies g=805 c=804 q=8384)
>> >> [  150.482242] Task dump for CPU 25:
>> >> [  150.482245] kworker/25:1    R  running task        0   454      2 0x00000002
>> >> [  150.482259] Workqueue: events netstamp_clear
>> >> [  150.482264] Call trace:
>> >> [  150.482271] [<ffff00000808a530>] dump_backtrace+0x0/0x2b0
>> >> [  150.482277] [<ffff00000808a804>] show_stack+0x24/0x30
>> >> [  150.482281] [<ffff0000080fb750>] sched_show_task+0x128/0x178
>> >> [  150.482285] [<ffff0000080fd298>] dump_cpu_task+0x48/0x58
>> >> [  150.482288] [<ffff0000081f81e4>] rcu_dump_cpu_stacks+0xa0/0xe8
>> >> [  150.482297] [<ffff00000813983c>] rcu_check_callbacks+0x774/0x938
>> >> [  150.482305] [<ffff00000813fcb4>] update_process_times+0x34/0x60
>> >> [  150.482314] [<ffff000008151b80>] tick_sched_handle.isra.7+0x38/0x70
>> >> [  150.482319] [<ffff000008151c04>] tick_sched_timer+0x4c/0x98
>> >> [  150.482324] [<ffff000008140510>] __hrtimer_run_queues+0xd8/0x2b8
>> >> [  150.482328] [<ffff000008141180>] hrtimer_interrupt+0xa8/0x228
>> >> [  150.482334] [<ffff0000087f2a2c>] arch_timer_handler_phys+0x3c/0x50
>> >> [  150.482341] [<ffff00000812c194>] handle_percpu_devid_irq+0x8c/0x230
>> >> [  150.482344] [<ffff000008126174>] generic_handle_irq+0x34/0x50
>> >> [  150.482347] [<ffff000008126898>] __handle_domain_irq+0x68/0xc0
>> >> [  150.482351] [<ffff0000080817e4>] gic_handle_irq+0xc4/0x170
>> >> [  150.482356] Exception stack(0xffff801f901a7ab0 to 0xffff801f901a7be0)
>> >> [  150.482360] 7aa0:
>> >> 0000000000000003 ffff801f6d682578
>> >> [  150.482364] 7ac0: 0000000000000000 0000000000000020
>> >> 0000000000000020 ffffffff00000000
>> >> [  150.482367] 7ae0: 0000000000000000 0000000000000000
>> >> 0000000000000002 ffff801f901a7d30
>> >> [  150.482371] 7b00: 00000000000009f0 00000000366e2f04
>> >> 0000000000000018 0000000058dabe4c
>> >> [  150.482375] 7b20: 0036312196549b4a 00003d0d6dd44d08
>> >> ffff0000082e81d8 0000ffffb77c9028
>> >> [  150.482378] 7b40: 0000000000000a03 ffff000009138e38
>> >> ffff801f6d5cff08 ffff801f6d5cff00
>> >> [  150.482382] 7b60: ffff000008e11e00 0000000000000001
>> >> ffff0000081565d0 0000000000000000
>> >> [  150.482386] 7b80: 0000000000000080 ffff000009139434
>> >> ffff000009139000 ffff801f901a7be0
>> >> [  150.482390] 7ba0: ffff000008156e8c ffff801f901a7be0
>> >> ffff000008156ecc 0000000080000145
>> >> [  150.482394] 7bc0: ffff801f901a7be0 ffff000008156e68
>> >> ffffffffffffffff ffff000008156e8c
>> >> [  150.482397] [<ffff000008082ff4>] el1_irq+0xb4/0x140
>> >> [  150.482401] [<ffff000008156ecc>] smp_call_function_many+0x284/0x2e8
>> >> [  150.482405] [<ffff000008157020>] kick_all_cpus_sync+0x30/0x38
>> >> [  150.482409] [<ffff00000897c6cc>] aarch64_insn_patch_text+0xec/0xf8
>> >> [  150.482415] [<ffff000008095978>] arch_jump_label_transform+0x60/0x98
>> >> [  150.482420] [<ffff0000081f593c>] __jump_label_update+0x8c/0xa8
>> >> [  150.482423] [<ffff0000081f6088>] jump_label_update+0x58/0xe8
>> >> [  150.482429] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
>> >> [  150.482434] [<ffff0000081f6284>] static_key_enable+0x34/0x60
>> >> [  150.482438] [<ffff000008843268>] netstamp_clear+0x68/0x80
>> >> [  150.482441] [<ffff0000080e49a0>] process_one_work+0x158/0x478
>> >> [  150.482444] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
>> >> [  150.482448] [<ffff0000080ebd78>] kthread+0x108/0x138
>> >> [  150.482451] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
>> >>
>> >> ====================================
>> >>
>> >> I am observing that this usually happens when the guest tries to
>> >> bringup or use the default virtio-net interface.
>> >> And I am unable to reproduce this when directly booting the guest
>> >> kernel without UEFI BIOS.
>> >> So anyone observed similar issue ?
>> >>
>> >
>> > I haven't seen this myself.
>> >
>> > Have you tried with other versions of QEMU and the host kernel by any
>> > chance using the same setup?
>>
>> The earlier qemu version (atleast 2.5 which is default on ubuntu
>> 16.04) have different issue. The same thing happens when I use
>> vhost=on but without vhost and with virtio-net didn't see the issue.
>> The Armband folks have also reported the vhost issue here
>> https://bugs.launchpad.net/ubuntu/+source/edk2/+bug/1673564
>
> So I just ran your setup using the flash emulation on the ThunderX in
> packet.net and my host is happily continuing to run hackbench.

One more thing I see is the ubuntu cloud images run a cloud-init
service during boot and seems to do lot of network init, the bug
appears when doing such things.

>
>>
>> So I am beginning to think the issue could be related to the flash
>> emulation but wanted to check on ml if anyone else is seeing it. But
>> if it is flash emulation how can it hose the host like this ?
>
> Why do you think it's related to the flash emulation?

Because without pflash method and directly launching kernel using
"-kernel" option of qemu works ok.


>
> From what you're telling me it sounds much more like a vhost regression.
>
> Can you try with v4.10 of the kernel with vhost=on ?

Ok will try. I do think the issue could also be host kernel side and
something is triggering.
The newer qemu is having a default virtio-net interface. Does this use
vhost by default ?
And is there a way not to have a default virtio-net interface.

I am not specifying any vhost in my command.
My observation with vhost=on was with qemu-2.5 and 4.4 kernel from Ubuntu 16.04.

Another observation I have is, occasionally during the trials the
guest hangs at this point and host cpu goes into a softlockup after
sometime. I cannot kill the qemu either through qemu Ctrl-A+C+Q or
kill command from host.

------------- cut -----------------
EFI stub: Booting Linux Kernel...
EFI stub: EFI_RNG_PROTOCOL unavailable, no randomness supplied
EFI stub: Using DTB from configuration table
EFI stub: Exiting boot services and installing virtual address map...
[    0.000000] Booting Linux on physical CPU 0x0
------------- end ------------------


>
> Thanks,
> -Christoffer

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: host stalls when qemu-system-aarch64 with kvm and pflash
@ 2017-03-29 20:51         ` Radha Mohan
  0 siblings, 0 replies; 28+ messages in thread
From: Radha Mohan @ 2017-03-29 20:51 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: qemu-devel, kvmarm

On Wed, Mar 29, 2017 at 11:56 AM, Christoffer Dall <cdall@linaro.org> wrote:
> On Tue, Mar 28, 2017 at 01:24:15PM -0700, Radha Mohan wrote:
>> On Tue, Mar 28, 2017 at 1:16 PM, Christoffer Dall <cdall@linaro.org> wrote:
>> > Hi Radha,
>> >
>> > On Tue, Mar 28, 2017 at 12:58:24PM -0700, Radha Mohan wrote:
>> >> Hi,
>> >> I am seeing an issue with qemu-system-aarch64 when using pflash
>> >> (booting kernel via UEFI bios).
>> >>
>> >> Host kernel: 4.11.0-rc3-next-20170323
>> >> Qemu version: v2.9.0-rc1
>> >>
>> >> Command used:
>> >> ./aarch64-softmmu/qemu-system-aarch64 -cpu host -enable-kvm -M
>> >> virt,gic_version=3 -nographic -smp 1 -m 2048 -drive
>> >> if=none,id=hd0,file=/root/zesty-server-cloudimg-arm64.img,id=0 -device
>> >> virtio-blk-device,drive=hd0 -pflash /root/flash0.img -pflash
>> >> /root/flash1.img
>> >>
>> >>
>> >> As soon as the guest kernel boots the host starts to stall and prints
>> >> the below messages. And the system never recovers. I can neither
>> >> poweroff the guest nor the host. So I have resort to external power
>> >> reset of the host.
>> >>
>> >> ==================
>> >> [  116.199077] NMI watchdog: BUG: soft lockup - CPU#25 stuck for 23s!
>> >> [kworker/25:1:454]
>> >> [  116.206901] Modules linked in: binfmt_misc nls_iso8859_1 aes_ce_blk
>> >> shpchp crypto_simd gpio_keys cryptd aes_ce_cipher ghash_ce sha2_ce
>> >> sha1_ce uio_pdrv_genirq uio autofs4 btrfs raid10 rai
>> >> d456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
>> >> raid6_pq libcrc32c raid1 raid0 multipath linear ast i2c_algo_bit ttm
>> >> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_s
>> >> ys_fops drm nicvf ahci nicpf libahci thunder_bgx thunder_xcv
>> >> mdio_thunder mdio_cavium
>> >>
>> >> [  116.206995] CPU: 25 PID: 454 Comm: kworker/25:1 Not tainted
>> >> 4.11.0-rc3-next-20170323 #1
>> >> [  116.206997] Hardware name: www.cavium.com crb-1s/crb-1s, BIOS 0.3 Feb 23 2017
>> >> [  116.207010] Workqueue: events netstamp_clear
>> >> [  116.207015] task: ffff801f906b5400 task.stack: ffff801f901a4000
>> >> [  116.207020] PC is at smp_call_function_many+0x284/0x2e8
>> >> [  116.207023] LR is at smp_call_function_many+0x244/0x2e8
>> >> [  116.207026] pc : [<ffff000008156ecc>] lr : [<ffff000008156e8c>]
>> >> pstate: 80000145
>> >> [  116.207028] sp : ffff801f901a7be0
>> >> [  116.207030] x29: ffff801f901a7be0 x28: ffff000009139000
>> >> [  116.207036] x27: ffff000009139434 x26: 0000000000000080
>> >> [  116.207041] x25: 0000000000000000 x24: ffff0000081565d0
>> >> [  116.207047] x23: 0000000000000001 x22: ffff000008e11e00
>> >> [  116.207052] x21: ffff801f6d5cff00 x20: ffff801f6d5cff08
>> >> [  116.207057] x19: ffff000009138e38 x18: 0000000000000a03
>> >> [  116.207063] x17: 0000ffffb77c9028 x16: ffff0000082e81d8
>> >> [  116.207068] x15: 00003d0d6dd44d08 x14: 0036312196549b4a
>> >> [  116.207073] x13: 0000000058dabe4c x12: 0000000000000018
>> >> [  116.207079] x11: 00000000366e2f04 x10: 00000000000009f0
>> >> [  116.207084] x9 : ffff801f901a7d30 x8 : 0000000000000002
>> >> [  116.207089] x7 : 0000000000000000 x6 : 0000000000000000
>> >> [  116.207095] x5 : ffffffff00000000 x4 : 0000000000000020
>> >> [  116.207100] x3 : 0000000000000020 x2 : 0000000000000000
>> >> [  116.207105] x1 : ffff801f6d682578 x0 : 0000000000000003
>> >>
>> >> [  150.443116] INFO: rcu_sched self-detected stall on CPU
>> >> [  150.448261]  25-...: (14997 ticks this GP)
>> >> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
>> >> [  150.451115] INFO: rcu_sched detected stalls on CPUs/tasks:
>> >> [  150.451123]  25-...: (14997 ticks this GP)
>> >> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
>> >> [  150.451124]  (detected by 13, t=15002 jiffies, g=805, c=804, q=8384)
>> >> [  150.451136] Task dump for CPU 25:
>> >> [  150.451138] kworker/25:1    R  running task        0   454      2 0x00000002
>> >> [  150.451155] Workqueue: events netstamp_clear
>> >> [  150.451158] Call trace:
>> >> [  150.451164] [<ffff000008086188>] __switch_to+0x90/0xa8
>> >> [  150.451172] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
>> >> [  150.451175] [<ffff0000081f6284>] static_key_enable+0x34/0x60
>> >> [  150.451178] [<ffff000008843268>] netstamp_clear+0x68/0x80
>> >> [  150.451181] [<ffff0000080e49a0>] process_one_work+0x158/0x478
>> >> [  150.451183] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
>> >> [  150.451187] [<ffff0000080ebd78>] kthread+0x108/0x138
>> >> [  150.451190] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
>> >> [  150.477451]   (t=15008 jiffies g=805 c=804 q=8384)
>> >> [  150.482242] Task dump for CPU 25:
>> >> [  150.482245] kworker/25:1    R  running task        0   454      2 0x00000002
>> >> [  150.482259] Workqueue: events netstamp_clear
>> >> [  150.482264] Call trace:
>> >> [  150.482271] [<ffff00000808a530>] dump_backtrace+0x0/0x2b0
>> >> [  150.482277] [<ffff00000808a804>] show_stack+0x24/0x30
>> >> [  150.482281] [<ffff0000080fb750>] sched_show_task+0x128/0x178
>> >> [  150.482285] [<ffff0000080fd298>] dump_cpu_task+0x48/0x58
>> >> [  150.482288] [<ffff0000081f81e4>] rcu_dump_cpu_stacks+0xa0/0xe8
>> >> [  150.482297] [<ffff00000813983c>] rcu_check_callbacks+0x774/0x938
>> >> [  150.482305] [<ffff00000813fcb4>] update_process_times+0x34/0x60
>> >> [  150.482314] [<ffff000008151b80>] tick_sched_handle.isra.7+0x38/0x70
>> >> [  150.482319] [<ffff000008151c04>] tick_sched_timer+0x4c/0x98
>> >> [  150.482324] [<ffff000008140510>] __hrtimer_run_queues+0xd8/0x2b8
>> >> [  150.482328] [<ffff000008141180>] hrtimer_interrupt+0xa8/0x228
>> >> [  150.482334] [<ffff0000087f2a2c>] arch_timer_handler_phys+0x3c/0x50
>> >> [  150.482341] [<ffff00000812c194>] handle_percpu_devid_irq+0x8c/0x230
>> >> [  150.482344] [<ffff000008126174>] generic_handle_irq+0x34/0x50
>> >> [  150.482347] [<ffff000008126898>] __handle_domain_irq+0x68/0xc0
>> >> [  150.482351] [<ffff0000080817e4>] gic_handle_irq+0xc4/0x170
>> >> [  150.482356] Exception stack(0xffff801f901a7ab0 to 0xffff801f901a7be0)
>> >> [  150.482360] 7aa0:
>> >> 0000000000000003 ffff801f6d682578
>> >> [  150.482364] 7ac0: 0000000000000000 0000000000000020
>> >> 0000000000000020 ffffffff00000000
>> >> [  150.482367] 7ae0: 0000000000000000 0000000000000000
>> >> 0000000000000002 ffff801f901a7d30
>> >> [  150.482371] 7b00: 00000000000009f0 00000000366e2f04
>> >> 0000000000000018 0000000058dabe4c
>> >> [  150.482375] 7b20: 0036312196549b4a 00003d0d6dd44d08
>> >> ffff0000082e81d8 0000ffffb77c9028
>> >> [  150.482378] 7b40: 0000000000000a03 ffff000009138e38
>> >> ffff801f6d5cff08 ffff801f6d5cff00
>> >> [  150.482382] 7b60: ffff000008e11e00 0000000000000001
>> >> ffff0000081565d0 0000000000000000
>> >> [  150.482386] 7b80: 0000000000000080 ffff000009139434
>> >> ffff000009139000 ffff801f901a7be0
>> >> [  150.482390] 7ba0: ffff000008156e8c ffff801f901a7be0
>> >> ffff000008156ecc 0000000080000145
>> >> [  150.482394] 7bc0: ffff801f901a7be0 ffff000008156e68
>> >> ffffffffffffffff ffff000008156e8c
>> >> [  150.482397] [<ffff000008082ff4>] el1_irq+0xb4/0x140
>> >> [  150.482401] [<ffff000008156ecc>] smp_call_function_many+0x284/0x2e8
>> >> [  150.482405] [<ffff000008157020>] kick_all_cpus_sync+0x30/0x38
>> >> [  150.482409] [<ffff00000897c6cc>] aarch64_insn_patch_text+0xec/0xf8
>> >> [  150.482415] [<ffff000008095978>] arch_jump_label_transform+0x60/0x98
>> >> [  150.482420] [<ffff0000081f593c>] __jump_label_update+0x8c/0xa8
>> >> [  150.482423] [<ffff0000081f6088>] jump_label_update+0x58/0xe8
>> >> [  150.482429] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
>> >> [  150.482434] [<ffff0000081f6284>] static_key_enable+0x34/0x60
>> >> [  150.482438] [<ffff000008843268>] netstamp_clear+0x68/0x80
>> >> [  150.482441] [<ffff0000080e49a0>] process_one_work+0x158/0x478
>> >> [  150.482444] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
>> >> [  150.482448] [<ffff0000080ebd78>] kthread+0x108/0x138
>> >> [  150.482451] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
>> >>
>> >> ====================================
>> >>
>> >> I am observing that this usually happens when the guest tries to
>> >> bringup or use the default virtio-net interface.
>> >> And I am unable to reproduce this when directly booting the guest
>> >> kernel without UEFI BIOS.
>> >> So anyone observed similar issue ?
>> >>
>> >
>> > I haven't seen this myself.
>> >
>> > Have you tried with other versions of QEMU and the host kernel by any
>> > chance using the same setup?
>>
>> The earlier qemu version (atleast 2.5 which is default on ubuntu
>> 16.04) have different issue. The same thing happens when I use
>> vhost=on but without vhost and with virtio-net didn't see the issue.
>> The Armband folks have also reported the vhost issue here
>> https://bugs.launchpad.net/ubuntu/+source/edk2/+bug/1673564
>
> So I just ran your setup using the flash emulation on the ThunderX in
> packet.net and my host is happily continuing to run hackbench.

One more thing I see is the ubuntu cloud images run a cloud-init
service during boot and seems to do lot of network init, the bug
appears when doing such things.

>
>>
>> So I am beginning to think the issue could be related to the flash
>> emulation but wanted to check on ml if anyone else is seeing it. But
>> if it is flash emulation how can it hose the host like this ?
>
> Why do you think it's related to the flash emulation?

Because without pflash method and directly launching kernel using
"-kernel" option of qemu works ok.


>
> From what you're telling me it sounds much more like a vhost regression.
>
> Can you try with v4.10 of the kernel with vhost=on ?

Ok will try. I do think the issue could also be host kernel side and
something is triggering.
The newer qemu is having a default virtio-net interface. Does this use
vhost by default ?
And is there a way not to have a default virtio-net interface.

I am not specifying any vhost in my command.
My observation with vhost=on was with qemu-2.5 and 4.4 kernel from Ubuntu 16.04.

Another observation I have is, occasionally during the trials the
guest hangs at this point and host cpu goes into a softlockup after
sometime. I cannot kill the qemu either through qemu Ctrl-A+C+Q or
kill command from host.

------------- cut -----------------
EFI stub: Booting Linux Kernel...
EFI stub: EFI_RNG_PROTOCOL unavailable, no randomness supplied
EFI stub: Using DTB from configuration table
EFI stub: Exiting boot services and installing virtual address map...
[    0.000000] Booting Linux on physical CPU 0x0
------------- end ------------------


>
> Thanks,
> -Christoffer

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] host stalls when qemu-system-aarch64 with kvm and pflash
  2017-03-29 20:51         ` Radha Mohan
@ 2017-03-29 21:06           ` Christoffer Dall
  -1 siblings, 0 replies; 28+ messages in thread
From: Christoffer Dall @ 2017-03-29 21:06 UTC (permalink / raw)
  To: Radha Mohan; +Cc: qemu-devel, kvmarm

On Wed, Mar 29, 2017 at 01:51:19PM -0700, Radha Mohan wrote:
> On Wed, Mar 29, 2017 at 11:56 AM, Christoffer Dall <cdall@linaro.org> wrote:
> > On Tue, Mar 28, 2017 at 01:24:15PM -0700, Radha Mohan wrote:
> >> On Tue, Mar 28, 2017 at 1:16 PM, Christoffer Dall <cdall@linaro.org> wrote:
> >> > Hi Radha,
> >> >
> >> > On Tue, Mar 28, 2017 at 12:58:24PM -0700, Radha Mohan wrote:
> >> >> Hi,
> >> >> I am seeing an issue with qemu-system-aarch64 when using pflash
> >> >> (booting kernel via UEFI bios).
> >> >>
> >> >> Host kernel: 4.11.0-rc3-next-20170323
> >> >> Qemu version: v2.9.0-rc1
> >> >>
> >> >> Command used:
> >> >> ./aarch64-softmmu/qemu-system-aarch64 -cpu host -enable-kvm -M
> >> >> virt,gic_version=3 -nographic -smp 1 -m 2048 -drive
> >> >> if=none,id=hd0,file=/root/zesty-server-cloudimg-arm64.img,id=0 -device
> >> >> virtio-blk-device,drive=hd0 -pflash /root/flash0.img -pflash
> >> >> /root/flash1.img
> >> >>
> >> >>
> >> >> As soon as the guest kernel boots the host starts to stall and prints
> >> >> the below messages. And the system never recovers. I can neither
> >> >> poweroff the guest nor the host. So I have resort to external power
> >> >> reset of the host.
> >> >>
> >> >> ==================
> >> >> [  116.199077] NMI watchdog: BUG: soft lockup - CPU#25 stuck for 23s!
> >> >> [kworker/25:1:454]
> >> >> [  116.206901] Modules linked in: binfmt_misc nls_iso8859_1 aes_ce_blk
> >> >> shpchp crypto_simd gpio_keys cryptd aes_ce_cipher ghash_ce sha2_ce
> >> >> sha1_ce uio_pdrv_genirq uio autofs4 btrfs raid10 rai
> >> >> d456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
> >> >> raid6_pq libcrc32c raid1 raid0 multipath linear ast i2c_algo_bit ttm
> >> >> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_s
> >> >> ys_fops drm nicvf ahci nicpf libahci thunder_bgx thunder_xcv
> >> >> mdio_thunder mdio_cavium
> >> >>
> >> >> [  116.206995] CPU: 25 PID: 454 Comm: kworker/25:1 Not tainted
> >> >> 4.11.0-rc3-next-20170323 #1
> >> >> [  116.206997] Hardware name: www.cavium.com crb-1s/crb-1s, BIOS 0.3 Feb 23 2017
> >> >> [  116.207010] Workqueue: events netstamp_clear
> >> >> [  116.207015] task: ffff801f906b5400 task.stack: ffff801f901a4000
> >> >> [  116.207020] PC is at smp_call_function_many+0x284/0x2e8
> >> >> [  116.207023] LR is at smp_call_function_many+0x244/0x2e8
> >> >> [  116.207026] pc : [<ffff000008156ecc>] lr : [<ffff000008156e8c>]
> >> >> pstate: 80000145
> >> >> [  116.207028] sp : ffff801f901a7be0
> >> >> [  116.207030] x29: ffff801f901a7be0 x28: ffff000009139000
> >> >> [  116.207036] x27: ffff000009139434 x26: 0000000000000080
> >> >> [  116.207041] x25: 0000000000000000 x24: ffff0000081565d0
> >> >> [  116.207047] x23: 0000000000000001 x22: ffff000008e11e00
> >> >> [  116.207052] x21: ffff801f6d5cff00 x20: ffff801f6d5cff08
> >> >> [  116.207057] x19: ffff000009138e38 x18: 0000000000000a03
> >> >> [  116.207063] x17: 0000ffffb77c9028 x16: ffff0000082e81d8
> >> >> [  116.207068] x15: 00003d0d6dd44d08 x14: 0036312196549b4a
> >> >> [  116.207073] x13: 0000000058dabe4c x12: 0000000000000018
> >> >> [  116.207079] x11: 00000000366e2f04 x10: 00000000000009f0
> >> >> [  116.207084] x9 : ffff801f901a7d30 x8 : 0000000000000002
> >> >> [  116.207089] x7 : 0000000000000000 x6 : 0000000000000000
> >> >> [  116.207095] x5 : ffffffff00000000 x4 : 0000000000000020
> >> >> [  116.207100] x3 : 0000000000000020 x2 : 0000000000000000
> >> >> [  116.207105] x1 : ffff801f6d682578 x0 : 0000000000000003
> >> >>
> >> >> [  150.443116] INFO: rcu_sched self-detected stall on CPU
> >> >> [  150.448261]  25-...: (14997 ticks this GP)
> >> >> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
> >> >> [  150.451115] INFO: rcu_sched detected stalls on CPUs/tasks:
> >> >> [  150.451123]  25-...: (14997 ticks this GP)
> >> >> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
> >> >> [  150.451124]  (detected by 13, t=15002 jiffies, g=805, c=804, q=8384)
> >> >> [  150.451136] Task dump for CPU 25:
> >> >> [  150.451138] kworker/25:1    R  running task        0   454      2 0x00000002
> >> >> [  150.451155] Workqueue: events netstamp_clear
> >> >> [  150.451158] Call trace:
> >> >> [  150.451164] [<ffff000008086188>] __switch_to+0x90/0xa8
> >> >> [  150.451172] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
> >> >> [  150.451175] [<ffff0000081f6284>] static_key_enable+0x34/0x60
> >> >> [  150.451178] [<ffff000008843268>] netstamp_clear+0x68/0x80
> >> >> [  150.451181] [<ffff0000080e49a0>] process_one_work+0x158/0x478
> >> >> [  150.451183] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
> >> >> [  150.451187] [<ffff0000080ebd78>] kthread+0x108/0x138
> >> >> [  150.451190] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
> >> >> [  150.477451]   (t=15008 jiffies g=805 c=804 q=8384)
> >> >> [  150.482242] Task dump for CPU 25:
> >> >> [  150.482245] kworker/25:1    R  running task        0   454      2 0x00000002
> >> >> [  150.482259] Workqueue: events netstamp_clear
> >> >> [  150.482264] Call trace:
> >> >> [  150.482271] [<ffff00000808a530>] dump_backtrace+0x0/0x2b0
> >> >> [  150.482277] [<ffff00000808a804>] show_stack+0x24/0x30
> >> >> [  150.482281] [<ffff0000080fb750>] sched_show_task+0x128/0x178
> >> >> [  150.482285] [<ffff0000080fd298>] dump_cpu_task+0x48/0x58
> >> >> [  150.482288] [<ffff0000081f81e4>] rcu_dump_cpu_stacks+0xa0/0xe8
> >> >> [  150.482297] [<ffff00000813983c>] rcu_check_callbacks+0x774/0x938
> >> >> [  150.482305] [<ffff00000813fcb4>] update_process_times+0x34/0x60
> >> >> [  150.482314] [<ffff000008151b80>] tick_sched_handle.isra.7+0x38/0x70
> >> >> [  150.482319] [<ffff000008151c04>] tick_sched_timer+0x4c/0x98
> >> >> [  150.482324] [<ffff000008140510>] __hrtimer_run_queues+0xd8/0x2b8
> >> >> [  150.482328] [<ffff000008141180>] hrtimer_interrupt+0xa8/0x228
> >> >> [  150.482334] [<ffff0000087f2a2c>] arch_timer_handler_phys+0x3c/0x50
> >> >> [  150.482341] [<ffff00000812c194>] handle_percpu_devid_irq+0x8c/0x230
> >> >> [  150.482344] [<ffff000008126174>] generic_handle_irq+0x34/0x50
> >> >> [  150.482347] [<ffff000008126898>] __handle_domain_irq+0x68/0xc0
> >> >> [  150.482351] [<ffff0000080817e4>] gic_handle_irq+0xc4/0x170
> >> >> [  150.482356] Exception stack(0xffff801f901a7ab0 to 0xffff801f901a7be0)
> >> >> [  150.482360] 7aa0:
> >> >> 0000000000000003 ffff801f6d682578
> >> >> [  150.482364] 7ac0: 0000000000000000 0000000000000020
> >> >> 0000000000000020 ffffffff00000000
> >> >> [  150.482367] 7ae0: 0000000000000000 0000000000000000
> >> >> 0000000000000002 ffff801f901a7d30
> >> >> [  150.482371] 7b00: 00000000000009f0 00000000366e2f04
> >> >> 0000000000000018 0000000058dabe4c
> >> >> [  150.482375] 7b20: 0036312196549b4a 00003d0d6dd44d08
> >> >> ffff0000082e81d8 0000ffffb77c9028
> >> >> [  150.482378] 7b40: 0000000000000a03 ffff000009138e38
> >> >> ffff801f6d5cff08 ffff801f6d5cff00
> >> >> [  150.482382] 7b60: ffff000008e11e00 0000000000000001
> >> >> ffff0000081565d0 0000000000000000
> >> >> [  150.482386] 7b80: 0000000000000080 ffff000009139434
> >> >> ffff000009139000 ffff801f901a7be0
> >> >> [  150.482390] 7ba0: ffff000008156e8c ffff801f901a7be0
> >> >> ffff000008156ecc 0000000080000145
> >> >> [  150.482394] 7bc0: ffff801f901a7be0 ffff000008156e68
> >> >> ffffffffffffffff ffff000008156e8c
> >> >> [  150.482397] [<ffff000008082ff4>] el1_irq+0xb4/0x140
> >> >> [  150.482401] [<ffff000008156ecc>] smp_call_function_many+0x284/0x2e8
> >> >> [  150.482405] [<ffff000008157020>] kick_all_cpus_sync+0x30/0x38
> >> >> [  150.482409] [<ffff00000897c6cc>] aarch64_insn_patch_text+0xec/0xf8
> >> >> [  150.482415] [<ffff000008095978>] arch_jump_label_transform+0x60/0x98
> >> >> [  150.482420] [<ffff0000081f593c>] __jump_label_update+0x8c/0xa8
> >> >> [  150.482423] [<ffff0000081f6088>] jump_label_update+0x58/0xe8
> >> >> [  150.482429] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
> >> >> [  150.482434] [<ffff0000081f6284>] static_key_enable+0x34/0x60
> >> >> [  150.482438] [<ffff000008843268>] netstamp_clear+0x68/0x80
> >> >> [  150.482441] [<ffff0000080e49a0>] process_one_work+0x158/0x478
> >> >> [  150.482444] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
> >> >> [  150.482448] [<ffff0000080ebd78>] kthread+0x108/0x138
> >> >> [  150.482451] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
> >> >>
> >> >> ====================================
> >> >>
> >> >> I am observing that this usually happens when the guest tries to
> >> >> bringup or use the default virtio-net interface.
> >> >> And I am unable to reproduce this when directly booting the guest
> >> >> kernel without UEFI BIOS.
> >> >> So anyone observed similar issue ?
> >> >>
> >> >
> >> > I haven't seen this myself.
> >> >
> >> > Have you tried with other versions of QEMU and the host kernel by any
> >> > chance using the same setup?
> >>
> >> The earlier qemu version (atleast 2.5 which is default on ubuntu
> >> 16.04) have different issue. The same thing happens when I use
> >> vhost=on but without vhost and with virtio-net didn't see the issue.
> >> The Armband folks have also reported the vhost issue here
> >> https://bugs.launchpad.net/ubuntu/+source/edk2/+bug/1673564
> >
> > So I just ran your setup using the flash emulation on the ThunderX in
> > packet.net and my host is happily continuing to run hackbench.
> 
> One more thing I see is the ubuntu cloud images run a cloud-init
> service during boot and seems to do lot of network init, the bug
> appears when doing such things.
> 
> >
> >>
> >> So I am beginning to think the issue could be related to the flash
> >> emulation but wanted to check on ml if anyone else is seeing it. But
> >> if it is flash emulation how can it hose the host like this ?
> >
> > Why do you think it's related to the flash emulation?
> 
> Because without pflash method and directly launching kernel using
> "-kernel" option of qemu works ok.

With the same kernel binary which is otherwise used inside the cloud
image when booting with QEMU?

Make sure you verify this conclusion by repeating tries tens of times at
least, just to make sure we're not misdiagnosing based on chance.

> 
> 
> >
> > From what you're telling me it sounds much more like a vhost regression.
> >
> > Can you try with v4.10 of the kernel with vhost=on ?
> 
> Ok will try. 

Cool.

> I do think the issue could also be host kernel side and
> something is triggering.
> The newer qemu is having a default virtio-net interface. Does this use
> vhost by default ?
> And is there a way not to have a default virtio-net interface.

I really don't know, I'm not an expert on QEMU or dealing with virtio
vhost config options.

> 
> I am not specifying any vhost in my command.
> My observation with vhost=on was with qemu-2.5 and 4.4 kernel from Ubuntu 16.04.
> 
> Another observation I have is, occasionally during the trials the
> guest hangs at this point and host cpu goes into a softlockup after
> sometime. I cannot kill the qemu either through qemu Ctrl-A+C+Q or
> kill command from host.
> 
> ------------- cut -----------------
> EFI stub: Booting Linux Kernel...
> EFI stub: EFI_RNG_PROTOCOL unavailable, no randomness supplied
> EFI stub: Using DTB from configuration table
> EFI stub: Exiting boot services and installing virtual address map...
> [    0.000000] Booting Linux on physical CPU 0x0
> ------------- end ------------------
> 
> 

My best bet based on what you're telling me above is that it's related
to memory accesses somehow; with vhost the host kernel will access guest
memory and with pflash emulation I think we're mapping host regions
read-only and executing code in the guest from those regions.  Perhaps
we end up corrupting some memory somewhere.

Is this a different hardware platform from the ThunderX in packet.net ?

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: host stalls when qemu-system-aarch64 with kvm and pflash
@ 2017-03-29 21:06           ` Christoffer Dall
  0 siblings, 0 replies; 28+ messages in thread
From: Christoffer Dall @ 2017-03-29 21:06 UTC (permalink / raw)
  To: Radha Mohan; +Cc: qemu-devel, kvmarm

On Wed, Mar 29, 2017 at 01:51:19PM -0700, Radha Mohan wrote:
> On Wed, Mar 29, 2017 at 11:56 AM, Christoffer Dall <cdall@linaro.org> wrote:
> > On Tue, Mar 28, 2017 at 01:24:15PM -0700, Radha Mohan wrote:
> >> On Tue, Mar 28, 2017 at 1:16 PM, Christoffer Dall <cdall@linaro.org> wrote:
> >> > Hi Radha,
> >> >
> >> > On Tue, Mar 28, 2017 at 12:58:24PM -0700, Radha Mohan wrote:
> >> >> Hi,
> >> >> I am seeing an issue with qemu-system-aarch64 when using pflash
> >> >> (booting kernel via UEFI bios).
> >> >>
> >> >> Host kernel: 4.11.0-rc3-next-20170323
> >> >> Qemu version: v2.9.0-rc1
> >> >>
> >> >> Command used:
> >> >> ./aarch64-softmmu/qemu-system-aarch64 -cpu host -enable-kvm -M
> >> >> virt,gic_version=3 -nographic -smp 1 -m 2048 -drive
> >> >> if=none,id=hd0,file=/root/zesty-server-cloudimg-arm64.img,id=0 -device
> >> >> virtio-blk-device,drive=hd0 -pflash /root/flash0.img -pflash
> >> >> /root/flash1.img
> >> >>
> >> >>
> >> >> As soon as the guest kernel boots the host starts to stall and prints
> >> >> the below messages. And the system never recovers. I can neither
> >> >> poweroff the guest nor the host. So I have resort to external power
> >> >> reset of the host.
> >> >>
> >> >> ==================
> >> >> [  116.199077] NMI watchdog: BUG: soft lockup - CPU#25 stuck for 23s!
> >> >> [kworker/25:1:454]
> >> >> [  116.206901] Modules linked in: binfmt_misc nls_iso8859_1 aes_ce_blk
> >> >> shpchp crypto_simd gpio_keys cryptd aes_ce_cipher ghash_ce sha2_ce
> >> >> sha1_ce uio_pdrv_genirq uio autofs4 btrfs raid10 rai
> >> >> d456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
> >> >> raid6_pq libcrc32c raid1 raid0 multipath linear ast i2c_algo_bit ttm
> >> >> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_s
> >> >> ys_fops drm nicvf ahci nicpf libahci thunder_bgx thunder_xcv
> >> >> mdio_thunder mdio_cavium
> >> >>
> >> >> [  116.206995] CPU: 25 PID: 454 Comm: kworker/25:1 Not tainted
> >> >> 4.11.0-rc3-next-20170323 #1
> >> >> [  116.206997] Hardware name: www.cavium.com crb-1s/crb-1s, BIOS 0.3 Feb 23 2017
> >> >> [  116.207010] Workqueue: events netstamp_clear
> >> >> [  116.207015] task: ffff801f906b5400 task.stack: ffff801f901a4000
> >> >> [  116.207020] PC is at smp_call_function_many+0x284/0x2e8
> >> >> [  116.207023] LR is at smp_call_function_many+0x244/0x2e8
> >> >> [  116.207026] pc : [<ffff000008156ecc>] lr : [<ffff000008156e8c>]
> >> >> pstate: 80000145
> >> >> [  116.207028] sp : ffff801f901a7be0
> >> >> [  116.207030] x29: ffff801f901a7be0 x28: ffff000009139000
> >> >> [  116.207036] x27: ffff000009139434 x26: 0000000000000080
> >> >> [  116.207041] x25: 0000000000000000 x24: ffff0000081565d0
> >> >> [  116.207047] x23: 0000000000000001 x22: ffff000008e11e00
> >> >> [  116.207052] x21: ffff801f6d5cff00 x20: ffff801f6d5cff08
> >> >> [  116.207057] x19: ffff000009138e38 x18: 0000000000000a03
> >> >> [  116.207063] x17: 0000ffffb77c9028 x16: ffff0000082e81d8
> >> >> [  116.207068] x15: 00003d0d6dd44d08 x14: 0036312196549b4a
> >> >> [  116.207073] x13: 0000000058dabe4c x12: 0000000000000018
> >> >> [  116.207079] x11: 00000000366e2f04 x10: 00000000000009f0
> >> >> [  116.207084] x9 : ffff801f901a7d30 x8 : 0000000000000002
> >> >> [  116.207089] x7 : 0000000000000000 x6 : 0000000000000000
> >> >> [  116.207095] x5 : ffffffff00000000 x4 : 0000000000000020
> >> >> [  116.207100] x3 : 0000000000000020 x2 : 0000000000000000
> >> >> [  116.207105] x1 : ffff801f6d682578 x0 : 0000000000000003
> >> >>
> >> >> [  150.443116] INFO: rcu_sched self-detected stall on CPU
> >> >> [  150.448261]  25-...: (14997 ticks this GP)
> >> >> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
> >> >> [  150.451115] INFO: rcu_sched detected stalls on CPUs/tasks:
> >> >> [  150.451123]  25-...: (14997 ticks this GP)
> >> >> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
> >> >> [  150.451124]  (detected by 13, t=15002 jiffies, g=805, c=804, q=8384)
> >> >> [  150.451136] Task dump for CPU 25:
> >> >> [  150.451138] kworker/25:1    R  running task        0   454      2 0x00000002
> >> >> [  150.451155] Workqueue: events netstamp_clear
> >> >> [  150.451158] Call trace:
> >> >> [  150.451164] [<ffff000008086188>] __switch_to+0x90/0xa8
> >> >> [  150.451172] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
> >> >> [  150.451175] [<ffff0000081f6284>] static_key_enable+0x34/0x60
> >> >> [  150.451178] [<ffff000008843268>] netstamp_clear+0x68/0x80
> >> >> [  150.451181] [<ffff0000080e49a0>] process_one_work+0x158/0x478
> >> >> [  150.451183] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
> >> >> [  150.451187] [<ffff0000080ebd78>] kthread+0x108/0x138
> >> >> [  150.451190] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
> >> >> [  150.477451]   (t=15008 jiffies g=805 c=804 q=8384)
> >> >> [  150.482242] Task dump for CPU 25:
> >> >> [  150.482245] kworker/25:1    R  running task        0   454      2 0x00000002
> >> >> [  150.482259] Workqueue: events netstamp_clear
> >> >> [  150.482264] Call trace:
> >> >> [  150.482271] [<ffff00000808a530>] dump_backtrace+0x0/0x2b0
> >> >> [  150.482277] [<ffff00000808a804>] show_stack+0x24/0x30
> >> >> [  150.482281] [<ffff0000080fb750>] sched_show_task+0x128/0x178
> >> >> [  150.482285] [<ffff0000080fd298>] dump_cpu_task+0x48/0x58
> >> >> [  150.482288] [<ffff0000081f81e4>] rcu_dump_cpu_stacks+0xa0/0xe8
> >> >> [  150.482297] [<ffff00000813983c>] rcu_check_callbacks+0x774/0x938
> >> >> [  150.482305] [<ffff00000813fcb4>] update_process_times+0x34/0x60
> >> >> [  150.482314] [<ffff000008151b80>] tick_sched_handle.isra.7+0x38/0x70
> >> >> [  150.482319] [<ffff000008151c04>] tick_sched_timer+0x4c/0x98
> >> >> [  150.482324] [<ffff000008140510>] __hrtimer_run_queues+0xd8/0x2b8
> >> >> [  150.482328] [<ffff000008141180>] hrtimer_interrupt+0xa8/0x228
> >> >> [  150.482334] [<ffff0000087f2a2c>] arch_timer_handler_phys+0x3c/0x50
> >> >> [  150.482341] [<ffff00000812c194>] handle_percpu_devid_irq+0x8c/0x230
> >> >> [  150.482344] [<ffff000008126174>] generic_handle_irq+0x34/0x50
> >> >> [  150.482347] [<ffff000008126898>] __handle_domain_irq+0x68/0xc0
> >> >> [  150.482351] [<ffff0000080817e4>] gic_handle_irq+0xc4/0x170
> >> >> [  150.482356] Exception stack(0xffff801f901a7ab0 to 0xffff801f901a7be0)
> >> >> [  150.482360] 7aa0:
> >> >> 0000000000000003 ffff801f6d682578
> >> >> [  150.482364] 7ac0: 0000000000000000 0000000000000020
> >> >> 0000000000000020 ffffffff00000000
> >> >> [  150.482367] 7ae0: 0000000000000000 0000000000000000
> >> >> 0000000000000002 ffff801f901a7d30
> >> >> [  150.482371] 7b00: 00000000000009f0 00000000366e2f04
> >> >> 0000000000000018 0000000058dabe4c
> >> >> [  150.482375] 7b20: 0036312196549b4a 00003d0d6dd44d08
> >> >> ffff0000082e81d8 0000ffffb77c9028
> >> >> [  150.482378] 7b40: 0000000000000a03 ffff000009138e38
> >> >> ffff801f6d5cff08 ffff801f6d5cff00
> >> >> [  150.482382] 7b60: ffff000008e11e00 0000000000000001
> >> >> ffff0000081565d0 0000000000000000
> >> >> [  150.482386] 7b80: 0000000000000080 ffff000009139434
> >> >> ffff000009139000 ffff801f901a7be0
> >> >> [  150.482390] 7ba0: ffff000008156e8c ffff801f901a7be0
> >> >> ffff000008156ecc 0000000080000145
> >> >> [  150.482394] 7bc0: ffff801f901a7be0 ffff000008156e68
> >> >> ffffffffffffffff ffff000008156e8c
> >> >> [  150.482397] [<ffff000008082ff4>] el1_irq+0xb4/0x140
> >> >> [  150.482401] [<ffff000008156ecc>] smp_call_function_many+0x284/0x2e8
> >> >> [  150.482405] [<ffff000008157020>] kick_all_cpus_sync+0x30/0x38
> >> >> [  150.482409] [<ffff00000897c6cc>] aarch64_insn_patch_text+0xec/0xf8
> >> >> [  150.482415] [<ffff000008095978>] arch_jump_label_transform+0x60/0x98
> >> >> [  150.482420] [<ffff0000081f593c>] __jump_label_update+0x8c/0xa8
> >> >> [  150.482423] [<ffff0000081f6088>] jump_label_update+0x58/0xe8
> >> >> [  150.482429] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
> >> >> [  150.482434] [<ffff0000081f6284>] static_key_enable+0x34/0x60
> >> >> [  150.482438] [<ffff000008843268>] netstamp_clear+0x68/0x80
> >> >> [  150.482441] [<ffff0000080e49a0>] process_one_work+0x158/0x478
> >> >> [  150.482444] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
> >> >> [  150.482448] [<ffff0000080ebd78>] kthread+0x108/0x138
> >> >> [  150.482451] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
> >> >>
> >> >> ====================================
> >> >>
> >> >> I am observing that this usually happens when the guest tries to
> >> >> bringup or use the default virtio-net interface.
> >> >> And I am unable to reproduce this when directly booting the guest
> >> >> kernel without UEFI BIOS.
> >> >> So anyone observed similar issue ?
> >> >>
> >> >
> >> > I haven't seen this myself.
> >> >
> >> > Have you tried with other versions of QEMU and the host kernel by any
> >> > chance using the same setup?
> >>
> >> The earlier qemu version (atleast 2.5 which is default on ubuntu
> >> 16.04) have different issue. The same thing happens when I use
> >> vhost=on but without vhost and with virtio-net didn't see the issue.
> >> The Armband folks have also reported the vhost issue here
> >> https://bugs.launchpad.net/ubuntu/+source/edk2/+bug/1673564
> >
> > So I just ran your setup using the flash emulation on the ThunderX in
> > packet.net and my host is happily continuing to run hackbench.
> 
> One more thing I see is the ubuntu cloud images run a cloud-init
> service during boot and seems to do lot of network init, the bug
> appears when doing such things.
> 
> >
> >>
> >> So I am beginning to think the issue could be related to the flash
> >> emulation but wanted to check on ml if anyone else is seeing it. But
> >> if it is flash emulation how can it hose the host like this ?
> >
> > Why do you think it's related to the flash emulation?
> 
> Because without pflash method and directly launching kernel using
> "-kernel" option of qemu works ok.

With the same kernel binary which is otherwise used inside the cloud
image when booting with QEMU?

Make sure you verify this conclusion by repeating tries tens of times at
least, just to make sure we're not misdiagnosing based on chance.

> 
> 
> >
> > From what you're telling me it sounds much more like a vhost regression.
> >
> > Can you try with v4.10 of the kernel with vhost=on ?
> 
> Ok will try. 

Cool.

> I do think the issue could also be host kernel side and
> something is triggering.
> The newer qemu is having a default virtio-net interface. Does this use
> vhost by default ?
> And is there a way not to have a default virtio-net interface.

I really don't know, I'm not an expert on QEMU or dealing with virtio
vhost config options.

> 
> I am not specifying any vhost in my command.
> My observation with vhost=on was with qemu-2.5 and 4.4 kernel from Ubuntu 16.04.
> 
> Another observation I have is, occasionally during the trials the
> guest hangs at this point and host cpu goes into a softlockup after
> sometime. I cannot kill the qemu either through qemu Ctrl-A+C+Q or
> kill command from host.
> 
> ------------- cut -----------------
> EFI stub: Booting Linux Kernel...
> EFI stub: EFI_RNG_PROTOCOL unavailable, no randomness supplied
> EFI stub: Using DTB from configuration table
> EFI stub: Exiting boot services and installing virtual address map...
> [    0.000000] Booting Linux on physical CPU 0x0
> ------------- end ------------------
> 
> 

My best bet based on what you're telling me above is that it's related
to memory accesses somehow; with vhost the host kernel will access guest
memory and with pflash emulation I think we're mapping host regions
read-only and executing code in the guest from those regions.  Perhaps
we end up corrupting some memory somewhere.

Is this a different hardware platform from the ThunderX in packet.net ?

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] host stalls when qemu-system-aarch64 with kvm and pflash
  2017-03-29 21:06           ` Christoffer Dall
@ 2017-03-29 21:36             ` Radha Mohan
  -1 siblings, 0 replies; 28+ messages in thread
From: Radha Mohan @ 2017-03-29 21:36 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: qemu-devel, kvmarm

On Wed, Mar 29, 2017 at 2:06 PM, Christoffer Dall <cdall@linaro.org> wrote:
> On Wed, Mar 29, 2017 at 01:51:19PM -0700, Radha Mohan wrote:
>> On Wed, Mar 29, 2017 at 11:56 AM, Christoffer Dall <cdall@linaro.org> wrote:
>> > On Tue, Mar 28, 2017 at 01:24:15PM -0700, Radha Mohan wrote:
>> >> On Tue, Mar 28, 2017 at 1:16 PM, Christoffer Dall <cdall@linaro.org> wrote:
>> >> > Hi Radha,
>> >> >
>> >> > On Tue, Mar 28, 2017 at 12:58:24PM -0700, Radha Mohan wrote:
>> >> >> Hi,
>> >> >> I am seeing an issue with qemu-system-aarch64 when using pflash
>> >> >> (booting kernel via UEFI bios).
>> >> >>
>> >> >> Host kernel: 4.11.0-rc3-next-20170323
>> >> >> Qemu version: v2.9.0-rc1
>> >> >>
>> >> >> Command used:
>> >> >> ./aarch64-softmmu/qemu-system-aarch64 -cpu host -enable-kvm -M
>> >> >> virt,gic_version=3 -nographic -smp 1 -m 2048 -drive
>> >> >> if=none,id=hd0,file=/root/zesty-server-cloudimg-arm64.img,id=0 -device
>> >> >> virtio-blk-device,drive=hd0 -pflash /root/flash0.img -pflash
>> >> >> /root/flash1.img
>> >> >>
>> >> >>
>> >> >> As soon as the guest kernel boots the host starts to stall and prints
>> >> >> the below messages. And the system never recovers. I can neither
>> >> >> poweroff the guest nor the host. So I have resort to external power
>> >> >> reset of the host.
>> >> >>
>> >> >> ==================
>> >> >> [  116.199077] NMI watchdog: BUG: soft lockup - CPU#25 stuck for 23s!
>> >> >> [kworker/25:1:454]
>> >> >> [  116.206901] Modules linked in: binfmt_misc nls_iso8859_1 aes_ce_blk
>> >> >> shpchp crypto_simd gpio_keys cryptd aes_ce_cipher ghash_ce sha2_ce
>> >> >> sha1_ce uio_pdrv_genirq uio autofs4 btrfs raid10 rai
>> >> >> d456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
>> >> >> raid6_pq libcrc32c raid1 raid0 multipath linear ast i2c_algo_bit ttm
>> >> >> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_s
>> >> >> ys_fops drm nicvf ahci nicpf libahci thunder_bgx thunder_xcv
>> >> >> mdio_thunder mdio_cavium
>> >> >>
>> >> >> [  116.206995] CPU: 25 PID: 454 Comm: kworker/25:1 Not tainted
>> >> >> 4.11.0-rc3-next-20170323 #1
>> >> >> [  116.206997] Hardware name: www.cavium.com crb-1s/crb-1s, BIOS 0.3 Feb 23 2017
>> >> >> [  116.207010] Workqueue: events netstamp_clear
>> >> >> [  116.207015] task: ffff801f906b5400 task.stack: ffff801f901a4000
>> >> >> [  116.207020] PC is at smp_call_function_many+0x284/0x2e8
>> >> >> [  116.207023] LR is at smp_call_function_many+0x244/0x2e8
>> >> >> [  116.207026] pc : [<ffff000008156ecc>] lr : [<ffff000008156e8c>]
>> >> >> pstate: 80000145
>> >> >> [  116.207028] sp : ffff801f901a7be0
>> >> >> [  116.207030] x29: ffff801f901a7be0 x28: ffff000009139000
>> >> >> [  116.207036] x27: ffff000009139434 x26: 0000000000000080
>> >> >> [  116.207041] x25: 0000000000000000 x24: ffff0000081565d0
>> >> >> [  116.207047] x23: 0000000000000001 x22: ffff000008e11e00
>> >> >> [  116.207052] x21: ffff801f6d5cff00 x20: ffff801f6d5cff08
>> >> >> [  116.207057] x19: ffff000009138e38 x18: 0000000000000a03
>> >> >> [  116.207063] x17: 0000ffffb77c9028 x16: ffff0000082e81d8
>> >> >> [  116.207068] x15: 00003d0d6dd44d08 x14: 0036312196549b4a
>> >> >> [  116.207073] x13: 0000000058dabe4c x12: 0000000000000018
>> >> >> [  116.207079] x11: 00000000366e2f04 x10: 00000000000009f0
>> >> >> [  116.207084] x9 : ffff801f901a7d30 x8 : 0000000000000002
>> >> >> [  116.207089] x7 : 0000000000000000 x6 : 0000000000000000
>> >> >> [  116.207095] x5 : ffffffff00000000 x4 : 0000000000000020
>> >> >> [  116.207100] x3 : 0000000000000020 x2 : 0000000000000000
>> >> >> [  116.207105] x1 : ffff801f6d682578 x0 : 0000000000000003
>> >> >>
>> >> >> [  150.443116] INFO: rcu_sched self-detected stall on CPU
>> >> >> [  150.448261]  25-...: (14997 ticks this GP)
>> >> >> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
>> >> >> [  150.451115] INFO: rcu_sched detected stalls on CPUs/tasks:
>> >> >> [  150.451123]  25-...: (14997 ticks this GP)
>> >> >> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
>> >> >> [  150.451124]  (detected by 13, t=15002 jiffies, g=805, c=804, q=8384)
>> >> >> [  150.451136] Task dump for CPU 25:
>> >> >> [  150.451138] kworker/25:1    R  running task        0   454      2 0x00000002
>> >> >> [  150.451155] Workqueue: events netstamp_clear
>> >> >> [  150.451158] Call trace:
>> >> >> [  150.451164] [<ffff000008086188>] __switch_to+0x90/0xa8
>> >> >> [  150.451172] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
>> >> >> [  150.451175] [<ffff0000081f6284>] static_key_enable+0x34/0x60
>> >> >> [  150.451178] [<ffff000008843268>] netstamp_clear+0x68/0x80
>> >> >> [  150.451181] [<ffff0000080e49a0>] process_one_work+0x158/0x478
>> >> >> [  150.451183] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
>> >> >> [  150.451187] [<ffff0000080ebd78>] kthread+0x108/0x138
>> >> >> [  150.451190] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
>> >> >> [  150.477451]   (t=15008 jiffies g=805 c=804 q=8384)
>> >> >> [  150.482242] Task dump for CPU 25:
>> >> >> [  150.482245] kworker/25:1    R  running task        0   454      2 0x00000002
>> >> >> [  150.482259] Workqueue: events netstamp_clear
>> >> >> [  150.482264] Call trace:
>> >> >> [  150.482271] [<ffff00000808a530>] dump_backtrace+0x0/0x2b0
>> >> >> [  150.482277] [<ffff00000808a804>] show_stack+0x24/0x30
>> >> >> [  150.482281] [<ffff0000080fb750>] sched_show_task+0x128/0x178
>> >> >> [  150.482285] [<ffff0000080fd298>] dump_cpu_task+0x48/0x58
>> >> >> [  150.482288] [<ffff0000081f81e4>] rcu_dump_cpu_stacks+0xa0/0xe8
>> >> >> [  150.482297] [<ffff00000813983c>] rcu_check_callbacks+0x774/0x938
>> >> >> [  150.482305] [<ffff00000813fcb4>] update_process_times+0x34/0x60
>> >> >> [  150.482314] [<ffff000008151b80>] tick_sched_handle.isra.7+0x38/0x70
>> >> >> [  150.482319] [<ffff000008151c04>] tick_sched_timer+0x4c/0x98
>> >> >> [  150.482324] [<ffff000008140510>] __hrtimer_run_queues+0xd8/0x2b8
>> >> >> [  150.482328] [<ffff000008141180>] hrtimer_interrupt+0xa8/0x228
>> >> >> [  150.482334] [<ffff0000087f2a2c>] arch_timer_handler_phys+0x3c/0x50
>> >> >> [  150.482341] [<ffff00000812c194>] handle_percpu_devid_irq+0x8c/0x230
>> >> >> [  150.482344] [<ffff000008126174>] generic_handle_irq+0x34/0x50
>> >> >> [  150.482347] [<ffff000008126898>] __handle_domain_irq+0x68/0xc0
>> >> >> [  150.482351] [<ffff0000080817e4>] gic_handle_irq+0xc4/0x170
>> >> >> [  150.482356] Exception stack(0xffff801f901a7ab0 to 0xffff801f901a7be0)
>> >> >> [  150.482360] 7aa0:
>> >> >> 0000000000000003 ffff801f6d682578
>> >> >> [  150.482364] 7ac0: 0000000000000000 0000000000000020
>> >> >> 0000000000000020 ffffffff00000000
>> >> >> [  150.482367] 7ae0: 0000000000000000 0000000000000000
>> >> >> 0000000000000002 ffff801f901a7d30
>> >> >> [  150.482371] 7b00: 00000000000009f0 00000000366e2f04
>> >> >> 0000000000000018 0000000058dabe4c
>> >> >> [  150.482375] 7b20: 0036312196549b4a 00003d0d6dd44d08
>> >> >> ffff0000082e81d8 0000ffffb77c9028
>> >> >> [  150.482378] 7b40: 0000000000000a03 ffff000009138e38
>> >> >> ffff801f6d5cff08 ffff801f6d5cff00
>> >> >> [  150.482382] 7b60: ffff000008e11e00 0000000000000001
>> >> >> ffff0000081565d0 0000000000000000
>> >> >> [  150.482386] 7b80: 0000000000000080 ffff000009139434
>> >> >> ffff000009139000 ffff801f901a7be0
>> >> >> [  150.482390] 7ba0: ffff000008156e8c ffff801f901a7be0
>> >> >> ffff000008156ecc 0000000080000145
>> >> >> [  150.482394] 7bc0: ffff801f901a7be0 ffff000008156e68
>> >> >> ffffffffffffffff ffff000008156e8c
>> >> >> [  150.482397] [<ffff000008082ff4>] el1_irq+0xb4/0x140
>> >> >> [  150.482401] [<ffff000008156ecc>] smp_call_function_many+0x284/0x2e8
>> >> >> [  150.482405] [<ffff000008157020>] kick_all_cpus_sync+0x30/0x38
>> >> >> [  150.482409] [<ffff00000897c6cc>] aarch64_insn_patch_text+0xec/0xf8
>> >> >> [  150.482415] [<ffff000008095978>] arch_jump_label_transform+0x60/0x98
>> >> >> [  150.482420] [<ffff0000081f593c>] __jump_label_update+0x8c/0xa8
>> >> >> [  150.482423] [<ffff0000081f6088>] jump_label_update+0x58/0xe8
>> >> >> [  150.482429] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
>> >> >> [  150.482434] [<ffff0000081f6284>] static_key_enable+0x34/0x60
>> >> >> [  150.482438] [<ffff000008843268>] netstamp_clear+0x68/0x80
>> >> >> [  150.482441] [<ffff0000080e49a0>] process_one_work+0x158/0x478
>> >> >> [  150.482444] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
>> >> >> [  150.482448] [<ffff0000080ebd78>] kthread+0x108/0x138
>> >> >> [  150.482451] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
>> >> >>
>> >> >> ====================================
>> >> >>
>> >> >> I am observing that this usually happens when the guest tries to
>> >> >> bringup or use the default virtio-net interface.
>> >> >> And I am unable to reproduce this when directly booting the guest
>> >> >> kernel without UEFI BIOS.
>> >> >> So anyone observed similar issue ?
>> >> >>
>> >> >
>> >> > I haven't seen this myself.
>> >> >
>> >> > Have you tried with other versions of QEMU and the host kernel by any
>> >> > chance using the same setup?
>> >>
>> >> The earlier qemu version (atleast 2.5 which is default on ubuntu
>> >> 16.04) have different issue. The same thing happens when I use
>> >> vhost=on but without vhost and with virtio-net didn't see the issue.
>> >> The Armband folks have also reported the vhost issue here
>> >> https://bugs.launchpad.net/ubuntu/+source/edk2/+bug/1673564
>> >
>> > So I just ran your setup using the flash emulation on the ThunderX in
>> > packet.net and my host is happily continuing to run hackbench.
>>
>> One more thing I see is the ubuntu cloud images run a cloud-init
>> service during boot and seems to do lot of network init, the bug
>> appears when doing such things.
>>
>> >
>> >>
>> >> So I am beginning to think the issue could be related to the flash
>> >> emulation but wanted to check on ml if anyone else is seeing it. But
>> >> if it is flash emulation how can it hose the host like this ?
>> >
>> > Why do you think it's related to the flash emulation?
>>
>> Because without pflash method and directly launching kernel using
>> "-kernel" option of qemu works ok.
>
> With the same kernel binary which is otherwise used inside the cloud
> image when booting with QEMU?

In this case the guest kernel is 4.11, the one from cloud img is 4.10.

>
> Make sure you verify this conclusion by repeating tries tens of times at
> least, just to make sure we're not misdiagnosing based on chance.

Ok agreed. Initially was assuming something broke top of tree as I
moved from stock qemu 2.5 to master but turned out it isn't.
I will have to some combinations and get back to you guys.

>
>>
>>
>> >
>> > From what you're telling me it sounds much more like a vhost regression.
>> >
>> > Can you try with v4.10 of the kernel with vhost=on ?
>>
>> Ok will try.
>
> Cool.
>
>> I do think the issue could also be host kernel side and
>> something is triggering.
>> The newer qemu is having a default virtio-net interface. Does this use
>> vhost by default ?
>> And is there a way not to have a default virtio-net interface.
>
> I really don't know, I'm not an expert on QEMU or dealing with virtio
> vhost config options.
>
>>
>> I am not specifying any vhost in my command.
>> My observation with vhost=on was with qemu-2.5 and 4.4 kernel from Ubuntu 16.04.
>>
>> Another observation I have is, occasionally during the trials the
>> guest hangs at this point and host cpu goes into a softlockup after
>> sometime. I cannot kill the qemu either through qemu Ctrl-A+C+Q or
>> kill command from host.
>>
>> ------------- cut -----------------
>> EFI stub: Booting Linux Kernel...
>> EFI stub: EFI_RNG_PROTOCOL unavailable, no randomness supplied
>> EFI stub: Using DTB from configuration table
>> EFI stub: Exiting boot services and installing virtual address map...
>> [    0.000000] Booting Linux on physical CPU 0x0
>> ------------- end ------------------
>>
>>
>
> My best bet based on what you're telling me above is that it's related
> to memory accesses somehow; with vhost the host kernel will access guest
> memory and with pflash emulation I think we're mapping host regions
> read-only and executing code in the guest from those regions.  Perhaps
> we end up corrupting some memory somewhere.
>
> Is this a different hardware platform from the ThunderX in packet.net ?

The chip is same, just different in the onboard ethernet ports and PCIe slots.

>
> Thanks,
> -Christoffer

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: host stalls when qemu-system-aarch64 with kvm and pflash
@ 2017-03-29 21:36             ` Radha Mohan
  0 siblings, 0 replies; 28+ messages in thread
From: Radha Mohan @ 2017-03-29 21:36 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: qemu-devel, kvmarm

On Wed, Mar 29, 2017 at 2:06 PM, Christoffer Dall <cdall@linaro.org> wrote:
> On Wed, Mar 29, 2017 at 01:51:19PM -0700, Radha Mohan wrote:
>> On Wed, Mar 29, 2017 at 11:56 AM, Christoffer Dall <cdall@linaro.org> wrote:
>> > On Tue, Mar 28, 2017 at 01:24:15PM -0700, Radha Mohan wrote:
>> >> On Tue, Mar 28, 2017 at 1:16 PM, Christoffer Dall <cdall@linaro.org> wrote:
>> >> > Hi Radha,
>> >> >
>> >> > On Tue, Mar 28, 2017 at 12:58:24PM -0700, Radha Mohan wrote:
>> >> >> Hi,
>> >> >> I am seeing an issue with qemu-system-aarch64 when using pflash
>> >> >> (booting kernel via UEFI bios).
>> >> >>
>> >> >> Host kernel: 4.11.0-rc3-next-20170323
>> >> >> Qemu version: v2.9.0-rc1
>> >> >>
>> >> >> Command used:
>> >> >> ./aarch64-softmmu/qemu-system-aarch64 -cpu host -enable-kvm -M
>> >> >> virt,gic_version=3 -nographic -smp 1 -m 2048 -drive
>> >> >> if=none,id=hd0,file=/root/zesty-server-cloudimg-arm64.img,id=0 -device
>> >> >> virtio-blk-device,drive=hd0 -pflash /root/flash0.img -pflash
>> >> >> /root/flash1.img
>> >> >>
>> >> >>
>> >> >> As soon as the guest kernel boots the host starts to stall and prints
>> >> >> the below messages. And the system never recovers. I can neither
>> >> >> poweroff the guest nor the host. So I have resort to external power
>> >> >> reset of the host.
>> >> >>
>> >> >> ==================
>> >> >> [  116.199077] NMI watchdog: BUG: soft lockup - CPU#25 stuck for 23s!
>> >> >> [kworker/25:1:454]
>> >> >> [  116.206901] Modules linked in: binfmt_misc nls_iso8859_1 aes_ce_blk
>> >> >> shpchp crypto_simd gpio_keys cryptd aes_ce_cipher ghash_ce sha2_ce
>> >> >> sha1_ce uio_pdrv_genirq uio autofs4 btrfs raid10 rai
>> >> >> d456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
>> >> >> raid6_pq libcrc32c raid1 raid0 multipath linear ast i2c_algo_bit ttm
>> >> >> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_s
>> >> >> ys_fops drm nicvf ahci nicpf libahci thunder_bgx thunder_xcv
>> >> >> mdio_thunder mdio_cavium
>> >> >>
>> >> >> [  116.206995] CPU: 25 PID: 454 Comm: kworker/25:1 Not tainted
>> >> >> 4.11.0-rc3-next-20170323 #1
>> >> >> [  116.206997] Hardware name: www.cavium.com crb-1s/crb-1s, BIOS 0.3 Feb 23 2017
>> >> >> [  116.207010] Workqueue: events netstamp_clear
>> >> >> [  116.207015] task: ffff801f906b5400 task.stack: ffff801f901a4000
>> >> >> [  116.207020] PC is at smp_call_function_many+0x284/0x2e8
>> >> >> [  116.207023] LR is at smp_call_function_many+0x244/0x2e8
>> >> >> [  116.207026] pc : [<ffff000008156ecc>] lr : [<ffff000008156e8c>]
>> >> >> pstate: 80000145
>> >> >> [  116.207028] sp : ffff801f901a7be0
>> >> >> [  116.207030] x29: ffff801f901a7be0 x28: ffff000009139000
>> >> >> [  116.207036] x27: ffff000009139434 x26: 0000000000000080
>> >> >> [  116.207041] x25: 0000000000000000 x24: ffff0000081565d0
>> >> >> [  116.207047] x23: 0000000000000001 x22: ffff000008e11e00
>> >> >> [  116.207052] x21: ffff801f6d5cff00 x20: ffff801f6d5cff08
>> >> >> [  116.207057] x19: ffff000009138e38 x18: 0000000000000a03
>> >> >> [  116.207063] x17: 0000ffffb77c9028 x16: ffff0000082e81d8
>> >> >> [  116.207068] x15: 00003d0d6dd44d08 x14: 0036312196549b4a
>> >> >> [  116.207073] x13: 0000000058dabe4c x12: 0000000000000018
>> >> >> [  116.207079] x11: 00000000366e2f04 x10: 00000000000009f0
>> >> >> [  116.207084] x9 : ffff801f901a7d30 x8 : 0000000000000002
>> >> >> [  116.207089] x7 : 0000000000000000 x6 : 0000000000000000
>> >> >> [  116.207095] x5 : ffffffff00000000 x4 : 0000000000000020
>> >> >> [  116.207100] x3 : 0000000000000020 x2 : 0000000000000000
>> >> >> [  116.207105] x1 : ffff801f6d682578 x0 : 0000000000000003
>> >> >>
>> >> >> [  150.443116] INFO: rcu_sched self-detected stall on CPU
>> >> >> [  150.448261]  25-...: (14997 ticks this GP)
>> >> >> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
>> >> >> [  150.451115] INFO: rcu_sched detected stalls on CPUs/tasks:
>> >> >> [  150.451123]  25-...: (14997 ticks this GP)
>> >> >> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
>> >> >> [  150.451124]  (detected by 13, t=15002 jiffies, g=805, c=804, q=8384)
>> >> >> [  150.451136] Task dump for CPU 25:
>> >> >> [  150.451138] kworker/25:1    R  running task        0   454      2 0x00000002
>> >> >> [  150.451155] Workqueue: events netstamp_clear
>> >> >> [  150.451158] Call trace:
>> >> >> [  150.451164] [<ffff000008086188>] __switch_to+0x90/0xa8
>> >> >> [  150.451172] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
>> >> >> [  150.451175] [<ffff0000081f6284>] static_key_enable+0x34/0x60
>> >> >> [  150.451178] [<ffff000008843268>] netstamp_clear+0x68/0x80
>> >> >> [  150.451181] [<ffff0000080e49a0>] process_one_work+0x158/0x478
>> >> >> [  150.451183] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
>> >> >> [  150.451187] [<ffff0000080ebd78>] kthread+0x108/0x138
>> >> >> [  150.451190] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
>> >> >> [  150.477451]   (t=15008 jiffies g=805 c=804 q=8384)
>> >> >> [  150.482242] Task dump for CPU 25:
>> >> >> [  150.482245] kworker/25:1    R  running task        0   454      2 0x00000002
>> >> >> [  150.482259] Workqueue: events netstamp_clear
>> >> >> [  150.482264] Call trace:
>> >> >> [  150.482271] [<ffff00000808a530>] dump_backtrace+0x0/0x2b0
>> >> >> [  150.482277] [<ffff00000808a804>] show_stack+0x24/0x30
>> >> >> [  150.482281] [<ffff0000080fb750>] sched_show_task+0x128/0x178
>> >> >> [  150.482285] [<ffff0000080fd298>] dump_cpu_task+0x48/0x58
>> >> >> [  150.482288] [<ffff0000081f81e4>] rcu_dump_cpu_stacks+0xa0/0xe8
>> >> >> [  150.482297] [<ffff00000813983c>] rcu_check_callbacks+0x774/0x938
>> >> >> [  150.482305] [<ffff00000813fcb4>] update_process_times+0x34/0x60
>> >> >> [  150.482314] [<ffff000008151b80>] tick_sched_handle.isra.7+0x38/0x70
>> >> >> [  150.482319] [<ffff000008151c04>] tick_sched_timer+0x4c/0x98
>> >> >> [  150.482324] [<ffff000008140510>] __hrtimer_run_queues+0xd8/0x2b8
>> >> >> [  150.482328] [<ffff000008141180>] hrtimer_interrupt+0xa8/0x228
>> >> >> [  150.482334] [<ffff0000087f2a2c>] arch_timer_handler_phys+0x3c/0x50
>> >> >> [  150.482341] [<ffff00000812c194>] handle_percpu_devid_irq+0x8c/0x230
>> >> >> [  150.482344] [<ffff000008126174>] generic_handle_irq+0x34/0x50
>> >> >> [  150.482347] [<ffff000008126898>] __handle_domain_irq+0x68/0xc0
>> >> >> [  150.482351] [<ffff0000080817e4>] gic_handle_irq+0xc4/0x170
>> >> >> [  150.482356] Exception stack(0xffff801f901a7ab0 to 0xffff801f901a7be0)
>> >> >> [  150.482360] 7aa0:
>> >> >> 0000000000000003 ffff801f6d682578
>> >> >> [  150.482364] 7ac0: 0000000000000000 0000000000000020
>> >> >> 0000000000000020 ffffffff00000000
>> >> >> [  150.482367] 7ae0: 0000000000000000 0000000000000000
>> >> >> 0000000000000002 ffff801f901a7d30
>> >> >> [  150.482371] 7b00: 00000000000009f0 00000000366e2f04
>> >> >> 0000000000000018 0000000058dabe4c
>> >> >> [  150.482375] 7b20: 0036312196549b4a 00003d0d6dd44d08
>> >> >> ffff0000082e81d8 0000ffffb77c9028
>> >> >> [  150.482378] 7b40: 0000000000000a03 ffff000009138e38
>> >> >> ffff801f6d5cff08 ffff801f6d5cff00
>> >> >> [  150.482382] 7b60: ffff000008e11e00 0000000000000001
>> >> >> ffff0000081565d0 0000000000000000
>> >> >> [  150.482386] 7b80: 0000000000000080 ffff000009139434
>> >> >> ffff000009139000 ffff801f901a7be0
>> >> >> [  150.482390] 7ba0: ffff000008156e8c ffff801f901a7be0
>> >> >> ffff000008156ecc 0000000080000145
>> >> >> [  150.482394] 7bc0: ffff801f901a7be0 ffff000008156e68
>> >> >> ffffffffffffffff ffff000008156e8c
>> >> >> [  150.482397] [<ffff000008082ff4>] el1_irq+0xb4/0x140
>> >> >> [  150.482401] [<ffff000008156ecc>] smp_call_function_many+0x284/0x2e8
>> >> >> [  150.482405] [<ffff000008157020>] kick_all_cpus_sync+0x30/0x38
>> >> >> [  150.482409] [<ffff00000897c6cc>] aarch64_insn_patch_text+0xec/0xf8
>> >> >> [  150.482415] [<ffff000008095978>] arch_jump_label_transform+0x60/0x98
>> >> >> [  150.482420] [<ffff0000081f593c>] __jump_label_update+0x8c/0xa8
>> >> >> [  150.482423] [<ffff0000081f6088>] jump_label_update+0x58/0xe8
>> >> >> [  150.482429] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
>> >> >> [  150.482434] [<ffff0000081f6284>] static_key_enable+0x34/0x60
>> >> >> [  150.482438] [<ffff000008843268>] netstamp_clear+0x68/0x80
>> >> >> [  150.482441] [<ffff0000080e49a0>] process_one_work+0x158/0x478
>> >> >> [  150.482444] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
>> >> >> [  150.482448] [<ffff0000080ebd78>] kthread+0x108/0x138
>> >> >> [  150.482451] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
>> >> >>
>> >> >> ====================================
>> >> >>
>> >> >> I am observing that this usually happens when the guest tries to
>> >> >> bringup or use the default virtio-net interface.
>> >> >> And I am unable to reproduce this when directly booting the guest
>> >> >> kernel without UEFI BIOS.
>> >> >> So anyone observed similar issue ?
>> >> >>
>> >> >
>> >> > I haven't seen this myself.
>> >> >
>> >> > Have you tried with other versions of QEMU and the host kernel by any
>> >> > chance using the same setup?
>> >>
>> >> The earlier qemu version (atleast 2.5 which is default on ubuntu
>> >> 16.04) have different issue. The same thing happens when I use
>> >> vhost=on but without vhost and with virtio-net didn't see the issue.
>> >> The Armband folks have also reported the vhost issue here
>> >> https://bugs.launchpad.net/ubuntu/+source/edk2/+bug/1673564
>> >
>> > So I just ran your setup using the flash emulation on the ThunderX in
>> > packet.net and my host is happily continuing to run hackbench.
>>
>> One more thing I see is the ubuntu cloud images run a cloud-init
>> service during boot and seems to do lot of network init, the bug
>> appears when doing such things.
>>
>> >
>> >>
>> >> So I am beginning to think the issue could be related to the flash
>> >> emulation but wanted to check on ml if anyone else is seeing it. But
>> >> if it is flash emulation how can it hose the host like this ?
>> >
>> > Why do you think it's related to the flash emulation?
>>
>> Because without pflash method and directly launching kernel using
>> "-kernel" option of qemu works ok.
>
> With the same kernel binary which is otherwise used inside the cloud
> image when booting with QEMU?

In this case the guest kernel is 4.11, the one from cloud img is 4.10.

>
> Make sure you verify this conclusion by repeating tries tens of times at
> least, just to make sure we're not misdiagnosing based on chance.

Ok agreed. Initially was assuming something broke top of tree as I
moved from stock qemu 2.5 to master but turned out it isn't.
I will have to some combinations and get back to you guys.

>
>>
>>
>> >
>> > From what you're telling me it sounds much more like a vhost regression.
>> >
>> > Can you try with v4.10 of the kernel with vhost=on ?
>>
>> Ok will try.
>
> Cool.
>
>> I do think the issue could also be host kernel side and
>> something is triggering.
>> The newer qemu is having a default virtio-net interface. Does this use
>> vhost by default ?
>> And is there a way not to have a default virtio-net interface.
>
> I really don't know, I'm not an expert on QEMU or dealing with virtio
> vhost config options.
>
>>
>> I am not specifying any vhost in my command.
>> My observation with vhost=on was with qemu-2.5 and 4.4 kernel from Ubuntu 16.04.
>>
>> Another observation I have is, occasionally during the trials the
>> guest hangs at this point and host cpu goes into a softlockup after
>> sometime. I cannot kill the qemu either through qemu Ctrl-A+C+Q or
>> kill command from host.
>>
>> ------------- cut -----------------
>> EFI stub: Booting Linux Kernel...
>> EFI stub: EFI_RNG_PROTOCOL unavailable, no randomness supplied
>> EFI stub: Using DTB from configuration table
>> EFI stub: Exiting boot services and installing virtual address map...
>> [    0.000000] Booting Linux on physical CPU 0x0
>> ------------- end ------------------
>>
>>
>
> My best bet based on what you're telling me above is that it's related
> to memory accesses somehow; with vhost the host kernel will access guest
> memory and with pflash emulation I think we're mapping host regions
> read-only and executing code in the guest from those regions.  Perhaps
> we end up corrupting some memory somewhere.
>
> Is this a different hardware platform from the ThunderX in packet.net ?

The chip is same, just different in the onboard ethernet ports and PCIe slots.

>
> Thanks,
> -Christoffer

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] host stalls when qemu-system-aarch64 with kvm and pflash
  2017-03-29 18:56       ` Christoffer Dall
@ 2017-03-30 10:51         ` Marc Zyngier
  -1 siblings, 0 replies; 28+ messages in thread
From: Marc Zyngier @ 2017-03-30 10:51 UTC (permalink / raw)
  To: Christoffer Dall, Radha Mohan; +Cc: qemu-devel, kvmarm

On 29/03/17 19:56, Christoffer Dall wrote:
> On Tue, Mar 28, 2017 at 01:24:15PM -0700, Radha Mohan wrote:
>> On Tue, Mar 28, 2017 at 1:16 PM, Christoffer Dall <cdall@linaro.org> wrote:
>>> Hi Radha,
>>>
>>> On Tue, Mar 28, 2017 at 12:58:24PM -0700, Radha Mohan wrote:
>>>> Hi,
>>>> I am seeing an issue with qemu-system-aarch64 when using pflash
>>>> (booting kernel via UEFI bios).
>>>>
>>>> Host kernel: 4.11.0-rc3-next-20170323
>>>> Qemu version: v2.9.0-rc1
>>>>
>>>> Command used:
>>>> ./aarch64-softmmu/qemu-system-aarch64 -cpu host -enable-kvm -M
>>>> virt,gic_version=3 -nographic -smp 1 -m 2048 -drive
>>>> if=none,id=hd0,file=/root/zesty-server-cloudimg-arm64.img,id=0 -device
>>>> virtio-blk-device,drive=hd0 -pflash /root/flash0.img -pflash
>>>> /root/flash1.img
>>>>
>>>>
>>>> As soon as the guest kernel boots the host starts to stall and prints
>>>> the below messages. And the system never recovers. I can neither
>>>> poweroff the guest nor the host. So I have resort to external power
>>>> reset of the host.
>>>>
>>>> ==================
>>>> [  116.199077] NMI watchdog: BUG: soft lockup - CPU#25 stuck for 23s!
>>>> [kworker/25:1:454]
>>>> [  116.206901] Modules linked in: binfmt_misc nls_iso8859_1 aes_ce_blk
>>>> shpchp crypto_simd gpio_keys cryptd aes_ce_cipher ghash_ce sha2_ce
>>>> sha1_ce uio_pdrv_genirq uio autofs4 btrfs raid10 rai
>>>> d456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
>>>> raid6_pq libcrc32c raid1 raid0 multipath linear ast i2c_algo_bit ttm
>>>> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_s
>>>> ys_fops drm nicvf ahci nicpf libahci thunder_bgx thunder_xcv
>>>> mdio_thunder mdio_cavium
>>>>
>>>> [  116.206995] CPU: 25 PID: 454 Comm: kworker/25:1 Not tainted
>>>> 4.11.0-rc3-next-20170323 #1
>>>> [  116.206997] Hardware name: www.cavium.com crb-1s/crb-1s, BIOS 0.3 Feb 23 2017
>>>> [  116.207010] Workqueue: events netstamp_clear
>>>> [  116.207015] task: ffff801f906b5400 task.stack: ffff801f901a4000
>>>> [  116.207020] PC is at smp_call_function_many+0x284/0x2e8
>>>> [  116.207023] LR is at smp_call_function_many+0x244/0x2e8
>>>> [  116.207026] pc : [<ffff000008156ecc>] lr : [<ffff000008156e8c>]
>>>> pstate: 80000145
>>>> [  116.207028] sp : ffff801f901a7be0
>>>> [  116.207030] x29: ffff801f901a7be0 x28: ffff000009139000
>>>> [  116.207036] x27: ffff000009139434 x26: 0000000000000080
>>>> [  116.207041] x25: 0000000000000000 x24: ffff0000081565d0
>>>> [  116.207047] x23: 0000000000000001 x22: ffff000008e11e00
>>>> [  116.207052] x21: ffff801f6d5cff00 x20: ffff801f6d5cff08
>>>> [  116.207057] x19: ffff000009138e38 x18: 0000000000000a03
>>>> [  116.207063] x17: 0000ffffb77c9028 x16: ffff0000082e81d8
>>>> [  116.207068] x15: 00003d0d6dd44d08 x14: 0036312196549b4a
>>>> [  116.207073] x13: 0000000058dabe4c x12: 0000000000000018
>>>> [  116.207079] x11: 00000000366e2f04 x10: 00000000000009f0
>>>> [  116.207084] x9 : ffff801f901a7d30 x8 : 0000000000000002
>>>> [  116.207089] x7 : 0000000000000000 x6 : 0000000000000000
>>>> [  116.207095] x5 : ffffffff00000000 x4 : 0000000000000020
>>>> [  116.207100] x3 : 0000000000000020 x2 : 0000000000000000
>>>> [  116.207105] x1 : ffff801f6d682578 x0 : 0000000000000003
>>>>
>>>> [  150.443116] INFO: rcu_sched self-detected stall on CPU
>>>> [  150.448261]  25-...: (14997 ticks this GP)
>>>> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
>>>> [  150.451115] INFO: rcu_sched detected stalls on CPUs/tasks:
>>>> [  150.451123]  25-...: (14997 ticks this GP)
>>>> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
>>>> [  150.451124]  (detected by 13, t=15002 jiffies, g=805, c=804, q=8384)
>>>> [  150.451136] Task dump for CPU 25:
>>>> [  150.451138] kworker/25:1    R  running task        0   454      2 0x00000002
>>>> [  150.451155] Workqueue: events netstamp_clear
>>>> [  150.451158] Call trace:
>>>> [  150.451164] [<ffff000008086188>] __switch_to+0x90/0xa8
>>>> [  150.451172] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
>>>> [  150.451175] [<ffff0000081f6284>] static_key_enable+0x34/0x60
>>>> [  150.451178] [<ffff000008843268>] netstamp_clear+0x68/0x80
>>>> [  150.451181] [<ffff0000080e49a0>] process_one_work+0x158/0x478
>>>> [  150.451183] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
>>>> [  150.451187] [<ffff0000080ebd78>] kthread+0x108/0x138
>>>> [  150.451190] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
>>>> [  150.477451]   (t=15008 jiffies g=805 c=804 q=8384)
>>>> [  150.482242] Task dump for CPU 25:
>>>> [  150.482245] kworker/25:1    R  running task        0   454      2 0x00000002
>>>> [  150.482259] Workqueue: events netstamp_clear
>>>> [  150.482264] Call trace:
>>>> [  150.482271] [<ffff00000808a530>] dump_backtrace+0x0/0x2b0
>>>> [  150.482277] [<ffff00000808a804>] show_stack+0x24/0x30
>>>> [  150.482281] [<ffff0000080fb750>] sched_show_task+0x128/0x178
>>>> [  150.482285] [<ffff0000080fd298>] dump_cpu_task+0x48/0x58
>>>> [  150.482288] [<ffff0000081f81e4>] rcu_dump_cpu_stacks+0xa0/0xe8
>>>> [  150.482297] [<ffff00000813983c>] rcu_check_callbacks+0x774/0x938
>>>> [  150.482305] [<ffff00000813fcb4>] update_process_times+0x34/0x60
>>>> [  150.482314] [<ffff000008151b80>] tick_sched_handle.isra.7+0x38/0x70
>>>> [  150.482319] [<ffff000008151c04>] tick_sched_timer+0x4c/0x98
>>>> [  150.482324] [<ffff000008140510>] __hrtimer_run_queues+0xd8/0x2b8
>>>> [  150.482328] [<ffff000008141180>] hrtimer_interrupt+0xa8/0x228
>>>> [  150.482334] [<ffff0000087f2a2c>] arch_timer_handler_phys+0x3c/0x50
>>>> [  150.482341] [<ffff00000812c194>] handle_percpu_devid_irq+0x8c/0x230
>>>> [  150.482344] [<ffff000008126174>] generic_handle_irq+0x34/0x50
>>>> [  150.482347] [<ffff000008126898>] __handle_domain_irq+0x68/0xc0
>>>> [  150.482351] [<ffff0000080817e4>] gic_handle_irq+0xc4/0x170
>>>> [  150.482356] Exception stack(0xffff801f901a7ab0 to 0xffff801f901a7be0)
>>>> [  150.482360] 7aa0:
>>>> 0000000000000003 ffff801f6d682578
>>>> [  150.482364] 7ac0: 0000000000000000 0000000000000020
>>>> 0000000000000020 ffffffff00000000
>>>> [  150.482367] 7ae0: 0000000000000000 0000000000000000
>>>> 0000000000000002 ffff801f901a7d30
>>>> [  150.482371] 7b00: 00000000000009f0 00000000366e2f04
>>>> 0000000000000018 0000000058dabe4c
>>>> [  150.482375] 7b20: 0036312196549b4a 00003d0d6dd44d08
>>>> ffff0000082e81d8 0000ffffb77c9028
>>>> [  150.482378] 7b40: 0000000000000a03 ffff000009138e38
>>>> ffff801f6d5cff08 ffff801f6d5cff00
>>>> [  150.482382] 7b60: ffff000008e11e00 0000000000000001
>>>> ffff0000081565d0 0000000000000000
>>>> [  150.482386] 7b80: 0000000000000080 ffff000009139434
>>>> ffff000009139000 ffff801f901a7be0
>>>> [  150.482390] 7ba0: ffff000008156e8c ffff801f901a7be0
>>>> ffff000008156ecc 0000000080000145
>>>> [  150.482394] 7bc0: ffff801f901a7be0 ffff000008156e68
>>>> ffffffffffffffff ffff000008156e8c
>>>> [  150.482397] [<ffff000008082ff4>] el1_irq+0xb4/0x140
>>>> [  150.482401] [<ffff000008156ecc>] smp_call_function_many+0x284/0x2e8
>>>> [  150.482405] [<ffff000008157020>] kick_all_cpus_sync+0x30/0x38
>>>> [  150.482409] [<ffff00000897c6cc>] aarch64_insn_patch_text+0xec/0xf8
>>>> [  150.482415] [<ffff000008095978>] arch_jump_label_transform+0x60/0x98
>>>> [  150.482420] [<ffff0000081f593c>] __jump_label_update+0x8c/0xa8
>>>> [  150.482423] [<ffff0000081f6088>] jump_label_update+0x58/0xe8
>>>> [  150.482429] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
>>>> [  150.482434] [<ffff0000081f6284>] static_key_enable+0x34/0x60
>>>> [  150.482438] [<ffff000008843268>] netstamp_clear+0x68/0x80
>>>> [  150.482441] [<ffff0000080e49a0>] process_one_work+0x158/0x478
>>>> [  150.482444] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
>>>> [  150.482448] [<ffff0000080ebd78>] kthread+0x108/0x138
>>>> [  150.482451] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
>>>>
>>>> ====================================
>>>>
>>>> I am observing that this usually happens when the guest tries to
>>>> bringup or use the default virtio-net interface.
>>>> And I am unable to reproduce this when directly booting the guest
>>>> kernel without UEFI BIOS.
>>>> So anyone observed similar issue ?
>>>>
>>>
>>> I haven't seen this myself.
>>>
>>> Have you tried with other versions of QEMU and the host kernel by any
>>> chance using the same setup?
>>
>> The earlier qemu version (atleast 2.5 which is default on ubuntu
>> 16.04) have different issue. The same thing happens when I use
>> vhost=on but without vhost and with virtio-net didn't see the issue.
>> The Armband folks have also reported the vhost issue here
>> https://bugs.launchpad.net/ubuntu/+source/edk2/+bug/1673564
> 
> So I just ran your setup using the flash emulation on the ThunderX in
> packet.net and my host is happily continuing to run hackbench.
> 
>>
>> So I am beginning to think the issue could be related to the flash
>> emulation but wanted to check on ml if anyone else is seeing it. But
>> if it is flash emulation how can it hose the host like this ?
> 
> Why do you think it's related to the flash emulation?
> 
> From what you're telling me it sounds much more like a vhost regression.
> 
> Can you try with v4.10 of the kernel with vhost=on ?

I'm unsure about this being directly related to vhost. vhost seems to 
be the trigger, but the issue seems that after a smp_call_function_many()
call, some of the target CPUs have either stopped responding, or missed
the IPI. The IPI initiator keeps on waiting forever.

I'm not seeing any of this on the HW I have access to, but this could 
be symptomatic of a SGI delivery problem.

Any chance you could try the following patchlet on the host kernel?

diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
index c132f29322cc..727c7dc7592a 100644
--- a/drivers/irqchip/irq-gic-v3.c
+++ b/drivers/irqchip/irq-gic-v3.c
@@ -617,6 +617,15 @@ static void gic_raise_softirq(const struct cpumask *mask, unsigned int irq)
 	 */
 	smp_wmb();
 
+	/* Is it a broadcast? */
+	if (cpumask_weight(mask) == (num_online_cpus() - 1)) {
+		gic_write_sgi1r(1UL << ICC_SGI1R_IRQ_ROUTING_MODE_BIT |
+				irq << ICC_SGI1R_SGI_ID_SHIFT);
+		isb();
+		return;
+	}
+
+
 	for_each_cpu(cpu, mask) {
 		unsigned long cluster_id = cpu_logical_map(cpu) & ~0xffUL;
 		u16 tlist;

This shouldn't make any difference (other than being slightly faster in the
broadcast case, and slower in every other), but I'm prepared to try anything...

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: host stalls when qemu-system-aarch64 with kvm and pflash
@ 2017-03-30 10:51         ` Marc Zyngier
  0 siblings, 0 replies; 28+ messages in thread
From: Marc Zyngier @ 2017-03-30 10:51 UTC (permalink / raw)
  To: Christoffer Dall, Radha Mohan; +Cc: qemu-devel, kvmarm

On 29/03/17 19:56, Christoffer Dall wrote:
> On Tue, Mar 28, 2017 at 01:24:15PM -0700, Radha Mohan wrote:
>> On Tue, Mar 28, 2017 at 1:16 PM, Christoffer Dall <cdall@linaro.org> wrote:
>>> Hi Radha,
>>>
>>> On Tue, Mar 28, 2017 at 12:58:24PM -0700, Radha Mohan wrote:
>>>> Hi,
>>>> I am seeing an issue with qemu-system-aarch64 when using pflash
>>>> (booting kernel via UEFI bios).
>>>>
>>>> Host kernel: 4.11.0-rc3-next-20170323
>>>> Qemu version: v2.9.0-rc1
>>>>
>>>> Command used:
>>>> ./aarch64-softmmu/qemu-system-aarch64 -cpu host -enable-kvm -M
>>>> virt,gic_version=3 -nographic -smp 1 -m 2048 -drive
>>>> if=none,id=hd0,file=/root/zesty-server-cloudimg-arm64.img,id=0 -device
>>>> virtio-blk-device,drive=hd0 -pflash /root/flash0.img -pflash
>>>> /root/flash1.img
>>>>
>>>>
>>>> As soon as the guest kernel boots the host starts to stall and prints
>>>> the below messages. And the system never recovers. I can neither
>>>> poweroff the guest nor the host. So I have resort to external power
>>>> reset of the host.
>>>>
>>>> ==================
>>>> [  116.199077] NMI watchdog: BUG: soft lockup - CPU#25 stuck for 23s!
>>>> [kworker/25:1:454]
>>>> [  116.206901] Modules linked in: binfmt_misc nls_iso8859_1 aes_ce_blk
>>>> shpchp crypto_simd gpio_keys cryptd aes_ce_cipher ghash_ce sha2_ce
>>>> sha1_ce uio_pdrv_genirq uio autofs4 btrfs raid10 rai
>>>> d456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
>>>> raid6_pq libcrc32c raid1 raid0 multipath linear ast i2c_algo_bit ttm
>>>> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_s
>>>> ys_fops drm nicvf ahci nicpf libahci thunder_bgx thunder_xcv
>>>> mdio_thunder mdio_cavium
>>>>
>>>> [  116.206995] CPU: 25 PID: 454 Comm: kworker/25:1 Not tainted
>>>> 4.11.0-rc3-next-20170323 #1
>>>> [  116.206997] Hardware name: www.cavium.com crb-1s/crb-1s, BIOS 0.3 Feb 23 2017
>>>> [  116.207010] Workqueue: events netstamp_clear
>>>> [  116.207015] task: ffff801f906b5400 task.stack: ffff801f901a4000
>>>> [  116.207020] PC is at smp_call_function_many+0x284/0x2e8
>>>> [  116.207023] LR is at smp_call_function_many+0x244/0x2e8
>>>> [  116.207026] pc : [<ffff000008156ecc>] lr : [<ffff000008156e8c>]
>>>> pstate: 80000145
>>>> [  116.207028] sp : ffff801f901a7be0
>>>> [  116.207030] x29: ffff801f901a7be0 x28: ffff000009139000
>>>> [  116.207036] x27: ffff000009139434 x26: 0000000000000080
>>>> [  116.207041] x25: 0000000000000000 x24: ffff0000081565d0
>>>> [  116.207047] x23: 0000000000000001 x22: ffff000008e11e00
>>>> [  116.207052] x21: ffff801f6d5cff00 x20: ffff801f6d5cff08
>>>> [  116.207057] x19: ffff000009138e38 x18: 0000000000000a03
>>>> [  116.207063] x17: 0000ffffb77c9028 x16: ffff0000082e81d8
>>>> [  116.207068] x15: 00003d0d6dd44d08 x14: 0036312196549b4a
>>>> [  116.207073] x13: 0000000058dabe4c x12: 0000000000000018
>>>> [  116.207079] x11: 00000000366e2f04 x10: 00000000000009f0
>>>> [  116.207084] x9 : ffff801f901a7d30 x8 : 0000000000000002
>>>> [  116.207089] x7 : 0000000000000000 x6 : 0000000000000000
>>>> [  116.207095] x5 : ffffffff00000000 x4 : 0000000000000020
>>>> [  116.207100] x3 : 0000000000000020 x2 : 0000000000000000
>>>> [  116.207105] x1 : ffff801f6d682578 x0 : 0000000000000003
>>>>
>>>> [  150.443116] INFO: rcu_sched self-detected stall on CPU
>>>> [  150.448261]  25-...: (14997 ticks this GP)
>>>> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
>>>> [  150.451115] INFO: rcu_sched detected stalls on CPUs/tasks:
>>>> [  150.451123]  25-...: (14997 ticks this GP)
>>>> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
>>>> [  150.451124]  (detected by 13, t=15002 jiffies, g=805, c=804, q=8384)
>>>> [  150.451136] Task dump for CPU 25:
>>>> [  150.451138] kworker/25:1    R  running task        0   454      2 0x00000002
>>>> [  150.451155] Workqueue: events netstamp_clear
>>>> [  150.451158] Call trace:
>>>> [  150.451164] [<ffff000008086188>] __switch_to+0x90/0xa8
>>>> [  150.451172] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
>>>> [  150.451175] [<ffff0000081f6284>] static_key_enable+0x34/0x60
>>>> [  150.451178] [<ffff000008843268>] netstamp_clear+0x68/0x80
>>>> [  150.451181] [<ffff0000080e49a0>] process_one_work+0x158/0x478
>>>> [  150.451183] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
>>>> [  150.451187] [<ffff0000080ebd78>] kthread+0x108/0x138
>>>> [  150.451190] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
>>>> [  150.477451]   (t=15008 jiffies g=805 c=804 q=8384)
>>>> [  150.482242] Task dump for CPU 25:
>>>> [  150.482245] kworker/25:1    R  running task        0   454      2 0x00000002
>>>> [  150.482259] Workqueue: events netstamp_clear
>>>> [  150.482264] Call trace:
>>>> [  150.482271] [<ffff00000808a530>] dump_backtrace+0x0/0x2b0
>>>> [  150.482277] [<ffff00000808a804>] show_stack+0x24/0x30
>>>> [  150.482281] [<ffff0000080fb750>] sched_show_task+0x128/0x178
>>>> [  150.482285] [<ffff0000080fd298>] dump_cpu_task+0x48/0x58
>>>> [  150.482288] [<ffff0000081f81e4>] rcu_dump_cpu_stacks+0xa0/0xe8
>>>> [  150.482297] [<ffff00000813983c>] rcu_check_callbacks+0x774/0x938
>>>> [  150.482305] [<ffff00000813fcb4>] update_process_times+0x34/0x60
>>>> [  150.482314] [<ffff000008151b80>] tick_sched_handle.isra.7+0x38/0x70
>>>> [  150.482319] [<ffff000008151c04>] tick_sched_timer+0x4c/0x98
>>>> [  150.482324] [<ffff000008140510>] __hrtimer_run_queues+0xd8/0x2b8
>>>> [  150.482328] [<ffff000008141180>] hrtimer_interrupt+0xa8/0x228
>>>> [  150.482334] [<ffff0000087f2a2c>] arch_timer_handler_phys+0x3c/0x50
>>>> [  150.482341] [<ffff00000812c194>] handle_percpu_devid_irq+0x8c/0x230
>>>> [  150.482344] [<ffff000008126174>] generic_handle_irq+0x34/0x50
>>>> [  150.482347] [<ffff000008126898>] __handle_domain_irq+0x68/0xc0
>>>> [  150.482351] [<ffff0000080817e4>] gic_handle_irq+0xc4/0x170
>>>> [  150.482356] Exception stack(0xffff801f901a7ab0 to 0xffff801f901a7be0)
>>>> [  150.482360] 7aa0:
>>>> 0000000000000003 ffff801f6d682578
>>>> [  150.482364] 7ac0: 0000000000000000 0000000000000020
>>>> 0000000000000020 ffffffff00000000
>>>> [  150.482367] 7ae0: 0000000000000000 0000000000000000
>>>> 0000000000000002 ffff801f901a7d30
>>>> [  150.482371] 7b00: 00000000000009f0 00000000366e2f04
>>>> 0000000000000018 0000000058dabe4c
>>>> [  150.482375] 7b20: 0036312196549b4a 00003d0d6dd44d08
>>>> ffff0000082e81d8 0000ffffb77c9028
>>>> [  150.482378] 7b40: 0000000000000a03 ffff000009138e38
>>>> ffff801f6d5cff08 ffff801f6d5cff00
>>>> [  150.482382] 7b60: ffff000008e11e00 0000000000000001
>>>> ffff0000081565d0 0000000000000000
>>>> [  150.482386] 7b80: 0000000000000080 ffff000009139434
>>>> ffff000009139000 ffff801f901a7be0
>>>> [  150.482390] 7ba0: ffff000008156e8c ffff801f901a7be0
>>>> ffff000008156ecc 0000000080000145
>>>> [  150.482394] 7bc0: ffff801f901a7be0 ffff000008156e68
>>>> ffffffffffffffff ffff000008156e8c
>>>> [  150.482397] [<ffff000008082ff4>] el1_irq+0xb4/0x140
>>>> [  150.482401] [<ffff000008156ecc>] smp_call_function_many+0x284/0x2e8
>>>> [  150.482405] [<ffff000008157020>] kick_all_cpus_sync+0x30/0x38
>>>> [  150.482409] [<ffff00000897c6cc>] aarch64_insn_patch_text+0xec/0xf8
>>>> [  150.482415] [<ffff000008095978>] arch_jump_label_transform+0x60/0x98
>>>> [  150.482420] [<ffff0000081f593c>] __jump_label_update+0x8c/0xa8
>>>> [  150.482423] [<ffff0000081f6088>] jump_label_update+0x58/0xe8
>>>> [  150.482429] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
>>>> [  150.482434] [<ffff0000081f6284>] static_key_enable+0x34/0x60
>>>> [  150.482438] [<ffff000008843268>] netstamp_clear+0x68/0x80
>>>> [  150.482441] [<ffff0000080e49a0>] process_one_work+0x158/0x478
>>>> [  150.482444] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
>>>> [  150.482448] [<ffff0000080ebd78>] kthread+0x108/0x138
>>>> [  150.482451] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
>>>>
>>>> ====================================
>>>>
>>>> I am observing that this usually happens when the guest tries to
>>>> bringup or use the default virtio-net interface.
>>>> And I am unable to reproduce this when directly booting the guest
>>>> kernel without UEFI BIOS.
>>>> So anyone observed similar issue ?
>>>>
>>>
>>> I haven't seen this myself.
>>>
>>> Have you tried with other versions of QEMU and the host kernel by any
>>> chance using the same setup?
>>
>> The earlier qemu version (atleast 2.5 which is default on ubuntu
>> 16.04) have different issue. The same thing happens when I use
>> vhost=on but without vhost and with virtio-net didn't see the issue.
>> The Armband folks have also reported the vhost issue here
>> https://bugs.launchpad.net/ubuntu/+source/edk2/+bug/1673564
> 
> So I just ran your setup using the flash emulation on the ThunderX in
> packet.net and my host is happily continuing to run hackbench.
> 
>>
>> So I am beginning to think the issue could be related to the flash
>> emulation but wanted to check on ml if anyone else is seeing it. But
>> if it is flash emulation how can it hose the host like this ?
> 
> Why do you think it's related to the flash emulation?
> 
> From what you're telling me it sounds much more like a vhost regression.
> 
> Can you try with v4.10 of the kernel with vhost=on ?

I'm unsure about this being directly related to vhost. vhost seems to 
be the trigger, but the issue seems that after a smp_call_function_many()
call, some of the target CPUs have either stopped responding, or missed
the IPI. The IPI initiator keeps on waiting forever.

I'm not seeing any of this on the HW I have access to, but this could 
be symptomatic of a SGI delivery problem.

Any chance you could try the following patchlet on the host kernel?

diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
index c132f29322cc..727c7dc7592a 100644
--- a/drivers/irqchip/irq-gic-v3.c
+++ b/drivers/irqchip/irq-gic-v3.c
@@ -617,6 +617,15 @@ static void gic_raise_softirq(const struct cpumask *mask, unsigned int irq)
 	 */
 	smp_wmb();
 
+	/* Is it a broadcast? */
+	if (cpumask_weight(mask) == (num_online_cpus() - 1)) {
+		gic_write_sgi1r(1UL << ICC_SGI1R_IRQ_ROUTING_MODE_BIT |
+				irq << ICC_SGI1R_SGI_ID_SHIFT);
+		isb();
+		return;
+	}
+
+
 	for_each_cpu(cpu, mask) {
 		unsigned long cluster_id = cpu_logical_map(cpu) & ~0xffUL;
 		u16 tlist;

This shouldn't make any difference (other than being slightly faster in the
broadcast case, and slower in every other), but I'm prepared to try anything...

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] host stalls when qemu-system-aarch64 with kvm and pflash
  2017-03-29 18:56       ` Christoffer Dall
@ 2017-03-30 16:47         ` Laszlo Ersek
  -1 siblings, 0 replies; 28+ messages in thread
From: Laszlo Ersek @ 2017-03-30 16:47 UTC (permalink / raw)
  To: Christoffer Dall, Radha Mohan
  Cc: qemu-devel, kvmarm, Wei Huang, Drew Jones, Michael S. Tsirkin,
	Jason Wang

On 03/29/17 20:56, Christoffer Dall wrote:
> On Tue, Mar 28, 2017 at 01:24:15PM -0700, Radha Mohan wrote:
>> On Tue, Mar 28, 2017 at 1:16 PM, Christoffer Dall <cdall@linaro.org> wrote:
>>> Hi Radha,
>>>
>>> On Tue, Mar 28, 2017 at 12:58:24PM -0700, Radha Mohan wrote:
>>>> Hi,
>>>> I am seeing an issue with qemu-system-aarch64 when using pflash
>>>> (booting kernel via UEFI bios).
>>>>
>>>> Host kernel: 4.11.0-rc3-next-20170323
>>>> Qemu version: v2.9.0-rc1
>>>>
>>>> Command used:
>>>> ./aarch64-softmmu/qemu-system-aarch64 -cpu host -enable-kvm -M
>>>> virt,gic_version=3 -nographic -smp 1 -m 2048 -drive
>>>> if=none,id=hd0,file=/root/zesty-server-cloudimg-arm64.img,id=0 -device
>>>> virtio-blk-device,drive=hd0 -pflash /root/flash0.img -pflash
>>>> /root/flash1.img
>>>>
>>>>
>>>> As soon as the guest kernel boots the host starts to stall and prints
>>>> the below messages. And the system never recovers. I can neither
>>>> poweroff the guest nor the host. So I have resort to external power
>>>> reset of the host.
>>>>
>>>> ==================
>>>> [  116.199077] NMI watchdog: BUG: soft lockup - CPU#25 stuck for 23s!
>>>> [kworker/25:1:454]
>>>> [  116.206901] Modules linked in: binfmt_misc nls_iso8859_1 aes_ce_blk
>>>> shpchp crypto_simd gpio_keys cryptd aes_ce_cipher ghash_ce sha2_ce
>>>> sha1_ce uio_pdrv_genirq uio autofs4 btrfs raid10 rai
>>>> d456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
>>>> raid6_pq libcrc32c raid1 raid0 multipath linear ast i2c_algo_bit ttm
>>>> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_s
>>>> ys_fops drm nicvf ahci nicpf libahci thunder_bgx thunder_xcv
>>>> mdio_thunder mdio_cavium
>>>>
>>>> [  116.206995] CPU: 25 PID: 454 Comm: kworker/25:1 Not tainted
>>>> 4.11.0-rc3-next-20170323 #1
>>>> [  116.206997] Hardware name: www.cavium.com crb-1s/crb-1s, BIOS 0.3 Feb 23 2017
>>>> [  116.207010] Workqueue: events netstamp_clear
>>>> [  116.207015] task: ffff801f906b5400 task.stack: ffff801f901a4000
>>>> [  116.207020] PC is at smp_call_function_many+0x284/0x2e8
>>>> [  116.207023] LR is at smp_call_function_many+0x244/0x2e8
>>>> [  116.207026] pc : [<ffff000008156ecc>] lr : [<ffff000008156e8c>]
>>>> pstate: 80000145
>>>> [  116.207028] sp : ffff801f901a7be0
>>>> [  116.207030] x29: ffff801f901a7be0 x28: ffff000009139000
>>>> [  116.207036] x27: ffff000009139434 x26: 0000000000000080
>>>> [  116.207041] x25: 0000000000000000 x24: ffff0000081565d0
>>>> [  116.207047] x23: 0000000000000001 x22: ffff000008e11e00
>>>> [  116.207052] x21: ffff801f6d5cff00 x20: ffff801f6d5cff08
>>>> [  116.207057] x19: ffff000009138e38 x18: 0000000000000a03
>>>> [  116.207063] x17: 0000ffffb77c9028 x16: ffff0000082e81d8
>>>> [  116.207068] x15: 00003d0d6dd44d08 x14: 0036312196549b4a
>>>> [  116.207073] x13: 0000000058dabe4c x12: 0000000000000018
>>>> [  116.207079] x11: 00000000366e2f04 x10: 00000000000009f0
>>>> [  116.207084] x9 : ffff801f901a7d30 x8 : 0000000000000002
>>>> [  116.207089] x7 : 0000000000000000 x6 : 0000000000000000
>>>> [  116.207095] x5 : ffffffff00000000 x4 : 0000000000000020
>>>> [  116.207100] x3 : 0000000000000020 x2 : 0000000000000000
>>>> [  116.207105] x1 : ffff801f6d682578 x0 : 0000000000000003
>>>>
>>>> [  150.443116] INFO: rcu_sched self-detected stall on CPU
>>>> [  150.448261]  25-...: (14997 ticks this GP)
>>>> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
>>>> [  150.451115] INFO: rcu_sched detected stalls on CPUs/tasks:
>>>> [  150.451123]  25-...: (14997 ticks this GP)
>>>> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
>>>> [  150.451124]  (detected by 13, t=15002 jiffies, g=805, c=804, q=8384)
>>>> [  150.451136] Task dump for CPU 25:
>>>> [  150.451138] kworker/25:1    R  running task        0   454      2 0x00000002
>>>> [  150.451155] Workqueue: events netstamp_clear
>>>> [  150.451158] Call trace:
>>>> [  150.451164] [<ffff000008086188>] __switch_to+0x90/0xa8
>>>> [  150.451172] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
>>>> [  150.451175] [<ffff0000081f6284>] static_key_enable+0x34/0x60
>>>> [  150.451178] [<ffff000008843268>] netstamp_clear+0x68/0x80
>>>> [  150.451181] [<ffff0000080e49a0>] process_one_work+0x158/0x478
>>>> [  150.451183] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
>>>> [  150.451187] [<ffff0000080ebd78>] kthread+0x108/0x138
>>>> [  150.451190] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
>>>> [  150.477451]   (t=15008 jiffies g=805 c=804 q=8384)
>>>> [  150.482242] Task dump for CPU 25:
>>>> [  150.482245] kworker/25:1    R  running task        0   454      2 0x00000002
>>>> [  150.482259] Workqueue: events netstamp_clear
>>>> [  150.482264] Call trace:
>>>> [  150.482271] [<ffff00000808a530>] dump_backtrace+0x0/0x2b0
>>>> [  150.482277] [<ffff00000808a804>] show_stack+0x24/0x30
>>>> [  150.482281] [<ffff0000080fb750>] sched_show_task+0x128/0x178
>>>> [  150.482285] [<ffff0000080fd298>] dump_cpu_task+0x48/0x58
>>>> [  150.482288] [<ffff0000081f81e4>] rcu_dump_cpu_stacks+0xa0/0xe8
>>>> [  150.482297] [<ffff00000813983c>] rcu_check_callbacks+0x774/0x938
>>>> [  150.482305] [<ffff00000813fcb4>] update_process_times+0x34/0x60
>>>> [  150.482314] [<ffff000008151b80>] tick_sched_handle.isra.7+0x38/0x70
>>>> [  150.482319] [<ffff000008151c04>] tick_sched_timer+0x4c/0x98
>>>> [  150.482324] [<ffff000008140510>] __hrtimer_run_queues+0xd8/0x2b8
>>>> [  150.482328] [<ffff000008141180>] hrtimer_interrupt+0xa8/0x228
>>>> [  150.482334] [<ffff0000087f2a2c>] arch_timer_handler_phys+0x3c/0x50
>>>> [  150.482341] [<ffff00000812c194>] handle_percpu_devid_irq+0x8c/0x230
>>>> [  150.482344] [<ffff000008126174>] generic_handle_irq+0x34/0x50
>>>> [  150.482347] [<ffff000008126898>] __handle_domain_irq+0x68/0xc0
>>>> [  150.482351] [<ffff0000080817e4>] gic_handle_irq+0xc4/0x170
>>>> [  150.482356] Exception stack(0xffff801f901a7ab0 to 0xffff801f901a7be0)
>>>> [  150.482360] 7aa0:
>>>> 0000000000000003 ffff801f6d682578
>>>> [  150.482364] 7ac0: 0000000000000000 0000000000000020
>>>> 0000000000000020 ffffffff00000000
>>>> [  150.482367] 7ae0: 0000000000000000 0000000000000000
>>>> 0000000000000002 ffff801f901a7d30
>>>> [  150.482371] 7b00: 00000000000009f0 00000000366e2f04
>>>> 0000000000000018 0000000058dabe4c
>>>> [  150.482375] 7b20: 0036312196549b4a 00003d0d6dd44d08
>>>> ffff0000082e81d8 0000ffffb77c9028
>>>> [  150.482378] 7b40: 0000000000000a03 ffff000009138e38
>>>> ffff801f6d5cff08 ffff801f6d5cff00
>>>> [  150.482382] 7b60: ffff000008e11e00 0000000000000001
>>>> ffff0000081565d0 0000000000000000
>>>> [  150.482386] 7b80: 0000000000000080 ffff000009139434
>>>> ffff000009139000 ffff801f901a7be0
>>>> [  150.482390] 7ba0: ffff000008156e8c ffff801f901a7be0
>>>> ffff000008156ecc 0000000080000145
>>>> [  150.482394] 7bc0: ffff801f901a7be0 ffff000008156e68
>>>> ffffffffffffffff ffff000008156e8c
>>>> [  150.482397] [<ffff000008082ff4>] el1_irq+0xb4/0x140
>>>> [  150.482401] [<ffff000008156ecc>] smp_call_function_many+0x284/0x2e8
>>>> [  150.482405] [<ffff000008157020>] kick_all_cpus_sync+0x30/0x38
>>>> [  150.482409] [<ffff00000897c6cc>] aarch64_insn_patch_text+0xec/0xf8
>>>> [  150.482415] [<ffff000008095978>] arch_jump_label_transform+0x60/0x98
>>>> [  150.482420] [<ffff0000081f593c>] __jump_label_update+0x8c/0xa8
>>>> [  150.482423] [<ffff0000081f6088>] jump_label_update+0x58/0xe8
>>>> [  150.482429] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
>>>> [  150.482434] [<ffff0000081f6284>] static_key_enable+0x34/0x60
>>>> [  150.482438] [<ffff000008843268>] netstamp_clear+0x68/0x80
>>>> [  150.482441] [<ffff0000080e49a0>] process_one_work+0x158/0x478
>>>> [  150.482444] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
>>>> [  150.482448] [<ffff0000080ebd78>] kthread+0x108/0x138
>>>> [  150.482451] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
>>>>
>>>> ====================================
>>>>
>>>> I am observing that this usually happens when the guest tries to
>>>> bringup or use the default virtio-net interface.
>>>> And I am unable to reproduce this when directly booting the guest
>>>> kernel without UEFI BIOS.
>>>> So anyone observed similar issue ?
>>>>
>>>
>>> I haven't seen this myself.
>>>
>>> Have you tried with other versions of QEMU and the host kernel by any
>>> chance using the same setup?
>>
>> The earlier qemu version (atleast 2.5 which is default on ubuntu
>> 16.04) have different issue. The same thing happens when I use
>> vhost=on but without vhost and with virtio-net didn't see the issue.
>> The Armband folks have also reported the vhost issue here
>> https://bugs.launchpad.net/ubuntu/+source/edk2/+bug/1673564
> 
> So I just ran your setup using the flash emulation on the ThunderX in
> packet.net and my host is happily continuing to run hackbench.
> 
>>
>> So I am beginning to think the issue could be related to the flash
>> emulation but wanted to check on ml if anyone else is seeing it. But
>> if it is flash emulation how can it hose the host like this ?
> 
> Why do you think it's related to the flash emulation?
> 
> From what you're telling me it sounds much more like a vhost regression.
> 
> Can you try with v4.10 of the kernel with vhost=on ?

Does this QEMU pull req help by any chance?

  [Qemu-devel] [PULL 0/2] vhost, pc: fixes
  http://lists.nongnu.org/archive/html/qemu-devel/2017-03/msg06103.html

It says "Region caching is still causing issues around reset". (Region
caching was a recent, very intrusive, virtio change in QEMU that sort of
blew up everything and developers have been dealing with the fallout
ever since. <-- my personal account anyway)

In another (separate, downstream) discussion, I explained that the
VirtioNetDxe driver in OVMF and ArmVirtQemu does a virtio reset on the
virtio network device in an ExitBootServices() callback; that is, when
the kernel's UEFI stub takes control of the system. Resetting the virtio
device is supposed to cause the host to de-configure the device and to
forget all about it. If that doesn't occur correctly, the guest kernel
can definitely end up in a crappy state (the host might still consider
memory areas belong to the virtio device that were originally allocated
by the firmware, but are now overwritten by the guest kernel).

In addition, I found some relevant-looking, recent vhost commits in the
kernel:

275bf960ac69 vhost: better detection of available buffers
0ed005ce02fa vhost_net: tx batching
635b8c8ecdd2 tap: Renaming tap related APIs, data structures, macros
9a393b5d5988 tap: tap as an independent module
e3b56cdd4351 vhost: try avoiding avail index access when getting
             descriptor
f88949138058 vhost: introduce O(1) vq metadata cache

I think the symptoms could be a strange interplay between recent QEMU
and kernel changes that are tickled by the firmware's virtio-net driver
the wrong way.

Thanks
Laszlo

> 
> Thanks,
> -Christoffer
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: host stalls when qemu-system-aarch64 with kvm and pflash
@ 2017-03-30 16:47         ` Laszlo Ersek
  0 siblings, 0 replies; 28+ messages in thread
From: Laszlo Ersek @ 2017-03-30 16:47 UTC (permalink / raw)
  To: Christoffer Dall, Radha Mohan
  Cc: Michael S. Tsirkin, Jason Wang, qemu-devel, kvmarm

On 03/29/17 20:56, Christoffer Dall wrote:
> On Tue, Mar 28, 2017 at 01:24:15PM -0700, Radha Mohan wrote:
>> On Tue, Mar 28, 2017 at 1:16 PM, Christoffer Dall <cdall@linaro.org> wrote:
>>> Hi Radha,
>>>
>>> On Tue, Mar 28, 2017 at 12:58:24PM -0700, Radha Mohan wrote:
>>>> Hi,
>>>> I am seeing an issue with qemu-system-aarch64 when using pflash
>>>> (booting kernel via UEFI bios).
>>>>
>>>> Host kernel: 4.11.0-rc3-next-20170323
>>>> Qemu version: v2.9.0-rc1
>>>>
>>>> Command used:
>>>> ./aarch64-softmmu/qemu-system-aarch64 -cpu host -enable-kvm -M
>>>> virt,gic_version=3 -nographic -smp 1 -m 2048 -drive
>>>> if=none,id=hd0,file=/root/zesty-server-cloudimg-arm64.img,id=0 -device
>>>> virtio-blk-device,drive=hd0 -pflash /root/flash0.img -pflash
>>>> /root/flash1.img
>>>>
>>>>
>>>> As soon as the guest kernel boots the host starts to stall and prints
>>>> the below messages. And the system never recovers. I can neither
>>>> poweroff the guest nor the host. So I have resort to external power
>>>> reset of the host.
>>>>
>>>> ==================
>>>> [  116.199077] NMI watchdog: BUG: soft lockup - CPU#25 stuck for 23s!
>>>> [kworker/25:1:454]
>>>> [  116.206901] Modules linked in: binfmt_misc nls_iso8859_1 aes_ce_blk
>>>> shpchp crypto_simd gpio_keys cryptd aes_ce_cipher ghash_ce sha2_ce
>>>> sha1_ce uio_pdrv_genirq uio autofs4 btrfs raid10 rai
>>>> d456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
>>>> raid6_pq libcrc32c raid1 raid0 multipath linear ast i2c_algo_bit ttm
>>>> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_s
>>>> ys_fops drm nicvf ahci nicpf libahci thunder_bgx thunder_xcv
>>>> mdio_thunder mdio_cavium
>>>>
>>>> [  116.206995] CPU: 25 PID: 454 Comm: kworker/25:1 Not tainted
>>>> 4.11.0-rc3-next-20170323 #1
>>>> [  116.206997] Hardware name: www.cavium.com crb-1s/crb-1s, BIOS 0.3 Feb 23 2017
>>>> [  116.207010] Workqueue: events netstamp_clear
>>>> [  116.207015] task: ffff801f906b5400 task.stack: ffff801f901a4000
>>>> [  116.207020] PC is at smp_call_function_many+0x284/0x2e8
>>>> [  116.207023] LR is at smp_call_function_many+0x244/0x2e8
>>>> [  116.207026] pc : [<ffff000008156ecc>] lr : [<ffff000008156e8c>]
>>>> pstate: 80000145
>>>> [  116.207028] sp : ffff801f901a7be0
>>>> [  116.207030] x29: ffff801f901a7be0 x28: ffff000009139000
>>>> [  116.207036] x27: ffff000009139434 x26: 0000000000000080
>>>> [  116.207041] x25: 0000000000000000 x24: ffff0000081565d0
>>>> [  116.207047] x23: 0000000000000001 x22: ffff000008e11e00
>>>> [  116.207052] x21: ffff801f6d5cff00 x20: ffff801f6d5cff08
>>>> [  116.207057] x19: ffff000009138e38 x18: 0000000000000a03
>>>> [  116.207063] x17: 0000ffffb77c9028 x16: ffff0000082e81d8
>>>> [  116.207068] x15: 00003d0d6dd44d08 x14: 0036312196549b4a
>>>> [  116.207073] x13: 0000000058dabe4c x12: 0000000000000018
>>>> [  116.207079] x11: 00000000366e2f04 x10: 00000000000009f0
>>>> [  116.207084] x9 : ffff801f901a7d30 x8 : 0000000000000002
>>>> [  116.207089] x7 : 0000000000000000 x6 : 0000000000000000
>>>> [  116.207095] x5 : ffffffff00000000 x4 : 0000000000000020
>>>> [  116.207100] x3 : 0000000000000020 x2 : 0000000000000000
>>>> [  116.207105] x1 : ffff801f6d682578 x0 : 0000000000000003
>>>>
>>>> [  150.443116] INFO: rcu_sched self-detected stall on CPU
>>>> [  150.448261]  25-...: (14997 ticks this GP)
>>>> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
>>>> [  150.451115] INFO: rcu_sched detected stalls on CPUs/tasks:
>>>> [  150.451123]  25-...: (14997 ticks this GP)
>>>> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
>>>> [  150.451124]  (detected by 13, t=15002 jiffies, g=805, c=804, q=8384)
>>>> [  150.451136] Task dump for CPU 25:
>>>> [  150.451138] kworker/25:1    R  running task        0   454      2 0x00000002
>>>> [  150.451155] Workqueue: events netstamp_clear
>>>> [  150.451158] Call trace:
>>>> [  150.451164] [<ffff000008086188>] __switch_to+0x90/0xa8
>>>> [  150.451172] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
>>>> [  150.451175] [<ffff0000081f6284>] static_key_enable+0x34/0x60
>>>> [  150.451178] [<ffff000008843268>] netstamp_clear+0x68/0x80
>>>> [  150.451181] [<ffff0000080e49a0>] process_one_work+0x158/0x478
>>>> [  150.451183] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
>>>> [  150.451187] [<ffff0000080ebd78>] kthread+0x108/0x138
>>>> [  150.451190] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
>>>> [  150.477451]   (t=15008 jiffies g=805 c=804 q=8384)
>>>> [  150.482242] Task dump for CPU 25:
>>>> [  150.482245] kworker/25:1    R  running task        0   454      2 0x00000002
>>>> [  150.482259] Workqueue: events netstamp_clear
>>>> [  150.482264] Call trace:
>>>> [  150.482271] [<ffff00000808a530>] dump_backtrace+0x0/0x2b0
>>>> [  150.482277] [<ffff00000808a804>] show_stack+0x24/0x30
>>>> [  150.482281] [<ffff0000080fb750>] sched_show_task+0x128/0x178
>>>> [  150.482285] [<ffff0000080fd298>] dump_cpu_task+0x48/0x58
>>>> [  150.482288] [<ffff0000081f81e4>] rcu_dump_cpu_stacks+0xa0/0xe8
>>>> [  150.482297] [<ffff00000813983c>] rcu_check_callbacks+0x774/0x938
>>>> [  150.482305] [<ffff00000813fcb4>] update_process_times+0x34/0x60
>>>> [  150.482314] [<ffff000008151b80>] tick_sched_handle.isra.7+0x38/0x70
>>>> [  150.482319] [<ffff000008151c04>] tick_sched_timer+0x4c/0x98
>>>> [  150.482324] [<ffff000008140510>] __hrtimer_run_queues+0xd8/0x2b8
>>>> [  150.482328] [<ffff000008141180>] hrtimer_interrupt+0xa8/0x228
>>>> [  150.482334] [<ffff0000087f2a2c>] arch_timer_handler_phys+0x3c/0x50
>>>> [  150.482341] [<ffff00000812c194>] handle_percpu_devid_irq+0x8c/0x230
>>>> [  150.482344] [<ffff000008126174>] generic_handle_irq+0x34/0x50
>>>> [  150.482347] [<ffff000008126898>] __handle_domain_irq+0x68/0xc0
>>>> [  150.482351] [<ffff0000080817e4>] gic_handle_irq+0xc4/0x170
>>>> [  150.482356] Exception stack(0xffff801f901a7ab0 to 0xffff801f901a7be0)
>>>> [  150.482360] 7aa0:
>>>> 0000000000000003 ffff801f6d682578
>>>> [  150.482364] 7ac0: 0000000000000000 0000000000000020
>>>> 0000000000000020 ffffffff00000000
>>>> [  150.482367] 7ae0: 0000000000000000 0000000000000000
>>>> 0000000000000002 ffff801f901a7d30
>>>> [  150.482371] 7b00: 00000000000009f0 00000000366e2f04
>>>> 0000000000000018 0000000058dabe4c
>>>> [  150.482375] 7b20: 0036312196549b4a 00003d0d6dd44d08
>>>> ffff0000082e81d8 0000ffffb77c9028
>>>> [  150.482378] 7b40: 0000000000000a03 ffff000009138e38
>>>> ffff801f6d5cff08 ffff801f6d5cff00
>>>> [  150.482382] 7b60: ffff000008e11e00 0000000000000001
>>>> ffff0000081565d0 0000000000000000
>>>> [  150.482386] 7b80: 0000000000000080 ffff000009139434
>>>> ffff000009139000 ffff801f901a7be0
>>>> [  150.482390] 7ba0: ffff000008156e8c ffff801f901a7be0
>>>> ffff000008156ecc 0000000080000145
>>>> [  150.482394] 7bc0: ffff801f901a7be0 ffff000008156e68
>>>> ffffffffffffffff ffff000008156e8c
>>>> [  150.482397] [<ffff000008082ff4>] el1_irq+0xb4/0x140
>>>> [  150.482401] [<ffff000008156ecc>] smp_call_function_many+0x284/0x2e8
>>>> [  150.482405] [<ffff000008157020>] kick_all_cpus_sync+0x30/0x38
>>>> [  150.482409] [<ffff00000897c6cc>] aarch64_insn_patch_text+0xec/0xf8
>>>> [  150.482415] [<ffff000008095978>] arch_jump_label_transform+0x60/0x98
>>>> [  150.482420] [<ffff0000081f593c>] __jump_label_update+0x8c/0xa8
>>>> [  150.482423] [<ffff0000081f6088>] jump_label_update+0x58/0xe8
>>>> [  150.482429] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
>>>> [  150.482434] [<ffff0000081f6284>] static_key_enable+0x34/0x60
>>>> [  150.482438] [<ffff000008843268>] netstamp_clear+0x68/0x80
>>>> [  150.482441] [<ffff0000080e49a0>] process_one_work+0x158/0x478
>>>> [  150.482444] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
>>>> [  150.482448] [<ffff0000080ebd78>] kthread+0x108/0x138
>>>> [  150.482451] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
>>>>
>>>> ====================================
>>>>
>>>> I am observing that this usually happens when the guest tries to
>>>> bringup or use the default virtio-net interface.
>>>> And I am unable to reproduce this when directly booting the guest
>>>> kernel without UEFI BIOS.
>>>> So anyone observed similar issue ?
>>>>
>>>
>>> I haven't seen this myself.
>>>
>>> Have you tried with other versions of QEMU and the host kernel by any
>>> chance using the same setup?
>>
>> The earlier qemu version (atleast 2.5 which is default on ubuntu
>> 16.04) have different issue. The same thing happens when I use
>> vhost=on but without vhost and with virtio-net didn't see the issue.
>> The Armband folks have also reported the vhost issue here
>> https://bugs.launchpad.net/ubuntu/+source/edk2/+bug/1673564
> 
> So I just ran your setup using the flash emulation on the ThunderX in
> packet.net and my host is happily continuing to run hackbench.
> 
>>
>> So I am beginning to think the issue could be related to the flash
>> emulation but wanted to check on ml if anyone else is seeing it. But
>> if it is flash emulation how can it hose the host like this ?
> 
> Why do you think it's related to the flash emulation?
> 
> From what you're telling me it sounds much more like a vhost regression.
> 
> Can you try with v4.10 of the kernel with vhost=on ?

Does this QEMU pull req help by any chance?

  [Qemu-devel] [PULL 0/2] vhost, pc: fixes
  http://lists.nongnu.org/archive/html/qemu-devel/2017-03/msg06103.html

It says "Region caching is still causing issues around reset". (Region
caching was a recent, very intrusive, virtio change in QEMU that sort of
blew up everything and developers have been dealing with the fallout
ever since. <-- my personal account anyway)

In another (separate, downstream) discussion, I explained that the
VirtioNetDxe driver in OVMF and ArmVirtQemu does a virtio reset on the
virtio network device in an ExitBootServices() callback; that is, when
the kernel's UEFI stub takes control of the system. Resetting the virtio
device is supposed to cause the host to de-configure the device and to
forget all about it. If that doesn't occur correctly, the guest kernel
can definitely end up in a crappy state (the host might still consider
memory areas belong to the virtio device that were originally allocated
by the firmware, but are now overwritten by the guest kernel).

In addition, I found some relevant-looking, recent vhost commits in the
kernel:

275bf960ac69 vhost: better detection of available buffers
0ed005ce02fa vhost_net: tx batching
635b8c8ecdd2 tap: Renaming tap related APIs, data structures, macros
9a393b5d5988 tap: tap as an independent module
e3b56cdd4351 vhost: try avoiding avail index access when getting
             descriptor
f88949138058 vhost: introduce O(1) vq metadata cache

I think the symptoms could be a strange interplay between recent QEMU
and kernel changes that are tickled by the firmware's virtio-net driver
the wrong way.

Thanks
Laszlo

> 
> Thanks,
> -Christoffer
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] host stalls when qemu-system-aarch64 with kvm and pflash
  2017-03-30 16:47         ` Laszlo Ersek
@ 2017-03-31 23:16           ` Radha Mohan
  -1 siblings, 0 replies; 28+ messages in thread
From: Radha Mohan @ 2017-03-31 23:16 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: Christoffer Dall, qemu-devel, kvmarm, Wei Huang, Drew Jones,
	Michael S. Tsirkin, Jason Wang

On Thu, Mar 30, 2017 at 9:47 AM, Laszlo Ersek <lersek@redhat.com> wrote:
> On 03/29/17 20:56, Christoffer Dall wrote:
>> On Tue, Mar 28, 2017 at 01:24:15PM -0700, Radha Mohan wrote:
>>> On Tue, Mar 28, 2017 at 1:16 PM, Christoffer Dall <cdall@linaro.org> wrote:
>>>> Hi Radha,
>>>>
>>>> On Tue, Mar 28, 2017 at 12:58:24PM -0700, Radha Mohan wrote:
>>>>> Hi,
>>>>> I am seeing an issue with qemu-system-aarch64 when using pflash
>>>>> (booting kernel via UEFI bios).
>>>>>
>>>>> Host kernel: 4.11.0-rc3-next-20170323
>>>>> Qemu version: v2.9.0-rc1
>>>>>
>>>>> Command used:
>>>>> ./aarch64-softmmu/qemu-system-aarch64 -cpu host -enable-kvm -M
>>>>> virt,gic_version=3 -nographic -smp 1 -m 2048 -drive
>>>>> if=none,id=hd0,file=/root/zesty-server-cloudimg-arm64.img,id=0 -device
>>>>> virtio-blk-device,drive=hd0 -pflash /root/flash0.img -pflash
>>>>> /root/flash1.img
>>>>>
>>>>>
>>>>> As soon as the guest kernel boots the host starts to stall and prints
>>>>> the below messages. And the system never recovers. I can neither
>>>>> poweroff the guest nor the host. So I have resort to external power
>>>>> reset of the host.
>>>>>
>>>>> ==================
>>>>> [  116.199077] NMI watchdog: BUG: soft lockup - CPU#25 stuck for 23s!
>>>>> [kworker/25:1:454]
>>>>> [  116.206901] Modules linked in: binfmt_misc nls_iso8859_1 aes_ce_blk
>>>>> shpchp crypto_simd gpio_keys cryptd aes_ce_cipher ghash_ce sha2_ce
>>>>> sha1_ce uio_pdrv_genirq uio autofs4 btrfs raid10 rai
>>>>> d456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
>>>>> raid6_pq libcrc32c raid1 raid0 multipath linear ast i2c_algo_bit ttm
>>>>> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_s
>>>>> ys_fops drm nicvf ahci nicpf libahci thunder_bgx thunder_xcv
>>>>> mdio_thunder mdio_cavium
>>>>>
>>>>> [  116.206995] CPU: 25 PID: 454 Comm: kworker/25:1 Not tainted
>>>>> 4.11.0-rc3-next-20170323 #1
>>>>> [  116.206997] Hardware name: www.cavium.com crb-1s/crb-1s, BIOS 0.3 Feb 23 2017
>>>>> [  116.207010] Workqueue: events netstamp_clear
>>>>> [  116.207015] task: ffff801f906b5400 task.stack: ffff801f901a4000
>>>>> [  116.207020] PC is at smp_call_function_many+0x284/0x2e8
>>>>> [  116.207023] LR is at smp_call_function_many+0x244/0x2e8
>>>>> [  116.207026] pc : [<ffff000008156ecc>] lr : [<ffff000008156e8c>]
>>>>> pstate: 80000145
>>>>> [  116.207028] sp : ffff801f901a7be0
>>>>> [  116.207030] x29: ffff801f901a7be0 x28: ffff000009139000
>>>>> [  116.207036] x27: ffff000009139434 x26: 0000000000000080
>>>>> [  116.207041] x25: 0000000000000000 x24: ffff0000081565d0
>>>>> [  116.207047] x23: 0000000000000001 x22: ffff000008e11e00
>>>>> [  116.207052] x21: ffff801f6d5cff00 x20: ffff801f6d5cff08
>>>>> [  116.207057] x19: ffff000009138e38 x18: 0000000000000a03
>>>>> [  116.207063] x17: 0000ffffb77c9028 x16: ffff0000082e81d8
>>>>> [  116.207068] x15: 00003d0d6dd44d08 x14: 0036312196549b4a
>>>>> [  116.207073] x13: 0000000058dabe4c x12: 0000000000000018
>>>>> [  116.207079] x11: 00000000366e2f04 x10: 00000000000009f0
>>>>> [  116.207084] x9 : ffff801f901a7d30 x8 : 0000000000000002
>>>>> [  116.207089] x7 : 0000000000000000 x6 : 0000000000000000
>>>>> [  116.207095] x5 : ffffffff00000000 x4 : 0000000000000020
>>>>> [  116.207100] x3 : 0000000000000020 x2 : 0000000000000000
>>>>> [  116.207105] x1 : ffff801f6d682578 x0 : 0000000000000003
>>>>>
>>>>> [  150.443116] INFO: rcu_sched self-detected stall on CPU
>>>>> [  150.448261]  25-...: (14997 ticks this GP)
>>>>> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
>>>>> [  150.451115] INFO: rcu_sched detected stalls on CPUs/tasks:
>>>>> [  150.451123]  25-...: (14997 ticks this GP)
>>>>> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
>>>>> [  150.451124]  (detected by 13, t=15002 jiffies, g=805, c=804, q=8384)
>>>>> [  150.451136] Task dump for CPU 25:
>>>>> [  150.451138] kworker/25:1    R  running task        0   454      2 0x00000002
>>>>> [  150.451155] Workqueue: events netstamp_clear
>>>>> [  150.451158] Call trace:
>>>>> [  150.451164] [<ffff000008086188>] __switch_to+0x90/0xa8
>>>>> [  150.451172] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
>>>>> [  150.451175] [<ffff0000081f6284>] static_key_enable+0x34/0x60
>>>>> [  150.451178] [<ffff000008843268>] netstamp_clear+0x68/0x80
>>>>> [  150.451181] [<ffff0000080e49a0>] process_one_work+0x158/0x478
>>>>> [  150.451183] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
>>>>> [  150.451187] [<ffff0000080ebd78>] kthread+0x108/0x138
>>>>> [  150.451190] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
>>>>> [  150.477451]   (t=15008 jiffies g=805 c=804 q=8384)
>>>>> [  150.482242] Task dump for CPU 25:
>>>>> [  150.482245] kworker/25:1    R  running task        0   454      2 0x00000002
>>>>> [  150.482259] Workqueue: events netstamp_clear
>>>>> [  150.482264] Call trace:
>>>>> [  150.482271] [<ffff00000808a530>] dump_backtrace+0x0/0x2b0
>>>>> [  150.482277] [<ffff00000808a804>] show_stack+0x24/0x30
>>>>> [  150.482281] [<ffff0000080fb750>] sched_show_task+0x128/0x178
>>>>> [  150.482285] [<ffff0000080fd298>] dump_cpu_task+0x48/0x58
>>>>> [  150.482288] [<ffff0000081f81e4>] rcu_dump_cpu_stacks+0xa0/0xe8
>>>>> [  150.482297] [<ffff00000813983c>] rcu_check_callbacks+0x774/0x938
>>>>> [  150.482305] [<ffff00000813fcb4>] update_process_times+0x34/0x60
>>>>> [  150.482314] [<ffff000008151b80>] tick_sched_handle.isra.7+0x38/0x70
>>>>> [  150.482319] [<ffff000008151c04>] tick_sched_timer+0x4c/0x98
>>>>> [  150.482324] [<ffff000008140510>] __hrtimer_run_queues+0xd8/0x2b8
>>>>> [  150.482328] [<ffff000008141180>] hrtimer_interrupt+0xa8/0x228
>>>>> [  150.482334] [<ffff0000087f2a2c>] arch_timer_handler_phys+0x3c/0x50
>>>>> [  150.482341] [<ffff00000812c194>] handle_percpu_devid_irq+0x8c/0x230
>>>>> [  150.482344] [<ffff000008126174>] generic_handle_irq+0x34/0x50
>>>>> [  150.482347] [<ffff000008126898>] __handle_domain_irq+0x68/0xc0
>>>>> [  150.482351] [<ffff0000080817e4>] gic_handle_irq+0xc4/0x170
>>>>> [  150.482356] Exception stack(0xffff801f901a7ab0 to 0xffff801f901a7be0)
>>>>> [  150.482360] 7aa0:
>>>>> 0000000000000003 ffff801f6d682578
>>>>> [  150.482364] 7ac0: 0000000000000000 0000000000000020
>>>>> 0000000000000020 ffffffff00000000
>>>>> [  150.482367] 7ae0: 0000000000000000 0000000000000000
>>>>> 0000000000000002 ffff801f901a7d30
>>>>> [  150.482371] 7b00: 00000000000009f0 00000000366e2f04
>>>>> 0000000000000018 0000000058dabe4c
>>>>> [  150.482375] 7b20: 0036312196549b4a 00003d0d6dd44d08
>>>>> ffff0000082e81d8 0000ffffb77c9028
>>>>> [  150.482378] 7b40: 0000000000000a03 ffff000009138e38
>>>>> ffff801f6d5cff08 ffff801f6d5cff00
>>>>> [  150.482382] 7b60: ffff000008e11e00 0000000000000001
>>>>> ffff0000081565d0 0000000000000000
>>>>> [  150.482386] 7b80: 0000000000000080 ffff000009139434
>>>>> ffff000009139000 ffff801f901a7be0
>>>>> [  150.482390] 7ba0: ffff000008156e8c ffff801f901a7be0
>>>>> ffff000008156ecc 0000000080000145
>>>>> [  150.482394] 7bc0: ffff801f901a7be0 ffff000008156e68
>>>>> ffffffffffffffff ffff000008156e8c
>>>>> [  150.482397] [<ffff000008082ff4>] el1_irq+0xb4/0x140
>>>>> [  150.482401] [<ffff000008156ecc>] smp_call_function_many+0x284/0x2e8
>>>>> [  150.482405] [<ffff000008157020>] kick_all_cpus_sync+0x30/0x38
>>>>> [  150.482409] [<ffff00000897c6cc>] aarch64_insn_patch_text+0xec/0xf8
>>>>> [  150.482415] [<ffff000008095978>] arch_jump_label_transform+0x60/0x98
>>>>> [  150.482420] [<ffff0000081f593c>] __jump_label_update+0x8c/0xa8
>>>>> [  150.482423] [<ffff0000081f6088>] jump_label_update+0x58/0xe8
>>>>> [  150.482429] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
>>>>> [  150.482434] [<ffff0000081f6284>] static_key_enable+0x34/0x60
>>>>> [  150.482438] [<ffff000008843268>] netstamp_clear+0x68/0x80
>>>>> [  150.482441] [<ffff0000080e49a0>] process_one_work+0x158/0x478
>>>>> [  150.482444] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
>>>>> [  150.482448] [<ffff0000080ebd78>] kthread+0x108/0x138
>>>>> [  150.482451] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
>>>>>
>>>>> ====================================
>>>>>
>>>>> I am observing that this usually happens when the guest tries to
>>>>> bringup or use the default virtio-net interface.
>>>>> And I am unable to reproduce this when directly booting the guest
>>>>> kernel without UEFI BIOS.
>>>>> So anyone observed similar issue ?
>>>>>
>>>>
>>>> I haven't seen this myself.
>>>>
>>>> Have you tried with other versions of QEMU and the host kernel by any
>>>> chance using the same setup?
>>>
>>> The earlier qemu version (atleast 2.5 which is default on ubuntu
>>> 16.04) have different issue. The same thing happens when I use
>>> vhost=on but without vhost and with virtio-net didn't see the issue.
>>> The Armband folks have also reported the vhost issue here
>>> https://bugs.launchpad.net/ubuntu/+source/edk2/+bug/1673564
>>
>> So I just ran your setup using the flash emulation on the ThunderX in
>> packet.net and my host is happily continuing to run hackbench.
>>
>>>
>>> So I am beginning to think the issue could be related to the flash
>>> emulation but wanted to check on ml if anyone else is seeing it. But
>>> if it is flash emulation how can it hose the host like this ?
>>
>> Why do you think it's related to the flash emulation?
>>
>> From what you're telling me it sounds much more like a vhost regression.
>>
>> Can you try with v4.10 of the kernel with vhost=on ?
>
> Does this QEMU pull req help by any chance?

I will try all the suggestions above and get back. It would take few
days as I have few priority things to clear off my plate.

>
>   [Qemu-devel] [PULL 0/2] vhost, pc: fixes
>   http://lists.nongnu.org/archive/html/qemu-devel/2017-03/msg06103.html
>
> It says "Region caching is still causing issues around reset". (Region
> caching was a recent, very intrusive, virtio change in QEMU that sort of
> blew up everything and developers have been dealing with the fallout
> ever since. <-- my personal account anyway)
>
> In another (separate, downstream) discussion, I explained that the
> VirtioNetDxe driver in OVMF and ArmVirtQemu does a virtio reset on the
> virtio network device in an ExitBootServices() callback; that is, when
> the kernel's UEFI stub takes control of the system. Resetting the virtio
> device is supposed to cause the host to de-configure the device and to
> forget all about it. If that doesn't occur correctly, the guest kernel
> can definitely end up in a crappy state (the host might still consider
> memory areas belong to the virtio device that were originally allocated
> by the firmware, but are now overwritten by the guest kernel).
>
> In addition, I found some relevant-looking, recent vhost commits in the
> kernel:
>
> 275bf960ac69 vhost: better detection of available buffers
> 0ed005ce02fa vhost_net: tx batching
> 635b8c8ecdd2 tap: Renaming tap related APIs, data structures, macros
> 9a393b5d5988 tap: tap as an independent module
> e3b56cdd4351 vhost: try avoiding avail index access when getting
>              descriptor
> f88949138058 vhost: introduce O(1) vq metadata cache
>
> I think the symptoms could be a strange interplay between recent QEMU
> and kernel changes that are tickled by the firmware's virtio-net driver
> the wrong way.
>
> Thanks
> Laszlo
>
>>
>> Thanks,
>> -Christoffer
>> _______________________________________________
>> kvmarm mailing list
>> kvmarm@lists.cs.columbia.edu
>> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
>>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: host stalls when qemu-system-aarch64 with kvm and pflash
@ 2017-03-31 23:16           ` Radha Mohan
  0 siblings, 0 replies; 28+ messages in thread
From: Radha Mohan @ 2017-03-31 23:16 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: Christoffer Dall, Michael S. Tsirkin, Jason Wang, qemu-devel, kvmarm

On Thu, Mar 30, 2017 at 9:47 AM, Laszlo Ersek <lersek@redhat.com> wrote:
> On 03/29/17 20:56, Christoffer Dall wrote:
>> On Tue, Mar 28, 2017 at 01:24:15PM -0700, Radha Mohan wrote:
>>> On Tue, Mar 28, 2017 at 1:16 PM, Christoffer Dall <cdall@linaro.org> wrote:
>>>> Hi Radha,
>>>>
>>>> On Tue, Mar 28, 2017 at 12:58:24PM -0700, Radha Mohan wrote:
>>>>> Hi,
>>>>> I am seeing an issue with qemu-system-aarch64 when using pflash
>>>>> (booting kernel via UEFI bios).
>>>>>
>>>>> Host kernel: 4.11.0-rc3-next-20170323
>>>>> Qemu version: v2.9.0-rc1
>>>>>
>>>>> Command used:
>>>>> ./aarch64-softmmu/qemu-system-aarch64 -cpu host -enable-kvm -M
>>>>> virt,gic_version=3 -nographic -smp 1 -m 2048 -drive
>>>>> if=none,id=hd0,file=/root/zesty-server-cloudimg-arm64.img,id=0 -device
>>>>> virtio-blk-device,drive=hd0 -pflash /root/flash0.img -pflash
>>>>> /root/flash1.img
>>>>>
>>>>>
>>>>> As soon as the guest kernel boots the host starts to stall and prints
>>>>> the below messages. And the system never recovers. I can neither
>>>>> poweroff the guest nor the host. So I have resort to external power
>>>>> reset of the host.
>>>>>
>>>>> ==================
>>>>> [  116.199077] NMI watchdog: BUG: soft lockup - CPU#25 stuck for 23s!
>>>>> [kworker/25:1:454]
>>>>> [  116.206901] Modules linked in: binfmt_misc nls_iso8859_1 aes_ce_blk
>>>>> shpchp crypto_simd gpio_keys cryptd aes_ce_cipher ghash_ce sha2_ce
>>>>> sha1_ce uio_pdrv_genirq uio autofs4 btrfs raid10 rai
>>>>> d456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
>>>>> raid6_pq libcrc32c raid1 raid0 multipath linear ast i2c_algo_bit ttm
>>>>> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_s
>>>>> ys_fops drm nicvf ahci nicpf libahci thunder_bgx thunder_xcv
>>>>> mdio_thunder mdio_cavium
>>>>>
>>>>> [  116.206995] CPU: 25 PID: 454 Comm: kworker/25:1 Not tainted
>>>>> 4.11.0-rc3-next-20170323 #1
>>>>> [  116.206997] Hardware name: www.cavium.com crb-1s/crb-1s, BIOS 0.3 Feb 23 2017
>>>>> [  116.207010] Workqueue: events netstamp_clear
>>>>> [  116.207015] task: ffff801f906b5400 task.stack: ffff801f901a4000
>>>>> [  116.207020] PC is at smp_call_function_many+0x284/0x2e8
>>>>> [  116.207023] LR is at smp_call_function_many+0x244/0x2e8
>>>>> [  116.207026] pc : [<ffff000008156ecc>] lr : [<ffff000008156e8c>]
>>>>> pstate: 80000145
>>>>> [  116.207028] sp : ffff801f901a7be0
>>>>> [  116.207030] x29: ffff801f901a7be0 x28: ffff000009139000
>>>>> [  116.207036] x27: ffff000009139434 x26: 0000000000000080
>>>>> [  116.207041] x25: 0000000000000000 x24: ffff0000081565d0
>>>>> [  116.207047] x23: 0000000000000001 x22: ffff000008e11e00
>>>>> [  116.207052] x21: ffff801f6d5cff00 x20: ffff801f6d5cff08
>>>>> [  116.207057] x19: ffff000009138e38 x18: 0000000000000a03
>>>>> [  116.207063] x17: 0000ffffb77c9028 x16: ffff0000082e81d8
>>>>> [  116.207068] x15: 00003d0d6dd44d08 x14: 0036312196549b4a
>>>>> [  116.207073] x13: 0000000058dabe4c x12: 0000000000000018
>>>>> [  116.207079] x11: 00000000366e2f04 x10: 00000000000009f0
>>>>> [  116.207084] x9 : ffff801f901a7d30 x8 : 0000000000000002
>>>>> [  116.207089] x7 : 0000000000000000 x6 : 0000000000000000
>>>>> [  116.207095] x5 : ffffffff00000000 x4 : 0000000000000020
>>>>> [  116.207100] x3 : 0000000000000020 x2 : 0000000000000000
>>>>> [  116.207105] x1 : ffff801f6d682578 x0 : 0000000000000003
>>>>>
>>>>> [  150.443116] INFO: rcu_sched self-detected stall on CPU
>>>>> [  150.448261]  25-...: (14997 ticks this GP)
>>>>> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
>>>>> [  150.451115] INFO: rcu_sched detected stalls on CPUs/tasks:
>>>>> [  150.451123]  25-...: (14997 ticks this GP)
>>>>> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
>>>>> [  150.451124]  (detected by 13, t=15002 jiffies, g=805, c=804, q=8384)
>>>>> [  150.451136] Task dump for CPU 25:
>>>>> [  150.451138] kworker/25:1    R  running task        0   454      2 0x00000002
>>>>> [  150.451155] Workqueue: events netstamp_clear
>>>>> [  150.451158] Call trace:
>>>>> [  150.451164] [<ffff000008086188>] __switch_to+0x90/0xa8
>>>>> [  150.451172] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
>>>>> [  150.451175] [<ffff0000081f6284>] static_key_enable+0x34/0x60
>>>>> [  150.451178] [<ffff000008843268>] netstamp_clear+0x68/0x80
>>>>> [  150.451181] [<ffff0000080e49a0>] process_one_work+0x158/0x478
>>>>> [  150.451183] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
>>>>> [  150.451187] [<ffff0000080ebd78>] kthread+0x108/0x138
>>>>> [  150.451190] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
>>>>> [  150.477451]   (t=15008 jiffies g=805 c=804 q=8384)
>>>>> [  150.482242] Task dump for CPU 25:
>>>>> [  150.482245] kworker/25:1    R  running task        0   454      2 0x00000002
>>>>> [  150.482259] Workqueue: events netstamp_clear
>>>>> [  150.482264] Call trace:
>>>>> [  150.482271] [<ffff00000808a530>] dump_backtrace+0x0/0x2b0
>>>>> [  150.482277] [<ffff00000808a804>] show_stack+0x24/0x30
>>>>> [  150.482281] [<ffff0000080fb750>] sched_show_task+0x128/0x178
>>>>> [  150.482285] [<ffff0000080fd298>] dump_cpu_task+0x48/0x58
>>>>> [  150.482288] [<ffff0000081f81e4>] rcu_dump_cpu_stacks+0xa0/0xe8
>>>>> [  150.482297] [<ffff00000813983c>] rcu_check_callbacks+0x774/0x938
>>>>> [  150.482305] [<ffff00000813fcb4>] update_process_times+0x34/0x60
>>>>> [  150.482314] [<ffff000008151b80>] tick_sched_handle.isra.7+0x38/0x70
>>>>> [  150.482319] [<ffff000008151c04>] tick_sched_timer+0x4c/0x98
>>>>> [  150.482324] [<ffff000008140510>] __hrtimer_run_queues+0xd8/0x2b8
>>>>> [  150.482328] [<ffff000008141180>] hrtimer_interrupt+0xa8/0x228
>>>>> [  150.482334] [<ffff0000087f2a2c>] arch_timer_handler_phys+0x3c/0x50
>>>>> [  150.482341] [<ffff00000812c194>] handle_percpu_devid_irq+0x8c/0x230
>>>>> [  150.482344] [<ffff000008126174>] generic_handle_irq+0x34/0x50
>>>>> [  150.482347] [<ffff000008126898>] __handle_domain_irq+0x68/0xc0
>>>>> [  150.482351] [<ffff0000080817e4>] gic_handle_irq+0xc4/0x170
>>>>> [  150.482356] Exception stack(0xffff801f901a7ab0 to 0xffff801f901a7be0)
>>>>> [  150.482360] 7aa0:
>>>>> 0000000000000003 ffff801f6d682578
>>>>> [  150.482364] 7ac0: 0000000000000000 0000000000000020
>>>>> 0000000000000020 ffffffff00000000
>>>>> [  150.482367] 7ae0: 0000000000000000 0000000000000000
>>>>> 0000000000000002 ffff801f901a7d30
>>>>> [  150.482371] 7b00: 00000000000009f0 00000000366e2f04
>>>>> 0000000000000018 0000000058dabe4c
>>>>> [  150.482375] 7b20: 0036312196549b4a 00003d0d6dd44d08
>>>>> ffff0000082e81d8 0000ffffb77c9028
>>>>> [  150.482378] 7b40: 0000000000000a03 ffff000009138e38
>>>>> ffff801f6d5cff08 ffff801f6d5cff00
>>>>> [  150.482382] 7b60: ffff000008e11e00 0000000000000001
>>>>> ffff0000081565d0 0000000000000000
>>>>> [  150.482386] 7b80: 0000000000000080 ffff000009139434
>>>>> ffff000009139000 ffff801f901a7be0
>>>>> [  150.482390] 7ba0: ffff000008156e8c ffff801f901a7be0
>>>>> ffff000008156ecc 0000000080000145
>>>>> [  150.482394] 7bc0: ffff801f901a7be0 ffff000008156e68
>>>>> ffffffffffffffff ffff000008156e8c
>>>>> [  150.482397] [<ffff000008082ff4>] el1_irq+0xb4/0x140
>>>>> [  150.482401] [<ffff000008156ecc>] smp_call_function_many+0x284/0x2e8
>>>>> [  150.482405] [<ffff000008157020>] kick_all_cpus_sync+0x30/0x38
>>>>> [  150.482409] [<ffff00000897c6cc>] aarch64_insn_patch_text+0xec/0xf8
>>>>> [  150.482415] [<ffff000008095978>] arch_jump_label_transform+0x60/0x98
>>>>> [  150.482420] [<ffff0000081f593c>] __jump_label_update+0x8c/0xa8
>>>>> [  150.482423] [<ffff0000081f6088>] jump_label_update+0x58/0xe8
>>>>> [  150.482429] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
>>>>> [  150.482434] [<ffff0000081f6284>] static_key_enable+0x34/0x60
>>>>> [  150.482438] [<ffff000008843268>] netstamp_clear+0x68/0x80
>>>>> [  150.482441] [<ffff0000080e49a0>] process_one_work+0x158/0x478
>>>>> [  150.482444] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
>>>>> [  150.482448] [<ffff0000080ebd78>] kthread+0x108/0x138
>>>>> [  150.482451] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
>>>>>
>>>>> ====================================
>>>>>
>>>>> I am observing that this usually happens when the guest tries to
>>>>> bringup or use the default virtio-net interface.
>>>>> And I am unable to reproduce this when directly booting the guest
>>>>> kernel without UEFI BIOS.
>>>>> So anyone observed similar issue ?
>>>>>
>>>>
>>>> I haven't seen this myself.
>>>>
>>>> Have you tried with other versions of QEMU and the host kernel by any
>>>> chance using the same setup?
>>>
>>> The earlier qemu version (atleast 2.5 which is default on ubuntu
>>> 16.04) have different issue. The same thing happens when I use
>>> vhost=on but without vhost and with virtio-net didn't see the issue.
>>> The Armband folks have also reported the vhost issue here
>>> https://bugs.launchpad.net/ubuntu/+source/edk2/+bug/1673564
>>
>> So I just ran your setup using the flash emulation on the ThunderX in
>> packet.net and my host is happily continuing to run hackbench.
>>
>>>
>>> So I am beginning to think the issue could be related to the flash
>>> emulation but wanted to check on ml if anyone else is seeing it. But
>>> if it is flash emulation how can it hose the host like this ?
>>
>> Why do you think it's related to the flash emulation?
>>
>> From what you're telling me it sounds much more like a vhost regression.
>>
>> Can you try with v4.10 of the kernel with vhost=on ?
>
> Does this QEMU pull req help by any chance?

I will try all the suggestions above and get back. It would take few
days as I have few priority things to clear off my plate.

>
>   [Qemu-devel] [PULL 0/2] vhost, pc: fixes
>   http://lists.nongnu.org/archive/html/qemu-devel/2017-03/msg06103.html
>
> It says "Region caching is still causing issues around reset". (Region
> caching was a recent, very intrusive, virtio change in QEMU that sort of
> blew up everything and developers have been dealing with the fallout
> ever since. <-- my personal account anyway)
>
> In another (separate, downstream) discussion, I explained that the
> VirtioNetDxe driver in OVMF and ArmVirtQemu does a virtio reset on the
> virtio network device in an ExitBootServices() callback; that is, when
> the kernel's UEFI stub takes control of the system. Resetting the virtio
> device is supposed to cause the host to de-configure the device and to
> forget all about it. If that doesn't occur correctly, the guest kernel
> can definitely end up in a crappy state (the host might still consider
> memory areas belong to the virtio device that were originally allocated
> by the firmware, but are now overwritten by the guest kernel).
>
> In addition, I found some relevant-looking, recent vhost commits in the
> kernel:
>
> 275bf960ac69 vhost: better detection of available buffers
> 0ed005ce02fa vhost_net: tx batching
> 635b8c8ecdd2 tap: Renaming tap related APIs, data structures, macros
> 9a393b5d5988 tap: tap as an independent module
> e3b56cdd4351 vhost: try avoiding avail index access when getting
>              descriptor
> f88949138058 vhost: introduce O(1) vq metadata cache
>
> I think the symptoms could be a strange interplay between recent QEMU
> and kernel changes that are tickled by the firmware's virtio-net driver
> the wrong way.
>
> Thanks
> Laszlo
>
>>
>> Thanks,
>> -Christoffer
>> _______________________________________________
>> kvmarm mailing list
>> kvmarm@lists.cs.columbia.edu
>> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
>>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] host stalls when qemu-system-aarch64 with kvm and pflash
  2017-03-31 23:16           ` Radha Mohan
@ 2017-04-05 19:12             ` Radha Mohan
  -1 siblings, 0 replies; 28+ messages in thread
From: Radha Mohan @ 2017-04-05 19:12 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: Christoffer Dall, qemu-devel, kvmarm, Wei Huang, Drew Jones,
	Michael S. Tsirkin, Jason Wang

On Fri, Mar 31, 2017 at 4:16 PM, Radha Mohan <mohun106@gmail.com> wrote:
> On Thu, Mar 30, 2017 at 9:47 AM, Laszlo Ersek <lersek@redhat.com> wrote:
>> On 03/29/17 20:56, Christoffer Dall wrote:
>>> On Tue, Mar 28, 2017 at 01:24:15PM -0700, Radha Mohan wrote:
>>>> On Tue, Mar 28, 2017 at 1:16 PM, Christoffer Dall <cdall@linaro.org> wrote:
>>>>> Hi Radha,
>>>>>
>>>>> On Tue, Mar 28, 2017 at 12:58:24PM -0700, Radha Mohan wrote:
>>>>>> Hi,
>>>>>> I am seeing an issue with qemu-system-aarch64 when using pflash
>>>>>> (booting kernel via UEFI bios).
>>>>>>
>>>>>> Host kernel: 4.11.0-rc3-next-20170323
>>>>>> Qemu version: v2.9.0-rc1
>>>>>>
>>>>>> Command used:
>>>>>> ./aarch64-softmmu/qemu-system-aarch64 -cpu host -enable-kvm -M
>>>>>> virt,gic_version=3 -nographic -smp 1 -m 2048 -drive
>>>>>> if=none,id=hd0,file=/root/zesty-server-cloudimg-arm64.img,id=0 -device
>>>>>> virtio-blk-device,drive=hd0 -pflash /root/flash0.img -pflash
>>>>>> /root/flash1.img
>>>>>>
>>>>>>
>>>>>> As soon as the guest kernel boots the host starts to stall and prints
>>>>>> the below messages. And the system never recovers. I can neither
>>>>>> poweroff the guest nor the host. So I have resort to external power
>>>>>> reset of the host.
>>>>>>
>>>>>> ==================
>>>>>> [  116.199077] NMI watchdog: BUG: soft lockup - CPU#25 stuck for 23s!
>>>>>> [kworker/25:1:454]
>>>>>> [  116.206901] Modules linked in: binfmt_misc nls_iso8859_1 aes_ce_blk
>>>>>> shpchp crypto_simd gpio_keys cryptd aes_ce_cipher ghash_ce sha2_ce
>>>>>> sha1_ce uio_pdrv_genirq uio autofs4 btrfs raid10 rai
>>>>>> d456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
>>>>>> raid6_pq libcrc32c raid1 raid0 multipath linear ast i2c_algo_bit ttm
>>>>>> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_s
>>>>>> ys_fops drm nicvf ahci nicpf libahci thunder_bgx thunder_xcv
>>>>>> mdio_thunder mdio_cavium
>>>>>>
>>>>>> [  116.206995] CPU: 25 PID: 454 Comm: kworker/25:1 Not tainted
>>>>>> 4.11.0-rc3-next-20170323 #1
>>>>>> [  116.206997] Hardware name: www.cavium.com crb-1s/crb-1s, BIOS 0.3 Feb 23 2017
>>>>>> [  116.207010] Workqueue: events netstamp_clear
>>>>>> [  116.207015] task: ffff801f906b5400 task.stack: ffff801f901a4000
>>>>>> [  116.207020] PC is at smp_call_function_many+0x284/0x2e8
>>>>>> [  116.207023] LR is at smp_call_function_many+0x244/0x2e8
>>>>>> [  116.207026] pc : [<ffff000008156ecc>] lr : [<ffff000008156e8c>]
>>>>>> pstate: 80000145
>>>>>> [  116.207028] sp : ffff801f901a7be0
>>>>>> [  116.207030] x29: ffff801f901a7be0 x28: ffff000009139000
>>>>>> [  116.207036] x27: ffff000009139434 x26: 0000000000000080
>>>>>> [  116.207041] x25: 0000000000000000 x24: ffff0000081565d0
>>>>>> [  116.207047] x23: 0000000000000001 x22: ffff000008e11e00
>>>>>> [  116.207052] x21: ffff801f6d5cff00 x20: ffff801f6d5cff08
>>>>>> [  116.207057] x19: ffff000009138e38 x18: 0000000000000a03
>>>>>> [  116.207063] x17: 0000ffffb77c9028 x16: ffff0000082e81d8
>>>>>> [  116.207068] x15: 00003d0d6dd44d08 x14: 0036312196549b4a
>>>>>> [  116.207073] x13: 0000000058dabe4c x12: 0000000000000018
>>>>>> [  116.207079] x11: 00000000366e2f04 x10: 00000000000009f0
>>>>>> [  116.207084] x9 : ffff801f901a7d30 x8 : 0000000000000002
>>>>>> [  116.207089] x7 : 0000000000000000 x6 : 0000000000000000
>>>>>> [  116.207095] x5 : ffffffff00000000 x4 : 0000000000000020
>>>>>> [  116.207100] x3 : 0000000000000020 x2 : 0000000000000000
>>>>>> [  116.207105] x1 : ffff801f6d682578 x0 : 0000000000000003
>>>>>>
>>>>>> [  150.443116] INFO: rcu_sched self-detected stall on CPU
>>>>>> [  150.448261]  25-...: (14997 ticks this GP)
>>>>>> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
>>>>>> [  150.451115] INFO: rcu_sched detected stalls on CPUs/tasks:
>>>>>> [  150.451123]  25-...: (14997 ticks this GP)
>>>>>> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
>>>>>> [  150.451124]  (detected by 13, t=15002 jiffies, g=805, c=804, q=8384)
>>>>>> [  150.451136] Task dump for CPU 25:
>>>>>> [  150.451138] kworker/25:1    R  running task        0   454      2 0x00000002
>>>>>> [  150.451155] Workqueue: events netstamp_clear
>>>>>> [  150.451158] Call trace:
>>>>>> [  150.451164] [<ffff000008086188>] __switch_to+0x90/0xa8
>>>>>> [  150.451172] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
>>>>>> [  150.451175] [<ffff0000081f6284>] static_key_enable+0x34/0x60
>>>>>> [  150.451178] [<ffff000008843268>] netstamp_clear+0x68/0x80
>>>>>> [  150.451181] [<ffff0000080e49a0>] process_one_work+0x158/0x478
>>>>>> [  150.451183] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
>>>>>> [  150.451187] [<ffff0000080ebd78>] kthread+0x108/0x138
>>>>>> [  150.451190] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
>>>>>> [  150.477451]   (t=15008 jiffies g=805 c=804 q=8384)
>>>>>> [  150.482242] Task dump for CPU 25:
>>>>>> [  150.482245] kworker/25:1    R  running task        0   454      2 0x00000002
>>>>>> [  150.482259] Workqueue: events netstamp_clear
>>>>>> [  150.482264] Call trace:
>>>>>> [  150.482271] [<ffff00000808a530>] dump_backtrace+0x0/0x2b0
>>>>>> [  150.482277] [<ffff00000808a804>] show_stack+0x24/0x30
>>>>>> [  150.482281] [<ffff0000080fb750>] sched_show_task+0x128/0x178
>>>>>> [  150.482285] [<ffff0000080fd298>] dump_cpu_task+0x48/0x58
>>>>>> [  150.482288] [<ffff0000081f81e4>] rcu_dump_cpu_stacks+0xa0/0xe8
>>>>>> [  150.482297] [<ffff00000813983c>] rcu_check_callbacks+0x774/0x938
>>>>>> [  150.482305] [<ffff00000813fcb4>] update_process_times+0x34/0x60
>>>>>> [  150.482314] [<ffff000008151b80>] tick_sched_handle.isra.7+0x38/0x70
>>>>>> [  150.482319] [<ffff000008151c04>] tick_sched_timer+0x4c/0x98
>>>>>> [  150.482324] [<ffff000008140510>] __hrtimer_run_queues+0xd8/0x2b8
>>>>>> [  150.482328] [<ffff000008141180>] hrtimer_interrupt+0xa8/0x228
>>>>>> [  150.482334] [<ffff0000087f2a2c>] arch_timer_handler_phys+0x3c/0x50
>>>>>> [  150.482341] [<ffff00000812c194>] handle_percpu_devid_irq+0x8c/0x230
>>>>>> [  150.482344] [<ffff000008126174>] generic_handle_irq+0x34/0x50
>>>>>> [  150.482347] [<ffff000008126898>] __handle_domain_irq+0x68/0xc0
>>>>>> [  150.482351] [<ffff0000080817e4>] gic_handle_irq+0xc4/0x170
>>>>>> [  150.482356] Exception stack(0xffff801f901a7ab0 to 0xffff801f901a7be0)
>>>>>> [  150.482360] 7aa0:
>>>>>> 0000000000000003 ffff801f6d682578
>>>>>> [  150.482364] 7ac0: 0000000000000000 0000000000000020
>>>>>> 0000000000000020 ffffffff00000000
>>>>>> [  150.482367] 7ae0: 0000000000000000 0000000000000000
>>>>>> 0000000000000002 ffff801f901a7d30
>>>>>> [  150.482371] 7b00: 00000000000009f0 00000000366e2f04
>>>>>> 0000000000000018 0000000058dabe4c
>>>>>> [  150.482375] 7b20: 0036312196549b4a 00003d0d6dd44d08
>>>>>> ffff0000082e81d8 0000ffffb77c9028
>>>>>> [  150.482378] 7b40: 0000000000000a03 ffff000009138e38
>>>>>> ffff801f6d5cff08 ffff801f6d5cff00
>>>>>> [  150.482382] 7b60: ffff000008e11e00 0000000000000001
>>>>>> ffff0000081565d0 0000000000000000
>>>>>> [  150.482386] 7b80: 0000000000000080 ffff000009139434
>>>>>> ffff000009139000 ffff801f901a7be0
>>>>>> [  150.482390] 7ba0: ffff000008156e8c ffff801f901a7be0
>>>>>> ffff000008156ecc 0000000080000145
>>>>>> [  150.482394] 7bc0: ffff801f901a7be0 ffff000008156e68
>>>>>> ffffffffffffffff ffff000008156e8c
>>>>>> [  150.482397] [<ffff000008082ff4>] el1_irq+0xb4/0x140
>>>>>> [  150.482401] [<ffff000008156ecc>] smp_call_function_many+0x284/0x2e8
>>>>>> [  150.482405] [<ffff000008157020>] kick_all_cpus_sync+0x30/0x38
>>>>>> [  150.482409] [<ffff00000897c6cc>] aarch64_insn_patch_text+0xec/0xf8
>>>>>> [  150.482415] [<ffff000008095978>] arch_jump_label_transform+0x60/0x98
>>>>>> [  150.482420] [<ffff0000081f593c>] __jump_label_update+0x8c/0xa8
>>>>>> [  150.482423] [<ffff0000081f6088>] jump_label_update+0x58/0xe8
>>>>>> [  150.482429] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
>>>>>> [  150.482434] [<ffff0000081f6284>] static_key_enable+0x34/0x60
>>>>>> [  150.482438] [<ffff000008843268>] netstamp_clear+0x68/0x80
>>>>>> [  150.482441] [<ffff0000080e49a0>] process_one_work+0x158/0x478
>>>>>> [  150.482444] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
>>>>>> [  150.482448] [<ffff0000080ebd78>] kthread+0x108/0x138
>>>>>> [  150.482451] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
>>>>>>
>>>>>> ====================================
>>>>>>
>>>>>> I am observing that this usually happens when the guest tries to
>>>>>> bringup or use the default virtio-net interface.
>>>>>> And I am unable to reproduce this when directly booting the guest
>>>>>> kernel without UEFI BIOS.
>>>>>> So anyone observed similar issue ?
>>>>>>
>>>>>
>>>>> I haven't seen this myself.
>>>>>
>>>>> Have you tried with other versions of QEMU and the host kernel by any
>>>>> chance using the same setup?
>>>>
>>>> The earlier qemu version (atleast 2.5 which is default on ubuntu
>>>> 16.04) have different issue. The same thing happens when I use
>>>> vhost=on but without vhost and with virtio-net didn't see the issue.
>>>> The Armband folks have also reported the vhost issue here
>>>> https://bugs.launchpad.net/ubuntu/+source/edk2/+bug/1673564
>>>
>>> So I just ran your setup using the flash emulation on the ThunderX in
>>> packet.net and my host is happily continuing to run hackbench.
>>>
>>>>
>>>> So I am beginning to think the issue could be related to the flash
>>>> emulation but wanted to check on ml if anyone else is seeing it. But
>>>> if it is flash emulation how can it hose the host like this ?
>>>
>>> Why do you think it's related to the flash emulation?
>>>
>>> From what you're telling me it sounds much more like a vhost regression.
>>>
>>> Can you try with v4.10 of the kernel with vhost=on ?
>>
>> Does this QEMU pull req help by any chance?
>
> I will try all the suggestions above and get back. It would take few
> days as I have few priority things to clear off my plate.
>

I tried the suggestions but nothing seem to work. We are debugging
this at a chip errata angle as well as we found few anomalies.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: host stalls when qemu-system-aarch64 with kvm and pflash
@ 2017-04-05 19:12             ` Radha Mohan
  0 siblings, 0 replies; 28+ messages in thread
From: Radha Mohan @ 2017-04-05 19:12 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: Christoffer Dall, Michael S. Tsirkin, Jason Wang, qemu-devel, kvmarm

On Fri, Mar 31, 2017 at 4:16 PM, Radha Mohan <mohun106@gmail.com> wrote:
> On Thu, Mar 30, 2017 at 9:47 AM, Laszlo Ersek <lersek@redhat.com> wrote:
>> On 03/29/17 20:56, Christoffer Dall wrote:
>>> On Tue, Mar 28, 2017 at 01:24:15PM -0700, Radha Mohan wrote:
>>>> On Tue, Mar 28, 2017 at 1:16 PM, Christoffer Dall <cdall@linaro.org> wrote:
>>>>> Hi Radha,
>>>>>
>>>>> On Tue, Mar 28, 2017 at 12:58:24PM -0700, Radha Mohan wrote:
>>>>>> Hi,
>>>>>> I am seeing an issue with qemu-system-aarch64 when using pflash
>>>>>> (booting kernel via UEFI bios).
>>>>>>
>>>>>> Host kernel: 4.11.0-rc3-next-20170323
>>>>>> Qemu version: v2.9.0-rc1
>>>>>>
>>>>>> Command used:
>>>>>> ./aarch64-softmmu/qemu-system-aarch64 -cpu host -enable-kvm -M
>>>>>> virt,gic_version=3 -nographic -smp 1 -m 2048 -drive
>>>>>> if=none,id=hd0,file=/root/zesty-server-cloudimg-arm64.img,id=0 -device
>>>>>> virtio-blk-device,drive=hd0 -pflash /root/flash0.img -pflash
>>>>>> /root/flash1.img
>>>>>>
>>>>>>
>>>>>> As soon as the guest kernel boots the host starts to stall and prints
>>>>>> the below messages. And the system never recovers. I can neither
>>>>>> poweroff the guest nor the host. So I have resort to external power
>>>>>> reset of the host.
>>>>>>
>>>>>> ==================
>>>>>> [  116.199077] NMI watchdog: BUG: soft lockup - CPU#25 stuck for 23s!
>>>>>> [kworker/25:1:454]
>>>>>> [  116.206901] Modules linked in: binfmt_misc nls_iso8859_1 aes_ce_blk
>>>>>> shpchp crypto_simd gpio_keys cryptd aes_ce_cipher ghash_ce sha2_ce
>>>>>> sha1_ce uio_pdrv_genirq uio autofs4 btrfs raid10 rai
>>>>>> d456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
>>>>>> raid6_pq libcrc32c raid1 raid0 multipath linear ast i2c_algo_bit ttm
>>>>>> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_s
>>>>>> ys_fops drm nicvf ahci nicpf libahci thunder_bgx thunder_xcv
>>>>>> mdio_thunder mdio_cavium
>>>>>>
>>>>>> [  116.206995] CPU: 25 PID: 454 Comm: kworker/25:1 Not tainted
>>>>>> 4.11.0-rc3-next-20170323 #1
>>>>>> [  116.206997] Hardware name: www.cavium.com crb-1s/crb-1s, BIOS 0.3 Feb 23 2017
>>>>>> [  116.207010] Workqueue: events netstamp_clear
>>>>>> [  116.207015] task: ffff801f906b5400 task.stack: ffff801f901a4000
>>>>>> [  116.207020] PC is at smp_call_function_many+0x284/0x2e8
>>>>>> [  116.207023] LR is at smp_call_function_many+0x244/0x2e8
>>>>>> [  116.207026] pc : [<ffff000008156ecc>] lr : [<ffff000008156e8c>]
>>>>>> pstate: 80000145
>>>>>> [  116.207028] sp : ffff801f901a7be0
>>>>>> [  116.207030] x29: ffff801f901a7be0 x28: ffff000009139000
>>>>>> [  116.207036] x27: ffff000009139434 x26: 0000000000000080
>>>>>> [  116.207041] x25: 0000000000000000 x24: ffff0000081565d0
>>>>>> [  116.207047] x23: 0000000000000001 x22: ffff000008e11e00
>>>>>> [  116.207052] x21: ffff801f6d5cff00 x20: ffff801f6d5cff08
>>>>>> [  116.207057] x19: ffff000009138e38 x18: 0000000000000a03
>>>>>> [  116.207063] x17: 0000ffffb77c9028 x16: ffff0000082e81d8
>>>>>> [  116.207068] x15: 00003d0d6dd44d08 x14: 0036312196549b4a
>>>>>> [  116.207073] x13: 0000000058dabe4c x12: 0000000000000018
>>>>>> [  116.207079] x11: 00000000366e2f04 x10: 00000000000009f0
>>>>>> [  116.207084] x9 : ffff801f901a7d30 x8 : 0000000000000002
>>>>>> [  116.207089] x7 : 0000000000000000 x6 : 0000000000000000
>>>>>> [  116.207095] x5 : ffffffff00000000 x4 : 0000000000000020
>>>>>> [  116.207100] x3 : 0000000000000020 x2 : 0000000000000000
>>>>>> [  116.207105] x1 : ffff801f6d682578 x0 : 0000000000000003
>>>>>>
>>>>>> [  150.443116] INFO: rcu_sched self-detected stall on CPU
>>>>>> [  150.448261]  25-...: (14997 ticks this GP)
>>>>>> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
>>>>>> [  150.451115] INFO: rcu_sched detected stalls on CPUs/tasks:
>>>>>> [  150.451123]  25-...: (14997 ticks this GP)
>>>>>> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
>>>>>> [  150.451124]  (detected by 13, t=15002 jiffies, g=805, c=804, q=8384)
>>>>>> [  150.451136] Task dump for CPU 25:
>>>>>> [  150.451138] kworker/25:1    R  running task        0   454      2 0x00000002
>>>>>> [  150.451155] Workqueue: events netstamp_clear
>>>>>> [  150.451158] Call trace:
>>>>>> [  150.451164] [<ffff000008086188>] __switch_to+0x90/0xa8
>>>>>> [  150.451172] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
>>>>>> [  150.451175] [<ffff0000081f6284>] static_key_enable+0x34/0x60
>>>>>> [  150.451178] [<ffff000008843268>] netstamp_clear+0x68/0x80
>>>>>> [  150.451181] [<ffff0000080e49a0>] process_one_work+0x158/0x478
>>>>>> [  150.451183] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
>>>>>> [  150.451187] [<ffff0000080ebd78>] kthread+0x108/0x138
>>>>>> [  150.451190] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
>>>>>> [  150.477451]   (t=15008 jiffies g=805 c=804 q=8384)
>>>>>> [  150.482242] Task dump for CPU 25:
>>>>>> [  150.482245] kworker/25:1    R  running task        0   454      2 0x00000002
>>>>>> [  150.482259] Workqueue: events netstamp_clear
>>>>>> [  150.482264] Call trace:
>>>>>> [  150.482271] [<ffff00000808a530>] dump_backtrace+0x0/0x2b0
>>>>>> [  150.482277] [<ffff00000808a804>] show_stack+0x24/0x30
>>>>>> [  150.482281] [<ffff0000080fb750>] sched_show_task+0x128/0x178
>>>>>> [  150.482285] [<ffff0000080fd298>] dump_cpu_task+0x48/0x58
>>>>>> [  150.482288] [<ffff0000081f81e4>] rcu_dump_cpu_stacks+0xa0/0xe8
>>>>>> [  150.482297] [<ffff00000813983c>] rcu_check_callbacks+0x774/0x938
>>>>>> [  150.482305] [<ffff00000813fcb4>] update_process_times+0x34/0x60
>>>>>> [  150.482314] [<ffff000008151b80>] tick_sched_handle.isra.7+0x38/0x70
>>>>>> [  150.482319] [<ffff000008151c04>] tick_sched_timer+0x4c/0x98
>>>>>> [  150.482324] [<ffff000008140510>] __hrtimer_run_queues+0xd8/0x2b8
>>>>>> [  150.482328] [<ffff000008141180>] hrtimer_interrupt+0xa8/0x228
>>>>>> [  150.482334] [<ffff0000087f2a2c>] arch_timer_handler_phys+0x3c/0x50
>>>>>> [  150.482341] [<ffff00000812c194>] handle_percpu_devid_irq+0x8c/0x230
>>>>>> [  150.482344] [<ffff000008126174>] generic_handle_irq+0x34/0x50
>>>>>> [  150.482347] [<ffff000008126898>] __handle_domain_irq+0x68/0xc0
>>>>>> [  150.482351] [<ffff0000080817e4>] gic_handle_irq+0xc4/0x170
>>>>>> [  150.482356] Exception stack(0xffff801f901a7ab0 to 0xffff801f901a7be0)
>>>>>> [  150.482360] 7aa0:
>>>>>> 0000000000000003 ffff801f6d682578
>>>>>> [  150.482364] 7ac0: 0000000000000000 0000000000000020
>>>>>> 0000000000000020 ffffffff00000000
>>>>>> [  150.482367] 7ae0: 0000000000000000 0000000000000000
>>>>>> 0000000000000002 ffff801f901a7d30
>>>>>> [  150.482371] 7b00: 00000000000009f0 00000000366e2f04
>>>>>> 0000000000000018 0000000058dabe4c
>>>>>> [  150.482375] 7b20: 0036312196549b4a 00003d0d6dd44d08
>>>>>> ffff0000082e81d8 0000ffffb77c9028
>>>>>> [  150.482378] 7b40: 0000000000000a03 ffff000009138e38
>>>>>> ffff801f6d5cff08 ffff801f6d5cff00
>>>>>> [  150.482382] 7b60: ffff000008e11e00 0000000000000001
>>>>>> ffff0000081565d0 0000000000000000
>>>>>> [  150.482386] 7b80: 0000000000000080 ffff000009139434
>>>>>> ffff000009139000 ffff801f901a7be0
>>>>>> [  150.482390] 7ba0: ffff000008156e8c ffff801f901a7be0
>>>>>> ffff000008156ecc 0000000080000145
>>>>>> [  150.482394] 7bc0: ffff801f901a7be0 ffff000008156e68
>>>>>> ffffffffffffffff ffff000008156e8c
>>>>>> [  150.482397] [<ffff000008082ff4>] el1_irq+0xb4/0x140
>>>>>> [  150.482401] [<ffff000008156ecc>] smp_call_function_many+0x284/0x2e8
>>>>>> [  150.482405] [<ffff000008157020>] kick_all_cpus_sync+0x30/0x38
>>>>>> [  150.482409] [<ffff00000897c6cc>] aarch64_insn_patch_text+0xec/0xf8
>>>>>> [  150.482415] [<ffff000008095978>] arch_jump_label_transform+0x60/0x98
>>>>>> [  150.482420] [<ffff0000081f593c>] __jump_label_update+0x8c/0xa8
>>>>>> [  150.482423] [<ffff0000081f6088>] jump_label_update+0x58/0xe8
>>>>>> [  150.482429] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
>>>>>> [  150.482434] [<ffff0000081f6284>] static_key_enable+0x34/0x60
>>>>>> [  150.482438] [<ffff000008843268>] netstamp_clear+0x68/0x80
>>>>>> [  150.482441] [<ffff0000080e49a0>] process_one_work+0x158/0x478
>>>>>> [  150.482444] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
>>>>>> [  150.482448] [<ffff0000080ebd78>] kthread+0x108/0x138
>>>>>> [  150.482451] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
>>>>>>
>>>>>> ====================================
>>>>>>
>>>>>> I am observing that this usually happens when the guest tries to
>>>>>> bringup or use the default virtio-net interface.
>>>>>> And I am unable to reproduce this when directly booting the guest
>>>>>> kernel without UEFI BIOS.
>>>>>> So anyone observed similar issue ?
>>>>>>
>>>>>
>>>>> I haven't seen this myself.
>>>>>
>>>>> Have you tried with other versions of QEMU and the host kernel by any
>>>>> chance using the same setup?
>>>>
>>>> The earlier qemu version (atleast 2.5 which is default on ubuntu
>>>> 16.04) have different issue. The same thing happens when I use
>>>> vhost=on but without vhost and with virtio-net didn't see the issue.
>>>> The Armband folks have also reported the vhost issue here
>>>> https://bugs.launchpad.net/ubuntu/+source/edk2/+bug/1673564
>>>
>>> So I just ran your setup using the flash emulation on the ThunderX in
>>> packet.net and my host is happily continuing to run hackbench.
>>>
>>>>
>>>> So I am beginning to think the issue could be related to the flash
>>>> emulation but wanted to check on ml if anyone else is seeing it. But
>>>> if it is flash emulation how can it hose the host like this ?
>>>
>>> Why do you think it's related to the flash emulation?
>>>
>>> From what you're telling me it sounds much more like a vhost regression.
>>>
>>> Can you try with v4.10 of the kernel with vhost=on ?
>>
>> Does this QEMU pull req help by any chance?
>
> I will try all the suggestions above and get back. It would take few
> days as I have few priority things to clear off my plate.
>

I tried the suggestions but nothing seem to work. We are debugging
this at a chip errata angle as well as we found few anomalies.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] host stalls when qemu-system-aarch64 with kvm and pflash
  2017-03-30 10:51         ` Marc Zyngier
@ 2017-04-07 20:05           ` Wei Huang
  -1 siblings, 0 replies; 28+ messages in thread
From: Wei Huang @ 2017-04-07 20:05 UTC (permalink / raw)
  To: Marc Zyngier, Christoffer Dall, Radha Mohan
  Cc: qemu-devel, kvmarm, Laszlo Ersek, Jason Wang



On 03/30/2017 05:51 AM, Marc Zyngier wrote:
> On 29/03/17 19:56, Christoffer Dall wrote:
>> On Tue, Mar 28, 2017 at 01:24:15PM -0700, Radha Mohan wrote:
>>> On Tue, Mar 28, 2017 at 1:16 PM, Christoffer Dall <cdall@linaro.org> wrote:
>>>> Hi Radha,
>>>>
>>>> On Tue, Mar 28, 2017 at 12:58:24PM -0700, Radha Mohan wrote:
>>>>> Hi,
>>>>> I am seeing an issue with qemu-system-aarch64 when using pflash
>>>>> (booting kernel via UEFI bios).
>>>>>
>>>>> Host kernel: 4.11.0-rc3-next-20170323
>>>>> Qemu version: v2.9.0-rc1
>>>>>
>>>>> Command used:
>>>>> ./aarch64-softmmu/qemu-system-aarch64 -cpu host -enable-kvm -M
>>>>> virt,gic_version=3 -nographic -smp 1 -m 2048 -drive
>>>>> if=none,id=hd0,file=/root/zesty-server-cloudimg-arm64.img,id=0 -device
>>>>> virtio-blk-device,drive=hd0 -pflash /root/flash0.img -pflash
>>>>> /root/flash1.img
>>>>>
>>>>>
>>>>> As soon as the guest kernel boots the host starts to stall and prints
>>>>> the below messages. And the system never recovers. I can neither
>>>>> poweroff the guest nor the host. So I have resort to external power
>>>>> reset of the host.
>>>>>
>>>>> ==================
>>>>> [  116.199077] NMI watchdog: BUG: soft lockup - CPU#25 stuck for 23s!
>>>>> [kworker/25:1:454]
>>>>> [  116.206901] Modules linked in: binfmt_misc nls_iso8859_1 aes_ce_blk
>>>>> shpchp crypto_simd gpio_keys cryptd aes_ce_cipher ghash_ce sha2_ce
>>>>> sha1_ce uio_pdrv_genirq uio autofs4 btrfs raid10 rai
>>>>> d456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
>>>>> raid6_pq libcrc32c raid1 raid0 multipath linear ast i2c_algo_bit ttm
>>>>> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_s
>>>>> ys_fops drm nicvf ahci nicpf libahci thunder_bgx thunder_xcv
>>>>> mdio_thunder mdio_cavium
>>>>>
>>>>> [  116.206995] CPU: 25 PID: 454 Comm: kworker/25:1 Not tainted
>>>>> 4.11.0-rc3-next-20170323 #1
>>>>> [  116.206997] Hardware name: www.cavium.com crb-1s/crb-1s, BIOS 0.3 Feb 23 2017
>>>>> [  116.207010] Workqueue: events netstamp_clear
>>>>> [  116.207015] task: ffff801f906b5400 task.stack: ffff801f901a4000
>>>>> [  116.207020] PC is at smp_call_function_many+0x284/0x2e8
>>>>> [  116.207023] LR is at smp_call_function_many+0x244/0x2e8
>>>>> [  116.207026] pc : [<ffff000008156ecc>] lr : [<ffff000008156e8c>]
>>>>> pstate: 80000145
>>>>> [  116.207028] sp : ffff801f901a7be0
>>>>> [  116.207030] x29: ffff801f901a7be0 x28: ffff000009139000
>>>>> [  116.207036] x27: ffff000009139434 x26: 0000000000000080
>>>>> [  116.207041] x25: 0000000000000000 x24: ffff0000081565d0
>>>>> [  116.207047] x23: 0000000000000001 x22: ffff000008e11e00
>>>>> [  116.207052] x21: ffff801f6d5cff00 x20: ffff801f6d5cff08
>>>>> [  116.207057] x19: ffff000009138e38 x18: 0000000000000a03
>>>>> [  116.207063] x17: 0000ffffb77c9028 x16: ffff0000082e81d8
>>>>> [  116.207068] x15: 00003d0d6dd44d08 x14: 0036312196549b4a
>>>>> [  116.207073] x13: 0000000058dabe4c x12: 0000000000000018
>>>>> [  116.207079] x11: 00000000366e2f04 x10: 00000000000009f0
>>>>> [  116.207084] x9 : ffff801f901a7d30 x8 : 0000000000000002
>>>>> [  116.207089] x7 : 0000000000000000 x6 : 0000000000000000
>>>>> [  116.207095] x5 : ffffffff00000000 x4 : 0000000000000020
>>>>> [  116.207100] x3 : 0000000000000020 x2 : 0000000000000000
>>>>> [  116.207105] x1 : ffff801f6d682578 x0 : 0000000000000003
>>>>>
>>>>> [  150.443116] INFO: rcu_sched self-detected stall on CPU
>>>>> [  150.448261]  25-...: (14997 ticks this GP)
>>>>> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
>>>>> [  150.451115] INFO: rcu_sched detected stalls on CPUs/tasks:
>>>>> [  150.451123]  25-...: (14997 ticks this GP)
>>>>> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
>>>>> [  150.451124]  (detected by 13, t=15002 jiffies, g=805, c=804, q=8384)
>>>>> [  150.451136] Task dump for CPU 25:
>>>>> [  150.451138] kworker/25:1    R  running task        0   454      2 0x00000002
>>>>> [  150.451155] Workqueue: events netstamp_clear
>>>>> [  150.451158] Call trace:
>>>>> [  150.451164] [<ffff000008086188>] __switch_to+0x90/0xa8
>>>>> [  150.451172] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
>>>>> [  150.451175] [<ffff0000081f6284>] static_key_enable+0x34/0x60
>>>>> [  150.451178] [<ffff000008843268>] netstamp_clear+0x68/0x80
>>>>> [  150.451181] [<ffff0000080e49a0>] process_one_work+0x158/0x478
>>>>> [  150.451183] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
>>>>> [  150.451187] [<ffff0000080ebd78>] kthread+0x108/0x138
>>>>> [  150.451190] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
>>>>> [  150.477451]   (t=15008 jiffies g=805 c=804 q=8384)
>>>>> [  150.482242] Task dump for CPU 25:
>>>>> [  150.482245] kworker/25:1    R  running task        0   454      2 0x00000002
>>>>> [  150.482259] Workqueue: events netstamp_clear
>>>>> [  150.482264] Call trace:
>>>>> [  150.482271] [<ffff00000808a530>] dump_backtrace+0x0/0x2b0
>>>>> [  150.482277] [<ffff00000808a804>] show_stack+0x24/0x30
>>>>> [  150.482281] [<ffff0000080fb750>] sched_show_task+0x128/0x178
>>>>> [  150.482285] [<ffff0000080fd298>] dump_cpu_task+0x48/0x58
>>>>> [  150.482288] [<ffff0000081f81e4>] rcu_dump_cpu_stacks+0xa0/0xe8
>>>>> [  150.482297] [<ffff00000813983c>] rcu_check_callbacks+0x774/0x938
>>>>> [  150.482305] [<ffff00000813fcb4>] update_process_times+0x34/0x60
>>>>> [  150.482314] [<ffff000008151b80>] tick_sched_handle.isra.7+0x38/0x70
>>>>> [  150.482319] [<ffff000008151c04>] tick_sched_timer+0x4c/0x98
>>>>> [  150.482324] [<ffff000008140510>] __hrtimer_run_queues+0xd8/0x2b8
>>>>> [  150.482328] [<ffff000008141180>] hrtimer_interrupt+0xa8/0x228
>>>>> [  150.482334] [<ffff0000087f2a2c>] arch_timer_handler_phys+0x3c/0x50
>>>>> [  150.482341] [<ffff00000812c194>] handle_percpu_devid_irq+0x8c/0x230
>>>>> [  150.482344] [<ffff000008126174>] generic_handle_irq+0x34/0x50
>>>>> [  150.482347] [<ffff000008126898>] __handle_domain_irq+0x68/0xc0
>>>>> [  150.482351] [<ffff0000080817e4>] gic_handle_irq+0xc4/0x170
>>>>> [  150.482356] Exception stack(0xffff801f901a7ab0 to 0xffff801f901a7be0)
>>>>> [  150.482360] 7aa0:
>>>>> 0000000000000003 ffff801f6d682578
>>>>> [  150.482364] 7ac0: 0000000000000000 0000000000000020
>>>>> 0000000000000020 ffffffff00000000
>>>>> [  150.482367] 7ae0: 0000000000000000 0000000000000000
>>>>> 0000000000000002 ffff801f901a7d30
>>>>> [  150.482371] 7b00: 00000000000009f0 00000000366e2f04
>>>>> 0000000000000018 0000000058dabe4c
>>>>> [  150.482375] 7b20: 0036312196549b4a 00003d0d6dd44d08
>>>>> ffff0000082e81d8 0000ffffb77c9028
>>>>> [  150.482378] 7b40: 0000000000000a03 ffff000009138e38
>>>>> ffff801f6d5cff08 ffff801f6d5cff00
>>>>> [  150.482382] 7b60: ffff000008e11e00 0000000000000001
>>>>> ffff0000081565d0 0000000000000000
>>>>> [  150.482386] 7b80: 0000000000000080 ffff000009139434
>>>>> ffff000009139000 ffff801f901a7be0
>>>>> [  150.482390] 7ba0: ffff000008156e8c ffff801f901a7be0
>>>>> ffff000008156ecc 0000000080000145
>>>>> [  150.482394] 7bc0: ffff801f901a7be0 ffff000008156e68
>>>>> ffffffffffffffff ffff000008156e8c
>>>>> [  150.482397] [<ffff000008082ff4>] el1_irq+0xb4/0x140
>>>>> [  150.482401] [<ffff000008156ecc>] smp_call_function_many+0x284/0x2e8
>>>>> [  150.482405] [<ffff000008157020>] kick_all_cpus_sync+0x30/0x38
>>>>> [  150.482409] [<ffff00000897c6cc>] aarch64_insn_patch_text+0xec/0xf8
>>>>> [  150.482415] [<ffff000008095978>] arch_jump_label_transform+0x60/0x98
>>>>> [  150.482420] [<ffff0000081f593c>] __jump_label_update+0x8c/0xa8
>>>>> [  150.482423] [<ffff0000081f6088>] jump_label_update+0x58/0xe8
>>>>> [  150.482429] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
>>>>> [  150.482434] [<ffff0000081f6284>] static_key_enable+0x34/0x60
>>>>> [  150.482438] [<ffff000008843268>] netstamp_clear+0x68/0x80
>>>>> [  150.482441] [<ffff0000080e49a0>] process_one_work+0x158/0x478
>>>>> [  150.482444] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
>>>>> [  150.482448] [<ffff0000080ebd78>] kthread+0x108/0x138
>>>>> [  150.482451] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
>>>>>
>>>>> ====================================
>>>>>
>>>>> I am observing that this usually happens when the guest tries to
>>>>> bringup or use the default virtio-net interface.
>>>>> And I am unable to reproduce this when directly booting the guest
>>>>> kernel without UEFI BIOS.
>>>>> So anyone observed similar issue ?
>>>>>
>>>>
>>>> I haven't seen this myself.
>>>>
>>>> Have you tried with other versions of QEMU and the host kernel by any
>>>> chance using the same setup?
>>>
>>> The earlier qemu version (atleast 2.5 which is default on ubuntu
>>> 16.04) have different issue. The same thing happens when I use
>>> vhost=on but without vhost and with virtio-net didn't see the issue.
>>> The Armband folks have also reported the vhost issue here
>>> https://bugs.launchpad.net/ubuntu/+source/edk2/+bug/1673564
>>
>> So I just ran your setup using the flash emulation on the ThunderX in
>> packet.net and my host is happily continuing to run hackbench.
>>
>>>
>>> So I am beginning to think the issue could be related to the flash
>>> emulation but wanted to check on ml if anyone else is seeing it. But
>>> if it is flash emulation how can it hose the host like this ?
>>
>> Why do you think it's related to the flash emulation?
>>
>> From what you're telling me it sounds much more like a vhost regression.
>>
>> Can you try with v4.10 of the kernel with vhost=on ?
> 
> I'm unsure about this being directly related to vhost. vhost seems to 
> be the trigger, but the issue seems that after a smp_call_function_many()
> call, some of the target CPUs have either stopped responding, or missed
> the IPI. The IPI initiator keeps on waiting forever.
> 
> I'm not seeing any of this on the HW I have access to, but this could 
> be symptomatic of a SGI delivery problem.
> 
> Any chance you could try the following patchlet on the host kernel?

Marc,

I saw this problem on my ThunderX system and had been debugging it for a
while. I did try your patch and it didn't work (host system can't boot
correctly).

Beyond that, here are the findings I have so far:

1) This issue was triggered when virtio-net is present in AAVMF. If I
removed virtio-net from AAVMF, everything will work normally. To me it
is an issue related to virtio-net/vhost-net re-entrant. Another approach
I used to verify this theory was to boot guest VM without AAVMF. That
worked out fine as well.

2) I couldn't reproduce this problem on other platforms, such as APM
Mustang. So far this problem was only seen on ThuderX platform. In my
opinion the main HW difference is GICv3 and SMMU.

3) I was able to trace down the following two commits. In fact I am
almost 100% sure it is caused 7235acd.
    04b96e5 vhost: lockless enqueuing
    7235acd vhost: simplify work flushing

4) This issue seemed to be present in kernel all the way back to 4.5. I
applied 7235acd on a working 4.5 kernel and tested it on ThunderX. The
test failed.

5) A quick (non-real) fix is to add a printk() before
wait_for_completion(&flush.wait_event) in vhost.c file. This printk()
can solve the problem consistently. This observation made me believe
that there is synchronization issue triggered by 7235acd (in some corner
cases).

6) wait_for_completion() is used to wait for vhost worker thread to call
complete(). My debugging showed the complete() was indeed called by the
worker thread. However wait_for_completion() never received this signal.
So it got stuck as both sides are waiting for each other.

In summary this problem is possibly caused by missing interrupt, leading
to sort of deadlock. If you have other patches, I can help test them.

Thanks,
-Wei

> 
> diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
> index c132f29322cc..727c7dc7592a 100644
> --- a/drivers/irqchip/irq-gic-v3.c
> +++ b/drivers/irqchip/irq-gic-v3.c
> @@ -617,6 +617,15 @@ static void gic_raise_softirq(const struct cpumask *mask, unsigned int irq)
>  	 */
>  	smp_wmb();
>  
> +	/* Is it a broadcast? */
> +	if (cpumask_weight(mask) == (num_online_cpus() - 1)) {
> +		gic_write_sgi1r(1UL << ICC_SGI1R_IRQ_ROUTING_MODE_BIT |
> +				irq << ICC_SGI1R_SGI_ID_SHIFT);
> +		isb();
> +		return;
> +	}
> +
> +
>  	for_each_cpu(cpu, mask) {
>  		unsigned long cluster_id = cpu_logical_map(cpu) & ~0xffUL;
>  		u16 tlist;
> 
> This shouldn't make any difference (other than being slightly faster in the
> broadcast case, and slower in every other), but I'm prepared to try anything...
> 
> Thanks,
> 
> 	M.
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: host stalls when qemu-system-aarch64 with kvm and pflash
@ 2017-04-07 20:05           ` Wei Huang
  0 siblings, 0 replies; 28+ messages in thread
From: Wei Huang @ 2017-04-07 20:05 UTC (permalink / raw)
  To: Marc Zyngier, Christoffer Dall, Radha Mohan
  Cc: Jason Wang, Laszlo Ersek, qemu-devel, kvmarm



On 03/30/2017 05:51 AM, Marc Zyngier wrote:
> On 29/03/17 19:56, Christoffer Dall wrote:
>> On Tue, Mar 28, 2017 at 01:24:15PM -0700, Radha Mohan wrote:
>>> On Tue, Mar 28, 2017 at 1:16 PM, Christoffer Dall <cdall@linaro.org> wrote:
>>>> Hi Radha,
>>>>
>>>> On Tue, Mar 28, 2017 at 12:58:24PM -0700, Radha Mohan wrote:
>>>>> Hi,
>>>>> I am seeing an issue with qemu-system-aarch64 when using pflash
>>>>> (booting kernel via UEFI bios).
>>>>>
>>>>> Host kernel: 4.11.0-rc3-next-20170323
>>>>> Qemu version: v2.9.0-rc1
>>>>>
>>>>> Command used:
>>>>> ./aarch64-softmmu/qemu-system-aarch64 -cpu host -enable-kvm -M
>>>>> virt,gic_version=3 -nographic -smp 1 -m 2048 -drive
>>>>> if=none,id=hd0,file=/root/zesty-server-cloudimg-arm64.img,id=0 -device
>>>>> virtio-blk-device,drive=hd0 -pflash /root/flash0.img -pflash
>>>>> /root/flash1.img
>>>>>
>>>>>
>>>>> As soon as the guest kernel boots the host starts to stall and prints
>>>>> the below messages. And the system never recovers. I can neither
>>>>> poweroff the guest nor the host. So I have resort to external power
>>>>> reset of the host.
>>>>>
>>>>> ==================
>>>>> [  116.199077] NMI watchdog: BUG: soft lockup - CPU#25 stuck for 23s!
>>>>> [kworker/25:1:454]
>>>>> [  116.206901] Modules linked in: binfmt_misc nls_iso8859_1 aes_ce_blk
>>>>> shpchp crypto_simd gpio_keys cryptd aes_ce_cipher ghash_ce sha2_ce
>>>>> sha1_ce uio_pdrv_genirq uio autofs4 btrfs raid10 rai
>>>>> d456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
>>>>> raid6_pq libcrc32c raid1 raid0 multipath linear ast i2c_algo_bit ttm
>>>>> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_s
>>>>> ys_fops drm nicvf ahci nicpf libahci thunder_bgx thunder_xcv
>>>>> mdio_thunder mdio_cavium
>>>>>
>>>>> [  116.206995] CPU: 25 PID: 454 Comm: kworker/25:1 Not tainted
>>>>> 4.11.0-rc3-next-20170323 #1
>>>>> [  116.206997] Hardware name: www.cavium.com crb-1s/crb-1s, BIOS 0.3 Feb 23 2017
>>>>> [  116.207010] Workqueue: events netstamp_clear
>>>>> [  116.207015] task: ffff801f906b5400 task.stack: ffff801f901a4000
>>>>> [  116.207020] PC is at smp_call_function_many+0x284/0x2e8
>>>>> [  116.207023] LR is at smp_call_function_many+0x244/0x2e8
>>>>> [  116.207026] pc : [<ffff000008156ecc>] lr : [<ffff000008156e8c>]
>>>>> pstate: 80000145
>>>>> [  116.207028] sp : ffff801f901a7be0
>>>>> [  116.207030] x29: ffff801f901a7be0 x28: ffff000009139000
>>>>> [  116.207036] x27: ffff000009139434 x26: 0000000000000080
>>>>> [  116.207041] x25: 0000000000000000 x24: ffff0000081565d0
>>>>> [  116.207047] x23: 0000000000000001 x22: ffff000008e11e00
>>>>> [  116.207052] x21: ffff801f6d5cff00 x20: ffff801f6d5cff08
>>>>> [  116.207057] x19: ffff000009138e38 x18: 0000000000000a03
>>>>> [  116.207063] x17: 0000ffffb77c9028 x16: ffff0000082e81d8
>>>>> [  116.207068] x15: 00003d0d6dd44d08 x14: 0036312196549b4a
>>>>> [  116.207073] x13: 0000000058dabe4c x12: 0000000000000018
>>>>> [  116.207079] x11: 00000000366e2f04 x10: 00000000000009f0
>>>>> [  116.207084] x9 : ffff801f901a7d30 x8 : 0000000000000002
>>>>> [  116.207089] x7 : 0000000000000000 x6 : 0000000000000000
>>>>> [  116.207095] x5 : ffffffff00000000 x4 : 0000000000000020
>>>>> [  116.207100] x3 : 0000000000000020 x2 : 0000000000000000
>>>>> [  116.207105] x1 : ffff801f6d682578 x0 : 0000000000000003
>>>>>
>>>>> [  150.443116] INFO: rcu_sched self-detected stall on CPU
>>>>> [  150.448261]  25-...: (14997 ticks this GP)
>>>>> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
>>>>> [  150.451115] INFO: rcu_sched detected stalls on CPUs/tasks:
>>>>> [  150.451123]  25-...: (14997 ticks this GP)
>>>>> idle=47a/140000000000001/0 softirq=349/349 fqs=7495
>>>>> [  150.451124]  (detected by 13, t=15002 jiffies, g=805, c=804, q=8384)
>>>>> [  150.451136] Task dump for CPU 25:
>>>>> [  150.451138] kworker/25:1    R  running task        0   454      2 0x00000002
>>>>> [  150.451155] Workqueue: events netstamp_clear
>>>>> [  150.451158] Call trace:
>>>>> [  150.451164] [<ffff000008086188>] __switch_to+0x90/0xa8
>>>>> [  150.451172] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
>>>>> [  150.451175] [<ffff0000081f6284>] static_key_enable+0x34/0x60
>>>>> [  150.451178] [<ffff000008843268>] netstamp_clear+0x68/0x80
>>>>> [  150.451181] [<ffff0000080e49a0>] process_one_work+0x158/0x478
>>>>> [  150.451183] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
>>>>> [  150.451187] [<ffff0000080ebd78>] kthread+0x108/0x138
>>>>> [  150.451190] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
>>>>> [  150.477451]   (t=15008 jiffies g=805 c=804 q=8384)
>>>>> [  150.482242] Task dump for CPU 25:
>>>>> [  150.482245] kworker/25:1    R  running task        0   454      2 0x00000002
>>>>> [  150.482259] Workqueue: events netstamp_clear
>>>>> [  150.482264] Call trace:
>>>>> [  150.482271] [<ffff00000808a530>] dump_backtrace+0x0/0x2b0
>>>>> [  150.482277] [<ffff00000808a804>] show_stack+0x24/0x30
>>>>> [  150.482281] [<ffff0000080fb750>] sched_show_task+0x128/0x178
>>>>> [  150.482285] [<ffff0000080fd298>] dump_cpu_task+0x48/0x58
>>>>> [  150.482288] [<ffff0000081f81e4>] rcu_dump_cpu_stacks+0xa0/0xe8
>>>>> [  150.482297] [<ffff00000813983c>] rcu_check_callbacks+0x774/0x938
>>>>> [  150.482305] [<ffff00000813fcb4>] update_process_times+0x34/0x60
>>>>> [  150.482314] [<ffff000008151b80>] tick_sched_handle.isra.7+0x38/0x70
>>>>> [  150.482319] [<ffff000008151c04>] tick_sched_timer+0x4c/0x98
>>>>> [  150.482324] [<ffff000008140510>] __hrtimer_run_queues+0xd8/0x2b8
>>>>> [  150.482328] [<ffff000008141180>] hrtimer_interrupt+0xa8/0x228
>>>>> [  150.482334] [<ffff0000087f2a2c>] arch_timer_handler_phys+0x3c/0x50
>>>>> [  150.482341] [<ffff00000812c194>] handle_percpu_devid_irq+0x8c/0x230
>>>>> [  150.482344] [<ffff000008126174>] generic_handle_irq+0x34/0x50
>>>>> [  150.482347] [<ffff000008126898>] __handle_domain_irq+0x68/0xc0
>>>>> [  150.482351] [<ffff0000080817e4>] gic_handle_irq+0xc4/0x170
>>>>> [  150.482356] Exception stack(0xffff801f901a7ab0 to 0xffff801f901a7be0)
>>>>> [  150.482360] 7aa0:
>>>>> 0000000000000003 ffff801f6d682578
>>>>> [  150.482364] 7ac0: 0000000000000000 0000000000000020
>>>>> 0000000000000020 ffffffff00000000
>>>>> [  150.482367] 7ae0: 0000000000000000 0000000000000000
>>>>> 0000000000000002 ffff801f901a7d30
>>>>> [  150.482371] 7b00: 00000000000009f0 00000000366e2f04
>>>>> 0000000000000018 0000000058dabe4c
>>>>> [  150.482375] 7b20: 0036312196549b4a 00003d0d6dd44d08
>>>>> ffff0000082e81d8 0000ffffb77c9028
>>>>> [  150.482378] 7b40: 0000000000000a03 ffff000009138e38
>>>>> ffff801f6d5cff08 ffff801f6d5cff00
>>>>> [  150.482382] 7b60: ffff000008e11e00 0000000000000001
>>>>> ffff0000081565d0 0000000000000000
>>>>> [  150.482386] 7b80: 0000000000000080 ffff000009139434
>>>>> ffff000009139000 ffff801f901a7be0
>>>>> [  150.482390] 7ba0: ffff000008156e8c ffff801f901a7be0
>>>>> ffff000008156ecc 0000000080000145
>>>>> [  150.482394] 7bc0: ffff801f901a7be0 ffff000008156e68
>>>>> ffffffffffffffff ffff000008156e8c
>>>>> [  150.482397] [<ffff000008082ff4>] el1_irq+0xb4/0x140
>>>>> [  150.482401] [<ffff000008156ecc>] smp_call_function_many+0x284/0x2e8
>>>>> [  150.482405] [<ffff000008157020>] kick_all_cpus_sync+0x30/0x38
>>>>> [  150.482409] [<ffff00000897c6cc>] aarch64_insn_patch_text+0xec/0xf8
>>>>> [  150.482415] [<ffff000008095978>] arch_jump_label_transform+0x60/0x98
>>>>> [  150.482420] [<ffff0000081f593c>] __jump_label_update+0x8c/0xa8
>>>>> [  150.482423] [<ffff0000081f6088>] jump_label_update+0x58/0xe8
>>>>> [  150.482429] [<ffff0000081f6240>] static_key_slow_inc+0x128/0x138
>>>>> [  150.482434] [<ffff0000081f6284>] static_key_enable+0x34/0x60
>>>>> [  150.482438] [<ffff000008843268>] netstamp_clear+0x68/0x80
>>>>> [  150.482441] [<ffff0000080e49a0>] process_one_work+0x158/0x478
>>>>> [  150.482444] [<ffff0000080e4d10>] worker_thread+0x50/0x4a8
>>>>> [  150.482448] [<ffff0000080ebd78>] kthread+0x108/0x138
>>>>> [  150.482451] [<ffff0000080836c0>] ret_from_fork+0x10/0x50
>>>>>
>>>>> ====================================
>>>>>
>>>>> I am observing that this usually happens when the guest tries to
>>>>> bringup or use the default virtio-net interface.
>>>>> And I am unable to reproduce this when directly booting the guest
>>>>> kernel without UEFI BIOS.
>>>>> So anyone observed similar issue ?
>>>>>
>>>>
>>>> I haven't seen this myself.
>>>>
>>>> Have you tried with other versions of QEMU and the host kernel by any
>>>> chance using the same setup?
>>>
>>> The earlier qemu version (atleast 2.5 which is default on ubuntu
>>> 16.04) have different issue. The same thing happens when I use
>>> vhost=on but without vhost and with virtio-net didn't see the issue.
>>> The Armband folks have also reported the vhost issue here
>>> https://bugs.launchpad.net/ubuntu/+source/edk2/+bug/1673564
>>
>> So I just ran your setup using the flash emulation on the ThunderX in
>> packet.net and my host is happily continuing to run hackbench.
>>
>>>
>>> So I am beginning to think the issue could be related to the flash
>>> emulation but wanted to check on ml if anyone else is seeing it. But
>>> if it is flash emulation how can it hose the host like this ?
>>
>> Why do you think it's related to the flash emulation?
>>
>> From what you're telling me it sounds much more like a vhost regression.
>>
>> Can you try with v4.10 of the kernel with vhost=on ?
> 
> I'm unsure about this being directly related to vhost. vhost seems to 
> be the trigger, but the issue seems that after a smp_call_function_many()
> call, some of the target CPUs have either stopped responding, or missed
> the IPI. The IPI initiator keeps on waiting forever.
> 
> I'm not seeing any of this on the HW I have access to, but this could 
> be symptomatic of a SGI delivery problem.
> 
> Any chance you could try the following patchlet on the host kernel?

Marc,

I saw this problem on my ThunderX system and had been debugging it for a
while. I did try your patch and it didn't work (host system can't boot
correctly).

Beyond that, here are the findings I have so far:

1) This issue was triggered when virtio-net is present in AAVMF. If I
removed virtio-net from AAVMF, everything will work normally. To me it
is an issue related to virtio-net/vhost-net re-entrant. Another approach
I used to verify this theory was to boot guest VM without AAVMF. That
worked out fine as well.

2) I couldn't reproduce this problem on other platforms, such as APM
Mustang. So far this problem was only seen on ThuderX platform. In my
opinion the main HW difference is GICv3 and SMMU.

3) I was able to trace down the following two commits. In fact I am
almost 100% sure it is caused 7235acd.
    04b96e5 vhost: lockless enqueuing
    7235acd vhost: simplify work flushing

4) This issue seemed to be present in kernel all the way back to 4.5. I
applied 7235acd on a working 4.5 kernel and tested it on ThunderX. The
test failed.

5) A quick (non-real) fix is to add a printk() before
wait_for_completion(&flush.wait_event) in vhost.c file. This printk()
can solve the problem consistently. This observation made me believe
that there is synchronization issue triggered by 7235acd (in some corner
cases).

6) wait_for_completion() is used to wait for vhost worker thread to call
complete(). My debugging showed the complete() was indeed called by the
worker thread. However wait_for_completion() never received this signal.
So it got stuck as both sides are waiting for each other.

In summary this problem is possibly caused by missing interrupt, leading
to sort of deadlock. If you have other patches, I can help test them.

Thanks,
-Wei

> 
> diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
> index c132f29322cc..727c7dc7592a 100644
> --- a/drivers/irqchip/irq-gic-v3.c
> +++ b/drivers/irqchip/irq-gic-v3.c
> @@ -617,6 +617,15 @@ static void gic_raise_softirq(const struct cpumask *mask, unsigned int irq)
>  	 */
>  	smp_wmb();
>  
> +	/* Is it a broadcast? */
> +	if (cpumask_weight(mask) == (num_online_cpus() - 1)) {
> +		gic_write_sgi1r(1UL << ICC_SGI1R_IRQ_ROUTING_MODE_BIT |
> +				irq << ICC_SGI1R_SGI_ID_SHIFT);
> +		isb();
> +		return;
> +	}
> +
> +
>  	for_each_cpu(cpu, mask) {
>  		unsigned long cluster_id = cpu_logical_map(cpu) & ~0xffUL;
>  		u16 tlist;
> 
> This shouldn't make any difference (other than being slightly faster in the
> broadcast case, and slower in every other), but I'm prepared to try anything...
> 
> Thanks,
> 
> 	M.
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2017-04-07 20:05 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-28 19:58 [Qemu-devel] host stalls when qemu-system-aarch64 with kvm and pflash Radha Mohan
2017-03-28 19:58 ` Radha Mohan
2017-03-28 20:16 ` [Qemu-devel] " Christoffer Dall
2017-03-28 20:16   ` Christoffer Dall
2017-03-28 20:24   ` [Qemu-devel] " Radha Mohan
2017-03-28 20:24     ` Radha Mohan
2017-03-29 18:17     ` [Qemu-devel] " Radha Mohan
2017-03-29 18:17       ` Radha Mohan
2017-03-29 18:34       ` [Qemu-devel] " Peter Maydell
2017-03-29 18:34         ` Peter Maydell
2017-03-29 18:56     ` [Qemu-devel] " Christoffer Dall
2017-03-29 18:56       ` Christoffer Dall
2017-03-29 20:51       ` [Qemu-devel] " Radha Mohan
2017-03-29 20:51         ` Radha Mohan
2017-03-29 21:06         ` [Qemu-devel] " Christoffer Dall
2017-03-29 21:06           ` Christoffer Dall
2017-03-29 21:36           ` [Qemu-devel] " Radha Mohan
2017-03-29 21:36             ` Radha Mohan
2017-03-30 10:51       ` [Qemu-devel] " Marc Zyngier
2017-03-30 10:51         ` Marc Zyngier
2017-04-07 20:05         ` [Qemu-devel] " Wei Huang
2017-04-07 20:05           ` Wei Huang
2017-03-30 16:47       ` [Qemu-devel] " Laszlo Ersek
2017-03-30 16:47         ` Laszlo Ersek
2017-03-31 23:16         ` [Qemu-devel] " Radha Mohan
2017-03-31 23:16           ` Radha Mohan
2017-04-05 19:12           ` [Qemu-devel] " Radha Mohan
2017-04-05 19:12             ` Radha Mohan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.