From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:50594) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ctLGL-0005hL-W3 for qemu-devel@nongnu.org; Wed, 29 Mar 2017 17:36:19 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ctLGJ-0008P0-Oe for qemu-devel@nongnu.org; Wed, 29 Mar 2017 17:36:18 -0400 Received: from mail-pg0-x241.google.com ([2607:f8b0:400e:c05::241]:34962) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1ctLGJ-0008Os-H5 for qemu-devel@nongnu.org; Wed, 29 Mar 2017 17:36:15 -0400 Received: by mail-pg0-x241.google.com with SMTP id g2so4875179pge.2 for ; Wed, 29 Mar 2017 14:36:15 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20170329210615.GB1020@cbox> References: <20170328201623.GD20211@cbox> <20170329185630.GA4908@cbox> <20170329210615.GB1020@cbox> From: Radha Mohan Date: Wed, 29 Mar 2017 14:36:13 -0700 Message-ID: Content-Type: text/plain; charset=UTF-8 Subject: Re: [Qemu-devel] host stalls when qemu-system-aarch64 with kvm and pflash List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Christoffer Dall Cc: qemu-devel@nongnu.org, "kvmarm@lists.cs.columbia.edu" On Wed, Mar 29, 2017 at 2:06 PM, Christoffer Dall wrote: > On Wed, Mar 29, 2017 at 01:51:19PM -0700, Radha Mohan wrote: >> On Wed, Mar 29, 2017 at 11:56 AM, Christoffer Dall wrote: >> > On Tue, Mar 28, 2017 at 01:24:15PM -0700, Radha Mohan wrote: >> >> On Tue, Mar 28, 2017 at 1:16 PM, Christoffer Dall wrote: >> >> > Hi Radha, >> >> > >> >> > On Tue, Mar 28, 2017 at 12:58:24PM -0700, Radha Mohan wrote: >> >> >> Hi, >> >> >> I am seeing an issue with qemu-system-aarch64 when using pflash >> >> >> (booting kernel via UEFI bios). >> >> >> >> >> >> Host kernel: 4.11.0-rc3-next-20170323 >> >> >> Qemu version: v2.9.0-rc1 >> >> >> >> >> >> Command used: >> >> >> ./aarch64-softmmu/qemu-system-aarch64 -cpu host -enable-kvm -M >> >> >> virt,gic_version=3 -nographic -smp 1 -m 2048 -drive >> >> >> if=none,id=hd0,file=/root/zesty-server-cloudimg-arm64.img,id=0 -device >> >> >> virtio-blk-device,drive=hd0 -pflash /root/flash0.img -pflash >> >> >> /root/flash1.img >> >> >> >> >> >> >> >> >> As soon as the guest kernel boots the host starts to stall and prints >> >> >> the below messages. And the system never recovers. I can neither >> >> >> poweroff the guest nor the host. So I have resort to external power >> >> >> reset of the host. >> >> >> >> >> >> ================== >> >> >> [ 116.199077] NMI watchdog: BUG: soft lockup - CPU#25 stuck for 23s! >> >> >> [kworker/25:1:454] >> >> >> [ 116.206901] Modules linked in: binfmt_misc nls_iso8859_1 aes_ce_blk >> >> >> shpchp crypto_simd gpio_keys cryptd aes_ce_cipher ghash_ce sha2_ce >> >> >> sha1_ce uio_pdrv_genirq uio autofs4 btrfs raid10 rai >> >> >> d456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor >> >> >> raid6_pq libcrc32c raid1 raid0 multipath linear ast i2c_algo_bit ttm >> >> >> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_s >> >> >> ys_fops drm nicvf ahci nicpf libahci thunder_bgx thunder_xcv >> >> >> mdio_thunder mdio_cavium >> >> >> >> >> >> [ 116.206995] CPU: 25 PID: 454 Comm: kworker/25:1 Not tainted >> >> >> 4.11.0-rc3-next-20170323 #1 >> >> >> [ 116.206997] Hardware name: www.cavium.com crb-1s/crb-1s, BIOS 0.3 Feb 23 2017 >> >> >> [ 116.207010] Workqueue: events netstamp_clear >> >> >> [ 116.207015] task: ffff801f906b5400 task.stack: ffff801f901a4000 >> >> >> [ 116.207020] PC is at smp_call_function_many+0x284/0x2e8 >> >> >> [ 116.207023] LR is at smp_call_function_many+0x244/0x2e8 >> >> >> [ 116.207026] pc : [] lr : [] >> >> >> pstate: 80000145 >> >> >> [ 116.207028] sp : ffff801f901a7be0 >> >> >> [ 116.207030] x29: ffff801f901a7be0 x28: ffff000009139000 >> >> >> [ 116.207036] x27: ffff000009139434 x26: 0000000000000080 >> >> >> [ 116.207041] x25: 0000000000000000 x24: ffff0000081565d0 >> >> >> [ 116.207047] x23: 0000000000000001 x22: ffff000008e11e00 >> >> >> [ 116.207052] x21: ffff801f6d5cff00 x20: ffff801f6d5cff08 >> >> >> [ 116.207057] x19: ffff000009138e38 x18: 0000000000000a03 >> >> >> [ 116.207063] x17: 0000ffffb77c9028 x16: ffff0000082e81d8 >> >> >> [ 116.207068] x15: 00003d0d6dd44d08 x14: 0036312196549b4a >> >> >> [ 116.207073] x13: 0000000058dabe4c x12: 0000000000000018 >> >> >> [ 116.207079] x11: 00000000366e2f04 x10: 00000000000009f0 >> >> >> [ 116.207084] x9 : ffff801f901a7d30 x8 : 0000000000000002 >> >> >> [ 116.207089] x7 : 0000000000000000 x6 : 0000000000000000 >> >> >> [ 116.207095] x5 : ffffffff00000000 x4 : 0000000000000020 >> >> >> [ 116.207100] x3 : 0000000000000020 x2 : 0000000000000000 >> >> >> [ 116.207105] x1 : ffff801f6d682578 x0 : 0000000000000003 >> >> >> >> >> >> [ 150.443116] INFO: rcu_sched self-detected stall on CPU >> >> >> [ 150.448261] 25-...: (14997 ticks this GP) >> >> >> idle=47a/140000000000001/0 softirq=349/349 fqs=7495 >> >> >> [ 150.451115] INFO: rcu_sched detected stalls on CPUs/tasks: >> >> >> [ 150.451123] 25-...: (14997 ticks this GP) >> >> >> idle=47a/140000000000001/0 softirq=349/349 fqs=7495 >> >> >> [ 150.451124] (detected by 13, t=15002 jiffies, g=805, c=804, q=8384) >> >> >> [ 150.451136] Task dump for CPU 25: >> >> >> [ 150.451138] kworker/25:1 R running task 0 454 2 0x00000002 >> >> >> [ 150.451155] Workqueue: events netstamp_clear >> >> >> [ 150.451158] Call trace: >> >> >> [ 150.451164] [] __switch_to+0x90/0xa8 >> >> >> [ 150.451172] [] static_key_slow_inc+0x128/0x138 >> >> >> [ 150.451175] [] static_key_enable+0x34/0x60 >> >> >> [ 150.451178] [] netstamp_clear+0x68/0x80 >> >> >> [ 150.451181] [] process_one_work+0x158/0x478 >> >> >> [ 150.451183] [] worker_thread+0x50/0x4a8 >> >> >> [ 150.451187] [] kthread+0x108/0x138 >> >> >> [ 150.451190] [] ret_from_fork+0x10/0x50 >> >> >> [ 150.477451] (t=15008 jiffies g=805 c=804 q=8384) >> >> >> [ 150.482242] Task dump for CPU 25: >> >> >> [ 150.482245] kworker/25:1 R running task 0 454 2 0x00000002 >> >> >> [ 150.482259] Workqueue: events netstamp_clear >> >> >> [ 150.482264] Call trace: >> >> >> [ 150.482271] [] dump_backtrace+0x0/0x2b0 >> >> >> [ 150.482277] [] show_stack+0x24/0x30 >> >> >> [ 150.482281] [] sched_show_task+0x128/0x178 >> >> >> [ 150.482285] [] dump_cpu_task+0x48/0x58 >> >> >> [ 150.482288] [] rcu_dump_cpu_stacks+0xa0/0xe8 >> >> >> [ 150.482297] [] rcu_check_callbacks+0x774/0x938 >> >> >> [ 150.482305] [] update_process_times+0x34/0x60 >> >> >> [ 150.482314] [] tick_sched_handle.isra.7+0x38/0x70 >> >> >> [ 150.482319] [] tick_sched_timer+0x4c/0x98 >> >> >> [ 150.482324] [] __hrtimer_run_queues+0xd8/0x2b8 >> >> >> [ 150.482328] [] hrtimer_interrupt+0xa8/0x228 >> >> >> [ 150.482334] [] arch_timer_handler_phys+0x3c/0x50 >> >> >> [ 150.482341] [] handle_percpu_devid_irq+0x8c/0x230 >> >> >> [ 150.482344] [] generic_handle_irq+0x34/0x50 >> >> >> [ 150.482347] [] __handle_domain_irq+0x68/0xc0 >> >> >> [ 150.482351] [] gic_handle_irq+0xc4/0x170 >> >> >> [ 150.482356] Exception stack(0xffff801f901a7ab0 to 0xffff801f901a7be0) >> >> >> [ 150.482360] 7aa0: >> >> >> 0000000000000003 ffff801f6d682578 >> >> >> [ 150.482364] 7ac0: 0000000000000000 0000000000000020 >> >> >> 0000000000000020 ffffffff00000000 >> >> >> [ 150.482367] 7ae0: 0000000000000000 0000000000000000 >> >> >> 0000000000000002 ffff801f901a7d30 >> >> >> [ 150.482371] 7b00: 00000000000009f0 00000000366e2f04 >> >> >> 0000000000000018 0000000058dabe4c >> >> >> [ 150.482375] 7b20: 0036312196549b4a 00003d0d6dd44d08 >> >> >> ffff0000082e81d8 0000ffffb77c9028 >> >> >> [ 150.482378] 7b40: 0000000000000a03 ffff000009138e38 >> >> >> ffff801f6d5cff08 ffff801f6d5cff00 >> >> >> [ 150.482382] 7b60: ffff000008e11e00 0000000000000001 >> >> >> ffff0000081565d0 0000000000000000 >> >> >> [ 150.482386] 7b80: 0000000000000080 ffff000009139434 >> >> >> ffff000009139000 ffff801f901a7be0 >> >> >> [ 150.482390] 7ba0: ffff000008156e8c ffff801f901a7be0 >> >> >> ffff000008156ecc 0000000080000145 >> >> >> [ 150.482394] 7bc0: ffff801f901a7be0 ffff000008156e68 >> >> >> ffffffffffffffff ffff000008156e8c >> >> >> [ 150.482397] [] el1_irq+0xb4/0x140 >> >> >> [ 150.482401] [] smp_call_function_many+0x284/0x2e8 >> >> >> [ 150.482405] [] kick_all_cpus_sync+0x30/0x38 >> >> >> [ 150.482409] [] aarch64_insn_patch_text+0xec/0xf8 >> >> >> [ 150.482415] [] arch_jump_label_transform+0x60/0x98 >> >> >> [ 150.482420] [] __jump_label_update+0x8c/0xa8 >> >> >> [ 150.482423] [] jump_label_update+0x58/0xe8 >> >> >> [ 150.482429] [] static_key_slow_inc+0x128/0x138 >> >> >> [ 150.482434] [] static_key_enable+0x34/0x60 >> >> >> [ 150.482438] [] netstamp_clear+0x68/0x80 >> >> >> [ 150.482441] [] process_one_work+0x158/0x478 >> >> >> [ 150.482444] [] worker_thread+0x50/0x4a8 >> >> >> [ 150.482448] [] kthread+0x108/0x138 >> >> >> [ 150.482451] [] ret_from_fork+0x10/0x50 >> >> >> >> >> >> ==================================== >> >> >> >> >> >> I am observing that this usually happens when the guest tries to >> >> >> bringup or use the default virtio-net interface. >> >> >> And I am unable to reproduce this when directly booting the guest >> >> >> kernel without UEFI BIOS. >> >> >> So anyone observed similar issue ? >> >> >> >> >> > >> >> > I haven't seen this myself. >> >> > >> >> > Have you tried with other versions of QEMU and the host kernel by any >> >> > chance using the same setup? >> >> >> >> The earlier qemu version (atleast 2.5 which is default on ubuntu >> >> 16.04) have different issue. The same thing happens when I use >> >> vhost=on but without vhost and with virtio-net didn't see the issue. >> >> The Armband folks have also reported the vhost issue here >> >> https://bugs.launchpad.net/ubuntu/+source/edk2/+bug/1673564 >> > >> > So I just ran your setup using the flash emulation on the ThunderX in >> > packet.net and my host is happily continuing to run hackbench. >> >> One more thing I see is the ubuntu cloud images run a cloud-init >> service during boot and seems to do lot of network init, the bug >> appears when doing such things. >> >> > >> >> >> >> So I am beginning to think the issue could be related to the flash >> >> emulation but wanted to check on ml if anyone else is seeing it. But >> >> if it is flash emulation how can it hose the host like this ? >> > >> > Why do you think it's related to the flash emulation? >> >> Because without pflash method and directly launching kernel using >> "-kernel" option of qemu works ok. > > With the same kernel binary which is otherwise used inside the cloud > image when booting with QEMU? In this case the guest kernel is 4.11, the one from cloud img is 4.10. > > Make sure you verify this conclusion by repeating tries tens of times at > least, just to make sure we're not misdiagnosing based on chance. Ok agreed. Initially was assuming something broke top of tree as I moved from stock qemu 2.5 to master but turned out it isn't. I will have to some combinations and get back to you guys. > >> >> >> > >> > From what you're telling me it sounds much more like a vhost regression. >> > >> > Can you try with v4.10 of the kernel with vhost=on ? >> >> Ok will try. > > Cool. > >> I do think the issue could also be host kernel side and >> something is triggering. >> The newer qemu is having a default virtio-net interface. Does this use >> vhost by default ? >> And is there a way not to have a default virtio-net interface. > > I really don't know, I'm not an expert on QEMU or dealing with virtio > vhost config options. > >> >> I am not specifying any vhost in my command. >> My observation with vhost=on was with qemu-2.5 and 4.4 kernel from Ubuntu 16.04. >> >> Another observation I have is, occasionally during the trials the >> guest hangs at this point and host cpu goes into a softlockup after >> sometime. I cannot kill the qemu either through qemu Ctrl-A+C+Q or >> kill command from host. >> >> ------------- cut ----------------- >> EFI stub: Booting Linux Kernel... >> EFI stub: EFI_RNG_PROTOCOL unavailable, no randomness supplied >> EFI stub: Using DTB from configuration table >> EFI stub: Exiting boot services and installing virtual address map... >> [ 0.000000] Booting Linux on physical CPU 0x0 >> ------------- end ------------------ >> >> > > My best bet based on what you're telling me above is that it's related > to memory accesses somehow; with vhost the host kernel will access guest > memory and with pflash emulation I think we're mapping host regions > read-only and executing code in the guest from those regions. Perhaps > we end up corrupting some memory somewhere. > > Is this a different hardware platform from the ThunderX in packet.net ? The chip is same, just different in the onboard ethernet ports and PCIe slots. > > Thanks, > -Christoffer From mboxrd@z Thu Jan 1 00:00:00 1970 From: Radha Mohan Subject: Re: host stalls when qemu-system-aarch64 with kvm and pflash Date: Wed, 29 Mar 2017 14:36:13 -0700 Message-ID: References: <20170328201623.GD20211@cbox> <20170329185630.GA4908@cbox> <20170329210615.GB1020@cbox> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Received: from localhost (localhost [127.0.0.1]) by mm01.cs.columbia.edu (Postfix) with ESMTP id C503240A7B for ; Wed, 29 Mar 2017 17:34:15 -0400 (EDT) Received: from mm01.cs.columbia.edu ([127.0.0.1]) by localhost (mm01.cs.columbia.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id yfbyQkgoe1j8 for ; Wed, 29 Mar 2017 17:34:14 -0400 (EDT) Received: from mail-pg0-f65.google.com (mail-pg0-f65.google.com [74.125.83.65]) by mm01.cs.columbia.edu (Postfix) with ESMTPS id EDB0840284 for ; Wed, 29 Mar 2017 17:34:13 -0400 (EDT) Received: by mail-pg0-f65.google.com with SMTP id 81so4883926pgh.3 for ; Wed, 29 Mar 2017 14:36:15 -0700 (PDT) In-Reply-To: <20170329210615.GB1020@cbox> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: kvmarm-bounces@lists.cs.columbia.edu Sender: kvmarm-bounces@lists.cs.columbia.edu To: Christoffer Dall Cc: qemu-devel@nongnu.org, "kvmarm@lists.cs.columbia.edu" List-Id: kvmarm@lists.cs.columbia.edu On Wed, Mar 29, 2017 at 2:06 PM, Christoffer Dall wrote: > On Wed, Mar 29, 2017 at 01:51:19PM -0700, Radha Mohan wrote: >> On Wed, Mar 29, 2017 at 11:56 AM, Christoffer Dall wrote: >> > On Tue, Mar 28, 2017 at 01:24:15PM -0700, Radha Mohan wrote: >> >> On Tue, Mar 28, 2017 at 1:16 PM, Christoffer Dall wrote: >> >> > Hi Radha, >> >> > >> >> > On Tue, Mar 28, 2017 at 12:58:24PM -0700, Radha Mohan wrote: >> >> >> Hi, >> >> >> I am seeing an issue with qemu-system-aarch64 when using pflash >> >> >> (booting kernel via UEFI bios). >> >> >> >> >> >> Host kernel: 4.11.0-rc3-next-20170323 >> >> >> Qemu version: v2.9.0-rc1 >> >> >> >> >> >> Command used: >> >> >> ./aarch64-softmmu/qemu-system-aarch64 -cpu host -enable-kvm -M >> >> >> virt,gic_version=3 -nographic -smp 1 -m 2048 -drive >> >> >> if=none,id=hd0,file=/root/zesty-server-cloudimg-arm64.img,id=0 -device >> >> >> virtio-blk-device,drive=hd0 -pflash /root/flash0.img -pflash >> >> >> /root/flash1.img >> >> >> >> >> >> >> >> >> As soon as the guest kernel boots the host starts to stall and prints >> >> >> the below messages. And the system never recovers. I can neither >> >> >> poweroff the guest nor the host. So I have resort to external power >> >> >> reset of the host. >> >> >> >> >> >> ================== >> >> >> [ 116.199077] NMI watchdog: BUG: soft lockup - CPU#25 stuck for 23s! >> >> >> [kworker/25:1:454] >> >> >> [ 116.206901] Modules linked in: binfmt_misc nls_iso8859_1 aes_ce_blk >> >> >> shpchp crypto_simd gpio_keys cryptd aes_ce_cipher ghash_ce sha2_ce >> >> >> sha1_ce uio_pdrv_genirq uio autofs4 btrfs raid10 rai >> >> >> d456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor >> >> >> raid6_pq libcrc32c raid1 raid0 multipath linear ast i2c_algo_bit ttm >> >> >> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_s >> >> >> ys_fops drm nicvf ahci nicpf libahci thunder_bgx thunder_xcv >> >> >> mdio_thunder mdio_cavium >> >> >> >> >> >> [ 116.206995] CPU: 25 PID: 454 Comm: kworker/25:1 Not tainted >> >> >> 4.11.0-rc3-next-20170323 #1 >> >> >> [ 116.206997] Hardware name: www.cavium.com crb-1s/crb-1s, BIOS 0.3 Feb 23 2017 >> >> >> [ 116.207010] Workqueue: events netstamp_clear >> >> >> [ 116.207015] task: ffff801f906b5400 task.stack: ffff801f901a4000 >> >> >> [ 116.207020] PC is at smp_call_function_many+0x284/0x2e8 >> >> >> [ 116.207023] LR is at smp_call_function_many+0x244/0x2e8 >> >> >> [ 116.207026] pc : [] lr : [] >> >> >> pstate: 80000145 >> >> >> [ 116.207028] sp : ffff801f901a7be0 >> >> >> [ 116.207030] x29: ffff801f901a7be0 x28: ffff000009139000 >> >> >> [ 116.207036] x27: ffff000009139434 x26: 0000000000000080 >> >> >> [ 116.207041] x25: 0000000000000000 x24: ffff0000081565d0 >> >> >> [ 116.207047] x23: 0000000000000001 x22: ffff000008e11e00 >> >> >> [ 116.207052] x21: ffff801f6d5cff00 x20: ffff801f6d5cff08 >> >> >> [ 116.207057] x19: ffff000009138e38 x18: 0000000000000a03 >> >> >> [ 116.207063] x17: 0000ffffb77c9028 x16: ffff0000082e81d8 >> >> >> [ 116.207068] x15: 00003d0d6dd44d08 x14: 0036312196549b4a >> >> >> [ 116.207073] x13: 0000000058dabe4c x12: 0000000000000018 >> >> >> [ 116.207079] x11: 00000000366e2f04 x10: 00000000000009f0 >> >> >> [ 116.207084] x9 : ffff801f901a7d30 x8 : 0000000000000002 >> >> >> [ 116.207089] x7 : 0000000000000000 x6 : 0000000000000000 >> >> >> [ 116.207095] x5 : ffffffff00000000 x4 : 0000000000000020 >> >> >> [ 116.207100] x3 : 0000000000000020 x2 : 0000000000000000 >> >> >> [ 116.207105] x1 : ffff801f6d682578 x0 : 0000000000000003 >> >> >> >> >> >> [ 150.443116] INFO: rcu_sched self-detected stall on CPU >> >> >> [ 150.448261] 25-...: (14997 ticks this GP) >> >> >> idle=47a/140000000000001/0 softirq=349/349 fqs=7495 >> >> >> [ 150.451115] INFO: rcu_sched detected stalls on CPUs/tasks: >> >> >> [ 150.451123] 25-...: (14997 ticks this GP) >> >> >> idle=47a/140000000000001/0 softirq=349/349 fqs=7495 >> >> >> [ 150.451124] (detected by 13, t=15002 jiffies, g=805, c=804, q=8384) >> >> >> [ 150.451136] Task dump for CPU 25: >> >> >> [ 150.451138] kworker/25:1 R running task 0 454 2 0x00000002 >> >> >> [ 150.451155] Workqueue: events netstamp_clear >> >> >> [ 150.451158] Call trace: >> >> >> [ 150.451164] [] __switch_to+0x90/0xa8 >> >> >> [ 150.451172] [] static_key_slow_inc+0x128/0x138 >> >> >> [ 150.451175] [] static_key_enable+0x34/0x60 >> >> >> [ 150.451178] [] netstamp_clear+0x68/0x80 >> >> >> [ 150.451181] [] process_one_work+0x158/0x478 >> >> >> [ 150.451183] [] worker_thread+0x50/0x4a8 >> >> >> [ 150.451187] [] kthread+0x108/0x138 >> >> >> [ 150.451190] [] ret_from_fork+0x10/0x50 >> >> >> [ 150.477451] (t=15008 jiffies g=805 c=804 q=8384) >> >> >> [ 150.482242] Task dump for CPU 25: >> >> >> [ 150.482245] kworker/25:1 R running task 0 454 2 0x00000002 >> >> >> [ 150.482259] Workqueue: events netstamp_clear >> >> >> [ 150.482264] Call trace: >> >> >> [ 150.482271] [] dump_backtrace+0x0/0x2b0 >> >> >> [ 150.482277] [] show_stack+0x24/0x30 >> >> >> [ 150.482281] [] sched_show_task+0x128/0x178 >> >> >> [ 150.482285] [] dump_cpu_task+0x48/0x58 >> >> >> [ 150.482288] [] rcu_dump_cpu_stacks+0xa0/0xe8 >> >> >> [ 150.482297] [] rcu_check_callbacks+0x774/0x938 >> >> >> [ 150.482305] [] update_process_times+0x34/0x60 >> >> >> [ 150.482314] [] tick_sched_handle.isra.7+0x38/0x70 >> >> >> [ 150.482319] [] tick_sched_timer+0x4c/0x98 >> >> >> [ 150.482324] [] __hrtimer_run_queues+0xd8/0x2b8 >> >> >> [ 150.482328] [] hrtimer_interrupt+0xa8/0x228 >> >> >> [ 150.482334] [] arch_timer_handler_phys+0x3c/0x50 >> >> >> [ 150.482341] [] handle_percpu_devid_irq+0x8c/0x230 >> >> >> [ 150.482344] [] generic_handle_irq+0x34/0x50 >> >> >> [ 150.482347] [] __handle_domain_irq+0x68/0xc0 >> >> >> [ 150.482351] [] gic_handle_irq+0xc4/0x170 >> >> >> [ 150.482356] Exception stack(0xffff801f901a7ab0 to 0xffff801f901a7be0) >> >> >> [ 150.482360] 7aa0: >> >> >> 0000000000000003 ffff801f6d682578 >> >> >> [ 150.482364] 7ac0: 0000000000000000 0000000000000020 >> >> >> 0000000000000020 ffffffff00000000 >> >> >> [ 150.482367] 7ae0: 0000000000000000 0000000000000000 >> >> >> 0000000000000002 ffff801f901a7d30 >> >> >> [ 150.482371] 7b00: 00000000000009f0 00000000366e2f04 >> >> >> 0000000000000018 0000000058dabe4c >> >> >> [ 150.482375] 7b20: 0036312196549b4a 00003d0d6dd44d08 >> >> >> ffff0000082e81d8 0000ffffb77c9028 >> >> >> [ 150.482378] 7b40: 0000000000000a03 ffff000009138e38 >> >> >> ffff801f6d5cff08 ffff801f6d5cff00 >> >> >> [ 150.482382] 7b60: ffff000008e11e00 0000000000000001 >> >> >> ffff0000081565d0 0000000000000000 >> >> >> [ 150.482386] 7b80: 0000000000000080 ffff000009139434 >> >> >> ffff000009139000 ffff801f901a7be0 >> >> >> [ 150.482390] 7ba0: ffff000008156e8c ffff801f901a7be0 >> >> >> ffff000008156ecc 0000000080000145 >> >> >> [ 150.482394] 7bc0: ffff801f901a7be0 ffff000008156e68 >> >> >> ffffffffffffffff ffff000008156e8c >> >> >> [ 150.482397] [] el1_irq+0xb4/0x140 >> >> >> [ 150.482401] [] smp_call_function_many+0x284/0x2e8 >> >> >> [ 150.482405] [] kick_all_cpus_sync+0x30/0x38 >> >> >> [ 150.482409] [] aarch64_insn_patch_text+0xec/0xf8 >> >> >> [ 150.482415] [] arch_jump_label_transform+0x60/0x98 >> >> >> [ 150.482420] [] __jump_label_update+0x8c/0xa8 >> >> >> [ 150.482423] [] jump_label_update+0x58/0xe8 >> >> >> [ 150.482429] [] static_key_slow_inc+0x128/0x138 >> >> >> [ 150.482434] [] static_key_enable+0x34/0x60 >> >> >> [ 150.482438] [] netstamp_clear+0x68/0x80 >> >> >> [ 150.482441] [] process_one_work+0x158/0x478 >> >> >> [ 150.482444] [] worker_thread+0x50/0x4a8 >> >> >> [ 150.482448] [] kthread+0x108/0x138 >> >> >> [ 150.482451] [] ret_from_fork+0x10/0x50 >> >> >> >> >> >> ==================================== >> >> >> >> >> >> I am observing that this usually happens when the guest tries to >> >> >> bringup or use the default virtio-net interface. >> >> >> And I am unable to reproduce this when directly booting the guest >> >> >> kernel without UEFI BIOS. >> >> >> So anyone observed similar issue ? >> >> >> >> >> > >> >> > I haven't seen this myself. >> >> > >> >> > Have you tried with other versions of QEMU and the host kernel by any >> >> > chance using the same setup? >> >> >> >> The earlier qemu version (atleast 2.5 which is default on ubuntu >> >> 16.04) have different issue. The same thing happens when I use >> >> vhost=on but without vhost and with virtio-net didn't see the issue. >> >> The Armband folks have also reported the vhost issue here >> >> https://bugs.launchpad.net/ubuntu/+source/edk2/+bug/1673564 >> > >> > So I just ran your setup using the flash emulation on the ThunderX in >> > packet.net and my host is happily continuing to run hackbench. >> >> One more thing I see is the ubuntu cloud images run a cloud-init >> service during boot and seems to do lot of network init, the bug >> appears when doing such things. >> >> > >> >> >> >> So I am beginning to think the issue could be related to the flash >> >> emulation but wanted to check on ml if anyone else is seeing it. But >> >> if it is flash emulation how can it hose the host like this ? >> > >> > Why do you think it's related to the flash emulation? >> >> Because without pflash method and directly launching kernel using >> "-kernel" option of qemu works ok. > > With the same kernel binary which is otherwise used inside the cloud > image when booting with QEMU? In this case the guest kernel is 4.11, the one from cloud img is 4.10. > > Make sure you verify this conclusion by repeating tries tens of times at > least, just to make sure we're not misdiagnosing based on chance. Ok agreed. Initially was assuming something broke top of tree as I moved from stock qemu 2.5 to master but turned out it isn't. I will have to some combinations and get back to you guys. > >> >> >> > >> > From what you're telling me it sounds much more like a vhost regression. >> > >> > Can you try with v4.10 of the kernel with vhost=on ? >> >> Ok will try. > > Cool. > >> I do think the issue could also be host kernel side and >> something is triggering. >> The newer qemu is having a default virtio-net interface. Does this use >> vhost by default ? >> And is there a way not to have a default virtio-net interface. > > I really don't know, I'm not an expert on QEMU or dealing with virtio > vhost config options. > >> >> I am not specifying any vhost in my command. >> My observation with vhost=on was with qemu-2.5 and 4.4 kernel from Ubuntu 16.04. >> >> Another observation I have is, occasionally during the trials the >> guest hangs at this point and host cpu goes into a softlockup after >> sometime. I cannot kill the qemu either through qemu Ctrl-A+C+Q or >> kill command from host. >> >> ------------- cut ----------------- >> EFI stub: Booting Linux Kernel... >> EFI stub: EFI_RNG_PROTOCOL unavailable, no randomness supplied >> EFI stub: Using DTB from configuration table >> EFI stub: Exiting boot services and installing virtual address map... >> [ 0.000000] Booting Linux on physical CPU 0x0 >> ------------- end ------------------ >> >> > > My best bet based on what you're telling me above is that it's related > to memory accesses somehow; with vhost the host kernel will access guest > memory and with pflash emulation I think we're mapping host regions > read-only and executing code in the guest from those regions. Perhaps > we end up corrupting some memory somewhere. > > Is this a different hardware platform from the ThunderX in packet.net ? The chip is same, just different in the onboard ethernet ports and PCIe slots. > > Thanks, > -Christoffer