kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: James Harvey <jamespharvey20@gmail.com>
To: kvm@vger.kernel.org
Subject: 5.2.11+ Regression: > nproc/2 lockups during initramfs
Date: Sun, 8 Sep 2019 06:37:43 -0400	[thread overview]
Message-ID: <CA+X5Wn4CbU305tDeu4UM=rBEzVyVgf0+YLsx70RtUJMZCFhXXw@mail.gmail.com> (raw)

Host is up to date Arch Linux, with exception of downgrading linux to
track this down to 5.2.11 - 5.2.13.  QEMU 4.1.0, but have also
downgraded to 4.0.0 to confirm no change.

Host is dual E5-2690 v1 Xeons.  With hyperthreading, 32 logical cores.
I've always been able to boot qemu with "-smp
cpus=30,cores=15,threads=1,sockets=2".  I leave 2 free for host
responsiveness.

Upgrading from 5.2.10 to 5.2.11 causes the VM to lock up while loading
the initramfs about 90-95% of the time.  (Probably a slight race
condition.)  On host, QEMU shows as nVmCPUs*100% CPU usage, so around
3000% for 30 cpus.

If I back down to "cpus=16,cores=8", it always boots.  If I increase
to "cpus=18,cores=9", it goes back to locking up 90-95% of the time.

Omitting "-accel=kvm" allows 5.2.11 to work on the host without issue,
so combined with that the only package needing to be downgraded is
linux to 5.2.10 to prevent the issue with KVM, I think this must be a
KVM issue.

Using version of QEMU with debug symbols gives:
* gdb backtrace: http://ix.io/1UyO
* 11 seconds of attaching strace to locked up qemu (167K): http://ix.io/1UyP
* strace from the beginning of starting a qemu that locks up (8MB):
https://filebin.ca/4uI15ztGAarw/strace.qemu.from.start
** This definitely changed timings, and it became harder to replicate,
to where I'd guess 20-30% of boots hang
** Interestingly, the strace only collected data for 5 seconds, even
though qemu continued at full CPU usage much longer.  Don't know what
to make of that, especially because the first strace was attached to
an already locked up qemu that had gone well past 5 seconds.

Like how the strace changed timings, I have seen attaching GDB to a
running qemu which pauses it, then simply running continue, has gotten
it "unstuck" immediately.

I've let this go 14 hours, but once it goes into complete CPU usage,
it never comes out.

If booting from the September 2019 Arch ISO, it hangs right after the
ISO's UEFI bootloader selects Arch Linux, then the screen goes black.

If booting from grub/systemd, it hangs right after "Loading Initial Ramdisk..."

             reply	other threads:[~2019-09-08 10:37 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-08 10:37 James Harvey [this message]
2019-09-10 18:32 ` 5.2.11+ Regression: > nproc/2 lockups during initramfs Sean Christopherson
2019-09-12  7:59   ` James Harvey
2019-09-17 13:36     ` Paolo Bonzini
2019-09-17 13:41       ` Greg KH

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CA+X5Wn4CbU305tDeu4UM=rBEzVyVgf0+YLsx70RtUJMZCFhXXw@mail.gmail.com' \
    --to=jamespharvey20@gmail.com \
    --cc=kvm@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).