* [Bug 209867] CPU soft lockup/stall with nested KVM and SMP
2020-10-26 12:13 [Bug 209867] New: CPU soft lockup/stall with nested KVM and SMP bugzilla-daemon
@ 2020-10-26 13:10 ` bugzilla-daemon
2020-10-26 13:10 ` bugzilla-daemon
` (10 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: bugzilla-daemon @ 2020-10-26 13:10 UTC (permalink / raw)
To: kvm
https://bugzilla.kernel.org/show_bug.cgi?id=209867
--- Comment #1 from Frantisek Sumsal (frantisek@sumsal.cz) ---
Created attachment 293197
--> https://bugzilla.kernel.org/attachment.cgi?id=293197&action=edit
RHEL 8.2 host (AMD) - full log
--
You are receiving this mail because:
You are watching the assignee of the bug.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug 209867] CPU soft lockup/stall with nested KVM and SMP
2020-10-26 12:13 [Bug 209867] New: CPU soft lockup/stall with nested KVM and SMP bugzilla-daemon
2020-10-26 13:10 ` [Bug 209867] " bugzilla-daemon
@ 2020-10-26 13:10 ` bugzilla-daemon
2020-10-26 20:05 ` bugzilla-daemon
` (9 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: bugzilla-daemon @ 2020-10-26 13:10 UTC (permalink / raw)
To: kvm
https://bugzilla.kernel.org/show_bug.cgi?id=209867
--- Comment #2 from Frantisek Sumsal (frantisek@sumsal.cz) ---
Created attachment 293199
--> https://bugzilla.kernel.org/attachment.cgi?id=293199&action=edit
Fedora 32 host (AMD) - full log
--
You are receiving this mail because:
You are watching the assignee of the bug.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug 209867] CPU soft lockup/stall with nested KVM and SMP
2020-10-26 12:13 [Bug 209867] New: CPU soft lockup/stall with nested KVM and SMP bugzilla-daemon
2020-10-26 13:10 ` [Bug 209867] " bugzilla-daemon
2020-10-26 13:10 ` bugzilla-daemon
@ 2020-10-26 20:05 ` bugzilla-daemon
2020-11-02 16:16 ` bugzilla-daemon
` (8 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: bugzilla-daemon @ 2020-10-26 20:05 UTC (permalink / raw)
To: kvm
https://bugzilla.kernel.org/show_bug.cgi?id=209867
Frantisek Sumsal (frantisek@sumsal.cz) changed:
What |Removed |Added
----------------------------------------------------------------------------
Regression|No |Yes
--
You are receiving this mail because:
You are watching the assignee of the bug.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug 209867] CPU soft lockup/stall with nested KVM and SMP
2020-10-26 12:13 [Bug 209867] New: CPU soft lockup/stall with nested KVM and SMP bugzilla-daemon
` (2 preceding siblings ...)
2020-10-26 20:05 ` bugzilla-daemon
@ 2020-11-02 16:16 ` bugzilla-daemon
2020-11-02 16:24 ` bugzilla-daemon
` (7 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: bugzilla-daemon @ 2020-11-02 16:16 UTC (permalink / raw)
To: kvm
https://bugzilla.kernel.org/show_bug.cgi?id=209867
Frantisek Sumsal (frantisek@sumsal.cz) changed:
What |Removed |Added
----------------------------------------------------------------------------
Kernel Version|5.9.1-arch1-1 |5.9.3-arch1-1
--
You are receiving this mail because:
You are watching the assignee of the bug.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug 209867] CPU soft lockup/stall with nested KVM and SMP
2020-10-26 12:13 [Bug 209867] New: CPU soft lockup/stall with nested KVM and SMP bugzilla-daemon
` (3 preceding siblings ...)
2020-11-02 16:16 ` bugzilla-daemon
@ 2020-11-02 16:24 ` bugzilla-daemon
2020-11-09 10:59 ` bugzilla-daemon
` (6 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: bugzilla-daemon @ 2020-11-02 16:24 UTC (permalink / raw)
To: kvm
https://bugzilla.kernel.org/show_bug.cgi?id=209867
--- Comment #3 from Frantisek Sumsal (frantisek@sumsal.cz) ---
Clarification: the issue seems to appear only on AMD CPUs. I went through
several runs and tests in the "AMD[0] rack" suffer from the soft lockup above,
but the same workload passes on machines from the "Intel[1] rack"
[0] AMD Opteron 63xx class CPU (family: 0x15, model: 0x2, stepping: 0x0)
[1] Intel(R) Xeon(R) CPU E3-1265L V2 @ 2.50GHz (family: 0x6, model: 0x3a,
stepping: 0x9)
--
You are receiving this mail because:
You are watching the assignee of the bug.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug 209867] CPU soft lockup/stall with nested KVM and SMP
2020-10-26 12:13 [Bug 209867] New: CPU soft lockup/stall with nested KVM and SMP bugzilla-daemon
` (4 preceding siblings ...)
2020-11-02 16:24 ` bugzilla-daemon
@ 2020-11-09 10:59 ` bugzilla-daemon
2020-11-12 10:02 ` bugzilla-daemon
` (5 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: bugzilla-daemon @ 2020-11-09 10:59 UTC (permalink / raw)
To: kvm
https://bugzilla.kernel.org/show_bug.cgi?id=209867
Frantisek Sumsal (frantisek@sumsal.cz) changed:
What |Removed |Added
----------------------------------------------------------------------------
Kernel Version|5.9.3-arch1-1 |5.9.6-arch1-1
--- Comment #4 from Frantisek Sumsal (frantisek@sumsal.cz) ---
Results with kernel 5.9.6:
[ 4.353614] PCI: Using configuration type 1 for extended access
[ 4.361708] HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages
[ 4.363625] HugeTLB registered 2.00 MiB page size, pre-allocated 0 pages
[ 64.373614] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 64.376918] rcu: 3-...0: (0 ticks this GP) idle=95a/1/0x4000000000000000
softirq=18/18 fqs=6000 last_accelerate: 0000/e77e dyntick_enabled: 0
[ 64.376918] (detected by 0, t=18002 jiffies, g=-1123, q=62)
[ 64.376918] Sending NMI from CPU 0 to CPUs 3:
[ 244.390281] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 244.393584] rcu: 3-...0: (0 ticks this GP) idle=95a/1/0x4000000000000000
softirq=18/18 fqs=24002 last_accelerate: 0000/ba73 dyntick_enabled: 0
[ 244.393584] (detected by 0, t=72007 jiffies, g=-1123, q=62)
[ 244.393584] Sending NMI from CPU 0 to CPUs 3:
[ 424.406947] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 424.410251] rcu: 3-...0: (0 ticks this GP) idle=95a/1/0x4000000000000000
softirq=18/18 fqs=42004 last_accelerate: 0000/8d68 dyntick_enabled: 0
[ 424.410251] (detected by 0, t=126012 jiffies, g=-1123, q=62)
[ 424.410251] Sending NMI from CPU 0 to CPUs 3:
qemu-system-x86_64: terminating on signal 15 from pid 31982 (timeout)
--
You are receiving this mail because:
You are watching the assignee of the bug.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug 209867] CPU soft lockup/stall with nested KVM and SMP
2020-10-26 12:13 [Bug 209867] New: CPU soft lockup/stall with nested KVM and SMP bugzilla-daemon
` (5 preceding siblings ...)
2020-11-09 10:59 ` bugzilla-daemon
@ 2020-11-12 10:02 ` bugzilla-daemon
2020-11-21 16:19 ` bugzilla-daemon
` (4 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: bugzilla-daemon @ 2020-11-12 10:02 UTC (permalink / raw)
To: kvm
https://bugzilla.kernel.org/show_bug.cgi?id=209867
Frantisek Sumsal (frantisek@sumsal.cz) changed:
What |Removed |Added
----------------------------------------------------------------------------
Kernel Version|5.9.6-arch1-1 |5.9.8-arch1-1
--
You are receiving this mail because:
You are watching the assignee of the bug.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug 209867] CPU soft lockup/stall with nested KVM and SMP
2020-10-26 12:13 [Bug 209867] New: CPU soft lockup/stall with nested KVM and SMP bugzilla-daemon
` (6 preceding siblings ...)
2020-11-12 10:02 ` bugzilla-daemon
@ 2020-11-21 16:19 ` bugzilla-daemon
2020-11-27 10:21 ` bugzilla-daemon
` (3 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: bugzilla-daemon @ 2020-11-21 16:19 UTC (permalink / raw)
To: kvm
https://bugzilla.kernel.org/show_bug.cgi?id=209867
Frantisek Sumsal (frantisek@sumsal.cz) changed:
What |Removed |Added
----------------------------------------------------------------------------
Kernel Version|5.9.8-arch1-1 |5.9.9-arch1-1
--
You are receiving this mail because:
You are watching the assignee of the bug.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug 209867] CPU soft lockup/stall with nested KVM and SMP
2020-10-26 12:13 [Bug 209867] New: CPU soft lockup/stall with nested KVM and SMP bugzilla-daemon
` (7 preceding siblings ...)
2020-11-21 16:19 ` bugzilla-daemon
@ 2020-11-27 10:21 ` bugzilla-daemon
2020-12-01 8:39 ` bugzilla-daemon
` (2 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: bugzilla-daemon @ 2020-11-27 10:21 UTC (permalink / raw)
To: kvm
https://bugzilla.kernel.org/show_bug.cgi?id=209867
Frantisek Sumsal (frantisek@sumsal.cz) changed:
What |Removed |Added
----------------------------------------------------------------------------
Kernel Version|5.9.9-arch1-1 |5.9.10-arch1-1
--- Comment #5 from Frantisek Sumsal (frantisek@sumsal.cz) ---
I noticed there's a MSR access error when trying to online secondary CPUs,
which may be relevant:
[ 3.969876] Last level dTLB entries: 4KB 512, 2MB 255, 4MB 127, 1GB 0
[ 3.973256] Spectre V1 : Mitigation: usercopy/swapgs barriers and __user
pointer sanitization
[ 3.976544] Spectre V2 : Mitigation: Full AMD retpoline
[ 3.979874] Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on
context switch
[ 3.983210] Spectre V2 : mitigation: Enabling conditional Indirect Branch
Prediction Barrier
[ 3.986544] Speculative Store Bypass: Mitigation: Speculative Store Bypass
disabled via prctl and seccomp
[ 3.990704] Freeing SMP alternatives memory: 32K
[ 3.997866] smpboot: CPU0: AMD Opteron 63xx class CPU (family: 0x15, model:
0x2, stepping: 0x0)
[ 4.001938] Performance Events: Fam15h core perfctr, AMD PMU driver.
[ 4.003261] ... version: 0
[ 4.006576] ... bit width: 48
[ 4.009900] ... generic registers: 6
[ 4.013234] ... value mask: 0000ffffffffffff
[ 4.016567] ... max period: 00007fffffffffff
[ 4.019900] ... fixed-purpose events: 0
[ 4.023233] ... event mask: 000000000000003f
[ 4.026887] rcu: Hierarchical SRCU implementation.
[ 4.030952] smp: Bringing up secondary CPUs ...
[ 4.034030] x86: Booting SMP configuration:
[ 4.036581] .... node #0, CPUs: #1
[ 1.328014] kvm-clock: cpu 1, msr 8801041, secondary cpu clock
[ 1.328014] smpboot: CPU 1 Converting physical 0 to logical die 1
[ 1.328014] unchecked MSR access error: WRMSR to 0x48 (tried to write
0x0000000000000000) at rIP: 0xffffffff9da6c984 (native_write_msr+0x4/0x20)
[ 1.328014] Call Trace:
[ 1.328014] x86_spec_ctrl_setup_ap+0x34/0x50
[ 1.328014] identify_secondary_cpu+0x6c/0x80
[ 1.328014] smp_store_cpu_info+0x45/0x50
[ 1.328014] start_secondary+0x58/0x160
[ 1.328014] secondary_startup_64+0xb6/0xc0
[ 6.088346] kvm-guest: stealtime: cpu 1, msr 1e66e080
[ 6.094247] #2
[ 1.328014] kvm-clock: cpu 2, msr 8801081, secondary cpu clock
[ 1.328014] smpboot: CPU 2 Converting physical 0 to logical die 2
[ 6.123987] kvm-guest: stealtime: cpu 2, msr 1e6ae080
--
You are receiving this mail because:
You are watching the assignee of the bug.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug 209867] CPU soft lockup/stall with nested KVM and SMP
2020-10-26 12:13 [Bug 209867] New: CPU soft lockup/stall with nested KVM and SMP bugzilla-daemon
` (8 preceding siblings ...)
2020-11-27 10:21 ` bugzilla-daemon
@ 2020-12-01 8:39 ` bugzilla-daemon
2020-12-04 11:57 ` bugzilla-daemon
2020-12-23 12:31 ` bugzilla-daemon
11 siblings, 0 replies; 13+ messages in thread
From: bugzilla-daemon @ 2020-12-01 8:39 UTC (permalink / raw)
To: kvm
https://bugzilla.kernel.org/show_bug.cgi?id=209867
taz.007@zoho.com changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |taz.007@zoho.com
--- Comment #6 from taz.007@zoho.com ---
I've got a similar stack trace, but with a completely different context. This
is a small htpc box, no kvm involved, Intel CPU, 32bits.
déc 01 05:01:20 Aspire kernel: watchdog: BUG: soft lockup - CPU#2 stuck for
22s! [sshd:20874]
déc 01 05:01:20 Aspire kernel: Modules linked in: mptcp_diag tcp_diag udp_diag
raw_diag inet_diag rpcsec_gss_krb5 md4 cmac nls_utf8 cifs libdes dns_resolver
fscache fuse hwmon_vid nouveau ath5k snd_hda_codec_hdmi ath mxm_wmi ttm
snd_hda_codec_realtek mac80211 snd_hda_codec_generic drm_kms_helper
ledtrig_audio snd_hda_intel cfg80211 snd_intel_dspcfg mousedev cec input_leds
snd_hda_codec rc_core snd_hda_core syscopyarea rfkill hid_generic sysfillrect
snd_hwdep wmi_bmof libarc4 snd_pcm sysimgblt snd_timer fb_sys_fops coretemp
usbhid uas hid pcspkr i2c_algo_bit usb_storage snd nv_tco soundcore forcedeth
i2c_nforce2 wmi evdev nfsd auth_rpcgss tcp_bbr nfs_acl lockd grace sunrpc sg
drm nfs_ssc agpgart ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2
ohci_pci ehci_pci ehci_hcd ohci_hcd
déc 01 05:01:20 Aspire kernel: CPU: 2 PID: 20874 Comm: sshd Tainted: G D
5.9.9-arch1-1 #1
déc 01 05:01:20 Aspire kernel: Hardware name: Acer Aspire R3610/FMCP7A-ION-LE,
BIOS P01-A4 11/03/2009
déc 01 05:01:20 Aspire kernel: EIP: queued_spin_lock_slowpath+0x42/0x200
déc 01 05:01:20 Aspire kernel: Code: 8b 01 0f b6 d2 c1 e2 08 30 e4 09 d0 a9 00
01 ff ff 0f 85 21 01 00 00 85 c0 74 15 8b 01 84 c0 74 0f 8d b4 26 00 00 00 00
f3 90 <8b> 01 84 c0 75 f8 b8 01 00 00 00 66 89 01 64 ff 05 c0 1e bd c6 c3
déc 01 05:01:20 Aspire kernel: EAX: 00000101 EBX: df4f1ea4 ECX: c6bf5e68 EDX:
00000000
déc 01 05:01:20 Aspire kernel: ESI: 00000001 EDI: df4f1e58 EBP: df4f1dcc ESP:
df4f1dc8
déc 01 05:01:20 Aspire kernel: DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
EFLAGS: 00000202
déc 01 05:01:20 Aspire kernel: CR0: 80050033 CR2: 0048a107 CR3: 218a8000 CR4:
000006d0
déc 01 05:01:20 Aspire kernel: Call Trace:
déc 01 05:01:20 Aspire kernel: ? _raw_spin_lock+0x2c/0x30
déc 01 05:01:20 Aspire kernel: __change_page_attr_set_clr+0x45/0x740
déc 01 05:01:20 Aspire kernel: ? _vm_unmap_aliases.part.0+0x114/0x130
déc 01 05:01:20 Aspire kernel: change_page_attr_set_clr+0xd0/0x2a0
déc 01 05:01:20 Aspire kernel: set_memory_ro+0x1b/0x20
déc 01 05:01:20 Aspire kernel: bpf_prog_select_runtime+0x16c/0x1b0
déc 01 05:01:20 Aspire kernel: bpf_migrate_filter+0xe2/0x130
déc 01 05:01:20 Aspire kernel: bpf_prog_create_from_user+0x147/0x190
déc 01 05:01:20 Aspire kernel: ? hardlockup_detector_perf_cleanup+0x70/0x70
déc 01 05:01:20 Aspire kernel: do_seccomp+0x22d/0x9a0
déc 01 05:01:20 Aspire kernel: ? security_task_prctl+0x38/0x90
déc 01 05:01:20 Aspire kernel: prctl_set_seccomp+0x27/0x40
déc 01 05:01:20 Aspire kernel: __ia32_sys_prctl+0x87/0x4f0
déc 01 05:01:20 Aspire kernel: __do_fast_syscall_32+0x40/0x70
déc 01 05:01:20 Aspire kernel: do_fast_syscall_32+0x29/0x60
déc 01 05:01:20 Aspire kernel: do_SYSENTER_32+0x15/0x20
déc 01 05:01:20 Aspire kernel: entry_SYSENTER_32+0x9f/0xf2
déc 01 05:01:20 Aspire kernel: EIP: 0xb7fd0549
déc 01 05:01:20 Aspire kernel: Code: b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01
10 08 03 74 d8 01 00 00 00 00 00 00 00 00 00 00 00 00 00 51 52 55 89 e5 0f 34
cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
déc 01 05:01:20 Aspire kernel: EAX: ffffffda EBX: 00000016 ECX: 00000002 EDX:
004f4d5c
déc 01 05:01:20 Aspire kernel: ESI: 00000000 EDI: 00000001 EBP: b7b6ae1c ESP:
bffff6ec
déc 01 05:01:20 Aspire kernel: DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
EFLAGS: 00000292
kernel 5.9.9-arch1-1
Feel free to report if it's non related and I'll open a new bug report about
it.
--
You are receiving this mail because:
You are watching the assignee of the bug.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug 209867] CPU soft lockup/stall with nested KVM and SMP
2020-10-26 12:13 [Bug 209867] New: CPU soft lockup/stall with nested KVM and SMP bugzilla-daemon
` (9 preceding siblings ...)
2020-12-01 8:39 ` bugzilla-daemon
@ 2020-12-04 11:57 ` bugzilla-daemon
2020-12-23 12:31 ` bugzilla-daemon
11 siblings, 0 replies; 13+ messages in thread
From: bugzilla-daemon @ 2020-12-04 11:57 UTC (permalink / raw)
To: kvm
https://bugzilla.kernel.org/show_bug.cgi?id=209867
Frantisek Sumsal (frantisek@sumsal.cz) changed:
What |Removed |Added
----------------------------------------------------------------------------
Kernel Version|5.9.10-arch1-1 |5.9.11-arch2-1
--
You are receiving this mail because:
You are watching the assignee of the bug.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug 209867] CPU soft lockup/stall with nested KVM and SMP
2020-10-26 12:13 [Bug 209867] New: CPU soft lockup/stall with nested KVM and SMP bugzilla-daemon
` (10 preceding siblings ...)
2020-12-04 11:57 ` bugzilla-daemon
@ 2020-12-23 12:31 ` bugzilla-daemon
11 siblings, 0 replies; 13+ messages in thread
From: bugzilla-daemon @ 2020-12-23 12:31 UTC (permalink / raw)
To: kvm
https://bugzilla.kernel.org/show_bug.cgi?id=209867
Frantisek Sumsal (frantisek@sumsal.cz) changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution|--- |CODE_FIX
--- Comment #7 from Frantisek Sumsal (frantisek@sumsal.cz) ---
So, it looks like the issue was either resolved in kernel 5.9.12+ (currently on
5.9.14) or upgrade of the hypervisors to CentOS 8.3 (4.18.0-240.1.1.el8_3)
helped. Unfortunately, I have no way to easily check which one of them is the
real fix here.
As for you, Taz, please open a new bug if you still encounter the issue you
mentioned, so it won't get forgotten.
--
You may reply to this email to add a comment.
You are receiving this mail because:
You are watching the assignee of the bug.
^ permalink raw reply [flat|nested] 13+ messages in thread