All of lore.kernel.org
 help / color / mirror / Atom feed
* [Bug 209867] New: CPU soft lockup/stall with nested KVM and SMP
@ 2020-10-26 12:13 bugzilla-daemon
  2020-10-26 13:10 ` [Bug 209867] " bugzilla-daemon
                   ` (11 more replies)
  0 siblings, 12 replies; 13+ messages in thread
From: bugzilla-daemon @ 2020-10-26 12:13 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=209867

            Bug ID: 209867
           Summary: CPU soft lockup/stall with nested KVM and SMP
           Product: Virtualization
           Version: unspecified
    Kernel Version: 5.9.1-arch1-1
          Hardware: All
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: high
          Priority: P1
         Component: kvm
          Assignee: virtualization_kvm@kernel-bugs.osdl.org
          Reporter: frantisek@sumsal.cz
        Regression: No

Hello,

During my systemd CI adventures I've encountered an issue with kernel 5.9.x
where the boot freezes at completely random moments because of a CPU soft
lockup. From my testing it seems to be reproducible with nested KVM & SMP > 1
(it does happen with SMP == 1 as well, but not always) - see[0].

Reproducer is quite straightforward - enable nested KVM on the host, create a
VM, and create a nested KVM VM in that VM. During my testing I used Vagrant[1]
(with libvirt backend) for the outer VM, and an image generated by mkosi[2] for
the inner VM. Both VMs run the same kernel version.

Hosts:
 * several AMD & Intel servers with RHEL 8.2 (4.18.0-193.19.1.el8_2)
 * AMD desktop with Fedora 32 (5.6.2-300.fc32.x86_64)

The behavior was consistent on all hosts.

Desktop results:
# qemu-system-x86_64 -net none -smp 2 -m 512 -nographic -machine accel=kvm
-enable-kvm -cpu host -kernel /boot/vmlinuz-linux -initrd
/boot/initramfs-linux.img -append 'debug rw console=ttyS0 root=/dev/sda1'
-drive format=raw,file=image.raw
...
[    4.602193] random: dbus-daemon: uninitialized urandom read (12 bytes read)
[    5.538763] random: crng init done
[   28.635398] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [systemd:1]
[   28.638215] Modules linked in: drm agpgart ip_tables x_tables ext4
crc32c_generic crc16 mbcache jbd2 sr_mod cdrom ata_generic pata_acpi
crc32_pclmul crc32c_intel serio_raw atkbd libps2 aesni_intel glue_helper
crypto_simd cryptd ata_piix floppy i8042 serio
[   28.642668] CPU: 2 PID: 1 Comm: systemd Not tainted 5.9.1-arch1-1 #1
[   28.648865] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
ArchLinux 1.14.0-1 04/01/2014
[   28.648865] RIP: 0010:smp_call_function_many_cond+0x2a3/0x2f0
[   28.655420] Code: c3 0d 3d 00 3b 05 61 0a 83 01 89 c7 0f 83 f4 fd ff ff 48
63 c7 49 8b 55 00 48 03 14 c5 00 19 81 8b 8b 42 08 a8 01 74 09 f3 90 <8b> 42 08
a8 01 75 f7 eb c9 48 c7 c2 60 45 d7 8b 48 89 ee 44 89 ff
[   28.655420] RSP: 0018:ffffacf800013b18 EFLAGS: 00000202
[   28.668750] RAX: 0000000000000011 RBX: 0000000000000000 RCX:
0000000000000000
[   28.668750] RDX: ffff91c39da333e0 RSI: 0000000000000000 RDI:
0000000000000000
[   28.668750] RBP: 0000000000000003 R08: 0000000000000000 R09:
0000000000000000
[   28.668750] R10: 0000000000000140 R11: 0000000000000002 R12:
0000000000000000
[   28.682088] R13: ffff91c39db2d340 R14: 0000000000000140 R15:
ffff91c39db2d348
[   28.682088] FS:  00007fd4cbc04340(0000) GS:ffff91c39db00000(0000)
knlGS:0000000000000000
[   28.682088] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   28.682088] CR2: 00007fd4cc9e8520 CR3: 000000001fa8e000 CR4:
0000000000350ee0
[   28.695419] Call Trace:
[   28.695419]  ? __flush_tlb_all+0x30/0x30
[   28.695419]  ? __flush_tlb_all+0x30/0x30
[   28.695419]  on_each_cpu+0x43/0xb0
[   28.695419]  __purge_vmap_area_lazy+0x5d/0x670
[   28.695419]  ? do_jit+0xbdf/0x1cd0
[   28.708758]  ? purge_fragmented_blocks+0xbd/0x1a0
[   28.708758]  _vm_unmap_aliases.part.0+0x110/0x140
[   28.708758]  change_page_attr_set_clr+0xb9/0x1c0
[   28.708758]  set_memory_ro+0x26/0x30
[   28.708758]  bpf_int_jit_compile+0x407/0x42b
[   28.708758]  bpf_prog_select_runtime+0x101/0x1a0
[   28.708758]  bpf_prog_load+0x49a/0x8e0
[   28.722089]  __do_sys_bpf+0x2dd/0x1ea0
[   28.722089]  do_syscall_64+0x33/0x40
[   28.722089]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   28.722089] RIP: 0033:0x7fd4cc91ed5d
[   28.722089] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89
f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01
f0 ff ff 73 01 c3 48 8b 0d e3 70 0c 00 f7 d8 64 89 01 48
[   28.735427] RSP: 002b:00007ffe18ec4218 EFLAGS: 00000246 ORIG_RAX:
0000000000000141
[   28.735427] RAX: ffffffffffffffda RBX: 000055cea99a9ab0 RCX:
00007fd4cc91ed5d
[   28.735427] RDX: 0000000000000070 RSI: 00007ffe18ec4220 RDI:
0000000000000005
[   28.748752] RBP: 0000000000000000 R08: 0070756f7267632f R09:
0000000800000008
[   28.748752] R10: 0000000000000000 R11: 0000000000000246 R12:
000055cea999fb20
[   28.748752] R13: 0000000000000001 R14: 0000000000000001 R15:
000055cea99836a0
[   56.635397] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [systemd:1]
[   56.638254] Modules linked in: drm agpgart ip_tables x_tables ext4
crc32c_generic crc16 mbcache jbd2 sr_mod cdrom ata_generic pata_acpi
crc32_pclmul crc32c_intel serio_raw atkbd libps2 aesni_intel glue_helper
crypto_simd cryptd ata_piix floppy i8042 serio
[   56.642094] CPU: 2 PID: 1 Comm: systemd Tainted: G             L   
5.9.1-arch1-1 #1
[   56.648798] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
ArchLinux 1.14.0-1 04/01/2014
[   56.648798] RIP: 0010:smp_call_function_many_cond+0x2a3/0x2f0
[   56.655444] Code: c3 0d 3d 00 3b 05 61 0a 83 01 89 c7 0f 83 f4 fd ff ff 48
63 c7 49 8b 55 00 48 03 14 c5 00 19 81 8b 8b 42 08 a8 01 74 09 f3 90 <8b> 42 08
a8 01 75 f7 eb c9 48 c7 c2 60 45 d7 8b 48 89 ee 44 89 ff
[   56.655444] RSP: 0018:ffffacf800013b18 EFLAGS: 00000202
[   56.655444] RAX: 0000000000000011 RBX: 0000000000000000 RCX:
0000000000000000
[   56.668871] RDX: ffff91c39da333e0 RSI: 0000000000000000 RDI:
0000000000000000
[   56.668871] RBP: 0000000000000003 R08: 0000000000000000 R09:
0000000000000000
[   56.668871] R10: 0000000000000140 R11: 0000000000000002 R12:
0000000000000000
[   56.668871] R13: ffff91c39db2d340 R14: 0000000000000140 R15:
ffff91c39db2d348
[   56.668871] FS:  00007fd4cbc04340(0000) GS:ffff91c39db00000(0000)
knlGS:0000000000000000
[   56.682244] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   56.682244] CR2: 00007fd4cc9e8520 CR3: 000000001fa8e000 CR4:
0000000000350ee0
[   56.682244] Call Trace:
[   56.682244]  ? __flush_tlb_all+0x30/0x30
[   56.682244]  ? __flush_tlb_all+0x30/0x30
[   56.682244]  on_each_cpu+0x43/0xb0
[   56.682244]  __purge_vmap_area_lazy+0x5d/0x670
[   56.695525]  ? do_jit+0xbdf/0x1cd0
[   56.695525]  ? purge_fragmented_blocks+0xbd/0x1a0
[   56.695525]  _vm_unmap_aliases.part.0+0x110/0x140
[   56.695525]  change_page_attr_set_clr+0xb9/0x1c0
[   56.695525]  set_memory_ro+0x26/0x30
[   56.695525]  bpf_int_jit_compile+0x407/0x42b
[   56.695525]  bpf_prog_select_runtime+0x101/0x1a0
[   56.708855]  bpf_prog_load+0x49a/0x8e0
[   56.708855]  __do_sys_bpf+0x2dd/0x1ea0
[   56.708855]  do_syscall_64+0x33/0x40
[   56.708855]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   56.708855] RIP: 0033:0x7fd4cc91ed5d
[   56.708855] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89
f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01
f0 ff ff 73 01 c3 48 8b 0d e3 70 0c 00 f7 d8 64 89 01 48
[   56.722223] RSP: 002b:00007ffe18ec4218 EFLAGS: 00000246 ORIG_RAX:
0000000000000141
[   56.722223] RAX: ffffffffffffffda RBX: 000055cea99a9ab0 RCX:
00007fd4cc91ed5d
[   56.722223] RDX: 0000000000000070 RSI: 00007ffe18ec4220 RDI:
0000000000000005
[   56.722223] RBP: 0000000000000000 R08: 0070756f7267632f R09:
0000000800000008
[   56.735526] R10: 0000000000000000 R11: 0000000000000246 R12:
000055cea999fb20
[   56.735526] R13: 0000000000000001 R14: 0000000000000001 R15:
000055cea99836a0
...
[   64.578716] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[   64.578716]  (detected by 2, t=18002 jiffies, g=-207, q=1973)
[   64.578716] rcu: All QSes seen, last rcu_preempt kthread activity 18002
(4294896513-4294878511), jiffies_till_next_fqs=2, root ->qsmask 0x0
[   64.578716] rcu: rcu_preempt kthread starved for 18002 jiffies! g-207 f0x2
RCU_GP_WAIT_FQS(5) ->state=0x200 ->cpu=0
[   64.588745] rcu:     Unless rcu_preempt kthread gets sufficient CPU time,
OOM is now expected behavior.
[   64.588745] rcu: RCU grace-period kthread stack dump:
[   64.588745] task:rcu_preempt     state:R stack:    0 pid:   11 ppid:     2
flags:0x00004000
[   64.602163] Call Trace:
[   64.602163]  __schedule+0x292/0x830
[   64.602163]  schedule+0x46/0xf0
[   64.602163]  schedule_timeout+0x99/0x170
[   64.602163]  ? __next_timer_interrupt+0x100/0x100
[   64.602163]  rcu_gp_kthread+0x5a4/0xbe0
[   64.602163]  ? __note_gp_changes+0x190/0x190
[   64.602163]  kthread+0x142/0x160
[   64.602163]  ? __kthread_bind_mask+0x60/0x60
[   64.615482]  ret_from_fork+0x22/0x30
...

Server results:
...
[   32.051205] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1]
[   32.051237] Modules linked in:
[   32.051237] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.9.1-arch1-1 #1
[   32.051237] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
ArchLinux 1.14.0-1 04/01/2014
[   32.051237] RIP: 0010:smp_call_function_many_cond+0x2a3/0x2f0
[   32.051237] Code: c3 0d 3d 00 3b 05 61 0a 83 01 89 c7 0f 83 f4 fd ff ff 48
63 c7 49 8b 55 00 48 03 14 c5 00 19 41 bb 8b 42 08 a8 01 74 09 f3 90 <8b> 42 08
a8 01 75 f7 eb c9 48 c7 c2 60 45 97 bb 48 89 ee 44 89 ff
[   32.051237] RSP: 0018:ffffa661c0013d98 EFLAGS: 00000202
[   32.051237] RAX: 0000000000000011 RBX: 0000000000000000 RCX:
0000000000000004
[   32.051237] RDX: ffff9e955db320a0 RSI: 0000000000000000 RDI:
0000000000000004
[   32.051237] RBP: 0000000000000007 R08: 0000000000000000 R09:
0000000000000004
[   32.051237] R10: 0000000000000005 R11: 0000000000000005 R12:
0000000000000000
[   32.051237] R13: ffff9e955da2d340 R14: 0000000000000140 R15:
ffff9e955da2d348
[   32.051237] FS:  0000000000000000(0000) GS:ffff9e955da00000(0000)
knlGS:0000000000000000
[   32.051237] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   32.051237] CR2: 0000000000000000 CR3: 000000001840e000 CR4:
00000000000406f0
[   32.051237] Call Trace:
[   32.051237]  ? _raw_spin_unlock+0x16/0x30
[   32.051237]  ? text_poke_loc_init+0x160/0x160
[   32.051237]  ? text_poke_loc_init+0x160/0x160
[   32.051237]  on_each_cpu+0x43/0xb0
[   32.051237]  text_poke_bp_batch+0x1d7/0x200
[   32.051237]  text_poke_finish+0x1b/0x26
[   32.051237]  arch_jump_label_transform_apply+0x16/0x30
[   32.051237]  static_key_slow_inc_cpuslocked+0x7a/0x90
[   32.051237]  static_key_slow_inc+0x16/0x20
[   32.051237]  ? kvm_init_platform+0x16/0x16
[   32.051237]  activate_jump_labels+0x2f/0x32
[   32.051237]  do_one_initcall+0x59/0x234
[   32.051237]  kernel_init_freeable+0x1b0/0x1f5
[   32.051237]  ? rest_init+0xbf/0xbf
[   32.051237]  kernel_init+0xa/0x111
[   32.051237]  ret_from_fork+0x22/0x30
...


Frankly, I'm at wits' end, as I've been noticing similar issues since kernel
5.8.x and still can pinpoint what's going on (again, see [0]), thus my aplogies
if I filed this under a wrong component.

Thank you.


[0]
https://github.com/systemd/systemd-centos-ci/pull/295#issuecomment-682519585
[1]
Vagrant.configure("2") do |config|
  config.vm.box = "generic/arch"
  config.vm.provider :libvirt do |libvirt|
    libvirt.cpus = 4
    libvirt.memory = "2048"
    libvirt.driver = "kvm"
    libvirt.nested = true
    libvirt.cpu_mode = "host-model"
    libvirt.random :model => "random"
  end
end

[2] # mkosi -b -d arch --qemu-headless -t gpt_ext4

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug 209867] CPU soft lockup/stall with nested KVM and SMP
  2020-10-26 12:13 [Bug 209867] New: CPU soft lockup/stall with nested KVM and SMP bugzilla-daemon
@ 2020-10-26 13:10 ` bugzilla-daemon
  2020-10-26 13:10 ` bugzilla-daemon
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: bugzilla-daemon @ 2020-10-26 13:10 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=209867

--- Comment #1 from Frantisek Sumsal (frantisek@sumsal.cz) ---
Created attachment 293197
  --> https://bugzilla.kernel.org/attachment.cgi?id=293197&action=edit
RHEL 8.2 host (AMD) - full log

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug 209867] CPU soft lockup/stall with nested KVM and SMP
  2020-10-26 12:13 [Bug 209867] New: CPU soft lockup/stall with nested KVM and SMP bugzilla-daemon
  2020-10-26 13:10 ` [Bug 209867] " bugzilla-daemon
@ 2020-10-26 13:10 ` bugzilla-daemon
  2020-10-26 20:05 ` bugzilla-daemon
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: bugzilla-daemon @ 2020-10-26 13:10 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=209867

--- Comment #2 from Frantisek Sumsal (frantisek@sumsal.cz) ---
Created attachment 293199
  --> https://bugzilla.kernel.org/attachment.cgi?id=293199&action=edit
Fedora 32 host (AMD) - full log

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug 209867] CPU soft lockup/stall with nested KVM and SMP
  2020-10-26 12:13 [Bug 209867] New: CPU soft lockup/stall with nested KVM and SMP bugzilla-daemon
  2020-10-26 13:10 ` [Bug 209867] " bugzilla-daemon
  2020-10-26 13:10 ` bugzilla-daemon
@ 2020-10-26 20:05 ` bugzilla-daemon
  2020-11-02 16:16 ` bugzilla-daemon
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: bugzilla-daemon @ 2020-10-26 20:05 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=209867

Frantisek Sumsal (frantisek@sumsal.cz) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Regression|No                          |Yes

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug 209867] CPU soft lockup/stall with nested KVM and SMP
  2020-10-26 12:13 [Bug 209867] New: CPU soft lockup/stall with nested KVM and SMP bugzilla-daemon
                   ` (2 preceding siblings ...)
  2020-10-26 20:05 ` bugzilla-daemon
@ 2020-11-02 16:16 ` bugzilla-daemon
  2020-11-02 16:24 ` bugzilla-daemon
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: bugzilla-daemon @ 2020-11-02 16:16 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=209867

Frantisek Sumsal (frantisek@sumsal.cz) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Kernel Version|5.9.1-arch1-1               |5.9.3-arch1-1

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug 209867] CPU soft lockup/stall with nested KVM and SMP
  2020-10-26 12:13 [Bug 209867] New: CPU soft lockup/stall with nested KVM and SMP bugzilla-daemon
                   ` (3 preceding siblings ...)
  2020-11-02 16:16 ` bugzilla-daemon
@ 2020-11-02 16:24 ` bugzilla-daemon
  2020-11-09 10:59 ` bugzilla-daemon
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: bugzilla-daemon @ 2020-11-02 16:24 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=209867

--- Comment #3 from Frantisek Sumsal (frantisek@sumsal.cz) ---
Clarification: the issue seems to appear only on AMD CPUs. I went through
several runs and tests in the "AMD[0] rack" suffer from the soft lockup above,
but the same workload passes on machines from the "Intel[1] rack"

[0] AMD Opteron 63xx class CPU (family: 0x15, model: 0x2, stepping: 0x0)
[1] Intel(R) Xeon(R) CPU E3-1265L V2 @ 2.50GHz (family: 0x6, model: 0x3a,
stepping: 0x9)

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug 209867] CPU soft lockup/stall with nested KVM and SMP
  2020-10-26 12:13 [Bug 209867] New: CPU soft lockup/stall with nested KVM and SMP bugzilla-daemon
                   ` (4 preceding siblings ...)
  2020-11-02 16:24 ` bugzilla-daemon
@ 2020-11-09 10:59 ` bugzilla-daemon
  2020-11-12 10:02 ` bugzilla-daemon
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: bugzilla-daemon @ 2020-11-09 10:59 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=209867

Frantisek Sumsal (frantisek@sumsal.cz) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Kernel Version|5.9.3-arch1-1               |5.9.6-arch1-1

--- Comment #4 from Frantisek Sumsal (frantisek@sumsal.cz) ---
Results with kernel 5.9.6:

[    4.353614] PCI: Using configuration type 1 for extended access
[    4.361708] HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages
[    4.363625] HugeTLB registered 2.00 MiB page size, pre-allocated 0 pages
[   64.373614] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[   64.376918] rcu:     3-...0: (0 ticks this GP) idle=95a/1/0x4000000000000000
softirq=18/18 fqs=6000 last_accelerate: 0000/e77e dyntick_enabled: 0
[   64.376918]  (detected by 0, t=18002 jiffies, g=-1123, q=62)
[   64.376918] Sending NMI from CPU 0 to CPUs 3:
[  244.390281] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  244.393584] rcu:     3-...0: (0 ticks this GP) idle=95a/1/0x4000000000000000
softirq=18/18 fqs=24002 last_accelerate: 0000/ba73 dyntick_enabled: 0
[  244.393584]  (detected by 0, t=72007 jiffies, g=-1123, q=62)
[  244.393584] Sending NMI from CPU 0 to CPUs 3:
[  424.406947] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  424.410251] rcu:     3-...0: (0 ticks this GP) idle=95a/1/0x4000000000000000
softirq=18/18 fqs=42004 last_accelerate: 0000/8d68 dyntick_enabled: 0
[  424.410251]  (detected by 0, t=126012 jiffies, g=-1123, q=62)
[  424.410251] Sending NMI from CPU 0 to CPUs 3:
qemu-system-x86_64: terminating on signal 15 from pid 31982 (timeout)

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug 209867] CPU soft lockup/stall with nested KVM and SMP
  2020-10-26 12:13 [Bug 209867] New: CPU soft lockup/stall with nested KVM and SMP bugzilla-daemon
                   ` (5 preceding siblings ...)
  2020-11-09 10:59 ` bugzilla-daemon
@ 2020-11-12 10:02 ` bugzilla-daemon
  2020-11-21 16:19 ` bugzilla-daemon
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: bugzilla-daemon @ 2020-11-12 10:02 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=209867

Frantisek Sumsal (frantisek@sumsal.cz) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Kernel Version|5.9.6-arch1-1               |5.9.8-arch1-1

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug 209867] CPU soft lockup/stall with nested KVM and SMP
  2020-10-26 12:13 [Bug 209867] New: CPU soft lockup/stall with nested KVM and SMP bugzilla-daemon
                   ` (6 preceding siblings ...)
  2020-11-12 10:02 ` bugzilla-daemon
@ 2020-11-21 16:19 ` bugzilla-daemon
  2020-11-27 10:21 ` bugzilla-daemon
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: bugzilla-daemon @ 2020-11-21 16:19 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=209867

Frantisek Sumsal (frantisek@sumsal.cz) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Kernel Version|5.9.8-arch1-1               |5.9.9-arch1-1

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug 209867] CPU soft lockup/stall with nested KVM and SMP
  2020-10-26 12:13 [Bug 209867] New: CPU soft lockup/stall with nested KVM and SMP bugzilla-daemon
                   ` (7 preceding siblings ...)
  2020-11-21 16:19 ` bugzilla-daemon
@ 2020-11-27 10:21 ` bugzilla-daemon
  2020-12-01  8:39 ` bugzilla-daemon
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: bugzilla-daemon @ 2020-11-27 10:21 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=209867

Frantisek Sumsal (frantisek@sumsal.cz) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Kernel Version|5.9.9-arch1-1               |5.9.10-arch1-1

--- Comment #5 from Frantisek Sumsal (frantisek@sumsal.cz) ---
I noticed there's a MSR access error when trying to online secondary CPUs,
which may be relevant:

[    3.969876] Last level dTLB entries: 4KB 512, 2MB 255, 4MB 127, 1GB 0
[    3.973256] Spectre V1 : Mitigation: usercopy/swapgs barriers and __user
pointer sanitization
[    3.976544] Spectre V2 : Mitigation: Full AMD retpoline
[    3.979874] Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on
context switch
[    3.983210] Spectre V2 : mitigation: Enabling conditional Indirect Branch
Prediction Barrier
[    3.986544] Speculative Store Bypass: Mitigation: Speculative Store Bypass
disabled via prctl and seccomp
[    3.990704] Freeing SMP alternatives memory: 32K
[    3.997866] smpboot: CPU0: AMD Opteron 63xx class CPU (family: 0x15, model:
0x2, stepping: 0x0)
[    4.001938] Performance Events: Fam15h core perfctr, AMD PMU driver.
[    4.003261] ... version:                0
[    4.006576] ... bit width:              48
[    4.009900] ... generic registers:      6
[    4.013234] ... value mask:             0000ffffffffffff
[    4.016567] ... max period:             00007fffffffffff
[    4.019900] ... fixed-purpose events:   0
[    4.023233] ... event mask:             000000000000003f
[    4.026887] rcu: Hierarchical SRCU implementation.
[    4.030952] smp: Bringing up secondary CPUs ...
[    4.034030] x86: Booting SMP configuration:

[    4.036581] .... node  #0, CPUs:      #1
[    1.328014] kvm-clock: cpu 1, msr 8801041, secondary cpu clock
[    1.328014] smpboot: CPU 1 Converting physical 0 to logical die 1
[    1.328014] unchecked MSR access error: WRMSR to 0x48 (tried to write
0x0000000000000000) at rIP: 0xffffffff9da6c984 (native_write_msr+0x4/0x20)
[    1.328014] Call Trace:
[    1.328014]  x86_spec_ctrl_setup_ap+0x34/0x50
[    1.328014]  identify_secondary_cpu+0x6c/0x80
[    1.328014]  smp_store_cpu_info+0x45/0x50
[    1.328014]  start_secondary+0x58/0x160
[    1.328014]  secondary_startup_64+0xb6/0xc0
[    6.088346] kvm-guest: stealtime: cpu 1, msr 1e66e080
[    6.094247]  #2
[    1.328014] kvm-clock: cpu 2, msr 8801081, secondary cpu clock
[    1.328014] smpboot: CPU 2 Converting physical 0 to logical die 2
[    6.123987] kvm-guest: stealtime: cpu 2, msr 1e6ae080

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug 209867] CPU soft lockup/stall with nested KVM and SMP
  2020-10-26 12:13 [Bug 209867] New: CPU soft lockup/stall with nested KVM and SMP bugzilla-daemon
                   ` (8 preceding siblings ...)
  2020-11-27 10:21 ` bugzilla-daemon
@ 2020-12-01  8:39 ` bugzilla-daemon
  2020-12-04 11:57 ` bugzilla-daemon
  2020-12-23 12:31 ` bugzilla-daemon
  11 siblings, 0 replies; 13+ messages in thread
From: bugzilla-daemon @ 2020-12-01  8:39 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=209867

taz.007@zoho.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |taz.007@zoho.com

--- Comment #6 from taz.007@zoho.com ---
I've got a similar stack trace, but with a completely different context. This
is a small htpc box, no kvm involved, Intel CPU, 32bits.

déc 01 05:01:20 Aspire kernel: watchdog: BUG: soft lockup - CPU#2 stuck for
22s! [sshd:20874]
déc 01 05:01:20 Aspire kernel: Modules linked in: mptcp_diag tcp_diag udp_diag
raw_diag inet_diag rpcsec_gss_krb5 md4 cmac nls_utf8 cifs libdes dns_resolver
fscache fuse hwmon_vid nouveau ath5k snd_hda_codec_hdmi ath mxm_wmi ttm
snd_hda_codec_realtek mac80211 snd_hda_codec_generic drm_kms_helper
ledtrig_audio snd_hda_intel cfg80211 snd_intel_dspcfg mousedev cec input_leds
snd_hda_codec rc_core snd_hda_core syscopyarea rfkill hid_generic sysfillrect
snd_hwdep wmi_bmof libarc4 snd_pcm sysimgblt snd_timer fb_sys_fops coretemp
usbhid uas hid pcspkr i2c_algo_bit usb_storage snd nv_tco soundcore forcedeth
i2c_nforce2 wmi evdev nfsd auth_rpcgss tcp_bbr nfs_acl lockd grace sunrpc sg
drm nfs_ssc agpgart ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2
ohci_pci ehci_pci ehci_hcd ohci_hcd
déc 01 05:01:20 Aspire kernel: CPU: 2 PID: 20874 Comm: sshd Tainted: G      D  
        5.9.9-arch1-1 #1
déc 01 05:01:20 Aspire kernel: Hardware name: Acer Aspire R3610/FMCP7A-ION-LE,
BIOS P01-A4 11/03/2009
déc 01 05:01:20 Aspire kernel: EIP: queued_spin_lock_slowpath+0x42/0x200
déc 01 05:01:20 Aspire kernel: Code: 8b 01 0f b6 d2 c1 e2 08 30 e4 09 d0 a9 00
01 ff ff 0f 85 21 01 00 00 85 c0 74 15 8b 01 84 c0 74 0f 8d b4 26 00 00 00 00
f3 90 <8b> 01 84 c0 75 f8 b8 01 00 00 00 66 89 01 64 ff 05 c0 1e bd c6 c3
déc 01 05:01:20 Aspire kernel: EAX: 00000101 EBX: df4f1ea4 ECX: c6bf5e68 EDX:
00000000
déc 01 05:01:20 Aspire kernel: ESI: 00000001 EDI: df4f1e58 EBP: df4f1dcc ESP:
df4f1dc8
déc 01 05:01:20 Aspire kernel: DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
EFLAGS: 00000202
déc 01 05:01:20 Aspire kernel: CR0: 80050033 CR2: 0048a107 CR3: 218a8000 CR4:
000006d0
déc 01 05:01:20 Aspire kernel: Call Trace:
déc 01 05:01:20 Aspire kernel:  ? _raw_spin_lock+0x2c/0x30
déc 01 05:01:20 Aspire kernel:  __change_page_attr_set_clr+0x45/0x740
déc 01 05:01:20 Aspire kernel:  ? _vm_unmap_aliases.part.0+0x114/0x130
déc 01 05:01:20 Aspire kernel:  change_page_attr_set_clr+0xd0/0x2a0
déc 01 05:01:20 Aspire kernel:  set_memory_ro+0x1b/0x20
déc 01 05:01:20 Aspire kernel:  bpf_prog_select_runtime+0x16c/0x1b0
déc 01 05:01:20 Aspire kernel:  bpf_migrate_filter+0xe2/0x130
déc 01 05:01:20 Aspire kernel:  bpf_prog_create_from_user+0x147/0x190
déc 01 05:01:20 Aspire kernel:  ? hardlockup_detector_perf_cleanup+0x70/0x70
déc 01 05:01:20 Aspire kernel:  do_seccomp+0x22d/0x9a0
déc 01 05:01:20 Aspire kernel:  ? security_task_prctl+0x38/0x90
déc 01 05:01:20 Aspire kernel:  prctl_set_seccomp+0x27/0x40
déc 01 05:01:20 Aspire kernel:  __ia32_sys_prctl+0x87/0x4f0
déc 01 05:01:20 Aspire kernel:  __do_fast_syscall_32+0x40/0x70
déc 01 05:01:20 Aspire kernel:  do_fast_syscall_32+0x29/0x60
déc 01 05:01:20 Aspire kernel:  do_SYSENTER_32+0x15/0x20
déc 01 05:01:20 Aspire kernel:  entry_SYSENTER_32+0x9f/0xf2
déc 01 05:01:20 Aspire kernel: EIP: 0xb7fd0549
déc 01 05:01:20 Aspire kernel: Code: b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01
10 08 03 74 d8 01 00 00 00 00 00 00 00 00 00 00 00 00 00 51 52 55 89 e5 0f 34
cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
déc 01 05:01:20 Aspire kernel: EAX: ffffffda EBX: 00000016 ECX: 00000002 EDX:
004f4d5c
déc 01 05:01:20 Aspire kernel: ESI: 00000000 EDI: 00000001 EBP: b7b6ae1c ESP:
bffff6ec
déc 01 05:01:20 Aspire kernel: DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
EFLAGS: 00000292

kernel 5.9.9-arch1-1

Feel free to report if it's non related and I'll open a new bug report about
it.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug 209867] CPU soft lockup/stall with nested KVM and SMP
  2020-10-26 12:13 [Bug 209867] New: CPU soft lockup/stall with nested KVM and SMP bugzilla-daemon
                   ` (9 preceding siblings ...)
  2020-12-01  8:39 ` bugzilla-daemon
@ 2020-12-04 11:57 ` bugzilla-daemon
  2020-12-23 12:31 ` bugzilla-daemon
  11 siblings, 0 replies; 13+ messages in thread
From: bugzilla-daemon @ 2020-12-04 11:57 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=209867

Frantisek Sumsal (frantisek@sumsal.cz) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Kernel Version|5.9.10-arch1-1              |5.9.11-arch2-1

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug 209867] CPU soft lockup/stall with nested KVM and SMP
  2020-10-26 12:13 [Bug 209867] New: CPU soft lockup/stall with nested KVM and SMP bugzilla-daemon
                   ` (10 preceding siblings ...)
  2020-12-04 11:57 ` bugzilla-daemon
@ 2020-12-23 12:31 ` bugzilla-daemon
  11 siblings, 0 replies; 13+ messages in thread
From: bugzilla-daemon @ 2020-12-23 12:31 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=209867

Frantisek Sumsal (frantisek@sumsal.cz) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |CODE_FIX

--- Comment #7 from Frantisek Sumsal (frantisek@sumsal.cz) ---
So, it looks like the issue was either resolved in kernel 5.9.12+ (currently on
5.9.14) or upgrade of the hypervisors to CentOS 8.3 (4.18.0-240.1.1.el8_3)
helped. Unfortunately, I have no way to easily check which one of them is the
real fix here.

As for you, Taz, please open a new bug if you still encounter the issue you
mentioned, so it won't get forgotten.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-12-23 12:31 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-26 12:13 [Bug 209867] New: CPU soft lockup/stall with nested KVM and SMP bugzilla-daemon
2020-10-26 13:10 ` [Bug 209867] " bugzilla-daemon
2020-10-26 13:10 ` bugzilla-daemon
2020-10-26 20:05 ` bugzilla-daemon
2020-11-02 16:16 ` bugzilla-daemon
2020-11-02 16:24 ` bugzilla-daemon
2020-11-09 10:59 ` bugzilla-daemon
2020-11-12 10:02 ` bugzilla-daemon
2020-11-21 16:19 ` bugzilla-daemon
2020-11-27 10:21 ` bugzilla-daemon
2020-12-01  8:39 ` bugzilla-daemon
2020-12-04 11:57 ` bugzilla-daemon
2020-12-23 12:31 ` bugzilla-daemon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.