linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* DEAD callback error for CPU, WARNING: CPU: 3 PID: 1134 at kernel/cpu.c:1163 _cpu_down+0x20a/0x3a0
@ 2021-11-07 11:59 Colin King (gmail)
  2021-11-08  8:12 ` Peter Zijlstra
  0 siblings, 1 reply; 8+ messages in thread
From: Colin King (gmail) @ 2021-11-07 11:59 UTC (permalink / raw)
  To: Peter Zijlstra, Thomas Gleixner; +Cc: linux-kernel

On a SMP system in a VM, a deadlock callback error can be reproduced 
with 5.15, tested from head at commit 
d4439a1189f93d0ac1eaf0197db8e6b3e197d5c7

Didn't see this issue on 5.13

How to reproduce:

git clone https://github.com/ColinIanKing/stress-ng
cd stress-ng
make -j $(nproc)
sudo ./stress-ng --cpu-online 0 -t 15 --pathological

Tested on a 8 thread virtual machine, 4MB of memory.

[ 2239.378724] smpboot: CPU 6 is now offline
[ 2239.379443] smpboot: Booting Node 0 Processor 6 APIC 0x6
[ 2239.380169] kvm-clock: cpu 6, msr 79201181, secondary cpu clock
[ 2239.401652] ------------[ cut here ]------------
[ 2239.401658] DEAD callback error for CPU6
[ 2239.401721] WARNING: CPU: 3 PID: 1134 at kernel/cpu.c:1163 
_cpu_down+0x20a/0x3a0
[ 2239.401856] Modules linked in: dm_multipath scsi_dh_rdac scsi_dh_emc 
scsi_dh_alua intel_rapl_msr intel_rapl_common kvm_intel 
snd_hda_codec_generic ledtrig_audio snd_hda_intel kvm snd_intel_dspcfg 
snd_intel_sdw_acpi snd_hda_codec rapl joydev input_leds snd_hda_core 
snd_hwdep snd_pcm snd_timer snd serio_raw soundcore qemu_fw_cfg mac_hid 
sch_fq_codel msr virtio_rng ip_tables x_tables autofs4 btrfs 
blake2b_generic zstd_compress raid10 raid456 async_raid6_recov 
async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 
raid0 multipath linear qxl hid_generic drm_ttm_helper ttm drm_kms_helper 
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel virtio_net syscopyarea 
aesni_intel sysfillrect sysimgblt fb_sys_fops cec usbhid crypto_simd 
net_failover rc_core i2c_i801 ahci hid cryptd i2c_smbus psmouse libahci 
drm lpc_ich virtio_blk failover
[ 2239.402631] CPU: 3 PID: 1134 Comm: stress-ng Not tainted 5.15.0+ #1
[ 2239.402649] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.14.0-2 04/01/2014
[ 2239.402654] RIP: 0010:_cpu_down+0x20a/0x3a0
[ 2239.402664] Code: 8b 03 41 39 c6 7f 6f 39 45 c0 0f 8e 34 ff ff ff 3d 
96 00 00 00 0f 84 be 00 00 00 44 89 ee 48 c7 c7 f3 38 9a 95 e8 26 a2 f9 
ff <0f> 0b e9 13 ff ff ff e8 3a bd 52 ff e9 2a ff ff ff f0 48 0f b3 05
[ 2239.402744] RSP: 0018:ffffa30e008b7cc0 EFLAGS: 00010282
[ 2239.402755] RAX: 0000000000000000 RBX: ffff8ccb7bda0660 RCX: 
0000000000000000
[ 2239.402760] RDX: 0000000000000001 RSI: ffffffff959b6099 RDI: 
00000000ffffffff
[ 2239.402766] RBP: ffffa30e008b7d00 R08: 0000000000000000 R09: 
ffffa30e008b7ab0
[ 2239.402771] R10: ffffa30e008b7aa8 R11: ffffffff96356908 R12: 
0000000000000000
[ 2239.402776] R13: 0000000000000006 R14: 0000000000000096 R15: 
00000000ffffffea
[ 2239.402783] FS:  00007f5ee8713740(0000) GS:ffff8ccb7bcc0000(0000) 
knlGS:0000000000000000
[ 2239.402791] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2239.402796] CR2: 000055e786065c20 CR3: 000000010622e004 CR4: 
0000000000370ee0
[ 2239.402811] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[ 2239.402816] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
0000000000000400
[ 2239.402821] Call Trace:
[ 2239.402827]  <TASK>
[ 2239.402938]  cpu_device_down+0x34/0x60
[ 2239.403018]  cpu_subsys_offline+0xe/0x10
[ 2239.403086]  device_offline+0xe6/0x110
[ 2239.403095]  online_store+0x53/0xc0
[ 2239.403103]  dev_attr_store+0x17/0x30
[ 2239.403155]  sysfs_kf_write+0x3e/0x50
[ 2239.403193]  kernfs_fop_write_iter+0x137/0x1c0
[ 2239.403207]  new_sync_write+0x117/0x1a0
[ 2239.403264]  vfs_write+0x211/0x2a0
[ 2239.403276]  ksys_write+0x67/0xe0
[ 2239.403290]  __x64_sys_write+0x19/0x20
[ 2239.403303]  do_syscall_64+0x5c/0xc0
[ 2239.403316]  ? exc_page_fault+0x89/0x180
[ 2239.403323]  ? asm_exc_page_fault+0x8/0x30
[ 2239.403349]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 2239.403359] RIP: 0033:0x7f5ee8c1b777
[ 2239.403395] Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 
1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 
05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[ 2239.403425] RSP: 002b:00007ffe09004648 EFLAGS: 00000246 ORIG_RAX: 
0000000000000001
[ 2239.403432] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 
00007f5ee8c1b777
[ 2239.403436] RDX: 0000000000000003 RSI: 00007ffe0900467d RDI: 
0000000000000004
[ 2239.403439] RBP: 00007ffe09004680 R08: 0000000000000000 R09: 
00007ffe09004500
[ 2239.403442] R10: 0000000000000000 R11: 0000000000000246 R12: 
00007ffe0900467d
[ 2239.403454] R13: 0000000000000004 R14: 0000000000000006 R15: 
000055e785de5f08
[ 2239.403463]  </TASK>
[ 2239.403466] ---[ end trace e92cc28a4b580b82 ]---

See bug report https://bugzilla.kernel.org/show_bug.cgi?id=214955

Colin


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: DEAD callback error for CPU, WARNING: CPU: 3 PID: 1134 at kernel/cpu.c:1163 _cpu_down+0x20a/0x3a0
  2021-11-07 11:59 DEAD callback error for CPU, WARNING: CPU: 3 PID: 1134 at kernel/cpu.c:1163 _cpu_down+0x20a/0x3a0 Colin King (gmail)
@ 2021-11-08  8:12 ` Peter Zijlstra
  2021-11-08 22:25   ` Colin King (gmail)
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2021-11-08  8:12 UTC (permalink / raw)
  To: Colin King (gmail); +Cc: Thomas Gleixner, linux-kernel

On Sun, Nov 07, 2021 at 11:59:17AM +0000, Colin King (gmail) wrote:
> On a SMP system in a VM, a deadlock callback error can be reproduced with
> 5.15, tested from head at commit d4439a1189f93d0ac1eaf0197db8e6b3e197d5c7
> 
> Didn't see this issue on 5.13
> 
> How to reproduce:
> 
> git clone https://github.com/ColinIanKing/stress-ng
> cd stress-ng
> make -j $(nproc)
> sudo ./stress-ng --cpu-online 0 -t 15 --pathological
> 
> Tested on a 8 thread virtual machine, 4MB of memory.
> 
> [ 2239.378724] smpboot: CPU 6 is now offline
> [ 2239.379443] smpboot: Booting Node 0 Processor 6 APIC 0x6
> [ 2239.380169] kvm-clock: cpu 6, msr 79201181, secondary cpu clock
> [ 2239.401652] ------------[ cut here ]------------
> [ 2239.401658] DEAD callback error for CPU6
> [ 2239.401721] WARNING: CPU: 3 PID: 1134 at kernel/cpu.c:1163
> _cpu_down+0x20a/0x3a0

Can you do the same with whatever magic is required to get the
pr_debug() output from cpuhp_down_callbacks() included?


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: DEAD callback error for CPU, WARNING: CPU: 3 PID: 1134 at kernel/cpu.c:1163 _cpu_down+0x20a/0x3a0
  2021-11-08  8:12 ` Peter Zijlstra
@ 2021-11-08 22:25   ` Colin King (gmail)
  2021-11-14 10:32     ` Josef Johansson
  0 siblings, 1 reply; 8+ messages in thread
From: Colin King (gmail) @ 2021-11-08 22:25 UTC (permalink / raw)
  To: Peter Zijlstra, Colin King (gmail); +Cc: Thomas Gleixner, linux-kernel

On 08/11/2021 08:12, Peter Zijlstra wrote:
> On Sun, Nov 07, 2021 at 11:59:17AM +0000, Colin King (gmail) wrote:
>> On a SMP system in a VM, a deadlock callback error can be reproduced with
>> 5.15, tested from head at commit d4439a1189f93d0ac1eaf0197db8e6b3e197d5c7
>>
>> Didn't see this issue on 5.13
>>
>> How to reproduce:
>>
>> git clone https://github.com/ColinIanKing/stress-ng
>> cd stress-ng
>> make -j $(nproc)
>> sudo ./stress-ng --cpu-online 0 -t 15 --pathological
>>
>> Tested on a 8 thread virtual machine, 4MB of memory.
>>
>> [ 2239.378724] smpboot: CPU 6 is now offline
>> [ 2239.379443] smpboot: Booting Node 0 Processor 6 APIC 0x6
>> [ 2239.380169] kvm-clock: cpu 6, msr 79201181, secondary cpu clock
>> [ 2239.401652] ------------[ cut here ]------------
>> [ 2239.401658] DEAD callback error for CPU6
>> [ 2239.401721] WARNING: CPU: 3 PID: 1134 at kernel/cpu.c:1163
>> _cpu_down+0x20a/0x3a0
> 
> Can you do the same with whatever magic is required to get the
> pr_debug() output from cpuhp_down_callbacks() included?
>
reproducer is actually very easy:

# echo 0 > /sys/devices/system/cpu/cpu7/online


[   75.258042] smpboot: CPU 7 is now offline
[   75.258093] CPU DOWN failed (-22) CPU 7 state padata:dead (39)
[   75.258193] smpboot: Booting Node 0 Processor 7 APIC 0x7
[   75.258380] kvm-clock: cpu 7, msr 1042011c1, secondary cpu clock
[   75.278828] ------------[ cut here ]------------
[   75.278830] DEAD callback error for CPU7
[   75.278856] WARNING: CPU: 1 PID: 1078 at kernel/cpu.c:1163 
_cpu_down+0x20a/0x3a0
[   75.278865] Modules linked in: dm_multipath scsi_dh_rdac scsi_dh_emc 
scsi_dh_alua intel_rapl_msr intel_rapl_common snd_hda_codec_generic 
ledtrig_audio kvm_intel snd_hda_intel snd_intel_dspcfg 
snd_intel_sdw_acpi snd_hda_codec kvm snd_hda_core rapl input_leds 
snd_hwdep joydev snd_pcm snd_timer snd serio_raw qemu_fw_cfg mac_hid 
soundcore sch_fq_codel msr virtio_rng ip_tables x_tables autofs4 btrfs 
blake2b_generic zstd_compress raid10 raid456 async_raid6_recov 
async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 
raid0 multipath linear qxl drm_ttm_helper ttm drm_kms_helper syscopyarea 
sysfillrect sysimgblt fb_sys_fops hid_generic crct10dif_pclmul 
crc32_pclmul ghash_clmulni_intel cec aesni_intel rc_core crypto_simd 
virtio_net net_failover usbhid i2c_i801 ahci psmouse cryptd hid 
i2c_smbus drm lpc_ich libahci virtio_blk failover
[   75.278920] CPU: 1 PID: 1078 Comm: bash Not tainted 5.15.0+ #1
[   75.278924] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.14.0-2 04/01/2014
[   75.278925] RIP: 0010:_cpu_down+0x20a/0x3a0
[   75.278927] Code: 8b 03 41 39 c6 7f 6f 39 45 c0 0f 8e 34 ff ff ff 3d 
96 00 00 00 0f 84 be 00 00 00 44 89 ee 48 c7 c7 f3 38 7a a4 e8 26 a2 f9 
ff <0f> 0b e9 13 ff ff ff e8 3a bd 52 ff e9 2a ff ff ff f0 48 0f b3 05
[   75.278929] RSP: 0018:ffffb12600d03c98 EFLAGS: 00010282
[   75.278930] RAX: 0000000000000000 RBX: ffff8e9d7bde0660 RCX: 
0000000000000000
[   75.278931] RDX: 0000000000000001 RSI: ffffffffa47b6099 RDI: 
00000000ffffffff
[   75.278932] RBP: ffffb12600d03cd8 R08: 0000000000000000 R09: 
ffffb12600d03a88
[   75.278933] R10: ffffb12600d03a80 R11: ffffffffa5156908 R12: 
0000000000000000
[   75.278934] R13: 0000000000000007 R14: 0000000000000096 R15: 
00000000ffffffea
[   75.278935] FS:  00007f5b0014a740(0000) GS:ffff8e9d7bc40000(0000) 
knlGS:0000000000000000
[   75.278937] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   75.278937] CR2: 000055f4a4e58818 CR3: 0000000115042005 CR4: 
0000000000370ee0
[   75.278940] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[   75.278941] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
0000000000000400
[   75.278942] Call Trace:
[   75.278943]  <TASK>
[   75.278950]  cpu_device_down+0x34/0x60
[   75.278953]  cpu_subsys_offline+0xe/0x10
[   75.278956]  device_offline+0xe6/0x110
[   75.278957]  online_store+0x53/0xc0
[   75.278959]  dev_attr_store+0x17/0x30
[   75.278962]  sysfs_kf_write+0x3e/0x50
[   75.278964]  kernfs_fop_write_iter+0x137/0x1c0
[   75.278966]  new_sync_write+0x117/0x1a0
[   75.278970]  vfs_write+0x211/0x2a0
[   75.278978]  ksys_write+0x67/0xe0
[   75.278982]  __x64_sys_write+0x19/0x20
[   75.278984]  do_syscall_64+0x5c/0xc0
[   75.279000]  ? irqentry_exit_to_user_mode+0x9/0x20
[   75.279002]  ? irqentry_exit+0x33/0x40
[   75.279004]  ? exc_page_fault+0x89/0x180
[   75.279005]  ? asm_exc_page_fault+0x8/0x30
[   75.279008]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   75.279010] RIP: 0033:0x7f5b00264777
[   75.279016] Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 
1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 
05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[   75.279018] RSP: 002b:00007ffed5e83b58 EFLAGS: 00000246 ORIG_RAX: 
0000000000000001
[   75.279020] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 
00007f5b00264777
[   75.279021] RDX: 0000000000000002 RSI: 000055d8774e0b10 RDI: 
0000000000000001
[   75.279022] RBP: 000055d8774e0b10 R08: 0000000000000000 R09: 
00007f5b00365cf0
[   75.279023] R10: 00007f5b00365cc0 R11: 0000000000000246 R12: 
0000000000000002
[   75.279024] R13: 00007f5b00366760 R14: 00007f5b00367560 R15: 
00007f5b00366960
[   75.279029]  </TASK>

Colin



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: DEAD callback error for CPU, WARNING: CPU: 3 PID: 1134 at kernel/cpu.c:1163 _cpu_down+0x20a/0x3a0
  2021-11-08 22:25   ` Colin King (gmail)
@ 2021-11-14 10:32     ` Josef Johansson
  2021-11-14 20:00       ` Josef Johansson
  0 siblings, 1 reply; 8+ messages in thread
From: Josef Johansson @ 2021-11-14 10:32 UTC (permalink / raw)
  To: Colin King (gmail), Peter Zijlstra; +Cc: Thomas Gleixner, linux-kernel

On 11/8/21 23:25, Colin King (gmail) wrote:
> On 08/11/2021 08:12, Peter Zijlstra wrote:
>> On Sun, Nov 07, 2021 at 11:59:17AM +0000, Colin King (gmail) wrote:
>>> On a SMP system in a VM, a deadlock callback error can be reproduced
>>> with
>>> 5.15, tested from head at commit
>>> d4439a1189f93d0ac1eaf0197db8e6b3e197d5c7
>>>
>>> Didn't see this issue on 5.13
>>>
>>> How to reproduce:
>>>
>>> git clone https://github.com/ColinIanKing/stress-ng
>>> cd stress-ng
>>> make -j $(nproc)
>>> sudo ./stress-ng --cpu-online 0 -t 15 --pathological
>>>
>>> Tested on a 8 thread virtual machine, 4MB of memory.
>>>
>>> [ 2239.378724] smpboot: CPU 6 is now offline
>>> [ 2239.379443] smpboot: Booting Node 0 Processor 6 APIC 0x6
>>> [ 2239.380169] kvm-clock: cpu 6, msr 79201181, secondary cpu clock
>>> [ 2239.401652] ------------[ cut here ]------------
>>> [ 2239.401658] DEAD callback error for CPU6
>>> [ 2239.401721] WARNING: CPU: 3 PID: 1134 at kernel/cpu.c:1163
>>> _cpu_down+0x20a/0x3a0
>>
>> Can you do the same with whatever magic is required to get the
>> pr_debug() output from cpuhp_down_callbacks() included?
>>
I don't get this sentence at all. Is there some patch I should add?
> reproducer is actually very easy:
>
> # echo 0 > /sys/devices/system/cpu/cpu7/online
>
>
> [   75.258042] smpboot: CPU 7 is now offline
> [   75.258093] CPU DOWN failed (-22) CPU 7 state padata:dead (39)
> [   75.258193] smpboot: Booting Node 0 Processor 7 APIC 0x7
> [   75.258380] kvm-clock: cpu 7, msr 1042011c1, secondary cpu clock
> [   75.278828] ------------[ cut here ]------------
> [   75.278830] DEAD callback error for CPU7
> [   75.278856] WARNING: CPU: 1 PID: 1078 at kernel/cpu.c:1163
> _cpu_down+0x20a/0x3a0
> [   75.278865] Modules linked in: dm_multipath scsi_dh_rdac
> scsi_dh_emc scsi_dh_alua intel_rapl_msr intel_rapl_common
> snd_hda_codec_generic ledtrig_audio kvm_intel snd_hda_intel
> snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec kvm snd_hda_core
> rapl input_leds snd_hwdep joydev snd_pcm snd_timer snd serio_raw
> qemu_fw_cfg mac_hid soundcore sch_fq_codel msr virtio_rng ip_tables
> x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456
> async_raid6_recov async_memcpy async_pq async_xor async_tx xor
> raid6_pq libcrc32c raid1 raid0 multipath linear qxl drm_ttm_helper ttm
> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
> hid_generic crct10dif_pclmul crc32_pclmul ghash_clmulni_intel cec
> aesni_intel rc_core crypto_simd virtio_net net_failover usbhid
> i2c_i801 ahci psmouse cryptd hid i2c_smbus drm lpc_ich libahci
> virtio_blk failover
> [   75.278920] CPU: 1 PID: 1078 Comm: bash Not tainted 5.15.0+ #1
> [   75.278924] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
> BIOS 1.14.0-2 04/01/2014
> [   75.278925] RIP: 0010:_cpu_down+0x20a/0x3a0
> [   75.278927] Code: 8b 03 41 39 c6 7f 6f 39 45 c0 0f 8e 34 ff ff ff
> 3d 96 00 00 00 0f 84 be 00 00 00 44 89 ee 48 c7 c7 f3 38 7a a4 e8 26
> a2 f9 ff <0f> 0b e9 13 ff ff ff e8 3a bd 52 ff e9 2a ff ff ff f0 48 0f
> b3 05
> [   75.278929] RSP: 0018:ffffb12600d03c98 EFLAGS: 00010282
> [   75.278930] RAX: 0000000000000000 RBX: ffff8e9d7bde0660 RCX:
> 0000000000000000
> [   75.278931] RDX: 0000000000000001 RSI: ffffffffa47b6099 RDI:
> 00000000ffffffff
> [   75.278932] RBP: ffffb12600d03cd8 R08: 0000000000000000 R09:
> ffffb12600d03a88
> [   75.278933] R10: ffffb12600d03a80 R11: ffffffffa5156908 R12:
> 0000000000000000
> [   75.278934] R13: 0000000000000007 R14: 0000000000000096 R15:
> 00000000ffffffea
> [   75.278935] FS:  00007f5b0014a740(0000) GS:ffff8e9d7bc40000(0000)
> knlGS:0000000000000000
> [   75.278937] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   75.278937] CR2: 000055f4a4e58818 CR3: 0000000115042005 CR4:
> 0000000000370ee0
> [   75.278940] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [   75.278941] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [   75.278942] Call Trace:
> [   75.278943]  <TASK>
> [   75.278950]  cpu_device_down+0x34/0x60
> [   75.278953]  cpu_subsys_offline+0xe/0x10
> [   75.278956]  device_offline+0xe6/0x110
> [   75.278957]  online_store+0x53/0xc0
> [   75.278959]  dev_attr_store+0x17/0x30
> [   75.278962]  sysfs_kf_write+0x3e/0x50
> [   75.278964]  kernfs_fop_write_iter+0x137/0x1c0
> [   75.278966]  new_sync_write+0x117/0x1a0
> [   75.278970]  vfs_write+0x211/0x2a0
> [   75.278978]  ksys_write+0x67/0xe0
> [   75.278982]  __x64_sys_write+0x19/0x20
> [   75.278984]  do_syscall_64+0x5c/0xc0
> [   75.279000]  ? irqentry_exit_to_user_mode+0x9/0x20
> [   75.279002]  ? irqentry_exit+0x33/0x40
> [   75.279004]  ? exc_page_fault+0x89/0x180
> [   75.279005]  ? asm_exc_page_fault+0x8/0x30
> [   75.279008]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [   75.279010] RIP: 0033:0x7f5b00264777
> [   75.279016] Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7
> 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00
> 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89
> 74 24
> [   75.279018] RSP: 002b:00007ffed5e83b58 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000001
> [   75.279020] RAX: ffffffffffffffda RBX: 0000000000000002 RCX:
> 00007f5b00264777
> [   75.279021] RDX: 0000000000000002 RSI: 000055d8774e0b10 RDI:
> 0000000000000001
> [   75.279022] RBP: 000055d8774e0b10 R08: 0000000000000000 R09:
> 00007f5b00365cf0
> [   75.279023] R10: 00007f5b00365cc0 R11: 0000000000000246 R12:
> 0000000000000002
> [   75.279024] R13: 00007f5b00366760 R14: 00007f5b00367560 R15:
> 00007f5b00366960
> [   75.279029]  </TASK>
>
> Colin
>
>
>
>

Here is the same BUG but on Xen. I'm producing the error by suspending
the laptop.
For anyone following along, when you are bisecting this on Xen you need
these two commits:
8aa83e6395ce047a506f0b16edca45f36c1ae7f8 ("x86/setup: Call
early_reserve_memory() earlier")
96f5bd03e1be606987644b71899ea56a8d05f825 ("xen/balloon: fix balloon
kthread freezing")

------------[ cut here ]------------
DEAD callback error for CPU1
WARNING: CPU: 2 PID: 21738 at kernel/cpu.c:1163 _cpu_down+0x143/0x230
Modules linked in: snd_seq_dummy snd_hrtimer snd_seq snd_seq_device snd_timer nf_tables nfnetlink vfat fat intel_rapl_msr think_lmi firmware_attributes_class wm>
CPU: 2 PID: 21738 Comm: systemd-sleep Tainted: G        W         5.15.0-0.fc32.qubes.x86_64 #1
Hardware name: LENOVO 20Y1S02400/20Y1S02400, BIOS R1BET65W(1.34 ) 06/17/2021
RIP: e030:_cpu_down+0x143/0x230
Code: 00 81 3b 96 00 00 00 7e a7 e9 75 ff ff ff 8b 03 44 39 f8 7d b2 3d 96 00 00 00 74 3f 89 ee 48 c7 c7 8b 4e 5e 82 e8 92 d3 f9 ff <0f> 0b eb 99 e8 c4 06 50 ff>
RSP: e02b:ffffc9004273fd20 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ffff8881406606a0 RCX: ffff8881406a0a08
RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff8881406a0a00
RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
R10: 000000000000001d R11: ffffffff834917cc R12: 00000000ffffffea
R13: 0000000000000000 R14: 0000000000000001 R15: 00000000000000e4
FS:  0000787200206b80(0000) GS:ffff888140680000(0000) knlGS:0000000000000000
CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000059b42ea71da8 CR3: 0000000005ca0000 CR4: 0000000000050660
Call Trace:
 <TASK>
 freeze_secondary_cpus.cold+0x3d/0x129
 suspend_enter+0xb7/0x380
 suspend_devices_and_enter+0x165/0x240
 enter_state+0x125/0x176
 pm_suspend.cold+0x20/0x6b
 state_store+0x27/0x50
 kernfs_fop_write_iter+0x124/0x1b0
 new_sync_write+0x15c/0x1f0
 vfs_write+0x1d2/0x270
 ksys_write+0x67/0xe0
 do_syscall_64+0x3b/0x90
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7872013292f7
Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3>
RSP: 002b:00007ffefc6d4718 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007872013292f7
RDX: 0000000000000004 RSI: 00007ffefc6d4800 RDI: 0000000000000004
RBP: 00007ffefc6d4800 R08: 00005817137c5ca0 R09: 000000000000000d
R10: 00005817137c1eb0 R11: 0000000000000246 R12: 0000000000000004
R13: 00005817137c12d0 R14: 0000000000000004 R15: 00007872013fb700
 </TASK>
---[ end trace a9ab6f056496ed14 ]---
Error taking CPU1 down: -22

I'm actively bisecting this, ~6 iterations left, will propably have some more in ~6h or so.

# git bisect log
# bad: [0594c58161b6e0f3da8efa9c6e3d4ba52b652717] xen/x86: fix PV trap handling on secondary processors
git bisect start 'HEAD'
# good: [868c250bb4639531ff33b2d879fbef39c1d9ed39] x86/fpu: Include vmalloc.h for vzalloc()
git bisect good 868c250bb4639531ff33b2d879fbef39c1d9ed39
# bad: [6ab1d4839a486727fdd412bd8bab27417388d381] Merge tag 'platform-drivers-x86-v5.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
git bisect bad 6ab1d4839a486727fdd412bd8bab27417388d381
# good: [84882cf72cd774cf16fd338bdbf00f69ac9f9194] Revert "net: avoid double accounting for pure zerocopy skbs"
git bisect good 84882cf72cd774cf16fd338bdbf00f69ac9f9194
# good: [79ef0c00142519bc34e1341447f3797436cc48bf] Merge tag 'trace-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
git bisect good 79ef0c00142519bc34e1341447f3797436cc48bf
# good: [e848c714dbda4d6f0ad17a4d374af0ce9fcd1615] Revert "drm/amd/display: Fix error in dmesg at boot"
git bisect good e848c714dbda4d6f0ad17a4d374af0ce9fcd1615
# good: [a64a325bf6313aa5cde7ecd691927e92892d1b7f] Merge tag 'afs-next-20211102' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
git bisect good a64a325bf6313aa5cde7ecd691927e92892d1b7f
# good: [d9bd054177fbd2c4762546aec40fc3071bfe4cc0] Merge tag 'amd-drm-next-5.16-2021-10-29' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
git bisect good d9bd054177fbd2c4762546aec40fc3071bfe4cc0
# bad: [833db72142b93a89211c1e43ca0a1e2e16457756] Merge tag 'pm-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
git bisect bad 833db72142b93a89211c1e43ca0a1e2e16457756

Regards
Josef


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: DEAD callback error for CPU, WARNING: CPU: 3 PID: 1134 at kernel/cpu.c:1163 _cpu_down+0x20a/0x3a0
  2021-11-14 10:32     ` Josef Johansson
@ 2021-11-14 20:00       ` Josef Johansson
  2021-11-14 22:20         ` Daniel Lezcano
  0 siblings, 1 reply; 8+ messages in thread
From: Josef Johansson @ 2021-11-14 20:00 UTC (permalink / raw)
  To: Colin King (gmail), Peter Zijlstra, Daniel Lezcano, Lukasz Luba
  Cc: Thomas Gleixner, linux-kernel

On 11/14/21 11:32, Josef Johansson wrote:
> On 11/8/21 23:25, Colin King (gmail) wrote:
>> On 08/11/2021 08:12, Peter Zijlstra wrote:
>>> On Sun, Nov 07, 2021 at 11:59:17AM +0000, Colin King (gmail) wrote:
>>>> On a SMP system in a VM, a deadlock callback error can be reproduced
>>>> with
>>>> 5.15, tested from head at commit
>>>> d4439a1189f93d0ac1eaf0197db8e6b3e197d5c7
>>>>
>>>> Didn't see this issue on 5.13
>>>>
>>>> How to reproduce:
>>>>
>>>> git clone https://github.com/ColinIanKing/stress-ng
>>>> cd stress-ng
>>>> make -j $(nproc)
>>>> sudo ./stress-ng --cpu-online 0 -t 15 --pathological
>>>>
>>>> Tested on a 8 thread virtual machine, 4MB of memory.
>>>>
>>>> [ 2239.378724] smpboot: CPU 6 is now offline
>>>> [ 2239.379443] smpboot: Booting Node 0 Processor 6 APIC 0x6
>>>> [ 2239.380169] kvm-clock: cpu 6, msr 79201181, secondary cpu clock
>>>> [ 2239.401652] ------------[ cut here ]------------
>>>> [ 2239.401658] DEAD callback error for CPU6
>>>> [ 2239.401721] WARNING: CPU: 3 PID: 1134 at kernel/cpu.c:1163
>>>> _cpu_down+0x20a/0x3a0
>>> Can you do the same with whatever magic is required to get the
>>> pr_debug() output from cpuhp_down_callbacks() included?
>>>
> I don't get this sentence at all. Is there some patch I should add?
>> reproducer is actually very easy:
>>
>> # echo 0 > /sys/devices/system/cpu/cpu7/online
>>
>>
>> [   75.258042] smpboot: CPU 7 is now offline
>> [   75.258093] CPU DOWN failed (-22) CPU 7 state padata:dead (39)
>> [   75.258193] smpboot: Booting Node 0 Processor 7 APIC 0x7
>> [   75.258380] kvm-clock: cpu 7, msr 1042011c1, secondary cpu clock
>> [   75.278828] ------------[ cut here ]------------
>> [   75.278830] DEAD callback error for CPU7
>> [   75.278856] WARNING: CPU: 1 PID: 1078 at kernel/cpu.c:1163
>> _cpu_down+0x20a/0x3a0
>> [   75.278865] Modules linked in: dm_multipath scsi_dh_rdac
>> scsi_dh_emc scsi_dh_alua intel_rapl_msr intel_rapl_common
>> snd_hda_codec_generic ledtrig_audio kvm_intel snd_hda_intel
>> snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec kvm snd_hda_core
>> rapl input_leds snd_hwdep joydev snd_pcm snd_timer snd serio_raw
>> qemu_fw_cfg mac_hid soundcore sch_fq_codel msr virtio_rng ip_tables
>> x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456
>> async_raid6_recov async_memcpy async_pq async_xor async_tx xor
>> raid6_pq libcrc32c raid1 raid0 multipath linear qxl drm_ttm_helper ttm
>> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
>> hid_generic crct10dif_pclmul crc32_pclmul ghash_clmulni_intel cec
>> aesni_intel rc_core crypto_simd virtio_net net_failover usbhid
>> i2c_i801 ahci psmouse cryptd hid i2c_smbus drm lpc_ich libahci
>> virtio_blk failover
>> [   75.278920] CPU: 1 PID: 1078 Comm: bash Not tainted 5.15.0+ #1
>> [   75.278924] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
>> BIOS 1.14.0-2 04/01/2014
>> [   75.278925] RIP: 0010:_cpu_down+0x20a/0x3a0
>> [   75.278927] Code: 8b 03 41 39 c6 7f 6f 39 45 c0 0f 8e 34 ff ff ff
>> 3d 96 00 00 00 0f 84 be 00 00 00 44 89 ee 48 c7 c7 f3 38 7a a4 e8 26
>> a2 f9 ff <0f> 0b e9 13 ff ff ff e8 3a bd 52 ff e9 2a ff ff ff f0 48 0f
>> b3 05
>> [   75.278929] RSP: 0018:ffffb12600d03c98 EFLAGS: 00010282
>> [   75.278930] RAX: 0000000000000000 RBX: ffff8e9d7bde0660 RCX:
>> 0000000000000000
>> [   75.278931] RDX: 0000000000000001 RSI: ffffffffa47b6099 RDI:
>> 00000000ffffffff
>> [   75.278932] RBP: ffffb12600d03cd8 R08: 0000000000000000 R09:
>> ffffb12600d03a88
>> [   75.278933] R10: ffffb12600d03a80 R11: ffffffffa5156908 R12:
>> 0000000000000000
>> [   75.278934] R13: 0000000000000007 R14: 0000000000000096 R15:
>> 00000000ffffffea
>> [   75.278935] FS:  00007f5b0014a740(0000) GS:ffff8e9d7bc40000(0000)
>> knlGS:0000000000000000
>> [   75.278937] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [   75.278937] CR2: 000055f4a4e58818 CR3: 0000000115042005 CR4:
>> 0000000000370ee0
>> [   75.278940] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>> 0000000000000000
>> [   75.278941] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>> 0000000000000400
>> [   75.278942] Call Trace:
>> [   75.278943]  <TASK>
>> [   75.278950]  cpu_device_down+0x34/0x60
>> [   75.278953]  cpu_subsys_offline+0xe/0x10
>> [   75.278956]  device_offline+0xe6/0x110
>> [   75.278957]  online_store+0x53/0xc0
>> [   75.278959]  dev_attr_store+0x17/0x30
>> [   75.278962]  sysfs_kf_write+0x3e/0x50
>> [   75.278964]  kernfs_fop_write_iter+0x137/0x1c0
>> [   75.278966]  new_sync_write+0x117/0x1a0
>> [   75.278970]  vfs_write+0x211/0x2a0
>> [   75.278978]  ksys_write+0x67/0xe0
>> [   75.278982]  __x64_sys_write+0x19/0x20
>> [   75.278984]  do_syscall_64+0x5c/0xc0
>> [   75.279000]  ? irqentry_exit_to_user_mode+0x9/0x20
>> [   75.279002]  ? irqentry_exit+0x33/0x40
>> [   75.279004]  ? exc_page_fault+0x89/0x180
>> [   75.279005]  ? asm_exc_page_fault+0x8/0x30
>> [   75.279008]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>> [   75.279010] RIP: 0033:0x7f5b00264777
>> [   75.279016] Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7
>> 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00
>> 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89
>> 74 24
>> [   75.279018] RSP: 002b:00007ffed5e83b58 EFLAGS: 00000246 ORIG_RAX:
>> 0000000000000001
>> [   75.279020] RAX: ffffffffffffffda RBX: 0000000000000002 RCX:
>> 00007f5b00264777
>> [   75.279021] RDX: 0000000000000002 RSI: 000055d8774e0b10 RDI:
>> 0000000000000001
>> [   75.279022] RBP: 000055d8774e0b10 R08: 0000000000000000 R09:
>> 00007f5b00365cf0
>> [   75.279023] R10: 00007f5b00365cc0 R11: 0000000000000246 R12:
>> 0000000000000002
>> [   75.279024] R13: 00007f5b00366760 R14: 00007f5b00367560 R15:
>> 00007f5b00366960
>> [   75.279029]  </TASK>
>>
>> Colin
>>
>>
>>
>>
> Here is the same BUG but on Xen. I'm producing the error by suspending
> the laptop.
> For anyone following along, when you are bisecting this on Xen you need
> these two commits:
> 8aa83e6395ce047a506f0b16edca45f36c1ae7f8 ("x86/setup: Call
> early_reserve_memory() earlier")
> 96f5bd03e1be606987644b71899ea56a8d05f825 ("xen/balloon: fix balloon
> kthread freezing")
>
> ------------[ cut here ]------------
> DEAD callback error for CPU1
> WARNING: CPU: 2 PID: 21738 at kernel/cpu.c:1163 _cpu_down+0x143/0x230
> Modules linked in: snd_seq_dummy snd_hrtimer snd_seq snd_seq_device snd_timer nf_tables nfnetlink vfat fat intel_rapl_msr think_lmi firmware_attributes_class wm>
> CPU: 2 PID: 21738 Comm: systemd-sleep Tainted: G        W         5.15.0-0.fc32.qubes.x86_64 #1
> Hardware name: LENOVO 20Y1S02400/20Y1S02400, BIOS R1BET65W(1.34 ) 06/17/2021
> RIP: e030:_cpu_down+0x143/0x230
> Code: 00 81 3b 96 00 00 00 7e a7 e9 75 ff ff ff 8b 03 44 39 f8 7d b2 3d 96 00 00 00 74 3f 89 ee 48 c7 c7 8b 4e 5e 82 e8 92 d3 f9 ff <0f> 0b eb 99 e8 c4 06 50 ff>
> RSP: e02b:ffffc9004273fd20 EFLAGS: 00010282
> RAX: 0000000000000000 RBX: ffff8881406606a0 RCX: ffff8881406a0a08
> RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff8881406a0a00
> RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
> R10: 000000000000001d R11: ffffffff834917cc R12: 00000000ffffffea
> R13: 0000000000000000 R14: 0000000000000001 R15: 00000000000000e4
> FS:  0000787200206b80(0000) GS:ffff888140680000(0000) knlGS:0000000000000000
> CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 000059b42ea71da8 CR3: 0000000005ca0000 CR4: 0000000000050660
> Call Trace:
>  <TASK>
>  freeze_secondary_cpus.cold+0x3d/0x129
>  suspend_enter+0xb7/0x380
>  suspend_devices_and_enter+0x165/0x240
>  enter_state+0x125/0x176
>  pm_suspend.cold+0x20/0x6b
>  state_store+0x27/0x50
>  kernfs_fop_write_iter+0x124/0x1b0
>  new_sync_write+0x15c/0x1f0
>  vfs_write+0x1d2/0x270
>  ksys_write+0x67/0xe0
>  do_syscall_64+0x3b/0x90
>  entry_SYSCALL_64_after_hwframe+0x44/0xae
> RIP: 0033:0x7872013292f7
> Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3>
> RSP: 002b:00007ffefc6d4718 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007872013292f7
> RDX: 0000000000000004 RSI: 00007ffefc6d4800 RDI: 0000000000000004
> RBP: 00007ffefc6d4800 R08: 00005817137c5ca0 R09: 000000000000000d
> R10: 00005817137c1eb0 R11: 0000000000000246 R12: 0000000000000004
> R13: 00005817137c12d0 R14: 0000000000000004 R15: 00007872013fb700
>  </TASK>
> ---[ end trace a9ab6f056496ed14 ]---
> Error taking CPU1 down: -22
>
> I'm actively bisecting this, ~6 iterations left, will propably have some more in ~6h or so.
>
> # git bisect log
> # bad: [0594c58161b6e0f3da8efa9c6e3d4ba52b652717] xen/x86: fix PV trap handling on secondary processors
> git bisect start 'HEAD'
> # good: [868c250bb4639531ff33b2d879fbef39c1d9ed39] x86/fpu: Include vmalloc.h for vzalloc()
> git bisect good 868c250bb4639531ff33b2d879fbef39c1d9ed39
> # bad: [6ab1d4839a486727fdd412bd8bab27417388d381] Merge tag 'platform-drivers-x86-v5.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
> git bisect bad 6ab1d4839a486727fdd412bd8bab27417388d381
> # good: [84882cf72cd774cf16fd338bdbf00f69ac9f9194] Revert "net: avoid double accounting for pure zerocopy skbs"
> git bisect good 84882cf72cd774cf16fd338bdbf00f69ac9f9194
> # good: [79ef0c00142519bc34e1341447f3797436cc48bf] Merge tag 'trace-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
> git bisect good 79ef0c00142519bc34e1341447f3797436cc48bf
> # good: [e848c714dbda4d6f0ad17a4d374af0ce9fcd1615] Revert "drm/amd/display: Fix error in dmesg at boot"
> git bisect good e848c714dbda4d6f0ad17a4d374af0ce9fcd1615
> # good: [a64a325bf6313aa5cde7ecd691927e92892d1b7f] Merge tag 'afs-next-20211102' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
> git bisect good a64a325bf6313aa5cde7ecd691927e92892d1b7f
> # good: [d9bd054177fbd2c4762546aec40fc3071bfe4cc0] Merge tag 'amd-drm-next-5.16-2021-10-29' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
> git bisect good d9bd054177fbd2c4762546aec40fc3071bfe4cc0
> # bad: [833db72142b93a89211c1e43ca0a1e2e16457756] Merge tag 'pm-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
> git bisect bad 833db72142b93a89211c1e43ca0a1e2e16457756
>
> Regards
> Josef
>

Here's the bad commit
4570ddda43387e5a130dd85e71a1947b0c11da77 ("powercap/drivers/dtpm:
Encapsulate even more the code")

Added CC: Lukasz Luba <lukasz.luba@arm.com> , Daniel Lezcano
<daniel.lezcano@linaro.org>

Here's the bisect log:
# bad: [0594c58161b6e0f3da8efa9c6e3d4ba52b652717] xen/x86: fix PV trap
handling on secondary processors
git bisect start 'HEAD'
# good: [868c250bb4639531ff33b2d879fbef39c1d9ed39] x86/fpu: Include
vmalloc.h for vzalloc()
git bisect good 868c250bb4639531ff33b2d879fbef39c1d9ed39
# bad: [6ab1d4839a486727fdd412bd8bab27417388d381] Merge tag
'platform-drivers-x86-v5.16-1' of
git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
git bisect bad 6ab1d4839a486727fdd412bd8bab27417388d381
# good: [84882cf72cd774cf16fd338bdbf00f69ac9f9194] Revert "net: avoid
double accounting for pure zerocopy skbs"
git bisect good 84882cf72cd774cf16fd338bdbf00f69ac9f9194
# good: [79ef0c00142519bc34e1341447f3797436cc48bf] Merge tag
'trace-v5.16' of
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
git bisect good 79ef0c00142519bc34e1341447f3797436cc48bf
# good: [e848c714dbda4d6f0ad17a4d374af0ce9fcd1615] Revert
"drm/amd/display: Fix error in dmesg at boot"
git bisect good e848c714dbda4d6f0ad17a4d374af0ce9fcd1615
# good: [a64a325bf6313aa5cde7ecd691927e92892d1b7f] Merge tag
'afs-next-20211102' of
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
git bisect good a64a325bf6313aa5cde7ecd691927e92892d1b7f
# good: [d9bd054177fbd2c4762546aec40fc3071bfe4cc0] Merge tag
'amd-drm-next-5.16-2021-10-29' of
https://gitlab.freedesktop.org/agd5f/linux into drm-next
git bisect good d9bd054177fbd2c4762546aec40fc3071bfe4cc0
# bad: [833db72142b93a89211c1e43ca0a1e2e16457756] Merge tag
'pm-5.16-rc1' of
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
git bisect bad 833db72142b93a89211c1e43ca0a1e2e16457756
# good: [33fb42636a938be01d951b4cee68127a59a1e7e4] Merge branch
'ucount-fixes-for-v5.16' of
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
git bisect good 33fb42636a938be01d951b4cee68127a59a1e7e4
# good: [c0d6586afa3546a3d148cf4b9d9a407b4f79d0bb] Merge tag
'acpi-5.16-rc1' of
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
git bisect good c0d6586afa3546a3d148cf4b9d9a407b4f79d0bb
# good: [b62b306469b36fae7030c0ad4ffa11de0c9b9957] Merge branch 'pm-sleep'
git bisect good b62b306469b36fae7030c0ad4ffa11de0c9b9957
# good: [8e5b4779f6c50d118b3056bbb07c886b9ff8d112] Merge branches
'pm-cpufreq' and 'pm-cpuidle'
git bisect good 8e5b4779f6c50d118b3056bbb07c886b9ff8d112
# good: [e458716a92b57f854deb89bb40aa3554c2b6205e] PM: EM: Mark
inefficiencies in CPUFreq
git bisect good e458716a92b57f854deb89bb40aa3554c2b6205e
# good: [e458716a92b57f854deb89bb40aa3554c2b6205e] PM: EM: Mark
inefficiencies in CPUFreq
git bisect good e458716a92b57f854deb89bb40aa3554c2b6205e
# bad: [eb82bace893169b319c563b7f813c58a0a5a9f76] powercap/drivers/dtpm:
Scale the power with the load
git bisect bad eb82bace893169b319c563b7f813c58a0a5a9f76
# bad: [7a89d7eacf8e84f2afb94db5ae9d9f9faa93f01c] powercap/drivers/dtpm:
Simplify the dtpm table
git bisect bad 7a89d7eacf8e84f2afb94db5ae9d9f9faa93f01c


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: DEAD callback error for CPU, WARNING: CPU: 3 PID: 1134 at kernel/cpu.c:1163 _cpu_down+0x20a/0x3a0
  2021-11-14 20:00       ` Josef Johansson
@ 2021-11-14 22:20         ` Daniel Lezcano
  2021-11-15  8:30           ` Josef Johansson
  0 siblings, 1 reply; 8+ messages in thread
From: Daniel Lezcano @ 2021-11-14 22:20 UTC (permalink / raw)
  To: Josef Johansson, Colin King (gmail), Peter Zijlstra, Lukasz Luba
  Cc: Thomas Gleixner, linux-kernel

On 14/11/2021 21:00, Josef Johansson wrote:
> On 11/14/21 11:32, Josef Johansson wrote:
>> On 11/8/21 23:25, Colin King (gmail) wrote:
>>> On 08/11/2021 08:12, Peter Zijlstra wrote:
>>>> On Sun, Nov 07, 2021 at 11:59:17AM +0000, Colin King (gmail) wrote:
>>>>> On a SMP system in a VM, a deadlock callback error can be reproduced
>>>>> with
>>>>> 5.15, tested from head at commit
>>>>> d4439a1189f93d0ac1eaf0197db8e6b3e197d5c7

I think it is fixed

https://lkml.org/lkml/2021/11/8/44

>>>>>
>>>>> Didn't see this issue on 5.13
>>>>>
>>>>> How to reproduce:
>>>>>
>>>>> git clone https://github.com/ColinIanKing/stress-ng
>>>>> cd stress-ng
>>>>> make -j $(nproc)
>>>>> sudo ./stress-ng --cpu-online 0 -t 15 --pathological
>>>>>
>>>>> Tested on a 8 thread virtual machine, 4MB of memory.
>>>>>
>>>>> [ 2239.378724] smpboot: CPU 6 is now offline
>>>>> [ 2239.379443] smpboot: Booting Node 0 Processor 6 APIC 0x6
>>>>> [ 2239.380169] kvm-clock: cpu 6, msr 79201181, secondary cpu clock
>>>>> [ 2239.401652] ------------[ cut here ]------------
>>>>> [ 2239.401658] DEAD callback error for CPU6
>>>>> [ 2239.401721] WARNING: CPU: 3 PID: 1134 at kernel/cpu.c:1163
>>>>> _cpu_down+0x20a/0x3a0
>>>> Can you do the same with whatever magic is required to get the
>>>> pr_debug() output from cpuhp_down_callbacks() included?
>>>>
>> I don't get this sentence at all. Is there some patch I should add?
>>> reproducer is actually very easy:
>>>
>>> # echo 0 > /sys/devices/system/cpu/cpu7/online
>>>
>>>
>>> [   75.258042] smpboot: CPU 7 is now offline
>>> [   75.258093] CPU DOWN failed (-22) CPU 7 state padata:dead (39)
>>> [   75.258193] smpboot: Booting Node 0 Processor 7 APIC 0x7
>>> [   75.258380] kvm-clock: cpu 7, msr 1042011c1, secondary cpu clock
>>> [   75.278828] ------------[ cut here ]------------
>>> [   75.278830] DEAD callback error for CPU7
>>> [   75.278856] WARNING: CPU: 1 PID: 1078 at kernel/cpu.c:1163
>>> _cpu_down+0x20a/0x3a0
>>> [   75.278865] Modules linked in: dm_multipath scsi_dh_rdac
>>> scsi_dh_emc scsi_dh_alua intel_rapl_msr intel_rapl_common
>>> snd_hda_codec_generic ledtrig_audio kvm_intel snd_hda_intel
>>> snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec kvm snd_hda_core
>>> rapl input_leds snd_hwdep joydev snd_pcm snd_timer snd serio_raw
>>> qemu_fw_cfg mac_hid soundcore sch_fq_codel msr virtio_rng ip_tables
>>> x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456
>>> async_raid6_recov async_memcpy async_pq async_xor async_tx xor
>>> raid6_pq libcrc32c raid1 raid0 multipath linear qxl drm_ttm_helper ttm
>>> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
>>> hid_generic crct10dif_pclmul crc32_pclmul ghash_clmulni_intel cec
>>> aesni_intel rc_core crypto_simd virtio_net net_failover usbhid
>>> i2c_i801 ahci psmouse cryptd hid i2c_smbus drm lpc_ich libahci
>>> virtio_blk failover
>>> [   75.278920] CPU: 1 PID: 1078 Comm: bash Not tainted 5.15.0+ #1
>>> [   75.278924] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
>>> BIOS 1.14.0-2 04/01/2014
>>> [   75.278925] RIP: 0010:_cpu_down+0x20a/0x3a0
>>> [   75.278927] Code: 8b 03 41 39 c6 7f 6f 39 45 c0 0f 8e 34 ff ff ff
>>> 3d 96 00 00 00 0f 84 be 00 00 00 44 89 ee 48 c7 c7 f3 38 7a a4 e8 26
>>> a2 f9 ff <0f> 0b e9 13 ff ff ff e8 3a bd 52 ff e9 2a ff ff ff f0 48 0f
>>> b3 05
>>> [   75.278929] RSP: 0018:ffffb12600d03c98 EFLAGS: 00010282
>>> [   75.278930] RAX: 0000000000000000 RBX: ffff8e9d7bde0660 RCX:
>>> 0000000000000000
>>> [   75.278931] RDX: 0000000000000001 RSI: ffffffffa47b6099 RDI:
>>> 00000000ffffffff
>>> [   75.278932] RBP: ffffb12600d03cd8 R08: 0000000000000000 R09:
>>> ffffb12600d03a88
>>> [   75.278933] R10: ffffb12600d03a80 R11: ffffffffa5156908 R12:
>>> 0000000000000000
>>> [   75.278934] R13: 0000000000000007 R14: 0000000000000096 R15:
>>> 00000000ffffffea
>>> [   75.278935] FS:  00007f5b0014a740(0000) GS:ffff8e9d7bc40000(0000)
>>> knlGS:0000000000000000
>>> [   75.278937] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [   75.278937] CR2: 000055f4a4e58818 CR3: 0000000115042005 CR4:
>>> 0000000000370ee0
>>> [   75.278940] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>> 0000000000000000
>>> [   75.278941] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>> 0000000000000400
>>> [   75.278942] Call Trace:
>>> [   75.278943]  <TASK>
>>> [   75.278950]  cpu_device_down+0x34/0x60
>>> [   75.278953]  cpu_subsys_offline+0xe/0x10
>>> [   75.278956]  device_offline+0xe6/0x110
>>> [   75.278957]  online_store+0x53/0xc0
>>> [   75.278959]  dev_attr_store+0x17/0x30
>>> [   75.278962]  sysfs_kf_write+0x3e/0x50
>>> [   75.278964]  kernfs_fop_write_iter+0x137/0x1c0
>>> [   75.278966]  new_sync_write+0x117/0x1a0
>>> [   75.278970]  vfs_write+0x211/0x2a0
>>> [   75.278978]  ksys_write+0x67/0xe0
>>> [   75.278982]  __x64_sys_write+0x19/0x20
>>> [   75.278984]  do_syscall_64+0x5c/0xc0
>>> [   75.279000]  ? irqentry_exit_to_user_mode+0x9/0x20
>>> [   75.279002]  ? irqentry_exit+0x33/0x40
>>> [   75.279004]  ? exc_page_fault+0x89/0x180
>>> [   75.279005]  ? asm_exc_page_fault+0x8/0x30
>>> [   75.279008]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>> [   75.279010] RIP: 0033:0x7f5b00264777
>>> [   75.279016] Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7
>>> 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00
>>> 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89
>>> 74 24
>>> [   75.279018] RSP: 002b:00007ffed5e83b58 EFLAGS: 00000246 ORIG_RAX:
>>> 0000000000000001
>>> [   75.279020] RAX: ffffffffffffffda RBX: 0000000000000002 RCX:
>>> 00007f5b00264777
>>> [   75.279021] RDX: 0000000000000002 RSI: 000055d8774e0b10 RDI:
>>> 0000000000000001
>>> [   75.279022] RBP: 000055d8774e0b10 R08: 0000000000000000 R09:
>>> 00007f5b00365cf0
>>> [   75.279023] R10: 00007f5b00365cc0 R11: 0000000000000246 R12:
>>> 0000000000000002
>>> [   75.279024] R13: 00007f5b00366760 R14: 00007f5b00367560 R15:
>>> 00007f5b00366960
>>> [   75.279029]  </TASK>
>>>
>>> Colin
>>>
>>>
>>>
>>>
>> Here is the same BUG but on Xen. I'm producing the error by suspending
>> the laptop.
>> For anyone following along, when you are bisecting this on Xen you need
>> these two commits:
>> 8aa83e6395ce047a506f0b16edca45f36c1ae7f8 ("x86/setup: Call
>> early_reserve_memory() earlier")
>> 96f5bd03e1be606987644b71899ea56a8d05f825 ("xen/balloon: fix balloon
>> kthread freezing")
>>
>> ------------[ cut here ]------------
>> DEAD callback error for CPU1
>> WARNING: CPU: 2 PID: 21738 at kernel/cpu.c:1163 _cpu_down+0x143/0x230
>> Modules linked in: snd_seq_dummy snd_hrtimer snd_seq snd_seq_device snd_timer nf_tables nfnetlink vfat fat intel_rapl_msr think_lmi firmware_attributes_class wm>
>> CPU: 2 PID: 21738 Comm: systemd-sleep Tainted: G        W         5.15.0-0.fc32.qubes.x86_64 #1
>> Hardware name: LENOVO 20Y1S02400/20Y1S02400, BIOS R1BET65W(1.34 ) 06/17/2021
>> RIP: e030:_cpu_down+0x143/0x230
>> Code: 00 81 3b 96 00 00 00 7e a7 e9 75 ff ff ff 8b 03 44 39 f8 7d b2 3d 96 00 00 00 74 3f 89 ee 48 c7 c7 8b 4e 5e 82 e8 92 d3 f9 ff <0f> 0b eb 99 e8 c4 06 50 ff>
>> RSP: e02b:ffffc9004273fd20 EFLAGS: 00010282
>> RAX: 0000000000000000 RBX: ffff8881406606a0 RCX: ffff8881406a0a08
>> RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff8881406a0a00
>> RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
>> R10: 000000000000001d R11: ffffffff834917cc R12: 00000000ffffffea
>> R13: 0000000000000000 R14: 0000000000000001 R15: 00000000000000e4
>> FS:  0000787200206b80(0000) GS:ffff888140680000(0000) knlGS:0000000000000000
>> CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 000059b42ea71da8 CR3: 0000000005ca0000 CR4: 0000000000050660
>> Call Trace:
>>  <TASK>
>>  freeze_secondary_cpus.cold+0x3d/0x129
>>  suspend_enter+0xb7/0x380
>>  suspend_devices_and_enter+0x165/0x240
>>  enter_state+0x125/0x176
>>  pm_suspend.cold+0x20/0x6b
>>  state_store+0x27/0x50
>>  kernfs_fop_write_iter+0x124/0x1b0
>>  new_sync_write+0x15c/0x1f0
>>  vfs_write+0x1d2/0x270
>>  ksys_write+0x67/0xe0
>>  do_syscall_64+0x3b/0x90
>>  entry_SYSCALL_64_after_hwframe+0x44/0xae
>> RIP: 0033:0x7872013292f7
>> Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3>
>> RSP: 002b:00007ffefc6d4718 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
>> RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007872013292f7
>> RDX: 0000000000000004 RSI: 00007ffefc6d4800 RDI: 0000000000000004
>> RBP: 00007ffefc6d4800 R08: 00005817137c5ca0 R09: 000000000000000d
>> R10: 00005817137c1eb0 R11: 0000000000000246 R12: 0000000000000004
>> R13: 00005817137c12d0 R14: 0000000000000004 R15: 00007872013fb700
>>  </TASK>
>> ---[ end trace a9ab6f056496ed14 ]---
>> Error taking CPU1 down: -22
>>
>> I'm actively bisecting this, ~6 iterations left, will propably have some more in ~6h or so.
>>
>> # git bisect log
>> # bad: [0594c58161b6e0f3da8efa9c6e3d4ba52b652717] xen/x86: fix PV trap handling on secondary processors
>> git bisect start 'HEAD'
>> # good: [868c250bb4639531ff33b2d879fbef39c1d9ed39] x86/fpu: Include vmalloc.h for vzalloc()
>> git bisect good 868c250bb4639531ff33b2d879fbef39c1d9ed39
>> # bad: [6ab1d4839a486727fdd412bd8bab27417388d381] Merge tag 'platform-drivers-x86-v5.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
>> git bisect bad 6ab1d4839a486727fdd412bd8bab27417388d381
>> # good: [84882cf72cd774cf16fd338bdbf00f69ac9f9194] Revert "net: avoid double accounting for pure zerocopy skbs"
>> git bisect good 84882cf72cd774cf16fd338bdbf00f69ac9f9194
>> # good: [79ef0c00142519bc34e1341447f3797436cc48bf] Merge tag 'trace-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
>> git bisect good 79ef0c00142519bc34e1341447f3797436cc48bf
>> # good: [e848c714dbda4d6f0ad17a4d374af0ce9fcd1615] Revert "drm/amd/display: Fix error in dmesg at boot"
>> git bisect good e848c714dbda4d6f0ad17a4d374af0ce9fcd1615
>> # good: [a64a325bf6313aa5cde7ecd691927e92892d1b7f] Merge tag 'afs-next-20211102' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
>> git bisect good a64a325bf6313aa5cde7ecd691927e92892d1b7f
>> # good: [d9bd054177fbd2c4762546aec40fc3071bfe4cc0] Merge tag 'amd-drm-next-5.16-2021-10-29' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
>> git bisect good d9bd054177fbd2c4762546aec40fc3071bfe4cc0
>> # bad: [833db72142b93a89211c1e43ca0a1e2e16457756] Merge tag 'pm-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
>> git bisect bad 833db72142b93a89211c1e43ca0a1e2e16457756
>>
>> Regards
>> Josef
>>
> 
> Here's the bad commit
> 4570ddda43387e5a130dd85e71a1947b0c11da77 ("powercap/drivers/dtpm:
> Encapsulate even more the code")
> 
> Added CC: Lukasz Luba <lukasz.luba@arm.com> , Daniel Lezcano
> <daniel.lezcano@linaro.org>
> 
> Here's the bisect log:
> # bad: [0594c58161b6e0f3da8efa9c6e3d4ba52b652717] xen/x86: fix PV trap
> handling on secondary processors
> git bisect start 'HEAD'
> # good: [868c250bb4639531ff33b2d879fbef39c1d9ed39] x86/fpu: Include
> vmalloc.h for vzalloc()
> git bisect good 868c250bb4639531ff33b2d879fbef39c1d9ed39
> # bad: [6ab1d4839a486727fdd412bd8bab27417388d381] Merge tag
> 'platform-drivers-x86-v5.16-1' of
> git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
> git bisect bad 6ab1d4839a486727fdd412bd8bab27417388d381
> # good: [84882cf72cd774cf16fd338bdbf00f69ac9f9194] Revert "net: avoid
> double accounting for pure zerocopy skbs"
> git bisect good 84882cf72cd774cf16fd338bdbf00f69ac9f9194
> # good: [79ef0c00142519bc34e1341447f3797436cc48bf] Merge tag
> 'trace-v5.16' of
> git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
> git bisect good 79ef0c00142519bc34e1341447f3797436cc48bf
> # good: [e848c714dbda4d6f0ad17a4d374af0ce9fcd1615] Revert
> "drm/amd/display: Fix error in dmesg at boot"
> git bisect good e848c714dbda4d6f0ad17a4d374af0ce9fcd1615
> # good: [a64a325bf6313aa5cde7ecd691927e92892d1b7f] Merge tag
> 'afs-next-20211102' of
> git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
> git bisect good a64a325bf6313aa5cde7ecd691927e92892d1b7f
> # good: [d9bd054177fbd2c4762546aec40fc3071bfe4cc0] Merge tag
> 'amd-drm-next-5.16-2021-10-29' of
> https://gitlab.freedesktop.org/agd5f/linux into drm-next
> git bisect good d9bd054177fbd2c4762546aec40fc3071bfe4cc0
> # bad: [833db72142b93a89211c1e43ca0a1e2e16457756] Merge tag
> 'pm-5.16-rc1' of
> git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
> git bisect bad 833db72142b93a89211c1e43ca0a1e2e16457756
> # good: [33fb42636a938be01d951b4cee68127a59a1e7e4] Merge branch
> 'ucount-fixes-for-v5.16' of
> git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
> git bisect good 33fb42636a938be01d951b4cee68127a59a1e7e4
> # good: [c0d6586afa3546a3d148cf4b9d9a407b4f79d0bb] Merge tag
> 'acpi-5.16-rc1' of
> git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
> git bisect good c0d6586afa3546a3d148cf4b9d9a407b4f79d0bb
> # good: [b62b306469b36fae7030c0ad4ffa11de0c9b9957] Merge branch 'pm-sleep'
> git bisect good b62b306469b36fae7030c0ad4ffa11de0c9b9957
> # good: [8e5b4779f6c50d118b3056bbb07c886b9ff8d112] Merge branches
> 'pm-cpufreq' and 'pm-cpuidle'
> git bisect good 8e5b4779f6c50d118b3056bbb07c886b9ff8d112
> # good: [e458716a92b57f854deb89bb40aa3554c2b6205e] PM: EM: Mark
> inefficiencies in CPUFreq
> git bisect good e458716a92b57f854deb89bb40aa3554c2b6205e
> # good: [e458716a92b57f854deb89bb40aa3554c2b6205e] PM: EM: Mark
> inefficiencies in CPUFreq
> git bisect good e458716a92b57f854deb89bb40aa3554c2b6205e
> # bad: [eb82bace893169b319c563b7f813c58a0a5a9f76] powercap/drivers/dtpm:
> Scale the power with the load
> git bisect bad eb82bace893169b319c563b7f813c58a0a5a9f76
> # bad: [7a89d7eacf8e84f2afb94db5ae9d9f9faa93f01c] powercap/drivers/dtpm:
> Simplify the dtpm table
> git bisect bad 7a89d7eacf8e84f2afb94db5ae9d9f9faa93f01c
> 


-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: DEAD callback error for CPU, WARNING: CPU: 3 PID: 1134 at kernel/cpu.c:1163 _cpu_down+0x20a/0x3a0
  2021-11-14 22:20         ` Daniel Lezcano
@ 2021-11-15  8:30           ` Josef Johansson
  2021-11-15 12:02             ` Colin King (gmail)
  0 siblings, 1 reply; 8+ messages in thread
From: Josef Johansson @ 2021-11-15  8:30 UTC (permalink / raw)
  To: Daniel Lezcano, Colin King (gmail), Peter Zijlstra, Lukasz Luba
  Cc: Thomas Gleixner, linux-kernel

On 11/14/21 23:20, Daniel Lezcano wrote:
> On 14/11/2021 21:00, Josef Johansson wrote:
>> On 11/14/21 11:32, Josef Johansson wrote:
>>> On 11/8/21 23:25, Colin King (gmail) wrote:
>>>> On 08/11/2021 08:12, Peter Zijlstra wrote:
>>>>> On Sun, Nov 07, 2021 at 11:59:17AM +0000, Colin King (gmail) wrote:
>>>>>> On a SMP system in a VM, a deadlock callback error can be reproduced
>>>>>> with
>>>>>> 5.15, tested from head at commit
>>>>>> d4439a1189f93d0ac1eaf0197db8e6b3e197d5c7
> I think it is fixed
>
> https://lkml.org/lkml/2021/11/8/44
I tested this out and it works.
You can add Tested-By: josef@oderland.se if you'd like.

>>>>>> Didn't see this issue on 5.13
>>>>>>
>>>>>> How to reproduce:
>>>>>>
>>>>>> git clone https://github.com/ColinIanKing/stress-ng
>>>>>> cd stress-ng
>>>>>> make -j $(nproc)
>>>>>> sudo ./stress-ng --cpu-online 0 -t 15 --pathological
>>>>>>
>>>>>> Tested on a 8 thread virtual machine, 4MB of memory.
>>>>>>
>>>>>> [ 2239.378724] smpboot: CPU 6 is now offline
>>>>>> [ 2239.379443] smpboot: Booting Node 0 Processor 6 APIC 0x6
>>>>>> [ 2239.380169] kvm-clock: cpu 6, msr 79201181, secondary cpu clock
>>>>>> [ 2239.401652] ------------[ cut here ]------------
>>>>>> [ 2239.401658] DEAD callback error for CPU6
>>>>>> [ 2239.401721] WARNING: CPU: 3 PID: 1134 at kernel/cpu.c:1163
>>>>>> _cpu_down+0x20a/0x3a0
>>>>> Can you do the same with whatever magic is required to get the
>>>>> pr_debug() output from cpuhp_down_callbacks() included?
>>>>>
>>> I don't get this sentence at all. Is there some patch I should add?
>>>> reproducer is actually very easy:
>>>>
>>>> # echo 0 > /sys/devices/system/cpu/cpu7/online
>>>>
>>>>
>>>> [   75.258042] smpboot: CPU 7 is now offline
>>>> [   75.258093] CPU DOWN failed (-22) CPU 7 state padata:dead (39)
>>>> [   75.258193] smpboot: Booting Node 0 Processor 7 APIC 0x7
>>>> [   75.258380] kvm-clock: cpu 7, msr 1042011c1, secondary cpu clock
>>>> [   75.278828] ------------[ cut here ]------------
>>>> [   75.278830] DEAD callback error for CPU7
>>>> [   75.278856] WARNING: CPU: 1 PID: 1078 at kernel/cpu.c:1163
>>>> _cpu_down+0x20a/0x3a0
>>>> [   75.278865] Modules linked in: dm_multipath scsi_dh_rdac
>>>> scsi_dh_emc scsi_dh_alua intel_rapl_msr intel_rapl_common
>>>> snd_hda_codec_generic ledtrig_audio kvm_intel snd_hda_intel
>>>> snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec kvm snd_hda_core
>>>> rapl input_leds snd_hwdep joydev snd_pcm snd_timer snd serio_raw
>>>> qemu_fw_cfg mac_hid soundcore sch_fq_codel msr virtio_rng ip_tables
>>>> x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456
>>>> async_raid6_recov async_memcpy async_pq async_xor async_tx xor
>>>> raid6_pq libcrc32c raid1 raid0 multipath linear qxl drm_ttm_helper ttm
>>>> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
>>>> hid_generic crct10dif_pclmul crc32_pclmul ghash_clmulni_intel cec
>>>> aesni_intel rc_core crypto_simd virtio_net net_failover usbhid
>>>> i2c_i801 ahci psmouse cryptd hid i2c_smbus drm lpc_ich libahci
>>>> virtio_blk failover
>>>> [   75.278920] CPU: 1 PID: 1078 Comm: bash Not tainted 5.15.0+ #1
>>>> [   75.278924] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
>>>> BIOS 1.14.0-2 04/01/2014
>>>> [   75.278925] RIP: 0010:_cpu_down+0x20a/0x3a0
>>>> [   75.278927] Code: 8b 03 41 39 c6 7f 6f 39 45 c0 0f 8e 34 ff ff ff
>>>> 3d 96 00 00 00 0f 84 be 00 00 00 44 89 ee 48 c7 c7 f3 38 7a a4 e8 26
>>>> a2 f9 ff <0f> 0b e9 13 ff ff ff e8 3a bd 52 ff e9 2a ff ff ff f0 48 0f
>>>> b3 05
>>>> [   75.278929] RSP: 0018:ffffb12600d03c98 EFLAGS: 00010282
>>>> [   75.278930] RAX: 0000000000000000 RBX: ffff8e9d7bde0660 RCX:
>>>> 0000000000000000
>>>> [   75.278931] RDX: 0000000000000001 RSI: ffffffffa47b6099 RDI:
>>>> 00000000ffffffff
>>>> [   75.278932] RBP: ffffb12600d03cd8 R08: 0000000000000000 R09:
>>>> ffffb12600d03a88
>>>> [   75.278933] R10: ffffb12600d03a80 R11: ffffffffa5156908 R12:
>>>> 0000000000000000
>>>> [   75.278934] R13: 0000000000000007 R14: 0000000000000096 R15:
>>>> 00000000ffffffea
>>>> [   75.278935] FS:  00007f5b0014a740(0000) GS:ffff8e9d7bc40000(0000)
>>>> knlGS:0000000000000000
>>>> [   75.278937] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [   75.278937] CR2: 000055f4a4e58818 CR3: 0000000115042005 CR4:
>>>> 0000000000370ee0
>>>> [   75.278940] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>>> 0000000000000000
>>>> [   75.278941] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>>> 0000000000000400
>>>> [   75.278942] Call Trace:
>>>> [   75.278943]  <TASK>
>>>> [   75.278950]  cpu_device_down+0x34/0x60
>>>> [   75.278953]  cpu_subsys_offline+0xe/0x10
>>>> [   75.278956]  device_offline+0xe6/0x110
>>>> [   75.278957]  online_store+0x53/0xc0
>>>> [   75.278959]  dev_attr_store+0x17/0x30
>>>> [   75.278962]  sysfs_kf_write+0x3e/0x50
>>>> [   75.278964]  kernfs_fop_write_iter+0x137/0x1c0
>>>> [   75.278966]  new_sync_write+0x117/0x1a0
>>>> [   75.278970]  vfs_write+0x211/0x2a0
>>>> [   75.278978]  ksys_write+0x67/0xe0
>>>> [   75.278982]  __x64_sys_write+0x19/0x20
>>>> [   75.278984]  do_syscall_64+0x5c/0xc0
>>>> [   75.279000]  ? irqentry_exit_to_user_mode+0x9/0x20
>>>> [   75.279002]  ? irqentry_exit+0x33/0x40
>>>> [   75.279004]  ? exc_page_fault+0x89/0x180
>>>> [   75.279005]  ? asm_exc_page_fault+0x8/0x30
>>>> [   75.279008]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>> [   75.279010] RIP: 0033:0x7f5b00264777
>>>> [   75.279016] Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7
>>>> 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00
>>>> 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89
>>>> 74 24
>>>> [   75.279018] RSP: 002b:00007ffed5e83b58 EFLAGS: 00000246 ORIG_RAX:
>>>> 0000000000000001
>>>> [   75.279020] RAX: ffffffffffffffda RBX: 0000000000000002 RCX:
>>>> 00007f5b00264777
>>>> [   75.279021] RDX: 0000000000000002 RSI: 000055d8774e0b10 RDI:
>>>> 0000000000000001
>>>> [   75.279022] RBP: 000055d8774e0b10 R08: 0000000000000000 R09:
>>>> 00007f5b00365cf0
>>>> [   75.279023] R10: 00007f5b00365cc0 R11: 0000000000000246 R12:
>>>> 0000000000000002
>>>> [   75.279024] R13: 00007f5b00366760 R14: 00007f5b00367560 R15:
>>>> 00007f5b00366960
>>>> [   75.279029]  </TASK>
>>>>
>>>> Colin
>>>>
>>>>
>>>>
>>>>
>>> Here is the same BUG but on Xen. I'm producing the error by suspending
>>> the laptop.
>>> For anyone following along, when you are bisecting this on Xen you need
>>> these two commits:
>>> 8aa83e6395ce047a506f0b16edca45f36c1ae7f8 ("x86/setup: Call
>>> early_reserve_memory() earlier")
>>> 96f5bd03e1be606987644b71899ea56a8d05f825 ("xen/balloon: fix balloon
>>> kthread freezing")
>>>
>>> ------------[ cut here ]------------
>>> DEAD callback error for CPU1
>>> WARNING: CPU: 2 PID: 21738 at kernel/cpu.c:1163 _cpu_down+0x143/0x230
>>> Modules linked in: snd_seq_dummy snd_hrtimer snd_seq snd_seq_device snd_timer nf_tables nfnetlink vfat fat intel_rapl_msr think_lmi firmware_attributes_class wm>
>>> CPU: 2 PID: 21738 Comm: systemd-sleep Tainted: G        W         5.15.0-0.fc32.qubes.x86_64 #1
>>> Hardware name: LENOVO 20Y1S02400/20Y1S02400, BIOS R1BET65W(1.34 ) 06/17/2021
>>> RIP: e030:_cpu_down+0x143/0x230
>>> Code: 00 81 3b 96 00 00 00 7e a7 e9 75 ff ff ff 8b 03 44 39 f8 7d b2 3d 96 00 00 00 74 3f 89 ee 48 c7 c7 8b 4e 5e 82 e8 92 d3 f9 ff <0f> 0b eb 99 e8 c4 06 50 ff>
>>> RSP: e02b:ffffc9004273fd20 EFLAGS: 00010282
>>> RAX: 0000000000000000 RBX: ffff8881406606a0 RCX: ffff8881406a0a08
>>> RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff8881406a0a00
>>> RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
>>> R10: 000000000000001d R11: ffffffff834917cc R12: 00000000ffffffea
>>> R13: 0000000000000000 R14: 0000000000000001 R15: 00000000000000e4
>>> FS:  0000787200206b80(0000) GS:ffff888140680000(0000) knlGS:0000000000000000
>>> CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> CR2: 000059b42ea71da8 CR3: 0000000005ca0000 CR4: 0000000000050660
>>> Call Trace:
>>>  <TASK>
>>>  freeze_secondary_cpus.cold+0x3d/0x129
>>>  suspend_enter+0xb7/0x380
>>>  suspend_devices_and_enter+0x165/0x240
>>>  enter_state+0x125/0x176
>>>  pm_suspend.cold+0x20/0x6b
>>>  state_store+0x27/0x50
>>>  kernfs_fop_write_iter+0x124/0x1b0
>>>  new_sync_write+0x15c/0x1f0
>>>  vfs_write+0x1d2/0x270
>>>  ksys_write+0x67/0xe0
>>>  do_syscall_64+0x3b/0x90
>>>  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>> RIP: 0033:0x7872013292f7
>>> Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3>
>>> RSP: 002b:00007ffefc6d4718 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
>>> RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007872013292f7
>>> RDX: 0000000000000004 RSI: 00007ffefc6d4800 RDI: 0000000000000004
>>> RBP: 00007ffefc6d4800 R08: 00005817137c5ca0 R09: 000000000000000d
>>> R10: 00005817137c1eb0 R11: 0000000000000246 R12: 0000000000000004
>>> R13: 00005817137c12d0 R14: 0000000000000004 R15: 00007872013fb700
>>>  </TASK>
>>> ---[ end trace a9ab6f056496ed14 ]---
>>> Error taking CPU1 down: -22
>>>
>>> I'm actively bisecting this, ~6 iterations left, will propably have some more in ~6h or so.
>>>
>>> # git bisect log
>>> # bad: [0594c58161b6e0f3da8efa9c6e3d4ba52b652717] xen/x86: fix PV trap handling on secondary processors
>>> git bisect start 'HEAD'
>>> # good: [868c250bb4639531ff33b2d879fbef39c1d9ed39] x86/fpu: Include vmalloc.h for vzalloc()
>>> git bisect good 868c250bb4639531ff33b2d879fbef39c1d9ed39
>>> # bad: [6ab1d4839a486727fdd412bd8bab27417388d381] Merge tag 'platform-drivers-x86-v5.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
>>> git bisect bad 6ab1d4839a486727fdd412bd8bab27417388d381
>>> # good: [84882cf72cd774cf16fd338bdbf00f69ac9f9194] Revert "net: avoid double accounting for pure zerocopy skbs"
>>> git bisect good 84882cf72cd774cf16fd338bdbf00f69ac9f9194
>>> # good: [79ef0c00142519bc34e1341447f3797436cc48bf] Merge tag 'trace-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
>>> git bisect good 79ef0c00142519bc34e1341447f3797436cc48bf
>>> # good: [e848c714dbda4d6f0ad17a4d374af0ce9fcd1615] Revert "drm/amd/display: Fix error in dmesg at boot"
>>> git bisect good e848c714dbda4d6f0ad17a4d374af0ce9fcd1615
>>> # good: [a64a325bf6313aa5cde7ecd691927e92892d1b7f] Merge tag 'afs-next-20211102' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
>>> git bisect good a64a325bf6313aa5cde7ecd691927e92892d1b7f
>>> # good: [d9bd054177fbd2c4762546aec40fc3071bfe4cc0] Merge tag 'amd-drm-next-5.16-2021-10-29' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
>>> git bisect good d9bd054177fbd2c4762546aec40fc3071bfe4cc0
>>> # bad: [833db72142b93a89211c1e43ca0a1e2e16457756] Merge tag 'pm-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
>>> git bisect bad 833db72142b93a89211c1e43ca0a1e2e16457756
>>>
>>> Regards
>>> Josef
>>>
>> Here's the bad commit
>> 4570ddda43387e5a130dd85e71a1947b0c11da77 ("powercap/drivers/dtpm:
>> Encapsulate even more the code")
>>
>> Added CC: Lukasz Luba <lukasz.luba@arm.com> , Daniel Lezcano
>> <daniel.lezcano@linaro.org>
>>
>> Here's the bisect log:
>> # bad: [0594c58161b6e0f3da8efa9c6e3d4ba52b652717] xen/x86: fix PV trap
>> handling on secondary processors
>> git bisect start 'HEAD'
>> # good: [868c250bb4639531ff33b2d879fbef39c1d9ed39] x86/fpu: Include
>> vmalloc.h for vzalloc()
>> git bisect good 868c250bb4639531ff33b2d879fbef39c1d9ed39
>> # bad: [6ab1d4839a486727fdd412bd8bab27417388d381] Merge tag
>> 'platform-drivers-x86-v5.16-1' of
>> git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
>> git bisect bad 6ab1d4839a486727fdd412bd8bab27417388d381
>> # good: [84882cf72cd774cf16fd338bdbf00f69ac9f9194] Revert "net: avoid
>> double accounting for pure zerocopy skbs"
>> git bisect good 84882cf72cd774cf16fd338bdbf00f69ac9f9194
>> # good: [79ef0c00142519bc34e1341447f3797436cc48bf] Merge tag
>> 'trace-v5.16' of
>> git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
>> git bisect good 79ef0c00142519bc34e1341447f3797436cc48bf
>> # good: [e848c714dbda4d6f0ad17a4d374af0ce9fcd1615] Revert
>> "drm/amd/display: Fix error in dmesg at boot"
>> git bisect good e848c714dbda4d6f0ad17a4d374af0ce9fcd1615
>> # good: [a64a325bf6313aa5cde7ecd691927e92892d1b7f] Merge tag
>> 'afs-next-20211102' of
>> git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
>> git bisect good a64a325bf6313aa5cde7ecd691927e92892d1b7f
>> # good: [d9bd054177fbd2c4762546aec40fc3071bfe4cc0] Merge tag
>> 'amd-drm-next-5.16-2021-10-29' of
>> https://gitlab.freedesktop.org/agd5f/linux into drm-next
>> git bisect good d9bd054177fbd2c4762546aec40fc3071bfe4cc0
>> # bad: [833db72142b93a89211c1e43ca0a1e2e16457756] Merge tag
>> 'pm-5.16-rc1' of
>> git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
>> git bisect bad 833db72142b93a89211c1e43ca0a1e2e16457756
>> # good: [33fb42636a938be01d951b4cee68127a59a1e7e4] Merge branch
>> 'ucount-fixes-for-v5.16' of
>> git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
>> git bisect good 33fb42636a938be01d951b4cee68127a59a1e7e4
>> # good: [c0d6586afa3546a3d148cf4b9d9a407b4f79d0bb] Merge tag
>> 'acpi-5.16-rc1' of
>> git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
>> git bisect good c0d6586afa3546a3d148cf4b9d9a407b4f79d0bb
>> # good: [b62b306469b36fae7030c0ad4ffa11de0c9b9957] Merge branch 'pm-sleep'
>> git bisect good b62b306469b36fae7030c0ad4ffa11de0c9b9957
>> # good: [8e5b4779f6c50d118b3056bbb07c886b9ff8d112] Merge branches
>> 'pm-cpufreq' and 'pm-cpuidle'
>> git bisect good 8e5b4779f6c50d118b3056bbb07c886b9ff8d112
>> # good: [e458716a92b57f854deb89bb40aa3554c2b6205e] PM: EM: Mark
>> inefficiencies in CPUFreq
>> git bisect good e458716a92b57f854deb89bb40aa3554c2b6205e
>> # good: [e458716a92b57f854deb89bb40aa3554c2b6205e] PM: EM: Mark
>> inefficiencies in CPUFreq
>> git bisect good e458716a92b57f854deb89bb40aa3554c2b6205e
>> # bad: [eb82bace893169b319c563b7f813c58a0a5a9f76] powercap/drivers/dtpm:
>> Scale the power with the load
>> git bisect bad eb82bace893169b319c563b7f813c58a0a5a9f76
>> # bad: [7a89d7eacf8e84f2afb94db5ae9d9f9faa93f01c] powercap/drivers/dtpm:
>> Simplify the dtpm table
>> git bisect bad 7a89d7eacf8e84f2afb94db5ae9d9f9faa93f01c
>>
>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: DEAD callback error for CPU, WARNING: CPU: 3 PID: 1134 at kernel/cpu.c:1163 _cpu_down+0x20a/0x3a0
  2021-11-15  8:30           ` Josef Johansson
@ 2021-11-15 12:02             ` Colin King (gmail)
  0 siblings, 0 replies; 8+ messages in thread
From: Colin King (gmail) @ 2021-11-15 12:02 UTC (permalink / raw)
  To: Josef Johansson, Daniel Lezcano, Colin King (gmail),
	Peter Zijlstra, Lukasz Luba
  Cc: Thomas Gleixner, linux-kernel

On 15/11/2021 08:30, Josef Johansson wrote:
> On 11/14/21 23:20, Daniel Lezcano wrote:
>> On 14/11/2021 21:00, Josef Johansson wrote:
>>> On 11/14/21 11:32, Josef Johansson wrote:
>>>> On 11/8/21 23:25, Colin King (gmail) wrote:
>>>>> On 08/11/2021 08:12, Peter Zijlstra wrote:
>>>>>> On Sun, Nov 07, 2021 at 11:59:17AM +0000, Colin King (gmail) wrote:
>>>>>>> On a SMP system in a VM, a deadlock callback error can be reproduced
>>>>>>> with
>>>>>>> 5.15, tested from head at commit
>>>>>>> d4439a1189f93d0ac1eaf0197db8e6b3e197d5c7
>> I think it is fixed
>>
>> https://lkml.org/lkml/2021/11/8/44
> I tested this out and it works.
> You can add Tested-By: josef@oderland.se if you'd like.

Tested this against 5.16-rc1, fix works perfectly with stress-ng 
--cpu-online stressor on a 8 thread VM

Tested-By: Colin Ian King <colin.i.king@gmail.com>

> 
>>>>>>> Didn't see this issue on 5.13
>>>>>>>
>>>>>>> How to reproduce:
>>>>>>>
>>>>>>> git clone https://github.com/ColinIanKing/stress-ng
>>>>>>> cd stress-ng
>>>>>>> make -j $(nproc)
>>>>>>> sudo ./stress-ng --cpu-online 0 -t 15 --pathological
>>>>>>>
>>>>>>> Tested on a 8 thread virtual machine, 4MB of memory.
>>>>>>>
>>>>>>> [ 2239.378724] smpboot: CPU 6 is now offline
>>>>>>> [ 2239.379443] smpboot: Booting Node 0 Processor 6 APIC 0x6
>>>>>>> [ 2239.380169] kvm-clock: cpu 6, msr 79201181, secondary cpu clock
>>>>>>> [ 2239.401652] ------------[ cut here ]------------
>>>>>>> [ 2239.401658] DEAD callback error for CPU6
>>>>>>> [ 2239.401721] WARNING: CPU: 3 PID: 1134 at kernel/cpu.c:1163
>>>>>>> _cpu_down+0x20a/0x3a0
>>>>>> Can you do the same with whatever magic is required to get the
>>>>>> pr_debug() output from cpuhp_down_callbacks() included?
>>>>>>
>>>> I don't get this sentence at all. Is there some patch I should add?
>>>>> reproducer is actually very easy:
>>>>>
>>>>> # echo 0 > /sys/devices/system/cpu/cpu7/online
>>>>>
>>>>>
>>>>> [   75.258042] smpboot: CPU 7 is now offline
>>>>> [   75.258093] CPU DOWN failed (-22) CPU 7 state padata:dead (39)
>>>>> [   75.258193] smpboot: Booting Node 0 Processor 7 APIC 0x7
>>>>> [   75.258380] kvm-clock: cpu 7, msr 1042011c1, secondary cpu clock
>>>>> [   75.278828] ------------[ cut here ]------------
>>>>> [   75.278830] DEAD callback error for CPU7
>>>>> [   75.278856] WARNING: CPU: 1 PID: 1078 at kernel/cpu.c:1163
>>>>> _cpu_down+0x20a/0x3a0
>>>>> [   75.278865] Modules linked in: dm_multipath scsi_dh_rdac
>>>>> scsi_dh_emc scsi_dh_alua intel_rapl_msr intel_rapl_common
>>>>> snd_hda_codec_generic ledtrig_audio kvm_intel snd_hda_intel
>>>>> snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec kvm snd_hda_core
>>>>> rapl input_leds snd_hwdep joydev snd_pcm snd_timer snd serio_raw
>>>>> qemu_fw_cfg mac_hid soundcore sch_fq_codel msr virtio_rng ip_tables
>>>>> x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456
>>>>> async_raid6_recov async_memcpy async_pq async_xor async_tx xor
>>>>> raid6_pq libcrc32c raid1 raid0 multipath linear qxl drm_ttm_helper ttm
>>>>> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
>>>>> hid_generic crct10dif_pclmul crc32_pclmul ghash_clmulni_intel cec
>>>>> aesni_intel rc_core crypto_simd virtio_net net_failover usbhid
>>>>> i2c_i801 ahci psmouse cryptd hid i2c_smbus drm lpc_ich libahci
>>>>> virtio_blk failover
>>>>> [   75.278920] CPU: 1 PID: 1078 Comm: bash Not tainted 5.15.0+ #1
>>>>> [   75.278924] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
>>>>> BIOS 1.14.0-2 04/01/2014
>>>>> [   75.278925] RIP: 0010:_cpu_down+0x20a/0x3a0
>>>>> [   75.278927] Code: 8b 03 41 39 c6 7f 6f 39 45 c0 0f 8e 34 ff ff ff
>>>>> 3d 96 00 00 00 0f 84 be 00 00 00 44 89 ee 48 c7 c7 f3 38 7a a4 e8 26
>>>>> a2 f9 ff <0f> 0b e9 13 ff ff ff e8 3a bd 52 ff e9 2a ff ff ff f0 48 0f
>>>>> b3 05
>>>>> [   75.278929] RSP: 0018:ffffb12600d03c98 EFLAGS: 00010282
>>>>> [   75.278930] RAX: 0000000000000000 RBX: ffff8e9d7bde0660 RCX:
>>>>> 0000000000000000
>>>>> [   75.278931] RDX: 0000000000000001 RSI: ffffffffa47b6099 RDI:
>>>>> 00000000ffffffff
>>>>> [   75.278932] RBP: ffffb12600d03cd8 R08: 0000000000000000 R09:
>>>>> ffffb12600d03a88
>>>>> [   75.278933] R10: ffffb12600d03a80 R11: ffffffffa5156908 R12:
>>>>> 0000000000000000
>>>>> [   75.278934] R13: 0000000000000007 R14: 0000000000000096 R15:
>>>>> 00000000ffffffea
>>>>> [   75.278935] FS:  00007f5b0014a740(0000) GS:ffff8e9d7bc40000(0000)
>>>>> knlGS:0000000000000000
>>>>> [   75.278937] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> [   75.278937] CR2: 000055f4a4e58818 CR3: 0000000115042005 CR4:
>>>>> 0000000000370ee0
>>>>> [   75.278940] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>>>> 0000000000000000
>>>>> [   75.278941] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>>>> 0000000000000400
>>>>> [   75.278942] Call Trace:
>>>>> [   75.278943]  <TASK>
>>>>> [   75.278950]  cpu_device_down+0x34/0x60
>>>>> [   75.278953]  cpu_subsys_offline+0xe/0x10
>>>>> [   75.278956]  device_offline+0xe6/0x110
>>>>> [   75.278957]  online_store+0x53/0xc0
>>>>> [   75.278959]  dev_attr_store+0x17/0x30
>>>>> [   75.278962]  sysfs_kf_write+0x3e/0x50
>>>>> [   75.278964]  kernfs_fop_write_iter+0x137/0x1c0
>>>>> [   75.278966]  new_sync_write+0x117/0x1a0
>>>>> [   75.278970]  vfs_write+0x211/0x2a0
>>>>> [   75.278978]  ksys_write+0x67/0xe0
>>>>> [   75.278982]  __x64_sys_write+0x19/0x20
>>>>> [   75.278984]  do_syscall_64+0x5c/0xc0
>>>>> [   75.279000]  ? irqentry_exit_to_user_mode+0x9/0x20
>>>>> [   75.279002]  ? irqentry_exit+0x33/0x40
>>>>> [   75.279004]  ? exc_page_fault+0x89/0x180
>>>>> [   75.279005]  ? asm_exc_page_fault+0x8/0x30
>>>>> [   75.279008]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>> [   75.279010] RIP: 0033:0x7f5b00264777
>>>>> [   75.279016] Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7
>>>>> 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00
>>>>> 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89
>>>>> 74 24
>>>>> [   75.279018] RSP: 002b:00007ffed5e83b58 EFLAGS: 00000246 ORIG_RAX:
>>>>> 0000000000000001
>>>>> [   75.279020] RAX: ffffffffffffffda RBX: 0000000000000002 RCX:
>>>>> 00007f5b00264777
>>>>> [   75.279021] RDX: 0000000000000002 RSI: 000055d8774e0b10 RDI:
>>>>> 0000000000000001
>>>>> [   75.279022] RBP: 000055d8774e0b10 R08: 0000000000000000 R09:
>>>>> 00007f5b00365cf0
>>>>> [   75.279023] R10: 00007f5b00365cc0 R11: 0000000000000246 R12:
>>>>> 0000000000000002
>>>>> [   75.279024] R13: 00007f5b00366760 R14: 00007f5b00367560 R15:
>>>>> 00007f5b00366960
>>>>> [   75.279029]  </TASK>
>>>>>
>>>>> Colin
>>>>>
>>>>>
>>>>>
>>>>>
>>>> Here is the same BUG but on Xen. I'm producing the error by suspending
>>>> the laptop.
>>>> For anyone following along, when you are bisecting this on Xen you need
>>>> these two commits:
>>>> 8aa83e6395ce047a506f0b16edca45f36c1ae7f8 ("x86/setup: Call
>>>> early_reserve_memory() earlier")
>>>> 96f5bd03e1be606987644b71899ea56a8d05f825 ("xen/balloon: fix balloon
>>>> kthread freezing")
>>>>
>>>> ------------[ cut here ]------------
>>>> DEAD callback error for CPU1
>>>> WARNING: CPU: 2 PID: 21738 at kernel/cpu.c:1163 _cpu_down+0x143/0x230
>>>> Modules linked in: snd_seq_dummy snd_hrtimer snd_seq snd_seq_device snd_timer nf_tables nfnetlink vfat fat intel_rapl_msr think_lmi firmware_attributes_class wm>
>>>> CPU: 2 PID: 21738 Comm: systemd-sleep Tainted: G        W         5.15.0-0.fc32.qubes.x86_64 #1
>>>> Hardware name: LENOVO 20Y1S02400/20Y1S02400, BIOS R1BET65W(1.34 ) 06/17/2021
>>>> RIP: e030:_cpu_down+0x143/0x230
>>>> Code: 00 81 3b 96 00 00 00 7e a7 e9 75 ff ff ff 8b 03 44 39 f8 7d b2 3d 96 00 00 00 74 3f 89 ee 48 c7 c7 8b 4e 5e 82 e8 92 d3 f9 ff <0f> 0b eb 99 e8 c4 06 50 ff>
>>>> RSP: e02b:ffffc9004273fd20 EFLAGS: 00010282
>>>> RAX: 0000000000000000 RBX: ffff8881406606a0 RCX: ffff8881406a0a08
>>>> RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff8881406a0a00
>>>> RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
>>>> R10: 000000000000001d R11: ffffffff834917cc R12: 00000000ffffffea
>>>> R13: 0000000000000000 R14: 0000000000000001 R15: 00000000000000e4
>>>> FS:  0000787200206b80(0000) GS:ffff888140680000(0000) knlGS:0000000000000000
>>>> CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> CR2: 000059b42ea71da8 CR3: 0000000005ca0000 CR4: 0000000000050660
>>>> Call Trace:
>>>>   <TASK>
>>>>   freeze_secondary_cpus.cold+0x3d/0x129
>>>>   suspend_enter+0xb7/0x380
>>>>   suspend_devices_and_enter+0x165/0x240
>>>>   enter_state+0x125/0x176
>>>>   pm_suspend.cold+0x20/0x6b
>>>>   state_store+0x27/0x50
>>>>   kernfs_fop_write_iter+0x124/0x1b0
>>>>   new_sync_write+0x15c/0x1f0
>>>>   vfs_write+0x1d2/0x270
>>>>   ksys_write+0x67/0xe0
>>>>   do_syscall_64+0x3b/0x90
>>>>   entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>> RIP: 0033:0x7872013292f7
>>>> Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3>
>>>> RSP: 002b:00007ffefc6d4718 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
>>>> RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007872013292f7
>>>> RDX: 0000000000000004 RSI: 00007ffefc6d4800 RDI: 0000000000000004
>>>> RBP: 00007ffefc6d4800 R08: 00005817137c5ca0 R09: 000000000000000d
>>>> R10: 00005817137c1eb0 R11: 0000000000000246 R12: 0000000000000004
>>>> R13: 00005817137c12d0 R14: 0000000000000004 R15: 00007872013fb700
>>>>   </TASK>
>>>> ---[ end trace a9ab6f056496ed14 ]---
>>>> Error taking CPU1 down: -22
>>>>
>>>> I'm actively bisecting this, ~6 iterations left, will propably have some more in ~6h or so.
>>>>
>>>> # git bisect log
>>>> # bad: [0594c58161b6e0f3da8efa9c6e3d4ba52b652717] xen/x86: fix PV trap handling on secondary processors
>>>> git bisect start 'HEAD'
>>>> # good: [868c250bb4639531ff33b2d879fbef39c1d9ed39] x86/fpu: Include vmalloc.h for vzalloc()
>>>> git bisect good 868c250bb4639531ff33b2d879fbef39c1d9ed39
>>>> # bad: [6ab1d4839a486727fdd412bd8bab27417388d381] Merge tag 'platform-drivers-x86-v5.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
>>>> git bisect bad 6ab1d4839a486727fdd412bd8bab27417388d381
>>>> # good: [84882cf72cd774cf16fd338bdbf00f69ac9f9194] Revert "net: avoid double accounting for pure zerocopy skbs"
>>>> git bisect good 84882cf72cd774cf16fd338bdbf00f69ac9f9194
>>>> # good: [79ef0c00142519bc34e1341447f3797436cc48bf] Merge tag 'trace-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
>>>> git bisect good 79ef0c00142519bc34e1341447f3797436cc48bf
>>>> # good: [e848c714dbda4d6f0ad17a4d374af0ce9fcd1615] Revert "drm/amd/display: Fix error in dmesg at boot"
>>>> git bisect good e848c714dbda4d6f0ad17a4d374af0ce9fcd1615
>>>> # good: [a64a325bf6313aa5cde7ecd691927e92892d1b7f] Merge tag 'afs-next-20211102' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
>>>> git bisect good a64a325bf6313aa5cde7ecd691927e92892d1b7f
>>>> # good: [d9bd054177fbd2c4762546aec40fc3071bfe4cc0] Merge tag 'amd-drm-next-5.16-2021-10-29' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
>>>> git bisect good d9bd054177fbd2c4762546aec40fc3071bfe4cc0
>>>> # bad: [833db72142b93a89211c1e43ca0a1e2e16457756] Merge tag 'pm-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
>>>> git bisect bad 833db72142b93a89211c1e43ca0a1e2e16457756
>>>>
>>>> Regards
>>>> Josef
>>>>
>>> Here's the bad commit
>>> 4570ddda43387e5a130dd85e71a1947b0c11da77 ("powercap/drivers/dtpm:
>>> Encapsulate even more the code")
>>>
>>> Added CC: Lukasz Luba <lukasz.luba@arm.com> , Daniel Lezcano
>>> <daniel.lezcano@linaro.org>
>>>
>>> Here's the bisect log:
>>> # bad: [0594c58161b6e0f3da8efa9c6e3d4ba52b652717] xen/x86: fix PV trap
>>> handling on secondary processors
>>> git bisect start 'HEAD'
>>> # good: [868c250bb4639531ff33b2d879fbef39c1d9ed39] x86/fpu: Include
>>> vmalloc.h for vzalloc()
>>> git bisect good 868c250bb4639531ff33b2d879fbef39c1d9ed39
>>> # bad: [6ab1d4839a486727fdd412bd8bab27417388d381] Merge tag
>>> 'platform-drivers-x86-v5.16-1' of
>>> git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
>>> git bisect bad 6ab1d4839a486727fdd412bd8bab27417388d381
>>> # good: [84882cf72cd774cf16fd338bdbf00f69ac9f9194] Revert "net: avoid
>>> double accounting for pure zerocopy skbs"
>>> git bisect good 84882cf72cd774cf16fd338bdbf00f69ac9f9194
>>> # good: [79ef0c00142519bc34e1341447f3797436cc48bf] Merge tag
>>> 'trace-v5.16' of
>>> git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
>>> git bisect good 79ef0c00142519bc34e1341447f3797436cc48bf
>>> # good: [e848c714dbda4d6f0ad17a4d374af0ce9fcd1615] Revert
>>> "drm/amd/display: Fix error in dmesg at boot"
>>> git bisect good e848c714dbda4d6f0ad17a4d374af0ce9fcd1615
>>> # good: [a64a325bf6313aa5cde7ecd691927e92892d1b7f] Merge tag
>>> 'afs-next-20211102' of
>>> git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
>>> git bisect good a64a325bf6313aa5cde7ecd691927e92892d1b7f
>>> # good: [d9bd054177fbd2c4762546aec40fc3071bfe4cc0] Merge tag
>>> 'amd-drm-next-5.16-2021-10-29' of
>>> https://gitlab.freedesktop.org/agd5f/linux into drm-next
>>> git bisect good d9bd054177fbd2c4762546aec40fc3071bfe4cc0
>>> # bad: [833db72142b93a89211c1e43ca0a1e2e16457756] Merge tag
>>> 'pm-5.16-rc1' of
>>> git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
>>> git bisect bad 833db72142b93a89211c1e43ca0a1e2e16457756
>>> # good: [33fb42636a938be01d951b4cee68127a59a1e7e4] Merge branch
>>> 'ucount-fixes-for-v5.16' of
>>> git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
>>> git bisect good 33fb42636a938be01d951b4cee68127a59a1e7e4
>>> # good: [c0d6586afa3546a3d148cf4b9d9a407b4f79d0bb] Merge tag
>>> 'acpi-5.16-rc1' of
>>> git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
>>> git bisect good c0d6586afa3546a3d148cf4b9d9a407b4f79d0bb
>>> # good: [b62b306469b36fae7030c0ad4ffa11de0c9b9957] Merge branch 'pm-sleep'
>>> git bisect good b62b306469b36fae7030c0ad4ffa11de0c9b9957
>>> # good: [8e5b4779f6c50d118b3056bbb07c886b9ff8d112] Merge branches
>>> 'pm-cpufreq' and 'pm-cpuidle'
>>> git bisect good 8e5b4779f6c50d118b3056bbb07c886b9ff8d112
>>> # good: [e458716a92b57f854deb89bb40aa3554c2b6205e] PM: EM: Mark
>>> inefficiencies in CPUFreq
>>> git bisect good e458716a92b57f854deb89bb40aa3554c2b6205e
>>> # good: [e458716a92b57f854deb89bb40aa3554c2b6205e] PM: EM: Mark
>>> inefficiencies in CPUFreq
>>> git bisect good e458716a92b57f854deb89bb40aa3554c2b6205e
>>> # bad: [eb82bace893169b319c563b7f813c58a0a5a9f76] powercap/drivers/dtpm:
>>> Scale the power with the load
>>> git bisect bad eb82bace893169b319c563b7f813c58a0a5a9f76
>>> # bad: [7a89d7eacf8e84f2afb94db5ae9d9f9faa93f01c] powercap/drivers/dtpm:
>>> Simplify the dtpm table
>>> git bisect bad 7a89d7eacf8e84f2afb94db5ae9d9f9faa93f01c
>>>
>>
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-11-15 12:03 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-07 11:59 DEAD callback error for CPU, WARNING: CPU: 3 PID: 1134 at kernel/cpu.c:1163 _cpu_down+0x20a/0x3a0 Colin King (gmail)
2021-11-08  8:12 ` Peter Zijlstra
2021-11-08 22:25   ` Colin King (gmail)
2021-11-14 10:32     ` Josef Johansson
2021-11-14 20:00       ` Josef Johansson
2021-11-14 22:20         ` Daniel Lezcano
2021-11-15  8:30           ` Josef Johansson
2021-11-15 12:02             ` Colin King (gmail)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).