All of lore.kernel.org
 help / color / mirror / Atom feed
* Panic when cpu hot-remove
@ 2015-06-17 10:42 ` 范冬冬
  0 siblings, 0 replies; 15+ messages in thread
From: 范冬冬 @ 2015-06-17 10:42 UTC (permalink / raw)
  To: Joerg Roedeljoro, linux-kernel, jiang.liu, iommu
  Cc: 闫晓峰, 刘长生

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="gb2312", Size: 16936 bytes --]

Hi maintainer,

We found a problem that a panic happen when cpu was hot-removed. We also trace the problem according to the calltrace information.
An endless loop happen because value head is not equal to value tail forever in the function qi_check_fault( ).
The location code is as follows:


do {
        if (qi->desc_status[head] == QI_IN_USE)
        qi->desc_status[head] = QI_ABORT;
        head = (head - 2 + QI_LENGTH) % QI_LENGTH;
    } while (head != tail);



Follow is the panic information:

[root@localhost ~]lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 120
On-line CPU(s) list: 0-119
Thread(s) per core: 2
Core(s) per socket: 15
Socket(s): 4
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 62
Model name: Intel(R) Xeon(R) CPU E7-8880 v2 @ 2.50GHz
Stepping: 7
CPU MHz: 2973.535
BogoMIPS: 5008.11
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 38400K
NUMA node0 CPU(s): 0-119
[root@localhost ~]# echo 1 > /sys/firmware/acpi/hotplug/force_remove
[root@localhost ~]# echo 1 > /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0004:01/eject
[ 138.217913] intel_pstate CPU 15 exiting
[ 138.249976] kvm: disabling virtualization on CPU15
[ 138.256008] smpboot: CPU 15 is now offline
[ 138.364245] intel_pstate CPU 75 exiting
[ 138.389285] Broke affinity for irq 47
[ 138.394433] kvm: disabling virtualization on CPU75
[ 138.400193] smpboot: CPU 75 is now offline
[ 139.119913] intel_pstate CPU 16 exiting
[ 139.146122] kvm: disabling virtualization on CPU16
[ 139.159401] smpboot: CPU 16 is now offline
[ 139.183872] intel_pstate CPU 76 exiting
[ 139.215591] kvm: disabling virtualization on CPU76
[ 139.221226] smpboot: CPU 76 is now offline
[ 139.971687] intel_pstate CPU 17 exiting
[ 140.003541] kvm: disabling virtualization on CPU17
[ 140.009286] smpboot: CPU 17 is now offline
[ 140.038648] intel_pstate CPU 77 exiting
[ 140.064705] kvm: disabling virtualization on CPU77
[ 140.070292] smpboot: CPU 77 is now offline
[ 140.291735] intel_pstate CPU 18 exiting
[ 140.306457] kvm: disabling virtualization on CPU18
[ 140.314712] smpboot: CPU 18 is now offline
[ 140.343928] intel_pstate CPU 78 exiting
[ 140.369473] kvm: disabling virtualization on CPU78
[ 140.378172] smpboot: CPU 78 is now offline
[ 140.522952] intel_pstate CPU 19 exiting
[ 140.537781] kvm: disabling virtualization on CPU19
[ 140.545436] smpboot: CPU 19 is now offline
[ 140.571167] intel_pstate CPU 79 exiting
[ 140.591320] kvm: disabling virtualization on CPU79
[ 140.597138] smpboot: CPU 79 is now offline
[ 140.735166] intel_pstate CPU 20 exiting
[ 140.750057] kvm: disabling virtualization on CPU20
[ 140.755738] smpboot: CPU 20 is now offline
[ 140.780342] intel_pstate CPU 80 exiting
[ 140.797354] kvm: disabling virtualization on CPU80
[ 140.803083] smpboot: CPU 80 is now offline
[ 140.937355] intel_pstate CPU 21 exiting
[ 140.955338] kvm: disabling virtualization on CPU21
[ 140.962774] smpboot: CPU 21 is now offline
[ 140.985552] intel_pstate CPU 81 exiting
[ 141.002056] kvm: disabling virtualization on CPU81
[ 141.007721] smpboot: CPU 81 is now offline
[ 141.181624] intel_pstate CPU 22 exiting
[ 141.199390] kvm: disabling virtualization on CPU22
[ 141.205059] smpboot: CPU 22 is now offline
[ 141.230659] intel_pstate CPU 82 exiting
[ 141.250371] kvm: disabling virtualization on CPU82
[ 141.256080] smpboot: CPU 82 is now offline
[ 141.405812] intel_pstate CPU 23 exiting
[ 141.420677] kvm: disabling virtualization on CPU23
[ 141.426406] smpboot: CPU 23 is now offline
[ 141.450894] intel_pstate CPU 83 exiting
[ 141.467542] kvm: disabling virtualization on CPU83
[ 141.473283] smpboot: CPU 83 is now offline
[ 141.654099] intel_pstate CPU 24 exiting
[ 141.669299] kvm: disabling virtualization on CPU24
[ 141.674959] smpboot: CPU 24 is now offline
[ 141.701252] intel_pstate CPU 84 exiting
[ 141.723850] kvm: disabling virtualization on CPU84
[ 141.732427] smpboot: CPU 84 is now offline
[ 141.871268] intel_pstate CPU 25 exiting
[ 141.883049] kvm: disabling virtualization on CPU25
[ 141.888690] smpboot: CPU 25 is now offline
[ 141.915392] intel_pstate CPU 85 exiting
[ 141.935412] kvm: disabling virtualization on CPU85
[ 141.941056] smpboot: CPU 85 is now offline
[ 142.102551] intel_pstate CPU 26 exiting
[ 142.120636] kvm: disabling virtualization on CPU26
[ 142.129233] smpboot: CPU 26 is now offline
[ 142.152582] intel_pstate CPU 86 exiting
[ 142.171197] Broke affinity for irq 27
[ 142.176406] kvm: disabling virtualization on CPU86
[ 142.181977] smpboot: CPU 86 is now offline
[ 142.339730] intel_pstate CPU 27 exiting
[ 142.354745] kvm: disabling virtualization on CPU27
[ 142.362048] smpboot: CPU 27 is now offline
[ 142.387910] intel_pstate CPU 87 exiting
[ 142.403435] Broke affinity for irq 16
[ 142.408612] kvm: disabling virtualization on CPU87
[ 142.414266] smpboot: CPU 87 is now offline
[ 142.558938] intel_pstate CPU 28 exiting
[ 142.570570] kvm: disabling virtualization on CPU28
[ 142.577692] smpboot: CPU 28 is now offline
[ 142.600045] intel_pstate CPU 88 exiting
[ 142.615597] Broke affinity for irq 48
[ 142.620738] kvm: disabling virtualization on CPU88
[ 142.626425] smpboot: CPU 88 is now offline
[ 142.765143] intel_pstate CPU 29 exiting
[ 142.780261] kvm: disabling virtualization on CPU29
[ 142.788962] smpboot: CPU 29 is now offline
[ 142.799788] intel_pstate CPU 89 exiting
[ 142.819354] Broke affinity for irq 40
[ 142.824553] kvm: disabling virtualization on CPU89
[ 142.830219] smpboot: CPU 89 is now offline



[ 149.972781] memory is not present
[ 149.976493] acpi ACPI0004:01: Still not present
[ 149.995783] memory is not present

[root@localhost ~]#
[ 197.532857] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 64
[ 197.541032] CPU: 64 PID: 2081 Comm: irqbalance Not tainted 4.1.0-rc1+ #29
[ 197.561245] 0000000000000000 00000000aa448ad2 ffff88046e205a90 ffffffff816a432a
[ 197.569555] 0000000000000000 ffffffff818e9f78 ffff88046e205b10 ffffffff8169f1dc
[ 197.577858] 0000000000000010 ffff88046e205b20 ffff88046e205ac0 00000000aa448ad2
[ 197.586166] Call Trace:
[ 197.588896] <NMI> [<ffffffff816a432a>] dump_stack+0x45/0x57
[ 197.595343] [<ffffffff8169f1dc>] panic+0xd0/0x204
[ 197.600699] [<ffffffff81134540>] ? restart_watchdog_hrtimer+0x60/0x60
[ 197.607991] [<ffffffff811345ff>] watchdog_overflow_callback+0xbf/0xc0
[ 197.615286] [<ffffffff81176bec>] __perf_event_overflow+0x9c/0x250
[ 197.622182] [<ffffffff811777c4>] perf_event_overflow+0x14/0x20
[ 197.628799] [<ffffffff81035952>] intel_pmu_handle_irq+0x1f2/0x480
[ 197.635709] [<ffffffff81319b21>] ? ioremap_page_range+0x281/0x400
[ 197.642617] [<ffffffff811bf84c>] ? vunmap_page_range+0x1bc/0x2e0
[ 197.649427] [<ffffffff811bf981>] ? unmap_kernel_range_noflush+0x11/0x20
[ 197.656915] [<ffffffff813e080a>] ? ghes_copy_tofrom_phys+0x12a/0x210
[ 197.664104] [<ffffffff813e0990>] ? ghes_read_estatus+0xa0/0x190
[ 197.670817] [<ffffffff8102bf0b>] perf_event_nmi_handler+0x2b/0x50
[ 197.677725] [<ffffffff81019130>] nmi_handle+0x90/0x130
[ 197.683562] [<ffffffff810196ba>] default_do_nmi+0x4a/0x140
[ 197.689788] [<ffffffff81019838>] do_nmi+0x88/0xc0
[ 197.695141] [<ffffffff816ade2f>] end_repeat_nmi+0x1e/0x2e
[ 197.701274] [<ffffffff8144bd87>] ? qi_submit_sync+0x217/0x3f0
[ 197.707790] [<ffffffff8144bd87>] ? qi_submit_sync+0x217/0x3f0
[ 197.714308] [<ffffffff8144bd87>] ? qi_submit_sync+0x217/0x3f0
[ 197.720815] <<EOE>> [<ffffffff814533b2>] modify_irte+0xa2/0xf0
[ 197.727541] [<ffffffff814537c1>] intel_ioapic_set_affinity+0x141/0x1e0
[ 197.734933] [<ffffffff81453de0>] set_remapped_irq_affinity+0x20/0x30
[ 197.742123] [<ffffffff810d6dec>] irq_do_set_affinity+0x1c/0x70
[ 197.748738] [<ffffffff810d6fd8>] irq_set_affinity_locked+0xa8/0xe0
[ 197.755732] [<ffffffff810d705a>] __irq_set_affinity+0x4a/0x80
[ 197.762252] [<ffffffff810db1f9>] write_irq_affinity.isra.3+0x119/0x140
[ 197.769643] [<ffffffff810db259>] irq_affinity_proc_write+0x19/0x20
[ 197.776649] [<ffffffff8126525d>] proc_reg_write+0x3d/0x80
[ 197.782777] [<ffffffff811b8e25>] ? do_mmap_pgoff+0x2f5/0x3c0
[ 197.789200] [<ffffffff811fa277>] __vfs_write+0x37/0x110
[ 197.795137] [<ffffffff811fd148>] ? __sb_start_write+0x58/0x110
[ 197.801753] [<ffffffff812a3133>] ? security_file_permission+0x23/0xa0
[ 197.809046] [<ffffffff811fa9a9>] vfs_write+0xa9/0x1b0
[ 197.814796] [<ffffffff8102368c>] ? do_audit_syscall_entry+0x6c/0x70
[ 197.821887] [<ffffffff811fb855>] SyS_write+0x55/0xd0
[ 197.827534] [<ffffffff81066cb0>] ? do_page_fault+0x30/0x80
[ 197.833762] [<ffffffff816abaee>] system_call_fastpath+0x12/0x71
[ 197.840610] Kernel Offset: disabled
[ 197.844505] drm_kms_helper: panic occurred, switching back to text console
[ 197.852238] ------------[ cut here ]------------
[ 197.857401] WARNING: CPU: 64 PID: 0 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x5d/0x60()
[ 197.867895] Modules linked in: xt_CHECKSUM ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_mangle iptable_security iptable_raw iptable_filter ip_tables sg vfat fat x86_pkg_temp_thermal coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw sb_edac gf128mul iTCO_wdt edac_core iTCO_vendor_support i2c_i801 glue_helper ablk_helper lpc_ich mfd_core pcspkr cryptd ipmi_si ipmi_msghandler shpchp nfsd auth_rpcgss nfs_acl lockd grace uinput sunrpc xfs libcrc32c sd_mod mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_kms_helper ttm drm ahci libahci libata i2c_core dm_mirror dm_region_hash dm_log dm_mod
[ 197.958056] CPU: 64 PID: 0 Comm: swapper/64 Not tainted 4.1.0-rc1+ #29
[ 197.977976] 0000000000000000 4d09e50dfaab2c41 ffff88046e203d58 ffffffff816a432a
[ 197.986281] 0000000000000000 0000000000000000 ffff88046e203d98 ffffffff8107b1fa
[ 197.994585] ffff88046e203d98 0000000000000000 ffff88046d217580 0000000000000040
[ 198.002892] Call Trace:
[ 198.005622] <IRQ> [<ffffffff816a432a>] dump_stack+0x45/0x57
[ 198.012059] [<ffffffff8107b1fa>] warn_slowpath_common+0x8a/0xc0
[ 198.018771] [<ffffffff8107b32a>] warn_slowpath_null+0x1a/0x20
[ 198.025288] [<ffffffff8104e96d>] native_smp_send_reschedule+0x5d/0x60
[ 198.032583] [<ffffffff810ba9b5>] trigger_load_balance+0x145/0x1f0
[ 198.039490] [<ffffffff810a7ccc>] scheduler_tick+0x9c/0xe0
[ 198.045612] [<ffffffff810e6a61>] update_process_times+0x51/0x60
[ 198.052325] [<ffffffff810f6ed5>] tick_sched_handle.isra.18+0x25/0x60
[ 198.059511] [<ffffffff810f6f54>] tick_sched_timer+0x44/0x80
[ 198.065834] [<ffffffff810e7777>] __run_hrtimer+0x77/0x1d0
[ 198.071961] [<ffffffff810f6f10>] ? tick_sched_handle.isra.18+0x60/0x60
[ 198.079351] [<ffffffff810e7b53>] hrtimer_interrupt+0x103/0x230
[ 198.085966] [<ffffffff81051729>] local_apic_timer_interrupt+0x39/0x60
[ 198.093250] [<ffffffff816ae8f5>] smp_apic_timer_interrupt+0x45/0x60
[ 198.100351] [<ffffffff816ac9be>] apic_timer_interrupt+0x6e/0x80
[ 198.107050] <EOI> [<ffffffff810b2fc9>] ? pick_next_entity+0xa9/0x190
[ 198.114358] [<ffffffff810a3dec>] ? finish_task_switch+0x6c/0x1a0
[ 198.121168] [<ffffffff816a727c>] __schedule+0x2cc/0x910
[ 198.127102] [<ffffffff816a78f7>] schedule+0x37/0x90
[ 198.132649] [<ffffffff816a7c2e>] schedule_preempt_disabled+0xe/0x10
[ 198.139747] [<ffffffff810c0c4b>] cpu_startup_entry+0x1bb/0x480
[ 198.146360] [<ffffffff810f44fc>] ? clockevents_register_device+0xec/0x1c0
[ 198.154043] [<ffffffff8104f7b3>] start_secondary+0x173/0x1e0
[ 198.160463] ---[ end trace a332d23455636d1e ]---
[ 198.184372] ------------[ cut here ]------------
[ 198.189537] WARNING: CPU: 64 PID: 2081 at kernel/time/timer.c:1096 del_timer_sync+0x36/0x60()
[ 198.199061] Modules linked in: xt_CHECKSUM ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_mangle iptable_security iptable_raw iptable_filter ip_tables sg vfat fat x86_pkg_temp_thermal coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw sb_edac gf128mul iTCO_wdt edac_core iTCO_vendor_support i2c_i801 glue_helper ablk_helper lpc_ich mfd_core pcspkr cryptd ipmi_si ipmi_msghandler shpchp nfsd auth_rpcgss nfs_acl lockd grace uinput sunrpc xfs libcrc32c sd_mod mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_kms_helper ttm drm ahci libahci libata i2c_core dm_mirror dm_region_hash dm_log dm_mod
[ 198.289205] CPU: 64 PID: 2081 Comm: irqbalance Tainted: G W 4.1.0-rc1+ #29
[ 198.310777] 0000000000000000 00000000aa448ad2 ffff88046e205530 ffffffff816a432a
[ 198.319082] 0000000000000000 0000000000000000 ffff88046e205570 ffffffff8107b1fa
[ 198.327387] ffff88046a992910 ffff88046e2055d0 ffff88046e2055d0 00000000fffe6ab5
[ 198.335691] Call Trace:
[ 198.338421] <NMI> [<ffffffff816a432a>] dump_stack+0x45/0x57
[ 198.344850] [<ffffffff8107b1fa>] warn_slowpath_common+0x8a/0xc0
[ 198.351563] [<ffffffff8107b32a>] warn_slowpath_null+0x1a/0x20
[ 198.358080] [<ffffffff810e5ba6>] del_timer_sync+0x36/0x60
[ 198.364209] [<ffffffff816aa886>] schedule_timeout+0x156/0x280
[ 198.370726] [<ffffffff813193bc>] ? idr_alloc+0x8c/0x100
[ 198.376663] [<ffffffff810e41c0>] ? internal_add_timer+0xb0/0xb0
[ 198.383373] [<ffffffff810e6197>] msleep+0x37/0x50
[ 198.388731] [<ffffffffa01a96ee>] mga_crtc_prepare+0x16e/0x380 [mgag200]
[ 198.396228] [<ffffffffa0166988>] drm_crtc_helper_set_mode+0x318/0x5a0 [drm_kms_helper]
[ 198.405181] [<ffffffffa0167a42>] drm_crtc_helper_set_config+0x892/0xab0 [drm_kms_helper]
[ 198.414342] [<ffffffffa00dc03f>] drm_mode_set_config_internal+0x6f/0x110 [drm]
[ 198.422518] [<ffffffffa0172538>] restore_fbdev_mode+0xc8/0xf0 [drm_kms_helper]
[ 198.430691] [<ffffffffa0172705>] drm_fb_helper_force_kernel_mode+0x75/0xb0 [drm_kms_helper]
[ 198.440125] [<ffffffffa0173409>] drm_fb_helper_panic+0x29/0x30 [drm_kms_helper]
[ 198.448391] [<ffffffff8109be5e>] notifier_call_chain+0x4e/0x80
[ 198.455004] [<ffffffff8109beca>] atomic_notifier_call_chain+0x1a/0x20
[ 198.462298] [<ffffffff8169f209>] panic+0xfd/0x204
[ 198.467651] [<ffffffff81134540>] ? restart_watchdog_hrtimer+0x60/0x60
[ 198.474944] [<ffffffff811345ff>] watchdog_overflow_callback+0xbf/0xc0
[ 198.482237] [<ffffffff81176bec>] __perf_event_overflow+0x9c/0x250
[ 198.489141] [<ffffffff811777c4>] perf_event_overflow+0x14/0x20
[ 198.495755] [<ffffffff81035952>] intel_pmu_handle_irq+0x1f2/0x480
[ 198.502660] [<ffffffff81319b21>] ? ioremap_page_range+0x281/0x400
[ 198.509566] [<ffffffff811bf84c>] ? vunmap_page_range+0x1bc/0x2e0
[ 198.516366] [<ffffffff811bf981>] ? unmap_kernel_range_noflush+0x11/0x20
[ 198.523852] [<ffffffff813e080a>] ? ghes_copy_tofrom_phys+0x12a/0x210
[ 198.531048] [<ffffffff813e0990>] ? ghes_read_estatus+0xa0/0x190
[ 198.537758] [<ffffffff8102bf0b>] perf_event_nmi_handler+0x2b/0x50
[ 198.544663] [<ffffffff81019130>] nmi_handle+0x90/0x130
[ 198.550499] [<ffffffff810196ba>] default_do_nmi+0x4a/0x140
[ 198.556724] [<ffffffff81019838>] do_nmi+0x88/0xc0
[ 198.562077] [<ffffffff816ade2f>] end_repeat_nmi+0x1e/0x2e
[ 198.568208] [<ffffffff8144bd87>] ? qi_submit_sync+0x217/0x3f0
[ 198.574724] [<ffffffff8144bd87>] ? qi_submit_sync+0x217/0x3f0
[ 198.581241] [<ffffffff8144bd87>] ? qi_submit_sync+0x217/0x3f0
[ 198.587755] <<EOE>> [<ffffffff814533b2>] modify_irte+0xa2/0xf0
[ 198.594480] [<ffffffff814537c1>] intel_ioapic_set_affinity+0x141/0x1e0
[ 198.601870] [<ffffffff81453de0>] set_remapped_irq_affinity+0x20/0x30
[ 198.609066] [<ffffffff810d6dec>] irq_do_set_affinity+0x1c/0x70
[ 198.615679] [<ffffffff810d6fd8>] irq_set_affinity_locked+0xa8/0xe0
[ 198.622681] [<ffffffff810d705a>] __irq_set_affinity+0x4a/0x80
[ 198.629198] [<ffffffff810db1f9>] write_irq_affinity.isra.3+0x119/0x140
[ 198.636589] [<ffffffff810db259>] irq_affinity_proc_write+0x19/0x20
[ 198.643590] [<ffffffff8126525d>] proc_reg_write+0x3d/0x80
[ 198.649716] [<ffffffff811b8e25>] ? do_mmap_pgoff+0x2f5/0x3c0
[ 198.656136] [<ffffffff811fa277>] __vfs_write+0x37/0x110
[ 198.662069] [<ffffffff811fd148>] ? __sb_start_write+0x58/0x110
[ 198.668684] [<ffffffff812a3133>] ? security_file_permission+0x23/0xa0
[ 198.675978] [<ffffffff811fa9a9>] vfs_write+0xa9/0x1b0
[ 198.681716] [<ffffffff8102368c>] ? do_audit_syscall_entry+0x6c/0x70
[ 198.688814] [<ffffffff811fb855>] SyS_write+0x55/0xd0
[ 198.694456] [<ffffffff81066cb0>] ? do_page_fault+0x30/0x80
[ 198.700682] [<ffffffff816abaee>] system_call_fastpath+0x12/0x71
[ 198.707392] ---[ end trace a332d23455636d1f ]---
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Panic when cpu hot-remove
@ 2015-06-17 10:42 ` 范冬冬
  0 siblings, 0 replies; 15+ messages in thread
From: 范冬冬 @ 2015-06-17 10:42 UTC (permalink / raw)
  To: Joerg Roedeljoro, linux-kernel, jiang.liu-ral2JQCrhuEAvxtiuMwx3w, iommu
  Cc: 刘长生, 闫晓峰

Hi maintainer,

We found a problem that a panic happen when cpu was hot-removed. We also trace the problem according to the calltrace information.
An endless loop happen because value head is not equal to value tail forever in the function qi_check_fault( ).
The location code is as follows:


do {
        if (qi->desc_status[head] == QI_IN_USE)
        qi->desc_status[head] = QI_ABORT;
        head = (head - 2 + QI_LENGTH) % QI_LENGTH;
    } while (head != tail);



Follow is the panic information:

[root@localhost ~]lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 120
On-line CPU(s) list: 0-119
Thread(s) per core: 2
Core(s) per socket: 15
Socket(s): 4
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 62
Model name: Intel(R) Xeon(R) CPU E7-8880 v2 @ 2.50GHz
Stepping: 7
CPU MHz: 2973.535
BogoMIPS: 5008.11
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 38400K
NUMA node0 CPU(s): 0-119
[root@localhost ~]# echo 1 > /sys/firmware/acpi/hotplug/force_remove
[root@localhost ~]# echo 1 > /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0004:01/eject
[ 138.217913] intel_pstate CPU 15 exiting
[ 138.249976] kvm: disabling virtualization on CPU15
[ 138.256008] smpboot: CPU 15 is now offline
[ 138.364245] intel_pstate CPU 75 exiting
[ 138.389285] Broke affinity for irq 47
[ 138.394433] kvm: disabling virtualization on CPU75
[ 138.400193] smpboot: CPU 75 is now offline
[ 139.119913] intel_pstate CPU 16 exiting
[ 139.146122] kvm: disabling virtualization on CPU16
[ 139.159401] smpboot: CPU 16 is now offline
[ 139.183872] intel_pstate CPU 76 exiting
[ 139.215591] kvm: disabling virtualization on CPU76
[ 139.221226] smpboot: CPU 76 is now offline
[ 139.971687] intel_pstate CPU 17 exiting
[ 140.003541] kvm: disabling virtualization on CPU17
[ 140.009286] smpboot: CPU 17 is now offline
[ 140.038648] intel_pstate CPU 77 exiting
[ 140.064705] kvm: disabling virtualization on CPU77
[ 140.070292] smpboot: CPU 77 is now offline
[ 140.291735] intel_pstate CPU 18 exiting
[ 140.306457] kvm: disabling virtualization on CPU18
[ 140.314712] smpboot: CPU 18 is now offline
[ 140.343928] intel_pstate CPU 78 exiting
[ 140.369473] kvm: disabling virtualization on CPU78
[ 140.378172] smpboot: CPU 78 is now offline
[ 140.522952] intel_pstate CPU 19 exiting
[ 140.537781] kvm: disabling virtualization on CPU19
[ 140.545436] smpboot: CPU 19 is now offline
[ 140.571167] intel_pstate CPU 79 exiting
[ 140.591320] kvm: disabling virtualization on CPU79
[ 140.597138] smpboot: CPU 79 is now offline
[ 140.735166] intel_pstate CPU 20 exiting
[ 140.750057] kvm: disabling virtualization on CPU20
[ 140.755738] smpboot: CPU 20 is now offline
[ 140.780342] intel_pstate CPU 80 exiting
[ 140.797354] kvm: disabling virtualization on CPU80
[ 140.803083] smpboot: CPU 80 is now offline
[ 140.937355] intel_pstate CPU 21 exiting
[ 140.955338] kvm: disabling virtualization on CPU21
[ 140.962774] smpboot: CPU 21 is now offline
[ 140.985552] intel_pstate CPU 81 exiting
[ 141.002056] kvm: disabling virtualization on CPU81
[ 141.007721] smpboot: CPU 81 is now offline
[ 141.181624] intel_pstate CPU 22 exiting
[ 141.199390] kvm: disabling virtualization on CPU22
[ 141.205059] smpboot: CPU 22 is now offline
[ 141.230659] intel_pstate CPU 82 exiting
[ 141.250371] kvm: disabling virtualization on CPU82
[ 141.256080] smpboot: CPU 82 is now offline
[ 141.405812] intel_pstate CPU 23 exiting
[ 141.420677] kvm: disabling virtualization on CPU23
[ 141.426406] smpboot: CPU 23 is now offline
[ 141.450894] intel_pstate CPU 83 exiting
[ 141.467542] kvm: disabling virtualization on CPU83
[ 141.473283] smpboot: CPU 83 is now offline
[ 141.654099] intel_pstate CPU 24 exiting
[ 141.669299] kvm: disabling virtualization on CPU24
[ 141.674959] smpboot: CPU 24 is now offline
[ 141.701252] intel_pstate CPU 84 exiting
[ 141.723850] kvm: disabling virtualization on CPU84
[ 141.732427] smpboot: CPU 84 is now offline
[ 141.871268] intel_pstate CPU 25 exiting
[ 141.883049] kvm: disabling virtualization on CPU25
[ 141.888690] smpboot: CPU 25 is now offline
[ 141.915392] intel_pstate CPU 85 exiting
[ 141.935412] kvm: disabling virtualization on CPU85
[ 141.941056] smpboot: CPU 85 is now offline
[ 142.102551] intel_pstate CPU 26 exiting
[ 142.120636] kvm: disabling virtualization on CPU26
[ 142.129233] smpboot: CPU 26 is now offline
[ 142.152582] intel_pstate CPU 86 exiting
[ 142.171197] Broke affinity for irq 27
[ 142.176406] kvm: disabling virtualization on CPU86
[ 142.181977] smpboot: CPU 86 is now offline
[ 142.339730] intel_pstate CPU 27 exiting
[ 142.354745] kvm: disabling virtualization on CPU27
[ 142.362048] smpboot: CPU 27 is now offline
[ 142.387910] intel_pstate CPU 87 exiting
[ 142.403435] Broke affinity for irq 16
[ 142.408612] kvm: disabling virtualization on CPU87
[ 142.414266] smpboot: CPU 87 is now offline
[ 142.558938] intel_pstate CPU 28 exiting
[ 142.570570] kvm: disabling virtualization on CPU28
[ 142.577692] smpboot: CPU 28 is now offline
[ 142.600045] intel_pstate CPU 88 exiting
[ 142.615597] Broke affinity for irq 48
[ 142.620738] kvm: disabling virtualization on CPU88
[ 142.626425] smpboot: CPU 88 is now offline
[ 142.765143] intel_pstate CPU 29 exiting
[ 142.780261] kvm: disabling virtualization on CPU29
[ 142.788962] smpboot: CPU 29 is now offline
[ 142.799788] intel_pstate CPU 89 exiting
[ 142.819354] Broke affinity for irq 40
[ 142.824553] kvm: disabling virtualization on CPU89
[ 142.830219] smpboot: CPU 89 is now offline



[ 149.972781] memory is not present
[ 149.976493] acpi ACPI0004:01: Still not present
[ 149.995783] memory is not present

[root@localhost ~]#
[ 197.532857] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 64
[ 197.541032] CPU: 64 PID: 2081 Comm: irqbalance Not tainted 4.1.0-rc1+ #29
[ 197.561245] 0000000000000000 00000000aa448ad2 ffff88046e205a90 ffffffff816a432a
[ 197.569555] 0000000000000000 ffffffff818e9f78 ffff88046e205b10 ffffffff8169f1dc
[ 197.577858] 0000000000000010 ffff88046e205b20 ffff88046e205ac0 00000000aa448ad2
[ 197.586166] Call Trace:
[ 197.588896] <NMI> [<ffffffff816a432a>] dump_stack+0x45/0x57
[ 197.595343] [<ffffffff8169f1dc>] panic+0xd0/0x204
[ 197.600699] [<ffffffff81134540>] ? restart_watchdog_hrtimer+0x60/0x60
[ 197.607991] [<ffffffff811345ff>] watchdog_overflow_callback+0xbf/0xc0
[ 197.615286] [<ffffffff81176bec>] __perf_event_overflow+0x9c/0x250
[ 197.622182] [<ffffffff811777c4>] perf_event_overflow+0x14/0x20
[ 197.628799] [<ffffffff81035952>] intel_pmu_handle_irq+0x1f2/0x480
[ 197.635709] [<ffffffff81319b21>] ? ioremap_page_range+0x281/0x400
[ 197.642617] [<ffffffff811bf84c>] ? vunmap_page_range+0x1bc/0x2e0
[ 197.649427] [<ffffffff811bf981>] ? unmap_kernel_range_noflush+0x11/0x20
[ 197.656915] [<ffffffff813e080a>] ? ghes_copy_tofrom_phys+0x12a/0x210
[ 197.664104] [<ffffffff813e0990>] ? ghes_read_estatus+0xa0/0x190
[ 197.670817] [<ffffffff8102bf0b>] perf_event_nmi_handler+0x2b/0x50
[ 197.677725] [<ffffffff81019130>] nmi_handle+0x90/0x130
[ 197.683562] [<ffffffff810196ba>] default_do_nmi+0x4a/0x140
[ 197.689788] [<ffffffff81019838>] do_nmi+0x88/0xc0
[ 197.695141] [<ffffffff816ade2f>] end_repeat_nmi+0x1e/0x2e
[ 197.701274] [<ffffffff8144bd87>] ? qi_submit_sync+0x217/0x3f0
[ 197.707790] [<ffffffff8144bd87>] ? qi_submit_sync+0x217/0x3f0
[ 197.714308] [<ffffffff8144bd87>] ? qi_submit_sync+0x217/0x3f0
[ 197.720815] <<EOE>> [<ffffffff814533b2>] modify_irte+0xa2/0xf0
[ 197.727541] [<ffffffff814537c1>] intel_ioapic_set_affinity+0x141/0x1e0
[ 197.734933] [<ffffffff81453de0>] set_remapped_irq_affinity+0x20/0x30
[ 197.742123] [<ffffffff810d6dec>] irq_do_set_affinity+0x1c/0x70
[ 197.748738] [<ffffffff810d6fd8>] irq_set_affinity_locked+0xa8/0xe0
[ 197.755732] [<ffffffff810d705a>] __irq_set_affinity+0x4a/0x80
[ 197.762252] [<ffffffff810db1f9>] write_irq_affinity.isra.3+0x119/0x140
[ 197.769643] [<ffffffff810db259>] irq_affinity_proc_write+0x19/0x20
[ 197.776649] [<ffffffff8126525d>] proc_reg_write+0x3d/0x80
[ 197.782777] [<ffffffff811b8e25>] ? do_mmap_pgoff+0x2f5/0x3c0
[ 197.789200] [<ffffffff811fa277>] __vfs_write+0x37/0x110
[ 197.795137] [<ffffffff811fd148>] ? __sb_start_write+0x58/0x110
[ 197.801753] [<ffffffff812a3133>] ? security_file_permission+0x23/0xa0
[ 197.809046] [<ffffffff811fa9a9>] vfs_write+0xa9/0x1b0
[ 197.814796] [<ffffffff8102368c>] ? do_audit_syscall_entry+0x6c/0x70
[ 197.821887] [<ffffffff811fb855>] SyS_write+0x55/0xd0
[ 197.827534] [<ffffffff81066cb0>] ? do_page_fault+0x30/0x80
[ 197.833762] [<ffffffff816abaee>] system_call_fastpath+0x12/0x71
[ 197.840610] Kernel Offset: disabled
[ 197.844505] drm_kms_helper: panic occurred, switching back to text console
[ 197.852238] ------------[ cut here ]------------
[ 197.857401] WARNING: CPU: 64 PID: 0 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x5d/0x60()
[ 197.867895] Modules linked in: xt_CHECKSUM ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_mangle iptable_security iptable_raw iptable_filter ip_tables sg vfat fat x86_pkg_temp_thermal coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw sb_edac gf128mul iTCO_wdt edac_core iTCO_vendor_support i2c_i801 glue_helper ablk_helper lpc_ich mfd_core pcspkr cryptd ipmi_si ipmi_msghandler shpchp nfsd auth_rpcgss nfs_acl lockd grace uinput sunrpc xfs libcrc32c sd_mod mgag200 syscopyarea sysfillre
 ct sysimgblt i2c_algo_bit drm_kms_helper ttm drm ahci libahci libata i2c_core dm_mirror dm_region_hash dm_log dm_mod
[ 197.958056] CPU: 64 PID: 0 Comm: swapper/64 Not tainted 4.1.0-rc1+ #29
[ 197.977976] 0000000000000000 4d09e50dfaab2c41 ffff88046e203d58 ffffffff816a432a
[ 197.986281] 0000000000000000 0000000000000000 ffff88046e203d98 ffffffff8107b1fa
[ 197.994585] ffff88046e203d98 0000000000000000 ffff88046d217580 0000000000000040
[ 198.002892] Call Trace:
[ 198.005622] <IRQ> [<ffffffff816a432a>] dump_stack+0x45/0x57
[ 198.012059] [<ffffffff8107b1fa>] warn_slowpath_common+0x8a/0xc0
[ 198.018771] [<ffffffff8107b32a>] warn_slowpath_null+0x1a/0x20
[ 198.025288] [<ffffffff8104e96d>] native_smp_send_reschedule+0x5d/0x60
[ 198.032583] [<ffffffff810ba9b5>] trigger_load_balance+0x145/0x1f0
[ 198.039490] [<ffffffff810a7ccc>] scheduler_tick+0x9c/0xe0
[ 198.045612] [<ffffffff810e6a61>] update_process_times+0x51/0x60
[ 198.052325] [<ffffffff810f6ed5>] tick_sched_handle.isra.18+0x25/0x60
[ 198.059511] [<ffffffff810f6f54>] tick_sched_timer+0x44/0x80
[ 198.065834] [<ffffffff810e7777>] __run_hrtimer+0x77/0x1d0
[ 198.071961] [<ffffffff810f6f10>] ? tick_sched_handle.isra.18+0x60/0x60
[ 198.079351] [<ffffffff810e7b53>] hrtimer_interrupt+0x103/0x230
[ 198.085966] [<ffffffff81051729>] local_apic_timer_interrupt+0x39/0x60
[ 198.093250] [<ffffffff816ae8f5>] smp_apic_timer_interrupt+0x45/0x60
[ 198.100351] [<ffffffff816ac9be>] apic_timer_interrupt+0x6e/0x80
[ 198.107050] <EOI> [<ffffffff810b2fc9>] ? pick_next_entity+0xa9/0x190
[ 198.114358] [<ffffffff810a3dec>] ? finish_task_switch+0x6c/0x1a0
[ 198.121168] [<ffffffff816a727c>] __schedule+0x2cc/0x910
[ 198.127102] [<ffffffff816a78f7>] schedule+0x37/0x90
[ 198.132649] [<ffffffff816a7c2e>] schedule_preempt_disabled+0xe/0x10
[ 198.139747] [<ffffffff810c0c4b>] cpu_startup_entry+0x1bb/0x480
[ 198.146360] [<ffffffff810f44fc>] ? clockevents_register_device+0xec/0x1c0
[ 198.154043] [<ffffffff8104f7b3>] start_secondary+0x173/0x1e0
[ 198.160463] ---[ end trace a332d23455636d1e ]---
[ 198.184372] ------------[ cut here ]------------
[ 198.189537] WARNING: CPU: 64 PID: 2081 at kernel/time/timer.c:1096 del_timer_sync+0x36/0x60()
[ 198.199061] Modules linked in: xt_CHECKSUM ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_mangle iptable_security iptable_raw iptable_filter ip_tables sg vfat fat x86_pkg_temp_thermal coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw sb_edac gf128mul iTCO_wdt edac_core iTCO_vendor_support i2c_i801 glue_helper ablk_helper lpc_ich mfd_core pcspkr cryptd ipmi_si ipmi_msghandler shpchp nfsd auth_rpcgss nfs_acl lockd grace uinput sunrpc xfs libcrc32c sd_mod mgag200 syscopyarea sysfillre
 ct sysimgblt i2c_algo_bit drm_kms_helper ttm drm ahci libahci libata i2c_core dm_mirror dm_region_hash dm_log dm_mod
[ 198.289205] CPU: 64 PID: 2081 Comm: irqbalance Tainted: G W 4.1.0-rc1+ #29
[ 198.310777] 0000000000000000 00000000aa448ad2 ffff88046e205530 ffffffff816a432a
[ 198.319082] 0000000000000000 0000000000000000 ffff88046e205570 ffffffff8107b1fa
[ 198.327387] ffff88046a992910 ffff88046e2055d0 ffff88046e2055d0 00000000fffe6ab5
[ 198.335691] Call Trace:
[ 198.338421] <NMI> [<ffffffff816a432a>] dump_stack+0x45/0x57
[ 198.344850] [<ffffffff8107b1fa>] warn_slowpath_common+0x8a/0xc0
[ 198.351563] [<ffffffff8107b32a>] warn_slowpath_null+0x1a/0x20
[ 198.358080] [<ffffffff810e5ba6>] del_timer_sync+0x36/0x60
[ 198.364209] [<ffffffff816aa886>] schedule_timeout+0x156/0x280
[ 198.370726] [<ffffffff813193bc>] ? idr_alloc+0x8c/0x100
[ 198.376663] [<ffffffff810e41c0>] ? internal_add_timer+0xb0/0xb0
[ 198.383373] [<ffffffff810e6197>] msleep+0x37/0x50
[ 198.388731] [<ffffffffa01a96ee>] mga_crtc_prepare+0x16e/0x380 [mgag200]
[ 198.396228] [<ffffffffa0166988>] drm_crtc_helper_set_mode+0x318/0x5a0 [drm_kms_helper]
[ 198.405181] [<ffffffffa0167a42>] drm_crtc_helper_set_config+0x892/0xab0 [drm_kms_helper]
[ 198.414342] [<ffffffffa00dc03f>] drm_mode_set_config_internal+0x6f/0x110 [drm]
[ 198.422518] [<ffffffffa0172538>] restore_fbdev_mode+0xc8/0xf0 [drm_kms_helper]
[ 198.430691] [<ffffffffa0172705>] drm_fb_helper_force_kernel_mode+0x75/0xb0 [drm_kms_helper]
[ 198.440125] [<ffffffffa0173409>] drm_fb_helper_panic+0x29/0x30 [drm_kms_helper]
[ 198.448391] [<ffffffff8109be5e>] notifier_call_chain+0x4e/0x80
[ 198.455004] [<ffffffff8109beca>] atomic_notifier_call_chain+0x1a/0x20
[ 198.462298] [<ffffffff8169f209>] panic+0xfd/0x204
[ 198.467651] [<ffffffff81134540>] ? restart_watchdog_hrtimer+0x60/0x60
[ 198.474944] [<ffffffff811345ff>] watchdog_overflow_callback+0xbf/0xc0
[ 198.482237] [<ffffffff81176bec>] __perf_event_overflow+0x9c/0x250
[ 198.489141] [<ffffffff811777c4>] perf_event_overflow+0x14/0x20
[ 198.495755] [<ffffffff81035952>] intel_pmu_handle_irq+0x1f2/0x480
[ 198.502660] [<ffffffff81319b21>] ? ioremap_page_range+0x281/0x400
[ 198.509566] [<ffffffff811bf84c>] ? vunmap_page_range+0x1bc/0x2e0
[ 198.516366] [<ffffffff811bf981>] ? unmap_kernel_range_noflush+0x11/0x20
[ 198.523852] [<ffffffff813e080a>] ? ghes_copy_tofrom_phys+0x12a/0x210
[ 198.531048] [<ffffffff813e0990>] ? ghes_read_estatus+0xa0/0x190
[ 198.537758] [<ffffffff8102bf0b>] perf_event_nmi_handler+0x2b/0x50
[ 198.544663] [<ffffffff81019130>] nmi_handle+0x90/0x130
[ 198.550499] [<ffffffff810196ba>] default_do_nmi+0x4a/0x140
[ 198.556724] [<ffffffff81019838>] do_nmi+0x88/0xc0
[ 198.562077] [<ffffffff816ade2f>] end_repeat_nmi+0x1e/0x2e
[ 198.568208] [<ffffffff8144bd87>] ? qi_submit_sync+0x217/0x3f0
[ 198.574724] [<ffffffff8144bd87>] ? qi_submit_sync+0x217/0x3f0
[ 198.581241] [<ffffffff8144bd87>] ? qi_submit_sync+0x217/0x3f0
[ 198.587755] <<EOE>> [<ffffffff814533b2>] modify_irte+0xa2/0xf0
[ 198.594480] [<ffffffff814537c1>] intel_ioapic_set_affinity+0x141/0x1e0
[ 198.601870] [<ffffffff81453de0>] set_remapped_irq_affinity+0x20/0x30
[ 198.609066] [<ffffffff810d6dec>] irq_do_set_affinity+0x1c/0x70
[ 198.615679] [<ffffffff810d6fd8>] irq_set_affinity_locked+0xa8/0xe0
[ 198.622681] [<ffffffff810d705a>] __irq_set_affinity+0x4a/0x80
[ 198.629198] [<ffffffff810db1f9>] write_irq_affinity.isra.3+0x119/0x140
[ 198.636589] [<ffffffff810db259>] irq_affinity_proc_write+0x19/0x20
[ 198.643590] [<ffffffff8126525d>] proc_reg_write+0x3d/0x80
[ 198.649716] [<ffffffff811b8e25>] ? do_mmap_pgoff+0x2f5/0x3c0
[ 198.656136] [<ffffffff811fa277>] __vfs_write+0x37/0x110
[ 198.662069] [<ffffffff811fd148>] ? __sb_start_write+0x58/0x110
[ 198.668684] [<ffffffff812a3133>] ? security_file_permission+0x23/0xa0
[ 198.675978] [<ffffffff811fa9a9>] vfs_write+0xa9/0x1b0
[ 198.681716] [<ffffffff8102368c>] ? do_audit_syscall_entry+0x6c/0x70
[ 198.688814] [<ffffffff811fb855>] SyS_write+0x55/0xd0
[ 198.694456] [<ffffffff81066cb0>] ? do_page_fault+0x30/0x80
[ 198.700682] [<ffffffff816abaee>] system_call_fastpath+0x12/0x71
[ 198.707392] ---[ end trace a332d23455636d1f ]---

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Panic when cpu hot-remove
  2015-06-17 10:42 ` 范冬冬
  (?)
@ 2015-06-17 11:52 ` Joerg Roedeljoro
  2015-06-17 14:36     ` Alex Williamson
  -1 siblings, 1 reply; 15+ messages in thread
From: Joerg Roedeljoro @ 2015-06-17 11:52 UTC (permalink / raw)
  To: 范冬冬
  Cc: linux-kernel, jiang.liu, iommu, 闫晓峰,
	刘长生

On Wed, Jun 17, 2015 at 10:42:49AM +0000, 范冬冬 wrote:
> Hi maintainer,
> 
> We found a problem that a panic happen when cpu was hot-removed. We also trace the problem according to the calltrace information.
> An endless loop happen because value head is not equal to value tail forever in the function qi_check_fault( ).
> The location code is as follows:
> 
> 
> do {
>         if (qi->desc_status[head] == QI_IN_USE)
>         qi->desc_status[head] = QI_ABORT;
>         head = (head - 2 + QI_LENGTH) % QI_LENGTH;
>     } while (head != tail);

Hmm, this code interates only over every second QI descriptor, and tail
probably points to a descriptor that is not iterated over.

Jiang, can you please have a look?


	Joerg


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Panic when cpu hot-remove
@ 2015-06-17 14:36     ` Alex Williamson
  0 siblings, 0 replies; 15+ messages in thread
From: Alex Williamson @ 2015-06-17 14:36 UTC (permalink / raw)
  To: Joerg Roedeljoro
  Cc: 范冬冬, 刘长生,
	iommu, jiang.liu, linux-kernel, 闫晓峰,
	Roland Dreier

On Wed, 2015-06-17 at 13:52 +0200, Joerg Roedeljoro wrote:
> On Wed, Jun 17, 2015 at 10:42:49AM +0000, 范冬冬 wrote:
> > Hi maintainer,
> > 
> > We found a problem that a panic happen when cpu was hot-removed. We also trace the problem according to the calltrace information.
> > An endless loop happen because value head is not equal to value tail forever in the function qi_check_fault( ).
> > The location code is as follows:
> > 
> > 
> > do {
> >         if (qi->desc_status[head] == QI_IN_USE)
> >         qi->desc_status[head] = QI_ABORT;
> >         head = (head - 2 + QI_LENGTH) % QI_LENGTH;
> >     } while (head != tail);
> 
> Hmm, this code interates only over every second QI descriptor, and tail
> probably points to a descriptor that is not iterated over.
> 
> Jiang, can you please have a look?

I think that part is normal, the way we use the queue is to always
submit a work operation followed by a wait operation so that we can
determine the work operation is complete.  That's done via
qi_submit_sync().  We have had spurious reports of the queue getting
impossibly out of sync though.  I saw one that was somehow linked to the
I/O AT DMA engine.  Roland Dreier saw something similar[1].  I'm not
sure if they're related to this, but maybe worth comparing.  Thanks,

Alex

[1] http://lists.linuxfoundation.org/pipermail/iommu/2015-January/011502.html


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Panic when cpu hot-remove
@ 2015-06-17 14:36     ` Alex Williamson
  0 siblings, 0 replies; 15+ messages in thread
From: Alex Williamson @ 2015-06-17 14:36 UTC (permalink / raw)
  To: Joerg Roedeljoro
  Cc: Roland Dreier, 闫晓峰,
	jiang.liu-ral2JQCrhuEAvxtiuMwx3w, linux-kernel,
	刘长生, iommu, 范冬冬

On Wed, 2015-06-17 at 13:52 +0200, Joerg Roedeljoro wrote:
> On Wed, Jun 17, 2015 at 10:42:49AM +0000, 范冬冬 wrote:
> > Hi maintainer,
> > 
> > We found a problem that a panic happen when cpu was hot-removed. We also trace the problem according to the calltrace information.
> > An endless loop happen because value head is not equal to value tail forever in the function qi_check_fault( ).
> > The location code is as follows:
> > 
> > 
> > do {
> >         if (qi->desc_status[head] == QI_IN_USE)
> >         qi->desc_status[head] = QI_ABORT;
> >         head = (head - 2 + QI_LENGTH) % QI_LENGTH;
> >     } while (head != tail);
> 
> Hmm, this code interates only over every second QI descriptor, and tail
> probably points to a descriptor that is not iterated over.
> 
> Jiang, can you please have a look?

I think that part is normal, the way we use the queue is to always
submit a work operation followed by a wait operation so that we can
determine the work operation is complete.  That's done via
qi_submit_sync().  We have had spurious reports of the queue getting
impossibly out of sync though.  I saw one that was somehow linked to the
I/O AT DMA engine.  Roland Dreier saw something similar[1].  I'm not
sure if they're related to this, but maybe worth comparing.  Thanks,

Alex

[1] http://lists.linuxfoundation.org/pipermail/iommu/2015-January/011502.html

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Panic when cpu hot-remove
@ 2015-06-18  5:40       ` Jiang Liu
  0 siblings, 0 replies; 15+ messages in thread
From: Jiang Liu @ 2015-06-18  5:40 UTC (permalink / raw)
  To: Alex Williamson, Joerg Roedeljoro
  Cc: 范冬冬, 刘长生,
	iommu, jiang.liu, linux-kernel, 闫晓峰,
	Roland Dreier

On 2015/6/17 22:36, Alex Williamson wrote:
> On Wed, 2015-06-17 at 13:52 +0200, Joerg Roedeljoro wrote:
>> On Wed, Jun 17, 2015 at 10:42:49AM +0000, 范冬冬 wrote:
>>> Hi maintainer,
>>>
>>> We found a problem that a panic happen when cpu was hot-removed. We also trace the problem according to the calltrace information.
>>> An endless loop happen because value head is not equal to value tail forever in the function qi_check_fault( ).
>>> The location code is as follows:
>>>
>>>
>>> do {
>>>         if (qi->desc_status[head] == QI_IN_USE)
>>>         qi->desc_status[head] = QI_ABORT;
>>>         head = (head - 2 + QI_LENGTH) % QI_LENGTH;
>>>     } while (head != tail);
>>
>> Hmm, this code interates only over every second QI descriptor, and tail
>> probably points to a descriptor that is not iterated over.
>>
>> Jiang, can you please have a look?
> 
> I think that part is normal, the way we use the queue is to always
> submit a work operation followed by a wait operation so that we can
> determine the work operation is complete.  That's done via
> qi_submit_sync().  We have had spurious reports of the queue getting
> impossibly out of sync though.  I saw one that was somehow linked to the
> I/O AT DMA engine.  Roland Dreier saw something similar[1].  I'm not
> sure if they're related to this, but maybe worth comparing.  Thanks,
Thanks, Alex and Joerg!

Hi Dongdong,
	Could you please help to give some instructions about how to
reproduce this issue? I will try to reproduce it if possible.
Thanks!
Gerry

> 
> Alex
> 
> [1] http://lists.linuxfoundation.org/pipermail/iommu/2015-January/011502.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Panic when cpu hot-remove
@ 2015-06-18  5:40       ` Jiang Liu
  0 siblings, 0 replies; 15+ messages in thread
From: Jiang Liu @ 2015-06-18  5:40 UTC (permalink / raw)
  To: Alex Williamson, Joerg Roedeljoro
  Cc: Roland Dreier, 闫晓峰,
	jiang.liu-ral2JQCrhuEAvxtiuMwx3w, linux-kernel,
	刘长生, iommu, 范冬冬

On 2015/6/17 22:36, Alex Williamson wrote:
> On Wed, 2015-06-17 at 13:52 +0200, Joerg Roedeljoro wrote:
>> On Wed, Jun 17, 2015 at 10:42:49AM +0000, 范冬冬 wrote:
>>> Hi maintainer,
>>>
>>> We found a problem that a panic happen when cpu was hot-removed. We also trace the problem according to the calltrace information.
>>> An endless loop happen because value head is not equal to value tail forever in the function qi_check_fault( ).
>>> The location code is as follows:
>>>
>>>
>>> do {
>>>         if (qi->desc_status[head] == QI_IN_USE)
>>>         qi->desc_status[head] = QI_ABORT;
>>>         head = (head - 2 + QI_LENGTH) % QI_LENGTH;
>>>     } while (head != tail);
>>
>> Hmm, this code interates only over every second QI descriptor, and tail
>> probably points to a descriptor that is not iterated over.
>>
>> Jiang, can you please have a look?
> 
> I think that part is normal, the way we use the queue is to always
> submit a work operation followed by a wait operation so that we can
> determine the work operation is complete.  That's done via
> qi_submit_sync().  We have had spurious reports of the queue getting
> impossibly out of sync though.  I saw one that was somehow linked to the
> I/O AT DMA engine.  Roland Dreier saw something similar[1].  I'm not
> sure if they're related to this, but maybe worth comparing.  Thanks,
Thanks, Alex and Joerg!

Hi Dongdong,
	Could you please help to give some instructions about how to
reproduce this issue? I will try to reproduce it if possible.
Thanks!
Gerry

> 
> Alex
> 
> [1] http://lists.linuxfoundation.org/pipermail/iommu/2015-January/011502.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Panic when cpu hot-remove
@ 2015-06-18  7:54           ` fandongdong
  0 siblings, 0 replies; 15+ messages in thread
From: fandongdong @ 2015-06-18  7:54 UTC (permalink / raw)
  To: Jiang Liu, Alex Williamson, Joerg Roedeljoro
  Cc: 刘长生,
	iommu, jiang.liu, linux-kernel, 闫晓峰,
	Roland Dreier



在 2015/6/18 15:27, fandongdong 写道:
>
>
> 在 2015/6/18 13:40, Jiang Liu 写道:
>> On 2015/6/17 22:36, Alex Williamson wrote:
>>> On Wed, 2015-06-17 at 13:52 +0200, Joerg Roedeljoro wrote:
>>>> On Wed, Jun 17, 2015 at 10:42:49AM +0000, 范冬冬 wrote:
>>>>> Hi maintainer,
>>>>>
>>>>> We found a problem that a panic happen when cpu was hot-removed. 
>>>>> We also trace the problem according to the calltrace information.
>>>>> An endless loop happen because value head is not equal to value 
>>>>> tail forever in the function qi_check_fault( ).
>>>>> The location code is as follows:
>>>>>
>>>>>
>>>>> do {
>>>>>          if (qi->desc_status[head] == QI_IN_USE)
>>>>>          qi->desc_status[head] = QI_ABORT;
>>>>>          head = (head - 2 + QI_LENGTH) % QI_LENGTH;
>>>>>      } while (head != tail);
>>>> Hmm, this code interates only over every second QI descriptor, and 
>>>> tail
>>>> probably points to a descriptor that is not iterated over.
>>>>
>>>> Jiang, can you please have a look?
>>> I think that part is normal, the way we use the queue is to always
>>> submit a work operation followed by a wait operation so that we can
>>> determine the work operation is complete.  That's done via
>>> qi_submit_sync().  We have had spurious reports of the queue getting
>>> impossibly out of sync though.  I saw one that was somehow linked to 
>>> the
>>> I/O AT DMA engine.  Roland Dreier saw something similar[1]. I'm not
>>> sure if they're related to this, but maybe worth comparing. Thanks,
>> Thanks, Alex and Joerg!
>>
>> Hi Dongdong,
>>     Could you please help to give some instructions about how to
>> reproduce this issue? I will try to reproduce it if possible.
>> Thanks!
>> Gerry
> Hi Gerry,
>
> We're running kernel 4.1.0 on a 4-socket system and  we want to 
> offline socket 1.
> Steps as follows:
>
> echo 1 > /sys/firmware/acpi/hotplug/force_remove
> echo 1 > /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0004:01/eject
>
> Thanks!
> Dongdong
>>> Alex
>>>
>>> [1] 
>>> http://lists.linuxfoundation.org/pipermail/iommu/2015-January/011502.html
>>>
>>> -- 
>>> To unsubscribe from this list: send the line "unsubscribe 
>>> linux-kernel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at  http://www.tux.org/lkml/
>>>
>> .
>>
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Panic when cpu hot-remove
@ 2015-06-18  7:54           ` fandongdong
  0 siblings, 0 replies; 15+ messages in thread
From: fandongdong @ 2015-06-18  7:54 UTC (permalink / raw)
  To: Jiang Liu, Alex Williamson, Joerg Roedeljoro
  Cc: Roland Dreier, 闫晓峰,
	jiang.liu-ral2JQCrhuEAvxtiuMwx3w, linux-kernel,
	刘长生,
	iommu



在 2015/6/18 15:27, fandongdong 写道:
>
>
> 在 2015/6/18 13:40, Jiang Liu 写道:
>> On 2015/6/17 22:36, Alex Williamson wrote:
>>> On Wed, 2015-06-17 at 13:52 +0200, Joerg Roedeljoro wrote:
>>>> On Wed, Jun 17, 2015 at 10:42:49AM +0000, 范冬冬 wrote:
>>>>> Hi maintainer,
>>>>>
>>>>> We found a problem that a panic happen when cpu was hot-removed. 
>>>>> We also trace the problem according to the calltrace information.
>>>>> An endless loop happen because value head is not equal to value 
>>>>> tail forever in the function qi_check_fault( ).
>>>>> The location code is as follows:
>>>>>
>>>>>
>>>>> do {
>>>>>          if (qi->desc_status[head] == QI_IN_USE)
>>>>>          qi->desc_status[head] = QI_ABORT;
>>>>>          head = (head - 2 + QI_LENGTH) % QI_LENGTH;
>>>>>      } while (head != tail);
>>>> Hmm, this code interates only over every second QI descriptor, and 
>>>> tail
>>>> probably points to a descriptor that is not iterated over.
>>>>
>>>> Jiang, can you please have a look?
>>> I think that part is normal, the way we use the queue is to always
>>> submit a work operation followed by a wait operation so that we can
>>> determine the work operation is complete.  That's done via
>>> qi_submit_sync().  We have had spurious reports of the queue getting
>>> impossibly out of sync though.  I saw one that was somehow linked to 
>>> the
>>> I/O AT DMA engine.  Roland Dreier saw something similar[1]. I'm not
>>> sure if they're related to this, but maybe worth comparing. Thanks,
>> Thanks, Alex and Joerg!
>>
>> Hi Dongdong,
>>     Could you please help to give some instructions about how to
>> reproduce this issue? I will try to reproduce it if possible.
>> Thanks!
>> Gerry
> Hi Gerry,
>
> We're running kernel 4.1.0 on a 4-socket system and  we want to 
> offline socket 1.
> Steps as follows:
>
> echo 1 > /sys/firmware/acpi/hotplug/force_remove
> echo 1 > /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0004:01/eject
>
> Thanks!
> Dongdong
>>> Alex
>>>
>>> [1] 
>>> http://lists.linuxfoundation.org/pipermail/iommu/2015-January/011502.html
>>>
>>> -- 
>>> To unsubscribe from this list: send the line "unsubscribe 
>>> linux-kernel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at  http://www.tux.org/lkml/
>>>
>> .
>>
>

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Panic when cpu hot-remove
@ 2015-06-25  8:11             ` Jiang Liu
  0 siblings, 0 replies; 15+ messages in thread
From: Jiang Liu @ 2015-06-25  8:11 UTC (permalink / raw)
  To: fandongdong, Alex Williamson, Joerg Roedeljoro
  Cc: 刘长生,
	iommu, jiang.liu, linux-kernel, 闫晓峰,
	Roland Dreier

On 2015/6/18 15:54, fandongdong wrote:
> 
> 
> 在 2015/6/18 15:27, fandongdong 写道:
>>
>>
>> 在 2015/6/18 13:40, Jiang Liu 写道:
>>> On 2015/6/17 22:36, Alex Williamson wrote:
>>>> On Wed, 2015-06-17 at 13:52 +0200, Joerg Roedeljoro wrote:
>>>>> On Wed, Jun 17, 2015 at 10:42:49AM +0000, 范冬冬 wrote:
>>>>>> Hi maintainer,
>>>>>>
>>>>>> We found a problem that a panic happen when cpu was hot-removed.
>>>>>> We also trace the problem according to the calltrace information.
>>>>>> An endless loop happen because value head is not equal to value
>>>>>> tail forever in the function qi_check_fault( ).
>>>>>> The location code is as follows:
>>>>>>
>>>>>>
>>>>>> do {
>>>>>>          if (qi->desc_status[head] == QI_IN_USE)
>>>>>>          qi->desc_status[head] = QI_ABORT;
>>>>>>          head = (head - 2 + QI_LENGTH) % QI_LENGTH;
>>>>>>      } while (head != tail);
>>>>> Hmm, this code interates only over every second QI descriptor, and
>>>>> tail
>>>>> probably points to a descriptor that is not iterated over.
>>>>>
>>>>> Jiang, can you please have a look?
>>>> I think that part is normal, the way we use the queue is to always
>>>> submit a work operation followed by a wait operation so that we can
>>>> determine the work operation is complete.  That's done via
>>>> qi_submit_sync().  We have had spurious reports of the queue getting
>>>> impossibly out of sync though.  I saw one that was somehow linked to
>>>> the
>>>> I/O AT DMA engine.  Roland Dreier saw something similar[1]. I'm not
>>>> sure if they're related to this, but maybe worth comparing. Thanks,
>>> Thanks, Alex and Joerg!
>>>
>>> Hi Dongdong,
>>>     Could you please help to give some instructions about how to
>>> reproduce this issue? I will try to reproduce it if possible.
>>> Thanks!
>>> Gerry
>> Hi Gerry,
>>
>> We're running kernel 4.1.0 on a 4-socket system and  we want to
>> offline socket 1.
>> Steps as follows:
>>
>> echo 1 > /sys/firmware/acpi/hotplug/force_remove
>> echo 1 > /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0004:01/eject
Hi Dongdong,
	I failed to reproduce this issue on my side. Some please help
to confirm?
1) Is this issue reproducible on your side?
2) Does this issue happen if you disable irqbalance service on you
   system?
3) Has the corresponding PCI host bridge been removed before removing
   the socket?

>From the log message, we only noticed log messages for CPU and memory,
but not messages for PCI (IOMMU) devices. And this log message
	"[ 149.976493] acpi ACPI0004:01: Still not present"
implies that the socket has been powered off during the ejection.
So the story may be that you powered off the socket while the host
bridge on the socket is still in use.
Thanks!
Gerry


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Panic when cpu hot-remove
@ 2015-06-25  8:11             ` Jiang Liu
  0 siblings, 0 replies; 15+ messages in thread
From: Jiang Liu @ 2015-06-25  8:11 UTC (permalink / raw)
  To: fandongdong, Alex Williamson, Joerg Roedeljoro
  Cc: Roland Dreier, 闫晓峰,
	jiang.liu-ral2JQCrhuEAvxtiuMwx3w, linux-kernel,
	刘长生,
	iommu

On 2015/6/18 15:54, fandongdong wrote:
> 
> 
> 在 2015/6/18 15:27, fandongdong 写道:
>>
>>
>> 在 2015/6/18 13:40, Jiang Liu 写道:
>>> On 2015/6/17 22:36, Alex Williamson wrote:
>>>> On Wed, 2015-06-17 at 13:52 +0200, Joerg Roedeljoro wrote:
>>>>> On Wed, Jun 17, 2015 at 10:42:49AM +0000, 范冬冬 wrote:
>>>>>> Hi maintainer,
>>>>>>
>>>>>> We found a problem that a panic happen when cpu was hot-removed.
>>>>>> We also trace the problem according to the calltrace information.
>>>>>> An endless loop happen because value head is not equal to value
>>>>>> tail forever in the function qi_check_fault( ).
>>>>>> The location code is as follows:
>>>>>>
>>>>>>
>>>>>> do {
>>>>>>          if (qi->desc_status[head] == QI_IN_USE)
>>>>>>          qi->desc_status[head] = QI_ABORT;
>>>>>>          head = (head - 2 + QI_LENGTH) % QI_LENGTH;
>>>>>>      } while (head != tail);
>>>>> Hmm, this code interates only over every second QI descriptor, and
>>>>> tail
>>>>> probably points to a descriptor that is not iterated over.
>>>>>
>>>>> Jiang, can you please have a look?
>>>> I think that part is normal, the way we use the queue is to always
>>>> submit a work operation followed by a wait operation so that we can
>>>> determine the work operation is complete.  That's done via
>>>> qi_submit_sync().  We have had spurious reports of the queue getting
>>>> impossibly out of sync though.  I saw one that was somehow linked to
>>>> the
>>>> I/O AT DMA engine.  Roland Dreier saw something similar[1]. I'm not
>>>> sure if they're related to this, but maybe worth comparing. Thanks,
>>> Thanks, Alex and Joerg!
>>>
>>> Hi Dongdong,
>>>     Could you please help to give some instructions about how to
>>> reproduce this issue? I will try to reproduce it if possible.
>>> Thanks!
>>> Gerry
>> Hi Gerry,
>>
>> We're running kernel 4.1.0 on a 4-socket system and  we want to
>> offline socket 1.
>> Steps as follows:
>>
>> echo 1 > /sys/firmware/acpi/hotplug/force_remove
>> echo 1 > /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0004:01/eject
Hi Dongdong,
	I failed to reproduce this issue on my side. Some please help
to confirm?
1) Is this issue reproducible on your side?
2) Does this issue happen if you disable irqbalance service on you
   system?
3) Has the corresponding PCI host bridge been removed before removing
   the socket?

From the log message, we only noticed log messages for CPU and memory,
but not messages for PCI (IOMMU) devices. And this log message
	"[ 149.976493] acpi ACPI0004:01: Still not present"
implies that the socket has been powered off during the ejection.
So the story may be that you powered off the socket while the host
bridge on the socket is still in use.
Thanks!
Gerry

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Panic when cpu hot-remove
       [not found]             ` <558BB7B8.7000402-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
@ 2015-06-25 10:46               ` fandongdong
  0 siblings, 0 replies; 15+ messages in thread
From: fandongdong @ 2015-06-25 10:46 UTC (permalink / raw)
  To: Jiang Liu, Alex Williamson, Joerg Roedeljoro
  Cc: 刘长生,
	iommu, jiang.liu, linux-kernel, 闫晓峰,
	Roland Dreier



在 2015/6/25 16:11, Jiang Liu 写道:
> On 2015/6/18 15:54, fandongdong wrote:
>>
>> 在 2015/6/18 15:27, fandongdong 写道:
>>>
>>> 在 2015/6/18 13:40, Jiang Liu 写道:
>>>> On 2015/6/17 22:36, Alex Williamson wrote:
>>>>> On Wed, 2015-06-17 at 13:52 +0200, Joerg Roedeljoro wrote:
>>>>>> On Wed, Jun 17, 2015 at 10:42:49AM +0000, 范冬冬 wrote:
>>>>>>> Hi maintainer,
>>>>>>>
>>>>>>> We found a problem that a panic happen when cpu was hot-removed.
>>>>>>> We also trace the problem according to the calltrace information.
>>>>>>> An endless loop happen because value head is not equal to value
>>>>>>> tail forever in the function qi_check_fault( ).
>>>>>>> The location code is as follows:
>>>>>>>
>>>>>>>
>>>>>>> do {
>>>>>>>           if (qi->desc_status[head] == QI_IN_USE)
>>>>>>>           qi->desc_status[head] = QI_ABORT;
>>>>>>>           head = (head - 2 + QI_LENGTH) % QI_LENGTH;
>>>>>>>       } while (head != tail);
>>>>>> Hmm, this code interates only over every second QI descriptor, and
>>>>>> tail
>>>>>> probably points to a descriptor that is not iterated over.
>>>>>>
>>>>>> Jiang, can you please have a look?
>>>>> I think that part is normal, the way we use the queue is to always
>>>>> submit a work operation followed by a wait operation so that we can
>>>>> determine the work operation is complete.  That's done via
>>>>> qi_submit_sync().  We have had spurious reports of the queue getting
>>>>> impossibly out of sync though.  I saw one that was somehow linked to
>>>>> the
>>>>> I/O AT DMA engine.  Roland Dreier saw something similar[1]. I'm not
>>>>> sure if they're related to this, but maybe worth comparing. Thanks,
>>>> Thanks, Alex and Joerg!
>>>>
>>>> Hi Dongdong,
>>>>      Could you please help to give some instructions about how to
>>>> reproduce this issue? I will try to reproduce it if possible.
>>>> Thanks!
>>>> Gerry
>>> Hi Gerry,
>>>
>>> We're running kernel 4.1.0 on a 4-socket system and  we want to
>>> offline socket 1.
>>> Steps as follows:
>>>
>>> echo 1 > /sys/firmware/acpi/hotplug/force_remove
>>> echo 1 > /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0004:01/eject
> Hi Dongdong,
> 	I failed to reproduce this issue on my side. Some please help
> to confirm?
> 1) Is this issue reproducible on your side?
Yes.
> 2) Does this issue happen if you disable irqbalance service on you
>     system?
Yes.
> 3) Has the corresponding PCI host bridge been removed before removing
>     the socket?
No, we will try to remove it before removing the socket later.
Thanks for your help, Gerry.
>
> >From the log message, we only noticed log messages for CPU and memory,
> but not messages for PCI (IOMMU) devices. And this log message
> 	"[ 149.976493] acpi ACPI0004:01: Still not present"
> implies that the socket has been powered off during the ejection.
> So the story may be that you powered off the socket while the host
> bridge on the socket is still in use.
> Thanks!
> Gerry
>
> .
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Panic when cpu hot-remove
@ 2015-06-25 10:46               ` fandongdong
  0 siblings, 0 replies; 15+ messages in thread
From: fandongdong @ 2015-06-25 10:46 UTC (permalink / raw)
  To: Jiang Liu, Alex Williamson, Joerg Roedeljoro
  Cc: Roland Dreier, 闫晓峰,
	jiang.liu-ral2JQCrhuEAvxtiuMwx3w, linux-kernel,
	刘长生,
	iommu



在 2015/6/25 16:11, Jiang Liu 写道:
> On 2015/6/18 15:54, fandongdong wrote:
>>
>> 在 2015/6/18 15:27, fandongdong 写道:
>>>
>>> 在 2015/6/18 13:40, Jiang Liu 写道:
>>>> On 2015/6/17 22:36, Alex Williamson wrote:
>>>>> On Wed, 2015-06-17 at 13:52 +0200, Joerg Roedeljoro wrote:
>>>>>> On Wed, Jun 17, 2015 at 10:42:49AM +0000, 范冬冬 wrote:
>>>>>>> Hi maintainer,
>>>>>>>
>>>>>>> We found a problem that a panic happen when cpu was hot-removed.
>>>>>>> We also trace the problem according to the calltrace information.
>>>>>>> An endless loop happen because value head is not equal to value
>>>>>>> tail forever in the function qi_check_fault( ).
>>>>>>> The location code is as follows:
>>>>>>>
>>>>>>>
>>>>>>> do {
>>>>>>>           if (qi->desc_status[head] == QI_IN_USE)
>>>>>>>           qi->desc_status[head] = QI_ABORT;
>>>>>>>           head = (head - 2 + QI_LENGTH) % QI_LENGTH;
>>>>>>>       } while (head != tail);
>>>>>> Hmm, this code interates only over every second QI descriptor, and
>>>>>> tail
>>>>>> probably points to a descriptor that is not iterated over.
>>>>>>
>>>>>> Jiang, can you please have a look?
>>>>> I think that part is normal, the way we use the queue is to always
>>>>> submit a work operation followed by a wait operation so that we can
>>>>> determine the work operation is complete.  That's done via
>>>>> qi_submit_sync().  We have had spurious reports of the queue getting
>>>>> impossibly out of sync though.  I saw one that was somehow linked to
>>>>> the
>>>>> I/O AT DMA engine.  Roland Dreier saw something similar[1]. I'm not
>>>>> sure if they're related to this, but maybe worth comparing. Thanks,
>>>> Thanks, Alex and Joerg!
>>>>
>>>> Hi Dongdong,
>>>>      Could you please help to give some instructions about how to
>>>> reproduce this issue? I will try to reproduce it if possible.
>>>> Thanks!
>>>> Gerry
>>> Hi Gerry,
>>>
>>> We're running kernel 4.1.0 on a 4-socket system and  we want to
>>> offline socket 1.
>>> Steps as follows:
>>>
>>> echo 1 > /sys/firmware/acpi/hotplug/force_remove
>>> echo 1 > /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0004:01/eject
> Hi Dongdong,
> 	I failed to reproduce this issue on my side. Some please help
> to confirm?
> 1) Is this issue reproducible on your side?
Yes.
> 2) Does this issue happen if you disable irqbalance service on you
>     system?
Yes.
> 3) Has the corresponding PCI host bridge been removed before removing
>     the socket?
No, we will try to remove it before removing the socket later.
Thanks for your help, Gerry.
>
> >From the log message, we only noticed log messages for CPU and memory,
> but not messages for PCI (IOMMU) devices. And this log message
> 	"[ 149.976493] acpi ACPI0004:01: Still not present"
> implies that the socket has been powered off during the ejection.
> So the story may be that you powered off the socket while the host
> bridge on the socket is still in use.
> Thanks!
> Gerry
>
> .
>

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Panic when cpu hot-remove
       [not found]             ` <558BB7B8.7000402-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
@ 2015-06-26  9:35               ` fandongdong
  0 siblings, 0 replies; 15+ messages in thread
From: fandongdong @ 2015-06-26  9:35 UTC (permalink / raw)
  To: Jiang Liu, Alex Williamson, Joerg Roedeljoro
  Cc: Roland Dreier, 闫晓峰,
	jiang.liu-ral2JQCrhuEAvxtiuMwx3w, linux-kernel,
	刘长生,
	iommu


[-- Attachment #1.1: Type: text/plain, Size: 3083 bytes --]



在 2015/6/25 16:11, Jiang Liu 写道:
> On 2015/6/18 15:54, fandongdong wrote:
>>
>> 在 2015/6/18 15:27, fandongdong 写道:
>>>
>>> 在 2015/6/18 13:40, Jiang Liu 写道:
>>>> On 2015/6/17 22:36, Alex Williamson wrote:
>>>>> On Wed, 2015-06-17 at 13:52 +0200, Joerg Roedeljoro wrote:
>>>>>> On Wed, Jun 17, 2015 at 10:42:49AM +0000, 范冬冬 wrote:
>>>>>>> Hi maintainer,
>>>>>>>
>>>>>>> We found a problem that a panic happen when cpu was hot-removed.
>>>>>>> We also trace the problem according to the calltrace information.
>>>>>>> An endless loop happen because value head is not equal to value
>>>>>>> tail forever in the function qi_check_fault( ).
>>>>>>> The location code is as follows:
>>>>>>>
>>>>>>>
>>>>>>> do {
>>>>>>>           if (qi->desc_status[head] == QI_IN_USE)
>>>>>>>           qi->desc_status[head] = QI_ABORT;
>>>>>>>           head = (head - 2 + QI_LENGTH) % QI_LENGTH;
>>>>>>>       } while (head != tail);
>>>>>> Hmm, this code interates only over every second QI descriptor, and
>>>>>> tail
>>>>>> probably points to a descriptor that is not iterated over.
>>>>>>
>>>>>> Jiang, can you please have a look?
>>>>> I think that part is normal, the way we use the queue is to always
>>>>> submit a work operation followed by a wait operation so that we can
>>>>> determine the work operation is complete.  That's done via
>>>>> qi_submit_sync().  We have had spurious reports of the queue getting
>>>>> impossibly out of sync though.  I saw one that was somehow linked to
>>>>> the
>>>>> I/O AT DMA engine.  Roland Dreier saw something similar[1]. I'm not
>>>>> sure if they're related to this, but maybe worth comparing. Thanks,
>>>> Thanks, Alex and Joerg!
>>>>
>>>> Hi Dongdong,
>>>>      Could you please help to give some instructions about how to
>>>> reproduce this issue? I will try to reproduce it if possible.
>>>> Thanks!
>>>> Gerry
>>> Hi Gerry,
>>>
>>> We're running kernel 4.1.0 on a 4-socket system and  we want to
>>> offline socket 1.
>>> Steps as follows:
>>>
>>> echo 1 > /sys/firmware/acpi/hotplug/force_remove
>>> echo 1 > /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0004:01/eject
> Hi Dongdong,
> 	I failed to reproduce this issue on my side. Some please help
> to confirm?
> 1) Is this issue reproducible on your side?
> 2) Does this issue happen if you disable irqbalance service on you
>     system?
> 3) Has the corresponding PCI host bridge been removed before removing
>     the socket?
>
> >From the log message, we only noticed log messages for CPU and memory,
> but not messages for PCI (IOMMU) devices. And this log message
> 	"[ 149.976493] acpi ACPI0004:01: Still not present"
> implies that the socket has been powered off during the ejection.
> So the story may be that you powered off the socket while the host
> bridge on the socket is still in use.
> Thanks!
> Gerry
Hi Gerry,
             Thanks for your suggestion!
             The issue didn't happen after removing the corresponding 
PCI host bridge.
  Thanks!
  Dongdong
>
> .
>


[-- Attachment #1.2: Type: text/html, Size: 4699 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Panic when cpu hot-remove
  2015-06-25  8:11             ` Jiang Liu
                               ` (2 preceding siblings ...)
  (?)
@ 2015-11-09 20:21             ` Guenter Roeck
  -1 siblings, 0 replies; 15+ messages in thread
From: Guenter Roeck @ 2015-11-09 20:21 UTC (permalink / raw)
  To: Jiang Liu; +Cc: fandongdong, Alex Williamson, Joerg Roedeljoro, linux-kernel

Gerry,

On Thu, Jun 25, 2015 at 04:11:36PM +0800, Jiang Liu wrote:
> On 2015/6/18 15:54, fandongdong wrote:
> > 
> > 
> > 在 2015/6/18 15:27, fandongdong 写道:
> >>
> >>
> >> 在 2015/6/18 13:40, Jiang Liu 写道:
> >>> On 2015/6/17 22:36, Alex Williamson wrote:
> >>>> On Wed, 2015-06-17 at 13:52 +0200, Joerg Roedeljoro wrote:
> >>>>> On Wed, Jun 17, 2015 at 10:42:49AM +0000, 范冬冬 wrote:
> >>>>>> Hi maintainer,
> >>>>>>
> >>>>>> We found a problem that a panic happen when cpu was hot-removed.
> >>>>>> We also trace the problem according to the calltrace information.
> >>>>>> An endless loop happen because value head is not equal to value
> >>>>>> tail forever in the function qi_check_fault( ).
> >>>>>> The location code is as follows:
> >>>>>>
> >>>>>>
> >>>>>> do {
> >>>>>>          if (qi->desc_status[head] == QI_IN_USE)
> >>>>>>          qi->desc_status[head] = QI_ABORT;
> >>>>>>          head = (head - 2 + QI_LENGTH) % QI_LENGTH;
> >>>>>>      } while (head != tail);
> >>>>> Hmm, this code interates only over every second QI descriptor, and
> >>>>> tail
> >>>>> probably points to a descriptor that is not iterated over.
> >>>>>
> >>>>> Jiang, can you please have a look?
> >>>> I think that part is normal, the way we use the queue is to always
> >>>> submit a work operation followed by a wait operation so that we can
> >>>> determine the work operation is complete.  That's done via
> >>>> qi_submit_sync().  We have had spurious reports of the queue getting
> >>>> impossibly out of sync though.  I saw one that was somehow linked to
> >>>> the
> >>>> I/O AT DMA engine.  Roland Dreier saw something similar[1]. I'm not
> >>>> sure if they're related to this, but maybe worth comparing. Thanks,
> >>> Thanks, Alex and Joerg!
> >>>
> >>> Hi Dongdong,
> >>>     Could you please help to give some instructions about how to
> >>> reproduce this issue? I will try to reproduce it if possible.
> >>> Thanks!
> >>> Gerry
> >> Hi Gerry,
> >>
> >> We're running kernel 4.1.0 on a 4-socket system and  we want to
> >> offline socket 1.
> >> Steps as follows:
> >>
> >> echo 1 > /sys/firmware/acpi/hotplug/force_remove
> >> echo 1 > /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0004:01/eject
> Hi Dongdong,
> 	I failed to reproduce this issue on my side. Some please help
> to confirm?
> 1) Is this issue reproducible on your side?
> 2) Does this issue happen if you disable irqbalance service on you
>    system?
> 3) Has the corresponding PCI host bridge been removed before removing
>    the socket?
> 
> From the log message, we only noticed log messages for CPU and memory,
> but not messages for PCI (IOMMU) devices. And this log message
> 	"[ 149.976493] acpi ACPI0004:01: Still not present"
> implies that the socket has been powered off during the ejection.
> So the story may be that you powered off the socket while the host
> bridge on the socket is still in use.
> Thanks!
> Gerry
> 

Was this problem ever resolved ?

We are seeing the same (or a similar) problem randomly with our hardware.
No CPU hotplug is involved.

Any idea what I can do (or help) to track down the problem ?

Thanks,
Guenter

---
Sample traceback:

[  485.547997] Uhhuh. NMI received for unknown reason 29 on CPU 0.
[  485.633519] Do you have a strange power saving mode enabled?
[  485.715262] Kernel panic - not syncing: NMI: Not continuing^M
[  485.795750] CPU: 0 PID: 25109 Comm: cty Tainted: P        W  O 4.1.12-juniper-00687-g3de457e-dirty #1
[  485.932825] Hardware name: Juniper Networks, Inc. 0576/HSW RCB PTX, BIOS NGRE_v0.44 04/07/2015
[  486.057327]  0000000000000029 ffff88085f605df8 ffffffff80a9e179 0000000000000000
[  486.164220]  ffffffff80e53b4a ffff88085f605e78 ffffffff80a99b6f ffff88085f605e18^M
[  486.271116]  ffffffff00000008 ffff88085f605e88 ffff88085f605e28 ffffffff81019a00
[  486.378012] Call Trace:
[  486.413225]  <NMI>  [<ffffffff80a9e179>] dump_stack+0x4f/0x7b
[  486.496228]  [<ffffffff80a99b6f>] panic+0xbb/0x1e9
[  486.565393]  [<ffffffff802070ac>] unknown_nmi_error+0x9c/0xa0
[  486.648394]  [<ffffffff8020724c>] default_do_nmi+0x19c/0x1c0
[  486.730138]  [<ffffffff80207356>] do_nmi+0xe6/0x160
[  486.800564]  [<ffffffff80aa859b>] end_repeat_nmi+0x1a/0x1e
[  486.879793]  [<ffffffff8072a896>] ? qi_submit_sync+0x186/0x3f0
[  486.964051]  [<ffffffff8072a896>] ? qi_submit_sync+0x186/0x3f0
[  487.048307]  [<ffffffff8072a896>] ? qi_submit_sync+0x186/0x3f0
[  487.132564]  <<EOE>>  [<ffffffff80731823>] modify_irte+0x93/0xd0
[  487.219342]  [<ffffffff80731bd3>] intel_ioapic_set_affinity+0x113/0x1a0
[  487.314918]  [<ffffffff80732130>] set_remapped_irq_affinity+0x20/0x30
[  487.407979]  [<ffffffff802c5fec>] irq_do_set_affinity+0x1c/0x50
[  487.493494]  [<ffffffff802c607d>] setup_affinity+0x5d/0x80
[  487.572725]  [<ffffffff802c68b4>] __setup_irq+0x2c4/0x580
[  487.650695]  [<ffffffff8070ce80>] ? serial8250_modem_status+0xd0/0xd0
[  487.743755]  [<ffffffff802c6cf4>] request_threaded_irq+0xf4/0x1b0
[  487.831786]  [<ffffffff8070febf>] univ8250_setup_irq+0x24f/0x290
[  487.918560]  [<ffffffff80710c27>] serial8250_do_startup+0x117/0x5f0
[  488.009108]  [<ffffffff80711125>] serial8250_startup+0x25/0x30
[  488.093365]  [<ffffffff8070b779>] uart_startup.part.16+0x89/0x1f0
[  488.181397]  [<ffffffff8070c475>] uart_open+0x115/0x160
[  488.256852]  [<ffffffff806e9537>] ? check_tty_count+0x57/0xc0
[  488.339854]  [<ffffffff806ed95c>] tty_open+0xcc/0x610
[  488.412793]  [<ffffffff8073dc92>] ? kobj_lookup+0x112/0x170
[  488.493283]  [<ffffffff803b7e6f>] chrdev_open+0x9f/0x1d0
[  488.569992]  [<ffffffff803b1297>] do_dentry_open+0x217/0x340
[  488.651735]  [<ffffffff803b7dd0>] ? cdev_put+0x30/0x30
[  488.725934]  [<ffffffff803b2577>] vfs_open+0x57/0x60
[  488.797616]  [<ffffffff803bffbb>] do_last+0x3fb/0xee0
[  488.870557]  [<ffffffff803c2620>] path_openat+0x80/0x640^M
[  488.947270]  [<ffffffff803c3eda>] do_filp_open+0x3a/0x90
[  489.023984]  [<ffffffff80aa6098>] ? _raw_spin_unlock+0x18/0x40
[  489.108240]  [<ffffffff803d0ba7>] ? __alloc_fd+0xa7/0x130
[  489.186213]  [<ffffffff803b2909>] do_sys_open+0x129/0x220^M
[  489.264184]  [<ffffffff80402a4b>] compat_SyS_open+0x1b/0x20
[  489.344670]  [<ffffffff80aa8d65>] ia32_do_call+0x13/0x13

---
Similar traceback, but during PCIe hotplug:

Call Trace:
 <NMI>  [<ffffffff80a9218a>] dump_stack+0x4f/0x7b^M
 [<ffffffff80a8df39>] panic+0xbb/0x1df
 [<ffffffff8020728c>] unknown_nmi_error+0x9c/0xa0
 [<ffffffff8020742c>] default_do_nmi+0x19c/0x1c0
 [<ffffffff80207536>] do_nmi+0xe6/0x160^M
 [<ffffffff80a9b31b>] end_repeat_nmi+0x1a/0x1e
 [<ffffffff80723dc6>] ? qi_submit_sync+0x186/0x3f0
 [<ffffffff80723dc6>] ? qi_submit_sync+0x186/0x3f0
 [<ffffffff80723dc6>] ? qi_submit_sync+0x186/0x3f0
 <<EOE>>  [<ffffffff8072a325>] free_irte+0xe5/0x130
 [<ffffffff8072ba0f>] free_remapped_irq+0x2f/0x40^M
 [<ffffffff8023af33>] arch_teardown_hwirq+0x23/0x70
 [<ffffffff802c32d8>] irq_free_hwirqs+0x38/0x60
 [<ffffffff8023e0e3>] native_teardown_msi_irq+0x13/0x20
 [<ffffffff8020777f>] arch_teardown_msi_irq+0xf/0x20
 [<ffffffff8069e08f>] default_teardown_msi_irqs+0x5f/0xa0
 [<ffffffff8020775f>] arch_teardown_msi_irqs+0xf/0x20
 [<ffffffff8069e159>] free_msi_irqs+0x89/0x1a0
 [<ffffffff8069f165>] pci_disable_msi+0x45/0x50
 [<ffffffff80696d05>] cleanup_service_irqs+0x25/0x40
 [<ffffffff8069749e>] pcie_port_device_remove+0x2e/0x40
 [<ffffffff8069760e>] pcie_portdrv_remove+0xe/0x10

---
Similar, but at another location in qi_submit_sync:

Call Trace:
 <NMI>  [<ffffffff80a9218a>] dump_stack+0x4f/0x7b^M
 [<ffffffff80a8df39>] panic+0xbb/0x1df
 [<ffffffff8020728c>] unknown_nmi_error+0x9c/0xa0
 [<ffffffff8020742c>] default_do_nmi+0x19c/0x1c0
 [<ffffffff80207536>] do_nmi+0xe6/0x160^M
 [<ffffffff80a9b31b>] end_repeat_nmi+0x1a/0x1e
 [<ffffffff80a98c58>] ? _raw_spin_lock+0x38/0x40^M
 [<ffffffff80a98c58>] ? _raw_spin_lock+0x38/0x40
 [<ffffffff80a98c58>] ? _raw_spin_lock+0x38/0x40
 <<EOE>>  [<ffffffff80723e9d>] qi_submit_sync+0x25d/0x3f0
 [<ffffffff8072a325>] free_irte+0xe5/0x130
 [<ffffffff8072ba0f>] free_remapped_irq+0x2f/0x40
 [<ffffffff8023af33>] arch_teardown_hwirq+0x23/0x70
 [<ffffffff802c32d8>] irq_free_hwirqs+0x38/0x60
 [<ffffffff8023e0e3>] native_teardown_msi_irq+0x13/0x20
 [<ffffffff8020777f>] arch_teardown_msi_irq+0xf/0x20
 [<ffffffff8069e08f>] default_teardown_msi_irqs+0x5f/0xa0
 [<ffffffff8020775f>] arch_teardown_msi_irqs+0xf/0x20
 [<ffffffff8069e159>] free_msi_irqs+0x89/0x1a0
 [<ffffffff8069f165>] pci_disable_msi+0x45/0x50
 [<ffffffff80696d05>] cleanup_service_irqs+0x25/0x40
 [<ffffffff8069749e>] pcie_port_device_remove+0x2e/0x40
 [<ffffffff8069760e>] pcie_portdrv_remove+0xe/0x10
 [<ffffffff806896ed>] pci_device_remove+0x3d/0xc0

The NMIs during PCIe hotplug seem to be more likely (possibly because our
testing generates a large number of PCIe hotplug events).

---
CPU information:

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 63
model name	: Genuine Intel(R) CPU @ 1.80GHz
stepping	: 1
microcode	: 0x14

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-11-09 20:21 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-17 10:42 Panic when cpu hot-remove 范冬冬
2015-06-17 10:42 ` 范冬冬
2015-06-17 11:52 ` Joerg Roedeljoro
2015-06-17 14:36   ` Alex Williamson
2015-06-17 14:36     ` Alex Williamson
2015-06-18  5:40     ` Jiang Liu
2015-06-18  5:40       ` Jiang Liu
     [not found]       ` <558272E3.4000504@inspur.com>
2015-06-18  7:54         ` fandongdong
2015-06-18  7:54           ` fandongdong
2015-06-25  8:11           ` Jiang Liu
2015-06-25  8:11             ` Jiang Liu
2015-06-25 10:46             ` fandongdong
2015-06-25 10:46               ` fandongdong
     [not found]             ` <558BB7B8.7000402-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2015-06-26  9:35               ` fandongdong
2015-11-09 20:21             ` Guenter Roeck

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.