From: Rick Warner <rick@microway.com>
To: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Thomas Gleixner <tglx@linutronix.d>
Subject: stall/hang on 4.15 kernel with some Xeon skylake CPUs and extended APIC
Date: Tue, 1 May 2018 12:37:29 -0400 [thread overview]
Message-ID: <831e8a53-05d1-edfb-6287-fecfba22b8bd@microway.com> (raw)
[-- Attachment #1: Type: text/plain, Size: 4178 bytes --]
Hi All,
I've discovered that some new Supermicro skylake systems will hang/stall
while booting the 4.15 kernel when extended APIC (x2apic) is enabled in
the BIOS. The issue happens on specific CPUs only and follows the CPUs.
We had (4) quad socket systems with Xeon 6134 CPUs; 2 out of 4 were
exhibiting this behavior. We replaced 2 CPUs at that time and the
behavior was eliminated. Those systems were then shipped to our customer
(we are an HPC system integrator).
Now, we have 5 single socket systems with 5122 CPUs. 2 out of the 5 are
hanging. If we swap the CPUs from the hanging systems with working
systems, the behavior follows the CPU.
I've done a git bisect between 4.14 and 4.15 and found this commit is
triggering the issue:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?id=023a611748fd58d46c8aa049cf4f22ebada983f5
Some of the commits right before it also seemed to trigger this warning:
[ 5.062563] Debug warning: early ioremap leak of 1 areas detected.
please boot with early_ioremap_debug and report the dmesg.
I have a dmesg log of 1 commit prior to the referenced link with
early_ioremap_debug enabled if it is desired.
The latest git still has the issue.
I've attached a dmesg log captured via serial console from a system
exhibiting this problem. Here is an excerpt from it where the problems
start:
ACPI: Added _OSI(Module Device)
ACPI: Added _OSI(Processor Device)
ACPI: Added _OSI(3.0 _SCP Extensions)
ACPI: Added _OSI(Processor Aggregator Device)
ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored
INFO: rcu_sched self-detected stall on CPU
34-....: (14997 ticks this GP) idle=b3e/140000000000001/0
softirq=18/18 fqs=7497
INFO: rcu_sched detected stalls on CPUs/tasks:
34-....: (14997 ticks this GP) idle=b3e/140000000000001/0
softirq=18/18 fqs=7498
(t=15002 jiffies g=-294 c=-295 q=391)
(detected by 0, t=15002 jiffies, g=-294, c=-295, q=391)
NMI backtrace for cpu 34
CPU: 34 PID: 1 Comm: swapper/0 Not tainted 4.15.7-gentoo-r1-netuno-x86_64 #4
Hardware name: Supermicro SYS-2049U-TR4/X11QPH+, BIOS 2.0c 02/23/2018
Call Trace:
<IRQ>
dump_stack+0x5d/0x79
nmi_cpu_backtrace+0x94/0xae
? irq_force_complete_move+0x6f/0x6f
nmi_trigger_cpumask_backtrace+0x56/0xd3
rcu_dump_cpu_stacks+0x96/0xc0
rcu_check_callbacks+0x285/0x697
update_process_times+0x28/0x4a
tick_handle_periodic+0x20/0x5f
smp_apic_timer_interrupt+0x93/0xf9
apic_timer_interrupt+0x7d/0x90
</IRQ>
RIP: 0010:smp_call_function_many+0x1f1/0x204
RSP: 0000:ffffc900000f3af0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff11
RAX: 0000000000000001 RBX: ffff880c110a0488 RCX: 0000000000000001
RDX: ffff880c10e64440 RSI: 0000000000000000 RDI: ffff880c110a0488
RBP: ffff880c110a0480 R08: fffffffffffffffe R09: 0000000000000003
R10: 0000000000000000 R11: ffffea00c03c1a60 R12: 0000000000000001
R13: ffff880c110a04b8 R14: 0000000000020440 R15: ffffffff81ed5400
? slub_cpu_dead+0xa0/0xa0
? slub_cpu_dead+0xa0/0xa0
? __mmu_notifier_mm_destroy+0x32/0x32
on_each_cpu_mask+0x23/0x53
? slub_cpu_dead+0xa0/0xa0
on_each_cpu_cond+0x7c/0x8b
__kmem_cache_shrink+0x3c/0x237
? acpi_ps_delete_parse_tree+0x2d/0x59
? set_debug_rodata+0x11/0x11
? acpi_os_purge_cache+0xa/0xd
acpi_os_purge_cache+0xa/0xd
acpi_purge_cached_objects+0x29/0x38
acpi_initialize_objects+0x46/0x4f
? acpi_sleep_init+0xd6/0xd6
acpi_init+0xb6/0x324
? scan_for_dmi_ipmi+0x15/0xec
? acpi_sleep_init+0xd6/0xd6
do_one_initcall+0x89/0x128
? set_debug_rodata+0x11/0x11
? set_debug_rodata+0x11/0x11
kernel_init_freeable+0x112/0x18e
? rest_init+0xaa/0xaa
kernel_init+0xa/0xf0
ret_from_fork+0x35/0x40
The NMI dump info repeats periodically after that but never progresses
further.
If any other information is needed, please let me know. I've reported
this issue to Supermicro already and they believe it is an issue with
the kernel opposed to an issue specific to their systems. I don't have
any other brand Xeon skylake systems with extended APIC support that I
can try this with.
Thanks,
Rick
Richard Warner
Chief Technology Officer
Microway, Inc
[-- Attachment #2: dmesg-hang-with-extended-APIC-enabled.txt.gz --]
[-- Type: application/gzip, Size: 7055 bytes --]
next reply other threads:[~2018-05-01 16:43 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-05-01 16:37 Rick Warner [this message]
2018-05-15 16:07 ` [bisected] rcu_sched detected stalls - 4.15 or newer kernel with some Xeon skylake CPUs and extended APIC Rick Warner
2018-05-15 20:19 ` Thomas Gleixner
2018-05-16 14:50 ` Thomas Gleixner
2018-05-16 23:02 ` Rick Warner
2018-05-17 12:36 ` Thomas Gleixner
2018-05-17 15:59 ` Rick Warner
2018-05-17 19:03 ` [tip:x86/urgent] x86/apic/x2apic: Initialize cluster ID properly tip-bot for Thomas Gleixner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=831e8a53-05d1-edfb-6287-fecfba22b8bd@microway.com \
--to=rick@microway.com \
--cc=linux-kernel@vger.kernel.org \
--cc=tglx@linutronix.d \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.