All of lore.kernel.org
 help / color / mirror / Atom feed
From: Thomas Gleixner <tglx@linutronix.de>
To: Rick Warner <rick@microway.com>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [bisected] rcu_sched detected stalls - 4.15 or newer kernel with some Xeon skylake CPUs and extended APIC
Date: Tue, 15 May 2018 22:19:16 +0200 (CEST)	[thread overview]
Message-ID: <alpine.DEB.2.21.1805152208590.1605@nanos.tec.linutronix.de> (raw)
In-Reply-To: <b346f6d2-dd4e-d19b-099e-633045e88f4b@microway.com>

[-- Attachment #1: Type: text/plain, Size: 3087 bytes --]

On Tue, 15 May 2018, Rick Warner wrote:
> > I've discovered that some new Supermicro skylake systems will hang/stall
> > while booting the 4.15 kernel when extended APIC (x2apic) is enabled in
> > the BIOS. The issue happens on specific CPUs only and follows the CPUs.
> > 
> > We had (4) quad socket systems with Xeon 6134 CPUs; 2 out of 4 were
> > exhibiting this behavior.  We replaced 2 CPUs at that time and the
> > behavior was eliminated. Those systems were then shipped to our customer
> > (we are an HPC system integrator).
> > 
> > Now, we have 5 single socket systems with 5122 CPUs.  2 out of the 5 are
> > hanging.  If we swap the CPUs from the hanging systems with working
> > systems, the behavior follows the CPU.

That's weird.

> > I've done a git bisect between 4.14 and 4.15 and found this commit is
> > triggering the issue:
> > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?id=023a611748fd58d46c8aa049cf4f22ebada983f5
> >

Interesting.

> > I've attached a dmesg log captured via serial console from a system
> > exhibiting this problem.  Here is an excerpt from it where the problems
> > start:

> > NMI backtrace for cpu 34

> > RIP: 0010:smp_call_function_many+0x1f1/0x204

So this waits for the IPI to be handled on some other CPU(s).

> > RSP: 0000:ffffc900000f3af0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff11
> > RAX: 0000000000000001 RBX: ffff880c110a0488 RCX: 0000000000000001
> > RDX: ffff880c10e64440 RSI: 0000000000000000 RDI: ffff880c110a0488
> > RBP: ffff880c110a0480 R08: fffffffffffffffe R09: 0000000000000003
> > R10: 0000000000000000 R11: ffffea00c03c1a60 R12: 0000000000000001
> > R13: ffff880c110a04b8 R14: 0000000000020440 R15: ffffffff81ed5400
> >   ? slub_cpu_dead+0xa0/0xa0
> >   ? slub_cpu_dead+0xa0/0xa0
> >   ? __mmu_notifier_mm_destroy+0x32/0x32
> >   on_each_cpu_mask+0x23/0x53
> >   ? slub_cpu_dead+0xa0/0xa0
> >   on_each_cpu_cond+0x7c/0x8b
> >   __kmem_cache_shrink+0x3c/0x237
> >   ? acpi_ps_delete_parse_tree+0x2d/0x59
> >   ? set_debug_rodata+0x11/0x11
> >   ? acpi_os_purge_cache+0xa/0xd
> >   acpi_os_purge_cache+0xa/0xd
> >   acpi_purge_cached_objects+0x29/0x38
> >   acpi_initialize_objects+0x46/0x4f
> >   ? acpi_sleep_init+0xd6/0xd6
> >   acpi_init+0xb6/0x324
> >   ? scan_for_dmi_ipmi+0x15/0xec
> >   ? acpi_sleep_init+0xd6/0xd6
> >   do_one_initcall+0x89/0x128
> >   ? set_debug_rodata+0x11/0x11
> >   ? set_debug_rodata+0x11/0x11
> >   kernel_init_freeable+0x112/0x18e
> >   ? rest_init+0xaa/0xaa
> >   kernel_init+0xa/0xf0
> >   ret_from_fork+0x35/0x40

> > If any other information is needed, please let me know.  I've reported
> > this issue to Supermicro already and they believe it is an issue with
> > the kernel opposed to an issue specific to their systems.  I don't have
> > any other brand Xeon skylake systems with extended APIC support that I
> > can try this with.

I can't spot an immediate fail with that commit, but I'll have a look
tomorrow for instrumenting this with tracepoints which can be dumped from
the stall detector.

Thanks,

	tglx

  reply	other threads:[~2018-05-15 20:19 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-01 16:37 stall/hang on 4.15 kernel with some Xeon skylake CPUs and extended APIC Rick Warner
2018-05-15 16:07 ` [bisected] rcu_sched detected stalls - 4.15 or newer " Rick Warner
2018-05-15 20:19   ` Thomas Gleixner [this message]
2018-05-16 14:50     ` Thomas Gleixner
2018-05-16 23:02       ` Rick Warner
2018-05-17 12:36         ` Thomas Gleixner
2018-05-17 15:59           ` Rick Warner
2018-05-17 19:03           ` [tip:x86/urgent] x86/apic/x2apic: Initialize cluster ID properly tip-bot for Thomas Gleixner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.DEB.2.21.1805152208590.1605@nanos.tec.linutronix.de \
    --to=tglx@linutronix.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=rick@microway.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.