linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@kernel.org>
To: Rik van Riel <riel@surriel.com>
Cc: Dave Hansen <dave.hansen@intel.com>,
	x86@kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com,
	Thomas Gleixner <tglx@linutronix.de>, Dave Jones <dsj@fb.com>,
	Andy Lutomirski <luto@kernel.org>
Subject: Re: [PATCH v2]  x86,mm: print likely CPU at segfault time
Date: Thu, 4 Aug 2022 22:17:04 +0200	[thread overview]
Message-ID: <YuwpQEYCwTl+m6j5@gmail.com> (raw)
In-Reply-To: <20220804155450.08c5b87e@imladris.surriel.com>


* Rik van Riel <riel@surriel.com> wrote:

> In a large enough fleet of computers, it is common to have a few bad CPUs.
> Those can often be identified by seeing that some commonly run kernel code,
> which runs fine everywhere else, keeps crashing on the same CPU core on one
> particular bad system.
> 
> However, the failure modes in CPUs that have gone bad over the years are
> often oddly specific, and the only bad behavior seen might be segfaults
> in programs like bash, python, or various system daemons that run fine
> everywhere else.
> 
> Add a printk() to show_signal_msg() to print the CPU, core, and socket
> at segfault time. This is not perfect, since the task might get rescheduled
> on another CPU between when the fault hit, and when the message is printed,
> but in practice this has been good enough to help us identify several bad
> CPU cores.
> 
> segfault[1349]: segfault at 0 ip 000000000040113a sp 00007ffc6d32e360 error 4 in segfault[401000+1000] on CPU 0 (core 0, socket 0)
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> CC: Dave Jones <dsj@fb.com>
> ---
>  arch/x86/mm/fault.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index fad8faa29d04..a9b93a7816f9 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -769,6 +769,8 @@ show_signal_msg(struct pt_regs *regs, unsigned long error_code,
>  		unsigned long address, struct task_struct *tsk)
>  {
>  	const char *loglvl = task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG;
> +	/* This is a racy snapshot, but it's better than nothing. */
> +	int cpu = READ_ONCE(raw_smp_processor_id());
>  
>  	if (!unhandled_signal(tsk, SIGSEGV))
>  		return;
> @@ -782,6 +784,14 @@ show_signal_msg(struct pt_regs *regs, unsigned long error_code,
>  
>  	print_vma_addr(KERN_CONT " in ", regs->ip);
>  
> +	/*
> +	 * Dump the likely CPU where the fatal segfault happened.
> +	 * This can help identify faulty hardware.
> +	 */
> +	printk(KERN_CONT " on CPU %d (core %d, socket %d)", cpu,
> +	       topology_core_id(cpu), topology_physical_package_id(cpu));

LGTM, applying this to tip:x86/mm unless someone objects.

I've added the tidbit to the changelog that this only gets printed if 
show_unhandled_signals (/proc/sys/kernel/print-fatal-signals) is enabled - 
which is off by default. So your patch expands upon a default-off debug 
printout in essence - where utility maximization is OK.

Thanks,

	Ingo

  reply	other threads:[~2022-08-04 20:17 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-04 19:54 [PATCH v2] x86,mm: print likely CPU at segfault time Rik van Riel
2022-08-04 20:17 ` Ingo Molnar [this message]
2022-08-05  3:01   ` Rik van Riel
2022-08-05  7:59     ` Ingo Molnar
2022-08-05 12:54       ` Rik van Riel
2022-08-05 10:08 ` Borislav Petkov
2022-08-05 12:53   ` Rik van Riel
2022-08-05 13:25     ` Borislav Petkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YuwpQEYCwTl+m6j5@gmail.com \
    --to=mingo@kernel.org \
    --cc=dave.hansen@intel.com \
    --cc=dsj@fb.com \
    --cc=kernel-team@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=riel@surriel.com \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).