Re: [PATCH v2] x86,mm: print likely CPU at segfault time

From: Ingo Molnar <mingo@kernel.org>
To: Rik van Riel <riel@surriel.com>
Cc: Dave Hansen <dave.hansen@intel.com>,
	x86@kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com,
	Thomas Gleixner <tglx@linutronix.de>, Dave Jones <dsj@fb.com>,
	Andy Lutomirski <luto@kernel.org>
Subject: Re: [PATCH v2]  x86,mm: print likely CPU at segfault time
Date: Thu, 4 Aug 2022 22:17:04 +0200	[thread overview]
Message-ID: <YuwpQEYCwTl+m6j5@gmail.com> (raw)
In-Reply-To: <20220804155450.08c5b87e@imladris.surriel.com>

* Rik van Riel <riel@surriel.com> wrote:

> In a large enough fleet of computers, it is common to have a few bad CPUs.
> Those can often be identified by seeing that some commonly run kernel code,
> which runs fine everywhere else, keeps crashing on the same CPU core on one
> particular bad system.
> 
> However, the failure modes in CPUs that have gone bad over the years are
> often oddly specific, and the only bad behavior seen might be segfaults
> in programs like bash, python, or various system daemons that run fine
> everywhere else.
> 
> Add a printk() to show_signal_msg() to print the CPU, core, and socket
> at segfault time. This is not perfect, since the task might get rescheduled
> on another CPU between when the fault hit, and when the message is printed,
> but in practice this has been good enough to help us identify several bad
> CPU cores.
> 
> segfault[1349]: segfault at 0 ip 000000000040113a sp 00007ffc6d32e360 error 4 in segfault[401000+1000] on CPU 0 (core 0, socket 0)
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> CC: Dave Jones <dsj@fb.com>
> ---
>  arch/x86/mm/fault.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index fad8faa29d04..a9b93a7816f9 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -769,6 +769,8 @@ show_signal_msg(struct pt_regs *regs, unsigned long error_code,
>  		unsigned long address, struct task_struct *tsk)
>  {
>  	const char *loglvl = task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG;
> +	/* This is a racy snapshot, but it's better than nothing. */
> +	int cpu = READ_ONCE(raw_smp_processor_id());
>  
>  	if (!unhandled_signal(tsk, SIGSEGV))
>  		return;
> @@ -782,6 +784,14 @@ show_signal_msg(struct pt_regs *regs, unsigned long error_code,
>  
>  	print_vma_addr(KERN_CONT " in ", regs->ip);
>  
> +	/*
> +	 * Dump the likely CPU where the fatal segfault happened.
> +	 * This can help identify faulty hardware.
> +	 */
> +	printk(KERN_CONT " on CPU %d (core %d, socket %d)", cpu,
> +	       topology_core_id(cpu), topology_physical_package_id(cpu));

LGTM, applying this to tip:x86/mm unless someone objects.

I've added the tidbit to the changelog that this only gets printed if 
show_unhandled_signals (/proc/sys/kernel/print-fatal-signals) is enabled - 
which is off by default. So your patch expands upon a default-off debug 
printout in essence - where utility maximization is OK.

Thanks,

	Ingo