linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Rik van Riel <riel@surriel.com>
To: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>,
	x86@kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com,
	Thomas Gleixner <tglx@linutronix.de>, Dave Jones <dsj@fb.com>,
	Andy Lutomirski <luto@kernel.org>
Subject: Re: [PATCH v2]  x86,mm: print likely CPU at segfault time
Date: Fri, 05 Aug 2022 08:53:46 -0400	[thread overview]
Message-ID: <2239a6e4f5e9d12ef7a55da6dba716df681201ff.camel@surriel.com> (raw)
In-Reply-To: <YuzsJfHi+qV6Z16E@zn.tnic>

[-- Attachment #1: Type: text/plain, Size: 2401 bytes --]

On Fri, 2022-08-05 at 12:08 +0200, Borislav Petkov wrote:
> On Thu, Aug 04, 2022 at 03:54:50PM -0400, Rik van Riel wrote:
> > Add a printk() to show_signal_msg() to print the CPU, core, and
> > socket
> > at segfault time. This is not perfect, since the task might get
> > rescheduled
> > on another CPU between when the fault hit, and when the message is
> > printed,
> > but in practice this has been good enough to help us identify
> > several bad
> > CPU cores.
> > 
> > segfault[1349]: segfault at 0 ip 000000000040113a sp
> > 00007ffc6d32e360 error 4 in segfault[401000+1000] on CPU 0 (core 0,
> > socket 0)
> 
> And what happens when someone is looking at this, the CPU information
> is
> wrong because we got rescheduled but...
> 
> ... someone is missing this important tidbit here that the CPU info
> above is unreliable?
> 
> Someone is sent on a wild goose chase.
> 
We have been using this patch for the past year, and it
does not appear to be an issue in practice.

When a faulty CPU core causes tasks to segfault, we
typically see >95% of the errors printed for the CPU
core that is bad, and only a few on other CPUs.

Those other CPU cores tend to be inside the same physical
chip too, so they get replaced at the same time.

CPU failures tend to be oddly specific, often leading to
dozens (or hundreds) of segfaults in bash or python for 
some particular script, while the main workload on the 
system continues running without issues.

Having a small percentage of the segfaults show up on
cores other than the broken one does not cause issues with
detection or diagnosis.

> Can't you read out the CPU number before interrupts are enabled and
> hand
> it down for printing?
> 
We could, but then we would be reading the CPU number
on every page fault, just in case it's a segfault.

That does not seem like a worthwhile tradeoff, given
how much of a hot path page faults are, and how rare
segfaults are.

Furthermore, we still have the possibilities that a thread
on one CPU crashes because another, faulty CPU scribbled
over its data, or that a thread is preempted and migrated
while still in userspace in-between deciding on a bad address
and accessing it.

I'm willing to be convinced otherwise, but staying out of
the hot path for something so rare seems like a typical
thing to do?

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

  reply	other threads:[~2022-08-05 12:54 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-04 19:54 [PATCH v2] x86,mm: print likely CPU at segfault time Rik van Riel
2022-08-04 20:17 ` Ingo Molnar
2022-08-05  3:01   ` Rik van Riel
2022-08-05  7:59     ` Ingo Molnar
2022-08-05 12:54       ` Rik van Riel
2022-08-05 10:08 ` Borislav Petkov
2022-08-05 12:53   ` Rik van Riel [this message]
2022-08-05 13:25     ` Borislav Petkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2239a6e4f5e9d12ef7a55da6dba716df681201ff.camel@surriel.com \
    --to=riel@surriel.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@intel.com \
    --cc=dsj@fb.com \
    --cc=kernel-team@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).