On Fri, 2022-08-05 at 12:08 +0200, Borislav Petkov wrote:
> On Thu, Aug 04, 2022 at 03:54:50PM -0400, Rik van Riel wrote:
> > Add a printk() to show_signal_msg() to print the CPU, core, and
> > socket
> > at segfault time. This is not perfect, since the task might get
> > rescheduled
> > on another CPU between when the fault hit, and when the message is
> > printed,
> > but in practice this has been good enough to help us identify
> > several bad
> > CPU cores.
> > 
> > segfault[1349]: segfault at 0 ip 000000000040113a sp
> > 00007ffc6d32e360 error 4 in segfault[401000+1000] on CPU 0 (core 0,
> > socket 0)
> 
> And what happens when someone is looking at this, the CPU information
> is
> wrong because we got rescheduled but...
> 
> ... someone is missing this important tidbit here that the CPU info
> above is unreliable?
> 
> Someone is sent on a wild goose chase.
> 
We have been using this patch for the past year, and it
does not appear to be an issue in practice.

When a faulty CPU core causes tasks to segfault, we
typically see >95% of the errors printed for the CPU
core that is bad, and only a few on other CPUs.

Those other CPU cores tend to be inside the same physical
chip too, so they get replaced at the same time.

CPU failures tend to be oddly specific, often leading to
dozens (or hundreds) of segfaults in bash or python for 
some particular script, while the main workload on the 
system continues running without issues.

Having a small percentage of the segfaults show up on
cores other than the broken one does not cause issues with
detection or diagnosis.

> Can't you read out the CPU number before interrupts are enabled and
> hand
> it down for printing?
> 
We could, but then we would be reading the CPU number
on every page fault, just in case it's a segfault.

That does not seem like a worthwhile tradeoff, given
how much of a hot path page faults are, and how rare
segfaults are.

Furthermore, we still have the possibilities that a thread
on one CPU crashes because another, faulty CPU scribbled
over its data, or that a thread is preempted and migrated
while still in userspace in-between deciding on a bad address
and accessing it.

I'm willing to be convinced otherwise, but staying out of
the hot path for something so rare seems like a typical
thing to do?

-- 
All Rights Reversed.