On Fri, 2022-08-05 at 12:08 +0200, Borislav Petkov wrote: > On Thu, Aug 04, 2022 at 03:54:50PM -0400, Rik van Riel wrote: > > Add a printk() to show_signal_msg() to print the CPU, core, and > > socket > > at segfault time. This is not perfect, since the task might get > > rescheduled > > on another CPU between when the fault hit, and when the message is > > printed, > > but in practice this has been good enough to help us identify > > several bad > > CPU cores. > > > > segfault[1349]: segfault at 0 ip 000000000040113a sp > > 00007ffc6d32e360 error 4 in segfault[401000+1000] on CPU 0 (core 0, > > socket 0) > > And what happens when someone is looking at this, the CPU information > is > wrong because we got rescheduled but... > > ... someone is missing this important tidbit here that the CPU info > above is unreliable? > > Someone is sent on a wild goose chase. > We have been using this patch for the past year, and it does not appear to be an issue in practice. When a faulty CPU core causes tasks to segfault, we typically see >95% of the errors printed for the CPU core that is bad, and only a few on other CPUs. Those other CPU cores tend to be inside the same physical chip too, so they get replaced at the same time. CPU failures tend to be oddly specific, often leading to dozens (or hundreds) of segfaults in bash or python forĀ  some particular script, while the main workload on theĀ  system continues running without issues. Having a small percentage of the segfaults show up on cores other than the broken one does not cause issues with detection or diagnosis. > Can't you read out the CPU number before interrupts are enabled and > hand > it down for printing? > We could, but then we would be reading the CPU number on every page fault, just in case it's a segfault. That does not seem like a worthwhile tradeoff, given how much of a hot path page faults are, and how rare segfaults are. Furthermore, we still have the possibilities that a thread on one CPU crashes because another, faulty CPU scribbled over its data, or that a thread is preempted and migrated while still in userspace in-between deciding on a bad address and accessing it. I'm willing to be convinced otherwise, but staying out of the hot path for something so rare seems like a typical thing to do? -- All Rights Reversed.