linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Tony Luck <tony.luck@intel.com>
To: Borislav Petkov <bp@alien8.de>
Cc: "Naik, Avadhut" <avadnaik@amd.com>,
	"Mehta, Sohil" <sohil.mehta@intel.com>,
	"x86@kernel.org" <x86@kernel.org>,
	"linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"yazen.ghannam@amd.com" <yazen.ghannam@amd.com>,
	Avadhut Naik <avadhut.naik@amd.com>
Subject: Re: [PATCH] x86/mce: Dynamically size space for machine check records
Date: Thu, 29 Feb 2024 09:47:24 -0800	[thread overview]
Message-ID: <ZeDDLLQWPyyZve_s@agluck-desk3> (raw)
In-Reply-To: <20240229083951.GAZeBC1yS3MPonWwKv@fat_crate.local>

On Thu, Feb 29, 2024 at 09:39:51AM +0100, Borislav Petkov wrote:
> On Thu, Feb 29, 2024 at 12:42:38AM -0600, Naik, Avadhut wrote:
> > Somewhat confused here. Weren't we also exploring ways to avoid
> > duplicate records from being added to the genpool? Has something
> > changed in that regard?
> 
> You can always send patches proposing how *you* think this duplicate
> elimination should look like and we can talk. :)
> 
> I don't think anyone would mind it if done properly but first you'd need
> a real-life use case. As in, do we log sooo many duplicates such that
> we'd want to dedup?

There are definitly cases where dedup will not help. If a row fails in a
DIMM there will be a flood of correctable errors with different addresses
(depending on number of channels in the interleave schema for a system
this may be dozens or hundreds of distinct addresses).

Same for other failures in structures like column and rank.

As to "real-life" use cases. A search on Lore for "MCE records pool
full!" only finds threads about modifications to this code. So the
general population of Linux developers isn't seeing this.

But a search in my internal e-mail box kicks up a dozen or so distinct
hits from internal validation teams in just the past year. But those
folks are super-dedicated to finding corner cases. Just this morning I
got a triumphant e-mail from someone who reproduced an issue "after 1.6
million error injections".

I'd bet that large cloud providers with system numbered in the hundreds
of thousands see the MCE pool exhausted from time to time.

Open question is "How many error records do you need to diagnose the
cause of various floods of errors?"

I'm going to say that the current 32 (see earlier e-mail today to
Sohil https://lore.kernel.org/all/ZeC8_jzdFnkpPVPf@agluck-desk3/ )
isn't enough for big systems. It may be hard to distinguish between
the various bulk fail modes with just that many error logs.

-Tony

  reply	other threads:[~2024-02-29 17:47 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-07 22:56 [PATCH 0/2] Extend size of the MCE Records pool Avadhut Naik
2024-02-07 22:56 ` [PATCH 1/2] x86/MCE: " Avadhut Naik
2024-02-08  0:02   ` Luck, Tony
2024-02-08 17:41     ` Naik, Avadhut
2024-02-08 17:47       ` Naik, Avadhut
2024-02-08 18:39       ` Luck, Tony
2024-02-09 19:47         ` Naik, Avadhut
2024-02-08 21:09   ` Sohil Mehta
2024-02-09 19:52     ` Naik, Avadhut
2024-02-07 22:56 ` [PATCH 2/2] x86/MCE: Add command line option to extend " Avadhut Naik
2024-02-09  1:36   ` Sohil Mehta
2024-02-09 20:02     ` Naik, Avadhut
2024-02-09 20:09       ` Borislav Petkov
2024-02-09 20:35         ` Naik, Avadhut
2024-02-09 20:51           ` Borislav Petkov
2024-02-10  7:52             ` Borislav Petkov
2024-02-10 21:15               ` Naik, Avadhut
2024-02-11 11:14                 ` Borislav Petkov
2024-02-12  2:54                   ` Naik, Avadhut
2024-02-12  8:58                     ` Borislav Petkov
2024-02-12  9:32                       ` Borislav Petkov
2024-02-12 17:29                         ` Luck, Tony
2024-02-12 17:54                           ` Borislav Petkov
2024-02-12 18:45                             ` Luck, Tony
2024-02-12 19:14                               ` Borislav Petkov
2024-02-12 19:41                                 ` Luck, Tony
2024-02-12 21:37                                   ` Tony Luck
2024-02-12 22:08                                     ` Borislav Petkov
2024-02-12 22:19                                       ` Borislav Petkov
2024-02-12 22:42                                         ` Borislav Petkov
2024-02-28 23:14                                           ` [PATCH] x86/mce: Dynamically size space for machine check records Tony Luck
2024-02-29  0:39                                             ` Sohil Mehta
2024-02-29  0:44                                               ` Luck, Tony
2024-02-29  1:56                                             ` Sohil Mehta
2024-02-29 15:49                                               ` Yazen Ghannam
2024-02-29 17:22                                                 ` Tony Luck
2024-02-29 17:21                                               ` Tony Luck
2024-02-29 23:56                                                 ` Sohil Mehta
2024-02-29  6:42                                             ` Naik, Avadhut
2024-02-29  8:39                                               ` Borislav Petkov
2024-02-29 17:47                                                 ` Tony Luck [this message]
2024-02-29 18:28                                                   ` Naik, Avadhut
2024-02-29 18:38                                                     ` Luck, Tony
2024-02-29 17:26                                               ` Tony Luck
2024-03-06 21:52                                             ` Naik, Avadhut
2024-03-06 22:07                                               ` Luck, Tony
2024-03-06 23:21                                                 ` Naik, Avadhut
2024-02-15 20:18                             ` [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool Naik, Avadhut
2024-02-15 20:15                         ` Naik, Avadhut
2024-02-15 20:14                       ` Naik, Avadhut
2024-02-12 18:47                   ` Yazen Ghannam
2024-02-12 18:58                     ` Luck, Tony
2024-02-12 19:40                       ` Naik, Avadhut
2024-02-12 20:18                         ` Borislav Petkov
2024-02-12 20:51                           ` Naik, Avadhut
2024-02-12 19:43                       ` Yazen Ghannam
2024-02-12 19:49                         ` Luck, Tony
2024-02-12 20:10                           ` Borislav Petkov
2024-02-12 20:44                             ` Paul E. McKenney
2024-02-12 21:18                               ` Luck, Tony
2024-02-12 21:27                               ` Borislav Petkov
2024-02-12 22:46                                 ` Paul E. McKenney
2024-02-12 22:53                                   ` Luck, Tony
2024-02-12 23:10                                   ` Borislav Petkov
2024-02-13  1:07                                     ` Paul E. McKenney
2024-02-09 20:16       ` Sohil Mehta
2024-02-09 20:28         ` Luck, Tony
2024-02-09 21:02           ` Sohil Mehta

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZeDDLLQWPyyZve_s@agluck-desk3 \
    --to=tony.luck@intel.com \
    --cc=avadhut.naik@amd.com \
    --cc=avadnaik@amd.com \
    --cc=bp@alien8.de \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sohil.mehta@intel.com \
    --cc=x86@kernel.org \
    --cc=yazen.ghannam@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).