From: Yazen Ghannam <yazen.ghannam@amd.com>
To: Borislav Petkov <bp@alien8.de>
Cc: "Joshi, Mukul" <Mukul.Joshi@amd.com>,
"linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>,
"x86@kernel.org" <x86@kernel.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"mingo@redhat.com" <mingo@redhat.com>,
"mchehab@kernel.org" <mchehab@kernel.org>,
"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCHv3 2/2] drm/amdgpu: Register MCE notifier for Aldebaran RAS
Date: Mon, 27 Sep 2021 18:37:05 +0000 [thread overview]
Message-ID: <YVIPUa6vaax7DVyE@yaz-ubuntu> (raw)
In-Reply-To: <YU8GGSrQSbAZPz4z@zn.tnic>
On Sat, Sep 25, 2021 at 01:20:57PM +0200, Borislav Petkov wrote:
> On Fri, Sep 24, 2021 at 07:46:10PM +0000, Yazen Ghannam wrote:
> > I agree with you in general. But this device isn't really a GPU. And
> > users of this device seem to want to count *every* error, at least for
> > now.
>
> Aha, so something accelerator-y where they do general purpose computation.
>
> So what's the big picture here: they count all the errors and when they
> reach a certain amount, they decide to replace the GPUs just in case?
>
> Or wait until they become uncorrectable? But then it doesn't matter
> because we will handle it properly by excluding the VRAM range from
> further use.
>
> Or do they wanna see *when* they had the correctable errors so that they
> can restart the computation, just in case.
>
> Dunno, it would be a lot helpful if we had some RAS strategy for those
> things...
>
I completely agree. The system integrators have their own policies for error
tracking, part replacement, etc. I expect they'll propose kernel changes if
they want any. Though I think general strategies will become apparent once
these sort of devices are in wider use.
Thanks,
Yazen
next prev parent reply other threads:[~2021-09-27 18:37 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-05-11 15:25 [PATCH 1/3] x86/MCE/AMD, EDAC/mce_amd: Add new SMCA bank types Naveen Krishna Chatradhi
2021-05-11 15:25 ` [PATCH 2/3] x86/MCE/AMD: Helper function to check UMC v2 Naveen Krishna Chatradhi
2021-05-11 17:34 ` Borislav Petkov
2021-05-12 1:40 ` Joshi, Mukul
2021-05-12 7:26 ` Borislav Petkov
2021-09-13 2:13 ` [PATCHv2 1/2] x86/MCE/AMD: Export smca_get_bank_type symbol Mukul Joshi
2021-09-13 2:13 ` [PATCHv2 2/2] drm/amdgpu: Register MCE notifier for Aldebaran RAS Mukul Joshi
2021-09-22 11:40 ` Borislav Petkov
2021-09-22 19:43 ` Joshi, Mukul
2021-09-22 19:36 ` [PATCHv3 " Mukul Joshi
2021-09-23 14:29 ` Yazen Ghannam
2021-09-23 14:37 ` Borislav Petkov
2021-09-23 15:31 ` Joshi, Mukul
2021-09-23 15:30 ` Joshi, Mukul
2021-09-23 17:23 ` Yazen Ghannam
2021-09-23 18:14 ` Borislav Petkov
2021-09-24 19:46 ` Yazen Ghannam
2021-09-25 11:20 ` Borislav Petkov
2021-09-27 18:37 ` Yazen Ghannam [this message]
2021-09-23 18:34 ` Joshi, Mukul
2021-09-23 22:04 ` [PATCHv4 " Mukul Joshi
2021-09-24 19:53 ` Yazen Ghannam
2021-09-22 11:33 ` [PATCHv2 1/2] x86/MCE/AMD: Export smca_get_bank_type symbol Borislav Petkov
2021-09-22 16:27 ` Deucher, Alexander
2021-09-22 16:43 ` Borislav Petkov
2021-09-22 16:47 ` Joshi, Mukul
2021-05-11 15:25 ` [PATCH 3/3] x86/mce: Add MCE priority for Accelerator devices Naveen Krishna Chatradhi
2021-05-11 17:27 ` [PATCH 1/3] x86/MCE/AMD, EDAC/mce_amd: Add new SMCA bank types Borislav Petkov
2021-05-24 16:41 ` Chatradhi, Naveen Krishna
2021-05-25 18:02 ` Borislav Petkov
2021-05-25 20:03 ` Yazen Ghannam
2021-05-25 20:12 ` Borislav Petkov
2021-05-26 16:46 ` [PATCH v2] " Naveen Krishna Chatradhi
2021-05-28 15:25 ` [tip: ras/core] " tip-bot2 for Muralidhara M K
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YVIPUa6vaax7DVyE@yaz-ubuntu \
--to=yazen.ghannam@amd.com \
--cc=Mukul.Joshi@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=bp@alien8.de \
--cc=linux-edac@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mchehab@kernel.org \
--cc=mingo@redhat.com \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).