From: "Joshi, Mukul" <Mukul.Joshi@amd.com>
To: "Joshi, Mukul" <Mukul.Joshi@amd.com>,
"Ghannam, Yazen" <Yazen.Ghannam@amd.com>
Cc: x86-ml <x86@kernel.org>,
"Kasiviswanathan, Harish" <Harish.Kasiviswanathan@amd.com>,
lkml <linux-kernel@vger.kernel.org>,
"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
Borislav Petkov <bp@alien8.de>,
Alex Deucher <alexdeucher@gmail.com>
Subject: RE: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
Date: Mon, 13 Sep 2021 01:31:31 +0000 [thread overview]
Message-ID: <DM4PR12MB5263161FCF60F73D66EADFDEEED99@DM4PR12MB5263.namprd12.prod.outlook.com> (raw)
In-Reply-To: <DM4PR12MB5263785A21F34B24A0C7FC89EEEB9@DM4PR12MB5263.namprd12.prod.outlook.com>
[AMD Official Use Only]
> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Joshi,
> Mukul
> Sent: Thursday, July 29, 2021 8:00 PM
> To: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> Cc: x86-ml <x86@kernel.org>; Kasiviswanathan, Harish
> <Harish.Kasiviswanathan@amd.com>; lkml <linux-kernel@vger.kernel.org>;
> amd-gfx@lists.freedesktop.org; Borislav Petkov <bp@alien8.de>; Alex Deucher
> <alexdeucher@gmail.com>
> Subject: RE: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
>
> [CAUTION: External Email]
>
> [AMD Official Use Only]
>
>
>
> > -----Original Message-----
> > From: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> > Sent: Thursday, June 3, 2021 5:13 PM
> > To: Joshi, Mukul <Mukul.Joshi@amd.com>
> > Cc: Borislav Petkov <bp@alien8.de>; Alex Deucher
> > <alexdeucher@gmail.com>; x86-ml <x86@kernel.org>; Kasiviswanathan,
> > Harish <Harish.Kasiviswanathan@amd.com>; lkml
> > <linux-kernel@vger.kernel.org>; amd-gfx@lists.freedesktop.org
> > Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for
> > Aldebaran
> >
> > On Thu, May 27, 2021 at 03:54:27PM -0400, Joshi, Mukul wrote:
> > ...
> > > > Is that the same deferred interrupt which calls
> > > > amd_deferred_error_interrupt() ?
> > >
> > > Sorry picking this up after sometime. I thought I had replied to this email.
> > > Yes it is the same deferred interrupt which calls
> > amd_deferred_error_interrupt().
> > >
> >
> > Mukul,
> >
> > Do you expect that the driver will need to mark pages with high
> > correctable error counts as bad? I think the hardware folks may want
> > the GPU memory errors to be handled more aggressively than CPU memory
> > errors. The specific threshold may change from product to product, so
> > it may make sense to hardcode this in the driver.
> >
>
> Sorry I missed this email completely. Just saw it so responding now.
>
> At the moment, we don't have a requirement to mark a page "bad" if there is a
> high correctable error counts.
> Our previous GPU ASICs which support RAS, also do not have such a feature.
> But you make a good point. It might be worthwhile to go and ask the hardware
> folks about it.
>
> > We have similar functionality in the Correctable Errors Collector. But
> > enterprise users may prefer a direct approach done in the driver
> > (based on the hardware experts' guidance) instead of configuring the kernel at
> runtime.
> >
> > So I think having a separate priority may make sense if some special
> > functionality, or combination of behaviors, is needed which don't fall
> > under any exisiting things. In this case, "special functionality"
> > could be that the GPU memory needs to be handled differently than CPU
> memory.
> >
> > Another thing is that this behavior is similar to the NFIT behavior,
> > i.e. there's a memory error on an external device that needs to be
> > handled by the device's driver. So maybe we can rename MCE_PRIO_NFIT
> > to be generic
> > (MCE_PRIO_EXTERNAL?) and use that? Multiple notifiers with the same
> > priority is okay, right?
> >
> With respect to MCE priority, I was thinking of using the MCE_PRIO_EDAC
> instead of creating a new priority as the code in the GPU driver is doing error
> detection and handling the uncorrectable errors.
> Not sure if that aligns with the definition of EDAC device in the kernel.
>
> What do you think?
>
> Regards,
> Mukul
>
After talking to Yazen, MCE_PRIO_UC might be a better choice for the MCE priority as we are dealing
only with uncorrectable errors.
I will be sending out a v2 patch with changes to use the MCE_PRIO_UC and drop the MCE_PRIO_ACCEL and see what the feedback is.
Thanks,
Mukul
> > Thanks,
> > Yazen
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.fre
> edesktop.org%2Fmailman%2Flistinfo%2Famd-
> gfx&data=04%7C01%7Cmukul.joshi%40amd.com%7C7d32897fddef448ab0
> aa08d952ecf41f%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6376
> 31999953383488%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJ
> QIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=YWZz9
> OYTMOhBl4183kV5ZYj01yw0xwNj%2BjTdXejFKH8%3D&reserved=0
next prev parent reply other threads:[~2021-09-13 1:31 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20210512013058.6827-1-mukul.joshi@amd.com>
2021-05-12 9:36 ` [PATCH] drm/amdgpu: Register bad page handler for Aldebaran Borislav Petkov
2021-05-12 19:00 ` Joshi, Mukul
2021-05-12 21:05 ` Borislav Petkov
2021-05-13 3:20 ` Joshi, Mukul
2021-05-13 9:53 ` Borislav Petkov
2021-05-13 14:17 ` Alex Deucher
2021-05-13 14:30 ` Borislav Petkov
2021-05-13 14:32 ` Alex Deucher
2021-05-13 14:57 ` Borislav Petkov
2021-05-13 15:02 ` Alex Deucher
2021-05-13 23:14 ` Joshi, Mukul
2021-05-14 7:03 ` Borislav Petkov
2021-05-27 19:54 ` Joshi, Mukul
2021-06-03 21:13 ` Yazen Ghannam
2021-07-29 23:59 ` Joshi, Mukul
2021-09-13 1:31 ` Joshi, Mukul [this message]
2021-05-13 23:10 ` Joshi, Mukul
2021-05-14 7:05 ` Borislav Petkov
2021-05-14 13:06 ` Joshi, Mukul
2021-05-14 14:38 ` Borislav Petkov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=DM4PR12MB5263161FCF60F73D66EADFDEEED99@DM4PR12MB5263.namprd12.prod.outlook.com \
--to=mukul.joshi@amd.com \
--cc=Harish.Kasiviswanathan@amd.com \
--cc=Yazen.Ghannam@amd.com \
--cc=alexdeucher@gmail.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=bp@alien8.de \
--cc=linux-kernel@vger.kernel.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).