All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Joshi, Mukul" <Mukul.Joshi@amd.com>
To: "Joshi, Mukul" <Mukul.Joshi@amd.com>,
	"Ghannam, Yazen" <Yazen.Ghannam@amd.com>
Cc: x86-ml <x86@kernel.org>,
	"Kasiviswanathan, Harish" <Harish.Kasiviswanathan@amd.com>,
	lkml <linux-kernel@vger.kernel.org>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
	Borislav Petkov <bp@alien8.de>,
	Alex Deucher <alexdeucher@gmail.com>
Subject: RE: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
Date: Mon, 13 Sep 2021 01:31:31 +0000	[thread overview]
Message-ID: <DM4PR12MB5263161FCF60F73D66EADFDEEED99@DM4PR12MB5263.namprd12.prod.outlook.com> (raw)
In-Reply-To: <DM4PR12MB5263785A21F34B24A0C7FC89EEEB9@DM4PR12MB5263.namprd12.prod.outlook.com>

[AMD Official Use Only]



> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Joshi,
> Mukul
> Sent: Thursday, July 29, 2021 8:00 PM
> To: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> Cc: x86-ml <x86@kernel.org>; Kasiviswanathan, Harish
> <Harish.Kasiviswanathan@amd.com>; lkml <linux-kernel@vger.kernel.org>;
> amd-gfx@lists.freedesktop.org; Borislav Petkov <bp@alien8.de>; Alex Deucher
> <alexdeucher@gmail.com>
> Subject: RE: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
> 
> [CAUTION: External Email]
> 
> [AMD Official Use Only]
> 
> 
> 
> > -----Original Message-----
> > From: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> > Sent: Thursday, June 3, 2021 5:13 PM
> > To: Joshi, Mukul <Mukul.Joshi@amd.com>
> > Cc: Borislav Petkov <bp@alien8.de>; Alex Deucher
> > <alexdeucher@gmail.com>; x86-ml <x86@kernel.org>; Kasiviswanathan,
> > Harish <Harish.Kasiviswanathan@amd.com>; lkml
> > <linux-kernel@vger.kernel.org>; amd-gfx@lists.freedesktop.org
> > Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for
> > Aldebaran
> >
> > On Thu, May 27, 2021 at 03:54:27PM -0400, Joshi, Mukul wrote:
> > ...
> > > > Is that the same deferred interrupt which calls
> > > > amd_deferred_error_interrupt() ?
> > >
> > > Sorry picking this up after sometime. I thought I had replied to this email.
> > > Yes it is the same deferred interrupt which calls
> > amd_deferred_error_interrupt().
> > >
> >
> > Mukul,
> >
> > Do you expect that the driver will need to mark pages with high
> > correctable error counts as bad? I think the hardware folks may want
> > the GPU memory errors to be handled more aggressively than CPU memory
> > errors. The specific threshold may change from product to product, so
> > it may make sense to hardcode this in the driver.
> >
> 
> Sorry I missed this email completely. Just saw it so responding now.
> 
> At the moment, we don't have a requirement to mark a page "bad" if there is a
> high correctable error counts.
> Our previous GPU ASICs which support RAS, also do not have such a feature.
> But you make a good point. It might be worthwhile to go and ask the hardware
> folks about it.
> 
> > We have similar functionality in the Correctable Errors Collector. But
> > enterprise users may prefer a direct approach done in the driver
> > (based on the hardware experts' guidance) instead of configuring the kernel at
> runtime.
> >
> > So I think having a separate priority may make sense if some special
> > functionality, or combination of behaviors, is needed which don't fall
> > under any exisiting things. In this case, "special functionality"
> > could be that the GPU memory needs to be handled differently than CPU
> memory.
> >
> > Another thing is that this behavior is similar to the NFIT behavior,
> > i.e. there's a memory error on an external device that needs to be
> > handled by the device's driver. So maybe we can rename MCE_PRIO_NFIT
> > to be generic
> > (MCE_PRIO_EXTERNAL?) and use that? Multiple notifiers with the same
> > priority is okay, right?
> >
> With respect to MCE priority, I was thinking of using the MCE_PRIO_EDAC
> instead of creating a new priority as the code in the GPU driver is doing error
> detection and handling the uncorrectable errors.
> Not sure if that aligns with the definition of EDAC device in the kernel.
> 
> What do you think?
> 
> Regards,
> Mukul
> 

After talking to Yazen, MCE_PRIO_UC might be a better choice for the MCE priority as we are dealing
only with uncorrectable errors.
I will be sending out a v2 patch with changes to use the MCE_PRIO_UC and drop the MCE_PRIO_ACCEL and see what the feedback is.

Thanks,
Mukul

> > Thanks,
> > Yazen
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.fre
> edesktop.org%2Fmailman%2Flistinfo%2Famd-
> gfx&amp;data=04%7C01%7Cmukul.joshi%40amd.com%7C7d32897fddef448ab0
> aa08d952ecf41f%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6376
> 31999953383488%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJ
> QIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=YWZz9
> OYTMOhBl4183kV5ZYj01yw0xwNj%2BjTdXejFKH8%3D&amp;reserved=0

  reply	other threads:[~2021-09-13  1:31 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-12  1:30 [PATCH] drm/amdgpu: Register bad page handler for Aldebaran Mukul Joshi
2021-05-12  9:36 ` Borislav Petkov
2021-05-12  9:36   ` Borislav Petkov
2021-05-12 19:00   ` Joshi, Mukul
2021-05-12 19:00     ` Joshi, Mukul
2021-05-12 21:05     ` Borislav Petkov
2021-05-12 21:05       ` Borislav Petkov
2021-05-13  3:20       ` Joshi, Mukul
2021-05-13  3:20         ` Joshi, Mukul
2021-05-13  9:53         ` Borislav Petkov
2021-05-13  9:53           ` Borislav Petkov
2021-05-13 14:17           ` Alex Deucher
2021-05-13 14:17             ` Alex Deucher
2021-05-13 14:30             ` Borislav Petkov
2021-05-13 14:30               ` Borislav Petkov
2021-05-13 14:32               ` Alex Deucher
2021-05-13 14:32                 ` Alex Deucher
2021-05-13 14:57                 ` Borislav Petkov
2021-05-13 14:57                   ` Borislav Petkov
2021-05-13 15:02                   ` Alex Deucher
2021-05-13 15:02                     ` Alex Deucher
2021-05-13 23:14                   ` Joshi, Mukul
2021-05-13 23:14                     ` Joshi, Mukul
2021-05-14  7:03                     ` Borislav Petkov
2021-05-14  7:03                       ` Borislav Petkov
2021-05-27 19:54                       ` Joshi, Mukul
2021-05-27 19:54                         ` Joshi, Mukul
2021-06-03 21:13                         ` Yazen Ghannam
2021-06-03 21:13                           ` Yazen Ghannam
2021-07-29 23:59                           ` Joshi, Mukul
2021-07-29 23:59                             ` Joshi, Mukul
2021-09-13  1:31                             ` Joshi, Mukul [this message]
2021-05-13 23:10           ` Joshi, Mukul
2021-05-13 23:10             ` Joshi, Mukul
2021-05-14  7:05             ` Borislav Petkov
2021-05-14  7:05               ` Borislav Petkov
2021-05-14 13:06               ` Joshi, Mukul
2021-05-14 13:06                 ` Joshi, Mukul
2021-05-14 14:38                 ` Borislav Petkov
2021-05-14 14:38                   ` Borislav Petkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DM4PR12MB5263161FCF60F73D66EADFDEEED99@DM4PR12MB5263.namprd12.prod.outlook.com \
    --to=mukul.joshi@amd.com \
    --cc=Harish.Kasiviswanathan@amd.com \
    --cc=Yazen.Ghannam@amd.com \
    --cc=alexdeucher@gmail.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=bp@alien8.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.