linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Joshi, Mukul" <Mukul.Joshi@amd.com>
To: Borislav Petkov <bp@alien8.de>
Cc: "amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
	"Kasiviswanathan, Harish" <Harish.Kasiviswanathan@amd.com>,
	x86-ml <x86@kernel.org>, lkml <linux-kernel@vger.kernel.org>
Subject: RE: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
Date: Thu, 13 May 2021 23:10:34 +0000	[thread overview]
Message-ID: <DM4PR12MB5263A719B11C6DF8EF9F3A4BEE519@DM4PR12MB5263.namprd12.prod.outlook.com> (raw)
In-Reply-To: <YJz3CMBFFIDBzVwX@zn.tnic>

[AMD Official Use Only - Internal Distribution Only]



> -----Original Message-----
> From: Borislav Petkov <bp@alien8.de>
> Sent: Thursday, May 13, 2021 5:53 AM
> To: Joshi, Mukul <Mukul.Joshi@amd.com>
> Cc: amd-gfx@lists.freedesktop.org; Kasiviswanathan, Harish
> <Harish.Kasiviswanathan@amd.com>; x86-ml <x86@kernel.org>; lkml <linux-
> kernel@vger.kernel.org>
> Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
> 
> [CAUTION: External Email]
> 
> On Thu, May 13, 2021 at 03:20:36AM +0000, Joshi, Mukul wrote:
> > Exporting smca_get_bank_type() works fine when CONFIG_X86_MCE_AMD is
> defined.
> > I would need to put #ifdef CONFIG_X86_MCE_AMD in my code to compile
> > the amdgpu driver when CONFIG_X86_MCE_AMD is not defined.
> > I can avoid all that by using is_smca_umc_v2().
> > I think it would be cleaner with using is_smca_umc_v2().
> 
> See how smca_get_long_name() is exported and export that function the same
> way.
> 

That's probably not the best example to look at.
smca_get_long_name() is used in drivers/edac/mce_amd.c and this file doesn't
get compiled when CONFIG_X86_MCE_AMD is not defined.

And amdgpu driver has no dependency on CONFIG_X86_MCE_AMD.

So here is one option that we can try:
1. Export smca_get_bank_type().
2. I wrap my entire code in GPU driver with #ifdef CONFIG_X86_MCE_AMD

Will that work for you?

Thanks,
Mukul

> To save you some energy: is_smca_umc_v2() is not going to happen.


> 
> > You can think of GPU device as a EDAC device here. It is mainly
> > interested in handling uncorrectable errors.
> 
> An EDAC "device", as you call it, is not interested in handling UEs. If anything, it
> counts them.
> 
> > It is a deferred interrupt that generates an MCE.
> 
> Is that the same deferred interrupt which calls amd_deferred_error_interrupt() ?
> 
> > When an uncorrectable error is detected on the GPU UMC, all we are
> > doing is determining the physical address where the error occurred and
> > then "retiring" the page that address belongs to.
> 
> What page is that? Normal DRAM page or a page in some special GPU memory?
> 
> > By retiring, we mean we reserve the page so that it is not available
> > for allocations to any applications.
> 
> We do that for normal DRAM memory pages by poisoning them. I hope you
> don't mean that.
> 
> Looking at
> 
> amdgpu_ras_add_bad_pages
> |-> amdgpu_vram_mgr_reserve_range
> 
> that's some VRAM thing so I'm guessing special memory on the GPU.
> 
> If so, what happens with all those "retired" pages when you reboot?
> They're getting used again and potentially trigger the same UEs and the same
> retiring happens?
> 
> > We are providing information to the user by storing all the
> > information about the retired pages in EEPROM. This can be accessed
> > through sysfs.
> 
> Ok, I'm a user and I can access that information through sysfs. What can I do
> with it?
> 
> > Hope it clears what "bad page retirement" is achieving.
> 
> It is getting there.
> 
> Thx.
> 
> --
> Regards/Gruss,
>     Boris.
> 
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpeople.
> kernel.org%2Ftglx%2Fnotes-about-
> netiquette&amp;data=04%7C01%7CMukul.Joshi%40amd.com%7Cd8c660fce3a2
> 4ce3c6d408d915f4efa6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%
> 7C637564964013263414%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwM
> DAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=
> %2BnJ%2B99N%2FRljoHGALimZHZG%2Bmf9jL5zP2eA44I6pbzFY%3D&amp;reser
> ved=0

  parent reply	other threads:[~2021-05-13 23:10 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20210512013058.6827-1-mukul.joshi@amd.com>
2021-05-12  9:36 ` [PATCH] drm/amdgpu: Register bad page handler for Aldebaran Borislav Petkov
2021-05-12 19:00   ` Joshi, Mukul
2021-05-12 21:05     ` Borislav Petkov
2021-05-13  3:20       ` Joshi, Mukul
2021-05-13  9:53         ` Borislav Petkov
2021-05-13 14:17           ` Alex Deucher
2021-05-13 14:30             ` Borislav Petkov
2021-05-13 14:32               ` Alex Deucher
2021-05-13 14:57                 ` Borislav Petkov
2021-05-13 15:02                   ` Alex Deucher
2021-05-13 23:14                   ` Joshi, Mukul
2021-05-14  7:03                     ` Borislav Petkov
2021-05-27 19:54                       ` Joshi, Mukul
2021-06-03 21:13                         ` Yazen Ghannam
2021-07-29 23:59                           ` Joshi, Mukul
2021-09-13  1:31                             ` Joshi, Mukul
2021-05-13 23:10           ` Joshi, Mukul [this message]
2021-05-14  7:05             ` Borislav Petkov
2021-05-14 13:06               ` Joshi, Mukul
2021-05-14 14:38                 ` Borislav Petkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DM4PR12MB5263A719B11C6DF8EF9F3A4BEE519@DM4PR12MB5263.namprd12.prod.outlook.com \
    --to=mukul.joshi@amd.com \
    --cc=Harish.Kasiviswanathan@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=bp@alien8.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).