linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Bjorn Helgaas <helgaas@kernel.org>
To: "Christian König" <christian.koenig@amd.com>
Cc: "Lazar, Lijo" <lijo.lazar@amd.com>, Stefan Roese <sr@denx.de>,
	Tom Seewald <tseewald@gmail.com>,
	regressions@lists.linux.dev, David Airlie <airlied@linux.ie>,
	linux-pci@vger.kernel.org, Xinhui Pan <Xinhui.Pan@amd.com>,
	amd-gfx@lists.freedesktop.org,
	Kai-Heng Feng <kai.heng.feng@canonical.com>,
	Daniel Vetter <daniel@ffwll.ch>,
	Alex Deucher <alexander.deucher@amd.com>
Subject: Re: [Bug 216373] New: Uncorrected errors reported for AMD GPU
Date: Thu, 25 Aug 2022 12:48:45 -0500	[thread overview]
Message-ID: <20220825174845.GA2857385@bhelgaas> (raw)
In-Reply-To: <0444020d-e7e6-2fe9-e94e-413c8d3bab38@amd.com>

On Thu, Aug 25, 2022 at 10:18:28AM +0200, Christian König wrote:
> Am 25.08.22 um 09:54 schrieb Lazar, Lijo:
> > On 8/25/2022 1:04 PM, Christian König wrote:
> > > Am 25.08.22 um 08:40 schrieb Stefan Roese:
> > > > On 24.08.22 16:45, Tom Seewald wrote:
> > > > > On Wed, Aug 24, 2022 at 12:11 AM Lazar, Lijo
> > > > > <lijo.lazar@amd.com> wrote:
> > > > > > Unfortunately, I don't have any NV platforms to test. Attached is an
> > > > > > 'untested-patch' based on your trace logs.
> > > > > ...
> > > > 
> > > > I did not follow this thread in depth, but FWICT the bug is solved now
> > > > with this patch. So is it correct, that the now fully enabled AER
> > > > support in the PCI subsystem in v6.0 helped detecting a bug in the AMD
> > > > GPU driver?
> > > 
> > > It looks like it, but I'm not 100% sure about the rational behind it.
> > > 
> > > Lijo can you explain more on this?
> > 
> > From the trace, during gmc hw_init it takes this route -
> > 
> > gart_enable -> amdgpu_gtt_mgr_recover -> amdgpu_gart_invalidate_tlb ->
> > amdgpu_device_flush_hdp -> amdgpu_asic_flush_hdp (non-ring based HDP
> > flush)
> > 
> > HDP flush is done using remapped offset which is MMIO_REG_HOLE_OFFSET
> > (0x80000 - PAGE_SIZE)
> > 
> > WREG32_NO_KIQ((adev->rmmio_remap.reg_offset +
> > KFD_MMIO_REMAP_HDP_MEM_FLUSH_CNTL) >> 2, 0);
> > 
> > However, the remapping is not yet done at this point. It's done at a
> > later point during common block initialization. Access to the unmapped
> > offset '(0x80000 - PAGE_SIZE)' seems to come back as unsupported request
> > and reported through AER.
> 
> That's interesting behavior. So far AER always indicated some kind of
> transmission error.
> 
> When that happens as well on unmapped areas of the MMIO BAR then we need to
> keep that in mind.

AER can log many different kinds of errors, some related to hardware
issues and some related to software.

PCI writes are normally posted and get no response, so AER is the main
way to find out about writes to unimplemented addresses.

Reads do get a response, of course, and reads to unimplemented
addresses cause errors that most hardware turns into a ~0 data return
(in addition to reporting via AER if enabled).

Bjorn

  reply	other threads:[~2022-08-25 17:48 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <bug-216373-41252@https.bugzilla.kernel.org/>
2022-08-18 20:38 ` [Bug 216373] New: Uncorrected errors reported for AMD GPU Bjorn Helgaas
2022-08-19  7:05   ` Christian König
2022-08-19  8:33     ` Lazar, Lijo
2022-08-19 11:04       ` Bjorn Helgaas
2022-08-19 17:13   ` Bjorn Helgaas
2022-08-19 19:07     ` Bjorn Helgaas
2022-08-20  7:52       ` Lazar, Lijo
2022-08-23 17:04         ` Tom Seewald
2022-08-24  5:10           ` Lazar, Lijo
2022-08-24 14:45             ` Tom Seewald
2022-08-25  6:40               ` Stefan Roese
2022-08-25  7:34                 ` Christian König
2022-08-25  7:54                   ` Lazar, Lijo
2022-08-25  8:18                     ` Christian König
2022-08-25 17:48                       ` Bjorn Helgaas [this message]
2022-08-26  7:10                         ` Christian König
2022-08-25 15:05             ` Felix Kuehling
2022-08-23 16:01   ` [Bug 216373] New: Uncorrected errors reported for AMD GPU #forregzbot Thorsten Leemhuis

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220825174845.GA2857385@bhelgaas \
    --to=helgaas@kernel.org \
    --cc=Xinhui.Pan@amd.com \
    --cc=airlied@linux.ie \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=daniel@ffwll.ch \
    --cc=kai.heng.feng@canonical.com \
    --cc=lijo.lazar@amd.com \
    --cc=linux-pci@vger.kernel.org \
    --cc=regressions@lists.linux.dev \
    --cc=sr@denx.de \
    --cc=tseewald@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).