> To elaborate on the PTRACE test, we PEEK 2 DWORDs inside thunk > allocated mapped memory and 2 DWORDS outside that boundary (it’s only > about 4MB to the boundary).  Then we POKE to swap the DWORD positions > across the boundary.  The RAS event on the single failing machine > happens on the out of boundary PEEK. > Well when you access outside of an allocated buffer I would expect that we never get as far as even touching the hardware because the kernel should block the access with an -EPERM or -EFAULT. So sounds like I'm not understanding something correctly here. Apart from that I completely agree that we need to sort out any other RAS event first to make sure that the system is simply not failing randomly. Regards, Christian. Am 15.04.20 um 11:49 schrieb Kim, Jonathan: > > [AMD Public Use] > > Hi Christian, > > That could potentially be it.  With additional testing, 2 of 3 Vega20 > machines never hit error over BAR access with the PTRACE test.  3 of 3 > machines (from the same pool) always hit error with CWSR. > > To elaborate on the PTRACE test, we PEEK 2 DWORDs inside thunk > allocated mapped memory and 2 DWORDS outside that boundary (it’s only > about 4MB to the boundary).  Then we POKE to swap the DWORD positions > across the boundary.  The RAS event on the single failing machine > happens on the out of boundary PEEK. > > Felix mentioned we don’t hit errors over general HDP access but that > may not true.  An Arcturus failure sys logs posted (which wasn’t > tested by me) shows someone launched rocm bandwidth test, hit a VM > fault and a RAS event ensued during evictions (I can point the > internal ticket or log snippet offline if interested).  Whether the > RAS event is BAR access triggered or the result of HW instability is > beyond me since I don’t have access to the machine. > > Thanks, > > Jon > > *From:*Koenig, Christian > *Sent:* Wednesday, April 15, 2020 4:11 AM > *To:* Kim, Jonathan ; Kuehling, Felix > ; Deucher, Alexander > *Cc:* Russell, Kent ; amd-gfx@lists.freedesktop.org > *Subject:* Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in > amdgpu_device_vram_access v2" > > Hi Jon, > > Also cwsr tests fail on Vega20 with or without the revert with the > same RAS error. > > > That sounds like the system/setup has a more general problem. > > Could it be that we are seeing RAS errors because there really is some > hardware failure, but with the MM path we don't trigger a RAS interrupt? > > Thanks, > Christian. > > Am 14.04.20 um 22:30 schrieb Kim, Jonathan: > > [AMD Official Use Only - Internal Distribution Only] > > If we’re passing the test on the revert, then the only thing > that’s different is we’re not invalidating HDP and doing a copy to > host anymore in amdgpu_device_vram_access since the function is > still called in ttm access_memory with BAR. > > Also cwsr tests fail on Vega20 with or without the revert with the > same RAS error. > > Thanks, > > Jon > > *From:* Kuehling, Felix > > *Sent:* Tuesday, April 14, 2020 2:32 PM > *To:* Kim, Jonathan > ; Koenig, Christian > ; > Deucher, Alexander > > *Cc:* Russell, Kent > ; amd-gfx@lists.freedesktop.org > > *Subject:* Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible > in amdgpu_device_vram_access v2" > > I wouldn't call it premature. Revert is a usual practice when > there is a serious regression that isn't fully understood or > root-caused. As far as I can tell, the problem has been reproduced > on multiple systems, different GPUs, and clearly regressed to > Christian's commit. I think that justifies reverting it for now. > > I agree with Christian that a general HDP memory access problem > causing RAS errors would potentially cause problems in other tests > as well. For example common operations like GART table updates, > and GPUVM page table updates and PCIe peer2peer accesses in ROCm > applications use HDP. But we're not seeing obvious problems from > those. So we need to understand what's special about this test. I > asked questions to that effect on our other email thread. > > Regards, >   Felix > > Am 2020-04-14 um 10:51 a.m. schrieb Kim, Jonathan: > > [AMD Official Use Only - Internal Distribution Only] > > I think it’s premature to push this revert. > > With more testing, I’m getting failures from different tests > or sometimes none at all on my machine. > > Kent, let’s continue the discussion on the original thread. > > Thanks, > > Jon > > *From:* Koenig, Christian > > *Sent:* Tuesday, April 14, 2020 10:47 AM > *To:* Deucher, Alexander > > *Cc:* Russell, Kent > ; amd-gfx@lists.freedesktop.org > ; Kuehling, Felix > ; Kim, > Jonathan > *Subject:* Re: [PATCH] Revert "drm/amdgpu: use the BAR if > possible in amdgpu_device_vram_access v2" > > That's exactly my concern as well. > > This looks a bit like the test creates erroneous data somehow, > but there doesn't seems to be a RAS check in the MM data path. > > And now that we use the BAR path it goes up in flames. > > I just don't see how we can create erroneous data in a test case? > > Christian. > > Am 14.04.2020 16:35 schrieb "Deucher, Alexander" > >: > > [AMD Public Use] > > If this causes an issue, any access to vram via the BAR > could cause an issue. > > Alex > > ------------------------------------------------------------------------ > > *From:* amd-gfx > on behalf > of Russell, Kent > > *Sent:* Tuesday, April 14, 2020 10:19 AM > *To:* Koenig, Christian >; > amd-gfx@lists.freedesktop.org > > > > *Cc:* Kuehling, Felix >; Kim, Jonathan > > > *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if > possible in amdgpu_device_vram_access v2" > > [AMD Official Use Only - Internal Distribution Only] > > On VG20 or MI100, as soon as we run the subtest, we get > the dmesg output below, and then the kernel ends up > hanging. I don't know enough about the test itself to know > why this is occurring, but Jon Kim and Felix were > discussing it on a separate thread when the issue was > first reported, so they can hopefully provide some > additional information. > >  Kent > > > -----Original Message----- > > From: Christian König > > > Sent: Tuesday, April 14, 2020 9:52 AM > > To: Russell, Kent >; > amd-gfx@lists.freedesktop.org > > > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if > possible in > > amdgpu_device_vram_access v2" > > > > Am 13.04.20 um 20:20 schrieb Kent Russell: > > > This reverts commit > c12b84d6e0d70f1185e6daddfd12afb671791b6e. > > > The original patch causes a RAS event and subsequent > kernel hard-hang > > > when running the > KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and > > > Arcturus > > > > > > dmesg output at hang time: > > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected! > > > amdgpu 0000:67:00.0: GPU reset begin! > > > Evicting PASID 0x8000 queues > > > Started evicting pasid 0x8000 > > > qcm fence wait loop timeout expired > > > The cp might be in an unrecoverable state due to an > unsuccessful > > > queues preemption Failed to evict process queues > Failed to suspend > > > process 0x8000 Finished evicting pasid 0x8000 Started > restoring pasid > > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU > state may lost > > > due to RAS ERREVENT_ATHUB_INTERRUPT > > > amdgpu: [powerplay] Failed to send message 0x26, > response 0x0 > > > amdgpu: [powerplay] Failed to set soft min gfxclk ! > > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels! > > > amdgpu: [powerplay] Failed to send message 0x7, > response 0x0 > > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to > disable all smu > > features! > > > amdgpu: [powerplay] [DisableDpmTasks] Failed to > disable all smu features! > > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM! > > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* > suspend of IP > > > block failed -5 > > > > Do you have more information on what's going wrong here > since this is a really > > important patch for KFD debugging. > > > > > > > > Signed-off-by: Kent Russell > > > > > Reviewed-by: Christian König > > > > > > --- > > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 > ---------------------- > > >   1 file changed, 26 deletions(-) > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > index cf5d6e585634..a3f997f84020 100644 > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct > > amdgpu_device *adev, loff_t pos, > > >      uint32_t hi = ~0; > > >      uint64_t last; > > > > > > - > > > -#ifdef CONFIG_64BIT > > > -   last = min(pos + size, adev->gmc.visible_vram_size); > > > -   if (last > pos) { > > > -           void __iomem *addr = > adev->mman.aper_base_kaddr + pos; > > > -           size_t count = last - pos; > > > - > > > -           if (write) { > > > - memcpy_toio(addr, buf, count); > > > -                   mb(); > > > - amdgpu_asic_flush_hdp(adev, NULL); > > > -           } else { > > > - amdgpu_asic_invalidate_hdp(adev, NULL); > > > -                   mb(); > > > - memcpy_fromio(buf, addr, count); > > > -           } > > > - > > > -           if (count == size) > > > - return; > > > - > > > -           pos += count; > > > -           buf += count / 4; > > > -           size -= count; > > > -   } > > > -#endif > > > - > > > spin_lock_irqsave(&adev->mmio_idx_lock, flags); > > >      for (last = pos + size; pos < last; pos += 4) { > > >              uint32_t tmp = pos >> 31; > _______________________________________________ > amd-gfx mailing list > amd-gfx@lists.freedesktop.org > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0 > > Am 14.04.2020 16:35 schrieb "Deucher, Alexander" > >: > > [AMD Public Use] > > If this causes an issue, any access to vram via the BAR > could cause an issue. > > Alex > > ------------------------------------------------------------------------ > > *From:* amd-gfx > on behalf > of Russell, Kent > > *Sent:* Tuesday, April 14, 2020 10:19 AM > *To:* Koenig, Christian >; > amd-gfx@lists.freedesktop.org > > > > *Cc:* Kuehling, Felix >; Kim, Jonathan > > > *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if > possible in amdgpu_device_vram_access v2" > > [AMD Official Use Only - Internal Distribution Only] > > On VG20 or MI100, as soon as we run the subtest, we get > the dmesg output below, and then the kernel ends up > hanging. I don't know enough about the test itself to know > why this is occurring, but Jon Kim and Felix were > discussing it on a separate thread when the issue was > first reported, so they can hopefully provide some > additional information. > >  Kent > > > -----Original Message----- > > From: Christian König > > > Sent: Tuesday, April 14, 2020 9:52 AM > > To: Russell, Kent >; > amd-gfx@lists.freedesktop.org > > > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if > possible in > > amdgpu_device_vram_access v2" > > > > Am 13.04.20 um 20:20 schrieb Kent Russell: > > > This reverts commit > c12b84d6e0d70f1185e6daddfd12afb671791b6e. > > > The original patch causes a RAS event and subsequent > kernel hard-hang > > > when running the > KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and > > > Arcturus > > > > > > dmesg output at hang time: > > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected! > > > amdgpu 0000:67:00.0: GPU reset begin! > > > Evicting PASID 0x8000 queues > > > Started evicting pasid 0x8000 > > > qcm fence wait loop timeout expired > > > The cp might be in an unrecoverable state due to an > unsuccessful > > > queues preemption Failed to evict process queues > Failed to suspend > > > process 0x8000 Finished evicting pasid 0x8000 Started > restoring pasid > > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU > state may lost > > > due to RAS ERREVENT_ATHUB_INTERRUPT > > > amdgpu: [powerplay] Failed to send message 0x26, > response 0x0 > > > amdgpu: [powerplay] Failed to set soft min gfxclk ! > > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels! > > > amdgpu: [powerplay] Failed to send message 0x7, > response 0x0 > > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to > disable all smu > > features! > > > amdgpu: [powerplay] [DisableDpmTasks] Failed to > disable all smu features! > > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM! > > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* > suspend of IP > > > block failed -5 > > > > Do you have more information on what's going wrong here > since this is a really > > important patch for KFD debugging. > > > > > > > > Signed-off-by: Kent Russell > > > > > Reviewed-by: Christian König > > > > > > --- > > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 > ---------------------- > > >   1 file changed, 26 deletions(-) > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > index cf5d6e585634..a3f997f84020 100644 > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct > > amdgpu_device *adev, loff_t pos, > > >      uint32_t hi = ~0; > > >      uint64_t last; > > > > > > - > > > -#ifdef CONFIG_64BIT > > > -   last = min(pos + size, adev->gmc.visible_vram_size); > > > -   if (last > pos) { > > > -           void __iomem *addr = > adev->mman.aper_base_kaddr + pos; > > > -           size_t count = last - pos; > > > - > > > -           if (write) { > > > - memcpy_toio(addr, buf, count); > > > -                   mb(); > > > - amdgpu_asic_flush_hdp(adev, NULL); > > > -           } else { > > > - amdgpu_asic_invalidate_hdp(adev, NULL); > > > -                   mb(); > > > - memcpy_fromio(buf, addr, count); > > > -           } > > > - > > > -           if (count == size) > > > -                   return; > > > - > > > -           pos += count; > > > -           buf += count / 4; > > > -           size -= count; > > > -   } > > > -#endif > > > - > > > spin_lock_irqsave(&adev->mmio_idx_lock, flags); > > >      for (last = pos + size; pos < last; pos += 4) { > > >              uint32_t tmp = pos >> 31; > _______________________________________________ > amd-gfx mailing list > amd-gfx@lists.freedesktop.org > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0 > > Am 14.04.2020 16:35 schrieb "Deucher, Alexander" > >: > > [AMD Public Use] > > If this causes an issue, any access to vram via the BAR > could cause an issue. > > Alex > > ------------------------------------------------------------------------ > > *From:* amd-gfx > on behalf > of Russell, Kent > > *Sent:* Tuesday, April 14, 2020 10:19 AM > *To:* Koenig, Christian >; > amd-gfx@lists.freedesktop.org > > > > *Cc:* Kuehling, Felix >; Kim, Jonathan > > > *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if > possible in amdgpu_device_vram_access v2" > > [AMD Official Use Only - Internal Distribution Only] > > On VG20 or MI100, as soon as we run the subtest, we get > the dmesg output below, and then the kernel ends up > hanging. I don't know enough about the test itself to know > why this is occurring, but Jon Kim and Felix were > discussing it on a separate thread when the issue was > first reported, so they can hopefully provide some > additional information. > >  Kent > > > -----Original Message----- > > From: Christian König > > > Sent: Tuesday, April 14, 2020 9:52 AM > > To: Russell, Kent >; > amd-gfx@lists.freedesktop.org > > > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if > possible in > > amdgpu_device_vram_access v2" > > > > Am 13.04.20 um 20:20 schrieb Kent Russell: > > > This reverts commit > c12b84d6e0d70f1185e6daddfd12afb671791b6e. > > > The original patch causes a RAS event and subsequent > kernel hard-hang > > > when running the > KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and > > > Arcturus > > > > > > dmesg output at hang time: > > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected! > > > amdgpu 0000:67:00.0: GPU reset begin! > > > Evicting PASID 0x8000 queues > > > Started evicting pasid 0x8000 > > > qcm fence wait loop timeout expired > > > The cp might be in an unrecoverable state due to an > unsuccessful > > > queues preemption Failed to evict process queues > Failed to suspend > > > process 0x8000 Finished evicting pasid 0x8000 Started > restoring pasid > > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU > state may lost > > > due to RAS ERREVENT_ATHUB_INTERRUPT > > > amdgpu: [powerplay] Failed to send message 0x26, > response 0x0 > > > amdgpu: [powerplay] Failed to set soft min gfxclk ! > > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels! > > > amdgpu: [powerplay] Failed to send message 0x7, > response 0x0 > > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to > disable all smu > > features! > > > amdgpu: [powerplay] [DisableDpmTasks] Failed to > disable all smu features! > > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM! > > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* > suspend of IP > > > block failed -5 > > > > Do you have more information on what's going wrong here > since this is a really > > important patch for KFD debugging. > > > > > > > > Signed-off-by: Kent Russell > > > > > Reviewed-by: Christian König > > > > > > --- > > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 > ---------------------- > > >   1 file changed, 26 deletions(-) > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > index cf5d6e585634..a3f997f84020 100644 > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct > > amdgpu_device *adev, loff_t pos, > > >      uint32_t hi = ~0; > > >      uint64_t last; > > > > > > - > > > -#ifdef CONFIG_64BIT > > > -   last = min(pos + size, adev->gmc.visible_vram_size); > > > -   if (last > pos) { > > > -           void __iomem *addr = > adev->mman.aper_base_kaddr + pos; > > > -           size_t count = last - pos; > > > - > > > -           if (write) { > > > - memcpy_toio(addr, buf, count); > > > -                   mb(); > > > - amdgpu_asic_flush_hdp(adev, NULL); > > > -           } else { > > > - amdgpu_asic_invalidate_hdp(adev, NULL); > > > -                   mb(); > > > - memcpy_fromio(buf, addr, count); > > > -           } > > > - > > > -           if (count == size) > > > -                   return; > > > - > > > -           pos += count; > > > -           buf += count / 4; > > > -           size -= count; > > > -   } > > > -#endif > > > - > > > spin_lock_irqsave(&adev->mmio_idx_lock, flags); > > >      for (last = pos + size; pos < last; pos += 4) { > > >              uint32_t tmp = pos >> 31; > _______________________________________________ > amd-gfx mailing list > amd-gfx@lists.freedesktop.org > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0 > > Am 14.04.2020 16:35 schrieb "Deucher, Alexander" > >: > > [AMD Public Use] > > If this causes an issue, any access to vram via the BAR > could cause an issue. > > Alex > > ------------------------------------------------------------------------ > > *From:* amd-gfx > on behalf > of Russell, Kent > > *Sent:* Tuesday, April 14, 2020 10:19 AM > *To:* Koenig, Christian >; > amd-gfx@lists.freedesktop.org > > > > *Cc:* Kuehling, Felix >; Kim, Jonathan > > > *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if > possible in amdgpu_device_vram_access v2" > > [AMD Official Use Only - Internal Distribution Only] > > On VG20 or MI100, as soon as we run the subtest, we get > the dmesg output below, and then the kernel ends up > hanging. I don't know enough about the test itself to know > why this is occurring, but Jon Kim and Felix were > discussing it on a separate thread when the issue was > first reported, so they can hopefully provide some > additional information. > >  Kent > > > -----Original Message----- > > From: Christian König > > > Sent: Tuesday, April 14, 2020 9:52 AM > > To: Russell, Kent >; > amd-gfx@lists.freedesktop.org > > > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if > possible in > > amdgpu_device_vram_access v2" > > > > Am 13.04.20 um 20:20 schrieb Kent Russell: > > > This reverts commit > c12b84d6e0d70f1185e6daddfd12afb671791b6e. > > > The original patch causes a RAS event and subsequent > kernel hard-hang > > > when running the > KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and > > > Arcturus > > > > > > dmesg output at hang time: > > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected! > > > amdgpu 0000:67:00.0: GPU reset begin! > > > Evicting PASID 0x8000 queues > > > Started evicting pasid 0x8000 > > > qcm fence wait loop timeout expired > > > The cp might be in an unrecoverable state due to an > unsuccessful > > > queues preemption Failed to evict process queues > Failed to suspend > > > process 0x8000 Finished evicting pasid 0x8000 Started > restoring pasid > > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU > state may lost > > > due to RAS ERREVENT_ATHUB_INTERRUPT > > > amdgpu: [powerplay] Failed to send message 0x26, > response 0x0 > > > amdgpu: [powerplay] Failed to set soft min gfxclk ! > > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels! > > > amdgpu: [powerplay] Failed to send message 0x7, > response 0x0 > > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to > disable all smu > > features! > > > amdgpu: [powerplay] [DisableDpmTasks] Failed to > disable all smu features! > > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM! > > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* > suspend of IP > > > block failed -5 > > > > Do you have more information on what's going wrong here > since this is a really > > important patch for KFD debugging. > > > > > > > > Signed-off-by: Kent Russell > > > > > Reviewed-by: Christian König > > > > > > --- > > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 > ---------------------- > > >   1 file changed, 26 deletions(-) > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > index cf5d6e585634..a3f997f84020 100644 > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct > > amdgpu_device *adev, loff_t pos, > > >      uint32_t hi = ~0; > > >      uint64_t last; > > > > > > - > > > -#ifdef CONFIG_64BIT > > > -   last = min(pos + size, adev->gmc.visible_vram_size); > > > -   if (last > pos) { > > > -           void __iomem *addr = > adev->mman.aper_base_kaddr + pos; > > > -           size_t count = last - pos; > > > - > > > -           if (write) { > > > - memcpy_toio(addr, buf, count); > > > -                   mb(); > > > - amdgpu_asic_flush_hdp(adev, NULL); > > > -           } else { > > > - amdgpu_asic_invalidate_hdp(adev, NULL); > > > -                   mb(); > > > - memcpy_fromio(buf, addr, count); > > > -           } > > > - > > > -           if (count == size) > > > -                   return; > > > - > > > -           pos += count; > > > -           buf += count / 4; > > > -           size -= count; > > > -   } > > > -#endif > > > - > > > spin_lock_irqsave(&adev->mmio_idx_lock, flags); > > >      for (last = pos + size; pos < last; pos += 4) { > > >              uint32_t tmp = pos >> 31; > _______________________________________________ > amd-gfx mailing list > amd-gfx@lists.freedesktop.org > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0 > > Am 14.04.2020 16:35 schrieb "Deucher, Alexander" > >: > > [AMD Public Use] > > If this causes an issue, any access to vram via the BAR could > cause an issue. > > Alex > > ------------------------------------------------------------------------ > > *From:* amd-gfx > on behalf of > Russell, Kent > > *Sent:* Tuesday, April 14, 2020 10:19 AM > *To:* Koenig, Christian >; > amd-gfx@lists.freedesktop.org > > > > *Cc:* Kuehling, Felix >; Kim, Jonathan > > > *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if > possible in amdgpu_device_vram_access v2" > > [AMD Official Use Only - Internal Distribution Only] > > On VG20 or MI100, as soon as we run the subtest, we get the > dmesg output below, and then the kernel ends up hanging. I > don't know enough about the test itself to know why this is > occurring, but Jon Kim and Felix were discussing it on a > separate thread when the issue was first reported, so they can > hopefully provide some additional information. > >  Kent > > > -----Original Message----- > > From: Christian König > > > Sent: Tuesday, April 14, 2020 9:52 AM > > To: Russell, Kent >; amd-gfx@lists.freedesktop.org > > > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if > possible in > > amdgpu_device_vram_access v2" > > > > Am 13.04.20 um 20:20 schrieb Kent Russell: > > > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e. > > > The original patch causes a RAS event and subsequent > kernel hard-hang > > > when running the KFDMemoryTest.PtraceAccessInvisibleVram > on VG20 and > > > Arcturus > > > > > > dmesg output at hang time: > > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected! > > > amdgpu 0000:67:00.0: GPU reset begin! > > > Evicting PASID 0x8000 queues > > > Started evicting pasid 0x8000 > > > qcm fence wait loop timeout expired > > > The cp might be in an unrecoverable state due to an > unsuccessful > > > queues preemption Failed to evict process queues Failed to > suspend > > > process 0x8000 Finished evicting pasid 0x8000 Started > restoring pasid > > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU > state may lost > > > due to RAS ERREVENT_ATHUB_INTERRUPT > > > amdgpu: [powerplay] Failed to send message 0x26, response 0x0 > > > amdgpu: [powerplay] Failed to set soft min gfxclk ! > > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels! > > > amdgpu: [powerplay] Failed to send message 0x7, response 0x0 > > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to > disable all smu > > features! > > > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable > all smu features! > > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM! > > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* > suspend of IP > > > block failed -5 > > > > Do you have more information on what's going wrong here > since this is a really > > important patch for KFD debugging. > > > > > > > > Signed-off-by: Kent Russell > > > > > Reviewed-by: Christian König > > > > > > --- > > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 > ---------------------- > > >   1 file changed, 26 deletions(-) > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > index cf5d6e585634..a3f997f84020 100644 > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct > > amdgpu_device *adev, loff_t pos, > > >      uint32_t hi = ~0; > > >      uint64_t last; > > > > > > - > > > -#ifdef CONFIG_64BIT > > > -   last = min(pos + size, adev->gmc.visible_vram_size); > > > -   if (last > pos) { > > > -           void __iomem *addr = > adev->mman.aper_base_kaddr + pos; > > > -           size_t count = last - pos; > > > - > > > -           if (write) { > > > -                   memcpy_toio(addr, buf, count); > > > -                   mb(); > > > - amdgpu_asic_flush_hdp(adev, NULL); > > > -           } else { > > > - amdgpu_asic_invalidate_hdp(adev, NULL); > > > -                   mb(); > > > -                   memcpy_fromio(buf, addr, count); > > > -           } > > > - > > > -           if (count == size) > > > -                   return; > > > - > > > -           pos += count; > > > -           buf += count / 4; > > > -           size -= count; > > > -   } > > > -#endif > > > - > > > spin_lock_irqsave(&adev->mmio_idx_lock, flags); > > >      for (last = pos + size; pos < last; pos += 4) { > > >              uint32_t tmp = pos >> 31; > _______________________________________________ > amd-gfx mailing list > amd-gfx@lists.freedesktop.org > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0 >