[PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"
@ 2020-04-13 18:20 Kent Russell
  2020-04-14 13:51 ` Christian König
  0 siblings, 1 reply; 14+ messages in thread
From: Kent Russell @ 2020-04-13 18:20 UTC (permalink / raw)
  To: amd-gfx; +Cc: Kent Russell

This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
The original patch causes a RAS event and subsequent kernel hard-hang
when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
Arcturus

dmesg output at hang time:
[drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
amdgpu 0000:67:00.0: GPU reset begin!
Evicting PASID 0x8000 queues
Started evicting pasid 0x8000
qcm fence wait loop timeout expired
The cp might be in an unrecoverable state due to an unsuccessful queues preemption
Failed to evict process queues
Failed to suspend process 0x8000
Finished evicting pasid 0x8000
Started restoring pasid 0x8000
Finished restoring pasid 0x8000
[drm] UVD VCPU state may lost due to RAS ERREVENT_ATHUB_INTERRUPT
amdgpu: [powerplay] Failed to send message 0x26, response 0x0
amdgpu: [powerplay] Failed to set soft min gfxclk !
amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
amdgpu: [powerplay] Failed to send message 0x7, response 0x0
amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu features!
amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!
amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
[drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <powerplay> failed -5

Signed-off-by: Kent Russell <kent.russell@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ----------------------
 1 file changed, 26 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index cf5d6e585634..a3f997f84020 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct amdgpu_device *adev, loff_t pos,
 	uint32_t hi = ~0;
 	uint64_t last;

-
-#ifdef CONFIG_64BIT
-	last = min(pos + size, adev->gmc.visible_vram_size);
-	if (last > pos) {
-		void __iomem *addr = adev->mman.aper_base_kaddr + pos;
-		size_t count = last - pos;
-
-		if (write) {
-			memcpy_toio(addr, buf, count);
-			mb();
-			amdgpu_asic_flush_hdp(adev, NULL);
-		} else {
-			amdgpu_asic_invalidate_hdp(adev, NULL);
-			mb();
-			memcpy_fromio(buf, addr, count);
-		}
-
-		if (count == size)
-			return;
-
-		pos += count;
-		buf += count / 4;
-		size -= count;
-	}
-#endif
-
 	spin_lock_irqsave(&adev->mmio_idx_lock, flags);
 	for (last = pos + size; pos < last; pos += 4) {
 		uint32_t tmp = pos >> 31;
-- 
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"
  2020-04-13 18:20 [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2" Kent Russell
@ 2020-04-14 13:51 ` Christian König
  2020-04-14 14:19   ` Russell, Kent
  0 siblings, 1 reply; 14+ messages in thread
From: Christian König @ 2020-04-14 13:51 UTC (permalink / raw)
  To: Kent Russell, amd-gfx

Am 13.04.20 um 20:20 schrieb Kent Russell:
> This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> The original patch causes a RAS event and subsequent kernel hard-hang
> when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> Arcturus
>
> dmesg output at hang time:
> [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> amdgpu 0000:67:00.0: GPU reset begin!
> Evicting PASID 0x8000 queues
> Started evicting pasid 0x8000
> qcm fence wait loop timeout expired
> The cp might be in an unrecoverable state due to an unsuccessful queues preemption
> Failed to evict process queues
> Failed to suspend process 0x8000
> Finished evicting pasid 0x8000
> Started restoring pasid 0x8000
> Finished restoring pasid 0x8000
> [drm] UVD VCPU state may lost due to RAS ERREVENT_ATHUB_INTERRUPT
> amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> amdgpu: [powerplay] Failed to set soft min gfxclk !
> amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu features!
> amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!
> amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <powerplay> failed -5

Do you have more information on what's going wrong here since this is a 
really important patch for KFD debugging.

>
> Signed-off-by: Kent Russell <kent.russell@amd.com>

Reviewed-by: Christian König <christian.koenig@amd.com>

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ----------------------
>   1 file changed, 26 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index cf5d6e585634..a3f997f84020 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct amdgpu_device *adev, loff_t pos,
>   	uint32_t hi = ~0;
>   	uint64_t last;
>   
> -
> -#ifdef CONFIG_64BIT
> -	last = min(pos + size, adev->gmc.visible_vram_size);
> -	if (last > pos) {
> -		void __iomem *addr = adev->mman.aper_base_kaddr + pos;
> -		size_t count = last - pos;
> -
> -		if (write) {
> -			memcpy_toio(addr, buf, count);
> -			mb();
> -			amdgpu_asic_flush_hdp(adev, NULL);
> -		} else {
> -			amdgpu_asic_invalidate_hdp(adev, NULL);
> -			mb();
> -			memcpy_fromio(buf, addr, count);
> -		}
> -
> -		if (count == size)
> -			return;
> -
> -		pos += count;
> -		buf += count / 4;
> -		size -= count;
> -	}
> -#endif
> -
>   	spin_lock_irqsave(&adev->mmio_idx_lock, flags);
>   	for (last = pos + size; pos < last; pos += 4) {
>   		uint32_t tmp = pos >> 31;

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"
  2020-04-14 13:51 ` Christian König
@ 2020-04-14 14:19   ` Russell, Kent
  2020-04-14 14:35     ` Deucher, Alexander
  0 siblings, 1 reply; 14+ messages in thread
From: Russell, Kent @ 2020-04-14 14:19 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx; +Cc: Kuehling, Felix, Kim, Jonathan

[AMD Official Use Only - Internal Distribution Only]

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output below, and then the kernel ends up hanging. I don't know enough about the test itself to know why this is occurring, but Jon Kim and Felix were discussing it on a separate thread when the issue was first reported, so they can hopefully provide some additional information.

 Kent

> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
> amdgpu_device_vram_access v2"
> 
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful
> > queues preemption Failed to evict process queues Failed to suspend
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP
> > block <powerplay> failed -5
> 
> Do you have more information on what's going wrong here since this is a really
> important patch for KFD debugging.
> 
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com>
> 
> Reviewed-by: Christian König <christian.koenig@amd.com>
> 
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ----------------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> amdgpu_device *adev, loff_t pos,
> >   	uint32_t hi = ~0;
> >   	uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -	last = min(pos + size, adev->gmc.visible_vram_size);
> > -	if (last > pos) {
> > -		void __iomem *addr = adev->mman.aper_base_kaddr + pos;
> > -		size_t count = last - pos;
> > -
> > -		if (write) {
> > -			memcpy_toio(addr, buf, count);
> > -			mb();
> > -			amdgpu_asic_flush_hdp(adev, NULL);
> > -		} else {
> > -			amdgpu_asic_invalidate_hdp(adev, NULL);
> > -			mb();
> > -			memcpy_fromio(buf, addr, count);
> > -		}
> > -
> > -		if (count == size)
> > -			return;
> > -
> > -		pos += count;
> > -		buf += count / 4;
> > -		size -= count;
> > -	}
> > -#endif
> > -
> >   	spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> >   	for (last = pos + size; pos < last; pos += 4) {
> >   		uint32_t tmp = pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"
  2020-04-14 14:19   ` Russell, Kent
@ 2020-04-14 14:35     ` Deucher, Alexander
  2020-04-14 14:46       ` Koenig, Christian
  0 siblings, 1 reply; 14+ messages in thread
From: Deucher, Alexander @ 2020-04-14 14:35 UTC (permalink / raw)
  To: Russell, Kent, Koenig, Christian, amd-gfx; +Cc: Kuehling, Felix, Kim, Jonathan


[-- Attachment #1.1: Type: text/plain, Size: 4999 bytes --]

[AMD Public Use]

If this causes an issue, any access to vram via the BAR could cause an issue.

Alex
________________________________
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of Russell, Kent <Kent.Russell@amd.com>
Sent: Tuesday, April 14, 2020 10:19 AM
To: Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
Cc: Kuehling, Felix <Felix.Kuehling@amd.com>; Kim, Jonathan <Jonathan.Kim@amd.com>
Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"

[AMD Official Use Only - Internal Distribution Only]

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output below, and then the kernel ends up hanging. I don't know enough about the test itself to know why this is occurring, but Jon Kim and Felix were discussing it on a separate thread when the issue was first reported, so they can hopefully provide some additional information.

 Kent

> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
> amdgpu_device_vram_access v2"
>
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful
> > queues preemption Failed to evict process queues Failed to suspend
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP
> > block <powerplay> failed -5
>
> Do you have more information on what's going wrong here since this is a really
> important patch for KFD debugging.
>
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com>
>
> Reviewed-by: Christian König <christian.koenig@amd.com>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ----------------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> amdgpu_device *adev, loff_t pos,
> >      uint32_t hi = ~0;
> >      uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -   last = min(pos + size, adev->gmc.visible_vram_size);
> > -   if (last > pos) {
> > -           void __iomem *addr = adev->mman.aper_base_kaddr + pos;
> > -           size_t count = last - pos;
> > -
> > -           if (write) {
> > -                   memcpy_toio(addr, buf, count);
> > -                   mb();
> > -                   amdgpu_asic_flush_hdp(adev, NULL);
> > -           } else {
> > -                   amdgpu_asic_invalidate_hdp(adev, NULL);
> > -                   mb();
> > -                   memcpy_fromio(buf, addr, count);
> > -           }
> > -
> > -           if (count == size)
> > -                   return;
> > -
> > -           pos += count;
> > -           buf += count / 4;
> > -           size -= count;
> > -   }
> > -#endif
> > -
> >      spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> >      for (last = pos + size; pos < last; pos += 4) {
> >              uint32_t tmp = pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0

[-- Attachment #1.2: Type: text/html, Size: 8737 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"
  2020-04-14 14:35     ` Deucher, Alexander
@ 2020-04-14 14:46       ` Koenig, Christian
  2020-04-14 14:51         ` Kim, Jonathan
  0 siblings, 1 reply; 14+ messages in thread
From: Koenig, Christian @ 2020-04-14 14:46 UTC (permalink / raw)
  To: Deucher, Alexander; +Cc: Kuehling, Felix, Kim, Jonathan, Russell, Kent, amd-gfx


[-- Attachment #1.1: Type: text/plain, Size: 25724 bytes --]

That's exactly my concern as well.

This looks a bit like the test creates erroneous data somehow, but there doesn't seems to be a RAS check in the MM data path.

And now that we use the BAR path it goes up in flames.

I just don't see how we can create erroneous data in a test case?

Christian.

Am 14.04.2020 16:35 schrieb "Deucher, Alexander" <Alexander.Deucher@amd.com>:

[AMD Public Use]

If this causes an issue, any access to vram via the BAR could cause an issue.

Alex
________________________________
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of Russell, Kent <Kent.Russell@amd.com>
Sent: Tuesday, April 14, 2020 10:19 AM
To: Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
Cc: Kuehling, Felix <Felix.Kuehling@amd.com>; Kim, Jonathan <Jonathan.Kim@amd.com>
Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"

[AMD Official Use Only - Internal Distribution Only]

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output below, and then the kernel ends up hanging. I don't know enough about the test itself to know why this is occurring, but Jon Kim and Felix were discussing it on a separate thread when the issue was first reported, so they can hopefully provide some additional information.

 Kent

> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
> amdgpu_device_vram_access v2"
>
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful
> > queues preemption Failed to evict process queues Failed to suspend
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP
> > block <powerplay> failed -5
>
> Do you have more information on what's going wrong here since this is a really
> important patch for KFD debugging.
>
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com>
>
> Reviewed-by: Christian König <christian.koenig@amd.com>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ----------------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> amdgpu_device *adev, loff_t pos,
> >      uint32_t hi = ~0;
> >      uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -   last = min(pos + size, adev->gmc.visible_vram_size);
> > -   if (last > pos) {
> > -           void __iomem *addr = adev->mman.aper_base_kaddr + pos;
> > -           size_t count = last - pos;
> > -
> > -           if (write) {
> > -                   memcpy_toio(addr, buf, count);
> > -                   mb();
> > -                   amdgpu_asic_flush_hdp(adev, NULL);
> > -           } else {
> > -                   amdgpu_asic_invalidate_hdp(adev, NULL);
> > -                   mb();
> > -                   memcpy_fromio(buf, addr, count);
> > -           }
> > -
> > -           if (count == size)
> > -                   return;
> > -
> > -           pos += count;
> > -           buf += count / 4;
> > -           size -= count;
> > -   }
> > -#endif
> > -
> >      spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> >      for (last = pos + size; pos < last; pos += 4) {
> >              uint32_t tmp = pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0


Am 14.04.2020 16:35 schrieb "Deucher, Alexander" <Alexander.Deucher@amd.com>:

[AMD Public Use]

If this causes an issue, any access to vram via the BAR could cause an issue.

Alex
________________________________
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of Russell, Kent <Kent.Russell@amd.com>
Sent: Tuesday, April 14, 2020 10:19 AM
To: Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
Cc: Kuehling, Felix <Felix.Kuehling@amd.com>; Kim, Jonathan <Jonathan.Kim@amd.com>
Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"

[AMD Official Use Only - Internal Distribution Only]

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output below, and then the kernel ends up hanging. I don't know enough about the test itself to know why this is occurring, but Jon Kim and Felix were discussing it on a separate thread when the issue was first reported, so they can hopefully provide some additional information.

 Kent

> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
> amdgpu_device_vram_access v2"
>
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful
> > queues preemption Failed to evict process queues Failed to suspend
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP
> > block <powerplay> failed -5
>
> Do you have more information on what's going wrong here since this is a really
> important patch for KFD debugging.
>
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com>
>
> Reviewed-by: Christian König <christian.koenig@amd.com>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ----------------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> amdgpu_device *adev, loff_t pos,
> >      uint32_t hi = ~0;
> >      uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -   last = min(pos + size, adev->gmc.visible_vram_size);
> > -   if (last > pos) {
> > -           void __iomem *addr = adev->mman.aper_base_kaddr + pos;
> > -           size_t count = last - pos;
> > -
> > -           if (write) {
> > -                   memcpy_toio(addr, buf, count);
> > -                   mb();
> > -                   amdgpu_asic_flush_hdp(adev, NULL);
> > -           } else {
> > -                   amdgpu_asic_invalidate_hdp(adev, NULL);
> > -                   mb();
> > -                   memcpy_fromio(buf, addr, count);
> > -           }
> > -
> > -           if (count == size)
> > -                   return;
> > -
> > -           pos += count;
> > -           buf += count / 4;
> > -           size -= count;
> > -   }
> > -#endif
> > -
> >      spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> >      for (last = pos + size; pos < last; pos += 4) {
> >              uint32_t tmp = pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0


Am 14.04.2020 16:35 schrieb "Deucher, Alexander" <Alexander.Deucher@amd.com>:

[AMD Public Use]

If this causes an issue, any access to vram via the BAR could cause an issue.

Alex
________________________________
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of Russell, Kent <Kent.Russell@amd.com>
Sent: Tuesday, April 14, 2020 10:19 AM
To: Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
Cc: Kuehling, Felix <Felix.Kuehling@amd.com>; Kim, Jonathan <Jonathan.Kim@amd.com>
Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"

[AMD Official Use Only - Internal Distribution Only]

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output below, and then the kernel ends up hanging. I don't know enough about the test itself to know why this is occurring, but Jon Kim and Felix were discussing it on a separate thread when the issue was first reported, so they can hopefully provide some additional information.

 Kent

> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
> amdgpu_device_vram_access v2"
>
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful
> > queues preemption Failed to evict process queues Failed to suspend
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP
> > block <powerplay> failed -5
>
> Do you have more information on what's going wrong here since this is a really
> important patch for KFD debugging.
>
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com>
>
> Reviewed-by: Christian König <christian.koenig@amd.com>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ----------------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> amdgpu_device *adev, loff_t pos,
> >      uint32_t hi = ~0;
> >      uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -   last = min(pos + size, adev->gmc.visible_vram_size);
> > -   if (last > pos) {
> > -           void __iomem *addr = adev->mman.aper_base_kaddr + pos;
> > -           size_t count = last - pos;
> > -
> > -           if (write) {
> > -                   memcpy_toio(addr, buf, count);
> > -                   mb();
> > -                   amdgpu_asic_flush_hdp(adev, NULL);
> > -           } else {
> > -                   amdgpu_asic_invalidate_hdp(adev, NULL);
> > -                   mb();
> > -                   memcpy_fromio(buf, addr, count);
> > -           }
> > -
> > -           if (count == size)
> > -                   return;
> > -
> > -           pos += count;
> > -           buf += count / 4;
> > -           size -= count;
> > -   }
> > -#endif
> > -
> >      spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> >      for (last = pos + size; pos < last; pos += 4) {
> >              uint32_t tmp = pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0


Am 14.04.2020 16:35 schrieb "Deucher, Alexander" <Alexander.Deucher@amd.com>:

[AMD Public Use]

If this causes an issue, any access to vram via the BAR could cause an issue.

Alex
________________________________
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of Russell, Kent <Kent.Russell@amd.com>
Sent: Tuesday, April 14, 2020 10:19 AM
To: Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
Cc: Kuehling, Felix <Felix.Kuehling@amd.com>; Kim, Jonathan <Jonathan.Kim@amd.com>
Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"

[AMD Official Use Only - Internal Distribution Only]

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output below, and then the kernel ends up hanging. I don't know enough about the test itself to know why this is occurring, but Jon Kim and Felix were discussing it on a separate thread when the issue was first reported, so they can hopefully provide some additional information.

 Kent

> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
> amdgpu_device_vram_access v2"
>
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful
> > queues preemption Failed to evict process queues Failed to suspend
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP
> > block <powerplay> failed -5
>
> Do you have more information on what's going wrong here since this is a really
> important patch for KFD debugging.
>
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com>
>
> Reviewed-by: Christian König <christian.koenig@amd.com>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ----------------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> amdgpu_device *adev, loff_t pos,
> >      uint32_t hi = ~0;
> >      uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -   last = min(pos + size, adev->gmc.visible_vram_size);
> > -   if (last > pos) {
> > -           void __iomem *addr = adev->mman.aper_base_kaddr + pos;
> > -           size_t count = last - pos;
> > -
> > -           if (write) {
> > -                   memcpy_toio(addr, buf, count);
> > -                   mb();
> > -                   amdgpu_asic_flush_hdp(adev, NULL);
> > -           } else {
> > -                   amdgpu_asic_invalidate_hdp(adev, NULL);
> > -                   mb();
> > -                   memcpy_fromio(buf, addr, count);
> > -           }
> > -
> > -           if (count == size)
> > -                   return;
> > -
> > -           pos += count;
> > -           buf += count / 4;
> > -           size -= count;
> > -   }
> > -#endif
> > -
> >      spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> >      for (last = pos + size; pos < last; pos += 4) {
> >              uint32_t tmp = pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0


Am 14.04.2020 16:35 schrieb "Deucher, Alexander" <Alexander.Deucher@amd.com>:

[AMD Public Use]

If this causes an issue, any access to vram via the BAR could cause an issue.

Alex
________________________________
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of Russell, Kent <Kent.Russell@amd.com>
Sent: Tuesday, April 14, 2020 10:19 AM
To: Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
Cc: Kuehling, Felix <Felix.Kuehling@amd.com>; Kim, Jonathan <Jonathan.Kim@amd.com>
Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"

[AMD Official Use Only - Internal Distribution Only]

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output below, and then the kernel ends up hanging. I don't know enough about the test itself to know why this is occurring, but Jon Kim and Felix were discussing it on a separate thread when the issue was first reported, so they can hopefully provide some additional information.

 Kent

> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
> amdgpu_device_vram_access v2"
>
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful
> > queues preemption Failed to evict process queues Failed to suspend
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP
> > block <powerplay> failed -5
>
> Do you have more information on what's going wrong here since this is a really
> important patch for KFD debugging.
>
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com>
>
> Reviewed-by: Christian König <christian.koenig@amd.com>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ----------------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> amdgpu_device *adev, loff_t pos,
> >      uint32_t hi = ~0;
> >      uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -   last = min(pos + size, adev->gmc.visible_vram_size);
> > -   if (last > pos) {
> > -           void __iomem *addr = adev->mman.aper_base_kaddr + pos;
> > -           size_t count = last - pos;
> > -
> > -           if (write) {
> > -                   memcpy_toio(addr, buf, count);
> > -                   mb();
> > -                   amdgpu_asic_flush_hdp(adev, NULL);
> > -           } else {
> > -                   amdgpu_asic_invalidate_hdp(adev, NULL);
> > -                   mb();
> > -                   memcpy_fromio(buf, addr, count);
> > -           }
> > -
> > -           if (count == size)
> > -                   return;
> > -
> > -           pos += count;
> > -           buf += count / 4;
> > -           size -= count;
> > -   }
> > -#endif
> > -
> >      spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> >      for (last = pos + size; pos < last; pos += 4) {
> >              uint32_t tmp = pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0

[-- Attachment #1.2: Type: text/html, Size: 44561 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"
  2020-04-14 14:46       ` Koenig, Christian
@ 2020-04-14 14:51         ` Kim, Jonathan
  2020-04-14 18:31           ` Felix Kuehling
  0 siblings, 1 reply; 14+ messages in thread
From: Kim, Jonathan @ 2020-04-14 14:51 UTC (permalink / raw)
  To: Koenig, Christian, Deucher, Alexander
  Cc: Kuehling, Felix, Russell, Kent, amd-gfx


[-- Attachment #1.1: Type: text/plain, Size: 28840 bytes --]

[AMD Official Use Only - Internal Distribution Only]

I think it's premature to push this revert.

With more testing, I'm getting failures from different tests or sometimes none at all on my machine.

Kent, let's continue the discussion on the original thread.

Thanks,

Jon

From: Koenig, Christian <Christian.Koenig@amd.com>
Sent: Tuesday, April 14, 2020 10:47 AM
To: Deucher, Alexander <Alexander.Deucher@amd.com>
Cc: Russell, Kent <Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org; Kuehling, Felix <Felix.Kuehling@amd.com>; Kim, Jonathan <Jonathan.Kim@amd.com>
Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"

That's exactly my concern as well.

This looks a bit like the test creates erroneous data somehow, but there doesn't seems to be a RAS check in the MM data path.

And now that we use the BAR path it goes up in flames.

I just don't see how we can create erroneous data in a test case?

Christian.

Am 14.04.2020 16:35 schrieb "Deucher, Alexander" <Alexander.Deucher@amd.com<mailto:Alexander.Deucher@amd.com>>:

[AMD Public Use]

If this causes an issue, any access to vram via the BAR could cause an issue.

Alex
________________________________
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org<mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf of Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>
Sent: Tuesday, April 14, 2020 10:19 AM
To: Koenig, Christian <Christian.Koenig@amd.com<mailto:Christian.Koenig@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Cc: Kuehling, Felix <Felix.Kuehling@amd.com<mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan <Jonathan.Kim@amd.com<mailto:Jonathan.Kim@amd.com>>
Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"

[AMD Official Use Only - Internal Distribution Only]

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output below, and then the kernel ends up hanging. I don't know enough about the test itself to know why this is occurring, but Jon Kim and Felix were discussing it on a separate thread when the issue was first reported, so they can hopefully provide some additional information.

 Kent

> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com<mailto:ckoenig.leichtzumerken@gmail.com>>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
> amdgpu_device_vram_access v2"
>
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful
> > queues preemption Failed to evict process queues Failed to suspend
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP
> > block <powerplay> failed -5
>
> Do you have more information on what's going wrong here since this is a really
> important patch for KFD debugging.
>
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com<mailto:kent.russell@amd.com>>
>
> Reviewed-by: Christian König <christian.koenig@amd.com<mailto:christian.koenig@amd.com>>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ----------------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> amdgpu_device *adev, loff_t pos,
> >      uint32_t hi = ~0;
> >      uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -   last = min(pos + size, adev->gmc.visible_vram_size);
> > -   if (last > pos) {
> > -           void __iomem *addr = adev->mman.aper_base_kaddr + pos;
> > -           size_t count = last - pos;
> > -
> > -           if (write) {
> > -                   memcpy_toio(addr, buf, count);
> > -                   mb();
> > -                   amdgpu_asic_flush_hdp(adev, NULL);
> > -           } else {
> > -                   amdgpu_asic_invalidate_hdp(adev, NULL);
> > -                   mb();
> > -                   memcpy_fromio(buf, addr, count);
> > -           }
> > -
> > -           if (count == size)
> > -                   return;
> > -
> > -           pos += count;
> > -           buf += count / 4;
> > -           size -= count;
> > -   }
> > -#endif
> > -
> >      spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> >      for (last = pos + size; pos < last; pos += 4) {
> >              uint32_t tmp = pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0


Am 14.04.2020 16:35 schrieb "Deucher, Alexander" <Alexander.Deucher@amd.com<mailto:Alexander.Deucher@amd.com>>:

[AMD Public Use]

If this causes an issue, any access to vram via the BAR could cause an issue.

Alex
________________________________
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org<mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf of Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>
Sent: Tuesday, April 14, 2020 10:19 AM
To: Koenig, Christian <Christian.Koenig@amd.com<mailto:Christian.Koenig@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Cc: Kuehling, Felix <Felix.Kuehling@amd.com<mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan <Jonathan.Kim@amd.com<mailto:Jonathan.Kim@amd.com>>
Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"

[AMD Official Use Only - Internal Distribution Only]

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output below, and then the kernel ends up hanging. I don't know enough about the test itself to know why this is occurring, but Jon Kim and Felix were discussing it on a separate thread when the issue was first reported, so they can hopefully provide some additional information.

 Kent

> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com<mailto:ckoenig.leichtzumerken@gmail.com>>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
> amdgpu_device_vram_access v2"
>
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful
> > queues preemption Failed to evict process queues Failed to suspend
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP
> > block <powerplay> failed -5
>
> Do you have more information on what's going wrong here since this is a really
> important patch for KFD debugging.
>
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com<mailto:kent.russell@amd.com>>
>
> Reviewed-by: Christian König <christian.koenig@amd.com<mailto:christian.koenig@amd.com>>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ----------------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> amdgpu_device *adev, loff_t pos,
> >      uint32_t hi = ~0;
> >      uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -   last = min(pos + size, adev->gmc.visible_vram_size);
> > -   if (last > pos) {
> > -           void __iomem *addr = adev->mman.aper_base_kaddr + pos;
> > -           size_t count = last - pos;
> > -
> > -           if (write) {
> > -                   memcpy_toio(addr, buf, count);
> > -                   mb();
> > -                   amdgpu_asic_flush_hdp(adev, NULL);
> > -           } else {
> > -                   amdgpu_asic_invalidate_hdp(adev, NULL);
> > -                   mb();
> > -                   memcpy_fromio(buf, addr, count);
> > -           }
> > -
> > -           if (count == size)
> > -                   return;
> > -
> > -           pos += count;
> > -           buf += count / 4;
> > -           size -= count;
> > -   }
> > -#endif
> > -
> >      spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> >      for (last = pos + size; pos < last; pos += 4) {
> >              uint32_t tmp = pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0


Am 14.04.2020 16:35 schrieb "Deucher, Alexander" <Alexander.Deucher@amd.com<mailto:Alexander.Deucher@amd.com>>:

[AMD Public Use]

If this causes an issue, any access to vram via the BAR could cause an issue.

Alex
________________________________
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org<mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf of Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>
Sent: Tuesday, April 14, 2020 10:19 AM
To: Koenig, Christian <Christian.Koenig@amd.com<mailto:Christian.Koenig@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Cc: Kuehling, Felix <Felix.Kuehling@amd.com<mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan <Jonathan.Kim@amd.com<mailto:Jonathan.Kim@amd.com>>
Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"

[AMD Official Use Only - Internal Distribution Only]

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output below, and then the kernel ends up hanging. I don't know enough about the test itself to know why this is occurring, but Jon Kim and Felix were discussing it on a separate thread when the issue was first reported, so they can hopefully provide some additional information.

 Kent

> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com<mailto:ckoenig.leichtzumerken@gmail.com>>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
> amdgpu_device_vram_access v2"
>
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful
> > queues preemption Failed to evict process queues Failed to suspend
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP
> > block <powerplay> failed -5
>
> Do you have more information on what's going wrong here since this is a really
> important patch for KFD debugging.
>
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com<mailto:kent.russell@amd.com>>
>
> Reviewed-by: Christian König <christian.koenig@amd.com<mailto:christian.koenig@amd.com>>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ----------------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> amdgpu_device *adev, loff_t pos,
> >      uint32_t hi = ~0;
> >      uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -   last = min(pos + size, adev->gmc.visible_vram_size);
> > -   if (last > pos) {
> > -           void __iomem *addr = adev->mman.aper_base_kaddr + pos;
> > -           size_t count = last - pos;
> > -
> > -           if (write) {
> > -                   memcpy_toio(addr, buf, count);
> > -                   mb();
> > -                   amdgpu_asic_flush_hdp(adev, NULL);
> > -           } else {
> > -                   amdgpu_asic_invalidate_hdp(adev, NULL);
> > -                   mb();
> > -                   memcpy_fromio(buf, addr, count);
> > -           }
> > -
> > -           if (count == size)
> > -                   return;
> > -
> > -           pos += count;
> > -           buf += count / 4;
> > -           size -= count;
> > -   }
> > -#endif
> > -
> >      spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> >      for (last = pos + size; pos < last; pos += 4) {
> >              uint32_t tmp = pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0


Am 14.04.2020 16:35 schrieb "Deucher, Alexander" <Alexander.Deucher@amd.com<mailto:Alexander.Deucher@amd.com>>:

[AMD Public Use]

If this causes an issue, any access to vram via the BAR could cause an issue.

Alex
________________________________
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org<mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf of Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>
Sent: Tuesday, April 14, 2020 10:19 AM
To: Koenig, Christian <Christian.Koenig@amd.com<mailto:Christian.Koenig@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Cc: Kuehling, Felix <Felix.Kuehling@amd.com<mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan <Jonathan.Kim@amd.com<mailto:Jonathan.Kim@amd.com>>
Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"

[AMD Official Use Only - Internal Distribution Only]

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output below, and then the kernel ends up hanging. I don't know enough about the test itself to know why this is occurring, but Jon Kim and Felix were discussing it on a separate thread when the issue was first reported, so they can hopefully provide some additional information.

 Kent

> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com<mailto:ckoenig.leichtzumerken@gmail.com>>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
> amdgpu_device_vram_access v2"
>
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful
> > queues preemption Failed to evict process queues Failed to suspend
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP
> > block <powerplay> failed -5
>
> Do you have more information on what's going wrong here since this is a really
> important patch for KFD debugging.
>
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com<mailto:kent.russell@amd.com>>
>
> Reviewed-by: Christian König <christian.koenig@amd.com<mailto:christian.koenig@amd.com>>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ----------------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> amdgpu_device *adev, loff_t pos,
> >      uint32_t hi = ~0;
> >      uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -   last = min(pos + size, adev->gmc.visible_vram_size);
> > -   if (last > pos) {
> > -           void __iomem *addr = adev->mman.aper_base_kaddr + pos;
> > -           size_t count = last - pos;
> > -
> > -           if (write) {
> > -                   memcpy_toio(addr, buf, count);
> > -                   mb();
> > -                   amdgpu_asic_flush_hdp(adev, NULL);
> > -           } else {
> > -                   amdgpu_asic_invalidate_hdp(adev, NULL);
> > -                   mb();
> > -                   memcpy_fromio(buf, addr, count);
> > -           }
> > -
> > -           if (count == size)
> > -                   return;
> > -
> > -           pos += count;
> > -           buf += count / 4;
> > -           size -= count;
> > -   }
> > -#endif
> > -
> >      spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> >      for (last = pos + size; pos < last; pos += 4) {
> >              uint32_t tmp = pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0


Am 14.04.2020 16:35 schrieb "Deucher, Alexander" <Alexander.Deucher@amd.com<mailto:Alexander.Deucher@amd.com>>:

[AMD Public Use]

If this causes an issue, any access to vram via the BAR could cause an issue.

Alex
________________________________
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org<mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf of Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>
Sent: Tuesday, April 14, 2020 10:19 AM
To: Koenig, Christian <Christian.Koenig@amd.com<mailto:Christian.Koenig@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Cc: Kuehling, Felix <Felix.Kuehling@amd.com<mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan <Jonathan.Kim@amd.com<mailto:Jonathan.Kim@amd.com>>
Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"

[AMD Official Use Only - Internal Distribution Only]

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output below, and then the kernel ends up hanging. I don't know enough about the test itself to know why this is occurring, but Jon Kim and Felix were discussing it on a separate thread when the issue was first reported, so they can hopefully provide some additional information.

 Kent

> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com<mailto:ckoenig.leichtzumerken@gmail.com>>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
> amdgpu_device_vram_access v2"
>
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful
> > queues preemption Failed to evict process queues Failed to suspend
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP
> > block <powerplay> failed -5
>
> Do you have more information on what's going wrong here since this is a really
> important patch for KFD debugging.
>
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com<mailto:kent.russell@amd.com>>
>
> Reviewed-by: Christian König <christian.koenig@amd.com<mailto:christian.koenig@amd.com>>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ----------------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> amdgpu_device *adev, loff_t pos,
> >      uint32_t hi = ~0;
> >      uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -   last = min(pos + size, adev->gmc.visible_vram_size);
> > -   if (last > pos) {
> > -           void __iomem *addr = adev->mman.aper_base_kaddr + pos;
> > -           size_t count = last - pos;
> > -
> > -           if (write) {
> > -                   memcpy_toio(addr, buf, count);
> > -                   mb();
> > -                   amdgpu_asic_flush_hdp(adev, NULL);
> > -           } else {
> > -                   amdgpu_asic_invalidate_hdp(adev, NULL);
> > -                   mb();
> > -                   memcpy_fromio(buf, addr, count);
> > -           }
> > -
> > -           if (count == size)
> > -                   return;
> > -
> > -           pos += count;
> > -           buf += count / 4;
> > -           size -= count;
> > -   }
> > -#endif
> > -
> >      spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> >      for (last = pos + size; pos < last; pos += 4) {
> >              uint32_t tmp = pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0

[-- Attachment #1.2: Type: text/html, Size: 52482 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"
  2020-04-14 14:51         ` Kim, Jonathan
@ 2020-04-14 18:31           ` Felix Kuehling
  2020-04-14 20:30             ` Kim, Jonathan
  0 siblings, 1 reply; 14+ messages in thread
From: Felix Kuehling @ 2020-04-14 18:31 UTC (permalink / raw)
  To: Kim, Jonathan, Koenig, Christian, Deucher, Alexander
  Cc: Russell, Kent, amd-gfx


[-- Attachment #1.1: Type: text/plain, Size: 34525 bytes --]

I wouldn't call it premature. Revert is a usual practice when there is a
serious regression that isn't fully understood or root-caused. As far as
I can tell, the problem has been reproduced on multiple systems,
different GPUs, and clearly regressed to Christian's commit. I think
that justifies reverting it for now.

I agree with Christian that a general HDP memory access problem causing
RAS errors would potentially cause problems in other tests as well. For
example common operations like GART table updates, and GPUVM page table
updates and PCIe peer2peer accesses in ROCm applications use HDP. But
we're not seeing obvious problems from those. So we need to understand
what's special about this test. I asked questions to that effect on our
other email thread.

Regards,
  Felix

Am 2020-04-14 um 10:51 a.m. schrieb Kim, Jonathan:
>
> [AMD Official Use Only - Internal Distribution Only]
>
>  
>
> I think it’s premature to push this revert.
>
>  
>
> With more testing, I’m getting failures from different tests or
> sometimes none at all on my machine.
>
>  
>
> Kent, let’s continue the discussion on the original thread.
>
>  
>
> Thanks,
>
>  
>
> Jon
>
>  
>
> *From:* Koenig, Christian <Christian.Koenig@amd.com>
> *Sent:* Tuesday, April 14, 2020 10:47 AM
> *To:* Deucher, Alexander <Alexander.Deucher@amd.com>
> *Cc:* Russell, Kent <Kent.Russell@amd.com>;
> amd-gfx@lists.freedesktop.org; Kuehling, Felix
> <Felix.Kuehling@amd.com>; Kim, Jonathan <Jonathan.Kim@amd.com>
> *Subject:* Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
> amdgpu_device_vram_access v2"
>
>  
>
> That's exactly my concern as well.
>
>  
>
> This looks a bit like the test creates erroneous data somehow, but
> there doesn't seems to be a RAS check in the MM data path.
>
>  
>
> And now that we use the BAR path it goes up in flames.
>
>  
>
> I just don't see how we can create erroneous data in a test case?
>
>  
>
> Christian.
>
>  
>
> Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
> <Alexander.Deucher@amd.com <mailto:Alexander.Deucher@amd.com>>:
>
>     [AMD Public Use]
>
>      
>
>     If this causes an issue, any access to vram via the BAR could
>     cause an issue.
>
>      
>
>     Alex
>
>     ------------------------------------------------------------------------
>
>     *From:*amd-gfx <amd-gfx-bounces@lists.freedesktop.org
>     <mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf of
>     Russell, Kent <Kent.Russell@amd.com <mailto:Kent.Russell@amd.com>>
>     *Sent:* Tuesday, April 14, 2020 10:19 AM
>     *To:* Koenig, Christian <Christian.Koenig@amd.com
>     <mailto:Christian.Koenig@amd.com>>; amd-gfx@lists.freedesktop.org
>     <mailto:amd-gfx@lists.freedesktop.org>
>     <amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>>
>     *Cc:* Kuehling, Felix <Felix.Kuehling@amd.com
>     <mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan
>     <Jonathan.Kim@amd.com <mailto:Jonathan.Kim@amd.com>>
>     *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible
>     in amdgpu_device_vram_access v2"
>
>      
>
>     [AMD Official Use Only - Internal Distribution Only]
>
>     On VG20 or MI100, as soon as we run the subtest, we get the dmesg
>     output below, and then the kernel ends up hanging. I don't know
>     enough about the test itself to know why this is occurring, but
>     Jon Kim and Felix were discussing it on a separate thread when the
>     issue was first reported, so they can hopefully provide some
>     additional information.
>
>      Kent
>
>     > -----Original Message-----
>     > From: Christian König <ckoenig.leichtzumerken@gmail.com
>     <mailto:ckoenig.leichtzumerken@gmail.com>>
>     > Sent: Tuesday, April 14, 2020 9:52 AM
>     > To: Russell, Kent <Kent.Russell@amd.com
>     <mailto:Kent.Russell@amd.com>>; amd-gfx@lists.freedesktop.org
>     <mailto:amd-gfx@lists.freedesktop.org>
>     > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
>     > amdgpu_device_vram_access v2"
>     >
>     > Am 13.04.20 um 20:20 schrieb Kent Russell:
>     > > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
>     > > The original patch causes a RAS event and subsequent kernel
>     hard-hang
>     > > when running the KFDMemoryTest.PtraceAccessInvisibleVram on
>     VG20 and
>     > > Arcturus
>     > >
>     > > dmesg output at hang time:
>     > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
>     > > amdgpu 0000:67:00.0: GPU reset begin!
>     > > Evicting PASID 0x8000 queues
>     > > Started evicting pasid 0x8000
>     > > qcm fence wait loop timeout expired
>     > > The cp might be in an unrecoverable state due to an unsuccessful
>     > > queues preemption Failed to evict process queues Failed to suspend
>     > > process 0x8000 Finished evicting pasid 0x8000 Started
>     restoring pasid
>     > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state
>     may lost
>     > > due to RAS ERREVENT_ATHUB_INTERRUPT
>     > > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
>     > > amdgpu: [powerplay] Failed to set soft min gfxclk !
>     > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
>     > > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
>     > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable
>     all smu
>     > features!
>     > > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all
>     smu features!
>     > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
>     > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend
>     of IP
>     > > block <powerplay> failed -5
>     >
>     > Do you have more information on what's going wrong here since
>     this is a really
>     > important patch for KFD debugging.
>     >
>     > >
>     > > Signed-off-by: Kent Russell <kent.russell@amd.com
>     <mailto:kent.russell@amd.com>>
>     >
>     > Reviewed-by: Christian König <christian.koenig@amd.com
>     <mailto:christian.koenig@amd.com>>
>     >
>     > > ---
>     > >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
>     ----------------------
>     > >   1 file changed, 26 deletions(-)
>     > >
>     > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>     > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>     > > index cf5d6e585634..a3f997f84020 100644
>     > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>     > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>     > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
>     > amdgpu_device *adev, loff_t pos,
>     > >      uint32_t hi = ~0;
>     > >      uint64_t last;
>     > >
>     > > -
>     > > -#ifdef CONFIG_64BIT
>     > > -   last = min(pos + size, adev->gmc.visible_vram_size);
>     > > -   if (last > pos) {
>     > > -           void __iomem *addr = adev->mman.aper_base_kaddr + pos;
>     > > -           size_t count = last - pos;
>     > > -
>     > > -           if (write) {
>     > > -                   memcpy_toio(addr, buf, count);
>     > > -                   mb();
>     > > -                   amdgpu_asic_flush_hdp(adev, NULL);
>     > > -           } else {
>     > > -                   amdgpu_asic_invalidate_hdp(adev, NULL);
>     > > -                   mb();
>     > > -                   memcpy_fromio(buf, addr, count);
>     > > -           }
>     > > -
>     > > -           if (count == size)
>     > > -                   return;
>     > > -
>     > > -           pos += count;
>     > > -           buf += count / 4;
>     > > -           size -= count;
>     > > -   }
>     > > -#endif
>     > > -
>     > >      spin_lock_irqsave(&adev->mmio_idx_lock, flags);
>     > >      for (last = pos + size; pos < last; pos += 4) {
>     > >              uint32_t tmp = pos >> 31;
>     _______________________________________________
>     amd-gfx mailing list
>     amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>
>     https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0
>
>  
>
>  
>
> Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
> <Alexander.Deucher@amd.com <mailto:Alexander.Deucher@amd.com>>:
>
>     [AMD Public Use]
>
>      
>
>     If this causes an issue, any access to vram via the BAR could
>     cause an issue.
>
>      
>
>     Alex
>
>     ------------------------------------------------------------------------
>
>     *From:*amd-gfx <amd-gfx-bounces@lists.freedesktop.org
>     <mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf of
>     Russell, Kent <Kent.Russell@amd.com <mailto:Kent.Russell@amd.com>>
>     *Sent:* Tuesday, April 14, 2020 10:19 AM
>     *To:* Koenig, Christian <Christian.Koenig@amd.com
>     <mailto:Christian.Koenig@amd.com>>; amd-gfx@lists.freedesktop.org
>     <mailto:amd-gfx@lists.freedesktop.org>
>     <amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>>
>     *Cc:* Kuehling, Felix <Felix.Kuehling@amd.com
>     <mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan
>     <Jonathan.Kim@amd.com <mailto:Jonathan.Kim@amd.com>>
>     *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible
>     in amdgpu_device_vram_access v2"
>
>      
>
>     [AMD Official Use Only - Internal Distribution Only]
>
>     On VG20 or MI100, as soon as we run the subtest, we get the dmesg
>     output below, and then the kernel ends up hanging. I don't know
>     enough about the test itself to know why this is occurring, but
>     Jon Kim and Felix were discussing it on a separate thread when the
>     issue was first reported, so they can hopefully provide some
>     additional information.
>
>      Kent
>
>     > -----Original Message-----
>     > From: Christian König <ckoenig.leichtzumerken@gmail.com
>     <mailto:ckoenig.leichtzumerken@gmail.com>>
>     > Sent: Tuesday, April 14, 2020 9:52 AM
>     > To: Russell, Kent <Kent.Russell@amd.com
>     <mailto:Kent.Russell@amd.com>>; amd-gfx@lists.freedesktop.org
>     <mailto:amd-gfx@lists.freedesktop.org>
>     > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
>     > amdgpu_device_vram_access v2"
>     >
>     > Am 13.04.20 um 20:20 schrieb Kent Russell:
>     > > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
>     > > The original patch causes a RAS event and subsequent kernel
>     hard-hang
>     > > when running the KFDMemoryTest.PtraceAccessInvisibleVram on
>     VG20 and
>     > > Arcturus
>     > >
>     > > dmesg output at hang time:
>     > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
>     > > amdgpu 0000:67:00.0: GPU reset begin!
>     > > Evicting PASID 0x8000 queues
>     > > Started evicting pasid 0x8000
>     > > qcm fence wait loop timeout expired
>     > > The cp might be in an unrecoverable state due to an unsuccessful
>     > > queues preemption Failed to evict process queues Failed to suspend
>     > > process 0x8000 Finished evicting pasid 0x8000 Started
>     restoring pasid
>     > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state
>     may lost
>     > > due to RAS ERREVENT_ATHUB_INTERRUPT
>     > > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
>     > > amdgpu: [powerplay] Failed to set soft min gfxclk !
>     > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
>     > > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
>     > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable
>     all smu
>     > features!
>     > > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all
>     smu features!
>     > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
>     > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend
>     of IP
>     > > block <powerplay> failed -5
>     >
>     > Do you have more information on what's going wrong here since
>     this is a really
>     > important patch for KFD debugging.
>     >
>     > >
>     > > Signed-off-by: Kent Russell <kent.russell@amd.com
>     <mailto:kent.russell@amd.com>>
>     >
>     > Reviewed-by: Christian König <christian.koenig@amd.com
>     <mailto:christian.koenig@amd.com>>
>     >
>     > > ---
>     > >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
>     ----------------------
>     > >   1 file changed, 26 deletions(-)
>     > >
>     > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>     > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>     > > index cf5d6e585634..a3f997f84020 100644
>     > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>     > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>     > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
>     > amdgpu_device *adev, loff_t pos,
>     > >      uint32_t hi = ~0;
>     > >      uint64_t last;
>     > >
>     > > -
>     > > -#ifdef CONFIG_64BIT
>     > > -   last = min(pos + size, adev->gmc.visible_vram_size);
>     > > -   if (last > pos) {
>     > > -           void __iomem *addr = adev->mman.aper_base_kaddr + pos;
>     > > -           size_t count = last - pos;
>     > > -
>     > > -           if (write) {
>     > > -                   memcpy_toio(addr, buf, count);
>     > > -                   mb();
>     > > -                   amdgpu_asic_flush_hdp(adev, NULL);
>     > > -           } else {
>     > > -                   amdgpu_asic_invalidate_hdp(adev, NULL);
>     > > -                   mb();
>     > > -                   memcpy_fromio(buf, addr, count);
>     > > -           }
>     > > -
>     > > -           if (count == size)
>     > > -                   return;
>     > > -
>     > > -           pos += count;
>     > > -           buf += count / 4;
>     > > -           size -= count;
>     > > -   }
>     > > -#endif
>     > > -
>     > >      spin_lock_irqsave(&adev->mmio_idx_lock, flags);
>     > >      for (last = pos + size; pos < last; pos += 4) {
>     > >              uint32_t tmp = pos >> 31;
>     _______________________________________________
>     amd-gfx mailing list
>     amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>
>     https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0
>
>  
>
>  
>
> Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
> <Alexander.Deucher@amd.com <mailto:Alexander.Deucher@amd.com>>:
>
>     [AMD Public Use]
>
>      
>
>     If this causes an issue, any access to vram via the BAR could
>     cause an issue.
>
>      
>
>     Alex
>
>     ------------------------------------------------------------------------
>
>     *From:*amd-gfx <amd-gfx-bounces@lists.freedesktop.org
>     <mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf of
>     Russell, Kent <Kent.Russell@amd.com <mailto:Kent.Russell@amd.com>>
>     *Sent:* Tuesday, April 14, 2020 10:19 AM
>     *To:* Koenig, Christian <Christian.Koenig@amd.com
>     <mailto:Christian.Koenig@amd.com>>; amd-gfx@lists.freedesktop.org
>     <mailto:amd-gfx@lists.freedesktop.org>
>     <amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>>
>     *Cc:* Kuehling, Felix <Felix.Kuehling@amd.com
>     <mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan
>     <Jonathan.Kim@amd.com <mailto:Jonathan.Kim@amd.com>>
>     *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible
>     in amdgpu_device_vram_access v2"
>
>      
>
>     [AMD Official Use Only - Internal Distribution Only]
>
>     On VG20 or MI100, as soon as we run the subtest, we get the dmesg
>     output below, and then the kernel ends up hanging. I don't know
>     enough about the test itself to know why this is occurring, but
>     Jon Kim and Felix were discussing it on a separate thread when the
>     issue was first reported, so they can hopefully provide some
>     additional information.
>
>      Kent
>
>     > -----Original Message-----
>     > From: Christian König <ckoenig.leichtzumerken@gmail.com
>     <mailto:ckoenig.leichtzumerken@gmail.com>>
>     > Sent: Tuesday, April 14, 2020 9:52 AM
>     > To: Russell, Kent <Kent.Russell@amd.com
>     <mailto:Kent.Russell@amd.com>>; amd-gfx@lists.freedesktop.org
>     <mailto:amd-gfx@lists.freedesktop.org>
>     > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
>     > amdgpu_device_vram_access v2"
>     >
>     > Am 13.04.20 um 20:20 schrieb Kent Russell:
>     > > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
>     > > The original patch causes a RAS event and subsequent kernel
>     hard-hang
>     > > when running the KFDMemoryTest.PtraceAccessInvisibleVram on
>     VG20 and
>     > > Arcturus
>     > >
>     > > dmesg output at hang time:
>     > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
>     > > amdgpu 0000:67:00.0: GPU reset begin!
>     > > Evicting PASID 0x8000 queues
>     > > Started evicting pasid 0x8000
>     > > qcm fence wait loop timeout expired
>     > > The cp might be in an unrecoverable state due to an unsuccessful
>     > > queues preemption Failed to evict process queues Failed to suspend
>     > > process 0x8000 Finished evicting pasid 0x8000 Started
>     restoring pasid
>     > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state
>     may lost
>     > > due to RAS ERREVENT_ATHUB_INTERRUPT
>     > > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
>     > > amdgpu: [powerplay] Failed to set soft min gfxclk !
>     > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
>     > > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
>     > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable
>     all smu
>     > features!
>     > > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all
>     smu features!
>     > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
>     > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend
>     of IP
>     > > block <powerplay> failed -5
>     >
>     > Do you have more information on what's going wrong here since
>     this is a really
>     > important patch for KFD debugging.
>     >
>     > >
>     > > Signed-off-by: Kent Russell <kent.russell@amd.com
>     <mailto:kent.russell@amd.com>>
>     >
>     > Reviewed-by: Christian König <christian.koenig@amd.com
>     <mailto:christian.koenig@amd.com>>
>     >
>     > > ---
>     > >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
>     ----------------------
>     > >   1 file changed, 26 deletions(-)
>     > >
>     > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>     > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>     > > index cf5d6e585634..a3f997f84020 100644
>     > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>     > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>     > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
>     > amdgpu_device *adev, loff_t pos,
>     > >      uint32_t hi = ~0;
>     > >      uint64_t last;
>     > >
>     > > -
>     > > -#ifdef CONFIG_64BIT
>     > > -   last = min(pos + size, adev->gmc.visible_vram_size);
>     > > -   if (last > pos) {
>     > > -           void __iomem *addr = adev->mman.aper_base_kaddr + pos;
>     > > -           size_t count = last - pos;
>     > > -
>     > > -           if (write) {
>     > > -                   memcpy_toio(addr, buf, count);
>     > > -                   mb();
>     > > -                   amdgpu_asic_flush_hdp(adev, NULL);
>     > > -           } else {
>     > > -                   amdgpu_asic_invalidate_hdp(adev, NULL);
>     > > -                   mb();
>     > > -                   memcpy_fromio(buf, addr, count);
>     > > -           }
>     > > -
>     > > -           if (count == size)
>     > > -                   return;
>     > > -
>     > > -           pos += count;
>     > > -           buf += count / 4;
>     > > -           size -= count;
>     > > -   }
>     > > -#endif
>     > > -
>     > >      spin_lock_irqsave(&adev->mmio_idx_lock, flags);
>     > >      for (last = pos + size; pos < last; pos += 4) {
>     > >              uint32_t tmp = pos >> 31;
>     _______________________________________________
>     amd-gfx mailing list
>     amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>
>     https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0
>
>  
>
>  
>
> Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
> <Alexander.Deucher@amd.com <mailto:Alexander.Deucher@amd.com>>:
>
>     [AMD Public Use]
>
>      
>
>     If this causes an issue, any access to vram via the BAR could
>     cause an issue.
>
>      
>
>     Alex
>
>     ------------------------------------------------------------------------
>
>     *From:*amd-gfx <amd-gfx-bounces@lists.freedesktop.org
>     <mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf of
>     Russell, Kent <Kent.Russell@amd.com <mailto:Kent.Russell@amd.com>>
>     *Sent:* Tuesday, April 14, 2020 10:19 AM
>     *To:* Koenig, Christian <Christian.Koenig@amd.com
>     <mailto:Christian.Koenig@amd.com>>; amd-gfx@lists.freedesktop.org
>     <mailto:amd-gfx@lists.freedesktop.org>
>     <amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>>
>     *Cc:* Kuehling, Felix <Felix.Kuehling@amd.com
>     <mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan
>     <Jonathan.Kim@amd.com <mailto:Jonathan.Kim@amd.com>>
>     *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible
>     in amdgpu_device_vram_access v2"
>
>      
>
>     [AMD Official Use Only - Internal Distribution Only]
>
>     On VG20 or MI100, as soon as we run the subtest, we get the dmesg
>     output below, and then the kernel ends up hanging. I don't know
>     enough about the test itself to know why this is occurring, but
>     Jon Kim and Felix were discussing it on a separate thread when the
>     issue was first reported, so they can hopefully provide some
>     additional information.
>
>      Kent
>
>     > -----Original Message-----
>     > From: Christian König <ckoenig.leichtzumerken@gmail.com
>     <mailto:ckoenig.leichtzumerken@gmail.com>>
>     > Sent: Tuesday, April 14, 2020 9:52 AM
>     > To: Russell, Kent <Kent.Russell@amd.com
>     <mailto:Kent.Russell@amd.com>>; amd-gfx@lists.freedesktop.org
>     <mailto:amd-gfx@lists.freedesktop.org>
>     > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
>     > amdgpu_device_vram_access v2"
>     >
>     > Am 13.04.20 um 20:20 schrieb Kent Russell:
>     > > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
>     > > The original patch causes a RAS event and subsequent kernel
>     hard-hang
>     > > when running the KFDMemoryTest.PtraceAccessInvisibleVram on
>     VG20 and
>     > > Arcturus
>     > >
>     > > dmesg output at hang time:
>     > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
>     > > amdgpu 0000:67:00.0: GPU reset begin!
>     > > Evicting PASID 0x8000 queues
>     > > Started evicting pasid 0x8000
>     > > qcm fence wait loop timeout expired
>     > > The cp might be in an unrecoverable state due to an unsuccessful
>     > > queues preemption Failed to evict process queues Failed to suspend
>     > > process 0x8000 Finished evicting pasid 0x8000 Started
>     restoring pasid
>     > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state
>     may lost
>     > > due to RAS ERREVENT_ATHUB_INTERRUPT
>     > > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
>     > > amdgpu: [powerplay] Failed to set soft min gfxclk !
>     > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
>     > > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
>     > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable
>     all smu
>     > features!
>     > > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all
>     smu features!
>     > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
>     > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend
>     of IP
>     > > block <powerplay> failed -5
>     >
>     > Do you have more information on what's going wrong here since
>     this is a really
>     > important patch for KFD debugging.
>     >
>     > >
>     > > Signed-off-by: Kent Russell <kent.russell@amd.com
>     <mailto:kent.russell@amd.com>>
>     >
>     > Reviewed-by: Christian König <christian.koenig@amd.com
>     <mailto:christian.koenig@amd.com>>
>     >
>     > > ---
>     > >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
>     ----------------------
>     > >   1 file changed, 26 deletions(-)
>     > >
>     > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>     > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>     > > index cf5d6e585634..a3f997f84020 100644
>     > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>     > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>     > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
>     > amdgpu_device *adev, loff_t pos,
>     > >      uint32_t hi = ~0;
>     > >      uint64_t last;
>     > >
>     > > -
>     > > -#ifdef CONFIG_64BIT
>     > > -   last = min(pos + size, adev->gmc.visible_vram_size);
>     > > -   if (last > pos) {
>     > > -           void __iomem *addr = adev->mman.aper_base_kaddr + pos;
>     > > -           size_t count = last - pos;
>     > > -
>     > > -           if (write) {
>     > > -                   memcpy_toio(addr, buf, count);
>     > > -                   mb();
>     > > -                   amdgpu_asic_flush_hdp(adev, NULL);
>     > > -           } else {
>     > > -                   amdgpu_asic_invalidate_hdp(adev, NULL);
>     > > -                   mb();
>     > > -                   memcpy_fromio(buf, addr, count);
>     > > -           }
>     > > -
>     > > -           if (count == size)
>     > > -                   return;
>     > > -
>     > > -           pos += count;
>     > > -           buf += count / 4;
>     > > -           size -= count;
>     > > -   }
>     > > -#endif
>     > > -
>     > >      spin_lock_irqsave(&adev->mmio_idx_lock, flags);
>     > >      for (last = pos + size; pos < last; pos += 4) {
>     > >              uint32_t tmp = pos >> 31;
>     _______________________________________________
>     amd-gfx mailing list
>     amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>
>     https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0
>
>  
>
>  
>
> Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
> <Alexander.Deucher@amd.com <mailto:Alexander.Deucher@amd.com>>:
>
> [AMD Public Use]
>
>  
>
> If this causes an issue, any access to vram via the BAR could cause an
> issue.
>
>  
>
> Alex
>
> ------------------------------------------------------------------------
>
> *From:*amd-gfx <amd-gfx-bounces@lists.freedesktop.org
> <mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf of Russell,
> Kent <Kent.Russell@amd.com <mailto:Kent.Russell@amd.com>>
> *Sent:* Tuesday, April 14, 2020 10:19 AM
> *To:* Koenig, Christian <Christian.Koenig@amd.com
> <mailto:Christian.Koenig@amd.com>>; amd-gfx@lists.freedesktop.org
> <mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org
> <mailto:amd-gfx@lists.freedesktop.org>>
> *Cc:* Kuehling, Felix <Felix.Kuehling@amd.com
> <mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan <Jonathan.Kim@amd.com
> <mailto:Jonathan.Kim@amd.com>>
> *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
> amdgpu_device_vram_access v2"
>
>  
>
> [AMD Official Use Only - Internal Distribution Only]
>
> On VG20 or MI100, as soon as we run the subtest, we get the dmesg
> output below, and then the kernel ends up hanging. I don't know enough
> about the test itself to know why this is occurring, but Jon Kim and
> Felix were discussing it on a separate thread when the issue was first
> reported, so they can hopefully provide some additional information.
>
>  Kent
>
> > -----Original Message-----
> > From: Christian König <ckoenig.leichtzumerken@gmail.com
> <mailto:ckoenig.leichtzumerken@gmail.com>>
> > Sent: Tuesday, April 14, 2020 9:52 AM
> > To: Russell, Kent <Kent.Russell@amd.com
> <mailto:Kent.Russell@amd.com>>; amd-gfx@lists.freedesktop.org
> <mailto:amd-gfx@lists.freedesktop.org>
> > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
> > amdgpu_device_vram_access v2"
> >
> > Am 13.04.20 um 20:20 schrieb Kent Russell:
> > > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > > The original patch causes a RAS event and subsequent kernel hard-hang
> > > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > > Arcturus
> > >
> > > dmesg output at hang time:
> > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > > amdgpu 0000:67:00.0: GPU reset begin!
> > > Evicting PASID 0x8000 queues
> > > Started evicting pasid 0x8000
> > > qcm fence wait loop timeout expired
> > > The cp might be in an unrecoverable state due to an unsuccessful
> > > queues preemption Failed to evict process queues Failed to suspend
> > > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid
> > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost
> > > due to RAS ERREVENT_ATHUB_INTERRUPT
> > > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu
> > features!
> > > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu
> features!
> > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP
> > > block <powerplay> failed -5
> >
> > Do you have more information on what's going wrong here since this
> is a really
> > important patch for KFD debugging.
> >
> > >
> > > Signed-off-by: Kent Russell <kent.russell@amd.com
> <mailto:kent.russell@amd.com>>
> >
> > Reviewed-by: Christian König <christian.koenig@amd.com
> <mailto:christian.koenig@amd.com>>
> >
> > > ---
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
> ----------------------
> > >   1 file changed, 26 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > index cf5d6e585634..a3f997f84020 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> > amdgpu_device *adev, loff_t pos,
> > >      uint32_t hi = ~0;
> > >      uint64_t last;
> > >
> > > -
> > > -#ifdef CONFIG_64BIT
> > > -   last = min(pos + size, adev->gmc.visible_vram_size);
> > > -   if (last > pos) {
> > > -           void __iomem *addr = adev->mman.aper_base_kaddr + pos;
> > > -           size_t count = last - pos;
> > > -
> > > -           if (write) {
> > > -                   memcpy_toio(addr, buf, count);
> > > -                   mb();
> > > -                   amdgpu_asic_flush_hdp(adev, NULL);
> > > -           } else {
> > > -                   amdgpu_asic_invalidate_hdp(adev, NULL);
> > > -                   mb();
> > > -                   memcpy_fromio(buf, addr, count);
> > > -           }
> > > -
> > > -           if (count == size)
> > > -                   return;
> > > -
> > > -           pos += count;
> > > -           buf += count / 4;
> > > -           size -= count;
> > > -   }
> > > -#endif
> > > -
> > >      spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> > >      for (last = pos + size; pos < last; pos += 4) {
> > >              uint32_t tmp = pos >> 31;
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0
>

[-- Attachment #1.2: Type: text/html, Size: 81514 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"
  2020-04-14 18:31           ` Felix Kuehling
@ 2020-04-14 20:30             ` Kim, Jonathan
  2020-04-15  8:11               ` Christian König
  0 siblings, 1 reply; 14+ messages in thread
From: Kim, Jonathan @ 2020-04-14 20:30 UTC (permalink / raw)
  To: Kuehling, Felix, Koenig, Christian, Deucher, Alexander
  Cc: Russell, Kent, amd-gfx


[-- Attachment #1.1: Type: text/plain, Size: 30677 bytes --]

[AMD Official Use Only - Internal Distribution Only]

If we're passing the test on the revert, then the only thing that's different is we're not invalidating HDP and doing a copy to host anymore in amdgpu_device_vram_access since the function is still called in ttm access_memory with BAR.

Also cwsr tests fail on Vega20 with or without the revert with the same RAS error.

Thanks,

Jon

From: Kuehling, Felix <Felix.Kuehling@amd.com>
Sent: Tuesday, April 14, 2020 2:32 PM
To: Kim, Jonathan <Jonathan.Kim@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>
Cc: Russell, Kent <Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"


I wouldn't call it premature. Revert is a usual practice when there is a serious regression that isn't fully understood or root-caused. As far as I can tell, the problem has been reproduced on multiple systems, different GPUs, and clearly regressed to Christian's commit. I think that justifies reverting it for now.

I agree with Christian that a general HDP memory access problem causing RAS errors would potentially cause problems in other tests as well. For example common operations like GART table updates, and GPUVM page table updates and PCIe peer2peer accesses in ROCm applications use HDP. But we're not seeing obvious problems from those. So we need to understand what's special about this test. I asked questions to that effect on our other email thread.

Regards,
  Felix
Am 2020-04-14 um 10:51 a.m. schrieb Kim, Jonathan:

[AMD Official Use Only - Internal Distribution Only]

I think it's premature to push this revert.

With more testing, I'm getting failures from different tests or sometimes none at all on my machine.

Kent, let's continue the discussion on the original thread.

Thanks,

Jon

From: Koenig, Christian <Christian.Koenig@amd.com><mailto:Christian.Koenig@amd.com>
Sent: Tuesday, April 14, 2020 10:47 AM
To: Deucher, Alexander <Alexander.Deucher@amd.com><mailto:Alexander.Deucher@amd.com>
Cc: Russell, Kent <Kent.Russell@amd.com><mailto:Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Kuehling, Felix <Felix.Kuehling@amd.com><mailto:Felix.Kuehling@amd.com>; Kim, Jonathan <Jonathan.Kim@amd.com><mailto:Jonathan.Kim@amd.com>
Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"

That's exactly my concern as well.

This looks a bit like the test creates erroneous data somehow, but there doesn't seems to be a RAS check in the MM data path.

And now that we use the BAR path it goes up in flames.

I just don't see how we can create erroneous data in a test case?

Christian.

Am 14.04.2020 16:35 schrieb "Deucher, Alexander" <Alexander.Deucher@amd.com<mailto:Alexander.Deucher@amd.com>>:

[AMD Public Use]

If this causes an issue, any access to vram via the BAR could cause an issue.

Alex
________________________________
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org<mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf of Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>
Sent: Tuesday, April 14, 2020 10:19 AM
To: Koenig, Christian <Christian.Koenig@amd.com<mailto:Christian.Koenig@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Cc: Kuehling, Felix <Felix.Kuehling@amd.com<mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan <Jonathan.Kim@amd.com<mailto:Jonathan.Kim@amd.com>>
Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"

[AMD Official Use Only - Internal Distribution Only]

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output below, and then the kernel ends up hanging. I don't know enough about the test itself to know why this is occurring, but Jon Kim and Felix were discussing it on a separate thread when the issue was first reported, so they can hopefully provide some additional information.

 Kent

> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com<mailto:ckoenig.leichtzumerken@gmail.com>>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
> amdgpu_device_vram_access v2"
>
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful
> > queues preemption Failed to evict process queues Failed to suspend
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP
> > block <powerplay> failed -5
>
> Do you have more information on what's going wrong here since this is a really
> important patch for KFD debugging.
>
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com<mailto:kent.russell@amd.com>>
>
> Reviewed-by: Christian König <christian.koenig@amd.com<mailto:christian.koenig@amd.com>>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ----------------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> amdgpu_device *adev, loff_t pos,
> >      uint32_t hi = ~0;
> >      uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -   last = min(pos + size, adev->gmc.visible_vram_size);
> > -   if (last > pos) {
> > -           void __iomem *addr = adev->mman.aper_base_kaddr + pos;
> > -           size_t count = last - pos;
> > -
> > -           if (write) {
> > -                   memcpy_toio(addr, buf, count);
> > -                   mb();
> > -                   amdgpu_asic_flush_hdp(adev, NULL);
> > -           } else {
> > -                   amdgpu_asic_invalidate_hdp(adev, NULL);
> > -                   mb();
> > -                   memcpy_fromio(buf, addr, count);
> > -           }
> > -
> > -           if (count == size)
> > -                   return;
> > -
> > -           pos += count;
> > -           buf += count / 4;
> > -           size -= count;
> > -   }
> > -#endif
> > -
> >      spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> >      for (last = pos + size; pos < last; pos += 4) {
> >              uint32_t tmp = pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0


Am 14.04.2020 16:35 schrieb "Deucher, Alexander" <Alexander.Deucher@amd.com<mailto:Alexander.Deucher@amd.com>>:

[AMD Public Use]

If this causes an issue, any access to vram via the BAR could cause an issue.

Alex
________________________________
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org<mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf of Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>
Sent: Tuesday, April 14, 2020 10:19 AM
To: Koenig, Christian <Christian.Koenig@amd.com<mailto:Christian.Koenig@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Cc: Kuehling, Felix <Felix.Kuehling@amd.com<mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan <Jonathan.Kim@amd.com<mailto:Jonathan.Kim@amd.com>>
Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"

[AMD Official Use Only - Internal Distribution Only]

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output below, and then the kernel ends up hanging. I don't know enough about the test itself to know why this is occurring, but Jon Kim and Felix were discussing it on a separate thread when the issue was first reported, so they can hopefully provide some additional information.

 Kent

> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com<mailto:ckoenig.leichtzumerken@gmail.com>>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
> amdgpu_device_vram_access v2"
>
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful
> > queues preemption Failed to evict process queues Failed to suspend
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP
> > block <powerplay> failed -5
>
> Do you have more information on what's going wrong here since this is a really
> important patch for KFD debugging.
>
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com<mailto:kent.russell@amd.com>>
>
> Reviewed-by: Christian König <christian.koenig@amd.com<mailto:christian.koenig@amd.com>>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ----------------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> amdgpu_device *adev, loff_t pos,
> >      uint32_t hi = ~0;
> >      uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -   last = min(pos + size, adev->gmc.visible_vram_size);
> > -   if (last > pos) {
> > -           void __iomem *addr = adev->mman.aper_base_kaddr + pos;
> > -           size_t count = last - pos;
> > -
> > -           if (write) {
> > -                   memcpy_toio(addr, buf, count);
> > -                   mb();
> > -                   amdgpu_asic_flush_hdp(adev, NULL);
> > -           } else {
> > -                   amdgpu_asic_invalidate_hdp(adev, NULL);
> > -                   mb();
> > -                   memcpy_fromio(buf, addr, count);
> > -           }
> > -
> > -           if (count == size)
> > -                   return;
> > -
> > -           pos += count;
> > -           buf += count / 4;
> > -           size -= count;
> > -   }
> > -#endif
> > -
> >      spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> >      for (last = pos + size; pos < last; pos += 4) {
> >              uint32_t tmp = pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0


Am 14.04.2020 16:35 schrieb "Deucher, Alexander" <Alexander.Deucher@amd.com<mailto:Alexander.Deucher@amd.com>>:

[AMD Public Use]

If this causes an issue, any access to vram via the BAR could cause an issue.

Alex
________________________________
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org<mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf of Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>
Sent: Tuesday, April 14, 2020 10:19 AM
To: Koenig, Christian <Christian.Koenig@amd.com<mailto:Christian.Koenig@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Cc: Kuehling, Felix <Felix.Kuehling@amd.com<mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan <Jonathan.Kim@amd.com<mailto:Jonathan.Kim@amd.com>>
Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"

[AMD Official Use Only - Internal Distribution Only]

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output below, and then the kernel ends up hanging. I don't know enough about the test itself to know why this is occurring, but Jon Kim and Felix were discussing it on a separate thread when the issue was first reported, so they can hopefully provide some additional information.

 Kent

> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com<mailto:ckoenig.leichtzumerken@gmail.com>>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
> amdgpu_device_vram_access v2"
>
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful
> > queues preemption Failed to evict process queues Failed to suspend
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP
> > block <powerplay> failed -5
>
> Do you have more information on what's going wrong here since this is a really
> important patch for KFD debugging.
>
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com<mailto:kent.russell@amd.com>>
>
> Reviewed-by: Christian König <christian.koenig@amd.com<mailto:christian.koenig@amd.com>>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ----------------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> amdgpu_device *adev, loff_t pos,
> >      uint32_t hi = ~0;
> >      uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -   last = min(pos + size, adev->gmc.visible_vram_size);
> > -   if (last > pos) {
> > -           void __iomem *addr = adev->mman.aper_base_kaddr + pos;
> > -           size_t count = last - pos;
> > -
> > -           if (write) {
> > -                   memcpy_toio(addr, buf, count);
> > -                   mb();
> > -                   amdgpu_asic_flush_hdp(adev, NULL);
> > -           } else {
> > -                   amdgpu_asic_invalidate_hdp(adev, NULL);
> > -                   mb();
> > -                   memcpy_fromio(buf, addr, count);
> > -           }
> > -
> > -           if (count == size)
> > -                   return;
> > -
> > -           pos += count;
> > -           buf += count / 4;
> > -           size -= count;
> > -   }
> > -#endif
> > -
> >      spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> >      for (last = pos + size; pos < last; pos += 4) {
> >              uint32_t tmp = pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0


Am 14.04.2020 16:35 schrieb "Deucher, Alexander" <Alexander.Deucher@amd.com<mailto:Alexander.Deucher@amd.com>>:

[AMD Public Use]

If this causes an issue, any access to vram via the BAR could cause an issue.

Alex
________________________________
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org<mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf of Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>
Sent: Tuesday, April 14, 2020 10:19 AM
To: Koenig, Christian <Christian.Koenig@amd.com<mailto:Christian.Koenig@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Cc: Kuehling, Felix <Felix.Kuehling@amd.com<mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan <Jonathan.Kim@amd.com<mailto:Jonathan.Kim@amd.com>>
Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"

[AMD Official Use Only - Internal Distribution Only]

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output below, and then the kernel ends up hanging. I don't know enough about the test itself to know why this is occurring, but Jon Kim and Felix were discussing it on a separate thread when the issue was first reported, so they can hopefully provide some additional information.

 Kent

> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com<mailto:ckoenig.leichtzumerken@gmail.com>>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
> amdgpu_device_vram_access v2"
>
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful
> > queues preemption Failed to evict process queues Failed to suspend
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP
> > block <powerplay> failed -5
>
> Do you have more information on what's going wrong here since this is a really
> important patch for KFD debugging.
>
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com<mailto:kent.russell@amd.com>>
>
> Reviewed-by: Christian König <christian.koenig@amd.com<mailto:christian.koenig@amd.com>>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ----------------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> amdgpu_device *adev, loff_t pos,
> >      uint32_t hi = ~0;
> >      uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -   last = min(pos + size, adev->gmc.visible_vram_size);
> > -   if (last > pos) {
> > -           void __iomem *addr = adev->mman.aper_base_kaddr + pos;
> > -           size_t count = last - pos;
> > -
> > -           if (write) {
> > -                   memcpy_toio(addr, buf, count);
> > -                   mb();
> > -                   amdgpu_asic_flush_hdp(adev, NULL);
> > -           } else {
> > -                   amdgpu_asic_invalidate_hdp(adev, NULL);
> > -                   mb();
> > -                   memcpy_fromio(buf, addr, count);
> > -           }
> > -
> > -           if (count == size)
> > -                   return;
> > -
> > -           pos += count;
> > -           buf += count / 4;
> > -           size -= count;
> > -   }
> > -#endif
> > -
> >      spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> >      for (last = pos + size; pos < last; pos += 4) {
> >              uint32_t tmp = pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0


Am 14.04.2020 16:35 schrieb "Deucher, Alexander" <Alexander.Deucher@amd.com<mailto:Alexander.Deucher@amd.com>>:

[AMD Public Use]

If this causes an issue, any access to vram via the BAR could cause an issue.

Alex
________________________________
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org<mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf of Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>
Sent: Tuesday, April 14, 2020 10:19 AM
To: Koenig, Christian <Christian.Koenig@amd.com<mailto:Christian.Koenig@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Cc: Kuehling, Felix <Felix.Kuehling@amd.com<mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan <Jonathan.Kim@amd.com<mailto:Jonathan.Kim@amd.com>>
Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"

[AMD Official Use Only - Internal Distribution Only]

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output below, and then the kernel ends up hanging. I don't know enough about the test itself to know why this is occurring, but Jon Kim and Felix were discussing it on a separate thread when the issue was first reported, so they can hopefully provide some additional information.

 Kent

> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com<mailto:ckoenig.leichtzumerken@gmail.com>>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
> amdgpu_device_vram_access v2"
>
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful
> > queues preemption Failed to evict process queues Failed to suspend
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP
> > block <powerplay> failed -5
>
> Do you have more information on what's going wrong here since this is a really
> important patch for KFD debugging.
>
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com<mailto:kent.russell@amd.com>>
>
> Reviewed-by: Christian König <christian.koenig@amd.com<mailto:christian.koenig@amd.com>>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ----------------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> amdgpu_device *adev, loff_t pos,
> >      uint32_t hi = ~0;
> >      uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -   last = min(pos + size, adev->gmc.visible_vram_size);
> > -   if (last > pos) {
> > -           void __iomem *addr = adev->mman.aper_base_kaddr + pos;
> > -           size_t count = last - pos;
> > -
> > -           if (write) {
> > -                   memcpy_toio(addr, buf, count);
> > -                   mb();
> > -                   amdgpu_asic_flush_hdp(adev, NULL);
> > -           } else {
> > -                   amdgpu_asic_invalidate_hdp(adev, NULL);
> > -                   mb();
> > -                   memcpy_fromio(buf, addr, count);
> > -           }
> > -
> > -           if (count == size)
> > -                   return;
> > -
> > -           pos += count;
> > -           buf += count / 4;
> > -           size -= count;
> > -   }
> > -#endif
> > -
> >      spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> >      for (last = pos + size; pos < last; pos += 4) {
> >              uint32_t tmp = pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0

[-- Attachment #1.2: Type: text/html, Size: 55414 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"
  2020-04-14 20:30             ` Kim, Jonathan
@ 2020-04-15  8:11               ` Christian König
  2020-04-15  9:49                 ` Kim, Jonathan
  0 siblings, 1 reply; 14+ messages in thread
From: Christian König @ 2020-04-15  8:11 UTC (permalink / raw)
  To: Kim, Jonathan, Kuehling, Felix, Deucher, Alexander; +Cc: Russell, Kent, amd-gfx


[-- Attachment #1.1: Type: text/plain, Size: 38013 bytes --]

Hi Jon,

> Also cwsr tests fail on Vega20 with or without the revert with the 
> same RAS error.

That sounds like the system/setup has a more general problem.

Could it be that we are seeing RAS errors because there really is some 
hardware failure, but with the MM path we don't trigger a RAS interrupt?

Thanks,
Christian.

Am 14.04.20 um 22:30 schrieb Kim, Jonathan:
>
> [AMD Official Use Only - Internal Distribution Only]
>
> If we’re passing the test on the revert, then the only thing that’s 
> different is we’re not invalidating HDP and doing a copy to host 
> anymore in amdgpu_device_vram_access since the function is still 
> called in ttm access_memory with BAR.
>
> Also cwsr tests fail on Vega20 with or without the revert with the 
> same RAS error.
>
> Thanks,
>
> Jon
>
> *From:* Kuehling, Felix <Felix.Kuehling@amd.com>
> *Sent:* Tuesday, April 14, 2020 2:32 PM
> *To:* Kim, Jonathan <Jonathan.Kim@amd.com>; Koenig, Christian 
> <Christian.Koenig@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>
> *Cc:* Russell, Kent <Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org
> *Subject:* Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in 
> amdgpu_device_vram_access v2"
>
> I wouldn't call it premature. Revert is a usual practice when there is 
> a serious regression that isn't fully understood or root-caused. As 
> far as I can tell, the problem has been reproduced on multiple 
> systems, different GPUs, and clearly regressed to Christian's commit. 
> I think that justifies reverting it for now.
>
> I agree with Christian that a general HDP memory access problem 
> causing RAS errors would potentially cause problems in other tests as 
> well. For example common operations like GART table updates, and GPUVM 
> page table updates and PCIe peer2peer accesses in ROCm applications 
> use HDP. But we're not seeing obvious problems from those. So we need 
> to understand what's special about this test. I asked questions to 
> that effect on our other email thread.
>
> Regards,
>   Felix
>
> Am 2020-04-14 um 10:51 a.m. schrieb Kim, Jonathan:
>
>     [AMD Official Use Only - Internal Distribution Only]
>
>     I think it’s premature to push this revert.
>
>     With more testing, I’m getting failures from different tests or
>     sometimes none at all on my machine.
>
>     Kent, let’s continue the discussion on the original thread.
>
>     Thanks,
>
>     Jon
>
>     *From:* Koenig, Christian <Christian.Koenig@amd.com>
>     <mailto:Christian.Koenig@amd.com>
>     *Sent:* Tuesday, April 14, 2020 10:47 AM
>     *To:* Deucher, Alexander <Alexander.Deucher@amd.com>
>     <mailto:Alexander.Deucher@amd.com>
>     *Cc:* Russell, Kent <Kent.Russell@amd.com>
>     <mailto:Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org
>     <mailto:amd-gfx@lists.freedesktop.org>; Kuehling, Felix
>     <Felix.Kuehling@amd.com> <mailto:Felix.Kuehling@amd.com>; Kim,
>     Jonathan <Jonathan.Kim@amd.com> <mailto:Jonathan.Kim@amd.com>
>     *Subject:* Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible
>     in amdgpu_device_vram_access v2"
>
>     That's exactly my concern as well.
>
>     This looks a bit like the test creates erroneous data somehow, but
>     there doesn't seems to be a RAS check in the MM data path.
>
>     And now that we use the BAR path it goes up in flames.
>
>     I just don't see how we can create erroneous data in a test case?
>
>     Christian.
>
>     Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
>     <Alexander.Deucher@amd.com <mailto:Alexander.Deucher@amd.com>>:
>
>         [AMD Public Use]
>
>         If this causes an issue, any access to vram via the BAR could
>         cause an issue.
>
>         Alex
>
>         ------------------------------------------------------------------------
>
>         *From:*amd-gfx <amd-gfx-bounces@lists.freedesktop.org
>         <mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf of
>         Russell, Kent <Kent.Russell@amd.com <mailto:Kent.Russell@amd.com>>
>         *Sent:* Tuesday, April 14, 2020 10:19 AM
>         *To:* Koenig, Christian <Christian.Koenig@amd.com
>         <mailto:Christian.Koenig@amd.com>>;
>         amd-gfx@lists.freedesktop.org
>         <mailto:amd-gfx@lists.freedesktop.org>
>         <amd-gfx@lists.freedesktop.org
>         <mailto:amd-gfx@lists.freedesktop.org>>
>         *Cc:* Kuehling, Felix <Felix.Kuehling@amd.com
>         <mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan
>         <Jonathan.Kim@amd.com <mailto:Jonathan.Kim@amd.com>>
>         *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if
>         possible in amdgpu_device_vram_access v2"
>
>         [AMD Official Use Only - Internal Distribution Only]
>
>         On VG20 or MI100, as soon as we run the subtest, we get the
>         dmesg output below, and then the kernel ends up hanging. I
>         don't know enough about the test itself to know why this is
>         occurring, but Jon Kim and Felix were discussing it on a
>         separate thread when the issue was first reported, so they can
>         hopefully provide some additional information.
>
>          Kent
>
>         > -----Original Message-----
>         > From: Christian König <ckoenig.leichtzumerken@gmail.com
>         <mailto:ckoenig.leichtzumerken@gmail.com>>
>         > Sent: Tuesday, April 14, 2020 9:52 AM
>         > To: Russell, Kent <Kent.Russell@amd.com
>         <mailto:Kent.Russell@amd.com>>; amd-gfx@lists.freedesktop.org
>         <mailto:amd-gfx@lists.freedesktop.org>
>         > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if
>         possible in
>         > amdgpu_device_vram_access v2"
>         >
>         > Am 13.04.20 um 20:20 schrieb Kent Russell:
>         > > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
>         > > The original patch causes a RAS event and subsequent
>         kernel hard-hang
>         > > when running the KFDMemoryTest.PtraceAccessInvisibleVram
>         on VG20 and
>         > > Arcturus
>         > >
>         > > dmesg output at hang time:
>         > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
>         > > amdgpu 0000:67:00.0: GPU reset begin!
>         > > Evicting PASID 0x8000 queues
>         > > Started evicting pasid 0x8000
>         > > qcm fence wait loop timeout expired
>         > > The cp might be in an unrecoverable state due to an
>         unsuccessful
>         > > queues preemption Failed to evict process queues Failed to
>         suspend
>         > > process 0x8000 Finished evicting pasid 0x8000 Started
>         restoring pasid
>         > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU
>         state may lost
>         > > due to RAS ERREVENT_ATHUB_INTERRUPT
>         > > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
>         > > amdgpu: [powerplay] Failed to set soft min gfxclk !
>         > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
>         > > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
>         > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to
>         disable all smu
>         > features!
>         > > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable
>         all smu features!
>         > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
>         > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR*
>         suspend of IP
>         > > block <powerplay> failed -5
>         >
>         > Do you have more information on what's going wrong here
>         since this is a really
>         > important patch for KFD debugging.
>         >
>         > >
>         > > Signed-off-by: Kent Russell <kent.russell@amd.com
>         <mailto:kent.russell@amd.com>>
>         >
>         > Reviewed-by: Christian König <christian.koenig@amd.com
>         <mailto:christian.koenig@amd.com>>
>         >
>         > > ---
>         > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
>         ----------------------
>         > >   1 file changed, 26 deletions(-)
>         > >
>         > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>         > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>         > > index cf5d6e585634..a3f997f84020 100644
>         > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>         > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>         > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
>         > amdgpu_device *adev, loff_t pos,
>         > >      uint32_t hi = ~0;
>         > >      uint64_t last;
>         > >
>         > > -
>         > > -#ifdef CONFIG_64BIT
>         > > -   last = min(pos + size, adev->gmc.visible_vram_size);
>         > > -   if (last > pos) {
>         > > -           void __iomem *addr =
>         adev->mman.aper_base_kaddr + pos;
>         > > -           size_t count = last - pos;
>         > > -
>         > > -           if (write) {
>         > > - memcpy_toio(addr, buf, count);
>         > > -                   mb();
>         > > - amdgpu_asic_flush_hdp(adev, NULL);
>         > > -           } else {
>         > > - amdgpu_asic_invalidate_hdp(adev, NULL);
>         > > -                   mb();
>         > > - memcpy_fromio(buf, addr, count);
>         > > -           }
>         > > -
>         > > -           if (count == size)
>         > > -                   return;
>         > > -
>         > > -           pos += count;
>         > > -           buf += count / 4;
>         > > -           size -= count;
>         > > -   }
>         > > -#endif
>         > > -
>         > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);
>         > >      for (last = pos + size; pos < last; pos += 4) {
>         > >              uint32_t tmp = pos >> 31;
>         _______________________________________________
>         amd-gfx mailing list
>         amd-gfx@lists.freedesktop.org
>         <mailto:amd-gfx@lists.freedesktop.org>
>         https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0
>
>     Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
>     <Alexander.Deucher@amd.com <mailto:Alexander.Deucher@amd.com>>:
>
>         [AMD Public Use]
>
>         If this causes an issue, any access to vram via the BAR could
>         cause an issue.
>
>         Alex
>
>         ------------------------------------------------------------------------
>
>         *From:*amd-gfx <amd-gfx-bounces@lists.freedesktop.org
>         <mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf of
>         Russell, Kent <Kent.Russell@amd.com <mailto:Kent.Russell@amd.com>>
>         *Sent:* Tuesday, April 14, 2020 10:19 AM
>         *To:* Koenig, Christian <Christian.Koenig@amd.com
>         <mailto:Christian.Koenig@amd.com>>;
>         amd-gfx@lists.freedesktop.org
>         <mailto:amd-gfx@lists.freedesktop.org>
>         <amd-gfx@lists.freedesktop.org
>         <mailto:amd-gfx@lists.freedesktop.org>>
>         *Cc:* Kuehling, Felix <Felix.Kuehling@amd.com
>         <mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan
>         <Jonathan.Kim@amd.com <mailto:Jonathan.Kim@amd.com>>
>         *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if
>         possible in amdgpu_device_vram_access v2"
>
>         [AMD Official Use Only - Internal Distribution Only]
>
>         On VG20 or MI100, as soon as we run the subtest, we get the
>         dmesg output below, and then the kernel ends up hanging. I
>         don't know enough about the test itself to know why this is
>         occurring, but Jon Kim and Felix were discussing it on a
>         separate thread when the issue was first reported, so they can
>         hopefully provide some additional information.
>
>          Kent
>
>         > -----Original Message-----
>         > From: Christian König <ckoenig.leichtzumerken@gmail.com
>         <mailto:ckoenig.leichtzumerken@gmail.com>>
>         > Sent: Tuesday, April 14, 2020 9:52 AM
>         > To: Russell, Kent <Kent.Russell@amd.com
>         <mailto:Kent.Russell@amd.com>>; amd-gfx@lists.freedesktop.org
>         <mailto:amd-gfx@lists.freedesktop.org>
>         > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if
>         possible in
>         > amdgpu_device_vram_access v2"
>         >
>         > Am 13.04.20 um 20:20 schrieb Kent Russell:
>         > > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
>         > > The original patch causes a RAS event and subsequent
>         kernel hard-hang
>         > > when running the KFDMemoryTest.PtraceAccessInvisibleVram
>         on VG20 and
>         > > Arcturus
>         > >
>         > > dmesg output at hang time:
>         > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
>         > > amdgpu 0000:67:00.0: GPU reset begin!
>         > > Evicting PASID 0x8000 queues
>         > > Started evicting pasid 0x8000
>         > > qcm fence wait loop timeout expired
>         > > The cp might be in an unrecoverable state due to an
>         unsuccessful
>         > > queues preemption Failed to evict process queues Failed to
>         suspend
>         > > process 0x8000 Finished evicting pasid 0x8000 Started
>         restoring pasid
>         > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU
>         state may lost
>         > > due to RAS ERREVENT_ATHUB_INTERRUPT
>         > > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
>         > > amdgpu: [powerplay] Failed to set soft min gfxclk !
>         > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
>         > > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
>         > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to
>         disable all smu
>         > features!
>         > > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable
>         all smu features!
>         > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
>         > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR*
>         suspend of IP
>         > > block <powerplay> failed -5
>         >
>         > Do you have more information on what's going wrong here
>         since this is a really
>         > important patch for KFD debugging.
>         >
>         > >
>         > > Signed-off-by: Kent Russell <kent.russell@amd.com
>         <mailto:kent.russell@amd.com>>
>         >
>         > Reviewed-by: Christian König <christian.koenig@amd.com
>         <mailto:christian.koenig@amd.com>>
>         >
>         > > ---
>         > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
>         ----------------------
>         > >   1 file changed, 26 deletions(-)
>         > >
>         > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>         > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>         > > index cf5d6e585634..a3f997f84020 100644
>         > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>         > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>         > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
>         > amdgpu_device *adev, loff_t pos,
>         > >      uint32_t hi = ~0;
>         > >      uint64_t last;
>         > >
>         > > -
>         > > -#ifdef CONFIG_64BIT
>         > > -   last = min(pos + size, adev->gmc.visible_vram_size);
>         > > -   if (last > pos) {
>         > > -           void __iomem *addr =
>         adev->mman.aper_base_kaddr + pos;
>         > > -           size_t count = last - pos;
>         > > -
>         > > -           if (write) {
>         > > - memcpy_toio(addr, buf, count);
>         > > -                   mb();
>         > > - amdgpu_asic_flush_hdp(adev, NULL);
>         > > -           } else {
>         > > - amdgpu_asic_invalidate_hdp(adev, NULL);
>         > > -                   mb();
>         > > - memcpy_fromio(buf, addr, count);
>         > > -           }
>         > > -
>         > > -           if (count == size)
>         > > -                   return;
>         > > -
>         > > -           pos += count;
>         > > -           buf += count / 4;
>         > > -           size -= count;
>         > > -   }
>         > > -#endif
>         > > -
>         > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);
>         > >      for (last = pos + size; pos < last; pos += 4) {
>         > >              uint32_t tmp = pos >> 31;
>         _______________________________________________
>         amd-gfx mailing list
>         amd-gfx@lists.freedesktop.org
>         <mailto:amd-gfx@lists.freedesktop.org>
>         https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0
>
>     Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
>     <Alexander.Deucher@amd.com <mailto:Alexander.Deucher@amd.com>>:
>
>         [AMD Public Use]
>
>         If this causes an issue, any access to vram via the BAR could
>         cause an issue.
>
>         Alex
>
>         ------------------------------------------------------------------------
>
>         *From:*amd-gfx <amd-gfx-bounces@lists.freedesktop.org
>         <mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf of
>         Russell, Kent <Kent.Russell@amd.com <mailto:Kent.Russell@amd.com>>
>         *Sent:* Tuesday, April 14, 2020 10:19 AM
>         *To:* Koenig, Christian <Christian.Koenig@amd.com
>         <mailto:Christian.Koenig@amd.com>>;
>         amd-gfx@lists.freedesktop.org
>         <mailto:amd-gfx@lists.freedesktop.org>
>         <amd-gfx@lists.freedesktop.org
>         <mailto:amd-gfx@lists.freedesktop.org>>
>         *Cc:* Kuehling, Felix <Felix.Kuehling@amd.com
>         <mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan
>         <Jonathan.Kim@amd.com <mailto:Jonathan.Kim@amd.com>>
>         *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if
>         possible in amdgpu_device_vram_access v2"
>
>         [AMD Official Use Only - Internal Distribution Only]
>
>         On VG20 or MI100, as soon as we run the subtest, we get the
>         dmesg output below, and then the kernel ends up hanging. I
>         don't know enough about the test itself to know why this is
>         occurring, but Jon Kim and Felix were discussing it on a
>         separate thread when the issue was first reported, so they can
>         hopefully provide some additional information.
>
>          Kent
>
>         > -----Original Message-----
>         > From: Christian König <ckoenig.leichtzumerken@gmail.com
>         <mailto:ckoenig.leichtzumerken@gmail.com>>
>         > Sent: Tuesday, April 14, 2020 9:52 AM
>         > To: Russell, Kent <Kent.Russell@amd.com
>         <mailto:Kent.Russell@amd.com>>; amd-gfx@lists.freedesktop.org
>         <mailto:amd-gfx@lists.freedesktop.org>
>         > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if
>         possible in
>         > amdgpu_device_vram_access v2"
>         >
>         > Am 13.04.20 um 20:20 schrieb Kent Russell:
>         > > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
>         > > The original patch causes a RAS event and subsequent
>         kernel hard-hang
>         > > when running the KFDMemoryTest.PtraceAccessInvisibleVram
>         on VG20 and
>         > > Arcturus
>         > >
>         > > dmesg output at hang time:
>         > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
>         > > amdgpu 0000:67:00.0: GPU reset begin!
>         > > Evicting PASID 0x8000 queues
>         > > Started evicting pasid 0x8000
>         > > qcm fence wait loop timeout expired
>         > > The cp might be in an unrecoverable state due to an
>         unsuccessful
>         > > queues preemption Failed to evict process queues Failed to
>         suspend
>         > > process 0x8000 Finished evicting pasid 0x8000 Started
>         restoring pasid
>         > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU
>         state may lost
>         > > due to RAS ERREVENT_ATHUB_INTERRUPT
>         > > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
>         > > amdgpu: [powerplay] Failed to set soft min gfxclk !
>         > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
>         > > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
>         > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to
>         disable all smu
>         > features!
>         > > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable
>         all smu features!
>         > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
>         > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR*
>         suspend of IP
>         > > block <powerplay> failed -5
>         >
>         > Do you have more information on what's going wrong here
>         since this is a really
>         > important patch for KFD debugging.
>         >
>         > >
>         > > Signed-off-by: Kent Russell <kent.russell@amd.com
>         <mailto:kent.russell@amd.com>>
>         >
>         > Reviewed-by: Christian König <christian.koenig@amd.com
>         <mailto:christian.koenig@amd.com>>
>         >
>         > > ---
>         > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
>         ----------------------
>         > >   1 file changed, 26 deletions(-)
>         > >
>         > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>         > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>         > > index cf5d6e585634..a3f997f84020 100644
>         > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>         > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>         > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
>         > amdgpu_device *adev, loff_t pos,
>         > >      uint32_t hi = ~0;
>         > >      uint64_t last;
>         > >
>         > > -
>         > > -#ifdef CONFIG_64BIT
>         > > -   last = min(pos + size, adev->gmc.visible_vram_size);
>         > > -   if (last > pos) {
>         > > -           void __iomem *addr =
>         adev->mman.aper_base_kaddr + pos;
>         > > -           size_t count = last - pos;
>         > > -
>         > > -           if (write) {
>         > > - memcpy_toio(addr, buf, count);
>         > > -                   mb();
>         > > - amdgpu_asic_flush_hdp(adev, NULL);
>         > > -           } else {
>         > > - amdgpu_asic_invalidate_hdp(adev, NULL);
>         > > -                   mb();
>         > > - memcpy_fromio(buf, addr, count);
>         > > -           }
>         > > -
>         > > -           if (count == size)
>         > > -                   return;
>         > > -
>         > > -           pos += count;
>         > > -           buf += count / 4;
>         > > -           size -= count;
>         > > -   }
>         > > -#endif
>         > > -
>         > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);
>         > >      for (last = pos + size; pos < last; pos += 4) {
>         > >              uint32_t tmp = pos >> 31;
>         _______________________________________________
>         amd-gfx mailing list
>         amd-gfx@lists.freedesktop.org
>         <mailto:amd-gfx@lists.freedesktop.org>
>         https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0
>
>     Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
>     <Alexander.Deucher@amd.com <mailto:Alexander.Deucher@amd.com>>:
>
>         [AMD Public Use]
>
>         If this causes an issue, any access to vram via the BAR could
>         cause an issue.
>
>         Alex
>
>         ------------------------------------------------------------------------
>
>         *From:*amd-gfx <amd-gfx-bounces@lists.freedesktop.org
>         <mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf of
>         Russell, Kent <Kent.Russell@amd.com <mailto:Kent.Russell@amd.com>>
>         *Sent:* Tuesday, April 14, 2020 10:19 AM
>         *To:* Koenig, Christian <Christian.Koenig@amd.com
>         <mailto:Christian.Koenig@amd.com>>;
>         amd-gfx@lists.freedesktop.org
>         <mailto:amd-gfx@lists.freedesktop.org>
>         <amd-gfx@lists.freedesktop.org
>         <mailto:amd-gfx@lists.freedesktop.org>>
>         *Cc:* Kuehling, Felix <Felix.Kuehling@amd.com
>         <mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan
>         <Jonathan.Kim@amd.com <mailto:Jonathan.Kim@amd.com>>
>         *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if
>         possible in amdgpu_device_vram_access v2"
>
>         [AMD Official Use Only - Internal Distribution Only]
>
>         On VG20 or MI100, as soon as we run the subtest, we get the
>         dmesg output below, and then the kernel ends up hanging. I
>         don't know enough about the test itself to know why this is
>         occurring, but Jon Kim and Felix were discussing it on a
>         separate thread when the issue was first reported, so they can
>         hopefully provide some additional information.
>
>          Kent
>
>         > -----Original Message-----
>         > From: Christian König <ckoenig.leichtzumerken@gmail.com
>         <mailto:ckoenig.leichtzumerken@gmail.com>>
>         > Sent: Tuesday, April 14, 2020 9:52 AM
>         > To: Russell, Kent <Kent.Russell@amd.com
>         <mailto:Kent.Russell@amd.com>>; amd-gfx@lists.freedesktop.org
>         <mailto:amd-gfx@lists.freedesktop.org>
>         > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if
>         possible in
>         > amdgpu_device_vram_access v2"
>         >
>         > Am 13.04.20 um 20:20 schrieb Kent Russell:
>         > > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
>         > > The original patch causes a RAS event and subsequent
>         kernel hard-hang
>         > > when running the KFDMemoryTest.PtraceAccessInvisibleVram
>         on VG20 and
>         > > Arcturus
>         > >
>         > > dmesg output at hang time:
>         > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
>         > > amdgpu 0000:67:00.0: GPU reset begin!
>         > > Evicting PASID 0x8000 queues
>         > > Started evicting pasid 0x8000
>         > > qcm fence wait loop timeout expired
>         > > The cp might be in an unrecoverable state due to an
>         unsuccessful
>         > > queues preemption Failed to evict process queues Failed to
>         suspend
>         > > process 0x8000 Finished evicting pasid 0x8000 Started
>         restoring pasid
>         > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU
>         state may lost
>         > > due to RAS ERREVENT_ATHUB_INTERRUPT
>         > > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
>         > > amdgpu: [powerplay] Failed to set soft min gfxclk !
>         > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
>         > > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
>         > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to
>         disable all smu
>         > features!
>         > > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable
>         all smu features!
>         > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
>         > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR*
>         suspend of IP
>         > > block <powerplay> failed -5
>         >
>         > Do you have more information on what's going wrong here
>         since this is a really
>         > important patch for KFD debugging.
>         >
>         > >
>         > > Signed-off-by: Kent Russell <kent.russell@amd.com
>         <mailto:kent.russell@amd.com>>
>         >
>         > Reviewed-by: Christian König <christian.koenig@amd.com
>         <mailto:christian.koenig@amd.com>>
>         >
>         > > ---
>         > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
>         ----------------------
>         > >   1 file changed, 26 deletions(-)
>         > >
>         > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>         > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>         > > index cf5d6e585634..a3f997f84020 100644
>         > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>         > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>         > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
>         > amdgpu_device *adev, loff_t pos,
>         > >      uint32_t hi = ~0;
>         > >      uint64_t last;
>         > >
>         > > -
>         > > -#ifdef CONFIG_64BIT
>         > > -   last = min(pos + size, adev->gmc.visible_vram_size);
>         > > -   if (last > pos) {
>         > > -           void __iomem *addr =
>         adev->mman.aper_base_kaddr + pos;
>         > > -           size_t count = last - pos;
>         > > -
>         > > -           if (write) {
>         > > - memcpy_toio(addr, buf, count);
>         > > -                   mb();
>         > > - amdgpu_asic_flush_hdp(adev, NULL);
>         > > -           } else {
>         > > - amdgpu_asic_invalidate_hdp(adev, NULL);
>         > > -                   mb();
>         > > - memcpy_fromio(buf, addr, count);
>         > > -           }
>         > > -
>         > > -           if (count == size)
>         > > -                   return;
>         > > -
>         > > -           pos += count;
>         > > -           buf += count / 4;
>         > > -           size -= count;
>         > > -   }
>         > > -#endif
>         > > -
>         > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);
>         > >      for (last = pos + size; pos < last; pos += 4) {
>         > >              uint32_t tmp = pos >> 31;
>         _______________________________________________
>         amd-gfx mailing list
>         amd-gfx@lists.freedesktop.org
>         <mailto:amd-gfx@lists.freedesktop.org>
>         https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0
>
>     Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
>     <Alexander.Deucher@amd.com <mailto:Alexander.Deucher@amd.com>>:
>
>     [AMD Public Use]
>
>     If this causes an issue, any access to vram via the BAR could
>     cause an issue.
>
>     Alex
>
>     ------------------------------------------------------------------------
>
>     *From:*amd-gfx <amd-gfx-bounces@lists.freedesktop.org
>     <mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf of
>     Russell, Kent <Kent.Russell@amd.com <mailto:Kent.Russell@amd.com>>
>     *Sent:* Tuesday, April 14, 2020 10:19 AM
>     *To:* Koenig, Christian <Christian.Koenig@amd.com
>     <mailto:Christian.Koenig@amd.com>>; amd-gfx@lists.freedesktop.org
>     <mailto:amd-gfx@lists.freedesktop.org>
>     <amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>>
>     *Cc:* Kuehling, Felix <Felix.Kuehling@amd.com
>     <mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan
>     <Jonathan.Kim@amd.com <mailto:Jonathan.Kim@amd.com>>
>     *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible
>     in amdgpu_device_vram_access v2"
>
>     [AMD Official Use Only - Internal Distribution Only]
>
>     On VG20 or MI100, as soon as we run the subtest, we get the dmesg
>     output below, and then the kernel ends up hanging. I don't know
>     enough about the test itself to know why this is occurring, but
>     Jon Kim and Felix were discussing it on a separate thread when the
>     issue was first reported, so they can hopefully provide some
>     additional information.
>
>      Kent
>
>     > -----Original Message-----
>     > From: Christian König <ckoenig.leichtzumerken@gmail.com
>     <mailto:ckoenig.leichtzumerken@gmail.com>>
>     > Sent: Tuesday, April 14, 2020 9:52 AM
>     > To: Russell, Kent <Kent.Russell@amd.com
>     <mailto:Kent.Russell@amd.com>>; amd-gfx@lists.freedesktop.org
>     <mailto:amd-gfx@lists.freedesktop.org>
>     > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
>     > amdgpu_device_vram_access v2"
>     >
>     > Am 13.04.20 um 20:20 schrieb Kent Russell:
>     > > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
>     > > The original patch causes a RAS event and subsequent kernel
>     hard-hang
>     > > when running the KFDMemoryTest.PtraceAccessInvisibleVram on
>     VG20 and
>     > > Arcturus
>     > >
>     > > dmesg output at hang time:
>     > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
>     > > amdgpu 0000:67:00.0: GPU reset begin!
>     > > Evicting PASID 0x8000 queues
>     > > Started evicting pasid 0x8000
>     > > qcm fence wait loop timeout expired
>     > > The cp might be in an unrecoverable state due to an unsuccessful
>     > > queues preemption Failed to evict process queues Failed to suspend
>     > > process 0x8000 Finished evicting pasid 0x8000 Started
>     restoring pasid
>     > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state
>     may lost
>     > > due to RAS ERREVENT_ATHUB_INTERRUPT
>     > > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
>     > > amdgpu: [powerplay] Failed to set soft min gfxclk !
>     > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
>     > > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
>     > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable
>     all smu
>     > features!
>     > > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all
>     smu features!
>     > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
>     > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend
>     of IP
>     > > block <powerplay> failed -5
>     >
>     > Do you have more information on what's going wrong here since
>     this is a really
>     > important patch for KFD debugging.
>     >
>     > >
>     > > Signed-off-by: Kent Russell <kent.russell@amd.com
>     <mailto:kent.russell@amd.com>>
>     >
>     > Reviewed-by: Christian König <christian.koenig@amd.com
>     <mailto:christian.koenig@amd.com>>
>     >
>     > > ---
>     > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
>     ----------------------
>     > >   1 file changed, 26 deletions(-)
>     > >
>     > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>     > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>     > > index cf5d6e585634..a3f997f84020 100644
>     > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>     > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>     > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
>     > amdgpu_device *adev, loff_t pos,
>     > >      uint32_t hi = ~0;
>     > >      uint64_t last;
>     > >
>     > > -
>     > > -#ifdef CONFIG_64BIT
>     > > -   last = min(pos + size, adev->gmc.visible_vram_size);
>     > > -   if (last > pos) {
>     > > -           void __iomem *addr = adev->mman.aper_base_kaddr + pos;
>     > > -           size_t count = last - pos;
>     > > -
>     > > -           if (write) {
>     > > -                   memcpy_toio(addr, buf, count);
>     > > -                   mb();
>     > > - amdgpu_asic_flush_hdp(adev, NULL);
>     > > -           } else {
>     > > - amdgpu_asic_invalidate_hdp(adev, NULL);
>     > > -                   mb();
>     > > -                   memcpy_fromio(buf, addr, count);
>     > > -           }
>     > > -
>     > > -           if (count == size)
>     > > -                   return;
>     > > -
>     > > -           pos += count;
>     > > -           buf += count / 4;
>     > > -           size -= count;
>     > > -   }
>     > > -#endif
>     > > -
>     > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);
>     > >      for (last = pos + size; pos < last; pos += 4) {
>     > >              uint32_t tmp = pos >> 31;
>     _______________________________________________
>     amd-gfx mailing list
>     amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>
>     https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0
>


[-- Attachment #1.2: Type: text/html, Size: 87667 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"
  2020-04-15  8:11               ` Christian König
@ 2020-04-15  9:49                 ` Kim, Jonathan
  2020-04-15 10:58                   ` Christian König
  0 siblings, 1 reply; 14+ messages in thread
From: Kim, Jonathan @ 2020-04-15  9:49 UTC (permalink / raw)
  To: Koenig, Christian, Kuehling, Felix, Deucher, Alexander
  Cc: Russell, Kent, amd-gfx


[-- Attachment #1.1: Type: text/plain, Size: 32671 bytes --]

[AMD Public Use]

Hi Christian,

That could potentially be it.  With additional testing, 2 of 3 Vega20 machines never hit error over BAR access with the PTRACE test.  3 of 3 machines (from the same pool) always hit error with CWSR.
To elaborate on the PTRACE test, we PEEK 2 DWORDs inside thunk allocated mapped memory and 2 DWORDS outside that boundary (it's only about 4MB to the boundary).  Then we POKE to swap the DWORD positions across the boundary.  The RAS event on the single failing machine happens on the out of boundary PEEK.

Felix mentioned we don't hit errors over general HDP access but that may not true.  An Arcturus failure sys logs posted (which wasn't tested by me) shows someone launched rocm bandwidth test, hit a VM fault and a RAS event ensued during evictions (I can point the internal ticket or log snippet offline if interested).  Whether the RAS event is BAR access triggered or the result of HW instability is beyond me since I don't have access to the machine.

Thanks,

Jon

From: Koenig, Christian <Christian.Koenig@amd.com>
Sent: Wednesday, April 15, 2020 4:11 AM
To: Kim, Jonathan <Jonathan.Kim@amd.com>; Kuehling, Felix <Felix.Kuehling@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>
Cc: Russell, Kent <Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"

Hi Jon,

Also cwsr tests fail on Vega20 with or without the revert with the same RAS error.

That sounds like the system/setup has a more general problem.

Could it be that we are seeing RAS errors because there really is some hardware failure, but with the MM path we don't trigger a RAS interrupt?

Thanks,
Christian.

Am 14.04.20 um 22:30 schrieb Kim, Jonathan:

[AMD Official Use Only - Internal Distribution Only]

If we're passing the test on the revert, then the only thing that's different is we're not invalidating HDP and doing a copy to host anymore in amdgpu_device_vram_access since the function is still called in ttm access_memory with BAR.

Also cwsr tests fail on Vega20 with or without the revert with the same RAS error.

Thanks,

Jon

From: Kuehling, Felix <Felix.Kuehling@amd.com><mailto:Felix.Kuehling@amd.com>
Sent: Tuesday, April 14, 2020 2:32 PM
To: Kim, Jonathan <Jonathan.Kim@amd.com><mailto:Jonathan.Kim@amd.com>; Koenig, Christian <Christian.Koenig@amd.com><mailto:Christian.Koenig@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com><mailto:Alexander.Deucher@amd.com>
Cc: Russell, Kent <Kent.Russell@amd.com><mailto:Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"


I wouldn't call it premature. Revert is a usual practice when there is a serious regression that isn't fully understood or root-caused. As far as I can tell, the problem has been reproduced on multiple systems, different GPUs, and clearly regressed to Christian's commit. I think that justifies reverting it for now.

I agree with Christian that a general HDP memory access problem causing RAS errors would potentially cause problems in other tests as well. For example common operations like GART table updates, and GPUVM page table updates and PCIe peer2peer accesses in ROCm applications use HDP. But we're not seeing obvious problems from those. So we need to understand what's special about this test. I asked questions to that effect on our other email thread.

Regards,
  Felix
Am 2020-04-14 um 10:51 a.m. schrieb Kim, Jonathan:

[AMD Official Use Only - Internal Distribution Only]

I think it's premature to push this revert.

With more testing, I'm getting failures from different tests or sometimes none at all on my machine.

Kent, let's continue the discussion on the original thread.

Thanks,

Jon

From: Koenig, Christian <Christian.Koenig@amd.com><mailto:Christian.Koenig@amd.com>
Sent: Tuesday, April 14, 2020 10:47 AM
To: Deucher, Alexander <Alexander.Deucher@amd.com><mailto:Alexander.Deucher@amd.com>
Cc: Russell, Kent <Kent.Russell@amd.com><mailto:Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Kuehling, Felix <Felix.Kuehling@amd.com><mailto:Felix.Kuehling@amd.com>; Kim, Jonathan <Jonathan.Kim@amd.com><mailto:Jonathan.Kim@amd.com>
Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"

That's exactly my concern as well.

This looks a bit like the test creates erroneous data somehow, but there doesn't seems to be a RAS check in the MM data path.

And now that we use the BAR path it goes up in flames.

I just don't see how we can create erroneous data in a test case?

Christian.

Am 14.04.2020 16:35 schrieb "Deucher, Alexander" <Alexander.Deucher@amd.com<mailto:Alexander.Deucher@amd.com>>:

[AMD Public Use]

If this causes an issue, any access to vram via the BAR could cause an issue.

Alex
________________________________
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org<mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf of Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>
Sent: Tuesday, April 14, 2020 10:19 AM
To: Koenig, Christian <Christian.Koenig@amd.com<mailto:Christian.Koenig@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Cc: Kuehling, Felix <Felix.Kuehling@amd.com<mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan <Jonathan.Kim@amd.com<mailto:Jonathan.Kim@amd.com>>
Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"

[AMD Official Use Only - Internal Distribution Only]

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output below, and then the kernel ends up hanging. I don't know enough about the test itself to know why this is occurring, but Jon Kim and Felix were discussing it on a separate thread when the issue was first reported, so they can hopefully provide some additional information.

 Kent

> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com<mailto:ckoenig.leichtzumerken@gmail.com>>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
> amdgpu_device_vram_access v2"
>
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful
> > queues preemption Failed to evict process queues Failed to suspend
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP
> > block <powerplay> failed -5
>
> Do you have more information on what's going wrong here since this is a really
> important patch for KFD debugging.
>
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com<mailto:kent.russell@amd.com>>
>
> Reviewed-by: Christian König <christian.koenig@amd.com<mailto:christian.koenig@amd.com>>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ----------------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> amdgpu_device *adev, loff_t pos,
> >      uint32_t hi = ~0;
> >      uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -   last = min(pos + size, adev->gmc.visible_vram_size);
> > -   if (last > pos) {
> > -           void __iomem *addr = adev->mman.aper_base_kaddr + pos;
> > -           size_t count = last - pos;
> > -
> > -           if (write) {
> > -                   memcpy_toio(addr, buf, count);
> > -                   mb();
> > -                   amdgpu_asic_flush_hdp(adev, NULL);
> > -           } else {
> > -                   amdgpu_asic_invalidate_hdp(adev, NULL);
> > -                   mb();
> > -                   memcpy_fromio(buf, addr, count);
> > -           }
> > -
> > -           if (count == size)
> > -                   return;
> > -
> > -           pos += count;
> > -           buf += count / 4;
> > -           size -= count;
> > -   }
> > -#endif
> > -
> >      spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> >      for (last = pos + size; pos < last; pos += 4) {
> >              uint32_t tmp = pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0


Am 14.04.2020 16:35 schrieb "Deucher, Alexander" <Alexander.Deucher@amd.com<mailto:Alexander.Deucher@amd.com>>:

[AMD Public Use]

If this causes an issue, any access to vram via the BAR could cause an issue.

Alex
________________________________
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org<mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf of Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>
Sent: Tuesday, April 14, 2020 10:19 AM
To: Koenig, Christian <Christian.Koenig@amd.com<mailto:Christian.Koenig@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Cc: Kuehling, Felix <Felix.Kuehling@amd.com<mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan <Jonathan.Kim@amd.com<mailto:Jonathan.Kim@amd.com>>
Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"

[AMD Official Use Only - Internal Distribution Only]

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output below, and then the kernel ends up hanging. I don't know enough about the test itself to know why this is occurring, but Jon Kim and Felix were discussing it on a separate thread when the issue was first reported, so they can hopefully provide some additional information.

 Kent

> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com<mailto:ckoenig.leichtzumerken@gmail.com>>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
> amdgpu_device_vram_access v2"
>
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful
> > queues preemption Failed to evict process queues Failed to suspend
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP
> > block <powerplay> failed -5
>
> Do you have more information on what's going wrong here since this is a really
> important patch for KFD debugging.
>
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com<mailto:kent.russell@amd.com>>
>
> Reviewed-by: Christian König <christian.koenig@amd.com<mailto:christian.koenig@amd.com>>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ----------------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> amdgpu_device *adev, loff_t pos,
> >      uint32_t hi = ~0;
> >      uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -   last = min(pos + size, adev->gmc.visible_vram_size);
> > -   if (last > pos) {
> > -           void __iomem *addr = adev->mman.aper_base_kaddr + pos;
> > -           size_t count = last - pos;
> > -
> > -           if (write) {
> > -                   memcpy_toio(addr, buf, count);
> > -                   mb();
> > -                   amdgpu_asic_flush_hdp(adev, NULL);
> > -           } else {
> > -                   amdgpu_asic_invalidate_hdp(adev, NULL);
> > -                   mb();
> > -                   memcpy_fromio(buf, addr, count);
> > -           }
> > -
> > -           if (count == size)
> > -                   return;
> > -
> > -           pos += count;
> > -           buf += count / 4;
> > -           size -= count;
> > -   }
> > -#endif
> > -
> >      spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> >      for (last = pos + size; pos < last; pos += 4) {
> >              uint32_t tmp = pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0


Am 14.04.2020 16:35 schrieb "Deucher, Alexander" <Alexander.Deucher@amd.com<mailto:Alexander.Deucher@amd.com>>:

[AMD Public Use]

If this causes an issue, any access to vram via the BAR could cause an issue.

Alex
________________________________
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org<mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf of Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>
Sent: Tuesday, April 14, 2020 10:19 AM
To: Koenig, Christian <Christian.Koenig@amd.com<mailto:Christian.Koenig@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Cc: Kuehling, Felix <Felix.Kuehling@amd.com<mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan <Jonathan.Kim@amd.com<mailto:Jonathan.Kim@amd.com>>
Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"

[AMD Official Use Only - Internal Distribution Only]

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output below, and then the kernel ends up hanging. I don't know enough about the test itself to know why this is occurring, but Jon Kim and Felix were discussing it on a separate thread when the issue was first reported, so they can hopefully provide some additional information.

 Kent

> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com<mailto:ckoenig.leichtzumerken@gmail.com>>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
> amdgpu_device_vram_access v2"
>
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful
> > queues preemption Failed to evict process queues Failed to suspend
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP
> > block <powerplay> failed -5
>
> Do you have more information on what's going wrong here since this is a really
> important patch for KFD debugging.
>
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com<mailto:kent.russell@amd.com>>
>
> Reviewed-by: Christian König <christian.koenig@amd.com<mailto:christian.koenig@amd.com>>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ----------------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> amdgpu_device *adev, loff_t pos,
> >      uint32_t hi = ~0;
> >      uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -   last = min(pos + size, adev->gmc.visible_vram_size);
> > -   if (last > pos) {
> > -           void __iomem *addr = adev->mman.aper_base_kaddr + pos;
> > -           size_t count = last - pos;
> > -
> > -           if (write) {
> > -                   memcpy_toio(addr, buf, count);
> > -                   mb();
> > -                   amdgpu_asic_flush_hdp(adev, NULL);
> > -           } else {
> > -                   amdgpu_asic_invalidate_hdp(adev, NULL);
> > -                   mb();
> > -                   memcpy_fromio(buf, addr, count);
> > -           }
> > -
> > -           if (count == size)
> > -                   return;
> > -
> > -           pos += count;
> > -           buf += count / 4;
> > -           size -= count;
> > -   }
> > -#endif
> > -
> >      spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> >      for (last = pos + size; pos < last; pos += 4) {
> >              uint32_t tmp = pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0


Am 14.04.2020 16:35 schrieb "Deucher, Alexander" <Alexander.Deucher@amd.com<mailto:Alexander.Deucher@amd.com>>:

[AMD Public Use]

If this causes an issue, any access to vram via the BAR could cause an issue.

Alex
________________________________
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org<mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf of Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>
Sent: Tuesday, April 14, 2020 10:19 AM
To: Koenig, Christian <Christian.Koenig@amd.com<mailto:Christian.Koenig@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Cc: Kuehling, Felix <Felix.Kuehling@amd.com<mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan <Jonathan.Kim@amd.com<mailto:Jonathan.Kim@amd.com>>
Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"

[AMD Official Use Only - Internal Distribution Only]

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output below, and then the kernel ends up hanging. I don't know enough about the test itself to know why this is occurring, but Jon Kim and Felix were discussing it on a separate thread when the issue was first reported, so they can hopefully provide some additional information.

 Kent

> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com<mailto:ckoenig.leichtzumerken@gmail.com>>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
> amdgpu_device_vram_access v2"
>
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful
> > queues preemption Failed to evict process queues Failed to suspend
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP
> > block <powerplay> failed -5
>
> Do you have more information on what's going wrong here since this is a really
> important patch for KFD debugging.
>
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com<mailto:kent.russell@amd.com>>
>
> Reviewed-by: Christian König <christian.koenig@amd.com<mailto:christian.koenig@amd.com>>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ----------------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> amdgpu_device *adev, loff_t pos,
> >      uint32_t hi = ~0;
> >      uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -   last = min(pos + size, adev->gmc.visible_vram_size);
> > -   if (last > pos) {
> > -           void __iomem *addr = adev->mman.aper_base_kaddr + pos;
> > -           size_t count = last - pos;
> > -
> > -           if (write) {
> > -                   memcpy_toio(addr, buf, count);
> > -                   mb();
> > -                   amdgpu_asic_flush_hdp(adev, NULL);
> > -           } else {
> > -                   amdgpu_asic_invalidate_hdp(adev, NULL);
> > -                   mb();
> > -                   memcpy_fromio(buf, addr, count);
> > -           }
> > -
> > -           if (count == size)
> > -                   return;
> > -
> > -           pos += count;
> > -           buf += count / 4;
> > -           size -= count;
> > -   }
> > -#endif
> > -
> >      spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> >      for (last = pos + size; pos < last; pos += 4) {
> >              uint32_t tmp = pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0


Am 14.04.2020 16:35 schrieb "Deucher, Alexander" <Alexander.Deucher@amd.com<mailto:Alexander.Deucher@amd.com>>:

[AMD Public Use]

If this causes an issue, any access to vram via the BAR could cause an issue.

Alex
________________________________
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org<mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf of Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>
Sent: Tuesday, April 14, 2020 10:19 AM
To: Koenig, Christian <Christian.Koenig@amd.com<mailto:Christian.Koenig@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Cc: Kuehling, Felix <Felix.Kuehling@amd.com<mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan <Jonathan.Kim@amd.com<mailto:Jonathan.Kim@amd.com>>
Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"

[AMD Official Use Only - Internal Distribution Only]

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output below, and then the kernel ends up hanging. I don't know enough about the test itself to know why this is occurring, but Jon Kim and Felix were discussing it on a separate thread when the issue was first reported, so they can hopefully provide some additional information.

 Kent

> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com<mailto:ckoenig.leichtzumerken@gmail.com>>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
> amdgpu_device_vram_access v2"
>
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful
> > queues preemption Failed to evict process queues Failed to suspend
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP
> > block <powerplay> failed -5
>
> Do you have more information on what's going wrong here since this is a really
> important patch for KFD debugging.
>
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com<mailto:kent.russell@amd.com>>
>
> Reviewed-by: Christian König <christian.koenig@amd.com<mailto:christian.koenig@amd.com>>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ----------------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> amdgpu_device *adev, loff_t pos,
> >      uint32_t hi = ~0;
> >      uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -   last = min(pos + size, adev->gmc.visible_vram_size);
> > -   if (last > pos) {
> > -           void __iomem *addr = adev->mman.aper_base_kaddr + pos;
> > -           size_t count = last - pos;
> > -
> > -           if (write) {
> > -                   memcpy_toio(addr, buf, count);
> > -                   mb();
> > -                   amdgpu_asic_flush_hdp(adev, NULL);
> > -           } else {
> > -                   amdgpu_asic_invalidate_hdp(adev, NULL);
> > -                   mb();
> > -                   memcpy_fromio(buf, addr, count);
> > -           }
> > -
> > -           if (count == size)
> > -                   return;
> > -
> > -           pos += count;
> > -           buf += count / 4;
> > -           size -= count;
> > -   }
> > -#endif
> > -
> >      spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> >      for (last = pos + size; pos < last; pos += 4) {
> >              uint32_t tmp = pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0


[-- Attachment #1.2: Type: text/html, Size: 59184 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"
  2020-04-15  9:49                 ` Kim, Jonathan
@ 2020-04-15 10:58                   ` Christian König
  2020-04-15 15:02                     ` Kuehling, Felix
  0 siblings, 1 reply; 14+ messages in thread
From: Christian König @ 2020-04-15 10:58 UTC (permalink / raw)
  To: Kim, Jonathan, Kuehling, Felix, Deucher, Alexander; +Cc: Russell, Kent, amd-gfx


[-- Attachment #1.1: Type: text/plain, Size: 43912 bytes --]

> To elaborate on the PTRACE test, we PEEK 2 DWORDs inside thunk 
> allocated mapped memory and 2 DWORDS outside that boundary (it’s only 
> about 4MB to the boundary).  Then we POKE to swap the DWORD positions 
> across the boundary.  The RAS event on the single failing machine 
> happens on the out of boundary PEEK.
>

Well when you access outside of an allocated buffer I would expect that 
we never get as far as even touching the hardware because the kernel 
should block the access with an -EPERM or -EFAULT. So sounds like I'm 
not understanding something correctly here.

Apart from that I completely agree that we need to sort out any other 
RAS event first to make sure that the system is simply not failing randomly.

Regards,
Christian.

Am 15.04.20 um 11:49 schrieb Kim, Jonathan:
>
> [AMD Public Use]
>
> Hi Christian,
>
> That could potentially be it.  With additional testing, 2 of 3 Vega20 
> machines never hit error over BAR access with the PTRACE test.  3 of 3 
> machines (from the same pool) always hit error with CWSR.
>
> To elaborate on the PTRACE test, we PEEK 2 DWORDs inside thunk 
> allocated mapped memory and 2 DWORDS outside that boundary (it’s only 
> about 4MB to the boundary).  Then we POKE to swap the DWORD positions 
> across the boundary.  The RAS event on the single failing machine 
> happens on the out of boundary PEEK.
>
> Felix mentioned we don’t hit errors over general HDP access but that 
> may not true.  An Arcturus failure sys logs posted (which wasn’t 
> tested by me) shows someone launched rocm bandwidth test, hit a VM 
> fault and a RAS event ensued during evictions (I can point the 
> internal ticket or log snippet offline if interested).  Whether the 
> RAS event is BAR access triggered or the result of HW instability is 
> beyond me since I don’t have access to the machine.
>
> Thanks,
>
> Jon
>
> *From:*Koenig, Christian <Christian.Koenig@amd.com>
> *Sent:* Wednesday, April 15, 2020 4:11 AM
> *To:* Kim, Jonathan <Jonathan.Kim@amd.com>; Kuehling, Felix 
> <Felix.Kuehling@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>
> *Cc:* Russell, Kent <Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org
> *Subject:* Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in 
> amdgpu_device_vram_access v2"
>
> Hi Jon,
>
>     Also cwsr tests fail on Vega20 with or without the revert with the
>     same RAS error.
>
>
> That sounds like the system/setup has a more general problem.
>
> Could it be that we are seeing RAS errors because there really is some 
> hardware failure, but with the MM path we don't trigger a RAS interrupt?
>
> Thanks,
> Christian.
>
> Am 14.04.20 um 22:30 schrieb Kim, Jonathan:
>
>     [AMD Official Use Only - Internal Distribution Only]
>
>     If we’re passing the test on the revert, then the only thing
>     that’s different is we’re not invalidating HDP and doing a copy to
>     host anymore in amdgpu_device_vram_access since the function is
>     still called in ttm access_memory with BAR.
>
>     Also cwsr tests fail on Vega20 with or without the revert with the
>     same RAS error.
>
>     Thanks,
>
>     Jon
>
>     *From:* Kuehling, Felix <Felix.Kuehling@amd.com>
>     <mailto:Felix.Kuehling@amd.com>
>     *Sent:* Tuesday, April 14, 2020 2:32 PM
>     *To:* Kim, Jonathan <Jonathan.Kim@amd.com>
>     <mailto:Jonathan.Kim@amd.com>; Koenig, Christian
>     <Christian.Koenig@amd.com> <mailto:Christian.Koenig@amd.com>;
>     Deucher, Alexander <Alexander.Deucher@amd.com>
>     <mailto:Alexander.Deucher@amd.com>
>     *Cc:* Russell, Kent <Kent.Russell@amd.com>
>     <mailto:Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org
>     <mailto:amd-gfx@lists.freedesktop.org>
>     *Subject:* Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible
>     in amdgpu_device_vram_access v2"
>
>     I wouldn't call it premature. Revert is a usual practice when
>     there is a serious regression that isn't fully understood or
>     root-caused. As far as I can tell, the problem has been reproduced
>     on multiple systems, different GPUs, and clearly regressed to
>     Christian's commit. I think that justifies reverting it for now.
>
>     I agree with Christian that a general HDP memory access problem
>     causing RAS errors would potentially cause problems in other tests
>     as well. For example common operations like GART table updates,
>     and GPUVM page table updates and PCIe peer2peer accesses in ROCm
>     applications use HDP. But we're not seeing obvious problems from
>     those. So we need to understand what's special about this test. I
>     asked questions to that effect on our other email thread.
>
>     Regards,
>       Felix
>
>     Am 2020-04-14 um 10:51 a.m. schrieb Kim, Jonathan:
>
>         [AMD Official Use Only - Internal Distribution Only]
>
>         I think it’s premature to push this revert.
>
>         With more testing, I’m getting failures from different tests
>         or sometimes none at all on my machine.
>
>         Kent, let’s continue the discussion on the original thread.
>
>         Thanks,
>
>         Jon
>
>         *From:* Koenig, Christian <Christian.Koenig@amd.com>
>         <mailto:Christian.Koenig@amd.com>
>         *Sent:* Tuesday, April 14, 2020 10:47 AM
>         *To:* Deucher, Alexander <Alexander.Deucher@amd.com>
>         <mailto:Alexander.Deucher@amd.com>
>         *Cc:* Russell, Kent <Kent.Russell@amd.com>
>         <mailto:Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org
>         <mailto:amd-gfx@lists.freedesktop.org>; Kuehling, Felix
>         <Felix.Kuehling@amd.com> <mailto:Felix.Kuehling@amd.com>; Kim,
>         Jonathan <Jonathan.Kim@amd.com> <mailto:Jonathan.Kim@amd.com>
>         *Subject:* Re: [PATCH] Revert "drm/amdgpu: use the BAR if
>         possible in amdgpu_device_vram_access v2"
>
>         That's exactly my concern as well.
>
>         This looks a bit like the test creates erroneous data somehow,
>         but there doesn't seems to be a RAS check in the MM data path.
>
>         And now that we use the BAR path it goes up in flames.
>
>         I just don't see how we can create erroneous data in a test case?
>
>         Christian.
>
>         Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
>         <Alexander.Deucher@amd.com <mailto:Alexander.Deucher@amd.com>>:
>
>             [AMD Public Use]
>
>             If this causes an issue, any access to vram via the BAR
>             could cause an issue.
>
>             Alex
>
>             ------------------------------------------------------------------------
>
>             *From:* amd-gfx <amd-gfx-bounces@lists.freedesktop.org
>             <mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf
>             of Russell, Kent <Kent.Russell@amd.com
>             <mailto:Kent.Russell@amd.com>>
>             *Sent:* Tuesday, April 14, 2020 10:19 AM
>             *To:* Koenig, Christian <Christian.Koenig@amd.com
>             <mailto:Christian.Koenig@amd.com>>;
>             amd-gfx@lists.freedesktop.org
>             <mailto:amd-gfx@lists.freedesktop.org>
>             <amd-gfx@lists.freedesktop.org
>             <mailto:amd-gfx@lists.freedesktop.org>>
>             *Cc:* Kuehling, Felix <Felix.Kuehling@amd.com
>             <mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan
>             <Jonathan.Kim@amd.com <mailto:Jonathan.Kim@amd.com>>
>             *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if
>             possible in amdgpu_device_vram_access v2"
>
>             [AMD Official Use Only - Internal Distribution Only]
>
>             On VG20 or MI100, as soon as we run the subtest, we get
>             the dmesg output below, and then the kernel ends up
>             hanging. I don't know enough about the test itself to know
>             why this is occurring, but Jon Kim and Felix were
>             discussing it on a separate thread when the issue was
>             first reported, so they can hopefully provide some
>             additional information.
>
>              Kent
>
>             > -----Original Message-----
>             > From: Christian König <ckoenig.leichtzumerken@gmail.com
>             <mailto:ckoenig.leichtzumerken@gmail.com>>
>             > Sent: Tuesday, April 14, 2020 9:52 AM
>             > To: Russell, Kent <Kent.Russell@amd.com
>             <mailto:Kent.Russell@amd.com>>;
>             amd-gfx@lists.freedesktop.org
>             <mailto:amd-gfx@lists.freedesktop.org>
>             > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if
>             possible in
>             > amdgpu_device_vram_access v2"
>             >
>             > Am 13.04.20 um 20:20 schrieb Kent Russell:
>             > > This reverts commit
>             c12b84d6e0d70f1185e6daddfd12afb671791b6e.
>             > > The original patch causes a RAS event and subsequent
>             kernel hard-hang
>             > > when running the
>             KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
>             > > Arcturus
>             > >
>             > > dmesg output at hang time:
>             > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
>             > > amdgpu 0000:67:00.0: GPU reset begin!
>             > > Evicting PASID 0x8000 queues
>             > > Started evicting pasid 0x8000
>             > > qcm fence wait loop timeout expired
>             > > The cp might be in an unrecoverable state due to an
>             unsuccessful
>             > > queues preemption Failed to evict process queues
>             Failed to suspend
>             > > process 0x8000 Finished evicting pasid 0x8000 Started
>             restoring pasid
>             > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU
>             state may lost
>             > > due to RAS ERREVENT_ATHUB_INTERRUPT
>             > > amdgpu: [powerplay] Failed to send message 0x26,
>             response 0x0
>             > > amdgpu: [powerplay] Failed to set soft min gfxclk !
>             > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
>             > > amdgpu: [powerplay] Failed to send message 0x7,
>             response 0x0
>             > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to
>             disable all smu
>             > features!
>             > > amdgpu: [powerplay] [DisableDpmTasks] Failed to
>             disable all smu features!
>             > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
>             > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR*
>             suspend of IP
>             > > block <powerplay> failed -5
>             >
>             > Do you have more information on what's going wrong here
>             since this is a really
>             > important patch for KFD debugging.
>             >
>             > >
>             > > Signed-off-by: Kent Russell <kent.russell@amd.com
>             <mailto:kent.russell@amd.com>>
>             >
>             > Reviewed-by: Christian König <christian.koenig@amd.com
>             <mailto:christian.koenig@amd.com>>
>             >
>             > > ---
>             > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
>             ----------------------
>             > >   1 file changed, 26 deletions(-)
>             > >
>             > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>             > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>             > > index cf5d6e585634..a3f997f84020 100644
>             > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>             > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>             > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
>             > amdgpu_device *adev, loff_t pos,
>             > >      uint32_t hi = ~0;
>             > >      uint64_t last;
>             > >
>             > > -
>             > > -#ifdef CONFIG_64BIT
>             > > -   last = min(pos + size, adev->gmc.visible_vram_size);
>             > > -   if (last > pos) {
>             > > -           void __iomem *addr =
>             adev->mman.aper_base_kaddr + pos;
>             > > -           size_t count = last - pos;
>             > > -
>             > > -           if (write) {
>             > > - memcpy_toio(addr, buf, count);
>             > > -                   mb();
>             > > - amdgpu_asic_flush_hdp(adev, NULL);
>             > > -           } else {
>             > > - amdgpu_asic_invalidate_hdp(adev, NULL);
>             > > -                   mb();
>             > > - memcpy_fromio(buf, addr, count);
>             > > -           }
>             > > -
>             > > -           if (count == size)
>             > > - return;
>             > > -
>             > > -           pos += count;
>             > > -           buf += count / 4;
>             > > -           size -= count;
>             > > -   }
>             > > -#endif
>             > > -
>             > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);
>             > >      for (last = pos + size; pos < last; pos += 4) {
>             > >              uint32_t tmp = pos >> 31;
>             _______________________________________________
>             amd-gfx mailing list
>             amd-gfx@lists.freedesktop.org
>             <mailto:amd-gfx@lists.freedesktop.org>
>             https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0
>
>         Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
>         <Alexander.Deucher@amd.com <mailto:Alexander.Deucher@amd.com>>:
>
>             [AMD Public Use]
>
>             If this causes an issue, any access to vram via the BAR
>             could cause an issue.
>
>             Alex
>
>             ------------------------------------------------------------------------
>
>             *From:* amd-gfx <amd-gfx-bounces@lists.freedesktop.org
>             <mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf
>             of Russell, Kent <Kent.Russell@amd.com
>             <mailto:Kent.Russell@amd.com>>
>             *Sent:* Tuesday, April 14, 2020 10:19 AM
>             *To:* Koenig, Christian <Christian.Koenig@amd.com
>             <mailto:Christian.Koenig@amd.com>>;
>             amd-gfx@lists.freedesktop.org
>             <mailto:amd-gfx@lists.freedesktop.org>
>             <amd-gfx@lists.freedesktop.org
>             <mailto:amd-gfx@lists.freedesktop.org>>
>             *Cc:* Kuehling, Felix <Felix.Kuehling@amd.com
>             <mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan
>             <Jonathan.Kim@amd.com <mailto:Jonathan.Kim@amd.com>>
>             *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if
>             possible in amdgpu_device_vram_access v2"
>
>             [AMD Official Use Only - Internal Distribution Only]
>
>             On VG20 or MI100, as soon as we run the subtest, we get
>             the dmesg output below, and then the kernel ends up
>             hanging. I don't know enough about the test itself to know
>             why this is occurring, but Jon Kim and Felix were
>             discussing it on a separate thread when the issue was
>             first reported, so they can hopefully provide some
>             additional information.
>
>              Kent
>
>             > -----Original Message-----
>             > From: Christian König <ckoenig.leichtzumerken@gmail.com
>             <mailto:ckoenig.leichtzumerken@gmail.com>>
>             > Sent: Tuesday, April 14, 2020 9:52 AM
>             > To: Russell, Kent <Kent.Russell@amd.com
>             <mailto:Kent.Russell@amd.com>>;
>             amd-gfx@lists.freedesktop.org
>             <mailto:amd-gfx@lists.freedesktop.org>
>             > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if
>             possible in
>             > amdgpu_device_vram_access v2"
>             >
>             > Am 13.04.20 um 20:20 schrieb Kent Russell:
>             > > This reverts commit
>             c12b84d6e0d70f1185e6daddfd12afb671791b6e.
>             > > The original patch causes a RAS event and subsequent
>             kernel hard-hang
>             > > when running the
>             KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
>             > > Arcturus
>             > >
>             > > dmesg output at hang time:
>             > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
>             > > amdgpu 0000:67:00.0: GPU reset begin!
>             > > Evicting PASID 0x8000 queues
>             > > Started evicting pasid 0x8000
>             > > qcm fence wait loop timeout expired
>             > > The cp might be in an unrecoverable state due to an
>             unsuccessful
>             > > queues preemption Failed to evict process queues
>             Failed to suspend
>             > > process 0x8000 Finished evicting pasid 0x8000 Started
>             restoring pasid
>             > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU
>             state may lost
>             > > due to RAS ERREVENT_ATHUB_INTERRUPT
>             > > amdgpu: [powerplay] Failed to send message 0x26,
>             response 0x0
>             > > amdgpu: [powerplay] Failed to set soft min gfxclk !
>             > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
>             > > amdgpu: [powerplay] Failed to send message 0x7,
>             response 0x0
>             > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to
>             disable all smu
>             > features!
>             > > amdgpu: [powerplay] [DisableDpmTasks] Failed to
>             disable all smu features!
>             > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
>             > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR*
>             suspend of IP
>             > > block <powerplay> failed -5
>             >
>             > Do you have more information on what's going wrong here
>             since this is a really
>             > important patch for KFD debugging.
>             >
>             > >
>             > > Signed-off-by: Kent Russell <kent.russell@amd.com
>             <mailto:kent.russell@amd.com>>
>             >
>             > Reviewed-by: Christian König <christian.koenig@amd.com
>             <mailto:christian.koenig@amd.com>>
>             >
>             > > ---
>             > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
>             ----------------------
>             > >   1 file changed, 26 deletions(-)
>             > >
>             > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>             > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>             > > index cf5d6e585634..a3f997f84020 100644
>             > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>             > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>             > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
>             > amdgpu_device *adev, loff_t pos,
>             > >      uint32_t hi = ~0;
>             > >      uint64_t last;
>             > >
>             > > -
>             > > -#ifdef CONFIG_64BIT
>             > > -   last = min(pos + size, adev->gmc.visible_vram_size);
>             > > -   if (last > pos) {
>             > > -           void __iomem *addr =
>             adev->mman.aper_base_kaddr + pos;
>             > > -           size_t count = last - pos;
>             > > -
>             > > -           if (write) {
>             > > - memcpy_toio(addr, buf, count);
>             > > -                   mb();
>             > > - amdgpu_asic_flush_hdp(adev, NULL);
>             > > -           } else {
>             > > - amdgpu_asic_invalidate_hdp(adev, NULL);
>             > > -                   mb();
>             > > - memcpy_fromio(buf, addr, count);
>             > > -           }
>             > > -
>             > > -           if (count == size)
>             > > -                   return;
>             > > -
>             > > -           pos += count;
>             > > -           buf += count / 4;
>             > > -           size -= count;
>             > > -   }
>             > > -#endif
>             > > -
>             > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);
>             > >      for (last = pos + size; pos < last; pos += 4) {
>             > >              uint32_t tmp = pos >> 31;
>             _______________________________________________
>             amd-gfx mailing list
>             amd-gfx@lists.freedesktop.org
>             <mailto:amd-gfx@lists.freedesktop.org>
>             https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0
>
>         Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
>         <Alexander.Deucher@amd.com <mailto:Alexander.Deucher@amd.com>>:
>
>             [AMD Public Use]
>
>             If this causes an issue, any access to vram via the BAR
>             could cause an issue.
>
>             Alex
>
>             ------------------------------------------------------------------------
>
>             *From:* amd-gfx <amd-gfx-bounces@lists.freedesktop.org
>             <mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf
>             of Russell, Kent <Kent.Russell@amd.com
>             <mailto:Kent.Russell@amd.com>>
>             *Sent:* Tuesday, April 14, 2020 10:19 AM
>             *To:* Koenig, Christian <Christian.Koenig@amd.com
>             <mailto:Christian.Koenig@amd.com>>;
>             amd-gfx@lists.freedesktop.org
>             <mailto:amd-gfx@lists.freedesktop.org>
>             <amd-gfx@lists.freedesktop.org
>             <mailto:amd-gfx@lists.freedesktop.org>>
>             *Cc:* Kuehling, Felix <Felix.Kuehling@amd.com
>             <mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan
>             <Jonathan.Kim@amd.com <mailto:Jonathan.Kim@amd.com>>
>             *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if
>             possible in amdgpu_device_vram_access v2"
>
>             [AMD Official Use Only - Internal Distribution Only]
>
>             On VG20 or MI100, as soon as we run the subtest, we get
>             the dmesg output below, and then the kernel ends up
>             hanging. I don't know enough about the test itself to know
>             why this is occurring, but Jon Kim and Felix were
>             discussing it on a separate thread when the issue was
>             first reported, so they can hopefully provide some
>             additional information.
>
>              Kent
>
>             > -----Original Message-----
>             > From: Christian König <ckoenig.leichtzumerken@gmail.com
>             <mailto:ckoenig.leichtzumerken@gmail.com>>
>             > Sent: Tuesday, April 14, 2020 9:52 AM
>             > To: Russell, Kent <Kent.Russell@amd.com
>             <mailto:Kent.Russell@amd.com>>;
>             amd-gfx@lists.freedesktop.org
>             <mailto:amd-gfx@lists.freedesktop.org>
>             > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if
>             possible in
>             > amdgpu_device_vram_access v2"
>             >
>             > Am 13.04.20 um 20:20 schrieb Kent Russell:
>             > > This reverts commit
>             c12b84d6e0d70f1185e6daddfd12afb671791b6e.
>             > > The original patch causes a RAS event and subsequent
>             kernel hard-hang
>             > > when running the
>             KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
>             > > Arcturus
>             > >
>             > > dmesg output at hang time:
>             > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
>             > > amdgpu 0000:67:00.0: GPU reset begin!
>             > > Evicting PASID 0x8000 queues
>             > > Started evicting pasid 0x8000
>             > > qcm fence wait loop timeout expired
>             > > The cp might be in an unrecoverable state due to an
>             unsuccessful
>             > > queues preemption Failed to evict process queues
>             Failed to suspend
>             > > process 0x8000 Finished evicting pasid 0x8000 Started
>             restoring pasid
>             > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU
>             state may lost
>             > > due to RAS ERREVENT_ATHUB_INTERRUPT
>             > > amdgpu: [powerplay] Failed to send message 0x26,
>             response 0x0
>             > > amdgpu: [powerplay] Failed to set soft min gfxclk !
>             > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
>             > > amdgpu: [powerplay] Failed to send message 0x7,
>             response 0x0
>             > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to
>             disable all smu
>             > features!
>             > > amdgpu: [powerplay] [DisableDpmTasks] Failed to
>             disable all smu features!
>             > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
>             > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR*
>             suspend of IP
>             > > block <powerplay> failed -5
>             >
>             > Do you have more information on what's going wrong here
>             since this is a really
>             > important patch for KFD debugging.
>             >
>             > >
>             > > Signed-off-by: Kent Russell <kent.russell@amd.com
>             <mailto:kent.russell@amd.com>>
>             >
>             > Reviewed-by: Christian König <christian.koenig@amd.com
>             <mailto:christian.koenig@amd.com>>
>             >
>             > > ---
>             > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
>             ----------------------
>             > >   1 file changed, 26 deletions(-)
>             > >
>             > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>             > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>             > > index cf5d6e585634..a3f997f84020 100644
>             > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>             > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>             > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
>             > amdgpu_device *adev, loff_t pos,
>             > >      uint32_t hi = ~0;
>             > >      uint64_t last;
>             > >
>             > > -
>             > > -#ifdef CONFIG_64BIT
>             > > -   last = min(pos + size, adev->gmc.visible_vram_size);
>             > > -   if (last > pos) {
>             > > -           void __iomem *addr =
>             adev->mman.aper_base_kaddr + pos;
>             > > -           size_t count = last - pos;
>             > > -
>             > > -           if (write) {
>             > > - memcpy_toio(addr, buf, count);
>             > > -                   mb();
>             > > - amdgpu_asic_flush_hdp(adev, NULL);
>             > > -           } else {
>             > > - amdgpu_asic_invalidate_hdp(adev, NULL);
>             > > -                   mb();
>             > > - memcpy_fromio(buf, addr, count);
>             > > -           }
>             > > -
>             > > -           if (count == size)
>             > > -                   return;
>             > > -
>             > > -           pos += count;
>             > > -           buf += count / 4;
>             > > -           size -= count;
>             > > -   }
>             > > -#endif
>             > > -
>             > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);
>             > >      for (last = pos + size; pos < last; pos += 4) {
>             > >              uint32_t tmp = pos >> 31;
>             _______________________________________________
>             amd-gfx mailing list
>             amd-gfx@lists.freedesktop.org
>             <mailto:amd-gfx@lists.freedesktop.org>
>             https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0
>
>         Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
>         <Alexander.Deucher@amd.com <mailto:Alexander.Deucher@amd.com>>:
>
>             [AMD Public Use]
>
>             If this causes an issue, any access to vram via the BAR
>             could cause an issue.
>
>             Alex
>
>             ------------------------------------------------------------------------
>
>             *From:* amd-gfx <amd-gfx-bounces@lists.freedesktop.org
>             <mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf
>             of Russell, Kent <Kent.Russell@amd.com
>             <mailto:Kent.Russell@amd.com>>
>             *Sent:* Tuesday, April 14, 2020 10:19 AM
>             *To:* Koenig, Christian <Christian.Koenig@amd.com
>             <mailto:Christian.Koenig@amd.com>>;
>             amd-gfx@lists.freedesktop.org
>             <mailto:amd-gfx@lists.freedesktop.org>
>             <amd-gfx@lists.freedesktop.org
>             <mailto:amd-gfx@lists.freedesktop.org>>
>             *Cc:* Kuehling, Felix <Felix.Kuehling@amd.com
>             <mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan
>             <Jonathan.Kim@amd.com <mailto:Jonathan.Kim@amd.com>>
>             *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if
>             possible in amdgpu_device_vram_access v2"
>
>             [AMD Official Use Only - Internal Distribution Only]
>
>             On VG20 or MI100, as soon as we run the subtest, we get
>             the dmesg output below, and then the kernel ends up
>             hanging. I don't know enough about the test itself to know
>             why this is occurring, but Jon Kim and Felix were
>             discussing it on a separate thread when the issue was
>             first reported, so they can hopefully provide some
>             additional information.
>
>              Kent
>
>             > -----Original Message-----
>             > From: Christian König <ckoenig.leichtzumerken@gmail.com
>             <mailto:ckoenig.leichtzumerken@gmail.com>>
>             > Sent: Tuesday, April 14, 2020 9:52 AM
>             > To: Russell, Kent <Kent.Russell@amd.com
>             <mailto:Kent.Russell@amd.com>>;
>             amd-gfx@lists.freedesktop.org
>             <mailto:amd-gfx@lists.freedesktop.org>
>             > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if
>             possible in
>             > amdgpu_device_vram_access v2"
>             >
>             > Am 13.04.20 um 20:20 schrieb Kent Russell:
>             > > This reverts commit
>             c12b84d6e0d70f1185e6daddfd12afb671791b6e.
>             > > The original patch causes a RAS event and subsequent
>             kernel hard-hang
>             > > when running the
>             KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
>             > > Arcturus
>             > >
>             > > dmesg output at hang time:
>             > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
>             > > amdgpu 0000:67:00.0: GPU reset begin!
>             > > Evicting PASID 0x8000 queues
>             > > Started evicting pasid 0x8000
>             > > qcm fence wait loop timeout expired
>             > > The cp might be in an unrecoverable state due to an
>             unsuccessful
>             > > queues preemption Failed to evict process queues
>             Failed to suspend
>             > > process 0x8000 Finished evicting pasid 0x8000 Started
>             restoring pasid
>             > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU
>             state may lost
>             > > due to RAS ERREVENT_ATHUB_INTERRUPT
>             > > amdgpu: [powerplay] Failed to send message 0x26,
>             response 0x0
>             > > amdgpu: [powerplay] Failed to set soft min gfxclk !
>             > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
>             > > amdgpu: [powerplay] Failed to send message 0x7,
>             response 0x0
>             > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to
>             disable all smu
>             > features!
>             > > amdgpu: [powerplay] [DisableDpmTasks] Failed to
>             disable all smu features!
>             > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
>             > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR*
>             suspend of IP
>             > > block <powerplay> failed -5
>             >
>             > Do you have more information on what's going wrong here
>             since this is a really
>             > important patch for KFD debugging.
>             >
>             > >
>             > > Signed-off-by: Kent Russell <kent.russell@amd.com
>             <mailto:kent.russell@amd.com>>
>             >
>             > Reviewed-by: Christian König <christian.koenig@amd.com
>             <mailto:christian.koenig@amd.com>>
>             >
>             > > ---
>             > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
>             ----------------------
>             > >   1 file changed, 26 deletions(-)
>             > >
>             > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>             > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>             > > index cf5d6e585634..a3f997f84020 100644
>             > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>             > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>             > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
>             > amdgpu_device *adev, loff_t pos,
>             > >      uint32_t hi = ~0;
>             > >      uint64_t last;
>             > >
>             > > -
>             > > -#ifdef CONFIG_64BIT
>             > > -   last = min(pos + size, adev->gmc.visible_vram_size);
>             > > -   if (last > pos) {
>             > > -           void __iomem *addr =
>             adev->mman.aper_base_kaddr + pos;
>             > > -           size_t count = last - pos;
>             > > -
>             > > -           if (write) {
>             > > - memcpy_toio(addr, buf, count);
>             > > -                   mb();
>             > > - amdgpu_asic_flush_hdp(adev, NULL);
>             > > -           } else {
>             > > - amdgpu_asic_invalidate_hdp(adev, NULL);
>             > > -                   mb();
>             > > - memcpy_fromio(buf, addr, count);
>             > > -           }
>             > > -
>             > > -           if (count == size)
>             > > -                   return;
>             > > -
>             > > -           pos += count;
>             > > -           buf += count / 4;
>             > > -           size -= count;
>             > > -   }
>             > > -#endif
>             > > -
>             > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);
>             > >      for (last = pos + size; pos < last; pos += 4) {
>             > >              uint32_t tmp = pos >> 31;
>             _______________________________________________
>             amd-gfx mailing list
>             amd-gfx@lists.freedesktop.org
>             <mailto:amd-gfx@lists.freedesktop.org>
>             https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0
>
>         Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
>         <Alexander.Deucher@amd.com <mailto:Alexander.Deucher@amd.com>>:
>
>         [AMD Public Use]
>
>         If this causes an issue, any access to vram via the BAR could
>         cause an issue.
>
>         Alex
>
>         ------------------------------------------------------------------------
>
>         *From:* amd-gfx <amd-gfx-bounces@lists.freedesktop.org
>         <mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf of
>         Russell, Kent <Kent.Russell@amd.com <mailto:Kent.Russell@amd.com>>
>         *Sent:* Tuesday, April 14, 2020 10:19 AM
>         *To:* Koenig, Christian <Christian.Koenig@amd.com
>         <mailto:Christian.Koenig@amd.com>>;
>         amd-gfx@lists.freedesktop.org
>         <mailto:amd-gfx@lists.freedesktop.org>
>         <amd-gfx@lists.freedesktop.org
>         <mailto:amd-gfx@lists.freedesktop.org>>
>         *Cc:* Kuehling, Felix <Felix.Kuehling@amd.com
>         <mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan
>         <Jonathan.Kim@amd.com <mailto:Jonathan.Kim@amd.com>>
>         *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if
>         possible in amdgpu_device_vram_access v2"
>
>         [AMD Official Use Only - Internal Distribution Only]
>
>         On VG20 or MI100, as soon as we run the subtest, we get the
>         dmesg output below, and then the kernel ends up hanging. I
>         don't know enough about the test itself to know why this is
>         occurring, but Jon Kim and Felix were discussing it on a
>         separate thread when the issue was first reported, so they can
>         hopefully provide some additional information.
>
>          Kent
>
>         > -----Original Message-----
>         > From: Christian König <ckoenig.leichtzumerken@gmail.com
>         <mailto:ckoenig.leichtzumerken@gmail.com>>
>         > Sent: Tuesday, April 14, 2020 9:52 AM
>         > To: Russell, Kent <Kent.Russell@amd.com
>         <mailto:Kent.Russell@amd.com>>; amd-gfx@lists.freedesktop.org
>         <mailto:amd-gfx@lists.freedesktop.org>
>         > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if
>         possible in
>         > amdgpu_device_vram_access v2"
>         >
>         > Am 13.04.20 um 20:20 schrieb Kent Russell:
>         > > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
>         > > The original patch causes a RAS event and subsequent
>         kernel hard-hang
>         > > when running the KFDMemoryTest.PtraceAccessInvisibleVram
>         on VG20 and
>         > > Arcturus
>         > >
>         > > dmesg output at hang time:
>         > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
>         > > amdgpu 0000:67:00.0: GPU reset begin!
>         > > Evicting PASID 0x8000 queues
>         > > Started evicting pasid 0x8000
>         > > qcm fence wait loop timeout expired
>         > > The cp might be in an unrecoverable state due to an
>         unsuccessful
>         > > queues preemption Failed to evict process queues Failed to
>         suspend
>         > > process 0x8000 Finished evicting pasid 0x8000 Started
>         restoring pasid
>         > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU
>         state may lost
>         > > due to RAS ERREVENT_ATHUB_INTERRUPT
>         > > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
>         > > amdgpu: [powerplay] Failed to set soft min gfxclk !
>         > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
>         > > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
>         > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to
>         disable all smu
>         > features!
>         > > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable
>         all smu features!
>         > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
>         > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR*
>         suspend of IP
>         > > block <powerplay> failed -5
>         >
>         > Do you have more information on what's going wrong here
>         since this is a really
>         > important patch for KFD debugging.
>         >
>         > >
>         > > Signed-off-by: Kent Russell <kent.russell@amd.com
>         <mailto:kent.russell@amd.com>>
>         >
>         > Reviewed-by: Christian König <christian.koenig@amd.com
>         <mailto:christian.koenig@amd.com>>
>         >
>         > > ---
>         > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
>         ----------------------
>         > >   1 file changed, 26 deletions(-)
>         > >
>         > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>         > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>         > > index cf5d6e585634..a3f997f84020 100644
>         > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>         > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>         > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
>         > amdgpu_device *adev, loff_t pos,
>         > >      uint32_t hi = ~0;
>         > >      uint64_t last;
>         > >
>         > > -
>         > > -#ifdef CONFIG_64BIT
>         > > -   last = min(pos + size, adev->gmc.visible_vram_size);
>         > > -   if (last > pos) {
>         > > -           void __iomem *addr =
>         adev->mman.aper_base_kaddr + pos;
>         > > -           size_t count = last - pos;
>         > > -
>         > > -           if (write) {
>         > > -                   memcpy_toio(addr, buf, count);
>         > > -                   mb();
>         > > - amdgpu_asic_flush_hdp(adev, NULL);
>         > > -           } else {
>         > > - amdgpu_asic_invalidate_hdp(adev, NULL);
>         > > -                   mb();
>         > > -                   memcpy_fromio(buf, addr, count);
>         > > -           }
>         > > -
>         > > -           if (count == size)
>         > > -                   return;
>         > > -
>         > > -           pos += count;
>         > > -           buf += count / 4;
>         > > -           size -= count;
>         > > -   }
>         > > -#endif
>         > > -
>         > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);
>         > >      for (last = pos + size; pos < last; pos += 4) {
>         > >              uint32_t tmp = pos >> 31;
>         _______________________________________________
>         amd-gfx mailing list
>         amd-gfx@lists.freedesktop.org
>         <mailto:amd-gfx@lists.freedesktop.org>
>         https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0
>


[-- Attachment #1.2: Type: text/html, Size: 96054 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"
  2020-04-15 10:58                   ` Christian König
@ 2020-04-15 15:02                     ` Kuehling, Felix
  2020-04-16 16:08                       ` Kim, Jonathan
  0 siblings, 1 reply; 14+ messages in thread
From: Kuehling, Felix @ 2020-04-15 15:02 UTC (permalink / raw)
  To: Koenig, Christian, Kim, Jonathan, Deucher, Alexander
  Cc: Russell, Kent, amd-gfx


[-- Attachment #1.1: Type: text/plain, Size: 35246 bytes --]

[AMD Official Use Only - Internal Distribution Only]

The test does not access outside of the allocated memory. But it deliberately crosses a boundary where memory can be allocated non-contiguously. This is meant to catch problems where the access function doesn't handle non-contiguous VRAM allocations correctly. However, the way that VRAM allocation has been optimized, I expect that most allocations are contiguous nowadays. However, the more interesting aspect of the test is, that it performs misaligned memory accesses. The MMIO method of accessing VRAM explicitly handles misaligned accesses and breaks them down into dword aligned accesses with proper masking and shifting.

Could the unaligned nature of the memory access have something to do with hitting RAS errors? That's something unique to this test that we wouldn't see on a normal page table update or memory eviction.

Regards,
  Felix

________________________________
From: Koenig, Christian <Christian.Koenig@amd.com>
Sent: Wednesday, April 15, 2020 6:58 AM
To: Kim, Jonathan <Jonathan.Kim@amd.com>; Kuehling, Felix <Felix.Kuehling@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>
Cc: Russell, Kent <Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"


To elaborate on the PTRACE test, we PEEK 2 DWORDs inside thunk allocated mapped memory and 2 DWORDS outside that boundary (it’s only about 4MB to the boundary).  Then we POKE to swap the DWORD positions across the boundary.  The RAS event on the single failing machine happens on the out of boundary PEEK.

Well when you access outside of an allocated buffer I would expect that we never get as far as even touching the hardware because the kernel should block the access with an -EPERM or -EFAULT. So sounds like I'm not understanding something correctly here.

Apart from that I completely agree that we need to sort out any other RAS event first to make sure that the system is simply not failing randomly.

Regards,
Christian.

Am 15.04.20 um 11:49 schrieb Kim, Jonathan:

[AMD Public Use]



Hi Christian,



That could potentially be it.  With additional testing, 2 of 3 Vega20 machines never hit error over BAR access with the PTRACE test.  3 of 3 machines (from the same pool) always hit error with CWSR.

To elaborate on the PTRACE test, we PEEK 2 DWORDs inside thunk allocated mapped memory and 2 DWORDS outside that boundary (it’s only about 4MB to the boundary).  Then we POKE to swap the DWORD positions across the boundary.  The RAS event on the single failing machine happens on the out of boundary PEEK.



Felix mentioned we don’t hit errors over general HDP access but that may not true.  An Arcturus failure sys logs posted (which wasn’t tested by me) shows someone launched rocm bandwidth test, hit a VM fault and a RAS event ensued during evictions (I can point the internal ticket or log snippet offline if interested).  Whether the RAS event is BAR access triggered or the result of HW instability is beyond me since I don’t have access to the machine.



Thanks,



Jon



From: Koenig, Christian <Christian.Koenig@amd.com><mailto:Christian.Koenig@amd.com>
Sent: Wednesday, April 15, 2020 4:11 AM
To: Kim, Jonathan <Jonathan.Kim@amd.com><mailto:Jonathan.Kim@amd.com>; Kuehling, Felix <Felix.Kuehling@amd.com><mailto:Felix.Kuehling@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com><mailto:Alexander.Deucher@amd.com>
Cc: Russell, Kent <Kent.Russell@amd.com><mailto:Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"



Hi Jon,


Also cwsr tests fail on Vega20 with or without the revert with the same RAS error.

That sounds like the system/setup has a more general problem.

Could it be that we are seeing RAS errors because there really is some hardware failure, but with the MM path we don't trigger a RAS interrupt?

Thanks,
Christian.

Am 14.04.20 um 22:30 schrieb Kim, Jonathan:

[AMD Official Use Only - Internal Distribution Only]



If we’re passing the test on the revert, then the only thing that’s different is we’re not invalidating HDP and doing a copy to host anymore in amdgpu_device_vram_access since the function is still called in ttm access_memory with BAR.



Also cwsr tests fail on Vega20 with or without the revert with the same RAS error.



Thanks,



Jon



From: Kuehling, Felix <Felix.Kuehling@amd.com><mailto:Felix.Kuehling@amd.com>
Sent: Tuesday, April 14, 2020 2:32 PM
To: Kim, Jonathan <Jonathan.Kim@amd.com><mailto:Jonathan.Kim@amd.com>; Koenig, Christian <Christian.Koenig@amd.com><mailto:Christian.Koenig@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com><mailto:Alexander.Deucher@amd.com>
Cc: Russell, Kent <Kent.Russell@amd.com><mailto:Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"



I wouldn't call it premature. Revert is a usual practice when there is a serious regression that isn't fully understood or root-caused. As far as I can tell, the problem has been reproduced on multiple systems, different GPUs, and clearly regressed to Christian's commit. I think that justifies reverting it for now.

I agree with Christian that a general HDP memory access problem causing RAS errors would potentially cause problems in other tests as well. For example common operations like GART table updates, and GPUVM page table updates and PCIe peer2peer accesses in ROCm applications use HDP. But we're not seeing obvious problems from those. So we need to understand what's special about this test. I asked questions to that effect on our other email thread.

Regards,
  Felix

Am 2020-04-14 um 10:51 a.m. schrieb Kim, Jonathan:

[AMD Official Use Only - Internal Distribution Only]



I think it’s premature to push this revert.



With more testing, I’m getting failures from different tests or sometimes none at all on my machine.



Kent, let’s continue the discussion on the original thread.



Thanks,



Jon



From: Koenig, Christian <Christian.Koenig@amd.com><mailto:Christian.Koenig@amd.com>
Sent: Tuesday, April 14, 2020 10:47 AM
To: Deucher, Alexander <Alexander.Deucher@amd.com><mailto:Alexander.Deucher@amd.com>
Cc: Russell, Kent <Kent.Russell@amd.com><mailto:Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Kuehling, Felix <Felix.Kuehling@amd.com><mailto:Felix.Kuehling@amd.com>; Kim, Jonathan <Jonathan.Kim@amd.com><mailto:Jonathan.Kim@amd.com>
Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"



That's exactly my concern as well.



This looks a bit like the test creates erroneous data somehow, but there doesn't seems to be a RAS check in the MM data path.



And now that we use the BAR path it goes up in flames.



I just don't see how we can create erroneous data in a test case?



Christian.



Am 14.04.2020 16:35 schrieb "Deucher, Alexander" <Alexander.Deucher@amd.com<mailto:Alexander.Deucher@amd.com>>:

[AMD Public Use]



If this causes an issue, any access to vram via the BAR could cause an issue.



Alex

________________________________

From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org<mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf of Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>
Sent: Tuesday, April 14, 2020 10:19 AM
To: Koenig, Christian <Christian.Koenig@amd.com<mailto:Christian.Koenig@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Cc: Kuehling, Felix <Felix.Kuehling@amd.com<mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan <Jonathan.Kim@amd.com<mailto:Jonathan.Kim@amd.com>>
Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"



[AMD Official Use Only - Internal Distribution Only]

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output below, and then the kernel ends up hanging. I don't know enough about the test itself to know why this is occurring, but Jon Kim and Felix were discussing it on a separate thread when the issue was first reported, so they can hopefully provide some additional information.

 Kent

> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com<mailto:ckoenig.leichtzumerken@gmail.com>>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
> amdgpu_device_vram_access v2"
>
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful
> > queues preemption Failed to evict process queues Failed to suspend
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP
> > block <powerplay> failed -5
>
> Do you have more information on what's going wrong here since this is a really
> important patch for KFD debugging.
>
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com<mailto:kent.russell@amd.com>>
>
> Reviewed-by: Christian König <christian.koenig@amd.com<mailto:christian.koenig@amd.com>>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ----------------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> amdgpu_device *adev, loff_t pos,
> >      uint32_t hi = ~0;
> >      uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -   last = min(pos + size, adev->gmc.visible_vram_size);
> > -   if (last > pos) {
> > -           void __iomem *addr = adev->mman.aper_base_kaddr + pos;
> > -           size_t count = last - pos;
> > -
> > -           if (write) {
> > -                   memcpy_toio(addr, buf, count);
> > -                   mb();
> > -                   amdgpu_asic_flush_hdp(adev, NULL);
> > -           } else {
> > -                   amdgpu_asic_invalidate_hdp(adev, NULL);
> > -                   mb();
> > -                   memcpy_fromio(buf, addr, count);
> > -           }
> > -
> > -           if (count == size)
> > -                   return;
> > -
> > -           pos += count;
> > -           buf += count / 4;
> > -           size -= count;
> > -   }
> > -#endif
> > -
> >      spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> >      for (last = pos + size; pos < last; pos += 4) {
> >              uint32_t tmp = pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0





Am 14.04.2020 16:35 schrieb "Deucher, Alexander" <Alexander.Deucher@amd.com<mailto:Alexander.Deucher@amd.com>>:

[AMD Public Use]



If this causes an issue, any access to vram via the BAR could cause an issue.



Alex

________________________________

From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org<mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf of Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>
Sent: Tuesday, April 14, 2020 10:19 AM
To: Koenig, Christian <Christian.Koenig@amd.com<mailto:Christian.Koenig@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Cc: Kuehling, Felix <Felix.Kuehling@amd.com<mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan <Jonathan.Kim@amd.com<mailto:Jonathan.Kim@amd.com>>
Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"



[AMD Official Use Only - Internal Distribution Only]

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output below, and then the kernel ends up hanging. I don't know enough about the test itself to know why this is occurring, but Jon Kim and Felix were discussing it on a separate thread when the issue was first reported, so they can hopefully provide some additional information.

 Kent

> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com<mailto:ckoenig.leichtzumerken@gmail.com>>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
> amdgpu_device_vram_access v2"
>
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful
> > queues preemption Failed to evict process queues Failed to suspend
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP
> > block <powerplay> failed -5
>
> Do you have more information on what's going wrong here since this is a really
> important patch for KFD debugging.
>
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com<mailto:kent.russell@amd.com>>
>
> Reviewed-by: Christian König <christian.koenig@amd.com<mailto:christian.koenig@amd.com>>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ----------------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> amdgpu_device *adev, loff_t pos,
> >      uint32_t hi = ~0;
> >      uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -   last = min(pos + size, adev->gmc.visible_vram_size);
> > -   if (last > pos) {
> > -           void __iomem *addr = adev->mman.aper_base_kaddr + pos;
> > -           size_t count = last - pos;
> > -
> > -           if (write) {
> > -                   memcpy_toio(addr, buf, count);
> > -                   mb();
> > -                   amdgpu_asic_flush_hdp(adev, NULL);
> > -           } else {
> > -                   amdgpu_asic_invalidate_hdp(adev, NULL);
> > -                   mb();
> > -                   memcpy_fromio(buf, addr, count);
> > -           }
> > -
> > -           if (count == size)
> > -                   return;
> > -
> > -           pos += count;
> > -           buf += count / 4;
> > -           size -= count;
> > -   }
> > -#endif
> > -
> >      spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> >      for (last = pos + size; pos < last; pos += 4) {
> >              uint32_t tmp = pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0





Am 14.04.2020 16:35 schrieb "Deucher, Alexander" <Alexander.Deucher@amd.com<mailto:Alexander.Deucher@amd.com>>:

[AMD Public Use]



If this causes an issue, any access to vram via the BAR could cause an issue.



Alex

________________________________

From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org<mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf of Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>
Sent: Tuesday, April 14, 2020 10:19 AM
To: Koenig, Christian <Christian.Koenig@amd.com<mailto:Christian.Koenig@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Cc: Kuehling, Felix <Felix.Kuehling@amd.com<mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan <Jonathan.Kim@amd.com<mailto:Jonathan.Kim@amd.com>>
Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"



[AMD Official Use Only - Internal Distribution Only]

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output below, and then the kernel ends up hanging. I don't know enough about the test itself to know why this is occurring, but Jon Kim and Felix were discussing it on a separate thread when the issue was first reported, so they can hopefully provide some additional information.

 Kent

> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com<mailto:ckoenig.leichtzumerken@gmail.com>>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
> amdgpu_device_vram_access v2"
>
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful
> > queues preemption Failed to evict process queues Failed to suspend
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP
> > block <powerplay> failed -5
>
> Do you have more information on what's going wrong here since this is a really
> important patch for KFD debugging.
>
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com<mailto:kent.russell@amd.com>>
>
> Reviewed-by: Christian König <christian.koenig@amd.com<mailto:christian.koenig@amd.com>>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ----------------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> amdgpu_device *adev, loff_t pos,
> >      uint32_t hi = ~0;
> >      uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -   last = min(pos + size, adev->gmc.visible_vram_size);
> > -   if (last > pos) {
> > -           void __iomem *addr = adev->mman.aper_base_kaddr + pos;
> > -           size_t count = last - pos;
> > -
> > -           if (write) {
> > -                   memcpy_toio(addr, buf, count);
> > -                   mb();
> > -                   amdgpu_asic_flush_hdp(adev, NULL);
> > -           } else {
> > -                   amdgpu_asic_invalidate_hdp(adev, NULL);
> > -                   mb();
> > -                   memcpy_fromio(buf, addr, count);
> > -           }
> > -
> > -           if (count == size)
> > -                   return;
> > -
> > -           pos += count;
> > -           buf += count / 4;
> > -           size -= count;
> > -   }
> > -#endif
> > -
> >      spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> >      for (last = pos + size; pos < last; pos += 4) {
> >              uint32_t tmp = pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0





Am 14.04.2020 16:35 schrieb "Deucher, Alexander" <Alexander.Deucher@amd.com<mailto:Alexander.Deucher@amd.com>>:

[AMD Public Use]



If this causes an issue, any access to vram via the BAR could cause an issue.



Alex

________________________________

From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org<mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf of Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>
Sent: Tuesday, April 14, 2020 10:19 AM
To: Koenig, Christian <Christian.Koenig@amd.com<mailto:Christian.Koenig@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Cc: Kuehling, Felix <Felix.Kuehling@amd.com<mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan <Jonathan.Kim@amd.com<mailto:Jonathan.Kim@amd.com>>
Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"



[AMD Official Use Only - Internal Distribution Only]

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output below, and then the kernel ends up hanging. I don't know enough about the test itself to know why this is occurring, but Jon Kim and Felix were discussing it on a separate thread when the issue was first reported, so they can hopefully provide some additional information.

 Kent

> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com<mailto:ckoenig.leichtzumerken@gmail.com>>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
> amdgpu_device_vram_access v2"
>
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful
> > queues preemption Failed to evict process queues Failed to suspend
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP
> > block <powerplay> failed -5
>
> Do you have more information on what's going wrong here since this is a really
> important patch for KFD debugging.
>
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com<mailto:kent.russell@amd.com>>
>
> Reviewed-by: Christian König <christian.koenig@amd.com<mailto:christian.koenig@amd.com>>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ----------------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> amdgpu_device *adev, loff_t pos,
> >      uint32_t hi = ~0;
> >      uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -   last = min(pos + size, adev->gmc.visible_vram_size);
> > -   if (last > pos) {
> > -           void __iomem *addr = adev->mman.aper_base_kaddr + pos;
> > -           size_t count = last - pos;
> > -
> > -           if (write) {
> > -                   memcpy_toio(addr, buf, count);
> > -                   mb();
> > -                   amdgpu_asic_flush_hdp(adev, NULL);
> > -           } else {
> > -                   amdgpu_asic_invalidate_hdp(adev, NULL);
> > -                   mb();
> > -                   memcpy_fromio(buf, addr, count);
> > -           }
> > -
> > -           if (count == size)
> > -                   return;
> > -
> > -           pos += count;
> > -           buf += count / 4;
> > -           size -= count;
> > -   }
> > -#endif
> > -
> >      spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> >      for (last = pos + size; pos < last; pos += 4) {
> >              uint32_t tmp = pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0





Am 14.04.2020 16:35 schrieb "Deucher, Alexander" <Alexander.Deucher@amd.com<mailto:Alexander.Deucher@amd.com>>:

[AMD Public Use]



If this causes an issue, any access to vram via the BAR could cause an issue.



Alex

________________________________

From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org<mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf of Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>
Sent: Tuesday, April 14, 2020 10:19 AM
To: Koenig, Christian <Christian.Koenig@amd.com<mailto:Christian.Koenig@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Cc: Kuehling, Felix <Felix.Kuehling@amd.com<mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan <Jonathan.Kim@amd.com<mailto:Jonathan.Kim@amd.com>>
Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"



[AMD Official Use Only - Internal Distribution Only]

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output below, and then the kernel ends up hanging. I don't know enough about the test itself to know why this is occurring, but Jon Kim and Felix were discussing it on a separate thread when the issue was first reported, so they can hopefully provide some additional information.

 Kent

> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com<mailto:ckoenig.leichtzumerken@gmail.com>>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
> amdgpu_device_vram_access v2"
>
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful
> > queues preemption Failed to evict process queues Failed to suspend
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP
> > block <powerplay> failed -5
>
> Do you have more information on what's going wrong here since this is a really
> important patch for KFD debugging.
>
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com<mailto:kent.russell@amd.com>>
>
> Reviewed-by: Christian König <christian.koenig@amd.com<mailto:christian.koenig@amd.com>>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ----------------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> amdgpu_device *adev, loff_t pos,
> >      uint32_t hi = ~0;
> >      uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -   last = min(pos + size, adev->gmc.visible_vram_size);
> > -   if (last > pos) {
> > -           void __iomem *addr = adev->mman.aper_base_kaddr + pos;
> > -           size_t count = last - pos;
> > -
> > -           if (write) {
> > -                   memcpy_toio(addr, buf, count);
> > -                   mb();
> > -                   amdgpu_asic_flush_hdp(adev, NULL);
> > -           } else {
> > -                   amdgpu_asic_invalidate_hdp(adev, NULL);
> > -                   mb();
> > -                   memcpy_fromio(buf, addr, count);
> > -           }
> > -
> > -           if (count == size)
> > -                   return;
> > -
> > -           pos += count;
> > -           buf += count / 4;
> > -           size -= count;
> > -   }
> > -#endif
> > -
> >      spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> >      for (last = pos + size; pos < last; pos += 4) {
> >              uint32_t tmp = pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0




[-- Attachment #1.2: Type: text/html, Size: 61394 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"
  2020-04-15 15:02                     ` Kuehling, Felix
@ 2020-04-16 16:08                       ` Kim, Jonathan
  2020-04-17  7:46                         ` Christian König
  0 siblings, 1 reply; 14+ messages in thread
From: Kim, Jonathan @ 2020-04-16 16:08 UTC (permalink / raw)
  To: Kuehling, Felix, Koenig, Christian, Deucher, Alexander
  Cc: Russell, Kent, amd-gfx


[-- Attachment #1.1: Type: text/plain, Size: 36415 bytes --]

[AMD Official Use Only - Internal Distribution Only]

Hi Felix,

You're probably right.

Passing Vega20 system:
[   56.683273] amdgpu: [vram dbg] addr         3e7ffff8, val         deadbeef
[   56.683349] amdgpu: [vram dbg] addr         3efed000, val         cafebabe <- potential misalign access

Failing Vega20 system:
[Apr16 12:00] amdgpu: [vram dbg] addr         be7ffff8, val         deadbeef
[  +0.000082] amdgpu: [vram dbg] addr         befed000, val         ffffffff <- potential misalign access

Thanks,

Jon

From: Kuehling, Felix <Felix.Kuehling@amd.com>
Sent: Wednesday, April 15, 2020 11:02 AM
To: Koenig, Christian <Christian.Koenig@amd.com>; Kim, Jonathan <Jonathan.Kim@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>
Cc: Russell, Kent <Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"


[AMD Official Use Only - Internal Distribution Only]

The test does not access outside of the allocated memory. But it deliberately crosses a boundary where memory can be allocated non-contiguously. This is meant to catch problems where the access function doesn't handle non-contiguous VRAM allocations correctly. However, the way that VRAM allocation has been optimized, I expect that most allocations are contiguous nowadays. However, the more interesting aspect of the test is, that it performs misaligned memory accesses. The MMIO method of accessing VRAM explicitly handles misaligned accesses and breaks them down into dword aligned accesses with proper masking and shifting.

Could the unaligned nature of the memory access have something to do with hitting RAS errors? That's something unique to this test that we wouldn't see on a normal page table update or memory eviction.

Regards,
  Felix

________________________________
From: Koenig, Christian <Christian.Koenig@amd.com<mailto:Christian.Koenig@amd.com>>
Sent: Wednesday, April 15, 2020 6:58 AM
To: Kim, Jonathan <Jonathan.Kim@amd.com<mailto:Jonathan.Kim@amd.com>>; Kuehling, Felix <Felix.Kuehling@amd.com<mailto:Felix.Kuehling@amd.com>>; Deucher, Alexander <Alexander.Deucher@amd.com<mailto:Alexander.Deucher@amd.com>>
Cc: Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"


To elaborate on the PTRACE test, we PEEK 2 DWORDs inside thunk allocated mapped memory and 2 DWORDS outside that boundary (it's only about 4MB to the boundary).  Then we POKE to swap the DWORD positions across the boundary.  The RAS event on the single failing machine happens on the out of boundary PEEK.

Well when you access outside of an allocated buffer I would expect that we never get as far as even touching the hardware because the kernel should block the access with an -EPERM or -EFAULT. So sounds like I'm not understanding something correctly here.

Apart from that I completely agree that we need to sort out any other RAS event first to make sure that the system is simply not failing randomly.

Regards,
Christian.

Am 15.04.20 um 11:49 schrieb Kim, Jonathan:

[AMD Public Use]



Hi Christian,



That could potentially be it.  With additional testing, 2 of 3 Vega20 machines never hit error over BAR access with the PTRACE test.  3 of 3 machines (from the same pool) always hit error with CWSR.

To elaborate on the PTRACE test, we PEEK 2 DWORDs inside thunk allocated mapped memory and 2 DWORDS outside that boundary (it's only about 4MB to the boundary).  Then we POKE to swap the DWORD positions across the boundary.  The RAS event on the single failing machine happens on the out of boundary PEEK.



Felix mentioned we don't hit errors over general HDP access but that may not true.  An Arcturus failure sys logs posted (which wasn't tested by me) shows someone launched rocm bandwidth test, hit a VM fault and a RAS event ensued during evictions (I can point the internal ticket or log snippet offline if interested).  Whether the RAS event is BAR access triggered or the result of HW instability is beyond me since I don't have access to the machine.



Thanks,



Jon



From: Koenig, Christian <Christian.Koenig@amd.com><mailto:Christian.Koenig@amd.com>
Sent: Wednesday, April 15, 2020 4:11 AM
To: Kim, Jonathan <Jonathan.Kim@amd.com><mailto:Jonathan.Kim@amd.com>; Kuehling, Felix <Felix.Kuehling@amd.com><mailto:Felix.Kuehling@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com><mailto:Alexander.Deucher@amd.com>
Cc: Russell, Kent <Kent.Russell@amd.com><mailto:Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"



Hi Jon,

Also cwsr tests fail on Vega20 with or without the revert with the same RAS error.

That sounds like the system/setup has a more general problem.

Could it be that we are seeing RAS errors because there really is some hardware failure, but with the MM path we don't trigger a RAS interrupt?

Thanks,
Christian.

Am 14.04.20 um 22:30 schrieb Kim, Jonathan:

[AMD Official Use Only - Internal Distribution Only]



If we're passing the test on the revert, then the only thing that's different is we're not invalidating HDP and doing a copy to host anymore in amdgpu_device_vram_access since the function is still called in ttm access_memory with BAR.



Also cwsr tests fail on Vega20 with or without the revert with the same RAS error.



Thanks,



Jon



From: Kuehling, Felix <Felix.Kuehling@amd.com><mailto:Felix.Kuehling@amd.com>
Sent: Tuesday, April 14, 2020 2:32 PM
To: Kim, Jonathan <Jonathan.Kim@amd.com><mailto:Jonathan.Kim@amd.com>; Koenig, Christian <Christian.Koenig@amd.com><mailto:Christian.Koenig@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com><mailto:Alexander.Deucher@amd.com>
Cc: Russell, Kent <Kent.Russell@amd.com><mailto:Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"



I wouldn't call it premature. Revert is a usual practice when there is a serious regression that isn't fully understood or root-caused. As far as I can tell, the problem has been reproduced on multiple systems, different GPUs, and clearly regressed to Christian's commit. I think that justifies reverting it for now.

I agree with Christian that a general HDP memory access problem causing RAS errors would potentially cause problems in other tests as well. For example common operations like GART table updates, and GPUVM page table updates and PCIe peer2peer accesses in ROCm applications use HDP. But we're not seeing obvious problems from those. So we need to understand what's special about this test. I asked questions to that effect on our other email thread.

Regards,
  Felix

Am 2020-04-14 um 10:51 a.m. schrieb Kim, Jonathan:

[AMD Official Use Only - Internal Distribution Only]



I think it's premature to push this revert.



With more testing, I'm getting failures from different tests or sometimes none at all on my machine.



Kent, let's continue the discussion on the original thread.



Thanks,



Jon



From: Koenig, Christian <Christian.Koenig@amd.com><mailto:Christian.Koenig@amd.com>
Sent: Tuesday, April 14, 2020 10:47 AM
To: Deucher, Alexander <Alexander.Deucher@amd.com><mailto:Alexander.Deucher@amd.com>
Cc: Russell, Kent <Kent.Russell@amd.com><mailto:Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Kuehling, Felix <Felix.Kuehling@amd.com><mailto:Felix.Kuehling@amd.com>; Kim, Jonathan <Jonathan.Kim@amd.com><mailto:Jonathan.Kim@amd.com>
Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"



That's exactly my concern as well.



This looks a bit like the test creates erroneous data somehow, but there doesn't seems to be a RAS check in the MM data path.



And now that we use the BAR path it goes up in flames.



I just don't see how we can create erroneous data in a test case?



Christian.



Am 14.04.2020 16:35 schrieb "Deucher, Alexander" <Alexander.Deucher@amd.com<mailto:Alexander.Deucher@amd.com>>:

[AMD Public Use]



If this causes an issue, any access to vram via the BAR could cause an issue.



Alex

________________________________

From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org<mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf of Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>
Sent: Tuesday, April 14, 2020 10:19 AM
To: Koenig, Christian <Christian.Koenig@amd.com<mailto:Christian.Koenig@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Cc: Kuehling, Felix <Felix.Kuehling@amd.com<mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan <Jonathan.Kim@amd.com<mailto:Jonathan.Kim@amd.com>>
Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"



[AMD Official Use Only - Internal Distribution Only]

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output below, and then the kernel ends up hanging. I don't know enough about the test itself to know why this is occurring, but Jon Kim and Felix were discussing it on a separate thread when the issue was first reported, so they can hopefully provide some additional information.

 Kent

> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com<mailto:ckoenig.leichtzumerken@gmail.com>>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
> amdgpu_device_vram_access v2"
>
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful
> > queues preemption Failed to evict process queues Failed to suspend
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP
> > block <powerplay> failed -5
>
> Do you have more information on what's going wrong here since this is a really
> important patch for KFD debugging.
>
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com<mailto:kent.russell@amd.com>>
>
> Reviewed-by: Christian König <christian.koenig@amd.com<mailto:christian.koenig@amd.com>>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ----------------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> amdgpu_device *adev, loff_t pos,
> >      uint32_t hi = ~0;
> >      uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -   last = min(pos + size, adev->gmc.visible_vram_size);
> > -   if (last > pos) {
> > -           void __iomem *addr = adev->mman.aper_base_kaddr + pos;
> > -           size_t count = last - pos;
> > -
> > -           if (write) {
> > -                   memcpy_toio(addr, buf, count);
> > -                   mb();
> > -                   amdgpu_asic_flush_hdp(adev, NULL);
> > -           } else {
> > -                   amdgpu_asic_invalidate_hdp(adev, NULL);
> > -                   mb();
> > -                   memcpy_fromio(buf, addr, count);
> > -           }
> > -
> > -           if (count == size)
> > -                   return;
> > -
> > -           pos += count;
> > -           buf += count / 4;
> > -           size -= count;
> > -   }
> > -#endif
> > -
> >      spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> >      for (last = pos + size; pos < last; pos += 4) {
> >              uint32_t tmp = pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0





Am 14.04.2020 16:35 schrieb "Deucher, Alexander" <Alexander.Deucher@amd.com<mailto:Alexander.Deucher@amd.com>>:

[AMD Public Use]



If this causes an issue, any access to vram via the BAR could cause an issue.



Alex

________________________________

From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org<mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf of Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>
Sent: Tuesday, April 14, 2020 10:19 AM
To: Koenig, Christian <Christian.Koenig@amd.com<mailto:Christian.Koenig@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Cc: Kuehling, Felix <Felix.Kuehling@amd.com<mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan <Jonathan.Kim@amd.com<mailto:Jonathan.Kim@amd.com>>
Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"



[AMD Official Use Only - Internal Distribution Only]

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output below, and then the kernel ends up hanging. I don't know enough about the test itself to know why this is occurring, but Jon Kim and Felix were discussing it on a separate thread when the issue was first reported, so they can hopefully provide some additional information.

 Kent

> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com<mailto:ckoenig.leichtzumerken@gmail.com>>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
> amdgpu_device_vram_access v2"
>
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful
> > queues preemption Failed to evict process queues Failed to suspend
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP
> > block <powerplay> failed -5
>
> Do you have more information on what's going wrong here since this is a really
> important patch for KFD debugging.
>
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com<mailto:kent.russell@amd.com>>
>
> Reviewed-by: Christian König <christian.koenig@amd.com<mailto:christian.koenig@amd.com>>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ----------------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> amdgpu_device *adev, loff_t pos,
> >      uint32_t hi = ~0;
> >      uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -   last = min(pos + size, adev->gmc.visible_vram_size);
> > -   if (last > pos) {
> > -           void __iomem *addr = adev->mman.aper_base_kaddr + pos;
> > -           size_t count = last - pos;
> > -
> > -           if (write) {
> > -                   memcpy_toio(addr, buf, count);
> > -                   mb();
> > -                   amdgpu_asic_flush_hdp(adev, NULL);
> > -           } else {
> > -                   amdgpu_asic_invalidate_hdp(adev, NULL);
> > -                   mb();
> > -                   memcpy_fromio(buf, addr, count);
> > -           }
> > -
> > -           if (count == size)
> > -                   return;
> > -
> > -           pos += count;
> > -           buf += count / 4;
> > -           size -= count;
> > -   }
> > -#endif
> > -
> >      spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> >      for (last = pos + size; pos < last; pos += 4) {
> >              uint32_t tmp = pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0





Am 14.04.2020 16:35 schrieb "Deucher, Alexander" <Alexander.Deucher@amd.com<mailto:Alexander.Deucher@amd.com>>:

[AMD Public Use]



If this causes an issue, any access to vram via the BAR could cause an issue.



Alex

________________________________

From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org<mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf of Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>
Sent: Tuesday, April 14, 2020 10:19 AM
To: Koenig, Christian <Christian.Koenig@amd.com<mailto:Christian.Koenig@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Cc: Kuehling, Felix <Felix.Kuehling@amd.com<mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan <Jonathan.Kim@amd.com<mailto:Jonathan.Kim@amd.com>>
Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"



[AMD Official Use Only - Internal Distribution Only]

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output below, and then the kernel ends up hanging. I don't know enough about the test itself to know why this is occurring, but Jon Kim and Felix were discussing it on a separate thread when the issue was first reported, so they can hopefully provide some additional information.

 Kent

> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com<mailto:ckoenig.leichtzumerken@gmail.com>>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
> amdgpu_device_vram_access v2"
>
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful
> > queues preemption Failed to evict process queues Failed to suspend
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP
> > block <powerplay> failed -5
>
> Do you have more information on what's going wrong here since this is a really
> important patch for KFD debugging.
>
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com<mailto:kent.russell@amd.com>>
>
> Reviewed-by: Christian König <christian.koenig@amd.com<mailto:christian.koenig@amd.com>>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ----------------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> amdgpu_device *adev, loff_t pos,
> >      uint32_t hi = ~0;
> >      uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -   last = min(pos + size, adev->gmc.visible_vram_size);
> > -   if (last > pos) {
> > -           void __iomem *addr = adev->mman.aper_base_kaddr + pos;
> > -           size_t count = last - pos;
> > -
> > -           if (write) {
> > -                   memcpy_toio(addr, buf, count);
> > -                   mb();
> > -                   amdgpu_asic_flush_hdp(adev, NULL);
> > -           } else {
> > -                   amdgpu_asic_invalidate_hdp(adev, NULL);
> > -                   mb();
> > -                   memcpy_fromio(buf, addr, count);
> > -           }
> > -
> > -           if (count == size)
> > -                   return;
> > -
> > -           pos += count;
> > -           buf += count / 4;
> > -           size -= count;
> > -   }
> > -#endif
> > -
> >      spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> >      for (last = pos + size; pos < last; pos += 4) {
> >              uint32_t tmp = pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0





Am 14.04.2020 16:35 schrieb "Deucher, Alexander" <Alexander.Deucher@amd.com<mailto:Alexander.Deucher@amd.com>>:

[AMD Public Use]



If this causes an issue, any access to vram via the BAR could cause an issue.



Alex

________________________________

From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org<mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf of Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>
Sent: Tuesday, April 14, 2020 10:19 AM
To: Koenig, Christian <Christian.Koenig@amd.com<mailto:Christian.Koenig@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Cc: Kuehling, Felix <Felix.Kuehling@amd.com<mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan <Jonathan.Kim@amd.com<mailto:Jonathan.Kim@amd.com>>
Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"



[AMD Official Use Only - Internal Distribution Only]

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output below, and then the kernel ends up hanging. I don't know enough about the test itself to know why this is occurring, but Jon Kim and Felix were discussing it on a separate thread when the issue was first reported, so they can hopefully provide some additional information.

 Kent

> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com<mailto:ckoenig.leichtzumerken@gmail.com>>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
> amdgpu_device_vram_access v2"
>
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful
> > queues preemption Failed to evict process queues Failed to suspend
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP
> > block <powerplay> failed -5
>
> Do you have more information on what's going wrong here since this is a really
> important patch for KFD debugging.
>
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com<mailto:kent.russell@amd.com>>
>
> Reviewed-by: Christian König <christian.koenig@amd.com<mailto:christian.koenig@amd.com>>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ----------------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> amdgpu_device *adev, loff_t pos,
> >      uint32_t hi = ~0;
> >      uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -   last = min(pos + size, adev->gmc.visible_vram_size);
> > -   if (last > pos) {
> > -           void __iomem *addr = adev->mman.aper_base_kaddr + pos;
> > -           size_t count = last - pos;
> > -
> > -           if (write) {
> > -                   memcpy_toio(addr, buf, count);
> > -                   mb();
> > -                   amdgpu_asic_flush_hdp(adev, NULL);
> > -           } else {
> > -                   amdgpu_asic_invalidate_hdp(adev, NULL);
> > -                   mb();
> > -                   memcpy_fromio(buf, addr, count);
> > -           }
> > -
> > -           if (count == size)
> > -                   return;
> > -
> > -           pos += count;
> > -           buf += count / 4;
> > -           size -= count;
> > -   }
> > -#endif
> > -
> >      spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> >      for (last = pos + size; pos < last; pos += 4) {
> >              uint32_t tmp = pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0





Am 14.04.2020 16:35 schrieb "Deucher, Alexander" <Alexander.Deucher@amd.com<mailto:Alexander.Deucher@amd.com>>:

[AMD Public Use]



If this causes an issue, any access to vram via the BAR could cause an issue.



Alex

________________________________

From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org<mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf of Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>
Sent: Tuesday, April 14, 2020 10:19 AM
To: Koenig, Christian <Christian.Koenig@amd.com<mailto:Christian.Koenig@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Cc: Kuehling, Felix <Felix.Kuehling@amd.com<mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan <Jonathan.Kim@amd.com<mailto:Jonathan.Kim@amd.com>>
Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"



[AMD Official Use Only - Internal Distribution Only]

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output below, and then the kernel ends up hanging. I don't know enough about the test itself to know why this is occurring, but Jon Kim and Felix were discussing it on a separate thread when the issue was first reported, so they can hopefully provide some additional information.

 Kent

> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com<mailto:ckoenig.leichtzumerken@gmail.com>>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
> amdgpu_device_vram_access v2"
>
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful
> > queues preemption Failed to evict process queues Failed to suspend
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP
> > block <powerplay> failed -5
>
> Do you have more information on what's going wrong here since this is a really
> important patch for KFD debugging.
>
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com<mailto:kent.russell@amd.com>>
>
> Reviewed-by: Christian König <christian.koenig@amd.com<mailto:christian.koenig@amd.com>>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ----------------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> amdgpu_device *adev, loff_t pos,
> >      uint32_t hi = ~0;
> >      uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -   last = min(pos + size, adev->gmc.visible_vram_size);
> > -   if (last > pos) {
> > -           void __iomem *addr = adev->mman.aper_base_kaddr + pos;
> > -           size_t count = last - pos;
> > -
> > -           if (write) {
> > -                   memcpy_toio(addr, buf, count);
> > -                   mb();
> > -                   amdgpu_asic_flush_hdp(adev, NULL);
> > -           } else {
> > -                   amdgpu_asic_invalidate_hdp(adev, NULL);
> > -                   mb();
> > -                   memcpy_fromio(buf, addr, count);
> > -           }
> > -
> > -           if (count == size)
> > -                   return;
> > -
> > -           pos += count;
> > -           buf += count / 4;
> > -           size -= count;
> > -   }
> > -#endif
> > -
> >      spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> >      for (last = pos + size; pos < last; pos += 4) {
> >              uint32_t tmp = pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0




[-- Attachment #1.2: Type: text/html, Size: 69110 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"
  2020-04-16 16:08                       ` Kim, Jonathan
@ 2020-04-17  7:46                         ` Christian König
  0 siblings, 0 replies; 14+ messages in thread
From: Christian König @ 2020-04-17  7:46 UTC (permalink / raw)
  To: Kim, Jonathan, Kuehling, Felix, Deucher, Alexander; +Cc: Russell, Kent, amd-gfx


[-- Attachment #1.1: Type: text/plain, Size: 50869 bytes --]

Why do you think that 3efed000 and befed000 are misaligned addresses?

And see amdgpu_ttm_access_memory(), misaligned accesses are always 
routed to the MM path.

Regards,
Christian.

Am 16.04.20 um 18:08 schrieb Kim, Jonathan:
>
> [AMD Official Use Only - Internal Distribution Only]
>
> Hi Felix,
>
> You’re probably right.
>
> Passing Vega20 system:
>
> [   56.683273] amdgpu: [vram dbg] addr         3e7ffff8, val         
> deadbeef
>
> [   56.683349] amdgpu: [vram dbg] addr         3efed000, val         
> cafebabe <- potential misalign access
>
> Failing Vega20 system:
>
> [Apr16 12:00] amdgpu: [vram dbg] addr         be7ffff8, val         
> deadbeef
>
> [  +0.000082] amdgpu: [vram dbg] addr         befed000, val         
> ffffffff <- potential misalign access
>
> Thanks,
>
> Jon
>
> *From:* Kuehling, Felix <Felix.Kuehling@amd.com>
> *Sent:* Wednesday, April 15, 2020 11:02 AM
> *To:* Koenig, Christian <Christian.Koenig@amd.com>; Kim, Jonathan 
> <Jonathan.Kim@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>
> *Cc:* Russell, Kent <Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org
> *Subject:* Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in 
> amdgpu_device_vram_access v2"
>
> [AMD Official Use Only - Internal Distribution Only]
>
> The test does not access outside of the allocated memory. But it 
> deliberately crosses a boundary where memory can be allocated 
> non-contiguously. This is meant to catch problems where the access 
> function doesn't handle non-contiguous VRAM allocations correctly. 
> However, the way that VRAM allocation has been optimized, I expect 
> that most allocations are contiguous nowadays. However, the more 
> interesting aspect of the test is, that it performs misaligned memory 
> accesses. The MMIO method of accessing VRAM explicitly handles 
> misaligned accesses and breaks them down into dword aligned accesses 
> with proper masking and shifting.
>
> Could the unaligned nature of the memory access have something to do 
> with hitting RAS errors? That's something unique to this test that we 
> wouldn't see on a normal page table update or memory eviction.
>
> Regards,
>   Felix
>
> ------------------------------------------------------------------------
>
> *From:*Koenig, Christian <Christian.Koenig@amd.com 
> <mailto:Christian.Koenig@amd.com>>
> *Sent:* Wednesday, April 15, 2020 6:58 AM
> *To:* Kim, Jonathan <Jonathan.Kim@amd.com 
> <mailto:Jonathan.Kim@amd.com>>; Kuehling, Felix 
> <Felix.Kuehling@amd.com <mailto:Felix.Kuehling@amd.com>>; Deucher, 
> Alexander <Alexander.Deucher@amd.com <mailto:Alexander.Deucher@amd.com>>
> *Cc:* Russell, Kent <Kent.Russell@amd.com 
> <mailto:Kent.Russell@amd.com>>; amd-gfx@lists.freedesktop.org 
> <mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org 
> <mailto:amd-gfx@lists.freedesktop.org>>
> *Subject:* Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in 
> amdgpu_device_vram_access v2"
>
>     To elaborate on the PTRACE test, we PEEK 2 DWORDs inside thunk
>     allocated mapped memory and 2 DWORDS outside that boundary (it’s
>     only about 4MB to the boundary). Then we POKE to swap the DWORD
>     positions across the boundary.  The RAS event on the single
>     failing machine happens on the out of boundary PEEK.
>
>
> Well when you access outside of an allocated buffer I would expect 
> that we never get as far as even touching the hardware because the 
> kernel should block the access with an -EPERM or -EFAULT. So sounds 
> like I'm not understanding something correctly here.
>
> Apart from that I completely agree that we need to sort out any other 
> RAS event first to make sure that the system is simply not failing 
> randomly.
>
> Regards,
> Christian.
>
> Am 15.04.20 um 11:49 schrieb Kim, Jonathan:
>
>     [AMD Public Use]
>
>     Hi Christian,
>
>     That could potentially be it.  With additional testing, 2 of 3
>     Vega20 machines never hit error over BAR access with the PTRACE
>     test.  3 of 3 machines (from the same pool) always hit error with
>     CWSR.
>
>     To elaborate on the PTRACE test, we PEEK 2 DWORDs inside thunk
>     allocated mapped memory and 2 DWORDS outside that boundary (it’s
>     only about 4MB to the boundary). Then we POKE to swap the DWORD
>     positions across the boundary.  The RAS event on the single
>     failing machine happens on the out of boundary PEEK.
>
>     Felix mentioned we don’t hit errors over general HDP access but
>     that may not true.  An Arcturus failure sys logs posted (which
>     wasn’t tested by me) shows someone launched rocm bandwidth test,
>     hit a VM fault and a RAS event ensued during evictions (I can
>     point the internal ticket or log snippet offline if interested). 
>     Whether the RAS event is BAR access triggered or the result of HW
>     instability is beyond me since I don’t have access to the machine.
>
>     Thanks,
>
>     Jon
>
>     *From:* Koenig, Christian <Christian.Koenig@amd.com>
>     <mailto:Christian.Koenig@amd.com>
>     *Sent:* Wednesday, April 15, 2020 4:11 AM
>     *To:* Kim, Jonathan <Jonathan.Kim@amd.com>
>     <mailto:Jonathan.Kim@amd.com>; Kuehling, Felix
>     <Felix.Kuehling@amd.com> <mailto:Felix.Kuehling@amd.com>; Deucher,
>     Alexander <Alexander.Deucher@amd.com>
>     <mailto:Alexander.Deucher@amd.com>
>     *Cc:* Russell, Kent <Kent.Russell@amd.com>
>     <mailto:Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org
>     <mailto:amd-gfx@lists.freedesktop.org>
>     *Subject:* Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible
>     in amdgpu_device_vram_access v2"
>
>     Hi Jon,
>
>         Also cwsr tests fail on Vega20 with or without the revert with
>         the same RAS error.
>
>
>     That sounds like the system/setup has a more general problem.
>
>     Could it be that we are seeing RAS errors because there really is
>     some hardware failure, but with the MM path we don't trigger a RAS
>     interrupt?
>
>     Thanks,
>     Christian.
>
>     Am 14.04.20 um 22:30 schrieb Kim, Jonathan:
>
>         [AMD Official Use Only - Internal Distribution Only]
>
>         If we’re passing the test on the revert, then the only thing
>         that’s different is we’re not invalidating HDP and doing a
>         copy to host anymore in amdgpu_device_vram_access since the
>         function is still called in ttm access_memory with BAR.
>
>         Also cwsr tests fail on Vega20 with or without the revert with
>         the same RAS error.
>
>         Thanks,
>
>         Jon
>
>         *From:* Kuehling, Felix <Felix.Kuehling@amd.com>
>         <mailto:Felix.Kuehling@amd.com>
>         *Sent:* Tuesday, April 14, 2020 2:32 PM
>         *To:* Kim, Jonathan <Jonathan.Kim@amd.com>
>         <mailto:Jonathan.Kim@amd.com>; Koenig, Christian
>         <Christian.Koenig@amd.com> <mailto:Christian.Koenig@amd.com>;
>         Deucher, Alexander <Alexander.Deucher@amd.com>
>         <mailto:Alexander.Deucher@amd.com>
>         *Cc:* Russell, Kent <Kent.Russell@amd.com>
>         <mailto:Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org
>         <mailto:amd-gfx@lists.freedesktop.org>
>         *Subject:* Re: [PATCH] Revert "drm/amdgpu: use the BAR if
>         possible in amdgpu_device_vram_access v2"
>
>         I wouldn't call it premature. Revert is a usual practice when
>         there is a serious regression that isn't fully understood or
>         root-caused. As far as I can tell, the problem has been
>         reproduced on multiple systems, different GPUs, and clearly
>         regressed to Christian's commit. I think that justifies
>         reverting it for now.
>
>         I agree with Christian that a general HDP memory access
>         problem causing RAS errors would potentially cause problems in
>         other tests as well. For example common operations like GART
>         table updates, and GPUVM page table updates and PCIe peer2peer
>         accesses in ROCm applications use HDP. But we're not seeing
>         obvious problems from those. So we need to understand what's
>         special about this test. I asked questions to that effect on
>         our other email thread.
>
>         Regards,
>           Felix
>
>         Am 2020-04-14 um 10:51 a.m. schrieb Kim, Jonathan:
>
>             [AMD Official Use Only - Internal Distribution Only]
>
>             I think it’s premature to push this revert.
>
>             With more testing, I’m getting failures from different
>             tests or sometimes none at all on my machine.
>
>             Kent, let’s continue the discussion on the original thread.
>
>             Thanks,
>
>             Jon
>
>             *From:* Koenig, Christian <Christian.Koenig@amd.com>
>             <mailto:Christian.Koenig@amd.com>
>             *Sent:* Tuesday, April 14, 2020 10:47 AM
>             *To:* Deucher, Alexander <Alexander.Deucher@amd.com>
>             <mailto:Alexander.Deucher@amd.com>
>             *Cc:* Russell, Kent <Kent.Russell@amd.com>
>             <mailto:Kent.Russell@amd.com>;
>             amd-gfx@lists.freedesktop.org
>             <mailto:amd-gfx@lists.freedesktop.org>; Kuehling, Felix
>             <Felix.Kuehling@amd.com> <mailto:Felix.Kuehling@amd.com>;
>             Kim, Jonathan <Jonathan.Kim@amd.com>
>             <mailto:Jonathan.Kim@amd.com>
>             *Subject:* Re: [PATCH] Revert "drm/amdgpu: use the BAR if
>             possible in amdgpu_device_vram_access v2"
>
>             That's exactly my concern as well.
>
>             This looks a bit like the test creates erroneous data
>             somehow, but there doesn't seems to be a RAS check in the
>             MM data path.
>
>             And now that we use the BAR path it goes up in flames.
>
>             I just don't see how we can create erroneous data in a
>             test case?
>
>             Christian.
>
>             Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
>             <Alexander.Deucher@amd.com
>             <mailto:Alexander.Deucher@amd.com>>:
>
>                 [AMD Public Use]
>
>                 If this causes an issue, any access to vram via the
>                 BAR could cause an issue.
>
>                 Alex
>
>                 ------------------------------------------------------------------------
>
>                 *From:* amd-gfx <amd-gfx-bounces@lists.freedesktop.org
>                 <mailto:amd-gfx-bounces@lists.freedesktop.org>> on
>                 behalf of Russell, Kent <Kent.Russell@amd.com
>                 <mailto:Kent.Russell@amd.com>>
>                 *Sent:* Tuesday, April 14, 2020 10:19 AM
>                 *To:* Koenig, Christian <Christian.Koenig@amd.com
>                 <mailto:Christian.Koenig@amd.com>>;
>                 amd-gfx@lists.freedesktop.org
>                 <mailto:amd-gfx@lists.freedesktop.org>
>                 <amd-gfx@lists.freedesktop.org
>                 <mailto:amd-gfx@lists.freedesktop.org>>
>                 *Cc:* Kuehling, Felix <Felix.Kuehling@amd.com
>                 <mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan
>                 <Jonathan.Kim@amd.com <mailto:Jonathan.Kim@amd.com>>
>                 *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR
>                 if possible in amdgpu_device_vram_access v2"
>
>                 [AMD Official Use Only - Internal Distribution Only]
>
>                 On VG20 or MI100, as soon as we run the subtest, we
>                 get the dmesg output below, and then the kernel ends
>                 up hanging. I don't know enough about the test itself
>                 to know why this is occurring, but Jon Kim and Felix
>                 were discussing it on a separate thread when the issue
>                 was first reported, so they can hopefully provide some
>                 additional information.
>
>                  Kent
>
>                 > -----Original Message-----
>                 > From: Christian König
>                 <ckoenig.leichtzumerken@gmail.com
>                 <mailto:ckoenig.leichtzumerken@gmail.com>>
>                 > Sent: Tuesday, April 14, 2020 9:52 AM
>                 > To: Russell, Kent <Kent.Russell@amd.com
>                 <mailto:Kent.Russell@amd.com>>;
>                 amd-gfx@lists.freedesktop.org
>                 <mailto:amd-gfx@lists.freedesktop.org>
>                 > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR
>                 if possible in
>                 > amdgpu_device_vram_access v2"
>                 >
>                 > Am 13.04.20 um 20:20 schrieb Kent Russell:
>                 > > This reverts commit
>                 c12b84d6e0d70f1185e6daddfd12afb671791b6e.
>                 > > The original patch causes a RAS event and
>                 subsequent kernel hard-hang
>                 > > when running the
>                 KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
>                 > > Arcturus
>                 > >
>                 > > dmesg output at hang time:
>                 > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT
>                 detected!
>                 > > amdgpu 0000:67:00.0: GPU reset begin!
>                 > > Evicting PASID 0x8000 queues
>                 > > Started evicting pasid 0x8000
>                 > > qcm fence wait loop timeout expired
>                 > > The cp might be in an unrecoverable state due to
>                 an unsuccessful
>                 > > queues preemption Failed to evict process queues
>                 Failed to suspend
>                 > > process 0x8000 Finished evicting pasid 0x8000
>                 Started restoring pasid
>                 > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD
>                 VCPU state may lost
>                 > > due to RAS ERREVENT_ATHUB_INTERRUPT
>                 > > amdgpu: [powerplay] Failed to send message 0x26,
>                 response 0x0
>                 > > amdgpu: [powerplay] Failed to set soft min gfxclk !
>                 > > amdgpu: [powerplay] Failed to upload DPM Bootup
>                 Levels!
>                 > > amdgpu: [powerplay] Failed to send message 0x7,
>                 response 0x0
>                 > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed
>                 to disable all smu
>                 > features!
>                 > > amdgpu: [powerplay] [DisableDpmTasks] Failed to
>                 disable all smu features!
>                 > > amdgpu: [powerplay] [PowerOffAsic] Failed to
>                 disable DPM!
>                 > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]]
>                 *ERROR* suspend of IP
>                 > > block <powerplay> failed -5
>                 >
>                 > Do you have more information on what's going wrong
>                 here since this is a really
>                 > important patch for KFD debugging.
>                 >
>                 > >
>                 > > Signed-off-by: Kent Russell <kent.russell@amd.com
>                 <mailto:kent.russell@amd.com>>
>                 >
>                 > Reviewed-by: Christian König
>                 <christian.koenig@amd.com
>                 <mailto:christian.koenig@amd.com>>
>                 >
>                 > > ---
>                 > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
>                 ----------------------
>                 > >   1 file changed, 26 deletions(-)
>                 > >
>                 > > diff --git
>                 a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>                 > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>                 > > index cf5d6e585634..a3f997f84020 100644
>                 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>                 > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>                 > > @@ -254,32 +254,6 @@ void
>                 amdgpu_device_vram_access(struct
>                 > amdgpu_device *adev, loff_t pos,
>                 > >      uint32_t hi = ~0;
>                 > >      uint64_t last;
>                 > >
>                 > > -
>                 > > -#ifdef CONFIG_64BIT
>                 > > -   last = min(pos + size,
>                 adev->gmc.visible_vram_size);
>                 > > -   if (last > pos) {
>                 > > -           void __iomem *addr =
>                 adev->mman.aper_base_kaddr + pos;
>                 > > -           size_t count = last - pos;
>                 > > -
>                 > > -           if (write) {
>                 > > - memcpy_toio(addr, buf, count);
>                 > > -                   mb();
>                 > > - amdgpu_asic_flush_hdp(adev, NULL);
>                 > > -           } else {
>                 > > - amdgpu_asic_invalidate_hdp(adev, NULL);
>                 > > -                   mb();
>                 > > - memcpy_fromio(buf, addr, count);
>                 > > -           }
>                 > > -
>                 > > -           if (count == size)
>                 > > -                   return;
>                 > > -
>                 > > -           pos += count;
>                 > > -           buf += count / 4;
>                 > > -           size -= count;
>                 > > -   }
>                 > > -#endif
>                 > > -
>                 > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);
>                 > >      for (last = pos + size; pos < last; pos += 4) {
>                 > > uint32_t tmp = pos >> 31;
>                 _______________________________________________
>                 amd-gfx mailing list
>                 amd-gfx@lists.freedesktop.org
>                 <mailto:amd-gfx@lists.freedesktop.org>
>                 https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0
>
>             Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
>             <Alexander.Deucher@amd.com
>             <mailto:Alexander.Deucher@amd.com>>:
>
>                 [AMD Public Use]
>
>                 If this causes an issue, any access to vram via the
>                 BAR could cause an issue.
>
>                 Alex
>
>                 ------------------------------------------------------------------------
>
>                 *From:* amd-gfx <amd-gfx-bounces@lists.freedesktop.org
>                 <mailto:amd-gfx-bounces@lists.freedesktop.org>> on
>                 behalf of Russell, Kent <Kent.Russell@amd.com
>                 <mailto:Kent.Russell@amd.com>>
>                 *Sent:* Tuesday, April 14, 2020 10:19 AM
>                 *To:* Koenig, Christian <Christian.Koenig@amd.com
>                 <mailto:Christian.Koenig@amd.com>>;
>                 amd-gfx@lists.freedesktop.org
>                 <mailto:amd-gfx@lists.freedesktop.org>
>                 <amd-gfx@lists.freedesktop.org
>                 <mailto:amd-gfx@lists.freedesktop.org>>
>                 *Cc:* Kuehling, Felix <Felix.Kuehling@amd.com
>                 <mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan
>                 <Jonathan.Kim@amd.com <mailto:Jonathan.Kim@amd.com>>
>                 *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR
>                 if possible in amdgpu_device_vram_access v2"
>
>                 [AMD Official Use Only - Internal Distribution Only]
>
>                 On VG20 or MI100, as soon as we run the subtest, we
>                 get the dmesg output below, and then the kernel ends
>                 up hanging. I don't know enough about the test itself
>                 to know why this is occurring, but Jon Kim and Felix
>                 were discussing it on a separate thread when the issue
>                 was first reported, so they can hopefully provide some
>                 additional information.
>
>                  Kent
>
>                 > -----Original Message-----
>                 > From: Christian König
>                 <ckoenig.leichtzumerken@gmail.com
>                 <mailto:ckoenig.leichtzumerken@gmail.com>>
>                 > Sent: Tuesday, April 14, 2020 9:52 AM
>                 > To: Russell, Kent <Kent.Russell@amd.com
>                 <mailto:Kent.Russell@amd.com>>;
>                 amd-gfx@lists.freedesktop.org
>                 <mailto:amd-gfx@lists.freedesktop.org>
>                 > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR
>                 if possible in
>                 > amdgpu_device_vram_access v2"
>                 >
>                 > Am 13.04.20 um 20:20 schrieb Kent Russell:
>                 > > This reverts commit
>                 c12b84d6e0d70f1185e6daddfd12afb671791b6e.
>                 > > The original patch causes a RAS event and
>                 subsequent kernel hard-hang
>                 > > when running the
>                 KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
>                 > > Arcturus
>                 > >
>                 > > dmesg output at hang time:
>                 > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT
>                 detected!
>                 > > amdgpu 0000:67:00.0: GPU reset begin!
>                 > > Evicting PASID 0x8000 queues
>                 > > Started evicting pasid 0x8000
>                 > > qcm fence wait loop timeout expired
>                 > > The cp might be in an unrecoverable state due to
>                 an unsuccessful
>                 > > queues preemption Failed to evict process queues
>                 Failed to suspend
>                 > > process 0x8000 Finished evicting pasid 0x8000
>                 Started restoring pasid
>                 > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD
>                 VCPU state may lost
>                 > > due to RAS ERREVENT_ATHUB_INTERRUPT
>                 > > amdgpu: [powerplay] Failed to send message 0x26,
>                 response 0x0
>                 > > amdgpu: [powerplay] Failed to set soft min gfxclk !
>                 > > amdgpu: [powerplay] Failed to upload DPM Bootup
>                 Levels!
>                 > > amdgpu: [powerplay] Failed to send message 0x7,
>                 response 0x0
>                 > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed
>                 to disable all smu
>                 > features!
>                 > > amdgpu: [powerplay] [DisableDpmTasks] Failed to
>                 disable all smu features!
>                 > > amdgpu: [powerplay] [PowerOffAsic] Failed to
>                 disable DPM!
>                 > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]]
>                 *ERROR* suspend of IP
>                 > > block <powerplay> failed -5
>                 >
>                 > Do you have more information on what's going wrong
>                 here since this is a really
>                 > important patch for KFD debugging.
>                 >
>                 > >
>                 > > Signed-off-by: Kent Russell <kent.russell@amd.com
>                 <mailto:kent.russell@amd.com>>
>                 >
>                 > Reviewed-by: Christian König
>                 <christian.koenig@amd.com
>                 <mailto:christian.koenig@amd.com>>
>                 >
>                 > > ---
>                 > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
>                 ----------------------
>                 > >   1 file changed, 26 deletions(-)
>                 > >
>                 > > diff --git
>                 a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>                 > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>                 > > index cf5d6e585634..a3f997f84020 100644
>                 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>                 > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>                 > > @@ -254,32 +254,6 @@ void
>                 amdgpu_device_vram_access(struct
>                 > amdgpu_device *adev, loff_t pos,
>                 > >      uint32_t hi = ~0;
>                 > >      uint64_t last;
>                 > >
>                 > > -
>                 > > -#ifdef CONFIG_64BIT
>                 > > -   last = min(pos + size,
>                 adev->gmc.visible_vram_size);
>                 > > -   if (last > pos) {
>                 > > -           void __iomem *addr =
>                 adev->mman.aper_base_kaddr + pos;
>                 > > -           size_t count = last - pos;
>                 > > -
>                 > > -           if (write) {
>                 > > - memcpy_toio(addr, buf, count);
>                 > > - mb();
>                 > > - amdgpu_asic_flush_hdp(adev, NULL);
>                 > > -           } else {
>                 > > - amdgpu_asic_invalidate_hdp(adev, NULL);
>                 > > - mb();
>                 > > - memcpy_fromio(buf, addr, count);
>                 > > -           }
>                 > > -
>                 > > -           if (count == size)
>                 > > - return;
>                 > > -
>                 > > -           pos += count;
>                 > > -           buf += count / 4;
>                 > > -           size -= count;
>                 > > -   }
>                 > > -#endif
>                 > > -
>                 > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);
>                 > >      for (last = pos + size; pos < last; pos += 4) {
>                 > > uint32_t tmp = pos >> 31;
>                 _______________________________________________
>                 amd-gfx mailing list
>                 amd-gfx@lists.freedesktop.org
>                 <mailto:amd-gfx@lists.freedesktop.org>
>                 https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0
>
>             Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
>             <Alexander.Deucher@amd.com
>             <mailto:Alexander.Deucher@amd.com>>:
>
>                 [AMD Public Use]
>
>                 If this causes an issue, any access to vram via the
>                 BAR could cause an issue.
>
>                 Alex
>
>                 ------------------------------------------------------------------------
>
>                 *From:* amd-gfx <amd-gfx-bounces@lists.freedesktop.org
>                 <mailto:amd-gfx-bounces@lists.freedesktop.org>> on
>                 behalf of Russell, Kent <Kent.Russell@amd.com
>                 <mailto:Kent.Russell@amd.com>>
>                 *Sent:* Tuesday, April 14, 2020 10:19 AM
>                 *To:* Koenig, Christian <Christian.Koenig@amd.com
>                 <mailto:Christian.Koenig@amd.com>>;
>                 amd-gfx@lists.freedesktop.org
>                 <mailto:amd-gfx@lists.freedesktop.org>
>                 <amd-gfx@lists.freedesktop.org
>                 <mailto:amd-gfx@lists.freedesktop.org>>
>                 *Cc:* Kuehling, Felix <Felix.Kuehling@amd.com
>                 <mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan
>                 <Jonathan.Kim@amd.com <mailto:Jonathan.Kim@amd.com>>
>                 *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR
>                 if possible in amdgpu_device_vram_access v2"
>
>                 [AMD Official Use Only - Internal Distribution Only]
>
>                 On VG20 or MI100, as soon as we run the subtest, we
>                 get the dmesg output below, and then the kernel ends
>                 up hanging. I don't know enough about the test itself
>                 to know why this is occurring, but Jon Kim and Felix
>                 were discussing it on a separate thread when the issue
>                 was first reported, so they can hopefully provide some
>                 additional information.
>
>                  Kent
>
>                 > -----Original Message-----
>                 > From: Christian König
>                 <ckoenig.leichtzumerken@gmail.com
>                 <mailto:ckoenig.leichtzumerken@gmail.com>>
>                 > Sent: Tuesday, April 14, 2020 9:52 AM
>                 > To: Russell, Kent <Kent.Russell@amd.com
>                 <mailto:Kent.Russell@amd.com>>;
>                 amd-gfx@lists.freedesktop.org
>                 <mailto:amd-gfx@lists.freedesktop.org>
>                 > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR
>                 if possible in
>                 > amdgpu_device_vram_access v2"
>                 >
>                 > Am 13.04.20 um 20:20 schrieb Kent Russell:
>                 > > This reverts commit
>                 c12b84d6e0d70f1185e6daddfd12afb671791b6e.
>                 > > The original patch causes a RAS event and
>                 subsequent kernel hard-hang
>                 > > when running the
>                 KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
>                 > > Arcturus
>                 > >
>                 > > dmesg output at hang time:
>                 > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT
>                 detected!
>                 > > amdgpu 0000:67:00.0: GPU reset begin!
>                 > > Evicting PASID 0x8000 queues
>                 > > Started evicting pasid 0x8000
>                 > > qcm fence wait loop timeout expired
>                 > > The cp might be in an unrecoverable state due to
>                 an unsuccessful
>                 > > queues preemption Failed to evict process queues
>                 Failed to suspend
>                 > > process 0x8000 Finished evicting pasid 0x8000
>                 Started restoring pasid
>                 > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD
>                 VCPU state may lost
>                 > > due to RAS ERREVENT_ATHUB_INTERRUPT
>                 > > amdgpu: [powerplay] Failed to send message 0x26,
>                 response 0x0
>                 > > amdgpu: [powerplay] Failed to set soft min gfxclk !
>                 > > amdgpu: [powerplay] Failed to upload DPM Bootup
>                 Levels!
>                 > > amdgpu: [powerplay] Failed to send message 0x7,
>                 response 0x0
>                 > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed
>                 to disable all smu
>                 > features!
>                 > > amdgpu: [powerplay] [DisableDpmTasks] Failed to
>                 disable all smu features!
>                 > > amdgpu: [powerplay] [PowerOffAsic] Failed to
>                 disable DPM!
>                 > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]]
>                 *ERROR* suspend of IP
>                 > > block <powerplay> failed -5
>                 >
>                 > Do you have more information on what's going wrong
>                 here since this is a really
>                 > important patch for KFD debugging.
>                 >
>                 > >
>                 > > Signed-off-by: Kent Russell <kent.russell@amd.com
>                 <mailto:kent.russell@amd.com>>
>                 >
>                 > Reviewed-by: Christian König
>                 <christian.koenig@amd.com
>                 <mailto:christian.koenig@amd.com>>
>                 >
>                 > > ---
>                 > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
>                 ----------------------
>                 > >   1 file changed, 26 deletions(-)
>                 > >
>                 > > diff --git
>                 a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>                 > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>                 > > index cf5d6e585634..a3f997f84020 100644
>                 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>                 > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>                 > > @@ -254,32 +254,6 @@ void
>                 amdgpu_device_vram_access(struct
>                 > amdgpu_device *adev, loff_t pos,
>                 > >      uint32_t hi = ~0;
>                 > >      uint64_t last;
>                 > >
>                 > > -
>                 > > -#ifdef CONFIG_64BIT
>                 > > -   last = min(pos + size,
>                 adev->gmc.visible_vram_size);
>                 > > -   if (last > pos) {
>                 > > -           void __iomem *addr =
>                 adev->mman.aper_base_kaddr + pos;
>                 > > -           size_t count = last - pos;
>                 > > -
>                 > > -           if (write) {
>                 > > - memcpy_toio(addr, buf, count);
>                 > > - mb();
>                 > > - amdgpu_asic_flush_hdp(adev, NULL);
>                 > > -           } else {
>                 > > - amdgpu_asic_invalidate_hdp(adev, NULL);
>                 > > - mb();
>                 > > - memcpy_fromio(buf, addr, count);
>                 > > -           }
>                 > > -
>                 > > -           if (count == size)
>                 > > - return;
>                 > > -
>                 > > -           pos += count;
>                 > > -           buf += count / 4;
>                 > > -           size -= count;
>                 > > -   }
>                 > > -#endif
>                 > > -
>                 > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);
>                 > >      for (last = pos + size; pos < last; pos += 4) {
>                 > >              uint32_t tmp = pos >> 31;
>                 _______________________________________________
>                 amd-gfx mailing list
>                 amd-gfx@lists.freedesktop.org
>                 <mailto:amd-gfx@lists.freedesktop.org>
>                 https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0
>
>             Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
>             <Alexander.Deucher@amd.com
>             <mailto:Alexander.Deucher@amd.com>>:
>
>                 [AMD Public Use]
>
>                 If this causes an issue, any access to vram via the
>                 BAR could cause an issue.
>
>                 Alex
>
>                 ------------------------------------------------------------------------
>
>                 *From:* amd-gfx <amd-gfx-bounces@lists.freedesktop.org
>                 <mailto:amd-gfx-bounces@lists.freedesktop.org>> on
>                 behalf of Russell, Kent <Kent.Russell@amd.com
>                 <mailto:Kent.Russell@amd.com>>
>                 *Sent:* Tuesday, April 14, 2020 10:19 AM
>                 *To:* Koenig, Christian <Christian.Koenig@amd.com
>                 <mailto:Christian.Koenig@amd.com>>;
>                 amd-gfx@lists.freedesktop.org
>                 <mailto:amd-gfx@lists.freedesktop.org>
>                 <amd-gfx@lists.freedesktop.org
>                 <mailto:amd-gfx@lists.freedesktop.org>>
>                 *Cc:* Kuehling, Felix <Felix.Kuehling@amd.com
>                 <mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan
>                 <Jonathan.Kim@amd.com <mailto:Jonathan.Kim@amd.com>>
>                 *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR
>                 if possible in amdgpu_device_vram_access v2"
>
>                 [AMD Official Use Only - Internal Distribution Only]
>
>                 On VG20 or MI100, as soon as we run the subtest, we
>                 get the dmesg output below, and then the kernel ends
>                 up hanging. I don't know enough about the test itself
>                 to know why this is occurring, but Jon Kim and Felix
>                 were discussing it on a separate thread when the issue
>                 was first reported, so they can hopefully provide some
>                 additional information.
>
>                  Kent
>
>                 > -----Original Message-----
>                 > From: Christian König
>                 <ckoenig.leichtzumerken@gmail.com
>                 <mailto:ckoenig.leichtzumerken@gmail.com>>
>                 > Sent: Tuesday, April 14, 2020 9:52 AM
>                 > To: Russell, Kent <Kent.Russell@amd.com
>                 <mailto:Kent.Russell@amd.com>>;
>                 amd-gfx@lists.freedesktop.org
>                 <mailto:amd-gfx@lists.freedesktop.org>
>                 > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR
>                 if possible in
>                 > amdgpu_device_vram_access v2"
>                 >
>                 > Am 13.04.20 um 20:20 schrieb Kent Russell:
>                 > > This reverts commit
>                 c12b84d6e0d70f1185e6daddfd12afb671791b6e.
>                 > > The original patch causes a RAS event and
>                 subsequent kernel hard-hang
>                 > > when running the
>                 KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
>                 > > Arcturus
>                 > >
>                 > > dmesg output at hang time:
>                 > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT
>                 detected!
>                 > > amdgpu 0000:67:00.0: GPU reset begin!
>                 > > Evicting PASID 0x8000 queues
>                 > > Started evicting pasid 0x8000
>                 > > qcm fence wait loop timeout expired
>                 > > The cp might be in an unrecoverable state due to
>                 an unsuccessful
>                 > > queues preemption Failed to evict process queues
>                 Failed to suspend
>                 > > process 0x8000 Finished evicting pasid 0x8000
>                 Started restoring pasid
>                 > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD
>                 VCPU state may lost
>                 > > due to RAS ERREVENT_ATHUB_INTERRUPT
>                 > > amdgpu: [powerplay] Failed to send message 0x26,
>                 response 0x0
>                 > > amdgpu: [powerplay] Failed to set soft min gfxclk !
>                 > > amdgpu: [powerplay] Failed to upload DPM Bootup
>                 Levels!
>                 > > amdgpu: [powerplay] Failed to send message 0x7,
>                 response 0x0
>                 > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed
>                 to disable all smu
>                 > features!
>                 > > amdgpu: [powerplay] [DisableDpmTasks] Failed to
>                 disable all smu features!
>                 > > amdgpu: [powerplay] [PowerOffAsic] Failed to
>                 disable DPM!
>                 > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]]
>                 *ERROR* suspend of IP
>                 > > block <powerplay> failed -5
>                 >
>                 > Do you have more information on what's going wrong
>                 here since this is a really
>                 > important patch for KFD debugging.
>                 >
>                 > >
>                 > > Signed-off-by: Kent Russell <kent.russell@amd.com
>                 <mailto:kent.russell@amd.com>>
>                 >
>                 > Reviewed-by: Christian König
>                 <christian.koenig@amd.com
>                 <mailto:christian.koenig@amd.com>>
>                 >
>                 > > ---
>                 > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
>                 ----------------------
>                 > >   1 file changed, 26 deletions(-)
>                 > >
>                 > > diff --git
>                 a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>                 > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>                 > > index cf5d6e585634..a3f997f84020 100644
>                 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>                 > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>                 > > @@ -254,32 +254,6 @@ void
>                 amdgpu_device_vram_access(struct
>                 > amdgpu_device *adev, loff_t pos,
>                 > >      uint32_t hi = ~0;
>                 > >      uint64_t last;
>                 > >
>                 > > -
>                 > > -#ifdef CONFIG_64BIT
>                 > > -   last = min(pos + size,
>                 adev->gmc.visible_vram_size);
>                 > > -   if (last > pos) {
>                 > > -           void __iomem *addr =
>                 adev->mman.aper_base_kaddr + pos;
>                 > > -           size_t count = last - pos;
>                 > > -
>                 > > -           if (write) {
>                 > > - memcpy_toio(addr, buf, count);
>                 > > - mb();
>                 > > - amdgpu_asic_flush_hdp(adev, NULL);
>                 > > -           } else {
>                 > > - amdgpu_asic_invalidate_hdp(adev, NULL);
>                 > > - mb();
>                 > > - memcpy_fromio(buf, addr, count);
>                 > > -           }
>                 > > -
>                 > > -           if (count == size)
>                 > > - return;
>                 > > -
>                 > > -           pos += count;
>                 > > -           buf += count / 4;
>                 > > -           size -= count;
>                 > > -   }
>                 > > -#endif
>                 > > -
>                 > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);
>                 > >      for (last = pos + size; pos < last; pos += 4) {
>                 > >              uint32_t tmp = pos >> 31;
>                 _______________________________________________
>                 amd-gfx mailing list
>                 amd-gfx@lists.freedesktop.org
>                 <mailto:amd-gfx@lists.freedesktop.org>
>                 https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0
>
>             Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
>             <Alexander.Deucher@amd.com
>             <mailto:Alexander.Deucher@amd.com>>:
>
>             [AMD Public Use]
>
>             If this causes an issue, any access to vram via the BAR
>             could cause an issue.
>
>             Alex
>
>             ------------------------------------------------------------------------
>
>             *From:* amd-gfx <amd-gfx-bounces@lists.freedesktop.org
>             <mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf
>             of Russell, Kent <Kent.Russell@amd.com
>             <mailto:Kent.Russell@amd.com>>
>             *Sent:* Tuesday, April 14, 2020 10:19 AM
>             *To:* Koenig, Christian <Christian.Koenig@amd.com
>             <mailto:Christian.Koenig@amd.com>>;
>             amd-gfx@lists.freedesktop.org
>             <mailto:amd-gfx@lists.freedesktop.org>
>             <amd-gfx@lists.freedesktop.org
>             <mailto:amd-gfx@lists.freedesktop.org>>
>             *Cc:* Kuehling, Felix <Felix.Kuehling@amd.com
>             <mailto:Felix.Kuehling@amd.com>>; Kim, Jonathan
>             <Jonathan.Kim@amd.com <mailto:Jonathan.Kim@amd.com>>
>             *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if
>             possible in amdgpu_device_vram_access v2"
>
>             [AMD Official Use Only - Internal Distribution Only]
>
>             On VG20 or MI100, as soon as we run the subtest, we get
>             the dmesg output below, and then the kernel ends up
>             hanging. I don't know enough about the test itself to know
>             why this is occurring, but Jon Kim and Felix were
>             discussing it on a separate thread when the issue was
>             first reported, so they can hopefully provide some
>             additional information.
>
>              Kent
>
>             > -----Original Message-----
>             > From: Christian König <ckoenig.leichtzumerken@gmail.com
>             <mailto:ckoenig.leichtzumerken@gmail.com>>
>             > Sent: Tuesday, April 14, 2020 9:52 AM
>             > To: Russell, Kent <Kent.Russell@amd.com
>             <mailto:Kent.Russell@amd.com>>;
>             amd-gfx@lists.freedesktop.org
>             <mailto:amd-gfx@lists.freedesktop.org>
>             > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if
>             possible in
>             > amdgpu_device_vram_access v2"
>             >
>             > Am 13.04.20 um 20:20 schrieb Kent Russell:
>             > > This reverts commit
>             c12b84d6e0d70f1185e6daddfd12afb671791b6e.
>             > > The original patch causes a RAS event and subsequent
>             kernel hard-hang
>             > > when running the
>             KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
>             > > Arcturus
>             > >
>             > > dmesg output at hang time:
>             > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
>             > > amdgpu 0000:67:00.0: GPU reset begin!
>             > > Evicting PASID 0x8000 queues
>             > > Started evicting pasid 0x8000
>             > > qcm fence wait loop timeout expired
>             > > The cp might be in an unrecoverable state due to an
>             unsuccessful
>             > > queues preemption Failed to evict process queues
>             Failed to suspend
>             > > process 0x8000 Finished evicting pasid 0x8000 Started
>             restoring pasid
>             > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU
>             state may lost
>             > > due to RAS ERREVENT_ATHUB_INTERRUPT
>             > > amdgpu: [powerplay] Failed to send message 0x26,
>             response 0x0
>             > > amdgpu: [powerplay] Failed to set soft min gfxclk !
>             > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
>             > > amdgpu: [powerplay] Failed to send message 0x7,
>             response 0x0
>             > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to
>             disable all smu
>             > features!
>             > > amdgpu: [powerplay] [DisableDpmTasks] Failed to
>             disable all smu features!
>             > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
>             > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR*
>             suspend of IP
>             > > block <powerplay> failed -5
>             >
>             > Do you have more information on what's going wrong here
>             since this is a really
>             > important patch for KFD debugging.
>             >
>             > >
>             > > Signed-off-by: Kent Russell <kent.russell@amd.com
>             <mailto:kent.russell@amd.com>>
>             >
>             > Reviewed-by: Christian König <christian.koenig@amd.com
>             <mailto:christian.koenig@amd.com>>
>             >
>             > > ---
>             > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
>             ----------------------
>             > >   1 file changed, 26 deletions(-)
>             > >
>             > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>             > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>             > > index cf5d6e585634..a3f997f84020 100644
>             > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>             > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>             > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
>             > amdgpu_device *adev, loff_t pos,
>             > >      uint32_t hi = ~0;
>             > >      uint64_t last;
>             > >
>             > > -
>             > > -#ifdef CONFIG_64BIT
>             > > -   last = min(pos + size, adev->gmc.visible_vram_size);
>             > > -   if (last > pos) {
>             > > -           void __iomem *addr =
>             adev->mman.aper_base_kaddr + pos;
>             > > -           size_t count = last - pos;
>             > > -
>             > > -           if (write) {
>             > > - memcpy_toio(addr, buf, count);
>             > > -                   mb();
>             > > - amdgpu_asic_flush_hdp(adev, NULL);
>             > > -           } else {
>             > > - amdgpu_asic_invalidate_hdp(adev, NULL);
>             > > -                   mb();
>             > > - memcpy_fromio(buf, addr, count);
>             > > -           }
>             > > -
>             > > -           if (count == size)
>             > > -                   return;
>             > > -
>             > > -           pos += count;
>             > > -           buf += count / 4;
>             > > -           size -= count;
>             > > -   }
>             > > -#endif
>             > > -
>             > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);
>             > >      for (last = pos + size; pos < last; pos += 4) {
>             > >              uint32_t tmp = pos >> 31;
>             _______________________________________________
>             amd-gfx mailing list
>             amd-gfx@lists.freedesktop.org
>             <mailto:amd-gfx@lists.freedesktop.org>
>             https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0
>


[-- Attachment #1.2: Type: text/html, Size: 124908 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2020-04-17  7:46 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-13 18:20 [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2" Kent Russell
2020-04-14 13:51 ` Christian König
2020-04-14 14:19   ` Russell, Kent
2020-04-14 14:35     ` Deucher, Alexander
2020-04-14 14:46       ` Koenig, Christian
2020-04-14 14:51         ` Kim, Jonathan
2020-04-14 18:31           ` Felix Kuehling
2020-04-14 20:30             ` Kim, Jonathan
2020-04-15  8:11               ` Christian König
2020-04-15  9:49                 ` Kim, Jonathan
2020-04-15 10:58                   ` Christian König
2020-04-15 15:02                     ` Kuehling, Felix
2020-04-16 16:08                       ` Kim, Jonathan
2020-04-17  7:46                         ` Christian König

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.