All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH] drm/radeon: avoid page fault during gpu reset
@ 2020-01-25 19:01 Koenig, Christian
  2020-01-28 13:15 ` Andreas Messer
  0 siblings, 1 reply; 4+ messages in thread
From: Koenig, Christian @ 2020-01-25 19:01 UTC (permalink / raw)
  To: Andreas Messer; +Cc: Deucher, Alexander, Zhou, David(ChunMing), amd-gfx


[-- Attachment #1.1: Type: text/plain, Size: 3136 bytes --]



Am 25.01.2020 19:47 schrieb Andreas Messer <andi@bastelmap.de>:
When backing up a ring, validate pointer to avoid page fault.

When the drivers attempts to handle a gpu lockup, a page fault might occur
during call of radeon_ring_backup() since (*ring->next_rptr_cpu_addr) could
have invalid content:

  [ 3790.348267] radeon 0000:01:00.0: ring 0 stalled for more than 10150msec
  [ 3790.348276] radeon 0000:01:00.0: GPU lockup (current fence id 0x00000000000699e4 last fence id 0x00000000000699f9 on ring 0)
  [ 3791.504484] BUG: unable to handle page fault for address: ffffba5602800ffc
  [ 3791.504485] #PF: supervisor read access in kernel mode
  [ 3791.504486] #PF: error_code(0x0000) - not-present page
  [ 3791.504487] PGD 851d3b067 P4D 851d3b067 PUD 0
  [ 3791.504488] Oops: 0000 [#1] SMP PTI
  [ 3791.504490] CPU: 5 PID: 268 Comm: kworker/5:1H Tainted: G            E     5.4.8-amesser #3
  [ 3791.504491] Hardware name: Gigabyte Technology Co., Ltd. X170-WS ECC/X170-WS ECC-CF, BIOS F2 06/20/2016
  [ 3791.504507] Workqueue: radeon-crtc radeon_flip_work_func [radeon]
  [ 3791.504520] RIP: 0010:radeon_ring_backup+0xb9/0x130 [radeon]

It seems that my HD7750 enters such a state during thermal shutdown. Here
the kernel message with added debug print and fix:

  [ 2930.783094] radeon 0000:01:00.0: ring 3 stalled for more than 10280msec
  [ 2930.783104] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000011194b last fence id 0x000000000011196a on ring 3)
  [ 2931.936653] radeon 0000:01:00.0: Bad ptr 0xffffffff [   -1] for backup
  [ 2931.937704] radeon 0000:01:00.0: GPU softreset: 0x00000BFD
  [ 2931.937705] radeon 0000:01:00.0:   GRBM_STATUS               = 0xFFFFFFFF
  [ 2931.937707] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0xFFFFFFFF

NAK, that was suggested multiple times now and is essentially the wrong approach.

The problem is that the value is invalid because the hardware is not functional any more. Returning here without backing up the ring just papers over the real problem.

This is just the first occurance of this and you would need to fix a couple of hundred register accesses (both inside and outside of the driver) to make that really work reliable.

The only advice I can give you is to replace the hardware. From experience those symptoms mean that your GPU will die rather soon.

Regards,
Christian.



Signed-off-by: Andreas Messer <andi@bastelmap.de>
---
diff --git a/drivers/gpu/drm/radeon/radeon_ring.c b/drivers/gpu/drm/radeon/radeon_ring.c
index 37093cea24c5..bf55a682442a 100644
--- a/drivers/gpu/drm/radeon/radeon_ring.c
+++ b/drivers/gpu/drm/radeon/radeon_ring.c
@@ -309,6 +309,12 @@ unsigned radeon_ring_backup(struct radeon_device *rdev, struct radeon_ring *ring
                 return 0;
         }

+       /* ptr could be invalid after thermal shutdown */
+       if (ptr >= (ring->ring_size / 4)) {
+               mutex_unlock(&rdev->ring_lock);
+               return 0;
+       }
+
         size = ring->wptr + (ring->ring_size / 4);
         size -= ptr;
         size &= ring->ptr_mask;


[-- Attachment #1.2: Type: text/html, Size: 5224 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] drm/radeon: avoid page fault during gpu reset
  2020-01-25 19:01 [PATCH] drm/radeon: avoid page fault during gpu reset Koenig, Christian
@ 2020-01-28 13:15 ` Andreas Messer
  2020-01-28 16:22   ` Christian König
  0 siblings, 1 reply; 4+ messages in thread
From: Andreas Messer @ 2020-01-28 13:15 UTC (permalink / raw)
  To: Koenig, Christian; +Cc: amd-gfx


[-- Attachment #1.1: Type: text/plain, Size: 2221 bytes --]

On Sat, Jan 25, 2020 at 07:01:36PM +0000, Koenig, Christian wrote:
> 
> 
> Am 25.01.2020 19:47 schrieb Andreas Messer <andi@bastelmap.de>:
> When backing up a ring, validate pointer to avoid page fault.
> [ cut description / kernel messages ] 
> 
> NAK, that was suggested multiple times now and is essentially the wrong
> approach.
>
> The problem is that the value is invalid because the hardware is not
> functional any more. Returning here without backing up the ring just
> papers over the real problem.
> 
> This is just the first occurance of this and you would need to fix a
> couple of hundred register accesses (both inside and outside of the
> driver) to make that really work reliable.

Sure, it wont fix the hardware. But since the page fault is most prominent
part in kernel log, people will continue suggesting it. With that change,
the kernel messages are full of ring and atom bios timeouts and might make
users more likely to consider a hardware issue in the first place. Anyway:

> The only advice I can give you is to replace the hardware. From
> experience those symptoms mean that your GPU will die rather soon.

I think my hardware is fine. I have monitored gpu temp and fan pwm now for
a while and found the pwm to be driven at ~60% only although the gpu
already got quite high temperature during gameplay. When forcing the pwm
to ~80% no crash occurs anymore. I suppose it is not the GPU crashing but
instead the VRMs, not getting enough airflow.

I have compared the Bios fan tables of my card with them of other cards
bios (downloaded from web) of same GPU type and similar design.
Although they differ in cooler construction and used fan, all of them
despite one model have exactly the same fan regulation points with PWMHigh
at 80% for 90°C. This single model with other settings has 100% for this
temp and generally much more sane looking regulation curve.

I suppose most of the vendors just copied some reference design,
maybe the vendor's windows driver adjust the curve to a better one,
I don't know.

I think I'll add some sysfs attributes or module parameter to adjust 
the curve to my needs.

> [ Patch cut out ]

cheers,
Andreas



[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] drm/radeon: avoid page fault during gpu reset
  2020-01-28 13:15 ` Andreas Messer
@ 2020-01-28 16:22   ` Christian König
  0 siblings, 0 replies; 4+ messages in thread
From: Christian König @ 2020-01-28 16:22 UTC (permalink / raw)
  To: Andreas Messer, Koenig, Christian; +Cc: amd-gfx


[-- Attachment #1.1: Type: text/plain, Size: 3539 bytes --]

Am 28.01.20 um 14:15 schrieb Andreas Messer:
> On Sat, Jan 25, 2020 at 07:01:36PM +0000, Koenig, Christian wrote:
>>
>> Am 25.01.2020 19:47 schrieb Andreas Messer <andi@bastelmap.de>:
>> When backing up a ring, validate pointer to avoid page fault.
>> [ cut description / kernel messages ]
>>
>> NAK, that was suggested multiple times now and is essentially the wrong
>> approach.
>>
>> The problem is that the value is invalid because the hardware is not
>> functional any more. Returning here without backing up the ring just
>> papers over the real problem.
>>
>> This is just the first occurance of this and you would need to fix a
>> couple of hundred register accesses (both inside and outside of the
>> driver) to make that really work reliable.
> Sure, it wont fix the hardware. But since the page fault is most prominent
> part in kernel log, people will continue suggesting it. With that change,
> the kernel messages are full of ring and atom bios timeouts and might make
> users more likely to consider a hardware issue in the first place.

That is correct, but the problem is that we currently have 2209 places 
where we read a register and usually expect that the values to be in a 
valid range.

If you really want to avoid all crashes you would need to audit and fix 
all occurrences where for example the register value is used as index in 
an array or similar.

And the radeon code is only the beginning, the whole PCIe subsystem 
would need an audit in a similar way. That is a huge lot of work we are 
not willing to do.

>   Anyway:
>
>> The only advice I can give you is to replace the hardware. From
>> experience those symptoms mean that your GPU will die rather soon.
> I think my hardware is fine. I have monitored gpu temp and fan pwm now for
> a while and found the pwm to be driven at ~60% only although the gpu
> already got quite high temperature during gameplay. When forcing the pwm
> to ~80% no crash occurs anymore. I suppose it is not the GPU crashing but
> instead the VRMs, not getting enough airflow.
>
> I have compared the Bios fan tables of my card with them of other cards
> bios (downloaded from web) of same GPU type and similar design.
> Although they differ in cooler construction and used fan, all of them
> despite one model have exactly the same fan regulation points with PWMHigh
> at 80% for 90°C. This single model with other settings has 100% for this
> temp and generally much more sane looking regulation curve.
>
> I suppose most of the vendors just copied some reference design,
> maybe the vendor's windows driver adjust the curve to a better one,
> I don't know.
>
> I think I'll add some sysfs attributes or module parameter to adjust
> the curve to my needs.

The issue is that this is most likely not a temperature problem at all. 
If you have a temperature problem the ASIC usually just hangs in a 
shader or so, but the BIF is still fully functional (e.g. you can probe 
PCI-IDs etc...).

That looks more like the ESD protection is kicking in for some reason. 
In other words what you got here is a cold/broken solder point on the 
SMD components which happens to loose contact because the material 
expands when it warms up.

That is a serious hardware fault and a really good indicator that you 
should replace the faulty component ASAP.

Regards,
Christian.

>
>> [ Patch cut out ]
> cheers,
> Andreas
>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[-- Attachment #1.2: Type: text/html, Size: 5116 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH] drm/radeon: avoid page fault during gpu reset
@ 2020-01-25 18:47 Andreas Messer
  0 siblings, 0 replies; 4+ messages in thread
From: Andreas Messer @ 2020-01-25 18:47 UTC (permalink / raw)
  To: Alex Deucher, Christian König, David Zhou; +Cc: amd-gfx

When backing up a ring, validate pointer to avoid page fault.

When the drivers attempts to handle a gpu lockup, a page fault might occur
during call of radeon_ring_backup() since (*ring->next_rptr_cpu_addr) could
have invalid content:

  [ 3790.348267] radeon 0000:01:00.0: ring 0 stalled for more than 10150msec
  [ 3790.348276] radeon 0000:01:00.0: GPU lockup (current fence id 0x00000000000699e4 last fence id 0x00000000000699f9 on ring 0)
  [ 3791.504484] BUG: unable to handle page fault for address: ffffba5602800ffc
  [ 3791.504485] #PF: supervisor read access in kernel mode
  [ 3791.504486] #PF: error_code(0x0000) - not-present page
  [ 3791.504487] PGD 851d3b067 P4D 851d3b067 PUD 0 
  [ 3791.504488] Oops: 0000 [#1] SMP PTI
  [ 3791.504490] CPU: 5 PID: 268 Comm: kworker/5:1H Tainted: G            E     5.4.8-amesser #3
  [ 3791.504491] Hardware name: Gigabyte Technology Co., Ltd. X170-WS ECC/X170-WS ECC-CF, BIOS F2 06/20/2016
  [ 3791.504507] Workqueue: radeon-crtc radeon_flip_work_func [radeon]
  [ 3791.504520] RIP: 0010:radeon_ring_backup+0xb9/0x130 [radeon]

It seems that my HD7750 enters such a state during thermal shutdown. Here
the kernel message with added debug print and fix:

  [ 2930.783094] radeon 0000:01:00.0: ring 3 stalled for more than 10280msec
  [ 2930.783104] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000011194b last fence id 0x000000000011196a on ring 3)
  [ 2931.936653] radeon 0000:01:00.0: Bad ptr 0xffffffff [   -1] for backup
  [ 2931.937704] radeon 0000:01:00.0: GPU softreset: 0x00000BFD
  [ 2931.937705] radeon 0000:01:00.0:   GRBM_STATUS               = 0xFFFFFFFF
  [ 2931.937707] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0xFFFFFFFF

Signed-off-by: Andreas Messer <andi@bastelmap.de>
---
diff --git a/drivers/gpu/drm/radeon/radeon_ring.c b/drivers/gpu/drm/radeon/radeon_ring.c
index 37093cea24c5..bf55a682442a 100644
--- a/drivers/gpu/drm/radeon/radeon_ring.c
+++ b/drivers/gpu/drm/radeon/radeon_ring.c
@@ -309,6 +309,12 @@ unsigned radeon_ring_backup(struct radeon_device *rdev, struct radeon_ring *ring
 		return 0;
 	}
 
+	/* ptr could be invalid after thermal shutdown */
+	if (ptr >= (ring->ring_size / 4)) {
+		mutex_unlock(&rdev->ring_lock);
+		return 0;
+	}
+
 	size = ring->wptr + (ring->ring_size / 4);
 	size -= ptr;
 	size &= ring->ptr_mask;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-01-28 16:22 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-25 19:01 [PATCH] drm/radeon: avoid page fault during gpu reset Koenig, Christian
2020-01-28 13:15 ` Andreas Messer
2020-01-28 16:22   ` Christian König
  -- strict thread matches above, loose matches on Subject: below --
2020-01-25 18:47 Andreas Messer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.