All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/amd: Fail the suspend if resources can't be evicted
@ 2022-10-26 19:03 Mario Limonciello
  2022-10-27  6:30 ` Christian König
  2022-10-27 14:56 ` Christian König
  0 siblings, 2 replies; 6+ messages in thread
From: Mario Limonciello @ 2022-10-26 19:03 UTC (permalink / raw)
  To: amd-gfx; +Cc: post, Mario Limonciello

If a system does not have swap and memory is under 100% usage,
amdgpu will fail to evict resources.  Currently the suspend
carries on proceeding to reset the GPU:

```
[drm] evicting device resources failed
[drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <vcn_v3_0> failed -12
[drm] free PSP TMR buffer
[TTM] Failed allocating page table
[drm] evicting device resources failed
amdgpu 0000:03:00.0: amdgpu: MODE1 reset
amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset
amdgpu 0000:03:00.0: amdgpu: GPU smu mode1 reset
```

At this point if the suspend actually succeeded I think that amdgpu
would have recovered because the GPU would have power cut off and
restored.  However the kernel fails to continue the suspend from the
memory pressure and amdgpu fails to run the "resume" from the aborted
suspend.

```
ACPI: PM: Preparing to enter system sleep state S3
SLUB: Unable to allocate memory on node -1, gfp=0xdc0(GFP_KERNEL|__GFP_ZERO)
  cache: Acpi-State, object size: 80, buffer size: 80, default order: 0, min order: 0
  node 0: slabs: 22, objs: 1122, free: 0
ACPI Error: AE_NO_MEMORY, Could not update object reference count (20210730/utdelete-651)

[drm:psp_hw_start [amdgpu]] *ERROR* PSP load kdb failed!
[drm:psp_resume [amdgpu]] *ERROR* PSP resume failed
[drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -62
amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (-62).
PM: dpm_run_callback(): pci_pm_resume+0x0/0x100 returns -62
amdgpu 0000:03:00.0: PM: failed to resume async: error -62
```

To avoid this series of unfortunate events, fail amdgpu's suspend
when the memory eviction fails.  This will let the system gracefully
recover and the user can try suspend again when the memory pressure
is relieved.

Reported-by: post@davidak.de
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2223
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 6f958603c8cc2..ae10acede495e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4060,15 +4060,18 @@ void amdgpu_device_fini_sw(struct amdgpu_device *adev)
  * at suspend time.
  *
  */
-static void amdgpu_device_evict_resources(struct amdgpu_device *adev)
+static int amdgpu_device_evict_resources(struct amdgpu_device *adev)
 {
+	int ret;
+
 	/* No need to evict vram on APUs for suspend to ram or s2idle */
 	if ((adev->in_s3 || adev->in_s0ix) && (adev->flags & AMD_IS_APU))
-		return;
+		return 0;
 
-	if (amdgpu_ttm_evict_resources(adev, TTM_PL_VRAM))
+	ret = amdgpu_ttm_evict_resources(adev, TTM_PL_VRAM);
+	if (ret)
 		DRM_WARN("evicting device resources failed\n");
-
+	return ret;
 }
 
 /*
@@ -4118,7 +4121,9 @@ int amdgpu_device_suspend(struct drm_device *dev, bool fbcon)
 	if (!adev->in_s0ix)
 		amdgpu_amdkfd_suspend(adev, adev->in_runpm);
 
-	amdgpu_device_evict_resources(adev);
+	r = amdgpu_device_evict_resources(adev);
+	if (r)
+		return r;
 
 	amdgpu_fence_driver_hw_fini(adev);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] drm/amd: Fail the suspend if resources can't be evicted
  2022-10-26 19:03 [PATCH] drm/amd: Fail the suspend if resources can't be evicted Mario Limonciello
@ 2022-10-27  6:30 ` Christian König
  2022-10-27 11:32   ` Mario Limonciello
  2022-10-27 13:39   ` Alex Deucher
  2022-10-27 14:56 ` Christian König
  1 sibling, 2 replies; 6+ messages in thread
From: Christian König @ 2022-10-27  6:30 UTC (permalink / raw)
  To: Mario Limonciello, amd-gfx, Alex Deucher; +Cc: post

Am 26.10.22 um 21:03 schrieb Mario Limonciello:
> If a system does not have swap and memory is under 100% usage,
> amdgpu will fail to evict resources.  Currently the suspend
> carries on proceeding to reset the GPU:
>
> ```
> [drm] evicting device resources failed
> [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <vcn_v3_0> failed -12
> [drm] free PSP TMR buffer
> [TTM] Failed allocating page table
> [drm] evicting device resources failed
> amdgpu 0000:03:00.0: amdgpu: MODE1 reset
> amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset
> amdgpu 0000:03:00.0: amdgpu: GPU smu mode1 reset
> ```
>
> At this point if the suspend actually succeeded I think that amdgpu
> would have recovered because the GPU would have power cut off and
> restored.  However the kernel fails to continue the suspend from the
> memory pressure and amdgpu fails to run the "resume" from the aborted
> suspend.
>
> ```
> ACPI: PM: Preparing to enter system sleep state S3
> SLUB: Unable to allocate memory on node -1, gfp=0xdc0(GFP_KERNEL|__GFP_ZERO)
>    cache: Acpi-State, object size: 80, buffer size: 80, default order: 0, min order: 0
>    node 0: slabs: 22, objs: 1122, free: 0
> ACPI Error: AE_NO_MEMORY, Could not update object reference count (20210730/utdelete-651)
>
> [drm:psp_hw_start [amdgpu]] *ERROR* PSP load kdb failed!
> [drm:psp_resume [amdgpu]] *ERROR* PSP resume failed
> [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -62
> amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (-62).
> PM: dpm_run_callback(): pci_pm_resume+0x0/0x100 returns -62
> amdgpu 0000:03:00.0: PM: failed to resume async: error -62
> ```
>
> To avoid this series of unfortunate events, fail amdgpu's suspend
> when the memory eviction fails.  This will let the system gracefully
> recover and the user can try suspend again when the memory pressure
> is relieved.

Yeah, I've been thinking about that handling for a while now as well.

Failing to suspend when we are OOM is certainly the right thing to do 
from a technical perspective.

But it also means that when users close their laptop it can happen that 
it keeps running and draining the battery.

On the other hand when you don't have swap configured it's your fault 
and not the drivers.

It's a trade off and I'm not sure what's better. Alex any comment here?

Thanks,
Christian.

>
> Reported-by: post@davidak.de
> Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2223
> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 ++++++++++-----
>   1 file changed, 10 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 6f958603c8cc2..ae10acede495e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4060,15 +4060,18 @@ void amdgpu_device_fini_sw(struct amdgpu_device *adev)
>    * at suspend time.
>    *
>    */
> -static void amdgpu_device_evict_resources(struct amdgpu_device *adev)
> +static int amdgpu_device_evict_resources(struct amdgpu_device *adev)
>   {
> +	int ret;
> +
>   	/* No need to evict vram on APUs for suspend to ram or s2idle */
>   	if ((adev->in_s3 || adev->in_s0ix) && (adev->flags & AMD_IS_APU))
> -		return;
> +		return 0;
>   
> -	if (amdgpu_ttm_evict_resources(adev, TTM_PL_VRAM))
> +	ret = amdgpu_ttm_evict_resources(adev, TTM_PL_VRAM);
> +	if (ret)
>   		DRM_WARN("evicting device resources failed\n");
> -
> +	return ret;
>   }
>   
>   /*
> @@ -4118,7 +4121,9 @@ int amdgpu_device_suspend(struct drm_device *dev, bool fbcon)
>   	if (!adev->in_s0ix)
>   		amdgpu_amdkfd_suspend(adev, adev->in_runpm);
>   
> -	amdgpu_device_evict_resources(adev);
> +	r = amdgpu_device_evict_resources(adev);
> +	if (r)
> +		return r;
>   
>   	amdgpu_fence_driver_hw_fini(adev);
>   


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] drm/amd: Fail the suspend if resources can't be evicted
  2022-10-27  6:30 ` Christian König
@ 2022-10-27 11:32   ` Mario Limonciello
  2022-10-27 13:39   ` Alex Deucher
  1 sibling, 0 replies; 6+ messages in thread
From: Mario Limonciello @ 2022-10-27 11:32 UTC (permalink / raw)
  To: Christian König, amd-gfx, Alex Deucher; +Cc: post

On 10/27/22 01:30, Christian König wrote:
> Am 26.10.22 um 21:03 schrieb Mario Limonciello:
>> If a system does not have swap and memory is under 100% usage,
>> amdgpu will fail to evict resources.  Currently the suspend
>> carries on proceeding to reset the GPU:
>>
>> ```
>> [drm] evicting device resources failed
>> [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP 
>> block <vcn_v3_0> failed -12
>> [drm] free PSP TMR buffer
>> [TTM] Failed allocating page table
>> [drm] evicting device resources failed
>> amdgpu 0000:03:00.0: amdgpu: MODE1 reset
>> amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset
>> amdgpu 0000:03:00.0: amdgpu: GPU smu mode1 reset
>> ```
>>
>> At this point if the suspend actually succeeded I think that amdgpu
>> would have recovered because the GPU would have power cut off and
>> restored.  However the kernel fails to continue the suspend from the
>> memory pressure and amdgpu fails to run the "resume" from the aborted
>> suspend.
>>
>> ```
>> ACPI: PM: Preparing to enter system sleep state S3
>> SLUB: Unable to allocate memory on node -1, 
>> gfp=0xdc0(GFP_KERNEL|__GFP_ZERO)
>>    cache: Acpi-State, object size: 80, buffer size: 80, default order: 
>> 0, min order: 0
>>    node 0: slabs: 22, objs: 1122, free: 0
>> ACPI Error: AE_NO_MEMORY, Could not update object reference count 
>> (20210730/utdelete-651)
>>
>> [drm:psp_hw_start [amdgpu]] *ERROR* PSP load kdb failed!
>> [drm:psp_resume [amdgpu]] *ERROR* PSP resume failed
>> [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block 
>> <psp> failed -62
>> amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (-62).
>> PM: dpm_run_callback(): pci_pm_resume+0x0/0x100 returns -62
>> amdgpu 0000:03:00.0: PM: failed to resume async: error -62
>> ```
>>
>> To avoid this series of unfortunate events, fail amdgpu's suspend
>> when the memory eviction fails.  This will let the system gracefully
>> recover and the user can try suspend again when the memory pressure
>> is relieved.
> 
> Yeah, I've been thinking about that handling for a while now as well.
> 
> Failing to suspend when we are OOM is certainly the right thing to do 
> from a technical perspective.
> 
> But it also means that when users close their laptop it can happen that 
> it keeps running and draining the battery.
> 
> On the other hand when you don't have swap configured it's your fault 
> and not the drivers. >
> It's a trade off and I'm not sure what's better. Alex any comment here?

There are userspace means to handle this (such as systemd-oomd).  If we 
actually fail the suspend and can signal an out of memory error code all 
the way back up then it can work with the oomd processor to make some 
room and try again.

> 
> Thanks,
> Christian.
> 
>>
>> Reported-by: post@davidak.de
>> Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2223
>> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 ++++++++++-----
>>   1 file changed, 10 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 6f958603c8cc2..ae10acede495e 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -4060,15 +4060,18 @@ void amdgpu_device_fini_sw(struct 
>> amdgpu_device *adev)
>>    * at suspend time.
>>    *
>>    */
>> -static void amdgpu_device_evict_resources(struct amdgpu_device *adev)
>> +static int amdgpu_device_evict_resources(struct amdgpu_device *adev)
>>   {
>> +    int ret;
>> +
>>       /* No need to evict vram on APUs for suspend to ram or s2idle */
>>       if ((adev->in_s3 || adev->in_s0ix) && (adev->flags & AMD_IS_APU))
>> -        return;
>> +        return 0;
>> -    if (amdgpu_ttm_evict_resources(adev, TTM_PL_VRAM))
>> +    ret = amdgpu_ttm_evict_resources(adev, TTM_PL_VRAM);
>> +    if (ret)
>>           DRM_WARN("evicting device resources failed\n");
>> -
>> +    return ret;
>>   }
>>   /*
>> @@ -4118,7 +4121,9 @@ int amdgpu_device_suspend(struct drm_device 
>> *dev, bool fbcon)
>>       if (!adev->in_s0ix)
>>           amdgpu_amdkfd_suspend(adev, adev->in_runpm);
>> -    amdgpu_device_evict_resources(adev);
>> +    r = amdgpu_device_evict_resources(adev);
>> +    if (r)
>> +        return r;
>>       amdgpu_fence_driver_hw_fini(adev);
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] drm/amd: Fail the suspend if resources can't be evicted
  2022-10-27  6:30 ` Christian König
  2022-10-27 11:32   ` Mario Limonciello
@ 2022-10-27 13:39   ` Alex Deucher
  2022-10-27 13:39     ` Alex Deucher
  1 sibling, 1 reply; 6+ messages in thread
From: Alex Deucher @ 2022-10-27 13:39 UTC (permalink / raw)
  To: Christian König; +Cc: Alex Deucher, post, Mario Limonciello, amd-gfx

On Thu, Oct 27, 2022 at 2:31 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 26.10.22 um 21:03 schrieb Mario Limonciello:
> > If a system does not have swap and memory is under 100% usage,
> > amdgpu will fail to evict resources.  Currently the suspend
> > carries on proceeding to reset the GPU:
> >
> > ```
> > [drm] evicting device resources failed
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <vcn_v3_0> failed -12
> > [drm] free PSP TMR buffer
> > [TTM] Failed allocating page table
> > [drm] evicting device resources failed
> > amdgpu 0000:03:00.0: amdgpu: MODE1 reset
> > amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset
> > amdgpu 0000:03:00.0: amdgpu: GPU smu mode1 reset
> > ```
> >
> > At this point if the suspend actually succeeded I think that amdgpu
> > would have recovered because the GPU would have power cut off and
> > restored.  However the kernel fails to continue the suspend from the
> > memory pressure and amdgpu fails to run the "resume" from the aborted
> > suspend.
> >
> > ```
> > ACPI: PM: Preparing to enter system sleep state S3
> > SLUB: Unable to allocate memory on node -1, gfp=0xdc0(GFP_KERNEL|__GFP_ZERO)
> >    cache: Acpi-State, object size: 80, buffer size: 80, default order: 0, min order: 0
> >    node 0: slabs: 22, objs: 1122, free: 0
> > ACPI Error: AE_NO_MEMORY, Could not update object reference count (20210730/utdelete-651)
> >
> > [drm:psp_hw_start [amdgpu]] *ERROR* PSP load kdb failed!
> > [drm:psp_resume [amdgpu]] *ERROR* PSP resume failed
> > [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -62
> > amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (-62).
> > PM: dpm_run_callback(): pci_pm_resume+0x0/0x100 returns -62
> > amdgpu 0000:03:00.0: PM: failed to resume async: error -62
> > ```
> >
> > To avoid this series of unfortunate events, fail amdgpu's suspend
> > when the memory eviction fails.  This will let the system gracefully
> > recover and the user can try suspend again when the memory pressure
> > is relieved.
>
> Yeah, I've been thinking about that handling for a while now as well.
>
> Failing to suspend when we are OOM is certainly the right thing to do
> from a technical perspective.
>
> But it also means that when users close their laptop it can happen that
> it keeps running and draining the battery.
>
> On the other hand when you don't have swap configured it's your fault
> and not the drivers.
>
> It's a trade off and I'm not sure what's better. Alex any comment here?

Probably better handled in userspace as Mario noted.  Also, at least
the whole driver won't fall apart due to important buffers getting
lost.

Alex


>
> Thanks,
> Christian.
>
> >
> > Reported-by: post@davidak.de
> > Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2223
> > Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 ++++++++++-----
> >   1 file changed, 10 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 6f958603c8cc2..ae10acede495e 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -4060,15 +4060,18 @@ void amdgpu_device_fini_sw(struct amdgpu_device *adev)
> >    * at suspend time.
> >    *
> >    */
> > -static void amdgpu_device_evict_resources(struct amdgpu_device *adev)
> > +static int amdgpu_device_evict_resources(struct amdgpu_device *adev)
> >   {
> > +     int ret;
> > +
> >       /* No need to evict vram on APUs for suspend to ram or s2idle */
> >       if ((adev->in_s3 || adev->in_s0ix) && (adev->flags & AMD_IS_APU))
> > -             return;
> > +             return 0;
> >
> > -     if (amdgpu_ttm_evict_resources(adev, TTM_PL_VRAM))
> > +     ret = amdgpu_ttm_evict_resources(adev, TTM_PL_VRAM);
> > +     if (ret)
> >               DRM_WARN("evicting device resources failed\n");
> > -
> > +     return ret;
> >   }
> >
> >   /*
> > @@ -4118,7 +4121,9 @@ int amdgpu_device_suspend(struct drm_device *dev, bool fbcon)
> >       if (!adev->in_s0ix)
> >               amdgpu_amdkfd_suspend(adev, adev->in_runpm);
> >
> > -     amdgpu_device_evict_resources(adev);
> > +     r = amdgpu_device_evict_resources(adev);
> > +     if (r)
> > +             return r;
> >
> >       amdgpu_fence_driver_hw_fini(adev);
> >
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] drm/amd: Fail the suspend if resources can't be evicted
  2022-10-27 13:39   ` Alex Deucher
@ 2022-10-27 13:39     ` Alex Deucher
  0 siblings, 0 replies; 6+ messages in thread
From: Alex Deucher @ 2022-10-27 13:39 UTC (permalink / raw)
  To: Christian König; +Cc: Alex Deucher, post, Mario Limonciello, amd-gfx

Patch is:
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>

On Thu, Oct 27, 2022 at 9:39 AM Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Thu, Oct 27, 2022 at 2:31 AM Christian König
> <christian.koenig@amd.com> wrote:
> >
> > Am 26.10.22 um 21:03 schrieb Mario Limonciello:
> > > If a system does not have swap and memory is under 100% usage,
> > > amdgpu will fail to evict resources.  Currently the suspend
> > > carries on proceeding to reset the GPU:
> > >
> > > ```
> > > [drm] evicting device resources failed
> > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <vcn_v3_0> failed -12
> > > [drm] free PSP TMR buffer
> > > [TTM] Failed allocating page table
> > > [drm] evicting device resources failed
> > > amdgpu 0000:03:00.0: amdgpu: MODE1 reset
> > > amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset
> > > amdgpu 0000:03:00.0: amdgpu: GPU smu mode1 reset
> > > ```
> > >
> > > At this point if the suspend actually succeeded I think that amdgpu
> > > would have recovered because the GPU would have power cut off and
> > > restored.  However the kernel fails to continue the suspend from the
> > > memory pressure and amdgpu fails to run the "resume" from the aborted
> > > suspend.
> > >
> > > ```
> > > ACPI: PM: Preparing to enter system sleep state S3
> > > SLUB: Unable to allocate memory on node -1, gfp=0xdc0(GFP_KERNEL|__GFP_ZERO)
> > >    cache: Acpi-State, object size: 80, buffer size: 80, default order: 0, min order: 0
> > >    node 0: slabs: 22, objs: 1122, free: 0
> > > ACPI Error: AE_NO_MEMORY, Could not update object reference count (20210730/utdelete-651)
> > >
> > > [drm:psp_hw_start [amdgpu]] *ERROR* PSP load kdb failed!
> > > [drm:psp_resume [amdgpu]] *ERROR* PSP resume failed
> > > [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -62
> > > amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (-62).
> > > PM: dpm_run_callback(): pci_pm_resume+0x0/0x100 returns -62
> > > amdgpu 0000:03:00.0: PM: failed to resume async: error -62
> > > ```
> > >
> > > To avoid this series of unfortunate events, fail amdgpu's suspend
> > > when the memory eviction fails.  This will let the system gracefully
> > > recover and the user can try suspend again when the memory pressure
> > > is relieved.
> >
> > Yeah, I've been thinking about that handling for a while now as well.
> >
> > Failing to suspend when we are OOM is certainly the right thing to do
> > from a technical perspective.
> >
> > But it also means that when users close their laptop it can happen that
> > it keeps running and draining the battery.
> >
> > On the other hand when you don't have swap configured it's your fault
> > and not the drivers.
> >
> > It's a trade off and I'm not sure what's better. Alex any comment here?
>
> Probably better handled in userspace as Mario noted.  Also, at least
> the whole driver won't fall apart due to important buffers getting
> lost.
>
> Alex
>
>
> >
> > Thanks,
> > Christian.
> >
> > >
> > > Reported-by: post@davidak.de
> > > Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2223
> > > Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
> > > ---
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 ++++++++++-----
> > >   1 file changed, 10 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > index 6f958603c8cc2..ae10acede495e 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > @@ -4060,15 +4060,18 @@ void amdgpu_device_fini_sw(struct amdgpu_device *adev)
> > >    * at suspend time.
> > >    *
> > >    */
> > > -static void amdgpu_device_evict_resources(struct amdgpu_device *adev)
> > > +static int amdgpu_device_evict_resources(struct amdgpu_device *adev)
> > >   {
> > > +     int ret;
> > > +
> > >       /* No need to evict vram on APUs for suspend to ram or s2idle */
> > >       if ((adev->in_s3 || adev->in_s0ix) && (adev->flags & AMD_IS_APU))
> > > -             return;
> > > +             return 0;
> > >
> > > -     if (amdgpu_ttm_evict_resources(adev, TTM_PL_VRAM))
> > > +     ret = amdgpu_ttm_evict_resources(adev, TTM_PL_VRAM);
> > > +     if (ret)
> > >               DRM_WARN("evicting device resources failed\n");
> > > -
> > > +     return ret;
> > >   }
> > >
> > >   /*
> > > @@ -4118,7 +4121,9 @@ int amdgpu_device_suspend(struct drm_device *dev, bool fbcon)
> > >       if (!adev->in_s0ix)
> > >               amdgpu_amdkfd_suspend(adev, adev->in_runpm);
> > >
> > > -     amdgpu_device_evict_resources(adev);
> > > +     r = amdgpu_device_evict_resources(adev);
> > > +     if (r)
> > > +             return r;
> > >
> > >       amdgpu_fence_driver_hw_fini(adev);
> > >
> >

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] drm/amd: Fail the suspend if resources can't be evicted
  2022-10-26 19:03 [PATCH] drm/amd: Fail the suspend if resources can't be evicted Mario Limonciello
  2022-10-27  6:30 ` Christian König
@ 2022-10-27 14:56 ` Christian König
  1 sibling, 0 replies; 6+ messages in thread
From: Christian König @ 2022-10-27 14:56 UTC (permalink / raw)
  To: Mario Limonciello, amd-gfx; +Cc: post

Am 26.10.22 um 21:03 schrieb Mario Limonciello:
> If a system does not have swap and memory is under 100% usage,
> amdgpu will fail to evict resources.  Currently the suspend
> carries on proceeding to reset the GPU:
>
> ```
> [drm] evicting device resources failed
> [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <vcn_v3_0> failed -12
> [drm] free PSP TMR buffer
> [TTM] Failed allocating page table
> [drm] evicting device resources failed
> amdgpu 0000:03:00.0: amdgpu: MODE1 reset
> amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset
> amdgpu 0000:03:00.0: amdgpu: GPU smu mode1 reset
> ```
>
> At this point if the suspend actually succeeded I think that amdgpu
> would have recovered because the GPU would have power cut off and
> restored.  However the kernel fails to continue the suspend from the
> memory pressure and amdgpu fails to run the "resume" from the aborted
> suspend.
>
> ```
> ACPI: PM: Preparing to enter system sleep state S3
> SLUB: Unable to allocate memory on node -1, gfp=0xdc0(GFP_KERNEL|__GFP_ZERO)
>    cache: Acpi-State, object size: 80, buffer size: 80, default order: 0, min order: 0
>    node 0: slabs: 22, objs: 1122, free: 0
> ACPI Error: AE_NO_MEMORY, Could not update object reference count (20210730/utdelete-651)
>
> [drm:psp_hw_start [amdgpu]] *ERROR* PSP load kdb failed!
> [drm:psp_resume [amdgpu]] *ERROR* PSP resume failed
> [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -62
> amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (-62).
> PM: dpm_run_callback(): pci_pm_resume+0x0/0x100 returns -62
> amdgpu 0000:03:00.0: PM: failed to resume async: error -62
> ```
>
> To avoid this series of unfortunate events, fail amdgpu's suspend
> when the memory eviction fails.  This will let the system gracefully
> recover and the user can try suspend again when the memory pressure
> is relieved.
>
> Reported-by: post@davidak.de
> Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2223
> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>

Acked-by: Christian König <christian.koenig@amd.com>

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 ++++++++++-----
>   1 file changed, 10 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 6f958603c8cc2..ae10acede495e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4060,15 +4060,18 @@ void amdgpu_device_fini_sw(struct amdgpu_device *adev)
>    * at suspend time.
>    *
>    */
> -static void amdgpu_device_evict_resources(struct amdgpu_device *adev)
> +static int amdgpu_device_evict_resources(struct amdgpu_device *adev)
>   {
> +	int ret;
> +
>   	/* No need to evict vram on APUs for suspend to ram or s2idle */
>   	if ((adev->in_s3 || adev->in_s0ix) && (adev->flags & AMD_IS_APU))
> -		return;
> +		return 0;
>   
> -	if (amdgpu_ttm_evict_resources(adev, TTM_PL_VRAM))
> +	ret = amdgpu_ttm_evict_resources(adev, TTM_PL_VRAM);
> +	if (ret)
>   		DRM_WARN("evicting device resources failed\n");
> -
> +	return ret;
>   }
>   
>   /*
> @@ -4118,7 +4121,9 @@ int amdgpu_device_suspend(struct drm_device *dev, bool fbcon)
>   	if (!adev->in_s0ix)
>   		amdgpu_amdkfd_suspend(adev, adev->in_runpm);
>   
> -	amdgpu_device_evict_resources(adev);
> +	r = amdgpu_device_evict_resources(adev);
> +	if (r)
> +		return r;
>   
>   	amdgpu_fence_driver_hw_fini(adev);
>   


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-10-27 14:56 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-26 19:03 [PATCH] drm/amd: Fail the suspend if resources can't be evicted Mario Limonciello
2022-10-27  6:30 ` Christian König
2022-10-27 11:32   ` Mario Limonciello
2022-10-27 13:39   ` Alex Deucher
2022-10-27 13:39     ` Alex Deucher
2022-10-27 14:56 ` Christian König

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.