[PATCH] drm/amdgpu: add drm_dev_unplug() in GPU initialization failure to prevent crash

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] drm/amdgpu: add drm_dev_unplug() in GPU initialization failure to prevent crash
@ 2021-12-15  8:46 Leslie Shi
  2021-12-15 10:59 ` Christian König
  0 siblings, 1 reply; 4+ messages in thread
From: Leslie Shi @ 2021-12-15  8:46 UTC (permalink / raw)
  To: andrey.grodzovsky, christian.koenig, xinhui.pan,
	alexander.deucher, amd-gfx
  Cc: yuliang.shi, guchun.chen

[Why]
In amdgpu_driver_load_kms, when amdgpu_device_init returns error during driver modprobe, it
will start the error handle path immediately and call into amdgpu_device_unmap_mmio as well
to release mapped VRAM. However, in the following release callback, driver stills visits the
unmapped memory like vcn.inst[i].fw_shared_cpu_addr in vcn_v3_0_sw_fini. So a kernel crash occurs.

[How]
Add drm_dev_unplug() before executing amdgpu_driver_unload_kms to prevent such crash.
GPU initialization failure is somehow allowed, but a kernel crash in this case should never happen.

Signed-off-by: Leslie Shi <Yuliang.Shi@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index 651c7abfde03..7bf6aecdbb92 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -268,6 +268,8 @@ int amdgpu_driver_load_kms(struct amdgpu_device *adev, unsigned long flags)
 		/* balance pm_runtime_get_sync in amdgpu_driver_unload_kms */
 		if (adev->rmmio && adev->runpm)
 			pm_runtime_put_noidle(dev->dev);
+
+		drm_dev_unplug(dev);
 		amdgpu_driver_unload_kms(dev);
 	}

-- 
2.25.1

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] drm/amdgpu: add drm_dev_unplug() in GPU initialization failure to prevent crash
  2021-12-15  8:46 [PATCH] drm/amdgpu: add drm_dev_unplug() in GPU initialization failure to prevent crash Leslie Shi
@ 2021-12-15 10:59 ` Christian König
  2021-12-15 13:28   ` Chen, Guchun
  0 siblings, 1 reply; 4+ messages in thread
From: Christian König @ 2021-12-15 10:59 UTC (permalink / raw)
  To: Leslie Shi, andrey.grodzovsky, xinhui.pan, alexander.deucher, amd-gfx
  Cc: guchun.chen

Am 15.12.21 um 09:46 schrieb Leslie Shi:
> [Why]
> In amdgpu_driver_load_kms, when amdgpu_device_init returns error during driver modprobe, it
> will start the error handle path immediately and call into amdgpu_device_unmap_mmio as well
> to release mapped VRAM. However, in the following release callback, driver stills visits the
> unmapped memory like vcn.inst[i].fw_shared_cpu_addr in vcn_v3_0_sw_fini. So a kernel crash occurs.

Mhm, interesting workaround but I'm not sure that's the right thing to do.

Question is why are we unmapping the MMIO space on driver load failure 
so early in the first place? I mean don't we need to clean up a bit?

If that's really the way to go then we should at least add a comment 
explaining why it's done that way.

Regards,
Christian.

>
> [How]
> Add drm_dev_unplug() before executing amdgpu_driver_unload_kms to prevent such crash.
> GPU initialization failure is somehow allowed, but a kernel crash in this case should never happen.
>
> Signed-off-by: Leslie Shi <Yuliang.Shi@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 2 ++
>   1 file changed, 2 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> index 651c7abfde03..7bf6aecdbb92 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> @@ -268,6 +268,8 @@ int amdgpu_driver_load_kms(struct amdgpu_device *adev, unsigned long flags)
>   		/* balance pm_runtime_get_sync in amdgpu_driver_unload_kms */
>   		if (adev->rmmio && adev->runpm)
>   			pm_runtime_put_noidle(dev->dev);
> +
> +		drm_dev_unplug(dev);
>   		amdgpu_driver_unload_kms(dev);
>   	}
>   


^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: [PATCH] drm/amdgpu: add drm_dev_unplug() in GPU initialization failure to prevent crash
  2021-12-15 10:59 ` Christian König
@ 2021-12-15 13:28   ` Chen, Guchun
  2021-12-15 15:19     ` Andrey Grodzovsky
  0 siblings, 1 reply; 4+ messages in thread
From: Chen, Guchun @ 2021-12-15 13:28 UTC (permalink / raw)
  To: Koenig, Christian, Shi, Leslie, Grodzovsky, Andrey, Pan, Xinhui,
	Deucher, Alexander, amd-gfx

[Public]

Hi Christian,

Your question is a really good one. The patch to unmap MMOI in such early phase is from Andrey's patch: drm/amdgpu: Unmap all MMIO mappings. It's a patch half a year ago, and everything looks fine till this case.

Regards,
Guchun

-----Original Message-----
From: Koenig, Christian <Christian.Koenig@amd.com> 
Sent: Wednesday, December 15, 2021 7:00 PM
To: Shi, Leslie <Yuliang.Shi@amd.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Pan, Xinhui <Xinhui.Pan@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Chen, Guchun <Guchun.Chen@amd.com>
Subject: Re: [PATCH] drm/amdgpu: add drm_dev_unplug() in GPU initialization failure to prevent crash

Am 15.12.21 um 09:46 schrieb Leslie Shi:
> [Why]
> In amdgpu_driver_load_kms, when amdgpu_device_init returns error 
> during driver modprobe, it will start the error handle path 
> immediately and call into amdgpu_device_unmap_mmio as well to release 
> mapped VRAM. However, in the following release callback, driver stills visits the unmapped memory like vcn.inst[i].fw_shared_cpu_addr in vcn_v3_0_sw_fini. So a kernel crash occurs.

Mhm, interesting workaround but I'm not sure that's the right thing to do.

Question is why are we unmapping the MMIO space on driver load failure so early in the first place? I mean don't we need to clean up a bit?

If that's really the way to go then we should at least add a comment explaining why it's done that way.

Regards,
Christian.

>
> [How]
> Add drm_dev_unplug() before executing amdgpu_driver_unload_kms to prevent such crash.
> GPU initialization failure is somehow allowed, but a kernel crash in this case should never happen.
>
> Signed-off-by: Leslie Shi <Yuliang.Shi@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 2 ++
>   1 file changed, 2 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> index 651c7abfde03..7bf6aecdbb92 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> @@ -268,6 +268,8 @@ int amdgpu_driver_load_kms(struct amdgpu_device *adev, unsigned long flags)
>   		/* balance pm_runtime_get_sync in amdgpu_driver_unload_kms */
>   		if (adev->rmmio && adev->runpm)
>   			pm_runtime_put_noidle(dev->dev);
> +
> +		drm_dev_unplug(dev);
>   		amdgpu_driver_unload_kms(dev);
>   	}
>   

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] drm/amdgpu: add drm_dev_unplug() in GPU initialization failure to prevent crash
  2021-12-15 13:28   ` Chen, Guchun
@ 2021-12-15 15:19     ` Andrey Grodzovsky
  0 siblings, 0 replies; 4+ messages in thread
From: Andrey Grodzovsky @ 2021-12-15 15:19 UTC (permalink / raw)
  To: Chen, Guchun, Koenig, Christian, Shi, Leslie, Pan, Xinhui,
	Deucher, Alexander, amd-gfx

I think that we should not call amdgpu_device_unmap_mmio unless device 
is unplugged (as in amdgpu_pci_remove) because the point of this 
function is to prevent accesses to MMIO range the device was occupying 
before removal.
There is no point to prevent MMIO accesses when init failed and we want 
to do an orderly HW shutdown... So probably we should just change to

if (drm_dev_enter()) {

     amdgpu_device_unmap_mmio

     drm_dev_exit()

}

Andrey

On 2021-12-15 8:28 a.m., Chen, Guchun wrote:
> [Public]
>
> Hi Christian,
>
> Your question is a really good one. The patch to unmap MMOI in such early phase is from Andrey's patch: drm/amdgpu: Unmap all MMIO mappings. It's a patch half a year ago, and everything looks fine till this case.
>
> Regards,
> Guchun
>
> -----Original Message-----
> From: Koenig, Christian <Christian.Koenig@amd.com>
> Sent: Wednesday, December 15, 2021 7:00 PM
> To: Shi, Leslie <Yuliang.Shi@amd.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Pan, Xinhui <Xinhui.Pan@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Chen, Guchun <Guchun.Chen@amd.com>
> Subject: Re: [PATCH] drm/amdgpu: add drm_dev_unplug() in GPU initialization failure to prevent crash
>
> Am 15.12.21 um 09:46 schrieb Leslie Shi:
>> [Why]
>> In amdgpu_driver_load_kms, when amdgpu_device_init returns error
>> during driver modprobe, it will start the error handle path
>> immediately and call into amdgpu_device_unmap_mmio as well to release
>> mapped VRAM. However, in the following release callback, driver stills visits the unmapped memory like vcn.inst[i].fw_shared_cpu_addr in vcn_v3_0_sw_fini. So a kernel crash occurs.
> Mhm, interesting workaround but I'm not sure that's the right thing to do.
>
> Question is why are we unmapping the MMIO space on driver load failure so early in the first place? I mean don't we need to clean up a bit?
>
> If that's really the way to go then we should at least add a comment explaining why it's done that way.
>
> Regards,
> Christian.
>
>> [How]
>> Add drm_dev_unplug() before executing amdgpu_driver_unload_kms to prevent such crash.
>> GPU initialization failure is somehow allowed, but a kernel crash in this case should never happen.
>>
>> Signed-off-by: Leslie Shi <Yuliang.Shi@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 2 ++
>>    1 file changed, 2 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> index 651c7abfde03..7bf6aecdbb92 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> @@ -268,6 +268,8 @@ int amdgpu_driver_load_kms(struct amdgpu_device *adev, unsigned long flags)
>>    		/* balance pm_runtime_get_sync in amdgpu_driver_unload_kms */
>>    		if (adev->rmmio && adev->runpm)
>>    			pm_runtime_put_noidle(dev->dev);
>> +
>> +		drm_dev_unplug(dev);
>>    		amdgpu_driver_unload_kms(dev);
>>    	}
>>    

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-12-15 15:19 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-15  8:46 [PATCH] drm/amdgpu: add drm_dev_unplug() in GPU initialization failure to prevent crash Leslie Shi
2021-12-15 10:59 ` Christian König
2021-12-15 13:28   ` Chen, Guchun
2021-12-15 15:19     ` Andrey Grodzovsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.