All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3] drm/amdgpu: Call amdgpu_device_unmap_mmio() if device is unplugged to prevent crash in GPU initialization failure
@ 2021-12-17  2:26 Leslie Shi
  2021-12-17  5:52 ` Chen, Guchun
  2021-12-17  8:49 ` Christian König
  0 siblings, 2 replies; 5+ messages in thread
From: Leslie Shi @ 2021-12-17  2:26 UTC (permalink / raw)
  To: andrey.grodzovsky, christian.koenig, xinhui.pan,
	alexander.deucher, amd-gfx
  Cc: yuliang.shi, guchun.chen

[Why]
In amdgpu_driver_load_kms, when amdgpu_device_init returns error during driver modprobe, it
will start the error handle path immediately and call into amdgpu_device_unmap_mmio as well
to release mapped VRAM. However, in the following release callback, driver stills visits the
unmapped memory like vcn.inst[i].fw_shared_cpu_addr in vcn_v3_0_sw_fini. So a kernel crash occurs.

[How]
call amdgpu_device_unmap_mmio() if device is unplugged to prevent invalid memory address in
vcn_v3_0_sw_fini() when GPU initialization failure.

Signed-off-by: Leslie Shi <Yuliang.Shi@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index f31caec669e7..f6a47b927cfd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3899,7 +3899,9 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
 
 	amdgpu_gart_dummy_page_fini(adev);
 
-	amdgpu_device_unmap_mmio(adev);
+	if (drm_dev_is_unplugged(adev_to_drm(adev)))
+		amdgpu_device_unmap_mmio(adev);
+
 }
 
 void amdgpu_device_fini_sw(struct amdgpu_device *adev)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* RE: [PATCH v3] drm/amdgpu: Call amdgpu_device_unmap_mmio() if device is unplugged to prevent crash in GPU initialization failure
  2021-12-17  2:26 [PATCH v3] drm/amdgpu: Call amdgpu_device_unmap_mmio() if device is unplugged to prevent crash in GPU initialization failure Leslie Shi
@ 2021-12-17  5:52 ` Chen, Guchun
  2021-12-17  8:49 ` Christian König
  1 sibling, 0 replies; 5+ messages in thread
From: Chen, Guchun @ 2021-12-17  5:52 UTC (permalink / raw)
  To: Shi, Leslie, Grodzovsky, Andrey, Koenig, Christian, Pan, Xinhui,
	Deucher, Alexander, amd-gfx

[Public]

Reviewed-by: Guchun Chen <guchun.chen@amd.com>

Regards,
Guchun

-----Original Message-----
From: Shi, Leslie <Yuliang.Shi@amd.com> 
Sent: Friday, December 17, 2021 10:26 AM
To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Pan, Xinhui <Xinhui.Pan@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Chen, Guchun <Guchun.Chen@amd.com>; Shi, Leslie <Yuliang.Shi@amd.com>
Subject: [PATCH v3] drm/amdgpu: Call amdgpu_device_unmap_mmio() if device is unplugged to prevent crash in GPU initialization failure

[Why]
In amdgpu_driver_load_kms, when amdgpu_device_init returns error during driver modprobe, it will start the error handle path immediately and call into amdgpu_device_unmap_mmio as well to release mapped VRAM. However, in the following release callback, driver stills visits the unmapped memory like vcn.inst[i].fw_shared_cpu_addr in vcn_v3_0_sw_fini. So a kernel crash occurs.

[How]
call amdgpu_device_unmap_mmio() if device is unplugged to prevent invalid memory address in
vcn_v3_0_sw_fini() when GPU initialization failure.

Signed-off-by: Leslie Shi <Yuliang.Shi@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index f31caec669e7..f6a47b927cfd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3899,7 +3899,9 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
 
 	amdgpu_gart_dummy_page_fini(adev);
 
-	amdgpu_device_unmap_mmio(adev);
+	if (drm_dev_is_unplugged(adev_to_drm(adev)))
+		amdgpu_device_unmap_mmio(adev);
+
 }
 
 void amdgpu_device_fini_sw(struct amdgpu_device *adev)
--
2.25.1

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v3] drm/amdgpu: Call amdgpu_device_unmap_mmio() if device is unplugged to prevent crash in GPU initialization failure
  2021-12-17  2:26 [PATCH v3] drm/amdgpu: Call amdgpu_device_unmap_mmio() if device is unplugged to prevent crash in GPU initialization failure Leslie Shi
  2021-12-17  5:52 ` Chen, Guchun
@ 2021-12-17  8:49 ` Christian König
  2021-12-17 15:02   ` Andrey Grodzovsky
  1 sibling, 1 reply; 5+ messages in thread
From: Christian König @ 2021-12-17  8:49 UTC (permalink / raw)
  To: Leslie Shi, andrey.grodzovsky, xinhui.pan, alexander.deucher, amd-gfx
  Cc: guchun.chen

Am 17.12.21 um 03:26 schrieb Leslie Shi:
> [Why]
> In amdgpu_driver_load_kms, when amdgpu_device_init returns error during driver modprobe, it
> will start the error handle path immediately and call into amdgpu_device_unmap_mmio as well
> to release mapped VRAM. However, in the following release callback, driver stills visits the
> unmapped memory like vcn.inst[i].fw_shared_cpu_addr in vcn_v3_0_sw_fini. So a kernel crash occurs.
>
> [How]
> call amdgpu_device_unmap_mmio() if device is unplugged to prevent invalid memory address in
> vcn_v3_0_sw_fini() when GPU initialization failure.
>
> Signed-off-by: Leslie Shi <Yuliang.Shi@amd.com>

Looks sane to me, but Andrey should probably nood as well.

Acked-by: Christian König <christian.koenig@amd.com>

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +++-
>   1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index f31caec669e7..f6a47b927cfd 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3899,7 +3899,9 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
>   
>   	amdgpu_gart_dummy_page_fini(adev);
>   
> -	amdgpu_device_unmap_mmio(adev);
> +	if (drm_dev_is_unplugged(adev_to_drm(adev)))
> +		amdgpu_device_unmap_mmio(adev);
> +
>   }
>   
>   void amdgpu_device_fini_sw(struct amdgpu_device *adev)


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v3] drm/amdgpu: Call amdgpu_device_unmap_mmio() if device is unplugged to prevent crash in GPU initialization failure
  2021-12-17  8:49 ` Christian König
@ 2021-12-17 15:02   ` Andrey Grodzovsky
       [not found]     ` <DM6PR12MB39302C239EB5DBBF5EF30F7B97789@DM6PR12MB3930.namprd12.prod.outlook.com>
  0 siblings, 1 reply; 5+ messages in thread
From: Andrey Grodzovsky @ 2021-12-17 15:02 UTC (permalink / raw)
  To: Christian König, Leslie Shi, xinhui.pan, alexander.deucher, amd-gfx
  Cc: guchun.chen

Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

Andrey

On 2021-12-17 3:49 a.m., Christian König wrote:
> Am 17.12.21 um 03:26 schrieb Leslie Shi:
>> [Why]
>> In amdgpu_driver_load_kms, when amdgpu_device_init returns error 
>> during driver modprobe, it
>> will start the error handle path immediately and call into 
>> amdgpu_device_unmap_mmio as well
>> to release mapped VRAM. However, in the following release callback, 
>> driver stills visits the
>> unmapped memory like vcn.inst[i].fw_shared_cpu_addr in 
>> vcn_v3_0_sw_fini. So a kernel crash occurs.
>>
>> [How]
>> call amdgpu_device_unmap_mmio() if device is unplugged to prevent 
>> invalid memory address in
>> vcn_v3_0_sw_fini() when GPU initialization failure.
>>
>> Signed-off-by: Leslie Shi <Yuliang.Shi@amd.com>
>
> Looks sane to me, but Andrey should probably nood as well.
>
> Acked-by: Christian König <christian.koenig@amd.com>
>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +++-
>>   1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index f31caec669e7..f6a47b927cfd 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -3899,7 +3899,9 @@ void amdgpu_device_fini_hw(struct amdgpu_device 
>> *adev)
>>         amdgpu_gart_dummy_page_fini(adev);
>>   -    amdgpu_device_unmap_mmio(adev);
>> +    if (drm_dev_is_unplugged(adev_to_drm(adev)))
>> +        amdgpu_device_unmap_mmio(adev);
>> +
>>   }
>>     void amdgpu_device_fini_sw(struct amdgpu_device *adev)
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v3] drm/amdgpu: Call amdgpu_device_unmap_mmio() if device is unplugged to prevent crash in GPU initialization failure
       [not found]         ` <DM6PR12MB39306C1F1EF550E0AB0872D6977E9@DM6PR12MB3930.namprd12.prod.outlook.com>
@ 2021-12-23 15:42           ` Andrey Grodzovsky
  0 siblings, 0 replies; 5+ messages in thread
From: Andrey Grodzovsky @ 2021-12-23 15:42 UTC (permalink / raw)
  To: Lazar, Lijo, amd-gfx list, Shi, Leslie

Adding back public list and Leslie specifically.

Lijo is right and it's not MTTR release only, also all the unmaps in 
amdgpu_device_unmap_mmio
since this patch makes call to amdgpu_device_unmap_mmio conditioned on 
device unplugged
we need then to still take care of unmapping even when device is NOT 
unplugged - for this i suggest
to look at 'drm/amdgpu: Unmap all MMIO mappings' and just conditionally 
call all the deleted unmaps
in the patch where the condition is 'if (drm_dev_enter(dev))'.

Andrey

On 2021-12-23 8:47 a.m., Lazar, Lijo wrote:
> [AMD Official Use Only]
>
> Actually, I was asking specifically about this -
>
> gmc_v9_0_sw_init -> amdgpu_bo_init->    adev->gmc.vram_mtrr = arch_phys_wc_add(adev->gmc.aper_base,
>                                  adev->gmc.aper_size);
>
>
> As per this patch, if the driver load failed due to some error which happens during hw_init(), this action is not undone. I was thinking why it's not considered important to skip this on driver failure to load. For ex: when some MTRR register is reserved for this mapping.
>
> Thanks,
> Lijo
>
> -----Original Message-----
> From: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> Sent: Friday, December 17, 2021 9:05 PM
> To: Lazar, Lijo <Lijo.Lazar@amd.com>
> Subject: Re: [PATCH v3] drm/amdgpu: Call amdgpu_device_unmap_mmio() if device is unplugged to prevent crash in GPU initialization failure
>
> We do unmap on unload, this function is a collection of all MMIO unmapps we already do on unload, it's just does them early on in case of device hot removal. Before pci_driver.remove callback (amdgpu_pci_remove) finish execution.
>
> Andrey
>
> On 2021-12-17 10:23 a.m., Lazar, Lijo wrote:
>> [AMD Official Use Only]
>>
>> As a side note,  even if it's a failed driver load, why it is not important to undo the mappings created during driver load? I'm wondering what is the impact on a system like MI200 A+A.
>>
>> Thanks,
>> Lijo
>>
>> -----Original Message-----
>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of
>> Andrey Grodzovsky
>> Sent: Friday, December 17, 2021 8:32 PM
>> To: Koenig, Christian <Christian.Koenig@amd.com>; Shi, Leslie
>> <Yuliang.Shi@amd.com>; Pan, Xinhui <Xinhui.Pan@amd.com>; Deucher,
>> Alexander <Alexander.Deucher@amd.com>; amd-gfx@lists.freedesktop.org
>> Cc: Chen, Guchun <Guchun.Chen@amd.com>
>> Subject: Re: [PATCH v3] drm/amdgpu: Call amdgpu_device_unmap_mmio() if
>> device is unplugged to prevent crash in GPU initialization failure
>>
>> Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>
>> Andrey
>>
>> On 2021-12-17 3:49 a.m., Christian König wrote:
>>> Am 17.12.21 um 03:26 schrieb Leslie Shi:
>>>> [Why]
>>>> In amdgpu_driver_load_kms, when amdgpu_device_init returns error
>>>> during driver modprobe, it will start the error handle path
>>>> immediately and call into amdgpu_device_unmap_mmio as well to
>>>> release mapped VRAM. However, in the following release callback,
>>>> driver stills visits the unmapped memory like
>>>> vcn.inst[i].fw_shared_cpu_addr in vcn_v3_0_sw_fini. So a kernel crash occurs.
>>>>
>>>> [How]
>>>> call amdgpu_device_unmap_mmio() if device is unplugged to prevent
>>>> invalid memory address in
>>>> vcn_v3_0_sw_fini() when GPU initialization failure.
>>>>
>>>> Signed-off-by: Leslie Shi <Yuliang.Shi@amd.com>
>>> Looks sane to me, but Andrey should probably nood as well.
>>>
>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>
>>>> ---
>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +++-
>>>>     1 file changed, 3 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index f31caec669e7..f6a47b927cfd 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -3899,7 +3899,9 @@ void amdgpu_device_fini_hw(struct
>>>> amdgpu_device
>>>> *adev)
>>>>           amdgpu_gart_dummy_page_fini(adev);
>>>>     -    amdgpu_device_unmap_mmio(adev);
>>>> +    if (drm_dev_is_unplugged(adev_to_drm(adev)))
>>>> +        amdgpu_device_unmap_mmio(adev);
>>>> +
>>>>     }
>>>>       void amdgpu_device_fini_sw(struct amdgpu_device *adev)

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-12-23 15:43 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-17  2:26 [PATCH v3] drm/amdgpu: Call amdgpu_device_unmap_mmio() if device is unplugged to prevent crash in GPU initialization failure Leslie Shi
2021-12-17  5:52 ` Chen, Guchun
2021-12-17  8:49 ` Christian König
2021-12-17 15:02   ` Andrey Grodzovsky
     [not found]     ` <DM6PR12MB39302C239EB5DBBF5EF30F7B97789@DM6PR12MB3930.namprd12.prod.outlook.com>
     [not found]       ` <be2b13c1-09fd-1cb9-4157-872ac49dc03b@amd.com>
     [not found]         ` <DM6PR12MB39306C1F1EF550E0AB0872D6977E9@DM6PR12MB3930.namprd12.prod.outlook.com>
2021-12-23 15:42           ` Andrey Grodzovsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.