* [PATCH v3] drm/amdgpu: Call amdgpu_device_unmap_mmio() if device is unplugged to prevent crash in GPU initialization failure
@ 2021-12-17 2:26 Leslie Shi
2021-12-17 5:52 ` Chen, Guchun
2021-12-17 8:49 ` Christian König
0 siblings, 2 replies; 5+ messages in thread
From: Leslie Shi @ 2021-12-17 2:26 UTC (permalink / raw)
To: andrey.grodzovsky, christian.koenig, xinhui.pan,
alexander.deucher, amd-gfx
Cc: yuliang.shi, guchun.chen
[Why]
In amdgpu_driver_load_kms, when amdgpu_device_init returns error during driver modprobe, it
will start the error handle path immediately and call into amdgpu_device_unmap_mmio as well
to release mapped VRAM. However, in the following release callback, driver stills visits the
unmapped memory like vcn.inst[i].fw_shared_cpu_addr in vcn_v3_0_sw_fini. So a kernel crash occurs.
[How]
call amdgpu_device_unmap_mmio() if device is unplugged to prevent invalid memory address in
vcn_v3_0_sw_fini() when GPU initialization failure.
Signed-off-by: Leslie Shi <Yuliang.Shi@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index f31caec669e7..f6a47b927cfd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3899,7 +3899,9 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
amdgpu_gart_dummy_page_fini(adev);
- amdgpu_device_unmap_mmio(adev);
+ if (drm_dev_is_unplugged(adev_to_drm(adev)))
+ amdgpu_device_unmap_mmio(adev);
+
}
void amdgpu_device_fini_sw(struct amdgpu_device *adev)
--
2.25.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
* RE: [PATCH v3] drm/amdgpu: Call amdgpu_device_unmap_mmio() if device is unplugged to prevent crash in GPU initialization failure
2021-12-17 2:26 [PATCH v3] drm/amdgpu: Call amdgpu_device_unmap_mmio() if device is unplugged to prevent crash in GPU initialization failure Leslie Shi
@ 2021-12-17 5:52 ` Chen, Guchun
2021-12-17 8:49 ` Christian König
1 sibling, 0 replies; 5+ messages in thread
From: Chen, Guchun @ 2021-12-17 5:52 UTC (permalink / raw)
To: Shi, Leslie, Grodzovsky, Andrey, Koenig, Christian, Pan, Xinhui,
Deucher, Alexander, amd-gfx
[Public]
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Regards,
Guchun
-----Original Message-----
From: Shi, Leslie <Yuliang.Shi@amd.com>
Sent: Friday, December 17, 2021 10:26 AM
To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Pan, Xinhui <Xinhui.Pan@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Chen, Guchun <Guchun.Chen@amd.com>; Shi, Leslie <Yuliang.Shi@amd.com>
Subject: [PATCH v3] drm/amdgpu: Call amdgpu_device_unmap_mmio() if device is unplugged to prevent crash in GPU initialization failure
[Why]
In amdgpu_driver_load_kms, when amdgpu_device_init returns error during driver modprobe, it will start the error handle path immediately and call into amdgpu_device_unmap_mmio as well to release mapped VRAM. However, in the following release callback, driver stills visits the unmapped memory like vcn.inst[i].fw_shared_cpu_addr in vcn_v3_0_sw_fini. So a kernel crash occurs.
[How]
call amdgpu_device_unmap_mmio() if device is unplugged to prevent invalid memory address in
vcn_v3_0_sw_fini() when GPU initialization failure.
Signed-off-by: Leslie Shi <Yuliang.Shi@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index f31caec669e7..f6a47b927cfd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3899,7 +3899,9 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
amdgpu_gart_dummy_page_fini(adev);
- amdgpu_device_unmap_mmio(adev);
+ if (drm_dev_is_unplugged(adev_to_drm(adev)))
+ amdgpu_device_unmap_mmio(adev);
+
}
void amdgpu_device_fini_sw(struct amdgpu_device *adev)
--
2.25.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH v3] drm/amdgpu: Call amdgpu_device_unmap_mmio() if device is unplugged to prevent crash in GPU initialization failure
2021-12-17 2:26 [PATCH v3] drm/amdgpu: Call amdgpu_device_unmap_mmio() if device is unplugged to prevent crash in GPU initialization failure Leslie Shi
2021-12-17 5:52 ` Chen, Guchun
@ 2021-12-17 8:49 ` Christian König
2021-12-17 15:02 ` Andrey Grodzovsky
1 sibling, 1 reply; 5+ messages in thread
From: Christian König @ 2021-12-17 8:49 UTC (permalink / raw)
To: Leslie Shi, andrey.grodzovsky, xinhui.pan, alexander.deucher, amd-gfx
Cc: guchun.chen
Am 17.12.21 um 03:26 schrieb Leslie Shi:
> [Why]
> In amdgpu_driver_load_kms, when amdgpu_device_init returns error during driver modprobe, it
> will start the error handle path immediately and call into amdgpu_device_unmap_mmio as well
> to release mapped VRAM. However, in the following release callback, driver stills visits the
> unmapped memory like vcn.inst[i].fw_shared_cpu_addr in vcn_v3_0_sw_fini. So a kernel crash occurs.
>
> [How]
> call amdgpu_device_unmap_mmio() if device is unplugged to prevent invalid memory address in
> vcn_v3_0_sw_fini() when GPU initialization failure.
>
> Signed-off-by: Leslie Shi <Yuliang.Shi@amd.com>
Looks sane to me, but Andrey should probably nood as well.
Acked-by: Christian König <christian.koenig@amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index f31caec669e7..f6a47b927cfd 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3899,7 +3899,9 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
>
> amdgpu_gart_dummy_page_fini(adev);
>
> - amdgpu_device_unmap_mmio(adev);
> + if (drm_dev_is_unplugged(adev_to_drm(adev)))
> + amdgpu_device_unmap_mmio(adev);
> +
> }
>
> void amdgpu_device_fini_sw(struct amdgpu_device *adev)
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v3] drm/amdgpu: Call amdgpu_device_unmap_mmio() if device is unplugged to prevent crash in GPU initialization failure
2021-12-17 8:49 ` Christian König
@ 2021-12-17 15:02 ` Andrey Grodzovsky
[not found] ` <DM6PR12MB39302C239EB5DBBF5EF30F7B97789@DM6PR12MB3930.namprd12.prod.outlook.com>
0 siblings, 1 reply; 5+ messages in thread
From: Andrey Grodzovsky @ 2021-12-17 15:02 UTC (permalink / raw)
To: Christian König, Leslie Shi, xinhui.pan, alexander.deucher, amd-gfx
Cc: guchun.chen
Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Andrey
On 2021-12-17 3:49 a.m., Christian König wrote:
> Am 17.12.21 um 03:26 schrieb Leslie Shi:
>> [Why]
>> In amdgpu_driver_load_kms, when amdgpu_device_init returns error
>> during driver modprobe, it
>> will start the error handle path immediately and call into
>> amdgpu_device_unmap_mmio as well
>> to release mapped VRAM. However, in the following release callback,
>> driver stills visits the
>> unmapped memory like vcn.inst[i].fw_shared_cpu_addr in
>> vcn_v3_0_sw_fini. So a kernel crash occurs.
>>
>> [How]
>> call amdgpu_device_unmap_mmio() if device is unplugged to prevent
>> invalid memory address in
>> vcn_v3_0_sw_fini() when GPU initialization failure.
>>
>> Signed-off-by: Leslie Shi <Yuliang.Shi@amd.com>
>
> Looks sane to me, but Andrey should probably nood as well.
>
> Acked-by: Christian König <christian.koenig@amd.com>
>
>> ---
>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +++-
>> 1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index f31caec669e7..f6a47b927cfd 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -3899,7 +3899,9 @@ void amdgpu_device_fini_hw(struct amdgpu_device
>> *adev)
>> amdgpu_gart_dummy_page_fini(adev);
>> - amdgpu_device_unmap_mmio(adev);
>> + if (drm_dev_is_unplugged(adev_to_drm(adev)))
>> + amdgpu_device_unmap_mmio(adev);
>> +
>> }
>> void amdgpu_device_fini_sw(struct amdgpu_device *adev)
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v3] drm/amdgpu: Call amdgpu_device_unmap_mmio() if device is unplugged to prevent crash in GPU initialization failure
[not found] ` <DM6PR12MB39306C1F1EF550E0AB0872D6977E9@DM6PR12MB3930.namprd12.prod.outlook.com>
@ 2021-12-23 15:42 ` Andrey Grodzovsky
0 siblings, 0 replies; 5+ messages in thread
From: Andrey Grodzovsky @ 2021-12-23 15:42 UTC (permalink / raw)
To: Lazar, Lijo, amd-gfx list, Shi, Leslie
Adding back public list and Leslie specifically.
Lijo is right and it's not MTTR release only, also all the unmaps in
amdgpu_device_unmap_mmio
since this patch makes call to amdgpu_device_unmap_mmio conditioned on
device unplugged
we need then to still take care of unmapping even when device is NOT
unplugged - for this i suggest
to look at 'drm/amdgpu: Unmap all MMIO mappings' and just conditionally
call all the deleted unmaps
in the patch where the condition is 'if (drm_dev_enter(dev))'.
Andrey
On 2021-12-23 8:47 a.m., Lazar, Lijo wrote:
> [AMD Official Use Only]
>
> Actually, I was asking specifically about this -
>
> gmc_v9_0_sw_init -> amdgpu_bo_init-> adev->gmc.vram_mtrr = arch_phys_wc_add(adev->gmc.aper_base,
> adev->gmc.aper_size);
>
>
> As per this patch, if the driver load failed due to some error which happens during hw_init(), this action is not undone. I was thinking why it's not considered important to skip this on driver failure to load. For ex: when some MTRR register is reserved for this mapping.
>
> Thanks,
> Lijo
>
> -----Original Message-----
> From: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> Sent: Friday, December 17, 2021 9:05 PM
> To: Lazar, Lijo <Lijo.Lazar@amd.com>
> Subject: Re: [PATCH v3] drm/amdgpu: Call amdgpu_device_unmap_mmio() if device is unplugged to prevent crash in GPU initialization failure
>
> We do unmap on unload, this function is a collection of all MMIO unmapps we already do on unload, it's just does them early on in case of device hot removal. Before pci_driver.remove callback (amdgpu_pci_remove) finish execution.
>
> Andrey
>
> On 2021-12-17 10:23 a.m., Lazar, Lijo wrote:
>> [AMD Official Use Only]
>>
>> As a side note, even if it's a failed driver load, why it is not important to undo the mappings created during driver load? I'm wondering what is the impact on a system like MI200 A+A.
>>
>> Thanks,
>> Lijo
>>
>> -----Original Message-----
>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of
>> Andrey Grodzovsky
>> Sent: Friday, December 17, 2021 8:32 PM
>> To: Koenig, Christian <Christian.Koenig@amd.com>; Shi, Leslie
>> <Yuliang.Shi@amd.com>; Pan, Xinhui <Xinhui.Pan@amd.com>; Deucher,
>> Alexander <Alexander.Deucher@amd.com>; amd-gfx@lists.freedesktop.org
>> Cc: Chen, Guchun <Guchun.Chen@amd.com>
>> Subject: Re: [PATCH v3] drm/amdgpu: Call amdgpu_device_unmap_mmio() if
>> device is unplugged to prevent crash in GPU initialization failure
>>
>> Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>
>> Andrey
>>
>> On 2021-12-17 3:49 a.m., Christian König wrote:
>>> Am 17.12.21 um 03:26 schrieb Leslie Shi:
>>>> [Why]
>>>> In amdgpu_driver_load_kms, when amdgpu_device_init returns error
>>>> during driver modprobe, it will start the error handle path
>>>> immediately and call into amdgpu_device_unmap_mmio as well to
>>>> release mapped VRAM. However, in the following release callback,
>>>> driver stills visits the unmapped memory like
>>>> vcn.inst[i].fw_shared_cpu_addr in vcn_v3_0_sw_fini. So a kernel crash occurs.
>>>>
>>>> [How]
>>>> call amdgpu_device_unmap_mmio() if device is unplugged to prevent
>>>> invalid memory address in
>>>> vcn_v3_0_sw_fini() when GPU initialization failure.
>>>>
>>>> Signed-off-by: Leslie Shi <Yuliang.Shi@amd.com>
>>> Looks sane to me, but Andrey should probably nood as well.
>>>
>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>
>>>> ---
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +++-
>>>> 1 file changed, 3 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index f31caec669e7..f6a47b927cfd 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -3899,7 +3899,9 @@ void amdgpu_device_fini_hw(struct
>>>> amdgpu_device
>>>> *adev)
>>>> amdgpu_gart_dummy_page_fini(adev);
>>>> - amdgpu_device_unmap_mmio(adev);
>>>> + if (drm_dev_is_unplugged(adev_to_drm(adev)))
>>>> + amdgpu_device_unmap_mmio(adev);
>>>> +
>>>> }
>>>> void amdgpu_device_fini_sw(struct amdgpu_device *adev)
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2021-12-23 15:43 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-17 2:26 [PATCH v3] drm/amdgpu: Call amdgpu_device_unmap_mmio() if device is unplugged to prevent crash in GPU initialization failure Leslie Shi
2021-12-17 5:52 ` Chen, Guchun
2021-12-17 8:49 ` Christian König
2021-12-17 15:02 ` Andrey Grodzovsky
[not found] ` <DM6PR12MB39302C239EB5DBBF5EF30F7B97789@DM6PR12MB3930.namprd12.prod.outlook.com>
[not found] ` <be2b13c1-09fd-1cb9-4157-872ac49dc03b@amd.com>
[not found] ` <DM6PR12MB39306C1F1EF550E0AB0872D6977E9@DM6PR12MB3930.namprd12.prod.outlook.com>
2021-12-23 15:42 ` Andrey Grodzovsky
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.