All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/amdgpu: add post reset IP callback
@ 2024-03-28  4:40 Lang Yu
  2024-04-02 13:38 ` Christian König
  0 siblings, 1 reply; 7+ messages in thread
From: Lang Yu @ 2024-03-28  4:40 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alex Deucher, Christian Koenig, Lang Yu

There are use cases which need full GPU functionality
(e.g., VM update, TLB inavildate) when doing a GPU reset.

Especially, the mes/umsch self tests which help validate
the hw state after hw init like ring/ib tests.

Add a post reset IP callback to handle such use cases
which will be executed after GPU reset succeeds.

Signed-off-by: Lang Yu <Lang.Yu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 24 ++++++++++++++++++++++
 drivers/gpu/drm/amd/include/amd_shared.h   |  3 +++
 2 files changed, 27 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 12dc71a6b5db..feeab9397aab 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5556,6 +5556,27 @@ static int amdgpu_device_health_check(struct list_head *device_list_handle)
 	return ret;
 }
 
+static int amdgpu_device_ip_post_reset(struct amdgpu_device *adev)
+{
+	uint32_t i;
+	int r;
+
+	for (i = 0; i < adev->num_ip_blocks; i++) {
+		if (!adev->ip_blocks[i].status.valid ||
+		    !adev->ip_blocks[i].version->funcs->post_reset)
+			continue;
+
+		r = adev->ip_blocks[i].version->funcs->post_reset(adev);
+		if (r) {
+			DRM_ERROR("post reset of IP block <%s> failed %d\n",
+				  adev->ip_blocks[i].version->funcs->name, r);
+			return r;
+		}
+	}
+
+	return r;
+}
+
 /**
  * amdgpu_device_gpu_recover - reset the asic and recover scheduler
  *
@@ -5805,6 +5826,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		amdgpu_put_xgmi_hive(hive);
 	}
 
+	if (!r && !job_signaled)
+		r = amdgpu_device_ip_post_reset(adev);
+
 	if (r)
 		dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
 
diff --git a/drivers/gpu/drm/amd/include/amd_shared.h b/drivers/gpu/drm/amd/include/amd_shared.h
index b0a6256e89f4..33ce30a8e3ab 100644
--- a/drivers/gpu/drm/amd/include/amd_shared.h
+++ b/drivers/gpu/drm/amd/include/amd_shared.h
@@ -287,6 +287,7 @@ enum amd_dpm_forced_level;
  * @pre_soft_reset: pre soft reset the IP block
  * @soft_reset: soft reset the IP block
  * @post_soft_reset: post soft reset the IP block
+ * @post_reset: handles IP specific post reset stuff(e.g., self test)
  * @set_clockgating_state: enable/disable cg for the IP block
  * @set_powergating_state: enable/disable pg for the IP block
  * @get_clockgating_state: get current clockgating status
@@ -316,11 +317,13 @@ struct amd_ip_funcs {
 	int (*pre_soft_reset)(void *handle);
 	int (*soft_reset)(void *handle);
 	int (*post_soft_reset)(void *handle);
+	int (*post_reset)(void *handle);
 	int (*set_clockgating_state)(void *handle,
 				     enum amd_clockgating_state state);
 	int (*set_powergating_state)(void *handle,
 				     enum amd_powergating_state state);
 	void (*get_clockgating_state)(void *handle, u64 *flags);
+
 };
 
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] drm/amdgpu: add post reset IP callback
  2024-03-28  4:40 [PATCH] drm/amdgpu: add post reset IP callback Lang Yu
@ 2024-04-02 13:38 ` Christian König
  2024-04-03  6:51   ` Yu, Lang
  0 siblings, 1 reply; 7+ messages in thread
From: Christian König @ 2024-04-02 13:38 UTC (permalink / raw)
  To: Lang Yu, amd-gfx; +Cc: Alex Deucher, Christian Koenig, Sharma, Shashank

Am 28.03.24 um 05:40 schrieb Lang Yu:
> There are use cases which need full GPU functionality
> (e.g., VM update, TLB inavildate) when doing a GPU reset.
>
> Especially, the mes/umsch self tests which help validate
> the hw state after hw init like ring/ib tests.

I noted that before but just to repeat it once more: We can't do any MES 
or UMSCH validation while doing a GPU reset!

The ring and IB tests use some pre-allocated memory we put aside for the 
task during driver load and so can execute during GPU reset as well.

But for the MES/UMSCH we need a full blown environment with VM and 
submission infrastructure and setting that up isn't possible here.

Adding Shashank as well, but I think we should probably just completely 
remove those from the kernel.

Regards,
Christian.

>
> Add a post reset IP callback to handle such use cases
> which will be executed after GPU reset succeeds.
>
> Signed-off-by: Lang Yu <Lang.Yu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 24 ++++++++++++++++++++++
>   drivers/gpu/drm/amd/include/amd_shared.h   |  3 +++
>   2 files changed, 27 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 12dc71a6b5db..feeab9397aab 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5556,6 +5556,27 @@ static int amdgpu_device_health_check(struct list_head *device_list_handle)
>   	return ret;
>   }
>   
> +static int amdgpu_device_ip_post_reset(struct amdgpu_device *adev)
> +{
> +	uint32_t i;
> +	int r;
> +
> +	for (i = 0; i < adev->num_ip_blocks; i++) {
> +		if (!adev->ip_blocks[i].status.valid ||
> +		    !adev->ip_blocks[i].version->funcs->post_reset)
> +			continue;
> +
> +		r = adev->ip_blocks[i].version->funcs->post_reset(adev);
> +		if (r) {
> +			DRM_ERROR("post reset of IP block <%s> failed %d\n",
> +				  adev->ip_blocks[i].version->funcs->name, r);
> +			return r;
> +		}
> +	}
> +
> +	return r;
> +}
> +
>   /**
>    * amdgpu_device_gpu_recover - reset the asic and recover scheduler
>    *
> @@ -5805,6 +5826,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   		amdgpu_put_xgmi_hive(hive);
>   	}
>   
> +	if (!r && !job_signaled)
> +		r = amdgpu_device_ip_post_reset(adev);
> +
>   	if (r)
>   		dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
>   
> diff --git a/drivers/gpu/drm/amd/include/amd_shared.h b/drivers/gpu/drm/amd/include/amd_shared.h
> index b0a6256e89f4..33ce30a8e3ab 100644
> --- a/drivers/gpu/drm/amd/include/amd_shared.h
> +++ b/drivers/gpu/drm/amd/include/amd_shared.h
> @@ -287,6 +287,7 @@ enum amd_dpm_forced_level;
>    * @pre_soft_reset: pre soft reset the IP block
>    * @soft_reset: soft reset the IP block
>    * @post_soft_reset: post soft reset the IP block
> + * @post_reset: handles IP specific post reset stuff(e.g., self test)
>    * @set_clockgating_state: enable/disable cg for the IP block
>    * @set_powergating_state: enable/disable pg for the IP block
>    * @get_clockgating_state: get current clockgating status
> @@ -316,11 +317,13 @@ struct amd_ip_funcs {
>   	int (*pre_soft_reset)(void *handle);
>   	int (*soft_reset)(void *handle);
>   	int (*post_soft_reset)(void *handle);
> +	int (*post_reset)(void *handle);
>   	int (*set_clockgating_state)(void *handle,
>   				     enum amd_clockgating_state state);
>   	int (*set_powergating_state)(void *handle,
>   				     enum amd_powergating_state state);
>   	void (*get_clockgating_state)(void *handle, u64 *flags);
> +
>   };
>   
>   


^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH] drm/amdgpu: add post reset IP callback
  2024-04-02 13:38 ` Christian König
@ 2024-04-03  6:51   ` Yu, Lang
  2024-04-03  7:18     ` Sharma, Shashank
  2024-04-03  9:40     ` Christian König
  0 siblings, 2 replies; 7+ messages in thread
From: Yu, Lang @ 2024-04-03  6:51 UTC (permalink / raw)
  To: Christian König, amd-gfx
  Cc: Deucher, Alexander, Koenig, Christian, Sharma, Shashank

[AMD Official Use Only - General]

>-----Original Message-----
>From: Christian König <ckoenig.leichtzumerken@gmail.com>
>Sent: Tuesday, April 2, 2024 9:38 PM
>To: Yu, Lang <Lang.Yu@amd.com>; amd-gfx@lists.freedesktop.org
>Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
><Christian.Koenig@amd.com>; Sharma, Shashank
><Shashank.Sharma@amd.com>
>Subject: Re: [PATCH] drm/amdgpu: add post reset IP callback
>
>Am 28.03.24 um 05:40 schrieb Lang Yu:
>> There are use cases which need full GPU functionality (e.g., VM
>> update, TLB inavildate) when doing a GPU reset.
>>
>> Especially, the mes/umsch self tests which help validate the hw state
>> after hw init like ring/ib tests.
>
>I noted that before but just to repeat it once more: We can't do any MES or
>UMSCH validation while doing a GPU reset!

Yes, we can just easily disable it if it doesn't work well.
But it doesn't take too much effort to make it work.
It can expose issues as soon as possible and is useful for debugging purpose.

>The ring and IB tests use some pre-allocated memory we put aside for the
>task during driver load and so can execute during GPU reset as well.

If user space can create a VM and allocate memory during GPU reset,
it makes no sense to prevent kernel space from doing that.

>But for the MES/UMSCH we need a full blown environment with VM and
>submission infrastructure and setting that up isn't possible here.

At least for UMSCH test, it only uses VM mapping functionality
(if we can create a VM with cpu update mode, that's enough),
it doesn't use other submission functionality.
It is actually a compute context.


Regards,
Lang

>Adding Shashank as well, but I think we should probably just completely
>remove those from the kernel.
>
>Regards,
>Christian.
>
>>
>> Add a post reset IP callback to handle such use cases which will be
>> executed after GPU reset succeeds.
>>
>> Signed-off-by: Lang Yu <Lang.Yu@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 24
>++++++++++++++++++++++
>>   drivers/gpu/drm/amd/include/amd_shared.h   |  3 +++
>>   2 files changed, 27 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 12dc71a6b5db..feeab9397aab 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -5556,6 +5556,27 @@ static int amdgpu_device_health_check(struct
>list_head *device_list_handle)
>>      return ret;
>>   }
>>
>> +static int amdgpu_device_ip_post_reset(struct amdgpu_device *adev) {
>> +    uint32_t i;
>> +    int r;
>> +
>> +    for (i = 0; i < adev->num_ip_blocks; i++) {
>> +            if (!adev->ip_blocks[i].status.valid ||
>> +                !adev->ip_blocks[i].version->funcs->post_reset)
>> +                    continue;
>> +
>> +            r = adev->ip_blocks[i].version->funcs->post_reset(adev);
>> +            if (r) {
>> +                    DRM_ERROR("post reset of IP block <%s>
>failed %d\n",
>> +                              adev->ip_blocks[i].version->funcs->name, r);
>> +                    return r;
>> +            }
>> +    }
>> +
>> +    return r;
>> +}
>> +
>>   /**
>>    * amdgpu_device_gpu_recover - reset the asic and recover scheduler
>>    *
>> @@ -5805,6 +5826,9 @@ int amdgpu_device_gpu_recover(struct
>amdgpu_device *adev,
>>              amdgpu_put_xgmi_hive(hive);
>>      }
>>
>> +    if (!r && !job_signaled)
>> +            r = amdgpu_device_ip_post_reset(adev);
>> +
>>      if (r)
>>              dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
>>
>> diff --git a/drivers/gpu/drm/amd/include/amd_shared.h
>> b/drivers/gpu/drm/amd/include/amd_shared.h
>> index b0a6256e89f4..33ce30a8e3ab 100644
>> --- a/drivers/gpu/drm/amd/include/amd_shared.h
>> +++ b/drivers/gpu/drm/amd/include/amd_shared.h
>> @@ -287,6 +287,7 @@ enum amd_dpm_forced_level;
>>    * @pre_soft_reset: pre soft reset the IP block
>>    * @soft_reset: soft reset the IP block
>>    * @post_soft_reset: post soft reset the IP block
>> + * @post_reset: handles IP specific post reset stuff(e.g., self test)
>>    * @set_clockgating_state: enable/disable cg for the IP block
>>    * @set_powergating_state: enable/disable pg for the IP block
>>    * @get_clockgating_state: get current clockgating status @@ -316,11
>> +317,13 @@ struct amd_ip_funcs {
>>      int (*pre_soft_reset)(void *handle);
>>      int (*soft_reset)(void *handle);
>>      int (*post_soft_reset)(void *handle);
>> +    int (*post_reset)(void *handle);
>>      int (*set_clockgating_state)(void *handle,
>>                                   enum amd_clockgating_state state);
>>      int (*set_powergating_state)(void *handle,
>>                                   enum amd_powergating_state state);
>>      void (*get_clockgating_state)(void *handle, u64 *flags);
>> +
>>   };
>>
>>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] drm/amdgpu: add post reset IP callback
  2024-04-03  6:51   ` Yu, Lang
@ 2024-04-03  7:18     ` Sharma, Shashank
  2024-04-03  7:31       ` Yu, Lang
  2024-04-03  9:40     ` Christian König
  1 sibling, 1 reply; 7+ messages in thread
From: Sharma, Shashank @ 2024-04-03  7:18 UTC (permalink / raw)
  To: Yu, Lang, Christian König, amd-gfx
  Cc: Deucher, Alexander, Koenig, Christian

Hey Lang,

On 03/04/2024 08:51, Yu, Lang wrote:
> [AMD Official Use Only - General]
>
>> -----Original Message-----
>> From: Christian König <ckoenig.leichtzumerken@gmail.com>
>> Sent: Tuesday, April 2, 2024 9:38 PM
>> To: Yu, Lang <Lang.Yu@amd.com>; amd-gfx@lists.freedesktop.org
>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
>> <Christian.Koenig@amd.com>; Sharma, Shashank
>> <Shashank.Sharma@amd.com>
>> Subject: Re: [PATCH] drm/amdgpu: add post reset IP callback
>>
>> Am 28.03.24 um 05:40 schrieb Lang Yu:
>>> There are use cases which need full GPU functionality (e.g., VM
>>> update, TLB inavildate) when doing a GPU reset.
>>>
>>> Especially, the mes/umsch self tests which help validate the hw state
>>> after hw init like ring/ib tests.
>> I noted that before but just to repeat it once more: We can't do any MES or
>> UMSCH validation while doing a GPU reset!
> Yes, we can just easily disable it if it doesn't work well.
> But it doesn't take too much effort to make it work.
> It can expose issues as soon as possible and is useful for debugging purpose.
IMO, its not that useful for debugging as well. In case of a problem, It 
will just timeout waiting for MES packet write and we will still have to 
find out the actual problem which caused MES to go into bad state in the 
last GPU reset.
>
>> The ring and IB tests use some pre-allocated memory we put aside for the
>> task during driver load and so can execute during GPU reset as well.
> If user space can create a VM and allocate memory during GPU reset,
> it makes no sense to prevent kernel space from doing that.

I think the objection here is mostly about why to do it at all, when it 
is not helpful. It would be just a maintenance overhead.

- Shashank

>> But for the MES/UMSCH we need a full blown environment with VM and
>> submission infrastructure and setting that up isn't possible here.
> At least for UMSCH test, it only uses VM mapping functionality
> (if we can create a VM with cpu update mode, that's enough),
> it doesn't use other submission functionality.
> It is actually a compute context.
>
>
> Regards,
> Lang
>
>> Adding Shashank as well, but I think we should probably just completely
>> remove those from the kernel.
>>
>> Regards,
>> Christian.
>>
>>> Add a post reset IP callback to handle such use cases which will be
>>> executed after GPU reset succeeds.
>>>
>>> Signed-off-by: Lang Yu <Lang.Yu@amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 24
>> ++++++++++++++++++++++
>>>    drivers/gpu/drm/amd/include/amd_shared.h   |  3 +++
>>>    2 files changed, 27 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 12dc71a6b5db..feeab9397aab 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -5556,6 +5556,27 @@ static int amdgpu_device_health_check(struct
>> list_head *device_list_handle)
>>>       return ret;
>>>    }
>>>
>>> +static int amdgpu_device_ip_post_reset(struct amdgpu_device *adev) {
>>> +    uint32_t i;
>>> +    int r;
>>> +
>>> +    for (i = 0; i < adev->num_ip_blocks; i++) {
>>> +            if (!adev->ip_blocks[i].status.valid ||
>>> +                !adev->ip_blocks[i].version->funcs->post_reset)
>>> +                    continue;
>>> +
>>> +            r = adev->ip_blocks[i].version->funcs->post_reset(adev);
>>> +            if (r) {
>>> +                    DRM_ERROR("post reset of IP block <%s>
>> failed %d\n",
>>> +                              adev->ip_blocks[i].version->funcs->name, r);
>>> +                    return r;
>>> +            }
>>> +    }
>>> +
>>> +    return r;
>>> +}
>>> +
>>>    /**
>>>     * amdgpu_device_gpu_recover - reset the asic and recover scheduler
>>>     *
>>> @@ -5805,6 +5826,9 @@ int amdgpu_device_gpu_recover(struct
>> amdgpu_device *adev,
>>>               amdgpu_put_xgmi_hive(hive);
>>>       }
>>>
>>> +    if (!r && !job_signaled)
>>> +            r = amdgpu_device_ip_post_reset(adev);
>>> +
>>>       if (r)
>>>               dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
>>>
>>> diff --git a/drivers/gpu/drm/amd/include/amd_shared.h
>>> b/drivers/gpu/drm/amd/include/amd_shared.h
>>> index b0a6256e89f4..33ce30a8e3ab 100644
>>> --- a/drivers/gpu/drm/amd/include/amd_shared.h
>>> +++ b/drivers/gpu/drm/amd/include/amd_shared.h
>>> @@ -287,6 +287,7 @@ enum amd_dpm_forced_level;
>>>     * @pre_soft_reset: pre soft reset the IP block
>>>     * @soft_reset: soft reset the IP block
>>>     * @post_soft_reset: post soft reset the IP block
>>> + * @post_reset: handles IP specific post reset stuff(e.g., self test)
>>>     * @set_clockgating_state: enable/disable cg for the IP block
>>>     * @set_powergating_state: enable/disable pg for the IP block
>>>     * @get_clockgating_state: get current clockgating status @@ -316,11
>>> +317,13 @@ struct amd_ip_funcs {
>>>       int (*pre_soft_reset)(void *handle);
>>>       int (*soft_reset)(void *handle);
>>>       int (*post_soft_reset)(void *handle);
>>> +    int (*post_reset)(void *handle);
>>>       int (*set_clockgating_state)(void *handle,
>>>                                    enum amd_clockgating_state state);
>>>       int (*set_powergating_state)(void *handle,
>>>                                    enum amd_powergating_state state);
>>>       void (*get_clockgating_state)(void *handle, u64 *flags);
>>> +
>>>    };
>>>
>>>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH] drm/amdgpu: add post reset IP callback
  2024-04-03  7:18     ` Sharma, Shashank
@ 2024-04-03  7:31       ` Yu, Lang
  2024-04-03  8:44         ` Sharma, Shashank
  0 siblings, 1 reply; 7+ messages in thread
From: Yu, Lang @ 2024-04-03  7:31 UTC (permalink / raw)
  To: Sharma, Shashank, Christian König, amd-gfx
  Cc: Deucher, Alexander, Koenig, Christian

[AMD Official Use Only - General]

>-----Original Message-----
>From: Sharma, Shashank <Shashank.Sharma@amd.com>
>Sent: Wednesday, April 3, 2024 3:19 PM
>To: Yu, Lang <Lang.Yu@amd.com>; Christian König
><ckoenig.leichtzumerken@gmail.com>; amd-gfx@lists.freedesktop.org
>Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
><Christian.Koenig@amd.com>
>Subject: Re: [PATCH] drm/amdgpu: add post reset IP callback
>
>Hey Lang,
>
>On 03/04/2024 08:51, Yu, Lang wrote:
>> [AMD Official Use Only - General]
>>
>>> -----Original Message-----
>>> From: Christian König <ckoenig.leichtzumerken@gmail.com>
>>> Sent: Tuesday, April 2, 2024 9:38 PM
>>> To: Yu, Lang <Lang.Yu@amd.com>; amd-gfx@lists.freedesktop.org
>>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
>>> <Christian.Koenig@amd.com>; Sharma, Shashank
>>> <Shashank.Sharma@amd.com>
>>> Subject: Re: [PATCH] drm/amdgpu: add post reset IP callback
>>>
>>> Am 28.03.24 um 05:40 schrieb Lang Yu:
>>>> There are use cases which need full GPU functionality (e.g., VM
>>>> update, TLB inavildate) when doing a GPU reset.
>>>>
>>>> Especially, the mes/umsch self tests which help validate the hw
>>>> state after hw init like ring/ib tests.
>>> I noted that before but just to repeat it once more: We can't do any
>>> MES or UMSCH validation while doing a GPU reset!
>> Yes, we can just easily disable it if it doesn't work well.
>> But it doesn't take too much effort to make it work.
>> It can expose issues as soon as possible and is useful for debugging
>purpose.
>IMO, its not that useful for debugging as well. In case of a problem, It will just
>timeout waiting for MES packet write and we will still have to find out the
>actual problem which caused MES to go into bad state in the last GPU reset.
>>
>>> The ring and IB tests use some pre-allocated memory we put aside for
>>> the task during driver load and so can execute during GPU reset as well.
>> If user space can create a VM and allocate memory during GPU reset, it
>> makes no sense to prevent kernel space from doing that.
>
>I think the objection here is mostly about why to do it at all, when it is not
>helpful. It would be just a maintenance overhead.

If you think it is not helpful,  why doing ring/ib tests?
I don't think such sanity test is not helpful.

I only talk UMSCH test(different with MES test) here,
I don't think it has a maintenance overhead.

Regards,
Lang

>- Shashank
>
>>> But for the MES/UMSCH we need a full blown environment with VM and
>>> submission infrastructure and setting that up isn't possible here.
>> At least for UMSCH test, it only uses VM mapping functionality (if we
>> can create a VM with cpu update mode, that's enough), it doesn't use
>> other submission functionality.
>> It is actually a compute context.
>>
>>
>> Regards,
>> Lang
>>
>>> Adding Shashank as well, but I think we should probably just
>>> completely remove those from the kernel.
>>>
>>> Regards,
>>> Christian.
>>>
>>>> Add a post reset IP callback to handle such use cases which will be
>>>> executed after GPU reset succeeds.
>>>>
>>>> Signed-off-by: Lang Yu <Lang.Yu@amd.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 24
>>> ++++++++++++++++++++++
>>>>    drivers/gpu/drm/amd/include/amd_shared.h   |  3 +++
>>>>    2 files changed, 27 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index 12dc71a6b5db..feeab9397aab 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -5556,6 +5556,27 @@ static int
>amdgpu_device_health_check(struct
>>> list_head *device_list_handle)
>>>>       return ret;
>>>>    }
>>>>
>>>> +static int amdgpu_device_ip_post_reset(struct amdgpu_device *adev) {
>>>> +    uint32_t i;
>>>> +    int r;
>>>> +
>>>> +    for (i = 0; i < adev->num_ip_blocks; i++) {
>>>> +            if (!adev->ip_blocks[i].status.valid ||
>>>> +                !adev->ip_blocks[i].version->funcs->post_reset)
>>>> +                    continue;
>>>> +
>>>> +            r = adev->ip_blocks[i].version->funcs->post_reset(adev);
>>>> +            if (r) {
>>>> +                    DRM_ERROR("post reset of IP block <%s>
>>> failed %d\n",
>>>> +                              adev->ip_blocks[i].version->funcs->name, r);
>>>> +                    return r;
>>>> +            }
>>>> +    }
>>>> +
>>>> +    return r;
>>>> +}
>>>> +
>>>>    /**
>>>>     * amdgpu_device_gpu_recover - reset the asic and recover scheduler
>>>>     *
>>>> @@ -5805,6 +5826,9 @@ int amdgpu_device_gpu_recover(struct
>>> amdgpu_device *adev,
>>>>               amdgpu_put_xgmi_hive(hive);
>>>>       }
>>>>
>>>> +    if (!r && !job_signaled)
>>>> +            r = amdgpu_device_ip_post_reset(adev);
>>>> +
>>>>       if (r)
>>>>               dev_info(adev->dev, "GPU reset end with ret = %d\n",
>>>> r);
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/include/amd_shared.h
>>>> b/drivers/gpu/drm/amd/include/amd_shared.h
>>>> index b0a6256e89f4..33ce30a8e3ab 100644
>>>> --- a/drivers/gpu/drm/amd/include/amd_shared.h
>>>> +++ b/drivers/gpu/drm/amd/include/amd_shared.h
>>>> @@ -287,6 +287,7 @@ enum amd_dpm_forced_level;
>>>>     * @pre_soft_reset: pre soft reset the IP block
>>>>     * @soft_reset: soft reset the IP block
>>>>     * @post_soft_reset: post soft reset the IP block
>>>> + * @post_reset: handles IP specific post reset stuff(e.g., self
>>>> + test)
>>>>     * @set_clockgating_state: enable/disable cg for the IP block
>>>>     * @set_powergating_state: enable/disable pg for the IP block
>>>>     * @get_clockgating_state: get current clockgating status @@
>>>> -316,11
>>>> +317,13 @@ struct amd_ip_funcs {
>>>>       int (*pre_soft_reset)(void *handle);
>>>>       int (*soft_reset)(void *handle);
>>>>       int (*post_soft_reset)(void *handle);
>>>> +    int (*post_reset)(void *handle);
>>>>       int (*set_clockgating_state)(void *handle,
>>>>                                    enum amd_clockgating_state state);
>>>>       int (*set_powergating_state)(void *handle,
>>>>                                    enum amd_powergating_state state);
>>>>       void (*get_clockgating_state)(void *handle, u64 *flags);
>>>> +
>>>>    };
>>>>
>>>>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] drm/amdgpu: add post reset IP callback
  2024-04-03  7:31       ` Yu, Lang
@ 2024-04-03  8:44         ` Sharma, Shashank
  0 siblings, 0 replies; 7+ messages in thread
From: Sharma, Shashank @ 2024-04-03  8:44 UTC (permalink / raw)
  To: Yu, Lang, Christian König, amd-gfx
  Cc: Deucher, Alexander, Koenig, Christian


On 03/04/2024 09:31, Yu, Lang wrote:
> [AMD Official Use Only - General]
>
>> -----Original Message-----
>> From: Sharma, Shashank <Shashank.Sharma@amd.com>
>> Sent: Wednesday, April 3, 2024 3:19 PM
>> To: Yu, Lang <Lang.Yu@amd.com>; Christian König
>> <ckoenig.leichtzumerken@gmail.com>; amd-gfx@lists.freedesktop.org
>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
>> <Christian.Koenig@amd.com>
>> Subject: Re: [PATCH] drm/amdgpu: add post reset IP callback
>>
>> Hey Lang,
>>
>> On 03/04/2024 08:51, Yu, Lang wrote:
>>> [AMD Official Use Only - General]
>>>
>>>> -----Original Message-----
>>>> From: Christian König <ckoenig.leichtzumerken@gmail.com>
>>>> Sent: Tuesday, April 2, 2024 9:38 PM
>>>> To: Yu, Lang <Lang.Yu@amd.com>; amd-gfx@lists.freedesktop.org
>>>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
>>>> <Christian.Koenig@amd.com>; Sharma, Shashank
>>>> <Shashank.Sharma@amd.com>
>>>> Subject: Re: [PATCH] drm/amdgpu: add post reset IP callback
>>>>
>>>> Am 28.03.24 um 05:40 schrieb Lang Yu:
>>>>> There are use cases which need full GPU functionality (e.g., VM
>>>>> update, TLB inavildate) when doing a GPU reset.
>>>>>
>>>>> Especially, the mes/umsch self tests which help validate the hw
>>>>> state after hw init like ring/ib tests.
>>>> I noted that before but just to repeat it once more: We can't do any
>>>> MES or UMSCH validation while doing a GPU reset!
>>> Yes, we can just easily disable it if it doesn't work well.
>>> But it doesn't take too much effort to make it work.
>>> It can expose issues as soon as possible and is useful for debugging
>> purpose.
>> IMO, its not that useful for debugging as well. In case of a problem, It will just
>> timeout waiting for MES packet write and we will still have to find out the
>> actual problem which caused MES to go into bad state in the last GPU reset.
>>>> The ring and IB tests use some pre-allocated memory we put aside for
>>>> the task during driver load and so can execute during GPU reset as well.
>>> If user space can create a VM and allocate memory during GPU reset, it
>>> makes no sense to prevent kernel space from doing that.
>> I think the objection here is mostly about why to do it at all, when it is not
>> helpful. It would be just a maintenance overhead.
> If you think it is not helpful,  why doing ring/ib tests?
That's because they add value during the bootup, when we know that the 
HW is being initialized in a controlled/systematic way. But in case of a 
GPU reset, it could go anyway and could be neither of those.
> I don't think such sanity test is not helpful.
>
> I only talk UMSCH test(different with MES test) here,

Your comment above about ring tests talks about both MES/UMSCH self 
tests, so it attracts attention on both.

If its only about UMSCH tests, I will let someone else to comment on that.

- Shashank

> I don't think it has a maintenance overhead.
>
> Regards,
> Lang
>
>> - Shashank
>>
>>>> But for the MES/UMSCH we need a full blown environment with VM and
>>>> submission infrastructure and setting that up isn't possible here.
>>> At least for UMSCH test, it only uses VM mapping functionality (if we
>>> can create a VM with cpu update mode, that's enough), it doesn't use
>>> other submission functionality.
>>> It is actually a compute context.
>>>
>>>
>>> Regards,
>>> Lang
>>>
>>>> Adding Shashank as well, but I think we should probably just
>>>> completely remove those from the kernel.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Add a post reset IP callback to handle such use cases which will be
>>>>> executed after GPU reset succeeds.
>>>>>
>>>>> Signed-off-by: Lang Yu <Lang.Yu@amd.com>
>>>>> ---
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 24
>>>> ++++++++++++++++++++++
>>>>>     drivers/gpu/drm/amd/include/amd_shared.h   |  3 +++
>>>>>     2 files changed, 27 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> index 12dc71a6b5db..feeab9397aab 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> @@ -5556,6 +5556,27 @@ static int
>> amdgpu_device_health_check(struct
>>>> list_head *device_list_handle)
>>>>>        return ret;
>>>>>     }
>>>>>
>>>>> +static int amdgpu_device_ip_post_reset(struct amdgpu_device *adev) {
>>>>> +    uint32_t i;
>>>>> +    int r;
>>>>> +
>>>>> +    for (i = 0; i < adev->num_ip_blocks; i++) {
>>>>> +            if (!adev->ip_blocks[i].status.valid ||
>>>>> +                !adev->ip_blocks[i].version->funcs->post_reset)
>>>>> +                    continue;
>>>>> +
>>>>> +            r = adev->ip_blocks[i].version->funcs->post_reset(adev);
>>>>> +            if (r) {
>>>>> +                    DRM_ERROR("post reset of IP block <%s>
>>>> failed %d\n",
>>>>> +                              adev->ip_blocks[i].version->funcs->name, r);
>>>>> +                    return r;
>>>>> +            }
>>>>> +    }
>>>>> +
>>>>> +    return r;
>>>>> +}
>>>>> +
>>>>>     /**
>>>>>      * amdgpu_device_gpu_recover - reset the asic and recover scheduler
>>>>>      *
>>>>> @@ -5805,6 +5826,9 @@ int amdgpu_device_gpu_recover(struct
>>>> amdgpu_device *adev,
>>>>>                amdgpu_put_xgmi_hive(hive);
>>>>>        }
>>>>>
>>>>> +    if (!r && !job_signaled)
>>>>> +            r = amdgpu_device_ip_post_reset(adev);
>>>>> +
>>>>>        if (r)
>>>>>                dev_info(adev->dev, "GPU reset end with ret = %d\n",
>>>>> r);
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/include/amd_shared.h
>>>>> b/drivers/gpu/drm/amd/include/amd_shared.h
>>>>> index b0a6256e89f4..33ce30a8e3ab 100644
>>>>> --- a/drivers/gpu/drm/amd/include/amd_shared.h
>>>>> +++ b/drivers/gpu/drm/amd/include/amd_shared.h
>>>>> @@ -287,6 +287,7 @@ enum amd_dpm_forced_level;
>>>>>      * @pre_soft_reset: pre soft reset the IP block
>>>>>      * @soft_reset: soft reset the IP block
>>>>>      * @post_soft_reset: post soft reset the IP block
>>>>> + * @post_reset: handles IP specific post reset stuff(e.g., self
>>>>> + test)
>>>>>      * @set_clockgating_state: enable/disable cg for the IP block
>>>>>      * @set_powergating_state: enable/disable pg for the IP block
>>>>>      * @get_clockgating_state: get current clockgating status @@
>>>>> -316,11
>>>>> +317,13 @@ struct amd_ip_funcs {
>>>>>        int (*pre_soft_reset)(void *handle);
>>>>>        int (*soft_reset)(void *handle);
>>>>>        int (*post_soft_reset)(void *handle);
>>>>> +    int (*post_reset)(void *handle);
>>>>>        int (*set_clockgating_state)(void *handle,
>>>>>                                     enum amd_clockgating_state state);
>>>>>        int (*set_powergating_state)(void *handle,
>>>>>                                     enum amd_powergating_state state);
>>>>>        void (*get_clockgating_state)(void *handle, u64 *flags);
>>>>> +
>>>>>     };
>>>>>
>>>>>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] drm/amdgpu: add post reset IP callback
  2024-04-03  6:51   ` Yu, Lang
  2024-04-03  7:18     ` Sharma, Shashank
@ 2024-04-03  9:40     ` Christian König
  1 sibling, 0 replies; 7+ messages in thread
From: Christian König @ 2024-04-03  9:40 UTC (permalink / raw)
  To: Yu, Lang, amd-gfx; +Cc: Deucher, Alexander, Koenig, Christian, Sharma, Shashank

Am 03.04.24 um 08:51 schrieb Yu, Lang:
> [AMD Official Use Only - General]
>
>> -----Original Message-----
>> From: Christian König <ckoenig.leichtzumerken@gmail.com>
>> Sent: Tuesday, April 2, 2024 9:38 PM
>> To: Yu, Lang <Lang.Yu@amd.com>; amd-gfx@lists.freedesktop.org
>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
>> <Christian.Koenig@amd.com>; Sharma, Shashank
>> <Shashank.Sharma@amd.com>
>> Subject: Re: [PATCH] drm/amdgpu: add post reset IP callback
>>
>> Am 28.03.24 um 05:40 schrieb Lang Yu:
>>> There are use cases which need full GPU functionality (e.g., VM
>>> update, TLB inavildate) when doing a GPU reset.
>>>
>>> Especially, the mes/umsch self tests which help validate the hw state
>>> after hw init like ring/ib tests.
>> I noted that before but just to repeat it once more: We can't do any MES or
>> UMSCH validation while doing a GPU reset!
> Yes, we can just easily disable it if it doesn't work well.
> But it doesn't take too much effort to make it work.

No, that is a completely false assumption. This is a fundamental problem.

Neither the MES test nor the UMSCH test will ever work correctly under a 
GPU reset.

> It can expose issues as soon as possible and is useful for debugging purpose.
>
>> The ring and IB tests use some pre-allocated memory we put aside for the
>> task during driver load and so can execute during GPU reset as well.
> If user space can create a VM and allocate memory during GPU reset,
> it makes no sense to prevent kernel space from doing that.

Yes it does. The GPU reset must re-start the hardware as soon as 
possible because memory management might wait on it.

You can create a VM and allocate memory, but this must be independent of 
the GPU reset.

And when you do this it is not valuable any more since we can't say that 
the GPU reset hasn't worked later on.

>
>> But for the MES/UMSCH we need a full blown environment with VM and
>> submission infrastructure and setting that up isn't possible here.
> At least for UMSCH test, it only uses VM mapping functionality
> (if we can create a VM with cpu update mode, that's enough),
> it doesn't use other submission functionality.
> It is actually a compute context.

Nope, that doesn't even remotely work correctly.

Regards,
Christian.

>
>
> Regards,
> Lang
>
>> Adding Shashank as well, but I think we should probably just completely
>> remove those from the kernel.
>>
>> Regards,
>> Christian.
>>
>>> Add a post reset IP callback to handle such use cases which will be
>>> executed after GPU reset succeeds.
>>>
>>> Signed-off-by: Lang Yu <Lang.Yu@amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 24
>> ++++++++++++++++++++++
>>>    drivers/gpu/drm/amd/include/amd_shared.h   |  3 +++
>>>    2 files changed, 27 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 12dc71a6b5db..feeab9397aab 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -5556,6 +5556,27 @@ static int amdgpu_device_health_check(struct
>> list_head *device_list_handle)
>>>       return ret;
>>>    }
>>>
>>> +static int amdgpu_device_ip_post_reset(struct amdgpu_device *adev) {
>>> +    uint32_t i;
>>> +    int r;
>>> +
>>> +    for (i = 0; i < adev->num_ip_blocks; i++) {
>>> +            if (!adev->ip_blocks[i].status.valid ||
>>> +                !adev->ip_blocks[i].version->funcs->post_reset)
>>> +                    continue;
>>> +
>>> +            r = adev->ip_blocks[i].version->funcs->post_reset(adev);
>>> +            if (r) {
>>> +                    DRM_ERROR("post reset of IP block <%s>
>> failed %d\n",
>>> +                              adev->ip_blocks[i].version->funcs->name, r);
>>> +                    return r;
>>> +            }
>>> +    }
>>> +
>>> +    return r;
>>> +}
>>> +
>>>    /**
>>>     * amdgpu_device_gpu_recover - reset the asic and recover scheduler
>>>     *
>>> @@ -5805,6 +5826,9 @@ int amdgpu_device_gpu_recover(struct
>> amdgpu_device *adev,
>>>               amdgpu_put_xgmi_hive(hive);
>>>       }
>>>
>>> +    if (!r && !job_signaled)
>>> +            r = amdgpu_device_ip_post_reset(adev);
>>> +
>>>       if (r)
>>>               dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
>>>
>>> diff --git a/drivers/gpu/drm/amd/include/amd_shared.h
>>> b/drivers/gpu/drm/amd/include/amd_shared.h
>>> index b0a6256e89f4..33ce30a8e3ab 100644
>>> --- a/drivers/gpu/drm/amd/include/amd_shared.h
>>> +++ b/drivers/gpu/drm/amd/include/amd_shared.h
>>> @@ -287,6 +287,7 @@ enum amd_dpm_forced_level;
>>>     * @pre_soft_reset: pre soft reset the IP block
>>>     * @soft_reset: soft reset the IP block
>>>     * @post_soft_reset: post soft reset the IP block
>>> + * @post_reset: handles IP specific post reset stuff(e.g., self test)
>>>     * @set_clockgating_state: enable/disable cg for the IP block
>>>     * @set_powergating_state: enable/disable pg for the IP block
>>>     * @get_clockgating_state: get current clockgating status @@ -316,11
>>> +317,13 @@ struct amd_ip_funcs {
>>>       int (*pre_soft_reset)(void *handle);
>>>       int (*soft_reset)(void *handle);
>>>       int (*post_soft_reset)(void *handle);
>>> +    int (*post_reset)(void *handle);
>>>       int (*set_clockgating_state)(void *handle,
>>>                                    enum amd_clockgating_state state);
>>>       int (*set_powergating_state)(void *handle,
>>>                                    enum amd_powergating_state state);
>>>       void (*get_clockgating_state)(void *handle, u64 *flags);
>>> +
>>>    };
>>>
>>>


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-04-03  9:40 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-28  4:40 [PATCH] drm/amdgpu: add post reset IP callback Lang Yu
2024-04-02 13:38 ` Christian König
2024-04-03  6:51   ` Yu, Lang
2024-04-03  7:18     ` Sharma, Shashank
2024-04-03  7:31       ` Yu, Lang
2024-04-03  8:44         ` Sharma, Shashank
2024-04-03  9:40     ` Christian König

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.