All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
@ 2022-01-21 20:34 Sharma, Shashank
  2022-01-22  6:42 ` Lazar, Lijo
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Sharma, Shashank @ 2022-01-21 20:34 UTC (permalink / raw)
  To: amd-gfx; +Cc: Deucher, Alexander, Somalapuram Amaranath, Christian König

 From 899ec6060eb7d8a3d4d56ab439e4e6cdd74190a4 Mon Sep 17 00:00:00 2001
From: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
Date: Fri, 21 Jan 2022 14:19:42 +0530
Subject: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler

This patch adds a GPU reset handler for Navi ASIC family, which
typically dumps some of the registersand sends a trace event.

V2: Accomodated call to work function to send uevent

Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
Signed-off-by: Shashank Sharma <shashank.sharma@amd.com>
---
  drivers/gpu/drm/amd/amdgpu/nv.c | 28 ++++++++++++++++++++++++++++
  1 file changed, 28 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c 
b/drivers/gpu/drm/amd/amdgpu/nv.c
index 01efda4398e5..ada35d4c5245 100644
--- a/drivers/gpu/drm/amd/amdgpu/nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/nv.c
@@ -528,10 +528,38 @@ nv_asic_reset_method(struct amdgpu_device *adev)
  	}
  }

+static void amdgpu_reset_dumps(struct amdgpu_device *adev)
+{
+	int r = 0, i;
+
+	/* original raven doesn't have full asic reset */
+	if ((adev->apu_flags & AMD_APU_IS_RAVEN) &&
+		!(adev->apu_flags & AMD_APU_IS_RAVEN2))
+		return;
+	for (i = 0; i < adev->num_ip_blocks; i++) {
+		if (!adev->ip_blocks[i].status.valid)
+			continue;
+		if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps)
+			continue;
+		r = adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev);
+
+		if (r)
+			DRM_ERROR("reset_reg_dumps of IP block <%s> failed %d\n",
+					adev->ip_blocks[i].version->funcs->name, r);
+	}
+
+	/* Schedule work to send uevent */
+	if (!queue_work(system_unbound_wq, &adev->gpu_reset_work))
+		DRM_ERROR("failed to add GPU reset work\n");
+
+	dump_stack();
+}
+
  static int nv_asic_reset(struct amdgpu_device *adev)
  {
  	int ret = 0;

+	amdgpu_reset_dumps(adev);
  	switch (nv_asic_reset_method(adev)) {
  	case AMD_RESET_METHOD_PCI:
  		dev_info(adev->dev, "PCI reset\n");
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
  2022-01-21 20:34 [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler Sharma, Shashank
@ 2022-01-22  6:42 ` Lazar, Lijo
  2022-02-04 16:38   ` Sharma, Shashank
  2022-01-24  7:18 ` Christian König
  2022-01-24 16:32 ` Andrey Grodzovsky
  2 siblings, 1 reply; 20+ messages in thread
From: Lazar, Lijo @ 2022-01-22  6:42 UTC (permalink / raw)
  To: Sharma, Shashank, amd-gfx
  Cc: Deucher, Alexander, Somalapuram Amaranath, Christian König



On 1/22/2022 2:04 AM, Sharma, Shashank wrote:
>  From 899ec6060eb7d8a3d4d56ab439e4e6cdd74190a4 Mon Sep 17 00:00:00 2001
> From: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
> Date: Fri, 21 Jan 2022 14:19:42 +0530
> Subject: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
> 
> This patch adds a GPU reset handler for Navi ASIC family, which
> typically dumps some of the registersand sends a trace event.
> 
> V2: Accomodated call to work function to send uevent
> 
> Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/nv.c | 28 ++++++++++++++++++++++++++++
>   1 file changed, 28 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c 
> b/drivers/gpu/drm/amd/amdgpu/nv.c
> index 01efda4398e5..ada35d4c5245 100644
> --- a/drivers/gpu/drm/amd/amdgpu/nv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c
> @@ -528,10 +528,38 @@ nv_asic_reset_method(struct amdgpu_device *adev)
>       }
>   }
> 
> +static void amdgpu_reset_dumps(struct amdgpu_device *adev)
> +{
> +    int r = 0, i;
> +
> +    /* original raven doesn't have full asic reset */
> +    if ((adev->apu_flags & AMD_APU_IS_RAVEN) &&
> +        !(adev->apu_flags & AMD_APU_IS_RAVEN2))
> +        return;
> +    for (i = 0; i < adev->num_ip_blocks; i++) {
> +        if (!adev->ip_blocks[i].status.valid)
> +            continue;
> +        if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps)
> +            continue;
> +        r = adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev);
> +
> +        if (r)
> +            DRM_ERROR("reset_reg_dumps of IP block <%s> failed %d\n",
> +                    adev->ip_blocks[i].version->funcs->name, r);
> +    }
> +
> +    /* Schedule work to send uevent */
> +    if (!queue_work(system_unbound_wq, &adev->gpu_reset_work))
> +        DRM_ERROR("failed to add GPU reset work\n");
> +
> +    dump_stack();
> +}
> +
>   static int nv_asic_reset(struct amdgpu_device *adev)
>   {
>       int ret = 0;
> 
> +    amdgpu_reset_dumps(adev);

Had a comment on this before. Now there are different reasons (or even 
no reason like a precautionary reset) to perform reset. A user would be 
interested in a trace only if the reason is valid.

To clarify on why a work shouldn't be scheduled on every reset, check here -

https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c#L2188


Thanks,
Lijo

>       switch (nv_asic_reset_method(adev)) {
>       case AMD_RESET_METHOD_PCI:
>           dev_info(adev->dev, "PCI reset\n");

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
  2022-01-21 20:34 [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler Sharma, Shashank
  2022-01-22  6:42 ` Lazar, Lijo
@ 2022-01-24  7:18 ` Christian König
  2022-01-24 16:50   ` Sharma, Shashank
  2022-01-24 16:32 ` Andrey Grodzovsky
  2 siblings, 1 reply; 20+ messages in thread
From: Christian König @ 2022-01-24  7:18 UTC (permalink / raw)
  To: Sharma, Shashank, amd-gfx; +Cc: Deucher, Alexander, Somalapuram Amaranath



Am 21.01.22 um 21:34 schrieb Sharma, Shashank:
> From 899ec6060eb7d8a3d4d56ab439e4e6cdd74190a4 Mon Sep 17 00:00:00 2001
> From: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
> Date: Fri, 21 Jan 2022 14:19:42 +0530
> Subject: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
>
> This patch adds a GPU reset handler for Navi ASIC family, which
> typically dumps some of the registersand sends a trace event.
>
> V2: Accomodated call to work function to send uevent
>
> Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/nv.c | 28 ++++++++++++++++++++++++++++
>  1 file changed, 28 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c 
> b/drivers/gpu/drm/amd/amdgpu/nv.c
> index 01efda4398e5..ada35d4c5245 100644
> --- a/drivers/gpu/drm/amd/amdgpu/nv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c
> @@ -528,10 +528,38 @@ nv_asic_reset_method(struct amdgpu_device *adev)
>      }
>  }
>
> +static void amdgpu_reset_dumps(struct amdgpu_device *adev)
> +{
> +    int r = 0, i;

Please don't initialize variables if it isn't absolutely necessary.

> +
> +    /* original raven doesn't have full asic reset */
> +    if ((adev->apu_flags & AMD_APU_IS_RAVEN) &&
> +        !(adev->apu_flags & AMD_APU_IS_RAVEN2))
> +        return;
> +    for (i = 0; i < adev->num_ip_blocks; i++) {
> +        if (!adev->ip_blocks[i].status.valid)
> +            continue;
> +        if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps)
> +            continue;
> +        r = adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev);
> +
> +        if (r)
> +            DRM_ERROR("reset_reg_dumps of IP block <%s> failed %d\n",
> + adev->ip_blocks[i].version->funcs->name, r);
> +    }
> +
> +    /* Schedule work to send uevent */
> +    if (!queue_work(system_unbound_wq, &adev->gpu_reset_work))
> +        DRM_ERROR("failed to add GPU reset work\n");
> +
> +    dump_stack();
> +}
> +

I'm pretty sure that should be inside common code and not here.

In other words that is absolutely not ASIC specific at all.

Regards,
Christian.

>  static int nv_asic_reset(struct amdgpu_device *adev)
>  {
>      int ret = 0;
>
> +    amdgpu_reset_dumps(adev);
>      switch (nv_asic_reset_method(adev)) {
>      case AMD_RESET_METHOD_PCI:
>          dev_info(adev->dev, "PCI reset\n");


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
  2022-01-21 20:34 [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler Sharma, Shashank
  2022-01-22  6:42 ` Lazar, Lijo
  2022-01-24  7:18 ` Christian König
@ 2022-01-24 16:32 ` Andrey Grodzovsky
  2022-01-24 16:38   ` Sharma, Shashank
  2 siblings, 1 reply; 20+ messages in thread
From: Andrey Grodzovsky @ 2022-01-24 16:32 UTC (permalink / raw)
  To: Sharma, Shashank, amd-gfx
  Cc: Deucher, Alexander, Somalapuram Amaranath, Christian König

You probably can add the STB dump we worked on a while ago to your info 
dump - a reminder
on the feature is here https://www.spinics.net/lists/amd-gfx/msg70751.html

Andrey

On 2022-01-21 15:34, Sharma, Shashank wrote:
> From 899ec6060eb7d8a3d4d56ab439e4e6cdd74190a4 Mon Sep 17 00:00:00 2001
> From: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
> Date: Fri, 21 Jan 2022 14:19:42 +0530
> Subject: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
>
> This patch adds a GPU reset handler for Navi ASIC family, which
> typically dumps some of the registersand sends a trace event.
>
> V2: Accomodated call to work function to send uevent
>
> Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/nv.c | 28 ++++++++++++++++++++++++++++
>  1 file changed, 28 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c 
> b/drivers/gpu/drm/amd/amdgpu/nv.c
> index 01efda4398e5..ada35d4c5245 100644
> --- a/drivers/gpu/drm/amd/amdgpu/nv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c
> @@ -528,10 +528,38 @@ nv_asic_reset_method(struct amdgpu_device *adev)
>      }
>  }
>
> +static void amdgpu_reset_dumps(struct amdgpu_device *adev)
> +{
> +    int r = 0, i;
> +
> +    /* original raven doesn't have full asic reset */
> +    if ((adev->apu_flags & AMD_APU_IS_RAVEN) &&
> +        !(adev->apu_flags & AMD_APU_IS_RAVEN2))
> +        return;
> +    for (i = 0; i < adev->num_ip_blocks; i++) {
> +        if (!adev->ip_blocks[i].status.valid)
> +            continue;
> +        if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps)
> +            continue;
> +        r = adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev);
> +
> +        if (r)
> +            DRM_ERROR("reset_reg_dumps of IP block <%s> failed %d\n",
> + adev->ip_blocks[i].version->funcs->name, r);
> +    }
> +
> +    /* Schedule work to send uevent */
> +    if (!queue_work(system_unbound_wq, &adev->gpu_reset_work))
> +        DRM_ERROR("failed to add GPU reset work\n");
> +
> +    dump_stack();
> +}
> +
>  static int nv_asic_reset(struct amdgpu_device *adev)
>  {
>      int ret = 0;
>
> +    amdgpu_reset_dumps(adev);
>      switch (nv_asic_reset_method(adev)) {
>      case AMD_RESET_METHOD_PCI:
>          dev_info(adev->dev, "PCI reset\n");

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
  2022-01-24 16:32 ` Andrey Grodzovsky
@ 2022-01-24 16:38   ` Sharma, Shashank
  2022-01-24 17:08     ` Andrey Grodzovsky
  0 siblings, 1 reply; 20+ messages in thread
From: Sharma, Shashank @ 2022-01-24 16:38 UTC (permalink / raw)
  To: Andrey Grodzovsky, amd-gfx
  Cc: Deucher, Alexander, Somalapuram Amaranath, Christian König

Hey Andrey,
That seems like a good idea, may I know if there is a trigger for STB 
dump ? or is it just the infrastructure which one can use when they feel 
a need to dump info ? Also, how reliable is the STB infra during a reset ?

Regards
Shashank
On 1/24/2022 5:32 PM, Andrey Grodzovsky wrote:
> You probably can add the STB dump we worked on a while ago to your info 
> dump - a reminder
> on the feature is here https://www.spinics.net/lists/amd-gfx/msg70751.html
> 
> Andrey
> 
> On 2022-01-21 15:34, Sharma, Shashank wrote:
>> From 899ec6060eb7d8a3d4d56ab439e4e6cdd74190a4 Mon Sep 17 00:00:00 2001
>> From: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
>> Date: Fri, 21 Jan 2022 14:19:42 +0530
>> Subject: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
>>
>> This patch adds a GPU reset handler for Navi ASIC family, which
>> typically dumps some of the registersand sends a trace event.
>>
>> V2: Accomodated call to work function to send uevent
>>
>> Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
>> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com>
>> ---
>>  drivers/gpu/drm/amd/amdgpu/nv.c | 28 ++++++++++++++++++++++++++++
>>  1 file changed, 28 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c 
>> b/drivers/gpu/drm/amd/amdgpu/nv.c
>> index 01efda4398e5..ada35d4c5245 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/nv.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c
>> @@ -528,10 +528,38 @@ nv_asic_reset_method(struct amdgpu_device *adev)
>>      }
>>  }
>>
>> +static void amdgpu_reset_dumps(struct amdgpu_device *adev)
>> +{
>> +    int r = 0, i;
>> +
>> +    /* original raven doesn't have full asic reset */
>> +    if ((adev->apu_flags & AMD_APU_IS_RAVEN) &&
>> +        !(adev->apu_flags & AMD_APU_IS_RAVEN2))
>> +        return;
>> +    for (i = 0; i < adev->num_ip_blocks; i++) {
>> +        if (!adev->ip_blocks[i].status.valid)
>> +            continue;
>> +        if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps)
>> +            continue;
>> +        r = adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev);
>> +
>> +        if (r)
>> +            DRM_ERROR("reset_reg_dumps of IP block <%s> failed %d\n",
>> + adev->ip_blocks[i].version->funcs->name, r);
>> +    }
>> +
>> +    /* Schedule work to send uevent */
>> +    if (!queue_work(system_unbound_wq, &adev->gpu_reset_work))
>> +        DRM_ERROR("failed to add GPU reset work\n");
>> +
>> +    dump_stack();
>> +}
>> +
>>  static int nv_asic_reset(struct amdgpu_device *adev)
>>  {
>>      int ret = 0;
>>
>> +    amdgpu_reset_dumps(adev);
>>      switch (nv_asic_reset_method(adev)) {
>>      case AMD_RESET_METHOD_PCI:
>>          dev_info(adev->dev, "PCI reset\n");

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
  2022-01-24  7:18 ` Christian König
@ 2022-01-24 16:50   ` Sharma, Shashank
  0 siblings, 0 replies; 20+ messages in thread
From: Sharma, Shashank @ 2022-01-24 16:50 UTC (permalink / raw)
  To: Christian König, amd-gfx; +Cc: Deucher, Alexander, Somalapuram Amaranath



On 1/24/2022 8:18 AM, Christian König wrote:
> 
> 
> Am 21.01.22 um 21:34 schrieb Sharma, Shashank:
>> From 899ec6060eb7d8a3d4d56ab439e4e6cdd74190a4 Mon Sep 17 00:00:00 2001
>> From: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
>> Date: Fri, 21 Jan 2022 14:19:42 +0530
>> Subject: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
>>
>> This patch adds a GPU reset handler for Navi ASIC family, which
>> typically dumps some of the registersand sends a trace event.
>>
>> V2: Accomodated call to work function to send uevent
>>
>> Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
>> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com>
>> ---
>>  drivers/gpu/drm/amd/amdgpu/nv.c | 28 ++++++++++++++++++++++++++++
>>  1 file changed, 28 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c 
>> b/drivers/gpu/drm/amd/amdgpu/nv.c
>> index 01efda4398e5..ada35d4c5245 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/nv.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c
>> @@ -528,10 +528,38 @@ nv_asic_reset_method(struct amdgpu_device *adev)
>>      }
>>  }
>>
>> +static void amdgpu_reset_dumps(struct amdgpu_device *adev)
>> +{
>> +    int r = 0, i;
> 
> Please don't initialize variables if it isn't absolutely necessary.
> 

Agree, @Amar please check this out.

>> +
>> +    /* original raven doesn't have full asic reset */
>> +    if ((adev->apu_flags & AMD_APU_IS_RAVEN) &&
>> +        !(adev->apu_flags & AMD_APU_IS_RAVEN2))
>> +        return;
>> +    for (i = 0; i < adev->num_ip_blocks; i++) {
>> +        if (!adev->ip_blocks[i].status.valid)
>> +            continue;
>> +        if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps)
>> +            continue;
>> +        r = adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev);
>> +
>> +        if (r)
>> +            DRM_ERROR("reset_reg_dumps of IP block <%s> failed %d\n",
>> + adev->ip_blocks[i].version->funcs->name, r);
>> +    }
>> +
>> +    /* Schedule work to send uevent */
>> +    if (!queue_work(system_unbound_wq, &adev->gpu_reset_work))
>> +        DRM_ERROR("failed to add GPU reset work\n");
>> +
>> +    dump_stack();
>> +}
>> +
> 
> I'm pretty sure that should be inside common code and not here.
> 
> In other words that is absolutely not ASIC specific at all.

Yeah, I agree. I think the initial problem for Amar was that some 
offsets were only valid for particular ASICs, but nothing is stopping us 
to make this in a generic function followed by a asic special function.

We might have to move the asic.fptr to a more generic higher level fptr.

- Shashank

> 
> Regards,
> Christian.
> 
>>  static int nv_asic_reset(struct amdgpu_device *adev)
>>  {
>>      int ret = 0;
>>
>> +    amdgpu_reset_dumps(adev);
>>      switch (nv_asic_reset_method(adev)) {
>>      case AMD_RESET_METHOD_PCI:
>>          dev_info(adev->dev, "PCI reset\n");
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
  2022-01-24 16:38   ` Sharma, Shashank
@ 2022-01-24 17:08     ` Andrey Grodzovsky
  2022-01-24 17:11       ` Sharma, Shashank
  0 siblings, 1 reply; 20+ messages in thread
From: Andrey Grodzovsky @ 2022-01-24 17:08 UTC (permalink / raw)
  To: Sharma, Shashank, amd-gfx
  Cc: Deucher, Alexander, Somalapuram Amaranath, Christian König

It's just an infrastructure you use when you need.
I never tested it during reset i think but, we deliberately did it very 
self reliant where you simply iterate a FIFO of the dump through PMI3 
registers interface and dump out the content. It currently supposed to 
work for the NV family.

In case you encounter issues during reset let me know and I will do my 
best to resolve them.

Andrey

On 2022-01-24 11:38, Sharma, Shashank wrote:
> Hey Andrey,
> That seems like a good idea, may I know if there is a trigger for STB 
> dump ? or is it just the infrastructure which one can use when they 
> feel a need to dump info ? Also, how reliable is the STB infra during 
> a reset ?
>
> Regards
> Shashank
> On 1/24/2022 5:32 PM, Andrey Grodzovsky wrote:
>> You probably can add the STB dump we worked on a while ago to your 
>> info dump - a reminder
>> on the feature is here 
>> https://www.spinics.net/lists/amd-gfx/msg70751.html
>>
>> Andrey
>>
>> On 2022-01-21 15:34, Sharma, Shashank wrote:
>>> From 899ec6060eb7d8a3d4d56ab439e4e6cdd74190a4 Mon Sep 17 00:00:00 2001
>>> From: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
>>> Date: Fri, 21 Jan 2022 14:19:42 +0530
>>> Subject: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
>>>
>>> This patch adds a GPU reset handler for Navi ASIC family, which
>>> typically dumps some of the registersand sends a trace event.
>>>
>>> V2: Accomodated call to work function to send uevent
>>>
>>> Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
>>> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com>
>>> ---
>>>  drivers/gpu/drm/amd/amdgpu/nv.c | 28 ++++++++++++++++++++++++++++
>>>  1 file changed, 28 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c 
>>> b/drivers/gpu/drm/amd/amdgpu/nv.c
>>> index 01efda4398e5..ada35d4c5245 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/nv.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c
>>> @@ -528,10 +528,38 @@ nv_asic_reset_method(struct amdgpu_device *adev)
>>>      }
>>>  }
>>>
>>> +static void amdgpu_reset_dumps(struct amdgpu_device *adev)
>>> +{
>>> +    int r = 0, i;
>>> +
>>> +    /* original raven doesn't have full asic reset */
>>> +    if ((adev->apu_flags & AMD_APU_IS_RAVEN) &&
>>> +        !(adev->apu_flags & AMD_APU_IS_RAVEN2))
>>> +        return;
>>> +    for (i = 0; i < adev->num_ip_blocks; i++) {
>>> +        if (!adev->ip_blocks[i].status.valid)
>>> +            continue;
>>> +        if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps)
>>> +            continue;
>>> +        r = adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev);
>>> +
>>> +        if (r)
>>> +            DRM_ERROR("reset_reg_dumps of IP block <%s> failed %d\n",
>>> + adev->ip_blocks[i].version->funcs->name, r);
>>> +    }
>>> +
>>> +    /* Schedule work to send uevent */
>>> +    if (!queue_work(system_unbound_wq, &adev->gpu_reset_work))
>>> +        DRM_ERROR("failed to add GPU reset work\n");
>>> +
>>> +    dump_stack();
>>> +}
>>> +
>>>  static int nv_asic_reset(struct amdgpu_device *adev)
>>>  {
>>>      int ret = 0;
>>>
>>> +    amdgpu_reset_dumps(adev);
>>>      switch (nv_asic_reset_method(adev)) {
>>>      case AMD_RESET_METHOD_PCI:
>>>          dev_info(adev->dev, "PCI reset\n");

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
  2022-01-24 17:08     ` Andrey Grodzovsky
@ 2022-01-24 17:11       ` Sharma, Shashank
  0 siblings, 0 replies; 20+ messages in thread
From: Sharma, Shashank @ 2022-01-24 17:11 UTC (permalink / raw)
  To: Andrey Grodzovsky, amd-gfx
  Cc: Deucher, Alexander, Somalapuram Amaranath, Christian König



On 1/24/2022 6:08 PM, Andrey Grodzovsky wrote:
> It's just an infrastructure you use when you need.
> I never tested it during reset i think but, we deliberately did it very 
> self reliant where you simply iterate a FIFO of the dump through PMI3 
> registers interface and dump out the content. It currently supposed to 
> work for the NV family.
>

Got it, thanks for the suggestion. Let me check the feasibility of 
plug-in STB in out existing design and use case.

- Shashank


> In case you encounter issues during reset let me know and I will do my 
> best to resolve them.
> 
> Andrey
> 
> On 2022-01-24 11:38, Sharma, Shashank wrote:
>> Hey Andrey,
>> That seems like a good idea, may I know if there is a trigger for STB 
>> dump ? or is it just the infrastructure which one can use when they 
>> feel a need to dump info ? Also, how reliable is the STB infra during 
>> a reset ?
>>
>> Regards
>> Shashank
>> On 1/24/2022 5:32 PM, Andrey Grodzovsky wrote:
>>> You probably can add the STB dump we worked on a while ago to your 
>>> info dump - a reminder
>>> on the feature is here 
>>> https://www.spinics.net/lists/amd-gfx/msg70751.html
>>>
>>> Andrey
>>>
>>> On 2022-01-21 15:34, Sharma, Shashank wrote:
>>>> From 899ec6060eb7d8a3d4d56ab439e4e6cdd74190a4 Mon Sep 17 00:00:00 2001
>>>> From: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
>>>> Date: Fri, 21 Jan 2022 14:19:42 +0530
>>>> Subject: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
>>>>
>>>> This patch adds a GPU reset handler for Navi ASIC family, which
>>>> typically dumps some of the registersand sends a trace event.
>>>>
>>>> V2: Accomodated call to work function to send uevent
>>>>
>>>> Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
>>>> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com>
>>>> ---
>>>>  drivers/gpu/drm/amd/amdgpu/nv.c | 28 ++++++++++++++++++++++++++++
>>>>  1 file changed, 28 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/nv.c
>>>> index 01efda4398e5..ada35d4c5245 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/nv.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c
>>>> @@ -528,10 +528,38 @@ nv_asic_reset_method(struct amdgpu_device *adev)
>>>>      }
>>>>  }
>>>>
>>>> +static void amdgpu_reset_dumps(struct amdgpu_device *adev)
>>>> +{
>>>> +    int r = 0, i;
>>>> +
>>>> +    /* original raven doesn't have full asic reset */
>>>> +    if ((adev->apu_flags & AMD_APU_IS_RAVEN) &&
>>>> +        !(adev->apu_flags & AMD_APU_IS_RAVEN2))
>>>> +        return;
>>>> +    for (i = 0; i < adev->num_ip_blocks; i++) {
>>>> +        if (!adev->ip_blocks[i].status.valid)
>>>> +            continue;
>>>> +        if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps)
>>>> +            continue;
>>>> +        r = adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev);
>>>> +
>>>> +        if (r)
>>>> +            DRM_ERROR("reset_reg_dumps of IP block <%s> failed %d\n",
>>>> + adev->ip_blocks[i].version->funcs->name, r);
>>>> +    }
>>>> +
>>>> +    /* Schedule work to send uevent */
>>>> +    if (!queue_work(system_unbound_wq, &adev->gpu_reset_work))
>>>> +        DRM_ERROR("failed to add GPU reset work\n");
>>>> +
>>>> +    dump_stack();
>>>> +}
>>>> +
>>>>  static int nv_asic_reset(struct amdgpu_device *adev)
>>>>  {
>>>>      int ret = 0;
>>>>
>>>> +    amdgpu_reset_dumps(adev);
>>>>      switch (nv_asic_reset_method(adev)) {
>>>>      case AMD_RESET_METHOD_PCI:
>>>>          dev_info(adev->dev, "PCI reset\n");

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
  2022-01-22  6:42 ` Lazar, Lijo
@ 2022-02-04 16:38   ` Sharma, Shashank
  2022-02-04 16:50     ` Lazar, Lijo
  0 siblings, 1 reply; 20+ messages in thread
From: Sharma, Shashank @ 2022-02-04 16:38 UTC (permalink / raw)
  To: Lazar, Lijo, amd-gfx
  Cc: Deucher, Alexander, Somalapuram Amaranath, Christian König

Hey Lijo,
I somehow missed to respond on this comment, pls find inline:

Regards
Shashank

On 1/22/2022 7:42 AM, Lazar, Lijo wrote:
> 
> 
> On 1/22/2022 2:04 AM, Sharma, Shashank wrote:
>>  From 899ec6060eb7d8a3d4d56ab439e4e6cdd74190a4 Mon Sep 17 00:00:00 2001
>> From: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
>> Date: Fri, 21 Jan 2022 14:19:42 +0530
>> Subject: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
>>
>> This patch adds a GPU reset handler for Navi ASIC family, which
>> typically dumps some of the registersand sends a trace event.
>>
>> V2: Accomodated call to work function to send uevent
>>
>> Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
>> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/nv.c | 28 ++++++++++++++++++++++++++++
>>   1 file changed, 28 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c 
>> b/drivers/gpu/drm/amd/amdgpu/nv.c
>> index 01efda4398e5..ada35d4c5245 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/nv.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c
>> @@ -528,10 +528,38 @@ nv_asic_reset_method(struct amdgpu_device *adev)
>>       }
>>   }
>>
>> +static void amdgpu_reset_dumps(struct amdgpu_device *adev)
>> +{
>> +    int r = 0, i;
>> +
>> +    /* original raven doesn't have full asic reset */
>> +    if ((adev->apu_flags & AMD_APU_IS_RAVEN) &&
>> +        !(adev->apu_flags & AMD_APU_IS_RAVEN2))
>> +        return;
>> +    for (i = 0; i < adev->num_ip_blocks; i++) {
>> +        if (!adev->ip_blocks[i].status.valid)
>> +            continue;
>> +        if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps)
>> +            continue;
>> +        r = adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev);
>> +
>> +        if (r)
>> +            DRM_ERROR("reset_reg_dumps of IP block <%s> failed %d\n",
>> +                    adev->ip_blocks[i].version->funcs->name, r);
>> +    }
>> +
>> +    /* Schedule work to send uevent */
>> +    if (!queue_work(system_unbound_wq, &adev->gpu_reset_work))
>> +        DRM_ERROR("failed to add GPU reset work\n");
>> +
>> +    dump_stack();
>> +}
>> +
>>   static int nv_asic_reset(struct amdgpu_device *adev)
>>   {
>>       int ret = 0;
>>
>> +    amdgpu_reset_dumps(adev);
> 
> Had a comment on this before. Now there are different reasons (or even 
> no reason like a precautionary reset) to perform reset. A user would be 
> interested in a trace only if the reason is valid.
> 
> To clarify on why a work shouldn't be scheduled on every reset, check 
> here -
> 
> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c#L2188 
In the example you pointed to, they have a criteria to decide what is a 
valid reset in their context, in the kernel side itself. So they can 
take a call if they want to do something about it or not.

But, in our case, we want to send the trace_event to user with some 
register values on every reset, and it is actually up to the profiling 
app to interpret (along with what it wants to call a GPU reset). So I 
don't think this is causing a considerable overhead.

- Shashank
> 
> 
> 
> Thanks,
> Lijo
> 
>>       switch (nv_asic_reset_method(adev)) {
>>       case AMD_RESET_METHOD_PCI:
>>           dev_info(adev->dev, "PCI reset\n");

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
  2022-02-04 16:38   ` Sharma, Shashank
@ 2022-02-04 16:50     ` Lazar, Lijo
  2022-02-04 16:59       ` Sharma, Shashank
  0 siblings, 1 reply; 20+ messages in thread
From: Lazar, Lijo @ 2022-02-04 16:50 UTC (permalink / raw)
  To: Sharma, Shashank, amd-gfx
  Cc: Deucher, Alexander, Somalapuram, Amaranath, Koenig, Christian

[AMD Official Use Only]

To explain more -
	It's an unconditional reset done by the kernel on every suspend (S3/S4). In such a case which process is going to receive the trace events?

Most likely use case would be related to gpu recovery. Triggering a trace on every reset doesn't look like a good idea.

Thanks,
Lijo

-----Original Message-----
From: Sharma, Shashank <Shashank.Sharma@amd.com> 
Sent: Friday, February 4, 2022 10:09 PM
To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Somalapuram, Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler

Hey Lijo,
I somehow missed to respond on this comment, pls find inline:

Regards
Shashank

On 1/22/2022 7:42 AM, Lazar, Lijo wrote:
> 
> 
> On 1/22/2022 2:04 AM, Sharma, Shashank wrote:
>>  From 899ec6060eb7d8a3d4d56ab439e4e6cdd74190a4 Mon Sep 17 00:00:00 
>> 2001
>> From: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
>> Date: Fri, 21 Jan 2022 14:19:42 +0530
>> Subject: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
>>
>> This patch adds a GPU reset handler for Navi ASIC family, which 
>> typically dumps some of the registersand sends a trace event.
>>
>> V2: Accomodated call to work function to send uevent
>>
>> Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
>> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/nv.c | 28 ++++++++++++++++++++++++++++
>>   1 file changed, 28 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c 
>> b/drivers/gpu/drm/amd/amdgpu/nv.c index 01efda4398e5..ada35d4c5245 
>> 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/nv.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c
>> @@ -528,10 +528,38 @@ nv_asic_reset_method(struct amdgpu_device 
>> *adev)
>>       }
>>   }
>>
>> +static void amdgpu_reset_dumps(struct amdgpu_device *adev) {
>> +    int r = 0, i;
>> +
>> +    /* original raven doesn't have full asic reset */
>> +    if ((adev->apu_flags & AMD_APU_IS_RAVEN) &&
>> +        !(adev->apu_flags & AMD_APU_IS_RAVEN2))
>> +        return;
>> +    for (i = 0; i < adev->num_ip_blocks; i++) {
>> +        if (!adev->ip_blocks[i].status.valid)
>> +            continue;
>> +        if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps)
>> +            continue;
>> +        r = 
>> +adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev);
>> +
>> +        if (r)
>> +            DRM_ERROR("reset_reg_dumps of IP block <%s> failed 
>> +%d\n",
>> +                    adev->ip_blocks[i].version->funcs->name, r);
>> +    }
>> +
>> +    /* Schedule work to send uevent */
>> +    if (!queue_work(system_unbound_wq, &adev->gpu_reset_work))
>> +        DRM_ERROR("failed to add GPU reset work\n");
>> +
>> +    dump_stack();
>> +}
>> +
>>   static int nv_asic_reset(struct amdgpu_device *adev)
>>   {
>>       int ret = 0;
>>
>> +    amdgpu_reset_dumps(adev);
> 
> Had a comment on this before. Now there are different reasons (or even 
> no reason like a precautionary reset) to perform reset. A user would 
> be interested in a trace only if the reason is valid.
> 
> To clarify on why a work shouldn't be scheduled on every reset, check 
> here -
> 
> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amd
> gpu/amdgpu_drv.c#L2188
In the example you pointed to, they have a criteria to decide what is a valid reset in their context, in the kernel side itself. So they can take a call if they want to do something about it or not.

But, in our case, we want to send the trace_event to user with some register values on every reset, and it is actually up to the profiling app to interpret (along with what it wants to call a GPU reset). So I don't think this is causing a considerable overhead.

- Shashank
> 
> 
> 
> Thanks,
> Lijo
> 
>>       switch (nv_asic_reset_method(adev)) {
>>       case AMD_RESET_METHOD_PCI:
>>           dev_info(adev->dev, "PCI reset\n");

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
  2022-02-04 16:50     ` Lazar, Lijo
@ 2022-02-04 16:59       ` Sharma, Shashank
  2022-02-04 17:02         ` Lazar, Lijo
  0 siblings, 1 reply; 20+ messages in thread
From: Sharma, Shashank @ 2022-02-04 16:59 UTC (permalink / raw)
  To: Lazar, Lijo, amd-gfx
  Cc: Deucher, Alexander, Somalapuram, Amaranath, Koenig, Christian



On 2/4/2022 5:50 PM, Lazar, Lijo wrote:
> [AMD Official Use Only]
> 
> To explain more -
> 	It's an unconditional reset done by the kernel on every suspend (S3/S4). In such a case which process is going to receive the trace events?
> 
> Most likely use case would be related to gpu recovery. Triggering a trace on every reset doesn't look like a good idea.
>

If you observer carefully, we are just providing an infrastructure, the 
application's intention is unknown to us. In my opinion it's rather not 
a good idea to apply a filter in kernel, with our interpretation of 
intention.

For example if an app just wants to count how many resets are happening 
due to S3/S4 transition, this infra might become useless. It would 
rather be a better idea for the app to learn and ignore these scenarios 
which it is not interested in.

This could eventually be just difference in design philosophy maybe :)

- Shashank

> Thanks,
> Lijo
> 
> -----Original Message-----
> From: Sharma, Shashank <Shashank.Sharma@amd.com>
> Sent: Friday, February 4, 2022 10:09 PM
> To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Somalapuram, Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
> Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
> 
> Hey Lijo,
> I somehow missed to respond on this comment, pls find inline:
> 
> Regards
> Shashank
> 
> On 1/22/2022 7:42 AM, Lazar, Lijo wrote:
>>
>>
>> On 1/22/2022 2:04 AM, Sharma, Shashank wrote:
>>>   From 899ec6060eb7d8a3d4d56ab439e4e6cdd74190a4 Mon Sep 17 00:00:00
>>> 2001
>>> From: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
>>> Date: Fri, 21 Jan 2022 14:19:42 +0530
>>> Subject: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
>>>
>>> This patch adds a GPU reset handler for Navi ASIC family, which
>>> typically dumps some of the registersand sends a trace event.
>>>
>>> V2: Accomodated call to work function to send uevent
>>>
>>> Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
>>> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/nv.c | 28 ++++++++++++++++++++++++++++
>>>    1 file changed, 28 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c
>>> b/drivers/gpu/drm/amd/amdgpu/nv.c index 01efda4398e5..ada35d4c5245
>>> 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/nv.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c
>>> @@ -528,10 +528,38 @@ nv_asic_reset_method(struct amdgpu_device
>>> *adev)
>>>        }
>>>    }
>>>
>>> +static void amdgpu_reset_dumps(struct amdgpu_device *adev) {
>>> +    int r = 0, i;
>>> +
>>> +    /* original raven doesn't have full asic reset */
>>> +    if ((adev->apu_flags & AMD_APU_IS_RAVEN) &&
>>> +        !(adev->apu_flags & AMD_APU_IS_RAVEN2))
>>> +        return;
>>> +    for (i = 0; i < adev->num_ip_blocks; i++) {
>>> +        if (!adev->ip_blocks[i].status.valid)
>>> +            continue;
>>> +        if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps)
>>> +            continue;
>>> +        r =
>>> +adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev);
>>> +
>>> +        if (r)
>>> +            DRM_ERROR("reset_reg_dumps of IP block <%s> failed
>>> +%d\n",
>>> +                    adev->ip_blocks[i].version->funcs->name, r);
>>> +    }
>>> +
>>> +    /* Schedule work to send uevent */
>>> +    if (!queue_work(system_unbound_wq, &adev->gpu_reset_work))
>>> +        DRM_ERROR("failed to add GPU reset work\n");
>>> +
>>> +    dump_stack();
>>> +}
>>> +
>>>    static int nv_asic_reset(struct amdgpu_device *adev)
>>>    {
>>>        int ret = 0;
>>>
>>> +    amdgpu_reset_dumps(adev);
>>
>> Had a comment on this before. Now there are different reasons (or even
>> no reason like a precautionary reset) to perform reset. A user would
>> be interested in a trace only if the reason is valid.
>>
>> To clarify on why a work shouldn't be scheduled on every reset, check
>> here -
>>
>> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amd
>> gpu/amdgpu_drv.c#L2188
> In the example you pointed to, they have a criteria to decide what is a valid reset in their context, in the kernel side itself. So they can take a call if they want to do something about it or not.
> 
> But, in our case, we want to send the trace_event to user with some register values on every reset, and it is actually up to the profiling app to interpret (along with what it wants to call a GPU reset). So I don't think this is causing a considerable overhead.
> 
> - Shashank
>>
>>
>>
>> Thanks,
>> Lijo
>>
>>>        switch (nv_asic_reset_method(adev)) {
>>>        case AMD_RESET_METHOD_PCI:
>>>            dev_info(adev->dev, "PCI reset\n");

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
  2022-02-04 16:59       ` Sharma, Shashank
@ 2022-02-04 17:02         ` Lazar, Lijo
  2022-02-04 17:07           ` Sharma, Shashank
  0 siblings, 1 reply; 20+ messages in thread
From: Lazar, Lijo @ 2022-02-04 17:02 UTC (permalink / raw)
  To: Sharma, Shashank, amd-gfx
  Cc: Deucher, Alexander, Somalapuram, Amaranath, Koenig, Christian

[Public]

The problem is app doesn't know why the reset happened. It just receives a bunch of registers to be read. On what basis an app can filter this out?

Thanks,
Lijo

-----Original Message-----
From: Sharma, Shashank <Shashank.Sharma@amd.com> 
Sent: Friday, February 4, 2022 10:29 PM
To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Somalapuram, Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler



On 2/4/2022 5:50 PM, Lazar, Lijo wrote:
> [AMD Official Use Only]
> 
> To explain more -
> 	It's an unconditional reset done by the kernel on every suspend (S3/S4). In such a case which process is going to receive the trace events?
> 
> Most likely use case would be related to gpu recovery. Triggering a trace on every reset doesn't look like a good idea.
>

If you observer carefully, we are just providing an infrastructure, the application's intention is unknown to us. In my opinion it's rather not a good idea to apply a filter in kernel, with our interpretation of intention.

For example if an app just wants to count how many resets are happening due to S3/S4 transition, this infra might become useless. It would rather be a better idea for the app to learn and ignore these scenarios which it is not interested in.

This could eventually be just difference in design philosophy maybe :)

- Shashank

> Thanks,
> Lijo
> 
> -----Original Message-----
> From: Sharma, Shashank <Shashank.Sharma@amd.com>
> Sent: Friday, February 4, 2022 10:09 PM
> To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Somalapuram, 
> Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, Christian 
> <Christian.Koenig@amd.com>
> Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
> 
> Hey Lijo,
> I somehow missed to respond on this comment, pls find inline:
> 
> Regards
> Shashank
> 
> On 1/22/2022 7:42 AM, Lazar, Lijo wrote:
>>
>>
>> On 1/22/2022 2:04 AM, Sharma, Shashank wrote:
>>>   From 899ec6060eb7d8a3d4d56ab439e4e6cdd74190a4 Mon Sep 17 00:00:00
>>> 2001
>>> From: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
>>> Date: Fri, 21 Jan 2022 14:19:42 +0530
>>> Subject: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
>>>
>>> This patch adds a GPU reset handler for Navi ASIC family, which 
>>> typically dumps some of the registersand sends a trace event.
>>>
>>> V2: Accomodated call to work function to send uevent
>>>
>>> Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
>>> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/nv.c | 28 ++++++++++++++++++++++++++++
>>>    1 file changed, 28 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c 
>>> b/drivers/gpu/drm/amd/amdgpu/nv.c index 01efda4398e5..ada35d4c5245
>>> 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/nv.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c
>>> @@ -528,10 +528,38 @@ nv_asic_reset_method(struct amdgpu_device
>>> *adev)
>>>        }
>>>    }
>>>
>>> +static void amdgpu_reset_dumps(struct amdgpu_device *adev) {
>>> +    int r = 0, i;
>>> +
>>> +    /* original raven doesn't have full asic reset */
>>> +    if ((adev->apu_flags & AMD_APU_IS_RAVEN) &&
>>> +        !(adev->apu_flags & AMD_APU_IS_RAVEN2))
>>> +        return;
>>> +    for (i = 0; i < adev->num_ip_blocks; i++) {
>>> +        if (!adev->ip_blocks[i].status.valid)
>>> +            continue;
>>> +        if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps)
>>> +            continue;
>>> +        r =
>>> +adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev);
>>> +
>>> +        if (r)
>>> +            DRM_ERROR("reset_reg_dumps of IP block <%s> failed 
>>> +%d\n",
>>> +                    adev->ip_blocks[i].version->funcs->name, r);
>>> +    }
>>> +
>>> +    /* Schedule work to send uevent */
>>> +    if (!queue_work(system_unbound_wq, &adev->gpu_reset_work))
>>> +        DRM_ERROR("failed to add GPU reset work\n");
>>> +
>>> +    dump_stack();
>>> +}
>>> +
>>>    static int nv_asic_reset(struct amdgpu_device *adev)
>>>    {
>>>        int ret = 0;
>>>
>>> +    amdgpu_reset_dumps(adev);
>>
>> Had a comment on this before. Now there are different reasons (or 
>> even no reason like a precautionary reset) to perform reset. A user 
>> would be interested in a trace only if the reason is valid.
>>
>> To clarify on why a work shouldn't be scheduled on every reset, check 
>> here -
>>
>> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/am
>> d
>> gpu/amdgpu_drv.c#L2188
> In the example you pointed to, they have a criteria to decide what is a valid reset in their context, in the kernel side itself. So they can take a call if they want to do something about it or not.
> 
> But, in our case, we want to send the trace_event to user with some register values on every reset, and it is actually up to the profiling app to interpret (along with what it wants to call a GPU reset). So I don't think this is causing a considerable overhead.
> 
> - Shashank
>>
>>
>>
>> Thanks,
>> Lijo
>>
>>>        switch (nv_asic_reset_method(adev)) {
>>>        case AMD_RESET_METHOD_PCI:
>>>            dev_info(adev->dev, "PCI reset\n");

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
  2022-02-04 17:02         ` Lazar, Lijo
@ 2022-02-04 17:07           ` Sharma, Shashank
  2022-02-04 17:11             ` Lazar, Lijo
  0 siblings, 1 reply; 20+ messages in thread
From: Sharma, Shashank @ 2022-02-04 17:07 UTC (permalink / raw)
  To: Lazar, Lijo, amd-gfx
  Cc: Deucher, Alexander, Somalapuram, Amaranath, Koenig, Christian



On 2/4/2022 6:02 PM, Lazar, Lijo wrote:
> [Public]
> 
> The problem is app doesn't know why the reset happened. It just receives a bunch of registers to be read. On what basis an app can filter this out?
>

Again, that is contextual analysis capability, which needs to be 
embedded in the reader app. Even if we filter out the S3/S4 resets in 
the kernel, the situation remains the same, isn't it ?

- Shashank

> Thanks,
> Lijo
> 
> -----Original Message-----
> From: Sharma, Shashank <Shashank.Sharma@amd.com>
> Sent: Friday, February 4, 2022 10:29 PM
> To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Somalapuram, Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
> Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
> 
> 
> 
> On 2/4/2022 5:50 PM, Lazar, Lijo wrote:
>> [AMD Official Use Only]
>>
>> To explain more -
>> 	It's an unconditional reset done by the kernel on every suspend (S3/S4). In such a case which process is going to receive the trace events?
>>
>> Most likely use case would be related to gpu recovery. Triggering a trace on every reset doesn't look like a good idea.
>>
> 
> If you observer carefully, we are just providing an infrastructure, the application's intention is unknown to us. In my opinion it's rather not a good idea to apply a filter in kernel, with our interpretation of intention.
> 
> For example if an app just wants to count how many resets are happening due to S3/S4 transition, this infra might become useless. It would rather be a better idea for the app to learn and ignore these scenarios which it is not interested in.
> 
> This could eventually be just difference in design philosophy maybe :)
> 
> - Shashank
> 
>> Thanks,
>> Lijo
>>
>> -----Original Message-----
>> From: Sharma, Shashank <Shashank.Sharma@amd.com>
>> Sent: Friday, February 4, 2022 10:09 PM
>> To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org
>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Somalapuram,
>> Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, Christian
>> <Christian.Koenig@amd.com>
>> Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
>>
>> Hey Lijo,
>> I somehow missed to respond on this comment, pls find inline:
>>
>> Regards
>> Shashank
>>
>> On 1/22/2022 7:42 AM, Lazar, Lijo wrote:
>>>
>>>
>>> On 1/22/2022 2:04 AM, Sharma, Shashank wrote:
>>>>    From 899ec6060eb7d8a3d4d56ab439e4e6cdd74190a4 Mon Sep 17 00:00:00
>>>> 2001
>>>> From: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
>>>> Date: Fri, 21 Jan 2022 14:19:42 +0530
>>>> Subject: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
>>>>
>>>> This patch adds a GPU reset handler for Navi ASIC family, which
>>>> typically dumps some of the registersand sends a trace event.
>>>>
>>>> V2: Accomodated call to work function to send uevent
>>>>
>>>> Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
>>>> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com>
>>>> ---
>>>>     drivers/gpu/drm/amd/amdgpu/nv.c | 28 ++++++++++++++++++++++++++++
>>>>     1 file changed, 28 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c
>>>> b/drivers/gpu/drm/amd/amdgpu/nv.c index 01efda4398e5..ada35d4c5245
>>>> 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/nv.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c
>>>> @@ -528,10 +528,38 @@ nv_asic_reset_method(struct amdgpu_device
>>>> *adev)
>>>>         }
>>>>     }
>>>>
>>>> +static void amdgpu_reset_dumps(struct amdgpu_device *adev) {
>>>> +    int r = 0, i;
>>>> +
>>>> +    /* original raven doesn't have full asic reset */
>>>> +    if ((adev->apu_flags & AMD_APU_IS_RAVEN) &&
>>>> +        !(adev->apu_flags & AMD_APU_IS_RAVEN2))
>>>> +        return;
>>>> +    for (i = 0; i < adev->num_ip_blocks; i++) {
>>>> +        if (!adev->ip_blocks[i].status.valid)
>>>> +            continue;
>>>> +        if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps)
>>>> +            continue;
>>>> +        r =
>>>> +adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev);
>>>> +
>>>> +        if (r)
>>>> +            DRM_ERROR("reset_reg_dumps of IP block <%s> failed
>>>> +%d\n",
>>>> +                    adev->ip_blocks[i].version->funcs->name, r);
>>>> +    }
>>>> +
>>>> +    /* Schedule work to send uevent */
>>>> +    if (!queue_work(system_unbound_wq, &adev->gpu_reset_work))
>>>> +        DRM_ERROR("failed to add GPU reset work\n");
>>>> +
>>>> +    dump_stack();
>>>> +}
>>>> +
>>>>     static int nv_asic_reset(struct amdgpu_device *adev)
>>>>     {
>>>>         int ret = 0;
>>>>
>>>> +    amdgpu_reset_dumps(adev);
>>>
>>> Had a comment on this before. Now there are different reasons (or
>>> even no reason like a precautionary reset) to perform reset. A user
>>> would be interested in a trace only if the reason is valid.
>>>
>>> To clarify on why a work shouldn't be scheduled on every reset, check
>>> here -
>>>
>>> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/am
>>> d
>>> gpu/amdgpu_drv.c#L2188
>> In the example you pointed to, they have a criteria to decide what is a valid reset in their context, in the kernel side itself. So they can take a call if they want to do something about it or not.
>>
>> But, in our case, we want to send the trace_event to user with some register values on every reset, and it is actually up to the profiling app to interpret (along with what it wants to call a GPU reset). So I don't think this is causing a considerable overhead.
>>
>> - Shashank
>>>
>>>
>>>
>>> Thanks,
>>> Lijo
>>>
>>>>         switch (nv_asic_reset_method(adev)) {
>>>>         case AMD_RESET_METHOD_PCI:
>>>>             dev_info(adev->dev, "PCI reset\n");

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
  2022-02-04 17:07           ` Sharma, Shashank
@ 2022-02-04 17:11             ` Lazar, Lijo
  2022-02-04 17:16               ` Sharma, Shashank
  0 siblings, 1 reply; 20+ messages in thread
From: Lazar, Lijo @ 2022-02-04 17:11 UTC (permalink / raw)
  To: Sharma, Shashank, amd-gfx
  Cc: Deucher, Alexander, Somalapuram, Amaranath, Koenig, Christian

[AMD Official Use Only]

No, otherwise driver reset only for GPU recovery purpose. S3/S4 is not meant for recovery purpose. It's just a precautionary reset to make sure that everything works fine on resume.

BTW, since this is already providing a set of values it would be useful to provide one more field as the reset reason - RAS error recovery, GPU hung recovery or something else.

Thanks,
Lijo

-----Original Message-----
From: Sharma, Shashank <Shashank.Sharma@amd.com> 
Sent: Friday, February 4, 2022 10:37 PM
To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Somalapuram, Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler



On 2/4/2022 6:02 PM, Lazar, Lijo wrote:
> [Public]
> 
> The problem is app doesn't know why the reset happened. It just receives a bunch of registers to be read. On what basis an app can filter this out?
>

Again, that is contextual analysis capability, which needs to be embedded in the reader app. Even if we filter out the S3/S4 resets in the kernel, the situation remains the same, isn't it ?

- Shashank

> Thanks,
> Lijo
> 
> -----Original Message-----
> From: Sharma, Shashank <Shashank.Sharma@amd.com>
> Sent: Friday, February 4, 2022 10:29 PM
> To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Somalapuram, 
> Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, Christian 
> <Christian.Koenig@amd.com>
> Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
> 
> 
> 
> On 2/4/2022 5:50 PM, Lazar, Lijo wrote:
>> [AMD Official Use Only]
>>
>> To explain more -
>> 	It's an unconditional reset done by the kernel on every suspend (S3/S4). In such a case which process is going to receive the trace events?
>>
>> Most likely use case would be related to gpu recovery. Triggering a trace on every reset doesn't look like a good idea.
>>
> 
> If you observer carefully, we are just providing an infrastructure, the application's intention is unknown to us. In my opinion it's rather not a good idea to apply a filter in kernel, with our interpretation of intention.
> 
> For example if an app just wants to count how many resets are happening due to S3/S4 transition, this infra might become useless. It would rather be a better idea for the app to learn and ignore these scenarios which it is not interested in.
> 
> This could eventually be just difference in design philosophy maybe :)
> 
> - Shashank
> 
>> Thanks,
>> Lijo
>>
>> -----Original Message-----
>> From: Sharma, Shashank <Shashank.Sharma@amd.com>
>> Sent: Friday, February 4, 2022 10:09 PM
>> To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org
>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Somalapuram, 
>> Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, Christian 
>> <Christian.Koenig@amd.com>
>> Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
>>
>> Hey Lijo,
>> I somehow missed to respond on this comment, pls find inline:
>>
>> Regards
>> Shashank
>>
>> On 1/22/2022 7:42 AM, Lazar, Lijo wrote:
>>>
>>>
>>> On 1/22/2022 2:04 AM, Sharma, Shashank wrote:
>>>>    From 899ec6060eb7d8a3d4d56ab439e4e6cdd74190a4 Mon Sep 17 
>>>> 00:00:00
>>>> 2001
>>>> From: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
>>>> Date: Fri, 21 Jan 2022 14:19:42 +0530
>>>> Subject: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
>>>>
>>>> This patch adds a GPU reset handler for Navi ASIC family, which 
>>>> typically dumps some of the registersand sends a trace event.
>>>>
>>>> V2: Accomodated call to work function to send uevent
>>>>
>>>> Signed-off-by: Somalapuram Amaranath 
>>>> <Amaranath.Somalapuram@amd.com>
>>>> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com>
>>>> ---
>>>>     drivers/gpu/drm/amd/amdgpu/nv.c | 28 ++++++++++++++++++++++++++++
>>>>     1 file changed, 28 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/nv.c index 01efda4398e5..ada35d4c5245
>>>> 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/nv.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c
>>>> @@ -528,10 +528,38 @@ nv_asic_reset_method(struct amdgpu_device
>>>> *adev)
>>>>         }
>>>>     }
>>>>
>>>> +static void amdgpu_reset_dumps(struct amdgpu_device *adev) {
>>>> +    int r = 0, i;
>>>> +
>>>> +    /* original raven doesn't have full asic reset */
>>>> +    if ((adev->apu_flags & AMD_APU_IS_RAVEN) &&
>>>> +        !(adev->apu_flags & AMD_APU_IS_RAVEN2))
>>>> +        return;
>>>> +    for (i = 0; i < adev->num_ip_blocks; i++) {
>>>> +        if (!adev->ip_blocks[i].status.valid)
>>>> +            continue;
>>>> +        if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps)
>>>> +            continue;
>>>> +        r =
>>>> +adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev);
>>>> +
>>>> +        if (r)
>>>> +            DRM_ERROR("reset_reg_dumps of IP block <%s> failed 
>>>> +%d\n",
>>>> +                    adev->ip_blocks[i].version->funcs->name, r);
>>>> +    }
>>>> +
>>>> +    /* Schedule work to send uevent */
>>>> +    if (!queue_work(system_unbound_wq, &adev->gpu_reset_work))
>>>> +        DRM_ERROR("failed to add GPU reset work\n");
>>>> +
>>>> +    dump_stack();
>>>> +}
>>>> +
>>>>     static int nv_asic_reset(struct amdgpu_device *adev)
>>>>     {
>>>>         int ret = 0;
>>>>
>>>> +    amdgpu_reset_dumps(adev);
>>>
>>> Had a comment on this before. Now there are different reasons (or 
>>> even no reason like a precautionary reset) to perform reset. A user 
>>> would be interested in a trace only if the reason is valid.
>>>
>>> To clarify on why a work shouldn't be scheduled on every reset, 
>>> check here -
>>>
>>> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/a
>>> m
>>> d
>>> gpu/amdgpu_drv.c#L2188
>> In the example you pointed to, they have a criteria to decide what is a valid reset in their context, in the kernel side itself. So they can take a call if they want to do something about it or not.
>>
>> But, in our case, we want to send the trace_event to user with some register values on every reset, and it is actually up to the profiling app to interpret (along with what it wants to call a GPU reset). So I don't think this is causing a considerable overhead.
>>
>> - Shashank
>>>
>>>
>>>
>>> Thanks,
>>> Lijo
>>>
>>>>         switch (nv_asic_reset_method(adev)) {
>>>>         case AMD_RESET_METHOD_PCI:
>>>>             dev_info(adev->dev, "PCI reset\n");

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
  2022-02-04 17:11             ` Lazar, Lijo
@ 2022-02-04 17:16               ` Sharma, Shashank
  2022-02-04 17:20                 ` Lazar, Lijo
  0 siblings, 1 reply; 20+ messages in thread
From: Sharma, Shashank @ 2022-02-04 17:16 UTC (permalink / raw)
  To: Lazar, Lijo, amd-gfx
  Cc: Deucher, Alexander, Somalapuram, Amaranath, Koenig, Christian



On 2/4/2022 6:11 PM, Lazar, Lijo wrote:
> BTW, since this is already providing a set of values it would be useful to provide one more field as the reset reason - RAS error recovery, GPU hung recovery or something else.

Adding this additional parameter instead of blocking something in 
kernel, seems like a better idea. The app can filter out and read what 
it is interested into.

- Shashank

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
  2022-02-04 17:16               ` Sharma, Shashank
@ 2022-02-04 17:20                 ` Lazar, Lijo
  2022-02-04 17:22                   ` Sharma, Shashank
  0 siblings, 1 reply; 20+ messages in thread
From: Lazar, Lijo @ 2022-02-04 17:20 UTC (permalink / raw)
  To: Sharma, Shashank, amd-gfx
  Cc: Deucher, Alexander, Somalapuram, Amaranath, Koenig, Christian

[AMD Official Use Only]

One more thing 
	In suspend-reset case, won't this cause to schedule a work item on suspend? I don't know if that is a good idea, ideally we would like to clean up all work items before going to suspend.

Thanks,
Lijo

-----Original Message-----
From: Sharma, Shashank <Shashank.Sharma@amd.com> 
Sent: Friday, February 4, 2022 10:47 PM
To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Somalapuram, Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler



On 2/4/2022 6:11 PM, Lazar, Lijo wrote:
> BTW, since this is already providing a set of values it would be useful to provide one more field as the reset reason - RAS error recovery, GPU hung recovery or something else.

Adding this additional parameter instead of blocking something in kernel, seems like a better idea. The app can filter out and read what it is interested into.

- Shashank

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
  2022-02-04 17:20                 ` Lazar, Lijo
@ 2022-02-04 17:22                   ` Sharma, Shashank
  2022-02-04 18:41                     ` Deucher, Alexander
  0 siblings, 1 reply; 20+ messages in thread
From: Sharma, Shashank @ 2022-02-04 17:22 UTC (permalink / raw)
  To: Lazar, Lijo, amd-gfx
  Cc: Deucher, Alexander, Somalapuram, Amaranath, Koenig, Christian



On 2/4/2022 6:20 PM, Lazar, Lijo wrote:
> [AMD Official Use Only]
> 
> One more thing
> 	In suspend-reset case, won't this cause to schedule a work item on suspend? I don't know if that is a good idea, ideally we would like to clean up all work items before going to suspend.
> 
> Thanks,
> Lijo

Again, this opens scope for discussion. What if there is a GPU error 
during suspend-reset, which is very probable case.

- Shashank

> 
> -----Original Message-----
> From: Sharma, Shashank <Shashank.Sharma@amd.com>
> Sent: Friday, February 4, 2022 10:47 PM
> To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Somalapuram, Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
> Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
> 
> 
> 
> On 2/4/2022 6:11 PM, Lazar, Lijo wrote:
>> BTW, since this is already providing a set of values it would be useful to provide one more field as the reset reason - RAS error recovery, GPU hung recovery or something else.
> 
> Adding this additional parameter instead of blocking something in kernel, seems like a better idea. The app can filter out and read what it is interested into.
> 
> - Shashank

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
  2022-02-04 17:22                   ` Sharma, Shashank
@ 2022-02-04 18:41                     ` Deucher, Alexander
  2022-02-04 18:44                       ` Deucher, Alexander
  0 siblings, 1 reply; 20+ messages in thread
From: Deucher, Alexander @ 2022-02-04 18:41 UTC (permalink / raw)
  To: Sharma, Shashank, Lazar, Lijo, amd-gfx
  Cc: Somalapuram, Amaranath, Koenig, Christian

[-- Attachment #1: Type: text/plain, Size: 2193 bytes --]

[Public]

In the suspend and hibernate cases, we don't care.  In most cases the power rail will be cut once the system enters suspend so it doesn't really matter.  That's why we call the asic reset callback directly rather than going through the whole recovery process.  The device is already quiescent at this point we just want to make sure the device is in a known state when we come out of suspend (in case suspend overall fails).

Alex


________________________________
From: Sharma, Shashank <Shashank.Sharma@amd.com>
Sent: Friday, February 4, 2022 12:22 PM
To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Somalapuram, Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler



On 2/4/2022 6:20 PM, Lazar, Lijo wrote:
> [AMD Official Use Only]
>
> One more thing
>        In suspend-reset case, won't this cause to schedule a work item on suspend? I don't know if that is a good idea, ideally we would like to clean up all work items before going to suspend.
>
> Thanks,
> Lijo

Again, this opens scope for discussion. What if there is a GPU error
during suspend-reset, which is very probable case.

- Shashank

>
> -----Original Message-----
> From: Sharma, Shashank <Shashank.Sharma@amd.com>
> Sent: Friday, February 4, 2022 10:47 PM
> To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Somalapuram, Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
> Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
>
>
>
> On 2/4/2022 6:11 PM, Lazar, Lijo wrote:
>> BTW, since this is already providing a set of values it would be useful to provide one more field as the reset reason - RAS error recovery, GPU hung recovery or something else.
>
> Adding this additional parameter instead of blocking something in kernel, seems like a better idea. The app can filter out and read what it is interested into.
>
> - Shashank

[-- Attachment #2: Type: text/html, Size: 3795 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
  2022-02-04 18:41                     ` Deucher, Alexander
@ 2022-02-04 18:44                       ` Deucher, Alexander
  2022-02-05  7:00                         ` Sharma, Shashank
  0 siblings, 1 reply; 20+ messages in thread
From: Deucher, Alexander @ 2022-02-04 18:44 UTC (permalink / raw)
  To: Sharma, Shashank, Lazar, Lijo, amd-gfx
  Cc: Somalapuram, Amaranath, Koenig, Christian

[-- Attachment #1: Type: text/plain, Size: 3036 bytes --]

[Public]

Seems like this functionality should be moved up into the callers.  Maybe add new IP callbacks (dump_reset_registers()) so that each IP can specify what registers are relevant for a reset debugging and then we can walk the IP list and call the callback before we call the asic_reset callbacks.

Alex

________________________________
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of Deucher, Alexander <Alexander.Deucher@amd.com>
Sent: Friday, February 4, 2022 1:41 PM
To: Sharma, Shashank <Shashank.Sharma@amd.com>; Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
Cc: Somalapuram, Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler


[Public]


[Public]

In the suspend and hibernate cases, we don't care.  In most cases the power rail will be cut once the system enters suspend so it doesn't really matter.  That's why we call the asic reset callback directly rather than going through the whole recovery process.  The device is already quiescent at this point we just want to make sure the device is in a known state when we come out of suspend (in case suspend overall fails).

Alex


________________________________
From: Sharma, Shashank <Shashank.Sharma@amd.com>
Sent: Friday, February 4, 2022 12:22 PM
To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Somalapuram, Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler



On 2/4/2022 6:20 PM, Lazar, Lijo wrote:
> [AMD Official Use Only]
>
> One more thing
>        In suspend-reset case, won't this cause to schedule a work item on suspend? I don't know if that is a good idea, ideally we would like to clean up all work items before going to suspend.
>
> Thanks,
> Lijo

Again, this opens scope for discussion. What if there is a GPU error
during suspend-reset, which is very probable case.

- Shashank

>
> -----Original Message-----
> From: Sharma, Shashank <Shashank.Sharma@amd.com>
> Sent: Friday, February 4, 2022 10:47 PM
> To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Somalapuram, Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
> Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
>
>
>
> On 2/4/2022 6:11 PM, Lazar, Lijo wrote:
>> BTW, since this is already providing a set of values it would be useful to provide one more field as the reset reason - RAS error recovery, GPU hung recovery or something else.
>
> Adding this additional parameter instead of blocking something in kernel, seems like a better idea. The app can filter out and read what it is interested into.
>
> - Shashank

[-- Attachment #2: Type: text/html, Size: 5719 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
  2022-02-04 18:44                       ` Deucher, Alexander
@ 2022-02-05  7:00                         ` Sharma, Shashank
  0 siblings, 0 replies; 20+ messages in thread
From: Sharma, Shashank @ 2022-02-05  7:00 UTC (permalink / raw)
  To: Deucher, Alexander, Lazar, Lijo, amd-gfx
  Cc: Somalapuram, Amaranath, Koenig, Christian

Hey Alex,
Agree, we are moving it above, Christian also had the same feedback.

- Shashank

On 2/4/2022 7:44 PM, Deucher, Alexander wrote:
> [Public]
> 
> 
> Seems like this functionality should be moved up into the callers.  
> Maybe add new IP callbacks (dump_reset_registers()) so that each IP can 
> specify what registers are relevant for a reset debugging and then we 
> can walk the IP list and call the callback before we call the asic_reset 
> callbacks.
> 
> Alex
> 
> ------------------------------------------------------------------------
> *From:* amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of 
> Deucher, Alexander <Alexander.Deucher@amd.com>
> *Sent:* Friday, February 4, 2022 1:41 PM
> *To:* Sharma, Shashank <Shashank.Sharma@amd.com>; Lazar, Lijo 
> <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org 
> <amd-gfx@lists.freedesktop.org>
> *Cc:* Somalapuram, Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, 
> Christian <Christian.Koenig@amd.com>
> *Subject:* Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
> 
> [Public]
> 
> 
> [Public]
> 
> 
> In the suspend and hibernate cases, we don't care.  In most cases the 
> power rail will be cut once the system enters suspend so it doesn't 
> really matter.  That's why we call the asic reset callback directly 
> rather than going through the whole recovery process. The device is 
> already quiescent at this point we just want to make sure the device is 
> in a known state when we come out of suspend (in case suspend overall 
> fails).
> 
> Alex
> 
> 
> ------------------------------------------------------------------------
> *From:* Sharma, Shashank <Shashank.Sharma@amd.com>
> *Sent:* Friday, February 4, 2022 12:22 PM
> *To:* Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org 
> <amd-gfx@lists.freedesktop.org>
> *Cc:* Deucher, Alexander <Alexander.Deucher@amd.com>; Somalapuram, 
> Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, Christian 
> <Christian.Koenig@amd.com>
> *Subject:* Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
> 
> 
> On 2/4/2022 6:20 PM, Lazar, Lijo wrote:
>> [AMD Official Use Only]
>> 
>> One more thing
>>        In suspend-reset case, won't this cause to schedule a work item on suspend? I don't know if that is a good idea, ideally we would like to clean up all work items before going to suspend.
>> 
>> Thanks,
>> Lijo
> 
> Again, this opens scope for discussion. What if there is a GPU error
> during suspend-reset, which is very probable case.
> 
> - Shashank
> 
>> 
>> -----Original Message-----
>> From: Sharma, Shashank <Shashank.Sharma@amd.com>
>> Sent: Friday, February 4, 2022 10:47 PM
>> To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org
>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Somalapuram, Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
>> Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
>> 
>> 
>> 
>> On 2/4/2022 6:11 PM, Lazar, Lijo wrote:
>>> BTW, since this is already providing a set of values it would be useful to provide one more field as the reset reason - RAS error recovery, GPU hung recovery or something else.
>> 
>> Adding this additional parameter instead of blocking something in kernel, seems like a better idea. The app can filter out and read what it is interested into.
>> 
>> - Shashank

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2022-02-05  7:00 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-21 20:34 [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler Sharma, Shashank
2022-01-22  6:42 ` Lazar, Lijo
2022-02-04 16:38   ` Sharma, Shashank
2022-02-04 16:50     ` Lazar, Lijo
2022-02-04 16:59       ` Sharma, Shashank
2022-02-04 17:02         ` Lazar, Lijo
2022-02-04 17:07           ` Sharma, Shashank
2022-02-04 17:11             ` Lazar, Lijo
2022-02-04 17:16               ` Sharma, Shashank
2022-02-04 17:20                 ` Lazar, Lijo
2022-02-04 17:22                   ` Sharma, Shashank
2022-02-04 18:41                     ` Deucher, Alexander
2022-02-04 18:44                       ` Deucher, Alexander
2022-02-05  7:00                         ` Sharma, Shashank
2022-01-24  7:18 ` Christian König
2022-01-24 16:50   ` Sharma, Shashank
2022-01-24 16:32 ` Andrey Grodzovsky
2022-01-24 16:38   ` Sharma, Shashank
2022-01-24 17:08     ` Andrey Grodzovsky
2022-01-24 17:11       ` Sharma, Shashank

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.