RE: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler

From: "Lazar, Lijo" <Lijo.Lazar@amd.com>
To: "Sharma, Shashank" <Shashank.Sharma@amd.com>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>
Cc: "Deucher, Alexander" <Alexander.Deucher@amd.com>,
	"Somalapuram, Amaranath" <Amaranath.Somalapuram@amd.com>,
	"Koenig, Christian" <Christian.Koenig@amd.com>
Subject: RE: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
Date: Fri, 4 Feb 2022 17:02:04 +0000	[thread overview]
Message-ID: <BYAPR12MB4614325D69EACA459547E4F597299@BYAPR12MB4614.namprd12.prod.outlook.com> (raw)
In-Reply-To: <8bada9e4-d6d5-e11a-45fe-aea7615f3356@amd.com>

[Public]

The problem is app doesn't know why the reset happened. It just receives a bunch of registers to be read. On what basis an app can filter this out?

Thanks,
Lijo

-----Original Message-----
From: Sharma, Shashank <Shashank.Sharma@amd.com> 
Sent: Friday, February 4, 2022 10:29 PM
To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Somalapuram, Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler

On 2/4/2022 5:50 PM, Lazar, Lijo wrote:
> [AMD Official Use Only]
> 
> To explain more -
> 	It's an unconditional reset done by the kernel on every suspend (S3/S4). In such a case which process is going to receive the trace events?
> 
> Most likely use case would be related to gpu recovery. Triggering a trace on every reset doesn't look like a good idea.
>

If you observer carefully, we are just providing an infrastructure, the application's intention is unknown to us. In my opinion it's rather not a good idea to apply a filter in kernel, with our interpretation of intention.

For example if an app just wants to count how many resets are happening due to S3/S4 transition, this infra might become useless. It would rather be a better idea for the app to learn and ignore these scenarios which it is not interested in.

This could eventually be just difference in design philosophy maybe :)

- Shashank

> Thanks,
> Lijo
> 
> -----Original Message-----
> From: Sharma, Shashank <Shashank.Sharma@amd.com>
> Sent: Friday, February 4, 2022 10:09 PM
> To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Somalapuram, 
> Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, Christian 
> <Christian.Koenig@amd.com>
> Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
> 
> Hey Lijo,
> I somehow missed to respond on this comment, pls find inline:
> 
> Regards
> Shashank
> 
> On 1/22/2022 7:42 AM, Lazar, Lijo wrote:
>>
>>
>> On 1/22/2022 2:04 AM, Sharma, Shashank wrote:
>>>   From 899ec6060eb7d8a3d4d56ab439e4e6cdd74190a4 Mon Sep 17 00:00:00
>>> 2001
>>> From: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
>>> Date: Fri, 21 Jan 2022 14:19:42 +0530
>>> Subject: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
>>>
>>> This patch adds a GPU reset handler for Navi ASIC family, which 
>>> typically dumps some of the registersand sends a trace event.
>>>
>>> V2: Accomodated call to work function to send uevent
>>>
>>> Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
>>> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/nv.c | 28 ++++++++++++++++++++++++++++
>>>    1 file changed, 28 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c 
>>> b/drivers/gpu/drm/amd/amdgpu/nv.c index 01efda4398e5..ada35d4c5245
>>> 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/nv.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c
>>> @@ -528,10 +528,38 @@ nv_asic_reset_method(struct amdgpu_device
>>> *adev)
>>>        }
>>>    }
>>>
>>> +static void amdgpu_reset_dumps(struct amdgpu_device *adev) {
>>> +    int r = 0, i;
>>> +
>>> +    /* original raven doesn't have full asic reset */
>>> +    if ((adev->apu_flags & AMD_APU_IS_RAVEN) &&
>>> +        !(adev->apu_flags & AMD_APU_IS_RAVEN2))
>>> +        return;
>>> +    for (i = 0; i < adev->num_ip_blocks; i++) {
>>> +        if (!adev->ip_blocks[i].status.valid)
>>> +            continue;
>>> +        if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps)
>>> +            continue;
>>> +        r =
>>> +adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev);
>>> +
>>> +        if (r)
>>> +            DRM_ERROR("reset_reg_dumps of IP block <%s> failed 
>>> +%d\n",
>>> +                    adev->ip_blocks[i].version->funcs->name, r);
>>> +    }
>>> +
>>> +    /* Schedule work to send uevent */
>>> +    if (!queue_work(system_unbound_wq, &adev->gpu_reset_work))
>>> +        DRM_ERROR("failed to add GPU reset work\n");
>>> +
>>> +    dump_stack();
>>> +}
>>> +
>>>    static int nv_asic_reset(struct amdgpu_device *adev)
>>>    {
>>>        int ret = 0;
>>>
>>> +    amdgpu_reset_dumps(adev);
>>
>> Had a comment on this before. Now there are different reasons (or 
>> even no reason like a precautionary reset) to perform reset. A user 
>> would be interested in a trace only if the reason is valid.
>>
>> To clarify on why a work shouldn't be scheduled on every reset, check 
>> here -
>>
>> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/am
>> d
>> gpu/amdgpu_drv.c#L2188
> In the example you pointed to, they have a criteria to decide what is a valid reset in their context, in the kernel side itself. So they can take a call if they want to do something about it or not.
> 
> But, in our case, we want to send the trace_event to user with some register values on every reset, and it is actually up to the profiling app to interpret (along with what it wants to call a GPU reset). So I don't think this is causing a considerable overhead.
> 
> - Shashank
>>
>>
>>
>> Thanks,
>> Lijo
>>
>>>        switch (nv_asic_reset_method(adev)) {
>>>        case AMD_RESET_METHOD_PCI:
>>>            dev_info(adev->dev, "PCI reset\n");