All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Lazar, Lijo" <Lijo.Lazar@amd.com>
To: "Sharma, Shashank" <Shashank.Sharma@amd.com>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>
Cc: "Deucher, Alexander" <Alexander.Deucher@amd.com>,
	"Somalapuram, Amaranath" <Amaranath.Somalapuram@amd.com>,
	"Koenig, Christian" <Christian.Koenig@amd.com>
Subject: RE: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
Date: Fri, 4 Feb 2022 17:02:04 +0000	[thread overview]
Message-ID: <BYAPR12MB4614325D69EACA459547E4F597299@BYAPR12MB4614.namprd12.prod.outlook.com> (raw)
In-Reply-To: <8bada9e4-d6d5-e11a-45fe-aea7615f3356@amd.com>

[Public]

The problem is app doesn't know why the reset happened. It just receives a bunch of registers to be read. On what basis an app can filter this out?

Thanks,
Lijo

-----Original Message-----
From: Sharma, Shashank <Shashank.Sharma@amd.com> 
Sent: Friday, February 4, 2022 10:29 PM
To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Somalapuram, Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler



On 2/4/2022 5:50 PM, Lazar, Lijo wrote:
> [AMD Official Use Only]
> 
> To explain more -
> 	It's an unconditional reset done by the kernel on every suspend (S3/S4). In such a case which process is going to receive the trace events?
> 
> Most likely use case would be related to gpu recovery. Triggering a trace on every reset doesn't look like a good idea.
>

If you observer carefully, we are just providing an infrastructure, the application's intention is unknown to us. In my opinion it's rather not a good idea to apply a filter in kernel, with our interpretation of intention.

For example if an app just wants to count how many resets are happening due to S3/S4 transition, this infra might become useless. It would rather be a better idea for the app to learn and ignore these scenarios which it is not interested in.

This could eventually be just difference in design philosophy maybe :)

- Shashank

> Thanks,
> Lijo
> 
> -----Original Message-----
> From: Sharma, Shashank <Shashank.Sharma@amd.com>
> Sent: Friday, February 4, 2022 10:09 PM
> To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Somalapuram, 
> Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, Christian 
> <Christian.Koenig@amd.com>
> Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
> 
> Hey Lijo,
> I somehow missed to respond on this comment, pls find inline:
> 
> Regards
> Shashank
> 
> On 1/22/2022 7:42 AM, Lazar, Lijo wrote:
>>
>>
>> On 1/22/2022 2:04 AM, Sharma, Shashank wrote:
>>>   From 899ec6060eb7d8a3d4d56ab439e4e6cdd74190a4 Mon Sep 17 00:00:00
>>> 2001
>>> From: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
>>> Date: Fri, 21 Jan 2022 14:19:42 +0530
>>> Subject: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
>>>
>>> This patch adds a GPU reset handler for Navi ASIC family, which 
>>> typically dumps some of the registersand sends a trace event.
>>>
>>> V2: Accomodated call to work function to send uevent
>>>
>>> Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
>>> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/nv.c | 28 ++++++++++++++++++++++++++++
>>>    1 file changed, 28 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c 
>>> b/drivers/gpu/drm/amd/amdgpu/nv.c index 01efda4398e5..ada35d4c5245
>>> 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/nv.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c
>>> @@ -528,10 +528,38 @@ nv_asic_reset_method(struct amdgpu_device
>>> *adev)
>>>        }
>>>    }
>>>
>>> +static void amdgpu_reset_dumps(struct amdgpu_device *adev) {
>>> +    int r = 0, i;
>>> +
>>> +    /* original raven doesn't have full asic reset */
>>> +    if ((adev->apu_flags & AMD_APU_IS_RAVEN) &&
>>> +        !(adev->apu_flags & AMD_APU_IS_RAVEN2))
>>> +        return;
>>> +    for (i = 0; i < adev->num_ip_blocks; i++) {
>>> +        if (!adev->ip_blocks[i].status.valid)
>>> +            continue;
>>> +        if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps)
>>> +            continue;
>>> +        r =
>>> +adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev);
>>> +
>>> +        if (r)
>>> +            DRM_ERROR("reset_reg_dumps of IP block <%s> failed 
>>> +%d\n",
>>> +                    adev->ip_blocks[i].version->funcs->name, r);
>>> +    }
>>> +
>>> +    /* Schedule work to send uevent */
>>> +    if (!queue_work(system_unbound_wq, &adev->gpu_reset_work))
>>> +        DRM_ERROR("failed to add GPU reset work\n");
>>> +
>>> +    dump_stack();
>>> +}
>>> +
>>>    static int nv_asic_reset(struct amdgpu_device *adev)
>>>    {
>>>        int ret = 0;
>>>
>>> +    amdgpu_reset_dumps(adev);
>>
>> Had a comment on this before. Now there are different reasons (or 
>> even no reason like a precautionary reset) to perform reset. A user 
>> would be interested in a trace only if the reason is valid.
>>
>> To clarify on why a work shouldn't be scheduled on every reset, check 
>> here -
>>
>> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/am
>> d
>> gpu/amdgpu_drv.c#L2188
> In the example you pointed to, they have a criteria to decide what is a valid reset in their context, in the kernel side itself. So they can take a call if they want to do something about it or not.
> 
> But, in our case, we want to send the trace_event to user with some register values on every reset, and it is actually up to the profiling app to interpret (along with what it wants to call a GPU reset). So I don't think this is causing a considerable overhead.
> 
> - Shashank
>>
>>
>>
>> Thanks,
>> Lijo
>>
>>>        switch (nv_asic_reset_method(adev)) {
>>>        case AMD_RESET_METHOD_PCI:
>>>            dev_info(adev->dev, "PCI reset\n");

  reply	other threads:[~2022-02-04 17:02 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-21 20:34 [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler Sharma, Shashank
2022-01-22  6:42 ` Lazar, Lijo
2022-02-04 16:38   ` Sharma, Shashank
2022-02-04 16:50     ` Lazar, Lijo
2022-02-04 16:59       ` Sharma, Shashank
2022-02-04 17:02         ` Lazar, Lijo [this message]
2022-02-04 17:07           ` Sharma, Shashank
2022-02-04 17:11             ` Lazar, Lijo
2022-02-04 17:16               ` Sharma, Shashank
2022-02-04 17:20                 ` Lazar, Lijo
2022-02-04 17:22                   ` Sharma, Shashank
2022-02-04 18:41                     ` Deucher, Alexander
2022-02-04 18:44                       ` Deucher, Alexander
2022-02-05  7:00                         ` Sharma, Shashank
2022-01-24  7:18 ` Christian König
2022-01-24 16:50   ` Sharma, Shashank
2022-01-24 16:32 ` Andrey Grodzovsky
2022-01-24 16:38   ` Sharma, Shashank
2022-01-24 17:08     ` Andrey Grodzovsky
2022-01-24 17:11       ` Sharma, Shashank

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=BYAPR12MB4614325D69EACA459547E4F597299@BYAPR12MB4614.namprd12.prod.outlook.com \
    --to=lijo.lazar@amd.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Amaranath.Somalapuram@amd.com \
    --cc=Christian.Koenig@amd.com \
    --cc=Shashank.Sharma@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.