* [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler @ 2022-01-21 20:34 Sharma, Shashank 2022-01-22 6:42 ` Lazar, Lijo ` (2 more replies) 0 siblings, 3 replies; 20+ messages in thread From: Sharma, Shashank @ 2022-01-21 20:34 UTC (permalink / raw) To: amd-gfx; +Cc: Deucher, Alexander, Somalapuram Amaranath, Christian König From 899ec6060eb7d8a3d4d56ab439e4e6cdd74190a4 Mon Sep 17 00:00:00 2001 From: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com> Date: Fri, 21 Jan 2022 14:19:42 +0530 Subject: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler This patch adds a GPU reset handler for Navi ASIC family, which typically dumps some of the registersand sends a trace event. V2: Accomodated call to work function to send uevent Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com> --- drivers/gpu/drm/amd/amdgpu/nv.c | 28 ++++++++++++++++++++++++++++ 1 file changed, 28 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c b/drivers/gpu/drm/amd/amdgpu/nv.c index 01efda4398e5..ada35d4c5245 100644 --- a/drivers/gpu/drm/amd/amdgpu/nv.c +++ b/drivers/gpu/drm/amd/amdgpu/nv.c @@ -528,10 +528,38 @@ nv_asic_reset_method(struct amdgpu_device *adev) } } +static void amdgpu_reset_dumps(struct amdgpu_device *adev) +{ + int r = 0, i; + + /* original raven doesn't have full asic reset */ + if ((adev->apu_flags & AMD_APU_IS_RAVEN) && + !(adev->apu_flags & AMD_APU_IS_RAVEN2)) + return; + for (i = 0; i < adev->num_ip_blocks; i++) { + if (!adev->ip_blocks[i].status.valid) + continue; + if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps) + continue; + r = adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev); + + if (r) + DRM_ERROR("reset_reg_dumps of IP block <%s> failed %d\n", + adev->ip_blocks[i].version->funcs->name, r); + } + + /* Schedule work to send uevent */ + if (!queue_work(system_unbound_wq, &adev->gpu_reset_work)) + DRM_ERROR("failed to add GPU reset work\n"); + + dump_stack(); +} + static int nv_asic_reset(struct amdgpu_device *adev) { int ret = 0; + amdgpu_reset_dumps(adev); switch (nv_asic_reset_method(adev)) { case AMD_RESET_METHOD_PCI: dev_info(adev->dev, "PCI reset\n"); -- 2.32.0 ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler 2022-01-21 20:34 [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler Sharma, Shashank @ 2022-01-22 6:42 ` Lazar, Lijo 2022-02-04 16:38 ` Sharma, Shashank 2022-01-24 7:18 ` Christian König 2022-01-24 16:32 ` Andrey Grodzovsky 2 siblings, 1 reply; 20+ messages in thread From: Lazar, Lijo @ 2022-01-22 6:42 UTC (permalink / raw) To: Sharma, Shashank, amd-gfx Cc: Deucher, Alexander, Somalapuram Amaranath, Christian König On 1/22/2022 2:04 AM, Sharma, Shashank wrote: > From 899ec6060eb7d8a3d4d56ab439e4e6cdd74190a4 Mon Sep 17 00:00:00 2001 > From: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com> > Date: Fri, 21 Jan 2022 14:19:42 +0530 > Subject: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler > > This patch adds a GPU reset handler for Navi ASIC family, which > typically dumps some of the registersand sends a trace event. > > V2: Accomodated call to work function to send uevent > > Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com> > Signed-off-by: Shashank Sharma <shashank.sharma@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/nv.c | 28 ++++++++++++++++++++++++++++ > 1 file changed, 28 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c > b/drivers/gpu/drm/amd/amdgpu/nv.c > index 01efda4398e5..ada35d4c5245 100644 > --- a/drivers/gpu/drm/amd/amdgpu/nv.c > +++ b/drivers/gpu/drm/amd/amdgpu/nv.c > @@ -528,10 +528,38 @@ nv_asic_reset_method(struct amdgpu_device *adev) > } > } > > +static void amdgpu_reset_dumps(struct amdgpu_device *adev) > +{ > + int r = 0, i; > + > + /* original raven doesn't have full asic reset */ > + if ((adev->apu_flags & AMD_APU_IS_RAVEN) && > + !(adev->apu_flags & AMD_APU_IS_RAVEN2)) > + return; > + for (i = 0; i < adev->num_ip_blocks; i++) { > + if (!adev->ip_blocks[i].status.valid) > + continue; > + if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps) > + continue; > + r = adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev); > + > + if (r) > + DRM_ERROR("reset_reg_dumps of IP block <%s> failed %d\n", > + adev->ip_blocks[i].version->funcs->name, r); > + } > + > + /* Schedule work to send uevent */ > + if (!queue_work(system_unbound_wq, &adev->gpu_reset_work)) > + DRM_ERROR("failed to add GPU reset work\n"); > + > + dump_stack(); > +} > + > static int nv_asic_reset(struct amdgpu_device *adev) > { > int ret = 0; > > + amdgpu_reset_dumps(adev); Had a comment on this before. Now there are different reasons (or even no reason like a precautionary reset) to perform reset. A user would be interested in a trace only if the reason is valid. To clarify on why a work shouldn't be scheduled on every reset, check here - https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c#L2188 Thanks, Lijo > switch (nv_asic_reset_method(adev)) { > case AMD_RESET_METHOD_PCI: > dev_info(adev->dev, "PCI reset\n"); ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler 2022-01-22 6:42 ` Lazar, Lijo @ 2022-02-04 16:38 ` Sharma, Shashank 2022-02-04 16:50 ` Lazar, Lijo 0 siblings, 1 reply; 20+ messages in thread From: Sharma, Shashank @ 2022-02-04 16:38 UTC (permalink / raw) To: Lazar, Lijo, amd-gfx Cc: Deucher, Alexander, Somalapuram Amaranath, Christian König Hey Lijo, I somehow missed to respond on this comment, pls find inline: Regards Shashank On 1/22/2022 7:42 AM, Lazar, Lijo wrote: > > > On 1/22/2022 2:04 AM, Sharma, Shashank wrote: >> From 899ec6060eb7d8a3d4d56ab439e4e6cdd74190a4 Mon Sep 17 00:00:00 2001 >> From: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com> >> Date: Fri, 21 Jan 2022 14:19:42 +0530 >> Subject: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler >> >> This patch adds a GPU reset handler for Navi ASIC family, which >> typically dumps some of the registersand sends a trace event. >> >> V2: Accomodated call to work function to send uevent >> >> Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com> >> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com> >> --- >> drivers/gpu/drm/amd/amdgpu/nv.c | 28 ++++++++++++++++++++++++++++ >> 1 file changed, 28 insertions(+) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c >> b/drivers/gpu/drm/amd/amdgpu/nv.c >> index 01efda4398e5..ada35d4c5245 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/nv.c >> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c >> @@ -528,10 +528,38 @@ nv_asic_reset_method(struct amdgpu_device *adev) >> } >> } >> >> +static void amdgpu_reset_dumps(struct amdgpu_device *adev) >> +{ >> + int r = 0, i; >> + >> + /* original raven doesn't have full asic reset */ >> + if ((adev->apu_flags & AMD_APU_IS_RAVEN) && >> + !(adev->apu_flags & AMD_APU_IS_RAVEN2)) >> + return; >> + for (i = 0; i < adev->num_ip_blocks; i++) { >> + if (!adev->ip_blocks[i].status.valid) >> + continue; >> + if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps) >> + continue; >> + r = adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev); >> + >> + if (r) >> + DRM_ERROR("reset_reg_dumps of IP block <%s> failed %d\n", >> + adev->ip_blocks[i].version->funcs->name, r); >> + } >> + >> + /* Schedule work to send uevent */ >> + if (!queue_work(system_unbound_wq, &adev->gpu_reset_work)) >> + DRM_ERROR("failed to add GPU reset work\n"); >> + >> + dump_stack(); >> +} >> + >> static int nv_asic_reset(struct amdgpu_device *adev) >> { >> int ret = 0; >> >> + amdgpu_reset_dumps(adev); > > Had a comment on this before. Now there are different reasons (or even > no reason like a precautionary reset) to perform reset. A user would be > interested in a trace only if the reason is valid. > > To clarify on why a work shouldn't be scheduled on every reset, check > here - > > https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c#L2188 In the example you pointed to, they have a criteria to decide what is a valid reset in their context, in the kernel side itself. So they can take a call if they want to do something about it or not. But, in our case, we want to send the trace_event to user with some register values on every reset, and it is actually up to the profiling app to interpret (along with what it wants to call a GPU reset). So I don't think this is causing a considerable overhead. - Shashank > > > > Thanks, > Lijo > >> switch (nv_asic_reset_method(adev)) { >> case AMD_RESET_METHOD_PCI: >> dev_info(adev->dev, "PCI reset\n"); ^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler 2022-02-04 16:38 ` Sharma, Shashank @ 2022-02-04 16:50 ` Lazar, Lijo 2022-02-04 16:59 ` Sharma, Shashank 0 siblings, 1 reply; 20+ messages in thread From: Lazar, Lijo @ 2022-02-04 16:50 UTC (permalink / raw) To: Sharma, Shashank, amd-gfx Cc: Deucher, Alexander, Somalapuram, Amaranath, Koenig, Christian [AMD Official Use Only] To explain more - It's an unconditional reset done by the kernel on every suspend (S3/S4). In such a case which process is going to receive the trace events? Most likely use case would be related to gpu recovery. Triggering a trace on every reset doesn't look like a good idea. Thanks, Lijo -----Original Message----- From: Sharma, Shashank <Shashank.Sharma@amd.com> Sent: Friday, February 4, 2022 10:09 PM To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Somalapuram, Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, Christian <Christian.Koenig@amd.com> Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler Hey Lijo, I somehow missed to respond on this comment, pls find inline: Regards Shashank On 1/22/2022 7:42 AM, Lazar, Lijo wrote: > > > On 1/22/2022 2:04 AM, Sharma, Shashank wrote: >> From 899ec6060eb7d8a3d4d56ab439e4e6cdd74190a4 Mon Sep 17 00:00:00 >> 2001 >> From: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com> >> Date: Fri, 21 Jan 2022 14:19:42 +0530 >> Subject: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler >> >> This patch adds a GPU reset handler for Navi ASIC family, which >> typically dumps some of the registersand sends a trace event. >> >> V2: Accomodated call to work function to send uevent >> >> Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com> >> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com> >> --- >> drivers/gpu/drm/amd/amdgpu/nv.c | 28 ++++++++++++++++++++++++++++ >> 1 file changed, 28 insertions(+) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c >> b/drivers/gpu/drm/amd/amdgpu/nv.c index 01efda4398e5..ada35d4c5245 >> 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/nv.c >> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c >> @@ -528,10 +528,38 @@ nv_asic_reset_method(struct amdgpu_device >> *adev) >> } >> } >> >> +static void amdgpu_reset_dumps(struct amdgpu_device *adev) { >> + int r = 0, i; >> + >> + /* original raven doesn't have full asic reset */ >> + if ((adev->apu_flags & AMD_APU_IS_RAVEN) && >> + !(adev->apu_flags & AMD_APU_IS_RAVEN2)) >> + return; >> + for (i = 0; i < adev->num_ip_blocks; i++) { >> + if (!adev->ip_blocks[i].status.valid) >> + continue; >> + if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps) >> + continue; >> + r = >> +adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev); >> + >> + if (r) >> + DRM_ERROR("reset_reg_dumps of IP block <%s> failed >> +%d\n", >> + adev->ip_blocks[i].version->funcs->name, r); >> + } >> + >> + /* Schedule work to send uevent */ >> + if (!queue_work(system_unbound_wq, &adev->gpu_reset_work)) >> + DRM_ERROR("failed to add GPU reset work\n"); >> + >> + dump_stack(); >> +} >> + >> static int nv_asic_reset(struct amdgpu_device *adev) >> { >> int ret = 0; >> >> + amdgpu_reset_dumps(adev); > > Had a comment on this before. Now there are different reasons (or even > no reason like a precautionary reset) to perform reset. A user would > be interested in a trace only if the reason is valid. > > To clarify on why a work shouldn't be scheduled on every reset, check > here - > > https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amd > gpu/amdgpu_drv.c#L2188 In the example you pointed to, they have a criteria to decide what is a valid reset in their context, in the kernel side itself. So they can take a call if they want to do something about it or not. But, in our case, we want to send the trace_event to user with some register values on every reset, and it is actually up to the profiling app to interpret (along with what it wants to call a GPU reset). So I don't think this is causing a considerable overhead. - Shashank > > > > Thanks, > Lijo > >> switch (nv_asic_reset_method(adev)) { >> case AMD_RESET_METHOD_PCI: >> dev_info(adev->dev, "PCI reset\n"); ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler 2022-02-04 16:50 ` Lazar, Lijo @ 2022-02-04 16:59 ` Sharma, Shashank 2022-02-04 17:02 ` Lazar, Lijo 0 siblings, 1 reply; 20+ messages in thread From: Sharma, Shashank @ 2022-02-04 16:59 UTC (permalink / raw) To: Lazar, Lijo, amd-gfx Cc: Deucher, Alexander, Somalapuram, Amaranath, Koenig, Christian On 2/4/2022 5:50 PM, Lazar, Lijo wrote: > [AMD Official Use Only] > > To explain more - > It's an unconditional reset done by the kernel on every suspend (S3/S4). In such a case which process is going to receive the trace events? > > Most likely use case would be related to gpu recovery. Triggering a trace on every reset doesn't look like a good idea. > If you observer carefully, we are just providing an infrastructure, the application's intention is unknown to us. In my opinion it's rather not a good idea to apply a filter in kernel, with our interpretation of intention. For example if an app just wants to count how many resets are happening due to S3/S4 transition, this infra might become useless. It would rather be a better idea for the app to learn and ignore these scenarios which it is not interested in. This could eventually be just difference in design philosophy maybe :) - Shashank > Thanks, > Lijo > > -----Original Message----- > From: Sharma, Shashank <Shashank.Sharma@amd.com> > Sent: Friday, February 4, 2022 10:09 PM > To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org > Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Somalapuram, Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, Christian <Christian.Koenig@amd.com> > Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler > > Hey Lijo, > I somehow missed to respond on this comment, pls find inline: > > Regards > Shashank > > On 1/22/2022 7:42 AM, Lazar, Lijo wrote: >> >> >> On 1/22/2022 2:04 AM, Sharma, Shashank wrote: >>> From 899ec6060eb7d8a3d4d56ab439e4e6cdd74190a4 Mon Sep 17 00:00:00 >>> 2001 >>> From: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com> >>> Date: Fri, 21 Jan 2022 14:19:42 +0530 >>> Subject: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler >>> >>> This patch adds a GPU reset handler for Navi ASIC family, which >>> typically dumps some of the registersand sends a trace event. >>> >>> V2: Accomodated call to work function to send uevent >>> >>> Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com> >>> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com> >>> --- >>> drivers/gpu/drm/amd/amdgpu/nv.c | 28 ++++++++++++++++++++++++++++ >>> 1 file changed, 28 insertions(+) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c >>> b/drivers/gpu/drm/amd/amdgpu/nv.c index 01efda4398e5..ada35d4c5245 >>> 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/nv.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c >>> @@ -528,10 +528,38 @@ nv_asic_reset_method(struct amdgpu_device >>> *adev) >>> } >>> } >>> >>> +static void amdgpu_reset_dumps(struct amdgpu_device *adev) { >>> + int r = 0, i; >>> + >>> + /* original raven doesn't have full asic reset */ >>> + if ((adev->apu_flags & AMD_APU_IS_RAVEN) && >>> + !(adev->apu_flags & AMD_APU_IS_RAVEN2)) >>> + return; >>> + for (i = 0; i < adev->num_ip_blocks; i++) { >>> + if (!adev->ip_blocks[i].status.valid) >>> + continue; >>> + if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps) >>> + continue; >>> + r = >>> +adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev); >>> + >>> + if (r) >>> + DRM_ERROR("reset_reg_dumps of IP block <%s> failed >>> +%d\n", >>> + adev->ip_blocks[i].version->funcs->name, r); >>> + } >>> + >>> + /* Schedule work to send uevent */ >>> + if (!queue_work(system_unbound_wq, &adev->gpu_reset_work)) >>> + DRM_ERROR("failed to add GPU reset work\n"); >>> + >>> + dump_stack(); >>> +} >>> + >>> static int nv_asic_reset(struct amdgpu_device *adev) >>> { >>> int ret = 0; >>> >>> + amdgpu_reset_dumps(adev); >> >> Had a comment on this before. Now there are different reasons (or even >> no reason like a precautionary reset) to perform reset. A user would >> be interested in a trace only if the reason is valid. >> >> To clarify on why a work shouldn't be scheduled on every reset, check >> here - >> >> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amd >> gpu/amdgpu_drv.c#L2188 > In the example you pointed to, they have a criteria to decide what is a valid reset in their context, in the kernel side itself. So they can take a call if they want to do something about it or not. > > But, in our case, we want to send the trace_event to user with some register values on every reset, and it is actually up to the profiling app to interpret (along with what it wants to call a GPU reset). So I don't think this is causing a considerable overhead. > > - Shashank >> >> >> >> Thanks, >> Lijo >> >>> switch (nv_asic_reset_method(adev)) { >>> case AMD_RESET_METHOD_PCI: >>> dev_info(adev->dev, "PCI reset\n"); ^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler 2022-02-04 16:59 ` Sharma, Shashank @ 2022-02-04 17:02 ` Lazar, Lijo 2022-02-04 17:07 ` Sharma, Shashank 0 siblings, 1 reply; 20+ messages in thread From: Lazar, Lijo @ 2022-02-04 17:02 UTC (permalink / raw) To: Sharma, Shashank, amd-gfx Cc: Deucher, Alexander, Somalapuram, Amaranath, Koenig, Christian [Public] The problem is app doesn't know why the reset happened. It just receives a bunch of registers to be read. On what basis an app can filter this out? Thanks, Lijo -----Original Message----- From: Sharma, Shashank <Shashank.Sharma@amd.com> Sent: Friday, February 4, 2022 10:29 PM To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Somalapuram, Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, Christian <Christian.Koenig@amd.com> Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler On 2/4/2022 5:50 PM, Lazar, Lijo wrote: > [AMD Official Use Only] > > To explain more - > It's an unconditional reset done by the kernel on every suspend (S3/S4). In such a case which process is going to receive the trace events? > > Most likely use case would be related to gpu recovery. Triggering a trace on every reset doesn't look like a good idea. > If you observer carefully, we are just providing an infrastructure, the application's intention is unknown to us. In my opinion it's rather not a good idea to apply a filter in kernel, with our interpretation of intention. For example if an app just wants to count how many resets are happening due to S3/S4 transition, this infra might become useless. It would rather be a better idea for the app to learn and ignore these scenarios which it is not interested in. This could eventually be just difference in design philosophy maybe :) - Shashank > Thanks, > Lijo > > -----Original Message----- > From: Sharma, Shashank <Shashank.Sharma@amd.com> > Sent: Friday, February 4, 2022 10:09 PM > To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org > Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Somalapuram, > Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, Christian > <Christian.Koenig@amd.com> > Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler > > Hey Lijo, > I somehow missed to respond on this comment, pls find inline: > > Regards > Shashank > > On 1/22/2022 7:42 AM, Lazar, Lijo wrote: >> >> >> On 1/22/2022 2:04 AM, Sharma, Shashank wrote: >>> From 899ec6060eb7d8a3d4d56ab439e4e6cdd74190a4 Mon Sep 17 00:00:00 >>> 2001 >>> From: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com> >>> Date: Fri, 21 Jan 2022 14:19:42 +0530 >>> Subject: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler >>> >>> This patch adds a GPU reset handler for Navi ASIC family, which >>> typically dumps some of the registersand sends a trace event. >>> >>> V2: Accomodated call to work function to send uevent >>> >>> Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com> >>> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com> >>> --- >>> drivers/gpu/drm/amd/amdgpu/nv.c | 28 ++++++++++++++++++++++++++++ >>> 1 file changed, 28 insertions(+) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c >>> b/drivers/gpu/drm/amd/amdgpu/nv.c index 01efda4398e5..ada35d4c5245 >>> 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/nv.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c >>> @@ -528,10 +528,38 @@ nv_asic_reset_method(struct amdgpu_device >>> *adev) >>> } >>> } >>> >>> +static void amdgpu_reset_dumps(struct amdgpu_device *adev) { >>> + int r = 0, i; >>> + >>> + /* original raven doesn't have full asic reset */ >>> + if ((adev->apu_flags & AMD_APU_IS_RAVEN) && >>> + !(adev->apu_flags & AMD_APU_IS_RAVEN2)) >>> + return; >>> + for (i = 0; i < adev->num_ip_blocks; i++) { >>> + if (!adev->ip_blocks[i].status.valid) >>> + continue; >>> + if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps) >>> + continue; >>> + r = >>> +adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev); >>> + >>> + if (r) >>> + DRM_ERROR("reset_reg_dumps of IP block <%s> failed >>> +%d\n", >>> + adev->ip_blocks[i].version->funcs->name, r); >>> + } >>> + >>> + /* Schedule work to send uevent */ >>> + if (!queue_work(system_unbound_wq, &adev->gpu_reset_work)) >>> + DRM_ERROR("failed to add GPU reset work\n"); >>> + >>> + dump_stack(); >>> +} >>> + >>> static int nv_asic_reset(struct amdgpu_device *adev) >>> { >>> int ret = 0; >>> >>> + amdgpu_reset_dumps(adev); >> >> Had a comment on this before. Now there are different reasons (or >> even no reason like a precautionary reset) to perform reset. A user >> would be interested in a trace only if the reason is valid. >> >> To clarify on why a work shouldn't be scheduled on every reset, check >> here - >> >> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/am >> d >> gpu/amdgpu_drv.c#L2188 > In the example you pointed to, they have a criteria to decide what is a valid reset in their context, in the kernel side itself. So they can take a call if they want to do something about it or not. > > But, in our case, we want to send the trace_event to user with some register values on every reset, and it is actually up to the profiling app to interpret (along with what it wants to call a GPU reset). So I don't think this is causing a considerable overhead. > > - Shashank >> >> >> >> Thanks, >> Lijo >> >>> switch (nv_asic_reset_method(adev)) { >>> case AMD_RESET_METHOD_PCI: >>> dev_info(adev->dev, "PCI reset\n"); ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler 2022-02-04 17:02 ` Lazar, Lijo @ 2022-02-04 17:07 ` Sharma, Shashank 2022-02-04 17:11 ` Lazar, Lijo 0 siblings, 1 reply; 20+ messages in thread From: Sharma, Shashank @ 2022-02-04 17:07 UTC (permalink / raw) To: Lazar, Lijo, amd-gfx Cc: Deucher, Alexander, Somalapuram, Amaranath, Koenig, Christian On 2/4/2022 6:02 PM, Lazar, Lijo wrote: > [Public] > > The problem is app doesn't know why the reset happened. It just receives a bunch of registers to be read. On what basis an app can filter this out? > Again, that is contextual analysis capability, which needs to be embedded in the reader app. Even if we filter out the S3/S4 resets in the kernel, the situation remains the same, isn't it ? - Shashank > Thanks, > Lijo > > -----Original Message----- > From: Sharma, Shashank <Shashank.Sharma@amd.com> > Sent: Friday, February 4, 2022 10:29 PM > To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org > Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Somalapuram, Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, Christian <Christian.Koenig@amd.com> > Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler > > > > On 2/4/2022 5:50 PM, Lazar, Lijo wrote: >> [AMD Official Use Only] >> >> To explain more - >> It's an unconditional reset done by the kernel on every suspend (S3/S4). In such a case which process is going to receive the trace events? >> >> Most likely use case would be related to gpu recovery. Triggering a trace on every reset doesn't look like a good idea. >> > > If you observer carefully, we are just providing an infrastructure, the application's intention is unknown to us. In my opinion it's rather not a good idea to apply a filter in kernel, with our interpretation of intention. > > For example if an app just wants to count how many resets are happening due to S3/S4 transition, this infra might become useless. It would rather be a better idea for the app to learn and ignore these scenarios which it is not interested in. > > This could eventually be just difference in design philosophy maybe :) > > - Shashank > >> Thanks, >> Lijo >> >> -----Original Message----- >> From: Sharma, Shashank <Shashank.Sharma@amd.com> >> Sent: Friday, February 4, 2022 10:09 PM >> To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org >> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Somalapuram, >> Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, Christian >> <Christian.Koenig@amd.com> >> Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler >> >> Hey Lijo, >> I somehow missed to respond on this comment, pls find inline: >> >> Regards >> Shashank >> >> On 1/22/2022 7:42 AM, Lazar, Lijo wrote: >>> >>> >>> On 1/22/2022 2:04 AM, Sharma, Shashank wrote: >>>> From 899ec6060eb7d8a3d4d56ab439e4e6cdd74190a4 Mon Sep 17 00:00:00 >>>> 2001 >>>> From: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com> >>>> Date: Fri, 21 Jan 2022 14:19:42 +0530 >>>> Subject: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler >>>> >>>> This patch adds a GPU reset handler for Navi ASIC family, which >>>> typically dumps some of the registersand sends a trace event. >>>> >>>> V2: Accomodated call to work function to send uevent >>>> >>>> Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com> >>>> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com> >>>> --- >>>> drivers/gpu/drm/amd/amdgpu/nv.c | 28 ++++++++++++++++++++++++++++ >>>> 1 file changed, 28 insertions(+) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c >>>> b/drivers/gpu/drm/amd/amdgpu/nv.c index 01efda4398e5..ada35d4c5245 >>>> 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/nv.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c >>>> @@ -528,10 +528,38 @@ nv_asic_reset_method(struct amdgpu_device >>>> *adev) >>>> } >>>> } >>>> >>>> +static void amdgpu_reset_dumps(struct amdgpu_device *adev) { >>>> + int r = 0, i; >>>> + >>>> + /* original raven doesn't have full asic reset */ >>>> + if ((adev->apu_flags & AMD_APU_IS_RAVEN) && >>>> + !(adev->apu_flags & AMD_APU_IS_RAVEN2)) >>>> + return; >>>> + for (i = 0; i < adev->num_ip_blocks; i++) { >>>> + if (!adev->ip_blocks[i].status.valid) >>>> + continue; >>>> + if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps) >>>> + continue; >>>> + r = >>>> +adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev); >>>> + >>>> + if (r) >>>> + DRM_ERROR("reset_reg_dumps of IP block <%s> failed >>>> +%d\n", >>>> + adev->ip_blocks[i].version->funcs->name, r); >>>> + } >>>> + >>>> + /* Schedule work to send uevent */ >>>> + if (!queue_work(system_unbound_wq, &adev->gpu_reset_work)) >>>> + DRM_ERROR("failed to add GPU reset work\n"); >>>> + >>>> + dump_stack(); >>>> +} >>>> + >>>> static int nv_asic_reset(struct amdgpu_device *adev) >>>> { >>>> int ret = 0; >>>> >>>> + amdgpu_reset_dumps(adev); >>> >>> Had a comment on this before. Now there are different reasons (or >>> even no reason like a precautionary reset) to perform reset. A user >>> would be interested in a trace only if the reason is valid. >>> >>> To clarify on why a work shouldn't be scheduled on every reset, check >>> here - >>> >>> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/am >>> d >>> gpu/amdgpu_drv.c#L2188 >> In the example you pointed to, they have a criteria to decide what is a valid reset in their context, in the kernel side itself. So they can take a call if they want to do something about it or not. >> >> But, in our case, we want to send the trace_event to user with some register values on every reset, and it is actually up to the profiling app to interpret (along with what it wants to call a GPU reset). So I don't think this is causing a considerable overhead. >> >> - Shashank >>> >>> >>> >>> Thanks, >>> Lijo >>> >>>> switch (nv_asic_reset_method(adev)) { >>>> case AMD_RESET_METHOD_PCI: >>>> dev_info(adev->dev, "PCI reset\n"); ^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler 2022-02-04 17:07 ` Sharma, Shashank @ 2022-02-04 17:11 ` Lazar, Lijo 2022-02-04 17:16 ` Sharma, Shashank 0 siblings, 1 reply; 20+ messages in thread From: Lazar, Lijo @ 2022-02-04 17:11 UTC (permalink / raw) To: Sharma, Shashank, amd-gfx Cc: Deucher, Alexander, Somalapuram, Amaranath, Koenig, Christian [AMD Official Use Only] No, otherwise driver reset only for GPU recovery purpose. S3/S4 is not meant for recovery purpose. It's just a precautionary reset to make sure that everything works fine on resume. BTW, since this is already providing a set of values it would be useful to provide one more field as the reset reason - RAS error recovery, GPU hung recovery or something else. Thanks, Lijo -----Original Message----- From: Sharma, Shashank <Shashank.Sharma@amd.com> Sent: Friday, February 4, 2022 10:37 PM To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Somalapuram, Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, Christian <Christian.Koenig@amd.com> Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler On 2/4/2022 6:02 PM, Lazar, Lijo wrote: > [Public] > > The problem is app doesn't know why the reset happened. It just receives a bunch of registers to be read. On what basis an app can filter this out? > Again, that is contextual analysis capability, which needs to be embedded in the reader app. Even if we filter out the S3/S4 resets in the kernel, the situation remains the same, isn't it ? - Shashank > Thanks, > Lijo > > -----Original Message----- > From: Sharma, Shashank <Shashank.Sharma@amd.com> > Sent: Friday, February 4, 2022 10:29 PM > To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org > Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Somalapuram, > Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, Christian > <Christian.Koenig@amd.com> > Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler > > > > On 2/4/2022 5:50 PM, Lazar, Lijo wrote: >> [AMD Official Use Only] >> >> To explain more - >> It's an unconditional reset done by the kernel on every suspend (S3/S4). In such a case which process is going to receive the trace events? >> >> Most likely use case would be related to gpu recovery. Triggering a trace on every reset doesn't look like a good idea. >> > > If you observer carefully, we are just providing an infrastructure, the application's intention is unknown to us. In my opinion it's rather not a good idea to apply a filter in kernel, with our interpretation of intention. > > For example if an app just wants to count how many resets are happening due to S3/S4 transition, this infra might become useless. It would rather be a better idea for the app to learn and ignore these scenarios which it is not interested in. > > This could eventually be just difference in design philosophy maybe :) > > - Shashank > >> Thanks, >> Lijo >> >> -----Original Message----- >> From: Sharma, Shashank <Shashank.Sharma@amd.com> >> Sent: Friday, February 4, 2022 10:09 PM >> To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org >> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Somalapuram, >> Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, Christian >> <Christian.Koenig@amd.com> >> Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler >> >> Hey Lijo, >> I somehow missed to respond on this comment, pls find inline: >> >> Regards >> Shashank >> >> On 1/22/2022 7:42 AM, Lazar, Lijo wrote: >>> >>> >>> On 1/22/2022 2:04 AM, Sharma, Shashank wrote: >>>> From 899ec6060eb7d8a3d4d56ab439e4e6cdd74190a4 Mon Sep 17 >>>> 00:00:00 >>>> 2001 >>>> From: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com> >>>> Date: Fri, 21 Jan 2022 14:19:42 +0530 >>>> Subject: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler >>>> >>>> This patch adds a GPU reset handler for Navi ASIC family, which >>>> typically dumps some of the registersand sends a trace event. >>>> >>>> V2: Accomodated call to work function to send uevent >>>> >>>> Signed-off-by: Somalapuram Amaranath >>>> <Amaranath.Somalapuram@amd.com> >>>> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com> >>>> --- >>>> drivers/gpu/drm/amd/amdgpu/nv.c | 28 ++++++++++++++++++++++++++++ >>>> 1 file changed, 28 insertions(+) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c >>>> b/drivers/gpu/drm/amd/amdgpu/nv.c index 01efda4398e5..ada35d4c5245 >>>> 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/nv.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c >>>> @@ -528,10 +528,38 @@ nv_asic_reset_method(struct amdgpu_device >>>> *adev) >>>> } >>>> } >>>> >>>> +static void amdgpu_reset_dumps(struct amdgpu_device *adev) { >>>> + int r = 0, i; >>>> + >>>> + /* original raven doesn't have full asic reset */ >>>> + if ((adev->apu_flags & AMD_APU_IS_RAVEN) && >>>> + !(adev->apu_flags & AMD_APU_IS_RAVEN2)) >>>> + return; >>>> + for (i = 0; i < adev->num_ip_blocks; i++) { >>>> + if (!adev->ip_blocks[i].status.valid) >>>> + continue; >>>> + if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps) >>>> + continue; >>>> + r = >>>> +adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev); >>>> + >>>> + if (r) >>>> + DRM_ERROR("reset_reg_dumps of IP block <%s> failed >>>> +%d\n", >>>> + adev->ip_blocks[i].version->funcs->name, r); >>>> + } >>>> + >>>> + /* Schedule work to send uevent */ >>>> + if (!queue_work(system_unbound_wq, &adev->gpu_reset_work)) >>>> + DRM_ERROR("failed to add GPU reset work\n"); >>>> + >>>> + dump_stack(); >>>> +} >>>> + >>>> static int nv_asic_reset(struct amdgpu_device *adev) >>>> { >>>> int ret = 0; >>>> >>>> + amdgpu_reset_dumps(adev); >>> >>> Had a comment on this before. Now there are different reasons (or >>> even no reason like a precautionary reset) to perform reset. A user >>> would be interested in a trace only if the reason is valid. >>> >>> To clarify on why a work shouldn't be scheduled on every reset, >>> check here - >>> >>> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/a >>> m >>> d >>> gpu/amdgpu_drv.c#L2188 >> In the example you pointed to, they have a criteria to decide what is a valid reset in their context, in the kernel side itself. So they can take a call if they want to do something about it or not. >> >> But, in our case, we want to send the trace_event to user with some register values on every reset, and it is actually up to the profiling app to interpret (along with what it wants to call a GPU reset). So I don't think this is causing a considerable overhead. >> >> - Shashank >>> >>> >>> >>> Thanks, >>> Lijo >>> >>>> switch (nv_asic_reset_method(adev)) { >>>> case AMD_RESET_METHOD_PCI: >>>> dev_info(adev->dev, "PCI reset\n"); ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler 2022-02-04 17:11 ` Lazar, Lijo @ 2022-02-04 17:16 ` Sharma, Shashank 2022-02-04 17:20 ` Lazar, Lijo 0 siblings, 1 reply; 20+ messages in thread From: Sharma, Shashank @ 2022-02-04 17:16 UTC (permalink / raw) To: Lazar, Lijo, amd-gfx Cc: Deucher, Alexander, Somalapuram, Amaranath, Koenig, Christian On 2/4/2022 6:11 PM, Lazar, Lijo wrote: > BTW, since this is already providing a set of values it would be useful to provide one more field as the reset reason - RAS error recovery, GPU hung recovery or something else. Adding this additional parameter instead of blocking something in kernel, seems like a better idea. The app can filter out and read what it is interested into. - Shashank ^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler 2022-02-04 17:16 ` Sharma, Shashank @ 2022-02-04 17:20 ` Lazar, Lijo 2022-02-04 17:22 ` Sharma, Shashank 0 siblings, 1 reply; 20+ messages in thread From: Lazar, Lijo @ 2022-02-04 17:20 UTC (permalink / raw) To: Sharma, Shashank, amd-gfx Cc: Deucher, Alexander, Somalapuram, Amaranath, Koenig, Christian [AMD Official Use Only] One more thing In suspend-reset case, won't this cause to schedule a work item on suspend? I don't know if that is a good idea, ideally we would like to clean up all work items before going to suspend. Thanks, Lijo -----Original Message----- From: Sharma, Shashank <Shashank.Sharma@amd.com> Sent: Friday, February 4, 2022 10:47 PM To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Somalapuram, Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, Christian <Christian.Koenig@amd.com> Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler On 2/4/2022 6:11 PM, Lazar, Lijo wrote: > BTW, since this is already providing a set of values it would be useful to provide one more field as the reset reason - RAS error recovery, GPU hung recovery or something else. Adding this additional parameter instead of blocking something in kernel, seems like a better idea. The app can filter out and read what it is interested into. - Shashank ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler 2022-02-04 17:20 ` Lazar, Lijo @ 2022-02-04 17:22 ` Sharma, Shashank 2022-02-04 18:41 ` Deucher, Alexander 0 siblings, 1 reply; 20+ messages in thread From: Sharma, Shashank @ 2022-02-04 17:22 UTC (permalink / raw) To: Lazar, Lijo, amd-gfx Cc: Deucher, Alexander, Somalapuram, Amaranath, Koenig, Christian On 2/4/2022 6:20 PM, Lazar, Lijo wrote: > [AMD Official Use Only] > > One more thing > In suspend-reset case, won't this cause to schedule a work item on suspend? I don't know if that is a good idea, ideally we would like to clean up all work items before going to suspend. > > Thanks, > Lijo Again, this opens scope for discussion. What if there is a GPU error during suspend-reset, which is very probable case. - Shashank > > -----Original Message----- > From: Sharma, Shashank <Shashank.Sharma@amd.com> > Sent: Friday, February 4, 2022 10:47 PM > To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org > Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Somalapuram, Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, Christian <Christian.Koenig@amd.com> > Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler > > > > On 2/4/2022 6:11 PM, Lazar, Lijo wrote: >> BTW, since this is already providing a set of values it would be useful to provide one more field as the reset reason - RAS error recovery, GPU hung recovery or something else. > > Adding this additional parameter instead of blocking something in kernel, seems like a better idea. The app can filter out and read what it is interested into. > > - Shashank ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler 2022-02-04 17:22 ` Sharma, Shashank @ 2022-02-04 18:41 ` Deucher, Alexander 2022-02-04 18:44 ` Deucher, Alexander 0 siblings, 1 reply; 20+ messages in thread From: Deucher, Alexander @ 2022-02-04 18:41 UTC (permalink / raw) To: Sharma, Shashank, Lazar, Lijo, amd-gfx Cc: Somalapuram, Amaranath, Koenig, Christian [-- Attachment #1: Type: text/plain, Size: 2193 bytes --] [Public] In the suspend and hibernate cases, we don't care. In most cases the power rail will be cut once the system enters suspend so it doesn't really matter. That's why we call the asic reset callback directly rather than going through the whole recovery process. The device is already quiescent at this point we just want to make sure the device is in a known state when we come out of suspend (in case suspend overall fails). Alex ________________________________ From: Sharma, Shashank <Shashank.Sharma@amd.com> Sent: Friday, February 4, 2022 12:22 PM To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Somalapuram, Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, Christian <Christian.Koenig@amd.com> Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler On 2/4/2022 6:20 PM, Lazar, Lijo wrote: > [AMD Official Use Only] > > One more thing > In suspend-reset case, won't this cause to schedule a work item on suspend? I don't know if that is a good idea, ideally we would like to clean up all work items before going to suspend. > > Thanks, > Lijo Again, this opens scope for discussion. What if there is a GPU error during suspend-reset, which is very probable case. - Shashank > > -----Original Message----- > From: Sharma, Shashank <Shashank.Sharma@amd.com> > Sent: Friday, February 4, 2022 10:47 PM > To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org > Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Somalapuram, Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, Christian <Christian.Koenig@amd.com> > Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler > > > > On 2/4/2022 6:11 PM, Lazar, Lijo wrote: >> BTW, since this is already providing a set of values it would be useful to provide one more field as the reset reason - RAS error recovery, GPU hung recovery or something else. > > Adding this additional parameter instead of blocking something in kernel, seems like a better idea. The app can filter out and read what it is interested into. > > - Shashank [-- Attachment #2: Type: text/html, Size: 3795 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler 2022-02-04 18:41 ` Deucher, Alexander @ 2022-02-04 18:44 ` Deucher, Alexander 2022-02-05 7:00 ` Sharma, Shashank 0 siblings, 1 reply; 20+ messages in thread From: Deucher, Alexander @ 2022-02-04 18:44 UTC (permalink / raw) To: Sharma, Shashank, Lazar, Lijo, amd-gfx Cc: Somalapuram, Amaranath, Koenig, Christian [-- Attachment #1: Type: text/plain, Size: 3036 bytes --] [Public] Seems like this functionality should be moved up into the callers. Maybe add new IP callbacks (dump_reset_registers()) so that each IP can specify what registers are relevant for a reset debugging and then we can walk the IP list and call the callback before we call the asic_reset callbacks. Alex ________________________________ From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of Deucher, Alexander <Alexander.Deucher@amd.com> Sent: Friday, February 4, 2022 1:41 PM To: Sharma, Shashank <Shashank.Sharma@amd.com>; Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org> Cc: Somalapuram, Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, Christian <Christian.Koenig@amd.com> Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler [Public] [Public] In the suspend and hibernate cases, we don't care. In most cases the power rail will be cut once the system enters suspend so it doesn't really matter. That's why we call the asic reset callback directly rather than going through the whole recovery process. The device is already quiescent at this point we just want to make sure the device is in a known state when we come out of suspend (in case suspend overall fails). Alex ________________________________ From: Sharma, Shashank <Shashank.Sharma@amd.com> Sent: Friday, February 4, 2022 12:22 PM To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Somalapuram, Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, Christian <Christian.Koenig@amd.com> Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler On 2/4/2022 6:20 PM, Lazar, Lijo wrote: > [AMD Official Use Only] > > One more thing > In suspend-reset case, won't this cause to schedule a work item on suspend? I don't know if that is a good idea, ideally we would like to clean up all work items before going to suspend. > > Thanks, > Lijo Again, this opens scope for discussion. What if there is a GPU error during suspend-reset, which is very probable case. - Shashank > > -----Original Message----- > From: Sharma, Shashank <Shashank.Sharma@amd.com> > Sent: Friday, February 4, 2022 10:47 PM > To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org > Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Somalapuram, Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, Christian <Christian.Koenig@amd.com> > Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler > > > > On 2/4/2022 6:11 PM, Lazar, Lijo wrote: >> BTW, since this is already providing a set of values it would be useful to provide one more field as the reset reason - RAS error recovery, GPU hung recovery or something else. > > Adding this additional parameter instead of blocking something in kernel, seems like a better idea. The app can filter out and read what it is interested into. > > - Shashank [-- Attachment #2: Type: text/html, Size: 5719 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler 2022-02-04 18:44 ` Deucher, Alexander @ 2022-02-05 7:00 ` Sharma, Shashank 0 siblings, 0 replies; 20+ messages in thread From: Sharma, Shashank @ 2022-02-05 7:00 UTC (permalink / raw) To: Deucher, Alexander, Lazar, Lijo, amd-gfx Cc: Somalapuram, Amaranath, Koenig, Christian Hey Alex, Agree, we are moving it above, Christian also had the same feedback. - Shashank On 2/4/2022 7:44 PM, Deucher, Alexander wrote: > [Public] > > > Seems like this functionality should be moved up into the callers. > Maybe add new IP callbacks (dump_reset_registers()) so that each IP can > specify what registers are relevant for a reset debugging and then we > can walk the IP list and call the callback before we call the asic_reset > callbacks. > > Alex > > ------------------------------------------------------------------------ > *From:* amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of > Deucher, Alexander <Alexander.Deucher@amd.com> > *Sent:* Friday, February 4, 2022 1:41 PM > *To:* Sharma, Shashank <Shashank.Sharma@amd.com>; Lazar, Lijo > <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org > <amd-gfx@lists.freedesktop.org> > *Cc:* Somalapuram, Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, > Christian <Christian.Koenig@amd.com> > *Subject:* Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler > > [Public] > > > [Public] > > > In the suspend and hibernate cases, we don't care. In most cases the > power rail will be cut once the system enters suspend so it doesn't > really matter. That's why we call the asic reset callback directly > rather than going through the whole recovery process. The device is > already quiescent at this point we just want to make sure the device is > in a known state when we come out of suspend (in case suspend overall > fails). > > Alex > > > ------------------------------------------------------------------------ > *From:* Sharma, Shashank <Shashank.Sharma@amd.com> > *Sent:* Friday, February 4, 2022 12:22 PM > *To:* Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org > <amd-gfx@lists.freedesktop.org> > *Cc:* Deucher, Alexander <Alexander.Deucher@amd.com>; Somalapuram, > Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, Christian > <Christian.Koenig@amd.com> > *Subject:* Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler > > > On 2/4/2022 6:20 PM, Lazar, Lijo wrote: >> [AMD Official Use Only] >> >> One more thing >> In suspend-reset case, won't this cause to schedule a work item on suspend? I don't know if that is a good idea, ideally we would like to clean up all work items before going to suspend. >> >> Thanks, >> Lijo > > Again, this opens scope for discussion. What if there is a GPU error > during suspend-reset, which is very probable case. > > - Shashank > >> >> -----Original Message----- >> From: Sharma, Shashank <Shashank.Sharma@amd.com> >> Sent: Friday, February 4, 2022 10:47 PM >> To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org >> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Somalapuram, Amaranath <Amaranath.Somalapuram@amd.com>; Koenig, Christian <Christian.Koenig@amd.com> >> Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler >> >> >> >> On 2/4/2022 6:11 PM, Lazar, Lijo wrote: >>> BTW, since this is already providing a set of values it would be useful to provide one more field as the reset reason - RAS error recovery, GPU hung recovery or something else. >> >> Adding this additional parameter instead of blocking something in kernel, seems like a better idea. The app can filter out and read what it is interested into. >> >> - Shashank ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler 2022-01-21 20:34 [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler Sharma, Shashank 2022-01-22 6:42 ` Lazar, Lijo @ 2022-01-24 7:18 ` Christian König 2022-01-24 16:50 ` Sharma, Shashank 2022-01-24 16:32 ` Andrey Grodzovsky 2 siblings, 1 reply; 20+ messages in thread From: Christian König @ 2022-01-24 7:18 UTC (permalink / raw) To: Sharma, Shashank, amd-gfx; +Cc: Deucher, Alexander, Somalapuram Amaranath Am 21.01.22 um 21:34 schrieb Sharma, Shashank: > From 899ec6060eb7d8a3d4d56ab439e4e6cdd74190a4 Mon Sep 17 00:00:00 2001 > From: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com> > Date: Fri, 21 Jan 2022 14:19:42 +0530 > Subject: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler > > This patch adds a GPU reset handler for Navi ASIC family, which > typically dumps some of the registersand sends a trace event. > > V2: Accomodated call to work function to send uevent > > Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com> > Signed-off-by: Shashank Sharma <shashank.sharma@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/nv.c | 28 ++++++++++++++++++++++++++++ > 1 file changed, 28 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c > b/drivers/gpu/drm/amd/amdgpu/nv.c > index 01efda4398e5..ada35d4c5245 100644 > --- a/drivers/gpu/drm/amd/amdgpu/nv.c > +++ b/drivers/gpu/drm/amd/amdgpu/nv.c > @@ -528,10 +528,38 @@ nv_asic_reset_method(struct amdgpu_device *adev) > } > } > > +static void amdgpu_reset_dumps(struct amdgpu_device *adev) > +{ > + int r = 0, i; Please don't initialize variables if it isn't absolutely necessary. > + > + /* original raven doesn't have full asic reset */ > + if ((adev->apu_flags & AMD_APU_IS_RAVEN) && > + !(adev->apu_flags & AMD_APU_IS_RAVEN2)) > + return; > + for (i = 0; i < adev->num_ip_blocks; i++) { > + if (!adev->ip_blocks[i].status.valid) > + continue; > + if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps) > + continue; > + r = adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev); > + > + if (r) > + DRM_ERROR("reset_reg_dumps of IP block <%s> failed %d\n", > + adev->ip_blocks[i].version->funcs->name, r); > + } > + > + /* Schedule work to send uevent */ > + if (!queue_work(system_unbound_wq, &adev->gpu_reset_work)) > + DRM_ERROR("failed to add GPU reset work\n"); > + > + dump_stack(); > +} > + I'm pretty sure that should be inside common code and not here. In other words that is absolutely not ASIC specific at all. Regards, Christian. > static int nv_asic_reset(struct amdgpu_device *adev) > { > int ret = 0; > > + amdgpu_reset_dumps(adev); > switch (nv_asic_reset_method(adev)) { > case AMD_RESET_METHOD_PCI: > dev_info(adev->dev, "PCI reset\n"); ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler 2022-01-24 7:18 ` Christian König @ 2022-01-24 16:50 ` Sharma, Shashank 0 siblings, 0 replies; 20+ messages in thread From: Sharma, Shashank @ 2022-01-24 16:50 UTC (permalink / raw) To: Christian König, amd-gfx; +Cc: Deucher, Alexander, Somalapuram Amaranath On 1/24/2022 8:18 AM, Christian König wrote: > > > Am 21.01.22 um 21:34 schrieb Sharma, Shashank: >> From 899ec6060eb7d8a3d4d56ab439e4e6cdd74190a4 Mon Sep 17 00:00:00 2001 >> From: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com> >> Date: Fri, 21 Jan 2022 14:19:42 +0530 >> Subject: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler >> >> This patch adds a GPU reset handler for Navi ASIC family, which >> typically dumps some of the registersand sends a trace event. >> >> V2: Accomodated call to work function to send uevent >> >> Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com> >> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com> >> --- >> drivers/gpu/drm/amd/amdgpu/nv.c | 28 ++++++++++++++++++++++++++++ >> 1 file changed, 28 insertions(+) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c >> b/drivers/gpu/drm/amd/amdgpu/nv.c >> index 01efda4398e5..ada35d4c5245 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/nv.c >> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c >> @@ -528,10 +528,38 @@ nv_asic_reset_method(struct amdgpu_device *adev) >> } >> } >> >> +static void amdgpu_reset_dumps(struct amdgpu_device *adev) >> +{ >> + int r = 0, i; > > Please don't initialize variables if it isn't absolutely necessary. > Agree, @Amar please check this out. >> + >> + /* original raven doesn't have full asic reset */ >> + if ((adev->apu_flags & AMD_APU_IS_RAVEN) && >> + !(adev->apu_flags & AMD_APU_IS_RAVEN2)) >> + return; >> + for (i = 0; i < adev->num_ip_blocks; i++) { >> + if (!adev->ip_blocks[i].status.valid) >> + continue; >> + if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps) >> + continue; >> + r = adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev); >> + >> + if (r) >> + DRM_ERROR("reset_reg_dumps of IP block <%s> failed %d\n", >> + adev->ip_blocks[i].version->funcs->name, r); >> + } >> + >> + /* Schedule work to send uevent */ >> + if (!queue_work(system_unbound_wq, &adev->gpu_reset_work)) >> + DRM_ERROR("failed to add GPU reset work\n"); >> + >> + dump_stack(); >> +} >> + > > I'm pretty sure that should be inside common code and not here. > > In other words that is absolutely not ASIC specific at all. Yeah, I agree. I think the initial problem for Amar was that some offsets were only valid for particular ASICs, but nothing is stopping us to make this in a generic function followed by a asic special function. We might have to move the asic.fptr to a more generic higher level fptr. - Shashank > > Regards, > Christian. > >> static int nv_asic_reset(struct amdgpu_device *adev) >> { >> int ret = 0; >> >> + amdgpu_reset_dumps(adev); >> switch (nv_asic_reset_method(adev)) { >> case AMD_RESET_METHOD_PCI: >> dev_info(adev->dev, "PCI reset\n"); > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler 2022-01-21 20:34 [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler Sharma, Shashank 2022-01-22 6:42 ` Lazar, Lijo 2022-01-24 7:18 ` Christian König @ 2022-01-24 16:32 ` Andrey Grodzovsky 2022-01-24 16:38 ` Sharma, Shashank 2 siblings, 1 reply; 20+ messages in thread From: Andrey Grodzovsky @ 2022-01-24 16:32 UTC (permalink / raw) To: Sharma, Shashank, amd-gfx Cc: Deucher, Alexander, Somalapuram Amaranath, Christian König You probably can add the STB dump we worked on a while ago to your info dump - a reminder on the feature is here https://www.spinics.net/lists/amd-gfx/msg70751.html Andrey On 2022-01-21 15:34, Sharma, Shashank wrote: > From 899ec6060eb7d8a3d4d56ab439e4e6cdd74190a4 Mon Sep 17 00:00:00 2001 > From: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com> > Date: Fri, 21 Jan 2022 14:19:42 +0530 > Subject: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler > > This patch adds a GPU reset handler for Navi ASIC family, which > typically dumps some of the registersand sends a trace event. > > V2: Accomodated call to work function to send uevent > > Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com> > Signed-off-by: Shashank Sharma <shashank.sharma@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/nv.c | 28 ++++++++++++++++++++++++++++ > 1 file changed, 28 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c > b/drivers/gpu/drm/amd/amdgpu/nv.c > index 01efda4398e5..ada35d4c5245 100644 > --- a/drivers/gpu/drm/amd/amdgpu/nv.c > +++ b/drivers/gpu/drm/amd/amdgpu/nv.c > @@ -528,10 +528,38 @@ nv_asic_reset_method(struct amdgpu_device *adev) > } > } > > +static void amdgpu_reset_dumps(struct amdgpu_device *adev) > +{ > + int r = 0, i; > + > + /* original raven doesn't have full asic reset */ > + if ((adev->apu_flags & AMD_APU_IS_RAVEN) && > + !(adev->apu_flags & AMD_APU_IS_RAVEN2)) > + return; > + for (i = 0; i < adev->num_ip_blocks; i++) { > + if (!adev->ip_blocks[i].status.valid) > + continue; > + if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps) > + continue; > + r = adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev); > + > + if (r) > + DRM_ERROR("reset_reg_dumps of IP block <%s> failed %d\n", > + adev->ip_blocks[i].version->funcs->name, r); > + } > + > + /* Schedule work to send uevent */ > + if (!queue_work(system_unbound_wq, &adev->gpu_reset_work)) > + DRM_ERROR("failed to add GPU reset work\n"); > + > + dump_stack(); > +} > + > static int nv_asic_reset(struct amdgpu_device *adev) > { > int ret = 0; > > + amdgpu_reset_dumps(adev); > switch (nv_asic_reset_method(adev)) { > case AMD_RESET_METHOD_PCI: > dev_info(adev->dev, "PCI reset\n"); ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler 2022-01-24 16:32 ` Andrey Grodzovsky @ 2022-01-24 16:38 ` Sharma, Shashank 2022-01-24 17:08 ` Andrey Grodzovsky 0 siblings, 1 reply; 20+ messages in thread From: Sharma, Shashank @ 2022-01-24 16:38 UTC (permalink / raw) To: Andrey Grodzovsky, amd-gfx Cc: Deucher, Alexander, Somalapuram Amaranath, Christian König Hey Andrey, That seems like a good idea, may I know if there is a trigger for STB dump ? or is it just the infrastructure which one can use when they feel a need to dump info ? Also, how reliable is the STB infra during a reset ? Regards Shashank On 1/24/2022 5:32 PM, Andrey Grodzovsky wrote: > You probably can add the STB dump we worked on a while ago to your info > dump - a reminder > on the feature is here https://www.spinics.net/lists/amd-gfx/msg70751.html > > Andrey > > On 2022-01-21 15:34, Sharma, Shashank wrote: >> From 899ec6060eb7d8a3d4d56ab439e4e6cdd74190a4 Mon Sep 17 00:00:00 2001 >> From: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com> >> Date: Fri, 21 Jan 2022 14:19:42 +0530 >> Subject: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler >> >> This patch adds a GPU reset handler for Navi ASIC family, which >> typically dumps some of the registersand sends a trace event. >> >> V2: Accomodated call to work function to send uevent >> >> Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com> >> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com> >> --- >> drivers/gpu/drm/amd/amdgpu/nv.c | 28 ++++++++++++++++++++++++++++ >> 1 file changed, 28 insertions(+) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c >> b/drivers/gpu/drm/amd/amdgpu/nv.c >> index 01efda4398e5..ada35d4c5245 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/nv.c >> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c >> @@ -528,10 +528,38 @@ nv_asic_reset_method(struct amdgpu_device *adev) >> } >> } >> >> +static void amdgpu_reset_dumps(struct amdgpu_device *adev) >> +{ >> + int r = 0, i; >> + >> + /* original raven doesn't have full asic reset */ >> + if ((adev->apu_flags & AMD_APU_IS_RAVEN) && >> + !(adev->apu_flags & AMD_APU_IS_RAVEN2)) >> + return; >> + for (i = 0; i < adev->num_ip_blocks; i++) { >> + if (!adev->ip_blocks[i].status.valid) >> + continue; >> + if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps) >> + continue; >> + r = adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev); >> + >> + if (r) >> + DRM_ERROR("reset_reg_dumps of IP block <%s> failed %d\n", >> + adev->ip_blocks[i].version->funcs->name, r); >> + } >> + >> + /* Schedule work to send uevent */ >> + if (!queue_work(system_unbound_wq, &adev->gpu_reset_work)) >> + DRM_ERROR("failed to add GPU reset work\n"); >> + >> + dump_stack(); >> +} >> + >> static int nv_asic_reset(struct amdgpu_device *adev) >> { >> int ret = 0; >> >> + amdgpu_reset_dumps(adev); >> switch (nv_asic_reset_method(adev)) { >> case AMD_RESET_METHOD_PCI: >> dev_info(adev->dev, "PCI reset\n"); ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler 2022-01-24 16:38 ` Sharma, Shashank @ 2022-01-24 17:08 ` Andrey Grodzovsky 2022-01-24 17:11 ` Sharma, Shashank 0 siblings, 1 reply; 20+ messages in thread From: Andrey Grodzovsky @ 2022-01-24 17:08 UTC (permalink / raw) To: Sharma, Shashank, amd-gfx Cc: Deucher, Alexander, Somalapuram Amaranath, Christian König It's just an infrastructure you use when you need. I never tested it during reset i think but, we deliberately did it very self reliant where you simply iterate a FIFO of the dump through PMI3 registers interface and dump out the content. It currently supposed to work for the NV family. In case you encounter issues during reset let me know and I will do my best to resolve them. Andrey On 2022-01-24 11:38, Sharma, Shashank wrote: > Hey Andrey, > That seems like a good idea, may I know if there is a trigger for STB > dump ? or is it just the infrastructure which one can use when they > feel a need to dump info ? Also, how reliable is the STB infra during > a reset ? > > Regards > Shashank > On 1/24/2022 5:32 PM, Andrey Grodzovsky wrote: >> You probably can add the STB dump we worked on a while ago to your >> info dump - a reminder >> on the feature is here >> https://www.spinics.net/lists/amd-gfx/msg70751.html >> >> Andrey >> >> On 2022-01-21 15:34, Sharma, Shashank wrote: >>> From 899ec6060eb7d8a3d4d56ab439e4e6cdd74190a4 Mon Sep 17 00:00:00 2001 >>> From: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com> >>> Date: Fri, 21 Jan 2022 14:19:42 +0530 >>> Subject: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler >>> >>> This patch adds a GPU reset handler for Navi ASIC family, which >>> typically dumps some of the registersand sends a trace event. >>> >>> V2: Accomodated call to work function to send uevent >>> >>> Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com> >>> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com> >>> --- >>> drivers/gpu/drm/amd/amdgpu/nv.c | 28 ++++++++++++++++++++++++++++ >>> 1 file changed, 28 insertions(+) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c >>> b/drivers/gpu/drm/amd/amdgpu/nv.c >>> index 01efda4398e5..ada35d4c5245 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/nv.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c >>> @@ -528,10 +528,38 @@ nv_asic_reset_method(struct amdgpu_device *adev) >>> } >>> } >>> >>> +static void amdgpu_reset_dumps(struct amdgpu_device *adev) >>> +{ >>> + int r = 0, i; >>> + >>> + /* original raven doesn't have full asic reset */ >>> + if ((adev->apu_flags & AMD_APU_IS_RAVEN) && >>> + !(adev->apu_flags & AMD_APU_IS_RAVEN2)) >>> + return; >>> + for (i = 0; i < adev->num_ip_blocks; i++) { >>> + if (!adev->ip_blocks[i].status.valid) >>> + continue; >>> + if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps) >>> + continue; >>> + r = adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev); >>> + >>> + if (r) >>> + DRM_ERROR("reset_reg_dumps of IP block <%s> failed %d\n", >>> + adev->ip_blocks[i].version->funcs->name, r); >>> + } >>> + >>> + /* Schedule work to send uevent */ >>> + if (!queue_work(system_unbound_wq, &adev->gpu_reset_work)) >>> + DRM_ERROR("failed to add GPU reset work\n"); >>> + >>> + dump_stack(); >>> +} >>> + >>> static int nv_asic_reset(struct amdgpu_device *adev) >>> { >>> int ret = 0; >>> >>> + amdgpu_reset_dumps(adev); >>> switch (nv_asic_reset_method(adev)) { >>> case AMD_RESET_METHOD_PCI: >>> dev_info(adev->dev, "PCI reset\n"); ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler 2022-01-24 17:08 ` Andrey Grodzovsky @ 2022-01-24 17:11 ` Sharma, Shashank 0 siblings, 0 replies; 20+ messages in thread From: Sharma, Shashank @ 2022-01-24 17:11 UTC (permalink / raw) To: Andrey Grodzovsky, amd-gfx Cc: Deucher, Alexander, Somalapuram Amaranath, Christian König On 1/24/2022 6:08 PM, Andrey Grodzovsky wrote: > It's just an infrastructure you use when you need. > I never tested it during reset i think but, we deliberately did it very > self reliant where you simply iterate a FIFO of the dump through PMI3 > registers interface and dump out the content. It currently supposed to > work for the NV family. > Got it, thanks for the suggestion. Let me check the feasibility of plug-in STB in out existing design and use case. - Shashank > In case you encounter issues during reset let me know and I will do my > best to resolve them. > > Andrey > > On 2022-01-24 11:38, Sharma, Shashank wrote: >> Hey Andrey, >> That seems like a good idea, may I know if there is a trigger for STB >> dump ? or is it just the infrastructure which one can use when they >> feel a need to dump info ? Also, how reliable is the STB infra during >> a reset ? >> >> Regards >> Shashank >> On 1/24/2022 5:32 PM, Andrey Grodzovsky wrote: >>> You probably can add the STB dump we worked on a while ago to your >>> info dump - a reminder >>> on the feature is here >>> https://www.spinics.net/lists/amd-gfx/msg70751.html >>> >>> Andrey >>> >>> On 2022-01-21 15:34, Sharma, Shashank wrote: >>>> From 899ec6060eb7d8a3d4d56ab439e4e6cdd74190a4 Mon Sep 17 00:00:00 2001 >>>> From: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com> >>>> Date: Fri, 21 Jan 2022 14:19:42 +0530 >>>> Subject: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler >>>> >>>> This patch adds a GPU reset handler for Navi ASIC family, which >>>> typically dumps some of the registersand sends a trace event. >>>> >>>> V2: Accomodated call to work function to send uevent >>>> >>>> Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com> >>>> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com> >>>> --- >>>> drivers/gpu/drm/amd/amdgpu/nv.c | 28 ++++++++++++++++++++++++++++ >>>> 1 file changed, 28 insertions(+) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c >>>> b/drivers/gpu/drm/amd/amdgpu/nv.c >>>> index 01efda4398e5..ada35d4c5245 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/nv.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c >>>> @@ -528,10 +528,38 @@ nv_asic_reset_method(struct amdgpu_device *adev) >>>> } >>>> } >>>> >>>> +static void amdgpu_reset_dumps(struct amdgpu_device *adev) >>>> +{ >>>> + int r = 0, i; >>>> + >>>> + /* original raven doesn't have full asic reset */ >>>> + if ((adev->apu_flags & AMD_APU_IS_RAVEN) && >>>> + !(adev->apu_flags & AMD_APU_IS_RAVEN2)) >>>> + return; >>>> + for (i = 0; i < adev->num_ip_blocks; i++) { >>>> + if (!adev->ip_blocks[i].status.valid) >>>> + continue; >>>> + if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps) >>>> + continue; >>>> + r = adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev); >>>> + >>>> + if (r) >>>> + DRM_ERROR("reset_reg_dumps of IP block <%s> failed %d\n", >>>> + adev->ip_blocks[i].version->funcs->name, r); >>>> + } >>>> + >>>> + /* Schedule work to send uevent */ >>>> + if (!queue_work(system_unbound_wq, &adev->gpu_reset_work)) >>>> + DRM_ERROR("failed to add GPU reset work\n"); >>>> + >>>> + dump_stack(); >>>> +} >>>> + >>>> static int nv_asic_reset(struct amdgpu_device *adev) >>>> { >>>> int ret = 0; >>>> >>>> + amdgpu_reset_dumps(adev); >>>> switch (nv_asic_reset_method(adev)) { >>>> case AMD_RESET_METHOD_PCI: >>>> dev_info(adev->dev, "PCI reset\n"); ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2022-02-05 7:00 UTC | newest] Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-01-21 20:34 [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler Sharma, Shashank 2022-01-22 6:42 ` Lazar, Lijo 2022-02-04 16:38 ` Sharma, Shashank 2022-02-04 16:50 ` Lazar, Lijo 2022-02-04 16:59 ` Sharma, Shashank 2022-02-04 17:02 ` Lazar, Lijo 2022-02-04 17:07 ` Sharma, Shashank 2022-02-04 17:11 ` Lazar, Lijo 2022-02-04 17:16 ` Sharma, Shashank 2022-02-04 17:20 ` Lazar, Lijo 2022-02-04 17:22 ` Sharma, Shashank 2022-02-04 18:41 ` Deucher, Alexander 2022-02-04 18:44 ` Deucher, Alexander 2022-02-05 7:00 ` Sharma, Shashank 2022-01-24 7:18 ` Christian König 2022-01-24 16:50 ` Sharma, Shashank 2022-01-24 16:32 ` Andrey Grodzovsky 2022-01-24 16:38 ` Sharma, Shashank 2022-01-24 17:08 ` Andrey Grodzovsky 2022-01-24 17:11 ` Sharma, Shashank
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.