All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
To: "Christian König" <christian.koenig@amd.com>,
	amd-gfx@lists.freedesktop.org
Cc: Zoy.Bai@amd.com, lijo.lazar@amd.com
Subject: Re: [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive.
Date: Wed, 18 May 2022 10:24:45 -0400	[thread overview]
Message-ID: <ce60a983-9906-e33f-a2cc-6fedb958a124@amd.com> (raw)
In-Reply-To: <1a7fd05f-490b-9999-5f0b-e84af26504a9@amd.com>

[-- Attachment #1: Type: text/plain, Size: 3140 bytes --]


On 2022-05-18 02:07, Christian König wrote:
> Am 17.05.22 um 21:20 schrieb Andrey Grodzovsky:
>> Problem:
>> During hive reset caused by command timing out on a ring
>> extra resets are generated by triggered by KFD which is
>> unable to accesses registers on the resetting ASIC.
>>
>> Fix: Rework GPU reset to actively stop any pending reset
>> works while another in progress.
>>
>> v2: Switch from generic list as was in v1[1] to eplicit
>> stopping of each reset request from each reset source
>> per each request submitter.
>
> Looks mostly good to me.
>
> Apart from the naming nit pick on patch #1 the only thing I couldn't 
> of hand figure out is why you are using a delayed work everywhere 
> instead of a just a work item.
>
> That needs a bit further explanation what's happening here.
>
> Christian.


Check APIs for cancelling work vs. delayed work -

For work_struct the only public API is this - 
https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3214 
- blocking cancel.

For delayed_work we have both blocking and non blocking public APIs -

https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3295

https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3295

I prefer not to go now into convincing core kernel people of exposing 
another interface for our own sake - from my past experience API changes 
in core code has slim chances and a lot of time spent on back and forth 
arguments.

"If the mountain will not come to Muhammad, then Muhammad must go to the 
mountain" ;)*
*

Andrey

>
>>
>> [1] - 
>> https://lore.kernel.org/all/20220504161841.24669-1-andrey.grodzovsky@amd.com/
>>
>> Andrey Grodzovsky (7):
>>    drm/amdgpu: Cache result of last reset at reset domain level.
>>    drm/amdgpu: Switch to delayed work from work_struct.
>>    drm/admgpu: Serialize RAS recovery work directly into reset domain
>>      queue.
>>    drm/amdgpu: Add delayed work for GPU reset from debugfs
>>    drm/amdgpu: Add delayed work for GPU reset from kfd.
>>    drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to
>>      amdgpu_device_gpu_recover
>>    drm/amdgpu: Stop any pending reset if another in progress.
>>
>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  4 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 15 +++++-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  1 +
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 62 +++++++++++-----------
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 19 ++++++-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c    | 10 ++--
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h    |  2 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c  |  1 +
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  |  5 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |  2 +-
>>   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      |  6 +--
>>   drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c      |  6 +--
>>   drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      |  6 +--
>>   14 files changed, 87 insertions(+), 54 deletions(-)
>>
>

[-- Attachment #2: Type: text/html, Size: 6524 bytes --]

  reply	other threads:[~2022-05-18 14:25 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-17 19:20 [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive Andrey Grodzovsky
2022-05-17 19:20 ` [PATCH v2 1/7] drm/amdgpu: Cache result of last reset at reset domain level Andrey Grodzovsky
2022-05-18  6:02   ` Christian König
2022-05-17 19:20 ` [PATCH v2 2/7] drm/amdgpu: Switch to delayed work from work_struct Andrey Grodzovsky
2022-05-18  6:03   ` Christian König
2022-05-17 19:20 ` [PATCH v2 3/7] drm/admgpu: Serialize RAS recovery work directly into reset domain queue Andrey Grodzovsky
2022-05-17 19:20 ` [PATCH v2 4/7] drm/amdgpu: Add delayed work for GPU reset from debugfs Andrey Grodzovsky
2022-05-17 19:21 ` [PATCH v2 5/7] drm/amdgpu: Add delayed work for GPU reset from kfd Andrey Grodzovsky
2022-05-17 19:21 ` [PATCH v2 6/7] drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to amdgpu_device_gpu_recover Andrey Grodzovsky
2022-05-17 19:21 ` [PATCH v2 7/7] drm/amdgpu: Stop any pending reset if another in progress Andrey Grodzovsky
2022-05-17 20:56   ` Felix Kuehling
2022-05-18  6:07 ` [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive Christian König
2022-05-18 14:24   ` Andrey Grodzovsky [this message]
2022-05-19  7:58     ` Christian König
2022-05-19 13:41       ` Andrey Grodzovsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ce60a983-9906-e33f-a2cc-6fedb958a124@amd.com \
    --to=andrey.grodzovsky@amd.com \
    --cc=Zoy.Bai@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=lijo.lazar@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.