All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
To: "Christian König" <ckoenig.leichtzumerken@gmail.com>,
	"Christian König" <christian.koenig@amd.com>,
	amd-gfx@lists.freedesktop.org
Cc: Zoy.Bai@amd.com, lijo.lazar@amd.com
Subject: Re: [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive.
Date: Thu, 19 May 2022 09:41:32 -0400	[thread overview]
Message-ID: <ac3feb03-9231-fa52-6183-43f85d0cff78@amd.com> (raw)
In-Reply-To: <3abdd2a3-2cb8-fc8f-fd5d-a8a7cd5fc828@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 4170 bytes --]


On 2022-05-19 03:58, Christian König wrote:
>
>
> Am 18.05.22 um 16:24 schrieb Andrey Grodzovsky:
>>
>>
>> On 2022-05-18 02:07, Christian König wrote:
>>> Am 17.05.22 um 21:20 schrieb Andrey Grodzovsky:
>>>> Problem:
>>>> During hive reset caused by command timing out on a ring
>>>> extra resets are generated by triggered by KFD which is
>>>> unable to accesses registers on the resetting ASIC.
>>>>
>>>> Fix: Rework GPU reset to actively stop any pending reset
>>>> works while another in progress.
>>>>
>>>> v2: Switch from generic list as was in v1[1] to eplicit
>>>> stopping of each reset request from each reset source
>>>> per each request submitter.
>>>
>>> Looks mostly good to me.
>>>
>>> Apart from the naming nit pick on patch #1 the only thing I couldn't 
>>> of hand figure out is why you are using a delayed work everywhere 
>>> instead of a just a work item.
>>>
>>> That needs a bit further explanation what's happening here.
>>>
>>> Christian.
>>
>>
>> Check APIs for cancelling work vs. delayed work -
>>
>> For work_struct the only public API is this - 
>> https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3214 
>> - blocking cancel.
>>
>> For delayed_work we have both blocking and non blocking public APIs -
>>
>> https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3295
>>
>> https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3295
>>
>> I prefer not to go now into convincing core kernel people of exposing 
>> another interface for our own sake - from my past experience API 
>> changes in core code has slim chances and a lot of time spent on back 
>> and forth arguments.
>>
>> "If the mountain will not come to Muhammad, then Muhammad must go to 
>> the mountain" ;)*
>> *
>>
>
> Ah, good point. The cancel_work() function was removed a few years ago:
>
> commit 6417250d3f894e66a68ba1cd93676143f2376a6f
> Author: Stephen Hemminger <stephen@networkplumber.org>
> Date:   Tue Mar 6 19:34:42 2018 -0800
>
>     workqueue: remove unused cancel_work()
>
>     Found this by accident.
>     There are no usages of bare cancel_work() in current kernel source.
>
>     Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
>     Signed-off-by: Tejun Heo <tj@kernel.org>
>
>
> Maybe just revert that patch, export the function and use it. I think 
> there is plenty of justification for this.
>
> Thanks,
> Christian.


Ok - i will send them a patch - let's see what they say.

Andrey


>
>> **
>>
>> Andrey
>>
>>>
>>>>
>>>> [1] - 
>>>> https://lore.kernel.org/all/20220504161841.24669-1-andrey.grodzovsky@amd.com/
>>>>
>>>> Andrey Grodzovsky (7):
>>>>    drm/amdgpu: Cache result of last reset at reset domain level.
>>>>    drm/amdgpu: Switch to delayed work from work_struct.
>>>>    drm/admgpu: Serialize RAS recovery work directly into reset domain
>>>>      queue.
>>>>    drm/amdgpu: Add delayed work for GPU reset from debugfs
>>>>    drm/amdgpu: Add delayed work for GPU reset from kfd.
>>>>    drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to
>>>>      amdgpu_device_gpu_recover
>>>>    drm/amdgpu: Stop any pending reset if another in progress.
>>>>
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  4 +-
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 15 +++++-
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  1 +
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 62 
>>>> +++++++++++-----------
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 19 ++++++-
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +-
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c    | 10 ++--
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h    |  2 +-
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c  |  1 +
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  |  5 +-
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |  2 +-
>>>>   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      |  6 +--
>>>>   drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c      |  6 +--
>>>>   drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      |  6 +--
>>>>   14 files changed, 87 insertions(+), 54 deletions(-)
>>>>
>>>
>

[-- Attachment #2: Type: text/html, Size: 12100 bytes --]

      reply	other threads:[~2022-05-19 13:41 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-17 19:20 [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive Andrey Grodzovsky
2022-05-17 19:20 ` [PATCH v2 1/7] drm/amdgpu: Cache result of last reset at reset domain level Andrey Grodzovsky
2022-05-18  6:02   ` Christian König
2022-05-17 19:20 ` [PATCH v2 2/7] drm/amdgpu: Switch to delayed work from work_struct Andrey Grodzovsky
2022-05-18  6:03   ` Christian König
2022-05-17 19:20 ` [PATCH v2 3/7] drm/admgpu: Serialize RAS recovery work directly into reset domain queue Andrey Grodzovsky
2022-05-17 19:20 ` [PATCH v2 4/7] drm/amdgpu: Add delayed work for GPU reset from debugfs Andrey Grodzovsky
2022-05-17 19:21 ` [PATCH v2 5/7] drm/amdgpu: Add delayed work for GPU reset from kfd Andrey Grodzovsky
2022-05-17 19:21 ` [PATCH v2 6/7] drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to amdgpu_device_gpu_recover Andrey Grodzovsky
2022-05-17 19:21 ` [PATCH v2 7/7] drm/amdgpu: Stop any pending reset if another in progress Andrey Grodzovsky
2022-05-17 20:56   ` Felix Kuehling
2022-05-18  6:07 ` [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive Christian König
2022-05-18 14:24   ` Andrey Grodzovsky
2022-05-19  7:58     ` Christian König
2022-05-19 13:41       ` Andrey Grodzovsky [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ac3feb03-9231-fa52-6183-43f85d0cff78@amd.com \
    --to=andrey.grodzovsky@amd.com \
    --cc=Zoy.Bai@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=ckoenig.leichtzumerken@gmail.com \
    --cc=lijo.lazar@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.