Re: [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive.

From: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
To: "Christian König" <christian.koenig@amd.com>,
	amd-gfx@lists.freedesktop.org
Cc: Zoy.Bai@amd.com, lijo.lazar@amd.com
Subject: Re: [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive.
Date: Wed, 18 May 2022 10:24:45 -0400	[thread overview]
Message-ID: <ce60a983-9906-e33f-a2cc-6fedb958a124@amd.com> (raw)
In-Reply-To: <1a7fd05f-490b-9999-5f0b-e84af26504a9@amd.com>

[-- Attachment #1: Type: text/plain, Size: 3140 bytes --]

On 2022-05-18 02:07, Christian König wrote:
> Am 17.05.22 um 21:20 schrieb Andrey Grodzovsky:
>> Problem:
>> During hive reset caused by command timing out on a ring
>> extra resets are generated by triggered by KFD which is
>> unable to accesses registers on the resetting ASIC.
>>
>> Fix: Rework GPU reset to actively stop any pending reset
>> works while another in progress.
>>
>> v2: Switch from generic list as was in v1[1] to eplicit
>> stopping of each reset request from each reset source
>> per each request submitter.
>
> Looks mostly good to me.
>
> Apart from the naming nit pick on patch #1 the only thing I couldn't 
> of hand figure out is why you are using a delayed work everywhere 
> instead of a just a work item.
>
> That needs a bit further explanation what's happening here.
>
> Christian.

Check APIs for cancelling work vs. delayed work -

For work_struct the only public API is this - 
https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3214 
- blocking cancel.

For delayed_work we have both blocking and non blocking public APIs -

https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3295

https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3295

I prefer not to go now into convincing core kernel people of exposing 
another interface for our own sake - from my past experience API changes 
in core code has slim chances and a lot of time spent on back and forth 
arguments.

"If the mountain will not come to Muhammad, then Muhammad must go to the 
mountain" ;)*
*

Andrey

>
>>
>> [1] - 
>> https://lore.kernel.org/all/20220504161841.24669-1-andrey.grodzovsky@amd.com/
>>
>> Andrey Grodzovsky (7):
>>    drm/amdgpu: Cache result of last reset at reset domain level.
>>    drm/amdgpu: Switch to delayed work from work_struct.
>>    drm/admgpu: Serialize RAS recovery work directly into reset domain
>>      queue.
>>    drm/amdgpu: Add delayed work for GPU reset from debugfs
>>    drm/amdgpu: Add delayed work for GPU reset from kfd.
>>    drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to
>>      amdgpu_device_gpu_recover
>>    drm/amdgpu: Stop any pending reset if another in progress.
>>
>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  4 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 15 +++++-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  1 +
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 62 +++++++++++-----------
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 19 ++++++-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c    | 10 ++--
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h    |  2 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c  |  1 +
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  |  5 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |  2 +-
>>   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      |  6 +--
>>   drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c      |  6 +--
>>   drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      |  6 +--
>>   14 files changed, 87 insertions(+), 54 deletions(-)
>>
>

[-- Attachment #2: Type: text/html, Size: 6524 bytes --]