Am 18.05.22 um 16:24 schrieb Andrey Grodzovsky: > > > On 2022-05-18 02:07, Christian König wrote: >> Am 17.05.22 um 21:20 schrieb Andrey Grodzovsky: >>> Problem: >>> During hive reset caused by command timing out on a ring >>> extra resets are generated by triggered by KFD which is >>> unable to accesses registers on the resetting ASIC. >>> >>> Fix: Rework GPU reset to actively stop any pending reset >>> works while another in progress. >>> >>> v2: Switch from generic list as was in v1[1] to eplicit >>> stopping of each reset request from each reset source >>> per each request submitter. >> >> Looks mostly good to me. >> >> Apart from the naming nit pick on patch #1 the only thing I couldn't >> of hand figure out is why you are using a delayed work everywhere >> instead of a just a work item. >> >> That needs a bit further explanation what's happening here. >> >> Christian. > > > Check APIs for cancelling work vs. delayed work - > > For work_struct the only public API is this - > https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3214 > - blocking cancel. > > For delayed_work we have both blocking and non blocking public APIs - > > https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3295 > > https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3295 > > I prefer not to go now into convincing core kernel people of exposing > another interface for our own sake - from my past experience API > changes in core code has slim chances and a lot of time spent on back > and forth arguments. > > "If the mountain will not come to Muhammad, then Muhammad must go to > the mountain" ;)* > * > Ah, good point. The cancel_work() function was removed a few years ago: commit 6417250d3f894e66a68ba1cd93676143f2376a6f Author: Stephen Hemminger Date:   Tue Mar 6 19:34:42 2018 -0800     workqueue: remove unused cancel_work()     Found this by accident.     There are no usages of bare cancel_work() in current kernel source.     Signed-off-by: Stephen Hemminger     Signed-off-by: Tejun Heo Maybe just revert that patch, export the function and use it. I think there is plenty of justification for this. Thanks, Christian. > ** > > Andrey > >> >>> >>> [1] - >>> https://lore.kernel.org/all/20220504161841.24669-1-andrey.grodzovsky@amd.com/ >>> >>> Andrey Grodzovsky (7): >>>    drm/amdgpu: Cache result of last reset at reset domain level. >>>    drm/amdgpu: Switch to delayed work from work_struct. >>>    drm/admgpu: Serialize RAS recovery work directly into reset domain >>>      queue. >>>    drm/amdgpu: Add delayed work for GPU reset from debugfs >>>    drm/amdgpu: Add delayed work for GPU reset from kfd. >>>    drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to >>>      amdgpu_device_gpu_recover >>>    drm/amdgpu: Stop any pending reset if another in progress. >>> >>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  4 +- >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 15 +++++- >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  1 + >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 62 >>> +++++++++++----------- >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 19 ++++++- >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +- >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c    | 10 ++-- >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h    |  2 +- >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c  |  1 + >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  |  5 +- >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |  2 +- >>>   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      |  6 +-- >>>   drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c      |  6 +-- >>>   drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      |  6 +-- >>>   14 files changed, 87 insertions(+), 54 deletions(-) >>> >>