RE: [RFC 0/6] Define and use reset domain for GPU recovery in amdgpu

From: "Liu, Shaoyun" <Shaoyun.Liu@amd.com>
To: "Christian König" <ckoenig.leichtzumerken@gmail.com>,
	"Grodzovsky, Andrey" <Andrey.Grodzovsky@amd.com>,
	"dri-devel@lists.freedesktop.org"
	<dri-devel@lists.freedesktop.org>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>
Cc: "Liu, Monk" <Monk.Liu@amd.com>,
	"Chen, Horace" <Horace.Chen@amd.com>,
	"Koenig, Christian" <Christian.Koenig@amd.com>
Subject: RE: [RFC 0/6] Define and use reset domain for GPU recovery in amdgpu
Date: Mon, 20 Dec 2021 17:06:16 +0000	[thread overview]
Message-ID: <CH0PR12MB5372A4EAE67D6F2C0B06F5DCF47B9@CH0PR12MB5372.namprd12.prod.outlook.com> (raw)
In-Reply-To: <0a30778e-28b8-7d02-01e9-9db690227222@gmail.com>

[AMD Official Use Only]

Hi , Andrey 
I actually has some concerns about this  change . 
1.  on SRIOV configuration , the reset notify coming  from host , and driver already trigger a work queue to handle the reset (check xgpu_*_mailbox_flr_work) , is it a good idea to trigger another work queue inside the work queue ?  Can  we just use the  new one  you added ? 
2. For KFD,  the rocm use the user queue for the submission and it won't call the drm scheduler  and hence no job timeout.  Can  we handle that with  your new change ? 
3 . For XGMI  hive, there is only hive  reset for all devices on bare-metal  config ,  but for SRIOV config , the VF will support VF FLR, which means host might only need to reset specific device instead trigger whole hive reset . So we might still need  reset_domain for individual device within the hive for SRIOV configuration. 

Anyway I think this change need to be verified on sriov configuration on XGMI with  some rocm use app is running . 

Regards
Shaoyun.liu

-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Christian König
Sent: Monday, December 20, 2021 2:25 AM
To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
Cc: daniel@ffwll.ch; Chen, Horace <Horace.Chen@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Liu, Monk <Monk.Liu@amd.com>
Subject: Re: [RFC 0/6] Define and use reset domain for GPU recovery in amdgpu

Am 17.12.21 um 23:27 schrieb Andrey Grodzovsky:
> This patchset is based on earlier work by Boris[1] that allowed to 
> have an ordered workqueue at the driver level that will be used by the 
> different schedulers to queue their timeout work. On top of that I 
> also serialized any GPU reset we trigger from within amdgpu code to 
> also go through the same ordered wq and in this way simplify somewhat 
> our GPU reset code so we don't need to protect from concurrency by 
> multiple GPU reset triggeres such as TDR on one hand and sysfs trigger or RAS trigger on the other hand.
>
> As advised by Christian and Daniel I defined a reset_domain struct 
> such that all the entities that go through reset together will be 
> serialized one against another.
>
> TDR triggered by multiple entities within the same domain due to the 
> same reason will not be triggered as the first such reset will cancel 
> all the pending resets. This is relevant only to TDR timers and not to 
> triggered resets coming from RAS or SYSFS, those will still happen after the in flight resets finishes.
>
> [1] 
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatc
> hwork.kernel.org%2Fproject%2Fdri-devel%2Fpatch%2F20210629073510.276439
> 1-3-boris.brezillon%40collabora.com%2F&amp;data=04%7C01%7CShaoyun.Liu%
> 40amd.com%7C1d2b07ad556b4da5d58808d9c389decf%7C3dd8961fe4884e608e11a82
> d994e183d%7C0%7C0%7C637755819206627827%7CUnknown%7CTWFpbGZsb3d8eyJWIjo
> iMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp
> ;sdata=8C8UbdPmM%2FH6sdTYDP5lZfRfBdQ%2B%2FN7m6s%2FREW8%2BsoM%3D&amp;re
> served=0
>
> P.S Going through drm-misc-next and not amd-staging-drm-next as Boris work hasn't landed yet there.

Patches #1 and #5, #6 are Reviewed-by: Christian König <christian.koenig@amd.com>

Some minor comments on the rest, but in general absolutely looks like the way we want to go.

Regards,
Christian.

>
> Andrey Grodzovsky (6):
>    drm/amdgpu: Init GPU reset single threaded wq
>    drm/amdgpu: Move scheduler init to after XGMI is ready
>    drm/amdgpu: Fix crash on modprobe
>    drm/amdgpu: Serialize non TDR gpu recovery with TDRs
>    drm/amdgpu: Drop hive->in_reset
>    drm/amdgpu: Drop concurrent GPU reset protection for device
>
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   9 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 206 +++++++++++----------
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  36 +---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |   2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |   2 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   |  10 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |   3 +-
>   7 files changed, 132 insertions(+), 136 deletions(-)
>