From: Andrey Grodzovsky <andrey.grodzovsky@amd.com> To: "Liu, Shaoyun" <Shaoyun.Liu@amd.com>, "Christian König" <ckoenig.leichtzumerken@gmail.com>, "dri-devel@lists.freedesktop.org" <dri-devel@lists.freedesktop.org>, "amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org> Cc: "Liu, Monk" <Monk.Liu@amd.com>, "Chen, Horace" <Horace.Chen@amd.com>, "Koenig, Christian" <Christian.Koenig@amd.com> Subject: Re: [RFC 0/6] Define and use reset domain for GPU recovery in amdgpu Date: Mon, 20 Dec 2021 14:11:49 -0500 [thread overview] Message-ID: <7e1846e1-a392-fa48-875f-0f6581468a76@amd.com> (raw) In-Reply-To: <CH0PR12MB5372A4EAE67D6F2C0B06F5DCF47B9@CH0PR12MB5372.namprd12.prod.outlook.com> On 2021-12-20 12:06 p.m., Liu, Shaoyun wrote: > [AMD Official Use Only] > > > Hi , Andrey > I actually has some concerns about this change . > 1. on SRIOV configuration , the reset notify coming from host , and driver already trigger a work queue to handle the reset (check xgpu_*_mailbox_flr_work) , is it a good idea to trigger another work queue inside the work queue ? Can we just use the new one you added ? Shouldn't be a problem, i will change. In fact it's a great idea because then it looks like we can totally drop 'adev->in_gpu_reset' since we don't need to lock again concurrent resets anymore > 2. For KFD, the rocm use the user queue for the submission and it won't call the drm scheduler and hence no job timeout. Can we handle that with your new change ? I think that not a problem - a lot of places use direct submissions and not scheduler, in case they need to synchronize against concurrent GPU resets they lock adev->reset_sem. Nothing changes in this sense. > > 3 . For XGMI hive, there is only hive reset for all devices on bare-metal config , but for SRIOV config , the VF will support VF FLR, which means host might only need to reset specific device instead trigger whole hive reset . So we might still need reset_domain for individual device within the hive for SRIOV configuration. This is something future right ? I don't see it in the code - in this case we will have to account for this as part of the generic design for this kind of single device reset within XGMI hive. It should require only a minor addition to current design in creating 2 parallel reset domains - one for hive and one per device. > > Anyway I think this change need to be verified on sriov configuration on XGMI with some rocm use app is running . I do have XGMI setup where I still test XGMI resets. It has ROCm stack there - can you please login there and tell me if I have what needed there to do the tests you advise ? I am not very familiar with ROCm tools as i usually test using libdrm. (Ping me on teams for the device ip and user name) Andrey > > Regards > Shaoyun.liu > > -----Original Message----- > From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Christian König > Sent: Monday, December 20, 2021 2:25 AM > To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org > Cc: daniel@ffwll.ch; Chen, Horace <Horace.Chen@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Liu, Monk <Monk.Liu@amd.com> > Subject: Re: [RFC 0/6] Define and use reset domain for GPU recovery in amdgpu > > Am 17.12.21 um 23:27 schrieb Andrey Grodzovsky: >> This patchset is based on earlier work by Boris[1] that allowed to >> have an ordered workqueue at the driver level that will be used by the >> different schedulers to queue their timeout work. On top of that I >> also serialized any GPU reset we trigger from within amdgpu code to >> also go through the same ordered wq and in this way simplify somewhat >> our GPU reset code so we don't need to protect from concurrency by >> multiple GPU reset triggeres such as TDR on one hand and sysfs trigger or RAS trigger on the other hand. >> >> As advised by Christian and Daniel I defined a reset_domain struct >> such that all the entities that go through reset together will be >> serialized one against another. >> >> TDR triggered by multiple entities within the same domain due to the >> same reason will not be triggered as the first such reset will cancel >> all the pending resets. This is relevant only to TDR timers and not to >> triggered resets coming from RAS or SYSFS, those will still happen after the in flight resets finishes. >> >> [1] >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatc >> hwork.kernel.org%2Fproject%2Fdri-devel%2Fpatch%2F20210629073510.276439 >> 1-3-boris.brezillon%40collabora.com%2F&data=04%7C01%7CShaoyun.Liu% >> 40amd.com%7C1d2b07ad556b4da5d58808d9c389decf%7C3dd8961fe4884e608e11a82 >> d994e183d%7C0%7C0%7C637755819206627827%7CUnknown%7CTWFpbGZsb3d8eyJWIjo >> iMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000& >> ;sdata=8C8UbdPmM%2FH6sdTYDP5lZfRfBdQ%2B%2FN7m6s%2FREW8%2BsoM%3D&re >> served=0 >> >> P.S Going through drm-misc-next and not amd-staging-drm-next as Boris work hasn't landed yet there. > Patches #1 and #5, #6 are Reviewed-by: Christian König <christian.koenig@amd.com> > > Some minor comments on the rest, but in general absolutely looks like the way we want to go. > > Regards, > Christian. > >> Andrey Grodzovsky (6): >> drm/amdgpu: Init GPU reset single threaded wq >> drm/amdgpu: Move scheduler init to after XGMI is ready >> drm/amdgpu: Fix crash on modprobe >> drm/amdgpu: Serialize non TDR gpu recovery with TDRs >> drm/amdgpu: Drop hive->in_reset >> drm/amdgpu: Drop concurrent GPU reset protection for device >> >> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 9 + >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 206 +++++++++++---------- >> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 36 +--- >> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +- >> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 2 + >> drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 10 +- >> drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h | 3 +- >> 7 files changed, 132 insertions(+), 136 deletions(-) >>
WARNING: multiple messages have this Message-ID (diff)
From: Andrey Grodzovsky <andrey.grodzovsky@amd.com> To: "Liu, Shaoyun" <Shaoyun.Liu@amd.com>, "Christian König" <ckoenig.leichtzumerken@gmail.com>, "dri-devel@lists.freedesktop.org" <dri-devel@lists.freedesktop.org>, "amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org> Cc: "Liu, Monk" <Monk.Liu@amd.com>, "Chen, Horace" <Horace.Chen@amd.com>, "Koenig, Christian" <Christian.Koenig@amd.com>, "daniel@ffwll.ch" <daniel@ffwll.ch> Subject: Re: [RFC 0/6] Define and use reset domain for GPU recovery in amdgpu Date: Mon, 20 Dec 2021 14:11:49 -0500 [thread overview] Message-ID: <7e1846e1-a392-fa48-875f-0f6581468a76@amd.com> (raw) In-Reply-To: <CH0PR12MB5372A4EAE67D6F2C0B06F5DCF47B9@CH0PR12MB5372.namprd12.prod.outlook.com> On 2021-12-20 12:06 p.m., Liu, Shaoyun wrote: > [AMD Official Use Only] > > > Hi , Andrey > I actually has some concerns about this change . > 1. on SRIOV configuration , the reset notify coming from host , and driver already trigger a work queue to handle the reset (check xgpu_*_mailbox_flr_work) , is it a good idea to trigger another work queue inside the work queue ? Can we just use the new one you added ? Shouldn't be a problem, i will change. In fact it's a great idea because then it looks like we can totally drop 'adev->in_gpu_reset' since we don't need to lock again concurrent resets anymore > 2. For KFD, the rocm use the user queue for the submission and it won't call the drm scheduler and hence no job timeout. Can we handle that with your new change ? I think that not a problem - a lot of places use direct submissions and not scheduler, in case they need to synchronize against concurrent GPU resets they lock adev->reset_sem. Nothing changes in this sense. > > 3 . For XGMI hive, there is only hive reset for all devices on bare-metal config , but for SRIOV config , the VF will support VF FLR, which means host might only need to reset specific device instead trigger whole hive reset . So we might still need reset_domain for individual device within the hive for SRIOV configuration. This is something future right ? I don't see it in the code - in this case we will have to account for this as part of the generic design for this kind of single device reset within XGMI hive. It should require only a minor addition to current design in creating 2 parallel reset domains - one for hive and one per device. > > Anyway I think this change need to be verified on sriov configuration on XGMI with some rocm use app is running . I do have XGMI setup where I still test XGMI resets. It has ROCm stack there - can you please login there and tell me if I have what needed there to do the tests you advise ? I am not very familiar with ROCm tools as i usually test using libdrm. (Ping me on teams for the device ip and user name) Andrey > > Regards > Shaoyun.liu > > -----Original Message----- > From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Christian König > Sent: Monday, December 20, 2021 2:25 AM > To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org > Cc: daniel@ffwll.ch; Chen, Horace <Horace.Chen@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Liu, Monk <Monk.Liu@amd.com> > Subject: Re: [RFC 0/6] Define and use reset domain for GPU recovery in amdgpu > > Am 17.12.21 um 23:27 schrieb Andrey Grodzovsky: >> This patchset is based on earlier work by Boris[1] that allowed to >> have an ordered workqueue at the driver level that will be used by the >> different schedulers to queue their timeout work. On top of that I >> also serialized any GPU reset we trigger from within amdgpu code to >> also go through the same ordered wq and in this way simplify somewhat >> our GPU reset code so we don't need to protect from concurrency by >> multiple GPU reset triggeres such as TDR on one hand and sysfs trigger or RAS trigger on the other hand. >> >> As advised by Christian and Daniel I defined a reset_domain struct >> such that all the entities that go through reset together will be >> serialized one against another. >> >> TDR triggered by multiple entities within the same domain due to the >> same reason will not be triggered as the first such reset will cancel >> all the pending resets. This is relevant only to TDR timers and not to >> triggered resets coming from RAS or SYSFS, those will still happen after the in flight resets finishes. >> >> [1] >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatc >> hwork.kernel.org%2Fproject%2Fdri-devel%2Fpatch%2F20210629073510.276439 >> 1-3-boris.brezillon%40collabora.com%2F&data=04%7C01%7CShaoyun.Liu% >> 40amd.com%7C1d2b07ad556b4da5d58808d9c389decf%7C3dd8961fe4884e608e11a82 >> d994e183d%7C0%7C0%7C637755819206627827%7CUnknown%7CTWFpbGZsb3d8eyJWIjo >> iMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000& >> ;sdata=8C8UbdPmM%2FH6sdTYDP5lZfRfBdQ%2B%2FN7m6s%2FREW8%2BsoM%3D&re >> served=0 >> >> P.S Going through drm-misc-next and not amd-staging-drm-next as Boris work hasn't landed yet there. > Patches #1 and #5, #6 are Reviewed-by: Christian König <christian.koenig@amd.com> > > Some minor comments on the rest, but in general absolutely looks like the way we want to go. > > Regards, > Christian. > >> Andrey Grodzovsky (6): >> drm/amdgpu: Init GPU reset single threaded wq >> drm/amdgpu: Move scheduler init to after XGMI is ready >> drm/amdgpu: Fix crash on modprobe >> drm/amdgpu: Serialize non TDR gpu recovery with TDRs >> drm/amdgpu: Drop hive->in_reset >> drm/amdgpu: Drop concurrent GPU reset protection for device >> >> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 9 + >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 206 +++++++++++---------- >> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 36 +--- >> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +- >> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 2 + >> drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 10 +- >> drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h | 3 +- >> 7 files changed, 132 insertions(+), 136 deletions(-) >>
next prev parent reply other threads:[~2021-12-20 19:11 UTC|newest] Thread overview: 46+ messages / expand[flat|nested] mbox.gz Atom feed top 2021-12-17 22:27 [RFC 0/6] Define and use reset domain for GPU recovery in amdgpu Andrey Grodzovsky 2021-12-17 22:27 ` Andrey Grodzovsky 2021-12-17 22:27 ` [RFC 1/6] drm/amdgpu: Init GPU reset single threaded wq Andrey Grodzovsky 2021-12-17 22:27 ` Andrey Grodzovsky 2021-12-17 22:27 ` [RFC 2/6] drm/amdgpu: Move scheduler init to after XGMI is ready Andrey Grodzovsky 2021-12-17 22:27 ` Andrey Grodzovsky 2021-12-20 7:16 ` Christian König 2021-12-20 7:16 ` Christian König 2021-12-20 21:51 ` Andrey Grodzovsky 2021-12-20 21:51 ` Andrey Grodzovsky 2021-12-21 7:05 ` Christian König 2021-12-21 7:05 ` Christian König 2021-12-17 22:27 ` [RFC 3/6] drm/amdgpu: Fix crash on modprobe Andrey Grodzovsky 2021-12-17 22:27 ` Andrey Grodzovsky 2021-12-20 7:17 ` Christian König 2021-12-20 7:17 ` Christian König 2021-12-20 19:22 ` Andrey Grodzovsky 2021-12-20 19:22 ` Andrey Grodzovsky 2021-12-21 7:02 ` Christian König 2021-12-21 7:02 ` Christian König 2021-12-21 16:03 ` Andrey Grodzovsky 2021-12-21 16:03 ` Andrey Grodzovsky 2021-12-22 7:50 ` Christian König 2021-12-22 7:50 ` Christian König 2021-12-17 22:27 ` [RFC 4/6] drm/amdgpu: Serialize non TDR gpu recovery with TDRs Andrey Grodzovsky 2021-12-17 22:27 ` Andrey Grodzovsky 2021-12-20 7:20 ` Christian König 2021-12-20 7:20 ` Christian König 2021-12-20 22:17 ` Andrey Grodzovsky 2021-12-20 22:17 ` Andrey Grodzovsky 2021-12-21 7:59 ` Christian König 2021-12-21 7:59 ` Christian König 2021-12-21 16:10 ` Andrey Grodzovsky 2021-12-21 16:10 ` Andrey Grodzovsky 2021-12-17 22:27 ` [RFC 5/6] drm/amdgpu: Drop hive->in_reset Andrey Grodzovsky 2021-12-17 22:27 ` Andrey Grodzovsky 2021-12-17 22:27 ` [RFC 6/6] drm/amdgpu: Drop concurrent GPU reset protection for device Andrey Grodzovsky 2021-12-17 22:27 ` Andrey Grodzovsky 2021-12-20 7:25 ` [RFC 0/6] Define and use reset domain for GPU recovery in amdgpu Christian König 2021-12-20 7:25 ` Christian König 2021-12-20 9:43 ` Daniel Vetter 2021-12-20 9:43 ` Daniel Vetter 2021-12-20 17:06 ` Liu, Shaoyun 2021-12-20 17:06 ` Liu, Shaoyun 2021-12-20 19:11 ` Andrey Grodzovsky [this message] 2021-12-20 19:11 ` Andrey Grodzovsky
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=7e1846e1-a392-fa48-875f-0f6581468a76@amd.com \ --to=andrey.grodzovsky@amd.com \ --cc=Christian.Koenig@amd.com \ --cc=Horace.Chen@amd.com \ --cc=Monk.Liu@amd.com \ --cc=Shaoyun.Liu@amd.com \ --cc=amd-gfx@lists.freedesktop.org \ --cc=ckoenig.leichtzumerken@gmail.com \ --cc=dri-devel@lists.freedesktop.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.