On 2021-04-08 4:32 a.m., Christian König wrote: > > > Am 08.04.21 um 10:22 schrieb Christian König: >> Hi Andrey, >> >> Am 07.04.21 um 21:44 schrieb Andrey Grodzovsky: >>> >>> >>> On 2021-04-07 6:28 a.m., Christian König wrote: >>>> Hi Andrey, >>>> >>>> Am 06.04.21 um 23:22 schrieb Andrey Grodzovsky: >>>>> >>>>> Hey Christian, Denis, see bellow - >>>>> >>>>> On 2021-04-06 6:34 a.m., Christian König wrote: >>>>>> Hi Andrey, >>>>>> >>>>>> well good question. My job is to watch over the implementation >>>>>> and design and while I always help I can adjust anybodies schedule. >>>>>> >>>>>> Is the patch to print a warning when the hardware is accessed >>>>>> without holding the locks merged yet? If not then that would >>>>>> probably be a good starting point. >>>>> >>>>> >>>>> It's merged into amd-staging-drm-next and since I work on >>>>> drm-misc-next I will cherry-pick it into there. >>>>> >>>> >>>> Ok good to know, I haven't tracked that one further. >>>> >>>>> >>>>>> >>>>>> Then we would need to unify this with the SRCU to make sure that >>>>>> we have both the reset lock as well as block the hotplug code >>>>>> from reusing the MMIO space. >>>>> >>>>> In my understanding there is a significant difference between >>>>> handling of GPU reset and unplug - while GPU reset use case >>>>> requires any HW accessing code to block and wait for the reset to >>>>> finish and then proceed, hot-unplug >>>>> is permanent and hence no need to wait and proceed but rather >>>>> abort at once. >>>>> >>>> >>>> Yes, absolutely correct. >>>> >>>>> This why I think that in any place we already check for device >>>>> reset we should also add a check for hot-unplug but the handling >>>>> would be different >>>>> in that for hot-unplug we would abort instead of keep waiting. >>>>> >>>> >>>> Yes, that's the rough picture in my head as well. >>>> >>>> Essentially Daniels patch of having an >>>> amdgpu_device_hwaccess_begin()/_end() was the right approach. You >>>> just can't do it in the top level IOCTL handler, but rather need it >>>> somewhere between front end and backend. >>> >>> >>> Can you point me to what patch was it ? Can't find. >>> >> >> What I mean was the approach in patch #3 in this series where he >> replaced the down_read/up_read with >> amdgpu_read_lock()/amdgpu_read_unlock(). >> >> I would just not call it amdgpu_read_lock()/amdgpu_read_unlock(), but >> something more descriptive. >> >> Regards, >> Christian. >> >>> >>>> >>>>> Similar to handling device reset for unplug we obviously also need >>>>> to stop and block any MMIO accesses once device is unplugged and, >>>>> as Daniel Vetter mentioned - we have to do it before finishing >>>>> pci_remove (early device fini) >>>>> and not later (when last device reference is dropped from user >>>>> space) in order to prevent reuse of MMIO space we still access by >>>>> other hot plugging devices. As in device reset case we need to >>>>> cancel all delay works, stop drm schedule, complete all unfinished >>>>> fences(both HW and scheduler fences). While you stated strong >>>>> objection to force signalling scheduler fences from GPU reset, quote: >>>>> >>>>> "you can't signal the dma_fence waiting. Waiting for a dma_fence >>>>> also means you wait for the GPU reset to finish. When we would >>>>> signal the dma_fence during the GPU reset then we would run into >>>>> memory corruption because the hardware jobs running after the GPU >>>>> reset would access memory which is already freed." >>>>> To my understating this is a key difference with hot-unplug, the >>>>> device is gone, all those concerns are irrelevant and hence we can >>>>> actually force signal scheduler fences (setting and error to them >>>>> before) to force completion of any >>>>> waiting clients such as possibly IOCTLs or async page flips e.t.c. >>>>> >>>> >>>> Yes, absolutely correct. That's what I also mentioned to Daniel. >>>> When we are able to nuke the device and any memory access it might >>>> do we can also signal the fences. >>>> >>>>> Beyond blocking all delayed works and scheduler threads we also >>>>> need to guarantee no  IOCTL can access MMIO post device unplug OR >>>>> in flight IOCTLs are done before we finish pci_remove >>>>> (amdgpu_pci_remove for us). >>>>> For this I suggest we do something like what we worked on with >>>>> Takashi Iwai the ALSA maintainer recently when he helped >>>>> implementing PCI BARs move support for snd_hda_intel. Take a look at >>>>> https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=yadro/pcie_hotplug/movable_bars_v9.1&id=cbaa324799718e2b828a8c7b5b001dd896748497 >>>>> and >>>>> https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=yadro/pcie_hotplug/movable_bars_v9.1&id=e36365d9ab5bbc30bdc221ab4b3437de34492440 >>>>> We also had same issue there, how to prevent MMIO accesses while >>>>> the BARs are migrating. What was done there is a refcount was >>>>> added to count all IOCTLs in flight, for any in flight  IOCTL the >>>>> BAR migration handler would >>>>> block for the refcount to drop to 0 before it would proceed, for >>>>> any later IOCTL it stops and wait if device is in migration state. >>>>> We even don't need the wait part, nothing to wait for, we just >>>>> return with -ENODEV for this case. >>>>> >>>> >>>> This is essentially what the DRM SRCU is doing as well. >>>> >>>> For the hotplug case we could do this in the toplevel since we can >>>> signal the fence and don't need to block memory management. >>> >>> >>> To make SRCU 'wait for' all IOCTLs in flight we would need to wrap >>> every IOCTL ( practically - just drm_ioctl function) with >>> drm_dev_enter/drm_dev_exit - can we do it ? >>> > > Sorry totally missed this question. > > Yes, exactly that. As discussed for the hotplug case we can do this. Thinking more about it - assuming we are  treating synchronize_srcu as a 'wait for completion' of any in flight {drm_dev_enter, drm_dev_exit} scope, some of those scopes might do dma_fence_wait inside. Since we haven't force signaled the fences yet we will end up a deadlock. We have to signal all the various fences before doing the 'wait for'. But we can't signal the fences before setting 'dev->unplugged = true' to reject further CS and other stuff which might create more fences we were supposed-to force signal and now missed them. Effectively  setting 'dev->unplugged = true' and doing synchronize_srcu in one call like drm_dev_unplug does without signalling all the fences in the device in between these two steps looks luck a possible deadlock to me - what do you think ? Andrey > >>> >>>> >>>> But I'm not sure, maybe we should handle it the same way as reset >>>> or maybe we should have it at the top level. >>> >>> >>> If by top level you mean checking for device unplugged and bailing >>> out at the entry to IOCTL or right at start of any work_item/timer >>> function we have then seems to me it's better and more clear. Once >>> we flushed all of them in flight there is no reason for them to >>> execute any more when device is unplugged. >>> > > Well I'm open to both approaches. I just think having > drm_dev_enter/exit on each work item would be more defensive in case > we forgot to cancel/sync one. > > Christian. > >>> Andrey >>> >>> >>>> >>>> Regards, >>>> Christian. >>>> >>>>> The above approach should allow us to wait for all the IOCTLs in >>>>> flight, together with stopping scheduler threads and cancelling >>>>> and flushing all in flight work items and timers i think It should >>>>> give as full solution for the hot-unplug case >>>>> of preventing any MMIO accesses post device pci_remove. >>>>> >>>>> Let me know what you think guys. >>>>> >>>>> Andrey >>>>> >>>>> >>>>>> >>>>>> And then testing, testing, testing to see if we have missed >>>>>> something. >>>>>> >>>>>> Christian. >>>>>> >>>>>> Am 05.04.21 um 19:58 schrieb Andrey Grodzovsky: >>>>>>> >>>>>>> Denis, Christian, are there any updates in the plan on how to >>>>>>> move on with this ? As you know I need very similar code for my >>>>>>> up-streaming of device hot-unplug. My latest solution >>>>>>> (https://lists.freedesktop.org/archives/amd-gfx/2021-January/058606.html) >>>>>>> was not acceptable because of low level guards on the register >>>>>>> accessors level which was hurting performance. Basically I need >>>>>>> a way to prevent any MMIO write accesses from kernel driver >>>>>>> after device is removed (UMD accesses are taken care of by page >>>>>>> faulting dummy page). We are using now hot-unplug code for >>>>>>> Freemont program and so up-streaming became more of a priority >>>>>>> then before. This MMIO access issue is currently my main blocker >>>>>>> from up-streaming. Is there any way I can assist in pushing this >>>>>>> on ? >>>>>>> >>>>>>> Andrey >>>>>>> >>>>>>> On 2021-03-18 5:51 a.m., Christian König wrote: >>>>>>>> Am 18.03.21 um 10:30 schrieb Li, Dennis: >>>>>>>>> >>>>>>>>> >>> The GPU reset doesn't complete the fences we wait for. It >>>>>>>>> only completes the hardware fences as part of the reset. >>>>>>>>> >>>>>>>>> >>> So waiting for a fence while holding the reset lock is >>>>>>>>> illegal and needs to be avoided. >>>>>>>>> >>>>>>>>> I understood your concern. It is more complex for DRM GFX, >>>>>>>>> therefore I abandon adding lock protection for DRM ioctls now. >>>>>>>>> Maybe we can try to add all kernel  dma_fence waiting in a >>>>>>>>> list, and signal all in recovery threads. Do you have same >>>>>>>>> concern for compute cases? >>>>>>>>> >>>>>>>> >>>>>>>> Yes, compute (KFD) is even harder to handle. >>>>>>>> >>>>>>>> See you can't signal the dma_fence waiting. Waiting for a >>>>>>>> dma_fence also means you wait for the GPU reset to finish. >>>>>>>> >>>>>>>> When we would signal the dma_fence during the GPU reset then we >>>>>>>> would run into memory corruption because the hardware jobs >>>>>>>> running after the GPU reset would access memory which is >>>>>>>> already freed. >>>>>>>> >>>>>>>>> >>> Lockdep also complains about this when it is used >>>>>>>>> correctly. The only reason it doesn't complain here is because >>>>>>>>> you use an atomic+wait_event instead of a locking primitive. >>>>>>>>> >>>>>>>>> Agree. This approach will escape the monitor of lockdep.  Its >>>>>>>>> goal is to block other threads when GPU recovery thread start. >>>>>>>>> But I couldn’t find a better method to solve this problem. Do >>>>>>>>> you have some suggestion? >>>>>>>>> >>>>>>>> >>>>>>>> Well, completely abandon those change here. >>>>>>>> >>>>>>>> What we need to do is to identify where hardware access happens >>>>>>>> and then insert taking the read side of the GPU reset lock so >>>>>>>> that we don't wait for a dma_fence or allocate memory, but >>>>>>>> still protect the hardware from concurrent access and reset. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Christian. >>>>>>>> >>>>>>>>> Best Regards >>>>>>>>> >>>>>>>>> Dennis Li >>>>>>>>> >>>>>>>>> *From:* Koenig, Christian >>>>>>>>> *Sent:* Thursday, March 18, 2021 4:59 PM >>>>>>>>> *To:* Li, Dennis ; >>>>>>>>> amd-gfx@lists.freedesktop.org; Deucher, Alexander >>>>>>>>> ; Kuehling, Felix >>>>>>>>> ; Zhang, Hawking >>>>>>>>> *Subject:* AW: [PATCH 0/4] Refine GPU recovery sequence to >>>>>>>>> enhance its stability >>>>>>>>> >>>>>>>>> Exactly that's what you don't seem to understand. >>>>>>>>> >>>>>>>>> The GPU reset doesn't complete the fences we wait for. It only >>>>>>>>> completes the hardware fences as part of the reset. >>>>>>>>> >>>>>>>>> So waiting for a fence while holding the reset lock is illegal >>>>>>>>> and needs to be avoided. >>>>>>>>> >>>>>>>>> Lockdep also complains about this when it is used correctly. >>>>>>>>> The only reason it doesn't complain here is because you use an >>>>>>>>> atomic+wait_event instead of a locking primitive. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> >>>>>>>>> Christian. >>>>>>>>> >>>>>>>>> ------------------------------------------------------------------------ >>>>>>>>> >>>>>>>>> *Von:*Li, Dennis > >>>>>>>>> *Gesendet:* Donnerstag, 18. März 2021 09:28 >>>>>>>>> *An:* Koenig, Christian >>>>>>>> >; >>>>>>>>> amd-gfx@lists.freedesktop.org >>>>>>>>> >>>>>>>>> >>>>>>>> >; Deucher, Alexander >>>>>>>>> >>>>>>>> >; Kuehling, Felix >>>>>>>>> >; >>>>>>>>> Zhang, Hawking >>>>>>>> > >>>>>>>>> *Betreff:* RE: [PATCH 0/4] Refine GPU recovery sequence to >>>>>>>>> enhance its stability >>>>>>>>> >>>>>>>>> >>> Those two steps need to be exchanged or otherwise it is >>>>>>>>> possible that new delayed work items etc are started before >>>>>>>>> the lock is taken. >>>>>>>>> What about adding check for adev->in_gpu_reset in work item? >>>>>>>>> If exchange the two steps, it maybe introduce the deadlock.  >>>>>>>>> For example, the user thread hold the read lock and waiting >>>>>>>>> for the fence, if recovery thread try to hold write lock and >>>>>>>>> then complete fences, in this case, recovery thread will >>>>>>>>> always be blocked. >>>>>>>>> >>>>>>>>> >>>>>>>>> Best Regards >>>>>>>>> Dennis Li >>>>>>>>> -----Original Message----- >>>>>>>>> From: Koenig, Christian >>>>>>>> > >>>>>>>>> Sent: Thursday, March 18, 2021 3:54 PM >>>>>>>>> To: Li, Dennis >; >>>>>>>>> amd-gfx@lists.freedesktop.org >>>>>>>>> ; Deucher, Alexander >>>>>>>>> >>>>>>>> >; Kuehling, Felix >>>>>>>>> >; >>>>>>>>> Zhang, Hawking >>>>>>>> > >>>>>>>>> Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to >>>>>>>>> enhance its stability >>>>>>>>> >>>>>>>>> Am 18.03.21 um 08:23 schrieb Dennis Li: >>>>>>>>> > We have defined two variables in_gpu_reset and reset_sem in >>>>>>>>> adev object. The atomic type variable in_gpu_reset is used to >>>>>>>>> avoid recovery thread reenter and make lower functions return >>>>>>>>> more earlier when recovery start, but couldn't block recovery >>>>>>>>> thread when it access hardware. The r/w semaphore reset_sem is >>>>>>>>> used to solve these synchronization issues between recovery >>>>>>>>> thread and other threads. >>>>>>>>> > >>>>>>>>> > The original solution locked registers' access in lower >>>>>>>>> functions, which will introduce following issues: >>>>>>>>> > >>>>>>>>> > 1) many lower functions are used in both recovery thread and >>>>>>>>> others. Firstly we must harvest these functions, it is easy to >>>>>>>>> miss someones. Secondly these functions need select which lock >>>>>>>>> (read lock or write lock) will be used, according to the >>>>>>>>> thread it is running in. If the thread context isn't >>>>>>>>> considered, the added lock will easily introduce deadlock. >>>>>>>>> Besides that, in most time, developer easily forget to add >>>>>>>>> locks for new functions. >>>>>>>>> > >>>>>>>>> > 2) performance drop. More lower functions are more >>>>>>>>> frequently called. >>>>>>>>> > >>>>>>>>> > 3) easily introduce false positive lockdep complaint, >>>>>>>>> because write lock has big range in recovery thread, but low >>>>>>>>> level functions will hold read lock may be protected by other >>>>>>>>> locks in other threads. >>>>>>>>> > >>>>>>>>> > Therefore the new solution will try to add lock protection >>>>>>>>> for ioctls of kfd. Its goal is that there are no threads >>>>>>>>> except for recovery thread or its children (for xgmi) to >>>>>>>>> access hardware when doing GPU reset and resume. So refine >>>>>>>>> recovery thread as the following: >>>>>>>>> > >>>>>>>>> > Step 0: atomic_cmpxchg(&adev->in_gpu_reset, 0, 1) >>>>>>>>> >     1). if failed, it means system had a recovery thread >>>>>>>>> running, current thread exit directly; >>>>>>>>> >     2). if success, enter recovery thread; >>>>>>>>> > >>>>>>>>> > Step 1: cancel all delay works, stop drm schedule, complete >>>>>>>>> all unreceived fences and so on. It try to stop or pause other >>>>>>>>> threads. >>>>>>>>> > >>>>>>>>> > Step 2: call down_write(&adev->reset_sem) to hold write >>>>>>>>> lock, which will block recovery thread until other threads >>>>>>>>> release read locks. >>>>>>>>> >>>>>>>>> Those two steps need to be exchanged or otherwise it is >>>>>>>>> possible that new delayed work items etc are started before >>>>>>>>> the lock is taken. >>>>>>>>> >>>>>>>>> Just to make it clear until this is fixed the whole patch set >>>>>>>>> is a NAK. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Christian. >>>>>>>>> >>>>>>>>> > >>>>>>>>> > Step 3: normally, there is only recovery threads running to >>>>>>>>> access hardware, it is safe to do gpu reset now. >>>>>>>>> > >>>>>>>>> > Step 4: do post gpu reset, such as call all ips' resume >>>>>>>>> functions; >>>>>>>>> > >>>>>>>>> > Step 5: atomic set adev->in_gpu_reset as 0, wake up other >>>>>>>>> threads and release write lock. Recovery thread exit normally. >>>>>>>>> > >>>>>>>>> > Other threads call the amdgpu_read_lock to synchronize with >>>>>>>>> recovery thread. If it finds that in_gpu_reset is 1, it should >>>>>>>>> release read lock if it has holden one, and then blocks itself >>>>>>>>> to wait for recovery finished event. If thread successfully >>>>>>>>> hold read lock and in_gpu_reset is 0, it continues. It will >>>>>>>>> exit normally or be stopped by recovery thread in step 1. >>>>>>>>> > >>>>>>>>> > Dennis Li (4): >>>>>>>>> >    drm/amdgpu: remove reset lock from low level functions >>>>>>>>> >    drm/amdgpu: refine the GPU recovery sequence >>>>>>>>> >    drm/amdgpu: instead of using down/up_read directly >>>>>>>>> >    drm/amdkfd: add reset lock protection for kfd entry functions >>>>>>>>> > >>>>>>>>> > drivers/gpu/drm/amd/amdgpu/amdgpu.h |   6 + >>>>>>>>> > drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +- >>>>>>>>> > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 173 >>>>>>>>> +++++++++++++----- >>>>>>>>> > .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c |   8 - >>>>>>>>> > drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c |   4 +- >>>>>>>>> > drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c |   9 +- >>>>>>>>> > drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c |   5 +- >>>>>>>>> > drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c |   5 +- >>>>>>>>> > drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 172 ++++++++++++++++- >>>>>>>>> > drivers/gpu/drm/amd/amdkfd/kfd_priv.h |   3 +- >>>>>>>>> > drivers/gpu/drm/amd/amdkfd/kfd_process.c |   4 + >>>>>>>>> > .../amd/amdkfd/kfd_process_queue_manager.c |  17 ++ >>>>>>>>> >   12 files changed, 345 insertions(+), 75 deletions(-) >>>>>>>>> > >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> amd-gfx mailing list >>>>>>>> amd-gfx@lists.freedesktop.org >>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>>> >>>>> >>>>> _______________________________________________ >>>>> amd-gfx mailing list >>>>> amd-gfx@lists.freedesktop.org >>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>> >> >