Hi Andrey, Am 07.04.21 um 21:44 schrieb Andrey Grodzovsky: > > > On 2021-04-07 6:28 a.m., Christian König wrote: >> Hi Andrey, >> >> Am 06.04.21 um 23:22 schrieb Andrey Grodzovsky: >>> >>> Hey Christian, Denis, see bellow - >>> >>> On 2021-04-06 6:34 a.m., Christian König wrote: >>>> Hi Andrey, >>>> >>>> well good question. My job is to watch over the implementation and >>>> design and while I always help I can adjust anybodies schedule. >>>> >>>> Is the patch to print a warning when the hardware is accessed >>>> without holding the locks merged yet? If not then that would >>>> probably be a good starting point. >>> >>> >>> It's merged into amd-staging-drm-next and since I work on >>> drm-misc-next I will cherry-pick it into there. >>> >> >> Ok good to know, I haven't tracked that one further. >> >>> >>>> >>>> Then we would need to unify this with the SRCU to make sure that we >>>> have both the reset lock as well as block the hotplug code from >>>> reusing the MMIO space. >>> >>> In my understanding there is a significant difference between >>> handling of GPU reset and unplug - while GPU reset use case requires >>> any HW accessing code to block and wait for the reset to finish and >>> then proceed, hot-unplug >>> is permanent and hence no need to wait and proceed but rather abort >>> at once. >>> >> >> Yes, absolutely correct. >> >>> This why I think that in any place we already check for device reset >>> we should also add a check for hot-unplug but the handling would be >>> different >>> in that for hot-unplug we would abort instead of keep waiting. >>> >> >> Yes, that's the rough picture in my head as well. >> >> Essentially Daniels patch of having an >> amdgpu_device_hwaccess_begin()/_end() was the right approach. You >> just can't do it in the top level IOCTL handler, but rather need it >> somewhere between front end and backend. > > > Can you point me to what patch was it ? Can't find. > What I mean was the approach in patch #3 in this series where he replaced the down_read/up_read with amdgpu_read_lock()/amdgpu_read_unlock(). I would just not call it amdgpu_read_lock()/amdgpu_read_unlock(), but something more descriptive. Regards, Christian. > >> >>> Similar to handling device reset for unplug we obviously also need >>> to stop and block any MMIO accesses once device is unplugged and, as >>> Daniel Vetter mentioned - we have to do it before finishing >>> pci_remove (early device fini) >>> and not later (when last device reference is dropped from user >>> space) in order to prevent reuse of MMIO space we still access by >>> other hot plugging devices. As in device reset case we need to >>> cancel all delay works, stop drm schedule, complete all unfinished >>> fences(both HW and scheduler fences). While you stated strong >>> objection to force signalling scheduler fences from GPU reset, quote: >>> >>> "you can't signal the dma_fence waiting. Waiting for a dma_fence >>> also means you wait for the GPU reset to finish. When we would >>> signal the dma_fence during the GPU reset then we would run into >>> memory corruption because the hardware jobs running after the GPU >>> reset would access memory which is already freed." >>> To my understating this is a key difference with hot-unplug, the >>> device is gone, all those concerns are irrelevant and hence we can >>> actually force signal scheduler fences (setting and error to them >>> before) to force completion of any >>> waiting clients such as possibly IOCTLs or async page flips e.t.c. >>> >> >> Yes, absolutely correct. That's what I also mentioned to Daniel. When >> we are able to nuke the device and any memory access it might do we >> can also signal the fences. >> >>> Beyond blocking all delayed works and scheduler threads we also need >>> to guarantee no  IOCTL can access MMIO post device unplug OR in >>> flight IOCTLs are done before we finish pci_remove >>> (amdgpu_pci_remove for us). >>> For this I suggest we do something like what we worked on with >>> Takashi Iwai the ALSA maintainer recently when he helped >>> implementing PCI BARs move support for snd_hda_intel. Take a look at >>> https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=yadro/pcie_hotplug/movable_bars_v9.1&id=cbaa324799718e2b828a8c7b5b001dd896748497 >>> and >>> https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=yadro/pcie_hotplug/movable_bars_v9.1&id=e36365d9ab5bbc30bdc221ab4b3437de34492440 >>> We also had same issue there, how to prevent MMIO accesses while the >>> BARs are migrating. What was done there is a refcount was added to >>> count all IOCTLs in flight, for any in flight  IOCTL the BAR >>> migration handler would >>> block for the refcount to drop to 0 before it would proceed, for any >>> later IOCTL it stops and wait if device is in migration state. We >>> even don't need the wait part, nothing to wait for, we just return >>> with -ENODEV for this case. >>> >> >> This is essentially what the DRM SRCU is doing as well. >> >> For the hotplug case we could do this in the toplevel since we can >> signal the fence and don't need to block memory management. > > > To make SRCU 'wait for' all IOCTLs in flight we would need to wrap > every IOCTL ( practically - just drm_ioctl function) with > drm_dev_enter/drm_dev_exit - can we do it ? > > >> >> But I'm not sure, maybe we should handle it the same way as reset or >> maybe we should have it at the top level. > > > If by top level you mean checking for device unplugged and bailing out > at the entry to IOCTL or right at start of any work_item/timer > function we have then seems to me it's better and more clear. Once we > flushed all of them in flight there is no reason for them to execute > any more when device is unplugged. > > Andrey > > >> >> Regards, >> Christian. >> >>> The above approach should allow us to wait for all the IOCTLs in >>> flight, together with stopping scheduler threads and cancelling and >>> flushing all in flight work items and timers i think It should give >>> as full solution for the hot-unplug case >>> of preventing any MMIO accesses post device pci_remove. >>> >>> Let me know what you think guys. >>> >>> Andrey >>> >>> >>>> >>>> And then testing, testing, testing to see if we have missed something. >>>> >>>> Christian. >>>> >>>> Am 05.04.21 um 19:58 schrieb Andrey Grodzovsky: >>>>> >>>>> Denis, Christian, are there any updates in the plan on how to move >>>>> on with this ? As you know I need very similar code for my >>>>> up-streaming of device hot-unplug. My latest solution >>>>> (https://lists.freedesktop.org/archives/amd-gfx/2021-January/058606.html) >>>>> was not acceptable because of low level guards on the register >>>>> accessors level which was hurting performance. Basically I need a >>>>> way to prevent any MMIO write accesses from kernel driver after >>>>> device is removed (UMD accesses are taken care of by page faulting >>>>> dummy page). We are using now hot-unplug code for Freemont program >>>>> and so up-streaming became more of a priority then before. This >>>>> MMIO access issue is currently my main blocker from up-streaming. >>>>> Is there any way I can assist in pushing this on ? >>>>> >>>>> Andrey >>>>> >>>>> On 2021-03-18 5:51 a.m., Christian König wrote: >>>>>> Am 18.03.21 um 10:30 schrieb Li, Dennis: >>>>>>> >>>>>>> >>> The GPU reset doesn't complete the fences we wait for. It >>>>>>> only completes the hardware fences as part of the reset. >>>>>>> >>>>>>> >>> So waiting for a fence while holding the reset lock is >>>>>>> illegal and needs to be avoided. >>>>>>> >>>>>>> I understood your concern. It is more complex for DRM GFX, >>>>>>> therefore I abandon adding lock protection for DRM ioctls now. >>>>>>> Maybe we can try to add all kernel  dma_fence waiting in a list, >>>>>>> and signal all in recovery threads. Do you have same concern for >>>>>>> compute cases? >>>>>>> >>>>>> >>>>>> Yes, compute (KFD) is even harder to handle. >>>>>> >>>>>> See you can't signal the dma_fence waiting. Waiting for a >>>>>> dma_fence also means you wait for the GPU reset to finish. >>>>>> >>>>>> When we would signal the dma_fence during the GPU reset then we >>>>>> would run into memory corruption because the hardware jobs >>>>>> running after the GPU reset would access memory which is already >>>>>> freed. >>>>>> >>>>>>> >>> Lockdep also complains about this when it is used correctly. >>>>>>> The only reason it doesn't complain here is because you use an >>>>>>> atomic+wait_event instead of a locking primitive. >>>>>>> >>>>>>> Agree. This approach will escape the monitor of lockdep.  Its >>>>>>> goal is to block other threads when GPU recovery thread start. >>>>>>> But I couldn’t find a better method to solve this problem. Do >>>>>>> you have some suggestion? >>>>>>> >>>>>> >>>>>> Well, completely abandon those change here. >>>>>> >>>>>> What we need to do is to identify where hardware access happens >>>>>> and then insert taking the read side of the GPU reset lock so >>>>>> that we don't wait for a dma_fence or allocate memory, but still >>>>>> protect the hardware from concurrent access and reset. >>>>>> >>>>>> Regards, >>>>>> Christian. >>>>>> >>>>>>> Best Regards >>>>>>> >>>>>>> Dennis Li >>>>>>> >>>>>>> *From:* Koenig, Christian >>>>>>> *Sent:* Thursday, March 18, 2021 4:59 PM >>>>>>> *To:* Li, Dennis ; >>>>>>> amd-gfx@lists.freedesktop.org; Deucher, Alexander >>>>>>> ; Kuehling, Felix >>>>>>> ; Zhang, Hawking >>>>>>> *Subject:* AW: [PATCH 0/4] Refine GPU recovery sequence to >>>>>>> enhance its stability >>>>>>> >>>>>>> Exactly that's what you don't seem to understand. >>>>>>> >>>>>>> The GPU reset doesn't complete the fences we wait for. It only >>>>>>> completes the hardware fences as part of the reset. >>>>>>> >>>>>>> So waiting for a fence while holding the reset lock is illegal >>>>>>> and needs to be avoided. >>>>>>> >>>>>>> Lockdep also complains about this when it is used correctly. The >>>>>>> only reason it doesn't complain here is because you use an >>>>>>> atomic+wait_event instead of a locking primitive. >>>>>>> >>>>>>> Regards, >>>>>>> >>>>>>> Christian. >>>>>>> >>>>>>> ------------------------------------------------------------------------ >>>>>>> >>>>>>> *Von:*Li, Dennis > >>>>>>> *Gesendet:* Donnerstag, 18. März 2021 09:28 >>>>>>> *An:* Koenig, Christian >>>>>> >; >>>>>>> amd-gfx@lists.freedesktop.org >>>>>>> >>>>>>> >>>>>> >; Deucher, Alexander >>>>>>> >; >>>>>>> Kuehling, Felix >>>>>> >; Zhang, Hawking >>>>>>> > >>>>>>> *Betreff:* RE: [PATCH 0/4] Refine GPU recovery sequence to >>>>>>> enhance its stability >>>>>>> >>>>>>> >>> Those two steps need to be exchanged or otherwise it is >>>>>>> possible that new delayed work items etc are started before the >>>>>>> lock is taken. >>>>>>> What about adding check for adev->in_gpu_reset in work item? If >>>>>>> exchange the two steps, it maybe introduce the deadlock.  For >>>>>>> example, the user thread hold the read lock and waiting for the >>>>>>> fence, if recovery thread try to hold write lock and then >>>>>>> complete fences, in this case, recovery thread will always be >>>>>>> blocked. >>>>>>> >>>>>>> >>>>>>> Best Regards >>>>>>> Dennis Li >>>>>>> -----Original Message----- >>>>>>> From: Koenig, Christian >>>>>> > >>>>>>> Sent: Thursday, March 18, 2021 3:54 PM >>>>>>> To: Li, Dennis >; >>>>>>> amd-gfx@lists.freedesktop.org >>>>>>> ; Deucher, Alexander >>>>>>> >; >>>>>>> Kuehling, Felix >>>>>> >; Zhang, Hawking >>>>>>> > >>>>>>> Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance >>>>>>> its stability >>>>>>> >>>>>>> Am 18.03.21 um 08:23 schrieb Dennis Li: >>>>>>> > We have defined two variables in_gpu_reset and reset_sem in >>>>>>> adev object. The atomic type variable in_gpu_reset is used to >>>>>>> avoid recovery thread reenter and make lower functions return >>>>>>> more earlier when recovery start, but couldn't block recovery >>>>>>> thread when it access hardware. The r/w semaphore reset_sem is >>>>>>> used to solve these synchronization issues between recovery >>>>>>> thread and other threads. >>>>>>> > >>>>>>> > The original solution locked registers' access in lower >>>>>>> functions, which will introduce following issues: >>>>>>> > >>>>>>> > 1) many lower functions are used in both recovery thread and >>>>>>> others. Firstly we must harvest these functions, it is easy to >>>>>>> miss someones. Secondly these functions need select which lock >>>>>>> (read lock or write lock) will be used, according to the thread >>>>>>> it is running in. If the thread context isn't considered, the >>>>>>> added lock will easily introduce deadlock. Besides that, in most >>>>>>> time, developer easily forget to add locks for new functions. >>>>>>> > >>>>>>> > 2) performance drop. More lower functions are more frequently >>>>>>> called. >>>>>>> > >>>>>>> > 3) easily introduce false positive lockdep complaint, because >>>>>>> write lock has big range in recovery thread, but low level >>>>>>> functions will hold read lock may be protected by other locks in >>>>>>> other threads. >>>>>>> > >>>>>>> > Therefore the new solution will try to add lock protection for >>>>>>> ioctls of kfd. Its goal is that there are no threads except for >>>>>>> recovery thread or its children (for xgmi) to access hardware >>>>>>> when doing GPU reset and resume. So refine recovery thread as >>>>>>> the following: >>>>>>> > >>>>>>> > Step 0: atomic_cmpxchg(&adev->in_gpu_reset, 0, 1) >>>>>>> >     1). if failed, it means system had a recovery thread >>>>>>> running, current thread exit directly; >>>>>>> >     2). if success, enter recovery thread; >>>>>>> > >>>>>>> > Step 1: cancel all delay works, stop drm schedule, complete >>>>>>> all unreceived fences and so on. It try to stop or pause other >>>>>>> threads. >>>>>>> > >>>>>>> > Step 2: call down_write(&adev->reset_sem) to hold write lock, >>>>>>> which will block recovery thread until other threads release >>>>>>> read locks. >>>>>>> >>>>>>> Those two steps need to be exchanged or otherwise it is possible >>>>>>> that new delayed work items etc are started before the lock is >>>>>>> taken. >>>>>>> >>>>>>> Just to make it clear until this is fixed the whole patch set is >>>>>>> a NAK. >>>>>>> >>>>>>> Regards, >>>>>>> Christian. >>>>>>> >>>>>>> > >>>>>>> > Step 3: normally, there is only recovery threads running to >>>>>>> access hardware, it is safe to do gpu reset now. >>>>>>> > >>>>>>> > Step 4: do post gpu reset, such as call all ips' resume functions; >>>>>>> > >>>>>>> > Step 5: atomic set adev->in_gpu_reset as 0, wake up other >>>>>>> threads and release write lock. Recovery thread exit normally. >>>>>>> > >>>>>>> > Other threads call the amdgpu_read_lock to synchronize with >>>>>>> recovery thread. If it finds that in_gpu_reset is 1, it should >>>>>>> release read lock if it has holden one, and then blocks itself >>>>>>> to wait for recovery finished event. If thread successfully hold >>>>>>> read lock and in_gpu_reset is 0, it continues. It will exit >>>>>>> normally or be stopped by recovery thread in step 1. >>>>>>> > >>>>>>> > Dennis Li (4): >>>>>>> >    drm/amdgpu: remove reset lock from low level functions >>>>>>> >    drm/amdgpu: refine the GPU recovery sequence >>>>>>> >    drm/amdgpu: instead of using down/up_read directly >>>>>>> >    drm/amdkfd: add reset lock protection for kfd entry functions >>>>>>> > >>>>>>> > drivers/gpu/drm/amd/amdgpu/amdgpu.h |   6 + >>>>>>> > drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +- >>>>>>> > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 173 >>>>>>> +++++++++++++----- >>>>>>> > .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c |   8 - >>>>>>> > drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c |   4 +- >>>>>>> > drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c |   9 +- >>>>>>> > drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c |   5 +- >>>>>>> > drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c |   5 +- >>>>>>> > drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 172 ++++++++++++++++- >>>>>>> > drivers/gpu/drm/amd/amdkfd/kfd_priv.h |   3 +- >>>>>>> > drivers/gpu/drm/amd/amdkfd/kfd_process.c |   4 + >>>>>>> > .../amd/amdkfd/kfd_process_queue_manager.c |  17 ++ >>>>>>> >   12 files changed, 345 insertions(+), 75 deletions(-) >>>>>>> > >>>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> amd-gfx mailing list >>>>>> amd-gfx@lists.freedesktop.org >>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>> >>> >>> _______________________________________________ >>> amd-gfx mailing list >>> amd-gfx@lists.freedesktop.org >>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>