From: Daniel Vetter <daniel.vetter@ffwll.ch> To: DRI Development <dri-devel@lists.freedesktop.org> Cc: "Intel Graphics Development" <intel-gfx@lists.freedesktop.org>, "Daniel Vetter" <daniel.vetter@ffwll.ch>, linux-media@vger.kernel.org, linaro-mm-sig@lists.linaro.org, linux-rdma@vger.kernel.org, amd-gfx@lists.freedesktop.org, "Chris Wilson" <chris@chris-wilson.co.uk>, "Maarten Lankhorst" <maarten.lankhorst@linux.intel.com>, "Christian König" <christian.koenig@amd.com>, "Daniel Vetter" <daniel.vetter@intel.com> Subject: [PATCH 20/65] drm/scheduler: use dma-fence annotations in tdr work Date: Fri, 23 Oct 2020 14:21:31 +0200 Message-ID: <20201023122216.2373294-20-daniel.vetter@ffwll.ch> (raw) In-Reply-To: <20201023122216.2373294-1-daniel.vetter@ffwll.ch> In the face of unpriviledged userspace being able to submit bogus gpu workloads the kernel needs gpu timeout and reset (tdr) to guarantee that dma_fences actually complete. Annotate this worker to make sure we don't have any accidental locking inversions or other problems lurking. Originally this was part of the overall scheduler annotation patch. But amdgpu has some glorious inversions here: - grabs console_lock - does a full modeset, which grabs all kinds of locks (drm_modeset_lock, dma_resv_lock) which can deadlock with dma_fence_wait held inside them. - almost minor at that point, but the modeset code also allocates memory These all look like they'll be very hard to fix properly, the hardware seems to require a full display reset with any gpu recovery. Hence split out as a seperate patch. Since amdgpu isn't the only hardware driver that needs to reset the display (at least gen2/3 on intel have the same problem) we need a generic solution for this. There's two tricks we could still from drm/i915 and lift to dma-fence: - The big whack, aka force-complete all fences. i915 does this for all pending jobs if the reset is somehow stuck. Trouble is we'd need to do this for all fences in the entire system, and just the book-keeping for that will be fun. Plus lots of drivers use fences for all kinds of internal stuff like memory management, so unconditionally resetting all of them doesn't work. I'm also hoping that with these fence annotations we could enlist lockdep in finding the last offenders causing deadlocks, and we could remove this get-out-of-jail trick. - The more feasible approach (across drivers at least as part of the dma_fence contract) is what drm/i915 does for gen2/3: When we need to reset the display we wake up all dma_fence_wait_interruptible calls, or well at least the equivalent of those in i915 internally. Relying on ioctl restart we force all other threads to release their locks, which means the tdr thread is guaranteed to be able to get them. I think we could implement this at the dma_fence level, including proper lockdep annotations. dma_fence_begin_tdr(): - must be nested within a dma_fence_begin/end_signalling section - will wake up all interruptible (but not the non-interruptible) dma_fence_wait() calls and force them to complete with a -ERESTARTSYS errno code. All new interrupitble calls to dma_fence_wait() will immeidately fail with the same error code. dma_fence_end_trdr(): - this will convert dma_fence_wait() calls back to normal. Of course interrupting dma_fence_wait is only ok if the caller specified that, which means we need to split the annotations into interruptible and non-interruptible version. If we then make sure that we only use interruptible dma_fence_wait() calls while holding drm_modeset_lock we can grab them in tdr code, and allow display resets. Doing the same for dma_resv_lock might be a lot harder, so buffer updates must be avoided. What's worse, we're not going to be able to make the dma_fence_wait calls in mmu-notifiers interruptible, that doesn't work. So allocating memory still wont' be allowed, even in tdr sections. Plus obviously we can use this trick only in tdr, it is rather intrusive. Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Christian König <christian.koenig@amd.com> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> --- drivers/gpu/drm/scheduler/sched_main.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index f69abc4e70d3..ae0d5ceca49a 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -281,9 +281,12 @@ static void drm_sched_job_timedout(struct work_struct *work) { struct drm_gpu_scheduler *sched; struct drm_sched_job *job; + bool fence_cookie; sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work); + fence_cookie = dma_fence_begin_signalling(); + /* Protects against concurrent deletion in drm_sched_get_cleanup_job */ spin_lock(&sched->job_list_lock); job = list_first_entry_or_null(&sched->ring_mirror_list, @@ -315,6 +318,8 @@ static void drm_sched_job_timedout(struct work_struct *work) spin_lock(&sched->job_list_lock); drm_sched_start_timeout(sched); spin_unlock(&sched->job_list_lock); + + dma_fence_end_signalling(fence_cookie); } /** -- 2.28.0
next prev parent reply index Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top [not found] <20201021163242.1458885-1-daniel.vetter@ffwll.ch> [not found] ` <20201023122216.2373294-1-daniel.vetter@ffwll.ch> 2020-10-23 12:21 ` [PATCH 05/65] drm/atomic-helper: Add dma-fence annotations Daniel Vetter 2020-10-23 12:21 ` [PATCH 06/65] drm/vkms: Annotate vblank timer Daniel Vetter 2020-10-23 12:21 ` [PATCH 07/65] drm/vblank: Annotate with dma-fence signalling section Daniel Vetter 2020-10-23 12:21 ` [PATCH 08/65] drm/amdgpu: add dma-fence annotations to atomic commit path Daniel Vetter 2020-10-23 12:21 ` [PATCH 17/65] drm/scheduler: use dma-fence annotations in main thread Daniel Vetter 2020-10-23 12:21 ` [PATCH 18/65] drm/amdgpu: use dma-fence annotations in cs_submit() Daniel Vetter 2020-10-23 12:21 ` [PATCH 19/65] drm/amdgpu: s/GFP_KERNEL/GFP_ATOMIC in scheduler code Daniel Vetter 2020-10-23 12:21 ` Daniel Vetter [this message] 2020-10-23 12:21 ` [PATCH 21/65] drm/amdgpu: use dma-fence annotations for gpu reset code Daniel Vetter 2020-10-23 12:21 ` [PATCH 22/65] Revert "drm/amdgpu: add fbdev suspend/resume on gpu reset" Daniel Vetter 2020-10-23 12:21 ` [PATCH 23/65] drm/i915: Annotate dma_fence_work Daniel Vetter 2020-10-23 12:21 ` [PATCH 29/65] s390/pci: Remove races against pte updates Daniel Vetter 2020-10-23 12:26 ` Daniel Vetter 2020-10-23 12:21 ` [PATCH 30/65] drm/exynos: Stop using frame_vector helpers Daniel Vetter 2020-10-23 12:21 ` [PATCH 31/65] drm/exynos: Use FOLL_LONGTERM for g2d cmdlists Daniel Vetter 2020-10-23 12:21 ` [PATCH 32/65] misc/habana: Stop using frame_vector helpers Daniel Vetter 2020-10-23 12:21 ` [PATCH 33/65] misc/habana: Use FOLL_LONGTERM for userptr Daniel Vetter 2020-10-23 12:21 ` [PATCH 34/65] mm/frame-vector: Use FOLL_LONGTERM Daniel Vetter 2020-10-23 12:21 ` [PATCH 35/65] media: videobuf2: Move frame_vector into media subsystem Daniel Vetter 2020-10-23 12:21 ` [PATCH 36/65] mm: Close race in generic_access_phys Daniel Vetter 2020-10-23 12:21 ` [PATCH 37/65] mm: Add unsafe_follow_pfn Daniel Vetter 2020-10-23 12:21 ` [PATCH 38/65] media/videbuf1|2: Mark follow_pfn usage as unsafe Daniel Vetter 2020-10-23 12:21 ` [PATCH 39/65] vfio/type1: Mark follow_pfn " Daniel Vetter 2020-10-23 12:21 ` [PATCH 40/65] PCI: Obey iomem restrictions for procfs mmap Daniel Vetter 2020-10-23 12:21 ` [PATCH 41/65] /dev/mem: Only set filp->f_mapping Daniel Vetter 2020-10-23 12:21 ` [PATCH 42/65] resource: Move devmem revoke code to resource framework Daniel Vetter 2020-10-23 12:21 ` [PATCH 43/65] sysfs: Support zapping of binary attr mmaps Daniel Vetter 2020-10-23 12:21 ` [PATCH 44/65] PCI: Revoke mappings like devmem Daniel Vetter
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20201023122216.2373294-20-daniel.vetter@ffwll.ch \ --to=daniel.vetter@ffwll.ch \ --cc=amd-gfx@lists.freedesktop.org \ --cc=chris@chris-wilson.co.uk \ --cc=christian.koenig@amd.com \ --cc=daniel.vetter@intel.com \ --cc=dri-devel@lists.freedesktop.org \ --cc=intel-gfx@lists.freedesktop.org \ --cc=linaro-mm-sig@lists.linaro.org \ --cc=linux-media@vger.kernel.org \ --cc=linux-rdma@vger.kernel.org \ --cc=maarten.lankhorst@linux.intel.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
Linux-Media Archive on lore.kernel.org Archives are clonable: git clone --mirror https://lore.kernel.org/linux-media/0 linux-media/git/0.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 linux-media linux-media/ https://lore.kernel.org/linux-media \ linux-media@vger.kernel.org public-inbox-index linux-media Example config snippet for mirrors Newsgroup available over NNTP: nntp://nntp.lore.kernel.org/org.kernel.vger.linux-media AGPL code for this site: git clone https://public-inbox.org/public-inbox.git