From: Andrey Grodzovsky <andrey.grodzovsky@amd.com> To: <dri-devel@lists.freedesktop.org>, <amd-gfx@lists.freedesktop.org> Cc: jingwen.chen2@amd.com, Christian.Koenig@amd.com, monk.liu@amd.com, yiqing.yao@amd.com Subject: [PATCH 4/5] drm/sched: Partial revert of 'drm/sched: Keep s_fence->parent pointer' Date: Mon, 20 Jun 2022 18:03:01 -0400 [thread overview] Message-ID: <20220620220302.86389-5-andrey.grodzovsky@amd.com> (raw) In-Reply-To: <20220620220302.86389-1-andrey.grodzovsky@amd.com> Problem: This patch caused negative refcount as described in [1] because for that case parent fence did not signal by the time of drm_sched_stop and hence kept in pending list the assumption was they will not signal and so fence was put to account for the s_fence->parent refcount but for amdgpu which has embedded HW fence (always same parent fence) drm_sched_fence_release_scheduled was always called and would still drop the count for parent fence once more. For jobs that never signaled this imbalance was masked by refcount bug in amdgpu_fence_driver_clear_job_fences that would not drop refcount on the fences that were removed from fence drive fences array (against prevois insertion into the array in get in amdgpu_fence_emit). Fix: Revert this patch and by setting s_job->s_fence->parent to NULL as before prevent the extra refcount drop in amdgpu when drm_sched_fence_release_scheduled is called on job release. Also - align behaviour in drm_sched_resubmit_jobs_ext with that of drm_sched_main when submitting jobs - take a refcount for the new parent fence pointer and drop refcount for original kref_init for new HW fence creation (or fake new HW fence in amdgpu - see next patch). [1] - https://lore.kernel.org/all/731b7ff1-3cc9-e314-df2a-7c51b76d4db0@amd.com/t/#r00c728fcc069b1276642c325bfa9d82bf8fa21a3 Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Tested-by: Yiqing Yao <yiqing.yao@amd.com> --- drivers/gpu/drm/scheduler/sched_main.c | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index b81fceb0b8a2..b38394f5694f 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -419,6 +419,11 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad) if (s_job->s_fence->parent && dma_fence_remove_callback(s_job->s_fence->parent, &s_job->cb)) { + /* Revert drm/sched: Keep s_fence->parent pointer, no + * need anymore for amdgpu and creates only troubles + */ + dma_fence_put(s_job->s_fence->parent); + s_job->s_fence->parent = NULL; atomic_dec(&sched->hw_rq_count); } else { /* @@ -548,7 +553,6 @@ void drm_sched_resubmit_jobs_ext(struct drm_gpu_scheduler *sched, int max) if (found_guilty && s_job->s_fence->scheduled.context == guilty_context) dma_fence_set_error(&s_fence->finished, -ECANCELED); - dma_fence_put(s_job->s_fence->parent); fence = sched->ops->run_job(s_job); i++; @@ -558,7 +562,11 @@ void drm_sched_resubmit_jobs_ext(struct drm_gpu_scheduler *sched, int max) s_job->s_fence->parent = NULL; } else { - s_job->s_fence->parent = fence; + + s_job->s_fence->parent = dma_fence_get(fence); + + /* Drop for orignal kref_init */ + dma_fence_put(fence); } } } @@ -952,6 +960,9 @@ static int drm_sched_main(void *param) if (!IS_ERR_OR_NULL(fence)) { s_fence->parent = dma_fence_get(fence); + /* Drop for original kref_init of the fence */ + dma_fence_put(fence); + r = dma_fence_add_callback(fence, &sched_job->cb, drm_sched_job_done_cb); if (r == -ENOENT) @@ -959,7 +970,6 @@ static int drm_sched_main(void *param) else if (r) DRM_DEV_ERROR(sched->dev, "fence add callback failed (%d)\n", r); - dma_fence_put(fence); } else { if (IS_ERR(fence)) dma_fence_set_error(&s_fence->finished, PTR_ERR(fence)); -- 2.25.1
WARNING: multiple messages have this Message-ID (diff)
From: Andrey Grodzovsky <andrey.grodzovsky@amd.com> To: <dri-devel@lists.freedesktop.org>, <amd-gfx@lists.freedesktop.org> Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com>, jingwen.chen2@amd.com, Christian.Koenig@amd.com, monk.liu@amd.com, yiqing.yao@amd.com Subject: [PATCH 4/5] drm/sched: Partial revert of 'drm/sched: Keep s_fence->parent pointer' Date: Mon, 20 Jun 2022 18:03:01 -0400 [thread overview] Message-ID: <20220620220302.86389-5-andrey.grodzovsky@amd.com> (raw) In-Reply-To: <20220620220302.86389-1-andrey.grodzovsky@amd.com> Problem: This patch caused negative refcount as described in [1] because for that case parent fence did not signal by the time of drm_sched_stop and hence kept in pending list the assumption was they will not signal and so fence was put to account for the s_fence->parent refcount but for amdgpu which has embedded HW fence (always same parent fence) drm_sched_fence_release_scheduled was always called and would still drop the count for parent fence once more. For jobs that never signaled this imbalance was masked by refcount bug in amdgpu_fence_driver_clear_job_fences that would not drop refcount on the fences that were removed from fence drive fences array (against prevois insertion into the array in get in amdgpu_fence_emit). Fix: Revert this patch and by setting s_job->s_fence->parent to NULL as before prevent the extra refcount drop in amdgpu when drm_sched_fence_release_scheduled is called on job release. Also - align behaviour in drm_sched_resubmit_jobs_ext with that of drm_sched_main when submitting jobs - take a refcount for the new parent fence pointer and drop refcount for original kref_init for new HW fence creation (or fake new HW fence in amdgpu - see next patch). [1] - https://lore.kernel.org/all/731b7ff1-3cc9-e314-df2a-7c51b76d4db0@amd.com/t/#r00c728fcc069b1276642c325bfa9d82bf8fa21a3 Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Tested-by: Yiqing Yao <yiqing.yao@amd.com> --- drivers/gpu/drm/scheduler/sched_main.c | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index b81fceb0b8a2..b38394f5694f 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -419,6 +419,11 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad) if (s_job->s_fence->parent && dma_fence_remove_callback(s_job->s_fence->parent, &s_job->cb)) { + /* Revert drm/sched: Keep s_fence->parent pointer, no + * need anymore for amdgpu and creates only troubles + */ + dma_fence_put(s_job->s_fence->parent); + s_job->s_fence->parent = NULL; atomic_dec(&sched->hw_rq_count); } else { /* @@ -548,7 +553,6 @@ void drm_sched_resubmit_jobs_ext(struct drm_gpu_scheduler *sched, int max) if (found_guilty && s_job->s_fence->scheduled.context == guilty_context) dma_fence_set_error(&s_fence->finished, -ECANCELED); - dma_fence_put(s_job->s_fence->parent); fence = sched->ops->run_job(s_job); i++; @@ -558,7 +562,11 @@ void drm_sched_resubmit_jobs_ext(struct drm_gpu_scheduler *sched, int max) s_job->s_fence->parent = NULL; } else { - s_job->s_fence->parent = fence; + + s_job->s_fence->parent = dma_fence_get(fence); + + /* Drop for orignal kref_init */ + dma_fence_put(fence); } } } @@ -952,6 +960,9 @@ static int drm_sched_main(void *param) if (!IS_ERR_OR_NULL(fence)) { s_fence->parent = dma_fence_get(fence); + /* Drop for original kref_init of the fence */ + dma_fence_put(fence); + r = dma_fence_add_callback(fence, &sched_job->cb, drm_sched_job_done_cb); if (r == -ENOENT) @@ -959,7 +970,6 @@ static int drm_sched_main(void *param) else if (r) DRM_DEV_ERROR(sched->dev, "fence add callback failed (%d)\n", r); - dma_fence_put(fence); } else { if (IS_ERR(fence)) dma_fence_set_error(&s_fence->finished, PTR_ERR(fence)); -- 2.25.1
next prev parent reply other threads:[~2022-06-20 22:03 UTC|newest] Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top 2022-06-20 22:02 [PATCH 0/5] Rework amdgpu HW fence refocunt and update scheduler parent fence refcount Andrey Grodzovsky 2022-06-20 22:02 ` Andrey Grodzovsky 2022-06-20 22:02 ` [PATCH 1/5] drm/amdgpu: Fix possible refcount leak for release of external_hw_fence Andrey Grodzovsky 2022-06-20 22:02 ` Andrey Grodzovsky 2022-06-21 7:19 ` Christian König 2022-06-21 19:34 ` Andrey Grodzovsky 2022-06-22 9:00 ` Christian König 2022-06-22 15:01 ` Andrey Grodzovsky 2022-06-22 15:04 ` Christian König 2022-06-23 21:18 ` Andrey Grodzovsky 2022-06-24 6:00 ` Christian König 2022-06-20 22:02 ` [PATCH 2/5] drm/amdgpu: Add put fence in amdgpu_fence_driver_clear_job_fences Andrey Grodzovsky 2022-06-20 22:02 ` Andrey Grodzovsky 2022-06-21 7:21 ` Christian König 2022-06-20 22:03 ` [PATCH 3/5] drm/amdgpu: Prevent race between late signaled fences and GPU reset Andrey Grodzovsky 2022-06-20 22:03 ` Andrey Grodzovsky 2022-06-21 7:25 ` Christian König 2022-06-21 19:45 ` Andrey Grodzovsky 2022-06-22 1:47 ` VURDIGERENATARAJ, CHANDAN 2022-06-22 2:41 ` Andrey Grodzovsky 2022-06-22 17:31 ` Andrey Grodzovsky 2022-06-23 5:57 ` Christian König 2022-06-20 22:03 ` Andrey Grodzovsky [this message] 2022-06-20 22:03 ` [PATCH 4/5] drm/sched: Partial revert of 'drm/sched: Keep s_fence->parent pointer' Andrey Grodzovsky 2022-06-21 7:26 ` Christian König 2022-06-20 22:03 ` [PATCH 5/5] drm/amdgpu: Follow up change to previous drm scheduler change Andrey Grodzovsky 2022-06-20 22:03 ` Andrey Grodzovsky 2022-06-21 7:28 ` Christian König 2022-06-21 20:00 ` Andrey Grodzovsky 2022-06-22 7:17 ` Christian König 2022-06-22 17:19 ` Andrey Grodzovsky 2022-06-23 5:52 ` Christian König 2022-06-23 14:51 ` Andrey Grodzovsky 2022-06-23 14:54 ` Christian König
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20220620220302.86389-5-andrey.grodzovsky@amd.com \ --to=andrey.grodzovsky@amd.com \ --cc=Christian.Koenig@amd.com \ --cc=amd-gfx@lists.freedesktop.org \ --cc=dri-devel@lists.freedesktop.org \ --cc=jingwen.chen2@amd.com \ --cc=monk.liu@amd.com \ --cc=yiqing.yao@amd.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.