All of lore.kernel.org
 help / color / mirror / Atom feed
* Fixes for scheduler hang when killing a process
@ 2022-10-14  8:46 Christian König
  2022-10-14  8:46 ` [PATCH 01/13] drm/scheduler: fix fence ref counting Christian König
                   ` (14 more replies)
  0 siblings, 15 replies; 47+ messages in thread
From: Christian König @ 2022-10-14  8:46 UTC (permalink / raw)
  To: luben.tuikov, dri-devel, amd-gfx

Hi guys,

rebased those patches on top of amd-staging-drm-next since the
amdgpu changes are quite substencial.

Please review and comment,
Christian.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 01/13] drm/scheduler: fix fence ref counting
  2022-10-14  8:46 Fixes for scheduler hang when killing a process Christian König
@ 2022-10-14  8:46 ` Christian König
  2022-10-25  3:23   ` Luna Nova
  2022-10-14  8:46 ` [PATCH 02/13] drm/scheduler: add drm_sched_job_add_resv_dependencies Christian König
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 47+ messages in thread
From: Christian König @ 2022-10-14  8:46 UTC (permalink / raw)
  To: luben.tuikov, dri-devel, amd-gfx; +Cc: Christian König

We leaked dependency fences when processes were beeing killed.

Additional to that grab a reference to the last scheduled fence.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/scheduler/sched_entity.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index 43d337d8b153..243ff384cde8 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -207,6 +207,7 @@ static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
 	struct drm_sched_job *job = container_of(cb, struct drm_sched_job,
 						 finish_cb);
 
+	dma_fence_put(f);
 	init_irq_work(&job->work, drm_sched_entity_kill_jobs_irq_work);
 	irq_work_queue(&job->work);
 }
@@ -234,8 +235,10 @@ static void drm_sched_entity_kill_jobs(struct drm_sched_entity *entity)
 		struct drm_sched_fence *s_fence = job->s_fence;
 
 		/* Wait for all dependencies to avoid data corruptions */
-		while ((f = drm_sched_job_dependency(job, entity)))
+		while ((f = drm_sched_job_dependency(job, entity))) {
 			dma_fence_wait(f, false);
+			dma_fence_put(f);
+		}
 
 		drm_sched_fence_scheduled(s_fence);
 		dma_fence_set_error(&s_fence->finished, -ESRCH);
@@ -250,6 +253,7 @@ static void drm_sched_entity_kill_jobs(struct drm_sched_entity *entity)
 			continue;
 		}
 
+		dma_fence_get(entity->last_scheduled);
 		r = dma_fence_add_callback(entity->last_scheduled,
 					   &job->finish_cb,
 					   drm_sched_entity_kill_jobs_cb);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 02/13] drm/scheduler: add drm_sched_job_add_resv_dependencies
  2022-10-14  8:46 Fixes for scheduler hang when killing a process Christian König
  2022-10-14  8:46 ` [PATCH 01/13] drm/scheduler: fix fence ref counting Christian König
@ 2022-10-14  8:46 ` Christian König
  2022-10-14  8:46 ` [PATCH 03/13] drm/amdgpu: use drm_sched_job_add_resv_dependencies for moves Christian König
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 47+ messages in thread
From: Christian König @ 2022-10-14  8:46 UTC (permalink / raw)
  To: luben.tuikov, dri-devel, amd-gfx; +Cc: Christian König

Add a new function to update job dependencies from a resv obj.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 49 ++++++++++++++++++--------
 include/drm/gpu_scheduler.h            |  5 +++
 2 files changed, 39 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index e0ab14e0fb6b..6e2cd0f906b2 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -685,32 +685,28 @@ int drm_sched_job_add_dependency(struct drm_sched_job *job,
 EXPORT_SYMBOL(drm_sched_job_add_dependency);
 
 /**
- * drm_sched_job_add_implicit_dependencies - adds implicit dependencies as job
- *   dependencies
+ * drm_sched_job_add_resv_dependencies - add all fences from the resv to the job
  * @job: scheduler job to add the dependencies to
- * @obj: the gem object to add new dependencies from.
- * @write: whether the job might write the object (so we need to depend on
- * shared fences in the reservation object).
+ * @resv: the dma_resv object to get the fences from
+ * @usage: the dma_resv_usage to use to filter the fences
  *
- * This should be called after drm_gem_lock_reservations() on your array of
- * GEM objects used in the job but before updating the reservations with your
- * own fences.
+ * This adds all fences matching the given usage from @resv to @job.
+ * Must be called with the @resv lock held.
  *
  * Returns:
  * 0 on success, or an error on failing to expand the array.
  */
-int drm_sched_job_add_implicit_dependencies(struct drm_sched_job *job,
-					    struct drm_gem_object *obj,
-					    bool write)
+int drm_sched_job_add_resv_dependencies(struct drm_sched_job *job,
+					struct dma_resv *resv,
+					enum dma_resv_usage usage)
 {
 	struct dma_resv_iter cursor;
 	struct dma_fence *fence;
 	int ret;
 
-	dma_resv_assert_held(obj->resv);
+	dma_resv_assert_held(resv);
 
-	dma_resv_for_each_fence(&cursor, obj->resv, dma_resv_usage_rw(write),
-				fence) {
+	dma_resv_for_each_fence(&cursor, resv, usage, fence) {
 		/* Make sure to grab an additional ref on the added fence */
 		dma_fence_get(fence);
 		ret = drm_sched_job_add_dependency(job, fence);
@@ -721,8 +717,31 @@ int drm_sched_job_add_implicit_dependencies(struct drm_sched_job *job,
 	}
 	return 0;
 }
-EXPORT_SYMBOL(drm_sched_job_add_implicit_dependencies);
+EXPORT_SYMBOL(drm_sched_job_add_resv_dependencies);
 
+/**
+ * drm_sched_job_add_implicit_dependencies - adds implicit dependencies as job
+ *   dependencies
+ * @job: scheduler job to add the dependencies to
+ * @obj: the gem object to add new dependencies from.
+ * @write: whether the job might write the object (so we need to depend on
+ * shared fences in the reservation object).
+ *
+ * This should be called after drm_gem_lock_reservations() on your array of
+ * GEM objects used in the job but before updating the reservations with your
+ * own fences.
+ *
+ * Returns:
+ * 0 on success, or an error on failing to expand the array.
+ */
+int drm_sched_job_add_implicit_dependencies(struct drm_sched_job *job,
+					    struct drm_gem_object *obj,
+					    bool write)
+{
+	return drm_sched_job_add_resv_dependencies(job, obj->resv,
+						   dma_resv_usage_rw(write));
+}
+EXPORT_SYMBOL(drm_sched_job_add_implicit_dependencies);
 
 /**
  * drm_sched_job_cleanup - clean up scheduler job resources
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index f01d14b231ed..2a6e261ea01e 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -41,6 +41,8 @@
  */
 #define DRM_SCHED_FENCE_DONT_PIPELINE	DMA_FENCE_FLAG_USER_BITS
 
+enum dma_resv_usage;
+struct dma_resv;
 struct drm_gem_object;
 
 struct drm_gpu_scheduler;
@@ -483,6 +485,9 @@ int drm_sched_job_init(struct drm_sched_job *job,
 void drm_sched_job_arm(struct drm_sched_job *job);
 int drm_sched_job_add_dependency(struct drm_sched_job *job,
 				 struct dma_fence *fence);
+int drm_sched_job_add_resv_dependencies(struct drm_sched_job *job,
+					struct dma_resv *resv,
+					enum dma_resv_usage usage);
 int drm_sched_job_add_implicit_dependencies(struct drm_sched_job *job,
 					    struct drm_gem_object *obj,
 					    bool write);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 03/13] drm/amdgpu: use drm_sched_job_add_resv_dependencies for moves
  2022-10-14  8:46 Fixes for scheduler hang when killing a process Christian König
  2022-10-14  8:46 ` [PATCH 01/13] drm/scheduler: fix fence ref counting Christian König
  2022-10-14  8:46 ` [PATCH 02/13] drm/scheduler: add drm_sched_job_add_resv_dependencies Christian König
@ 2022-10-14  8:46 ` Christian König
  2022-10-14  8:46 ` [PATCH 04/13] drm/amdgpu: drop the fence argument from amdgpu_vmid_grab Christian König
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 47+ messages in thread
From: Christian König @ 2022-10-14  8:46 UTC (permalink / raw)
  To: luben.tuikov, dri-devel, amd-gfx; +Cc: Christian König

Use the new common scheduler functions to figure out what to wait for.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 16 +++++-----------
 1 file changed, 5 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 794062ab57fc..f898e870d157 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -1984,17 +1984,11 @@ static int amdgpu_ttm_prepare_job(struct amdgpu_device *adev,
 							adev->gart.bo);
 		(*job)->vm_needs_flush = true;
 	}
-	if (resv) {
-		r = amdgpu_sync_resv(adev, &(*job)->sync, resv,
-				     AMDGPU_SYNC_ALWAYS,
-				     AMDGPU_FENCE_OWNER_UNDEFINED);
-		if (r) {
-			DRM_ERROR("sync failed (%d).\n", r);
-			amdgpu_job_free(*job);
-			return r;
-		}
-	}
-	return 0;
+	if (!resv)
+		return 0;
+
+	return drm_sched_job_add_resv_dependencies(&(*job)->base, resv,
+						   DMA_RESV_USAGE_BOOKKEEP);
 }
 
 int amdgpu_copy_buffer(struct amdgpu_ring *ring, uint64_t src_offset,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 04/13] drm/amdgpu: drop the fence argument from amdgpu_vmid_grab
  2022-10-14  8:46 Fixes for scheduler hang when killing a process Christian König
                   ` (2 preceding siblings ...)
  2022-10-14  8:46 ` [PATCH 03/13] drm/amdgpu: use drm_sched_job_add_resv_dependencies for moves Christian König
@ 2022-10-14  8:46 ` Christian König
  2022-10-14  8:46 ` [PATCH 05/13] drm/amdgpu: drop amdgpu_sync " Christian König
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 47+ messages in thread
From: Christian König @ 2022-10-14  8:46 UTC (permalink / raw)
  To: luben.tuikov, dri-devel, amd-gfx; +Cc: Christian König

This is always the job anyway.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c | 20 ++++++++------------
 drivers/gpu/drm/amd/amdgpu/amdgpu_ids.h |  3 +--
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c |  4 +---
 3 files changed, 10 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
index 03d115d2b5ed..b76294d4275b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
@@ -244,7 +244,6 @@ static int amdgpu_vmid_grab_idle(struct amdgpu_vm *vm,
  * @vm: vm to allocate id for
  * @ring: ring we want to submit job to
  * @sync: sync object where we add dependencies
- * @fence: fence protecting ID from reuse
  * @job: job who wants to use the VMID
  * @id: resulting VMID
  *
@@ -253,7 +252,6 @@ static int amdgpu_vmid_grab_idle(struct amdgpu_vm *vm,
 static int amdgpu_vmid_grab_reserved(struct amdgpu_vm *vm,
 				     struct amdgpu_ring *ring,
 				     struct amdgpu_sync *sync,
-				     struct dma_fence *fence,
 				     struct amdgpu_job *job,
 				     struct amdgpu_vmid **id)
 {
@@ -290,7 +288,7 @@ static int amdgpu_vmid_grab_reserved(struct amdgpu_vm *vm,
 	/* Good we can use this VMID. Remember this submission as
 	* user of the VMID.
 	*/
-	r = amdgpu_sync_fence(&(*id)->active, fence);
+	r = amdgpu_sync_fence(&(*id)->active, &job->base.s_fence->finished);
 	if (r)
 		return r;
 
@@ -305,7 +303,6 @@ static int amdgpu_vmid_grab_reserved(struct amdgpu_vm *vm,
  * @vm: vm to allocate id for
  * @ring: ring we want to submit job to
  * @sync: sync object where we add dependencies
- * @fence: fence protecting ID from reuse
  * @job: job who wants to use the VMID
  * @id: resulting VMID
  *
@@ -314,7 +311,6 @@ static int amdgpu_vmid_grab_reserved(struct amdgpu_vm *vm,
 static int amdgpu_vmid_grab_used(struct amdgpu_vm *vm,
 				 struct amdgpu_ring *ring,
 				 struct amdgpu_sync *sync,
-				 struct dma_fence *fence,
 				 struct amdgpu_job *job,
 				 struct amdgpu_vmid **id)
 {
@@ -352,7 +348,8 @@ static int amdgpu_vmid_grab_used(struct amdgpu_vm *vm,
 		/* Good, we can use this VMID. Remember this submission as
 		 * user of the VMID.
 		 */
-		r = amdgpu_sync_fence(&(*id)->active, fence);
+		r = amdgpu_sync_fence(&(*id)->active,
+				      &job->base.s_fence->finished);
 		if (r)
 			return r;
 
@@ -371,14 +368,12 @@ static int amdgpu_vmid_grab_used(struct amdgpu_vm *vm,
  * @vm: vm to allocate id for
  * @ring: ring we want to submit job to
  * @sync: sync object where we add dependencies
- * @fence: fence protecting ID from reuse
  * @job: job who wants to use the VMID
  *
  * Allocate an id for the vm, adding fences to the sync obj as necessary.
  */
 int amdgpu_vmid_grab(struct amdgpu_vm *vm, struct amdgpu_ring *ring,
-		     struct amdgpu_sync *sync, struct dma_fence *fence,
-		     struct amdgpu_job *job)
+		     struct amdgpu_sync *sync, struct amdgpu_job *job)
 {
 	struct amdgpu_device *adev = ring->adev;
 	unsigned vmhub = ring->funcs->vmhub;
@@ -393,11 +388,11 @@ int amdgpu_vmid_grab(struct amdgpu_vm *vm, struct amdgpu_ring *ring,
 		goto error;
 
 	if (vm->reserved_vmid[vmhub]) {
-		r = amdgpu_vmid_grab_reserved(vm, ring, sync, fence, job, &id);
+		r = amdgpu_vmid_grab_reserved(vm, ring, sync, job, &id);
 		if (r || !id)
 			goto error;
 	} else {
-		r = amdgpu_vmid_grab_used(vm, ring, sync, fence, job, &id);
+		r = amdgpu_vmid_grab_used(vm, ring, sync, job, &id);
 		if (r)
 			goto error;
 
@@ -406,7 +401,8 @@ int amdgpu_vmid_grab(struct amdgpu_vm *vm, struct amdgpu_ring *ring,
 			id = idle;
 
 			/* Remember this submission as user of the VMID */
-			r = amdgpu_sync_fence(&id->active, fence);
+			r = amdgpu_sync_fence(&id->active,
+					      &job->base.s_fence->finished);
 			if (r)
 				goto error;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.h
index 06c8a0034fa5..1b1e7d04655c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.h
@@ -84,8 +84,7 @@ void amdgpu_vmid_free_reserved(struct amdgpu_device *adev,
 			       struct amdgpu_vm *vm,
 			       unsigned vmhub);
 int amdgpu_vmid_grab(struct amdgpu_vm *vm, struct amdgpu_ring *ring,
-		     struct amdgpu_sync *sync, struct dma_fence *fence,
-		     struct amdgpu_job *job);
+		     struct amdgpu_sync *sync, struct amdgpu_job *job);
 void amdgpu_vmid_reset(struct amdgpu_device *adev, unsigned vmhub,
 		       unsigned vmid);
 void amdgpu_vmid_reset_all(struct amdgpu_device *adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index cd968e781077..13b752687b30 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -255,9 +255,7 @@ static struct dma_fence *amdgpu_job_dependency(struct drm_sched_job *sched_job,
 	}
 
 	while (fence == NULL && vm && !job->vmid) {
-		r = amdgpu_vmid_grab(vm, ring, &job->sync,
-				     &job->base.s_fence->finished,
-				     job);
+		r = amdgpu_vmid_grab(vm, ring, &job->sync, job);
 		if (r)
 			DRM_ERROR("Error getting VM ID (%d)\n", r);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 05/13] drm/amdgpu: drop amdgpu_sync from amdgpu_vmid_grab
  2022-10-14  8:46 Fixes for scheduler hang when killing a process Christian König
                   ` (3 preceding siblings ...)
  2022-10-14  8:46 ` [PATCH 04/13] drm/amdgpu: drop the fence argument from amdgpu_vmid_grab Christian König
@ 2022-10-14  8:46 ` Christian König
  2022-10-23  1:25   ` Luben Tuikov
  2022-10-14  8:46 ` [PATCH 06/13] drm/amdgpu: cleanup scheduler job initialization Christian König
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 47+ messages in thread
From: Christian König @ 2022-10-14  8:46 UTC (permalink / raw)
  To: luben.tuikov, dri-devel, amd-gfx; +Cc: Christian König

Instead return the fence directly. Avoids memory allocation to store the
fence.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c | 42 +++++++++++++------------
 drivers/gpu/drm/amd/amdgpu/amdgpu_ids.h |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 12 +++----
 3 files changed, 28 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
index b76294d4275b..2a9a2593dc18 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
@@ -170,26 +170,27 @@ bool amdgpu_vmid_had_gpu_reset(struct amdgpu_device *adev,
  *
  * @vm: vm to allocate id for
  * @ring: ring we want to submit job to
- * @sync: sync object where we add dependencies
  * @idle: resulting idle VMID
+ * @fence: fence to wait for if no id could be grabbed
  *
  * Try to find an idle VMID, if none is idle add a fence to wait to the sync
  * object. Returns -ENOMEM when we are out of memory.
  */
 static int amdgpu_vmid_grab_idle(struct amdgpu_vm *vm,
 				 struct amdgpu_ring *ring,
-				 struct amdgpu_sync *sync,
-				 struct amdgpu_vmid **idle)
+				 struct amdgpu_vmid **idle,
+				 struct dma_fence **fence)
 {
 	struct amdgpu_device *adev = ring->adev;
 	unsigned vmhub = ring->funcs->vmhub;
 	struct amdgpu_vmid_mgr *id_mgr = &adev->vm_manager.id_mgr[vmhub];
 	struct dma_fence **fences;
 	unsigned i;
-	int r;
 
-	if (!dma_fence_is_signaled(ring->vmid_wait))
-		return amdgpu_sync_fence(sync, ring->vmid_wait);
+	if (!dma_fence_is_signaled(ring->vmid_wait)) {
+		*fence = dma_fence_get(ring->vmid_wait);
+		return 0;
+	}
 
 	fences = kmalloc_array(id_mgr->num_ids, sizeof(void *), GFP_KERNEL);
 	if (!fences)
@@ -228,10 +229,10 @@ static int amdgpu_vmid_grab_idle(struct amdgpu_vm *vm,
 			return -ENOMEM;
 		}
 
-		r = amdgpu_sync_fence(sync, &array->base);
+		*fence = dma_fence_get(&array->base);
 		dma_fence_put(ring->vmid_wait);
 		ring->vmid_wait = &array->base;
-		return r;
+		return 0;
 	}
 	kfree(fences);
 
@@ -243,17 +244,17 @@ static int amdgpu_vmid_grab_idle(struct amdgpu_vm *vm,
  *
  * @vm: vm to allocate id for
  * @ring: ring we want to submit job to
- * @sync: sync object where we add dependencies
  * @job: job who wants to use the VMID
  * @id: resulting VMID
+ * @fence: fence to wait for if no id could be grabbed
  *
  * Try to assign a reserved VMID.
  */
 static int amdgpu_vmid_grab_reserved(struct amdgpu_vm *vm,
 				     struct amdgpu_ring *ring,
-				     struct amdgpu_sync *sync,
 				     struct amdgpu_job *job,
-				     struct amdgpu_vmid **id)
+				     struct amdgpu_vmid **id,
+				     struct dma_fence **fence)
 {
 	struct amdgpu_device *adev = ring->adev;
 	unsigned vmhub = ring->funcs->vmhub;
@@ -280,7 +281,8 @@ static int amdgpu_vmid_grab_reserved(struct amdgpu_vm *vm,
 		tmp = amdgpu_sync_peek_fence(&(*id)->active, ring);
 		if (tmp) {
 			*id = NULL;
-			return amdgpu_sync_fence(sync, tmp);
+			*fence = dma_fence_get(tmp);
+			return 0;
 		}
 		needs_flush = true;
 	}
@@ -302,17 +304,17 @@ static int amdgpu_vmid_grab_reserved(struct amdgpu_vm *vm,
  *
  * @vm: vm to allocate id for
  * @ring: ring we want to submit job to
- * @sync: sync object where we add dependencies
  * @job: job who wants to use the VMID
  * @id: resulting VMID
+ * @fence: fence to wait for if no id could be grabbed
  *
  * Try to reuse a VMID for this submission.
  */
 static int amdgpu_vmid_grab_used(struct amdgpu_vm *vm,
 				 struct amdgpu_ring *ring,
-				 struct amdgpu_sync *sync,
 				 struct amdgpu_job *job,
-				 struct amdgpu_vmid **id)
+				 struct amdgpu_vmid **id,
+				 struct dma_fence **fence)
 {
 	struct amdgpu_device *adev = ring->adev;
 	unsigned vmhub = ring->funcs->vmhub;
@@ -367,13 +369,13 @@ static int amdgpu_vmid_grab_used(struct amdgpu_vm *vm,
  *
  * @vm: vm to allocate id for
  * @ring: ring we want to submit job to
- * @sync: sync object where we add dependencies
  * @job: job who wants to use the VMID
+ * @fence: fence to wait for if no id could be grabbed
  *
  * Allocate an id for the vm, adding fences to the sync obj as necessary.
  */
 int amdgpu_vmid_grab(struct amdgpu_vm *vm, struct amdgpu_ring *ring,
-		     struct amdgpu_sync *sync, struct amdgpu_job *job)
+		     struct amdgpu_job *job, struct dma_fence **fence)
 {
 	struct amdgpu_device *adev = ring->adev;
 	unsigned vmhub = ring->funcs->vmhub;
@@ -383,16 +385,16 @@ int amdgpu_vmid_grab(struct amdgpu_vm *vm, struct amdgpu_ring *ring,
 	int r = 0;
 
 	mutex_lock(&id_mgr->lock);
-	r = amdgpu_vmid_grab_idle(vm, ring, sync, &idle);
+	r = amdgpu_vmid_grab_idle(vm, ring, &idle, fence);
 	if (r || !idle)
 		goto error;
 
 	if (vm->reserved_vmid[vmhub]) {
-		r = amdgpu_vmid_grab_reserved(vm, ring, sync, job, &id);
+		r = amdgpu_vmid_grab_reserved(vm, ring, job, &id, fence);
 		if (r || !id)
 			goto error;
 	} else {
-		r = amdgpu_vmid_grab_used(vm, ring, sync, job, &id);
+		r = amdgpu_vmid_grab_used(vm, ring, job, &id, fence);
 		if (r)
 			goto error;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.h
index 1b1e7d04655c..57efe61dceed 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.h
@@ -84,7 +84,7 @@ void amdgpu_vmid_free_reserved(struct amdgpu_device *adev,
 			       struct amdgpu_vm *vm,
 			       unsigned vmhub);
 int amdgpu_vmid_grab(struct amdgpu_vm *vm, struct amdgpu_ring *ring,
-		     struct amdgpu_sync *sync, struct amdgpu_job *job);
+		     struct amdgpu_job *job, struct dma_fence **fence);
 void amdgpu_vmid_reset(struct amdgpu_device *adev, unsigned vmhub,
 		       unsigned vmid);
 void amdgpu_vmid_reset_all(struct amdgpu_device *adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 13b752687b30..e187dc0ab898 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -238,12 +238,12 @@ int amdgpu_job_submit_direct(struct amdgpu_job *job, struct amdgpu_ring *ring,
 	return 0;
 }
 
-static struct dma_fence *amdgpu_job_dependency(struct drm_sched_job *sched_job,
-					       struct drm_sched_entity *s_entity)
+static struct dma_fence *
+amdgpu_job_dependency(struct drm_sched_job *sched_job,
+		      struct drm_sched_entity *s_entity)
 {
 	struct amdgpu_ring *ring = to_amdgpu_ring(s_entity->rq->sched);
 	struct amdgpu_job *job = to_amdgpu_job(sched_job);
-	struct amdgpu_vm *vm = job->vm;
 	struct dma_fence *fence;
 	int r;
 
@@ -254,12 +254,10 @@ static struct dma_fence *amdgpu_job_dependency(struct drm_sched_job *sched_job,
 			DRM_ERROR("Error adding fence (%d)\n", r);
 	}
 
-	while (fence == NULL && vm && !job->vmid) {
-		r = amdgpu_vmid_grab(vm, ring, &job->sync, job);
+	while (fence == NULL && job->vm && !job->vmid) {
+		r = amdgpu_vmid_grab(job->vm, ring, job, &fence);
 		if (r)
 			DRM_ERROR("Error getting VM ID (%d)\n", r);
-
-		fence = amdgpu_sync_get_fence(&job->sync);
 	}
 
 	if (!fence && job->gang_submit)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 06/13] drm/amdgpu: cleanup scheduler job initialization
  2022-10-14  8:46 Fixes for scheduler hang when killing a process Christian König
                   ` (4 preceding siblings ...)
  2022-10-14  8:46 ` [PATCH 05/13] drm/amdgpu: drop amdgpu_sync " Christian König
@ 2022-10-14  8:46 ` Christian König
  2022-10-23  1:50   ` Luben Tuikov
  2022-10-14  8:46 ` [PATCH 07/13] drm/amdgpu: move explicit sync check into the CS Christian König
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 47+ messages in thread
From: Christian König @ 2022-10-14  8:46 UTC (permalink / raw)
  To: luben.tuikov, dri-devel, amd-gfx; +Cc: Christian König

Init the DRM scheduler base class while allocating the job.

This makes the whole handling much more cleaner.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c  |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c      |  8 +--
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c     | 44 ++++++++-------
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.h     | 14 +++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c    |  7 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c     | 56 +++++++------------
 drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c     |  9 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c     | 13 +++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c     | 22 +++++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 61 +++++++++++----------
 drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c      | 12 ++--
 drivers/gpu/drm/amd/amdgpu/uvd_v6_0.c       |  8 +--
 drivers/gpu/drm/amd/amdgpu/uvd_v7_0.c       | 14 +++--
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c    | 17 ++----
 14 files changed, 137 insertions(+), 150 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 0561812aa0a4..046d466b4ee4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -673,7 +673,7 @@ int amdgpu_amdkfd_submit_ib(struct amdgpu_device *adev,
 		goto err;
 	}
 
-	ret = amdgpu_job_alloc(adev, 1, &job, NULL);
+	ret = amdgpu_job_alloc(adev, NULL, NULL, NULL, 1, &job);
 	if (ret)
 		goto err;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 1bbd39b3b0fc..aa6f6c428dbc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -291,12 +291,8 @@ static int amdgpu_cs_pass1(struct amdgpu_cs_parser *p,
 		return -EINVAL;
 
 	for (i = 0; i < p->gang_size; ++i) {
-		ret = amdgpu_job_alloc(p->adev, num_ibs[i], &p->jobs[i], vm);
-		if (ret)
-			goto free_all_kdata;
-
-		ret = drm_sched_job_init(&p->jobs[i]->base, p->entities[i],
-					 &fpriv->vm);
+		ret = amdgpu_job_alloc(p->adev, vm, p->entities[i], vm,
+				       num_ibs[i], &p->jobs[i]);
 		if (ret)
 			goto free_all_kdata;
 	}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index e187dc0ab898..5c69461ab3e0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -88,8 +88,9 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
 	return DRM_GPU_SCHED_STAT_NOMINAL;
 }
 
-int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
-		     struct amdgpu_job **job, struct amdgpu_vm *vm)
+int amdgpu_job_alloc(struct amdgpu_device *adev, struct amdgpu_vm *vm,
+		     struct drm_sched_entity *entity, void *owner,
+		     unsigned num_ibs, struct amdgpu_job **job)
 {
 	if (num_ibs == 0)
 		return -EINVAL;
@@ -110,23 +111,30 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
 	(*job)->vram_lost_counter = atomic_read(&adev->vram_lost_counter);
 	(*job)->vm_pd_addr = AMDGPU_BO_INVALID_OFFSET;
 
-	return 0;
+	if (!entity)
+		return 0;
+
+	return drm_sched_job_init(&(*job)->base, entity, owner);
 }
 
-int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned size,
-		enum amdgpu_ib_pool_type pool_type,
-		struct amdgpu_job **job)
+int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev,
+			     struct drm_sched_entity *entity, void *owner,
+			     unsigned size, enum amdgpu_ib_pool_type pool_type,
+			     struct amdgpu_job **job)
 {
 	int r;
 
-	r = amdgpu_job_alloc(adev, 1, job, NULL);
+	r = amdgpu_job_alloc(adev, NULL, entity, owner, 1, job);
 	if (r)
 		return r;
 
 	(*job)->num_ibs = 1;
 	r = amdgpu_ib_get(adev, NULL, size, pool_type, &(*job)->ibs[0]);
-	if (r)
+	if (r) {
+		if (entity)
+			drm_sched_job_cleanup(&(*job)->base);
 		kfree(*job);
+	}
 
 	return r;
 }
@@ -190,6 +198,9 @@ void amdgpu_job_set_gang_leader(struct amdgpu_job *job,
 
 void amdgpu_job_free(struct amdgpu_job *job)
 {
+	if (job->base.entity)
+		drm_sched_job_cleanup(&job->base);
+
 	amdgpu_job_free_resources(job);
 	amdgpu_sync_free(&job->sync);
 	amdgpu_sync_free(&job->sched_sync);
@@ -202,25 +213,16 @@ void amdgpu_job_free(struct amdgpu_job *job)
 		dma_fence_put(&job->hw_fence);
 }
 
-int amdgpu_job_submit(struct amdgpu_job *job, struct drm_sched_entity *entity,
-		      void *owner, struct dma_fence **f)
+struct dma_fence *amdgpu_job_submit(struct amdgpu_job *job)
 {
-	int r;
-
-	if (!f)
-		return -EINVAL;
-
-	r = drm_sched_job_init(&job->base, entity, owner);
-	if (r)
-		return r;
+	struct dma_fence *f;
 
 	drm_sched_job_arm(&job->base);
-
-	*f = dma_fence_get(&job->base.s_fence->finished);
+	f = dma_fence_get(&job->base.s_fence->finished);
 	amdgpu_job_free_resources(job);
 	drm_sched_entity_push_job(&job->base);
 
-	return 0;
+	return f;
 }
 
 int amdgpu_job_submit_direct(struct amdgpu_job *job, struct amdgpu_ring *ring,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
index ab7b150e5d50..f099210c386a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
@@ -78,18 +78,20 @@ static inline struct amdgpu_ring *amdgpu_job_ring(struct amdgpu_job *job)
 	return to_amdgpu_ring(job->base.entity->rq->sched);
 }
 
-int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
-		     struct amdgpu_job **job, struct amdgpu_vm *vm);
-int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned size,
-		enum amdgpu_ib_pool_type pool, struct amdgpu_job **job);
+int amdgpu_job_alloc(struct amdgpu_device *adev, struct amdgpu_vm *vm,
+		     struct drm_sched_entity *entity, void *owner,
+		     unsigned num_ibs, struct amdgpu_job **job);
+int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev,
+			     struct drm_sched_entity *entity, void *owner,
+			     unsigned size, enum amdgpu_ib_pool_type pool_type,
+			     struct amdgpu_job **job);
 void amdgpu_job_set_resources(struct amdgpu_job *job, struct amdgpu_bo *gds,
 			      struct amdgpu_bo *gws, struct amdgpu_bo *oa);
 void amdgpu_job_free_resources(struct amdgpu_job *job);
 void amdgpu_job_set_gang_leader(struct amdgpu_job *job,
 				struct amdgpu_job *leader);
 void amdgpu_job_free(struct amdgpu_job *job);
-int amdgpu_job_submit(struct amdgpu_job *job, struct drm_sched_entity *entity,
-		      void *owner, struct dma_fence **f);
+struct dma_fence *amdgpu_job_submit(struct amdgpu_job *job);
 int amdgpu_job_submit_direct(struct amdgpu_job *job, struct amdgpu_ring *ring,
 			     struct dma_fence **fence);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c
index 518eb0e40d32..de182bfcf85f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c
@@ -150,14 +150,15 @@ static int amdgpu_jpeg_dec_set_reg(struct amdgpu_ring *ring, uint32_t handle,
 	const unsigned ib_size_dw = 16;
 	int i, r;
 
-	r = amdgpu_job_alloc_with_ib(ring->adev, ib_size_dw * 4,
-					AMDGPU_IB_POOL_DIRECT, &job);
+	r = amdgpu_job_alloc_with_ib(ring->adev, NULL, NULL, ib_size_dw * 4,
+				     AMDGPU_IB_POOL_DIRECT, &job);
 	if (r)
 		return r;
 
 	ib = &job->ibs[0];
 
-	ib->ptr[0] = PACKETJ(adev->jpeg.internal.jpeg_pitch, 0, 0, PACKETJ_TYPE0);
+	ib->ptr[0] = PACKETJ(adev->jpeg.internal.jpeg_pitch, 0, 0,
+			     PACKETJ_TYPE0);
 	ib->ptr[1] = 0xDEADBEEF;
 	for (i = 2; i < 16; i += 2) {
 		ib->ptr[i] = PACKETJ(0, 0, 0, PACKETJ_TYPE6);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index f898e870d157..30edb05e0d25 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -189,7 +189,6 @@ static int amdgpu_ttm_map_buffer(struct ttm_buffer_object *bo,
 	struct amdgpu_device *adev = ring->adev;
 	unsigned offset, num_pages, num_dw, num_bytes;
 	uint64_t src_addr, dst_addr;
-	struct dma_fence *fence;
 	struct amdgpu_job *job;
 	void *cpu_addr;
 	uint64_t flags;
@@ -229,7 +228,9 @@ static int amdgpu_ttm_map_buffer(struct ttm_buffer_object *bo,
 	num_dw = ALIGN(adev->mman.buffer_funcs->copy_num_dw, 8);
 	num_bytes = num_pages * 8 * AMDGPU_GPU_PAGES_IN_CPU_PAGE;
 
-	r = amdgpu_job_alloc_with_ib(adev, num_dw * 4 + num_bytes,
+	r = amdgpu_job_alloc_with_ib(adev, &adev->mman.entity,
+				     AMDGPU_FENCE_OWNER_UNDEFINED,
+				     num_dw * 4 + num_bytes,
 				     AMDGPU_IB_POOL_DELAYED, &job);
 	if (r)
 		return r;
@@ -269,18 +270,8 @@ static int amdgpu_ttm_map_buffer(struct ttm_buffer_object *bo,
 		}
 	}
 
-	r = amdgpu_job_submit(job, &adev->mman.entity,
-			      AMDGPU_FENCE_OWNER_UNDEFINED, &fence);
-	if (r)
-		goto error_free;
-
-	dma_fence_put(fence);
-
-	return r;
-
-error_free:
-	amdgpu_job_free(job);
-	return r;
+	dma_fence_put(amdgpu_job_submit(job));
+	return 0;
 }
 
 /**
@@ -1432,7 +1423,8 @@ static void amdgpu_ttm_vram_mm_access(struct amdgpu_device *adev, loff_t pos,
 }
 
 static int amdgpu_ttm_access_memory_sdma(struct ttm_buffer_object *bo,
-					unsigned long offset, void *buf, int len, int write)
+					unsigned long offset, void *buf,
+					int len, int write)
 {
 	struct amdgpu_bo *abo = ttm_to_amdgpu_bo(bo);
 	struct amdgpu_device *adev = amdgpu_ttm_adev(abo->tbo.bdev);
@@ -1456,26 +1448,27 @@ static int amdgpu_ttm_access_memory_sdma(struct ttm_buffer_object *bo,
 		memcpy(adev->mman.sdma_access_ptr, buf, len);
 
 	num_dw = ALIGN(adev->mman.buffer_funcs->copy_num_dw, 8);
-	r = amdgpu_job_alloc_with_ib(adev, num_dw * 4, AMDGPU_IB_POOL_DELAYED, &job);
+	r = amdgpu_job_alloc_with_ib(adev, &adev->mman.entity,
+				     AMDGPU_FENCE_OWNER_UNDEFINED,
+				     num_dw * 4, AMDGPU_IB_POOL_DELAYED,
+				     &job);
 	if (r)
 		goto out;
 
 	amdgpu_res_first(abo->tbo.resource, offset, len, &src_mm);
-	src_addr = amdgpu_ttm_domain_start(adev, bo->resource->mem_type) + src_mm.start;
+	src_addr = amdgpu_ttm_domain_start(adev, bo->resource->mem_type) +
+		src_mm.start;
 	dst_addr = amdgpu_bo_gpu_offset(adev->mman.sdma_access_bo);
 	if (write)
 		swap(src_addr, dst_addr);
 
-	amdgpu_emit_copy_buffer(adev, &job->ibs[0], src_addr, dst_addr, PAGE_SIZE, false);
+	amdgpu_emit_copy_buffer(adev, &job->ibs[0], src_addr, dst_addr,
+				PAGE_SIZE, false);
 
 	amdgpu_ring_pad_ib(adev->mman.buffer_funcs_ring, &job->ibs[0]);
 	WARN_ON(job->ibs[0].length_dw > num_dw);
 
-	r = amdgpu_job_submit(job, &adev->mman.entity, AMDGPU_FENCE_OWNER_UNDEFINED, &fence);
-	if (r) {
-		amdgpu_job_free(job);
-		goto out;
-	}
+	fence = amdgpu_job_submit(job);
 
 	if (!dma_fence_wait_timeout(fence, false, adev->sdma_timeout))
 		r = -ETIMEDOUT;
@@ -1974,7 +1967,9 @@ static int amdgpu_ttm_prepare_job(struct amdgpu_device *adev,
 		AMDGPU_IB_POOL_DELAYED;
 	int r;
 
-	r = amdgpu_job_alloc_with_ib(adev, num_dw * 4, pool, job);
+	r = amdgpu_job_alloc_with_ib(adev, &adev->mman.entity,
+				     AMDGPU_FENCE_OWNER_UNDEFINED,
+				     num_dw * 4, pool, job);
 	if (r)
 		return r;
 
@@ -2033,8 +2028,7 @@ int amdgpu_copy_buffer(struct amdgpu_ring *ring, uint64_t src_offset,
 	if (direct_submit)
 		r = amdgpu_job_submit_direct(job, ring, fence);
 	else
-		r = amdgpu_job_submit(job, &adev->mman.entity,
-				      AMDGPU_FENCE_OWNER_UNDEFINED, fence);
+		*fence = amdgpu_job_submit(job);
 	if (r)
 		goto error_free;
 
@@ -2079,16 +2073,8 @@ static int amdgpu_ttm_fill_mem(struct amdgpu_ring *ring, uint32_t src_data,
 
 	amdgpu_ring_pad_ib(ring, &job->ibs[0]);
 	WARN_ON(job->ibs[0].length_dw > num_dw);
-	r = amdgpu_job_submit(job, &adev->mman.entity,
-			      AMDGPU_FENCE_OWNER_UNDEFINED, fence);
-	if (r)
-		goto error_free;
-
+	*fence = amdgpu_job_submit(job);
 	return 0;
-
-error_free:
-	amdgpu_job_free(job);
-	return r;
 }
 
 int amdgpu_fill_buffer(struct amdgpu_bo *bo,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
index 6eac649499d3..8baddf79635b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
@@ -1132,7 +1132,9 @@ static int amdgpu_uvd_send_msg(struct amdgpu_ring *ring, struct amdgpu_bo *bo,
 	unsigned offset_idx = 0;
 	unsigned offset[3] = { UVD_BASE_SI, 0, 0 };
 
-	r = amdgpu_job_alloc_with_ib(adev, 64, direct ? AMDGPU_IB_POOL_DIRECT :
+	r = amdgpu_job_alloc_with_ib(ring->adev, &adev->uvd.entity,
+				     AMDGPU_FENCE_OWNER_UNDEFINED,
+				     64, direct ? AMDGPU_IB_POOL_DIRECT :
 				     AMDGPU_IB_POOL_DELAYED, &job);
 	if (r)
 		return r;
@@ -1181,10 +1183,7 @@ static int amdgpu_uvd_send_msg(struct amdgpu_ring *ring, struct amdgpu_bo *bo,
 		if (r)
 			goto err_free;
 
-		r = amdgpu_job_submit(job, &adev->uvd.entity,
-				      AMDGPU_FENCE_OWNER_UNDEFINED, &f);
-		if (r)
-			goto err_free;
+		f = amdgpu_job_submit(job);
 	}
 
 	amdgpu_bo_reserve(bo, true);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
index 02cb3a12dd76..b239e874f2d5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
@@ -450,8 +450,10 @@ static int amdgpu_vce_get_create_msg(struct amdgpu_ring *ring, uint32_t handle,
 	uint64_t addr;
 	int i, r;
 
-	r = amdgpu_job_alloc_with_ib(ring->adev, ib_size_dw * 4,
-				     AMDGPU_IB_POOL_DIRECT, &job);
+	r = amdgpu_job_alloc_with_ib(ring->adev, &ring->adev->vce.entity,
+				     AMDGPU_FENCE_OWNER_UNDEFINED,
+				     ib_size_dw * 4, AMDGPU_IB_POOL_DIRECT,
+				     &job);
 	if (r)
 		return r;
 
@@ -538,7 +540,9 @@ static int amdgpu_vce_get_destroy_msg(struct amdgpu_ring *ring, uint32_t handle,
 	struct dma_fence *f = NULL;
 	int i, r;
 
-	r = amdgpu_job_alloc_with_ib(ring->adev, ib_size_dw * 4,
+	r = amdgpu_job_alloc_with_ib(ring->adev, &ring->adev->vce.entity,
+				     AMDGPU_FENCE_OWNER_UNDEFINED,
+				     ib_size_dw * 4,
 				     direct ? AMDGPU_IB_POOL_DIRECT :
 				     AMDGPU_IB_POOL_DELAYED, &job);
 	if (r)
@@ -570,8 +574,7 @@ static int amdgpu_vce_get_destroy_msg(struct amdgpu_ring *ring, uint32_t handle,
 	if (direct)
 		r = amdgpu_job_submit_direct(job, ring, &f);
 	else
-		r = amdgpu_job_submit(job, &ring->adev->vce.entity,
-				      AMDGPU_FENCE_OWNER_UNDEFINED, &f);
+		f = amdgpu_job_submit(job);
 	if (r)
 		goto err;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
index 0b52af415b28..965b7755cb88 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
@@ -600,15 +600,16 @@ static int amdgpu_vcn_dec_send_msg(struct amdgpu_ring *ring,
 				   struct amdgpu_ib *ib_msg,
 				   struct dma_fence **fence)
 {
+	uint64_t addr = AMDGPU_GPU_PAGE_ALIGN(ib_msg->gpu_addr);
 	struct amdgpu_device *adev = ring->adev;
 	struct dma_fence *f = NULL;
 	struct amdgpu_job *job;
 	struct amdgpu_ib *ib;
-	uint64_t addr = AMDGPU_GPU_PAGE_ALIGN(ib_msg->gpu_addr);
 	int i, r;
 
-	r = amdgpu_job_alloc_with_ib(adev, 64,
-					AMDGPU_IB_POOL_DIRECT, &job);
+	r = amdgpu_job_alloc_with_ib(ring->adev, NULL, NULL,
+				     64, AMDGPU_IB_POOL_DIRECT,
+				     &job);
 	if (r)
 		goto err;
 
@@ -787,8 +788,9 @@ static int amdgpu_vcn_dec_sw_send_msg(struct amdgpu_ring *ring,
 	if (sq)
 		ib_size_dw += 8;
 
-	r = amdgpu_job_alloc_with_ib(adev, ib_size_dw * 4,
-				AMDGPU_IB_POOL_DIRECT, &job);
+	r = amdgpu_job_alloc_with_ib(ring->adev, NULL, NULL,
+				     ib_size_dw * 4, AMDGPU_IB_POOL_DIRECT,
+				     &job);
 	if (r)
 		goto err;
 
@@ -916,8 +918,9 @@ static int amdgpu_vcn_enc_get_create_msg(struct amdgpu_ring *ring, uint32_t hand
 	if (sq)
 		ib_size_dw += 8;
 
-	r = amdgpu_job_alloc_with_ib(ring->adev, ib_size_dw * 4,
-					AMDGPU_IB_POOL_DIRECT, &job);
+	r = amdgpu_job_alloc_with_ib(ring->adev, NULL, NULL,
+				     ib_size_dw * 4, AMDGPU_IB_POOL_DIRECT,
+				     &job);
 	if (r)
 		return r;
 
@@ -982,8 +985,9 @@ static int amdgpu_vcn_enc_get_destroy_msg(struct amdgpu_ring *ring, uint32_t han
 	if (sq)
 		ib_size_dw += 8;
 
-	r = amdgpu_job_alloc_with_ib(ring->adev, ib_size_dw * 4,
-					AMDGPU_IB_POOL_DIRECT, &job);
+	r = amdgpu_job_alloc_with_ib(ring->adev, NULL, NULL,
+				     ib_size_dw * 4, AMDGPU_IB_POOL_DIRECT,
+				     &job);
 	if (r)
 		return r;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
index 69e105fa41f6..126364882d09 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
@@ -47,6 +47,32 @@ static int amdgpu_vm_sdma_map_table(struct amdgpu_bo_vm *table)
 	return r;
 }
 
+/* Allocate a new job for @count PTE updates */
+static int amdgpu_vm_sdma_alloc_job(struct amdgpu_vm_update_params *p,
+				    unsigned int count)
+{
+	enum amdgpu_ib_pool_type pool = p->immediate ? AMDGPU_IB_POOL_IMMEDIATE
+		: AMDGPU_IB_POOL_DELAYED;
+	struct drm_sched_entity *entity = p->immediate ? &p->vm->immediate
+		: &p->vm->delayed;
+	unsigned int ndw;
+	int r;
+
+	/* estimate how many dw we need */
+	ndw = AMDGPU_VM_SDMA_MIN_NUM_DW;
+	if (p->pages_addr)
+		ndw += count * 2;
+	ndw = min(ndw, AMDGPU_VM_SDMA_MAX_NUM_DW);
+
+	r = amdgpu_job_alloc_with_ib(p->adev, entity, AMDGPU_FENCE_OWNER_VM,
+				     ndw * 4, pool, &p->job);
+	if (r)
+		return r;
+
+	p->num_dw_left = ndw;
+	return 0;
+}
+
 /**
  * amdgpu_vm_sdma_prepare - prepare SDMA command submission
  *
@@ -61,17 +87,12 @@ static int amdgpu_vm_sdma_prepare(struct amdgpu_vm_update_params *p,
 				  struct dma_resv *resv,
 				  enum amdgpu_sync_mode sync_mode)
 {
-	enum amdgpu_ib_pool_type pool = p->immediate ? AMDGPU_IB_POOL_IMMEDIATE
-		: AMDGPU_IB_POOL_DELAYED;
-	unsigned int ndw = AMDGPU_VM_SDMA_MIN_NUM_DW;
 	int r;
 
-	r = amdgpu_job_alloc_with_ib(p->adev, ndw * 4, pool, &p->job);
+	r = amdgpu_vm_sdma_alloc_job(p, 0);
 	if (r)
 		return r;
 
-	p->num_dw_left = ndw;
-
 	if (!resv)
 		return 0;
 
@@ -91,20 +112,16 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
 				 struct dma_fence **fence)
 {
 	struct amdgpu_ib *ib = p->job->ibs;
-	struct drm_sched_entity *entity;
 	struct amdgpu_ring *ring;
 	struct dma_fence *f;
-	int r;
 
-	entity = p->immediate ? &p->vm->immediate : &p->vm->delayed;
-	ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
+	ring = container_of(p->vm->delayed.rq->sched, struct amdgpu_ring,
+			    sched);
 
 	WARN_ON(ib->length_dw == 0);
 	amdgpu_ring_pad_ib(ring, ib);
 	WARN_ON(ib->length_dw > p->num_dw_left);
-	r = amdgpu_job_submit(p->job, entity, AMDGPU_FENCE_OWNER_VM, &f);
-	if (r)
-		goto error;
+	f = amdgpu_job_submit(p->job);
 
 	if (p->unlocked) {
 		struct dma_fence *tmp = dma_fence_get(f);
@@ -127,10 +144,6 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
 	}
 	dma_fence_put(f);
 	return 0;
-
-error:
-	amdgpu_job_free(p->job);
-	return r;
 }
 
 /**
@@ -210,8 +223,6 @@ static int amdgpu_vm_sdma_update(struct amdgpu_vm_update_params *p,
 				 uint64_t flags)
 {
 	struct amdgpu_bo *bo = &vmbo->bo;
-	enum amdgpu_ib_pool_type pool = p->immediate ? AMDGPU_IB_POOL_IMMEDIATE
-		: AMDGPU_IB_POOL_DELAYED;
 	struct dma_resv_iter cursor;
 	unsigned int i, ndw, nptes;
 	struct dma_fence *fence;
@@ -238,19 +249,9 @@ static int amdgpu_vm_sdma_update(struct amdgpu_vm_update_params *p,
 			if (r)
 				return r;
 
-			/* estimate how many dw we need */
-			ndw = 32;
-			if (p->pages_addr)
-				ndw += count * 2;
-			ndw = max(ndw, AMDGPU_VM_SDMA_MIN_NUM_DW);
-			ndw = min(ndw, AMDGPU_VM_SDMA_MAX_NUM_DW);
-
-			r = amdgpu_job_alloc_with_ib(p->adev, ndw * 4, pool,
-						     &p->job);
+			r = amdgpu_vm_sdma_alloc_job(p, count);
 			if (r)
 				return r;
-
-			p->num_dw_left = ndw;
 		}
 
 		if (!p->pages_addr) {
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
index f513e2c2e964..657e53708248 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
@@ -371,7 +371,9 @@ static void gmc_v10_0_flush_gpu_tlb(struct amdgpu_device *adev, uint32_t vmid,
 	 * translation. Avoid this by doing the invalidation from the SDMA
 	 * itself.
 	 */
-	r = amdgpu_job_alloc_with_ib(adev, 16 * 4, AMDGPU_IB_POOL_IMMEDIATE,
+	r = amdgpu_job_alloc_with_ib(ring->adev, &adev->mman.entity,
+				     AMDGPU_FENCE_OWNER_UNDEFINED,
+				     16 * 4, AMDGPU_IB_POOL_IMMEDIATE,
 				     &job);
 	if (r)
 		goto error_alloc;
@@ -380,10 +382,7 @@ static void gmc_v10_0_flush_gpu_tlb(struct amdgpu_device *adev, uint32_t vmid,
 	job->vm_needs_flush = true;
 	job->ibs->ptr[job->ibs->length_dw++] = ring->funcs->nop;
 	amdgpu_ring_pad_ib(ring, &job->ibs[0]);
-	r = amdgpu_job_submit(job, &adev->mman.entity,
-			      AMDGPU_FENCE_OWNER_UNDEFINED, &fence);
-	if (r)
-		goto error_submit;
+	fence = amdgpu_job_submit(job);
 
 	mutex_unlock(&adev->mman.gtt_window_lock);
 
@@ -392,9 +391,6 @@ static void gmc_v10_0_flush_gpu_tlb(struct amdgpu_device *adev, uint32_t vmid,
 
 	return;
 
-error_submit:
-	amdgpu_job_free(job);
-
 error_alloc:
 	mutex_unlock(&adev->mman.gtt_window_lock);
 	DRM_ERROR("Error flushing GPU TLB using the SDMA (%d)!\n", r);
diff --git a/drivers/gpu/drm/amd/amdgpu/uvd_v6_0.c b/drivers/gpu/drm/amd/amdgpu/uvd_v6_0.c
index 375c440957dc..5fe872f4bea7 100644
--- a/drivers/gpu/drm/amd/amdgpu/uvd_v6_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/uvd_v6_0.c
@@ -216,8 +216,8 @@ static int uvd_v6_0_enc_get_create_msg(struct amdgpu_ring *ring, uint32_t handle
 	uint64_t addr;
 	int i, r;
 
-	r = amdgpu_job_alloc_with_ib(ring->adev, ib_size_dw * 4,
-					AMDGPU_IB_POOL_DIRECT, &job);
+	r = amdgpu_job_alloc_with_ib(ring->adev, NULL, NULL, ib_size_dw * 4,
+				     AMDGPU_IB_POOL_DIRECT, &job);
 	if (r)
 		return r;
 
@@ -280,8 +280,8 @@ static int uvd_v6_0_enc_get_destroy_msg(struct amdgpu_ring *ring,
 	uint64_t addr;
 	int i, r;
 
-	r = amdgpu_job_alloc_with_ib(ring->adev, ib_size_dw * 4,
-					AMDGPU_IB_POOL_DIRECT, &job);
+	r = amdgpu_job_alloc_with_ib(ring->adev, NULL, NULL, ib_size_dw * 4,
+				     AMDGPU_IB_POOL_DIRECT, &job);
 	if (r)
 		return r;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/uvd_v7_0.c b/drivers/gpu/drm/amd/amdgpu/uvd_v7_0.c
index e668b3baa8c6..1c0210de8455 100644
--- a/drivers/gpu/drm/amd/amdgpu/uvd_v7_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/uvd_v7_0.c
@@ -213,7 +213,8 @@ static int uvd_v7_0_enc_ring_test_ring(struct amdgpu_ring *ring)
  *
  * Open up a stream for HW test
  */
-static int uvd_v7_0_enc_get_create_msg(struct amdgpu_ring *ring, uint32_t handle,
+static int uvd_v7_0_enc_get_create_msg(struct amdgpu_ring *ring,
+				       uint32_t handle,
 				       struct amdgpu_bo *bo,
 				       struct dma_fence **fence)
 {
@@ -224,8 +225,8 @@ static int uvd_v7_0_enc_get_create_msg(struct amdgpu_ring *ring, uint32_t handle
 	uint64_t addr;
 	int i, r;
 
-	r = amdgpu_job_alloc_with_ib(ring->adev, ib_size_dw * 4,
-					AMDGPU_IB_POOL_DIRECT, &job);
+	r = amdgpu_job_alloc_with_ib(ring->adev, NULL, NULL, ib_size_dw * 4,
+				     AMDGPU_IB_POOL_DIRECT, &job);
 	if (r)
 		return r;
 
@@ -276,7 +277,8 @@ static int uvd_v7_0_enc_get_create_msg(struct amdgpu_ring *ring, uint32_t handle
  *
  * Close up a stream for HW test or if userspace failed to do so
  */
-static int uvd_v7_0_enc_get_destroy_msg(struct amdgpu_ring *ring, uint32_t handle,
+static int uvd_v7_0_enc_get_destroy_msg(struct amdgpu_ring *ring,
+					uint32_t handle,
 					struct amdgpu_bo *bo,
 					struct dma_fence **fence)
 {
@@ -287,8 +289,8 @@ static int uvd_v7_0_enc_get_destroy_msg(struct amdgpu_ring *ring, uint32_t handl
 	uint64_t addr;
 	int i, r;
 
-	r = amdgpu_job_alloc_with_ib(ring->adev, ib_size_dw * 4,
-					AMDGPU_IB_POOL_DIRECT, &job);
+	r = amdgpu_job_alloc_with_ib(ring->adev, NULL, NULL, ib_size_dw * 4,
+				     AMDGPU_IB_POOL_DIRECT, &job);
 	if (r)
 		return r;
 
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index dd56d7678876..cd2909d0ad26 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -65,8 +65,11 @@ svm_migrate_gart_map(struct amdgpu_ring *ring, uint64_t npages,
 	num_dw = ALIGN(adev->mman.buffer_funcs->copy_num_dw, 8);
 	num_bytes = npages * 8;
 
-	r = amdgpu_job_alloc_with_ib(adev, num_dw * 4 + num_bytes,
-				     AMDGPU_IB_POOL_DELAYED, &job);
+	r = amdgpu_job_alloc_with_ib(adev, &adev->mman.entity,
+				     AMDGPU_FENCE_OWNER_UNDEFINED,
+				     num_dw * 4 + num_bytes,
+				     AMDGPU_IB_POOL_DELAYED,
+				     &job);
 	if (r)
 		return r;
 
@@ -89,18 +92,10 @@ svm_migrate_gart_map(struct amdgpu_ring *ring, uint64_t npages,
 	cpu_addr = &job->ibs[0].ptr[num_dw];
 
 	amdgpu_gart_map(adev, 0, npages, addr, pte_flags, cpu_addr);
-	r = amdgpu_job_submit(job, &adev->mman.entity,
-			      AMDGPU_FENCE_OWNER_UNDEFINED, &fence);
-	if (r)
-		goto error_free;
-
+	fence = amdgpu_job_submit(job);
 	dma_fence_put(fence);
 
 	return r;
-
-error_free:
-	amdgpu_job_free(job);
-	return r;
 }
 
 /**
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 07/13] drm/amdgpu: move explicit sync check into the CS
  2022-10-14  8:46 Fixes for scheduler hang when killing a process Christian König
                   ` (5 preceding siblings ...)
  2022-10-14  8:46 ` [PATCH 06/13] drm/amdgpu: cleanup scheduler job initialization Christian König
@ 2022-10-14  8:46 ` Christian König
  2022-10-14  8:46 ` [PATCH 08/13] drm/amdgpu: use scheduler depenencies for VM updates Christian König
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 47+ messages in thread
From: Christian König @ 2022-10-14  8:46 UTC (permalink / raw)
  To: luben.tuikov, dri-devel, amd-gfx; +Cc: Christian König

This moves the memory allocation out of the critical code path.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 13 ++++++++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c  |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 12 +++---------
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.h |  2 +-
 4 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index aa6f6c428dbc..d45b86bcf7fa 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -449,8 +449,19 @@ static int amdgpu_syncobj_lookup_and_add(struct amdgpu_cs_parser *p,
 	}
 
 	r = amdgpu_sync_fence(&p->gang_leader->sync, fence);
-	dma_fence_put(fence);
+	if (r)
+		goto error;
+
+	/*
+	 * When we have an explicit dependency it might be necessary to insert a
+	 * pipeline sync to make sure that all caches etc are flushed and the
+	 * next job actually sees the results from the previous one.
+	 */
+	if (fence->context == p->gang_leader->base.entity->fence_context)
+		r = amdgpu_sync_fence(&p->gang_leader->explicit_sync, fence);
 
+error:
+	dma_fence_put(fence);
 	return r;
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
index 258cffe3c06a..774c77bb8f4e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
@@ -182,7 +182,7 @@ int amdgpu_ib_schedule(struct amdgpu_ring *ring, unsigned num_ibs,
 
 	need_ctx_switch = ring->current_ctx != fence_ctx;
 	if (ring->funcs->emit_pipeline_sync && job &&
-	    ((tmp = amdgpu_sync_get_fence(&job->sched_sync)) ||
+	    ((tmp = amdgpu_sync_get_fence(&job->explicit_sync)) ||
 	     (amdgpu_sriov_vf(adev) && need_ctx_switch) ||
 	     amdgpu_vm_need_pipeline_sync(ring, job))) {
 		need_pipe_sync = true;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 5c69461ab3e0..ba98d65835b4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -107,7 +107,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct amdgpu_vm *vm,
 	(*job)->vm = vm;
 
 	amdgpu_sync_create(&(*job)->sync);
-	amdgpu_sync_create(&(*job)->sched_sync);
+	amdgpu_sync_create(&(*job)->explicit_sync);
 	(*job)->vram_lost_counter = atomic_read(&adev->vram_lost_counter);
 	(*job)->vm_pd_addr = AMDGPU_BO_INVALID_OFFSET;
 
@@ -175,7 +175,7 @@ static void amdgpu_job_free_cb(struct drm_sched_job *s_job)
 	drm_sched_job_cleanup(s_job);
 
 	amdgpu_sync_free(&job->sync);
-	amdgpu_sync_free(&job->sched_sync);
+	amdgpu_sync_free(&job->explicit_sync);
 
 	dma_fence_put(&job->hw_fence);
 }
@@ -203,7 +203,7 @@ void amdgpu_job_free(struct amdgpu_job *job)
 
 	amdgpu_job_free_resources(job);
 	amdgpu_sync_free(&job->sync);
-	amdgpu_sync_free(&job->sched_sync);
+	amdgpu_sync_free(&job->explicit_sync);
 	if (job->gang_submit != &job->base.s_fence->scheduled)
 		dma_fence_put(job->gang_submit);
 
@@ -250,12 +250,6 @@ amdgpu_job_dependency(struct drm_sched_job *sched_job,
 	int r;
 
 	fence = amdgpu_sync_get_fence(&job->sync);
-	if (fence && drm_sched_dependency_optimized(fence, s_entity)) {
-		r = amdgpu_sync_fence(&job->sched_sync, fence);
-		if (r)
-			DRM_ERROR("Error adding fence (%d)\n", r);
-	}
-
 	while (fence == NULL && job->vm && !job->vmid) {
 		r = amdgpu_vmid_grab(job->vm, ring, job, &fence);
 		if (r)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
index f099210c386a..9c10b9bd0084 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
@@ -48,7 +48,7 @@ struct amdgpu_job {
 	struct drm_sched_job    base;
 	struct amdgpu_vm	*vm;
 	struct amdgpu_sync	sync;
-	struct amdgpu_sync	sched_sync;
+	struct amdgpu_sync	explicit_sync;
 	struct dma_fence	hw_fence;
 	struct dma_fence	*gang_submit;
 	uint32_t		preamble_status;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 08/13] drm/amdgpu: use scheduler depenencies for VM updates
  2022-10-14  8:46 Fixes for scheduler hang when killing a process Christian König
                   ` (6 preceding siblings ...)
  2022-10-14  8:46 ` [PATCH 07/13] drm/amdgpu: move explicit sync check into the CS Christian König
@ 2022-10-14  8:46 ` Christian König
  2022-10-24  5:50   ` Luben Tuikov
  2022-10-14  8:46 ` [PATCH 09/13] drm/amdgpu: use scheduler depenencies for UVD msgs Christian König
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 47+ messages in thread
From: Christian König @ 2022-10-14  8:46 UTC (permalink / raw)
  To: luben.tuikov, dri-devel, amd-gfx; +Cc: Christian König

Instead of putting that into the job sync object.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c    | 56 +++++++++++++++------
 drivers/gpu/drm/amd/amdgpu/amdgpu_sync.h    |  2 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 10 +++-
 3 files changed, 52 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
index 090e66a1b284..bac7976975bd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
@@ -259,6 +259,14 @@ int amdgpu_sync_resv(struct amdgpu_device *adev, struct amdgpu_sync *sync,
 	return 0;
 }
 
+/* Free the entry back to the slab */
+static void amdgpu_sync_entry_free(struct amdgpu_sync_entry *e)
+{
+	hash_del(&e->node);
+	dma_fence_put(e->fence);
+	kmem_cache_free(amdgpu_sync_slab, e);
+}
+
 /**
  * amdgpu_sync_peek_fence - get the next fence not signaled yet
  *
@@ -280,9 +288,7 @@ struct dma_fence *amdgpu_sync_peek_fence(struct amdgpu_sync *sync,
 		struct drm_sched_fence *s_fence = to_drm_sched_fence(f);
 
 		if (dma_fence_is_signaled(f)) {
-			hash_del(&e->node);
-			dma_fence_put(f);
-			kmem_cache_free(amdgpu_sync_slab, e);
+			amdgpu_sync_entry_free(e);
 			continue;
 		}
 		if (ring && s_fence) {
@@ -355,15 +361,42 @@ int amdgpu_sync_clone(struct amdgpu_sync *source, struct amdgpu_sync *clone)
 			if (r)
 				return r;
 		} else {
-			hash_del(&e->node);
-			dma_fence_put(f);
-			kmem_cache_free(amdgpu_sync_slab, e);
+			amdgpu_sync_entry_free(e);
 		}
 	}
 
 	return 0;
 }
 
+/**
+ * amdgpu_sync_push_to_job - push fences into job
+ * @sync: sync object to get the fences from
+ * @job: job to push the fences into
+ *
+ * Add all unsignaled fences from sync to job.
+ */
+int amdgpu_sync_push_to_job(struct amdgpu_sync *sync, struct amdgpu_job *job)
+{
+	struct amdgpu_sync_entry *e;
+	struct hlist_node *tmp;
+	struct dma_fence *f;
+	int i, r;
+
+	hash_for_each_safe(sync->fences, i, tmp, e, node) {
+		f = e->fence;
+		if (dma_fence_is_signaled(f)) {
+			amdgpu_sync_entry_free(e);
+			continue;
+		}
+
+		dma_fence_get(f);
+		r = drm_sched_job_add_dependency(&job->base, f);
+		if (r)
+			return r;
+	}
+	return 0;
+}
+
 int amdgpu_sync_wait(struct amdgpu_sync *sync, bool intr)
 {
 	struct amdgpu_sync_entry *e;
@@ -375,9 +408,7 @@ int amdgpu_sync_wait(struct amdgpu_sync *sync, bool intr)
 		if (r)
 			return r;
 
-		hash_del(&e->node);
-		dma_fence_put(e->fence);
-		kmem_cache_free(amdgpu_sync_slab, e);
+		amdgpu_sync_entry_free(e);
 	}
 
 	return 0;
@@ -396,11 +427,8 @@ void amdgpu_sync_free(struct amdgpu_sync *sync)
 	struct hlist_node *tmp;
 	unsigned int i;
 
-	hash_for_each_safe(sync->fences, i, tmp, e, node) {
-		hash_del(&e->node);
-		dma_fence_put(e->fence);
-		kmem_cache_free(amdgpu_sync_slab, e);
-	}
+	hash_for_each_safe(sync->fences, i, tmp, e, node)
+		amdgpu_sync_entry_free(e);
 }
 
 /**
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.h
index 2d5c613cda10..cf1e9e858efd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.h
@@ -30,6 +30,7 @@ struct dma_fence;
 struct dma_resv;
 struct amdgpu_device;
 struct amdgpu_ring;
+struct amdgpu_job;
 
 enum amdgpu_sync_mode {
 	AMDGPU_SYNC_ALWAYS,
@@ -54,6 +55,7 @@ struct dma_fence *amdgpu_sync_peek_fence(struct amdgpu_sync *sync,
 				     struct amdgpu_ring *ring);
 struct dma_fence *amdgpu_sync_get_fence(struct amdgpu_sync *sync);
 int amdgpu_sync_clone(struct amdgpu_sync *source, struct amdgpu_sync *clone);
+int amdgpu_sync_push_to_job(struct amdgpu_sync *sync, struct amdgpu_job *job);
 int amdgpu_sync_wait(struct amdgpu_sync *sync, bool intr);
 void amdgpu_sync_free(struct amdgpu_sync *sync);
 int amdgpu_sync_init(void);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
index 126364882d09..59cf64216fbb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
@@ -87,6 +87,7 @@ static int amdgpu_vm_sdma_prepare(struct amdgpu_vm_update_params *p,
 				  struct dma_resv *resv,
 				  enum amdgpu_sync_mode sync_mode)
 {
+	struct amdgpu_sync sync;
 	int r;
 
 	r = amdgpu_vm_sdma_alloc_job(p, 0);
@@ -96,7 +97,12 @@ static int amdgpu_vm_sdma_prepare(struct amdgpu_vm_update_params *p,
 	if (!resv)
 		return 0;
 
-	return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, p->vm);
+	amdgpu_sync_create(&sync);
+	r = amdgpu_sync_resv(p->adev, &sync, resv, sync_mode, p->vm);
+	if (!r)
+		r = amdgpu_sync_push_to_job(&sync, p->job);
+	amdgpu_sync_free(&sync);
+	return r;
 }
 
 /**
@@ -232,7 +238,7 @@ static int amdgpu_vm_sdma_update(struct amdgpu_vm_update_params *p,
 	/* Wait for PD/PT moves to be completed */
 	dma_resv_iter_begin(&cursor, bo->tbo.base.resv, DMA_RESV_USAGE_KERNEL);
 	dma_resv_for_each_fence_unlocked(&cursor, fence) {
-		r = amdgpu_sync_fence(&p->job->sync, fence);
+		r = drm_sched_job_add_dependency(&p->job->base, fence);
 		if (r) {
 			dma_resv_iter_end(&cursor);
 			return r;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 09/13] drm/amdgpu: use scheduler depenencies for UVD msgs
  2022-10-14  8:46 Fixes for scheduler hang when killing a process Christian König
                   ` (7 preceding siblings ...)
  2022-10-14  8:46 ` [PATCH 08/13] drm/amdgpu: use scheduler depenencies for VM updates Christian König
@ 2022-10-14  8:46 ` Christian König
  2022-10-24  5:53   ` Luben Tuikov
  2022-10-14  8:46 ` [PATCH 10/13] drm/amdgpu: use scheduler depenencies for CS Christian König
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 47+ messages in thread
From: Christian König @ 2022-10-14  8:46 UTC (permalink / raw)
  To: luben.tuikov, dri-devel, amd-gfx; +Cc: Christian König

Instead of putting that into the job sync object.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
index 8baddf79635b..e00bb654e24b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
@@ -1177,9 +1177,9 @@ static int amdgpu_uvd_send_msg(struct amdgpu_ring *ring, struct amdgpu_bo *bo,
 		if (r)
 			goto err_free;
 	} else {
-		r = amdgpu_sync_resv(adev, &job->sync, bo->tbo.base.resv,
-				     AMDGPU_SYNC_ALWAYS,
-				     AMDGPU_FENCE_OWNER_UNDEFINED);
+		r = drm_sched_job_add_resv_dependencies(&job->base,
+							bo->tbo.base.resv,
+							DMA_RESV_USAGE_KERNEL);
 		if (r)
 			goto err_free;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 10/13] drm/amdgpu: use scheduler depenencies for CS
  2022-10-14  8:46 Fixes for scheduler hang when killing a process Christian König
                   ` (8 preceding siblings ...)
  2022-10-14  8:46 ` [PATCH 09/13] drm/amdgpu: use scheduler depenencies for UVD msgs Christian König
@ 2022-10-14  8:46 ` Christian König
  2022-10-24  5:55   ` Luben Tuikov
  2022-12-21 15:34   ` Mike Lothian
  2022-10-14  8:46 ` [PATCH 11/13] drm/scheduler: remove drm_sched_dependency_optimized Christian König
                   ` (4 subsequent siblings)
  14 siblings, 2 replies; 47+ messages in thread
From: Christian König @ 2022-10-14  8:46 UTC (permalink / raw)
  To: luben.tuikov, dri-devel, amd-gfx; +Cc: Christian König

Entirely remove the sync obj in the job.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 21 ++++++++++-----------
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h  |  2 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c |  9 +--------
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.h |  1 -
 4 files changed, 13 insertions(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index d45b86bcf7fa..0528c2b1db6e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -426,7 +426,7 @@ static int amdgpu_cs_p2_dependencies(struct amdgpu_cs_parser *p,
 			dma_fence_put(old);
 		}
 
-		r = amdgpu_sync_fence(&p->gang_leader->sync, fence);
+		r = amdgpu_sync_fence(&p->sync, fence);
 		dma_fence_put(fence);
 		if (r)
 			return r;
@@ -448,7 +448,7 @@ static int amdgpu_syncobj_lookup_and_add(struct amdgpu_cs_parser *p,
 		return r;
 	}
 
-	r = amdgpu_sync_fence(&p->gang_leader->sync, fence);
+	r = amdgpu_sync_fence(&p->sync, fence);
 	if (r)
 		goto error;
 
@@ -1108,7 +1108,7 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p)
 	if (r)
 		return r;
 
-	r = amdgpu_sync_fence(&job->sync, fpriv->prt_va->last_pt_update);
+	r = amdgpu_sync_fence(&p->sync, fpriv->prt_va->last_pt_update);
 	if (r)
 		return r;
 
@@ -1119,7 +1119,7 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p)
 		if (r)
 			return r;
 
-		r = amdgpu_sync_fence(&job->sync, bo_va->last_pt_update);
+		r = amdgpu_sync_fence(&p->sync, bo_va->last_pt_update);
 		if (r)
 			return r;
 	}
@@ -1138,7 +1138,7 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p)
 		if (r)
 			return r;
 
-		r = amdgpu_sync_fence(&job->sync, bo_va->last_pt_update);
+		r = amdgpu_sync_fence(&p->sync, bo_va->last_pt_update);
 		if (r)
 			return r;
 	}
@@ -1151,7 +1151,7 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p)
 	if (r)
 		return r;
 
-	r = amdgpu_sync_fence(&job->sync, vm->last_update);
+	r = amdgpu_sync_fence(&p->sync, vm->last_update);
 	if (r)
 		return r;
 
@@ -1183,7 +1183,6 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p)
 static int amdgpu_cs_sync_rings(struct amdgpu_cs_parser *p)
 {
 	struct amdgpu_fpriv *fpriv = p->filp->driver_priv;
-	struct amdgpu_job *leader = p->gang_leader;
 	struct amdgpu_bo_list_entry *e;
 	unsigned int i;
 	int r;
@@ -1195,14 +1194,14 @@ static int amdgpu_cs_sync_rings(struct amdgpu_cs_parser *p)
 
 		sync_mode = amdgpu_bo_explicit_sync(bo) ?
 			AMDGPU_SYNC_EXPLICIT : AMDGPU_SYNC_NE_OWNER;
-		r = amdgpu_sync_resv(p->adev, &leader->sync, resv, sync_mode,
+		r = amdgpu_sync_resv(p->adev, &p->sync, resv, sync_mode,
 				     &fpriv->vm);
 		if (r)
 			return r;
 	}
 
-	for (i = 0; i < p->gang_size - 1; ++i) {
-		r = amdgpu_sync_clone(&leader->sync, &p->jobs[i]->sync);
+	for (i = 0; i < p->gang_size; ++i) {
+		r = amdgpu_sync_push_to_job(&p->sync, p->jobs[i]);
 		if (r)
 			return r;
 	}
@@ -1248,7 +1247,7 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
 		struct dma_fence *fence;
 
 		fence = &p->jobs[i]->base.s_fence->scheduled;
-		r = amdgpu_sync_fence(&leader->sync, fence);
+		r = drm_sched_job_add_dependency(&leader->base, fence);
 		if (r)
 			goto error_cleanup;
 	}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h
index cbaa19b2b8a3..207e801c24ed 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h
@@ -75,6 +75,8 @@ struct amdgpu_cs_parser {
 
 	unsigned			num_post_deps;
 	struct amdgpu_cs_post_dep	*post_deps;
+
+	struct amdgpu_sync		sync;
 };
 
 int amdgpu_cs_find_mapping(struct amdgpu_cs_parser *parser,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index ba98d65835b4..b8494c3b3b8a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -106,7 +106,6 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct amdgpu_vm *vm,
 	(*job)->base.sched = &adev->rings[0]->sched;
 	(*job)->vm = vm;
 
-	amdgpu_sync_create(&(*job)->sync);
 	amdgpu_sync_create(&(*job)->explicit_sync);
 	(*job)->vram_lost_counter = atomic_read(&adev->vram_lost_counter);
 	(*job)->vm_pd_addr = AMDGPU_BO_INVALID_OFFSET;
@@ -174,9 +173,7 @@ static void amdgpu_job_free_cb(struct drm_sched_job *s_job)
 
 	drm_sched_job_cleanup(s_job);
 
-	amdgpu_sync_free(&job->sync);
 	amdgpu_sync_free(&job->explicit_sync);
-
 	dma_fence_put(&job->hw_fence);
 }
 
@@ -202,7 +199,6 @@ void amdgpu_job_free(struct amdgpu_job *job)
 		drm_sched_job_cleanup(&job->base);
 
 	amdgpu_job_free_resources(job);
-	amdgpu_sync_free(&job->sync);
 	amdgpu_sync_free(&job->explicit_sync);
 	if (job->gang_submit != &job->base.s_fence->scheduled)
 		dma_fence_put(job->gang_submit);
@@ -246,10 +242,9 @@ amdgpu_job_dependency(struct drm_sched_job *sched_job,
 {
 	struct amdgpu_ring *ring = to_amdgpu_ring(s_entity->rq->sched);
 	struct amdgpu_job *job = to_amdgpu_job(sched_job);
-	struct dma_fence *fence;
+	struct dma_fence *fence = NULL;
 	int r;
 
-	fence = amdgpu_sync_get_fence(&job->sync);
 	while (fence == NULL && job->vm && !job->vmid) {
 		r = amdgpu_vmid_grab(job->vm, ring, job, &fence);
 		if (r)
@@ -273,8 +268,6 @@ static struct dma_fence *amdgpu_job_run(struct drm_sched_job *sched_job)
 	job = to_amdgpu_job(sched_job);
 	finished = &job->base.s_fence->finished;
 
-	BUG_ON(amdgpu_sync_peek_fence(&job->sync, NULL));
-
 	trace_amdgpu_sched_run_job(job);
 
 	/* Skip job if VRAM is lost and never resubmit gangs */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
index 9c10b9bd0084..6558839fda03 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
@@ -47,7 +47,6 @@ enum amdgpu_ib_pool_type;
 struct amdgpu_job {
 	struct drm_sched_job    base;
 	struct amdgpu_vm	*vm;
-	struct amdgpu_sync	sync;
 	struct amdgpu_sync	explicit_sync;
 	struct dma_fence	hw_fence;
 	struct dma_fence	*gang_submit;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 11/13] drm/scheduler: remove drm_sched_dependency_optimized
  2022-10-14  8:46 Fixes for scheduler hang when killing a process Christian König
                   ` (9 preceding siblings ...)
  2022-10-14  8:46 ` [PATCH 10/13] drm/amdgpu: use scheduler depenencies for CS Christian König
@ 2022-10-14  8:46 ` Christian König
  2022-10-14  8:46 ` [PATCH 12/13] drm/scheduler: rework entity flush, kill and fini Christian König
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 47+ messages in thread
From: Christian König @ 2022-10-14  8:46 UTC (permalink / raw)
  To: luben.tuikov, dri-devel, amd-gfx; +Cc: Christian König

Not used any more.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 26 --------------------------
 include/drm/gpu_scheduler.h            |  2 --
 2 files changed, 28 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 6e2cd0f906b2..ae4aac6eb26e 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -197,32 +197,6 @@ static void drm_sched_job_done_cb(struct dma_fence *f, struct dma_fence_cb *cb)
 	drm_sched_job_done(s_job);
 }
 
-/**
- * drm_sched_dependency_optimized
- *
- * @fence: the dependency fence
- * @entity: the entity which depends on the above fence
- *
- * Returns true if the dependency can be optimized and false otherwise
- */
-bool drm_sched_dependency_optimized(struct dma_fence* fence,
-				    struct drm_sched_entity *entity)
-{
-	struct drm_gpu_scheduler *sched = entity->rq->sched;
-	struct drm_sched_fence *s_fence;
-
-	if (!fence || dma_fence_is_signaled(fence))
-		return false;
-	if (fence->context == entity->fence_context)
-		return true;
-	s_fence = to_drm_sched_fence(fence);
-	if (s_fence && s_fence->sched == sched)
-		return true;
-
-	return false;
-}
-EXPORT_SYMBOL(drm_sched_dependency_optimized);
-
 /**
  * drm_sched_start_timeout - start timeout for reset worker
  *
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 2a6e261ea01e..b1698819ad38 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -506,8 +506,6 @@ void drm_sched_resubmit_jobs_ext(struct drm_gpu_scheduler *sched, int max);
 void drm_sched_increase_karma(struct drm_sched_job *bad);
 void drm_sched_reset_karma(struct drm_sched_job *bad);
 void drm_sched_increase_karma_ext(struct drm_sched_job *bad, int type);
-bool drm_sched_dependency_optimized(struct dma_fence* fence,
-				    struct drm_sched_entity *entity);
 void drm_sched_fault(struct drm_gpu_scheduler *sched);
 void drm_sched_job_kickout(struct drm_sched_job *s_job);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 12/13] drm/scheduler: rework entity flush, kill and fini
  2022-10-14  8:46 Fixes for scheduler hang when killing a process Christian König
                   ` (10 preceding siblings ...)
  2022-10-14  8:46 ` [PATCH 11/13] drm/scheduler: remove drm_sched_dependency_optimized Christian König
@ 2022-10-14  8:46 ` Christian König
  2022-11-17  2:36   ` Dmitry Osipenko
  2022-12-26 16:01   ` [12/13] " Jonathan Marek
  2022-10-14  8:46 ` [PATCH 13/13] drm/scheduler: rename dependency callback into prepare_job Christian König
                   ` (2 subsequent siblings)
  14 siblings, 2 replies; 47+ messages in thread
From: Christian König @ 2022-10-14  8:46 UTC (permalink / raw)
  To: luben.tuikov, dri-devel, amd-gfx; +Cc: Christian König

This was buggy because when we had to wait for entities which were
killed as well we would just deadlock.

Instead move all the dependency handling into the callbacks so that
will all happen asynchronously.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/scheduler/sched_entity.c | 197 +++++++++++------------
 1 file changed, 92 insertions(+), 105 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index 243ff384cde8..54ac37cd5017 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -139,6 +139,74 @@ bool drm_sched_entity_is_ready(struct drm_sched_entity *entity)
 	return true;
 }
 
+static void drm_sched_entity_kill_jobs_irq_work(struct irq_work *wrk)
+{
+	struct drm_sched_job *job = container_of(wrk, typeof(*job), work);
+
+	drm_sched_fence_scheduled(job->s_fence);
+	drm_sched_fence_finished(job->s_fence);
+	WARN_ON(job->s_fence->parent);
+	job->sched->ops->free_job(job);
+}
+
+/* Signal the scheduler finished fence when the entity in question is killed. */
+static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
+					  struct dma_fence_cb *cb)
+{
+	struct drm_sched_job *job = container_of(cb, struct drm_sched_job,
+						 finish_cb);
+	int r;
+
+	dma_fence_put(f);
+
+	/* Wait for all dependencies to avoid data corruptions */
+	while (!xa_empty(&job->dependencies)) {
+		f = xa_erase(&job->dependencies, job->last_dependency++);
+		r = dma_fence_add_callback(f, &job->finish_cb,
+					   drm_sched_entity_kill_jobs_cb);
+		if (!r)
+			return;
+
+		dma_fence_put(f);
+	}
+
+	init_irq_work(&job->work, drm_sched_entity_kill_jobs_irq_work);
+	irq_work_queue(&job->work);
+}
+
+/* Remove the entity from the scheduler and kill all pending jobs */
+static void drm_sched_entity_kill(struct drm_sched_entity *entity)
+{
+	struct drm_sched_job *job;
+	struct dma_fence *prev;
+
+	if (!entity->rq)
+		return;
+
+	spin_lock(&entity->rq_lock);
+	entity->stopped = true;
+	drm_sched_rq_remove_entity(entity->rq, entity);
+	spin_unlock(&entity->rq_lock);
+
+	/* Make sure this entity is not used by the scheduler at the moment */
+	wait_for_completion(&entity->entity_idle);
+
+	prev = dma_fence_get(entity->last_scheduled);
+	while ((job = to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) {
+		struct drm_sched_fence *s_fence = job->s_fence;
+
+		dma_fence_set_error(&s_fence->finished, -ESRCH);
+
+		dma_fence_get(&s_fence->finished);
+		if (!prev || dma_fence_add_callback(prev, &job->finish_cb,
+					   drm_sched_entity_kill_jobs_cb))
+			drm_sched_entity_kill_jobs_cb(NULL, &job->finish_cb);
+
+		prev = &s_fence->finished;
+	}
+	dma_fence_put(prev);
+}
+
 /**
  * drm_sched_entity_flush - Flush a context entity
  *
@@ -179,91 +247,13 @@ long drm_sched_entity_flush(struct drm_sched_entity *entity, long timeout)
 	/* For killed process disable any more IBs enqueue right now */
 	last_user = cmpxchg(&entity->last_user, current->group_leader, NULL);
 	if ((!last_user || last_user == current->group_leader) &&
-	    (current->flags & PF_EXITING) && (current->exit_code == SIGKILL)) {
-		spin_lock(&entity->rq_lock);
-		entity->stopped = true;
-		drm_sched_rq_remove_entity(entity->rq, entity);
-		spin_unlock(&entity->rq_lock);
-	}
+	    (current->flags & PF_EXITING) && (current->exit_code == SIGKILL))
+		drm_sched_entity_kill(entity);
 
 	return ret;
 }
 EXPORT_SYMBOL(drm_sched_entity_flush);
 
-static void drm_sched_entity_kill_jobs_irq_work(struct irq_work *wrk)
-{
-	struct drm_sched_job *job = container_of(wrk, typeof(*job), work);
-
-	drm_sched_fence_finished(job->s_fence);
-	WARN_ON(job->s_fence->parent);
-	job->sched->ops->free_job(job);
-}
-
-
-/* Signal the scheduler finished fence when the entity in question is killed. */
-static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
-					  struct dma_fence_cb *cb)
-{
-	struct drm_sched_job *job = container_of(cb, struct drm_sched_job,
-						 finish_cb);
-
-	dma_fence_put(f);
-	init_irq_work(&job->work, drm_sched_entity_kill_jobs_irq_work);
-	irq_work_queue(&job->work);
-}
-
-static struct dma_fence *
-drm_sched_job_dependency(struct drm_sched_job *job,
-			 struct drm_sched_entity *entity)
-{
-	if (!xa_empty(&job->dependencies))
-		return xa_erase(&job->dependencies, job->last_dependency++);
-
-	if (job->sched->ops->dependency)
-		return job->sched->ops->dependency(job, entity);
-
-	return NULL;
-}
-
-static void drm_sched_entity_kill_jobs(struct drm_sched_entity *entity)
-{
-	struct drm_sched_job *job;
-	struct dma_fence *f;
-	int r;
-
-	while ((job = to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) {
-		struct drm_sched_fence *s_fence = job->s_fence;
-
-		/* Wait for all dependencies to avoid data corruptions */
-		while ((f = drm_sched_job_dependency(job, entity))) {
-			dma_fence_wait(f, false);
-			dma_fence_put(f);
-		}
-
-		drm_sched_fence_scheduled(s_fence);
-		dma_fence_set_error(&s_fence->finished, -ESRCH);
-
-		/*
-		 * When pipe is hanged by older entity, new entity might
-		 * not even have chance to submit it's first job to HW
-		 * and so entity->last_scheduled will remain NULL
-		 */
-		if (!entity->last_scheduled) {
-			drm_sched_entity_kill_jobs_cb(NULL, &job->finish_cb);
-			continue;
-		}
-
-		dma_fence_get(entity->last_scheduled);
-		r = dma_fence_add_callback(entity->last_scheduled,
-					   &job->finish_cb,
-					   drm_sched_entity_kill_jobs_cb);
-		if (r == -ENOENT)
-			drm_sched_entity_kill_jobs_cb(NULL, &job->finish_cb);
-		else if (r)
-			DRM_ERROR("fence add callback failed (%d)\n", r);
-	}
-}
-
 /**
  * drm_sched_entity_fini - Destroy a context entity
  *
@@ -277,33 +267,17 @@ static void drm_sched_entity_kill_jobs(struct drm_sched_entity *entity)
  */
 void drm_sched_entity_fini(struct drm_sched_entity *entity)
 {
-	struct drm_gpu_scheduler *sched = NULL;
-
-	if (entity->rq) {
-		sched = entity->rq->sched;
-		drm_sched_rq_remove_entity(entity->rq, entity);
-	}
-
-	/* Consumption of existing IBs wasn't completed. Forcefully
-	 * remove them here.
+	/*
+	 * If consumption of existing IBs wasn't completed. Forcefully remove
+	 * them here. Also makes sure that the scheduler won't touch this entity
+	 * any more.
 	 */
-	if (spsc_queue_count(&entity->job_queue)) {
-		if (sched) {
-			/*
-			 * Wait for thread to idle to make sure it isn't processing
-			 * this entity.
-			 */
-			wait_for_completion(&entity->entity_idle);
-
-		}
-		if (entity->dependency) {
-			dma_fence_remove_callback(entity->dependency,
-						  &entity->cb);
-			dma_fence_put(entity->dependency);
-			entity->dependency = NULL;
-		}
+	drm_sched_entity_kill(entity);
 
-		drm_sched_entity_kill_jobs(entity);
+	if (entity->dependency) {
+		dma_fence_remove_callback(entity->dependency, &entity->cb);
+		dma_fence_put(entity->dependency);
+		entity->dependency = NULL;
 	}
 
 	dma_fence_put(entity->last_scheduled);
@@ -416,6 +390,19 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
 	return false;
 }
 
+static struct dma_fence *
+drm_sched_job_dependency(struct drm_sched_job *job,
+			 struct drm_sched_entity *entity)
+{
+	if (!xa_empty(&job->dependencies))
+		return xa_erase(&job->dependencies, job->last_dependency++);
+
+	if (job->sched->ops->dependency)
+		return job->sched->ops->dependency(job, entity);
+
+	return NULL;
+}
+
 struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
 {
 	struct drm_sched_job *sched_job;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 13/13] drm/scheduler: rename dependency callback into prepare_job
  2022-10-14  8:46 Fixes for scheduler hang when killing a process Christian König
                   ` (11 preceding siblings ...)
  2022-10-14  8:46 ` [PATCH 12/13] drm/scheduler: rework entity flush, kill and fini Christian König
@ 2022-10-14  8:46 ` Christian König
  2022-10-23  1:35 ` Fixes for scheduler hang when killing a process Luben Tuikov
  2022-10-24  7:00 ` Luben Tuikov
  14 siblings, 0 replies; 47+ messages in thread
From: Christian König @ 2022-10-14  8:46 UTC (permalink / raw)
  To: luben.tuikov, dri-devel, amd-gfx; +Cc: Christian König

This now matches much better what this is doing.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c  |  4 ++--
 drivers/gpu/drm/scheduler/sched_entity.c |  4 ++--
 include/drm/gpu_scheduler.h              | 13 ++++++-------
 3 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index b8494c3b3b8a..1c4a13693653 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -237,7 +237,7 @@ int amdgpu_job_submit_direct(struct amdgpu_job *job, struct amdgpu_ring *ring,
 }
 
 static struct dma_fence *
-amdgpu_job_dependency(struct drm_sched_job *sched_job,
+amdgpu_job_prepare_job(struct drm_sched_job *sched_job,
 		      struct drm_sched_entity *s_entity)
 {
 	struct amdgpu_ring *ring = to_amdgpu_ring(s_entity->rq->sched);
@@ -326,7 +326,7 @@ void amdgpu_job_stop_all_jobs_on_sched(struct drm_gpu_scheduler *sched)
 }
 
 const struct drm_sched_backend_ops amdgpu_sched_ops = {
-	.dependency = amdgpu_job_dependency,
+	.prepare_job = amdgpu_job_prepare_job,
 	.run_job = amdgpu_job_run,
 	.timedout_job = amdgpu_job_timedout,
 	.free_job = amdgpu_job_free_cb
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index 54ac37cd5017..cf68cd1559ec 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -397,8 +397,8 @@ drm_sched_job_dependency(struct drm_sched_job *job,
 	if (!xa_empty(&job->dependencies))
 		return xa_erase(&job->dependencies, job->last_dependency++);
 
-	if (job->sched->ops->dependency)
-		return job->sched->ops->dependency(job, entity);
+	if (job->sched->ops->prepare_job)
+		return job->sched->ops->prepare_job(job, entity);
 
 	return NULL;
 }
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index b1698819ad38..163385e15597 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -306,7 +306,7 @@ struct drm_sched_job {
 	 */
 	union {
 		struct dma_fence_cb		finish_cb;
-		struct irq_work 		work;
+		struct irq_work			work;
 	};
 
 	uint64_t			id;
@@ -347,18 +347,17 @@ enum drm_gpu_sched_stat {
  */
 struct drm_sched_backend_ops {
 	/**
-	 * @dependency:
+	 * @prepare_job:
 	 *
 	 * Called when the scheduler is considering scheduling this job next, to
 	 * get another struct dma_fence for this job to block on.  Once it
 	 * returns NULL, run_job() may be called.
 	 *
-	 * If a driver exclusively uses drm_sched_job_add_dependency() and
-	 * drm_sched_job_add_implicit_dependencies() this can be ommitted and
-	 * left as NULL.
+	 * Can be NULL if no additional preparation to the dependencies are
+	 * necessary. Skipped when jobs are killed instead of run.
 	 */
-	struct dma_fence *(*dependency)(struct drm_sched_job *sched_job,
-					struct drm_sched_entity *s_entity);
+	struct dma_fence *(*prepare_job)(struct drm_sched_job *sched_job,
+					 struct drm_sched_entity *s_entity);
 
 	/**
          * @run_job: Called to execute the job once all of the dependencies
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 05/13] drm/amdgpu: drop amdgpu_sync from amdgpu_vmid_grab
  2022-10-14  8:46 ` [PATCH 05/13] drm/amdgpu: drop amdgpu_sync " Christian König
@ 2022-10-23  1:25   ` Luben Tuikov
  2022-10-24 10:54     ` Christian König
  0 siblings, 1 reply; 47+ messages in thread
From: Luben Tuikov @ 2022-10-23  1:25 UTC (permalink / raw)
  To: Christian König, dri-devel, amd-gfx; +Cc: Christian König

On 2022-10-14 04:46, Christian König wrote:
> Instead return the fence directly. Avoids memory allocation to store the
> fence.
> 
> Signed-off-by: Christian König <christian.koenig@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c | 42 +++++++++++++------------
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ids.h |  2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 12 +++----
>  3 files changed, 28 insertions(+), 28 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
> index b76294d4275b..2a9a2593dc18 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
> @@ -170,26 +170,27 @@ bool amdgpu_vmid_had_gpu_reset(struct amdgpu_device *adev,
>   *
>   * @vm: vm to allocate id for
>   * @ring: ring we want to submit job to
> - * @sync: sync object where we add dependencies
>   * @idle: resulting idle VMID
> + * @fence: fence to wait for if no id could be grabbed
>   *
>   * Try to find an idle VMID, if none is idle add a fence to wait to the sync
>   * object. Returns -ENOMEM when we are out of memory.
>   */
>  static int amdgpu_vmid_grab_idle(struct amdgpu_vm *vm,
>  				 struct amdgpu_ring *ring,
> -				 struct amdgpu_sync *sync,
> -				 struct amdgpu_vmid **idle)
> +				 struct amdgpu_vmid **idle,
> +				 struct dma_fence **fence)
>  {
>  	struct amdgpu_device *adev = ring->adev;
>  	unsigned vmhub = ring->funcs->vmhub;
>  	struct amdgpu_vmid_mgr *id_mgr = &adev->vm_manager.id_mgr[vmhub];
>  	struct dma_fence **fences;
>  	unsigned i;
> -	int r;
>  
> -	if (!dma_fence_is_signaled(ring->vmid_wait))
> -		return amdgpu_sync_fence(sync, ring->vmid_wait);
> +	if (!dma_fence_is_signaled(ring->vmid_wait)) {
> +		*fence = dma_fence_get(ring->vmid_wait);
> +		return 0;
> +	}
>  
>  	fences = kmalloc_array(id_mgr->num_ids, sizeof(void *), GFP_KERNEL);
>  	if (!fences)
> @@ -228,10 +229,10 @@ static int amdgpu_vmid_grab_idle(struct amdgpu_vm *vm,
>  			return -ENOMEM;
>  		}
>  
> -		r = amdgpu_sync_fence(sync, &array->base);
> +		*fence = dma_fence_get(&array->base);
>  		dma_fence_put(ring->vmid_wait);
>  		ring->vmid_wait = &array->base;
> -		return r;
> +		return 0;
>  	}
>  	kfree(fences);
>  
> @@ -243,17 +244,17 @@ static int amdgpu_vmid_grab_idle(struct amdgpu_vm *vm,
>   *
>   * @vm: vm to allocate id for
>   * @ring: ring we want to submit job to
> - * @sync: sync object where we add dependencies
>   * @job: job who wants to use the VMID
>   * @id: resulting VMID
> + * @fence: fence to wait for if no id could be grabbed
>   *
>   * Try to assign a reserved VMID.
>   */
>  static int amdgpu_vmid_grab_reserved(struct amdgpu_vm *vm,
>  				     struct amdgpu_ring *ring,
> -				     struct amdgpu_sync *sync,
>  				     struct amdgpu_job *job,
> -				     struct amdgpu_vmid **id)
> +				     struct amdgpu_vmid **id,
> +				     struct dma_fence **fence)
>  {
>  	struct amdgpu_device *adev = ring->adev;
>  	unsigned vmhub = ring->funcs->vmhub;
> @@ -280,7 +281,8 @@ static int amdgpu_vmid_grab_reserved(struct amdgpu_vm *vm,
>  		tmp = amdgpu_sync_peek_fence(&(*id)->active, ring);
>  		if (tmp) {
>  			*id = NULL;
> -			return amdgpu_sync_fence(sync, tmp);
> +			*fence = dma_fence_get(tmp);
> +			return 0;
>  		}
>  		needs_flush = true;
>  	}
> @@ -302,17 +304,17 @@ static int amdgpu_vmid_grab_reserved(struct amdgpu_vm *vm,
>   *
>   * @vm: vm to allocate id for
>   * @ring: ring we want to submit job to
> - * @sync: sync object where we add dependencies
>   * @job: job who wants to use the VMID
>   * @id: resulting VMID
> + * @fence: fence to wait for if no id could be grabbed
>   *
>   * Try to reuse a VMID for this submission.
>   */
>  static int amdgpu_vmid_grab_used(struct amdgpu_vm *vm,
>  				 struct amdgpu_ring *ring,
> -				 struct amdgpu_sync *sync,
>  				 struct amdgpu_job *job,
> -				 struct amdgpu_vmid **id)
> +				 struct amdgpu_vmid **id,
> +				 struct dma_fence **fence)
>  {
>  	struct amdgpu_device *adev = ring->adev;
>  	unsigned vmhub = ring->funcs->vmhub;
> @@ -367,13 +369,13 @@ static int amdgpu_vmid_grab_used(struct amdgpu_vm *vm,
>   *
>   * @vm: vm to allocate id for
>   * @ring: ring we want to submit job to
> - * @sync: sync object where we add dependencies
>   * @job: job who wants to use the VMID
> + * @fence: fence to wait for if no id could be grabbed
>   *
>   * Allocate an id for the vm, adding fences to the sync obj as necessary.
>   */
>  int amdgpu_vmid_grab(struct amdgpu_vm *vm, struct amdgpu_ring *ring,
> -		     struct amdgpu_sync *sync, struct amdgpu_job *job)
> +		     struct amdgpu_job *job, struct dma_fence **fence)
>  {
>  	struct amdgpu_device *adev = ring->adev;
>  	unsigned vmhub = ring->funcs->vmhub;
> @@ -383,16 +385,16 @@ int amdgpu_vmid_grab(struct amdgpu_vm *vm, struct amdgpu_ring *ring,
>  	int r = 0;
>  
>  	mutex_lock(&id_mgr->lock);
> -	r = amdgpu_vmid_grab_idle(vm, ring, sync, &idle);
> +	r = amdgpu_vmid_grab_idle(vm, ring, &idle, fence);
>  	if (r || !idle)
>  		goto error;
>  
>  	if (vm->reserved_vmid[vmhub]) {
> -		r = amdgpu_vmid_grab_reserved(vm, ring, sync, job, &id);
> +		r = amdgpu_vmid_grab_reserved(vm, ring, job, &id, fence);
>  		if (r || !id)
>  			goto error;
>  	} else {
> -		r = amdgpu_vmid_grab_used(vm, ring, sync, job, &id);
> +		r = amdgpu_vmid_grab_used(vm, ring, job, &id, fence);
>  		if (r)
>  			goto error;
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.h
> index 1b1e7d04655c..57efe61dceed 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.h
> @@ -84,7 +84,7 @@ void amdgpu_vmid_free_reserved(struct amdgpu_device *adev,
>  			       struct amdgpu_vm *vm,
>  			       unsigned vmhub);
>  int amdgpu_vmid_grab(struct amdgpu_vm *vm, struct amdgpu_ring *ring,
> -		     struct amdgpu_sync *sync, struct amdgpu_job *job);
> +		     struct amdgpu_job *job, struct dma_fence **fence);
>  void amdgpu_vmid_reset(struct amdgpu_device *adev, unsigned vmhub,
>  		       unsigned vmid);
>  void amdgpu_vmid_reset_all(struct amdgpu_device *adev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 13b752687b30..e187dc0ab898 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -238,12 +238,12 @@ int amdgpu_job_submit_direct(struct amdgpu_job *job, struct amdgpu_ring *ring,
>  	return 0;
>  }
>  
> -static struct dma_fence *amdgpu_job_dependency(struct drm_sched_job *sched_job,
> -					       struct drm_sched_entity *s_entity)
> +static struct dma_fence *
> +amdgpu_job_dependency(struct drm_sched_job *sched_job,
> +		      struct drm_sched_entity *s_entity)
>  {
>  	struct amdgpu_ring *ring = to_amdgpu_ring(s_entity->rq->sched);
>  	struct amdgpu_job *job = to_amdgpu_job(sched_job);
> -	struct amdgpu_vm *vm = job->vm;
>  	struct dma_fence *fence;
>  	int r;
>  
> @@ -254,12 +254,10 @@ static struct dma_fence *amdgpu_job_dependency(struct drm_sched_job *sched_job,
>  			DRM_ERROR("Error adding fence (%d)\n", r);
>  	}
>  
> -	while (fence == NULL && vm && !job->vmid) {
> -		r = amdgpu_vmid_grab(vm, ring, &job->sync, job);
> +	while (fence == NULL && job->vm && !job->vmid) {

In preliminary application of the patch series, checkpatch.pl complains about comparison to NULL,
and wants !fence, instead:

	while (!fence && job->vm && !job->vmid) {

I can see that it had been like this before... and I know it's out of the scope of this series,
but we should fix this at some point in time.

Regards,
Luben


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Fixes for scheduler hang when killing a process
  2022-10-14  8:46 Fixes for scheduler hang when killing a process Christian König
                   ` (12 preceding siblings ...)
  2022-10-14  8:46 ` [PATCH 13/13] drm/scheduler: rename dependency callback into prepare_job Christian König
@ 2022-10-23  1:35 ` Luben Tuikov
  2022-10-24  7:00 ` Luben Tuikov
  14 siblings, 0 replies; 47+ messages in thread
From: Luben Tuikov @ 2022-10-23  1:35 UTC (permalink / raw)
  To: Christian König, dri-devel, amd-gfx

On 2022-10-14 04:46, Christian König wrote:
> Hi guys,
> 
> rebased those patches on top of amd-staging-drm-next since the
> amdgpu changes are quite substencial.
> 
> Please review and comment,
> Christian.

The amd-staging-drm-next branch on which this is rebased, includes your 2-patch series
titled "drm_sched: add DRM_SCHED_FENCE_DONT_PIPELINE flag" and "drm_amdgpu: use DRM_SCHED_FENCE_DONT_PIPELINE for VM updates".

Without those two earlier patches present in the branch, this series doesn't apply.

Regards,
Luben


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 06/13] drm/amdgpu: cleanup scheduler job initialization
  2022-10-14  8:46 ` [PATCH 06/13] drm/amdgpu: cleanup scheduler job initialization Christian König
@ 2022-10-23  1:50   ` Luben Tuikov
  0 siblings, 0 replies; 47+ messages in thread
From: Luben Tuikov @ 2022-10-23  1:50 UTC (permalink / raw)
  To: Christian König, dri-devel, amd-gfx; +Cc: Christian König

On 2022-10-14 04:46, Christian König wrote:
> Init the DRM scheduler base class while allocating the job.
> 
> This makes the whole handling much more cleaner.
> 
> Signed-off-by: Christian König <christian.koenig@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c  |  2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c      |  8 +--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c     | 44 ++++++++-------
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.h     | 14 +++--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c    |  7 ++-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c     | 56 +++++++------------
>  drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c     |  9 ++-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c     | 13 +++--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c     | 22 +++++---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 61 +++++++++++----------
>  drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c      | 12 ++--
>  drivers/gpu/drm/amd/amdgpu/uvd_v6_0.c       |  8 +--
>  drivers/gpu/drm/amd/amdgpu/uvd_v7_0.c       | 14 +++--
>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c    | 17 ++----
>  14 files changed, 137 insertions(+), 150 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> index 0561812aa0a4..046d466b4ee4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> @@ -673,7 +673,7 @@ int amdgpu_amdkfd_submit_ib(struct amdgpu_device *adev,
>  		goto err;
>  	}
>  
> -	ret = amdgpu_job_alloc(adev, 1, &job, NULL);
> +	ret = amdgpu_job_alloc(adev, NULL, NULL, NULL, 1, &job);
>  	if (ret)
>  		goto err;
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index 1bbd39b3b0fc..aa6f6c428dbc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -291,12 +291,8 @@ static int amdgpu_cs_pass1(struct amdgpu_cs_parser *p,
>  		return -EINVAL;
>  
>  	for (i = 0; i < p->gang_size; ++i) {
> -		ret = amdgpu_job_alloc(p->adev, num_ibs[i], &p->jobs[i], vm);
> -		if (ret)
> -			goto free_all_kdata;
> -
> -		ret = drm_sched_job_init(&p->jobs[i]->base, p->entities[i],
> -					 &fpriv->vm);
> +		ret = amdgpu_job_alloc(p->adev, vm, p->entities[i], vm,
> +				       num_ibs[i], &p->jobs[i]);
>  		if (ret)
>  			goto free_all_kdata;
>  	}
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index e187dc0ab898..5c69461ab3e0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -88,8 +88,9 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>  	return DRM_GPU_SCHED_STAT_NOMINAL;
>  }
>  
> -int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
> -		     struct amdgpu_job **job, struct amdgpu_vm *vm)
> +int amdgpu_job_alloc(struct amdgpu_device *adev, struct amdgpu_vm *vm,
> +		     struct drm_sched_entity *entity, void *owner,
> +		     unsigned num_ibs, struct amdgpu_job **job)

Checkpatch.pl complains about "unsigned" and wants "unsigned int" here.

>  {
>  	if (num_ibs == 0)
>  		return -EINVAL;
> @@ -110,23 +111,30 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>  	(*job)->vram_lost_counter = atomic_read(&adev->vram_lost_counter);
>  	(*job)->vm_pd_addr = AMDGPU_BO_INVALID_OFFSET;
>  
> -	return 0;
> +	if (!entity)
> +		return 0;
> +
> +	return drm_sched_job_init(&(*job)->base, entity, owner);
>  }
>  
> -int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned size,
> -		enum amdgpu_ib_pool_type pool_type,
> -		struct amdgpu_job **job)
> +int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev,
> +			     struct drm_sched_entity *entity, void *owner,
> +			     unsigned size, enum amdgpu_ib_pool_type pool_type,

Here also, "unsigned int".

> +			     struct amdgpu_job **job)
>  {
>  	int r;
>  
> -	r = amdgpu_job_alloc(adev, 1, job, NULL);
> +	r = amdgpu_job_alloc(adev, NULL, entity, owner, 1, job);
>  	if (r)
>  		return r;
>  
>  	(*job)->num_ibs = 1;
>  	r = amdgpu_ib_get(adev, NULL, size, pool_type, &(*job)->ibs[0]);
> -	if (r)
> +	if (r) {
> +		if (entity)
> +			drm_sched_job_cleanup(&(*job)->base);
>  		kfree(*job);
> +	}
>  
>  	return r;
>  }
> @@ -190,6 +198,9 @@ void amdgpu_job_set_gang_leader(struct amdgpu_job *job,
>  
>  void amdgpu_job_free(struct amdgpu_job *job)
>  {
> +	if (job->base.entity)
> +		drm_sched_job_cleanup(&job->base);
> +
>  	amdgpu_job_free_resources(job);
>  	amdgpu_sync_free(&job->sync);
>  	amdgpu_sync_free(&job->sched_sync);
> @@ -202,25 +213,16 @@ void amdgpu_job_free(struct amdgpu_job *job)
>  		dma_fence_put(&job->hw_fence);
>  }
>  
> -int amdgpu_job_submit(struct amdgpu_job *job, struct drm_sched_entity *entity,
> -		      void *owner, struct dma_fence **f)
> +struct dma_fence *amdgpu_job_submit(struct amdgpu_job *job)
>  {
> -	int r;
> -
> -	if (!f)
> -		return -EINVAL;
> -
> -	r = drm_sched_job_init(&job->base, entity, owner);
> -	if (r)
> -		return r;
> +	struct dma_fence *f;
>  
>  	drm_sched_job_arm(&job->base);
> -
> -	*f = dma_fence_get(&job->base.s_fence->finished);
> +	f = dma_fence_get(&job->base.s_fence->finished);
>  	amdgpu_job_free_resources(job);
>  	drm_sched_entity_push_job(&job->base);
>  
> -	return 0;
> +	return f;
>  }
>  
>  int amdgpu_job_submit_direct(struct amdgpu_job *job, struct amdgpu_ring *ring,
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> index ab7b150e5d50..f099210c386a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> @@ -78,18 +78,20 @@ static inline struct amdgpu_ring *amdgpu_job_ring(struct amdgpu_job *job)
>  	return to_amdgpu_ring(job->base.entity->rq->sched);
>  }
>  
> -int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
> -		     struct amdgpu_job **job, struct amdgpu_vm *vm);
> -int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned size,
> -		enum amdgpu_ib_pool_type pool, struct amdgpu_job **job);
> +int amdgpu_job_alloc(struct amdgpu_device *adev, struct amdgpu_vm *vm,
> +		     struct drm_sched_entity *entity, void *owner,
> +		     unsigned num_ibs, struct amdgpu_job **job);
> +int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev,
> +			     struct drm_sched_entity *entity, void *owner,
> +			     unsigned size, enum amdgpu_ib_pool_type pool_type,
> +			     struct amdgpu_job **job);

"unsigned int" correspondingly in the chunk above.

>  void amdgpu_job_set_resources(struct amdgpu_job *job, struct amdgpu_bo *gds,
>  			      struct amdgpu_bo *gws, struct amdgpu_bo *oa);
>  void amdgpu_job_free_resources(struct amdgpu_job *job);
>  void amdgpu_job_set_gang_leader(struct amdgpu_job *job,
>  				struct amdgpu_job *leader);
>  void amdgpu_job_free(struct amdgpu_job *job);
> -int amdgpu_job_submit(struct amdgpu_job *job, struct drm_sched_entity *entity,
> -		      void *owner, struct dma_fence **f);
> +struct dma_fence *amdgpu_job_submit(struct amdgpu_job *job);
>  int amdgpu_job_submit_direct(struct amdgpu_job *job, struct amdgpu_ring *ring,
>  			     struct dma_fence **fence);
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c
> index 518eb0e40d32..de182bfcf85f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c
> @@ -150,14 +150,15 @@ static int amdgpu_jpeg_dec_set_reg(struct amdgpu_ring *ring, uint32_t handle,
>  	const unsigned ib_size_dw = 16;
>  	int i, r;
>  
> -	r = amdgpu_job_alloc_with_ib(ring->adev, ib_size_dw * 4,
> -					AMDGPU_IB_POOL_DIRECT, &job);
> +	r = amdgpu_job_alloc_with_ib(ring->adev, NULL, NULL, ib_size_dw * 4,
> +				     AMDGPU_IB_POOL_DIRECT, &job);
>  	if (r)
>  		return r;
>  
>  	ib = &job->ibs[0];
>  
> -	ib->ptr[0] = PACKETJ(adev->jpeg.internal.jpeg_pitch, 0, 0, PACKETJ_TYPE0);
> +	ib->ptr[0] = PACKETJ(adev->jpeg.internal.jpeg_pitch, 0, 0,
> +			     PACKETJ_TYPE0);
>  	ib->ptr[1] = 0xDEADBEEF;
>  	for (i = 2; i < 16; i += 2) {
>  		ib->ptr[i] = PACKETJ(0, 0, 0, PACKETJ_TYPE6);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> index f898e870d157..30edb05e0d25 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> @@ -189,7 +189,6 @@ static int amdgpu_ttm_map_buffer(struct ttm_buffer_object *bo,
>  	struct amdgpu_device *adev = ring->adev;
>  	unsigned offset, num_pages, num_dw, num_bytes;
>  	uint64_t src_addr, dst_addr;
> -	struct dma_fence *fence;
>  	struct amdgpu_job *job;
>  	void *cpu_addr;
>  	uint64_t flags;
> @@ -229,7 +228,9 @@ static int amdgpu_ttm_map_buffer(struct ttm_buffer_object *bo,
>  	num_dw = ALIGN(adev->mman.buffer_funcs->copy_num_dw, 8);
>  	num_bytes = num_pages * 8 * AMDGPU_GPU_PAGES_IN_CPU_PAGE;
>  
> -	r = amdgpu_job_alloc_with_ib(adev, num_dw * 4 + num_bytes,
> +	r = amdgpu_job_alloc_with_ib(adev, &adev->mman.entity,
> +				     AMDGPU_FENCE_OWNER_UNDEFINED,
> +				     num_dw * 4 + num_bytes,
>  				     AMDGPU_IB_POOL_DELAYED, &job);
>  	if (r)
>  		return r;
> @@ -269,18 +270,8 @@ static int amdgpu_ttm_map_buffer(struct ttm_buffer_object *bo,
>  		}
>  	}
>  
> -	r = amdgpu_job_submit(job, &adev->mman.entity,
> -			      AMDGPU_FENCE_OWNER_UNDEFINED, &fence);
> -	if (r)
> -		goto error_free;
> -
> -	dma_fence_put(fence);
> -
> -	return r;
> -
> -error_free:
> -	amdgpu_job_free(job);
> -	return r;
> +	dma_fence_put(amdgpu_job_submit(job));
> +	return 0;
>  }
>  
>  /**
> @@ -1432,7 +1423,8 @@ static void amdgpu_ttm_vram_mm_access(struct amdgpu_device *adev, loff_t pos,
>  }
>  
>  static int amdgpu_ttm_access_memory_sdma(struct ttm_buffer_object *bo,
> -					unsigned long offset, void *buf, int len, int write)
> +					unsigned long offset, void *buf,
> +					int len, int write)
>  {
>  	struct amdgpu_bo *abo = ttm_to_amdgpu_bo(bo);
>  	struct amdgpu_device *adev = amdgpu_ttm_adev(abo->tbo.bdev);
> @@ -1456,26 +1448,27 @@ static int amdgpu_ttm_access_memory_sdma(struct ttm_buffer_object *bo,
>  		memcpy(adev->mman.sdma_access_ptr, buf, len);
>  
>  	num_dw = ALIGN(adev->mman.buffer_funcs->copy_num_dw, 8);
> -	r = amdgpu_job_alloc_with_ib(adev, num_dw * 4, AMDGPU_IB_POOL_DELAYED, &job);
> +	r = amdgpu_job_alloc_with_ib(adev, &adev->mman.entity,
> +				     AMDGPU_FENCE_OWNER_UNDEFINED,
> +				     num_dw * 4, AMDGPU_IB_POOL_DELAYED,
> +				     &job);
>  	if (r)
>  		goto out;
>  
>  	amdgpu_res_first(abo->tbo.resource, offset, len, &src_mm);
> -	src_addr = amdgpu_ttm_domain_start(adev, bo->resource->mem_type) + src_mm.start;
> +	src_addr = amdgpu_ttm_domain_start(adev, bo->resource->mem_type) +
> +		src_mm.start;
>  	dst_addr = amdgpu_bo_gpu_offset(adev->mman.sdma_access_bo);
>  	if (write)
>  		swap(src_addr, dst_addr);
>  
> -	amdgpu_emit_copy_buffer(adev, &job->ibs[0], src_addr, dst_addr, PAGE_SIZE, false);
> +	amdgpu_emit_copy_buffer(adev, &job->ibs[0], src_addr, dst_addr,
> +				PAGE_SIZE, false);
>  
>  	amdgpu_ring_pad_ib(adev->mman.buffer_funcs_ring, &job->ibs[0]);
>  	WARN_ON(job->ibs[0].length_dw > num_dw);
>  
> -	r = amdgpu_job_submit(job, &adev->mman.entity, AMDGPU_FENCE_OWNER_UNDEFINED, &fence);
> -	if (r) {
> -		amdgpu_job_free(job);
> -		goto out;
> -	}
> +	fence = amdgpu_job_submit(job);
>  
>  	if (!dma_fence_wait_timeout(fence, false, adev->sdma_timeout))
>  		r = -ETIMEDOUT;
> @@ -1974,7 +1967,9 @@ static int amdgpu_ttm_prepare_job(struct amdgpu_device *adev,
>  		AMDGPU_IB_POOL_DELAYED;
>  	int r;
>  
> -	r = amdgpu_job_alloc_with_ib(adev, num_dw * 4, pool, job);
> +	r = amdgpu_job_alloc_with_ib(adev, &adev->mman.entity,
> +				     AMDGPU_FENCE_OWNER_UNDEFINED,
> +				     num_dw * 4, pool, job);
>  	if (r)
>  		return r;
>  
> @@ -2033,8 +2028,7 @@ int amdgpu_copy_buffer(struct amdgpu_ring *ring, uint64_t src_offset,
>  	if (direct_submit)
>  		r = amdgpu_job_submit_direct(job, ring, fence);
>  	else
> -		r = amdgpu_job_submit(job, &adev->mman.entity,
> -				      AMDGPU_FENCE_OWNER_UNDEFINED, fence);
> +		*fence = amdgpu_job_submit(job);
>  	if (r)
>  		goto error_free;
>  
> @@ -2079,16 +2073,8 @@ static int amdgpu_ttm_fill_mem(struct amdgpu_ring *ring, uint32_t src_data,
>  
>  	amdgpu_ring_pad_ib(ring, &job->ibs[0]);
>  	WARN_ON(job->ibs[0].length_dw > num_dw);
> -	r = amdgpu_job_submit(job, &adev->mman.entity,
> -			      AMDGPU_FENCE_OWNER_UNDEFINED, fence);
> -	if (r)
> -		goto error_free;
> -
> +	*fence = amdgpu_job_submit(job);
>  	return 0;
> -
> -error_free:
> -	amdgpu_job_free(job);
> -	return r;
>  }
>  
>  int amdgpu_fill_buffer(struct amdgpu_bo *bo,
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
> index 6eac649499d3..8baddf79635b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
> @@ -1132,7 +1132,9 @@ static int amdgpu_uvd_send_msg(struct amdgpu_ring *ring, struct amdgpu_bo *bo,
>  	unsigned offset_idx = 0;
>  	unsigned offset[3] = { UVD_BASE_SI, 0, 0 };
>  
> -	r = amdgpu_job_alloc_with_ib(adev, 64, direct ? AMDGPU_IB_POOL_DIRECT :
> +	r = amdgpu_job_alloc_with_ib(ring->adev, &adev->uvd.entity,
> +				     AMDGPU_FENCE_OWNER_UNDEFINED,
> +				     64, direct ? AMDGPU_IB_POOL_DIRECT :
>  				     AMDGPU_IB_POOL_DELAYED, &job);
>  	if (r)
>  		return r;
> @@ -1181,10 +1183,7 @@ static int amdgpu_uvd_send_msg(struct amdgpu_ring *ring, struct amdgpu_bo *bo,
>  		if (r)
>  			goto err_free;
>  
> -		r = amdgpu_job_submit(job, &adev->uvd.entity,
> -				      AMDGPU_FENCE_OWNER_UNDEFINED, &f);
> -		if (r)
> -			goto err_free;
> +		f = amdgpu_job_submit(job);
>  	}
>  
>  	amdgpu_bo_reserve(bo, true);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
> index 02cb3a12dd76..b239e874f2d5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
> @@ -450,8 +450,10 @@ static int amdgpu_vce_get_create_msg(struct amdgpu_ring *ring, uint32_t handle,
>  	uint64_t addr;
>  	int i, r;
>  
> -	r = amdgpu_job_alloc_with_ib(ring->adev, ib_size_dw * 4,
> -				     AMDGPU_IB_POOL_DIRECT, &job);
> +	r = amdgpu_job_alloc_with_ib(ring->adev, &ring->adev->vce.entity,
> +				     AMDGPU_FENCE_OWNER_UNDEFINED,
> +				     ib_size_dw * 4, AMDGPU_IB_POOL_DIRECT,
> +				     &job);
>  	if (r)
>  		return r;
>  
> @@ -538,7 +540,9 @@ static int amdgpu_vce_get_destroy_msg(struct amdgpu_ring *ring, uint32_t handle,
>  	struct dma_fence *f = NULL;
>  	int i, r;
>  
> -	r = amdgpu_job_alloc_with_ib(ring->adev, ib_size_dw * 4,
> +	r = amdgpu_job_alloc_with_ib(ring->adev, &ring->adev->vce.entity,
> +				     AMDGPU_FENCE_OWNER_UNDEFINED,
> +				     ib_size_dw * 4,
>  				     direct ? AMDGPU_IB_POOL_DIRECT :
>  				     AMDGPU_IB_POOL_DELAYED, &job);
>  	if (r)
> @@ -570,8 +574,7 @@ static int amdgpu_vce_get_destroy_msg(struct amdgpu_ring *ring, uint32_t handle,
>  	if (direct)
>  		r = amdgpu_job_submit_direct(job, ring, &f);
>  	else
> -		r = amdgpu_job_submit(job, &ring->adev->vce.entity,
> -				      AMDGPU_FENCE_OWNER_UNDEFINED, &f);
> +		f = amdgpu_job_submit(job);
>  	if (r)
>  		goto err;
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
> index 0b52af415b28..965b7755cb88 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
> @@ -600,15 +600,16 @@ static int amdgpu_vcn_dec_send_msg(struct amdgpu_ring *ring,
>  				   struct amdgpu_ib *ib_msg,
>  				   struct dma_fence **fence)
>  {
> +	uint64_t addr = AMDGPU_GPU_PAGE_ALIGN(ib_msg->gpu_addr);

"u64" : checkpatch.pl.

>  	struct amdgpu_device *adev = ring->adev;
>  	struct dma_fence *f = NULL;
>  	struct amdgpu_job *job;
>  	struct amdgpu_ib *ib;
> -	uint64_t addr = AMDGPU_GPU_PAGE_ALIGN(ib_msg->gpu_addr);
>  	int i, r;
>  
> -	r = amdgpu_job_alloc_with_ib(adev, 64,
> -					AMDGPU_IB_POOL_DIRECT, &job);
> +	r = amdgpu_job_alloc_with_ib(ring->adev, NULL, NULL,
> +				     64, AMDGPU_IB_POOL_DIRECT,
> +				     &job);
>  	if (r)
>  		goto err;
>  
> @@ -787,8 +788,9 @@ static int amdgpu_vcn_dec_sw_send_msg(struct amdgpu_ring *ring,
>  	if (sq)
>  		ib_size_dw += 8;
>  
> -	r = amdgpu_job_alloc_with_ib(adev, ib_size_dw * 4,
> -				AMDGPU_IB_POOL_DIRECT, &job);
> +	r = amdgpu_job_alloc_with_ib(ring->adev, NULL, NULL,
> +				     ib_size_dw * 4, AMDGPU_IB_POOL_DIRECT,
> +				     &job);
>  	if (r)
>  		goto err;
>  
> @@ -916,8 +918,9 @@ static int amdgpu_vcn_enc_get_create_msg(struct amdgpu_ring *ring, uint32_t hand
>  	if (sq)
>  		ib_size_dw += 8;
>  
> -	r = amdgpu_job_alloc_with_ib(ring->adev, ib_size_dw * 4,
> -					AMDGPU_IB_POOL_DIRECT, &job);
> +	r = amdgpu_job_alloc_with_ib(ring->adev, NULL, NULL,
> +				     ib_size_dw * 4, AMDGPU_IB_POOL_DIRECT,
> +				     &job);
>  	if (r)
>  		return r;
>  
> @@ -982,8 +985,9 @@ static int amdgpu_vcn_enc_get_destroy_msg(struct amdgpu_ring *ring, uint32_t han
>  	if (sq)
>  		ib_size_dw += 8;
>  
> -	r = amdgpu_job_alloc_with_ib(ring->adev, ib_size_dw * 4,
> -					AMDGPU_IB_POOL_DIRECT, &job);
> +	r = amdgpu_job_alloc_with_ib(ring->adev, NULL, NULL,
> +				     ib_size_dw * 4, AMDGPU_IB_POOL_DIRECT,
> +				     &job);
>  	if (r)
>  		return r;
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> index 69e105fa41f6..126364882d09 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> @@ -47,6 +47,32 @@ static int amdgpu_vm_sdma_map_table(struct amdgpu_bo_vm *table)
>  	return r;
>  }
>  
> +/* Allocate a new job for @count PTE updates */
> +static int amdgpu_vm_sdma_alloc_job(struct amdgpu_vm_update_params *p,
> +				    unsigned int count)
> +{
> +	enum amdgpu_ib_pool_type pool = p->immediate ? AMDGPU_IB_POOL_IMMEDIATE
> +		: AMDGPU_IB_POOL_DELAYED;
> +	struct drm_sched_entity *entity = p->immediate ? &p->vm->immediate
> +		: &p->vm->delayed;
> +	unsigned int ndw;
> +	int r;
> +
> +	/* estimate how many dw we need */
> +	ndw = AMDGPU_VM_SDMA_MIN_NUM_DW;
> +	if (p->pages_addr)
> +		ndw += count * 2;
> +	ndw = min(ndw, AMDGPU_VM_SDMA_MAX_NUM_DW);
> +
> +	r = amdgpu_job_alloc_with_ib(p->adev, entity, AMDGPU_FENCE_OWNER_VM,
> +				     ndw * 4, pool, &p->job);
> +	if (r)
> +		return r;
> +
> +	p->num_dw_left = ndw;
> +	return 0;
> +}
> +
>  /**
>   * amdgpu_vm_sdma_prepare - prepare SDMA command submission
>   *
> @@ -61,17 +87,12 @@ static int amdgpu_vm_sdma_prepare(struct amdgpu_vm_update_params *p,
>  				  struct dma_resv *resv,
>  				  enum amdgpu_sync_mode sync_mode)
>  {
> -	enum amdgpu_ib_pool_type pool = p->immediate ? AMDGPU_IB_POOL_IMMEDIATE
> -		: AMDGPU_IB_POOL_DELAYED;
> -	unsigned int ndw = AMDGPU_VM_SDMA_MIN_NUM_DW;
>  	int r;
>  
> -	r = amdgpu_job_alloc_with_ib(p->adev, ndw * 4, pool, &p->job);
> +	r = amdgpu_vm_sdma_alloc_job(p, 0);
>  	if (r)
>  		return r;
>  
> -	p->num_dw_left = ndw;
> -
>  	if (!resv)
>  		return 0;
>  
> @@ -91,20 +112,16 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
>  				 struct dma_fence **fence)
>  {
>  	struct amdgpu_ib *ib = p->job->ibs;
> -	struct drm_sched_entity *entity;
>  	struct amdgpu_ring *ring;
>  	struct dma_fence *f;
> -	int r;
>  
> -	entity = p->immediate ? &p->vm->immediate : &p->vm->delayed;
> -	ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
> +	ring = container_of(p->vm->delayed.rq->sched, struct amdgpu_ring,
> +			    sched);
>  
>  	WARN_ON(ib->length_dw == 0);
>  	amdgpu_ring_pad_ib(ring, ib);
>  	WARN_ON(ib->length_dw > p->num_dw_left);
> -	r = amdgpu_job_submit(p->job, entity, AMDGPU_FENCE_OWNER_VM, &f);
> -	if (r)
> -		goto error;
> +	f = amdgpu_job_submit(p->job);
>  
>  	if (p->unlocked) {
>  		struct dma_fence *tmp = dma_fence_get(f);
> @@ -127,10 +144,6 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
>  	}
>  	dma_fence_put(f);
>  	return 0;
> -
> -error:
> -	amdgpu_job_free(p->job);
> -	return r;
>  }
>  
>  /**
> @@ -210,8 +223,6 @@ static int amdgpu_vm_sdma_update(struct amdgpu_vm_update_params *p,
>  				 uint64_t flags)
>  {
>  	struct amdgpu_bo *bo = &vmbo->bo;
> -	enum amdgpu_ib_pool_type pool = p->immediate ? AMDGPU_IB_POOL_IMMEDIATE
> -		: AMDGPU_IB_POOL_DELAYED;
>  	struct dma_resv_iter cursor;
>  	unsigned int i, ndw, nptes;
>  	struct dma_fence *fence;
> @@ -238,19 +249,9 @@ static int amdgpu_vm_sdma_update(struct amdgpu_vm_update_params *p,
>  			if (r)
>  				return r;
>  
> -			/* estimate how many dw we need */
> -			ndw = 32;
> -			if (p->pages_addr)
> -				ndw += count * 2;
> -			ndw = max(ndw, AMDGPU_VM_SDMA_MIN_NUM_DW);
> -			ndw = min(ndw, AMDGPU_VM_SDMA_MAX_NUM_DW);
> -
> -			r = amdgpu_job_alloc_with_ib(p->adev, ndw * 4, pool,
> -						     &p->job);
> +			r = amdgpu_vm_sdma_alloc_job(p, count);
>  			if (r)
>  				return r;
> -
> -			p->num_dw_left = ndw;
>  		}
>  
>  		if (!p->pages_addr) {
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> index f513e2c2e964..657e53708248 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> @@ -371,7 +371,9 @@ static void gmc_v10_0_flush_gpu_tlb(struct amdgpu_device *adev, uint32_t vmid,
>  	 * translation. Avoid this by doing the invalidation from the SDMA
>  	 * itself.
>  	 */
> -	r = amdgpu_job_alloc_with_ib(adev, 16 * 4, AMDGPU_IB_POOL_IMMEDIATE,
> +	r = amdgpu_job_alloc_with_ib(ring->adev, &adev->mman.entity,
> +				     AMDGPU_FENCE_OWNER_UNDEFINED,
> +				     16 * 4, AMDGPU_IB_POOL_IMMEDIATE,
>  				     &job);
>  	if (r)
>  		goto error_alloc;
> @@ -380,10 +382,7 @@ static void gmc_v10_0_flush_gpu_tlb(struct amdgpu_device *adev, uint32_t vmid,
>  	job->vm_needs_flush = true;
>  	job->ibs->ptr[job->ibs->length_dw++] = ring->funcs->nop;
>  	amdgpu_ring_pad_ib(ring, &job->ibs[0]);
> -	r = amdgpu_job_submit(job, &adev->mman.entity,
> -			      AMDGPU_FENCE_OWNER_UNDEFINED, &fence);
> -	if (r)
> -		goto error_submit;
> +	fence = amdgpu_job_submit(job);
>  
>  	mutex_unlock(&adev->mman.gtt_window_lock);
>  
> @@ -392,9 +391,6 @@ static void gmc_v10_0_flush_gpu_tlb(struct amdgpu_device *adev, uint32_t vmid,
>  
>  	return;
>  
> -error_submit:
> -	amdgpu_job_free(job);
> -
>  error_alloc:
>  	mutex_unlock(&adev->mman.gtt_window_lock);
>  	DRM_ERROR("Error flushing GPU TLB using the SDMA (%d)!\n", r);
> diff --git a/drivers/gpu/drm/amd/amdgpu/uvd_v6_0.c b/drivers/gpu/drm/amd/amdgpu/uvd_v6_0.c
> index 375c440957dc..5fe872f4bea7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/uvd_v6_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/uvd_v6_0.c
> @@ -216,8 +216,8 @@ static int uvd_v6_0_enc_get_create_msg(struct amdgpu_ring *ring, uint32_t handle
>  	uint64_t addr;
>  	int i, r;
>  
> -	r = amdgpu_job_alloc_with_ib(ring->adev, ib_size_dw * 4,
> -					AMDGPU_IB_POOL_DIRECT, &job);
> +	r = amdgpu_job_alloc_with_ib(ring->adev, NULL, NULL, ib_size_dw * 4,
> +				     AMDGPU_IB_POOL_DIRECT, &job);
>  	if (r)
>  		return r;
>  
> @@ -280,8 +280,8 @@ static int uvd_v6_0_enc_get_destroy_msg(struct amdgpu_ring *ring,
>  	uint64_t addr;
>  	int i, r;
>  
> -	r = amdgpu_job_alloc_with_ib(ring->adev, ib_size_dw * 4,
> -					AMDGPU_IB_POOL_DIRECT, &job);
> +	r = amdgpu_job_alloc_with_ib(ring->adev, NULL, NULL, ib_size_dw * 4,
> +				     AMDGPU_IB_POOL_DIRECT, &job);
>  	if (r)
>  		return r;
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/uvd_v7_0.c b/drivers/gpu/drm/amd/amdgpu/uvd_v7_0.c
> index e668b3baa8c6..1c0210de8455 100644
> --- a/drivers/gpu/drm/amd/amdgpu/uvd_v7_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/uvd_v7_0.c
> @@ -213,7 +213,8 @@ static int uvd_v7_0_enc_ring_test_ring(struct amdgpu_ring *ring)
>   *
>   * Open up a stream for HW test
>   */
> -static int uvd_v7_0_enc_get_create_msg(struct amdgpu_ring *ring, uint32_t handle,
> +static int uvd_v7_0_enc_get_create_msg(struct amdgpu_ring *ring,
> +				       uint32_t handle,

"u32" : checkpatch.pl.

>  				       struct amdgpu_bo *bo,
>  				       struct dma_fence **fence)
>  {
> @@ -224,8 +225,8 @@ static int uvd_v7_0_enc_get_create_msg(struct amdgpu_ring *ring, uint32_t handle
>  	uint64_t addr;
>  	int i, r;
>  
> -	r = amdgpu_job_alloc_with_ib(ring->adev, ib_size_dw * 4,
> -					AMDGPU_IB_POOL_DIRECT, &job);
> +	r = amdgpu_job_alloc_with_ib(ring->adev, NULL, NULL, ib_size_dw * 4,
> +				     AMDGPU_IB_POOL_DIRECT, &job);
>  	if (r)
>  		return r;
>  
> @@ -276,7 +277,8 @@ static int uvd_v7_0_enc_get_create_msg(struct amdgpu_ring *ring, uint32_t handle
>   *
>   * Close up a stream for HW test or if userspace failed to do so
>   */
> -static int uvd_v7_0_enc_get_destroy_msg(struct amdgpu_ring *ring, uint32_t handle,
> +static int uvd_v7_0_enc_get_destroy_msg(struct amdgpu_ring *ring,
> +					uint32_t handle,

"u32" : checkpatch.pl.

Regards,
Luben


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 08/13] drm/amdgpu: use scheduler depenencies for VM updates
  2022-10-14  8:46 ` [PATCH 08/13] drm/amdgpu: use scheduler depenencies for VM updates Christian König
@ 2022-10-24  5:50   ` Luben Tuikov
  0 siblings, 0 replies; 47+ messages in thread
From: Luben Tuikov @ 2022-10-24  5:50 UTC (permalink / raw)
  To: Christian König, dri-devel, amd-gfx; +Cc: Christian König

"dependencies" in the title.

The rest looks good. I like pulling that sync free code into its own function.

Regards,
Luben

On 2022-10-14 04:46, Christian König wrote:
> Instead of putting that into the job sync object.
> 
> Signed-off-by: Christian König <christian.koenig@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c    | 56 +++++++++++++++------
>  drivers/gpu/drm/amd/amdgpu/amdgpu_sync.h    |  2 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 10 +++-
>  3 files changed, 52 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
> index 090e66a1b284..bac7976975bd 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
> @@ -259,6 +259,14 @@ int amdgpu_sync_resv(struct amdgpu_device *adev, struct amdgpu_sync *sync,
>  	return 0;
>  }
>  
> +/* Free the entry back to the slab */
> +static void amdgpu_sync_entry_free(struct amdgpu_sync_entry *e)
> +{
> +	hash_del(&e->node);
> +	dma_fence_put(e->fence);
> +	kmem_cache_free(amdgpu_sync_slab, e);
> +}
> +
>  /**
>   * amdgpu_sync_peek_fence - get the next fence not signaled yet
>   *
> @@ -280,9 +288,7 @@ struct dma_fence *amdgpu_sync_peek_fence(struct amdgpu_sync *sync,
>  		struct drm_sched_fence *s_fence = to_drm_sched_fence(f);
>  
>  		if (dma_fence_is_signaled(f)) {
> -			hash_del(&e->node);
> -			dma_fence_put(f);
> -			kmem_cache_free(amdgpu_sync_slab, e);
> +			amdgpu_sync_entry_free(e);
>  			continue;
>  		}
>  		if (ring && s_fence) {
> @@ -355,15 +361,42 @@ int amdgpu_sync_clone(struct amdgpu_sync *source, struct amdgpu_sync *clone)
>  			if (r)
>  				return r;
>  		} else {
> -			hash_del(&e->node);
> -			dma_fence_put(f);
> -			kmem_cache_free(amdgpu_sync_slab, e);
> +			amdgpu_sync_entry_free(e);
>  		}
>  	}
>  
>  	return 0;
>  }
>  
> +/**
> + * amdgpu_sync_push_to_job - push fences into job
> + * @sync: sync object to get the fences from
> + * @job: job to push the fences into
> + *
> + * Add all unsignaled fences from sync to job.
> + */
> +int amdgpu_sync_push_to_job(struct amdgpu_sync *sync, struct amdgpu_job *job)
> +{
> +	struct amdgpu_sync_entry *e;
> +	struct hlist_node *tmp;
> +	struct dma_fence *f;
> +	int i, r;
> +
> +	hash_for_each_safe(sync->fences, i, tmp, e, node) {
> +		f = e->fence;
> +		if (dma_fence_is_signaled(f)) {
> +			amdgpu_sync_entry_free(e);
> +			continue;
> +		}
> +
> +		dma_fence_get(f);
> +		r = drm_sched_job_add_dependency(&job->base, f);
> +		if (r)
> +			return r;
> +	}
> +	return 0;
> +}
> +
>  int amdgpu_sync_wait(struct amdgpu_sync *sync, bool intr)
>  {
>  	struct amdgpu_sync_entry *e;
> @@ -375,9 +408,7 @@ int amdgpu_sync_wait(struct amdgpu_sync *sync, bool intr)
>  		if (r)
>  			return r;
>  
> -		hash_del(&e->node);
> -		dma_fence_put(e->fence);
> -		kmem_cache_free(amdgpu_sync_slab, e);
> +		amdgpu_sync_entry_free(e);
>  	}
>  
>  	return 0;
> @@ -396,11 +427,8 @@ void amdgpu_sync_free(struct amdgpu_sync *sync)
>  	struct hlist_node *tmp;
>  	unsigned int i;
>  
> -	hash_for_each_safe(sync->fences, i, tmp, e, node) {
> -		hash_del(&e->node);
> -		dma_fence_put(e->fence);
> -		kmem_cache_free(amdgpu_sync_slab, e);
> -	}
> +	hash_for_each_safe(sync->fences, i, tmp, e, node)
> +		amdgpu_sync_entry_free(e);
>  }
>  
>  /**
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.h
> index 2d5c613cda10..cf1e9e858efd 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.h
> @@ -30,6 +30,7 @@ struct dma_fence;
>  struct dma_resv;
>  struct amdgpu_device;
>  struct amdgpu_ring;
> +struct amdgpu_job;
>  
>  enum amdgpu_sync_mode {
>  	AMDGPU_SYNC_ALWAYS,
> @@ -54,6 +55,7 @@ struct dma_fence *amdgpu_sync_peek_fence(struct amdgpu_sync *sync,
>  				     struct amdgpu_ring *ring);
>  struct dma_fence *amdgpu_sync_get_fence(struct amdgpu_sync *sync);
>  int amdgpu_sync_clone(struct amdgpu_sync *source, struct amdgpu_sync *clone);
> +int amdgpu_sync_push_to_job(struct amdgpu_sync *sync, struct amdgpu_job *job);
>  int amdgpu_sync_wait(struct amdgpu_sync *sync, bool intr);
>  void amdgpu_sync_free(struct amdgpu_sync *sync);
>  int amdgpu_sync_init(void);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> index 126364882d09..59cf64216fbb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> @@ -87,6 +87,7 @@ static int amdgpu_vm_sdma_prepare(struct amdgpu_vm_update_params *p,
>  				  struct dma_resv *resv,
>  				  enum amdgpu_sync_mode sync_mode)
>  {
> +	struct amdgpu_sync sync;
>  	int r;
>  
>  	r = amdgpu_vm_sdma_alloc_job(p, 0);
> @@ -96,7 +97,12 @@ static int amdgpu_vm_sdma_prepare(struct amdgpu_vm_update_params *p,
>  	if (!resv)
>  		return 0;
>  
> -	return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, p->vm);
> +	amdgpu_sync_create(&sync);
> +	r = amdgpu_sync_resv(p->adev, &sync, resv, sync_mode, p->vm);
> +	if (!r)
> +		r = amdgpu_sync_push_to_job(&sync, p->job);
> +	amdgpu_sync_free(&sync);
> +	return r;
>  }
>  
>  /**
> @@ -232,7 +238,7 @@ static int amdgpu_vm_sdma_update(struct amdgpu_vm_update_params *p,
>  	/* Wait for PD/PT moves to be completed */
>  	dma_resv_iter_begin(&cursor, bo->tbo.base.resv, DMA_RESV_USAGE_KERNEL);
>  	dma_resv_for_each_fence_unlocked(&cursor, fence) {
> -		r = amdgpu_sync_fence(&p->job->sync, fence);
> +		r = drm_sched_job_add_dependency(&p->job->base, fence);
>  		if (r) {
>  			dma_resv_iter_end(&cursor);
>  			return r;


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 09/13] drm/amdgpu: use scheduler depenencies for UVD msgs
  2022-10-14  8:46 ` [PATCH 09/13] drm/amdgpu: use scheduler depenencies for UVD msgs Christian König
@ 2022-10-24  5:53   ` Luben Tuikov
  0 siblings, 0 replies; 47+ messages in thread
From: Luben Tuikov @ 2022-10-24  5:53 UTC (permalink / raw)
  To: Christian König, dri-devel, amd-gfx; +Cc: Christian König

Title: "dependencies".

Regards,
Luben

On 2022-10-14 04:46, Christian König wrote:
> Instead of putting that into the job sync object.
> 
> Signed-off-by: Christian König <christian.koenig@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
> index 8baddf79635b..e00bb654e24b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
> @@ -1177,9 +1177,9 @@ static int amdgpu_uvd_send_msg(struct amdgpu_ring *ring, struct amdgpu_bo *bo,
>  		if (r)
>  			goto err_free;
>  	} else {
> -		r = amdgpu_sync_resv(adev, &job->sync, bo->tbo.base.resv,
> -				     AMDGPU_SYNC_ALWAYS,
> -				     AMDGPU_FENCE_OWNER_UNDEFINED);
> +		r = drm_sched_job_add_resv_dependencies(&job->base,
> +							bo->tbo.base.resv,
> +							DMA_RESV_USAGE_KERNEL);
>  		if (r)
>  			goto err_free;
>  


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 10/13] drm/amdgpu: use scheduler depenencies for CS
  2022-10-14  8:46 ` [PATCH 10/13] drm/amdgpu: use scheduler depenencies for CS Christian König
@ 2022-10-24  5:55   ` Luben Tuikov
  2022-12-21 15:34   ` Mike Lothian
  1 sibling, 0 replies; 47+ messages in thread
From: Luben Tuikov @ 2022-10-24  5:55 UTC (permalink / raw)
  To: Christian König, dri-devel, amd-gfx; +Cc: Christian König

Title: "dependencies"

Regards,
Luben

On 2022-10-14 04:46, Christian König wrote:
> Entirely remove the sync obj in the job.
> 
> Signed-off-by: Christian König <christian.koenig@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 21 ++++++++++-----------
>  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h  |  2 ++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c |  9 +--------
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.h |  1 -
>  4 files changed, 13 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index d45b86bcf7fa..0528c2b1db6e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -426,7 +426,7 @@ static int amdgpu_cs_p2_dependencies(struct amdgpu_cs_parser *p,
>  			dma_fence_put(old);
>  		}
>  
> -		r = amdgpu_sync_fence(&p->gang_leader->sync, fence);
> +		r = amdgpu_sync_fence(&p->sync, fence);
>  		dma_fence_put(fence);
>  		if (r)
>  			return r;
> @@ -448,7 +448,7 @@ static int amdgpu_syncobj_lookup_and_add(struct amdgpu_cs_parser *p,
>  		return r;
>  	}
>  
> -	r = amdgpu_sync_fence(&p->gang_leader->sync, fence);
> +	r = amdgpu_sync_fence(&p->sync, fence);
>  	if (r)
>  		goto error;
>  
> @@ -1108,7 +1108,7 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p)
>  	if (r)
>  		return r;
>  
> -	r = amdgpu_sync_fence(&job->sync, fpriv->prt_va->last_pt_update);
> +	r = amdgpu_sync_fence(&p->sync, fpriv->prt_va->last_pt_update);
>  	if (r)
>  		return r;
>  
> @@ -1119,7 +1119,7 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p)
>  		if (r)
>  			return r;
>  
> -		r = amdgpu_sync_fence(&job->sync, bo_va->last_pt_update);
> +		r = amdgpu_sync_fence(&p->sync, bo_va->last_pt_update);
>  		if (r)
>  			return r;
>  	}
> @@ -1138,7 +1138,7 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p)
>  		if (r)
>  			return r;
>  
> -		r = amdgpu_sync_fence(&job->sync, bo_va->last_pt_update);
> +		r = amdgpu_sync_fence(&p->sync, bo_va->last_pt_update);
>  		if (r)
>  			return r;
>  	}
> @@ -1151,7 +1151,7 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p)
>  	if (r)
>  		return r;
>  
> -	r = amdgpu_sync_fence(&job->sync, vm->last_update);
> +	r = amdgpu_sync_fence(&p->sync, vm->last_update);
>  	if (r)
>  		return r;
>  
> @@ -1183,7 +1183,6 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p)
>  static int amdgpu_cs_sync_rings(struct amdgpu_cs_parser *p)
>  {
>  	struct amdgpu_fpriv *fpriv = p->filp->driver_priv;
> -	struct amdgpu_job *leader = p->gang_leader;
>  	struct amdgpu_bo_list_entry *e;
>  	unsigned int i;
>  	int r;
> @@ -1195,14 +1194,14 @@ static int amdgpu_cs_sync_rings(struct amdgpu_cs_parser *p)
>  
>  		sync_mode = amdgpu_bo_explicit_sync(bo) ?
>  			AMDGPU_SYNC_EXPLICIT : AMDGPU_SYNC_NE_OWNER;
> -		r = amdgpu_sync_resv(p->adev, &leader->sync, resv, sync_mode,
> +		r = amdgpu_sync_resv(p->adev, &p->sync, resv, sync_mode,
>  				     &fpriv->vm);
>  		if (r)
>  			return r;
>  	}
>  
> -	for (i = 0; i < p->gang_size - 1; ++i) {
> -		r = amdgpu_sync_clone(&leader->sync, &p->jobs[i]->sync);
> +	for (i = 0; i < p->gang_size; ++i) {
> +		r = amdgpu_sync_push_to_job(&p->sync, p->jobs[i]);
>  		if (r)
>  			return r;
>  	}
> @@ -1248,7 +1247,7 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
>  		struct dma_fence *fence;
>  
>  		fence = &p->jobs[i]->base.s_fence->scheduled;
> -		r = amdgpu_sync_fence(&leader->sync, fence);
> +		r = drm_sched_job_add_dependency(&leader->base, fence);
>  		if (r)
>  			goto error_cleanup;
>  	}
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h
> index cbaa19b2b8a3..207e801c24ed 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h
> @@ -75,6 +75,8 @@ struct amdgpu_cs_parser {
>  
>  	unsigned			num_post_deps;
>  	struct amdgpu_cs_post_dep	*post_deps;
> +
> +	struct amdgpu_sync		sync;
>  };
>  
>  int amdgpu_cs_find_mapping(struct amdgpu_cs_parser *parser,
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index ba98d65835b4..b8494c3b3b8a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -106,7 +106,6 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>  	(*job)->base.sched = &adev->rings[0]->sched;
>  	(*job)->vm = vm;
>  
> -	amdgpu_sync_create(&(*job)->sync);
>  	amdgpu_sync_create(&(*job)->explicit_sync);
>  	(*job)->vram_lost_counter = atomic_read(&adev->vram_lost_counter);
>  	(*job)->vm_pd_addr = AMDGPU_BO_INVALID_OFFSET;
> @@ -174,9 +173,7 @@ static void amdgpu_job_free_cb(struct drm_sched_job *s_job)
>  
>  	drm_sched_job_cleanup(s_job);
>  
> -	amdgpu_sync_free(&job->sync);
>  	amdgpu_sync_free(&job->explicit_sync);
> -
>  	dma_fence_put(&job->hw_fence);
>  }
>  
> @@ -202,7 +199,6 @@ void amdgpu_job_free(struct amdgpu_job *job)
>  		drm_sched_job_cleanup(&job->base);
>  
>  	amdgpu_job_free_resources(job);
> -	amdgpu_sync_free(&job->sync);
>  	amdgpu_sync_free(&job->explicit_sync);
>  	if (job->gang_submit != &job->base.s_fence->scheduled)
>  		dma_fence_put(job->gang_submit);
> @@ -246,10 +242,9 @@ amdgpu_job_dependency(struct drm_sched_job *sched_job,
>  {
>  	struct amdgpu_ring *ring = to_amdgpu_ring(s_entity->rq->sched);
>  	struct amdgpu_job *job = to_amdgpu_job(sched_job);
> -	struct dma_fence *fence;
> +	struct dma_fence *fence = NULL;
>  	int r;
>  
> -	fence = amdgpu_sync_get_fence(&job->sync);
>  	while (fence == NULL && job->vm && !job->vmid) {
>  		r = amdgpu_vmid_grab(job->vm, ring, job, &fence);
>  		if (r)
> @@ -273,8 +268,6 @@ static struct dma_fence *amdgpu_job_run(struct drm_sched_job *sched_job)
>  	job = to_amdgpu_job(sched_job);
>  	finished = &job->base.s_fence->finished;
>  
> -	BUG_ON(amdgpu_sync_peek_fence(&job->sync, NULL));
> -
>  	trace_amdgpu_sched_run_job(job);
>  
>  	/* Skip job if VRAM is lost and never resubmit gangs */
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> index 9c10b9bd0084..6558839fda03 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> @@ -47,7 +47,6 @@ enum amdgpu_ib_pool_type;
>  struct amdgpu_job {
>  	struct drm_sched_job    base;
>  	struct amdgpu_vm	*vm;
> -	struct amdgpu_sync	sync;
>  	struct amdgpu_sync	explicit_sync;
>  	struct dma_fence	hw_fence;
>  	struct dma_fence	*gang_submit;


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Fixes for scheduler hang when killing a process
  2022-10-14  8:46 Fixes for scheduler hang when killing a process Christian König
                   ` (13 preceding siblings ...)
  2022-10-23  1:35 ` Fixes for scheduler hang when killing a process Luben Tuikov
@ 2022-10-24  7:00 ` Luben Tuikov
  14 siblings, 0 replies; 47+ messages in thread
From: Luben Tuikov @ 2022-10-24  7:00 UTC (permalink / raw)
  To: Christian König, dri-devel, amd-gfx

On 2022-10-14 04:46, Christian König wrote:
> Hi guys,
> 
> rebased those patches on top of amd-staging-drm-next since the
> amdgpu changes are quite substencial.
> 
> Please review and comment,
> Christian.

Hi Christian,

The changes are good and I think asynchronous is the way to go. I'm also running with those changes live.

The series is:

Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>

Regards,
Luben


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 05/13] drm/amdgpu: drop amdgpu_sync from amdgpu_vmid_grab
  2022-10-23  1:25   ` Luben Tuikov
@ 2022-10-24 10:54     ` Christian König
  0 siblings, 0 replies; 47+ messages in thread
From: Christian König @ 2022-10-24 10:54 UTC (permalink / raw)
  To: Luben Tuikov, Christian König, dri-devel, amd-gfx

Am 23.10.22 um 03:25 schrieb Luben Tuikov:
> On 2022-10-14 04:46, Christian König wrote:
>> [SNIP]
>> @@ -254,12 +254,10 @@ static struct dma_fence *amdgpu_job_dependency(struct drm_sched_job *sched_job,
>>   			DRM_ERROR("Error adding fence (%d)\n", r);
>>   	}
>>   
>> -	while (fence == NULL && vm && !job->vmid) {
>> -		r = amdgpu_vmid_grab(vm, ring, &job->sync, job);
>> +	while (fence == NULL && job->vm && !job->vmid) {
> In preliminary application of the patch series, checkpatch.pl complains about comparison to NULL,
> and wants !fence, instead:
>
> 	while (!fence && job->vm && !job->vmid) {
>
> I can see that it had been like this before... and I know it's out of the scope of this series,
> but we should fix this at some point in time.

Thanks for pointing that out. I try to fix it whenever I encounter 
something like this, but sometimes just forget to double check.

Thanks,
Christian.

>
> Regards,
> Luben
>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 01/13] drm/scheduler: fix fence ref counting
  2022-10-14  8:46 ` [PATCH 01/13] drm/scheduler: fix fence ref counting Christian König
@ 2022-10-25  3:23   ` Luna Nova
  2022-10-25 11:35     ` Christian König
  0 siblings, 1 reply; 47+ messages in thread
From: Luna Nova @ 2022-10-25  3:23 UTC (permalink / raw)
  To: amd-gfx, christian.koenig

This patch doesn't apply nicely on 6.0, seems amd-staging-drm-next is missing 9b04369b060fd4885f728b7a4ab4851ffb1abb64 drm/scheduler: Don't kill jobs in interrupt context

On Fri, 14 Oct 2022, at 01:46, Christian König wrote:
> We leaked dependency fences when processes were beeing killed.
> 
> Additional to that grab a reference to the last scheduled fence.
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> index 43d337d8b153..243ff384cde8 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -207,6 +207,7 @@ static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
> struct drm_sched_job *job = container_of(cb, struct drm_sched_job,
> finish_cb);
>  
> + dma_fence_put(f);
Conflict here, is INIT_WORK with that commit.
> init_irq_work(&job->work, drm_sched_entity_kill_jobs_irq_work);
> irq_work_queue(&job->work);


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 01/13] drm/scheduler: fix fence ref counting
  2022-10-25  3:23   ` Luna Nova
@ 2022-10-25 11:35     ` Christian König
  0 siblings, 0 replies; 47+ messages in thread
From: Christian König @ 2022-10-25 11:35 UTC (permalink / raw)
  To: Luna Nova, amd-gfx

Am 25.10.22 um 05:23 schrieb Luna Nova:
> This patch doesn't apply nicely on 6.0, seems amd-staging-drm-next is missing 9b04369b060fd4885f728b7a4ab4851ffb1abb64 drm/scheduler: Don't kill jobs in interrupt context

Yeah, so what? This is a rebase on top of amd-staging-drm-next. I've 
just pushed the original patch to drm-misc-fixes.

Regards,
Christian.

>
> On Fri, 14 Oct 2022, at 01:46, Christian König wrote:
>> We leaked dependency fences when processes were beeing killed.
>>
>> Additional to that grab a reference to the last scheduled fence.
>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>> index 43d337d8b153..243ff384cde8 100644
>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>> @@ -207,6 +207,7 @@ static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
>> struct drm_sched_job *job = container_of(cb, struct drm_sched_job,
>> finish_cb);
>>   
>> + dma_fence_put(f);
> Conflict here, is INIT_WORK with that commit.
>> init_irq_work(&job->work, drm_sched_entity_kill_jobs_irq_work);
>> irq_work_queue(&job->work);


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 12/13] drm/scheduler: rework entity flush, kill and fini
  2022-10-14  8:46 ` [PATCH 12/13] drm/scheduler: rework entity flush, kill and fini Christian König
@ 2022-11-17  2:36   ` Dmitry Osipenko
  2022-11-17  9:53     ` Christian König
  2022-12-26 16:01   ` [12/13] " Jonathan Marek
  1 sibling, 1 reply; 47+ messages in thread
From: Dmitry Osipenko @ 2022-11-17  2:36 UTC (permalink / raw)
  To: Christian König, luben.tuikov, dri-devel, amd-gfx
  Cc: Christian König

Hi,

On 10/14/22 11:46, Christian König wrote:
> +/* Remove the entity from the scheduler and kill all pending jobs */
> +static void drm_sched_entity_kill(struct drm_sched_entity *entity)
> +{
> +	struct drm_sched_job *job;
> +	struct dma_fence *prev;
> +
> +	if (!entity->rq)
> +		return;
> +
> +	spin_lock(&entity->rq_lock);
> +	entity->stopped = true;
> +	drm_sched_rq_remove_entity(entity->rq, entity);
> +	spin_unlock(&entity->rq_lock);
> +
> +	/* Make sure this entity is not used by the scheduler at the moment */
> +	wait_for_completion(&entity->entity_idle);

I'm always hitting lockup here using Panfrost driver on terminating
Xorg. Revering this patch helps. Any ideas how to fix it?

-- 
Best regards,
Dmitry


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 12/13] drm/scheduler: rework entity flush, kill and fini
  2022-11-17  2:36   ` Dmitry Osipenko
@ 2022-11-17  9:53     ` Christian König
  2022-11-17 12:47       ` Dmitry Osipenko
  0 siblings, 1 reply; 47+ messages in thread
From: Christian König @ 2022-11-17  9:53 UTC (permalink / raw)
  To: Dmitry Osipenko, Christian König, luben.tuikov, dri-devel, amd-gfx

Am 17.11.22 um 03:36 schrieb Dmitry Osipenko:
> Hi,
>
> On 10/14/22 11:46, Christian König wrote:
>> +/* Remove the entity from the scheduler and kill all pending jobs */
>> +static void drm_sched_entity_kill(struct drm_sched_entity *entity)
>> +{
>> +	struct drm_sched_job *job;
>> +	struct dma_fence *prev;
>> +
>> +	if (!entity->rq)
>> +		return;
>> +
>> +	spin_lock(&entity->rq_lock);
>> +	entity->stopped = true;
>> +	drm_sched_rq_remove_entity(entity->rq, entity);
>> +	spin_unlock(&entity->rq_lock);
>> +
>> +	/* Make sure this entity is not used by the scheduler at the moment */
>> +	wait_for_completion(&entity->entity_idle);
> I'm always hitting lockup here using Panfrost driver on terminating
> Xorg. Revering this patch helps. Any ideas how to fix it?
>

Well is the entity idle or are there some unsubmitted jobs left?

Regards,
Christian.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 12/13] drm/scheduler: rework entity flush, kill and fini
  2022-11-17  9:53     ` Christian König
@ 2022-11-17 12:47       ` Dmitry Osipenko
  2022-11-17 12:55         ` Christian König
  0 siblings, 1 reply; 47+ messages in thread
From: Dmitry Osipenko @ 2022-11-17 12:47 UTC (permalink / raw)
  To: Christian König, Christian König, luben.tuikov,
	dri-devel, amd-gfx

On 11/17/22 12:53, Christian König wrote:
> Am 17.11.22 um 03:36 schrieb Dmitry Osipenko:
>> Hi,
>>
>> On 10/14/22 11:46, Christian König wrote:
>>> +/* Remove the entity from the scheduler and kill all pending jobs */
>>> +static void drm_sched_entity_kill(struct drm_sched_entity *entity)
>>> +{
>>> +    struct drm_sched_job *job;
>>> +    struct dma_fence *prev;
>>> +
>>> +    if (!entity->rq)
>>> +        return;
>>> +
>>> +    spin_lock(&entity->rq_lock);
>>> +    entity->stopped = true;
>>> +    drm_sched_rq_remove_entity(entity->rq, entity);
>>> +    spin_unlock(&entity->rq_lock);
>>> +
>>> +    /* Make sure this entity is not used by the scheduler at the
>>> moment */
>>> +    wait_for_completion(&entity->entity_idle);
>> I'm always hitting lockup here using Panfrost driver on terminating
>> Xorg. Revering this patch helps. Any ideas how to fix it?
>>
> 
> Well is the entity idle or are there some unsubmitted jobs left?

Do you mean unsubmitted to h/w? IIUC, there are unsubmitted jobs left.

I see that there are 5-6 incomplete (in-flight) jobs when
panfrost_job_close() is invoked.

There are 1-2 jobs that are constantly scheduled and finished once in a
few seconds after the lockup happens.

-- 
Best regards,
Dmitry


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 12/13] drm/scheduler: rework entity flush, kill and fini
  2022-11-17 12:47       ` Dmitry Osipenko
@ 2022-11-17 12:55         ` Christian König
  2022-11-17 12:59           ` Dmitry Osipenko
  0 siblings, 1 reply; 47+ messages in thread
From: Christian König @ 2022-11-17 12:55 UTC (permalink / raw)
  To: Dmitry Osipenko, Christian König, luben.tuikov, dri-devel, amd-gfx

Am 17.11.22 um 13:47 schrieb Dmitry Osipenko:
> On 11/17/22 12:53, Christian König wrote:
>> Am 17.11.22 um 03:36 schrieb Dmitry Osipenko:
>>> Hi,
>>>
>>> On 10/14/22 11:46, Christian König wrote:
>>>> +/* Remove the entity from the scheduler and kill all pending jobs */
>>>> +static void drm_sched_entity_kill(struct drm_sched_entity *entity)
>>>> +{
>>>> +    struct drm_sched_job *job;
>>>> +    struct dma_fence *prev;
>>>> +
>>>> +    if (!entity->rq)
>>>> +        return;
>>>> +
>>>> +    spin_lock(&entity->rq_lock);
>>>> +    entity->stopped = true;
>>>> +    drm_sched_rq_remove_entity(entity->rq, entity);
>>>> +    spin_unlock(&entity->rq_lock);
>>>> +
>>>> +    /* Make sure this entity is not used by the scheduler at the
>>>> moment */
>>>> +    wait_for_completion(&entity->entity_idle);
>>> I'm always hitting lockup here using Panfrost driver on terminating
>>> Xorg. Revering this patch helps. Any ideas how to fix it?
>>>
>> Well is the entity idle or are there some unsubmitted jobs left?
> Do you mean unsubmitted to h/w? IIUC, there are unsubmitted jobs left.
>
> I see that there are 5-6 incomplete (in-flight) jobs when
> panfrost_job_close() is invoked.
>
> There are 1-2 jobs that are constantly scheduled and finished once in a
> few seconds after the lockup happens.

Well what drm_sched_entity_kill() is supposed to do is to prevent 
pushing queued up stuff to the hw when the process which queued it is 
killed. Is the process really killed or is that just some incorrect 
handling?

In other words I see two possibilities here, either we have a bug in the 
scheduler or panfrost isn't using it correctly.

Does panfrost calls drm_sched_entity_flush() before it calls 
drm_sched_entity_fini()? (I don't have the driver source at hand at the 
moment).

Regards,
Christian.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 12/13] drm/scheduler: rework entity flush, kill and fini
  2022-11-17 12:55         ` Christian König
@ 2022-11-17 12:59           ` Dmitry Osipenko
  2022-11-17 13:00             ` Dmitry Osipenko
  0 siblings, 1 reply; 47+ messages in thread
From: Dmitry Osipenko @ 2022-11-17 12:59 UTC (permalink / raw)
  To: Christian König, Christian König, luben.tuikov,
	dri-devel, amd-gfx

On 11/17/22 15:55, Christian König wrote:
> Am 17.11.22 um 13:47 schrieb Dmitry Osipenko:
>> On 11/17/22 12:53, Christian König wrote:
>>> Am 17.11.22 um 03:36 schrieb Dmitry Osipenko:
>>>> Hi,
>>>>
>>>> On 10/14/22 11:46, Christian König wrote:
>>>>> +/* Remove the entity from the scheduler and kill all pending jobs */
>>>>> +static void drm_sched_entity_kill(struct drm_sched_entity *entity)
>>>>> +{
>>>>> +    struct drm_sched_job *job;
>>>>> +    struct dma_fence *prev;
>>>>> +
>>>>> +    if (!entity->rq)
>>>>> +        return;
>>>>> +
>>>>> +    spin_lock(&entity->rq_lock);
>>>>> +    entity->stopped = true;
>>>>> +    drm_sched_rq_remove_entity(entity->rq, entity);
>>>>> +    spin_unlock(&entity->rq_lock);
>>>>> +
>>>>> +    /* Make sure this entity is not used by the scheduler at the
>>>>> moment */
>>>>> +    wait_for_completion(&entity->entity_idle);
>>>> I'm always hitting lockup here using Panfrost driver on terminating
>>>> Xorg. Revering this patch helps. Any ideas how to fix it?
>>>>
>>> Well is the entity idle or are there some unsubmitted jobs left?
>> Do you mean unsubmitted to h/w? IIUC, there are unsubmitted jobs left.
>>
>> I see that there are 5-6 incomplete (in-flight) jobs when
>> panfrost_job_close() is invoked.
>>
>> There are 1-2 jobs that are constantly scheduled and finished once in a
>> few seconds after the lockup happens.
> 
> Well what drm_sched_entity_kill() is supposed to do is to prevent
> pushing queued up stuff to the hw when the process which queued it is
> killed. Is the process really killed or is that just some incorrect
> handling?

It's actually 5-6 incomplete jobs of Xorg that are hanging when Xorg
process is closed.

The two re-scheduled jobs are from sddm, so it's only the Xorg context
that hangs.

> In other words I see two possibilities here, either we have a bug in the
> scheduler or panfrost isn't using it correctly.
> 
> Does panfrost calls drm_sched_entity_flush() before it calls
> drm_sched_entity_fini()? (I don't have the driver source at hand at the
> moment).

Panfrost doesn't use drm_sched_entity_flush(), nor drm_sched_entity_flush().

-- 
Best regards,
Dmitry


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 12/13] drm/scheduler: rework entity flush, kill and fini
  2022-11-17 12:59           ` Dmitry Osipenko
@ 2022-11-17 13:00             ` Dmitry Osipenko
  2022-11-17 13:11               ` Christian König
  0 siblings, 1 reply; 47+ messages in thread
From: Dmitry Osipenko @ 2022-11-17 13:00 UTC (permalink / raw)
  To: Christian König, Christian König, luben.tuikov,
	dri-devel, amd-gfx

On 11/17/22 15:59, Dmitry Osipenko wrote:
> On 11/17/22 15:55, Christian König wrote:
>> Am 17.11.22 um 13:47 schrieb Dmitry Osipenko:
>>> On 11/17/22 12:53, Christian König wrote:
>>>> Am 17.11.22 um 03:36 schrieb Dmitry Osipenko:
>>>>> Hi,
>>>>>
>>>>> On 10/14/22 11:46, Christian König wrote:
>>>>>> +/* Remove the entity from the scheduler and kill all pending jobs */
>>>>>> +static void drm_sched_entity_kill(struct drm_sched_entity *entity)
>>>>>> +{
>>>>>> +    struct drm_sched_job *job;
>>>>>> +    struct dma_fence *prev;
>>>>>> +
>>>>>> +    if (!entity->rq)
>>>>>> +        return;
>>>>>> +
>>>>>> +    spin_lock(&entity->rq_lock);
>>>>>> +    entity->stopped = true;
>>>>>> +    drm_sched_rq_remove_entity(entity->rq, entity);
>>>>>> +    spin_unlock(&entity->rq_lock);
>>>>>> +
>>>>>> +    /* Make sure this entity is not used by the scheduler at the
>>>>>> moment */
>>>>>> +    wait_for_completion(&entity->entity_idle);
>>>>> I'm always hitting lockup here using Panfrost driver on terminating
>>>>> Xorg. Revering this patch helps. Any ideas how to fix it?
>>>>>
>>>> Well is the entity idle or are there some unsubmitted jobs left?
>>> Do you mean unsubmitted to h/w? IIUC, there are unsubmitted jobs left.
>>>
>>> I see that there are 5-6 incomplete (in-flight) jobs when
>>> panfrost_job_close() is invoked.
>>>
>>> There are 1-2 jobs that are constantly scheduled and finished once in a
>>> few seconds after the lockup happens.
>>
>> Well what drm_sched_entity_kill() is supposed to do is to prevent
>> pushing queued up stuff to the hw when the process which queued it is
>> killed. Is the process really killed or is that just some incorrect
>> handling?
> 
> It's actually 5-6 incomplete jobs of Xorg that are hanging when Xorg
> process is closed.
> 
> The two re-scheduled jobs are from sddm, so it's only the Xorg context
> that hangs.
> 
>> In other words I see two possibilities here, either we have a bug in the
>> scheduler or panfrost isn't using it correctly.
>>
>> Does panfrost calls drm_sched_entity_flush() before it calls
>> drm_sched_entity_fini()? (I don't have the driver source at hand at the
>> moment).
> 
> Panfrost doesn't use drm_sched_entity_flush(), nor drm_sched_entity_flush().

*nor drm_sched_entity_fini()

-- 
Best regards,
Dmitry


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 12/13] drm/scheduler: rework entity flush, kill and fini
  2022-11-17 13:00             ` Dmitry Osipenko
@ 2022-11-17 13:11               ` Christian König
  2022-11-17 14:41                 ` Dmitry Osipenko
  0 siblings, 1 reply; 47+ messages in thread
From: Christian König @ 2022-11-17 13:11 UTC (permalink / raw)
  To: Dmitry Osipenko, Christian König, luben.tuikov, dri-devel, amd-gfx

Am 17.11.22 um 14:00 schrieb Dmitry Osipenko:
> On 11/17/22 15:59, Dmitry Osipenko wrote:
>> On 11/17/22 15:55, Christian König wrote:
>>> Am 17.11.22 um 13:47 schrieb Dmitry Osipenko:
>>>> On 11/17/22 12:53, Christian König wrote:
>>>>> Am 17.11.22 um 03:36 schrieb Dmitry Osipenko:
>>>>>> Hi,
>>>>>>
>>>>>> On 10/14/22 11:46, Christian König wrote:
>>>>>>> +/* Remove the entity from the scheduler and kill all pending jobs */
>>>>>>> +static void drm_sched_entity_kill(struct drm_sched_entity *entity)
>>>>>>> +{
>>>>>>> +    struct drm_sched_job *job;
>>>>>>> +    struct dma_fence *prev;
>>>>>>> +
>>>>>>> +    if (!entity->rq)
>>>>>>> +        return;
>>>>>>> +
>>>>>>> +    spin_lock(&entity->rq_lock);
>>>>>>> +    entity->stopped = true;
>>>>>>> +    drm_sched_rq_remove_entity(entity->rq, entity);
>>>>>>> +    spin_unlock(&entity->rq_lock);
>>>>>>> +
>>>>>>> +    /* Make sure this entity is not used by the scheduler at the
>>>>>>> moment */
>>>>>>> +    wait_for_completion(&entity->entity_idle);
>>>>>> I'm always hitting lockup here using Panfrost driver on terminating
>>>>>> Xorg. Revering this patch helps. Any ideas how to fix it?
>>>>>>
>>>>> Well is the entity idle or are there some unsubmitted jobs left?
>>>> Do you mean unsubmitted to h/w? IIUC, there are unsubmitted jobs left.
>>>>
>>>> I see that there are 5-6 incomplete (in-flight) jobs when
>>>> panfrost_job_close() is invoked.
>>>>
>>>> There are 1-2 jobs that are constantly scheduled and finished once in a
>>>> few seconds after the lockup happens.
>>> Well what drm_sched_entity_kill() is supposed to do is to prevent
>>> pushing queued up stuff to the hw when the process which queued it is
>>> killed. Is the process really killed or is that just some incorrect
>>> handling?
>> It's actually 5-6 incomplete jobs of Xorg that are hanging when Xorg
>> process is closed.
>>
>> The two re-scheduled jobs are from sddm, so it's only the Xorg context
>> that hangs.
>>
>>> In other words I see two possibilities here, either we have a bug in the
>>> scheduler or panfrost isn't using it correctly.
>>>
>>> Does panfrost calls drm_sched_entity_flush() before it calls
>>> drm_sched_entity_fini()? (I don't have the driver source at hand at the
>>> moment).
>> Panfrost doesn't use drm_sched_entity_flush(), nor drm_sched_entity_flush().
> *nor drm_sched_entity_fini()

Well that would mean that this is *really* buggy! How do you then end up 
in drm_sched_entity_kill()? From drm_sched_entity_destroy()?

drm_sched_entity_flush() should be called from the flush callback from 
the file_operations structure of panfrost. See amdgpu_flush() and 
amdgpu_ctx_mgr_entity_flush(). This makes sure that we wait for all 
entities of the process/file descriptor to be flushed out.

drm_sched_entity_fini() must be called before you free the memory the 
entity structure or otherwise we would run into an use after free.

Regards,
Christian.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 12/13] drm/scheduler: rework entity flush, kill and fini
  2022-11-17 13:11               ` Christian König
@ 2022-11-17 14:41                 ` Dmitry Osipenko
  2022-11-17 15:09                   ` Christian König
  0 siblings, 1 reply; 47+ messages in thread
From: Dmitry Osipenko @ 2022-11-17 14:41 UTC (permalink / raw)
  To: Christian König, Christian König, luben.tuikov,
	dri-devel, amd-gfx

On 11/17/22 16:11, Christian König wrote:
> Am 17.11.22 um 14:00 schrieb Dmitry Osipenko:
>> On 11/17/22 15:59, Dmitry Osipenko wrote:
>>> On 11/17/22 15:55, Christian König wrote:
>>>> Am 17.11.22 um 13:47 schrieb Dmitry Osipenko:
>>>>> On 11/17/22 12:53, Christian König wrote:
>>>>>> Am 17.11.22 um 03:36 schrieb Dmitry Osipenko:
>>>>>>> Hi,
>>>>>>>
>>>>>>> On 10/14/22 11:46, Christian König wrote:
>>>>>>>> +/* Remove the entity from the scheduler and kill all pending
>>>>>>>> jobs */
>>>>>>>> +static void drm_sched_entity_kill(struct drm_sched_entity *entity)
>>>>>>>> +{
>>>>>>>> +    struct drm_sched_job *job;
>>>>>>>> +    struct dma_fence *prev;
>>>>>>>> +
>>>>>>>> +    if (!entity->rq)
>>>>>>>> +        return;
>>>>>>>> +
>>>>>>>> +    spin_lock(&entity->rq_lock);
>>>>>>>> +    entity->stopped = true;
>>>>>>>> +    drm_sched_rq_remove_entity(entity->rq, entity);
>>>>>>>> +    spin_unlock(&entity->rq_lock);
>>>>>>>> +
>>>>>>>> +    /* Make sure this entity is not used by the scheduler at the
>>>>>>>> moment */
>>>>>>>> +    wait_for_completion(&entity->entity_idle);
>>>>>>> I'm always hitting lockup here using Panfrost driver on terminating
>>>>>>> Xorg. Revering this patch helps. Any ideas how to fix it?
>>>>>>>
>>>>>> Well is the entity idle or are there some unsubmitted jobs left?
>>>>> Do you mean unsubmitted to h/w? IIUC, there are unsubmitted jobs left.
>>>>>
>>>>> I see that there are 5-6 incomplete (in-flight) jobs when
>>>>> panfrost_job_close() is invoked.
>>>>>
>>>>> There are 1-2 jobs that are constantly scheduled and finished once
>>>>> in a
>>>>> few seconds after the lockup happens.
>>>> Well what drm_sched_entity_kill() is supposed to do is to prevent
>>>> pushing queued up stuff to the hw when the process which queued it is
>>>> killed. Is the process really killed or is that just some incorrect
>>>> handling?
>>> It's actually 5-6 incomplete jobs of Xorg that are hanging when Xorg
>>> process is closed.
>>>
>>> The two re-scheduled jobs are from sddm, so it's only the Xorg context
>>> that hangs.
>>>
>>>> In other words I see two possibilities here, either we have a bug in
>>>> the
>>>> scheduler or panfrost isn't using it correctly.
>>>>
>>>> Does panfrost calls drm_sched_entity_flush() before it calls
>>>> drm_sched_entity_fini()? (I don't have the driver source at hand at the
>>>> moment).
>>> Panfrost doesn't use drm_sched_entity_flush(), nor
>>> drm_sched_entity_flush().
>> *nor drm_sched_entity_fini()
> 
> Well that would mean that this is *really* buggy! How do you then end up
> in drm_sched_entity_kill()? From drm_sched_entity_destroy()?

Yes, from drm_sched_entity_destroy().

> drm_sched_entity_flush() should be called from the flush callback from
> the file_operations structure of panfrost. See amdgpu_flush() and
> amdgpu_ctx_mgr_entity_flush(). This makes sure that we wait for all
> entities of the process/file descriptor to be flushed out.
> 
> drm_sched_entity_fini() must be called before you free the memory the
> entity structure or otherwise we would run into an use after free.

Right, drm_sched_entity_destroy() invokes these two functions and
Panfrost uses drm_sched_entity_destroy().

-- 
Best regards,
Dmitry


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 12/13] drm/scheduler: rework entity flush, kill and fini
  2022-11-17 14:41                 ` Dmitry Osipenko
@ 2022-11-17 15:09                   ` Christian König
  2022-11-17 15:11                     ` Dmitry Osipenko
  0 siblings, 1 reply; 47+ messages in thread
From: Christian König @ 2022-11-17 15:09 UTC (permalink / raw)
  To: Dmitry Osipenko, Christian König, luben.tuikov, dri-devel, amd-gfx

Am 17.11.22 um 15:41 schrieb Dmitry Osipenko:
> [SNIP]
>> drm_sched_entity_flush() should be called from the flush callback from
>> the file_operations structure of panfrost. See amdgpu_flush() and
>> amdgpu_ctx_mgr_entity_flush(). This makes sure that we wait for all
>> entities of the process/file descriptor to be flushed out.
>>
>> drm_sched_entity_fini() must be called before you free the memory the
>> entity structure or otherwise we would run into an use after free.
> Right, drm_sched_entity_destroy() invokes these two functions and
> Panfrost uses drm_sched_entity_destroy().

Than I have no idea what's going wrong here.

The scheduler should trivially finish with the entity and call 
complete(&entity->entity_idle) in it's main loop. No idea why this 
doesn't happen. Can you investigate?

Regards,
Christian.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 12/13] drm/scheduler: rework entity flush, kill and fini
  2022-11-17 15:09                   ` Christian König
@ 2022-11-17 15:11                     ` Dmitry Osipenko
  2022-12-28 16:27                       ` Rob Clark
  0 siblings, 1 reply; 47+ messages in thread
From: Dmitry Osipenko @ 2022-11-17 15:11 UTC (permalink / raw)
  To: Christian König, Christian König, luben.tuikov,
	dri-devel, amd-gfx

On 11/17/22 18:09, Christian König wrote:
> Am 17.11.22 um 15:41 schrieb Dmitry Osipenko:
>> [SNIP]
>>> drm_sched_entity_flush() should be called from the flush callback from
>>> the file_operations structure of panfrost. See amdgpu_flush() and
>>> amdgpu_ctx_mgr_entity_flush(). This makes sure that we wait for all
>>> entities of the process/file descriptor to be flushed out.
>>>
>>> drm_sched_entity_fini() must be called before you free the memory the
>>> entity structure or otherwise we would run into an use after free.
>> Right, drm_sched_entity_destroy() invokes these two functions and
>> Panfrost uses drm_sched_entity_destroy().
> 
> Than I have no idea what's going wrong here.
> 
> The scheduler should trivially finish with the entity and call
> complete(&entity->entity_idle) in it's main loop. No idea why this
> doesn't happen. Can you investigate?

I'll take a closer look. Hoped you may have a quick idea of what's wrong :)

-- 
Best regards,
Dmitry


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 10/13] drm/amdgpu: use scheduler depenencies for CS
  2022-10-14  8:46 ` [PATCH 10/13] drm/amdgpu: use scheduler depenencies for CS Christian König
  2022-10-24  5:55   ` Luben Tuikov
@ 2022-12-21 15:34   ` Mike Lothian
  2022-12-21 15:47     ` Mike Lothian
  2022-12-21 15:52     ` Luben Tuikov
  1 sibling, 2 replies; 47+ messages in thread
From: Mike Lothian @ 2022-12-21 15:34 UTC (permalink / raw)
  To: Christian König
  Cc: luben.tuikov, amd-gfx, dri-devel, Christian König

On Fri, 14 Oct 2022 at 09:47, Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Entirely remove the sync obj in the job.
>
> Signed-off-by: Christian König <christian.koenig@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 21 ++++++++++-----------
>  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h  |  2 ++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c |  9 +--------
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.h |  1 -
>  4 files changed, 13 insertions(+), 20 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index d45b86bcf7fa..0528c2b1db6e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -426,7 +426,7 @@ static int amdgpu_cs_p2_dependencies(struct amdgpu_cs_parser *p,
>                         dma_fence_put(old);
>                 }
>
> -               r = amdgpu_sync_fence(&p->gang_leader->sync, fence);
> +               r = amdgpu_sync_fence(&p->sync, fence);
>                 dma_fence_put(fence);
>                 if (r)
>                         return r;
> @@ -448,7 +448,7 @@ static int amdgpu_syncobj_lookup_and_add(struct amdgpu_cs_parser *p,
>                 return r;
>         }
>
> -       r = amdgpu_sync_fence(&p->gang_leader->sync, fence);
> +       r = amdgpu_sync_fence(&p->sync, fence);
>         if (r)
>                 goto error;
>
> @@ -1108,7 +1108,7 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p)
>         if (r)
>                 return r;
>
> -       r = amdgpu_sync_fence(&job->sync, fpriv->prt_va->last_pt_update);
> +       r = amdgpu_sync_fence(&p->sync, fpriv->prt_va->last_pt_update);
>         if (r)
>                 return r;
>
> @@ -1119,7 +1119,7 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p)
>                 if (r)
>                         return r;
>
> -               r = amdgpu_sync_fence(&job->sync, bo_va->last_pt_update);
> +               r = amdgpu_sync_fence(&p->sync, bo_va->last_pt_update);
>                 if (r)
>                         return r;
>         }
> @@ -1138,7 +1138,7 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p)
>                 if (r)
>                         return r;
>
> -               r = amdgpu_sync_fence(&job->sync, bo_va->last_pt_update);
> +               r = amdgpu_sync_fence(&p->sync, bo_va->last_pt_update);
>                 if (r)
>                         return r;
>         }
> @@ -1151,7 +1151,7 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p)
>         if (r)
>                 return r;
>
> -       r = amdgpu_sync_fence(&job->sync, vm->last_update);
> +       r = amdgpu_sync_fence(&p->sync, vm->last_update);
>         if (r)
>                 return r;
>
> @@ -1183,7 +1183,6 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p)
>  static int amdgpu_cs_sync_rings(struct amdgpu_cs_parser *p)
>  {
>         struct amdgpu_fpriv *fpriv = p->filp->driver_priv;
> -       struct amdgpu_job *leader = p->gang_leader;
>         struct amdgpu_bo_list_entry *e;
>         unsigned int i;
>         int r;
> @@ -1195,14 +1194,14 @@ static int amdgpu_cs_sync_rings(struct amdgpu_cs_parser *p)
>
>                 sync_mode = amdgpu_bo_explicit_sync(bo) ?
>                         AMDGPU_SYNC_EXPLICIT : AMDGPU_SYNC_NE_OWNER;
> -               r = amdgpu_sync_resv(p->adev, &leader->sync, resv, sync_mode,
> +               r = amdgpu_sync_resv(p->adev, &p->sync, resv, sync_mode,
>                                      &fpriv->vm);
>                 if (r)
>                         return r;
>         }
>
> -       for (i = 0; i < p->gang_size - 1; ++i) {
> -               r = amdgpu_sync_clone(&leader->sync, &p->jobs[i]->sync);
> +       for (i = 0; i < p->gang_size; ++i) {
> +               r = amdgpu_sync_push_to_job(&p->sync, p->jobs[i]);
>                 if (r)
>                         return r;
>         }
> @@ -1248,7 +1247,7 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
>                 struct dma_fence *fence;
>
>                 fence = &p->jobs[i]->base.s_fence->scheduled;
> -               r = amdgpu_sync_fence(&leader->sync, fence);
> +               r = drm_sched_job_add_dependency(&leader->base, fence);
>                 if (r)
>                         goto error_cleanup;
>         }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h
> index cbaa19b2b8a3..207e801c24ed 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h
> @@ -75,6 +75,8 @@ struct amdgpu_cs_parser {
>
>         unsigned                        num_post_deps;
>         struct amdgpu_cs_post_dep       *post_deps;
> +
> +       struct amdgpu_sync              sync;
>  };
>
>  int amdgpu_cs_find_mapping(struct amdgpu_cs_parser *parser,
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index ba98d65835b4..b8494c3b3b8a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -106,7 +106,6 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>         (*job)->base.sched = &adev->rings[0]->sched;
>         (*job)->vm = vm;
>
> -       amdgpu_sync_create(&(*job)->sync);
>         amdgpu_sync_create(&(*job)->explicit_sync);
>         (*job)->vram_lost_counter = atomic_read(&adev->vram_lost_counter);
>         (*job)->vm_pd_addr = AMDGPU_BO_INVALID_OFFSET;
> @@ -174,9 +173,7 @@ static void amdgpu_job_free_cb(struct drm_sched_job *s_job)
>
>         drm_sched_job_cleanup(s_job);
>
> -       amdgpu_sync_free(&job->sync);
>         amdgpu_sync_free(&job->explicit_sync);
> -
>         dma_fence_put(&job->hw_fence);
>  }
>
> @@ -202,7 +199,6 @@ void amdgpu_job_free(struct amdgpu_job *job)
>                 drm_sched_job_cleanup(&job->base);
>
>         amdgpu_job_free_resources(job);
> -       amdgpu_sync_free(&job->sync);
>         amdgpu_sync_free(&job->explicit_sync);
>         if (job->gang_submit != &job->base.s_fence->scheduled)
>                 dma_fence_put(job->gang_submit);
> @@ -246,10 +242,9 @@ amdgpu_job_dependency(struct drm_sched_job *sched_job,
>  {
>         struct amdgpu_ring *ring = to_amdgpu_ring(s_entity->rq->sched);
>         struct amdgpu_job *job = to_amdgpu_job(sched_job);
> -       struct dma_fence *fence;
> +       struct dma_fence *fence = NULL;
>         int r;
>
> -       fence = amdgpu_sync_get_fence(&job->sync);
>         while (fence == NULL && job->vm && !job->vmid) {
>                 r = amdgpu_vmid_grab(job->vm, ring, job, &fence);
>                 if (r)
> @@ -273,8 +268,6 @@ static struct dma_fence *amdgpu_job_run(struct drm_sched_job *sched_job)
>         job = to_amdgpu_job(sched_job);
>         finished = &job->base.s_fence->finished;
>
> -       BUG_ON(amdgpu_sync_peek_fence(&job->sync, NULL));
> -
>         trace_amdgpu_sched_run_job(job);
>
>         /* Skip job if VRAM is lost and never resubmit gangs */
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> index 9c10b9bd0084..6558839fda03 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> @@ -47,7 +47,6 @@ enum amdgpu_ib_pool_type;
>  struct amdgpu_job {
>         struct drm_sched_job    base;
>         struct amdgpu_vm        *vm;
> -       struct amdgpu_sync      sync;
>         struct amdgpu_sync      explicit_sync;
>         struct dma_fence        hw_fence;
>         struct dma_fence        *gang_submit;
> --
> 2.25.1
>

Hi, I've been testing the Mesh shader benchmark in GravityMark and
I've bisected my laptop freezing up and rebooting, to this commit

1728baa7e4e60054bf13dd9b1212d133cbd53b3f is the first bad commit
commit 1728baa7e4e60054bf13dd9b1212d133cbd53b3f
Author: Christian König <christian.koenig@amd.com>
Date:   Thu Sep 29 14:04:01 2022 +0200

   drm/amdgpu: use scheduler dependencies for CS

   Entirely remove the sync obj in the job.

   Signed-off-by: Christian König <christian.koenig@amd.com>
   Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
   Link: https://patchwork.freedesktop.org/patch/msgid/20221014084641.128280-11-christian.koenig@amd.com

drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 21 ++++++++++-----------
drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h  |  2 ++
drivers/gpu/drm/amd/amdgpu/amdgpu_job.c |  9 +--------
drivers/gpu/drm/amd/amdgpu/amdgpu_job.h |  1 -
4 files changed, 13 insertions(+), 20 deletions(-)

This is on a prime system 6800M with the latest mesa

I tried reverting this patch however it didn't revert cleanly, and my
attempt doesn't work and only partially freezes up the system

Would you like me to open a bug for this on
https://gitlab.freedesktop.org/drm/amd/-/issues ?

Cheers

Mike

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 10/13] drm/amdgpu: use scheduler depenencies for CS
  2022-12-21 15:34   ` Mike Lothian
@ 2022-12-21 15:47     ` Mike Lothian
  2022-12-21 15:52     ` Luben Tuikov
  1 sibling, 0 replies; 47+ messages in thread
From: Mike Lothian @ 2022-12-21 15:47 UTC (permalink / raw)
  To: Christian König
  Cc: luben.tuikov, amd-gfx, dri-devel, Christian König

https://gitlab.freedesktop.org/drm/amd/-/issues/2309

On Wed, 21 Dec 2022 at 15:34, Mike Lothian <mike@fireburn.co.uk> wrote:
>
> On Fri, 14 Oct 2022 at 09:47, Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
> >
> > Entirely remove the sync obj in the job.
> >
> > Signed-off-by: Christian König <christian.koenig@amd.com>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 21 ++++++++++-----------
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h  |  2 ++
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c |  9 +--------
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_job.h |  1 -
> >  4 files changed, 13 insertions(+), 20 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > index d45b86bcf7fa..0528c2b1db6e 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > @@ -426,7 +426,7 @@ static int amdgpu_cs_p2_dependencies(struct amdgpu_cs_parser *p,
> >                         dma_fence_put(old);
> >                 }
> >
> > -               r = amdgpu_sync_fence(&p->gang_leader->sync, fence);
> > +               r = amdgpu_sync_fence(&p->sync, fence);
> >                 dma_fence_put(fence);
> >                 if (r)
> >                         return r;
> > @@ -448,7 +448,7 @@ static int amdgpu_syncobj_lookup_and_add(struct amdgpu_cs_parser *p,
> >                 return r;
> >         }
> >
> > -       r = amdgpu_sync_fence(&p->gang_leader->sync, fence);
> > +       r = amdgpu_sync_fence(&p->sync, fence);
> >         if (r)
> >                 goto error;
> >
> > @@ -1108,7 +1108,7 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p)
> >         if (r)
> >                 return r;
> >
> > -       r = amdgpu_sync_fence(&job->sync, fpriv->prt_va->last_pt_update);
> > +       r = amdgpu_sync_fence(&p->sync, fpriv->prt_va->last_pt_update);
> >         if (r)
> >                 return r;
> >
> > @@ -1119,7 +1119,7 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p)
> >                 if (r)
> >                         return r;
> >
> > -               r = amdgpu_sync_fence(&job->sync, bo_va->last_pt_update);
> > +               r = amdgpu_sync_fence(&p->sync, bo_va->last_pt_update);
> >                 if (r)
> >                         return r;
> >         }
> > @@ -1138,7 +1138,7 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p)
> >                 if (r)
> >                         return r;
> >
> > -               r = amdgpu_sync_fence(&job->sync, bo_va->last_pt_update);
> > +               r = amdgpu_sync_fence(&p->sync, bo_va->last_pt_update);
> >                 if (r)
> >                         return r;
> >         }
> > @@ -1151,7 +1151,7 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p)
> >         if (r)
> >                 return r;
> >
> > -       r = amdgpu_sync_fence(&job->sync, vm->last_update);
> > +       r = amdgpu_sync_fence(&p->sync, vm->last_update);
> >         if (r)
> >                 return r;
> >
> > @@ -1183,7 +1183,6 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p)
> >  static int amdgpu_cs_sync_rings(struct amdgpu_cs_parser *p)
> >  {
> >         struct amdgpu_fpriv *fpriv = p->filp->driver_priv;
> > -       struct amdgpu_job *leader = p->gang_leader;
> >         struct amdgpu_bo_list_entry *e;
> >         unsigned int i;
> >         int r;
> > @@ -1195,14 +1194,14 @@ static int amdgpu_cs_sync_rings(struct amdgpu_cs_parser *p)
> >
> >                 sync_mode = amdgpu_bo_explicit_sync(bo) ?
> >                         AMDGPU_SYNC_EXPLICIT : AMDGPU_SYNC_NE_OWNER;
> > -               r = amdgpu_sync_resv(p->adev, &leader->sync, resv, sync_mode,
> > +               r = amdgpu_sync_resv(p->adev, &p->sync, resv, sync_mode,
> >                                      &fpriv->vm);
> >                 if (r)
> >                         return r;
> >         }
> >
> > -       for (i = 0; i < p->gang_size - 1; ++i) {
> > -               r = amdgpu_sync_clone(&leader->sync, &p->jobs[i]->sync);
> > +       for (i = 0; i < p->gang_size; ++i) {
> > +               r = amdgpu_sync_push_to_job(&p->sync, p->jobs[i]);
> >                 if (r)
> >                         return r;
> >         }
> > @@ -1248,7 +1247,7 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
> >                 struct dma_fence *fence;
> >
> >                 fence = &p->jobs[i]->base.s_fence->scheduled;
> > -               r = amdgpu_sync_fence(&leader->sync, fence);
> > +               r = drm_sched_job_add_dependency(&leader->base, fence);
> >                 if (r)
> >                         goto error_cleanup;
> >         }
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h
> > index cbaa19b2b8a3..207e801c24ed 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h
> > @@ -75,6 +75,8 @@ struct amdgpu_cs_parser {
> >
> >         unsigned                        num_post_deps;
> >         struct amdgpu_cs_post_dep       *post_deps;
> > +
> > +       struct amdgpu_sync              sync;
> >  };
> >
> >  int amdgpu_cs_find_mapping(struct amdgpu_cs_parser *parser,
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > index ba98d65835b4..b8494c3b3b8a 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > @@ -106,7 +106,6 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct amdgpu_vm *vm,
> >         (*job)->base.sched = &adev->rings[0]->sched;
> >         (*job)->vm = vm;
> >
> > -       amdgpu_sync_create(&(*job)->sync);
> >         amdgpu_sync_create(&(*job)->explicit_sync);
> >         (*job)->vram_lost_counter = atomic_read(&adev->vram_lost_counter);
> >         (*job)->vm_pd_addr = AMDGPU_BO_INVALID_OFFSET;
> > @@ -174,9 +173,7 @@ static void amdgpu_job_free_cb(struct drm_sched_job *s_job)
> >
> >         drm_sched_job_cleanup(s_job);
> >
> > -       amdgpu_sync_free(&job->sync);
> >         amdgpu_sync_free(&job->explicit_sync);
> > -
> >         dma_fence_put(&job->hw_fence);
> >  }
> >
> > @@ -202,7 +199,6 @@ void amdgpu_job_free(struct amdgpu_job *job)
> >                 drm_sched_job_cleanup(&job->base);
> >
> >         amdgpu_job_free_resources(job);
> > -       amdgpu_sync_free(&job->sync);
> >         amdgpu_sync_free(&job->explicit_sync);
> >         if (job->gang_submit != &job->base.s_fence->scheduled)
> >                 dma_fence_put(job->gang_submit);
> > @@ -246,10 +242,9 @@ amdgpu_job_dependency(struct drm_sched_job *sched_job,
> >  {
> >         struct amdgpu_ring *ring = to_amdgpu_ring(s_entity->rq->sched);
> >         struct amdgpu_job *job = to_amdgpu_job(sched_job);
> > -       struct dma_fence *fence;
> > +       struct dma_fence *fence = NULL;
> >         int r;
> >
> > -       fence = amdgpu_sync_get_fence(&job->sync);
> >         while (fence == NULL && job->vm && !job->vmid) {
> >                 r = amdgpu_vmid_grab(job->vm, ring, job, &fence);
> >                 if (r)
> > @@ -273,8 +268,6 @@ static struct dma_fence *amdgpu_job_run(struct drm_sched_job *sched_job)
> >         job = to_amdgpu_job(sched_job);
> >         finished = &job->base.s_fence->finished;
> >
> > -       BUG_ON(amdgpu_sync_peek_fence(&job->sync, NULL));
> > -
> >         trace_amdgpu_sched_run_job(job);
> >
> >         /* Skip job if VRAM is lost and never resubmit gangs */
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> > index 9c10b9bd0084..6558839fda03 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> > @@ -47,7 +47,6 @@ enum amdgpu_ib_pool_type;
> >  struct amdgpu_job {
> >         struct drm_sched_job    base;
> >         struct amdgpu_vm        *vm;
> > -       struct amdgpu_sync      sync;
> >         struct amdgpu_sync      explicit_sync;
> >         struct dma_fence        hw_fence;
> >         struct dma_fence        *gang_submit;
> > --
> > 2.25.1
> >
>
> Hi, I've been testing the Mesh shader benchmark in GravityMark and
> I've bisected my laptop freezing up and rebooting, to this commit
>
> 1728baa7e4e60054bf13dd9b1212d133cbd53b3f is the first bad commit
> commit 1728baa7e4e60054bf13dd9b1212d133cbd53b3f
> Author: Christian König <christian.koenig@amd.com>
> Date:   Thu Sep 29 14:04:01 2022 +0200
>
>    drm/amdgpu: use scheduler dependencies for CS
>
>    Entirely remove the sync obj in the job.
>
>    Signed-off-by: Christian König <christian.koenig@amd.com>
>    Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
>    Link: https://patchwork.freedesktop.org/patch/msgid/20221014084641.128280-11-christian.koenig@amd.com
>
> drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 21 ++++++++++-----------
> drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h  |  2 ++
> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c |  9 +--------
> drivers/gpu/drm/amd/amdgpu/amdgpu_job.h |  1 -
> 4 files changed, 13 insertions(+), 20 deletions(-)
>
> This is on a prime system 6800M with the latest mesa
>
> I tried reverting this patch however it didn't revert cleanly, and my
> attempt doesn't work and only partially freezes up the system
>
> Would you like me to open a bug for this on
> https://gitlab.freedesktop.org/drm/amd/-/issues ?
>
> Cheers
>
> Mike

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 10/13] drm/amdgpu: use scheduler depenencies for CS
  2022-12-21 15:34   ` Mike Lothian
  2022-12-21 15:47     ` Mike Lothian
@ 2022-12-21 15:52     ` Luben Tuikov
  2022-12-21 15:55       ` Mike Lothian
  1 sibling, 1 reply; 47+ messages in thread
From: Luben Tuikov @ 2022-12-21 15:52 UTC (permalink / raw)
  To: Mike Lothian, Christian König
  Cc: amd-gfx, dri-devel, Christian König

On 2022-12-21 10:34, Mike Lothian wrote:
> On Fri, 14 Oct 2022 at 09:47, Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
>>
>> Entirely remove the sync obj in the job.
>>
>> Signed-off-by: Christian König <christian.koenig@amd.com>
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 21 ++++++++++-----------
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h  |  2 ++
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c |  9 +--------
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.h |  1 -
>>  4 files changed, 13 insertions(+), 20 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> index d45b86bcf7fa..0528c2b1db6e 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> @@ -426,7 +426,7 @@ static int amdgpu_cs_p2_dependencies(struct amdgpu_cs_parser *p,
>>                         dma_fence_put(old);
>>                 }
>>
>> -               r = amdgpu_sync_fence(&p->gang_leader->sync, fence);
>> +               r = amdgpu_sync_fence(&p->sync, fence);
>>                 dma_fence_put(fence);
>>                 if (r)
>>                         return r;
>> @@ -448,7 +448,7 @@ static int amdgpu_syncobj_lookup_and_add(struct amdgpu_cs_parser *p,
>>                 return r;
>>         }
>>
>> -       r = amdgpu_sync_fence(&p->gang_leader->sync, fence);
>> +       r = amdgpu_sync_fence(&p->sync, fence);
>>         if (r)
>>                 goto error;
>>
>> @@ -1108,7 +1108,7 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p)
>>         if (r)
>>                 return r;
>>
>> -       r = amdgpu_sync_fence(&job->sync, fpriv->prt_va->last_pt_update);
>> +       r = amdgpu_sync_fence(&p->sync, fpriv->prt_va->last_pt_update);
>>         if (r)
>>                 return r;
>>
>> @@ -1119,7 +1119,7 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p)
>>                 if (r)
>>                         return r;
>>
>> -               r = amdgpu_sync_fence(&job->sync, bo_va->last_pt_update);
>> +               r = amdgpu_sync_fence(&p->sync, bo_va->last_pt_update);
>>                 if (r)
>>                         return r;
>>         }
>> @@ -1138,7 +1138,7 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p)
>>                 if (r)
>>                         return r;
>>
>> -               r = amdgpu_sync_fence(&job->sync, bo_va->last_pt_update);
>> +               r = amdgpu_sync_fence(&p->sync, bo_va->last_pt_update);
>>                 if (r)
>>                         return r;
>>         }
>> @@ -1151,7 +1151,7 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p)
>>         if (r)
>>                 return r;
>>
>> -       r = amdgpu_sync_fence(&job->sync, vm->last_update);
>> +       r = amdgpu_sync_fence(&p->sync, vm->last_update);
>>         if (r)
>>                 return r;
>>
>> @@ -1183,7 +1183,6 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p)
>>  static int amdgpu_cs_sync_rings(struct amdgpu_cs_parser *p)
>>  {
>>         struct amdgpu_fpriv *fpriv = p->filp->driver_priv;
>> -       struct amdgpu_job *leader = p->gang_leader;
>>         struct amdgpu_bo_list_entry *e;
>>         unsigned int i;
>>         int r;
>> @@ -1195,14 +1194,14 @@ static int amdgpu_cs_sync_rings(struct amdgpu_cs_parser *p)
>>
>>                 sync_mode = amdgpu_bo_explicit_sync(bo) ?
>>                         AMDGPU_SYNC_EXPLICIT : AMDGPU_SYNC_NE_OWNER;
>> -               r = amdgpu_sync_resv(p->adev, &leader->sync, resv, sync_mode,
>> +               r = amdgpu_sync_resv(p->adev, &p->sync, resv, sync_mode,
>>                                      &fpriv->vm);
>>                 if (r)
>>                         return r;
>>         }
>>
>> -       for (i = 0; i < p->gang_size - 1; ++i) {
>> -               r = amdgpu_sync_clone(&leader->sync, &p->jobs[i]->sync);
>> +       for (i = 0; i < p->gang_size; ++i) {
>> +               r = amdgpu_sync_push_to_job(&p->sync, p->jobs[i]);
>>                 if (r)
>>                         return r;
>>         }
>> @@ -1248,7 +1247,7 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
>>                 struct dma_fence *fence;
>>
>>                 fence = &p->jobs[i]->base.s_fence->scheduled;
>> -               r = amdgpu_sync_fence(&leader->sync, fence);
>> +               r = drm_sched_job_add_dependency(&leader->base, fence);
>>                 if (r)
>>                         goto error_cleanup;
>>         }
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h
>> index cbaa19b2b8a3..207e801c24ed 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h
>> @@ -75,6 +75,8 @@ struct amdgpu_cs_parser {
>>
>>         unsigned                        num_post_deps;
>>         struct amdgpu_cs_post_dep       *post_deps;
>> +
>> +       struct amdgpu_sync              sync;
>>  };
>>
>>  int amdgpu_cs_find_mapping(struct amdgpu_cs_parser *parser,
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> index ba98d65835b4..b8494c3b3b8a 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> @@ -106,7 +106,6 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>>         (*job)->base.sched = &adev->rings[0]->sched;
>>         (*job)->vm = vm;
>>
>> -       amdgpu_sync_create(&(*job)->sync);
>>         amdgpu_sync_create(&(*job)->explicit_sync);
>>         (*job)->vram_lost_counter = atomic_read(&adev->vram_lost_counter);
>>         (*job)->vm_pd_addr = AMDGPU_BO_INVALID_OFFSET;
>> @@ -174,9 +173,7 @@ static void amdgpu_job_free_cb(struct drm_sched_job *s_job)
>>
>>         drm_sched_job_cleanup(s_job);
>>
>> -       amdgpu_sync_free(&job->sync);
>>         amdgpu_sync_free(&job->explicit_sync);
>> -
>>         dma_fence_put(&job->hw_fence);
>>  }
>>
>> @@ -202,7 +199,6 @@ void amdgpu_job_free(struct amdgpu_job *job)
>>                 drm_sched_job_cleanup(&job->base);
>>
>>         amdgpu_job_free_resources(job);
>> -       amdgpu_sync_free(&job->sync);
>>         amdgpu_sync_free(&job->explicit_sync);
>>         if (job->gang_submit != &job->base.s_fence->scheduled)
>>                 dma_fence_put(job->gang_submit);
>> @@ -246,10 +242,9 @@ amdgpu_job_dependency(struct drm_sched_job *sched_job,
>>  {
>>         struct amdgpu_ring *ring = to_amdgpu_ring(s_entity->rq->sched);
>>         struct amdgpu_job *job = to_amdgpu_job(sched_job);
>> -       struct dma_fence *fence;
>> +       struct dma_fence *fence = NULL;
>>         int r;
>>
>> -       fence = amdgpu_sync_get_fence(&job->sync);
>>         while (fence == NULL && job->vm && !job->vmid) {
>>                 r = amdgpu_vmid_grab(job->vm, ring, job, &fence);
>>                 if (r)
>> @@ -273,8 +268,6 @@ static struct dma_fence *amdgpu_job_run(struct drm_sched_job *sched_job)
>>         job = to_amdgpu_job(sched_job);
>>         finished = &job->base.s_fence->finished;
>>
>> -       BUG_ON(amdgpu_sync_peek_fence(&job->sync, NULL));
>> -
>>         trace_amdgpu_sched_run_job(job);
>>
>>         /* Skip job if VRAM is lost and never resubmit gangs */
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>> index 9c10b9bd0084..6558839fda03 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>> @@ -47,7 +47,6 @@ enum amdgpu_ib_pool_type;
>>  struct amdgpu_job {
>>         struct drm_sched_job    base;
>>         struct amdgpu_vm        *vm;
>> -       struct amdgpu_sync      sync;
>>         struct amdgpu_sync      explicit_sync;
>>         struct dma_fence        hw_fence;
>>         struct dma_fence        *gang_submit;
>> --
>> 2.25.1
>>
> 
> Hi, I've been testing the Mesh shader benchmark in GravityMark and
> I've bisected my laptop freezing up and rebooting, to this commit
> 
> 1728baa7e4e60054bf13dd9b1212d133cbd53b3f is the first bad commit
> commit 1728baa7e4e60054bf13dd9b1212d133cbd53b3f
> Author: Christian König <christian.koenig@amd.com>
> Date:   Thu Sep 29 14:04:01 2022 +0200
> 
>    drm/amdgpu: use scheduler dependencies for CS
> 
>    Entirely remove the sync obj in the job.
> 
>    Signed-off-by: Christian König <christian.koenig@amd.com>
>    Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
>    Link: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.freedesktop.org%2Fpatch%2Fmsgid%2F20221014084641.128280-11-christian.koenig%40amd.com&data=05%7C01%7Cluben.tuikov%40amd.com%7C89490e3fad4843fd789308dae368e10a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638072336848708258%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=yinQfgx3pcqZjCzafxTysYlhb4RUwJN8t8cb2VjOOes%3D&reserved=0
> 
> drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 21 ++++++++++-----------
> drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h  |  2 ++
> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c |  9 +--------
> drivers/gpu/drm/amd/amdgpu/amdgpu_job.h |  1 -
> 4 files changed, 13 insertions(+), 20 deletions(-)
> 
> This is on a prime system 6800M with the latest mesa
> 
> I tried reverting this patch however it didn't revert cleanly, and my
> attempt doesn't work and only partially freezes up the system
> 
> Would you like me to open a bug for this on
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fdrm%2Famd%2F-%2Fissues&data=05%7C01%7Cluben.tuikov%40amd.com%7C89490e3fad4843fd789308dae368e10a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638072336848708258%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=M8d6vBXgByuQCRm9844a9jYtIDfuDy7efv3NM03Bmho%3D&reserved=0 ?
> 

Hi Mike,

Could you try this patch:

https://lore.kernel.org/all/20221219104718.21677-1-christian.koenig@amd.com/

Regards,
Luben



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 10/13] drm/amdgpu: use scheduler depenencies for CS
  2022-12-21 15:52     ` Luben Tuikov
@ 2022-12-21 15:55       ` Mike Lothian
  0 siblings, 0 replies; 47+ messages in thread
From: Mike Lothian @ 2022-12-21 15:55 UTC (permalink / raw)
  To: Luben Tuikov
  Cc: Christian König, amd-gfx, dri-devel, Christian König

On Wed, 21 Dec 2022 at 15:52, Luben Tuikov <luben.tuikov@amd.com> wrote:
>
> On 2022-12-21 10:34, Mike Lothian wrote:
> > On Fri, 14 Oct 2022 at 09:47, Christian König
> > <ckoenig.leichtzumerken@gmail.com> wrote:
> >>
> >> Entirely remove the sync obj in the job.
> >>
> >> Signed-off-by: Christian König <christian.koenig@amd.com>
> >> ---
> >>  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 21 ++++++++++-----------
> >>  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h  |  2 ++
> >>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c |  9 +--------
> >>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.h |  1 -
> >>  4 files changed, 13 insertions(+), 20 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> >> index d45b86bcf7fa..0528c2b1db6e 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> >> @@ -426,7 +426,7 @@ static int amdgpu_cs_p2_dependencies(struct amdgpu_cs_parser *p,
> >>                         dma_fence_put(old);
> >>                 }
> >>
> >> -               r = amdgpu_sync_fence(&p->gang_leader->sync, fence);
> >> +               r = amdgpu_sync_fence(&p->sync, fence);
> >>                 dma_fence_put(fence);
> >>                 if (r)
> >>                         return r;
> >> @@ -448,7 +448,7 @@ static int amdgpu_syncobj_lookup_and_add(struct amdgpu_cs_parser *p,
> >>                 return r;
> >>         }
> >>
> >> -       r = amdgpu_sync_fence(&p->gang_leader->sync, fence);
> >> +       r = amdgpu_sync_fence(&p->sync, fence);
> >>         if (r)
> >>                 goto error;
> >>
> >> @@ -1108,7 +1108,7 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p)
> >>         if (r)
> >>                 return r;
> >>
> >> -       r = amdgpu_sync_fence(&job->sync, fpriv->prt_va->last_pt_update);
> >> +       r = amdgpu_sync_fence(&p->sync, fpriv->prt_va->last_pt_update);
> >>         if (r)
> >>                 return r;
> >>
> >> @@ -1119,7 +1119,7 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p)
> >>                 if (r)
> >>                         return r;
> >>
> >> -               r = amdgpu_sync_fence(&job->sync, bo_va->last_pt_update);
> >> +               r = amdgpu_sync_fence(&p->sync, bo_va->last_pt_update);
> >>                 if (r)
> >>                         return r;
> >>         }
> >> @@ -1138,7 +1138,7 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p)
> >>                 if (r)
> >>                         return r;
> >>
> >> -               r = amdgpu_sync_fence(&job->sync, bo_va->last_pt_update);
> >> +               r = amdgpu_sync_fence(&p->sync, bo_va->last_pt_update);
> >>                 if (r)
> >>                         return r;
> >>         }
> >> @@ -1151,7 +1151,7 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p)
> >>         if (r)
> >>                 return r;
> >>
> >> -       r = amdgpu_sync_fence(&job->sync, vm->last_update);
> >> +       r = amdgpu_sync_fence(&p->sync, vm->last_update);
> >>         if (r)
> >>                 return r;
> >>
> >> @@ -1183,7 +1183,6 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p)
> >>  static int amdgpu_cs_sync_rings(struct amdgpu_cs_parser *p)
> >>  {
> >>         struct amdgpu_fpriv *fpriv = p->filp->driver_priv;
> >> -       struct amdgpu_job *leader = p->gang_leader;
> >>         struct amdgpu_bo_list_entry *e;
> >>         unsigned int i;
> >>         int r;
> >> @@ -1195,14 +1194,14 @@ static int amdgpu_cs_sync_rings(struct amdgpu_cs_parser *p)
> >>
> >>                 sync_mode = amdgpu_bo_explicit_sync(bo) ?
> >>                         AMDGPU_SYNC_EXPLICIT : AMDGPU_SYNC_NE_OWNER;
> >> -               r = amdgpu_sync_resv(p->adev, &leader->sync, resv, sync_mode,
> >> +               r = amdgpu_sync_resv(p->adev, &p->sync, resv, sync_mode,
> >>                                      &fpriv->vm);
> >>                 if (r)
> >>                         return r;
> >>         }
> >>
> >> -       for (i = 0; i < p->gang_size - 1; ++i) {
> >> -               r = amdgpu_sync_clone(&leader->sync, &p->jobs[i]->sync);
> >> +       for (i = 0; i < p->gang_size; ++i) {
> >> +               r = amdgpu_sync_push_to_job(&p->sync, p->jobs[i]);
> >>                 if (r)
> >>                         return r;
> >>         }
> >> @@ -1248,7 +1247,7 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
> >>                 struct dma_fence *fence;
> >>
> >>                 fence = &p->jobs[i]->base.s_fence->scheduled;
> >> -               r = amdgpu_sync_fence(&leader->sync, fence);
> >> +               r = drm_sched_job_add_dependency(&leader->base, fence);
> >>                 if (r)
> >>                         goto error_cleanup;
> >>         }
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h
> >> index cbaa19b2b8a3..207e801c24ed 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h
> >> @@ -75,6 +75,8 @@ struct amdgpu_cs_parser {
> >>
> >>         unsigned                        num_post_deps;
> >>         struct amdgpu_cs_post_dep       *post_deps;
> >> +
> >> +       struct amdgpu_sync              sync;
> >>  };
> >>
> >>  int amdgpu_cs_find_mapping(struct amdgpu_cs_parser *parser,
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >> index ba98d65835b4..b8494c3b3b8a 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >> @@ -106,7 +106,6 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct amdgpu_vm *vm,
> >>         (*job)->base.sched = &adev->rings[0]->sched;
> >>         (*job)->vm = vm;
> >>
> >> -       amdgpu_sync_create(&(*job)->sync);
> >>         amdgpu_sync_create(&(*job)->explicit_sync);
> >>         (*job)->vram_lost_counter = atomic_read(&adev->vram_lost_counter);
> >>         (*job)->vm_pd_addr = AMDGPU_BO_INVALID_OFFSET;
> >> @@ -174,9 +173,7 @@ static void amdgpu_job_free_cb(struct drm_sched_job *s_job)
> >>
> >>         drm_sched_job_cleanup(s_job);
> >>
> >> -       amdgpu_sync_free(&job->sync);
> >>         amdgpu_sync_free(&job->explicit_sync);
> >> -
> >>         dma_fence_put(&job->hw_fence);
> >>  }
> >>
> >> @@ -202,7 +199,6 @@ void amdgpu_job_free(struct amdgpu_job *job)
> >>                 drm_sched_job_cleanup(&job->base);
> >>
> >>         amdgpu_job_free_resources(job);
> >> -       amdgpu_sync_free(&job->sync);
> >>         amdgpu_sync_free(&job->explicit_sync);
> >>         if (job->gang_submit != &job->base.s_fence->scheduled)
> >>                 dma_fence_put(job->gang_submit);
> >> @@ -246,10 +242,9 @@ amdgpu_job_dependency(struct drm_sched_job *sched_job,
> >>  {
> >>         struct amdgpu_ring *ring = to_amdgpu_ring(s_entity->rq->sched);
> >>         struct amdgpu_job *job = to_amdgpu_job(sched_job);
> >> -       struct dma_fence *fence;
> >> +       struct dma_fence *fence = NULL;
> >>         int r;
> >>
> >> -       fence = amdgpu_sync_get_fence(&job->sync);
> >>         while (fence == NULL && job->vm && !job->vmid) {
> >>                 r = amdgpu_vmid_grab(job->vm, ring, job, &fence);
> >>                 if (r)
> >> @@ -273,8 +268,6 @@ static struct dma_fence *amdgpu_job_run(struct drm_sched_job *sched_job)
> >>         job = to_amdgpu_job(sched_job);
> >>         finished = &job->base.s_fence->finished;
> >>
> >> -       BUG_ON(amdgpu_sync_peek_fence(&job->sync, NULL));
> >> -
> >>         trace_amdgpu_sched_run_job(job);
> >>
> >>         /* Skip job if VRAM is lost and never resubmit gangs */
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> >> index 9c10b9bd0084..6558839fda03 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> >> @@ -47,7 +47,6 @@ enum amdgpu_ib_pool_type;
> >>  struct amdgpu_job {
> >>         struct drm_sched_job    base;
> >>         struct amdgpu_vm        *vm;
> >> -       struct amdgpu_sync      sync;
> >>         struct amdgpu_sync      explicit_sync;
> >>         struct dma_fence        hw_fence;
> >>         struct dma_fence        *gang_submit;
> >> --
> >> 2.25.1
> >>
> >
> > Hi, I've been testing the Mesh shader benchmark in GravityMark and
> > I've bisected my laptop freezing up and rebooting, to this commit
> >
> > 1728baa7e4e60054bf13dd9b1212d133cbd53b3f is the first bad commit
> > commit 1728baa7e4e60054bf13dd9b1212d133cbd53b3f
> > Author: Christian König <christian.koenig@amd.com>
> > Date:   Thu Sep 29 14:04:01 2022 +0200
> >
> >    drm/amdgpu: use scheduler dependencies for CS
> >
> >    Entirely remove the sync obj in the job.
> >
> >    Signed-off-by: Christian König <christian.koenig@amd.com>
> >    Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
> >    Link: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.freedesktop.org%2Fpatch%2Fmsgid%2F20221014084641.128280-11-christian.koenig%40amd.com&data=05%7C01%7Cluben.tuikov%40amd.com%7C89490e3fad4843fd789308dae368e10a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638072336848708258%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=yinQfgx3pcqZjCzafxTysYlhb4RUwJN8t8cb2VjOOes%3D&reserved=0
> >
> > drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 21 ++++++++++-----------
> > drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h  |  2 ++
> > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c |  9 +--------
> > drivers/gpu/drm/amd/amdgpu/amdgpu_job.h |  1 -
> > 4 files changed, 13 insertions(+), 20 deletions(-)
> >
> > This is on a prime system 6800M with the latest mesa
> >
> > I tried reverting this patch however it didn't revert cleanly, and my
> > attempt doesn't work and only partially freezes up the system
> >
> > Would you like me to open a bug for this on
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fdrm%2Famd%2F-%2Fissues&data=05%7C01%7Cluben.tuikov%40amd.com%7C89490e3fad4843fd789308dae368e10a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638072336848708258%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=M8d6vBXgByuQCRm9844a9jYtIDfuDy7efv3NM03Bmho%3D&reserved=0 ?
> >
>
> Hi Mike,
>
> Could you try this patch:
>
> https://lore.kernel.org/all/20221219104718.21677-1-christian.koenig@amd.com/
>
> Regards,
> Luben
>
>

I still see the same issue with this patch

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [12/13] drm/scheduler: rework entity flush, kill and fini
  2022-10-14  8:46 ` [PATCH 12/13] drm/scheduler: rework entity flush, kill and fini Christian König
  2022-11-17  2:36   ` Dmitry Osipenko
@ 2022-12-26 16:01   ` Jonathan Marek
  1 sibling, 0 replies; 47+ messages in thread
From: Jonathan Marek @ 2022-12-26 16:01 UTC (permalink / raw)
  To: christian.koenig; +Cc: Rob Clark, dri-devel

This patch broke drm/msm in 6.2-rc1 for me. drm_sched_entity_destroy() 
never returns when exiting a process from gdb if it has a drm/msm fd 
opened (if the fd is closed normally then it doesn't have this problem).

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 12/13] drm/scheduler: rework entity flush, kill and fini
  2022-11-17 15:11                     ` Dmitry Osipenko
@ 2022-12-28 16:27                       ` Rob Clark
  2022-12-28 16:52                         ` Rob Clark
  0 siblings, 1 reply; 47+ messages in thread
From: Rob Clark @ 2022-12-28 16:27 UTC (permalink / raw)
  To: Dmitry Osipenko
  Cc: Jonathan, Christian König, amd-gfx, luben.tuikov, dri-devel,
	Christian König

On Thu, Nov 17, 2022 at 7:12 AM Dmitry Osipenko
<dmitry.osipenko@collabora.com> wrote:
>
> On 11/17/22 18:09, Christian König wrote:
> > Am 17.11.22 um 15:41 schrieb Dmitry Osipenko:
> >> [SNIP]
> >>> drm_sched_entity_flush() should be called from the flush callback from
> >>> the file_operations structure of panfrost. See amdgpu_flush() and
> >>> amdgpu_ctx_mgr_entity_flush(). This makes sure that we wait for all
> >>> entities of the process/file descriptor to be flushed out.
> >>>
> >>> drm_sched_entity_fini() must be called before you free the memory the
> >>> entity structure or otherwise we would run into an use after free.
> >> Right, drm_sched_entity_destroy() invokes these two functions and
> >> Panfrost uses drm_sched_entity_destroy().
> >
> > Than I have no idea what's going wrong here.
> >
> > The scheduler should trivially finish with the entity and call
> > complete(&entity->entity_idle) in it's main loop. No idea why this
> > doesn't happen. Can you investigate?
>
> I'll take a closer look. Hoped you may have a quick idea of what's wrong :)
>

As Jonathan mentioned, the same thing is happening on msm.  I can
reproduce this by adding an assert in mesa (in this case, triggered
after 100 draws) and running an app under gdb.  After the assert is
hit, if I try to exit mesa, it hangs.

The problem is that we somehow call drm_sched_entity_kill() twice.
The first time completes, but now the entity_idle completion is no
longer done, so the second call hangs forever.

BR,
-R

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 12/13] drm/scheduler: rework entity flush, kill and fini
  2022-12-28 16:27                       ` Rob Clark
@ 2022-12-28 16:52                         ` Rob Clark
  2023-01-01 18:29                           ` youling257
  0 siblings, 1 reply; 47+ messages in thread
From: Rob Clark @ 2022-12-28 16:52 UTC (permalink / raw)
  To: Dmitry Osipenko
  Cc: Jonathan, Christian König, amd-gfx, luben.tuikov, dri-devel,
	Christian König

On Wed, Dec 28, 2022 at 8:27 AM Rob Clark <robdclark@gmail.com> wrote:
>
> On Thu, Nov 17, 2022 at 7:12 AM Dmitry Osipenko
> <dmitry.osipenko@collabora.com> wrote:
> >
> > On 11/17/22 18:09, Christian König wrote:
> > > Am 17.11.22 um 15:41 schrieb Dmitry Osipenko:
> > >> [SNIP]
> > >>> drm_sched_entity_flush() should be called from the flush callback from
> > >>> the file_operations structure of panfrost. See amdgpu_flush() and
> > >>> amdgpu_ctx_mgr_entity_flush(). This makes sure that we wait for all
> > >>> entities of the process/file descriptor to be flushed out.
> > >>>
> > >>> drm_sched_entity_fini() must be called before you free the memory the
> > >>> entity structure or otherwise we would run into an use after free.
> > >> Right, drm_sched_entity_destroy() invokes these two functions and
> > >> Panfrost uses drm_sched_entity_destroy().
> > >
> > > Than I have no idea what's going wrong here.
> > >
> > > The scheduler should trivially finish with the entity and call
> > > complete(&entity->entity_idle) in it's main loop. No idea why this
> > > doesn't happen. Can you investigate?
> >
> > I'll take a closer look. Hoped you may have a quick idea of what's wrong :)
> >
>
> As Jonathan mentioned, the same thing is happening on msm.  I can
> reproduce this by adding an assert in mesa (in this case, triggered
> after 100 draws) and running an app under gdb.  After the assert is
> hit, if I try to exit mesa, it hangs.
>
> The problem is that we somehow call drm_sched_entity_kill() twice.
> The first time completes, but now the entity_idle completion is no
> longer done, so the second call hangs forever.

Maybe we should:

------
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
b/drivers/gpu/drm/scheduler/sched_entity.c
index fe09e5be79bd..3d7c671d05e3 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -222,7 +226,6 @@ static void drm_sched_entity_kill(struct
drm_sched_entity *entity)
 long drm_sched_entity_flush(struct drm_sched_entity *entity, long timeout)
 {
        struct drm_gpu_scheduler *sched;
-       struct task_struct *last_user;
        long ret = timeout;

        if (!entity->rq)
@@ -244,12 +247,6 @@ long drm_sched_entity_flush(struct
drm_sched_entity *entity, long timeout)
                                    drm_sched_entity_is_idle(entity));
        }

-       /* For killed process disable any more IBs enqueue right now */
-       last_user = cmpxchg(&entity->last_user, current->group_leader, NULL);
-       if ((!last_user || last_user == current->group_leader) &&
-           (current->flags & PF_EXITING) && (current->exit_code == SIGKILL))
-               drm_sched_entity_kill(entity);
-
        return ret;
 }
 EXPORT_SYMBOL(drm_sched_entity_flush);
----

Maybe there is a better fix, but special handling for SIGKILL seems
dubious to me (vs just relying on the drm device fd close path).  I
wonder if that code path was tested at all?

BR,
-R

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 12/13] drm/scheduler: rework entity flush, kill and fini
  2022-12-28 16:52                         ` Rob Clark
@ 2023-01-01 18:29                           ` youling257
  2023-01-02  9:24                             ` Dmitry Osipenko
  0 siblings, 1 reply; 47+ messages in thread
From: youling257 @ 2023-01-01 18:29 UTC (permalink / raw)
  To: robdclark
  Cc: jonathan, ckoenig.leichtzumerken, dri-devel, luben.tuikov,
	amd-gfx, dmitry.osipenko, christian.koenig

Linux 6.2-rc1 has memory leak on amdgpu, git bisect bad commit is "drm/scheduler: rework entity flush, kill and fini".
git bisect start
# status: waiting for both good and bad commits
# good: [eb7081409f94a9a8608593d0fb63a1aa3d6f95d8] Linux 6.1-rc6
git bisect good eb7081409f94a9a8608593d0fb63a1aa3d6f95d8
# status: waiting for bad commit, 1 good commit known
# bad: [66efff515a6500d4b4976fbab3bee8b92a1137fb] Merge tag 'amd-drm-next-6.2-2022-12-07' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
git bisect bad 66efff515a6500d4b4976fbab3bee8b92a1137fb
# good: [49e8e6343df688d68b12c2af50791ca37520f0b7] Merge tag 'amd-drm-next-6.2-2022-11-04' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
git bisect good 49e8e6343df688d68b12c2af50791ca37520f0b7
# bad: [fc58764bbf602b65a6f63c53e5fd6feae76c510c] Merge tag 'amd-drm-next-6.2-2022-11-18' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
git bisect bad fc58764bbf602b65a6f63c53e5fd6feae76c510c
# bad: [4e291f2f585313efa5200cce655e17c94906e50a] Merge tag 'drm-misc-next-2022-11-10-1' of git://anongit.freedesktop.org/drm/drm-misc into drm-next
git bisect bad 4e291f2f585313efa5200cce655e17c94906e50a
# good: [78a43c7e3b2ff5aed1809f93b4f87a418355789e] drm/nouveau/gr/gf100-: make global attrib_cb actually global
git bisect good 78a43c7e3b2ff5aed1809f93b4f87a418355789e
# bad: [611fc22c9e5e13276c819a7f7a7d19b794bbed1a] drm/arm/hdlcd: remove calls to drm_mode_config_cleanup()
git bisect bad 611fc22c9e5e13276c819a7f7a7d19b794bbed1a
# bad: [a8d9621b9fc67957b3de334cc1b5f47570fb90a0] drm/ingenic: Don't set struct drm_driver.output_poll_changed
git bisect bad a8d9621b9fc67957b3de334cc1b5f47570fb90a0
# good: [2cf9886e281678ae9ee57e24a656749071d543bb] drm/scheduler: remove drm_sched_dependency_optimized
git bisect good 2cf9886e281678ae9ee57e24a656749071d543bb
# bad: [8e4e4c2f53ffcb0ef746dc3b87ce1a57c5c94c7d] Merge drm/drm-next into drm-misc-next
git bisect bad 8e4e4c2f53ffcb0ef746dc3b87ce1a57c5c94c7d
# bad: [47078311b8efebdefd5b3b2f87e2b02b14f49c66] drm/ingenic: Fix missing platform_driver_unregister() call in ingenic_drm_init()
git bisect bad 47078311b8efebdefd5b3b2f87e2b02b14f49c66
# bad: [a82f30b04c6aaefe62cbbfd297e1bb23435b6b3a] drm/scheduler: rename dependency callback into prepare_job
git bisect bad a82f30b04c6aaefe62cbbfd297e1bb23435b6b3a
# bad: [2fdb8a8f07c2f1353770a324fd19b8114e4329ac] drm/scheduler: rework entity flush, kill and fini
git bisect bad 2fdb8a8f07c2f1353770a324fd19b8114e4329ac
# first bad commit: [2fdb8a8f07c2f1353770a324fd19b8114e4329ac] drm/scheduler: rework entity flush, kill and fini

@Rob Clark, i test your patch fixed my problem.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 12/13] drm/scheduler: rework entity flush, kill and fini
  2023-01-01 18:29                           ` youling257
@ 2023-01-02  9:24                             ` Dmitry Osipenko
  2023-01-02 14:17                                 ` youling 257
  0 siblings, 1 reply; 47+ messages in thread
From: Dmitry Osipenko @ 2023-01-02  9:24 UTC (permalink / raw)
  To: youling257, robdclark
  Cc: jonathan, ckoenig.leichtzumerken, dri-devel, luben.tuikov,
	amd-gfx, christian.koenig

On 1/1/23 21:29, youling257 wrote:
> Linux 6.2-rc1 has memory leak on amdgpu, git bisect bad commit is "drm/scheduler: rework entity flush, kill and fini".
> git bisect start
> # status: waiting for both good and bad commits
> # good: [eb7081409f94a9a8608593d0fb63a1aa3d6f95d8] Linux 6.1-rc6
> git bisect good eb7081409f94a9a8608593d0fb63a1aa3d6f95d8
> # status: waiting for bad commit, 1 good commit known
> # bad: [66efff515a6500d4b4976fbab3bee8b92a1137fb] Merge tag 'amd-drm-next-6.2-2022-12-07' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
> git bisect bad 66efff515a6500d4b4976fbab3bee8b92a1137fb
> # good: [49e8e6343df688d68b12c2af50791ca37520f0b7] Merge tag 'amd-drm-next-6.2-2022-11-04' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
> git bisect good 49e8e6343df688d68b12c2af50791ca37520f0b7
> # bad: [fc58764bbf602b65a6f63c53e5fd6feae76c510c] Merge tag 'amd-drm-next-6.2-2022-11-18' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
> git bisect bad fc58764bbf602b65a6f63c53e5fd6feae76c510c
> # bad: [4e291f2f585313efa5200cce655e17c94906e50a] Merge tag 'drm-misc-next-2022-11-10-1' of git://anongit.freedesktop.org/drm/drm-misc into drm-next
> git bisect bad 4e291f2f585313efa5200cce655e17c94906e50a
> # good: [78a43c7e3b2ff5aed1809f93b4f87a418355789e] drm/nouveau/gr/gf100-: make global attrib_cb actually global
> git bisect good 78a43c7e3b2ff5aed1809f93b4f87a418355789e
> # bad: [611fc22c9e5e13276c819a7f7a7d19b794bbed1a] drm/arm/hdlcd: remove calls to drm_mode_config_cleanup()
> git bisect bad 611fc22c9e5e13276c819a7f7a7d19b794bbed1a
> # bad: [a8d9621b9fc67957b3de334cc1b5f47570fb90a0] drm/ingenic: Don't set struct drm_driver.output_poll_changed
> git bisect bad a8d9621b9fc67957b3de334cc1b5f47570fb90a0
> # good: [2cf9886e281678ae9ee57e24a656749071d543bb] drm/scheduler: remove drm_sched_dependency_optimized
> git bisect good 2cf9886e281678ae9ee57e24a656749071d543bb
> # bad: [8e4e4c2f53ffcb0ef746dc3b87ce1a57c5c94c7d] Merge drm/drm-next into drm-misc-next
> git bisect bad 8e4e4c2f53ffcb0ef746dc3b87ce1a57c5c94c7d
> # bad: [47078311b8efebdefd5b3b2f87e2b02b14f49c66] drm/ingenic: Fix missing platform_driver_unregister() call in ingenic_drm_init()
> git bisect bad 47078311b8efebdefd5b3b2f87e2b02b14f49c66
> # bad: [a82f30b04c6aaefe62cbbfd297e1bb23435b6b3a] drm/scheduler: rename dependency callback into prepare_job
> git bisect bad a82f30b04c6aaefe62cbbfd297e1bb23435b6b3a
> # bad: [2fdb8a8f07c2f1353770a324fd19b8114e4329ac] drm/scheduler: rework entity flush, kill and fini
> git bisect bad 2fdb8a8f07c2f1353770a324fd19b8114e4329ac
> # first bad commit: [2fdb8a8f07c2f1353770a324fd19b8114e4329ac] drm/scheduler: rework entity flush, kill and fini
> 
> @Rob Clark, i test your patch fixed my problem.

The linux-next already carried the fix for a couple weeks. It will land
to 6.2-rc once drm-fixes branch will be synced with the 6.2.

-- 
Best regards,
Dmitry


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 12/13] drm/scheduler: rework entity flush, kill and fini
  2023-01-02  9:24                             ` Dmitry Osipenko
@ 2023-01-02 14:17                                 ` youling 257
  0 siblings, 0 replies; 47+ messages in thread
From: youling 257 @ 2023-01-02 14:17 UTC (permalink / raw)
  To: Dmitry Osipenko
  Cc: jonathan, ckoenig.leichtzumerken, dri-devel, luben.tuikov,
	amd-gfx, christian.koenig

which patch?

2023-01-02 17:24 GMT+08:00, Dmitry Osipenko <dmitry.osipenko@collabora.com>:
> On 1/1/23 21:29, youling257 wrote:
>> Linux 6.2-rc1 has memory leak on amdgpu, git bisect bad commit is
>> "drm/scheduler: rework entity flush, kill and fini".
>> git bisect start
>> # status: waiting for both good and bad commits
>> # good: [eb7081409f94a9a8608593d0fb63a1aa3d6f95d8] Linux 6.1-rc6
>> git bisect good eb7081409f94a9a8608593d0fb63a1aa3d6f95d8
>> # status: waiting for bad commit, 1 good commit known
>> # bad: [66efff515a6500d4b4976fbab3bee8b92a1137fb] Merge tag
>> 'amd-drm-next-6.2-2022-12-07' of
>> https://gitlab.freedesktop.org/agd5f/linux into drm-next
>> git bisect bad 66efff515a6500d4b4976fbab3bee8b92a1137fb
>> # good: [49e8e6343df688d68b12c2af50791ca37520f0b7] Merge tag
>> 'amd-drm-next-6.2-2022-11-04' of
>> https://gitlab.freedesktop.org/agd5f/linux into drm-next
>> git bisect good 49e8e6343df688d68b12c2af50791ca37520f0b7
>> # bad: [fc58764bbf602b65a6f63c53e5fd6feae76c510c] Merge tag
>> 'amd-drm-next-6.2-2022-11-18' of
>> https://gitlab.freedesktop.org/agd5f/linux into drm-next
>> git bisect bad fc58764bbf602b65a6f63c53e5fd6feae76c510c
>> # bad: [4e291f2f585313efa5200cce655e17c94906e50a] Merge tag
>> 'drm-misc-next-2022-11-10-1' of git://anongit.freedesktop.org/drm/drm-misc
>> into drm-next
>> git bisect bad 4e291f2f585313efa5200cce655e17c94906e50a
>> # good: [78a43c7e3b2ff5aed1809f93b4f87a418355789e] drm/nouveau/gr/gf100-:
>> make global attrib_cb actually global
>> git bisect good 78a43c7e3b2ff5aed1809f93b4f87a418355789e
>> # bad: [611fc22c9e5e13276c819a7f7a7d19b794bbed1a] drm/arm/hdlcd: remove
>> calls to drm_mode_config_cleanup()
>> git bisect bad 611fc22c9e5e13276c819a7f7a7d19b794bbed1a
>> # bad: [a8d9621b9fc67957b3de334cc1b5f47570fb90a0] drm/ingenic: Don't set
>> struct drm_driver.output_poll_changed
>> git bisect bad a8d9621b9fc67957b3de334cc1b5f47570fb90a0
>> # good: [2cf9886e281678ae9ee57e24a656749071d543bb] drm/scheduler: remove
>> drm_sched_dependency_optimized
>> git bisect good 2cf9886e281678ae9ee57e24a656749071d543bb
>> # bad: [8e4e4c2f53ffcb0ef746dc3b87ce1a57c5c94c7d] Merge drm/drm-next into
>> drm-misc-next
>> git bisect bad 8e4e4c2f53ffcb0ef746dc3b87ce1a57c5c94c7d
>> # bad: [47078311b8efebdefd5b3b2f87e2b02b14f49c66] drm/ingenic: Fix missing
>> platform_driver_unregister() call in ingenic_drm_init()
>> git bisect bad 47078311b8efebdefd5b3b2f87e2b02b14f49c66
>> # bad: [a82f30b04c6aaefe62cbbfd297e1bb23435b6b3a] drm/scheduler: rename
>> dependency callback into prepare_job
>> git bisect bad a82f30b04c6aaefe62cbbfd297e1bb23435b6b3a
>> # bad: [2fdb8a8f07c2f1353770a324fd19b8114e4329ac] drm/scheduler: rework
>> entity flush, kill and fini
>> git bisect bad 2fdb8a8f07c2f1353770a324fd19b8114e4329ac
>> # first bad commit: [2fdb8a8f07c2f1353770a324fd19b8114e4329ac]
>> drm/scheduler: rework entity flush, kill and fini
>>
>> @Rob Clark, i test your patch fixed my problem.
>
> The linux-next already carried the fix for a couple weeks. It will land
> to 6.2-rc once drm-fixes branch will be synced with the 6.2.
>
> --
> Best regards,
> Dmitry
>
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 12/13] drm/scheduler: rework entity flush, kill and fini
@ 2023-01-02 14:17                                 ` youling 257
  0 siblings, 0 replies; 47+ messages in thread
From: youling 257 @ 2023-01-02 14:17 UTC (permalink / raw)
  To: Dmitry Osipenko
  Cc: jonathan, ckoenig.leichtzumerken, dri-devel, robdclark,
	luben.tuikov, amd-gfx, christian.koenig

which patch?

2023-01-02 17:24 GMT+08:00, Dmitry Osipenko <dmitry.osipenko@collabora.com>:
> On 1/1/23 21:29, youling257 wrote:
>> Linux 6.2-rc1 has memory leak on amdgpu, git bisect bad commit is
>> "drm/scheduler: rework entity flush, kill and fini".
>> git bisect start
>> # status: waiting for both good and bad commits
>> # good: [eb7081409f94a9a8608593d0fb63a1aa3d6f95d8] Linux 6.1-rc6
>> git bisect good eb7081409f94a9a8608593d0fb63a1aa3d6f95d8
>> # status: waiting for bad commit, 1 good commit known
>> # bad: [66efff515a6500d4b4976fbab3bee8b92a1137fb] Merge tag
>> 'amd-drm-next-6.2-2022-12-07' of
>> https://gitlab.freedesktop.org/agd5f/linux into drm-next
>> git bisect bad 66efff515a6500d4b4976fbab3bee8b92a1137fb
>> # good: [49e8e6343df688d68b12c2af50791ca37520f0b7] Merge tag
>> 'amd-drm-next-6.2-2022-11-04' of
>> https://gitlab.freedesktop.org/agd5f/linux into drm-next
>> git bisect good 49e8e6343df688d68b12c2af50791ca37520f0b7
>> # bad: [fc58764bbf602b65a6f63c53e5fd6feae76c510c] Merge tag
>> 'amd-drm-next-6.2-2022-11-18' of
>> https://gitlab.freedesktop.org/agd5f/linux into drm-next
>> git bisect bad fc58764bbf602b65a6f63c53e5fd6feae76c510c
>> # bad: [4e291f2f585313efa5200cce655e17c94906e50a] Merge tag
>> 'drm-misc-next-2022-11-10-1' of git://anongit.freedesktop.org/drm/drm-misc
>> into drm-next
>> git bisect bad 4e291f2f585313efa5200cce655e17c94906e50a
>> # good: [78a43c7e3b2ff5aed1809f93b4f87a418355789e] drm/nouveau/gr/gf100-:
>> make global attrib_cb actually global
>> git bisect good 78a43c7e3b2ff5aed1809f93b4f87a418355789e
>> # bad: [611fc22c9e5e13276c819a7f7a7d19b794bbed1a] drm/arm/hdlcd: remove
>> calls to drm_mode_config_cleanup()
>> git bisect bad 611fc22c9e5e13276c819a7f7a7d19b794bbed1a
>> # bad: [a8d9621b9fc67957b3de334cc1b5f47570fb90a0] drm/ingenic: Don't set
>> struct drm_driver.output_poll_changed
>> git bisect bad a8d9621b9fc67957b3de334cc1b5f47570fb90a0
>> # good: [2cf9886e281678ae9ee57e24a656749071d543bb] drm/scheduler: remove
>> drm_sched_dependency_optimized
>> git bisect good 2cf9886e281678ae9ee57e24a656749071d543bb
>> # bad: [8e4e4c2f53ffcb0ef746dc3b87ce1a57c5c94c7d] Merge drm/drm-next into
>> drm-misc-next
>> git bisect bad 8e4e4c2f53ffcb0ef746dc3b87ce1a57c5c94c7d
>> # bad: [47078311b8efebdefd5b3b2f87e2b02b14f49c66] drm/ingenic: Fix missing
>> platform_driver_unregister() call in ingenic_drm_init()
>> git bisect bad 47078311b8efebdefd5b3b2f87e2b02b14f49c66
>> # bad: [a82f30b04c6aaefe62cbbfd297e1bb23435b6b3a] drm/scheduler: rename
>> dependency callback into prepare_job
>> git bisect bad a82f30b04c6aaefe62cbbfd297e1bb23435b6b3a
>> # bad: [2fdb8a8f07c2f1353770a324fd19b8114e4329ac] drm/scheduler: rework
>> entity flush, kill and fini
>> git bisect bad 2fdb8a8f07c2f1353770a324fd19b8114e4329ac
>> # first bad commit: [2fdb8a8f07c2f1353770a324fd19b8114e4329ac]
>> drm/scheduler: rework entity flush, kill and fini
>>
>> @Rob Clark, i test your patch fixed my problem.
>
> The linux-next already carried the fix for a couple weeks. It will land
> to 6.2-rc once drm-fixes branch will be synced with the 6.2.
>
> --
> Best regards,
> Dmitry
>
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 12/13] drm/scheduler: rework entity flush, kill and fini
  2023-01-02 14:17                                 ` youling 257
@ 2023-01-02 15:08                                   ` Dmitry Osipenko
  -1 siblings, 0 replies; 47+ messages in thread
From: Dmitry Osipenko @ 2023-01-02 15:08 UTC (permalink / raw)
  To: youling 257
  Cc: jonathan, ckoenig.leichtzumerken, dri-devel, luben.tuikov,
	amd-gfx, christian.koenig

On 1/2/23 17:17, youling 257 wrote:
> which patch?

https://patchwork.freedesktop.org/patch/512652/

I applied it to next-fixes

-- 
Best regards,
Dmitry


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 12/13] drm/scheduler: rework entity flush, kill and fini
@ 2023-01-02 15:08                                   ` Dmitry Osipenko
  0 siblings, 0 replies; 47+ messages in thread
From: Dmitry Osipenko @ 2023-01-02 15:08 UTC (permalink / raw)
  To: youling 257
  Cc: jonathan, ckoenig.leichtzumerken, dri-devel, robdclark,
	luben.tuikov, amd-gfx, christian.koenig

On 1/2/23 17:17, youling 257 wrote:
> which patch?

https://patchwork.freedesktop.org/patch/512652/

I applied it to next-fixes

-- 
Best regards,
Dmitry


^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2023-01-03  8:32 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-14  8:46 Fixes for scheduler hang when killing a process Christian König
2022-10-14  8:46 ` [PATCH 01/13] drm/scheduler: fix fence ref counting Christian König
2022-10-25  3:23   ` Luna Nova
2022-10-25 11:35     ` Christian König
2022-10-14  8:46 ` [PATCH 02/13] drm/scheduler: add drm_sched_job_add_resv_dependencies Christian König
2022-10-14  8:46 ` [PATCH 03/13] drm/amdgpu: use drm_sched_job_add_resv_dependencies for moves Christian König
2022-10-14  8:46 ` [PATCH 04/13] drm/amdgpu: drop the fence argument from amdgpu_vmid_grab Christian König
2022-10-14  8:46 ` [PATCH 05/13] drm/amdgpu: drop amdgpu_sync " Christian König
2022-10-23  1:25   ` Luben Tuikov
2022-10-24 10:54     ` Christian König
2022-10-14  8:46 ` [PATCH 06/13] drm/amdgpu: cleanup scheduler job initialization Christian König
2022-10-23  1:50   ` Luben Tuikov
2022-10-14  8:46 ` [PATCH 07/13] drm/amdgpu: move explicit sync check into the CS Christian König
2022-10-14  8:46 ` [PATCH 08/13] drm/amdgpu: use scheduler depenencies for VM updates Christian König
2022-10-24  5:50   ` Luben Tuikov
2022-10-14  8:46 ` [PATCH 09/13] drm/amdgpu: use scheduler depenencies for UVD msgs Christian König
2022-10-24  5:53   ` Luben Tuikov
2022-10-14  8:46 ` [PATCH 10/13] drm/amdgpu: use scheduler depenencies for CS Christian König
2022-10-24  5:55   ` Luben Tuikov
2022-12-21 15:34   ` Mike Lothian
2022-12-21 15:47     ` Mike Lothian
2022-12-21 15:52     ` Luben Tuikov
2022-12-21 15:55       ` Mike Lothian
2022-10-14  8:46 ` [PATCH 11/13] drm/scheduler: remove drm_sched_dependency_optimized Christian König
2022-10-14  8:46 ` [PATCH 12/13] drm/scheduler: rework entity flush, kill and fini Christian König
2022-11-17  2:36   ` Dmitry Osipenko
2022-11-17  9:53     ` Christian König
2022-11-17 12:47       ` Dmitry Osipenko
2022-11-17 12:55         ` Christian König
2022-11-17 12:59           ` Dmitry Osipenko
2022-11-17 13:00             ` Dmitry Osipenko
2022-11-17 13:11               ` Christian König
2022-11-17 14:41                 ` Dmitry Osipenko
2022-11-17 15:09                   ` Christian König
2022-11-17 15:11                     ` Dmitry Osipenko
2022-12-28 16:27                       ` Rob Clark
2022-12-28 16:52                         ` Rob Clark
2023-01-01 18:29                           ` youling257
2023-01-02  9:24                             ` Dmitry Osipenko
2023-01-02 14:17                               ` youling 257
2023-01-02 14:17                                 ` youling 257
2023-01-02 15:08                                 ` Dmitry Osipenko
2023-01-02 15:08                                   ` Dmitry Osipenko
2022-12-26 16:01   ` [12/13] " Jonathan Marek
2022-10-14  8:46 ` [PATCH 13/13] drm/scheduler: rename dependency callback into prepare_job Christian König
2022-10-23  1:35 ` Fixes for scheduler hang when killing a process Luben Tuikov
2022-10-24  7:00 ` Luben Tuikov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.