All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] Patch serials to implement guilty ctx/entity for SRIOV TDR
@ 2017-05-01  7:22 Monk Liu
       [not found] ` <1493623371-32614-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Monk Liu @ 2017-05-01  7:22 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

sometime user space submits bad command steam to kernel and with current scheme
gpu-scheduler will always resubmit all un-signaled job to hw ring after gpu reset
thus this bad submit will infinitly trigger GPU hang.

this patch serials implement a system called guilty context, which can avoid submitting
malicious jobs and invalidate the related context behind them, that way the regular
application can still continue to run, and other VF can also suffer less GPU time reductions

the guilty charge is simple: if a job hang too much times exceeds the threshold, we
consider it guilty, and we invalidates the context behind it, and pop out all job in
its entities of each scheduler. the next IOCTL on this CTX handler will get -ENODEV
error thus UMD can know this context is released by driver due to its malicious 
command submit.

Monk Liu (5):
  drm/amdgpu:keep ctx alive till all job finished
  drm/amdgpu:some modifications in amdgpu_ctx
  drm/amdgpu:Impl guilty ctx feature for sriov TDR
  drm/amdgpu:change sriov_gpu_reset interface
  drm/amdgpu:sriov TDR only recover hang ring

 drivers/gpu/drm/amd/amdgpu/amdgpu.h           | 12 +++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        | 26 ++++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       | 39 ++++++++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 43 ++++++++++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |  3 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c     |  6 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       | 30 +++++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h      |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h      |  2 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c         |  2 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c         |  2 +-
 drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 87 ++++++++++++++++++++++++---
 drivers/gpu/drm/amd/scheduler/gpu_scheduler.h |  3 +
 13 files changed, 209 insertions(+), 47 deletions(-)

-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
       [not found] ` <1493623371-32614-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2017-05-01  7:22   ` Monk Liu
       [not found]     ` <1493623371-32614-2-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2017-05-01  7:22   ` [PATCH 2/5] drm/amdgpu:some modifications in amdgpu_ctx Monk Liu
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 28+ messages in thread
From: Monk Liu @ 2017-05-01  7:22 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

for TDR guilty context feature, we need access ctx/s_entity
field member through sched job pointer,so ctx must keep alive
till all job from it signaled.

Change-Id: Ib87e9502f7a5c8c054c7e56956d7f7ad75998e43
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h           | 6 +++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       | 9 +++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       | 9 +++++++--
 drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 6 ------
 drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 1 +
 6 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index e330009..8e031d6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -760,10 +760,12 @@ struct amdgpu_ib {
 	uint32_t			flags;
 };
 
+struct amdgpu_ctx;
+
 extern const struct amd_sched_backend_ops amdgpu_sched_ops;
 
 int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
-		     struct amdgpu_job **job, struct amdgpu_vm *vm);
+		     struct amdgpu_job **job, struct amdgpu_vm *vm, struct amdgpu_ctx *ctx);
 int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned size,
 			     struct amdgpu_job **job);
 
@@ -802,6 +804,7 @@ struct amdgpu_ctx_mgr {
 
 struct amdgpu_ctx *amdgpu_ctx_get(struct amdgpu_fpriv *fpriv, uint32_t id);
 int amdgpu_ctx_put(struct amdgpu_ctx *ctx);
+struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx);
 
 uint64_t amdgpu_ctx_add_fence(struct amdgpu_ctx *ctx, struct amdgpu_ring *ring,
 			      struct fence *fence);
@@ -1129,6 +1132,7 @@ struct amdgpu_job {
 	struct amdgpu_sync	sync;
 	struct amdgpu_ib	*ibs;
 	struct fence		*fence; /* the hw fence */
+	struct amdgpu_ctx *ctx;
 	uint32_t		preamble_status;
 	uint32_t		num_ibs;
 	void			*owner;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 699f5fe..267fb65 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -234,7 +234,7 @@ int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, void *data)
 		}
 	}
 
-	ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm);
+	ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm, p->ctx);
 	if (ret)
 		goto free_all_kdata;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index b4bbbb3..81438af 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -25,6 +25,13 @@
 #include <drm/drmP.h>
 #include "amdgpu.h"
 
+struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx)
+{
+	if (ctx)
+		kref_get(&ctx->refcount);
+	return ctx;
+}
+
 static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
 {
 	unsigned i, j;
@@ -56,6 +63,8 @@ static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
 					  rq, amdgpu_sched_jobs);
 		if (r)
 			goto failed;
+
+		ctx->rings[i].entity.ptr_guilty = &ctx->guilty; /* kernel entity doesn't have ptr_guilty */
 	}
 
 	return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 690ef3d..208da11 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -40,7 +40,7 @@ static void amdgpu_job_timedout(struct amd_sched_job *s_job)
 }
 
 int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
-		     struct amdgpu_job **job, struct amdgpu_vm *vm)
+		     struct amdgpu_job **job, struct amdgpu_vm *vm, struct amdgpu_ctx *ctx)
 {
 	size_t size = sizeof(struct amdgpu_job);
 
@@ -57,6 +57,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
 	(*job)->vm = vm;
 	(*job)->ibs = (void *)&(*job)[1];
 	(*job)->num_ibs = num_ibs;
+	(*job)->ctx = amdgpu_ctx_kref_get(ctx);
 
 	amdgpu_sync_create(&(*job)->sync);
 
@@ -68,7 +69,7 @@ int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned size,
 {
 	int r;
 
-	r = amdgpu_job_alloc(adev, 1, job, NULL);
+	r = amdgpu_job_alloc(adev, 1, job, NULL, NULL);
 	if (r)
 		return r;
 
@@ -94,6 +95,10 @@ void amdgpu_job_free_resources(struct amdgpu_job *job)
 static void amdgpu_job_free_cb(struct amd_sched_job *s_job)
 {
 	struct amdgpu_job *job = container_of(s_job, struct amdgpu_job, base);
+	struct amdgpu_ctx *ctx = job->ctx;
+
+	if (ctx)
+		amdgpu_ctx_put(ctx);
 
 	fence_put(job->fence);
 	amdgpu_sync_free(&job->sync);
diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
index 6f4e31f..9100ca8 100644
--- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
+++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
@@ -208,12 +208,6 @@ void amd_sched_entity_fini(struct amd_gpu_scheduler *sched,
 	if (!amd_sched_entity_is_initialized(sched, entity))
 		return;
 
-	/**
-	 * The client will not queue more IBs during this fini, consume existing
-	 * queued IBs
-	*/
-	wait_event(sched->job_scheduled, amd_sched_entity_is_idle(entity));
-
 	amd_sched_rq_remove_entity(rq, entity);
 	kfifo_free(&entity->job_queue);
 }
diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
index 8cb41d3..ccbbcb0 100644
--- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
+++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
@@ -49,6 +49,7 @@ struct amd_sched_entity {
 
 	struct fence			*dependency;
 	struct fence_cb			cb;
+	bool *ptr_guilty;
 };
 
 /**
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/5] drm/amdgpu:some modifications in amdgpu_ctx
       [not found] ` <1493623371-32614-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2017-05-01  7:22   ` [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished Monk Liu
@ 2017-05-01  7:22   ` Monk Liu
  2017-05-01  7:22   ` [PATCH 3/5] drm/amdgpu:Impl guilty ctx feature for sriov TDR Monk Liu
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 28+ messages in thread
From: Monk Liu @ 2017-05-01  7:22 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

1,introduce a member field guilty for amdgpu_ctx
2,free ctx if found it guilty in amdgpu_ctx_get,
3,change interface of amdgpu_ctx_get :
  return -ENODEV if an alive ctx is detected guilty
  return -EINVAL if ctx hanler invalid
  the amdgpu_ctx* will be hold in the @out parm

this way we can let UMD differentiate a guilty ctx or a wrong
ctx handler.

Change-Id: Ib9cd3230e982b72ceb3b7b2fb14e48c32f63493f
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h     |  3 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 24 +++++++++++-------------
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 30 ++++++++++++++++++++++++------
 3 files changed, 37 insertions(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 8e031d6..6312cc5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -793,6 +793,7 @@ struct amdgpu_ctx {
 	struct fence            **fences;
 	struct amdgpu_ctx_ring	rings[AMDGPU_MAX_RINGS];
 	bool preamble_presented;
+	bool guilty; /* if this context is considered guilty so will be removed  */
 };
 
 struct amdgpu_ctx_mgr {
@@ -802,7 +803,7 @@ struct amdgpu_ctx_mgr {
 	struct idr		ctx_handles;
 };
 
-struct amdgpu_ctx *amdgpu_ctx_get(struct amdgpu_fpriv *fpriv, uint32_t id);
+int amdgpu_ctx_get(struct amdgpu_fpriv *fpriv, uint32_t id, struct amdgpu_ctx **out);
 int amdgpu_ctx_put(struct amdgpu_ctx *ctx);
 struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 267fb65..baa90dd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -154,11 +154,9 @@ int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, void *data)
 	if (!chunk_array)
 		return -ENOMEM;
 
-	p->ctx = amdgpu_ctx_get(fpriv, cs->in.ctx_id);
-	if (!p->ctx) {
-		ret = -EINVAL;
+	ret = amdgpu_ctx_get(fpriv, cs->in.ctx_id, &p->ctx);
+	if (ret)
 		goto free_chunk;
-	}
 
 	/* get chunks */
 	chunk_array_user = (uint64_t __user *)(uintptr_t)(cs->in.chunks);
@@ -1026,9 +1024,9 @@ static int amdgpu_cs_dependencies(struct amdgpu_device *adev,
 			if (r)
 				return r;
 
-			ctx = amdgpu_ctx_get(fpriv, deps[j].ctx_id);
-			if (ctx == NULL)
-				return -EINVAL;
+			r = amdgpu_ctx_get(fpriv, deps[j].ctx_id, &ctx);
+			if (r)
+				return r;
 
 			fence = amdgpu_ctx_get_fence(ctx, ring,
 						     deps[j].handle);
@@ -1164,9 +1162,9 @@ int amdgpu_cs_wait_ioctl(struct drm_device *dev, void *data,
 	if (r)
 		return r;
 
-	ctx = amdgpu_ctx_get(filp->driver_priv, wait->in.ctx_id);
-	if (ctx == NULL)
-		return -EINVAL;
+	r = amdgpu_ctx_get(filp->driver_priv, wait->in.ctx_id, &ctx);
+	if (r)
+		return r;
 
 	fence = amdgpu_ctx_get_fence(ctx, ring, wait->in.handle);
 	if (IS_ERR(fence))
@@ -1208,9 +1206,9 @@ static struct fence *amdgpu_cs_get_fence(struct amdgpu_device *adev,
 	if (r)
 		return ERR_PTR(r);
 
-	ctx = amdgpu_ctx_get(filp->driver_priv, user->ctx_id);
-	if (ctx == NULL)
-		return ERR_PTR(-EINVAL);
+	r = amdgpu_ctx_get(filp->driver_priv, user->ctx_id, &ctx);
+	if (r)
+		return r;
 
 	fence = amdgpu_ctx_get_fence(ctx, ring, user->seq_no);
 	amdgpu_ctx_put(ctx);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index 81438af..3947f63 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -146,6 +146,9 @@ static int amdgpu_ctx_free(struct amdgpu_fpriv *fpriv, uint32_t id)
 	mutex_lock(&mgr->lock);
 	ctx = idr_find(&mgr->ctx_handles, id);
 	if (ctx) {
+		if (ctx->guilty)
+			DRM_ERROR("Guilty context:%u detected! ID removed\n", id);
+
 		idr_remove(&mgr->ctx_handles, id);
 		kref_put(&ctx->refcount, amdgpu_ctx_do_release);
 		mutex_unlock(&mgr->lock);
@@ -222,22 +225,37 @@ int amdgpu_ctx_ioctl(struct drm_device *dev, void *data,
 	return r;
 }
 
-struct amdgpu_ctx *amdgpu_ctx_get(struct amdgpu_fpriv *fpriv, uint32_t id)
+int amdgpu_ctx_get(struct amdgpu_fpriv *fpriv, uint32_t id, struct amdgpu_ctx **out)
 {
 	struct amdgpu_ctx *ctx;
 	struct amdgpu_ctx_mgr *mgr;
+	int r = -EINVAL;
 
-	if (!fpriv)
-		return NULL;
+	if (!fpriv || !out)
+		return r;
 
 	mgr = &fpriv->ctx_mgr;
 
 	mutex_lock(&mgr->lock);
 	ctx = idr_find(&mgr->ctx_handles, id);
-	if (ctx)
-		kref_get(&ctx->refcount);
+	if (ctx) {
+		if (!ctx->guilty) {
+			kref_get(&ctx->refcount);
+			*out = ctx;
+			r = 0;
+		} else {
+			DRM_ERROR("Guilty context:%u detected! handler removed\n", id);
+			/* if a guilty context is alive but used by upper client, we distory it
+			 * manually and return NULL, thus libdrm_amdgpu should re-create one.
+			 */
+			idr_remove(&mgr->ctx_handles, id);
+			kref_put(&ctx->refcount, amdgpu_ctx_do_release);
+			r = -ENODEV;
+		}
+	}
+
 	mutex_unlock(&mgr->lock);
-	return ctx;
+	return r;
 }
 
 int amdgpu_ctx_put(struct amdgpu_ctx *ctx)
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 3/5] drm/amdgpu:Impl guilty ctx feature for sriov TDR
       [not found] ` <1493623371-32614-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2017-05-01  7:22   ` [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished Monk Liu
  2017-05-01  7:22   ` [PATCH 2/5] drm/amdgpu:some modifications in amdgpu_ctx Monk Liu
@ 2017-05-01  7:22   ` Monk Liu
  2017-05-01  7:22   ` [PATCH 4/5] drm/amdgpu:change sriov_gpu_reset interface Monk Liu
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 28+ messages in thread
From: Monk Liu @ 2017-05-01  7:22 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

if a job hang (more iterations exceeds the threshold) we
consider the entity/ctx behind it as guilty, we kick out
all jobs/entities in before sched_recovery.

with this feature driver won't suffer infinite job resubmit if
this job will always cause GPU hang.

And a new module paramter "hang_limit" is introduced as threshold
to let driver control how much time we allow a job hang
before we tag its context guilty.

Change-Id: I6c08ba126b985232e9b67530c304f09a5aeee78d
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h           |  3 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |  3 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       | 15 ++++-
 drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 81 ++++++++++++++++++++++++++-
 drivers/gpu/drm/amd/scheduler/gpu_scheduler.h |  2 +
 7 files changed, 103 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 6312cc5..f3c3c36 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -111,6 +111,7 @@ extern int amdgpu_prim_buf_per_se;
 extern int amdgpu_pos_buf_per_se;
 extern int amdgpu_cntl_sb_buf_per_se;
 extern int amdgpu_param_buf_per_se;
+extern int amdgpu_hang_limit;
 
 #define AMDGPU_DEFAULT_GTT_SIZE_MB		3072ULL /* 3GB by default */
 #define AMDGPU_WAIT_IDLE_TIMEOUT_IN_MS	        3000
@@ -1148,7 +1149,7 @@ struct amdgpu_job {
 	/* user fence handling */
 	uint64_t		uf_addr;
 	uint64_t		uf_sequence;
-
+	atomic_t karma;
 };
 #define to_amdgpu_job(sched_job)		\
 		container_of((sched_job), struct amdgpu_job, base)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index 3947f63..0083153 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -64,7 +64,7 @@ static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
 		if (r)
 			goto failed;
 
-		ctx->rings[i].entity.ptr_guilty = &ctx->guilty; /* kernel entity doesn't have ptr_guilty */
+		ctx->rings[i].entity.ptr_guilty = &ctx->guilty; /* kernel context/entity doesn't have ptr_guilty assigned*/
 	}
 
 	return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 5573792..0c51fb5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2619,6 +2619,7 @@ int amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, bool voluntary)
 		if (!ring || !ring->sched.thread)
 			continue;
 
+		amd_sched_job_kickout_guilty(&ring->sched);
 		amd_sched_job_recovery(&ring->sched);
 		kthread_unpark(ring->sched.thread);
 	}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 416908a..b999990 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -112,6 +112,7 @@ int amdgpu_prim_buf_per_se = 0;
 int amdgpu_pos_buf_per_se = 0;
 int amdgpu_cntl_sb_buf_per_se = 0;
 int amdgpu_param_buf_per_se = 0;
+int amdgpu_hang_limit = 0;
 
 MODULE_PARM_DESC(vramlimit, "Restrict VRAM for testing, in megabytes");
 module_param_named(vramlimit, amdgpu_vram_limit, int, 0600);
@@ -237,6 +238,8 @@ module_param_named(cntl_sb_buf_per_se, amdgpu_cntl_sb_buf_per_se, int, 0444);
 MODULE_PARM_DESC(param_buf_per_se, "the size of Off-Chip Pramater Cache per Shader Engine (default depending on gfx)");
 module_param_named(param_buf_per_se, amdgpu_param_buf_per_se, int, 0444);
 
+MODULE_PARM_DESC(hang_limit, "how many loops allow a job hang (default 0)");
+module_param_named(hang_limit, amdgpu_hang_limit, int ,0444);
 
 static const struct pci_device_id pciidlist[] = {
 #ifdef  CONFIG_DRM_AMDGPU_SI
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 208da11..0209c96 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -186,9 +186,22 @@ static struct fence *amdgpu_job_run(struct amd_sched_job *sched_job)
 	return fence;
 }
 
+static void amdgpu_invalidate_job(struct amd_sched_job *sched_job)
+{
+	struct amdgpu_job *job;
+
+	if (!sched_job || !sched_job->s_entity->ptr_guilty)
+		return;
+
+	job = to_amdgpu_job(sched_job);
+	if (atomic_inc_return(&job->karma) > amdgpu_hang_limit)
+		*sched_job->s_entity->ptr_guilty = true;
+}
+
 const struct amd_sched_backend_ops amdgpu_sched_ops = {
 	.dependency = amdgpu_job_dependency,
 	.run_job = amdgpu_job_run,
 	.timedout_job = amdgpu_job_timedout,
-	.free_job = amdgpu_job_free_cb
+	.free_job = amdgpu_job_free_cb,
+	.invalidate_job = amdgpu_invalidate_job,
 };
diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
index 9100ca8..f671b1a 100644
--- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
+++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
@@ -373,11 +373,87 @@ static void amd_sched_job_timedout(struct work_struct *work)
 	job->sched->ops->timedout_job(job);
 }
 
+static inline bool amd_sched_check_guilty(struct amd_sched_entity *entity)
+{
+	if (entity && entity->ptr_guilty != NULL)
+		return *entity->ptr_guilty;
+
+	/* if sched_job->s_entity->ptr_guilty == NULL, means this is a kernel entity job */
+	return false;
+}
+
+void amd_sched_job_kickout_guilty(struct amd_gpu_scheduler *sched)
+{
+	struct amd_sched_job *s_job, *s_tmp;
+	struct amd_sched_rq *rq;
+	struct list_head guilty_head;
+	int i;
+
+	INIT_LIST_HEAD(&guilty_head);
+	spin_lock(&sched->job_list_lock);
+	list_for_each_entry_safe(s_job, s_tmp, &sched->ring_mirror_list, node)
+		if (amd_sched_check_guilty(s_job->s_entity))
+			list_move(&s_job->node, &guilty_head);
+	spin_unlock(&sched->job_list_lock);
+
+	/* since free_job may cause wait/schedule, we'd better run it without spinlock
+	 * TODO: maybe we can just remove all spinlock protection in this routine becuase
+	 * this routine is invoked prior to job_recovery and kthread_unpark
+	 */
+	list_for_each_entry_safe(s_job, s_tmp, &guilty_head, node) {
+		/* the guilty job is fakely signaled, release the cs_wait on it
+		 *
+		 * TODO: we need to add more flags appended to FENCE_SIGNAL and
+		 * change behavior of fence_wait to indicate this fence's signal
+		 * is fake and due to gpu-reset, thus UMD will be acknowledged that CS_SUBMIT is
+		 * failed and its context is invalid.
+		 */
+
+		amd_sched_fence_finished(s_job->s_fence);
+		fence_put(&s_job->s_fence->finished);
+	}
+
+	/* Go through all entities and signal all jobs from the guilty */
+	for (i = AMD_SCHED_PRIORITY_MIN; i < AMD_SCHED_PRIORITY_MAX; i++) {
+		struct amd_sched_entity *entity, *e_tmp;
+
+		if (i == AMD_SCHED_PRIORITY_KERNEL)
+			continue; /* kernel entity is always not gulity and can't be kickout */
+
+		rq = &sched->sched_rq[i];
+		spin_lock(&rq->lock);
+		list_for_each_entry_safe(entity, e_tmp, &rq->entities, list) {
+			struct amd_sched_job *guilty_job;
+
+			if (amd_sched_check_guilty(entity)) {
+				spin_lock(&entity->queue_lock);
+				while (!kfifo_is_empty(&entity->job_queue)) {
+					kfifo_out(&entity->job_queue, &guilty_job, sizeof(guilty_job));
+					spin_unlock(&entity->queue_lock);
+					amd_sched_fence_finished(guilty_job->s_fence);
+					fence_put(&guilty_job->s_fence->finished);
+					spin_lock(&entity->queue_lock);
+				}
+				spin_unlock(&entity->queue_lock);
+
+				list_del_init(&entity->list);
+				if (rq->current_entity == entity)
+					rq->current_entity = NULL;
+			}
+		}
+		spin_unlock(&rq->lock);
+	 }
+}
+
 void amd_sched_hw_job_reset(struct amd_gpu_scheduler *sched)
 {
-	struct amd_sched_job *s_job;
+	struct amd_sched_job *s_job, *first;
 
 	spin_lock(&sched->job_list_lock);
+	/* for the first job, consider it as guilty */
+	first = list_first_entry_or_null(&sched->ring_mirror_list,
+			struct amd_sched_job, node);
+
 	list_for_each_entry_reverse(s_job, &sched->ring_mirror_list, node) {
 		if (s_job->s_fence->parent &&
 		    fence_remove_callback(s_job->s_fence->parent,
@@ -388,6 +464,9 @@ void amd_sched_hw_job_reset(struct amd_gpu_scheduler *sched)
 	}
 	atomic_set(&sched->hw_rq_count, 0);
 	spin_unlock(&sched->job_list_lock);
+
+	/* this will mark all entity behind this job's context as guilty */
+	sched->ops->invalidate_job(first);
 }
 
 void amd_sched_job_recovery(struct amd_gpu_scheduler *sched)
diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
index ccbbcb0..ab644a6 100644
--- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
+++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
@@ -106,6 +106,7 @@ struct amd_sched_backend_ops {
 	struct fence *(*run_job)(struct amd_sched_job *sched_job);
 	void (*timedout_job)(struct amd_sched_job *sched_job);
 	void (*free_job)(struct amd_sched_job *sched_job);
+	void (*invalidate_job)(struct amd_sched_job *sched_job);
 };
 
 enum amd_sched_priority {
@@ -159,4 +160,5 @@ int amd_sched_job_init(struct amd_sched_job *job,
 		       void *owner);
 void amd_sched_hw_job_reset(struct amd_gpu_scheduler *sched);
 void amd_sched_job_recovery(struct amd_gpu_scheduler *sched);
+void amd_sched_job_kickout_guilty(struct amd_gpu_scheduler *sched);
 #endif
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 4/5] drm/amdgpu:change sriov_gpu_reset interface
       [not found] ` <1493623371-32614-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
                     ` (2 preceding siblings ...)
  2017-05-01  7:22   ` [PATCH 3/5] drm/amdgpu:Impl guilty ctx feature for sriov TDR Monk Liu
@ 2017-05-01  7:22   ` Monk Liu
  2017-05-01  7:22   ` [PATCH 5/5] drm/amdgpu:sriov TDR only recover hang ring Monk Liu
  2017-05-01 14:53   ` [PATCH 0/5] Patch serials to implement guilty ctx/entity for SRIOV TDR Christian König
  5 siblings, 0 replies; 28+ messages in thread
From: Monk Liu @ 2017-05-01  7:22 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

1,add new parm @job to indicate which ring hang.
2,return when calling to amdgpu_gpu_reset if under SRIOV case

with this patch, we'll reset & recovery on the perticular ring
instead of overkill all ring.

Change-Id: I37dff5c16ef161e3a8425434516ebb285fbe6600
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 ++++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    | 6 +++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   | 2 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      | 2 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      | 2 +-
 5 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 0c51fb5..157d023 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2528,6 +2528,7 @@ static int amdgpu_recover_vram_from_shadow(struct amdgpu_device *adev,
  * amdgpu_sriov_gpu_reset - reset the asic
  *
  * @adev: amdgpu device pointer
+ * @job: the job lead to hang
  * @voluntary: if this reset is requested by guest.
  *             (true means by guest and false means by HYPERVISOR )
  *
@@ -2535,9 +2536,9 @@ static int amdgpu_recover_vram_from_shadow(struct amdgpu_device *adev,
  * for SRIOV case.
  * Returns 0 for success or an error on failure.
  */
-int amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, bool voluntary)
+int amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, struct amdgpu_job *job, bool voluntary)
 {
-	int i, r = 0;
+	int i, j, r = 0;
 	int resched;
 	struct amdgpu_bo *bo, *tmp;
 	struct amdgpu_ring *ring;
@@ -2652,7 +2653,7 @@ int amdgpu_gpu_reset(struct amdgpu_device *adev)
 	bool need_full_reset;
 
 	if (amdgpu_sriov_vf(adev))
-		return amdgpu_sriov_gpu_reset(adev, true);
+		return -EINVAL;
 
 	if (!amdgpu_check_soft_reset(adev)) {
 		DRM_INFO("No hardware hang detected. Did some blocks stall?\n");
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 0209c96..2cc7c63 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -36,7 +36,11 @@ static void amdgpu_job_timedout(struct amd_sched_job *s_job)
 		  job->base.sched->name,
 		  atomic_read(&job->ring->fence_drv.last_seq),
 		  job->ring->fence_drv.sync_seq);
-	amdgpu_gpu_reset(job->adev);
+
+	if (amdgpu_sriov_vf(job->adev))
+		amdgpu_sriov_gpu_reset(job->adev, job, true);
+	else
+		amdgpu_gpu_reset(job->adev);
 }
 
 int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
index a8ed162..2b641eb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
@@ -97,7 +97,7 @@ void amdgpu_virt_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v);
 int amdgpu_virt_request_full_gpu(struct amdgpu_device *adev, bool init);
 int amdgpu_virt_release_full_gpu(struct amdgpu_device *adev, bool init);
 int amdgpu_virt_reset_gpu(struct amdgpu_device *adev);
-int amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, bool voluntary);
+int amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, struct amdgpu_job *job, bool voluntary);
 int amdgpu_virt_alloc_mm_table(struct amdgpu_device *adev);
 void amdgpu_virt_free_mm_table(struct amdgpu_device *adev);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index e967a7b..4d71db3 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -243,7 +243,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
 	}
 
 	/* Trigger recovery due to world switch failure */
-	amdgpu_sriov_gpu_reset(adev, false);
+	amdgpu_sriov_gpu_reset(adev, NULL, false);
 }
 
 static int xgpu_ai_set_mailbox_rcv_irq(struct amdgpu_device *adev,
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
index f0d64f1..a604bd1 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
@@ -514,7 +514,7 @@ static void xgpu_vi_mailbox_flr_work(struct work_struct *work)
 	}
 
 	/* Trigger recovery due to world switch failure */
-	amdgpu_sriov_gpu_reset(adev, false);
+	amdgpu_sriov_gpu_reset(adev, NULL, false);
 }
 
 static int xgpu_vi_set_mailbox_rcv_irq(struct amdgpu_device *adev,
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 5/5] drm/amdgpu:sriov TDR only recover hang ring
       [not found] ` <1493623371-32614-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
                     ` (3 preceding siblings ...)
  2017-05-01  7:22   ` [PATCH 4/5] drm/amdgpu:change sriov_gpu_reset interface Monk Liu
@ 2017-05-01  7:22   ` Monk Liu
  2017-05-01 14:53   ` [PATCH 0/5] Patch serials to implement guilty ctx/entity for SRIOV TDR Christian König
  5 siblings, 0 replies; 28+ messages in thread
From: Monk Liu @ 2017-05-01  7:22 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

instead of reset/recovery all rings, we can only work
on the perticular ring if detects it hang.

Change-Id: Ie9de78819e1567e9f001d3593c9c52f749137c32
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 35 ++++++++++++++++++++++++------
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  6 +++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  1 +
 3 files changed, 35 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 157d023..4dbd121 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2551,19 +2551,26 @@ int amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, struct amdgpu_job *job, b
 	/* block TTM */
 	resched = ttm_bo_lock_delayed_workqueue(&adev->mman.bdev);
 
-	/* block scheduler */
-	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
-		ring = adev->rings[i];
+	/* we start from the ring trigger GPU hang */
+	j = job ? job->ring->idx : 0;
 
+	/* block scheduler */
+	for (i = j; i < j + AMDGPU_MAX_RINGS; ++i) {
+		ring = adev->rings[i % AMDGPU_MAX_RINGS];
 		if (!ring || !ring->sched.thread)
 			continue;
 
 		kthread_park(ring->sched.thread);
+
+		if (job && job->ring->idx != i)
+			continue;
+
+		/* only do job_reset on the hang ring if @job not NULL */
 		amd_sched_hw_job_reset(&ring->sched);
-	}
 
-	/* after all hw jobs are reset, hw fence is meaningless, so force_completion */
-	amdgpu_fence_driver_force_completion(adev);
+		/* after all hw jobs are reset, hw fence is meaningless, so force_completion */
+		amdgpu_fence_driver_force_completion_ring(ring);
+	}
 
 	/* request to take full control of GPU before re-initialization  */
 	if (voluntary)
@@ -2615,12 +2622,26 @@ int amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, struct amdgpu_job *job, b
 	}
 	fence_put(fence);
 
+	/* before recovery and unpark, kickout guilty for every rings */
 	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
-		struct amdgpu_ring *ring = adev->rings[i];
+		ring = adev->rings[i];
+
 		if (!ring || !ring->sched.thread)
 			continue;
 
 		amd_sched_job_kickout_guilty(&ring->sched);
+	}
+
+	for (i = j; i < j + AMDGPU_MAX_RINGS; ++i) {
+		ring = adev->rings[i % AMDGPU_MAX_RINGS];
+		if (!ring || !ring->sched.thread)
+			continue;
+
+		if (job && job->ring->idx != i) {
+			kthread_unpark(ring->sched.thread);
+			continue;
+		}
+
 		amd_sched_job_recovery(&ring->sched);
 		kthread_unpark(ring->sched.thread);
 	}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 5772ef2..de4c851 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -541,6 +541,12 @@ void amdgpu_fence_driver_force_completion(struct amdgpu_device *adev)
 	}
 }
 
+void amdgpu_fence_driver_force_completion_ring(struct amdgpu_ring *ring)
+{
+	if (ring)
+		amdgpu_fence_write(ring, ring->fence_drv.sync_seq);
+}
+
 /*
  * Common fence implementation
  */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
index 5786cc3..2acaac6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
@@ -76,6 +76,7 @@ struct amdgpu_fence_driver {
 int amdgpu_fence_driver_init(struct amdgpu_device *adev);
 void amdgpu_fence_driver_fini(struct amdgpu_device *adev);
 void amdgpu_fence_driver_force_completion(struct amdgpu_device *adev);
+void amdgpu_fence_driver_force_completion_ring(struct amdgpu_ring *ring);
 
 int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
 				  unsigned num_hw_submission);
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
       [not found]     ` <1493623371-32614-2-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2017-05-01 14:47       ` Christian König
       [not found]         ` <a4605d10-b1f7-7fee-63c9-829d612c63aa-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Christian König @ 2017-05-01 14:47 UTC (permalink / raw)
  To: Monk Liu, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 01.05.2017 um 09:22 schrieb Monk Liu:
> for TDR guilty context feature, we need access ctx/s_entity
> field member through sched job pointer,so ctx must keep alive
> till all job from it signaled.

NAK, that is unnecessary and quite dangerous.

Instead we have the designed status field in the fences which should be 
checked for that.

Regards,
Christian.

>
> Change-Id: Ib87e9502f7a5c8c054c7e56956d7f7ad75998e43
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h           | 6 +++++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        | 2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       | 9 +++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       | 9 +++++++--
>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 6 ------
>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 1 +
>   6 files changed, 23 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index e330009..8e031d6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -760,10 +760,12 @@ struct amdgpu_ib {
>   	uint32_t			flags;
>   };
>   
> +struct amdgpu_ctx;
> +
>   extern const struct amd_sched_backend_ops amdgpu_sched_ops;
>   
>   int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
> -		     struct amdgpu_job **job, struct amdgpu_vm *vm);
> +		     struct amdgpu_job **job, struct amdgpu_vm *vm, struct amdgpu_ctx *ctx);
>   int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned size,
>   			     struct amdgpu_job **job);
>   
> @@ -802,6 +804,7 @@ struct amdgpu_ctx_mgr {
>   
>   struct amdgpu_ctx *amdgpu_ctx_get(struct amdgpu_fpriv *fpriv, uint32_t id);
>   int amdgpu_ctx_put(struct amdgpu_ctx *ctx);
> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx);
>   
>   uint64_t amdgpu_ctx_add_fence(struct amdgpu_ctx *ctx, struct amdgpu_ring *ring,
>   			      struct fence *fence);
> @@ -1129,6 +1132,7 @@ struct amdgpu_job {
>   	struct amdgpu_sync	sync;
>   	struct amdgpu_ib	*ibs;
>   	struct fence		*fence; /* the hw fence */
> +	struct amdgpu_ctx *ctx;
>   	uint32_t		preamble_status;
>   	uint32_t		num_ibs;
>   	void			*owner;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index 699f5fe..267fb65 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -234,7 +234,7 @@ int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, void *data)
>   		}
>   	}
>   
> -	ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm);
> +	ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm, p->ctx);
>   	if (ret)
>   		goto free_all_kdata;
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> index b4bbbb3..81438af 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> @@ -25,6 +25,13 @@
>   #include <drm/drmP.h>
>   #include "amdgpu.h"
>   
> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx)
> +{
> +	if (ctx)
> +		kref_get(&ctx->refcount);
> +	return ctx;
> +}
> +
>   static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
>   {
>   	unsigned i, j;
> @@ -56,6 +63,8 @@ static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
>   					  rq, amdgpu_sched_jobs);
>   		if (r)
>   			goto failed;
> +
> +		ctx->rings[i].entity.ptr_guilty = &ctx->guilty; /* kernel entity doesn't have ptr_guilty */
>   	}
>   
>   	return 0;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 690ef3d..208da11 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -40,7 +40,7 @@ static void amdgpu_job_timedout(struct amd_sched_job *s_job)
>   }
>   
>   int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
> -		     struct amdgpu_job **job, struct amdgpu_vm *vm)
> +		     struct amdgpu_job **job, struct amdgpu_vm *vm, struct amdgpu_ctx *ctx)
>   {
>   	size_t size = sizeof(struct amdgpu_job);
>   
> @@ -57,6 +57,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>   	(*job)->vm = vm;
>   	(*job)->ibs = (void *)&(*job)[1];
>   	(*job)->num_ibs = num_ibs;
> +	(*job)->ctx = amdgpu_ctx_kref_get(ctx);
>   
>   	amdgpu_sync_create(&(*job)->sync);
>   
> @@ -68,7 +69,7 @@ int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned size,
>   {
>   	int r;
>   
> -	r = amdgpu_job_alloc(adev, 1, job, NULL);
> +	r = amdgpu_job_alloc(adev, 1, job, NULL, NULL);
>   	if (r)
>   		return r;
>   
> @@ -94,6 +95,10 @@ void amdgpu_job_free_resources(struct amdgpu_job *job)
>   static void amdgpu_job_free_cb(struct amd_sched_job *s_job)
>   {
>   	struct amdgpu_job *job = container_of(s_job, struct amdgpu_job, base);
> +	struct amdgpu_ctx *ctx = job->ctx;
> +
> +	if (ctx)
> +		amdgpu_ctx_put(ctx);
>   
>   	fence_put(job->fence);
>   	amdgpu_sync_free(&job->sync);
> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> index 6f4e31f..9100ca8 100644
> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> @@ -208,12 +208,6 @@ void amd_sched_entity_fini(struct amd_gpu_scheduler *sched,
>   	if (!amd_sched_entity_is_initialized(sched, entity))
>   		return;
>   
> -	/**
> -	 * The client will not queue more IBs during this fini, consume existing
> -	 * queued IBs
> -	*/
> -	wait_event(sched->job_scheduled, amd_sched_entity_is_idle(entity));
> -
>   	amd_sched_rq_remove_entity(rq, entity);
>   	kfifo_free(&entity->job_queue);
>   }
> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> index 8cb41d3..ccbbcb0 100644
> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> @@ -49,6 +49,7 @@ struct amd_sched_entity {
>   
>   	struct fence			*dependency;
>   	struct fence_cb			cb;
> +	bool *ptr_guilty;
>   };
>   
>   /**


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/5] Patch serials to implement guilty ctx/entity for SRIOV TDR
       [not found] ` <1493623371-32614-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
                     ` (4 preceding siblings ...)
  2017-05-01  7:22   ` [PATCH 5/5] drm/amdgpu:sriov TDR only recover hang ring Monk Liu
@ 2017-05-01 14:53   ` Christian König
  5 siblings, 0 replies; 28+ messages in thread
From: Christian König @ 2017-05-01 14:53 UTC (permalink / raw)
  To: Monk Liu, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 01.05.2017 um 09:22 schrieb Monk Liu:
> sometime user space submits bad command steam to kernel and with current scheme
> gpu-scheduler will always resubmit all un-signaled job to hw ring after gpu reset
> thus this bad submit will infinitly trigger GPU hang.
>
> this patch serials implement a system called guilty context, which can avoid submitting
> malicious jobs and invalidate the related context behind them, that way the regular
> application can still continue to run, and other VF can also suffer less GPU time reductions
>
> the guilty charge is simple: if a job hang too much times exceeds the threshold, we
> consider it guilty, and we invalidates the context behind it, and pop out all job in
> its entities of each scheduler. the next IOCTL on this CTX handler will get -ENODEV
> error thus UMD can know this context is released by driver due to its malicious
> command submit.

NAK to the whole approach. That would require that CTX are kept alive 
until all jobs in them are finished which is a NO-GO for resource 
management.

A process which is killed should release all of it's resources as fast 
as possible and not block for the last GPU command to finish for that 
(only what the last GPU command is using shoul be kept alive).

Instead build the whole thing around the fence status. If a job is 
guilty we note that inside the fence status field.

Then on the context query we check the status of the pending fences and 
can judge if a context is guilty or not.

Regards,
Christian.

>
> Monk Liu (5):
>    drm/amdgpu:keep ctx alive till all job finished
>    drm/amdgpu:some modifications in amdgpu_ctx
>    drm/amdgpu:Impl guilty ctx feature for sriov TDR
>    drm/amdgpu:change sriov_gpu_reset interface
>    drm/amdgpu:sriov TDR only recover hang ring
>
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h           | 12 +++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        | 26 ++++----
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       | 39 ++++++++++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 43 ++++++++++---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |  3 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c     |  6 ++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       | 30 +++++++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h      |  1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h      |  2 +-
>   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c         |  2 +-
>   drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c         |  2 +-
>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 87 ++++++++++++++++++++++++---
>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.h |  3 +
>   13 files changed, 209 insertions(+), 47 deletions(-)
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
       [not found]         ` <a4605d10-b1f7-7fee-63c9-829d612c63aa-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
@ 2017-05-03  3:30           ` Liu, Monk
       [not found]             ` <DM5PR12MB16102746DB02DBE8ED69DA9C84160-2J9CzHegvk++jCVTvoAFKAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Liu, Monk @ 2017-05-03  3:30 UTC (permalink / raw)
  To: Christian König, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

1, This is necessary otherwise how can I access entity pointer after a job timedout , and why it is dangerous ?
2, what's the status field in the fences you were referring to ? I need to judge if it could satisfy my requirement 



-----Original Message-----
From: Christian König [mailto:deathsimple@vodafone.de] 
Sent: Monday, May 01, 2017 10:48 PM
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished

Am 01.05.2017 um 09:22 schrieb Monk Liu:
> for TDR guilty context feature, we need access ctx/s_entity field 
> member through sched job pointer,so ctx must keep alive till all job 
> from it signaled.

NAK, that is unnecessary and quite dangerous.

Instead we have the designed status field in the fences which should be checked for that.

Regards,
Christian.

>
> Change-Id: Ib87e9502f7a5c8c054c7e56956d7f7ad75998e43
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h           | 6 +++++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        | 2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       | 9 +++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       | 9 +++++++--
>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 6 ------
>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 1 +
>   6 files changed, 23 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index e330009..8e031d6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -760,10 +760,12 @@ struct amdgpu_ib {
>   	uint32_t			flags;
>   };
>   
> +struct amdgpu_ctx;
> +
>   extern const struct amd_sched_backend_ops amdgpu_sched_ops;
>   
>   int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
> -		     struct amdgpu_job **job, struct amdgpu_vm *vm);
> +		     struct amdgpu_job **job, struct amdgpu_vm *vm, struct 
> +amdgpu_ctx *ctx);
>   int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned size,
>   			     struct amdgpu_job **job);
>   
> @@ -802,6 +804,7 @@ struct amdgpu_ctx_mgr {
>   
>   struct amdgpu_ctx *amdgpu_ctx_get(struct amdgpu_fpriv *fpriv, uint32_t id);
>   int amdgpu_ctx_put(struct amdgpu_ctx *ctx);
> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx);
>   
>   uint64_t amdgpu_ctx_add_fence(struct amdgpu_ctx *ctx, struct amdgpu_ring *ring,
>   			      struct fence *fence);
> @@ -1129,6 +1132,7 @@ struct amdgpu_job {
>   	struct amdgpu_sync	sync;
>   	struct amdgpu_ib	*ibs;
>   	struct fence		*fence; /* the hw fence */
> +	struct amdgpu_ctx *ctx;
>   	uint32_t		preamble_status;
>   	uint32_t		num_ibs;
>   	void			*owner;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index 699f5fe..267fb65 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -234,7 +234,7 @@ int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, void *data)
>   		}
>   	}
>   
> -	ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm);
> +	ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm, p->ctx);
>   	if (ret)
>   		goto free_all_kdata;
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> index b4bbbb3..81438af 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> @@ -25,6 +25,13 @@
>   #include <drm/drmP.h>
>   #include "amdgpu.h"
>   
> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx) {
> +	if (ctx)
> +		kref_get(&ctx->refcount);
> +	return ctx;
> +}
> +
>   static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
>   {
>   	unsigned i, j;
> @@ -56,6 +63,8 @@ static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
>   					  rq, amdgpu_sched_jobs);
>   		if (r)
>   			goto failed;
> +
> +		ctx->rings[i].entity.ptr_guilty = &ctx->guilty; /* kernel entity 
> +doesn't have ptr_guilty */
>   	}
>   
>   	return 0;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 690ef3d..208da11 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -40,7 +40,7 @@ static void amdgpu_job_timedout(struct amd_sched_job *s_job)
>   }
>   
>   int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
> -		     struct amdgpu_job **job, struct amdgpu_vm *vm)
> +		     struct amdgpu_job **job, struct amdgpu_vm *vm, struct 
> +amdgpu_ctx *ctx)
>   {
>   	size_t size = sizeof(struct amdgpu_job);
>   
> @@ -57,6 +57,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>   	(*job)->vm = vm;
>   	(*job)->ibs = (void *)&(*job)[1];
>   	(*job)->num_ibs = num_ibs;
> +	(*job)->ctx = amdgpu_ctx_kref_get(ctx);
>   
>   	amdgpu_sync_create(&(*job)->sync);
>   
> @@ -68,7 +69,7 @@ int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned size,
>   {
>   	int r;
>   
> -	r = amdgpu_job_alloc(adev, 1, job, NULL);
> +	r = amdgpu_job_alloc(adev, 1, job, NULL, NULL);
>   	if (r)
>   		return r;
>   
> @@ -94,6 +95,10 @@ void amdgpu_job_free_resources(struct amdgpu_job *job)
>   static void amdgpu_job_free_cb(struct amd_sched_job *s_job)
>   {
>   	struct amdgpu_job *job = container_of(s_job, struct amdgpu_job, 
> base);
> +	struct amdgpu_ctx *ctx = job->ctx;
> +
> +	if (ctx)
> +		amdgpu_ctx_put(ctx);
>   
>   	fence_put(job->fence);
>   	amdgpu_sync_free(&job->sync);
> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c 
> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> index 6f4e31f..9100ca8 100644
> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> @@ -208,12 +208,6 @@ void amd_sched_entity_fini(struct amd_gpu_scheduler *sched,
>   	if (!amd_sched_entity_is_initialized(sched, entity))
>   		return;
>   
> -	/**
> -	 * The client will not queue more IBs during this fini, consume existing
> -	 * queued IBs
> -	*/
> -	wait_event(sched->job_scheduled, amd_sched_entity_is_idle(entity));
> -
>   	amd_sched_rq_remove_entity(rq, entity);
>   	kfifo_free(&entity->job_queue);
>   }
> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h 
> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> index 8cb41d3..ccbbcb0 100644
> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> @@ -49,6 +49,7 @@ struct amd_sched_entity {
>   
>   	struct fence			*dependency;
>   	struct fence_cb			cb;
> +	bool *ptr_guilty;
>   };
>   
>   /**


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
       [not found]             ` <DM5PR12MB16102746DB02DBE8ED69DA9C84160-2J9CzHegvk++jCVTvoAFKAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-05-03  3:57               ` Liu, Monk
       [not found]                 ` <DM5PR12MB16107F8A55F3EF0B1C834FC384160-2J9CzHegvk++jCVTvoAFKAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  2017-05-03  8:58               ` Christian König
  1 sibling, 1 reply; 28+ messages in thread
From: Liu, Monk @ 2017-05-03  3:57 UTC (permalink / raw)
  To: Liu, Monk, Christian König,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

@Christian,

The thing is I need mark all entities behind this timeout job as guilty, so I use a member in entity named "ptr_guilty", to point to
The member in amdgpu_ctx named "guilty", that way read *entity->ptr_guilty can get you if this entity is "invalid", and set *entity->ptr_guilty
Can let you set all entities belongs to the context as "invalid".

Above logic is to guarantee we can kick out all guilty entities in entity kfifo, and also block all IOCTL with a ctx handle point to this guilty context,
And we only recovery other jobs/entities/context after scheduler unparked 

If you reject the patch to keep ctx alive till all jobs signaled, please give me a solution to satisfy above logic 

BR Monk

-----Original Message-----
From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf Of Liu, Monk
Sent: Wednesday, May 03, 2017 11:31 AM
To: Christian König <deathsimple@vodafone.de>; amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished

1, This is necessary otherwise how can I access entity pointer after a job timedout , and why it is dangerous ?
2, what's the status field in the fences you were referring to ? I need to judge if it could satisfy my requirement 



-----Original Message-----
From: Christian König [mailto:deathsimple@vodafone.de]
Sent: Monday, May 01, 2017 10:48 PM
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished

Am 01.05.2017 um 09:22 schrieb Monk Liu:
> for TDR guilty context feature, we need access ctx/s_entity field 
> member through sched job pointer,so ctx must keep alive till all job 
> from it signaled.

NAK, that is unnecessary and quite dangerous.

Instead we have the designed status field in the fences which should be checked for that.

Regards,
Christian.

>
> Change-Id: Ib87e9502f7a5c8c054c7e56956d7f7ad75998e43
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h           | 6 +++++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        | 2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       | 9 +++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       | 9 +++++++--
>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 6 ------
>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 1 +
>   6 files changed, 23 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index e330009..8e031d6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -760,10 +760,12 @@ struct amdgpu_ib {
>   	uint32_t			flags;
>   };
>   
> +struct amdgpu_ctx;
> +
>   extern const struct amd_sched_backend_ops amdgpu_sched_ops;
>   
>   int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
> -		     struct amdgpu_job **job, struct amdgpu_vm *vm);
> +		     struct amdgpu_job **job, struct amdgpu_vm *vm, struct 
> +amdgpu_ctx *ctx);
>   int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned size,
>   			     struct amdgpu_job **job);
>   
> @@ -802,6 +804,7 @@ struct amdgpu_ctx_mgr {
>   
>   struct amdgpu_ctx *amdgpu_ctx_get(struct amdgpu_fpriv *fpriv, uint32_t id);
>   int amdgpu_ctx_put(struct amdgpu_ctx *ctx);
> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx);
>   
>   uint64_t amdgpu_ctx_add_fence(struct amdgpu_ctx *ctx, struct amdgpu_ring *ring,
>   			      struct fence *fence);
> @@ -1129,6 +1132,7 @@ struct amdgpu_job {
>   	struct amdgpu_sync	sync;
>   	struct amdgpu_ib	*ibs;
>   	struct fence		*fence; /* the hw fence */
> +	struct amdgpu_ctx *ctx;
>   	uint32_t		preamble_status;
>   	uint32_t		num_ibs;
>   	void			*owner;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index 699f5fe..267fb65 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -234,7 +234,7 @@ int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, void *data)
>   		}
>   	}
>   
> -	ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm);
> +	ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm, p->ctx);
>   	if (ret)
>   		goto free_all_kdata;
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> index b4bbbb3..81438af 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> @@ -25,6 +25,13 @@
>   #include <drm/drmP.h>
>   #include "amdgpu.h"
>   
> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx) {
> +	if (ctx)
> +		kref_get(&ctx->refcount);
> +	return ctx;
> +}
> +
>   static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
>   {
>   	unsigned i, j;
> @@ -56,6 +63,8 @@ static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
>   					  rq, amdgpu_sched_jobs);
>   		if (r)
>   			goto failed;
> +
> +		ctx->rings[i].entity.ptr_guilty = &ctx->guilty; /* kernel entity 
> +doesn't have ptr_guilty */
>   	}
>   
>   	return 0;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 690ef3d..208da11 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -40,7 +40,7 @@ static void amdgpu_job_timedout(struct amd_sched_job *s_job)
>   }
>   
>   int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
> -		     struct amdgpu_job **job, struct amdgpu_vm *vm)
> +		     struct amdgpu_job **job, struct amdgpu_vm *vm, struct 
> +amdgpu_ctx *ctx)
>   {
>   	size_t size = sizeof(struct amdgpu_job);
>   
> @@ -57,6 +57,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>   	(*job)->vm = vm;
>   	(*job)->ibs = (void *)&(*job)[1];
>   	(*job)->num_ibs = num_ibs;
> +	(*job)->ctx = amdgpu_ctx_kref_get(ctx);
>   
>   	amdgpu_sync_create(&(*job)->sync);
>   
> @@ -68,7 +69,7 @@ int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned size,
>   {
>   	int r;
>   
> -	r = amdgpu_job_alloc(adev, 1, job, NULL);
> +	r = amdgpu_job_alloc(adev, 1, job, NULL, NULL);
>   	if (r)
>   		return r;
>   
> @@ -94,6 +95,10 @@ void amdgpu_job_free_resources(struct amdgpu_job *job)
>   static void amdgpu_job_free_cb(struct amd_sched_job *s_job)
>   {
>   	struct amdgpu_job *job = container_of(s_job, struct amdgpu_job, 
> base);
> +	struct amdgpu_ctx *ctx = job->ctx;
> +
> +	if (ctx)
> +		amdgpu_ctx_put(ctx);
>   
>   	fence_put(job->fence);
>   	amdgpu_sync_free(&job->sync);
> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> index 6f4e31f..9100ca8 100644
> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> @@ -208,12 +208,6 @@ void amd_sched_entity_fini(struct amd_gpu_scheduler *sched,
>   	if (!amd_sched_entity_is_initialized(sched, entity))
>   		return;
>   
> -	/**
> -	 * The client will not queue more IBs during this fini, consume existing
> -	 * queued IBs
> -	*/
> -	wait_event(sched->job_scheduled, amd_sched_entity_is_idle(entity));
> -
>   	amd_sched_rq_remove_entity(rq, entity);
>   	kfifo_free(&entity->job_queue);
>   }
> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> index 8cb41d3..ccbbcb0 100644
> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> @@ -49,6 +49,7 @@ struct amd_sched_entity {
>   
>   	struct fence			*dependency;
>   	struct fence_cb			cb;
> +	bool *ptr_guilty;
>   };
>   
>   /**


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
       [not found]                 ` <DM5PR12MB16107F8A55F3EF0B1C834FC384160-2J9CzHegvk++jCVTvoAFKAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-05-03  4:54                   ` Zhou, David(ChunMing)
       [not found]                     ` <MWHPR1201MB020601F998809FC8F0527723B4160-3iK1xFAIwjrUF/YbdlDdgWrFom/aUZj6nBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Zhou, David(ChunMing) @ 2017-05-03  4:54 UTC (permalink / raw)
  To: Liu, Monk

You can add ctx as filed of job, but not get reference of it, when you try to use ctx, just check if ctx == NULL. 

Another stupid method:
Use idr_for_each_entry(..job->vm->ctx_mgr...) and compare the job->fence->fence_context with ctx->ring[].entity->fence_context. if not found, then ctx is freed, otherwise you can do your things for this ctx.

Regards,
David Zhou

-----Original Message-----
From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf Of Liu, Monk
Sent: Wednesday, May 03, 2017 11:57 AM
To: Liu, Monk <Monk.Liu@amd.com>; Christian König <deathsimple@vodafone.de>; amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished

@Christian,

The thing is I need mark all entities behind this timeout job as guilty, so I use a member in entity named "ptr_guilty", to point to The member in amdgpu_ctx named "guilty", that way read *entity->ptr_guilty can get you if this entity is "invalid", and set *entity->ptr_guilty Can let you set all entities belongs to the context as "invalid".

Above logic is to guarantee we can kick out all guilty entities in entity kfifo, and also block all IOCTL with a ctx handle point to this guilty context, And we only recovery other jobs/entities/context after scheduler unparked 

If you reject the patch to keep ctx alive till all jobs signaled, please give me a solution to satisfy above logic 

BR Monk

-----Original Message-----
From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf Of Liu, Monk
Sent: Wednesday, May 03, 2017 11:31 AM
To: Christian König <deathsimple@vodafone.de>; amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished

1, This is necessary otherwise how can I access entity pointer after a job timedout , and why it is dangerous ?
2, what's the status field in the fences you were referring to ? I need to judge if it could satisfy my requirement 



-----Original Message-----
From: Christian König [mailto:deathsimple@vodafone.de]
Sent: Monday, May 01, 2017 10:48 PM
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished

Am 01.05.2017 um 09:22 schrieb Monk Liu:
> for TDR guilty context feature, we need access ctx/s_entity field 
> member through sched job pointer,so ctx must keep alive till all job 
> from it signaled.

NAK, that is unnecessary and quite dangerous.

Instead we have the designed status field in the fences which should be checked for that.

Regards,
Christian.

>
> Change-Id: Ib87e9502f7a5c8c054c7e56956d7f7ad75998e43
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h           | 6 +++++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        | 2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       | 9 +++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       | 9 +++++++--
>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 6 ------
>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 1 +
>   6 files changed, 23 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index e330009..8e031d6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -760,10 +760,12 @@ struct amdgpu_ib {
>   	uint32_t			flags;
>   };
>   
> +struct amdgpu_ctx;
> +
>   extern const struct amd_sched_backend_ops amdgpu_sched_ops;
>   
>   int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
> -		     struct amdgpu_job **job, struct amdgpu_vm *vm);
> +		     struct amdgpu_job **job, struct amdgpu_vm *vm, struct 
> +amdgpu_ctx *ctx);
>   int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned size,
>   			     struct amdgpu_job **job);
>   
> @@ -802,6 +804,7 @@ struct amdgpu_ctx_mgr {
>   
>   struct amdgpu_ctx *amdgpu_ctx_get(struct amdgpu_fpriv *fpriv, uint32_t id);
>   int amdgpu_ctx_put(struct amdgpu_ctx *ctx);
> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx);
>   
>   uint64_t amdgpu_ctx_add_fence(struct amdgpu_ctx *ctx, struct amdgpu_ring *ring,
>   			      struct fence *fence);
> @@ -1129,6 +1132,7 @@ struct amdgpu_job {
>   	struct amdgpu_sync	sync;
>   	struct amdgpu_ib	*ibs;
>   	struct fence		*fence; /* the hw fence */
> +	struct amdgpu_ctx *ctx;
>   	uint32_t		preamble_status;
>   	uint32_t		num_ibs;
>   	void			*owner;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index 699f5fe..267fb65 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -234,7 +234,7 @@ int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, void *data)
>   		}
>   	}
>   
> -	ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm);
> +	ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm, p->ctx);
>   	if (ret)
>   		goto free_all_kdata;
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> index b4bbbb3..81438af 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> @@ -25,6 +25,13 @@
>   #include <drm/drmP.h>
>   #include "amdgpu.h"
>   
> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx) {
> +	if (ctx)
> +		kref_get(&ctx->refcount);
> +	return ctx;
> +}
> +
>   static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
>   {
>   	unsigned i, j;
> @@ -56,6 +63,8 @@ static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
>   					  rq, amdgpu_sched_jobs);
>   		if (r)
>   			goto failed;
> +
> +		ctx->rings[i].entity.ptr_guilty = &ctx->guilty; /* kernel entity 
> +doesn't have ptr_guilty */
>   	}
>   
>   	return 0;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 690ef3d..208da11 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -40,7 +40,7 @@ static void amdgpu_job_timedout(struct amd_sched_job *s_job)
>   }
>   
>   int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
> -		     struct amdgpu_job **job, struct amdgpu_vm *vm)
> +		     struct amdgpu_job **job, struct amdgpu_vm *vm, struct 
> +amdgpu_ctx *ctx)
>   {
>   	size_t size = sizeof(struct amdgpu_job);
>   
> @@ -57,6 +57,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>   	(*job)->vm = vm;
>   	(*job)->ibs = (void *)&(*job)[1];
>   	(*job)->num_ibs = num_ibs;
> +	(*job)->ctx = amdgpu_ctx_kref_get(ctx);
>   
>   	amdgpu_sync_create(&(*job)->sync);
>   
> @@ -68,7 +69,7 @@ int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned size,
>   {
>   	int r;
>   
> -	r = amdgpu_job_alloc(adev, 1, job, NULL);
> +	r = amdgpu_job_alloc(adev, 1, job, NULL, NULL);
>   	if (r)
>   		return r;
>   
> @@ -94,6 +95,10 @@ void amdgpu_job_free_resources(struct amdgpu_job *job)
>   static void amdgpu_job_free_cb(struct amd_sched_job *s_job)
>   {
>   	struct amdgpu_job *job = container_of(s_job, struct amdgpu_job, 
> base);
> +	struct amdgpu_ctx *ctx = job->ctx;
> +
> +	if (ctx)
> +		amdgpu_ctx_put(ctx);
>   
>   	fence_put(job->fence);
>   	amdgpu_sync_free(&job->sync);
> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> index 6f4e31f..9100ca8 100644
> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> @@ -208,12 +208,6 @@ void amd_sched_entity_fini(struct amd_gpu_scheduler *sched,
>   	if (!amd_sched_entity_is_initialized(sched, entity))
>   		return;
>   
> -	/**
> -	 * The client will not queue more IBs during this fini, consume existing
> -	 * queued IBs
> -	*/
> -	wait_event(sched->job_scheduled, amd_sched_entity_is_idle(entity));
> -
>   	amd_sched_rq_remove_entity(rq, entity);
>   	kfifo_free(&entity->job_queue);
>   }
> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> index 8cb41d3..ccbbcb0 100644
> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> @@ -49,6 +49,7 @@ struct amd_sched_entity {
>   
>   	struct fence			*dependency;
>   	struct fence_cb			cb;
> +	bool *ptr_guilty;
>   };
>   
>   /**


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
       [not found]                     ` <MWHPR1201MB020601F998809FC8F0527723B4160-3iK1xFAIwjrUF/YbdlDdgWrFom/aUZj6nBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
@ 2017-05-03  6:02                       ` Liu, Monk
       [not found]                         ` <DM5PR12MB161082763FA0163FF22E1C1F84160-2J9CzHegvk++jCVTvoAFKAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Liu, Monk @ 2017-05-03  6:02 UTC (permalink / raw)
  To: Zhou, David(ChunMing),
	Christian König, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

You can add ctx as filed of job, but not get reference of it, when you try to use ctx, just check if ctx == NULL.
> that doesn't work at all... job->ctx will always be non-NULL after it is initialized, you just refer to a wild pointer after CTX released 

Another stupid method:
Use idr_for_each_entry(..job->vm->ctx_mgr...) and compare the job->fence->fence_context with ctx->ring[].entity->fence_context. if not found, then ctx is freed, otherwise you can do your things for this ctx.
> 1) fence_context have chance to incorrectly represent the context behind it, because the number can be used up and will wrapped around from beginning 
   2) why not just keep CTX alive, which is much simple than this method 

BR

-----Original Message-----
From: Zhou, David(ChunMing) 
Sent: Wednesday, May 03, 2017 12:54 PM
To: Liu, Monk <Monk.Liu@amd.com>; Liu, Monk <Monk.Liu@amd.com>; Christian König <deathsimple@vodafone.de>; amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished

You can add ctx as filed of job, but not get reference of it, when you try to use ctx, just check if ctx == NULL. 

Another stupid method:
Use idr_for_each_entry(..job->vm->ctx_mgr...) and compare the job->fence->fence_context with ctx->ring[].entity->fence_context. if not found, then ctx is freed, otherwise you can do your things for this ctx.

Regards,
David Zhou

-----Original Message-----
From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf Of Liu, Monk
Sent: Wednesday, May 03, 2017 11:57 AM
To: Liu, Monk <Monk.Liu@amd.com>; Christian König <deathsimple@vodafone.de>; amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished

@Christian,

The thing is I need mark all entities behind this timeout job as guilty, so I use a member in entity named "ptr_guilty", to point to The member in amdgpu_ctx named "guilty", that way read *entity->ptr_guilty can get you if this entity is "invalid", and set *entity->ptr_guilty Can let you set all entities belongs to the context as "invalid".

Above logic is to guarantee we can kick out all guilty entities in entity kfifo, and also block all IOCTL with a ctx handle point to this guilty context, And we only recovery other jobs/entities/context after scheduler unparked 

If you reject the patch to keep ctx alive till all jobs signaled, please give me a solution to satisfy above logic 

BR Monk

-----Original Message-----
From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf Of Liu, Monk
Sent: Wednesday, May 03, 2017 11:31 AM
To: Christian König <deathsimple@vodafone.de>; amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished

1, This is necessary otherwise how can I access entity pointer after a job timedout , and why it is dangerous ?
2, what's the status field in the fences you were referring to ? I need to judge if it could satisfy my requirement 



-----Original Message-----
From: Christian König [mailto:deathsimple@vodafone.de]
Sent: Monday, May 01, 2017 10:48 PM
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished

Am 01.05.2017 um 09:22 schrieb Monk Liu:
> for TDR guilty context feature, we need access ctx/s_entity field 
> member through sched job pointer,so ctx must keep alive till all job 
> from it signaled.

NAK, that is unnecessary and quite dangerous.

Instead we have the designed status field in the fences which should be checked for that.

Regards,
Christian.

>
> Change-Id: Ib87e9502f7a5c8c054c7e56956d7f7ad75998e43
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h           | 6 +++++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        | 2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       | 9 +++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       | 9 +++++++--
>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 6 ------
>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 1 +
>   6 files changed, 23 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index e330009..8e031d6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -760,10 +760,12 @@ struct amdgpu_ib {
>   	uint32_t			flags;
>   };
>   
> +struct amdgpu_ctx;
> +
>   extern const struct amd_sched_backend_ops amdgpu_sched_ops;
>   
>   int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
> -		     struct amdgpu_job **job, struct amdgpu_vm *vm);
> +		     struct amdgpu_job **job, struct amdgpu_vm *vm, struct 
> +amdgpu_ctx *ctx);
>   int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned size,
>   			     struct amdgpu_job **job);
>   
> @@ -802,6 +804,7 @@ struct amdgpu_ctx_mgr {
>   
>   struct amdgpu_ctx *amdgpu_ctx_get(struct amdgpu_fpriv *fpriv, uint32_t id);
>   int amdgpu_ctx_put(struct amdgpu_ctx *ctx);
> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx);
>   
>   uint64_t amdgpu_ctx_add_fence(struct amdgpu_ctx *ctx, struct amdgpu_ring *ring,
>   			      struct fence *fence);
> @@ -1129,6 +1132,7 @@ struct amdgpu_job {
>   	struct amdgpu_sync	sync;
>   	struct amdgpu_ib	*ibs;
>   	struct fence		*fence; /* the hw fence */
> +	struct amdgpu_ctx *ctx;
>   	uint32_t		preamble_status;
>   	uint32_t		num_ibs;
>   	void			*owner;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index 699f5fe..267fb65 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -234,7 +234,7 @@ int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, void *data)
>   		}
>   	}
>   
> -	ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm);
> +	ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm, p->ctx);
>   	if (ret)
>   		goto free_all_kdata;
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> index b4bbbb3..81438af 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> @@ -25,6 +25,13 @@
>   #include <drm/drmP.h>
>   #include "amdgpu.h"
>   
> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx) {
> +	if (ctx)
> +		kref_get(&ctx->refcount);
> +	return ctx;
> +}
> +
>   static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
>   {
>   	unsigned i, j;
> @@ -56,6 +63,8 @@ static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
>   					  rq, amdgpu_sched_jobs);
>   		if (r)
>   			goto failed;
> +
> +		ctx->rings[i].entity.ptr_guilty = &ctx->guilty; /* kernel entity 
> +doesn't have ptr_guilty */
>   	}
>   
>   	return 0;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 690ef3d..208da11 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -40,7 +40,7 @@ static void amdgpu_job_timedout(struct amd_sched_job *s_job)
>   }
>   
>   int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
> -		     struct amdgpu_job **job, struct amdgpu_vm *vm)
> +		     struct amdgpu_job **job, struct amdgpu_vm *vm, struct 
> +amdgpu_ctx *ctx)
>   {
>   	size_t size = sizeof(struct amdgpu_job);
>   
> @@ -57,6 +57,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>   	(*job)->vm = vm;
>   	(*job)->ibs = (void *)&(*job)[1];
>   	(*job)->num_ibs = num_ibs;
> +	(*job)->ctx = amdgpu_ctx_kref_get(ctx);
>   
>   	amdgpu_sync_create(&(*job)->sync);
>   
> @@ -68,7 +69,7 @@ int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned size,
>   {
>   	int r;
>   
> -	r = amdgpu_job_alloc(adev, 1, job, NULL);
> +	r = amdgpu_job_alloc(adev, 1, job, NULL, NULL);
>   	if (r)
>   		return r;
>   
> @@ -94,6 +95,10 @@ void amdgpu_job_free_resources(struct amdgpu_job *job)
>   static void amdgpu_job_free_cb(struct amd_sched_job *s_job)
>   {
>   	struct amdgpu_job *job = container_of(s_job, struct amdgpu_job, 
> base);
> +	struct amdgpu_ctx *ctx = job->ctx;
> +
> +	if (ctx)
> +		amdgpu_ctx_put(ctx);
>   
>   	fence_put(job->fence);
>   	amdgpu_sync_free(&job->sync);
> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> index 6f4e31f..9100ca8 100644
> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> @@ -208,12 +208,6 @@ void amd_sched_entity_fini(struct amd_gpu_scheduler *sched,
>   	if (!amd_sched_entity_is_initialized(sched, entity))
>   		return;
>   
> -	/**
> -	 * The client will not queue more IBs during this fini, consume existing
> -	 * queued IBs
> -	*/
> -	wait_event(sched->job_scheduled, amd_sched_entity_is_idle(entity));
> -
>   	amd_sched_rq_remove_entity(rq, entity);
>   	kfifo_free(&entity->job_queue);
>   }
> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> index 8cb41d3..ccbbcb0 100644
> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> @@ -49,6 +49,7 @@ struct amd_sched_entity {
>   
>   	struct fence			*dependency;
>   	struct fence_cb			cb;
> +	bool *ptr_guilty;
>   };
>   
>   /**


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
       [not found]                         ` <DM5PR12MB161082763FA0163FF22E1C1F84160-2J9CzHegvk++jCVTvoAFKAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-05-03  7:23                           ` zhoucm1
       [not found]                             ` <59098580.6090204-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: zhoucm1 @ 2017-05-03  7:23 UTC (permalink / raw)
  To: Liu, Monk, Christian König,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1: Type: text/plain, Size: 11349 bytes --]



On 2017年05月03日 14:02, Liu, Monk wrote:
> You can add ctx as filed of job, but not get reference of it, when you 
> try to use ctx, just check if ctx == NULL.
> > that doesn't work at all... job->ctx will always be non-NULL after 
> it is initialized, you just refer to a wild pointer after CTX released
job->ctx is a **ctx, which could resolve your this concern.

>
> Another stupid method:
> Use idr_for_each_entry(..job->vm->ctx_mgr...) and compare the 
> job->fence->fence_context with ctx->ring[].entity->fence_context. if 
> not found, then ctx is freed, otherwise you can do your things for 
> this ctx.
> > 1) fence_context have chance to incorrectly represent the context 
> behind it, because the number can be used up and will wrapped around 
> from beginning
No, fence_context is unique in global in kernel.

Regards,
David Zhou
>    2) why not just keep CTX alive, which is much simple than this method
>
> BR
>
> -----Original Message-----
> From: Zhou, David(ChunMing)
> Sent: Wednesday, May 03, 2017 12:54 PM
> To: Liu, Monk <Monk.Liu-5C7GfCeVMHo@public.gmane.org>; Liu, Monk <Monk.Liu-5C7GfCeVMHo@public.gmane.org>; 
> Christian König <deathsimple-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
> Subject: RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
>
> You can add ctx as filed of job, but not get reference of it, when you 
> try to use ctx, just check if ctx == NULL.
>
> Another stupid method:
> Use idr_for_each_entry(..job->vm->ctx_mgr...) and compare the 
> job->fence->fence_context with ctx->ring[].entity->fence_context. if 
> not found, then ctx is freed, otherwise you can do your things for 
> this ctx.
>
> Regards,
> David Zhou
>
> -----Original Message-----
> From: amd-gfx [mailto:amd-gfx-bounces-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org] On Behalf 
> Of Liu, Monk
> Sent: Wednesday, May 03, 2017 11:57 AM
> To: Liu, Monk <Monk.Liu-5C7GfCeVMHo@public.gmane.org>; Christian König 
> <deathsimple-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
> Subject: RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
>
> @Christian,
>
> The thing is I need mark all entities behind this timeout job as 
> guilty, so I use a member in entity named "ptr_guilty", to point to 
> The member in amdgpu_ctx named "guilty", that way read 
> *entity->ptr_guilty can get you if this entity is "invalid", and set 
> *entity->ptr_guilty Can let you set all entities belongs to the 
> context as "invalid".
>
> Above logic is to guarantee we can kick out all guilty entities in 
> entity kfifo, and also block all IOCTL with a ctx handle point to this 
> guilty context, And we only recovery other jobs/entities/context after 
> scheduler unparked
>
> If you reject the patch to keep ctx alive till all jobs signaled, 
> please give me a solution to satisfy above logic
>
> BR Monk
>
> -----Original Message-----
> From: amd-gfx [mailto:amd-gfx-bounces-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org] On Behalf 
> Of Liu, Monk
> Sent: Wednesday, May 03, 2017 11:31 AM
> To: Christian König <deathsimple-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>; 
> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
> Subject: RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
>
> 1, This is necessary otherwise how can I access entity pointer after a 
> job timedout , and why it is dangerous ?
> 2, what's the status field in the fences you were referring to ? I 
> need to judge if it could satisfy my requirement
>
>
>
> -----Original Message-----
> From: Christian König [mailto:deathsimple-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org]
> Sent: Monday, May 01, 2017 10:48 PM
> To: Liu, Monk <Monk.Liu-5C7GfCeVMHo@public.gmane.org>; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
>
> Am 01.05.2017 um 09:22 schrieb Monk Liu:
> > for TDR guilty context feature, we need access ctx/s_entity field
> > member through sched job pointer,so ctx must keep alive till all job
> > from it signaled.
>
> NAK, that is unnecessary and quite dangerous.
>
> Instead we have the designed status field in the fences which should 
> be checked for that.
>
> Regards,
> Christian.
>
> >
> > Change-Id: Ib87e9502f7a5c8c054c7e56956d7f7ad75998e43
> > Signed-off-by: Monk Liu <Monk.Liu-5C7GfCeVMHo@public.gmane.org>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu.h           | 6 +++++-
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        | 2 +-
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       | 9 +++++++++
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       | 9 +++++++--
> >   drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 6 ------
> >   drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 1 +
> >   6 files changed, 23 insertions(+), 10 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > index e330009..8e031d6 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > @@ -760,10 +760,12 @@ struct amdgpu_ib {
> >        uint32_t                        flags;
> >   };
> >
> > +struct amdgpu_ctx;
> > +
> >   extern const struct amd_sched_backend_ops amdgpu_sched_ops;
> >
> >   int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
> > -                  struct amdgpu_job **job, struct amdgpu_vm *vm);
> > +                  struct amdgpu_job **job, struct amdgpu_vm *vm, 
> struct
> > +amdgpu_ctx *ctx);
> >   int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned 
> size,
> >                             struct amdgpu_job **job);
> >
> > @@ -802,6 +804,7 @@ struct amdgpu_ctx_mgr {
> >
> >   struct amdgpu_ctx *amdgpu_ctx_get(struct amdgpu_fpriv *fpriv, 
> uint32_t id);
> >   int amdgpu_ctx_put(struct amdgpu_ctx *ctx);
> > +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx);
> >
> >   uint64_t amdgpu_ctx_add_fence(struct amdgpu_ctx *ctx, struct 
> amdgpu_ring *ring,
> >                              struct fence *fence);
> > @@ -1129,6 +1132,7 @@ struct amdgpu_job {
> >        struct amdgpu_sync      sync;
> >        struct amdgpu_ib        *ibs;
> >        struct fence            *fence; /* the hw fence */
> > +     struct amdgpu_ctx *ctx;
> >        uint32_t                preamble_status;
> >        uint32_t                num_ibs;
> >        void                    *owner;
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > index 699f5fe..267fb65 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > @@ -234,7 +234,7 @@ int amdgpu_cs_parser_init(struct 
> amdgpu_cs_parser *p, void *data)
> >                }
> >        }
> >
> > -     ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm);
> > +     ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm, p->ctx);
> >        if (ret)
> >                goto free_all_kdata;
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> > index b4bbbb3..81438af 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> > @@ -25,6 +25,13 @@
> >   #include <drm/drmP.h>
> >   #include "amdgpu.h"
> >
> > +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx) {
> > +     if (ctx)
> > +             kref_get(&ctx->refcount);
> > +     return ctx;
> > +}
> > +
> >   static int amdgpu_ctx_init(struct amdgpu_device *adev, struct 
> amdgpu_ctx *ctx)
> >   {
> >        unsigned i, j;
> > @@ -56,6 +63,8 @@ static int amdgpu_ctx_init(struct amdgpu_device 
> *adev, struct amdgpu_ctx *ctx)
> >                                          rq, amdgpu_sched_jobs);
> >                if (r)
> >                        goto failed;
> > +
> > +             ctx->rings[i].entity.ptr_guilty = &ctx->guilty; /* 
> kernel entity
> > +doesn't have ptr_guilty */
> >        }
> >
> >        return 0;
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > index 690ef3d..208da11 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > @@ -40,7 +40,7 @@ static void amdgpu_job_timedout(struct 
> amd_sched_job *s_job)
> >   }
> >
> >   int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
> > -                  struct amdgpu_job **job, struct amdgpu_vm *vm)
> > +                  struct amdgpu_job **job, struct amdgpu_vm *vm, 
> struct
> > +amdgpu_ctx *ctx)
> >   {
> >        size_t size = sizeof(struct amdgpu_job);
> >
> > @@ -57,6 +57,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, 
> unsigned num_ibs,
> >        (*job)->vm = vm;
> >        (*job)->ibs = (void *)&(*job)[1];
> >        (*job)->num_ibs = num_ibs;
> > +     (*job)->ctx = amdgpu_ctx_kref_get(ctx);
> >
> >        amdgpu_sync_create(&(*job)->sync);
> >
> > @@ -68,7 +69,7 @@ int amdgpu_job_alloc_with_ib(struct amdgpu_device 
> *adev, unsigned size,
> >   {
> >        int r;
> >
> > -     r = amdgpu_job_alloc(adev, 1, job, NULL);
> > +     r = amdgpu_job_alloc(adev, 1, job, NULL, NULL);
> >        if (r)
> >                return r;
> >
> > @@ -94,6 +95,10 @@ void amdgpu_job_free_resources(struct amdgpu_job 
> *job)
> >   static void amdgpu_job_free_cb(struct amd_sched_job *s_job)
> >   {
> >        struct amdgpu_job *job = container_of(s_job, struct amdgpu_job,
> > base);
> > +     struct amdgpu_ctx *ctx = job->ctx;
> > +
> > +     if (ctx)
> > +             amdgpu_ctx_put(ctx);
> >
> >        fence_put(job->fence);
> >        amdgpu_sync_free(&job->sync);
> > diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> > b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> > index 6f4e31f..9100ca8 100644
> > --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> > +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> > @@ -208,12 +208,6 @@ void amd_sched_entity_fini(struct 
> amd_gpu_scheduler *sched,
> >        if (!amd_sched_entity_is_initialized(sched, entity))
> >                return;
> >
> > -     /**
> > -      * The client will not queue more IBs during this fini, 
> consume existing
> > -      * queued IBs
> > -     */
> > -     wait_event(sched->job_scheduled, 
> amd_sched_entity_is_idle(entity));
> > -
> >        amd_sched_rq_remove_entity(rq, entity);
> >        kfifo_free(&entity->job_queue);
> >   }
> > diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> > b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> > index 8cb41d3..ccbbcb0 100644
> > --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> > +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> > @@ -49,6 +49,7 @@ struct amd_sched_entity {
> >
> >        struct fence                    *dependency;
> >        struct fence_cb                 cb;
> > +     bool *ptr_guilty;
> >   };
> >
> >   /**
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[-- Attachment #1.2: Type: text/html, Size: 20680 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
       [not found]             ` <DM5PR12MB16102746DB02DBE8ED69DA9C84160-2J9CzHegvk++jCVTvoAFKAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  2017-05-03  3:57               ` Liu, Monk
@ 2017-05-03  8:58               ` Christian König
       [not found]                 ` <059fe927-90c8-0cf3-336c-56818d9277f0-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
  1 sibling, 1 reply; 28+ messages in thread
From: Christian König @ 2017-05-03  8:58 UTC (permalink / raw)
  To: Liu, Monk, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

> 1, This is necessary otherwise how can I access entity pointer after a job timedout
No that isn't necessary.

The problem with your idea is that you want to actively push the 
feedback/status from the job execution back to userspace when an error 
(timeout) happens.

My idea is that userspace should rather gather the feedback during the 
next command submission. This has the advantage that you don't need to 
keep userspace alive till all jobs are done.

>   , and why it is dangerous ?
You need to keep quite a bunch of stuff alive (VM, CSA) when you don't 
tear down the ctx immediately.

We could split ctx tear down into freeing the resources and freeing the 
structure, but I think just gathering the information needed on CS is 
easier to do.

> 2, what's the status field in the fences you were referring to ? I need to judge if it could satisfy my requirement
struct fence was renamed to struct dma_fence on newer kernels and a 
status field added to exactly this purpose.

The Intel guys did this because they ran into the exactly same problem.

Regards,
Christian.

Am 03.05.2017 um 05:30 schrieb Liu, Monk:
> 1, This is necessary otherwise how can I access entity pointer after a job timedout , and why it is dangerous ?
> 2, what's the status field in the fences you were referring to ? I need to judge if it could satisfy my requirement
>
>
>
> -----Original Message-----
> From: Christian König [mailto:deathsimple@vodafone.de]
> Sent: Monday, May 01, 2017 10:48 PM
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
>
> Am 01.05.2017 um 09:22 schrieb Monk Liu:
>> for TDR guilty context feature, we need access ctx/s_entity field
>> member through sched job pointer,so ctx must keep alive till all job
>> from it signaled.
> NAK, that is unnecessary and quite dangerous.
>
> Instead we have the designed status field in the fences which should be checked for that.
>
> Regards,
> Christian.
>
>> Change-Id: Ib87e9502f7a5c8c054c7e56956d7f7ad75998e43
>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu.h           | 6 +++++-
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        | 2 +-
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       | 9 +++++++++
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       | 9 +++++++--
>>    drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 6 ------
>>    drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 1 +
>>    6 files changed, 23 insertions(+), 10 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> index e330009..8e031d6 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> @@ -760,10 +760,12 @@ struct amdgpu_ib {
>>    	uint32_t			flags;
>>    };
>>    
>> +struct amdgpu_ctx;
>> +
>>    extern const struct amd_sched_backend_ops amdgpu_sched_ops;
>>    
>>    int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>> -		     struct amdgpu_job **job, struct amdgpu_vm *vm);
>> +		     struct amdgpu_job **job, struct amdgpu_vm *vm, struct
>> +amdgpu_ctx *ctx);
>>    int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned size,
>>    			     struct amdgpu_job **job);
>>    
>> @@ -802,6 +804,7 @@ struct amdgpu_ctx_mgr {
>>    
>>    struct amdgpu_ctx *amdgpu_ctx_get(struct amdgpu_fpriv *fpriv, uint32_t id);
>>    int amdgpu_ctx_put(struct amdgpu_ctx *ctx);
>> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx);
>>    
>>    uint64_t amdgpu_ctx_add_fence(struct amdgpu_ctx *ctx, struct amdgpu_ring *ring,
>>    			      struct fence *fence);
>> @@ -1129,6 +1132,7 @@ struct amdgpu_job {
>>    	struct amdgpu_sync	sync;
>>    	struct amdgpu_ib	*ibs;
>>    	struct fence		*fence; /* the hw fence */
>> +	struct amdgpu_ctx *ctx;
>>    	uint32_t		preamble_status;
>>    	uint32_t		num_ibs;
>>    	void			*owner;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> index 699f5fe..267fb65 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> @@ -234,7 +234,7 @@ int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, void *data)
>>    		}
>>    	}
>>    
>> -	ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm);
>> +	ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm, p->ctx);
>>    	if (ret)
>>    		goto free_all_kdata;
>>    
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> index b4bbbb3..81438af 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> @@ -25,6 +25,13 @@
>>    #include <drm/drmP.h>
>>    #include "amdgpu.h"
>>    
>> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx) {
>> +	if (ctx)
>> +		kref_get(&ctx->refcount);
>> +	return ctx;
>> +}
>> +
>>    static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
>>    {
>>    	unsigned i, j;
>> @@ -56,6 +63,8 @@ static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
>>    					  rq, amdgpu_sched_jobs);
>>    		if (r)
>>    			goto failed;
>> +
>> +		ctx->rings[i].entity.ptr_guilty = &ctx->guilty; /* kernel entity
>> +doesn't have ptr_guilty */
>>    	}
>>    
>>    	return 0;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> index 690ef3d..208da11 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> @@ -40,7 +40,7 @@ static void amdgpu_job_timedout(struct amd_sched_job *s_job)
>>    }
>>    
>>    int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>> -		     struct amdgpu_job **job, struct amdgpu_vm *vm)
>> +		     struct amdgpu_job **job, struct amdgpu_vm *vm, struct
>> +amdgpu_ctx *ctx)
>>    {
>>    	size_t size = sizeof(struct amdgpu_job);
>>    
>> @@ -57,6 +57,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>>    	(*job)->vm = vm;
>>    	(*job)->ibs = (void *)&(*job)[1];
>>    	(*job)->num_ibs = num_ibs;
>> +	(*job)->ctx = amdgpu_ctx_kref_get(ctx);
>>    
>>    	amdgpu_sync_create(&(*job)->sync);
>>    
>> @@ -68,7 +69,7 @@ int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned size,
>>    {
>>    	int r;
>>    
>> -	r = amdgpu_job_alloc(adev, 1, job, NULL);
>> +	r = amdgpu_job_alloc(adev, 1, job, NULL, NULL);
>>    	if (r)
>>    		return r;
>>    
>> @@ -94,6 +95,10 @@ void amdgpu_job_free_resources(struct amdgpu_job *job)
>>    static void amdgpu_job_free_cb(struct amd_sched_job *s_job)
>>    {
>>    	struct amdgpu_job *job = container_of(s_job, struct amdgpu_job,
>> base);
>> +	struct amdgpu_ctx *ctx = job->ctx;
>> +
>> +	if (ctx)
>> +		amdgpu_ctx_put(ctx);
>>    
>>    	fence_put(job->fence);
>>    	amdgpu_sync_free(&job->sync);
>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>> index 6f4e31f..9100ca8 100644
>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>> @@ -208,12 +208,6 @@ void amd_sched_entity_fini(struct amd_gpu_scheduler *sched,
>>    	if (!amd_sched_entity_is_initialized(sched, entity))
>>    		return;
>>    
>> -	/**
>> -	 * The client will not queue more IBs during this fini, consume existing
>> -	 * queued IBs
>> -	*/
>> -	wait_event(sched->job_scheduled, amd_sched_entity_is_idle(entity));
>> -
>>    	amd_sched_rq_remove_entity(rq, entity);
>>    	kfifo_free(&entity->job_queue);
>>    }
>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>> index 8cb41d3..ccbbcb0 100644
>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>> @@ -49,6 +49,7 @@ struct amd_sched_entity {
>>    
>>    	struct fence			*dependency;
>>    	struct fence_cb			cb;
>> +	bool *ptr_guilty;
>>    };
>>    
>>    /**
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
       [not found]                 ` <059fe927-90c8-0cf3-336c-56818d9277f0-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
@ 2017-05-03  9:08                   ` Liu, Monk
       [not found]                     ` <DM5PR12MB1610E867F75FA922A874D74884160-2J9CzHegvk++jCVTvoAFKAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Liu, Monk @ 2017-05-03  9:08 UTC (permalink / raw)
  To: Christian König, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


1,My idea is that userspace should rather gather the feedback during the next command submission. This has the advantage that you don't need to keep userspace alive till all jobs are done.

> No, we need to clean the hw ring (cherry-pick out guilty entities' job in all rings) after gpu reset, and we need fackly signal all sched_fence in the guity entity as well, and we need mark context as guilty so the next IOCTL on it will return -ENODEV.
> I don't understand how your idea can solve my request ... 

2,You need to keep quite a bunch of stuff alive (VM, CSA) when you don't tear down the ctx immediately.

> I'm afraid not:  CSA is gone with the VM, and VM is gone after app close our FD, I don't see amdgpu_vm_fini() is depended on context living or not ...

3, struct fence was renamed to struct dma_fence on newer kernels and a status field added to exactly this purpose.

The Intel guys did this because they ran into the exactly same problem.

> I'll see if dma_fence could solve my issue, but I wish you can give me your detail idea 


BR Monk



-----Original Message-----
From: Christian König [mailto:deathsimple@vodafone.de] 
Sent: Wednesday, May 03, 2017 4:59 PM
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished

> 1, This is necessary otherwise how can I access entity pointer after a 
> job timedout
No that isn't necessary.

The problem with your idea is that you want to actively push the feedback/status from the job execution back to userspace when an error
(timeout) happens.

My idea is that userspace should rather gather the feedback during the next command submission. This has the advantage that you don't need to keep userspace alive till all jobs are done.

>   , and why it is dangerous ?
You need to keep quite a bunch of stuff alive (VM, CSA) when you don't tear down the ctx immediately.

We could split ctx tear down into freeing the resources and freeing the structure, but I think just gathering the information needed on CS is easier to do.

> 2, what's the status field in the fences you were referring to ? I 
> need to judge if it could satisfy my requirement
struct fence was renamed to struct dma_fence on newer kernels and a status field added to exactly this purpose.

The Intel guys did this because they ran into the exactly same problem.

Regards,
Christian.

Am 03.05.2017 um 05:30 schrieb Liu, Monk:
> 1, This is necessary otherwise how can I access entity pointer after a job timedout , and why it is dangerous ?
> 2, what's the status field in the fences you were referring to ? I 
> need to judge if it could satisfy my requirement
>
>
>
> -----Original Message-----
> From: Christian König [mailto:deathsimple@vodafone.de]
> Sent: Monday, May 01, 2017 10:48 PM
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job 
> finished
>
> Am 01.05.2017 um 09:22 schrieb Monk Liu:
>> for TDR guilty context feature, we need access ctx/s_entity field 
>> member through sched job pointer,so ctx must keep alive till all job 
>> from it signaled.
> NAK, that is unnecessary and quite dangerous.
>
> Instead we have the designed status field in the fences which should be checked for that.
>
> Regards,
> Christian.
>
>> Change-Id: Ib87e9502f7a5c8c054c7e56956d7f7ad75998e43
>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu.h           | 6 +++++-
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        | 2 +-
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       | 9 +++++++++
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       | 9 +++++++--
>>    drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 6 ------
>>    drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 1 +
>>    6 files changed, 23 insertions(+), 10 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> index e330009..8e031d6 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> @@ -760,10 +760,12 @@ struct amdgpu_ib {
>>    	uint32_t			flags;
>>    };
>>    
>> +struct amdgpu_ctx;
>> +
>>    extern const struct amd_sched_backend_ops amdgpu_sched_ops;
>>    
>>    int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>> -		     struct amdgpu_job **job, struct amdgpu_vm *vm);
>> +		     struct amdgpu_job **job, struct amdgpu_vm *vm, struct 
>> +amdgpu_ctx *ctx);
>>    int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned size,
>>    			     struct amdgpu_job **job);
>>    
>> @@ -802,6 +804,7 @@ struct amdgpu_ctx_mgr {
>>    
>>    struct amdgpu_ctx *amdgpu_ctx_get(struct amdgpu_fpriv *fpriv, uint32_t id);
>>    int amdgpu_ctx_put(struct amdgpu_ctx *ctx);
>> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx);
>>    
>>    uint64_t amdgpu_ctx_add_fence(struct amdgpu_ctx *ctx, struct amdgpu_ring *ring,
>>    			      struct fence *fence);
>> @@ -1129,6 +1132,7 @@ struct amdgpu_job {
>>    	struct amdgpu_sync	sync;
>>    	struct amdgpu_ib	*ibs;
>>    	struct fence		*fence; /* the hw fence */
>> +	struct amdgpu_ctx *ctx;
>>    	uint32_t		preamble_status;
>>    	uint32_t		num_ibs;
>>    	void			*owner;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> index 699f5fe..267fb65 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> @@ -234,7 +234,7 @@ int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, void *data)
>>    		}
>>    	}
>>    
>> -	ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm);
>> +	ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm, p->ctx);
>>    	if (ret)
>>    		goto free_all_kdata;
>>    
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> index b4bbbb3..81438af 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> @@ -25,6 +25,13 @@
>>    #include <drm/drmP.h>
>>    #include "amdgpu.h"
>>    
>> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx) {
>> +	if (ctx)
>> +		kref_get(&ctx->refcount);
>> +	return ctx;
>> +}
>> +
>>    static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
>>    {
>>    	unsigned i, j;
>> @@ -56,6 +63,8 @@ static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
>>    					  rq, amdgpu_sched_jobs);
>>    		if (r)
>>    			goto failed;
>> +
>> +		ctx->rings[i].entity.ptr_guilty = &ctx->guilty; /* kernel entity 
>> +doesn't have ptr_guilty */
>>    	}
>>    
>>    	return 0;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> index 690ef3d..208da11 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> @@ -40,7 +40,7 @@ static void amdgpu_job_timedout(struct amd_sched_job *s_job)
>>    }
>>    
>>    int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>> -		     struct amdgpu_job **job, struct amdgpu_vm *vm)
>> +		     struct amdgpu_job **job, struct amdgpu_vm *vm, struct 
>> +amdgpu_ctx *ctx)
>>    {
>>    	size_t size = sizeof(struct amdgpu_job);
>>    
>> @@ -57,6 +57,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>>    	(*job)->vm = vm;
>>    	(*job)->ibs = (void *)&(*job)[1];
>>    	(*job)->num_ibs = num_ibs;
>> +	(*job)->ctx = amdgpu_ctx_kref_get(ctx);
>>    
>>    	amdgpu_sync_create(&(*job)->sync);
>>    
>> @@ -68,7 +69,7 @@ int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned size,
>>    {
>>    	int r;
>>    
>> -	r = amdgpu_job_alloc(adev, 1, job, NULL);
>> +	r = amdgpu_job_alloc(adev, 1, job, NULL, NULL);
>>    	if (r)
>>    		return r;
>>    
>> @@ -94,6 +95,10 @@ void amdgpu_job_free_resources(struct amdgpu_job *job)
>>    static void amdgpu_job_free_cb(struct amd_sched_job *s_job)
>>    {
>>    	struct amdgpu_job *job = container_of(s_job, struct amdgpu_job, 
>> base);
>> +	struct amdgpu_ctx *ctx = job->ctx;
>> +
>> +	if (ctx)
>> +		amdgpu_ctx_put(ctx);
>>    
>>    	fence_put(job->fence);
>>    	amdgpu_sync_free(&job->sync);
>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>> index 6f4e31f..9100ca8 100644
>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>> @@ -208,12 +208,6 @@ void amd_sched_entity_fini(struct amd_gpu_scheduler *sched,
>>    	if (!amd_sched_entity_is_initialized(sched, entity))
>>    		return;
>>    
>> -	/**
>> -	 * The client will not queue more IBs during this fini, consume existing
>> -	 * queued IBs
>> -	*/
>> -	wait_event(sched->job_scheduled, amd_sched_entity_is_idle(entity));
>> -
>>    	amd_sched_rq_remove_entity(rq, entity);
>>    	kfifo_free(&entity->job_queue);
>>    }
>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>> index 8cb41d3..ccbbcb0 100644
>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>> @@ -49,6 +49,7 @@ struct amd_sched_entity {
>>    
>>    	struct fence			*dependency;
>>    	struct fence_cb			cb;
>> +	bool *ptr_guilty;
>>    };
>>    
>>    /**
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
       [not found]                             ` <59098580.6090204-5C7GfCeVMHo@public.gmane.org>
@ 2017-05-03  9:11                               ` Christian König
       [not found]                                 ` <ba31391d-1f42-705b-5c94-bfd7bd1a194f-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Christian König @ 2017-05-03  9:11 UTC (permalink / raw)
  To: zhoucm1, Liu, Monk, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1: Type: text/plain, Size: 12519 bytes --]

> > 1) fence_context have chance to incorrectly represent the context 
> behind it, because the number can be used up and will wrapped around 
> from beginning
> No, fence_context is unique in global in kernel.
Yeah, I would go with that as well.

You note the fence_context of the job which caused the trouble.

Then take a look at all jobs on the recovery list and remove the ones 
with the same fence_context number.

Then take a look at all entities on the runlist and remove the one with 
the same fence_context number. If it is already freed you won't find 
any, but that shouldn't be a problem.

This way you effectively can prevent other jobs from the same context 
from running and the context query can simply check if the entity is 
still on the runlist to figure out if it was guilty or not.

Regards,
Christian.

Am 03.05.2017 um 09:23 schrieb zhoucm1:
>
>
> On 2017年05月03日 14:02, Liu, Monk wrote:
>> You can add ctx as filed of job, but not get reference of it, when 
>> you try to use ctx, just check if ctx == NULL.
>> > that doesn't work at all... job->ctx will always be non-NULL after 
>> it is initialized, you just refer to a wild pointer after CTX released
> job->ctx is a **ctx, which could resolve your this concern.
>
>>
>> Another stupid method:
>> Use idr_for_each_entry(..job->vm->ctx_mgr...) and compare the 
>> job->fence->fence_context with ctx->ring[].entity->fence_context. if 
>> not found, then ctx is freed, otherwise you can do your things for 
>> this ctx.
>> > 1) fence_context have chance to incorrectly represent the context 
>> behind it, because the number can be used up and will wrapped around 
>> from beginning
> No, fence_context is unique in global in kernel.
>
> Regards,
> David Zhou
>>    2) why not just keep CTX alive, which is much simple than this method
>>
>> BR
>>
>> -----Original Message-----
>> From: Zhou, David(ChunMing)
>> Sent: Wednesday, May 03, 2017 12:54 PM
>> To: Liu, Monk <Monk.Liu-5C7GfCeVMHo@public.gmane.org>; Liu, Monk <Monk.Liu-5C7GfCeVMHo@public.gmane.org>; 
>> Christian König <deathsimple-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>> Subject: RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
>>
>> You can add ctx as filed of job, but not get reference of it, when 
>> you try to use ctx, just check if ctx == NULL.
>>
>> Another stupid method:
>> Use idr_for_each_entry(..job->vm->ctx_mgr...) and compare the 
>> job->fence->fence_context with ctx->ring[].entity->fence_context. if 
>> not found, then ctx is freed, otherwise you can do your things for 
>> this ctx.
>>
>> Regards,
>> David Zhou
>>
>> -----Original Message-----
>> From: amd-gfx [mailto:amd-gfx-bounces-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org] On 
>> Behalf Of Liu, Monk
>> Sent: Wednesday, May 03, 2017 11:57 AM
>> To: Liu, Monk <Monk.Liu-5C7GfCeVMHo@public.gmane.org>; Christian König 
>> <deathsimple-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>> Subject: RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
>>
>> @Christian,
>>
>> The thing is I need mark all entities behind this timeout job as 
>> guilty, so I use a member in entity named "ptr_guilty", to point to 
>> The member in amdgpu_ctx named "guilty", that way read 
>> *entity->ptr_guilty can get you if this entity is "invalid", and set 
>> *entity->ptr_guilty Can let you set all entities belongs to the 
>> context as "invalid".
>>
>> Above logic is to guarantee we can kick out all guilty entities in 
>> entity kfifo, and also block all IOCTL with a ctx handle point to 
>> this guilty context, And we only recovery other jobs/entities/context 
>> after scheduler unparked
>>
>> If you reject the patch to keep ctx alive till all jobs signaled, 
>> please give me a solution to satisfy above logic
>>
>> BR Monk
>>
>> -----Original Message-----
>> From: amd-gfx [mailto:amd-gfx-bounces-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org] On 
>> Behalf Of Liu, Monk
>> Sent: Wednesday, May 03, 2017 11:31 AM
>> To: Christian König <deathsimple-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>; 
>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>> Subject: RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
>>
>> 1, This is necessary otherwise how can I access entity pointer after 
>> a job timedout , and why it is dangerous ?
>> 2, what's the status field in the fences you were referring to ? I 
>> need to judge if it could satisfy my requirement
>>
>>
>>
>> -----Original Message-----
>> From: Christian König [mailto:deathsimple-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org]
>> Sent: Monday, May 01, 2017 10:48 PM
>> To: Liu, Monk <Monk.Liu-5C7GfCeVMHo@public.gmane.org>; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
>>
>> Am 01.05.2017 um 09:22 schrieb Monk Liu:
>> > for TDR guilty context feature, we need access ctx/s_entity field
>> > member through sched job pointer,so ctx must keep alive till all job
>> > from it signaled.
>>
>> NAK, that is unnecessary and quite dangerous.
>>
>> Instead we have the designed status field in the fences which should 
>> be checked for that.
>>
>> Regards,
>> Christian.
>>
>> >
>> > Change-Id: Ib87e9502f7a5c8c054c7e56956d7f7ad75998e43
>> > Signed-off-by: Monk Liu <Monk.Liu-5C7GfCeVMHo@public.gmane.org>
>> > ---
>> >   drivers/gpu/drm/amd/amdgpu/amdgpu.h           | 6 +++++-
>> >   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        | 2 +-
>> >   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       | 9 +++++++++
>> >   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       | 9 +++++++--
>> >   drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 6 ------
>> >   drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 1 +
>> >   6 files changed, 23 insertions(+), 10 deletions(-)
>> >
>> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> > index e330009..8e031d6 100644
>> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> > @@ -760,10 +760,12 @@ struct amdgpu_ib {
>> >        uint32_t                        flags;
>> >   };
>> >
>> > +struct amdgpu_ctx;
>> > +
>> >   extern const struct amd_sched_backend_ops amdgpu_sched_ops;
>> >
>> >   int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>> > -                  struct amdgpu_job **job, struct amdgpu_vm *vm);
>> > +                  struct amdgpu_job **job, struct amdgpu_vm *vm, 
>> struct
>> > +amdgpu_ctx *ctx);
>> >   int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned 
>> size,
>> >                             struct amdgpu_job **job);
>> >
>> > @@ -802,6 +804,7 @@ struct amdgpu_ctx_mgr {
>> >
>> >   struct amdgpu_ctx *amdgpu_ctx_get(struct amdgpu_fpriv *fpriv, 
>> uint32_t id);
>> >   int amdgpu_ctx_put(struct amdgpu_ctx *ctx);
>> > +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx);
>> >
>> >   uint64_t amdgpu_ctx_add_fence(struct amdgpu_ctx *ctx, struct 
>> amdgpu_ring *ring,
>> >                              struct fence *fence);
>> > @@ -1129,6 +1132,7 @@ struct amdgpu_job {
>> >        struct amdgpu_sync      sync;
>> >        struct amdgpu_ib        *ibs;
>> >        struct fence            *fence; /* the hw fence */
>> > +     struct amdgpu_ctx *ctx;
>> >        uint32_t                preamble_status;
>> >        uint32_t                num_ibs;
>> >        void                    *owner;
>> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> > index 699f5fe..267fb65 100644
>> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> > @@ -234,7 +234,7 @@ int amdgpu_cs_parser_init(struct 
>> amdgpu_cs_parser *p, void *data)
>> >                }
>> >        }
>> >
>> > -     ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm);
>> > +     ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm, p->ctx);
>> >        if (ret)
>> >                goto free_all_kdata;
>> >
>> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> > index b4bbbb3..81438af 100644
>> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> > @@ -25,6 +25,13 @@
>> >   #include <drm/drmP.h>
>> >   #include "amdgpu.h"
>> >
>> > +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx) {
>> > +     if (ctx)
>> > +             kref_get(&ctx->refcount);
>> > +     return ctx;
>> > +}
>> > +
>> >   static int amdgpu_ctx_init(struct amdgpu_device *adev, struct 
>> amdgpu_ctx *ctx)
>> >   {
>> >        unsigned i, j;
>> > @@ -56,6 +63,8 @@ static int amdgpu_ctx_init(struct amdgpu_device 
>> *adev, struct amdgpu_ctx *ctx)
>> >                                          rq, amdgpu_sched_jobs);
>> >                if (r)
>> >                        goto failed;
>> > +
>> > +             ctx->rings[i].entity.ptr_guilty = &ctx->guilty; /* 
>> kernel entity
>> > +doesn't have ptr_guilty */
>> >        }
>> >
>> >        return 0;
>> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> > index 690ef3d..208da11 100644
>> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> > @@ -40,7 +40,7 @@ static void amdgpu_job_timedout(struct 
>> amd_sched_job *s_job)
>> >   }
>> >
>> >   int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>> > -                  struct amdgpu_job **job, struct amdgpu_vm *vm)
>> > +                  struct amdgpu_job **job, struct amdgpu_vm *vm, 
>> struct
>> > +amdgpu_ctx *ctx)
>> >   {
>> >        size_t size = sizeof(struct amdgpu_job);
>> >
>> > @@ -57,6 +57,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, 
>> unsigned num_ibs,
>> >        (*job)->vm = vm;
>> >        (*job)->ibs = (void *)&(*job)[1];
>> >        (*job)->num_ibs = num_ibs;
>> > +     (*job)->ctx = amdgpu_ctx_kref_get(ctx);
>> >
>> >        amdgpu_sync_create(&(*job)->sync);
>> >
>> > @@ -68,7 +69,7 @@ int amdgpu_job_alloc_with_ib(struct amdgpu_device 
>> *adev, unsigned size,
>> >   {
>> >        int r;
>> >
>> > -     r = amdgpu_job_alloc(adev, 1, job, NULL);
>> > +     r = amdgpu_job_alloc(adev, 1, job, NULL, NULL);
>> >        if (r)
>> >                return r;
>> >
>> > @@ -94,6 +95,10 @@ void amdgpu_job_free_resources(struct amdgpu_job 
>> *job)
>> >   static void amdgpu_job_free_cb(struct amd_sched_job *s_job)
>> >   {
>> >        struct amdgpu_job *job = container_of(s_job, struct amdgpu_job,
>> > base);
>> > +     struct amdgpu_ctx *ctx = job->ctx;
>> > +
>> > +     if (ctx)
>> > +             amdgpu_ctx_put(ctx);
>> >
>> >        fence_put(job->fence);
>> >        amdgpu_sync_free(&job->sync);
>> > diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>> > b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>> > index 6f4e31f..9100ca8 100644
>> > --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>> > +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>> > @@ -208,12 +208,6 @@ void amd_sched_entity_fini(struct 
>> amd_gpu_scheduler *sched,
>> >        if (!amd_sched_entity_is_initialized(sched, entity))
>> >                return;
>> >
>> > -     /**
>> > -      * The client will not queue more IBs during this fini, 
>> consume existing
>> > -      * queued IBs
>> > -     */
>> > -     wait_event(sched->job_scheduled, 
>> amd_sched_entity_is_idle(entity));
>> > -
>> >        amd_sched_rq_remove_entity(rq, entity);
>> >        kfifo_free(&entity->job_queue);
>> >   }
>> > diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>> > b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>> > index 8cb41d3..ccbbcb0 100644
>> > --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>> > +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>> > @@ -49,6 +49,7 @@ struct amd_sched_entity {
>> >
>> >        struct fence                    *dependency;
>> >        struct fence_cb                 cb;
>> > +     bool *ptr_guilty;
>> >   };
>> >
>> >   /**
>>
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>


[-- Attachment #1.2: Type: text/html, Size: 23679 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
       [not found]                                 ` <ba31391d-1f42-705b-5c94-bfd7bd1a194f-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
@ 2017-05-03  9:14                                   ` Liu, Monk
       [not found]                                     ` <DM5PR12MB1610875E9D1BC9E967BE119A84160-2J9CzHegvk++jCVTvoAFKAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Liu, Monk @ 2017-05-03  9:14 UTC (permalink / raw)
  To: Christian König, Zhou, David(ChunMing),
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1: Type: text/plain, Size: 12579 bytes --]

I thought of that for the first place as well, that keep an ID and parsing the ID with every job not signaled to see if the job
Belongs to something guilty , But keep ctx alive is more simple so I choose this way.

But I admit it is doable as well,  and I want to compare this method and the dma_fence status filed you mentioned,

Please give me some detail approach on dma_fence, thanks !

BR Monk

From: Christian König [mailto:deathsimple@vodafone.de]
Sent: Wednesday, May 03, 2017 5:12 PM
To: Zhou, David(ChunMing) <David1.Zhou@amd.com>; Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished

> 1) fence_context have chance to incorrectly represent the context behind it, because the number can be used up and will wrapped around from beginning
No, fence_context is unique in global in kernel.
Yeah, I would go with that as well.

You note the fence_context of the job which caused the trouble.

Then take a look at all jobs on the recovery list and remove the ones with the same fence_context number.

Then take a look at all entities on the runlist and remove the one with the same fence_context number. If it is already freed you won't find any, but that shouldn't be a problem.

This way you effectively can prevent other jobs from the same context from running and the context query can simply check if the entity is still on the runlist to figure out if it was guilty or not.

Regards,
Christian.

Am 03.05.2017 um 09:23 schrieb zhoucm1:

On 2017年05月03日 14:02, Liu, Monk wrote:
You can add ctx as filed of job, but not get reference of it, when you try to use ctx, just check if ctx == NULL.
> that doesn't work at all... job->ctx will always be non-NULL after it is initialized, you just refer to a wild pointer after CTX released
job->ctx is a **ctx, which could resolve your this concern.



Another stupid method:
Use idr_for_each_entry(..job->vm->ctx_mgr...) and compare the job->fence->fence_context with ctx->ring[].entity->fence_context. if not found, then ctx is freed, otherwise you can do your things for this ctx.
> 1) fence_context have chance to incorrectly represent the context behind it, because the number can be used up and will wrapped around from beginning
No, fence_context is unique in global in kernel.

Regards,
David Zhou

   2) why not just keep CTX alive, which is much simple than this method

BR

-----Original Message-----
From: Zhou, David(ChunMing)
Sent: Wednesday, May 03, 2017 12:54 PM
To: Liu, Monk <Monk.Liu@amd.com><mailto:Monk.Liu@amd.com>; Liu, Monk <Monk.Liu@amd.com><mailto:Monk.Liu@amd.com>; Christian König <deathsimple@vodafone.de><mailto:deathsimple@vodafone.de>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished

You can add ctx as filed of job, but not get reference of it, when you try to use ctx, just check if ctx == NULL.

Another stupid method:
Use idr_for_each_entry(..job->vm->ctx_mgr...) and compare the job->fence->fence_context with ctx->ring[].entity->fence_context. if not found, then ctx is freed, otherwise you can do your things for this ctx.

Regards,
David Zhou

-----Original Message-----
From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf Of Liu, Monk
Sent: Wednesday, May 03, 2017 11:57 AM
To: Liu, Monk <Monk.Liu@amd.com><mailto:Monk.Liu@amd.com>; Christian König <deathsimple@vodafone.de><mailto:deathsimple@vodafone.de>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished

@Christian,

The thing is I need mark all entities behind this timeout job as guilty, so I use a member in entity named "ptr_guilty", to point to The member in amdgpu_ctx named "guilty", that way read *entity->ptr_guilty can get you if this entity is "invalid", and set *entity->ptr_guilty Can let you set all entities belongs to the context as "invalid".

Above logic is to guarantee we can kick out all guilty entities in entity kfifo, and also block all IOCTL with a ctx handle point to this guilty context, And we only recovery other jobs/entities/context after scheduler unparked

If you reject the patch to keep ctx alive till all jobs signaled, please give me a solution to satisfy above logic

BR Monk

-----Original Message-----
From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf Of Liu, Monk
Sent: Wednesday, May 03, 2017 11:31 AM
To: Christian König <deathsimple@vodafone.de><mailto:deathsimple@vodafone.de>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished

1, This is necessary otherwise how can I access entity pointer after a job timedout , and why it is dangerous ?
2, what's the status field in the fences you were referring to ? I need to judge if it could satisfy my requirement



-----Original Message-----
From: Christian König [mailto:deathsimple@vodafone.de]
Sent: Monday, May 01, 2017 10:48 PM
To: Liu, Monk <Monk.Liu@amd.com><mailto:Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished

Am 01.05.2017 um 09:22 schrieb Monk Liu:
> for TDR guilty context feature, we need access ctx/s_entity field
> member through sched job pointer,so ctx must keep alive till all job
> from it signaled.

NAK, that is unnecessary and quite dangerous.

Instead we have the designed status field in the fences which should be checked for that.

Regards,
Christian.

>
> Change-Id: Ib87e9502f7a5c8c054c7e56956d7f7ad75998e43
> Signed-off-by: Monk Liu <Monk.Liu@amd.com><mailto:Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h           | 6 +++++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        | 2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       | 9 +++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       | 9 +++++++--
>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 6 ------
>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 1 +
>   6 files changed, 23 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index e330009..8e031d6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -760,10 +760,12 @@ struct amdgpu_ib {
>        uint32_t                        flags;
>   };
>
> +struct amdgpu_ctx;
> +
>   extern const struct amd_sched_backend_ops amdgpu_sched_ops;
>
>   int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
> -                  struct amdgpu_job **job, struct amdgpu_vm *vm);
> +                  struct amdgpu_job **job, struct amdgpu_vm *vm, struct
> +amdgpu_ctx *ctx);
>   int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned size,
>                             struct amdgpu_job **job);
>
> @@ -802,6 +804,7 @@ struct amdgpu_ctx_mgr {
>
>   struct amdgpu_ctx *amdgpu_ctx_get(struct amdgpu_fpriv *fpriv, uint32_t id);
>   int amdgpu_ctx_put(struct amdgpu_ctx *ctx);
> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx);
>
>   uint64_t amdgpu_ctx_add_fence(struct amdgpu_ctx *ctx, struct amdgpu_ring *ring,
>                              struct fence *fence);
> @@ -1129,6 +1132,7 @@ struct amdgpu_job {
>        struct amdgpu_sync      sync;
>        struct amdgpu_ib        *ibs;
>        struct fence            *fence; /* the hw fence */
> +     struct amdgpu_ctx *ctx;
>        uint32_t                preamble_status;
>        uint32_t                num_ibs;
>        void                    *owner;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index 699f5fe..267fb65 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -234,7 +234,7 @@ int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, void *data)
>                }
>        }
>
> -     ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm);
> +     ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm, p->ctx);
>        if (ret)
>                goto free_all_kdata;
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> index b4bbbb3..81438af 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> @@ -25,6 +25,13 @@
>   #include <drm/drmP.h>
>   #include "amdgpu.h"
>
> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx) {
> +     if (ctx)
> +             kref_get(&ctx->refcount);
> +     return ctx;
> +}
> +
>   static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
>   {
>        unsigned i, j;
> @@ -56,6 +63,8 @@ static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
>                                          rq, amdgpu_sched_jobs);
>                if (r)
>                        goto failed;
> +
> +             ctx->rings[i].entity.ptr_guilty = &ctx->guilty; /* kernel entity
> +doesn't have ptr_guilty */
>        }
>
>        return 0;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 690ef3d..208da11 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -40,7 +40,7 @@ static void amdgpu_job_timedout(struct amd_sched_job *s_job)
>   }
>
>   int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
> -                  struct amdgpu_job **job, struct amdgpu_vm *vm)
> +                  struct amdgpu_job **job, struct amdgpu_vm *vm, struct
> +amdgpu_ctx *ctx)
>   {
>        size_t size = sizeof(struct amdgpu_job);
>
> @@ -57,6 +57,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>        (*job)->vm = vm;
>        (*job)->ibs = (void *)&(*job)[1];
>        (*job)->num_ibs = num_ibs;
> +     (*job)->ctx = amdgpu_ctx_kref_get(ctx);
>
>        amdgpu_sync_create(&(*job)->sync);
>
> @@ -68,7 +69,7 @@ int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned size,
>   {
>        int r;
>
> -     r = amdgpu_job_alloc(adev, 1, job, NULL);
> +     r = amdgpu_job_alloc(adev, 1, job, NULL, NULL);
>        if (r)
>                return r;
>
> @@ -94,6 +95,10 @@ void amdgpu_job_free_resources(struct amdgpu_job *job)
>   static void amdgpu_job_free_cb(struct amd_sched_job *s_job)
>   {
>        struct amdgpu_job *job = container_of(s_job, struct amdgpu_job,
> base);
> +     struct amdgpu_ctx *ctx = job->ctx;
> +
> +     if (ctx)
> +             amdgpu_ctx_put(ctx);
>
>        fence_put(job->fence);
>        amdgpu_sync_free(&job->sync);
> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> index 6f4e31f..9100ca8 100644
> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> @@ -208,12 +208,6 @@ void amd_sched_entity_fini(struct amd_gpu_scheduler *sched,
>        if (!amd_sched_entity_is_initialized(sched, entity))
>                return;
>
> -     /**
> -      * The client will not queue more IBs during this fini, consume existing
> -      * queued IBs
> -     */
> -     wait_event(sched->job_scheduled, amd_sched_entity_is_idle(entity));
> -
>        amd_sched_rq_remove_entity(rq, entity);
>        kfifo_free(&entity->job_queue);
>   }
> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> index 8cb41d3..ccbbcb0 100644
> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> @@ -49,6 +49,7 @@ struct amd_sched_entity {
>
>        struct fence                    *dependency;
>        struct fence_cb                 cb;
> +     bool *ptr_guilty;
>   };
>
>   /**


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://lists.freedesktop.org/mailman/listinfo/amd-gfx




[-- Attachment #1.2: Type: text/html, Size: 24535 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
       [not found]                     ` <DM5PR12MB1610E867F75FA922A874D74884160-2J9CzHegvk++jCVTvoAFKAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-05-03  9:18                       ` Christian König
       [not found]                         ` <eb637720-5c9a-636b-237e-228b499ff3bb-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Christian König @ 2017-05-03  9:18 UTC (permalink / raw)
  To: Liu, Monk, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

>> I'm afraid not:  CSA is gone with the VM, and VM is gone after app close our FD, I don't see amdgpu_vm_fini() is depended on context living or not ...

See the teardown order in amdgpu_driver_postclose_kms():
> amdgpu_ctx_mgr_fini(&fpriv->ctx_mgr);
>
>         amdgpu_uvd_free_handles(adev, file_priv);
>         amdgpu_vce_free_handles(adev, file_priv);
>
>         amdgpu_vm_bo_rmv(adev, fpriv->prt_va);
>
>         if (amdgpu_sriov_vf(adev)) {
>                 /* TODO: how to handle reserve failure */
>                 BUG_ON(amdgpu_bo_reserve(adev->virt.csa_obj, false));
>                 amdgpu_vm_bo_rmv(adev, fpriv->vm.csa_bo_va);
>                 fpriv->vm.csa_bo_va = NULL;
>                 amdgpu_bo_unreserve(adev->virt.csa_obj);
>         }
>
>         amdgpu_vm_fini(adev, &fpriv->vm);

amdgpu_ctx_mgr_fini() waits for scheduling to finish and releases all 
contexts of the current fd.

If we don't release the context here because some jobs are still 
executed we need to keep the UVD and VCE handle, the PRT VAs, the CSA 
and even the whole VM structure alive.

> I'll see if dma_fence could solve my issue, but I wish you can give me your detail idea
Please take a look at David's idea of using the fence_context to find 
which jobs and entity to skip, that is even better than mine about the 
fence status and should be trivial to implement because all the data is 
already present we just need to use it.

Regards,
Christian.

Am 03.05.2017 um 11:08 schrieb Liu, Monk:
> 1,My idea is that userspace should rather gather the feedback during the next command submission. This has the advantage that you don't need to keep userspace alive till all jobs are done.
>
>> No, we need to clean the hw ring (cherry-pick out guilty entities' job in all rings) after gpu reset, and we need fackly signal all sched_fence in the guity entity as well, and we need mark context as guilty so the next IOCTL on it will return -ENODEV.
>> I don't understand how your idea can solve my request ...
> 2,You need to keep quite a bunch of stuff alive (VM, CSA) when you don't tear down the ctx immediately.
>
>> I'm afraid not:  CSA is gone with the VM, and VM is gone after app close our FD, I don't see amdgpu_vm_fini() is depended on context living or not ...
> 3, struct fence was renamed to struct dma_fence on newer kernels and a status field added to exactly this purpose.
>
> The Intel guys did this because they ran into the exactly same problem.
>
>> I'll see if dma_fence could solve my issue, but I wish you can give me your detail idea
>
> BR Monk
>
>
>
> -----Original Message-----
> From: Christian König [mailto:deathsimple@vodafone.de]
> Sent: Wednesday, May 03, 2017 4:59 PM
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
>
>> 1, This is necessary otherwise how can I access entity pointer after a
>> job timedout
> No that isn't necessary.
>
> The problem with your idea is that you want to actively push the feedback/status from the job execution back to userspace when an error
> (timeout) happens.
>
> My idea is that userspace should rather gather the feedback during the next command submission. This has the advantage that you don't need to keep userspace alive till all jobs are done.
>
>>    , and why it is dangerous ?
> You need to keep quite a bunch of stuff alive (VM, CSA) when you don't tear down the ctx immediately.
>
> We could split ctx tear down into freeing the resources and freeing the structure, but I think just gathering the information needed on CS is easier to do.
>
>> 2, what's the status field in the fences you were referring to ? I
>> need to judge if it could satisfy my requirement
> struct fence was renamed to struct dma_fence on newer kernels and a status field added to exactly this purpose.
>
> The Intel guys did this because they ran into the exactly same problem.
>
> Regards,
> Christian.
>
> Am 03.05.2017 um 05:30 schrieb Liu, Monk:
>> 1, This is necessary otherwise how can I access entity pointer after a job timedout , and why it is dangerous ?
>> 2, what's the status field in the fences you were referring to ? I
>> need to judge if it could satisfy my requirement
>>
>>
>>
>> -----Original Message-----
>> From: Christian König [mailto:deathsimple@vodafone.de]
>> Sent: Monday, May 01, 2017 10:48 PM
>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job
>> finished
>>
>> Am 01.05.2017 um 09:22 schrieb Monk Liu:
>>> for TDR guilty context feature, we need access ctx/s_entity field
>>> member through sched job pointer,so ctx must keep alive till all job
>>> from it signaled.
>> NAK, that is unnecessary and quite dangerous.
>>
>> Instead we have the designed status field in the fences which should be checked for that.
>>
>> Regards,
>> Christian.
>>
>>> Change-Id: Ib87e9502f7a5c8c054c7e56956d7f7ad75998e43
>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>> ---
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu.h           | 6 +++++-
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        | 2 +-
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       | 9 +++++++++
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       | 9 +++++++--
>>>     drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 6 ------
>>>     drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 1 +
>>>     6 files changed, 23 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> index e330009..8e031d6 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> @@ -760,10 +760,12 @@ struct amdgpu_ib {
>>>     	uint32_t			flags;
>>>     };
>>>     
>>> +struct amdgpu_ctx;
>>> +
>>>     extern const struct amd_sched_backend_ops amdgpu_sched_ops;
>>>     
>>>     int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>>> -		     struct amdgpu_job **job, struct amdgpu_vm *vm);
>>> +		     struct amdgpu_job **job, struct amdgpu_vm *vm, struct
>>> +amdgpu_ctx *ctx);
>>>     int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned size,
>>>     			     struct amdgpu_job **job);
>>>     
>>> @@ -802,6 +804,7 @@ struct amdgpu_ctx_mgr {
>>>     
>>>     struct amdgpu_ctx *amdgpu_ctx_get(struct amdgpu_fpriv *fpriv, uint32_t id);
>>>     int amdgpu_ctx_put(struct amdgpu_ctx *ctx);
>>> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx);
>>>     
>>>     uint64_t amdgpu_ctx_add_fence(struct amdgpu_ctx *ctx, struct amdgpu_ring *ring,
>>>     			      struct fence *fence);
>>> @@ -1129,6 +1132,7 @@ struct amdgpu_job {
>>>     	struct amdgpu_sync	sync;
>>>     	struct amdgpu_ib	*ibs;
>>>     	struct fence		*fence; /* the hw fence */
>>> +	struct amdgpu_ctx *ctx;
>>>     	uint32_t		preamble_status;
>>>     	uint32_t		num_ibs;
>>>     	void			*owner;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> index 699f5fe..267fb65 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> @@ -234,7 +234,7 @@ int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, void *data)
>>>     		}
>>>     	}
>>>     
>>> -	ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm);
>>> +	ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm, p->ctx);
>>>     	if (ret)
>>>     		goto free_all_kdata;
>>>     
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>> index b4bbbb3..81438af 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>> @@ -25,6 +25,13 @@
>>>     #include <drm/drmP.h>
>>>     #include "amdgpu.h"
>>>     
>>> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx) {
>>> +	if (ctx)
>>> +		kref_get(&ctx->refcount);
>>> +	return ctx;
>>> +}
>>> +
>>>     static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
>>>     {
>>>     	unsigned i, j;
>>> @@ -56,6 +63,8 @@ static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
>>>     					  rq, amdgpu_sched_jobs);
>>>     		if (r)
>>>     			goto failed;
>>> +
>>> +		ctx->rings[i].entity.ptr_guilty = &ctx->guilty; /* kernel entity
>>> +doesn't have ptr_guilty */
>>>     	}
>>>     
>>>     	return 0;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> index 690ef3d..208da11 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> @@ -40,7 +40,7 @@ static void amdgpu_job_timedout(struct amd_sched_job *s_job)
>>>     }
>>>     
>>>     int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>>> -		     struct amdgpu_job **job, struct amdgpu_vm *vm)
>>> +		     struct amdgpu_job **job, struct amdgpu_vm *vm, struct
>>> +amdgpu_ctx *ctx)
>>>     {
>>>     	size_t size = sizeof(struct amdgpu_job);
>>>     
>>> @@ -57,6 +57,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>>>     	(*job)->vm = vm;
>>>     	(*job)->ibs = (void *)&(*job)[1];
>>>     	(*job)->num_ibs = num_ibs;
>>> +	(*job)->ctx = amdgpu_ctx_kref_get(ctx);
>>>     
>>>     	amdgpu_sync_create(&(*job)->sync);
>>>     
>>> @@ -68,7 +69,7 @@ int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned size,
>>>     {
>>>     	int r;
>>>     
>>> -	r = amdgpu_job_alloc(adev, 1, job, NULL);
>>> +	r = amdgpu_job_alloc(adev, 1, job, NULL, NULL);
>>>     	if (r)
>>>     		return r;
>>>     
>>> @@ -94,6 +95,10 @@ void amdgpu_job_free_resources(struct amdgpu_job *job)
>>>     static void amdgpu_job_free_cb(struct amd_sched_job *s_job)
>>>     {
>>>     	struct amdgpu_job *job = container_of(s_job, struct amdgpu_job,
>>> base);
>>> +	struct amdgpu_ctx *ctx = job->ctx;
>>> +
>>> +	if (ctx)
>>> +		amdgpu_ctx_put(ctx);
>>>     
>>>     	fence_put(job->fence);
>>>     	amdgpu_sync_free(&job->sync);
>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>> index 6f4e31f..9100ca8 100644
>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>> @@ -208,12 +208,6 @@ void amd_sched_entity_fini(struct amd_gpu_scheduler *sched,
>>>     	if (!amd_sched_entity_is_initialized(sched, entity))
>>>     		return;
>>>     
>>> -	/**
>>> -	 * The client will not queue more IBs during this fini, consume existing
>>> -	 * queued IBs
>>> -	*/
>>> -	wait_event(sched->job_scheduled, amd_sched_entity_is_idle(entity));
>>> -
>>>     	amd_sched_rq_remove_entity(rq, entity);
>>>     	kfifo_free(&entity->job_queue);
>>>     }
>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>> index 8cb41d3..ccbbcb0 100644
>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>> @@ -49,6 +49,7 @@ struct amd_sched_entity {
>>>     
>>>     	struct fence			*dependency;
>>>     	struct fence_cb			cb;
>>> +	bool *ptr_guilty;
>>>     };
>>>     
>>>     /**
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
       [not found]                                     ` <DM5PR12MB1610875E9D1BC9E967BE119A84160-2J9CzHegvk++jCVTvoAFKAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-05-03  9:23                                       ` Christian König
  0 siblings, 0 replies; 28+ messages in thread
From: Christian König @ 2017-05-03  9:23 UTC (permalink / raw)
  To: Liu, Monk, Zhou, David(ChunMing),
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1: Type: text/plain, Size: 16795 bytes --]

> Please give me some detail approach on dma_fence, thanks !
>
I think the idea of using the fence_context is even better than using 
the fence_status. It is already present for a while and filtering the 
jobs/entities by it needs something like 10 lines of code.

The fence status is new and only available on newer kernels (we would 
need to rebase to amd-staging-4.11).

I will try to dig up later today the how the Intel guys solved that problem.

Christian.

Am 03.05.2017 um 11:14 schrieb Liu, Monk:
>
> I thought of that for the first place as well, that keep an ID and 
> parsing the ID with every job not signaled to see if the job
>
> Belongs to something guilty , But keep ctx alive is more simple so I 
> choose this way.
>
> But I admit it is doable as well,  and I want to compare this method 
> and the dma_fence status filed you mentioned,
>
> Please give me some detail approach on dma_fence, thanks !
>
> BR Monk
>
> *From:*Christian König [mailto:deathsimple-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org]
> *Sent:* Wednesday, May 03, 2017 5:12 PM
> *To:* Zhou, David(ChunMing) <David1.Zhou-5C7GfCeVMHo@public.gmane.org>; Liu, Monk 
> <Monk.Liu-5C7GfCeVMHo@public.gmane.org>; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
> *Subject:* Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
>
>     > 1) fence_context have chance to incorrectly represent the context
>     behind it, because the number can be used up and will wrapped
>     around from beginning
>
>     No, fence_context is unique in global in kernel.
>
> Yeah, I would go with that as well.
>
> You note the fence_context of the job which caused the trouble.
>
> Then take a look at all jobs on the recovery list and remove the ones 
> with the same fence_context number.
>
> Then take a look at all entities on the runlist and remove the one 
> with the same fence_context number. If it is already freed you won't 
> find any, but that shouldn't be a problem.
>
> This way you effectively can prevent other jobs from the same context 
> from running and the context query can simply check if the entity is 
> still on the runlist to figure out if it was guilty or not.
>
> Regards,
> Christian.
>
> Am 03.05.2017 um 09:23 schrieb zhoucm1:
>
>     On 2017年05月03日 14:02, Liu, Monk wrote:
>
>         You can add ctx as filed of job, but not get reference of it,
>         when you try to use ctx, just check if ctx == NULL.
>         > that doesn't work at all... job->ctx will always be non-NULL
>         after it is initialized, you just refer to a wild pointer
>         after CTX released
>
>     job->ctx is a **ctx, which could resolve your this concern.
>
>
>
>         Another stupid method:
>         Use idr_for_each_entry(..job->vm->ctx_mgr...) and compare the
>         job->fence->fence_context with
>         ctx->ring[].entity->fence_context. if not found, then ctx is
>         freed, otherwise you can do your things for this ctx.
>         > 1) fence_context have chance to incorrectly represent the
>         context behind it, because the number can be used up and will
>         wrapped around from beginning
>
>     No, fence_context is unique in global in kernel.
>
>     Regards,
>     David Zhou
>
>            2) why not just keep CTX alive, which is much simple than
>         this method
>
>         BR
>
>         -----Original Message-----
>         From: Zhou, David(ChunMing)
>         Sent: Wednesday, May 03, 2017 12:54 PM
>         To: Liu, Monk <Monk.Liu-5C7GfCeVMHo@public.gmane.org> <mailto:Monk.Liu-5C7GfCeVMHo@public.gmane.org>;
>         Liu, Monk <Monk.Liu-5C7GfCeVMHo@public.gmane.org> <mailto:Monk.Liu-5C7GfCeVMHo@public.gmane.org>;
>         Christian König <deathsimple-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
>         <mailto:deathsimple-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>;
>         amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>         <mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
>         Subject: RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till all
>         job finished
>
>         You can add ctx as filed of job, but not get reference of it,
>         when you try to use ctx, just check if ctx == NULL.
>
>         Another stupid method:
>         Use idr_for_each_entry(..job->vm->ctx_mgr...) and compare the
>         job->fence->fence_context with
>         ctx->ring[].entity->fence_context. if not found, then ctx is
>         freed, otherwise you can do your things for this ctx.
>
>         Regards,
>         David Zhou
>
>         -----Original Message-----
>         From: amd-gfx [mailto:amd-gfx-bounces-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org]
>         On Behalf Of Liu, Monk
>         Sent: Wednesday, May 03, 2017 11:57 AM
>         To: Liu, Monk <Monk.Liu-5C7GfCeVMHo@public.gmane.org> <mailto:Monk.Liu-5C7GfCeVMHo@public.gmane.org>;
>         Christian König <deathsimple-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
>         <mailto:deathsimple-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>;
>         amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>         <mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
>         Subject: RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till all
>         job finished
>
>         @Christian,
>
>         The thing is I need mark all entities behind this timeout job
>         as guilty, so I use a member in entity named "ptr_guilty", to
>         point to The member in amdgpu_ctx named "guilty", that way
>         read *entity->ptr_guilty can get you if this entity is
>         "invalid", and set *entity->ptr_guilty Can let you set all
>         entities belongs to the context as "invalid".
>
>         Above logic is to guarantee we can kick out all guilty
>         entities in entity kfifo, and also block all IOCTL with a ctx
>         handle point to this guilty context, And we only recovery
>         other jobs/entities/context after scheduler unparked
>
>         If you reject the patch to keep ctx alive till all jobs
>         signaled, please give me a solution to satisfy above logic
>
>         BR Monk
>
>         -----Original Message-----
>         From: amd-gfx [mailto:amd-gfx-bounces-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org]
>         On Behalf Of Liu, Monk
>         Sent: Wednesday, May 03, 2017 11:31 AM
>         To: Christian König <deathsimple-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
>         <mailto:deathsimple-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>;
>         amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>         <mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
>         Subject: RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till all
>         job finished
>
>         1, This is necessary otherwise how can I access entity pointer
>         after a job timedout , and why it is dangerous ?
>         2, what's the status field in the fences you were referring to
>         ? I need to judge if it could satisfy my requirement
>
>
>
>         -----Original Message-----
>         From: Christian König [mailto:deathsimple-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org]
>         Sent: Monday, May 01, 2017 10:48 PM
>         To: Liu, Monk <Monk.Liu-5C7GfCeVMHo@public.gmane.org> <mailto:Monk.Liu-5C7GfCeVMHo@public.gmane.org>;
>         amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>         <mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
>         Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all
>         job finished
>
>         Am 01.05.2017 um 09:22 schrieb Monk Liu:
>         > for TDR guilty context feature, we need access ctx/s_entity
>         field
>         > member through sched job pointer,so ctx must keep alive till
>         all job
>         > from it signaled.
>
>         NAK, that is unnecessary and quite dangerous.
>
>         Instead we have the designed status field in the fences which
>         should be checked for that.
>
>         Regards,
>         Christian.
>
>         >
>         > Change-Id: Ib87e9502f7a5c8c054c7e56956d7f7ad75998e43
>         > Signed-off-by: Monk Liu <Monk.Liu-5C7GfCeVMHo@public.gmane.org>
>         <mailto:Monk.Liu-5C7GfCeVMHo@public.gmane.org>
>         > ---
>         >   drivers/gpu/drm/amd/amdgpu/amdgpu.h           | 6 +++++-
>         >   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        | 2 +-
>         >   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       | 9 +++++++++
>         >   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       | 9 +++++++--
>         >   drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 6 ------
>         >   drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 1 +
>         >   6 files changed, 23 insertions(+), 10 deletions(-)
>         >
>         > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>         > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>         > index e330009..8e031d6 100644
>         > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>         > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>         > @@ -760,10 +760,12 @@ struct amdgpu_ib {
>         >        uint32_t                        flags;
>         >   };
>         >
>         > +struct amdgpu_ctx;
>         > +
>         >   extern const struct amd_sched_backend_ops amdgpu_sched_ops;
>         >
>         >   int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned
>         num_ibs,
>         > -                  struct amdgpu_job **job, struct amdgpu_vm
>         *vm);
>         > +                  struct amdgpu_job **job, struct amdgpu_vm
>         *vm, struct
>         > +amdgpu_ctx *ctx);
>         >   int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev,
>         unsigned size,
>         >                             struct amdgpu_job **job);
>         >
>         > @@ -802,6 +804,7 @@ struct amdgpu_ctx_mgr {
>         >
>         >   struct amdgpu_ctx *amdgpu_ctx_get(struct amdgpu_fpriv
>         *fpriv, uint32_t id);
>         >   int amdgpu_ctx_put(struct amdgpu_ctx *ctx);
>         > +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx);
>         >
>         >   uint64_t amdgpu_ctx_add_fence(struct amdgpu_ctx *ctx,
>         struct amdgpu_ring *ring,
>         >                              struct fence *fence);
>         > @@ -1129,6 +1132,7 @@ struct amdgpu_job {
>         >        struct amdgpu_sync      sync;
>         >        struct amdgpu_ib        *ibs;
>         >        struct fence            *fence; /* the hw fence */
>         > +     struct amdgpu_ctx *ctx;
>         >        uint32_t                preamble_status;
>         >        uint32_t                num_ibs;
>         >        void                    *owner;
>         > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>         > b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>         > index 699f5fe..267fb65 100644
>         > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>         > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>         > @@ -234,7 +234,7 @@ int amdgpu_cs_parser_init(struct
>         amdgpu_cs_parser *p, void *data)
>         >                }
>         >        }
>         >
>         > -     ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm);
>         > +     ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm,
>         p->ctx);
>         >        if (ret)
>         >                goto free_all_kdata;
>         >
>         > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>         > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>         > index b4bbbb3..81438af 100644
>         > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>         > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>         > @@ -25,6 +25,13 @@
>         >   #include <drm/drmP.h>
>         >   #include "amdgpu.h"
>         >
>         > +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx
>         *ctx) {
>         > +     if (ctx)
>         > +             kref_get(&ctx->refcount);
>         > +     return ctx;
>         > +}
>         > +
>         >   static int amdgpu_ctx_init(struct amdgpu_device *adev,
>         struct amdgpu_ctx *ctx)
>         >   {
>         >        unsigned i, j;
>         > @@ -56,6 +63,8 @@ static int amdgpu_ctx_init(struct
>         amdgpu_device *adev, struct amdgpu_ctx *ctx)
>         >                                          rq, amdgpu_sched_jobs);
>         >                if (r)
>         >                        goto failed;
>         > +
>         > +             ctx->rings[i].entity.ptr_guilty =
>         &ctx->guilty; /* kernel entity
>         > +doesn't have ptr_guilty */
>         >        }
>         >
>         >        return 0;
>         > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>         > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>         > index 690ef3d..208da11 100644
>         > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>         > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>         > @@ -40,7 +40,7 @@ static void amdgpu_job_timedout(struct
>         amd_sched_job *s_job)
>         >   }
>         >
>         >   int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned
>         num_ibs,
>         > -                  struct amdgpu_job **job, struct amdgpu_vm
>         *vm)
>         > +                  struct amdgpu_job **job, struct amdgpu_vm
>         *vm, struct
>         > +amdgpu_ctx *ctx)
>         >   {
>         >        size_t size = sizeof(struct amdgpu_job);
>         >
>         > @@ -57,6 +57,7 @@ int amdgpu_job_alloc(struct amdgpu_device
>         *adev, unsigned num_ibs,
>         >        (*job)->vm = vm;
>         >        (*job)->ibs = (void *)&(*job)[1];
>         >        (*job)->num_ibs = num_ibs;
>         > +     (*job)->ctx = amdgpu_ctx_kref_get(ctx);
>         >
>         >        amdgpu_sync_create(&(*job)->sync);
>         >
>         > @@ -68,7 +69,7 @@ int amdgpu_job_alloc_with_ib(struct
>         amdgpu_device *adev, unsigned size,
>         >   {
>         >        int r;
>         >
>         > -     r = amdgpu_job_alloc(adev, 1, job, NULL);
>         > +     r = amdgpu_job_alloc(adev, 1, job, NULL, NULL);
>         >        if (r)
>         >                return r;
>         >
>         > @@ -94,6 +95,10 @@ void amdgpu_job_free_resources(struct
>         amdgpu_job *job)
>         >   static void amdgpu_job_free_cb(struct amd_sched_job *s_job)
>         >   {
>         >        struct amdgpu_job *job = container_of(s_job, struct
>         amdgpu_job,
>         > base);
>         > +     struct amdgpu_ctx *ctx = job->ctx;
>         > +
>         > +     if (ctx)
>         > +             amdgpu_ctx_put(ctx);
>         >
>         >        fence_put(job->fence);
>         >        amdgpu_sync_free(&job->sync);
>         > diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>         > b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>         > index 6f4e31f..9100ca8 100644
>         > --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>         > +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>         > @@ -208,12 +208,6 @@ void amd_sched_entity_fini(struct
>         amd_gpu_scheduler *sched,
>         >        if (!amd_sched_entity_is_initialized(sched, entity))
>         >                return;
>         >
>         > -     /**
>         > -      * The client will not queue more IBs during this
>         fini, consume existing
>         > -      * queued IBs
>         > -     */
>         > -     wait_event(sched->job_scheduled,
>         amd_sched_entity_is_idle(entity));
>         > -
>         >        amd_sched_rq_remove_entity(rq, entity);
>         >        kfifo_free(&entity->job_queue);
>         >   }
>         > diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>         > b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>         > index 8cb41d3..ccbbcb0 100644
>         > --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>         > +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>         > @@ -49,6 +49,7 @@ struct amd_sched_entity {
>         >
>         >        struct fence *dependency;
>         >        struct fence_cb                 cb;
>         > +     bool *ptr_guilty;
>         >   };
>         >
>         >   /**
>
>
>         _______________________________________________
>         amd-gfx mailing list
>         amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>         <mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
>         https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>         _______________________________________________
>         amd-gfx mailing list
>         amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>         <mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
>         https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>


[-- Attachment #1.2: Type: text/html, Size: 31451 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
       [not found]                         ` <eb637720-5c9a-636b-237e-228b499ff3bb-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
@ 2017-05-03  9:29                           ` zhoucm1
  2017-05-03  9:36                           ` Liu, Monk
  1 sibling, 0 replies; 28+ messages in thread
From: zhoucm1 @ 2017-05-03  9:29 UTC (permalink / raw)
  To: Christian König, Liu, Monk,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Out of this title, it let me think job->vm isn't safe as well, vm could 
be freed when it is being used amdgpu_ib_schedule, do you have any 
thoughts to solve it?

Regards,
David Zhou

On 2017年05月03日 17:18, Christian König wrote:
>>> I'm afraid not:  CSA is gone with the VM, and VM is gone after app 
>>> close our FD, I don't see amdgpu_vm_fini() is depended on context 
>>> living or not ...
>
> See the teardown order in amdgpu_driver_postclose_kms():
>> amdgpu_ctx_mgr_fini(&fpriv->ctx_mgr);
>>
>>         amdgpu_uvd_free_handles(adev, file_priv);
>>         amdgpu_vce_free_handles(adev, file_priv);
>>
>>         amdgpu_vm_bo_rmv(adev, fpriv->prt_va);
>>
>>         if (amdgpu_sriov_vf(adev)) {
>>                 /* TODO: how to handle reserve failure */
>>                 BUG_ON(amdgpu_bo_reserve(adev->virt.csa_obj, false));
>>                 amdgpu_vm_bo_rmv(adev, fpriv->vm.csa_bo_va);
>>                 fpriv->vm.csa_bo_va = NULL;
>>                 amdgpu_bo_unreserve(adev->virt.csa_obj);
>>         }
>>
>>         amdgpu_vm_fini(adev, &fpriv->vm);
>
> amdgpu_ctx_mgr_fini() waits for scheduling to finish and releases all 
> contexts of the current fd.
>
> If we don't release the context here because some jobs are still 
> executed we need to keep the UVD and VCE handle, the PRT VAs, the CSA 
> and even the whole VM structure alive.
>
>> I'll see if dma_fence could solve my issue, but I wish you can give 
>> me your detail idea
> Please take a look at David's idea of using the fence_context to find 
> which jobs and entity to skip, that is even better than mine about the 
> fence status and should be trivial to implement because all the data 
> is already present we just need to use it.
>
> Regards,
> Christian.
>
> Am 03.05.2017 um 11:08 schrieb Liu, Monk:
>> 1,My idea is that userspace should rather gather the feedback during 
>> the next command submission. This has the advantage that you don't 
>> need to keep userspace alive till all jobs are done.
>>
>>> No, we need to clean the hw ring (cherry-pick out guilty entities' 
>>> job in all rings) after gpu reset, and we need fackly signal all 
>>> sched_fence in the guity entity as well, and we need mark context as 
>>> guilty so the next IOCTL on it will return -ENODEV.
>>> I don't understand how your idea can solve my request ...
>> 2,You need to keep quite a bunch of stuff alive (VM, CSA) when you 
>> don't tear down the ctx immediately.
>>
>>> I'm afraid not:  CSA is gone with the VM, and VM is gone after app 
>>> close our FD, I don't see amdgpu_vm_fini() is depended on context 
>>> living or not ...
>> 3, struct fence was renamed to struct dma_fence on newer kernels and 
>> a status field added to exactly this purpose.
>>
>> The Intel guys did this because they ran into the exactly same problem.
>>
>>> I'll see if dma_fence could solve my issue, but I wish you can give 
>>> me your detail idea
>>
>> BR Monk
>>
>>
>>
>> -----Original Message-----
>> From: Christian König [mailto:deathsimple@vodafone.de]
>> Sent: Wednesday, May 03, 2017 4:59 PM
>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
>>
>>> 1, This is necessary otherwise how can I access entity pointer after a
>>> job timedout
>> No that isn't necessary.
>>
>> The problem with your idea is that you want to actively push the 
>> feedback/status from the job execution back to userspace when an error
>> (timeout) happens.
>>
>> My idea is that userspace should rather gather the feedback during 
>> the next command submission. This has the advantage that you don't 
>> need to keep userspace alive till all jobs are done.
>>
>>>    , and why it is dangerous ?
>> You need to keep quite a bunch of stuff alive (VM, CSA) when you 
>> don't tear down the ctx immediately.
>>
>> We could split ctx tear down into freeing the resources and freeing 
>> the structure, but I think just gathering the information needed on 
>> CS is easier to do.
>>
>>> 2, what's the status field in the fences you were referring to ? I
>>> need to judge if it could satisfy my requirement
>> struct fence was renamed to struct dma_fence on newer kernels and a 
>> status field added to exactly this purpose.
>>
>> The Intel guys did this because they ran into the exactly same problem.
>>
>> Regards,
>> Christian.
>>
>> Am 03.05.2017 um 05:30 schrieb Liu, Monk:
>>> 1, This is necessary otherwise how can I access entity pointer after 
>>> a job timedout , and why it is dangerous ?
>>> 2, what's the status field in the fences you were referring to ? I
>>> need to judge if it could satisfy my requirement
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Christian König [mailto:deathsimple@vodafone.de]
>>> Sent: Monday, May 01, 2017 10:48 PM
>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job
>>> finished
>>>
>>> Am 01.05.2017 um 09:22 schrieb Monk Liu:
>>>> for TDR guilty context feature, we need access ctx/s_entity field
>>>> member through sched job pointer,so ctx must keep alive till all job
>>>> from it signaled.
>>> NAK, that is unnecessary and quite dangerous.
>>>
>>> Instead we have the designed status field in the fences which should 
>>> be checked for that.
>>>
>>> Regards,
>>> Christian.
>>>
>>>> Change-Id: Ib87e9502f7a5c8c054c7e56956d7f7ad75998e43
>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>> ---
>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu.h           | 6 +++++-
>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        | 2 +-
>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       | 9 +++++++++
>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       | 9 +++++++--
>>>>     drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 6 ------
>>>>     drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 1 +
>>>>     6 files changed, 23 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> index e330009..8e031d6 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> @@ -760,10 +760,12 @@ struct amdgpu_ib {
>>>>         uint32_t            flags;
>>>>     };
>>>>     +struct amdgpu_ctx;
>>>> +
>>>>     extern const struct amd_sched_backend_ops amdgpu_sched_ops;
>>>>         int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned 
>>>> num_ibs,
>>>> -             struct amdgpu_job **job, struct amdgpu_vm *vm);
>>>> +             struct amdgpu_job **job, struct amdgpu_vm *vm, struct
>>>> +amdgpu_ctx *ctx);
>>>>     int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, 
>>>> unsigned size,
>>>>                      struct amdgpu_job **job);
>>>>     @@ -802,6 +804,7 @@ struct amdgpu_ctx_mgr {
>>>>         struct amdgpu_ctx *amdgpu_ctx_get(struct amdgpu_fpriv 
>>>> *fpriv, uint32_t id);
>>>>     int amdgpu_ctx_put(struct amdgpu_ctx *ctx);
>>>> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx);
>>>>         uint64_t amdgpu_ctx_add_fence(struct amdgpu_ctx *ctx, 
>>>> struct amdgpu_ring *ring,
>>>>                       struct fence *fence);
>>>> @@ -1129,6 +1132,7 @@ struct amdgpu_job {
>>>>         struct amdgpu_sync    sync;
>>>>         struct amdgpu_ib    *ibs;
>>>>         struct fence        *fence; /* the hw fence */
>>>> +    struct amdgpu_ctx *ctx;
>>>>         uint32_t        preamble_status;
>>>>         uint32_t        num_ibs;
>>>>         void            *owner;
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>> index 699f5fe..267fb65 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>> @@ -234,7 +234,7 @@ int amdgpu_cs_parser_init(struct 
>>>> amdgpu_cs_parser *p, void *data)
>>>>             }
>>>>         }
>>>>     -    ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm);
>>>> +    ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm, p->ctx);
>>>>         if (ret)
>>>>             goto free_all_kdata;
>>>>     diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>> index b4bbbb3..81438af 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>> @@ -25,6 +25,13 @@
>>>>     #include <drm/drmP.h>
>>>>     #include "amdgpu.h"
>>>>     +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx) {
>>>> +    if (ctx)
>>>> +        kref_get(&ctx->refcount);
>>>> +    return ctx;
>>>> +}
>>>> +
>>>>     static int amdgpu_ctx_init(struct amdgpu_device *adev, struct 
>>>> amdgpu_ctx *ctx)
>>>>     {
>>>>         unsigned i, j;
>>>> @@ -56,6 +63,8 @@ static int amdgpu_ctx_init(struct amdgpu_device 
>>>> *adev, struct amdgpu_ctx *ctx)
>>>>                           rq, amdgpu_sched_jobs);
>>>>             if (r)
>>>>                 goto failed;
>>>> +
>>>> +        ctx->rings[i].entity.ptr_guilty = &ctx->guilty; /* kernel 
>>>> entity
>>>> +doesn't have ptr_guilty */
>>>>         }
>>>>             return 0;
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> index 690ef3d..208da11 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> @@ -40,7 +40,7 @@ static void amdgpu_job_timedout(struct 
>>>> amd_sched_job *s_job)
>>>>     }
>>>>         int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned 
>>>> num_ibs,
>>>> -             struct amdgpu_job **job, struct amdgpu_vm *vm)
>>>> +             struct amdgpu_job **job, struct amdgpu_vm *vm, struct
>>>> +amdgpu_ctx *ctx)
>>>>     {
>>>>         size_t size = sizeof(struct amdgpu_job);
>>>>     @@ -57,6 +57,7 @@ int amdgpu_job_alloc(struct amdgpu_device 
>>>> *adev, unsigned num_ibs,
>>>>         (*job)->vm = vm;
>>>>         (*job)->ibs = (void *)&(*job)[1];
>>>>         (*job)->num_ibs = num_ibs;
>>>> +    (*job)->ctx = amdgpu_ctx_kref_get(ctx);
>>>>             amdgpu_sync_create(&(*job)->sync);
>>>>     @@ -68,7 +69,7 @@ int amdgpu_job_alloc_with_ib(struct 
>>>> amdgpu_device *adev, unsigned size,
>>>>     {
>>>>         int r;
>>>>     -    r = amdgpu_job_alloc(adev, 1, job, NULL);
>>>> +    r = amdgpu_job_alloc(adev, 1, job, NULL, NULL);
>>>>         if (r)
>>>>             return r;
>>>>     @@ -94,6 +95,10 @@ void amdgpu_job_free_resources(struct 
>>>> amdgpu_job *job)
>>>>     static void amdgpu_job_free_cb(struct amd_sched_job *s_job)
>>>>     {
>>>>         struct amdgpu_job *job = container_of(s_job, struct 
>>>> amdgpu_job,
>>>> base);
>>>> +    struct amdgpu_ctx *ctx = job->ctx;
>>>> +
>>>> +    if (ctx)
>>>> +        amdgpu_ctx_put(ctx);
>>>>             fence_put(job->fence);
>>>>         amdgpu_sync_free(&job->sync);
>>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>> index 6f4e31f..9100ca8 100644
>>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>> @@ -208,12 +208,6 @@ void amd_sched_entity_fini(struct 
>>>> amd_gpu_scheduler *sched,
>>>>         if (!amd_sched_entity_is_initialized(sched, entity))
>>>>             return;
>>>>     -    /**
>>>> -     * The client will not queue more IBs during this fini, 
>>>> consume existing
>>>> -     * queued IBs
>>>> -    */
>>>> -    wait_event(sched->job_scheduled, 
>>>> amd_sched_entity_is_idle(entity));
>>>> -
>>>>         amd_sched_rq_remove_entity(rq, entity);
>>>>         kfifo_free(&entity->job_queue);
>>>>     }
>>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>> index 8cb41d3..ccbbcb0 100644
>>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>> @@ -49,6 +49,7 @@ struct amd_sched_entity {
>>>>             struct fence            *dependency;
>>>>         struct fence_cb            cb;
>>>> +    bool *ptr_guilty;
>>>>     };
>>>>         /**
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
       [not found]                         ` <eb637720-5c9a-636b-237e-228b499ff3bb-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
  2017-05-03  9:29                           ` zhoucm1
@ 2017-05-03  9:36                           ` Liu, Monk
       [not found]                             ` <DM5PR12MB161020C82674A01805B8C8D384160-2J9CzHegvk++jCVTvoAFKAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  1 sibling, 1 reply; 28+ messages in thread
From: Liu, Monk @ 2017-05-03  9:36 UTC (permalink / raw)
  To: Christian König, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Since I get one more kref for ctx when creating jobs, so amdgpu_ctx_mgr_fini(&fpriv->ctx_mgr) here won't actually waiting ... because the " amdgpu_ctx_do_release" 
Won't going to run (kref > 0 before all job signaled).

That way amdgpu_driver_postclose_kms() can continue go on , 
So actually " UVD and VCE handle, the PRT VAs, the CSA and even the whole VM structure" won't be kept alive , and the ironic thing is I want to alive as well (especially CSA, PTR)


BR Monk




-----Original Message-----
From: Christian König [mailto:deathsimple@vodafone.de] 
Sent: Wednesday, May 03, 2017 5:19 PM
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished

>> I'm afraid not:  CSA is gone with the VM, and VM is gone after app close our FD, I don't see amdgpu_vm_fini() is depended on context living or not ...

See the teardown order in amdgpu_driver_postclose_kms():
> amdgpu_ctx_mgr_fini(&fpriv->ctx_mgr);
>
>         amdgpu_uvd_free_handles(adev, file_priv);
>         amdgpu_vce_free_handles(adev, file_priv);
>
>         amdgpu_vm_bo_rmv(adev, fpriv->prt_va);
>
>         if (amdgpu_sriov_vf(adev)) {
>                 /* TODO: how to handle reserve failure */
>                 BUG_ON(amdgpu_bo_reserve(adev->virt.csa_obj, false));
>                 amdgpu_vm_bo_rmv(adev, fpriv->vm.csa_bo_va);
>                 fpriv->vm.csa_bo_va = NULL;
>                 amdgpu_bo_unreserve(adev->virt.csa_obj);
>         }
>
>         amdgpu_vm_fini(adev, &fpriv->vm);

amdgpu_ctx_mgr_fini() waits for scheduling to finish and releases all contexts of the current fd.

If we don't release the context here because some jobs are still executed we need to keep the UVD and VCE handle, the PRT VAs, the CSA and even the whole VM structure alive.

> I'll see if dma_fence could solve my issue, but I wish you can give me 
> your detail idea
Please take a look at David's idea of using the fence_context to find which jobs and entity to skip, that is even better than mine about the fence status and should be trivial to implement because all the data is already present we just need to use it.

Regards,
Christian.

Am 03.05.2017 um 11:08 schrieb Liu, Monk:
> 1,My idea is that userspace should rather gather the feedback during the next command submission. This has the advantage that you don't need to keep userspace alive till all jobs are done.
>
>> No, we need to clean the hw ring (cherry-pick out guilty entities' job in all rings) after gpu reset, and we need fackly signal all sched_fence in the guity entity as well, and we need mark context as guilty so the next IOCTL on it will return -ENODEV.
>> I don't understand how your idea can solve my request ...
> 2,You need to keep quite a bunch of stuff alive (VM, CSA) when you don't tear down the ctx immediately.
>
>> I'm afraid not:  CSA is gone with the VM, and VM is gone after app close our FD, I don't see amdgpu_vm_fini() is depended on context living or not ...
> 3, struct fence was renamed to struct dma_fence on newer kernels and a status field added to exactly this purpose.
>
> The Intel guys did this because they ran into the exactly same problem.
>
>> I'll see if dma_fence could solve my issue, but I wish you can give 
>> me your detail idea
>
> BR Monk
>
>
>
> -----Original Message-----
> From: Christian König [mailto:deathsimple@vodafone.de]
> Sent: Wednesday, May 03, 2017 4:59 PM
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job 
> finished
>
>> 1, This is necessary otherwise how can I access entity pointer after 
>> a job timedout
> No that isn't necessary.
>
> The problem with your idea is that you want to actively push the 
> feedback/status from the job execution back to userspace when an error
> (timeout) happens.
>
> My idea is that userspace should rather gather the feedback during the next command submission. This has the advantage that you don't need to keep userspace alive till all jobs are done.
>
>>    , and why it is dangerous ?
> You need to keep quite a bunch of stuff alive (VM, CSA) when you don't tear down the ctx immediately.
>
> We could split ctx tear down into freeing the resources and freeing the structure, but I think just gathering the information needed on CS is easier to do.
>
>> 2, what's the status field in the fences you were referring to ? I 
>> need to judge if it could satisfy my requirement
> struct fence was renamed to struct dma_fence on newer kernels and a status field added to exactly this purpose.
>
> The Intel guys did this because they ran into the exactly same problem.
>
> Regards,
> Christian.
>
> Am 03.05.2017 um 05:30 schrieb Liu, Monk:
>> 1, This is necessary otherwise how can I access entity pointer after a job timedout , and why it is dangerous ?
>> 2, what's the status field in the fences you were referring to ? I 
>> need to judge if it could satisfy my requirement
>>
>>
>>
>> -----Original Message-----
>> From: Christian König [mailto:deathsimple@vodafone.de]
>> Sent: Monday, May 01, 2017 10:48 PM
>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job 
>> finished
>>
>> Am 01.05.2017 um 09:22 schrieb Monk Liu:
>>> for TDR guilty context feature, we need access ctx/s_entity field 
>>> member through sched job pointer,so ctx must keep alive till all job 
>>> from it signaled.
>> NAK, that is unnecessary and quite dangerous.
>>
>> Instead we have the designed status field in the fences which should be checked for that.
>>
>> Regards,
>> Christian.
>>
>>> Change-Id: Ib87e9502f7a5c8c054c7e56956d7f7ad75998e43
>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>> ---
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu.h           | 6 +++++-
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        | 2 +-
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       | 9 +++++++++
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       | 9 +++++++--
>>>     drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 6 ------
>>>     drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 1 +
>>>     6 files changed, 23 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> index e330009..8e031d6 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> @@ -760,10 +760,12 @@ struct amdgpu_ib {
>>>     	uint32_t			flags;
>>>     };
>>>     
>>> +struct amdgpu_ctx;
>>> +
>>>     extern const struct amd_sched_backend_ops amdgpu_sched_ops;
>>>     
>>>     int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>>> -		     struct amdgpu_job **job, struct amdgpu_vm *vm);
>>> +		     struct amdgpu_job **job, struct amdgpu_vm *vm, struct 
>>> +amdgpu_ctx *ctx);
>>>     int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned size,
>>>     			     struct amdgpu_job **job);
>>>     
>>> @@ -802,6 +804,7 @@ struct amdgpu_ctx_mgr {
>>>     
>>>     struct amdgpu_ctx *amdgpu_ctx_get(struct amdgpu_fpriv *fpriv, uint32_t id);
>>>     int amdgpu_ctx_put(struct amdgpu_ctx *ctx);
>>> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx);
>>>     
>>>     uint64_t amdgpu_ctx_add_fence(struct amdgpu_ctx *ctx, struct amdgpu_ring *ring,
>>>     			      struct fence *fence);
>>> @@ -1129,6 +1132,7 @@ struct amdgpu_job {
>>>     	struct amdgpu_sync	sync;
>>>     	struct amdgpu_ib	*ibs;
>>>     	struct fence		*fence; /* the hw fence */
>>> +	struct amdgpu_ctx *ctx;
>>>     	uint32_t		preamble_status;
>>>     	uint32_t		num_ibs;
>>>     	void			*owner;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> index 699f5fe..267fb65 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> @@ -234,7 +234,7 @@ int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, void *data)
>>>     		}
>>>     	}
>>>     
>>> -	ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm);
>>> +	ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm, p->ctx);
>>>     	if (ret)
>>>     		goto free_all_kdata;
>>>     
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>> index b4bbbb3..81438af 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>> @@ -25,6 +25,13 @@
>>>     #include <drm/drmP.h>
>>>     #include "amdgpu.h"
>>>     
>>> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx) {
>>> +	if (ctx)
>>> +		kref_get(&ctx->refcount);
>>> +	return ctx;
>>> +}
>>> +
>>>     static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
>>>     {
>>>     	unsigned i, j;
>>> @@ -56,6 +63,8 @@ static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
>>>     					  rq, amdgpu_sched_jobs);
>>>     		if (r)
>>>     			goto failed;
>>> +
>>> +		ctx->rings[i].entity.ptr_guilty = &ctx->guilty; /* kernel entity 
>>> +doesn't have ptr_guilty */
>>>     	}
>>>     
>>>     	return 0;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> index 690ef3d..208da11 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> @@ -40,7 +40,7 @@ static void amdgpu_job_timedout(struct amd_sched_job *s_job)
>>>     }
>>>     
>>>     int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>>> -		     struct amdgpu_job **job, struct amdgpu_vm *vm)
>>> +		     struct amdgpu_job **job, struct amdgpu_vm *vm, struct 
>>> +amdgpu_ctx *ctx)
>>>     {
>>>     	size_t size = sizeof(struct amdgpu_job);
>>>     
>>> @@ -57,6 +57,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>>>     	(*job)->vm = vm;
>>>     	(*job)->ibs = (void *)&(*job)[1];
>>>     	(*job)->num_ibs = num_ibs;
>>> +	(*job)->ctx = amdgpu_ctx_kref_get(ctx);
>>>     
>>>     	amdgpu_sync_create(&(*job)->sync);
>>>     
>>> @@ -68,7 +69,7 @@ int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned size,
>>>     {
>>>     	int r;
>>>     
>>> -	r = amdgpu_job_alloc(adev, 1, job, NULL);
>>> +	r = amdgpu_job_alloc(adev, 1, job, NULL, NULL);
>>>     	if (r)
>>>     		return r;
>>>     
>>> @@ -94,6 +95,10 @@ void amdgpu_job_free_resources(struct amdgpu_job *job)
>>>     static void amdgpu_job_free_cb(struct amd_sched_job *s_job)
>>>     {
>>>     	struct amdgpu_job *job = container_of(s_job, struct amdgpu_job, 
>>> base);
>>> +	struct amdgpu_ctx *ctx = job->ctx;
>>> +
>>> +	if (ctx)
>>> +		amdgpu_ctx_put(ctx);
>>>     
>>>     	fence_put(job->fence);
>>>     	amdgpu_sync_free(&job->sync);
>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>> index 6f4e31f..9100ca8 100644
>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>> @@ -208,12 +208,6 @@ void amd_sched_entity_fini(struct amd_gpu_scheduler *sched,
>>>     	if (!amd_sched_entity_is_initialized(sched, entity))
>>>     		return;
>>>     
>>> -	/**
>>> -	 * The client will not queue more IBs during this fini, consume existing
>>> -	 * queued IBs
>>> -	*/
>>> -	wait_event(sched->job_scheduled, amd_sched_entity_is_idle(entity));
>>> -
>>>     	amd_sched_rq_remove_entity(rq, entity);
>>>     	kfifo_free(&entity->job_queue);
>>>     }
>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>> index 8cb41d3..ccbbcb0 100644
>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>> @@ -49,6 +49,7 @@ struct amd_sched_entity {
>>>     
>>>     	struct fence			*dependency;
>>>     	struct fence_cb			cb;
>>> +	bool *ptr_guilty;
>>>     };
>>>     
>>>     /**
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
       [not found]                             ` <DM5PR12MB161020C82674A01805B8C8D384160-2J9CzHegvk++jCVTvoAFKAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-05-03 12:49                               ` Christian König
       [not found]                                 ` <cefbc7ee-36a7-3aba-7b4a-102a5a0f2e22-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Christian König @ 2017-05-03 12:49 UTC (permalink / raw)
  To: Liu, Monk, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

> and the ironic thing is I want to alive as well (especially CSA, PTR)
Yes, and exactly that is the danger I was talking about. We messed up 
the tear down oder with that and try to access resources which are 
already freed when the job is now scheduled.

I would rather say we should get completely rid of the ctx kref 
counting, that was a rather bad idea in the first place.

Regards,
Christian.

Am 03.05.2017 um 11:36 schrieb Liu, Monk:
> Since I get one more kref for ctx when creating jobs, so amdgpu_ctx_mgr_fini(&fpriv->ctx_mgr) here won't actually waiting ... because the " amdgpu_ctx_do_release"
> Won't going to run (kref > 0 before all job signaled).
>
> That way amdgpu_driver_postclose_kms() can continue go on ,
> So actually " UVD and VCE handle, the PRT VAs, the CSA and even the whole VM structure" won't be kept alive , and the ironic thing is I want to alive as well (especially CSA, PTR)
>
>
> BR Monk
>
>
>
>
> -----Original Message-----
> From: Christian König [mailto:deathsimple@vodafone.de]
> Sent: Wednesday, May 03, 2017 5:19 PM
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
>
>>> I'm afraid not:  CSA is gone with the VM, and VM is gone after app close our FD, I don't see amdgpu_vm_fini() is depended on context living or not ...
> See the teardown order in amdgpu_driver_postclose_kms():
>> amdgpu_ctx_mgr_fini(&fpriv->ctx_mgr);
>>
>>          amdgpu_uvd_free_handles(adev, file_priv);
>>          amdgpu_vce_free_handles(adev, file_priv);
>>
>>          amdgpu_vm_bo_rmv(adev, fpriv->prt_va);
>>
>>          if (amdgpu_sriov_vf(adev)) {
>>                  /* TODO: how to handle reserve failure */
>>                  BUG_ON(amdgpu_bo_reserve(adev->virt.csa_obj, false));
>>                  amdgpu_vm_bo_rmv(adev, fpriv->vm.csa_bo_va);
>>                  fpriv->vm.csa_bo_va = NULL;
>>                  amdgpu_bo_unreserve(adev->virt.csa_obj);
>>          }
>>
>>          amdgpu_vm_fini(adev, &fpriv->vm);
> amdgpu_ctx_mgr_fini() waits for scheduling to finish and releases all contexts of the current fd.
>
> If we don't release the context here because some jobs are still executed we need to keep the UVD and VCE handle, the PRT VAs, the CSA and even the whole VM structure alive.
>
>> I'll see if dma_fence could solve my issue, but I wish you can give me
>> your detail idea
> Please take a look at David's idea of using the fence_context to find which jobs and entity to skip, that is even better than mine about the fence status and should be trivial to implement because all the data is already present we just need to use it.
>
> Regards,
> Christian.
>
> Am 03.05.2017 um 11:08 schrieb Liu, Monk:
>> 1,My idea is that userspace should rather gather the feedback during the next command submission. This has the advantage that you don't need to keep userspace alive till all jobs are done.
>>
>>> No, we need to clean the hw ring (cherry-pick out guilty entities' job in all rings) after gpu reset, and we need fackly signal all sched_fence in the guity entity as well, and we need mark context as guilty so the next IOCTL on it will return -ENODEV.
>>> I don't understand how your idea can solve my request ...
>> 2,You need to keep quite a bunch of stuff alive (VM, CSA) when you don't tear down the ctx immediately.
>>
>>> I'm afraid not:  CSA is gone with the VM, and VM is gone after app close our FD, I don't see amdgpu_vm_fini() is depended on context living or not ...
>> 3, struct fence was renamed to struct dma_fence on newer kernels and a status field added to exactly this purpose.
>>
>> The Intel guys did this because they ran into the exactly same problem.
>>
>>> I'll see if dma_fence could solve my issue, but I wish you can give
>>> me your detail idea
>> BR Monk
>>
>>
>>
>> -----Original Message-----
>> From: Christian König [mailto:deathsimple@vodafone.de]
>> Sent: Wednesday, May 03, 2017 4:59 PM
>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job
>> finished
>>
>>> 1, This is necessary otherwise how can I access entity pointer after
>>> a job timedout
>> No that isn't necessary.
>>
>> The problem with your idea is that you want to actively push the
>> feedback/status from the job execution back to userspace when an error
>> (timeout) happens.
>>
>> My idea is that userspace should rather gather the feedback during the next command submission. This has the advantage that you don't need to keep userspace alive till all jobs are done.
>>
>>>     , and why it is dangerous ?
>> You need to keep quite a bunch of stuff alive (VM, CSA) when you don't tear down the ctx immediately.
>>
>> We could split ctx tear down into freeing the resources and freeing the structure, but I think just gathering the information needed on CS is easier to do.
>>
>>> 2, what's the status field in the fences you were referring to ? I
>>> need to judge if it could satisfy my requirement
>> struct fence was renamed to struct dma_fence on newer kernels and a status field added to exactly this purpose.
>>
>> The Intel guys did this because they ran into the exactly same problem.
>>
>> Regards,
>> Christian.
>>
>> Am 03.05.2017 um 05:30 schrieb Liu, Monk:
>>> 1, This is necessary otherwise how can I access entity pointer after a job timedout , and why it is dangerous ?
>>> 2, what's the status field in the fences you were referring to ? I
>>> need to judge if it could satisfy my requirement
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Christian König [mailto:deathsimple@vodafone.de]
>>> Sent: Monday, May 01, 2017 10:48 PM
>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job
>>> finished
>>>
>>> Am 01.05.2017 um 09:22 schrieb Monk Liu:
>>>> for TDR guilty context feature, we need access ctx/s_entity field
>>>> member through sched job pointer,so ctx must keep alive till all job
>>>> from it signaled.
>>> NAK, that is unnecessary and quite dangerous.
>>>
>>> Instead we have the designed status field in the fences which should be checked for that.
>>>
>>> Regards,
>>> Christian.
>>>
>>>> Change-Id: Ib87e9502f7a5c8c054c7e56956d7f7ad75998e43
>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>> ---
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu.h           | 6 +++++-
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        | 2 +-
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       | 9 +++++++++
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       | 9 +++++++--
>>>>      drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 6 ------
>>>>      drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 1 +
>>>>      6 files changed, 23 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> index e330009..8e031d6 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> @@ -760,10 +760,12 @@ struct amdgpu_ib {
>>>>      	uint32_t			flags;
>>>>      };
>>>>      
>>>> +struct amdgpu_ctx;
>>>> +
>>>>      extern const struct amd_sched_backend_ops amdgpu_sched_ops;
>>>>      
>>>>      int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>>>> -		     struct amdgpu_job **job, struct amdgpu_vm *vm);
>>>> +		     struct amdgpu_job **job, struct amdgpu_vm *vm, struct
>>>> +amdgpu_ctx *ctx);
>>>>      int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned size,
>>>>      			     struct amdgpu_job **job);
>>>>      
>>>> @@ -802,6 +804,7 @@ struct amdgpu_ctx_mgr {
>>>>      
>>>>      struct amdgpu_ctx *amdgpu_ctx_get(struct amdgpu_fpriv *fpriv, uint32_t id);
>>>>      int amdgpu_ctx_put(struct amdgpu_ctx *ctx);
>>>> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx);
>>>>      
>>>>      uint64_t amdgpu_ctx_add_fence(struct amdgpu_ctx *ctx, struct amdgpu_ring *ring,
>>>>      			      struct fence *fence);
>>>> @@ -1129,6 +1132,7 @@ struct amdgpu_job {
>>>>      	struct amdgpu_sync	sync;
>>>>      	struct amdgpu_ib	*ibs;
>>>>      	struct fence		*fence; /* the hw fence */
>>>> +	struct amdgpu_ctx *ctx;
>>>>      	uint32_t		preamble_status;
>>>>      	uint32_t		num_ibs;
>>>>      	void			*owner;
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>> index 699f5fe..267fb65 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>> @@ -234,7 +234,7 @@ int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, void *data)
>>>>      		}
>>>>      	}
>>>>      
>>>> -	ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm);
>>>> +	ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm, p->ctx);
>>>>      	if (ret)
>>>>      		goto free_all_kdata;
>>>>      
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>> index b4bbbb3..81438af 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>> @@ -25,6 +25,13 @@
>>>>      #include <drm/drmP.h>
>>>>      #include "amdgpu.h"
>>>>      
>>>> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx) {
>>>> +	if (ctx)
>>>> +		kref_get(&ctx->refcount);
>>>> +	return ctx;
>>>> +}
>>>> +
>>>>      static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
>>>>      {
>>>>      	unsigned i, j;
>>>> @@ -56,6 +63,8 @@ static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
>>>>      					  rq, amdgpu_sched_jobs);
>>>>      		if (r)
>>>>      			goto failed;
>>>> +
>>>> +		ctx->rings[i].entity.ptr_guilty = &ctx->guilty; /* kernel entity
>>>> +doesn't have ptr_guilty */
>>>>      	}
>>>>      
>>>>      	return 0;
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> index 690ef3d..208da11 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> @@ -40,7 +40,7 @@ static void amdgpu_job_timedout(struct amd_sched_job *s_job)
>>>>      }
>>>>      
>>>>      int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>>>> -		     struct amdgpu_job **job, struct amdgpu_vm *vm)
>>>> +		     struct amdgpu_job **job, struct amdgpu_vm *vm, struct
>>>> +amdgpu_ctx *ctx)
>>>>      {
>>>>      	size_t size = sizeof(struct amdgpu_job);
>>>>      
>>>> @@ -57,6 +57,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>>>>      	(*job)->vm = vm;
>>>>      	(*job)->ibs = (void *)&(*job)[1];
>>>>      	(*job)->num_ibs = num_ibs;
>>>> +	(*job)->ctx = amdgpu_ctx_kref_get(ctx);
>>>>      
>>>>      	amdgpu_sync_create(&(*job)->sync);
>>>>      
>>>> @@ -68,7 +69,7 @@ int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned size,
>>>>      {
>>>>      	int r;
>>>>      
>>>> -	r = amdgpu_job_alloc(adev, 1, job, NULL);
>>>> +	r = amdgpu_job_alloc(adev, 1, job, NULL, NULL);
>>>>      	if (r)
>>>>      		return r;
>>>>      
>>>> @@ -94,6 +95,10 @@ void amdgpu_job_free_resources(struct amdgpu_job *job)
>>>>      static void amdgpu_job_free_cb(struct amd_sched_job *s_job)
>>>>      {
>>>>      	struct amdgpu_job *job = container_of(s_job, struct amdgpu_job,
>>>> base);
>>>> +	struct amdgpu_ctx *ctx = job->ctx;
>>>> +
>>>> +	if (ctx)
>>>> +		amdgpu_ctx_put(ctx);
>>>>      
>>>>      	fence_put(job->fence);
>>>>      	amdgpu_sync_free(&job->sync);
>>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>> index 6f4e31f..9100ca8 100644
>>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>> @@ -208,12 +208,6 @@ void amd_sched_entity_fini(struct amd_gpu_scheduler *sched,
>>>>      	if (!amd_sched_entity_is_initialized(sched, entity))
>>>>      		return;
>>>>      
>>>> -	/**
>>>> -	 * The client will not queue more IBs during this fini, consume existing
>>>> -	 * queued IBs
>>>> -	*/
>>>> -	wait_event(sched->job_scheduled, amd_sched_entity_is_idle(entity));
>>>> -
>>>>      	amd_sched_rq_remove_entity(rq, entity);
>>>>      	kfifo_free(&entity->job_queue);
>>>>      }
>>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>> index 8cb41d3..ccbbcb0 100644
>>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>> @@ -49,6 +49,7 @@ struct amd_sched_entity {
>>>>      
>>>>      	struct fence			*dependency;
>>>>      	struct fence_cb			cb;
>>>> +	bool *ptr_guilty;
>>>>      };
>>>>      
>>>>      /**
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
       [not found]                                 ` <cefbc7ee-36a7-3aba-7b4a-102a5a0f2e22-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
@ 2017-05-03 13:31                                   ` Liu, Monk
       [not found]                                     ` <DM5PR12MB1610C0502DE515B570B2F4C984160-2J9CzHegvk++jCVTvoAFKAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Liu, Monk @ 2017-05-03 13:31 UTC (permalink / raw)
  To: Christian König, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Even we release the ctx as usual way,
Can we guarantee the pde/pte and PRT/CSA are all alive (BO, mappings) when resubmitting the timeout job (assume this time out job can signal after the resubmit)?

You know App can submit a command a release all BO and free_ctx, close FD/VM, and exit very soon, it just doesn't wait for the  fence signal 

BR Monk

-----Original Message-----
From: Christian König [mailto:deathsimple@vodafone.de] 
Sent: Wednesday, May 03, 2017 8:50 PM
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished

> and the ironic thing is I want to alive as well (especially CSA, PTR)
Yes, and exactly that is the danger I was talking about. We messed up the tear down oder with that and try to access resources which are already freed when the job is now scheduled.

I would rather say we should get completely rid of the ctx kref counting, that was a rather bad idea in the first place.

Regards,
Christian.

Am 03.05.2017 um 11:36 schrieb Liu, Monk:
> Since I get one more kref for ctx when creating jobs, so amdgpu_ctx_mgr_fini(&fpriv->ctx_mgr) here won't actually waiting ... because the " amdgpu_ctx_do_release"
> Won't going to run (kref > 0 before all job signaled).
>
> That way amdgpu_driver_postclose_kms() can continue go on , So 
> actually " UVD and VCE handle, the PRT VAs, the CSA and even the whole 
> VM structure" won't be kept alive , and the ironic thing is I want to 
> alive as well (especially CSA, PTR)
>
>
> BR Monk
>
>
>
>
> -----Original Message-----
> From: Christian König [mailto:deathsimple@vodafone.de]
> Sent: Wednesday, May 03, 2017 5:19 PM
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job 
> finished
>
>>> I'm afraid not:  CSA is gone with the VM, and VM is gone after app close our FD, I don't see amdgpu_vm_fini() is depended on context living or not ...
> See the teardown order in amdgpu_driver_postclose_kms():
>> amdgpu_ctx_mgr_fini(&fpriv->ctx_mgr);
>>
>>          amdgpu_uvd_free_handles(adev, file_priv);
>>          amdgpu_vce_free_handles(adev, file_priv);
>>
>>          amdgpu_vm_bo_rmv(adev, fpriv->prt_va);
>>
>>          if (amdgpu_sriov_vf(adev)) {
>>                  /* TODO: how to handle reserve failure */
>>                  BUG_ON(amdgpu_bo_reserve(adev->virt.csa_obj, false));
>>                  amdgpu_vm_bo_rmv(adev, fpriv->vm.csa_bo_va);
>>                  fpriv->vm.csa_bo_va = NULL;
>>                  amdgpu_bo_unreserve(adev->virt.csa_obj);
>>          }
>>
>>          amdgpu_vm_fini(adev, &fpriv->vm);
> amdgpu_ctx_mgr_fini() waits for scheduling to finish and releases all contexts of the current fd.
>
> If we don't release the context here because some jobs are still executed we need to keep the UVD and VCE handle, the PRT VAs, the CSA and even the whole VM structure alive.
>
>> I'll see if dma_fence could solve my issue, but I wish you can give 
>> me your detail idea
> Please take a look at David's idea of using the fence_context to find which jobs and entity to skip, that is even better than mine about the fence status and should be trivial to implement because all the data is already present we just need to use it.
>
> Regards,
> Christian.
>
> Am 03.05.2017 um 11:08 schrieb Liu, Monk:
>> 1,My idea is that userspace should rather gather the feedback during the next command submission. This has the advantage that you don't need to keep userspace alive till all jobs are done.
>>
>>> No, we need to clean the hw ring (cherry-pick out guilty entities' job in all rings) after gpu reset, and we need fackly signal all sched_fence in the guity entity as well, and we need mark context as guilty so the next IOCTL on it will return -ENODEV.
>>> I don't understand how your idea can solve my request ...
>> 2,You need to keep quite a bunch of stuff alive (VM, CSA) when you don't tear down the ctx immediately.
>>
>>> I'm afraid not:  CSA is gone with the VM, and VM is gone after app close our FD, I don't see amdgpu_vm_fini() is depended on context living or not ...
>> 3, struct fence was renamed to struct dma_fence on newer kernels and a status field added to exactly this purpose.
>>
>> The Intel guys did this because they ran into the exactly same problem.
>>
>>> I'll see if dma_fence could solve my issue, but I wish you can give 
>>> me your detail idea
>> BR Monk
>>
>>
>>
>> -----Original Message-----
>> From: Christian König [mailto:deathsimple@vodafone.de]
>> Sent: Wednesday, May 03, 2017 4:59 PM
>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job 
>> finished
>>
>>> 1, This is necessary otherwise how can I access entity pointer after 
>>> a job timedout
>> No that isn't necessary.
>>
>> The problem with your idea is that you want to actively push the 
>> feedback/status from the job execution back to userspace when an 
>> error
>> (timeout) happens.
>>
>> My idea is that userspace should rather gather the feedback during the next command submission. This has the advantage that you don't need to keep userspace alive till all jobs are done.
>>
>>>     , and why it is dangerous ?
>> You need to keep quite a bunch of stuff alive (VM, CSA) when you don't tear down the ctx immediately.
>>
>> We could split ctx tear down into freeing the resources and freeing the structure, but I think just gathering the information needed on CS is easier to do.
>>
>>> 2, what's the status field in the fences you were referring to ? I 
>>> need to judge if it could satisfy my requirement
>> struct fence was renamed to struct dma_fence on newer kernels and a status field added to exactly this purpose.
>>
>> The Intel guys did this because they ran into the exactly same problem.
>>
>> Regards,
>> Christian.
>>
>> Am 03.05.2017 um 05:30 schrieb Liu, Monk:
>>> 1, This is necessary otherwise how can I access entity pointer after a job timedout , and why it is dangerous ?
>>> 2, what's the status field in the fences you were referring to ? I 
>>> need to judge if it could satisfy my requirement
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Christian König [mailto:deathsimple@vodafone.de]
>>> Sent: Monday, May 01, 2017 10:48 PM
>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job 
>>> finished
>>>
>>> Am 01.05.2017 um 09:22 schrieb Monk Liu:
>>>> for TDR guilty context feature, we need access ctx/s_entity field 
>>>> member through sched job pointer,so ctx must keep alive till all 
>>>> job from it signaled.
>>> NAK, that is unnecessary and quite dangerous.
>>>
>>> Instead we have the designed status field in the fences which should be checked for that.
>>>
>>> Regards,
>>> Christian.
>>>
>>>> Change-Id: Ib87e9502f7a5c8c054c7e56956d7f7ad75998e43
>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>> ---
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu.h           | 6 +++++-
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        | 2 +-
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       | 9 +++++++++
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       | 9 +++++++--
>>>>      drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 6 ------
>>>>      drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 1 +
>>>>      6 files changed, 23 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> index e330009..8e031d6 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> @@ -760,10 +760,12 @@ struct amdgpu_ib {
>>>>      	uint32_t			flags;
>>>>      };
>>>>      
>>>> +struct amdgpu_ctx;
>>>> +
>>>>      extern const struct amd_sched_backend_ops amdgpu_sched_ops;
>>>>      
>>>>      int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>>>> -		     struct amdgpu_job **job, struct amdgpu_vm *vm);
>>>> +		     struct amdgpu_job **job, struct amdgpu_vm *vm, struct 
>>>> +amdgpu_ctx *ctx);
>>>>      int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned size,
>>>>      			     struct amdgpu_job **job);
>>>>      
>>>> @@ -802,6 +804,7 @@ struct amdgpu_ctx_mgr {
>>>>      
>>>>      struct amdgpu_ctx *amdgpu_ctx_get(struct amdgpu_fpriv *fpriv, uint32_t id);
>>>>      int amdgpu_ctx_put(struct amdgpu_ctx *ctx);
>>>> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx);
>>>>      
>>>>      uint64_t amdgpu_ctx_add_fence(struct amdgpu_ctx *ctx, struct amdgpu_ring *ring,
>>>>      			      struct fence *fence); @@ -1129,6 +1132,7 @@ struct 
>>>> amdgpu_job {
>>>>      	struct amdgpu_sync	sync;
>>>>      	struct amdgpu_ib	*ibs;
>>>>      	struct fence		*fence; /* the hw fence */
>>>> +	struct amdgpu_ctx *ctx;
>>>>      	uint32_t		preamble_status;
>>>>      	uint32_t		num_ibs;
>>>>      	void			*owner;
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>> index 699f5fe..267fb65 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>> @@ -234,7 +234,7 @@ int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, void *data)
>>>>      		}
>>>>      	}
>>>>      
>>>> -	ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm);
>>>> +	ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm, p->ctx);
>>>>      	if (ret)
>>>>      		goto free_all_kdata;
>>>>      
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>> index b4bbbb3..81438af 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>> @@ -25,6 +25,13 @@
>>>>      #include <drm/drmP.h>
>>>>      #include "amdgpu.h"
>>>>      
>>>> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx) {
>>>> +	if (ctx)
>>>> +		kref_get(&ctx->refcount);
>>>> +	return ctx;
>>>> +}
>>>> +
>>>>      static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
>>>>      {
>>>>      	unsigned i, j;
>>>> @@ -56,6 +63,8 @@ static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
>>>>      					  rq, amdgpu_sched_jobs);
>>>>      		if (r)
>>>>      			goto failed;
>>>> +
>>>> +		ctx->rings[i].entity.ptr_guilty = &ctx->guilty; /* kernel entity 
>>>> +doesn't have ptr_guilty */
>>>>      	}
>>>>      
>>>>      	return 0;
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> index 690ef3d..208da11 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> @@ -40,7 +40,7 @@ static void amdgpu_job_timedout(struct amd_sched_job *s_job)
>>>>      }
>>>>      
>>>>      int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>>>> -		     struct amdgpu_job **job, struct amdgpu_vm *vm)
>>>> +		     struct amdgpu_job **job, struct amdgpu_vm *vm, struct 
>>>> +amdgpu_ctx *ctx)
>>>>      {
>>>>      	size_t size = sizeof(struct amdgpu_job);
>>>>      
>>>> @@ -57,6 +57,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>>>>      	(*job)->vm = vm;
>>>>      	(*job)->ibs = (void *)&(*job)[1];
>>>>      	(*job)->num_ibs = num_ibs;
>>>> +	(*job)->ctx = amdgpu_ctx_kref_get(ctx);
>>>>      
>>>>      	amdgpu_sync_create(&(*job)->sync);
>>>>      
>>>> @@ -68,7 +69,7 @@ int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned size,
>>>>      {
>>>>      	int r;
>>>>      
>>>> -	r = amdgpu_job_alloc(adev, 1, job, NULL);
>>>> +	r = amdgpu_job_alloc(adev, 1, job, NULL, NULL);
>>>>      	if (r)
>>>>      		return r;
>>>>      
>>>> @@ -94,6 +95,10 @@ void amdgpu_job_free_resources(struct amdgpu_job *job)
>>>>      static void amdgpu_job_free_cb(struct amd_sched_job *s_job)
>>>>      {
>>>>      	struct amdgpu_job *job = container_of(s_job, struct 
>>>> amdgpu_job, base);
>>>> +	struct amdgpu_ctx *ctx = job->ctx;
>>>> +
>>>> +	if (ctx)
>>>> +		amdgpu_ctx_put(ctx);
>>>>      
>>>>      	fence_put(job->fence);
>>>>      	amdgpu_sync_free(&job->sync); diff --git 
>>>> a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>> index 6f4e31f..9100ca8 100644
>>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>> @@ -208,12 +208,6 @@ void amd_sched_entity_fini(struct amd_gpu_scheduler *sched,
>>>>      	if (!amd_sched_entity_is_initialized(sched, entity))
>>>>      		return;
>>>>      
>>>> -	/**
>>>> -	 * The client will not queue more IBs during this fini, consume existing
>>>> -	 * queued IBs
>>>> -	*/
>>>> -	wait_event(sched->job_scheduled, amd_sched_entity_is_idle(entity));
>>>> -
>>>>      	amd_sched_rq_remove_entity(rq, entity);
>>>>      	kfifo_free(&entity->job_queue);
>>>>      }
>>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>> index 8cb41d3..ccbbcb0 100644
>>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>> @@ -49,6 +49,7 @@ struct amd_sched_entity {
>>>>      
>>>>      	struct fence			*dependency;
>>>>      	struct fence_cb			cb;
>>>> +	bool *ptr_guilty;
>>>>      };
>>>>      
>>>>      /**
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
       [not found]                                     ` <DM5PR12MB1610C0502DE515B570B2F4C984160-2J9CzHegvk++jCVTvoAFKAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-05-03 13:34                                       ` Christian König
       [not found]                                         ` <200bd9aa-1374-69be-c155-689013ba49c5-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Christian König @ 2017-05-03 13:34 UTC (permalink / raw)
  To: Liu, Monk, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

> Can we guarantee the pde/pte and PRT/CSA are all alive (BO, mappings) when resubmitting the timeout job (assume this time out job can signal after the resubmit)?
Yes, that's why we add all fences of each command submission to the 
PD/PT BOs.

Regards,
Christian.

Am 03.05.2017 um 15:31 schrieb Liu, Monk:
> Even we release the ctx as usual way,
> Can we guarantee the pde/pte and PRT/CSA are all alive (BO, mappings) when resubmitting the timeout job (assume this time out job can signal after the resubmit)?
>
> You know App can submit a command a release all BO and free_ctx, close FD/VM, and exit very soon, it just doesn't wait for the  fence signal
>
> BR Monk
>
> -----Original Message-----
> From: Christian König [mailto:deathsimple@vodafone.de]
> Sent: Wednesday, May 03, 2017 8:50 PM
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
>
>> and the ironic thing is I want to alive as well (especially CSA, PTR)
> Yes, and exactly that is the danger I was talking about. We messed up the tear down oder with that and try to access resources which are already freed when the job is now scheduled.
>
> I would rather say we should get completely rid of the ctx kref counting, that was a rather bad idea in the first place.
>
> Regards,
> Christian.
>
> Am 03.05.2017 um 11:36 schrieb Liu, Monk:
>> Since I get one more kref for ctx when creating jobs, so amdgpu_ctx_mgr_fini(&fpriv->ctx_mgr) here won't actually waiting ... because the " amdgpu_ctx_do_release"
>> Won't going to run (kref > 0 before all job signaled).
>>
>> That way amdgpu_driver_postclose_kms() can continue go on , So
>> actually " UVD and VCE handle, the PRT VAs, the CSA and even the whole
>> VM structure" won't be kept alive , and the ironic thing is I want to
>> alive as well (especially CSA, PTR)
>>
>>
>> BR Monk
>>
>>
>>
>>
>> -----Original Message-----
>> From: Christian König [mailto:deathsimple@vodafone.de]
>> Sent: Wednesday, May 03, 2017 5:19 PM
>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job
>> finished
>>
>>>> I'm afraid not:  CSA is gone with the VM, and VM is gone after app close our FD, I don't see amdgpu_vm_fini() is depended on context living or not ...
>> See the teardown order in amdgpu_driver_postclose_kms():
>>> amdgpu_ctx_mgr_fini(&fpriv->ctx_mgr);
>>>
>>>           amdgpu_uvd_free_handles(adev, file_priv);
>>>           amdgpu_vce_free_handles(adev, file_priv);
>>>
>>>           amdgpu_vm_bo_rmv(adev, fpriv->prt_va);
>>>
>>>           if (amdgpu_sriov_vf(adev)) {
>>>                   /* TODO: how to handle reserve failure */
>>>                   BUG_ON(amdgpu_bo_reserve(adev->virt.csa_obj, false));
>>>                   amdgpu_vm_bo_rmv(adev, fpriv->vm.csa_bo_va);
>>>                   fpriv->vm.csa_bo_va = NULL;
>>>                   amdgpu_bo_unreserve(adev->virt.csa_obj);
>>>           }
>>>
>>>           amdgpu_vm_fini(adev, &fpriv->vm);
>> amdgpu_ctx_mgr_fini() waits for scheduling to finish and releases all contexts of the current fd.
>>
>> If we don't release the context here because some jobs are still executed we need to keep the UVD and VCE handle, the PRT VAs, the CSA and even the whole VM structure alive.
>>
>>> I'll see if dma_fence could solve my issue, but I wish you can give
>>> me your detail idea
>> Please take a look at David's idea of using the fence_context to find which jobs and entity to skip, that is even better than mine about the fence status and should be trivial to implement because all the data is already present we just need to use it.
>>
>> Regards,
>> Christian.
>>
>> Am 03.05.2017 um 11:08 schrieb Liu, Monk:
>>> 1,My idea is that userspace should rather gather the feedback during the next command submission. This has the advantage that you don't need to keep userspace alive till all jobs are done.
>>>
>>>> No, we need to clean the hw ring (cherry-pick out guilty entities' job in all rings) after gpu reset, and we need fackly signal all sched_fence in the guity entity as well, and we need mark context as guilty so the next IOCTL on it will return -ENODEV.
>>>> I don't understand how your idea can solve my request ...
>>> 2,You need to keep quite a bunch of stuff alive (VM, CSA) when you don't tear down the ctx immediately.
>>>
>>>> I'm afraid not:  CSA is gone with the VM, and VM is gone after app close our FD, I don't see amdgpu_vm_fini() is depended on context living or not ...
>>> 3, struct fence was renamed to struct dma_fence on newer kernels and a status field added to exactly this purpose.
>>>
>>> The Intel guys did this because they ran into the exactly same problem.
>>>
>>>> I'll see if dma_fence could solve my issue, but I wish you can give
>>>> me your detail idea
>>> BR Monk
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Christian König [mailto:deathsimple@vodafone.de]
>>> Sent: Wednesday, May 03, 2017 4:59 PM
>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job
>>> finished
>>>
>>>> 1, This is necessary otherwise how can I access entity pointer after
>>>> a job timedout
>>> No that isn't necessary.
>>>
>>> The problem with your idea is that you want to actively push the
>>> feedback/status from the job execution back to userspace when an
>>> error
>>> (timeout) happens.
>>>
>>> My idea is that userspace should rather gather the feedback during the next command submission. This has the advantage that you don't need to keep userspace alive till all jobs are done.
>>>
>>>>      , and why it is dangerous ?
>>> You need to keep quite a bunch of stuff alive (VM, CSA) when you don't tear down the ctx immediately.
>>>
>>> We could split ctx tear down into freeing the resources and freeing the structure, but I think just gathering the information needed on CS is easier to do.
>>>
>>>> 2, what's the status field in the fences you were referring to ? I
>>>> need to judge if it could satisfy my requirement
>>> struct fence was renamed to struct dma_fence on newer kernels and a status field added to exactly this purpose.
>>>
>>> The Intel guys did this because they ran into the exactly same problem.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 03.05.2017 um 05:30 schrieb Liu, Monk:
>>>> 1, This is necessary otherwise how can I access entity pointer after a job timedout , and why it is dangerous ?
>>>> 2, what's the status field in the fences you were referring to ? I
>>>> need to judge if it could satisfy my requirement
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Christian König [mailto:deathsimple@vodafone.de]
>>>> Sent: Monday, May 01, 2017 10:48 PM
>>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>>> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job
>>>> finished
>>>>
>>>> Am 01.05.2017 um 09:22 schrieb Monk Liu:
>>>>> for TDR guilty context feature, we need access ctx/s_entity field
>>>>> member through sched job pointer,so ctx must keep alive till all
>>>>> job from it signaled.
>>>> NAK, that is unnecessary and quite dangerous.
>>>>
>>>> Instead we have the designed status field in the fences which should be checked for that.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Change-Id: Ib87e9502f7a5c8c054c7e56956d7f7ad75998e43
>>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>>> ---
>>>>>       drivers/gpu/drm/amd/amdgpu/amdgpu.h           | 6 +++++-
>>>>>       drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        | 2 +-
>>>>>       drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       | 9 +++++++++
>>>>>       drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       | 9 +++++++--
>>>>>       drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 6 ------
>>>>>       drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 1 +
>>>>>       6 files changed, 23 insertions(+), 10 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> index e330009..8e031d6 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> @@ -760,10 +760,12 @@ struct amdgpu_ib {
>>>>>       	uint32_t			flags;
>>>>>       };
>>>>>       
>>>>> +struct amdgpu_ctx;
>>>>> +
>>>>>       extern const struct amd_sched_backend_ops amdgpu_sched_ops;
>>>>>       
>>>>>       int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>>>>> -		     struct amdgpu_job **job, struct amdgpu_vm *vm);
>>>>> +		     struct amdgpu_job **job, struct amdgpu_vm *vm, struct
>>>>> +amdgpu_ctx *ctx);
>>>>>       int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned size,
>>>>>       			     struct amdgpu_job **job);
>>>>>       
>>>>> @@ -802,6 +804,7 @@ struct amdgpu_ctx_mgr {
>>>>>       
>>>>>       struct amdgpu_ctx *amdgpu_ctx_get(struct amdgpu_fpriv *fpriv, uint32_t id);
>>>>>       int amdgpu_ctx_put(struct amdgpu_ctx *ctx);
>>>>> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx);
>>>>>       
>>>>>       uint64_t amdgpu_ctx_add_fence(struct amdgpu_ctx *ctx, struct amdgpu_ring *ring,
>>>>>       			      struct fence *fence); @@ -1129,6 +1132,7 @@ struct
>>>>> amdgpu_job {
>>>>>       	struct amdgpu_sync	sync;
>>>>>       	struct amdgpu_ib	*ibs;
>>>>>       	struct fence		*fence; /* the hw fence */
>>>>> +	struct amdgpu_ctx *ctx;
>>>>>       	uint32_t		preamble_status;
>>>>>       	uint32_t		num_ibs;
>>>>>       	void			*owner;
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>>> index 699f5fe..267fb65 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>>> @@ -234,7 +234,7 @@ int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, void *data)
>>>>>       		}
>>>>>       	}
>>>>>       
>>>>> -	ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm);
>>>>> +	ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm, p->ctx);
>>>>>       	if (ret)
>>>>>       		goto free_all_kdata;
>>>>>       
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>>> index b4bbbb3..81438af 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>>> @@ -25,6 +25,13 @@
>>>>>       #include <drm/drmP.h>
>>>>>       #include "amdgpu.h"
>>>>>       
>>>>> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx) {
>>>>> +	if (ctx)
>>>>> +		kref_get(&ctx->refcount);
>>>>> +	return ctx;
>>>>> +}
>>>>> +
>>>>>       static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
>>>>>       {
>>>>>       	unsigned i, j;
>>>>> @@ -56,6 +63,8 @@ static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
>>>>>       					  rq, amdgpu_sched_jobs);
>>>>>       		if (r)
>>>>>       			goto failed;
>>>>> +
>>>>> +		ctx->rings[i].entity.ptr_guilty = &ctx->guilty; /* kernel entity
>>>>> +doesn't have ptr_guilty */
>>>>>       	}
>>>>>       
>>>>>       	return 0;
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> index 690ef3d..208da11 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> @@ -40,7 +40,7 @@ static void amdgpu_job_timedout(struct amd_sched_job *s_job)
>>>>>       }
>>>>>       
>>>>>       int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>>>>> -		     struct amdgpu_job **job, struct amdgpu_vm *vm)
>>>>> +		     struct amdgpu_job **job, struct amdgpu_vm *vm, struct
>>>>> +amdgpu_ctx *ctx)
>>>>>       {
>>>>>       	size_t size = sizeof(struct amdgpu_job);
>>>>>       
>>>>> @@ -57,6 +57,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>>>>>       	(*job)->vm = vm;
>>>>>       	(*job)->ibs = (void *)&(*job)[1];
>>>>>       	(*job)->num_ibs = num_ibs;
>>>>> +	(*job)->ctx = amdgpu_ctx_kref_get(ctx);
>>>>>       
>>>>>       	amdgpu_sync_create(&(*job)->sync);
>>>>>       
>>>>> @@ -68,7 +69,7 @@ int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned size,
>>>>>       {
>>>>>       	int r;
>>>>>       
>>>>> -	r = amdgpu_job_alloc(adev, 1, job, NULL);
>>>>> +	r = amdgpu_job_alloc(adev, 1, job, NULL, NULL);
>>>>>       	if (r)
>>>>>       		return r;
>>>>>       
>>>>> @@ -94,6 +95,10 @@ void amdgpu_job_free_resources(struct amdgpu_job *job)
>>>>>       static void amdgpu_job_free_cb(struct amd_sched_job *s_job)
>>>>>       {
>>>>>       	struct amdgpu_job *job = container_of(s_job, struct
>>>>> amdgpu_job, base);
>>>>> +	struct amdgpu_ctx *ctx = job->ctx;
>>>>> +
>>>>> +	if (ctx)
>>>>> +		amdgpu_ctx_put(ctx);
>>>>>       
>>>>>       	fence_put(job->fence);
>>>>>       	amdgpu_sync_free(&job->sync); diff --git
>>>>> a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>>> index 6f4e31f..9100ca8 100644
>>>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>>> @@ -208,12 +208,6 @@ void amd_sched_entity_fini(struct amd_gpu_scheduler *sched,
>>>>>       	if (!amd_sched_entity_is_initialized(sched, entity))
>>>>>       		return;
>>>>>       
>>>>> -	/**
>>>>> -	 * The client will not queue more IBs during this fini, consume existing
>>>>> -	 * queued IBs
>>>>> -	*/
>>>>> -	wait_event(sched->job_scheduled, amd_sched_entity_is_idle(entity));
>>>>> -
>>>>>       	amd_sched_rq_remove_entity(rq, entity);
>>>>>       	kfifo_free(&entity->job_queue);
>>>>>       }
>>>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>>> index 8cb41d3..ccbbcb0 100644
>>>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>>> @@ -49,6 +49,7 @@ struct amd_sched_entity {
>>>>>       
>>>>>       	struct fence			*dependency;
>>>>>       	struct fence_cb			cb;
>>>>> +	bool *ptr_guilty;
>>>>>       };
>>>>>       
>>>>>       /**
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
       [not found]                                         ` <200bd9aa-1374-69be-c155-689013ba49c5-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
@ 2017-05-03 13:42                                           ` Liu, Monk
       [not found]                                             ` <DM5PR12MB1610435D144871B4CC2D88AF84160-2J9CzHegvk++jCVTvoAFKAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Liu, Monk @ 2017-05-03 13:42 UTC (permalink / raw)
  To: Christian König, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

That should be way you said , but I didn't see the logic to assure that,

If kref of one BO go downs to 0, this Bo will be destroy (amdgpu_bo_destory), I don't see what code to prevent this destroy invoked if the resv of this BO 
Still have fence not signaling, can you share this tricks ?

Because I found if a job hang, and after my code kick it out from scheduler, (I manually call amd_sched_fence_finished() on the sched_fence of the timedout job ), the page_dir won't get destroy ....

But the BO created through GEM by app can be destroy as expected

BR Monk

-----Original Message-----
From: Christian König [mailto:deathsimple@vodafone.de] 
Sent: Wednesday, May 03, 2017 9:34 PM
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished

> Can we guarantee the pde/pte and PRT/CSA are all alive (BO, mappings) when resubmitting the timeout job (assume this time out job can signal after the resubmit)?
Yes, that's why we add all fences of each command submission to the PD/PT BOs.

Regards,
Christian.

Am 03.05.2017 um 15:31 schrieb Liu, Monk:
> Even we release the ctx as usual way,
> Can we guarantee the pde/pte and PRT/CSA are all alive (BO, mappings) when resubmitting the timeout job (assume this time out job can signal after the resubmit)?
>
> You know App can submit a command a release all BO and free_ctx, close 
> FD/VM, and exit very soon, it just doesn't wait for the  fence signal
>
> BR Monk
>
> -----Original Message-----
> From: Christian König [mailto:deathsimple@vodafone.de]
> Sent: Wednesday, May 03, 2017 8:50 PM
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job 
> finished
>
>> and the ironic thing is I want to alive as well (especially CSA, PTR)
> Yes, and exactly that is the danger I was talking about. We messed up the tear down oder with that and try to access resources which are already freed when the job is now scheduled.
>
> I would rather say we should get completely rid of the ctx kref counting, that was a rather bad idea in the first place.
>
> Regards,
> Christian.
>
> Am 03.05.2017 um 11:36 schrieb Liu, Monk:
>> Since I get one more kref for ctx when creating jobs, so amdgpu_ctx_mgr_fini(&fpriv->ctx_mgr) here won't actually waiting ... because the " amdgpu_ctx_do_release"
>> Won't going to run (kref > 0 before all job signaled).
>>
>> That way amdgpu_driver_postclose_kms() can continue go on , So 
>> actually " UVD and VCE handle, the PRT VAs, the CSA and even the 
>> whole VM structure" won't be kept alive , and the ironic thing is I 
>> want to alive as well (especially CSA, PTR)
>>
>>
>> BR Monk
>>
>>
>>
>>
>> -----Original Message-----
>> From: Christian König [mailto:deathsimple@vodafone.de]
>> Sent: Wednesday, May 03, 2017 5:19 PM
>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job 
>> finished
>>
>>>> I'm afraid not:  CSA is gone with the VM, and VM is gone after app close our FD, I don't see amdgpu_vm_fini() is depended on context living or not ...
>> See the teardown order in amdgpu_driver_postclose_kms():
>>> amdgpu_ctx_mgr_fini(&fpriv->ctx_mgr);
>>>
>>>           amdgpu_uvd_free_handles(adev, file_priv);
>>>           amdgpu_vce_free_handles(adev, file_priv);
>>>
>>>           amdgpu_vm_bo_rmv(adev, fpriv->prt_va);
>>>
>>>           if (amdgpu_sriov_vf(adev)) {
>>>                   /* TODO: how to handle reserve failure */
>>>                   BUG_ON(amdgpu_bo_reserve(adev->virt.csa_obj, false));
>>>                   amdgpu_vm_bo_rmv(adev, fpriv->vm.csa_bo_va);
>>>                   fpriv->vm.csa_bo_va = NULL;
>>>                   amdgpu_bo_unreserve(adev->virt.csa_obj);
>>>           }
>>>
>>>           amdgpu_vm_fini(adev, &fpriv->vm);
>> amdgpu_ctx_mgr_fini() waits for scheduling to finish and releases all contexts of the current fd.
>>
>> If we don't release the context here because some jobs are still executed we need to keep the UVD and VCE handle, the PRT VAs, the CSA and even the whole VM structure alive.
>>
>>> I'll see if dma_fence could solve my issue, but I wish you can give 
>>> me your detail idea
>> Please take a look at David's idea of using the fence_context to find which jobs and entity to skip, that is even better than mine about the fence status and should be trivial to implement because all the data is already present we just need to use it.
>>
>> Regards,
>> Christian.
>>
>> Am 03.05.2017 um 11:08 schrieb Liu, Monk:
>>> 1,My idea is that userspace should rather gather the feedback during the next command submission. This has the advantage that you don't need to keep userspace alive till all jobs are done.
>>>
>>>> No, we need to clean the hw ring (cherry-pick out guilty entities' job in all rings) after gpu reset, and we need fackly signal all sched_fence in the guity entity as well, and we need mark context as guilty so the next IOCTL on it will return -ENODEV.
>>>> I don't understand how your idea can solve my request ...
>>> 2,You need to keep quite a bunch of stuff alive (VM, CSA) when you don't tear down the ctx immediately.
>>>
>>>> I'm afraid not:  CSA is gone with the VM, and VM is gone after app close our FD, I don't see amdgpu_vm_fini() is depended on context living or not ...
>>> 3, struct fence was renamed to struct dma_fence on newer kernels and a status field added to exactly this purpose.
>>>
>>> The Intel guys did this because they ran into the exactly same problem.
>>>
>>>> I'll see if dma_fence could solve my issue, but I wish you can give 
>>>> me your detail idea
>>> BR Monk
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Christian König [mailto:deathsimple@vodafone.de]
>>> Sent: Wednesday, May 03, 2017 4:59 PM
>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job 
>>> finished
>>>
>>>> 1, This is necessary otherwise how can I access entity pointer 
>>>> after a job timedout
>>> No that isn't necessary.
>>>
>>> The problem with your idea is that you want to actively push the 
>>> feedback/status from the job execution back to userspace when an 
>>> error
>>> (timeout) happens.
>>>
>>> My idea is that userspace should rather gather the feedback during the next command submission. This has the advantage that you don't need to keep userspace alive till all jobs are done.
>>>
>>>>      , and why it is dangerous ?
>>> You need to keep quite a bunch of stuff alive (VM, CSA) when you don't tear down the ctx immediately.
>>>
>>> We could split ctx tear down into freeing the resources and freeing the structure, but I think just gathering the information needed on CS is easier to do.
>>>
>>>> 2, what's the status field in the fences you were referring to ? I 
>>>> need to judge if it could satisfy my requirement
>>> struct fence was renamed to struct dma_fence on newer kernels and a status field added to exactly this purpose.
>>>
>>> The Intel guys did this because they ran into the exactly same problem.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 03.05.2017 um 05:30 schrieb Liu, Monk:
>>>> 1, This is necessary otherwise how can I access entity pointer after a job timedout , and why it is dangerous ?
>>>> 2, what's the status field in the fences you were referring to ? I 
>>>> need to judge if it could satisfy my requirement
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Christian König [mailto:deathsimple@vodafone.de]
>>>> Sent: Monday, May 01, 2017 10:48 PM
>>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>>> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job 
>>>> finished
>>>>
>>>> Am 01.05.2017 um 09:22 schrieb Monk Liu:
>>>>> for TDR guilty context feature, we need access ctx/s_entity field 
>>>>> member through sched job pointer,so ctx must keep alive till all 
>>>>> job from it signaled.
>>>> NAK, that is unnecessary and quite dangerous.
>>>>
>>>> Instead we have the designed status field in the fences which should be checked for that.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Change-Id: Ib87e9502f7a5c8c054c7e56956d7f7ad75998e43
>>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>>> ---
>>>>>       drivers/gpu/drm/amd/amdgpu/amdgpu.h           | 6 +++++-
>>>>>       drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        | 2 +-
>>>>>       drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       | 9 +++++++++
>>>>>       drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       | 9 +++++++--
>>>>>       drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 6 ------
>>>>>       drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 1 +
>>>>>       6 files changed, 23 insertions(+), 10 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> index e330009..8e031d6 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> @@ -760,10 +760,12 @@ struct amdgpu_ib {
>>>>>       	uint32_t			flags;
>>>>>       };
>>>>>       
>>>>> +struct amdgpu_ctx;
>>>>> +
>>>>>       extern const struct amd_sched_backend_ops amdgpu_sched_ops;
>>>>>       
>>>>>       int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>>>>> -		     struct amdgpu_job **job, struct amdgpu_vm *vm);
>>>>> +		     struct amdgpu_job **job, struct amdgpu_vm *vm, struct 
>>>>> +amdgpu_ctx *ctx);
>>>>>       int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned size,
>>>>>       			     struct amdgpu_job **job);
>>>>>       
>>>>> @@ -802,6 +804,7 @@ struct amdgpu_ctx_mgr {
>>>>>       
>>>>>       struct amdgpu_ctx *amdgpu_ctx_get(struct amdgpu_fpriv *fpriv, uint32_t id);
>>>>>       int amdgpu_ctx_put(struct amdgpu_ctx *ctx);
>>>>> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx);
>>>>>       
>>>>>       uint64_t amdgpu_ctx_add_fence(struct amdgpu_ctx *ctx, struct amdgpu_ring *ring,
>>>>>       			      struct fence *fence); @@ -1129,6 +1132,7 @@ struct 
>>>>> amdgpu_job {
>>>>>       	struct amdgpu_sync	sync;
>>>>>       	struct amdgpu_ib	*ibs;
>>>>>       	struct fence		*fence; /* the hw fence */
>>>>> +	struct amdgpu_ctx *ctx;
>>>>>       	uint32_t		preamble_status;
>>>>>       	uint32_t		num_ibs;
>>>>>       	void			*owner;
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>>> index 699f5fe..267fb65 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>>> @@ -234,7 +234,7 @@ int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, void *data)
>>>>>       		}
>>>>>       	}
>>>>>       
>>>>> -	ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm);
>>>>> +	ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm, p->ctx);
>>>>>       	if (ret)
>>>>>       		goto free_all_kdata;
>>>>>       
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>>> index b4bbbb3..81438af 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>>> @@ -25,6 +25,13 @@
>>>>>       #include <drm/drmP.h>
>>>>>       #include "amdgpu.h"
>>>>>       
>>>>> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx) {
>>>>> +	if (ctx)
>>>>> +		kref_get(&ctx->refcount);
>>>>> +	return ctx;
>>>>> +}
>>>>> +
>>>>>       static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
>>>>>       {
>>>>>       	unsigned i, j;
>>>>> @@ -56,6 +63,8 @@ static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
>>>>>       					  rq, amdgpu_sched_jobs);
>>>>>       		if (r)
>>>>>       			goto failed;
>>>>> +
>>>>> +		ctx->rings[i].entity.ptr_guilty = &ctx->guilty; /* kernel 
>>>>> +entity doesn't have ptr_guilty */
>>>>>       	}
>>>>>       
>>>>>       	return 0;
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> index 690ef3d..208da11 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> @@ -40,7 +40,7 @@ static void amdgpu_job_timedout(struct amd_sched_job *s_job)
>>>>>       }
>>>>>       
>>>>>       int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>>>>> -		     struct amdgpu_job **job, struct amdgpu_vm *vm)
>>>>> +		     struct amdgpu_job **job, struct amdgpu_vm *vm, struct 
>>>>> +amdgpu_ctx *ctx)
>>>>>       {
>>>>>       	size_t size = sizeof(struct amdgpu_job);
>>>>>       
>>>>> @@ -57,6 +57,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>>>>>       	(*job)->vm = vm;
>>>>>       	(*job)->ibs = (void *)&(*job)[1];
>>>>>       	(*job)->num_ibs = num_ibs;
>>>>> +	(*job)->ctx = amdgpu_ctx_kref_get(ctx);
>>>>>       
>>>>>       	amdgpu_sync_create(&(*job)->sync);
>>>>>       
>>>>> @@ -68,7 +69,7 @@ int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned size,
>>>>>       {
>>>>>       	int r;
>>>>>       
>>>>> -	r = amdgpu_job_alloc(adev, 1, job, NULL);
>>>>> +	r = amdgpu_job_alloc(adev, 1, job, NULL, NULL);
>>>>>       	if (r)
>>>>>       		return r;
>>>>>       
>>>>> @@ -94,6 +95,10 @@ void amdgpu_job_free_resources(struct amdgpu_job *job)
>>>>>       static void amdgpu_job_free_cb(struct amd_sched_job *s_job)
>>>>>       {
>>>>>       	struct amdgpu_job *job = container_of(s_job, struct 
>>>>> amdgpu_job, base);
>>>>> +	struct amdgpu_ctx *ctx = job->ctx;
>>>>> +
>>>>> +	if (ctx)
>>>>> +		amdgpu_ctx_put(ctx);
>>>>>       
>>>>>       	fence_put(job->fence);
>>>>>       	amdgpu_sync_free(&job->sync); diff --git 
>>>>> a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>>> index 6f4e31f..9100ca8 100644
>>>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>>> @@ -208,12 +208,6 @@ void amd_sched_entity_fini(struct amd_gpu_scheduler *sched,
>>>>>       	if (!amd_sched_entity_is_initialized(sched, entity))
>>>>>       		return;
>>>>>       
>>>>> -	/**
>>>>> -	 * The client will not queue more IBs during this fini, consume existing
>>>>> -	 * queued IBs
>>>>> -	*/
>>>>> -	wait_event(sched->job_scheduled, amd_sched_entity_is_idle(entity));
>>>>> -
>>>>>       	amd_sched_rq_remove_entity(rq, entity);
>>>>>       	kfifo_free(&entity->job_queue);
>>>>>       }
>>>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>>> index 8cb41d3..ccbbcb0 100644
>>>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>>> @@ -49,6 +49,7 @@ struct amd_sched_entity {
>>>>>       
>>>>>       	struct fence			*dependency;
>>>>>       	struct fence_cb			cb;
>>>>> +	bool *ptr_guilty;
>>>>>       };
>>>>>       
>>>>>       /**
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
       [not found]                                             ` <DM5PR12MB1610435D144871B4CC2D88AF84160-2J9CzHegvk++jCVTvoAFKAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-05-03 14:04                                               ` Christian König
       [not found]                                                 ` <44d1cc5a-a064-322a-15a6-08015378311c-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Christian König @ 2017-05-03 14:04 UTC (permalink / raw)
  To: Liu, Monk, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

> If kref of one BO go downs to 0, this Bo will be destroy (amdgpu_bo_destory), I don't see what code to prevent this destroy invoked if the resv of this BO
> Still have fence not signaling, can you share this tricks ?
IIRC the destroy itself is not prevented, but TTM prevents reusing of 
the memory in question until the last fence is signaled.

Need to dig through the TTM code as well to find that, but it is 
something very basic of TTM so I'm pretty sure it should work as expected.

> Because I found if a job hang, and after my code kick it out from scheduler, (I manually call amd_sched_fence_finished() on the sched_fence of the timedout job ), the page_dir won't get destroy ....
Puh, good question. Sounds like we somehow messed up the reference count 
or a fence isn't signaled as it should.

> But the BO created through GEM by app can be destroy as expected
Mhm, there isn't much difference between the two regarding this.

No idea of hand what that could be, I would need to recreate the issue 
and take a look myself.

Regards,
Christian.

Am 03.05.2017 um 15:42 schrieb Liu, Monk:
> That should be way you said , but I didn't see the logic to assure that,
>
> If kref of one BO go downs to 0, this Bo will be destroy (amdgpu_bo_destory), I don't see what code to prevent this destroy invoked if the resv of this BO
> Still have fence not signaling, can you share this tricks ?
>
> Because I found if a job hang, and after my code kick it out from scheduler, (I manually call amd_sched_fence_finished() on the sched_fence of the timedout job ), the page_dir won't get destroy ....
>
> But the BO created through GEM by app can be destroy as expected
>
> BR Monk
>
> -----Original Message-----
> From: Christian König [mailto:deathsimple@vodafone.de]
> Sent: Wednesday, May 03, 2017 9:34 PM
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
>
>> Can we guarantee the pde/pte and PRT/CSA are all alive (BO, mappings) when resubmitting the timeout job (assume this time out job can signal after the resubmit)?
> Yes, that's why we add all fences of each command submission to the PD/PT BOs.
>
> Regards,
> Christian.
>
> Am 03.05.2017 um 15:31 schrieb Liu, Monk:
>> Even we release the ctx as usual way,
>> Can we guarantee the pde/pte and PRT/CSA are all alive (BO, mappings) when resubmitting the timeout job (assume this time out job can signal after the resubmit)?
>>
>> You know App can submit a command a release all BO and free_ctx, close
>> FD/VM, and exit very soon, it just doesn't wait for the  fence signal
>>
>> BR Monk
>>
>> -----Original Message-----
>> From: Christian König [mailto:deathsimple@vodafone.de]
>> Sent: Wednesday, May 03, 2017 8:50 PM
>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job
>> finished
>>
>>> and the ironic thing is I want to alive as well (especially CSA, PTR)
>> Yes, and exactly that is the danger I was talking about. We messed up the tear down oder with that and try to access resources which are already freed when the job is now scheduled.
>>
>> I would rather say we should get completely rid of the ctx kref counting, that was a rather bad idea in the first place.
>>
>> Regards,
>> Christian.
>>
>> Am 03.05.2017 um 11:36 schrieb Liu, Monk:
>>> Since I get one more kref for ctx when creating jobs, so amdgpu_ctx_mgr_fini(&fpriv->ctx_mgr) here won't actually waiting ... because the " amdgpu_ctx_do_release"
>>> Won't going to run (kref > 0 before all job signaled).
>>>
>>> That way amdgpu_driver_postclose_kms() can continue go on , So
>>> actually " UVD and VCE handle, the PRT VAs, the CSA and even the
>>> whole VM structure" won't be kept alive , and the ironic thing is I
>>> want to alive as well (especially CSA, PTR)
>>>
>>>
>>> BR Monk
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Christian König [mailto:deathsimple@vodafone.de]
>>> Sent: Wednesday, May 03, 2017 5:19 PM
>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job
>>> finished
>>>
>>>>> I'm afraid not:  CSA is gone with the VM, and VM is gone after app close our FD, I don't see amdgpu_vm_fini() is depended on context living or not ...
>>> See the teardown order in amdgpu_driver_postclose_kms():
>>>> amdgpu_ctx_mgr_fini(&fpriv->ctx_mgr);
>>>>
>>>>            amdgpu_uvd_free_handles(adev, file_priv);
>>>>            amdgpu_vce_free_handles(adev, file_priv);
>>>>
>>>>            amdgpu_vm_bo_rmv(adev, fpriv->prt_va);
>>>>
>>>>            if (amdgpu_sriov_vf(adev)) {
>>>>                    /* TODO: how to handle reserve failure */
>>>>                    BUG_ON(amdgpu_bo_reserve(adev->virt.csa_obj, false));
>>>>                    amdgpu_vm_bo_rmv(adev, fpriv->vm.csa_bo_va);
>>>>                    fpriv->vm.csa_bo_va = NULL;
>>>>                    amdgpu_bo_unreserve(adev->virt.csa_obj);
>>>>            }
>>>>
>>>>            amdgpu_vm_fini(adev, &fpriv->vm);
>>> amdgpu_ctx_mgr_fini() waits for scheduling to finish and releases all contexts of the current fd.
>>>
>>> If we don't release the context here because some jobs are still executed we need to keep the UVD and VCE handle, the PRT VAs, the CSA and even the whole VM structure alive.
>>>
>>>> I'll see if dma_fence could solve my issue, but I wish you can give
>>>> me your detail idea
>>> Please take a look at David's idea of using the fence_context to find which jobs and entity to skip, that is even better than mine about the fence status and should be trivial to implement because all the data is already present we just need to use it.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 03.05.2017 um 11:08 schrieb Liu, Monk:
>>>> 1,My idea is that userspace should rather gather the feedback during the next command submission. This has the advantage that you don't need to keep userspace alive till all jobs are done.
>>>>
>>>>> No, we need to clean the hw ring (cherry-pick out guilty entities' job in all rings) after gpu reset, and we need fackly signal all sched_fence in the guity entity as well, and we need mark context as guilty so the next IOCTL on it will return -ENODEV.
>>>>> I don't understand how your idea can solve my request ...
>>>> 2,You need to keep quite a bunch of stuff alive (VM, CSA) when you don't tear down the ctx immediately.
>>>>
>>>>> I'm afraid not:  CSA is gone with the VM, and VM is gone after app close our FD, I don't see amdgpu_vm_fini() is depended on context living or not ...
>>>> 3, struct fence was renamed to struct dma_fence on newer kernels and a status field added to exactly this purpose.
>>>>
>>>> The Intel guys did this because they ran into the exactly same problem.
>>>>
>>>>> I'll see if dma_fence could solve my issue, but I wish you can give
>>>>> me your detail idea
>>>> BR Monk
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Christian König [mailto:deathsimple@vodafone.de]
>>>> Sent: Wednesday, May 03, 2017 4:59 PM
>>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>>> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job
>>>> finished
>>>>
>>>>> 1, This is necessary otherwise how can I access entity pointer
>>>>> after a job timedout
>>>> No that isn't necessary.
>>>>
>>>> The problem with your idea is that you want to actively push the
>>>> feedback/status from the job execution back to userspace when an
>>>> error
>>>> (timeout) happens.
>>>>
>>>> My idea is that userspace should rather gather the feedback during the next command submission. This has the advantage that you don't need to keep userspace alive till all jobs are done.
>>>>
>>>>>       , and why it is dangerous ?
>>>> You need to keep quite a bunch of stuff alive (VM, CSA) when you don't tear down the ctx immediately.
>>>>
>>>> We could split ctx tear down into freeing the resources and freeing the structure, but I think just gathering the information needed on CS is easier to do.
>>>>
>>>>> 2, what's the status field in the fences you were referring to ? I
>>>>> need to judge if it could satisfy my requirement
>>>> struct fence was renamed to struct dma_fence on newer kernels and a status field added to exactly this purpose.
>>>>
>>>> The Intel guys did this because they ran into the exactly same problem.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 03.05.2017 um 05:30 schrieb Liu, Monk:
>>>>> 1, This is necessary otherwise how can I access entity pointer after a job timedout , and why it is dangerous ?
>>>>> 2, what's the status field in the fences you were referring to ? I
>>>>> need to judge if it could satisfy my requirement
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Christian König [mailto:deathsimple@vodafone.de]
>>>>> Sent: Monday, May 01, 2017 10:48 PM
>>>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>>>> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job
>>>>> finished
>>>>>
>>>>> Am 01.05.2017 um 09:22 schrieb Monk Liu:
>>>>>> for TDR guilty context feature, we need access ctx/s_entity field
>>>>>> member through sched job pointer,so ctx must keep alive till all
>>>>>> job from it signaled.
>>>>> NAK, that is unnecessary and quite dangerous.
>>>>>
>>>>> Instead we have the designed status field in the fences which should be checked for that.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> Change-Id: Ib87e9502f7a5c8c054c7e56956d7f7ad75998e43
>>>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>>>> ---
>>>>>>        drivers/gpu/drm/amd/amdgpu/amdgpu.h           | 6 +++++-
>>>>>>        drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        | 2 +-
>>>>>>        drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       | 9 +++++++++
>>>>>>        drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       | 9 +++++++--
>>>>>>        drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 6 ------
>>>>>>        drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 1 +
>>>>>>        6 files changed, 23 insertions(+), 10 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>> index e330009..8e031d6 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>> @@ -760,10 +760,12 @@ struct amdgpu_ib {
>>>>>>        	uint32_t			flags;
>>>>>>        };
>>>>>>        
>>>>>> +struct amdgpu_ctx;
>>>>>> +
>>>>>>        extern const struct amd_sched_backend_ops amdgpu_sched_ops;
>>>>>>        
>>>>>>        int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>>>>>> -		     struct amdgpu_job **job, struct amdgpu_vm *vm);
>>>>>> +		     struct amdgpu_job **job, struct amdgpu_vm *vm, struct
>>>>>> +amdgpu_ctx *ctx);
>>>>>>        int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned size,
>>>>>>        			     struct amdgpu_job **job);
>>>>>>        
>>>>>> @@ -802,6 +804,7 @@ struct amdgpu_ctx_mgr {
>>>>>>        
>>>>>>        struct amdgpu_ctx *amdgpu_ctx_get(struct amdgpu_fpriv *fpriv, uint32_t id);
>>>>>>        int amdgpu_ctx_put(struct amdgpu_ctx *ctx);
>>>>>> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx);
>>>>>>        
>>>>>>        uint64_t amdgpu_ctx_add_fence(struct amdgpu_ctx *ctx, struct amdgpu_ring *ring,
>>>>>>        			      struct fence *fence); @@ -1129,6 +1132,7 @@ struct
>>>>>> amdgpu_job {
>>>>>>        	struct amdgpu_sync	sync;
>>>>>>        	struct amdgpu_ib	*ibs;
>>>>>>        	struct fence		*fence; /* the hw fence */
>>>>>> +	struct amdgpu_ctx *ctx;
>>>>>>        	uint32_t		preamble_status;
>>>>>>        	uint32_t		num_ibs;
>>>>>>        	void			*owner;
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>>>> index 699f5fe..267fb65 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>>>> @@ -234,7 +234,7 @@ int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, void *data)
>>>>>>        		}
>>>>>>        	}
>>>>>>        
>>>>>> -	ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm);
>>>>>> +	ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm, p->ctx);
>>>>>>        	if (ret)
>>>>>>        		goto free_all_kdata;
>>>>>>        
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>>>> index b4bbbb3..81438af 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>>>> @@ -25,6 +25,13 @@
>>>>>>        #include <drm/drmP.h>
>>>>>>        #include "amdgpu.h"
>>>>>>        
>>>>>> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx) {
>>>>>> +	if (ctx)
>>>>>> +		kref_get(&ctx->refcount);
>>>>>> +	return ctx;
>>>>>> +}
>>>>>> +
>>>>>>        static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
>>>>>>        {
>>>>>>        	unsigned i, j;
>>>>>> @@ -56,6 +63,8 @@ static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
>>>>>>        					  rq, amdgpu_sched_jobs);
>>>>>>        		if (r)
>>>>>>        			goto failed;
>>>>>> +
>>>>>> +		ctx->rings[i].entity.ptr_guilty = &ctx->guilty; /* kernel
>>>>>> +entity doesn't have ptr_guilty */
>>>>>>        	}
>>>>>>        
>>>>>>        	return 0;
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>>> index 690ef3d..208da11 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>>> @@ -40,7 +40,7 @@ static void amdgpu_job_timedout(struct amd_sched_job *s_job)
>>>>>>        }
>>>>>>        
>>>>>>        int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>>>>>> -		     struct amdgpu_job **job, struct amdgpu_vm *vm)
>>>>>> +		     struct amdgpu_job **job, struct amdgpu_vm *vm, struct
>>>>>> +amdgpu_ctx *ctx)
>>>>>>        {
>>>>>>        	size_t size = sizeof(struct amdgpu_job);
>>>>>>        
>>>>>> @@ -57,6 +57,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>>>>>>        	(*job)->vm = vm;
>>>>>>        	(*job)->ibs = (void *)&(*job)[1];
>>>>>>        	(*job)->num_ibs = num_ibs;
>>>>>> +	(*job)->ctx = amdgpu_ctx_kref_get(ctx);
>>>>>>        
>>>>>>        	amdgpu_sync_create(&(*job)->sync);
>>>>>>        
>>>>>> @@ -68,7 +69,7 @@ int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned size,
>>>>>>        {
>>>>>>        	int r;
>>>>>>        
>>>>>> -	r = amdgpu_job_alloc(adev, 1, job, NULL);
>>>>>> +	r = amdgpu_job_alloc(adev, 1, job, NULL, NULL);
>>>>>>        	if (r)
>>>>>>        		return r;
>>>>>>        
>>>>>> @@ -94,6 +95,10 @@ void amdgpu_job_free_resources(struct amdgpu_job *job)
>>>>>>        static void amdgpu_job_free_cb(struct amd_sched_job *s_job)
>>>>>>        {
>>>>>>        	struct amdgpu_job *job = container_of(s_job, struct
>>>>>> amdgpu_job, base);
>>>>>> +	struct amdgpu_ctx *ctx = job->ctx;
>>>>>> +
>>>>>> +	if (ctx)
>>>>>> +		amdgpu_ctx_put(ctx);
>>>>>>        
>>>>>>        	fence_put(job->fence);
>>>>>>        	amdgpu_sync_free(&job->sync); diff --git
>>>>>> a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>>>> index 6f4e31f..9100ca8 100644
>>>>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>>>> @@ -208,12 +208,6 @@ void amd_sched_entity_fini(struct amd_gpu_scheduler *sched,
>>>>>>        	if (!amd_sched_entity_is_initialized(sched, entity))
>>>>>>        		return;
>>>>>>        
>>>>>> -	/**
>>>>>> -	 * The client will not queue more IBs during this fini, consume existing
>>>>>> -	 * queued IBs
>>>>>> -	*/
>>>>>> -	wait_event(sched->job_scheduled, amd_sched_entity_is_idle(entity));
>>>>>> -
>>>>>>        	amd_sched_rq_remove_entity(rq, entity);
>>>>>>        	kfifo_free(&entity->job_queue);
>>>>>>        }
>>>>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>>>> index 8cb41d3..ccbbcb0 100644
>>>>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>>>> @@ -49,6 +49,7 @@ struct amd_sched_entity {
>>>>>>        
>>>>>>        	struct fence			*dependency;
>>>>>>        	struct fence_cb			cb;
>>>>>> +	bool *ptr_guilty;
>>>>>>        };
>>>>>>        
>>>>>>        /**
>>>>> _______________________________________________
>>>>> amd-gfx mailing list
>>>>> amd-gfx@lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
       [not found]                                                 ` <44d1cc5a-a064-322a-15a6-08015378311c-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
@ 2017-05-03 15:17                                                   ` Liu, Monk
       [not found]                                                     ` <DM5PR12MB1610D501E6D8E8C1B25D4BC384160-2J9CzHegvk++jCVTvoAFKAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Liu, Monk @ 2017-05-03 15:17 UTC (permalink / raw)
  To: Christian K?nig, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

>Need to dig through the TTM code as well to find that, but it is something very basic of TTM so I'm pretty sure it should work as expected.
That's what make me feel a little confused0, 
if a BO is destroyed, then how TTM system track its resv pointer , without this resv pointer, how TTM wait on the fence hooked on this resv before reusing the memory of it 


BR Monk

-----Original Message-----
From: Christian König [mailto:deathsimple@vodafone.de] 
Sent: Wednesday, May 3, 2017 10:05 PM
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished

> If kref of one BO go downs to 0, this Bo will be destroy 
> (amdgpu_bo_destory), I don't see what code to prevent this destroy invoked if the resv of this BO Still have fence not signaling, can you share this tricks ?
IIRC the destroy itself is not prevented, but TTM prevents reusing of the memory in question until the last fence is signaled.

Need to dig through the TTM code as well to find that, but it is something very basic of TTM so I'm pretty sure it should work as expected.

> Because I found if a job hang, and after my code kick it out from scheduler, (I manually call amd_sched_fence_finished() on the sched_fence of the timedout job ), the page_dir won't get destroy ....
Puh, good question. Sounds like we somehow messed up the reference count or a fence isn't signaled as it should.

> But the BO created through GEM by app can be destroy as expected
Mhm, there isn't much difference between the two regarding this.

No idea of hand what that could be, I would need to recreate the issue and take a look myself.

Regards,
Christian.

Am 03.05.2017 um 15:42 schrieb Liu, Monk:
> That should be way you said , but I didn't see the logic to assure 
> that,
>
> If kref of one BO go downs to 0, this Bo will be destroy 
> (amdgpu_bo_destory), I don't see what code to prevent this destroy invoked if the resv of this BO Still have fence not signaling, can you share this tricks ?
>
> Because I found if a job hang, and after my code kick it out from scheduler, (I manually call amd_sched_fence_finished() on the sched_fence of the timedout job ), the page_dir won't get destroy ....
>
> But the BO created through GEM by app can be destroy as expected
>
> BR Monk
>
> -----Original Message-----
> From: Christian König [mailto:deathsimple@vodafone.de]
> Sent: Wednesday, May 03, 2017 9:34 PM
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job 
> finished
>
>> Can we guarantee the pde/pte and PRT/CSA are all alive (BO, mappings) when resubmitting the timeout job (assume this time out job can signal after the resubmit)?
> Yes, that's why we add all fences of each command submission to the PD/PT BOs.
>
> Regards,
> Christian.
>
> Am 03.05.2017 um 15:31 schrieb Liu, Monk:
>> Even we release the ctx as usual way, Can we guarantee the pde/pte 
>> and PRT/CSA are all alive (BO, mappings) when resubmitting the timeout job (assume this time out job can signal after the resubmit)?
>>
>> You know App can submit a command a release all BO and free_ctx, 
>> close FD/VM, and exit very soon, it just doesn't wait for the  fence 
>> signal
>>
>> BR Monk
>>
>> -----Original Message-----
>> From: Christian König [mailto:deathsimple@vodafone.de]
>> Sent: Wednesday, May 03, 2017 8:50 PM
>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job 
>> finished
>>
>>> and the ironic thing is I want to alive as well (especially CSA, 
>>> PTR)
>> Yes, and exactly that is the danger I was talking about. We messed up the tear down oder with that and try to access resources which are already freed when the job is now scheduled.
>>
>> I would rather say we should get completely rid of the ctx kref counting, that was a rather bad idea in the first place.
>>
>> Regards,
>> Christian.
>>
>> Am 03.05.2017 um 11:36 schrieb Liu, Monk:
>>> Since I get one more kref for ctx when creating jobs, so amdgpu_ctx_mgr_fini(&fpriv->ctx_mgr) here won't actually waiting ... because the " amdgpu_ctx_do_release"
>>> Won't going to run (kref > 0 before all job signaled).
>>>
>>> That way amdgpu_driver_postclose_kms() can continue go on , So 
>>> actually " UVD and VCE handle, the PRT VAs, the CSA and even the 
>>> whole VM structure" won't be kept alive , and the ironic thing is I 
>>> want to alive as well (especially CSA, PTR)
>>>
>>>
>>> BR Monk
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Christian König [mailto:deathsimple@vodafone.de]
>>> Sent: Wednesday, May 03, 2017 5:19 PM
>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job 
>>> finished
>>>
>>>>> I'm afraid not:  CSA is gone with the VM, and VM is gone after app close our FD, I don't see amdgpu_vm_fini() is depended on context living or not ...
>>> See the teardown order in amdgpu_driver_postclose_kms():
>>>> amdgpu_ctx_mgr_fini(&fpriv->ctx_mgr);
>>>>
>>>>            amdgpu_uvd_free_handles(adev, file_priv);
>>>>            amdgpu_vce_free_handles(adev, file_priv);
>>>>
>>>>            amdgpu_vm_bo_rmv(adev, fpriv->prt_va);
>>>>
>>>>            if (amdgpu_sriov_vf(adev)) {
>>>>                    /* TODO: how to handle reserve failure */
>>>>                    BUG_ON(amdgpu_bo_reserve(adev->virt.csa_obj, false));
>>>>                    amdgpu_vm_bo_rmv(adev, fpriv->vm.csa_bo_va);
>>>>                    fpriv->vm.csa_bo_va = NULL;
>>>>                    amdgpu_bo_unreserve(adev->virt.csa_obj);
>>>>            }
>>>>
>>>>            amdgpu_vm_fini(adev, &fpriv->vm);
>>> amdgpu_ctx_mgr_fini() waits for scheduling to finish and releases all contexts of the current fd.
>>>
>>> If we don't release the context here because some jobs are still executed we need to keep the UVD and VCE handle, the PRT VAs, the CSA and even the whole VM structure alive.
>>>
>>>> I'll see if dma_fence could solve my issue, but I wish you can give 
>>>> me your detail idea
>>> Please take a look at David's idea of using the fence_context to find which jobs and entity to skip, that is even better than mine about the fence status and should be trivial to implement because all the data is already present we just need to use it.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 03.05.2017 um 11:08 schrieb Liu, Monk:
>>>> 1,My idea is that userspace should rather gather the feedback during the next command submission. This has the advantage that you don't need to keep userspace alive till all jobs are done.
>>>>
>>>>> No, we need to clean the hw ring (cherry-pick out guilty entities' job in all rings) after gpu reset, and we need fackly signal all sched_fence in the guity entity as well, and we need mark context as guilty so the next IOCTL on it will return -ENODEV.
>>>>> I don't understand how your idea can solve my request ...
>>>> 2,You need to keep quite a bunch of stuff alive (VM, CSA) when you don't tear down the ctx immediately.
>>>>
>>>>> I'm afraid not:  CSA is gone with the VM, and VM is gone after app close our FD, I don't see amdgpu_vm_fini() is depended on context living or not ...
>>>> 3, struct fence was renamed to struct dma_fence on newer kernels and a status field added to exactly this purpose.
>>>>
>>>> The Intel guys did this because they ran into the exactly same problem.
>>>>
>>>>> I'll see if dma_fence could solve my issue, but I wish you can 
>>>>> give me your detail idea
>>>> BR Monk
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Christian König [mailto:deathsimple@vodafone.de]
>>>> Sent: Wednesday, May 03, 2017 4:59 PM
>>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>>> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job 
>>>> finished
>>>>
>>>>> 1, This is necessary otherwise how can I access entity pointer 
>>>>> after a job timedout
>>>> No that isn't necessary.
>>>>
>>>> The problem with your idea is that you want to actively push the 
>>>> feedback/status from the job execution back to userspace when an 
>>>> error
>>>> (timeout) happens.
>>>>
>>>> My idea is that userspace should rather gather the feedback during the next command submission. This has the advantage that you don't need to keep userspace alive till all jobs are done.
>>>>
>>>>>       , and why it is dangerous ?
>>>> You need to keep quite a bunch of stuff alive (VM, CSA) when you don't tear down the ctx immediately.
>>>>
>>>> We could split ctx tear down into freeing the resources and freeing the structure, but I think just gathering the information needed on CS is easier to do.
>>>>
>>>>> 2, what's the status field in the fences you were referring to ? I 
>>>>> need to judge if it could satisfy my requirement
>>>> struct fence was renamed to struct dma_fence on newer kernels and a status field added to exactly this purpose.
>>>>
>>>> The Intel guys did this because they ran into the exactly same problem.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 03.05.2017 um 05:30 schrieb Liu, Monk:
>>>>> 1, This is necessary otherwise how can I access entity pointer after a job timedout , and why it is dangerous ?
>>>>> 2, what's the status field in the fences you were referring to ? I 
>>>>> need to judge if it could satisfy my requirement
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Christian König [mailto:deathsimple@vodafone.de]
>>>>> Sent: Monday, May 01, 2017 10:48 PM
>>>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>>>> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job 
>>>>> finished
>>>>>
>>>>> Am 01.05.2017 um 09:22 schrieb Monk Liu:
>>>>>> for TDR guilty context feature, we need access ctx/s_entity field 
>>>>>> member through sched job pointer,so ctx must keep alive till all 
>>>>>> job from it signaled.
>>>>> NAK, that is unnecessary and quite dangerous.
>>>>>
>>>>> Instead we have the designed status field in the fences which should be checked for that.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> Change-Id: Ib87e9502f7a5c8c054c7e56956d7f7ad75998e43
>>>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>>>> ---
>>>>>>        drivers/gpu/drm/amd/amdgpu/amdgpu.h           | 6 +++++-
>>>>>>        drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        | 2 +-
>>>>>>        drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       | 9 +++++++++
>>>>>>        drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       | 9 +++++++--
>>>>>>        drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 6 ------
>>>>>>        drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 1 +
>>>>>>        6 files changed, 23 insertions(+), 10 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>> index e330009..8e031d6 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>> @@ -760,10 +760,12 @@ struct amdgpu_ib {
>>>>>>        	uint32_t			flags;
>>>>>>        };
>>>>>>        
>>>>>> +struct amdgpu_ctx;
>>>>>> +
>>>>>>        extern const struct amd_sched_backend_ops 
>>>>>> amdgpu_sched_ops;
>>>>>>        
>>>>>>        int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>>>>>> -		     struct amdgpu_job **job, struct amdgpu_vm *vm);
>>>>>> +		     struct amdgpu_job **job, struct amdgpu_vm *vm, struct 
>>>>>> +amdgpu_ctx *ctx);
>>>>>>        int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned size,
>>>>>>        			     struct amdgpu_job **job);
>>>>>>        
>>>>>> @@ -802,6 +804,7 @@ struct amdgpu_ctx_mgr {
>>>>>>        
>>>>>>        struct amdgpu_ctx *amdgpu_ctx_get(struct amdgpu_fpriv *fpriv, uint32_t id);
>>>>>>        int amdgpu_ctx_put(struct amdgpu_ctx *ctx);
>>>>>> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx);
>>>>>>        
>>>>>>        uint64_t amdgpu_ctx_add_fence(struct amdgpu_ctx *ctx, struct amdgpu_ring *ring,
>>>>>>        			      struct fence *fence); @@ -1129,6 +1132,7 @@ 
>>>>>> struct amdgpu_job {
>>>>>>        	struct amdgpu_sync	sync;
>>>>>>        	struct amdgpu_ib	*ibs;
>>>>>>        	struct fence		*fence; /* the hw fence */
>>>>>> +	struct amdgpu_ctx *ctx;
>>>>>>        	uint32_t		preamble_status;
>>>>>>        	uint32_t		num_ibs;
>>>>>>        	void			*owner;
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>>>> index 699f5fe..267fb65 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>>>> @@ -234,7 +234,7 @@ int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, void *data)
>>>>>>        		}
>>>>>>        	}
>>>>>>        
>>>>>> -	ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm);
>>>>>> +	ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm, p->ctx);
>>>>>>        	if (ret)
>>>>>>        		goto free_all_kdata;
>>>>>>        
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>>>> index b4bbbb3..81438af 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>>>> @@ -25,6 +25,13 @@
>>>>>>        #include <drm/drmP.h>
>>>>>>        #include "amdgpu.h"
>>>>>>        
>>>>>> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx) {
>>>>>> +	if (ctx)
>>>>>> +		kref_get(&ctx->refcount);
>>>>>> +	return ctx;
>>>>>> +}
>>>>>> +
>>>>>>        static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
>>>>>>        {
>>>>>>        	unsigned i, j;
>>>>>> @@ -56,6 +63,8 @@ static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
>>>>>>        					  rq, amdgpu_sched_jobs);
>>>>>>        		if (r)
>>>>>>        			goto failed;
>>>>>> +
>>>>>> +		ctx->rings[i].entity.ptr_guilty = &ctx->guilty; /* kernel 
>>>>>> +entity doesn't have ptr_guilty */
>>>>>>        	}
>>>>>>        
>>>>>>        	return 0;
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>>> index 690ef3d..208da11 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>>> @@ -40,7 +40,7 @@ static void amdgpu_job_timedout(struct amd_sched_job *s_job)
>>>>>>        }
>>>>>>        
>>>>>>        int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>>>>>> -		     struct amdgpu_job **job, struct amdgpu_vm *vm)
>>>>>> +		     struct amdgpu_job **job, struct amdgpu_vm *vm, struct 
>>>>>> +amdgpu_ctx *ctx)
>>>>>>        {
>>>>>>        	size_t size = sizeof(struct amdgpu_job);
>>>>>>        
>>>>>> @@ -57,6 +57,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>>>>>>        	(*job)->vm = vm;
>>>>>>        	(*job)->ibs = (void *)&(*job)[1];
>>>>>>        	(*job)->num_ibs = num_ibs;
>>>>>> +	(*job)->ctx = amdgpu_ctx_kref_get(ctx);
>>>>>>        
>>>>>>        	amdgpu_sync_create(&(*job)->sync);
>>>>>>        
>>>>>> @@ -68,7 +69,7 @@ int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned size,
>>>>>>        {
>>>>>>        	int r;
>>>>>>        
>>>>>> -	r = amdgpu_job_alloc(adev, 1, job, NULL);
>>>>>> +	r = amdgpu_job_alloc(adev, 1, job, NULL, NULL);
>>>>>>        	if (r)
>>>>>>        		return r;
>>>>>>        
>>>>>> @@ -94,6 +95,10 @@ void amdgpu_job_free_resources(struct amdgpu_job *job)
>>>>>>        static void amdgpu_job_free_cb(struct amd_sched_job *s_job)
>>>>>>        {
>>>>>>        	struct amdgpu_job *job = container_of(s_job, struct 
>>>>>> amdgpu_job, base);
>>>>>> +	struct amdgpu_ctx *ctx = job->ctx;
>>>>>> +
>>>>>> +	if (ctx)
>>>>>> +		amdgpu_ctx_put(ctx);
>>>>>>        
>>>>>>        	fence_put(job->fence);
>>>>>>        	amdgpu_sync_free(&job->sync); diff --git 
>>>>>> a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>>>> index 6f4e31f..9100ca8 100644
>>>>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>>>> @@ -208,12 +208,6 @@ void amd_sched_entity_fini(struct amd_gpu_scheduler *sched,
>>>>>>        	if (!amd_sched_entity_is_initialized(sched, entity))
>>>>>>        		return;
>>>>>>        
>>>>>> -	/**
>>>>>> -	 * The client will not queue more IBs during this fini, consume existing
>>>>>> -	 * queued IBs
>>>>>> -	*/
>>>>>> -	wait_event(sched->job_scheduled, amd_sched_entity_is_idle(entity));
>>>>>> -
>>>>>>        	amd_sched_rq_remove_entity(rq, entity);
>>>>>>        	kfifo_free(&entity->job_queue);
>>>>>>        }
>>>>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>>>> index 8cb41d3..ccbbcb0 100644
>>>>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>>>> @@ -49,6 +49,7 @@ struct amd_sched_entity {
>>>>>>        
>>>>>>        	struct fence			*dependency;
>>>>>>        	struct fence_cb			cb;
>>>>>> +	bool *ptr_guilty;
>>>>>>        };
>>>>>>        
>>>>>>        /**
>>>>> _______________________________________________
>>>>> amd-gfx mailing list
>>>>> amd-gfx@lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
       [not found]                                                     ` <DM5PR12MB1610D501E6D8E8C1B25D4BC384160-2J9CzHegvk++jCVTvoAFKAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-05-03 17:40                                                       ` Christian König
  0 siblings, 0 replies; 28+ messages in thread
From: Christian König @ 2017-05-03 17:40 UTC (permalink / raw)
  To: Liu, Monk, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

See ttm_bo_release(), that one is called when the last reference to the 
BO goes away.

It is the first stage of cleanup and removes the VMA mapping (no more 
CPU access to the BO), and calls ttm_bo_cleanup_refs_or_queue().

ttm_bo_cleanup_refs_or_queue() then checks if the BO is idle, e.g. if it 
doesn't have any fence any more assigned to it.

If it still has fences on it the BO is queue to the delayed destruction 
queue.

Regards,
Christian.

Am 03.05.2017 um 17:17 schrieb Liu, Monk:
>> Need to dig through the TTM code as well to find that, but it is something very basic of TTM so I'm pretty sure it should work as expected.
> That's what make me feel a little confused0,
> if a BO is destroyed, then how TTM system track its resv pointer , without this resv pointer, how TTM wait on the fence hooked on this resv before reusing the memory of it
>
>
> BR Monk
>
> -----Original Message-----
> From: Christian König [mailto:deathsimple@vodafone.de]
> Sent: Wednesday, May 3, 2017 10:05 PM
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
>
>> If kref of one BO go downs to 0, this Bo will be destroy
>> (amdgpu_bo_destory), I don't see what code to prevent this destroy invoked if the resv of this BO Still have fence not signaling, can you share this tricks ?
> IIRC the destroy itself is not prevented, but TTM prevents reusing of the memory in question until the last fence is signaled.
>
> Need to dig through the TTM code as well to find that, but it is something very basic of TTM so I'm pretty sure it should work as expected.
>
>> Because I found if a job hang, and after my code kick it out from scheduler, (I manually call amd_sched_fence_finished() on the sched_fence of the timedout job ), the page_dir won't get destroy ....
> Puh, good question. Sounds like we somehow messed up the reference count or a fence isn't signaled as it should.
>
>> But the BO created through GEM by app can be destroy as expected
> Mhm, there isn't much difference between the two regarding this.
>
> No idea of hand what that could be, I would need to recreate the issue and take a look myself.
>
> Regards,
> Christian.
>
> Am 03.05.2017 um 15:42 schrieb Liu, Monk:
>> That should be way you said , but I didn't see the logic to assure
>> that,
>>
>> If kref of one BO go downs to 0, this Bo will be destroy
>> (amdgpu_bo_destory), I don't see what code to prevent this destroy invoked if the resv of this BO Still have fence not signaling, can you share this tricks ?
>>
>> Because I found if a job hang, and after my code kick it out from scheduler, (I manually call amd_sched_fence_finished() on the sched_fence of the timedout job ), the page_dir won't get destroy ....
>>
>> But the BO created through GEM by app can be destroy as expected
>>
>> BR Monk
>>
>> -----Original Message-----
>> From: Christian König [mailto:deathsimple@vodafone.de]
>> Sent: Wednesday, May 03, 2017 9:34 PM
>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job
>> finished
>>
>>> Can we guarantee the pde/pte and PRT/CSA are all alive (BO, mappings) when resubmitting the timeout job (assume this time out job can signal after the resubmit)?
>> Yes, that's why we add all fences of each command submission to the PD/PT BOs.
>>
>> Regards,
>> Christian.
>>
>> Am 03.05.2017 um 15:31 schrieb Liu, Monk:
>>> Even we release the ctx as usual way, Can we guarantee the pde/pte
>>> and PRT/CSA are all alive (BO, mappings) when resubmitting the timeout job (assume this time out job can signal after the resubmit)?
>>>
>>> You know App can submit a command a release all BO and free_ctx,
>>> close FD/VM, and exit very soon, it just doesn't wait for the  fence
>>> signal
>>>
>>> BR Monk
>>>
>>> -----Original Message-----
>>> From: Christian König [mailto:deathsimple@vodafone.de]
>>> Sent: Wednesday, May 03, 2017 8:50 PM
>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job
>>> finished
>>>
>>>> and the ironic thing is I want to alive as well (especially CSA,
>>>> PTR)
>>> Yes, and exactly that is the danger I was talking about. We messed up the tear down oder with that and try to access resources which are already freed when the job is now scheduled.
>>>
>>> I would rather say we should get completely rid of the ctx kref counting, that was a rather bad idea in the first place.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 03.05.2017 um 11:36 schrieb Liu, Monk:
>>>> Since I get one more kref for ctx when creating jobs, so amdgpu_ctx_mgr_fini(&fpriv->ctx_mgr) here won't actually waiting ... because the " amdgpu_ctx_do_release"
>>>> Won't going to run (kref > 0 before all job signaled).
>>>>
>>>> That way amdgpu_driver_postclose_kms() can continue go on , So
>>>> actually " UVD and VCE handle, the PRT VAs, the CSA and even the
>>>> whole VM structure" won't be kept alive , and the ironic thing is I
>>>> want to alive as well (especially CSA, PTR)
>>>>
>>>>
>>>> BR Monk
>>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Christian König [mailto:deathsimple@vodafone.de]
>>>> Sent: Wednesday, May 03, 2017 5:19 PM
>>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>>> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job
>>>> finished
>>>>
>>>>>> I'm afraid not:  CSA is gone with the VM, and VM is gone after app close our FD, I don't see amdgpu_vm_fini() is depended on context living or not ...
>>>> See the teardown order in amdgpu_driver_postclose_kms():
>>>>> amdgpu_ctx_mgr_fini(&fpriv->ctx_mgr);
>>>>>
>>>>>             amdgpu_uvd_free_handles(adev, file_priv);
>>>>>             amdgpu_vce_free_handles(adev, file_priv);
>>>>>
>>>>>             amdgpu_vm_bo_rmv(adev, fpriv->prt_va);
>>>>>
>>>>>             if (amdgpu_sriov_vf(adev)) {
>>>>>                     /* TODO: how to handle reserve failure */
>>>>>                     BUG_ON(amdgpu_bo_reserve(adev->virt.csa_obj, false));
>>>>>                     amdgpu_vm_bo_rmv(adev, fpriv->vm.csa_bo_va);
>>>>>                     fpriv->vm.csa_bo_va = NULL;
>>>>>                     amdgpu_bo_unreserve(adev->virt.csa_obj);
>>>>>             }
>>>>>
>>>>>             amdgpu_vm_fini(adev, &fpriv->vm);
>>>> amdgpu_ctx_mgr_fini() waits for scheduling to finish and releases all contexts of the current fd.
>>>>
>>>> If we don't release the context here because some jobs are still executed we need to keep the UVD and VCE handle, the PRT VAs, the CSA and even the whole VM structure alive.
>>>>
>>>>> I'll see if dma_fence could solve my issue, but I wish you can give
>>>>> me your detail idea
>>>> Please take a look at David's idea of using the fence_context to find which jobs and entity to skip, that is even better than mine about the fence status and should be trivial to implement because all the data is already present we just need to use it.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 03.05.2017 um 11:08 schrieb Liu, Monk:
>>>>> 1,My idea is that userspace should rather gather the feedback during the next command submission. This has the advantage that you don't need to keep userspace alive till all jobs are done.
>>>>>
>>>>>> No, we need to clean the hw ring (cherry-pick out guilty entities' job in all rings) after gpu reset, and we need fackly signal all sched_fence in the guity entity as well, and we need mark context as guilty so the next IOCTL on it will return -ENODEV.
>>>>>> I don't understand how your idea can solve my request ...
>>>>> 2,You need to keep quite a bunch of stuff alive (VM, CSA) when you don't tear down the ctx immediately.
>>>>>
>>>>>> I'm afraid not:  CSA is gone with the VM, and VM is gone after app close our FD, I don't see amdgpu_vm_fini() is depended on context living or not ...
>>>>> 3, struct fence was renamed to struct dma_fence on newer kernels and a status field added to exactly this purpose.
>>>>>
>>>>> The Intel guys did this because they ran into the exactly same problem.
>>>>>
>>>>>> I'll see if dma_fence could solve my issue, but I wish you can
>>>>>> give me your detail idea
>>>>> BR Monk
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Christian König [mailto:deathsimple@vodafone.de]
>>>>> Sent: Wednesday, May 03, 2017 4:59 PM
>>>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>>>> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job
>>>>> finished
>>>>>
>>>>>> 1, This is necessary otherwise how can I access entity pointer
>>>>>> after a job timedout
>>>>> No that isn't necessary.
>>>>>
>>>>> The problem with your idea is that you want to actively push the
>>>>> feedback/status from the job execution back to userspace when an
>>>>> error
>>>>> (timeout) happens.
>>>>>
>>>>> My idea is that userspace should rather gather the feedback during the next command submission. This has the advantage that you don't need to keep userspace alive till all jobs are done.
>>>>>
>>>>>>        , and why it is dangerous ?
>>>>> You need to keep quite a bunch of stuff alive (VM, CSA) when you don't tear down the ctx immediately.
>>>>>
>>>>> We could split ctx tear down into freeing the resources and freeing the structure, but I think just gathering the information needed on CS is easier to do.
>>>>>
>>>>>> 2, what's the status field in the fences you were referring to ? I
>>>>>> need to judge if it could satisfy my requirement
>>>>> struct fence was renamed to struct dma_fence on newer kernels and a status field added to exactly this purpose.
>>>>>
>>>>> The Intel guys did this because they ran into the exactly same problem.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>> Am 03.05.2017 um 05:30 schrieb Liu, Monk:
>>>>>> 1, This is necessary otherwise how can I access entity pointer after a job timedout , and why it is dangerous ?
>>>>>> 2, what's the status field in the fences you were referring to ? I
>>>>>> need to judge if it could satisfy my requirement
>>>>>>
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Christian König [mailto:deathsimple@vodafone.de]
>>>>>> Sent: Monday, May 01, 2017 10:48 PM
>>>>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>>>>> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job
>>>>>> finished
>>>>>>
>>>>>> Am 01.05.2017 um 09:22 schrieb Monk Liu:
>>>>>>> for TDR guilty context feature, we need access ctx/s_entity field
>>>>>>> member through sched job pointer,so ctx must keep alive till all
>>>>>>> job from it signaled.
>>>>>> NAK, that is unnecessary and quite dangerous.
>>>>>>
>>>>>> Instead we have the designed status field in the fences which should be checked for that.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>> Change-Id: Ib87e9502f7a5c8c054c7e56956d7f7ad75998e43
>>>>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>>>>> ---
>>>>>>>         drivers/gpu/drm/amd/amdgpu/amdgpu.h           | 6 +++++-
>>>>>>>         drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        | 2 +-
>>>>>>>         drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       | 9 +++++++++
>>>>>>>         drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       | 9 +++++++--
>>>>>>>         drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 6 ------
>>>>>>>         drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 1 +
>>>>>>>         6 files changed, 23 insertions(+), 10 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>>> index e330009..8e031d6 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>>> @@ -760,10 +760,12 @@ struct amdgpu_ib {
>>>>>>>         	uint32_t			flags;
>>>>>>>         };
>>>>>>>         
>>>>>>> +struct amdgpu_ctx;
>>>>>>> +
>>>>>>>         extern const struct amd_sched_backend_ops
>>>>>>> amdgpu_sched_ops;
>>>>>>>         
>>>>>>>         int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>>>>>>> -		     struct amdgpu_job **job, struct amdgpu_vm *vm);
>>>>>>> +		     struct amdgpu_job **job, struct amdgpu_vm *vm, struct
>>>>>>> +amdgpu_ctx *ctx);
>>>>>>>         int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned size,
>>>>>>>         			     struct amdgpu_job **job);
>>>>>>>         
>>>>>>> @@ -802,6 +804,7 @@ struct amdgpu_ctx_mgr {
>>>>>>>         
>>>>>>>         struct amdgpu_ctx *amdgpu_ctx_get(struct amdgpu_fpriv *fpriv, uint32_t id);
>>>>>>>         int amdgpu_ctx_put(struct amdgpu_ctx *ctx);
>>>>>>> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx);
>>>>>>>         
>>>>>>>         uint64_t amdgpu_ctx_add_fence(struct amdgpu_ctx *ctx, struct amdgpu_ring *ring,
>>>>>>>         			      struct fence *fence); @@ -1129,6 +1132,7 @@
>>>>>>> struct amdgpu_job {
>>>>>>>         	struct amdgpu_sync	sync;
>>>>>>>         	struct amdgpu_ib	*ibs;
>>>>>>>         	struct fence		*fence; /* the hw fence */
>>>>>>> +	struct amdgpu_ctx *ctx;
>>>>>>>         	uint32_t		preamble_status;
>>>>>>>         	uint32_t		num_ibs;
>>>>>>>         	void			*owner;
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>>>>> index 699f5fe..267fb65 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>>>>> @@ -234,7 +234,7 @@ int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, void *data)
>>>>>>>         		}
>>>>>>>         	}
>>>>>>>         
>>>>>>> -	ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm);
>>>>>>> +	ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm, p->ctx);
>>>>>>>         	if (ret)
>>>>>>>         		goto free_all_kdata;
>>>>>>>         
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>>>>> index b4bbbb3..81438af 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>>>>> @@ -25,6 +25,13 @@
>>>>>>>         #include <drm/drmP.h>
>>>>>>>         #include "amdgpu.h"
>>>>>>>         
>>>>>>> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx) {
>>>>>>> +	if (ctx)
>>>>>>> +		kref_get(&ctx->refcount);
>>>>>>> +	return ctx;
>>>>>>> +}
>>>>>>> +
>>>>>>>         static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
>>>>>>>         {
>>>>>>>         	unsigned i, j;
>>>>>>> @@ -56,6 +63,8 @@ static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
>>>>>>>         					  rq, amdgpu_sched_jobs);
>>>>>>>         		if (r)
>>>>>>>         			goto failed;
>>>>>>> +
>>>>>>> +		ctx->rings[i].entity.ptr_guilty = &ctx->guilty; /* kernel
>>>>>>> +entity doesn't have ptr_guilty */
>>>>>>>         	}
>>>>>>>         
>>>>>>>         	return 0;
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>>>> index 690ef3d..208da11 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>>>> @@ -40,7 +40,7 @@ static void amdgpu_job_timedout(struct amd_sched_job *s_job)
>>>>>>>         }
>>>>>>>         
>>>>>>>         int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>>>>>>> -		     struct amdgpu_job **job, struct amdgpu_vm *vm)
>>>>>>> +		     struct amdgpu_job **job, struct amdgpu_vm *vm, struct
>>>>>>> +amdgpu_ctx *ctx)
>>>>>>>         {
>>>>>>>         	size_t size = sizeof(struct amdgpu_job);
>>>>>>>         
>>>>>>> @@ -57,6 +57,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>>>>>>>         	(*job)->vm = vm;
>>>>>>>         	(*job)->ibs = (void *)&(*job)[1];
>>>>>>>         	(*job)->num_ibs = num_ibs;
>>>>>>> +	(*job)->ctx = amdgpu_ctx_kref_get(ctx);
>>>>>>>         
>>>>>>>         	amdgpu_sync_create(&(*job)->sync);
>>>>>>>         
>>>>>>> @@ -68,7 +69,7 @@ int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned size,
>>>>>>>         {
>>>>>>>         	int r;
>>>>>>>         
>>>>>>> -	r = amdgpu_job_alloc(adev, 1, job, NULL);
>>>>>>> +	r = amdgpu_job_alloc(adev, 1, job, NULL, NULL);
>>>>>>>         	if (r)
>>>>>>>         		return r;
>>>>>>>         
>>>>>>> @@ -94,6 +95,10 @@ void amdgpu_job_free_resources(struct amdgpu_job *job)
>>>>>>>         static void amdgpu_job_free_cb(struct amd_sched_job *s_job)
>>>>>>>         {
>>>>>>>         	struct amdgpu_job *job = container_of(s_job, struct
>>>>>>> amdgpu_job, base);
>>>>>>> +	struct amdgpu_ctx *ctx = job->ctx;
>>>>>>> +
>>>>>>> +	if (ctx)
>>>>>>> +		amdgpu_ctx_put(ctx);
>>>>>>>         
>>>>>>>         	fence_put(job->fence);
>>>>>>>         	amdgpu_sync_free(&job->sync); diff --git
>>>>>>> a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>>>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>>>>> index 6f4e31f..9100ca8 100644
>>>>>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>>>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>>>>> @@ -208,12 +208,6 @@ void amd_sched_entity_fini(struct amd_gpu_scheduler *sched,
>>>>>>>         	if (!amd_sched_entity_is_initialized(sched, entity))
>>>>>>>         		return;
>>>>>>>         
>>>>>>> -	/**
>>>>>>> -	 * The client will not queue more IBs during this fini, consume existing
>>>>>>> -	 * queued IBs
>>>>>>> -	*/
>>>>>>> -	wait_event(sched->job_scheduled, amd_sched_entity_is_idle(entity));
>>>>>>> -
>>>>>>>         	amd_sched_rq_remove_entity(rq, entity);
>>>>>>>         	kfifo_free(&entity->job_queue);
>>>>>>>         }
>>>>>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>>>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>>>>> index 8cb41d3..ccbbcb0 100644
>>>>>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>>>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>>>>> @@ -49,6 +49,7 @@ struct amd_sched_entity {
>>>>>>>         
>>>>>>>         	struct fence			*dependency;
>>>>>>>         	struct fence_cb			cb;
>>>>>>> +	bool *ptr_guilty;
>>>>>>>         };
>>>>>>>         
>>>>>>>         /**
>>>>>> _______________________________________________
>>>>>> amd-gfx mailing list
>>>>>> amd-gfx@lists.freedesktop.org
>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>> _______________________________________________
>>>>> amd-gfx mailing list
>>>>> amd-gfx@lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2017-05-03 17:40 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-01  7:22 [PATCH 0/5] Patch serials to implement guilty ctx/entity for SRIOV TDR Monk Liu
     [not found] ` <1493623371-32614-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-05-01  7:22   ` [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished Monk Liu
     [not found]     ` <1493623371-32614-2-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-05-01 14:47       ` Christian König
     [not found]         ` <a4605d10-b1f7-7fee-63c9-829d612c63aa-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
2017-05-03  3:30           ` Liu, Monk
     [not found]             ` <DM5PR12MB16102746DB02DBE8ED69DA9C84160-2J9CzHegvk++jCVTvoAFKAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-05-03  3:57               ` Liu, Monk
     [not found]                 ` <DM5PR12MB16107F8A55F3EF0B1C834FC384160-2J9CzHegvk++jCVTvoAFKAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-05-03  4:54                   ` Zhou, David(ChunMing)
     [not found]                     ` <MWHPR1201MB020601F998809FC8F0527723B4160-3iK1xFAIwjrUF/YbdlDdgWrFom/aUZj6nBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2017-05-03  6:02                       ` Liu, Monk
     [not found]                         ` <DM5PR12MB161082763FA0163FF22E1C1F84160-2J9CzHegvk++jCVTvoAFKAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-05-03  7:23                           ` zhoucm1
     [not found]                             ` <59098580.6090204-5C7GfCeVMHo@public.gmane.org>
2017-05-03  9:11                               ` Christian König
     [not found]                                 ` <ba31391d-1f42-705b-5c94-bfd7bd1a194f-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
2017-05-03  9:14                                   ` Liu, Monk
     [not found]                                     ` <DM5PR12MB1610875E9D1BC9E967BE119A84160-2J9CzHegvk++jCVTvoAFKAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-05-03  9:23                                       ` Christian König
2017-05-03  8:58               ` Christian König
     [not found]                 ` <059fe927-90c8-0cf3-336c-56818d9277f0-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
2017-05-03  9:08                   ` Liu, Monk
     [not found]                     ` <DM5PR12MB1610E867F75FA922A874D74884160-2J9CzHegvk++jCVTvoAFKAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-05-03  9:18                       ` Christian König
     [not found]                         ` <eb637720-5c9a-636b-237e-228b499ff3bb-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
2017-05-03  9:29                           ` zhoucm1
2017-05-03  9:36                           ` Liu, Monk
     [not found]                             ` <DM5PR12MB161020C82674A01805B8C8D384160-2J9CzHegvk++jCVTvoAFKAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-05-03 12:49                               ` Christian König
     [not found]                                 ` <cefbc7ee-36a7-3aba-7b4a-102a5a0f2e22-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
2017-05-03 13:31                                   ` Liu, Monk
     [not found]                                     ` <DM5PR12MB1610C0502DE515B570B2F4C984160-2J9CzHegvk++jCVTvoAFKAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-05-03 13:34                                       ` Christian König
     [not found]                                         ` <200bd9aa-1374-69be-c155-689013ba49c5-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
2017-05-03 13:42                                           ` Liu, Monk
     [not found]                                             ` <DM5PR12MB1610435D144871B4CC2D88AF84160-2J9CzHegvk++jCVTvoAFKAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-05-03 14:04                                               ` Christian König
     [not found]                                                 ` <44d1cc5a-a064-322a-15a6-08015378311c-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
2017-05-03 15:17                                                   ` Liu, Monk
     [not found]                                                     ` <DM5PR12MB1610D501E6D8E8C1B25D4BC384160-2J9CzHegvk++jCVTvoAFKAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-05-03 17:40                                                       ` Christian König
2017-05-01  7:22   ` [PATCH 2/5] drm/amdgpu:some modifications in amdgpu_ctx Monk Liu
2017-05-01  7:22   ` [PATCH 3/5] drm/amdgpu:Impl guilty ctx feature for sriov TDR Monk Liu
2017-05-01  7:22   ` [PATCH 4/5] drm/amdgpu:change sriov_gpu_reset interface Monk Liu
2017-05-01  7:22   ` [PATCH 5/5] drm/amdgpu:sriov TDR only recover hang ring Monk Liu
2017-05-01 14:53   ` [PATCH 0/5] Patch serials to implement guilty ctx/entity for SRIOV TDR Christian König

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.