From: John.C.Harrison@Intel.com To: Intel-GFX@Lists.FreeDesktop.Org Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>, John Harrison <John.C.Harrison@Intel.com>, DRI-Devel@Lists.FreeDesktop.Org, Tvrtko Ursulin <tvrtko.ursulin@intel.com> Subject: [PATCH v4 3/7] drm/i915: Allow error capture without a request Date: Fri, 20 Jan 2023 15:28:27 -0800 [thread overview] Message-ID: <20230120232831.28177-4-John.C.Harrison@Intel.com> (raw) In-Reply-To: <20230120232831.28177-1-John.C.Harrison@Intel.com> From: John Harrison <John.C.Harrison@Intel.com> There was a report of error captures occurring without any hung context being indicated despite the capture being initiated by a 'hung context notification' from GuC. The problem was not reproducible. However, it is possible to happen if the context in question has no active requests. For example, if the hang was in the context switch itself then the breadcrumb write would have occurred and the KMD would see an idle context. In the interests of attempting to provide as much information as possible about a hang, it seems wise to include the engine info regardless of whether a request was found or not. As opposed to just prentending there was no hang at all. So update the error capture code to always record engine information if a context is given. Which means updating record_context() to take a context instead of a request (which it only ever used to find the context anyway). And split the request agnostic parts of intel_engine_coredump_add_request() out into a seaprate function. v2: Remove a duplicate 'if' statement (Umesh) and fix a put of a null pointer. v3: Tidy up request locking code flow (Tvrtko) v4: Pull in improved info message from next patch and fix up potential leak of GuC register state (Daniele) Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Acked-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> --- drivers/gpu/drm/i915/i915_gpu_error.c | 74 ++++++++++++++++++--------- 1 file changed, 50 insertions(+), 24 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c index b20bd6365615b..225f1b11a6b93 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.c +++ b/drivers/gpu/drm/i915/i915_gpu_error.c @@ -1370,14 +1370,14 @@ static void engine_record_execlists(struct intel_engine_coredump *ee) } static bool record_context(struct i915_gem_context_coredump *e, - const struct i915_request *rq) + struct intel_context *ce) { struct i915_gem_context *ctx; struct task_struct *task; bool simulated; rcu_read_lock(); - ctx = rcu_dereference(rq->context->gem_context); + ctx = rcu_dereference(ce->gem_context); if (ctx && !kref_get_unless_zero(&ctx->ref)) ctx = NULL; rcu_read_unlock(); @@ -1396,8 +1396,8 @@ static bool record_context(struct i915_gem_context_coredump *e, e->guilty = atomic_read(&ctx->guilty_count); e->active = atomic_read(&ctx->active_count); - e->total_runtime = intel_context_get_total_runtime_ns(rq->context); - e->avg_runtime = intel_context_get_avg_runtime_ns(rq->context); + e->total_runtime = intel_context_get_total_runtime_ns(ce); + e->avg_runtime = intel_context_get_avg_runtime_ns(ce); simulated = i915_gem_context_no_error_capture(ctx); @@ -1532,15 +1532,37 @@ intel_engine_coredump_alloc(struct intel_engine_cs *engine, gfp_t gfp, u32 dump_ return ee; } +static struct intel_engine_capture_vma * +engine_coredump_add_context(struct intel_engine_coredump *ee, + struct intel_context *ce, + gfp_t gfp) +{ + struct intel_engine_capture_vma *vma = NULL; + + ee->simulated |= record_context(&ee->context, ce); + if (ee->simulated) + return NULL; + + /* + * We need to copy these to an anonymous buffer + * as the simplest method to avoid being overwritten + * by userspace. + */ + vma = capture_vma(vma, ce->ring->vma, "ring", gfp); + vma = capture_vma(vma, ce->state, "HW context", gfp); + + return vma; +} + struct intel_engine_capture_vma * intel_engine_coredump_add_request(struct intel_engine_coredump *ee, struct i915_request *rq, gfp_t gfp) { - struct intel_engine_capture_vma *vma = NULL; + struct intel_engine_capture_vma *vma; - ee->simulated |= record_context(&ee->context, rq); - if (ee->simulated) + vma = engine_coredump_add_context(ee, rq->context, gfp); + if (!vma) return NULL; /* @@ -1550,8 +1572,6 @@ intel_engine_coredump_add_request(struct intel_engine_coredump *ee, */ vma = capture_vma_snapshot(vma, rq->batch_res, gfp, "batch"); vma = capture_user(vma, rq, gfp); - vma = capture_vma(vma, rq->ring->vma, "ring", gfp); - vma = capture_vma(vma, rq->context->state, "HW context", gfp); ee->rq_head = rq->head; ee->rq_post = rq->postfix; @@ -1604,25 +1624,31 @@ capture_engine(struct intel_engine_cs *engine, return NULL; intel_engine_get_hung_entity(engine, &ce, &rq); - if (!rq || !i915_request_started(rq)) - goto no_request_capture; + if (rq && !i915_request_started(rq)) { + drm_info(&engine->gt->i915->drm, "Got hung context on %s with active request %lld:%lld [0x%04X] not yet started\n", + engine->name, rq->fence.context, rq->fence.seqno, ce->guc_id.id); + i915_request_put(rq); + rq = NULL; + } - capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL); - if (!capture) - goto no_request_capture; - if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE) - intel_guc_capture_get_matching_node(engine->gt, ee, ce); + if (rq) { + capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL); + i915_request_put(rq); + } else if (ce) { + capture = engine_coredump_add_context(ee, ce, ATOMIC_MAYFAIL); + } - intel_engine_coredump_add_vma(ee, capture, compress); - i915_request_put(rq); + if (capture) { + intel_engine_coredump_add_vma(ee, capture, compress); - return ee; + if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE) + intel_guc_capture_get_matching_node(engine->gt, ee, ce); + } else { + kfree(ee); + ee = NULL; + } -no_request_capture: - if (rq) - i915_request_put(rq); - kfree(ee); - return NULL; + return ee; } static void -- 2.39.0
WARNING: multiple messages have this Message-ID (diff)
From: John.C.Harrison@Intel.com To: Intel-GFX@Lists.FreeDesktop.Org Cc: DRI-Devel@Lists.FreeDesktop.Org Subject: [Intel-gfx] [PATCH v4 3/7] drm/i915: Allow error capture without a request Date: Fri, 20 Jan 2023 15:28:27 -0800 [thread overview] Message-ID: <20230120232831.28177-4-John.C.Harrison@Intel.com> (raw) In-Reply-To: <20230120232831.28177-1-John.C.Harrison@Intel.com> From: John Harrison <John.C.Harrison@Intel.com> There was a report of error captures occurring without any hung context being indicated despite the capture being initiated by a 'hung context notification' from GuC. The problem was not reproducible. However, it is possible to happen if the context in question has no active requests. For example, if the hang was in the context switch itself then the breadcrumb write would have occurred and the KMD would see an idle context. In the interests of attempting to provide as much information as possible about a hang, it seems wise to include the engine info regardless of whether a request was found or not. As opposed to just prentending there was no hang at all. So update the error capture code to always record engine information if a context is given. Which means updating record_context() to take a context instead of a request (which it only ever used to find the context anyway). And split the request agnostic parts of intel_engine_coredump_add_request() out into a seaprate function. v2: Remove a duplicate 'if' statement (Umesh) and fix a put of a null pointer. v3: Tidy up request locking code flow (Tvrtko) v4: Pull in improved info message from next patch and fix up potential leak of GuC register state (Daniele) Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Acked-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> --- drivers/gpu/drm/i915/i915_gpu_error.c | 74 ++++++++++++++++++--------- 1 file changed, 50 insertions(+), 24 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c index b20bd6365615b..225f1b11a6b93 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.c +++ b/drivers/gpu/drm/i915/i915_gpu_error.c @@ -1370,14 +1370,14 @@ static void engine_record_execlists(struct intel_engine_coredump *ee) } static bool record_context(struct i915_gem_context_coredump *e, - const struct i915_request *rq) + struct intel_context *ce) { struct i915_gem_context *ctx; struct task_struct *task; bool simulated; rcu_read_lock(); - ctx = rcu_dereference(rq->context->gem_context); + ctx = rcu_dereference(ce->gem_context); if (ctx && !kref_get_unless_zero(&ctx->ref)) ctx = NULL; rcu_read_unlock(); @@ -1396,8 +1396,8 @@ static bool record_context(struct i915_gem_context_coredump *e, e->guilty = atomic_read(&ctx->guilty_count); e->active = atomic_read(&ctx->active_count); - e->total_runtime = intel_context_get_total_runtime_ns(rq->context); - e->avg_runtime = intel_context_get_avg_runtime_ns(rq->context); + e->total_runtime = intel_context_get_total_runtime_ns(ce); + e->avg_runtime = intel_context_get_avg_runtime_ns(ce); simulated = i915_gem_context_no_error_capture(ctx); @@ -1532,15 +1532,37 @@ intel_engine_coredump_alloc(struct intel_engine_cs *engine, gfp_t gfp, u32 dump_ return ee; } +static struct intel_engine_capture_vma * +engine_coredump_add_context(struct intel_engine_coredump *ee, + struct intel_context *ce, + gfp_t gfp) +{ + struct intel_engine_capture_vma *vma = NULL; + + ee->simulated |= record_context(&ee->context, ce); + if (ee->simulated) + return NULL; + + /* + * We need to copy these to an anonymous buffer + * as the simplest method to avoid being overwritten + * by userspace. + */ + vma = capture_vma(vma, ce->ring->vma, "ring", gfp); + vma = capture_vma(vma, ce->state, "HW context", gfp); + + return vma; +} + struct intel_engine_capture_vma * intel_engine_coredump_add_request(struct intel_engine_coredump *ee, struct i915_request *rq, gfp_t gfp) { - struct intel_engine_capture_vma *vma = NULL; + struct intel_engine_capture_vma *vma; - ee->simulated |= record_context(&ee->context, rq); - if (ee->simulated) + vma = engine_coredump_add_context(ee, rq->context, gfp); + if (!vma) return NULL; /* @@ -1550,8 +1572,6 @@ intel_engine_coredump_add_request(struct intel_engine_coredump *ee, */ vma = capture_vma_snapshot(vma, rq->batch_res, gfp, "batch"); vma = capture_user(vma, rq, gfp); - vma = capture_vma(vma, rq->ring->vma, "ring", gfp); - vma = capture_vma(vma, rq->context->state, "HW context", gfp); ee->rq_head = rq->head; ee->rq_post = rq->postfix; @@ -1604,25 +1624,31 @@ capture_engine(struct intel_engine_cs *engine, return NULL; intel_engine_get_hung_entity(engine, &ce, &rq); - if (!rq || !i915_request_started(rq)) - goto no_request_capture; + if (rq && !i915_request_started(rq)) { + drm_info(&engine->gt->i915->drm, "Got hung context on %s with active request %lld:%lld [0x%04X] not yet started\n", + engine->name, rq->fence.context, rq->fence.seqno, ce->guc_id.id); + i915_request_put(rq); + rq = NULL; + } - capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL); - if (!capture) - goto no_request_capture; - if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE) - intel_guc_capture_get_matching_node(engine->gt, ee, ce); + if (rq) { + capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL); + i915_request_put(rq); + } else if (ce) { + capture = engine_coredump_add_context(ee, ce, ATOMIC_MAYFAIL); + } - intel_engine_coredump_add_vma(ee, capture, compress); - i915_request_put(rq); + if (capture) { + intel_engine_coredump_add_vma(ee, capture, compress); - return ee; + if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE) + intel_guc_capture_get_matching_node(engine->gt, ee, ce); + } else { + kfree(ee); + ee = NULL; + } -no_request_capture: - if (rq) - i915_request_put(rq); - kfree(ee); - return NULL; + return ee; } static void -- 2.39.0
next prev parent reply other threads:[~2023-01-20 23:29 UTC|newest] Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top 2023-01-20 23:28 [PATCH v4 0/7] Allow error capture without a request & fix locking issues John.C.Harrison 2023-01-20 23:28 ` [Intel-gfx] " John.C.Harrison 2023-01-20 23:28 ` [PATCH v4 1/7] drm/i915: Fix request locking during error capture & debugfs dump John.C.Harrison 2023-01-20 23:28 ` [Intel-gfx] " John.C.Harrison 2023-01-23 17:51 ` Tvrtko Ursulin 2023-01-23 17:51 ` [Intel-gfx] " Tvrtko Ursulin 2023-01-23 20:35 ` John Harrison 2023-01-23 20:35 ` [Intel-gfx] " John Harrison 2023-01-25 22:04 ` John Harrison 2023-01-25 22:04 ` John Harrison 2023-01-20 23:28 ` [PATCH v4 2/7] drm/i915: Fix up locking around dumping requests lists John.C.Harrison 2023-01-20 23:28 ` [Intel-gfx] " John.C.Harrison 2023-01-20 23:40 ` John Harrison 2023-01-20 23:40 ` [Intel-gfx] " John Harrison 2023-01-24 14:40 ` Tvrtko Ursulin 2023-01-25 18:00 ` John Harrison 2023-01-25 18:12 ` Tvrtko Ursulin 2023-01-25 18:17 ` John Harrison 2023-01-25 0:31 ` Ceraolo Spurio, Daniele 2023-01-20 23:28 ` John.C.Harrison [this message] 2023-01-20 23:28 ` [Intel-gfx] [PATCH v4 3/7] drm/i915: Allow error capture without a request John.C.Harrison 2023-01-25 0:39 ` Ceraolo Spurio, Daniele 2023-01-25 0:56 ` John Harrison 2023-01-20 23:28 ` [PATCH v4 4/7] drm/i915: Allow error capture of a pending request John.C.Harrison 2023-01-20 23:28 ` [Intel-gfx] " John.C.Harrison 2023-01-20 23:28 ` [PATCH v4 5/7] drm/i915/guc: Look for a guilty context when an engine reset fails John.C.Harrison 2023-01-20 23:28 ` [Intel-gfx] " John.C.Harrison 2023-01-20 23:28 ` [PATCH v4 6/7] drm/i915/guc: Add a debug print on GuC triggered reset John.C.Harrison 2023-01-20 23:28 ` [Intel-gfx] " John.C.Harrison 2023-01-20 23:28 ` [PATCH v4 7/7] drm/i915/guc: Rename GuC register state capture node to be more obvious John.C.Harrison 2023-01-20 23:28 ` [Intel-gfx] " John.C.Harrison 2023-01-25 0:44 ` Ceraolo Spurio, Daniele 2023-01-20 23:57 ` [Intel-gfx] ✓ Fi.CI.BAT: success for Allow error capture without a request & fix locking issues (rev2) Patchwork 2023-01-21 21:41 ` [Intel-gfx] ✓ Fi.CI.IGT: " Patchwork
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20230120232831.28177-4-John.C.Harrison@Intel.com \ --to=john.c.harrison@intel.com \ --cc=DRI-Devel@Lists.FreeDesktop.Org \ --cc=Intel-GFX@Lists.FreeDesktop.Org \ --cc=tvrtko.ursulin@intel.com \ --cc=umesh.nerlige.ramappa@intel.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.