All of lore.kernel.org
 help / color / mirror / Atom feed
From: Matthew Brost <matthew.brost@intel.com>
To: John Harrison <john.c.harrison@intel.com>
Cc: thomas.hellstrom@linux.intel.com,
	intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org
Subject: Re: [PATCH 2/3] drm/i915/guc: Add work queue to trigger a GT reset
Date: Wed, 19 Jan 2022 12:54:05 -0800	[thread overview]
Message-ID: <20220119205405.GA32440@jons-linux-dev-box> (raw)
In-Reply-To: <50355add-0758-c4cc-df74-a6bb329ceb15@intel.com>

On Tue, Jan 18, 2022 at 05:37:01PM -0800, John Harrison wrote:
> On 1/18/2022 13:43, Matthew Brost wrote:
> > The G2H handler needs to be flushed during a GT reset but a G2H
> > indicating engine reset failure can trigger a GT reset. Add a worker to
> > trigger the GT when a engine reset failure is received to break this
> s/a/an/
> 

Yep.

> > circular dependency.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  5 ++++
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 23 +++++++++++++++----
> >   2 files changed, 24 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > index 9d26a86fe557a..60ea8deef5392 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > @@ -119,6 +119,11 @@ struct intel_guc {
> >   		 * function as it might be in an atomic context (no sleeping)
> >   		 */
> >   		struct work_struct destroyed_worker;
> > +		/**
> > +		 * @reset_worker: worker to trigger a GT reset after an engine
> > +		 * reset fails
> > +		 */
> > +		struct work_struct reset_worker;
> >   	} submission_state;
> >   	/**
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 23a40f10d376d..cdd8d691251ff 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -1746,6 +1746,7 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
> >   }
> >   static void destroyed_worker_func(struct work_struct *w);
> > +static void reset_worker_func(struct work_struct *w);
> >   /*
> >    * Set up the memory resources to be shared with the GuC (via the GGTT)
> > @@ -1776,6 +1777,8 @@ int intel_guc_submission_init(struct intel_guc *guc)
> >   	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
> >   	INIT_WORK(&guc->submission_state.destroyed_worker,
> >   		  destroyed_worker_func);
> > +	INIT_WORK(&guc->submission_state.reset_worker,
> > +		  reset_worker_func);
> >   	guc->submission_state.guc_ids_bitmap =
> >   		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID(guc), GFP_KERNEL);
> > @@ -4052,6 +4055,17 @@ guc_lookup_engine(struct intel_guc *guc, u8 guc_class, u8 instance)
> >   	return gt->engine_class[engine_class][instance];
> >   }
> > +static void reset_worker_func(struct work_struct *w)
> > +{
> > +	struct intel_guc *guc = container_of(w, struct intel_guc,
> > +					     submission_state.reset_worker);
> > +	struct intel_gt *gt = guc_to_gt(guc);
> > +
> > +	intel_gt_handle_error(gt, ALL_ENGINES,
> > +			      I915_ERROR_CAPTURE,
> > +			      "GuC failed to reset a engine\n");
> s/a/an/
> 

Yep.

> > +}
> > +
> >   int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
> >   					 const u32 *msg, u32 len)
> >   {
> > @@ -4083,10 +4097,11 @@ int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
> >   	drm_err(&gt->i915->drm, "GuC engine reset request failed on %d:%d (%s) because 0x%08X",
> >   		guc_class, instance, engine->name, reason);
> > -	intel_gt_handle_error(gt, engine->mask,
> > -			      I915_ERROR_CAPTURE,
> > -			      "GuC failed to reset %s (reason=0x%08x)\n",
> > -			      engine->name, reason);
> The engine name and reason code are lost from the error capture? I guess we
> still get it in the drm_err above, though. So probably not an issue. We
> shouldn't be getting these from end users and any internal CI run is only
> likely to give us the dmesg, not the error capture anyway! However, still

That was my reasoning on the msg too.

> seems like it is work saving engine->mask in the submission_state structure
> (ORing in, in case there are multiple resets). Clearing it should be safe
> because once a GT reset has happened, we aren't getting any more G2Hs. And
> we can't have multiple message handlers running concurrently, right? So no
> need to protect the OR either.
> 

I could do that but the engine->mask is really only used for the error
capture with GuC submission as any i915 based reset with GuC submission
is a GT reset. Going from engine->mask to ALL_ENGINES will just capture
all engine state before doing a GT reset which probably isn't a bad
thing, right?

I can update the commit message explaining this if that helps.

Matt 

> John.
> 
> 
> > +	/*
> > +	 * A GT reset flushes this worker queue (G2H handler) so we must use
> > +	 * another worker to trigger a GT reset.
> > +	 */
> > +	queue_work(system_unbound_wq, &guc->submission_state.reset_worker);
> >   	return 0;
> >   }
> 

WARNING: multiple messages have this Message-ID (diff)
From: Matthew Brost <matthew.brost@intel.com>
To: John Harrison <john.c.harrison@intel.com>
Cc: thomas.hellstrom@linux.intel.com,
	intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org
Subject: Re: [Intel-gfx] [PATCH 2/3] drm/i915/guc: Add work queue to trigger a GT reset
Date: Wed, 19 Jan 2022 12:54:05 -0800	[thread overview]
Message-ID: <20220119205405.GA32440@jons-linux-dev-box> (raw)
In-Reply-To: <50355add-0758-c4cc-df74-a6bb329ceb15@intel.com>

On Tue, Jan 18, 2022 at 05:37:01PM -0800, John Harrison wrote:
> On 1/18/2022 13:43, Matthew Brost wrote:
> > The G2H handler needs to be flushed during a GT reset but a G2H
> > indicating engine reset failure can trigger a GT reset. Add a worker to
> > trigger the GT when a engine reset failure is received to break this
> s/a/an/
> 

Yep.

> > circular dependency.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  5 ++++
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 23 +++++++++++++++----
> >   2 files changed, 24 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > index 9d26a86fe557a..60ea8deef5392 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > @@ -119,6 +119,11 @@ struct intel_guc {
> >   		 * function as it might be in an atomic context (no sleeping)
> >   		 */
> >   		struct work_struct destroyed_worker;
> > +		/**
> > +		 * @reset_worker: worker to trigger a GT reset after an engine
> > +		 * reset fails
> > +		 */
> > +		struct work_struct reset_worker;
> >   	} submission_state;
> >   	/**
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 23a40f10d376d..cdd8d691251ff 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -1746,6 +1746,7 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
> >   }
> >   static void destroyed_worker_func(struct work_struct *w);
> > +static void reset_worker_func(struct work_struct *w);
> >   /*
> >    * Set up the memory resources to be shared with the GuC (via the GGTT)
> > @@ -1776,6 +1777,8 @@ int intel_guc_submission_init(struct intel_guc *guc)
> >   	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
> >   	INIT_WORK(&guc->submission_state.destroyed_worker,
> >   		  destroyed_worker_func);
> > +	INIT_WORK(&guc->submission_state.reset_worker,
> > +		  reset_worker_func);
> >   	guc->submission_state.guc_ids_bitmap =
> >   		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID(guc), GFP_KERNEL);
> > @@ -4052,6 +4055,17 @@ guc_lookup_engine(struct intel_guc *guc, u8 guc_class, u8 instance)
> >   	return gt->engine_class[engine_class][instance];
> >   }
> > +static void reset_worker_func(struct work_struct *w)
> > +{
> > +	struct intel_guc *guc = container_of(w, struct intel_guc,
> > +					     submission_state.reset_worker);
> > +	struct intel_gt *gt = guc_to_gt(guc);
> > +
> > +	intel_gt_handle_error(gt, ALL_ENGINES,
> > +			      I915_ERROR_CAPTURE,
> > +			      "GuC failed to reset a engine\n");
> s/a/an/
> 

Yep.

> > +}
> > +
> >   int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
> >   					 const u32 *msg, u32 len)
> >   {
> > @@ -4083,10 +4097,11 @@ int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
> >   	drm_err(&gt->i915->drm, "GuC engine reset request failed on %d:%d (%s) because 0x%08X",
> >   		guc_class, instance, engine->name, reason);
> > -	intel_gt_handle_error(gt, engine->mask,
> > -			      I915_ERROR_CAPTURE,
> > -			      "GuC failed to reset %s (reason=0x%08x)\n",
> > -			      engine->name, reason);
> The engine name and reason code are lost from the error capture? I guess we
> still get it in the drm_err above, though. So probably not an issue. We
> shouldn't be getting these from end users and any internal CI run is only
> likely to give us the dmesg, not the error capture anyway! However, still

That was my reasoning on the msg too.

> seems like it is work saving engine->mask in the submission_state structure
> (ORing in, in case there are multiple resets). Clearing it should be safe
> because once a GT reset has happened, we aren't getting any more G2Hs. And
> we can't have multiple message handlers running concurrently, right? So no
> need to protect the OR either.
> 

I could do that but the engine->mask is really only used for the error
capture with GuC submission as any i915 based reset with GuC submission
is a GT reset. Going from engine->mask to ALL_ENGINES will just capture
all engine state before doing a GT reset which probably isn't a bad
thing, right?

I can update the commit message explaining this if that helps.

Matt 

> John.
> 
> 
> > +	/*
> > +	 * A GT reset flushes this worker queue (G2H handler) so we must use
> > +	 * another worker to trigger a GT reset.
> > +	 */
> > +	queue_work(system_unbound_wq, &guc->submission_state.reset_worker);
> >   	return 0;
> >   }
> 

  reply	other threads:[~2022-01-19 21:00 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-18 21:43 [PATCH 0/3] Flush G2H handler during a GT reset Matthew Brost
2022-01-18 21:43 ` [Intel-gfx] " Matthew Brost
2022-01-18 21:43 ` [PATCH 1/3] drm/i915: Allocate intel_engine_coredump_alloc with ALLOW_FAIL Matthew Brost
2022-01-18 21:43   ` [Intel-gfx] " Matthew Brost
2022-01-19  1:29   ` John Harrison
2022-01-19  1:29     ` [Intel-gfx] " John Harrison
2022-01-19 20:47     ` Matthew Brost
2022-01-19 20:47       ` [Intel-gfx] " Matthew Brost
2022-01-19 20:56       ` John Harrison
2022-01-19 20:56         ` [Intel-gfx] " John Harrison
2022-01-18 21:43 ` [PATCH 2/3] drm/i915/guc: Add work queue to trigger a GT reset Matthew Brost
2022-01-18 21:43   ` [Intel-gfx] " Matthew Brost
2022-01-19  1:37   ` John Harrison
2022-01-19  1:37     ` [Intel-gfx] " John Harrison
2022-01-19 20:54     ` Matthew Brost [this message]
2022-01-19 20:54       ` Matthew Brost
2022-01-19 21:07       ` John Harrison
2022-01-19 21:07         ` [Intel-gfx] " John Harrison
2022-01-19 21:05         ` Matthew Brost
2022-01-19 21:05           ` [Intel-gfx] " Matthew Brost
2022-01-18 21:43 ` [PATCH 3/3] drm/i915/guc: Flush G2H handler during " Matthew Brost
2022-01-18 21:43   ` [Intel-gfx] " Matthew Brost
2022-01-19  1:38   ` John Harrison
2022-01-19  1:38     ` [Intel-gfx] " John Harrison
2022-01-18 22:01 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Flush G2H handler during a GT reset (rev2) Patchwork
2022-01-18 22:02 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2022-01-18 22:32 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2022-01-19  1:02 ` [Intel-gfx] ✓ Fi.CI.IGT: " Patchwork
2022-01-19 21:24 [PATCH 0/3] Flush G2H handler during a GT reset Matthew Brost
2022-01-19 21:24 ` [PATCH 2/3] drm/i915/guc: Add work queue to trigger " Matthew Brost
2022-01-21  1:34   ` John Harrison
2022-01-21  4:04     ` Matthew Brost
2022-01-21  4:31 [PATCH 0/3] Flush G2H handler during " Matthew Brost
2022-01-21  4:31 ` [PATCH 2/3] drm/i915/guc: Add work queue to trigger " Matthew Brost
2022-01-21 18:53   ` John Harrison

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220119205405.GA32440@jons-linux-dev-box \
    --to=matthew.brost@intel.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=john.c.harrison@intel.com \
    --cc=thomas.hellstrom@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.