dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] Flush G2H handler during a GT reset
@ 2022-01-19 21:24 Matthew Brost
  2022-01-19 21:24 ` [PATCH 1/3] drm/i915: Allocate intel_engine_coredump_alloc with ALLOW_FAIL Matthew Brost
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Matthew Brost @ 2022-01-19 21:24 UTC (permalink / raw)
  To: dri-devel, intel-gfx; +Cc: john.c.harrison

After a small fix to error capture code, we now can flush G2H during a
GT reset which simplifies code and seals some extreme corner case races. 

v2:
 (CI)
  - Don't trigger GT reset from G2H handler
v3:
  - Address John Harrison's comments

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

Matthew Brost (3):
  drm/i915: Allocate intel_engine_coredump_alloc with ALLOW_FAIL
  drm/i915/guc: Add work queue to trigger a GT reset
  drm/i915/guc: Flush G2H handler during a GT reset

 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  9 +++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 55 ++++++++++++-------
 drivers/gpu/drm/i915/i915_gpu_error.c         |  2 +-
 3 files changed, 44 insertions(+), 22 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/3] drm/i915: Allocate intel_engine_coredump_alloc with ALLOW_FAIL
  2022-01-19 21:24 [PATCH 0/3] Flush G2H handler during a GT reset Matthew Brost
@ 2022-01-19 21:24 ` Matthew Brost
  2022-01-19 21:24 ` [PATCH 2/3] drm/i915/guc: Add work queue to trigger a GT reset Matthew Brost
  2022-01-19 21:24 ` [PATCH 3/3] drm/i915/guc: Flush G2H handler during " Matthew Brost
  2 siblings, 0 replies; 11+ messages in thread
From: Matthew Brost @ 2022-01-19 21:24 UTC (permalink / raw)
  To: dri-devel, intel-gfx; +Cc: john.c.harrison

Allocate intel_engine_coredump_alloc with ALLOW_FAIL rather than
GFP_KERNEL to fully decouple the error capture from fence signalling.

v2:
 (John Harrison)
  - Fix typo in commit message (s/do/to)

Fixes: 8b91cdd4f8649 ("drm/i915: Use __GFP_KSWAPD_RECLAIM in the capture code")

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/i915_gpu_error.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 67f3515f07e7..aee42eae4729 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1516,7 +1516,7 @@ capture_engine(struct intel_engine_cs *engine,
 	struct i915_request *rq = NULL;
 	unsigned long flags;
 
-	ee = intel_engine_coredump_alloc(engine, GFP_KERNEL);
+	ee = intel_engine_coredump_alloc(engine, ALLOW_FAIL);
 	if (!ee)
 		return NULL;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/3] drm/i915/guc: Add work queue to trigger a GT reset
  2022-01-19 21:24 [PATCH 0/3] Flush G2H handler during a GT reset Matthew Brost
  2022-01-19 21:24 ` [PATCH 1/3] drm/i915: Allocate intel_engine_coredump_alloc with ALLOW_FAIL Matthew Brost
@ 2022-01-19 21:24 ` Matthew Brost
  2022-01-21  1:34   ` John Harrison
  2022-01-19 21:24 ` [PATCH 3/3] drm/i915/guc: Flush G2H handler during " Matthew Brost
  2 siblings, 1 reply; 11+ messages in thread
From: Matthew Brost @ 2022-01-19 21:24 UTC (permalink / raw)
  To: dri-devel, intel-gfx; +Cc: john.c.harrison

The G2H handler needs to be flushed during a GT reset but a G2H
indicating engine reset failure can trigger a GT reset. Add a worker to
trigger the GT when an engine reset failure is received to break this
circular dependency.

v2:
 (John Harrison)
  - Store engine reset mask
  - Fix typo in commit message

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  9 +++++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 37 +++++++++++++++++--
 2 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 9d26a86fe557..c4a9fc7dd246 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -119,6 +119,15 @@ struct intel_guc {
 		 * function as it might be in an atomic context (no sleeping)
 		 */
 		struct work_struct destroyed_worker;
+		/**
+		 * @reset_worker: worker to trigger a GT reset after an engine
+		 * reset fails
+		 */
+		struct work_struct reset_worker;
+		/**
+		 * @reset_mask: mask of engines that failed to reset
+		 */
+		intel_engine_mask_t reset_mask;
 	} submission_state;
 
 	/**
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 3918f1be114f..514b3060b141 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1731,6 +1731,7 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
 }
 
 static void destroyed_worker_func(struct work_struct *w);
+static void reset_worker_func(struct work_struct *w);
 
 /*
  * Set up the memory resources to be shared with the GuC (via the GGTT)
@@ -1761,6 +1762,8 @@ int intel_guc_submission_init(struct intel_guc *guc)
 	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
 	INIT_WORK(&guc->submission_state.destroyed_worker,
 		  destroyed_worker_func);
+	INIT_WORK(&guc->submission_state.reset_worker,
+		  reset_worker_func);
 
 	guc->submission_state.guc_ids_bitmap =
 		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID(guc), GFP_KERNEL);
@@ -4026,6 +4029,26 @@ guc_lookup_engine(struct intel_guc *guc, u8 guc_class, u8 instance)
 	return gt->engine_class[engine_class][instance];
 }
 
+static void reset_worker_func(struct work_struct *w)
+{
+	struct intel_guc *guc = container_of(w, struct intel_guc,
+					     submission_state.reset_worker);
+	struct intel_gt *gt = guc_to_gt(guc);
+	intel_engine_mask_t reset_mask;
+	unsigned long flags;
+
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
+	reset_mask = guc->submission_state.reset_mask;
+	guc->submission_state.reset_mask = 0;
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
+
+	if (likely(reset_mask))
+		intel_gt_handle_error(gt, reset_mask,
+				      I915_ERROR_CAPTURE,
+				      "GuC failed to reset engine mask=0x%x\n",
+				      reset_mask);
+}
+
 int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
 					 const u32 *msg, u32 len)
 {
@@ -4033,6 +4056,7 @@ int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
 	struct intel_gt *gt = guc_to_gt(guc);
 	u8 guc_class, instance;
 	u32 reason;
+	unsigned long flags;
 
 	if (unlikely(len != 3)) {
 		drm_err(&gt->i915->drm, "Invalid length %u", len);
@@ -4057,10 +4081,15 @@ int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
 	drm_err(&gt->i915->drm, "GuC engine reset request failed on %d:%d (%s) because 0x%08X",
 		guc_class, instance, engine->name, reason);
 
-	intel_gt_handle_error(gt, engine->mask,
-			      I915_ERROR_CAPTURE,
-			      "GuC failed to reset %s (reason=0x%08x)\n",
-			      engine->name, reason);
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
+	guc->submission_state.reset_mask |= engine->mask;
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
+
+	/*
+	 * A GT reset flushes this worker queue (G2H handler) so we must use
+	 * another worker to trigger a GT reset.
+	 */
+	queue_work(system_unbound_wq, &guc->submission_state.reset_worker);
 
 	return 0;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/3] drm/i915/guc: Flush G2H handler during a GT reset
  2022-01-19 21:24 [PATCH 0/3] Flush G2H handler during a GT reset Matthew Brost
  2022-01-19 21:24 ` [PATCH 1/3] drm/i915: Allocate intel_engine_coredump_alloc with ALLOW_FAIL Matthew Brost
  2022-01-19 21:24 ` [PATCH 2/3] drm/i915/guc: Add work queue to trigger a GT reset Matthew Brost
@ 2022-01-19 21:24 ` Matthew Brost
  2022-01-21  1:36   ` John Harrison
  2 siblings, 1 reply; 11+ messages in thread
From: Matthew Brost @ 2022-01-19 21:24 UTC (permalink / raw)
  To: dri-devel, intel-gfx; +Cc: john.c.harrison

Now that the error capture is fully decoupled from fence signalling
(request retirement to free memory, which in turn depends on resets) we
can safely flush the G2H handler during a GT reset. This is eliminates
corner cases where GuC generated G2H (e.g. engine resets) race with a GT
reset.

v2:
 (John Harrison)
  - Fix typo in commit message (s/is/in)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c  | 18 +-----------------
 1 file changed, 1 insertion(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 514b3060b141..406dd2e3f5a9 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1396,8 +1396,6 @@ static void guc_flush_destroyed_contexts(struct intel_guc *guc);
 
 void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 {
-	int i;
-
 	if (unlikely(!guc_submission_initialized(guc))) {
 		/* Reset called during driver load? GuC not yet initialised! */
 		return;
@@ -1414,21 +1412,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 
 	guc_flush_submissions(guc);
 	guc_flush_destroyed_contexts(guc);
-
-	/*
-	 * Handle any outstanding G2Hs before reset. Call IRQ handler directly
-	 * each pass as interrupt have been disabled. We always scrub for
-	 * outstanding G2H as it is possible for outstanding_submission_g2h to
-	 * be incremented after the context state update.
-	 */
-	for (i = 0; i < 4 && atomic_read(&guc->outstanding_submission_g2h); ++i) {
-		intel_guc_to_host_event_handler(guc);
-#define wait_for_reset(guc, wait_var) \
-		intel_guc_wait_for_pending_msg(guc, wait_var, false, (HZ / 20))
-		do {
-			wait_for_reset(guc, &guc->outstanding_submission_g2h);
-		} while (!list_empty(&guc->ct.requests.incoming));
-	}
+	flush_work(&guc->ct.requests.worker);
 
 	scrub_guc_desc_for_outstanding_g2h(guc);
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/3] drm/i915/guc: Add work queue to trigger a GT reset
  2022-01-19 21:24 ` [PATCH 2/3] drm/i915/guc: Add work queue to trigger a GT reset Matthew Brost
@ 2022-01-21  1:34   ` John Harrison
  2022-01-21  4:04     ` Matthew Brost
  0 siblings, 1 reply; 11+ messages in thread
From: John Harrison @ 2022-01-21  1:34 UTC (permalink / raw)
  To: Matthew Brost, dri-devel, intel-gfx

On 1/19/2022 13:24, Matthew Brost wrote:
> The G2H handler needs to be flushed during a GT reset but a G2H
> indicating engine reset failure can trigger a GT reset. Add a worker to
> trigger the GT when an engine reset failure is received to break this
trigger the GT reset?

> circular dependency.
>
> v2:
>   (John Harrison)
>    - Store engine reset mask
>    - Fix typo in commit message
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  9 +++++
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 37 +++++++++++++++++--
>   2 files changed, 42 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> index 9d26a86fe557..c4a9fc7dd246 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> @@ -119,6 +119,15 @@ struct intel_guc {
>   		 * function as it might be in an atomic context (no sleeping)
>   		 */
>   		struct work_struct destroyed_worker;
> +		/**
> +		 * @reset_worker: worker to trigger a GT reset after an engine
> +		 * reset fails
> +		 */
> +		struct work_struct reset_worker;
> +		/**
> +		 * @reset_mask: mask of engines that failed to reset
> +		 */
> +		intel_engine_mask_t reset_mask;
reset_fail_mask might be a less ambiguous name? Same for the worker 
struct and function.

John.

>   	} submission_state;
>   
>   	/**
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 3918f1be114f..514b3060b141 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1731,6 +1731,7 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
>   }
>   
>   static void destroyed_worker_func(struct work_struct *w);
> +static void reset_worker_func(struct work_struct *w);
>   
>   /*
>    * Set up the memory resources to be shared with the GuC (via the GGTT)
> @@ -1761,6 +1762,8 @@ int intel_guc_submission_init(struct intel_guc *guc)
>   	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
>   	INIT_WORK(&guc->submission_state.destroyed_worker,
>   		  destroyed_worker_func);
> +	INIT_WORK(&guc->submission_state.reset_worker,
> +		  reset_worker_func);
>   
>   	guc->submission_state.guc_ids_bitmap =
>   		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID(guc), GFP_KERNEL);
> @@ -4026,6 +4029,26 @@ guc_lookup_engine(struct intel_guc *guc, u8 guc_class, u8 instance)
>   	return gt->engine_class[engine_class][instance];
>   }
>   
> +static void reset_worker_func(struct work_struct *w)
> +{
> +	struct intel_guc *guc = container_of(w, struct intel_guc,
> +					     submission_state.reset_worker);
> +	struct intel_gt *gt = guc_to_gt(guc);
> +	intel_engine_mask_t reset_mask;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> +	reset_mask = guc->submission_state.reset_mask;
> +	guc->submission_state.reset_mask = 0;
> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> +
> +	if (likely(reset_mask))
> +		intel_gt_handle_error(gt, reset_mask,
> +				      I915_ERROR_CAPTURE,
> +				      "GuC failed to reset engine mask=0x%x\n",
> +				      reset_mask);
> +}
> +
>   int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
>   					 const u32 *msg, u32 len)
>   {
> @@ -4033,6 +4056,7 @@ int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
>   	struct intel_gt *gt = guc_to_gt(guc);
>   	u8 guc_class, instance;
>   	u32 reason;
> +	unsigned long flags;
>   
>   	if (unlikely(len != 3)) {
>   		drm_err(&gt->i915->drm, "Invalid length %u", len);
> @@ -4057,10 +4081,15 @@ int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
>   	drm_err(&gt->i915->drm, "GuC engine reset request failed on %d:%d (%s) because 0x%08X",
>   		guc_class, instance, engine->name, reason);
>   
> -	intel_gt_handle_error(gt, engine->mask,
> -			      I915_ERROR_CAPTURE,
> -			      "GuC failed to reset %s (reason=0x%08x)\n",
> -			      engine->name, reason);
> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> +	guc->submission_state.reset_mask |= engine->mask;
> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> +
> +	/*
> +	 * A GT reset flushes this worker queue (G2H handler) so we must use
> +	 * another worker to trigger a GT reset.
> +	 */
> +	queue_work(system_unbound_wq, &guc->submission_state.reset_worker);
>   
>   	return 0;
>   }


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] drm/i915/guc: Flush G2H handler during a GT reset
  2022-01-19 21:24 ` [PATCH 3/3] drm/i915/guc: Flush G2H handler during " Matthew Brost
@ 2022-01-21  1:36   ` John Harrison
  2022-01-21  4:05     ` Matthew Brost
  0 siblings, 1 reply; 11+ messages in thread
From: John Harrison @ 2022-01-21  1:36 UTC (permalink / raw)
  To: Matthew Brost, dri-devel, intel-gfx

On 1/19/2022 13:24, Matthew Brost wrote:
> Now that the error capture is fully decoupled from fence signalling
> (request retirement to free memory, which in turn depends on resets) we
> can safely flush the G2H handler during a GT reset. This is eliminates
This eliminates

John.

> corner cases where GuC generated G2H (e.g. engine resets) race with a GT
> reset.
>
> v2:
>   (John Harrison)
>    - Fix typo in commit message (s/is/in)
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
> ---
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c  | 18 +-----------------
>   1 file changed, 1 insertion(+), 17 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 514b3060b141..406dd2e3f5a9 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1396,8 +1396,6 @@ static void guc_flush_destroyed_contexts(struct intel_guc *guc);
>   
>   void intel_guc_submission_reset_prepare(struct intel_guc *guc)
>   {
> -	int i;
> -
>   	if (unlikely(!guc_submission_initialized(guc))) {
>   		/* Reset called during driver load? GuC not yet initialised! */
>   		return;
> @@ -1414,21 +1412,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
>   
>   	guc_flush_submissions(guc);
>   	guc_flush_destroyed_contexts(guc);
> -
> -	/*
> -	 * Handle any outstanding G2Hs before reset. Call IRQ handler directly
> -	 * each pass as interrupt have been disabled. We always scrub for
> -	 * outstanding G2H as it is possible for outstanding_submission_g2h to
> -	 * be incremented after the context state update.
> -	 */
> -	for (i = 0; i < 4 && atomic_read(&guc->outstanding_submission_g2h); ++i) {
> -		intel_guc_to_host_event_handler(guc);
> -#define wait_for_reset(guc, wait_var) \
> -		intel_guc_wait_for_pending_msg(guc, wait_var, false, (HZ / 20))
> -		do {
> -			wait_for_reset(guc, &guc->outstanding_submission_g2h);
> -		} while (!list_empty(&guc->ct.requests.incoming));
> -	}
> +	flush_work(&guc->ct.requests.worker);
>   
>   	scrub_guc_desc_for_outstanding_g2h(guc);
>   }


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/3] drm/i915/guc: Add work queue to trigger a GT reset
  2022-01-21  1:34   ` John Harrison
@ 2022-01-21  4:04     ` Matthew Brost
  0 siblings, 0 replies; 11+ messages in thread
From: Matthew Brost @ 2022-01-21  4:04 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel

On Thu, Jan 20, 2022 at 05:34:54PM -0800, John Harrison wrote:
> On 1/19/2022 13:24, Matthew Brost wrote:
> > The G2H handler needs to be flushed during a GT reset but a G2H
> > indicating engine reset failure can trigger a GT reset. Add a worker to
> > trigger the GT when an engine reset failure is received to break this
> trigger the GT reset?
> 

Yes.

> > circular dependency.
> > 
> > v2:
> >   (John Harrison)
> >    - Store engine reset mask
> >    - Fix typo in commit message
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  9 +++++
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 37 +++++++++++++++++--
> >   2 files changed, 42 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > index 9d26a86fe557..c4a9fc7dd246 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > @@ -119,6 +119,15 @@ struct intel_guc {
> >   		 * function as it might be in an atomic context (no sleeping)
> >   		 */
> >   		struct work_struct destroyed_worker;
> > +		/**
> > +		 * @reset_worker: worker to trigger a GT reset after an engine
> > +		 * reset fails
> > +		 */
> > +		struct work_struct reset_worker;
> > +		/**
> > +		 * @reset_mask: mask of engines that failed to reset
> > +		 */
> > +		intel_engine_mask_t reset_mask;
> reset_fail_mask might be a less ambiguous name? Same for the worker struct
> and function.
> 

How about:

struct {
	worker;
	mask;
} engine_reset_fail;

Matt

> John.
> 
> >   	} submission_state;
> >   	/**
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 3918f1be114f..514b3060b141 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -1731,6 +1731,7 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
> >   }
> >   static void destroyed_worker_func(struct work_struct *w);
> > +static void reset_worker_func(struct work_struct *w);
> >   /*
> >    * Set up the memory resources to be shared with the GuC (via the GGTT)
> > @@ -1761,6 +1762,8 @@ int intel_guc_submission_init(struct intel_guc *guc)
> >   	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
> >   	INIT_WORK(&guc->submission_state.destroyed_worker,
> >   		  destroyed_worker_func);
> > +	INIT_WORK(&guc->submission_state.reset_worker,
> > +		  reset_worker_func);
> >   	guc->submission_state.guc_ids_bitmap =
> >   		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID(guc), GFP_KERNEL);
> > @@ -4026,6 +4029,26 @@ guc_lookup_engine(struct intel_guc *guc, u8 guc_class, u8 instance)
> >   	return gt->engine_class[engine_class][instance];
> >   }
> > +static void reset_worker_func(struct work_struct *w)
> > +{
> > +	struct intel_guc *guc = container_of(w, struct intel_guc,
> > +					     submission_state.reset_worker);
> > +	struct intel_gt *gt = guc_to_gt(guc);
> > +	intel_engine_mask_t reset_mask;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> > +	reset_mask = guc->submission_state.reset_mask;
> > +	guc->submission_state.reset_mask = 0;
> > +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> > +
> > +	if (likely(reset_mask))
> > +		intel_gt_handle_error(gt, reset_mask,
> > +				      I915_ERROR_CAPTURE,
> > +				      "GuC failed to reset engine mask=0x%x\n",
> > +				      reset_mask);
> > +}
> > +
> >   int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
> >   					 const u32 *msg, u32 len)
> >   {
> > @@ -4033,6 +4056,7 @@ int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
> >   	struct intel_gt *gt = guc_to_gt(guc);
> >   	u8 guc_class, instance;
> >   	u32 reason;
> > +	unsigned long flags;
> >   	if (unlikely(len != 3)) {
> >   		drm_err(&gt->i915->drm, "Invalid length %u", len);
> > @@ -4057,10 +4081,15 @@ int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
> >   	drm_err(&gt->i915->drm, "GuC engine reset request failed on %d:%d (%s) because 0x%08X",
> >   		guc_class, instance, engine->name, reason);
> > -	intel_gt_handle_error(gt, engine->mask,
> > -			      I915_ERROR_CAPTURE,
> > -			      "GuC failed to reset %s (reason=0x%08x)\n",
> > -			      engine->name, reason);
> > +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> > +	guc->submission_state.reset_mask |= engine->mask;
> > +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> > +
> > +	/*
> > +	 * A GT reset flushes this worker queue (G2H handler) so we must use
> > +	 * another worker to trigger a GT reset.
> > +	 */
> > +	queue_work(system_unbound_wq, &guc->submission_state.reset_worker);
> >   	return 0;
> >   }
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] drm/i915/guc: Flush G2H handler during a GT reset
  2022-01-21  1:36   ` John Harrison
@ 2022-01-21  4:05     ` Matthew Brost
  0 siblings, 0 replies; 11+ messages in thread
From: Matthew Brost @ 2022-01-21  4:05 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel

On Thu, Jan 20, 2022 at 05:36:22PM -0800, John Harrison wrote:
> On 1/19/2022 13:24, Matthew Brost wrote:
> > Now that the error capture is fully decoupled from fence signalling
> > (request retirement to free memory, which in turn depends on resets) we
> > can safely flush the G2H handler during a GT reset. This is eliminates
> This eliminates
> 
> John.
> 

Yep, will fixup in the next rev.

Matt

> > corner cases where GuC generated G2H (e.g. engine resets) race with a GT
> > reset.
> > 
> > v2:
> >   (John Harrison)
> >    - Fix typo in commit message (s/is/in)
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
> > ---
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c  | 18 +-----------------
> >   1 file changed, 1 insertion(+), 17 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 514b3060b141..406dd2e3f5a9 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -1396,8 +1396,6 @@ static void guc_flush_destroyed_contexts(struct intel_guc *guc);
> >   void intel_guc_submission_reset_prepare(struct intel_guc *guc)
> >   {
> > -	int i;
> > -
> >   	if (unlikely(!guc_submission_initialized(guc))) {
> >   		/* Reset called during driver load? GuC not yet initialised! */
> >   		return;
> > @@ -1414,21 +1412,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
> >   	guc_flush_submissions(guc);
> >   	guc_flush_destroyed_contexts(guc);
> > -
> > -	/*
> > -	 * Handle any outstanding G2Hs before reset. Call IRQ handler directly
> > -	 * each pass as interrupt have been disabled. We always scrub for
> > -	 * outstanding G2H as it is possible for outstanding_submission_g2h to
> > -	 * be incremented after the context state update.
> > -	 */
> > -	for (i = 0; i < 4 && atomic_read(&guc->outstanding_submission_g2h); ++i) {
> > -		intel_guc_to_host_event_handler(guc);
> > -#define wait_for_reset(guc, wait_var) \
> > -		intel_guc_wait_for_pending_msg(guc, wait_var, false, (HZ / 20))
> > -		do {
> > -			wait_for_reset(guc, &guc->outstanding_submission_g2h);
> > -		} while (!list_empty(&guc->ct.requests.incoming));
> > -	}
> > +	flush_work(&guc->ct.requests.worker);
> >   	scrub_guc_desc_for_outstanding_g2h(guc);
> >   }
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 3/3] drm/i915/guc: Flush G2H handler during a GT reset
  2022-01-21  4:31 [PATCH 0/3] " Matthew Brost
@ 2022-01-21  4:31 ` Matthew Brost
  0 siblings, 0 replies; 11+ messages in thread
From: Matthew Brost @ 2022-01-21  4:31 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison

Now that the error capture is fully decoupled from fence signalling
(request retirement to free memory, which in turn depends on resets) we
can safely flush the G2H handler during a GT reset. This eliminates
corner cases where GuC generated G2H (e.g. engine resets) race with a GT
reset.

v2:
 (John Harrison)
  - Fix typo in commit message (s/is/in)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c  | 18 +-----------------
 1 file changed, 1 insertion(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 9a3f503d201aa..1331ff91c5b05 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1396,8 +1396,6 @@ static void guc_flush_destroyed_contexts(struct intel_guc *guc);
 
 void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 {
-	int i;
-
 	if (unlikely(!guc_submission_initialized(guc))) {
 		/* Reset called during driver load? GuC not yet initialised! */
 		return;
@@ -1414,21 +1412,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 
 	guc_flush_submissions(guc);
 	guc_flush_destroyed_contexts(guc);
-
-	/*
-	 * Handle any outstanding G2Hs before reset. Call IRQ handler directly
-	 * each pass as interrupt have been disabled. We always scrub for
-	 * outstanding G2H as it is possible for outstanding_submission_g2h to
-	 * be incremented after the context state update.
-	 */
-	for (i = 0; i < 4 && atomic_read(&guc->outstanding_submission_g2h); ++i) {
-		intel_guc_to_host_event_handler(guc);
-#define wait_for_reset(guc, wait_var) \
-		intel_guc_wait_for_pending_msg(guc, wait_var, false, (HZ / 20))
-		do {
-			wait_for_reset(guc, &guc->outstanding_submission_g2h);
-		} while (!list_empty(&guc->ct.requests.incoming));
-	}
+	flush_work(&guc->ct.requests.worker);
 
 	scrub_guc_desc_for_outstanding_g2h(guc);
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] drm/i915/guc: Flush G2H handler during a GT reset
  2022-01-18 21:43 ` [PATCH 3/3] drm/i915/guc: " Matthew Brost
@ 2022-01-19  1:38   ` John Harrison
  0 siblings, 0 replies; 11+ messages in thread
From: John Harrison @ 2022-01-19  1:38 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: thomas.hellstrom

On 1/18/2022 13:43, Matthew Brost wrote:
> Now that the error capture is fully decoupled from fence signalling
> (request retirement to free memory, which is turn depends on resets) we
s/is/in/

With that fixed:
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>

John.

> can safely flush the G2H handler during a GT reset. This is eliminates
> corner cases where GuC generated G2H (e.g. engine resets) race with a GT
> reset.
>
> Signed-off-by: Matthew Brost <mattthew.brost@intel.com>
> ---
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c  | 18 +-----------------
>   1 file changed, 1 insertion(+), 17 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index cdd8d691251ff..1a11e8986948b 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1396,8 +1396,6 @@ static void guc_flush_destroyed_contexts(struct intel_guc *guc);
>   
>   void intel_guc_submission_reset_prepare(struct intel_guc *guc)
>   {
> -	int i;
> -
>   	if (unlikely(!guc_submission_initialized(guc))) {
>   		/* Reset called during driver load? GuC not yet initialised! */
>   		return;
> @@ -1414,21 +1412,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
>   
>   	guc_flush_submissions(guc);
>   	guc_flush_destroyed_contexts(guc);
> -
> -	/*
> -	 * Handle any outstanding G2Hs before reset. Call IRQ handler directly
> -	 * each pass as interrupt have been disabled. We always scrub for
> -	 * outstanding G2H as it is possible for outstanding_submission_g2h to
> -	 * be incremented after the context state update.
> -	 */
> -	for (i = 0; i < 4 && atomic_read(&guc->outstanding_submission_g2h); ++i) {
> -		intel_guc_to_host_event_handler(guc);
> -#define wait_for_reset(guc, wait_var) \
> -		intel_guc_wait_for_pending_msg(guc, wait_var, false, (HZ / 20))
> -		do {
> -			wait_for_reset(guc, &guc->outstanding_submission_g2h);
> -		} while (!list_empty(&guc->ct.requests.incoming));
> -	}
> +	flush_work(&guc->ct.requests.worker);
>   
>   	scrub_guc_desc_for_outstanding_g2h(guc);
>   }


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 3/3] drm/i915/guc: Flush G2H handler during a GT reset
  2022-01-18 21:43 [PATCH 0/3] " Matthew Brost
@ 2022-01-18 21:43 ` Matthew Brost
  2022-01-19  1:38   ` John Harrison
  0 siblings, 1 reply; 11+ messages in thread
From: Matthew Brost @ 2022-01-18 21:43 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: thomas.hellstrom, john.c.harrison

Now that the error capture is fully decoupled from fence signalling
(request retirement to free memory, which is turn depends on resets) we
can safely flush the G2H handler during a GT reset. This is eliminates
corner cases where GuC generated G2H (e.g. engine resets) race with a GT
reset.

Signed-off-by: Matthew Brost <mattthew.brost@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c  | 18 +-----------------
 1 file changed, 1 insertion(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index cdd8d691251ff..1a11e8986948b 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1396,8 +1396,6 @@ static void guc_flush_destroyed_contexts(struct intel_guc *guc);
 
 void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 {
-	int i;
-
 	if (unlikely(!guc_submission_initialized(guc))) {
 		/* Reset called during driver load? GuC not yet initialised! */
 		return;
@@ -1414,21 +1412,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 
 	guc_flush_submissions(guc);
 	guc_flush_destroyed_contexts(guc);
-
-	/*
-	 * Handle any outstanding G2Hs before reset. Call IRQ handler directly
-	 * each pass as interrupt have been disabled. We always scrub for
-	 * outstanding G2H as it is possible for outstanding_submission_g2h to
-	 * be incremented after the context state update.
-	 */
-	for (i = 0; i < 4 && atomic_read(&guc->outstanding_submission_g2h); ++i) {
-		intel_guc_to_host_event_handler(guc);
-#define wait_for_reset(guc, wait_var) \
-		intel_guc_wait_for_pending_msg(guc, wait_var, false, (HZ / 20))
-		do {
-			wait_for_reset(guc, &guc->outstanding_submission_g2h);
-		} while (!list_empty(&guc->ct.requests.incoming));
-	}
+	flush_work(&guc->ct.requests.worker);
 
 	scrub_guc_desc_for_outstanding_g2h(guc);
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-01-21  4:37 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-19 21:24 [PATCH 0/3] Flush G2H handler during a GT reset Matthew Brost
2022-01-19 21:24 ` [PATCH 1/3] drm/i915: Allocate intel_engine_coredump_alloc with ALLOW_FAIL Matthew Brost
2022-01-19 21:24 ` [PATCH 2/3] drm/i915/guc: Add work queue to trigger a GT reset Matthew Brost
2022-01-21  1:34   ` John Harrison
2022-01-21  4:04     ` Matthew Brost
2022-01-19 21:24 ` [PATCH 3/3] drm/i915/guc: Flush G2H handler during " Matthew Brost
2022-01-21  1:36   ` John Harrison
2022-01-21  4:05     ` Matthew Brost
  -- strict thread matches above, loose matches on Subject: below --
2022-01-21  4:31 [PATCH 0/3] " Matthew Brost
2022-01-21  4:31 ` [PATCH 3/3] drm/i915/guc: " Matthew Brost
2022-01-18 21:43 [PATCH 0/3] " Matthew Brost
2022-01-18 21:43 ` [PATCH 3/3] drm/i915/guc: " Matthew Brost
2022-01-19  1:38   ` John Harrison

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).