All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/i915/gt: Reset twice
@ 2022-12-12 16:13 ` Andi Shyti
  0 siblings, 0 replies; 27+ messages in thread
From: Andi Shyti @ 2022-12-12 16:13 UTC (permalink / raw)
  To: intel-gfx, dri-devel, stable
  Cc: Chris Wilson, Mika Kuoppala, Andi Shyti, Andi Shyti

From: Chris Wilson <chris@chris-wilson.co.uk>

After applying an engine reset, on some platforms like Jasperlake, we
occasionally detect that the engine state is not cleared until shortly
after the resume. As we try to resume the engine with volatile internal
state, the first request fails with a spurious CS event (it looks like
it reports a lite-restore to the hung context, instead of the expected
idle->active context switch).

Signed-off-by: Chris Wilson <hris@chris-wilson.co.uk>
Cc: stable@vger.kernel.org
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
---
 drivers/gpu/drm/i915/gt/intel_reset.c | 34 ++++++++++++++++++++++-----
 1 file changed, 28 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
index ffde89c5835a4..88dfc0c5316ff 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -268,6 +268,7 @@ static int ilk_do_reset(struct intel_gt *gt, intel_engine_mask_t engine_mask,
 static int gen6_hw_domain_reset(struct intel_gt *gt, u32 hw_domain_mask)
 {
 	struct intel_uncore *uncore = gt->uncore;
+	int loops = 2;
 	int err;
 
 	/*
@@ -275,18 +276,39 @@ static int gen6_hw_domain_reset(struct intel_gt *gt, u32 hw_domain_mask)
 	 * for fifo space for the write or forcewake the chip for
 	 * the read
 	 */
-	intel_uncore_write_fw(uncore, GEN6_GDRST, hw_domain_mask);
+	do {
+		intel_uncore_write_fw(uncore, GEN6_GDRST, hw_domain_mask);
 
-	/* Wait for the device to ack the reset requests */
-	err = __intel_wait_for_register_fw(uncore,
-					   GEN6_GDRST, hw_domain_mask, 0,
-					   500, 0,
-					   NULL);
+		/*
+		 * Wait for the device to ack the reset requests.
+		 *
+		 * On some platforms, e.g. Jasperlake, we see see that the
+		 * engine register state is not cleared until shortly after
+		 * GDRST reports completion, causing a failure as we try
+		 * to immediately resume while the internal state is still
+		 * in flux. If we immediately repeat the reset, the second
+		 * reset appears to serialise with the first, and since
+		 * it is a no-op, the registers should retain their reset
+		 * value. However, there is still a concern that upon
+		 * leaving the second reset, the internal engine state
+		 * is still in flux and not ready for resuming.
+		 */
+		err = __intel_wait_for_register_fw(uncore, GEN6_GDRST,
+						   hw_domain_mask, 0,
+						   2000, 0,
+						   NULL);
+	} while (err == 0 && --loops);
 	if (err)
 		GT_TRACE(gt,
 			 "Wait for 0x%08x engines reset failed\n",
 			 hw_domain_mask);
 
+	/*
+	 * As we have observed that the engine state is still volatile
+	 * after GDRST is acked, impose a small delay to let everything settle.
+	 */
+	udelay(50);
+
 	return err;
 }
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH] drm/i915/gt: Reset twice
@ 2022-12-12 16:13 ` Andi Shyti
  0 siblings, 0 replies; 27+ messages in thread
From: Andi Shyti @ 2022-12-12 16:13 UTC (permalink / raw)
  To: intel-gfx, dri-devel, stable
  Cc: Mika Kuoppala, Andi Shyti, Andi Shyti, Chris Wilson

From: Chris Wilson <chris@chris-wilson.co.uk>

After applying an engine reset, on some platforms like Jasperlake, we
occasionally detect that the engine state is not cleared until shortly
after the resume. As we try to resume the engine with volatile internal
state, the first request fails with a spurious CS event (it looks like
it reports a lite-restore to the hung context, instead of the expected
idle->active context switch).

Signed-off-by: Chris Wilson <hris@chris-wilson.co.uk>
Cc: stable@vger.kernel.org
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
---
 drivers/gpu/drm/i915/gt/intel_reset.c | 34 ++++++++++++++++++++++-----
 1 file changed, 28 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
index ffde89c5835a4..88dfc0c5316ff 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -268,6 +268,7 @@ static int ilk_do_reset(struct intel_gt *gt, intel_engine_mask_t engine_mask,
 static int gen6_hw_domain_reset(struct intel_gt *gt, u32 hw_domain_mask)
 {
 	struct intel_uncore *uncore = gt->uncore;
+	int loops = 2;
 	int err;
 
 	/*
@@ -275,18 +276,39 @@ static int gen6_hw_domain_reset(struct intel_gt *gt, u32 hw_domain_mask)
 	 * for fifo space for the write or forcewake the chip for
 	 * the read
 	 */
-	intel_uncore_write_fw(uncore, GEN6_GDRST, hw_domain_mask);
+	do {
+		intel_uncore_write_fw(uncore, GEN6_GDRST, hw_domain_mask);
 
-	/* Wait for the device to ack the reset requests */
-	err = __intel_wait_for_register_fw(uncore,
-					   GEN6_GDRST, hw_domain_mask, 0,
-					   500, 0,
-					   NULL);
+		/*
+		 * Wait for the device to ack the reset requests.
+		 *
+		 * On some platforms, e.g. Jasperlake, we see see that the
+		 * engine register state is not cleared until shortly after
+		 * GDRST reports completion, causing a failure as we try
+		 * to immediately resume while the internal state is still
+		 * in flux. If we immediately repeat the reset, the second
+		 * reset appears to serialise with the first, and since
+		 * it is a no-op, the registers should retain their reset
+		 * value. However, there is still a concern that upon
+		 * leaving the second reset, the internal engine state
+		 * is still in flux and not ready for resuming.
+		 */
+		err = __intel_wait_for_register_fw(uncore, GEN6_GDRST,
+						   hw_domain_mask, 0,
+						   2000, 0,
+						   NULL);
+	} while (err == 0 && --loops);
 	if (err)
 		GT_TRACE(gt,
 			 "Wait for 0x%08x engines reset failed\n",
 			 hw_domain_mask);
 
+	/*
+	 * As we have observed that the engine state is still volatile
+	 * after GDRST is acked, impose a small delay to let everything settle.
+	 */
+	udelay(50);
+
 	return err;
 }
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Intel-gfx] [PATCH] drm/i915/gt: Reset twice
@ 2022-12-12 16:13 ` Andi Shyti
  0 siblings, 0 replies; 27+ messages in thread
From: Andi Shyti @ 2022-12-12 16:13 UTC (permalink / raw)
  To: intel-gfx, dri-devel, stable; +Cc: Chris Wilson

From: Chris Wilson <chris@chris-wilson.co.uk>

After applying an engine reset, on some platforms like Jasperlake, we
occasionally detect that the engine state is not cleared until shortly
after the resume. As we try to resume the engine with volatile internal
state, the first request fails with a spurious CS event (it looks like
it reports a lite-restore to the hung context, instead of the expected
idle->active context switch).

Signed-off-by: Chris Wilson <hris@chris-wilson.co.uk>
Cc: stable@vger.kernel.org
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
---
 drivers/gpu/drm/i915/gt/intel_reset.c | 34 ++++++++++++++++++++++-----
 1 file changed, 28 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
index ffde89c5835a4..88dfc0c5316ff 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -268,6 +268,7 @@ static int ilk_do_reset(struct intel_gt *gt, intel_engine_mask_t engine_mask,
 static int gen6_hw_domain_reset(struct intel_gt *gt, u32 hw_domain_mask)
 {
 	struct intel_uncore *uncore = gt->uncore;
+	int loops = 2;
 	int err;
 
 	/*
@@ -275,18 +276,39 @@ static int gen6_hw_domain_reset(struct intel_gt *gt, u32 hw_domain_mask)
 	 * for fifo space for the write or forcewake the chip for
 	 * the read
 	 */
-	intel_uncore_write_fw(uncore, GEN6_GDRST, hw_domain_mask);
+	do {
+		intel_uncore_write_fw(uncore, GEN6_GDRST, hw_domain_mask);
 
-	/* Wait for the device to ack the reset requests */
-	err = __intel_wait_for_register_fw(uncore,
-					   GEN6_GDRST, hw_domain_mask, 0,
-					   500, 0,
-					   NULL);
+		/*
+		 * Wait for the device to ack the reset requests.
+		 *
+		 * On some platforms, e.g. Jasperlake, we see see that the
+		 * engine register state is not cleared until shortly after
+		 * GDRST reports completion, causing a failure as we try
+		 * to immediately resume while the internal state is still
+		 * in flux. If we immediately repeat the reset, the second
+		 * reset appears to serialise with the first, and since
+		 * it is a no-op, the registers should retain their reset
+		 * value. However, there is still a concern that upon
+		 * leaving the second reset, the internal engine state
+		 * is still in flux and not ready for resuming.
+		 */
+		err = __intel_wait_for_register_fw(uncore, GEN6_GDRST,
+						   hw_domain_mask, 0,
+						   2000, 0,
+						   NULL);
+	} while (err == 0 && --loops);
 	if (err)
 		GT_TRACE(gt,
 			 "Wait for 0x%08x engines reset failed\n",
 			 hw_domain_mask);
 
+	/*
+	 * As we have observed that the engine state is still volatile
+	 * after GDRST is acked, impose a small delay to let everything settle.
+	 */
+	udelay(50);
+
 	return err;
 }
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH] drm/i915/gt: Reset twice
  2022-12-12 16:13 ` Andi Shyti
  (?)
@ 2022-12-12 16:55   ` Rodrigo Vivi
  -1 siblings, 0 replies; 27+ messages in thread
From: Rodrigo Vivi @ 2022-12-12 16:55 UTC (permalink / raw)
  To: Andi Shyti
  Cc: intel-gfx, dri-devel, stable, Mika Kuoppala, Andi Shyti, Chris Wilson

On Mon, Dec 12, 2022 at 05:13:38PM +0100, Andi Shyti wrote:
> From: Chris Wilson <chris@chris-wilson.co.uk>
> 
> After applying an engine reset, on some platforms like Jasperlake, we
> occasionally detect that the engine state is not cleared until shortly
> after the resume. As we try to resume the engine with volatile internal
> state, the first request fails with a spurious CS event (it looks like
> it reports a lite-restore to the hung context, instead of the expected
> idle->active context switch).
> 
> Signed-off-by: Chris Wilson <hris@chris-wilson.co.uk>

There's a typo in the signature email I'm afraid...

Other than that, have we checked the possibility of using the driver-initiated-flr bit
instead of this second loop? That should be the right way to guarantee everything is
cleared on gen11+...

> Cc: stable@vger.kernel.org
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_reset.c | 34 ++++++++++++++++++++++-----
>  1 file changed, 28 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
> index ffde89c5835a4..88dfc0c5316ff 100644
> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> @@ -268,6 +268,7 @@ static int ilk_do_reset(struct intel_gt *gt, intel_engine_mask_t engine_mask,
>  static int gen6_hw_domain_reset(struct intel_gt *gt, u32 hw_domain_mask)
>  {
>  	struct intel_uncore *uncore = gt->uncore;
> +	int loops = 2;
>  	int err;
>  
>  	/*
> @@ -275,18 +276,39 @@ static int gen6_hw_domain_reset(struct intel_gt *gt, u32 hw_domain_mask)
>  	 * for fifo space for the write or forcewake the chip for
>  	 * the read
>  	 */
> -	intel_uncore_write_fw(uncore, GEN6_GDRST, hw_domain_mask);
> +	do {
> +		intel_uncore_write_fw(uncore, GEN6_GDRST, hw_domain_mask);
>  
> -	/* Wait for the device to ack the reset requests */
> -	err = __intel_wait_for_register_fw(uncore,
> -					   GEN6_GDRST, hw_domain_mask, 0,
> -					   500, 0,
> -					   NULL);
> +		/*
> +		 * Wait for the device to ack the reset requests.
> +		 *
> +		 * On some platforms, e.g. Jasperlake, we see see that the
> +		 * engine register state is not cleared until shortly after
> +		 * GDRST reports completion, causing a failure as we try
> +		 * to immediately resume while the internal state is still
> +		 * in flux. If we immediately repeat the reset, the second
> +		 * reset appears to serialise with the first, and since
> +		 * it is a no-op, the registers should retain their reset
> +		 * value. However, there is still a concern that upon
> +		 * leaving the second reset, the internal engine state
> +		 * is still in flux and not ready for resuming.
> +		 */
> +		err = __intel_wait_for_register_fw(uncore, GEN6_GDRST,
> +						   hw_domain_mask, 0,
> +						   2000, 0,
> +						   NULL);
> +	} while (err == 0 && --loops);
>  	if (err)
>  		GT_TRACE(gt,
>  			 "Wait for 0x%08x engines reset failed\n",
>  			 hw_domain_mask);
>  
> +	/*
> +	 * As we have observed that the engine state is still volatile
> +	 * after GDRST is acked, impose a small delay to let everything settle.
> +	 */
> +	udelay(50);
> +
>  	return err;
>  }
>  
> -- 
> 2.38.1
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] drm/i915/gt: Reset twice
@ 2022-12-12 16:55   ` Rodrigo Vivi
  0 siblings, 0 replies; 27+ messages in thread
From: Rodrigo Vivi @ 2022-12-12 16:55 UTC (permalink / raw)
  To: Andi Shyti
  Cc: Mika Kuoppala, intel-gfx, stable, Chris Wilson, dri-devel, Andi Shyti

On Mon, Dec 12, 2022 at 05:13:38PM +0100, Andi Shyti wrote:
> From: Chris Wilson <chris@chris-wilson.co.uk>
> 
> After applying an engine reset, on some platforms like Jasperlake, we
> occasionally detect that the engine state is not cleared until shortly
> after the resume. As we try to resume the engine with volatile internal
> state, the first request fails with a spurious CS event (it looks like
> it reports a lite-restore to the hung context, instead of the expected
> idle->active context switch).
> 
> Signed-off-by: Chris Wilson <hris@chris-wilson.co.uk>

There's a typo in the signature email I'm afraid...

Other than that, have we checked the possibility of using the driver-initiated-flr bit
instead of this second loop? That should be the right way to guarantee everything is
cleared on gen11+...

> Cc: stable@vger.kernel.org
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_reset.c | 34 ++++++++++++++++++++++-----
>  1 file changed, 28 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
> index ffde89c5835a4..88dfc0c5316ff 100644
> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> @@ -268,6 +268,7 @@ static int ilk_do_reset(struct intel_gt *gt, intel_engine_mask_t engine_mask,
>  static int gen6_hw_domain_reset(struct intel_gt *gt, u32 hw_domain_mask)
>  {
>  	struct intel_uncore *uncore = gt->uncore;
> +	int loops = 2;
>  	int err;
>  
>  	/*
> @@ -275,18 +276,39 @@ static int gen6_hw_domain_reset(struct intel_gt *gt, u32 hw_domain_mask)
>  	 * for fifo space for the write or forcewake the chip for
>  	 * the read
>  	 */
> -	intel_uncore_write_fw(uncore, GEN6_GDRST, hw_domain_mask);
> +	do {
> +		intel_uncore_write_fw(uncore, GEN6_GDRST, hw_domain_mask);
>  
> -	/* Wait for the device to ack the reset requests */
> -	err = __intel_wait_for_register_fw(uncore,
> -					   GEN6_GDRST, hw_domain_mask, 0,
> -					   500, 0,
> -					   NULL);
> +		/*
> +		 * Wait for the device to ack the reset requests.
> +		 *
> +		 * On some platforms, e.g. Jasperlake, we see see that the
> +		 * engine register state is not cleared until shortly after
> +		 * GDRST reports completion, causing a failure as we try
> +		 * to immediately resume while the internal state is still
> +		 * in flux. If we immediately repeat the reset, the second
> +		 * reset appears to serialise with the first, and since
> +		 * it is a no-op, the registers should retain their reset
> +		 * value. However, there is still a concern that upon
> +		 * leaving the second reset, the internal engine state
> +		 * is still in flux and not ready for resuming.
> +		 */
> +		err = __intel_wait_for_register_fw(uncore, GEN6_GDRST,
> +						   hw_domain_mask, 0,
> +						   2000, 0,
> +						   NULL);
> +	} while (err == 0 && --loops);
>  	if (err)
>  		GT_TRACE(gt,
>  			 "Wait for 0x%08x engines reset failed\n",
>  			 hw_domain_mask);
>  
> +	/*
> +	 * As we have observed that the engine state is still volatile
> +	 * after GDRST is acked, impose a small delay to let everything settle.
> +	 */
> +	udelay(50);
> +
>  	return err;
>  }
>  
> -- 
> 2.38.1
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Intel-gfx] [PATCH] drm/i915/gt: Reset twice
@ 2022-12-12 16:55   ` Rodrigo Vivi
  0 siblings, 0 replies; 27+ messages in thread
From: Rodrigo Vivi @ 2022-12-12 16:55 UTC (permalink / raw)
  To: Andi Shyti; +Cc: intel-gfx, stable, Chris Wilson, dri-devel

On Mon, Dec 12, 2022 at 05:13:38PM +0100, Andi Shyti wrote:
> From: Chris Wilson <chris@chris-wilson.co.uk>
> 
> After applying an engine reset, on some platforms like Jasperlake, we
> occasionally detect that the engine state is not cleared until shortly
> after the resume. As we try to resume the engine with volatile internal
> state, the first request fails with a spurious CS event (it looks like
> it reports a lite-restore to the hung context, instead of the expected
> idle->active context switch).
> 
> Signed-off-by: Chris Wilson <hris@chris-wilson.co.uk>

There's a typo in the signature email I'm afraid...

Other than that, have we checked the possibility of using the driver-initiated-flr bit
instead of this second loop? That should be the right way to guarantee everything is
cleared on gen11+...

> Cc: stable@vger.kernel.org
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_reset.c | 34 ++++++++++++++++++++++-----
>  1 file changed, 28 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
> index ffde89c5835a4..88dfc0c5316ff 100644
> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> @@ -268,6 +268,7 @@ static int ilk_do_reset(struct intel_gt *gt, intel_engine_mask_t engine_mask,
>  static int gen6_hw_domain_reset(struct intel_gt *gt, u32 hw_domain_mask)
>  {
>  	struct intel_uncore *uncore = gt->uncore;
> +	int loops = 2;
>  	int err;
>  
>  	/*
> @@ -275,18 +276,39 @@ static int gen6_hw_domain_reset(struct intel_gt *gt, u32 hw_domain_mask)
>  	 * for fifo space for the write or forcewake the chip for
>  	 * the read
>  	 */
> -	intel_uncore_write_fw(uncore, GEN6_GDRST, hw_domain_mask);
> +	do {
> +		intel_uncore_write_fw(uncore, GEN6_GDRST, hw_domain_mask);
>  
> -	/* Wait for the device to ack the reset requests */
> -	err = __intel_wait_for_register_fw(uncore,
> -					   GEN6_GDRST, hw_domain_mask, 0,
> -					   500, 0,
> -					   NULL);
> +		/*
> +		 * Wait for the device to ack the reset requests.
> +		 *
> +		 * On some platforms, e.g. Jasperlake, we see see that the
> +		 * engine register state is not cleared until shortly after
> +		 * GDRST reports completion, causing a failure as we try
> +		 * to immediately resume while the internal state is still
> +		 * in flux. If we immediately repeat the reset, the second
> +		 * reset appears to serialise with the first, and since
> +		 * it is a no-op, the registers should retain their reset
> +		 * value. However, there is still a concern that upon
> +		 * leaving the second reset, the internal engine state
> +		 * is still in flux and not ready for resuming.
> +		 */
> +		err = __intel_wait_for_register_fw(uncore, GEN6_GDRST,
> +						   hw_domain_mask, 0,
> +						   2000, 0,
> +						   NULL);
> +	} while (err == 0 && --loops);
>  	if (err)
>  		GT_TRACE(gt,
>  			 "Wait for 0x%08x engines reset failed\n",
>  			 hw_domain_mask);
>  
> +	/*
> +	 * As we have observed that the engine state is still volatile
> +	 * after GDRST is acked, impose a small delay to let everything settle.
> +	 */
> +	udelay(50);
> +
>  	return err;
>  }
>  
> -- 
> 2.38.1
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for drm/i915/gt: Reset twice
  2022-12-12 16:13 ` Andi Shyti
                   ` (2 preceding siblings ...)
  (?)
@ 2022-12-12 18:34 ` Patchwork
  -1 siblings, 0 replies; 27+ messages in thread
From: Patchwork @ 2022-12-12 18:34 UTC (permalink / raw)
  To: Andi Shyti; +Cc: intel-gfx

== Series Details ==

Series: drm/i915/gt: Reset twice
URL   : https://patchwork.freedesktop.org/series/111859/
State : warning

== Summary ==

Error: dim checkpatch failed
1fbbd5f0943a drm/i915/gt: Reset twice
-:46: WARNING:REPEATED_WORD: Possible repeated word: 'see'
#46: FILE: drivers/gpu/drm/i915/gt/intel_reset.c:285:
+		 * On some platforms, e.g. Jasperlake, we see see that the

-:71: CHECK:USLEEP_RANGE: usleep_range is preferred over udelay; see Documentation/timers/timers-howto.rst
#71: FILE: drivers/gpu/drm/i915/gt/intel_reset.c:310:
+	udelay(50);

-:75: WARNING:FROM_SIGN_OFF_MISMATCH: From:/Signed-off-by: email address mismatch: 'From: Chris Wilson <chris@chris-wilson.co.uk>' != 'Signed-off-by: Chris Wilson <hris@chris-wilson.co.uk>'

total: 0 errors, 2 warnings, 1 checks, 52 lines checked



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Intel-gfx] ✓ Fi.CI.BAT: success for drm/i915/gt: Reset twice
  2022-12-12 16:13 ` Andi Shyti
                   ` (3 preceding siblings ...)
  (?)
@ 2022-12-12 18:46 ` Patchwork
  -1 siblings, 0 replies; 27+ messages in thread
From: Patchwork @ 2022-12-12 18:46 UTC (permalink / raw)
  To: Andi Shyti; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 3284 bytes --]

== Series Details ==

Series: drm/i915/gt: Reset twice
URL   : https://patchwork.freedesktop.org/series/111859/
State : success

== Summary ==

CI Bug Log - changes from CI_DRM_12496 -> Patchwork_111859v1
====================================================

Summary
-------

  **SUCCESS**

  No regressions found.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/index.html

Participating hosts (18 -> 19)
------------------------------

  Additional (1): fi-hsw-4770 

Known issues
------------

  Here are the changes found in Patchwork_111859v1 that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@gem_exec_gttfill@basic:
    - fi-pnv-d510:        [PASS][1] -> [FAIL][2] ([i915#7229])
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12496/fi-pnv-d510/igt@gem_exec_gttfill@basic.html
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/fi-pnv-d510/igt@gem_exec_gttfill@basic.html

  * igt@kms_chamelium@dp-crc-fast:
    - fi-hsw-4770:        NOTRUN -> [SKIP][3] ([fdo#109271] / [fdo#111827]) +8 similar issues
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/fi-hsw-4770/igt@kms_chamelium@dp-crc-fast.html

  * igt@kms_psr@sprite_plane_onoff:
    - fi-hsw-4770:        NOTRUN -> [SKIP][4] ([fdo#109271] / [i915#1072]) +3 similar issues
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/fi-hsw-4770/igt@kms_psr@sprite_plane_onoff.html

  * igt@kms_setmode@basic-clone-single-crtc:
    - fi-hsw-4770:        NOTRUN -> [SKIP][5] ([fdo#109271]) +11 similar issues
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/fi-hsw-4770/igt@kms_setmode@basic-clone-single-crtc.html
    - fi-snb-2600:        NOTRUN -> [SKIP][6] ([fdo#109271])
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/fi-snb-2600/igt@kms_setmode@basic-clone-single-crtc.html

  
#### Warnings ####

  * igt@i915_suspend@basic-s3-without-i915:
    - fi-rkl-11600:       [FAIL][7] ([fdo#103375]) -> [INCOMPLETE][8] ([i915#4817])
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12496/fi-rkl-11600/igt@i915_suspend@basic-s3-without-i915.html
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/fi-rkl-11600/igt@i915_suspend@basic-s3-without-i915.html

  
  [fdo#103375]: https://bugs.freedesktop.org/show_bug.cgi?id=103375
  [fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271
  [fdo#111827]: https://bugs.freedesktop.org/show_bug.cgi?id=111827
  [i915#1072]: https://gitlab.freedesktop.org/drm/intel/issues/1072
  [i915#4817]: https://gitlab.freedesktop.org/drm/intel/issues/4817
  [i915#7229]: https://gitlab.freedesktop.org/drm/intel/issues/7229


Build changes
-------------

  * Linux: CI_DRM_12496 -> Patchwork_111859v1

  CI-20190529: 20190529
  CI_DRM_12496: da695a0fe3c49c4c8709e1e6daabd07fc405cf81 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_7091: b8015f920c9f469d3733854263cb878373c1df51 @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  Patchwork_111859v1: da695a0fe3c49c4c8709e1e6daabd07fc405cf81 @ git://anongit.freedesktop.org/gfx-ci/linux


### Linux commits

94f3ea5ccb5c drm/i915/gt: Reset twice

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/index.html

[-- Attachment #2: Type: text/html, Size: 4209 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] drm/i915/gt: Reset twice
  2022-12-12 16:55   ` Rodrigo Vivi
  (?)
@ 2022-12-12 23:08     ` Andi Shyti
  -1 siblings, 0 replies; 27+ messages in thread
From: Andi Shyti @ 2022-12-12 23:08 UTC (permalink / raw)
  To: Rodrigo Vivi
  Cc: Andi Shyti, intel-gfx, dri-devel, stable, Mika Kuoppala,
	Andi Shyti, Chris Wilson

Hi Rodrigo,

On Mon, Dec 12, 2022 at 11:55:10AM -0500, Rodrigo Vivi wrote:
> On Mon, Dec 12, 2022 at 05:13:38PM +0100, Andi Shyti wrote:
> > From: Chris Wilson <chris@chris-wilson.co.uk>
> > 
> > After applying an engine reset, on some platforms like Jasperlake, we
> > occasionally detect that the engine state is not cleared until shortly
> > after the resume. As we try to resume the engine with volatile internal
> > state, the first request fails with a spurious CS event (it looks like
> > it reports a lite-restore to the hung context, instead of the expected
> > idle->active context switch).
> > 
> > Signed-off-by: Chris Wilson <hris@chris-wilson.co.uk>
> 
> There's a typo in the signature email I'm afraid...

oh yes, I forgot the 'C' :)

> Other than that, have we checked the possibility of using the driver-initiated-flr bit
> instead of this second loop? That should be the right way to guarantee everything is
> cleared on gen11+...

maybe I am misinterpreting it, but is FLR the same as resetting
hardware domains individually?

How am I supposed to use driver_initiated_flr() in this context?

Thanks,
Andi

> > Cc: stable@vger.kernel.org
> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
> > ---
> >  drivers/gpu/drm/i915/gt/intel_reset.c | 34 ++++++++++++++++++++++-----
> >  1 file changed, 28 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
> > index ffde89c5835a4..88dfc0c5316ff 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> > @@ -268,6 +268,7 @@ static int ilk_do_reset(struct intel_gt *gt, intel_engine_mask_t engine_mask,
> >  static int gen6_hw_domain_reset(struct intel_gt *gt, u32 hw_domain_mask)
> >  {
> >  	struct intel_uncore *uncore = gt->uncore;
> > +	int loops = 2;
> >  	int err;
> >  
> >  	/*
> > @@ -275,18 +276,39 @@ static int gen6_hw_domain_reset(struct intel_gt *gt, u32 hw_domain_mask)
> >  	 * for fifo space for the write or forcewake the chip for
> >  	 * the read
> >  	 */
> > -	intel_uncore_write_fw(uncore, GEN6_GDRST, hw_domain_mask);
> > +	do {
> > +		intel_uncore_write_fw(uncore, GEN6_GDRST, hw_domain_mask);
> >  
> > -	/* Wait for the device to ack the reset requests */
> > -	err = __intel_wait_for_register_fw(uncore,
> > -					   GEN6_GDRST, hw_domain_mask, 0,
> > -					   500, 0,
> > -					   NULL);
> > +		/*
> > +		 * Wait for the device to ack the reset requests.
> > +		 *
> > +		 * On some platforms, e.g. Jasperlake, we see see that the
> > +		 * engine register state is not cleared until shortly after
> > +		 * GDRST reports completion, causing a failure as we try
> > +		 * to immediately resume while the internal state is still
> > +		 * in flux. If we immediately repeat the reset, the second
> > +		 * reset appears to serialise with the first, and since
> > +		 * it is a no-op, the registers should retain their reset
> > +		 * value. However, there is still a concern that upon
> > +		 * leaving the second reset, the internal engine state
> > +		 * is still in flux and not ready for resuming.
> > +		 */
> > +		err = __intel_wait_for_register_fw(uncore, GEN6_GDRST,
> > +						   hw_domain_mask, 0,
> > +						   2000, 0,
> > +						   NULL);
> > +	} while (err == 0 && --loops);
> >  	if (err)
> >  		GT_TRACE(gt,
> >  			 "Wait for 0x%08x engines reset failed\n",
> >  			 hw_domain_mask);
> >  
> > +	/*
> > +	 * As we have observed that the engine state is still volatile
> > +	 * after GDRST is acked, impose a small delay to let everything settle.
> > +	 */
> > +	udelay(50);
> > +
> >  	return err;
> >  }
> >  
> > -- 
> > 2.38.1
> > 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] drm/i915/gt: Reset twice
@ 2022-12-12 23:08     ` Andi Shyti
  0 siblings, 0 replies; 27+ messages in thread
From: Andi Shyti @ 2022-12-12 23:08 UTC (permalink / raw)
  To: Rodrigo Vivi
  Cc: Andi Shyti, Mika Kuoppala, intel-gfx, stable, Chris Wilson,
	dri-devel, Andi Shyti

Hi Rodrigo,

On Mon, Dec 12, 2022 at 11:55:10AM -0500, Rodrigo Vivi wrote:
> On Mon, Dec 12, 2022 at 05:13:38PM +0100, Andi Shyti wrote:
> > From: Chris Wilson <chris@chris-wilson.co.uk>
> > 
> > After applying an engine reset, on some platforms like Jasperlake, we
> > occasionally detect that the engine state is not cleared until shortly
> > after the resume. As we try to resume the engine with volatile internal
> > state, the first request fails with a spurious CS event (it looks like
> > it reports a lite-restore to the hung context, instead of the expected
> > idle->active context switch).
> > 
> > Signed-off-by: Chris Wilson <hris@chris-wilson.co.uk>
> 
> There's a typo in the signature email I'm afraid...

oh yes, I forgot the 'C' :)

> Other than that, have we checked the possibility of using the driver-initiated-flr bit
> instead of this second loop? That should be the right way to guarantee everything is
> cleared on gen11+...

maybe I am misinterpreting it, but is FLR the same as resetting
hardware domains individually?

How am I supposed to use driver_initiated_flr() in this context?

Thanks,
Andi

> > Cc: stable@vger.kernel.org
> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
> > ---
> >  drivers/gpu/drm/i915/gt/intel_reset.c | 34 ++++++++++++++++++++++-----
> >  1 file changed, 28 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
> > index ffde89c5835a4..88dfc0c5316ff 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> > @@ -268,6 +268,7 @@ static int ilk_do_reset(struct intel_gt *gt, intel_engine_mask_t engine_mask,
> >  static int gen6_hw_domain_reset(struct intel_gt *gt, u32 hw_domain_mask)
> >  {
> >  	struct intel_uncore *uncore = gt->uncore;
> > +	int loops = 2;
> >  	int err;
> >  
> >  	/*
> > @@ -275,18 +276,39 @@ static int gen6_hw_domain_reset(struct intel_gt *gt, u32 hw_domain_mask)
> >  	 * for fifo space for the write or forcewake the chip for
> >  	 * the read
> >  	 */
> > -	intel_uncore_write_fw(uncore, GEN6_GDRST, hw_domain_mask);
> > +	do {
> > +		intel_uncore_write_fw(uncore, GEN6_GDRST, hw_domain_mask);
> >  
> > -	/* Wait for the device to ack the reset requests */
> > -	err = __intel_wait_for_register_fw(uncore,
> > -					   GEN6_GDRST, hw_domain_mask, 0,
> > -					   500, 0,
> > -					   NULL);
> > +		/*
> > +		 * Wait for the device to ack the reset requests.
> > +		 *
> > +		 * On some platforms, e.g. Jasperlake, we see see that the
> > +		 * engine register state is not cleared until shortly after
> > +		 * GDRST reports completion, causing a failure as we try
> > +		 * to immediately resume while the internal state is still
> > +		 * in flux. If we immediately repeat the reset, the second
> > +		 * reset appears to serialise with the first, and since
> > +		 * it is a no-op, the registers should retain their reset
> > +		 * value. However, there is still a concern that upon
> > +		 * leaving the second reset, the internal engine state
> > +		 * is still in flux and not ready for resuming.
> > +		 */
> > +		err = __intel_wait_for_register_fw(uncore, GEN6_GDRST,
> > +						   hw_domain_mask, 0,
> > +						   2000, 0,
> > +						   NULL);
> > +	} while (err == 0 && --loops);
> >  	if (err)
> >  		GT_TRACE(gt,
> >  			 "Wait for 0x%08x engines reset failed\n",
> >  			 hw_domain_mask);
> >  
> > +	/*
> > +	 * As we have observed that the engine state is still volatile
> > +	 * after GDRST is acked, impose a small delay to let everything settle.
> > +	 */
> > +	udelay(50);
> > +
> >  	return err;
> >  }
> >  
> > -- 
> > 2.38.1
> > 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Intel-gfx] [PATCH] drm/i915/gt: Reset twice
@ 2022-12-12 23:08     ` Andi Shyti
  0 siblings, 0 replies; 27+ messages in thread
From: Andi Shyti @ 2022-12-12 23:08 UTC (permalink / raw)
  To: Rodrigo Vivi; +Cc: intel-gfx, stable, Chris Wilson, dri-devel

Hi Rodrigo,

On Mon, Dec 12, 2022 at 11:55:10AM -0500, Rodrigo Vivi wrote:
> On Mon, Dec 12, 2022 at 05:13:38PM +0100, Andi Shyti wrote:
> > From: Chris Wilson <chris@chris-wilson.co.uk>
> > 
> > After applying an engine reset, on some platforms like Jasperlake, we
> > occasionally detect that the engine state is not cleared until shortly
> > after the resume. As we try to resume the engine with volatile internal
> > state, the first request fails with a spurious CS event (it looks like
> > it reports a lite-restore to the hung context, instead of the expected
> > idle->active context switch).
> > 
> > Signed-off-by: Chris Wilson <hris@chris-wilson.co.uk>
> 
> There's a typo in the signature email I'm afraid...

oh yes, I forgot the 'C' :)

> Other than that, have we checked the possibility of using the driver-initiated-flr bit
> instead of this second loop? That should be the right way to guarantee everything is
> cleared on gen11+...

maybe I am misinterpreting it, but is FLR the same as resetting
hardware domains individually?

How am I supposed to use driver_initiated_flr() in this context?

Thanks,
Andi

> > Cc: stable@vger.kernel.org
> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
> > ---
> >  drivers/gpu/drm/i915/gt/intel_reset.c | 34 ++++++++++++++++++++++-----
> >  1 file changed, 28 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
> > index ffde89c5835a4..88dfc0c5316ff 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> > @@ -268,6 +268,7 @@ static int ilk_do_reset(struct intel_gt *gt, intel_engine_mask_t engine_mask,
> >  static int gen6_hw_domain_reset(struct intel_gt *gt, u32 hw_domain_mask)
> >  {
> >  	struct intel_uncore *uncore = gt->uncore;
> > +	int loops = 2;
> >  	int err;
> >  
> >  	/*
> > @@ -275,18 +276,39 @@ static int gen6_hw_domain_reset(struct intel_gt *gt, u32 hw_domain_mask)
> >  	 * for fifo space for the write or forcewake the chip for
> >  	 * the read
> >  	 */
> > -	intel_uncore_write_fw(uncore, GEN6_GDRST, hw_domain_mask);
> > +	do {
> > +		intel_uncore_write_fw(uncore, GEN6_GDRST, hw_domain_mask);
> >  
> > -	/* Wait for the device to ack the reset requests */
> > -	err = __intel_wait_for_register_fw(uncore,
> > -					   GEN6_GDRST, hw_domain_mask, 0,
> > -					   500, 0,
> > -					   NULL);
> > +		/*
> > +		 * Wait for the device to ack the reset requests.
> > +		 *
> > +		 * On some platforms, e.g. Jasperlake, we see see that the
> > +		 * engine register state is not cleared until shortly after
> > +		 * GDRST reports completion, causing a failure as we try
> > +		 * to immediately resume while the internal state is still
> > +		 * in flux. If we immediately repeat the reset, the second
> > +		 * reset appears to serialise with the first, and since
> > +		 * it is a no-op, the registers should retain their reset
> > +		 * value. However, there is still a concern that upon
> > +		 * leaving the second reset, the internal engine state
> > +		 * is still in flux and not ready for resuming.
> > +		 */
> > +		err = __intel_wait_for_register_fw(uncore, GEN6_GDRST,
> > +						   hw_domain_mask, 0,
> > +						   2000, 0,
> > +						   NULL);
> > +	} while (err == 0 && --loops);
> >  	if (err)
> >  		GT_TRACE(gt,
> >  			 "Wait for 0x%08x engines reset failed\n",
> >  			 hw_domain_mask);
> >  
> > +	/*
> > +	 * As we have observed that the engine state is still volatile
> > +	 * after GDRST is acked, impose a small delay to let everything settle.
> > +	 */
> > +	udelay(50);
> > +
> >  	return err;
> >  }
> >  
> > -- 
> > 2.38.1
> > 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Intel-gfx] ✗ Fi.CI.IGT: failure for drm/i915/gt: Reset twice
  2022-12-12 16:13 ` Andi Shyti
                   ` (4 preceding siblings ...)
  (?)
@ 2022-12-13 10:11 ` Patchwork
  -1 siblings, 0 replies; 27+ messages in thread
From: Patchwork @ 2022-12-13 10:11 UTC (permalink / raw)
  To: Andi Shyti; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 17363 bytes --]

== Series Details ==

Series: drm/i915/gt: Reset twice
URL   : https://patchwork.freedesktop.org/series/111859/
State : failure

== Summary ==

CI Bug Log - changes from CI_DRM_12496_full -> Patchwork_111859v1_full
====================================================

Summary
-------

  **FAILURE**

  Serious unknown changes coming with Patchwork_111859v1_full absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_111859v1_full, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  

Participating hosts (14 -> 9)
------------------------------

  Missing    (5): shard-tglu-9 shard-tglu-10 shard-tglu shard-rkl shard-dg1 

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in Patchwork_111859v1_full:

### IGT changes ###

#### Possible regressions ####

  * igt@drm_import_export@prime:
    - shard-tglb:         [PASS][1] -> [INCOMPLETE][2]
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12496/shard-tglb7/igt@drm_import_export@prime.html
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/shard-tglb2/igt@drm_import_export@prime.html

  
Known issues
------------

  Here are the changes found in Patchwork_111859v1_full that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@gem_ctx_persistence@smoketest:
    - shard-tglb:         [PASS][3] -> [FAIL][4] ([i915#5099])
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12496/shard-tglb2/igt@gem_ctx_persistence@smoketest.html
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/shard-tglb7/igt@gem_ctx_persistence@smoketest.html

  * igt@gem_exec_balancer@parallel-bb-first:
    - shard-iclb:         [PASS][5] -> [SKIP][6] ([i915#4525]) +2 similar issues
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12496/shard-iclb1/igt@gem_exec_balancer@parallel-bb-first.html
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/shard-iclb3/igt@gem_exec_balancer@parallel-bb-first.html

  * igt@gem_softpin@noreloc-s3:
    - shard-skl:          [PASS][7] -> [INCOMPLETE][8] ([i915#7236])
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12496/shard-skl4/igt@gem_softpin@noreloc-s3.html
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/shard-skl7/igt@gem_softpin@noreloc-s3.html

  * igt@gem_userptr_blits@input-checking:
    - shard-skl:          NOTRUN -> [DMESG-WARN][9] ([i915#4991])
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/shard-skl7/igt@gem_userptr_blits@input-checking.html

  * igt@i915_pm_rc6_residency@rc6-idle@vcs0:
    - shard-skl:          [PASS][10] -> [WARN][11] ([i915#1804])
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12496/shard-skl1/igt@i915_pm_rc6_residency@rc6-idle@vcs0.html
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/shard-skl1/igt@i915_pm_rc6_residency@rc6-idle@vcs0.html

  * igt@i915_pm_rpm@modeset-non-lpsp-stress-no-wait:
    - shard-skl:          NOTRUN -> [SKIP][12] ([fdo#109271]) +29 similar issues
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/shard-skl7/igt@i915_pm_rpm@modeset-non-lpsp-stress-no-wait.html

  * igt@kms_async_flips@alternate-sync-async-flip@pipe-b-edp-1:
    - shard-skl:          [PASS][13] -> [FAIL][14] ([i915#2521]) +1 similar issue
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12496/shard-skl1/igt@kms_async_flips@alternate-sync-async-flip@pipe-b-edp-1.html
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/shard-skl1/igt@kms_async_flips@alternate-sync-async-flip@pipe-b-edp-1.html

  * igt@kms_chamelium@dp-edid-change-during-suspend:
    - shard-skl:          NOTRUN -> [SKIP][15] ([fdo#109271] / [fdo#111827]) +3 similar issues
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/shard-skl7/igt@kms_chamelium@dp-edid-change-during-suspend.html

  * igt@kms_cursor_legacy@cursor-vs-flip@atomic-transitions:
    - shard-iclb:         [PASS][16] -> [FAIL][17] ([i915#5072])
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12496/shard-iclb3/igt@kms_cursor_legacy@cursor-vs-flip@atomic-transitions.html
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/shard-iclb7/igt@kms_cursor_legacy@cursor-vs-flip@atomic-transitions.html

  * igt@kms_fbcon_fbt@fbc-suspend:
    - shard-apl:          NOTRUN -> [FAIL][18] ([i915#4767])
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/shard-apl8/igt@kms_fbcon_fbt@fbc-suspend.html

  * igt@kms_flip@flip-vs-expired-vblank-interruptible@b-edp1:
    - shard-skl:          [PASS][19] -> [FAIL][20] ([i915#79])
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12496/shard-skl7/igt@kms_flip@flip-vs-expired-vblank-interruptible@b-edp1.html
   [20]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/shard-skl6/igt@kms_flip@flip-vs-expired-vblank-interruptible@b-edp1.html

  * igt@kms_flip_scaled_crc@flip-32bpp-ytile-to-32bpp-ytilegen12rcccs-downscaling@pipe-a-default-mode:
    - shard-iclb:         NOTRUN -> [SKIP][21] ([i915#2672] / [i915#3555])
   [21]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/shard-iclb3/igt@kms_flip_scaled_crc@flip-32bpp-ytile-to-32bpp-ytilegen12rcccs-downscaling@pipe-a-default-mode.html

  * igt@kms_flip_scaled_crc@flip-64bpp-4tile-to-32bpp-4tiledg2rcccs-downscaling@pipe-a-valid-mode:
    - shard-iclb:         NOTRUN -> [SKIP][22] ([i915#2587] / [i915#2672]) +1 similar issue
   [22]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/shard-iclb5/igt@kms_flip_scaled_crc@flip-64bpp-4tile-to-32bpp-4tiledg2rcccs-downscaling@pipe-a-valid-mode.html

  * igt@kms_flip_scaled_crc@flip-64bpp-xtile-to-16bpp-xtile-downscaling@pipe-a-default-mode:
    - shard-iclb:         NOTRUN -> [SKIP][23] ([i915#3555])
   [23]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/shard-iclb2/igt@kms_flip_scaled_crc@flip-64bpp-xtile-to-16bpp-xtile-downscaling@pipe-a-default-mode.html

  * igt@kms_flip_scaled_crc@flip-64bpp-yftile-to-32bpp-yftile-upscaling@pipe-a-default-mode:
    - shard-iclb:         NOTRUN -> [SKIP][24] ([i915#2672]) +7 similar issues
   [24]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/shard-iclb2/igt@kms_flip_scaled_crc@flip-64bpp-yftile-to-32bpp-yftile-upscaling@pipe-a-default-mode.html

  * igt@kms_plane_scaling@invalid-num-scalers@pipe-a-edp-1-invalid-num-scalers:
    - shard-skl:          NOTRUN -> [SKIP][25] ([fdo#109271] / [i915#5776]) +2 similar issues
   [25]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/shard-skl7/igt@kms_plane_scaling@invalid-num-scalers@pipe-a-edp-1-invalid-num-scalers.html

  * igt@kms_plane_scaling@plane-scaler-with-clipping-clamping-pixel-formats@pipe-b-edp-1:
    - shard-iclb:         [PASS][26] -> [SKIP][27] ([i915#5176]) +1 similar issue
   [26]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12496/shard-iclb8/igt@kms_plane_scaling@plane-scaler-with-clipping-clamping-pixel-formats@pipe-b-edp-1.html
   [27]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/shard-iclb3/igt@kms_plane_scaling@plane-scaler-with-clipping-clamping-pixel-formats@pipe-b-edp-1.html

  * igt@kms_psr2_sf@primary-plane-update-sf-dmg-area-big-fb:
    - shard-skl:          NOTRUN -> [SKIP][28] ([fdo#109271] / [i915#658])
   [28]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/shard-skl7/igt@kms_psr2_sf@primary-plane-update-sf-dmg-area-big-fb.html

  * igt@kms_psr2_su@page_flip-p010@pipe-b-edp-1:
    - shard-iclb:         NOTRUN -> [FAIL][29] ([i915#5939]) +2 similar issues
   [29]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/shard-iclb2/igt@kms_psr2_su@page_flip-p010@pipe-b-edp-1.html

  * igt@kms_psr2_su@page_flip-xrgb8888:
    - shard-iclb:         NOTRUN -> [SKIP][30] ([fdo#109642] / [fdo#111068] / [i915#658]) +1 similar issue
   [30]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/shard-iclb6/igt@kms_psr2_su@page_flip-xrgb8888.html

  * igt@kms_psr@psr2_primary_blt:
    - shard-iclb:         [PASS][31] -> [SKIP][32] ([fdo#109441]) +4 similar issues
   [31]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12496/shard-iclb2/igt@kms_psr@psr2_primary_blt.html
   [32]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/shard-iclb7/igt@kms_psr@psr2_primary_blt.html

  * igt@kms_psr_stress_test@flip-primary-invalidate-overlay:
    - shard-tglb:         [PASS][33] -> [SKIP][34] ([i915#5519])
   [33]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12496/shard-tglb1/igt@kms_psr_stress_test@flip-primary-invalidate-overlay.html
   [34]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/shard-tglb3/igt@kms_psr_stress_test@flip-primary-invalidate-overlay.html

  * igt@perf@polling:
    - shard-skl:          [PASS][35] -> [FAIL][36] ([i915#1542])
   [35]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12496/shard-skl7/igt@perf@polling.html
   [36]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/shard-skl6/igt@perf@polling.html

  
#### Possible fixes ####

  * igt@gem_exec_balancer@parallel-balancer:
    - shard-iclb:         [SKIP][37] ([i915#4525]) -> [PASS][38]
   [37]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12496/shard-iclb8/igt@gem_exec_balancer@parallel-balancer.html
   [38]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/shard-iclb2/igt@gem_exec_balancer@parallel-balancer.html

  * igt@gem_huc_copy@huc-copy:
    - shard-tglb:         [SKIP][39] ([i915#2190]) -> [PASS][40]
   [39]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12496/shard-tglb7/igt@gem_huc_copy@huc-copy.html
   [40]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/shard-tglb1/igt@gem_huc_copy@huc-copy.html

  * igt@i915_pm_dc@dc6-psr:
    - shard-iclb:         [FAIL][41] ([i915#3989] / [i915#454]) -> [PASS][42]
   [41]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12496/shard-iclb6/igt@i915_pm_dc@dc6-psr.html
   [42]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/shard-iclb1/igt@i915_pm_dc@dc6-psr.html

  * igt@kms_flip@flip-vs-suspend-interruptible@c-edp1:
    - shard-tglb:         [DMESG-WARN][43] ([i915#2411] / [i915#2867]) -> [PASS][44]
   [43]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12496/shard-tglb3/igt@kms_flip@flip-vs-suspend-interruptible@c-edp1.html
   [44]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/shard-tglb5/igt@kms_flip@flip-vs-suspend-interruptible@c-edp1.html

  * igt@kms_flip@flip-vs-suspend@b-dp1:
    - shard-apl:          [DMESG-WARN][45] ([i915#180]) -> [PASS][46]
   [45]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12496/shard-apl6/igt@kms_flip@flip-vs-suspend@b-dp1.html
   [46]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/shard-apl8/igt@kms_flip@flip-vs-suspend@b-dp1.html

  * igt@kms_flip@plain-flip-fb-recreate@a-edp1:
    - shard-skl:          [FAIL][47] ([i915#2122]) -> [PASS][48] +1 similar issue
   [47]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12496/shard-skl6/igt@kms_flip@plain-flip-fb-recreate@a-edp1.html
   [48]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/shard-skl7/igt@kms_flip@plain-flip-fb-recreate@a-edp1.html

  * igt@kms_plane_scaling@plane-downscale-with-pixel-format-factor-0-5@pipe-b-edp-1:
    - shard-iclb:         [SKIP][49] ([i915#5176]) -> [PASS][50] +2 similar issues
   [49]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12496/shard-iclb2/igt@kms_plane_scaling@plane-downscale-with-pixel-format-factor-0-5@pipe-b-edp-1.html
   [50]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/shard-iclb6/igt@kms_plane_scaling@plane-downscale-with-pixel-format-factor-0-5@pipe-b-edp-1.html

  * igt@kms_psr@psr2_sprite_render:
    - shard-iclb:         [SKIP][51] ([fdo#109441]) -> [PASS][52] +1 similar issue
   [51]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12496/shard-iclb8/igt@kms_psr@psr2_sprite_render.html
   [52]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/shard-iclb2/igt@kms_psr@psr2_sprite_render.html

  * igt@kms_psr_stress_test@invalidate-primary-flip-overlay:
    - shard-tglb:         [SKIP][53] ([i915#5519]) -> [PASS][54]
   [53]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12496/shard-tglb5/igt@kms_psr_stress_test@invalidate-primary-flip-overlay.html
   [54]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/shard-tglb6/igt@kms_psr_stress_test@invalidate-primary-flip-overlay.html

  
#### Warnings ####

  * igt@kms_psr2_sf@overlay-plane-update-sf-dmg-area:
    - shard-iclb:         [SKIP][55] ([fdo#111068] / [i915#658]) -> [SKIP][56] ([i915#2920])
   [55]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12496/shard-iclb1/igt@kms_psr2_sf@overlay-plane-update-sf-dmg-area.html
   [56]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/shard-iclb2/igt@kms_psr2_sf@overlay-plane-update-sf-dmg-area.html

  * igt@runner@aborted:
    - shard-apl:          ([FAIL][57], [FAIL][58], [FAIL][59]) ([i915#180] / [i915#3002] / [i915#4312]) -> ([FAIL][60], [FAIL][61]) ([i915#3002] / [i915#4312])
   [57]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12496/shard-apl1/igt@runner@aborted.html
   [58]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12496/shard-apl3/igt@runner@aborted.html
   [59]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12496/shard-apl6/igt@runner@aborted.html
   [60]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/shard-apl6/igt@runner@aborted.html
   [61]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/shard-apl6/igt@runner@aborted.html
    - shard-skl:          ([FAIL][62], [FAIL][63]) ([i915#4312] / [i915#6949]) -> ([FAIL][64], [FAIL][65]) ([i915#3002] / [i915#4312] / [i915#6949])
   [62]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12496/shard-skl6/igt@runner@aborted.html
   [63]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12496/shard-skl6/igt@runner@aborted.html
   [64]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/shard-skl7/igt@runner@aborted.html
   [65]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/shard-skl7/igt@runner@aborted.html

  
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  [fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271
  [fdo#109441]: https://bugs.freedesktop.org/show_bug.cgi?id=109441
  [fdo#109642]: https://bugs.freedesktop.org/show_bug.cgi?id=109642
  [fdo#111068]: https://bugs.freedesktop.org/show_bug.cgi?id=111068
  [fdo#111827]: https://bugs.freedesktop.org/show_bug.cgi?id=111827
  [i915#1542]: https://gitlab.freedesktop.org/drm/intel/issues/1542
  [i915#180]: https://gitlab.freedesktop.org/drm/intel/issues/180
  [i915#1804]: https://gitlab.freedesktop.org/drm/intel/issues/1804
  [i915#2122]: https://gitlab.freedesktop.org/drm/intel/issues/2122
  [i915#2190]: https://gitlab.freedesktop.org/drm/intel/issues/2190
  [i915#2411]: https://gitlab.freedesktop.org/drm/intel/issues/2411
  [i915#2521]: https://gitlab.freedesktop.org/drm/intel/issues/2521
  [i915#2587]: https://gitlab.freedesktop.org/drm/intel/issues/2587
  [i915#2672]: https://gitlab.freedesktop.org/drm/intel/issues/2672
  [i915#2867]: https://gitlab.freedesktop.org/drm/intel/issues/2867
  [i915#2920]: https://gitlab.freedesktop.org/drm/intel/issues/2920
  [i915#3002]: https://gitlab.freedesktop.org/drm/intel/issues/3002
  [i915#3555]: https://gitlab.freedesktop.org/drm/intel/issues/3555
  [i915#3989]: https://gitlab.freedesktop.org/drm/intel/issues/3989
  [i915#4312]: https://gitlab.freedesktop.org/drm/intel/issues/4312
  [i915#4525]: https://gitlab.freedesktop.org/drm/intel/issues/4525
  [i915#454]: https://gitlab.freedesktop.org/drm/intel/issues/454
  [i915#4767]: https://gitlab.freedesktop.org/drm/intel/issues/4767
  [i915#4991]: https://gitlab.freedesktop.org/drm/intel/issues/4991
  [i915#5072]: https://gitlab.freedesktop.org/drm/intel/issues/5072
  [i915#5099]: https://gitlab.freedesktop.org/drm/intel/issues/5099
  [i915#5176]: https://gitlab.freedesktop.org/drm/intel/issues/5176
  [i915#5519]: https://gitlab.freedesktop.org/drm/intel/issues/5519
  [i915#5776]: https://gitlab.freedesktop.org/drm/intel/issues/5776
  [i915#5939]: https://gitlab.freedesktop.org/drm/intel/issues/5939
  [i915#658]: https://gitlab.freedesktop.org/drm/intel/issues/658
  [i915#6949]: https://gitlab.freedesktop.org/drm/intel/issues/6949
  [i915#7236]: https://gitlab.freedesktop.org/drm/intel/issues/7236
  [i915#7688]: https://gitlab.freedesktop.org/drm/intel/issues/7688
  [i915#79]: https://gitlab.freedesktop.org/drm/intel/issues/79


Build changes
-------------

  * Linux: CI_DRM_12496 -> Patchwork_111859v1

  CI-20190529: 20190529
  CI_DRM_12496: da695a0fe3c49c4c8709e1e6daabd07fc405cf81 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_7091: b8015f920c9f469d3733854263cb878373c1df51 @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  Patchwork_111859v1: da695a0fe3c49c4c8709e1e6daabd07fc405cf81 @ git://anongit.freedesktop.org/gfx-ci/linux
  piglit_4509: fdc5a4ca11124ab8413c7988896eec4c97336694 @ git://anongit.freedesktop.org/piglit

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111859v1/index.html

[-- Attachment #2: Type: text/html, Size: 20107 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] drm/i915/gt: Reset twice
  2022-12-12 23:08     ` Andi Shyti
  (?)
@ 2022-12-13 13:18       ` Vivi, Rodrigo
  -1 siblings, 0 replies; 27+ messages in thread
From: Vivi, Rodrigo @ 2022-12-13 13:18 UTC (permalink / raw)
  To: andi.shyti; +Cc: dri-devel, mika.kuoppala, stable, intel-gfx, andi, chris

On Tue, 2022-12-13 at 00:08 +0100, Andi Shyti wrote:
> Hi Rodrigo,
> 
> On Mon, Dec 12, 2022 at 11:55:10AM -0500, Rodrigo Vivi wrote:
> > On Mon, Dec 12, 2022 at 05:13:38PM +0100, Andi Shyti wrote:
> > > From: Chris Wilson <chris@chris-wilson.co.uk>
> > > 
> > > After applying an engine reset, on some platforms like
> > > Jasperlake, we
> > > occasionally detect that the engine state is not cleared until
> > > shortly
> > > after the resume. As we try to resume the engine with volatile
> > > internal
> > > state, the first request fails with a spurious CS event (it looks
> > > like
> > > it reports a lite-restore to the hung context, instead of the
> > > expected
> > > idle->active context switch).
> > > 
> > > Signed-off-by: Chris Wilson <hris@chris-wilson.co.uk>
> > 
> > There's a typo in the signature email I'm afraid...
> 
> oh yes, I forgot the 'C' :)

you forgot?
who signed it off?

> 
> > Other than that, have we checked the possibility of using the
> > driver-initiated-flr bit
> > instead of this second loop? That should be the right way to
> > guarantee everything is
> > cleared on gen11+...
> 
> maybe I am misinterpreting it, but is FLR the same as resetting
> hardware domains individually?

No, it is bigger than that... almost the PCI FLR with some exceptions:

https://lists.freedesktop.org/archives/intel-gfx/2022-December/313956.html

> How am I supposed to use driver_initiated_flr() in this context?

Some drivers are not even using this gt reset anymore and going
directly with the driver-initiated FLR in case that guc reset failed.

I believe we can still keep the gt reset in our case as we currently
have, but instead of keep retrying it until it succeeds we probably
should go to the next level and do the driver initiated FLR when the GT
reset failed.

> 
> Thanks,
> Andi
> 
> > > Cc: stable@vger.kernel.org
> > > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > > Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
> > > ---
> > >  drivers/gpu/drm/i915/gt/intel_reset.c | 34
> > > ++++++++++++++++++++++-----
> > >  1 file changed, 28 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c
> > > b/drivers/gpu/drm/i915/gt/intel_reset.c
> > > index ffde89c5835a4..88dfc0c5316ff 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> > > +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> > > @@ -268,6 +268,7 @@ static int ilk_do_reset(struct intel_gt *gt,
> > > intel_engine_mask_t engine_mask,
> > >  static int gen6_hw_domain_reset(struct intel_gt *gt, u32
> > > hw_domain_mask)
> > >  {
> > >         struct intel_uncore *uncore = gt->uncore;
> > > +       int loops = 2;
> > >         int err;
> > >  
> > >         /*
> > > @@ -275,18 +276,39 @@ static int gen6_hw_domain_reset(struct
> > > intel_gt *gt, u32 hw_domain_mask)
> > >          * for fifo space for the write or forcewake the chip for
> > >          * the read
> > >          */
> > > -       intel_uncore_write_fw(uncore, GEN6_GDRST,
> > > hw_domain_mask);
> > > +       do {
> > > +               intel_uncore_write_fw(uncore, GEN6_GDRST,
> > > hw_domain_mask);
> > >  
> > > -       /* Wait for the device to ack the reset requests */
> > > -       err = __intel_wait_for_register_fw(uncore,
> > > -                                          GEN6_GDRST,
> > > hw_domain_mask, 0,
> > > -                                          500, 0,
> > > -                                          NULL);
> > > +               /*
> > > +                * Wait for the device to ack the reset requests.
> > > +                *
> > > +                * On some platforms, e.g. Jasperlake, we see see
> > > that the
> > > +                * engine register state is not cleared until
> > > shortly after
> > > +                * GDRST reports completion, causing a failure as
> > > we try
> > > +                * to immediately resume while the internal state
> > > is still
> > > +                * in flux. If we immediately repeat the reset,
> > > the second
> > > +                * reset appears to serialise with the first, and
> > > since
> > > +                * it is a no-op, the registers should retain
> > > their reset
> > > +                * value. However, there is still a concern that
> > > upon
> > > +                * leaving the second reset, the internal engine
> > > state
> > > +                * is still in flux and not ready for resuming.
> > > +                */
> > > +               err = __intel_wait_for_register_fw(uncore,
> > > GEN6_GDRST,
> > > +                                                 
> > > hw_domain_mask, 0,
> > > +                                                  2000, 0,
> > > +                                                  NULL);
> > > +       } while (err == 0 && --loops);
> > >         if (err)
> > >                 GT_TRACE(gt,
> > >                          "Wait for 0x%08x engines reset
> > > failed\n",
> > >                          hw_domain_mask);
> > >  
> > > +       /*
> > > +        * As we have observed that the engine state is still
> > > volatile
> > > +        * after GDRST is acked, impose a small delay to let
> > > everything settle.
> > > +        */
> > > +       udelay(50);
> > > +
> > >         return err;
> > >  }
> > >  
> > > -- 
> > > 2.38.1
> > > 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] drm/i915/gt: Reset twice
@ 2022-12-13 13:18       ` Vivi, Rodrigo
  0 siblings, 0 replies; 27+ messages in thread
From: Vivi, Rodrigo @ 2022-12-13 13:18 UTC (permalink / raw)
  To: andi.shyti; +Cc: mika.kuoppala, intel-gfx, stable, chris, dri-devel, andi

On Tue, 2022-12-13 at 00:08 +0100, Andi Shyti wrote:
> Hi Rodrigo,
> 
> On Mon, Dec 12, 2022 at 11:55:10AM -0500, Rodrigo Vivi wrote:
> > On Mon, Dec 12, 2022 at 05:13:38PM +0100, Andi Shyti wrote:
> > > From: Chris Wilson <chris@chris-wilson.co.uk>
> > > 
> > > After applying an engine reset, on some platforms like
> > > Jasperlake, we
> > > occasionally detect that the engine state is not cleared until
> > > shortly
> > > after the resume. As we try to resume the engine with volatile
> > > internal
> > > state, the first request fails with a spurious CS event (it looks
> > > like
> > > it reports a lite-restore to the hung context, instead of the
> > > expected
> > > idle->active context switch).
> > > 
> > > Signed-off-by: Chris Wilson <hris@chris-wilson.co.uk>
> > 
> > There's a typo in the signature email I'm afraid...
> 
> oh yes, I forgot the 'C' :)

you forgot?
who signed it off?

> 
> > Other than that, have we checked the possibility of using the
> > driver-initiated-flr bit
> > instead of this second loop? That should be the right way to
> > guarantee everything is
> > cleared on gen11+...
> 
> maybe I am misinterpreting it, but is FLR the same as resetting
> hardware domains individually?

No, it is bigger than that... almost the PCI FLR with some exceptions:

https://lists.freedesktop.org/archives/intel-gfx/2022-December/313956.html

> How am I supposed to use driver_initiated_flr() in this context?

Some drivers are not even using this gt reset anymore and going
directly with the driver-initiated FLR in case that guc reset failed.

I believe we can still keep the gt reset in our case as we currently
have, but instead of keep retrying it until it succeeds we probably
should go to the next level and do the driver initiated FLR when the GT
reset failed.

> 
> Thanks,
> Andi
> 
> > > Cc: stable@vger.kernel.org
> > > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > > Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
> > > ---
> > >  drivers/gpu/drm/i915/gt/intel_reset.c | 34
> > > ++++++++++++++++++++++-----
> > >  1 file changed, 28 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c
> > > b/drivers/gpu/drm/i915/gt/intel_reset.c
> > > index ffde89c5835a4..88dfc0c5316ff 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> > > +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> > > @@ -268,6 +268,7 @@ static int ilk_do_reset(struct intel_gt *gt,
> > > intel_engine_mask_t engine_mask,
> > >  static int gen6_hw_domain_reset(struct intel_gt *gt, u32
> > > hw_domain_mask)
> > >  {
> > >         struct intel_uncore *uncore = gt->uncore;
> > > +       int loops = 2;
> > >         int err;
> > >  
> > >         /*
> > > @@ -275,18 +276,39 @@ static int gen6_hw_domain_reset(struct
> > > intel_gt *gt, u32 hw_domain_mask)
> > >          * for fifo space for the write or forcewake the chip for
> > >          * the read
> > >          */
> > > -       intel_uncore_write_fw(uncore, GEN6_GDRST,
> > > hw_domain_mask);
> > > +       do {
> > > +               intel_uncore_write_fw(uncore, GEN6_GDRST,
> > > hw_domain_mask);
> > >  
> > > -       /* Wait for the device to ack the reset requests */
> > > -       err = __intel_wait_for_register_fw(uncore,
> > > -                                          GEN6_GDRST,
> > > hw_domain_mask, 0,
> > > -                                          500, 0,
> > > -                                          NULL);
> > > +               /*
> > > +                * Wait for the device to ack the reset requests.
> > > +                *
> > > +                * On some platforms, e.g. Jasperlake, we see see
> > > that the
> > > +                * engine register state is not cleared until
> > > shortly after
> > > +                * GDRST reports completion, causing a failure as
> > > we try
> > > +                * to immediately resume while the internal state
> > > is still
> > > +                * in flux. If we immediately repeat the reset,
> > > the second
> > > +                * reset appears to serialise with the first, and
> > > since
> > > +                * it is a no-op, the registers should retain
> > > their reset
> > > +                * value. However, there is still a concern that
> > > upon
> > > +                * leaving the second reset, the internal engine
> > > state
> > > +                * is still in flux and not ready for resuming.
> > > +                */
> > > +               err = __intel_wait_for_register_fw(uncore,
> > > GEN6_GDRST,
> > > +                                                 
> > > hw_domain_mask, 0,
> > > +                                                  2000, 0,
> > > +                                                  NULL);
> > > +       } while (err == 0 && --loops);
> > >         if (err)
> > >                 GT_TRACE(gt,
> > >                          "Wait for 0x%08x engines reset
> > > failed\n",
> > >                          hw_domain_mask);
> > >  
> > > +       /*
> > > +        * As we have observed that the engine state is still
> > > volatile
> > > +        * after GDRST is acked, impose a small delay to let
> > > everything settle.
> > > +        */
> > > +       udelay(50);
> > > +
> > >         return err;
> > >  }
> > >  
> > > -- 
> > > 2.38.1
> > > 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Intel-gfx] [PATCH] drm/i915/gt: Reset twice
@ 2022-12-13 13:18       ` Vivi, Rodrigo
  0 siblings, 0 replies; 27+ messages in thread
From: Vivi, Rodrigo @ 2022-12-13 13:18 UTC (permalink / raw)
  To: andi.shyti; +Cc: intel-gfx, stable, chris, dri-devel

On Tue, 2022-12-13 at 00:08 +0100, Andi Shyti wrote:
> Hi Rodrigo,
> 
> On Mon, Dec 12, 2022 at 11:55:10AM -0500, Rodrigo Vivi wrote:
> > On Mon, Dec 12, 2022 at 05:13:38PM +0100, Andi Shyti wrote:
> > > From: Chris Wilson <chris@chris-wilson.co.uk>
> > > 
> > > After applying an engine reset, on some platforms like
> > > Jasperlake, we
> > > occasionally detect that the engine state is not cleared until
> > > shortly
> > > after the resume. As we try to resume the engine with volatile
> > > internal
> > > state, the first request fails with a spurious CS event (it looks
> > > like
> > > it reports a lite-restore to the hung context, instead of the
> > > expected
> > > idle->active context switch).
> > > 
> > > Signed-off-by: Chris Wilson <hris@chris-wilson.co.uk>
> > 
> > There's a typo in the signature email I'm afraid...
> 
> oh yes, I forgot the 'C' :)

you forgot?
who signed it off?

> 
> > Other than that, have we checked the possibility of using the
> > driver-initiated-flr bit
> > instead of this second loop? That should be the right way to
> > guarantee everything is
> > cleared on gen11+...
> 
> maybe I am misinterpreting it, but is FLR the same as resetting
> hardware domains individually?

No, it is bigger than that... almost the PCI FLR with some exceptions:

https://lists.freedesktop.org/archives/intel-gfx/2022-December/313956.html

> How am I supposed to use driver_initiated_flr() in this context?

Some drivers are not even using this gt reset anymore and going
directly with the driver-initiated FLR in case that guc reset failed.

I believe we can still keep the gt reset in our case as we currently
have, but instead of keep retrying it until it succeeds we probably
should go to the next level and do the driver initiated FLR when the GT
reset failed.

> 
> Thanks,
> Andi
> 
> > > Cc: stable@vger.kernel.org
> > > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > > Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
> > > ---
> > >  drivers/gpu/drm/i915/gt/intel_reset.c | 34
> > > ++++++++++++++++++++++-----
> > >  1 file changed, 28 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c
> > > b/drivers/gpu/drm/i915/gt/intel_reset.c
> > > index ffde89c5835a4..88dfc0c5316ff 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> > > +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> > > @@ -268,6 +268,7 @@ static int ilk_do_reset(struct intel_gt *gt,
> > > intel_engine_mask_t engine_mask,
> > >  static int gen6_hw_domain_reset(struct intel_gt *gt, u32
> > > hw_domain_mask)
> > >  {
> > >         struct intel_uncore *uncore = gt->uncore;
> > > +       int loops = 2;
> > >         int err;
> > >  
> > >         /*
> > > @@ -275,18 +276,39 @@ static int gen6_hw_domain_reset(struct
> > > intel_gt *gt, u32 hw_domain_mask)
> > >          * for fifo space for the write or forcewake the chip for
> > >          * the read
> > >          */
> > > -       intel_uncore_write_fw(uncore, GEN6_GDRST,
> > > hw_domain_mask);
> > > +       do {
> > > +               intel_uncore_write_fw(uncore, GEN6_GDRST,
> > > hw_domain_mask);
> > >  
> > > -       /* Wait for the device to ack the reset requests */
> > > -       err = __intel_wait_for_register_fw(uncore,
> > > -                                          GEN6_GDRST,
> > > hw_domain_mask, 0,
> > > -                                          500, 0,
> > > -                                          NULL);
> > > +               /*
> > > +                * Wait for the device to ack the reset requests.
> > > +                *
> > > +                * On some platforms, e.g. Jasperlake, we see see
> > > that the
> > > +                * engine register state is not cleared until
> > > shortly after
> > > +                * GDRST reports completion, causing a failure as
> > > we try
> > > +                * to immediately resume while the internal state
> > > is still
> > > +                * in flux. If we immediately repeat the reset,
> > > the second
> > > +                * reset appears to serialise with the first, and
> > > since
> > > +                * it is a no-op, the registers should retain
> > > their reset
> > > +                * value. However, there is still a concern that
> > > upon
> > > +                * leaving the second reset, the internal engine
> > > state
> > > +                * is still in flux and not ready for resuming.
> > > +                */
> > > +               err = __intel_wait_for_register_fw(uncore,
> > > GEN6_GDRST,
> > > +                                                 
> > > hw_domain_mask, 0,
> > > +                                                  2000, 0,
> > > +                                                  NULL);
> > > +       } while (err == 0 && --loops);
> > >         if (err)
> > >                 GT_TRACE(gt,
> > >                          "Wait for 0x%08x engines reset
> > > failed\n",
> > >                          hw_domain_mask);
> > >  
> > > +       /*
> > > +        * As we have observed that the engine state is still
> > > volatile
> > > +        * after GDRST is acked, impose a small delay to let
> > > everything settle.
> > > +        */
> > > +       udelay(50);
> > > +
> > >         return err;
> > >  }
> > >  
> > > -- 
> > > 2.38.1
> > > 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] drm/i915/gt: Reset twice
  2022-12-13 13:18       ` Vivi, Rodrigo
  (?)
@ 2022-12-14 22:37         ` Andi Shyti
  -1 siblings, 0 replies; 27+ messages in thread
From: Andi Shyti @ 2022-12-14 22:37 UTC (permalink / raw)
  To: Vivi, Rodrigo
  Cc: andi.shyti, dri-devel, mika.kuoppala, stable, intel-gfx, andi, chris

Hi Rodrigo,

On Tue, Dec 13, 2022 at 01:18:48PM +0000, Vivi, Rodrigo wrote:
> On Tue, 2022-12-13 at 00:08 +0100, Andi Shyti wrote:
> > Hi Rodrigo,
> > 
> > On Mon, Dec 12, 2022 at 11:55:10AM -0500, Rodrigo Vivi wrote:
> > > On Mon, Dec 12, 2022 at 05:13:38PM +0100, Andi Shyti wrote:
> > > > From: Chris Wilson <chris@chris-wilson.co.uk>
> > > > 
> > > > After applying an engine reset, on some platforms like
> > > > Jasperlake, we
> > > > occasionally detect that the engine state is not cleared until
> > > > shortly
> > > > after the resume. As we try to resume the engine with volatile
> > > > internal
> > > > state, the first request fails with a spurious CS event (it looks
> > > > like
> > > > it reports a lite-restore to the hung context, instead of the
> > > > expected
> > > > idle->active context switch).
> > > > 
> > > > Signed-off-by: Chris Wilson <hris@chris-wilson.co.uk>
> > > 
> > > There's a typo in the signature email I'm afraid...
> > 
> > oh yes, I forgot the 'C' :)
> 
> you forgot?
> who signed it off?

Chris, but as I was copy/pasting SoB's I might have
unintentionally removed the 'c'.

> > > Other than that, have we checked the possibility of using the
> > > driver-initiated-flr bit
> > > instead of this second loop? That should be the right way to
> > > guarantee everything is
> > > cleared on gen11+...
> > 
> > maybe I am misinterpreting it, but is FLR the same as resetting
> > hardware domains individually?
> 
> No, it is bigger than that... almost the PCI FLR with some exceptions:
> 
> https://lists.freedesktop.org/archives/intel-gfx/2022-December/313956.html

yes, exactly... I would use FLR feedback if I was performing an
FLR reset. But here I'm not doing that, here I'm simply gating
off some power domains. It happens that those power domains turn
on and off engines making them reset.

FLR doesn't have anything to do here, also because if you want to
reset a single engine you go through this function, instead of
resetting the whole GPU with whatever is annexed.

This patch is not fixing the "reset" concept of i915, but it's
fixing a missing feedback that happens in one single platform
when trying to gate on/off a domain.

Maybe I am completely off track, but I don't see connection with
FLR here.

(besides FLR might not be present in all the platforms)

Thanks a lot for your inputs,
Andi

> > How am I supposed to use driver_initiated_flr() in this context?
> 
> Some drivers are not even using this gt reset anymore and going
> directly with the driver-initiated FLR in case that guc reset failed.
> 
> I believe we can still keep the gt reset in our case as we currently
> have, but instead of keep retrying it until it succeeds we probably
> should go to the next level and do the driver initiated FLR when the GT
> reset failed.
> 
> > 
> > Thanks,
> > Andi
> > 
> > > > Cc: stable@vger.kernel.org
> > > > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > > > Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
> > > > ---
> > > >  drivers/gpu/drm/i915/gt/intel_reset.c | 34
> > > > ++++++++++++++++++++++-----
> > > >  1 file changed, 28 insertions(+), 6 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c
> > > > b/drivers/gpu/drm/i915/gt/intel_reset.c
> > > > index ffde89c5835a4..88dfc0c5316ff 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> > > > @@ -268,6 +268,7 @@ static int ilk_do_reset(struct intel_gt *gt,
> > > > intel_engine_mask_t engine_mask,
> > > >  static int gen6_hw_domain_reset(struct intel_gt *gt, u32
> > > > hw_domain_mask)
> > > >  {
> > > >         struct intel_uncore *uncore = gt->uncore;
> > > > +       int loops = 2;
> > > >         int err;
> > > >  
> > > >         /*
> > > > @@ -275,18 +276,39 @@ static int gen6_hw_domain_reset(struct
> > > > intel_gt *gt, u32 hw_domain_mask)
> > > >          * for fifo space for the write or forcewake the chip for
> > > >          * the read
> > > >          */
> > > > -       intel_uncore_write_fw(uncore, GEN6_GDRST,
> > > > hw_domain_mask);
> > > > +       do {
> > > > +               intel_uncore_write_fw(uncore, GEN6_GDRST,
> > > > hw_domain_mask);
> > > >  
> > > > -       /* Wait for the device to ack the reset requests */
> > > > -       err = __intel_wait_for_register_fw(uncore,
> > > > -                                          GEN6_GDRST,
> > > > hw_domain_mask, 0,
> > > > -                                          500, 0,
> > > > -                                          NULL);
> > > > +               /*
> > > > +                * Wait for the device to ack the reset requests.
> > > > +                *
> > > > +                * On some platforms, e.g. Jasperlake, we see see
> > > > that the
> > > > +                * engine register state is not cleared until
> > > > shortly after
> > > > +                * GDRST reports completion, causing a failure as
> > > > we try
> > > > +                * to immediately resume while the internal state
> > > > is still
> > > > +                * in flux. If we immediately repeat the reset,
> > > > the second
> > > > +                * reset appears to serialise with the first, and
> > > > since
> > > > +                * it is a no-op, the registers should retain
> > > > their reset
> > > > +                * value. However, there is still a concern that
> > > > upon
> > > > +                * leaving the second reset, the internal engine
> > > > state
> > > > +                * is still in flux and not ready for resuming.
> > > > +                */
> > > > +               err = __intel_wait_for_register_fw(uncore,
> > > > GEN6_GDRST,
> > > > +                                                 
> > > > hw_domain_mask, 0,
> > > > +                                                  2000, 0,
> > > > +                                                  NULL);
> > > > +       } while (err == 0 && --loops);
> > > >         if (err)
> > > >                 GT_TRACE(gt,
> > > >                          "Wait for 0x%08x engines reset
> > > > failed\n",
> > > >                          hw_domain_mask);
> > > >  
> > > > +       /*
> > > > +        * As we have observed that the engine state is still
> > > > volatile
> > > > +        * after GDRST is acked, impose a small delay to let
> > > > everything settle.
> > > > +        */
> > > > +       udelay(50);
> > > > +
> > > >         return err;
> > > >  }
> > > >  
> > > > -- 
> > > > 2.38.1
> > > > 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] drm/i915/gt: Reset twice
@ 2022-12-14 22:37         ` Andi Shyti
  0 siblings, 0 replies; 27+ messages in thread
From: Andi Shyti @ 2022-12-14 22:37 UTC (permalink / raw)
  To: Vivi, Rodrigo
  Cc: andi.shyti, mika.kuoppala, intel-gfx, stable, chris, dri-devel, andi

Hi Rodrigo,

On Tue, Dec 13, 2022 at 01:18:48PM +0000, Vivi, Rodrigo wrote:
> On Tue, 2022-12-13 at 00:08 +0100, Andi Shyti wrote:
> > Hi Rodrigo,
> > 
> > On Mon, Dec 12, 2022 at 11:55:10AM -0500, Rodrigo Vivi wrote:
> > > On Mon, Dec 12, 2022 at 05:13:38PM +0100, Andi Shyti wrote:
> > > > From: Chris Wilson <chris@chris-wilson.co.uk>
> > > > 
> > > > After applying an engine reset, on some platforms like
> > > > Jasperlake, we
> > > > occasionally detect that the engine state is not cleared until
> > > > shortly
> > > > after the resume. As we try to resume the engine with volatile
> > > > internal
> > > > state, the first request fails with a spurious CS event (it looks
> > > > like
> > > > it reports a lite-restore to the hung context, instead of the
> > > > expected
> > > > idle->active context switch).
> > > > 
> > > > Signed-off-by: Chris Wilson <hris@chris-wilson.co.uk>
> > > 
> > > There's a typo in the signature email I'm afraid...
> > 
> > oh yes, I forgot the 'C' :)
> 
> you forgot?
> who signed it off?

Chris, but as I was copy/pasting SoB's I might have
unintentionally removed the 'c'.

> > > Other than that, have we checked the possibility of using the
> > > driver-initiated-flr bit
> > > instead of this second loop? That should be the right way to
> > > guarantee everything is
> > > cleared on gen11+...
> > 
> > maybe I am misinterpreting it, but is FLR the same as resetting
> > hardware domains individually?
> 
> No, it is bigger than that... almost the PCI FLR with some exceptions:
> 
> https://lists.freedesktop.org/archives/intel-gfx/2022-December/313956.html

yes, exactly... I would use FLR feedback if I was performing an
FLR reset. But here I'm not doing that, here I'm simply gating
off some power domains. It happens that those power domains turn
on and off engines making them reset.

FLR doesn't have anything to do here, also because if you want to
reset a single engine you go through this function, instead of
resetting the whole GPU with whatever is annexed.

This patch is not fixing the "reset" concept of i915, but it's
fixing a missing feedback that happens in one single platform
when trying to gate on/off a domain.

Maybe I am completely off track, but I don't see connection with
FLR here.

(besides FLR might not be present in all the platforms)

Thanks a lot for your inputs,
Andi

> > How am I supposed to use driver_initiated_flr() in this context?
> 
> Some drivers are not even using this gt reset anymore and going
> directly with the driver-initiated FLR in case that guc reset failed.
> 
> I believe we can still keep the gt reset in our case as we currently
> have, but instead of keep retrying it until it succeeds we probably
> should go to the next level and do the driver initiated FLR when the GT
> reset failed.
> 
> > 
> > Thanks,
> > Andi
> > 
> > > > Cc: stable@vger.kernel.org
> > > > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > > > Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
> > > > ---
> > > >  drivers/gpu/drm/i915/gt/intel_reset.c | 34
> > > > ++++++++++++++++++++++-----
> > > >  1 file changed, 28 insertions(+), 6 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c
> > > > b/drivers/gpu/drm/i915/gt/intel_reset.c
> > > > index ffde89c5835a4..88dfc0c5316ff 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> > > > @@ -268,6 +268,7 @@ static int ilk_do_reset(struct intel_gt *gt,
> > > > intel_engine_mask_t engine_mask,
> > > >  static int gen6_hw_domain_reset(struct intel_gt *gt, u32
> > > > hw_domain_mask)
> > > >  {
> > > >         struct intel_uncore *uncore = gt->uncore;
> > > > +       int loops = 2;
> > > >         int err;
> > > >  
> > > >         /*
> > > > @@ -275,18 +276,39 @@ static int gen6_hw_domain_reset(struct
> > > > intel_gt *gt, u32 hw_domain_mask)
> > > >          * for fifo space for the write or forcewake the chip for
> > > >          * the read
> > > >          */
> > > > -       intel_uncore_write_fw(uncore, GEN6_GDRST,
> > > > hw_domain_mask);
> > > > +       do {
> > > > +               intel_uncore_write_fw(uncore, GEN6_GDRST,
> > > > hw_domain_mask);
> > > >  
> > > > -       /* Wait for the device to ack the reset requests */
> > > > -       err = __intel_wait_for_register_fw(uncore,
> > > > -                                          GEN6_GDRST,
> > > > hw_domain_mask, 0,
> > > > -                                          500, 0,
> > > > -                                          NULL);
> > > > +               /*
> > > > +                * Wait for the device to ack the reset requests.
> > > > +                *
> > > > +                * On some platforms, e.g. Jasperlake, we see see
> > > > that the
> > > > +                * engine register state is not cleared until
> > > > shortly after
> > > > +                * GDRST reports completion, causing a failure as
> > > > we try
> > > > +                * to immediately resume while the internal state
> > > > is still
> > > > +                * in flux. If we immediately repeat the reset,
> > > > the second
> > > > +                * reset appears to serialise with the first, and
> > > > since
> > > > +                * it is a no-op, the registers should retain
> > > > their reset
> > > > +                * value. However, there is still a concern that
> > > > upon
> > > > +                * leaving the second reset, the internal engine
> > > > state
> > > > +                * is still in flux and not ready for resuming.
> > > > +                */
> > > > +               err = __intel_wait_for_register_fw(uncore,
> > > > GEN6_GDRST,
> > > > +                                                 
> > > > hw_domain_mask, 0,
> > > > +                                                  2000, 0,
> > > > +                                                  NULL);
> > > > +       } while (err == 0 && --loops);
> > > >         if (err)
> > > >                 GT_TRACE(gt,
> > > >                          "Wait for 0x%08x engines reset
> > > > failed\n",
> > > >                          hw_domain_mask);
> > > >  
> > > > +       /*
> > > > +        * As we have observed that the engine state is still
> > > > volatile
> > > > +        * after GDRST is acked, impose a small delay to let
> > > > everything settle.
> > > > +        */
> > > > +       udelay(50);
> > > > +
> > > >         return err;
> > > >  }
> > > >  
> > > > -- 
> > > > 2.38.1
> > > > 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Intel-gfx] [PATCH] drm/i915/gt: Reset twice
@ 2022-12-14 22:37         ` Andi Shyti
  0 siblings, 0 replies; 27+ messages in thread
From: Andi Shyti @ 2022-12-14 22:37 UTC (permalink / raw)
  To: Vivi, Rodrigo; +Cc: intel-gfx, stable, chris, dri-devel

Hi Rodrigo,

On Tue, Dec 13, 2022 at 01:18:48PM +0000, Vivi, Rodrigo wrote:
> On Tue, 2022-12-13 at 00:08 +0100, Andi Shyti wrote:
> > Hi Rodrigo,
> > 
> > On Mon, Dec 12, 2022 at 11:55:10AM -0500, Rodrigo Vivi wrote:
> > > On Mon, Dec 12, 2022 at 05:13:38PM +0100, Andi Shyti wrote:
> > > > From: Chris Wilson <chris@chris-wilson.co.uk>
> > > > 
> > > > After applying an engine reset, on some platforms like
> > > > Jasperlake, we
> > > > occasionally detect that the engine state is not cleared until
> > > > shortly
> > > > after the resume. As we try to resume the engine with volatile
> > > > internal
> > > > state, the first request fails with a spurious CS event (it looks
> > > > like
> > > > it reports a lite-restore to the hung context, instead of the
> > > > expected
> > > > idle->active context switch).
> > > > 
> > > > Signed-off-by: Chris Wilson <hris@chris-wilson.co.uk>
> > > 
> > > There's a typo in the signature email I'm afraid...
> > 
> > oh yes, I forgot the 'C' :)
> 
> you forgot?
> who signed it off?

Chris, but as I was copy/pasting SoB's I might have
unintentionally removed the 'c'.

> > > Other than that, have we checked the possibility of using the
> > > driver-initiated-flr bit
> > > instead of this second loop? That should be the right way to
> > > guarantee everything is
> > > cleared on gen11+...
> > 
> > maybe I am misinterpreting it, but is FLR the same as resetting
> > hardware domains individually?
> 
> No, it is bigger than that... almost the PCI FLR with some exceptions:
> 
> https://lists.freedesktop.org/archives/intel-gfx/2022-December/313956.html

yes, exactly... I would use FLR feedback if I was performing an
FLR reset. But here I'm not doing that, here I'm simply gating
off some power domains. It happens that those power domains turn
on and off engines making them reset.

FLR doesn't have anything to do here, also because if you want to
reset a single engine you go through this function, instead of
resetting the whole GPU with whatever is annexed.

This patch is not fixing the "reset" concept of i915, but it's
fixing a missing feedback that happens in one single platform
when trying to gate on/off a domain.

Maybe I am completely off track, but I don't see connection with
FLR here.

(besides FLR might not be present in all the platforms)

Thanks a lot for your inputs,
Andi

> > How am I supposed to use driver_initiated_flr() in this context?
> 
> Some drivers are not even using this gt reset anymore and going
> directly with the driver-initiated FLR in case that guc reset failed.
> 
> I believe we can still keep the gt reset in our case as we currently
> have, but instead of keep retrying it until it succeeds we probably
> should go to the next level and do the driver initiated FLR when the GT
> reset failed.
> 
> > 
> > Thanks,
> > Andi
> > 
> > > > Cc: stable@vger.kernel.org
> > > > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > > > Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
> > > > ---
> > > >  drivers/gpu/drm/i915/gt/intel_reset.c | 34
> > > > ++++++++++++++++++++++-----
> > > >  1 file changed, 28 insertions(+), 6 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c
> > > > b/drivers/gpu/drm/i915/gt/intel_reset.c
> > > > index ffde89c5835a4..88dfc0c5316ff 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> > > > @@ -268,6 +268,7 @@ static int ilk_do_reset(struct intel_gt *gt,
> > > > intel_engine_mask_t engine_mask,
> > > >  static int gen6_hw_domain_reset(struct intel_gt *gt, u32
> > > > hw_domain_mask)
> > > >  {
> > > >         struct intel_uncore *uncore = gt->uncore;
> > > > +       int loops = 2;
> > > >         int err;
> > > >  
> > > >         /*
> > > > @@ -275,18 +276,39 @@ static int gen6_hw_domain_reset(struct
> > > > intel_gt *gt, u32 hw_domain_mask)
> > > >          * for fifo space for the write or forcewake the chip for
> > > >          * the read
> > > >          */
> > > > -       intel_uncore_write_fw(uncore, GEN6_GDRST,
> > > > hw_domain_mask);
> > > > +       do {
> > > > +               intel_uncore_write_fw(uncore, GEN6_GDRST,
> > > > hw_domain_mask);
> > > >  
> > > > -       /* Wait for the device to ack the reset requests */
> > > > -       err = __intel_wait_for_register_fw(uncore,
> > > > -                                          GEN6_GDRST,
> > > > hw_domain_mask, 0,
> > > > -                                          500, 0,
> > > > -                                          NULL);
> > > > +               /*
> > > > +                * Wait for the device to ack the reset requests.
> > > > +                *
> > > > +                * On some platforms, e.g. Jasperlake, we see see
> > > > that the
> > > > +                * engine register state is not cleared until
> > > > shortly after
> > > > +                * GDRST reports completion, causing a failure as
> > > > we try
> > > > +                * to immediately resume while the internal state
> > > > is still
> > > > +                * in flux. If we immediately repeat the reset,
> > > > the second
> > > > +                * reset appears to serialise with the first, and
> > > > since
> > > > +                * it is a no-op, the registers should retain
> > > > their reset
> > > > +                * value. However, there is still a concern that
> > > > upon
> > > > +                * leaving the second reset, the internal engine
> > > > state
> > > > +                * is still in flux and not ready for resuming.
> > > > +                */
> > > > +               err = __intel_wait_for_register_fw(uncore,
> > > > GEN6_GDRST,
> > > > +                                                 
> > > > hw_domain_mask, 0,
> > > > +                                                  2000, 0,
> > > > +                                                  NULL);
> > > > +       } while (err == 0 && --loops);
> > > >         if (err)
> > > >                 GT_TRACE(gt,
> > > >                          "Wait for 0x%08x engines reset
> > > > failed\n",
> > > >                          hw_domain_mask);
> > > >  
> > > > +       /*
> > > > +        * As we have observed that the engine state is still
> > > > volatile
> > > > +        * after GDRST is acked, impose a small delay to let
> > > > everything settle.
> > > > +        */
> > > > +       udelay(50);
> > > > +
> > > >         return err;
> > > >  }
> > > >  
> > > > -- 
> > > > 2.38.1
> > > > 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Intel-gfx] [PATCH] drm/i915/gt: Reset twice
  2022-12-14 22:37         ` Andi Shyti
@ 2022-12-15 20:07           ` Rodrigo Vivi
  -1 siblings, 0 replies; 27+ messages in thread
From: Rodrigo Vivi @ 2022-12-15 20:07 UTC (permalink / raw)
  To: Andi Shyti; +Cc: intel-gfx, stable, chris, dri-devel

On Wed, Dec 14, 2022 at 11:37:19PM +0100, Andi Shyti wrote:
> Hi Rodrigo,
> 
> On Tue, Dec 13, 2022 at 01:18:48PM +0000, Vivi, Rodrigo wrote:
> > On Tue, 2022-12-13 at 00:08 +0100, Andi Shyti wrote:
> > > Hi Rodrigo,
> > > 
> > > On Mon, Dec 12, 2022 at 11:55:10AM -0500, Rodrigo Vivi wrote:
> > > > On Mon, Dec 12, 2022 at 05:13:38PM +0100, Andi Shyti wrote:
> > > > > From: Chris Wilson <chris@chris-wilson.co.uk>
> > > > > 
> > > > > After applying an engine reset, on some platforms like
> > > > > Jasperlake, we
> > > > > occasionally detect that the engine state is not cleared until
> > > > > shortly
> > > > > after the resume. As we try to resume the engine with volatile
> > > > > internal
> > > > > state, the first request fails with a spurious CS event (it looks
> > > > > like
> > > > > it reports a lite-restore to the hung context, instead of the
> > > > > expected
> > > > > idle->active context switch).
> > > > > 
> > > > > Signed-off-by: Chris Wilson <hris@chris-wilson.co.uk>
> > > > 
> > > > There's a typo in the signature email I'm afraid...
> > > 
> > > oh yes, I forgot the 'C' :)
> > 
> > you forgot?
> > who signed it off?
> 
> Chris, but as I was copy/pasting SoB's I might have
> unintentionally removed the 'c'.
> 
> > > > Other than that, have we checked the possibility of using the
> > > > driver-initiated-flr bit
> > > > instead of this second loop? That should be the right way to
> > > > guarantee everything is
> > > > cleared on gen11+...
> > > 
> > > maybe I am misinterpreting it, but is FLR the same as resetting
> > > hardware domains individually?
> > 
> > No, it is bigger than that... almost the PCI FLR with some exceptions:
> > 
> > https://lists.freedesktop.org/archives/intel-gfx/2022-December/313956.html
> 
> yes, exactly... I would use FLR feedback if I was performing an
> FLR reset. But here I'm not doing that, here I'm simply gating
> off some power domains. It happens that those power domains turn
> on and off engines making them reset.

is this issue only seeing when this reset is called from the
sanitize functions at probe and resumes?
Or from any kind of gt reset?

I don't remember seeing any reference link to the bug in the patch,
hence I'm assuming this is happening in any kind of gt reset that
ends up in this function.

> 
> FLR doesn't have anything to do here, also because if you want to
> reset a single engine you go through this function, instead of
> resetting the whole GPU with whatever is annexed.

yeap. That might be to extreme depending on the case. But all that
I asked was if we were considering this option since this is the
recommended way of reseting our engines nowadays.

> 
> This patch is not fixing the "reset" concept of i915, but it's
> fixing a missing feedback that happens in one single platform
> when trying to gate on/off a domain.

But it is changing the reset concept and timeouts for all the reset
cases in all the platforms.

> 
> Maybe I am completely off track, but I don't see connection with
> FLR here.

The point is that if a reset is needed, for any reason,
the recommended way for Jasperlake, and any other newer platforms,
is to use the FLR rather than the engine reset. But we are using
the engine reset, and now twice, rather then attempt the recommended
way.

> 
> (besides FLR might not be present in all the platforms)

This issue is also not present in all the platforms and you are still
increasing the loops and delay for all the platforms.

> 
> Thanks a lot for your inputs,

have we looked to the Jasperlake workarounds to see if we are missing
anything there that could help us in this case instead of this extreme
approach of randomly increasing timeouts and attempts for all the
platforms?

> Andi
> 
> > > How am I supposed to use driver_initiated_flr() in this context?
> > 
> > Some drivers are not even using this gt reset anymore and going
> > directly with the driver-initiated FLR in case that guc reset failed.
> > 
> > I believe we can still keep the gt reset in our case as we currently
> > have, but instead of keep retrying it until it succeeds we probably
> > should go to the next level and do the driver initiated FLR when the GT
> > reset failed.
> > 
> > > 
> > > Thanks,
> > > Andi
> > > 
> > > > > Cc: stable@vger.kernel.org
> > > > > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > > > > Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
> > > > > ---
> > > > >  drivers/gpu/drm/i915/gt/intel_reset.c | 34
> > > > > ++++++++++++++++++++++-----
> > > > >  1 file changed, 28 insertions(+), 6 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c
> > > > > b/drivers/gpu/drm/i915/gt/intel_reset.c
> > > > > index ffde89c5835a4..88dfc0c5316ff 100644
> > > > > --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> > > > > +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> > > > > @@ -268,6 +268,7 @@ static int ilk_do_reset(struct intel_gt *gt,
> > > > > intel_engine_mask_t engine_mask,
> > > > >  static int gen6_hw_domain_reset(struct intel_gt *gt, u32
> > > > > hw_domain_mask)
> > > > >  {
> > > > >         struct intel_uncore *uncore = gt->uncore;
> > > > > +       int loops = 2;
> > > > >         int err;
> > > > >  
> > > > >         /*
> > > > > @@ -275,18 +276,39 @@ static int gen6_hw_domain_reset(struct
> > > > > intel_gt *gt, u32 hw_domain_mask)
> > > > >          * for fifo space for the write or forcewake the chip for
> > > > >          * the read
> > > > >          */
> > > > > -       intel_uncore_write_fw(uncore, GEN6_GDRST,
> > > > > hw_domain_mask);
> > > > > +       do {
> > > > > +               intel_uncore_write_fw(uncore, GEN6_GDRST,
> > > > > hw_domain_mask);
> > > > >  
> > > > > -       /* Wait for the device to ack the reset requests */
> > > > > -       err = __intel_wait_for_register_fw(uncore,
> > > > > -                                          GEN6_GDRST,
> > > > > hw_domain_mask, 0,
> > > > > -                                          500, 0,
> > > > > -                                          NULL);
> > > > > +               /*
> > > > > +                * Wait for the device to ack the reset requests.
> > > > > +                *
> > > > > +                * On some platforms, e.g. Jasperlake, we see see
> > > > > that the
> > > > > +                * engine register state is not cleared until
> > > > > shortly after
> > > > > +                * GDRST reports completion, causing a failure as
> > > > > we try
> > > > > +                * to immediately resume while the internal state
> > > > > is still
> > > > > +                * in flux. If we immediately repeat the reset,
> > > > > the second
> > > > > +                * reset appears to serialise with the first, and
> > > > > since
> > > > > +                * it is a no-op, the registers should retain
> > > > > their reset
> > > > > +                * value. However, there is still a concern that
> > > > > upon
> > > > > +                * leaving the second reset, the internal engine
> > > > > state
> > > > > +                * is still in flux and not ready for resuming.
> > > > > +                */
> > > > > +               err = __intel_wait_for_register_fw(uncore,
> > > > > GEN6_GDRST,
> > > > > +                                                 
> > > > > hw_domain_mask, 0,
> > > > > +                                                  2000, 0,
> > > > > +                                                  NULL);
> > > > > +       } while (err == 0 && --loops);
> > > > >         if (err)
> > > > >                 GT_TRACE(gt,
> > > > >                          "Wait for 0x%08x engines reset
> > > > > failed\n",
> > > > >                          hw_domain_mask);
> > > > >  
> > > > > +       /*
> > > > > +        * As we have observed that the engine state is still
> > > > > volatile
> > > > > +        * after GDRST is acked, impose a small delay to let
> > > > > everything settle.
> > > > > +        */
> > > > > +       udelay(50);
> > > > > +
> > > > >         return err;
> > > > >  }
> > > > >  
> > > > > -- 
> > > > > 2.38.1
> > > > > 
> > 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Intel-gfx] [PATCH] drm/i915/gt: Reset twice
@ 2022-12-15 20:07           ` Rodrigo Vivi
  0 siblings, 0 replies; 27+ messages in thread
From: Rodrigo Vivi @ 2022-12-15 20:07 UTC (permalink / raw)
  To: Andi Shyti; +Cc: intel-gfx, dri-devel, stable, chris

On Wed, Dec 14, 2022 at 11:37:19PM +0100, Andi Shyti wrote:
> Hi Rodrigo,
> 
> On Tue, Dec 13, 2022 at 01:18:48PM +0000, Vivi, Rodrigo wrote:
> > On Tue, 2022-12-13 at 00:08 +0100, Andi Shyti wrote:
> > > Hi Rodrigo,
> > > 
> > > On Mon, Dec 12, 2022 at 11:55:10AM -0500, Rodrigo Vivi wrote:
> > > > On Mon, Dec 12, 2022 at 05:13:38PM +0100, Andi Shyti wrote:
> > > > > From: Chris Wilson <chris@chris-wilson.co.uk>
> > > > > 
> > > > > After applying an engine reset, on some platforms like
> > > > > Jasperlake, we
> > > > > occasionally detect that the engine state is not cleared until
> > > > > shortly
> > > > > after the resume. As we try to resume the engine with volatile
> > > > > internal
> > > > > state, the first request fails with a spurious CS event (it looks
> > > > > like
> > > > > it reports a lite-restore to the hung context, instead of the
> > > > > expected
> > > > > idle->active context switch).
> > > > > 
> > > > > Signed-off-by: Chris Wilson <hris@chris-wilson.co.uk>
> > > > 
> > > > There's a typo in the signature email I'm afraid...
> > > 
> > > oh yes, I forgot the 'C' :)
> > 
> > you forgot?
> > who signed it off?
> 
> Chris, but as I was copy/pasting SoB's I might have
> unintentionally removed the 'c'.
> 
> > > > Other than that, have we checked the possibility of using the
> > > > driver-initiated-flr bit
> > > > instead of this second loop? That should be the right way to
> > > > guarantee everything is
> > > > cleared on gen11+...
> > > 
> > > maybe I am misinterpreting it, but is FLR the same as resetting
> > > hardware domains individually?
> > 
> > No, it is bigger than that... almost the PCI FLR with some exceptions:
> > 
> > https://lists.freedesktop.org/archives/intel-gfx/2022-December/313956.html
> 
> yes, exactly... I would use FLR feedback if I was performing an
> FLR reset. But here I'm not doing that, here I'm simply gating
> off some power domains. It happens that those power domains turn
> on and off engines making them reset.

is this issue only seeing when this reset is called from the
sanitize functions at probe and resumes?
Or from any kind of gt reset?

I don't remember seeing any reference link to the bug in the patch,
hence I'm assuming this is happening in any kind of gt reset that
ends up in this function.

> 
> FLR doesn't have anything to do here, also because if you want to
> reset a single engine you go through this function, instead of
> resetting the whole GPU with whatever is annexed.

yeap. That might be to extreme depending on the case. But all that
I asked was if we were considering this option since this is the
recommended way of reseting our engines nowadays.

> 
> This patch is not fixing the "reset" concept of i915, but it's
> fixing a missing feedback that happens in one single platform
> when trying to gate on/off a domain.

But it is changing the reset concept and timeouts for all the reset
cases in all the platforms.

> 
> Maybe I am completely off track, but I don't see connection with
> FLR here.

The point is that if a reset is needed, for any reason,
the recommended way for Jasperlake, and any other newer platforms,
is to use the FLR rather than the engine reset. But we are using
the engine reset, and now twice, rather then attempt the recommended
way.

> 
> (besides FLR might not be present in all the platforms)

This issue is also not present in all the platforms and you are still
increasing the loops and delay for all the platforms.

> 
> Thanks a lot for your inputs,

have we looked to the Jasperlake workarounds to see if we are missing
anything there that could help us in this case instead of this extreme
approach of randomly increasing timeouts and attempts for all the
platforms?

> Andi
> 
> > > How am I supposed to use driver_initiated_flr() in this context?
> > 
> > Some drivers are not even using this gt reset anymore and going
> > directly with the driver-initiated FLR in case that guc reset failed.
> > 
> > I believe we can still keep the gt reset in our case as we currently
> > have, but instead of keep retrying it until it succeeds we probably
> > should go to the next level and do the driver initiated FLR when the GT
> > reset failed.
> > 
> > > 
> > > Thanks,
> > > Andi
> > > 
> > > > > Cc: stable@vger.kernel.org
> > > > > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > > > > Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
> > > > > ---
> > > > >  drivers/gpu/drm/i915/gt/intel_reset.c | 34
> > > > > ++++++++++++++++++++++-----
> > > > >  1 file changed, 28 insertions(+), 6 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c
> > > > > b/drivers/gpu/drm/i915/gt/intel_reset.c
> > > > > index ffde89c5835a4..88dfc0c5316ff 100644
> > > > > --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> > > > > +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> > > > > @@ -268,6 +268,7 @@ static int ilk_do_reset(struct intel_gt *gt,
> > > > > intel_engine_mask_t engine_mask,
> > > > >  static int gen6_hw_domain_reset(struct intel_gt *gt, u32
> > > > > hw_domain_mask)
> > > > >  {
> > > > >         struct intel_uncore *uncore = gt->uncore;
> > > > > +       int loops = 2;
> > > > >         int err;
> > > > >  
> > > > >         /*
> > > > > @@ -275,18 +276,39 @@ static int gen6_hw_domain_reset(struct
> > > > > intel_gt *gt, u32 hw_domain_mask)
> > > > >          * for fifo space for the write or forcewake the chip for
> > > > >          * the read
> > > > >          */
> > > > > -       intel_uncore_write_fw(uncore, GEN6_GDRST,
> > > > > hw_domain_mask);
> > > > > +       do {
> > > > > +               intel_uncore_write_fw(uncore, GEN6_GDRST,
> > > > > hw_domain_mask);
> > > > >  
> > > > > -       /* Wait for the device to ack the reset requests */
> > > > > -       err = __intel_wait_for_register_fw(uncore,
> > > > > -                                          GEN6_GDRST,
> > > > > hw_domain_mask, 0,
> > > > > -                                          500, 0,
> > > > > -                                          NULL);
> > > > > +               /*
> > > > > +                * Wait for the device to ack the reset requests.
> > > > > +                *
> > > > > +                * On some platforms, e.g. Jasperlake, we see see
> > > > > that the
> > > > > +                * engine register state is not cleared until
> > > > > shortly after
> > > > > +                * GDRST reports completion, causing a failure as
> > > > > we try
> > > > > +                * to immediately resume while the internal state
> > > > > is still
> > > > > +                * in flux. If we immediately repeat the reset,
> > > > > the second
> > > > > +                * reset appears to serialise with the first, and
> > > > > since
> > > > > +                * it is a no-op, the registers should retain
> > > > > their reset
> > > > > +                * value. However, there is still a concern that
> > > > > upon
> > > > > +                * leaving the second reset, the internal engine
> > > > > state
> > > > > +                * is still in flux and not ready for resuming.
> > > > > +                */
> > > > > +               err = __intel_wait_for_register_fw(uncore,
> > > > > GEN6_GDRST,
> > > > > +                                                 
> > > > > hw_domain_mask, 0,
> > > > > +                                                  2000, 0,
> > > > > +                                                  NULL);
> > > > > +       } while (err == 0 && --loops);
> > > > >         if (err)
> > > > >                 GT_TRACE(gt,
> > > > >                          "Wait for 0x%08x engines reset
> > > > > failed\n",
> > > > >                          hw_domain_mask);
> > > > >  
> > > > > +       /*
> > > > > +        * As we have observed that the engine state is still
> > > > > volatile
> > > > > +        * after GDRST is acked, impose a small delay to let
> > > > > everything settle.
> > > > > +        */
> > > > > +       udelay(50);
> > > > > +
> > > > >         return err;
> > > > >  }
> > > > >  
> > > > > -- 
> > > > > 2.38.1
> > > > > 
> > 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Intel-gfx] [PATCH] drm/i915/gt: Reset twice
  2022-12-15 20:07           ` Rodrigo Vivi
@ 2022-12-22  9:28             ` Gwan-gyeong Mun
  -1 siblings, 0 replies; 27+ messages in thread
From: Gwan-gyeong Mun @ 2022-12-22  9:28 UTC (permalink / raw)
  To: Rodrigo Vivi, Andi Shyti; +Cc: intel-gfx, dri-devel, stable, chris



On 12/15/22 10:07 PM, Rodrigo Vivi wrote:
> On Wed, Dec 14, 2022 at 11:37:19PM +0100, Andi Shyti wrote:
>> Hi Rodrigo,
>>
>> On Tue, Dec 13, 2022 at 01:18:48PM +0000, Vivi, Rodrigo wrote:
>>> On Tue, 2022-12-13 at 00:08 +0100, Andi Shyti wrote:
>>>> Hi Rodrigo,
>>>>
>>>> On Mon, Dec 12, 2022 at 11:55:10AM -0500, Rodrigo Vivi wrote:
>>>>> On Mon, Dec 12, 2022 at 05:13:38PM +0100, Andi Shyti wrote:
>>>>>> From: Chris Wilson <chris@chris-wilson.co.uk>
>>>>>>
>>>>>> After applying an engine reset, on some platforms like
>>>>>> Jasperlake, we
>>>>>> occasionally detect that the engine state is not cleared until
>>>>>> shortly
>>>>>> after the resume. As we try to resume the engine with volatile
>>>>>> internal
>>>>>> state, the first request fails with a spurious CS event (it looks
>>>>>> like
>>>>>> it reports a lite-restore to the hung context, instead of the
>>>>>> expected
>>>>>> idle->active context switch).
>>>>>>
>>>>>> Signed-off-by: Chris Wilson <hris@chris-wilson.co.uk>
>>>>>
>>>>> There's a typo in the signature email I'm afraid...
>>>>
>>>> oh yes, I forgot the 'C' :)
>>>
>>> you forgot?
>>> who signed it off?
>>
>> Chris, but as I was copy/pasting SoB's I might have
>> unintentionally removed the 'c'.
>>
>>>>> Other than that, have we checked the possibility of using the
>>>>> driver-initiated-flr bit
>>>>> instead of this second loop? That should be the right way to
>>>>> guarantee everything is
>>>>> cleared on gen11+...
>>>>
>>>> maybe I am misinterpreting it, but is FLR the same as resetting
>>>> hardware domains individually?
>>>
>>> No, it is bigger than that... almost the PCI FLR with some exceptions:
>>>
>>> https://lists.freedesktop.org/archives/intel-gfx/2022-December/313956.html
>>
>> yes, exactly... I would use FLR feedback if I was performing an
>> FLR reset. But here I'm not doing that, here I'm simply gating
>> off some power domains. It happens that those power domains turn
>> on and off engines making them reset.
> 
> is this issue only seeing when this reset is called from the
> sanitize functions at probe and resumes?
> Or from any kind of gt reset?
> 
> I don't remember seeing any reference link to the bug in the patch,
> hence I'm assuming this is happening in any kind of gt reset that
> ends up in this function.
> 
>>
>> FLR doesn't have anything to do here, also because if you want to
>> reset a single engine you go through this function, instead of
>> resetting the whole GPU with whatever is annexed.
> 
> yeap. That might be to extreme depending on the case. But all that
> I asked was if we were considering this option since this is the
> recommended way of reseting our engines nowadays.
> 
>>
>> This patch is not fixing the "reset" concept of i915, but it's
>> fixing a missing feedback that happens in one single platform
>> when trying to gate on/off a domain.
> 
> But it is changing the reset concept and timeouts for all the reset
> cases in all the platforms.
> 
>>
>> Maybe I am completely off track, but I don't see connection with
>> FLR here.
> 
> The point is that if a reset is needed, for any reason,
> the recommended way for Jasperlake, and any other newer platforms,
> is to use the FLR rather than the engine reset. But we are using
> the engine reset, and now twice, rather then attempt the recommended
> way.
> 
>>
>> (besides FLR might not be present in all the platforms)
> 
> This issue is also not present in all the platforms and you are still
> increasing the loops and delay for all the platforms.
> 
>>
>> Thanks a lot for your inputs,
> 
> have we looked to the Jasperlake workarounds to see if we are missing
> anything there that could help us in this case instead of this extreme
> approach of randomly increasing timeouts and attempts for all the
> platforms?
> 
Hi, Rodrigo
JSL_WA (Bspec: 33451) doesn't seem to have a WA similar to this issue. 
(Please correct me if I can't find it.)

>> Andi
>>
>>>> How am I supposed to use driver_initiated_flr() in this context?
>>>
>>> Some drivers are not even using this gt reset anymore and going
>>> directly with the driver-initiated FLR in case that guc reset failed.
>>>
>>> I believe we can still keep the gt reset in our case as we currently
>>> have, but instead of keep retrying it until it succeeds we probably
>>> should go to the next level and do the driver initiated FLR when the GT
>>> reset failed.
>>>
>>>>
>>>> Thanks,
>>>> Andi
>>>>
>>>>>> Cc: stable@vger.kernel.org
>>>>>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>>>>>> Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
>>>>>> ---
>>>>>>   drivers/gpu/drm/i915/gt/intel_reset.c | 34
>>>>>> ++++++++++++++++++++++-----
>>>>>>   1 file changed, 28 insertions(+), 6 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c
>>>>>> b/drivers/gpu/drm/i915/gt/intel_reset.c
>>>>>> index ffde89c5835a4..88dfc0c5316ff 100644
>>>>>> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
>>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
>>>>>> @@ -268,6 +268,7 @@ static int ilk_do_reset(struct intel_gt *gt,
>>>>>> intel_engine_mask_t engine_mask,
>>>>>>   static int gen6_hw_domain_reset(struct intel_gt *gt, u32
>>>>>> hw_domain_mask)
>>>>>>   {
>>>>>>          struct intel_uncore *uncore = gt->uncore;
>>>>>> +       int loops = 2;
>>>>>>          int err;
>>>>>>   
>>>>>>          /*
>>>>>> @@ -275,18 +276,39 @@ static int gen6_hw_domain_reset(struct
>>>>>> intel_gt *gt, u32 hw_domain_mask)
>>>>>>           * for fifo space for the write or forcewake the chip for
>>>>>>           * the read
>>>>>>           */
>>>>>> -       intel_uncore_write_fw(uncore, GEN6_GDRST,
>>>>>> hw_domain_mask);
>>>>>> +       do {
>>>>>> +               intel_uncore_write_fw(uncore, GEN6_GDRST,
>>>>>> hw_domain_mask);
>>>>>>   
>>>>>> -       /* Wait for the device to ack the reset requests */
>>>>>> -       err = __intel_wait_for_register_fw(uncore,
>>>>>> -                                          GEN6_GDRST,
>>>>>> hw_domain_mask, 0,
>>>>>> -                                          500, 0,
>>>>>> -                                          NULL);
>>>>>> +               /*
>>>>>> +                * Wait for the device to ack the reset requests.
>>>>>> +                *
>>>>>> +                * On some platforms, e.g. Jasperlake, we see see
>>>>>> that the
>>>>>> +                * engine register state is not cleared until
>>>>>> shortly after
>>>>>> +                * GDRST reports completion, causing a failure as
>>>>>> we try
>>>>>> +                * to immediately resume while the internal state
>>>>>> is still
>>>>>> +                * in flux. If we immediately repeat the reset,
>>>>>> the second
>>>>>> +                * reset appears to serialise with the first, and
>>>>>> since
>>>>>> +                * it is a no-op, the registers should retain
>>>>>> their reset
>>>>>> +                * value. However, there is still a concern that
>>>>>> upon
>>>>>> +                * leaving the second reset, the internal engine
>>>>>> state
>>>>>> +                * is still in flux and not ready for resuming.
>>>>>> +                */
>>>>>> +               err = __intel_wait_for_register_fw(uncore,
>>>>>> GEN6_GDRST,
>>>>>> +
>>>>>> hw_domain_mask, 0,
>>>>>> +                                                  2000, 0,
>>>>>> +                                                  NULL);
Andi, fast_timeout_us is increased from 500 to 2000, and if it fails, it 
tries to reset it once more. How was this value of 2000 calculated?
>>>>>> +       } while (err == 0 && --loops);
>>>>>>          if (err)
>>>>>>                  GT_TRACE(gt,
>>>>>>                           "Wait for 0x%08x engines reset
>>>>>> failed\n",
>>>>>>                           hw_domain_mask);
Did GT_TRACE report an error in a situation where the problem was reported?
>>>>>>   
>>>>>> +       /*
>>>>>> +        * As we have observed that the engine state is still
>>>>>> volatile
>>>>>> +        * after GDRST is acked, impose a small delay to let
>>>>>> everything settle.
>>>>>> +        */
>>>>>> +       udelay(50);
udelay(50) affects all platforms that can call gen6_hw_domain_reset(), 
is that intended?

Br,

G.G.
>>>>>> +
>>>>>>          return err;
>>>>>>   }
>>>>>>   
>>>>>> -- 
>>>>>> 2.38.1
>>>>>>
>>>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Intel-gfx] [PATCH] drm/i915/gt: Reset twice
@ 2022-12-22  9:28             ` Gwan-gyeong Mun
  0 siblings, 0 replies; 27+ messages in thread
From: Gwan-gyeong Mun @ 2022-12-22  9:28 UTC (permalink / raw)
  To: Rodrigo Vivi, Andi Shyti; +Cc: intel-gfx, stable, dri-devel, chris



On 12/15/22 10:07 PM, Rodrigo Vivi wrote:
> On Wed, Dec 14, 2022 at 11:37:19PM +0100, Andi Shyti wrote:
>> Hi Rodrigo,
>>
>> On Tue, Dec 13, 2022 at 01:18:48PM +0000, Vivi, Rodrigo wrote:
>>> On Tue, 2022-12-13 at 00:08 +0100, Andi Shyti wrote:
>>>> Hi Rodrigo,
>>>>
>>>> On Mon, Dec 12, 2022 at 11:55:10AM -0500, Rodrigo Vivi wrote:
>>>>> On Mon, Dec 12, 2022 at 05:13:38PM +0100, Andi Shyti wrote:
>>>>>> From: Chris Wilson <chris@chris-wilson.co.uk>
>>>>>>
>>>>>> After applying an engine reset, on some platforms like
>>>>>> Jasperlake, we
>>>>>> occasionally detect that the engine state is not cleared until
>>>>>> shortly
>>>>>> after the resume. As we try to resume the engine with volatile
>>>>>> internal
>>>>>> state, the first request fails with a spurious CS event (it looks
>>>>>> like
>>>>>> it reports a lite-restore to the hung context, instead of the
>>>>>> expected
>>>>>> idle->active context switch).
>>>>>>
>>>>>> Signed-off-by: Chris Wilson <hris@chris-wilson.co.uk>
>>>>>
>>>>> There's a typo in the signature email I'm afraid...
>>>>
>>>> oh yes, I forgot the 'C' :)
>>>
>>> you forgot?
>>> who signed it off?
>>
>> Chris, but as I was copy/pasting SoB's I might have
>> unintentionally removed the 'c'.
>>
>>>>> Other than that, have we checked the possibility of using the
>>>>> driver-initiated-flr bit
>>>>> instead of this second loop? That should be the right way to
>>>>> guarantee everything is
>>>>> cleared on gen11+...
>>>>
>>>> maybe I am misinterpreting it, but is FLR the same as resetting
>>>> hardware domains individually?
>>>
>>> No, it is bigger than that... almost the PCI FLR with some exceptions:
>>>
>>> https://lists.freedesktop.org/archives/intel-gfx/2022-December/313956.html
>>
>> yes, exactly... I would use FLR feedback if I was performing an
>> FLR reset. But here I'm not doing that, here I'm simply gating
>> off some power domains. It happens that those power domains turn
>> on and off engines making them reset.
> 
> is this issue only seeing when this reset is called from the
> sanitize functions at probe and resumes?
> Or from any kind of gt reset?
> 
> I don't remember seeing any reference link to the bug in the patch,
> hence I'm assuming this is happening in any kind of gt reset that
> ends up in this function.
> 
>>
>> FLR doesn't have anything to do here, also because if you want to
>> reset a single engine you go through this function, instead of
>> resetting the whole GPU with whatever is annexed.
> 
> yeap. That might be to extreme depending on the case. But all that
> I asked was if we were considering this option since this is the
> recommended way of reseting our engines nowadays.
> 
>>
>> This patch is not fixing the "reset" concept of i915, but it's
>> fixing a missing feedback that happens in one single platform
>> when trying to gate on/off a domain.
> 
> But it is changing the reset concept and timeouts for all the reset
> cases in all the platforms.
> 
>>
>> Maybe I am completely off track, but I don't see connection with
>> FLR here.
> 
> The point is that if a reset is needed, for any reason,
> the recommended way for Jasperlake, and any other newer platforms,
> is to use the FLR rather than the engine reset. But we are using
> the engine reset, and now twice, rather then attempt the recommended
> way.
> 
>>
>> (besides FLR might not be present in all the platforms)
> 
> This issue is also not present in all the platforms and you are still
> increasing the loops and delay for all the platforms.
> 
>>
>> Thanks a lot for your inputs,
> 
> have we looked to the Jasperlake workarounds to see if we are missing
> anything there that could help us in this case instead of this extreme
> approach of randomly increasing timeouts and attempts for all the
> platforms?
> 
Hi, Rodrigo
JSL_WA (Bspec: 33451) doesn't seem to have a WA similar to this issue. 
(Please correct me if I can't find it.)

>> Andi
>>
>>>> How am I supposed to use driver_initiated_flr() in this context?
>>>
>>> Some drivers are not even using this gt reset anymore and going
>>> directly with the driver-initiated FLR in case that guc reset failed.
>>>
>>> I believe we can still keep the gt reset in our case as we currently
>>> have, but instead of keep retrying it until it succeeds we probably
>>> should go to the next level and do the driver initiated FLR when the GT
>>> reset failed.
>>>
>>>>
>>>> Thanks,
>>>> Andi
>>>>
>>>>>> Cc: stable@vger.kernel.org
>>>>>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>>>>>> Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
>>>>>> ---
>>>>>>   drivers/gpu/drm/i915/gt/intel_reset.c | 34
>>>>>> ++++++++++++++++++++++-----
>>>>>>   1 file changed, 28 insertions(+), 6 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c
>>>>>> b/drivers/gpu/drm/i915/gt/intel_reset.c
>>>>>> index ffde89c5835a4..88dfc0c5316ff 100644
>>>>>> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
>>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
>>>>>> @@ -268,6 +268,7 @@ static int ilk_do_reset(struct intel_gt *gt,
>>>>>> intel_engine_mask_t engine_mask,
>>>>>>   static int gen6_hw_domain_reset(struct intel_gt *gt, u32
>>>>>> hw_domain_mask)
>>>>>>   {
>>>>>>          struct intel_uncore *uncore = gt->uncore;
>>>>>> +       int loops = 2;
>>>>>>          int err;
>>>>>>   
>>>>>>          /*
>>>>>> @@ -275,18 +276,39 @@ static int gen6_hw_domain_reset(struct
>>>>>> intel_gt *gt, u32 hw_domain_mask)
>>>>>>           * for fifo space for the write or forcewake the chip for
>>>>>>           * the read
>>>>>>           */
>>>>>> -       intel_uncore_write_fw(uncore, GEN6_GDRST,
>>>>>> hw_domain_mask);
>>>>>> +       do {
>>>>>> +               intel_uncore_write_fw(uncore, GEN6_GDRST,
>>>>>> hw_domain_mask);
>>>>>>   
>>>>>> -       /* Wait for the device to ack the reset requests */
>>>>>> -       err = __intel_wait_for_register_fw(uncore,
>>>>>> -                                          GEN6_GDRST,
>>>>>> hw_domain_mask, 0,
>>>>>> -                                          500, 0,
>>>>>> -                                          NULL);
>>>>>> +               /*
>>>>>> +                * Wait for the device to ack the reset requests.
>>>>>> +                *
>>>>>> +                * On some platforms, e.g. Jasperlake, we see see
>>>>>> that the
>>>>>> +                * engine register state is not cleared until
>>>>>> shortly after
>>>>>> +                * GDRST reports completion, causing a failure as
>>>>>> we try
>>>>>> +                * to immediately resume while the internal state
>>>>>> is still
>>>>>> +                * in flux. If we immediately repeat the reset,
>>>>>> the second
>>>>>> +                * reset appears to serialise with the first, and
>>>>>> since
>>>>>> +                * it is a no-op, the registers should retain
>>>>>> their reset
>>>>>> +                * value. However, there is still a concern that
>>>>>> upon
>>>>>> +                * leaving the second reset, the internal engine
>>>>>> state
>>>>>> +                * is still in flux and not ready for resuming.
>>>>>> +                */
>>>>>> +               err = __intel_wait_for_register_fw(uncore,
>>>>>> GEN6_GDRST,
>>>>>> +
>>>>>> hw_domain_mask, 0,
>>>>>> +                                                  2000, 0,
>>>>>> +                                                  NULL);
Andi, fast_timeout_us is increased from 500 to 2000, and if it fails, it 
tries to reset it once more. How was this value of 2000 calculated?
>>>>>> +       } while (err == 0 && --loops);
>>>>>>          if (err)
>>>>>>                  GT_TRACE(gt,
>>>>>>                           "Wait for 0x%08x engines reset
>>>>>> failed\n",
>>>>>>                           hw_domain_mask);
Did GT_TRACE report an error in a situation where the problem was reported?
>>>>>>   
>>>>>> +       /*
>>>>>> +        * As we have observed that the engine state is still
>>>>>> volatile
>>>>>> +        * after GDRST is acked, impose a small delay to let
>>>>>> everything settle.
>>>>>> +        */
>>>>>> +       udelay(50);
udelay(50) affects all platforms that can call gen6_hw_domain_reset(), 
is that intended?

Br,

G.G.
>>>>>> +
>>>>>>          return err;
>>>>>>   }
>>>>>>   
>>>>>> -- 
>>>>>> 2.38.1
>>>>>>
>>>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Intel-gfx] [PATCH] drm/i915/gt: Reset twice
  2022-12-22  9:28             ` Gwan-gyeong Mun
  (?)
@ 2022-12-22 13:47               ` Andi Shyti
  -1 siblings, 0 replies; 27+ messages in thread
From: Andi Shyti @ 2022-12-22 13:47 UTC (permalink / raw)
  To: Gwan-gyeong Mun
  Cc: Rodrigo Vivi, Andi Shyti, intel-gfx, dri-devel, stable, chris

Hi GG,

> > > > > > >   drivers/gpu/drm/i915/gt/intel_reset.c | 34
> > > > > > > ++++++++++++++++++++++-----
> > > > > > >   1 file changed, 28 insertions(+), 6 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c
> > > > > > > b/drivers/gpu/drm/i915/gt/intel_reset.c
> > > > > > > index ffde89c5835a4..88dfc0c5316ff 100644
> > > > > > > --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> > > > > > > +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> > > > > > > @@ -268,6 +268,7 @@ static int ilk_do_reset(struct intel_gt *gt,
> > > > > > > intel_engine_mask_t engine_mask,
> > > > > > >   static int gen6_hw_domain_reset(struct intel_gt *gt, u32
> > > > > > > hw_domain_mask)
> > > > > > >   {
> > > > > > >          struct intel_uncore *uncore = gt->uncore;
> > > > > > > +       int loops = 2;
> > > > > > >          int err;
> > > > > > >          /*
> > > > > > > @@ -275,18 +276,39 @@ static int gen6_hw_domain_reset(struct
> > > > > > > intel_gt *gt, u32 hw_domain_mask)
> > > > > > >           * for fifo space for the write or forcewake the chip for
> > > > > > >           * the read
> > > > > > >           */
> > > > > > > -       intel_uncore_write_fw(uncore, GEN6_GDRST,
> > > > > > > hw_domain_mask);
> > > > > > > +       do {
> > > > > > > +               intel_uncore_write_fw(uncore, GEN6_GDRST,
> > > > > > > hw_domain_mask);
> > > > > > > -       /* Wait for the device to ack the reset requests */
> > > > > > > -       err = __intel_wait_for_register_fw(uncore,
> > > > > > > -                                          GEN6_GDRST,
> > > > > > > hw_domain_mask, 0,
> > > > > > > -                                          500, 0,
> > > > > > > -                                          NULL);
> > > > > > > +               /*
> > > > > > > +                * Wait for the device to ack the reset requests.
> > > > > > > +                *
> > > > > > > +                * On some platforms, e.g. Jasperlake, we see see
> > > > > > > that the
> > > > > > > +                * engine register state is not cleared until
> > > > > > > shortly after
> > > > > > > +                * GDRST reports completion, causing a failure as
> > > > > > > we try
> > > > > > > +                * to immediately resume while the internal state
> > > > > > > is still
> > > > > > > +                * in flux. If we immediately repeat the reset,
> > > > > > > the second
> > > > > > > +                * reset appears to serialise with the first, and
> > > > > > > since
> > > > > > > +                * it is a no-op, the registers should retain
> > > > > > > their reset
> > > > > > > +                * value. However, there is still a concern that
> > > > > > > upon
> > > > > > > +                * leaving the second reset, the internal engine
> > > > > > > state
> > > > > > > +                * is still in flux and not ready for resuming.
> > > > > > > +                */
> > > > > > > +               err = __intel_wait_for_register_fw(uncore,
> > > > > > > GEN6_GDRST,
> > > > > > > +
> > > > > > > hw_domain_mask, 0,
> > > > > > > +                                                  2000, 0,
> > > > > > > +                                                  NULL);

> Andi, fast_timeout_us is increased from 500 to 2000, and if it fails, it
> tries to reset it once more. How was this value of 2000 calculated?

No real reason, it's just an empiric choice to make the call a
bit more robust and suffer less from delayed feedback.

> > > > > > > +       } while (err == 0 && --loops);
> > > > > > >          if (err)
> > > > > > >                  GT_TRACE(gt,
> > > > > > >                           "Wait for 0x%08x engines reset
> > > > > > > failed\n",
> > > > > > >                           hw_domain_mask);

> Did GT_TRACE report an error in a situation where the problem was reported?

I guess so, in Jasperlake.

> > > > > > > +       /*
> > > > > > > +        * As we have observed that the engine state is still
> > > > > > > volatile
> > > > > > > +        * after GDRST is acked, impose a small delay to let
> > > > > > > everything settle.
> > > > > > > +        */
> > > > > > > +       udelay(50);

> udelay(50) affects all platforms that can call gen6_hw_domain_reset(), is
> that intended?

Yes, that's intended as apparently we need to give it a bit more
time for the engines to recover from the reset. We are here in
atomic context and we need udelay to wait atomically, thus
udelay().

Thank you,
Andi

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Intel-gfx] [PATCH] drm/i915/gt: Reset twice
@ 2022-12-22 13:47               ` Andi Shyti
  0 siblings, 0 replies; 27+ messages in thread
From: Andi Shyti @ 2022-12-22 13:47 UTC (permalink / raw)
  To: Gwan-gyeong Mun
  Cc: intel-gfx, stable, chris, dri-devel, Andi Shyti, Rodrigo Vivi

Hi GG,

> > > > > > >   drivers/gpu/drm/i915/gt/intel_reset.c | 34
> > > > > > > ++++++++++++++++++++++-----
> > > > > > >   1 file changed, 28 insertions(+), 6 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c
> > > > > > > b/drivers/gpu/drm/i915/gt/intel_reset.c
> > > > > > > index ffde89c5835a4..88dfc0c5316ff 100644
> > > > > > > --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> > > > > > > +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> > > > > > > @@ -268,6 +268,7 @@ static int ilk_do_reset(struct intel_gt *gt,
> > > > > > > intel_engine_mask_t engine_mask,
> > > > > > >   static int gen6_hw_domain_reset(struct intel_gt *gt, u32
> > > > > > > hw_domain_mask)
> > > > > > >   {
> > > > > > >          struct intel_uncore *uncore = gt->uncore;
> > > > > > > +       int loops = 2;
> > > > > > >          int err;
> > > > > > >          /*
> > > > > > > @@ -275,18 +276,39 @@ static int gen6_hw_domain_reset(struct
> > > > > > > intel_gt *gt, u32 hw_domain_mask)
> > > > > > >           * for fifo space for the write or forcewake the chip for
> > > > > > >           * the read
> > > > > > >           */
> > > > > > > -       intel_uncore_write_fw(uncore, GEN6_GDRST,
> > > > > > > hw_domain_mask);
> > > > > > > +       do {
> > > > > > > +               intel_uncore_write_fw(uncore, GEN6_GDRST,
> > > > > > > hw_domain_mask);
> > > > > > > -       /* Wait for the device to ack the reset requests */
> > > > > > > -       err = __intel_wait_for_register_fw(uncore,
> > > > > > > -                                          GEN6_GDRST,
> > > > > > > hw_domain_mask, 0,
> > > > > > > -                                          500, 0,
> > > > > > > -                                          NULL);
> > > > > > > +               /*
> > > > > > > +                * Wait for the device to ack the reset requests.
> > > > > > > +                *
> > > > > > > +                * On some platforms, e.g. Jasperlake, we see see
> > > > > > > that the
> > > > > > > +                * engine register state is not cleared until
> > > > > > > shortly after
> > > > > > > +                * GDRST reports completion, causing a failure as
> > > > > > > we try
> > > > > > > +                * to immediately resume while the internal state
> > > > > > > is still
> > > > > > > +                * in flux. If we immediately repeat the reset,
> > > > > > > the second
> > > > > > > +                * reset appears to serialise with the first, and
> > > > > > > since
> > > > > > > +                * it is a no-op, the registers should retain
> > > > > > > their reset
> > > > > > > +                * value. However, there is still a concern that
> > > > > > > upon
> > > > > > > +                * leaving the second reset, the internal engine
> > > > > > > state
> > > > > > > +                * is still in flux and not ready for resuming.
> > > > > > > +                */
> > > > > > > +               err = __intel_wait_for_register_fw(uncore,
> > > > > > > GEN6_GDRST,
> > > > > > > +
> > > > > > > hw_domain_mask, 0,
> > > > > > > +                                                  2000, 0,
> > > > > > > +                                                  NULL);

> Andi, fast_timeout_us is increased from 500 to 2000, and if it fails, it
> tries to reset it once more. How was this value of 2000 calculated?

No real reason, it's just an empiric choice to make the call a
bit more robust and suffer less from delayed feedback.

> > > > > > > +       } while (err == 0 && --loops);
> > > > > > >          if (err)
> > > > > > >                  GT_TRACE(gt,
> > > > > > >                           "Wait for 0x%08x engines reset
> > > > > > > failed\n",
> > > > > > >                           hw_domain_mask);

> Did GT_TRACE report an error in a situation where the problem was reported?

I guess so, in Jasperlake.

> > > > > > > +       /*
> > > > > > > +        * As we have observed that the engine state is still
> > > > > > > volatile
> > > > > > > +        * after GDRST is acked, impose a small delay to let
> > > > > > > everything settle.
> > > > > > > +        */
> > > > > > > +       udelay(50);

> udelay(50) affects all platforms that can call gen6_hw_domain_reset(), is
> that intended?

Yes, that's intended as apparently we need to give it a bit more
time for the engines to recover from the reset. We are here in
atomic context and we need udelay to wait atomically, thus
udelay().

Thank you,
Andi

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Intel-gfx] [PATCH] drm/i915/gt: Reset twice
@ 2022-12-22 13:47               ` Andi Shyti
  0 siblings, 0 replies; 27+ messages in thread
From: Andi Shyti @ 2022-12-22 13:47 UTC (permalink / raw)
  To: Gwan-gyeong Mun; +Cc: intel-gfx, stable, chris, dri-devel, Rodrigo Vivi

Hi GG,

> > > > > > >   drivers/gpu/drm/i915/gt/intel_reset.c | 34
> > > > > > > ++++++++++++++++++++++-----
> > > > > > >   1 file changed, 28 insertions(+), 6 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c
> > > > > > > b/drivers/gpu/drm/i915/gt/intel_reset.c
> > > > > > > index ffde89c5835a4..88dfc0c5316ff 100644
> > > > > > > --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> > > > > > > +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> > > > > > > @@ -268,6 +268,7 @@ static int ilk_do_reset(struct intel_gt *gt,
> > > > > > > intel_engine_mask_t engine_mask,
> > > > > > >   static int gen6_hw_domain_reset(struct intel_gt *gt, u32
> > > > > > > hw_domain_mask)
> > > > > > >   {
> > > > > > >          struct intel_uncore *uncore = gt->uncore;
> > > > > > > +       int loops = 2;
> > > > > > >          int err;
> > > > > > >          /*
> > > > > > > @@ -275,18 +276,39 @@ static int gen6_hw_domain_reset(struct
> > > > > > > intel_gt *gt, u32 hw_domain_mask)
> > > > > > >           * for fifo space for the write or forcewake the chip for
> > > > > > >           * the read
> > > > > > >           */
> > > > > > > -       intel_uncore_write_fw(uncore, GEN6_GDRST,
> > > > > > > hw_domain_mask);
> > > > > > > +       do {
> > > > > > > +               intel_uncore_write_fw(uncore, GEN6_GDRST,
> > > > > > > hw_domain_mask);
> > > > > > > -       /* Wait for the device to ack the reset requests */
> > > > > > > -       err = __intel_wait_for_register_fw(uncore,
> > > > > > > -                                          GEN6_GDRST,
> > > > > > > hw_domain_mask, 0,
> > > > > > > -                                          500, 0,
> > > > > > > -                                          NULL);
> > > > > > > +               /*
> > > > > > > +                * Wait for the device to ack the reset requests.
> > > > > > > +                *
> > > > > > > +                * On some platforms, e.g. Jasperlake, we see see
> > > > > > > that the
> > > > > > > +                * engine register state is not cleared until
> > > > > > > shortly after
> > > > > > > +                * GDRST reports completion, causing a failure as
> > > > > > > we try
> > > > > > > +                * to immediately resume while the internal state
> > > > > > > is still
> > > > > > > +                * in flux. If we immediately repeat the reset,
> > > > > > > the second
> > > > > > > +                * reset appears to serialise with the first, and
> > > > > > > since
> > > > > > > +                * it is a no-op, the registers should retain
> > > > > > > their reset
> > > > > > > +                * value. However, there is still a concern that
> > > > > > > upon
> > > > > > > +                * leaving the second reset, the internal engine
> > > > > > > state
> > > > > > > +                * is still in flux and not ready for resuming.
> > > > > > > +                */
> > > > > > > +               err = __intel_wait_for_register_fw(uncore,
> > > > > > > GEN6_GDRST,
> > > > > > > +
> > > > > > > hw_domain_mask, 0,
> > > > > > > +                                                  2000, 0,
> > > > > > > +                                                  NULL);

> Andi, fast_timeout_us is increased from 500 to 2000, and if it fails, it
> tries to reset it once more. How was this value of 2000 calculated?

No real reason, it's just an empiric choice to make the call a
bit more robust and suffer less from delayed feedback.

> > > > > > > +       } while (err == 0 && --loops);
> > > > > > >          if (err)
> > > > > > >                  GT_TRACE(gt,
> > > > > > >                           "Wait for 0x%08x engines reset
> > > > > > > failed\n",
> > > > > > >                           hw_domain_mask);

> Did GT_TRACE report an error in a situation where the problem was reported?

I guess so, in Jasperlake.

> > > > > > > +       /*
> > > > > > > +        * As we have observed that the engine state is still
> > > > > > > volatile
> > > > > > > +        * after GDRST is acked, impose a small delay to let
> > > > > > > everything settle.
> > > > > > > +        */
> > > > > > > +       udelay(50);

> udelay(50) affects all platforms that can call gen6_hw_domain_reset(), is
> that intended?

Yes, that's intended as apparently we need to give it a bit more
time for the engines to recover from the reset. We are here in
atomic context and we need udelay to wait atomically, thus
udelay().

Thank you,
Andi

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Intel-gfx] [PATCH] drm/i915/gt: Reset twice
  2022-12-22 13:47               ` Andi Shyti
@ 2022-12-23  6:24                 ` Gwan-gyeong Mun
  -1 siblings, 0 replies; 27+ messages in thread
From: Gwan-gyeong Mun @ 2022-12-23  6:24 UTC (permalink / raw)
  To: Andi Shyti; +Cc: Rodrigo Vivi, intel-gfx, dri-devel, stable, chris



On 12/22/22 3:47 PM, Andi Shyti wrote:
> Hi GG,
> 
>>>>>>>>    drivers/gpu/drm/i915/gt/intel_reset.c | 34
>>>>>>>> ++++++++++++++++++++++-----
>>>>>>>>    1 file changed, 28 insertions(+), 6 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c
>>>>>>>> b/drivers/gpu/drm/i915/gt/intel_reset.c
>>>>>>>> index ffde89c5835a4..88dfc0c5316ff 100644
>>>>>>>> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
>>>>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
>>>>>>>> @@ -268,6 +268,7 @@ static int ilk_do_reset(struct intel_gt *gt,
>>>>>>>> intel_engine_mask_t engine_mask,
>>>>>>>>    static int gen6_hw_domain_reset(struct intel_gt *gt, u32
>>>>>>>> hw_domain_mask)
>>>>>>>>    {
>>>>>>>>           struct intel_uncore *uncore = gt->uncore;
>>>>>>>> +       int loops = 2;
>>>>>>>>           int err;
>>>>>>>>           /*
>>>>>>>> @@ -275,18 +276,39 @@ static int gen6_hw_domain_reset(struct
>>>>>>>> intel_gt *gt, u32 hw_domain_mask)
>>>>>>>>            * for fifo space for the write or forcewake the chip for
>>>>>>>>            * the read
>>>>>>>>            */
>>>>>>>> -       intel_uncore_write_fw(uncore, GEN6_GDRST,
>>>>>>>> hw_domain_mask);
>>>>>>>> +       do {
>>>>>>>> +               intel_uncore_write_fw(uncore, GEN6_GDRST,
>>>>>>>> hw_domain_mask);
>>>>>>>> -       /* Wait for the device to ack the reset requests */
>>>>>>>> -       err = __intel_wait_for_register_fw(uncore,
>>>>>>>> -                                          GEN6_GDRST,
>>>>>>>> hw_domain_mask, 0,
>>>>>>>> -                                          500, 0,
>>>>>>>> -                                          NULL);
>>>>>>>> +               /*
>>>>>>>> +                * Wait for the device to ack the reset requests.
>>>>>>>> +                *
>>>>>>>> +                * On some platforms, e.g. Jasperlake, we see see
>>>>>>>> that the
>>>>>>>> +                * engine register state is not cleared until
>>>>>>>> shortly after
>>>>>>>> +                * GDRST reports completion, causing a failure as
>>>>>>>> we try
>>>>>>>> +                * to immediately resume while the internal state
>>>>>>>> is still
>>>>>>>> +                * in flux. If we immediately repeat the reset,
>>>>>>>> the second
>>>>>>>> +                * reset appears to serialise with the first, and
>>>>>>>> since
>>>>>>>> +                * it is a no-op, the registers should retain
>>>>>>>> their reset
>>>>>>>> +                * value. However, there is still a concern that
>>>>>>>> upon
>>>>>>>> +                * leaving the second reset, the internal engine
>>>>>>>> state
>>>>>>>> +                * is still in flux and not ready for resuming.
>>>>>>>> +                */
>>>>>>>> +               err = __intel_wait_for_register_fw(uncore,
>>>>>>>> GEN6_GDRST,
>>>>>>>> +
>>>>>>>> hw_domain_mask, 0,
>>>>>>>> +                                                  2000, 0,
>>>>>>>> +                                                  NULL);
> 
>> Andi, fast_timeout_us is increased from 500 to 2000, and if it fails, it
>> tries to reset it once more. How was this value of 2000 calculated?
> 
> No real reason, it's just an empiric choice to make the call a
> bit more robust and suffer less from delayed feedback.
> 
>>>>>>>> +       } while (err == 0 && --loops);
>>>>>>>>           if (err)
>>>>>>>>                   GT_TRACE(gt,
>>>>>>>>                            "Wait for 0x%08x engines reset
>>>>>>>> failed\n",
>>>>>>>>                            hw_domain_mask);
> 
>> Did GT_TRACE report an error in a situation where the problem was reported?
> 
> I guess so, in Jasperlake.
> 
>>>>>>>> +       /*
>>>>>>>> +        * As we have observed that the engine state is still
>>>>>>>> volatile
>>>>>>>> +        * after GDRST is acked, impose a small delay to let
>>>>>>>> everything settle.
>>>>>>>> +        */
>>>>>>>> +       udelay(50);
> 
>> udelay(50) affects all platforms that can call gen6_hw_domain_reset(), is
>> that intended?
> 
> Yes, that's intended as apparently we need to give it a bit more
> time for the engines to recover from the reset. We are here in
> atomic context and we need udelay to wait atomically, thus
> udelay().
> 
Hi Andi,

In scenarios/platforms where GSC Firmware is not used, reset through FLR 
is not possible and this reset function is used.
Therefore if this problem cannot be avoided by other WAs and this method 
is the only one, we might have to apply this patch as a temporal fix.
But we also ask the person who is in charge of this HW Platform 
(Jasperlake or all of GEN11?)to analyze the problem and you need to sure 
get proper WA guidance as a next step.

Reviewed-by: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com>

Br,
G.G.
> Thank you,
> Andi

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Intel-gfx] [PATCH] drm/i915/gt: Reset twice
@ 2022-12-23  6:24                 ` Gwan-gyeong Mun
  0 siblings, 0 replies; 27+ messages in thread
From: Gwan-gyeong Mun @ 2022-12-23  6:24 UTC (permalink / raw)
  To: Andi Shyti; +Cc: intel-gfx, chris, stable, dri-devel, Rodrigo Vivi



On 12/22/22 3:47 PM, Andi Shyti wrote:
> Hi GG,
> 
>>>>>>>>    drivers/gpu/drm/i915/gt/intel_reset.c | 34
>>>>>>>> ++++++++++++++++++++++-----
>>>>>>>>    1 file changed, 28 insertions(+), 6 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c
>>>>>>>> b/drivers/gpu/drm/i915/gt/intel_reset.c
>>>>>>>> index ffde89c5835a4..88dfc0c5316ff 100644
>>>>>>>> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
>>>>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
>>>>>>>> @@ -268,6 +268,7 @@ static int ilk_do_reset(struct intel_gt *gt,
>>>>>>>> intel_engine_mask_t engine_mask,
>>>>>>>>    static int gen6_hw_domain_reset(struct intel_gt *gt, u32
>>>>>>>> hw_domain_mask)
>>>>>>>>    {
>>>>>>>>           struct intel_uncore *uncore = gt->uncore;
>>>>>>>> +       int loops = 2;
>>>>>>>>           int err;
>>>>>>>>           /*
>>>>>>>> @@ -275,18 +276,39 @@ static int gen6_hw_domain_reset(struct
>>>>>>>> intel_gt *gt, u32 hw_domain_mask)
>>>>>>>>            * for fifo space for the write or forcewake the chip for
>>>>>>>>            * the read
>>>>>>>>            */
>>>>>>>> -       intel_uncore_write_fw(uncore, GEN6_GDRST,
>>>>>>>> hw_domain_mask);
>>>>>>>> +       do {
>>>>>>>> +               intel_uncore_write_fw(uncore, GEN6_GDRST,
>>>>>>>> hw_domain_mask);
>>>>>>>> -       /* Wait for the device to ack the reset requests */
>>>>>>>> -       err = __intel_wait_for_register_fw(uncore,
>>>>>>>> -                                          GEN6_GDRST,
>>>>>>>> hw_domain_mask, 0,
>>>>>>>> -                                          500, 0,
>>>>>>>> -                                          NULL);
>>>>>>>> +               /*
>>>>>>>> +                * Wait for the device to ack the reset requests.
>>>>>>>> +                *
>>>>>>>> +                * On some platforms, e.g. Jasperlake, we see see
>>>>>>>> that the
>>>>>>>> +                * engine register state is not cleared until
>>>>>>>> shortly after
>>>>>>>> +                * GDRST reports completion, causing a failure as
>>>>>>>> we try
>>>>>>>> +                * to immediately resume while the internal state
>>>>>>>> is still
>>>>>>>> +                * in flux. If we immediately repeat the reset,
>>>>>>>> the second
>>>>>>>> +                * reset appears to serialise with the first, and
>>>>>>>> since
>>>>>>>> +                * it is a no-op, the registers should retain
>>>>>>>> their reset
>>>>>>>> +                * value. However, there is still a concern that
>>>>>>>> upon
>>>>>>>> +                * leaving the second reset, the internal engine
>>>>>>>> state
>>>>>>>> +                * is still in flux and not ready for resuming.
>>>>>>>> +                */
>>>>>>>> +               err = __intel_wait_for_register_fw(uncore,
>>>>>>>> GEN6_GDRST,
>>>>>>>> +
>>>>>>>> hw_domain_mask, 0,
>>>>>>>> +                                                  2000, 0,
>>>>>>>> +                                                  NULL);
> 
>> Andi, fast_timeout_us is increased from 500 to 2000, and if it fails, it
>> tries to reset it once more. How was this value of 2000 calculated?
> 
> No real reason, it's just an empiric choice to make the call a
> bit more robust and suffer less from delayed feedback.
> 
>>>>>>>> +       } while (err == 0 && --loops);
>>>>>>>>           if (err)
>>>>>>>>                   GT_TRACE(gt,
>>>>>>>>                            "Wait for 0x%08x engines reset
>>>>>>>> failed\n",
>>>>>>>>                            hw_domain_mask);
> 
>> Did GT_TRACE report an error in a situation where the problem was reported?
> 
> I guess so, in Jasperlake.
> 
>>>>>>>> +       /*
>>>>>>>> +        * As we have observed that the engine state is still
>>>>>>>> volatile
>>>>>>>> +        * after GDRST is acked, impose a small delay to let
>>>>>>>> everything settle.
>>>>>>>> +        */
>>>>>>>> +       udelay(50);
> 
>> udelay(50) affects all platforms that can call gen6_hw_domain_reset(), is
>> that intended?
> 
> Yes, that's intended as apparently we need to give it a bit more
> time for the engines to recover from the reset. We are here in
> atomic context and we need udelay to wait atomically, thus
> udelay().
> 
Hi Andi,

In scenarios/platforms where GSC Firmware is not used, reset through FLR 
is not possible and this reset function is used.
Therefore if this problem cannot be avoided by other WAs and this method 
is the only one, we might have to apply this patch as a temporal fix.
But we also ask the person who is in charge of this HW Platform 
(Jasperlake or all of GEN11?)to analyze the problem and you need to sure 
get proper WA guidance as a next step.

Reviewed-by: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com>

Br,
G.G.
> Thank you,
> Andi

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2022-12-23  6:26 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-12 16:13 [PATCH] drm/i915/gt: Reset twice Andi Shyti
2022-12-12 16:13 ` [Intel-gfx] " Andi Shyti
2022-12-12 16:13 ` Andi Shyti
2022-12-12 16:55 ` Rodrigo Vivi
2022-12-12 16:55   ` [Intel-gfx] " Rodrigo Vivi
2022-12-12 16:55   ` Rodrigo Vivi
2022-12-12 23:08   ` Andi Shyti
2022-12-12 23:08     ` [Intel-gfx] " Andi Shyti
2022-12-12 23:08     ` Andi Shyti
2022-12-13 13:18     ` Vivi, Rodrigo
2022-12-13 13:18       ` [Intel-gfx] " Vivi, Rodrigo
2022-12-13 13:18       ` Vivi, Rodrigo
2022-12-14 22:37       ` Andi Shyti
2022-12-14 22:37         ` [Intel-gfx] " Andi Shyti
2022-12-14 22:37         ` Andi Shyti
2022-12-15 20:07         ` [Intel-gfx] " Rodrigo Vivi
2022-12-15 20:07           ` Rodrigo Vivi
2022-12-22  9:28           ` Gwan-gyeong Mun
2022-12-22  9:28             ` Gwan-gyeong Mun
2022-12-22 13:47             ` Andi Shyti
2022-12-22 13:47               ` Andi Shyti
2022-12-22 13:47               ` Andi Shyti
2022-12-23  6:24               ` Gwan-gyeong Mun
2022-12-23  6:24                 ` Gwan-gyeong Mun
2022-12-12 18:34 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for " Patchwork
2022-12-12 18:46 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2022-12-13 10:11 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.