All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/i915: Disable per-engine reset for Broxton
@ 2017-07-04 16:09 Chris Wilson
  2017-07-06  1:24 ` Michel Thierry
  0 siblings, 1 reply; 5+ messages in thread
From: Chris Wilson @ 2017-07-04 16:09 UTC (permalink / raw)
  To: intel-gfx

Triggering a GPU reset for one engine affects another, notably
corrupting the context status buffer (CSB) effectively losing track of
inflight requests.

Adding a few printks:
diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index ad41836fa5e5..a969456bc0fa 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1953,6 +1953,7 @@ int i915_reset_engine(struct intel_engine_cs *engine)
                goto out;
        }

+       pr_err("Resetting %s\n", engine->name);
        ret = intel_gpu_reset(engine->i915, intel_engine_flag(engine));
        if (ret) {
                /* If we fail here, we expect to fallback to a global reset */
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 716e5c9ea222..a72bc35d0870 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -355,6 +355,7 @@ static void execlists_submit_ports(struct intel_engine_cs *engine)
                                execlists_context_status_change(rq, INTEL_CONTEXT_SCHEDULE_IN);
                        port_set(&port[n], port_pack(rq, count));
                        desc = execlists_update_context(rq);
+                       pr_err("%s: in (rq=%x) ctx=%d\n", engine->name, rq->global_seqno, upper_32_bits(desc));
                        GEM_DEBUG_EXEC(port[n].context_id = upper_32_bits(desc));
                } else {
                        GEM_BUG_ON(!n);
@@ -594,9 +595,23 @@ static void intel_lrc_irq_handler(unsigned long data)
                        if (!(status & GEN8_CTX_STATUS_COMPLETED_MASK))
                                continue;

+                       pr_err("%s: out CSB (%x head=%d, tail=%d), ctx=%d, rq=%d\n",
+                                       engine->name,
+                                       readl(csb_mmio),
+                                       head, tail,
+                                       readl(buf+2*head+1),
+                                       port->context_id);
+
                        /* Check the context/desc id for this event matches */
-                       GEM_DEBUG_BUG_ON(readl(buf + 2 * head + 1) !=
-                                        port->context_id);
+                       if (readl(buf + 2 * head + 1) != port->context_id) {
+                               pr_err("%s: BUG CSB (%x head=%d, tail=%d), ctx=%d, rq=%d\n",
+                                               engine->name,
+                                               readl(csb_mmio),
+                                               head, tail,
+                                               readl(buf+2*head+1),
+                                               port->context_id);
+                               BUG();
+                       }

                        rq = port_unpack(port, &count);
                        GEM_BUG_ON(count == 0);

Results in:

[ 6423.006602] Resetting rcs0
[ 6423.009080] rcs0: in (rq=fffffe70) ctx=1
[ 6423.009216] rcs0: in (rq=fffffe6f) ctx=3
[ 6423.009542] rcs0: out CSB (2 head=1, tail=2), ctx=3, rq=3
[ 6423.009619] Resetting bcs0
[ 6423.009980] rcs0: BUG CSB (0 head=1, tail=2), ctx=0, rq=3

Note that this bug may be affect all machines and not just Broxton,
Broxton is just the first machine on which I have confirmed this bug.

Fixes: 142bc7d99bcf ("drm/i915: Modify error handler for per engine hang recovery")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_pci.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
index 04aaf553e3fa..dcf4481f24ee 100644
--- a/drivers/gpu/drm/i915/i915_pci.c
+++ b/drivers/gpu/drm/i915/i915_pci.c
@@ -398,6 +398,7 @@ static const struct intel_device_info intel_broxton_info = {
 	GEN9_LP_FEATURES,
 	.platform = INTEL_BROXTON,
 	.ddb_size = 512,
+	.has_reset_engine = false,
 };
 
 static const struct intel_device_info intel_geminilake_info = {
-- 
2.13.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] drm/i915: Disable per-engine reset for Broxton
  2017-07-04 16:09 [PATCH] drm/i915: Disable per-engine reset for Broxton Chris Wilson
@ 2017-07-06  1:24 ` Michel Thierry
  2017-07-06  7:11   ` Chris Wilson
  0 siblings, 1 reply; 5+ messages in thread
From: Michel Thierry @ 2017-07-06  1:24 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

On 04/07/17 09:09, Chris Wilson wrote:
> Triggering a GPU reset for one engine affects another, notably
> corrupting the context status buffer (CSB) effectively losing track of
> inflight requests.
>
> Adding a few printks:
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index ad41836fa5e5..a969456bc0fa 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -1953,6 +1953,7 @@ int i915_reset_engine(struct intel_engine_cs *engine)
>                 goto out;
>         }
>
> +       pr_err("Resetting %s\n", engine->name);
>         ret = intel_gpu_reset(engine->i915, intel_engine_flag(engine));
>         if (ret) {
>                 /* If we fail here, we expect to fallback to a global reset */
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 716e5c9ea222..a72bc35d0870 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -355,6 +355,7 @@ static void execlists_submit_ports(struct intel_engine_cs *engine)
>                                 execlists_context_status_change(rq, INTEL_CONTEXT_SCHEDULE_IN);
>                         port_set(&port[n], port_pack(rq, count));
>                         desc = execlists_update_context(rq);
> +                       pr_err("%s: in (rq=%x) ctx=%d\n", engine->name, rq->global_seqno, upper_32_bits(desc));
>                         GEM_DEBUG_EXEC(port[n].context_id = upper_32_bits(desc));
>                 } else {
>                         GEM_BUG_ON(!n);
> @@ -594,9 +595,23 @@ static void intel_lrc_irq_handler(unsigned long data)
>                         if (!(status & GEN8_CTX_STATUS_COMPLETED_MASK))
>                                 continue;
>
> +                       pr_err("%s: out CSB (%x head=%d, tail=%d), ctx=%d, rq=%d\n",
> +                                       engine->name,
> +                                       readl(csb_mmio),
> +                                       head, tail,
> +                                       readl(buf+2*head+1),
> +                                       port->context_id);
> +
>                         /* Check the context/desc id for this event matches */
> -                       GEM_DEBUG_BUG_ON(readl(buf + 2 * head + 1) !=
> -                                        port->context_id);
> +                       if (readl(buf + 2 * head + 1) != port->context_id) {
> +                               pr_err("%s: BUG CSB (%x head=%d, tail=%d), ctx=%d, rq=%d\n",
> +                                               engine->name,
> +                                               readl(csb_mmio),
> +                                               head, tail,
> +                                               readl(buf+2*head+1),
> +                                               port->context_id);
> +                               BUG();
> +                       }
>
>                         rq = port_unpack(port, &count);
>                         GEM_BUG_ON(count == 0);
>
> Results in:
>
> [ 6423.006602] Resetting rcs0
> [ 6423.009080] rcs0: in (rq=fffffe70) ctx=1
> [ 6423.009216] rcs0: in (rq=fffffe6f) ctx=3
> [ 6423.009542] rcs0: out CSB (2 head=1, tail=2), ctx=3, rq=3
> [ 6423.009619] Resetting bcs0
> [ 6423.009980] rcs0: BUG CSB (0 head=1, tail=2), ctx=0, rq=3
>

It took me a while, but  I was able to replicate this (with your 
igt_reset_active_engines) a couple of times. I also captured the value 
of the CSB events at that point and it looked like this.

[   55.134393] Resetting rcs0
[   55.134747] bcs0: BUG CSB (3 head=1, tail=2), ctx=10, rq=4
[   55.134755]  bcs0: HWSP[16-17] Execlist CSB[0]:   0x00000018 _ 0x0000000a
[   55.134759]  bcs0: HWSP[18-19] Execlist CSB[1]:   0x00000012 _ 0x0000000a
[   55.134762]  bcs0: HWSP[20-21] Execlist CSB[2]:   0x00008002 _ 0x00000004
[   55.134765]  bcs0: HWSP[22-23] Execlist CSB[3]:   0x00000014 _ 0x00000004
[   55.134767]  bcs0: HWSP[24-25] Execlist CSB[4]:   0x00000018 _ 0x0000000a
[   55.134770]  bcs0: HWSP[26-27] Execlist CSB[5]: 0x00000001 _ 0x00000000

The problem is ctx 10 finished in CSB[0] (ctx_complete & 
active_to_idle), but then somehow CSB[1] has the same ctx 10 with 
'preempted' & ctx_complete.

To make things worse, in CSB[2], the hw claims to have lite-restored ctx 4.

So it seems, we could ignore events like CSB[1], i.e. preempted && 
ctx_complete && !lite_restore.


> Note that this bug may be affect all machines and not just Broxton,
> Broxton is just the first machine on which I have confirmed this bug.
>
> Fixes: 142bc7d99bcf ("drm/i915: Modify error handler for per engine hang recovery")
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Cc: Michel Thierry <michel.thierry@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_pci.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
> index 04aaf553e3fa..dcf4481f24ee 100644
> --- a/drivers/gpu/drm/i915/i915_pci.c
> +++ b/drivers/gpu/drm/i915/i915_pci.c
> @@ -398,6 +398,7 @@ static const struct intel_device_info intel_broxton_info = {
>  	GEN9_LP_FEATURES,
>  	.platform = INTEL_BROXTON,
>  	.ddb_size = 512,
> +	.has_reset_engine = false,
>  };
>
>  static const struct intel_device_info intel_geminilake_info = {
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] drm/i915: Disable per-engine reset for Broxton
  2017-07-06  1:24 ` Michel Thierry
@ 2017-07-06  7:11   ` Chris Wilson
  2017-07-06 17:02     ` Michel Thierry
  0 siblings, 1 reply; 5+ messages in thread
From: Chris Wilson @ 2017-07-06  7:11 UTC (permalink / raw)
  To: Michel Thierry, intel-gfx

Quoting Michel Thierry (2017-07-06 02:24:26)
> On 04/07/17 09:09, Chris Wilson wrote:
> > Triggering a GPU reset for one engine affects another, notably
> > corrupting the context status buffer (CSB) effectively losing track of
> > inflight requests.
> >
> > Adding a few printks:
> > diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> > index ad41836fa5e5..a969456bc0fa 100644
> > --- a/drivers/gpu/drm/i915/i915_drv.c
> > +++ b/drivers/gpu/drm/i915/i915_drv.c
> > @@ -1953,6 +1953,7 @@ int i915_reset_engine(struct intel_engine_cs *engine)
> >                 goto out;
> >         }
> >
> > +       pr_err("Resetting %s\n", engine->name);
> >         ret = intel_gpu_reset(engine->i915, intel_engine_flag(engine));
> >         if (ret) {
> >                 /* If we fail here, we expect to fallback to a global reset */
> > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> > index 716e5c9ea222..a72bc35d0870 100644
> > --- a/drivers/gpu/drm/i915/intel_lrc.c
> > +++ b/drivers/gpu/drm/i915/intel_lrc.c
> > @@ -355,6 +355,7 @@ static void execlists_submit_ports(struct intel_engine_cs *engine)
> >                                 execlists_context_status_change(rq, INTEL_CONTEXT_SCHEDULE_IN);
> >                         port_set(&port[n], port_pack(rq, count));
> >                         desc = execlists_update_context(rq);
> > +                       pr_err("%s: in (rq=%x) ctx=%d\n", engine->name, rq->global_seqno, upper_32_bits(desc));
> >                         GEM_DEBUG_EXEC(port[n].context_id = upper_32_bits(desc));
> >                 } else {
> >                         GEM_BUG_ON(!n);
> > @@ -594,9 +595,23 @@ static void intel_lrc_irq_handler(unsigned long data)
> >                         if (!(status & GEN8_CTX_STATUS_COMPLETED_MASK))
> >                                 continue;
> >
> > +                       pr_err("%s: out CSB (%x head=%d, tail=%d), ctx=%d, rq=%d\n",
> > +                                       engine->name,
> > +                                       readl(csb_mmio),
> > +                                       head, tail,
> > +                                       readl(buf+2*head+1),
> > +                                       port->context_id);
> > +
> >                         /* Check the context/desc id for this event matches */
> > -                       GEM_DEBUG_BUG_ON(readl(buf + 2 * head + 1) !=
> > -                                        port->context_id);
> > +                       if (readl(buf + 2 * head + 1) != port->context_id) {
> > +                               pr_err("%s: BUG CSB (%x head=%d, tail=%d), ctx=%d, rq=%d\n",
> > +                                               engine->name,
> > +                                               readl(csb_mmio),
> > +                                               head, tail,
> > +                                               readl(buf+2*head+1),
> > +                                               port->context_id);
> > +                               BUG();
> > +                       }
> >
> >                         rq = port_unpack(port, &count);
> >                         GEM_BUG_ON(count == 0);
> >
> > Results in:
> >
> > [ 6423.006602] Resetting rcs0
> > [ 6423.009080] rcs0: in (rq=fffffe70) ctx=1
> > [ 6423.009216] rcs0: in (rq=fffffe6f) ctx=3
> > [ 6423.009542] rcs0: out CSB (2 head=1, tail=2), ctx=3, rq=3
> > [ 6423.009619] Resetting bcs0
> > [ 6423.009980] rcs0: BUG CSB (0 head=1, tail=2), ctx=0, rq=3
> >
> 
> It took me a while, but  I was able to replicate this (with your 
> igt_reset_active_engines) a couple of times. I also captured the value 
> of the CSB events at that point and it looked like this.

Ah, that's a separate issue that definitely isn't limited to bxt. In the
bug on my machine the CSB is distinctly zeroed.
 
> [   55.134393] Resetting rcs0
> [   55.134747] bcs0: BUG CSB (3 head=1, tail=2), ctx=10, rq=4
> [   55.134755]  bcs0: HWSP[16-17] Execlist CSB[0]:   0x00000018 _ 0x0000000a
> [   55.134759]  bcs0: HWSP[18-19] Execlist CSB[1]:   0x00000012 _ 0x0000000a
> [   55.134762]  bcs0: HWSP[20-21] Execlist CSB[2]:   0x00008002 _ 0x00000004
> [   55.134765]  bcs0: HWSP[22-23] Execlist CSB[3]:   0x00000014 _ 0x00000004
> [   55.134767]  bcs0: HWSP[24-25] Execlist CSB[4]:   0x00000018 _ 0x0000000a
> [   55.134770]  bcs0: HWSP[26-27] Execlist CSB[5]: 0x00000001 _ 0x00000000
> 
> The problem is ctx 10 finished in CSB[0] (ctx_complete & 
> active_to_idle), but then somehow CSB[1] has the same ctx 10 with 
> 'preempted' & ctx_complete.
> 
> To make things worse, in CSB[2], the hw claims to have lite-restored ctx 4.
> 
> So it seems, we could ignore events like CSB[1], i.e. preempted && 
> ctx_complete && !lite_restore.

I think this bug is fixed in the series I sent last week, this is just
the race in reset vs the tasklet.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] drm/i915: Disable per-engine reset for Broxton
  2017-07-06  7:11   ` Chris Wilson
@ 2017-07-06 17:02     ` Michel Thierry
  2017-07-06 19:55       ` Chris Wilson
  0 siblings, 1 reply; 5+ messages in thread
From: Michel Thierry @ 2017-07-06 17:02 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

On Thu, Jul 6, 2017 at 12:11 AM, Chris Wilson 
<chris@chris-wilson.co.uk> wrote:
> Quoting Michel Thierry (2017-07-06 02:24:26)
>>  On 04/07/17 09:09, Chris Wilson wrote:
>>  > Triggering a GPU reset for one engine affects another, notably
>>  > corrupting the context status buffer (CSB) effectively losing 
>> track of
>>  > inflight requests.
>>  >
>>  > Adding a few printks:
>>  > diff --git a/drivers/gpu/drm/i915/i915_drv.c 
>> b/drivers/gpu/drm/i915/i915_drv.c
>>  > index ad41836fa5e5..a969456bc0fa 100644
>>  > --- a/drivers/gpu/drm/i915/i915_drv.c
>>  > +++ b/drivers/gpu/drm/i915/i915_drv.c
>>  > @@ -1953,6 +1953,7 @@ int i915_reset_engine(struct 
>> intel_engine_cs *engine)
>>  >                 goto out;
>>  >         }
>>  >
>>  > +       pr_err("Resetting %s\n", engine->name);
>>  >         ret = intel_gpu_reset(engine->i915, 
>> intel_engine_flag(engine));
>>  >         if (ret) {
>>  >                 /* If we fail here, we expect to fallback to a 
>> global reset */
>>  > diff --git a/drivers/gpu/drm/i915/intel_lrc.c 
>> b/drivers/gpu/drm/i915/intel_lrc.c
>>  > index 716e5c9ea222..a72bc35d0870 100644
>>  > --- a/drivers/gpu/drm/i915/intel_lrc.c
>>  > +++ b/drivers/gpu/drm/i915/intel_lrc.c
>>  > @@ -355,6 +355,7 @@ static void execlists_submit_ports(struct 
>> intel_engine_cs *engine)
>>  >                                 
>> execlists_context_status_change(rq, INTEL_CONTEXT_SCHEDULE_IN);
>>  >                         port_set(&port[n], port_pack(rq, count));
>>  >                         desc = execlists_update_context(rq);
>>  > +                       pr_err("%s: in (rq=%x) ctx=%d\n", 
>> engine->name, rq->global_seqno, upper_32_bits(desc));
>>  >                         GEM_DEBUG_EXEC(port[n].context_id = 
>> upper_32_bits(desc));
>>  >                 } else {
>>  >                         GEM_BUG_ON(!n);
>>  > @@ -594,9 +595,23 @@ static void intel_lrc_irq_handler(unsigned 
>> long data)
>>  >                         if (!(status & 
>> GEN8_CTX_STATUS_COMPLETED_MASK))
>>  >                                 continue;
>>  >
>>  > +                       pr_err("%s: out CSB (%x head=%d, 
>> tail=%d), ctx=%d, rq=%d\n",
>>  > +                                       engine->name,
>>  > +                                       readl(csb_mmio),
>>  > +                                       head, tail,
>>  > +                                       readl(buf+2*head+1),
>>  > +                                       port->context_id);
>>  > +
>>  >                         /* Check the context/desc id for this 
>> event matches */
>>  > -                       GEM_DEBUG_BUG_ON(readl(buf + 2 * head + 
>> 1) !=
>>  > -                                        port->context_id);
>>  > +                       if (readl(buf + 2 * head + 1) != 
>> port->context_id) {
>>  > +                               pr_err("%s: BUG CSB (%x head=%d, 
>> tail=%d), ctx=%d, rq=%d\n",
>>  > +                                               engine->name,
>>  > +                                               readl(csb_mmio),
>>  > +                                               head, tail,
>>  > +                                               
>> readl(buf+2*head+1),
>>  > +                                               port->context_id);
>>  > +                               BUG();
>>  > +                       }
>>  >
>>  >                         rq = port_unpack(port, &count);
>>  >                         GEM_BUG_ON(count == 0);
>>  >
>>  > Results in:
>>  >
>>  > [ 6423.006602] Resetting rcs0
>>  > [ 6423.009080] rcs0: in (rq=fffffe70) ctx=1
>>  > [ 6423.009216] rcs0: in (rq=fffffe6f) ctx=3
>>  > [ 6423.009542] rcs0: out CSB (2 head=1, tail=2), ctx=3, rq=3
>>  > [ 6423.009619] Resetting bcs0
>>  > [ 6423.009980] rcs0: BUG CSB (0 head=1, tail=2), ctx=0, rq=3
>>  >
>>  
>>  It took me a while, but  I was able to replicate this (with your 
>>  igt_reset_active_engines) a couple of times. I also captured the 
>> value 
>>  of the CSB events at that point and it looked like this.
> 
> Ah, that's a separate issue that definitely isn't limited to bxt. In 
> the
> bug on my machine the CSB is distinctly zeroed.

Oh, if you saw them being zeroed, then there's not much I can argue (do 
you remember if the CSB write pointer was reset to '7' too?).

Hopefully is only limited to bxt, so sadly 

Reviewed-by: Michel Thierry <michel.thierry@intel.com>

> 
>  
>>  [   55.134393] Resetting rcs0
>>  [   55.134747] bcs0: BUG CSB (3 head=1, tail=2), ctx=10, rq=4
>>  [   55.134755]  bcs0: HWSP[16-17] Execlist CSB[0]:   0x00000018 _ 
>> 0x0000000a
>>  [   55.134759]  bcs0: HWSP[18-19] Execlist CSB[1]:   0x00000012 _ 
>> 0x0000000a
>>  [   55.134762]  bcs0: HWSP[20-21] Execlist CSB[2]:   0x00008002 _ 
>> 0x00000004
>>  [   55.134765]  bcs0: HWSP[22-23] Execlist CSB[3]:   0x00000014 _ 
>> 0x00000004
>>  [   55.134767]  bcs0: HWSP[24-25] Execlist CSB[4]:   0x00000018 _ 
>> 0x0000000a
>>  [   55.134770]  bcs0: HWSP[26-27] Execlist CSB[5]: 0x00000001 _ 
>> 0x00000000
>>  
>>  The problem is ctx 10 finished in CSB[0] (ctx_complete & 
>>  active_to_idle), but then somehow CSB[1] has the same ctx 10 with 
>>  'preempted' & ctx_complete.
>>  
>>  To make things worse, in CSB[2], the hw claims to have 
>> lite-restored ctx 4.
>>  
>>  So it seems, we could ignore events like CSB[1], i.e. preempted && 
>>  ctx_complete && !lite_restore.
> 
> I think this bug is fixed in the series I sent last week, this is just
> the race in reset vs the tasklet.

I'll try with those patches.

Thanks,

> 
> -Chris

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] drm/i915: Disable per-engine reset for Broxton
  2017-07-06 17:02     ` Michel Thierry
@ 2017-07-06 19:55       ` Chris Wilson
  0 siblings, 0 replies; 5+ messages in thread
From: Chris Wilson @ 2017-07-06 19:55 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

Quoting Michel Thierry (2017-07-06 18:02:13)
> On Thu, Jul 6, 2017 at 12:11 AM, Chris Wilson 
> <chris@chris-wilson.co.uk> wrote:
> > Quoting Michel Thierry (2017-07-06 02:24:26)
> >>  On 04/07/17 09:09, Chris Wilson wrote:
> >>  > Triggering a GPU reset for one engine affects another, notably
> >>  > corrupting the context status buffer (CSB) effectively losing 
> >> track of
> >>  > inflight requests.
> >>  >
> >>  > Adding a few printks:
> >>  > diff --git a/drivers/gpu/drm/i915/i915_drv.c 
> >> b/drivers/gpu/drm/i915/i915_drv.c
> >>  > index ad41836fa5e5..a969456bc0fa 100644
> >>  > --- a/drivers/gpu/drm/i915/i915_drv.c
> >>  > +++ b/drivers/gpu/drm/i915/i915_drv.c
> >>  > @@ -1953,6 +1953,7 @@ int i915_reset_engine(struct 
> >> intel_engine_cs *engine)
> >>  >                 goto out;
> >>  >         }
> >>  >
> >>  > +       pr_err("Resetting %s\n", engine->name);
> >>  >         ret = intel_gpu_reset(engine->i915, 
> >> intel_engine_flag(engine));
> >>  >         if (ret) {
> >>  >                 /* If we fail here, we expect to fallback to a 
> >> global reset */
> >>  > diff --git a/drivers/gpu/drm/i915/intel_lrc.c 
> >> b/drivers/gpu/drm/i915/intel_lrc.c
> >>  > index 716e5c9ea222..a72bc35d0870 100644
> >>  > --- a/drivers/gpu/drm/i915/intel_lrc.c
> >>  > +++ b/drivers/gpu/drm/i915/intel_lrc.c
> >>  > @@ -355,6 +355,7 @@ static void execlists_submit_ports(struct 
> >> intel_engine_cs *engine)
> >>  >                                 
> >> execlists_context_status_change(rq, INTEL_CONTEXT_SCHEDULE_IN);
> >>  >                         port_set(&port[n], port_pack(rq, count));
> >>  >                         desc = execlists_update_context(rq);
> >>  > +                       pr_err("%s: in (rq=%x) ctx=%d\n", 
> >> engine->name, rq->global_seqno, upper_32_bits(desc));
> >>  >                         GEM_DEBUG_EXEC(port[n].context_id = 
> >> upper_32_bits(desc));
> >>  >                 } else {
> >>  >                         GEM_BUG_ON(!n);
> >>  > @@ -594,9 +595,23 @@ static void intel_lrc_irq_handler(unsigned 
> >> long data)
> >>  >                         if (!(status & 
> >> GEN8_CTX_STATUS_COMPLETED_MASK))
> >>  >                                 continue;
> >>  >
> >>  > +                       pr_err("%s: out CSB (%x head=%d, 
> >> tail=%d), ctx=%d, rq=%d\n",
> >>  > +                                       engine->name,
> >>  > +                                       readl(csb_mmio),
> >>  > +                                       head, tail,
> >>  > +                                       readl(buf+2*head+1),
> >>  > +                                       port->context_id);
> >>  > +
> >>  >                         /* Check the context/desc id for this 
> >> event matches */
> >>  > -                       GEM_DEBUG_BUG_ON(readl(buf + 2 * head + 
> >> 1) !=
> >>  > -                                        port->context_id);
> >>  > +                       if (readl(buf + 2 * head + 1) != 
> >> port->context_id) {
> >>  > +                               pr_err("%s: BUG CSB (%x head=%d, 
> >> tail=%d), ctx=%d, rq=%d\n",
> >>  > +                                               engine->name,
> >>  > +                                               readl(csb_mmio),
> >>  > +                                               head, tail,
> >>  > +                                               
> >> readl(buf+2*head+1),
> >>  > +                                               port->context_id);
> >>  > +                               BUG();
> >>  > +                       }
> >>  >
> >>  >                         rq = port_unpack(port, &count);
> >>  >                         GEM_BUG_ON(count == 0);
> >>  >
> >>  > Results in:
> >>  >
> >>  > [ 6423.006602] Resetting rcs0
> >>  > [ 6423.009080] rcs0: in (rq=fffffe70) ctx=1
> >>  > [ 6423.009216] rcs0: in (rq=fffffe6f) ctx=3
> >>  > [ 6423.009542] rcs0: out CSB (2 head=1, tail=2), ctx=3, rq=3
> >>  > [ 6423.009619] Resetting bcs0
> >>  > [ 6423.009980] rcs0: BUG CSB (0 head=1, tail=2), ctx=0, rq=3
> >>  >
> >>  
> >>  It took me a while, but  I was able to replicate this (with your 
> >>  igt_reset_active_engines) a couple of times. I also captured the 
> >> value 
> >>  of the CSB events at that point and it looked like this.
> > 
> > Ah, that's a separate issue that definitely isn't limited to bxt. In 
> > the
> > bug on my machine the CSB is distinctly zeroed.
> 
> Oh, if you saw them being zeroed, then there's not much I can argue (do 
> you remember if the CSB write pointer was reset to '7' too?).

I didn't look at the pointer actually on reset (tried manually writing
to it, just in case it stuck across the power context reloads) then
realised that the fault was always due to the CSB reads returning 0 and
then pin-pointed it to the concurrent reset on the other engine.
 
> Hopefully is only limited to bxt, so sadly 

Broadwell has survived over a day so far (so back to the earlier
stability, -tip is decidedly unstable for resets atm), need to get around
to checking the other platforms carefully.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-07-06 19:55 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-04 16:09 [PATCH] drm/i915: Disable per-engine reset for Broxton Chris Wilson
2017-07-06  1:24 ` Michel Thierry
2017-07-06  7:11   ` Chris Wilson
2017-07-06 17:02     ` Michel Thierry
2017-07-06 19:55       ` Chris Wilson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.