All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/i915: Don't touch fence->error when resetting an innocent request
@ 2017-07-20 13:24 Chris Wilson
  2017-07-20 13:39 ` Chris Wilson
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Chris Wilson @ 2017-07-20 13:24 UTC (permalink / raw)
  To: intel-gfx; +Cc: Daniel Vetter

If the request has been completed before the reset took effect, we don't
need to mark it up as being a victim. Touching fence->error after the
fence has been signaled is detected by dma_fence_set_error() and
triggers a BUG:

[  231.743133] kernel BUG at ./include/linux/dma-fence.h:434!
[  231.743156] invalid opcode: 0000 [#1] SMP KASAN
[  231.743172] Modules linked in: i915 drm_kms_helper drm iptable_nat nf_nat_ipv4 nf_nat x86_pkg_temp_thermal iosf_mbi i2c_algo_bit cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea fb font fbdev [last unloaded: drm]
[  231.743221] CPU: 2 PID: 20 Comm: kworker/2:0 Tainted: G     U          4.13.0-rc1+ #52
[  231.743236] Hardware name: Hewlett-Packard HP EliteBook 8460p/161C, BIOS 68SCF Ver. F.01 03/11/2011
[  231.743363] Workqueue: events_long i915_hangcheck_elapsed [i915]
[  231.743382] task: ffff8801f42e9780 task.stack: ffff8801f42f8000
[  231.743489] RIP: 0010:i915_gem_reset_engine+0x45a/0x460 [i915]
[  231.743505] RSP: 0018:ffff8801f42ff770 EFLAGS: 00010202
[  231.743521] RAX: 0000000000000007 RBX: ffff8801bf6b1880 RCX: ffffffffa02881a6
[  231.743537] RDX: dffffc0000000000 RSI: dffffc0000000000 RDI: ffff8801bf6b18c8
[  231.743551] RBP: ffff8801f42ff7c8 R08: 0000000000000001 R09: 0000000000000000
[  231.743566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801edb02d00
[  231.743581] R13: ffff8801e19d4200 R14: 000000000000001d R15: ffff8801ce2a4000
[  231.743599] FS:  0000000000000000(0000) GS:ffff8801f5a80000(0000) knlGS:0000000000000000
[  231.743614] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  231.743629] CR2: 00007f0ebd1add10 CR3: 0000000002621000 CR4: 00000000000406e0
[  231.743643] Call Trace:
[  231.743752]  i915_gem_reset+0x6c/0x150 [i915]
[  231.743853]  i915_reset+0x175/0x210 [i915]
[  231.743958]  i915_reset_device+0x33b/0x350 [i915]
[  231.744061]  ? valleyview_pipestat_irq_handler+0xe0/0xe0 [i915]
[  231.744081]  ? trace_hardirqs_off_caller+0x70/0x110
[  231.744102]  ? _raw_spin_unlock_irqrestore+0x46/0x50
[  231.744120]  ? find_held_lock+0x119/0x150
[  231.744138]  ? mark_lock+0x6d/0x850
[  231.744241]  ? gen8_gt_irq_ack+0x1f0/0x1f0 [i915]
[  231.744262]  ? work_on_cpu_safe+0x60/0x60
[  231.744284]  ? rcu_read_lock_sched_held+0x57/0xa0
[  231.744400]  ? gen6_read32+0x2ba/0x320 [i915]
[  231.744506]  i915_handle_error+0x382/0x5f0 [i915]
[  231.744611]  ? gen6_rps_reset_ei+0x20/0x20 [i915]
[  231.744630]  ? vsnprintf+0x128/0x8e0
[  231.744649]  ? pointer+0x6b0/0x6b0
[  231.744667]  ? debug_check_no_locks_freed+0x1a0/0x1a0
[  231.744688]  ? scnprintf+0x92/0xe0
[  231.744706]  ? snprintf+0xb0/0xb0
[  231.744820]  hangcheck_declare_hang+0x15a/0x1a0 [i915]
[  231.744932]  ? engine_stuck+0x440/0x440 [i915]
[  231.744951]  ? rcu_read_lock_sched_held+0x57/0xa0
[  231.745062]  ? gen6_read32+0x2ba/0x320 [i915]
[  231.745173]  ? gen6_read16+0x320/0x320 [i915]
[  231.745284]  ? intel_engine_get_active_head+0x91/0x170 [i915]
[  231.745401]  i915_hangcheck_elapsed+0x3d8/0x400 [i915]
[  231.745424]  process_one_work+0x3e8/0xac0
[  231.745444]  ? pwq_dec_nr_in_flight+0x110/0x110
[  231.745464]  ? do_raw_spin_lock+0x8e/0x120
[  231.745484]  worker_thread+0x8d/0x720
[  231.745506]  kthread+0x19e/0x1f0
[  231.745524]  ? process_one_work+0xac0/0xac0
[  231.745541]  ? kthread_create_on_node+0xa0/0xa0
[  231.745560]  ret_from_fork+0x27/0x40
[  231.745581] Code: 8b 7d c8 e8 49 0d 02 e1 49 8b 7f 38 48 8b 75 b8 48 83 c7 10 e8 b8 89 be e1 e9 95 fc ff ff 4c 89 e7 e8 4b b9 ff ff e9 30 ff ff ff <0f> 0b 0f 1f 40 00 55 48 89 e5 41 57 41 56 41 55 41 54 49 89 fe
[  231.745767] RIP: i915_gem_reset_engine+0x45a/0x460 [i915] RSP: ffff8801f42ff770

At first glance this looks to be related to commit c64992e035d7
("drm/i915: Look for active requests earlier in the reset path"), but it
could easily happen before as well. On the other hand, we no longer
logged victims due to the active_request being dropped earlier.

Reported-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Fixes: c64992e035d7 ("drm/i915: Look for active requests earlier in the reset path")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Michel Thierry <michel.thierry@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/i915/i915_gem.c | 37 +++++++++++++++++++------------------
 1 file changed, 19 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index b1520052a5e4..3ad1352dc871 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2914,11 +2914,9 @@ i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
 	if (engine->irq_seqno_barrier)
 		engine->irq_seqno_barrier(engine);
 
-	if (engine_stalled(engine)) {
-		request = i915_gem_find_active_request(engine);
-		if (request && request->fence.error == -EIO)
-			request = ERR_PTR(-EIO); /* Previous reset failed! */
-	}
+	request = i915_gem_find_active_request(engine);
+	if (request && request->fence.error == -EIO)
+		request = ERR_PTR(-EIO); /* Previous reset failed! */
 
 	return request;
 }
@@ -2987,12 +2985,10 @@ static void engine_skip_context(struct drm_i915_gem_request *request)
 	spin_unlock_irqrestore(&engine->timeline->lock, flags);
 }
 
-/* Returns true if the request was guilty of hang */
-static bool i915_gem_reset_request(struct drm_i915_gem_request *request)
+/* Returns the request if it was guilty of the hang */
+static struct drm_i915_gem_request *
+i915_gem_reset_request(struct drm_i915_gem_request *request)
 {
-	/* Read once and return the resolution */
-	const bool guilty = !i915_gem_request_completed(request);
-
 	/* The guilty request will get skipped on a hung engine.
 	 *
 	 * Users of client default contexts do not rely on logical
@@ -3014,27 +3010,32 @@ static bool i915_gem_reset_request(struct drm_i915_gem_request *request)
 	 * subsequent hangs.
 	 */
 
-	if (guilty) {
+	if (engine_stalled(request->engine)) {
 		i915_gem_context_mark_guilty(request->ctx);
 		skip_request(request);
+
+		/* If this context is now banned, skip all pending requests. */
+		if (i915_gem_context_is_banned(request->ctx))
+			engine_skip_context(request);
 	} else {
 		i915_gem_context_mark_innocent(request->ctx);
-		dma_fence_set_error(&request->fence, -EAGAIN);
+		if (!i915_gem_request_completed(request))
+			dma_fence_set_error(&request->fence, -EAGAIN);
+		request = NULL;
 	}
 
-	return guilty;
+	return request;
 }
 
 void i915_gem_reset_engine(struct intel_engine_cs *engine,
 			   struct drm_i915_gem_request *request)
 {
-	if (request && i915_gem_reset_request(request)) {
+	if (request)
+		request = i915_gem_reset_request(request);
+
+	if (request) {
 		DRM_DEBUG_DRIVER("resetting %s to restart from tail of request 0x%x\n",
 				 engine->name, request->global_seqno);
-
-		/* If this context is now banned, skip all pending requests. */
-		if (i915_gem_context_is_banned(request->ctx))
-			engine_skip_context(request);
 	}
 
 	/* Setup the CS to resume from the breadcrumb of the hung request */
-- 
2.13.3

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] drm/i915: Don't touch fence->error when resetting an innocent request
  2017-07-20 13:24 [PATCH] drm/i915: Don't touch fence->error when resetting an innocent request Chris Wilson
@ 2017-07-20 13:39 ` Chris Wilson
  2017-07-20 13:48 ` [PATCH v2] " Chris Wilson
  2017-07-20 15:56 ` ✓ Fi.CI.BAT: success for drm/i915: Don't touch fence->error when resetting an innocent request (rev2) Patchwork
  2 siblings, 0 replies; 6+ messages in thread
From: Chris Wilson @ 2017-07-20 13:39 UTC (permalink / raw)
  To: intel-gfx; +Cc: Daniel Vetter

Quoting Chris Wilson (2017-07-20 14:24:29)
> If the request has been completed before the reset took effect, we don't
> need to mark it up as being a victim. Touching fence->error after the
> fence has been signaled is detected by dma_fence_set_error() and
> triggers a BUG:
> 
> [  231.743133] kernel BUG at ./include/linux/dma-fence.h:434!
> [  231.743156] invalid opcode: 0000 [#1] SMP KASAN
> [  231.743172] Modules linked in: i915 drm_kms_helper drm iptable_nat nf_nat_ipv4 nf_nat x86_pkg_temp_thermal iosf_mbi i2c_algo_bit cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea fb font fbdev [last unloaded: drm]
> [  231.743221] CPU: 2 PID: 20 Comm: kworker/2:0 Tainted: G     U          4.13.0-rc1+ #52
> [  231.743236] Hardware name: Hewlett-Packard HP EliteBook 8460p/161C, BIOS 68SCF Ver. F.01 03/11/2011
> [  231.743363] Workqueue: events_long i915_hangcheck_elapsed [i915]
> [  231.743382] task: ffff8801f42e9780 task.stack: ffff8801f42f8000
> [  231.743489] RIP: 0010:i915_gem_reset_engine+0x45a/0x460 [i915]
> [  231.743505] RSP: 0018:ffff8801f42ff770 EFLAGS: 00010202
> [  231.743521] RAX: 0000000000000007 RBX: ffff8801bf6b1880 RCX: ffffffffa02881a6
> [  231.743537] RDX: dffffc0000000000 RSI: dffffc0000000000 RDI: ffff8801bf6b18c8
> [  231.743551] RBP: ffff8801f42ff7c8 R08: 0000000000000001 R09: 0000000000000000
> [  231.743566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801edb02d00
> [  231.743581] R13: ffff8801e19d4200 R14: 000000000000001d R15: ffff8801ce2a4000
> [  231.743599] FS:  0000000000000000(0000) GS:ffff8801f5a80000(0000) knlGS:0000000000000000
> [  231.743614] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  231.743629] CR2: 00007f0ebd1add10 CR3: 0000000002621000 CR4: 00000000000406e0
> [  231.743643] Call Trace:
> [  231.743752]  i915_gem_reset+0x6c/0x150 [i915]
> [  231.743853]  i915_reset+0x175/0x210 [i915]
> [  231.743958]  i915_reset_device+0x33b/0x350 [i915]
> [  231.744061]  ? valleyview_pipestat_irq_handler+0xe0/0xe0 [i915]
> [  231.744081]  ? trace_hardirqs_off_caller+0x70/0x110
> [  231.744102]  ? _raw_spin_unlock_irqrestore+0x46/0x50
> [  231.744120]  ? find_held_lock+0x119/0x150
> [  231.744138]  ? mark_lock+0x6d/0x850
> [  231.744241]  ? gen8_gt_irq_ack+0x1f0/0x1f0 [i915]
> [  231.744262]  ? work_on_cpu_safe+0x60/0x60
> [  231.744284]  ? rcu_read_lock_sched_held+0x57/0xa0
> [  231.744400]  ? gen6_read32+0x2ba/0x320 [i915]
> [  231.744506]  i915_handle_error+0x382/0x5f0 [i915]
> [  231.744611]  ? gen6_rps_reset_ei+0x20/0x20 [i915]
> [  231.744630]  ? vsnprintf+0x128/0x8e0
> [  231.744649]  ? pointer+0x6b0/0x6b0
> [  231.744667]  ? debug_check_no_locks_freed+0x1a0/0x1a0
> [  231.744688]  ? scnprintf+0x92/0xe0
> [  231.744706]  ? snprintf+0xb0/0xb0
> [  231.744820]  hangcheck_declare_hang+0x15a/0x1a0 [i915]
> [  231.744932]  ? engine_stuck+0x440/0x440 [i915]
> [  231.744951]  ? rcu_read_lock_sched_held+0x57/0xa0
> [  231.745062]  ? gen6_read32+0x2ba/0x320 [i915]
> [  231.745173]  ? gen6_read16+0x320/0x320 [i915]
> [  231.745284]  ? intel_engine_get_active_head+0x91/0x170 [i915]
> [  231.745401]  i915_hangcheck_elapsed+0x3d8/0x400 [i915]
> [  231.745424]  process_one_work+0x3e8/0xac0
> [  231.745444]  ? pwq_dec_nr_in_flight+0x110/0x110
> [  231.745464]  ? do_raw_spin_lock+0x8e/0x120
> [  231.745484]  worker_thread+0x8d/0x720
> [  231.745506]  kthread+0x19e/0x1f0
> [  231.745524]  ? process_one_work+0xac0/0xac0
> [  231.745541]  ? kthread_create_on_node+0xa0/0xa0
> [  231.745560]  ret_from_fork+0x27/0x40
> [  231.745581] Code: 8b 7d c8 e8 49 0d 02 e1 49 8b 7f 38 48 8b 75 b8 48 83 c7 10 e8 b8 89 be e1 e9 95 fc ff ff 4c 89 e7 e8 4b b9 ff ff e9 30 ff ff ff <0f> 0b 0f 1f 40 00 55 48 89 e5 41 57 41 56 41 55 41 54 49 89 fe
> [  231.745767] RIP: i915_gem_reset_engine+0x45a/0x460 [i915] RSP: ffff8801f42ff770
> 
> At first glance this looks to be related to commit c64992e035d7
> ("drm/i915: Look for active requests earlier in the reset path"), but it
> could easily happen before as well. On the other hand, we no longer
> logged victims due to the active_request being dropped earlier.
> 
> Reported-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> Fixes: c64992e035d7 ("drm/i915: Look for active requests earlier in the reset path")
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Michel Thierry <michel.thierry@intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Cc: Daniel Vetter <daniel.vetter@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_gem.c | 37 +++++++++++++++++++------------------
>  1 file changed, 19 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index b1520052a5e4..3ad1352dc871 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2914,11 +2914,9 @@ i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
>         if (engine->irq_seqno_barrier)
>                 engine->irq_seqno_barrier(engine);
>  
> -       if (engine_stalled(engine)) {
> -               request = i915_gem_find_active_request(engine);
> -               if (request && request->fence.error == -EIO)
> -                       request = ERR_PTR(-EIO); /* Previous reset failed! */
> -       }
> +       request = i915_gem_find_active_request(engine);
> +       if (request && request->fence.error == -EIO)
> +               request = ERR_PTR(-EIO); /* Previous reset failed! */
>  
>         return request;
>  }
> @@ -2987,12 +2985,10 @@ static void engine_skip_context(struct drm_i915_gem_request *request)
>         spin_unlock_irqrestore(&engine->timeline->lock, flags);
>  }
>  
> -/* Returns true if the request was guilty of hang */
> -static bool i915_gem_reset_request(struct drm_i915_gem_request *request)
> +/* Returns the request if it was guilty of the hang */
> +static struct drm_i915_gem_request *
> +i915_gem_reset_request(struct drm_i915_gem_request *request)
>  {
> -       /* Read once and return the resolution */
> -       const bool guilty = !i915_gem_request_completed(request);
> -
>         /* The guilty request will get skipped on a hung engine.
>          *
>          * Users of client default contexts do not rely on logical
> @@ -3014,27 +3010,32 @@ static bool i915_gem_reset_request(struct drm_i915_gem_request *request)
>          * subsequent hangs.
>          */
>  
> -       if (guilty) {
> +       if (engine_stalled(request->engine)) {
>                 i915_gem_context_mark_guilty(request->ctx);
>                 skip_request(request);
> +
> +               /* If this context is now banned, skip all pending requests. */
> +               if (i915_gem_context_is_banned(request->ctx))
> +                       engine_skip_context(request);
>         } else {
>                 i915_gem_context_mark_innocent(request->ctx);
> -               dma_fence_set_error(&request->fence, -EAGAIN);
> +               if (!i915_gem_request_completed(request))
> +                       dma_fence_set_error(&request->fence, -EAGAIN);
> +               request = NULL;

Hmm, needs to be tricker. If innocent we want to rerun the request, so,

request = list_prev_entry(request, link);
if (&request->link == engine->timeline->requests)
	request = NULL;

The problem essentially is that we need the request to restart the
engine (for the lockless per-engine reset path), but otherwise we could
rely on the retire_requests updating the ring->head. As we don't have
the lock, we have to find a request.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v2] drm/i915: Don't touch fence->error when resetting an innocent request
  2017-07-20 13:24 [PATCH] drm/i915: Don't touch fence->error when resetting an innocent request Chris Wilson
  2017-07-20 13:39 ` Chris Wilson
@ 2017-07-20 13:48 ` Chris Wilson
  2017-07-20 20:41   ` Daniel Vetter
  2017-07-21  9:33   ` Joonas Lahtinen
  2017-07-20 15:56 ` ✓ Fi.CI.BAT: success for drm/i915: Don't touch fence->error when resetting an innocent request (rev2) Patchwork
  2 siblings, 2 replies; 6+ messages in thread
From: Chris Wilson @ 2017-07-20 13:48 UTC (permalink / raw)
  To: intel-gfx; +Cc: Daniel Vetter

If the request has been completed before the reset took effect, we don't
need to mark it up as being a victim. Touching fence->error after the
fence has been signaled is detected by dma_fence_set_error() and
triggers a BUG:

[  231.743133] kernel BUG at ./include/linux/dma-fence.h:434!
[  231.743156] invalid opcode: 0000 [#1] SMP KASAN
[  231.743172] Modules linked in: i915 drm_kms_helper drm iptable_nat nf_nat_ipv4 nf_nat x86_pkg_temp_thermal iosf_mbi i2c_algo_bit cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea fb font fbdev [last unloaded: drm]
[  231.743221] CPU: 2 PID: 20 Comm: kworker/2:0 Tainted: G     U          4.13.0-rc1+ #52
[  231.743236] Hardware name: Hewlett-Packard HP EliteBook 8460p/161C, BIOS 68SCF Ver. F.01 03/11/2011
[  231.743363] Workqueue: events_long i915_hangcheck_elapsed [i915]
[  231.743382] task: ffff8801f42e9780 task.stack: ffff8801f42f8000
[  231.743489] RIP: 0010:i915_gem_reset_engine+0x45a/0x460 [i915]
[  231.743505] RSP: 0018:ffff8801f42ff770 EFLAGS: 00010202
[  231.743521] RAX: 0000000000000007 RBX: ffff8801bf6b1880 RCX: ffffffffa02881a6
[  231.743537] RDX: dffffc0000000000 RSI: dffffc0000000000 RDI: ffff8801bf6b18c8
[  231.743551] RBP: ffff8801f42ff7c8 R08: 0000000000000001 R09: 0000000000000000
[  231.743566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801edb02d00
[  231.743581] R13: ffff8801e19d4200 R14: 000000000000001d R15: ffff8801ce2a4000
[  231.743599] FS:  0000000000000000(0000) GS:ffff8801f5a80000(0000) knlGS:0000000000000000
[  231.743614] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  231.743629] CR2: 00007f0ebd1add10 CR3: 0000000002621000 CR4: 00000000000406e0
[  231.743643] Call Trace:
[  231.743752]  i915_gem_reset+0x6c/0x150 [i915]
[  231.743853]  i915_reset+0x175/0x210 [i915]
[  231.743958]  i915_reset_device+0x33b/0x350 [i915]
[  231.744061]  ? valleyview_pipestat_irq_handler+0xe0/0xe0 [i915]
[  231.744081]  ? trace_hardirqs_off_caller+0x70/0x110
[  231.744102]  ? _raw_spin_unlock_irqrestore+0x46/0x50
[  231.744120]  ? find_held_lock+0x119/0x150
[  231.744138]  ? mark_lock+0x6d/0x850
[  231.744241]  ? gen8_gt_irq_ack+0x1f0/0x1f0 [i915]
[  231.744262]  ? work_on_cpu_safe+0x60/0x60
[  231.744284]  ? rcu_read_lock_sched_held+0x57/0xa0
[  231.744400]  ? gen6_read32+0x2ba/0x320 [i915]
[  231.744506]  i915_handle_error+0x382/0x5f0 [i915]
[  231.744611]  ? gen6_rps_reset_ei+0x20/0x20 [i915]
[  231.744630]  ? vsnprintf+0x128/0x8e0
[  231.744649]  ? pointer+0x6b0/0x6b0
[  231.744667]  ? debug_check_no_locks_freed+0x1a0/0x1a0
[  231.744688]  ? scnprintf+0x92/0xe0
[  231.744706]  ? snprintf+0xb0/0xb0
[  231.744820]  hangcheck_declare_hang+0x15a/0x1a0 [i915]
[  231.744932]  ? engine_stuck+0x440/0x440 [i915]
[  231.744951]  ? rcu_read_lock_sched_held+0x57/0xa0
[  231.745062]  ? gen6_read32+0x2ba/0x320 [i915]
[  231.745173]  ? gen6_read16+0x320/0x320 [i915]
[  231.745284]  ? intel_engine_get_active_head+0x91/0x170 [i915]
[  231.745401]  i915_hangcheck_elapsed+0x3d8/0x400 [i915]
[  231.745424]  process_one_work+0x3e8/0xac0
[  231.745444]  ? pwq_dec_nr_in_flight+0x110/0x110
[  231.745464]  ? do_raw_spin_lock+0x8e/0x120
[  231.745484]  worker_thread+0x8d/0x720
[  231.745506]  kthread+0x19e/0x1f0
[  231.745524]  ? process_one_work+0xac0/0xac0
[  231.745541]  ? kthread_create_on_node+0xa0/0xa0
[  231.745560]  ret_from_fork+0x27/0x40
[  231.745581] Code: 8b 7d c8 e8 49 0d 02 e1 49 8b 7f 38 48 8b 75 b8 48 83 c7 10 e8 b8 89 be e1 e9 95 fc ff ff 4c 89 e7 e8 4b b9 ff ff e9 30 ff ff ff <0f> 0b 0f 1f 40 00 55 48 89 e5 41 57 41 56 41 55 41 54 49 89 fe
[  231.745767] RIP: i915_gem_reset_engine+0x45a/0x460 [i915] RSP: ffff8801f42ff770

At first glance this looks to be related to commit c64992e035d7
("drm/i915: Look for active requests earlier in the reset path"), but it
could easily happen before as well. On the other hand, we no longer
logged victims due to the active_request being dropped earlier.

v2: Be trickier to unwind the incomplete request as we cannot rely on
request retirement for the lockless per-engine reset.

Reported-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Fixes: c64992e035d7 ("drm/i915: Look for active requests earlier in the reset path")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Michel Thierry <michel.thierry@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/i915/i915_gem.c | 45 ++++++++++++++++++++++++-----------------
 1 file changed, 27 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index b1520052a5e4..607791724fa7 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2914,11 +2914,9 @@ i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
 	if (engine->irq_seqno_barrier)
 		engine->irq_seqno_barrier(engine);
 
-	if (engine_stalled(engine)) {
-		request = i915_gem_find_active_request(engine);
-		if (request && request->fence.error == -EIO)
-			request = ERR_PTR(-EIO); /* Previous reset failed! */
-	}
+	request = i915_gem_find_active_request(engine);
+	if (request && request->fence.error == -EIO)
+		request = ERR_PTR(-EIO); /* Previous reset failed! */
 
 	return request;
 }
@@ -2987,12 +2985,11 @@ static void engine_skip_context(struct drm_i915_gem_request *request)
 	spin_unlock_irqrestore(&engine->timeline->lock, flags);
 }
 
-/* Returns true if the request was guilty of hang */
-static bool i915_gem_reset_request(struct drm_i915_gem_request *request)
+/* Returns the request if it was guilty of the hang */
+static struct drm_i915_gem_request *
+i915_gem_reset_request(struct intel_engine_cs *engine,
+		       struct drm_i915_gem_request *request)
 {
-	/* Read once and return the resolution */
-	const bool guilty = !i915_gem_request_completed(request);
-
 	/* The guilty request will get skipped on a hung engine.
 	 *
 	 * Users of client default contexts do not rely on logical
@@ -3014,27 +3011,39 @@ static bool i915_gem_reset_request(struct drm_i915_gem_request *request)
 	 * subsequent hangs.
 	 */
 
-	if (guilty) {
+	if (engine_stalled(engine)) {
 		i915_gem_context_mark_guilty(request->ctx);
 		skip_request(request);
+
+		/* If this context is now banned, skip all pending requests. */
+		if (i915_gem_context_is_banned(request->ctx))
+			engine_skip_context(request);
 	} else {
 		i915_gem_context_mark_innocent(request->ctx);
-		dma_fence_set_error(&request->fence, -EAGAIN);
+		if (!i915_gem_request_completed(request)) {
+			dma_fence_set_error(&request->fence, -EAGAIN);
+
+			/* Rewind the engine to before this incomplete rq */ 
+			spin_lock_irq(&engine->timeline->lock);
+			request = list_prev_entry(request, link);
+			if (&request->link == &engine->timeline->requests)
+				request = NULL;
+			spin_unlock_irq(&engine->timeline->lock);
+		}
 	}
 
-	return guilty;
+	return request;
 }
 
 void i915_gem_reset_engine(struct intel_engine_cs *engine,
 			   struct drm_i915_gem_request *request)
 {
-	if (request && i915_gem_reset_request(request)) {
+	if (request)
+		request = i915_gem_reset_request(engine, request);
+
+	if (request) {
 		DRM_DEBUG_DRIVER("resetting %s to restart from tail of request 0x%x\n",
 				 engine->name, request->global_seqno);
-
-		/* If this context is now banned, skip all pending requests. */
-		if (i915_gem_context_is_banned(request->ctx))
-			engine_skip_context(request);
 	}
 
 	/* Setup the CS to resume from the breadcrumb of the hung request */
-- 
2.13.3

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* ✓ Fi.CI.BAT: success for drm/i915: Don't touch fence->error when resetting an innocent request (rev2)
  2017-07-20 13:24 [PATCH] drm/i915: Don't touch fence->error when resetting an innocent request Chris Wilson
  2017-07-20 13:39 ` Chris Wilson
  2017-07-20 13:48 ` [PATCH v2] " Chris Wilson
@ 2017-07-20 15:56 ` Patchwork
  2 siblings, 0 replies; 6+ messages in thread
From: Patchwork @ 2017-07-20 15:56 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Don't touch fence->error when resetting an innocent request (rev2)
URL   : https://patchwork.freedesktop.org/series/27641/
State : success

== Summary ==

Series 27641v2 drm/i915: Don't touch fence->error when resetting an innocent request
https://patchwork.freedesktop.org/api/1.0/series/27641/revisions/2/mbox/

Test gem_exec_flush:
        Subgroup basic-batch-kernel-default-uc:
                pass       -> FAIL       (fi-snb-2600) fdo#100007
Test kms_flip:
        Subgroup basic-flip-vs-modeset:
                pass       -> SKIP       (fi-skl-x1585l) fdo#101781
Test kms_pipe_crc_basic:
        Subgroup suspend-read-crc-pipe-b:
                dmesg-warn -> PASS       (fi-byt-n2820) fdo#101705

fdo#100007 https://bugs.freedesktop.org/show_bug.cgi?id=100007
fdo#101781 https://bugs.freedesktop.org/show_bug.cgi?id=101781
fdo#101705 https://bugs.freedesktop.org/show_bug.cgi?id=101705

fi-bdw-5557u     total:279  pass:268  dwarn:0   dfail:0   fail:0   skip:11  time:443s
fi-bdw-gvtdvm    total:279  pass:265  dwarn:0   dfail:0   fail:0   skip:14  time:427s
fi-blb-e6850     total:279  pass:224  dwarn:1   dfail:0   fail:0   skip:54  time:356s
fi-bsw-n3050     total:279  pass:243  dwarn:0   dfail:0   fail:0   skip:36  time:537s
fi-bxt-j4205     total:279  pass:260  dwarn:0   dfail:0   fail:0   skip:19  time:518s
fi-byt-j1900     total:279  pass:255  dwarn:0   dfail:0   fail:0   skip:24  time:492s
fi-byt-n2820     total:279  pass:251  dwarn:0   dfail:0   fail:0   skip:28  time:482s
fi-glk-2a        total:279  pass:260  dwarn:0   dfail:0   fail:0   skip:19  time:596s
fi-hsw-4770      total:279  pass:263  dwarn:0   dfail:0   fail:0   skip:16  time:439s
fi-hsw-4770r     total:279  pass:263  dwarn:0   dfail:0   fail:0   skip:16  time:426s
fi-ilk-650       total:279  pass:229  dwarn:0   dfail:0   fail:0   skip:50  time:416s
fi-ivb-3520m     total:279  pass:261  dwarn:0   dfail:0   fail:0   skip:18  time:503s
fi-ivb-3770      total:279  pass:261  dwarn:0   dfail:0   fail:0   skip:18  time:476s
fi-kbl-7500u     total:279  pass:261  dwarn:0   dfail:0   fail:0   skip:18  time:478s
fi-kbl-7560u     total:279  pass:269  dwarn:0   dfail:0   fail:0   skip:10  time:581s
fi-kbl-r         total:279  pass:261  dwarn:0   dfail:0   fail:0   skip:18  time:586s
fi-pnv-d510      total:279  pass:221  dwarn:3   dfail:0   fail:0   skip:55  time:566s
fi-skl-6260u     total:279  pass:269  dwarn:0   dfail:0   fail:0   skip:10  time:456s
fi-skl-6700hq    total:279  pass:262  dwarn:0   dfail:0   fail:0   skip:17  time:583s
fi-skl-6700k     total:279  pass:261  dwarn:0   dfail:0   fail:0   skip:18  time:468s
fi-skl-6770hq    total:279  pass:269  dwarn:0   dfail:0   fail:0   skip:10  time:477s
fi-skl-gvtdvm    total:279  pass:266  dwarn:0   dfail:0   fail:0   skip:13  time:434s
fi-skl-x1585l    total:279  pass:268  dwarn:0   dfail:0   fail:0   skip:11  time:480s
fi-snb-2520m     total:279  pass:251  dwarn:0   dfail:0   fail:0   skip:28  time:548s
fi-snb-2600      total:279  pass:249  dwarn:0   dfail:0   fail:1   skip:29  time:404s

05cf23d5c759917f3b50aea0dcb0df1464366b93 drm-tip: 2017y-07m-20d-14h-24m-27s UTC integration manifest
69bd28c drm/i915: Don't touch fence->error when resetting an innocent request

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_5253/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] drm/i915: Don't touch fence->error when resetting an innocent request
  2017-07-20 13:48 ` [PATCH v2] " Chris Wilson
@ 2017-07-20 20:41   ` Daniel Vetter
  2017-07-21  9:33   ` Joonas Lahtinen
  1 sibling, 0 replies; 6+ messages in thread
From: Daniel Vetter @ 2017-07-20 20:41 UTC (permalink / raw)
  To: Chris Wilson; +Cc: Daniel Vetter, intel-gfx

On Thu, Jul 20, 2017 at 02:48:19PM +0100, Chris Wilson wrote:
> If the request has been completed before the reset took effect, we don't
> need to mark it up as being a victim. Touching fence->error after the
> fence has been signaled is detected by dma_fence_set_error() and
> triggers a BUG:
> 
> [  231.743133] kernel BUG at ./include/linux/dma-fence.h:434!
> [  231.743156] invalid opcode: 0000 [#1] SMP KASAN
> [  231.743172] Modules linked in: i915 drm_kms_helper drm iptable_nat nf_nat_ipv4 nf_nat x86_pkg_temp_thermal iosf_mbi i2c_algo_bit cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea fb font fbdev [last unloaded: drm]
> [  231.743221] CPU: 2 PID: 20 Comm: kworker/2:0 Tainted: G     U          4.13.0-rc1+ #52
> [  231.743236] Hardware name: Hewlett-Packard HP EliteBook 8460p/161C, BIOS 68SCF Ver. F.01 03/11/2011
> [  231.743363] Workqueue: events_long i915_hangcheck_elapsed [i915]
> [  231.743382] task: ffff8801f42e9780 task.stack: ffff8801f42f8000
> [  231.743489] RIP: 0010:i915_gem_reset_engine+0x45a/0x460 [i915]
> [  231.743505] RSP: 0018:ffff8801f42ff770 EFLAGS: 00010202
> [  231.743521] RAX: 0000000000000007 RBX: ffff8801bf6b1880 RCX: ffffffffa02881a6
> [  231.743537] RDX: dffffc0000000000 RSI: dffffc0000000000 RDI: ffff8801bf6b18c8
> [  231.743551] RBP: ffff8801f42ff7c8 R08: 0000000000000001 R09: 0000000000000000
> [  231.743566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801edb02d00
> [  231.743581] R13: ffff8801e19d4200 R14: 000000000000001d R15: ffff8801ce2a4000
> [  231.743599] FS:  0000000000000000(0000) GS:ffff8801f5a80000(0000) knlGS:0000000000000000
> [  231.743614] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  231.743629] CR2: 00007f0ebd1add10 CR3: 0000000002621000 CR4: 00000000000406e0
> [  231.743643] Call Trace:
> [  231.743752]  i915_gem_reset+0x6c/0x150 [i915]
> [  231.743853]  i915_reset+0x175/0x210 [i915]
> [  231.743958]  i915_reset_device+0x33b/0x350 [i915]
> [  231.744061]  ? valleyview_pipestat_irq_handler+0xe0/0xe0 [i915]
> [  231.744081]  ? trace_hardirqs_off_caller+0x70/0x110
> [  231.744102]  ? _raw_spin_unlock_irqrestore+0x46/0x50
> [  231.744120]  ? find_held_lock+0x119/0x150
> [  231.744138]  ? mark_lock+0x6d/0x850
> [  231.744241]  ? gen8_gt_irq_ack+0x1f0/0x1f0 [i915]
> [  231.744262]  ? work_on_cpu_safe+0x60/0x60
> [  231.744284]  ? rcu_read_lock_sched_held+0x57/0xa0
> [  231.744400]  ? gen6_read32+0x2ba/0x320 [i915]
> [  231.744506]  i915_handle_error+0x382/0x5f0 [i915]
> [  231.744611]  ? gen6_rps_reset_ei+0x20/0x20 [i915]
> [  231.744630]  ? vsnprintf+0x128/0x8e0
> [  231.744649]  ? pointer+0x6b0/0x6b0
> [  231.744667]  ? debug_check_no_locks_freed+0x1a0/0x1a0
> [  231.744688]  ? scnprintf+0x92/0xe0
> [  231.744706]  ? snprintf+0xb0/0xb0
> [  231.744820]  hangcheck_declare_hang+0x15a/0x1a0 [i915]
> [  231.744932]  ? engine_stuck+0x440/0x440 [i915]
> [  231.744951]  ? rcu_read_lock_sched_held+0x57/0xa0
> [  231.745062]  ? gen6_read32+0x2ba/0x320 [i915]
> [  231.745173]  ? gen6_read16+0x320/0x320 [i915]
> [  231.745284]  ? intel_engine_get_active_head+0x91/0x170 [i915]
> [  231.745401]  i915_hangcheck_elapsed+0x3d8/0x400 [i915]
> [  231.745424]  process_one_work+0x3e8/0xac0
> [  231.745444]  ? pwq_dec_nr_in_flight+0x110/0x110
> [  231.745464]  ? do_raw_spin_lock+0x8e/0x120
> [  231.745484]  worker_thread+0x8d/0x720
> [  231.745506]  kthread+0x19e/0x1f0
> [  231.745524]  ? process_one_work+0xac0/0xac0
> [  231.745541]  ? kthread_create_on_node+0xa0/0xa0
> [  231.745560]  ret_from_fork+0x27/0x40
> [  231.745581] Code: 8b 7d c8 e8 49 0d 02 e1 49 8b 7f 38 48 8b 75 b8 48 83 c7 10 e8 b8 89 be e1 e9 95 fc ff ff 4c 89 e7 e8 4b b9 ff ff e9 30 ff ff ff <0f> 0b 0f 1f 40 00 55 48 89 e5 41 57 41 56 41 55 41 54 49 89 fe
> [  231.745767] RIP: i915_gem_reset_engine+0x45a/0x460 [i915] RSP: ffff8801f42ff770
> 
> At first glance this looks to be related to commit c64992e035d7
> ("drm/i915: Look for active requests earlier in the reset path"), but it
> could easily happen before as well. On the other hand, we no longer
> logged victims due to the active_request being dropped earlier.
> 
> v2: Be trickier to unwind the incomplete request as we cannot rely on
> request retirement for the lockless per-engine reset.
> 
> Reported-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> Fixes: c64992e035d7 ("drm/i915: Look for active requests earlier in the reset path")
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Michel Thierry <michel.thierry@intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Cc: Daniel Vetter <daniel.vetter@intel.com>

Observed this by running igt/kms_busy, with this patch the issue seems to
be gone.

Tested-by: Daniel Vetter <daniel.vetter@ffwll.ch>

> ---
>  drivers/gpu/drm/i915/i915_gem.c | 45 ++++++++++++++++++++++++-----------------
>  1 file changed, 27 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index b1520052a5e4..607791724fa7 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2914,11 +2914,9 @@ i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
>  	if (engine->irq_seqno_barrier)
>  		engine->irq_seqno_barrier(engine);
>  
> -	if (engine_stalled(engine)) {
> -		request = i915_gem_find_active_request(engine);
> -		if (request && request->fence.error == -EIO)
> -			request = ERR_PTR(-EIO); /* Previous reset failed! */
> -	}
> +	request = i915_gem_find_active_request(engine);
> +	if (request && request->fence.error == -EIO)
> +		request = ERR_PTR(-EIO); /* Previous reset failed! */
>  
>  	return request;
>  }
> @@ -2987,12 +2985,11 @@ static void engine_skip_context(struct drm_i915_gem_request *request)
>  	spin_unlock_irqrestore(&engine->timeline->lock, flags);
>  }
>  
> -/* Returns true if the request was guilty of hang */
> -static bool i915_gem_reset_request(struct drm_i915_gem_request *request)
> +/* Returns the request if it was guilty of the hang */
> +static struct drm_i915_gem_request *
> +i915_gem_reset_request(struct intel_engine_cs *engine,
> +		       struct drm_i915_gem_request *request)
>  {
> -	/* Read once and return the resolution */
> -	const bool guilty = !i915_gem_request_completed(request);
> -
>  	/* The guilty request will get skipped on a hung engine.
>  	 *
>  	 * Users of client default contexts do not rely on logical
> @@ -3014,27 +3011,39 @@ static bool i915_gem_reset_request(struct drm_i915_gem_request *request)
>  	 * subsequent hangs.
>  	 */
>  
> -	if (guilty) {
> +	if (engine_stalled(engine)) {
>  		i915_gem_context_mark_guilty(request->ctx);
>  		skip_request(request);
> +
> +		/* If this context is now banned, skip all pending requests. */
> +		if (i915_gem_context_is_banned(request->ctx))
> +			engine_skip_context(request);
>  	} else {
>  		i915_gem_context_mark_innocent(request->ctx);
> -		dma_fence_set_error(&request->fence, -EAGAIN);
> +		if (!i915_gem_request_completed(request)) {
> +			dma_fence_set_error(&request->fence, -EAGAIN);
> +
> +			/* Rewind the engine to before this incomplete rq */ 
> +			spin_lock_irq(&engine->timeline->lock);
> +			request = list_prev_entry(request, link);
> +			if (&request->link == &engine->timeline->requests)
> +				request = NULL;
> +			spin_unlock_irq(&engine->timeline->lock);
> +		}
>  	}
>  
> -	return guilty;
> +	return request;
>  }
>  
>  void i915_gem_reset_engine(struct intel_engine_cs *engine,
>  			   struct drm_i915_gem_request *request)
>  {
> -	if (request && i915_gem_reset_request(request)) {
> +	if (request)
> +		request = i915_gem_reset_request(engine, request);
> +
> +	if (request) {
>  		DRM_DEBUG_DRIVER("resetting %s to restart from tail of request 0x%x\n",
>  				 engine->name, request->global_seqno);
> -
> -		/* If this context is now banned, skip all pending requests. */
> -		if (i915_gem_context_is_banned(request->ctx))
> -			engine_skip_context(request);
>  	}
>  
>  	/* Setup the CS to resume from the breadcrumb of the hung request */
> -- 
> 2.13.3
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] drm/i915: Don't touch fence->error when resetting an innocent request
  2017-07-20 13:48 ` [PATCH v2] " Chris Wilson
  2017-07-20 20:41   ` Daniel Vetter
@ 2017-07-21  9:33   ` Joonas Lahtinen
  1 sibling, 0 replies; 6+ messages in thread
From: Joonas Lahtinen @ 2017-07-21  9:33 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: Daniel Vetter

On to, 2017-07-20 at 14:48 +0100, Chris Wilson wrote:
> If the request has been completed before the reset took effect, we don't
> need to mark it up as being a victim. Touching fence->error after the
> fence has been signaled is detected by dma_fence_set_error() and
> triggers a BUG:
> 
> [  231.743133] kernel BUG at ./include/linux/dma-fence.h:434!
> [  231.743156] invalid opcode: 0000 [#1] SMP KASAN
> [  231.743172] Modules linked in: i915 drm_kms_helper drm iptable_nat nf_nat_ipv4 nf_nat x86_pkg_temp_thermal iosf_mbi i2c_algo_bit cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea fb font fbdev [last unloaded: drm]
> [  231.743221] CPU: 2 PID: 20 Comm: kworker/2:0 Tainted: G     U          4.13.0-rc1+ #52
> [  231.743236] Hardware name: Hewlett-Packard HP EliteBook 8460p/161C, BIOS 68SCF Ver. F.01 03/11/2011
> [  231.743363] Workqueue: events_long i915_hangcheck_elapsed [i915]
> [  231.743382] task: ffff8801f42e9780 task.stack: ffff8801f42f8000
> [  231.743489] RIP: 0010:i915_gem_reset_engine+0x45a/0x460 [i915]
> [  231.743505] RSP: 0018:ffff8801f42ff770 EFLAGS: 00010202
> [  231.743521] RAX: 0000000000000007 RBX: ffff8801bf6b1880 RCX: ffffffffa02881a6
> [  231.743537] RDX: dffffc0000000000 RSI: dffffc0000000000 RDI: ffff8801bf6b18c8
> [  231.743551] RBP: ffff8801f42ff7c8 R08: 0000000000000001 R09: 0000000000000000
> [  231.743566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801edb02d00
> [  231.743581] R13: ffff8801e19d4200 R14: 000000000000001d R15: ffff8801ce2a4000
> [  231.743599] FS:  0000000000000000(0000) GS:ffff8801f5a80000(0000) knlGS:0000000000000000
> [  231.743614] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  231.743629] CR2: 00007f0ebd1add10 CR3: 0000000002621000 CR4: 00000000000406e0
> [  231.743643] Call Trace:
> [  231.743752]  i915_gem_reset+0x6c/0x150 [i915]
> [  231.743853]  i915_reset+0x175/0x210 [i915]
> [  231.743958]  i915_reset_device+0x33b/0x350 [i915]
> [  231.744061]  ? valleyview_pipestat_irq_handler+0xe0/0xe0 [i915]
> [  231.744081]  ? trace_hardirqs_off_caller+0x70/0x110
> [  231.744102]  ? _raw_spin_unlock_irqrestore+0x46/0x50
> [  231.744120]  ? find_held_lock+0x119/0x150
> [  231.744138]  ? mark_lock+0x6d/0x850
> [  231.744241]  ? gen8_gt_irq_ack+0x1f0/0x1f0 [i915]
> [  231.744262]  ? work_on_cpu_safe+0x60/0x60
> [  231.744284]  ? rcu_read_lock_sched_held+0x57/0xa0
> [  231.744400]  ? gen6_read32+0x2ba/0x320 [i915]
> [  231.744506]  i915_handle_error+0x382/0x5f0 [i915]
> [  231.744611]  ? gen6_rps_reset_ei+0x20/0x20 [i915]
> [  231.744630]  ? vsnprintf+0x128/0x8e0
> [  231.744649]  ? pointer+0x6b0/0x6b0
> [  231.744667]  ? debug_check_no_locks_freed+0x1a0/0x1a0
> [  231.744688]  ? scnprintf+0x92/0xe0
> [  231.744706]  ? snprintf+0xb0/0xb0
> [  231.744820]  hangcheck_declare_hang+0x15a/0x1a0 [i915]
> [  231.744932]  ? engine_stuck+0x440/0x440 [i915]
> [  231.744951]  ? rcu_read_lock_sched_held+0x57/0xa0
> [  231.745062]  ? gen6_read32+0x2ba/0x320 [i915]
> [  231.745173]  ? gen6_read16+0x320/0x320 [i915]
> [  231.745284]  ? intel_engine_get_active_head+0x91/0x170 [i915]
> [  231.745401]  i915_hangcheck_elapsed+0x3d8/0x400 [i915]
> [  231.745424]  process_one_work+0x3e8/0xac0
> [  231.745444]  ? pwq_dec_nr_in_flight+0x110/0x110
> [  231.745464]  ? do_raw_spin_lock+0x8e/0x120
> [  231.745484]  worker_thread+0x8d/0x720
> [  231.745506]  kthread+0x19e/0x1f0
> [  231.745524]  ? process_one_work+0xac0/0xac0
> [  231.745541]  ? kthread_create_on_node+0xa0/0xa0
> [  231.745560]  ret_from_fork+0x27/0x40
> [  231.745581] Code: 8b 7d c8 e8 49 0d 02 e1 49 8b 7f 38 48 8b 75 b8 48 83 c7 10 e8 b8 89 be e1 e9 95 fc ff ff 4c 89 e7 e8 4b b9 ff ff e9 30 ff ff ff <0f> 0b 0f 1f 40 00 55 48 89 e5 41 57 41 56 41 55 41 54 49 89 fe
> [  231.745767] RIP: i915_gem_reset_engine+0x45a/0x460 [i915] RSP: ffff8801f42ff770
> 
> At first glance this looks to be related to commit c64992e035d7
> ("drm/i915: Look for active requests earlier in the reset path"), but it
> could easily happen before as well. On the other hand, we no longer
> logged victims due to the active_request being dropped earlier.
> 
> v2: Be trickier to unwind the incomplete request as we cannot rely on
> request retirement for the lockless per-engine reset.
> 
> Reported-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> Fixes: c64992e035d7 ("drm/i915: Look for active requests earlier in the reset path")
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Michel Thierry <michel.thierry@intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>

I'm not Mika but;

Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>

Might want to mention of "global reset" in the commit message to make
some lights light up for a reader like me.

Regards, Joonas
-- 
Joonas Lahtinen
Open Source Technology Center
Intel Corporation
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-07-21  9:33 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-20 13:24 [PATCH] drm/i915: Don't touch fence->error when resetting an innocent request Chris Wilson
2017-07-20 13:39 ` Chris Wilson
2017-07-20 13:48 ` [PATCH v2] " Chris Wilson
2017-07-20 20:41   ` Daniel Vetter
2017-07-21  9:33   ` Joonas Lahtinen
2017-07-20 15:56 ` ✓ Fi.CI.BAT: success for drm/i915: Don't touch fence->error when resetting an innocent request (rev2) Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.