[PATCH 0/3] Do error capture async, flush G2H processing on reset

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] Do error capture async, flush G2H processing on reset
@ 2021-10-23 18:05 ` Matthew Brost
  0 siblings, 0 replies; 12+ messages in thread
From: Matthew Brost @ 2021-10-23 18:05 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniele.ceraolospurio, john.c.harrison

Rather allocating an error capture in nowait context to break a lockdep
splat [1], do the error capture async compared to the G2H processing.

v2: Fix Docs warning
v3: Rebase, resend for CI
v4: Resend for CI
v5: Fix CI splat: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21428/fi-rkl-guc/igt@i915_selftest@live@hangcheck.html

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

[1] https://patchwork.freedesktop.org/patch/451415/?series=93704&rev=5

John Harrison (1):
  drm/i915/guc: Refcount context during error capture

Matthew Brost (2):
  drm/i915/guc: Do error capture asynchronously
  drm/i915/guc: Flush G2H work queue during reset

 drivers/gpu/drm/i915/gt/intel_context.c       |   2 +
 drivers/gpu/drm/i915/gt/intel_context_types.h |   7 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  10 ++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 107 ++++++++++++++----
 4 files changed, 101 insertions(+), 25 deletions(-)

-- 
2.32.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Intel-gfx] [PATCH 0/3] Do error capture async, flush G2H processing on reset
@ 2021-10-23 18:05 ` Matthew Brost
  0 siblings, 0 replies; 12+ messages in thread
From: Matthew Brost @ 2021-10-23 18:05 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniele.ceraolospurio, john.c.harrison

Rather allocating an error capture in nowait context to break a lockdep
splat [1], do the error capture async compared to the G2H processing.

v2: Fix Docs warning
v3: Rebase, resend for CI
v4: Resend for CI
v5: Fix CI splat: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21428/fi-rkl-guc/igt@i915_selftest@live@hangcheck.html

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

[1] https://patchwork.freedesktop.org/patch/451415/?series=93704&rev=5

John Harrison (1):
  drm/i915/guc: Refcount context during error capture

Matthew Brost (2):
  drm/i915/guc: Do error capture asynchronously
  drm/i915/guc: Flush G2H work queue during reset

 drivers/gpu/drm/i915/gt/intel_context.c       |   2 +
 drivers/gpu/drm/i915/gt/intel_context_types.h |   7 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  10 ++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 107 ++++++++++++++----
 4 files changed, 101 insertions(+), 25 deletions(-)

-- 
2.32.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/3] drm/i915/guc: Do error capture asynchronously
  2021-10-23 18:05 ` [Intel-gfx] " Matthew Brost
@ 2021-10-23 18:05   ` Matthew Brost
  -1 siblings, 0 replies; 12+ messages in thread
From: Matthew Brost @ 2021-10-23 18:05 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniele.ceraolospurio, john.c.harrison

An error capture allocates memory, memory allocations depend on resets,
and resets need to flush the G2H handlers to seal several races. If the
error capture is done from the G2H handler this creates a circular
dependency. To work around this, do a error capture in a work queue
asynchronously from the G2H handler. This should be fine as (eventually)
all register state is put into a buffer by the GuC so it is safe to
restart the context before the error capture is complete. At worst, a
full GT reset clobbers an error capture and some information is lost.

Example of lockdep splat below:

[  154.625989] ======================================================
[  154.632195] WARNING: possible circular locking dependency detected
[  154.638393] 5.14.0-rc5-guc+ #50 Tainted: G     U
[  154.643991] ------------------------------------------------------
[  154.650196] i915_selftest/1673 is trying to acquire lock:
[  154.655621] ffff8881079cb918 ((work_completion)(&ct->requests.worker)){+.+.}-{0:0}, at: __flush_work+0x350/0x4d0 [  154.665826]
               but task is already holding lock:
[  154.671682] ffff8881079cbfb8 (&gt->reset.mutex){+.+.}-{3:3}, at: intel_gt_reset+0xf0/0x300 [i915] [  154.680659]
               which lock already depends on the new lock.

[  154.688857]
               the existing dependency chain (in reverse order) is:
[  154.696365]
               -> #2 (&gt->reset.mutex){+.+.}-{3:3}:
[  154.702571]        lock_acquire+0xd2/0x300
[  154.706695]        i915_gem_shrinker_taints_mutex+0x2d/0x50 [i915]
[  154.712959]        intel_gt_init_reset+0x61/0x80 [i915]
[  154.718258]        intel_gt_init_early+0xe6/0x120 [i915]
[  154.723648]        i915_driver_probe+0x592/0xdc0 [i915]
[  154.728942]        i915_pci_probe+0x43/0x1c0 [i915]
[  154.733891]        pci_device_probe+0x9b/0x110
[  154.738362]        really_probe+0x1a6/0x3a0
[  154.742568]        __driver_probe_device+0xf9/0x170
[  154.747468]        driver_probe_device+0x19/0x90
[  154.752114]        __driver_attach+0x99/0x170
[  154.756492]        bus_for_each_dev+0x73/0xc0
[  154.760870]        bus_add_driver+0x14b/0x1f0
[  154.765248]        driver_register+0x67/0xb0
[  154.769542]        i915_init+0x18/0x8c [i915]
[  154.773964]        do_one_initcall+0x53/0x2e0
[  154.778343]        do_init_module+0x56/0x210
[  154.782639]        load_module+0x25fc/0x29f0
[  154.786934]        __do_sys_finit_module+0xae/0x110
[  154.791835]        do_syscall_64+0x38/0xc0
[  154.795958]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  154.801558]
               -> #1 (fs_reclaim){+.+.}-{0:0}:
[  154.807241]        lock_acquire+0xd2/0x300
[  154.811361]        fs_reclaim_acquire+0x9e/0xd0
[  154.815914]        kmem_cache_alloc_trace+0x30/0x790
[  154.820899]        i915_gpu_coredump_alloc+0x53/0x1a0 [i915]
[  154.826649]        i915_gpu_coredump+0x39/0x560 [i915]
[  154.831866]        i915_capture_error_state+0xa/0x70 [i915]
[  154.837513]        intel_guc_context_reset_process_msg+0x174/0x1f0 [i915]
[  154.844383]        ct_incoming_request_worker_func+0x130/0x1b0 [i915]
[  154.850898]        process_one_work+0x264/0x590
[  154.855451]        worker_thread+0x4b/0x3a0
[  154.859655]        kthread+0x147/0x170
[  154.863428]        ret_from_fork+0x1f/0x30
[  154.867548]
               -> #0 ((work_completion)(&ct->requests.worker)){+.+.}-{0:0}:
[  154.875747]        check_prev_add+0x90/0xc30
[  154.880042]        __lock_acquire+0x1643/0x2110
[  154.884595]        lock_acquire+0xd2/0x300
[  154.888715]        __flush_work+0x373/0x4d0
[  154.892920]        intel_guc_submission_reset_prepare+0xf3/0x340 [i915]
[  154.899606]        intel_uc_reset_prepare+0x40/0x50 [i915]
[  154.905166]        reset_prepare+0x55/0x60 [i915]
[  154.909946]        intel_gt_reset+0x11c/0x300 [i915]
[  154.914984]        do_device_reset+0x13/0x20 [i915]
[  154.919936]        check_whitelist_across_reset+0x166/0x250 [i915]
[  154.926212]        live_reset_whitelist.cold+0x6a/0x7a [i915]
[  154.932037]        __i915_subtests.cold+0x20/0x74 [i915]
[  154.937428]        __run_selftests.cold+0x96/0xee [i915]
[  154.942816]        i915_live_selftests+0x2c/0x60 [i915]
[  154.948125]        i915_pci_probe+0x93/0x1c0 [i915]
[  154.953076]        pci_device_probe+0x9b/0x110
[  154.957545]        really_probe+0x1a6/0x3a0
[  154.961749]        __driver_probe_device+0xf9/0x170
[  154.966653]        driver_probe_device+0x19/0x90
[  154.971290]        __driver_attach+0x99/0x170
[  154.975671]        bus_for_each_dev+0x73/0xc0
[  154.980053]        bus_add_driver+0x14b/0x1f0
[  154.984431]        driver_register+0x67/0xb0
[  154.988725]        i915_init+0x18/0x8c [i915]
[  154.993149]        do_one_initcall+0x53/0x2e0
[  154.997527]        do_init_module+0x56/0x210
[  155.001822]        load_module+0x25fc/0x29f0
[  155.006118]        __do_sys_finit_module+0xae/0x110
[  155.011019]        do_syscall_64+0x38/0xc0
[  155.015139]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  155.020729]
               other info that might help us debug this:

[  155.028752] Chain exists of:
                 (work_completion)(&ct->requests.worker) --> fs_reclaim --> &gt->reset.mutex

[  155.041294]  Possible unsafe locking scenario:

[  155.047240]        CPU0                    CPU1
[  155.051791]        ----                    ----
[  155.056344]   lock(&gt->reset.mutex);
[  155.060026]                                lock(fs_reclaim);
[  155.065706]                                lock(&gt->reset.mutex);
[  155.071912]   lock((work_completion)(&ct->requests.worker));
[  155.077595]
                *** DEADLOCK ***

v2:
  - Resolve ce within lock, requeue workqueue if list not empty
v3:
 (CI)
  - Only add ce to capture list if empty

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c       |  2 +
 drivers/gpu/drm/i915/gt/intel_context_types.h |  7 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        | 10 +++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 65 +++++++++++++++++--
 4 files changed, 78 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index 5634d14052bc..c8990d7c030a 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -406,6 +406,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
 
 	INIT_LIST_HEAD(&ce->parallel.child_list);
 
+	INIT_LIST_HEAD(&ce->guc_capture_link);
+
 	/*
 	 * Initialize fence to be complete as this is expected to be complete
 	 * unless there is a pending schedule disable outstanding.
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 9e0177dc5484..ca714c866512 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -276,6 +276,13 @@ struct intel_context {
 		} guc;
 	} parallel;
 
+	/**
+	 * @guc_capture_link: in guc->submission_state.capture_list when an
+	 * error capture is pending on this context, protected by
+	 * guc->submission_state.lock
+	 */
+	struct list_head guc_capture_link;
+
 #ifdef CONFIG_DRM_I915_SELFTEST
 	/**
 	 * @drop_schedule_enable: Force drop of schedule enable G2H for selftest
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 31cf9fb48c7e..540a567268ef 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -114,6 +114,16 @@ struct intel_guc {
 		 * function as it might be in an atomic context (no sleeping)
 		 */
 		struct work_struct destroyed_worker;
+		/**
+		 * @capture_list: list of intel_context that need to capture
+		 * error state
+		 */
+		struct list_head capture_list;
+		/**
+		 * @capture_worker: worker to do error capture when the GuC
+		 * signals a context has been reset
+		 */
+		struct work_struct capture_worker;
 	} submission_state;
 
 	/**
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index d7710debcd47..9a5aeba033e0 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -91,8 +91,8 @@
  * used for all of GuC submission but that could change in the future.
  *
  * guc->submission_state.lock
- * Global lock for GuC submission state. Protects guc_ids and destroyed contexts
- * list.
+ * Global lock for GuC submission state. Protects guc_ids, destroyed contexts
+ * list, and capture list.
  *
  * ce->guc_state.lock
  * Protects everything under ce->guc_state. Ensures that a context is in the
@@ -1478,6 +1478,8 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
 
 static void destroyed_worker_func(struct work_struct *w);
 
+static void capture_worker_func(struct work_struct *w);
+
 /*
  * Set up the memory resources to be shared with the GuC (via the GGTT)
  * at firmware loading time.
@@ -1506,6 +1508,8 @@ int intel_guc_submission_init(struct intel_guc *guc)
 	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
 	INIT_WORK(&guc->submission_state.destroyed_worker,
 		  destroyed_worker_func);
+	INIT_LIST_HEAD(&guc->submission_state.capture_list);
+	INIT_WORK(&guc->submission_state.capture_worker, capture_worker_func);
 
 	guc->submission_state.guc_ids_bitmap =
 		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID, GFP_KERNEL);
@@ -3652,17 +3656,66 @@ int intel_guc_sched_done_process_msg(struct intel_guc *guc,
 	return 0;
 }
 
-static void capture_error_state(struct intel_guc *guc,
-				struct intel_context *ce)
+static void capture_worker_func(struct work_struct *w)
 {
+	struct intel_guc *guc = container_of(w, struct intel_guc,
+					     submission_state.capture_worker);
 	struct intel_gt *gt = guc_to_gt(guc);
 	struct drm_i915_private *i915 = gt->i915;
-	struct intel_engine_cs *engine = __context_to_physical_engine(ce);
+	struct intel_engine_cs *engine;
+	struct intel_context *ce;
+	unsigned long flags;
 	intel_wakeref_t wakeref;
 
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
+	ce = list_first_entry(&guc->submission_state.capture_list,
+			      struct intel_context, guc_capture_link);
+	list_del_init(&ce->guc_capture_link);
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
+
+	engine = __context_to_physical_engine(ce);
 	intel_engine_set_hung_context(engine, ce);
 	with_intel_runtime_pm(&i915->runtime_pm, wakeref)
-		i915_capture_error_state(gt, engine->mask);
+		i915_capture_error_state(gt, ce->engine->mask);
+
+	if (!list_empty(&guc->submission_state.capture_list))
+		queue_work(system_unbound_wq,
+			   &guc->submission_state.capture_worker);
+}
+
+static void capture_error_state(struct intel_guc *guc,
+				struct intel_context *ce)
+{
+	struct intel_gt *gt = guc_to_gt(guc);
+	struct drm_i915_private *i915 = gt->i915;
+	struct intel_engine_cs *engine = __context_to_physical_engine(ce);
+	unsigned long flags;
+
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
+	/*
+	 * Corner case where if resets happen faster than the error captures
+	 * (selftests, igts that intentionally cause hangs), suppress 2nd the
+	 * error capture.
+	 */
+	if (list_empty(&ce->guc_capture_link))
+		list_add_tail(&guc->submission_state.capture_list,
+			      &ce->guc_capture_link);
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
+
+	/*
+	 * We do the error capture async as an error capture can allocate
+	 * memory, the reset path must flush the G2H handler in order to seal
+	 * several races, and the memory allocations depend on the reset path
+	 * (circular dependecy if error capture done here in the G2H handler).
+	 * Doing the error capture async should be ok, as (eventually) all
+	 * register state is captured in buffer by the GuC (i.e. it is ok to
+	 * restart the context before the error capture is complete).
+	 *
+	 * FIXME: Long term, fix the error capture to be able to allocate memory
+	 * in a no sleep context 1 page at time so we can do the error capture
+	 * immediately.
+	 */
+	queue_work(system_unbound_wq, &guc->submission_state.capture_worker);
 	atomic_inc(&i915->gpu_error.reset_engine_count[engine->uabi_class]);
 }
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Intel-gfx] [PATCH 1/3] drm/i915/guc: Do error capture asynchronously
@ 2021-10-23 18:05   ` Matthew Brost
  0 siblings, 0 replies; 12+ messages in thread
From: Matthew Brost @ 2021-10-23 18:05 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniele.ceraolospurio, john.c.harrison

An error capture allocates memory, memory allocations depend on resets,
and resets need to flush the G2H handlers to seal several races. If the
error capture is done from the G2H handler this creates a circular
dependency. To work around this, do a error capture in a work queue
asynchronously from the G2H handler. This should be fine as (eventually)
all register state is put into a buffer by the GuC so it is safe to
restart the context before the error capture is complete. At worst, a
full GT reset clobbers an error capture and some information is lost.

Example of lockdep splat below:

[  154.625989] ======================================================
[  154.632195] WARNING: possible circular locking dependency detected
[  154.638393] 5.14.0-rc5-guc+ #50 Tainted: G     U
[  154.643991] ------------------------------------------------------
[  154.650196] i915_selftest/1673 is trying to acquire lock:
[  154.655621] ffff8881079cb918 ((work_completion)(&ct->requests.worker)){+.+.}-{0:0}, at: __flush_work+0x350/0x4d0 [  154.665826]
               but task is already holding lock:
[  154.671682] ffff8881079cbfb8 (&gt->reset.mutex){+.+.}-{3:3}, at: intel_gt_reset+0xf0/0x300 [i915] [  154.680659]
               which lock already depends on the new lock.

[  154.688857]
               the existing dependency chain (in reverse order) is:
[  154.696365]
               -> #2 (&gt->reset.mutex){+.+.}-{3:3}:
[  154.702571]        lock_acquire+0xd2/0x300
[  154.706695]        i915_gem_shrinker_taints_mutex+0x2d/0x50 [i915]
[  154.712959]        intel_gt_init_reset+0x61/0x80 [i915]
[  154.718258]        intel_gt_init_early+0xe6/0x120 [i915]
[  154.723648]        i915_driver_probe+0x592/0xdc0 [i915]
[  154.728942]        i915_pci_probe+0x43/0x1c0 [i915]
[  154.733891]        pci_device_probe+0x9b/0x110
[  154.738362]        really_probe+0x1a6/0x3a0
[  154.742568]        __driver_probe_device+0xf9/0x170
[  154.747468]        driver_probe_device+0x19/0x90
[  154.752114]        __driver_attach+0x99/0x170
[  154.756492]        bus_for_each_dev+0x73/0xc0
[  154.760870]        bus_add_driver+0x14b/0x1f0
[  154.765248]        driver_register+0x67/0xb0
[  154.769542]        i915_init+0x18/0x8c [i915]
[  154.773964]        do_one_initcall+0x53/0x2e0
[  154.778343]        do_init_module+0x56/0x210
[  154.782639]        load_module+0x25fc/0x29f0
[  154.786934]        __do_sys_finit_module+0xae/0x110
[  154.791835]        do_syscall_64+0x38/0xc0
[  154.795958]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  154.801558]
               -> #1 (fs_reclaim){+.+.}-{0:0}:
[  154.807241]        lock_acquire+0xd2/0x300
[  154.811361]        fs_reclaim_acquire+0x9e/0xd0
[  154.815914]        kmem_cache_alloc_trace+0x30/0x790
[  154.820899]        i915_gpu_coredump_alloc+0x53/0x1a0 [i915]
[  154.826649]        i915_gpu_coredump+0x39/0x560 [i915]
[  154.831866]        i915_capture_error_state+0xa/0x70 [i915]
[  154.837513]        intel_guc_context_reset_process_msg+0x174/0x1f0 [i915]
[  154.844383]        ct_incoming_request_worker_func+0x130/0x1b0 [i915]
[  154.850898]        process_one_work+0x264/0x590
[  154.855451]        worker_thread+0x4b/0x3a0
[  154.859655]        kthread+0x147/0x170
[  154.863428]        ret_from_fork+0x1f/0x30
[  154.867548]
               -> #0 ((work_completion)(&ct->requests.worker)){+.+.}-{0:0}:
[  154.875747]        check_prev_add+0x90/0xc30
[  154.880042]        __lock_acquire+0x1643/0x2110
[  154.884595]        lock_acquire+0xd2/0x300
[  154.888715]        __flush_work+0x373/0x4d0
[  154.892920]        intel_guc_submission_reset_prepare+0xf3/0x340 [i915]
[  154.899606]        intel_uc_reset_prepare+0x40/0x50 [i915]
[  154.905166]        reset_prepare+0x55/0x60 [i915]
[  154.909946]        intel_gt_reset+0x11c/0x300 [i915]
[  154.914984]        do_device_reset+0x13/0x20 [i915]
[  154.919936]        check_whitelist_across_reset+0x166/0x250 [i915]
[  154.926212]        live_reset_whitelist.cold+0x6a/0x7a [i915]
[  154.932037]        __i915_subtests.cold+0x20/0x74 [i915]
[  154.937428]        __run_selftests.cold+0x96/0xee [i915]
[  154.942816]        i915_live_selftests+0x2c/0x60 [i915]
[  154.948125]        i915_pci_probe+0x93/0x1c0 [i915]
[  154.953076]        pci_device_probe+0x9b/0x110
[  154.957545]        really_probe+0x1a6/0x3a0
[  154.961749]        __driver_probe_device+0xf9/0x170
[  154.966653]        driver_probe_device+0x19/0x90
[  154.971290]        __driver_attach+0x99/0x170
[  154.975671]        bus_for_each_dev+0x73/0xc0
[  154.980053]        bus_add_driver+0x14b/0x1f0
[  154.984431]        driver_register+0x67/0xb0
[  154.988725]        i915_init+0x18/0x8c [i915]
[  154.993149]        do_one_initcall+0x53/0x2e0
[  154.997527]        do_init_module+0x56/0x210
[  155.001822]        load_module+0x25fc/0x29f0
[  155.006118]        __do_sys_finit_module+0xae/0x110
[  155.011019]        do_syscall_64+0x38/0xc0
[  155.015139]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  155.020729]
               other info that might help us debug this:

[  155.028752] Chain exists of:
                 (work_completion)(&ct->requests.worker) --> fs_reclaim --> &gt->reset.mutex

[  155.041294]  Possible unsafe locking scenario:

[  155.047240]        CPU0                    CPU1
[  155.051791]        ----                    ----
[  155.056344]   lock(&gt->reset.mutex);
[  155.060026]                                lock(fs_reclaim);
[  155.065706]                                lock(&gt->reset.mutex);
[  155.071912]   lock((work_completion)(&ct->requests.worker));
[  155.077595]
                *** DEADLOCK ***

v2:
  - Resolve ce within lock, requeue workqueue if list not empty
v3:
 (CI)
  - Only add ce to capture list if empty

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c       |  2 +
 drivers/gpu/drm/i915/gt/intel_context_types.h |  7 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        | 10 +++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 65 +++++++++++++++++--
 4 files changed, 78 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index 5634d14052bc..c8990d7c030a 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -406,6 +406,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
 
 	INIT_LIST_HEAD(&ce->parallel.child_list);
 
+	INIT_LIST_HEAD(&ce->guc_capture_link);
+
 	/*
 	 * Initialize fence to be complete as this is expected to be complete
 	 * unless there is a pending schedule disable outstanding.
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 9e0177dc5484..ca714c866512 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -276,6 +276,13 @@ struct intel_context {
 		} guc;
 	} parallel;
 
+	/**
+	 * @guc_capture_link: in guc->submission_state.capture_list when an
+	 * error capture is pending on this context, protected by
+	 * guc->submission_state.lock
+	 */
+	struct list_head guc_capture_link;
+
 #ifdef CONFIG_DRM_I915_SELFTEST
 	/**
 	 * @drop_schedule_enable: Force drop of schedule enable G2H for selftest
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 31cf9fb48c7e..540a567268ef 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -114,6 +114,16 @@ struct intel_guc {
 		 * function as it might be in an atomic context (no sleeping)
 		 */
 		struct work_struct destroyed_worker;
+		/**
+		 * @capture_list: list of intel_context that need to capture
+		 * error state
+		 */
+		struct list_head capture_list;
+		/**
+		 * @capture_worker: worker to do error capture when the GuC
+		 * signals a context has been reset
+		 */
+		struct work_struct capture_worker;
 	} submission_state;
 
 	/**
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index d7710debcd47..9a5aeba033e0 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -91,8 +91,8 @@
  * used for all of GuC submission but that could change in the future.
  *
  * guc->submission_state.lock
- * Global lock for GuC submission state. Protects guc_ids and destroyed contexts
- * list.
+ * Global lock for GuC submission state. Protects guc_ids, destroyed contexts
+ * list, and capture list.
  *
  * ce->guc_state.lock
  * Protects everything under ce->guc_state. Ensures that a context is in the
@@ -1478,6 +1478,8 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
 
 static void destroyed_worker_func(struct work_struct *w);
 
+static void capture_worker_func(struct work_struct *w);
+
 /*
  * Set up the memory resources to be shared with the GuC (via the GGTT)
  * at firmware loading time.
@@ -1506,6 +1508,8 @@ int intel_guc_submission_init(struct intel_guc *guc)
 	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
 	INIT_WORK(&guc->submission_state.destroyed_worker,
 		  destroyed_worker_func);
+	INIT_LIST_HEAD(&guc->submission_state.capture_list);
+	INIT_WORK(&guc->submission_state.capture_worker, capture_worker_func);
 
 	guc->submission_state.guc_ids_bitmap =
 		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID, GFP_KERNEL);
@@ -3652,17 +3656,66 @@ int intel_guc_sched_done_process_msg(struct intel_guc *guc,
 	return 0;
 }
 
-static void capture_error_state(struct intel_guc *guc,
-				struct intel_context *ce)
+static void capture_worker_func(struct work_struct *w)
 {
+	struct intel_guc *guc = container_of(w, struct intel_guc,
+					     submission_state.capture_worker);
 	struct intel_gt *gt = guc_to_gt(guc);
 	struct drm_i915_private *i915 = gt->i915;
-	struct intel_engine_cs *engine = __context_to_physical_engine(ce);
+	struct intel_engine_cs *engine;
+	struct intel_context *ce;
+	unsigned long flags;
 	intel_wakeref_t wakeref;
 
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
+	ce = list_first_entry(&guc->submission_state.capture_list,
+			      struct intel_context, guc_capture_link);
+	list_del_init(&ce->guc_capture_link);
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
+
+	engine = __context_to_physical_engine(ce);
 	intel_engine_set_hung_context(engine, ce);
 	with_intel_runtime_pm(&i915->runtime_pm, wakeref)
-		i915_capture_error_state(gt, engine->mask);
+		i915_capture_error_state(gt, ce->engine->mask);
+
+	if (!list_empty(&guc->submission_state.capture_list))
+		queue_work(system_unbound_wq,
+			   &guc->submission_state.capture_worker);
+}
+
+static void capture_error_state(struct intel_guc *guc,
+				struct intel_context *ce)
+{
+	struct intel_gt *gt = guc_to_gt(guc);
+	struct drm_i915_private *i915 = gt->i915;
+	struct intel_engine_cs *engine = __context_to_physical_engine(ce);
+	unsigned long flags;
+
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
+	/*
+	 * Corner case where if resets happen faster than the error captures
+	 * (selftests, igts that intentionally cause hangs), suppress 2nd the
+	 * error capture.
+	 */
+	if (list_empty(&ce->guc_capture_link))
+		list_add_tail(&guc->submission_state.capture_list,
+			      &ce->guc_capture_link);
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
+
+	/*
+	 * We do the error capture async as an error capture can allocate
+	 * memory, the reset path must flush the G2H handler in order to seal
+	 * several races, and the memory allocations depend on the reset path
+	 * (circular dependecy if error capture done here in the G2H handler).
+	 * Doing the error capture async should be ok, as (eventually) all
+	 * register state is captured in buffer by the GuC (i.e. it is ok to
+	 * restart the context before the error capture is complete).
+	 *
+	 * FIXME: Long term, fix the error capture to be able to allocate memory
+	 * in a no sleep context 1 page at time so we can do the error capture
+	 * immediately.
+	 */
+	queue_work(system_unbound_wq, &guc->submission_state.capture_worker);
 	atomic_inc(&i915->gpu_error.reset_engine_count[engine->uabi_class]);
 }
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/3] drm/i915/guc: Flush G2H work queue during reset
  2021-10-23 18:05 ` [Intel-gfx] " Matthew Brost
@ 2021-10-23 18:05   ` Matthew Brost
  -1 siblings, 0 replies; 12+ messages in thread
From: Matthew Brost @ 2021-10-23 18:05 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniele.ceraolospurio, john.c.harrison

It isn't safe to scrub for missing G2H or continue with the reset until
all G2H processing is complete. Flush the G2H work queue during reset to
ensure it is done running. No need to call the IRQ handler directly
either as the scrubbing code can deal with any missing G2H.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c  | 18 +-----------------
 1 file changed, 1 insertion(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 9a5aeba033e0..5f584283631b 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1128,8 +1128,6 @@ static void guc_flush_destroyed_contexts(struct intel_guc *guc);
 
 void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 {
-	int i;
-
 	if (unlikely(!guc_submission_initialized(guc))) {
 		/* Reset called during driver load? GuC not yet initialised! */
 		return;
@@ -1145,21 +1143,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 
 	guc_flush_submissions(guc);
 	guc_flush_destroyed_contexts(guc);
-
-	/*
-	 * Handle any outstanding G2Hs before reset. Call IRQ handler directly
-	 * each pass as interrupt have been disabled. We always scrub for
-	 * outstanding G2H as it is possible for outstanding_submission_g2h to
-	 * be incremented after the context state update.
-	 */
-	for (i = 0; i < 4 && atomic_read(&guc->outstanding_submission_g2h); ++i) {
-		intel_guc_to_host_event_handler(guc);
-#define wait_for_reset(guc, wait_var) \
-		intel_guc_wait_for_pending_msg(guc, wait_var, false, (HZ / 20))
-		do {
-			wait_for_reset(guc, &guc->outstanding_submission_g2h);
-		} while (!list_empty(&guc->ct.requests.incoming));
-	}
+	flush_work(&guc->ct.requests.worker);
 
 	scrub_guc_desc_for_outstanding_g2h(guc);
 }
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Intel-gfx] [PATCH 2/3] drm/i915/guc: Flush G2H work queue during reset
@ 2021-10-23 18:05   ` Matthew Brost
  0 siblings, 0 replies; 12+ messages in thread
From: Matthew Brost @ 2021-10-23 18:05 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniele.ceraolospurio, john.c.harrison

It isn't safe to scrub for missing G2H or continue with the reset until
all G2H processing is complete. Flush the G2H work queue during reset to
ensure it is done running. No need to call the IRQ handler directly
either as the scrubbing code can deal with any missing G2H.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c  | 18 +-----------------
 1 file changed, 1 insertion(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 9a5aeba033e0..5f584283631b 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1128,8 +1128,6 @@ static void guc_flush_destroyed_contexts(struct intel_guc *guc);
 
 void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 {
-	int i;
-
 	if (unlikely(!guc_submission_initialized(guc))) {
 		/* Reset called during driver load? GuC not yet initialised! */
 		return;
@@ -1145,21 +1143,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 
 	guc_flush_submissions(guc);
 	guc_flush_destroyed_contexts(guc);
-
-	/*
-	 * Handle any outstanding G2Hs before reset. Call IRQ handler directly
-	 * each pass as interrupt have been disabled. We always scrub for
-	 * outstanding G2H as it is possible for outstanding_submission_g2h to
-	 * be incremented after the context state update.
-	 */
-	for (i = 0; i < 4 && atomic_read(&guc->outstanding_submission_g2h); ++i) {
-		intel_guc_to_host_event_handler(guc);
-#define wait_for_reset(guc, wait_var) \
-		intel_guc_wait_for_pending_msg(guc, wait_var, false, (HZ / 20))
-		do {
-			wait_for_reset(guc, &guc->outstanding_submission_g2h);
-		} while (!list_empty(&guc->ct.requests.incoming));
-	}
+	flush_work(&guc->ct.requests.worker);
 
 	scrub_guc_desc_for_outstanding_g2h(guc);
 }
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/3] drm/i915/guc: Refcount context during error capture
  2021-10-23 18:05 ` [Intel-gfx] " Matthew Brost
@ 2021-10-23 18:05   ` Matthew Brost
  -1 siblings, 0 replies; 12+ messages in thread
From: Matthew Brost @ 2021-10-23 18:05 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniele.ceraolospurio, john.c.harrison

From: John Harrison <John.C.Harrison@Intel.com>

When i915 receives a context reset notification from GuC, it triggers
an error capture before resetting any outstanding requsts of that
context. Unfortunately, the error capture is not a time bound
operation. In certain situations it can take a long time, particularly
when multiple large LMEM buffers must be read back and eoncoded. If
this delay is longer than other timeouts (heartbeat, test recovery,
etc.) then a full GT reset can be triggered in the middle.

That can result in the context being reset by GuC actually being
destroyed before the error capture completes and the GuC submission
code resumes. Thus, the GuC side can start dereferencing stale
pointers and Bad Things ensue.

So add a refcount get of the context during the entire reset
operation. That way, the context can't be destroyed part way through
no matter what other resets or user interactions occur.

v2:
 (Matthew Brost)
  - Update patch to work with async error capture

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 24 +++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 5f584283631b..846c3dfcf146 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -3662,6 +3662,8 @@ static void capture_worker_func(struct work_struct *w)
 	with_intel_runtime_pm(&i915->runtime_pm, wakeref)
 		i915_capture_error_state(gt, ce->engine->mask);
 
+	intel_context_put(ce);
+
 	if (!list_empty(&guc->submission_state.capture_list))
 		queue_work(system_unbound_wq,
 			   &guc->submission_state.capture_worker);
@@ -3711,7 +3713,7 @@ static void guc_context_replay(struct intel_context *ce)
 	tasklet_hi_schedule(&sched_engine->tasklet);
 }
 
-static void guc_handle_context_reset(struct intel_guc *guc,
+static bool guc_handle_context_reset(struct intel_guc *guc,
 				     struct intel_context *ce)
 {
 	trace_intel_context_reset(ce);
@@ -3724,7 +3726,11 @@ static void guc_handle_context_reset(struct intel_guc *guc,
 		   !context_blocked(ce))) {
 		capture_error_state(guc, ce);
 		guc_context_replay(ce);
+
+		return false;
 	}
+
+	return true;
 }
 
 int intel_guc_context_reset_process_msg(struct intel_guc *guc,
@@ -3732,6 +3738,7 @@ int intel_guc_context_reset_process_msg(struct intel_guc *guc,
 {
 	struct intel_context *ce;
 	int desc_idx;
+	unsigned long flags;
 
 	if (unlikely(len != 1)) {
 		drm_err(&guc_to_gt(guc)->i915->drm, "Invalid length %u", len);
@@ -3739,11 +3746,24 @@ int intel_guc_context_reset_process_msg(struct intel_guc *guc,
 	}
 
 	desc_idx = msg[0];
+
+	/*
+	 * The context lookup uses the xarray but lookups only require an RCU lock
+	 * not the full spinlock. So take the lock explicitly and keep it until the
+	 * context has been reference count locked to ensure it can't be destroyed
+	 * asynchronously until the reset is done.
+	 */
+	xa_lock_irqsave(&guc->context_lookup, flags);
 	ce = g2h_context_lookup(guc, desc_idx);
+	if (ce)
+		intel_context_get(ce);
+	xa_unlock_irqrestore(&guc->context_lookup, flags);
+
 	if (unlikely(!ce))
 		return -EPROTO;
 
-	guc_handle_context_reset(guc, ce);
+	if (guc_handle_context_reset(guc, ce))
+		intel_context_put(ce);
 
 	return 0;
 }
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Intel-gfx] [PATCH 3/3] drm/i915/guc: Refcount context during error capture
@ 2021-10-23 18:05   ` Matthew Brost
  0 siblings, 0 replies; 12+ messages in thread
From: Matthew Brost @ 2021-10-23 18:05 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniele.ceraolospurio, john.c.harrison

From: John Harrison <John.C.Harrison@Intel.com>

When i915 receives a context reset notification from GuC, it triggers
an error capture before resetting any outstanding requsts of that
context. Unfortunately, the error capture is not a time bound
operation. In certain situations it can take a long time, particularly
when multiple large LMEM buffers must be read back and eoncoded. If
this delay is longer than other timeouts (heartbeat, test recovery,
etc.) then a full GT reset can be triggered in the middle.

That can result in the context being reset by GuC actually being
destroyed before the error capture completes and the GuC submission
code resumes. Thus, the GuC side can start dereferencing stale
pointers and Bad Things ensue.

So add a refcount get of the context during the entire reset
operation. That way, the context can't be destroyed part way through
no matter what other resets or user interactions occur.

v2:
 (Matthew Brost)
  - Update patch to work with async error capture

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 24 +++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 5f584283631b..846c3dfcf146 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -3662,6 +3662,8 @@ static void capture_worker_func(struct work_struct *w)
 	with_intel_runtime_pm(&i915->runtime_pm, wakeref)
 		i915_capture_error_state(gt, ce->engine->mask);
 
+	intel_context_put(ce);
+
 	if (!list_empty(&guc->submission_state.capture_list))
 		queue_work(system_unbound_wq,
 			   &guc->submission_state.capture_worker);
@@ -3711,7 +3713,7 @@ static void guc_context_replay(struct intel_context *ce)
 	tasklet_hi_schedule(&sched_engine->tasklet);
 }
 
-static void guc_handle_context_reset(struct intel_guc *guc,
+static bool guc_handle_context_reset(struct intel_guc *guc,
 				     struct intel_context *ce)
 {
 	trace_intel_context_reset(ce);
@@ -3724,7 +3726,11 @@ static void guc_handle_context_reset(struct intel_guc *guc,
 		   !context_blocked(ce))) {
 		capture_error_state(guc, ce);
 		guc_context_replay(ce);
+
+		return false;
 	}
+
+	return true;
 }
 
 int intel_guc_context_reset_process_msg(struct intel_guc *guc,
@@ -3732,6 +3738,7 @@ int intel_guc_context_reset_process_msg(struct intel_guc *guc,
 {
 	struct intel_context *ce;
 	int desc_idx;
+	unsigned long flags;
 
 	if (unlikely(len != 1)) {
 		drm_err(&guc_to_gt(guc)->i915->drm, "Invalid length %u", len);
@@ -3739,11 +3746,24 @@ int intel_guc_context_reset_process_msg(struct intel_guc *guc,
 	}
 
 	desc_idx = msg[0];
+
+	/*
+	 * The context lookup uses the xarray but lookups only require an RCU lock
+	 * not the full spinlock. So take the lock explicitly and keep it until the
+	 * context has been reference count locked to ensure it can't be destroyed
+	 * asynchronously until the reset is done.
+	 */
+	xa_lock_irqsave(&guc->context_lookup, flags);
 	ce = g2h_context_lookup(guc, desc_idx);
+	if (ce)
+		intel_context_get(ce);
+	xa_unlock_irqrestore(&guc->context_lookup, flags);
+
 	if (unlikely(!ce))
 		return -EPROTO;
 
-	guc_handle_context_reset(guc, ce);
+	if (guc_handle_context_reset(guc, ce))
+		intel_context_put(ce);
 
 	return 0;
 }
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Intel-gfx] ✗ Fi.CI.SPARSE: warning for Do error capture async, flush G2H processing on reset
  2021-10-23 18:05 ` [Intel-gfx] " Matthew Brost
                   ` (3 preceding siblings ...)
  (?)
@ 2021-10-23 18:22 ` Patchwork
  -1 siblings, 0 replies; 12+ messages in thread
From: Patchwork @ 2021-10-23 18:22 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx

== Series Details ==

Series: Do error capture async, flush G2H processing on reset
URL   : https://patchwork.freedesktop.org/series/96216/
State : warning

== Summary ==

$ dim sparse --fast origin/drm-tip
Sparse version: v0.6.2
Fast mode used, each commit won't be checked separately.
-
+drivers/gpu/drm/i915/gt/intel_engine_stats.h:27:9: warning: trying to copy expression type 31
+drivers/gpu/drm/i915/gt/intel_engine_stats.h:27:9: warning: trying to copy expression type 31
+drivers/gpu/drm/i915/gt/intel_engine_stats.h:27:9: warning: trying to copy expression type 31
+drivers/gpu/drm/i915/gt/intel_engine_stats.h:32:9: warning: trying to copy expression type 31
+drivers/gpu/drm/i915/gt/intel_engine_stats.h:32:9: warning: trying to copy expression type 31
+drivers/gpu/drm/i915/gt/intel_engine_stats.h:49:9: warning: trying to copy expression type 31
+drivers/gpu/drm/i915/gt/intel_engine_stats.h:49:9: warning: trying to copy expression type 31
+drivers/gpu/drm/i915/gt/intel_engine_stats.h:49:9: warning: trying to copy expression type 31
+drivers/gpu/drm/i915/gt/intel_engine_stats.h:56:9: warning: trying to copy expression type 31
+drivers/gpu/drm/i915/gt/intel_engine_stats.h:56:9: warning: trying to copy expression type 31
+drivers/gpu/drm/i915/gt/intel_reset.c:1392:5: warning: context imbalance in 'intel_gt_reset_trylock' - different lock contexts for basic block
+drivers/gpu/drm/i915/i915_perf.c:1442:15: warning: memset with byte count of 16777216
+drivers/gpu/drm/i915/i915_perf.c:1496:15: warning: memset with byte count of 16777216
+./include/asm-generic/bitops/find.h:112:45: warning: shift count is negative (-262080)
+./include/asm-generic/bitops/find.h:32:31: warning: shift count is negative (-262080)
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'fwtable_read16' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'fwtable_read32' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'fwtable_read64' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'fwtable_read8' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'fwtable_write16' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'fwtable_write32' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'fwtable_write8' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'gen6_write16' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'gen6_write32' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'gen6_write8' - different lock contexts for basic block



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Intel-gfx] ✗ Fi.CI.BAT: failure for Do error capture async, flush G2H processing on reset
  2021-10-23 18:05 ` [Intel-gfx] " Matthew Brost
                   ` (4 preceding siblings ...)
  (?)
@ 2021-10-23 18:54 ` Patchwork
  -1 siblings, 0 replies; 12+ messages in thread
From: Patchwork @ 2021-10-23 18:54 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 7616 bytes --]

== Series Details ==

Series: Do error capture async, flush G2H processing on reset
URL   : https://patchwork.freedesktop.org/series/96216/
State : failure

== Summary ==

CI Bug Log - changes from CI_DRM_10782 -> Patchwork_21429
====================================================

Summary
-------

  **FAILURE**

  Serious unknown changes coming with Patchwork_21429 absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_21429, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21429/index.html

Participating hosts (38 -> 35)
------------------------------

  Additional (2): fi-tgl-1115g4 fi-pnv-d510 
  Missing    (5): bat-dg1-6 fi-hsw-4200u fi-bsw-cyan bat-adlp-4 fi-ctg-p8600 

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in Patchwork_21429:

### IGT changes ###

#### Possible regressions ####

  * igt@i915_selftest@live@hangcheck:
    - fi-rkl-guc:         [PASS][1] -> [INCOMPLETE][2]
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10782/fi-rkl-guc/igt@i915_selftest@live@hangcheck.html
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21429/fi-rkl-guc/igt@i915_selftest@live@hangcheck.html

  
Known issues
------------

  Here are the changes found in Patchwork_21429 that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@amdgpu/amd_basic@query-info:
    - fi-tgl-1115g4:      NOTRUN -> [SKIP][3] ([fdo#109315])
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21429/fi-tgl-1115g4/igt@amdgpu/amd_basic@query-info.html

  * igt@amdgpu/amd_cs_nop@nop-gfx0:
    - fi-tgl-1115g4:      NOTRUN -> [SKIP][4] ([fdo#109315] / [i915#2575]) +16 similar issues
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21429/fi-tgl-1115g4/igt@amdgpu/amd_cs_nop@nop-gfx0.html

  * igt@amdgpu/amd_cs_nop@sync-fork-compute0:
    - fi-snb-2600:        NOTRUN -> [SKIP][5] ([fdo#109271]) +17 similar issues
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21429/fi-snb-2600/igt@amdgpu/amd_cs_nop@sync-fork-compute0.html

  * igt@gem_huc_copy@huc-copy:
    - fi-tgl-1115g4:      NOTRUN -> [SKIP][6] ([i915#2190])
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21429/fi-tgl-1115g4/igt@gem_huc_copy@huc-copy.html

  * igt@i915_pm_backlight@basic-brightness:
    - fi-tgl-1115g4:      NOTRUN -> [SKIP][7] ([i915#1155])
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21429/fi-tgl-1115g4/igt@i915_pm_backlight@basic-brightness.html

  * igt@kms_chamelium@common-hpd-after-suspend:
    - fi-tgl-1115g4:      NOTRUN -> [SKIP][8] ([fdo#111827]) +8 similar issues
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21429/fi-tgl-1115g4/igt@kms_chamelium@common-hpd-after-suspend.html

  * igt@kms_cursor_legacy@basic-busy-flip-before-cursor-atomic:
    - fi-tgl-1115g4:      NOTRUN -> [SKIP][9] ([i915#4103]) +1 similar issue
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21429/fi-tgl-1115g4/igt@kms_cursor_legacy@basic-busy-flip-before-cursor-atomic.html

  * igt@kms_flip@basic-flip-vs-modeset@c-dp1:
    - fi-cfl-8109u:       [PASS][10] -> [FAIL][11] ([i915#4165])
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10782/fi-cfl-8109u/igt@kms_flip@basic-flip-vs-modeset@c-dp1.html
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21429/fi-cfl-8109u/igt@kms_flip@basic-flip-vs-modeset@c-dp1.html

  * igt@kms_force_connector_basic@force-load-detect:
    - fi-tgl-1115g4:      NOTRUN -> [SKIP][12] ([fdo#109285])
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21429/fi-tgl-1115g4/igt@kms_force_connector_basic@force-load-detect.html

  * igt@kms_frontbuffer_tracking@basic:
    - fi-cfl-8109u:       [PASS][13] -> [FAIL][14] ([i915#2546])
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10782/fi-cfl-8109u/igt@kms_frontbuffer_tracking@basic.html
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21429/fi-cfl-8109u/igt@kms_frontbuffer_tracking@basic.html

  * igt@kms_psr@primary_mmap_gtt:
    - fi-tgl-1115g4:      NOTRUN -> [SKIP][15] ([i915#1072]) +3 similar issues
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21429/fi-tgl-1115g4/igt@kms_psr@primary_mmap_gtt.html

  * igt@prime_vgem@basic-userptr:
    - fi-pnv-d510:        NOTRUN -> [SKIP][16] ([fdo#109271]) +53 similar issues
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21429/fi-pnv-d510/igt@prime_vgem@basic-userptr.html
    - fi-tgl-1115g4:      NOTRUN -> [SKIP][17] ([i915#3301])
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21429/fi-tgl-1115g4/igt@prime_vgem@basic-userptr.html

  * igt@runner@aborted:
    - fi-bdw-5557u:       NOTRUN -> [FAIL][18] ([i915#1602] / [i915#2426] / [i915#4312])
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21429/fi-bdw-5557u/igt@runner@aborted.html
    - fi-rkl-guc:         NOTRUN -> [FAIL][19] ([i915#3928] / [i915#4312])
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21429/fi-rkl-guc/igt@runner@aborted.html

  
#### Possible fixes ####

  * igt@i915_selftest@live@hangcheck:
    - fi-snb-2600:        [INCOMPLETE][20] ([i915#3921]) -> [PASS][21]
   [20]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10782/fi-snb-2600/igt@i915_selftest@live@hangcheck.html
   [21]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21429/fi-snb-2600/igt@i915_selftest@live@hangcheck.html

  
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  [fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271
  [fdo#109285]: https://bugs.freedesktop.org/show_bug.cgi?id=109285
  [fdo#109315]: https://bugs.freedesktop.org/show_bug.cgi?id=109315
  [fdo#111827]: https://bugs.freedesktop.org/show_bug.cgi?id=111827
  [i915#1072]: https://gitlab.freedesktop.org/drm/intel/issues/1072
  [i915#1155]: https://gitlab.freedesktop.org/drm/intel/issues/1155
  [i915#1602]: https://gitlab.freedesktop.org/drm/intel/issues/1602
  [i915#2190]: https://gitlab.freedesktop.org/drm/intel/issues/2190
  [i915#2426]: https://gitlab.freedesktop.org/drm/intel/issues/2426
  [i915#2546]: https://gitlab.freedesktop.org/drm/intel/issues/2546
  [i915#2575]: https://gitlab.freedesktop.org/drm/intel/issues/2575
  [i915#3301]: https://gitlab.freedesktop.org/drm/intel/issues/3301
  [i915#3921]: https://gitlab.freedesktop.org/drm/intel/issues/3921
  [i915#3928]: https://gitlab.freedesktop.org/drm/intel/issues/3928
  [i915#4103]: https://gitlab.freedesktop.org/drm/intel/issues/4103
  [i915#4165]: https://gitlab.freedesktop.org/drm/intel/issues/4165
  [i915#4312]: https://gitlab.freedesktop.org/drm/intel/issues/4312


Build changes
-------------

  * Linux: CI_DRM_10782 -> Patchwork_21429

  CI-20190529: 20190529
  CI_DRM_10782: 6eff63a9b932a4aa1e1f6e521cd919aaf57c058f @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_6259: 89629f64da9f12b144f913865b08d2c9efcd10d7 @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  Patchwork_21429: 6e3989feb5a49e7c2be536bd79978860830dafb9 @ git://anongit.freedesktop.org/gfx-ci/linux


== Linux commits ==

6e3989feb5a4 drm/i915/guc: Refcount context during error capture
5e336f70a919 drm/i915/guc: Flush G2H work queue during reset
d00a78b738bb drm/i915/guc: Do error capture asynchronously

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21429/index.html

[-- Attachment #2: Type: text/html, Size: 8856 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 2/3] drm/i915/guc: Flush G2H work queue during reset
  2021-10-23  0:39 [PATCH 0/3] " Matthew Brost
@ 2021-10-23  0:39 ` Matthew Brost
  0 siblings, 0 replies; 12+ messages in thread
From: Matthew Brost @ 2021-10-23  0:39 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniele.ceraolospurio, john.c.harrison

It isn't safe to scrub for missing G2H or continue with the reset until
all G2H processing is complete. Flush the G2H work queue during reset to
ensure it is done running. No need to call the IRQ handler directly
either as the scrubbing code can deal with any missing G2H.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c  | 18 +-----------------
 1 file changed, 1 insertion(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 57c07244b3c5..8513003f4820 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1128,8 +1128,6 @@ static void guc_flush_destroyed_contexts(struct intel_guc *guc);
 
 void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 {
-	int i;
-
 	if (unlikely(!guc_submission_initialized(guc))) {
 		/* Reset called during driver load? GuC not yet initialised! */
 		return;
@@ -1145,21 +1143,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 
 	guc_flush_submissions(guc);
 	guc_flush_destroyed_contexts(guc);
-
-	/*
-	 * Handle any outstanding G2Hs before reset. Call IRQ handler directly
-	 * each pass as interrupt have been disabled. We always scrub for
-	 * outstanding G2H as it is possible for outstanding_submission_g2h to
-	 * be incremented after the context state update.
-	 */
-	for (i = 0; i < 4 && atomic_read(&guc->outstanding_submission_g2h); ++i) {
-		intel_guc_to_host_event_handler(guc);
-#define wait_for_reset(guc, wait_var) \
-		intel_guc_wait_for_pending_msg(guc, wait_var, false, (HZ / 20))
-		do {
-			wait_for_reset(guc, &guc->outstanding_submission_g2h);
-		} while (!list_empty(&guc->ct.requests.incoming));
-	}
+	flush_work(&guc->ct.requests.worker);
 
 	scrub_guc_desc_for_outstanding_g2h(guc);
 }
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/3] drm/i915/guc: Flush G2H work queue during reset
  2021-10-22  1:29 [PATCH 0/3] Do error capture async, flush G2H processing on reset Matthew Brost
@ 2021-10-22  1:29 ` Matthew Brost
  0 siblings, 0 replies; 12+ messages in thread
From: Matthew Brost @ 2021-10-22  1:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniele.ceraolospurio, john.c.harrison

It isn't safe to scrub for missing G2H or continue with the reset until
all G2H processing is complete. Flush the G2H work queue during reset to
ensure it is done running. No need to call the IRQ handler directly
either as the scrubbing code can deal with any missing G2H.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c  | 18 +-----------------
 1 file changed, 1 insertion(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 57c07244b3c5..8513003f4820 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1128,8 +1128,6 @@ static void guc_flush_destroyed_contexts(struct intel_guc *guc);
 
 void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 {
-	int i;
-
 	if (unlikely(!guc_submission_initialized(guc))) {
 		/* Reset called during driver load? GuC not yet initialised! */
 		return;
@@ -1145,21 +1143,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 
 	guc_flush_submissions(guc);
 	guc_flush_destroyed_contexts(guc);
-
-	/*
-	 * Handle any outstanding G2Hs before reset. Call IRQ handler directly
-	 * each pass as interrupt have been disabled. We always scrub for
-	 * outstanding G2H as it is possible for outstanding_submission_g2h to
-	 * be incremented after the context state update.
-	 */
-	for (i = 0; i < 4 && atomic_read(&guc->outstanding_submission_g2h); ++i) {
-		intel_guc_to_host_event_handler(guc);
-#define wait_for_reset(guc, wait_var) \
-		intel_guc_wait_for_pending_msg(guc, wait_var, false, (HZ / 20))
-		do {
-			wait_for_reset(guc, &guc->outstanding_submission_g2h);
-		} while (!list_empty(&guc->ct.requests.incoming));
-	}
+	flush_work(&guc->ct.requests.worker);
 
 	scrub_guc_desc_for_outstanding_g2h(guc);
 }
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-10-23 18:54 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-23 18:05 [PATCH 0/3] Do error capture async, flush G2H processing on reset Matthew Brost
2021-10-23 18:05 ` [Intel-gfx] " Matthew Brost
2021-10-23 18:05 ` [PATCH 1/3] drm/i915/guc: Do error capture asynchronously Matthew Brost
2021-10-23 18:05   ` [Intel-gfx] " Matthew Brost
2021-10-23 18:05 ` [PATCH 2/3] drm/i915/guc: Flush G2H work queue during reset Matthew Brost
2021-10-23 18:05   ` [Intel-gfx] " Matthew Brost
2021-10-23 18:05 ` [PATCH 3/3] drm/i915/guc: Refcount context during error capture Matthew Brost
2021-10-23 18:05   ` [Intel-gfx] " Matthew Brost
2021-10-23 18:22 ` [Intel-gfx] ✗ Fi.CI.SPARSE: warning for Do error capture async, flush G2H processing on reset Patchwork
2021-10-23 18:54 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork
  -- strict thread matches above, loose matches on Subject: below --
2021-10-23  0:39 [PATCH 0/3] " Matthew Brost
2021-10-23  0:39 ` [PATCH 2/3] drm/i915/guc: Flush G2H work queue during reset Matthew Brost
2021-10-22  1:29 [PATCH 0/3] Do error capture async, flush G2H processing on reset Matthew Brost
2021-10-22  1:29 ` [PATCH 2/3] drm/i915/guc: Flush G2H work queue during reset Matthew Brost

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.