dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] drm/i915: Don't wait forever in drop_caches
@ 2022-11-03  1:35 John.C.Harrison
  0 siblings, 0 replies; 3+ messages in thread
From: John.C.Harrison @ 2022-11-03  1:35 UTC (permalink / raw)
  To: Intel-GFX; +Cc: John Harrison, DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

At the end of each test, IGT does a drop caches call via debugfs with
special flags set. One of the possible paths waits for idle with an
infinite timeout. That causes problems for debugging issues when CI
catches a "can't go idle" test failure. Best case, the CI system times
out (after 90s), attempts a bunch of state dump actions and then
reboots the system to recover it. Worst case, the CI system can't do
anything at all and then times out (after 1000s) and simply reboots.
Sometimes a serial port log of dmesg might be available, sometimes not.

So rather than making life hard for ourselves, change the timeout to
be 10s rather than infinite. Also, trigger the standard
wedge/reset/recover sequence so that testing can continue with a
working system (if possible).

v2: Rationalise timeout defines (review feedback from Jani & Tvrtko).

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/i915_debugfs.c | 10 ++++++++--
 drivers/gpu/drm/i915/i915_drv.h     |  2 --
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index ae987e92251dd..a224584ea4eb1 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -621,6 +621,9 @@ DEFINE_SIMPLE_ATTRIBUTE(i915_perf_noa_delay_fops,
 			i915_perf_noa_delay_set,
 			"%llu\n");
 
+#define DROPCACHE_IDLE_ENGINES_TIMEOUT_MS	200
+#define DROPCACHE_IDLE_GT_TIMEOUT		(HZ * 10)
+
 #define DROP_UNBOUND	BIT(0)
 #define DROP_BOUND	BIT(1)
 #define DROP_RETIRE	BIT(2)
@@ -641,6 +644,7 @@ DEFINE_SIMPLE_ATTRIBUTE(i915_perf_noa_delay_fops,
 		  DROP_RESET_ACTIVE | \
 		  DROP_RESET_SEQNO | \
 		  DROP_RCU)
+
 static int
 i915_drop_caches_get(void *data, u64 *val)
 {
@@ -654,14 +658,16 @@ gt_drop_caches(struct intel_gt *gt, u64 val)
 	int ret;
 
 	if (val & DROP_RESET_ACTIVE &&
-	    wait_for(intel_engines_are_idle(gt), I915_IDLE_ENGINES_TIMEOUT))
+	    wait_for(intel_engines_are_idle(gt), DROPCACHE_IDLE_ENGINES_TIMEOUT_MS))
 		intel_gt_set_wedged(gt);
 
 	if (val & DROP_RETIRE)
 		intel_gt_retire_requests(gt);
 
 	if (val & (DROP_IDLE | DROP_ACTIVE)) {
-		ret = intel_gt_wait_for_idle(gt, MAX_SCHEDULE_TIMEOUT);
+		ret = intel_gt_wait_for_idle(gt, DROPCACHE_IDLE_GT_TIMEOUT);
+		if (ret == -ETIME)
+			intel_gt_set_wedged(gt);
 		if (ret)
 			return ret;
 	}
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 05b3300cc4edf..4c2adaad8e9ed 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -162,8 +162,6 @@ struct i915_gem_mm {
 	u32 shrink_count;
 };
 
-#define I915_IDLE_ENGINES_TIMEOUT (200) /* in ms */
-
 unsigned long i915_fence_context_timeout(const struct drm_i915_private *i915,
 					 u64 context);
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] drm/i915: Don't wait forever in drop_caches
  2022-11-01 23:50 John.C.Harrison
@ 2022-11-02 12:12 ` Jani Nikula
  0 siblings, 0 replies; 3+ messages in thread
From: Jani Nikula @ 2022-11-02 12:12 UTC (permalink / raw)
  To: John.C.Harrison, Intel-GFX; +Cc: DRI-Devel, John Harrison

On Tue, 01 Nov 2022, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
>
> At the end of each test, IGT does a drop caches call via sysfs with

sysfs?

> special flags set. One of the possible paths waits for idle with an
> infinite timeout. That causes problems for debugging issues when CI
> catches a "can't go idle" test failure. Best case, the CI system times
> out (after 90s), attempts a bunch of state dump actions and then
> reboots the system to recover it. Worst case, the CI system can't do
> anything at all and then times out (after 1000s) and simply reboots.
> Sometimes a serial port log of dmesg might be available, sometimes not.
>
> So rather than making life hard for ourselves, change the timeout to
> be 10s rather than infinite. Also, trigger the standard
> wedge/reset/recover sequence so that testing can continue with a
> working system (if possible).
>
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> ---
>  drivers/gpu/drm/i915/i915_debugfs.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
> index ae987e92251dd..9d916fbbfc27c 100644
> --- a/drivers/gpu/drm/i915/i915_debugfs.c
> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
> @@ -641,6 +641,9 @@ DEFINE_SIMPLE_ATTRIBUTE(i915_perf_noa_delay_fops,
>  		  DROP_RESET_ACTIVE | \
>  		  DROP_RESET_SEQNO | \
>  		  DROP_RCU)
> +
> +#define DROP_IDLE_TIMEOUT	(HZ * 10)

I915_IDLE_ENGINES_TIMEOUT is defined in i915_drv.h. It's also only used
here.

I915_GEM_IDLE_TIMEOUT is defined in i915_gem.h. It's only used in
gt/intel_gt.c.

I915_GT_SUSPEND_IDLE_TIMEOUT is defined and used only in intel_gt_pm.c.

I915_IDLE_ENGINES_TIMEOUT is in ms, the rest are in jiffies.

My head spins.


BR,
Jani.


> +
>  static int
>  i915_drop_caches_get(void *data, u64 *val)
>  {
> @@ -661,7 +664,9 @@ gt_drop_caches(struct intel_gt *gt, u64 val)
>  		intel_gt_retire_requests(gt);
>  
>  	if (val & (DROP_IDLE | DROP_ACTIVE)) {
> -		ret = intel_gt_wait_for_idle(gt, MAX_SCHEDULE_TIMEOUT);
> +		ret = intel_gt_wait_for_idle(gt, DROP_IDLE_TIMEOUT);
> +		if (ret == -ETIME)
> +			intel_gt_set_wedged(gt);
>  		if (ret)
>  			return ret;
>  	}

-- 
Jani Nikula, Intel Open Source Graphics Center

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH] drm/i915: Don't wait forever in drop_caches
@ 2022-11-01 23:50 John.C.Harrison
  2022-11-02 12:12 ` Jani Nikula
  0 siblings, 1 reply; 3+ messages in thread
From: John.C.Harrison @ 2022-11-01 23:50 UTC (permalink / raw)
  To: Intel-GFX; +Cc: John Harrison, DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

At the end of each test, IGT does a drop caches call via sysfs with
special flags set. One of the possible paths waits for idle with an
infinite timeout. That causes problems for debugging issues when CI
catches a "can't go idle" test failure. Best case, the CI system times
out (after 90s), attempts a bunch of state dump actions and then
reboots the system to recover it. Worst case, the CI system can't do
anything at all and then times out (after 1000s) and simply reboots.
Sometimes a serial port log of dmesg might be available, sometimes not.

So rather than making life hard for ourselves, change the timeout to
be 10s rather than infinite. Also, trigger the standard
wedge/reset/recover sequence so that testing can continue with a
working system (if possible).

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/i915_debugfs.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index ae987e92251dd..9d916fbbfc27c 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -641,6 +641,9 @@ DEFINE_SIMPLE_ATTRIBUTE(i915_perf_noa_delay_fops,
 		  DROP_RESET_ACTIVE | \
 		  DROP_RESET_SEQNO | \
 		  DROP_RCU)
+
+#define DROP_IDLE_TIMEOUT	(HZ * 10)
+
 static int
 i915_drop_caches_get(void *data, u64 *val)
 {
@@ -661,7 +664,9 @@ gt_drop_caches(struct intel_gt *gt, u64 val)
 		intel_gt_retire_requests(gt);
 
 	if (val & (DROP_IDLE | DROP_ACTIVE)) {
-		ret = intel_gt_wait_for_idle(gt, MAX_SCHEDULE_TIMEOUT);
+		ret = intel_gt_wait_for_idle(gt, DROP_IDLE_TIMEOUT);
+		if (ret == -ETIME)
+			intel_gt_set_wedged(gt);
 		if (ret)
 			return ret;
 	}
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-11-03  1:33 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-03  1:35 [PATCH] drm/i915: Don't wait forever in drop_caches John.C.Harrison
  -- strict thread matches above, loose matches on Subject: below --
2022-11-01 23:50 John.C.Harrison
2022-11-02 12:12 ` Jani Nikula

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).