stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 01/39] drm/i915/gem: Avoid implicit vmap for highmem on x86-32
@ 2020-08-26 13:27 Chris Wilson
  2020-08-26 13:27 ` [PATCH 03/39] drm/i915/gem: Prevent using pgprot_writecombine() if PAT is not supported Chris Wilson
                   ` (7 more replies)
  0 siblings, 8 replies; 11+ messages in thread
From: Chris Wilson @ 2020-08-26 13:27 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson, Harald Arnesen, stable

On 32b, highmem uses a finite set of indirect PTE (i.e. vmap) to provide
virtual mappings of the high pages. As these are finite, map_new_virtual()
must wait for some other kmap() to finish when it runs out. If we map a
large number of objects, there is no method for it to tell us to release
the mappings, and we deadlock.

However, if we make an explicit vmap of the page, that uses a larger
vmalloc arena, and also has the ability to tell us to release unwanted
mappings. Most importantly, it will fail and propagate an error instead
of waiting forever.

Fixes: fb8621d3bee8 ("drm/i915: Avoid allocating a vmap arena for a single page") #x86-32
References: e87666b52f00 ("drm/i915/shrinker: Hook up vmap allocation failure notifier")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Harald Arnesen <harald@skogtun.org>
Cc: <stable@vger.kernel.org> # v4.7+
---
 drivers/gpu/drm/i915/gem/i915_gem_pages.c | 26 +++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_pages.c b/drivers/gpu/drm/i915/gem/i915_gem_pages.c
index 7050519c87a4..51b63e05dbe4 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_pages.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_pages.c
@@ -255,8 +255,30 @@ static void *i915_gem_object_map(struct drm_i915_gem_object *obj,
 		return NULL;
 
 	/* A single page can always be kmapped */
-	if (n_pte == 1 && type == I915_MAP_WB)
-		return kmap(sg_page(sgt->sgl));
+	if (n_pte == 1 && type == I915_MAP_WB) {
+		struct page *page = sg_page(sgt->sgl);
+
+		/*
+		 * On 32b, highmem uses a finite set of indirect PTE (i.e.
+		 * vmap) to provide virtual mappings of the high pages.
+		 * As these are finite, map_new_virtual() must wait for some
+		 * other kmap() to finish when it runs out. If we map a large
+		 * number of objects, there is no method for it to tell us
+		 * to release the mappings, and we deadlock.
+		 *
+		 * However, if we make an explicit vmap of the page, that
+		 * uses a larger vmalloc arena, and also has the ability
+		 * to tell us to release unwanted mappings. Most importantly,
+		 * it will fail and propagate an error instead of waiting
+		 * forever.
+		 *
+		 * So if the page is beyond the 32b boundary, make an explicit
+		 * vmap. On 64b, this check will be optimised away as we can
+		 * directly kmap any page on the system.
+		 */
+		if (!PageHighMem(page))
+			return kmap(page);
+	}
 
 	mem = stack;
 	if (n_pte > ARRAY_SIZE(stack)) {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 03/39] drm/i915/gem: Prevent using pgprot_writecombine() if PAT is not supported
  2020-08-26 13:27 [PATCH 01/39] drm/i915/gem: Avoid implicit vmap for highmem on x86-32 Chris Wilson
@ 2020-08-26 13:27 ` Chris Wilson
  2020-08-26 13:27 ` [PATCH 06/39] drm/i915/gt: Wait for CSB entries on Tigerlake Chris Wilson
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 11+ messages in thread
From: Chris Wilson @ 2020-08-26 13:27 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson, Matthew Auld, stable

Let's not try and use PAT attributes for I915_MAP_WC is the CPU doesn't
support PAT.

Fixes: 6056e50033d9 ("drm/i915/gem: Support discontiguous lmem object maps")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Cc: <stable@vger.kernel.org> # v5.6+
---
 drivers/gpu/drm/i915/gem/i915_gem_pages.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_pages.c b/drivers/gpu/drm/i915/gem/i915_gem_pages.c
index 0c3d0d6429ae..d729f6d05b97 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_pages.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_pages.c
@@ -265,6 +265,9 @@ static void *i915_gem_object_map(struct drm_i915_gem_object *obj,
 	if (!i915_gem_object_has_struct_page(obj) && type != I915_MAP_WC)
 		return NULL;
 
+	if (GEM_WARN_ON(type == I915_MAP_WC && !boot_cpu_has(X86_FEATURE_PAT)))
+		return NULL;
+
 	/* A single page can always be kmapped */
 	if (n_pte == 1 && type == I915_MAP_WB) {
 		struct page *page = sg_page(sgt->sgl);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 06/39] drm/i915/gt: Wait for CSB entries on Tigerlake
  2020-08-26 13:27 [PATCH 01/39] drm/i915/gem: Avoid implicit vmap for highmem on x86-32 Chris Wilson
  2020-08-26 13:27 ` [PATCH 03/39] drm/i915/gem: Prevent using pgprot_writecombine() if PAT is not supported Chris Wilson
@ 2020-08-26 13:27 ` Chris Wilson
  2020-08-28 14:04   ` Mika Kuoppala
  2020-08-26 13:27 ` [PATCH 10/39] drm/i915: Cancel outstanding work after disabling heartbeats on an engine Chris Wilson
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 11+ messages in thread
From: Chris Wilson @ 2020-08-26 13:27 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson, Bruce Chang, Mika Kuoppala, stable

On Tigerlake, we are seeing a repeat of commit d8f505311717 ("drm/i915/icl:
Forcibly evict stale csb entries") where, presumably, due to a missing
Global Observation Point synchronisation, the write pointer of the CSB
ringbuffer is updated _prior_ to the contents of the ringbuffer. That is
we see the GPU report more context-switch entries for us to parse, but
those entries have not been written, leading us to process stale events,
and eventually report a hung GPU.

However, this effect appears to be much more severe than we previously
saw on Icelake (though it might be best if we try the same approach
there as well and measure), and Bruce suggested the good idea of resetting
the CSB entry after use so that we can detect when it has been updated by
the GPU. By instrumenting how long that may be, we can set a reliable
upper bound for how long we should wait for:

    513 late, avg of 61 retries (590 ns), max of 1061 retries (10099 ns)

Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/2045
References: d8f505311717 ("drm/i915/icl: Forcibly evict stale csb entries")
Suggested-by: Bruce Chang <yu.bruce.chang@intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Bruce Chang <yu.bruce.chang@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: stable@vger.kernel.org # v5.4
---
 drivers/gpu/drm/i915/gt/intel_lrc.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index d6e0f62337b4..d75712a503b7 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -2498,9 +2498,22 @@ invalidate_csb_entries(const u64 *first, const u64 *last)
  */
 static inline bool gen12_csb_parse(const u64 *csb)
 {
-	u64 entry = READ_ONCE(*csb);
-	bool ctx_away_valid = GEN12_CSB_CTX_VALID(upper_32_bits(entry));
-	bool new_queue =
+	bool ctx_away_valid;
+	bool new_queue;
+	u64 entry;
+
+	/* HSD#22011248461 */
+	entry = READ_ONCE(*csb);
+	if (unlikely(entry == -1)) {
+		preempt_disable();
+		if (wait_for_atomic_us((entry = READ_ONCE(*csb)) != -1, 50))
+			GEM_WARN_ON("50us CSB timeout");
+		preempt_enable();
+	}
+	WRITE_ONCE(*(u64 *)csb, -1);
+
+	ctx_away_valid = GEN12_CSB_CTX_VALID(upper_32_bits(entry));
+	new_queue =
 		lower_32_bits(entry) & GEN12_CTX_STATUS_SWITCHED_TO_NEW_QUEUE;
 
 	/*
@@ -4004,6 +4017,8 @@ static void reset_csb_pointers(struct intel_engine_cs *engine)
 	WRITE_ONCE(*execlists->csb_write, reset_value);
 	wmb(); /* Make sure this is visible to HW (paranoia?) */
 
+	/* Check that the GPU does indeed update the CSB entries! */
+	memset(execlists->csb_status, -1, (reset_value + 1) * sizeof(u64));
 	invalidate_csb_entries(&execlists->csb_status[0],
 			       &execlists->csb_status[reset_value]);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 10/39] drm/i915: Cancel outstanding work after disabling heartbeats on an engine
  2020-08-26 13:27 [PATCH 01/39] drm/i915/gem: Avoid implicit vmap for highmem on x86-32 Chris Wilson
  2020-08-26 13:27 ` [PATCH 03/39] drm/i915/gem: Prevent using pgprot_writecombine() if PAT is not supported Chris Wilson
  2020-08-26 13:27 ` [PATCH 06/39] drm/i915/gt: Wait for CSB entries on Tigerlake Chris Wilson
@ 2020-08-26 13:27 ` Chris Wilson
  2020-08-29  9:30   ` [Intel-gfx] " Mika Kuoppala
  2020-08-26 13:27 ` [PATCH 11/39] drm/i915/gt: Always send a pulse down the engine after disabling heartbeat Chris Wilson
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 11+ messages in thread
From: Chris Wilson @ 2020-08-26 13:27 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson, Joonas Lahtinen, stable

We only allow persistent requests to remain on the GPU past the closure
of their containing context (and process) so long as they are continuously
checked for hangs or allow other requests to preempt them, as we need to
ensure forward progress of the system. If we allow persistent contexts
to remain on the system after the the hangcheck mechanism is disabled,
the system may grind to a halt. On disabling the mechanism, we sent a
pulse along the engine to remove all executing contexts from the engine
which would check for hung contexts -- but we did not prevent those
contexts from being resubmitted if they survived the final hangcheck.

Fixes: 9a40bddd47ca ("drm/i915/gt: Expose heartbeat interval via sysfs")
Testcase: igt/gem_ctx_persistence/heartbeat-stop
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: <stable@vger.kernel.org> # v5.7+
---
 drivers/gpu/drm/i915/gt/intel_engine.h | 9 +++++++++
 drivers/gpu/drm/i915/i915_request.c    | 5 +++++
 2 files changed, 14 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h
index 08e2c000dcc3..7c3a1012e702 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine.h
@@ -337,4 +337,13 @@ intel_engine_has_preempt_reset(const struct intel_engine_cs *engine)
 	return intel_engine_has_preemption(engine);
 }
 
+static inline bool
+intel_engine_has_heartbeat(const struct intel_engine_cs *engine)
+{
+	if (!IS_ACTIVE(CONFIG_DRM_I915_HEARTBEAT_INTERVAL))
+		return false;
+
+	return READ_ONCE(engine->props.heartbeat_interval_ms);
+}
+
 #endif /* _INTEL_RINGBUFFER_H_ */
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index a931b8b571d1..c187e1ec0278 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -542,8 +542,13 @@ bool __i915_request_submit(struct i915_request *request)
 	if (i915_request_completed(request))
 		goto xfer;
 
+	if (unlikely(intel_context_is_closed(request->context) &&
+		     !intel_engine_has_heartbeat(engine)))
+		intel_context_set_banned(request->context);
+
 	if (unlikely(intel_context_is_banned(request->context)))
 		i915_request_set_error_once(request, -EIO);
+
 	if (unlikely(fatal_error(request->fence.error)))
 		__i915_request_skip(request);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 11/39] drm/i915/gt: Always send a pulse down the engine after disabling heartbeat
  2020-08-26 13:27 [PATCH 01/39] drm/i915/gem: Avoid implicit vmap for highmem on x86-32 Chris Wilson
                   ` (2 preceding siblings ...)
  2020-08-26 13:27 ` [PATCH 10/39] drm/i915: Cancel outstanding work after disabling heartbeats on an engine Chris Wilson
@ 2020-08-26 13:27 ` Chris Wilson
  2020-08-26 13:27 ` [PATCH 12/39] drm/i915/gem: Always test execution status on closing the context Chris Wilson
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 11+ messages in thread
From: Chris Wilson @ 2020-08-26 13:27 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson, Joonas Lahtinen, stable

Currently, we check we can send a pulse prior to disabling the
heartbeat to verify that we can change the heartbeat, but since we may
re-evaluate execution upon changing the heartbeat interval we need another
pulse afterwards to refresh execution.

Fixes: 9a40bddd47ca ("drm/i915/gt: Expose heartbeat interval via sysfs")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: <stable@vger.kernel.org> # v5.7+
---
 drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
index 8ffdf676c0a0..d09df370f7cd 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
@@ -192,10 +192,12 @@ int intel_engine_set_heartbeat(struct intel_engine_cs *engine,
 	WRITE_ONCE(engine->props.heartbeat_interval_ms, delay);
 
 	if (intel_engine_pm_get_if_awake(engine)) {
-		if (delay)
+		if (delay) {
 			intel_engine_unpark_heartbeat(engine);
-		else
+		} else {
 			intel_engine_park_heartbeat(engine);
+			intel_engine_pulse(engine); /* recheck execution */
+		}
 		intel_engine_pm_put(engine);
 	}
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 12/39] drm/i915/gem: Always test execution status on closing the context
  2020-08-26 13:27 [PATCH 01/39] drm/i915/gem: Avoid implicit vmap for highmem on x86-32 Chris Wilson
                   ` (3 preceding siblings ...)
  2020-08-26 13:27 ` [PATCH 11/39] drm/i915/gt: Always send a pulse down the engine after disabling heartbeat Chris Wilson
@ 2020-08-26 13:27 ` Chris Wilson
  2020-08-26 13:28 ` [PATCH 38/39] drm/i915: Break up error capture compression loops with cond_resched() Chris Wilson
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 11+ messages in thread
From: Chris Wilson @ 2020-08-26 13:27 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson, Joonas Lahtinen, stable

Verify that if a context is active at the time it is closed, that it is
either persistent and preemptible (with hangcheck running) or it shall
be removed from execution.

Fixes: 9a40bddd47ca ("drm/i915/gt: Expose heartbeat interval via sysfs")
Testcase: igt/gem_ctx_persistence/heartbeat-close
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: <stable@vger.kernel.org> # v5.7+
---
 drivers/gpu/drm/i915/gem/i915_gem_context.c | 48 +++++----------------
 1 file changed, 10 insertions(+), 38 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index a548626fa8bc..4fd38101bb56 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -390,24 +390,6 @@ __context_engines_static(const struct i915_gem_context *ctx)
 	return rcu_dereference_protected(ctx->engines, true);
 }
 
-static bool __reset_engine(struct intel_engine_cs *engine)
-{
-	struct intel_gt *gt = engine->gt;
-	bool success = false;
-
-	if (!intel_has_reset_engine(gt))
-		return false;
-
-	if (!test_and_set_bit(I915_RESET_ENGINE + engine->id,
-			      &gt->reset.flags)) {
-		success = intel_engine_reset(engine, NULL) == 0;
-		clear_and_wake_up_bit(I915_RESET_ENGINE + engine->id,
-				      &gt->reset.flags);
-	}
-
-	return success;
-}
-
 static void __reset_context(struct i915_gem_context *ctx,
 			    struct intel_engine_cs *engine)
 {
@@ -431,12 +413,7 @@ static bool __cancel_engine(struct intel_engine_cs *engine)
 	 * kill the banned context, we fallback to doing a local reset
 	 * instead.
 	 */
-	if (IS_ACTIVE(CONFIG_DRM_I915_PREEMPT_TIMEOUT) &&
-	    !intel_engine_pulse(engine))
-		return true;
-
-	/* If we are unable to send a pulse, try resetting this engine. */
-	return __reset_engine(engine);
+	return intel_engine_pulse(engine) == 0;
 }
 
 static bool
@@ -506,7 +483,7 @@ static struct intel_engine_cs *active_engine(struct intel_context *ce)
 	return engine;
 }
 
-static void kill_engines(struct i915_gem_engines *engines)
+static void kill_engines(struct i915_gem_engines *engines, bool ban)
 {
 	struct i915_gem_engines_iter it;
 	struct intel_context *ce;
@@ -521,7 +498,7 @@ static void kill_engines(struct i915_gem_engines *engines)
 	for_each_gem_engine(ce, engines, it) {
 		struct intel_engine_cs *engine;
 
-		if (intel_context_set_banned(ce))
+		if (ban && intel_context_set_banned(ce))
 			continue;
 
 		/*
@@ -534,7 +511,7 @@ static void kill_engines(struct i915_gem_engines *engines)
 		engine = active_engine(ce);
 
 		/* First attempt to gracefully cancel the context */
-		if (engine && !__cancel_engine(engine))
+		if (engine && !__cancel_engine(engine) && ban)
 			/*
 			 * If we are unable to send a preemptive pulse to bump
 			 * the context from the GPU, we have to resort to a full
@@ -544,8 +521,10 @@ static void kill_engines(struct i915_gem_engines *engines)
 	}
 }
 
-static void kill_stale_engines(struct i915_gem_context *ctx)
+static void kill_context(struct i915_gem_context *ctx)
 {
+	bool ban = (!i915_gem_context_is_persistent(ctx) ||
+		    !ctx->i915->params.enable_hangcheck);
 	struct i915_gem_engines *pos, *next;
 
 	spin_lock_irq(&ctx->stale.lock);
@@ -558,7 +537,7 @@ static void kill_stale_engines(struct i915_gem_context *ctx)
 
 		spin_unlock_irq(&ctx->stale.lock);
 
-		kill_engines(pos);
+		kill_engines(pos, ban);
 
 		spin_lock_irq(&ctx->stale.lock);
 		GEM_BUG_ON(i915_sw_fence_signaled(&pos->fence));
@@ -570,11 +549,6 @@ static void kill_stale_engines(struct i915_gem_context *ctx)
 	spin_unlock_irq(&ctx->stale.lock);
 }
 
-static void kill_context(struct i915_gem_context *ctx)
-{
-	kill_stale_engines(ctx);
-}
-
 static void engines_idle_release(struct i915_gem_context *ctx,
 				 struct i915_gem_engines *engines)
 {
@@ -609,7 +583,7 @@ static void engines_idle_release(struct i915_gem_context *ctx,
 
 kill:
 	if (list_empty(&engines->link)) /* raced, already closed */
-		kill_engines(engines);
+		kill_engines(engines, true);
 
 	i915_sw_fence_commit(&engines->fence);
 }
@@ -667,9 +641,7 @@ static void context_close(struct i915_gem_context *ctx)
 	 * case we opt to forcibly kill off all remaining requests on
 	 * context close.
 	 */
-	if (!i915_gem_context_is_persistent(ctx) ||
-	    !ctx->i915->params.enable_hangcheck)
-		kill_context(ctx);
+	kill_context(ctx);
 
 	i915_gem_context_put(ctx);
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 38/39] drm/i915: Break up error capture compression loops with cond_resched()
  2020-08-26 13:27 [PATCH 01/39] drm/i915/gem: Avoid implicit vmap for highmem on x86-32 Chris Wilson
                   ` (4 preceding siblings ...)
  2020-08-26 13:27 ` [PATCH 12/39] drm/i915/gem: Always test execution status on closing the context Chris Wilson
@ 2020-08-26 13:28 ` Chris Wilson
  2020-08-26 14:20 ` [Intel-gfx] [PATCH 01/39] drm/i915/gem: Avoid implicit vmap for highmem on x86-32 Matthew Auld
  2020-08-26 20:43 ` Harald Arnesen
  7 siblings, 0 replies; 11+ messages in thread
From: Chris Wilson @ 2020-08-26 13:28 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson, Mika Kuoppala, stable

As the error capture will compress user buffers as directed to by the
user, it can take an arbitrary amount of time and space. Break up the
compression loops with a call to cond_resched(), that will allow other
processes to schedule (avoiding the soft lockups) and also serve as a
warning should we try to make this loop atomic in the future.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: stable@vger.kernel.org
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_gpu_error.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 2d76f3ddc571..70274fa0664d 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -311,6 +311,8 @@ static int compress_page(struct i915_vma_compress *c,
 
 		if (zlib_deflate(zstream, Z_NO_FLUSH) != Z_OK)
 			return -EIO;
+
+		cond_resched();
 	} while (zstream->avail_in);
 
 	/* Fallback to uncompressed if we increase size? */
@@ -397,6 +399,7 @@ static int compress_page(struct i915_vma_compress *c,
 	if (!(wc && i915_memcpy_from_wc(ptr, src, PAGE_SIZE)))
 		memcpy(ptr, src, PAGE_SIZE);
 	dst->pages[dst->page_count++] = ptr;
+	cond_resched();
 
 	return 0;
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [Intel-gfx] [PATCH 01/39] drm/i915/gem: Avoid implicit vmap for highmem on x86-32
  2020-08-26 13:27 [PATCH 01/39] drm/i915/gem: Avoid implicit vmap for highmem on x86-32 Chris Wilson
                   ` (5 preceding siblings ...)
  2020-08-26 13:28 ` [PATCH 38/39] drm/i915: Break up error capture compression loops with cond_resched() Chris Wilson
@ 2020-08-26 14:20 ` Matthew Auld
  2020-08-26 20:43 ` Harald Arnesen
  7 siblings, 0 replies; 11+ messages in thread
From: Matthew Auld @ 2020-08-26 14:20 UTC (permalink / raw)
  To: Chris Wilson; +Cc: Intel Graphics Development, Harald Arnesen, stable

On Wed, 26 Aug 2020 at 14:29, Chris Wilson <chris@chris-wilson.co.uk> wrote:
>
> On 32b, highmem uses a finite set of indirect PTE (i.e. vmap) to provide
> virtual mappings of the high pages. As these are finite, map_new_virtual()
> must wait for some other kmap() to finish when it runs out. If we map a
> large number of objects, there is no method for it to tell us to release
> the mappings, and we deadlock.
>
> However, if we make an explicit vmap of the page, that uses a larger
> vmalloc arena, and also has the ability to tell us to release unwanted
> mappings. Most importantly, it will fail and propagate an error instead
> of waiting forever.
>
> Fixes: fb8621d3bee8 ("drm/i915: Avoid allocating a vmap arena for a single page") #x86-32
> References: e87666b52f00 ("drm/i915/shrinker: Hook up vmap allocation failure notifier")
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Harald Arnesen <harald@skogtun.org>
> Cc: <stable@vger.kernel.org> # v4.7+
Reviewed-by: Matthew Auld <matthew.auld@intel.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 01/39] drm/i915/gem: Avoid implicit vmap for highmem on x86-32
  2020-08-26 13:27 [PATCH 01/39] drm/i915/gem: Avoid implicit vmap for highmem on x86-32 Chris Wilson
                   ` (6 preceding siblings ...)
  2020-08-26 14:20 ` [Intel-gfx] [PATCH 01/39] drm/i915/gem: Avoid implicit vmap for highmem on x86-32 Matthew Auld
@ 2020-08-26 20:43 ` Harald Arnesen
  7 siblings, 0 replies; 11+ messages in thread
From: Harald Arnesen @ 2020-08-26 20:43 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: stable

Chris Wilson [26.08.2020 15:27]:

> On 32b, highmem uses a finite set of indirect PTE (i.e. vmap) to provide
> virtual mappings of the high pages. As these are finite, map_new_virtual()
> must wait for some other kmap() to finish when it runs out. If we map a
> large number of objects, there is no method for it to tell us to release
> the mappings, and we deadlock.
> 
> However, if we make an explicit vmap of the page, that uses a larger
> vmalloc arena, and also has the ability to tell us to release unwanted
> mappings. Most importantly, it will fail and propagate an error instead
> of waiting forever.
> 
> Fixes: fb8621d3bee8 ("drm/i915: Avoid allocating a vmap arena for a single page") #x86-32
> References: e87666b52f00 ("drm/i915/shrinker: Hook up vmap allocation failure notifier")
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Harald Arnesen <harald@skogtun.org>
> Cc: <stable@vger.kernel.org> # v4.7+

Sorry, doesn't help on my machine (Thinkpad T520).
-- 
Hilsen Harald

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 06/39] drm/i915/gt: Wait for CSB entries on Tigerlake
  2020-08-26 13:27 ` [PATCH 06/39] drm/i915/gt: Wait for CSB entries on Tigerlake Chris Wilson
@ 2020-08-28 14:04   ` Mika Kuoppala
  0 siblings, 0 replies; 11+ messages in thread
From: Mika Kuoppala @ 2020-08-28 14:04 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: Chris Wilson, Bruce Chang, stable

Chris Wilson <chris@chris-wilson.co.uk> writes:

> On Tigerlake, we are seeing a repeat of commit d8f505311717 ("drm/i915/icl:
> Forcibly evict stale csb entries") where, presumably, due to a missing
> Global Observation Point synchronisation, the write pointer of the CSB
> ringbuffer is updated _prior_ to the contents of the ringbuffer. That is
> we see the GPU report more context-switch entries for us to parse, but
> those entries have not been written, leading us to process stale events,
> and eventually report a hung GPU.
>
> However, this effect appears to be much more severe than we previously
> saw on Icelake (though it might be best if we try the same approach
> there as well and measure), and Bruce suggested the good idea of resetting
> the CSB entry after use so that we can detect when it has been updated by
> the GPU. By instrumenting how long that may be, we can set a reliable
> upper bound for how long we should wait for:
>
>     513 late, avg of 61 retries (590 ns), max of 1061 retries (10099 ns)
>
> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/2045
> References: d8f505311717 ("drm/i915/icl: Forcibly evict stale csb entries")

References: HSDES#1508287568

> Suggested-by: Bruce Chang <yu.bruce.chang@intel.com>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Bruce Chang <yu.bruce.chang@intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Cc: stable@vger.kernel.org # v5.4
> ---
>  drivers/gpu/drm/i915/gt/intel_lrc.c | 21 ++++++++++++++++++---
>  1 file changed, 18 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> index d6e0f62337b4..d75712a503b7 100644
> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> @@ -2498,9 +2498,22 @@ invalidate_csb_entries(const u64 *first, const u64 *last)
>   */
>  static inline bool gen12_csb_parse(const u64 *csb)
>  {
> -	u64 entry = READ_ONCE(*csb);
> -	bool ctx_away_valid = GEN12_CSB_CTX_VALID(upper_32_bits(entry));
> -	bool new_queue =
> +	bool ctx_away_valid;
> +	bool new_queue;
> +	u64 entry;
> +
> +	/* HSD#22011248461 */

s/220011248461/1508287568

> +	entry = READ_ONCE(*csb);
> +	if (unlikely(entry == -1)) {
> +		preempt_disable();

As this seems to rather rare, consider falling back to mmio read
for this entry without the poll?

-Mika

> +		if (wait_for_atomic_us((entry = READ_ONCE(*csb)) != -1, 50))
> +			GEM_WARN_ON("50us CSB timeout");
> +		preempt_enable();
> +	}
> +	WRITE_ONCE(*(u64 *)csb, -1);
> +
> +	ctx_away_valid = GEN12_CSB_CTX_VALID(upper_32_bits(entry));
> +	new_queue =
>  		lower_32_bits(entry) & GEN12_CTX_STATUS_SWITCHED_TO_NEW_QUEUE;
>  
>  	/*
> @@ -4004,6 +4017,8 @@ static void reset_csb_pointers(struct intel_engine_cs *engine)
>  	WRITE_ONCE(*execlists->csb_write, reset_value);
>  	wmb(); /* Make sure this is visible to HW (paranoia?) */
>  
> +	/* Check that the GPU does indeed update the CSB entries! */
> +	memset(execlists->csb_status, -1, (reset_value + 1) * sizeof(u64));
>  	invalidate_csb_entries(&execlists->csb_status[0],
>  			       &execlists->csb_status[reset_value]);
>  
> -- 
> 2.20.1

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Intel-gfx] [PATCH 10/39] drm/i915: Cancel outstanding work after disabling heartbeats on an engine
  2020-08-26 13:27 ` [PATCH 10/39] drm/i915: Cancel outstanding work after disabling heartbeats on an engine Chris Wilson
@ 2020-08-29  9:30   ` Mika Kuoppala
  0 siblings, 0 replies; 11+ messages in thread
From: Mika Kuoppala @ 2020-08-29  9:30 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: stable, Chris Wilson

Chris Wilson <chris@chris-wilson.co.uk> writes:

> We only allow persistent requests to remain on the GPU past the closure
> of their containing context (and process) so long as they are continuously
> checked for hangs or allow other requests to preempt them, as we need to
> ensure forward progress of the system. If we allow persistent contexts
> to remain on the system after the the hangcheck mechanism is disabled,
> the system may grind to a halt. On disabling the mechanism, we sent a
> pulse along the engine to remove all executing contexts from the engine
> which would check for hung contexts -- but we did not prevent those
> contexts from being resubmitted if they survived the final hangcheck.
>
> Fixes: 9a40bddd47ca ("drm/i915/gt: Expose heartbeat interval via sysfs")
> Testcase: igt/gem_ctx_persistence/heartbeat-stop
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: <stable@vger.kernel.org> # v5.7+

Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>

> ---
>  drivers/gpu/drm/i915/gt/intel_engine.h | 9 +++++++++
>  drivers/gpu/drm/i915/i915_request.c    | 5 +++++
>  2 files changed, 14 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h
> index 08e2c000dcc3..7c3a1012e702 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine.h
> +++ b/drivers/gpu/drm/i915/gt/intel_engine.h
> @@ -337,4 +337,13 @@ intel_engine_has_preempt_reset(const struct intel_engine_cs *engine)
>  	return intel_engine_has_preemption(engine);
>  }
>  
> +static inline bool
> +intel_engine_has_heartbeat(const struct intel_engine_cs *engine)
> +{
> +	if (!IS_ACTIVE(CONFIG_DRM_I915_HEARTBEAT_INTERVAL))
> +		return false;
> +
> +	return READ_ONCE(engine->props.heartbeat_interval_ms);
> +}
> +
>  #endif /* _INTEL_RINGBUFFER_H_ */
> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> index a931b8b571d1..c187e1ec0278 100644
> --- a/drivers/gpu/drm/i915/i915_request.c
> +++ b/drivers/gpu/drm/i915/i915_request.c
> @@ -542,8 +542,13 @@ bool __i915_request_submit(struct i915_request *request)
>  	if (i915_request_completed(request))
>  		goto xfer;
>  
> +	if (unlikely(intel_context_is_closed(request->context) &&
> +		     !intel_engine_has_heartbeat(engine)))
> +		intel_context_set_banned(request->context);
> +
>  	if (unlikely(intel_context_is_banned(request->context)))
>  		i915_request_set_error_once(request, -EIO);
> +
>  	if (unlikely(fatal_error(request->fence.error)))
>  		__i915_request_skip(request);
>  
> -- 
> 2.20.1
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-08-29  9:31 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-26 13:27 [PATCH 01/39] drm/i915/gem: Avoid implicit vmap for highmem on x86-32 Chris Wilson
2020-08-26 13:27 ` [PATCH 03/39] drm/i915/gem: Prevent using pgprot_writecombine() if PAT is not supported Chris Wilson
2020-08-26 13:27 ` [PATCH 06/39] drm/i915/gt: Wait for CSB entries on Tigerlake Chris Wilson
2020-08-28 14:04   ` Mika Kuoppala
2020-08-26 13:27 ` [PATCH 10/39] drm/i915: Cancel outstanding work after disabling heartbeats on an engine Chris Wilson
2020-08-29  9:30   ` [Intel-gfx] " Mika Kuoppala
2020-08-26 13:27 ` [PATCH 11/39] drm/i915/gt: Always send a pulse down the engine after disabling heartbeat Chris Wilson
2020-08-26 13:27 ` [PATCH 12/39] drm/i915/gem: Always test execution status on closing the context Chris Wilson
2020-08-26 13:28 ` [PATCH 38/39] drm/i915: Break up error capture compression loops with cond_resched() Chris Wilson
2020-08-26 14:20 ` [Intel-gfx] [PATCH 01/39] drm/i915/gem: Avoid implicit vmap for highmem on x86-32 Matthew Auld
2020-08-26 20:43 ` Harald Arnesen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).