All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chris Wilson <chris@chris-wilson.co.uk>
To: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>,
	Bruce Chang <yu.bruce.chang@intel.com>,
	Mika Kuoppala <mika.kuoppala@linux.intel.com>,
	stable@vger.kernel.org
Subject: [PATCH 06/39] drm/i915/gt: Wait for CSB entries on Tigerlake
Date: Wed, 26 Aug 2020 14:27:38 +0100	[thread overview]
Message-ID: <20200826132811.17577-6-chris@chris-wilson.co.uk> (raw)
In-Reply-To: <20200826132811.17577-1-chris@chris-wilson.co.uk>

On Tigerlake, we are seeing a repeat of commit d8f505311717 ("drm/i915/icl:
Forcibly evict stale csb entries") where, presumably, due to a missing
Global Observation Point synchronisation, the write pointer of the CSB
ringbuffer is updated _prior_ to the contents of the ringbuffer. That is
we see the GPU report more context-switch entries for us to parse, but
those entries have not been written, leading us to process stale events,
and eventually report a hung GPU.

However, this effect appears to be much more severe than we previously
saw on Icelake (though it might be best if we try the same approach
there as well and measure), and Bruce suggested the good idea of resetting
the CSB entry after use so that we can detect when it has been updated by
the GPU. By instrumenting how long that may be, we can set a reliable
upper bound for how long we should wait for:

    513 late, avg of 61 retries (590 ns), max of 1061 retries (10099 ns)

Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/2045
References: d8f505311717 ("drm/i915/icl: Forcibly evict stale csb entries")
Suggested-by: Bruce Chang <yu.bruce.chang@intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Bruce Chang <yu.bruce.chang@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: stable@vger.kernel.org # v5.4
---
 drivers/gpu/drm/i915/gt/intel_lrc.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index d6e0f62337b4..d75712a503b7 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -2498,9 +2498,22 @@ invalidate_csb_entries(const u64 *first, const u64 *last)
  */
 static inline bool gen12_csb_parse(const u64 *csb)
 {
-	u64 entry = READ_ONCE(*csb);
-	bool ctx_away_valid = GEN12_CSB_CTX_VALID(upper_32_bits(entry));
-	bool new_queue =
+	bool ctx_away_valid;
+	bool new_queue;
+	u64 entry;
+
+	/* HSD#22011248461 */
+	entry = READ_ONCE(*csb);
+	if (unlikely(entry == -1)) {
+		preempt_disable();
+		if (wait_for_atomic_us((entry = READ_ONCE(*csb)) != -1, 50))
+			GEM_WARN_ON("50us CSB timeout");
+		preempt_enable();
+	}
+	WRITE_ONCE(*(u64 *)csb, -1);
+
+	ctx_away_valid = GEN12_CSB_CTX_VALID(upper_32_bits(entry));
+	new_queue =
 		lower_32_bits(entry) & GEN12_CTX_STATUS_SWITCHED_TO_NEW_QUEUE;
 
 	/*
@@ -4004,6 +4017,8 @@ static void reset_csb_pointers(struct intel_engine_cs *engine)
 	WRITE_ONCE(*execlists->csb_write, reset_value);
 	wmb(); /* Make sure this is visible to HW (paranoia?) */
 
+	/* Check that the GPU does indeed update the CSB entries! */
+	memset(execlists->csb_status, -1, (reset_value + 1) * sizeof(u64));
 	invalidate_csb_entries(&execlists->csb_status[0],
 			       &execlists->csb_status[reset_value]);
 
-- 
2.20.1


WARNING: multiple messages have this Message-ID (diff)
From: Chris Wilson <chris@chris-wilson.co.uk>
To: intel-gfx@lists.freedesktop.org
Cc: stable@vger.kernel.org, Chris Wilson <chris@chris-wilson.co.uk>
Subject: [Intel-gfx] [PATCH 06/39] drm/i915/gt: Wait for CSB entries on Tigerlake
Date: Wed, 26 Aug 2020 14:27:38 +0100	[thread overview]
Message-ID: <20200826132811.17577-6-chris@chris-wilson.co.uk> (raw)
In-Reply-To: <20200826132811.17577-1-chris@chris-wilson.co.uk>

On Tigerlake, we are seeing a repeat of commit d8f505311717 ("drm/i915/icl:
Forcibly evict stale csb entries") where, presumably, due to a missing
Global Observation Point synchronisation, the write pointer of the CSB
ringbuffer is updated _prior_ to the contents of the ringbuffer. That is
we see the GPU report more context-switch entries for us to parse, but
those entries have not been written, leading us to process stale events,
and eventually report a hung GPU.

However, this effect appears to be much more severe than we previously
saw on Icelake (though it might be best if we try the same approach
there as well and measure), and Bruce suggested the good idea of resetting
the CSB entry after use so that we can detect when it has been updated by
the GPU. By instrumenting how long that may be, we can set a reliable
upper bound for how long we should wait for:

    513 late, avg of 61 retries (590 ns), max of 1061 retries (10099 ns)

Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/2045
References: d8f505311717 ("drm/i915/icl: Forcibly evict stale csb entries")
Suggested-by: Bruce Chang <yu.bruce.chang@intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Bruce Chang <yu.bruce.chang@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: stable@vger.kernel.org # v5.4
---
 drivers/gpu/drm/i915/gt/intel_lrc.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index d6e0f62337b4..d75712a503b7 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -2498,9 +2498,22 @@ invalidate_csb_entries(const u64 *first, const u64 *last)
  */
 static inline bool gen12_csb_parse(const u64 *csb)
 {
-	u64 entry = READ_ONCE(*csb);
-	bool ctx_away_valid = GEN12_CSB_CTX_VALID(upper_32_bits(entry));
-	bool new_queue =
+	bool ctx_away_valid;
+	bool new_queue;
+	u64 entry;
+
+	/* HSD#22011248461 */
+	entry = READ_ONCE(*csb);
+	if (unlikely(entry == -1)) {
+		preempt_disable();
+		if (wait_for_atomic_us((entry = READ_ONCE(*csb)) != -1, 50))
+			GEM_WARN_ON("50us CSB timeout");
+		preempt_enable();
+	}
+	WRITE_ONCE(*(u64 *)csb, -1);
+
+	ctx_away_valid = GEN12_CSB_CTX_VALID(upper_32_bits(entry));
+	new_queue =
 		lower_32_bits(entry) & GEN12_CTX_STATUS_SWITCHED_TO_NEW_QUEUE;
 
 	/*
@@ -4004,6 +4017,8 @@ static void reset_csb_pointers(struct intel_engine_cs *engine)
 	WRITE_ONCE(*execlists->csb_write, reset_value);
 	wmb(); /* Make sure this is visible to HW (paranoia?) */
 
+	/* Check that the GPU does indeed update the CSB entries! */
+	memset(execlists->csb_status, -1, (reset_value + 1) * sizeof(u64));
 	invalidate_csb_entries(&execlists->csb_status[0],
 			       &execlists->csb_status[reset_value]);
 
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

  parent reply	other threads:[~2020-08-26 13:28 UTC|newest]

Thread overview: 65+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-26 13:27 [PATCH 01/39] drm/i915/gem: Avoid implicit vmap for highmem on x86-32 Chris Wilson
2020-08-26 13:27 ` [Intel-gfx] " Chris Wilson
2020-08-26 13:27 ` [Intel-gfx] [PATCH 02/39] drm/i915/gem: Use set_pte_at() for assigning the vmapped PTE Chris Wilson
2020-08-26 16:36   ` Matthew Auld
2020-08-26 16:55     ` Chris Wilson
2020-08-26 13:27 ` [PATCH 03/39] drm/i915/gem: Prevent using pgprot_writecombine() if PAT is not supported Chris Wilson
2020-08-26 13:27   ` [Intel-gfx] " Chris Wilson
2020-08-26 13:27 ` [Intel-gfx] [PATCH 04/39] drm/i915/gt: Clear the buffer pool age before use Chris Wilson
2020-08-26 17:15   ` Matthew Auld
2020-08-26 13:27 ` [Intel-gfx] [PATCH 05/39] drm/i915/gt: Widen CSB pointer to u64 for the parsers Chris Wilson
2020-08-26 13:27 ` Chris Wilson [this message]
2020-08-26 13:27   ` [Intel-gfx] [PATCH 06/39] drm/i915/gt: Wait for CSB entries on Tigerlake Chris Wilson
2020-08-28 14:04   ` Mika Kuoppala
2020-08-28 14:04     ` [Intel-gfx] " Mika Kuoppala
2020-08-26 13:27 ` [Intel-gfx] [PATCH 07/39] drm/i915/gt: Apply the CSB w/a for all Chris Wilson
2020-08-26 13:27 ` [Intel-gfx] [PATCH 08/39] drm/i915/gt: Show engine properties in the pretty printer Chris Wilson
2020-08-27 12:25   ` Mika Kuoppala
2020-08-26 13:27 ` [Intel-gfx] [PATCH 09/39] drm/i915/gem: Hold request reference for canceling an active context Chris Wilson
2020-08-26 13:27 ` [PATCH 10/39] drm/i915: Cancel outstanding work after disabling heartbeats on an engine Chris Wilson
2020-08-26 13:27   ` [Intel-gfx] " Chris Wilson
2020-08-29  9:30   ` Mika Kuoppala
2020-08-29  9:30     ` Mika Kuoppala
2020-08-26 13:27 ` [PATCH 11/39] drm/i915/gt: Always send a pulse down the engine after disabling heartbeat Chris Wilson
2020-08-26 13:27   ` [Intel-gfx] " Chris Wilson
2020-08-26 13:27 ` [PATCH 12/39] drm/i915/gem: Always test execution status on closing the context Chris Wilson
2020-08-26 13:27   ` [Intel-gfx] " Chris Wilson
2020-08-26 13:27 ` [Intel-gfx] [PATCH 13/39] drm/i915/gt: Signal cancelled requests Chris Wilson
2020-08-26 13:27 ` [Intel-gfx] [PATCH 14/39] drm/i915/selftests: Finish pending mock requests on cancellation Chris Wilson
2020-08-26 13:27 ` [Intel-gfx] [PATCH 15/39] drm/i915/gt: Retire cancelled requests on unload Chris Wilson
2020-08-26 13:27 ` [Intel-gfx] [PATCH 16/39] drm/i915/gt: Remove defunct intel_virtual_engine_get_sibling() Chris Wilson
2020-08-26 13:27 ` [Intel-gfx] [PATCH 17/39] drm/i915/gt: Defer enabling the breadcrumb interrupt to after submission Chris Wilson
2020-08-26 13:27 ` [Intel-gfx] [PATCH 18/39] drm/i915/gt: Track signaled breadcrumbs outside of the breadcrumb spinlock Chris Wilson
2020-08-26 13:27 ` [Intel-gfx] [PATCH 19/39] drm/i915/gt: Don't cancel the interrupt shadow too early Chris Wilson
2020-08-26 13:27 ` [Intel-gfx] [PATCH 20/39] drm/i915/gt: Free stale request on destroying the virtual engine Chris Wilson
2020-08-26 13:27 ` [Intel-gfx] [PATCH 21/39] drm/i915/gt: Protect context lifetime with RCU Chris Wilson
2020-08-26 13:27 ` [Intel-gfx] [PATCH 22/39] drm/i915/gt: Split the breadcrumb spinlock between global and contexts Chris Wilson
2020-08-26 13:27 ` [Intel-gfx] [PATCH 23/39] drm/i915/gt: Move the breadcrumb to the signaler if completed upon cancel Chris Wilson
2020-08-26 13:27 ` [Intel-gfx] [PATCH 24/39] drm/i915/gt: Decouple completed requests on unwind Chris Wilson
2020-08-26 13:27 ` [Intel-gfx] [PATCH 25/39] drm/i915/gt: Check for a completed last request once Chris Wilson
2020-08-26 13:27 ` [Intel-gfx] [PATCH 26/39] drm/i915/gt: Replace direct submit with direct call to tasklet Chris Wilson
2020-08-26 13:27 ` [Intel-gfx] [PATCH 27/39] drm/i915/gt: ce->inflight updates are now serialised Chris Wilson
2020-08-26 13:28 ` [Intel-gfx] [PATCH 28/39] drm/i915/gt: Use virtual_engine during execlists_dequeue Chris Wilson
2020-08-26 13:28 ` [Intel-gfx] [PATCH 29/39] drm/i915/gt: Decouple inflight virtual engines Chris Wilson
2020-08-26 13:28 ` [Intel-gfx] [PATCH 30/39] drm/i915/gt: Defer schedule_out until after the next dequeue Chris Wilson
2020-08-26 13:28 ` [Intel-gfx] [PATCH 31/39] drm/i915/gt: Remove virtual breadcrumb before transfer Chris Wilson
2020-08-26 13:28 ` [Intel-gfx] [PATCH 32/39] drm/i915/gt: Shrink the critical section for irq signaling Chris Wilson
2020-08-26 13:28 ` [Intel-gfx] [PATCH 33/39] drm/i915/gt: Resubmit the virtual engine on schedule-out Chris Wilson
2020-08-26 13:28 ` [Intel-gfx] [PATCH 34/39] drm/i915/gt: Simplify virtual engine handling for execlists_hold() Chris Wilson
2020-08-26 13:28 ` [Intel-gfx] [PATCH 35/39] drm/i915: Encode fence specific waitqueue behaviour into the wait.flags Chris Wilson
2020-09-02 14:02   ` Thomas Hellström (Intel)
2020-09-03  9:50     ` Thomas Hellström (Intel)
2020-09-03 11:32       ` Chris Wilson
2020-09-03 12:08         ` Thomas Hellström (Intel)
2020-08-26 13:28 ` [Intel-gfx] [PATCH 36/39] drm/i915/selftests: Confirm RING_TIMESTAMP / CTX_TIMESTAMP share a clock Chris Wilson
2020-08-26 13:28 ` [Intel-gfx] [PATCH 37/39] drm/i915/gt: Consolidate the CS timestamp clocks Chris Wilson
2020-08-26 13:28 ` [PATCH 38/39] drm/i915: Break up error capture compression loops with cond_resched() Chris Wilson
2020-08-26 13:28   ` [Intel-gfx] " Chris Wilson
2020-08-26 13:28 ` [Intel-gfx] [PATCH 39/39] drm/i915: Reduce GPU error capture mutex hold time Chris Wilson
2020-08-26 13:53 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for series starting with [01/39] drm/i915/gem: Avoid implicit vmap for highmem on x86-32 Patchwork
2020-08-26 13:54 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2020-08-26 14:11 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork
2020-08-26 14:20 ` [Intel-gfx] [PATCH 01/39] " Matthew Auld
2020-08-26 14:20   ` Matthew Auld
2020-08-26 20:43 ` Harald Arnesen
2020-08-26 20:43   ` [Intel-gfx] " Harald Arnesen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200826132811.17577-6-chris@chris-wilson.co.uk \
    --to=chris@chris-wilson.co.uk \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=mika.kuoppala@linux.intel.com \
    --cc=stable@vger.kernel.org \
    --cc=yu.bruce.chang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.