From: Chris Wilson <chris@chris-wilson.co.uk> To: intel-gfx@lists.freedesktop.org Cc: Chris Wilson <chris@chris-wilson.co.uk>, Bruce Chang <yu.bruce.chang@intel.com>, Mika Kuoppala <mika.kuoppala@linux.intel.com>, stable@vger.kernel.org Subject: [PATCH 06/39] drm/i915/gt: Wait for CSB entries on Tigerlake Date: Wed, 26 Aug 2020 14:27:38 +0100 [thread overview] Message-ID: <20200826132811.17577-6-chris@chris-wilson.co.uk> (raw) In-Reply-To: <20200826132811.17577-1-chris@chris-wilson.co.uk> On Tigerlake, we are seeing a repeat of commit d8f505311717 ("drm/i915/icl: Forcibly evict stale csb entries") where, presumably, due to a missing Global Observation Point synchronisation, the write pointer of the CSB ringbuffer is updated _prior_ to the contents of the ringbuffer. That is we see the GPU report more context-switch entries for us to parse, but those entries have not been written, leading us to process stale events, and eventually report a hung GPU. However, this effect appears to be much more severe than we previously saw on Icelake (though it might be best if we try the same approach there as well and measure), and Bruce suggested the good idea of resetting the CSB entry after use so that we can detect when it has been updated by the GPU. By instrumenting how long that may be, we can set a reliable upper bound for how long we should wait for: 513 late, avg of 61 retries (590 ns), max of 1061 retries (10099 ns) Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/2045 References: d8f505311717 ("drm/i915/icl: Forcibly evict stale csb entries") Suggested-by: Bruce Chang <yu.bruce.chang@intel.com> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Bruce Chang <yu.bruce.chang@intel.com> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Cc: stable@vger.kernel.org # v5.4 --- drivers/gpu/drm/i915/gt/intel_lrc.c | 21 ++++++++++++++++++--- 1 file changed, 18 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c index d6e0f62337b4..d75712a503b7 100644 --- a/drivers/gpu/drm/i915/gt/intel_lrc.c +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c @@ -2498,9 +2498,22 @@ invalidate_csb_entries(const u64 *first, const u64 *last) */ static inline bool gen12_csb_parse(const u64 *csb) { - u64 entry = READ_ONCE(*csb); - bool ctx_away_valid = GEN12_CSB_CTX_VALID(upper_32_bits(entry)); - bool new_queue = + bool ctx_away_valid; + bool new_queue; + u64 entry; + + /* HSD#22011248461 */ + entry = READ_ONCE(*csb); + if (unlikely(entry == -1)) { + preempt_disable(); + if (wait_for_atomic_us((entry = READ_ONCE(*csb)) != -1, 50)) + GEM_WARN_ON("50us CSB timeout"); + preempt_enable(); + } + WRITE_ONCE(*(u64 *)csb, -1); + + ctx_away_valid = GEN12_CSB_CTX_VALID(upper_32_bits(entry)); + new_queue = lower_32_bits(entry) & GEN12_CTX_STATUS_SWITCHED_TO_NEW_QUEUE; /* @@ -4004,6 +4017,8 @@ static void reset_csb_pointers(struct intel_engine_cs *engine) WRITE_ONCE(*execlists->csb_write, reset_value); wmb(); /* Make sure this is visible to HW (paranoia?) */ + /* Check that the GPU does indeed update the CSB entries! */ + memset(execlists->csb_status, -1, (reset_value + 1) * sizeof(u64)); invalidate_csb_entries(&execlists->csb_status[0], &execlists->csb_status[reset_value]); -- 2.20.1
WARNING: multiple messages have this Message-ID (diff)
From: Chris Wilson <chris@chris-wilson.co.uk> To: intel-gfx@lists.freedesktop.org Cc: stable@vger.kernel.org, Chris Wilson <chris@chris-wilson.co.uk> Subject: [Intel-gfx] [PATCH 06/39] drm/i915/gt: Wait for CSB entries on Tigerlake Date: Wed, 26 Aug 2020 14:27:38 +0100 [thread overview] Message-ID: <20200826132811.17577-6-chris@chris-wilson.co.uk> (raw) In-Reply-To: <20200826132811.17577-1-chris@chris-wilson.co.uk> On Tigerlake, we are seeing a repeat of commit d8f505311717 ("drm/i915/icl: Forcibly evict stale csb entries") where, presumably, due to a missing Global Observation Point synchronisation, the write pointer of the CSB ringbuffer is updated _prior_ to the contents of the ringbuffer. That is we see the GPU report more context-switch entries for us to parse, but those entries have not been written, leading us to process stale events, and eventually report a hung GPU. However, this effect appears to be much more severe than we previously saw on Icelake (though it might be best if we try the same approach there as well and measure), and Bruce suggested the good idea of resetting the CSB entry after use so that we can detect when it has been updated by the GPU. By instrumenting how long that may be, we can set a reliable upper bound for how long we should wait for: 513 late, avg of 61 retries (590 ns), max of 1061 retries (10099 ns) Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/2045 References: d8f505311717 ("drm/i915/icl: Forcibly evict stale csb entries") Suggested-by: Bruce Chang <yu.bruce.chang@intel.com> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Bruce Chang <yu.bruce.chang@intel.com> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Cc: stable@vger.kernel.org # v5.4 --- drivers/gpu/drm/i915/gt/intel_lrc.c | 21 ++++++++++++++++++--- 1 file changed, 18 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c index d6e0f62337b4..d75712a503b7 100644 --- a/drivers/gpu/drm/i915/gt/intel_lrc.c +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c @@ -2498,9 +2498,22 @@ invalidate_csb_entries(const u64 *first, const u64 *last) */ static inline bool gen12_csb_parse(const u64 *csb) { - u64 entry = READ_ONCE(*csb); - bool ctx_away_valid = GEN12_CSB_CTX_VALID(upper_32_bits(entry)); - bool new_queue = + bool ctx_away_valid; + bool new_queue; + u64 entry; + + /* HSD#22011248461 */ + entry = READ_ONCE(*csb); + if (unlikely(entry == -1)) { + preempt_disable(); + if (wait_for_atomic_us((entry = READ_ONCE(*csb)) != -1, 50)) + GEM_WARN_ON("50us CSB timeout"); + preempt_enable(); + } + WRITE_ONCE(*(u64 *)csb, -1); + + ctx_away_valid = GEN12_CSB_CTX_VALID(upper_32_bits(entry)); + new_queue = lower_32_bits(entry) & GEN12_CTX_STATUS_SWITCHED_TO_NEW_QUEUE; /* @@ -4004,6 +4017,8 @@ static void reset_csb_pointers(struct intel_engine_cs *engine) WRITE_ONCE(*execlists->csb_write, reset_value); wmb(); /* Make sure this is visible to HW (paranoia?) */ + /* Check that the GPU does indeed update the CSB entries! */ + memset(execlists->csb_status, -1, (reset_value + 1) * sizeof(u64)); invalidate_csb_entries(&execlists->csb_status[0], &execlists->csb_status[reset_value]); -- 2.20.1 _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
next prev parent reply other threads:[~2020-08-26 13:28 UTC|newest] Thread overview: 65+ messages / expand[flat|nested] mbox.gz Atom feed top 2020-08-26 13:27 [PATCH 01/39] drm/i915/gem: Avoid implicit vmap for highmem on x86-32 Chris Wilson 2020-08-26 13:27 ` [Intel-gfx] " Chris Wilson 2020-08-26 13:27 ` [Intel-gfx] [PATCH 02/39] drm/i915/gem: Use set_pte_at() for assigning the vmapped PTE Chris Wilson 2020-08-26 16:36 ` Matthew Auld 2020-08-26 16:55 ` Chris Wilson 2020-08-26 13:27 ` [PATCH 03/39] drm/i915/gem: Prevent using pgprot_writecombine() if PAT is not supported Chris Wilson 2020-08-26 13:27 ` [Intel-gfx] " Chris Wilson 2020-08-26 13:27 ` [Intel-gfx] [PATCH 04/39] drm/i915/gt: Clear the buffer pool age before use Chris Wilson 2020-08-26 17:15 ` Matthew Auld 2020-08-26 13:27 ` [Intel-gfx] [PATCH 05/39] drm/i915/gt: Widen CSB pointer to u64 for the parsers Chris Wilson 2020-08-26 13:27 ` Chris Wilson [this message] 2020-08-26 13:27 ` [Intel-gfx] [PATCH 06/39] drm/i915/gt: Wait for CSB entries on Tigerlake Chris Wilson 2020-08-28 14:04 ` Mika Kuoppala 2020-08-28 14:04 ` [Intel-gfx] " Mika Kuoppala 2020-08-26 13:27 ` [Intel-gfx] [PATCH 07/39] drm/i915/gt: Apply the CSB w/a for all Chris Wilson 2020-08-26 13:27 ` [Intel-gfx] [PATCH 08/39] drm/i915/gt: Show engine properties in the pretty printer Chris Wilson 2020-08-27 12:25 ` Mika Kuoppala 2020-08-26 13:27 ` [Intel-gfx] [PATCH 09/39] drm/i915/gem: Hold request reference for canceling an active context Chris Wilson 2020-08-26 13:27 ` [PATCH 10/39] drm/i915: Cancel outstanding work after disabling heartbeats on an engine Chris Wilson 2020-08-26 13:27 ` [Intel-gfx] " Chris Wilson 2020-08-29 9:30 ` Mika Kuoppala 2020-08-29 9:30 ` Mika Kuoppala 2020-08-26 13:27 ` [PATCH 11/39] drm/i915/gt: Always send a pulse down the engine after disabling heartbeat Chris Wilson 2020-08-26 13:27 ` [Intel-gfx] " Chris Wilson 2020-08-26 13:27 ` [PATCH 12/39] drm/i915/gem: Always test execution status on closing the context Chris Wilson 2020-08-26 13:27 ` [Intel-gfx] " Chris Wilson 2020-08-26 13:27 ` [Intel-gfx] [PATCH 13/39] drm/i915/gt: Signal cancelled requests Chris Wilson 2020-08-26 13:27 ` [Intel-gfx] [PATCH 14/39] drm/i915/selftests: Finish pending mock requests on cancellation Chris Wilson 2020-08-26 13:27 ` [Intel-gfx] [PATCH 15/39] drm/i915/gt: Retire cancelled requests on unload Chris Wilson 2020-08-26 13:27 ` [Intel-gfx] [PATCH 16/39] drm/i915/gt: Remove defunct intel_virtual_engine_get_sibling() Chris Wilson 2020-08-26 13:27 ` [Intel-gfx] [PATCH 17/39] drm/i915/gt: Defer enabling the breadcrumb interrupt to after submission Chris Wilson 2020-08-26 13:27 ` [Intel-gfx] [PATCH 18/39] drm/i915/gt: Track signaled breadcrumbs outside of the breadcrumb spinlock Chris Wilson 2020-08-26 13:27 ` [Intel-gfx] [PATCH 19/39] drm/i915/gt: Don't cancel the interrupt shadow too early Chris Wilson 2020-08-26 13:27 ` [Intel-gfx] [PATCH 20/39] drm/i915/gt: Free stale request on destroying the virtual engine Chris Wilson 2020-08-26 13:27 ` [Intel-gfx] [PATCH 21/39] drm/i915/gt: Protect context lifetime with RCU Chris Wilson 2020-08-26 13:27 ` [Intel-gfx] [PATCH 22/39] drm/i915/gt: Split the breadcrumb spinlock between global and contexts Chris Wilson 2020-08-26 13:27 ` [Intel-gfx] [PATCH 23/39] drm/i915/gt: Move the breadcrumb to the signaler if completed upon cancel Chris Wilson 2020-08-26 13:27 ` [Intel-gfx] [PATCH 24/39] drm/i915/gt: Decouple completed requests on unwind Chris Wilson 2020-08-26 13:27 ` [Intel-gfx] [PATCH 25/39] drm/i915/gt: Check for a completed last request once Chris Wilson 2020-08-26 13:27 ` [Intel-gfx] [PATCH 26/39] drm/i915/gt: Replace direct submit with direct call to tasklet Chris Wilson 2020-08-26 13:27 ` [Intel-gfx] [PATCH 27/39] drm/i915/gt: ce->inflight updates are now serialised Chris Wilson 2020-08-26 13:28 ` [Intel-gfx] [PATCH 28/39] drm/i915/gt: Use virtual_engine during execlists_dequeue Chris Wilson 2020-08-26 13:28 ` [Intel-gfx] [PATCH 29/39] drm/i915/gt: Decouple inflight virtual engines Chris Wilson 2020-08-26 13:28 ` [Intel-gfx] [PATCH 30/39] drm/i915/gt: Defer schedule_out until after the next dequeue Chris Wilson 2020-08-26 13:28 ` [Intel-gfx] [PATCH 31/39] drm/i915/gt: Remove virtual breadcrumb before transfer Chris Wilson 2020-08-26 13:28 ` [Intel-gfx] [PATCH 32/39] drm/i915/gt: Shrink the critical section for irq signaling Chris Wilson 2020-08-26 13:28 ` [Intel-gfx] [PATCH 33/39] drm/i915/gt: Resubmit the virtual engine on schedule-out Chris Wilson 2020-08-26 13:28 ` [Intel-gfx] [PATCH 34/39] drm/i915/gt: Simplify virtual engine handling for execlists_hold() Chris Wilson 2020-08-26 13:28 ` [Intel-gfx] [PATCH 35/39] drm/i915: Encode fence specific waitqueue behaviour into the wait.flags Chris Wilson 2020-09-02 14:02 ` Thomas Hellström (Intel) 2020-09-03 9:50 ` Thomas Hellström (Intel) 2020-09-03 11:32 ` Chris Wilson 2020-09-03 12:08 ` Thomas Hellström (Intel) 2020-08-26 13:28 ` [Intel-gfx] [PATCH 36/39] drm/i915/selftests: Confirm RING_TIMESTAMP / CTX_TIMESTAMP share a clock Chris Wilson 2020-08-26 13:28 ` [Intel-gfx] [PATCH 37/39] drm/i915/gt: Consolidate the CS timestamp clocks Chris Wilson 2020-08-26 13:28 ` [PATCH 38/39] drm/i915: Break up error capture compression loops with cond_resched() Chris Wilson 2020-08-26 13:28 ` [Intel-gfx] " Chris Wilson 2020-08-26 13:28 ` [Intel-gfx] [PATCH 39/39] drm/i915: Reduce GPU error capture mutex hold time Chris Wilson 2020-08-26 13:53 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for series starting with [01/39] drm/i915/gem: Avoid implicit vmap for highmem on x86-32 Patchwork 2020-08-26 13:54 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork 2020-08-26 14:11 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork 2020-08-26 14:20 ` [Intel-gfx] [PATCH 01/39] " Matthew Auld 2020-08-26 14:20 ` Matthew Auld 2020-08-26 20:43 ` Harald Arnesen 2020-08-26 20:43 ` [Intel-gfx] " Harald Arnesen
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20200826132811.17577-6-chris@chris-wilson.co.uk \ --to=chris@chris-wilson.co.uk \ --cc=intel-gfx@lists.freedesktop.org \ --cc=mika.kuoppala@linux.intel.com \ --cc=stable@vger.kernel.org \ --cc=yu.bruce.chang@intel.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.