All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] drm/i915/bxt: work around HW coherency issue
@ 2015-06-08 16:28 Imre Deak
  2015-06-08 16:28 ` [PATCH 1/2] drm/i915/bxt: work around HW coherency issue when accessing GPU seqno Imre Deak
  2015-06-08 16:28 ` [PATCH 2/2] drm/i915/bxt: work around HW coherency issue for cached GEM mappings Imre Deak
  0 siblings, 2 replies; 29+ messages in thread
From: Imre Deak @ 2015-06-08 16:28 UTC (permalink / raw)
  To: intel-gfx

Running some basic igt tests on BXT A1 stepping uncovered a coherency
problem where a cached CPU mapping won't be updated by a GPU store via a
snooped mapping. While I opened an internal ticket to investigate this
further this patchset is an attempt to work around the problem until a
better solution is found or the problem is fixed in a later stepping.

The problem can be easily triggered with workloads where we submit
multiple small requests to the GPU. Here the CPU's view of seqno will
get eventually stale wrt. to the value stored by the GPU at the end of
the request and __wait_for_request will stall (and trigger the lost
interrupt/GPU reset mechanism). I could reproduce this with
igt/gem_store_dword_loop_render.

I also added a new igt/gem_store_dword_batches_loop for cached mappings
which shows the same problem in case of MI_STORE_MEMORY_IMM to a cached
buffer. The difference between this and the above case is that here the
buffer is mapped through PPGTT, while above the status page is mapped
through GGTT.

Imre Deak (2):
  drm/i915/bxt: work around HW coherency issue when accessing GPU seqno
  drm/i915/bxt: work around HW coherency issue for cached GEM mappings

 drivers/gpu/drm/i915/i915_gem.c         |  6 +++++-
 drivers/gpu/drm/i915/intel_lrc.c        | 17 +++++++++++++++++
 drivers/gpu/drm/i915/intel_ringbuffer.h |  7 +++++++
 3 files changed, 29 insertions(+), 1 deletion(-)

-- 
2.1.4

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/2] drm/i915/bxt: work around HW coherency issue when accessing GPU seqno
  2015-06-08 16:28 [PATCH 0/2] drm/i915/bxt: work around HW coherency issue Imre Deak
@ 2015-06-08 16:28 ` Imre Deak
  2015-06-08 17:08   ` Dave Gordon
                     ` (2 more replies)
  2015-06-08 16:28 ` [PATCH 2/2] drm/i915/bxt: work around HW coherency issue for cached GEM mappings Imre Deak
  1 sibling, 3 replies; 29+ messages in thread
From: Imre Deak @ 2015-06-08 16:28 UTC (permalink / raw)
  To: intel-gfx

By running igt/store_dword_loop_render on BXT we can hit a coherency
problem where the seqno written at GPU command completion time is not
seen by the CPU. This results in __i915_wait_request seeing the stale
seqno and not completing the request (not considering the lost
interrupt/GPU reset mechanism). I also verified that this isn't a case
of a lost interrupt, or that the command didn't complete somehow: when
the coherency issue occured I read the seqno via an uncached GTT mapping
too. While the cached version of the seqno still showed the stale value
the one read via the uncached mapping was the correct one.

Work around this issue by clflushing the corresponding CPU cacheline
following any store of the seqno and preceding any reading of it. When
reading it do this only when the caller expects a coherent view.

Testcase: igt/store_dword_loop_render
Signed-off-by: Imre Deak <imre.deak@intel.com>
---
 drivers/gpu/drm/i915/intel_lrc.c        | 17 +++++++++++++++++
 drivers/gpu/drm/i915/intel_ringbuffer.h |  7 +++++++
 2 files changed, 24 insertions(+)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 9f5485d..88bc5525 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1288,12 +1288,29 @@ static int gen8_emit_flush_render(struct intel_ringbuffer *ringbuf,
 
 static u32 gen8_get_seqno(struct intel_engine_cs *ring, bool lazy_coherency)
 {
+	/*
+	 * On BXT-A1 there is a coherency issue whereby the MI_STORE_DATA_IMM
+	 * storing the completed request's seqno occasionally doesn't
+	 * invalidate the CPU cache. Work around this by clflushing the
+	 * corresponding cacheline whenever the caller wants the coherency to
+	 * be guaranteed. Note that this cacheline is known to be
+	 * clean at this point, since we only write it in gen8_set_seqno(),
+	 * where we also do a clflush after the write. So this clflush in
+	 * practice becomes an invalidate operation.
+	 */
+	if (IS_BROXTON(ring->dev) & !lazy_coherency)
+		intel_flush_status_page(ring, I915_GEM_HWS_INDEX);
+
 	return intel_read_status_page(ring, I915_GEM_HWS_INDEX);
 }
 
 static void gen8_set_seqno(struct intel_engine_cs *ring, u32 seqno)
 {
 	intel_write_status_page(ring, I915_GEM_HWS_INDEX, seqno);
+
+	/* See gen8_get_seqno() explaining the reason for the clflush. */
+	if (IS_BROXTON(ring->dev))
+		intel_flush_status_page(ring, I915_GEM_HWS_INDEX);
 }
 
 static int gen8_emit_request(struct intel_ringbuffer *ringbuf,
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 39f6dfc..224a25b 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -352,6 +352,13 @@ intel_ring_sync_index(struct intel_engine_cs *ring,
 	return idx;
 }
 
+static inline void
+intel_flush_status_page(struct intel_engine_cs *ring, int reg)
+{
+	drm_clflush_virt_range(&ring->status_page.page_addr[reg],
+			       sizeof(uint32_t));
+}
+
 static inline u32
 intel_read_status_page(struct intel_engine_cs *ring,
 		       int reg)
-- 
2.1.4

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 2/2] drm/i915/bxt: work around HW coherency issue for cached GEM mappings
  2015-06-08 16:28 [PATCH 0/2] drm/i915/bxt: work around HW coherency issue Imre Deak
  2015-06-08 16:28 ` [PATCH 1/2] drm/i915/bxt: work around HW coherency issue when accessing GPU seqno Imre Deak
@ 2015-06-08 16:28 ` Imre Deak
  2015-06-13 18:04   ` shuang.he
  1 sibling, 1 reply; 29+ messages in thread
From: Imre Deak @ 2015-06-08 16:28 UTC (permalink / raw)
  To: intel-gfx

Due to a coherency issue on BXT-A1 we can't guarantee a coherent view of
cached GPU mappings, so fall back to uncached mappings. Note that this
still won't fix cases where userspace expects a coherent view without
synchronizing (via a set domain call). It still makes sense to limit the
kernel's notion of the mapping to be uncached, for example for
relocations to work properly during execbuffer time. Also in case user
space does synchronize the buffer, this will still guarantee that we'll
do the proper clflushing for the buffer.

Testcast: igt/gem_store_dword_batches_loop/cached-mapping
Signed-off-by: Imre Deak <imre.deak@intel.com>
---
 drivers/gpu/drm/i915/i915_gem.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index be35f04..212b51c 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -4062,7 +4062,11 @@ int i915_gem_set_caching_ioctl(struct drm_device *dev, void *data,
 		level = I915_CACHE_NONE;
 		break;
 	case I915_CACHING_CACHED:
-		level = I915_CACHE_LLC;
+		/*
+		 * On BXT-A1 we can't guarantee a coherent view for cached
+		 * mappings, so fall back to uncached mappings.
+		 */
+		level = IS_BROXTON(dev) ? I915_CACHE_NONE : I915_CACHE_LLC;
 		break;
 	case I915_CACHING_DISPLAY:
 		level = HAS_WT(dev) ? I915_CACHE_WT : I915_CACHE_NONE;
-- 
2.1.4

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] drm/i915/bxt: work around HW coherency issue when accessing GPU seqno
  2015-06-08 16:28 ` [PATCH 1/2] drm/i915/bxt: work around HW coherency issue when accessing GPU seqno Imre Deak
@ 2015-06-08 17:08   ` Dave Gordon
  2015-06-08 17:12     ` Chris Wilson
  2015-06-08 17:14     ` Imre Deak
  2015-06-09  8:21   ` Jani Nikula
  2015-07-01 13:40   ` Mika Kuoppala
  2 siblings, 2 replies; 29+ messages in thread
From: Dave Gordon @ 2015-06-08 17:08 UTC (permalink / raw)
  To: Imre Deak, intel-gfx

On 08/06/15 17:28, Imre Deak wrote:
> By running igt/store_dword_loop_render on BXT we can hit a coherency
> problem where the seqno written at GPU command completion time is not
> seen by the CPU. This results in __i915_wait_request seeing the stale
> seqno and not completing the request (not considering the lost
> interrupt/GPU reset mechanism). I also verified that this isn't a case
> of a lost interrupt, or that the command didn't complete somehow: when
> the coherency issue occured I read the seqno via an uncached GTT mapping
> too. While the cached version of the seqno still showed the stale value
> the one read via the uncached mapping was the correct one.
> 
> Work around this issue by clflushing the corresponding CPU cacheline
> following any store of the seqno and preceding any reading of it. When
> reading it do this only when the caller expects a coherent view.
> 
> Testcase: igt/store_dword_loop_render
> Signed-off-by: Imre Deak <imre.deak@intel.com>

Not necessarily a cure for this, but BSpec says of MI_STORE_DATA_IMM
(and MI_STORE_DATA_INDEX):

	This command simply initiates the write operation with
	command execution proceeding normally. Although the write
	operation is guaranteed to complete "eventually", there is
	no mechanism to synchronize command execution with the
	completion (or even initiation) of these operations.

So shouldn't we use MI_FLUSH_DW or PIPE_CONTROL to update the seqno in
the HWSP instead?

.Dave.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] drm/i915/bxt: work around HW coherency issue when accessing GPU seqno
  2015-06-08 17:08   ` Dave Gordon
@ 2015-06-08 17:12     ` Chris Wilson
  2015-06-08 17:34       ` Ville Syrjälä
  2015-06-08 17:14     ` Imre Deak
  1 sibling, 1 reply; 29+ messages in thread
From: Chris Wilson @ 2015-06-08 17:12 UTC (permalink / raw)
  To: Dave Gordon; +Cc: intel-gfx

On Mon, Jun 08, 2015 at 06:08:00PM +0100, Dave Gordon wrote:
> On 08/06/15 17:28, Imre Deak wrote:
> > By running igt/store_dword_loop_render on BXT we can hit a coherency
> > problem where the seqno written at GPU command completion time is not
> > seen by the CPU. This results in __i915_wait_request seeing the stale
> > seqno and not completing the request (not considering the lost
> > interrupt/GPU reset mechanism). I also verified that this isn't a case
> > of a lost interrupt, or that the command didn't complete somehow: when
> > the coherency issue occured I read the seqno via an uncached GTT mapping
> > too. While the cached version of the seqno still showed the stale value
> > the one read via the uncached mapping was the correct one.
> > 
> > Work around this issue by clflushing the corresponding CPU cacheline
> > following any store of the seqno and preceding any reading of it. When
> > reading it do this only when the caller expects a coherent view.
> > 
> > Testcase: igt/store_dword_loop_render
> > Signed-off-by: Imre Deak <imre.deak@intel.com>
> 
> Not necessarily a cure for this, but BSpec says of MI_STORE_DATA_IMM
> (and MI_STORE_DATA_INDEX):
> 
> 	This command simply initiates the write operation with
> 	command execution proceeding normally. Although the write
> 	operation is guaranteed to complete "eventually", there is
> 	no mechanism to synchronize command execution with the
> 	completion (or even initiation) of these operations.
> 
> So shouldn't we use MI_FLUSH_DW or PIPE_CONTROL to update the seqno in
> the HWSP instead?

iirc there is also no guarrantee for when the post-sync write op is
completed for a FLUSH_DW/PIPE_CONTROL either. I'd be happy to be
corrected!
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] drm/i915/bxt: work around HW coherency issue when accessing GPU seqno
  2015-06-08 17:08   ` Dave Gordon
  2015-06-08 17:12     ` Chris Wilson
@ 2015-06-08 17:14     ` Imre Deak
  1 sibling, 0 replies; 29+ messages in thread
From: Imre Deak @ 2015-06-08 17:14 UTC (permalink / raw)
  To: Dave Gordon; +Cc: intel-gfx

On ma, 2015-06-08 at 18:08 +0100, Dave Gordon wrote:
> On 08/06/15 17:28, Imre Deak wrote:
> > By running igt/store_dword_loop_render on BXT we can hit a coherency
> > problem where the seqno written at GPU command completion time is not
> > seen by the CPU. This results in __i915_wait_request seeing the stale
> > seqno and not completing the request (not considering the lost
> > interrupt/GPU reset mechanism). I also verified that this isn't a case
> > of a lost interrupt, or that the command didn't complete somehow: when
> > the coherency issue occured I read the seqno via an uncached GTT mapping
> > too. While the cached version of the seqno still showed the stale value
> > the one read via the uncached mapping was the correct one.
> > 
> > Work around this issue by clflushing the corresponding CPU cacheline
> > following any store of the seqno and preceding any reading of it. When
> > reading it do this only when the caller expects a coherent view.
> > 
> > Testcase: igt/store_dword_loop_render
> > Signed-off-by: Imre Deak <imre.deak@intel.com>
> 
> Not necessarily a cure for this, but BSpec says of MI_STORE_DATA_IMM
> (and MI_STORE_DATA_INDEX):
> 
> 	This command simply initiates the write operation with
> 	command execution proceeding normally. Although the write
> 	operation is guaranteed to complete "eventually", there is
> 	no mechanism to synchronize command execution with the
> 	completion (or even initiation) of these operations.
> 
> So shouldn't we use MI_FLUSH_DW or PIPE_CONTROL to update the seqno in
> the HWSP instead?

I have tried to use PIPE_CONTROL instead of MI_STORE_DATA_IMM, but that
didn't solve this issue. With the new
igt/store_dword_batches_loop/cached-mapping you can also verify that
this isn't some ordering issue. By waiting enough after the command
completed the CPU's view is still stale. Reading it after a clflush
results in the correct value on the other hand.

--Imre


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] drm/i915/bxt: work around HW coherency issue when accessing GPU seqno
  2015-06-08 17:12     ` Chris Wilson
@ 2015-06-08 17:34       ` Ville Syrjälä
  2015-06-08 18:00         ` Chris Wilson
  0 siblings, 1 reply; 29+ messages in thread
From: Ville Syrjälä @ 2015-06-08 17:34 UTC (permalink / raw)
  To: Chris Wilson, Dave Gordon, Imre Deak, intel-gfx

On Mon, Jun 08, 2015 at 06:12:47PM +0100, Chris Wilson wrote:
> On Mon, Jun 08, 2015 at 06:08:00PM +0100, Dave Gordon wrote:
> > On 08/06/15 17:28, Imre Deak wrote:
> > > By running igt/store_dword_loop_render on BXT we can hit a coherency
> > > problem where the seqno written at GPU command completion time is not
> > > seen by the CPU. This results in __i915_wait_request seeing the stale
> > > seqno and not completing the request (not considering the lost
> > > interrupt/GPU reset mechanism). I also verified that this isn't a case
> > > of a lost interrupt, or that the command didn't complete somehow: when
> > > the coherency issue occured I read the seqno via an uncached GTT mapping
> > > too. While the cached version of the seqno still showed the stale value
> > > the one read via the uncached mapping was the correct one.
> > > 
> > > Work around this issue by clflushing the corresponding CPU cacheline
> > > following any store of the seqno and preceding any reading of it. When
> > > reading it do this only when the caller expects a coherent view.
> > > 
> > > Testcase: igt/store_dword_loop_render
> > > Signed-off-by: Imre Deak <imre.deak@intel.com>
> > 
> > Not necessarily a cure for this, but BSpec says of MI_STORE_DATA_IMM
> > (and MI_STORE_DATA_INDEX):
> > 
> > 	This command simply initiates the write operation with
> > 	command execution proceeding normally. Although the write
> > 	operation is guaranteed to complete "eventually", there is
> > 	no mechanism to synchronize command execution with the
> > 	completion (or even initiation) of these operations.
> > 
> > So shouldn't we use MI_FLUSH_DW or PIPE_CONTROL to update the seqno in
> > the HWSP instead?
> 
> iirc there is also no guarrantee for when the post-sync write op is
> completed for a FLUSH_DW/PIPE_CONTROL either. I'd be happy to be
> corrected!

My reading of the spec suggests that something like this could work:
PIPE_CONTROL w/ PIPE_CONTROL_QW_WRITE
PIPE_CONTROL w/ PIPE_CONTROL_NOTIFY | PIPE_CONTROL_FLUSH_ENABLE

-- 
Ville Syrjälä
Intel OTC
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] drm/i915/bxt: work around HW coherency issue when accessing GPU seqno
  2015-06-08 17:34       ` Ville Syrjälä
@ 2015-06-08 18:00         ` Chris Wilson
  2015-06-08 18:40           ` Ville Syrjälä
  0 siblings, 1 reply; 29+ messages in thread
From: Chris Wilson @ 2015-06-08 18:00 UTC (permalink / raw)
  To: Ville Syrjälä; +Cc: intel-gfx

On Mon, Jun 08, 2015 at 08:34:51PM +0300, Ville Syrjälä wrote:
> On Mon, Jun 08, 2015 at 06:12:47PM +0100, Chris Wilson wrote:
> > On Mon, Jun 08, 2015 at 06:08:00PM +0100, Dave Gordon wrote:
> > > On 08/06/15 17:28, Imre Deak wrote:
> > > > By running igt/store_dword_loop_render on BXT we can hit a coherency
> > > > problem where the seqno written at GPU command completion time is not
> > > > seen by the CPU. This results in __i915_wait_request seeing the stale
> > > > seqno and not completing the request (not considering the lost
> > > > interrupt/GPU reset mechanism). I also verified that this isn't a case
> > > > of a lost interrupt, or that the command didn't complete somehow: when
> > > > the coherency issue occured I read the seqno via an uncached GTT mapping
> > > > too. While the cached version of the seqno still showed the stale value
> > > > the one read via the uncached mapping was the correct one.
> > > > 
> > > > Work around this issue by clflushing the corresponding CPU cacheline
> > > > following any store of the seqno and preceding any reading of it. When
> > > > reading it do this only when the caller expects a coherent view.
> > > > 
> > > > Testcase: igt/store_dword_loop_render
> > > > Signed-off-by: Imre Deak <imre.deak@intel.com>
> > > 
> > > Not necessarily a cure for this, but BSpec says of MI_STORE_DATA_IMM
> > > (and MI_STORE_DATA_INDEX):
> > > 
> > > 	This command simply initiates the write operation with
> > > 	command execution proceeding normally. Although the write
> > > 	operation is guaranteed to complete "eventually", there is
> > > 	no mechanism to synchronize command execution with the
> > > 	completion (or even initiation) of these operations.
> > > 
> > > So shouldn't we use MI_FLUSH_DW or PIPE_CONTROL to update the seqno in
> > > the HWSP instead?
> > 
> > iirc there is also no guarrantee for when the post-sync write op is
> > completed for a FLUSH_DW/PIPE_CONTROL either. I'd be happy to be
> > corrected!
> 
> My reading of the spec suggests that something like this could work:
> PIPE_CONTROL w/ PIPE_CONTROL_QW_WRITE
> PIPE_CONTROL w/ PIPE_CONTROL_NOTIFY | PIPE_CONTROL_FLUSH_ENABLE

Absolutely sure? The issue is not the completion of the PIPE_CONTROL,
but of ensure that the value has been written to memory and the CPU
cache snooped. I don't remember there being anything as clear as the
gen2-5 statement that all writes are coherent before the interrupt is
raised.

We can hit the issue that the seqno writes aren't coherent before the
interrupt with our current method - I have seen it with hsw, so this is
definitely something worth improving.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] drm/i915/bxt: work around HW coherency issue when accessing GPU seqno
  2015-06-08 18:00         ` Chris Wilson
@ 2015-06-08 18:40           ` Ville Syrjälä
  2015-06-08 19:33             ` Dave Gordon
  0 siblings, 1 reply; 29+ messages in thread
From: Ville Syrjälä @ 2015-06-08 18:40 UTC (permalink / raw)
  To: Chris Wilson, Dave Gordon, Imre Deak, intel-gfx

On Mon, Jun 08, 2015 at 07:00:49PM +0100, Chris Wilson wrote:
> On Mon, Jun 08, 2015 at 08:34:51PM +0300, Ville Syrjälä wrote:
> > On Mon, Jun 08, 2015 at 06:12:47PM +0100, Chris Wilson wrote:
> > > On Mon, Jun 08, 2015 at 06:08:00PM +0100, Dave Gordon wrote:
> > > > On 08/06/15 17:28, Imre Deak wrote:
> > > > > By running igt/store_dword_loop_render on BXT we can hit a coherency
> > > > > problem where the seqno written at GPU command completion time is not
> > > > > seen by the CPU. This results in __i915_wait_request seeing the stale
> > > > > seqno and not completing the request (not considering the lost
> > > > > interrupt/GPU reset mechanism). I also verified that this isn't a case
> > > > > of a lost interrupt, or that the command didn't complete somehow: when
> > > > > the coherency issue occured I read the seqno via an uncached GTT mapping
> > > > > too. While the cached version of the seqno still showed the stale value
> > > > > the one read via the uncached mapping was the correct one.
> > > > > 
> > > > > Work around this issue by clflushing the corresponding CPU cacheline
> > > > > following any store of the seqno and preceding any reading of it. When
> > > > > reading it do this only when the caller expects a coherent view.
> > > > > 
> > > > > Testcase: igt/store_dword_loop_render
> > > > > Signed-off-by: Imre Deak <imre.deak@intel.com>
> > > > 
> > > > Not necessarily a cure for this, but BSpec says of MI_STORE_DATA_IMM
> > > > (and MI_STORE_DATA_INDEX):
> > > > 
> > > > 	This command simply initiates the write operation with
> > > > 	command execution proceeding normally. Although the write
> > > > 	operation is guaranteed to complete "eventually", there is
> > > > 	no mechanism to synchronize command execution with the
> > > > 	completion (or even initiation) of these operations.
> > > > 
> > > > So shouldn't we use MI_FLUSH_DW or PIPE_CONTROL to update the seqno in
> > > > the HWSP instead?
> > > 
> > > iirc there is also no guarrantee for when the post-sync write op is
> > > completed for a FLUSH_DW/PIPE_CONTROL either. I'd be happy to be
> > > corrected!
> > 
> > My reading of the spec suggests that something like this could work:
> > PIPE_CONTROL w/ PIPE_CONTROL_QW_WRITE
> > PIPE_CONTROL w/ PIPE_CONTROL_NOTIFY | PIPE_CONTROL_FLUSH_ENABLE
> 
> Absolutely sure? The issue is not the completion of the PIPE_CONTROL,
> but of ensure that the value has been written to memory and the CPU
> cache snooped. I don't remember there being anything as clear as the
> gen2-5 statement that all writes are coherent before the interrupt is
> raised.
> 
> We can hit the issue that the seqno writes aren't coherent before the
> interrupt with our current method - I have seen it with hsw, so this is
> definitely something worth improving.

What I get from the spec is:
- The post-sync operation is started after previous and current flushes
  have completed
- The flush enable bit causes the CS to wait until all previous
  post-sync operations have completed, which hopefully means the
  store is visible to everyone
- The notify interrupt is signalled after the current sync operation
  has completed, which I hope means the flush enable stall has also
  finished (and if not a three PIPE_CONTROL seqence could be used
  instead)

So no, I'm not absolutely sure by any means.

-- 
Ville Syrjälä
Intel OTC
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] drm/i915/bxt: work around HW coherency issue when accessing GPU seqno
  2015-06-08 18:40           ` Ville Syrjälä
@ 2015-06-08 19:33             ` Dave Gordon
  2015-06-10 10:59               ` Imre Deak
  0 siblings, 1 reply; 29+ messages in thread
From: Dave Gordon @ 2015-06-08 19:33 UTC (permalink / raw)
  To: Ville Syrjälä, Chris Wilson, Imre Deak, intel-gfx

On 08/06/15 19:40, Ville Syrjälä wrote:
> On Mon, Jun 08, 2015 at 07:00:49PM +0100, Chris Wilson wrote:
>> On Mon, Jun 08, 2015 at 08:34:51PM +0300, Ville Syrjälä wrote:
>>> On Mon, Jun 08, 2015 at 06:12:47PM +0100, Chris Wilson wrote:
>>>> On Mon, Jun 08, 2015 at 06:08:00PM +0100, Dave Gordon wrote:
>>>>> On 08/06/15 17:28, Imre Deak wrote:
>>>>>> By running igt/store_dword_loop_render on BXT we can hit a coherency
>>>>>> problem where the seqno written at GPU command completion time is not
>>>>>> seen by the CPU. This results in __i915_wait_request seeing the stale
>>>>>> seqno and not completing the request (not considering the lost
>>>>>> interrupt/GPU reset mechanism). I also verified that this isn't a case
>>>>>> of a lost interrupt, or that the command didn't complete somehow: when
>>>>>> the coherency issue occured I read the seqno via an uncached GTT mapping
>>>>>> too. While the cached version of the seqno still showed the stale value
>>>>>> the one read via the uncached mapping was the correct one.
>>>>>>
>>>>>> Work around this issue by clflushing the corresponding CPU cacheline
>>>>>> following any store of the seqno and preceding any reading of it. When
>>>>>> reading it do this only when the caller expects a coherent view.
>>>>>>
>>>>>> Testcase: igt/store_dword_loop_render
>>>>>> Signed-off-by: Imre Deak <imre.deak@intel.com>
>>>>>
>>>>> Not necessarily a cure for this, but BSpec says of MI_STORE_DATA_IMM
>>>>> (and MI_STORE_DATA_INDEX):
>>>>>
>>>>> 	This command simply initiates the write operation with
>>>>> 	command execution proceeding normally. Although the write
>>>>> 	operation is guaranteed to complete "eventually", there is
>>>>> 	no mechanism to synchronize command execution with the
>>>>> 	completion (or even initiation) of these operations.
>>>>>
>>>>> So shouldn't we use MI_FLUSH_DW or PIPE_CONTROL to update the seqno in
>>>>> the HWSP instead?
>>>>
>>>> iirc there is also no guarrantee for when the post-sync write op is
>>>> completed for a FLUSH_DW/PIPE_CONTROL either. I'd be happy to be
>>>> corrected!
>>>
>>> My reading of the spec suggests that something like this could work:
>>> PIPE_CONTROL w/ PIPE_CONTROL_QW_WRITE
>>> PIPE_CONTROL w/ PIPE_CONTROL_NOTIFY | PIPE_CONTROL_FLUSH_ENABLE
>>
>> Absolutely sure? The issue is not the completion of the PIPE_CONTROL,
>> but of ensure that the value has been written to memory and the CPU
>> cache snooped. I don't remember there being anything as clear as the
>> gen2-5 statement that all writes are coherent before the interrupt is
>> raised.
>>
>> We can hit the issue that the seqno writes aren't coherent before the
>> interrupt with our current method - I have seen it with hsw, so this is
>> definitely something worth improving.
> 
> What I get from the spec is:
> - The post-sync operation is started after previous and current flushes
>   have completed
> - The flush enable bit causes the CS to wait until all previous
>   post-sync operations have completed, which hopefully means the
>   store is visible to everyone
> - The notify interrupt is signalled after the current sync operation
>   has completed, which I hope means the flush enable stall has also
>   finished (and if not a three PIPE_CONTROL seqence could be used
>   instead)
> 
> So no, I'm not absolutely sure by any means.

Here's the commentary for MI_FLUSH_DW:

	The MI_FLUSH_DW command is used to perform an internal "flush"
	operation. The parser pauses on an internal flush until all
	drawing engines have completed any pending operations. In
	addition, this command can also be used to: Flush any dirty
	data to memory. Invalidate the TLB cache inside the hardware

	Usage note: After this command is completed with a Store DWord
	enabled, CPU access to graphics memory will be coherent
	(assuming the Render Cache flush is not inhibited).

	...

	NOTIFY: If ENABLED, a Sync Completion Interrupt will be
	generated (if enabled by the MI Interrupt Control registers)
	once the sync operation is complete. See Interrupt Control
	Registers in Memory Interface Registers for details.

So that's likewise unclear about whether the interrupt comes after just
the "sync" or after the "post sync operation" as well.

Then there's the DC_FLUSH bit in the PIPE_CONTROL instruction:

	DC Flush (L3 Flush) by default doesn’t result in
	flushing/invalidating the IA Coherent lines from L3$, however
	this can be achieved by setting control bit “Pipe line flush
	Coherent lines” in “L3SQCREG4” register.

And is the GPU's MI_CLFLUSH instruction of any use?

.Dave.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] drm/i915/bxt: work around HW coherency issue when accessing GPU seqno
  2015-06-08 16:28 ` [PATCH 1/2] drm/i915/bxt: work around HW coherency issue when accessing GPU seqno Imre Deak
  2015-06-08 17:08   ` Dave Gordon
@ 2015-06-09  8:21   ` Jani Nikula
  2015-06-10 14:07     ` Imre Deak
  2015-07-01 13:40   ` Mika Kuoppala
  2 siblings, 1 reply; 29+ messages in thread
From: Jani Nikula @ 2015-06-09  8:21 UTC (permalink / raw)
  To: Imre Deak, intel-gfx

On Mon, 08 Jun 2015, Imre Deak <imre.deak@intel.com> wrote:
> By running igt/store_dword_loop_render on BXT we can hit a coherency
> problem where the seqno written at GPU command completion time is not
> seen by the CPU. This results in __i915_wait_request seeing the stale
> seqno and not completing the request (not considering the lost
> interrupt/GPU reset mechanism). I also verified that this isn't a case
> of a lost interrupt, or that the command didn't complete somehow: when
> the coherency issue occured I read the seqno via an uncached GTT mapping
> too. While the cached version of the seqno still showed the stale value
> the one read via the uncached mapping was the correct one.
>
> Work around this issue by clflushing the corresponding CPU cacheline
> following any store of the seqno and preceding any reading of it. When
> reading it do this only when the caller expects a coherent view.
>
> Testcase: igt/store_dword_loop_render
> Signed-off-by: Imre Deak <imre.deak@intel.com>
> ---
>  drivers/gpu/drm/i915/intel_lrc.c        | 17 +++++++++++++++++
>  drivers/gpu/drm/i915/intel_ringbuffer.h |  7 +++++++
>  2 files changed, 24 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 9f5485d..88bc5525 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -1288,12 +1288,29 @@ static int gen8_emit_flush_render(struct intel_ringbuffer *ringbuf,
>  
>  static u32 gen8_get_seqno(struct intel_engine_cs *ring, bool lazy_coherency)
>  {
> +	/*
> +	 * On BXT-A1 there is a coherency issue whereby the MI_STORE_DATA_IMM
> +	 * storing the completed request's seqno occasionally doesn't
> +	 * invalidate the CPU cache. Work around this by clflushing the
> +	 * corresponding cacheline whenever the caller wants the coherency to
> +	 * be guaranteed. Note that this cacheline is known to be
> +	 * clean at this point, since we only write it in gen8_set_seqno(),
> +	 * where we also do a clflush after the write. So this clflush in
> +	 * practice becomes an invalidate operation.
> +	 */
> +	if (IS_BROXTON(ring->dev) & !lazy_coherency)

Should be &&.

BR,
Jani.

> +		intel_flush_status_page(ring, I915_GEM_HWS_INDEX);
> +
>  	return intel_read_status_page(ring, I915_GEM_HWS_INDEX);
>  }
>  
>  static void gen8_set_seqno(struct intel_engine_cs *ring, u32 seqno)
>  {
>  	intel_write_status_page(ring, I915_GEM_HWS_INDEX, seqno);
> +
> +	/* See gen8_get_seqno() explaining the reason for the clflush. */
> +	if (IS_BROXTON(ring->dev))
> +		intel_flush_status_page(ring, I915_GEM_HWS_INDEX);
>  }
>  
>  static int gen8_emit_request(struct intel_ringbuffer *ringbuf,
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
> index 39f6dfc..224a25b 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.h
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
> @@ -352,6 +352,13 @@ intel_ring_sync_index(struct intel_engine_cs *ring,
>  	return idx;
>  }
>  
> +static inline void
> +intel_flush_status_page(struct intel_engine_cs *ring, int reg)
> +{
> +	drm_clflush_virt_range(&ring->status_page.page_addr[reg],
> +			       sizeof(uint32_t));
> +}
> +
>  static inline u32
>  intel_read_status_page(struct intel_engine_cs *ring,
>  		       int reg)
> -- 
> 2.1.4
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Jani Nikula, Intel Open Source Technology Center
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] drm/i915/bxt: work around HW coherency issue when accessing GPU seqno
  2015-06-08 19:33             ` Dave Gordon
@ 2015-06-10 10:59               ` Imre Deak
  2015-06-10 15:10                 ` Jesse Barnes
  0 siblings, 1 reply; 29+ messages in thread
From: Imre Deak @ 2015-06-10 10:59 UTC (permalink / raw)
  To: Dave Gordon; +Cc: intel-gfx

On ma, 2015-06-08 at 20:33 +0100, Dave Gordon wrote:
> On 08/06/15 19:40, Ville Syrjälä wrote:
> > On Mon, Jun 08, 2015 at 07:00:49PM +0100, Chris Wilson wrote:
> >> On Mon, Jun 08, 2015 at 08:34:51PM +0300, Ville Syrjälä wrote:
> >>> On Mon, Jun 08, 2015 at 06:12:47PM +0100, Chris Wilson wrote:
> >>>> On Mon, Jun 08, 2015 at 06:08:00PM +0100, Dave Gordon wrote:
> >>>>> On 08/06/15 17:28, Imre Deak wrote:
> >>>>>> By running igt/store_dword_loop_render on BXT we can hit a coherency
> >>>>>> problem where the seqno written at GPU command completion time is not
> >>>>>> seen by the CPU. This results in __i915_wait_request seeing the stale
> >>>>>> seqno and not completing the request (not considering the lost
> >>>>>> interrupt/GPU reset mechanism). I also verified that this isn't a case
> >>>>>> of a lost interrupt, or that the command didn't complete somehow: when
> >>>>>> the coherency issue occured I read the seqno via an uncached GTT mapping
> >>>>>> too. While the cached version of the seqno still showed the stale value
> >>>>>> the one read via the uncached mapping was the correct one.
> >>>>>>
> >>>>>> Work around this issue by clflushing the corresponding CPU cacheline
> >>>>>> following any store of the seqno and preceding any reading of it. When
> >>>>>> reading it do this only when the caller expects a coherent view.
> >>>>>>
> >>>>>> Testcase: igt/store_dword_loop_render
> >>>>>> Signed-off-by: Imre Deak <imre.deak@intel.com>
> >>>>>
> >>>>> Not necessarily a cure for this, but BSpec says of MI_STORE_DATA_IMM
> >>>>> (and MI_STORE_DATA_INDEX):
> >>>>>
> >>>>> 	This command simply initiates the write operation with
> >>>>> 	command execution proceeding normally. Although the write
> >>>>> 	operation is guaranteed to complete "eventually", there is
> >>>>> 	no mechanism to synchronize command execution with the
> >>>>> 	completion (or even initiation) of these operations.
> >>>>>
> >>>>> So shouldn't we use MI_FLUSH_DW or PIPE_CONTROL to update the seqno in
> >>>>> the HWSP instead?
> >>>>
> >>>> iirc there is also no guarrantee for when the post-sync write op is
> >>>> completed for a FLUSH_DW/PIPE_CONTROL either. I'd be happy to be
> >>>> corrected!
> >>>
> >>> My reading of the spec suggests that something like this could work:
> >>> PIPE_CONTROL w/ PIPE_CONTROL_QW_WRITE
> >>> PIPE_CONTROL w/ PIPE_CONTROL_NOTIFY | PIPE_CONTROL_FLUSH_ENABLE
> >>
> >> Absolutely sure? The issue is not the completion of the PIPE_CONTROL,
> >> but of ensure that the value has been written to memory and the CPU
> >> cache snooped. I don't remember there being anything as clear as the
> >> gen2-5 statement that all writes are coherent before the interrupt is
> >> raised.
> >>
> >> We can hit the issue that the seqno writes aren't coherent before the
> >> interrupt with our current method - I have seen it with hsw, so this is
> >> definitely something worth improving.
> > 
> > What I get from the spec is:
> > - The post-sync operation is started after previous and current flushes
> >   have completed
> > - The flush enable bit causes the CS to wait until all previous
> >   post-sync operations have completed, which hopefully means the
> >   store is visible to everyone
> > - The notify interrupt is signalled after the current sync operation
> >   has completed, which I hope means the flush enable stall has also
> >   finished (and if not a three PIPE_CONTROL seqence could be used
> >   instead)
> > 
> > So no, I'm not absolutely sure by any means.
> 
> Here's the commentary for MI_FLUSH_DW:
> 
> 	The MI_FLUSH_DW command is used to perform an internal "flush"
> 	operation. The parser pauses on an internal flush until all
> 	drawing engines have completed any pending operations. In
> 	addition, this command can also be used to: Flush any dirty
> 	data to memory. Invalidate the TLB cache inside the hardware
> 
> 	Usage note: After this command is completed with a Store DWord
> 	enabled, CPU access to graphics memory will be coherent
> 	(assuming the Render Cache flush is not inhibited).
> 
> 	...
> 
> 	NOTIFY: If ENABLED, a Sync Completion Interrupt will be
> 	generated (if enabled by the MI Interrupt Control registers)
> 	once the sync operation is complete. See Interrupt Control
> 	Registers in Memory Interface Registers for details.
> 
> So that's likewise unclear about whether the interrupt comes after just
> the "sync" or after the "post sync operation" as well.
> 
> Then there's the DC_FLUSH bit in the PIPE_CONTROL instruction:
> 
> 	DC Flush (L3 Flush) by default doesn’t result in
> 	flushing/invalidating the IA Coherent lines from L3$, however
> 	this can be achieved by setting control bit “Pipe line flush
> 	Coherent lines” in “L3SQCREG4” register.

I think the discussion here is about two separate things:
1. Possible ordering issue between the seqno store and the completion
interrupt
2. Coherency issue that leaves the CPU with a stale view of the seqno
indefinitely, which this patch works around

I'm confident that in my case the problem is not due to ordering. If it
was "only" ordering then the value would show up eventually. This is not
the case though, __wait_for_request will see the stale value
indefinitely even though it gets woken up periodically afterwards by the
lost IRQ logic (with hangcheck disabled).

I have tried to replace the MI_STORE_MEMORY_IMM+MI_USER_INTERRUPT on the
render ring with a single PIPE_CONTROL to store the seqno and send a
notify interrupt, setting the CS_STALL, PIPE_CONTROL_FLUSH_ENABLE,
DC_FLUSH_ENABLE bits. This didn't get rid of the problem, I had the same
behavior as before. I also tried setting L3SQCREG4 bit 21 as you
suggested, but it didn't have an effect.

Similarly I tried now to replace the MI_STORE_MEMORY_IMM
+MI_USER_INTERRUPT on the BSD/BLT/VEBOX rings with a MI_FLUSH_DW storing
the seqno and sending a sync interrupt. As above, this didn't solve the
problem.

I wonder if stores by MI_STORE_MEMORY_IMM even go through the L3$.
According to Ville they are not and so I'm not sure how the above flush
operations would solve this problem.

> And is the GPU's MI_CLFLUSH instruction of any use?

According to the spec this command is RCS specific, so it wouldn't help
in case of the other rings (where I also see this problem without this
patch).

--Imre

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] drm/i915/bxt: work around HW coherency issue when accessing GPU seqno
  2015-06-09  8:21   ` Jani Nikula
@ 2015-06-10 14:07     ` Imre Deak
  2015-06-10 14:21       ` Chris Wilson
  0 siblings, 1 reply; 29+ messages in thread
From: Imre Deak @ 2015-06-10 14:07 UTC (permalink / raw)
  To: Jani Nikula; +Cc: intel-gfx

On ti, 2015-06-09 at 11:21 +0300, Jani Nikula wrote:
> On Mon, 08 Jun 2015, Imre Deak <imre.deak@intel.com> wrote:
> > By running igt/store_dword_loop_render on BXT we can hit a coherency
> > problem where the seqno written at GPU command completion time is not
> > seen by the CPU. This results in __i915_wait_request seeing the stale
> > seqno and not completing the request (not considering the lost
> > interrupt/GPU reset mechanism). I also verified that this isn't a case
> > of a lost interrupt, or that the command didn't complete somehow: when
> > the coherency issue occured I read the seqno via an uncached GTT mapping
> > too. While the cached version of the seqno still showed the stale value
> > the one read via the uncached mapping was the correct one.
> >
> > Work around this issue by clflushing the corresponding CPU cacheline
> > following any store of the seqno and preceding any reading of it. When
> > reading it do this only when the caller expects a coherent view.
> >
> > Testcase: igt/store_dword_loop_render
> > Signed-off-by: Imre Deak <imre.deak@intel.com>
> > ---
> >  drivers/gpu/drm/i915/intel_lrc.c        | 17 +++++++++++++++++
> >  drivers/gpu/drm/i915/intel_ringbuffer.h |  7 +++++++
> >  2 files changed, 24 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> > index 9f5485d..88bc5525 100644
> > --- a/drivers/gpu/drm/i915/intel_lrc.c
> > +++ b/drivers/gpu/drm/i915/intel_lrc.c
> > @@ -1288,12 +1288,29 @@ static int gen8_emit_flush_render(struct intel_ringbuffer *ringbuf,
> >  
> >  static u32 gen8_get_seqno(struct intel_engine_cs *ring, bool lazy_coherency)
> >  {
> > +	/*
> > +	 * On BXT-A1 there is a coherency issue whereby the MI_STORE_DATA_IMM
> > +	 * storing the completed request's seqno occasionally doesn't
> > +	 * invalidate the CPU cache. Work around this by clflushing the
> > +	 * corresponding cacheline whenever the caller wants the coherency to
> > +	 * be guaranteed. Note that this cacheline is known to be
> > +	 * clean at this point, since we only write it in gen8_set_seqno(),
> > +	 * where we also do a clflush after the write. So this clflush in
> > +	 * practice becomes an invalidate operation.
> > +	 */
> > +	if (IS_BROXTON(ring->dev) & !lazy_coherency)
> 
> Should be &&.

Thanks for catching it, I'll send a v2 with this fixed if there is no
more feedback.

> 
> BR,
> Jani.
> 
> > +		intel_flush_status_page(ring, I915_GEM_HWS_INDEX);
> > +
> >  	return intel_read_status_page(ring, I915_GEM_HWS_INDEX);
> >  }
> >  
> >  static void gen8_set_seqno(struct intel_engine_cs *ring, u32 seqno)
> >  {
> >  	intel_write_status_page(ring, I915_GEM_HWS_INDEX, seqno);
> > +
> > +	/* See gen8_get_seqno() explaining the reason for the clflush. */
> > +	if (IS_BROXTON(ring->dev))
> > +		intel_flush_status_page(ring, I915_GEM_HWS_INDEX);
> >  }
> >  
> >  static int gen8_emit_request(struct intel_ringbuffer *ringbuf,
> > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
> > index 39f6dfc..224a25b 100644
> > --- a/drivers/gpu/drm/i915/intel_ringbuffer.h
> > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
> > @@ -352,6 +352,13 @@ intel_ring_sync_index(struct intel_engine_cs *ring,
> >  	return idx;
> >  }
> >  
> > +static inline void
> > +intel_flush_status_page(struct intel_engine_cs *ring, int reg)
> > +{
> > +	drm_clflush_virt_range(&ring->status_page.page_addr[reg],
> > +			       sizeof(uint32_t));
> > +}
> > +
> >  static inline u32
> >  intel_read_status_page(struct intel_engine_cs *ring,
> >  		       int reg)
> > -- 
> > 2.1.4
> >
> > _______________________________________________
> > Intel-gfx mailing list
> > Intel-gfx@lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/intel-gfx
> 


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] drm/i915/bxt: work around HW coherency issue when accessing GPU seqno
  2015-06-10 14:07     ` Imre Deak
@ 2015-06-10 14:21       ` Chris Wilson
  2015-06-10 14:55         ` Imre Deak
  0 siblings, 1 reply; 29+ messages in thread
From: Chris Wilson @ 2015-06-10 14:21 UTC (permalink / raw)
  To: Imre Deak; +Cc: intel-gfx

On Wed, Jun 10, 2015 at 05:07:46PM +0300, Imre Deak wrote:
> On ti, 2015-06-09 at 11:21 +0300, Jani Nikula wrote:
> > On Mon, 08 Jun 2015, Imre Deak <imre.deak@intel.com> wrote:
> > > By running igt/store_dword_loop_render on BXT we can hit a coherency
> > > problem where the seqno written at GPU command completion time is not
> > > seen by the CPU. This results in __i915_wait_request seeing the stale
> > > seqno and not completing the request (not considering the lost
> > > interrupt/GPU reset mechanism). I also verified that this isn't a case
> > > of a lost interrupt, or that the command didn't complete somehow: when
> > > the coherency issue occured I read the seqno via an uncached GTT mapping
> > > too. While the cached version of the seqno still showed the stale value
> > > the one read via the uncached mapping was the correct one.
> > >
> > > Work around this issue by clflushing the corresponding CPU cacheline
> > > following any store of the seqno and preceding any reading of it. When
> > > reading it do this only when the caller expects a coherent view.
> > >
> > > Testcase: igt/store_dword_loop_render
> > > Signed-off-by: Imre Deak <imre.deak@intel.com>
> > > ---
> > >  drivers/gpu/drm/i915/intel_lrc.c        | 17 +++++++++++++++++
> > >  drivers/gpu/drm/i915/intel_ringbuffer.h |  7 +++++++
> > >  2 files changed, 24 insertions(+)
> > >
> > > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> > > index 9f5485d..88bc5525 100644
> > > --- a/drivers/gpu/drm/i915/intel_lrc.c
> > > +++ b/drivers/gpu/drm/i915/intel_lrc.c
> > > @@ -1288,12 +1288,29 @@ static int gen8_emit_flush_render(struct intel_ringbuffer *ringbuf,
> > >  
> > >  static u32 gen8_get_seqno(struct intel_engine_cs *ring, bool lazy_coherency)
> > >  {
> > > +	/*
> > > +	 * On BXT-A1 there is a coherency issue whereby the MI_STORE_DATA_IMM
> > > +	 * storing the completed request's seqno occasionally doesn't
> > > +	 * invalidate the CPU cache. Work around this by clflushing the
> > > +	 * corresponding cacheline whenever the caller wants the coherency to
> > > +	 * be guaranteed. Note that this cacheline is known to be
> > > +	 * clean at this point, since we only write it in gen8_set_seqno(),
> > > +	 * where we also do a clflush after the write. So this clflush in
> > > +	 * practice becomes an invalidate operation.

Did you compare and contrast with the gen6+ w/a? A clflush may just work
out quicker considering that the posting read would involve a spinlock
and fw dance.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] drm/i915/bxt: work around HW coherency issue when accessing GPU seqno
  2015-06-10 14:21       ` Chris Wilson
@ 2015-06-10 14:55         ` Imre Deak
  2015-06-10 15:00           ` Ville Syrjälä
  0 siblings, 1 reply; 29+ messages in thread
From: Imre Deak @ 2015-06-10 14:55 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

On ke, 2015-06-10 at 15:21 +0100, Chris Wilson wrote:
> On Wed, Jun 10, 2015 at 05:07:46PM +0300, Imre Deak wrote:
> > On ti, 2015-06-09 at 11:21 +0300, Jani Nikula wrote:
> > > On Mon, 08 Jun 2015, Imre Deak <imre.deak@intel.com> wrote:
> > > > By running igt/store_dword_loop_render on BXT we can hit a coherency
> > > > problem where the seqno written at GPU command completion time is not
> > > > seen by the CPU. This results in __i915_wait_request seeing the stale
> > > > seqno and not completing the request (not considering the lost
> > > > interrupt/GPU reset mechanism). I also verified that this isn't a case
> > > > of a lost interrupt, or that the command didn't complete somehow: when
> > > > the coherency issue occured I read the seqno via an uncached GTT mapping
> > > > too. While the cached version of the seqno still showed the stale value
> > > > the one read via the uncached mapping was the correct one.
> > > >
> > > > Work around this issue by clflushing the corresponding CPU cacheline
> > > > following any store of the seqno and preceding any reading of it. When
> > > > reading it do this only when the caller expects a coherent view.
> > > >
> > > > Testcase: igt/store_dword_loop_render
> > > > Signed-off-by: Imre Deak <imre.deak@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/i915/intel_lrc.c        | 17 +++++++++++++++++
> > > >  drivers/gpu/drm/i915/intel_ringbuffer.h |  7 +++++++
> > > >  2 files changed, 24 insertions(+)
> > > >
> > > > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> > > > index 9f5485d..88bc5525 100644
> > > > --- a/drivers/gpu/drm/i915/intel_lrc.c
> > > > +++ b/drivers/gpu/drm/i915/intel_lrc.c
> > > > @@ -1288,12 +1288,29 @@ static int gen8_emit_flush_render(struct intel_ringbuffer *ringbuf,
> > > >  
> > > >  static u32 gen8_get_seqno(struct intel_engine_cs *ring, bool lazy_coherency)
> > > >  {
> > > > +	/*
> > > > +	 * On BXT-A1 there is a coherency issue whereby the MI_STORE_DATA_IMM
> > > > +	 * storing the completed request's seqno occasionally doesn't
> > > > +	 * invalidate the CPU cache. Work around this by clflushing the
> > > > +	 * corresponding cacheline whenever the caller wants the coherency to
> > > > +	 * be guaranteed. Note that this cacheline is known to be
> > > > +	 * clean at this point, since we only write it in gen8_set_seqno(),
> > > > +	 * where we also do a clflush after the write. So this clflush in
> > > > +	 * practice becomes an invalidate operation.
> 
> Did you compare and contrast with the gen6+ w/a? A clflush may just work
> out quicker considering that the posting read would involve a spinlock
> and fw dance.

Actually, I did, but only saw that it only works, didn't benchmark it.
I'd also think that clflush would be faster, since it's only a cache
invalidate at this point. But I will compare the two things now.

> -Chris
> 


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] drm/i915/bxt: work around HW coherency issue when accessing GPU seqno
  2015-06-10 14:55         ` Imre Deak
@ 2015-06-10 15:00           ` Ville Syrjälä
  2015-06-10 15:16             ` Imre Deak
  0 siblings, 1 reply; 29+ messages in thread
From: Ville Syrjälä @ 2015-06-10 15:00 UTC (permalink / raw)
  To: Imre Deak; +Cc: intel-gfx

On Wed, Jun 10, 2015 at 05:55:24PM +0300, Imre Deak wrote:
> On ke, 2015-06-10 at 15:21 +0100, Chris Wilson wrote:
> > On Wed, Jun 10, 2015 at 05:07:46PM +0300, Imre Deak wrote:
> > > On ti, 2015-06-09 at 11:21 +0300, Jani Nikula wrote:
> > > > On Mon, 08 Jun 2015, Imre Deak <imre.deak@intel.com> wrote:
> > > > > By running igt/store_dword_loop_render on BXT we can hit a coherency
> > > > > problem where the seqno written at GPU command completion time is not
> > > > > seen by the CPU. This results in __i915_wait_request seeing the stale
> > > > > seqno and not completing the request (not considering the lost
> > > > > interrupt/GPU reset mechanism). I also verified that this isn't a case
> > > > > of a lost interrupt, or that the command didn't complete somehow: when
> > > > > the coherency issue occured I read the seqno via an uncached GTT mapping
> > > > > too. While the cached version of the seqno still showed the stale value
> > > > > the one read via the uncached mapping was the correct one.
> > > > >
> > > > > Work around this issue by clflushing the corresponding CPU cacheline
> > > > > following any store of the seqno and preceding any reading of it. When
> > > > > reading it do this only when the caller expects a coherent view.
> > > > >
> > > > > Testcase: igt/store_dword_loop_render
> > > > > Signed-off-by: Imre Deak <imre.deak@intel.com>
> > > > > ---
> > > > >  drivers/gpu/drm/i915/intel_lrc.c        | 17 +++++++++++++++++
> > > > >  drivers/gpu/drm/i915/intel_ringbuffer.h |  7 +++++++
> > > > >  2 files changed, 24 insertions(+)
> > > > >
> > > > > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> > > > > index 9f5485d..88bc5525 100644
> > > > > --- a/drivers/gpu/drm/i915/intel_lrc.c
> > > > > +++ b/drivers/gpu/drm/i915/intel_lrc.c
> > > > > @@ -1288,12 +1288,29 @@ static int gen8_emit_flush_render(struct intel_ringbuffer *ringbuf,
> > > > >  
> > > > >  static u32 gen8_get_seqno(struct intel_engine_cs *ring, bool lazy_coherency)
> > > > >  {
> > > > > +	/*
> > > > > +	 * On BXT-A1 there is a coherency issue whereby the MI_STORE_DATA_IMM
> > > > > +	 * storing the completed request's seqno occasionally doesn't
> > > > > +	 * invalidate the CPU cache. Work around this by clflushing the
> > > > > +	 * corresponding cacheline whenever the caller wants the coherency to
> > > > > +	 * be guaranteed. Note that this cacheline is known to be
> > > > > +	 * clean at this point, since we only write it in gen8_set_seqno(),
> > > > > +	 * where we also do a clflush after the write. So this clflush in
> > > > > +	 * practice becomes an invalidate operation.
> > 
> > Did you compare and contrast with the gen6+ w/a? A clflush may just work
> > out quicker considering that the posting read would involve a spinlock
> > and fw dance.
> 
> Actually, I did, but only saw that it only works, didn't benchmark it.
> I'd also think that clflush would be faster, since it's only a cache
> invalidate at this point. But I will compare the two things now.

If an mmio read fixes it then it doesn't feel like a snoop problem after
all.

-- 
Ville Syrjälä
Intel OTC
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] drm/i915/bxt: work around HW coherency issue when accessing GPU seqno
  2015-06-10 10:59               ` Imre Deak
@ 2015-06-10 15:10                 ` Jesse Barnes
  2015-06-10 15:26                   ` Imre Deak
  0 siblings, 1 reply; 29+ messages in thread
From: Jesse Barnes @ 2015-06-10 15:10 UTC (permalink / raw)
  To: imre.deak, Dave Gordon; +Cc: intel-gfx

On 06/10/2015 03:59 AM, Imre Deak wrote:
> I think the discussion here is about two separate things:
> 1. Possible ordering issue between the seqno store and the completion
> interrupt
> 2. Coherency issue that leaves the CPU with a stale view of the seqno
> indefinitely, which this patch works around
> 
> I'm confident that in my case the problem is not due to ordering. If it
> was "only" ordering then the value would show up eventually. This is not
> the case though, __wait_for_request will see the stale value
> indefinitely even though it gets woken up periodically afterwards by the
> lost IRQ logic (with hangcheck disabled).

Yeah, based on your workaround it sounds like the write from the CS is
landing in memory but failing to invalidate the associated CPU
cacheline.  I assume mapping the HWSP as uncached also works around this
issue?

I wonder if this is just an issue with GGTT mappings on BXT.  If we had
per context HSWPs using PPGTT (and maybe even 48 bit PPGTT) mappings,
the snoop may be performed correctly...  Looks like we don't have a
store_dword variant for explicit coherent or incoherent buffer writes
(though it does test PPGTT writes at least); that would make this easy
to test.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] drm/i915/bxt: work around HW coherency issue when accessing GPU seqno
  2015-06-10 15:00           ` Ville Syrjälä
@ 2015-06-10 15:16             ` Imre Deak
  2015-06-10 15:35               ` Chris Wilson
  0 siblings, 1 reply; 29+ messages in thread
From: Imre Deak @ 2015-06-10 15:16 UTC (permalink / raw)
  To: Ville Syrjälä; +Cc: intel-gfx

On ke, 2015-06-10 at 18:00 +0300, Ville Syrjälä wrote:
> On Wed, Jun 10, 2015 at 05:55:24PM +0300, Imre Deak wrote:
> > On ke, 2015-06-10 at 15:21 +0100, Chris Wilson wrote:
> > > On Wed, Jun 10, 2015 at 05:07:46PM +0300, Imre Deak wrote:
> > > > On ti, 2015-06-09 at 11:21 +0300, Jani Nikula wrote:
> > > > > On Mon, 08 Jun 2015, Imre Deak <imre.deak@intel.com> wrote:
> > > > > > By running igt/store_dword_loop_render on BXT we can hit a coherency
> > > > > > problem where the seqno written at GPU command completion time is not
> > > > > > seen by the CPU. This results in __i915_wait_request seeing the stale
> > > > > > seqno and not completing the request (not considering the lost
> > > > > > interrupt/GPU reset mechanism). I also verified that this isn't a case
> > > > > > of a lost interrupt, or that the command didn't complete somehow: when
> > > > > > the coherency issue occured I read the seqno via an uncached GTT mapping
> > > > > > too. While the cached version of the seqno still showed the stale value
> > > > > > the one read via the uncached mapping was the correct one.
> > > > > >
> > > > > > Work around this issue by clflushing the corresponding CPU cacheline
> > > > > > following any store of the seqno and preceding any reading of it. When
> > > > > > reading it do this only when the caller expects a coherent view.
> > > > > >
> > > > > > Testcase: igt/store_dword_loop_render
> > > > > > Signed-off-by: Imre Deak <imre.deak@intel.com>
> > > > > > ---
> > > > > >  drivers/gpu/drm/i915/intel_lrc.c        | 17 +++++++++++++++++
> > > > > >  drivers/gpu/drm/i915/intel_ringbuffer.h |  7 +++++++
> > > > > >  2 files changed, 24 insertions(+)
> > > > > >
> > > > > > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> > > > > > index 9f5485d..88bc5525 100644
> > > > > > --- a/drivers/gpu/drm/i915/intel_lrc.c
> > > > > > +++ b/drivers/gpu/drm/i915/intel_lrc.c
> > > > > > @@ -1288,12 +1288,29 @@ static int gen8_emit_flush_render(struct intel_ringbuffer *ringbuf,
> > > > > >  
> > > > > >  static u32 gen8_get_seqno(struct intel_engine_cs *ring, bool lazy_coherency)
> > > > > >  {
> > > > > > +	/*
> > > > > > +	 * On BXT-A1 there is a coherency issue whereby the MI_STORE_DATA_IMM
> > > > > > +	 * storing the completed request's seqno occasionally doesn't
> > > > > > +	 * invalidate the CPU cache. Work around this by clflushing the
> > > > > > +	 * corresponding cacheline whenever the caller wants the coherency to
> > > > > > +	 * be guaranteed. Note that this cacheline is known to be
> > > > > > +	 * clean at this point, since we only write it in gen8_set_seqno(),
> > > > > > +	 * where we also do a clflush after the write. So this clflush in
> > > > > > +	 * practice becomes an invalidate operation.
> > > 
> > > Did you compare and contrast with the gen6+ w/a? A clflush may just work
> > > out quicker considering that the posting read would involve a spinlock
> > > and fw dance.
> > 
> > Actually, I did, but only saw that it only works, didn't benchmark it.
> > I'd also think that clflush would be faster, since it's only a cache
> > invalidate at this point. But I will compare the two things now.
> 
> If an mmio read fixes it then it doesn't feel like a snoop problem after
> all.

Ok, I retract what I just said. I tried now and with the patch below and
still see the problem. I must have remembered the testcase where I
created a separate GTT mapping for the status page and read the seqno
for that. Sorry for the confusion.

diff --git a/drivers/gpu/drm/i915/intel_lrc.c
b/drivers/gpu/drm/i915/intel_lrc.c
index 9f5485d..36e5fd6 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1288,6 +1288,21 @@ static int gen8_emit_flush_render(struct
intel_ringbuffer *ringbuf,
 
 static u32 gen8_get_seqno(struct intel_engine_cs *ring, bool
lazy_coherency)
 {
+	if (!lazy_coherency) {
+		struct drm_i915_private *dev_priv = ring->dev->dev_private;
+		POSTING_READ(RING_ACTHD(ring->mmio_base));
+	}
+
 	return intel_read_status_page(ring, I915_GEM_HWS_INDEX);
 }
 


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] drm/i915/bxt: work around HW coherency issue when accessing GPU seqno
  2015-06-10 15:10                 ` Jesse Barnes
@ 2015-06-10 15:26                   ` Imre Deak
  2015-06-10 15:33                     ` Jesse Barnes
  2015-06-10 15:52                     ` Chris Wilson
  0 siblings, 2 replies; 29+ messages in thread
From: Imre Deak @ 2015-06-10 15:26 UTC (permalink / raw)
  To: Jesse Barnes; +Cc: intel-gfx

On ke, 2015-06-10 at 08:10 -0700, Jesse Barnes wrote:
> On 06/10/2015 03:59 AM, Imre Deak wrote:
> > I think the discussion here is about two separate things:
> > 1. Possible ordering issue between the seqno store and the completion
> > interrupt
> > 2. Coherency issue that leaves the CPU with a stale view of the seqno
> > indefinitely, which this patch works around
> > 
> > I'm confident that in my case the problem is not due to ordering. If it
> > was "only" ordering then the value would show up eventually. This is not
> > the case though, __wait_for_request will see the stale value
> > indefinitely even though it gets woken up periodically afterwards by the
> > lost IRQ logic (with hangcheck disabled).
> 
> Yeah, based on your workaround it sounds like the write from the CS is
> landing in memory but failing to invalidate the associated CPU
> cacheline.  I assume mapping the HWSP as uncached also works around this
> issue?

I assume it would, but it would of course have a bigger overhead. Based
on my testing the coherency problem happens only occasionally, so for
the rest of the times we still would want to benefit from cached reads.
See especially __i915_spin_request().

> I wonder if this is just an issue with GGTT mappings on BXT.  If we had
> per context HSWPs using PPGTT (and maybe even 48 bit PPGTT) mappings,
> the snoop may be performed correctly...  Looks like we don't have a
> store_dword variant for explicit coherent or incoherent buffer writes
> (though it does test PPGTT writes at least); that would make this easy
> to test.

Well, I tried to play around with the GGTT PTE/PAT flags to see if we're
not setting something or there is a bit that is significant despite the
specification. The spec says it's only the snoop flag that matters and
everything maps to PAT index 0. That looks correct in our code. In any
case I would expect that the MI_STORE_DATA_IMM would be coherent since
this is explicitly stated in the spec.

I also added a new testcase to gem_storedw_batches_loop that tests the
same thing via PPGTT mappings, it fails the same way.

--Imre

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] drm/i915/bxt: work around HW coherency issue when accessing GPU seqno
  2015-06-10 15:26                   ` Imre Deak
@ 2015-06-10 15:33                     ` Jesse Barnes
  2015-06-10 15:55                       ` Imre Deak
  2015-06-10 15:52                     ` Chris Wilson
  1 sibling, 1 reply; 29+ messages in thread
From: Jesse Barnes @ 2015-06-10 15:33 UTC (permalink / raw)
  To: imre.deak; +Cc: intel-gfx

On 06/10/2015 08:26 AM, Imre Deak wrote:
> On ke, 2015-06-10 at 08:10 -0700, Jesse Barnes wrote:
>> On 06/10/2015 03:59 AM, Imre Deak wrote:
>>> I think the discussion here is about two separate things:
>>> 1. Possible ordering issue between the seqno store and the completion
>>> interrupt
>>> 2. Coherency issue that leaves the CPU with a stale view of the seqno
>>> indefinitely, which this patch works around
>>>
>>> I'm confident that in my case the problem is not due to ordering. If it
>>> was "only" ordering then the value would show up eventually. This is not
>>> the case though, __wait_for_request will see the stale value
>>> indefinitely even though it gets woken up periodically afterwards by the
>>> lost IRQ logic (with hangcheck disabled).
>>
>> Yeah, based on your workaround it sounds like the write from the CS is
>> landing in memory but failing to invalidate the associated CPU
>> cacheline.  I assume mapping the HWSP as uncached also works around this
>> issue?
> 
> I assume it would, but it would of course have a bigger overhead. Based
> on my testing the coherency problem happens only occasionally, so for
> the rest of the times we still would want to benefit from cached reads.
> See especially __i915_spin_request().

Yeah, pretty sure we want it cached given how often we read from it.  I
was just curious if the UC mapping would address this just to narrow
things down even further.

>> I wonder if this is just an issue with GGTT mappings on BXT.  If we had
>> per context HSWPs using PPGTT (and maybe even 48 bit PPGTT) mappings,
>> the snoop may be performed correctly...  Looks like we don't have a
>> store_dword variant for explicit coherent or incoherent buffer writes
>> (though it does test PPGTT writes at least); that would make this easy
>> to test.
> 
> Well, I tried to play around with the GGTT PTE/PAT flags to see if we're
> not setting something or there is a bit that is significant despite the
> specification. The spec says it's only the snoop flag that matters and
> everything maps to PAT index 0. That looks correct in our code. In any
> case I would expect that the MI_STORE_DATA_IMM would be coherent since
> this is explicitly stated in the spec.
> 
> I also added a new testcase to gem_storedw_batches_loop that tests the
> same thing via PPGTT mappings, it fails the same way.

Ok interesting, so it sounds like a more general problem than just the GGTT.

Thanks,
Jesse
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] drm/i915/bxt: work around HW coherency issue when accessing GPU seqno
  2015-06-10 15:16             ` Imre Deak
@ 2015-06-10 15:35               ` Chris Wilson
  0 siblings, 0 replies; 29+ messages in thread
From: Chris Wilson @ 2015-06-10 15:35 UTC (permalink / raw)
  To: Imre Deak; +Cc: intel-gfx

On Wed, Jun 10, 2015 at 06:16:20PM +0300, Imre Deak wrote:
> On ke, 2015-06-10 at 18:00 +0300, Ville Syrjälä wrote:
> > On Wed, Jun 10, 2015 at 05:55:24PM +0300, Imre Deak wrote:
> > > On ke, 2015-06-10 at 15:21 +0100, Chris Wilson wrote:
> > > > On Wed, Jun 10, 2015 at 05:07:46PM +0300, Imre Deak wrote:
> > > > > On ti, 2015-06-09 at 11:21 +0300, Jani Nikula wrote:
> > > > > > On Mon, 08 Jun 2015, Imre Deak <imre.deak@intel.com> wrote:
> > > > > > > By running igt/store_dword_loop_render on BXT we can hit a coherency
> > > > > > > problem where the seqno written at GPU command completion time is not
> > > > > > > seen by the CPU. This results in __i915_wait_request seeing the stale
> > > > > > > seqno and not completing the request (not considering the lost
> > > > > > > interrupt/GPU reset mechanism). I also verified that this isn't a case
> > > > > > > of a lost interrupt, or that the command didn't complete somehow: when
> > > > > > > the coherency issue occured I read the seqno via an uncached GTT mapping
> > > > > > > too. While the cached version of the seqno still showed the stale value
> > > > > > > the one read via the uncached mapping was the correct one.
> > > > > > >
> > > > > > > Work around this issue by clflushing the corresponding CPU cacheline
> > > > > > > following any store of the seqno and preceding any reading of it. When
> > > > > > > reading it do this only when the caller expects a coherent view.
> > > > > > >
> > > > > > > Testcase: igt/store_dword_loop_render
> > > > > > > Signed-off-by: Imre Deak <imre.deak@intel.com>
> > > > > > > ---
> > > > > > >  drivers/gpu/drm/i915/intel_lrc.c        | 17 +++++++++++++++++
> > > > > > >  drivers/gpu/drm/i915/intel_ringbuffer.h |  7 +++++++
> > > > > > >  2 files changed, 24 insertions(+)
> > > > > > >
> > > > > > > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> > > > > > > index 9f5485d..88bc5525 100644
> > > > > > > --- a/drivers/gpu/drm/i915/intel_lrc.c
> > > > > > > +++ b/drivers/gpu/drm/i915/intel_lrc.c
> > > > > > > @@ -1288,12 +1288,29 @@ static int gen8_emit_flush_render(struct intel_ringbuffer *ringbuf,
> > > > > > >  
> > > > > > >  static u32 gen8_get_seqno(struct intel_engine_cs *ring, bool lazy_coherency)
> > > > > > >  {
> > > > > > > +	/*
> > > > > > > +	 * On BXT-A1 there is a coherency issue whereby the MI_STORE_DATA_IMM
> > > > > > > +	 * storing the completed request's seqno occasionally doesn't
> > > > > > > +	 * invalidate the CPU cache. Work around this by clflushing the
> > > > > > > +	 * corresponding cacheline whenever the caller wants the coherency to
> > > > > > > +	 * be guaranteed. Note that this cacheline is known to be
> > > > > > > +	 * clean at this point, since we only write it in gen8_set_seqno(),
> > > > > > > +	 * where we also do a clflush after the write. So this clflush in
> > > > > > > +	 * practice becomes an invalidate operation.
> > > > 
> > > > Did you compare and contrast with the gen6+ w/a? A clflush may just work
> > > > out quicker considering that the posting read would involve a spinlock
> > > > and fw dance.
> > > 
> > > Actually, I did, but only saw that it only works, didn't benchmark it.
> > > I'd also think that clflush would be faster, since it's only a cache
> > > invalidate at this point. But I will compare the two things now.
> > 
> > If an mmio read fixes it then it doesn't feel like a snoop problem after
> > all.
> 
> Ok, I retract what I just said. I tried now and with the patch below and
> still see the problem. I must have remembered the testcase where I
> created a separate GTT mapping for the status page and read the seqno
> for that. Sorry for the confusion.

Useful to know. Also something else to try is to set_pages_array_wc (or
set_memory_wc) for our internal mmaping of the hws. Though clflush is
likely to less of a maintenance issue.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] drm/i915/bxt: work around HW coherency issue when accessing GPU seqno
  2015-06-10 15:26                   ` Imre Deak
  2015-06-10 15:33                     ` Jesse Barnes
@ 2015-06-10 15:52                     ` Chris Wilson
  2015-06-11  8:02                       ` Dave Gordon
  1 sibling, 1 reply; 29+ messages in thread
From: Chris Wilson @ 2015-06-10 15:52 UTC (permalink / raw)
  To: Imre Deak; +Cc: intel-gfx

On Wed, Jun 10, 2015 at 06:26:58PM +0300, Imre Deak wrote:
> On ke, 2015-06-10 at 08:10 -0700, Jesse Barnes wrote:
> > On 06/10/2015 03:59 AM, Imre Deak wrote:
> > > I think the discussion here is about two separate things:
> > > 1. Possible ordering issue between the seqno store and the completion
> > > interrupt
> > > 2. Coherency issue that leaves the CPU with a stale view of the seqno
> > > indefinitely, which this patch works around
> > > 
> > > I'm confident that in my case the problem is not due to ordering. If it
> > > was "only" ordering then the value would show up eventually. This is not
> > > the case though, __wait_for_request will see the stale value
> > > indefinitely even though it gets woken up periodically afterwards by the
> > > lost IRQ logic (with hangcheck disabled).
> > 
> > Yeah, based on your workaround it sounds like the write from the CS is
> > landing in memory but failing to invalidate the associated CPU
> > cacheline.  I assume mapping the HWSP as uncached also works around this
> > issue?
> 
> I assume it would, but it would of course have a bigger overhead. Based
> on my testing the coherency problem happens only occasionally, so for
> the rest of the times we still would want to benefit from cached reads.
> See especially __i915_spin_request().

Yeah, I am not quite sure about that myself. The reason we don't do
forced coherent reads there is because of the impact that has with fw
and the snb/ivb/hsw workaround.  If we have a viable strategy that gives us
accurate seqno at marginal cost, then it will be beneficial to do so from within
the busy-spin - in order to minimise the amount of cycles we spin.

We always spend a jiffie of CPU time, we might as spend it getting
accurate results (so the coherent read is quick enough and doesn't slow
down the normal case).

I seem to recall that we tried clflushing for gen6+, but we didn't
record any details, so it may be worth retesting ivb with that w/a.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] drm/i915/bxt: work around HW coherency issue when accessing GPU seqno
  2015-06-10 15:33                     ` Jesse Barnes
@ 2015-06-10 15:55                       ` Imre Deak
  0 siblings, 0 replies; 29+ messages in thread
From: Imre Deak @ 2015-06-10 15:55 UTC (permalink / raw)
  To: Jesse Barnes; +Cc: intel-gfx

On ke, 2015-06-10 at 08:33 -0700, Jesse Barnes wrote:
> On 06/10/2015 08:26 AM, Imre Deak wrote:
> > On ke, 2015-06-10 at 08:10 -0700, Jesse Barnes wrote:
> >> On 06/10/2015 03:59 AM, Imre Deak wrote:
> >>> I think the discussion here is about two separate things:
> >>> 1. Possible ordering issue between the seqno store and the completion
> >>> interrupt
> >>> 2. Coherency issue that leaves the CPU with a stale view of the seqno
> >>> indefinitely, which this patch works around
> >>>
> >>> I'm confident that in my case the problem is not due to ordering. If it
> >>> was "only" ordering then the value would show up eventually. This is not
> >>> the case though, __wait_for_request will see the stale value
> >>> indefinitely even though it gets woken up periodically afterwards by the
> >>> lost IRQ logic (with hangcheck disabled).
> >>
> >> Yeah, based on your workaround it sounds like the write from the CS is
> >> landing in memory but failing to invalidate the associated CPU
> >> cacheline.  I assume mapping the HWSP as uncached also works around this
> >> issue?
> > 
> > I assume it would, but it would of course have a bigger overhead. Based
> > on my testing the coherency problem happens only occasionally, so for
> > the rest of the times we still would want to benefit from cached reads.
> > See especially __i915_spin_request().
> 
> Yeah, pretty sure we want it cached given how often we read from it.  I
> was just curious if the UC mapping would address this just to narrow
> things down even further.

I must admit that I also remembered our A0 power-on event where the
Android guys had a similar issue and came up with a solution to map the
status page uncached. At that time we didn't see this problem with the
upstream kernel. Not sure what was the difference then and why the
problem is back now, maybe we just didn't run the tests long enough. So
I'm pretty sure it's the same issue.

--Imre

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] drm/i915/bxt: work around HW coherency issue when accessing GPU seqno
  2015-06-10 15:52                     ` Chris Wilson
@ 2015-06-11  8:02                       ` Dave Gordon
  2015-06-11  8:20                         ` Chris Wilson
  2015-06-11 19:14                         ` Imre Deak
  0 siblings, 2 replies; 29+ messages in thread
From: Dave Gordon @ 2015-06-11  8:02 UTC (permalink / raw)
  To: Chris Wilson, Imre Deak, Jesse Barnes, intel-gfx

On 10/06/15 16:52, Chris Wilson wrote:
> On Wed, Jun 10, 2015 at 06:26:58PM +0300, Imre Deak wrote:
>> On ke, 2015-06-10 at 08:10 -0700, Jesse Barnes wrote:
>>> On 06/10/2015 03:59 AM, Imre Deak wrote:
>>>> I think the discussion here is about two separate things:
>>>> 1. Possible ordering issue between the seqno store and the completion
>>>> interrupt
>>>> 2. Coherency issue that leaves the CPU with a stale view of the seqno
>>>> indefinitely, which this patch works around
>>>>
>>>> I'm confident that in my case the problem is not due to ordering. If it
>>>> was "only" ordering then the value would show up eventually. This is not
>>>> the case though, __wait_for_request will see the stale value
>>>> indefinitely even though it gets woken up periodically afterwards by the
>>>> lost IRQ logic (with hangcheck disabled).
>>>
>>> Yeah, based on your workaround it sounds like the write from the CS is
>>> landing in memory but failing to invalidate the associated CPU
>>> cacheline.  I assume mapping the HWSP as uncached also works around this
>>> issue?
>>
>> I assume it would, but it would of course have a bigger overhead.

Would it though? We shouldn't be repeatedly reading from the HWSP unless
we're waiting for something to change, in which case we absolutely want
the latest value without any caching whatsoever.

Since the only thing we want to read from the HWSP is the seqno (via
engine->get_seqno()), and that has a 'lazy_coherency' flag, we could
cache the last value read explicitly in the intel_engine_cs and return
that if lazy_coherency is set, and get it directly from the uncached
HWSP only when lazy_coherency is false (i.e. it would become
very-lazy-coherency).

>> Based
>> on my testing the coherency problem happens only occasionally, so for
>> the rest of the times we still would want to benefit from cached reads.
>> See especially __i915_spin_request().

I would have expected __i915_spin_request() NOT to allow lazy coherency,
at least inside the loop. It could do one pre-check with lazy coherency
on, for the presumably-common case where the request has already
completed, but then use fully uncached reads while looping?

> Yeah, I am not quite sure about that myself. The reason we don't do
> forced coherent reads there is because of the impact that has with fw
> and the snb/ivb/hsw workaround.  If we have a viable strategy that gives us
> accurate seqno at marginal cost, then it will be beneficial to do so from within
> the busy-spin - in order to minimise the amount of cycles we spin.

If we mapped the HWSP uncached, and used a GPU instruction that
guaranteed making the seqno update globally-visible ASAP, we might not
need to read a register at all, and then we wouldn't have the
complications with forcewake :)

> We always spend a jiffie of CPU time, we might as spend it getting
> accurate results (so the coherent read is quick enough and doesn't slow
> down the normal case).
> 
> I seem to recall that we tried clflushing for gen6+, but we didn't
> record any details, so it may be worth retesting ivb with that w/a.
> -Chris

Is that clflush on the CPU, or MI_CLFLUSH on the GPU? Does the latter
help at all?

.Dave.

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] drm/i915/bxt: work around HW coherency issue when accessing GPU seqno
  2015-06-11  8:02                       ` Dave Gordon
@ 2015-06-11  8:20                         ` Chris Wilson
  2015-06-11 19:14                         ` Imre Deak
  1 sibling, 0 replies; 29+ messages in thread
From: Chris Wilson @ 2015-06-11  8:20 UTC (permalink / raw)
  To: Dave Gordon; +Cc: intel-gfx

On Thu, Jun 11, 2015 at 09:02:44AM +0100, Dave Gordon wrote:
> On 10/06/15 16:52, Chris Wilson wrote:
> > I seem to recall that we tried clflushing for gen6+, but we didn't
> > record any details, so it may be worth retesting ivb with that w/a.
> 
> Is that clflush on the CPU, or MI_CLFLUSH on the GPU? Does the latter
> help at all?

Only CPU. The w/a is required for all rings, and the MI_CFLUSH is RCS
only.

Anyway so far using clflush for gen6/ivb/hsw I haven't yet triggered the
failures, but there is a measurably improvement over the I915_READ().
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] drm/i915/bxt: work around HW coherency issue when accessing GPU seqno
  2015-06-11  8:02                       ` Dave Gordon
  2015-06-11  8:20                         ` Chris Wilson
@ 2015-06-11 19:14                         ` Imre Deak
  1 sibling, 0 replies; 29+ messages in thread
From: Imre Deak @ 2015-06-11 19:14 UTC (permalink / raw)
  To: Dave Gordon; +Cc: intel-gfx

On Thu, 2015-06-11 at 09:02 +0100, Dave Gordon wrote:
> On 10/06/15 16:52, Chris Wilson wrote:
> > On Wed, Jun 10, 2015 at 06:26:58PM +0300, Imre Deak wrote:
> >> On ke, 2015-06-10 at 08:10 -0700, Jesse Barnes wrote:
> >>> On 06/10/2015 03:59 AM, Imre Deak wrote:
> >>>> I think the discussion here is about two separate things:
> >>>> 1. Possible ordering issue between the seqno store and the completion
> >>>> interrupt
> >>>> 2. Coherency issue that leaves the CPU with a stale view of the seqno
> >>>> indefinitely, which this patch works around
> >>>>
> >>>> I'm confident that in my case the problem is not due to ordering. If it
> >>>> was "only" ordering then the value would show up eventually. This is not
> >>>> the case though, __wait_for_request will see the stale value
> >>>> indefinitely even though it gets woken up periodically afterwards by the
> >>>> lost IRQ logic (with hangcheck disabled).
> >>>
> >>> Yeah, based on your workaround it sounds like the write from the CS is
> >>> landing in memory but failing to invalidate the associated CPU
> >>> cacheline.  I assume mapping the HWSP as uncached also works around this
> >>> issue?
> >>
> >> I assume it would, but it would of course have a bigger overhead.
> 
> Would it though? We shouldn't be repeatedly reading from the HWSP unless
> we're waiting for something to change, in which case we absolutely want
> the latest value without any caching whatsoever.

Normally on GEN6+ reading the cached value always gives you the very
latest value, that's the whole point of coherency between GPU and CPU
and this also holds for BXT. We use the lazy_coherency flag to work
around the relatively rare occasions when this coherency is violated for
a reason that is yet to be found (for example some missing GT workaround
on BXT). I still prefer this form of the workaround where we do an
uncached read only when it's absolutely necessary, to one where we do
uncached reads all the time. I imagine that doing such uncached reads in
__i915_spin_request() in a loop is definitely worse than cached reads
(and again these cached reads will give you the same response time as
uncached reads in general).

> Since the only thing we want to read from the HWSP is the seqno (via
> engine->get_seqno()), and that has a 'lazy_coherency' flag, we could
> cache the last value read explicitly in the intel_engine_cs and return
> that if lazy_coherency is set, and get it directly from the uncached
> HWSP only when lazy_coherency is false (i.e. it would become
> very-lazy-coherency).

engine->get_seqno() will give you the very latest value even with
lazy_coherency=true in general. The exception is when the coherency
problem is hit which should be relatively rare.

> >> Based
> >> on my testing the coherency problem happens only occasionally, so for
> >> the rest of the times we still would want to benefit from cached reads.
> >> See especially __i915_spin_request().
>
> I would have expected __i915_spin_request() NOT to allow lazy coherency,
> at least inside the loop. It could do one pre-check with lazy coherency
> on, for the presumably-common case where the request has already
> completed, but then use fully uncached reads while looping?

Based on the above, I think we have the most benefit of coherent cached
reads inside such loops.

--Imre

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/2] drm/i915/bxt: work around HW coherency issue for cached GEM mappings
  2015-06-08 16:28 ` [PATCH 2/2] drm/i915/bxt: work around HW coherency issue for cached GEM mappings Imre Deak
@ 2015-06-13 18:04   ` shuang.he
  0 siblings, 0 replies; 29+ messages in thread
From: shuang.he @ 2015-06-13 18:04 UTC (permalink / raw)
  To: shuang.he, lei.a.liu, intel-gfx, imre.deak

Tested-By: Intel Graphics QA PRTS (Patch Regression Test System Contact: shuang.he@intel.com)
Task id: 6556
-------------------------------------Summary-------------------------------------
Platform          Delta          drm-intel-nightly          Series Applied
PNV                                  276/276              276/276
ILK                                  303/303              303/303
SNB                                  312/312              312/312
IVB                                  343/343              343/343
BYT                                  287/287              287/287
BDW                                  321/321              321/321
-------------------------------------Detailed-------------------------------------
Platform  Test                                drm-intel-nightly          Series Applied
Note: You need to pay more attention to line start with '*'
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] drm/i915/bxt: work around HW coherency issue when accessing GPU seqno
  2015-06-08 16:28 ` [PATCH 1/2] drm/i915/bxt: work around HW coherency issue when accessing GPU seqno Imre Deak
  2015-06-08 17:08   ` Dave Gordon
  2015-06-09  8:21   ` Jani Nikula
@ 2015-07-01 13:40   ` Mika Kuoppala
  2015-07-01 13:53     ` Mika Kuoppala
  2 siblings, 1 reply; 29+ messages in thread
From: Mika Kuoppala @ 2015-07-01 13:40 UTC (permalink / raw)
  To: Imre Deak, intel-gfx

Imre Deak <imre.deak@intel.com> writes:

> By running igt/store_dword_loop_render on BXT we can hit a coherency
> problem where the seqno written at GPU command completion time is not
> seen by the CPU. This results in __i915_wait_request seeing the stale
> seqno and not completing the request (not considering the lost
> interrupt/GPU reset mechanism). I also verified that this isn't a case
> of a lost interrupt, or that the command didn't complete somehow: when
> the coherency issue occured I read the seqno via an uncached GTT mapping
> too. While the cached version of the seqno still showed the stale value
> the one read via the uncached mapping was the correct one.
>
> Work around this issue by clflushing the corresponding CPU cacheline
> following any store of the seqno and preceding any reading of it. When
> reading it do this only when the caller expects a coherent view.
>
> Testcase: igt/store_dword_loop_render
> Signed-off-by: Imre Deak <imre.deak@intel.com>
> ---
>  drivers/gpu/drm/i915/intel_lrc.c        | 17 +++++++++++++++++
>  drivers/gpu/drm/i915/intel_ringbuffer.h |  7 +++++++
>  2 files changed, 24 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 9f5485d..88bc5525 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -1288,12 +1288,29 @@ static int gen8_emit_flush_render(struct intel_ringbuffer *ringbuf,
>  
>  static u32 gen8_get_seqno(struct intel_engine_cs *ring, bool lazy_coherency)
>  {
> +	/*
> +	 * On BXT-A1 there is a coherency issue whereby the MI_STORE_DATA_IMM
> +	 * storing the completed request's seqno occasionally doesn't
> +	 * invalidate the CPU cache. Work around this by clflushing the
> +	 * corresponding cacheline whenever the caller wants the coherency to
> +	 * be guaranteed. Note that this cacheline is known to be
> +	 * clean at this point, since we only write it in gen8_set_seqno(),
> +	 * where we also do a clflush after the write. So this clflush in
> +	 * practice becomes an invalidate operation.
> +	 */
> +	if (IS_BROXTON(ring->dev) & !lazy_coherency)

s/&/&& ?

-Mika

> +		intel_flush_status_page(ring, I915_GEM_HWS_INDEX);
> +
>  	return intel_read_status_page(ring, I915_GEM_HWS_INDEX);
>  }
>  
>  static void gen8_set_seqno(struct intel_engine_cs *ring, u32 seqno)
>  {
>  	intel_write_status_page(ring, I915_GEM_HWS_INDEX, seqno);
> +
> +	/* See gen8_get_seqno() explaining the reason for the clflush. */
> +	if (IS_BROXTON(ring->dev))
> +		intel_flush_status_page(ring, I915_GEM_HWS_INDEX);
>  }
>  
>  static int gen8_emit_request(struct intel_ringbuffer *ringbuf,
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
> index 39f6dfc..224a25b 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.h
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
> @@ -352,6 +352,13 @@ intel_ring_sync_index(struct intel_engine_cs *ring,
>  	return idx;
>  }
>  
> +static inline void
> +intel_flush_status_page(struct intel_engine_cs *ring, int reg)
> +{
> +	drm_clflush_virt_range(&ring->status_page.page_addr[reg],
> +			       sizeof(uint32_t));
> +}
> +
>  static inline u32
>  intel_read_status_page(struct intel_engine_cs *ring,
>  		       int reg)
> -- 
> 2.1.4
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] drm/i915/bxt: work around HW coherency issue when accessing GPU seqno
  2015-07-01 13:40   ` Mika Kuoppala
@ 2015-07-01 13:53     ` Mika Kuoppala
  0 siblings, 0 replies; 29+ messages in thread
From: Mika Kuoppala @ 2015-07-01 13:53 UTC (permalink / raw)
  To: Imre Deak, intel-gfx

Mika Kuoppala <mika.kuoppala@linux.intel.com> writes:

> Imre Deak <imre.deak@intel.com> writes:
>
>> By running igt/store_dword_loop_render on BXT we can hit a coherency
>> problem where the seqno written at GPU command completion time is not
>> seen by the CPU. This results in __i915_wait_request seeing the stale
>> seqno and not completing the request (not considering the lost
>> interrupt/GPU reset mechanism). I also verified that this isn't a case
>> of a lost interrupt, or that the command didn't complete somehow: when
>> the coherency issue occured I read the seqno via an uncached GTT mapping
>> too. While the cached version of the seqno still showed the stale value
>> the one read via the uncached mapping was the correct one.
>>
>> Work around this issue by clflushing the corresponding CPU cacheline
>> following any store of the seqno and preceding any reading of it. When
>> reading it do this only when the caller expects a coherent view.
>>
>> Testcase: igt/store_dword_loop_render
>> Signed-off-by: Imre Deak <imre.deak@intel.com>
>> ---
>>  drivers/gpu/drm/i915/intel_lrc.c        | 17 +++++++++++++++++
>>  drivers/gpu/drm/i915/intel_ringbuffer.h |  7 +++++++
>>  2 files changed, 24 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
>> index 9f5485d..88bc5525 100644
>> --- a/drivers/gpu/drm/i915/intel_lrc.c
>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
>> @@ -1288,12 +1288,29 @@ static int gen8_emit_flush_render(struct intel_ringbuffer *ringbuf,
>>  
>>  static u32 gen8_get_seqno(struct intel_engine_cs *ring, bool lazy_coherency)
>>  {
>> +	/*
>> +	 * On BXT-A1 there is a coherency issue whereby the MI_STORE_DATA_IMM
>> +	 * storing the completed request's seqno occasionally doesn't
>> +	 * invalidate the CPU cache. Work around this by clflushing the
>> +	 * corresponding cacheline whenever the caller wants the coherency to
>> +	 * be guaranteed. Note that this cacheline is known to be
>> +	 * clean at this point, since we only write it in gen8_set_seqno(),
>> +	 * where we also do a clflush after the write. So this clflush in
>> +	 * practice becomes an invalidate operation.
>> +	 */
>> +	if (IS_BROXTON(ring->dev) & !lazy_coherency)
>
> s/&/&& ?

s//Read The Whole Thread Before Replying

-Mika

> -Mika
>
>> +		intel_flush_status_page(ring, I915_GEM_HWS_INDEX);
>> +
>>  	return intel_read_status_page(ring, I915_GEM_HWS_INDEX);
>>  }
>>  
>>  static void gen8_set_seqno(struct intel_engine_cs *ring, u32 seqno)
>>  {
>>  	intel_write_status_page(ring, I915_GEM_HWS_INDEX, seqno);
>> +
>> +	/* See gen8_get_seqno() explaining the reason for the clflush. */
>> +	if (IS_BROXTON(ring->dev))
>> +		intel_flush_status_page(ring, I915_GEM_HWS_INDEX);
>>  }
>>  
>>  static int gen8_emit_request(struct intel_ringbuffer *ringbuf,
>> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
>> index 39f6dfc..224a25b 100644
>> --- a/drivers/gpu/drm/i915/intel_ringbuffer.h
>> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
>> @@ -352,6 +352,13 @@ intel_ring_sync_index(struct intel_engine_cs *ring,
>>  	return idx;
>>  }
>>  
>> +static inline void
>> +intel_flush_status_page(struct intel_engine_cs *ring, int reg)
>> +{
>> +	drm_clflush_virt_range(&ring->status_page.page_addr[reg],
>> +			       sizeof(uint32_t));
>> +}
>> +
>>  static inline u32
>>  intel_read_status_page(struct intel_engine_cs *ring,
>>  		       int reg)
>> -- 
>> 2.1.4
>>
>> _______________________________________________
>> Intel-gfx mailing list
>> Intel-gfx@lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2015-07-01 13:54 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-08 16:28 [PATCH 0/2] drm/i915/bxt: work around HW coherency issue Imre Deak
2015-06-08 16:28 ` [PATCH 1/2] drm/i915/bxt: work around HW coherency issue when accessing GPU seqno Imre Deak
2015-06-08 17:08   ` Dave Gordon
2015-06-08 17:12     ` Chris Wilson
2015-06-08 17:34       ` Ville Syrjälä
2015-06-08 18:00         ` Chris Wilson
2015-06-08 18:40           ` Ville Syrjälä
2015-06-08 19:33             ` Dave Gordon
2015-06-10 10:59               ` Imre Deak
2015-06-10 15:10                 ` Jesse Barnes
2015-06-10 15:26                   ` Imre Deak
2015-06-10 15:33                     ` Jesse Barnes
2015-06-10 15:55                       ` Imre Deak
2015-06-10 15:52                     ` Chris Wilson
2015-06-11  8:02                       ` Dave Gordon
2015-06-11  8:20                         ` Chris Wilson
2015-06-11 19:14                         ` Imre Deak
2015-06-08 17:14     ` Imre Deak
2015-06-09  8:21   ` Jani Nikula
2015-06-10 14:07     ` Imre Deak
2015-06-10 14:21       ` Chris Wilson
2015-06-10 14:55         ` Imre Deak
2015-06-10 15:00           ` Ville Syrjälä
2015-06-10 15:16             ` Imre Deak
2015-06-10 15:35               ` Chris Wilson
2015-07-01 13:40   ` Mika Kuoppala
2015-07-01 13:53     ` Mika Kuoppala
2015-06-08 16:28 ` [PATCH 2/2] drm/i915/bxt: work around HW coherency issue for cached GEM mappings Imre Deak
2015-06-13 18:04   ` shuang.he

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.