All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 0/6] Add Per-context WA using WA batch buffers
@ 2015-06-19 17:37 Arun Siluvery
  2015-06-19 17:37 ` [PATCH v6 1/6] drm/i915/gen8: Add infrastructure to initialize " Arun Siluvery
                   ` (6 more replies)
  0 siblings, 7 replies; 27+ messages in thread
From: Arun Siluvery @ 2015-06-19 17:37 UTC (permalink / raw)
  To: intel-gfx

From Gen8+ we have some workarounds that are applied Per context and
they are applied using special batch buffers called as WA batch buffers.
HW executes them at specific stages during context save/restore.
The patches in this series adds this framework to i915.

I did some basic testing on BDW by running glmark2 and didn't see any issues.
These WA are mainly required when preemption is enabled.

[v1] http://lists.freedesktop.org/archives/intel-gfx/2015-February/060707.html
[v2] http://www.spinics.net/lists/intel-gfx/msg67804.html

[v3] In v2, two separate ring_buffer objects were used to load WA instructions
and they were part of every context which is not really required.
Chris suggested a better approach of adding a page to context itself and using
it for this purpose. Since GuC is also planning to do the same it can probably
be shared with GuC. But after discussions it is agreed to use an independent
page as GuC area might grow in future. Independent page also makes sense because
these WA are only initialized once and not changed afterwards so we can share
them across all contexts.

[v4] Changes in this revision,
In the previous version the size of batch buffers are fixed during
initialization which is not a good idea. This is corrected by updating the
functions that load WA to return the number of dwords written and caller
updates the size once all WA are initialized. The functions now also accept
offset field which allows us to have multiple batches so that required batch
can be selected based on a criteria. This is not a requirement at this point
but could be useful in future.

WaFlushCoherentL3CacheLinesAtContextSwitch implementation was incomplete which
is fixed and programming restrictions correctly applied.

http://www.spinics.net/lists/intel-gfx/msg68947.html

[v5] No major changes in this revision but switched to new revision as changes
affected all patches. Introduced macro to add commands which also checks for
page overflow. Moved code around to simplify, indentation fixes and other
improvements suggested by Chris.

Since we don't know the number of WA applied upfront, Chris suggested a two-pass
approach but that brings additional complexity which is not necessary.
Discussed with Chris and agreed upon on single page setup as simpler code wins
and also single page is sufficient for our requirement.

http://www.spinics.net/lists/intel-gfx/msg69176.html

[v6] Changes made to make the code easy to modify for future Gen. It is much
neater and contained now, big thanks to Chris Wilson for the review.
Since these changes affect all patches I am sending as a separate revision.

Please see the patches for more details.

Arun Siluvery (6):
  drm/i915/gen8: Add infrastructure to initialize WA batch buffers
  drm/i915/gen8: Re-order init pipe_control in lrc mode
  drm/i915/gen8: Add WaDisableCtxRestoreArbitration workaround
  drm/i915/gen8: Add WaFlushCoherentL3CacheLinesAtContextSwitch
    workaround
  drm/i915/gen8: Add WaClearSlmSpaceAtContextSwitch workaround
  drm/i915/gen8: Add WaRsRestoreWithPerCtxtBb workaround

 drivers/gpu/drm/i915/i915_reg.h         |  32 +++-
 drivers/gpu/drm/i915/intel_lrc.c        | 328 +++++++++++++++++++++++++++++++-
 drivers/gpu/drm/i915/intel_ringbuffer.h |  21 ++
 3 files changed, 374 insertions(+), 7 deletions(-)

-- 
2.3.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v6 1/6] drm/i915/gen8: Add infrastructure to initialize WA batch buffers
  2015-06-19 17:37 [PATCH v6 0/6] Add Per-context WA using WA batch buffers Arun Siluvery
@ 2015-06-19 17:37 ` Arun Siluvery
  2015-06-19 17:50   ` Chris Wilson
  2015-06-19 17:37 ` [PATCH v6 2/6] drm/i915/gen8: Re-order init pipe_control in lrc mode Arun Siluvery
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 27+ messages in thread
From: Arun Siluvery @ 2015-06-19 17:37 UTC (permalink / raw)
  To: intel-gfx

Some of the WA are to be applied during context save but before restore and
some at the end of context save/restore but before executing the instructions
in the ring, WA batch buffers are created for this purpose and these WA cannot
be applied using normal means. Each context has two registers to load the
offsets of these batch buffers. If they are non-zero, HW understands that it
need to execute these batches.

v1: In this version two separate ring_buffer objects were used to load WA
instructions for indirect and per context batch buffers and they were part
of every context.

v2: Chris suggested to include additional page in context and use it to load
these WA instead of creating separate objects. This will simplify lot of things
as we need not explicity pin/unpin them. Thomas Daniel further pointed that GuC
is planning to use a similar setup to share data between GuC and driver and
WA batch buffers can probably share that page. However after discussions with
Dave who is implementing GuC changes, he suggested to use an independent page
for the reasons - GuC area might grow and these WA are initialized only once and
are not changed afterwards so we can share them share across all contexts.

The page is updated with WA during render ring init. This has an advantage of
not adding more special cases to default_context.

We don't know upfront the number of WA we will applying using these batch buffers.
For this reason the size was fixed earlier but it is not a good idea. To fix this,
the functions that load instructions are modified to report the no of commands
inserted and the size is now calculated after the batch is updated. A macro is
introduced to add commands to these batch buffers which also checks for overflow
and returns error.
We have a full page dedicated for these WA so that should be sufficient for
good number of WA, anything more means we have major issues.
The list for Gen8 is small, same for Gen9 also, maybe few more gets added
going forward but not close to filling entire page. Chris suggested a two-pass
approach but we agreed to go with single page setup as it is a one-off routine
and simpler code wins.

One additional option is offset field which is helpful if we would like to
have multiple batches at different offsets within the page and select them
based on some criteria. This is not a requirement at this point but could
help in future (Dave).

Chris provided some helpful macros and suggestions which further simplified
the code, they will also help in reducing code duplication when WA for
other Gen are added. Add detailed comments explaining restrictions.

(Many thanks to Chris, Dave and Thomas for their reviews and inputs)

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Dave Gordon <david.s.gordon@intel.com>
Signed-off-by: Rafael Barbalho <rafael.barbalho@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
---
 drivers/gpu/drm/i915/intel_lrc.c        | 222 +++++++++++++++++++++++++++++++-
 drivers/gpu/drm/i915/intel_ringbuffer.h |  21 +++
 2 files changed, 239 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 0413b8f..c9255fc 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -211,6 +211,7 @@ enum {
 	FAULT_AND_CONTINUE /* Unsupported */
 };
 #define GEN8_CTX_ID_SHIFT 32
+#define CTX_RCS_INDIRECT_CTX_OFFSET_DEFAULT  0x17
 
 static int intel_lr_context_pin(struct intel_engine_cs *ring,
 		struct intel_context *ctx);
@@ -1077,6 +1078,190 @@ static int intel_logical_ring_workarounds_emit(struct intel_engine_cs *ring,
 	return 0;
 }
 
+#define wa_ctx_emit(batch, cmd) {	\
+		if (WARN_ON(index >= (PAGE_SIZE / sizeof(uint32_t)))) {	\
+			return -ENOSPC;					\
+		}							\
+		batch[index++] = (cmd);					\
+	}
+
+static inline uint32_t wa_ctx_start(struct i915_wa_ctx_bb *wa_ctx,
+				    uint32_t offset,
+				    uint32_t start_alignment)
+{
+	return wa_ctx->offset = ALIGN(offset, start_alignment);
+}
+
+static inline int wa_ctx_end(struct i915_wa_ctx_bb *wa_ctx,
+			     uint32_t offset,
+			     uint32_t size_alignment)
+{
+	wa_ctx->size = offset - wa_ctx->offset;
+
+	WARN(wa_ctx->size % size_alignment,
+	     "wa_ctx_bb failed sanity checks: size %d is not aligned to %d\n",
+	     wa_ctx->size, size_alignment);
+	return 0;
+}
+
+/**
+ * gen8_init_indirectctx_bb() - initialize indirect ctx batch with WA
+ *
+ * @ring: only applicable for RCS
+ * @wa_ctx: structure representing wa_ctx
+ *  offset: specifies start of the batch, should be cache-aligned. This is updated
+ *    with the offset value received as input.
+ *  size: size of the batch in DWORDS but HW expects in terms of cachelines
+ * @batch: page in which WA are loaded
+ * @offset: This field specifies the start of the batch, it should be
+ *  cache-aligned otherwise it is adjusted accordingly.
+ *  Typically we only have one indirect_ctx and per_ctx batch buffer which are
+ *  initialized at the beginning and shared across all contexts but this field
+ *  helps us to have multiple batches at different offsets and select them based
+ *  on a criteria. At the moment this batch always start at the beginning of the page
+ *  and at this point we don't have multiple wa_ctx batch buffers.
+ *
+ *  The number of WA applied are not known at the beginning; we use this field
+ *  to return the no of DWORDS written.
+
+ *  It is to be noted that this batch does not contain MI_BATCH_BUFFER_END
+ *  so it adds NOOPs as padding to make it cacheline aligned.
+ *  MI_BATCH_BUFFER_END will be added to perctx batch and both of them together
+ *  makes a complete batch buffer.
+ *
+ * Return: non-zero if we exceed the PAGE_SIZE limit.
+ */
+
+static int gen8_init_indirectctx_bb(struct intel_engine_cs *ring,
+				    struct i915_wa_ctx_bb *wa_ctx,
+				    uint32_t *const batch,
+				    uint32_t *offset)
+{
+	uint32_t index = wa_ctx_start(wa_ctx, *offset, CACHELINE_DWORDS);
+
+	/* FIXME: Replace me with WA */
+	wa_ctx_emit(batch, MI_NOOP);
+
+	/* Pad to end of cacheline */
+	while (index % CACHELINE_DWORDS)
+		wa_ctx_emit(batch, MI_NOOP);
+
+	/*
+	 * MI_BATCH_BUFFER_END is not required in Indirect ctx BB because
+	 * execution depends on the length specified in terms of cache lines
+	 * in the register CTX_RCS_INDIRECT_CTX
+	 */
+
+	return wa_ctx_end(wa_ctx, *offset = index, CACHELINE_DWORDS);
+}
+
+/**
+ * gen8_init_perctx_bb() - initialize per ctx batch with WA
+ *
+ * @ring: only applicable for RCS
+ * @wa_ctx: structure representing wa_ctx
+ *  offset: specifies start of the batch, should be cache-aligned.
+ *  size: size of the batch in DWORDS but HW expects in terms of cachelines
+ * @offset: This field specifies the start of this batch.
+ *   This batch is started immediately after indirect_ctx batch. Since we ensure
+ *   that indirect_ctx ends on a cacheline this batch is aligned automatically.
+ *
+ *   The number of DWORDS written are returned using this field.
+ *
+ *  This batch is terminated with MI_BATCH_BUFFER_END and so we need not add padding
+ *  to align it with cacheline as padding after MI_BATCH_BUFFER_END is redundant.
+ */
+static int gen8_init_perctx_bb(struct intel_engine_cs *ring,
+			       struct i915_wa_ctx_bb *wa_ctx,
+			       uint32_t *const batch,
+			       uint32_t *offset)
+{
+	uint32_t index = wa_ctx_start(wa_ctx, *offset, CACHELINE_DWORDS);
+
+	wa_ctx_emit(batch, MI_BATCH_BUFFER_END);
+
+	return wa_ctx_end(wa_ctx, *offset = index, 1);
+}
+
+static int lrc_setup_wa_ctx_obj(struct intel_engine_cs *ring, u32 size)
+{
+	int ret;
+
+	ring->wa_ctx.obj = i915_gem_alloc_object(ring->dev, PAGE_ALIGN(size));
+	if (!ring->wa_ctx.obj) {
+		DRM_DEBUG_DRIVER("alloc LRC WA ctx backing obj failed.\n");
+		return -ENOMEM;
+	}
+
+	ret = i915_gem_obj_ggtt_pin(ring->wa_ctx.obj, PAGE_SIZE, 0);
+	if (ret) {
+		DRM_DEBUG_DRIVER("pin LRC WA ctx backing obj failed: %d\n",
+				 ret);
+		drm_gem_object_unreference(&ring->wa_ctx.obj->base);
+		return ret;
+	}
+
+	return 0;
+}
+
+static void lrc_destroy_wa_ctx_obj(struct intel_engine_cs *ring)
+{
+	if (ring->wa_ctx.obj) {
+		i915_gem_object_ggtt_unpin(ring->wa_ctx.obj);
+		drm_gem_object_unreference(&ring->wa_ctx.obj->base);
+		ring->wa_ctx.obj = NULL;
+	}
+}
+
+static int intel_init_workaround_bb(struct intel_engine_cs *ring)
+{
+	int ret;
+	uint32_t *batch;
+	uint32_t offset;
+	struct page *page;
+	struct i915_ctx_workarounds *wa_ctx = &ring->wa_ctx;
+
+	WARN_ON(ring->id != RCS);
+
+	ret = lrc_setup_wa_ctx_obj(ring, PAGE_SIZE);
+	if (ret) {
+		DRM_DEBUG_DRIVER("Failed to setup context WA page: %d\n", ret);
+		return ret;
+	}
+
+	page = i915_gem_object_get_page(wa_ctx->obj, 0);
+	batch = kmap_atomic(page);
+	offset = 0;
+
+	if (INTEL_INFO(ring->dev)->gen == 8) {
+		ret = gen8_init_indirectctx_bb(ring,
+					       &wa_ctx->indirect_ctx,
+					       batch,
+					       &offset);
+		if (ret)
+			goto out;
+
+		ret = gen8_init_perctx_bb(ring,
+					  &wa_ctx->per_ctx,
+					  batch,
+					  &offset);
+		if (ret)
+			goto out;
+	} else {
+		WARN(INTEL_INFO(ring->dev)->gen >= 8,
+		     "WA batch buffer is not initialized for Gen%d\n",
+		     INTEL_INFO(ring->dev)->gen);
+		lrc_destroy_wa_ctx_obj(ring);
+	}
+
+out:
+	kunmap_atomic(batch);
+	if (ret)
+		lrc_destroy_wa_ctx_obj(ring);
+
+	return ret;
+}
+
 static int gen8_init_common_ring(struct intel_engine_cs *ring)
 {
 	struct drm_device *dev = ring->dev;
@@ -1411,6 +1596,8 @@ void intel_logical_ring_cleanup(struct intel_engine_cs *ring)
 		kunmap(sg_page(ring->status_page.obj->pages->sgl));
 		ring->status_page.obj = NULL;
 	}
+
+	lrc_destroy_wa_ctx_obj(ring);
 }
 
 static int logical_ring_init(struct drm_device *dev, struct intel_engine_cs *ring)
@@ -1474,7 +1661,22 @@ static int logical_render_ring_init(struct drm_device *dev)
 	if (ret)
 		return ret;
 
-	return intel_init_pipe_control(ring);
+	ret = intel_init_workaround_bb(ring);
+	if (ret) {
+		/*
+		 * We continue even if we fail to initialize WA batch
+		 * because we only expect rare glitches but nothing
+		 * critical to prevent us from using GPU
+		 */
+		DRM_ERROR("WA batch buffer initialization failed: %d\n",
+			  ret);
+	}
+
+	ret = intel_init_pipe_control(ring);
+	if (ret)
+		lrc_destroy_wa_ctx_obj(ring);
+
+	return ret;
 }
 
 static int logical_bsd_ring_init(struct drm_device *dev)
@@ -1754,15 +1956,27 @@ populate_lr_context(struct intel_context *ctx, struct drm_i915_gem_object *ctx_o
 	reg_state[CTX_SECOND_BB_STATE] = ring->mmio_base + 0x118;
 	reg_state[CTX_SECOND_BB_STATE+1] = 0;
 	if (ring->id == RCS) {
-		/* TODO: according to BSpec, the register state context
-		 * for CHV does not have these. OTOH, these registers do
-		 * exist in CHV. I'm waiting for a clarification */
 		reg_state[CTX_BB_PER_CTX_PTR] = ring->mmio_base + 0x1c0;
 		reg_state[CTX_BB_PER_CTX_PTR+1] = 0;
 		reg_state[CTX_RCS_INDIRECT_CTX] = ring->mmio_base + 0x1c4;
 		reg_state[CTX_RCS_INDIRECT_CTX+1] = 0;
 		reg_state[CTX_RCS_INDIRECT_CTX_OFFSET] = ring->mmio_base + 0x1c8;
 		reg_state[CTX_RCS_INDIRECT_CTX_OFFSET+1] = 0;
+		if (ring->wa_ctx.obj) {
+			struct i915_ctx_workarounds *wa_ctx = &ring->wa_ctx;
+			uint32_t ggtt_offset = i915_gem_obj_ggtt_offset(wa_ctx->obj);
+
+			reg_state[CTX_RCS_INDIRECT_CTX+1] =
+				(ggtt_offset + wa_ctx->indirect_ctx.offset * sizeof(uint32_t)) |
+				(wa_ctx->indirect_ctx.size / CACHELINE_DWORDS);
+
+			reg_state[CTX_RCS_INDIRECT_CTX_OFFSET+1] =
+				CTX_RCS_INDIRECT_CTX_OFFSET_DEFAULT << 6;
+
+			reg_state[CTX_BB_PER_CTX_PTR+1] =
+				(ggtt_offset + wa_ctx->per_ctx.offset * sizeof(uint32_t)) |
+				0x01;
+		}
 	}
 	reg_state[CTX_LRI_HEADER_1] = MI_LOAD_REGISTER_IMM(9);
 	reg_state[CTX_LRI_HEADER_1] |= MI_LRI_FORCE_POSTED;
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 39f6dfc..06467c6 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -12,6 +12,7 @@
  * workarounds!
  */
 #define CACHELINE_BYTES 64
+#define CACHELINE_DWORDS (CACHELINE_BYTES / sizeof(uint32_t))
 
 /*
  * Gen2 BSpec "1. Programming Environment" / 1.4.4.6 "Ring Buffer Use"
@@ -119,6 +120,25 @@ struct intel_ringbuffer {
 
 struct	intel_context;
 
+/*
+ * we use a single page to load ctx workarounds so all of these
+ * values are referred in terms of dwords
+ *
+ * struct i915_wa_ctx_bb:
+ *  offset: specifies batch starting position, also helpful in case
+ *    if we want to have multiple batches at different offsets based on
+ *    some criteria. It is not a requirement at the moment but provides
+ *    an option for future use.
+ *  size: size of the batch in DWORDS
+ */
+struct  i915_ctx_workarounds {
+	struct i915_wa_ctx_bb {
+		u32 offset;
+		u32 size;
+	} indirect_ctx, per_ctx;
+	struct drm_i915_gem_object *obj;
+};
+
 struct  intel_engine_cs {
 	const char	*name;
 	enum intel_ring_id {
@@ -142,6 +162,7 @@ struct  intel_engine_cs {
 	struct i915_gem_batch_pool batch_pool;
 
 	struct intel_hw_status_page status_page;
+	struct i915_ctx_workarounds wa_ctx;
 
 	unsigned irq_refcount; /* protected by dev_priv->irq_lock */
 	u32		irq_enable_mask;	/* bitmask to enable ring interrupt */
-- 
2.3.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v6 2/6] drm/i915/gen8: Re-order init pipe_control in lrc mode
  2015-06-19 17:37 [PATCH v6 0/6] Add Per-context WA using WA batch buffers Arun Siluvery
  2015-06-19 17:37 ` [PATCH v6 1/6] drm/i915/gen8: Add infrastructure to initialize " Arun Siluvery
@ 2015-06-19 17:37 ` Arun Siluvery
  2015-06-19 17:58   ` Chris Wilson
  2015-06-19 17:37 ` [PATCH v6 3/6] drm/i915/gen8: Add WaDisableCtxRestoreArbitration workaround Arun Siluvery
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 27+ messages in thread
From: Arun Siluvery @ 2015-06-19 17:37 UTC (permalink / raw)
  To: intel-gfx

Some of the WA applied using WA batch buffers perform writes to scratch page.
In the current flow WA are initialized before scratch obj is allocated.
This patch reorders intel_init_pipe_control() to have a valid scratch obj
before we initialize WA.

v2: Check for valid scratch page before initializing WA as some of them
perform writes to it.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Dave Gordon <david.s.gordon@intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
---
 drivers/gpu/drm/i915/intel_lrc.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index c9255fc..0d350f6 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1223,6 +1223,12 @@ static int intel_init_workaround_bb(struct intel_engine_cs *ring)
 
 	WARN_ON(ring->id != RCS);
 
+	/* some WA perform writes to scratch page, ensure it is valid */
+	if (ring->scratch.obj == NULL) {
+		DRM_ERROR("scratch page not allocated for %s\n", ring->name);
+		return -EINVAL;
+	}
+
 	ret = lrc_setup_wa_ctx_obj(ring, PAGE_SIZE);
 	if (ret) {
 		DRM_DEBUG_DRIVER("Failed to setup context WA page: %d\n", ret);
@@ -1657,7 +1663,8 @@ static int logical_render_ring_init(struct drm_device *dev)
 	ring->emit_bb_start = gen8_emit_bb_start;
 
 	ring->dev = dev;
-	ret = logical_ring_init(dev, ring);
+
+	ret = intel_init_pipe_control(ring);
 	if (ret)
 		return ret;
 
@@ -1672,9 +1679,10 @@ static int logical_render_ring_init(struct drm_device *dev)
 			  ret);
 	}
 
-	ret = intel_init_pipe_control(ring);
-	if (ret)
+	ret = logical_ring_init(dev, ring);
+	if (ret) {
 		lrc_destroy_wa_ctx_obj(ring);
+	}
 
 	return ret;
 }
-- 
2.3.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v6 3/6] drm/i915/gen8: Add WaDisableCtxRestoreArbitration workaround
  2015-06-19 17:37 [PATCH v6 0/6] Add Per-context WA using WA batch buffers Arun Siluvery
  2015-06-19 17:37 ` [PATCH v6 1/6] drm/i915/gen8: Add infrastructure to initialize " Arun Siluvery
  2015-06-19 17:37 ` [PATCH v6 2/6] drm/i915/gen8: Re-order init pipe_control in lrc mode Arun Siluvery
@ 2015-06-19 17:37 ` Arun Siluvery
  2015-06-19 18:11   ` Chris Wilson
  2015-06-19 17:37 ` [PATCH v6 4/6] drm/i915/gen8: Add WaFlushCoherentL3CacheLinesAtContextSwitch workaround Arun Siluvery
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 27+ messages in thread
From: Arun Siluvery @ 2015-06-19 17:37 UTC (permalink / raw)
  To: intel-gfx

In Indirect and Per context w/a batch buffer,
+WaDisableCtxRestoreArbitration

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Dave Gordon <david.s.gordon@intel.com>
Signed-off-by: Rafael Barbalho <rafael.barbalho@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
---
 drivers/gpu/drm/i915/intel_lrc.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 0d350f6..62c7eeb 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1139,8 +1139,8 @@ static int gen8_init_indirectctx_bb(struct intel_engine_cs *ring,
 {
 	uint32_t index = wa_ctx_start(wa_ctx, *offset, CACHELINE_DWORDS);
 
-	/* FIXME: Replace me with WA */
-	wa_ctx_emit(batch, MI_NOOP);
+	/* WaDisableCtxRestoreArbitration:bdw,chv */
+	wa_ctx_emit(batch, MI_ARB_ON_OFF | MI_ARB_DISABLE);
 
 	/* Pad to end of cacheline */
 	while (index % CACHELINE_DWORDS)
@@ -1178,6 +1178,9 @@ static int gen8_init_perctx_bb(struct intel_engine_cs *ring,
 {
 	uint32_t index = wa_ctx_start(wa_ctx, *offset, CACHELINE_DWORDS);
 
+	/* WaDisableCtxRestoreArbitration:bdw,chv */
+	wa_ctx_emit(batch, MI_ARB_ON_OFF | MI_ARB_ENABLE);
+
 	wa_ctx_emit(batch, MI_BATCH_BUFFER_END);
 
 	return wa_ctx_end(wa_ctx, *offset = index, 1);
-- 
2.3.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v6 4/6] drm/i915/gen8: Add WaFlushCoherentL3CacheLinesAtContextSwitch workaround
  2015-06-19 17:37 [PATCH v6 0/6] Add Per-context WA using WA batch buffers Arun Siluvery
                   ` (2 preceding siblings ...)
  2015-06-19 17:37 ` [PATCH v6 3/6] drm/i915/gen8: Add WaDisableCtxRestoreArbitration workaround Arun Siluvery
@ 2015-06-19 17:37 ` Arun Siluvery
  2015-06-19 18:12   ` Chris Wilson
  2015-06-19 17:37 ` [PATCH v6 5/6] drm/i915/gen8: Add WaClearSlmSpaceAtContextSwitch workaround Arun Siluvery
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 27+ messages in thread
From: Arun Siluvery @ 2015-06-19 17:37 UTC (permalink / raw)
  To: intel-gfx

In Indirect context w/a batch buffer,
+WaFlushCoherentL3CacheLinesAtContextSwitch:bdw

v2: Add LRI commands to set/reset bit that invalidates coherent lines,
update WA to include programming restrictions and exclude CHV as
it is not required (Ville)

v3: Avoid unnecessary read when it can be done by reading register once (Chris).

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Dave Gordon <david.s.gordon@intel.com>
Signed-off-by: Rafael Barbalho <rafael.barbalho@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_reg.h  |  2 ++
 drivers/gpu/drm/i915/intel_lrc.c | 23 +++++++++++++++++++++++
 2 files changed, 25 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index 84af255..d14ad20 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -426,6 +426,7 @@
 #define   PIPE_CONTROL_INDIRECT_STATE_DISABLE		(1<<9)
 #define   PIPE_CONTROL_NOTIFY				(1<<8)
 #define   PIPE_CONTROL_FLUSH_ENABLE			(1<<7) /* gen7+ */
+#define   PIPE_CONTROL_DC_FLUSH_ENABLE			(1<<5)
 #define   PIPE_CONTROL_VF_CACHE_INVALIDATE		(1<<4)
 #define   PIPE_CONTROL_CONST_CACHE_INVALIDATE		(1<<3)
 #define   PIPE_CONTROL_STATE_CACHE_INVALIDATE		(1<<2)
@@ -5788,6 +5789,7 @@ enum skl_disp_power_wells {
 
 #define GEN8_L3SQCREG4				0xb118
 #define  GEN8_LQSC_RO_PERF_DIS			(1<<27)
+#define  GEN8_LQSC_FLUSH_COHERENT_LINES		(1<<21)
 
 /* GEN8 chicken */
 #define HDC_CHICKEN0				0x7300
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 62c7eeb..3e7aaa9 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1142,6 +1142,29 @@ static int gen8_init_indirectctx_bb(struct intel_engine_cs *ring,
 	/* WaDisableCtxRestoreArbitration:bdw,chv */
 	wa_ctx_emit(batch, MI_ARB_ON_OFF | MI_ARB_DISABLE);
 
+	/* WaFlushCoherentL3CacheLinesAtContextSwitch:bdw */
+	if (IS_BROADWELL(ring->dev)) {
+		struct drm_i915_private *dev_priv = to_i915(ring->dev);
+		uint32_t l3sqc4_flush = (I915_READ(GEN8_L3SQCREG4) |
+					 GEN8_LQSC_FLUSH_COHERENT_LINES);
+
+		wa_ctx_emit(batch, MI_LOAD_REGISTER_IMM(1));
+		wa_ctx_emit(batch, GEN8_L3SQCREG4);
+		wa_ctx_emit(batch, l3sqc4_flush);
+
+		wa_ctx_emit(batch, GFX_OP_PIPE_CONTROL(6));
+		wa_ctx_emit(batch, (PIPE_CONTROL_CS_STALL |
+				    PIPE_CONTROL_DC_FLUSH_ENABLE));
+		wa_ctx_emit(batch, 0);
+		wa_ctx_emit(batch, 0);
+		wa_ctx_emit(batch, 0);
+		wa_ctx_emit(batch, 0);
+
+		wa_ctx_emit(batch, MI_LOAD_REGISTER_IMM(1));
+		wa_ctx_emit(batch, GEN8_L3SQCREG4);
+		wa_ctx_emit(batch, l3sqc4_flush & ~GEN8_LQSC_FLUSH_COHERENT_LINES);
+	}
+
 	/* Pad to end of cacheline */
 	while (index % CACHELINE_DWORDS)
 		wa_ctx_emit(batch, MI_NOOP);
-- 
2.3.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v6 5/6] drm/i915/gen8: Add WaClearSlmSpaceAtContextSwitch workaround
  2015-06-19 17:37 [PATCH v6 0/6] Add Per-context WA using WA batch buffers Arun Siluvery
                   ` (3 preceding siblings ...)
  2015-06-19 17:37 ` [PATCH v6 4/6] drm/i915/gen8: Add WaFlushCoherentL3CacheLinesAtContextSwitch workaround Arun Siluvery
@ 2015-06-19 17:37 ` Arun Siluvery
  2015-06-19 18:09   ` Chris Wilson
  2015-06-23 14:46   ` [PATCH v6 5/5] " Arun Siluvery
  2015-06-19 17:37 ` [PATCH v6 6/6] drm/i915/gen8: Add WaRsRestoreWithPerCtxtBb workaround Arun Siluvery
  2015-06-19 18:07 ` [PATCH v6 1/6] drm/i915/gen8: Add infrastructure to initialize WA batch buffers Arun Siluvery
  6 siblings, 2 replies; 27+ messages in thread
From: Arun Siluvery @ 2015-06-19 17:37 UTC (permalink / raw)
  To: intel-gfx

In Indirect context w/a batch buffer,
WaClearSlmSpaceAtContextSwitch

This WA performs writes to scratch page so it must be valid, this check
is performed before initializing the batch with this WA.

v2: s/PIPE_CONTROL_FLUSH_RO_CACHES/PIPE_CONTROL_FLUSH_L3 (Ville)

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Dave Gordon <david.s.gordon@intel.com>
Signed-off-by: Rafael Barbalho <rafael.barbalho@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_reg.h  |  1 +
 drivers/gpu/drm/i915/intel_lrc.c | 16 ++++++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index d14ad20..7637e64 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -410,6 +410,7 @@
 #define   DISPLAY_PLANE_A           (0<<20)
 #define   DISPLAY_PLANE_B           (1<<20)
 #define GFX_OP_PIPE_CONTROL(len)	((0x3<<29)|(0x3<<27)|(0x2<<24)|(len-2))
+#define   PIPE_CONTROL_FLUSH_L3				(1<<27)
 #define   PIPE_CONTROL_GLOBAL_GTT_IVB			(1<<24) /* gen7+ */
 #define   PIPE_CONTROL_MMIO_WRITE			(1<<23)
 #define   PIPE_CONTROL_STORE_DATA_INDEX			(1<<21)
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 3e7aaa9..664455c 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1137,6 +1137,7 @@ static int gen8_init_indirectctx_bb(struct intel_engine_cs *ring,
 				    uint32_t *const batch,
 				    uint32_t *offset)
 {
+	uint32_t scratch_addr;
 	uint32_t index = wa_ctx_start(wa_ctx, *offset, CACHELINE_DWORDS);
 
 	/* WaDisableCtxRestoreArbitration:bdw,chv */
@@ -1165,6 +1166,21 @@ static int gen8_init_indirectctx_bb(struct intel_engine_cs *ring,
 		wa_ctx_emit(batch, l3sqc4_flush & ~GEN8_LQSC_FLUSH_COHERENT_LINES);
 	}
 
+	/* WaClearSlmSpaceAtContextSwitch:bdw,chv */
+	/* Actual scratch location is at 128 bytes offset */
+	scratch_addr = ring->scratch.gtt_offset + 2*CACHELINE_BYTES;
+	scratch_addr |= PIPE_CONTROL_GLOBAL_GTT;
+
+	wa_ctx_emit(batch, GFX_OP_PIPE_CONTROL(6));
+	wa_ctx_emit(batch, (PIPE_CONTROL_FLUSH_L3 |
+			    PIPE_CONTROL_GLOBAL_GTT_IVB |
+			    PIPE_CONTROL_CS_STALL |
+			    PIPE_CONTROL_QW_WRITE));
+	wa_ctx_emit(batch, scratch_addr);
+	wa_ctx_emit(batch, 0);
+	wa_ctx_emit(batch, 0);
+	wa_ctx_emit(batch, 0);
+
 	/* Pad to end of cacheline */
 	while (index % CACHELINE_DWORDS)
 		wa_ctx_emit(batch, MI_NOOP);
-- 
2.3.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v6 6/6] drm/i915/gen8: Add WaRsRestoreWithPerCtxtBb workaround
  2015-06-19 17:37 [PATCH v6 0/6] Add Per-context WA using WA batch buffers Arun Siluvery
                   ` (4 preceding siblings ...)
  2015-06-19 17:37 ` [PATCH v6 5/6] drm/i915/gen8: Add WaClearSlmSpaceAtContextSwitch workaround Arun Siluvery
@ 2015-06-19 17:37 ` Arun Siluvery
  2015-06-22 11:30   ` Siluvery, Arun
  2015-06-22 16:21   ` Ville Syrjälä
  2015-06-19 18:07 ` [PATCH v6 1/6] drm/i915/gen8: Add infrastructure to initialize WA batch buffers Arun Siluvery
  6 siblings, 2 replies; 27+ messages in thread
From: Arun Siluvery @ 2015-06-19 17:37 UTC (permalink / raw)
  To: intel-gfx

In Per context w/a batch buffer,
WaRsRestoreWithPerCtxtBb

This WA performs writes to scratch page so it must be valid, this check
is performed before initializing the batch with this WA.

v2: This patches modifies definitions of MI_LOAD_REGISTER_MEM and
MI_LOAD_REGISTER_REG; Add GEN8 specific defines for these instructions
so as to not break any future users of existing definitions (Michel)

v3: Length defined in current definitions of LRM, LRR instructions was specified
as 0. It seems it is common convention for instructions whose length vary between
platforms. This is not an issue so far because they are not used anywhere except
command parser; now that we use in this patch update them with correct length
and also move them out of command parser placeholder to appropriate place.
remove unnecessary padding and follow the WA programming sequence exactly
as mentioned in spec which is essential for this WA (Dave).

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Dave Gordon <david.s.gordon@intel.com>
Signed-off-by: Rafael Barbalho <rafael.barbalho@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_reg.h  | 29 +++++++++++++++++++--
 drivers/gpu/drm/i915/intel_lrc.c | 54 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 81 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index 7637e64..208620d 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -347,6 +347,31 @@
 #define   MI_INVALIDATE_BSD		(1<<7)
 #define   MI_FLUSH_DW_USE_GTT		(1<<2)
 #define   MI_FLUSH_DW_USE_PPGTT		(0<<2)
+#define MI_LOAD_REGISTER_MEM    MI_INSTR(0x29, 1)
+#define MI_LOAD_REGISTER_MEM_GEN8 MI_INSTR(0x29, 2)
+#define   MI_LRM_USE_GLOBAL_GTT (1<<22)
+#define   MI_LRM_ASYNC_MODE_ENABLE (1<<21)
+#define MI_LOAD_REGISTER_REG    MI_INSTR(0x2A, 1)
+#define MI_ATOMIC(len)	MI_INSTR(0x2F, (len-2))
+#define   MI_ATOMIC_MEMORY_TYPE_GGTT	(1<<22)
+#define   MI_ATOMIC_INLINE_DATA		(1<<18)
+#define   MI_ATOMIC_CS_STALL		(1<<17)
+#define   MI_ATOMIC_RETURN_DATA_CTL	(1<<16)
+#define MI_ATOMIC_OP_MASK(op)  ((op) << 8)
+#define MI_ATOMIC_AND	MI_ATOMIC_OP_MASK(0x01)
+#define MI_ATOMIC_OR	MI_ATOMIC_OP_MASK(0x02)
+#define MI_ATOMIC_XOR	MI_ATOMIC_OP_MASK(0x03)
+#define MI_ATOMIC_MOVE	MI_ATOMIC_OP_MASK(0x04)
+#define MI_ATOMIC_INC	MI_ATOMIC_OP_MASK(0x05)
+#define MI_ATOMIC_DEC	MI_ATOMIC_OP_MASK(0x06)
+#define MI_ATOMIC_ADD	MI_ATOMIC_OP_MASK(0x07)
+#define MI_ATOMIC_SUB	MI_ATOMIC_OP_MASK(0x08)
+#define MI_ATOMIC_RSUB	MI_ATOMIC_OP_MASK(0x09)
+#define MI_ATOMIC_IMAX	MI_ATOMIC_OP_MASK(0x0A)
+#define MI_ATOMIC_IMIN	MI_ATOMIC_OP_MASK(0x0B)
+#define MI_ATOMIC_UMAX	MI_ATOMIC_OP_MASK(0x0C)
+#define MI_ATOMIC_UMIN	MI_ATOMIC_OP_MASK(0x0D)
+
 #define MI_BATCH_BUFFER		MI_INSTR(0x30, 1)
 #define   MI_BATCH_NON_SECURE		(1)
 /* for snb/ivb/vlv this also means "batch in ppgtt" when ppgtt is enabled. */
@@ -451,8 +476,6 @@
 #define MI_CLFLUSH              MI_INSTR(0x27, 0)
 #define MI_REPORT_PERF_COUNT    MI_INSTR(0x28, 0)
 #define   MI_REPORT_PERF_COUNT_GGTT (1<<0)
-#define MI_LOAD_REGISTER_MEM    MI_INSTR(0x29, 0)
-#define MI_LOAD_REGISTER_REG    MI_INSTR(0x2A, 0)
 #define MI_RS_STORE_DATA_IMM    MI_INSTR(0x2B, 0)
 #define MI_LOAD_URB_MEM         MI_INSTR(0x2C, 0)
 #define MI_STORE_URB_MEM        MI_INSTR(0x2D, 0)
@@ -1799,6 +1822,8 @@ enum skl_disp_power_wells {
 #define   GEN8_RC_SEMA_IDLE_MSG_DISABLE	(1 << 12)
 #define   GEN8_FF_DOP_CLOCK_GATE_DISABLE	(1<<10)
 
+#define GEN8_RS_PREEMPT_STATUS		0x215C
+
 /* Fuse readout registers for GT */
 #define CHV_FUSE_GT			(VLV_DISPLAY_BASE + 0x2168)
 #define   CHV_FGT_DISABLE_SS0		(1 << 10)
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 664455c..28198c4 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1215,11 +1215,65 @@ static int gen8_init_perctx_bb(struct intel_engine_cs *ring,
 			       uint32_t *const batch,
 			       uint32_t *offset)
 {
+	uint32_t scratch_addr;
 	uint32_t index = wa_ctx_start(wa_ctx, *offset, CACHELINE_DWORDS);
 
+	/* Actual scratch location is at 128 bytes offset */
+	scratch_addr = ring->scratch.gtt_offset + 2*CACHELINE_BYTES;
+	scratch_addr |= PIPE_CONTROL_GLOBAL_GTT;
+
 	/* WaDisableCtxRestoreArbitration:bdw,chv */
 	wa_ctx_emit(batch, MI_ARB_ON_OFF | MI_ARB_ENABLE);
 
+	/*
+	 * As per Bspec, to workaround a known HW issue, SW must perform the
+	 * below programming sequence prior to programming MI_BATCH_BUFFER_END.
+	 *
+	 * This is only applicable for Gen8.
+	 */
+
+	/* WaRsRestoreWithPerCtxtBb:bdw,chv */
+	wa_ctx_emit(batch, MI_LOAD_REGISTER_IMM(1));
+	wa_ctx_emit(batch, INSTPM);
+	wa_ctx_emit(batch, _MASKED_BIT_DISABLE(INSTPM_FORCE_ORDERING));
+
+	wa_ctx_emit(batch, (MI_ATOMIC(5) |
+			    MI_ATOMIC_MEMORY_TYPE_GGTT |
+			    MI_ATOMIC_INLINE_DATA |
+			    MI_ATOMIC_CS_STALL |
+			    MI_ATOMIC_RETURN_DATA_CTL |
+			    MI_ATOMIC_MOVE));
+	wa_ctx_emit(batch, scratch_addr);
+	wa_ctx_emit(batch, 0);
+	wa_ctx_emit(batch, _MASKED_BIT_ENABLE(INSTPM_FORCE_ORDERING));
+	wa_ctx_emit(batch, _MASKED_BIT_ENABLE(INSTPM_FORCE_ORDERING));
+
+	/*
+	 * BSpec says MI_LOAD_REGISTER_MEM, MI_LOAD_REGISTER_REG and
+	 * MI_BATCH_BUFFER_END instructions in this sequence need to be
+	 * in the same cacheline. To satisfy this case even if more WA are
+	 * added in future, pad current cacheline and start remaining sequence
+	 * in new cacheline.
+	 */
+	while (index % CACHELINE_DWORDS)
+		wa_ctx_emit(batch, MI_NOOP);
+
+	wa_ctx_emit(batch, (MI_LOAD_REGISTER_MEM_GEN8 |
+			    MI_LRM_USE_GLOBAL_GTT |
+			    MI_LRM_ASYNC_MODE_ENABLE));
+	wa_ctx_emit(batch, INSTPM);
+	wa_ctx_emit(batch, scratch_addr);
+	wa_ctx_emit(batch, 0);
+
+	/*
+	 * BSpec says there should not be any commands programmed
+	 * between MI_LOAD_REGISTER_REG and MI_BATCH_BUFFER_END so
+	 * do not add any new commands
+	 */
+	wa_ctx_emit(batch, MI_LOAD_REGISTER_REG);
+	wa_ctx_emit(batch, GEN8_RS_PREEMPT_STATUS);
+	wa_ctx_emit(batch, GEN8_RS_PREEMPT_STATUS);
+
 	wa_ctx_emit(batch, MI_BATCH_BUFFER_END);
 
 	return wa_ctx_end(wa_ctx, *offset = index, 1);
-- 
2.3.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 1/6] drm/i915/gen8: Add infrastructure to initialize WA batch buffers
  2015-06-19 17:37 ` [PATCH v6 1/6] drm/i915/gen8: Add infrastructure to initialize " Arun Siluvery
@ 2015-06-19 17:50   ` Chris Wilson
  2015-06-22 15:36     ` Daniel Vetter
  0 siblings, 1 reply; 27+ messages in thread
From: Chris Wilson @ 2015-06-19 17:50 UTC (permalink / raw)
  To: Arun Siluvery; +Cc: intel-gfx

On Fri, Jun 19, 2015 at 06:37:10PM +0100, Arun Siluvery wrote:
> Some of the WA are to be applied during context save but before restore and
> some at the end of context save/restore but before executing the instructions
> in the ring, WA batch buffers are created for this purpose and these WA cannot
> be applied using normal means. Each context has two registers to load the
> offsets of these batch buffers. If they are non-zero, HW understands that it
> need to execute these batches.
> 
> v1: In this version two separate ring_buffer objects were used to load WA
> instructions for indirect and per context batch buffers and they were part
> of every context.
> 
> v2: Chris suggested to include additional page in context and use it to load
> these WA instead of creating separate objects. This will simplify lot of things
> as we need not explicity pin/unpin them. Thomas Daniel further pointed that GuC
> is planning to use a similar setup to share data between GuC and driver and
> WA batch buffers can probably share that page. However after discussions with
> Dave who is implementing GuC changes, he suggested to use an independent page
> for the reasons - GuC area might grow and these WA are initialized only once and
> are not changed afterwards so we can share them share across all contexts.
> 
> The page is updated with WA during render ring init. This has an advantage of
> not adding more special cases to default_context.
> 
> We don't know upfront the number of WA we will applying using these batch buffers.
> For this reason the size was fixed earlier but it is not a good idea. To fix this,
> the functions that load instructions are modified to report the no of commands
> inserted and the size is now calculated after the batch is updated. A macro is
> introduced to add commands to these batch buffers which also checks for overflow
> and returns error.
> We have a full page dedicated for these WA so that should be sufficient for
> good number of WA, anything more means we have major issues.
> The list for Gen8 is small, same for Gen9 also, maybe few more gets added
> going forward but not close to filling entire page. Chris suggested a two-pass
> approach but we agreed to go with single page setup as it is a one-off routine
> and simpler code wins.
> 
> One additional option is offset field which is helpful if we would like to
> have multiple batches at different offsets within the page and select them
> based on some criteria. This is not a requirement at this point but could
> help in future (Dave).
> 
> Chris provided some helpful macros and suggestions which further simplified
> the code, they will also help in reducing code duplication when WA for
> other Gen are added. Add detailed comments explaining restrictions.
> 
> (Many thanks to Chris, Dave and Thomas for their reviews and inputs)
> 
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Dave Gordon <david.s.gordon@intel.com>
> Signed-off-by: Rafael Barbalho <rafael.barbalho@intel.com>
> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>

Sigh, after all that, I found one minor thing, but nevertheless
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>

> +#define wa_ctx_emit(batch, cmd) {	\
> +		if (WARN_ON(index >= (PAGE_SIZE / sizeof(uint32_t)))) {	\
> +			return -ENOSPC;					\
> +		}							\
> +		batch[index++] = (cmd);					\
> +	}

We should have wrapped this in do { } while(0) - think of all those
trialing semicolons we have in the code! Fortunately we haven't used
this in a if (foo) wa_ctx_emit(bar); else wa_ctx_emit(baz); yet.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 2/6] drm/i915/gen8: Re-order init pipe_control in lrc mode
  2015-06-19 17:37 ` [PATCH v6 2/6] drm/i915/gen8: Re-order init pipe_control in lrc mode Arun Siluvery
@ 2015-06-19 17:58   ` Chris Wilson
  0 siblings, 0 replies; 27+ messages in thread
From: Chris Wilson @ 2015-06-19 17:58 UTC (permalink / raw)
  To: Arun Siluvery; +Cc: intel-gfx

On Fri, Jun 19, 2015 at 06:37:11PM +0100, Arun Siluvery wrote:
> Some of the WA applied using WA batch buffers perform writes to scratch page.
> In the current flow WA are initialized before scratch obj is allocated.
> This patch reorders intel_init_pipe_control() to have a valid scratch obj
> before we initialize WA.
> 
> v2: Check for valid scratch page before initializing WA as some of them
> perform writes to it.
> 
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Dave Gordon <david.s.gordon@intel.com>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>

Ok, I'm not completely happy with the sequence of this, but it works.

logical_ring_init() is the beast that initialises the intel_engine_cs,
and it looks dubious to be setting fields (like dev and scratch_obj)
before we do the main initialisation. I don't have a pretty solution
(thinking of passing around the scratch object, or reordering everything
still further with resulting code duplication for the other rings).

Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v6 1/6] drm/i915/gen8: Add infrastructure to initialize WA batch buffers
  2015-06-19 17:37 [PATCH v6 0/6] Add Per-context WA using WA batch buffers Arun Siluvery
                   ` (5 preceding siblings ...)
  2015-06-19 17:37 ` [PATCH v6 6/6] drm/i915/gen8: Add WaRsRestoreWithPerCtxtBb workaround Arun Siluvery
@ 2015-06-19 18:07 ` Arun Siluvery
  2015-06-22 15:41   ` Daniel Vetter
  6 siblings, 1 reply; 27+ messages in thread
From: Arun Siluvery @ 2015-06-19 18:07 UTC (permalink / raw)
  To: intel-gfx

Some of the WA are to be applied during context save but before restore and
some at the end of context save/restore but before executing the instructions
in the ring, WA batch buffers are created for this purpose and these WA cannot
be applied using normal means. Each context has two registers to load the
offsets of these batch buffers. If they are non-zero, HW understands that it
need to execute these batches.

v1: In this version two separate ring_buffer objects were used to load WA
instructions for indirect and per context batch buffers and they were part
of every context.

v2: Chris suggested to include additional page in context and use it to load
these WA instead of creating separate objects. This will simplify lot of things
as we need not explicity pin/unpin them. Thomas Daniel further pointed that GuC
is planning to use a similar setup to share data between GuC and driver and
WA batch buffers can probably share that page. However after discussions with
Dave who is implementing GuC changes, he suggested to use an independent page
for the reasons - GuC area might grow and these WA are initialized only once and
are not changed afterwards so we can share them share across all contexts.

The page is updated with WA during render ring init. This has an advantage of
not adding more special cases to default_context.

We don't know upfront the number of WA we will applying using these batch buffers.
For this reason the size was fixed earlier but it is not a good idea. To fix this,
the functions that load instructions are modified to report the no of commands
inserted and the size is now calculated after the batch is updated. A macro is
introduced to add commands to these batch buffers which also checks for overflow
and returns error.
We have a full page dedicated for these WA so that should be sufficient for
good number of WA, anything more means we have major issues.
The list for Gen8 is small, same for Gen9 also, maybe few more gets added
going forward but not close to filling entire page. Chris suggested a two-pass
approach but we agreed to go with single page setup as it is a one-off routine
and simpler code wins.

One additional option is offset field which is helpful if we would like to
have multiple batches at different offsets within the page and select them
based on some criteria. This is not a requirement at this point but could
help in future (Dave).

Chris provided some helpful macros and suggestions which further simplified
the code, they will also help in reducing code duplication when WA for
other Gen are added. Add detailed comments explaining restrictions.
Use do {} while(0) for wa_ctx_emit() macro.

(Many thanks to Chris, Dave and Thomas for their reviews and inputs)

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Dave Gordon <david.s.gordon@intel.com>
Signed-off-by: Rafael Barbalho <rafael.barbalho@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/intel_lrc.c        | 223 +++++++++++++++++++++++++++++++-
 drivers/gpu/drm/i915/intel_ringbuffer.h |  21 +++
 2 files changed, 240 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 0413b8f..0585298 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -211,6 +211,7 @@ enum {
 	FAULT_AND_CONTINUE /* Unsupported */
 };
 #define GEN8_CTX_ID_SHIFT 32
+#define CTX_RCS_INDIRECT_CTX_OFFSET_DEFAULT  0x17
 
 static int intel_lr_context_pin(struct intel_engine_cs *ring,
 		struct intel_context *ctx);
@@ -1077,6 +1078,191 @@ static int intel_logical_ring_workarounds_emit(struct intel_engine_cs *ring,
 	return 0;
 }
 
+#define wa_ctx_emit(batch, cmd)						\
+	do {								\
+		if (WARN_ON(index >= (PAGE_SIZE / sizeof(uint32_t)))) {	\
+			return -ENOSPC;					\
+		}							\
+		batch[index++] = (cmd);					\
+	} while (0)
+
+static inline uint32_t wa_ctx_start(struct i915_wa_ctx_bb *wa_ctx,
+				    uint32_t offset,
+				    uint32_t start_alignment)
+{
+	return wa_ctx->offset = ALIGN(offset, start_alignment);
+}
+
+static inline int wa_ctx_end(struct i915_wa_ctx_bb *wa_ctx,
+			     uint32_t offset,
+			     uint32_t size_alignment)
+{
+	wa_ctx->size = offset - wa_ctx->offset;
+
+	WARN(wa_ctx->size % size_alignment,
+	     "wa_ctx_bb failed sanity checks: size %d is not aligned to %d\n",
+	     wa_ctx->size, size_alignment);
+	return 0;
+}
+
+/**
+ * gen8_init_indirectctx_bb() - initialize indirect ctx batch with WA
+ *
+ * @ring: only applicable for RCS
+ * @wa_ctx: structure representing wa_ctx
+ *  offset: specifies start of the batch, should be cache-aligned. This is updated
+ *    with the offset value received as input.
+ *  size: size of the batch in DWORDS but HW expects in terms of cachelines
+ * @batch: page in which WA are loaded
+ * @offset: This field specifies the start of the batch, it should be
+ *  cache-aligned otherwise it is adjusted accordingly.
+ *  Typically we only have one indirect_ctx and per_ctx batch buffer which are
+ *  initialized at the beginning and shared across all contexts but this field
+ *  helps us to have multiple batches at different offsets and select them based
+ *  on a criteria. At the moment this batch always start at the beginning of the page
+ *  and at this point we don't have multiple wa_ctx batch buffers.
+ *
+ *  The number of WA applied are not known at the beginning; we use this field
+ *  to return the no of DWORDS written.
+
+ *  It is to be noted that this batch does not contain MI_BATCH_BUFFER_END
+ *  so it adds NOOPs as padding to make it cacheline aligned.
+ *  MI_BATCH_BUFFER_END will be added to perctx batch and both of them together
+ *  makes a complete batch buffer.
+ *
+ * Return: non-zero if we exceed the PAGE_SIZE limit.
+ */
+
+static int gen8_init_indirectctx_bb(struct intel_engine_cs *ring,
+				    struct i915_wa_ctx_bb *wa_ctx,
+				    uint32_t *const batch,
+				    uint32_t *offset)
+{
+	uint32_t index = wa_ctx_start(wa_ctx, *offset, CACHELINE_DWORDS);
+
+	/* FIXME: Replace me with WA */
+	wa_ctx_emit(batch, MI_NOOP);
+
+	/* Pad to end of cacheline */
+	while (index % CACHELINE_DWORDS)
+		wa_ctx_emit(batch, MI_NOOP);
+
+	/*
+	 * MI_BATCH_BUFFER_END is not required in Indirect ctx BB because
+	 * execution depends on the length specified in terms of cache lines
+	 * in the register CTX_RCS_INDIRECT_CTX
+	 */
+
+	return wa_ctx_end(wa_ctx, *offset = index, CACHELINE_DWORDS);
+}
+
+/**
+ * gen8_init_perctx_bb() - initialize per ctx batch with WA
+ *
+ * @ring: only applicable for RCS
+ * @wa_ctx: structure representing wa_ctx
+ *  offset: specifies start of the batch, should be cache-aligned.
+ *  size: size of the batch in DWORDS but HW expects in terms of cachelines
+ * @offset: This field specifies the start of this batch.
+ *   This batch is started immediately after indirect_ctx batch. Since we ensure
+ *   that indirect_ctx ends on a cacheline this batch is aligned automatically.
+ *
+ *   The number of DWORDS written are returned using this field.
+ *
+ *  This batch is terminated with MI_BATCH_BUFFER_END and so we need not add padding
+ *  to align it with cacheline as padding after MI_BATCH_BUFFER_END is redundant.
+ */
+static int gen8_init_perctx_bb(struct intel_engine_cs *ring,
+			       struct i915_wa_ctx_bb *wa_ctx,
+			       uint32_t *const batch,
+			       uint32_t *offset)
+{
+	uint32_t index = wa_ctx_start(wa_ctx, *offset, CACHELINE_DWORDS);
+
+	wa_ctx_emit(batch, MI_BATCH_BUFFER_END);
+
+	return wa_ctx_end(wa_ctx, *offset = index, 1);
+}
+
+static int lrc_setup_wa_ctx_obj(struct intel_engine_cs *ring, u32 size)
+{
+	int ret;
+
+	ring->wa_ctx.obj = i915_gem_alloc_object(ring->dev, PAGE_ALIGN(size));
+	if (!ring->wa_ctx.obj) {
+		DRM_DEBUG_DRIVER("alloc LRC WA ctx backing obj failed.\n");
+		return -ENOMEM;
+	}
+
+	ret = i915_gem_obj_ggtt_pin(ring->wa_ctx.obj, PAGE_SIZE, 0);
+	if (ret) {
+		DRM_DEBUG_DRIVER("pin LRC WA ctx backing obj failed: %d\n",
+				 ret);
+		drm_gem_object_unreference(&ring->wa_ctx.obj->base);
+		return ret;
+	}
+
+	return 0;
+}
+
+static void lrc_destroy_wa_ctx_obj(struct intel_engine_cs *ring)
+{
+	if (ring->wa_ctx.obj) {
+		i915_gem_object_ggtt_unpin(ring->wa_ctx.obj);
+		drm_gem_object_unreference(&ring->wa_ctx.obj->base);
+		ring->wa_ctx.obj = NULL;
+	}
+}
+
+static int intel_init_workaround_bb(struct intel_engine_cs *ring)
+{
+	int ret;
+	uint32_t *batch;
+	uint32_t offset;
+	struct page *page;
+	struct i915_ctx_workarounds *wa_ctx = &ring->wa_ctx;
+
+	WARN_ON(ring->id != RCS);
+
+	ret = lrc_setup_wa_ctx_obj(ring, PAGE_SIZE);
+	if (ret) {
+		DRM_DEBUG_DRIVER("Failed to setup context WA page: %d\n", ret);
+		return ret;
+	}
+
+	page = i915_gem_object_get_page(wa_ctx->obj, 0);
+	batch = kmap_atomic(page);
+	offset = 0;
+
+	if (INTEL_INFO(ring->dev)->gen == 8) {
+		ret = gen8_init_indirectctx_bb(ring,
+					       &wa_ctx->indirect_ctx,
+					       batch,
+					       &offset);
+		if (ret)
+			goto out;
+
+		ret = gen8_init_perctx_bb(ring,
+					  &wa_ctx->per_ctx,
+					  batch,
+					  &offset);
+		if (ret)
+			goto out;
+	} else {
+		WARN(INTEL_INFO(ring->dev)->gen >= 8,
+		     "WA batch buffer is not initialized for Gen%d\n",
+		     INTEL_INFO(ring->dev)->gen);
+		lrc_destroy_wa_ctx_obj(ring);
+	}
+
+out:
+	kunmap_atomic(batch);
+	if (ret)
+		lrc_destroy_wa_ctx_obj(ring);
+
+	return ret;
+}
+
 static int gen8_init_common_ring(struct intel_engine_cs *ring)
 {
 	struct drm_device *dev = ring->dev;
@@ -1411,6 +1597,8 @@ void intel_logical_ring_cleanup(struct intel_engine_cs *ring)
 		kunmap(sg_page(ring->status_page.obj->pages->sgl));
 		ring->status_page.obj = NULL;
 	}
+
+	lrc_destroy_wa_ctx_obj(ring);
 }
 
 static int logical_ring_init(struct drm_device *dev, struct intel_engine_cs *ring)
@@ -1474,7 +1662,22 @@ static int logical_render_ring_init(struct drm_device *dev)
 	if (ret)
 		return ret;
 
-	return intel_init_pipe_control(ring);
+	ret = intel_init_workaround_bb(ring);
+	if (ret) {
+		/*
+		 * We continue even if we fail to initialize WA batch
+		 * because we only expect rare glitches but nothing
+		 * critical to prevent us from using GPU
+		 */
+		DRM_ERROR("WA batch buffer initialization failed: %d\n",
+			  ret);
+	}
+
+	ret = intel_init_pipe_control(ring);
+	if (ret)
+		lrc_destroy_wa_ctx_obj(ring);
+
+	return ret;
 }
 
 static int logical_bsd_ring_init(struct drm_device *dev)
@@ -1754,15 +1957,27 @@ populate_lr_context(struct intel_context *ctx, struct drm_i915_gem_object *ctx_o
 	reg_state[CTX_SECOND_BB_STATE] = ring->mmio_base + 0x118;
 	reg_state[CTX_SECOND_BB_STATE+1] = 0;
 	if (ring->id == RCS) {
-		/* TODO: according to BSpec, the register state context
-		 * for CHV does not have these. OTOH, these registers do
-		 * exist in CHV. I'm waiting for a clarification */
 		reg_state[CTX_BB_PER_CTX_PTR] = ring->mmio_base + 0x1c0;
 		reg_state[CTX_BB_PER_CTX_PTR+1] = 0;
 		reg_state[CTX_RCS_INDIRECT_CTX] = ring->mmio_base + 0x1c4;
 		reg_state[CTX_RCS_INDIRECT_CTX+1] = 0;
 		reg_state[CTX_RCS_INDIRECT_CTX_OFFSET] = ring->mmio_base + 0x1c8;
 		reg_state[CTX_RCS_INDIRECT_CTX_OFFSET+1] = 0;
+		if (ring->wa_ctx.obj) {
+			struct i915_ctx_workarounds *wa_ctx = &ring->wa_ctx;
+			uint32_t ggtt_offset = i915_gem_obj_ggtt_offset(wa_ctx->obj);
+
+			reg_state[CTX_RCS_INDIRECT_CTX+1] =
+				(ggtt_offset + wa_ctx->indirect_ctx.offset * sizeof(uint32_t)) |
+				(wa_ctx->indirect_ctx.size / CACHELINE_DWORDS);
+
+			reg_state[CTX_RCS_INDIRECT_CTX_OFFSET+1] =
+				CTX_RCS_INDIRECT_CTX_OFFSET_DEFAULT << 6;
+
+			reg_state[CTX_BB_PER_CTX_PTR+1] =
+				(ggtt_offset + wa_ctx->per_ctx.offset * sizeof(uint32_t)) |
+				0x01;
+		}
 	}
 	reg_state[CTX_LRI_HEADER_1] = MI_LOAD_REGISTER_IMM(9);
 	reg_state[CTX_LRI_HEADER_1] |= MI_LRI_FORCE_POSTED;
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 39f6dfc..06467c6 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -12,6 +12,7 @@
  * workarounds!
  */
 #define CACHELINE_BYTES 64
+#define CACHELINE_DWORDS (CACHELINE_BYTES / sizeof(uint32_t))
 
 /*
  * Gen2 BSpec "1. Programming Environment" / 1.4.4.6 "Ring Buffer Use"
@@ -119,6 +120,25 @@ struct intel_ringbuffer {
 
 struct	intel_context;
 
+/*
+ * we use a single page to load ctx workarounds so all of these
+ * values are referred in terms of dwords
+ *
+ * struct i915_wa_ctx_bb:
+ *  offset: specifies batch starting position, also helpful in case
+ *    if we want to have multiple batches at different offsets based on
+ *    some criteria. It is not a requirement at the moment but provides
+ *    an option for future use.
+ *  size: size of the batch in DWORDS
+ */
+struct  i915_ctx_workarounds {
+	struct i915_wa_ctx_bb {
+		u32 offset;
+		u32 size;
+	} indirect_ctx, per_ctx;
+	struct drm_i915_gem_object *obj;
+};
+
 struct  intel_engine_cs {
 	const char	*name;
 	enum intel_ring_id {
@@ -142,6 +162,7 @@ struct  intel_engine_cs {
 	struct i915_gem_batch_pool batch_pool;
 
 	struct intel_hw_status_page status_page;
+	struct i915_ctx_workarounds wa_ctx;
 
 	unsigned irq_refcount; /* protected by dev_priv->irq_lock */
 	u32		irq_enable_mask;	/* bitmask to enable ring interrupt */
-- 
2.3.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 5/6] drm/i915/gen8: Add WaClearSlmSpaceAtContextSwitch workaround
  2015-06-19 17:37 ` [PATCH v6 5/6] drm/i915/gen8: Add WaClearSlmSpaceAtContextSwitch workaround Arun Siluvery
@ 2015-06-19 18:09   ` Chris Wilson
  2015-06-22 11:29     ` Siluvery, Arun
  2015-06-23 14:46   ` [PATCH v6 5/5] " Arun Siluvery
  1 sibling, 1 reply; 27+ messages in thread
From: Chris Wilson @ 2015-06-19 18:09 UTC (permalink / raw)
  To: Arun Siluvery; +Cc: intel-gfx

On Fri, Jun 19, 2015 at 06:37:14PM +0100, Arun Siluvery wrote:
> In Indirect context w/a batch buffer,
> WaClearSlmSpaceAtContextSwitch
> 
> This WA performs writes to scratch page so it must be valid, this check
> is performed before initializing the batch with this WA.
> 
> v2: s/PIPE_CONTROL_FLUSH_RO_CACHES/PIPE_CONTROL_FLUSH_L3 (Ville)
> 
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Dave Gordon <david.s.gordon@intel.com>
> Signed-off-by: Rafael Barbalho <rafael.barbalho@intel.com>
> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/i915_reg.h  |  1 +
>  drivers/gpu/drm/i915/intel_lrc.c | 16 ++++++++++++++++
>  2 files changed, 17 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
> index d14ad20..7637e64 100644
> --- a/drivers/gpu/drm/i915/i915_reg.h
> +++ b/drivers/gpu/drm/i915/i915_reg.h
> @@ -410,6 +410,7 @@
>  #define   DISPLAY_PLANE_A           (0<<20)
>  #define   DISPLAY_PLANE_B           (1<<20)
>  #define GFX_OP_PIPE_CONTROL(len)	((0x3<<29)|(0x3<<27)|(0x2<<24)|(len-2))
> +#define   PIPE_CONTROL_FLUSH_L3				(1<<27)
>  #define   PIPE_CONTROL_GLOBAL_GTT_IVB			(1<<24) /* gen7+ */
>  #define   PIPE_CONTROL_MMIO_WRITE			(1<<23)
>  #define   PIPE_CONTROL_STORE_DATA_INDEX			(1<<21)
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 3e7aaa9..664455c 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -1137,6 +1137,7 @@ static int gen8_init_indirectctx_bb(struct intel_engine_cs *ring,
>  				    uint32_t *const batch,
>  				    uint32_t *offset)
>  {
> +	uint32_t scratch_addr;
>  	uint32_t index = wa_ctx_start(wa_ctx, *offset, CACHELINE_DWORDS);
>  
>  	/* WaDisableCtxRestoreArbitration:bdw,chv */
> @@ -1165,6 +1166,21 @@ static int gen8_init_indirectctx_bb(struct intel_engine_cs *ring,
>  		wa_ctx_emit(batch, l3sqc4_flush & ~GEN8_LQSC_FLUSH_COHERENT_LINES);
>  	}
>  
> +	/* WaClearSlmSpaceAtContextSwitch:bdw,chv */
> +	/* Actual scratch location is at 128 bytes offset */
> +	scratch_addr = ring->scratch.gtt_offset + 2*CACHELINE_BYTES;
> +	scratch_addr |= PIPE_CONTROL_GLOBAL_GTT;

I thought this bit was now mbz - that's how we treat it elsewhere e.g.
gen8_emit_flush_render, and that the address has to be naturally aligned
for the target write. (Similar bit in patch 6 fwiw.)
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 3/6] drm/i915/gen8: Add WaDisableCtxRestoreArbitration workaround
  2015-06-19 17:37 ` [PATCH v6 3/6] drm/i915/gen8: Add WaDisableCtxRestoreArbitration workaround Arun Siluvery
@ 2015-06-19 18:11   ` Chris Wilson
  0 siblings, 0 replies; 27+ messages in thread
From: Chris Wilson @ 2015-06-19 18:11 UTC (permalink / raw)
  To: Arun Siluvery; +Cc: intel-gfx

On Fri, Jun 19, 2015 at 06:37:12PM +0100, Arun Siluvery wrote:
> In Indirect and Per context w/a batch buffer,
> +WaDisableCtxRestoreArbitration
> 
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Dave Gordon <david.s.gordon@intel.com>
> Signed-off-by: Rafael Barbalho <rafael.barbalho@intel.com>
> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>

Acked-by: Chris Wilson <chris@chris-wilson.co.uk> # too lazy to verify hsd
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 4/6] drm/i915/gen8: Add WaFlushCoherentL3CacheLinesAtContextSwitch workaround
  2015-06-19 17:37 ` [PATCH v6 4/6] drm/i915/gen8: Add WaFlushCoherentL3CacheLinesAtContextSwitch workaround Arun Siluvery
@ 2015-06-19 18:12   ` Chris Wilson
  2015-06-22 15:41     ` Daniel Vetter
  0 siblings, 1 reply; 27+ messages in thread
From: Chris Wilson @ 2015-06-19 18:12 UTC (permalink / raw)
  To: Arun Siluvery; +Cc: intel-gfx

On Fri, Jun 19, 2015 at 06:37:13PM +0100, Arun Siluvery wrote:
> In Indirect context w/a batch buffer,
> +WaFlushCoherentL3CacheLinesAtContextSwitch:bdw
> 
> v2: Add LRI commands to set/reset bit that invalidates coherent lines,
> update WA to include programming restrictions and exclude CHV as
> it is not required (Ville)
> 
> v3: Avoid unnecessary read when it can be done by reading register once (Chris).
> 
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Dave Gordon <david.s.gordon@intel.com>
> Signed-off-by: Rafael Barbalho <rafael.barbalho@intel.com>
> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>

Acked-by: Chris Wilson <chris@chris-wilson.co.uk>
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 5/6] drm/i915/gen8: Add WaClearSlmSpaceAtContextSwitch workaround
  2015-06-19 18:09   ` Chris Wilson
@ 2015-06-22 11:29     ` Siluvery, Arun
  2015-06-22 15:39       ` Daniel Vetter
  0 siblings, 1 reply; 27+ messages in thread
From: Siluvery, Arun @ 2015-06-22 11:29 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx, Dave Gordon, Rafael Barbalho

On 19/06/2015 19:09, Chris Wilson wrote:
> On Fri, Jun 19, 2015 at 06:37:14PM +0100, Arun Siluvery wrote:
>> In Indirect context w/a batch buffer,
>> WaClearSlmSpaceAtContextSwitch
>>
>> This WA performs writes to scratch page so it must be valid, this check
>> is performed before initializing the batch with this WA.
>>
>> v2: s/PIPE_CONTROL_FLUSH_RO_CACHES/PIPE_CONTROL_FLUSH_L3 (Ville)
>>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Dave Gordon <david.s.gordon@intel.com>
>> Signed-off-by: Rafael Barbalho <rafael.barbalho@intel.com>
>> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
>> ---
>>   drivers/gpu/drm/i915/i915_reg.h  |  1 +
>>   drivers/gpu/drm/i915/intel_lrc.c | 16 ++++++++++++++++
>>   2 files changed, 17 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
>> index d14ad20..7637e64 100644
>> --- a/drivers/gpu/drm/i915/i915_reg.h
>> +++ b/drivers/gpu/drm/i915/i915_reg.h
>> @@ -410,6 +410,7 @@
>>   #define   DISPLAY_PLANE_A           (0<<20)
>>   #define   DISPLAY_PLANE_B           (1<<20)
>>   #define GFX_OP_PIPE_CONTROL(len)	((0x3<<29)|(0x3<<27)|(0x2<<24)|(len-2))
>> +#define   PIPE_CONTROL_FLUSH_L3				(1<<27)
>>   #define   PIPE_CONTROL_GLOBAL_GTT_IVB			(1<<24) /* gen7+ */
>>   #define   PIPE_CONTROL_MMIO_WRITE			(1<<23)
>>   #define   PIPE_CONTROL_STORE_DATA_INDEX			(1<<21)
>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
>> index 3e7aaa9..664455c 100644
>> --- a/drivers/gpu/drm/i915/intel_lrc.c
>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
>> @@ -1137,6 +1137,7 @@ static int gen8_init_indirectctx_bb(struct intel_engine_cs *ring,
>>   				    uint32_t *const batch,
>>   				    uint32_t *offset)
>>   {
>> +	uint32_t scratch_addr;
>>   	uint32_t index = wa_ctx_start(wa_ctx, *offset, CACHELINE_DWORDS);
>>
>>   	/* WaDisableCtxRestoreArbitration:bdw,chv */
>> @@ -1165,6 +1166,21 @@ static int gen8_init_indirectctx_bb(struct intel_engine_cs *ring,
>>   		wa_ctx_emit(batch, l3sqc4_flush & ~GEN8_LQSC_FLUSH_COHERENT_LINES);
>>   	}
>>
>> +	/* WaClearSlmSpaceAtContextSwitch:bdw,chv */
>> +	/* Actual scratch location is at 128 bytes offset */
>> +	scratch_addr = ring->scratch.gtt_offset + 2*CACHELINE_BYTES;
>> +	scratch_addr |= PIPE_CONTROL_GLOBAL_GTT;
>
> I thought this bit was now mbz - that's how we treat it elsewhere e.g.
> gen8_emit_flush_render, and that the address has to be naturally aligned
> for the target write. (Similar bit in patch 6 fwiw.)
you are correct, this bit is mbz.

Daniel, could you please remove this line when applying patches?
sorry for additional work.

 >> +	scratch_addr |= PIPE_CONTROL_GLOBAL_GTT;

regards
Arun

> -Chris
>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 6/6] drm/i915/gen8: Add WaRsRestoreWithPerCtxtBb workaround
  2015-06-19 17:37 ` [PATCH v6 6/6] drm/i915/gen8: Add WaRsRestoreWithPerCtxtBb workaround Arun Siluvery
@ 2015-06-22 11:30   ` Siluvery, Arun
  2015-06-22 16:21   ` Ville Syrjälä
  1 sibling, 0 replies; 27+ messages in thread
From: Siluvery, Arun @ 2015-06-22 11:30 UTC (permalink / raw)
  To: intel-gfx

On 19/06/2015 18:37, Arun Siluvery wrote:
> In Per context w/a batch buffer,
> WaRsRestoreWithPerCtxtBb
>
> This WA performs writes to scratch page so it must be valid, this check
> is performed before initializing the batch with this WA.
>
> v2: This patches modifies definitions of MI_LOAD_REGISTER_MEM and
> MI_LOAD_REGISTER_REG; Add GEN8 specific defines for these instructions
> so as to not break any future users of existing definitions (Michel)
>
> v3: Length defined in current definitions of LRM, LRR instructions was specified
> as 0. It seems it is common convention for instructions whose length vary between
> platforms. This is not an issue so far because they are not used anywhere except
> command parser; now that we use in this patch update them with correct length
> and also move them out of command parser placeholder to appropriate place.
> remove unnecessary padding and follow the WA programming sequence exactly
> as mentioned in spec which is essential for this WA (Dave).
>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Dave Gordon <david.s.gordon@intel.com>
> Signed-off-by: Rafael Barbalho <rafael.barbalho@intel.com>
> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> ---
>   drivers/gpu/drm/i915/i915_reg.h  | 29 +++++++++++++++++++--
>   drivers/gpu/drm/i915/intel_lrc.c | 54 ++++++++++++++++++++++++++++++++++++++++
>   2 files changed, 81 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
> index 7637e64..208620d 100644
> --- a/drivers/gpu/drm/i915/i915_reg.h
> +++ b/drivers/gpu/drm/i915/i915_reg.h
> @@ -347,6 +347,31 @@
>   #define   MI_INVALIDATE_BSD		(1<<7)
>   #define   MI_FLUSH_DW_USE_GTT		(1<<2)
>   #define   MI_FLUSH_DW_USE_PPGTT		(0<<2)
> +#define MI_LOAD_REGISTER_MEM    MI_INSTR(0x29, 1)
> +#define MI_LOAD_REGISTER_MEM_GEN8 MI_INSTR(0x29, 2)
> +#define   MI_LRM_USE_GLOBAL_GTT (1<<22)
> +#define   MI_LRM_ASYNC_MODE_ENABLE (1<<21)
> +#define MI_LOAD_REGISTER_REG    MI_INSTR(0x2A, 1)
> +#define MI_ATOMIC(len)	MI_INSTR(0x2F, (len-2))
> +#define   MI_ATOMIC_MEMORY_TYPE_GGTT	(1<<22)
> +#define   MI_ATOMIC_INLINE_DATA		(1<<18)
> +#define   MI_ATOMIC_CS_STALL		(1<<17)
> +#define   MI_ATOMIC_RETURN_DATA_CTL	(1<<16)
> +#define MI_ATOMIC_OP_MASK(op)  ((op) << 8)
> +#define MI_ATOMIC_AND	MI_ATOMIC_OP_MASK(0x01)
> +#define MI_ATOMIC_OR	MI_ATOMIC_OP_MASK(0x02)
> +#define MI_ATOMIC_XOR	MI_ATOMIC_OP_MASK(0x03)
> +#define MI_ATOMIC_MOVE	MI_ATOMIC_OP_MASK(0x04)
> +#define MI_ATOMIC_INC	MI_ATOMIC_OP_MASK(0x05)
> +#define MI_ATOMIC_DEC	MI_ATOMIC_OP_MASK(0x06)
> +#define MI_ATOMIC_ADD	MI_ATOMIC_OP_MASK(0x07)
> +#define MI_ATOMIC_SUB	MI_ATOMIC_OP_MASK(0x08)
> +#define MI_ATOMIC_RSUB	MI_ATOMIC_OP_MASK(0x09)
> +#define MI_ATOMIC_IMAX	MI_ATOMIC_OP_MASK(0x0A)
> +#define MI_ATOMIC_IMIN	MI_ATOMIC_OP_MASK(0x0B)
> +#define MI_ATOMIC_UMAX	MI_ATOMIC_OP_MASK(0x0C)
> +#define MI_ATOMIC_UMIN	MI_ATOMIC_OP_MASK(0x0D)
> +
>   #define MI_BATCH_BUFFER		MI_INSTR(0x30, 1)
>   #define   MI_BATCH_NON_SECURE		(1)
>   /* for snb/ivb/vlv this also means "batch in ppgtt" when ppgtt is enabled. */
> @@ -451,8 +476,6 @@
>   #define MI_CLFLUSH              MI_INSTR(0x27, 0)
>   #define MI_REPORT_PERF_COUNT    MI_INSTR(0x28, 0)
>   #define   MI_REPORT_PERF_COUNT_GGTT (1<<0)
> -#define MI_LOAD_REGISTER_MEM    MI_INSTR(0x29, 0)
> -#define MI_LOAD_REGISTER_REG    MI_INSTR(0x2A, 0)
>   #define MI_RS_STORE_DATA_IMM    MI_INSTR(0x2B, 0)
>   #define MI_LOAD_URB_MEM         MI_INSTR(0x2C, 0)
>   #define MI_STORE_URB_MEM        MI_INSTR(0x2D, 0)
> @@ -1799,6 +1822,8 @@ enum skl_disp_power_wells {
>   #define   GEN8_RC_SEMA_IDLE_MSG_DISABLE	(1 << 12)
>   #define   GEN8_FF_DOP_CLOCK_GATE_DISABLE	(1<<10)
>
> +#define GEN8_RS_PREEMPT_STATUS		0x215C
> +
>   /* Fuse readout registers for GT */
>   #define CHV_FUSE_GT			(VLV_DISPLAY_BASE + 0x2168)
>   #define   CHV_FGT_DISABLE_SS0		(1 << 10)
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 664455c..28198c4 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -1215,11 +1215,65 @@ static int gen8_init_perctx_bb(struct intel_engine_cs *ring,
>   			       uint32_t *const batch,
>   			       uint32_t *offset)
>   {
> +	uint32_t scratch_addr;
>   	uint32_t index = wa_ctx_start(wa_ctx, *offset, CACHELINE_DWORDS);
>
> +	/* Actual scratch location is at 128 bytes offset */
> +	scratch_addr = ring->scratch.gtt_offset + 2*CACHELINE_BYTES;
> +	scratch_addr |= PIPE_CONTROL_GLOBAL_GTT;
> +

Daniel, could you please remove this line when applying this patch?
sorry for additional work.

 > +	scratch_addr |= PIPE_CONTROL_GLOBAL_GTT;

regards
Arun

>   	/* WaDisableCtxRestoreArbitration:bdw,chv */
>   	wa_ctx_emit(batch, MI_ARB_ON_OFF | MI_ARB_ENABLE);
>
> +	/*
> +	 * As per Bspec, to workaround a known HW issue, SW must perform the
> +	 * below programming sequence prior to programming MI_BATCH_BUFFER_END.
> +	 *
> +	 * This is only applicable for Gen8.
> +	 */
> +
> +	/* WaRsRestoreWithPerCtxtBb:bdw,chv */
> +	wa_ctx_emit(batch, MI_LOAD_REGISTER_IMM(1));
> +	wa_ctx_emit(batch, INSTPM);
> +	wa_ctx_emit(batch, _MASKED_BIT_DISABLE(INSTPM_FORCE_ORDERING));
> +
> +	wa_ctx_emit(batch, (MI_ATOMIC(5) |
> +			    MI_ATOMIC_MEMORY_TYPE_GGTT |
> +			    MI_ATOMIC_INLINE_DATA |
> +			    MI_ATOMIC_CS_STALL |
> +			    MI_ATOMIC_RETURN_DATA_CTL |
> +			    MI_ATOMIC_MOVE));
> +	wa_ctx_emit(batch, scratch_addr);
> +	wa_ctx_emit(batch, 0);
> +	wa_ctx_emit(batch, _MASKED_BIT_ENABLE(INSTPM_FORCE_ORDERING));
> +	wa_ctx_emit(batch, _MASKED_BIT_ENABLE(INSTPM_FORCE_ORDERING));
> +
> +	/*
> +	 * BSpec says MI_LOAD_REGISTER_MEM, MI_LOAD_REGISTER_REG and
> +	 * MI_BATCH_BUFFER_END instructions in this sequence need to be
> +	 * in the same cacheline. To satisfy this case even if more WA are
> +	 * added in future, pad current cacheline and start remaining sequence
> +	 * in new cacheline.
> +	 */
> +	while (index % CACHELINE_DWORDS)
> +		wa_ctx_emit(batch, MI_NOOP);
> +
> +	wa_ctx_emit(batch, (MI_LOAD_REGISTER_MEM_GEN8 |
> +			    MI_LRM_USE_GLOBAL_GTT |
> +			    MI_LRM_ASYNC_MODE_ENABLE));
> +	wa_ctx_emit(batch, INSTPM);
> +	wa_ctx_emit(batch, scratch_addr);
> +	wa_ctx_emit(batch, 0);
> +
> +	/*
> +	 * BSpec says there should not be any commands programmed
> +	 * between MI_LOAD_REGISTER_REG and MI_BATCH_BUFFER_END so
> +	 * do not add any new commands
> +	 */
> +	wa_ctx_emit(batch, MI_LOAD_REGISTER_REG);
> +	wa_ctx_emit(batch, GEN8_RS_PREEMPT_STATUS);
> +	wa_ctx_emit(batch, GEN8_RS_PREEMPT_STATUS);
> +
>   	wa_ctx_emit(batch, MI_BATCH_BUFFER_END);
>
>   	return wa_ctx_end(wa_ctx, *offset = index, 1);
>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 1/6] drm/i915/gen8: Add infrastructure to initialize WA batch buffers
  2015-06-19 17:50   ` Chris Wilson
@ 2015-06-22 15:36     ` Daniel Vetter
  2015-06-22 15:37       ` Siluvery, Arun
  0 siblings, 1 reply; 27+ messages in thread
From: Daniel Vetter @ 2015-06-22 15:36 UTC (permalink / raw)
  To: Chris Wilson, Arun Siluvery, intel-gfx, Dave Gordon, Rafael Barbalho

On Fri, Jun 19, 2015 at 06:50:36PM +0100, Chris Wilson wrote:
> On Fri, Jun 19, 2015 at 06:37:10PM +0100, Arun Siluvery wrote:
> > Some of the WA are to be applied during context save but before restore and
> > some at the end of context save/restore but before executing the instructions
> > in the ring, WA batch buffers are created for this purpose and these WA cannot
> > be applied using normal means. Each context has two registers to load the
> > offsets of these batch buffers. If they are non-zero, HW understands that it
> > need to execute these batches.
> > 
> > v1: In this version two separate ring_buffer objects were used to load WA
> > instructions for indirect and per context batch buffers and they were part
> > of every context.
> > 
> > v2: Chris suggested to include additional page in context and use it to load
> > these WA instead of creating separate objects. This will simplify lot of things
> > as we need not explicity pin/unpin them. Thomas Daniel further pointed that GuC
> > is planning to use a similar setup to share data between GuC and driver and
> > WA batch buffers can probably share that page. However after discussions with
> > Dave who is implementing GuC changes, he suggested to use an independent page
> > for the reasons - GuC area might grow and these WA are initialized only once and
> > are not changed afterwards so we can share them share across all contexts.
> > 
> > The page is updated with WA during render ring init. This has an advantage of
> > not adding more special cases to default_context.
> > 
> > We don't know upfront the number of WA we will applying using these batch buffers.
> > For this reason the size was fixed earlier but it is not a good idea. To fix this,
> > the functions that load instructions are modified to report the no of commands
> > inserted and the size is now calculated after the batch is updated. A macro is
> > introduced to add commands to these batch buffers which also checks for overflow
> > and returns error.
> > We have a full page dedicated for these WA so that should be sufficient for
> > good number of WA, anything more means we have major issues.
> > The list for Gen8 is small, same for Gen9 also, maybe few more gets added
> > going forward but not close to filling entire page. Chris suggested a two-pass
> > approach but we agreed to go with single page setup as it is a one-off routine
> > and simpler code wins.
> > 
> > One additional option is offset field which is helpful if we would like to
> > have multiple batches at different offsets within the page and select them
> > based on some criteria. This is not a requirement at this point but could
> > help in future (Dave).
> > 
> > Chris provided some helpful macros and suggestions which further simplified
> > the code, they will also help in reducing code duplication when WA for
> > other Gen are added. Add detailed comments explaining restrictions.
> > 
> > (Many thanks to Chris, Dave and Thomas for their reviews and inputs)
> > 
> > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Dave Gordon <david.s.gordon@intel.com>
> > Signed-off-by: Rafael Barbalho <rafael.barbalho@intel.com>
> > Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> 
> Sigh, after all that, I found one minor thing, but nevertheless
> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
> 
> > +#define wa_ctx_emit(batch, cmd) {	\
> > +		if (WARN_ON(index >= (PAGE_SIZE / sizeof(uint32_t)))) {	\
> > +			return -ENOSPC;					\
> > +		}							\
> > +		batch[index++] = (cmd);					\
> > +	}
> 
> We should have wrapped this in do { } while(0) - think of all those
> trialing semicolons we have in the code! Fortunately we haven't used
> this in a if (foo) wa_ctx_emit(bar); else wa_ctx_emit(baz); yet.

Uh yes, this is a critical one. Arun, can you please do a follow-up patch
to wrap your macro in a do {} while(0) like Chris suggested? I'll apply
the paches meanwhile.

Thanks, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 1/6] drm/i915/gen8: Add infrastructure to initialize WA batch buffers
  2015-06-22 15:36     ` Daniel Vetter
@ 2015-06-22 15:37       ` Siluvery, Arun
  0 siblings, 0 replies; 27+ messages in thread
From: Siluvery, Arun @ 2015-06-22 15:37 UTC (permalink / raw)
  To: Daniel Vetter, Chris Wilson, intel-gfx, Dave Gordon, Rafael Barbalho

On 22/06/2015 16:36, Daniel Vetter wrote:
> On Fri, Jun 19, 2015 at 06:50:36PM +0100, Chris Wilson wrote:
>> On Fri, Jun 19, 2015 at 06:37:10PM +0100, Arun Siluvery wrote:
>>> Some of the WA are to be applied during context save but before restore and
>>> some at the end of context save/restore but before executing the instructions
>>> in the ring, WA batch buffers are created for this purpose and these WA cannot
>>> be applied using normal means. Each context has two registers to load the
>>> offsets of these batch buffers. If they are non-zero, HW understands that it
>>> need to execute these batches.
>>>
>>> v1: In this version two separate ring_buffer objects were used to load WA
>>> instructions for indirect and per context batch buffers and they were part
>>> of every context.
>>>
>>> v2: Chris suggested to include additional page in context and use it to load
>>> these WA instead of creating separate objects. This will simplify lot of things
>>> as we need not explicity pin/unpin them. Thomas Daniel further pointed that GuC
>>> is planning to use a similar setup to share data between GuC and driver and
>>> WA batch buffers can probably share that page. However after discussions with
>>> Dave who is implementing GuC changes, he suggested to use an independent page
>>> for the reasons - GuC area might grow and these WA are initialized only once and
>>> are not changed afterwards so we can share them share across all contexts.
>>>
>>> The page is updated with WA during render ring init. This has an advantage of
>>> not adding more special cases to default_context.
>>>
>>> We don't know upfront the number of WA we will applying using these batch buffers.
>>> For this reason the size was fixed earlier but it is not a good idea. To fix this,
>>> the functions that load instructions are modified to report the no of commands
>>> inserted and the size is now calculated after the batch is updated. A macro is
>>> introduced to add commands to these batch buffers which also checks for overflow
>>> and returns error.
>>> We have a full page dedicated for these WA so that should be sufficient for
>>> good number of WA, anything more means we have major issues.
>>> The list for Gen8 is small, same for Gen9 also, maybe few more gets added
>>> going forward but not close to filling entire page. Chris suggested a two-pass
>>> approach but we agreed to go with single page setup as it is a one-off routine
>>> and simpler code wins.
>>>
>>> One additional option is offset field which is helpful if we would like to
>>> have multiple batches at different offsets within the page and select them
>>> based on some criteria. This is not a requirement at this point but could
>>> help in future (Dave).
>>>
>>> Chris provided some helpful macros and suggestions which further simplified
>>> the code, they will also help in reducing code duplication when WA for
>>> other Gen are added. Add detailed comments explaining restrictions.
>>>
>>> (Many thanks to Chris, Dave and Thomas for their reviews and inputs)
>>>
>>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Dave Gordon <david.s.gordon@intel.com>
>>> Signed-off-by: Rafael Barbalho <rafael.barbalho@intel.com>
>>> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
>>
>> Sigh, after all that, I found one minor thing, but nevertheless
>> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
>>
>>> +#define wa_ctx_emit(batch, cmd) {	\
>>> +		if (WARN_ON(index >= (PAGE_SIZE / sizeof(uint32_t)))) {	\
>>> +			return -ENOSPC;					\
>>> +		}							\
>>> +		batch[index++] = (cmd);					\
>>> +	}
>>
>> We should have wrapped this in do { } while(0) - think of all those
>> trialing semicolons we have in the code! Fortunately we haven't used
>> this in a if (foo) wa_ctx_emit(bar); else wa_ctx_emit(baz); yet.
>
> Uh yes, this is a critical one. Arun, can you please do a follow-up patch
> to wrap your macro in a do {} while(0) like Chris suggested? I'll apply
> the paches meanwhile.

Hi Daniel,

Already sent the updated patch.
I think I got the message-id wrong, the updated patch that I sent is 
showing up as the last message in this series.

regards
Arun

>
> Thanks, Daniel
>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 5/6] drm/i915/gen8: Add WaClearSlmSpaceAtContextSwitch workaround
  2015-06-22 11:29     ` Siluvery, Arun
@ 2015-06-22 15:39       ` Daniel Vetter
  0 siblings, 0 replies; 27+ messages in thread
From: Daniel Vetter @ 2015-06-22 15:39 UTC (permalink / raw)
  To: Siluvery, Arun; +Cc: intel-gfx

On Mon, Jun 22, 2015 at 12:29:05PM +0100, Siluvery, Arun wrote:
> On 19/06/2015 19:09, Chris Wilson wrote:
> >On Fri, Jun 19, 2015 at 06:37:14PM +0100, Arun Siluvery wrote:
> >>In Indirect context w/a batch buffer,
> >>WaClearSlmSpaceAtContextSwitch
> >>
> >>This WA performs writes to scratch page so it must be valid, this check
> >>is performed before initializing the batch with this WA.
> >>
> >>v2: s/PIPE_CONTROL_FLUSH_RO_CACHES/PIPE_CONTROL_FLUSH_L3 (Ville)
> >>
> >>Cc: Chris Wilson <chris@chris-wilson.co.uk>
> >>Cc: Dave Gordon <david.s.gordon@intel.com>
> >>Signed-off-by: Rafael Barbalho <rafael.barbalho@intel.com>
> >>Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> >>---
> >>  drivers/gpu/drm/i915/i915_reg.h  |  1 +
> >>  drivers/gpu/drm/i915/intel_lrc.c | 16 ++++++++++++++++
> >>  2 files changed, 17 insertions(+)
> >>
> >>diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
> >>index d14ad20..7637e64 100644
> >>--- a/drivers/gpu/drm/i915/i915_reg.h
> >>+++ b/drivers/gpu/drm/i915/i915_reg.h
> >>@@ -410,6 +410,7 @@
> >>  #define   DISPLAY_PLANE_A           (0<<20)
> >>  #define   DISPLAY_PLANE_B           (1<<20)
> >>  #define GFX_OP_PIPE_CONTROL(len)	((0x3<<29)|(0x3<<27)|(0x2<<24)|(len-2))
> >>+#define   PIPE_CONTROL_FLUSH_L3				(1<<27)
> >>  #define   PIPE_CONTROL_GLOBAL_GTT_IVB			(1<<24) /* gen7+ */
> >>  #define   PIPE_CONTROL_MMIO_WRITE			(1<<23)
> >>  #define   PIPE_CONTROL_STORE_DATA_INDEX			(1<<21)
> >>diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> >>index 3e7aaa9..664455c 100644
> >>--- a/drivers/gpu/drm/i915/intel_lrc.c
> >>+++ b/drivers/gpu/drm/i915/intel_lrc.c
> >>@@ -1137,6 +1137,7 @@ static int gen8_init_indirectctx_bb(struct intel_engine_cs *ring,
> >>  				    uint32_t *const batch,
> >>  				    uint32_t *offset)
> >>  {
> >>+	uint32_t scratch_addr;
> >>  	uint32_t index = wa_ctx_start(wa_ctx, *offset, CACHELINE_DWORDS);
> >>
> >>  	/* WaDisableCtxRestoreArbitration:bdw,chv */
> >>@@ -1165,6 +1166,21 @@ static int gen8_init_indirectctx_bb(struct intel_engine_cs *ring,
> >>  		wa_ctx_emit(batch, l3sqc4_flush & ~GEN8_LQSC_FLUSH_COHERENT_LINES);
> >>  	}
> >>
> >>+	/* WaClearSlmSpaceAtContextSwitch:bdw,chv */
> >>+	/* Actual scratch location is at 128 bytes offset */
> >>+	scratch_addr = ring->scratch.gtt_offset + 2*CACHELINE_BYTES;
> >>+	scratch_addr |= PIPE_CONTROL_GLOBAL_GTT;
> >
> >I thought this bit was now mbz - that's how we treat it elsewhere e.g.
> >gen8_emit_flush_render, and that the address has to be naturally aligned
> >for the target write. (Similar bit in patch 6 fwiw.)
> you are correct, this bit is mbz.
> 
> Daniel, could you please remove this line when applying patches?
> sorry for additional work.
> 
> >> +	scratch_addr |= PIPE_CONTROL_GLOBAL_GTT;

Since you need to send out a follow-up anyway, can you pls do that when
resending? I'm still applying all the other ones.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 1/6] drm/i915/gen8: Add infrastructure to initialize WA batch buffers
  2015-06-19 18:07 ` [PATCH v6 1/6] drm/i915/gen8: Add infrastructure to initialize WA batch buffers Arun Siluvery
@ 2015-06-22 15:41   ` Daniel Vetter
  2015-06-22 15:43     ` Siluvery, Arun
  0 siblings, 1 reply; 27+ messages in thread
From: Daniel Vetter @ 2015-06-22 15:41 UTC (permalink / raw)
  To: Arun Siluvery; +Cc: intel-gfx

On Fri, Jun 19, 2015 at 07:07:01PM +0100, Arun Siluvery wrote:
> Some of the WA are to be applied during context save but before restore and
> some at the end of context save/restore but before executing the instructions
> in the ring, WA batch buffers are created for this purpose and these WA cannot
> be applied using normal means. Each context has two registers to load the
> offsets of these batch buffers. If they are non-zero, HW understands that it
> need to execute these batches.
> 
> v1: In this version two separate ring_buffer objects were used to load WA
> instructions for indirect and per context batch buffers and they were part
> of every context.
> 
> v2: Chris suggested to include additional page in context and use it to load
> these WA instead of creating separate objects. This will simplify lot of things
> as we need not explicity pin/unpin them. Thomas Daniel further pointed that GuC
> is planning to use a similar setup to share data between GuC and driver and
> WA batch buffers can probably share that page. However after discussions with
> Dave who is implementing GuC changes, he suggested to use an independent page
> for the reasons - GuC area might grow and these WA are initialized only once and
> are not changed afterwards so we can share them share across all contexts.
> 
> The page is updated with WA during render ring init. This has an advantage of
> not adding more special cases to default_context.
> 
> We don't know upfront the number of WA we will applying using these batch buffers.
> For this reason the size was fixed earlier but it is not a good idea. To fix this,
> the functions that load instructions are modified to report the no of commands
> inserted and the size is now calculated after the batch is updated. A macro is
> introduced to add commands to these batch buffers which also checks for overflow
> and returns error.
> We have a full page dedicated for these WA so that should be sufficient for
> good number of WA, anything more means we have major issues.
> The list for Gen8 is small, same for Gen9 also, maybe few more gets added
> going forward but not close to filling entire page. Chris suggested a two-pass
> approach but we agreed to go with single page setup as it is a one-off routine
> and simpler code wins.
> 
> One additional option is offset field which is helpful if we would like to
> have multiple batches at different offsets within the page and select them
> based on some criteria. This is not a requirement at this point but could
> help in future (Dave).
> 
> Chris provided some helpful macros and suggestions which further simplified
> the code, they will also help in reducing code duplication when WA for
> other Gen are added. Add detailed comments explaining restrictions.
> Use do {} while(0) for wa_ctx_emit() macro.
> 
> (Many thanks to Chris, Dave and Thomas for their reviews and inputs)
> 
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Dave Gordon <david.s.gordon@intel.com>
> Signed-off-by: Rafael Barbalho <rafael.barbalho@intel.com>
> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>

Why did you resend this one - I don't spot any updates in the commit
message? Also when resending please in-reply to the corresponding previous
version of that patch, not the cover letter of the series.
-Daniel

> ---
>  drivers/gpu/drm/i915/intel_lrc.c        | 223 +++++++++++++++++++++++++++++++-
>  drivers/gpu/drm/i915/intel_ringbuffer.h |  21 +++
>  2 files changed, 240 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 0413b8f..0585298 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -211,6 +211,7 @@ enum {
>  	FAULT_AND_CONTINUE /* Unsupported */
>  };
>  #define GEN8_CTX_ID_SHIFT 32
> +#define CTX_RCS_INDIRECT_CTX_OFFSET_DEFAULT  0x17
>  
>  static int intel_lr_context_pin(struct intel_engine_cs *ring,
>  		struct intel_context *ctx);
> @@ -1077,6 +1078,191 @@ static int intel_logical_ring_workarounds_emit(struct intel_engine_cs *ring,
>  	return 0;
>  }
>  
> +#define wa_ctx_emit(batch, cmd)						\
> +	do {								\
> +		if (WARN_ON(index >= (PAGE_SIZE / sizeof(uint32_t)))) {	\
> +			return -ENOSPC;					\
> +		}							\
> +		batch[index++] = (cmd);					\
> +	} while (0)
> +
> +static inline uint32_t wa_ctx_start(struct i915_wa_ctx_bb *wa_ctx,
> +				    uint32_t offset,
> +				    uint32_t start_alignment)
> +{
> +	return wa_ctx->offset = ALIGN(offset, start_alignment);
> +}
> +
> +static inline int wa_ctx_end(struct i915_wa_ctx_bb *wa_ctx,
> +			     uint32_t offset,
> +			     uint32_t size_alignment)
> +{
> +	wa_ctx->size = offset - wa_ctx->offset;
> +
> +	WARN(wa_ctx->size % size_alignment,
> +	     "wa_ctx_bb failed sanity checks: size %d is not aligned to %d\n",
> +	     wa_ctx->size, size_alignment);
> +	return 0;
> +}
> +
> +/**
> + * gen8_init_indirectctx_bb() - initialize indirect ctx batch with WA
> + *
> + * @ring: only applicable for RCS
> + * @wa_ctx: structure representing wa_ctx
> + *  offset: specifies start of the batch, should be cache-aligned. This is updated
> + *    with the offset value received as input.
> + *  size: size of the batch in DWORDS but HW expects in terms of cachelines
> + * @batch: page in which WA are loaded
> + * @offset: This field specifies the start of the batch, it should be
> + *  cache-aligned otherwise it is adjusted accordingly.
> + *  Typically we only have one indirect_ctx and per_ctx batch buffer which are
> + *  initialized at the beginning and shared across all contexts but this field
> + *  helps us to have multiple batches at different offsets and select them based
> + *  on a criteria. At the moment this batch always start at the beginning of the page
> + *  and at this point we don't have multiple wa_ctx batch buffers.
> + *
> + *  The number of WA applied are not known at the beginning; we use this field
> + *  to return the no of DWORDS written.
> +
> + *  It is to be noted that this batch does not contain MI_BATCH_BUFFER_END
> + *  so it adds NOOPs as padding to make it cacheline aligned.
> + *  MI_BATCH_BUFFER_END will be added to perctx batch and both of them together
> + *  makes a complete batch buffer.
> + *
> + * Return: non-zero if we exceed the PAGE_SIZE limit.
> + */
> +
> +static int gen8_init_indirectctx_bb(struct intel_engine_cs *ring,
> +				    struct i915_wa_ctx_bb *wa_ctx,
> +				    uint32_t *const batch,
> +				    uint32_t *offset)
> +{
> +	uint32_t index = wa_ctx_start(wa_ctx, *offset, CACHELINE_DWORDS);
> +
> +	/* FIXME: Replace me with WA */
> +	wa_ctx_emit(batch, MI_NOOP);
> +
> +	/* Pad to end of cacheline */
> +	while (index % CACHELINE_DWORDS)
> +		wa_ctx_emit(batch, MI_NOOP);
> +
> +	/*
> +	 * MI_BATCH_BUFFER_END is not required in Indirect ctx BB because
> +	 * execution depends on the length specified in terms of cache lines
> +	 * in the register CTX_RCS_INDIRECT_CTX
> +	 */
> +
> +	return wa_ctx_end(wa_ctx, *offset = index, CACHELINE_DWORDS);
> +}
> +
> +/**
> + * gen8_init_perctx_bb() - initialize per ctx batch with WA
> + *
> + * @ring: only applicable for RCS
> + * @wa_ctx: structure representing wa_ctx
> + *  offset: specifies start of the batch, should be cache-aligned.
> + *  size: size of the batch in DWORDS but HW expects in terms of cachelines
> + * @offset: This field specifies the start of this batch.
> + *   This batch is started immediately after indirect_ctx batch. Since we ensure
> + *   that indirect_ctx ends on a cacheline this batch is aligned automatically.
> + *
> + *   The number of DWORDS written are returned using this field.
> + *
> + *  This batch is terminated with MI_BATCH_BUFFER_END and so we need not add padding
> + *  to align it with cacheline as padding after MI_BATCH_BUFFER_END is redundant.
> + */
> +static int gen8_init_perctx_bb(struct intel_engine_cs *ring,
> +			       struct i915_wa_ctx_bb *wa_ctx,
> +			       uint32_t *const batch,
> +			       uint32_t *offset)
> +{
> +	uint32_t index = wa_ctx_start(wa_ctx, *offset, CACHELINE_DWORDS);
> +
> +	wa_ctx_emit(batch, MI_BATCH_BUFFER_END);
> +
> +	return wa_ctx_end(wa_ctx, *offset = index, 1);
> +}
> +
> +static int lrc_setup_wa_ctx_obj(struct intel_engine_cs *ring, u32 size)
> +{
> +	int ret;
> +
> +	ring->wa_ctx.obj = i915_gem_alloc_object(ring->dev, PAGE_ALIGN(size));
> +	if (!ring->wa_ctx.obj) {
> +		DRM_DEBUG_DRIVER("alloc LRC WA ctx backing obj failed.\n");
> +		return -ENOMEM;
> +	}
> +
> +	ret = i915_gem_obj_ggtt_pin(ring->wa_ctx.obj, PAGE_SIZE, 0);
> +	if (ret) {
> +		DRM_DEBUG_DRIVER("pin LRC WA ctx backing obj failed: %d\n",
> +				 ret);
> +		drm_gem_object_unreference(&ring->wa_ctx.obj->base);
> +		return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +static void lrc_destroy_wa_ctx_obj(struct intel_engine_cs *ring)
> +{
> +	if (ring->wa_ctx.obj) {
> +		i915_gem_object_ggtt_unpin(ring->wa_ctx.obj);
> +		drm_gem_object_unreference(&ring->wa_ctx.obj->base);
> +		ring->wa_ctx.obj = NULL;
> +	}
> +}
> +
> +static int intel_init_workaround_bb(struct intel_engine_cs *ring)
> +{
> +	int ret;
> +	uint32_t *batch;
> +	uint32_t offset;
> +	struct page *page;
> +	struct i915_ctx_workarounds *wa_ctx = &ring->wa_ctx;
> +
> +	WARN_ON(ring->id != RCS);
> +
> +	ret = lrc_setup_wa_ctx_obj(ring, PAGE_SIZE);
> +	if (ret) {
> +		DRM_DEBUG_DRIVER("Failed to setup context WA page: %d\n", ret);
> +		return ret;
> +	}
> +
> +	page = i915_gem_object_get_page(wa_ctx->obj, 0);
> +	batch = kmap_atomic(page);
> +	offset = 0;
> +
> +	if (INTEL_INFO(ring->dev)->gen == 8) {
> +		ret = gen8_init_indirectctx_bb(ring,
> +					       &wa_ctx->indirect_ctx,
> +					       batch,
> +					       &offset);
> +		if (ret)
> +			goto out;
> +
> +		ret = gen8_init_perctx_bb(ring,
> +					  &wa_ctx->per_ctx,
> +					  batch,
> +					  &offset);
> +		if (ret)
> +			goto out;
> +	} else {
> +		WARN(INTEL_INFO(ring->dev)->gen >= 8,
> +		     "WA batch buffer is not initialized for Gen%d\n",
> +		     INTEL_INFO(ring->dev)->gen);
> +		lrc_destroy_wa_ctx_obj(ring);
> +	}
> +
> +out:
> +	kunmap_atomic(batch);
> +	if (ret)
> +		lrc_destroy_wa_ctx_obj(ring);
> +
> +	return ret;
> +}
> +
>  static int gen8_init_common_ring(struct intel_engine_cs *ring)
>  {
>  	struct drm_device *dev = ring->dev;
> @@ -1411,6 +1597,8 @@ void intel_logical_ring_cleanup(struct intel_engine_cs *ring)
>  		kunmap(sg_page(ring->status_page.obj->pages->sgl));
>  		ring->status_page.obj = NULL;
>  	}
> +
> +	lrc_destroy_wa_ctx_obj(ring);
>  }
>  
>  static int logical_ring_init(struct drm_device *dev, struct intel_engine_cs *ring)
> @@ -1474,7 +1662,22 @@ static int logical_render_ring_init(struct drm_device *dev)
>  	if (ret)
>  		return ret;
>  
> -	return intel_init_pipe_control(ring);
> +	ret = intel_init_workaround_bb(ring);
> +	if (ret) {
> +		/*
> +		 * We continue even if we fail to initialize WA batch
> +		 * because we only expect rare glitches but nothing
> +		 * critical to prevent us from using GPU
> +		 */
> +		DRM_ERROR("WA batch buffer initialization failed: %d\n",
> +			  ret);
> +	}
> +
> +	ret = intel_init_pipe_control(ring);
> +	if (ret)
> +		lrc_destroy_wa_ctx_obj(ring);
> +
> +	return ret;
>  }
>  
>  static int logical_bsd_ring_init(struct drm_device *dev)
> @@ -1754,15 +1957,27 @@ populate_lr_context(struct intel_context *ctx, struct drm_i915_gem_object *ctx_o
>  	reg_state[CTX_SECOND_BB_STATE] = ring->mmio_base + 0x118;
>  	reg_state[CTX_SECOND_BB_STATE+1] = 0;
>  	if (ring->id == RCS) {
> -		/* TODO: according to BSpec, the register state context
> -		 * for CHV does not have these. OTOH, these registers do
> -		 * exist in CHV. I'm waiting for a clarification */
>  		reg_state[CTX_BB_PER_CTX_PTR] = ring->mmio_base + 0x1c0;
>  		reg_state[CTX_BB_PER_CTX_PTR+1] = 0;
>  		reg_state[CTX_RCS_INDIRECT_CTX] = ring->mmio_base + 0x1c4;
>  		reg_state[CTX_RCS_INDIRECT_CTX+1] = 0;
>  		reg_state[CTX_RCS_INDIRECT_CTX_OFFSET] = ring->mmio_base + 0x1c8;
>  		reg_state[CTX_RCS_INDIRECT_CTX_OFFSET+1] = 0;
> +		if (ring->wa_ctx.obj) {
> +			struct i915_ctx_workarounds *wa_ctx = &ring->wa_ctx;
> +			uint32_t ggtt_offset = i915_gem_obj_ggtt_offset(wa_ctx->obj);
> +
> +			reg_state[CTX_RCS_INDIRECT_CTX+1] =
> +				(ggtt_offset + wa_ctx->indirect_ctx.offset * sizeof(uint32_t)) |
> +				(wa_ctx->indirect_ctx.size / CACHELINE_DWORDS);
> +
> +			reg_state[CTX_RCS_INDIRECT_CTX_OFFSET+1] =
> +				CTX_RCS_INDIRECT_CTX_OFFSET_DEFAULT << 6;
> +
> +			reg_state[CTX_BB_PER_CTX_PTR+1] =
> +				(ggtt_offset + wa_ctx->per_ctx.offset * sizeof(uint32_t)) |
> +				0x01;
> +		}
>  	}
>  	reg_state[CTX_LRI_HEADER_1] = MI_LOAD_REGISTER_IMM(9);
>  	reg_state[CTX_LRI_HEADER_1] |= MI_LRI_FORCE_POSTED;
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
> index 39f6dfc..06467c6 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.h
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
> @@ -12,6 +12,7 @@
>   * workarounds!
>   */
>  #define CACHELINE_BYTES 64
> +#define CACHELINE_DWORDS (CACHELINE_BYTES / sizeof(uint32_t))
>  
>  /*
>   * Gen2 BSpec "1. Programming Environment" / 1.4.4.6 "Ring Buffer Use"
> @@ -119,6 +120,25 @@ struct intel_ringbuffer {
>  
>  struct	intel_context;
>  
> +/*
> + * we use a single page to load ctx workarounds so all of these
> + * values are referred in terms of dwords
> + *
> + * struct i915_wa_ctx_bb:
> + *  offset: specifies batch starting position, also helpful in case
> + *    if we want to have multiple batches at different offsets based on
> + *    some criteria. It is not a requirement at the moment but provides
> + *    an option for future use.
> + *  size: size of the batch in DWORDS
> + */
> +struct  i915_ctx_workarounds {
> +	struct i915_wa_ctx_bb {
> +		u32 offset;
> +		u32 size;
> +	} indirect_ctx, per_ctx;
> +	struct drm_i915_gem_object *obj;
> +};
> +
>  struct  intel_engine_cs {
>  	const char	*name;
>  	enum intel_ring_id {
> @@ -142,6 +162,7 @@ struct  intel_engine_cs {
>  	struct i915_gem_batch_pool batch_pool;
>  
>  	struct intel_hw_status_page status_page;
> +	struct i915_ctx_workarounds wa_ctx;
>  
>  	unsigned irq_refcount; /* protected by dev_priv->irq_lock */
>  	u32		irq_enable_mask;	/* bitmask to enable ring interrupt */
> -- 
> 2.3.0
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 4/6] drm/i915/gen8: Add WaFlushCoherentL3CacheLinesAtContextSwitch workaround
  2015-06-19 18:12   ` Chris Wilson
@ 2015-06-22 15:41     ` Daniel Vetter
  0 siblings, 0 replies; 27+ messages in thread
From: Daniel Vetter @ 2015-06-22 15:41 UTC (permalink / raw)
  To: Chris Wilson, Arun Siluvery, intel-gfx, Dave Gordon, Rafael Barbalho

On Fri, Jun 19, 2015 at 07:12:21PM +0100, Chris Wilson wrote:
> On Fri, Jun 19, 2015 at 06:37:13PM +0100, Arun Siluvery wrote:
> > In Indirect context w/a batch buffer,
> > +WaFlushCoherentL3CacheLinesAtContextSwitch:bdw
> > 
> > v2: Add LRI commands to set/reset bit that invalidates coherent lines,
> > update WA to include programming restrictions and exclude CHV as
> > it is not required (Ville)
> > 
> > v3: Avoid unnecessary read when it can be done by reading register once (Chris).
> > 
> > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Dave Gordon <david.s.gordon@intel.com>
> > Signed-off-by: Rafael Barbalho <rafael.barbalho@intel.com>
> > Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> 
> Acked-by: Chris Wilson <chris@chris-wilson.co.uk>

Merged up to this patch, thanks.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 1/6] drm/i915/gen8: Add infrastructure to initialize WA batch buffers
  2015-06-22 15:41   ` Daniel Vetter
@ 2015-06-22 15:43     ` Siluvery, Arun
  0 siblings, 0 replies; 27+ messages in thread
From: Siluvery, Arun @ 2015-06-22 15:43 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx

On 22/06/2015 16:41, Daniel Vetter wrote:
> On Fri, Jun 19, 2015 at 07:07:01PM +0100, Arun Siluvery wrote:
>> Some of the WA are to be applied during context save but before restore and
>> some at the end of context save/restore but before executing the instructions
>> in the ring, WA batch buffers are created for this purpose and these WA cannot
>> be applied using normal means. Each context has two registers to load the
>> offsets of these batch buffers. If they are non-zero, HW understands that it
>> need to execute these batches.
>>
>> v1: In this version two separate ring_buffer objects were used to load WA
>> instructions for indirect and per context batch buffers and they were part
>> of every context.
>>
>> v2: Chris suggested to include additional page in context and use it to load
>> these WA instead of creating separate objects. This will simplify lot of things
>> as we need not explicity pin/unpin them. Thomas Daniel further pointed that GuC
>> is planning to use a similar setup to share data between GuC and driver and
>> WA batch buffers can probably share that page. However after discussions with
>> Dave who is implementing GuC changes, he suggested to use an independent page
>> for the reasons - GuC area might grow and these WA are initialized only once and
>> are not changed afterwards so we can share them share across all contexts.
>>
>> The page is updated with WA during render ring init. This has an advantage of
>> not adding more special cases to default_context.
>>
>> We don't know upfront the number of WA we will applying using these batch buffers.
>> For this reason the size was fixed earlier but it is not a good idea. To fix this,
>> the functions that load instructions are modified to report the no of commands
>> inserted and the size is now calculated after the batch is updated. A macro is
>> introduced to add commands to these batch buffers which also checks for overflow
>> and returns error.
>> We have a full page dedicated for these WA so that should be sufficient for
>> good number of WA, anything more means we have major issues.
>> The list for Gen8 is small, same for Gen9 also, maybe few more gets added
>> going forward but not close to filling entire page. Chris suggested a two-pass
>> approach but we agreed to go with single page setup as it is a one-off routine
>> and simpler code wins.
>>
>> One additional option is offset field which is helpful if we would like to
>> have multiple batches at different offsets within the page and select them
>> based on some criteria. This is not a requirement at this point but could
>> help in future (Dave).
>>
>> Chris provided some helpful macros and suggestions which further simplified
>> the code, they will also help in reducing code duplication when WA for
>> other Gen are added. Add detailed comments explaining restrictions.
>> Use do {} while(0) for wa_ctx_emit() macro.
>>
>> (Many thanks to Chris, Dave and Thomas for their reviews and inputs)
>>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Dave Gordon <david.s.gordon@intel.com>
>> Signed-off-by: Rafael Barbalho <rafael.barbalho@intel.com>
>> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
>> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
>
> Why did you resend this one - I don't spot any updates in the commit
> message? Also when resending please in-reply to the corresponding previous
> version of that patch, not the cover letter of the series.

Hi Daniel,

This is the updated patch with do {} while (0).
I picked a different message-id of cover letter by mistake which is why 
it is replied to cover letter instead of the corresponding patch.

regards
Arun

> -Daniel
>
>> ---
>>   drivers/gpu/drm/i915/intel_lrc.c        | 223 +++++++++++++++++++++++++++++++-
>>   drivers/gpu/drm/i915/intel_ringbuffer.h |  21 +++
>>   2 files changed, 240 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
>> index 0413b8f..0585298 100644
>> --- a/drivers/gpu/drm/i915/intel_lrc.c
>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
>> @@ -211,6 +211,7 @@ enum {
>>   	FAULT_AND_CONTINUE /* Unsupported */
>>   };
>>   #define GEN8_CTX_ID_SHIFT 32
>> +#define CTX_RCS_INDIRECT_CTX_OFFSET_DEFAULT  0x17
>>
>>   static int intel_lr_context_pin(struct intel_engine_cs *ring,
>>   		struct intel_context *ctx);
>> @@ -1077,6 +1078,191 @@ static int intel_logical_ring_workarounds_emit(struct intel_engine_cs *ring,
>>   	return 0;
>>   }
>>
>> +#define wa_ctx_emit(batch, cmd)						\
>> +	do {								\
>> +		if (WARN_ON(index >= (PAGE_SIZE / sizeof(uint32_t)))) {	\
>> +			return -ENOSPC;					\
>> +		}							\
>> +		batch[index++] = (cmd);					\
>> +	} while (0)
>> +
>> +static inline uint32_t wa_ctx_start(struct i915_wa_ctx_bb *wa_ctx,
>> +				    uint32_t offset,
>> +				    uint32_t start_alignment)
>> +{
>> +	return wa_ctx->offset = ALIGN(offset, start_alignment);
>> +}
>> +
>> +static inline int wa_ctx_end(struct i915_wa_ctx_bb *wa_ctx,
>> +			     uint32_t offset,
>> +			     uint32_t size_alignment)
>> +{
>> +	wa_ctx->size = offset - wa_ctx->offset;
>> +
>> +	WARN(wa_ctx->size % size_alignment,
>> +	     "wa_ctx_bb failed sanity checks: size %d is not aligned to %d\n",
>> +	     wa_ctx->size, size_alignment);
>> +	return 0;
>> +}
>> +
>> +/**
>> + * gen8_init_indirectctx_bb() - initialize indirect ctx batch with WA
>> + *
>> + * @ring: only applicable for RCS
>> + * @wa_ctx: structure representing wa_ctx
>> + *  offset: specifies start of the batch, should be cache-aligned. This is updated
>> + *    with the offset value received as input.
>> + *  size: size of the batch in DWORDS but HW expects in terms of cachelines
>> + * @batch: page in which WA are loaded
>> + * @offset: This field specifies the start of the batch, it should be
>> + *  cache-aligned otherwise it is adjusted accordingly.
>> + *  Typically we only have one indirect_ctx and per_ctx batch buffer which are
>> + *  initialized at the beginning and shared across all contexts but this field
>> + *  helps us to have multiple batches at different offsets and select them based
>> + *  on a criteria. At the moment this batch always start at the beginning of the page
>> + *  and at this point we don't have multiple wa_ctx batch buffers.
>> + *
>> + *  The number of WA applied are not known at the beginning; we use this field
>> + *  to return the no of DWORDS written.
>> +
>> + *  It is to be noted that this batch does not contain MI_BATCH_BUFFER_END
>> + *  so it adds NOOPs as padding to make it cacheline aligned.
>> + *  MI_BATCH_BUFFER_END will be added to perctx batch and both of them together
>> + *  makes a complete batch buffer.
>> + *
>> + * Return: non-zero if we exceed the PAGE_SIZE limit.
>> + */
>> +
>> +static int gen8_init_indirectctx_bb(struct intel_engine_cs *ring,
>> +				    struct i915_wa_ctx_bb *wa_ctx,
>> +				    uint32_t *const batch,
>> +				    uint32_t *offset)
>> +{
>> +	uint32_t index = wa_ctx_start(wa_ctx, *offset, CACHELINE_DWORDS);
>> +
>> +	/* FIXME: Replace me with WA */
>> +	wa_ctx_emit(batch, MI_NOOP);
>> +
>> +	/* Pad to end of cacheline */
>> +	while (index % CACHELINE_DWORDS)
>> +		wa_ctx_emit(batch, MI_NOOP);
>> +
>> +	/*
>> +	 * MI_BATCH_BUFFER_END is not required in Indirect ctx BB because
>> +	 * execution depends on the length specified in terms of cache lines
>> +	 * in the register CTX_RCS_INDIRECT_CTX
>> +	 */
>> +
>> +	return wa_ctx_end(wa_ctx, *offset = index, CACHELINE_DWORDS);
>> +}
>> +
>> +/**
>> + * gen8_init_perctx_bb() - initialize per ctx batch with WA
>> + *
>> + * @ring: only applicable for RCS
>> + * @wa_ctx: structure representing wa_ctx
>> + *  offset: specifies start of the batch, should be cache-aligned.
>> + *  size: size of the batch in DWORDS but HW expects in terms of cachelines
>> + * @offset: This field specifies the start of this batch.
>> + *   This batch is started immediately after indirect_ctx batch. Since we ensure
>> + *   that indirect_ctx ends on a cacheline this batch is aligned automatically.
>> + *
>> + *   The number of DWORDS written are returned using this field.
>> + *
>> + *  This batch is terminated with MI_BATCH_BUFFER_END and so we need not add padding
>> + *  to align it with cacheline as padding after MI_BATCH_BUFFER_END is redundant.
>> + */
>> +static int gen8_init_perctx_bb(struct intel_engine_cs *ring,
>> +			       struct i915_wa_ctx_bb *wa_ctx,
>> +			       uint32_t *const batch,
>> +			       uint32_t *offset)
>> +{
>> +	uint32_t index = wa_ctx_start(wa_ctx, *offset, CACHELINE_DWORDS);
>> +
>> +	wa_ctx_emit(batch, MI_BATCH_BUFFER_END);
>> +
>> +	return wa_ctx_end(wa_ctx, *offset = index, 1);
>> +}
>> +
>> +static int lrc_setup_wa_ctx_obj(struct intel_engine_cs *ring, u32 size)
>> +{
>> +	int ret;
>> +
>> +	ring->wa_ctx.obj = i915_gem_alloc_object(ring->dev, PAGE_ALIGN(size));
>> +	if (!ring->wa_ctx.obj) {
>> +		DRM_DEBUG_DRIVER("alloc LRC WA ctx backing obj failed.\n");
>> +		return -ENOMEM;
>> +	}
>> +
>> +	ret = i915_gem_obj_ggtt_pin(ring->wa_ctx.obj, PAGE_SIZE, 0);
>> +	if (ret) {
>> +		DRM_DEBUG_DRIVER("pin LRC WA ctx backing obj failed: %d\n",
>> +				 ret);
>> +		drm_gem_object_unreference(&ring->wa_ctx.obj->base);
>> +		return ret;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static void lrc_destroy_wa_ctx_obj(struct intel_engine_cs *ring)
>> +{
>> +	if (ring->wa_ctx.obj) {
>> +		i915_gem_object_ggtt_unpin(ring->wa_ctx.obj);
>> +		drm_gem_object_unreference(&ring->wa_ctx.obj->base);
>> +		ring->wa_ctx.obj = NULL;
>> +	}
>> +}
>> +
>> +static int intel_init_workaround_bb(struct intel_engine_cs *ring)
>> +{
>> +	int ret;
>> +	uint32_t *batch;
>> +	uint32_t offset;
>> +	struct page *page;
>> +	struct i915_ctx_workarounds *wa_ctx = &ring->wa_ctx;
>> +
>> +	WARN_ON(ring->id != RCS);
>> +
>> +	ret = lrc_setup_wa_ctx_obj(ring, PAGE_SIZE);
>> +	if (ret) {
>> +		DRM_DEBUG_DRIVER("Failed to setup context WA page: %d\n", ret);
>> +		return ret;
>> +	}
>> +
>> +	page = i915_gem_object_get_page(wa_ctx->obj, 0);
>> +	batch = kmap_atomic(page);
>> +	offset = 0;
>> +
>> +	if (INTEL_INFO(ring->dev)->gen == 8) {
>> +		ret = gen8_init_indirectctx_bb(ring,
>> +					       &wa_ctx->indirect_ctx,
>> +					       batch,
>> +					       &offset);
>> +		if (ret)
>> +			goto out;
>> +
>> +		ret = gen8_init_perctx_bb(ring,
>> +					  &wa_ctx->per_ctx,
>> +					  batch,
>> +					  &offset);
>> +		if (ret)
>> +			goto out;
>> +	} else {
>> +		WARN(INTEL_INFO(ring->dev)->gen >= 8,
>> +		     "WA batch buffer is not initialized for Gen%d\n",
>> +		     INTEL_INFO(ring->dev)->gen);
>> +		lrc_destroy_wa_ctx_obj(ring);
>> +	}
>> +
>> +out:
>> +	kunmap_atomic(batch);
>> +	if (ret)
>> +		lrc_destroy_wa_ctx_obj(ring);
>> +
>> +	return ret;
>> +}
>> +
>>   static int gen8_init_common_ring(struct intel_engine_cs *ring)
>>   {
>>   	struct drm_device *dev = ring->dev;
>> @@ -1411,6 +1597,8 @@ void intel_logical_ring_cleanup(struct intel_engine_cs *ring)
>>   		kunmap(sg_page(ring->status_page.obj->pages->sgl));
>>   		ring->status_page.obj = NULL;
>>   	}
>> +
>> +	lrc_destroy_wa_ctx_obj(ring);
>>   }
>>
>>   static int logical_ring_init(struct drm_device *dev, struct intel_engine_cs *ring)
>> @@ -1474,7 +1662,22 @@ static int logical_render_ring_init(struct drm_device *dev)
>>   	if (ret)
>>   		return ret;
>>
>> -	return intel_init_pipe_control(ring);
>> +	ret = intel_init_workaround_bb(ring);
>> +	if (ret) {
>> +		/*
>> +		 * We continue even if we fail to initialize WA batch
>> +		 * because we only expect rare glitches but nothing
>> +		 * critical to prevent us from using GPU
>> +		 */
>> +		DRM_ERROR("WA batch buffer initialization failed: %d\n",
>> +			  ret);
>> +	}
>> +
>> +	ret = intel_init_pipe_control(ring);
>> +	if (ret)
>> +		lrc_destroy_wa_ctx_obj(ring);
>> +
>> +	return ret;
>>   }
>>
>>   static int logical_bsd_ring_init(struct drm_device *dev)
>> @@ -1754,15 +1957,27 @@ populate_lr_context(struct intel_context *ctx, struct drm_i915_gem_object *ctx_o
>>   	reg_state[CTX_SECOND_BB_STATE] = ring->mmio_base + 0x118;
>>   	reg_state[CTX_SECOND_BB_STATE+1] = 0;
>>   	if (ring->id == RCS) {
>> -		/* TODO: according to BSpec, the register state context
>> -		 * for CHV does not have these. OTOH, these registers do
>> -		 * exist in CHV. I'm waiting for a clarification */
>>   		reg_state[CTX_BB_PER_CTX_PTR] = ring->mmio_base + 0x1c0;
>>   		reg_state[CTX_BB_PER_CTX_PTR+1] = 0;
>>   		reg_state[CTX_RCS_INDIRECT_CTX] = ring->mmio_base + 0x1c4;
>>   		reg_state[CTX_RCS_INDIRECT_CTX+1] = 0;
>>   		reg_state[CTX_RCS_INDIRECT_CTX_OFFSET] = ring->mmio_base + 0x1c8;
>>   		reg_state[CTX_RCS_INDIRECT_CTX_OFFSET+1] = 0;
>> +		if (ring->wa_ctx.obj) {
>> +			struct i915_ctx_workarounds *wa_ctx = &ring->wa_ctx;
>> +			uint32_t ggtt_offset = i915_gem_obj_ggtt_offset(wa_ctx->obj);
>> +
>> +			reg_state[CTX_RCS_INDIRECT_CTX+1] =
>> +				(ggtt_offset + wa_ctx->indirect_ctx.offset * sizeof(uint32_t)) |
>> +				(wa_ctx->indirect_ctx.size / CACHELINE_DWORDS);
>> +
>> +			reg_state[CTX_RCS_INDIRECT_CTX_OFFSET+1] =
>> +				CTX_RCS_INDIRECT_CTX_OFFSET_DEFAULT << 6;
>> +
>> +			reg_state[CTX_BB_PER_CTX_PTR+1] =
>> +				(ggtt_offset + wa_ctx->per_ctx.offset * sizeof(uint32_t)) |
>> +				0x01;
>> +		}
>>   	}
>>   	reg_state[CTX_LRI_HEADER_1] = MI_LOAD_REGISTER_IMM(9);
>>   	reg_state[CTX_LRI_HEADER_1] |= MI_LRI_FORCE_POSTED;
>> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
>> index 39f6dfc..06467c6 100644
>> --- a/drivers/gpu/drm/i915/intel_ringbuffer.h
>> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
>> @@ -12,6 +12,7 @@
>>    * workarounds!
>>    */
>>   #define CACHELINE_BYTES 64
>> +#define CACHELINE_DWORDS (CACHELINE_BYTES / sizeof(uint32_t))
>>
>>   /*
>>    * Gen2 BSpec "1. Programming Environment" / 1.4.4.6 "Ring Buffer Use"
>> @@ -119,6 +120,25 @@ struct intel_ringbuffer {
>>
>>   struct	intel_context;
>>
>> +/*
>> + * we use a single page to load ctx workarounds so all of these
>> + * values are referred in terms of dwords
>> + *
>> + * struct i915_wa_ctx_bb:
>> + *  offset: specifies batch starting position, also helpful in case
>> + *    if we want to have multiple batches at different offsets based on
>> + *    some criteria. It is not a requirement at the moment but provides
>> + *    an option for future use.
>> + *  size: size of the batch in DWORDS
>> + */
>> +struct  i915_ctx_workarounds {
>> +	struct i915_wa_ctx_bb {
>> +		u32 offset;
>> +		u32 size;
>> +	} indirect_ctx, per_ctx;
>> +	struct drm_i915_gem_object *obj;
>> +};
>> +
>>   struct  intel_engine_cs {
>>   	const char	*name;
>>   	enum intel_ring_id {
>> @@ -142,6 +162,7 @@ struct  intel_engine_cs {
>>   	struct i915_gem_batch_pool batch_pool;
>>
>>   	struct intel_hw_status_page status_page;
>> +	struct i915_ctx_workarounds wa_ctx;
>>
>>   	unsigned irq_refcount; /* protected by dev_priv->irq_lock */
>>   	u32		irq_enable_mask;	/* bitmask to enable ring interrupt */
>> --
>> 2.3.0
>>
>> _______________________________________________
>> Intel-gfx mailing list
>> Intel-gfx@lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 6/6] drm/i915/gen8: Add WaRsRestoreWithPerCtxtBb workaround
  2015-06-19 17:37 ` [PATCH v6 6/6] drm/i915/gen8: Add WaRsRestoreWithPerCtxtBb workaround Arun Siluvery
  2015-06-22 11:30   ` Siluvery, Arun
@ 2015-06-22 16:21   ` Ville Syrjälä
  2015-06-22 16:59     ` Siluvery, Arun
  1 sibling, 1 reply; 27+ messages in thread
From: Ville Syrjälä @ 2015-06-22 16:21 UTC (permalink / raw)
  To: Arun Siluvery; +Cc: intel-gfx

On Fri, Jun 19, 2015 at 06:37:15PM +0100, Arun Siluvery wrote:
> In Per context w/a batch buffer,
> WaRsRestoreWithPerCtxtBb
> 
> This WA performs writes to scratch page so it must be valid, this check
> is performed before initializing the batch with this WA.
> 
> v2: This patches modifies definitions of MI_LOAD_REGISTER_MEM and
> MI_LOAD_REGISTER_REG; Add GEN8 specific defines for these instructions
> so as to not break any future users of existing definitions (Michel)
> 
> v3: Length defined in current definitions of LRM, LRR instructions was specified
> as 0. It seems it is common convention for instructions whose length vary between
> platforms. This is not an issue so far because they are not used anywhere except
> command parser; now that we use in this patch update them with correct length
> and also move them out of command parser placeholder to appropriate place.
> remove unnecessary padding and follow the WA programming sequence exactly
> as mentioned in spec which is essential for this WA (Dave).
> 
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Dave Gordon <david.s.gordon@intel.com>
> Signed-off-by: Rafael Barbalho <rafael.barbalho@intel.com>
> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/i915_reg.h  | 29 +++++++++++++++++++--
>  drivers/gpu/drm/i915/intel_lrc.c | 54 ++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 81 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
> index 7637e64..208620d 100644
> --- a/drivers/gpu/drm/i915/i915_reg.h
> +++ b/drivers/gpu/drm/i915/i915_reg.h
> @@ -347,6 +347,31 @@
>  #define   MI_INVALIDATE_BSD		(1<<7)
>  #define   MI_FLUSH_DW_USE_GTT		(1<<2)
>  #define   MI_FLUSH_DW_USE_PPGTT		(0<<2)
> +#define MI_LOAD_REGISTER_MEM    MI_INSTR(0x29, 1)
> +#define MI_LOAD_REGISTER_MEM_GEN8 MI_INSTR(0x29, 2)
> +#define   MI_LRM_USE_GLOBAL_GTT (1<<22)
> +#define   MI_LRM_ASYNC_MODE_ENABLE (1<<21)
> +#define MI_LOAD_REGISTER_REG    MI_INSTR(0x2A, 1)
> +#define MI_ATOMIC(len)	MI_INSTR(0x2F, (len-2))
> +#define   MI_ATOMIC_MEMORY_TYPE_GGTT	(1<<22)
> +#define   MI_ATOMIC_INLINE_DATA		(1<<18)
> +#define   MI_ATOMIC_CS_STALL		(1<<17)
> +#define   MI_ATOMIC_RETURN_DATA_CTL	(1<<16)
> +#define MI_ATOMIC_OP_MASK(op)  ((op) << 8)
> +#define MI_ATOMIC_AND	MI_ATOMIC_OP_MASK(0x01)
> +#define MI_ATOMIC_OR	MI_ATOMIC_OP_MASK(0x02)
> +#define MI_ATOMIC_XOR	MI_ATOMIC_OP_MASK(0x03)
> +#define MI_ATOMIC_MOVE	MI_ATOMIC_OP_MASK(0x04)
> +#define MI_ATOMIC_INC	MI_ATOMIC_OP_MASK(0x05)
> +#define MI_ATOMIC_DEC	MI_ATOMIC_OP_MASK(0x06)
> +#define MI_ATOMIC_ADD	MI_ATOMIC_OP_MASK(0x07)
> +#define MI_ATOMIC_SUB	MI_ATOMIC_OP_MASK(0x08)
> +#define MI_ATOMIC_RSUB	MI_ATOMIC_OP_MASK(0x09)
> +#define MI_ATOMIC_IMAX	MI_ATOMIC_OP_MASK(0x0A)
> +#define MI_ATOMIC_IMIN	MI_ATOMIC_OP_MASK(0x0B)
> +#define MI_ATOMIC_UMAX	MI_ATOMIC_OP_MASK(0x0C)
> +#define MI_ATOMIC_UMIN	MI_ATOMIC_OP_MASK(0x0D)
> +
>  #define MI_BATCH_BUFFER		MI_INSTR(0x30, 1)
>  #define   MI_BATCH_NON_SECURE		(1)
>  /* for snb/ivb/vlv this also means "batch in ppgtt" when ppgtt is enabled. */
> @@ -451,8 +476,6 @@
>  #define MI_CLFLUSH              MI_INSTR(0x27, 0)
>  #define MI_REPORT_PERF_COUNT    MI_INSTR(0x28, 0)
>  #define   MI_REPORT_PERF_COUNT_GGTT (1<<0)
> -#define MI_LOAD_REGISTER_MEM    MI_INSTR(0x29, 0)
> -#define MI_LOAD_REGISTER_REG    MI_INSTR(0x2A, 0)
>  #define MI_RS_STORE_DATA_IMM    MI_INSTR(0x2B, 0)
>  #define MI_LOAD_URB_MEM         MI_INSTR(0x2C, 0)
>  #define MI_STORE_URB_MEM        MI_INSTR(0x2D, 0)
> @@ -1799,6 +1822,8 @@ enum skl_disp_power_wells {
>  #define   GEN8_RC_SEMA_IDLE_MSG_DISABLE	(1 << 12)
>  #define   GEN8_FF_DOP_CLOCK_GATE_DISABLE	(1<<10)
>  
> +#define GEN8_RS_PREEMPT_STATUS		0x215C
> +
>  /* Fuse readout registers for GT */
>  #define CHV_FUSE_GT			(VLV_DISPLAY_BASE + 0x2168)
>  #define   CHV_FGT_DISABLE_SS0		(1 << 10)
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 664455c..28198c4 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -1215,11 +1215,65 @@ static int gen8_init_perctx_bb(struct intel_engine_cs *ring,
>  			       uint32_t *const batch,
>  			       uint32_t *offset)
>  {
> +	uint32_t scratch_addr;
>  	uint32_t index = wa_ctx_start(wa_ctx, *offset, CACHELINE_DWORDS);
>  
> +	/* Actual scratch location is at 128 bytes offset */
> +	scratch_addr = ring->scratch.gtt_offset + 2*CACHELINE_BYTES;
> +	scratch_addr |= PIPE_CONTROL_GLOBAL_GTT;
> +
>  	/* WaDisableCtxRestoreArbitration:bdw,chv */
>  	wa_ctx_emit(batch, MI_ARB_ON_OFF | MI_ARB_ENABLE);
>  
> +	/*
> +	 * As per Bspec, to workaround a known HW issue, SW must perform the
> +	 * below programming sequence prior to programming MI_BATCH_BUFFER_END.
> +	 *
> +	 * This is only applicable for Gen8.
> +	 */
> +
> +	/* WaRsRestoreWithPerCtxtBb:bdw,chv */

This w/a doesn't seem to be needed for CHV. Also BDW seems to have
gained a chicken bit in H0 (FF_SLICE_CS_CHICKEN3[5]) that supposedly
means we shouldn't need this w/a on BDW either.

> +	wa_ctx_emit(batch, MI_LOAD_REGISTER_IMM(1));
> +	wa_ctx_emit(batch, INSTPM);
> +	wa_ctx_emit(batch, _MASKED_BIT_DISABLE(INSTPM_FORCE_ORDERING));
> +
> +	wa_ctx_emit(batch, (MI_ATOMIC(5) |
> +			    MI_ATOMIC_MEMORY_TYPE_GGTT |
> +			    MI_ATOMIC_INLINE_DATA |
> +			    MI_ATOMIC_CS_STALL |
> +			    MI_ATOMIC_RETURN_DATA_CTL |
> +			    MI_ATOMIC_MOVE));
> +	wa_ctx_emit(batch, scratch_addr);
> +	wa_ctx_emit(batch, 0);
> +	wa_ctx_emit(batch, _MASKED_BIT_ENABLE(INSTPM_FORCE_ORDERING));
> +	wa_ctx_emit(batch, _MASKED_BIT_ENABLE(INSTPM_FORCE_ORDERING));
> +
> +	/*
> +	 * BSpec says MI_LOAD_REGISTER_MEM, MI_LOAD_REGISTER_REG and
> +	 * MI_BATCH_BUFFER_END instructions in this sequence need to be
> +	 * in the same cacheline. To satisfy this case even if more WA are
> +	 * added in future, pad current cacheline and start remaining sequence
> +	 * in new cacheline.
> +	 */
> +	while (index % CACHELINE_DWORDS)
> +		wa_ctx_emit(batch, MI_NOOP);
> +
> +	wa_ctx_emit(batch, (MI_LOAD_REGISTER_MEM_GEN8 |
> +			    MI_LRM_USE_GLOBAL_GTT |
> +			    MI_LRM_ASYNC_MODE_ENABLE));
> +	wa_ctx_emit(batch, INSTPM);
> +	wa_ctx_emit(batch, scratch_addr);
> +	wa_ctx_emit(batch, 0);
> +
> +	/*
> +	 * BSpec says there should not be any commands programmed
> +	 * between MI_LOAD_REGISTER_REG and MI_BATCH_BUFFER_END so
> +	 * do not add any new commands
> +	 */
> +	wa_ctx_emit(batch, MI_LOAD_REGISTER_REG);
> +	wa_ctx_emit(batch, GEN8_RS_PREEMPT_STATUS);
> +	wa_ctx_emit(batch, GEN8_RS_PREEMPT_STATUS);
> +
>  	wa_ctx_emit(batch, MI_BATCH_BUFFER_END);
>  
>  	return wa_ctx_end(wa_ctx, *offset = index, 1);
> -- 
> 2.3.0
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Ville Syrjälä
Intel OTC
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 6/6] drm/i915/gen8: Add WaRsRestoreWithPerCtxtBb workaround
  2015-06-22 16:21   ` Ville Syrjälä
@ 2015-06-22 16:59     ` Siluvery, Arun
  2015-06-23 14:48       ` Siluvery, Arun
  0 siblings, 1 reply; 27+ messages in thread
From: Siluvery, Arun @ 2015-06-22 16:59 UTC (permalink / raw)
  To: Ville Syrjälä; +Cc: intel-gfx

On 22/06/2015 17:21, Ville Syrjälä wrote:
> On Fri, Jun 19, 2015 at 06:37:15PM +0100, Arun Siluvery wrote:
>> In Per context w/a batch buffer,
>> WaRsRestoreWithPerCtxtBb
>>
>> This WA performs writes to scratch page so it must be valid, this check
>> is performed before initializing the batch with this WA.
>>
>> v2: This patches modifies definitions of MI_LOAD_REGISTER_MEM and
>> MI_LOAD_REGISTER_REG; Add GEN8 specific defines for these instructions
>> so as to not break any future users of existing definitions (Michel)
>>
>> v3: Length defined in current definitions of LRM, LRR instructions was specified
>> as 0. It seems it is common convention for instructions whose length vary between
>> platforms. This is not an issue so far because they are not used anywhere except
>> command parser; now that we use in this patch update them with correct length
>> and also move them out of command parser placeholder to appropriate place.
>> remove unnecessary padding and follow the WA programming sequence exactly
>> as mentioned in spec which is essential for this WA (Dave).
>>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Dave Gordon <david.s.gordon@intel.com>
>> Signed-off-by: Rafael Barbalho <rafael.barbalho@intel.com>
>> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
>> ---
>>   drivers/gpu/drm/i915/i915_reg.h  | 29 +++++++++++++++++++--
>>   drivers/gpu/drm/i915/intel_lrc.c | 54 ++++++++++++++++++++++++++++++++++++++++
>>   2 files changed, 81 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
>> index 7637e64..208620d 100644
>> --- a/drivers/gpu/drm/i915/i915_reg.h
>> +++ b/drivers/gpu/drm/i915/i915_reg.h
>> @@ -347,6 +347,31 @@
>>   #define   MI_INVALIDATE_BSD		(1<<7)
>>   #define   MI_FLUSH_DW_USE_GTT		(1<<2)
>>   #define   MI_FLUSH_DW_USE_PPGTT		(0<<2)
>> +#define MI_LOAD_REGISTER_MEM    MI_INSTR(0x29, 1)
>> +#define MI_LOAD_REGISTER_MEM_GEN8 MI_INSTR(0x29, 2)
>> +#define   MI_LRM_USE_GLOBAL_GTT (1<<22)
>> +#define   MI_LRM_ASYNC_MODE_ENABLE (1<<21)
>> +#define MI_LOAD_REGISTER_REG    MI_INSTR(0x2A, 1)
>> +#define MI_ATOMIC(len)	MI_INSTR(0x2F, (len-2))
>> +#define   MI_ATOMIC_MEMORY_TYPE_GGTT	(1<<22)
>> +#define   MI_ATOMIC_INLINE_DATA		(1<<18)
>> +#define   MI_ATOMIC_CS_STALL		(1<<17)
>> +#define   MI_ATOMIC_RETURN_DATA_CTL	(1<<16)
>> +#define MI_ATOMIC_OP_MASK(op)  ((op) << 8)
>> +#define MI_ATOMIC_AND	MI_ATOMIC_OP_MASK(0x01)
>> +#define MI_ATOMIC_OR	MI_ATOMIC_OP_MASK(0x02)
>> +#define MI_ATOMIC_XOR	MI_ATOMIC_OP_MASK(0x03)
>> +#define MI_ATOMIC_MOVE	MI_ATOMIC_OP_MASK(0x04)
>> +#define MI_ATOMIC_INC	MI_ATOMIC_OP_MASK(0x05)
>> +#define MI_ATOMIC_DEC	MI_ATOMIC_OP_MASK(0x06)
>> +#define MI_ATOMIC_ADD	MI_ATOMIC_OP_MASK(0x07)
>> +#define MI_ATOMIC_SUB	MI_ATOMIC_OP_MASK(0x08)
>> +#define MI_ATOMIC_RSUB	MI_ATOMIC_OP_MASK(0x09)
>> +#define MI_ATOMIC_IMAX	MI_ATOMIC_OP_MASK(0x0A)
>> +#define MI_ATOMIC_IMIN	MI_ATOMIC_OP_MASK(0x0B)
>> +#define MI_ATOMIC_UMAX	MI_ATOMIC_OP_MASK(0x0C)
>> +#define MI_ATOMIC_UMIN	MI_ATOMIC_OP_MASK(0x0D)
>> +
>>   #define MI_BATCH_BUFFER		MI_INSTR(0x30, 1)
>>   #define   MI_BATCH_NON_SECURE		(1)
>>   /* for snb/ivb/vlv this also means "batch in ppgtt" when ppgtt is enabled. */
>> @@ -451,8 +476,6 @@
>>   #define MI_CLFLUSH              MI_INSTR(0x27, 0)
>>   #define MI_REPORT_PERF_COUNT    MI_INSTR(0x28, 0)
>>   #define   MI_REPORT_PERF_COUNT_GGTT (1<<0)
>> -#define MI_LOAD_REGISTER_MEM    MI_INSTR(0x29, 0)
>> -#define MI_LOAD_REGISTER_REG    MI_INSTR(0x2A, 0)
>>   #define MI_RS_STORE_DATA_IMM    MI_INSTR(0x2B, 0)
>>   #define MI_LOAD_URB_MEM         MI_INSTR(0x2C, 0)
>>   #define MI_STORE_URB_MEM        MI_INSTR(0x2D, 0)
>> @@ -1799,6 +1822,8 @@ enum skl_disp_power_wells {
>>   #define   GEN8_RC_SEMA_IDLE_MSG_DISABLE	(1 << 12)
>>   #define   GEN8_FF_DOP_CLOCK_GATE_DISABLE	(1<<10)
>>
>> +#define GEN8_RS_PREEMPT_STATUS		0x215C
>> +
>>   /* Fuse readout registers for GT */
>>   #define CHV_FUSE_GT			(VLV_DISPLAY_BASE + 0x2168)
>>   #define   CHV_FGT_DISABLE_SS0		(1 << 10)
>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
>> index 664455c..28198c4 100644
>> --- a/drivers/gpu/drm/i915/intel_lrc.c
>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
>> @@ -1215,11 +1215,65 @@ static int gen8_init_perctx_bb(struct intel_engine_cs *ring,
>>   			       uint32_t *const batch,
>>   			       uint32_t *offset)
>>   {
>> +	uint32_t scratch_addr;
>>   	uint32_t index = wa_ctx_start(wa_ctx, *offset, CACHELINE_DWORDS);
>>
>> +	/* Actual scratch location is at 128 bytes offset */
>> +	scratch_addr = ring->scratch.gtt_offset + 2*CACHELINE_BYTES;
>> +	scratch_addr |= PIPE_CONTROL_GLOBAL_GTT;
>> +
>>   	/* WaDisableCtxRestoreArbitration:bdw,chv */
>>   	wa_ctx_emit(batch, MI_ARB_ON_OFF | MI_ARB_ENABLE);
>>
>> +	/*
>> +	 * As per Bspec, to workaround a known HW issue, SW must perform the
>> +	 * below programming sequence prior to programming MI_BATCH_BUFFER_END.
>> +	 *
>> +	 * This is only applicable for Gen8.
>> +	 */
>> +
>> +	/* WaRsRestoreWithPerCtxtBb:bdw,chv */
>
> This w/a doesn't seem to be needed for CHV. Also BDW seems to have
> gained a chicken bit in H0 (FF_SLICE_CS_CHICKEN3[5]) that supposedly
> means we shouldn't need this w/a on BDW either.
>
looks like this chicken bit is applying this WA, if this is working as 
expected then we can ignore this patch, I will try to get some 
confirmation on this.

regards
Arun

>> +	wa_ctx_emit(batch, MI_LOAD_REGISTER_IMM(1));
>> +	wa_ctx_emit(batch, INSTPM);
>> +	wa_ctx_emit(batch, _MASKED_BIT_DISABLE(INSTPM_FORCE_ORDERING));
>> +
>> +	wa_ctx_emit(batch, (MI_ATOMIC(5) |
>> +			    MI_ATOMIC_MEMORY_TYPE_GGTT |
>> +			    MI_ATOMIC_INLINE_DATA |
>> +			    MI_ATOMIC_CS_STALL |
>> +			    MI_ATOMIC_RETURN_DATA_CTL |
>> +			    MI_ATOMIC_MOVE));
>> +	wa_ctx_emit(batch, scratch_addr);
>> +	wa_ctx_emit(batch, 0);
>> +	wa_ctx_emit(batch, _MASKED_BIT_ENABLE(INSTPM_FORCE_ORDERING));
>> +	wa_ctx_emit(batch, _MASKED_BIT_ENABLE(INSTPM_FORCE_ORDERING));
>> +
>> +	/*
>> +	 * BSpec says MI_LOAD_REGISTER_MEM, MI_LOAD_REGISTER_REG and
>> +	 * MI_BATCH_BUFFER_END instructions in this sequence need to be
>> +	 * in the same cacheline. To satisfy this case even if more WA are
>> +	 * added in future, pad current cacheline and start remaining sequence
>> +	 * in new cacheline.
>> +	 */
>> +	while (index % CACHELINE_DWORDS)
>> +		wa_ctx_emit(batch, MI_NOOP);
>> +
>> +	wa_ctx_emit(batch, (MI_LOAD_REGISTER_MEM_GEN8 |
>> +			    MI_LRM_USE_GLOBAL_GTT |
>> +			    MI_LRM_ASYNC_MODE_ENABLE));
>> +	wa_ctx_emit(batch, INSTPM);
>> +	wa_ctx_emit(batch, scratch_addr);
>> +	wa_ctx_emit(batch, 0);
>> +
>> +	/*
>> +	 * BSpec says there should not be any commands programmed
>> +	 * between MI_LOAD_REGISTER_REG and MI_BATCH_BUFFER_END so
>> +	 * do not add any new commands
>> +	 */
>> +	wa_ctx_emit(batch, MI_LOAD_REGISTER_REG);
>> +	wa_ctx_emit(batch, GEN8_RS_PREEMPT_STATUS);
>> +	wa_ctx_emit(batch, GEN8_RS_PREEMPT_STATUS);
>> +
>>   	wa_ctx_emit(batch, MI_BATCH_BUFFER_END);
>>
>>   	return wa_ctx_end(wa_ctx, *offset = index, 1);
>> --
>> 2.3.0
>>
>> _______________________________________________
>> Intel-gfx mailing list
>> Intel-gfx@lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v6 5/5] drm/i915/gen8: Add WaClearSlmSpaceAtContextSwitch workaround
  2015-06-19 17:37 ` [PATCH v6 5/6] drm/i915/gen8: Add WaClearSlmSpaceAtContextSwitch workaround Arun Siluvery
  2015-06-19 18:09   ` Chris Wilson
@ 2015-06-23 14:46   ` Arun Siluvery
  2015-06-23 15:14     ` Chris Wilson
  1 sibling, 1 reply; 27+ messages in thread
From: Arun Siluvery @ 2015-06-23 14:46 UTC (permalink / raw)
  To: intel-gfx

In Indirect context w/a batch buffer,
WaClearSlmSpaceAtContextSwitch

This WA performs writes to scratch page so it must be valid, this check
is performed before initializing the batch with this WA.

v2: s/PIPE_CONTROL_FLUSH_RO_CACHES/PIPE_CONTROL_FLUSH_L3 (Ville)

v3: GTT bit in scratch address should be mbz (Chris)

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Dave Gordon <david.s.gordon@intel.com>
Signed-off-by: Rafael Barbalho <rafael.barbalho@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_reg.h  |  1 +
 drivers/gpu/drm/i915/intel_lrc.c | 15 +++++++++++++++
 2 files changed, 16 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index d99f45d..fa9ccb87 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -415,6 +415,7 @@
 #define   DISPLAY_PLANE_A           (0<<20)
 #define   DISPLAY_PLANE_B           (1<<20)
 #define GFX_OP_PIPE_CONTROL(len)	((0x3<<29)|(0x3<<27)|(0x2<<24)|(len-2))
+#define   PIPE_CONTROL_FLUSH_L3				(1<<27)
 #define   PIPE_CONTROL_GLOBAL_GTT_IVB			(1<<24) /* gen7+ */
 #define   PIPE_CONTROL_MMIO_WRITE			(1<<23)
 #define   PIPE_CONTROL_STORE_DATA_INDEX			(1<<21)
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index b36e064..e4ebe05 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1146,6 +1146,7 @@ static int gen8_init_indirectctx_bb(struct intel_engine_cs *ring,
 				    uint32_t *const batch,
 				    uint32_t *offset)
 {
+	uint32_t scratch_addr;
 	uint32_t index = wa_ctx_start(wa_ctx, *offset, CACHELINE_DWORDS);
 
 	/* WaDisableCtxRestoreArbitration:bdw,chv */
@@ -1174,6 +1175,20 @@ static int gen8_init_indirectctx_bb(struct intel_engine_cs *ring,
 		wa_ctx_emit(batch, l3sqc4_flush & ~GEN8_LQSC_FLUSH_COHERENT_LINES);
 	}
 
+	/* WaClearSlmSpaceAtContextSwitch:bdw,chv */
+	/* Actual scratch location is at 128 bytes offset */
+	scratch_addr = ring->scratch.gtt_offset + 2*CACHELINE_BYTES;
+
+	wa_ctx_emit(batch, GFX_OP_PIPE_CONTROL(6));
+	wa_ctx_emit(batch, (PIPE_CONTROL_FLUSH_L3 |
+			    PIPE_CONTROL_GLOBAL_GTT_IVB |
+			    PIPE_CONTROL_CS_STALL |
+			    PIPE_CONTROL_QW_WRITE));
+	wa_ctx_emit(batch, scratch_addr);
+	wa_ctx_emit(batch, 0);
+	wa_ctx_emit(batch, 0);
+	wa_ctx_emit(batch, 0);
+
 	/* Pad to end of cacheline */
 	while (index % CACHELINE_DWORDS)
 		wa_ctx_emit(batch, MI_NOOP);
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 6/6] drm/i915/gen8: Add WaRsRestoreWithPerCtxtBb workaround
  2015-06-22 16:59     ` Siluvery, Arun
@ 2015-06-23 14:48       ` Siluvery, Arun
  0 siblings, 0 replies; 27+ messages in thread
From: Siluvery, Arun @ 2015-06-23 14:48 UTC (permalink / raw)
  To: Ville Syrjälä; +Cc: intel-gfx

On 22/06/2015 17:59, Siluvery, Arun wrote:
> On 22/06/2015 17:21, Ville Syrjälä wrote:
>> On Fri, Jun 19, 2015 at 06:37:15PM +0100, Arun Siluvery wrote:
>>> In Per context w/a batch buffer,
>>> WaRsRestoreWithPerCtxtBb
>>>
>>> This WA performs writes to scratch page so it must be valid, this check
>>> is performed before initializing the batch with this WA.
>>>
>>> v2: This patches modifies definitions of MI_LOAD_REGISTER_MEM and
>>> MI_LOAD_REGISTER_REG; Add GEN8 specific defines for these instructions
>>> so as to not break any future users of existing definitions (Michel)
>>>
>>> v3: Length defined in current definitions of LRM, LRR instructions was specified
>>> as 0. It seems it is common convention for instructions whose length vary between
>>> platforms. This is not an issue so far because they are not used anywhere except
>>> command parser; now that we use in this patch update them with correct length
>>> and also move them out of command parser placeholder to appropriate place.
>>> remove unnecessary padding and follow the WA programming sequence exactly
>>> as mentioned in spec which is essential for this WA (Dave).
>>>
>>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Dave Gordon <david.s.gordon@intel.com>
>>> Signed-off-by: Rafael Barbalho <rafael.barbalho@intel.com>
>>> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/i915_reg.h  | 29 +++++++++++++++++++--
>>>    drivers/gpu/drm/i915/intel_lrc.c | 54 ++++++++++++++++++++++++++++++++++++++++
>>>    2 files changed, 81 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
>>> index 7637e64..208620d 100644
>>> --- a/drivers/gpu/drm/i915/i915_reg.h
>>> +++ b/drivers/gpu/drm/i915/i915_reg.h
>>> @@ -347,6 +347,31 @@
>>>    #define   MI_INVALIDATE_BSD		(1<<7)
>>>    #define   MI_FLUSH_DW_USE_GTT		(1<<2)
>>>    #define   MI_FLUSH_DW_USE_PPGTT		(0<<2)
>>> +#define MI_LOAD_REGISTER_MEM    MI_INSTR(0x29, 1)
>>> +#define MI_LOAD_REGISTER_MEM_GEN8 MI_INSTR(0x29, 2)
>>> +#define   MI_LRM_USE_GLOBAL_GTT (1<<22)
>>> +#define   MI_LRM_ASYNC_MODE_ENABLE (1<<21)
>>> +#define MI_LOAD_REGISTER_REG    MI_INSTR(0x2A, 1)
>>> +#define MI_ATOMIC(len)	MI_INSTR(0x2F, (len-2))
>>> +#define   MI_ATOMIC_MEMORY_TYPE_GGTT	(1<<22)
>>> +#define   MI_ATOMIC_INLINE_DATA		(1<<18)
>>> +#define   MI_ATOMIC_CS_STALL		(1<<17)
>>> +#define   MI_ATOMIC_RETURN_DATA_CTL	(1<<16)
>>> +#define MI_ATOMIC_OP_MASK(op)  ((op) << 8)
>>> +#define MI_ATOMIC_AND	MI_ATOMIC_OP_MASK(0x01)
>>> +#define MI_ATOMIC_OR	MI_ATOMIC_OP_MASK(0x02)
>>> +#define MI_ATOMIC_XOR	MI_ATOMIC_OP_MASK(0x03)
>>> +#define MI_ATOMIC_MOVE	MI_ATOMIC_OP_MASK(0x04)
>>> +#define MI_ATOMIC_INC	MI_ATOMIC_OP_MASK(0x05)
>>> +#define MI_ATOMIC_DEC	MI_ATOMIC_OP_MASK(0x06)
>>> +#define MI_ATOMIC_ADD	MI_ATOMIC_OP_MASK(0x07)
>>> +#define MI_ATOMIC_SUB	MI_ATOMIC_OP_MASK(0x08)
>>> +#define MI_ATOMIC_RSUB	MI_ATOMIC_OP_MASK(0x09)
>>> +#define MI_ATOMIC_IMAX	MI_ATOMIC_OP_MASK(0x0A)
>>> +#define MI_ATOMIC_IMIN	MI_ATOMIC_OP_MASK(0x0B)
>>> +#define MI_ATOMIC_UMAX	MI_ATOMIC_OP_MASK(0x0C)
>>> +#define MI_ATOMIC_UMIN	MI_ATOMIC_OP_MASK(0x0D)
>>> +
>>>    #define MI_BATCH_BUFFER		MI_INSTR(0x30, 1)
>>>    #define   MI_BATCH_NON_SECURE		(1)
>>>    /* for snb/ivb/vlv this also means "batch in ppgtt" when ppgtt is enabled. */
>>> @@ -451,8 +476,6 @@
>>>    #define MI_CLFLUSH              MI_INSTR(0x27, 0)
>>>    #define MI_REPORT_PERF_COUNT    MI_INSTR(0x28, 0)
>>>    #define   MI_REPORT_PERF_COUNT_GGTT (1<<0)
>>> -#define MI_LOAD_REGISTER_MEM    MI_INSTR(0x29, 0)
>>> -#define MI_LOAD_REGISTER_REG    MI_INSTR(0x2A, 0)
>>>    #define MI_RS_STORE_DATA_IMM    MI_INSTR(0x2B, 0)
>>>    #define MI_LOAD_URB_MEM         MI_INSTR(0x2C, 0)
>>>    #define MI_STORE_URB_MEM        MI_INSTR(0x2D, 0)
>>> @@ -1799,6 +1822,8 @@ enum skl_disp_power_wells {
>>>    #define   GEN8_RC_SEMA_IDLE_MSG_DISABLE	(1 << 12)
>>>    #define   GEN8_FF_DOP_CLOCK_GATE_DISABLE	(1<<10)
>>>
>>> +#define GEN8_RS_PREEMPT_STATUS		0x215C
>>> +
>>>    /* Fuse readout registers for GT */
>>>    #define CHV_FUSE_GT			(VLV_DISPLAY_BASE + 0x2168)
>>>    #define   CHV_FGT_DISABLE_SS0		(1 << 10)
>>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
>>> index 664455c..28198c4 100644
>>> --- a/drivers/gpu/drm/i915/intel_lrc.c
>>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
>>> @@ -1215,11 +1215,65 @@ static int gen8_init_perctx_bb(struct intel_engine_cs *ring,
>>>    			       uint32_t *const batch,
>>>    			       uint32_t *offset)
>>>    {
>>> +	uint32_t scratch_addr;
>>>    	uint32_t index = wa_ctx_start(wa_ctx, *offset, CACHELINE_DWORDS);
>>>
>>> +	/* Actual scratch location is at 128 bytes offset */
>>> +	scratch_addr = ring->scratch.gtt_offset + 2*CACHELINE_BYTES;
>>> +	scratch_addr |= PIPE_CONTROL_GLOBAL_GTT;
>>> +
>>>    	/* WaDisableCtxRestoreArbitration:bdw,chv */
>>>    	wa_ctx_emit(batch, MI_ARB_ON_OFF | MI_ARB_ENABLE);
>>>
>>> +	/*
>>> +	 * As per Bspec, to workaround a known HW issue, SW must perform the
>>> +	 * below programming sequence prior to programming MI_BATCH_BUFFER_END.
>>> +	 *
>>> +	 * This is only applicable for Gen8.
>>> +	 */
>>> +
>>> +	/* WaRsRestoreWithPerCtxtBb:bdw,chv */
>>
>> This w/a doesn't seem to be needed for CHV. Also BDW seems to have
>> gained a chicken bit in H0 (FF_SLICE_CS_CHICKEN3[5]) that supposedly
>> means we shouldn't need this w/a on BDW either.
>>
> looks like this chicken bit is applying this WA, if this is working as
> expected then we can ignore this patch, I will try to get some
> confirmation on this.

I got confirmation from HW that chicken bit is enough, this patch can be 
ignored.

regards
Arun

>
> regards
> Arun
>
>>> +	wa_ctx_emit(batch, MI_LOAD_REGISTER_IMM(1));
>>> +	wa_ctx_emit(batch, INSTPM);
>>> +	wa_ctx_emit(batch, _MASKED_BIT_DISABLE(INSTPM_FORCE_ORDERING));
>>> +
>>> +	wa_ctx_emit(batch, (MI_ATOMIC(5) |
>>> +			    MI_ATOMIC_MEMORY_TYPE_GGTT |
>>> +			    MI_ATOMIC_INLINE_DATA |
>>> +			    MI_ATOMIC_CS_STALL |
>>> +			    MI_ATOMIC_RETURN_DATA_CTL |
>>> +			    MI_ATOMIC_MOVE));
>>> +	wa_ctx_emit(batch, scratch_addr);
>>> +	wa_ctx_emit(batch, 0);
>>> +	wa_ctx_emit(batch, _MASKED_BIT_ENABLE(INSTPM_FORCE_ORDERING));
>>> +	wa_ctx_emit(batch, _MASKED_BIT_ENABLE(INSTPM_FORCE_ORDERING));
>>> +
>>> +	/*
>>> +	 * BSpec says MI_LOAD_REGISTER_MEM, MI_LOAD_REGISTER_REG and
>>> +	 * MI_BATCH_BUFFER_END instructions in this sequence need to be
>>> +	 * in the same cacheline. To satisfy this case even if more WA are
>>> +	 * added in future, pad current cacheline and start remaining sequence
>>> +	 * in new cacheline.
>>> +	 */
>>> +	while (index % CACHELINE_DWORDS)
>>> +		wa_ctx_emit(batch, MI_NOOP);
>>> +
>>> +	wa_ctx_emit(batch, (MI_LOAD_REGISTER_MEM_GEN8 |
>>> +			    MI_LRM_USE_GLOBAL_GTT |
>>> +			    MI_LRM_ASYNC_MODE_ENABLE));
>>> +	wa_ctx_emit(batch, INSTPM);
>>> +	wa_ctx_emit(batch, scratch_addr);
>>> +	wa_ctx_emit(batch, 0);
>>> +
>>> +	/*
>>> +	 * BSpec says there should not be any commands programmed
>>> +	 * between MI_LOAD_REGISTER_REG and MI_BATCH_BUFFER_END so
>>> +	 * do not add any new commands
>>> +	 */
>>> +	wa_ctx_emit(batch, MI_LOAD_REGISTER_REG);
>>> +	wa_ctx_emit(batch, GEN8_RS_PREEMPT_STATUS);
>>> +	wa_ctx_emit(batch, GEN8_RS_PREEMPT_STATUS);
>>> +
>>>    	wa_ctx_emit(batch, MI_BATCH_BUFFER_END);
>>>
>>>    	return wa_ctx_end(wa_ctx, *offset = index, 1);
>>> --
>>> 2.3.0
>>>
>>> _______________________________________________
>>> Intel-gfx mailing list
>>> Intel-gfx@lists.freedesktop.org
>>> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
>>
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 5/5] drm/i915/gen8: Add WaClearSlmSpaceAtContextSwitch workaround
  2015-06-23 14:46   ` [PATCH v6 5/5] " Arun Siluvery
@ 2015-06-23 15:14     ` Chris Wilson
  2015-06-23 21:22       ` Daniel Vetter
  0 siblings, 1 reply; 27+ messages in thread
From: Chris Wilson @ 2015-06-23 15:14 UTC (permalink / raw)
  To: Arun Siluvery; +Cc: intel-gfx

On Tue, Jun 23, 2015 at 03:46:57PM +0100, Arun Siluvery wrote:
> In Indirect context w/a batch buffer,
> WaClearSlmSpaceAtContextSwitch
> 
> This WA performs writes to scratch page so it must be valid, this check
> is performed before initializing the batch with this WA.
> 
> v2: s/PIPE_CONTROL_FLUSH_RO_CACHES/PIPE_CONTROL_FLUSH_L3 (Ville)
> 
> v3: GTT bit in scratch address should be mbz (Chris)
> 
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Dave Gordon <david.s.gordon@intel.com>
> Signed-off-by: Rafael Barbalho <rafael.barbalho@intel.com>
> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>

Presuming the wa is still valid ;-)
Acked-by: Chris Wilson <chris@chris-wilson.co.uk>
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 5/5] drm/i915/gen8: Add WaClearSlmSpaceAtContextSwitch workaround
  2015-06-23 15:14     ` Chris Wilson
@ 2015-06-23 21:22       ` Daniel Vetter
  0 siblings, 0 replies; 27+ messages in thread
From: Daniel Vetter @ 2015-06-23 21:22 UTC (permalink / raw)
  To: Chris Wilson, Arun Siluvery, intel-gfx, Dave Gordon, Rafael Barbalho

On Tue, Jun 23, 2015 at 04:14:19PM +0100, Chris Wilson wrote:
> On Tue, Jun 23, 2015 at 03:46:57PM +0100, Arun Siluvery wrote:
> > In Indirect context w/a batch buffer,
> > WaClearSlmSpaceAtContextSwitch
> > 
> > This WA performs writes to scratch page so it must be valid, this check
> > is performed before initializing the batch with this WA.
> > 
> > v2: s/PIPE_CONTROL_FLUSH_RO_CACHES/PIPE_CONTROL_FLUSH_L3 (Ville)
> > 
> > v3: GTT bit in scratch address should be mbz (Chris)
> > 
> > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Dave Gordon <david.s.gordon@intel.com>
> > Signed-off-by: Rafael Barbalho <rafael.barbalho@intel.com>
> > Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> 
> Presuming the wa is still valid ;-)
> Acked-by: Chris Wilson <chris@chris-wilson.co.uk>
Queued for -next, thanks for the patch. And patch 6 ignored as requested.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2015-06-23 21:19 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-19 17:37 [PATCH v6 0/6] Add Per-context WA using WA batch buffers Arun Siluvery
2015-06-19 17:37 ` [PATCH v6 1/6] drm/i915/gen8: Add infrastructure to initialize " Arun Siluvery
2015-06-19 17:50   ` Chris Wilson
2015-06-22 15:36     ` Daniel Vetter
2015-06-22 15:37       ` Siluvery, Arun
2015-06-19 17:37 ` [PATCH v6 2/6] drm/i915/gen8: Re-order init pipe_control in lrc mode Arun Siluvery
2015-06-19 17:58   ` Chris Wilson
2015-06-19 17:37 ` [PATCH v6 3/6] drm/i915/gen8: Add WaDisableCtxRestoreArbitration workaround Arun Siluvery
2015-06-19 18:11   ` Chris Wilson
2015-06-19 17:37 ` [PATCH v6 4/6] drm/i915/gen8: Add WaFlushCoherentL3CacheLinesAtContextSwitch workaround Arun Siluvery
2015-06-19 18:12   ` Chris Wilson
2015-06-22 15:41     ` Daniel Vetter
2015-06-19 17:37 ` [PATCH v6 5/6] drm/i915/gen8: Add WaClearSlmSpaceAtContextSwitch workaround Arun Siluvery
2015-06-19 18:09   ` Chris Wilson
2015-06-22 11:29     ` Siluvery, Arun
2015-06-22 15:39       ` Daniel Vetter
2015-06-23 14:46   ` [PATCH v6 5/5] " Arun Siluvery
2015-06-23 15:14     ` Chris Wilson
2015-06-23 21:22       ` Daniel Vetter
2015-06-19 17:37 ` [PATCH v6 6/6] drm/i915/gen8: Add WaRsRestoreWithPerCtxtBb workaround Arun Siluvery
2015-06-22 11:30   ` Siluvery, Arun
2015-06-22 16:21   ` Ville Syrjälä
2015-06-22 16:59     ` Siluvery, Arun
2015-06-23 14:48       ` Siluvery, Arun
2015-06-19 18:07 ` [PATCH v6 1/6] drm/i915/gen8: Add infrastructure to initialize WA batch buffers Arun Siluvery
2015-06-22 15:41   ` Daniel Vetter
2015-06-22 15:43     ` Siluvery, Arun

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.