All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/i915: Execlists small cleanups and micro-optimisations
@ 2016-02-26 15:37 Tvrtko Ursulin
  2016-02-26 16:03 ` ✗ Fi.CI.BAT: failure for " Patchwork
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Tvrtko Ursulin @ 2016-02-26 15:37 UTC (permalink / raw)
  To: Intel-gfx

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Assorted changes in the areas of code cleanup, reduction of
invariant conditional in the interrupt handler and lock
contention and MMIO access optimisation.

 * Remove needless initialization.
 * Improve cache locality by reorganizing code and/or using
   branch hints to keep unexpected or error conditions out
   of line.
 * Favor busy submit path vs. empty queue.
 * Less branching in hot-paths.

v2:

 * Avoid mmio reads when possible. (Chris Wilson)
 * Use natural integer size for csb indices.
 * Remove useless return value from execlists_update_context.
 * Extract 32-bit ppgtt PDPs update so it is out of line and
   shared with two callers.
 * Grab forcewake across all mmio operations to ease the
   load on uncore lock and use chepear mmio ops.

v3:

 * Removed some more pointless u8 data types.
 * Removed unused return from execlists_context_queue.
 * Commit message updates.

Version 3 now makes the irq handling code path ~20% smaller on
48-bit PPGTT hardware, and a little bit less elsewhere. Hot
paths are mostly in-line now and hammering on the uncore
spinlock is greatly reduced together with mmio traffic to an
extent.

Benchmarking with "gem_latency -n 100" (keep submitting
batches with 100 nop instruction) shows approximately 4% higher
throughput, 2% less CPU time and 22% smaller latencies. This was
on a big-core while small-cores could benefit even more.

Most likely reason for the improvements are the MMIO
optimization and uncore lock traffic reduction.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/intel_lrc.c        | 208 ++++++++++++++++----------------
 drivers/gpu/drm/i915/intel_ringbuffer.h |   3 +-
 2 files changed, 108 insertions(+), 103 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index f0a57afc8dff..0f4dae257fa6 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -269,6 +269,9 @@ logical_ring_init_platform_invariants(struct intel_engine_cs *ring)
 {
 	struct drm_device *dev = ring->dev;
 
+	if (IS_GEN8(dev) || IS_GEN9(dev))
+		ring->idle_lite_restore_wa = ~0;
+
 	ring->disable_lite_restore_wa = (IS_SKL_REVID(dev, 0, SKL_REVID_B0) ||
 					IS_BXT_REVID(dev, 0, BXT_REVID_A1)) &&
 					(ring->id == VCS || ring->id == VCS2);
@@ -372,8 +375,6 @@ static void execlists_elsp_write(struct drm_i915_gem_request *rq0,
 	rq0->elsp_submitted++;
 
 	/* You must always write both descriptors in the order below. */
-	spin_lock(&dev_priv->uncore.lock);
-	intel_uncore_forcewake_get__locked(dev_priv, FORCEWAKE_ALL);
 	I915_WRITE_FW(RING_ELSP(ring), upper_32_bits(desc[1]));
 	I915_WRITE_FW(RING_ELSP(ring), lower_32_bits(desc[1]));
 
@@ -383,11 +384,18 @@ static void execlists_elsp_write(struct drm_i915_gem_request *rq0,
 
 	/* ELSP is a wo register, use another nearby reg for posting */
 	POSTING_READ_FW(RING_EXECLIST_STATUS_LO(ring));
-	intel_uncore_forcewake_put__locked(dev_priv, FORCEWAKE_ALL);
-	spin_unlock(&dev_priv->uncore.lock);
 }
 
-static int execlists_update_context(struct drm_i915_gem_request *rq)
+static void
+execlists_update_context_pdps(struct i915_hw_ppgtt *ppgtt, u32 *reg_state)
+{
+	ASSIGN_CTX_PDP(ppgtt, reg_state, 3);
+	ASSIGN_CTX_PDP(ppgtt, reg_state, 2);
+	ASSIGN_CTX_PDP(ppgtt, reg_state, 1);
+	ASSIGN_CTX_PDP(ppgtt, reg_state, 0);
+}
+
+static void execlists_update_context(struct drm_i915_gem_request *rq)
 {
 	struct intel_engine_cs *ring = rq->ring;
 	struct i915_hw_ppgtt *ppgtt = rq->ctx->ppgtt;
@@ -395,19 +403,13 @@ static int execlists_update_context(struct drm_i915_gem_request *rq)
 
 	reg_state[CTX_RING_TAIL+1] = rq->tail;
 
-	if (ppgtt && !USES_FULL_48BIT_PPGTT(ppgtt->base.dev)) {
-		/* True 32b PPGTT with dynamic page allocation: update PDP
-		 * registers and point the unallocated PDPs to scratch page.
-		 * PML4 is allocated during ppgtt init, so this is not needed
-		 * in 48-bit mode.
-		 */
-		ASSIGN_CTX_PDP(ppgtt, reg_state, 3);
-		ASSIGN_CTX_PDP(ppgtt, reg_state, 2);
-		ASSIGN_CTX_PDP(ppgtt, reg_state, 1);
-		ASSIGN_CTX_PDP(ppgtt, reg_state, 0);
-	}
-
-	return 0;
+	/* True 32b PPGTT with dynamic page allocation: update PDP
+	 * registers and point the unallocated PDPs to scratch page.
+	 * PML4 is allocated during ppgtt init, so this is not needed
+	 * in 48-bit mode.
+	 */
+	if (ppgtt && !USES_FULL_48BIT_PPGTT(ppgtt->base.dev))
+		execlists_update_context_pdps(ppgtt, reg_state);
 }
 
 static void execlists_submit_requests(struct drm_i915_gem_request *rq0,
@@ -424,7 +426,7 @@ static void execlists_submit_requests(struct drm_i915_gem_request *rq0,
 static void execlists_context_unqueue(struct intel_engine_cs *ring)
 {
 	struct drm_i915_gem_request *req0 = NULL, *req1 = NULL;
-	struct drm_i915_gem_request *cursor = NULL, *tmp = NULL;
+	struct drm_i915_gem_request *cursor, *tmp;
 
 	assert_spin_locked(&ring->execlist_lock);
 
@@ -434,9 +436,6 @@ static void execlists_context_unqueue(struct intel_engine_cs *ring)
 	 */
 	WARN_ON(!intel_irqs_enabled(ring->dev->dev_private));
 
-	if (list_empty(&ring->execlist_queue))
-		return;
-
 	/* Try to read in pairs */
 	list_for_each_entry_safe(cursor, tmp, &ring->execlist_queue,
 				 execlist_link) {
@@ -451,37 +450,35 @@ static void execlists_context_unqueue(struct intel_engine_cs *ring)
 			req0 = cursor;
 		} else {
 			req1 = cursor;
+			WARN_ON(req1->elsp_submitted);
 			break;
 		}
 	}
 
-	if (IS_GEN8(ring->dev) || IS_GEN9(ring->dev)) {
+	if (unlikely(!req0))
+		return;
+
+	if (req0->elsp_submitted & ring->idle_lite_restore_wa) {
 		/*
-		 * WaIdleLiteRestore: make sure we never cause a lite
-		 * restore with HEAD==TAIL
+		 * WaIdleLiteRestore: make sure we never cause a lite restore
+		 * with HEAD==TAIL.
+		 *
+		 * Apply the wa NOOPS to prevent ring:HEAD == req:TAIL as we
+		 * resubmit the request. See gen8_emit_request() for where we
+		 * prepare the padding after the end of the request.
 		 */
-		if (req0->elsp_submitted) {
-			/*
-			 * Apply the wa NOOPS to prevent ring:HEAD == req:TAIL
-			 * as we resubmit the request. See gen8_emit_request()
-			 * for where we prepare the padding after the end of the
-			 * request.
-			 */
-			struct intel_ringbuffer *ringbuf;
+		struct intel_ringbuffer *ringbuf;
 
-			ringbuf = req0->ctx->engine[ring->id].ringbuf;
-			req0->tail += 8;
-			req0->tail &= ringbuf->size - 1;
-		}
+		ringbuf = req0->ctx->engine[ring->id].ringbuf;
+		req0->tail += 8;
+		req0->tail &= ringbuf->size - 1;
 	}
 
-	WARN_ON(req1 && req1->elsp_submitted);
-
 	execlists_submit_requests(req0, req1);
 }
 
-static bool execlists_check_remove_request(struct intel_engine_cs *ring,
-					   u32 request_id)
+static unsigned int
+execlists_check_remove_request(struct intel_engine_cs *ring, u32 request_id)
 {
 	struct drm_i915_gem_request *head_req;
 
@@ -491,33 +488,41 @@ static bool execlists_check_remove_request(struct intel_engine_cs *ring,
 					    struct drm_i915_gem_request,
 					    execlist_link);
 
-	if (head_req != NULL) {
-		if (intel_execlists_ctx_id(head_req->ctx, ring) == request_id) {
-			WARN(head_req->elsp_submitted == 0,
-			     "Never submitted head request\n");
+	if (!head_req)
+		return 0;
 
-			if (--head_req->elsp_submitted <= 0) {
-				list_move_tail(&head_req->execlist_link,
-					       &ring->execlist_retired_req_list);
-				return true;
-			}
-		}
-	}
+	if (unlikely(intel_execlists_ctx_id(head_req->ctx, ring) != request_id))
+		return 0;
+
+	WARN(head_req->elsp_submitted == 0, "Never submitted head request\n");
+
+	if (--head_req->elsp_submitted > 0)
+		return 0;
 
-	return false;
+	list_move_tail(&head_req->execlist_link,
+		       &ring->execlist_retired_req_list);
+
+	return 1;
 }
 
-static void get_context_status(struct intel_engine_cs *ring,
-			       u8 read_pointer,
-			       u32 *status, u32 *context_id)
+static u32
+get_context_status(struct intel_engine_cs *ring, unsigned int read_pointer,
+		   u32 *context_id)
 {
 	struct drm_i915_private *dev_priv = ring->dev->dev_private;
+	u32 status;
 
-	if (WARN_ON(read_pointer >= GEN8_CSB_ENTRIES))
-		return;
+	read_pointer %= GEN8_CSB_ENTRIES;
+
+	status = I915_READ_FW(RING_CONTEXT_STATUS_BUF_LO(ring, read_pointer));
 
-	*status = I915_READ(RING_CONTEXT_STATUS_BUF_LO(ring, read_pointer));
-	*context_id = I915_READ(RING_CONTEXT_STATUS_BUF_HI(ring, read_pointer));
+	if (status & GEN8_CTX_STATUS_IDLE_ACTIVE)
+		return 0;
+
+	*context_id = I915_READ_FW(RING_CONTEXT_STATUS_BUF_HI(ring,
+							      read_pointer));
+
+	return status;
 }
 
 /**
@@ -531,30 +536,27 @@ void intel_lrc_irq_handler(struct intel_engine_cs *ring)
 {
 	struct drm_i915_private *dev_priv = ring->dev->dev_private;
 	u32 status_pointer;
-	u8 read_pointer;
-	u8 write_pointer;
+	unsigned int read_pointer, write_pointer;
 	u32 status = 0;
 	u32 status_id;
-	u32 submit_contexts = 0;
+	unsigned int submit_contexts = 0;
 
-	status_pointer = I915_READ(RING_CONTEXT_STATUS_PTR(ring));
+	spin_lock(&ring->execlist_lock);
+
+	spin_lock(&dev_priv->uncore.lock);
+	intel_uncore_forcewake_get__locked(dev_priv, FORCEWAKE_ALL);
+
+	status_pointer = I915_READ_FW(RING_CONTEXT_STATUS_PTR(ring));
 
 	read_pointer = ring->next_context_status_buffer;
 	write_pointer = GEN8_CSB_WRITE_PTR(status_pointer);
 	if (read_pointer > write_pointer)
 		write_pointer += GEN8_CSB_ENTRIES;
 
-	spin_lock(&ring->execlist_lock);
-
 	while (read_pointer < write_pointer) {
+		status = get_context_status(ring, ++read_pointer, &status_id);
 
-		get_context_status(ring, ++read_pointer % GEN8_CSB_ENTRIES,
-				   &status, &status_id);
-
-		if (status & GEN8_CTX_STATUS_IDLE_ACTIVE)
-			continue;
-
-		if (status & GEN8_CTX_STATUS_PREEMPTED) {
+		if (unlikely(status & GEN8_CTX_STATUS_PREEMPTED)) {
 			if (status & GEN8_CTX_STATUS_LITE_RESTORE) {
 				if (execlists_check_remove_request(ring, status_id))
 					WARN(1, "Lite Restored request removed from queue\n");
@@ -562,38 +564,37 @@ void intel_lrc_irq_handler(struct intel_engine_cs *ring)
 				WARN(1, "Preemption without Lite Restore\n");
 		}
 
-		if ((status & GEN8_CTX_STATUS_ACTIVE_IDLE) ||
-		    (status & GEN8_CTX_STATUS_ELEMENT_SWITCH)) {
-			if (execlists_check_remove_request(ring, status_id))
-				submit_contexts++;
-		}
+		if (status & (GEN8_CTX_STATUS_ACTIVE_IDLE |
+		    GEN8_CTX_STATUS_ELEMENT_SWITCH))
+			submit_contexts +=
+				execlists_check_remove_request(ring, status_id);
 	}
 
-	if (ring->disable_lite_restore_wa) {
-		/* Prevent a ctx to preempt itself */
-		if ((status & GEN8_CTX_STATUS_ACTIVE_IDLE) &&
-		    (submit_contexts != 0))
-			execlists_context_unqueue(ring);
-	} else if (submit_contexts != 0) {
+	if (submit_contexts && (!ring->disable_lite_restore_wa ||
+	    (ring->disable_lite_restore_wa && (status &
+	    GEN8_CTX_STATUS_ACTIVE_IDLE))))
 		execlists_context_unqueue(ring);
-	}
-
-	spin_unlock(&ring->execlist_lock);
-
-	if (unlikely(submit_contexts > 2))
-		DRM_ERROR("More than two context complete events?\n");
 
 	ring->next_context_status_buffer = write_pointer % GEN8_CSB_ENTRIES;
 
 	/* Update the read pointer to the old write pointer. Manual ringbuffer
 	 * management ftw </sarcasm> */
-	I915_WRITE(RING_CONTEXT_STATUS_PTR(ring),
-		   _MASKED_FIELD(GEN8_CSB_READ_PTR_MASK,
-				 ring->next_context_status_buffer << 8));
+	I915_WRITE_FW(RING_CONTEXT_STATUS_PTR(ring),
+		      _MASKED_FIELD(GEN8_CSB_READ_PTR_MASK,
+				    ring->next_context_status_buffer << 8));
+
+	intel_uncore_forcewake_put__locked(dev_priv, FORCEWAKE_ALL);
+	spin_unlock(&dev_priv->uncore.lock);
+
+	spin_unlock(&ring->execlist_lock);
+
+	if (unlikely(submit_contexts > 2))
+		DRM_ERROR("More than two context complete events?\n");
 }
 
-static int execlists_context_queue(struct drm_i915_gem_request *request)
+static void execlists_context_queue(struct drm_i915_gem_request *request)
 {
+	struct drm_i915_private *dev_priv = request->i915;
 	struct intel_engine_cs *ring = request->ring;
 	struct drm_i915_gem_request *cursor;
 	int num_elements = 0;
@@ -625,12 +626,17 @@ static int execlists_context_queue(struct drm_i915_gem_request *request)
 	}
 
 	list_add_tail(&request->execlist_link, &ring->execlist_queue);
-	if (num_elements == 0)
+	if (num_elements == 0) {
+		spin_lock(&dev_priv->uncore.lock);
+		intel_uncore_forcewake_get__locked(dev_priv, FORCEWAKE_ALL);
+
 		execlists_context_unqueue(ring);
 
-	spin_unlock_irq(&ring->execlist_lock);
+		intel_uncore_forcewake_put__locked(dev_priv, FORCEWAKE_ALL);
+		spin_unlock(&dev_priv->uncore.lock);
+	}
 
-	return 0;
+	spin_unlock_irq(&ring->execlist_lock);
 }
 
 static int logical_ring_invalidate_all_caches(struct drm_i915_gem_request *req)
@@ -1549,7 +1555,7 @@ static int gen8_init_common_ring(struct intel_engine_cs *ring)
 {
 	struct drm_device *dev = ring->dev;
 	struct drm_i915_private *dev_priv = dev->dev_private;
-	u8 next_context_status_buffer_hw;
+	unsigned int next_context_status_buffer_hw;
 
 	lrc_setup_hardware_status_page(ring,
 				dev_priv->kernel_context->engine[ring->id].state);
@@ -2012,6 +2018,7 @@ void intel_logical_ring_cleanup(struct intel_engine_cs *ring)
 		ring->status_page.obj = NULL;
 	}
 
+	ring->idle_lite_restore_wa = 0;
 	ring->disable_lite_restore_wa = false;
 	ring->ctx_desc_template = 0;
 
@@ -2416,10 +2423,7 @@ populate_lr_context(struct intel_context *ctx, struct drm_i915_gem_object *ctx_o
 		 * With dynamic page allocation, PDPs may not be allocated at
 		 * this point. Point the unallocated PDPs to the scratch page
 		 */
-		ASSIGN_CTX_PDP(ppgtt, reg_state, 3);
-		ASSIGN_CTX_PDP(ppgtt, reg_state, 2);
-		ASSIGN_CTX_PDP(ppgtt, reg_state, 1);
-		ASSIGN_CTX_PDP(ppgtt, reg_state, 0);
+		execlists_update_context_pdps(ppgtt, reg_state);
 	}
 
 	if (ring->id == RCS) {
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 566b0ae10ce0..dd910d30a380 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -271,7 +271,8 @@ struct  intel_engine_cs {
 	spinlock_t execlist_lock;
 	struct list_head execlist_queue;
 	struct list_head execlist_retired_req_list;
-	u8 next_context_status_buffer;
+	unsigned int next_context_status_buffer;
+	unsigned int idle_lite_restore_wa;
 	bool disable_lite_restore_wa;
 	u32 ctx_desc_template;
 	u32             irq_keep_mask; /* bitmask for interrupts that should not be masked */
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* ✗ Fi.CI.BAT: failure for drm/i915: Execlists small cleanups and micro-optimisations
  2016-02-26 15:37 [PATCH] drm/i915: Execlists small cleanups and micro-optimisations Tvrtko Ursulin
@ 2016-02-26 16:03 ` Patchwork
  2016-02-26 16:36 ` [PATCH] " Chris Wilson
  2016-02-26 17:27 ` ✗ Fi.CI.BAT: failure for drm/i915: Execlists small cleanups and micro-optimisations (rev2) Patchwork
  2 siblings, 0 replies; 19+ messages in thread
From: Patchwork @ 2016-02-26 16:03 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Execlists small cleanups and micro-optimisations
URL   : https://patchwork.freedesktop.org/series/3853/
State : failure

== Summary ==

Series 3853v1 drm/i915: Execlists small cleanups and micro-optimisations
http://patchwork.freedesktop.org/api/1.0/series/3853/revisions/1/mbox/

Test kms_flip:
        Subgroup basic-flip-vs-modeset:
                incomplete -> PASS       (ilk-hp8440p) UNSTABLE
Test kms_force_connector_basic:
        Subgroup force-load-detect:
                dmesg-fail -> FAIL       (hsw-gt2)
Test kms_pipe_crc_basic:
        Subgroup suspend-read-crc-pipe-a:
                pass       -> DMESG-WARN (skl-i7k-2) UNSTABLE
        Subgroup suspend-read-crc-pipe-b:
                incomplete -> PASS       (hsw-gt2)
        Subgroup suspend-read-crc-pipe-c:
                dmesg-warn -> PASS       (skl-i7k-2) UNSTABLE
Test pm_rpm:
        Subgroup basic-rte:
                pass       -> FAIL       (hsw-brixbox)

bdw-nuci7        total:166  pass:155  dwarn:0   dfail:0   fail:0   skip:11 
bdw-ultra        total:169  pass:155  dwarn:0   dfail:0   fail:0   skip:14 
bsw-nuc-2        total:169  pass:138  dwarn:0   dfail:0   fail:1   skip:30 
hsw-brixbox      total:169  pass:154  dwarn:0   dfail:0   fail:1   skip:14 
hsw-gt2          total:169  pass:158  dwarn:0   dfail:0   fail:1   skip:10 
ilk-hp8440p      total:169  pass:118  dwarn:1   dfail:1   fail:0   skip:49 
ivb-t430s        total:169  pass:154  dwarn:0   dfail:0   fail:1   skip:14 
skl-i7k-2        total:169  pass:152  dwarn:1   dfail:0   fail:0   skip:16 
snb-dellxps      total:169  pass:146  dwarn:0   dfail:0   fail:1   skip:22 

Results at /archive/results/CI_IGT_test/Patchwork_1484/

e511a05b4b3bb4d1dbca99b00af6d0dc0a65d295 drm-intel-nightly: 2016y-02m-26d-13h-42m-46s UTC integration manifest
24c236658acc8d2be9cc1fef21fa2feefd92d506 drm/i915: Execlists small cleanups and micro-optimisations

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] drm/i915: Execlists small cleanups and micro-optimisations
  2016-02-26 15:37 [PATCH] drm/i915: Execlists small cleanups and micro-optimisations Tvrtko Ursulin
  2016-02-26 16:03 ` ✗ Fi.CI.BAT: failure for " Patchwork
@ 2016-02-26 16:36 ` Chris Wilson
  2016-02-26 16:58   ` [PATCH v4] " Tvrtko Ursulin
  2016-02-26 17:27 ` ✗ Fi.CI.BAT: failure for drm/i915: Execlists small cleanups and micro-optimisations (rev2) Patchwork
  2 siblings, 1 reply; 19+ messages in thread
From: Chris Wilson @ 2016-02-26 16:36 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Intel-gfx

On Fri, Feb 26, 2016 at 03:37:35PM +0000, Tvrtko Ursulin wrote:
> -	if (ring->disable_lite_restore_wa) {
> -		/* Prevent a ctx to preempt itself */
> -		if ((status & GEN8_CTX_STATUS_ACTIVE_IDLE) &&
> -		    (submit_contexts != 0))
> -			execlists_context_unqueue(ring);
> -	} else if (submit_contexts != 0) {
> +	if (submit_contexts && (!ring->disable_lite_restore_wa ||
> +	    (ring->disable_lite_restore_wa && (status &
> +	    GEN8_CTX_STATUS_ACTIVE_IDLE))))

A little clumsy.

if (submit_contexts) {
	if (!ring->disable_lite_restore_wa == 0 ||
	    status & GEN8_CTX_STATUS_ACTIVE_IDLE)
		execlists_context_unqueue__locked(ring);
}

i.e. checking for ring->disable_lite_restore_wa != 0 is redundant (as it
must be true along the false branch of !ring->disable_lite_restore)

And if we take a moment to clean up the logic there

>  	list_add_tail(&request->execlist_link, &ring->execlist_queue);
> -	if (num_elements == 0)
> +	if (num_elements == 0) {
> +		spin_lock(&dev_priv->uncore.lock);
> +		intel_uncore_forcewake_get__locked(dev_priv, FORCEWAKE_ALL);
> +
>  		execlists_context_unqueue(ring);
>  
> -	spin_unlock_irq(&ring->execlist_lock);
> +		intel_uncore_forcewake_put__locked(dev_priv, FORCEWAKE_ALL);
> +		spin_unlock(&dev_priv->uncore.lock);
> +	}

should we hide the locks here with
void execlists_context_unqueue()
{
	spin_lock(&dev_priv->uncore.lock);
	intel_uncore_forcewake_get__locked(dev_priv, FORCEWAKE_ALL);

	execlists_context_unqueue__locked(ring);

	intel_uncore_forcewake_put__locked(dev_priv, FORCEWAKE_ALL);
	spin_unlock(&dev_priv->uncore.lock);

}
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v4] drm/i915: Execlists small cleanups and micro-optimisations
  2016-02-26 16:36 ` [PATCH] " Chris Wilson
@ 2016-02-26 16:58   ` Tvrtko Ursulin
  2016-02-26 20:24     ` Chris Wilson
  0 siblings, 1 reply; 19+ messages in thread
From: Tvrtko Ursulin @ 2016-02-26 16:58 UTC (permalink / raw)
  To: Intel-gfx

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Assorted changes in the areas of code cleanup, reduction of
invariant conditional in the interrupt handler and lock
contention and MMIO access optimisation.

 * Remove needless initialization.
 * Improve cache locality by reorganizing code and/or using
   branch hints to keep unexpected or error conditions out
   of line.
 * Favor busy submit path vs. empty queue.
 * Less branching in hot-paths.

v2:

 * Avoid mmio reads when possible. (Chris Wilson)
 * Use natural integer size for csb indices.
 * Remove useless return value from execlists_update_context.
 * Extract 32-bit ppgtt PDPs update so it is out of line and
   shared with two callers.
 * Grab forcewake across all mmio operations to ease the
   load on uncore lock and use chepear mmio ops.

v3:

 * Removed some more pointless u8 data types.
 * Removed unused return from execlists_context_queue.
 * Commit message updates.

v4:
 * Unclumsify the unqueue if statement. (Chris Wilson)
 * Hide forcewake from the queuing function. (Chris Wilson)

Version 3 now makes the irq handling code path ~20% smaller on
48-bit PPGTT hardware, and a little bit less elsewhere. Hot
paths are mostly in-line now and hammering on the uncore
spinlock is greatly reduced together with mmio traffic to an
extent.

Benchmarking with "gem_latency -n 100" (keep submitting
batches with 100 nop instruction) shows approximately 4% higher
throughput, 2% less CPU time and 22% smaller latencies. This was
on a big-core while small-cores could benefit even more.

Most likely reason for the improvements are the MMIO
optimization and uncore lock traffic reduction.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/intel_lrc.c        | 214 +++++++++++++++++---------------
 drivers/gpu/drm/i915/intel_ringbuffer.h |   3 +-
 2 files changed, 114 insertions(+), 103 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index f0a57afc8dff..247daf80664b 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -269,6 +269,9 @@ logical_ring_init_platform_invariants(struct intel_engine_cs *ring)
 {
 	struct drm_device *dev = ring->dev;
 
+	if (IS_GEN8(dev) || IS_GEN9(dev))
+		ring->idle_lite_restore_wa = ~0;
+
 	ring->disable_lite_restore_wa = (IS_SKL_REVID(dev, 0, SKL_REVID_B0) ||
 					IS_BXT_REVID(dev, 0, BXT_REVID_A1)) &&
 					(ring->id == VCS || ring->id == VCS2);
@@ -372,8 +375,6 @@ static void execlists_elsp_write(struct drm_i915_gem_request *rq0,
 	rq0->elsp_submitted++;
 
 	/* You must always write both descriptors in the order below. */
-	spin_lock(&dev_priv->uncore.lock);
-	intel_uncore_forcewake_get__locked(dev_priv, FORCEWAKE_ALL);
 	I915_WRITE_FW(RING_ELSP(ring), upper_32_bits(desc[1]));
 	I915_WRITE_FW(RING_ELSP(ring), lower_32_bits(desc[1]));
 
@@ -383,11 +384,18 @@ static void execlists_elsp_write(struct drm_i915_gem_request *rq0,
 
 	/* ELSP is a wo register, use another nearby reg for posting */
 	POSTING_READ_FW(RING_EXECLIST_STATUS_LO(ring));
-	intel_uncore_forcewake_put__locked(dev_priv, FORCEWAKE_ALL);
-	spin_unlock(&dev_priv->uncore.lock);
 }
 
-static int execlists_update_context(struct drm_i915_gem_request *rq)
+static void
+execlists_update_context_pdps(struct i915_hw_ppgtt *ppgtt, u32 *reg_state)
+{
+	ASSIGN_CTX_PDP(ppgtt, reg_state, 3);
+	ASSIGN_CTX_PDP(ppgtt, reg_state, 2);
+	ASSIGN_CTX_PDP(ppgtt, reg_state, 1);
+	ASSIGN_CTX_PDP(ppgtt, reg_state, 0);
+}
+
+static void execlists_update_context(struct drm_i915_gem_request *rq)
 {
 	struct intel_engine_cs *ring = rq->ring;
 	struct i915_hw_ppgtt *ppgtt = rq->ctx->ppgtt;
@@ -395,19 +403,13 @@ static int execlists_update_context(struct drm_i915_gem_request *rq)
 
 	reg_state[CTX_RING_TAIL+1] = rq->tail;
 
-	if (ppgtt && !USES_FULL_48BIT_PPGTT(ppgtt->base.dev)) {
-		/* True 32b PPGTT with dynamic page allocation: update PDP
-		 * registers and point the unallocated PDPs to scratch page.
-		 * PML4 is allocated during ppgtt init, so this is not needed
-		 * in 48-bit mode.
-		 */
-		ASSIGN_CTX_PDP(ppgtt, reg_state, 3);
-		ASSIGN_CTX_PDP(ppgtt, reg_state, 2);
-		ASSIGN_CTX_PDP(ppgtt, reg_state, 1);
-		ASSIGN_CTX_PDP(ppgtt, reg_state, 0);
-	}
-
-	return 0;
+	/* True 32b PPGTT with dynamic page allocation: update PDP
+	 * registers and point the unallocated PDPs to scratch page.
+	 * PML4 is allocated during ppgtt init, so this is not needed
+	 * in 48-bit mode.
+	 */
+	if (ppgtt && !USES_FULL_48BIT_PPGTT(ppgtt->base.dev))
+		execlists_update_context_pdps(ppgtt, reg_state);
 }
 
 static void execlists_submit_requests(struct drm_i915_gem_request *rq0,
@@ -421,10 +423,10 @@ static void execlists_submit_requests(struct drm_i915_gem_request *rq0,
 	execlists_elsp_write(rq0, rq1);
 }
 
-static void execlists_context_unqueue(struct intel_engine_cs *ring)
+static void execlists_context_unqueue__locked(struct intel_engine_cs *ring)
 {
 	struct drm_i915_gem_request *req0 = NULL, *req1 = NULL;
-	struct drm_i915_gem_request *cursor = NULL, *tmp = NULL;
+	struct drm_i915_gem_request *cursor, *tmp;
 
 	assert_spin_locked(&ring->execlist_lock);
 
@@ -434,9 +436,6 @@ static void execlists_context_unqueue(struct intel_engine_cs *ring)
 	 */
 	WARN_ON(!intel_irqs_enabled(ring->dev->dev_private));
 
-	if (list_empty(&ring->execlist_queue))
-		return;
-
 	/* Try to read in pairs */
 	list_for_each_entry_safe(cursor, tmp, &ring->execlist_queue,
 				 execlist_link) {
@@ -451,37 +450,48 @@ static void execlists_context_unqueue(struct intel_engine_cs *ring)
 			req0 = cursor;
 		} else {
 			req1 = cursor;
+			WARN_ON(req1->elsp_submitted);
 			break;
 		}
 	}
 
-	if (IS_GEN8(ring->dev) || IS_GEN9(ring->dev)) {
+	if (unlikely(!req0))
+		return;
+
+	if (req0->elsp_submitted & ring->idle_lite_restore_wa) {
 		/*
-		 * WaIdleLiteRestore: make sure we never cause a lite
-		 * restore with HEAD==TAIL
+		 * WaIdleLiteRestore: make sure we never cause a lite restore
+		 * with HEAD==TAIL.
+		 *
+		 * Apply the wa NOOPS to prevent ring:HEAD == req:TAIL as we
+		 * resubmit the request. See gen8_emit_request() for where we
+		 * prepare the padding after the end of the request.
 		 */
-		if (req0->elsp_submitted) {
-			/*
-			 * Apply the wa NOOPS to prevent ring:HEAD == req:TAIL
-			 * as we resubmit the request. See gen8_emit_request()
-			 * for where we prepare the padding after the end of the
-			 * request.
-			 */
-			struct intel_ringbuffer *ringbuf;
+		struct intel_ringbuffer *ringbuf;
 
-			ringbuf = req0->ctx->engine[ring->id].ringbuf;
-			req0->tail += 8;
-			req0->tail &= ringbuf->size - 1;
-		}
+		ringbuf = req0->ctx->engine[ring->id].ringbuf;
+		req0->tail += 8;
+		req0->tail &= ringbuf->size - 1;
 	}
 
-	WARN_ON(req1 && req1->elsp_submitted);
-
 	execlists_submit_requests(req0, req1);
 }
 
-static bool execlists_check_remove_request(struct intel_engine_cs *ring,
-					   u32 request_id)
+static void execlists_context_unqueue(struct intel_engine_cs *ring)
+{
+	struct drm_i915_private *dev_priv = ring->dev->dev_private;
+
+	spin_lock(&dev_priv->uncore.lock);
+	intel_uncore_forcewake_get__locked(dev_priv, FORCEWAKE_ALL);
+
+	execlists_context_unqueue__locked(ring);
+
+	intel_uncore_forcewake_put__locked(dev_priv, FORCEWAKE_ALL);
+	spin_unlock(&dev_priv->uncore.lock);
+}
+
+static unsigned int
+execlists_check_remove_request(struct intel_engine_cs *ring, u32 request_id)
 {
 	struct drm_i915_gem_request *head_req;
 
@@ -491,33 +501,41 @@ static bool execlists_check_remove_request(struct intel_engine_cs *ring,
 					    struct drm_i915_gem_request,
 					    execlist_link);
 
-	if (head_req != NULL) {
-		if (intel_execlists_ctx_id(head_req->ctx, ring) == request_id) {
-			WARN(head_req->elsp_submitted == 0,
-			     "Never submitted head request\n");
+	if (!head_req)
+		return 0;
 
-			if (--head_req->elsp_submitted <= 0) {
-				list_move_tail(&head_req->execlist_link,
-					       &ring->execlist_retired_req_list);
-				return true;
-			}
-		}
-	}
+	if (unlikely(intel_execlists_ctx_id(head_req->ctx, ring) != request_id))
+		return 0;
+
+	WARN(head_req->elsp_submitted == 0, "Never submitted head request\n");
+
+	if (--head_req->elsp_submitted > 0)
+		return 0;
+
+	list_move_tail(&head_req->execlist_link,
+		       &ring->execlist_retired_req_list);
 
-	return false;
+	return 1;
 }
 
-static void get_context_status(struct intel_engine_cs *ring,
-			       u8 read_pointer,
-			       u32 *status, u32 *context_id)
+static u32
+get_context_status(struct intel_engine_cs *ring, unsigned int read_pointer,
+		   u32 *context_id)
 {
 	struct drm_i915_private *dev_priv = ring->dev->dev_private;
+	u32 status;
 
-	if (WARN_ON(read_pointer >= GEN8_CSB_ENTRIES))
-		return;
+	read_pointer %= GEN8_CSB_ENTRIES;
+
+	status = I915_READ_FW(RING_CONTEXT_STATUS_BUF_LO(ring, read_pointer));
+
+	if (status & GEN8_CTX_STATUS_IDLE_ACTIVE)
+		return 0;
 
-	*status = I915_READ(RING_CONTEXT_STATUS_BUF_LO(ring, read_pointer));
-	*context_id = I915_READ(RING_CONTEXT_STATUS_BUF_HI(ring, read_pointer));
+	*context_id = I915_READ_FW(RING_CONTEXT_STATUS_BUF_HI(ring,
+							      read_pointer));
+
+	return status;
 }
 
 /**
@@ -531,30 +549,27 @@ void intel_lrc_irq_handler(struct intel_engine_cs *ring)
 {
 	struct drm_i915_private *dev_priv = ring->dev->dev_private;
 	u32 status_pointer;
-	u8 read_pointer;
-	u8 write_pointer;
+	unsigned int read_pointer, write_pointer;
 	u32 status = 0;
 	u32 status_id;
-	u32 submit_contexts = 0;
+	unsigned int submit_contexts = 0;
+
+	spin_lock(&ring->execlist_lock);
 
-	status_pointer = I915_READ(RING_CONTEXT_STATUS_PTR(ring));
+	spin_lock(&dev_priv->uncore.lock);
+	intel_uncore_forcewake_get__locked(dev_priv, FORCEWAKE_ALL);
+
+	status_pointer = I915_READ_FW(RING_CONTEXT_STATUS_PTR(ring));
 
 	read_pointer = ring->next_context_status_buffer;
 	write_pointer = GEN8_CSB_WRITE_PTR(status_pointer);
 	if (read_pointer > write_pointer)
 		write_pointer += GEN8_CSB_ENTRIES;
 
-	spin_lock(&ring->execlist_lock);
-
 	while (read_pointer < write_pointer) {
+		status = get_context_status(ring, ++read_pointer, &status_id);
 
-		get_context_status(ring, ++read_pointer % GEN8_CSB_ENTRIES,
-				   &status, &status_id);
-
-		if (status & GEN8_CTX_STATUS_IDLE_ACTIVE)
-			continue;
-
-		if (status & GEN8_CTX_STATUS_PREEMPTED) {
+		if (unlikely(status & GEN8_CTX_STATUS_PREEMPTED)) {
 			if (status & GEN8_CTX_STATUS_LITE_RESTORE) {
 				if (execlists_check_remove_request(ring, status_id))
 					WARN(1, "Lite Restored request removed from queue\n");
@@ -562,37 +577,36 @@ void intel_lrc_irq_handler(struct intel_engine_cs *ring)
 				WARN(1, "Preemption without Lite Restore\n");
 		}
 
-		if ((status & GEN8_CTX_STATUS_ACTIVE_IDLE) ||
-		    (status & GEN8_CTX_STATUS_ELEMENT_SWITCH)) {
-			if (execlists_check_remove_request(ring, status_id))
-				submit_contexts++;
-		}
+		if (status & (GEN8_CTX_STATUS_ACTIVE_IDLE |
+		    GEN8_CTX_STATUS_ELEMENT_SWITCH))
+			submit_contexts +=
+				execlists_check_remove_request(ring, status_id);
 	}
 
-	if (ring->disable_lite_restore_wa) {
-		/* Prevent a ctx to preempt itself */
-		if ((status & GEN8_CTX_STATUS_ACTIVE_IDLE) &&
-		    (submit_contexts != 0))
-			execlists_context_unqueue(ring);
-	} else if (submit_contexts != 0) {
-		execlists_context_unqueue(ring);
+	if (submit_contexts) {
+		if (!ring->disable_lite_restore_wa ||
+		    (status & GEN8_CTX_STATUS_ACTIVE_IDLE))
+			execlists_context_unqueue__locked(ring);
 	}
 
-	spin_unlock(&ring->execlist_lock);
-
-	if (unlikely(submit_contexts > 2))
-		DRM_ERROR("More than two context complete events?\n");
-
 	ring->next_context_status_buffer = write_pointer % GEN8_CSB_ENTRIES;
 
 	/* Update the read pointer to the old write pointer. Manual ringbuffer
 	 * management ftw </sarcasm> */
-	I915_WRITE(RING_CONTEXT_STATUS_PTR(ring),
-		   _MASKED_FIELD(GEN8_CSB_READ_PTR_MASK,
-				 ring->next_context_status_buffer << 8));
+	I915_WRITE_FW(RING_CONTEXT_STATUS_PTR(ring),
+		      _MASKED_FIELD(GEN8_CSB_READ_PTR_MASK,
+				    ring->next_context_status_buffer << 8));
+
+	intel_uncore_forcewake_put__locked(dev_priv, FORCEWAKE_ALL);
+	spin_unlock(&dev_priv->uncore.lock);
+
+	spin_unlock(&ring->execlist_lock);
+
+	if (unlikely(submit_contexts > 2))
+		DRM_ERROR("More than two context complete events?\n");
 }
 
-static int execlists_context_queue(struct drm_i915_gem_request *request)
+static void execlists_context_queue(struct drm_i915_gem_request *request)
 {
 	struct intel_engine_cs *ring = request->ring;
 	struct drm_i915_gem_request *cursor;
@@ -629,8 +643,6 @@ static int execlists_context_queue(struct drm_i915_gem_request *request)
 		execlists_context_unqueue(ring);
 
 	spin_unlock_irq(&ring->execlist_lock);
-
-	return 0;
 }
 
 static int logical_ring_invalidate_all_caches(struct drm_i915_gem_request *req)
@@ -1549,7 +1561,7 @@ static int gen8_init_common_ring(struct intel_engine_cs *ring)
 {
 	struct drm_device *dev = ring->dev;
 	struct drm_i915_private *dev_priv = dev->dev_private;
-	u8 next_context_status_buffer_hw;
+	unsigned int next_context_status_buffer_hw;
 
 	lrc_setup_hardware_status_page(ring,
 				dev_priv->kernel_context->engine[ring->id].state);
@@ -2012,6 +2024,7 @@ void intel_logical_ring_cleanup(struct intel_engine_cs *ring)
 		ring->status_page.obj = NULL;
 	}
 
+	ring->idle_lite_restore_wa = 0;
 	ring->disable_lite_restore_wa = false;
 	ring->ctx_desc_template = 0;
 
@@ -2416,10 +2429,7 @@ populate_lr_context(struct intel_context *ctx, struct drm_i915_gem_object *ctx_o
 		 * With dynamic page allocation, PDPs may not be allocated at
 		 * this point. Point the unallocated PDPs to the scratch page
 		 */
-		ASSIGN_CTX_PDP(ppgtt, reg_state, 3);
-		ASSIGN_CTX_PDP(ppgtt, reg_state, 2);
-		ASSIGN_CTX_PDP(ppgtt, reg_state, 1);
-		ASSIGN_CTX_PDP(ppgtt, reg_state, 0);
+		execlists_update_context_pdps(ppgtt, reg_state);
 	}
 
 	if (ring->id == RCS) {
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 566b0ae10ce0..dd910d30a380 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -271,7 +271,8 @@ struct  intel_engine_cs {
 	spinlock_t execlist_lock;
 	struct list_head execlist_queue;
 	struct list_head execlist_retired_req_list;
-	u8 next_context_status_buffer;
+	unsigned int next_context_status_buffer;
+	unsigned int idle_lite_restore_wa;
 	bool disable_lite_restore_wa;
 	u32 ctx_desc_template;
 	u32             irq_keep_mask; /* bitmask for interrupts that should not be masked */
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* ✗ Fi.CI.BAT: failure for drm/i915: Execlists small cleanups and micro-optimisations (rev2)
  2016-02-26 15:37 [PATCH] drm/i915: Execlists small cleanups and micro-optimisations Tvrtko Ursulin
  2016-02-26 16:03 ` ✗ Fi.CI.BAT: failure for " Patchwork
  2016-02-26 16:36 ` [PATCH] " Chris Wilson
@ 2016-02-26 17:27 ` Patchwork
  2016-02-29 10:16   ` Tvrtko Ursulin
  2 siblings, 1 reply; 19+ messages in thread
From: Patchwork @ 2016-02-26 17:27 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Execlists small cleanups and micro-optimisations (rev2)
URL   : https://patchwork.freedesktop.org/series/3853/
State : failure

== Summary ==

Series 3853v2 drm/i915: Execlists small cleanups and micro-optimisations
http://patchwork.freedesktop.org/api/1.0/series/3853/revisions/2/mbox/

Test drv_hangman:
        Subgroup error-state-basic:
                pass       -> FAIL       (ilk-hp8440p)
Test kms_flip:
        Subgroup basic-flip-vs-dpms:
                dmesg-warn -> PASS       (ilk-hp8440p) UNSTABLE
        Subgroup basic-flip-vs-modeset:
                incomplete -> PASS       (ilk-hp8440p) UNSTABLE
Test kms_pipe_crc_basic:
        Subgroup suspend-read-crc-pipe-a:
                pass       -> DMESG-WARN (skl-i7k-2) UNSTABLE
        Subgroup suspend-read-crc-pipe-b:
                incomplete -> PASS       (hsw-gt2)
        Subgroup suspend-read-crc-pipe-c:
                dmesg-warn -> PASS       (skl-i7k-2) UNSTABLE

bdw-nuci7        total:166  pass:155  dwarn:0   dfail:0   fail:0   skip:11 
bdw-ultra        total:169  pass:155  dwarn:0   dfail:0   fail:0   skip:14 
byt-nuc          total:169  pass:144  dwarn:0   dfail:0   fail:0   skip:25 
hsw-brixbox      total:169  pass:155  dwarn:0   dfail:0   fail:0   skip:14 
hsw-gt2          total:169  pass:158  dwarn:0   dfail:1   fail:0   skip:10 
ilk-hp8440p      total:169  pass:118  dwarn:0   dfail:1   fail:1   skip:49 
ivb-t430s        total:169  pass:154  dwarn:0   dfail:0   fail:1   skip:14 
skl-i7k-2        total:169  pass:152  dwarn:1   dfail:0   fail:0   skip:16 
snb-dellxps      total:169  pass:146  dwarn:0   dfail:0   fail:1   skip:22 
snb-x220t        total:169  pass:146  dwarn:0   dfail:0   fail:2   skip:21 

Results at /archive/results/CI_IGT_test/Patchwork_1487/

e511a05b4b3bb4d1dbca99b00af6d0dc0a65d295 drm-intel-nightly: 2016y-02m-26d-13h-42m-46s UTC integration manifest
b5bf538d5948d2f82c61c92bd452decab535d880 drm/i915: Execlists small cleanups and micro-optimisations

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4] drm/i915: Execlists small cleanups and micro-optimisations
  2016-02-26 16:58   ` [PATCH v4] " Tvrtko Ursulin
@ 2016-02-26 20:24     ` Chris Wilson
  2016-02-29 10:45       ` Tvrtko Ursulin
  0 siblings, 1 reply; 19+ messages in thread
From: Chris Wilson @ 2016-02-26 20:24 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Intel-gfx

On Fri, Feb 26, 2016 at 04:58:32PM +0000, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> 
> Assorted changes in the areas of code cleanup, reduction of
> invariant conditional in the interrupt handler and lock
> contention and MMIO access optimisation.
> 
>  * Remove needless initialization.
>  * Improve cache locality by reorganizing code and/or using
>    branch hints to keep unexpected or error conditions out
>    of line.
>  * Favor busy submit path vs. empty queue.
>  * Less branching in hot-paths.
> 
> v2:
> 
>  * Avoid mmio reads when possible. (Chris Wilson)
>  * Use natural integer size for csb indices.
>  * Remove useless return value from execlists_update_context.
>  * Extract 32-bit ppgtt PDPs update so it is out of line and
>    shared with two callers.
>  * Grab forcewake across all mmio operations to ease the
>    load on uncore lock and use chepear mmio ops.
> 
> v3:
> 
>  * Removed some more pointless u8 data types.
>  * Removed unused return from execlists_context_queue.
>  * Commit message updates.
> 
> v4:
>  * Unclumsify the unqueue if statement. (Chris Wilson)
>  * Hide forcewake from the queuing function. (Chris Wilson)
> 
> Version 3 now makes the irq handling code path ~20% smaller on
> 48-bit PPGTT hardware, and a little bit less elsewhere. Hot
> paths are mostly in-line now and hammering on the uncore
> spinlock is greatly reduced together with mmio traffic to an
> extent.
> 
> Benchmarking with "gem_latency -n 100" (keep submitting
> batches with 100 nop instruction) shows approximately 4% higher
> throughput, 2% less CPU time and 22% smaller latencies. This was
> on a big-core while small-cores could benefit even more.

Just add a quick comment about "gem_latency -n 0" suggesting an oddity
with synchronous workloads that bears further study (just so that we
have the hint/reminder about the test case to run).
 
> Most likely reason for the improvements are the MMIO
> optimization and uncore lock traffic reduction.
> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ✗ Fi.CI.BAT: failure for drm/i915: Execlists small cleanups and micro-optimisations (rev2)
  2016-02-26 17:27 ` ✗ Fi.CI.BAT: failure for drm/i915: Execlists small cleanups and micro-optimisations (rev2) Patchwork
@ 2016-02-29 10:16   ` Tvrtko Ursulin
  0 siblings, 0 replies; 19+ messages in thread
From: Tvrtko Ursulin @ 2016-02-29 10:16 UTC (permalink / raw)
  To: intel-gfx


On 26/02/16 17:27, Patchwork wrote:
> == Series Details ==
>
> Series: drm/i915: Execlists small cleanups and micro-optimisations (rev2)
> URL   : https://patchwork.freedesktop.org/series/3853/
> State : failure
>
> == Summary ==
>
> Series 3853v2 drm/i915: Execlists small cleanups and micro-optimisations
> http://patchwork.freedesktop.org/api/1.0/series/3853/revisions/2/mbox/
>
> Test drv_hangman:
>          Subgroup error-state-basic:
>                  pass       -> FAIL       (ilk-hp8440p)

Known unrelated: https://bugs.freedesktop.org/show_bug.cgi?id=94305

> Test kms_flip:
>          Subgroup basic-flip-vs-dpms:
>                  dmesg-warn -> PASS       (ilk-hp8440p) UNSTABLE
>          Subgroup basic-flip-vs-modeset:
>                  incomplete -> PASS       (ilk-hp8440p) UNSTABLE
> Test kms_pipe_crc_basic:
>          Subgroup suspend-read-crc-pipe-a:
>                  pass       -> DMESG-WARN (skl-i7k-2) UNSTABLE

Known unrelated: https://bugs.freedesktop.org/show_bug.cgi?id=93294

>          Subgroup suspend-read-crc-pipe-b:
>                  incomplete -> PASS       (hsw-gt2)
>          Subgroup suspend-read-crc-pipe-c:
>                  dmesg-warn -> PASS       (skl-i7k-2) UNSTABLE
>
> bdw-nuci7        total:166  pass:155  dwarn:0   dfail:0   fail:0   skip:11
> bdw-ultra        total:169  pass:155  dwarn:0   dfail:0   fail:0   skip:14
> byt-nuc          total:169  pass:144  dwarn:0   dfail:0   fail:0   skip:25
> hsw-brixbox      total:169  pass:155  dwarn:0   dfail:0   fail:0   skip:14
> hsw-gt2          total:169  pass:158  dwarn:0   dfail:1   fail:0   skip:10
> ilk-hp8440p      total:169  pass:118  dwarn:0   dfail:1   fail:1   skip:49
> ivb-t430s        total:169  pass:154  dwarn:0   dfail:0   fail:1   skip:14
> skl-i7k-2        total:169  pass:152  dwarn:1   dfail:0   fail:0   skip:16
> snb-dellxps      total:169  pass:146  dwarn:0   dfail:0   fail:1   skip:22
> snb-x220t        total:169  pass:146  dwarn:0   dfail:0   fail:2   skip:21
>
> Results at /archive/results/CI_IGT_test/Patchwork_1487/
>
> e511a05b4b3bb4d1dbca99b00af6d0dc0a65d295 drm-intel-nightly: 2016y-02m-26d-13h-42m-46s UTC integration manifest
> b5bf538d5948d2f82c61c92bd452decab535d880 drm/i915: Execlists small cleanups and micro-optimisations

Regards,

Tvrtko

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4] drm/i915: Execlists small cleanups and micro-optimisations
  2016-02-26 20:24     ` Chris Wilson
@ 2016-02-29 10:45       ` Tvrtko Ursulin
  2016-02-29 10:53         ` Chris Wilson
  0 siblings, 1 reply; 19+ messages in thread
From: Tvrtko Ursulin @ 2016-02-29 10:45 UTC (permalink / raw)
  To: Chris Wilson, Intel-gfx, Tvrtko Ursulin



On 26/02/16 20:24, Chris Wilson wrote:
> On Fri, Feb 26, 2016 at 04:58:32PM +0000, Tvrtko Ursulin wrote:
>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>
>> Assorted changes in the areas of code cleanup, reduction of
>> invariant conditional in the interrupt handler and lock
>> contention and MMIO access optimisation.
>>
>>   * Remove needless initialization.
>>   * Improve cache locality by reorganizing code and/or using
>>     branch hints to keep unexpected or error conditions out
>>     of line.
>>   * Favor busy submit path vs. empty queue.
>>   * Less branching in hot-paths.
>>
>> v2:
>>
>>   * Avoid mmio reads when possible. (Chris Wilson)
>>   * Use natural integer size for csb indices.
>>   * Remove useless return value from execlists_update_context.
>>   * Extract 32-bit ppgtt PDPs update so it is out of line and
>>     shared with two callers.
>>   * Grab forcewake across all mmio operations to ease the
>>     load on uncore lock and use chepear mmio ops.
>>
>> v3:
>>
>>   * Removed some more pointless u8 data types.
>>   * Removed unused return from execlists_context_queue.
>>   * Commit message updates.
>>
>> v4:
>>   * Unclumsify the unqueue if statement. (Chris Wilson)
>>   * Hide forcewake from the queuing function. (Chris Wilson)
>>
>> Version 3 now makes the irq handling code path ~20% smaller on
>> 48-bit PPGTT hardware, and a little bit less elsewhere. Hot
>> paths are mostly in-line now and hammering on the uncore
>> spinlock is greatly reduced together with mmio traffic to an
>> extent.
>>
>> Benchmarking with "gem_latency -n 100" (keep submitting
>> batches with 100 nop instruction) shows approximately 4% higher
>> throughput, 2% less CPU time and 22% smaller latencies. This was
>> on a big-core while small-cores could benefit even more.
>
> Just add a quick comment about "gem_latency -n 0" suggesting an oddity
> with synchronous workloads that bears further study (just so that we
> have the hint/reminder about the test case to run).
>

This ok?

"""
One unexplained result is with "gem_latency -n 0" (dispatching
empty batches) which shows 5% more throughput, 8% less CPU time,
25% better producer and consumer latencies, but 15% higher
dispatch latency which looks like a possible measuring artifact.
"""

>> Most likely reason for the improvements are the MMIO
>> optimization and uncore lock traffic reduction.
>>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>

Thanks!

Regards,

Tvrtko

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4] drm/i915: Execlists small cleanups and micro-optimisations
  2016-02-29 10:45       ` Tvrtko Ursulin
@ 2016-02-29 10:53         ` Chris Wilson
  2016-02-29 11:01           ` Tvrtko Ursulin
  0 siblings, 1 reply; 19+ messages in thread
From: Chris Wilson @ 2016-02-29 10:53 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Intel-gfx

On Mon, Feb 29, 2016 at 10:45:34AM +0000, Tvrtko Ursulin wrote:
> This ok?
> 
> """
> One unexplained result is with "gem_latency -n 0" (dispatching
> empty batches) which shows 5% more throughput, 8% less CPU time,
> 25% better producer and consumer latencies, but 15% higher
> dispatch latency which looks like a possible measuring artifact.
> """

I doubt it is a measuring artefact since throughput = 1/(dispatch +
latency + test overhead), and the dispatch latency here is larger than
the wakeup latency and so has greater impact on throughput in this
scenario.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4] drm/i915: Execlists small cleanups and micro-optimisations
  2016-02-29 10:53         ` Chris Wilson
@ 2016-02-29 11:01           ` Tvrtko Ursulin
  2016-02-29 11:13             ` Chris Wilson
  0 siblings, 1 reply; 19+ messages in thread
From: Tvrtko Ursulin @ 2016-02-29 11:01 UTC (permalink / raw)
  To: Chris Wilson, Intel-gfx, Tvrtko Ursulin


On 29/02/16 10:53, Chris Wilson wrote:
> On Mon, Feb 29, 2016 at 10:45:34AM +0000, Tvrtko Ursulin wrote:
>> This ok?
>>
>> """
>> One unexplained result is with "gem_latency -n 0" (dispatching
>> empty batches) which shows 5% more throughput, 8% less CPU time,
>> 25% better producer and consumer latencies, but 15% higher
>> dispatch latency which looks like a possible measuring artifact.
>> """
>
> I doubt it is a measuring artefact since throughput = 1/(dispatch +
> latency + test overhead), and the dispatch latency here is larger than
> the wakeup latency and so has greater impact on throughput in this
> scenario.

I don't follow you, if dispatch latency has larger effect on throughput 
how to explain the increase and still better throughput?

I see in gem_latency this block:

	measure_latency(p, &p->latency);
	igt_stats_push(&p->dispatch, *p->last_timestamp - start);

measure_latency waits for the batch to complete and then dispatch 
latency uses p->last_timestamp which is something written by the GPU and 
not a CPU view of the latency ?

Regards,

Tvrtko

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4] drm/i915: Execlists small cleanups and micro-optimisations
  2016-02-29 11:01           ` Tvrtko Ursulin
@ 2016-02-29 11:13             ` Chris Wilson
  2016-02-29 11:40               ` Tvrtko Ursulin
  0 siblings, 1 reply; 19+ messages in thread
From: Chris Wilson @ 2016-02-29 11:13 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Intel-gfx

On Mon, Feb 29, 2016 at 11:01:49AM +0000, Tvrtko Ursulin wrote:
> 
> On 29/02/16 10:53, Chris Wilson wrote:
> >On Mon, Feb 29, 2016 at 10:45:34AM +0000, Tvrtko Ursulin wrote:
> >>This ok?
> >>
> >>"""
> >>One unexplained result is with "gem_latency -n 0" (dispatching
> >>empty batches) which shows 5% more throughput, 8% less CPU time,
> >>25% better producer and consumer latencies, but 15% higher
> >>dispatch latency which looks like a possible measuring artifact.
> >>"""
> >
> >I doubt it is a measuring artefact since throughput = 1/(dispatch +
> >latency + test overhead), and the dispatch latency here is larger than
> >the wakeup latency and so has greater impact on throughput in this
> >scenario.
> 
> I don't follow you, if dispatch latency has larger effect on
> throughput how to explain the increase and still better throughput?
> 
> I see in gem_latency this block:
> 
> 	measure_latency(p, &p->latency);
> 	igt_stats_push(&p->dispatch, *p->last_timestamp - start);
> 
> measure_latency waits for the batch to complete and then dispatch
> latency uses p->last_timestamp which is something written by the GPU
> and not a CPU view of the latency ?

Exactly, measurements are entirely made from the running engine clock
(which is ~80ns clock, and should be verified during init). The register
is read before dispatch, inside the batch and then at wakeup, but the
information is presented as dispatch = batch - start and
wakeup = end - batch, so to get the duration (end - start) we need
to add the two together. Throughput will also include some overhead from
the test iteration (that will mainly be scheduler interference).

My comment about dispatch having greater effect, is in terms of
its higher absolute value (so the relative % means a larger change wrt
throughput).
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4] drm/i915: Execlists small cleanups and micro-optimisations
  2016-02-29 11:13             ` Chris Wilson
@ 2016-02-29 11:40               ` Tvrtko Ursulin
  2016-02-29 11:48                 ` Chris Wilson
  0 siblings, 1 reply; 19+ messages in thread
From: Tvrtko Ursulin @ 2016-02-29 11:40 UTC (permalink / raw)
  To: Chris Wilson, Intel-gfx, Tvrtko Ursulin



On 29/02/16 11:13, Chris Wilson wrote:
> On Mon, Feb 29, 2016 at 11:01:49AM +0000, Tvrtko Ursulin wrote:
>>
>> On 29/02/16 10:53, Chris Wilson wrote:
>>> On Mon, Feb 29, 2016 at 10:45:34AM +0000, Tvrtko Ursulin wrote:
>>>> This ok?
>>>>
>>>> """
>>>> One unexplained result is with "gem_latency -n 0" (dispatching
>>>> empty batches) which shows 5% more throughput, 8% less CPU time,
>>>> 25% better producer and consumer latencies, but 15% higher
>>>> dispatch latency which looks like a possible measuring artifact.
>>>> """
>>>
>>> I doubt it is a measuring artefact since throughput = 1/(dispatch +
>>> latency + test overhead), and the dispatch latency here is larger than
>>> the wakeup latency and so has greater impact on throughput in this
>>> scenario.
>>
>> I don't follow you, if dispatch latency has larger effect on
>> throughput how to explain the increase and still better throughput?
>>
>> I see in gem_latency this block:
>>
>> 	measure_latency(p, &p->latency);
>> 	igt_stats_push(&p->dispatch, *p->last_timestamp - start);
>>
>> measure_latency waits for the batch to complete and then dispatch
>> latency uses p->last_timestamp which is something written by the GPU
>> and not a CPU view of the latency ?
>
> Exactly, measurements are entirely made from the running engine clock
> (which is ~80ns clock, and should be verified during init). The register
> is read before dispatch, inside the batch and then at wakeup, but the
> information is presented as dispatch = batch - start and
> wakeup = end - batch, so to get the duration (end - start) we need
> to add the two together. Throughput will also include some overhead from
> the test iteration (that will mainly be scheduler interference).
>
> My comment about dispatch having greater effect, is in terms of
> its higher absolute value (so the relative % means a larger change wrt
> throughput).

Change to this then?

"""
     One unexplained result is with "gem_latency -n 0" (dispatching
     empty batches) which shows 5% more throughput, 8% less CPU time,
     25% better producer and consumer latencies, but 15% higher
     dispatch latency which looks like an amplified effect of test
     overhead.
"""

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4] drm/i915: Execlists small cleanups and micro-optimisations
  2016-02-29 11:40               ` Tvrtko Ursulin
@ 2016-02-29 11:48                 ` Chris Wilson
  2016-02-29 11:59                   ` Tvrtko Ursulin
  0 siblings, 1 reply; 19+ messages in thread
From: Chris Wilson @ 2016-02-29 11:48 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Intel-gfx

On Mon, Feb 29, 2016 at 11:40:37AM +0000, Tvrtko Ursulin wrote:
> 
> 
> On 29/02/16 11:13, Chris Wilson wrote:
> >On Mon, Feb 29, 2016 at 11:01:49AM +0000, Tvrtko Ursulin wrote:
> >>
> >>On 29/02/16 10:53, Chris Wilson wrote:
> >>>On Mon, Feb 29, 2016 at 10:45:34AM +0000, Tvrtko Ursulin wrote:
> >>>>This ok?
> >>>>
> >>>>"""
> >>>>One unexplained result is with "gem_latency -n 0" (dispatching
> >>>>empty batches) which shows 5% more throughput, 8% less CPU time,
> >>>>25% better producer and consumer latencies, but 15% higher
> >>>>dispatch latency which looks like a possible measuring artifact.
> >>>>"""
> >>>
> >>>I doubt it is a measuring artefact since throughput = 1/(dispatch +
> >>>latency + test overhead), and the dispatch latency here is larger than
> >>>the wakeup latency and so has greater impact on throughput in this
> >>>scenario.
> >>
> >>I don't follow you, if dispatch latency has larger effect on
> >>throughput how to explain the increase and still better throughput?
> >>
> >>I see in gem_latency this block:
> >>
> >>	measure_latency(p, &p->latency);
> >>	igt_stats_push(&p->dispatch, *p->last_timestamp - start);
> >>
> >>measure_latency waits for the batch to complete and then dispatch
> >>latency uses p->last_timestamp which is something written by the GPU
> >>and not a CPU view of the latency ?
> >
> >Exactly, measurements are entirely made from the running engine clock
> >(which is ~80ns clock, and should be verified during init). The register
> >is read before dispatch, inside the batch and then at wakeup, but the
> >information is presented as dispatch = batch - start and
> >wakeup = end - batch, so to get the duration (end - start) we need
> >to add the two together. Throughput will also include some overhead from
> >the test iteration (that will mainly be scheduler interference).
> >
> >My comment about dispatch having greater effect, is in terms of
> >its higher absolute value (so the relative % means a larger change wrt
> >throughput).
> 
> Change to this then?
> 
> """
>     One unexplained result is with "gem_latency -n 0" (dispatching
>     empty batches) which shows 5% more throughput, 8% less CPU time,
>     25% better producer and consumer latencies, but 15% higher
>     dispatch latency which looks like an amplified effect of test
>     overhead.
> """

No. Dispatch latency is important and this attempts to pass the change
off a test effect when to the best of my knowledge it is a valid external
observation of the system.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4] drm/i915: Execlists small cleanups and micro-optimisations
  2016-02-29 11:48                 ` Chris Wilson
@ 2016-02-29 11:59                   ` Tvrtko Ursulin
  2016-03-01 10:21                     ` Tvrtko Ursulin
  2016-03-01 10:32                     ` Chris Wilson
  0 siblings, 2 replies; 19+ messages in thread
From: Tvrtko Ursulin @ 2016-02-29 11:59 UTC (permalink / raw)
  To: Chris Wilson, Intel-gfx, Tvrtko Ursulin



On 29/02/16 11:48, Chris Wilson wrote:
> On Mon, Feb 29, 2016 at 11:40:37AM +0000, Tvrtko Ursulin wrote:
>>
>>
>> On 29/02/16 11:13, Chris Wilson wrote:
>>> On Mon, Feb 29, 2016 at 11:01:49AM +0000, Tvrtko Ursulin wrote:
>>>>
>>>> On 29/02/16 10:53, Chris Wilson wrote:
>>>>> On Mon, Feb 29, 2016 at 10:45:34AM +0000, Tvrtko Ursulin wrote:
>>>>>> This ok?
>>>>>>
>>>>>> """
>>>>>> One unexplained result is with "gem_latency -n 0" (dispatching
>>>>>> empty batches) which shows 5% more throughput, 8% less CPU time,
>>>>>> 25% better producer and consumer latencies, but 15% higher
>>>>>> dispatch latency which looks like a possible measuring artifact.
>>>>>> """
>>>>>
>>>>> I doubt it is a measuring artefact since throughput = 1/(dispatch +
>>>>> latency + test overhead), and the dispatch latency here is larger than
>>>>> the wakeup latency and so has greater impact on throughput in this
>>>>> scenario.
>>>>
>>>> I don't follow you, if dispatch latency has larger effect on
>>>> throughput how to explain the increase and still better throughput?
>>>>
>>>> I see in gem_latency this block:
>>>>
>>>> 	measure_latency(p, &p->latency);
>>>> 	igt_stats_push(&p->dispatch, *p->last_timestamp - start);
>>>>
>>>> measure_latency waits for the batch to complete and then dispatch
>>>> latency uses p->last_timestamp which is something written by the GPU
>>>> and not a CPU view of the latency ?
>>>
>>> Exactly, measurements are entirely made from the running engine clock
>>> (which is ~80ns clock, and should be verified during init). The register
>>> is read before dispatch, inside the batch and then at wakeup, but the
>>> information is presented as dispatch = batch - start and
>>> wakeup = end - batch, so to get the duration (end - start) we need
>>> to add the two together. Throughput will also include some overhead from
>>> the test iteration (that will mainly be scheduler interference).
>>>
>>> My comment about dispatch having greater effect, is in terms of
>>> its higher absolute value (so the relative % means a larger change wrt
>>> throughput).
>>
>> Change to this then?
>>
>> """
>>      One unexplained result is with "gem_latency -n 0" (dispatching
>>      empty batches) which shows 5% more throughput, 8% less CPU time,
>>      25% better producer and consumer latencies, but 15% higher
>>      dispatch latency which looks like an amplified effect of test
>>      overhead.
>> """
>
> No. Dispatch latency is important and this attempts to pass the change
> off a test effect when to the best of my knowledge it is a valid external
> observation of the system.

I just don't understand how can it be valid when we have executed more 
empty batches than before in a unit of time.

Because even dispatch + wake up latency is worse, but throughput is 
still better.

Sounds impossible to me so it must be the effect of using two different 
time sources. CPU side to measure throughput and GPU side to measure 
dispatch latency.I don't know, could you suggest a paragraph to add to 
the commit message so we can close on this?

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4] drm/i915: Execlists small cleanups and micro-optimisations
  2016-02-29 11:59                   ` Tvrtko Ursulin
@ 2016-03-01 10:21                     ` Tvrtko Ursulin
  2016-03-01 10:32                       ` Chris Wilson
  2016-03-01 10:32                     ` Chris Wilson
  1 sibling, 1 reply; 19+ messages in thread
From: Tvrtko Ursulin @ 2016-03-01 10:21 UTC (permalink / raw)
  To: Chris Wilson, Intel-gfx, Tvrtko Ursulin



On 29/02/16 11:59, Tvrtko Ursulin wrote:
>
>
> On 29/02/16 11:48, Chris Wilson wrote:
>> On Mon, Feb 29, 2016 at 11:40:37AM +0000, Tvrtko Ursulin wrote:
>>>
>>>
>>> On 29/02/16 11:13, Chris Wilson wrote:
>>>> On Mon, Feb 29, 2016 at 11:01:49AM +0000, Tvrtko Ursulin wrote:
>>>>>
>>>>> On 29/02/16 10:53, Chris Wilson wrote:
>>>>>> On Mon, Feb 29, 2016 at 10:45:34AM +0000, Tvrtko Ursulin wrote:
>>>>>>> This ok?
>>>>>>>
>>>>>>> """
>>>>>>> One unexplained result is with "gem_latency -n 0" (dispatching
>>>>>>> empty batches) which shows 5% more throughput, 8% less CPU time,
>>>>>>> 25% better producer and consumer latencies, but 15% higher
>>>>>>> dispatch latency which looks like a possible measuring artifact.
>>>>>>> """
>>>>>>
>>>>>> I doubt it is a measuring artefact since throughput = 1/(dispatch +
>>>>>> latency + test overhead), and the dispatch latency here is larger
>>>>>> than
>>>>>> the wakeup latency and so has greater impact on throughput in this
>>>>>> scenario.
>>>>>
>>>>> I don't follow you, if dispatch latency has larger effect on
>>>>> throughput how to explain the increase and still better throughput?
>>>>>
>>>>> I see in gem_latency this block:
>>>>>
>>>>>     measure_latency(p, &p->latency);
>>>>>     igt_stats_push(&p->dispatch, *p->last_timestamp - start);
>>>>>
>>>>> measure_latency waits for the batch to complete and then dispatch
>>>>> latency uses p->last_timestamp which is something written by the GPU
>>>>> and not a CPU view of the latency ?
>>>>
>>>> Exactly, measurements are entirely made from the running engine clock
>>>> (which is ~80ns clock, and should be verified during init). The
>>>> register
>>>> is read before dispatch, inside the batch and then at wakeup, but the
>>>> information is presented as dispatch = batch - start and
>>>> wakeup = end - batch, so to get the duration (end - start) we need
>>>> to add the two together. Throughput will also include some overhead
>>>> from
>>>> the test iteration (that will mainly be scheduler interference).
>>>>
>>>> My comment about dispatch having greater effect, is in terms of
>>>> its higher absolute value (so the relative % means a larger change wrt
>>>> throughput).
>>>
>>> Change to this then?
>>>
>>> """
>>>      One unexplained result is with "gem_latency -n 0" (dispatching
>>>      empty batches) which shows 5% more throughput, 8% less CPU time,
>>>      25% better producer and consumer latencies, but 15% higher
>>>      dispatch latency which looks like an amplified effect of test
>>>      overhead.
>>> """
>>
>> No. Dispatch latency is important and this attempts to pass the change
>> off a test effect when to the best of my knowledge it is a valid external
>> observation of the system.
>
> I just don't understand how can it be valid when we have executed more
> empty batches than before in a unit of time.
>
> Because even dispatch + wake up latency is worse, but throughput is
> still better.
>
> Sounds impossible to me so it must be the effect of using two different
> time sources. CPU side to measure throughput and GPU side to measure
> dispatch latency.I don't know, could you suggest a paragraph to add to
> the commit message so we can close on this?

Happy with simply leaving out any attempts of explaining the oddity like:

"""
One odd result is with "gem_latency -n 0" (dispatching empty batches) 
which shows 5% more throughput, 8% less CPU time, 25% better producer 
and consumer latencies, but 15% higher dispatch latency which is yet 
unexplained.
"""

?

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4] drm/i915: Execlists small cleanups and micro-optimisations
  2016-02-29 11:59                   ` Tvrtko Ursulin
  2016-03-01 10:21                     ` Tvrtko Ursulin
@ 2016-03-01 10:32                     ` Chris Wilson
  1 sibling, 0 replies; 19+ messages in thread
From: Chris Wilson @ 2016-03-01 10:32 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Intel-gfx

On Mon, Feb 29, 2016 at 11:59:26AM +0000, Tvrtko Ursulin wrote:
> 
> 
> On 29/02/16 11:48, Chris Wilson wrote:
> >On Mon, Feb 29, 2016 at 11:40:37AM +0000, Tvrtko Ursulin wrote:
> >>
> >>
> >>On 29/02/16 11:13, Chris Wilson wrote:
> >>>On Mon, Feb 29, 2016 at 11:01:49AM +0000, Tvrtko Ursulin wrote:
> >>>>
> >>>>On 29/02/16 10:53, Chris Wilson wrote:
> >>>>>On Mon, Feb 29, 2016 at 10:45:34AM +0000, Tvrtko Ursulin wrote:
> >>>>>>This ok?
> >>>>>>
> >>>>>>"""
> >>>>>>One unexplained result is with "gem_latency -n 0" (dispatching
> >>>>>>empty batches) which shows 5% more throughput, 8% less CPU time,
> >>>>>>25% better producer and consumer latencies, but 15% higher
> >>>>>>dispatch latency which looks like a possible measuring artifact.
> >>>>>>"""
> >>>>>
> >>>>>I doubt it is a measuring artefact since throughput = 1/(dispatch +
> >>>>>latency + test overhead), and the dispatch latency here is larger than
> >>>>>the wakeup latency and so has greater impact on throughput in this
> >>>>>scenario.
> >>>>
> >>>>I don't follow you, if dispatch latency has larger effect on
> >>>>throughput how to explain the increase and still better throughput?
> >>>>
> >>>>I see in gem_latency this block:
> >>>>
> >>>>	measure_latency(p, &p->latency);
> >>>>	igt_stats_push(&p->dispatch, *p->last_timestamp - start);
> >>>>
> >>>>measure_latency waits for the batch to complete and then dispatch
> >>>>latency uses p->last_timestamp which is something written by the GPU
> >>>>and not a CPU view of the latency ?
> >>>
> >>>Exactly, measurements are entirely made from the running engine clock
> >>>(which is ~80ns clock, and should be verified during init). The register
> >>>is read before dispatch, inside the batch and then at wakeup, but the
> >>>information is presented as dispatch = batch - start and
> >>>wakeup = end - batch, so to get the duration (end - start) we need
> >>>to add the two together. Throughput will also include some overhead from
> >>>the test iteration (that will mainly be scheduler interference).
> >>>
> >>>My comment about dispatch having greater effect, is in terms of
> >>>its higher absolute value (so the relative % means a larger change wrt
> >>>throughput).
> >>
> >>Change to this then?
> >>
> >>"""
> >>     One unexplained result is with "gem_latency -n 0" (dispatching
> >>     empty batches) which shows 5% more throughput, 8% less CPU time,
> >>     25% better producer and consumer latencies, but 15% higher
> >>     dispatch latency which looks like an amplified effect of test
> >>     overhead.
> >>"""
> >
> >No. Dispatch latency is important and this attempts to pass the change
> >off a test effect when to the best of my knowledge it is a valid external
> >observation of the system.
> 
> I just don't understand how can it be valid when we have executed
> more empty batches than before in a unit of time.
> 
> Because even dispatch + wake up latency is worse, but throughput is
> still better.
> 
> Sounds impossible to me so it must be the effect of using two
> different time sources. CPU side to measure throughput and GPU side
> to measure dispatch latency.

Sure, but it is not impossible. The actual throughput (measured as
number of cycles per test run) is a product of the CPU scheduler on top
of everything else, We could measure the test overhead by measuring the
GPU cycles between batches, and the most likely explanation is likely to
be exactly who the isr is interrupting, setting maxcpus=1 would cancel
the variation should that be the case.

> I don't know, could you suggest a
> paragraph to add to the commit message so we can close on this?

On to the next...
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4] drm/i915: Execlists small cleanups and micro-optimisations
  2016-03-01 10:21                     ` Tvrtko Ursulin
@ 2016-03-01 10:32                       ` Chris Wilson
  2016-03-01 10:41                         ` Tvrtko Ursulin
  0 siblings, 1 reply; 19+ messages in thread
From: Chris Wilson @ 2016-03-01 10:32 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Intel-gfx

On Tue, Mar 01, 2016 at 10:21:45AM +0000, Tvrtko Ursulin wrote:
> 
> 
> On 29/02/16 11:59, Tvrtko Ursulin wrote:
> >
> >
> >On 29/02/16 11:48, Chris Wilson wrote:
> >>On Mon, Feb 29, 2016 at 11:40:37AM +0000, Tvrtko Ursulin wrote:
> >>>
> >>>
> >>>On 29/02/16 11:13, Chris Wilson wrote:
> >>>>On Mon, Feb 29, 2016 at 11:01:49AM +0000, Tvrtko Ursulin wrote:
> >>>>>
> >>>>>On 29/02/16 10:53, Chris Wilson wrote:
> >>>>>>On Mon, Feb 29, 2016 at 10:45:34AM +0000, Tvrtko Ursulin wrote:
> >>>>>>>This ok?
> >>>>>>>
> >>>>>>>"""
> >>>>>>>One unexplained result is with "gem_latency -n 0" (dispatching
> >>>>>>>empty batches) which shows 5% more throughput, 8% less CPU time,
> >>>>>>>25% better producer and consumer latencies, but 15% higher
> >>>>>>>dispatch latency which looks like a possible measuring artifact.
> >>>>>>>"""
> >>>>>>
> >>>>>>I doubt it is a measuring artefact since throughput = 1/(dispatch +
> >>>>>>latency + test overhead), and the dispatch latency here is larger
> >>>>>>than
> >>>>>>the wakeup latency and so has greater impact on throughput in this
> >>>>>>scenario.
> >>>>>
> >>>>>I don't follow you, if dispatch latency has larger effect on
> >>>>>throughput how to explain the increase and still better throughput?
> >>>>>
> >>>>>I see in gem_latency this block:
> >>>>>
> >>>>>    measure_latency(p, &p->latency);
> >>>>>    igt_stats_push(&p->dispatch, *p->last_timestamp - start);
> >>>>>
> >>>>>measure_latency waits for the batch to complete and then dispatch
> >>>>>latency uses p->last_timestamp which is something written by the GPU
> >>>>>and not a CPU view of the latency ?
> >>>>
> >>>>Exactly, measurements are entirely made from the running engine clock
> >>>>(which is ~80ns clock, and should be verified during init). The
> >>>>register
> >>>>is read before dispatch, inside the batch and then at wakeup, but the
> >>>>information is presented as dispatch = batch - start and
> >>>>wakeup = end - batch, so to get the duration (end - start) we need
> >>>>to add the two together. Throughput will also include some overhead
> >>>>from
> >>>>the test iteration (that will mainly be scheduler interference).
> >>>>
> >>>>My comment about dispatch having greater effect, is in terms of
> >>>>its higher absolute value (so the relative % means a larger change wrt
> >>>>throughput).
> >>>
> >>>Change to this then?
> >>>
> >>>"""
> >>>     One unexplained result is with "gem_latency -n 0" (dispatching
> >>>     empty batches) which shows 5% more throughput, 8% less CPU time,
> >>>     25% better producer and consumer latencies, but 15% higher
> >>>     dispatch latency which looks like an amplified effect of test
> >>>     overhead.
> >>>"""
> >>
> >>No. Dispatch latency is important and this attempts to pass the change
> >>off a test effect when to the best of my knowledge it is a valid external
> >>observation of the system.
> >
> >I just don't understand how can it be valid when we have executed more
> >empty batches than before in a unit of time.
> >
> >Because even dispatch + wake up latency is worse, but throughput is
> >still better.
> >
> >Sounds impossible to me so it must be the effect of using two different
> >time sources. CPU side to measure throughput and GPU side to measure
> >dispatch latency.I don't know, could you suggest a paragraph to add to
> >the commit message so we can close on this?
> 
> Happy with simply leaving out any attempts of explaining the oddity like:
> 
> """
> One odd result is with "gem_latency -n 0" (dispatching empty
> batches) which shows 5% more throughput, 8% less CPU time, 25%
> better producer and consumer latencies, but 15% higher dispatch
> latency which is yet unexplained.
> """

Yes!
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4] drm/i915: Execlists small cleanups and micro-optimisations
  2016-03-01 10:32                       ` Chris Wilson
@ 2016-03-01 10:41                         ` Tvrtko Ursulin
  2016-03-01 10:57                           ` Chris Wilson
  0 siblings, 1 reply; 19+ messages in thread
From: Tvrtko Ursulin @ 2016-03-01 10:41 UTC (permalink / raw)
  To: Chris Wilson, Intel-gfx, Tvrtko Ursulin


On 01/03/16 10:32, Chris Wilson wrote:
> On Tue, Mar 01, 2016 at 10:21:45AM +0000, Tvrtko Ursulin wrote:
>>
>>
>> On 29/02/16 11:59, Tvrtko Ursulin wrote:
>>>
>>>
>>> On 29/02/16 11:48, Chris Wilson wrote:
>>>> On Mon, Feb 29, 2016 at 11:40:37AM +0000, Tvrtko Ursulin wrote:
>>>>>
>>>>>
>>>>> On 29/02/16 11:13, Chris Wilson wrote:
>>>>>> On Mon, Feb 29, 2016 at 11:01:49AM +0000, Tvrtko Ursulin wrote:
>>>>>>>
>>>>>>> On 29/02/16 10:53, Chris Wilson wrote:
>>>>>>>> On Mon, Feb 29, 2016 at 10:45:34AM +0000, Tvrtko Ursulin wrote:
>>>>>>>>> This ok?
>>>>>>>>>
>>>>>>>>> """
>>>>>>>>> One unexplained result is with "gem_latency -n 0" (dispatching
>>>>>>>>> empty batches) which shows 5% more throughput, 8% less CPU time,
>>>>>>>>> 25% better producer and consumer latencies, but 15% higher
>>>>>>>>> dispatch latency which looks like a possible measuring artifact.
>>>>>>>>> """
>>>>>>>>
>>>>>>>> I doubt it is a measuring artefact since throughput = 1/(dispatch +
>>>>>>>> latency + test overhead), and the dispatch latency here is larger
>>>>>>>> than
>>>>>>>> the wakeup latency and so has greater impact on throughput in this
>>>>>>>> scenario.
>>>>>>>
>>>>>>> I don't follow you, if dispatch latency has larger effect on
>>>>>>> throughput how to explain the increase and still better throughput?
>>>>>>>
>>>>>>> I see in gem_latency this block:
>>>>>>>
>>>>>>>     measure_latency(p, &p->latency);
>>>>>>>     igt_stats_push(&p->dispatch, *p->last_timestamp - start);
>>>>>>>
>>>>>>> measure_latency waits for the batch to complete and then dispatch
>>>>>>> latency uses p->last_timestamp which is something written by the GPU
>>>>>>> and not a CPU view of the latency ?
>>>>>>
>>>>>> Exactly, measurements are entirely made from the running engine clock
>>>>>> (which is ~80ns clock, and should be verified during init). The
>>>>>> register
>>>>>> is read before dispatch, inside the batch and then at wakeup, but the
>>>>>> information is presented as dispatch = batch - start and
>>>>>> wakeup = end - batch, so to get the duration (end - start) we need
>>>>>> to add the two together. Throughput will also include some overhead
>>>>>> from
>>>>>> the test iteration (that will mainly be scheduler interference).
>>>>>>
>>>>>> My comment about dispatch having greater effect, is in terms of
>>>>>> its higher absolute value (so the relative % means a larger change wrt
>>>>>> throughput).
>>>>>
>>>>> Change to this then?
>>>>>
>>>>> """
>>>>>      One unexplained result is with "gem_latency -n 0" (dispatching
>>>>>      empty batches) which shows 5% more throughput, 8% less CPU time,
>>>>>      25% better producer and consumer latencies, but 15% higher
>>>>>      dispatch latency which looks like an amplified effect of test
>>>>>      overhead.
>>>>> """
>>>>
>>>> No. Dispatch latency is important and this attempts to pass the change
>>>> off a test effect when to the best of my knowledge it is a valid external
>>>> observation of the system.
>>>
>>> I just don't understand how can it be valid when we have executed more
>>> empty batches than before in a unit of time.
>>>
>>> Because even dispatch + wake up latency is worse, but throughput is
>>> still better.
>>>
>>> Sounds impossible to me so it must be the effect of using two different
>>> time sources. CPU side to measure throughput and GPU side to measure
>>> dispatch latency.I don't know, could you suggest a paragraph to add to
>>> the commit message so we can close on this?
>>
>> Happy with simply leaving out any attempts of explaining the oddity like:
>>
>> """
>> One odd result is with "gem_latency -n 0" (dispatching empty
>> batches) which shows 5% more throughput, 8% less CPU time, 25%
>> better producer and consumer latencies, but 15% higher dispatch
>> latency which is yet unexplained.
>> """
>
> Yes!

Thanks! Patch merged.

I'll try CSB read outside the execlists lock to see if that helps any 
(into a temporary buffer).

What about your patch to move it all to a bottom handler? Are we going 
to progress that one?

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4] drm/i915: Execlists small cleanups and micro-optimisations
  2016-03-01 10:41                         ` Tvrtko Ursulin
@ 2016-03-01 10:57                           ` Chris Wilson
  0 siblings, 0 replies; 19+ messages in thread
From: Chris Wilson @ 2016-03-01 10:57 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Intel-gfx

On Tue, Mar 01, 2016 at 10:41:26AM +0000, Tvrtko Ursulin wrote:
> What about your patch to move it all to a bottom handler? Are we
> going to progress that one?

I looked for the regressions and found an interesting slowdown in single nop
batches (almost 2x as slow). Another one that should not happen, as along
that idle-execute-sync path the code should be the same (the elsp should
be written directly by the submitter in both cases). CPU time looks
pretty much identical between the two, so I'm left questioning the
scheduler and the impact of using another kthread. The effect is also
only apparent on bsw not big core, perhaps more indication of a
context-switch versus CPU cache? I also found another way to lockup the
interrupt handler thanks to its while(1) looping over GT + display
interrupts.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2016-03-01 10:57 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-26 15:37 [PATCH] drm/i915: Execlists small cleanups and micro-optimisations Tvrtko Ursulin
2016-02-26 16:03 ` ✗ Fi.CI.BAT: failure for " Patchwork
2016-02-26 16:36 ` [PATCH] " Chris Wilson
2016-02-26 16:58   ` [PATCH v4] " Tvrtko Ursulin
2016-02-26 20:24     ` Chris Wilson
2016-02-29 10:45       ` Tvrtko Ursulin
2016-02-29 10:53         ` Chris Wilson
2016-02-29 11:01           ` Tvrtko Ursulin
2016-02-29 11:13             ` Chris Wilson
2016-02-29 11:40               ` Tvrtko Ursulin
2016-02-29 11:48                 ` Chris Wilson
2016-02-29 11:59                   ` Tvrtko Ursulin
2016-03-01 10:21                     ` Tvrtko Ursulin
2016-03-01 10:32                       ` Chris Wilson
2016-03-01 10:41                         ` Tvrtko Ursulin
2016-03-01 10:57                           ` Chris Wilson
2016-03-01 10:32                     ` Chris Wilson
2016-02-26 17:27 ` ✗ Fi.CI.BAT: failure for drm/i915: Execlists small cleanups and micro-optimisations (rev2) Patchwork
2016-02-29 10:16   ` Tvrtko Ursulin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.