[PATCH v2] drm/i915: Optimistically spin for the request completion

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2] drm/i915: Optimistically spin for the request completion
@ 2015-03-11 15:29 Chris Wilson
  2015-03-11 21:18 ` Chris Wilson
  2015-03-11 23:07 ` [PATCH v2] " shuang.he
  0 siblings, 2 replies; 29+ messages in thread
From: Chris Wilson @ 2015-03-11 15:29 UTC (permalink / raw)
  To: intel-gfx; +Cc: Daniel Vetter

This provides a nice boost to mesa in swap bound scenarios (as mesa
throttles itself to the previous frame and given the scenario that will
complete shortly). It will also provide a good boost to systems running
with semaphores disabled and so frequently waiting on the GPU as it
switches rings. In the most favourable of microbenchmarks, this can
increase performance by around 15% - though in practice improvements
will be marginal and rarely noticeable.

v2: Account for user timeouts

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/gpu/drm/i915/i915_gem.c | 47 +++++++++++++++++++++++++++++++++++------
 1 file changed, 41 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index de15bd319bd0..f8895ed368cb 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1191,6 +1191,32 @@ static bool missed_irq(struct drm_i915_private *dev_priv,
 	return test_bit(ring->id, &dev_priv->gpu_error.missed_irq_rings);
 }
 
+static int __i915_spin_request(struct drm_i915_gem_request *req,
+			       unsigned long timeout)
+{
+	struct intel_engine_cs *ring = i915_gem_request_get_ring(req);
+	struct drm_i915_private *dev_priv = to_i915(ring->dev);
+	int ret = -EBUSY;
+
+	intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
+	while (!need_resched()) {
+		if (i915_gem_request_completed(req, true)) {
+			ret = 0;
+			goto out;
+		}
+
+		if (timeout && time_after_eq(jiffies, timeout))
+			break;
+
+		cpu_relax_lowlatency();
+	}
+	if (i915_gem_request_completed(req, false))
+		ret = 0;
+out:
+	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
+	return ret;
+}
+
 /**
  * __i915_wait_request - wait until execution of request has finished
  * @req: duh!
@@ -1235,12 +1261,20 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
 	if (ring->id == RCS && INTEL_INFO(dev)->gen >= 6)
 		gen6_rps_boost(dev_priv, file_priv);
 
-	if (!irq_test_in_progress && WARN_ON(!ring->irq_get(ring)))
-		return -ENODEV;
-
 	/* Record current time in case interrupted by signal, or wedged */
 	trace_i915_gem_request_wait_begin(req);
 	before = ktime_get_raw_ns();
+
+	/* Optimistic spin before touching IRQs */
+	ret = __i915_spin_request(req, timeout_expire);
+	if (ret == 0)
+		goto out;
+
+	if (!irq_test_in_progress && WARN_ON(!ring->irq_get(ring))) {
+		ret = -ENODEV;
+		goto out;
+	}
+
 	for (;;) {
 		struct timer_list timer;
 
@@ -1289,14 +1323,15 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
 			destroy_timer_on_stack(&timer);
 		}
 	}
-	now = ktime_get_raw_ns();
-	trace_i915_gem_request_wait_end(req);
-
 	if (!irq_test_in_progress)
 		ring->irq_put(ring);
 
 	finish_wait(&ring->irq_queue, &wait);
 
+out:
+	now = ktime_get_raw_ns();
+	trace_i915_gem_request_wait_end(req);
+
 	if (timeout) {
 		s64 tres = *timeout - (now - before);
 
-- 
2.1.4

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] drm/i915: Optimistically spin for the request completion
  2015-03-11 15:29 [PATCH v2] drm/i915: Optimistically spin for the request completion Chris Wilson
@ 2015-03-11 21:18 ` Chris Wilson
  2015-03-12  9:07   ` Chris Wilson
  2015-03-12  9:17   ` Daniel Vetter
  2015-03-11 23:07 ` [PATCH v2] " shuang.he
  1 sibling, 2 replies; 29+ messages in thread
From: Chris Wilson @ 2015-03-11 21:18 UTC (permalink / raw)
  To: intel-gfx; +Cc: Daniel Vetter

On Wed, Mar 11, 2015 at 03:29:19PM +0000, Chris Wilson wrote:
> +	while (!need_resched()) {
> +		if (i915_gem_request_completed(req, true)) {
> +			ret = 0;
> +			goto out;
> +		}
> +
> +		if (timeout && time_after_eq(jiffies, timeout))
> +			break;
> +
> +		cpu_relax_lowlatency();
> +	}

Hmm. This unfortunately doesn't quite work the same as the optimistic
mutex spinner. The difference being that the scheduler knows that two
processes are contending for the mutex, but it doesn't know about the
contention between the thread and the GPU. The result is that unless
there is other activity on the system we simply degrade into a busy-spin
until the GPU is finished, rather than what I intended as spin for the
next ~10ms and then fallback to the interrupt.

Arguably busy-spinning on an idle system isn't totally evil, but it
certainly is likely to come at a power cost. On the other hand, spinning
is relatively rare outside of benchmarks. Rare enough to be useful?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] drm/i915: Optimistically spin for the request completion
  2015-03-11 15:29 [PATCH v2] drm/i915: Optimistically spin for the request completion Chris Wilson
  2015-03-11 21:18 ` Chris Wilson
@ 2015-03-11 23:07 ` shuang.he
  1 sibling, 0 replies; 29+ messages in thread
From: shuang.he @ 2015-03-11 23:07 UTC (permalink / raw)
  To: shuang.he, ethan.gao, intel-gfx, chris

Tested-By: PRC QA PRTS (Patch Regression Test System Contact: shuang.he@intel.com)
Task id: 5934
-------------------------------------Summary-------------------------------------
Platform          Delta          drm-intel-nightly          Series Applied
PNV                 -9              281/281              272/281
ILK                 -1              308/308              307/308
SNB                 -1              284/284              283/284
IVB                                  375/375              375/375
BYT                                  294/294              294/294
HSW                                  384/384              384/384
BDW                                  315/315              315/315
-------------------------------------Detailed-------------------------------------
Platform  Test                                drm-intel-nightly          Series Applied
*PNV  igt_gem_fence_thrash_bo-write-verify-none      PASS(2)      FAIL(1)NRUN(1)
*PNV  igt_gem_fence_thrash_bo-write-verify-x      PASS(2)      FAIL(1)NO_RESULT(1)
*PNV  igt_gem_fence_thrash_bo-write-verify-y      PASS(2)      FAIL(1)NO_RESULT(1)
 PNV  igt_gem_userptr_blits_coherency-sync      CRASH(4)PASS(2)      CRASH(1)PASS(1)
 PNV  igt_gen3_render_linear_blits      FAIL(2)PASS(1)      FAIL(1)PASS(1)
*PNV  igt_gen3_render_mixed_blits      FAIL(1)PASS(2)      FAIL(1)NO_RESULT(1)
 PNV  igt_gen3_render_tiledx_blits      FAIL(3)PASS(1)      FAIL(1)PASS(1)
 PNV  igt_gen3_render_tiledy_blits      FAIL(4)PASS(1)      FAIL(1)PASS(1)
 PNV  igt_gem_fence_thrash_bo-write-verify-threaded-none      CRASH(1)PASS(4)      CRASH(1)PASS(1)
*ILK  igt_kms_flip_wf_vblank-ts-check      PASS(2)      DMESG_WARN(1)PASS(1)
*SNB  igt_gem_fenced_exec_thrash_too-many-fences      PASS(4)      DMESG_WARN(1)PASS(1)
Note: You need to pay more attention to line start with '*'
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] drm/i915: Optimistically spin for the request completion
  2015-03-11 21:18 ` Chris Wilson
@ 2015-03-12  9:07   ` Chris Wilson
  2015-03-12  9:17   ` Daniel Vetter
  1 sibling, 0 replies; 29+ messages in thread
From: Chris Wilson @ 2015-03-12  9:07 UTC (permalink / raw)
  To: intel-gfx, Daniel Vetter

On Wed, Mar 11, 2015 at 09:18:19PM +0000, Chris Wilson wrote:
> Arguably busy-spinning on an idle system isn't totally evil, but it
> certainly is likely to come at a power cost. On the other hand, spinning
> is relatively rare outside of benchmarks. Rare enough to be useful?

As a counterpoint, I plugged in the qm_pos trick we use for lowlatency
dma interrupts (gmbus, dp-aux) and found no improvement at all over a
plain interrupt driven __i915_wait_request.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] drm/i915: Optimistically spin for the request completion
  2015-03-11 21:18 ` Chris Wilson
  2015-03-12  9:07   ` Chris Wilson
@ 2015-03-12  9:17   ` Daniel Vetter
  2015-03-12 11:11     ` [PATCH v3] " Chris Wilson
  1 sibling, 1 reply; 29+ messages in thread
From: Daniel Vetter @ 2015-03-12  9:17 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx, Daniel Vetter

On Wed, Mar 11, 2015 at 09:18:19PM +0000, Chris Wilson wrote:
> On Wed, Mar 11, 2015 at 03:29:19PM +0000, Chris Wilson wrote:
> > +	while (!need_resched()) {
> > +		if (i915_gem_request_completed(req, true)) {
> > +			ret = 0;
> > +			goto out;
> > +		}
> > +
> > +		if (timeout && time_after_eq(jiffies, timeout))
> > +			break;
> > +
> > +		cpu_relax_lowlatency();
> > +	}
> 
> Hmm. This unfortunately doesn't quite work the same as the optimistic
> mutex spinner. The difference being that the scheduler knows that two
> processes are contending for the mutex, but it doesn't know about the
> contention between the thread and the GPU. The result is that unless
> there is other activity on the system we simply degrade into a busy-spin
> until the GPU is finished, rather than what I intended as spin for the
> next ~10ms and then fallback to the interrupt.
> 
> Arguably busy-spinning on an idle system isn't totally evil, but it
> certainly is likely to come at a power cost. On the other hand, spinning
> is relatively rare outside of benchmarks. Rare enough to be useful?

Maybe throw in a limit of a few useconds under the assumption that it's ok
to spin for roughly the average sleep+wakeup latency? And once spun out go
to sleep?
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v3] drm/i915: Optimistically spin for the request completion
  2015-03-12  9:17   ` Daniel Vetter
@ 2015-03-12 11:11     ` Chris Wilson
  2015-03-12 12:06       ` Chris Wilson
                         ` (4 more replies)
  0 siblings, 5 replies; 29+ messages in thread
From: Chris Wilson @ 2015-03-12 11:11 UTC (permalink / raw)
  To: intel-gfx; +Cc: Daniel Vetter

This provides a nice boost to mesa in swap bound scenarios (as mesa
throttles itself to the previous frame and given the scenario that will
complete shortly). It will also provide a good boost to systems running
with semaphores disabled and so frequently waiting on the GPU as it
switches rings. In the most favourable of microbenchmarks, this can
increase performance by around 15% - though in practice improvements
will be marginal and rarely noticeable.

v2: Account for user timeouts
v3: Limit the spinning to a single jiffie (~1us) at most. On an
otherwise idle system, there is no scheduler contention and so without a
limit we would spin until the GPU is ready.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/gpu/drm/i915/i915_gem.c | 48 +++++++++++++++++++++++++++++++++++------
 1 file changed, 42 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 1cc46558bf2a..df96f9704cf2 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1191,6 +1191,33 @@ static bool missed_irq(struct drm_i915_private *dev_priv,
 	return test_bit(ring->id, &dev_priv->gpu_error.missed_irq_rings);
 }
 
+static int __i915_spin_request(struct drm_i915_gem_request *req)
+{
+	struct intel_engine_cs *ring = i915_gem_request_get_ring(req);
+	struct drm_i915_private *dev_priv = to_i915(ring->dev);
+	unsigned long timeout;
+	int ret = -EBUSY;
+
+	intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
+	timeout = jiffies + 1;
+	while (!need_resched()) {
+		if (i915_gem_request_completed(req, true)) {
+			ret = 0;
+			goto out;
+		}
+
+		if (time_after_eq(jiffies, timeout))
+			break;
+
+		cpu_relax_lowlatency();
+	}
+	if (i915_gem_request_completed(req, false))
+		ret = 0;
+out:
+	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
+	return ret;
+}
+
 /**
  * __i915_wait_request - wait until execution of request has finished
  * @req: duh!
@@ -1235,12 +1262,20 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
 	if (ring->id == RCS && INTEL_INFO(dev)->gen >= 6)
 		gen6_rps_boost(dev_priv, file_priv);
 
-	if (!irq_test_in_progress && WARN_ON(!ring->irq_get(ring)))
-		return -ENODEV;
-
 	/* Record current time in case interrupted by signal, or wedged */
 	trace_i915_gem_request_wait_begin(req);
 	before = ktime_get_raw_ns();
+
+	/* Optimistic spin for the next jiffie before touching IRQs */
+	ret = __i915_spin_request(req);
+	if (ret == 0)
+		goto out;
+
+	if (!irq_test_in_progress && WARN_ON(!ring->irq_get(ring))) {
+		ret = -ENODEV;
+		goto out;
+	}
+
 	for (;;) {
 		struct timer_list timer;
 
@@ -1289,14 +1324,15 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
 			destroy_timer_on_stack(&timer);
 		}
 	}
-	now = ktime_get_raw_ns();
-	trace_i915_gem_request_wait_end(req);
-
 	if (!irq_test_in_progress)
 		ring->irq_put(ring);
 
 	finish_wait(&ring->irq_queue, &wait);
 
+out:
+	now = ktime_get_raw_ns();
+	trace_i915_gem_request_wait_end(req);
+
 	if (timeout) {
 		s64 tres = *timeout - (now - before);
 
-- 
2.1.4

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v3] drm/i915: Optimistically spin for the request completion
  2015-03-12 11:11     ` [PATCH v3] " Chris Wilson
@ 2015-03-12 12:06       ` Chris Wilson
  2015-03-12 13:14       ` Tvrtko Ursulin
                         ` (3 subsequent siblings)
  4 siblings, 0 replies; 29+ messages in thread
From: Chris Wilson @ 2015-03-12 12:06 UTC (permalink / raw)
  To: intel-gfx; +Cc: Daniel Vetter

On Thu, Mar 12, 2015 at 11:11:17AM +0000, Chris Wilson wrote:
> This provides a nice boost to mesa in swap bound scenarios (as mesa
> throttles itself to the previous frame and given the scenario that will
> complete shortly). It will also provide a good boost to systems running
> with semaphores disabled and so frequently waiting on the GPU as it
> switches rings. In the most favourable of microbenchmarks, this can
> increase performance by around 15% - though in practice improvements
> will be marginal and rarely noticeable.
> 
> v2: Account for user timeouts
> v3: Limit the spinning to a single jiffie (~1us) at most. On an
> otherwise idle system, there is no scheduler contention and so without a
> limit we would spin until the GPU is ready.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>

I'm really happy with this tradeoff. Can we throw this to the vultures
to see if anyone notices?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3] drm/i915: Optimistically spin for the request completion
  2015-03-12 11:11     ` [PATCH v3] " Chris Wilson
  2015-03-12 12:06       ` Chris Wilson
@ 2015-03-12 13:14       ` Tvrtko Ursulin
  2015-03-12 13:18         ` Chris Wilson
  2015-03-12 19:27       ` shuang.he
                         ` (2 subsequent siblings)
  4 siblings, 1 reply; 29+ messages in thread
From: Tvrtko Ursulin @ 2015-03-12 13:14 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: Daniel Vetter

On 03/12/2015 11:11 AM, Chris Wilson wrote:
> This provides a nice boost to mesa in swap bound scenarios (as mesa
> throttles itself to the previous frame and given the scenario that will
> complete shortly). It will also provide a good boost to systems running
> with semaphores disabled and so frequently waiting on the GPU as it
> switches rings. In the most favourable of microbenchmarks, this can
> increase performance by around 15% - though in practice improvements
> will be marginal and rarely noticeable.
>
> v2: Account for user timeouts
> v3: Limit the spinning to a single jiffie (~1us) at most. On an
> otherwise idle system, there is no scheduler contention and so without a
> limit we would spin until the GPU is ready.

Isn't one jiffie 1-10ms typically?

Regards,

Tvrtko

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3] drm/i915: Optimistically spin for the request completion
  2015-03-12 13:14       ` Tvrtko Ursulin
@ 2015-03-12 13:18         ` Chris Wilson
  2015-03-12 15:18           ` Tvrtko Ursulin
  0 siblings, 1 reply; 29+ messages in thread
From: Chris Wilson @ 2015-03-12 13:18 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Daniel Vetter, intel-gfx

On Thu, Mar 12, 2015 at 01:14:30PM +0000, Tvrtko Ursulin wrote:
> On 03/12/2015 11:11 AM, Chris Wilson wrote:
> >This provides a nice boost to mesa in swap bound scenarios (as mesa
> >throttles itself to the previous frame and given the scenario that will
> >complete shortly). It will also provide a good boost to systems running
> >with semaphores disabled and so frequently waiting on the GPU as it
> >switches rings. In the most favourable of microbenchmarks, this can
> >increase performance by around 15% - though in practice improvements
> >will be marginal and rarely noticeable.
> >
> >v2: Account for user timeouts
> >v3: Limit the spinning to a single jiffie (~1us) at most. On an
> >otherwise idle system, there is no scheduler contention and so without a
> >limit we would spin until the GPU is ready.
> 
> Isn't one jiffie 1-10ms typically?

1ms. I was just thinking of doing USECS_PER_SEC / HZ, then realised that
was a jiffie, hence the confusion. At any rate, it is still the minimum
we can trivially wait for (without an expensive hrtimer).
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3] drm/i915: Optimistically spin for the request completion
  2015-03-12 13:18         ` Chris Wilson
@ 2015-03-12 15:18           ` Tvrtko Ursulin
  2015-03-12 16:28             ` Chris Wilson
  0 siblings, 1 reply; 29+ messages in thread
From: Tvrtko Ursulin @ 2015-03-12 15:18 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx, Daniel Vetter


On 03/12/2015 01:18 PM, Chris Wilson wrote:
> On Thu, Mar 12, 2015 at 01:14:30PM +0000, Tvrtko Ursulin wrote:
>> On 03/12/2015 11:11 AM, Chris Wilson wrote:
>>> This provides a nice boost to mesa in swap bound scenarios (as mesa
>>> throttles itself to the previous frame and given the scenario that will
>>> complete shortly). It will also provide a good boost to systems running
>>> with semaphores disabled and so frequently waiting on the GPU as it
>>> switches rings. In the most favourable of microbenchmarks, this can
>>> increase performance by around 15% - though in practice improvements
>>> will be marginal and rarely noticeable.
>>>
>>> v2: Account for user timeouts
>>> v3: Limit the spinning to a single jiffie (~1us) at most. On an
>>> otherwise idle system, there is no scheduler contention and so without a
>>> limit we would spin until the GPU is ready.
>>
>> Isn't one jiffie 1-10ms typically?
>
> 1ms. I was just thinking of doing USECS_PER_SEC / HZ, then realised that
> was a jiffie, hence the confusion. At any rate, it is still the minimum
> we can trivially wait for (without an expensive hrtimer).

Unless I lost track with the times, that's CONFIG_HZ right?

I don't know what server distributions do, but this Ubuntu LTS I am 
running has HZ=250 which means 4ms.

That would mean on a system where throughput is more important than 
latency, you lose most throughput by spinning the longest. In theory at 
least, no?

So perhaps which should be a tunable? Optionally auto-select the initial 
state based on HZ.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3] drm/i915: Optimistically spin for the request completion
  2015-03-12 15:18           ` Tvrtko Ursulin
@ 2015-03-12 16:28             ` Chris Wilson
  2015-03-12 16:41               ` Tvrtko Ursulin
  0 siblings, 1 reply; 29+ messages in thread
From: Chris Wilson @ 2015-03-12 16:28 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Daniel Vetter, intel-gfx

On Thu, Mar 12, 2015 at 03:18:01PM +0000, Tvrtko Ursulin wrote:
> On 03/12/2015 01:18 PM, Chris Wilson wrote:
> >1ms. I was just thinking of doing USECS_PER_SEC / HZ, then realised that
> >was a jiffie, hence the confusion. At any rate, it is still the minimum
> >we can trivially wait for (without an expensive hrtimer).
> 
> Unless I lost track with the times, that's CONFIG_HZ right?
> 
> I don't know what server distributions do, but this Ubuntu LTS I am
> running has HZ=250 which means 4ms.
> 
> That would mean on a system where throughput is more important than
> latency, you lose most throughput by spinning the longest. In theory
> at least, no?

Only in theory, and only if you mean throughput of non-i915 workloads
with preemption disabled.  Spinning here improves both latency and
throughput for gfx clients. Using up the timeslice for the client may
have secondary effects though - otherwise they would get iowait credits.

> So perhaps which should be a tunable? Optionally auto-select the
> initial state based on HZ.

Spinning less than a jiffie requires hrtimers at which point you may as
well just use the i915 interrupt (rather than setup a timer interrupt).
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3] drm/i915: Optimistically spin for the request completion
  2015-03-12 16:28             ` Chris Wilson
@ 2015-03-12 16:41               ` Tvrtko Ursulin
  2015-03-12 16:50                 ` Chris Wilson
  0 siblings, 1 reply; 29+ messages in thread
From: Tvrtko Ursulin @ 2015-03-12 16:41 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx, Daniel Vetter


On 03/12/2015 04:28 PM, Chris Wilson wrote:
> On Thu, Mar 12, 2015 at 03:18:01PM +0000, Tvrtko Ursulin wrote:
>> On 03/12/2015 01:18 PM, Chris Wilson wrote:
>>> 1ms. I was just thinking of doing USECS_PER_SEC / HZ, then realised that
>>> was a jiffie, hence the confusion. At any rate, it is still the minimum
>>> we can trivially wait for (without an expensive hrtimer).
>>
>> Unless I lost track with the times, that's CONFIG_HZ right?
>>
>> I don't know what server distributions do, but this Ubuntu LTS I am
>> running has HZ=250 which means 4ms.
>>
>> That would mean on a system where throughput is more important than
>> latency, you lose most throughput by spinning the longest. In theory
>> at least, no?
>
> Only in theory, and only if you mean throughput of non-i915 workloads
> with preemption disabled.  Spinning here improves both latency and
> throughput for gfx clients. Using up the timeslice for the client may
> have secondary effects though - otherwise they would get iowait credits.

Yes, I meant CPU workloads. And low HZ and no preemption usually go 
together.

>> So perhaps which should be a tunable? Optionally auto-select the
>> initial state based on HZ.
>
> Spinning less than a jiffie requires hrtimers at which point you may as
> well just use the i915 interrupt (rather than setup a timer interrupt).

Yes I didn't mean that - but to have a boolean spinning-wait=on/off. 
Maybe default to "on" on HZ=1000 with preemption, or the opposite, 
something like that.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3] drm/i915: Optimistically spin for the request completion
  2015-03-12 16:41               ` Tvrtko Ursulin
@ 2015-03-12 16:50                 ` Chris Wilson
  2015-03-12 17:32                   ` Tvrtko Ursulin
  0 siblings, 1 reply; 29+ messages in thread
From: Chris Wilson @ 2015-03-12 16:50 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Daniel Vetter, intel-gfx

On Thu, Mar 12, 2015 at 04:41:10PM +0000, Tvrtko Ursulin wrote:
> Yes I didn't mean that - but to have a boolean spinning-wait=on/off.
> Maybe default to "on" on HZ=1000 with preemption, or the opposite,
> something like that.

I don't see the point in having the complication, until someone
complains. In my defense, I will point to the optimistic mutex spinning
equally having no configurable option. And since the idea is that you
only hit this if you are abusing i915 (e.g. benchmarking, or you have a
readback on the critical patch, or if we haven't implemented
semaphores), I would rather fix those scenarios as they arrive rather
than giving the user an option to break.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3] drm/i915: Optimistically spin for the request completion
  2015-03-12 16:50                 ` Chris Wilson
@ 2015-03-12 17:32                   ` Tvrtko Ursulin
  2015-03-13  9:33                     ` Daniel Vetter
  0 siblings, 1 reply; 29+ messages in thread
From: Tvrtko Ursulin @ 2015-03-12 17:32 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx, Daniel Vetter


On 03/12/2015 04:50 PM, Chris Wilson wrote:
> On Thu, Mar 12, 2015 at 04:41:10PM +0000, Tvrtko Ursulin wrote:
>> Yes I didn't mean that - but to have a boolean spinning-wait=on/off.
>> Maybe default to "on" on HZ=1000 with preemption, or the opposite,
>> something like that.
>
> I don't see the point in having the complication, until someone
> complains. In my defense, I will point to the optimistic mutex spinning
> equally having no configurable option. And since the idea is that you
> only hit this if you are abusing i915 (e.g. benchmarking, or you have a
> readback on the critical patch, or if we haven't implemented
> semaphores), I would rather fix those scenarios as they arrive rather
> than giving the user an option to break.

I simply pointed out a theoretical potential to burn more CPU on 
servers. If you know that is unlikely or only theoretical that's fine by me.

But I'll say I was more convinced before you mentioned "until someone 
complains" and "option to break". :)

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3] drm/i915: Optimistically spin for the request completion
  2015-03-12 11:11     ` [PATCH v3] " Chris Wilson
  2015-03-12 12:06       ` Chris Wilson
  2015-03-12 13:14       ` Tvrtko Ursulin
@ 2015-03-12 19:27       ` shuang.he
  2015-03-19 15:16       ` Chris Wilson
  2015-03-20 14:36       ` [PATCH v4] " Chris Wilson
  4 siblings, 0 replies; 29+ messages in thread
From: shuang.he @ 2015-03-12 19:27 UTC (permalink / raw)
  To: shuang.he, ethan.gao, intel-gfx, chris

Tested-By: PRC QA PRTS (Patch Regression Test System Contact: shuang.he@intel.com)
Task id: 5941
-------------------------------------Summary-------------------------------------
Platform          Delta          drm-intel-nightly          Series Applied
PNV                 -7              274/274              267/274
ILK                                  301/301              301/301
SNB                 -1              277/277              276/277
IVB                                  341/341              341/341
BYT                                  287/287              287/287
HSW                                  360/360              360/360
BDW                                  308/308              308/308
-------------------------------------Detailed-------------------------------------
Platform  Test                                drm-intel-nightly          Series Applied
*PNV  igt_gem_fence_thrash_bo-write-verify-none      PASS(4)      FAIL(1)PASS(1)
*PNV  igt_gem_fence_thrash_bo-write-verify-x      PASS(4)      FAIL(1)PASS(1)
*PNV  igt_gem_fence_thrash_bo-write-verify-y      PASS(4)      FAIL(1)PASS(1)
 PNV  igt_gem_userptr_blits_coherency-sync      CRASH(5)PASS(4)      CRASH(1)PASS(1)
 PNV  igt_gem_userptr_blits_coherency-unsync      CRASH(1)PASS(3)      CRASH(1)PASS(1)
 PNV  igt_gen3_render_linear_blits      FAIL(3)PASS(4)      FAIL(1)PASS(1)
 PNV  igt_gen3_render_mixed_blits      FAIL(3)PASS(4)      FAIL(2)
*SNB  igt_gem_mmap_gtt_write-read      PASS(2)      DMESG_WARN(1)PASS(1)
Note: You need to pay more attention to line start with '*'
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3] drm/i915: Optimistically spin for the request completion
  2015-03-12 17:32                   ` Tvrtko Ursulin
@ 2015-03-13  9:33                     ` Daniel Vetter
  0 siblings, 0 replies; 29+ messages in thread
From: Daniel Vetter @ 2015-03-13  9:33 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Daniel Vetter, intel-gfx

On Thu, Mar 12, 2015 at 05:32:26PM +0000, Tvrtko Ursulin wrote:
> 
> On 03/12/2015 04:50 PM, Chris Wilson wrote:
> >On Thu, Mar 12, 2015 at 04:41:10PM +0000, Tvrtko Ursulin wrote:
> >>Yes I didn't mean that - but to have a boolean spinning-wait=on/off.
> >>Maybe default to "on" on HZ=1000 with preemption, or the opposite,
> >>something like that.
> >
> >I don't see the point in having the complication, until someone
> >complains. In my defense, I will point to the optimistic mutex spinning
> >equally having no configurable option. And since the idea is that you
> >only hit this if you are abusing i915 (e.g. benchmarking, or you have a
> >readback on the critical patch, or if we haven't implemented
> >semaphores), I would rather fix those scenarios as they arrive rather
> >than giving the user an option to break.
> 
> I simply pointed out a theoretical potential to burn more CPU on servers. If
> you know that is unlikely or only theoretical that's fine by me.
> 
> But I'll say I was more convinced before you mentioned "until someone
> complains" and "option to break". :)

Server workloads on intel machines are currently transcode stuff (and
there's more room in libva for better parallelism than this) and ocl
(which tends to go with the busy-loops for synchronization model anyway).

Also we'll stop spinning the second someone else wants the cpu, so there's
really just secondary effects of the scheduler due to fewer iowait credits
and more timeslices used up. I'll go with Chris and will start to care
when someone complains with real benchmark numbers attached ;-)

Ofc we still want broader numbers for the benchmakr benefits for this one
first.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3] drm/i915: Optimistically spin for the request completion
  2015-03-12 11:11     ` [PATCH v3] " Chris Wilson
                         ` (2 preceding siblings ...)
  2015-03-12 19:27       ` shuang.he
@ 2015-03-19 15:16       ` Chris Wilson
  2015-03-20 14:54         ` Daniel Vetter
  2015-03-20 14:36       ` [PATCH v4] " Chris Wilson
  4 siblings, 1 reply; 29+ messages in thread
From: Chris Wilson @ 2015-03-19 15:16 UTC (permalink / raw)
  To: intel-gfx; +Cc: Daniel Vetter

On Thu, Mar 12, 2015 at 11:11:17AM +0000, Chris Wilson wrote:
> This provides a nice boost to mesa in swap bound scenarios (as mesa
> throttles itself to the previous frame and given the scenario that will
> complete shortly). It will also provide a good boost to systems running
> with semaphores disabled and so frequently waiting on the GPU as it
> switches rings. In the most favourable of microbenchmarks, this can
> increase performance by around 15% - though in practice improvements
> will be marginal and rarely noticeable.
> 
> v2: Account for user timeouts
> v3: Limit the spinning to a single jiffie (~1us) at most. On an
> otherwise idle system, there is no scheduler contention and so without a
> limit we would spin until the GPU is ready.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>

Just recording ideas for the future. Replace the busy-spin with
monitor/mwait. This requires Pentium4+, a cooperating GPU with working
cacheline snooping and that we use HWS seqno.

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 85e71e0e2340..454a38d4caa3 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -37,6 +37,7 @@
 #include <linux/swap.h>
 #include <linux/pci.h>
 #include <linux/dma-buf.h>
+#include <asm/mwait.h>
 
 #define RQ_BUG_ON(expr)
 
@@ -1187,18 +1188,42 @@ static int __i915_spin_request(struct drm_i915_gem_request *req)
        unsigned long timeout;
        int ret = -EBUSY;
 
+       if (ring->irq_refcount) /* IRQ is already active, keep using it */
+               return ret;
+
        intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
        timeout = jiffies + 1;
-       while (!need_resched()) {
-               if (i915_gem_request_completed(req, true)) {
-                       ret = 0;
-                       goto out;
-               }
+       if (this_cpu_has(X86_FEATURE_MWAIT)) {
+               do {
+                       unsigned long ecx = 1; /* break on interrupt */
+                       unsigned long eax = 0; /* cstate */
 
-               if (time_after_eq(jiffies, timeout))
-                       break;
+                       __monitor((void *)&ring->status_page.page_addr[I915_GEM_HWS_INDEX], 0, 0);
+                       if (need_resched())
+                               break;
+
+                       if (i915_gem_request_completed(req, true)) {
+                               ret = 0;
+                               goto out;
+                       }
 
-               cpu_relax_lowlatency();
+                       if (time_after_eq(jiffies, timeout))
+                               break;
+
+                       __mwait(eax, ecx);
+               } while (1);
+       } else {
+               while (!need_resched()) {
+                       if (i915_gem_request_completed(req, true)) {
+                               ret = 0;
+                               goto out;
+                       }
+
+                       if (time_after_eq(jiffies, timeout))
+                               break;
+
+                       cpu_relax_lowlatency();
+               }
        }
        if (i915_gem_request_completed(req, false))
                ret = 0;


-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4] drm/i915: Optimistically spin for the request completion
  2015-03-12 11:11     ` [PATCH v3] " Chris Wilson
                         ` (3 preceding siblings ...)
  2015-03-19 15:16       ` Chris Wilson
@ 2015-03-20 14:36       ` Chris Wilson
  2015-03-20 16:01         ` Tvrtko Ursulin
  2015-03-20 21:30         ` shuang.he
  4 siblings, 2 replies; 29+ messages in thread
From: Chris Wilson @ 2015-03-20 14:36 UTC (permalink / raw)
  To: intel-gfx; +Cc: Daniel Vetter, Eero Tamminen, Rantala, Valtteri

This provides a nice boost to mesa in swap bound scenarios (as mesa
throttles itself to the previous frame and given the scenario that will
complete shortly). It will also provide a good boost to systems running
with semaphores disabled and so frequently waiting on the GPU as it
switches rings. In the most favourable of microbenchmarks, this can
increase performance by around 15% - though in practice improvements
will be marginal and rarely noticeable.

v2: Account for user timeouts
v3: Limit the spinning to a single jiffie (~1us) at most. On an
otherwise idle system, there is no scheduler contention and so without a
limit we would spin until the GPU is ready.
v4: Drop forcewake - the lazy coherent access doesn't require it, and we
have no reason to believe that the forcewake itself improves seqno
coherency - it only adds delay.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Eero Tamminen <eero.t.tamminen@intel.com>
Cc: "Rantala, Valtteri" <valtteri.rantala@intel.com>
---
 drivers/gpu/drm/i915/i915_gem.c | 44 +++++++++++++++++++++++++++++++++++------
 1 file changed, 38 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 2e17a254aac1..9988e65c1440 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1181,6 +1181,29 @@ static bool missed_irq(struct drm_i915_private *dev_priv,
 	return test_bit(ring->id, &dev_priv->gpu_error.missed_irq_rings);
 }
 
+static int __i915_spin_request(struct drm_i915_gem_request *rq)
+{
+	unsigned long timeout;
+
+	if (i915_gem_request_get_ring(rq)->irq_refcount)
+		return -EBUSY;
+
+	timeout = jiffies + 1;
+	while (!need_resched()) {
+		if (i915_gem_request_completed(rq, true))
+			return 0;
+
+		if (time_after_eq(jiffies, timeout))
+			break;
+
+		cpu_relax_lowlatency();
+	}
+	if (i915_gem_request_completed(rq, false))
+		return 0;
+
+	return -EAGAIN;
+}
+
 /**
  * __i915_wait_request - wait until execution of request has finished
  * @req: duh!
@@ -1225,12 +1248,20 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
 	if (ring->id == RCS && INTEL_INFO(dev)->gen >= 6)
 		gen6_rps_boost(dev_priv, file_priv);
 
-	if (!irq_test_in_progress && WARN_ON(!ring->irq_get(ring)))
-		return -ENODEV;
-
 	/* Record current time in case interrupted by signal, or wedged */
 	trace_i915_gem_request_wait_begin(req);
 	before = ktime_get_raw_ns();
+
+	/* Optimistic spin for the next jiffie before touching IRQs */
+	ret = __i915_spin_request(req);
+	if (ret == 0)
+		goto out;
+
+	if (!irq_test_in_progress && WARN_ON(!ring->irq_get(ring))) {
+		ret = -ENODEV;
+		goto out;
+	}
+
 	for (;;) {
 		struct timer_list timer;
 
@@ -1279,14 +1310,15 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
 			destroy_timer_on_stack(&timer);
 		}
 	}
-	now = ktime_get_raw_ns();
-	trace_i915_gem_request_wait_end(req);
-
 	if (!irq_test_in_progress)
 		ring->irq_put(ring);
 
 	finish_wait(&ring->irq_queue, &wait);
 
+out:
+	now = ktime_get_raw_ns();
+	trace_i915_gem_request_wait_end(req);
+
 	if (timeout) {
 		s64 tres = *timeout - (now - before);
 
-- 
2.1.4

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v3] drm/i915: Optimistically spin for the request completion
  2015-03-19 15:16       ` Chris Wilson
@ 2015-03-20 14:54         ` Daniel Vetter
  2015-03-20 15:27           ` Chris Wilson
  0 siblings, 1 reply; 29+ messages in thread
From: Daniel Vetter @ 2015-03-20 14:54 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx, Daniel Vetter

On Thu, Mar 19, 2015 at 03:16:15PM +0000, Chris Wilson wrote:
> On Thu, Mar 12, 2015 at 11:11:17AM +0000, Chris Wilson wrote:
> > This provides a nice boost to mesa in swap bound scenarios (as mesa
> > throttles itself to the previous frame and given the scenario that will
> > complete shortly). It will also provide a good boost to systems running
> > with semaphores disabled and so frequently waiting on the GPU as it
> > switches rings. In the most favourable of microbenchmarks, this can
> > increase performance by around 15% - though in practice improvements
> > will be marginal and rarely noticeable.
> > 
> > v2: Account for user timeouts
> > v3: Limit the spinning to a single jiffie (~1us) at most. On an
> > otherwise idle system, there is no scheduler contention and so without a
> > limit we would spin until the GPU is ready.
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> 
> Just recording ideas for the future. Replace the busy-spin with
> monitor/mwait. This requires Pentium4+, a cooperating GPU with working
> cacheline snooping and that we use HWS seqno.

Just for the record: Did it help with powersaving or was it all in the
noise?
-Daniel

> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 85e71e0e2340..454a38d4caa3 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -37,6 +37,7 @@
>  #include <linux/swap.h>
>  #include <linux/pci.h>
>  #include <linux/dma-buf.h>
> +#include <asm/mwait.h>
>  
>  #define RQ_BUG_ON(expr)
>  
> @@ -1187,18 +1188,42 @@ static int __i915_spin_request(struct drm_i915_gem_request *req)
>         unsigned long timeout;
>         int ret = -EBUSY;
>  
> +       if (ring->irq_refcount) /* IRQ is already active, keep using it */
> +               return ret;
> +
>         intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
>         timeout = jiffies + 1;
> -       while (!need_resched()) {
> -               if (i915_gem_request_completed(req, true)) {
> -                       ret = 0;
> -                       goto out;
> -               }
> +       if (this_cpu_has(X86_FEATURE_MWAIT)) {
> +               do {
> +                       unsigned long ecx = 1; /* break on interrupt */
> +                       unsigned long eax = 0; /* cstate */
>  
> -               if (time_after_eq(jiffies, timeout))
> -                       break;
> +                       __monitor((void *)&ring->status_page.page_addr[I915_GEM_HWS_INDEX], 0, 0);
> +                       if (need_resched())
> +                               break;
> +
> +                       if (i915_gem_request_completed(req, true)) {
> +                               ret = 0;
> +                               goto out;
> +                       }
>  
> -               cpu_relax_lowlatency();
> +                       if (time_after_eq(jiffies, timeout))
> +                               break;
> +
> +                       __mwait(eax, ecx);
> +               } while (1);
> +       } else {
> +               while (!need_resched()) {
> +                       if (i915_gem_request_completed(req, true)) {
> +                               ret = 0;
> +                               goto out;
> +                       }
> +
> +                       if (time_after_eq(jiffies, timeout))
> +                               break;
> +
> +                       cpu_relax_lowlatency();
> +               }
>         }
>         if (i915_gem_request_completed(req, false))
>                 ret = 0;
> 
> 
> -- 
> Chris Wilson, Intel Open Source Technology Centre

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3] drm/i915: Optimistically spin for the request completion
  2015-03-20 14:54         ` Daniel Vetter
@ 2015-03-20 15:27           ` Chris Wilson
  0 siblings, 0 replies; 29+ messages in thread
From: Chris Wilson @ 2015-03-20 15:27 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Daniel Vetter, intel-gfx

On Fri, Mar 20, 2015 at 03:54:01PM +0100, Daniel Vetter wrote:
> On Thu, Mar 19, 2015 at 03:16:15PM +0000, Chris Wilson wrote:
> > On Thu, Mar 12, 2015 at 11:11:17AM +0000, Chris Wilson wrote:
> > > This provides a nice boost to mesa in swap bound scenarios (as mesa
> > > throttles itself to the previous frame and given the scenario that will
> > > complete shortly). It will also provide a good boost to systems running
> > > with semaphores disabled and so frequently waiting on the GPU as it
> > > switches rings. In the most favourable of microbenchmarks, this can
> > > increase performance by around 15% - though in practice improvements
> > > will be marginal and rarely noticeable.
> > > 
> > > v2: Account for user timeouts
> > > v3: Limit the spinning to a single jiffie (~1us) at most. On an
> > > otherwise idle system, there is no scheduler contention and so without a
> > > limit we would spin until the GPU is ready.
> > > 
> > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > > Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> > 
> > Just recording ideas for the future. Replace the busy-spin with
> > monitor/mwait. This requires Pentium4+, a cooperating GPU with working
> > cacheline snooping and that we use HWS seqno.
> 
> Just for the record: Did it help with powersaving or was it all in the
> noise?

Unscientifically, I would say mwait(cstate=0) was worse. It gave a
marginally higher peak, but there was clearly worse thermal throttling
than the simple busy-wait. powertop suggests that with the mwait we were
not reaching as low a package cstate as often.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4] drm/i915: Optimistically spin for the request completion
  2015-03-20 14:36       ` [PATCH v4] " Chris Wilson
@ 2015-03-20 16:01         ` Tvrtko Ursulin
  2015-03-20 16:19           ` Chris Wilson
  2015-03-20 21:30         ` shuang.he
  1 sibling, 1 reply; 29+ messages in thread
From: Tvrtko Ursulin @ 2015-03-20 16:01 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: Daniel Vetter, Eero Tamminen, Rantala, Valtteri


On 03/20/2015 02:36 PM, Chris Wilson wrote:
> This provides a nice boost to mesa in swap bound scenarios (as mesa
> throttles itself to the previous frame and given the scenario that will
> complete shortly). It will also provide a good boost to systems running
> with semaphores disabled and so frequently waiting on the GPU as it
> switches rings. In the most favourable of microbenchmarks, this can
> increase performance by around 15% - though in practice improvements
> will be marginal and rarely noticeable.
>
> v2: Account for user timeouts
> v3: Limit the spinning to a single jiffie (~1us) at most. On an
> otherwise idle system, there is no scheduler contention and so without a
> limit we would spin until the GPU is ready.
> v4: Drop forcewake - the lazy coherent access doesn't require it, and we
> have no reason to believe that the forcewake itself improves seqno
> coherency - it only adds delay.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> Cc: Eero Tamminen <eero.t.tamminen@intel.com>
> Cc: "Rantala, Valtteri" <valtteri.rantala@intel.com>

Still against a toggle switch like a simple module parameter?

> ---
>   drivers/gpu/drm/i915/i915_gem.c | 44 +++++++++++++++++++++++++++++++++++------
>   1 file changed, 38 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 2e17a254aac1..9988e65c1440 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -1181,6 +1181,29 @@ static bool missed_irq(struct drm_i915_private *dev_priv,
>   	return test_bit(ring->id, &dev_priv->gpu_error.missed_irq_rings);
>   }
>
> +static int __i915_spin_request(struct drm_i915_gem_request *rq)
> +{
> +	unsigned long timeout;
> +
> +	if (i915_gem_request_get_ring(rq)->irq_refcount)
> +		return -EBUSY;

So if someone else is already waiting on this ring skip the spin and do 
the sleep-wait.

That would mean earlier waiter didn't manage to spin to completion so 
for subsequent ones does it make sense to try to spin? If we assume 
waiters are arriving here in submission order then no, they should 
proceed to sleep-wait. But if waiters are arriving here in random order, 
and that is purely up to userspace I think, then I am not sure?

On the other hand if we allowed this "out-of-order waiters" that would 
potentially be too much spinning so maybe it is better like it is.

> +	timeout = jiffies + 1;
> +	while (!need_resched()) {
> +		if (i915_gem_request_completed(rq, true))
> +			return 0;
> +
> +		if (time_after_eq(jiffies, timeout))
> +			break;
> +
> +		cpu_relax_lowlatency();
> +	}
> +	if (i915_gem_request_completed(rq, false))
> +		return 0;
 > +
> +	return -EAGAIN;
> +}
> +
>   /**
>    * __i915_wait_request - wait until execution of request has finished
>    * @req: duh!
> @@ -1225,12 +1248,20 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
>   	if (ring->id == RCS && INTEL_INFO(dev)->gen >= 6)
>   		gen6_rps_boost(dev_priv, file_priv);
>
> -	if (!irq_test_in_progress && WARN_ON(!ring->irq_get(ring)))
> -		return -ENODEV;
> -
>   	/* Record current time in case interrupted by signal, or wedged */
>   	trace_i915_gem_request_wait_begin(req);
>   	before = ktime_get_raw_ns();
> +
> +	/* Optimistic spin for the next jiffie before touching IRQs */
> +	ret = __i915_spin_request(req);
> +	if (ret == 0)
> +		goto out;
> +
> +	if (!irq_test_in_progress && WARN_ON(!ring->irq_get(ring))) {
> +		ret = -ENODEV;
> +		goto out;
> +	}
> +
>   	for (;;) {
>   		struct timer_list timer;
>
> @@ -1279,14 +1310,15 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
>   			destroy_timer_on_stack(&timer);
>   		}
>   	}
> -	now = ktime_get_raw_ns();
> -	trace_i915_gem_request_wait_end(req);
> -
>   	if (!irq_test_in_progress)
>   		ring->irq_put(ring);
>
>   	finish_wait(&ring->irq_queue, &wait);
>
> +out:
> +	now = ktime_get_raw_ns();
> +	trace_i915_gem_request_wait_end(req);
> +
>   	if (timeout) {
>   		s64 tres = *timeout - (now - before);
>
>

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4] drm/i915: Optimistically spin for the request completion
  2015-03-20 16:01         ` Tvrtko Ursulin
@ 2015-03-20 16:19           ` Chris Wilson
  2015-03-20 16:31             ` Tvrtko Ursulin
  2015-03-20 22:59             ` Chris Wilson
  0 siblings, 2 replies; 29+ messages in thread
From: Chris Wilson @ 2015-03-20 16:19 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Daniel Vetter, intel-gfx, Rantala, Valtteri, Eero Tamminen

On Fri, Mar 20, 2015 at 04:01:52PM +0000, Tvrtko Ursulin wrote:
> 
> On 03/20/2015 02:36 PM, Chris Wilson wrote:
> >This provides a nice boost to mesa in swap bound scenarios (as mesa
> >throttles itself to the previous frame and given the scenario that will
> >complete shortly). It will also provide a good boost to systems running
> >with semaphores disabled and so frequently waiting on the GPU as it
> >switches rings. In the most favourable of microbenchmarks, this can
> >increase performance by around 15% - though in practice improvements
> >will be marginal and rarely noticeable.
> >
> >v2: Account for user timeouts
> >v3: Limit the spinning to a single jiffie (~1us) at most. On an
> >otherwise idle system, there is no scheduler contention and so without a
> >limit we would spin until the GPU is ready.
> >v4: Drop forcewake - the lazy coherent access doesn't require it, and we
> >have no reason to believe that the forcewake itself improves seqno
> >coherency - it only adds delay.
> >
> >Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> >Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> >Cc: Eero Tamminen <eero.t.tamminen@intel.com>
> >Cc: "Rantala, Valtteri" <valtteri.rantala@intel.com>
> 
> Still against a toggle switch like a simple module parameter?

Yes. I'd much rather tackle the corner cases than ignore them.
 
> >---
> >  drivers/gpu/drm/i915/i915_gem.c | 44 +++++++++++++++++++++++++++++++++++------
> >  1 file changed, 38 insertions(+), 6 deletions(-)
> >
> >diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> >index 2e17a254aac1..9988e65c1440 100644
> >--- a/drivers/gpu/drm/i915/i915_gem.c
> >+++ b/drivers/gpu/drm/i915/i915_gem.c
> >@@ -1181,6 +1181,29 @@ static bool missed_irq(struct drm_i915_private *dev_priv,
> >  	return test_bit(ring->id, &dev_priv->gpu_error.missed_irq_rings);
> >  }
> >
> >+static int __i915_spin_request(struct drm_i915_gem_request *rq)
> >+{
> >+	unsigned long timeout;
> >+
> >+	if (i915_gem_request_get_ring(rq)->irq_refcount)
> >+		return -EBUSY;
> 
> So if someone else is already waiting on this ring skip the spin and
> do the sleep-wait.
> 
> That would mean earlier waiter didn't manage to spin to completion
> so for subsequent ones does it make sense to try to spin? If we
> assume waiters are arriving here in submission order then no, they
> should proceed to sleep-wait. But if waiters are arriving here in
> random order, and that is purely up to userspace I think, then I am
> not sure?

They arrive pretty much in random order.
 
> On the other hand if we allowed this "out-of-order waiters" that
> would potentially be too much spinning so maybe it is better like it
> is.

Also part of my premise is that it the cost of the irq (setting it up and
handling all the intermidate ones) that is the crux of the issue. Once
we have enabled the irq for one, we may as well then use it for the
herd. Also with a herd we will want to err on the side of sleeping more,
or so my intuition says.

I guess one test would be to see how many 1x1 [xN overdraw, say 1x1
Window, but rendering internally at 1080p] clients we can run in
parallel whilst hitting 60fps. And then whether allowing multiple
spinners helps or hinders.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4] drm/i915: Optimistically spin for the request completion
  2015-03-20 16:19           ` Chris Wilson
@ 2015-03-20 16:31             ` Tvrtko Ursulin
  2015-03-23  8:29               ` Daniel Vetter
  2015-03-20 22:59             ` Chris Wilson
  1 sibling, 1 reply; 29+ messages in thread
From: Tvrtko Ursulin @ 2015-03-20 16:31 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx, Daniel Vetter, Eero Tamminen, Rantala, Valtteri


On 03/20/2015 04:19 PM, Chris Wilson wrote:
> On Fri, Mar 20, 2015 at 04:01:52PM +0000, Tvrtko Ursulin wrote:
>>
>> On 03/20/2015 02:36 PM, Chris Wilson wrote:
>>> This provides a nice boost to mesa in swap bound scenarios (as mesa
>>> throttles itself to the previous frame and given the scenario that will
>>> complete shortly). It will also provide a good boost to systems running
>>> with semaphores disabled and so frequently waiting on the GPU as it
>>> switches rings. In the most favourable of microbenchmarks, this can
>>> increase performance by around 15% - though in practice improvements
>>> will be marginal and rarely noticeable.
>>>
>>> v2: Account for user timeouts
>>> v3: Limit the spinning to a single jiffie (~1us) at most. On an
>>> otherwise idle system, there is no scheduler contention and so without a
>>> limit we would spin until the GPU is ready.
>>> v4: Drop forcewake - the lazy coherent access doesn't require it, and we
>>> have no reason to believe that the forcewake itself improves seqno
>>> coherency - it only adds delay.
>>>
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
>>> Cc: Eero Tamminen <eero.t.tamminen@intel.com>
>>> Cc: "Rantala, Valtteri" <valtteri.rantala@intel.com>
>>
>> Still against a toggle switch like a simple module parameter?
>
> Yes. I'd much rather tackle the corner cases than ignore them.

I'll say it once more then leave it be - my point of view is that module 
param does not mean ignoring any issues. It rather means that, if 
pathological use case if found in the field, you can provide a better 
user experience and then work in parallel in coming with improvements.

Your view is probably that if there is a toggle, someone somewhere will 
put on some wiki "yeah if that happens just put this in modprobe.conf" 
and there won't even be a bug report.

It is a downside yes, but to me much better than finding a bad corner 
case and then saying "Yeah, please recompile your kernel", or downgrade 
your kernel and wait for X weeks/months until the fix propagates.

>>> ---
>>>   drivers/gpu/drm/i915/i915_gem.c | 44 +++++++++++++++++++++++++++++++++++------
>>>   1 file changed, 38 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>>> index 2e17a254aac1..9988e65c1440 100644
>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>>> @@ -1181,6 +1181,29 @@ static bool missed_irq(struct drm_i915_private *dev_priv,
>>>   	return test_bit(ring->id, &dev_priv->gpu_error.missed_irq_rings);
>>>   }
>>>
>>> +static int __i915_spin_request(struct drm_i915_gem_request *rq)
>>> +{
>>> +	unsigned long timeout;
>>> +
>>> +	if (i915_gem_request_get_ring(rq)->irq_refcount)
>>> +		return -EBUSY;
>>
>> So if someone else is already waiting on this ring skip the spin and
>> do the sleep-wait.
>>
>> That would mean earlier waiter didn't manage to spin to completion
>> so for subsequent ones does it make sense to try to spin? If we
>> assume waiters are arriving here in submission order then no, they
>> should proceed to sleep-wait. But if waiters are arriving here in
>> random order, and that is purely up to userspace I think, then I am
>> not sure?
>
> They arrive pretty much in random order.
>
>> On the other hand if we allowed this "out-of-order waiters" that
>> would potentially be too much spinning so maybe it is better like it
>> is.
>
> Also part of my premise is that it the cost of the irq (setting it up and
> handling all the intermidate ones) that is the crux of the issue. Once
> we have enabled the irq for one, we may as well then use it for the
> herd. Also with a herd we will want to err on the side of sleeping more,
> or so my intuition says.

My gut feeling is the same. Waking thundering herd sounds safer than 
spinning one.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4] drm/i915: Optimistically spin for the request completion
  2015-03-20 14:36       ` [PATCH v4] " Chris Wilson
  2015-03-20 16:01         ` Tvrtko Ursulin
@ 2015-03-20 21:30         ` shuang.he
  1 sibling, 0 replies; 29+ messages in thread
From: shuang.he @ 2015-03-20 21:30 UTC (permalink / raw)
  To: shuang.he, ethan.gao, intel-gfx, chris

Tested-By: PRC QA PRTS (Patch Regression Test System Contact: shuang.he@intel.com)
Task id: 6019
-------------------------------------Summary-------------------------------------
Platform          Delta          drm-intel-nightly          Series Applied
PNV                 -1              274/274              273/274
ILK                                  303/303              303/303
SNB                                  303/303              303/303
IVB                                  342/342              342/342
BYT                                  287/287              287/287
HSW                                  362/362              362/362
BDW                                  308/308              308/308
-------------------------------------Detailed-------------------------------------
Platform  Test                                drm-intel-nightly          Series Applied
*PNV  igt_gen3_render_linear_blits      PASS(5)      CRASH(1)PASS(1)
Note: You need to pay more attention to line start with '*'
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4] drm/i915: Optimistically spin for the request completion
  2015-03-20 16:19           ` Chris Wilson
  2015-03-20 16:31             ` Tvrtko Ursulin
@ 2015-03-20 22:59             ` Chris Wilson
  2015-03-21  9:49               ` Chris Wilson
  2015-03-23  8:31               ` Daniel Vetter
  1 sibling, 2 replies; 29+ messages in thread
From: Chris Wilson @ 2015-03-20 22:59 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx, Daniel Vetter, Eero Tamminen, Rantala,
	Valtteri

On Fri, Mar 20, 2015 at 04:19:02PM +0000, Chris Wilson wrote:
> I guess one test would be to see how many 1x1 [xN overdraw, say 1x1
> Window, but rendering internally at 1080p] clients we can run in
> parallel whilst hitting 60fps. And then whether allowing multiple
> spinners helps or hinders.

I was thinking of a nice easy test that could demonstrate any advantage
for spinning over waiting, and realised we already had such an igt. The
trick is that it has to generate sufficient GPU load to actually require
a wait, but not too high a GPU load such that we can see the impact from
slow completion.

I present igt/gem_exec_blt (modified to repeat the measurement and do an
average over several runs):

Time to blt 16384 bytes x      1:        21.000µs -> 5.800µs
Time to blt 16384 bytes x      2:        11.500µs -> 4.500µs
Time to blt 16384 bytes x      4:         6.750µs -> 3.750µs
Time to blt 16384 bytes x      8:         4.950µs -> 3.375µs
Time to blt 16384 bytes x     16:         3.825µs -> 3.175µs
Time to blt 16384 bytes x     32:         3.356µs -> 3.000µs 
Time to blt 16384 bytes x     64:         3.259µs -> 2.909µs
Time to blt 16384 bytes x    128:         3.083µs -> 3.095µs
Time to blt 16384 bytes x    256:         3.104µs -> 2.979µs
Time to blt 16384 bytes x    512:         3.080µs -> 3.089µs
Time to blt 16384 bytes x   1024:         3.077µs -> 3.040µs 
Time to blt 16384 bytes x   2048:         3.127µs -> 3.304µs
Time to blt 16384 bytes x   4096:         3.279µs -> 3.265µs
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4] drm/i915: Optimistically spin for the request completion
  2015-03-20 22:59             ` Chris Wilson
@ 2015-03-21  9:49               ` Chris Wilson
  2015-03-23  8:31               ` Daniel Vetter
  1 sibling, 0 replies; 29+ messages in thread
From: Chris Wilson @ 2015-03-21  9:49 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx, Daniel Vetter, Eero Tamminen, Rantala,
	Valtteri

On Fri, Mar 20, 2015 at 10:59:50PM +0000, Chris Wilson wrote:
> On Fri, Mar 20, 2015 at 04:19:02PM +0000, Chris Wilson wrote:
> > I guess one test would be to see how many 1x1 [xN overdraw, say 1x1
> > Window, but rendering internally at 1080p] clients we can run in
> > parallel whilst hitting 60fps. And then whether allowing multiple
> > spinners helps or hinders.
> 
> I was thinking of a nice easy test that could demonstrate any advantage
> for spinning over waiting, and realised we already had such an igt. The
> trick is that it has to generate sufficient GPU load to actually require
> a wait, but not too high a GPU load such that we can see the impact from
> slow completion.
> 
> I present igt/gem_exec_blt (modified to repeat the measurement and do an
> average over several runs):
> 
> Time to blt 16384 bytes x      1:        21.000µs -> 5.800µs
> Time to blt 16384 bytes x      2:        11.500µs -> 4.500µs
> Time to blt 16384 bytes x      4:         6.750µs -> 3.750µs
> Time to blt 16384 bytes x      8:         4.950µs -> 3.375µs
> Time to blt 16384 bytes x     16:         3.825µs -> 3.175µs
> Time to blt 16384 bytes x     32:         3.356µs -> 3.000µs 
> Time to blt 16384 bytes x     64:         3.259µs -> 2.909µs
> Time to blt 16384 bytes x    128:         3.083µs -> 3.095µs
> Time to blt 16384 bytes x    256:         3.104µs -> 2.979µs
> Time to blt 16384 bytes x    512:         3.080µs -> 3.089µs
> Time to blt 16384 bytes x   1024:         3.077µs -> 3.040µs 
> Time to blt 16384 bytes x   2048:         3.127µs -> 3.304µs
> Time to blt 16384 bytes x   4096:         3.279µs -> 3.265µs

Then remembering to disable the cmdparser:

Time to blt 16384 bytes x      1:	  4.000µs, 3.8GiB/s
Time to blt 16384 bytes x      2:	  3.300µs, 4.6GiB/s
Time to blt 16384 bytes x      4:	  2.850µs, 5.4GiB/s
Time to blt 16384 bytes x      8:	  2.625µs, 5.8GiB/s
Time to blt 16384 bytes x     16:	  2.500µs, 6.1GiB/s
Time to blt 16384 bytes x     32:	  2.469µs, 6.2GiB/s
Time to blt 16384 bytes x     64:	  2.447µs, 6.2GiB/s
Time to blt 16384 bytes x    128:	  2.438µs, 6.3GiB/s
Time to blt 16384 bytes x    256:	  2.382µs, 6.4GiB/s
Time to blt 16384 bytes x    512:	  2.389µs, 6.4GiB/s
Time to blt 16384 bytes x   1024:	  2.386µs, 6.4GiB/s
Time to blt 16384 bytes x   2048:	  2.382µs, 6.4GiB/s
Time to blt 16384 bytes x   4096:	  2.378µs, 6.4GiB/s
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4] drm/i915: Optimistically spin for the request completion
  2015-03-20 16:31             ` Tvrtko Ursulin
@ 2015-03-23  8:29               ` Daniel Vetter
  0 siblings, 0 replies; 29+ messages in thread
From: Daniel Vetter @ 2015-03-23  8:29 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Daniel Vetter, intel-gfx, Rantala, Valtteri, Eero Tamminen

On Fri, Mar 20, 2015 at 04:31:29PM +0000, Tvrtko Ursulin wrote:
> 
> On 03/20/2015 04:19 PM, Chris Wilson wrote:
> >On Fri, Mar 20, 2015 at 04:01:52PM +0000, Tvrtko Ursulin wrote:
> >>
> >>On 03/20/2015 02:36 PM, Chris Wilson wrote:
> >>>This provides a nice boost to mesa in swap bound scenarios (as mesa
> >>>throttles itself to the previous frame and given the scenario that will
> >>>complete shortly). It will also provide a good boost to systems running
> >>>with semaphores disabled and so frequently waiting on the GPU as it
> >>>switches rings. In the most favourable of microbenchmarks, this can
> >>>increase performance by around 15% - though in practice improvements
> >>>will be marginal and rarely noticeable.
> >>>
> >>>v2: Account for user timeouts
> >>>v3: Limit the spinning to a single jiffie (~1us) at most. On an
> >>>otherwise idle system, there is no scheduler contention and so without a
> >>>limit we would spin until the GPU is ready.
> >>>v4: Drop forcewake - the lazy coherent access doesn't require it, and we
> >>>have no reason to believe that the forcewake itself improves seqno
> >>>coherency - it only adds delay.
> >>>
> >>>Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>>Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> >>>Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> >>>Cc: Eero Tamminen <eero.t.tamminen@intel.com>
> >>>Cc: "Rantala, Valtteri" <valtteri.rantala@intel.com>
> >>
> >>Still against a toggle switch like a simple module parameter?
> >
> >Yes. I'd much rather tackle the corner cases than ignore them.
> 
> I'll say it once more then leave it be - my point of view is that module
> param does not mean ignoring any issues. It rather means that, if
> pathological use case if found in the field, you can provide a better user
> experience and then work in parallel in coming with improvements.
> 
> Your view is probably that if there is a toggle, someone somewhere will put
> on some wiki "yeah if that happens just put this in modprobe.conf" and there
> won't even be a bug report.

This is generally what indeed happens. We've suffered for years from
google returning outdated wiki pages with "clever" suggestions ...

Nowadays all mod params are generally of the kernel tainting kind to
increase the odds that we get bug reports. I still don't really like them.

> It is a downside yes, but to me much better than finding a bad corner case
> and then saying "Yeah, please recompile your kernel", or downgrade your
> kernel and wait for X weeks/months until the fix propagates.

I agree but only for stuff that causes crashes and hangs. There a module
option makes sense for quicker debug. For general tuning I don't really
want them - the kernel should get things right, period.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4] drm/i915: Optimistically spin for the request completion
  2015-03-20 22:59             ` Chris Wilson
  2015-03-21  9:49               ` Chris Wilson
@ 2015-03-23  8:31               ` Daniel Vetter
  2015-03-23  9:09                 ` Chris Wilson
  1 sibling, 1 reply; 29+ messages in thread
From: Daniel Vetter @ 2015-03-23  8:31 UTC (permalink / raw)
  To: Chris Wilson, Tvrtko Ursulin, intel-gfx, Daniel Vetter,
	Eero Tamminen, Rantala, Valtteri

On Fri, Mar 20, 2015 at 10:59:50PM +0000, Chris Wilson wrote:
> On Fri, Mar 20, 2015 at 04:19:02PM +0000, Chris Wilson wrote:
> > I guess one test would be to see how many 1x1 [xN overdraw, say 1x1
> > Window, but rendering internally at 1080p] clients we can run in
> > parallel whilst hitting 60fps. And then whether allowing multiple
> > spinners helps or hinders.
> 
> I was thinking of a nice easy test that could demonstrate any advantage
> for spinning over waiting, and realised we already had such an igt. The
> trick is that it has to generate sufficient GPU load to actually require
> a wait, but not too high a GPU load such that we can see the impact from
> slow completion.
> 
> I present igt/gem_exec_blt (modified to repeat the measurement and do an
> average over several runs):
> 
> Time to blt 16384 bytes x      1:        21.000µs -> 5.800µs
> Time to blt 16384 bytes x      2:        11.500µs -> 4.500µs
> Time to blt 16384 bytes x      4:         6.750µs -> 3.750µs
> Time to blt 16384 bytes x      8:         4.950µs -> 3.375µs
> Time to blt 16384 bytes x     16:         3.825µs -> 3.175µs
> Time to blt 16384 bytes x     32:         3.356µs -> 3.000µs 
> Time to blt 16384 bytes x     64:         3.259µs -> 2.909µs
> Time to blt 16384 bytes x    128:         3.083µs -> 3.095µs
> Time to blt 16384 bytes x    256:         3.104µs -> 2.979µs
> Time to blt 16384 bytes x    512:         3.080µs -> 3.089µs
> Time to blt 16384 bytes x   1024:         3.077µs -> 3.040µs 
> Time to blt 16384 bytes x   2048:         3.127µs -> 3.304µs
> Time to blt 16384 bytes x   4096:         3.279µs -> 3.265µs

We probably need to revisit this when the scheduler lands - that one will
want to keep a short queue and generally will block for some request to
complete.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4] drm/i915: Optimistically spin for the request completion
  2015-03-23  8:31               ` Daniel Vetter
@ 2015-03-23  9:09                 ` Chris Wilson
  0 siblings, 0 replies; 29+ messages in thread
From: Chris Wilson @ 2015-03-23  9:09 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, Rantala, Valtteri, Eero Tamminen, Daniel Vetter

On Mon, Mar 23, 2015 at 09:31:38AM +0100, Daniel Vetter wrote:
> On Fri, Mar 20, 2015 at 10:59:50PM +0000, Chris Wilson wrote:
> > On Fri, Mar 20, 2015 at 04:19:02PM +0000, Chris Wilson wrote:
> > > I guess one test would be to see how many 1x1 [xN overdraw, say 1x1
> > > Window, but rendering internally at 1080p] clients we can run in
> > > parallel whilst hitting 60fps. And then whether allowing multiple
> > > spinners helps or hinders.
> > 
> > I was thinking of a nice easy test that could demonstrate any advantage
> > for spinning over waiting, and realised we already had such an igt. The
> > trick is that it has to generate sufficient GPU load to actually require
> > a wait, but not too high a GPU load such that we can see the impact from
> > slow completion.
> > 
> > I present igt/gem_exec_blt (modified to repeat the measurement and do an
> > average over several runs):
> > 
> > Time to blt 16384 bytes x      1:        21.000µs -> 5.800µs
> > Time to blt 16384 bytes x      2:        11.500µs -> 4.500µs
> > Time to blt 16384 bytes x      4:         6.750µs -> 3.750µs
> > Time to blt 16384 bytes x      8:         4.950µs -> 3.375µs
> > Time to blt 16384 bytes x     16:         3.825µs -> 3.175µs
> > Time to blt 16384 bytes x     32:         3.356µs -> 3.000µs 
> > Time to blt 16384 bytes x     64:         3.259µs -> 2.909µs
> > Time to blt 16384 bytes x    128:         3.083µs -> 3.095µs
> > Time to blt 16384 bytes x    256:         3.104µs -> 2.979µs
> > Time to blt 16384 bytes x    512:         3.080µs -> 3.089µs
> > Time to blt 16384 bytes x   1024:         3.077µs -> 3.040µs 
> > Time to blt 16384 bytes x   2048:         3.127µs -> 3.304µs
> > Time to blt 16384 bytes x   4096:         3.279µs -> 3.265µs
> 
> We probably need to revisit this when the scheduler lands - that one will
> want to keep a short queue and generally will block for some request to
> complete.

Speaking of which, execlists! You may have noticed that I
surreptitiously choose hsw to avoid the execlists overhead...

I was messing around over the weekend looking at the submission overhead
on bdw-u:
           -nightly  +spin     +hax      execlists=0
      x1:  23.600µs  18.400µs  15.200µs  6.800µs
      x2:  19.700µs  16.500µs  15.900µs  5.000µs
      x4:  15.600µs  12.250µs  12.500µs  4.450µs
      x8:  13.575µs  11.000µs  11.650µs  4.050µs
     x16:  10.812µs   9.738µs   9.875µs  3.900µs
     x32:   9.281µs   8.613µs   9.406µs  3.750µs
     x64:   8.088µs   7.988µs   8.806µs  3.703µs
    x128:   7.683µs   7.838µs   8.617µs  3.647µs
    x256:   9.481µs   7.301µs   8.091µs  3.409µs
    x512:   5.579µs   5.992µs   6.177µs  3.561µs
   x1024:  10.093µs   3.963µs   4.187µs  3.531µs
   x2048:  11.497µs   3.794µs   3.873µs  3.477µs
   x4096:   8.926µs   5.269µs   3.813µs  3.461µs

The hax are to remove the extra atomic ops and spinlocks imposed by
execlists. Steady state seems to be roughly on a par with the difference
appearing to be interrupt latency + extra register writes. What's
interesting is the latency caused by the ELSP submission mechanism to an
idle GPU, a hard floor for us. It may even be worth papering over it by
starting execlists from a tasklet.

I do feel this sort of information is missing from the execlists
merge...
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2015-03-23  9:10 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-11 15:29 [PATCH v2] drm/i915: Optimistically spin for the request completion Chris Wilson
2015-03-11 21:18 ` Chris Wilson
2015-03-12  9:07   ` Chris Wilson
2015-03-12  9:17   ` Daniel Vetter
2015-03-12 11:11     ` [PATCH v3] " Chris Wilson
2015-03-12 12:06       ` Chris Wilson
2015-03-12 13:14       ` Tvrtko Ursulin
2015-03-12 13:18         ` Chris Wilson
2015-03-12 15:18           ` Tvrtko Ursulin
2015-03-12 16:28             ` Chris Wilson
2015-03-12 16:41               ` Tvrtko Ursulin
2015-03-12 16:50                 ` Chris Wilson
2015-03-12 17:32                   ` Tvrtko Ursulin
2015-03-13  9:33                     ` Daniel Vetter
2015-03-12 19:27       ` shuang.he
2015-03-19 15:16       ` Chris Wilson
2015-03-20 14:54         ` Daniel Vetter
2015-03-20 15:27           ` Chris Wilson
2015-03-20 14:36       ` [PATCH v4] " Chris Wilson
2015-03-20 16:01         ` Tvrtko Ursulin
2015-03-20 16:19           ` Chris Wilson
2015-03-20 16:31             ` Tvrtko Ursulin
2015-03-23  8:29               ` Daniel Vetter
2015-03-20 22:59             ` Chris Wilson
2015-03-21  9:49               ` Chris Wilson
2015-03-23  8:31               ` Daniel Vetter
2015-03-23  9:09                 ` Chris Wilson
2015-03-20 21:30         ` shuang.he
2015-03-11 23:07 ` [PATCH v2] " shuang.he

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.