* [PATCH v2] drm/i915: Optimistically spin for the request completion
@ 2015-03-11 15:29 Chris Wilson
2015-03-11 21:18 ` Chris Wilson
2015-03-11 23:07 ` [PATCH v2] " shuang.he
0 siblings, 2 replies; 29+ messages in thread
From: Chris Wilson @ 2015-03-11 15:29 UTC (permalink / raw)
To: intel-gfx; +Cc: Daniel Vetter
This provides a nice boost to mesa in swap bound scenarios (as mesa
throttles itself to the previous frame and given the scenario that will
complete shortly). It will also provide a good boost to systems running
with semaphores disabled and so frequently waiting on the GPU as it
switches rings. In the most favourable of microbenchmarks, this can
increase performance by around 15% - though in practice improvements
will be marginal and rarely noticeable.
v2: Account for user timeouts
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
---
drivers/gpu/drm/i915/i915_gem.c | 47 +++++++++++++++++++++++++++++++++++------
1 file changed, 41 insertions(+), 6 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index de15bd319bd0..f8895ed368cb 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1191,6 +1191,32 @@ static bool missed_irq(struct drm_i915_private *dev_priv,
return test_bit(ring->id, &dev_priv->gpu_error.missed_irq_rings);
}
+static int __i915_spin_request(struct drm_i915_gem_request *req,
+ unsigned long timeout)
+{
+ struct intel_engine_cs *ring = i915_gem_request_get_ring(req);
+ struct drm_i915_private *dev_priv = to_i915(ring->dev);
+ int ret = -EBUSY;
+
+ intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
+ while (!need_resched()) {
+ if (i915_gem_request_completed(req, true)) {
+ ret = 0;
+ goto out;
+ }
+
+ if (timeout && time_after_eq(jiffies, timeout))
+ break;
+
+ cpu_relax_lowlatency();
+ }
+ if (i915_gem_request_completed(req, false))
+ ret = 0;
+out:
+ intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
+ return ret;
+}
+
/**
* __i915_wait_request - wait until execution of request has finished
* @req: duh!
@@ -1235,12 +1261,20 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
if (ring->id == RCS && INTEL_INFO(dev)->gen >= 6)
gen6_rps_boost(dev_priv, file_priv);
- if (!irq_test_in_progress && WARN_ON(!ring->irq_get(ring)))
- return -ENODEV;
-
/* Record current time in case interrupted by signal, or wedged */
trace_i915_gem_request_wait_begin(req);
before = ktime_get_raw_ns();
+
+ /* Optimistic spin before touching IRQs */
+ ret = __i915_spin_request(req, timeout_expire);
+ if (ret == 0)
+ goto out;
+
+ if (!irq_test_in_progress && WARN_ON(!ring->irq_get(ring))) {
+ ret = -ENODEV;
+ goto out;
+ }
+
for (;;) {
struct timer_list timer;
@@ -1289,14 +1323,15 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
destroy_timer_on_stack(&timer);
}
}
- now = ktime_get_raw_ns();
- trace_i915_gem_request_wait_end(req);
-
if (!irq_test_in_progress)
ring->irq_put(ring);
finish_wait(&ring->irq_queue, &wait);
+out:
+ now = ktime_get_raw_ns();
+ trace_i915_gem_request_wait_end(req);
+
if (timeout) {
s64 tres = *timeout - (now - before);
--
2.1.4
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH v2] drm/i915: Optimistically spin for the request completion
2015-03-11 15:29 [PATCH v2] drm/i915: Optimistically spin for the request completion Chris Wilson
@ 2015-03-11 21:18 ` Chris Wilson
2015-03-12 9:07 ` Chris Wilson
2015-03-12 9:17 ` Daniel Vetter
2015-03-11 23:07 ` [PATCH v2] " shuang.he
1 sibling, 2 replies; 29+ messages in thread
From: Chris Wilson @ 2015-03-11 21:18 UTC (permalink / raw)
To: intel-gfx; +Cc: Daniel Vetter
On Wed, Mar 11, 2015 at 03:29:19PM +0000, Chris Wilson wrote:
> + while (!need_resched()) {
> + if (i915_gem_request_completed(req, true)) {
> + ret = 0;
> + goto out;
> + }
> +
> + if (timeout && time_after_eq(jiffies, timeout))
> + break;
> +
> + cpu_relax_lowlatency();
> + }
Hmm. This unfortunately doesn't quite work the same as the optimistic
mutex spinner. The difference being that the scheduler knows that two
processes are contending for the mutex, but it doesn't know about the
contention between the thread and the GPU. The result is that unless
there is other activity on the system we simply degrade into a busy-spin
until the GPU is finished, rather than what I intended as spin for the
next ~10ms and then fallback to the interrupt.
Arguably busy-spinning on an idle system isn't totally evil, but it
certainly is likely to come at a power cost. On the other hand, spinning
is relatively rare outside of benchmarks. Rare enough to be useful?
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v2] drm/i915: Optimistically spin for the request completion
2015-03-11 15:29 [PATCH v2] drm/i915: Optimistically spin for the request completion Chris Wilson
2015-03-11 21:18 ` Chris Wilson
@ 2015-03-11 23:07 ` shuang.he
1 sibling, 0 replies; 29+ messages in thread
From: shuang.he @ 2015-03-11 23:07 UTC (permalink / raw)
To: shuang.he, ethan.gao, intel-gfx, chris
Tested-By: PRC QA PRTS (Patch Regression Test System Contact: shuang.he@intel.com)
Task id: 5934
-------------------------------------Summary-------------------------------------
Platform Delta drm-intel-nightly Series Applied
PNV -9 281/281 272/281
ILK -1 308/308 307/308
SNB -1 284/284 283/284
IVB 375/375 375/375
BYT 294/294 294/294
HSW 384/384 384/384
BDW 315/315 315/315
-------------------------------------Detailed-------------------------------------
Platform Test drm-intel-nightly Series Applied
*PNV igt_gem_fence_thrash_bo-write-verify-none PASS(2) FAIL(1)NRUN(1)
*PNV igt_gem_fence_thrash_bo-write-verify-x PASS(2) FAIL(1)NO_RESULT(1)
*PNV igt_gem_fence_thrash_bo-write-verify-y PASS(2) FAIL(1)NO_RESULT(1)
PNV igt_gem_userptr_blits_coherency-sync CRASH(4)PASS(2) CRASH(1)PASS(1)
PNV igt_gen3_render_linear_blits FAIL(2)PASS(1) FAIL(1)PASS(1)
*PNV igt_gen3_render_mixed_blits FAIL(1)PASS(2) FAIL(1)NO_RESULT(1)
PNV igt_gen3_render_tiledx_blits FAIL(3)PASS(1) FAIL(1)PASS(1)
PNV igt_gen3_render_tiledy_blits FAIL(4)PASS(1) FAIL(1)PASS(1)
PNV igt_gem_fence_thrash_bo-write-verify-threaded-none CRASH(1)PASS(4) CRASH(1)PASS(1)
*ILK igt_kms_flip_wf_vblank-ts-check PASS(2) DMESG_WARN(1)PASS(1)
*SNB igt_gem_fenced_exec_thrash_too-many-fences PASS(4) DMESG_WARN(1)PASS(1)
Note: You need to pay more attention to line start with '*'
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v2] drm/i915: Optimistically spin for the request completion
2015-03-11 21:18 ` Chris Wilson
@ 2015-03-12 9:07 ` Chris Wilson
2015-03-12 9:17 ` Daniel Vetter
1 sibling, 0 replies; 29+ messages in thread
From: Chris Wilson @ 2015-03-12 9:07 UTC (permalink / raw)
To: intel-gfx, Daniel Vetter
On Wed, Mar 11, 2015 at 09:18:19PM +0000, Chris Wilson wrote:
> Arguably busy-spinning on an idle system isn't totally evil, but it
> certainly is likely to come at a power cost. On the other hand, spinning
> is relatively rare outside of benchmarks. Rare enough to be useful?
As a counterpoint, I plugged in the qm_pos trick we use for lowlatency
dma interrupts (gmbus, dp-aux) and found no improvement at all over a
plain interrupt driven __i915_wait_request.
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v2] drm/i915: Optimistically spin for the request completion
2015-03-11 21:18 ` Chris Wilson
2015-03-12 9:07 ` Chris Wilson
@ 2015-03-12 9:17 ` Daniel Vetter
2015-03-12 11:11 ` [PATCH v3] " Chris Wilson
1 sibling, 1 reply; 29+ messages in thread
From: Daniel Vetter @ 2015-03-12 9:17 UTC (permalink / raw)
To: Chris Wilson, intel-gfx, Daniel Vetter
On Wed, Mar 11, 2015 at 09:18:19PM +0000, Chris Wilson wrote:
> On Wed, Mar 11, 2015 at 03:29:19PM +0000, Chris Wilson wrote:
> > + while (!need_resched()) {
> > + if (i915_gem_request_completed(req, true)) {
> > + ret = 0;
> > + goto out;
> > + }
> > +
> > + if (timeout && time_after_eq(jiffies, timeout))
> > + break;
> > +
> > + cpu_relax_lowlatency();
> > + }
>
> Hmm. This unfortunately doesn't quite work the same as the optimistic
> mutex spinner. The difference being that the scheduler knows that two
> processes are contending for the mutex, but it doesn't know about the
> contention between the thread and the GPU. The result is that unless
> there is other activity on the system we simply degrade into a busy-spin
> until the GPU is finished, rather than what I intended as spin for the
> next ~10ms and then fallback to the interrupt.
>
> Arguably busy-spinning on an idle system isn't totally evil, but it
> certainly is likely to come at a power cost. On the other hand, spinning
> is relatively rare outside of benchmarks. Rare enough to be useful?
Maybe throw in a limit of a few useconds under the assumption that it's ok
to spin for roughly the average sleep+wakeup latency? And once spun out go
to sleep?
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v3] drm/i915: Optimistically spin for the request completion
2015-03-12 9:17 ` Daniel Vetter
@ 2015-03-12 11:11 ` Chris Wilson
2015-03-12 12:06 ` Chris Wilson
` (4 more replies)
0 siblings, 5 replies; 29+ messages in thread
From: Chris Wilson @ 2015-03-12 11:11 UTC (permalink / raw)
To: intel-gfx; +Cc: Daniel Vetter
This provides a nice boost to mesa in swap bound scenarios (as mesa
throttles itself to the previous frame and given the scenario that will
complete shortly). It will also provide a good boost to systems running
with semaphores disabled and so frequently waiting on the GPU as it
switches rings. In the most favourable of microbenchmarks, this can
increase performance by around 15% - though in practice improvements
will be marginal and rarely noticeable.
v2: Account for user timeouts
v3: Limit the spinning to a single jiffie (~1us) at most. On an
otherwise idle system, there is no scheduler contention and so without a
limit we would spin until the GPU is ready.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
---
drivers/gpu/drm/i915/i915_gem.c | 48 +++++++++++++++++++++++++++++++++++------
1 file changed, 42 insertions(+), 6 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 1cc46558bf2a..df96f9704cf2 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1191,6 +1191,33 @@ static bool missed_irq(struct drm_i915_private *dev_priv,
return test_bit(ring->id, &dev_priv->gpu_error.missed_irq_rings);
}
+static int __i915_spin_request(struct drm_i915_gem_request *req)
+{
+ struct intel_engine_cs *ring = i915_gem_request_get_ring(req);
+ struct drm_i915_private *dev_priv = to_i915(ring->dev);
+ unsigned long timeout;
+ int ret = -EBUSY;
+
+ intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
+ timeout = jiffies + 1;
+ while (!need_resched()) {
+ if (i915_gem_request_completed(req, true)) {
+ ret = 0;
+ goto out;
+ }
+
+ if (time_after_eq(jiffies, timeout))
+ break;
+
+ cpu_relax_lowlatency();
+ }
+ if (i915_gem_request_completed(req, false))
+ ret = 0;
+out:
+ intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
+ return ret;
+}
+
/**
* __i915_wait_request - wait until execution of request has finished
* @req: duh!
@@ -1235,12 +1262,20 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
if (ring->id == RCS && INTEL_INFO(dev)->gen >= 6)
gen6_rps_boost(dev_priv, file_priv);
- if (!irq_test_in_progress && WARN_ON(!ring->irq_get(ring)))
- return -ENODEV;
-
/* Record current time in case interrupted by signal, or wedged */
trace_i915_gem_request_wait_begin(req);
before = ktime_get_raw_ns();
+
+ /* Optimistic spin for the next jiffie before touching IRQs */
+ ret = __i915_spin_request(req);
+ if (ret == 0)
+ goto out;
+
+ if (!irq_test_in_progress && WARN_ON(!ring->irq_get(ring))) {
+ ret = -ENODEV;
+ goto out;
+ }
+
for (;;) {
struct timer_list timer;
@@ -1289,14 +1324,15 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
destroy_timer_on_stack(&timer);
}
}
- now = ktime_get_raw_ns();
- trace_i915_gem_request_wait_end(req);
-
if (!irq_test_in_progress)
ring->irq_put(ring);
finish_wait(&ring->irq_queue, &wait);
+out:
+ now = ktime_get_raw_ns();
+ trace_i915_gem_request_wait_end(req);
+
if (timeout) {
s64 tres = *timeout - (now - before);
--
2.1.4
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH v3] drm/i915: Optimistically spin for the request completion
2015-03-12 11:11 ` [PATCH v3] " Chris Wilson
@ 2015-03-12 12:06 ` Chris Wilson
2015-03-12 13:14 ` Tvrtko Ursulin
` (3 subsequent siblings)
4 siblings, 0 replies; 29+ messages in thread
From: Chris Wilson @ 2015-03-12 12:06 UTC (permalink / raw)
To: intel-gfx; +Cc: Daniel Vetter
On Thu, Mar 12, 2015 at 11:11:17AM +0000, Chris Wilson wrote:
> This provides a nice boost to mesa in swap bound scenarios (as mesa
> throttles itself to the previous frame and given the scenario that will
> complete shortly). It will also provide a good boost to systems running
> with semaphores disabled and so frequently waiting on the GPU as it
> switches rings. In the most favourable of microbenchmarks, this can
> increase performance by around 15% - though in practice improvements
> will be marginal and rarely noticeable.
>
> v2: Account for user timeouts
> v3: Limit the spinning to a single jiffie (~1us) at most. On an
> otherwise idle system, there is no scheduler contention and so without a
> limit we would spin until the GPU is ready.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
I'm really happy with this tradeoff. Can we throw this to the vultures
to see if anyone notices?
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3] drm/i915: Optimistically spin for the request completion
2015-03-12 11:11 ` [PATCH v3] " Chris Wilson
2015-03-12 12:06 ` Chris Wilson
@ 2015-03-12 13:14 ` Tvrtko Ursulin
2015-03-12 13:18 ` Chris Wilson
2015-03-12 19:27 ` shuang.he
` (2 subsequent siblings)
4 siblings, 1 reply; 29+ messages in thread
From: Tvrtko Ursulin @ 2015-03-12 13:14 UTC (permalink / raw)
To: Chris Wilson, intel-gfx; +Cc: Daniel Vetter
On 03/12/2015 11:11 AM, Chris Wilson wrote:
> This provides a nice boost to mesa in swap bound scenarios (as mesa
> throttles itself to the previous frame and given the scenario that will
> complete shortly). It will also provide a good boost to systems running
> with semaphores disabled and so frequently waiting on the GPU as it
> switches rings. In the most favourable of microbenchmarks, this can
> increase performance by around 15% - though in practice improvements
> will be marginal and rarely noticeable.
>
> v2: Account for user timeouts
> v3: Limit the spinning to a single jiffie (~1us) at most. On an
> otherwise idle system, there is no scheduler contention and so without a
> limit we would spin until the GPU is ready.
Isn't one jiffie 1-10ms typically?
Regards,
Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3] drm/i915: Optimistically spin for the request completion
2015-03-12 13:14 ` Tvrtko Ursulin
@ 2015-03-12 13:18 ` Chris Wilson
2015-03-12 15:18 ` Tvrtko Ursulin
0 siblings, 1 reply; 29+ messages in thread
From: Chris Wilson @ 2015-03-12 13:18 UTC (permalink / raw)
To: Tvrtko Ursulin; +Cc: Daniel Vetter, intel-gfx
On Thu, Mar 12, 2015 at 01:14:30PM +0000, Tvrtko Ursulin wrote:
> On 03/12/2015 11:11 AM, Chris Wilson wrote:
> >This provides a nice boost to mesa in swap bound scenarios (as mesa
> >throttles itself to the previous frame and given the scenario that will
> >complete shortly). It will also provide a good boost to systems running
> >with semaphores disabled and so frequently waiting on the GPU as it
> >switches rings. In the most favourable of microbenchmarks, this can
> >increase performance by around 15% - though in practice improvements
> >will be marginal and rarely noticeable.
> >
> >v2: Account for user timeouts
> >v3: Limit the spinning to a single jiffie (~1us) at most. On an
> >otherwise idle system, there is no scheduler contention and so without a
> >limit we would spin until the GPU is ready.
>
> Isn't one jiffie 1-10ms typically?
1ms. I was just thinking of doing USECS_PER_SEC / HZ, then realised that
was a jiffie, hence the confusion. At any rate, it is still the minimum
we can trivially wait for (without an expensive hrtimer).
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3] drm/i915: Optimistically spin for the request completion
2015-03-12 13:18 ` Chris Wilson
@ 2015-03-12 15:18 ` Tvrtko Ursulin
2015-03-12 16:28 ` Chris Wilson
0 siblings, 1 reply; 29+ messages in thread
From: Tvrtko Ursulin @ 2015-03-12 15:18 UTC (permalink / raw)
To: Chris Wilson, intel-gfx, Daniel Vetter
On 03/12/2015 01:18 PM, Chris Wilson wrote:
> On Thu, Mar 12, 2015 at 01:14:30PM +0000, Tvrtko Ursulin wrote:
>> On 03/12/2015 11:11 AM, Chris Wilson wrote:
>>> This provides a nice boost to mesa in swap bound scenarios (as mesa
>>> throttles itself to the previous frame and given the scenario that will
>>> complete shortly). It will also provide a good boost to systems running
>>> with semaphores disabled and so frequently waiting on the GPU as it
>>> switches rings. In the most favourable of microbenchmarks, this can
>>> increase performance by around 15% - though in practice improvements
>>> will be marginal and rarely noticeable.
>>>
>>> v2: Account for user timeouts
>>> v3: Limit the spinning to a single jiffie (~1us) at most. On an
>>> otherwise idle system, there is no scheduler contention and so without a
>>> limit we would spin until the GPU is ready.
>>
>> Isn't one jiffie 1-10ms typically?
>
> 1ms. I was just thinking of doing USECS_PER_SEC / HZ, then realised that
> was a jiffie, hence the confusion. At any rate, it is still the minimum
> we can trivially wait for (without an expensive hrtimer).
Unless I lost track with the times, that's CONFIG_HZ right?
I don't know what server distributions do, but this Ubuntu LTS I am
running has HZ=250 which means 4ms.
That would mean on a system where throughput is more important than
latency, you lose most throughput by spinning the longest. In theory at
least, no?
So perhaps which should be a tunable? Optionally auto-select the initial
state based on HZ.
Regards,
Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3] drm/i915: Optimistically spin for the request completion
2015-03-12 15:18 ` Tvrtko Ursulin
@ 2015-03-12 16:28 ` Chris Wilson
2015-03-12 16:41 ` Tvrtko Ursulin
0 siblings, 1 reply; 29+ messages in thread
From: Chris Wilson @ 2015-03-12 16:28 UTC (permalink / raw)
To: Tvrtko Ursulin; +Cc: Daniel Vetter, intel-gfx
On Thu, Mar 12, 2015 at 03:18:01PM +0000, Tvrtko Ursulin wrote:
> On 03/12/2015 01:18 PM, Chris Wilson wrote:
> >1ms. I was just thinking of doing USECS_PER_SEC / HZ, then realised that
> >was a jiffie, hence the confusion. At any rate, it is still the minimum
> >we can trivially wait for (without an expensive hrtimer).
>
> Unless I lost track with the times, that's CONFIG_HZ right?
>
> I don't know what server distributions do, but this Ubuntu LTS I am
> running has HZ=250 which means 4ms.
>
> That would mean on a system where throughput is more important than
> latency, you lose most throughput by spinning the longest. In theory
> at least, no?
Only in theory, and only if you mean throughput of non-i915 workloads
with preemption disabled. Spinning here improves both latency and
throughput for gfx clients. Using up the timeslice for the client may
have secondary effects though - otherwise they would get iowait credits.
> So perhaps which should be a tunable? Optionally auto-select the
> initial state based on HZ.
Spinning less than a jiffie requires hrtimers at which point you may as
well just use the i915 interrupt (rather than setup a timer interrupt).
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3] drm/i915: Optimistically spin for the request completion
2015-03-12 16:28 ` Chris Wilson
@ 2015-03-12 16:41 ` Tvrtko Ursulin
2015-03-12 16:50 ` Chris Wilson
0 siblings, 1 reply; 29+ messages in thread
From: Tvrtko Ursulin @ 2015-03-12 16:41 UTC (permalink / raw)
To: Chris Wilson, intel-gfx, Daniel Vetter
On 03/12/2015 04:28 PM, Chris Wilson wrote:
> On Thu, Mar 12, 2015 at 03:18:01PM +0000, Tvrtko Ursulin wrote:
>> On 03/12/2015 01:18 PM, Chris Wilson wrote:
>>> 1ms. I was just thinking of doing USECS_PER_SEC / HZ, then realised that
>>> was a jiffie, hence the confusion. At any rate, it is still the minimum
>>> we can trivially wait for (without an expensive hrtimer).
>>
>> Unless I lost track with the times, that's CONFIG_HZ right?
>>
>> I don't know what server distributions do, but this Ubuntu LTS I am
>> running has HZ=250 which means 4ms.
>>
>> That would mean on a system where throughput is more important than
>> latency, you lose most throughput by spinning the longest. In theory
>> at least, no?
>
> Only in theory, and only if you mean throughput of non-i915 workloads
> with preemption disabled. Spinning here improves both latency and
> throughput for gfx clients. Using up the timeslice for the client may
> have secondary effects though - otherwise they would get iowait credits.
Yes, I meant CPU workloads. And low HZ and no preemption usually go
together.
>> So perhaps which should be a tunable? Optionally auto-select the
>> initial state based on HZ.
>
> Spinning less than a jiffie requires hrtimers at which point you may as
> well just use the i915 interrupt (rather than setup a timer interrupt).
Yes I didn't mean that - but to have a boolean spinning-wait=on/off.
Maybe default to "on" on HZ=1000 with preemption, or the opposite,
something like that.
Regards,
Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3] drm/i915: Optimistically spin for the request completion
2015-03-12 16:41 ` Tvrtko Ursulin
@ 2015-03-12 16:50 ` Chris Wilson
2015-03-12 17:32 ` Tvrtko Ursulin
0 siblings, 1 reply; 29+ messages in thread
From: Chris Wilson @ 2015-03-12 16:50 UTC (permalink / raw)
To: Tvrtko Ursulin; +Cc: Daniel Vetter, intel-gfx
On Thu, Mar 12, 2015 at 04:41:10PM +0000, Tvrtko Ursulin wrote:
> Yes I didn't mean that - but to have a boolean spinning-wait=on/off.
> Maybe default to "on" on HZ=1000 with preemption, or the opposite,
> something like that.
I don't see the point in having the complication, until someone
complains. In my defense, I will point to the optimistic mutex spinning
equally having no configurable option. And since the idea is that you
only hit this if you are abusing i915 (e.g. benchmarking, or you have a
readback on the critical patch, or if we haven't implemented
semaphores), I would rather fix those scenarios as they arrive rather
than giving the user an option to break.
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3] drm/i915: Optimistically spin for the request completion
2015-03-12 16:50 ` Chris Wilson
@ 2015-03-12 17:32 ` Tvrtko Ursulin
2015-03-13 9:33 ` Daniel Vetter
0 siblings, 1 reply; 29+ messages in thread
From: Tvrtko Ursulin @ 2015-03-12 17:32 UTC (permalink / raw)
To: Chris Wilson, intel-gfx, Daniel Vetter
On 03/12/2015 04:50 PM, Chris Wilson wrote:
> On Thu, Mar 12, 2015 at 04:41:10PM +0000, Tvrtko Ursulin wrote:
>> Yes I didn't mean that - but to have a boolean spinning-wait=on/off.
>> Maybe default to "on" on HZ=1000 with preemption, or the opposite,
>> something like that.
>
> I don't see the point in having the complication, until someone
> complains. In my defense, I will point to the optimistic mutex spinning
> equally having no configurable option. And since the idea is that you
> only hit this if you are abusing i915 (e.g. benchmarking, or you have a
> readback on the critical patch, or if we haven't implemented
> semaphores), I would rather fix those scenarios as they arrive rather
> than giving the user an option to break.
I simply pointed out a theoretical potential to burn more CPU on
servers. If you know that is unlikely or only theoretical that's fine by me.
But I'll say I was more convinced before you mentioned "until someone
complains" and "option to break". :)
Regards,
Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3] drm/i915: Optimistically spin for the request completion
2015-03-12 11:11 ` [PATCH v3] " Chris Wilson
2015-03-12 12:06 ` Chris Wilson
2015-03-12 13:14 ` Tvrtko Ursulin
@ 2015-03-12 19:27 ` shuang.he
2015-03-19 15:16 ` Chris Wilson
2015-03-20 14:36 ` [PATCH v4] " Chris Wilson
4 siblings, 0 replies; 29+ messages in thread
From: shuang.he @ 2015-03-12 19:27 UTC (permalink / raw)
To: shuang.he, ethan.gao, intel-gfx, chris
Tested-By: PRC QA PRTS (Patch Regression Test System Contact: shuang.he@intel.com)
Task id: 5941
-------------------------------------Summary-------------------------------------
Platform Delta drm-intel-nightly Series Applied
PNV -7 274/274 267/274
ILK 301/301 301/301
SNB -1 277/277 276/277
IVB 341/341 341/341
BYT 287/287 287/287
HSW 360/360 360/360
BDW 308/308 308/308
-------------------------------------Detailed-------------------------------------
Platform Test drm-intel-nightly Series Applied
*PNV igt_gem_fence_thrash_bo-write-verify-none PASS(4) FAIL(1)PASS(1)
*PNV igt_gem_fence_thrash_bo-write-verify-x PASS(4) FAIL(1)PASS(1)
*PNV igt_gem_fence_thrash_bo-write-verify-y PASS(4) FAIL(1)PASS(1)
PNV igt_gem_userptr_blits_coherency-sync CRASH(5)PASS(4) CRASH(1)PASS(1)
PNV igt_gem_userptr_blits_coherency-unsync CRASH(1)PASS(3) CRASH(1)PASS(1)
PNV igt_gen3_render_linear_blits FAIL(3)PASS(4) FAIL(1)PASS(1)
PNV igt_gen3_render_mixed_blits FAIL(3)PASS(4) FAIL(2)
*SNB igt_gem_mmap_gtt_write-read PASS(2) DMESG_WARN(1)PASS(1)
Note: You need to pay more attention to line start with '*'
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3] drm/i915: Optimistically spin for the request completion
2015-03-12 17:32 ` Tvrtko Ursulin
@ 2015-03-13 9:33 ` Daniel Vetter
0 siblings, 0 replies; 29+ messages in thread
From: Daniel Vetter @ 2015-03-13 9:33 UTC (permalink / raw)
To: Tvrtko Ursulin; +Cc: Daniel Vetter, intel-gfx
On Thu, Mar 12, 2015 at 05:32:26PM +0000, Tvrtko Ursulin wrote:
>
> On 03/12/2015 04:50 PM, Chris Wilson wrote:
> >On Thu, Mar 12, 2015 at 04:41:10PM +0000, Tvrtko Ursulin wrote:
> >>Yes I didn't mean that - but to have a boolean spinning-wait=on/off.
> >>Maybe default to "on" on HZ=1000 with preemption, or the opposite,
> >>something like that.
> >
> >I don't see the point in having the complication, until someone
> >complains. In my defense, I will point to the optimistic mutex spinning
> >equally having no configurable option. And since the idea is that you
> >only hit this if you are abusing i915 (e.g. benchmarking, or you have a
> >readback on the critical patch, or if we haven't implemented
> >semaphores), I would rather fix those scenarios as they arrive rather
> >than giving the user an option to break.
>
> I simply pointed out a theoretical potential to burn more CPU on servers. If
> you know that is unlikely or only theoretical that's fine by me.
>
> But I'll say I was more convinced before you mentioned "until someone
> complains" and "option to break". :)
Server workloads on intel machines are currently transcode stuff (and
there's more room in libva for better parallelism than this) and ocl
(which tends to go with the busy-loops for synchronization model anyway).
Also we'll stop spinning the second someone else wants the cpu, so there's
really just secondary effects of the scheduler due to fewer iowait credits
and more timeslices used up. I'll go with Chris and will start to care
when someone complains with real benchmark numbers attached ;-)
Ofc we still want broader numbers for the benchmakr benefits for this one
first.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3] drm/i915: Optimistically spin for the request completion
2015-03-12 11:11 ` [PATCH v3] " Chris Wilson
` (2 preceding siblings ...)
2015-03-12 19:27 ` shuang.he
@ 2015-03-19 15:16 ` Chris Wilson
2015-03-20 14:54 ` Daniel Vetter
2015-03-20 14:36 ` [PATCH v4] " Chris Wilson
4 siblings, 1 reply; 29+ messages in thread
From: Chris Wilson @ 2015-03-19 15:16 UTC (permalink / raw)
To: intel-gfx; +Cc: Daniel Vetter
On Thu, Mar 12, 2015 at 11:11:17AM +0000, Chris Wilson wrote:
> This provides a nice boost to mesa in swap bound scenarios (as mesa
> throttles itself to the previous frame and given the scenario that will
> complete shortly). It will also provide a good boost to systems running
> with semaphores disabled and so frequently waiting on the GPU as it
> switches rings. In the most favourable of microbenchmarks, this can
> increase performance by around 15% - though in practice improvements
> will be marginal and rarely noticeable.
>
> v2: Account for user timeouts
> v3: Limit the spinning to a single jiffie (~1us) at most. On an
> otherwise idle system, there is no scheduler contention and so without a
> limit we would spin until the GPU is ready.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Just recording ideas for the future. Replace the busy-spin with
monitor/mwait. This requires Pentium4+, a cooperating GPU with working
cacheline snooping and that we use HWS seqno.
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 85e71e0e2340..454a38d4caa3 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -37,6 +37,7 @@
#include <linux/swap.h>
#include <linux/pci.h>
#include <linux/dma-buf.h>
+#include <asm/mwait.h>
#define RQ_BUG_ON(expr)
@@ -1187,18 +1188,42 @@ static int __i915_spin_request(struct drm_i915_gem_request *req)
unsigned long timeout;
int ret = -EBUSY;
+ if (ring->irq_refcount) /* IRQ is already active, keep using it */
+ return ret;
+
intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
timeout = jiffies + 1;
- while (!need_resched()) {
- if (i915_gem_request_completed(req, true)) {
- ret = 0;
- goto out;
- }
+ if (this_cpu_has(X86_FEATURE_MWAIT)) {
+ do {
+ unsigned long ecx = 1; /* break on interrupt */
+ unsigned long eax = 0; /* cstate */
- if (time_after_eq(jiffies, timeout))
- break;
+ __monitor((void *)&ring->status_page.page_addr[I915_GEM_HWS_INDEX], 0, 0);
+ if (need_resched())
+ break;
+
+ if (i915_gem_request_completed(req, true)) {
+ ret = 0;
+ goto out;
+ }
- cpu_relax_lowlatency();
+ if (time_after_eq(jiffies, timeout))
+ break;
+
+ __mwait(eax, ecx);
+ } while (1);
+ } else {
+ while (!need_resched()) {
+ if (i915_gem_request_completed(req, true)) {
+ ret = 0;
+ goto out;
+ }
+
+ if (time_after_eq(jiffies, timeout))
+ break;
+
+ cpu_relax_lowlatency();
+ }
}
if (i915_gem_request_completed(req, false))
ret = 0;
--
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v4] drm/i915: Optimistically spin for the request completion
2015-03-12 11:11 ` [PATCH v3] " Chris Wilson
` (3 preceding siblings ...)
2015-03-19 15:16 ` Chris Wilson
@ 2015-03-20 14:36 ` Chris Wilson
2015-03-20 16:01 ` Tvrtko Ursulin
2015-03-20 21:30 ` shuang.he
4 siblings, 2 replies; 29+ messages in thread
From: Chris Wilson @ 2015-03-20 14:36 UTC (permalink / raw)
To: intel-gfx; +Cc: Daniel Vetter, Eero Tamminen, Rantala, Valtteri
This provides a nice boost to mesa in swap bound scenarios (as mesa
throttles itself to the previous frame and given the scenario that will
complete shortly). It will also provide a good boost to systems running
with semaphores disabled and so frequently waiting on the GPU as it
switches rings. In the most favourable of microbenchmarks, this can
increase performance by around 15% - though in practice improvements
will be marginal and rarely noticeable.
v2: Account for user timeouts
v3: Limit the spinning to a single jiffie (~1us) at most. On an
otherwise idle system, there is no scheduler contention and so without a
limit we would spin until the GPU is ready.
v4: Drop forcewake - the lazy coherent access doesn't require it, and we
have no reason to believe that the forcewake itself improves seqno
coherency - it only adds delay.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Eero Tamminen <eero.t.tamminen@intel.com>
Cc: "Rantala, Valtteri" <valtteri.rantala@intel.com>
---
drivers/gpu/drm/i915/i915_gem.c | 44 +++++++++++++++++++++++++++++++++++------
1 file changed, 38 insertions(+), 6 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 2e17a254aac1..9988e65c1440 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1181,6 +1181,29 @@ static bool missed_irq(struct drm_i915_private *dev_priv,
return test_bit(ring->id, &dev_priv->gpu_error.missed_irq_rings);
}
+static int __i915_spin_request(struct drm_i915_gem_request *rq)
+{
+ unsigned long timeout;
+
+ if (i915_gem_request_get_ring(rq)->irq_refcount)
+ return -EBUSY;
+
+ timeout = jiffies + 1;
+ while (!need_resched()) {
+ if (i915_gem_request_completed(rq, true))
+ return 0;
+
+ if (time_after_eq(jiffies, timeout))
+ break;
+
+ cpu_relax_lowlatency();
+ }
+ if (i915_gem_request_completed(rq, false))
+ return 0;
+
+ return -EAGAIN;
+}
+
/**
* __i915_wait_request - wait until execution of request has finished
* @req: duh!
@@ -1225,12 +1248,20 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
if (ring->id == RCS && INTEL_INFO(dev)->gen >= 6)
gen6_rps_boost(dev_priv, file_priv);
- if (!irq_test_in_progress && WARN_ON(!ring->irq_get(ring)))
- return -ENODEV;
-
/* Record current time in case interrupted by signal, or wedged */
trace_i915_gem_request_wait_begin(req);
before = ktime_get_raw_ns();
+
+ /* Optimistic spin for the next jiffie before touching IRQs */
+ ret = __i915_spin_request(req);
+ if (ret == 0)
+ goto out;
+
+ if (!irq_test_in_progress && WARN_ON(!ring->irq_get(ring))) {
+ ret = -ENODEV;
+ goto out;
+ }
+
for (;;) {
struct timer_list timer;
@@ -1279,14 +1310,15 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
destroy_timer_on_stack(&timer);
}
}
- now = ktime_get_raw_ns();
- trace_i915_gem_request_wait_end(req);
-
if (!irq_test_in_progress)
ring->irq_put(ring);
finish_wait(&ring->irq_queue, &wait);
+out:
+ now = ktime_get_raw_ns();
+ trace_i915_gem_request_wait_end(req);
+
if (timeout) {
s64 tres = *timeout - (now - before);
--
2.1.4
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH v3] drm/i915: Optimistically spin for the request completion
2015-03-19 15:16 ` Chris Wilson
@ 2015-03-20 14:54 ` Daniel Vetter
2015-03-20 15:27 ` Chris Wilson
0 siblings, 1 reply; 29+ messages in thread
From: Daniel Vetter @ 2015-03-20 14:54 UTC (permalink / raw)
To: Chris Wilson, intel-gfx, Daniel Vetter
On Thu, Mar 19, 2015 at 03:16:15PM +0000, Chris Wilson wrote:
> On Thu, Mar 12, 2015 at 11:11:17AM +0000, Chris Wilson wrote:
> > This provides a nice boost to mesa in swap bound scenarios (as mesa
> > throttles itself to the previous frame and given the scenario that will
> > complete shortly). It will also provide a good boost to systems running
> > with semaphores disabled and so frequently waiting on the GPU as it
> > switches rings. In the most favourable of microbenchmarks, this can
> > increase performance by around 15% - though in practice improvements
> > will be marginal and rarely noticeable.
> >
> > v2: Account for user timeouts
> > v3: Limit the spinning to a single jiffie (~1us) at most. On an
> > otherwise idle system, there is no scheduler contention and so without a
> > limit we would spin until the GPU is ready.
> >
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
>
> Just recording ideas for the future. Replace the busy-spin with
> monitor/mwait. This requires Pentium4+, a cooperating GPU with working
> cacheline snooping and that we use HWS seqno.
Just for the record: Did it help with powersaving or was it all in the
noise?
-Daniel
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 85e71e0e2340..454a38d4caa3 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -37,6 +37,7 @@
> #include <linux/swap.h>
> #include <linux/pci.h>
> #include <linux/dma-buf.h>
> +#include <asm/mwait.h>
>
> #define RQ_BUG_ON(expr)
>
> @@ -1187,18 +1188,42 @@ static int __i915_spin_request(struct drm_i915_gem_request *req)
> unsigned long timeout;
> int ret = -EBUSY;
>
> + if (ring->irq_refcount) /* IRQ is already active, keep using it */
> + return ret;
> +
> intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
> timeout = jiffies + 1;
> - while (!need_resched()) {
> - if (i915_gem_request_completed(req, true)) {
> - ret = 0;
> - goto out;
> - }
> + if (this_cpu_has(X86_FEATURE_MWAIT)) {
> + do {
> + unsigned long ecx = 1; /* break on interrupt */
> + unsigned long eax = 0; /* cstate */
>
> - if (time_after_eq(jiffies, timeout))
> - break;
> + __monitor((void *)&ring->status_page.page_addr[I915_GEM_HWS_INDEX], 0, 0);
> + if (need_resched())
> + break;
> +
> + if (i915_gem_request_completed(req, true)) {
> + ret = 0;
> + goto out;
> + }
>
> - cpu_relax_lowlatency();
> + if (time_after_eq(jiffies, timeout))
> + break;
> +
> + __mwait(eax, ecx);
> + } while (1);
> + } else {
> + while (!need_resched()) {
> + if (i915_gem_request_completed(req, true)) {
> + ret = 0;
> + goto out;
> + }
> +
> + if (time_after_eq(jiffies, timeout))
> + break;
> +
> + cpu_relax_lowlatency();
> + }
> }
> if (i915_gem_request_completed(req, false))
> ret = 0;
>
>
> --
> Chris Wilson, Intel Open Source Technology Centre
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3] drm/i915: Optimistically spin for the request completion
2015-03-20 14:54 ` Daniel Vetter
@ 2015-03-20 15:27 ` Chris Wilson
0 siblings, 0 replies; 29+ messages in thread
From: Chris Wilson @ 2015-03-20 15:27 UTC (permalink / raw)
To: Daniel Vetter; +Cc: Daniel Vetter, intel-gfx
On Fri, Mar 20, 2015 at 03:54:01PM +0100, Daniel Vetter wrote:
> On Thu, Mar 19, 2015 at 03:16:15PM +0000, Chris Wilson wrote:
> > On Thu, Mar 12, 2015 at 11:11:17AM +0000, Chris Wilson wrote:
> > > This provides a nice boost to mesa in swap bound scenarios (as mesa
> > > throttles itself to the previous frame and given the scenario that will
> > > complete shortly). It will also provide a good boost to systems running
> > > with semaphores disabled and so frequently waiting on the GPU as it
> > > switches rings. In the most favourable of microbenchmarks, this can
> > > increase performance by around 15% - though in practice improvements
> > > will be marginal and rarely noticeable.
> > >
> > > v2: Account for user timeouts
> > > v3: Limit the spinning to a single jiffie (~1us) at most. On an
> > > otherwise idle system, there is no scheduler contention and so without a
> > > limit we would spin until the GPU is ready.
> > >
> > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > > Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> >
> > Just recording ideas for the future. Replace the busy-spin with
> > monitor/mwait. This requires Pentium4+, a cooperating GPU with working
> > cacheline snooping and that we use HWS seqno.
>
> Just for the record: Did it help with powersaving or was it all in the
> noise?
Unscientifically, I would say mwait(cstate=0) was worse. It gave a
marginally higher peak, but there was clearly worse thermal throttling
than the simple busy-wait. powertop suggests that with the mwait we were
not reaching as low a package cstate as often.
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v4] drm/i915: Optimistically spin for the request completion
2015-03-20 14:36 ` [PATCH v4] " Chris Wilson
@ 2015-03-20 16:01 ` Tvrtko Ursulin
2015-03-20 16:19 ` Chris Wilson
2015-03-20 21:30 ` shuang.he
1 sibling, 1 reply; 29+ messages in thread
From: Tvrtko Ursulin @ 2015-03-20 16:01 UTC (permalink / raw)
To: Chris Wilson, intel-gfx; +Cc: Daniel Vetter, Eero Tamminen, Rantala, Valtteri
On 03/20/2015 02:36 PM, Chris Wilson wrote:
> This provides a nice boost to mesa in swap bound scenarios (as mesa
> throttles itself to the previous frame and given the scenario that will
> complete shortly). It will also provide a good boost to systems running
> with semaphores disabled and so frequently waiting on the GPU as it
> switches rings. In the most favourable of microbenchmarks, this can
> increase performance by around 15% - though in practice improvements
> will be marginal and rarely noticeable.
>
> v2: Account for user timeouts
> v3: Limit the spinning to a single jiffie (~1us) at most. On an
> otherwise idle system, there is no scheduler contention and so without a
> limit we would spin until the GPU is ready.
> v4: Drop forcewake - the lazy coherent access doesn't require it, and we
> have no reason to believe that the forcewake itself improves seqno
> coherency - it only adds delay.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> Cc: Eero Tamminen <eero.t.tamminen@intel.com>
> Cc: "Rantala, Valtteri" <valtteri.rantala@intel.com>
Still against a toggle switch like a simple module parameter?
> ---
> drivers/gpu/drm/i915/i915_gem.c | 44 +++++++++++++++++++++++++++++++++++------
> 1 file changed, 38 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 2e17a254aac1..9988e65c1440 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -1181,6 +1181,29 @@ static bool missed_irq(struct drm_i915_private *dev_priv,
> return test_bit(ring->id, &dev_priv->gpu_error.missed_irq_rings);
> }
>
> +static int __i915_spin_request(struct drm_i915_gem_request *rq)
> +{
> + unsigned long timeout;
> +
> + if (i915_gem_request_get_ring(rq)->irq_refcount)
> + return -EBUSY;
So if someone else is already waiting on this ring skip the spin and do
the sleep-wait.
That would mean earlier waiter didn't manage to spin to completion so
for subsequent ones does it make sense to try to spin? If we assume
waiters are arriving here in submission order then no, they should
proceed to sleep-wait. But if waiters are arriving here in random order,
and that is purely up to userspace I think, then I am not sure?
On the other hand if we allowed this "out-of-order waiters" that would
potentially be too much spinning so maybe it is better like it is.
> + timeout = jiffies + 1;
> + while (!need_resched()) {
> + if (i915_gem_request_completed(rq, true))
> + return 0;
> +
> + if (time_after_eq(jiffies, timeout))
> + break;
> +
> + cpu_relax_lowlatency();
> + }
> + if (i915_gem_request_completed(rq, false))
> + return 0;
> +
> + return -EAGAIN;
> +}
> +
> /**
> * __i915_wait_request - wait until execution of request has finished
> * @req: duh!
> @@ -1225,12 +1248,20 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
> if (ring->id == RCS && INTEL_INFO(dev)->gen >= 6)
> gen6_rps_boost(dev_priv, file_priv);
>
> - if (!irq_test_in_progress && WARN_ON(!ring->irq_get(ring)))
> - return -ENODEV;
> -
> /* Record current time in case interrupted by signal, or wedged */
> trace_i915_gem_request_wait_begin(req);
> before = ktime_get_raw_ns();
> +
> + /* Optimistic spin for the next jiffie before touching IRQs */
> + ret = __i915_spin_request(req);
> + if (ret == 0)
> + goto out;
> +
> + if (!irq_test_in_progress && WARN_ON(!ring->irq_get(ring))) {
> + ret = -ENODEV;
> + goto out;
> + }
> +
> for (;;) {
> struct timer_list timer;
>
> @@ -1279,14 +1310,15 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
> destroy_timer_on_stack(&timer);
> }
> }
> - now = ktime_get_raw_ns();
> - trace_i915_gem_request_wait_end(req);
> -
> if (!irq_test_in_progress)
> ring->irq_put(ring);
>
> finish_wait(&ring->irq_queue, &wait);
>
> +out:
> + now = ktime_get_raw_ns();
> + trace_i915_gem_request_wait_end(req);
> +
> if (timeout) {
> s64 tres = *timeout - (now - before);
>
>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Regards,
Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v4] drm/i915: Optimistically spin for the request completion
2015-03-20 16:01 ` Tvrtko Ursulin
@ 2015-03-20 16:19 ` Chris Wilson
2015-03-20 16:31 ` Tvrtko Ursulin
2015-03-20 22:59 ` Chris Wilson
0 siblings, 2 replies; 29+ messages in thread
From: Chris Wilson @ 2015-03-20 16:19 UTC (permalink / raw)
To: Tvrtko Ursulin; +Cc: Daniel Vetter, intel-gfx, Rantala, Valtteri, Eero Tamminen
On Fri, Mar 20, 2015 at 04:01:52PM +0000, Tvrtko Ursulin wrote:
>
> On 03/20/2015 02:36 PM, Chris Wilson wrote:
> >This provides a nice boost to mesa in swap bound scenarios (as mesa
> >throttles itself to the previous frame and given the scenario that will
> >complete shortly). It will also provide a good boost to systems running
> >with semaphores disabled and so frequently waiting on the GPU as it
> >switches rings. In the most favourable of microbenchmarks, this can
> >increase performance by around 15% - though in practice improvements
> >will be marginal and rarely noticeable.
> >
> >v2: Account for user timeouts
> >v3: Limit the spinning to a single jiffie (~1us) at most. On an
> >otherwise idle system, there is no scheduler contention and so without a
> >limit we would spin until the GPU is ready.
> >v4: Drop forcewake - the lazy coherent access doesn't require it, and we
> >have no reason to believe that the forcewake itself improves seqno
> >coherency - it only adds delay.
> >
> >Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> >Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> >Cc: Eero Tamminen <eero.t.tamminen@intel.com>
> >Cc: "Rantala, Valtteri" <valtteri.rantala@intel.com>
>
> Still against a toggle switch like a simple module parameter?
Yes. I'd much rather tackle the corner cases than ignore them.
> >---
> > drivers/gpu/drm/i915/i915_gem.c | 44 +++++++++++++++++++++++++++++++++++------
> > 1 file changed, 38 insertions(+), 6 deletions(-)
> >
> >diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> >index 2e17a254aac1..9988e65c1440 100644
> >--- a/drivers/gpu/drm/i915/i915_gem.c
> >+++ b/drivers/gpu/drm/i915/i915_gem.c
> >@@ -1181,6 +1181,29 @@ static bool missed_irq(struct drm_i915_private *dev_priv,
> > return test_bit(ring->id, &dev_priv->gpu_error.missed_irq_rings);
> > }
> >
> >+static int __i915_spin_request(struct drm_i915_gem_request *rq)
> >+{
> >+ unsigned long timeout;
> >+
> >+ if (i915_gem_request_get_ring(rq)->irq_refcount)
> >+ return -EBUSY;
>
> So if someone else is already waiting on this ring skip the spin and
> do the sleep-wait.
>
> That would mean earlier waiter didn't manage to spin to completion
> so for subsequent ones does it make sense to try to spin? If we
> assume waiters are arriving here in submission order then no, they
> should proceed to sleep-wait. But if waiters are arriving here in
> random order, and that is purely up to userspace I think, then I am
> not sure?
They arrive pretty much in random order.
> On the other hand if we allowed this "out-of-order waiters" that
> would potentially be too much spinning so maybe it is better like it
> is.
Also part of my premise is that it the cost of the irq (setting it up and
handling all the intermidate ones) that is the crux of the issue. Once
we have enabled the irq for one, we may as well then use it for the
herd. Also with a herd we will want to err on the side of sleeping more,
or so my intuition says.
I guess one test would be to see how many 1x1 [xN overdraw, say 1x1
Window, but rendering internally at 1080p] clients we can run in
parallel whilst hitting 60fps. And then whether allowing multiple
spinners helps or hinders.
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v4] drm/i915: Optimistically spin for the request completion
2015-03-20 16:19 ` Chris Wilson
@ 2015-03-20 16:31 ` Tvrtko Ursulin
2015-03-23 8:29 ` Daniel Vetter
2015-03-20 22:59 ` Chris Wilson
1 sibling, 1 reply; 29+ messages in thread
From: Tvrtko Ursulin @ 2015-03-20 16:31 UTC (permalink / raw)
To: Chris Wilson, intel-gfx, Daniel Vetter, Eero Tamminen, Rantala, Valtteri
On 03/20/2015 04:19 PM, Chris Wilson wrote:
> On Fri, Mar 20, 2015 at 04:01:52PM +0000, Tvrtko Ursulin wrote:
>>
>> On 03/20/2015 02:36 PM, Chris Wilson wrote:
>>> This provides a nice boost to mesa in swap bound scenarios (as mesa
>>> throttles itself to the previous frame and given the scenario that will
>>> complete shortly). It will also provide a good boost to systems running
>>> with semaphores disabled and so frequently waiting on the GPU as it
>>> switches rings. In the most favourable of microbenchmarks, this can
>>> increase performance by around 15% - though in practice improvements
>>> will be marginal and rarely noticeable.
>>>
>>> v2: Account for user timeouts
>>> v3: Limit the spinning to a single jiffie (~1us) at most. On an
>>> otherwise idle system, there is no scheduler contention and so without a
>>> limit we would spin until the GPU is ready.
>>> v4: Drop forcewake - the lazy coherent access doesn't require it, and we
>>> have no reason to believe that the forcewake itself improves seqno
>>> coherency - it only adds delay.
>>>
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
>>> Cc: Eero Tamminen <eero.t.tamminen@intel.com>
>>> Cc: "Rantala, Valtteri" <valtteri.rantala@intel.com>
>>
>> Still against a toggle switch like a simple module parameter?
>
> Yes. I'd much rather tackle the corner cases than ignore them.
I'll say it once more then leave it be - my point of view is that module
param does not mean ignoring any issues. It rather means that, if
pathological use case if found in the field, you can provide a better
user experience and then work in parallel in coming with improvements.
Your view is probably that if there is a toggle, someone somewhere will
put on some wiki "yeah if that happens just put this in modprobe.conf"
and there won't even be a bug report.
It is a downside yes, but to me much better than finding a bad corner
case and then saying "Yeah, please recompile your kernel", or downgrade
your kernel and wait for X weeks/months until the fix propagates.
>>> ---
>>> drivers/gpu/drm/i915/i915_gem.c | 44 +++++++++++++++++++++++++++++++++++------
>>> 1 file changed, 38 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>>> index 2e17a254aac1..9988e65c1440 100644
>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>>> @@ -1181,6 +1181,29 @@ static bool missed_irq(struct drm_i915_private *dev_priv,
>>> return test_bit(ring->id, &dev_priv->gpu_error.missed_irq_rings);
>>> }
>>>
>>> +static int __i915_spin_request(struct drm_i915_gem_request *rq)
>>> +{
>>> + unsigned long timeout;
>>> +
>>> + if (i915_gem_request_get_ring(rq)->irq_refcount)
>>> + return -EBUSY;
>>
>> So if someone else is already waiting on this ring skip the spin and
>> do the sleep-wait.
>>
>> That would mean earlier waiter didn't manage to spin to completion
>> so for subsequent ones does it make sense to try to spin? If we
>> assume waiters are arriving here in submission order then no, they
>> should proceed to sleep-wait. But if waiters are arriving here in
>> random order, and that is purely up to userspace I think, then I am
>> not sure?
>
> They arrive pretty much in random order.
>
>> On the other hand if we allowed this "out-of-order waiters" that
>> would potentially be too much spinning so maybe it is better like it
>> is.
>
> Also part of my premise is that it the cost of the irq (setting it up and
> handling all the intermidate ones) that is the crux of the issue. Once
> we have enabled the irq for one, we may as well then use it for the
> herd. Also with a herd we will want to err on the side of sleeping more,
> or so my intuition says.
My gut feeling is the same. Waking thundering herd sounds safer than
spinning one.
Regards,
Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v4] drm/i915: Optimistically spin for the request completion
2015-03-20 14:36 ` [PATCH v4] " Chris Wilson
2015-03-20 16:01 ` Tvrtko Ursulin
@ 2015-03-20 21:30 ` shuang.he
1 sibling, 0 replies; 29+ messages in thread
From: shuang.he @ 2015-03-20 21:30 UTC (permalink / raw)
To: shuang.he, ethan.gao, intel-gfx, chris
Tested-By: PRC QA PRTS (Patch Regression Test System Contact: shuang.he@intel.com)
Task id: 6019
-------------------------------------Summary-------------------------------------
Platform Delta drm-intel-nightly Series Applied
PNV -1 274/274 273/274
ILK 303/303 303/303
SNB 303/303 303/303
IVB 342/342 342/342
BYT 287/287 287/287
HSW 362/362 362/362
BDW 308/308 308/308
-------------------------------------Detailed-------------------------------------
Platform Test drm-intel-nightly Series Applied
*PNV igt_gen3_render_linear_blits PASS(5) CRASH(1)PASS(1)
Note: You need to pay more attention to line start with '*'
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v4] drm/i915: Optimistically spin for the request completion
2015-03-20 16:19 ` Chris Wilson
2015-03-20 16:31 ` Tvrtko Ursulin
@ 2015-03-20 22:59 ` Chris Wilson
2015-03-21 9:49 ` Chris Wilson
2015-03-23 8:31 ` Daniel Vetter
1 sibling, 2 replies; 29+ messages in thread
From: Chris Wilson @ 2015-03-20 22:59 UTC (permalink / raw)
To: Tvrtko Ursulin, intel-gfx, Daniel Vetter, Eero Tamminen, Rantala,
Valtteri
On Fri, Mar 20, 2015 at 04:19:02PM +0000, Chris Wilson wrote:
> I guess one test would be to see how many 1x1 [xN overdraw, say 1x1
> Window, but rendering internally at 1080p] clients we can run in
> parallel whilst hitting 60fps. And then whether allowing multiple
> spinners helps or hinders.
I was thinking of a nice easy test that could demonstrate any advantage
for spinning over waiting, and realised we already had such an igt. The
trick is that it has to generate sufficient GPU load to actually require
a wait, but not too high a GPU load such that we can see the impact from
slow completion.
I present igt/gem_exec_blt (modified to repeat the measurement and do an
average over several runs):
Time to blt 16384 bytes x 1: 21.000µs -> 5.800µs
Time to blt 16384 bytes x 2: 11.500µs -> 4.500µs
Time to blt 16384 bytes x 4: 6.750µs -> 3.750µs
Time to blt 16384 bytes x 8: 4.950µs -> 3.375µs
Time to blt 16384 bytes x 16: 3.825µs -> 3.175µs
Time to blt 16384 bytes x 32: 3.356µs -> 3.000µs
Time to blt 16384 bytes x 64: 3.259µs -> 2.909µs
Time to blt 16384 bytes x 128: 3.083µs -> 3.095µs
Time to blt 16384 bytes x 256: 3.104µs -> 2.979µs
Time to blt 16384 bytes x 512: 3.080µs -> 3.089µs
Time to blt 16384 bytes x 1024: 3.077µs -> 3.040µs
Time to blt 16384 bytes x 2048: 3.127µs -> 3.304µs
Time to blt 16384 bytes x 4096: 3.279µs -> 3.265µs
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v4] drm/i915: Optimistically spin for the request completion
2015-03-20 22:59 ` Chris Wilson
@ 2015-03-21 9:49 ` Chris Wilson
2015-03-23 8:31 ` Daniel Vetter
1 sibling, 0 replies; 29+ messages in thread
From: Chris Wilson @ 2015-03-21 9:49 UTC (permalink / raw)
To: Tvrtko Ursulin, intel-gfx, Daniel Vetter, Eero Tamminen, Rantala,
Valtteri
On Fri, Mar 20, 2015 at 10:59:50PM +0000, Chris Wilson wrote:
> On Fri, Mar 20, 2015 at 04:19:02PM +0000, Chris Wilson wrote:
> > I guess one test would be to see how many 1x1 [xN overdraw, say 1x1
> > Window, but rendering internally at 1080p] clients we can run in
> > parallel whilst hitting 60fps. And then whether allowing multiple
> > spinners helps or hinders.
>
> I was thinking of a nice easy test that could demonstrate any advantage
> for spinning over waiting, and realised we already had such an igt. The
> trick is that it has to generate sufficient GPU load to actually require
> a wait, but not too high a GPU load such that we can see the impact from
> slow completion.
>
> I present igt/gem_exec_blt (modified to repeat the measurement and do an
> average over several runs):
>
> Time to blt 16384 bytes x 1: 21.000µs -> 5.800µs
> Time to blt 16384 bytes x 2: 11.500µs -> 4.500µs
> Time to blt 16384 bytes x 4: 6.750µs -> 3.750µs
> Time to blt 16384 bytes x 8: 4.950µs -> 3.375µs
> Time to blt 16384 bytes x 16: 3.825µs -> 3.175µs
> Time to blt 16384 bytes x 32: 3.356µs -> 3.000µs
> Time to blt 16384 bytes x 64: 3.259µs -> 2.909µs
> Time to blt 16384 bytes x 128: 3.083µs -> 3.095µs
> Time to blt 16384 bytes x 256: 3.104µs -> 2.979µs
> Time to blt 16384 bytes x 512: 3.080µs -> 3.089µs
> Time to blt 16384 bytes x 1024: 3.077µs -> 3.040µs
> Time to blt 16384 bytes x 2048: 3.127µs -> 3.304µs
> Time to blt 16384 bytes x 4096: 3.279µs -> 3.265µs
Then remembering to disable the cmdparser:
Time to blt 16384 bytes x 1: 4.000µs, 3.8GiB/s
Time to blt 16384 bytes x 2: 3.300µs, 4.6GiB/s
Time to blt 16384 bytes x 4: 2.850µs, 5.4GiB/s
Time to blt 16384 bytes x 8: 2.625µs, 5.8GiB/s
Time to blt 16384 bytes x 16: 2.500µs, 6.1GiB/s
Time to blt 16384 bytes x 32: 2.469µs, 6.2GiB/s
Time to blt 16384 bytes x 64: 2.447µs, 6.2GiB/s
Time to blt 16384 bytes x 128: 2.438µs, 6.3GiB/s
Time to blt 16384 bytes x 256: 2.382µs, 6.4GiB/s
Time to blt 16384 bytes x 512: 2.389µs, 6.4GiB/s
Time to blt 16384 bytes x 1024: 2.386µs, 6.4GiB/s
Time to blt 16384 bytes x 2048: 2.382µs, 6.4GiB/s
Time to blt 16384 bytes x 4096: 2.378µs, 6.4GiB/s
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v4] drm/i915: Optimistically spin for the request completion
2015-03-20 16:31 ` Tvrtko Ursulin
@ 2015-03-23 8:29 ` Daniel Vetter
0 siblings, 0 replies; 29+ messages in thread
From: Daniel Vetter @ 2015-03-23 8:29 UTC (permalink / raw)
To: Tvrtko Ursulin; +Cc: Daniel Vetter, intel-gfx, Rantala, Valtteri, Eero Tamminen
On Fri, Mar 20, 2015 at 04:31:29PM +0000, Tvrtko Ursulin wrote:
>
> On 03/20/2015 04:19 PM, Chris Wilson wrote:
> >On Fri, Mar 20, 2015 at 04:01:52PM +0000, Tvrtko Ursulin wrote:
> >>
> >>On 03/20/2015 02:36 PM, Chris Wilson wrote:
> >>>This provides a nice boost to mesa in swap bound scenarios (as mesa
> >>>throttles itself to the previous frame and given the scenario that will
> >>>complete shortly). It will also provide a good boost to systems running
> >>>with semaphores disabled and so frequently waiting on the GPU as it
> >>>switches rings. In the most favourable of microbenchmarks, this can
> >>>increase performance by around 15% - though in practice improvements
> >>>will be marginal and rarely noticeable.
> >>>
> >>>v2: Account for user timeouts
> >>>v3: Limit the spinning to a single jiffie (~1us) at most. On an
> >>>otherwise idle system, there is no scheduler contention and so without a
> >>>limit we would spin until the GPU is ready.
> >>>v4: Drop forcewake - the lazy coherent access doesn't require it, and we
> >>>have no reason to believe that the forcewake itself improves seqno
> >>>coherency - it only adds delay.
> >>>
> >>>Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>>Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> >>>Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> >>>Cc: Eero Tamminen <eero.t.tamminen@intel.com>
> >>>Cc: "Rantala, Valtteri" <valtteri.rantala@intel.com>
> >>
> >>Still against a toggle switch like a simple module parameter?
> >
> >Yes. I'd much rather tackle the corner cases than ignore them.
>
> I'll say it once more then leave it be - my point of view is that module
> param does not mean ignoring any issues. It rather means that, if
> pathological use case if found in the field, you can provide a better user
> experience and then work in parallel in coming with improvements.
>
> Your view is probably that if there is a toggle, someone somewhere will put
> on some wiki "yeah if that happens just put this in modprobe.conf" and there
> won't even be a bug report.
This is generally what indeed happens. We've suffered for years from
google returning outdated wiki pages with "clever" suggestions ...
Nowadays all mod params are generally of the kernel tainting kind to
increase the odds that we get bug reports. I still don't really like them.
> It is a downside yes, but to me much better than finding a bad corner case
> and then saying "Yeah, please recompile your kernel", or downgrade your
> kernel and wait for X weeks/months until the fix propagates.
I agree but only for stuff that causes crashes and hangs. There a module
option makes sense for quicker debug. For general tuning I don't really
want them - the kernel should get things right, period.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v4] drm/i915: Optimistically spin for the request completion
2015-03-20 22:59 ` Chris Wilson
2015-03-21 9:49 ` Chris Wilson
@ 2015-03-23 8:31 ` Daniel Vetter
2015-03-23 9:09 ` Chris Wilson
1 sibling, 1 reply; 29+ messages in thread
From: Daniel Vetter @ 2015-03-23 8:31 UTC (permalink / raw)
To: Chris Wilson, Tvrtko Ursulin, intel-gfx, Daniel Vetter,
Eero Tamminen, Rantala, Valtteri
On Fri, Mar 20, 2015 at 10:59:50PM +0000, Chris Wilson wrote:
> On Fri, Mar 20, 2015 at 04:19:02PM +0000, Chris Wilson wrote:
> > I guess one test would be to see how many 1x1 [xN overdraw, say 1x1
> > Window, but rendering internally at 1080p] clients we can run in
> > parallel whilst hitting 60fps. And then whether allowing multiple
> > spinners helps or hinders.
>
> I was thinking of a nice easy test that could demonstrate any advantage
> for spinning over waiting, and realised we already had such an igt. The
> trick is that it has to generate sufficient GPU load to actually require
> a wait, but not too high a GPU load such that we can see the impact from
> slow completion.
>
> I present igt/gem_exec_blt (modified to repeat the measurement and do an
> average over several runs):
>
> Time to blt 16384 bytes x 1: 21.000µs -> 5.800µs
> Time to blt 16384 bytes x 2: 11.500µs -> 4.500µs
> Time to blt 16384 bytes x 4: 6.750µs -> 3.750µs
> Time to blt 16384 bytes x 8: 4.950µs -> 3.375µs
> Time to blt 16384 bytes x 16: 3.825µs -> 3.175µs
> Time to blt 16384 bytes x 32: 3.356µs -> 3.000µs
> Time to blt 16384 bytes x 64: 3.259µs -> 2.909µs
> Time to blt 16384 bytes x 128: 3.083µs -> 3.095µs
> Time to blt 16384 bytes x 256: 3.104µs -> 2.979µs
> Time to blt 16384 bytes x 512: 3.080µs -> 3.089µs
> Time to blt 16384 bytes x 1024: 3.077µs -> 3.040µs
> Time to blt 16384 bytes x 2048: 3.127µs -> 3.304µs
> Time to blt 16384 bytes x 4096: 3.279µs -> 3.265µs
We probably need to revisit this when the scheduler lands - that one will
want to keep a short queue and generally will block for some request to
complete.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v4] drm/i915: Optimistically spin for the request completion
2015-03-23 8:31 ` Daniel Vetter
@ 2015-03-23 9:09 ` Chris Wilson
0 siblings, 0 replies; 29+ messages in thread
From: Chris Wilson @ 2015-03-23 9:09 UTC (permalink / raw)
To: Daniel Vetter; +Cc: intel-gfx, Rantala, Valtteri, Eero Tamminen, Daniel Vetter
On Mon, Mar 23, 2015 at 09:31:38AM +0100, Daniel Vetter wrote:
> On Fri, Mar 20, 2015 at 10:59:50PM +0000, Chris Wilson wrote:
> > On Fri, Mar 20, 2015 at 04:19:02PM +0000, Chris Wilson wrote:
> > > I guess one test would be to see how many 1x1 [xN overdraw, say 1x1
> > > Window, but rendering internally at 1080p] clients we can run in
> > > parallel whilst hitting 60fps. And then whether allowing multiple
> > > spinners helps or hinders.
> >
> > I was thinking of a nice easy test that could demonstrate any advantage
> > for spinning over waiting, and realised we already had such an igt. The
> > trick is that it has to generate sufficient GPU load to actually require
> > a wait, but not too high a GPU load such that we can see the impact from
> > slow completion.
> >
> > I present igt/gem_exec_blt (modified to repeat the measurement and do an
> > average over several runs):
> >
> > Time to blt 16384 bytes x 1: 21.000µs -> 5.800µs
> > Time to blt 16384 bytes x 2: 11.500µs -> 4.500µs
> > Time to blt 16384 bytes x 4: 6.750µs -> 3.750µs
> > Time to blt 16384 bytes x 8: 4.950µs -> 3.375µs
> > Time to blt 16384 bytes x 16: 3.825µs -> 3.175µs
> > Time to blt 16384 bytes x 32: 3.356µs -> 3.000µs
> > Time to blt 16384 bytes x 64: 3.259µs -> 2.909µs
> > Time to blt 16384 bytes x 128: 3.083µs -> 3.095µs
> > Time to blt 16384 bytes x 256: 3.104µs -> 2.979µs
> > Time to blt 16384 bytes x 512: 3.080µs -> 3.089µs
> > Time to blt 16384 bytes x 1024: 3.077µs -> 3.040µs
> > Time to blt 16384 bytes x 2048: 3.127µs -> 3.304µs
> > Time to blt 16384 bytes x 4096: 3.279µs -> 3.265µs
>
> We probably need to revisit this when the scheduler lands - that one will
> want to keep a short queue and generally will block for some request to
> complete.
Speaking of which, execlists! You may have noticed that I
surreptitiously choose hsw to avoid the execlists overhead...
I was messing around over the weekend looking at the submission overhead
on bdw-u:
-nightly +spin +hax execlists=0
x1: 23.600µs 18.400µs 15.200µs 6.800µs
x2: 19.700µs 16.500µs 15.900µs 5.000µs
x4: 15.600µs 12.250µs 12.500µs 4.450µs
x8: 13.575µs 11.000µs 11.650µs 4.050µs
x16: 10.812µs 9.738µs 9.875µs 3.900µs
x32: 9.281µs 8.613µs 9.406µs 3.750µs
x64: 8.088µs 7.988µs 8.806µs 3.703µs
x128: 7.683µs 7.838µs 8.617µs 3.647µs
x256: 9.481µs 7.301µs 8.091µs 3.409µs
x512: 5.579µs 5.992µs 6.177µs 3.561µs
x1024: 10.093µs 3.963µs 4.187µs 3.531µs
x2048: 11.497µs 3.794µs 3.873µs 3.477µs
x4096: 8.926µs 5.269µs 3.813µs 3.461µs
The hax are to remove the extra atomic ops and spinlocks imposed by
execlists. Steady state seems to be roughly on a par with the difference
appearing to be interrupt latency + extra register writes. What's
interesting is the latency caused by the ELSP submission mechanism to an
idle GPU, a hard floor for us. It may even be worth papering over it by
starting execlists from a tasklet.
I do feel this sort of information is missing from the execlists
merge...
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2015-03-23 9:10 UTC | newest]
Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-11 15:29 [PATCH v2] drm/i915: Optimistically spin for the request completion Chris Wilson
2015-03-11 21:18 ` Chris Wilson
2015-03-12 9:07 ` Chris Wilson
2015-03-12 9:17 ` Daniel Vetter
2015-03-12 11:11 ` [PATCH v3] " Chris Wilson
2015-03-12 12:06 ` Chris Wilson
2015-03-12 13:14 ` Tvrtko Ursulin
2015-03-12 13:18 ` Chris Wilson
2015-03-12 15:18 ` Tvrtko Ursulin
2015-03-12 16:28 ` Chris Wilson
2015-03-12 16:41 ` Tvrtko Ursulin
2015-03-12 16:50 ` Chris Wilson
2015-03-12 17:32 ` Tvrtko Ursulin
2015-03-13 9:33 ` Daniel Vetter
2015-03-12 19:27 ` shuang.he
2015-03-19 15:16 ` Chris Wilson
2015-03-20 14:54 ` Daniel Vetter
2015-03-20 15:27 ` Chris Wilson
2015-03-20 14:36 ` [PATCH v4] " Chris Wilson
2015-03-20 16:01 ` Tvrtko Ursulin
2015-03-20 16:19 ` Chris Wilson
2015-03-20 16:31 ` Tvrtko Ursulin
2015-03-23 8:29 ` Daniel Vetter
2015-03-20 22:59 ` Chris Wilson
2015-03-21 9:49 ` Chris Wilson
2015-03-23 8:31 ` Daniel Vetter
2015-03-23 9:09 ` Chris Wilson
2015-03-20 21:30 ` shuang.he
2015-03-11 23:07 ` [PATCH v2] " shuang.he
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.