* [PATCH] drm/i915: Increase the busyspin durations for i915_wait_request
@ 2017-09-14 9:58 Chris Wilson
2017-09-14 10:18 ` ✓ Fi.CI.BAT: success for " Patchwork
` (3 more replies)
0 siblings, 4 replies; 8+ messages in thread
From: Chris Wilson @ 2017-09-14 9:58 UTC (permalink / raw)
To: intel-gfx; +Cc: Ben Widawsky, Eero Tamminen
An interesting discussion regarding "hybrid interrupt polling" for NVMe
came to the conclusion that the ideal busyspin before sleeping was half
of the expected request latency (and better if it was already halfway
through that request). This suggested that we too should look again at
our tradeoff between spinning and waiting. Currently, our spin simply
tries to hide the cost of enabling the interrupt, which is good to avoid
penalising nop requests (i.e. test throughput) and not much else.
Studying real world workloads suggests that a spin of upto 500us can
dramatically boost performance, but the suggestion is that this is not
from avoiding interrupt latency per-se, but from secondary effects of
sleeping such as allowing the CPU reduce cstate and context switch away.
To offset those costs from penalising the active client, bump the initial
spin somewhat to 250us and the secondary spin to 20us to balance the cost
of another context switch following the interrupt.
Suggested-by: Sagar Kamble <sagar.a.kamble@intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Sagar Kamble <sagar.a.kamble@intel.com>
Cc: Eero Tamminen <eero.t.tamminen@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Ben Widawsky <ben@bwidawsk.net>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
---
drivers/gpu/drm/i915/i915_gem_request.c | 25 +++++++++++++++++++++----
1 file changed, 21 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
index 813a3b546d6e..ccbdaf6a0e4d 100644
--- a/drivers/gpu/drm/i915/i915_gem_request.c
+++ b/drivers/gpu/drm/i915/i915_gem_request.c
@@ -1155,8 +1155,20 @@ long i915_wait_request(struct drm_i915_gem_request *req,
GEM_BUG_ON(!intel_wait_has_seqno(&wait));
GEM_BUG_ON(!i915_sw_fence_signaled(&req->submit));
- /* Optimistic short spin before touching IRQs */
- if (i915_spin_request(req, state, 5))
+ /* Optimistic short spin before touching IRQs.
+ *
+ * We use a rather large value here to offset the penalty of switching
+ * away from the active task. Frequently, the client will wait upon
+ * an old swapbuffer to throttle itself to remain within a frame of
+ * the gpu. If the client is running in lockstep with the gpu, then
+ * it should not be waiting long at all, and a sleep now will incur
+ * extra scheduler latency in producing the next frame. So we sleep
+ * for longer to try and keep the client running.
+ *
+ * We need ~5us to enable the irq, ~20us to hide a context switch,
+ * we use 250us to keep the cache hot.
+ */
+ if (i915_spin_request(req, state, 250))
goto complete;
set_current_state(state);
@@ -1212,8 +1224,13 @@ long i915_wait_request(struct drm_i915_gem_request *req,
__i915_wait_request_check_and_reset(req))
continue;
- /* Only spin if we know the GPU is processing this request */
- if (i915_spin_request(req, state, 2))
+ /*
+ * A quick spin now we are on the CPU to offset the cost of
+ * context switching away (and so spin for roughly the same as
+ * the scheduler latency). We only spin if we know the GPU is
+ * processing this request, and so likely to finish shortly.
+ */
+ if (i915_spin_request(req, state, 20))
break;
if (!intel_wait_check_request(&wait, req)) {
--
2.14.1
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply related [flat|nested] 8+ messages in thread
* ✓ Fi.CI.BAT: success for drm/i915: Increase the busyspin durations for i915_wait_request
2017-09-14 9:58 [PATCH] drm/i915: Increase the busyspin durations for i915_wait_request Chris Wilson
@ 2017-09-14 10:18 ` Patchwork
2017-09-14 11:17 ` ✗ Fi.CI.IGT: failure " Patchwork
` (2 subsequent siblings)
3 siblings, 0 replies; 8+ messages in thread
From: Patchwork @ 2017-09-14 10:18 UTC (permalink / raw)
To: Chris Wilson; +Cc: intel-gfx
== Series Details ==
Series: drm/i915: Increase the busyspin durations for i915_wait_request
URL : https://patchwork.freedesktop.org/series/30352/
State : success
== Summary ==
Series 30352v1 drm/i915: Increase the busyspin durations for i915_wait_request
https://patchwork.freedesktop.org/api/1.0/series/30352/revisions/1/mbox/
Test chamelium:
Subgroup dp-crc-fast:
fail -> PASS (fi-kbl-7500u) fdo#102514
Test kms_cursor_legacy:
Subgroup basic-busy-flip-before-cursor-atomic:
fail -> PASS (fi-snb-2600) fdo#100215 +1
Test pm_rpm:
Subgroup basic-rte:
pass -> DMESG-WARN (fi-cfl-s) fdo#102294
fdo#102514 https://bugs.freedesktop.org/show_bug.cgi?id=102514
fdo#100215 https://bugs.freedesktop.org/show_bug.cgi?id=100215
fdo#102294 https://bugs.freedesktop.org/show_bug.cgi?id=102294
fi-bdw-5557u total:289 pass:268 dwarn:0 dfail:0 fail:0 skip:21 time:441s
fi-blb-e6850 total:289 pass:224 dwarn:1 dfail:0 fail:0 skip:64 time:376s
fi-bsw-n3050 total:289 pass:243 dwarn:0 dfail:0 fail:0 skip:46 time:537s
fi-bwr-2160 total:289 pass:184 dwarn:0 dfail:0 fail:0 skip:105 time:268s
fi-bxt-j4205 total:289 pass:260 dwarn:0 dfail:0 fail:0 skip:29 time:505s
fi-byt-j1900 total:289 pass:254 dwarn:1 dfail:0 fail:0 skip:34 time:506s
fi-byt-n2820 total:289 pass:250 dwarn:1 dfail:0 fail:0 skip:38 time:498s
fi-cfl-s total:289 pass:222 dwarn:35 dfail:0 fail:0 skip:32 time:546s
fi-elk-e7500 total:289 pass:230 dwarn:0 dfail:0 fail:0 skip:59 time:446s
fi-glk-2a total:289 pass:260 dwarn:0 dfail:0 fail:0 skip:29 time:588s
fi-hsw-4770 total:289 pass:263 dwarn:0 dfail:0 fail:0 skip:26 time:431s
fi-hsw-4770r total:289 pass:263 dwarn:0 dfail:0 fail:0 skip:26 time:405s
fi-ilk-650 total:289 pass:229 dwarn:0 dfail:0 fail:0 skip:60 time:436s
fi-ivb-3520m total:289 pass:261 dwarn:0 dfail:0 fail:0 skip:28 time:492s
fi-ivb-3770 total:289 pass:261 dwarn:0 dfail:0 fail:0 skip:28 time:465s
fi-kbl-7500u total:289 pass:264 dwarn:1 dfail:0 fail:0 skip:24 time:496s
fi-kbl-7560u total:289 pass:270 dwarn:0 dfail:0 fail:0 skip:19 time:579s
fi-kbl-r total:289 pass:262 dwarn:0 dfail:0 fail:0 skip:27 time:578s
fi-pnv-d510 total:289 pass:223 dwarn:1 dfail:0 fail:0 skip:65 time:557s
fi-skl-6260u total:289 pass:269 dwarn:0 dfail:0 fail:0 skip:20 time:459s
fi-skl-6700k total:289 pass:265 dwarn:0 dfail:0 fail:0 skip:24 time:528s
fi-skl-6770hq total:289 pass:269 dwarn:0 dfail:0 fail:0 skip:20 time:502s
fi-skl-gvtdvm total:289 pass:266 dwarn:0 dfail:0 fail:0 skip:23 time:466s
fi-skl-x1585l total:289 pass:269 dwarn:0 dfail:0 fail:0 skip:20 time:494s
fi-snb-2520m total:289 pass:251 dwarn:0 dfail:0 fail:0 skip:38 time:579s
fi-snb-2600 total:289 pass:250 dwarn:0 dfail:0 fail:0 skip:39 time:424s
fi-bdw-gvtdvm failed to connect after reboot
b21142588a9a16f74988f6b524953d4382be4a81 drm-tip: 2017y-09m-14d-08h-23m-57s UTC integration manifest
68f1121717fe drm/i915: Increase the busyspin durations for i915_wait_request
== Logs ==
For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_5695/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 8+ messages in thread
* ✗ Fi.CI.IGT: failure for drm/i915: Increase the busyspin durations for i915_wait_request
2017-09-14 9:58 [PATCH] drm/i915: Increase the busyspin durations for i915_wait_request Chris Wilson
2017-09-14 10:18 ` ✓ Fi.CI.BAT: success for " Patchwork
@ 2017-09-14 11:17 ` Patchwork
2017-09-15 9:15 ` [PATCH] " Kamble, Sagar A
2017-09-15 10:01 ` Tvrtko Ursulin
3 siblings, 0 replies; 8+ messages in thread
From: Patchwork @ 2017-09-14 11:17 UTC (permalink / raw)
To: Chris Wilson; +Cc: intel-gfx
== Series Details ==
Series: drm/i915: Increase the busyspin durations for i915_wait_request
URL : https://patchwork.freedesktop.org/series/30352/
State : failure
== Summary ==
Test drv_missed_irq:
pass -> FAIL (shard-hsw)
Test kms_setmode:
Subgroup basic:
pass -> FAIL (shard-hsw) fdo#99912
Test perf:
Subgroup blocking:
fail -> PASS (shard-hsw) fdo#102252
fdo#99912 https://bugs.freedesktop.org/show_bug.cgi?id=99912
fdo#102252 https://bugs.freedesktop.org/show_bug.cgi?id=102252
shard-hsw total:2313 pass:1244 dwarn:0 dfail:0 fail:14 skip:1055 time:9368s
== Logs ==
For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_5695/shards.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] drm/i915: Increase the busyspin durations for i915_wait_request
2017-09-14 9:58 [PATCH] drm/i915: Increase the busyspin durations for i915_wait_request Chris Wilson
2017-09-14 10:18 ` ✓ Fi.CI.BAT: success for " Patchwork
2017-09-14 11:17 ` ✗ Fi.CI.IGT: failure " Patchwork
@ 2017-09-15 9:15 ` Kamble, Sagar A
2017-09-15 9:23 ` Chris Wilson
2017-09-15 10:01 ` Tvrtko Ursulin
3 siblings, 1 reply; 8+ messages in thread
From: Kamble, Sagar A @ 2017-09-15 9:15 UTC (permalink / raw)
To: Chris Wilson, intel-gfx; +Cc: Eero Tamminen, Ben Widawsky
Thanks Chris. LGTM.
Minor inputs below
On 9/14/2017 3:28 PM, Chris Wilson wrote:
> An interesting discussion regarding "hybrid interrupt polling" for NVMe
> came to the conclusion that the ideal busyspin before sleeping was half
> of the expected request latency (and better if it was already halfway
> through that request). This suggested that we too should look again at
> our tradeoff between spinning and waiting. Currently, our spin simply
> tries to hide the cost of enabling the interrupt, which is good to avoid
> penalising nop requests (i.e. test throughput) and not much else.
> Studying real world workloads suggests that a spin of upto 500us can
> dramatically boost performance, but the suggestion is that this is not
> from avoiding interrupt latency per-se, but from secondary effects of
> sleeping such as allowing the CPU reduce cstate and context switch away.
> To offset those costs from penalising the active client, bump the initial
> spin somewhat to 250us and the secondary spin to 20us to balance the cost
> of another context switch following the interrupt.
>
> Suggested-by: Sagar Kamble <sagar.a.kamble@intel.com>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Sagar Kamble <sagar.a.kamble@intel.com>
> Cc: Eero Tamminen <eero.t.tamminen@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Ben Widawsky <ben@bwidawsk.net>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> ---
> drivers/gpu/drm/i915/i915_gem_request.c | 25 +++++++++++++++++++++----
> 1 file changed, 21 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
> index 813a3b546d6e..ccbdaf6a0e4d 100644
> --- a/drivers/gpu/drm/i915/i915_gem_request.c
> +++ b/drivers/gpu/drm/i915/i915_gem_request.c
> @@ -1155,8 +1155,20 @@ long i915_wait_request(struct drm_i915_gem_request *req,
> GEM_BUG_ON(!intel_wait_has_seqno(&wait));
> GEM_BUG_ON(!i915_sw_fence_signaled(&req->submit));
>
> - /* Optimistic short spin before touching IRQs */
> - if (i915_spin_request(req, state, 5))
> + /* Optimistic short spin before touching IRQs.
> + *
> + * We use a rather large value here to offset the penalty of switching
> + * away from the active task. Frequently, the client will wait upon
> + * an old swapbuffer to throttle itself to remain within a frame of
> + * the gpu. If the client is running in lockstep with the gpu, then
> + * it should not be waiting long at all, and a sleep now will incur
> + * extra scheduler latency in producing the next frame. So we sleep
Typo? Should this be "So we spin for longer"
> + * for longer to try and keep the client running.
> + *
> + * We need ~5us to enable the irq, ~20us to hide a context switch,
> + * we use 250us to keep the cache hot.
> + */
> + if (i915_spin_request(req, state, 250))
Hybrid polling introduces sleep of 1/2 request duration prior to poll of
1/2 request duration.
Should we reorder here to achieve the same?
> goto complete;
>
> set_current_state(state);
> @@ -1212,8 +1224,13 @@ long i915_wait_request(struct drm_i915_gem_request *req,
> __i915_wait_request_check_and_reset(req))
> continue;
>
> - /* Only spin if we know the GPU is processing this request */
> - if (i915_spin_request(req, state, 2))
> + /*
> + * A quick spin now we are on the CPU to offset the cost of
> + * context switching away (and so spin for roughly the same as
> + * the scheduler latency). We only spin if we know the GPU is
> + * processing this request, and so likely to finish shortly.
> + */
> + if (i915_spin_request(req, state, 20))
> break;
>
> if (!intel_wait_check_request(&wait, req)) {
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] drm/i915: Increase the busyspin durations for i915_wait_request
2017-09-15 9:15 ` [PATCH] " Kamble, Sagar A
@ 2017-09-15 9:23 ` Chris Wilson
0 siblings, 0 replies; 8+ messages in thread
From: Chris Wilson @ 2017-09-15 9:23 UTC (permalink / raw)
To: Kamble, Sagar A, intel-gfx; +Cc: Eero Tamminen, Ben Widawsky
Quoting Kamble, Sagar A (2017-09-15 10:15:31)
> Thanks Chris. LGTM.
>
> Minor inputs below
>
>
> On 9/14/2017 3:28 PM, Chris Wilson wrote:
> > An interesting discussion regarding "hybrid interrupt polling" for NVMe
> > came to the conclusion that the ideal busyspin before sleeping was half
> > of the expected request latency (and better if it was already halfway
> > through that request). This suggested that we too should look again at
> > our tradeoff between spinning and waiting. Currently, our spin simply
> > tries to hide the cost of enabling the interrupt, which is good to avoid
> > penalising nop requests (i.e. test throughput) and not much else.
> > Studying real world workloads suggests that a spin of upto 500us can
> > dramatically boost performance, but the suggestion is that this is not
> > from avoiding interrupt latency per-se, but from secondary effects of
> > sleeping such as allowing the CPU reduce cstate and context switch away.
> > To offset those costs from penalising the active client, bump the initial
> > spin somewhat to 250us and the secondary spin to 20us to balance the cost
> > of another context switch following the interrupt.
> >
> > Suggested-by: Sagar Kamble <sagar.a.kamble@intel.com>
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Sagar Kamble <sagar.a.kamble@intel.com>
> > Cc: Eero Tamminen <eero.t.tamminen@intel.com>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > Cc: Ben Widawsky <ben@bwidawsk.net>
> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> > ---
> > drivers/gpu/drm/i915/i915_gem_request.c | 25 +++++++++++++++++++++----
> > 1 file changed, 21 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
> > index 813a3b546d6e..ccbdaf6a0e4d 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_request.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_request.c
> > @@ -1155,8 +1155,20 @@ long i915_wait_request(struct drm_i915_gem_request *req,
> > GEM_BUG_ON(!intel_wait_has_seqno(&wait));
> > GEM_BUG_ON(!i915_sw_fence_signaled(&req->submit));
> >
> > - /* Optimistic short spin before touching IRQs */
> > - if (i915_spin_request(req, state, 5))
> > + /* Optimistic short spin before touching IRQs.
> > + *
> > + * We use a rather large value here to offset the penalty of switching
> > + * away from the active task. Frequently, the client will wait upon
> > + * an old swapbuffer to throttle itself to remain within a frame of
> > + * the gpu. If the client is running in lockstep with the gpu, then
> > + * it should not be waiting long at all, and a sleep now will incur
> > + * extra scheduler latency in producing the next frame. So we sleep
> Typo? Should this be "So we spin for longer"
> > + * for longer to try and keep the client running.
> > + *
> > + * We need ~5us to enable the irq, ~20us to hide a context switch,
> > + * we use 250us to keep the cache hot.
> > + */
> > + if (i915_spin_request(req, state, 250))
> Hybrid polling introduces sleep of 1/2 request duration prior to poll of
> 1/2 request duration.
> Should we reorder here to achieve the same?
No, we are not tracking estimated time of completion for a request. We
don't know when the request was started on the hw, nor when it is likely
to complete. Now we could bookend the batch with TIMESTAMPs and with a
lot of extra work track an ewma per context per engine, but even with
some magic, I don't think that will work across preemption.
Yes, it makes a lot of sense not to spin if we don't expect the request
to complete. The closest we have at present (and currently use) is merely
a flag saying whether the request is currently executing.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] drm/i915: Increase the busyspin durations for i915_wait_request
2017-09-14 9:58 [PATCH] drm/i915: Increase the busyspin durations for i915_wait_request Chris Wilson
` (2 preceding siblings ...)
2017-09-15 9:15 ` [PATCH] " Kamble, Sagar A
@ 2017-09-15 10:01 ` Tvrtko Ursulin
2017-09-15 10:18 ` Chris Wilson
3 siblings, 1 reply; 8+ messages in thread
From: Tvrtko Ursulin @ 2017-09-15 10:01 UTC (permalink / raw)
To: Chris Wilson, intel-gfx; +Cc: Ben Widawsky, Eero Tamminen
On 14/09/2017 10:58, Chris Wilson wrote:
> An interesting discussion regarding "hybrid interrupt polling" for NVMe
> came to the conclusion that the ideal busyspin before sleeping was half
> of the expected request latency (and better if it was already halfway
> through that request). This suggested that we too should look again at
> our tradeoff between spinning and waiting. Currently, our spin simply
> tries to hide the cost of enabling the interrupt, which is good to avoid
> penalising nop requests (i.e. test throughput) and not much else.
> Studying real world workloads suggests that a spin of upto 500us can
What workloads and and power/perf testing?
> dramatically boost performance, but the suggestion is that this is not
> from avoiding interrupt latency per-se, but from secondary effects of
> sleeping such as allowing the CPU reduce cstate and context switch away.
Maybe the second part of the sentence would be clearer if not in a way
in inverted form. Like longer spin = more performance = less sleeping =
less cstate switching? Or just add "but from _avoiding_ secondary
effects of sleeping"?
> To offset those costs from penalising the active client, bump the initial
> spin somewhat to 250us and the secondary spin to 20us to balance the cost
> of another context switch following the interrupt.
>
> Suggested-by: Sagar Kamble <sagar.a.kamble@intel.com>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Sagar Kamble <sagar.a.kamble@intel.com>
> Cc: Eero Tamminen <eero.t.tamminen@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Ben Widawsky <ben@bwidawsk.net>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> ---
> drivers/gpu/drm/i915/i915_gem_request.c | 25 +++++++++++++++++++++----
> 1 file changed, 21 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
> index 813a3b546d6e..ccbdaf6a0e4d 100644
> --- a/drivers/gpu/drm/i915/i915_gem_request.c
> +++ b/drivers/gpu/drm/i915/i915_gem_request.c
> @@ -1155,8 +1155,20 @@ long i915_wait_request(struct drm_i915_gem_request *req,
> GEM_BUG_ON(!intel_wait_has_seqno(&wait));
> GEM_BUG_ON(!i915_sw_fence_signaled(&req->submit));
>
> - /* Optimistic short spin before touching IRQs */
> - if (i915_spin_request(req, state, 5))
> + /* Optimistic short spin before touching IRQs.
So it's not short any more. "Optimistic busy spin" ?
> + *
> + * We use a rather large value here to offset the penalty of switching
> + * away from the active task. Frequently, the client will wait upon
> + * an old swapbuffer to throttle itself to remain within a frame of
> + * the gpu. If the client is running in lockstep with the gpu, then
> + * it should not be waiting long at all, and a sleep now will incur
> + * extra scheduler latency in producing the next frame. So we sleep
> + * for longer to try and keep the client running.
> + *
250us sounds quite long and worrying to me.
In the waiting on swapbuffer case, what are the clients waiting for? GPU
rendering to finish or previous vblank or something?
I am thinking if it would be possible to add a special API just for this
sort of waits and internally know how long it is likely to take. So then
decide based on that whether to spin or sleep. Like next vblank is
coming in 5ms, no point in busy spinning or something like that.
Regards,
Tvrtko
> + * We need ~5us to enable the irq, ~20us to hide a context switch,
> + * we use 250us to keep the cache hot.
> + */
> + if (i915_spin_request(req, state, 250))
> goto complete;
>
> set_current_state(state);
> @@ -1212,8 +1224,13 @@ long i915_wait_request(struct drm_i915_gem_request *req,
> __i915_wait_request_check_and_reset(req))
> continue;
>
> - /* Only spin if we know the GPU is processing this request */
> - if (i915_spin_request(req, state, 2))
> + /*
> + * A quick spin now we are on the CPU to offset the cost of
> + * context switching away (and so spin for roughly the same as
> + * the scheduler latency). We only spin if we know the GPU is
> + * processing this request, and so likely to finish shortly.
> + */
> + if (i915_spin_request(req, state, 20))
> break;
>
> if (!intel_wait_check_request(&wait, req)) {
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] drm/i915: Increase the busyspin durations for i915_wait_request
2017-09-15 10:01 ` Tvrtko Ursulin
@ 2017-09-15 10:18 ` Chris Wilson
2017-10-23 9:05 ` Sagar Arun Kamble
0 siblings, 1 reply; 8+ messages in thread
From: Chris Wilson @ 2017-09-15 10:18 UTC (permalink / raw)
To: Tvrtko Ursulin, intel-gfx; +Cc: Ben Widawsky, Eero Tamminen
Quoting Tvrtko Ursulin (2017-09-15 11:01:02)
>
> On 14/09/2017 10:58, Chris Wilson wrote:
> > An interesting discussion regarding "hybrid interrupt polling" for NVMe
> > came to the conclusion that the ideal busyspin before sleeping was half
> > of the expected request latency (and better if it was already halfway
> > through that request). This suggested that we too should look again at
> > our tradeoff between spinning and waiting. Currently, our spin simply
> > tries to hide the cost of enabling the interrupt, which is good to avoid
> > penalising nop requests (i.e. test throughput) and not much else.
> > Studying real world workloads suggests that a spin of upto 500us can
>
> What workloads and and power/perf testing?
Classic glxgears ;) People on the cc I trust to give a solid yay/nay.
> > dramatically boost performance, but the suggestion is that this is not
> > from avoiding interrupt latency per-se, but from secondary effects of
> > sleeping such as allowing the CPU reduce cstate and context switch away.
>
> Maybe the second part of the sentence would be clearer if not in a way
> in inverted form. Like longer spin = more performance = less sleeping =
> less cstate switching? Or just add "but from _avoiding_ secondary
> effects of sleeping"?
>
> > To offset those costs from penalising the active client, bump the initial
> > spin somewhat to 250us and the secondary spin to 20us to balance the cost
> > of another context switch following the interrupt.
> >
> > Suggested-by: Sagar Kamble <sagar.a.kamble@intel.com>
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Sagar Kamble <sagar.a.kamble@intel.com>
> > Cc: Eero Tamminen <eero.t.tamminen@intel.com>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > Cc: Ben Widawsky <ben@bwidawsk.net>
> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> > ---
> > drivers/gpu/drm/i915/i915_gem_request.c | 25 +++++++++++++++++++++----
> > 1 file changed, 21 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
> > index 813a3b546d6e..ccbdaf6a0e4d 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_request.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_request.c
> > @@ -1155,8 +1155,20 @@ long i915_wait_request(struct drm_i915_gem_request *req,
> > GEM_BUG_ON(!intel_wait_has_seqno(&wait));
> > GEM_BUG_ON(!i915_sw_fence_signaled(&req->submit));
> >
> > - /* Optimistic short spin before touching IRQs */
> > - if (i915_spin_request(req, state, 5))
> > + /* Optimistic short spin before touching IRQs.
>
> So it's not short any more. "Optimistic busy spin" ?
>
> > + *
> > + * We use a rather large value here to offset the penalty of switching
> > + * away from the active task. Frequently, the client will wait upon
> > + * an old swapbuffer to throttle itself to remain within a frame of
> > + * the gpu. If the client is running in lockstep with the gpu, then
> > + * it should not be waiting long at all, and a sleep now will incur
> > + * extra scheduler latency in producing the next frame. So we sleep
> > + * for longer to try and keep the client running.
> > + *
>
> 250us sounds quite long and worrying to me.
Yes. It's the sort of level where we are applying an artificial load to
the CPU to encourage turbo...
20us, I'm happier with; that does tally well with the latencies we can
measure from interrupt to process. 250us and we're well into secondary
effects that we should not be messing with (at least not like this).
Otoh, at 250us it might be interesting to play with monitor/mwait.
> In the waiting on swapbuffer case, what are the clients waiting for? GPU
> rendering to finish or previous vblank or something?
It's throttling the swap queue, so previous frame.
> I am thinking if it would be possible to add a special API just for this
> sort of waits and internally know how long it is likely to take. So then
> decide based on that whether to spin or sleep. Like next vblank is
> coming in 5ms, no point in busy spinning or something like that.
We have one; THROTTLE. For the user, it is too imprecise (we could
argument it with a timeout parameter, but...) as it doesn't wait for an
explicit event. GL ideally tries to ensure the render pipeline is full
but never more than a completed frame in advance. (That pipeline depth
could be specified by the client.) You've seen the same effect in wsim,
where if you just let a client fill up 20ms of batches, that can block
the GPU for seconds, but again a limit upon client batches enforces rr.
We've left it to the user to cooperate and balance their own pipeline;
the old scheduler enforced it at the ABI level, but it ideally just uses
a CFS-esque method, leaving the judgement of pipeline depth to the user.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] drm/i915: Increase the busyspin durations for i915_wait_request
2017-09-15 10:18 ` Chris Wilson
@ 2017-10-23 9:05 ` Sagar Arun Kamble
0 siblings, 0 replies; 8+ messages in thread
From: Sagar Arun Kamble @ 2017-10-23 9:05 UTC (permalink / raw)
To: Chris Wilson, Tvrtko Ursulin, intel-gfx; +Cc: Ben Widawsky, Eero Tamminen
On 9/15/2017 3:48 PM, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2017-09-15 11:01:02)
>> On 14/09/2017 10:58, Chris Wilson wrote:
>>> An interesting discussion regarding "hybrid interrupt polling" for NVMe
>>> came to the conclusion that the ideal busyspin before sleeping was half
>>> of the expected request latency (and better if it was already halfway
>>> through that request). This suggested that we too should look again at
>>> our tradeoff between spinning and waiting. Currently, our spin simply
>>> tries to hide the cost of enabling the interrupt, which is good to avoid
>>> penalising nop requests (i.e. test throughput) and not much else.
>>> Studying real world workloads suggests that a spin of upto 500us can
>> What workloads and and power/perf testing?
> Classic glxgears ;) People on the cc I trust to give a solid yay/nay.
Perf runs on SKL show improvements on some of the CL and media
benchmarks without impacting power.
On APL, perf is not impacted but core power increase is observed. This
observation is present w/ and w/o GuC.
More benefit of this approach with GuC can be observed by disabling
RC6/CPG. I had tested xcompbench offscreen benchmark
with GuC on APL with this change and RC6 disabled and had observed
substantial improvements.
>>> dramatically boost performance, but the suggestion is that this is not
>>> from avoiding interrupt latency per-se, but from secondary effects of
>>> sleeping such as allowing the CPU reduce cstate and context switch away.
>> Maybe the second part of the sentence would be clearer if not in a way
>> in inverted form. Like longer spin = more performance = less sleeping =
>> less cstate switching? Or just add "but from _avoiding_ secondary
>> effects of sleeping"?
>>
>>> To offset those costs from penalising the active client, bump the initial
>>> spin somewhat to 250us and the secondary spin to 20us to balance the cost
>>> of another context switch following the interrupt.
>>>
>>> Suggested-by: Sagar Kamble <sagar.a.kamble@intel.com>
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Sagar Kamble <sagar.a.kamble@intel.com>
>>> Cc: Eero Tamminen <eero.t.tamminen@intel.com>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>> Cc: Ben Widawsky <ben@bwidawsk.net>
>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
The change looks good to me (w.r.t old drm-tip)
Reviewed-by: Sagar Arun Kamble <sagar.a.kamble@intel.com>
>>> ---
>>> drivers/gpu/drm/i915/i915_gem_request.c | 25 +++++++++++++++++++++----
>>> 1 file changed, 21 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
>>> index 813a3b546d6e..ccbdaf6a0e4d 100644
>>> --- a/drivers/gpu/drm/i915/i915_gem_request.c
>>> +++ b/drivers/gpu/drm/i915/i915_gem_request.c
>>> @@ -1155,8 +1155,20 @@ long i915_wait_request(struct drm_i915_gem_request *req,
>>> GEM_BUG_ON(!intel_wait_has_seqno(&wait));
>>> GEM_BUG_ON(!i915_sw_fence_signaled(&req->submit));
>>>
>>> - /* Optimistic short spin before touching IRQs */
>>> - if (i915_spin_request(req, state, 5))
>>> + /* Optimistic short spin before touching IRQs.
>> So it's not short any more. "Optimistic busy spin" ?
>>
>>> + *
>>> + * We use a rather large value here to offset the penalty of switching
>>> + * away from the active task. Frequently, the client will wait upon
>>> + * an old swapbuffer to throttle itself to remain within a frame of
>>> + * the gpu. If the client is running in lockstep with the gpu, then
>>> + * it should not be waiting long at all, and a sleep now will incur
>>> + * extra scheduler latency in producing the next frame. So we sleep
>>> + * for longer to try and keep the client running.
>>> + *
>> 250us sounds quite long and worrying to me.
> Yes. It's the sort of level where we are applying an artificial load to
> the CPU to encourage turbo...
>
> 20us, I'm happier with; that does tally well with the latencies we can
> measure from interrupt to process. 250us and we're well into secondary
> effects that we should not be messing with (at least not like this).
>
> Otoh, at 250us it might be interesting to play with monitor/mwait.
>
>> In the waiting on swapbuffer case, what are the clients waiting for? GPU
>> rendering to finish or previous vblank or something?
> It's throttling the swap queue, so previous frame.
>
>> I am thinking if it would be possible to add a special API just for this
>> sort of waits and internally know how long it is likely to take. So then
>> decide based on that whether to spin or sleep. Like next vblank is
>> coming in 5ms, no point in busy spinning or something like that.
> We have one; THROTTLE. For the user, it is too imprecise (we could
> argument it with a timeout parameter, but...) as it doesn't wait for an
> explicit event. GL ideally tries to ensure the render pipeline is full
> but never more than a completed frame in advance. (That pipeline depth
> could be specified by the client.) You've seen the same effect in wsim,
> where if you just let a client fill up 20ms of batches, that can block
> the GPU for seconds, but again a limit upon client batches enforces rr.
> We've left it to the user to cooperate and balance their own pipeline;
> the old scheduler enforced it at the ABI level, but it ideally just uses
> a CFS-esque method, leaving the judgement of pipeline depth to the user.
> -Chris
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2017-10-23 9:06 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-14 9:58 [PATCH] drm/i915: Increase the busyspin durations for i915_wait_request Chris Wilson
2017-09-14 10:18 ` ✓ Fi.CI.BAT: success for " Patchwork
2017-09-14 11:17 ` ✗ Fi.CI.IGT: failure " Patchwork
2017-09-15 9:15 ` [PATCH] " Kamble, Sagar A
2017-09-15 9:23 ` Chris Wilson
2017-09-15 10:01 ` Tvrtko Ursulin
2017-09-15 10:18 ` Chris Wilson
2017-10-23 9:05 ` Sagar Arun Kamble
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.