[PATCH] drm/i915: Increase the busyspin durations for i915_wait

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] drm/i915: Increase the busyspin durations for i915_wait_request
@ 2017-09-14  9:58 Chris Wilson
  2017-09-14 10:18 ` ✓ Fi.CI.BAT: success for " Patchwork
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Chris Wilson @ 2017-09-14  9:58 UTC (permalink / raw)
  To: intel-gfx; +Cc: Ben Widawsky, Eero Tamminen

An interesting discussion regarding "hybrid interrupt polling" for NVMe
came to the conclusion that the ideal busyspin before sleeping was half
of the expected request latency (and better if it was already halfway
through that request). This suggested that we too should look again at
our tradeoff between spinning and waiting. Currently, our spin simply
tries to hide the cost of enabling the interrupt, which is good to avoid
penalising nop requests (i.e. test throughput) and not much else.
Studying real world workloads suggests that a spin of upto 500us can
dramatically boost performance, but the suggestion is that this is not
from avoiding interrupt latency per-se, but from secondary effects of
sleeping such as allowing the CPU reduce cstate and context switch away.
To offset those costs from penalising the active client, bump the initial
spin somewhat to 250us and the secondary spin to 20us to balance the cost
of another context switch following the interrupt.

Suggested-by: Sagar Kamble <sagar.a.kamble@intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Sagar Kamble <sagar.a.kamble@intel.com>
Cc: Eero Tamminen <eero.t.tamminen@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Ben Widawsky <ben@bwidawsk.net>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_gem_request.c | 25 +++++++++++++++++++++----
 1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
index 813a3b546d6e..ccbdaf6a0e4d 100644
--- a/drivers/gpu/drm/i915/i915_gem_request.c
+++ b/drivers/gpu/drm/i915/i915_gem_request.c
@@ -1155,8 +1155,20 @@ long i915_wait_request(struct drm_i915_gem_request *req,
 	GEM_BUG_ON(!intel_wait_has_seqno(&wait));
 	GEM_BUG_ON(!i915_sw_fence_signaled(&req->submit));
 
-	/* Optimistic short spin before touching IRQs */
-	if (i915_spin_request(req, state, 5))
+	/* Optimistic short spin before touching IRQs.
+	 *
+	 * We use a rather large value here to offset the penalty of switching
+	 * away from the active task. Frequently, the client will wait upon
+	 * an old swapbuffer to throttle itself to remain within a frame of
+	 * the gpu. If the client is running in lockstep with the gpu, then
+	 * it should not be waiting long at all, and a sleep now will incur
+	 * extra scheduler latency in producing the next frame. So we sleep
+	 * for longer to try and keep the client running.
+	 *
+	 * We need ~5us to enable the irq, ~20us to hide a context switch,
+	 * we use 250us to keep the cache hot.
+	 */
+	if (i915_spin_request(req, state, 250))
 		goto complete;
 
 	set_current_state(state);
@@ -1212,8 +1224,13 @@ long i915_wait_request(struct drm_i915_gem_request *req,
 		    __i915_wait_request_check_and_reset(req))
 			continue;
 
-		/* Only spin if we know the GPU is processing this request */
-		if (i915_spin_request(req, state, 2))
+		/*
+		 * A quick spin now we are on the CPU to offset the cost of
+		 * context switching away (and so spin for roughly the same as
+		 * the scheduler latency). We only spin if we know the GPU is
+		 * processing this request, and so likely to finish shortly.
+		 */
+		if (i915_spin_request(req, state, 20))
 			break;
 
 		if (!intel_wait_check_request(&wait, req)) {
-- 
2.14.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* ✓ Fi.CI.BAT: success for drm/i915: Increase the busyspin durations for i915_wait_request
  2017-09-14  9:58 [PATCH] drm/i915: Increase the busyspin durations for i915_wait_request Chris Wilson
@ 2017-09-14 10:18 ` Patchwork
  2017-09-14 11:17 ` ✗ Fi.CI.IGT: failure " Patchwork
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: Patchwork @ 2017-09-14 10:18 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Increase the busyspin durations for i915_wait_request
URL   : https://patchwork.freedesktop.org/series/30352/
State : success

== Summary ==

Series 30352v1 drm/i915: Increase the busyspin durations for i915_wait_request
https://patchwork.freedesktop.org/api/1.0/series/30352/revisions/1/mbox/

Test chamelium:
        Subgroup dp-crc-fast:
                fail       -> PASS       (fi-kbl-7500u) fdo#102514
Test kms_cursor_legacy:
        Subgroup basic-busy-flip-before-cursor-atomic:
                fail       -> PASS       (fi-snb-2600) fdo#100215 +1
Test pm_rpm:
        Subgroup basic-rte:
                pass       -> DMESG-WARN (fi-cfl-s) fdo#102294

fdo#102514 https://bugs.freedesktop.org/show_bug.cgi?id=102514
fdo#100215 https://bugs.freedesktop.org/show_bug.cgi?id=100215
fdo#102294 https://bugs.freedesktop.org/show_bug.cgi?id=102294

fi-bdw-5557u     total:289  pass:268  dwarn:0   dfail:0   fail:0   skip:21  time:441s
fi-blb-e6850     total:289  pass:224  dwarn:1   dfail:0   fail:0   skip:64  time:376s
fi-bsw-n3050     total:289  pass:243  dwarn:0   dfail:0   fail:0   skip:46  time:537s
fi-bwr-2160      total:289  pass:184  dwarn:0   dfail:0   fail:0   skip:105 time:268s
fi-bxt-j4205     total:289  pass:260  dwarn:0   dfail:0   fail:0   skip:29  time:505s
fi-byt-j1900     total:289  pass:254  dwarn:1   dfail:0   fail:0   skip:34  time:506s
fi-byt-n2820     total:289  pass:250  dwarn:1   dfail:0   fail:0   skip:38  time:498s
fi-cfl-s         total:289  pass:222  dwarn:35  dfail:0   fail:0   skip:32  time:546s
fi-elk-e7500     total:289  pass:230  dwarn:0   dfail:0   fail:0   skip:59  time:446s
fi-glk-2a        total:289  pass:260  dwarn:0   dfail:0   fail:0   skip:29  time:588s
fi-hsw-4770      total:289  pass:263  dwarn:0   dfail:0   fail:0   skip:26  time:431s
fi-hsw-4770r     total:289  pass:263  dwarn:0   dfail:0   fail:0   skip:26  time:405s
fi-ilk-650       total:289  pass:229  dwarn:0   dfail:0   fail:0   skip:60  time:436s
fi-ivb-3520m     total:289  pass:261  dwarn:0   dfail:0   fail:0   skip:28  time:492s
fi-ivb-3770      total:289  pass:261  dwarn:0   dfail:0   fail:0   skip:28  time:465s
fi-kbl-7500u     total:289  pass:264  dwarn:1   dfail:0   fail:0   skip:24  time:496s
fi-kbl-7560u     total:289  pass:270  dwarn:0   dfail:0   fail:0   skip:19  time:579s
fi-kbl-r         total:289  pass:262  dwarn:0   dfail:0   fail:0   skip:27  time:578s
fi-pnv-d510      total:289  pass:223  dwarn:1   dfail:0   fail:0   skip:65  time:557s
fi-skl-6260u     total:289  pass:269  dwarn:0   dfail:0   fail:0   skip:20  time:459s
fi-skl-6700k     total:289  pass:265  dwarn:0   dfail:0   fail:0   skip:24  time:528s
fi-skl-6770hq    total:289  pass:269  dwarn:0   dfail:0   fail:0   skip:20  time:502s
fi-skl-gvtdvm    total:289  pass:266  dwarn:0   dfail:0   fail:0   skip:23  time:466s
fi-skl-x1585l    total:289  pass:269  dwarn:0   dfail:0   fail:0   skip:20  time:494s
fi-snb-2520m     total:289  pass:251  dwarn:0   dfail:0   fail:0   skip:38  time:579s
fi-snb-2600      total:289  pass:250  dwarn:0   dfail:0   fail:0   skip:39  time:424s
fi-bdw-gvtdvm failed to connect after reboot

b21142588a9a16f74988f6b524953d4382be4a81 drm-tip: 2017y-09m-14d-08h-23m-57s UTC integration manifest
68f1121717fe drm/i915: Increase the busyspin durations for i915_wait_request

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_5695/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 8+ messages in thread

* ✗ Fi.CI.IGT: failure for drm/i915: Increase the busyspin durations for i915_wait_request
  2017-09-14  9:58 [PATCH] drm/i915: Increase the busyspin durations for i915_wait_request Chris Wilson
  2017-09-14 10:18 ` ✓ Fi.CI.BAT: success for " Patchwork
@ 2017-09-14 11:17 ` Patchwork
  2017-09-15  9:15 ` [PATCH] " Kamble, Sagar A
  2017-09-15 10:01 ` Tvrtko Ursulin
  3 siblings, 0 replies; 8+ messages in thread
From: Patchwork @ 2017-09-14 11:17 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Increase the busyspin durations for i915_wait_request
URL   : https://patchwork.freedesktop.org/series/30352/
State : failure

== Summary ==

Test drv_missed_irq:
                pass       -> FAIL       (shard-hsw)
Test kms_setmode:
        Subgroup basic:
                pass       -> FAIL       (shard-hsw) fdo#99912
Test perf:
        Subgroup blocking:
                fail       -> PASS       (shard-hsw) fdo#102252

fdo#99912 https://bugs.freedesktop.org/show_bug.cgi?id=99912
fdo#102252 https://bugs.freedesktop.org/show_bug.cgi?id=102252

shard-hsw        total:2313 pass:1244 dwarn:0   dfail:0   fail:14  skip:1055 time:9368s

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_5695/shards.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] drm/i915: Increase the busyspin durations for i915_wait_request
  2017-09-14  9:58 [PATCH] drm/i915: Increase the busyspin durations for i915_wait_request Chris Wilson
  2017-09-14 10:18 ` ✓ Fi.CI.BAT: success for " Patchwork
  2017-09-14 11:17 ` ✗ Fi.CI.IGT: failure " Patchwork
@ 2017-09-15  9:15 ` Kamble, Sagar A
  2017-09-15  9:23   ` Chris Wilson
  2017-09-15 10:01 ` Tvrtko Ursulin
  3 siblings, 1 reply; 8+ messages in thread
From: Kamble, Sagar A @ 2017-09-15  9:15 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: Eero Tamminen, Ben Widawsky

Thanks Chris. LGTM.

Minor inputs below


On 9/14/2017 3:28 PM, Chris Wilson wrote:
> An interesting discussion regarding "hybrid interrupt polling" for NVMe
> came to the conclusion that the ideal busyspin before sleeping was half
> of the expected request latency (and better if it was already halfway
> through that request). This suggested that we too should look again at
> our tradeoff between spinning and waiting. Currently, our spin simply
> tries to hide the cost of enabling the interrupt, which is good to avoid
> penalising nop requests (i.e. test throughput) and not much else.
> Studying real world workloads suggests that a spin of upto 500us can
> dramatically boost performance, but the suggestion is that this is not
> from avoiding interrupt latency per-se, but from secondary effects of
> sleeping such as allowing the CPU reduce cstate and context switch away.
> To offset those costs from penalising the active client, bump the initial
> spin somewhat to 250us and the secondary spin to 20us to balance the cost
> of another context switch following the interrupt.
>
> Suggested-by: Sagar Kamble <sagar.a.kamble@intel.com>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Sagar Kamble <sagar.a.kamble@intel.com>
> Cc: Eero Tamminen <eero.t.tamminen@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Ben Widawsky <ben@bwidawsk.net>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> ---
>   drivers/gpu/drm/i915/i915_gem_request.c | 25 +++++++++++++++++++++----
>   1 file changed, 21 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
> index 813a3b546d6e..ccbdaf6a0e4d 100644
> --- a/drivers/gpu/drm/i915/i915_gem_request.c
> +++ b/drivers/gpu/drm/i915/i915_gem_request.c
> @@ -1155,8 +1155,20 @@ long i915_wait_request(struct drm_i915_gem_request *req,
>   	GEM_BUG_ON(!intel_wait_has_seqno(&wait));
>   	GEM_BUG_ON(!i915_sw_fence_signaled(&req->submit));
>   
> -	/* Optimistic short spin before touching IRQs */
> -	if (i915_spin_request(req, state, 5))
> +	/* Optimistic short spin before touching IRQs.
> +	 *
> +	 * We use a rather large value here to offset the penalty of switching
> +	 * away from the active task. Frequently, the client will wait upon
> +	 * an old swapbuffer to throttle itself to remain within a frame of
> +	 * the gpu. If the client is running in lockstep with the gpu, then
> +	 * it should not be waiting long at all, and a sleep now will incur
> +	 * extra scheduler latency in producing the next frame. So we sleep
Typo? Should this be "So we spin for longer"
> +	 * for longer to try and keep the client running.
> +	 *
> +	 * We need ~5us to enable the irq, ~20us to hide a context switch,
> +	 * we use 250us to keep the cache hot.
> +	 */
> +	if (i915_spin_request(req, state, 250))
Hybrid polling introduces sleep of 1/2 request duration prior to poll of 
1/2 request duration.
Should we reorder here to achieve the same?
>   		goto complete;
>   
>   	set_current_state(state);
> @@ -1212,8 +1224,13 @@ long i915_wait_request(struct drm_i915_gem_request *req,
>   		    __i915_wait_request_check_and_reset(req))
>   			continue;
>   
> -		/* Only spin if we know the GPU is processing this request */
> -		if (i915_spin_request(req, state, 2))
> +		/*
> +		 * A quick spin now we are on the CPU to offset the cost of
> +		 * context switching away (and so spin for roughly the same as
> +		 * the scheduler latency). We only spin if we know the GPU is
> +		 * processing this request, and so likely to finish shortly.
> +		 */
> +		if (i915_spin_request(req, state, 20))
>   			break;
>   
>   		if (!intel_wait_check_request(&wait, req)) {

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] drm/i915: Increase the busyspin durations for i915_wait_request
  2017-09-15  9:15 ` [PATCH] " Kamble, Sagar A
@ 2017-09-15  9:23   ` Chris Wilson
  0 siblings, 0 replies; 8+ messages in thread
From: Chris Wilson @ 2017-09-15  9:23 UTC (permalink / raw)
  To: Kamble, Sagar A, intel-gfx; +Cc: Eero Tamminen, Ben Widawsky

Quoting Kamble, Sagar A (2017-09-15 10:15:31)
> Thanks Chris. LGTM.
> 
> Minor inputs below
> 
> 
> On 9/14/2017 3:28 PM, Chris Wilson wrote:
> > An interesting discussion regarding "hybrid interrupt polling" for NVMe
> > came to the conclusion that the ideal busyspin before sleeping was half
> > of the expected request latency (and better if it was already halfway
> > through that request). This suggested that we too should look again at
> > our tradeoff between spinning and waiting. Currently, our spin simply
> > tries to hide the cost of enabling the interrupt, which is good to avoid
> > penalising nop requests (i.e. test throughput) and not much else.
> > Studying real world workloads suggests that a spin of upto 500us can
> > dramatically boost performance, but the suggestion is that this is not
> > from avoiding interrupt latency per-se, but from secondary effects of
> > sleeping such as allowing the CPU reduce cstate and context switch away.
> > To offset those costs from penalising the active client, bump the initial
> > spin somewhat to 250us and the secondary spin to 20us to balance the cost
> > of another context switch following the interrupt.
> >
> > Suggested-by: Sagar Kamble <sagar.a.kamble@intel.com>
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Sagar Kamble <sagar.a.kamble@intel.com>
> > Cc: Eero Tamminen <eero.t.tamminen@intel.com>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > Cc: Ben Widawsky <ben@bwidawsk.net>
> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> > ---
> >   drivers/gpu/drm/i915/i915_gem_request.c | 25 +++++++++++++++++++++----
> >   1 file changed, 21 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
> > index 813a3b546d6e..ccbdaf6a0e4d 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_request.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_request.c
> > @@ -1155,8 +1155,20 @@ long i915_wait_request(struct drm_i915_gem_request *req,
> >       GEM_BUG_ON(!intel_wait_has_seqno(&wait));
> >       GEM_BUG_ON(!i915_sw_fence_signaled(&req->submit));
> >   
> > -     /* Optimistic short spin before touching IRQs */
> > -     if (i915_spin_request(req, state, 5))
> > +     /* Optimistic short spin before touching IRQs.
> > +      *
> > +      * We use a rather large value here to offset the penalty of switching
> > +      * away from the active task. Frequently, the client will wait upon
> > +      * an old swapbuffer to throttle itself to remain within a frame of
> > +      * the gpu. If the client is running in lockstep with the gpu, then
> > +      * it should not be waiting long at all, and a sleep now will incur
> > +      * extra scheduler latency in producing the next frame. So we sleep
> Typo? Should this be "So we spin for longer"
> > +      * for longer to try and keep the client running.
> > +      *
> > +      * We need ~5us to enable the irq, ~20us to hide a context switch,
> > +      * we use 250us to keep the cache hot.
> > +      */
> > +     if (i915_spin_request(req, state, 250))
> Hybrid polling introduces sleep of 1/2 request duration prior to poll of 
> 1/2 request duration.
> Should we reorder here to achieve the same?

No, we are not tracking estimated time of completion for a request. We
don't know when the request was started on the hw, nor when it is likely
to complete. Now we could bookend the batch with TIMESTAMPs and with a
lot of extra work track an ewma per context per engine, but even with
some magic, I don't think that will work across preemption.

Yes, it makes a lot of sense not to spin if we don't expect the request
to complete. The closest we have at present (and currently use) is merely
a flag saying whether the request is currently executing.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] drm/i915: Increase the busyspin durations for i915_wait_request
  2017-09-14  9:58 [PATCH] drm/i915: Increase the busyspin durations for i915_wait_request Chris Wilson
                   ` (2 preceding siblings ...)
  2017-09-15  9:15 ` [PATCH] " Kamble, Sagar A
@ 2017-09-15 10:01 ` Tvrtko Ursulin
  2017-09-15 10:18   ` Chris Wilson
  3 siblings, 1 reply; 8+ messages in thread
From: Tvrtko Ursulin @ 2017-09-15 10:01 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: Ben Widawsky, Eero Tamminen


On 14/09/2017 10:58, Chris Wilson wrote:
> An interesting discussion regarding "hybrid interrupt polling" for NVMe
> came to the conclusion that the ideal busyspin before sleeping was half
> of the expected request latency (and better if it was already halfway
> through that request). This suggested that we too should look again at
> our tradeoff between spinning and waiting. Currently, our spin simply
> tries to hide the cost of enabling the interrupt, which is good to avoid
> penalising nop requests (i.e. test throughput) and not much else.
> Studying real world workloads suggests that a spin of upto 500us can

What workloads and and power/perf testing?

> dramatically boost performance, but the suggestion is that this is not
> from avoiding interrupt latency per-se, but from secondary effects of
> sleeping such as allowing the CPU reduce cstate and context switch away.

Maybe the second part of the sentence would be clearer if not in a way 
in inverted form. Like longer spin = more performance = less sleeping = 
less cstate switching? Or just add "but from _avoiding_ secondary 
effects of sleeping"?

> To offset those costs from penalising the active client, bump the initial
> spin somewhat to 250us and the secondary spin to 20us to balance the cost
> of another context switch following the interrupt.
> 
> Suggested-by: Sagar Kamble <sagar.a.kamble@intel.com>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Sagar Kamble <sagar.a.kamble@intel.com>
> Cc: Eero Tamminen <eero.t.tamminen@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Ben Widawsky <ben@bwidawsk.net>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> ---
>   drivers/gpu/drm/i915/i915_gem_request.c | 25 +++++++++++++++++++++----
>   1 file changed, 21 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
> index 813a3b546d6e..ccbdaf6a0e4d 100644
> --- a/drivers/gpu/drm/i915/i915_gem_request.c
> +++ b/drivers/gpu/drm/i915/i915_gem_request.c
> @@ -1155,8 +1155,20 @@ long i915_wait_request(struct drm_i915_gem_request *req,
>   	GEM_BUG_ON(!intel_wait_has_seqno(&wait));
>   	GEM_BUG_ON(!i915_sw_fence_signaled(&req->submit));
>   
> -	/* Optimistic short spin before touching IRQs */
> -	if (i915_spin_request(req, state, 5))
> +	/* Optimistic short spin before touching IRQs.

So it's not short any more. "Optimistic busy spin" ?

> +	 *
> +	 * We use a rather large value here to offset the penalty of switching
> +	 * away from the active task. Frequently, the client will wait upon
> +	 * an old swapbuffer to throttle itself to remain within a frame of
> +	 * the gpu. If the client is running in lockstep with the gpu, then
> +	 * it should not be waiting long at all, and a sleep now will incur
> +	 * extra scheduler latency in producing the next frame. So we sleep
> +	 * for longer to try and keep the client running.
> +	 *

250us sounds quite long and worrying to me.

In the waiting on swapbuffer case, what are the clients waiting for? GPU 
rendering to finish or previous vblank or something?

I am thinking if it would be possible to add a special API just for this 
sort of waits and internally know how long it is likely to take. So then 
decide based on that whether to spin or sleep. Like next vblank is 
coming in 5ms, no point in busy spinning or something like that.

Regards,

Tvrtko

> +	 * We need ~5us to enable the irq, ~20us to hide a context switch,
> +	 * we use 250us to keep the cache hot.
> +	 */
> +	if (i915_spin_request(req, state, 250))
>   		goto complete;
>   
>   	set_current_state(state);
> @@ -1212,8 +1224,13 @@ long i915_wait_request(struct drm_i915_gem_request *req,
>   		    __i915_wait_request_check_and_reset(req))
>   			continue;
>   
> -		/* Only spin if we know the GPU is processing this request */
> -		if (i915_spin_request(req, state, 2))
> +		/*
> +		 * A quick spin now we are on the CPU to offset the cost of
> +		 * context switching away (and so spin for roughly the same as
> +		 * the scheduler latency). We only spin if we know the GPU is
> +		 * processing this request, and so likely to finish shortly.
> +		 */
> +		if (i915_spin_request(req, state, 20))
>   			break;
>   
>   		if (!intel_wait_check_request(&wait, req)) {
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] drm/i915: Increase the busyspin durations for i915_wait_request
  2017-09-15 10:01 ` Tvrtko Ursulin
@ 2017-09-15 10:18   ` Chris Wilson
  2017-10-23  9:05     ` Sagar Arun Kamble
  0 siblings, 1 reply; 8+ messages in thread
From: Chris Wilson @ 2017-09-15 10:18 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx; +Cc: Ben Widawsky, Eero Tamminen

Quoting Tvrtko Ursulin (2017-09-15 11:01:02)
> 
> On 14/09/2017 10:58, Chris Wilson wrote:
> > An interesting discussion regarding "hybrid interrupt polling" for NVMe
> > came to the conclusion that the ideal busyspin before sleeping was half
> > of the expected request latency (and better if it was already halfway
> > through that request). This suggested that we too should look again at
> > our tradeoff between spinning and waiting. Currently, our spin simply
> > tries to hide the cost of enabling the interrupt, which is good to avoid
> > penalising nop requests (i.e. test throughput) and not much else.
> > Studying real world workloads suggests that a spin of upto 500us can
> 
> What workloads and and power/perf testing?

Classic glxgears ;) People on the cc I trust to give a solid yay/nay.

> > dramatically boost performance, but the suggestion is that this is not
> > from avoiding interrupt latency per-se, but from secondary effects of
> > sleeping such as allowing the CPU reduce cstate and context switch away.
> 
> Maybe the second part of the sentence would be clearer if not in a way 
> in inverted form. Like longer spin = more performance = less sleeping = 
> less cstate switching? Or just add "but from _avoiding_ secondary 
> effects of sleeping"?
> 
> > To offset those costs from penalising the active client, bump the initial
> > spin somewhat to 250us and the secondary spin to 20us to balance the cost
> > of another context switch following the interrupt.
> > 
> > Suggested-by: Sagar Kamble <sagar.a.kamble@intel.com>
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Sagar Kamble <sagar.a.kamble@intel.com>
> > Cc: Eero Tamminen <eero.t.tamminen@intel.com>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > Cc: Ben Widawsky <ben@bwidawsk.net>
> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> > ---
> >   drivers/gpu/drm/i915/i915_gem_request.c | 25 +++++++++++++++++++++----
> >   1 file changed, 21 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
> > index 813a3b546d6e..ccbdaf6a0e4d 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_request.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_request.c
> > @@ -1155,8 +1155,20 @@ long i915_wait_request(struct drm_i915_gem_request *req,
> >       GEM_BUG_ON(!intel_wait_has_seqno(&wait));
> >       GEM_BUG_ON(!i915_sw_fence_signaled(&req->submit));
> >   
> > -     /* Optimistic short spin before touching IRQs */
> > -     if (i915_spin_request(req, state, 5))
> > +     /* Optimistic short spin before touching IRQs.
> 
> So it's not short any more. "Optimistic busy spin" ?
> 
> > +      *
> > +      * We use a rather large value here to offset the penalty of switching
> > +      * away from the active task. Frequently, the client will wait upon
> > +      * an old swapbuffer to throttle itself to remain within a frame of
> > +      * the gpu. If the client is running in lockstep with the gpu, then
> > +      * it should not be waiting long at all, and a sleep now will incur
> > +      * extra scheduler latency in producing the next frame. So we sleep
> > +      * for longer to try and keep the client running.
> > +      *
> 
> 250us sounds quite long and worrying to me.

Yes. It's the sort of level where we are applying an artificial load to
the CPU to encourage turbo...

20us, I'm happier with; that does tally well with the latencies we can
measure from interrupt to process. 250us and we're well into secondary
effects that we should not be messing with (at least not like this).

Otoh, at 250us it might be interesting to play with monitor/mwait.
 
> In the waiting on swapbuffer case, what are the clients waiting for? GPU 
> rendering to finish or previous vblank or something?

It's throttling the swap queue, so previous frame.
 
> I am thinking if it would be possible to add a special API just for this 
> sort of waits and internally know how long it is likely to take. So then 
> decide based on that whether to spin or sleep. Like next vblank is 
> coming in 5ms, no point in busy spinning or something like that.

We have one; THROTTLE. For the user, it is too imprecise (we could
argument it with a timeout parameter, but...) as it doesn't wait for an
explicit event. GL ideally tries to ensure the render pipeline is full
but never more than a completed frame in advance. (That pipeline depth
could be specified by the client.) You've seen the same effect in wsim,
where if you just let a client fill up 20ms of batches, that can block
the GPU for seconds, but again a limit upon client batches enforces rr.
We've left it to the user to cooperate and balance their own pipeline;
the old scheduler enforced it at the ABI level, but it ideally just uses
a CFS-esque method, leaving the judgement of pipeline depth to the user.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] drm/i915: Increase the busyspin durations for i915_wait_request
  2017-09-15 10:18   ` Chris Wilson
@ 2017-10-23  9:05     ` Sagar Arun Kamble
  0 siblings, 0 replies; 8+ messages in thread
From: Sagar Arun Kamble @ 2017-10-23  9:05 UTC (permalink / raw)
  To: Chris Wilson, Tvrtko Ursulin, intel-gfx; +Cc: Ben Widawsky, Eero Tamminen



On 9/15/2017 3:48 PM, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2017-09-15 11:01:02)
>> On 14/09/2017 10:58, Chris Wilson wrote:
>>> An interesting discussion regarding "hybrid interrupt polling" for NVMe
>>> came to the conclusion that the ideal busyspin before sleeping was half
>>> of the expected request latency (and better if it was already halfway
>>> through that request). This suggested that we too should look again at
>>> our tradeoff between spinning and waiting. Currently, our spin simply
>>> tries to hide the cost of enabling the interrupt, which is good to avoid
>>> penalising nop requests (i.e. test throughput) and not much else.
>>> Studying real world workloads suggests that a spin of upto 500us can
>> What workloads and and power/perf testing?
> Classic glxgears ;) People on the cc I trust to give a solid yay/nay.
Perf runs on SKL show improvements on some of the CL and media 
benchmarks without impacting power.
On APL, perf is not impacted but core power increase is observed. This 
observation is present w/ and w/o GuC.
More benefit of this approach with GuC can be observed by disabling 
RC6/CPG. I had tested xcompbench offscreen benchmark
with GuC on APL with this change and RC6 disabled and had observed 
substantial improvements.
>>> dramatically boost performance, but the suggestion is that this is not
>>> from avoiding interrupt latency per-se, but from secondary effects of
>>> sleeping such as allowing the CPU reduce cstate and context switch away.
>> Maybe the second part of the sentence would be clearer if not in a way
>> in inverted form. Like longer spin = more performance = less sleeping =
>> less cstate switching? Or just add "but from _avoiding_ secondary
>> effects of sleeping"?
>>
>>> To offset those costs from penalising the active client, bump the initial
>>> spin somewhat to 250us and the secondary spin to 20us to balance the cost
>>> of another context switch following the interrupt.
>>>
>>> Suggested-by: Sagar Kamble <sagar.a.kamble@intel.com>
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Sagar Kamble <sagar.a.kamble@intel.com>
>>> Cc: Eero Tamminen <eero.t.tamminen@intel.com>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>> Cc: Ben Widawsky <ben@bwidawsk.net>
>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
The change looks good to me (w.r.t old drm-tip)
Reviewed-by: Sagar Arun Kamble <sagar.a.kamble@intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/i915_gem_request.c | 25 +++++++++++++++++++++----
>>>    1 file changed, 21 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
>>> index 813a3b546d6e..ccbdaf6a0e4d 100644
>>> --- a/drivers/gpu/drm/i915/i915_gem_request.c
>>> +++ b/drivers/gpu/drm/i915/i915_gem_request.c
>>> @@ -1155,8 +1155,20 @@ long i915_wait_request(struct drm_i915_gem_request *req,
>>>        GEM_BUG_ON(!intel_wait_has_seqno(&wait));
>>>        GEM_BUG_ON(!i915_sw_fence_signaled(&req->submit));
>>>    
>>> -     /* Optimistic short spin before touching IRQs */
>>> -     if (i915_spin_request(req, state, 5))
>>> +     /* Optimistic short spin before touching IRQs.
>> So it's not short any more. "Optimistic busy spin" ?
>>
>>> +      *
>>> +      * We use a rather large value here to offset the penalty of switching
>>> +      * away from the active task. Frequently, the client will wait upon
>>> +      * an old swapbuffer to throttle itself to remain within a frame of
>>> +      * the gpu. If the client is running in lockstep with the gpu, then
>>> +      * it should not be waiting long at all, and a sleep now will incur
>>> +      * extra scheduler latency in producing the next frame. So we sleep
>>> +      * for longer to try and keep the client running.
>>> +      *
>> 250us sounds quite long and worrying to me.
> Yes. It's the sort of level where we are applying an artificial load to
> the CPU to encourage turbo...
>
> 20us, I'm happier with; that does tally well with the latencies we can
> measure from interrupt to process. 250us and we're well into secondary
> effects that we should not be messing with (at least not like this).
>
> Otoh, at 250us it might be interesting to play with monitor/mwait.
>   
>> In the waiting on swapbuffer case, what are the clients waiting for? GPU
>> rendering to finish or previous vblank or something?
> It's throttling the swap queue, so previous frame.
>   
>> I am thinking if it would be possible to add a special API just for this
>> sort of waits and internally know how long it is likely to take. So then
>> decide based on that whether to spin or sleep. Like next vblank is
>> coming in 5ms, no point in busy spinning or something like that.
> We have one; THROTTLE. For the user, it is too imprecise (we could
> argument it with a timeout parameter, but...) as it doesn't wait for an
> explicit event. GL ideally tries to ensure the render pipeline is full
> but never more than a completed frame in advance. (That pipeline depth
> could be specified by the client.) You've seen the same effect in wsim,
> where if you just let a client fill up 20ms of batches, that can block
> the GPU for seconds, but again a limit upon client batches enforces rr.
> We've left it to the user to cooperate and balance their own pipeline;
> the old scheduler enforced it at the ABI level, but it ideally just uses
> a CFS-esque method, leaving the judgement of pipeline depth to the user.
> -Chris
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-10-23  9:06 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-14  9:58 [PATCH] drm/i915: Increase the busyspin durations for i915_wait_request Chris Wilson
2017-09-14 10:18 ` ✓ Fi.CI.BAT: success for " Patchwork
2017-09-14 11:17 ` ✗ Fi.CI.IGT: failure " Patchwork
2017-09-15  9:15 ` [PATCH] " Kamble, Sagar A
2017-09-15  9:23   ` Chris Wilson
2017-09-15 10:01 ` Tvrtko Ursulin
2017-09-15 10:18   ` Chris Wilson
2017-10-23  9:05     ` Sagar Arun Kamble

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.